Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and the elegance of automated infrastructure? CloudScale Dynamics is looking for a Senior Site Reliability Engineer to join our high-impact engineering team in San Francisco. You will be the architect of our reliability strategy, bridging the gap between development and operations to ensure our global platform remains resilient, performant, and secure.
You will work alongside elite software engineers to define error budgets, lead incident responses, and implement cutting-edge observability solutions. If you thrive in high-stakes environments and love solving complex distributed systems problems, we want to hear from you.
Tanggung Jawab
- Design, build, and maintain scalable, high-availability cloud infrastructure on AWS/GCP.
- Drive capacity planning and performance tuning for high-traffic microservices.
- Lead post-mortem analysis and implement long-term fixes to prevent recurrence of system incidents.
- Develop automation tools to manage infrastructure-as-code (Terraform) and CI/CD pipelines.
- Implement advanced monitoring, logging, and tracing solutions (Prometheus, Grafana, ELK).
- Champion 'SRE best practices' across engineering squads, including code reviews and architectural audits.
- Participate in a collaborative on-call rotation to ensure 99.99% service availability.
Kualifikasi
- Bachelor’s degree in Computer Science or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Expertise in Linux system internals and networking (TCP/IP, DNS, HTTP, TLS).
- Advanced proficiency in at least one language: Go, Python, or Java.
- Deep understanding of container orchestration platforms, specifically Kubernetes.
- Proven experience with IaC tools such as Terraform or Pulumi.
- Strong problem-solving skills and the ability to remain calm under pressure during outages.