Deskripsi Pekerjaan
Are you obsessed with system performance, scalability, and uptime? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our elite infrastructure team in San Francisco. You will play a pivotal role in designing, building, and maintaining our global cloud infrastructure, ensuring our platform remains performant and resilient under massive scale.
You will bridge the gap between development and operations, applying software engineering practices to infrastructure problems to create a seamless, automated, and highly available environment.
Tanggung Jawab
- Architect and maintain highly available, scalable, and secure cloud infrastructure using Terraform and Kubernetes.
- Automate operational tasks, including deployment pipelines, monitoring, and incident response, using Go or Python.
- Lead post-mortem investigations and drive root cause analysis for production incidents to prevent recurrence.
- Define and manage Service Level Objectives (SLOs) and Error Budgets for critical microservices.
- Collaborate with cross-functional teams to influence architectural decisions for performance and reliability.
- Optimize cloud resource utilization and cost-efficiency across multi-region environments.
- Participate in a collaborative on-call rotation to ensure platform integrity 24/7.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles focused on infrastructure.
- Deep proficiency in cloud platforms (AWS or GCP) and container orchestration with Kubernetes (K8s).
- Strong programming skills in Go, Python, or Ruby, with a focus on automation tools.
- Extensive experience with Infrastructure as Code (IaC) tooling, specifically Terraform or Pulumi.
- Advanced knowledge of observability stacks including Prometheus, Grafana, and ELK/Datadog.
- Strong understanding of CI/CD methodologies and distributed systems architecture.