Deskripsi Pekerjaan

Are you obsessed with uptime, scalability, and system performance? Nexus Cloud Infrastructure is looking for a Senior Site Reliability Engineer to join our high-impact engineering team in San Francisco. You will be the architect behind our mission-critical distributed systems, ensuring that our platform remains resilient under extreme load. If you thrive in a culture of automation, blameless post-mortems, and cutting-edge cloud architecture, we want to meet you.

Tanggung Jawab

Design and maintain highly available distributed systems on AWS and Kubernetes.
Automate operational tasks using Go, Python, or Terraform to eliminate toil.
Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
Optimize cloud infrastructure costs while maintaining peak performance metrics.
Collaborate with development teams to integrate CI/CD best practices early in the software development lifecycle.
Implement robust monitoring, logging, and alerting strategies using Datadog and Prometheus.

Kualifikasi

5+ years of experience in Site Reliability Engineering or Systems Engineering.
Deep expertise in managing production-grade Kubernetes clusters at scale.
Strong proficiency in infrastructure-as-code tools such as Terraform or Pulumi.
Advanced scripting skills in Go, Python, or Bash.
Proven experience with observability platforms like Prometheus, Grafana, or Datadog.
Strong understanding of Linux internals, networking, and security best practices.
Experience in architecting for cloud environments (AWS, GCP, or Azure).

Senior Site Reliability Engineer (SRE)

Deskripsi Pekerjaan

Tanggung Jawab

Kualifikasi

Keahlian yang Dibutuhkan

Siap Mengambil Tantangan Ini?

Lowongan Terkait

Backend Software Engineer

Senior Data Scientist

Senior AI/Machine Learning Engineer

AI Engineer

Senior AI/ML Engineer