Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? Nexus Cloud Infrastructure is looking for a Senior Site Reliability Engineer to join our high-impact engineering team in San Francisco. You will be the architect behind our mission-critical distributed systems, ensuring that our platform remains resilient under extreme load. If you thrive in a culture of automation, blameless post-mortems, and cutting-edge cloud architecture, we want to meet you.
Tanggung Jawab
- Design and maintain highly available distributed systems on AWS and Kubernetes.
- Automate operational tasks using Go, Python, or Terraform to eliminate toil.
- Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
- Optimize cloud infrastructure costs while maintaining peak performance metrics.
- Collaborate with development teams to integrate CI/CD best practices early in the software development lifecycle.
- Implement robust monitoring, logging, and alerting strategies using Datadog and Prometheus.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering or Systems Engineering.
- Deep expertise in managing production-grade Kubernetes clusters at scale.
- Strong proficiency in infrastructure-as-code tools such as Terraform or Pulumi.
- Advanced scripting skills in Go, Python, or Bash.
- Proven experience with observability platforms like Prometheus, Grafana, or Datadog.
- Strong understanding of Linux internals, networking, and security best practices.
- Experience in architecting for cloud environments (AWS, GCP, or Azure).