Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? NexusScale is looking for a Senior Site Reliability Engineer to join our high-impact infrastructure team in San Francisco. You will be responsible for building, scaling, and maintaining the mission-critical systems that power our global cloud platform. We move fast, automate everything, and value engineering excellence above all else.
Tanggung Jawab
- Architect and maintain highly available, distributed systems on AWS/GCP.
- Develop and implement robust automation for CI/CD pipelines and infrastructure provisioning using Terraform.
- Lead incident response and perform deep-dive post-mortems to ensure long-term system health.
- Optimize service performance and cost through proactive capacity planning and resource management.
- Collaborate with development teams to integrate observability, monitoring, and alerting frameworks.
- Contribute to the evolution of our platform strategy and security posture.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering roles.
- Advanced proficiency in Python, Go, or Ruby for automation and tool development.
- Deep expertise in container orchestration using Kubernetes and Docker.
- Strong background in managing large-scale cloud infrastructure (AWS/GCP/Azure).
- Proven experience with Monitoring tools such as Prometheus, Grafana, or Datadog.
- Strong understanding of network protocols, security best practices, and database reliability.
- BS/MS in Computer Science, Engineering, or equivalent practical experience.