Deskripsi Pekerjaan
Build the future of cloud resilience.
At NexusCloud, we empower global enterprises with highly available, scalable, and ultra-fast infrastructure. We are looking for a Senior SRE who is passionate about automation, observability, and the 'you build it, you run it' philosophy to join our mission-critical platform team in San Francisco.
You will work alongside elite engineers to design complex distributed systems and drive our transition to full GitOps and automated self-healing infrastructure.
Tanggung Jawab
- Design and implement robust automation for infrastructure provisioning and maintenance.
- Manage production availability and latency through proactive monitoring and performance tuning.
- Lead incident response and perform root-cause analysis for complex system issues.
- Develop and maintain CI/CD pipelines to ensure seamless software delivery.
- Scale infrastructure to support a massive increase in user traffic while optimizing cloud costs.
- Mentor junior SREs and promote a culture of operational excellence across engineering teams.
- Collaborate with development teams to ensure services are production-ready from day one.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering.
- Expert-level proficiency with Kubernetes, Docker, and container orchestration at scale.
- Strong coding skills in Go, Python, or Ruby.
- Deep expertise in public cloud environments (AWS, GCP, or Azure) using Infrastructure as Code (Terraform, Pulumi).
- Proven experience with observability platforms such as Prometheus, Grafana, ELK, or Datadog.
- Strong understanding of distributed systems, networking (TCP/IP, DNS, Load Balancing), and security principles.
- Excellent troubleshooting skills and the ability to thrive in a high-pressure, fast-paced environment.