Deskripsi Pekerjaan
At NexusCloud, we are building the next generation of high-availability infrastructure to power global enterprises. We are seeking a visionary Senior Site Reliability Engineer to bridge the gap between development and operations. You will be responsible for defining, maintaining, and scaling our core systems, ensuring 99.999% uptime through automation and architectural excellence.
You will join a high-performance team that values engineering rigor, blameless post-mortems, and a culture of continuous learning. If you are passionate about observability, distributed systems, and solving complex production challenges, we want to hear from you.
Tanggung Jawab
- Design and manage highly available, fault-tolerant distributed systems on GCP/AWS.
- Lead the automation of infrastructure provisioning using Terraform and Kubernetes.
- Establish SLOs, SLIs, and robust monitoring strategies using Prometheus, Grafana, and Datadog.
- Conduct deep-dive incident analysis and blameless post-mortems to improve system resilience.
- Collaborate with SDE teams to drive architectural improvements and capacity planning.
- Develop and maintain CI/CD pipelines to ensure rapid, safe deployment cycles.
- Participate in an on-call rotation to maintain system health and performance.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Strong expertise in Linux systems administration and container orchestration (Kubernetes).
- Proficiency in Go, Python, or Ruby for infrastructure automation and tooling.
- Deep understanding of cloud-native architecture and distributed databases (Cassandra, PostgreSQL, or Redis).
- Experience with IaC tools like Terraform, Pulumi, or Ansible.
- Solid grasp of networking fundamentals (TCP/IP, DNS, Load Balancing, CDN).
- Bachelor’s degree in Computer Science, Engineering, or equivalent professional experience.