Deskripsi Pekerjaan
Are you obsessed with system uptime, latency, and scalability? NexusCloud Systems is looking for a elite Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, ensuring our high-traffic platform remains resilient, performant, and secure. You will work with a modern tech stack in a culture that values automation, blameless post-mortems, and engineering excellence.
Tanggung Jawab
- Design and maintain highly available, distributed cloud systems on AWS.
- Automate operational tasks using Python, Go, or Bash to eliminate manual toil.
- Lead incident response efforts and conduct deep-dive post-mortem analyses.
- Optimize CI/CD pipelines to improve deployment velocity and system reliability.
- Manage Infrastructure as Code (IaC) utilizing Terraform and Kubernetes manifests.
- Mentor junior engineers on SRE best practices and system architecture design.
- Define and track Service Level Objectives (SLOs) and Error Budgets for production services.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering.
- Expert-level proficiency with Kubernetes (EKS) and container orchestration.
- Deep understanding of AWS cloud services and networking fundamentals.
- Strong coding skills in Go, Python, or Ruby.
- Hands-on experience with Prometheus, Grafana, and ELK stack for monitoring.
- Proven track record of managing high-traffic production environments with 99.99%+ uptime.
- Excellent problem-solving skills and ability to thrive in high-pressure situations.