Deskripsi Pekerjaan
Are you obsessed with uptime, performance, and automation? NexusScale is looking for a Senior Site Reliability Engineer to help us build and scale our high-traffic cloud infrastructure. You will be the bridge between development and operations, ensuring our systems are not just reliable, but resilient. If you thrive in a culture of blameless post-mortems and infrastructure-as-code, we want to hear from you.
Tanggung Jawab
- Design, build, and maintain highly available, distributed cloud systems on AWS.
- Automate infrastructure provisioning using Terraform and CI/CD best practices.
- Monitor system performance, troubleshoot bottlenecks, and implement proactive optimization strategies.
- Lead incident response and conduct thorough post-mortem analyses to prevent recurrence.
- Collaborate with engineering teams to improve software delivery speed and reliability.
- Manage capacity planning and resource allocation to ensure cost-efficiency.
- Develop and maintain internal tooling to streamline deployment workflows.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering.
- Deep expertise in AWS cloud services (EC2, EKS, RDS, S3).
- Proficiency in infrastructure-as-code tools such as Terraform or CloudFormation.
- Strong programming skills in Python, Go, or Ruby for automation scripting.
- In-depth knowledge of Kubernetes and container orchestration at scale.
- Experience with observability platforms like Datadog, Prometheus, or Grafana.
- Excellent analytical skills and the ability to solve complex production issues under pressure.