Deskripsi Pekerjaan
Are you obsessed with high-availability, scalability, and system resilience? Nexus Cloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between software engineering and systems operations, ensuring our global platforms remain performant and bulletproof.
You will work on cutting-edge cloud-native technologies, driving our transition toward automated infrastructure and self-healing systems. If you thrive in complex distributed environments, we want to hear from you.
Tanggung Jawab
- Architect, implement, and maintain highly available and scalable cloud infrastructure on AWS/GCP.
- Develop and maintain CI/CD pipelines to streamline deployment velocity and reliability.
- Lead incident response protocols and perform post-mortem analysis to identify root causes.
- Implement Infrastructure as Code (IaC) using Terraform and Ansible to ensure configuration consistency.
- Optimize system performance and resource utilization to manage cloud infrastructure costs effectively.
- Design robust monitoring, alerting, and logging solutions to gain deep visibility into production systems.
- Collaborate with cross-functional development teams to enforce reliability best practices throughout the SDLC.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering roles.
- Expertise in cloud orchestration (AWS, GCP, or Azure) and Kubernetes.
- Strong proficiency in scripting and automation (Python, Go, or Bash).
- Deep understanding of IaC tools such as Terraform, CloudFormation, or Pulumi.
- Experience with observability platforms like Prometheus, Grafana, Datadog, or ELK Stack.
- Proven track record of managing high-traffic production environments with 99.99% uptime requirements.
- Excellent communication skills with the ability to lead technical initiatives and mentor junior engineers.