Deskripsi Pekerjaan
Are you obsessed with system reliability and architectural scalability? NexusCloud Systems is seeking a world-class Senior Site Reliability Engineer to help us maintain 99.999% availability for our global cloud infrastructure.
You will work at the intersection of software engineering and systems operations, bridging the gap between development and production to deliver lightning-fast, resilient services. Join a team where your automation-first mindset is the primary driver of our success.
Tanggung Jawab
- Design, build, and maintain robust infrastructure as code (IaC) using Terraform and Pulumi.
- Lead incident response protocols and perform thorough post-mortem analyses to prevent recurrence.
- Optimize cloud costs and system performance through proactive capacity planning and resource tuning.
- Implement advanced observability patterns using Prometheus, Grafana, and ELK stack.
- Automate manual operational tasks through Python or Go scripts to reduce toil.
- Collaborate with cross-functional product teams to ensure high-availability designs from the onset.
- Mentor junior engineers on best practices for cloud-native security and resiliency.
Kualifikasi
- Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or systems engineering roles.
- Deep proficiency in cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes).
- Strong programming skills in at least one language: Go, Python, or Ruby.
- Hands-on experience with CI/CD pipeline optimization (GitHub Actions, Jenkins).
- Solid understanding of distributed systems architecture and microservices communication patterns.
- Excellent analytical thinking and the ability to troubleshoot complex, large-scale production issues.