Deskripsi Pekerjaan
Are you an expert in architecting resilient, high-scale distributed systems? NexusCloud Systems is seeking a visionary Senior Site Reliability Engineer to join our core infrastructure team. You will play a pivotal role in ensuring the availability, latency, and efficiency of our global cloud platform.
We operate at the intersection of software engineering and systems administration, focusing on automation, monitoring, and proactive failure prevention to deliver an unparalleled experience to our millions of users.
Tanggung Jawab
- Design, build, and maintain scalable infrastructure to support high-traffic cloud services.
- Implement CI/CD pipelines to streamline deployment processes and reduce manual overhead.
- Define and track Service Level Objectives (SLOs) and Error Budgets for critical microservices.
- Conduct blameless post-mortems and lead root cause analysis for production incidents.
- Optimize cloud resource utilization to drive efficiency and cost-reduction initiatives.
- Automate operational tasks using Python, Go, or Ruby to eliminate repetitive work.
- Mentor junior engineers on best practices for observability and incident response.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or large-scale systems engineering roles.
- Deep expertise in public cloud environments (AWS, GCP, or Azure).
- Proven proficiency with container orchestration tools, specifically Kubernetes and Helm.
- Strong experience with Infrastructure-as-Code (Terraform, Pulumi, or CloudFormation).
- Deep understanding of observability stacks (Prometheus, Grafana, Datadog, or ELK).
- Strong communication skills with the ability to articulate complex technical concepts to non-technical stakeholders.