Deskripsi Pekerjaan
Are you obsessed with system uptime, latency, and automated recovery? CloudScale Dynamics is seeking a visionary Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, building highly scalable systems that power our global platform.
We prioritize engineering excellence, blameless post-mortems, and a proactive culture of automation. If you thrive in high-stakes environments and want to influence the architecture of mission-critical services, we want to meet you.
Tanggung Jawab
- Design, implement, and maintain highly available, distributed cloud infrastructure.
- Automate manual operational tasks using Python, Go, or Bash to increase system efficiency.
- Lead incident response and perform deep-dive root cause analysis for production outages.
- Develop and manage CI/CD pipelines to ensure seamless deployment across global clusters.
- Implement comprehensive monitoring, alerting, and observability strategies using tools like Prometheus and Grafana.
- Collaborate with cross-functional software teams to influence architectural decisions for performance and scale.
- Manage capacity planning and resource optimization to control cloud infrastructure costs.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Deep expertise in managing large-scale infrastructure on AWS or GCP.
- Advanced proficiency with Kubernetes (K8s) orchestration and containerization strategies.
- Strong programming skills in Go, Python, or Java.
- Solid understanding of Infrastructure as Code (IaC) using Terraform or Pulumi.
- Proven ability to troubleshoot complex distributed systems under pressure.