Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? CloudScale Dynamics is seeking a visionary Senior Site Reliability Engineer to join our mission-critical infrastructure team. In this role, you will bridge the gap between development and operations, building robust automated systems that ensure our global platform remains lightning-fast and resilient under extreme load.
You will work at the intersection of software engineering and systems architecture, driving our transition to a fully observable, cloud-native ecosystem. If you are passionate about minimizing toil and maximizing reliability, we want to hear from you.
Tanggung Jawab
- Architect and maintain highly scalable, distributed cloud infrastructure on AWS/GCP.
- Drive the automation of operational tasks through robust CI/CD pipelines and Infrastructure as Code (Terraform/Pulumi).
- Lead incident response, root cause analysis, and post-mortem reviews to improve system resiliency.
- Optimize system performance, latency, and throughput through systematic capacity planning.
- Develop and implement sophisticated monitoring, logging, and alerting strategies using Prometheus, Grafana, and ELK.
- Foster a culture of 'reliability-first' engineering by mentoring junior team members.
- Collaborate with product teams to define and meet ambitious Service Level Objectives (SLOs).
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep expertise in Linux systems administration and container orchestration (Kubernetes).
- Strong proficiency in at least one high-level programming language (Go, Python, or Java).
- Expert knowledge of cloud platforms (AWS, GCP, or Azure) and networking fundamentals (TCP/IP, DNS, Load Balancing).
- Proven experience with Infrastructure as Code tools such as Terraform or Ansible.
- Demonstrated ability to solve complex production issues under pressure.