Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? CloudScale Dynamics is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will be the bridge between development and operations, ensuring our high-traffic microservices architecture remains robust, resilient, and performant as we scale globally.
You will work on cutting-edge cloud-native technologies, drive automation, and champion a culture of reliability. We are looking for an engineer who thrives on solving complex distributed systems challenges and automating away manual toil.
Tanggung Jawab
- Design and maintain highly available, distributed systems in a multi-cloud environment (AWS/GCP).
- Implement and manage Infrastructure as Code (IaC) using Terraform and Pulumi.
- Drive capacity planning and performance tuning for high-traffic API services.
- Define and track Service Level Objectives (SLOs) and Error Budgets to improve system health.
- Lead post-mortem analysis and incident response procedures to minimize mean time to recovery (MTTR).
- Collaborate with SDEs to design fault-tolerant system architectures.
- Mentor junior engineers on best practices regarding observability and monitoring.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Proficiency in programming languages such as Go, Python, or Java.
- Deep expertise in container orchestration (Kubernetes) and service mesh technologies.
- Strong background in cloud architecture (AWS or GCP) and networking fundamentals.
- Expert-level experience with observability stacks like Prometheus, Grafana, and ELK.
- Proven experience automating infrastructure using CI/CD pipelines and IaC tooling.
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.