Deskripsi Pekerjaan
Are you obsessed with system reliability, performance, and automation? CloudScale Dynamics is seeking a Senior Site Reliability Engineer to join our core infrastructure team. You will be instrumental in scaling our global platforms, ensuring 99.999% availability, and building the next generation of our observability stack.
You will work at the intersection of software engineering and systems operations, leveraging your expertise to eliminate manual toil through code. Join a team of world-class engineers where performance meets innovation.
Tanggung Jawab
- Design, build, and maintain highly scalable, distributed production systems on AWS.
- Automate infrastructure provisioning using Terraform and CI/CD pipelines.
- Conduct incident response and blameless post-mortems to improve service stability.
- Optimize cloud costs and system performance through proactive capacity planning.
- Develop self-healing mechanisms and automated monitoring/alerting strategies.
- Collaborate with development teams to ensure software is deployable and observable.
- Mentor junior engineers on best practices for infrastructure-as-code and reliability.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep proficiency with AWS (EC2, EKS, RDS, S3) and modern cloud architectures.
- Strong programming skills in Go, Python, or Ruby.
- Expert-level knowledge of Kubernetes and container orchestration at scale.
- Deep understanding of observability tools like Prometheus, Grafana, and Datadog.
- Strong grasp of Linux system internals, networking, and security best practices.
- Proven ability to troubleshoot complex issues in distributed environments.