Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and performance? NexusScale is looking for a Senior SRE to help build the next generation of our global cloud infrastructure. You will work alongside world-class engineers to automate our systems, optimize our Kubernetes clusters, and ensure our services remain rock-solid for millions of users worldwide.
We value pragmatic engineering, blameless post-mortems, and a proactive approach to technical debt. If you are an automation-first engineer who thrives on solving complex distributed systems problems, we want to hear from you.
Tanggung Jawab
- Design, implement, and maintain highly available distributed systems on GCP and AWS.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Drive capacity planning, performance analysis, and tuning of our production microservices.
- Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
- Implement observability tooling and alerting strategies (Prometheus, Grafana, ELK stack).
- Mentor junior engineers and promote DevOps best practices across the engineering organization.
- Optimize cloud spend and resource utilization without compromising system performance.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering.
- Expert-level proficiency with Kubernetes, Docker, and container orchestration at scale.
- Strong programming skills in Python, Go, or Ruby for automation and tool development.
- In-depth knowledge of cloud architecture (AWS or GCP) and networking (TCP/IP, DNS, Load Balancing).
- Experience with Infrastructure as Code (IaC) using Terraform or similar tools.
- Strong problem-solving skills with the ability to troubleshoot complex production issues under pressure.
- Excellent communication skills and the ability to collaborate effectively in a remote-friendly environment.