Deskripsi Pekerjaan
Join NexusScale, a hyper-growth leader in cloud-native infrastructure, as a Senior Site Reliability Engineer. You will be at the heart of our mission to build resilient, scalable systems that power thousands of global enterprises. We are looking for an expert in distributed systems who thrives on automation, observability, and proactive problem solving.
You will work alongside elite software engineers to bridge the gap between development and operations, ensuring our platform maintains 99.999% availability while continuously deploying new features.
Tanggung Jawab
- Design, implement, and maintain highly available and scalable distributed systems on GCP.
- Automate manual operational tasks using Go, Python, or Terraform to reduce toil.
- Lead incident response, root cause analysis, and post-mortem reviews to improve system resiliency.
- Develop and maintain advanced monitoring, logging, and alerting stacks (Prometheus, Grafana, ELK).
- Collaborate with product teams to define and track SLOs and SLIs.
- Mentor junior engineers on best practices for cloud architecture and performance tuning.
- Manage capacity planning and resource optimization to ensure cost-efficient infrastructure.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering.
- Deep expertise in Kubernetes, Docker, and container orchestration at scale.
- Proficiency in at least one major programming language: Go, Python, or Java.
- Strong background in IaC tools like Terraform, Pulumi, or Ansible.
- Solid understanding of CI/CD pipelines (GitHub Actions, GitLab CI, or Jenkins).
- Hands-on experience with cloud-native security practices and protocols.
- Excellent communication skills with the ability to influence cross-functional technical teams.