Deskripsi Pekerjaan
Are you obsessed with system reliability and massive scale? NexusCloud Systems is seeking a Senior Site Reliability Engineer to join our high-impact infrastructure team. You will be the architect behind our mission-critical services, ensuring 99.999% uptime while building the next generation of our automated global platform.
We operate at the intersection of software engineering and systems operations, building the tools that empower our developers to ship code faster and more reliably. If you thrive in a culture of blameless post-mortems, infrastructure-as-code, and complex problem solving, we want to hear from you.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure on AWS/GCP.
- Drive capacity planning, performance tuning, and latency optimization across our microservices stack.
- Automate manual processes through CI/CD pipelines and infrastructure-as-code (Terraform/Ansible).
- Lead incident response efforts and conduct deep-dive post-mortem analyses to prevent recurrence.
- Develop and manage service-level objectives (SLOs) and error budgets to balance velocity and reliability.
- Mentor junior engineers and promote engineering excellence across the organization.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering.
- Expert-level proficiency in Go, Python, or Java.
- Deep understanding of distributed systems, networking (TCP/IP, DNS, Load Balancing), and Linux internals.
- Proven hands-on experience with Kubernetes, Docker, and container orchestration at scale.
- Deep experience with infrastructure provisioning tools (Terraform, Pulumi, or CloudFormation).
- Strong knowledge of observability stacks (Prometheus, Grafana, Datadog, or ELK).
- Excellent communication skills with the ability to lead cross-functional technical projects.