Deskripsi Pekerjaan

Are you obsessed with system uptime, latency, and scalability? NexusCloud Systems is looking for a elite Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, ensuring our high-traffic platform remains resilient, performant, and secure. You will work with a modern tech stack in a culture that values automation, blameless post-mortems, and engineering excellence.

Tanggung Jawab

Design and maintain highly available, distributed cloud systems on AWS.
Automate operational tasks using Python, Go, or Bash to eliminate manual toil.
Lead incident response efforts and conduct deep-dive post-mortem analyses.
Optimize CI/CD pipelines to improve deployment velocity and system reliability.
Manage Infrastructure as Code (IaC) utilizing Terraform and Kubernetes manifests.
Mentor junior engineers on SRE best practices and system architecture design.
Define and track Service Level Objectives (SLOs) and Error Budgets for production services.

Kualifikasi

5+ years of experience in SRE, DevOps, or large-scale Systems Engineering.
Expert-level proficiency with Kubernetes (EKS) and container orchestration.
Deep understanding of AWS cloud services and networking fundamentals.
Strong coding skills in Go, Python, or Ruby.
Hands-on experience with Prometheus, Grafana, and ELK stack for monitoring.
Proven track record of managing high-traffic production environments with 99.99%+ uptime.
Excellent problem-solving skills and ability to thrive in high-pressure situations.

Senior Site Reliability Engineer (SRE)

Deskripsi Pekerjaan

Tanggung Jawab

Kualifikasi

Keahlian yang Dibutuhkan

Siap Mengambil Tantangan Ini?

Lowongan Terkait

Backend Software Engineer

Senior Data Scientist

Senior AI/Machine Learning Engineer

AI Engineer

Senior AI/ML Engineer