Deskripsi Pekerjaan
Are you obsessed with system uptime, latency, and scalable architecture? Nexus Cloud Infrastructure is seeking a Senior SRE to join our mission-critical platform team. You will be responsible for defining the reliability standards of our global cloud infrastructure and automating manual operational tasks at scale.
We operate a massive, distributed environment and believe in the 'SRE as a software engineering' philosophy. If you thrive in high-pressure environments and love solving complex, system-level challenges, we want to hear from you.
Tanggung Jawab
- Design and maintain robust, fault-tolerant infrastructure on AWS and Kubernetes.
- Automate operational processes through CI/CD pipelines and Infrastructure-as-Code (Terraform/Ansible).
- Conduct incident response and post-mortem analysis to prevent recurrence of systemic issues.
- Develop and manage monitoring, alerting, and observability stacks (Prometheus, Grafana, Datadog).
- Collaborate with development teams to ensure high availability and performance during the SDLC.
- Participate in an on-call rotation to support mission-critical production services.
- Optimize cloud resource utilization to balance performance with cost efficiency.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Strong proficiency in Linux internals, networking, and security best practices.
- Deep experience with Kubernetes, Docker, and container orchestration at scale.
- Advanced coding skills in Python, Go, or Ruby for automation and tool development.
- Proven track record of managing large-scale infrastructure using Terraform or similar IaC tools.
- Experience with high-traffic distributed systems and microservices architectures.
- Bachelor’s degree in Computer Science or equivalent practical industry experience.