Deskripsi Pekerjaan
Are you obsessed with system reliability and massive scale? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our high-impact infrastructure team. You will be instrumental in building the backbone of our global SaaS platform, ensuring 99.999% uptime, and automating complex cloud-native environments.
We value engineering excellence, pragmatic decision-making, and a deep commitment to observability and performance optimization.
Tanggung Jawab
- Design and maintain highly scalable, fault-tolerant infrastructure on AWS/GCP.
- Develop automation tooling to reduce manual operational toil using Python or Go.
- Lead incident response efforts and conduct blameless post-mortems to improve system resilience.
- Implement and refine SLOs, SLIs, and comprehensive monitoring dashboards.
- Collaborate with DevOps teams to integrate CI/CD pipelines and infrastructure-as-code (Terraform).
- Manage Kubernetes clusters at scale, ensuring resource optimization and security compliance.
Kualifikasi
- 5+ years of experience in SRE, Systems Engineering, or DevOps roles.
- Advanced proficiency with Kubernetes, Docker, and container orchestration.
- Strong expertise in infrastructure automation (Terraform, Ansible, or Pulumi).
- Proven experience with cloud providers (AWS preferred) and networking concepts.
- Experience with observability stacks like Prometheus, Grafana, or Datadog.
- Deep understanding of distributed systems and microservices architectures.
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.