Deskripsi Pekerjaan
Are you obsessed with system uptime, latency, and scalable architecture? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will be the bridge between development and operations, ensuring our global services remain resilient, performant, and secure.
We operate at a massive scale and believe in automation over manual intervention. You will have the autonomy to define the future of our cloud infrastructure and help us scale to meet the demands of millions of users.
Tanggung Jawab
- Design, build, and maintain highly available, distributed cloud infrastructure on AWS.
- Automate operational tasks using Go, Python, or Terraform to eliminate manual toil.
- Proactively identify and resolve performance bottlenecks across the stack.
- Participate in a periodic on-call rotation to ensure 99.99% service uptime.
- Lead post-mortem analysis and implement systemic improvements to prevent recurrence.
- Collaborate with product engineering teams to optimize cloud cost and resource allocation.
- Implement robust monitoring, alerting, and observability frameworks.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale systems engineering.
- Expert-level proficiency in AWS services (EKS, RDS, DynamoDB, S3).
- Strong coding skills in Go, Python, or similar languages for automation.
- Deep understanding of Kubernetes and container orchestration patterns.
- Hands-on experience with Infrastructure as Code (Terraform or Pulumi).
- Excellent communication skills with the ability to lead cross-functional incident responses.
- Proven track record of improving system reliability in a high-traffic production environment.