Deskripsi Pekerjaan
Are you obsessed with uptime, performance, and building resilient systems at scale? NexusScale is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, implementing automated solutions to ensure our global platform remains lightning-fast and highly available. You will work on cutting-edge Kubernetes clusters, multi-cloud architectures, and observability platforms that drive our mission forward.
Tanggung Jawab
- Design and maintain highly available, scalable infrastructure on AWS and GCP.
- Automate operational tasks through infrastructure-as-code (Terraform, Pulumi).
- Conduct blameless post-mortems and lead incident response for high-severity outages.
- Develop and refine CI/CD pipelines to accelerate deployment velocity.
- Implement advanced monitoring, logging, and tracing solutions (Prometheus, Grafana, ELK).
- Proactively identify performance bottlenecks and capacity constraints.
- Collaborate with engineering teams to improve system architecture and reliability patterns.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Expertise in container orchestration with Kubernetes and Docker.
- Proficiency in at least one scripting language (Python, Go, or Ruby).
- Deep understanding of distributed systems and cloud-native architecture.
- Experience with IaC tools like Terraform or CloudFormation.
- Strong background in Linux internals, networking protocols, and security best practices.
- Proven ability to manage and troubleshoot large-scale production environments.