Deskripsi Pekerjaan
Are you obsessed with uptime, performance, and building resilient systems at scale? Nexus Cloud Systems is seeking a visionary Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. You will play a pivotal role in bridging the gap between development and operations, ensuring our high-traffic microservices architecture remains bulletproof, scalable, and efficient.
You will work alongside elite engineers to shape the future of our cloud-native infrastructure, leveraging cutting-edge tools to automate, monitor, and optimize our distributed systems.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure on AWS and Kubernetes.
- Automate operational tasks and infrastructure provisioning using Terraform and CI/CD pipelines.
- Lead incident response efforts, conduct blameless post-mortems, and implement long-term fixes.
- Implement observability solutions to gain deep insights into system performance and capacity.
- Collaborate with development teams to embed reliability practices into the software development lifecycle.
- Optimize cloud resource utilization to balance performance with cost-efficiency.
- Define and track Service Level Objectives (SLOs) and Error Budgets to ensure a superior user experience.
Kualifikasi
- Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Deep expertise in Kubernetes, Docker, and container orchestration at scale.
- Proficiency in Go, Python, or Ruby for automation and tool development.
- Strong background in cloud platforms (AWS preferred) and Infrastructure-as-Code (Terraform/CloudFormation).
- Experience with monitoring and observability stacks like Prometheus, Grafana, or Datadog.
- Solid understanding of Linux internals, networking protocols, and distributed system architectures.