Deskripsi Pekerjaan
At NexusScale, we are architecting the backbone of modern cloud infrastructure. We are seeking a passionate Senior Site Reliability Engineer to join our high-impact team in San Francisco. You will be instrumental in bridging the gap between development and operations, ensuring our global services remain resilient, performant, and scalable.
If you are obsessed with automation, system observability, and engineering excellence, we want to hear from you.
Tanggung Jawab
- Design, implement, and maintain highly available distributed systems on GCP/AWS.
- Automate infrastructure provisioning using Terraform and CI/CD pipelines.
- Drive incident management processes and lead post-mortem analysis to identify root causes.
- Optimize service performance and latency through capacity planning and load testing.
- Implement proactive monitoring and alerting strategies to minimize MTTR.
- Collaborate with engineering teams to integrate reliability best practices into the SDLC.
- Participate in an on-call rotation to maintain 99.99% service uptime.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or a related technical field.
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Advanced proficiency in Python, Go, or Ruby for automation and tool development.
- Deep expertise in container orchestration using Kubernetes (EKS/GKE).
- Strong knowledge of Linux systems architecture and networking (TCP/IP, DNS, Load Balancing).
- Experience with observability tools such as Prometheus, Grafana, or Datadog.
- Proven ability to troubleshoot complex issues in large-scale microservices environments.