Deskripsi Pekerjaan
Are you obsessed with system reliability and massive scale? Nexus Cloud Infrastructure is seeking a visionary Senior Site Reliability Engineer to join our core platform team in San Francisco. You will play a critical role in architecting, automating, and securing our high-traffic global infrastructure, ensuring 99.999% availability for our enterprise clients.
We operate at the intersection of software engineering and systems operations. If you thrive on solving complex distributed systems problems and value clean, maintainable code, this is your next career destination.
Tanggung Jawab
- Design and maintain highly available, scalable distributed systems on public cloud infrastructure.
- Develop and implement robust CI/CD pipelines to streamline deployment workflows.
- Lead incident response protocols and perform blameless post-mortems to improve system resilience.
- Optimize cloud resource utilization to balance performance with cost-efficiency.
- Define and implement Service Level Objectives (SLOs) and Error Budgets for critical services.
- Collaborate with cross-functional teams to integrate security best practices (DevSecOps) into the infrastructure lifecycle.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep expertise in AWS or GCP cloud environments and infrastructure-as-code (Terraform/Pulumi).
- Proficiency in Go, Python, or Ruby for automation and tool development.
- Strong understanding of container orchestration platforms, specifically Kubernetes.
- Experience with observability stacks (Prometheus, Grafana, Datadog) and distributed tracing.
- Excellent communication skills with the ability to lead technical discussions across teams.