Deskripsi Pekerjaan
At NexusCloud, we are architecting the next generation of global cloud infrastructure. We are looking for a visionary Senior Site Reliability Engineer to join our core engineering team in San Francisco. You will be the guardian of our platform, focusing on scalability, automation, and reliability to ensure a seamless experience for millions of users worldwide.
This is a high-impact role where your work directly influences the architectural roadmap. If you are passionate about distributed systems, observability, and infrastructure-as-code, we want to hear from you.
Tanggung Jawab
- Architect, build, and maintain highly available, scalable, and resilient distributed systems.
- Lead incident response and perform deep-dive post-mortems to identify root causes and prevent recurrence.
- Automate operational tasks, deployments, and infrastructure provisioning using CI/CD pipelines.
- Optimize cloud costs and resource utilization without compromising system performance.
- Develop and maintain comprehensive monitoring, logging, and alerting strategies to ensure system health.
- Mentor junior engineers and promote a culture of operational excellence and engineering best practices.
- Collaborate closely with product and development teams to ship secure and performant features.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Strong proficiency with cloud providers (AWS, GCP, or Azure).
- Expertise in Infrastructure-as-Code tools such as Terraform, CloudFormation, or Pulumi.
- Deep experience with container orchestration platforms (Kubernetes, ECS).
- Advanced scripting skills in Python, Go, or Ruby.
- Solid understanding of observability stacks (Prometheus, Grafana, ELK, Datadog).
- Proven ability to troubleshoot complex performance issues in a microservices environment.