Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and the art of automation? NexusCloud Systems is seeking a Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. You will play a pivotal role in ensuring our global cloud architecture remains resilient, secure, and performant as we scale to millions of daily active users.
We leverage a modern stack including Kubernetes, Go, Terraform, and AWS. If you thrive in high-stakes environments and enjoy solving complex distributed systems problems, we want to talk to you.
Tanggung Jawab
- Design and maintain high-availability systems to ensure 99.99% service uptime.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Lead incident response protocols and conduct blameless post-mortems for production outages.
- Develop and maintain monitoring, logging, and alerting systems using Prometheus and Grafana.
- Collaborate with software engineering teams to improve application performance and reliability.
- Optimize cloud infrastructure costs through proactive resource management.
- Establish and maintain rigorous security standards and compliance best practices.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Proficiency in Go, Python, or Ruby for infrastructure automation.
- Deep expertise in managing large-scale Kubernetes clusters in production.
- Strong background in cloud architecture, specifically AWS or GCP services.
- Solid understanding of CI/CD pipelines (GitHub Actions, Jenkins, or GitLab CI).
- Experience with observability tools (Datadog, New Relic, or Prometheus).
- Exceptional problem-solving skills and ability to thrive in a fast-paced environment.