Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and automating the mundane? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. You will play a pivotal role in ensuring our global cloud platform remains resilient, performant, and secure as we navigate a period of hyper-growth.
We foster a culture of blameless post-mortems, deep technical curiosity, and collaborative problem-solving. If you thrive on solving complex distributed systems challenges, this is the environment for you.
Tanggung Jawab
- Design and maintain high-availability distributed systems on public cloud infrastructure (AWS/GCP).
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Manage incident response workflows and conduct deep-dive post-mortems to improve platform reliability.
- Optimize system performance and cost through rigorous observability, monitoring, and capacity planning.
- Develop and maintain CI/CD pipelines to ensure rapid, safe, and reliable software deployment.
- Mentor junior engineers and promote best practices in software engineering and operations.
- Collaborate with cross-functional teams to bridge the gap between development and operations.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or large-scale Systems Engineering.
- Expertise in Linux system internals, networking, and security best practices.
- Proficiency in at least one high-level programming language (Go, Python, or Java).
- Hands-on experience with container orchestration platforms such as Kubernetes.
- Strong background in observability tools like Prometheus, Grafana, Datadog, or ELK stack.
- Proven ability to troubleshoot complex performance bottlenecks in microservices architectures.
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.