Deskripsi Pekerjaan
Are you an expert in architecting highly scalable, fault-tolerant infrastructure? NexusCloud Systems is seeking a Senior Site Reliability Engineer to join our core platform team in San Francisco. You will play a pivotal role in optimizing our global cloud footprint, ensuring 99.999% uptime, and pioneering our IaC strategy.
We foster a culture of blameless post-mortems, continuous integration, and engineering excellence. If you are passionate about automation, performance tuning, and distributed systems, we want to hear from you.
Tanggung Jawab
- Design, implement, and maintain robust infrastructure-as-code (IaC) using Terraform and Pulumi.
- Manage large-scale Kubernetes clusters across multi-cloud environments.
- Lead incident response and perform deep-dive root cause analysis for production outages.
- Develop automated monitoring, alerting, and logging systems to ensure proactive issue detection.
- Collaborate with development teams to optimize application performance and deployment pipelines (CI/CD).
- Participate in an on-call rotation to maintain system reliability and performance SLAs.
- Mentor junior engineers and advocate for SRE best practices across the engineering department.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Advanced proficiency with AWS, GCP, or Azure and container orchestration using Kubernetes.
- Strong coding skills in Go, Python, or Ruby.
- Deep understanding of Linux internals, networking, and distributed systems.
- Proven experience managing high-traffic production environments with strict SLOs/SLIs.
- Excellent troubleshooting skills and ability to thrive in a fast-paced, collaborative environment.