Deskripsi Pekerjaan
Are you obsessed with system performance, scalability, and uptime? NexusCloud Systems is looking for a Senior SRE to join our infrastructure team in the heart of San Francisco. You will play a critical role in building and maintaining the resilient, high-traffic systems that power our global platform.
You will work at the intersection of software engineering and operations, utilizing modern observability tools to ensure our services are performant and reliable at scale. If you thrive in a high-growth environment and enjoy solving complex distributed systems problems, we want to hear from you.
Tanggung Jawab
- Design, build, and maintain highly available, scalable, and resilient distributed systems.
- Automate infrastructure management using Infrastructure as Code (IaC) tools like Terraform and Pulumi.
- Drive capacity planning, performance tuning, and system optimization initiatives.
- Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
- Develop and maintain monitoring, logging, and alerting frameworks to ensure 99.99% uptime.
- Partner with software engineering teams to improve deployment pipelines and CI/CD automation.
- Mentor junior team members and foster a culture of engineering excellence.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
- Expert-level proficiency in public cloud environments (AWS, GCP, or Azure).
- Deep knowledge of Kubernetes, container orchestration, and microservices architecture.
- Fluency in programming languages such as Go, Python, or Java for automation and tooling.
- Experience with observability platforms like Prometheus, Grafana, Datadog, or New Relic.
- Strong understanding of CI/CD methodologies and tools (GitHub Actions, Jenkins, or GitLab CI).
- Ability to thrive in a collaborative, fast-paced, remote-first, or hybrid work culture.