Deskripsi Pekerjaan
At NexusCloud, we are redefining global infrastructure scalability. We are seeking a visionary Senior Site Reliability Engineer to join our core platform team. You will be the architect of our uptime, performance, and automation strategies, ensuring our distributed systems remain bulletproof in the face of hyper-growth.
This is a high-impact role where you will bridge the gap between software development and operations, fostering a culture of reliability while leveraging cutting-edge cloud-native technologies.
Tanggung Jawab
- Architect and maintain highly available distributed systems on GCP/AWS.
- Lead incident response and conduct blameless post-mortems to improve system resilience.
- Implement Infrastructure-as-Code (IaC) using Terraform and Crossplane.
- Automate manual operational workflows to increase engineering velocity.
- Define and track Service Level Objectives (SLOs) and Error Budgets.
- Collaborate with product teams to embed SRE principles into the SDLC.
- Mentor junior engineers on best practices for performance tuning and monitoring.
Kualifikasi
- Bachelor’s degree in Computer Science or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Advanced proficiency in Go, Python, or Java.
- Deep expertise in Kubernetes, Docker, and service mesh technologies (Istio/Linkerd).
- Proven track record managing large-scale production environments in public cloud providers.
- Strong background in observability tools like Prometheus, Grafana, and Datadog.
- Excellent communication skills with the ability to explain complex technical concepts to non-technical stakeholders.