Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and automating the impossible? NexusCloud Systems is seeking a world-class Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, building robust, self-healing systems that power our global platform.
You will work with a modern tech stack, influence architectural decisions, and champion a culture of reliability. We are looking for an engineer who treats infrastructure as code and believes that manual intervention is a bug to be squashed.
Tanggung Jawab
- Architect and maintain highly available, scalable, and secure cloud infrastructure on AWS.
- Automate operational workflows using Terraform, Ansible, and Python/Go.
- Lead incident response efforts and conduct blameless post-mortems to improve system resilience.
- Optimize cloud resource utilization to balance performance with cost-efficiency.
- Develop and maintain monitoring, alerting, and observability frameworks (Prometheus/Grafana/Datadog).
- Collaborate with engineering squads to integrate CI/CD best practices into the development lifecycle.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Expert-level proficiency in AWS cloud services and Kubernetes orchestration.
- Strong coding skills in Python, Go, or Ruby for automation and tool development.
- Deep understanding of Linux internals, networking, and distributed systems.
- Proven experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
- Ability to participate in an on-call rotation and handle complex system troubleshooting.
- Strong communication skills and a passion for mentoring junior team members.