Deskripsi Pekerjaan
Are you obsessed with system reliability, performance optimization, and scalable architecture? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our high-impact platform team. You will be the architect behind our mission-critical infrastructure, ensuring 99.99% uptime for our global enterprise clients.
We operate in a modern, cloud-native environment where automation is the default. If you enjoy solving complex distributed systems challenges and fostering an environment of engineering excellence, this role is for you.
Tanggung Jawab
- Design, implement, and maintain highly available, fault-tolerant infrastructure on AWS/GCP.
- Drive capacity planning and performance tuning to ensure seamless scalability.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Champion SRE best practices, including error budget management and blameless post-mortems.
- Collaborate with engineering teams to improve CI/CD pipelines and deployment velocity.
- Monitor system health and respond to high-priority production incidents.
- Develop internal tools to improve developer productivity and system observability.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering.
- Expert-level proficiency with Kubernetes, Docker, and container orchestration.
- Strong coding skills in Go, Python, or Ruby.
- Deep understanding of distributed systems, networking (TCP/IP, DNS, Load Balancing), and Linux internals.
- Proven track record of managing large-scale cloud infrastructure (AWS preferred).
- Experience with observability stacks like Prometheus, Grafana, ELK, or Datadog.
- Strong communication skills and a passion for mentoring junior team members.