Deskripsi Pekerjaan

Are you obsessed with system reliability, performance optimization, and scalable architecture? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our high-impact platform team. You will be the architect behind our mission-critical infrastructure, ensuring 99.99% uptime for our global enterprise clients.
We operate in a modern, cloud-native environment where automation is the default. If you enjoy solving complex distributed systems challenges and fostering an environment of engineering excellence, this role is for you.

Tanggung Jawab

Design, implement, and maintain highly available, fault-tolerant infrastructure on AWS/GCP.
Drive capacity planning and performance tuning to ensure seamless scalability.
Automate infrastructure provisioning and configuration management using Terraform and Ansible.
Champion SRE best practices, including error budget management and blameless post-mortems.
Collaborate with engineering teams to improve CI/CD pipelines and deployment velocity.
Monitor system health and respond to high-priority production incidents.
Develop internal tools to improve developer productivity and system observability.

Kualifikasi

5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering.
Expert-level proficiency with Kubernetes, Docker, and container orchestration.
Strong coding skills in Go, Python, or Ruby.
Deep understanding of distributed systems, networking (TCP/IP, DNS, Load Balancing), and Linux internals.
Proven track record of managing large-scale cloud infrastructure (AWS preferred).
Experience with observability stacks like Prometheus, Grafana, ELK, or Datadog.
Strong communication skills and a passion for mentoring junior team members.

Senior Site Reliability Engineer (SRE)

Deskripsi Pekerjaan

Tanggung Jawab

Kualifikasi

Keahlian yang Dibutuhkan

Siap Mengambil Tantangan Ini?

Lowongan Terkait

Backend Software Engineer

Senior Data Scientist

Senior AI/Machine Learning Engineer

AI Engineer

Senior AI/ML Engineer