Deskripsi Pekerjaan
Are you obsessed with system reliability, performance, and automation? NexusCloud Solutions is seeking a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, ensuring our high-traffic global platforms remain scalable, resilient, and performant. You will be instrumental in driving our move toward complete infrastructure-as-code and cloud-native observability.
Tanggung Jawab
- Design, build, and maintain highly available, scalable, and secure cloud infrastructure.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Drive incident response and post-mortem analysis to maintain 99.99% system uptime.
- Implement and manage advanced monitoring, logging, and alerting systems (Prometheus, Grafana, ELK).
- Collaborate with cross-functional engineering teams to optimize application performance and CI/CD pipelines.
- Conduct regular capacity planning and load testing to ensure seamless peak-traffic handling.
- Mentor junior engineers and advocate for SRE best practices across the organization.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep expertise in AWS/GCP cloud platforms and container orchestration (Kubernetes).
- Advanced proficiency in scripting languages such as Python, Go, or Bash.
- Proven experience with infrastructure-as-code tools like Terraform or Pulumi.
- Strong understanding of distributed systems, networking protocols (TCP/IP, DNS, HTTP), and load balancing.
- Excellent analytical, problem-solving, and communication skills in a fast-paced environment.