Deskripsi Pekerjaan
Are you obsessed with system performance, scalability, and uptime? NexusCloud Systems is looking for a Senior Site Reliability Engineer to help us build and maintain high-traffic cloud infrastructure. In this role, you will bridge the gap between development and operations, ensuring our platform remains resilient in a fast-paced environment.
You will play a pivotal role in shaping our SRE culture, automating manual toil, and optimizing our AWS-based microservices architecture. If you thrive on solving complex distributed systems problems and championing reliability, we want to hear from you.
Tanggung Jawab
- Design, build, and maintain highly available, scalable, and secure cloud infrastructure on AWS.
- Automate operational tasks through infrastructure-as-code (Terraform, Ansible) to reduce manual toil.
- Implement proactive monitoring, logging, and alerting strategies using Datadog and Prometheus.
- Lead incident response, root cause analysis, and post-mortem reviews to improve system reliability.
- Collaborate with cross-functional software engineering teams to optimize application performance.
- Develop and manage CI/CD pipelines to ensure seamless and reliable code deployments.
- Establish and maintain Service Level Objectives (SLOs) and Error Budgets for core services.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
- Expert-level proficiency with AWS ecosystem (EC2, EKS, RDS, S3, IAM).
- Strong coding skills in Python, Go, or Ruby for automation and tool development.
- Hands-on experience with Kubernetes orchestration and containerization (Docker).
- Advanced knowledge of Linux systems administration, networking, and security best practices.
- Proven ability to troubleshoot complex performance issues in distributed systems.
- Strong communication skills and a collaborative mindset for cross-team initiatives.