Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and automating the mundane? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. You will play a pivotal role in designing, building, and maintaining our high-availability cloud platforms, ensuring that millions of users enjoy a seamless experience every day.
We prioritize engineering solutions over manual intervention and seek an SRE who thrives on tackling complex architectural challenges in a fast-paced environment.
Tanggung Jawab
- Design and manage highly scalable, distributed systems hosted on AWS/GCP.
- Drive capacity planning, performance tuning, and infrastructure optimization.
- Automate infrastructure provisioning using Infrastructure as Code (Terraform, Ansible).
- Implement advanced monitoring, logging, and alerting strategies to improve observability.
- Lead incident response protocols and conduct blameless post-mortems.
- Collaborate with development teams to integrate CI/CD best practices.
- Mentor junior engineers on reliability engineering standards and best practices.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Deep expertise in Linux systems administration and container orchestration (Kubernetes).
- Proficiency in scripting or programming languages (Python, Go, or Ruby).
- Hands-on experience with cloud infrastructure (AWS or GCP) and networking protocols.
- Strong problem-solving skills and the ability to debug complex issues across the stack.
- Excellent communication skills with a collaborative, growth-oriented mindset.