Senior Site Reliability Engineer

Senior Site Reliability Engineer - Nvidia

Join our team in Santa Clara, CA, USA as a Senior Site Reliability Engineer. At NVIDIA, you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.

What you'll be doing:

Own the solutions you build, collaborating with cross-functional teams to successfully implement them.

Collaborate with various teams in a fast-paced environment to ensure seamless project completion.

Continuously improve solution provisioning and management through automation.

Identify areas to improve service resiliency using industry-standard practices.

Detect performance issues and recommend solutions to maintain world-class service quality.

Conduct capacity management and planning to meet ongoing operational needs.

Participate in incident reviews, assist in root cause identification, and write RCA reports.

Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.

Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.

Participate in the team's on-call rotation.

What we need to see:

B.S. degree in Computer Science or related technical field (or equivalent experience) with over 10 years in building and supporting critical services.

Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).

Deep understanding of Linux operating systems and TCP/IP fundamentals.

Expertise with at least one major cloud service provider - AWS, GCP, Azure.

Demonstrated proficiency with end-to-end SRE capabilities and observability.

Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.

5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.

Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

Linux certification from a well-known vendor - RedHat, Oracle, etc.

Prior experience managing large-scale Kubernetes deployment in production.

Strong skills in modern container networking and storage architecture.

Well-known Cloud Certification(s).

Hands-on experience working with Slurm/LSF environments.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

Last updated: 23 hours ago