- Company Name
- NVIDIA
- Job Title
- Senior Site Reliability Engineer - DGX Cloud
- Job Description
-
**Job Title**
Senior Site Reliability Engineer – DGX Cloud
**Role Summary**
Lead the design, implementation, and operation of large‑scale, high‑availability GPU cloud services on Kubernetes and related cloud platforms. Drive automation, performance tuning, capacity planning, and incident management to ensure maximum reliability and uptime while enabling rapid, reliable deployment of new features.
**Expectations**
- Deliver end‑to‑end reliability for internal and external GPU cloud services.
- Reduce manual operational work through automation and tooling.
- Proactively identify and mitigate potential outages, conduct blameless post‑mortems.
- Participate in on‑call rotation, own incident response lifecycle.
- Communicate clearly with cross‑functional teams and influence architecture decisions.
**Key Responsibilities**
- Design, build, and maintain operational aspects of large‑scale Kubernetes clusters, focusing on performance, monitoring, logging, and alerting.
- Lead service lifecycle activities from architecture review, pre‑launch capacity planning, to deployment and continuous refinement.
- Develop and maintain tools, platforms, and frameworks that support distributed system operations.
- Measure and monitor availability, latency, and overall health of live services.
- Automate scaling, configuration, and deployment processes to sustain high‑scale growth.
- Conduct post‑mortems and implement post‑incident improvements.
- Support production systems through on‑call rotation and rapid response to incidents.
**Required Skills**
- 10+ years of experience in infrastructure automation and distributed system design.
- Proficient in Python, Go, Perl, or Ruby; strong coding and debugging skills.
- Deep knowledge of Linux, networking, containerization (Docker/Kubernetes).
- Experience with public/private cloud environments (Kubernetes, OpenStack, Docker).
- Expertise in capacity management, performance tuning, and automated deployment pipelines.
- Excellent problem‑solving, communication, and ownership mindset.
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Engineering, or related technical field, or equivalent professional experience.
Santa clara, United states
On site
Senior
03-11-2025