- Company Name
- DeepRec.ai
- Job Title
- Senior ML Infra Engineer
- Job Description
-
Job title: Senior Machine Learning Infrastructure Engineer
Role Summary: Own, build, and scale end‑to‑end ML infrastructure for physics‑based foundation models in a fast‑moving AI startup. Drive production‑grade training, fine‑tuning, serving, and data pipelines across cloud and on‑prem environments, partnering closely with customers and executive leadership.
Expectations: Minimum 3 years of designing and deploying scalable ML infrastructure, proven proficiency with AWS/GCP/Azure, Kubernetes, Docker, IaC, and distributed training frameworks. Strong Python, debugging, and execution skills. Optional: physics/background in simulation, regulated deployments, GPU optimization, and open‑source contributions.
Key Responsibilities:
- Architect and manage multi‑GPU/multi‑node distributed training and fine‑tuning clusters.
- Design low‑latency, highly reliable inference and model serving systems.
- Build secure, automated fine‑tuning pipelines for customer data workflows.
- Deploy ML solutions across cloud and on‑prem (including enterprise/air‑gapped) environments.
- Construct data pipelines for large‑scale simulation and CFD datasets.
- Implement observability, monitoring, and debugging across training, serving, and data pipelines.
- Collaborate directly with customers on deployment, integration, and scaling.
- Rapidly transition prototypes to production‑grade infrastructure.
Required Skills:
- Distributed training frameworks (PyTorch Distributed, DeepSpeed, Ray, etc.)
- Cloud platforms (AWS, GCP, Azure) and hybrid deployment environments
- Kubernetes, Docker, and infrastructure‑as‑code tools (e.g., Terraform, Helm)
- Python programming, end‑to‑end ML lifecycle understanding
- Distributed systems, networking, security fundamentals
- Debugging, performance tuning, and scaling experience
- Strong communication, collaboration, and independent ownership capability
Required Education & Certifications:
- Bachelor’s (or higher) in Computer Science, Engineering, or related STEM field (or equivalent experience)
---
San francisco, United states
On site
Senior
16-03-2026