- Company Name
- Greylock Partners
- Job Title
- Machine Learning Infrastructure Engineers (Multiple Opportunities)
- Job Description
-
Job Title: Machine Learning Infrastructure Engineer
Role Summary: Design, build, and operate scalable, reliable infrastructure to support machine learning workloads across multiple startup investments.
Expectations: 3+ years of industry experience in ML infrastructure or AI engineering; ability to collaborate with data science and product teams; strong background in distributed systems and cloud platforms.
Key Responsibilities:
• Architect and maintain production-grade ML pipelines (data ingestion, preprocessing, model training, inference).
• Design and deploy scalable, fault‑tolerant compute and storage solutions (containers, Kubernetes, GPU clusters, object storage).
• Automate CI/CD for ML workflows, including model versioning, testing, and rollout.
• Monitor performance, resource utilization, and reliability; implement metrics, alerts, and tuning.
• Collaborate with security, compliance, and operations teams to enforce best practices.
• Optimize cost and performance for large‑scale ML workloads.
Required Skills:
• Proficiency in distributed systems concepts (networking, fault tolerance, load balancing).
• Hands‑on experience with cloud providers (AWS, GCP, Azure) and container orchestration (Kubernetes).
• Comfortable with infrastructure-as-code (Terraform, CloudFormation, Pulumi).
• Scripting/automation skills (Python, Bash, Go).
• Knowledge of ML frameworks (TensorFlow, PyTorch, MLflow) and data pipelines (Spark, Beam).
• Familiarity with monitoring/observability tools (Prometheus, Grafana, ELK).
• Strong debugging and troubleshooting abilities.
Required Education & Certifications:
• Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or related field.
• Valid certifications such as AWS Certified Solutions Architect, GCP Professional Cloud Architect, or similar are a plus.
San francisco bay, United states
Hybrid
Junior
18-11-2025