Osmosis (YC W25)

1 Job

8 Employees

About the Company

Osmosis is a reinforcement fine-tuning platform that helps companies create task-specific models that outperform foundation models at a fraction of the cost.

Listed Jobs

Company Name: Osmosis (YC W25)
Job Title: Machine Learning Engineer
Job Description: **Job Title** Machine Learning Engineer **Role Summary** Design, develop, and maintain highly scalable distributed reinforcement learning (RL) training infrastructure. Lead the implementation of novel RL algorithms, optimize GPU utilization, and build post‑training pipelines that support continual learning. Engage directly with customers for production deployments and custom model development in a fast‑paced, customer‑driven environment. **Expactations** - Deliver robust, production‑grade distributed RL systems that outperform baseline foundation models on performance, latency, and cost. - Drive optimization of GPU resource allocation and system efficiency. - Collaborate with cross‑functional teams to integrate new RL algorithms into existing pipelines. - Provide technical leadership to customers, translating business requirements into scalable ML solutions. **Key Responsibilities** - Implement state‑of‑the‑art RL algorithms and scalable post‑training pipelines using PyTorch, Megatron‑LM, or similar frameworks. - Design and manage resource allocation frameworks for dynamic GPU utilization across AWS Fargate, Kubernetes, and SageMaker. - Build and maintain micro‑service backends (Python FastAPI, Go) for training orchestration. - Develop customer‑facing tooling and APIs to support custom model deployments and performance monitoring. - Optimize codebases for low‑latency inference (e.g., vLLM, SGLang) and large‑scale training (e.g., FSDP). - Collaborate with data engineers to manage datasets in S3 and DynamoDB. **Required Skills** - Strong background in reinforcement learning and experience deploying RL at scale. - Expertise in distributed training with PyTorch (including FSDP) and large‑scale language model frameworks (Megatron‑LM, SkyRL). - Proficient in Python (FastAPI, PyTorch) and backend Go; familiarity with React, TypeScript, Next.js is a plus. - Deep understanding of cloud infrastructure: AWS (SageMaker, Fargate), Docker, Kubernetes. - Skilled in GPU resource management, dynamic scheduling, and system optimization. - Solid knowledge of data storage services (S3, DynamoDB) and CI/CD pipelines. - Strong analytical, problem‑solving, and communication skills for customer‑facing interactions. **Required Education & Certifications** - Bachelor’s or Master’s degree in Computer Science, Machine Learning, or related field. - Proven experience (3+ years) in ML engineering with a focus on RL or large‑scale neural network training. - Certifications in cloud technologies (e.g., AWS Certified DevOps Engineer, AWS Certified Machine Learning Specialty) are advantageous but not mandatory.

San francisco, United states

On site

13-01-2026