- Company Name
- GEICO
- Job Title
- Machine Learning Engineer
- Job Description
-
**Job Title**
Senior Machine Learning Engineer
**Role Summary**
Design, build, and maintain scalable ML platforms for training, fine‑tuning, and serving large language models (LLMs) on Azure. Lead platform reliability, cost optimization, and DevOps practices while mentoring a growing engineering team.
**Expectations**
Deliver high‑availability, secure, and cost‑effective ML infrastructure; mentor junior staff; drive best‑practice adoption across the organization; collaborate cross‑functionally with data scientists, product, and research teams.
**Key Responsibilities**
- Design and implement scalable infrastructure for LLM training, fine‑tuning, and inference (Llama, Mistral, Gemma, etc.).
- Architect and manage Kubernetes clusters (AKS) with GPU scheduling, autoscaling, and resource optimization.
- Build and maintain feature stores for model training and inference pipelines.
- Develop LLM inference systems using vLLM, TensorRT‑LLM, and custom serving solutions.
- Ensure ≥99.9 % uptime through robust monitoring, alerting, and incident response (Prometheus, Grafana, Azure Monitor).
- Create and maintain CI/CD pipelines (Azure DevOps, GitHub Actions, MLOps tools).
- Optimize platform performance and cost across Azure regions, evaluating hybrid cloud options (AWS/GCP).
- Implement security, compliance, backup, and disaster‑recovery plans.
- Mentor engineers and data scientists on platform design, coding standards, and MLOps best practices.
- Document runbooks, technical onboarding material, and deliver internal training sessions.
- Collaborate with product engineers to embed ML capabilities into customer‑facing applications.
**Required Skills**
- Python (Proficient); Go, Rust, Java (preferred).
- Open‑source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama).
- Kubernetes (Helm, operators, GPU scheduling).
- Azure services: AKS, Azure ML, Container Registry, Storage, Networking.
- Feature store experience (Chronon, Feast, Tecton, Azure ML Feature Store).
- Inference optimization (vLLM, TensorRT‑LLM).
- Infrastructure as Code: Terraform, ARM templates.
- Monitoring/tools: Prometheus, Grafana, Azure Monitor.
- Strong written and verbal communication; proven mentoring abilities.
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Engineering, or related technical field (or equivalent experience).
- 5+ years software engineering with infrastructure or MLOps focus; 2+ years large‑scale ML deployment; 1+ year LLM experience.