Skills

Communication Leadership Python Java Go Rust Incident Response GitHub CI/CD DevOps Kubernetes Monitoring Azure Kubernetes Service (AKS) Azure DevOps Networking Research Training Machine Learning Azure AWS cloud platforms GCP Data Science Large Language Models CI/CD Pipelines Terraform Prometheus Grafana GitHub Actions

Job Specifications

At GEICO, we offer a rewarding career where your ambitions are met with endless possibilities.

Every day we honor our iconic brand by offering quality coverage to millions of customers and being there when they need us most. We thrive through relentless innovation to exceed our customers’ expectations while making a real impact for our company through our shared purpose.

When you join our company, we want you to feel valued, supported and proud to work here. That’s why we offer The GEICO Pledge: Great Company, Great Culture, Great Rewards and Great Careers.

GEICO AI ML Infrastructure team is seeking an exceptional Senior ML Platform Engineer to build and scale our machine learning infrastructure with a focus on Large Language Models (LLMs) and AI applications. This role combines deep technical expertise in cloud platforms, container orchestration, and ML operations with strong leadership and mentoring capabilities. You will be responsible for designing, implementing, and maintaining scalable, reliable systems that enable our data science and engineering teams to deploy and operate LLMs efficiently at scale. The candidate must have excellent verbal and written communication skills with a proven ability to work independently and in a team environment.

Key Responsibilities

ML Platform & Infrastructure

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

DevOps & Platform Engineering

Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools
Implement automated model training, validation, deployment, and monitoring workflows
Set up comprehensive observability using Prometheus, Grafana, Azure Monitor, and custom dashboards
Continuously optimize platform performance, reducing latency and improving throughput for ML workloads
Design and implement backup, recovery, and business continuity plans for ML platforms

Technical Leadership & Mentoring

Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations
Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability
Design and deliver technical onboarding programs for new team members joining the ML platform team
Establish and champion engineering standards for ML infrastructure, deployment practices, and operational procedures
Create technical documentation, runbooks, and deliver internal training sessions on platform capabilities

Cross-Functional Collaboration

Work closely with data scientists to understand requirements and optimize workflows for model development and deployment
Collaborate with product engineering teams to integrate ML capabilities into customer-facing applications
Support research teams with infrastructure for experimenting with cutting-edge LLM techniques and architectures
Present technical solutions and platform roadmaps to leadership and cross-functional stakeholders

Required Qualifications

Experience & Education

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
5+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
2+ years of hands-on experience with machine learning infrastructure and deployment at scale
1+ years of experience working with Large Language Models and transformer architectures

Technical Skills - Core Requirements

Proficient in Python; strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Hands-on experience with inference optimization using vLLM, TensorRT-LLM, Tri

About the Company

GEICO (Government Employees Insurance Company) offers a variety of insurance such as vehicle, property, business, life, umbrella, travel, pet, jewelry and more. The company, which was founded in 1936, is the third-largest auto insurer in the United States and insures vehicles in all 50 states and Washington, D.C. GEICO, a member of the Berkshire Hathaway family of companies, constantly strives to make lives better by protecting people against unexpected events while saving them money and providing an outstanding customer e... Know more