Job Specifications
Director of Machine Learning (Healthcare AI)
Location: New York City -- On-site (hybrid considered)
Employment: Full-time | Team: Founding/Engineering Leadership
Compensation: Competitive base + meaningful equity
Why this role
We're building the machine intelligence that powers a safe, scalable AI doctor. As Director of Machine Learning, you'll own the ML roadmap end-to-end--shipping production systems that deliver diagnostic reasoning, chronic disease management, and HIPAA-grade data processing at scale. You'll set the technical bar, hire and mentor the team, and harden the safety rails that make medical AI trustworthy.
What you'll lead (zero - one - scale)
ML Strategy & Org Building
Define the ML/AI vision, architecture standards, and research-to-production pipeline.
Build and lead a high-performing team of ML engineers, applied scientists, and MLOps.
Establish the SDLC for models: evaluation, safety, monitoring, rollback, and post-incident learning.
Clinical Reasoning & LLM Systems
Own multi-agent reasoning (debate/consensus) and tool-use policies for clinical tasks.
Scale retrieval-augmented generation (RAG) from thousands of guidelines with provenance and audit trails.
Drive prompt/program synthesis, fine-tuning, and distillation for low-latency inference.
Data & Evaluation
Stand up HIPAA-compliant data pipelines: de-identification, labeling, weak supervision, and active learning.
Define gold-standard evals with clinicians: accuracy, safety, fairness, and explanation quality.
Build offline/online experiment frameworks (A/B, counterfactual, shadow deploys).
Safety & Compliance
Implement guardrails: contraindication checks, uncertainty calibration, human-in-the-loop escalation.
Bias and drift detection across demographics and care pathways; model cards and documentation.
Partner with compliance on FDA/IEC considerations and real-world performance tracking.
Platform & Scale
Collaborate with Platform/Backend to deliver sub-second inference at peak load.
Architect multi-region, fault-tolerant model serving with deterministic backstops.
Align with product on domain workflows, clinician tooling, and cost/performance tradeoffs.
Responsibilities
Translate ambiguous clinical problems into prioritized ML roadmaps with clear success metrics.
Ship production models (LLMs + classical ML) for triage, summarization, document generation, and decision support.
Lead design reviews; author high-signal design docs and experiment plans.
Partner with Backend to expose safe, well-versioned inference APIs and feature stores.
Create an evidence pipeline: offline eval - shadow - gated release - continuous monitoring.
Recruit, coach, and level up the team; establish hiring rubrics and technical standards.
Communicate progress and risk to execs, clinicians, and cross-functional partners.
Qualifications (must-have)
8-12+ years in ML/AI with 3-5+ years leading teams; startup or zero-to-one experience.
Deep hands-on experience with LLMs/gen-AI (OpenAI, Anthropic, LLaMA, etc.), RAG, fine-tuning, and optimization.
Strong applied track record shipping real-time production ML systems at scale.
Proficiency in Python and PyTorch/TensorFlow; solid software engineering fundamentals.
Comfortable with MLOps: feature stores, model registries, CI/CD for models, observability, canary/rollback.
Demonstrated work in safety-critical contexts: uncertainty, bias/fairness, post-deployment monitoring.
Excellent written/spoken communication; proven ability to partner with product, clinical, and platform teams.
Nice to have
Healthcare/med-tech experience; familiarity with HL7, FHIR, EHR integrations, and clinical workflows.
Publications/patents in NLP, clinical ML, safety, or multi-agent systems.
Experience with FDA SaMD guidance, IEC 62304/82304, or similar frameworks.
Background in multilingual NLP, speech (ASR/TTS), or predictive risk modeling.
Prior seed/Series A leadership; hiring from scratch and scaling teams.
What success looks like
90 days:
ML strategy, safety gates, and evaluation plan in place. First high-impact model shipped behind a feature flag with shadow testing and clinician review.
180 days:
Multi-agent reasoning + RAG stack serving production traffic with measurable lifts in accuracy and time-to-answer.
End-to-end monitoring (quality, bias, cost) live; on-call and incident playbooks operational.
12 months:
ML org of 6-12 high performers; roadmap delivering quarterly value.
Documented safety and performance portfolio suitable for payer/provider and regulatory conversations.
Our stack (indicative)
Modeling: PyTorch, Transformers, vLLM/ONNX/TensorRT, LoRA/QLoRA, vector DBs (FAISS/pgvector)
Data/MLOps: Python, Airflow/Prefect, Spark/Ray, Feast, MLflow/W&B, Kubernetes, Argo, Docker
Serving: FastAPI/gRPC, Kafka/Redpanda, Redis, OpenTelemetry, Prometheus/Grafana
Cloud: AWS/GCP, Terraform, multi-region deployments
Security/Compliance: PHI de-identification, KMS/HSM, policy engines, audit logging
What we offer
Fo