Job Specifications
Machine Learning Researcher – Pretraining Systems
This role will require relocation to New York
Location: United States (preferred)
Team: Advanced Modeling Research
Discipline: Pretraining Dynamics & Large-Scale Optimization
The Opportunity
We seek the top 1% of ML researchers. We’re building a small research group focused on frontier-scale pretraining—where systems design, data mixtures, and optimization dynamics converge. This role sits at the intersection of modeling, distributed training, and empirical science: understanding how scale transforms representation, and how to steer it.
Our client consistently attracts the top 1% of elite ML and DL talent working on some of the largest datasets on the planet...
You’ll operate as both experimentalist and theorist—running controlled ablations at scale, profiling training behaviors, and distilling the principles that make large models learn efficiently.
A PhD. is valued highly in this regard - however high performing MA may be considered. Typically we require 2-5 years of experience post PhD.
Core Research Areas
Investigate pretraining objectives that enhance generalization, compositional reasoning, and long-horizon coherence.
Design data mixture experiments that balance entropy, redundancy, and signal—mapping mixture composition to model scaling efficiency.
Develop instrumentation for training dynamics (loss surfaces, gradient flow, activation distributions) to predict inflection points during pretraining.
Collaborate on distributed systems optimization—scheduling, sharding, and checkpointing for multi-node, high-throughput pretraining runs.
Explore representation diagnostics across model scales—alignment drift, retention, and capability formation.
Build evaluation harnesses for emergent behavior tracking—reasoning, tool-use proxy metrics, and temporal consistency tests.
Candidate Profile
We’re interested in individuals who’ve gone beyond running large models—those who’ve interpreted them. You may have:
Designed or scaled pretraining runs (10B+ parameters) or equivalent high-throughput distributed learning systems.
Authored or contributed to research in scaling laws, mixture sampling, or self-supervised pretraining.
Deep familiarity with distributed optimization frameworks (FSDP, DeepSpeed, Megatron-LM, JAX/TPU).
Proven skill in profiling model behavior—from gradient noise scale to tokenization effects.
Experience in data-centric experimentation: filtering, mixing, and quality assessment for large corpora.
Rigor in numerical reasoning about efficiency, throughput, and empirical reproducibility.
Technical Stack
Frameworks: PyTorch / JAX, custom distributed schedulers, FSDP / DeepSpeed / Megatron
Languages: Python, C++ (or sim. for profiling and system instrumentation)
Compute Scale: Multi-node clusters (A100/H100 class GPUs or TPUv4/5)
Data Systems: Versioned mixtures, tokenizer pipelines, distributed sampling
Research Ethos
This team values results that are:
Empirical — grounded in reproducible scaling evidence.
Data-aware — understanding that data is the architecture.
Systematic — bridging algorithmic intuition with compute pragmatism.
Quantitative — every hypothesis testable by metrics that matter: throughput, loss curvature, generalization slope.
Indicators of Fit
You’ve seen a 100B-parameter model diverge—and can explain why.
You can quantify the cost of a tokenization decision.
You can reduce a 72-hour pretraining run to 48 with the same validation curve.
You think about scaling laws the way others think about architecture diagrams.
Reward
NOTE: This client is focused on calibre. Reward ranges will be highly competitive and in line with a culture of high performance and high bar to entry.
About Sentiro Partners | Leadership for the Augmentation Era
We continuously engage with the world's elite researchers, engineers, and data quants — the technical leaders shaping the next generation of intelligent systems.
Sentiro Partners works with pioneering organizations across America, Europe, and Asia to identify the minds advancing Data Science, Machine Learning, Quant Research, and AI Engineering. In the Augmentation Era, intelligence is amplified by algorithms and human insight.
Sentiro Partners was founded by Adrian Clarke, a veteran data science headhunter.
About the Company
Leadership for the Augmentation Era™. At Sentiro Partners, we architect executive teams that drive innovation, leverage AI and data strategically, and create sustainable competitive advantage for visionary organizations. Providing Executive Search, Talent Augmentation & Talent Advisory Services globally. Our Practices: - Digital & Technology Leadership - Data & AI Leadership - Semiconductor - Private Equity & Growth - CEO & Executive Leadership - Board & Governance Connect with us: explore@sentiropartners.com
Know more