cover image
Braintrust

AI Evaluation Engineer

Hybrid

Paris, France

Freelance

15-01-2026

Share this job:

Skills

Python Programming Software Development Large Language Models

Job Specifications

Job Description

This is a contracting engagement - initially 6 months - with potential for long term engagement.

Location: Paris-based preferred; alternatively Europe remote for strong candidate

We are building and evaluating state-of-the-art large language models (LLMs) and are looking for experienced software engineers to join our evaluation and annotation team. This role sits at the intersection of real-world software engineering, model evaluation, and applied AI, and is critical to improving model reliability, reasoning, and code qualit

y.You will design challenging coding tasks, evaluate model outputs against rigorous benchmarks, identify failure modes, and contribute to reinforcement learning and model improvement workflow

s.
This is not a junior annotation role. We are looking for practitioners with deep hands-on coding experience who can think like both an engineer and an evaluat

or.
What You’l

l DoCreate high-quality coding prompts and reference answers (benchmark-style, e.g. SWE-Bench-like proble
ms).Evaluate LLM outputs for code generation, refactoring, debugging, and implementation ta
sks.Identify and document model failures, edge cases, and reasoning g
aps.Perform head-to-head evaluations between private LLMs (Mistral-based) and leading external mod
els.Build or configure coding environments to support evaluation and reinforcement learning (
RL).Follow detailed annotation and evaluation guidelines with high consiste

ncy.
What We’re Lookin

g For5+ years of professional software development experi
ence.Strong Python skills (requi
red).Knowledge of at least one additional programming language (bo
nus).1+ year of coding annotation and/or LLM evaluation experience (part-time OK) for a major frontier AI lab or AI infrastructure com
pany.Prior code reviewer experience is a
plus.Proven ability to apply structured evaluation criteria and write clear technical feed
back.Fluent in English (written and spo
ken).Team lead or mentoring experience is a strong

plus.
Why Thi

s RoleWork hands-on with cutting-edge
LLMs.Apply real-world engineering judgment to model evaluation and improv
ement.High-impact, technical work with a focused, senior

team.

About the Company

Braintrust is revolutionizing hiring with Braintrust AIR, the world's first and only end-to-end AI recruiting platform. Trained with human insights and proprietary data, Braintrust AIR reduces time to hire from months to days, instantly matching you with pre-vetted qualified candidates, and conducting the first round phone screen for you. Trusted by hundreds of Fortune 1000 enterprises including Nestlé, Porsche, Atlassian, Goldman Sachs, and Nike, Braintrust AIR is making talent acquisition professionals 100x more effective ... Know more