cover image
Braintrust

Senior Coding Annotator / LLM Evaluation Engineer (Contract)

Remote

Paris, France

Senior

Freelance

15-01-2026

Share this job:

Skills

Python Programming Software Development Large Language Models

Job Specifications

Job Description

This is a contracting engagement - initially 6 months - with potential for long term engagement.

Location: Paris-based preferred; alternatively Europe remote for strong candidates

We are building and evaluating state-of-the-art large language models (LLMs) and are looking for experienced software engineers to join our evaluation and annotation team. This role sits at the intersection of real-world software engineering, model evaluation, and applied AI, and is critical to improving model reliability, reasoning, and code quality.

You will design challenging coding tasks, evaluate model outputs against rigorous benchmarks, identify failure modes, and contribute to reinforcement learning and model improvement workflows.

This is not a junior annotation role. We are looking for practitioners with deep hands-on coding experience who can think like both an engineer and an evaluator.

What You’ll Do

Create high-quality coding prompts and reference answers (benchmark-style, e.g. SWE-Bench-like problems).
Evaluate LLM outputs for code generation, refactoring, debugging, and implementation tasks.
Identify and document model failures, edge cases, and reasoning gaps.
Perform head-to-head evaluations between private LLMs (Mistral-based) and leading external models.
Build or configure coding environments to support evaluation and reinforcement learning (RL).
Follow detailed annotation and evaluation guidelines with high consistency.

What We’re Looking For

5+ years of professional software development experience.
Strong Python skills (required).
Knowledge of at least one additional programming language (bonus).
1+ year of coding annotation and/or LLM evaluation experience (part-time OK) for a major frontier AI lab or AI infrastructure company.
Prior code reviewer experience is a plus.
Proven ability to apply structured evaluation criteria and write clear technical feedback.
Fluent in English (written and spoken).
Team lead or mentoring experience is a strong plus.

Why This Role

Work hands-on with cutting-edge LLMs.
Apply real-world engineering judgment to model evaluation and improvement.
High-impact, technical work with a focused, senior team.

About the Company

Braintrust is revolutionizing hiring with Braintrust AIR, the world's first and only end-to-end AI recruiting platform. Trained with human insights and proprietary data, Braintrust AIR reduces time to hire from months to days, instantly matching you with pre-vetted qualified candidates, and conducting the first round phone screen for you. Trusted by hundreds of Fortune 1000 enterprises including Nestlé, Porsche, Atlassian, Goldman Sachs, and Nike, Braintrust AIR is making talent acquisition professionals 100x more effective ... Know more