- Company Name
- Mistral AI
- Job Title
- AI Engineer, Product
- Job Description
-
**Job title**
AI Engineer – Product
**Role Summary**
Design, build, and maintain the evaluation, A/B testing, and release infrastructure for large language models within a product-focused team. Collaborate with research scientists to translate measurable improvements in quality, latency, safety, and reliability into production releases, ensuring end‑to‑end observability and robust deployment pipelines.
**Expectations**
- Autonomous, product‑driven mindset; able to form hypotheses, run experiments, and iterate quickly.
- Clear written and spoken communication; capable of authoring documentation and reporting results to cross‑functional stakeholders.
- Deep technical proficiency in TypeScript or Python; strong knowledge of production LLM lifecycles including prompts, function calling, and system prompts.
- Experience designing and operating observability solutions (logging, tracing, dashboards, alerting).
- Hands‑on expertise in evaluation creation, metric definition, A/B test design, and data‑driven rollout decisions.
- Familiarity with safety systems (moderation, PII handling, guardrails) and release operations (canary, shadowing, automated rollbacks) is highly desirable.
**Key Responsibilities**
- Build and maintain an LLM evaluation framework covering reference tests, heuristics, and model‑graded checks.
- Define, track, and share metrics such as task success, helpfulness, hallucination proxies, safety flags, latency, and cost.
- Design, execute, and analyze A/B tests for prompts, models, and system prompts; recommend rollout or rollback based on results.
- Implement structured logging, tracing, dashboards, and alerting for all LLM calls to provide comprehensive observability.
- Operate model release pipeline involving canary and shadow traffic, sign‑offs, SLO‑based rollback criteria, and regression detection.
- Improve core behaviors (memory write/retrieve policies, intent classification, follow‑ups, routing, tool‑call reliability) through evaluation and experimentation.
- Create reusable templates, documentation, and onboarding materials enabling other teams to author safe evaluations.
- Partner with science teams to diagnose regressions and lead post‑mortem analyses.
**Required Skills**
- Programming: TypeScript or Python (proficient).
- Production LLM experience: prompts, tool/function calling, system prompts.
- Evaluation design and A/B testing proficiency; metric engineering.
- Observability: structured logging, tracing, dashboards, alerting.
- Product thinking: hypothesis formation, experiment design, result interpretation.
- Strong written and verbal communication; autonomous working style.
- Desired: safety systems expertise (moderation, PII/redaction, guardrails).
- Desired: release operations experience (canary/shadowing, automated rollbacks, experiment platforms).
**Required Education & Certifications**
- Bachelor’s or higher degree in Computer Science, Software Engineering, or a related technical field.
- Certifications in cloud platforms or observability tools are a plus but not mandatory.