cover image
CAST

Stagiaire Data Engineer / Data Enablement with AI for AI

Hybrid

Meudon, France

Internship

03-12-2025

Share this job:

Skills

Python Data Cleaning Data Engineering Neo4J Version Control Training Architecture git Agile Pandas Langchain NLP

Job Specifications

CAST, a Software Company based in Meudon , is the market leader in Software Intelligence.

Working at CAST R&D means being an important part of a highly-talented, fast-paced, multicultural and Agile team .

Overview

We’re building the foundation to ground AI with AAA Software Intelligence — Aggregated, Accurated, and Augmented — sourced from real-world software and technology projects. This role goes beyond manual curation: it's about using AI to empower AI. You will leverage LLMs, embeddings, and NLP tools to clean, enrich, and validate data, enabling AI systems and autonomous agents to rely on it for training and contextual understanding.

Responsibilities

Aggregate and structure data from software ecosystems (codebases, APIs, tickets, documentation, architecture specs).
Apply LLMs, embeddings, and NLP tools to automate: data cleaning, entity extraction, metadata tagging, and semantic annotation.
Build and maintain semantic pipelines for LLM fine-tuning and RAG (Retrieval-Augmented Generation).
Organize datasets into formats suitable for Agent-to-Agent (A2A) interactions: APIs, vector DBs, knowledge graphs, etc.
Collaborate with AI teams to evolve schemas, prompts, labeling strategies, and evaluation data.
Ensure strong data lineage, reproducibility, and version control.

Requirements

Experience in data engineering, ML data ops, or structured data curation.
Proficient in Python, with strong data pipeline skills (Pandas, PyArrow, regex, Airflow).
Experience with LLMs or NLP tools (e.g., Hugging Face, spaCy, LangChain).
Ability to use AI to clean, enrich, classify, and organize technical content.
Strong understanding of tokenization, chunking, and model input preparation.
Experience working with software project data: Git repos, APIs, technical documentation, etc.

Bonus Skills

Knowledge of vector DBs (FAISS, Qdrant, Weaviate) or knowledge graphs (Neo4j, RDF, SPARQL).

About the Company

Businesses move faster using CAST technology to understand, improve, and transform their software. Through semantic analysis of source code, CAST produces 3D maps and dashboards to navigate inside individual applications and across entire portfolios. This intelligence empowers executives and technology leaders to steer, speed, and report on initiatives such as technical debt, GenAI, modernization, and cloud. As the pioneer of the software intelligence field, CAST is trusted by the world’s leading companies and governments, t... Know more