cover image
CAST

CAST

www.castsoftware.com

2 Jobs

1,253 Employees

About the Company

Businesses move faster using CAST technology to understand, improve, and transform their software. Through semantic analysis of source code, CAST produces 3D maps and dashboards to navigate inside individual applications and across entire portfolios. This intelligence empowers executives and technology leaders to steer, speed, and report on initiatives such as technical debt, GenAI, modernization, and cloud. As the pioneer of the software intelligence field, CAST is trusted by the world’s leading companies and governments, their consultancies and cloud providers. See it all at castsoftware.com.

Listed Jobs

Company background Company brand
Company Name
CAST
Job Title
Data Engineer / Data Enablement with AI for AI
Job Description
**Job Title:** Data Engineer / Data Enablement with AI **Role Summary:** Develop and maintain AI‑driven data pipelines that aggregate, cleanse, enrich, and semantically annotate software ecosystem data. Use LLMs, embeddings, and NLP tools to create structured knowledge assets for training and autonomous AI agents. **Expectations:** - Deliver production‑level semantic pipelines that feed LLM fine‑tuning and RAG workflows. - Maintain rigorous data lineage, reproducibility, and version control. - Act as a technical bridge between data engineering and AI research teams, shaping schemas, prompts, and labeling strategies. **Key Responsibilities:** 1. Aggregate and structure diverse software data (code, APIs, tickets, docs, architecture specs). 2. Apply LLMs, embeddings, and NLP for automated data cleaning, entity extraction, metadata tagging, and semantic annotation. 3. Build, test, and maintain semantic pipelines for LLM fine‑tuning and Retrieval‑Augmented Generation. 4. Organize datasets into formats suitable for Agent‑to‑Agent interactions (APIs, vector databases, knowledge graphs). 5. Collaborate on schema evolution, prompt engineering, labeling strategy, and evaluation data creation. 6. Implement data lineage, reproducibility, and version‑control best practices. **Required Skills:** - 3+ years in data engineering, ML data ops, or structured data curation. - Proficient in Python; strong data‑pipeline skills (Pandas, PyArrow, regex, Airflow). - Experience with LLMs/NLP libraries (Hugging Face, spaCy, LangChain). - Ability to use AI for cleaning, enriching, classifying technical content. - Solid understanding of tokenization, chunking, and model input preparation. - Familiarity with software project data: Git, APIs, technical documentation. **Bonus Skills:** - Knowledge of vector databases (FAISS, Qdrant, Weaviate). - Experience with knowledge graphs (Neo4j, RDF, SPARQL). **Required Education & Certifications:** - Bachelor’s (or higher) in Computer Science, Data Engineering, Computational Linguistics, or related field. - Certifications in data engineering or AI (e.g., GCP Data Engineer, AWS Big Data Specialty) preferred but not mandatory.
Meudon, France
Hybrid
05-11-2025
Company background Company brand
Company Name
CAST
Job Title
Stagiaire Data Engineer / Data Enablement with AI for AI
Job Description
**Job Title** Intern Data Engineer / Data Enablement for AI **Role Summary** Assist in building the foundational data layer that powers AI systems by aggregating, structuring, and enriching software project data. Use LLMs, embeddings, and NLP tools to automate data cleaning, entity extraction, and semantic annotation, creating ready‑to‑use datasets for fine‑tuning and Retrieval‑Augmented Generation (RAG). **Expectations** - Develop and maintain scalable semantic pipelines. - Apply advanced AI techniques for data curation and quality assurance. - Deliver reproducible, lineage‑tracked datasets for downstream AI models. - Work cross‑functionally with AI research and engineering teams. **Key Responsibilities** - Aggregate data from software ecosystems (code, APIs, tickets, docs, architecture specs). - Clean and enrich data using LLMs, embeddings, and NLP (Hugging Face, spaCy, LangChain). - Extract entities, tag metadata, perform semantic annotation, and prepare tokenized chunks for model input. - Build and manage pipelines (Airflow, Pandas, PyArrow) that feed RAG and LLM fine‑tuning workflows. - Format datasets for Agent‑to‑Agent interaction (vector databases, knowledge graphs, APIs). - Ensure robust data lineage, reproducibility, and version control. - Collaborate on schema evolution, prompt design, labeling strategies, and evaluation metrics. **Required Skills** - Python programming with data‑pipeline libraries (Pandas, PyArrow, regex, Airflow). - Experience in data engineering, ML data ops, or structured data curation. - Familiarity with LLMs/NLP frameworks (Hugging Face, spaCy, LangChain). - Knowledge of tokenization, chunking, and model input preparation. - Ability to work with software project data (Git repos, APIs, technical docs). **Required Education & Certifications** - Current student or recent graduate in Computer Science, Data Science, or a related field. - Optional certifications in data engineering or machine learning (e.g., Coursera, edX, GCP/AWS data services). ---
Meudon, France
Hybrid
03-12-2025