- Company Name
- CAST
- Job Title
- Data Engineer / Data Enablement with AI for AI
- Job Description
-
**Job Title:** Data Engineer / Data Enablement with AI
**Role Summary:**
Develop and maintain AI‑driven data pipelines that aggregate, cleanse, enrich, and semantically annotate software ecosystem data. Use LLMs, embeddings, and NLP tools to create structured knowledge assets for training and autonomous AI agents.
**Expectations:**
- Deliver production‑level semantic pipelines that feed LLM fine‑tuning and RAG workflows.
- Maintain rigorous data lineage, reproducibility, and version control.
- Act as a technical bridge between data engineering and AI research teams, shaping schemas, prompts, and labeling strategies.
**Key Responsibilities:**
1. Aggregate and structure diverse software data (code, APIs, tickets, docs, architecture specs).
2. Apply LLMs, embeddings, and NLP for automated data cleaning, entity extraction, metadata tagging, and semantic annotation.
3. Build, test, and maintain semantic pipelines for LLM fine‑tuning and Retrieval‑Augmented Generation.
4. Organize datasets into formats suitable for Agent‑to‑Agent interactions (APIs, vector databases, knowledge graphs).
5. Collaborate on schema evolution, prompt engineering, labeling strategy, and evaluation data creation.
6. Implement data lineage, reproducibility, and version‑control best practices.
**Required Skills:**
- 3+ years in data engineering, ML data ops, or structured data curation.
- Proficient in Python; strong data‑pipeline skills (Pandas, PyArrow, regex, Airflow).
- Experience with LLMs/NLP libraries (Hugging Face, spaCy, LangChain).
- Ability to use AI for cleaning, enriching, classifying technical content.
- Solid understanding of tokenization, chunking, and model input preparation.
- Familiarity with software project data: Git, APIs, technical documentation.
**Bonus Skills:**
- Knowledge of vector databases (FAISS, Qdrant, Weaviate).
- Experience with knowledge graphs (Neo4j, RDF, SPARQL).
**Required Education & Certifications:**
- Bachelor’s (or higher) in Computer Science, Data Engineering, Computational Linguistics, or related field.
- Certifications in data engineering or AI (e.g., GCP Data Engineer, AWS Big Data Specialty) preferred but not mandatory.