- Company Name
- Goliath Partners
- Job Title
- Data Engineer, Machine Learning
- Job Description
-
**Job Title:**
Data Engineer, Machine Learning
**Role Summary:**
Design and implement large‑scale data pipelines and architecture to support a real‑time multimodal generalist agent. Focus areas include embedding generation, vector search, semantic ranking, and data systems that enable long‑term memory and multimodal inference. Collaborate closely with ML and research teams to integrate retrieval components with LLMs and vision‑language models for multi‑step reasoning and autonomous action.
**Expectations:**
- Deliver scalable, high‑performance retrieval pipelines (embeddings, vector indexing, similarity and semantic ranking).
- Build and maintain data systems that support memory, long‑term state, and multimodal inference.
- Create tooling for data ingestion, transformation, storage, and access.
- Embed data infrastructure within LLM and vision‑language workflows, ensuring low latency and high fidelity.
- Continuously evaluate, test, and monitor system performance, proposing enhancements.
**Key Responsibilities:**
1. Construct and optimize retrieval pipelines (embedding generation, vector search, semantic similarity, ranking).
2. Architect data infrastructures for memory, long‑term state tracking, and multimodal inference.
3. Build ingestion, ETL, and feature‑store components for text, image, and audio data.
4. Integrate data pipelines with LLMs and vision‑language models to enable planning, reasoning, and autonomous actions.
5. Work with ML engineers and researchers to define data requirements and validate system outputs.
6. Ensure data quality, consistency, and compliance while monitoring pipeline health.
7. Employ CI/CD for data pipelines and utilize monitoring dashboards (e.g., Prometheus, Grafana).
8. Research emerging AI‑related data technologies and recommend adoption.
**Required Skills:**
- Proficient in Python (or Scala) and SQL; experience with Spark, Beam, or similar distributed data frameworks.
- Hands‑on experience with large‑scale vector search (FAISS, Milvus, Pinecone, Weaviate, or equivalent).
- Deep understanding of embedding strategies, semantic search, and similarity metrics.
- Familiarity with multimodal data processing (text, image, audio).
- Strong command of cloud data services (AWS S3/GCP Cloud Storage, BigQuery, Redshift, Snowflake).
- Knowledge of model serving, feature stores, and ML ops best practices.
- Experience with CI/CD pipelines for data engineering (Git, Docker, Kubernetes).
- Excellent debugging, profiling, and performance optimization skills.
- Bonus: familiarity with reinforcement learning pipelines and agentic workflows.
**Required Education & Certifications:**
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, Data Science, or related field.
- Cloud certifications (AWS Certified Data Analytics – Specialty, GCP Professional Data Engineer, or Azure Data Engineer Associate) are a plus.
San francisco bay, United states
Hybrid
24-11-2025