Job Specifications
Job Specification - Lead AI Engineer
Autonomous Systems • Multi-Agent Architectures • Long-Context RAG • MLOps & Cloud Infrastructure
Location: London, UK
(Hybrid)
Level: Lead / Principal
Type: Full-Time, Permanent
Salary: Competitive + Equity
Start: Immediate
About Trigma.ai
Trigma.ai is a next-generation AI product and services company building intelligent,
autonomous systems at scale. We work across healthcare, enterprise automation, and public
sector domains, delivering production-grade AI infrastructure that is robust, governed, and
impactful. We are growing our core engineering team and are looking for an exceptional Lead AI
Engineer to shape the direction of our AI platform and mentor a high-calibre team.
Role Overview
As Lead AI Engineer at Trigma.ai, you will own the design and delivery of our core AI and ML
platform—spanning agentic systems, foundation model infrastructure, MLOps pipelines, and
multi-cloud architecture. You will bring deep technical expertise alongside the ability to lead,
mentor, and set architectural direction. This is a hands-on leadership role: you will write
production code, define system design, and drive engineering excellence across the team.
1. Key Responsibilities
Agentic & Autonomous AI Systems
• Design and implement stateful, multi-agent architectures supporting long-horizon
reasoning, tool-driven execution, interrupt/resume logic, and checkpoint-based recovery.
• Build graph-based agent controllers, decision state machines, and adaptive execution frameworks for complex, multi-step AI workflows.
• Develop agent-optimized RAG systems with hierarchical chunking, hybrid retrieval, relevance scoring, and context-window management strategies.
• Create evaluation and reliability frameworks for non-deterministic agents, including failure simulation, regression tracking, and behavioral stability metrics.
ML Platform & Foundation Model Infrastructure
• Lead development of modular ML platforms covering data ingestion, pipeline orchestration, and automated fine-tuning for large foundatio
n models.
• Architect scalable, distributed training and inference infrastructure across AWS, GCP, and Azure, optimizing for GPU utilization and cost-efficiency.
• Build and maintain feature stores, metadata registries, and experiment tracking systems to ensure reproducibility and team-wide consistency.
• Productionize agentic and ML workloads using containerization, autoscaling, and execution isolation strategies.
MLOps & AI Governance
• Design and implement end-to-end MLOps frameworks using SageMaker, Vertex AI, Terraform, CI/CD pipelines, and container orchestration (Kubernetes, ArgoCD).
• Develop real-time observability frameworks for model drift, bias, performance monitoring, and cost tracking using tools such as Prometheus, Grafana, Arize AI, and Evidently.
• Embed governance, safety, and auditability controls into AI pipelines—including RBAC, data lineage, encryption, and GDPR-aligned access policies.
• Ensure compliance with relevant standards (SOC 2, ISO 27001) across cloud AI workloads.
Technical Leadership
• Define and champion architectural standards, design patterns, and engineering best practices across the AI engineering team.
• Mentor and develop a team of ML and data engineers, conducting code reviews, pair programming, and structured knowledge-sharing sessions.
• Collaborate with product, data science, and business stakeholders to translate complex requirements into scalable, production-ready AI solutions.
• Author internal documentation, architectural guidance, and evaluation strategies to build organisational AI capability.
2. Required Experience & Skills
Core Technical Skills
• 10+ years of software and ML engineering experience, with at least 3 years in a senior or lead AI/ML engineering role.
• Deep expertise in Python and at least one additional language (Go, Scala, Java).
• Proven experience designing and shipping agentic AI systems in production, including multi-agent coordination, tool use, and long-context reasoning.
• Strong grasp of RAG architectures, vector stores (Pinecone, Weaviate, FAISS), and embedding-based retrieval systems.
ML & GenAI Frameworks
• Hands-on experience with PyTorch, TensorFlow, Keras, Scikit-learn, XGBoost, and LightGBM.
• Practical experience with Hugging Face Transformers, LangChain, MLflow, Ray, and Weights & Biases.
• Familiarity with foundation and multimodal models (CLIP, Flamingo, DALL·E, Stable Diffusion, Perceiver IO).
MLOps & Infrastructure
• Extensive hands-on experience with AWS (SageMaker, Lambda, EKS, Bedrock), GCP (Vertex AI, BigQuery), and Azure ML.
• Proficiency with Kubernetes, Docker, Terraform, ArgoCD, Helm, and GitHub Actions.
• Experience with model serving frameworks such as Triton Inference Server, TorchServe, or BentoML.
• Strong background in building and operating real-time data pipelines using Kafka, Spark, Airflow, dbt, and Flink.
Observability, Security & Co