Job Specifications
Role- AI Architect
Location- London, UK
Experience level- 15+ Years
Job Description:
Architecture & Solution Design
Define reference architectures for GenAI systems: RAG, agentic orchestration, tool/function calling, multi-step reasoning workflows, memory patterns, and context strategies.
Design multi-tenant and enterprise-scale GenAI platforms with clear separation of concerns: UI, orchestration, retrieval, inference, evaluation, and observability.
Select model strategies: hosted LLMs, open-weight models, fine-tuning vs. prompt/RAG, latency and cost tradeoffs, and deployment patterns.
2) Agentic AI Orchestration & Tooling
Architect agent systems (single/multi-agent) including:
Task decomposition, planners/executors, reflection/verification loops
Tool use patterns (APIs, databases, search, workflow engines)
Guardrails to prevent unsafe tool actions and hallucinated commands
Build reliable flows for “human-in-the-loop” decision points and approvals (e.g., procurement, customer comms, incident triage).
3) Retrieval, Knowledge Systems & Data Design
Lead design of knowledge ingestion pipelines:
document parsing, chunking strategies, embeddings, metadata, lineage, freshness SLAs
Architect vector search and hybrid retrieval:
semantic + keyword, reranking, filtering, ACL-aware retrieval
Ensure retrieval respects access control, PII handling, data residency, and auditability.
4) Production Engineering, Reliability & Cost
Set non-functional requirements for GenAI workloads:
SLOs, latency budgets, fallback models, caching, rate limiting
Design cost controls: prompt/token optimization, model routing, batching, and usage governance.
Implement resiliency patterns: circuit breakers, retries, queue-based orchestration, idempotency.
5) Security, Risk & Responsible AI
Establish AI security posture:
prompt injection defenses, data exfiltration controls, tool sandboxing
Define policies and controls for:
sensitive data, logging, redaction, encryption, secret management, and auditing
Collaborate with risk/compliance to drive:
model governance, content safety, bias/quality monitoring, and regulatory alignment
6) Evaluation, Observability & Continuous Improvement
Create evaluation frameworks:
offline evals (golden sets), automated regression, and scenario-based testing
Instrument systems for observability:
traces, prompt/versioning, retrieval diagnostics, tool-call logs, and outcome metrics
Run A/B tests and iterate on prompts, retrieval, and agent policies based on measurable outcomes.
7) Leadership & Stakeholder Management
Partner with product leaders to identify high-value use cases and define roadmap.
Mentor engineers and data scientists on best practices for LLM apps.
Produce architecture artifacts: ADRs, threat models, system diagrams, runbooks.
Required Skills & Experience
Core Technical Skills (Must Have)
8+ years in software/solution architecture with 2+ years delivering GenAI/LLM solutions in production (adjust as needed).
Strong knowledge of LLMs: prompting patterns, context windows, tool/function calling, model limitations, and safety risks.
Agentic AI design experience:
orchestrators, workflows, multi-step reasoning, tool usage, HITL patterns
RAG expertise:
embeddings, vector DBs, hybrid retrieval, reranking, chunking strategies, evaluation
Cloud architecture (Azure/AWS/GCP) with production engineering rigor:
microservices, containers (Docker/K8s), serverless, CI/CD
Solid programming skills (one or more):
Python, TypeScript/JavaScript, Java, C#
Experience with APIs and integration patterns:
REST/gRPC, event-driven systems, queues, workflow engines
Security & Governance (Must Have)
Understanding of GenAI-specific threats:
prompt injection, data leakage, jailbreaks, insecure tool calling
Familiarity with enterprise controls:
IAM, key management, encryption, network isolation, audit logging
Responsible AI practices:
evaluation, content moderation, privacy, and compliance-by-design
Architecture & Systems Skills (Must Have)
Distributed system design:
scalability, fault tolerance, caching, performance tuning
Observability:
logging/metrics/tracing, prompt/version tracking, monitoring SLIs/SLOs
Cost management and performance optimization:
model selection/routing, token reduction, caching, batching
Preferred / Nice-to-Have Skills
Fine-tuning approaches:
LoRA/QLoRA, instruction tuning, adapters, distillation (when appropriate)
Experience with:
Knowledge graphs, semantic layers, enterprise search
Advanced evaluation:
LLM-as-judge with safeguards, rubric scoring, adversarial testing
MLOps/LLMOps toolchains:
experiment tracking, feature stores, model registries, data quality tools
Domain experience:
customer support automation, developer productivity copilots, IT ops agents, finance or healthcare compliance
Experience building platforms:
reusable agent frameworks, reusable RAG components, multi-team enablement
For more information on how we process your personal data, please refer to HCLTech’s Candidat