- Company Name
- Enigma
- Job Title
- Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London
- Job Description
-
**Job Title**
Senior Infrastructure Engineer – Kubernetes, Docker, Terraform, Python, GPU
**Role Summary**
Lead the design, deployment, and operation of a production Kubernetes platform that powers long‑running, failure‑prone reinforcement‑learning agent workloads. Own end‑to‑end lifecycle of containerised evaluation environments, ensuring high availability, efficient GPU utilisation, robust observability, and secure sandboxing for untrusted code execution.
**Expectations**
- Deliver on‑call and incident response for mission‑critical workloads.
- Drive infrastructure reliability, scalability, and performance improvements.
- Collaborate with research, data science, and ops teams to align environment capabilities with evolving training needs.
- Maintain clear, actionable documentation, runbooks, and dashboards.
**Key Responsibilities**
- Own and evolve the Kubernetes runtime: scheduling, lifecycle management, and autoscaling for multi‑hour/day agent runs.
- Optimize GPU scheduling, resource allocation, and image layering to minimise cold‑start times and maximize utilization.
- Design storage patterns for datasets, model checkpoints, and transient state.
- Build observability: metrics, logs, traces, dashboards, and alerting tied to SLOs (e.g., rollout success rate, environment health, queue latency).
- Create debugging playbooks and runbooks for OOMs, memory leaks, performance regressions, and network/storage issues.
- Implement reliability engineering: retry/backoff strategies, checkpointing, idempotence, graceful degradation, and chaos‑testing for failure injection.
- Harden sandboxing: container isolation, network policies, secrets management, audit logging, and rate limiting of external API calls.
**Required Skills**
- Deep production experience managing Kubernetes (resource limits, affinity/taints, priorities, autoscaling, node health).
- Strong distributed‑systems fundamentals: idempotency, retries, failure domains, incident response.
- Practical observability: metrics, structured logging, tracing.
- Ability to build tooling in Python and/or Go.
- Infrastructure‑as‑code and automation: Helm, Terraform, GitOps workflows.
- Redis expertise for high‑throughput, session‑oriented workloads.
- GPU scheduling, container runtimes, Linux performance tuning, and networking fundamentals.
**Required Education & Certifications**
- Bachelor’s (or higher) degree in Computer Science, Engineering, or related field, or equivalent professional experience.
- Kubernetes certification (CKA/CKAD) preferred.
- Terraform, cloud‑infra, or DevOps certifications advantageous.