- Company Name
- UK Health Security Agency
- Job Title
- Senior Specialist Engineer (SRE)
- Job Description
-
**Job Title**
Senior Specialist Engineer (Site Reliability Engineer)
**Role Summary**
Lead the design, implementation, and operation of highly available, scalable, and resilient cloud‐native and on‑premises HPC platforms. Drive automation, observability, and performance tuning to meet and exceed Service Level Objectives (SLOs) for UK Health Security Agency services. Mentor junior engineers and collaborate with high‑performance computing, AI, and research computing teams.
**Expectations**
- Operate within a hybrid model, spending at least 60 % of working hours at a core HQ.
- Report to the Principal Specialist Engineer SRE and align with the HPC /SRE/AI unit’s objectives.
- Deliver measurable improvements in reliability, performance, and capacity planning.
- Enforce CI/CD pipelines, infrastructure‑as‑code (IaC), and incident‑management best practices.
**Key Responsibilities**
- Remediate infrastructure and operational issues, ensuring continuous service availability.
- Architect, develop, and manage multi‑cloud HPC platforms and on‑premise infrastructure.
- Monitor, tune, and optimize system performance to meet defined SLOs and SLIs.
- Perform root‑cause analysis, post‑mortems, and implement preventive measures.
- Design and maintain robust monitoring, alerting, and observability dashboards to reduce alert fatigue and accelerate response times.
- Automate repetitive tasks via IaC, scripting, and tooling to reduce operational toil.
- Conduct capacity and performance planning to support current and future workloads.
- Define, track, and continuously improve SLOs, SLIs, and error budgets.
- Advocate SRE best practices across engineering teams and provide knowledge transfer to peers.
- Support AI requirements and integration within the HPC environment.
**Required Skills**
- Proven experience as an SRE or in a related role (≥5 years).
- Strong expertise in cloud platforms (AWS, Azure, GCP) and hybrid deployments.
- Hands‑on experience with CI/CD, IaC (Terraform, Pulumi, CloudFormation), and container orchestration (Kubernetes).
- Proficiency in scripting/automation (Python, Bash, Go).
- Deep knowledge of observability tools (Prometheus, Grafana, OpenTelemetry, ELK stack).
- Performance tuning, capacity planning, and load‑testing skills.
- Incident response, root‑cause analysis, and post‑mortem practices.
- Excellent communication, mentoring, and collaboration abilities.
- Familiarity with HPC, AI workloads, and scientific computing environments is a plus.
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- Relevant certifications (e.g., AWS Certified Solutions Architect, GCP Professional Cloud Architect, Kubernetes Administrator, or similar) preferred.
Sutton at hone, United kingdom
On site
Senior
02-12-2025