- Company Name
- Arcus Search
- Job Title
- Staff Site Reliability Engineer
- Job Description
-
**Job Title**
Staff Site Reliability Engineer
**Role Summary**
Lead and architect the Site Reliability Engineering practice for a highly secure technology incubator. Own end‑to‑end reliability, scalability, and security of cloud‑native systems, guiding a small team of SREs and collaborating with cross‑functional stakeholders across research, finance, and energy domains.
**Expectations**
- Immediate ownership of SRE strategy and execution in a nascent unit.
- Build foundational infrastructure, platform policies, and process frameworks from scratch.
- Foster a culture of automation, observability, and continuous improvement.
- Provide technical mentorship to junior SREs and influence broader engineering practices.
**Key Responsibilities**
- Design, implement, and maintain a Kubernetes platform at scale.
- Develop infrastructure-as-code using Terraform or equivalent IaC tools.
- Architect and run CI/CD pipelines that enforce quality gates and rapid releases.
- Build and sustain observability stack (metrics, logs, tracing) to guarantee reliability.
- Design and enforce network security controls across the stack.
- Integrate incident management, post‑mortem processes, and SLO/SLA definitions.
- Collaborate with product, security, and operations teams to align architectural decisions.
- Lead capacity planning, performance testing, and scaling initiatives.
**Required Skills**
- Advanced proficiency with Kubernetes (cluster design, operators, HA).
- Strong IaC experience (Terraform, CloudFormation, Pulumi).
- Expert in CI/CD tooling (Git, Jenkins, GitLab CI, ArgoCD, or similar).
- Deep understanding of observability principles (Prometheus, Grafana, Jaeger, ELK/EFK).
- Network security design expertise (VPC, firewalls, segmentation).
- Proven ability to build and operate cloud, hybrid, or on‑prem infrastructure in enterprise environments.
- Excellent problem‑solving, communication, and mentorship skills.
- Familiarity with SLO/SLA management and incident response.
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Engineering, or related field (Master’s preferred).
- Certifications:
- Certified Kubernetes Administrator (CKA) or equivalent.
- Terraform Associate or similar IaC credential.
- Relevant cloud certification (e.g., AWS Certified Solutions Architect, Azure Solutions Architect, Google Professional Cloud Architect).
- Networking/ security certs (e.g., CCNP, CCSP) are a plus.