- Company Name
- Gradle Technologies
- Job Title
- Senior Site Reliability Engineer
- Job Description
-
**Job title**
Senior Site Reliability Engineer
**Role Summary**
Lead the creation and ongoing operations of a new, distributed SRE team for a cloud‑native SaaS platform. Manage reliability, performance, and availability of production services, including Kubernetes clusters on AWS, artifact registries, and related infrastructure. Own incident response, automation, observability, and disaster recovery while driving best‑practice SRE culture across engineering teams.
**Expectations**
- 5+ years in SRE, DevOps, or equivalent, operating large‑scale production services.
- Proven expertise in Kubernetes (EKS, on‑prem), AWS (EC2, RDS, S3, EKS), and Infrastructure as Code (Terraform).
- Strong incident‑management track record, including on‑call participation and post‑mortem ownership.
- Deep understanding of SRE principles: SLAs, SLOs, error budgets, and reliability tooling.
- Advanced scripting skills (Python, Bash) and automation mindset.
- Excellent written and verbal English for asynchronous cross‑time‑zone communication.
**Key Responsibilities**
- Operate and maintain all production instances and supporting services.
- Participate in follow‑the‑sun on‑call rotation, leading incident detection, triage, resolution, and post‑mortem.
- Design, implement, and maintain end‑to‑end observability (logs, metrics, traces, alerts).
- Automate deployments, upgrades, monitoring, self‑healing, and recovery workflows.
- Build reliability into new features from inception in collaboration with engineering.
- Own disaster‑recovery plans, backups, and business‑continuity exercises.
- Communicate incident status and planned maintenance with customers.
- Optimize performance, resource usage, and operational costs.
- Evolve SaaS operations as scale grows, establishing SRE practices in new teams.
**Required Skills**
- Kubernetes administration (deployment, scaling, troubleshooting).
- AWS operations (EKS, EC2, RDS, S3).
- Terraform, CloudFormation, or equivalent IaC.
- Prometheus, Grafana, ELK/EFK stacks, distributed tracing.
- Incident‑response tooling (PagerDuty, Opsgenie, or similar).
- Scripting: Python, Bash, or Go for automation.
- Knowledge of SLO/SLA definition and monitoring.
- Disaster‑recovery design and execution.
- Strong documentation and asynchronous communication.
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent technical experience.
- Optional certifications: Certified Kubernetes Administrator (CKA), AWS Certified Solutions Architect – Associate, or equivalent SRE‑focused credentials.
San francisco bay, United states
Remote
Senior
09-02-2026