- Company Name
- BetterUp
- Job Title
- Senior Site Reliability Engineer
- Job Description
-
**Job title**
Senior Site Reliability Engineer
**Role Summary**
Lead the design, deployment, and operation of cloud‑native infrastructure that powers BetterUp’s scalable platform. Drive reliability across production systems, build automated monitoring and incident response, and harness AI tools for proactive maintenance. Collaborate cross‑functionally to embed SRE practices early in the development cycle.
**Expectations**
- Deliver highly available, secure, and efficient production environments on AWS.
- Automate infrastructure provisioning and configuration with Terraform, maintain version control and auditability.
- Scale Kubernetes clusters to meet traffic demands while ensuring performance and resilience.
- Implement advanced observability and alerting using Datadog, Prometheus, or OpenTelemetry.
- Integrate AI‑powered log analysis, anomaly detection, and predictive maintenance to reduce MTTR.
- Champion continuous improvement through data‑driven retrospectives and reliability metrics.
**Key Responsibilities**
1. Build, version, and maintain AWS infrastructure with Terraform across multiple environments.
2. Deploy, scale, debug, and secure Kubernetes clusters, managing node pools, networking, and storage.
3. Design and implement intelligent alerting, tracing, and logging pipelines that surface actionable insights.
4. Automate incident response workflows; create self‑healing services and bot integrations.
5. Evaluate and prototype emerging AI/ML tools for monitoring, log analytics, and predictive alerts.
6. Partner with engineering, product, and security teams to embed reliability into code reviews, CI/CD pipelines, and release processes.
7. Conduct post‑incident analyses and produce reliability dashboards and reports for stakeholders.
**Required Skills**
- 4+ years in SRE, DevOps, or infrastructure engineering.
- Deep experience with AWS services (EC2, EKS, RDS, Lambda, CloudFormation).
- Proficient Terraform scripting and IaC best practices.
- Hands‑on Kubernetes operations: deployment, scaling, troubleshooting, RBAC, network policies.
- Strong observability skillset: Datadog, Prometheus, OpenTelemetry, LogQL, Grafana.
- Debugging and troubleshooting distributed systems; root‑cause analysis.
- Proficiency in scripting (Python, Bash, Go).
- Comfortable communicating complex incidents to engineers, product managers, and executives.
- Enthusiasm for AI tooling; experience using copilots, LLM assistants, or AI‑based monitoring solutions.
- Builder mindset: proactively automate manual processes and propose tooling enhancements.
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Engineering, or related technical field, or equivalent work experience.
- AWS certifications (e.g., AWS Certified Solutions Architect – Professional, AWS Certified DevOps Engineer – Professional) are highly desirable.
---