- Company Name
- Workonomics
- Job Title
- Senior Site Reliability Engineer
- Job Description
-
**Job title:** Senior Site Reliability Engineer
**Role Summary:**
Senior SRE leading reliability and observability initiatives across a high‑throughput real‑time decision platform. Drives infrastructure modernization (Terraform + Kubernetes), rebuilds observability stack, and explores AI‑assisted operations to meet 5‑nines reliability targets for a multi‑team SaaS product.
**Expectations:**
- Deliver scalable, secure, highly available infrastructure and observability.
- Own end‑to‑end incident response, post‑mortem analysis, and continuous improvement cycles.
- Collaborate cross‑functionally with product, security, and dev‑op teams.
- Publish best‑practice guidelines and tooling for use by 150+ engineers.
- Demonstrate measurable improvements in reliability and observability coverage.
**Key Responsibilities:**
- Lead migration from legacy CloudFormation/EC2 stacks to Terraform‑based, Kubernetes‑oriented infrastructure.
- Design and implement modular, reusable Terraform modules and Kubernetes operators.
- Build and maintain observability architecture: logging, metrics, tracing, and alerting—moving from ELK to modern stack (e.g., Loki, Prometheus/Thanos, Tempo or equivalent).
- Set up comprehensive instrumentation for microservices in Python, Go, or JavaScript.
- Develop and enforce operational guardrails, SLO/SLA definitions, and error budgets.
- Pilot AI/ML tools to reduce toil (incident classification, root‑cause discovery, automated remediation).
- Mentor junior SREs and engineer teams on best practices.
- Participate in on‑call rotation and lead post‑mortem documentation.
**Required Skills:**
- Deep expertise with AWS services (EKS, EC2, S3, RDS, etc.).
- Advanced Terraform skills – module design, state management, CI/CD integration.
- Kubernetes fundamentals (cluster ops, helm, CRDs, RBAC, networking).
- Modern programming in Python, Go, or JavaScript (API clients, automation scripts).
- Proven experience building observability solutions (logs, metrics, traces).
- Incident response, root‑cause analysis, SLO/SLA/BLP implementation.
- Familiarity with CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins).
- Strong command‑line, scripting, and debugging abilities.
**Bonus Skills:**
- Experience optimizing observability cost/performance (data retention strategies, sampling).
- Contributions to open‑source monitoring or reliability tools.
- AI/ML experimentation in SRE context (e.g., incident chatbot, anomaly detection).
- Knowledge of chaos engineering and reliability throughput testing.
**Required Education & Certifications:**
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field (or equivalent professional experience).
- Professional certifications such as AWS Certified Solutions Architect or DevOps Engineer – Professional, or Kubernetes Certified Administrator (CKA) preferred.
---