- Company Name
- Tempest Vane Partners
- Job Title
- Senior Site Reliability Engineer
- Job Description
-
**Job Title**
Senior Site Reliability Engineer
**Role Summary**
Lead the design, implementation, and operation of high‑availability, scalable infrastructure for a complex trading platform. Own and evolve SRE principles, monitoring, and automation to ensure reliable service delivery across on‑prem and cloud environments.
**Expectations**
- Own end‑to‑end reliability of distributed systems; set SLIs/SLOs and conduct blameless post‑mortems.
- Own a month‑per‑year on‑call rotation and respond to incidents with minimal downtime.
- Drive continuous improvement in tooling, pipelines, and observability to support rapid deployment.
**Key Responsibilities**
- Define and embed SRE best practices, processes, and standards across engineering teams.
- Design, deploy, and maintain observability stack (Prometheus, Grafana, Loki, Tempo/OTEL).
- Configure and enforce reliability requirements for Kubernetes‑hosted applications, balancing cost, performance, and resilience.
- Develop automation and tooling (Python, Bash, Go) for deployment pipelines, health checks, and recovery workflows.
- Collaborate with development teams to improve service stability, scalability, and fault tolerance using SLOs and post‑mortem analysis.
- Participate in on‑call rotation (≈1 week/month) covering incident response and post‑incident reviews.
- Mentor junior SRE staff and foster a culture of operational excellence.
**Required Skills**
- 5+ years SRE or equivalent role managing complex, distributed systems.
- Expertise in CloudWatch, Prometheus, Grafana, Loki, Tempo, and OTEL.
- Deep knowledge of Kubernetes, Docker, and container orchestration.
- Hands‑on experience with AWS (preferred) and on‑prem infrastructure.
- Proficient in scripting/programming (Python, Bash, Go) for automation and CI/CD pipelines.
- Strong understanding of DevOps principles, CI/CD, and Agile workflows.
- Excellent communication skills for cross‑team collaboration.
- Bonus: Experience with PostgreSQL, Redis, Snowflake, Kafka/ Solace, Airflow or similar tools.
**Required Education & Certifications**
- Bachelor’s degree in Engineering, Computer Science, or equivalent practical experience.
- AWS Certified Solutions Architect or similar cloud certification is a plus.