- Company Name
- Black Rock Solutions INC
- Job Title
- SRE Manager
- Job Description
-
Job Title: SRE Manager
Role Summary: Lead a high‑performing Site Reliability Engineering team to deliver mission‑critical, highly available distributed systems. Own reliability strategy, incident response, observability, automation, and capacity planning to meet aggressive SLAs.
Expectations:
- Build and mentor a scalable SRE organization.
- Own end‑to‑end reliability metrics and continuous improvement.
- Drive cross‑functional incident management and post‑mortem culture.
Key Responsibilities:
- Define, own, and evolve SLIs, SLOs, and reliability roadmaps.
- Lead major incidents, coordinate cross‑functional response, and implement corrective actions to eliminate repeat failures.
- Architect and deploy observability, monitoring, and alerting solutions (metrics, logs, tracing) to reduce MTTD/MTTR.
- Improve platform scalability and resilience through automation, CI/CD pipelines, IaC, capacity planning, and performance testing.
- Partner with Engineering, Security, and Product to influence architecture, deploy robust runbooks, and embed reliability in the development lifecycle.
Required Skills:
- Kubernetes, Docker, Prometheus, Grafana, Terraform, AWS.
- Incident management, post‑mortem discipline, on‑call rotation management.
- Cloud‑native architecture, IaC, CI/CD practices; ability to both lead and contribute technically.
- Strong communication and people‑management capabilities.
- Preferred: Go, Python, Jenkins.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Engineering, or related field.
- Relevant certifications such as AWS Certified Solutions Architect, Certified Kubernetes Administrator, or equivalent preferred.