- Company Name
- TechnoSphere, Inc.
- Job Title
- Software Engineer
- Job Description
-
**Job Title:** SRE – Director
**Role Summary:**
Leads a global Site Reliability Engineering (SRE) organization to design, build, and maintain highly available, scalable, and resilient systems. Drives automation, reliability best practices, and observability across engineering, product, security, and operations teams to deliver exceptional customer experiences.
**Expectations:**
- Define and execute an SRE strategy aligned with business and engineering goals.
- Build and mentor a high‑performing, cross‑functional SRE team.
- Own service level objectives (SLAs/SLOs/SLIs) and ensure consistent compliance.
- Lead incident management, root‑cause analysis, and continuous improvement initiatives.
- Champion automation, cost‑optimized cloud architecture, and infrastructure‑as‑code practices.
- Communicate reliability metrics and progress to senior leadership and stakeholders.
**Key Responsibilities:**
- **Leadership & Strategy:** Recruit, develop, and manage SRE talent; foster a culture of reliability and performance.
- **Reliability Engineering:** Set and monitor SLAs/SLOs/SLIs; develop runbooks, tooling, and automation frameworks; oversee incident response and RCA processes.
- **Platform & Infrastructure:** Partner with Infrastructure, DevOps, and Cloud teams to design scalable architectures; promote IaC (Terraform, Ansible), CI/CD pipelines (Jenkins, ArgoCD, GitHub Actions), and modern observability tools (Prometheus, Grafana, Datadog, New Relic).
- **Collaboration & Communication:** Act as reliability evangelist; enable engineering teams to own service reliability; report metrics to leadership; coordinate with security, compliance, and governance for regulatory adherence.
**Required Skills:**
- Deep expertise in cloud platforms (AWS, GCP) and hybrid/multi‑cloud environments.
- Strong experience with containers and orchestration (Docker, Kubernetes).
- Proficiency in monitoring/observability tools (Prometheus, Grafana, Datadog, New Relic).
- Mastery of automation and IaC tools (Terraform, Ansible) and CI/CD systems.
- Proven track record managing large‑scale, high‑availability distributed systems.
- Excellent leadership, communication, and team development abilities.
**Required Education & Certifications:**
- Bachelor’s (or Master’s) degree in Computer Science, Engineering, or related field.
- 15+ years of software engineering or infrastructure experience, including 5+ years in SRE or DevOps leadership roles.
- Preferred: Cloud certifications (e.g., AWS Certified DevOps Engineer, Google Cloud SRE), experience in regulated industries (telecom/communications), and senior experience at top consultancy firms.