- Company Name
- TechDoQuest
- Job Title
- Site Reliability Engineer (SRE)
- Job Description
-
**Job Title:** Site Reliability Engineer (SRE)
**Role Summary:**
Responsible for ensuring the reliability, scalability, and performance of production systems. Encapsulates end‑to‑end automation, monitoring, incident response, and continuous improvement across cloud and container platforms.
**Expectations:**
- 3+ years in SRE, DevOps, or a related reliability role.
- Proven experience with multi‑cloud (AWS, Azure, GCP) and Kubernetes‑based environments.
- Demonstrated ownership of incident management, root‑cause analysis, and post‑mortem processes.
**Key Responsibilities:**
- Design, implement, and maintain infrastructure automation using Ansible, Terraform, or equivalent.
- Build and operate end‑to‑end monitoring, observability, and alerting with Dynatrace, Moogsoft, Elastic Stack.
- Enhance CI/CD pipelines (Jenkins, GitHub Actions, UrbanCode Deploy) and deployment automation via Helm, Docker, or Kubernetes manifests.
- Lead incident response, conduct root‑cause analysis, and drive corrective action cycles.
- Manage cloud resources, container orchestration, and distributed systems at scale; perform capacity planning and load testing.
- Enforce security, compliance, and governance (IAM, encryption, Vault, SOC 2, etc.).
- Document runbooks, SLOs/SLAs, and operational playbooks.
**Required Skills:**
- **Automation & Scripting:** Ansible, Python, PowerShell.
- **Monitoring & Observability:** Dynatrace, Moogsoft, Elastic Stack (Elasticsearch, Logstash, Kibana).
- **Incident Management:** ServiceNow ticketing, CMDB.
- **Cloud Platforms:** AWS, Azure, or GCP (compute, storage, serverless, networking).
- **Container & Orchestration:** Kubernetes/OpenShift, Docker, Helm.
- **Databases & Storage:** SQL Server, NoSQL (Cassandra, Redis) with replication/HA.
- **Security & Compliance:** IAM, encryption, Vault, vulnerability scanning, SOC 2.
- **CI/CD & DevOps:** Jenkins, GitHub Actions, UrbanCode Deploy, Artifactory/Nexus, Git branching strategies.
- **Performance Engineering:** JMeter load testing, capacity planning, SLI/SLO/SLI definition.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Engineering, or equivalent.
- Optional certifications: AWS Certified Solutions Architect, Azure Solutions Architect, GCP Professional Cloud Architect, Certified Kubernetes Administrator (CKA), Ansible Tower Certified, HashiCorp Certified: Terraform Associate, ITIL Foundation.