- Company Name
- Humankind Global Recruitment
- Job Title
- Site Reliability Engineer
- Job Description
-
**Job Title:**
Site Reliability Engineer
**Role Summary:**
Ensure the reliability, availability, and performance of mission‑critical applications and infrastructure. Bridge development and operations by designing automation, monitoring, and incident response practices. Mentor junior SREs and collaborate with cross‑functional teams to embed reliability into system design.
**Expectations:**
- 7+ years of SRE or equivalent infrastructure/operations experience.
- Proven leadership in incident management and post‑mortem culture.
- Ability to mentor and influence engineering and operations teams.
- Drive continuous improvement of tooling, processes, and performance.
**Key Responsibilities:**
- Maintain and improve system reliability, uptime, and performance across production.
- Define, track, and enforce SLOs, SLIs, and SLAs.
- Design, implement, and maintain IaC and deployment automation for scalable infrastructure.
- Integrate reliability practices into CI/CD pipelines.
- Deploy and manage monitoring, logging, and alerting solutions; define key metrics.
- Lead incident response, root‑cause analysis, and blameless post‑mortems.
- Mentor junior SREs and collaborate with engineering to enforce reliability in architecture.
- Identify and implement improvements to reduce toil, optimize resources, and enhance service quality.
**Required Skills:**
- Infrastructure‑as‑Code: Terraform, Ansible, AWX (or equivalent).
- Containerization & Orchestration: Docker, Kubernetes.
- Cloud Platforms: AWS (S3, EC2, etc.), Google Cloud, or Azure (knowledge preferable).
- Monitoring / Observability: Zabbix, Prometheus, Grafana; ELK stack, Fluentd, OpenTelemetry.
- Messaging & Caching: Redis, RabbitMQ.
- Distributed Systems: CAP theorem, eventual consistency concepts.
- Networking & Security: TCP/IP, HTTP/HTTPS, DNS, SMTP, load balancing, WAF, SIEM, service hardening, IPv4/IPv6 fundamentals, routing protocols (BGP, OSPF).
- Scripting/Programming: Python, Go, Bash.
- CI/CD & DevOps: Jenkins, GitLab, GitHub, basic DevSecOps practices.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent practical experience).
- Relevant certifications preferred: AWS Certified Solutions Architect, GCP Certified Professional Cloud Architect, Certified Kubernetes Administrator, Red Hat Certified System Administrator, or equivalent.