- Company Name
- Zipliens
- Job Title
- Senior Site Reliability Engineer
- Job Description
-
Job Title: Senior Site Reliability Engineer
Role Summary: Lead the reliability, scalability, and security of a legal technology platform. Own production and non‑production environments, design and maintain CI/CD pipelines, build observability tooling, and drive incident response and root‑cause analysis. Collaborate closely with software engineers and leadership to ensure high‑quality, secure, and resilient deployments.
Expactations:
- 7+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
- Proven experience scaling cloud‑based production systems (AWS, GCP, or Azure).
- Expertise in incident response, root‑cause analysis, and systemic improvement.
- Strong automation background: CI/CD, IaC, containerization.
- Advanced monitoring, logging, and alerting knowledge.
- Solid understanding of cloud security, secrets management, and backup strategies.
- Proficiency in scripting (Python, Go, Bash) and IaC tools (Terraform, CloudFormation).
- Excellent written and verbal communication; cross‑functional collaboration.
- Able to work onsite at least 3 days a week.
Key Responsibilities:
1. Maintain and improve availability, performance, and reliability of production/non‑production environments.
2. Identify scalability and capacity risks; recommend mitigation strategies.
3. Enhance system observability through monitoring, logging, alerting; define reliability metrics.
4. Lead incident investigations; develop preventive measures.
5. Shape and evolve reliability standards and practices.
6. Build and continually improve CI/CD pipelines for reliable, repeatable deployments.
7. Automate infrastructure provisioning, configuration, and operational workflows.
8. Develop tooling to improve performance, observability, deployment confidence.
9. Standardize deployment practices and operational readiness across services.
10. Establish and enforce best practices for access controls, secrets management, and system hardening.
11. Ensure backup, recovery, and disaster‑readiness strategies are tested and reliable.
12. Partner with leadership on security reviews, compliance initiatives, and risk mitigation.
Required Skills:
- Expertise with AWS, GCP, or Azure production environments.
- Hands‑on experience with Terraform, CloudFormation, Docker, Kubernetes, and CI/CD tooling.
- Strong troubleshooting and incident response skills.
- Monitoring/logging/alerting platforms (Prometheus, Grafana, ELK, CloudWatch, etc.).
- Cloud security fundamentals: access control, secrets management, backup strategies.
- Scripting proficiency (Python, Go, Bash).
- Written and verbal communication, stakeholder collaboration.
- Ability to work onsite regularly.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent professional experience).
- Relevant certifications (AWS Certified SysOps Administrator, GCP Professional Cloud DevOps Engineer, Kubernetes Administrator, etc.) preferred but not mandatory.