- Company Name
- OVHcloud
- Job Title
- Senior Site Reliability Engineer (Overnight)
- Job Description
-
Job title: Senior Site Reliability Engineer (Overnight)
Role Summary:
Provides 24/7 operational support and incident response for OVHcloud services during overnight shifts (Sunday‑Thursday 10:00 PM‑6:00 AM CST). Maintains high availability, performance, and reliability of distributed Linux/Unix and Windows infrastructure, automates operational tasks, and contributes to new product deployments.
Expactations:
• Work standard overnight shift with weekend on‑call rotation.
• Manage comprehensive SRE responsibilities, including monitoring, alerting, root‑cause analysis, and documentation.
• Deliver proactive automation, scripting, and tooling to increase reliability and efficiency.
• Collaborate with development teams on microservices, APIs, and UAT for product launches.
• Communicate findings and recommendations clearly to technical and business stakeholders.
Key Responsibilities:
- Monitor alerting systems (Grafana, Nagios) and adjust configurations to ensure 99.99% availability.
- Diagnose incidents using data‑driven analysis, perform root‑cause analysis, and document solutions.
- Develop and maintain automation scripts in Bash, Python, Go, or Perl.
- Contribute to microservice build, deployment, and troubleshooting.
- Configure and deploy infrastructure via Terraform, Ansible, or Puppet.
- Maintain monitoring, metrics, and logging stacks (OpenSearch, Grafana).
- Produce automated reports and dashboards for teams and leadership.
- Write and update knowledge‑base articles, SOPs, and incident post‑mortems.
- Conduct User Acceptance Testing for new services and features.
Required Skills:
- 5+ years in SRE, DevOps, or related role.
- 3+ years administering Linux/Unix and Windows systems.
- Experience with microservices, APIs, and cloud‑native architectures.
- Proficiency in scripting (Perl, Python, Bash, Go).
- Hands‑on with monitoring/alerting (Nagios, Grafana, OpenSearch).
- Familiarity with configuration management (Puppet, Ansible) and IaC (Terraform).
- Knowledge of virtualization, containers (Docker, Kubernetes).
- Strong analytical, problem‑solving, and documentation skills.
- Effective prioritization and ability to handle competing tasks.
Required Education & Certifications:
- Bachelor’s degree in Computer Science or related field preferred; equivalent experience acceptable.
- No mandatory certifications required, but familiarity with relevant cloud or DevOps certifications is a plus.