- Company Name
- Alibaba Cloud
- Job Title
- Cloud Platform Site Reliability Engineer
- Job Description
-
**Job title:** Cloud Platform Site Reliability Engineer
**Role Summary:**
Design, build, and operate highly available cloud workloads to ensure 99.99 % availability for enterprise customers. Lead incident response, stability engineering, and automation initiatives across application, database, and middleware tiers.
**Expectations:**
- Maintain continuous service uptime for large‑scale events and peak periods.
- Own the end‑to‑end incident lifecycle, including triage, response, root‑cause analysis, and post‑mortem.
- Implement and enforce operational dashboards, alerting, and automation to reduce manual toil.
- Collaborate with R&D to embed stability best practices into product development and release cycles.
- Participate in on‑call rotations, meeting SLA requirements for issue resolution.
**Key Responsibilities:**
- Daily monitoring and maintenance of production applications, databases, and middleware.
- Design and enforce stability metrics, service level objectives, and change management processes.
- Execute full‑stack disaster recovery drills and emergency response plans (1‑minute alert, 5‑minute triage, 10‑minute recovery).
- Develop and automate unattended change, risk inspection, and red/blue team testing platforms.
- Provide technical support during high‑traffic events (e.g., global conferences, business peaks).
- Conduct cross‑team coordination and post‑incident reviews to drive systemic improvements.
- Respond to customer inquiries and proactively resolve stability risks.
**Required Skills:**
- Strong knowledge of cloud platform architecture (IaaS, PaaS, SaaS) and container orchestration (Kubernetes, ECS).
- Proficiency in infrastructure as code (Terraform, CloudFormation) and configuration management (Ansible, Chef, Puppet).
- Experience with CI/CD pipelines, release automation, and Blue‑Green/Canary deployments.
- Deep understanding of monitoring, logging, and alerting (Prometheus, Grafana, ELK, cloud‑native stack).
- Incident management expertise: SIEM, alert triage, root‑cause analysis, post‑mortem documentation.
- Familiarity with security and compliance controls, including risk and vulnerability inspection.
- Strong scripting skills (Python, Bash, PowerShell).
- Excellent communication and collaboration skills for cross‑functional teamwork.
**Required Education & Certifications:**
- Bachelor’s degree or higher in Computer Science, Information Technology, or related field.
- Professional certifications preferred:
- Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)
- AWS Certified DevOps Engineer – Professional, Azure DevOps Engineer Expert, or Google Cloud Professional DevOps Engineer
- ITIL Foundation or equivalent service‑management certification.
---