- Company Name
- Alibaba Cloud
- Job Title
- Cloud Platform SRE
- Job Description
-
Job Title: Cloud Platform SRE
Role Summary:
Ensure uninterrupted, highly available production environments for enterprise‑grade cloud services. Develop and enforce stability standards, manage incidents, automate reliability tooling, and support large‑scale customer events to maintain >99.99% uptime.
Expectations:
* 24/7 on‑call rotations with SLA‑compliant response.
* Rapid incident triage, root‑cause analysis, and post‑mortem reviews.
* Continuous improvement of stability metrics and automation pipelines.
* Collaboration with R&D for production readiness and critical peak‑period support.
Key Responsibilities:
* Daily operations, monitoring, and maintenance of applications, databases, and middleware.
* Incident response, cross‑team coordination, and root‑cause analysis.
* Design and enforce stability standards, metrics, and governance campaigns.
* Lead full‑stack disaster recovery, phased change rollouts, and emergency response drills (1‑5‑10 model).
* Build and maintain automated change‑management, monitoring, and alerting platforms.
* Support large‑scale events (e.g., Olympics, peak business periods) with technical and operational planning.
* Perform risk and vulnerability inspections, and conduct red/blue team exercises.
* Provide expertise in capacity planning, performance diagnostics, and system hardening.
Required Skills:
* 3+ years of SRE/DevOps experience in cloud environments.
* Strong knowledge of cloud infrastructure (AWS, Azure, GCP) and automation (Python, Bash).
* Proficiency with monitoring/alerting tools (Prometheus, Grafana, CloudWatch, etc.).
* Incident management, root‑cause analysis, and post‑mortem documentation.
* Expertise in change management, disaster recovery, and high‑availability design.
* Excellent communication, problem‑solving, and teamwork across distributed teams.
* Familiarity with container orchestration (Kubernetes) and AIOps practices.
Required Education & Certifications:
* Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
* Cloud & DevOps certifications such as AWS Certified Solutions Architect, Azure Solutions Architect, GCP Professional Cloud Architect, or CNCF Certified Kubernetes Administrator (CKA).
* Incident response or related professional certifications preferred.