cover image
Alibaba Cloud

Alibaba Cloud

www.alibabacloud.com

6 Jobs

4,580 Employees

About the Company

Established in September 2009, Alibaba Cloud develops highly scalable cloud computing and data management services providing large and small businesses, financial institutions, governments and other organizations with flexible, cost-effective solutions to meet their networking and information needs. A business of Alibaba Group, one of the world’s largest e-commerce companies, Alibaba Cloud operates the network that powers Alibaba Group’s extensive online and mobile commerce ecosystem and sells a comprehensive suite of cloud computing services to support sellers and other third-party entities participating in this ecosystem.


Follow us:
Twitter: www.twitter.com/alibaba_cloud
Facebook: https://www.facebook.com/alibabacloud/

Listed Jobs

Company background Company brand
Company Name
Alibaba Cloud
Job Title
Site Reliability Engineer (Apsara Lab)
Job Description
**Job Title** Site Reliability Engineer **Role Summary** Build, operate, and continuously improve a highly available, performant model‑service platform. Ensure reliability through monitoring, incident response, automation, and customer issue resolution. **Expactations** - Maintain service SLA targets and system uptime. - Operate under on‑call rotation and manage critical incidents. - Deliver automation tools to reduce manual operations. - Communicate effectively in both Chinese and English. **Key Responsibilities** - Deploy, operate, and gradually enhance the platform and its standalone website. - Design and refine monitoring, log collection, and alerting strategies (metrics, dashboards, thresholds). - Instantly diagnose and resolve failures at network, service, and hardware levels; lead RCA and implement long‑term fixes. - Investigate and solve customer API QoS issues, coordinating with development on application clusters, edge networks, and infrastructure. - Develop scripts/tools in Python or Go for deployment, scaling, fault recovery, and other operational workflows. - Build diagnostic toolchains to accelerate issue resolution and improve customer satisfaction. **Required Skills** - 3+ years in SRE, DevOps, or backend development focused on distributed systems. - Programming in Python, Go, Java, or C++. - Experience with Linux, TCP/HTTP, and databases. - Strong incident‑management and on‑call experience. - Fluency in Chinese and English. **Preferred Skills** - Knowledge of MaaS and AI infrastructure. - Expertise with Kubernetes, Prometheus, Istio, Calico, and other cloud‑native components. - Building large‑scale monitoring systems and use of observability data for operational insights. **Required Education & Certifications** - Bachelor’s degree or higher in Computer Science, Engineering, or a related technical field. - Relevant certifications (e.g., Google SRE, Certified Kubernetes Administrator, AWS Certified DevOps Engineer) are preferred.
Seattle, United states
On site
Junior
08-12-2025
Company background Company brand
Company Name
Alibaba Cloud
Job Title
Cloud Platform Site Reliability Engineer
Job Description
**Job title:** Cloud Platform Site Reliability Engineer **Role Summary:** Design, build, and operate highly available cloud workloads to ensure 99.99 % availability for enterprise customers. Lead incident response, stability engineering, and automation initiatives across application, database, and middleware tiers. **Expectations:** - Maintain continuous service uptime for large‑scale events and peak periods. - Own the end‑to‑end incident lifecycle, including triage, response, root‑cause analysis, and post‑mortem. - Implement and enforce operational dashboards, alerting, and automation to reduce manual toil. - Collaborate with R&D to embed stability best practices into product development and release cycles. - Participate in on‑call rotations, meeting SLA requirements for issue resolution. **Key Responsibilities:** - Daily monitoring and maintenance of production applications, databases, and middleware. - Design and enforce stability metrics, service level objectives, and change management processes. - Execute full‑stack disaster recovery drills and emergency response plans (1‑minute alert, 5‑minute triage, 10‑minute recovery). - Develop and automate unattended change, risk inspection, and red/blue team testing platforms. - Provide technical support during high‑traffic events (e.g., global conferences, business peaks). - Conduct cross‑team coordination and post‑incident reviews to drive systemic improvements. - Respond to customer inquiries and proactively resolve stability risks. **Required Skills:** - Strong knowledge of cloud platform architecture (IaaS, PaaS, SaaS) and container orchestration (Kubernetes, ECS). - Proficiency in infrastructure as code (Terraform, CloudFormation) and configuration management (Ansible, Chef, Puppet). - Experience with CI/CD pipelines, release automation, and Blue‑Green/Canary deployments. - Deep understanding of monitoring, logging, and alerting (Prometheus, Grafana, ELK, cloud‑native stack). - Incident management expertise: SIEM, alert triage, root‑cause analysis, post‑mortem documentation. - Familiarity with security and compliance controls, including risk and vulnerability inspection. - Strong scripting skills (Python, Bash, PowerShell). - Excellent communication and collaboration skills for cross‑functional teamwork. **Required Education & Certifications:** - Bachelor’s degree or higher in Computer Science, Information Technology, or related field. - Professional certifications preferred: - Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS) - AWS Certified DevOps Engineer – Professional, Azure DevOps Engineer Expert, or Google Cloud Professional DevOps Engineer - ITIL Foundation or equivalent service‑management certification. ---
Sunnyvale, United states
On site
08-12-2025
Company background Company brand
Company Name
Alibaba Cloud
Job Title
Quality Assurance Engineer
Job Description
**Job title** Quality Assurance Engineer – Construction Testing & Commissioning **Role Summary** Lead on‑site supervision and technical testing of electrical and mechanical facilities for large‑scale data center projects. Plan, execute, and document all facility testing and commissioning activities to ensure compliance with design specifications, safety regulations, and performance targets while engaging stakeholders and escalating issues as needed. **Expectations** - Allocate at least 30 % of work time to on‑site activities. - Deliver comprehensive testing plans, execute tests, and close out commissioning in schedule and cost. - Communicate findings clearly to both technical and non‑technical audiences. - Maintain rigorous documentation, reports, and audit trails. - Resolve complex technical problems and coordinate inter‑disciplinary team efforts. **Key Responsibilities** 1. Supervise and coordinate on‑site construction testing of electrical, mechanical, HVAC, fire, plumbing, and monitoring systems. 2. Develop and execute detailed testing and commissioning plans, including test scripts, qualification procedures, and acceptance criteria. 3. Collect, analyze, and document test results; produce concise reports and traceability matrices. 4. Ensure adherence to safety, quality, and regulatory standards; conduct inspections and verify corrective actions. 5. Escalate technical issues to project stakeholders; facilitate cross‑functional meetings and progress updates. 6. Collaborate with design, procurement, and construction teams to identify and mitigate risks. 7. Validate system performance against design specifications and operational requirements. **Required Skills** - Minimum 5 years of experience in facility testing and commissioning for large‑scale infrastructure projects. - Strong knowledge of construction quality management and on‑site supervision. - Proficient in creating and executing test plans, interpreting test results, and reporting. - Excellent stakeholder communication and presentation skills. - Analytical problem‑solving ability and strong documentation skills. - Project management and coordination aptitude; ability to work cross‑functionally. - Competence in electrical and mechanical systems: power, cooling, ventilation, fire‑fighting, plumbing, drainage, and monitoring. **Required Education & Certifications** - Bachelor’s degree in Electrical Engineering, Mechanical Engineering, or related field. - Master’s degree or Professional Engineer (PE) license preferred. - Any additional certifications (e.g., PMP, CPK, ASHRAE) advantageous.
Sunnyvale, United states
On site
Mid level
31-12-2025
Company background Company brand
Company Name
Alibaba Cloud
Job Title
Cloud Network SRE Engineer
Job Description
**Job Title:** Cloud Network SRE Engineer **Role Summary:** Assure the reliability, scalability, and performance of a large‑scale cloud networking platform. Design and maintain automated operations, monitor and troubleshoot incidents, and collaborate on architecture enhancements to keep business services available and secure. **Expectations:** * Deliver rapid resolution of network incidents within SLA windows. * Drive automation and process standardization to improve operational efficiency. * Actively research and apply emerging networking technologies to enhance stability. * Communicate effectively through clear documentation and cross‑team collaboration. **Key Responsibilities:** 1. Maintain and improve cloud networking stability, ensuring continuous user service. 2. Design, implement, and manage automated operations systems and tooling. 3. Monitor, alert, and troubleshoot network issues; respond quickly to incidents. 4. Participate in architecture reviews and performance optimization of network services. 5. Track industry trends in cloud networking to recommend innovative improvements. 6. Manage on‑call duties and service‑level incident resolution. **Required Skills:** * 3+ years of cloud computing or network operations experience. * Proficiency in scripting/programming (Python, Golang, Java, or similar). * Strong Linux system administration and command‑line proficiency. * Experience with public cloud platforms (AliCloud, AWS, Azure) and their networking products. * Knowledge of databases (MySQL, Redis) for troubleshooting. * Solid understanding of network protocols, distributed systems, and performance tuning. * Strong troubleshooting and incident response capabilities. * Excellent written and verbal communication for documentation and collaboration. **Required Education & Certifications:** * Bachelor’s degree in Computer Science, Information Technology, or related field. * Relevant certifications such as AWS Certified Solutions Architect, Azure Network Engineer Associate, or equivalent (preferred but not mandatory).
Sunnyvale, United states
On site
Junior
31-12-2025