cover image
Alibaba Cloud

Cloud Infrastructure – Site Reliability Engineer (SRE)

On site

Sunnyvale, United states

$ 171,000 /year

Junior

Full Time

19-01-2026

Share this job:

Skills

Python Java Go Incident Response Kubernetes Monitoring Architecture Risk Control Shell Chaos Engineering Kafka Terraform Infrastructure as Code

Job Specifications

Alibaba Cloud Native Message Middleware Team is responsible for message products, including RocketMQ and other messaging products. We are committed to creating a more stable, user-friendly, streaming, and large-scale messaging platform for the future.

Cloud Product Operations & Reliability

Oversee stability maintenance, performance tuning, and high-availability architecture design for cloud middleware, including messaging middleware (Kafka/RocketMQ).

Manage the containerized middleware lifecycle on Kubernetes clusters: implement deployments, auto-scaling, version upgrades, and resource optimization in K8s environments.

Incident Response & Root Cause Analysis

Lead the troubleshooting of middleware-related incidents (e.g., message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems.

Develop diagnostic tools using Java/Go to resolve production issues, performance bottlenecks, and compatibility challenges.

Automation & Operational Excellence

Build Python/Go/Shell automation tools to standardize middleware deployment, monitoring, and disaster recovery workflows.

Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience.

Strong scripting skills in Shell/Python and experience with Infrastructure as Code (IaC) tools (Terraform preferred).

Minimum qualification:

Experience: Over 2 years of experience in distributed systems reliability engineering, familiar with high-availability architecture design, and proficient in at least one of Python, Go, or Java.

Messaging: Cluster management, message reliability assurance, and performance optimization for Kafka/RocketMQ.

Hands-on experience deploying middleware on Kubernetes (Helm/Operator preferred).

Automation: Ability to convert operations experience into automated solutions and familiarity with various message middleware, e.g., Kafka and RocketMQ.

Preferred Qualification:

SRE Practices: Familiar with core SRE practices (incident review, error budgeting, chaos engineering) and experienced in building automated risk control systems.

The pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.

If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

About the Company

Established in September 2009, Alibaba Cloud develops highly scalable cloud computing and data management services providing large and small businesses, financial institutions, governments and other organizations with flexible, cost-effective solutions to meet their networking and information needs. A business of Alibaba Group, one of the world’s largest e-commerce companies, Alibaba Cloud operates the network that powers Alibaba Group’s extensive online and mobile commerce ecosystem and sells a comprehensive suite of cloud ... Know more