- Company Name
- TP-Link
- Job Title
- Senior Site Reliability Engineer
- Job Description
-
**Job title:** Senior Site Reliability Engineer
**Role Summary:**
Lead the design, deployment, and operation of microservices on Kubernetes across a multi‑cloud environment (AWS, Azure, OCI, GCP). Ensure high availability, scalability, security, and compliance while driving observability, automation, and incident response. Mentor junior staff and participate in on‑call rotations.
**Expectations:**
- Act as technical SME for Kubernetes‑based microservices.
- Collaborate with Cloud, DevOps, and development teams on deployment, scaling, and disaster recovery.
- Maintain and improve observability, SLOs, and compliance (ISO27001, SOC2, GDPR).
- Lead and resolve incidents, perform post‑mortem analysis, and implement preventive measures.
- Provide mentorship, documentation, and support for evolving processes.
**Key Responsibilities:**
1. Define and implement Kubernetes deployments for microservices on multi‑cloud platforms.
2. Perform load and chaos testing to validate scalability and reliability.
3. Build observability pipelines (metrics, logs, traces) across AWS, OCI, Azure, GCP.
4. Develop and maintain disaster‑recovery plans with DevOps and development teams.
5. Identify and remediate production risks related to resources, HPA, JVM, etc.
6. Automate workflows using Python, Go, Bash, or related scripting languages.
7. Define and track KPIs for SLA/SLO/SLI with business stakeholders.
8. Produce architectural documentation, operating procedures, and design records.
9. Enforce security and compliance controls: IAM, network security, application security, data protection.
10. Lead incident response, conduct root‑cause analysis, and recommend solutions.
11. Evaluate and pilot new tools/technologies in collaboration with product teams.
12. Mentor junior SREs; participate in on‑call rotation (after hours, weekends).
**Required Skills:**
- 5+ years as a Site Reliability Engineer or equivalent.
- Strong programming/scripting proficiency in Java, Python, Bash, or PowerShell.
- Hands‑on experience with SRE/DevOps practices, cloud operations, and cloud security.
- Deep knowledge of Kubernetes orchestration and cloud services (AWS, Azure, OCI, GCP).
- Experience building observability stacks (Prometheus, Grafana, ELK, etc.).
- Familiarity with load/chaos testing tools.
- Expertise in incident management, post‑incident reviews, and root‑cause analysis.
- Strong analytical problem‑solving and independent work ethic.
- Ability to produce clear technical documentation and maintain compliance standards.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Information Technology, or a related discipline.
- 5+ years of SRE/DevOps experience in a cloud environment.
- Preferred: Professional-level cloud certifications (AWS Solutions Architect, Azure Solutions Architect Expert, GCP Professional Cloud Architect).