- Company Name
- Symphony
- Job Title
- Site Reliability Engineer
- Job Description
-
Job Title: Site Reliability Engineer
Role Summary:
Deliver and maintain scalable, highly available cloud‑native services across global, distributed environments. Drive engineering teams toward “you build it, you run it” culture by embedding resiliency, scalability, and operational ownership into development and deployment pipelines. Own platform automation, GitOps practices, and observability to ensure reliable, customer‑centric services.
Expectations:
* Champion production‑centric mindset, guiding peers through design decisions that affect uptime and resilience.
* Own long‑term platform health, driving continuous improvement of infrastructure, automation, and monitoring.
* Engage in incident response, on‑call duties, and post‑mortem analysis to reduce recurrence.
Key Responsibilities:
1. Design, build, and maintain cloud infrastructure using IaC (Terraform) and Kubernetes cluster management (Helm).
2. Administer and troubleshoot Linux systems, ensuring performance, security, and availability.
3. Architect and operate cloud‑native solutions on GCP and AWS, leveraging respective services.
4. Implement observability stack: metrics, logs, alerts to detect and resolve operational issues proactively.
5. Develop automation tools in Python or Go to streamline deployments, scaling, and incident response.
6. Participate in 24/7 on‑call rotation; respond to incidents, outages, and network performance issues.
7. Collaborate across product, engineering, and operations teams to align technical, operational, and business goals.
Required Skills:
* Infrastructure‑as‑Code with Terraform (or equivalent) and configuration management.
* Deep knowledge of Kubernetes, Helm, and cluster lifecycle.
* Strong Linux administration and troubleshooting expertise.
* Hands‑on experience with GCP and AWS cloud services and architectures.
* Proven implementation of observability solutions (monitoring, logging, alerting).
* Networking fundamentals (TCP/IP, DNS, load balancing, VPC, firewall, routing).
* Programming/automation skills in Python, Go, or similar.
* Excellent communication, problem‑solving, and cross‑functional collaboration.
* Ability to work independently and lead projects in a fast‑paced environment.
* Willingness to engage in 24/7 on‑call rotation.
Required Education & Certifications:
* Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
* Preferred certifications: AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, Certified Kubernetes Administrator (CKA), Terraform Associate.
---