- Company Name
- CATHEXIS
- Job Title
- Site Reliability Engineer (req-174)
- Job Description
-
**Job Title:** Site Reliability Engineer
**Role Summary**
Design, deploy, and maintain the reliability, scalability, and security of Kubernetes clusters and cloud infrastructure supporting AI‑driven solutions. Manage CI/CD pipelines, automation, monitoring, and incident response across AWS, Azure, and GCP environments.
**Expectations**
- Deliver 99.95% uptime and rapid incident resolution.
- Continuously improve infrastructure performance and cost efficiency.
- Automate provisioning, scaling, and monitoring using IaC tools.
- Ensure compliance with security standards and regulatory requirements.
- Collaborate closely with development, service, and operations teams.
**Key Responsibilities**
- Deploy, monitor, and scale applications on Kubernetes clusters; maintain Helm charts and cluster resources.
- Provision and configure cloud infrastructure (AWS, Azure, GCP) with Terraform, CloudFormation or equivalent IaC.
- Implement and tune monitoring, alerting, and logging for Kubernetes, CI/CD, and infrastructure components.
- Lead incident response: diagnose root causes, implement fixes, and improve post‑mortem processes.
- Automate infrastructure workflows with Terraform, Ansible, or similar tools.
- Enforce security best practices—RBAC, encryption, vulnerability scanning, and compliance checks.
- Collaborate with cross‑functional teams to integrate application development and infrastructure delivery.
- Identify and remediate performance bottlenecks and reliability gaps proactively.
**Required Skills**
- Kubernetes cluster administration, Helm, and resource management.
- Cloud platform expertise: AWS, Azure, and/or GCP.
- Infrastructure as Code: Terraform, CloudFormation, or similar.
- Monitoring tools: Prometheus, Grafana, ELK, or equivalent.
- Programming: Python, Java, C/C++, Ruby, or JavaScript (structured/OOP).
- Distributed storage knowledge: NFS, HDFS, Ceph, Amazon S3.
- Experience with CI/CD systems (e.g., Jenkins, GitLab CI).
- Agile/Scrum workflow experience.
- Strong problem‑identification, troubleshooting, and performance tuning skills.
- Security & compliance fundamentals (RBAC, encryption, vulnerability scanning).
**Required Education & Certifications**
- Bachelor’s degree or equivalent in Computer Science, Engineering, or related field.
- Active Secret Clearance (required).
- 2+ years of experience managing on‑premise and cloud environments.
- Certifications in Kubernetes (CKA/CKAD) and/or cloud platforms (AWS Certified Solutions Architect, Azure Administrator, GCP Professional Cloud Architect) are a plus.