- Company Name
- Archer
- Job Title
- Sr. DevOps
- Job Description
-
**Job Title:** Sr. DevOps Engineer
**Role Summary:**
Lead the design, deployment, and maintenance of highly available, scalable infrastructure for cloud and on‑premise environments. Drive best practices in CI/CD, configuration management, and observability to support large‑language‑model (LLM) services and general application delivery.
**Expectations:**
- Deliver secure, reliable, and performant infrastructure that meets operational, compliance, and performance goals.
- Collaborate with development, MLOps, and security teams to align infrastructure with product requirements.
- Participate in on‑call rotation to resolve incidents and maintain service availability.
**Key Responsibilities:**
- Design, deploy, and manage Kubernetes & Docker workloads on AWS, GCP, Azure, and on‑premise data centers.
- Build and maintain IaC solutions using Terraform, Ansible for consistent provisioning.
- Implement and optimize CI/CD pipelines for rapid, automated releases.
- Administer Linux and Windows servers, including troubleshooting and patch management.
- Configure observability with Datadog (or Prometheus/Grafana, New Relic) for logging, tracing, and alerting.
- Operate, scale, and monitor LLM infrastructure utilizing LiteLLM, OpenRouter, or similar frameworks.
- Automate repetitive tasks with Bash and Python scripting.
- Ensure compliance with security and network best practices in hybrid environments.
- Support MLOps workflows (e.g., MLflow, Kubeflow) as needed.
**Required Skills:**
- 5+ years in DevOps, SRE, or infrastructure engineering.
- Deep expertise in Kubernetes (design, deployment, troubleshooting) & Docker.
- Proven experience managing multi‑cloud (AWS, GCP, Azure) and on‑premise environments.
- Advanced Linux administration; strong knowledge of Windows Server.
- Mastery of IaC tools (Terraform, Ansible).
- Proficiency in Bash & Python automation.
- Extensive experience with monitoring/observability platforms (Datadog, Prometheus, Grafana, New Relic).
- Hands‑on deployment and management of LLM services (LiteLLM, OpenRouter).
**Nice‑to‑Have Skills**
- Familiarity with Kubernetes distributions (K3s, Rancher, OpenShift).
- Network configuration, firewall, and security best practices.
- MLOps tools such as Kubeflow.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Engineering, or related field preferred.
- Certifications: CKA, CKAD, AWS/Azure/GCP cloud certifications or equivalent.