- Company Name
- Reducto
- Job Title
- Founding Infrastructure Engineer
- Job Description
-
**Job title**: Founding Infrastructure Engineer
**Role Summary**: Lead the design, build, and operation of a scalable, highly available infrastructure stack to support AI/ML workloads and real‑time model deployments. Automate cloud provisioning, monitoring, and incident response, establishing reliability best practices across cloud and on‑prem environments.
**Expectations**:
- Deliver robust systems with a strong quality focus.
- 5+ years in production‑grade infrastructure and reliability for high‑throughput systems.
- Proven experience with cloud platforms, Kubernetes, container orchestration, networking, and storage.
- Hands‑on Python (or equivalent) and ability to create custom tooling for diagnostics and automation.
- Early‑stage/scale‑up background; comfortable driving the tech direction in a fast‑moving environment.
- Passion for AI/ML, observability, incident management, and open‑source contributions.
**Key Responsibilities**:
1. Architect & deploy modular, horizontally scalable infrastructure for AI/ML pipelines and live inference.
2. Implement end‑to‑end observability: metrics, logs, traces, alerts, dashboards.
3. Build and maintain CI/CD pipelines, infrastructure as code, and automated deployment workflows.
4. Diagnose, troubleshoot, and optimize performance bottlenecks and reliability incidents.
5. Create tooling (dashboards, remediation bots, scripts) to accelerate debugging and automation.
6. Collaborate with ML, product, and security teams to shape architecture and security posture.
7. Keep abreast of cloud, SRE, and observability trends; drive adoption of new tools/techniques.
8. Lead incident response, post‑mortem analysis, and continuous improvement cycles.
**Required Skills**:
- Cloud platforms (AWS, GCP, Azure) – infrastructure design, cost optimization, security.
- Container orchestration (Kubernetes) – deployment, scaling, networking.
- Networking & storage fundamentals (VPC, subnets, IAM, S3/Blob/GCS, block/managed disks).
- Python (or similar) for automation, SDK usage, and tooling.
- Observability stack (Prometheus, Grafana, Elastic Stack, Datadog, OpenTelemetry).
- CI/CD, GitOps, Terraform/Helm/Ansible.
- Incident management frameworks (SRE, PagerDuty, Opsgenie).
- Familiarity with AI/ML model serving and GPU/accelerator scheduling.
- Open‑source mindset; contributions to reliability or infrastructure projects preferred.
**Required Education & Certifications**:
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent).
- Relevant certifications (AWS Certified Solutions Architect, GCP Professional Cloud Architect, CKA/CKAD, or equivalent) are a plus.
San francisco, United states
On site
Mid level
24-12-2025