- Company Name
- Hyperbolic Labs, Inc.
- Job Title
- Head Of Infrastructure
- Job Description
-
**Job Title**
Head of Infrastructure
**Role Summary**
Lead the design, evolution, and operation of Hyperbolic Labs’ globally distributed GPU cloud. Build and scale the systems powering the peer‑to‑peer GPU marketplace, inference fabric, and core platform primitives. Own the end‑to‑end infrastructure roadmap—distributed systems design, resource orchestration, networking, security, and global capacity strategy—while growing a high‑performance engineering organization and partnering with product, security, platform, and GTM stakeholders.
**Expectations**
- 10+ years in infrastructure, systems engineering, or distributed systems; 5+ years managing senior ICs and engineering managers.
- Proven record of delivering multi‑year infrastructure roadmaps, translating ambiguous business needs into clear technical direction.
- Demonstrated ability to build, scale, and mentor engineering teams across infra, platform, and SRE disciplines.
- Strong judgment balancing velocity, reliability, cost, and security in high‑stakes, fast‑moving environments.
**Key Responsibilities**
- Architect and implement a globally scalable GPU cloud ecosystem (peer‑to‑peer marketplace, inference, and platform primitives).
- Define and maintain the infrastructure roadmap and strategy, including multi‑cloud, on‑prem, and edge GPU topologies.
- Lead engineering organization: hiring, coaching, establishing performance and excellence standards.
- Drive operational excellence: 99.9‑99.99% uptime targets, incident response, resilience engineering, and cost optimization.
- Build and maintain observability systems (metrics, tracing, logging, alerting) and automate via IaC, GitOps, and automation frameworks.
- Oversee capacity planning, load forecasting, and scalability testing.
- Partner cross‑functionally with Product, Security, Platform, and GTM to align infrastructure with AI workload requirements.
- Champion a security‑first mindset: workload isolation, network security, IAM, hardening, and compliance.
**Required Skills**
- Deep expertise in distributed systems, OS internals, networking, and resource orchestration.
- Hands‑on experience with container orchestration at global scale (Kubernetes, Nomad, SLURM, custom schedulers).
- Strong coding background (Go, Rust, Python, or similar).
- Architecture of multi‑cloud/on‑prem/edge GPU workloads.
- Mastery of infrastructure‑as‑code, automation, and GitOps workflows.
- Design and implementation of observability stacks (metrics, tracing, logging).
- Incident response, reliability, and cost‑optimization experience.
- Knowledge of security best practices: workload isolation, network security, IAM, mTLS, service mesh, low‑latency communication.
**Required Education & Certifications**
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related field.
- (Optional) Contributions to open‑source infra tools, kernels, schedulers, or distributed systems libraries.
San francisco, United states
On site
26-12-2025