Bagel Labs

www.bagel.com

1 Job

19 Employees

About the Company

Machine learning and cryptography research lab. Building a permissionless, privacy-preserving machine learning ecosystem.

Listed Jobs

Company Name: Bagel Labs
Job Title: Member of Technical Staff (Infra)
Job Description: **Job title** Member of Technical Staff (Infra) **Role Summary** Design, build, and continuously optimize distributed GPU infrastructure to train and serve large diffusion models. Work across multi‑node clusters, blending systems, performance engineering, and research enablement to achieve high throughput, low latency, and cost‑effective model serving. **Expectations** - Demonstrated ability to take full ownership of complex, ambiguous problems. - Strong agency: self‑directed decision‑making, rapid experimentation, and proactive issue resolution. - Proficiency in Linux system administration, networking, and debugging of production GPU workloads. - Baseline familiarity with modern deep‑learning frameworks (PyTorch) and distributed systems (NCCL, torch.distributed, Kubernetes). - Willingness to modify model code for system performance when necessary. - Commitment to observability, reproducibility, and automation of deployment pipelines. **Key Responsibilities** 1. Build and operate end‑to‑end distributed training stacks for diffusion models (U‑Net, DiT, video diffusion, world‑model variants). 2. Implement, tune, and integrate parallelism/sharding (data, tensor, pipeline, ZeRO/FSDP, expert, diffusion‑specific tricks). 3. Profile and eliminate GPU bottlenecks across kernels, memory, communications, and I/O (CUDA graphs, kernel fusion, attention kernels, NCCL tuning). 4. Own inference serving for diffusion workloads, ensuring high throughput, predictable latency, dynamic batching, multi‑GPU execution, and variable resolution support. 5. Design robust orchestration for heterogeneous and pre‑emptible environments (on‑prem, bare metal, cloud, spot) with checkpointing, resumability, and fault tolerance. 6. Develop actionable observability (step‑time breakdowns, VRAM headroom, NCCL health, queueing, tail latency, error budgets, cost per sample). 7. Apply pragmatic quantization and precision strategies (BF16/FP16/TF32/FP8, INT8/INT4) balancing quality and performance. 8. Improve developer velocity through reproducible environments, CI for performance regressions, and automation of cluster provisioning and rollouts. 9. Author clear internal documentation and occasional public technical deep‑dives. **Required Skills** - Linux fundamentals, networking basics, and production incident debugging. - Deep GPU performance instincts, profiling, memory behavior, kernel‑level thinking; familiarity with CUDA tooling. - Hands‑on experience scaling training/inference across multiple GPUs and nodes. - Comfortable implementing parallelism and sharding in PyTorch, NCCL, torch.distributed, FSDP/ZeRO, or equivalent. - Experience building reliable deployment pipelines (containers, rollouts, versioning, rollback, secrets, config management). - Ability to read and modify model code for infrastructure performance needs. **Bonus Skills** - Open‑source contributions to performance or distributed systems libraries (PyTorch internals, Triton, xFormers/FlashAttention, NCCL tooling, Ray, Kubernetes operators). - Diffusion‑specific serving/optimization experience (Diffusers, ComfyUI, custom schedulers, distillation, few‑step generation). - Compiler experience (TensorRT, torch.compile/Inductor, XLA, CUDA graphs). - Multi‑tenant GPU platform design (isolation, fair scheduling, QoS). - Cost engineering understanding of GPU cluster expenditures. **Required Education & Certifications** Not specified. (Candidates typically possess relevant technical training, such as a bachelor’s degree in Computer Science, Electrical Engineering, or related fields.)

Toronto, Canada

On site

11-01-2026