- Company Name
- Together AI
- Job Title
- LLM Inference Frameworks and Optimization Engineer
- Job Description
-
**Job Title**
LLM Inference Frameworks and Optimization Engineer
**Role Summary**
Design, develop, and optimize large‑scale, low‑latency inference engines for text, image, and multimodal models. Focus on distributed parallelism, GPU/accelerator efficiency, and software‑hardware co‑design to deliver high‑throughput, fault‑tolerant AI deployment.
**Expectations**
- Lead end‑to‑end development of inference pipelines for LLMs and vision models at scale.
- Demonstrate measurable improvements in latency, throughput, or cost per inference.
- Collaborate cross‑functionally with hardware, research, and infrastructure teams.
- Deliver production‑ready, maintainable code in Python/C++ with CUDA.
- Communicate technical trade‑offs to stakeholders.
**Key Responsibilities**
- Build fault‑tolerant, high‑concurrency distributed inference engines for multimodal generation.
- Engineer parallelism strategies (Mixture of Experts, tensor, pipeline parallelism).
- Apply CUDA graph, TensorRT/TRT‑LLM, and PyTorch compilation (torch.compile) optimizations.
- Perform cache system tuning (e.g., Mooncake, PagedAttention).
- Conduct performance bottleneck analysis and co‑optimize GPU/TPU/custom accelerator workloads.
- Integrate model execution plans into end‑to‑end serving pipelines.
- Maintain code quality, documentation, and automated testing.
**Required Skills**
- 3+ years deep‑learning inference, distributed systems, or HPC experience.
- Proficient in Python & C++/CUDA; familiarity with GPU programming (CUDA/Triton/TensorRT).
- Deep knowledge of transformer, large‑language, vision, and diffusion model optimization.
- Experience with LLM inference frameworks (TensorRT‑LLM, vLLM, SGLang, TGI).
- Knowledge of model quantization, KV cache systems, and distributed scheduling.
- Strong analytical, problem‑solving, and performance‑driven mindset.
- Excellent collaboration and communication skills.
**Nice‑to‑Have**
- RDMA/RoCE, distributed filesystems (HDFS, Ceph), Kubernetes experience.
- Contributions to open‑source inference projects.
**Required Education & Certifications**
- Bachelor’s degree (or higher) in Computer Science, Electrical Engineering, or related field.
- Certifications in GPU programming or distributed systems are a plus.
San francisco, United states
On site
Junior
14-12-2025