- Company Name
- Datadog
- Job Title
- AI Research Engineer - Datadog AI Research (DAIR)
- Job Description
-
**Job title:** AI Research Engineer – Datadog AI Research (DAIR)
**Role Summary:**
Design, build, and scale end‑to‑end AI systems that transform research concepts into production‑ready services within cloud observability and security domains.
**Expectations:**
- Deliver robust, scalable ML pipelines and inference engines.
- Transition prototypes into reliable, customer‑facing features.
- Collaborate closely with scientists, product, and engineering teams.
- Maintain high code quality, reproducibility, and observability.
**Key Responsibilities:**
- Construct and manage datasets, training and evaluation pipelines, benchmarks, and tooling.
- Implement, experiment, and tune large‑scale models (forecasting, anomaly detection, multimodal analysis, autonomous agents, code‑repair agents).
- Orchestrate distributed training/ RL with Ray or equivalent frameworks; handle scaling, scheduling, and fault tolerance.
- Profile models for reliability, performance, and cost; optimize GPU usage.
- Create automated benchmarks and regression tests for all key research areas.
- Integrate advanced AI capabilities into product pipelines and harden prototypes into production services.
- Produce high‑quality code, documentation, and open‑source artifacts to support community and internal reproducibility.
**Required Skills:**
- Strong software engineering background, preferably in observability, SRE, or security.
- Proficiency in Python; familiarity with Rust, C++, Go, or similar systems language.
- Experience with distributed computing frameworks (Ray, Slurm, etc.).
- Practical knowledge of ML frameworks (PyTorch, JAX), containerization, orchestration, and GPU acceleration.
- Expertise in training, fine‑tuning, and inference of large foundation models.
- Ability to communicate design trade‑offs to technical and non‑technical stakeholders.
- Passion for open‑source and open‑science; experience establishing benchmarks and sharing artifacts.
**Bonus Experience (not mandatory):**
- Proven track record bridging research prototypes and production deployments.
- GPU programming/optimization with CUDA.
- Production data pipeline and application development.
- Research publication contribution.
**Required Education & Certifications:**
- Bachelor’s degree or higher in Computer Science, Engineering, or related field (equivalent experience accepted).
- Certifications in distributed systems, ML engineering, or cloud platforms are advantageous.