Skills

Creativity Python Go Research Data collection Attention to detail Training Machine Learning Deep Learning benchmarking NLP

Job Specifications

Boson AI is an early-stage startup of 30 scientists. We are building large language tools for interaction and entertainment. Our founders, Alex Smola, Mu Li, and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language and beyond.

NOTE: Please apply to this role only if you are a student at the University of Toronto enrolled in the MScAC program.

We are seeking interns from the University of Toronto to join us in our Toronto office. As part of your role, you will work on modeling and training LLMs, understanding and interpreting model behavior and aligning models to human values. The ideal candidate has interest and background in engineering, machine learning, and have motivations for developing state-of-the-art models towards AGI. Some potential research topics include the following:

Data extraction and annotation

Data collection and extraction for LLMs has come a long way. Even small models are now trained on 20+ trillion tokens, a significant fraction of all the text mankind ever created. Compare that to audio models trained on 10-100 million hours, the equivalent of 20-200 human lifetimes.

Your task is to help design and implement tools for automatic collection of data for a large number of languages (30+) and to build data extraction and annotation pipelines that do not rely heavily on human engineering. That is, you will use unsupervised and semi-supervised techniques to build tools that automatically (and iteratively) refine audio for large audio model training. See e.g. https://arxiv.org/abs/2505.13404 (Nvidia Granary) for a description of some of the more basic components. You will go beyond that by integrating annotation and model training into one iterative loop.

This project works for you if you are good at scripting, enjoy experimentation, large amounts of data and have a moderate understanding of statistics and machine learning.

Audio evaluation and benchmarking

Evaluating large audio models is unique insofar as it does not only rely on the correctness of the pronunciation but also on a wide range of stylistic elements, cadence, emotion, matching soundstage, etc.; In other words, it isn’t enough for the voice to sound good. It also needs to fit the context and the intended emotion and style. This requires the design of novel benchmarking tools.

Your task is to help design and implement such algorithms for both evaluating and improving audio models. See e.g. https://arxiv.org/abs/2505.23009 (Boson EmergentTTS-Eval) for a description of some of the challenges. An improved benchmark should be multilingual, it should be able to address and assess both global and local issues in the sound generated and it should be able to assess conversational audio, rather than only monolingual generation. We will use the scoring methodology both to evaluate models and (after modification) within the training of models proper.

This project works for you if you are good at experimentation, ideally speak more than one language, and have a good ear for audio. You should be comfortable with python and scripting languages for automated evaluation and benchmarking.

Text evaluation and benchmarking

Evaluating modern LLMs presents a critical paradox. While frontier models like GPT-5 show amazing capabilities on a range of tasks, their deployment in high-stakes, real-world applications is often blocked by crucial failures. These include (a) persistent hallucinations, especially when context is complex or imperfectly provided; (b) instability, or the inability to produce the same correct, high-quality result reliably over millions of runs; and (c) difficulty in robustly following a large number of constraints (e.g., 100+ instructions) within a long context (e.g., 20K+ tokens). While mitigations like reasoning or multi-agent workflow might solve a problem 1 in 100 attempts, real-world applications, especially enterprise cannot tolerate the associated latency or cost, demanding 100/100 accuracy. So how to probe the maximum capabilities and failure modes of current LLMs is essential for moving the entire field of generative AI forward.

Your task is to help design and implement new evaluation mechanisms and benchmarks that assess LLMs for these critical, deployment-blocking issues. The benchmark should involve human-in-the-loop, have a more verifiable outcome, and simulate complicated real-world use cases. See e.g. https://arxiv.org/pdf/2406.12045 (tau-bench) as an example.

This project works for you if you are good at experimentation, have creativity, and strong attention to detail. You should be comfortable with building an agentic pipeline and enjoy thinking critically about how and why models fail, not just if they get an answer right.

Model training and efficient optimization

Efficient model training is one of the most important aspects of building large multimodal models. This includes a wide range of aspects: