Foundational Speech Models and their Efficient Training with NVIDIA NeMo {AI Talks with Tea/Coffee #30}

#WIE #current #speech #tools #webinars #training #technology #engineering #software #society #computer #architecture
Share


https://landing.signalprocessingsociety.org/ieee-sps-webinars-27-aug-2025

The intersection of speech and language models offer unique opportunities and challenges. This talk provides a comprehensive walkthrough of speech-language model research from NVIDIA NeMo. We cover several types of models such as attention-encoder-decoder Canary-1B, and LLM-based architectures such as SALM or BESTOW. In particular, we highlight the challenges in training and inference efficiency of such models and propose robust solutions via 2D bucketing and batch size OOMptimizer. Finally, we highlight the difficulty of preserving text-domain capabilities in speech-augmented training and present several possible solutions: EMMeTT, VoiceTextBlender, and Canary-Qwen-2.5B.

About the Presenter:

Piotr Żelasko received the B.S. and M.Sc. degrees in acoustic engineering, and the Ph.D. in electronic engineering from AGH-University Krakow, Poland in 2013, 2014, and 2019 respectively.

He is currently a research scientist at NVIDIA NeMo building multitask and multimodal models and efficient training infrastructure. He held a research scientist position at JHU’s CLSP and developed speech technology at different companies (Techmo, Avaya, Meaning.Team).

Dr. Żelasko is a co-author of the next-generation Kaldi toolkit (k2) and the maintainer of Lhotse.



  Date and Time

  Location

  Hosts

  Registration



  • Add_To_Calendar_icon Add Event to Calendar
If you are not a robot, please complete the ReCAPTCHA to display virtual attendance info.
  • Contact Event Hosts
  • Starts 31 July 2025 04:00 AM UTC
  • Ends 27 August 2025 04:00 AM UTC
  • No Admission Charge


  Speakers

Piotr Zelasko

Topic:

Foundational Speech Models and their Efficient Training with NVIDIA NeMo

The intersection of speech and language models offer unique opportunities and challenges. This talk provides a comprehensive walkthrough of speech-language model research from NVIDIA NeMo. We cover several types of models such as attention-encoder-decoder Canary-1B, and LLM-based architectures such as SALM or BESTOW. In particular, we highlight the challenges in training and inference efficiency of such models and propose robust solutions via 2D bucketing and batch size OOMptimizer. Finally, we highlight the difficulty of preserving text-domain capabilities in speech-augmented training and present several possible solutions: EMMeTT, VoiceTextBlender, and Canary-Qwen-2.5B.

About the Presenter:

Piotr Żelasko received the B.S. and M.Sc. degrees in acoustic engineering, and the Ph.D. in electronic engineering from AGH-University Krakow, Poland in 2013, 2014, and 2019 respectively.

He is currently a research scientist at NVIDIA NeMo building multitask and multimodal models and efficient training infrastructure. He held a research scientist position at JHU’s CLSP and developed speech technology at different companies (Techmo, Avaya, Meaning.Team).

Dr. Żelasko is a co-author of the next-generation Kaldi toolkit (k2) and the maintainer of Lhotse.

Biography:

Piotr Żelasko received the B.S. and M.Sc. degrees in acoustic engineering, and the Ph.D. in electronic engineering from AGH-University Krakow, Poland in 2013, 2014, and 2019 respectively.

He is currently a research scientist at NVIDIA NeMo building multitask and multimodal models and efficient training infrastructure. He held a research scientist position at JHU’s CLSP and developed speech technology at different companies (Techmo, Avaya, Meaning.Team).

Dr. Żelasko is a co-author of the next-generation Kaldi toolkit (k2) and the maintainer of Lhotse.





Agenda

https://landing.signalprocessingsociety.org/ieee-sps-webinars-27-aug-2025

Please register here too and on Vtools too.



https://landing.signalprocessingsociety.org/ieee-sps-webinars-27-aug-2025