Advancing Speech Processing with End-to-End Modeling and LLM Integration
Abstract
The field of speech processing is currently dominated by end-to-end (E2E) models, which utilize a single model to optimize directly towards the final objective function rather than optimizing multiple sub-models separately. This trend is particularly notable in automatic speech recognition (ASR). In this talk, we will provide an overview of E2E ASR models and discuss recent advancements from an industry perspective. Subsequently, we will examine the trend of E2E modeling beyond ASR, with applications such as multi-speaker ASR and simultaneous speech translation, where ASR traditionally serves as only one of several components. This trend ultimately unlocks multimodal intelligence by integrating speech capabilities into large language models (LLM). We will highlight the most recent developments in this area, which present unprecedented opportunities for the field.
Date and Time
Location
Hosts
Registration
- Date: 07 Mar 2025
- Time: 06:30 PM to 09:00 PM
- All times are (UTC-08:00) Pacific Time (US & Canada)
-
Add Event to Calendar
- Santa Clara University
- 500 El Camino Real
- Santa Clara, California
- United States 95053
- Building: Sobrato Campus for Discovery and Innovation Building
- Room Number: 1302
- Starts 07 February 2025 12:00 AM
- Ends 07 March 2025 12:00 PM
- All times are (UTC-08:00) Pacific Time (US & Canada)
- Admission fee ?
Speakers
Jinyu Li of Microsoft
Advancing Speech Processing with End-to-End Modeling and LLM Integration
The field of speech processing is currently dominated by end-to-end (E2E) models, which utilize a single model to optimize directly towards the final objective function rather than optimizing multiple sub-models separately. This trend is particularly notable in automatic speech recognition (ASR). In this talk, we will provide an overview of E2E ASR models and discuss recent advancements from an industry perspective. Subsequently, we will examine the trend of E2E modeling beyond ASR, with applications such as multi-speaker ASR and simultaneous speech translation, where ASR traditionally serves as only one of several components. This trend ultimately unlocks multimodal intelligence by integrating speech capabilities into large language models (LLM). We will highlight the most recent developments in this area, which present unprecedented opportunities for the field.
Biography:
Dr. Li is an IEEE Fellow, for contributions to deep-learning-based speech technology innovation and commercialization. He has been a member of IEEE Speech and Language Processing Technical Committee from 2017 to 2023. He also served as the associate editor of IEEE/ACM Transactions on Audio, Speech and Language Processing from 2015 to 2020. He was awarded as the Industrial Distinguished Leader at Asia-Pacific Signal and Information Processing Association (APSIPA) in 2021 and APSIPA Sadaoki Furui Prize Paper Award in 2023. He is named as Distinguished Industry Speakers for IEEE Signal Processing Society, 2025.
Agenda
6:30 – 7:00 Check-in, networking, food, and drink
7:00 – 8:00 PM – Presentation by Dr. Jinyu Li
8:00 – 8:30 PM – Q & A