Distant conversational speech recognition: Challenges and Opportunities
Abstract:
State-of-the-art ASR systems excel on close-talk benchmarks but struggle with far-field conversational speech, where error rates remain above 20%. Current benchmark datasets inadequately assess generalization across domains and real-world conditions, often relying on oracle segmentation that yields overly optimistic results. Distant ASR (DASR) faces unique challenges including overlapping speech, varied recording setups, and dynamic speaker interactions that significantly complicate system development. Despite these difficulties, spontaneous conversational speech represents the next frontier for developing more human-like AI agents capable of natural multi-party communication. This talk presents recent advances in DASR through three interconnected efforts: (1) the CHiME-7 and CHiME-8 DASR challenges, which established rigorous benchmarks for generalizable robust meeting transcription, (2) end-to-end joint modeling that unifies speaker diarization and speech recognition into a single framework, moving beyond traditional pipeline approaches, and (3) synthetic data generation leveraging large language models and text-to-speech systems to create realistic multi-speaker training data at scale.
Date and Time
Location
Hosts
Registration
-
Add Event to Calendar
Loading virtual attendance info...
- Starts
07 October 2025 07:00 AM UTC
- Ends
15 October 2025 07:00 AM UTC
- No Admission Charge
Speakers
Topic:
Distant conversational speech recognition: Challenges and Opportunities
Abstract:
State-of-the-art ASR systems excel on close-talk benchmarks but struggle with far-field conversational speech, where error rates remain above 20%. Current benchmark datasets inadequately assess generalization across domains and real-world conditions, often relying on oracle segmentation that yields overly optimistic results. Distant ASR (DASR) faces unique challenges including overlapping speech, varied recording setups, and dynamic speaker interactions that significantly complicate system development. Despite these difficulties, spontaneous conversational speech represents the next frontier for developing more human-like AI agents capable of natural multi-party communication. This talk presents recent advances in DASR through three interconnected efforts: (1) the CHiME-7 and CHiME-8 DASR challenges, which established rigorous benchmarks for generalizable robust meeting transcription, (2) end-to-end joint modeling that unifies speaker diarization and speech recognition into a single framework, moving beyond traditional pipeline approaches, and (3) synthetic data generation leveraging large language models and text-to-speech systems to create realistic multi-speaker training data at scale.
Biography:
Samuele Cornell is currently a postdoctoral research associate at Carnegie Mellon University at the Language Technologies Institute within Prof. Shinji Watanabe research group (WAVLab). He got a Master degree in electronic engineering (summa cum laude) at Università Politecnica delle Marche in 2019 and, in 2023, at the same institution, a doctoral degree in Information Engineering. His research interests are mainly in the area of robust speech processing (speech enhancement, speech separation, diarization, automatic speech recognition) for distant multi-talker conversational scenarios, and also in the broader field of machine listening (sound event detection and classification) with over 50 publications in these fields.
He is also author and has significant contributions in several popular open-source speech-processing toolkits (e.g. SpeechBrain, ESPNet, Asteroid source separation) and has organized and co-organized popular audio processing challenges in the fields of sound event detection, robust speech processing and speech enhancement such as DCASE Task 4 (2022, 2021, 2024), CHiME (CHiME-7/8 DASR lead organizer) and URGENT (2024 and 2025) and, more recently, co-led the JSALT 2025 EMMA team for end-to-end multi-channel multi-talker ASR.
Email: