Attention-Guided Audio Compression for Multimodal LLMs
Audio compression is often proposed to improve the efficiency of multimodal large language models, but its impact on downstream task performance remains underexplored. This talk examines how semantic neural audio codecs behave under token reduction constraints, using cross-modal attention as a signal to discard frames with low semantic content. On audio question-answering benchmarks, attention-guided frame selection removes 10–30% of frames while matching baseline accuracy and answer consistency, and identifies a critical compression threshold (keep ratio ~0.7) below which performance degrades sharply. The talk also discusses an "answer consistency paradox" where models remain highly self-consistent (>98%) even as accuracy degrades and what this decoupling of consistency from correctness means for evaluating compressed multimodal systems in low-resource deployments.
Date and Time
Location
Hosts
Registration
-
Add Event to Calendar
Loading virtual attendance info...
Speakers
Prerana
Attention-Guided Audio Compression for Multimodal LLMs
Audio compression is often proposed to improve the efficiency of multimodal large language models, but its impact on downstream task performance remains underexplored. This talk examines how semantic neural audio codecs behave under token reduction constraints, using cross-modal attention as a signal to discard frames with low semantic content. On audio question-answering benchmarks, attention-guided frame selection removes 10–30% of frames while matching baseline accuracy and answer consistency, and identifies a critical compression threshold (keep ratio ~0.7) below which performance degrades sharply. The talk also discusses an "answer consistency paradox" where models remain highly self-consistent (>98%) even as accuracy degrades and what this decoupling of consistency from correctness means for evaluating compressed multimodal systems in low-resource deployments.
Biography:
Prerana Rane is a researcher and engineer working at the intersection of speech and audio machine learning and multimodal AI. She holds an M.S. in Computer Engineering from Virginia Tech. She spent seven years at Intel's Next Generation and Standards group, developing PHY-layer systems and algorithms for 5G NR. She represented Intel as a 3GPP RAN1 delegate across Releases 16–18, with over 30 technical contributions, 17 patents and multiple proposals adopted into the 5G standards. She is an IEEE Senior Member and serves as Secretary of the IEEE Signal Processing Society Santa Clara Valley Chapter. Her broader research interests span signal processing, audio and speech machine learning, and efficient multimodal systems.
Email:
Address:California, United States