Face Video Compression with Generative Models
Video coding is a fundamental and ubiquitous technology in the modern society. Generations of international video coding standards, such as the widely-deployed H.264/AVC and H.265/HEVC and the latest H.266/VVC, provide essential means for enabling video conferencing, video streaming, video sharing, e-commerce, entertainment, and many more video applications. These existing standards all rely on the fundamental theory of signal processing and information theory to encode generic video efficiently with a favorable rate distortion behavior.
In recent years, rapid advancement in deep learning and artificial intelligence technology has allowed people to manipulate images and videos using deep generative models. Among these, of particular interest to the field of video coding is the application of deep generative models towards compressing talking-face video at ultra-low bit rates. By focusing on talking faces, generative models can effectively learn the inherent structure about composition, movement and posture of human faces and deliver promising results using very little bandwidth resource. At ultra-low bit rates, when even the latest video coding standard H.266/VVC is apt to suffer from significant blocking artifacts and blurriness beyond the point of recognition, generative methods can maintain clear facial features and vivid expression in the reconstructed video. Further, generative face video coding techniques are inherently capable of manipulating the reconstructed face and promise to deliver a more interactive experience.
In this talk, we start with a quick overview of traditional and deep learning-based video coding techniques. We then focus on face video coding with generative networks, and present two schemes that send different deep information in the bitstream, one sending compact temporal motion features and the other sending 3D facial semantics. We compare their compression efficiency and visual quality with that of the latest H.266/VVC standard, and showcase the power of deep generative models in preserving vivid facial images with little bandwidth resource. We also present visualization results to exhibit the capability of the 3D facial semantics-based scheme in terms of interacting with the reconstructed face video and animating virtual faces.
Date and Time
Location
Hosts
Registration
- Date: 03 May 2023
- Time: 12:00 PM to 01:00 PM
- All times are (GMT-05:00) US/Eastern
- Add Event to Calendar
- Contact Event Hosts
- Co-sponsored by Fairleigh Dickinson University
- Starts 15 March 2023 12:00 PM
- Ends 03 May 2023 02:00 PM
- All times are (GMT-05:00) US/Eastern
- No Admission Charge
Speakers
Dr. Yan Ye of Video Technology Lab of Alibaba’s Damo Academy, Sunnyvale, CA
Face Video Compression with Generative Models
Video coding is a fundamental and ubiquitous technology in the modern society. Generations of international video coding standards, such as the widely-deployed H.264/AVC and H.265/HEVC and the latest H.266/VVC, provide essential means for enabling video conferencing, video streaming, video sharing, e-commerce, entertainment, and many more video applications. These existing standards all rely on the fundamental theory of signal processing and information theory to encode generic video efficiently with a favorable rate distortion behavior.
In recent years, rapid advancement in deep learning and artificial intelligence technology has allowed people to manipulate images and videos using deep generative models. Among these, of particular interest to the field of video coding is the application of deep generative models towards compressing talking-face video at ultra-low bit rates. By focusing on talking faces, generative models can effectively learn the inherent structure about composition, movement and posture of human faces and deliver promising results using very little bandwidth resource. At ultra-low bit rates, when even the latest video coding standard H.266/VVC is apt to suffer from significant blocking artifacts and blurriness beyond the point of recognition, generative methods can maintain clear facial features and vivid expression in the reconstructed video. Further, generative face video coding techniques are inherently capable of manipulating the reconstructed face and promise to deliver a more interactive experience.
In this talk, we start with a quick overview of traditional and deep learning-based video coding techniques. We then focus on face video coding with generative networks, and present two schemes that send different deep information in the bitstream, one sending compact temporal motion features and the other sending 3D facial semantics. We compare their compression efficiency and visual quality with that of the latest H.266/VVC standard, and showcase the power of deep generative models in preserving vivid facial images with little bandwidth resource. We also present visualization results to exhibit the capability of the 3D facial semantics-based scheme in terms of interacting with the reconstructed face video and animating virtual face
Biography:
Dr. Yan Ye received her Ph.D. from the University of California, San Diego and her B.S. and M.S. from the University of Science and Technology of China. She is currently the Head of Video Technology Lab of Alibaba’s Damo Academy in Sunnyvale California. Prior to Alibaba, she held various management and technical positions at InterDigital, Dolby Laboratories, and Qualcomm.
Throughout her career, Dr. Ye has been actively involved in developing international video coding and streaming standards in ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG). She is currently an Associate Rapporteur of the ITU-T VCEG, the Group Chair of INCITS/MPEG task group, and a focus group chair of the ISO/IEC MPEG Visual Quality Assessment. Her research interests include advanced video coding, processing and streaming algorithms, real-time and immersive video communications, AR/VR/MR, and deep learning-based video coding, processing, and quality assessment algorithms.
Address:United States
Agenda
Video coding is a fundamental and ubiquitous technology in the modern society. Generations of international video coding standards, such as the widely-deployed H.264/AVC and H.265/HEVC and the latest H.266/VVC, provide essential means for enabling video conferencing, video streaming, video sharing, e-commerce, entertainment, and many more video applications. These existing standards all rely on the fundamental theory of signal processing and information theory to encode generic video efficiently with a favorable rate distortion behavior.
In recent years, rapid advancement in deep learning and artificial intelligence technology has allowed people to manipulate images and videos using deep generative models. Among these, of particular interest to the field of video coding is the application of deep generative models towards compressing talking-face video at ultra-low bit rates. By focusing on talking faces, generative models can effectively learn the inherent structure about composition, movement and posture of human faces and deliver promising results using very little bandwidth resource. At ultra-low bit rates, when even the latest video coding standard H.266/VVC is apt to suffer from significant blocking artifacts and blurriness beyond the point of recognition, generative methods can maintain clear facial features and vivid expression in the reconstructed video. Further, generative face video coding techniques are inherently capable of manipulating the reconstructed face and promise to deliver a more interactive experience.
In this talk, we start with a quick overview of traditional and deep learning-based video coding techniques. We then focus on face video coding with generative networks, and present two schemes that send different deep information in the bitstream, one sending compact temporal motion features and the other sending 3D facial semantics. We compare their compression efficiency and visual quality with that of the latest H.266/VVC standard, and showcase the power of deep generative models in preserving vivid facial images with little bandwidth resource. We also present visualization results to exhibit the capability of the 3D facial semantics-based scheme in terms of interacting with the reconstructed face video and animating virtual faces.