Image generation with end-to-end training and benefits of a good VAE
Latent diffusion models underly modern image generation, which requires a variational auto-encoder (VAE) for image encoding and decoding, and a diffusion transformer for generation. While end-to-end training has been the spirit of deep learning, it is surprising that latent diffusion models are not trained end-to-end, causing representation bottlenecks. In this talk, I will introduce our work that jointly trains the VAE and diffusion transformer and show how it accelerates training and yields high quality images. Further, I will discuss use cases where the resulting end-to-end trained VAEs bring significant benefits. This includes higher-quality text-to-image generation and automatic agentic search of diffusion transformer architectures. I will conclude with new perspectives.
Liang Zheng
Australian National University
Dr. Liang Zheng is an Associate Professor at the Australian National University and a Research Scientist at Canva. He is interested in representation learning for perception and generation. He contributed many useful datasets and methods to the object re-identification field that were later used in wider domains. He is currently working on image generation in both aspects of pre-training and post-training. He is a Program Chair for ACM MM’24, MM’28, andAVSS'24, and a General Chair for AVSS’27 and DICTA 2027. He is a regular area chair for important conferences and an Associate Editor for TPAMI. He has bachelor degrees in Biology, Economics and a PhD degree in Computer Science from Tsinghua University.
Date and Time
Location
Hosts
Registration
-
Add Event to Calendar
Loading virtual attendance info...