UMD Center for Machine Learning Visiting Talk
TALK: "From Filtering to Fingerprints: Constructing Pretraining Datasets for LLMs and Measuring Biases in the Data"
VISITING SPEAKER: Reinhard Heckel, an associate professor of machine learning at the Technical University of Munich
WHEN: Friday, December 6, 2024 at 11 a.m.
LOCATION: 5105 Iribe Center, University of Maryland
ABSTRACT: In this talk, we first discuss how pre-trained datasets for LLMs are sourced from the web through heuristic and machine learning based filtering techniques. We then investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining text datasets derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb and others. Despite those datasets being obtained with similar filtering and deduplication steps, LLMs can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints.
BIO: Reinhard Heckel is an associate professor of machine learning in the Department of Computer Engineering at the Technical University of Munich, and an adjunct faculty member at Rice University. From 2017–2019, he was an assistant professor of electrical and computer engineering at Rice University. Before that, Heckel was a postdoctoral researcher in the Berkeley Artificial Intelligence Research Lab at UC Berkeley and a researcher at IBM Research Zurich. He completed his Ph.D. in 2014 at ETH Zurich and was a visiting Ph.D. student at Stanford University's Statistics Department. Heckel's work focuses on machine learning, artificial intelligence, and information processing. He specializes in developing algorithms and foundations for deep learning, particularly for medical imaging, establishing mathematical and empirical underpinnings for machine learning, and utilizing DNA as a digital information technology.
Date and Time
Location
Hosts
Registration
- Date: 06 Dec 2024
- Time: 11:00 AM to 12:00 PM
- All times are (UTC-05:00) Eastern Time (US & Canada)
- Add Event to Calendar
- 8125 Paint Branch Dr
- College Park, Maryland
- United States 20740
- Building: Brendan Iribe Center for Computer Science and Engineering
- Room Number: 5105
- Contact Event Host
- Co-sponsored by UMD Center for Machine Learning
Speakers
Reinhard Heckel
From Filtering to Fingerprints: Constructing Pretraining Datasets for LLMs and Measuring Biases in the Data
In this talk, we first discuss how pre-trained datasets for LLMs are sourced from the web through heuristic and machine learning based filtering techniques. We then investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining text datasets derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb and others. Despite those datasets being obtained with similar filtering and deduplication steps, LLMs can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints.
Biography:
Reinhard Heckel is an associate professor of machine learning in the Department of Computer Engineering at the Technical University of Munich, and an adjunct faculty member at Rice University. From 2017–2019, he was an assistant professor of electrical and computer engineering at Rice University. Before that, Heckel was a postdoctoral researcher in the Berkeley Artificial Intelligence Research Lab at UC Berkeley and a researcher at IBM Research Zurich. He completed his Ph.D. in 2014 at ETH Zurich and was a visiting Ph.D. student at Stanford University's Statistics Department. Heckel's work focuses on machine learning, artificial intelligence, and information processing. He specializes in developing algorithms and foundations for deep learning, particularly for medical imaging, establishing mathematical and empirical underpinnings for machine learning, and utilizing DNA as a digital information technology.
Address:Germany