UMD Center for Machine Learning Visiting Talk : vTools Events

IEEE.org | IEEE Xplore Digital Library | IEEE Standards | IEEE Spectrum | More Sites

The event requested is not currently available

UMD Center for Machine Learning Visiting Talk

#AI#DataScience#GenAI

TALK: "From Filtering to Fingerprints: Constructing Pretraining Datasets for LLMs and Measuring Biases in the Data"

VISITING SPEAKER: Reinhard Heckel, an associate professor of machine learning at the Technical University of Munich

WHEN: Friday, December 6, 2024 at 11 a.m.

LOCATION: 5105 Iribe Center, University of Maryland

ABSTRACT: In this talk, we first discuss how pre-trained datasets for LLMs are sourced from the web through heuristic and machine learning based filtering techniques. We then investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining text datasets derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb and others. Despite those datasets being obtained with similar filtering and deduplication steps, LLMs can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints.

BIO: Reinhard Heckel is an associate professor of machine learning in the Department of Computer Engineering at the Technical University of Munich, and an adjunct faculty member at Rice University. From 2017–2019, he was an assistant professor of electrical and computer engineering at Rice University. Before that, Heckel was a postdoctoral researcher in the Berkeley Artificial Intelligence Research Lab at UC Berkeley and a researcher at IBM Research Zurich. He completed his Ph.D. in 2014 at ETH Zurich and was a visiting Ph.D. student at Stanford University's Statistics Department. Heckel's work focuses on machine learning, artificial intelligence, and information processing. He specializes in developing algorithms and foundations for deep learning, particularly for medical imaging, establishing mathematical and empirical underpinnings for machine learning, and utilizing DNA as a digital information technology.

Date and Time

Location

Hosts

Registration

Add Event to Calendar
iCal
Google Calendar

8125 Paint Branch Dr
College Park, Maryland
United States 20740
Building: Brendan Iribe Center for Computer Science and Engineering
Room Number: 5105

Contact Event Host
Co-sponsored by UMD Center for Machine Learning

Speakers

Reinhard Heckel

Topic:

From Filtering to Fingerprints: Constructing Pretraining Datasets for LLMs and Measuring Biases in the Data

In this talk, we first discuss how pre-trained datasets for LLMs are sourced from the web through heuristic and machine learning based filtering techniques. We then investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining text datasets derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb and others. Despite those datasets being obtained with similar filtering and deduplication steps, LLMs can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints.

Biography:

Reinhard Heckel is an associate professor of machine learning in the Department of Computer Engineering at the Technical University of Munich, and an adjunct faculty member at Rice University. From 2017–2019, he was an assistant professor of electrical and computer engineering at Rice University. Before that, Heckel was a postdoctoral researcher in the Berkeley Artificial Intelligence Research Lab at UC Berkeley and a researcher at IBM Research Zurich. He completed his Ph.D. in 2014 at ETH Zurich and was a visiting Ph.D. student at Stanford University's Statistics Department. Heckel's work focuses on machine learning, artificial intelligence, and information processing. He specializes in developing algorithms and foundations for deep learning, particularly for medical imaging, establishing mathematical and empirical underpinnings for machine learning, and utilizing DNA as a digital information technology.

Address:Germany