Characterizing the Confidence of LLM-based Evaluators : vTools Events

IEEE.org | IEEE Xplore Digital Library | IEEE Standards | IEEE Spectrum | More Sites

Characterizing the Confidence of LLM-based Evaluators

#llm #deeplearning #nlp

Presentador invitado en el grupo de lectura Light, Eyes and Neural networkS (LENS)

Rickard Stureborg

Abstract:

Considerable research effort has been put into improving Large Language Models (LLMs) to evaluate NLP tasks automatically. This work generally tries to achieve high correlations with human judgements on the same task. However, it is still unclear what level of correlation is good enough for practical applications of LLM-based automatic evaluation systems. This paper characterizes these LLM evaluators’ confidence in ranking candidate NLP models and develops a configurable Monte Carlo simulation method. We show that even automatic metrics with low correlation with human judgement can reach high-confidence rankings of candidate models with reasonable evaluation set sizes (100s of examples). Further, we describe tradeoff curves between the LLM evaluator performance (i.e., correlation with humans) and evaluation set size; loss in correlation can be compensated with modest increases in the evaluation set size. We validate our results on RoSE, a text summarization dataset, and find our estimates of confidence align with empirical observations.

Bio:

Rickard is a PhD candidate in Computer Science, where his research focuses on high-subjectivity tasks in natural language processing, including applications for misinformation and automatic evaluation of machine-generated text. He works as a researcher at Grammarly, where he helps build the next generation of tools to integrate AI into people’s writing workflow. Rich’s interdisciplinary research has been featured in the most prestigious peer-reviewed research venues across several fields, including natural language processing (*ACL conferences), artificial intelligence (NeurIPS and AAAI workshops), human-computer interaction (CHI), optics (SPIE, Journal of Biomedical Optics), and public health (Vaccine). Rich is serving a three-year term on Duke’s Board of Trustees, where he is a member of the Graduate Education and Research Committee.

Date and Time

Location

Hosts

Registration

Date: 05 Jun 2024
Time: 05:45 PM UTC to 07:15 PM UTC
Add Event to Calendar
iCal
Google Calendar

San Carlos de Apoquindo 2500, Las Condes
Universidad de los Andes
Santiago, Region Metropolitana
Chile
Building: Edificio de Ingeniería
Room Number: Sala I-103

Contact Event Host
Artificial Intelligence, Data Science and Applications (AIDA), Universidad de los Andes, Chile
Co-sponsored by Jose Delpiano

Starts 28 May 2024 04:00 AM UTC
Ends 05 June 2024 07:15 PM UTC
No Admission Charge

Speakers

Rich Stureborg of Duke University, Grammarly

Topic:

Characterizing the Confidence of LLM-based Evaluators

Biography: