Characterizing the Confidence of LLM-based Evaluators

#llm #deeplearning #nlp
Share

Presentador invitado en el grupo de lectura Light, Eyes and Neural networkS (LENS)


Rickard Stureborg

 

Abstract:

Considerable research effort has been put into improving Large Language Models (LLMs) to evaluate NLP tasks automatically. This work generally tries to achieve high correlations with human judgements on the same task. However, it is still unclear what level of correlation is good enough for practical applications of LLM-based automatic evaluation systems. This paper characterizes these LLM evaluators’ confidence in ranking candidate NLP models and develops a configurable Monte Carlo simulation method. We show that even automatic metrics with low correlation with human judgement can reach high-confidence rankings of candidate models with reasonable evaluation set sizes (100s of examples). Further, we describe tradeoff curves between the LLM evaluator performance (i.e., correlation with humans) and evaluation set size; loss in correlation can be compensated with modest increases in the evaluation set size. We validate our results on RoSE, a text summarization dataset, and find our estimates of confidence align with empirical observations.

 

Bio:

Rickard is a PhD candidate in Computer Science, where his research focuses on high-subjectivity tasks in natural language processing, including applications for misinformation and automatic evaluation of machine-generated text. He works as a researcher at Grammarly, where he helps build the next generation of tools to integrate AI into people’s writing workflow. Rich’s interdisciplinary research has been featured in the most prestigious peer-reviewed research venues across several fields, including natural language processing (*ACL conferences), artificial intelligence (NeurIPS and AAAI workshops), human-computer interaction (CHI), optics (SPIE, Journal of Biomedical Optics), and public health (Vaccine). Rich is serving a three-year term on Duke’s Board of Trustees, where he is a member of the Graduate Education and Research Committee.



  Date and Time

  Location

  Hosts

  Registration



  • Date: 05 Jun 2024
  • Time: 01:45 PM to 03:15 PM
  • All times are (UTC-04:00) Santiago
  • Add_To_Calendar_icon Add Event to Calendar
  • San Carlos de Apoquindo 2500, Las Condes
  • Universidad de los Andes
  • Santiago, Region Metropolitana
  • Chile
  • Building: Edificio de IngenierĂ­a
  • Room Number: Sala I-103

  • Contact Event Host
  • Artificial Intelligence, Data Science and Applications (AIDA), Universidad de los Andes, Chile

  • Co-sponsored by Jose Delpiano
  • Starts 28 May 2024 12:00 AM
  • Ends 05 June 2024 03:15 PM
  • All times are (UTC-04:00) Santiago
  • No Admission Charge


  Speakers

Rich Stureborg of Duke University, Grammarly

Topic:

Characterizing the Confidence of LLM-based Evaluators

Considerable research effort has been put into improving Large Language Models (LLMs) to evaluate NLP tasks automatically. This work generally tries to achieve high correlations with human judgements on the same task. However, it is still unclear what level of correlation is good enough for practical applications of LLM-based automatic evaluation systems. This paper characterizes these LLM evaluators’ confidence in ranking candidate NLP models and develops a configurable Monte Carlo simulation method. We show that even automatic metrics with low correlation with human judgement can reach high-confidence rankings of candidate models with reasonable evaluation set sizes (100s of examples). Further, we describe tradeoff curves between the LLM evaluator performance (i.e., correlation with humans) and evaluation set size; loss in correlation can be compensated with modest increases in the evaluation set size. We validate our results on RoSE, a text summarization dataset, and find our estimates of confidence align with empirical observations.

Biography:

Rickard is a PhD candidate in Computer Science, where his research focuses on high-subjectivity tasks in natural language processing, including applications for misinformation and automatic evaluation of machine-generated text. He works as a researcher at Grammarly, where he helps build the next generation of tools to integrate AI into people’s writing workflow. Rich’s interdisciplinary research has been featured in the most prestigious peer-reviewed research venues across several fields, including natural language processing (*ACL conferences), artificial intelligence (NeurIPS and AAAI workshops), human-computer interaction (CHI), optics (SPIE, Journal of Biomedical Optics), and public health (Vaccine). Rich is serving a three-year term on Duke’s Board of Trustees, where he is a member of the Graduate Education and Research Committee.