Human Alignment on Evaluating LLMs for Literature Review Extraction
Author(s)
Sam Work, MSc1, William Bentley, Bsc2, Christoph R Schlegel, PhD2.
1Reliant AI Inc., Montréal, QC, Canada, 2Reliant AI Europe GmbH, Berlin, Germany.
1Reliant AI Inc., Montréal, QC, Canada, 2Reliant AI Europe GmbH, Berlin, Germany.
OBJECTIVES: Software for literature reviews (LRs) has increasingly incorporated Large Language Models (LLMs) as a way to automate elements of the screening and extraction phases. This is marketed as a way to reduce costs by making human oversight more efficient and reducing workloads. However, alignment with human decision-making is essential to trusting in these systems. When experts cannot agree on a “ground truth,” benchmarking LLM extractors against a single human label is unreliable. While initial investigations into accuracy have focused on the screening stage, few have looked at inter-human alignment and human-LLM alignment at the extraction stage. This research assesses human reviewer agreement for epidemiology data points extracted from abstracts using an LLM-based system.
METHODS: Three human reviewers independently reviewed 72 data points extracted by an LLM-based system across four fields (measure type, measure value, population description, condition) from epidemiology-focused PubMed abstracts for correctness. The source abstract was used to verify the extractions by field. Reviewer agreement rate was calculated for all data points, and the human reviewers then provided their reasons for scoring and discussed each disagreement.
RESULTS: Total agreement for all 72 data points amongst the three human reviewers was low, with a Fleiss’ Kappa of 0.204. Disagreement was highest amongst the most ambiguous field (population description), likely due to individual interpretation around meaningful information inclusion. However, there was also substantial disagreement on binary categorization of measure type.
CONCLUSIONS: The lack of human consensus across data extraction assessment highlights the difficulty of benchmarking these systems’ output. Understanding the limitations of human agreement and their causes has useful implications for designing LLM-enabled systems. Built-in explainability could improve the human verification process, but should be tested by scoring inter-reviewer agreement. Adoption of these systems for LR extraction requires confidence that fewer humans are indeed sufficient to ensure accuracy.
METHODS: Three human reviewers independently reviewed 72 data points extracted by an LLM-based system across four fields (measure type, measure value, population description, condition) from epidemiology-focused PubMed abstracts for correctness. The source abstract was used to verify the extractions by field. Reviewer agreement rate was calculated for all data points, and the human reviewers then provided their reasons for scoring and discussed each disagreement.
RESULTS: Total agreement for all 72 data points amongst the three human reviewers was low, with a Fleiss’ Kappa of 0.204. Disagreement was highest amongst the most ambiguous field (population description), likely due to individual interpretation around meaningful information inclusion. However, there was also substantial disagreement on binary categorization of measure type.
CONCLUSIONS: The lack of human consensus across data extraction assessment highlights the difficulty of benchmarking these systems’ output. Understanding the limitations of human agreement and their causes has useful implications for designing LLM-enabled systems. Built-in explainability could improve the human verification process, but should be tested by scoring inter-reviewer agreement. Adoption of these systems for LR extraction requires confidence that fewer humans are indeed sufficient to ensure accuracy.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR125
Topic
Economic Evaluation, Epidemiology & Public Health, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas