Assessing Accuracy of Unsupervised Large Language Model Tools for Identifying Key Criteria Supporting Abstract Screening
Author(s)
Jeremy Schneider, BA, Kevin Kallmes, BS, MA, JD, Sierra Colon, BS, MS.
Nested Knowledge, St. Paul, MN, USA.
Nested Knowledge, St. Paul, MN, USA.
OBJECTIVES: Artificial Intelligence systems are increasingly being used in Screening for Systematic Literature Reviews, but unsupervised Large Language Model (LLM) approaches are under-explored. Adaptive Smart Tags (ASTs), a specialized LLM-based feature in Nested Knowledge, extracts and classifies qualitative information from abstracts using a structured tagging system. Building on prior evaluation of a supervised model, this study evaluates ASTs’ unsupervised performance in identifying and categorizing key data from 27 abstracts of randomized controlled trials (RCTs) on GLP-1 receptor agonists (GLP-1 RAs) for obesity.
METHODS: An AI Smart Search identified 419 studies published since 2017 evaluating GLP-1 RAs for weight loss. The first 50 abstracts were dual-screened by two researchers for calibration, then the remaining records were screened using a hybrid approach involving one researcher and a Robot Screener, with discrepancies adjudicated by consensus. After screening, 27 RCT abstracts were identified where weight loss was a primary outcome. A tag hierarchy was developed encompassing five dimensions: Study Type, Population, Intervention, Comparators, and Outcomes. ASTs were applied to each abstract then evaluated in picking up correct tags across seven key categories: Obesity, Study Type, Gender, GLP-1 RA, Comparators, Clinical Outcomes, and Safety Outcomes. Tag outputs were manually classified as Correct (True Positive), Missing (False Negative), Incorrect (False Positive), Partially Correct (0.5 True Positive, 0.5 False Negative), or Not in Abstract (True Negative) based on manual review. Performance metrics recall, accuracy, precision, and F1-score were calculated across all categories.
RESULTS: ASTs demonstrated consistently strong performance, identifying nearly all relevant elements. Precision was 0.9650±.0670, Recall was 1.0±0, F1 was 0.9810±.0364 and Accuracy was 0.9678±.0613).
CONCLUSIONS: Specialized, unsupervised LLMs can provide high accuracy in identifying key concepts during abstract screening. Their adoption has the potential to significantly reduce time spent on screening and extraction without compromising quality.
METHODS: An AI Smart Search identified 419 studies published since 2017 evaluating GLP-1 RAs for weight loss. The first 50 abstracts were dual-screened by two researchers for calibration, then the remaining records were screened using a hybrid approach involving one researcher and a Robot Screener, with discrepancies adjudicated by consensus. After screening, 27 RCT abstracts were identified where weight loss was a primary outcome. A tag hierarchy was developed encompassing five dimensions: Study Type, Population, Intervention, Comparators, and Outcomes. ASTs were applied to each abstract then evaluated in picking up correct tags across seven key categories: Obesity, Study Type, Gender, GLP-1 RA, Comparators, Clinical Outcomes, and Safety Outcomes. Tag outputs were manually classified as Correct (True Positive), Missing (False Negative), Incorrect (False Positive), Partially Correct (0.5 True Positive, 0.5 False Negative), or Not in Abstract (True Negative) based on manual review. Performance metrics recall, accuracy, precision, and F1-score were calculated across all categories.
RESULTS: ASTs demonstrated consistently strong performance, identifying nearly all relevant elements. Precision was 0.9650±.0670, Recall was 1.0±0, F1 was 0.9810±.0364 and Accuracy was 0.9678±.0613).
CONCLUSIONS: Specialized, unsupervised LLMs can provide high accuracy in identifying key concepts during abstract screening. Their adoption has the potential to significantly reduce time spent on screening and extraction without compromising quality.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR38
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Diabetes/Endocrine/Metabolic Disorders (including obesity)