Assessing Accuracy of Unsupervised Large Language Model Tools for Identifying Key Criteria Supporting Abstract Screening

Author(s)

Jeremy Schneider, BA, Kevin Kallmes, BS, MA, JD, Sierra Colon, BS, MS.
Nested Knowledge, St. Paul, MN, USA.

Presentation Documents

ISPOREurope25_Schneider_MSR38_POSTER.pdf

OBJECTIVES: Artificial Intelligence systems are increasingly being used in Screening for Systematic Literature Reviews, but unsupervised Large Language Model (LLM) approaches are under-explored. Adaptive Smart Tags (ASTs), a specialized LLM-based feature in Nested Knowledge, extracts and classifies qualitative information from abstracts using a structured tagging system. Building on prior evaluation of a supervised model, this study evaluates ASTs’ unsupervised performance in identifying and categorizing key data from 27 abstracts of randomized controlled trials (RCTs) on GLP-1 receptor agonists (GLP-1 RAs) for obesity.
METHODS: An AI Smart Search identified 419 studies published since 2017 evaluating GLP-1 RAs for weight loss. The first 50 abstracts were dual-screened by two researchers for calibration, then the remaining records were screened using a hybrid approach involving one researcher and a Robot Screener, with discrepancies adjudicated by consensus. After screening, 27 RCT abstracts were identified where weight loss was a primary outcome. A tag hierarchy was developed encompassing five dimensions: Study Type, Population, Intervention, Comparators, and Outcomes. ASTs were applied to each abstract then evaluated in picking up correct tags across seven key categories: Obesity, Study Type, Gender, GLP-1 RA, Comparators, Clinical Outcomes, and Safety Outcomes. Tag outputs were manually classified as Correct (True Positive), Missing (False Negative), Incorrect (False Positive), Partially Correct (0.5 True Positive, 0.5 False Negative), or Not in Abstract (True Negative) based on manual review. Performance metrics recall, accuracy, precision, and F1-score were calculated across all categories.
RESULTS: ASTs demonstrated consistently strong performance, identifying nearly all relevant elements. Precision was 0.9650±.0670, Recall was 1.0±0, F1 was 0.9810±.0364 and Accuracy was 0.9678±.0613).
CONCLUSIONS: Specialized, unsupervised LLMs can provide high accuracy in identifying key concepts during abstract screening. Their adoption has the potential to significantly reduce time spent on screening and extraction without compromising quality.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR38

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Diabetes/Endocrine/Metabolic Disorders (including obesity)

Presentation (CTI)