QUALITY ASSESSMENT IN A SYSTEMATIC LITERATURE REVIEW USING AN ARTIFICIAL INTELLIGENCE MODEL

Author(s)

Philippe Martin, PhD¹, Michelle Di Risio, MSc, PhD¹, Olivia Colman, MPH², Corrine Gregory, MPH², Lee Hughes, MS², Liz Lunn, BA³;
¹Knight Therapeutics, Montréal, QC, Canada, ²Costello Medical, Boston, MA, USA, ³Costello Medical, Manchester, United Kingdom

Presentation Documents

OBJECTIVES: To evaluate AI quality assessment (QA) performance for randomized controlled trials (RCTs).
METHODS: AI prompts were developed to extract risk of bias (RoB) from RCTs included in a systematic literature review (SLR) of attention deficit hyperactivity disorder (ADHD) medications. A modified Cochrane RoB 1.0 tool was used, with additional questions addressing cross-over designs and study funding. Prompts were refined for best output by human judgement and run in GPT-4o (temperature: 0.7) for all questions. A human verified AI outputs against publications to indicate a true positive (accurate, as reported in publication), false positive (data reported which are not present in publication), or false negative (data not reported which are present in publication). Results were used to compute recall (relevant data accurately identified) and precision (correct outputs). F1 scores (harmonic mean of precision and recall; score range 0-1 for all measures) were calculated. A predefined threshold of ≥0.70 was considered a ‘good’ F1 score.
RESULTS: Among 32 studies, median recall was 1.0 (range: 0.31-1.0), and median precision was 0.68 (range: 0.31-0.94). The F1 threshold was exceeded in 29/32 studies; two of the remaining studies scored close to the threshold (0.63 and 0.67). The remaining study score (for a clinicaltrials.gov record) was an outlier at 0.31. The highest question-specific F1 score was 0.97, for confounding bias. The lowest F1 score across checklist questions was 0.51, recorded for attrition bias.
CONCLUSIONS: The AI model demonstrated high recall, efficiently identifying QA checklist items when present. Precision varied across records, reflecting common misinterpretations of data. Although most studies met the F1 threshold, AI appeared less consistent for extracting and interpreting data from a clinical trial record than manuscript publications. These findings support AI’s value in QA of RCTs but emphasize the need for continuous human-in-the-loop verification to ensure accuracy.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR187

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)