QUALITY ASSESSMENT IN A SYSTEMATIC LITERATURE REVIEW USING AN ARTIFICIAL INTELLIGENCE MODEL
Author(s)
Philippe Martin, PhD1, Michelle Di Risio, MSc, PhD1, Olivia Colman, MPH2, Corrine Gregory, MPH2, Lee Hughes, MS2, Liz Lunn, BA3;
1Knight Therapeutics, Montréal, QC, Canada, 2Costello Medical, Boston, MA, USA, 3Costello Medical, Manchester, United Kingdom
1Knight Therapeutics, Montréal, QC, Canada, 2Costello Medical, Boston, MA, USA, 3Costello Medical, Manchester, United Kingdom
OBJECTIVES: To evaluate AI quality assessment (QA) performance for randomized controlled trials (RCTs).
METHODS: AI prompts were developed to extract risk of bias (RoB) from RCTs included in a systematic literature review (SLR) of attention deficit hyperactivity disorder (ADHD) medications. A modified Cochrane RoB 1.0 tool was used, with additional questions addressing cross-over designs and study funding. Prompts were refined for best output by human judgement and run in GPT-4o (temperature: 0.7) for all questions. A human verified AI outputs against publications to indicate a true positive (accurate, as reported in publication), false positive (data reported which are not present in publication), or false negative (data not reported which are present in publication). Results were used to compute recall (relevant data accurately identified) and precision (correct outputs). F1 scores (harmonic mean of precision and recall; score range 0-1 for all measures) were calculated. A predefined threshold of ≥0.70 was considered a ‘good’ F1 score.
RESULTS: Among 32 studies, median recall was 1.0 (range: 0.31-1.0), and median precision was 0.68 (range: 0.31-0.94). The F1 threshold was exceeded in 29/32 studies; two of the remaining studies scored close to the threshold (0.63 and 0.67). The remaining study score (for a clinicaltrials.gov record) was an outlier at 0.31. The highest question-specific F1 score was 0.99, for outcome reporting bias. The lowest F1 score across checklist questions was 0.52, recorded for attrition bias.
CONCLUSIONS: The AI model demonstrated high recall, efficiently identifying QA checklist items when present. Precision varied across records, reflecting common misinterpretations of data. Although most studies met the F1 threshold, AI appeared less consistent for extracting and interpreting data from a clinical trial record than manuscript publications. These findings support AI’s value in QA of RCTs but emphasize the need for continuous human-in-the-loop verification to ensure accuracy.
METHODS: AI prompts were developed to extract risk of bias (RoB) from RCTs included in a systematic literature review (SLR) of attention deficit hyperactivity disorder (ADHD) medications. A modified Cochrane RoB 1.0 tool was used, with additional questions addressing cross-over designs and study funding. Prompts were refined for best output by human judgement and run in GPT-4o (temperature: 0.7) for all questions. A human verified AI outputs against publications to indicate a true positive (accurate, as reported in publication), false positive (data reported which are not present in publication), or false negative (data not reported which are present in publication). Results were used to compute recall (relevant data accurately identified) and precision (correct outputs). F1 scores (harmonic mean of precision and recall; score range 0-1 for all measures) were calculated. A predefined threshold of ≥0.70 was considered a ‘good’ F1 score.
RESULTS: Among 32 studies, median recall was 1.0 (range: 0.31-1.0), and median precision was 0.68 (range: 0.31-0.94). The F1 threshold was exceeded in 29/32 studies; two of the remaining studies scored close to the threshold (0.63 and 0.67). The remaining study score (for a clinicaltrials.gov record) was an outlier at 0.31. The highest question-specific F1 score was 0.99, for outcome reporting bias. The lowest F1 score across checklist questions was 0.52, recorded for attrition bias.
CONCLUSIONS: The AI model demonstrated high recall, efficiently identifying QA checklist items when present. Precision varied across records, reflecting common misinterpretations of data. Although most studies met the F1 threshold, AI appeared less consistent for extracting and interpreting data from a clinical trial record than manuscript publications. These findings support AI’s value in QA of RCTs but emphasize the need for continuous human-in-the-loop verification to ensure accuracy.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR187
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas