Harnessing AI for Quality Assessment in Economic Literature Reviews: Development and Testing of Drummond Checklist Prompts

Author(s)

Andrew Lim, BSc, Chen Yin, MD, Jennifer Sara Evans, PhD.
Costello Medical, Singapore, Singapore.
OBJECTIVES: To develop and test AI prompts for quality assessments in economic literature reviews using the 10-item Drummond checklist, and to compare the accuracy of AI versus human quality assessments.
METHODS: Prompts were developed iteratively and entered into OpenAI (GPT-4o) using both a context prompt and a copy of the article for quality assessment. The structure and content of the prompts were initially explored on a development set of four economic evaluations in non-small cell lung cancer. The best-performing prompt was then applied to a separate test set (n=5) from the same disease area. F1 scores (the harmonic mean of precision and recall; range 0-1) were calculated at each iteration, with a target of ≥0.70. The F1 scores for AI-generated assessments were compared with the scores from human assessments of the same article.
RESULTS: In the development set (n=4), five prompt iterations produced F1 scores ranging from 0.77 to 0.84, while human assessments achieved 0.92, with improvement in all but one iteration. When the prompt that achieved the highest F1 score was run on the test set, the AI model achieved an F1 score of 0.78, compared to 0.90 for human assessments. The AI tended to generate less critical quality assessments than humans, especially for sources such as conference abstracts and health technology assessment reports. Its performance was more reliable for full-text journal articles, however false positives remained persistent, where AI incorrectly identified information that was not present in the article.
CONCLUSIONS: The AI model shows promise for quality assessments using the 10-item Drummond checklist, particularly for journal articles. However, a human in the loop approach will remain important given lower AI F1 scores and persistent false positives. Future work should explore broader testing across disease areas and the use of other variants of the Drummond checklist.

Conference/Value in Health Info

2025-09, ISPOR Real-World Evidence Summit 2025, Tokyo, Japan

Value in Health Regional, Volume 49S (September 2025)

Code

RWD310

Topic Subcategory

Reproducibility & Replicability

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×