Data Extraction in Literature Reviews Using an Artificial Intelligence Model: Prompt Development and Testing

Author(s)

Liz Lunn, BA (Hons)1, Shona Cross, PhD2, Ambar Khan, MSc3, Giuseppina Magri, PhD2, David Slater, MMath4, Saarang Tiwari, BTech3, Molly Murton, MSc2;
1Costello Medical Consulting, Manchester, United Kingdom, 2Costello Medical Consulting, Cambridge, United Kingdom, 3Costello Medical Consulting, London, United Kingdom, 4Costello Medical Consulting, Remote, United Kingdom
OBJECTIVES: Develop prompts for extraction of economic data from publications using AI and compare the accuracy of AI-extracted data against human-extracted data.
METHODS: Prompts were developed iteratively and sent to OpenAI (gpt-4o model; temperature 1) alongside a context prompt, which included the publication text. The structure and content of prompts were first explored on a development set of three articles reporting both economic evaluations (EEs) and cost and resource use (CRU). Following this, multiple iterations were performed on test sets comprising articles from two different disease areas (EEs: n=7, CRU data: n=4, utility data: n=4). After each iteration, F1 scores (the harmonic mean of precision and recall; score range 0-1) were calculated and the prompts were refined with the aim of achieving an F1 score of at least 0.70, or the best possible score. Scores for AI-extracted data were compared to the score for human-extraction of the same data.
RESULTS: Two iterations of prompts were tested for EEs; F1 scores were 0.38 and 0.73, compared to 0.98 for human-extracted data. Five iterations were tested for CRU data; F1 scores ranged from 0.37-0.71, compared to 0.91 for human-extracted data, with improvements at all but one iteration. Four iterations were tested for utility data; F1 scores ranged from 0.52-0.74 compared to 0.96 for human-extracted data, with improvements at each iteration. AI extractions were better for simpler, non-complex information e.g., model details and incremental costs, than for more complex aspects e.g., model input sources, differentiating between costs and RU, identifying all utility outputs.
CONCLUSIONS: The AI model produced promising outputs following prompt development and refinement on a small set of articles, particularly for simple information. However, performance is currently more limited (compared to human extractions) for complex economic information. Further optimization and testing on a larger number of articles and disease areas is required.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR84

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×