Data Extraction in Literature Reviews Using an Artificial Intelligence Model: Prompt Development and Testing
Author(s)
Liz Lunn, BA (Hons)1, Shona Cross, PhD2, Ambar Khan, MSc3, Giuseppina Magri, PhD2, David Slater, MMath4, Saarang Tiwari, BTech3, Molly Murton, MSc2;
1Costello Medical Consulting, Manchester, United Kingdom, 2Costello Medical Consulting, Cambridge, United Kingdom, 3Costello Medical Consulting, London, United Kingdom, 4Costello Medical Consulting, Remote, United Kingdom
1Costello Medical Consulting, Manchester, United Kingdom, 2Costello Medical Consulting, Cambridge, United Kingdom, 3Costello Medical Consulting, London, United Kingdom, 4Costello Medical Consulting, Remote, United Kingdom
OBJECTIVES: Develop prompts for extraction of economic data from publications using AI and compare the accuracy of AI-extracted data against human-extracted data.
METHODS: Prompts were developed iteratively and sent to OpenAI (gpt-4o model; temperature 1) alongside a context prompt, which included the publication text. The structure and content of prompts were first explored on a development set of three articles reporting both economic evaluations (EEs) and cost and resource use (CRU). Following this, multiple iterations were performed on test sets comprising articles from two different disease areas (EEs: n=7, CRU data: n=4, utility data: n=4). After each iteration, F1 scores (the harmonic mean of precision and recall; score range 0-1) were calculated and the prompts were refined with the aim of achieving an F1 score of at least 0.70, or the best possible score. Scores for AI-extracted data were compared to the score for human-extraction of the same data.
RESULTS: Two iterations of prompts were tested for EEs; F1 scores were 0.38 and 0.73, compared to 0.98 for human-extracted data. Five iterations were tested for CRU data; F1 scores ranged from 0.37-0.71, compared to 0.91 for human-extracted data, with improvements at all but one iteration. Four iterations were tested for utility data; F1 scores ranged from 0.52-0.74 compared to 0.96 for human-extracted data, with improvements at each iteration. AI extractions were better for simpler, non-complex information e.g., model details and incremental costs, than for more complex aspects e.g., model input sources, differentiating between costs and RU, identifying all utility outputs.
CONCLUSIONS: The AI model produced promising outputs following prompt development and refinement on a small set of articles, particularly for simple information. However, performance is currently more limited (compared to human extractions) for complex economic information. Further optimization and testing on a larger number of articles and disease areas is required.
METHODS: Prompts were developed iteratively and sent to OpenAI (gpt-4o model; temperature 1) alongside a context prompt, which included the publication text. The structure and content of prompts were first explored on a development set of three articles reporting both economic evaluations (EEs) and cost and resource use (CRU). Following this, multiple iterations were performed on test sets comprising articles from two different disease areas (EEs: n=7, CRU data: n=4, utility data: n=4). After each iteration, F1 scores (the harmonic mean of precision and recall; score range 0-1) were calculated and the prompts were refined with the aim of achieving an F1 score of at least 0.70, or the best possible score. Scores for AI-extracted data were compared to the score for human-extraction of the same data.
RESULTS: Two iterations of prompts were tested for EEs; F1 scores were 0.38 and 0.73, compared to 0.98 for human-extracted data. Five iterations were tested for CRU data; F1 scores ranged from 0.37-0.71, compared to 0.91 for human-extracted data, with improvements at all but one iteration. Four iterations were tested for utility data; F1 scores ranged from 0.52-0.74 compared to 0.96 for human-extracted data, with improvements at each iteration. AI extractions were better for simpler, non-complex information e.g., model details and incremental costs, than for more complex aspects e.g., model input sources, differentiating between costs and RU, identifying all utility outputs.
CONCLUSIONS: The AI model produced promising outputs following prompt development and refinement on a small set of articles, particularly for simple information. However, performance is currently more limited (compared to human extractions) for complex economic information. Further optimization and testing on a larger number of articles and disease areas is required.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR84
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas