Automated Data Extraction in Systematic Literature Reviews (SLRs): Assessing the Accuracy and Reliability of a Large Language Model (LLM)
Speaker(s)
Shree A1, Farraia M2, Pathak S3, Slim M4, Cichewicz A5, Mittal L3, CasaƱas i Comabella C6
1Evidera, a part of Thermo Fisher Scientific, Bhubaneswar, OR, India, 2Evidera, a part of Thermo Fisher Scientific, London, UK, 3Evidera, a part of Thermo Fisher Scientific, Bangalore, KA, India, 4Evidera, a part of Thermo Fisher Scientific, Hamilton, ON, Canada, 5Evidera, a part of Fisher Scientific, Waltham, MA, USA, 6Evidera, a part of Thermo Fisher Scientific, London, LON, UK
Presentation Documents
OBJECTIVES: SLRs are crucial for decision making but involve labor-intensive data extraction processes. Advances in artificial intelligence, specifically LLMs, provide the opportunity for application in SLR data extraction, but validation is required. This study aimed to assess the accuracy and reliability of GPT-4-assisted data extraction from peer-reviewed randomized controlled trials (RCTs).
METHODS: Prompts were developed to extract ~30 variables encompassing study design, patient characteristics, and outcome data from RCTs assessing treatments for atopic dermatitis. An iterative process using a random sample of three full-text publications was used to optimize the prompts. AI-generated data extraction for 10 RCTs was carried out twice (8 days apart) and compared against human-validated extractions. Each extracted variable was rated as: correct, incorrect, missing, or incomplete. Accuracy was determined by dividing the total number of correctly extracted variables by the total variables described in each study. Reliability of extractions between testing days was assessed using the intraclass correlation coefficient (ICC).
RESULTS: The mean overall accuracy was 84% (range: 66%-96%) across all RCTs. Accuracy was highest for patient characteristics with a mean of 90% (range: 75%-100%), followed by study characteristics (81% [60%-100%]) and outcomes (80% [55%-100%]). Missing data ranged from 4%-28% and were more frequent for outcome variables. Incorrect extractions ranged from 0%-5%. ICC for patient characteristics, outcomes, and study design variables were 0.95 (excellent), 0.85 (good), and 0 (poor), respectively.
CONCLUSIONS: GPT-4-assisted data extraction had a relatively high accuracy. The LLM performed well but had some limitations in accurately and reliably extracting data; missing data were mainly attributed to data presented in non-extractable figures/tables, and incorrect extractions were attributed to incorrect selection of time points and summary statistics. Further refinement and validation are needed to enhance the reliability of GPT-4-assisted data extraction in SLRs.
Code
MSR187
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas