Automated Data Extraction in Systematic Literature Reviews (SLRs): Assessing the Accuracy and Reliability of a Large Language Model (LLM)

Speaker(s)

Shree A¹, Farraia M², Pathak S³, Slim M⁴, Cichewicz A⁵, Mittal L³, Casañas i Comabella C⁶
¹Evidera, a part of Thermo Fisher Scientific, Bhubaneswar, OR, India, ²Evidera, a part of Thermo Fisher Scientific, London, UK, ³Evidera, a part of Thermo Fisher Scientific, Bangalore, KA, India, ⁴Evidera, a part of Thermo Fisher Scientific, Hamilton, ON, Canada, ⁵Evidera, a part of Fisher Scientific, Waltham, MA, USA, ⁶Evidera, a part of Thermo Fisher Scientific, London, LON, UK

Presentation Documents

ISPOREurope24_CasanasComabella_MSR187_POSTER146845.pdf

OBJECTIVES: SLRs are crucial for decision making but involve labor-intensive data extraction processes. Advances in artificial intelligence, specifically LLMs, provide the opportunity for application in SLR data extraction, but validation is required. This study aimed to assess the accuracy and reliability of GPT-4-assisted data extraction from peer-reviewed randomized controlled trials (RCTs).

METHODS: Prompts were developed to extract ~30 variables encompassing study design, patient characteristics, and outcome data from RCTs assessing treatments for atopic dermatitis. An iterative process using a random sample of three full-text publications was used to optimize the prompts. AI-generated data extraction for 10 RCTs was carried out twice (8 days apart) and compared against human-validated extractions. Each extracted variable was rated as: correct, incorrect, missing, or incomplete. Accuracy was determined by dividing the total number of correctly extracted variables by the total variables described in each study. Reliability of extractions between testing days was assessed using the intraclass correlation coefficient (ICC).

RESULTS: The mean overall accuracy was 84% (range: 66%-96%) across all RCTs. Accuracy was highest for patient characteristics with a mean of 90% (range: 75%-100%), followed by study characteristics (81% [60%-100%]) and outcomes (80% [55%-100%]). Missing data ranged from 4%-28% and were more frequent for outcome variables. Incorrect extractions ranged from 0%-5%. ICC for patient characteristics, outcomes, and study design variables were 0.95 (excellent), 0.85 (good), and 0 (poor), respectively.

CONCLUSIONS: GPT-4-assisted data extraction had a relatively high accuracy. The LLM performed well but had some limitations in accurately and reliably extracting data; missing data were mainly attributed to data presented in non-extractable figures/tables, and incorrect extractions were attributed to incorrect selection of time points and summary statistics. Further refinement and validation are needed to enhance the reliability of GPT-4-assisted data extraction in SLRs.

Code

MSR187

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

ISPOR Europe 2024

17 - 20 November