Using Large Language Models (LLMs) for Data Extraction in Literature Reviews: An Enhanced Approach
Author(s)
Lambova A1, Matev K1, Gallinaro J2, Guerra I2, Rtveladze K3, Caverly S2
1IQVIA, Sofia, Bulgaria, 2IQVIA, London, LON, UK, 3IQVIA, London , LON, UK
Presentation Documents
OBJECTIVES: Systematic Literature Reviews are pivotal for market access decisions regarding novel medical products. However, data extraction of clinical evidence for health-technology assessment dossiers remains labor-intensive and error-prone. Last year, a generative pre-trained transformer 4 (GPT-4)-based algorithm demonstrated the potential of LLMs for generating initial extraction from publications of clinical data, covering study details, patient characteristics, safety, efficacy, and quality of life outcomes. Accuracy ranged from 45% to 83%, with the highest performance in study details and the lowest in patient characteristics. Our objective was to enhance the previous algorithm, particularly for complex variables with historically low accuracy rates.
METHODS: A new LLM-based multistep approach was developed to overcome some of the challenges with the complex clinical data extraction such as sub-group variables extraction, long paper processing and structured format generation. Leveraging LLM retrievers, an embedding model, and GPT-4, relevant information was extracted for the variables in an unstructured format. Iterative prompt engineering, guided by subject matter experts, refined the information retrieval process. A LLM-based method was used to construct predefined extraction tables from the text. The accuracy of the algorithm was measured for 70 patient characteristic variables across 10 studies by comparison of the generated extraction to a manual extraction performed by humans.
RESULTS: Initial results showed an average accuracy of 70%, varying from 35% to 100% across the 10 studies extracted. Notably, patient characteristic extraction significantly improved compared to the previous results (45%). The studies contain between 1 and 5 sub-groups of interest, including historical controls. The algorithm identified all relevant sub-groups correctly, including some other more granular sub-groups as well.
CONCLUSIONS: Implementing a complex multistep approach enhances LLM-based clinical data extraction. Independent improvements at each step contribute to overall precision. Our algorithm demonstrates promising results, paving the way for efficient clinical data extraction even for complex variables and population sub-groups.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
MSR135
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas