Key Considerations in the Use of Large Language Models for Data Extraction in Health Economics and Outcomes Research
Author(s)
Elise Aronitz, MSc, Jayson Brian Habib, MPH, Christopher Olsen, BHSc, Kevin Hou, PhD, Nicole Ferko, MSc.
EVERSANA, Burlington, ON, Canada.
EVERSANA, Burlington, ON, Canada.
OBJECTIVES: Use of large language models (LLMs) to improve efficiency with data extraction is a less explored research area compared with article screening in systematic reviews. The current research identifies and explores important considerations to help optimize use of LLMs in data extraction.
METHODS: Using R programming, articles of interest were delivered to GPT through an application programming interface (API) key. The OpenAI playground was used to extract data from images. Engineered prompts instructed the LLM to extract relevant data. Prompts were standardized across both platforms.
RESULTS: Document pre-processing, such as optical character recognition (OCR) and redaction, was identified as a critical consideration. With OCR processing, error rates were reduced with text-based input methods, and the LLM was able to accurately extract information from both simple and complex tables. Redaction also showed promising results in improving accuracy. Additionally, document format (i.e., plain text vs. image) was observed to impact extraction accuracy, particularly with complex nested tables. Uploading images often yielded better results, as the LLM could extract from complex nested tables that caused inaccuracies when extracting documents as plain text. Finally, increasing the complexity of stratification increased error risk. For instance, extracting data solely based on treatment was more likely to yield accurate results than including both treatment and timepoint.
CONCLUSIONS: Careful consideration of several factors is essential to ensure accurate data extraction when using LLM-based tools. Overall, leveraging LLMs has the potential to significantly enhance the efficiency of data extraction, provided these factors are accounted for.
METHODS: Using R programming, articles of interest were delivered to GPT through an application programming interface (API) key. The OpenAI playground was used to extract data from images. Engineered prompts instructed the LLM to extract relevant data. Prompts were standardized across both platforms.
RESULTS: Document pre-processing, such as optical character recognition (OCR) and redaction, was identified as a critical consideration. With OCR processing, error rates were reduced with text-based input methods, and the LLM was able to accurately extract information from both simple and complex tables. Redaction also showed promising results in improving accuracy. Additionally, document format (i.e., plain text vs. image) was observed to impact extraction accuracy, particularly with complex nested tables. Uploading images often yielded better results, as the LLM could extract from complex nested tables that caused inaccuracies when extracting documents as plain text. Finally, increasing the complexity of stratification increased error risk. For instance, extracting data solely based on treatment was more likely to yield accurate results than including both treatment and timepoint.
CONCLUSIONS: Careful consideration of several factors is essential to ensure accurate data extraction when using LLM-based tools. Overall, leveraging LLMs has the potential to significantly enhance the efficiency of data extraction, provided these factors are accounted for.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR88
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas