Key Considerations in the Use of Large Language Models for Data Extraction in Health Economics and Outcomes Research

Author(s)

Elise Aronitz, MSc, Jayson Brian Habib, MPH, Christopher Olsen, BHSc, Kevin Hou, PhD, Nicole Ferko, MSc.
EVERSANA, Burlington, ON, Canada.
OBJECTIVES: Use of large language models (LLMs) to improve efficiency with data extraction is a less explored research area compared with article screening in systematic reviews. The current research identifies and explores important considerations to help optimize use of LLMs in data extraction.
METHODS: Using R programming, articles of interest were delivered to GPT through an application programming interface (API) key. The OpenAI playground was used to extract data from images. Engineered prompts instructed the LLM to extract relevant data. Prompts were standardized across both platforms.
RESULTS: Document pre-processing, such as optical character recognition (OCR) and redaction, was identified as a critical consideration. With OCR processing, error rates were reduced with text-based input methods, and the LLM was able to accurately extract information from both simple and complex tables. Redaction also showed promising results in improving accuracy. Additionally, document format (i.e., plain text vs. image) was observed to impact extraction accuracy, particularly with complex nested tables. Uploading images often yielded better results, as the LLM could extract from complex nested tables that caused inaccuracies when extracting documents as plain text. Finally, increasing the complexity of stratification increased error risk. For instance, extracting data solely based on treatment was more likely to yield accurate results than including both treatment and timepoint.
CONCLUSIONS: Careful consideration of several factors is essential to ensure accurate data extraction when using LLM-based tools. Overall, leveraging LLMs has the potential to significantly enhance the efficiency of data extraction, provided these factors are accounted for.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR88

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×