An Open-Source LLM Framework For Transparent Data Extraction From Research Papers
Author(s)
Kangping Zeng, MS1, Rong Liu, PhD1, Sayeed Salam, PhD2, Marko Zivkovic, PhD2, Zarmina Khankhel, MPH2, Lynn Okamoto, PharmD2.
1Stevens Institute of Technology, Hoboken, NJ, USA, 2Genesis Research Group, Hoboken, NJ, USA.
1Stevens Institute of Technology, Hoboken, NJ, USA, 2Genesis Research Group, Hoboken, NJ, USA.
OBJECTIVES: Data extraction is a time-consuming and error-prone, yet essential part of systematic review. The advent of large language models (LLMs) offers promising automation opportunities, but challenges persist, including substantial costs and piecemeal evidence extraction without their relations. We present a novel framework that leverages open-source LLMs to address these issues and streamline extraction process.
METHODS: The framework uses an open-source LLM to analyze each paragraph of a research paper and generate potential answers for predefined attributes. The quality of these answers is assessed based on their alignment with the original text, internal consistency within paragraphs, and cross-paragraph coherence. Using a feature-based scoring mechanism, the top-k ranked answers are synthesized into knowledge graphs, capturing targeted data attributes, such as study characteristics, population and interventions evaluated, and their relationships across papers (e.g. from a single review). These knowledge graphs can facilitate efficient systematic review including eligibility analysis and quality control.
RESULTS: As a pilot study, the framework was evaluated on 30 expert-annotated research papers sampled from different literature reviews. The framework achieved an F1-score of 0.86 in extracting 10 key study attributes, outperforming GPT-4o by 8%. The extractions were delivered at one-tenth of the computational costs with a 7B open-source LLM model, making it a cost-effective solution for large-scale data extraction tasks. Additionally, our framework provides precise traceability of extracted data for human verification, whereas baseline models like GPT-4o only offer vague location information.
CONCLUSIONS: The results highlight the potential of open-source LLM frameworks for enhancing data extraction, while providing enough transparency to minimize misinformation. With human oversight, these frameworks can present a viable, cost-effective alternative for expediting literature reviews at scale.
METHODS: The framework uses an open-source LLM to analyze each paragraph of a research paper and generate potential answers for predefined attributes. The quality of these answers is assessed based on their alignment with the original text, internal consistency within paragraphs, and cross-paragraph coherence. Using a feature-based scoring mechanism, the top-k ranked answers are synthesized into knowledge graphs, capturing targeted data attributes, such as study characteristics, population and interventions evaluated, and their relationships across papers (e.g. from a single review). These knowledge graphs can facilitate efficient systematic review including eligibility analysis and quality control.
RESULTS: As a pilot study, the framework was evaluated on 30 expert-annotated research papers sampled from different literature reviews. The framework achieved an F1-score of 0.86 in extracting 10 key study attributes, outperforming GPT-4o by 8%. The extractions were delivered at one-tenth of the computational costs with a 7B open-source LLM model, making it a cost-effective solution for large-scale data extraction tasks. Additionally, our framework provides precise traceability of extracted data for human verification, whereas baseline models like GPT-4o only offer vague location information.
CONCLUSIONS: The results highlight the potential of open-source LLM frameworks for enhancing data extraction, while providing enough transparency to minimize misinformation. With human oversight, these frameworks can present a viable, cost-effective alternative for expediting literature reviews at scale.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR19
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas