Open-Source LLMs Performance on Information Retrieval Tasks for Health Outcomes Research

Author(s)

Achilleas Livieratos, PhD1, Junjing Lin, PhD2, Di Zhang, PhD3, All-shine Chen, PhD4, Maria Kudela, PhD4, Yuxi Zhao, PhD4, Cynthia Basu, PhD4, Sai Hurrish Dharmarajan, PhD5, Margaret Gamalo, PhD4.
1SPAIML Scientific Working Group, New York, NY, USA, 2SPAIML Scientific Working Group/Takeda Pharmaceuticals, New York, NY, USA, 3SPAIML Scientific Working Group/Teva Pharmaceuticals, New York, NY, USA, 4SPAIML Scientific Working Group/Pfizer, New York, NY, USA, 5SPAIML Scientific Working Group/Sarepta Therapeutics, New York, NY, USA.

Presentation Documents

OBJECTIVES: We evaluated the effectiveness of open-source large language models (LLMs) in information retrieval tasks for Health Economics and Outcomes Research (HEOR). Specifically, our objectives were to analyze the performance of four open-source LLMs (Qwen2-72B, Llama-3.1-8B, Mistral-7B, and Phi-3-Mini-4K) on data extraction tasks from clinical abstracts and full manuscripts, using a proprietary model, OpenAI's o1, as an evaluator.
METHODS: Our methodology included zero-shot learning assessments across three types of scenarios—open-ended prompts for abstracts, open-ended prompts for full manuscripts, and narrow-specific prompts for full manuscripts—applied to immunology-focused publications sourced from PubMed. We adopted a simplified Fine-grained Language Model Evaluation based on Alignment Skill Sets (FLASK) framework, assessing models on six critical metrics: Accuracy, Robustness, Creativity, Insights, Quantitative Information, and Logical Reasoning. Pair-wise output comparisons and win-rate calculations provided insight into model performances across varied prompt scenarios.
RESULTS: Results indicated that, while there were no significant differences across models, certain trends emerged. Notably, Qwen2-72B demonstrated superior performance, especially in open-ended tasks, achieving over 50% win rates against other models. These findings suggest that open-source LLMs, particularly with refined prompt engineering, are viable alternatives to proprietary models for HEOR-specific applications. The study supports the adoption of open-source models in pharmaceutical research, highlighting their flexibility, cost-effectiveness, and adaptability to regulatory requirements.
CONCLUSIONS: In conclusion, open-source LLMs present an underutilized yet promising tool for HEOR, offering substantial benefits in scalability and customization for data-driven health outcomes research. Further investigation into varied document types and prompt configurations may deepen understanding of their utility in HEOR and medical affairs.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

SA1

Topic

Study Approaches

Disease

SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×