Open-Source LLMs Performance on Information Retrieval Tasks for Health Outcomes Research
Author(s)
Achilleas Livieratos, PhD1, Junjing Lin, PhD2, Di Zhang, PhD3, All-shine Chen, PhD4, Maria Kudela, PhD4, Yuxi Zhao, PhD4, Cynthia Basu, PhD4, Sai Hurrish Dharmarajan, PhD5, Margaret Gamalo, PhD4.
1SPAIML Scientific Working Group, New York, NY, USA, 2SPAIML Scientific Working Group/Takeda Pharmaceuticals, New York, NY, USA, 3SPAIML Scientific Working Group/Teva Pharmaceuticals, New York, NY, USA, 4SPAIML Scientific Working Group/Pfizer, New York, NY, USA, 5SPAIML Scientific Working Group/Sarepta Therapeutics, New York, NY, USA.
1SPAIML Scientific Working Group, New York, NY, USA, 2SPAIML Scientific Working Group/Takeda Pharmaceuticals, New York, NY, USA, 3SPAIML Scientific Working Group/Teva Pharmaceuticals, New York, NY, USA, 4SPAIML Scientific Working Group/Pfizer, New York, NY, USA, 5SPAIML Scientific Working Group/Sarepta Therapeutics, New York, NY, USA.
Presentation Documents
OBJECTIVES: We evaluated the effectiveness of open-source large language models (LLMs) in information retrieval tasks for Health Economics and Outcomes Research (HEOR). Specifically, our objectives were to analyze the performance of four open-source LLMs (Qwen2-72B, Llama-3.1-8B, Mistral-7B, and Phi-3-Mini-4K) on data extraction tasks from clinical abstracts and full manuscripts, using a proprietary model, OpenAI's o1, as an evaluator.
METHODS: Our methodology included zero-shot learning assessments across three types of scenarios—open-ended prompts for abstracts, open-ended prompts for full manuscripts, and narrow-specific prompts for full manuscripts—applied to immunology-focused publications sourced from PubMed. We adopted a simplified Fine-grained Language Model Evaluation based on Alignment Skill Sets (FLASK) framework, assessing models on six critical metrics: Accuracy, Robustness, Creativity, Insights, Quantitative Information, and Logical Reasoning. Pair-wise output comparisons and win-rate calculations provided insight into model performances across varied prompt scenarios.
RESULTS: Results indicated that, while there were no significant differences across models, certain trends emerged. Notably, Qwen2-72B demonstrated superior performance, especially in open-ended tasks, achieving over 50% win rates against other models. These findings suggest that open-source LLMs, particularly with refined prompt engineering, are viable alternatives to proprietary models for HEOR-specific applications. The study supports the adoption of open-source models in pharmaceutical research, highlighting their flexibility, cost-effectiveness, and adaptability to regulatory requirements.
CONCLUSIONS: In conclusion, open-source LLMs present an underutilized yet promising tool for HEOR, offering substantial benefits in scalability and customization for data-driven health outcomes research. Further investigation into varied document types and prompt configurations may deepen understanding of their utility in HEOR and medical affairs.
METHODS: Our methodology included zero-shot learning assessments across three types of scenarios—open-ended prompts for abstracts, open-ended prompts for full manuscripts, and narrow-specific prompts for full manuscripts—applied to immunology-focused publications sourced from PubMed. We adopted a simplified Fine-grained Language Model Evaluation based on Alignment Skill Sets (FLASK) framework, assessing models on six critical metrics: Accuracy, Robustness, Creativity, Insights, Quantitative Information, and Logical Reasoning. Pair-wise output comparisons and win-rate calculations provided insight into model performances across varied prompt scenarios.
RESULTS: Results indicated that, while there were no significant differences across models, certain trends emerged. Notably, Qwen2-72B demonstrated superior performance, especially in open-ended tasks, achieving over 50% win rates against other models. These findings suggest that open-source LLMs, particularly with refined prompt engineering, are viable alternatives to proprietary models for HEOR-specific applications. The study supports the adoption of open-source models in pharmaceutical research, highlighting their flexibility, cost-effectiveness, and adaptability to regulatory requirements.
CONCLUSIONS: In conclusion, open-source LLMs present an underutilized yet promising tool for HEOR, offering substantial benefits in scalability and customization for data-driven health outcomes research. Further investigation into varied document types and prompt configurations may deepen understanding of their utility in HEOR and medical affairs.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
SA1
Topic
Study Approaches
Disease
SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)