HELMET: A Benchmark Dataset for Evaluating Generative AI in Health Economics and Outcomes Research

Author(s)

Behnam Sharif, PhD1, Andre Verhoek, MSc.2;
1Benady Consulting LTD, Calgary, AB, Canada, 2SkyHTA, Gouda, Netherlands
OBJECTIVES: Generative artificial intelligence (GenAI), particularly large language models (LLMs), has shown potential automating tasks such as data extraction, evidence synthesis, and document labelling in health economics. However, there is no standardized benchmark to evaluate LLM performance in these tasks, particularly within the domains of cost-effectiveness models (CEM) and budget impact models (BIM). This study introduces the Health Economics Language Model Evaluative and Testing dataset (HELMET), designed to address this gap and advance AI applications for Health Economics and Outcomes Research (HEOR).
METHODS: HELMET comprises document-query-output triplets for data extraction, evidence synthesis, and information labelling. A total of 728 CEM, 256 BIM and 183 systematic literature reviews (SLRs) across indications in oncology, immunology, rare diseases, and chronic conditions were identified and retrieved from PubMed. Full-texts were used to construct the dataset, with queries generated by a prompt-based LLM (gpt-3.5-turbo and llama-index libraries). Data-extraction queries were generated for individual sentences in abstracts and stored with abstract-masked documents. For evidence synthesis, schemas summarizing evidence scopes were created from SLR result tables, guiding query development. Information labelling queries categorized document sections by domain and subheadings. Baseline performance was assessed using state-of-the-art LLMs with metrics like query-text alignment and token-level analysis.
RESULTS: HELMET contains 17,179 triplets for data-extraction, 1,647 for evidence synthesis and 18,980 for information labelling. Validation revealed a 0.92 Pearson correlation between query length and abstract sections in data-extraction, confirming queries were proportionally aligned with section content and length. Token analysis showed fewer missing tokens in queries compared to outputs across all datasets (89%,82%,78%), confirming comprehensive capture of contexts and alignment with LLM benchmarking standards.
CONCLUSIONS: HELMET provides a robust framework to evaluate and refine LLMs for HEOR applications, including evidence synthesis and economic modelling. By streamlining these processes, HELMET can support efficient decision-making in health economics, enhancing tools for researchers and developers.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

P25

Topic

Health Technology Assessment

Disease

SDC: Diabetes/Endocrine/Metabolic Disorders (including obesity), SDC: Oncology, SDC: Rare & Orphan Diseases, SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain), STA: Multiple/Other Specialized Treatments

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×