Comparative Analysis of Large Language Models for Extracting Patient-Reported Outcome Measures From Clinical Trial Protocols in Lymphoma
Author(s)
Attila Imre, PharmD1, Dalma Hosszú, MA2, Anna Bogos, BA2, Balázs Nagy, PhD1, Judit Józwiak-Hagymásy, MSc2, Tamas Agh, MSc, PhD, MD2, Ákos Bernard Józwiak, PhD2;
1Semmelweis University, Center for Health Technology Assessment, Budapest, Hungary, 2Syreon Research Institute, Budapest, Hungary
1Semmelweis University, Center for Health Technology Assessment, Budapest, Hungary, 2Syreon Research Institute, Budapest, Hungary
Presentation Documents
OBJECTIVES: Patient-reported outcome measures (PROMs) are essential in clinical trials as they directly capture data on patients' experiences with their health conditions. Collecting information on the types of PROMs used in clinical trials face challenges due to varying terminologies and fragmented data. The current study aimed to evaluate the performance of large language models (LLMs) in extracting PROMs from protocols from the ClinicalTrials.gov database in lymphoma and compare their performance against an expert-established reference standard, utilizing a zero-shot approach.
METHODS: A sample of outcome list of 300 protocols from lymphoma clinical trials was independently reviewed by domain experts to identify PROMs, establishing a gold standard. Three LLMs - gemma-2-9b-it, llama-3.3-70b, and chatgpt-4o-mini-2024-07-18 - were then applied to the same dataset using a tailor-made prompt to extract PROMs. Accuracy, precision and recall were calculated to evaluate the performance of the LLMs against the gold standard.
RESULTS: Among the three LLMs, llama-3.3-70b demonstrated the highest performance, achieving an accuracy of 73.36%, a precision of 90.48%, and a recall of 79.50%. The gemma-2-9b-it model showed moderate performance (63.25% accuracy, 80.27% precision and 74.90% recall), while gpt-4o-mini-2024-07-18 achieved the lowest accuracy of 62.36%, but maintained a relatively high precision of 87.23% and 68.62% recall. Notably, all models made characteristic mistakes (e.g., confusing EQ-5D versions, including FACT-G when it was part of another PROM), which can be addressed through standard downstream data curation steps.
CONCLUSIONS: LLMs, notably llama-3.3-70b can serve as effective tools for identifying PROMs from protocols of clinical trials in lymphoma, potentially reducing the manual effort required for systematic evidence synthesis. Beyond direct extraction, these tools can support live evidence management of PROMS, help generate synthetic training data for simpler models and facilitate automated pipelines for tracking of PROM usage trends in clinical research. These findings demonstrate a strong baseline for further studies to enhance LLM performance.
METHODS: A sample of outcome list of 300 protocols from lymphoma clinical trials was independently reviewed by domain experts to identify PROMs, establishing a gold standard. Three LLMs - gemma-2-9b-it, llama-3.3-70b, and chatgpt-4o-mini-2024-07-18 - were then applied to the same dataset using a tailor-made prompt to extract PROMs. Accuracy, precision and recall were calculated to evaluate the performance of the LLMs against the gold standard.
RESULTS: Among the three LLMs, llama-3.3-70b demonstrated the highest performance, achieving an accuracy of 73.36%, a precision of 90.48%, and a recall of 79.50%. The gemma-2-9b-it model showed moderate performance (63.25% accuracy, 80.27% precision and 74.90% recall), while gpt-4o-mini-2024-07-18 achieved the lowest accuracy of 62.36%, but maintained a relatively high precision of 87.23% and 68.62% recall. Notably, all models made characteristic mistakes (e.g., confusing EQ-5D versions, including FACT-G when it was part of another PROM), which can be addressed through standard downstream data curation steps.
CONCLUSIONS: LLMs, notably llama-3.3-70b can serve as effective tools for identifying PROMs from protocols of clinical trials in lymphoma, potentially reducing the manual effort required for systematic evidence synthesis. Beyond direct extraction, these tools can support live evidence management of PROMS, help generate synthetic training data for simpler models and facilitate automated pipelines for tracking of PROM usage trends in clinical research. These findings demonstrate a strong baseline for further studies to enhance LLM performance.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
P36
Topic
Patient-Centered Research
Topic Subcategory
Health State Utilities, Patient-reported Outcomes & Quality of Life Outcomes
Disease
No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology