Utilizing LLMs to Enhance Patient-Reported Outcome Measures: Application to the EQ-5D and Bolt-ons
Author(s)
Jan Heijdra Suasnabar, MSc.
Biomedical Data Science, Leiden University Medical Center, Leiden, Netherlands.
Biomedical Data Science, Leiden University Medical Center, Leiden, Netherlands.
OBJECTIVES: Large language models (LLMs) have shown promising applications in healthcare, yet little is known about their potential to improve the measurement of patient-reported outcomes, which are central to inform decision-making across health system levels. We explored the use of LLMs to develop or extend patient-reported outcome measures (PROMs) based on information from patient-reported free-text data.
METHODS: The GPT-4o model was used to analyze data from 1,977 members of the Dutch Celiac Association who completed the EQ-5D-5L and narratively described the impact of celiac disease on their lives. Prompts to the LLM were designed to identify possible additional dimensions (i.e., ‘bolt-on’ dimensions) to the EQ-5D-5L, and to produce preliminary bolt-on item wordings for selected dimensions. Evaluation of the approach comprised: comparisons of dimensions identified by two alternative approaches (i.e., qualitative analysis and topic modelling); text-entry level agreement (i.e., Cohen’s Kappa) on identified dimensions; suitability of LLM-generated bolt-on wordings assessed against existing criteria using Likert scales; and a critical appraisal consisting of face validity assessments and a SWOT analysis.
RESULTS: The LLM identified 12 potential bolt-on dimensions to the EQ-5D-5L, of which 9 were also identified using qualitative analysis, and 5 using topic modelling. Text-entry level agreement between the LLM and qualitative approaches was ‘substantial’ or ‘almost perfect’, with two exceptions of poor/fair agreement (median Kappa=0.70, IQR=0.44-0.89). The LLM-generated potential bolt-on wordings for the 4 most common dimensions (i.e., ‘Dietary restrictions’, ‘Fatigue’, ‘Social participation’, and ‘Gastrointestinal symptoms’) scored 4/5, 4.4/5, 4.3/5, and 4.2/5 respectively when assessed against existing criteria.
CONCLUSIONS: This study demonstrates the promising potential of LLMs to inform the development or modification of PROMs based on patient-reported text data. A limitation to generalizability and reliability is the approach’s dependency on prompt quality. Further research should assess the approach’s transferability across disease areas and different data sources (e.g. social media, EHRs).
METHODS: The GPT-4o model was used to analyze data from 1,977 members of the Dutch Celiac Association who completed the EQ-5D-5L and narratively described the impact of celiac disease on their lives. Prompts to the LLM were designed to identify possible additional dimensions (i.e., ‘bolt-on’ dimensions) to the EQ-5D-5L, and to produce preliminary bolt-on item wordings for selected dimensions. Evaluation of the approach comprised: comparisons of dimensions identified by two alternative approaches (i.e., qualitative analysis and topic modelling); text-entry level agreement (i.e., Cohen’s Kappa) on identified dimensions; suitability of LLM-generated bolt-on wordings assessed against existing criteria using Likert scales; and a critical appraisal consisting of face validity assessments and a SWOT analysis.
RESULTS: The LLM identified 12 potential bolt-on dimensions to the EQ-5D-5L, of which 9 were also identified using qualitative analysis, and 5 using topic modelling. Text-entry level agreement between the LLM and qualitative approaches was ‘substantial’ or ‘almost perfect’, with two exceptions of poor/fair agreement (median Kappa=0.70, IQR=0.44-0.89). The LLM-generated potential bolt-on wordings for the 4 most common dimensions (i.e., ‘Dietary restrictions’, ‘Fatigue’, ‘Social participation’, and ‘Gastrointestinal symptoms’) scored 4/5, 4.4/5, 4.3/5, and 4.2/5 respectively when assessed against existing criteria.
CONCLUSIONS: This study demonstrates the promising potential of LLMs to inform the development or modification of PROMs based on patient-reported text data. A limitation to generalizability and reliability is the approach’s dependency on prompt quality. Further research should assess the approach’s transferability across disease areas and different data sources (e.g. social media, EHRs).
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
P61
Topic
Methodological & Statistical Research, Patient-Centered Research
Topic Subcategory
Instrument Development, Validation, & Translation, Patient-reported Outcomes & Quality of Life Outcomes
Disease
No Additional Disease & Conditions/Specialized Treatment Areas