Advancing the Science of Qualitative Patient Preference Assessment Using Large Language Models
Author(s)
Ted Grover, PhD1, Emanuel Krebs, MA1, Deirdre Weymann, MA1, Morgan Ehman, MA1, Dean Regier, PhD2.
1Regulatory Science Lab, BC Cancer Research Institute, Vancouver, BC, Canada, 2School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada.
1Regulatory Science Lab, BC Cancer Research Institute, Vancouver, BC, Canada, 2School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada.
OBJECTIVES: Qualitative patient preference assessment (PPA) often uses thematic analysis to develop themes from interview of focus group transcripts. Large language models (LLMs) show initial promise for performing inductive thematic analysis of qualitative healthcare data, yet no empirical studies have investigated LLMs to facilitate qualitative PPA. We developed multiple LLM prompt frameworks for thematic analysis and evaluated the similarity of LLM generated themes against human-analyzed themes within a qualitative PPA study context.
METHODS: We customized the open-source Hermes-3-Llama-3.1-70B LLM to perform inductive thematic analysis on focus group transcripts from a previously published qualitative PPA study using three SmartGPT prompt frameworks. We evaluated LLM-generated themes against human-analyzed themes using the Sentence-T5-XXL language embedding model. Sentence-level theme similarity was assessed using Jaccard similarity coefficients (0-1 range), only retaining comparisons meeting an empirically-determined cosine similarity threshold, to ensure semantic validity. We further evaluated LLM themes for similarity in lexical diversity and reading grade-level metrics and benchmarked semantic similarity results with published similarity thresholds previously used with qualitative healthcare data.
RESULTS: All prompt frameworks generated themes with Jaccard similarity coefficients with human-analyzed themes between 0.46-0.64, indicating moderate to strong semantic overlap. Our best-performing framework instructed to pursue thematic saturation scored closest to human-analyzed themes on all reading grade-level metrics, and improved semantic similarity by 12% compared to published benchmarks. Our worst-performing framework produced themes with moderate semantic overlap and hallucinated findings unidentified in human-analyzed themes.
CONCLUSIONS: LLMs can perform inductive thematic analysis of qualitative patient preference data, producing themes substantively similar in content and style to human-analyzed themes when augmented with sufficient domain-specific context. While LLMs may augment thematic analysis, the contextual nature of qualitative analysis remains a challenge requiring collaborative LLM frameworks integrating human expertise. Our work can inform best practices for LLM use in qualitative PPA to improve healthcare decision-making.
METHODS: We customized the open-source Hermes-3-Llama-3.1-70B LLM to perform inductive thematic analysis on focus group transcripts from a previously published qualitative PPA study using three SmartGPT prompt frameworks. We evaluated LLM-generated themes against human-analyzed themes using the Sentence-T5-XXL language embedding model. Sentence-level theme similarity was assessed using Jaccard similarity coefficients (0-1 range), only retaining comparisons meeting an empirically-determined cosine similarity threshold, to ensure semantic validity. We further evaluated LLM themes for similarity in lexical diversity and reading grade-level metrics and benchmarked semantic similarity results with published similarity thresholds previously used with qualitative healthcare data.
RESULTS: All prompt frameworks generated themes with Jaccard similarity coefficients with human-analyzed themes between 0.46-0.64, indicating moderate to strong semantic overlap. Our best-performing framework instructed to pursue thematic saturation scored closest to human-analyzed themes on all reading grade-level metrics, and improved semantic similarity by 12% compared to published benchmarks. Our worst-performing framework produced themes with moderate semantic overlap and hallucinated findings unidentified in human-analyzed themes.
CONCLUSIONS: LLMs can perform inductive thematic analysis of qualitative patient preference data, producing themes substantively similar in content and style to human-analyzed themes when augmented with sufficient domain-specific context. While LLMs may augment thematic analysis, the contextual nature of qualitative analysis remains a challenge requiring collaborative LLM frameworks integrating human expertise. Our work can inform best practices for LLM use in qualitative PPA to improve healthcare decision-making.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR16
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas