CAN AI LEARN PATIENT PREFERENCES? PREDICTING PATIENT PREFERENCE HETEROGENEITY USING LARGE LANGUAGE MODELS

Author(s)

Tina Cheng¹, Juan M. Gonzalez, PhD², Matthew Engelhard, MD PHD³, Shelby Reed, RPh, PhD², Semra Ozdemir, PhD²;
¹Duke University, Doctoral Student, Durham, NC, USA, ²Duke Clinical Research Institute, Durham, NC, USA, ³Duke University, Durham, NC, USA

OBJECTIVES: Prior work suggests that large language models (LLMs), such as GPT-4, can predict average patient choices with moderate accuracy (~70%) when trained on homogeneous preference data. It remains unclear whether LLMs can predict individual-level choices that reflect heterogeneity in preferences. This study examines the ability of GPT-4 to predict individual decisions when provided with information representing heterogeneous preference patterns.
METHODS: GPT-4 was evaluated using synthetic preference data derived from real patient responses to a discrete-choice experiment on kidney transplant acceptance. Synthetic data were generated from the original study’s mixed-logit model, with individual-level preference parameters drawn from estimated coefficient distributions. The final dataset mirrored the original study and included 605 patients, each completing six binary choice tasks comparing two kidney options varying by time with regular function, time with low function, and time to transplant. For each patient, one task was held out. GPT-4 was instructed to predict the held-out choice using 1, 3, or 5 prior tasks from the same individual and to report prediction confidence (0-100). GPT-4 also generated a summary of distinct patient preference types.
RESULTS: Proportion of held-out task correctly predicted by GPT-4 across synthetic participants increased with additional prior tasks: 75.5% (95% CI: 72.0-78.8) using one task, 78.2% (95% CI: 74.7-81.3) using three tasks, and 80.0% (95% CI: 76.7 - 83.0) using five tasks. GPT-4’s self-reported confidence increased from 75.6 with one task to around 79 for both three and five tasks. Early hallucinations, predicting outcomes for incorrect tasks, were observed but resolved through iterative prompt refinement. GPT-4’ s generated summary aligned with latent class subgroup preference patterns identified in the original study.
CONCLUSIONS: LLMs show the ability to predict individual patient preferences, accounting for preference heterogeneity. However, performance can be sensitive to prompt design and data structure. Further work is needed to assess robustness, calibration, and clinical applicability.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

PCR151

Topic

Patient-Centered Research

Disease

SDC: Urinary/Kidney Disorders, STA: Surgery

Presentation (CTI)

Author(s)

Conference/Value in Health Info

Code

Topic

Disease

ISPOR–The Professional Society for
Health Economics and Outcomes Research

Your browser is out-of-date