FROM REAL-WORLD DATA (RWD) TO DIGITAL TWINS: BUILDING MODELS FOR PATIENT-LEVEL COUNTERFACTUAL PREDICTION IN ONCOLOGY

Author(s)

Sandra Griffith, PhD, Joe Manfredonia, ME, Marcello Ricottone, MA, Richard Knoche, PhD, Aaron B. Cohen, MD, MSCE, Jacqueline Law, PhD, Melissa Estevez, MS;
Flatiron Health, New York, NY, USA
OBJECTIVES: Digital twins (DT), models capable of generating patient-level counterfactual predictions, may improve oncology drug development success rates by informing trial design, contextualizing results, and enhancing statistical power. Training generalizable outcome-prediction models requires large, diverse, and longitudinally-rich data, and an understanding of methodological approaches. This study leverages the depth and scale of real-world data (RWD) to evaluate the feasibility of DT models and compare performance across approaches.
METHODS: This retrospective study used the Flatiron Health Research Database. Features were generated using demographics, structured, and ML/LLM-extracted clinical variables (e.g., Charlson comorbidity index [CCI] and sites of metastases [SOM]). We trained four models (penalized pooled logistic regression [LR], XGBoost [XGB], Multi-layer Perceptron, and Graph Attention Network [GAT]) to predict real-world overall survival (rwOS) in patients with Stage IV non-small cell lung cancer initiating first-line platinum chemotherapy between 2011 and 2016, contemporaneous to chemotherapy use in trials. Model performance was assessed using C-index, area under the curve (AUC(t)), integrated Brier score (IBS), mean absolute difference (MAD) between predicted and observed survival curves, and median rwOS. Calibration was assessed in clinically-important sub-populations.
RESULTS: Training (n=12,088) and test (n=3964) cohorts were selected. All four methods performed well (C-index 0.66-0.70; AUC(12) 0.70-0.75; IBS 0.14-0.15; MAD 0.5-2.1%), comparable to published results using different methods. LR/XGB demonstrated better discrimination, while GAT exhibited superior calibration. Predicted median rwOS (months) aligned with observed rwOS (9-10 vs 10 months). Results were robust across subgroups. Top features varied by method and included ECOG, SOM, CCI, and labs (e.g., albumin and creatinine).
CONCLUSIONS: DT models performed well across methodological approaches. Understanding their relative strengths and limitations while integrating rich LLM-extracted clinical features can help accelerate clinical research and drug development. External validation of DT models is ongoing.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR219

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Oncology, STA: Personalized & Precision Medicine

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×