FROM REAL-WORLD DATA (RWD) TO DIGITAL TWINS: BUILDING MODELS FOR PATIENT-LEVEL COUNTERFACTUAL PREDICTION IN ONCOLOGY

Author(s)

Sandra Griffith, PhD, Joe Manfredonia, ME, Marcello Ricottone, MA, Richard Knoche, PhD, Aaron B. Cohen, MD, MSCE, Jacqueline Law, PhD, Melissa Estevez, MS.
Flatiron Health, New York, NY, USA.

Presentation Documents

OBJECTIVES: Digital twins (DT), models capable of generating patient-level counterfactual predictions, may improve oncology drug development success rates by informing trial design, contextualizing results, and enhancing statistical power. Training generalizable outcome-prediction models requires large, diverse, and longitudinally-rich data, and an understanding of methodological approaches. This study leverages the depth and scale of real-world data (RWD) to evaluate the feasibility of DT models and compare performance across approaches.
METHODS: This retrospective study used the Flatiron Health Research Database. Features were generated using demographics, structured, and ML/LLM-extracted clinical variables (e.g., Charlson comorbidity index [CCI] and sites of metastases [SOM]). We trained four models (penalized pooled logistic regression [LR], XGBoost [XGB], Multi-layer Perceptron, and Graph Attention Network [GAT]) to predict real-world overall survival (rwOS) in patients with Stage IV non-small cell lung cancer initiating first-line platinum chemotherapy between 2011 and 2016, contemporaneous to chemotherapy use in trials. Model performance was assessed using C-index, area under the curve (AUC(t)), integrated Brier score (IBS), mean absolute difference (MAD) between predicted and observed survival curves, and median rwOS. Calibration was assessed in clinically-important sub-populations.
RESULTS: Training (n=12,088) and test (n=3964) cohorts were selected. All four methods performed well (C-index 0.66-0.70; AUC(12) 0.70-0.75; IBS 0.14-0.15; MAD 0.5-2.1%), comparable to published results using different methods. LR/XGB demonstrated better discrimination, while GAT exhibited superior calibration. Predicted median rwOS (months) aligned with observed rwOS (9-10 vs 10 months). Results were robust across subgroups. Top features varied by method and included ECOG, SOM, CCI, and labs (e.g., albumin and creatinine).
CONCLUSIONS: DT models performed well across methodological approaches. Understanding their relative strengths and limitations while integrating rich LLM-extracted clinical features can help accelerate clinical research and drug development. External validation of DT models is ongoing.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR219

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Oncology, STA: Personalized & Precision Medicine

Presentation (CTI)