Imputing Breast Cancer Stage in a Large EHR Dataset: Light Gradient Boosting Machine Algorithm and Explainable Artificial Intelligence

Author(s)

Julia A. O'Rourke, PhD1, Jin Yu, PhD2, Ellen Stein, MS1, Zuzanna Drebert, PhD1, Marley Boyd, Jr., MS1, Mike Temple, MD1, E. Susan Amirian, PhD1.
1TriNetX, LLC., Cambridge, MA, USA, 2Computer Science, Northeastern University, Boston, MA, USA.

Presentation Documents

OBJECTIVES: Electronic health records (EHR) data are often missing cancer staging. Advanced machine learning builds accurate but uninterpretable models; Explainable AI deciphers the logic behind these models. In this study, Light Gradient Boosting Machine (LightGBM) imputed breast cancer stage at initial diagnosis, and SHAP (SHapley Additive Explanations) explored feature importance of the underlying model.
METHODS: TriNetX harmonizes de-identified patient data from 69 US healthcare organizations, largely comprised of academic medical centers (~75%). Using structured EHR data, female breast cancer patients with index dates (stage/diagnosis date) between 2000 to present who had at least 10 encounters (1 month prior and 8 months after index) were examined. The LightGBM model was trained with >400 demographics, diagnoses, procedures, medication, and sentence embedding features (extracted from the code-based notes).
RESULTS: De-identified data from 460,616 women with breast cancer were utilized (30,473 with stage). Age and race distributions of patients with known stage differed from those with undocumented stage. Patients with known stage were younger (mean 58.8 vs. 61.5 years), included a greater proportion of Black women (19.2% vs. 9.8%), and a lower proportion of White women (67.1% vs. 72.1%). After optimization, the predictive model reached 88% accuracy (the base model had 67% accuracy). SHAP analysis revealed that the LightGBM assigned patients to higher stages based on the presence of increasingly complex diagnostic and treatment codes: in situ carcinoma diagnosis and the absence of complex interventions (stage 0); partial mastectomies, sentinel node biopsies, and receptor status testing (stage 1); systemic therapies, continuing nodal testing, and more diverse imaging (stage 2); secondary lymph node involvement, additional imaging complexity, and more systemic treatments (stage 3); extensive imaging, testing and greater procedure complexity (stage 4).
CONCLUSIONS: SHAP analysis revealed the expected diagnostic and treatment patterns. Advanced ML algorithms can impute missing stage at diagnosis with acceptable accuracy.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR81

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Missing Data

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology