Validating a Machine-Learning Approach to Cancer Stage Identification Using Medicare Claims and SEER Data

Speaker(s)

Smith R1, Miller-Wilson LA2, Ho N2, Cuyun Carter G2, Fayyaz I1, Pope A1, Pelizzari P1, Pyenson B1
1Milliman, Inc., New York, NY, USA, 2Exact Sciences Corporation, Madison, WI, USA

Presentation Documents

OBJECTIVES: Administrative claims data can provide information about real-world costs, treatments, and mortality for millions of patients with cancer, but claims’ diagnosis codes lack stage information, limiting research applications. Accurate assignment of cancer stage at diagnosis through claims would expand population research capabilities. This work aimed to build and validate a predictive machine-learning algorithm to assign patients’ cancer stage at diagnosis using claims data.

METHODS: Patients with incident non-small cell lung (NSCLC), colon (CC), or rectal cancer (RC) diagnosed between 2016-2017 were identified using the SEER-Medicare data. Patients with <1 month of Medicare Parts A/B/D enrollment in 2016-2017, <12 months of A/B/D enrollment prior to diagnosis, cancer-related treatment within one year of index or prior cancer diagnoses were excluded. Patients’ claims were flagged for evidence, frequency, and timing of cancer-related surgeries, anti-cancer therapies, radiation therapy, hospice, and death. These flags plus demographics, frailty-related diagnoses, and nursing home residence were tested as predictors of patients’ SEER-derived AJCC stage for each cancer type. Analysis was conducted with R Statistical Software (v4.1.2; R Core Team 2021) using predictive multinomial logistic regression (nnet package; Venables and Ripley 2002). The model trained separately on 70% of each cancer sample and tested on 30%.

RESULTS: CC staging accuracy was 82.3% [CI 80.6%-83.9%] (by stage – 0/1/2A/2B: 86%, 2C/3: 72%, 4: 89%; n = 7,145). NSCLC staging accuracy was 77.5% [CI 76.2%-78.8%] (by stage – 0/1/2: 78%, 3: 64%, 4: 82%; n = 13,494). RC staging accuracy was 67.4% [CI 62.7%-71.8%] (by stage – 0/1/2A/2B: 68%, 2C/3: 61%, 4: 79%; n = 1,424). Models most accurately identified patients with stage 4 disease.

CONCLUSIONS: Multinomial logistic regression using administrative claims data represents a useful approach to cancer staging, yielding superior results compared with previously published algorithms. Machine-learning algorithms may be viable to assign patients’ cancer stages at diagnosis using claims data alone.

Code

MSR32

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas