Validating a Machine-Learning Approach to Cancer Stage Identification Using Medicare Claims and SEER Data

Author(s)

Smith R1, Miller-Wilson LA2, Ho N2, Cuyun Carter G2, Fayyaz I1, Pope A1, Pelizzari P1, Pyenson B1
1Milliman, Inc., New York, NY, USA, 2Exact Sciences Corporation, Madison, WI, USA

Presentation Documents

OBJECTIVES: Administrative claims data can provide information about real-world costs, treatments, and mortality for millions of patients with cancer, but claims’ diagnosis codes lack stage information, limiting research applications. Accurate assignment of cancer stage at diagnosis through claims would expand population research capabilities. This work aimed to build and validate a predictive machine-learning algorithm to assign patients’ cancer stage at diagnosis using claims data.

METHODS: Patients with incident non-small cell lung (NSCLC), colon (CC), or rectal cancer (RC) diagnosed between 2016-2017 were identified using the SEER-Medicare data. Patients with <1 month of Medicare Parts A/B/D enrollment in 2016-2017, <12 months of A/B/D enrollment prior to diagnosis, cancer-related treatment within one year of index or prior cancer diagnoses were excluded. Patients’ claims were flagged for evidence, frequency, and timing of cancer-related surgeries, anti-cancer therapies, radiation therapy, hospice, and death. These flags plus demographics, frailty-related diagnoses, and nursing home residence were tested as predictors of patients’ SEER-derived AJCC stage for each cancer type. Analysis was conducted with R Statistical Software (v4.1.2; R Core Team 2021) using predictive multinomial logistic regression (nnet package; Venables and Ripley 2002). The model trained separately on 70% of each cancer sample and tested on 30%.

RESULTS: CC staging accuracy was 82.3% [CI 80.6%-83.9%] (by stage – 0/1/2A/2B: 86%, 2C/3: 72%, 4: 89%; n = 7,145). NSCLC staging accuracy was 77.5% [CI 76.2%-78.8%] (by stage – 0/1/2: 78%, 3: 64%, 4: 82%; n = 13,494). RC staging accuracy was 67.4% [CI 62.7%-71.8%] (by stage – 0/1/2A/2B: 68%, 2C/3: 61%, 4: 79%; n = 1,424). Models most accurately identified patients with stage 4 disease.

CONCLUSIONS: Multinomial logistic regression using administrative claims data represents a useful approach to cancer staging, yielding superior results compared with previously published algorithms. Machine-learning algorithms may be viable to assign patients’ cancer stages at diagnosis using claims data alone.

Conference/Value in Health Info

2023-05, ISPOR 2023, Boston, MA, USA

Value in Health, Volume 26, Issue 6, S2 (June 2023)

Code

MSR32

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×