Estimating US Cancer Prevalence With Community Oncology EHR-Based Datasets via National Benchmarks From SEER and NPCR
Author(s)
Herman Ray, PhD1, Justin Li, BA2, Sheenu Chandwani, MPH, PhD2, Claudio D'Ambrosio, PhD2.
1Kennesaw State University, Kennesaw, GA, USA, 2ConcertAI, LLC, Cambridge, MA, USA.
1Kennesaw State University, Kennesaw, GA, USA, 2ConcertAI, LLC, Cambridge, MA, USA.
OBJECTIVES: Cancer statistics from established national registries, SEER and NPCR, provide benchmarks for US cancer incidence. Contemporary real-world data (RWD) sources typically do not directly reflect national incidence or prevalence. This study demonstrates a methodology that aligns summaries from RWD with the incidence and prevalence benchmarks producing similar estimates for both statistics.
METHODS: US prostate cancer incidence data for cases diagnosed between 2011-2020 were obtained from NPCR and SEER and stratified by age, race, region, stage, and diagnosis year. ConcertAI’s EHR-based dataset (RWD360®) was weighted to match NPCR incidence distributions using raking methodology via R (v4.4.1). Raking iteratively adjusted the sample weights to match the population marginal distributions. Limited-duration prevalence estimates were calculated by applying annual 12-60 month survival rates from SEER to the incidence weights based on the time since diagnosis.
RESULTS: The weighted cohort from RWD360® achieved an absolute average deviation of 2.28% (SD 2.16%) from NCPR incidence counts and distributions for weighting variables. For variables not included in the weighting, absolute average deviation was substantially larger and varied by attribute. Based on the weighting scheme, the surviving prostate cancer population in 2020 was estimated to be 1,060,001. The 5-year limited duration prevalence estimates from the weighted cohorts were within 6.15% of published estimates.
CONCLUSIONS: This analysis demonstrates the feasibility of weighting an EHR-based RWD sample to align with national cancer incidence benchmarks, providing a method for estimating prevalence consistent with reference data. While focused on prostate cancer, the approach may apply to other tumors and leverage more current RWD incidence data to improve contemporary prevalence estimates and address the multi-year lag in traditional cancer registry data.
METHODS: US prostate cancer incidence data for cases diagnosed between 2011-2020 were obtained from NPCR and SEER and stratified by age, race, region, stage, and diagnosis year. ConcertAI’s EHR-based dataset (RWD360®) was weighted to match NPCR incidence distributions using raking methodology via R (v4.4.1). Raking iteratively adjusted the sample weights to match the population marginal distributions. Limited-duration prevalence estimates were calculated by applying annual 12-60 month survival rates from SEER to the incidence weights based on the time since diagnosis.
RESULTS: The weighted cohort from RWD360® achieved an absolute average deviation of 2.28% (SD 2.16%) from NCPR incidence counts and distributions for weighting variables. For variables not included in the weighting, absolute average deviation was substantially larger and varied by attribute. Based on the weighting scheme, the surviving prostate cancer population in 2020 was estimated to be 1,060,001. The 5-year limited duration prevalence estimates from the weighted cohorts were within 6.15% of published estimates.
CONCLUSIONS: This analysis demonstrates the feasibility of weighting an EHR-based RWD sample to align with national cancer incidence benchmarks, providing a method for estimating prevalence consistent with reference data. While focused on prostate cancer, the approach may apply to other tumors and leverage more current RWD incidence data to improve contemporary prevalence estimates and address the multi-year lag in traditional cancer registry data.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
EPH131
Topic
Epidemiology & Public Health
Disease
SDC: Oncology