WHEN ONE CODE DEFINES MILLIONS: SENSITIVITY OF PERIPHERAL ARTERY DISEASE COHORTS TO OPERATIONAL DEFINITIONS CHOICES IN REAL-WORLD DATA
Author(s)
Scott L. DuVall, PhD1, Jared H. Kamauu, BA2, Aimee Harrison, MFA3, Michael Buck, PhD3, Craig G Parker, MD, MS4, Allise G Kamauu, MS5, Aaron Kamauu, MPH, MS, MD6;
1PurpleLab Healthcare Analytics, Senior Vice President, Real-World Evidence, Taylorsville, UT, USA, 2Navidence Inc, Lehi, UT, USA, 3Navidence, Aurora, CO, USA, 4Navidence, Sandy, UT, USA, 5Navidence, Salt Lake City, UT, USA, 6Navidence, Inc., Bountiful, UT, USA
1PurpleLab Healthcare Analytics, Senior Vice President, Real-World Evidence, Taylorsville, UT, USA, 2Navidence Inc, Lehi, UT, USA, 3Navidence, Aurora, CO, USA, 4Navidence, Sandy, UT, USA, 5Navidence, Salt Lake City, UT, USA, 6Navidence, Inc., Bountiful, UT, USA
OBJECTIVES: Composite phenotypes such as peripheral artery disease (PAD) are commonly defined in real-world research using diagnosis code lists derived from clinical guidelines or prior studies. While variability in these operational definitions is recognized, the population-level impact of individual codes is rarely quantified. This study assessed the sensitivity of PAD cohort size to operational definition choices, with corroborating analyses using other phenotypes.
METHODS: Multiple published real-world PAD cohort definitions with explicit ICD-10-CM code lists were identified from the literature. Each definition was replicated using PurpleLab® CLEAR Claims. Cohort sizes were compared across definitions, with overlap assessed at the patient and code levels. Code-level contributions to cohort inclusion were examined to identify high-impact diagnosis codes. Parallel sensitivity assessments were conducted for MI, stroke, and TIA to evaluate generalizability.
RESULTS: Five PAD cohorts were constructed using definitions ranging from 52 to 351 diagnosis codes. Three large code lists (324-351 codes) identified 6.21-6.91 million patients, while two smaller lists (52-57 codes) identified 5.15 and 5.76 million patients, respectively. Despite minimal overlap between the two smaller lists (60 shared codes), cohort sizes were comparable to those derived from substantially larger definitions. The three larger lists included approximately 300 codes absent from the smaller definitions, collectively contributing minimal incremental patients, while a single diagnosis code (I73.9 peripheral vascular disease, unspecified) absent from all three larger lists accounted for approximately 1.88 million patients. Similar sensitivity to individual diagnosis codes was observed in all cohorts.
CONCLUSIONS: Operational definition choices for PAD can result in multi-million-patient differences in cohort size driven by a small number of high-impact diagnosis codes rather than overall code list size. These findings underscore the importance of data-informed phenotype design aligned with study intent, as inclusion or exclusion of specific codes may substantially alter cohort size and underlying patient populations.
METHODS: Multiple published real-world PAD cohort definitions with explicit ICD-10-CM code lists were identified from the literature. Each definition was replicated using PurpleLab® CLEAR Claims. Cohort sizes were compared across definitions, with overlap assessed at the patient and code levels. Code-level contributions to cohort inclusion were examined to identify high-impact diagnosis codes. Parallel sensitivity assessments were conducted for MI, stroke, and TIA to evaluate generalizability.
RESULTS: Five PAD cohorts were constructed using definitions ranging from 52 to 351 diagnosis codes. Three large code lists (324-351 codes) identified 6.21-6.91 million patients, while two smaller lists (52-57 codes) identified 5.15 and 5.76 million patients, respectively. Despite minimal overlap between the two smaller lists (60 shared codes), cohort sizes were comparable to those derived from substantially larger definitions. The three larger lists included approximately 300 codes absent from the smaller definitions, collectively contributing minimal incremental patients, while a single diagnosis code (I73.9 peripheral vascular disease, unspecified) absent from all three larger lists accounted for approximately 1.88 million patients. Similar sensitivity to individual diagnosis codes was observed in all cohorts.
CONCLUSIONS: Operational definition choices for PAD can result in multi-million-patient differences in cohort size driven by a small number of high-impact diagnosis codes rather than overall code list size. These findings underscore the importance of data-informed phenotype design aligned with study intent, as inclusion or exclusion of specific codes may substantially alter cohort size and underlying patient populations.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
RWD1
Topic
Real World Data & Information Systems
Topic Subcategory
Data Protection, Integrity, & Quality Assurance, Reproducibility & Replicability
Disease
SDC: Cardiovascular Disorders (including MI, Stroke, Circulatory)