A Statistical Method for Quality Assessment of Demographic Variables in EHR Data
Author(s)
Hui Wang, Ph.D.;
Lumbrita LLC, Los Gatos, CA, USA
Lumbrita LLC, Los Gatos, CA, USA
OBJECTIVES: We propose a statistical method for assessing the data consistency of demographic variables such as Sex in EHR data.
METHODS: A random sample of 70K hypertension patients was extracted from the Truveta database during 2018-2024, enriched with 1.8K patients with inconsistent self-reported sex and sex restrictive diagnosis codes. The patient cohort was further reduced to 47K patients by requiring >= 7 any diagnosis codes and >= 1 sex restrictive codes in any patient. A Bayesian posterior probability that the proportion of female (male) restrictive code in a patient (i.e. probability of being female or male) is greater than 0.8 was then derived, assuming a binomial distribution with Beta prior. The Beta prior parameters were estimated empirically from the data.
RESULTS: Among the 47K patients, 23K and 24K were self-reported female and male, and the proportion of female (male) restrictive codes relative to the total number of sex restrictive codes has mean 99.1% (96.9%) and SD 8.1% (16.2%). There were 1743 patients with inconsistency between the self-reported sex and sex restrictive diagnosis code: 525 were self-reported female patients, and 1218 were male. When we define female as having posterior probability of being a female > 80% and being a male <= 20% and vice versa, 1268 (73%) inconsistencies were detected; using > 95% and <= 5% as the thresholds, 1410 (81%) inconsistencies were detected. This demonstrates the usability and flexibility of these posterior probabilities in flagging data quality issues.
CONCLUSIONS: The proposed posterior probability score can be used as a quality index to assess the inconsistencies between the self-reported sex and diagnosis data, both quantitatively and qualitatively. The methods can be extended to incorporate information from procedure and medication codes.
METHODS: A random sample of 70K hypertension patients was extracted from the Truveta database during 2018-2024, enriched with 1.8K patients with inconsistent self-reported sex and sex restrictive diagnosis codes. The patient cohort was further reduced to 47K patients by requiring >= 7 any diagnosis codes and >= 1 sex restrictive codes in any patient. A Bayesian posterior probability that the proportion of female (male) restrictive code in a patient (i.e. probability of being female or male) is greater than 0.8 was then derived, assuming a binomial distribution with Beta prior. The Beta prior parameters were estimated empirically from the data.
RESULTS: Among the 47K patients, 23K and 24K were self-reported female and male, and the proportion of female (male) restrictive codes relative to the total number of sex restrictive codes has mean 99.1% (96.9%) and SD 8.1% (16.2%). There were 1743 patients with inconsistency between the self-reported sex and sex restrictive diagnosis code: 525 were self-reported female patients, and 1218 were male. When we define female as having posterior probability of being a female > 80% and being a male <= 20% and vice versa, 1268 (73%) inconsistencies were detected; using > 95% and <= 5% as the thresholds, 1410 (81%) inconsistencies were detected. This demonstrates the usability and flexibility of these posterior probabilities in flagging data quality issues.
CONCLUSIONS: The proposed posterior probability score can be used as a quality index to assess the inconsistencies between the self-reported sex and diagnosis data, both quantitatively and qualitatively. The methods can be extended to incorporate information from procedure and medication codes.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR137
Topic
Methodological & Statistical Research
Disease
No Additional Disease & Conditions/Specialized Treatment Areas