Evaluating Fairness Across Machine Learning Algorithms in Health Models Incorporating Race/Ethnicity as Predictors
Author(s)
Yizhi Liang, MBBS, MS1, Chen Sheng, MS, MD2, Jize Luo, BA3, Beier Chen, LLB, MS4;
1University of Southern California, Department of Pharmaceutical and Health Economics, Alfred E. Mann School of Pharmacy and Pharmaceuti, Log Angeles, CA, USA, 2Boston University, Department of Epidemiology, School of Public Health, Boston, MA, USA, 3University of Southern California, Department of Computer Science, School of Engineering, Los Angeles, CA, USA, 4University of Southern California, Department of Pharmaceutical and Health Economics, Alfred E. Mann School of Pharmacy and Pharmaceuti, Los Angeles, CA, USA
1University of Southern California, Department of Pharmaceutical and Health Economics, Alfred E. Mann School of Pharmacy and Pharmaceuti, Log Angeles, CA, USA, 2Boston University, Department of Epidemiology, School of Public Health, Boston, MA, USA, 3University of Southern California, Department of Computer Science, School of Engineering, Los Angeles, CA, USA, 4University of Southern California, Department of Pharmaceutical and Health Economics, Alfred E. Mann School of Pharmacy and Pharmaceuti, Los Angeles, CA, USA
Presentation Documents
OBJECTIVES: Incorporating socioeconomic factors, such as race and ethnicity, in health prediction models remains a topic of debate. While previous studies have assessed the fairness of including social factors, the role of different machine learning algorithms and hyperparameter optimization has been underexplored. This study examines the tradeoffs between model performance and fairness across various algorithms and scenarios in health predictive modeling.
METHODS: We examined two cases: (1) predicting cardiovascular diseases using the National Health and Nutrition Examination Survey data from 2007 to 2018 and (2) predicting adverse pregnancy outcomes through U.S. live birth certificate from 2016 to 2023. For each case, we compared general and race-specific models across three scenarios: (1) race-neutral (RN), which excludes race/ethnicity as a predictor; (2) race-sensitive (RS), which includes race/ethnicity; and (3) RN with race-stratified during cross-validation. We evaluated the performance of eight algorithms, each with Bayesian hyperparameter optimized and nested resampling to ensure generalizability. The model performance was assessed using the areas under the curve (AUC). The 95% confidence intervals (95% CI) were constructed with bootstrap.
RESULTS: In the first case, 4,942 individuals (41.4% non-Hispanic White, 27.0% Hispanic, 22.3% non-Hispanic Black, 9.3% Other) were included; the second case comprised 24,765,394 single live births (53.2% non-Hispanic White, 25.5% Hispanic, 13.9% non-Hispanic Black, 7.5% Other). Regarding models built for the whole sample, among two cases and all scenarios, XGBoost consistently demonstrated the highest AUC (0.82, 95% CI 0.80, 0.85), while K-nearest neighbors (KNN) showed the lowest AUC (0.64, 95% CI 0.62, 0.67). However, XGBoost exhibited significant variability, with the poorest performance among Black individuals, while KNN showed minimal variation across racial groups.
CONCLUSIONS: Fairness evaluations in predictive modeling require consideration beyond simple algorithm choice and tuning. To improve decision-making, researchers should optimize hyperparameters and assess fairness-performance trade-offs across diverse algorithms and racial/ethnicity subgroups, ensuring robustness and generalizability in predictive models.
METHODS: We examined two cases: (1) predicting cardiovascular diseases using the National Health and Nutrition Examination Survey data from 2007 to 2018 and (2) predicting adverse pregnancy outcomes through U.S. live birth certificate from 2016 to 2023. For each case, we compared general and race-specific models across three scenarios: (1) race-neutral (RN), which excludes race/ethnicity as a predictor; (2) race-sensitive (RS), which includes race/ethnicity; and (3) RN with race-stratified during cross-validation. We evaluated the performance of eight algorithms, each with Bayesian hyperparameter optimized and nested resampling to ensure generalizability. The model performance was assessed using the areas under the curve (AUC). The 95% confidence intervals (95% CI) were constructed with bootstrap.
RESULTS: In the first case, 4,942 individuals (41.4% non-Hispanic White, 27.0% Hispanic, 22.3% non-Hispanic Black, 9.3% Other) were included; the second case comprised 24,765,394 single live births (53.2% non-Hispanic White, 25.5% Hispanic, 13.9% non-Hispanic Black, 7.5% Other). Regarding models built for the whole sample, among two cases and all scenarios, XGBoost consistently demonstrated the highest AUC (0.82, 95% CI 0.80, 0.85), while K-nearest neighbors (KNN) showed the lowest AUC (0.64, 95% CI 0.62, 0.67). However, XGBoost exhibited significant variability, with the poorest performance among Black individuals, while KNN showed minimal variation across racial groups.
CONCLUSIONS: Fairness evaluations in predictive modeling require consideration beyond simple algorithm choice and tuning. To improve decision-making, researchers should optimize hyperparameters and assess fairness-performance trade-offs across diverse algorithms and racial/ethnicity subgroups, ensuring robustness and generalizability in predictive models.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
P14
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas