Evaluating Fairness Across Machine Learning Algorithms in Health Models Incorporating Race/Ethnicity as Predictors

Author(s)

Yizhi Liang, MBBS, MS¹, Chen Sheng, MS, MD², Jize Luo, BA³, Beier Chen, LLB, MS⁴;
¹University of Southern California, Department of Pharmaceutical and Health Economics, Alfred E. Mann School of Pharmacy and Pharmaceuti, Log Angeles, CA, USA, ²Boston University, Department of Epidemiology, School of Public Health, Boston, MA, USA, ³University of Southern California, Department of Computer Science, School of Engineering, Los Angeles, CA, USA, ⁴University of Southern California, Department of Pharmaceutical and Health Economics, Alfred E. Mann School of Pharmacy and Pharmaceuti, Los Angeles, CA, USA

Presentation Documents

ISPOR25_Liang_046_PRESENTATION.pdf

OBJECTIVES: Incorporating socioeconomic factors, such as race and ethnicity, in health prediction models remains a topic of debate. While previous studies have assessed the fairness of including social factors, the role of different machine learning algorithms and hyperparameter optimization has been underexplored. This study examines the tradeoffs between model performance and fairness across various algorithms and scenarios in health predictive modeling.
METHODS: We examined two cases: (1) predicting cardiovascular diseases using the National Health and Nutrition Examination Survey data from 2007 to 2018 and (2) predicting adverse pregnancy outcomes through U.S. live birth certificate from 2016 to 2023. For each case, we compared general and race-specific models across three scenarios: (1) race-neutral (RN), which excludes race/ethnicity as a predictor; (2) race-sensitive (RS), which includes race/ethnicity; and (3) RN with race-stratified during cross-validation. We evaluated the performance of eight algorithms, each with Bayesian hyperparameter optimized and nested resampling to ensure generalizability. The model performance was assessed using the areas under the curve (AUC). The 95% confidence intervals (95% CI) were constructed with bootstrap.
RESULTS: In the first case, 4,942 individuals (41.4% non-Hispanic White, 27.0% Hispanic, 22.3% non-Hispanic Black, 9.3% Other) were included; the second case comprised 24,765,394 single live births (53.2% non-Hispanic White, 25.5% Hispanic, 13.9% non-Hispanic Black, 7.5% Other). Regarding models built for the whole sample, among two cases and all scenarios, XGBoost consistently demonstrated the highest AUC (0.82, 95% CI 0.80, 0.85), while K-nearest neighbors (KNN) showed the lowest AUC (0.64, 95% CI 0.62, 0.67). However, XGBoost exhibited significant variability, with the poorest performance among Black individuals, while KNN showed minimal variation across racial groups.
CONCLUSIONS: Fairness evaluations in predictive modeling require consideration beyond simple algorithm choice and tuning. To improve decision-making, researchers should optimize hyperparameters and assess fairness-performance trade-offs across diverse algorithms and racial/ethnicity subgroups, ensuring robustness and generalizability in predictive models.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

P14

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)