INTERPRETABLE MACHINE LEARNING TO PREDICT CATASTROPHIC HEALTH EXPENDITURE RISK IN CHINA: EVIDENCE FROM NATIONALLY REPRESENTATIVE SURVEY DATA
Author(s)
Danyang Wei, Bachelor of Management, MIN HU, PhD;
Fudan University, Shanghai, China
Fudan University, Shanghai, China
OBJECTIVES: The incidence of catastrophic health expenditure (CHE) among Chinese households remains high by global standards, revealing gaps in financial risk protection. Prior studies have largely relied on conventional statistical approaches to examine correlates of CHE. This study aimed to develop a machine learning-based prediction model for household CHE risk in China and to identify key predictive factors to inform timely and targeted policy interventions.
METHODS: We used nationally representative data from the 2022 China Family Panel Studies (CFPS), including 8,000 households after data cleaning. CHE was defined as medical spending ≥40% of non-food expenditure, with 25% and 10% thresholds used for robustness checks. Guided by the Andersen behavioral model, predictors were grouped into predisposing(e.g., age, sex, education), enabling(e.g., income, insurance, access-related indicators) and need factors(e.g., chronic, self-reported health, hospitalization). Decision tree(DT), Random Forest(RF), and XGBoost classifiers were trained using an 80/20 train-test split with five-fold cross-validation and class-imbalance handling. Model performance was evaluated using AUROC, accuracy, precision, recall, and F1 score. SHAP was applied to interpret the best-performing model and quantify feature importance.
RESULTS: CHE incidence was 9.69% at the 40% threshold (17.39% at 25% and 38.01% at 10%). Subgroup analyses showed higher CHE incidence among adults aged ≥75 years, hospitalized, poor health, the lowest income and chronic(all p<0.001). XGBoost achieved the best discrimination (AUROC=0.806), outperforming RF(0.797) and DT(0.772). SHAP ranked hospitalization (SHAP=0.381), age(0.301), household composition(0.277), self-reported health(0.223), and income (0.178) as top contributors. Category-level SHAP suggested increased risk associated with low income, advanced age, poor health, hospitalized and having older household members. Findings were robust across alternative thresholds.
CONCLUSIONS: Interpretable machine learning enables early prediction of household CHE risk and identification of actionable signals, supporting risk stratification, dynamic monitoring, and targeted interventions to improve financial protection and progress toward UHC.
METHODS: We used nationally representative data from the 2022 China Family Panel Studies (CFPS), including 8,000 households after data cleaning. CHE was defined as medical spending ≥40% of non-food expenditure, with 25% and 10% thresholds used for robustness checks. Guided by the Andersen behavioral model, predictors were grouped into predisposing(e.g., age, sex, education), enabling(e.g., income, insurance, access-related indicators) and need factors(e.g., chronic, self-reported health, hospitalization). Decision tree(DT), Random Forest(RF), and XGBoost classifiers were trained using an 80/20 train-test split with five-fold cross-validation and class-imbalance handling. Model performance was evaluated using AUROC, accuracy, precision, recall, and F1 score. SHAP was applied to interpret the best-performing model and quantify feature importance.
RESULTS: CHE incidence was 9.69% at the 40% threshold (17.39% at 25% and 38.01% at 10%). Subgroup analyses showed higher CHE incidence among adults aged ≥75 years, hospitalized, poor health, the lowest income and chronic(all p<0.001). XGBoost achieved the best discrimination (AUROC=0.806), outperforming RF(0.797) and DT(0.772). SHAP ranked hospitalization (SHAP=0.381), age(0.301), household composition(0.277), self-reported health(0.223), and income (0.178) as top contributors. Category-level SHAP suggested increased risk associated with low income, advanced age, poor health, hospitalized and having older household members. Findings were robust across alternative thresholds.
CONCLUSIONS: Interpretable machine learning enables early prediction of household CHE risk and identification of actionable signals, supporting risk stratification, dynamic monitoring, and targeted interventions to improve financial protection and progress toward UHC.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR138
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas