Cardiovascular Events Prediction on a Synthetic Cohort in Low-Income Setting with Machine Learning
Speaker(s)
Carrasquilla Sotomayor M1, Chiavegatto Filho ADP1, Salcedo Mejía F2, Alvis Zakzuk NJ3
1Universidade de São Paulo, São Paulo, Brazil, 2ALZAK, Cartagena, Bolívar, Colombia, 3Universidad de la Costa, Barranquilla, Atlántico, Colombia
Presentation Documents
OBJECTIVES: To identify the optimal Machine learning (ML) model for the prediction of cardiovascular events (CVE) in patients enrolled in a cardiovascular program during 2013-2018.
METHODS: We performed two ML predictive models: 1) With the first fatal or non-fatal CVE as the primary outcome, and 2) For subsequent events. Demographic, clinical, anthropometric, and epidemiological covariates were included as predictors. The models were trained using synthetic data for 20.000 patients simulated from an original data set of 93,552 patients enrolled in a cardiovascular cohort program from Colombia. Data were simulated to replace sensitive values and causing minimal distortion of the statistical distribution.
We trained four ML algorithms for structured data on 70% of the sample: Random Forest (RF), XGboost (XGB), LightGBM (LGBM) and Catboost; then were tested on the remaining 30%. Hyperparameters were selected using Randomsearch and variable selection was optimized with Boruta. For model selection we identified the highest AUC-ROC, precision, and recall.
RESULTS: For the first CVE model the AUC-ROC metrics were: 0.79 for RF, 0.81 for XGB, 0.80 for LGBM, and the highest performance was for Catboost (0.829). However, the model reported a recall of 0.0912 when using the unbalanced outcome sample (<5.8% occurrence). From 555 patients with CVE history, the second model (30.63% subsequent CVE occurrence) obtained a lower performance, with little predictive potential for subsequent CVE risk. The highest AUC-ROC obtained was for Catboost (0.695) and showed an improved recall (0.392). All models identified as strong predictors the time to event, high risk categorization, age, microalbuminuria and creatinine, cholesterol, and glycemic levels.
CONCLUSIONS: ML models offer a few advantages especially when dealing with large datasets and unbalanced events, maintaining high performance and reliability in their predictions. Our findings provide an additional tool to help decision-making on prevention routes in primary and secondary care to enhance patients’ quality of life.
Code
MSR10
Topic
Epidemiology & Public Health, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Public Health
Disease
Cardiovascular Disorders (including MI, Stroke, Circulatory)