Cardiovascular Events Prediction on a Synthetic Cohort in Low-Income Setting with Machine Learning

Speaker(s)

Carrasquilla Sotomayor M¹, Chiavegatto Filho ADP¹, Salcedo Mejía F², Alvis Zakzuk NJ³
¹Universidade de São Paulo, São Paulo, Brazil, ²ALZAK, Cartagena, Bolívar, Colombia, ³Universidad de la Costa, Barranquilla, Atlántico, Colombia

Presentation Documents

Poster MSR10_Cardiovascular Events Prediction - Machine Learning-Ispor 2024139125.pdf

OBJECTIVES: To identify the optimal Machine learning (ML) model for the prediction of cardiovascular events (CVE) in patients enrolled in a cardiovascular program during 2013-2018.

METHODS: We performed two ML predictive models: 1) With the first fatal or non-fatal CVE as the primary outcome, and 2) For subsequent events. Demographic, clinical, anthropometric, and epidemiological covariates were included as predictors. The models were trained using synthetic data for 20.000 patients simulated from an original data set of 93,552 patients enrolled in a cardiovascular cohort program from Colombia. Data were simulated to replace sensitive values and causing minimal distortion of the statistical distribution.

We trained four ML algorithms for structured data on 70% of the sample: Random Forest (RF), XGboost (XGB), LightGBM (LGBM) and Catboost; then were tested on the remaining 30%. Hyperparameters were selected using Randomsearch and variable selection was optimized with Boruta. For model selection we identified the highest AUC-ROC, precision, and recall.

RESULTS: For the first CVE model the AUC-ROC metrics were: 0.79 for RF, 0.81 for XGB, 0.80 for LGBM, and the highest performance was for Catboost (0.829). However, the model reported a recall of 0.0912 when using the unbalanced outcome sample (<5.8% occurrence). From 555 patients with CVE history, the second model (30.63% subsequent CVE occurrence) obtained a lower performance, with little predictive potential for subsequent CVE risk. The highest AUC-ROC obtained was for Catboost (0.695) and showed an improved recall (0.392). All models identified as strong predictors the time to event, high risk categorization, age, microalbuminuria and creatinine, cholesterol, and glycemic levels.

CONCLUSIONS: ML models offer a few advantages especially when dealing with large datasets and unbalanced events, maintaining high performance and reliability in their predictions. Our findings provide an additional tool to help decision-making on prevention routes in primary and secondary care to enhance patients’ quality of life.

Code

MSR10

Topic

Epidemiology & Public Health, Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Public Health

Disease

Cardiovascular Disorders (including MI, Stroke, Circulatory)

ISPOR 2024

May 5-8, 2024