Cardiovascular Events Prediction on a Synthetic Cohort in Low-Income Setting with Machine Learning

Author(s)

Carrasquilla Sotomayor M1, Chiavegatto Filho ADP1, Salcedo Mejía F2, Alvis Zakzuk NJ3
1Universidade de São Paulo, São Paulo, Brazil, 2ALZAK, Cartagena, Bolívar, Colombia, 3Universidad de la Costa, Barranquilla, Atlántico, Colombia

OBJECTIVES: To identify the optimal Machine learning (ML) model for the prediction of cardiovascular events (CVE) in patients enrolled in a cardiovascular program during 2013-2018.

METHODS: We performed two ML predictive models: 1) With the first fatal or non-fatal CVE as the primary outcome, and 2) For subsequent events. Demographic, clinical, anthropometric, and epidemiological covariates were included as predictors. The models were trained using synthetic data for 20.000 patients simulated from an original data set of 93,552 patients enrolled in a cardiovascular cohort program from Colombia. Data were simulated to replace sensitive values and causing minimal distortion of the statistical distribution.

We trained four ML algorithms for structured data on 70% of the sample: Random Forest (RF), XGboost (XGB), LightGBM (LGBM) and Catboost; then were tested on the remaining 30%. Hyperparameters were selected using Randomsearch and variable selection was optimized with Boruta. For model selection we identified the highest AUC-ROC, precision, and recall.

RESULTS: For the first CVE model the AUC-ROC metrics were: 0.79 for RF, 0.81 for XGB, 0.80 for LGBM, and the highest performance was for Catboost (0.829). However, the model reported a recall of 0.0912 when using the unbalanced outcome sample (<5.8% occurrence). From 555 patients with CVE history, the second model (30.63% subsequent CVE occurrence) obtained a lower performance, with little predictive potential for subsequent CVE risk. The highest AUC-ROC obtained was for Catboost (0.695) and showed an improved recall (0.392). All models identified as strong predictors the time to event, high risk categorization, age, microalbuminuria and creatinine, cholesterol, and glycemic levels.

CONCLUSIONS: ML models offer a few advantages especially when dealing with large datasets and unbalanced events, maintaining high performance and reliability in their predictions. Our findings provide an additional tool to help decision-making on prevention routes in primary and secondary care to enhance patients’ quality of life.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

MSR10

Topic

Epidemiology & Public Health, Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Public Health

Disease

Cardiovascular Disorders (including MI, Stroke, Circulatory)

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×