A DEEP LEARNING-BASED COMPARATIVE STUDY OF CARDIOVASCULAR DISEASE PREDICTION USING MULTIFACTORIAL HEALTH INDICATORS

Fabiha; Dr. Muhammad Ashraf; Engr. Muhammad Akram khan; Muhammad Ameen; Dr. Akbar Khan; Muhammad Zahid Khan

Authors

Fabiha
Dr. Muhammad Ashraf
Engr. Muhammad Akram khan
Muhammad Ameen
Dr. Akbar Khan
Muhammad Zahid Khan

Abstract

Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, necessitating advanced risk prediction models to enhance early detection and prevention strategies. This comprehensive study evaluates multiple machine learning approaches for predicting cardiovascular disease risk using the BRFSS 2015 health indicators dataset, comparing their effectiveness against traditional assessment methods. Following a systematic process of data preprocessing, feature engineering, and overall model evaluation, we employed a comprehensive dataset of 253,680 records, which included 17 health indicators, such as BMI, blood pressure, cholesterol, smoking status, diabetes, physical activity, and demographic variables. The data was split 70%-30%, with 70% used for training and 30% used to test the algorithms used. A total of ten machine learning algorithms were tested, which consisted of: Naive Bayes, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest, XGBoost, Logistic regression, Linear SVC, Elastic Net, Gradient Boosting Trees, Artificial Neural Networks (ANN), a Multilayer Perceptron (MLP), and Self-Organizing Maps (SOM). Accuracy, precision, recall, the F1-score, and classification reports were analyzed for performance metrics. Results showed marked differences in predictive performance among the algorithms. The algorithms with the highest accuracy outcomes were the Multilayer Perceptron (90.78%), Gradient Boosting Trees (90.76%), Logistic Regression (90.74%), XGBoost (90.72%), and SVM (90.63%). The traditional algorithms performed the worst, with Naive Bayes at 83.06% and Self-Organizing Maps at 59.94%. Thus, the performance outcomes demonstrated that ensemble methods and neural networks were significantly more efficacious in predicting outcome with the data due to their ability to account for complex non-linear relationships inherent to cardiovascular risk factors. The analysis of feature importance showed that traditional risk factors including high blood pressure, high cholesterol, age, and BMI continued to be significant predictors, however when we added lifestyle factors (physical activity, smoking status, and alcohol consumption), this improved model performance. Challenges with predicting positive cases were noted due to class imbalance; despite this, most models were able to accurately identify cardiovascular disease with high specificity and moderate sensitivity. This comprehensive analysis adds to the existing basis of cardiovascular risk prediction literature, as it demonstrates that machine learning methods, especially ensemble methods and neural networks model performance in cardiovascular risk prediction was much better than conventional statistics. This is consistent with evidence calling for machine learning integration into clinical decision systems in order to add further utility for assessing cardiovascular risk and/or stratifying patients.