Leveraging Machine Learning for Stroke Prediction: An Empirical Study on Clinical and Behavioral Risk Factors

Jungmin Lim

University of California, Irvine, Irvine, CA, United States

Volume 1, Issue 1, January 2026

ISSN: 3070-6432

Keywords

Machine Learning Stroke Prediction Clinical Risk Factors Behavioral Risk Factors Healthcare Analytics

Abstract

This study investigates the applications of machine learning techniques for predicting the stroke risks using clinical, behavioral and demographic features. Multiple classification models were evaluated, and the random forest classifier achieved the highest performance, with a recall rate for stroke of 98% and an AUC of 0.98. Feature importance analysis showed that age, average glucose level, and BMI are the most influential predictors. From an operational perspective, integrating predicting modelling into healthcare systems can facilitate early risk detection and support personalized care strategies.

Conclusion

This study examined how machine learning can be applied to predict stroke risk by analyzing clinical and behavioral factors, while also exploring its implications for healthcare management and business analytics. Among the models tested, the Random Forest classifier achieved the best performance, with an accuracy of 94% and an AUC of 0.98, demonstrating strong predictive power in identifying individuals at high risk. Feature importance analysis indicated that age, average glucose level, and BMI were the most influential predictors, followed by marital status, hypertension, and work type. These findings suggest that both physiological and lifestyle-related factors contribute meaningfully to stroke prediction. The performance is likely related to the algorithm's ensemble structure. By aggregating the predictions from many decorrelated decision trees built on bootstrap samples and the random subsets of predictors, random forest can approximate complex non-linear and high-order interactions without requiring a prespecified functional form. Moreover, because each tree uses threshold-based splits on the predictor values, the model depends mainly on the ordering rather than the exact magnitude of the observations, which makes it less sensitive to extreme values.

Full Paper Available

Download the complete research paper

Download PDF
Back to Volume 1, Issue 1