Reliability and Clinical Utility of Machine Learning to Predict Stroke Prognosis: Comparison with Logistic Regression

Article information

J Stroke. 2020;22(3):403-406
Publication date (electronic) : 2020 September 29
doi : https://doi.org/10.5853/jos.2020.02537
1Department of Neurology, Asan Medical Center, Seoul, Korea
2Clinical Research Center, Asan Medical Center, Seoul, Korea
3Department of Clinical Epidemiology and Biostatistics, Asan Medical Center, Seoul, Korea
4Asan Institute for Life Sciences, Asan Medical Center, Seoul, Korea
5Department of Neurology, Kyung Hee University Medical Center, Seoul, Korea
6Department of Neurology, Pusan National University Hospital, Busan, Korea
7Department of Neurology, Dong-A University Hospital, Busan, Korea
8Department of Neurology, Hallym University Sacred Heart Hospital, Anyang, Korea
9Department of Neurology, Korea University Ansan Hospital, Ansan, Korea
10Department of Neurology, Chosun University Hospital, Gwangju, Korea
11Department of Neurology, Dongguk University Ilsan Hospital, Goyang, Korea
12Department of Neurology, Keimyung University Medical Center, Daegu, Korea
13Department of Neurology, Hallym University Kangdong Sacred Heart Hospital, Seoul, Korea
14Department of Neurology, Pusan National University Yangsan Hospital, Yangsan, Korea
Correspondence: Dong-Wha Kang Department of Neurology, Asan Medical Center, University of Ulsan College of Medicine, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Korea Tel: +82-2-3010-3440 Fax: +82-2-474-4691 E-mail: dwkang@amc.seoul.kr
*

These authors contributed equally to the manuscript as first author.

Received 2020 June 25; Revised 2020 July 29; Accepted 2020 August 13.

Dear Sir:

The accurate prediction of functional recovery after a stroke is essential for post-discharge treatment planning and resource utilization. Recently, machine learning (ML) algorithms with baseline clinical variables have demonstrated better performance for predicting the functional outcome of ischemic stroke compared with preexisting scoring systems developed by conventional statistics [1,2]. However, most studies compared model performance by area under curve (AUC) only, and ML and conventional statistical approaches were not sufficiently evaluated in terms of the reliability and clinical utility [3]. We aimed to compare the performance of the ML with that of the conventional logistic regression (LR) model by evaluating accuracy, reliability, and clinical utility using AUC comparison, calibration, and decision curve analysis to predict the outcome of a stroke using KOrean Stroke Neuroimaging Initiative (KOSNI) database.

Using clinical variables measurable at admission (Supplementary methods 1), we used various ML algorithms including deep learning (DL), support vector machine (SVM), random forest (RF), XGboost (XGB), and conventional LR models for predicting 3-month modified Rankin Scale (mRS) >2 or 1 (Supplementary methods 2). Receiver operating characteristic (ROC) curve analysis was performed to evaluate the sensitivity and specificity of each model across each decision threshold. Calibration was evaluated using a reliability diagram and expected calibration error (ECE) to assess the reliability of estimates between the predicted and actual outcomes [4]. The decision curve analysis was constructed to assess the clinical utility of various developed models (Supplementary methods 3) [5].

Six thousand seven hundred thirty-one patients included from 10 tertiary stroke centers in South Korea. This study was approved by the Institutional Review Boards of all participating institutions and comprehensive written informed consent was obtained from patients enrolled in the prospective study. Four thousand seven hundred nine (70%) of the datasets from the former part in the order of admission date were used for training, whereas the remaining 2,019 (30%) from the latter were used as a test set for evaluating the final performance. The baseline characteristics stratified by the outcomes were summarized in Supplementary Table 1.

When the predictive ability was compared with the LR model (AUC of the ROC curve: 0.860 for predicting mRS >2; 0.831 for predicting mRS >1), DL achieved AUC of 0.864 for predicting mRS >2 (P=0.11) and 0.834 for predicting mRS >1 (P=0.06), which was not statistically different. The AUC of SVM, RF, and XGB were 0.871 (P<0.001), 0.870 (P=0.01), and 0.871 (P<0.01) for mRS >2, 0.838 (P<0.001), 0.844 (P<0.001), and 0.843 (P<0.001) for mRS >1 respectively, which demonstrated better performance than the LR model (Figure 1). The detailed confusion matrix and accuracy are described in Supplementary Table 2. In the reliability diagram, the ECE values of SVM was the lowest for predicting both mRS >2 (0.020) and mRS >1 (0.037), suggesting that the SVM model was the most calibrated (Figure 2). The decision curve analysis indicated that the level of clinical benefit throughout the risk thresholds were similar for various ML and LR models (Figure 2).

Figure 1.

Receiver operating characteristic curve of classifiers to predict modified Rankin Scale (mRS) >2 (A) and mRS >1 (B). The P-value was calculated using DeLong’s test for the curve of logistic regression (LR) and the machine learning model. AUC, area under curve; DL, deep learning; SVM, support vector machine; RF, random forest; XGB, XGBoost.

Figure 2.

Comparison of calibration and clinical utility in different algorithms (A, B) reliability diagram (C, D). Decision curves of models which predict 3-month (A, C) modified Rankin Scale (mRS) >2 and (B, D) mRS >1. LR, logistic regression; ECE, expected calibration error; DL, deep learning; SVM, support vector machine; RF, random forest; XGB, XGBoost.

Our study shows that ML models had better discriminated power evaluated by AUC and reliability in predicting clinical outcome after a stroke than conventional LR models. It should be noted that, however, both ML and LR models demonstrated moderate-to-good performances, and ML model did not outperform LR models in terms of clinical utility.

This study has the advantage that we evaluated reliability and clinical utility of the models in addition to discriminated power comparison. The assessment of the agreement between the predicted and actual outcomes on the calibration plot is a requisite for model validation [6]. Also clinical net benefit needs to be evaluated using decision curve analysis [7]. The results indicate ML was also comparable or superior to LR in terms of reliability and clinical net benefit.

ML is effective in dealing with wide data where the number of variables per study subjects is relatively large and interactions between variables exist [8]. Introducing mixed-media data including image (computer tomography, magnetic resonance imaging), biosignal data acquired from continuous monitoring (blood pressure, heart rate, electrocardiography, and electroencephalography) in the analysis in addition to clinical variables with numeric, symbolic features may enable us to develop more accurate predictive ML model [9]. Training to predict an outcome with strong signal-to-noise ratio rather than an outcome of poor signal-to-noise ratio such as clinical outcome prediction may also improve performance power of ML [3].

The limitation of our study is that we only use baseline clinical variables and treatment-related factors were not included for model construction. Variables associated with acute stroke management to prevent stroke progression or recurrence and patient’s will for active rehabilitation could have a significant impact on functional recovery.

In conclusion, our study revealed that ML algorithms using baseline clinical parameters had better accuracy, reliability, and similar clinical net benefits to the traditional LR models in predicting functional recovery after an acute ischemic stroke.

Supplementary materials

Supplementary materials related to this article can be found online at https://doi.org/10.5853/jos.2020.02537.

Supplementary Table 1.

Characteristics of patients based on outcomes

Supplementary Table 2.

Confusion matrix and accuracy

Supplementary methods 1.

Selection of variables which were used as input for model

Supplementary methods 2.

Developments of model

Supplementary methods 3.

Evaluation of reliability and clinical benefit

Acknowledgements

This research was supported by grants from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant numbers: HI18C2383 and HI12C1847).

Notes

The authors have no financial conflicts of interest.

References

1. Nishi H, Oishi N, Ishii A, Ono I, Ogura T, Sunohara T, et al. Predicting clinical outcomes of large vessel occlusion before mechanical thrombectomy using machine learning. Stroke 2019;50:2379–2388.
2. Heo J, Yoon JG, Park H, Kim YD, Nam HS, Heo JH. Machine learning-based model for prediction of outcomes in acute stroke. Stroke 2019;50:1263–1265.
3. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019;110:12–22.
4. Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using Bayesian binning. Proc Conf AAAI Artif Intell 2015;2015:2901–2907.
5. Kerr KF, Brown MD, Zhu K, Janes H. Assessing the clinical impact of risk prediction models with decision curves: guidance for correct interpretation and appropriate use. J Clin Oncol 2016;34:2534–2540.
6. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 2016;74:167–176.
7. Van Calster B, Wynants L, Verbeek JFM, Verbakel JY, Christodoulou E, Vickers AJ, et al. Reporting and interpreting decision curve analysis: a guide for investigators. Eur Urol 2018;74:796–804.
8. Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods 2018;15:233–234.
9. Mitchell TM. Does machine learning really work? AI Mag 1997;18:11.

Article information Continued

Figure 1.

Receiver operating characteristic curve of classifiers to predict modified Rankin Scale (mRS) >2 (A) and mRS >1 (B). The P-value was calculated using DeLong’s test for the curve of logistic regression (LR) and the machine learning model. AUC, area under curve; DL, deep learning; SVM, support vector machine; RF, random forest; XGB, XGBoost.

Figure 2.

Comparison of calibration and clinical utility in different algorithms (A, B) reliability diagram (C, D). Decision curves of models which predict 3-month (A, C) modified Rankin Scale (mRS) >2 and (B, D) mRS >1. LR, logistic regression; ECE, expected calibration error; DL, deep learning; SVM, support vector machine; RF, random forest; XGB, XGBoost.