Deep Learning–Based Automatic Classification of Stroke Size in Patients With Atrial Fibrillation
Article information
Dear Sir:
The optimal timing for initiating direct oral anticoagulant (DOAC) in patients with stroke and atrial fibrillation (AF) remains uncertain, largely due to the varying risk of hemorrhagic transformation. Previous studies suggest that the risk of hemorrhagic transformation is related to stroke severity, using the National Institutes of Health Stroke Scale (NIHSS) score [1,2]. However, the NIHSS score reflects not only the size of the infarct but also its location, potentially leading to discrepancies between the score and the actual burden of infarct volume. Moreover, hemorrhagic transformation after ischemic stroke is closely related to the size of stroke [3]. Based on these findings, the Early versus Late Initiation of Direct Oral Anticoagulants in Post-ischemic Stroke Patients with Atrial Fibrillation (ELAN) trial stratified stroke size into minor, moderate, and major categories based on imaging, showing that early initiation of DOACs is likely safe and may reduce the risk of recurrent ischemic events [4].
In real-world practice, a lack of certified stroke centers and consistent access to in-person neurological consultations pose significant challenges [5]. Consequently, many institutions face challenges in maintaining sufficient expertise to reliably classify stroke size based on imaging criteria. Furthermore, imaging-based risk stratification in large-scale clinical trials is often limited by the requirement for precise stroke-volume assessment and the availability of multiple neurologists. To address these needs, we developed a deep learning algorithm that automatically classifies stroke sizes based on the imaging criteria, using diffusion-weighted imaging (DWI) data of stroke patients with AF.
The algorithm was trained on 1,091 DWI scans of ischemic stroke attributable to AF, collected from four hospitals between 2011 and 2021. An external validation dataset comprising 1,265 DWI scans was collected from 11 non-overlapping hospitals between 2017 and 2020 (Supplementary Methods and Supplementary Figure 1). The institutional review boards of all centers approved the study, and written informed consent was obtained.
For the training/internal validation dataset, stroke size classification was determined by an experienced vascular neurologist (WSR) using a standardized visual rating scheme from the ELAN trial (Supplementary Methods) [4]. For the external validation dataset, stroke size classification was independently determined by two vascular neurologists (WSR and HK) using the same criteria. In cases of disagreement, a consensus was reached and used as the ground truth for external validation. Stroke locations in the external validation dataset were classified into supratentorial, infratentorial, and mixed lesions (Supplementary Methods).
Infarct lesions on DWIs were automatically segmented using a validated 3D U-Net algorithm (JLK-DWI; JLK Inc., Seoul, Korea) [6]. For the classification, we employed an EfficientNet3D. The model was modified to accept two-channel inputs (DWI and segmentation mask) and output three classes representing stroke size classification. Additional details for model development are provided in Supplementary Methods and Supplementary Figure 2.
The algorithm’s performance was compared with expert consensus using percentage agreement, Cohen’s kappa, and area under the receiver operating characteristic curve (AUC). Inter-rater percentage agreement and Cohen’s kappa were also calculated, comparing classifications made by two vascular neurologists. Details for statistical analysis are provided in Supplementary Methods.
The mean (SD) ages for the internal and external datasets were 73.6 (10.3) years and 75.2 (10.2) years, respectively, with 54.4% and 53.0% of participants being male (Supplementary Tables 1 and 2). In the external validation dataset, the percentage agreement and Cohen’s kappa between the deep learning algorithm and the consensus of two vascular neurologists were 87.4% (95% confidence interval [CI], 85.4–89.2) and 0.81 (95% CI, 0.78–0.84), respectively, with comparable performance in the training/internal validation dataset and the algorithm demonstrated notable accuracy for each stroke size classifications (Table 1). The AUC values for classifying minor, moderate, and major stroke categories in the external validation dataset were 0.988, 0.955, and 0.988, respectively, with similar performance in the training/internal validation dataset (Figure 1). In comparison, between the vascular neurologists, the percentage agreement was 74.6% with Cohen’s kappa of 0.62 (Supplementary Table 3).
Agreement, confusion matrix, and diagnostic accuracy of stroke size classification between a deep learning algorithm and vascular neurologists
Receiver operating characteristic (ROC) curves for the classification of stroke size using a deep learning algorithm. (A) Comparisons of the ROC curves for minor, moderate, and major classification in the internal validation dataset. The area under the ROC curve (AUC) values for classifying minor, moderate, and major stroke categories in the internal validation dataset were 0.973, 0.929, and 0.970. (B) Comparisons of the ROC curves for minor, moderate, and major classification in the external validation dataset. The AUC values for classifying minor, moderate, and major stroke categories in the external validation dataset were 0.988, 0.955, and 0.988.
After stratifying by stroke location, the deep learning algorithm showed high agreement with stroke experts for supratentorial and infratentorial lesions, achieving Cohen’s kappa values of 0.82 (95% CI, 0.79–0.85) and 0.85 (95% CI, 0.76–0.93), respectively (Table 2). For mixed lesions, agreement was lower with a kappa of 0.61 (95% CI, 0.49–0.74). The mean infarct volume also varied among stroke size classifications within each lesion location category (Supplementary Figure 3 and Supplementary Table 4).
Agreement in stroke size classification between a deep learning algorithm and the consensus of vascular neurologists, categorized by infratentorial, supratentorial, and mixed locations
In patients undergoing DWI within 24 hours from the onset time, Cohen’s kappa was 0.81 (95% CI, 0.78–0.88). When the time was extended to 48 hours, the model exhibited similar performance, with Cohen’s kappa of 0.81 (95% CI, 0.78–0.87) (Supplementary Table 5).
Additionally, stroke size predicted by the algorithm was significantly associated with the frequency of symptomatic hemorrhagic transformation (Supplementary Figure 4). The mean processing time from raw image to output in a graphics processing unit (GPU) environment was 5.188 seconds (SD, 0.654) across 100 randomly selected DWI scans.
In this study, we developed and validated a deep learning algorithm to classify stroke size in AF-related stroke using multicenter and multivendor datasets, achieving excellent agreement with stroke experts. To our knowledge, this is the first study to develop a deep learning model that automatically classifies stroke size for severity prediction based on DWI.
Several observational studies have established that the risk of hemorrhagic transformation in AF-related stroke is closely related to infarct size, supporting neuroimaging-based risk stratification to minimize intracranial hemorrhage [7,8]. In addition to the ELAN trial, a recent meta-analysis suggested that early DOAC initiation may reduce recurrent ischemic stroke risk by 36% without increasing intracranial hemorrhage [9]. Our model effectively classified minor, moderate, and major cases separately and showed even higher agreement when categorizing patients into non-major versus major cases. These findings suggest our model could help guide DOAC initiation timing, particularly for physicians with less experience.
Furthermore, the algorithm’s mean processing time from raw DWI input to stroke size classification was approximately 5 seconds (Supplementary Discussion). This rapid processing could facilitate large-scale studies, enabling further research on infarct volume, DOAC initiation, and the risk of intracranial hemorrhage.
In conclusion, this algorithm has the potential to assist less experienced physicians in optimizing DOAC initiation timing and supports the use of large neuroimaging datasets in future research.
Supplementary materials
Supplementary materials related to this article can be found online at https://doi.org/10.5853/jos.2025.00423.
Baseline characteristics for training/internal validation and external validation dataset
Detailed characteristics of MRI vendors and protocols for training/internal validation and external validation dataset
Agreement of stroke size classification between two vascular neurologists
Stroke volume of stroke size classification in the datasets for external validation in total, supratentorial, infratentorial, and mixed location
Subgroup analysis based on symptom onset to imaging time (<24 hr and <48 hr)
A flowchart of the patient selection process. MRI, magnetic resonance imaging; DWI, diffusion-weighted imaging; AF, atrial fibrillation; NVAF, nonvalvular atrial fibrillation.
Deep learning model to classify stroke size. Infarct lesions on diffusion-weighted imaging (DWI) were segmented using a validated 3D U-Net algorithm (JLK-DWI, JLK Inc., Seoul, Korea). The DWI images and segmentation masks were processed into 256×256×64 voxel patches, serving as two-channel inputs (DWI signal intensities and binary segmentation masks) for a 3D adaptation of EfficientNet (EfficientNet3D, efficientnet-b0 configuration). The model was designed to classify stroke size into three categories: minor, moderate, and major.
Log-transformed infarct volume of stroke size classification in the datasets for external validation in supratentorial, infratentorial, and mixed location. Stroke volume tended to increase progressively across minor, moderate, and major stroke size categories within each location. Additionally, stroke volume varied by location, being largest in supratentorial regions. Detailed information and statistical data on stroke volume are provided in Supplementary Table 5.
Cochran–Armitage test between predicted stroke size classification and symptomatic hemorrhagic transformation (sHT). Based on the stroke size classification predicted by the algorithm, cases classified as Major exhibited the highest occurrence rate of actual sHT, while cases classified as Minor showed no sHT. A significant association was observed between the stroke size predicted by the deep learning model and sHT by Cochran-Armitage test.
Notes
Funding statement
None
Conflicts of interest
Hokyu Kim, Hoyoun Lee, and Wi-Sun Ryu are employees of JLK Inc. Dong-Eog Kim reports holding stocks in JLK Inc. Hee-Joon Bae reports holding stocks in JLK Inc., as well as grants from Bayer Korea, Bristol Myers Squibb Korea, Chong Kun Dang Pharmaceutical Corp., Dong-A ST, Korean Drug Co., Ltd., Samjin Pharm, and Takeda Pharmaceuticals Korea Co., Ltd., and personal fees from Amgen Korea, Bayer, Daiichi Sankyo, JW Pharmaceutical, Hanmi Pharmaceutical Co., Ltd., Otsuka Korea, SK Chemicals, and Viatris Korea, outside the submitted work. The other authors report no conflicts of interest.
Author contribution
Conceptualization: WSR, HJB. Study design: WSR, HJB, HK. Methodology: WSR, HJB, HK, HL. Data collection: DYK, HGJ, KJL, BJK, MKH, KHC, DIS, DEK, JMP, KK, JGK, SJL, MSO, KHY, BCL, HKP, KSH, YJC, JCC, SIS, JHH, THP, JHK, WJK, JL, HJB. Investigation: HK, HL. Statistical analysis: HK, WSR, HJB. Writing—original draft: HK. Writing—review & editing: WSR, HJB. Approval of final manuscript: all authors.
Acknowledgments
The authors appreciate the contributions of all members of the Comprehensive Registry Collaboration for Stroke in Korea to this study.
