1. Introduction
Ischemic stroke is one of the most serious diseases that continues to be one of the leading causes of death and disability worldwide. Every year, millions of people become victims of stroke, which is a significant problem for public health in various countries. One of the key factors determining a patient’s prognosis is the speed and accuracy of diagnosis. Timely and high-quality diagnosis of stroke is necessary to start effective treatment, which can significantly reduce the risk of irreversible consequences and disability. In recent years, significant attention has been paid to the use of modern technologies, such as deep learning [
1], to improve diagnostic methods based on medical images, including magnetic resonance imaging (MRI) and computed tomography (CT). The fastest and most effective method from an economic point of view is contrast-free CT diagnostics.
Medical imaging plays a key role in stroke diagnosis, and brain image segmentation is a crucial step in assessing a patient’s condition. Traditional segmentation methods, such as threshold filtering [
2] or active contour segmentation [
3], are limited in their ability to accurately and efficiently identify ischemic stroke lesions, especially in 3D images.
In recent years, significant progress in medical image segmentation has been achieved through the use of 3D convolutional neural networks (3D CNNs) [
4], which enable more accurate extraction of spatial information from tomographic data. Modern deep learning models, including CNNs, are applied in a wide range of tasks—from medical image analysis and disease diagnosis to handwritten text recognition and natural language processing. A key challenge is the development of architectures capable of efficiently processing 3D medical images, such as MRI and CT scans, where capturing spatial dependencies is crucial for an accurate diagnosis.
One of the most well-known and effective architectures is 3D U-Net. Studies conducted since 2021 have shown that this model outperforms alternative approaches, such as fully convolutional networks (FCNs) and V-Net, particularly in segmenting complex pathologies, including ischemic stroke [
5]. Modifications of 3D U-Net, such as integration with various attention layers and enhanced decoding mechanisms, have significantly improved accuracy when working with limited annotated data [
6]. In particular, Zhang et al. (2024) proposed a liver segmentation model on CT images based on a 3D U-Net model with a dual attention mechanism, which significantly improves accuracy and robustness to noise by enabling a deeper analysis of spatial–contextual relationships [
7]. Such modifications confirm the effectiveness of attention mechanisms in processing volumetric medical data.
Although the FCN and V-Net models are also used for medical image segmentation, they are inferior to 3D U-Net in terms of accuracy and robustness when processing medical data. The FCN performs well in 2D segmentation; however, when applied to 3D images, it suffers from spatial information loss, which reduces prediction accuracy [
8]. Similarly, while 3D V-Net is effective for volumetric data segmentation, it requires substantial computational resources and exhibits poorer generalization across heterogeneous datasets compared to 3D U-Net [
9,
10].
One of the key challenges in medical image segmentation is class imbalance, where healthy tissue occupies a significantly larger area in an image than the affected regions. This imbalance can hinder model training and reduce the accuracy of pathology segmentation. To address this issue, researchers have proposed various methods, such as weighted loss functions and enhanced sample generation algorithms. Recent studies, such as the work by Yeung et al.(2022), demonstrate that these methods significantly improve segmentation performance in cases of severe class imbalance [
11]. Garcia-Salgado et al. (2024) also developed an Attention U-Net model with the generalized Dice focal Loss function, which enabled high segmentation accuracy of ischemic lesions on MRI images even in the presence of severe class imbalance. This highlights the effectiveness of combining attention mechanisms with adapted loss functions in stroke segmentation tasks [
12]. Additionally, approaches that integrate data from multiple sources have been proposed, further improving predictions in complex cases [
13]. Another major challenge is the need to develop more accurate and robust models capable of handling various types of medical images and quickly adapting to new data.
To improve the accuracy and reliability of ischemic stroke diagnosis based on medical imaging, there is a growing need for approaches that combine predictions from multiple models to achieve more reliable results. Amirgaliyev et al. (2021) demonstrate the application of ensemble methods in computer vision tasks. This highlights the importance of combining different algorithms to enhance reliability and recognition accuracy [
14]. Berikov and Cherikbayeva (2018) introduced a combined method based on cluster ensembles and kernel functions, demonstrating high efficiency in searching for an optimal classifier [
15].
The study by Dobshik et al. (2023) explores the application of 3D CNNs for segmenting acute ischemic stroke lesions on non-contrast CT images. The authors demonstrate the effectiveness of deep learning in automated medical data processing, emphasizing the importance of three-dimensional analysis for improving diagnostic accuracy. This approach confirms the potential of using 3D CNNs for stroke segmentation and can serve as a foundation for further development of ensemble methods [
16]. Additionally, Vezakis et al. (2024) proposed a hybrid multi-dimensional approach that combines 2D U-Net and 3D Attention U-Net for organ segmentation using positron emission tomography (PET) data. Their work demonstrates that integrating different U-Net modifications can achieve accurate segmentation with minimal computational cost [
17]. Yousef et al. (2023) explored the optimization of U-Net models for brain MRI, proposing a combination of enhanced U-Net architectures that deliver high segmentation accuracy while reducing computational costs—an especially important factor in clinical practice [
18].
According to a number of studies, the 3D U-Net architecture is widely used due to its ability to extract spatial features at multiple scales, making it one of the most effective models for medical image segmentation. However, its limited spatial context coverage, which is associated with the fixed convolution kernel size, may reduce its effectiveness in capturing long-range dependencies between image regions. This is particularly relevant for ischemic stroke segmentation, where lesion boundaries can be blurred or poorly defined.
To overcome this limitation, there has been a growing interest in transformer-based architectures in recent years, which were originally developed for natural language processing tasks. For example, the Segment Anything Model (SAM), which was introduced by Meta AI in 2023, has become a key example of a scalable segmentation model capable of working with various types of images without the need for fine-tuning [
19]. These models exhibit a strong ability to model global contextual relationships, making them promising for the analysis of complex three-dimensional medical data. Based on this, hybrid architectures that combine the U-shaped structure with a transformer-based encoder have emerged. One of the most prominent examples of this approach is the UNETR (U-Net with Transformers) model, which was proposed by Hatamizadeh et al. (2022), and it uses a vision transformer (ViT) as the encoder. This approach allows for the extraction of spatial features without the need for prior image aggregation, thereby improving segmentation accuracy [
20]. An advancement of this idea is the Swin UNETR model, which is based on the Swin Transformer and utilizes window-based self-attention and hierarchical feature representation. This enables the model to effectively capture both local and global dependencies within the image. Swin UNETR has demonstrated high accuracy in brain structure segmentation and, according to recent studies [
21,
22,
23], holds significant potential for application in automated stroke diagnosis tasks.
Despite significant progress, several challenges remain in stroke diagnosis using deep learning methods. One of the key issues is the need to improve model interpretability, which is particularly crucial in medical practice, where errors can have serious consequences. In recent years, researchers have increasingly focused on developing models that provide not only high accuracy but also the ability to interpret neural network decisions. For example, the study by Jabal et al. (2022) demonstrates how visualization techniques can be used to interpret deep model decisions in the context of medical diagnostics [
24].
Additionally, a promising direction is the optimization of computational operations required for efficient processing of three-dimensional medical data. For example, in the patent by Tynymbaev et al. (2024) [
25], a device was proposed for multiplying polynomials modulo, an irreducible polynomial with integrated modular reduction. Although this solution was primarily developed for cryptographic applications, it reflects a general trend toward improving computational efficiency, which may also be relevant for computer-aided medical image analysis systems that require high performance when working with volumetric data.
The aim of this study is to develop a highly accurate and robust neural network model for ischemic stroke segmentation on 3D CT images. We hypothesize that employing an ensemble of transformer-based models (SE-UNETR and Swin UNETR) will achieve higher accuracy and robustness in 3D ischemic stroke lesion segmentation on CT images compared to individual models. To achieve this goal, this study proposes an ensemble approach for automated 3D segmentation of ischemic stroke lesions in brain CT scans. The ensemble is composed of two transformer-based architectures: the SE-UNETR model developed as part of this work, which is a modified version of UNETR incorporating squeeze-and-excitation (SE) blocks, and the Swin UNETR architecture, which is based on window-based self-attention and hierarchical feature representation. The model predictions are combined using a weighted voting mechanism, allowing adaptive consideration of each model’s confidence in forming the final output. The proposed approach was evaluated on real clinical data from 98 patients with confirmed ischemic stroke diagnoses.
The novelty of the proposed method lies in the systematic application and evaluation of an ensemble of transformer models for the task of 3D stroke segmentation on non-contrast CT images. Unlike existing studies that focus on individual architectures or their modifications, this research demonstrates that combining models with different attention mechanisms can lead to a significant improvement in segmentation accuracy. This study is the first to conduct a comparative analysis of transformer model ensembling for stroke segmentation on CT data. Notably, the effectiveness of an ensemble combining SE-UNETR and Swin UNETR has not been previously investigated on this clinical dataset. The experiments conducted on real clinical data confirm the effectiveness of the proposed ensemble and add originality to this work, highlighting its practical relevance for the development of reliable solutions in the field of medical diagnostics.
The study evaluates the effectiveness of the proposed ensemble compared to individual models and explores potential directions for further model improvement.
The paper is organized as follows:
Section 2 provides a detailed description of the materials and methods, including the dataset used, data preprocessing steps, model architecture, and evaluation metrics.
Section 3 presents the results obtained.
Section 4 is dedicated to the discussion of the findings, while the conclusion (
Section 5) summarizes the study and suggests potential directions for future work.
2. Materials and Methods
2.1. Dataset
For this study, a dataset of CT brain images from 98 patients diagnosed with acute ischemic stroke was used. The data were acquired using a Philips Ingenuity CT scanner and provided by the International Center for Tomography, SB RAS [
16], ensuring their high quality and accuracy. The CT datasets used in this study are not publicly available due to medical data-sharing agreements. However, they may be provided by the corresponding author upon reasonable request, subject to confidentiality requirements. All images were provided in a fully de-identified form and stored in NIfTI (.nii) [
26] format, which is standard in medical imaging and neuroscience. Prior to being transferred for research purposes, all metadata containing personal or potentially identifying information (patient name, patient ID, date of birth, age, sex, study date and time, institution information, etc.) were removed from the original data. Only fully anonymized data were used for the analysis. Demographic and clinical information about the patient population was not provided.
Each CT image consists of a series of slices, ranging from 306 to 505, depending on the specific patient’s characteristics. The size of each slice is 512 × 512 pixels, providing sufficient detail for analysis. The slice thickness is 0.5 mm, which is a standard parameter for modern CT scanners and ensures the necessary accuracy for segmentation and analysis.
Each image set is accompanied by manual segmentation performed by experienced radiologists. The segmentation procedures were carried out by two specialists with 9–13 years of experience and PhD degrees in radiology and radiation therapy using the 3D Slicer 5.6.2 [
27] software. During the segmentation process, brain regions affected by ischemic stroke were identified, allowing these data to be used for training and testing diagnostic and segmentation algorithms. To ensure the high quality and reliability of the ground truth, the annotations were agreed upon by the specialists. The image dataset in this case refers to a three-dimensional CT volume consisting of a sequence of axial slices. Segmentation is performed on each slice within the 3D volume, and as a result, a volumetric mask is generated, corresponding to anatomical structures and pathological lesions throughout the entire image volume. A total of approximately 29,988–49,490 slices were processed in this study, providing a large and diverse set of training and testing examples for building and evaluating the segmentation model.
In this study, no missing data were identified. All provided CT images and corresponding segmentations were complete, with no missing slices or corrupted files. Prior to analysis, the data underwent integrity and completeness checks: The number of scans and segmentations was compared. Image dimensions were verified, and the absence of empty masks or corrupted files was confirmed. No patients were excluded from the dataset, as it was provided in a complete form.
For a clear demonstration of the data structure and annotation quality,
Figure 1 presents axial CT slices from three different patients with overlaid masks of the affected areas (shown in red). The images are accompanied by an intensity scale in Hounsfield units (HU), reflecting tissue density, as well as a linear scale along the axes indicating the real size of the images in pixels. The red area corresponds to the annotated lesion mask, which was derived from the original annotations made by experienced radiologists. This visualization allows for the simultaneous evaluation of lesion localization, segmentation quality, and image detail, which is crucial for the subsequent training and testing of the model.
The dataset was randomly split by patients into the training and validation sets. Out of a total of 98 patients, 80% (78 patients) were allocated for training and the remaining 20% (20 patients) for validation. This approach prevents the overlap of slices from the same patient between the training and validation sets and ensures a proper assessment of the model’s generalization capability.
This work is a retrospective observational non-interventional study based on previously acquired CT images. Due to its non-interventional nature, registration in a clinical trial registry was not required.
2.2. Preprocessing
When working with CT images for stroke diagnosis, it is essential to perform a series of preprocessing steps to enhance data quality and extract relevant features for further analysis. The main preprocessing steps include the following:
Brain binarization (HB-BET)
The first step is to extract the brain region from the entire CT image [
28]. For this purpose, the HB-BET (Hunt and Bender Brain Extraction Tool) method is used, which is an algorithm designed to remove unnecessary parts of the image (e.g., the skull), leaving only the brain region. This is crucial for eliminating irrelevant elements and isolating the brain structure for further analysis.
The algorithm utilizes threshold segmentation and gradient analysis to extract the brain region. The main binarization equation can be expressed as
where
is the pixel intensity at point (
) and
T is the threshold value, which is determined during the image analysis process.
The threshold value
T is calculated based on the statistical characteristics of the intensities in the image.
T is defined as
where
is the mean intensity value within the presumed brain region,
is the standard deviation of these values, and
k is an empirical coefficient.
Cropping to the non-zero region
After extracting the brain region, the next step is cropping the image to the non-zero region, as described in Algorithm 1. This helps remove the background and areas with zero values that do not contain useful information.
Algorithm 1 Cropping the brain region to the non-zero area. |
Input: Three-dimensional binary brain mask obtained after HB-BET. Output: Cropped CT volume containing only the brain region. Steps:
- (a)
Determine the minimum and maximum coordinates of non-zero voxels along each axis: - (b)
Crop the region to .
|
Intensity normalization
The next preprocessing step is intensity normalization [
29]. CT images can be acquired with different parameters and intensity ranges, which can complicate analysis. Normalization of all pixel intensities to a standard range of [0, 1] was performed, improving stability when applying machine learning methods. This helps to avoid issues related to brightness and contrast differences across different scans.
Min-max normalization was performed using the following formula:
where
and
are the maximum and minimum intensity values in the image.
Contrast enhancement
After normalization, contrast can be further improved using the histogram equalization method. This technique is particularly useful for medical images, where it is essential to highlight small and barely visible structures, such as brain lesions.
Figure 2 presents an example from the dataset after preprocessing, illustrating the results of image processing aimed at improving data quality and enhancing key features for further analysis.
2.3. Methods
This study presents the architecture of an ensemble model designed for automated 3D segmentation of ischemic stroke lesions in brain CT images. The proposed ensemble is composed of two transformer-based models, SE-UNETR and Swin UNETR [
21], each contributing to improved segmentation accuracy and robustness through differences in attention mechanisms and spatial–contextual feature processing. SE-UNETR was selected due to its integration of SE blocks, which improve feature selectivity and accuracy for small lesions, while Swin UNETR was chosen for its efficient combination of local and global attention with low computational costs. The model predictions are combined using a weighted voting mechanism, providing a more reliable final decision.
The first model in the ensemble is SE-UNETR—a modified version of the transformer-based UNETR architecture that was developed in this study for 3D segmentation of brain CT images. A distinctive feature of this model is the integration of SE blocks [
30] into the decoder, which enables adaptive channel recalibration and contributes to more accurate delineation of pathological regions. The developed modification is referred to as SE-UNETR.
UNETR [
20] combines the strengths of U-Net and the ViT, enabling the simultaneous use of both local and global image features. The input to the model is a 3D CT image of size (H × W × D × 1), where H, W, and D represent the height, width, and depth of the volume, respectively. In the first stage, the input image is divided into patches of size 16 × 16 × 16 voxels, and each patch is linearly projected into a feature space with 768 dimensions, forming the input to the ViT encoder.
The encoder consists of 12 transformer blocks, each containing a multi-head self-attention mechanism, layer normalization (LayerNorm), and fully connected layers (Multilayer Perceptron, MLP). Features are extracted after every third transformer block (i.e., after layers , , , and ), which are then used to construct the decoder. This enables skip connections similar to the classic U-Net model, preserving spatial information from the early layers.
The main difference between SE-UNETR and the baseline UNETR model is the addition of SE blocks after each convolutional layer in the decoder. A detailed diagram of one such decoder block is shown in
Figure 3.
Each decoder block includes two consecutive 3D convolutional layers with a kernel size of 3 × 3 × 3, followed by instance normalization (IN) and a ReLU activation function. To enhance selectivity for the most informative features, a SE block [
30] is applied after each convolutional layer, performing adaptive channel recalibration. The SE block operates in the following stages:
Squeeze: Spatial features are collapsed into a single vector using global average pooling, allowing the aggregation of information for each channel.
Excitation: The resulting vector is passed through two fully connected layers with activation functions. This enables the model to learn which channels are more relevant for the specific task. A sigmoid function is applied at the output to generate weights that represent the importance of each channel.
Scale: The weight coefficients are reshaped back to match the input tensor dimensions and used to scale (recalibrate) each channel. This enhances informative features and suppresses less important ones.
These blocks allow the model to focus on the most relevant feature channels, adapting its attention to stroke-specific characteristics, such as faint, blurred, or small lesions. After the SE blocks, the output tensor is passed to the next decoder level, followed by a transposed convolution and concatenation with the corresponding encoder feature map.
In the final stage, the model uses a 1 × 1 × 1 convolution to reduce the number of channels to 1, corresponding to the binary segmentation mask (lesion/non-lesion). The resulting prediction map has the same dimensions as the input image: (H × W × D × 1).
The second method in the ensemble is the Swin-UNETR [
21] architecture, which combines the strengths of the Swin Transformer and a U-Net-like decoder for 3D medical segmentation. Unlike the classical UNETR model, which is based on global attention (ViT), Swin-UNETR employs localized window-based attention mechanisms (window-based multi-head self-attention, W-MSA), allowing it to efficiently capture both local and global dependencies with lower computational overhead.
The model takes as input 3D data of size H × W × D × 1, representing a single-channel MRI volume. In the first stage, the image is divided into non-overlapping 3D patches, which are projected into a feature space with 48 dimensions. These patches are then processed through four encoder stages with block depths of 2, 2, 2, and 2, where each stage uses Swin Transformer blocks with the number of attention heads set to 3, 6, 12, and 24, respectively. Between the stages, a patch-merging operation is used to reduce the spatial resolution and increase the number of channels, forming a hierarchical representation. The outputs from each stage are preserved and passed to the decoder via U-Net-style skip connections. The decoder includes transposed convolution blocks and convolutional layers that restore the spatial resolution and refine object localization. The final layer transforms the features into an output map of size H × W × D × 1. A sigmoid function is applied to generate a probabilistic mask, enabling the interpretation of each voxel’s value as the probability of belonging to the target class.
Figure 4 illustrates the architecture of Swin-UNETR, which comprises four encoder stages based on the Swin Transformer, a U-Net-like decoder with skip connections, and an output head for binary segmentation. The model was trained using the DiceLoss function with the parameter sigmoid=True, which enables effective handling of imbalanced data in binary segmentation tasks.
The use of an ensemble makes it possible to combine the strengths of each individual model and improve the robustness of the final output. This is especially critical in a clinical context, where the reliability and stability of predictions are of utmost importance. Unlike a single model, an ensemble reduces the likelihood of incorrect segmentation by aligning the outputs of multiple independent instances.
The ensemble is formed using a weighted voting method, where the final prediction is computed based on the outputs of all models, considering their confidence scores. In this case, the Dice coefficients are used as confidence scores, reflecting the accuracy of each individual model on the validation dataset. The Dice coefficient was chosen as a confidence measure because it is a widely accepted metric for evaluating segmentation quality in medical imaging tasks. It is particularly sensitive to the overlap between predicted and ground-truth regions, which is critical when analyzing pathologies such as ischemic brain lesions. The ensemble prediction
is computed using the following formula:
where
N is the number of models in the ensemble (in this case,
N = 2),
is the prediction of the i-th model, and
is the Dice coefficient of the i-th model, calculated on the validation dataset.
In this study, the overall process of ischemic stroke segmentation is presented as a sequence of four stages: data collection, preprocessing, model training, and prediction ensembling. In the first stage, 3D brain CT images are collected from patients with confirmed ischemic stroke. Next, image preprocessing is performed, including brain binarization using the HB-BET algorithm to remove the skull region, cropping the non-empty region to eliminate background, intensity normalization, and contrast enhancement to improve visual differentiation between tissues. The preprocessed data are then passed to the training module, where two neural network models are developed: the modified SE-UNETR architecture proposed in this study, and the Swin UNETR architecture with window-based self-attention. The final stage involves combining the predictions from these two models using a weighted voting mechanism, which takes into account the confidence level of each model. This approach improves the accuracy and robustness of the final segmentation, which is particularly important under conditions of clinical uncertainty.
Figure 5 shows a diagram of the proposed approach.
This diagram illustrates the complete pipeline of building the ensemble model—from the preprocessing of raw images to generating the final prediction—providing a clear overview of the entire process.
2.4. Evaluation Metrics
The Dice similarity coefficient (DSC) is one of the key metrics for evaluating the quality of medical image segmentation. It measures the similarity between the predicted mask S and the ground-truth mask G, with values ranging from [0, 1], where 1 indicates perfect overlap and 0 represents no overlap at all.
DSC is defined as follows:
or in terms of predictions, with
where
(true positive) is the number of correctly classified pathological pixels,
(false positive) is the number of pixels incorrectly classified as pathological, and
(false negative) is the number of pixels that the model failed to identify as pathological.
The Dice coefficient is widely used in medical image processing as it accounts for both missed (
) and incorrectly predicted
regions, making it more informative than simple classification accuracy [
31].
The following metrics were used for the quantitative evaluation of model performance [
32]:
Sensitivity (recall) is the model’s ability to correctly identify affected areas.
Specificity is the ability of the model to exclude false-positive predictions.
where
(true negative) is the number of correctly classified non-pathological pixels.
Precision is the proportion of correctly classified lesions among all positive predictions.
Before computing the evaluation metrics (, and ), the continuous-valued probability maps output by the models were converted into binary segmentation masks. A fixed threshold of 0.5 was applied: pixels with a predicted probability greater than 0.5 were assigned to the positive class, and all others to the negative class. This binarization step allowed the predicted masks to be directly compared with the binary ground-truth annotations, enabling accurate calculation of evaluation metrics.
3. Results
The training was conducted using the proposed ensemble of transformer-based models, SE-UNETR and Swin UNETR, with each trained with the Adam optimizer [
33] at a learning rate of
. All network parameters (SE-UNETR and Swin UNETR) were initialized using PyTorch’s default initialization. Models were trained for 120 epochs with a batch size of two. The dataset was split into the training and test sets in an 80:20 ratio. The training set contained approximately 39,936 non-overlapping 3D patches of size 16 × 16 × 16 voxels, which were extracted from 78 training volumes (after resizing to 128 × 128 × 128). Binary cross-entropy was used as the loss function to optimize training, while the Dice coefficient was employed to evaluate segmentation quality. The models were implemented in Python 3.11.13 using the PyTorch framework 2.6.0+cu124 and the MONAI library 1.5.0. NumPy 2.0.2 and NiBabel 5.3.2 were used for data preprocessing and image processing, and Matplotlib 3.10.0 was used for visualization. The training was performed in the Google Colab Pro+ environment on an NVIDIA A100 [
34] GPU with CUDA 12.4 support.
To illustrate the training dynamics,
Figure 6 presents the plots of the Dice coefficient and loss function values on both the training and validation datasets. As the number of epochs increases, the Dice coefficient improves while the loss decreases, indicating progressive model convergence. The closeness of the training and validation curves suggests good generalization performance and no signs of overfitting.
The performance evaluation results of the proposed model ensemble are presented in
Table 1. The highest Dice coefficient was achieved using the ensemble of SE-UNETR and Swin UNETR, reaching 0.7983, which exceeds the scores of the individual models: 0.7835 for SE-UNETR and 0.7667 for Swin UNETR, respectively. This indicates improved segmentation accuracy due to the combination of predictions from different architectures.
The ensemble model also demonstrated the best sensitivity (89.75%) and precision (94.91%), which is especially important for medical diagnostics, where under-segmentation of pathological regions can lead to serious consequences. The high specificity values (99.9%) for all models are explained by the substantial predominance of healthy tissue over pathological areas in CT images, which is typical in stroke segmentation tasks. Thus, the proposed ensemble approach provides more stable and reliable segmentation compared to using individual models alone.
Figure 7 presents a visual comparison of ischemic stroke lesion segmentation performed using the SE-UNETR model, the Swin UNETR model, and their ensemble. The first column shows the original CT image, and the second column displays the ground-truth mask manually annotated by experts, followed by the predictions from the two individual models and the final ensemble prediction. It can be observed that each model handles the task differently: SE-UNETR produces a smoother but somewhat blurred delineation of the lesion, while Swin UNETR yields sharper boundaries but with partial omissions. In contrast, the ensemble prediction shows the best alignment with the ground-truth mask. The ensemble result features a more precise and complete coverage of the pathological region, highlighting its advantage over the individual models.
Figure 8 presents a comparison between ground-truth segmentation masks and the results obtained using the ensemble of the SE-UNETR and Swin UNETR models. For analysis, three representative slices from a volumetric brain CT scan were selected—the initial (slice 32), middle (slice 64), and final (slice 96)—allowing for an assessment of the model’s consistency across different scanning stages.
The first column shows the original CT slices, depicting the anatomical structure of the brain. The second column contains the ground-truth masks, where the highlighted areas correspond to ischemic lesions annotated by experts and are used as the reference for training and validation. The third column displays the predictions produced by the ensemble model. As seen in the images, the model effectively identifies the affected regions, accurately reproducing both the shape and localization of the stroke lesions. The top row shows a slice without signs of ischemia, and the ensemble prediction correctly reflects the absence of pathological changes, indicating high model specificity. The middle and bottom rows display slices with prominent ischemic alterations; in these cases, the model precisely and consistently replicates the true lesion shape. Despite minor differences in detail, the ensemble demonstrates robustness to noise and artifacts, as well as strong generalization ability. This confirms the high quality of segmentation and the reliability of the ensemble approach when applied to clinical data.
Figure 9 provides a visual representation of ischemic stroke segmentation results for three different patients.
Each row corresponds to a single CT slice and consists of three columns: first column—the original CT image, second column—overlay of the ground truth mask annotated by experts (in blue), third column—overlay of the ensemble model’s prediction (in shades of red). These examples demonstrate the model’s robustness across various lesion localizations and shapes. In the first case, the model captures an extended lesion area closely matching the ground-truth annotation in size. In the second example, there is a high degree of agreement between the prediction and the reference: the ensemble model accurately traces the contour of the affected region in the lower-right part of the brain. The third case also shows strong overlap between the predicted and ground-truth masks, particularly in the central and frontal regions. These visualizations confirm that the ensemble model is capable of accurately identifying ischemic lesions in different patients, maintaining a high degree of correspondence with expert annotations in both shape and location of the pathological regions. This result highlights the model’s clinical potential and its applicability in automated CT image analysis.
As a control experiment, the proposed method was compared with the results presented by Dobshik et al. (2023) [
16], in which a convolutional 3D neural network was used for ischemic lesion segmentation. To ensure a fair comparison, the same CT dataset was used as in the present study. The results showed that the proposed ensemble approach provides more accurate segmentation of pathological regions compared to the method described by Dobshik et al. (2023) [
16].
Additionally, an alternative ensemble consisting of three architecturally identical 3D U-Net models was implemented and tested. The primary difference among them was the random initialization of weights, which introduced the necessary prediction diversity due to the stochastic nature of training. Ensembling such models helps reduce the influence of random fluctuations during training and lowers the risk of overfitting—an especially important factor when dealing with medical images, which are characterized by high variability and limited data availability. However, this ensemble demonstrated lower performance compared to the transformer-based ensemble comprising SE-UNETR and Swin UNETR. The Dice coefficients for the individual 3D U-Net models were 0.7132, 0.7145, and 0.7018, with a final ensemble Dice score of 0.7253, which is lower than the result achieved by the proposed transformer ensemble (0.7983). It is important to note that both ensembles were trained on the same dataset, which consists of CT images from 98 real patients with confirmed ischemic stroke. This consistent setup ensures an objective comparison and allows for a more accurate evaluation of the effectiveness of different architectures under equivalent conditions.
The summary results are presented in
Table 2, which provides a comparative analysis of various neural network architectures for the task of ischemic stroke segmentation. Specifically, the table includes the results of the proposed SE-UNETR + Swin UNETR ensemble with weighted voting, an alternative 3D U-Net ensemble, as well as previously published approaches: SE-Res 3D U-Net (Dobshik et al., 2023) [
16], METrans—a transformer-based model (Wang et al., 2022) [
35]—and a modified 3D U-Net model (Omarov et al., 2022) [
36]. As shown in the table, the proposed approach achieves the highest Dice coefficient among all the models considered, confirming its superior accuracy and effectiveness in automated segmentation of ischemic lesions in CT images.
The developed model demonstrated a significant increase in the Dice coefficient, indicating more accurate segmentation and improved detection of pathological regions. The obtained results confirm the high effectiveness of using ensemble deep learning architectures in medical image analysis tasks. In total, the dataset included CT scans from 98 patients, with each volume containing between 306 and 505 axial slices, resulting in approximately 30,000–50,000 annotated slices. This large number of samples provided sufficient variability for training deep neural networks. To ensure a reliable evaluation, a patient-wise split of 80% for training and 20% for validation was applied, eliminating any overlap between the sets. In the test set comprising 20 patients, the proportion of voxels affected by acute ischemic stroke was only 0.8% of the total number of voxels, indicating a strong class imbalance. With a model sensitivity of 89.75% and a specificity of 99.99%, the overall proportion of misclassifications was approximately 0.09% of all voxels in the test set. Errors were observed mainly as isolated false-positive predictions on individual slices, while on the remaining slices of the corresponding studies, the model demonstrated accurate segmentation. Moreover, repeated experiments with multiple random splits demonstrated the robustness of the proposed method, as evidenced by an average Dice coefficient of , indicating low result variability.
4. Discussion
The results of this study demonstrate the effectiveness of the proposed ensemble of transformer-based models, SE-UNETR and Swin UNETR, for the task of automated ischemic stroke segmentation in CT images. Based on quantitative metrics and visual analysis, the proposed approach was shown to outperform both alternative architectures implemented in this work and previously published methods, including SE-Res 3D U-Net [
16], METrans [
35], and modified versions of 3D U-Net [
36].
One of the key factors contributing to the reliability of the comparison was the use of a unified dataset consisting of CT scans from 98 real patients with confirmed ischemic stroke. This approach eliminated the influence of dataset variability and allowed for an objective evaluation of architectural differences. The comparison with an alternative ensemble composed of three 3D U-Net models trained with different weight initializations revealed that, despite the stability and simplicity of this architecture, its performance was lower (Dice = 0.7253) than that of the transformer ensemble (Dice = 0.7983). This highlights the potential of transformer-based models in medical image segmentation tasks.
While recent open-source segmentation frameworks such as SAM [
19] show promising results on natural and certain medical images, they are primarily designed for 2D image inputs. In contrast, our task involves volumetric 3D CT data with specific intensity characteristics (Hounsfield units) and thin-slice resolution (0.5 mm). Direct application of SAM would require substantial architectural adaptation to handle volumetric data and voxel-level annotations, as well as retraining on CT-specific datasets. Therefore, we focused on transformer-based 3D segmentation architectures, which are better suited to the nature of our data.
In accordance with the recommendations of the CLAIM 2024 checklist (Checklist for Artificial Intelligence in Medical Imaging) [
37], we analyzed our study in terms of transparency, reproducibility, and clinical contextualization. All data were provided in a fully anonymized form, and demographic and clinical information about the patients was not available (item 36). This limits the ability to analyze the impact of population characteristics on the results. In future multicenter studies, we plan to use extended datasets that include such information.
The method was tested only on an internal dataset. Although the training sample included CT images from 98 real patients, all data were obtained from a single institution. At the current stage, the model was tested exclusively on this internal data, which limits the diversity of clinical and technical conditions and may affect its generalizability when applied to external data acquired from different scanners or other centers (item 33). In the future, multicenter external validation is planned to assess the generalizability of the proposed approach.
This study reports averaged metrics (Dice ), which allows for an assessment of variability. However, due to the limited size of the test set (20 patients), we did not perform bootstrapping or calculate confidence intervals, as such methods can be statistically unstable with small samples (item 29). In future studies with larger and multicenter datasets, we plan to use bootstrapping and cross-validation for a rigorous estimation of confidence intervals.
We demonstrated the stability of the method on repeated random splits but did not perform specific stress tests (e.g., adding noise, artifacts, or modifying scanning protocols) (item 30). This is because the current study primarily focused on the ensemble architecture and model comparisons. In the future, we plan to additionally assess the algorithm’s robustness to data variations and real-world clinical conditions. We noted that the model’s errors occurred predominantly as isolated false-positive predictions; however, a systematic classification of error types was not performed (item 39). This is due to the limited number of erroneous cases in the test set. In future studies with larger datasets, we plan to categorize errors (e.g., false positives at vessel boundaries, false negatives in small lesions, and errors due to artifacts) and illustrate them with examples, which will provide a better understanding of the method’s limitations.
Explainability methods were not implemented in the current study (item 31). Special attention should be given in the future to improving model interpretability. The “black box” problem, characteristic of most modern neural network approaches, remains a significant barrier to widespread clinical adoption. In future research, we plan to incorporate explainable artificial intelligence (XAI) methods, such as Grad-CAM (gradient-weighted class activation mapping) or SHAP (Shapley additive explanations), which allow visualization of the image regions that most influenced the model’s final decision. This will help increase trust among medical professionals and facilitate the integration of the technology into existing diagnostic protocols.
In a clinical context, automatic segmentation of ischemic lesions on CT can significantly reduce image analysis time and improve the reproducibility of assessments compared to manual annotation (items 38 and 41). For practical application, it is crucial to define acceptable error ranges. For example, small inaccuracies at the lesion boundary may be clinically insignificant when assessing lesion volume, whereas missing small but critically located lesions could affect therapeutic decision-making. In future studies, we will focus on the clinical interpretation of the results, including determining acceptable accuracy thresholds at which the use of the model remains safe and effective. We also plan to assess the impact of automatic segmentation on diagnostic decisions and treatment outcomes. Furthermore, we will develop scenarios for seamless integration of the model into existing radiology systems and decision-making protocols, which will require appropriate technical infrastructure, staff training, and multicenter validation. These steps will help ensure the model’s operational stability and its acceptance by the medical community.
This study was conducted within the framework of project AP26195405, “Development of Combined Deep Neural Network Models for Interpretable Analysis of Medical Images,” which has local ethics approval (IRB No. A922) (item 27).
Thus, although the present study has several limitations, it demonstrates the potential of ensemble transformer architectures for the task of ischemic stroke segmentation on CT images. The recommendations of the CLAIM 2024 checklist allow not only a critical evaluation of the current study but also the development of a roadmap for future steps, ranging from dataset expansion and external validation to the implementation of explainability methods and the integration of the model into clinical workflows. We view this work as a foundation for subsequent research aimed at enhancing the transparency, reproducibility, and clinical applicability of artificial intelligence methods in medical imaging.