Ensembling Transformer-Based Models for 3D Ischemic Stroke Segmentation in Non-Contrast CT

Cherikbayeva, Lyailya; Berikov, Vladimir; Melis, Zarina; Yeleussinov, Arman; Baigozhanova, Dametken; Tasbolatuly, Nurbolat; Temirbekova, Zhanerke; Mikhailapov, Denis

doi:10.3390/app15179725

Open AccessArticle

Ensembling Transformer-Based Models for 3D Ischemic Stroke Segmentation in Non-Contrast CT

by

Lyailya Cherikbayeva

¹

,

Vladimir Berikov

^2,3

,

Zarina Melis

^1,*

,

Arman Yeleussinov

¹

,

Dametken Baigozhanova

⁴

,

Nurbolat Tasbolatuly

^4,5,*

,

Zhanerke Temirbekova

¹

and

Denis Mikhailapov

^2,3

¹

Department of Computer Science, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan

²

Department of Mechanics and Mathematics, Novosibirsk State University, 630090 Novosibirsk, Russia

³

Sobolev Institute of Mathematics SB RAS, 630090 Novosibirsk, Russia

⁴

Higher School of Information Technology and Engineering, Astana International University, Astana 010000, Kazakhstan

⁵

Department of Computer Engineering, Astana IT University, Astana 010000, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9725; https://doi.org/10.3390/app15179725

Submission received: 22 July 2025 / Revised: 24 August 2025 / Accepted: 27 August 2025 / Published: 4 September 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Ischemic stroke remains one of the leading causes of mortality and disability, and accurate segmentation of the affected areas on CT brain images plays a crucial role in timely diagnosis and clinical decision-making. This study proposes an ensemble approach based on the combination of the transformer-based models SE-UNETR and Swin UNETR using a weighted voting strategy. Its performance was evaluated using the Dice similarity coefficient, which quantifies the overlap between the predicted lesion regions and the ground-truth annotations. In this study, three-dimensional CT scans of the brain from 98 patients with a confirmed diagnosis of acute ischemic stroke were used. The data were provided by the International Tomography Center, SB RAS. The experimental results demonstrated that the ensemble based on transformer models significantly outperforms each individual model, providing more stable and accurate predictions. The final Dice coefficient reached 0.7983, indicating the high effectiveness of the proposed approach for ischemic lesion segmentation in CT images. The analysis showed more precise delineation of ischemic lesion boundaries and a reduction in segmentation errors. The proposed method can serve as an effective tool in automated stroke diagnosis systems and other applications requiring high-accuracy medical image analysis.

Keywords:

UNETR; Swin Transformer; CT; ischemic stroke; deep learning; segmentation; ensemble of models

1. Introduction

Ischemic stroke is one of the most serious diseases that continues to be one of the leading causes of death and disability worldwide. Every year, millions of people become victims of stroke, which is a significant problem for public health in various countries. One of the key factors determining a patient’s prognosis is the speed and accuracy of diagnosis. Timely and high-quality diagnosis of stroke is necessary to start effective treatment, which can significantly reduce the risk of irreversible consequences and disability. In recent years, significant attention has been paid to the use of modern technologies, such as deep learning [1], to improve diagnostic methods based on medical images, including magnetic resonance imaging (MRI) and computed tomography (CT). The fastest and most effective method from an economic point of view is contrast-free CT diagnostics.

Medical imaging plays a key role in stroke diagnosis, and brain image segmentation is a crucial step in assessing a patient’s condition. Traditional segmentation methods, such as threshold filtering [2] or active contour segmentation [3], are limited in their ability to accurately and efficiently identify ischemic stroke lesions, especially in 3D images.

In recent years, significant progress in medical image segmentation has been achieved through the use of 3D convolutional neural networks (3D CNNs) [4], which enable more accurate extraction of spatial information from tomographic data. Modern deep learning models, including CNNs, are applied in a wide range of tasks—from medical image analysis and disease diagnosis to handwritten text recognition and natural language processing. A key challenge is the development of architectures capable of efficiently processing 3D medical images, such as MRI and CT scans, where capturing spatial dependencies is crucial for an accurate diagnosis.

One of the most well-known and effective architectures is 3D U-Net. Studies conducted since 2021 have shown that this model outperforms alternative approaches, such as fully convolutional networks (FCNs) and V-Net, particularly in segmenting complex pathologies, including ischemic stroke [5]. Modifications of 3D U-Net, such as integration with various attention layers and enhanced decoding mechanisms, have significantly improved accuracy when working with limited annotated data [6]. In particular, Zhang et al. (2024) proposed a liver segmentation model on CT images based on a 3D U-Net model with a dual attention mechanism, which significantly improves accuracy and robustness to noise by enabling a deeper analysis of spatial–contextual relationships [7]. Such modifications confirm the effectiveness of attention mechanisms in processing volumetric medical data.

Although the FCN and V-Net models are also used for medical image segmentation, they are inferior to 3D U-Net in terms of accuracy and robustness when processing medical data. The FCN performs well in 2D segmentation; however, when applied to 3D images, it suffers from spatial information loss, which reduces prediction accuracy [8]. Similarly, while 3D V-Net is effective for volumetric data segmentation, it requires substantial computational resources and exhibits poorer generalization across heterogeneous datasets compared to 3D U-Net [9,10].

One of the key challenges in medical image segmentation is class imbalance, where healthy tissue occupies a significantly larger area in an image than the affected regions. This imbalance can hinder model training and reduce the accuracy of pathology segmentation. To address this issue, researchers have proposed various methods, such as weighted loss functions and enhanced sample generation algorithms. Recent studies, such as the work by Yeung et al.(2022), demonstrate that these methods significantly improve segmentation performance in cases of severe class imbalance [11]. Garcia-Salgado et al. (2024) also developed an Attention U-Net model with the generalized Dice focal Loss function, which enabled high segmentation accuracy of ischemic lesions on MRI images even in the presence of severe class imbalance. This highlights the effectiveness of combining attention mechanisms with adapted loss functions in stroke segmentation tasks [12]. Additionally, approaches that integrate data from multiple sources have been proposed, further improving predictions in complex cases [13]. Another major challenge is the need to develop more accurate and robust models capable of handling various types of medical images and quickly adapting to new data.

To improve the accuracy and reliability of ischemic stroke diagnosis based on medical imaging, there is a growing need for approaches that combine predictions from multiple models to achieve more reliable results. Amirgaliyev et al. (2021) demonstrate the application of ensemble methods in computer vision tasks. This highlights the importance of combining different algorithms to enhance reliability and recognition accuracy [14]. Berikov and Cherikbayeva (2018) introduced a combined method based on cluster ensembles and kernel functions, demonstrating high efficiency in searching for an optimal classifier [15].

The study by Dobshik et al. (2023) explores the application of 3D CNNs for segmenting acute ischemic stroke lesions on non-contrast CT images. The authors demonstrate the effectiveness of deep learning in automated medical data processing, emphasizing the importance of three-dimensional analysis for improving diagnostic accuracy. This approach confirms the potential of using 3D CNNs for stroke segmentation and can serve as a foundation for further development of ensemble methods [16]. Additionally, Vezakis et al. (2024) proposed a hybrid multi-dimensional approach that combines 2D U-Net and 3D Attention U-Net for organ segmentation using positron emission tomography (PET) data. Their work demonstrates that integrating different U-Net modifications can achieve accurate segmentation with minimal computational cost [17]. Yousef et al. (2023) explored the optimization of U-Net models for brain MRI, proposing a combination of enhanced U-Net architectures that deliver high segmentation accuracy while reducing computational costs—an especially important factor in clinical practice [18].

According to a number of studies, the 3D U-Net architecture is widely used due to its ability to extract spatial features at multiple scales, making it one of the most effective models for medical image segmentation. However, its limited spatial context coverage, which is associated with the fixed convolution kernel size, may reduce its effectiveness in capturing long-range dependencies between image regions. This is particularly relevant for ischemic stroke segmentation, where lesion boundaries can be blurred or poorly defined.

To overcome this limitation, there has been a growing interest in transformer-based architectures in recent years, which were originally developed for natural language processing tasks. For example, the Segment Anything Model (SAM), which was introduced by Meta AI in 2023, has become a key example of a scalable segmentation model capable of working with various types of images without the need for fine-tuning [19]. These models exhibit a strong ability to model global contextual relationships, making them promising for the analysis of complex three-dimensional medical data. Based on this, hybrid architectures that combine the U-shaped structure with a transformer-based encoder have emerged. One of the most prominent examples of this approach is the UNETR (U-Net with Transformers) model, which was proposed by Hatamizadeh et al. (2022), and it uses a vision transformer (ViT) as the encoder. This approach allows for the extraction of spatial features without the need for prior image aggregation, thereby improving segmentation accuracy [20]. An advancement of this idea is the Swin UNETR model, which is based on the Swin Transformer and utilizes window-based self-attention and hierarchical feature representation. This enables the model to effectively capture both local and global dependencies within the image. Swin UNETR has demonstrated high accuracy in brain structure segmentation and, according to recent studies [21,22,23], holds significant potential for application in automated stroke diagnosis tasks.

Despite significant progress, several challenges remain in stroke diagnosis using deep learning methods. One of the key issues is the need to improve model interpretability, which is particularly crucial in medical practice, where errors can have serious consequences. In recent years, researchers have increasingly focused on developing models that provide not only high accuracy but also the ability to interpret neural network decisions. For example, the study by Jabal et al. (2022) demonstrates how visualization techniques can be used to interpret deep model decisions in the context of medical diagnostics [24].

Additionally, a promising direction is the optimization of computational operations required for efficient processing of three-dimensional medical data. For example, in the patent by Tynymbaev et al. (2024) [25], a device was proposed for multiplying polynomials modulo, an irreducible polynomial with integrated modular reduction. Although this solution was primarily developed for cryptographic applications, it reflects a general trend toward improving computational efficiency, which may also be relevant for computer-aided medical image analysis systems that require high performance when working with volumetric data.

The aim of this study is to develop a highly accurate and robust neural network model for ischemic stroke segmentation on 3D CT images. We hypothesize that employing an ensemble of transformer-based models (SE-UNETR and Swin UNETR) will achieve higher accuracy and robustness in 3D ischemic stroke lesion segmentation on CT images compared to individual models. To achieve this goal, this study proposes an ensemble approach for automated 3D segmentation of ischemic stroke lesions in brain CT scans. The ensemble is composed of two transformer-based architectures: the SE-UNETR model developed as part of this work, which is a modified version of UNETR incorporating squeeze-and-excitation (SE) blocks, and the Swin UNETR architecture, which is based on window-based self-attention and hierarchical feature representation. The model predictions are combined using a weighted voting mechanism, allowing adaptive consideration of each model’s confidence in forming the final output. The proposed approach was evaluated on real clinical data from 98 patients with confirmed ischemic stroke diagnoses.

The novelty of the proposed method lies in the systematic application and evaluation of an ensemble of transformer models for the task of 3D stroke segmentation on non-contrast CT images. Unlike existing studies that focus on individual architectures or their modifications, this research demonstrates that combining models with different attention mechanisms can lead to a significant improvement in segmentation accuracy. This study is the first to conduct a comparative analysis of transformer model ensembling for stroke segmentation on CT data. Notably, the effectiveness of an ensemble combining SE-UNETR and Swin UNETR has not been previously investigated on this clinical dataset. The experiments conducted on real clinical data confirm the effectiveness of the proposed ensemble and add originality to this work, highlighting its practical relevance for the development of reliable solutions in the field of medical diagnostics.

The study evaluates the effectiveness of the proposed ensemble compared to individual models and explores potential directions for further model improvement.

The paper is organized as follows: Section 2 provides a detailed description of the materials and methods, including the dataset used, data preprocessing steps, model architecture, and evaluation metrics. Section 3 presents the results obtained. Section 4 is dedicated to the discussion of the findings, while the conclusion (Section 5) summarizes the study and suggests potential directions for future work.

2. Materials and Methods

2.1. Dataset

For this study, a dataset of CT brain images from 98 patients diagnosed with acute ischemic stroke was used. The data were acquired using a Philips Ingenuity CT scanner and provided by the International Center for Tomography, SB RAS [16], ensuring their high quality and accuracy. The CT datasets used in this study are not publicly available due to medical data-sharing agreements. However, they may be provided by the corresponding author upon reasonable request, subject to confidentiality requirements. All images were provided in a fully de-identified form and stored in NIfTI (.nii) [26] format, which is standard in medical imaging and neuroscience. Prior to being transferred for research purposes, all metadata containing personal or potentially identifying information (patient name, patient ID, date of birth, age, sex, study date and time, institution information, etc.) were removed from the original data. Only fully anonymized data were used for the analysis. Demographic and clinical information about the patient population was not provided.

Each CT image consists of a series of slices, ranging from 306 to 505, depending on the specific patient’s characteristics. The size of each slice is 512 × 512 pixels, providing sufficient detail for analysis. The slice thickness is 0.5 mm, which is a standard parameter for modern CT scanners and ensures the necessary accuracy for segmentation and analysis.

Each image set is accompanied by manual segmentation performed by experienced radiologists. The segmentation procedures were carried out by two specialists with 9–13 years of experience and PhD degrees in radiology and radiation therapy using the 3D Slicer 5.6.2 [27] software. During the segmentation process, brain regions affected by ischemic stroke were identified, allowing these data to be used for training and testing diagnostic and segmentation algorithms. To ensure the high quality and reliability of the ground truth, the annotations were agreed upon by the specialists. The image dataset in this case refers to a three-dimensional CT volume consisting of a sequence of axial slices. Segmentation is performed on each slice within the 3D volume, and as a result, a volumetric mask is generated, corresponding to anatomical structures and pathological lesions throughout the entire image volume. A total of approximately 29,988–49,490 slices were processed in this study, providing a large and diverse set of training and testing examples for building and evaluating the segmentation model.

In this study, no missing data were identified. All provided CT images and corresponding segmentations were complete, with no missing slices or corrupted files. Prior to analysis, the data underwent integrity and completeness checks: The number of scans and segmentations was compared. Image dimensions were verified, and the absence of empty masks or corrupted files was confirmed. No patients were excluded from the dataset, as it was provided in a complete form.

For a clear demonstration of the data structure and annotation quality, Figure 1 presents axial CT slices from three different patients with overlaid masks of the affected areas (shown in red). The images are accompanied by an intensity scale in Hounsfield units (HU), reflecting tissue density, as well as a linear scale along the axes indicating the real size of the images in pixels. The red area corresponds to the annotated lesion mask, which was derived from the original annotations made by experienced radiologists. This visualization allows for the simultaneous evaluation of lesion localization, segmentation quality, and image detail, which is crucial for the subsequent training and testing of the model.

The dataset was randomly split by patients into the training and validation sets. Out of a total of 98 patients, 80% (78 patients) were allocated for training and the remaining 20% (20 patients) for validation. This approach prevents the overlap of slices from the same patient between the training and validation sets and ensures a proper assessment of the model’s generalization capability.

This work is a retrospective observational non-interventional study based on previously acquired CT images. Due to its non-interventional nature, registration in a clinical trial registry was not required.

2.2. Preprocessing

When working with CT images for stroke diagnosis, it is essential to perform a series of preprocessing steps to enhance data quality and extract relevant features for further analysis. The main preprocessing steps include the following:

Brain binarization (HB-BET)
The first step is to extract the brain region from the entire CT image [28]. For this purpose, the HB-BET (Hunt and Bender Brain Extraction Tool) method is used, which is an algorithm designed to remove unnecessary parts of the image (e.g., the skull), leaving only the brain region. This is crucial for eliminating irrelevant elements and isolating the brain structure for further analysis.
The algorithm utilizes threshold segmentation and gradient analysis to extract the brain region. The main binarization equation can be expressed as

$B (x, y, z) = \{\begin{matrix} 1, & if I (x, y, z) \geq T \\ 0, & otherwise \end{matrix}$

(1)

where $I (x, y, z)$ is the pixel intensity at point ( $x, y, z$ ) and T is the threshold value, which is determined during the image analysis process.
The threshold value T is calculated based on the statistical characteristics of the intensities in the image. T is defined as

$T = μ - k \cdot σ$

(2)

where $μ$ is the mean intensity value within the presumed brain region, $σ$ is the standard deviation of these values, and k is an empirical coefficient.

Cropping to the non-zero region

After extracting the brain region, the next step is cropping the image to the non-zero region, as described in Algorithm 1. This helps remove the background and areas with zero values that do not contain useful information.

Algorithm 1 Cropping the brain region to the non-zero area.

Input: Three-dimensional binary brain mask

B (x, y, z)

obtained after HB-BET.
Output: Cropped CT volume containing only the brain region.
Steps:

(a): Determine the minimum and maximum coordinates of non-zero voxels along each axis:

$\begin{matrix} x_{min} & = min (x ∣ B (x, y, z) > 0), & x_{max} & = max (x ∣ B (x, y, z) > 0) \\ y_{min} & = min (y ∣ B x, y, z) > 0), & y_{max} & = max (y ∣ B (x, y, z) > 0) \\ z_{min} & = min (z ∣ B (x, y, z) > 0), & z_{max} & = max (z ∣ B (x, y, z) > 0) \end{matrix}$

(3)
(b): Crop the region to $[x_{min}, x_{max}], [y_{min}, y_{max}], [z_{min}, z_{max}]$ .

Intensity normalization
The next preprocessing step is intensity normalization [29]. CT images can be acquired with different parameters and intensity ranges, which can complicate analysis. Normalization of all pixel intensities to a standard range of [0, 1] was performed, improving stability when applying machine learning methods. This helps to avoid issues related to brightness and contrast differences across different scans.
Min-max normalization was performed using the following formula:

$I_{norm} (x, y, z) = \frac{I (x, y, z) - I_{min}}{I_{max} - I_{min}}$

(4)

where $I_{max}$ and $I_{min}$ are the maximum and minimum intensity values in the image.
Contrast enhancement
After normalization, contrast can be further improved using the histogram equalization method. This technique is particularly useful for medical images, where it is essential to highlight small and barely visible structures, such as brain lesions.

Figure 2 presents an example from the dataset after preprocessing, illustrating the results of image processing aimed at improving data quality and enhancing key features for further analysis.

2.3. Methods

This study presents the architecture of an ensemble model designed for automated 3D segmentation of ischemic stroke lesions in brain CT images. The proposed ensemble is composed of two transformer-based models, SE-UNETR and Swin UNETR [21], each contributing to improved segmentation accuracy and robustness through differences in attention mechanisms and spatial–contextual feature processing. SE-UNETR was selected due to its integration of SE blocks, which improve feature selectivity and accuracy for small lesions, while Swin UNETR was chosen for its efficient combination of local and global attention with low computational costs. The model predictions are combined using a weighted voting mechanism, providing a more reliable final decision.

The first model in the ensemble is SE-UNETR—a modified version of the transformer-based UNETR architecture that was developed in this study for 3D segmentation of brain CT images. A distinctive feature of this model is the integration of SE blocks [30] into the decoder, which enables adaptive channel recalibration and contributes to more accurate delineation of pathological regions. The developed modification is referred to as SE-UNETR.

UNETR [20] combines the strengths of U-Net and the ViT, enabling the simultaneous use of both local and global image features. The input to the model is a 3D CT image of size (H × W × D × 1), where H, W, and D represent the height, width, and depth of the volume, respectively. In the first stage, the input image is divided into patches of size 16 × 16 × 16 voxels, and each patch is linearly projected into a feature space with 768 dimensions, forming the input to the ViT encoder.

The encoder consists of 12 transformer blocks, each containing a multi-head self-attention mechanism, layer normalization (LayerNorm), and fully connected layers (Multilayer Perceptron, MLP). Features are extracted after every third transformer block (i.e., after layers

z_{3}

,

z_{6}

,

z_{9}

, and

z_{12}

), which are then used to construct the decoder. This enables skip connections similar to the classic U-Net model, preserving spatial information from the early layers.

The main difference between SE-UNETR and the baseline UNETR model is the addition of SE blocks after each convolutional layer in the decoder. A detailed diagram of one such decoder block is shown in Figure 3.

Each decoder block includes two consecutive 3D convolutional layers with a kernel size of 3 × 3 × 3, followed by instance normalization (IN) and a ReLU activation function. To enhance selectivity for the most informative features, a SE block [30] is applied after each convolutional layer, performing adaptive channel recalibration. The SE block operates in the following stages:

Squeeze: Spatial features are collapsed into a single vector using global average pooling, allowing the aggregation of information for each channel.
Excitation: The resulting vector is passed through two fully connected layers with activation functions. This enables the model to learn which channels are more relevant for the specific task. A sigmoid function is applied at the output to generate weights that represent the importance of each channel.
Scale: The weight coefficients are reshaped back to match the input tensor dimensions and used to scale (recalibrate) each channel. This enhances informative features and suppresses less important ones.

These blocks allow the model to focus on the most relevant feature channels, adapting its attention to stroke-specific characteristics, such as faint, blurred, or small lesions. After the SE blocks, the output tensor is passed to the next decoder level, followed by a transposed convolution and concatenation with the corresponding encoder feature map.

In the final stage, the model uses a 1 × 1 × 1 convolution to reduce the number of channels to 1, corresponding to the binary segmentation mask (lesion/non-lesion). The resulting prediction map has the same dimensions as the input image: (H × W × D × 1).

The second method in the ensemble is the Swin-UNETR [21] architecture, which combines the strengths of the Swin Transformer and a U-Net-like decoder for 3D medical segmentation. Unlike the classical UNETR model, which is based on global attention (ViT), Swin-UNETR employs localized window-based attention mechanisms (window-based multi-head self-attention, W-MSA), allowing it to efficiently capture both local and global dependencies with lower computational overhead.

The model takes as input 3D data of size H × W × D × 1, representing a single-channel MRI volume. In the first stage, the image is divided into non-overlapping 3D patches, which are projected into a feature space with 48 dimensions. These patches are then processed through four encoder stages with block depths of 2, 2, 2, and 2, where each stage uses Swin Transformer blocks with the number of attention heads set to 3, 6, 12, and 24, respectively. Between the stages, a patch-merging operation is used to reduce the spatial resolution and increase the number of channels, forming a hierarchical representation. The outputs from each stage are preserved and passed to the decoder via U-Net-style skip connections. The decoder includes transposed convolution blocks and convolutional layers that restore the spatial resolution and refine object localization. The final layer transforms the features into an output map of size H × W × D × 1. A sigmoid function is applied to generate a probabilistic mask, enabling the interpretation of each voxel’s value as the probability of belonging to the target class.

Figure 4 illustrates the architecture of Swin-UNETR, which comprises four encoder stages based on the Swin Transformer, a U-Net-like decoder with skip connections, and an output head for binary segmentation. The model was trained using the DiceLoss function with the parameter sigmoid=True, which enables effective handling of imbalanced data in binary segmentation tasks.

The use of an ensemble makes it possible to combine the strengths of each individual model and improve the robustness of the final output. This is especially critical in a clinical context, where the reliability and stability of predictions are of utmost importance. Unlike a single model, an ensemble reduces the likelihood of incorrect segmentation by aligning the outputs of multiple independent instances.

The ensemble is formed using a weighted voting method, where the final prediction is computed based on the outputs of all models, considering their confidence scores. In this case, the Dice coefficients are used as confidence scores, reflecting the accuracy of each individual model on the validation dataset. The Dice coefficient was chosen as a confidence measure because it is a widely accepted metric for evaluating segmentation quality in medical imaging tasks. It is particularly sensitive to the overlap between predicted and ground-truth regions, which is critical when analyzing pathologies such as ischemic brain lesions. The ensemble prediction

P_{ens}

is computed using the following formula:

P_{ens} = \sum_{i = 1}^{N} (\frac{D_{i}}{\sum_{j = 1}^{N} D_{j}} \cdot P_{i})

(5)

where N is the number of models in the ensemble (in this case, N = 2),

P_{i}

is the prediction of the i-th model, and

D_{i}

is the Dice coefficient of the i-th model, calculated on the validation dataset.

In this study, the overall process of ischemic stroke segmentation is presented as a sequence of four stages: data collection, preprocessing, model training, and prediction ensembling. In the first stage, 3D brain CT images are collected from patients with confirmed ischemic stroke. Next, image preprocessing is performed, including brain binarization using the HB-BET algorithm to remove the skull region, cropping the non-empty region to eliminate background, intensity normalization, and contrast enhancement to improve visual differentiation between tissues. The preprocessed data are then passed to the training module, where two neural network models are developed: the modified SE-UNETR architecture proposed in this study, and the Swin UNETR architecture with window-based self-attention. The final stage involves combining the predictions from these two models using a weighted voting mechanism, which takes into account the confidence level of each model. This approach improves the accuracy and robustness of the final segmentation, which is particularly important under conditions of clinical uncertainty. Figure 5 shows a diagram of the proposed approach.

This diagram illustrates the complete pipeline of building the ensemble model—from the preprocessing of raw images to generating the final prediction—providing a clear overview of the entire process.

2.4. Evaluation Metrics

The Dice similarity coefficient (DSC) is one of the key metrics for evaluating the quality of medical image segmentation. It measures the similarity between the predicted mask S and the ground-truth mask G, with values ranging from [0, 1], where 1 indicates perfect overlap and 0 represents no overlap at all.

DSC is defined as follows:

D = \frac{2 | S \cap G |}{| S | + | G |}

(6)

or in terms of predictions, with

D = \frac{2 T P}{2 T P + F P + F N}

(7)

where

T P

(true positive) is the number of correctly classified pathological pixels,

F P

(false positive) is the number of pixels incorrectly classified as pathological, and

F N

(false negative) is the number of pixels that the model failed to identify as pathological.

The Dice coefficient is widely used in medical image processing as it accounts for both missed (

F N

) and incorrectly predicted

(F P)

regions, making it more informative than simple classification accuracy [31].

The following metrics were used for the quantitative evaluation of model performance [32]:

Sensitivity (recall) is the model’s ability to correctly identify affected areas.

S e n s i t i v i t y = \frac{T P}{T P + F N}

(8)

Specificity is the ability of the model to exclude false-positive predictions.

S p e c i f i c i t y = \frac{T N}{T N + F P}

(9)

where

T N

(true negative) is the number of correctly classified non-pathological pixels.

Precision is the proportion of correctly classified lesions among all positive predictions.

P r e c i s i o n = \frac{T P}{T P + F P}

(10)

Before computing the evaluation metrics (

T P, F P, F N

, and

T N

), the continuous-valued probability maps output by the models were converted into binary segmentation masks. A fixed threshold of 0.5 was applied: pixels with a predicted probability greater than 0.5 were assigned to the positive class, and all others to the negative class. This binarization step allowed the predicted masks to be directly compared with the binary ground-truth annotations, enabling accurate calculation of evaluation metrics.

3. Results

The training was conducted using the proposed ensemble of transformer-based models, SE-UNETR and Swin UNETR, with each trained with the Adam optimizer [33] at a learning rate of

1 \times 10^{- 4}

. All network parameters (SE-UNETR and Swin UNETR) were initialized using PyTorch’s default initialization. Models were trained for 120 epochs with a batch size of two. The dataset was split into the training and test sets in an 80:20 ratio. The training set contained approximately 39,936 non-overlapping 3D patches of size 16 × 16 × 16 voxels, which were extracted from 78 training volumes (after resizing to 128 × 128 × 128). Binary cross-entropy was used as the loss function to optimize training, while the Dice coefficient was employed to evaluate segmentation quality. The models were implemented in Python 3.11.13 using the PyTorch framework 2.6.0+cu124 and the MONAI library 1.5.0. NumPy 2.0.2 and NiBabel 5.3.2 were used for data preprocessing and image processing, and Matplotlib 3.10.0 was used for visualization. The training was performed in the Google Colab Pro+ environment on an NVIDIA A100 [34] GPU with CUDA 12.4 support.

To illustrate the training dynamics, Figure 6 presents the plots of the Dice coefficient and loss function values on both the training and validation datasets. As the number of epochs increases, the Dice coefficient improves while the loss decreases, indicating progressive model convergence. The closeness of the training and validation curves suggests good generalization performance and no signs of overfitting.

The performance evaluation results of the proposed model ensemble are presented in Table 1. The highest Dice coefficient was achieved using the ensemble of SE-UNETR and Swin UNETR, reaching 0.7983, which exceeds the scores of the individual models: 0.7835 for SE-UNETR and 0.7667 for Swin UNETR, respectively. This indicates improved segmentation accuracy due to the combination of predictions from different architectures.

The ensemble model also demonstrated the best sensitivity (89.75%) and precision (94.91%), which is especially important for medical diagnostics, where under-segmentation of pathological regions can lead to serious consequences. The high specificity values (99.9%) for all models are explained by the substantial predominance of healthy tissue over pathological areas in CT images, which is typical in stroke segmentation tasks. Thus, the proposed ensemble approach provides more stable and reliable segmentation compared to using individual models alone.

Figure 7 presents a visual comparison of ischemic stroke lesion segmentation performed using the SE-UNETR model, the Swin UNETR model, and their ensemble. The first column shows the original CT image, and the second column displays the ground-truth mask manually annotated by experts, followed by the predictions from the two individual models and the final ensemble prediction. It can be observed that each model handles the task differently: SE-UNETR produces a smoother but somewhat blurred delineation of the lesion, while Swin UNETR yields sharper boundaries but with partial omissions. In contrast, the ensemble prediction shows the best alignment with the ground-truth mask. The ensemble result features a more precise and complete coverage of the pathological region, highlighting its advantage over the individual models.

Figure 8 presents a comparison between ground-truth segmentation masks and the results obtained using the ensemble of the SE-UNETR and Swin UNETR models. For analysis, three representative slices from a volumetric brain CT scan were selected—the initial (slice 32), middle (slice 64), and final (slice 96)—allowing for an assessment of the model’s consistency across different scanning stages.

The first column shows the original CT slices, depicting the anatomical structure of the brain. The second column contains the ground-truth masks, where the highlighted areas correspond to ischemic lesions annotated by experts and are used as the reference for training and validation. The third column displays the predictions produced by the ensemble model. As seen in the images, the model effectively identifies the affected regions, accurately reproducing both the shape and localization of the stroke lesions. The top row shows a slice without signs of ischemia, and the ensemble prediction correctly reflects the absence of pathological changes, indicating high model specificity. The middle and bottom rows display slices with prominent ischemic alterations; in these cases, the model precisely and consistently replicates the true lesion shape. Despite minor differences in detail, the ensemble demonstrates robustness to noise and artifacts, as well as strong generalization ability. This confirms the high quality of segmentation and the reliability of the ensemble approach when applied to clinical data.

Figure 9 provides a visual representation of ischemic stroke segmentation results for three different patients.

Each row corresponds to a single CT slice and consists of three columns: first column—the original CT image, second column—overlay of the ground truth mask annotated by experts (in blue), third column—overlay of the ensemble model’s prediction (in shades of red). These examples demonstrate the model’s robustness across various lesion localizations and shapes. In the first case, the model captures an extended lesion area closely matching the ground-truth annotation in size. In the second example, there is a high degree of agreement between the prediction and the reference: the ensemble model accurately traces the contour of the affected region in the lower-right part of the brain. The third case also shows strong overlap between the predicted and ground-truth masks, particularly in the central and frontal regions. These visualizations confirm that the ensemble model is capable of accurately identifying ischemic lesions in different patients, maintaining a high degree of correspondence with expert annotations in both shape and location of the pathological regions. This result highlights the model’s clinical potential and its applicability in automated CT image analysis.

As a control experiment, the proposed method was compared with the results presented by Dobshik et al. (2023) [16], in which a convolutional 3D neural network was used for ischemic lesion segmentation. To ensure a fair comparison, the same CT dataset was used as in the present study. The results showed that the proposed ensemble approach provides more accurate segmentation of pathological regions compared to the method described by Dobshik et al. (2023) [16].

Additionally, an alternative ensemble consisting of three architecturally identical 3D U-Net models was implemented and tested. The primary difference among them was the random initialization of weights, which introduced the necessary prediction diversity due to the stochastic nature of training. Ensembling such models helps reduce the influence of random fluctuations during training and lowers the risk of overfitting—an especially important factor when dealing with medical images, which are characterized by high variability and limited data availability. However, this ensemble demonstrated lower performance compared to the transformer-based ensemble comprising SE-UNETR and Swin UNETR. The Dice coefficients for the individual 3D U-Net models were 0.7132, 0.7145, and 0.7018, with a final ensemble Dice score of 0.7253, which is lower than the result achieved by the proposed transformer ensemble (0.7983). It is important to note that both ensembles were trained on the same dataset, which consists of CT images from 98 real patients with confirmed ischemic stroke. This consistent setup ensures an objective comparison and allows for a more accurate evaluation of the effectiveness of different architectures under equivalent conditions.

The summary results are presented in Table 2, which provides a comparative analysis of various neural network architectures for the task of ischemic stroke segmentation. Specifically, the table includes the results of the proposed SE-UNETR + Swin UNETR ensemble with weighted voting, an alternative 3D U-Net ensemble, as well as previously published approaches: SE-Res 3D U-Net (Dobshik et al., 2023) [16], METrans—a transformer-based model (Wang et al., 2022) [35]—and a modified 3D U-Net model (Omarov et al., 2022) [36]. As shown in the table, the proposed approach achieves the highest Dice coefficient among all the models considered, confirming its superior accuracy and effectiveness in automated segmentation of ischemic lesions in CT images.

The developed model demonstrated a significant increase in the Dice coefficient, indicating more accurate segmentation and improved detection of pathological regions. The obtained results confirm the high effectiveness of using ensemble deep learning architectures in medical image analysis tasks. In total, the dataset included CT scans from 98 patients, with each volume containing between 306 and 505 axial slices, resulting in approximately 30,000–50,000 annotated slices. This large number of samples provided sufficient variability for training deep neural networks. To ensure a reliable evaluation, a patient-wise split of 80% for training and 20% for validation was applied, eliminating any overlap between the sets. In the test set comprising 20 patients, the proportion of voxels affected by acute ischemic stroke was only 0.8% of the total number of voxels, indicating a strong class imbalance. With a model sensitivity of 89.75% and a specificity of 99.99%, the overall proportion of misclassifications was approximately 0.09% of all voxels in the test set. Errors were observed mainly as isolated false-positive predictions on individual slices, while on the remaining slices of the corresponding studies, the model demonstrated accurate segmentation. Moreover, repeated experiments with multiple random splits demonstrated the robustness of the proposed method, as evidenced by an average Dice coefficient of

79.8 \pm 2.3

, indicating low result variability.

4. Discussion

The results of this study demonstrate the effectiveness of the proposed ensemble of transformer-based models, SE-UNETR and Swin UNETR, for the task of automated ischemic stroke segmentation in CT images. Based on quantitative metrics and visual analysis, the proposed approach was shown to outperform both alternative architectures implemented in this work and previously published methods, including SE-Res 3D U-Net [16], METrans [35], and modified versions of 3D U-Net [36].

One of the key factors contributing to the reliability of the comparison was the use of a unified dataset consisting of CT scans from 98 real patients with confirmed ischemic stroke. This approach eliminated the influence of dataset variability and allowed for an objective evaluation of architectural differences. The comparison with an alternative ensemble composed of three 3D U-Net models trained with different weight initializations revealed that, despite the stability and simplicity of this architecture, its performance was lower (Dice = 0.7253) than that of the transformer ensemble (Dice = 0.7983). This highlights the potential of transformer-based models in medical image segmentation tasks.

While recent open-source segmentation frameworks such as SAM [19] show promising results on natural and certain medical images, they are primarily designed for 2D image inputs. In contrast, our task involves volumetric 3D CT data with specific intensity characteristics (Hounsfield units) and thin-slice resolution (0.5 mm). Direct application of SAM would require substantial architectural adaptation to handle volumetric data and voxel-level annotations, as well as retraining on CT-specific datasets. Therefore, we focused on transformer-based 3D segmentation architectures, which are better suited to the nature of our data.

In accordance with the recommendations of the CLAIM 2024 checklist (Checklist for Artificial Intelligence in Medical Imaging) [37], we analyzed our study in terms of transparency, reproducibility, and clinical contextualization. All data were provided in a fully anonymized form, and demographic and clinical information about the patients was not available (item 36). This limits the ability to analyze the impact of population characteristics on the results. In future multicenter studies, we plan to use extended datasets that include such information.

The method was tested only on an internal dataset. Although the training sample included CT images from 98 real patients, all data were obtained from a single institution. At the current stage, the model was tested exclusively on this internal data, which limits the diversity of clinical and technical conditions and may affect its generalizability when applied to external data acquired from different scanners or other centers (item 33). In the future, multicenter external validation is planned to assess the generalizability of the proposed approach.

This study reports averaged metrics (Dice

79.8 \pm 2.3

), which allows for an assessment of variability. However, due to the limited size of the test set (20 patients), we did not perform bootstrapping or calculate confidence intervals, as such methods can be statistically unstable with small samples (item 29). In future studies with larger and multicenter datasets, we plan to use bootstrapping and cross-validation for a rigorous estimation of confidence intervals.

We demonstrated the stability of the method on repeated random splits but did not perform specific stress tests (e.g., adding noise, artifacts, or modifying scanning protocols) (item 30). This is because the current study primarily focused on the ensemble architecture and model comparisons. In the future, we plan to additionally assess the algorithm’s robustness to data variations and real-world clinical conditions. We noted that the model’s errors occurred predominantly as isolated false-positive predictions; however, a systematic classification of error types was not performed (item 39). This is due to the limited number of erroneous cases in the test set. In future studies with larger datasets, we plan to categorize errors (e.g., false positives at vessel boundaries, false negatives in small lesions, and errors due to artifacts) and illustrate them with examples, which will provide a better understanding of the method’s limitations.

Explainability methods were not implemented in the current study (item 31). Special attention should be given in the future to improving model interpretability. The “black box” problem, characteristic of most modern neural network approaches, remains a significant barrier to widespread clinical adoption. In future research, we plan to incorporate explainable artificial intelligence (XAI) methods, such as Grad-CAM (gradient-weighted class activation mapping) or SHAP (Shapley additive explanations), which allow visualization of the image regions that most influenced the model’s final decision. This will help increase trust among medical professionals and facilitate the integration of the technology into existing diagnostic protocols.

In a clinical context, automatic segmentation of ischemic lesions on CT can significantly reduce image analysis time and improve the reproducibility of assessments compared to manual annotation (items 38 and 41). For practical application, it is crucial to define acceptable error ranges. For example, small inaccuracies at the lesion boundary may be clinically insignificant when assessing lesion volume, whereas missing small but critically located lesions could affect therapeutic decision-making. In future studies, we will focus on the clinical interpretation of the results, including determining acceptable accuracy thresholds at which the use of the model remains safe and effective. We also plan to assess the impact of automatic segmentation on diagnostic decisions and treatment outcomes. Furthermore, we will develop scenarios for seamless integration of the model into existing radiology systems and decision-making protocols, which will require appropriate technical infrastructure, staff training, and multicenter validation. These steps will help ensure the model’s operational stability and its acceptance by the medical community.

This study was conducted within the framework of project AP26195405, “Development of Combined Deep Neural Network Models for Interpretable Analysis of Medical Images,” which has local ethics approval (IRB No. A922) (item 27).

Thus, although the present study has several limitations, it demonstrates the potential of ensemble transformer architectures for the task of ischemic stroke segmentation on CT images. The recommendations of the CLAIM 2024 checklist allow not only a critical evaluation of the current study but also the development of a roadmap for future steps, ranging from dataset expansion and external validation to the implementation of explainability methods and the integration of the model into clinical workflows. We view this work as a foundation for subsequent research aimed at enhancing the transparency, reproducibility, and clinical applicability of artificial intelligence methods in medical imaging.

5. Conclusions

In this study, a transformer-based ensemble combining the SE-UNETR and Swin UNETR models was presented for automated segmentation of ischemic stroke lesions in brain CT images. The proposed method demonstrated high segmentation accuracy on real clinical data, achieving a Dice coefficient of 0.7983. Visual analysis of the predictions confirmed that the ensemble more accurately reproduces the boundaries of pathological regions, minimizing errors associated with individual models. The results indicate the potential of the proposed approach for automated stroke diagnosis and other tasks that require precise localization of pathological changes in medical images.

Future work will focus on enhancing the method through the use of more advanced ensembling strategies, as well as increasing the emphasis on model interpretability. This will make predictions more transparent and understandable for clinical specialists, thereby increasing trust in AI systems and their applicability in real-world medical practice.

The presented results support the feasibility of using transformer architectures and ensemble strategies in medical image segmentation tasks. The developed approach may serve as a foundation for clinically applicable decision-support systems aimed at automating stroke diagnosis and improving the quality of healthcare delivery.

Author Contributions

Conceptualization, L.C. and V.B.; methodology, A.Y.; software, Z.M. and A.Y.; validation, Z.M. and A.Y.; formal analysis, L.C. and V.B.; investigation, Z.M., D.B. and N.T.; resources, V.B.; data curation, Z.M.; writing—original draft preparation, D.B.; writing—review and editing, N.T. and D.M.; visualization, Z.T.; supervision, L.C.; project administration, V.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan (AP26195405).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

CT	Computed Tomography
MRI	Magnetic Resonance Imaging
CNNs	Convolutional Neural Networks
FCNs	Fully Convolutional Networks
SB RAS	Siberian Branch of the Russian Academy of Sciences
HB BET	Hunt and Bender Brain Extraction Tool
DSC	Dice Similarity Coefficient
UNETR	U-Net with Transformers
SE	Squeeze-and-Excitation
TP	True Positive
TN	True Negative
FP	False Positive
FN	False Negative

References

Bakator, M.; Radosav, D. Deep Learning and Medical Diagnosis: A Review of Literature. Multimodal Technol. Interact. 2018, 2, 47. [Google Scholar] [CrossRef]
Zhu, S.; Xia, X.; Zhang, Q.; Belloulata, K. An Image Segmentation Algorithm in Image Processing Based on Threshold Segmentation. In Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, Shanghai, China, 16–18 December 2007; pp. 673–678. [Google Scholar] [CrossRef]
Chen, X.; Williams, B.M.; Vallabhaneni, S.R.; Czanner, G.; Williams, R.; Zheng, Y. Learning Active Contour Models for Medical Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 11632–11640. [Google Scholar]
Khan, M.Z.; Gajendran, M.K.; Lee, Y.; Khan, M.A. Deep Neural Architectures for Medical Image Semantic Segmentation: Review. IEEE Access 2021, 9, 83002–83024. [Google Scholar] [CrossRef]
Tursynova, A.; Omarov, B. 3D U-Net for brain stroke lesion segmentation on ISLES 2018 dataset. In Proceedings of the 2021 16th International Conference on Electronics Computer and Computation (ICECCO), Kaskelen, Kazakhstan, 25–26 November 2021; pp. 1–4. [Google Scholar] [CrossRef]
Khan, W.R.; Madni, T.M.; Janjua, U.I.; Javed, U.; Khan, M.A.; Alhaisoni, M.; Tariq, U.; Cha, J.-H. A hybrid attention-based residual Unet for semantic segmentation of brain tumor. Comput. Mater. Contin. 2023, 76, 647–664. [Google Scholar] [CrossRef]
Zhang, B.; Qiu, S.; Liang, T. Dual Attention-Based 3D U-Net Liver Segmentation Algorithm on CT Images. Bioengineering 2024, 11, 737. [Google Scholar] [CrossRef]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-Net and Its Variants for Medical Image Segmentation: A Review of Theory and Applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Pinheiro, G.R.; Voltoline, R.; Bento, M.; Rittner, L. V-Net and U-Net for ischemic stroke lesion segmentation in a small dataset of perfusion data. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries (BrainLes 2018); Springer: Cham, Switzerland, 2019; pp. 301–309. [Google Scholar]
Atika, L.; Nurmaini, S.; Partan, R.U.; Sukandi, E. Image segmentation for mitral regurgitation with convolutional neural network based on UNet, ResNet, VNet, FractalNet, and SegNet: A preliminary study. Big Data Cogn. Comput. 2022, 6, 141. [Google Scholar] [CrossRef]
Yeung, M.; Sala, E.; Schönlieb, C.-B.; Rundo, L. Unified focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput. Med. Imaging 2022, 91, 102026. [Google Scholar] [CrossRef]
Garcia-Salgado, B.P.; Almaraz-Damian, J.A.; Cervantes-Chavarria, O.; Ponomaryov, V.; Reyes-Reyes, R.; Cruz-Ramos, C.; Sadovnychiy, S. Enhanced Ischemic Stroke Lesion Segmentation in MRI Using Attention U-Net with Generalized Dice Focal Loss. Appl. Sci. 2024, 14, 8183. [Google Scholar] [CrossRef]
Liu, L.; Chen, S.; Zhang, F.; Wu, F.-X.; Pan, Y.; Wang, J. Deep convolutional neural network for automatically segmenting acute ischemic stroke lesion in multi-modality MRI. Neural Comput. Appl. 2020, 32, 6545–6558. [Google Scholar] [CrossRef]
Amirgaliyev, Y.N.; Buribayev, Z.A.; Melis, Z.M.; Ataniyazova, A.S. On one approach to recognizing fuzzy images of faces based on an ensemble. In Proceedings of the 25th International Conference on Circuits, Systems, Communications and Computers (CSCC 2021), Crete Island, Greece, 19–22 July 2021. [Google Scholar] [CrossRef]
Berikov, V.B.; Cherikbayeva, L.S. Searching for optimal classifier using a combination of cluster ensemble and kernel method. CEUR Workshop Proc. 2018, 2098, 45–60. [Google Scholar]
Dobshik, A.V.; Verbitskiy, S.K.; Pestunov, I.A.; Sherman, K.M.; Sinyavskiy, Y.N.; Tulupov, A.A.; Berikov, V.B. Acute ischemic stroke lesion segmentation in non-contrast CT images using 3D convolutional neural networks. Comput. Opt. 2023, 47, 770–777. [Google Scholar] [CrossRef]
Vezakis, A.; Vezakis, I.; Vagenas, T.P.; Kakkos, I.; Matsopoulos, G.K. A Multidimensional Framework Incorporating 2D U-Net and 3D Attention U-Net for the Segmentation of Organs from 3D Fluorodeoxyglucose-Positron Emission Tomography Images. Electronics 2024, 13, 3526. [Google Scholar] [CrossRef]
Yousef, R.; Khan, S.; Gupta, G.; Siddiqui, T.; Albahlal, B.M.; Alajlan, S.A.; Haq, M.A. U-Net-Based Models towards Optimal MR Brain Image Segmentation. Diagnostics 2023, 13, 1624. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Deng, X.; Lu, Y. Segment Anything Model (SAM) for Medical Image Segmentation: A Preliminary Review. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, 5–8 December 2023; pp. 4187–4194. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.; Xu, D. UNETR: Transformers for 3D Medical Image Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Yang, D.; Roth, H.R.; Xu, D. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. arXiv 2022, arXiv:2201.01266. [Google Scholar] [CrossRef]
Tang, Y.; Yang, D.; Li, W.; Roth, H.R.; Landman, B.; Xu, D. Self-supervised Pre-training of Swin Transformers for 3D Medical Image Analysis. arXiv 2022, arXiv:2111.14791. [Google Scholar]
Wang, Y.; Li, Z.; Mei, J.; Wei, Z.; Liu, L.; Wang, C.; Sang, S.; Yuille, A.; Xie, C.; Zhou, Y. SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation. arXiv 2023, arXiv:2307.12591. [Google Scholar]
Jabal, M.S.; Joly, O.; Kallmes, D.; Harston, G.; Rabinstein, A.; Huynh, T.; Brinjikji, W. Interpretable machine learning modeling for ischemic stroke outcome prediction. Front. Neurol. 2022, 13, 884693. [Google Scholar] [CrossRef]
Tynymbaev, S.; Chinibaeva, T.T.; Temirbekova, Z.E.; Begimbayeva, Y.Y. Device Multiplication by Two Polynomials Modulo Irreducible Polynomials. Patent RK No. 36794, 14 June 2024. [Google Scholar]
Sriramakrishnan, P.; Kalaiselvi, T.; Padmapriya, S.T.; Shanthi, N.; Ramkumar, S.; Kalaichelvi, N. An medical image file formats and digital image conversion. Int. J. Eng. Adv. Technol. 2019, 9, 74–78. [Google Scholar] [CrossRef]
Wang, Y.; Wang, H.; Shen, K.; Chang, J.; Cui, J. Brain CT image segmentation based on 3D slicer. J. Complex. Health Sci. 2020, 3, 34–42. [Google Scholar] [CrossRef]
Kleesiek, J.; Urban, G.; Hubert, A.; Schwarz, D.; Maier-Hein, K.; Bendszus, M.; Biller, A. Deep MRI brain extraction: A 3D convolutional neural network for skull stripping. NeuroImage 2016, 129, 460–469. [Google Scholar] [CrossRef] [PubMed]
Lima, F.T.; Souza, V.M.A. A large comparison of normalization methods on time series. Big Data Res. 2023, 34, 100407. [Google Scholar] [CrossRef]
Jin, X.; Xie, Y.; Wei, X.-S.; Zhao, B.-R.; Chen, Z.-M. Delving Deep into Spatial Pooling for Squeeze and Excitation Networks. Pattern Recognit. 2021, 121, 108159. [Google Scholar] [CrossRef]
Taha, A.A.; Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging 2015, 15, 29. [Google Scholar] [CrossRef]
Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar] [CrossRef]
Ogundokun, R.O.; Maskeliunas, R.; Misra, S.; Damaševičius, R. Improved CNN based on batch normalization and Adam optimizer. In Computational Science and Its Applications, Proceedings of the International Conference on Computational Science and Its Applications, Malaga, Spain, 4–7 July 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 593–604. [Google Scholar] [CrossRef]
Choquette, J.; Gandhi, W.; Giroux, O.; Stam, N.; Krashinsky, R. NVIDIA A100 Tensor Core GPU: Performance and innovation. IEEE Micro 2021, 41, 29–35. [Google Scholar] [CrossRef]
Wang, J.; Wang, S.; Liang, W. METrans: Multi-encoder Transformer for Ischemic Stroke Segmentation. Electronics 2022, 58, 340–342. [Google Scholar] [CrossRef]
Omarov, M.; Ibragimov, M.; Kaldybekov, K.; Aytmahanov, M. Modified 3D U-Net for Brain Stroke Lesion Segmentation on Computed Tomography Images. Computers 2022, 11, 23. [Google Scholar] [CrossRef]
Tejani, A.S.; Klontzas, M.E.; CLAIM 2024 Update Panel; Gatti, A.A.; Mongan, J.T.; Moy, L.; Park, S.H.; Kahn, C.E., Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 Update. Radiol. Artif. Intell. 2024, 6, e240300. [Google Scholar] [CrossRef]

Figure 1. CT slices with lesion masks from three different patients (a–c). The red color indicates the annotated lesion areas. Intensity is displayed on the HU scale.

Figure 2. CT images before and after preprocessing: on the left: original and on the right: after preprocessing.

Figure 3. Modified SE-UNETR architecture for 3D segmentation of brain CT images.

Figure 4. Structure of Swin-UNETR with a Swin Transformer-based encoder.

Figure 5. The process of ensemble model formation for medical image segmentation.

Figure 6. Dynamics of the Dice coefficient and loss function on the training and validation datasets during the training process.

Figure 7. Visual comparison of ischemic lesion segmentation results on a CT image. From left to right: (1) original CT image, (2) ground-truth mask created manually, (3–4) predictions from three different models, and (5) result of ensemble prediction.

Figure 8. Comparison of ground-truth masks and ensemble model predictions on three different slices of a 3D CT image of a patient (slices 32, 64, and 96). From left to right: original image, true mask, and ensemble prediction.

Figure 9. Examples of ischemic stroke segmentation with overlays of the ground-truth mask (blue area) and the predicted mask (red area) on CT slices from three patients (a–c). Left to right: original image, ground-truth overlay, and ensemble model prediction overlay.

Table 1. Comparison of model performance.

Model	Sensitivity (%)	Specificity (%)	Precision (%)	Dice
SE-UNETR	87.30	99.99	94.31	0.7835
Swin UNETR	85.40	99.96	86.58	0.7667
Ensemble	89.75	99.99	94.91	0.7983

Table 2. Comparative analysis of segmentation results.

Source	Method	Dice Coefficient	Sensitivity (%)	Specificity (%)	Precision (%)
Proposed method	Ensemble (UNETR + Swin UNETR)	0.798	89.75	99.99	94.91
Alternative tested method (this study)	Ensemble 3D U-Net	0.725	61.59	99.72	74.37
Dobshik et al., 2023 [16]	SE-Res 3D U-Net	0.628	69.9	99.7	61.9
Wang et al., 2022 [35]	METrans	0.670	64.0	-	72.0
Omarov et al., 2022 [36]	Modified 3D U-Net	0.580	60.0	-	68.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cherikbayeva, L.; Berikov, V.; Melis, Z.; Yeleussinov, A.; Baigozhanova, D.; Tasbolatuly, N.; Temirbekova, Z.; Mikhailapov, D. Ensembling Transformer-Based Models for 3D Ischemic Stroke Segmentation in Non-Contrast CT. Appl. Sci. 2025, 15, 9725. https://doi.org/10.3390/app15179725

AMA Style

Cherikbayeva L, Berikov V, Melis Z, Yeleussinov A, Baigozhanova D, Tasbolatuly N, Temirbekova Z, Mikhailapov D. Ensembling Transformer-Based Models for 3D Ischemic Stroke Segmentation in Non-Contrast CT. Applied Sciences. 2025; 15(17):9725. https://doi.org/10.3390/app15179725

Chicago/Turabian Style

Cherikbayeva, Lyailya, Vladimir Berikov, Zarina Melis, Arman Yeleussinov, Dametken Baigozhanova, Nurbolat Tasbolatuly, Zhanerke Temirbekova, and Denis Mikhailapov. 2025. "Ensembling Transformer-Based Models for 3D Ischemic Stroke Segmentation in Non-Contrast CT" Applied Sciences 15, no. 17: 9725. https://doi.org/10.3390/app15179725

APA Style

Cherikbayeva, L., Berikov, V., Melis, Z., Yeleussinov, A., Baigozhanova, D., Tasbolatuly, N., Temirbekova, Z., & Mikhailapov, D. (2025). Ensembling Transformer-Based Models for 3D Ischemic Stroke Segmentation in Non-Contrast CT. Applied Sciences, 15(17), 9725. https://doi.org/10.3390/app15179725

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ensembling Transformer-Based Models for 3D Ischemic Stroke Segmentation in Non-Contrast CT

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Preprocessing

2.3. Methods

2.4. Evaluation Metrics

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI