SCENet: Small Kernel Convolution with Effective Receptive Field Network for Brain Tumor Segmentation

Guo, Bin; Cao, Ning; Zhang, Ruihao; Yang, Peng

doi:10.3390/app142311365

Open AccessArticle

SCENet: Small Kernel Convolution with Effective Receptive Field Network for Brain Tumor Segmentation

by

Bin Guo

^1,2

,

Ning Cao

^2,*,

Ruihao Zhang

¹ and

Peng Yang

¹

College of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830052, China

²

College of Information Science and Engineering, Hohai University, Nanjing 210098, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11365; https://doi.org/10.3390/app142311365

Submission received: 15 October 2024 / Revised: 22 November 2024 / Accepted: 28 November 2024 / Published: 5 December 2024

(This article belongs to the Special Issue Applications of Computer Vision and Image Processing in Medicine)

Download

Browse Figures

Versions Notes

Abstract

Brain tumors are serious conditions, which can cause great trauma to patients, endangering their health and even leading to disability or death. Therefore, accurate preoperative diagnosis is particularly important. Accurate brain tumor segmentation based on deep learning plays an important role in the preoperative treatment planning process and has achieved good performance. However, one of the challenges involved is an insufficient ability to extract features with a large receptive field in encoder layers and guide the selection of deep semantic information in decoder layers. We propose small kernel convolution with an effective receptive field network (SCENet) based on UNet, which involves a small kernel convolution with effective receptive field shuffle module (SCER) and a channel spatial attention module (CSAM). The SCER module utilizes the inherent properties of stacking convolution to obtain effectively receptive fields and improve the features with a large receptive field extraction ability. CSAM of decoder layers can preserve more detailed features to capture clearer contours of the segmented image by calculating the weights of channels and spaces. An ASPP module is introduced to the bottleneck layer to enlarge the receptive field and can capture multi-scale detailed features. Furthermore, a large number of experiments were performed to evaluate the performance of our model on the BraTS2021 dataset. The SCENet achieved dice coefficient scores of 91.67%, 87.70%, and 83.35% for whole tumor (WT), tumor core (TC), and enhancing tumor (ET), respectively. The results show that the proposed model achieves the state-of-the-art performance compared with more than twelve benchmarks.

Keywords:

brain tumor segmentation; MRI; medical image; receptive field; attention mechanism; small kernel convolution

1. Introduction

A brain tumor is an area of injury or disease within the brain, which can range from small to large, from few to many, from relatively harmless to life threatening. Common intracranial tumors include gliomas, metastases, and meningiomas. Glioma is the most common primary central nervous system tumor [1]. According to the heterogeneity and biological behavior of tumor cells, as well as the presence or absence of vascular proliferation in the tumor body, gliomas are classified into four grades by the WHO. Grades 1–2 represent low-grade malignancy, while grades 3–4 indicate highly malignant tumors [2]. Traditional imaging diagnosis mainly grades gliomas based on their morphology, signal characteristics, degree of enhancement of tumor parenchyma, presence or absence of cystic necrosis, and degree of peritumoral edema [3]. Low-grade gliomas have uniform signals, clear boundaries, and mostly lack enhancement. The characteristics of high-grade brain tumors are typically associated with vasogenic edema, mass effect, and intravenous contrast enhancement [4]. Although there has been significant progress in surgery, radiation therapy, and chemotherapy in recent years, the prognosis still needs to be improved. The purpose of treatment is to remove or eliminate as many tumors as possible while minimizing damage to normal brain tissue, prolonging the patient’s life and their quality of life as much as possible. Therefore, segmenting the subregion of the tumor and determining the degree of invasion of the tumor on surrounding structures before surgery can be helpful for selecting treatment plans and formulating the best surgical plan [5]. Magnetic Resonance Imaging (MRI) [6] is currently the preferred method for brain tumor imaging due to its excellent soft tissue resolution, high signal-to-noise ratio, and multi-directional imaging. It usually includes four modalities, T1-Weighted (T1), T1-enhanced contrast (T1ce), T2-weighted (T2), and fluid-attenuated inversion recovery (FLAIR) [7], in the Brats 2021 dataset. The four modalities are usually input simultaneously into the program to assist in effective analysis, that is, multimodal brain tumor segmentation.

The simple procedure of manual and semi-automatic segmentation can be seen in Figure 1. Typically, radiologists begin by importing brain tumor imaging data. They then proceed to examine the data across the axial, sagittal, and coronal planes to scrutinize the image details. Ultimately, they utilize the region of interest (ROI) for calibration and measurement purposes. Compared to traditional manual and semi-automatic diagnostic methods used by radiologists, brain tumor segmentation methods based on artificial intelligence are characterized by fast segmentation speed, high diagnostic accuracy, and a reduced missed diagnosis rate. UNet [8] is widely used for medical image processing, which fully captures the feature information of images with a special symmetrical structure and uses efficient restoration techniques to segment the lesion site. In order to fully utilize the depth information, a 3D UNet that performs well in the field of medical image segmentation has been proposed by Özgün Çiçek [9] and has been extensively researched.

The feature extraction in UNet is mainly undertaken by combining two convolutions, but convolution is characterized by a receptive field that is too small. This seems to be a common challenge; the insufficient extraction of features with a large receptive field can lead to suboptimal segmentation results. The attention mechanism can effectively supplement the insufficient extraction of two convolutional features in UNet. Recently, one of the most popular ideas in Unet or 3D Unet research has been that the attention mechanism is applied after the two convolutions to extract more image features of encoder blocks. Although some attention mechanisms improve the performance of brain tumor segmentation, further improvement is needed. To improve efficacy for brain tumor image segmentation, we proposed SCENet with an effective receptive field of encoder layers, and channels and spatial weights of decoder layers.

The main contributions of our work are as follows:

(1): We proposed the Small Kernel Convolution with Effective Receptive Field Network (SCENet) based on UNet, which can effectively improve brain tumor segmentation performance.
(2): We designed an SCER block to enhance the effectiveness and efficiency of extraction of features with effective, receptive filled-in encoder layers.
(3): We used a CSAM attention mechanism to select the more important, detailed features of decoder layers.
(4): The ASPP module was introduced to the bottleneck layer to enlarge the receptive field in order to capture more detailed features.

The rest of this paper is laid out as follows: The related work is described in Section 2. The materials and methods and details of SCENet are presented in Section 3. Subsequently, in Section 4, the comparison results and ablation experiments are presented and analyzed in detail. Section 5 introduces limitations and future perspectives. Finally, Section 6 reveals the conclusions.

2. Related Work

2.1. Deep Learning-Based Methods for Medical Image Segmentation

Recently, deep learning has distinguished itself in the field of medical image segmentation, effectively promoting the rapid development of medical diagnoses. The emergence of convolutional neural networks (CNNs) [10] has shown many researchers the powerful processing capabilities of deep learning and has achieved great success in the fields of image classification, object detection, and speech recognition. However, in order to ensure its efficiency and number of parameters, the size of the input network must only be fixed, which limits the further improvement of the results. The rise of fully convolutional networks (FCNs) [11] benefits from the fact that images of any size can be input into the network, which has also piqued the interest of many medical image processing researchers. Although FCN can also be applied in the field of medical image segmentation, its performance is not as good as UNet, which makes UNet one of the most popular architectures used in medical image networks today [12,13,14,15]. In UNet, the encoding branch is composed of convolution, normalization, activation function, and max pooling, while in the decoder part, it involves deconvolution, convolution, normalization, and activation functions. In the encoder section, as the network deepens, features are gradually extracted, and in the decoder section, detailed information of the image is restored. In order to fully utilize depth data in some medical images, the 3D UNet network was proposed and has achieved good performance in the field of medical images. Sequentially, many variants based on UNet and 3D UNet have been developed by researchers and have contributed to deep learning in the field of medical image segmentation. Chen et al. [16] proposed DMFNet based on UNet, where the correlation between adjacent slices is captured by 3D convolutions, and the 3D multi-fiber unit can be leveraged to significantly reduce the computational cost. Rehan Raza et al. [17] designed a model, which is a hybrid of the deep residual network and UNet, called dResU-Net. This model handles the issue of vanishing gradient and uses low-level and high-level features simultaneously for prediction making. Low-level features at each level are preserved by shortcut connections in this network. Parvez Ahmad et al. [18] proposed MS UNet to further improve the final segmentation accuracy, using a multi-scale strategy, and this showed that the multi-scale features are meaningful in brain tumor segmentation. Islem Gammoudi et al. [19] developed a Res-Gated-3DUNet network, which combines the advantages of UNet, ResNet and signal gating to improve accuracy in glioma image segmentation results. Vaibhav Soni et al. [20] designed an automatic brain tumor segmentation approach based on a multi-encoder. The inception module and dilated convolutions were utilized to extract multi-scale features from the images via the basic convolutional layer. A SEDNet based on UNet was exploited by Chollette C. Olisah [21], with a shallow encoder and decoder. Sufficient hierarchical convolutional blocks in the encoder part can learn the intrinsic features of brain tumors in brain slices, and the tiny local-level spatial features alongside the global-level features can be captured by a decoder with a selective skip path. To solve the problem of insufficient receptive fields and excessive computing costs, the 3D-EffUNet was proposed by Chen et al. [22]. Owing to the superior performance of UNet in medical image segmentation, we chose UNet as our baseline framework.

2.2. The Attention-Based Module for Medical Image Segmentation

One of the challenges based on UNet networks is that encoders are all local operations, lacking the sufficient integration of global information. At the same time, downsampling causes spatial information to be lost, especially for biomedical images. Recently, attention has been playing an important role in UNet networks, which can enhance important information while ignoring unimportant information. Ozan Oktay et al. [23] proposed the attention gate (AG) model to automatically focus on target structures of varying shapes and sizes. The irrelevant regions can be suppressed, and the salient features can be highlighted by AG. Liu et al. [24] developed SGERes U-Net to improve the segmentation accuracy of brain tumors. It uses spatial group-wise enhance (SGE) attention blocks to improve the feature learning of semantic features and reduce potential noise. Tian et al. [25] utilized axial attention to capture abundant semantic information, and the local–global contextual information can be introduced to help training. Zhang et al. [26] designed a Residual Convolutional Block, which inverted residual blocks to replace the convolution module between the encoder and the decoder, to improve the quality of detailed features. To adaptively compute the receptive field of brain tumor images, Liu et al. [27] used several 3D self-calibrated convolution to substitute for the original convolution layers. To train with different scale input images, Muhammad Zeeshan Aslam et al. [28] replaced the traditional skip connection with an attention module. Mona Kharaji et al. [29] merged residual block and attention gates to capture complex spatial features and emphasize informative regions, respectively. Pang et al. [30] took advantage of lightweight GhostV2 [31] at bottleneck to reduce redundant information to capture key features. From the above research, it can be seen that one way to overcome weaknesses of convolution is to design more effective attention mechanisms.

2.3. Transformer-Based Architecture for Medical Image Segmentation

Thanks to self-attention mechanisms, the Transformer has achieved good performance in the field of medical image processing. Recently, research on Transformer-based networks has mainly focused on two aspects: one is pure Transformers, such as the volumetric Transformer (VT-Unet) proposed by Himashi Peiris et al. [32], and the other is the combination of CNN and a Transformer. In terms of these two types of research, there are more studies on the latter. Jia et al. [33] designed a CNN–Transformer combined model, called BiTr-Unet, which achieved good performance on the BraTS2021. Wang et al. [34] proposed TransBTS to predict the detailed segmentation map. It inputs the local 3D context information to the Transformer for the global feature model. To improve generalization performance and the deficiency of long-range dependencies, the architecture based on Transformers and CNNs was designed by Sun et al. [35]. Ali Hatamizadeh et al. [36] proposed UNETR, the encoders of which use a structure based on Transformers, and decoders utilize traditional convolution. Cai et al. [37] developed a parallel structure of convolution and vision Transformer (VIT) [38] to learn both global and local dependency information in the image. Although Transformer-based networks perform well in medical image segmentation, their results are not significantly different from those of UNet-based networks. Due to the high consumption of computing resources, high demand for data, and relatively low data efficiency, the use of Transformers has been debated by some researchers.

3. Methodology

3.1. Network Architecture

The overall architecture of our method is depicted in Figure 2. On the left branch, Stage 1, Stage 2, Stage 3, and Stage 4 form the encoder, which is mainly responsible for extracting image features. In the right section, Decoder 1, Decoder 2, Decoder 3, and Decoder 4 gradually restore the detailed information of the image. The two convolutions, two SCER blocks and the downsampling block are included in each stage of encoder. The evolving normalization–activation layers (EVO Norm) [39] are used after each convolution, with kernel 3 × 3 × 3 and stride of 1. The SCER blocks take advantage of the small kernel convolution of effective receptive field to extract features from the fusion of local semantic information and features with a large receptive field. The layer normalization (LN) [40] and convolution with kernel 2 × 2 × 2 and stride of 2 are sequentially utilized as downsampling blocks of the encoder block.

The decoder contains two convolutions with kernel 3 × 3 × 3 and stride of 1, a CSAM block, a trilinear interpolation operation, and a convolution with kernel 1 × 1 × 1 and stride of 1. In the decoder, the convolution with kernel 1 × 1 × 1 and stride of 1 is used to improve the non-linear expression ability of the network. The EVO Norm is applied after it. A trilinear interpolation operation is used for gradually restoring image size by twice. To focus on important features and ignore unimportant ones, channel and spatial attention mechanisms, called CSAM, are also applied after the trilinear interpolation operation. The last operation of each decoder uses two convolutions with kernel 3 × 3 × 3 and stride of 1 to obtain segmentation details.

To further reduce the number of parameters, a 1 × 1 × 1 convolution and EVO Norm were added into the skip connections between the encoder and the corresponding decoder, which halved the number of channels. There are two CSAM modules and an Atrous Spatial Pyramid Pooling (ASPP) [41] in the bottle layer. ASPP with atrous rates 1, 2, 3, and 5 is applied to capture multi-scale contextual image information. At the end of the network, a convolution with kernel 1 × 1 × 1 and a stride of 1 serves as the classifier. The dimensions of the input image are 4 × 128 × 128 × 128, and the classifier layer maps deep dimensional features to 3 × 128 × 128 × 128.

3.2. Small Kernel Convolution with Effective Receptive Field Shuffle Module (SCER)

It is widely acknowledged that the larger the receptive field, the larger the range of the original image it can come into contact with, which also means that it may contain more features with a large receptive field and higher-level semantic features. On the other hand, a smaller receptive field indicates that its features tend to be more localized and detailed. In UNet, there are already two 3 × 3 × 3 convolutions, which lack a large receptive field that can extract features with a large receptive field. Convolutional operations are the fundamental units of UNet; it can extract effective features without additional operations. Increasing the number of stacking convolutional kernels to obtain a large receptive field is undoubtedly a fast and simple method. It is a fact that larger convolutional kernels bring greater receptive fields, but they also bring a larger number of parameters, such as 7 × 7 × 7. Karen Simonyan et al. [42] and Christian Szegedy et al. [43] suggested that the receptive field value of one convolution with kernel 5 × 5 × 5 is similar to a stack of two 3 × 3 × 3 convolution layers, and the receptive field value of one convolution with kernel 7 × 7 × 7 is close to a stack of three 3 × 3 × 3 convolution layers. However, the difference in their parameter amounts is huge. Taking 2D as an example, the parameters of one convolution with kernel 7 × 7 are 49C², in which C is the channel number and the parameter count for the two 3 × 3 convolutional layers is only 27C² [42]. To further reduce the number of parameters brought by the large receptive field, we introduced the idea of lightweight shuffleNetV2 [44], which mainly aims to improve the effect and reduce the number of parameters.

A diagram of the SCER block is shown in Figure 3. Firstly, in order to reduce the number of parameters in the network, the original image is evenly divided into two branches according to the channel. The operations of the upper branch are used to extract features with an effective receptive field, and the lower one is exploited to preserve the original semantic features, in which the stack of three 3 × 3 × 3 convolution layers is used to enlarge the receptive field. Next, channel shuffling operations are used to ensure the interaction of information between the two branches. The SCER block can be represented as follows:

Z = s h u f f l e (c o n c a t (s p l i t (x, \frac{C}{2}), C o n v_{3 \times 3 \times 3} (C o n v_{3 \times 3 \times 3} (C o n v_{3 \times 3 \times 3} (s p l i t (x, \frac{C}{2}))))))

(1)

where x denotes the input images. The split signifies split operations according to parameter

\frac{C}{2}

, in which

\frac{C}{2}

represents the division of images evenly according to the channel. Conv_3×3×3 is defined as a convolution with kernel 3 × 3 × 3 and stride of 1. The shuffle is expressed as channel shuffling operations.

3.3. Channel Spatial Attention Module (CSAM)

A diagram of the CSAM block is shown in Figure 4, which includes two parts: a channel attention and a spatial attention mechanism. The upper part is channel attention, and the lower part is the spatial attention mechanism. The channel attention involves global average pooling (GAP), convolution with kernel size 3 × 3 × 3, and the Sigmoid Activation Function, which enable the model to automatically learn the importance of each channel and adjust the representation of input data based on the contribution of each channel. The spatial attention incorporates average pooling (AP), max pooling (MP), convolution with kernel size 7 × 7 × 7 and Sigmoid Activation Function, which can help the model better focus on the features of different regions in the image to improve the representation ability and performance of the model. The CSAM is an attention module that combines channel attention and spatial attention mechanisms to improve the representation ability of image features, and it achieves global perception and the importance adjustment of image features by simultaneously focusing on feature information in both channel and spatial dimensions.

Compared to the CBAM [45] attention mechanism, it has two inputs: one is the information transmitted by the encoder, and the other is the information transmitted by the previous layer in the decoder. The advantage of the CSAM module is that it can integrate shallow information transmitted by the encoder corresponding to the decoder with deep information transmitted from the previous layer, which can guide the selection of detailed features transmitted from the previous layer by spatial and channel attention mechanisms. The channel attention can be represented as follows:

Z_{c} = S i g m o i d (C o n v_{3 \times 3 \times 3} (G A P (Z_{i n}^{L} + Z_{i n}^{L - 1}))) \times Z_{i n}^{L}

(2)

where Z_in^L denotes the shallow information transmitted by the encoder corresponding to the decoder, Z_in^L⁻¹ represents detailed features transmitted from the previous layer, GAP is defined as global average pooling, and Conv_3×3×3 indicates convolution with kernel size 3 × 3 × 3. Sigmoid is Activation Function.

The spatial attention can be described as follows:

Z_{S} = S i g m o i d (C o n v_{7 \times 7 \times 7} (c o n c a t (A P (Z_{i n}^{L} + Z_{i n}^{L - 1}), M P (Z_{i n}^{L} + Z_{i n}^{L - 1})))) \times Z_{i n}^{L - 1}

(3)

where AP denotes average pooling and MP represents max pooling; concat is concatenate operation.

Our proposed CSAM block can be depicted as follows:

Z = c o n c a t ((Z_{c} + Z_{s}), Z_{i n}^{L - 1})

(4)

where concat denotes concatenate operation.

4. Experiments and Results

4.1. Datasets and Preprocessing

The BraTS dataset, synonymous with the Brain Tumor Segmentation Challenge, is a publicly available medical imaging resource that serves as a foundation for the research and development of algorithms aimed at segmenting brain tumors. It is a collection of MRI images from a multitude of patients with brain tumors, sourced from various medical centers. For the BraTS 2021 challenge, a significant number of cases were employed, 1251 for training and 219 for online validation [7,46,47], which has gained considerable traction. The dataset encompasses 1251 cases with ground truth annotations provided by certified neuroradiologists, while the ground truth for the 219 validation cases remains undisclosed to the public, with evaluation results accessible solely through the online validation process. Our approach involved the utilization of 80% of the 1251 training cases and 20% of the 1251 cases for offline validation. Additionally, we have submitted our model for evaluation on the Synapse platform, which can be accessed at https://www.synapse.org/#platform (accessed on 29 November 2024).

To facilitate the accurate segmentation of brain tumor images by our network, we commenced by integrating the BraTS 2021 dataset into our program during the preprocessing phase. We employed simpleITK [48] and MONAI for image processing, employing the Z-score normalization method to standardize each image. Subsequently, we minimized the background while ensuring the inclusion of the entire brain and randomly downsampled the image from an initial size of 240 × 240 × 155 to a reduced size of 128 × 128 × 128. We clipped all intensity values to the 1st and 99th percentiles of the non-zero voxel distribution within the volume. In our research, we implemented several data augmentation techniques, including channel rescaling within the range of 0.9 to 1.1, channel intensity shifting between −0.1 and 0.1, the addition of Gaussian noise with a mean of 0 and a standard deviation of 0.1, channel dropping, and a random flipping probability of 80% along each spatial axis. We employed a strategic training regimen for our model during the training phase, and following model optimization, we resized the images back to their original dimensions. Ultimately, we submitted our results to the official platform for assessment.

4.2. Implementation Details

Our network was implemented using Python 3.8.10 and PyTorch 1.11.0. For the training phase, we utilized a single NVIDIA RTX A5000 graphics card (PNY, Parsippany, NJ, USA), equipped with 24 GB of memory, in conjunction with an AMD EPYC 7551P processor (AMD, Santa Clara, CA, USA). As detailed in Table 1, we initiated the training with a learning rate of 1 × 10⁴ and a batch size of 1. Throughout the training process, we employed the Ranger [49] optimizer to fine-tune our network. Additionally, we incorporated the standard Dice loss [50] into our network architecture. The dimensions for both input and output data were consistently set to 128 × 128 × 128.

4.3. Evaluation Metrics

Quantitative and qualitative assessments were conducted using established evaluation metrics, including the Dice Similarity Coefficient (Dice) score and the Hausdorff distance (HD) [51,52]. The Dice score serves as a metric for the similarity between two sets. In the context of image segmentation, it quantifies the resemblance between the segmentation outcomes predicted by the network and the manual annotations, and is expressed as follows:

D i c e = \frac{2 T P}{2 T P + F P + F N}

(5)

where TP, FP, and FN denote true-positive cases, false-positive cases, and false-negative cases, respectively. The Hausdorff distance (HD) signifies the maximum distance between the boundary of the predicted segmentation and the actual region boundary. A lower HD value indicates a smaller error in the predicted boundary segmentation, which corresponds to higher quality. The HD can be mathematically represented as follows:

D (P, T) = \max {s u p_{t \in T} i n f_{p \in P} d (t, p), s u p_{p \in P} i n f_{t \in T} d (t, p)}

(6)

where t and p represent the real region boundary and predicted segmentation region boundary, respectively. d(·) represents the distance between t and p. The sup denotes the supremum and the inf denotes the infimum.

4.4. Comparison with Other Methods

We compared twelve advanced models to evaluate the advantages of the proposed model. In the evaluation process, four indicators of whole tumor (WT), tumor core (TC), enhancing tumor (ET), and average dice value (AVG) are used to evaluate the level of results. The numbers of compared networks were 2, 2, 3, and 5 for 2024, 2023, 2022, and classic networks, respectively. We compared five classic networks, which are 3D UNet, Att-UNet, UNETR, TransBTS, and VT-UNet. There are eight architecture variants based on basic UNet, and four are structures based on Transformers. Table 2 and Figure 5 show the online validation results on BraTS2021. The WT, TC, ET, and AVG are 91.67%, 87.70%, 83.35%, and 87.57%, respectively.

The HD values, which are shown in Table 2 and Figure 6, are 5.34, 8.03, and 19.41 for the three tumor subregions (WT, TC, and ET), respectively. We compared the 3D U-Net as the baseline, and our WT, TC, ET, and average dice results increased by 3.65%, 11.53%, 7.15%, and 7.44%, respectively. The VT-UNet was compared with a pure Transformer, WT, TC, ET, and average dice results increased by 0.01%, 3.29%, 2.6%, and 1.96%, respectively. We compared UNETR, whose CNN + Transformer, our WT, TC, ET and average dice values increased by 0.78%, 3.97%, 2.42%, and 2.39%, respectively. The results show that our network dramatically improved for all targets. Although not all indicators are optimal, the average dice of the network is indeed the highest among all comparison results, which also demonstrates that the proposed network performs better in brain tumor segmentation tasks.

Figure 7 illustrates the visualization outcomes of the SCENet model when applied to the BraTS 2021 dataset, showcasing five randomly selected cases. The medical cases with identifiers 00631, 00446, 00586, 00618, and 00625 of Figure 7 were segmented by SCENet, respectively. The combined green, yellow, and red regions, the intersection of red and yellow, and the yellow regions correspond to whole tumor (WT), tumor core (TC), and enhancing tumor (ET), respectively. Generally speaking, SCENet’s segmentation results closely match the labeled ground truth. When compared to networks based on 3D UNet or Transformer, as presented in Table 2, our model’s performance is superior. Additionally, our network outperforms other UNet-based architectures. Collectively, our architecture and its constituent modules have achieved outstanding results on the BraTS 2021 dataset, laying a solid foundation for future research endeavors.

4.5. Ablation Experiments

4.5.1. Ablation Study of Each Module in SCENet

We conducted ablation experiments to verify the effect of different modules in this architecture. Experiment (Expt) A represents the baseline without any modules. The SCER module, CSAM attention mechanism, and ASPP module are used in Experiments B, C, and D, respectively. The different combinations of SCER module, CSAM attention mechanism, and ASPP module are tested in Experiments E, F, and G, respectively.

The WT value of the combination with SCER and CSAM is the highest, which can be seen in Figure 8 and Table 3. From the results, it can be seen that the effect is not significant when using CSAM or ASPP alone, but there is a significant improvement when both are used in the network. This indicates that increasing the multi-scale receptive field in the bottleneck layer is beneficial, but without effective feature selection after input to the decoder, it is difficult to achieve an improved performance. Empirically, we used the SCER, CSAM, and ASPP; the average dice is the best, and the average dice result is 87.57%. The best results show that our network and all the modules can be effectively applied to brain tumor segmentation tasks.

4.5.2. Ablation Study of the Number of Stacking Convolution Layers and Kernel Size in the SCER Module

To verify the effectiveness of replacing a convolution of kernel size 7 × 7 × 7 with a stack of three convolutions layers of kernel size 3 × 3 × 3 in the SCER module, we conducted three sets of experiments in the SCER.

The results are shown in Table 4 and Figure 9. It is generally believed that a convolution with kernel size 7 × 7 × 7 has a large receptive field and that its effect should be the best, but, except for the best ET results, its other indications are not as good as those of other experiments. A large receptive field does indeed improve the segmentation of small target areas, but a large number of parameters can lead to a decrease in the final result. Three convolutions with kernel size 3 × 3 × 3 and a convolution with kernel size 7 × 7 × 7 have the same receptive field, but the parameter of three 3 × 3 × 3 is much smaller, whose results are the best among these three experiments. Empirically, we reduced one convolution on the basis of Experiment C. From Experiment B in Table 4, it can be seen that the effect did not improve, which also shows that stacking three 3 × 3 × 3 convolutional layers is the best method in the proposed network.

4.5.3. Comparative Experiment SCER Module of SCENet with Shuffle Block of Shufflenet V2

The SCER module and ShuffleNet V2 are indeed somewhat similar, but there are still differences in their structures.

To demonstrate that the SCER module does indeed perform better than the Shuffle block in the SCENet network, we conducted two sets of experiments. From Figure 10 and Table 5, it can be seen that the WT, TC, ET, and average dice results of the SCER module are 91.15%, 86.40%, 82.17%, and 86.57%, respectively, which, in terms of WT, TC, ET, and average dice, are higher than the shuffle block of 90.73%, 84.99%, 82.39%, and 86.04%, respectively. This result shows that our module outperforms the shuffle block in brain tumor segmentation tasks.

5. Limitations and Future Perspectives

At present, deep learning stands as a prominent area of research. The advancement of attention mechanisms and the refinement of algorithms have significantly improved the quality of brain tumor MRI images. Nonetheless, there is a dearth of experiments that translate these advancements into clinical practice. Moving forward, the emphasis of research should pivot towards clinical applications to develop algorithms that are more grounded in real-world utility.

Bioinformatics stands as a crucial field of advancement, exerting a substantial impact on the etiological analysis and prognostic evaluation of brain tumors. The seamless integration of medical imaging, clinical diagnostics, and bioinformatics can significantly enhance the efficacy of brain tumor treatments [56].

6. Conclusions

In this study, we propose SCENet based on UNet, which integrates an SCER block and a CSAM block. The SCENet module takes advantage of stacking three small kernel convolutions to form an effective receptive field, which greatly improves feature extraction ability. The CSAM module achieved a good segmentation performance by combining spatial and channel attention mechanisms for guiding the selection of deep semantic information. The ASPP module is introduced into the bottleneck layer to obtain richer semantic features. In addition to comparing our results with the classic 3D UNet, we also compared our results with networks based on UNet or Transformers. Our results yielded WT, TC, ET, and average dice of 91.67%, 87.70%, 83.35%, and 87.57%, respectively. We also conducted ablation experiments on the SCER module, CSAM module, and ASPP, which demonstrated the effectiveness of our modules. By comparing a 7 × 7 × 7 convolution with three stacking 3 × 3 × 3 convolutions in the SCER block, it can be seen that although their receptive field values are similar, the stacking of three 3 × 3 × 3 convolutions yielded better results in this network. The combination experiment of ASPP and CSAM shows that multi-scale receptive fields used into the bottleneck layer are beneficial, but the improvement effect is not significant without effective feature selection in the decoder. By comparing the performance of the SCER block in SCENet with the shuffle block in ShuffleNet V2 in this architecture, the results show that the proposed module in this network is superior to the shuffle block. Furthermore, quantitative and qualitative experiments demonstrated the accuracy of SCENet in brain tumor segmentation tasks. Our architecture and the proposed modules can provide effective ideas for subsequent research. We believe that the encouraging results obtained with SCENet will inspire further research into brain tumor segmentation.

Author Contributions

Conceptualization, N.C. and B.G.; methodology, B.G.; software, P.Y.; data curation, R.Z.; writing—original draft, B.G.; writing—review and editing, N.C. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jiangsu Provincial Key Research and Development Program (BE2020714). The APC was funded by Cao, N.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Datasets released to the public were analyzed in this study. The BraTS2021 dataset can be found through the following link: https://www.med.upenn.edu/cbica/brats2021/#Data2 (accessed on 15 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AG	Attention Gate
AP	Average Pooling
ASPP	Atrous Spatial Pyramid Pooling
AVG	Average Dice Value
CNN	Convolutional Neural Network
CSAM	Channel Spatial Attention Module
Dice	Dice Similarity Coefficient
ET	Enhancing Tumor
EVO Norm	Evolving Normalization–Activation Layers
Expt	Experiment
FCN	Fully Convolutional Networks
FLAIR	Fluid Attenuated Inversion Recovery
GAP	Global Average Pooling
HD	Hausdorff Distance
LN	Layer Normalization
MP	Max Pooling
ROI	Region Of Interest
SCENet	Small Kernel Convolution With Effective Receptive Field Network
SCER	Small Kernel Convolution With Effective Receptive Field Shuffle Module
SGE	Spatial Group-Wise Enhance
T1	T1-Weighted
T1ce	T1-Enhanced Contrast
T2	T2-Weighted
TC	Tumor Core
VIT	Vision Transformer
VTUnet	Volumetric Transformer
WHO	World Health Organization
WT	Whole Tumor

References

Ibebuike, K.; Ouma, J.; Gopal, R. Meningiomas among intracranial neoplasms in Johannesburg, South Africa: Prevalence, clinical observations and review of the literature. Afr. Health Sci. 2013, 13, 118–121. [Google Scholar] [CrossRef] [PubMed]
Herholz, K.; Langen, K.-J.; Schiepers, C.; Mountz, J.M. Brain tumors. Semin. Nucl. Med. 2012, 42, 356–370. [Google Scholar] [CrossRef] [PubMed]
Wu, J.; Su, R.; Qiu, D.; Cheng, X.; Li, L.; Huang, C.; Mu, Q. Analysis of DWI in the classification of glioma pathology and its therapeutic application in clinical surgery: A case-control study. Transl. Cancer Res. 2022, 11, 805–812. [Google Scholar] [CrossRef]
Chen, J.; Qi, X.; Zhang, M.; Zhang, J.; Han, T.; Wang, C.; Cai, C. Review on neuroimaging in pediatric-type diffuse low-grade gliomas. Front. Pediatr. 2023, 11, 1149646. [Google Scholar] [CrossRef] [PubMed]
Verma, N.; Cowperthwaite, M.C.; Burnett, M.G.; Markey, M.K.J. Differentiating tumor recurrence from treatment necrosis: A review of neuro-oncologic imaging strategies. Neuro-Oncology 2013, 15, 515–534. [Google Scholar] [CrossRef]
Bauer, S.; Wiest, R.; Nolte, L.-P.; Reyes, M. A survey of MRI-based medical image analysis for brain tumor studies. Phys. Med. Biol. 2013, 58, R97. [Google Scholar] [CrossRef]
Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.S.; Freymann, J.B.; Farahani, K.; Davatzikos, C. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 2017, 4, 170117. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, 17–21 October 2016; pp. 424–432. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Norman, B.; Pedoia, V.; Majumdar, S. Use of 2D U-Net convolutional neural networks for automated cartilage and meniscus segmentation of knee MR imaging data to determine relaxometry and morphometry. Radiology 2018, 288, 177–185. [Google Scholar] [CrossRef]
Sevastopolsky, A. Optic disc and cup segmentation methods for glaucoma detection with modification of U-Net convolutional neural network. Pattern Recognit. Image Anal. 2017, 27, 618–624. [Google Scholar] [CrossRef]
Roy, A.G.; Conjeti, S.; Karri, S.P.K.; Sheet, D.; Katouzian, A.; Wachinger, C.; Navab, N. ReLayNet: Retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks. Biomed. Opt. Express. 2017, 8, 3627–3642. [Google Scholar] [CrossRef] [PubMed]
Skourt, B.A.; El Hassani, A.; Majda, A. Lung CT image segmentation using deep neural networks. Procedia Comput. Sci. 2018, 127, 109–113. [Google Scholar] [CrossRef]
Chen, C.; Liu, X.; Ding, M.; Zheng, J.; Li, J. 3D dilated multi-fiber network for real-time brain tumor segmentation in MRI. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; pp. 184–192. [Google Scholar]
Raza, R.; Bajwa, U.I.; Mehmood, Y.; Anwar, M.W.; Jamal, M.H. dResU-Net: 3D deep residual U-Net based brain tumor segmentation from multimodal MRI. Biomed. Signal Process. Control 2023, 79, 103861. [Google Scholar] [CrossRef]
Ahmad, P.; Qamar, S.; Shen, L.; Rizvi, S.Q.A.; Ali, A.; Chetty, G. Ms unet: Multi-scale 3d unet for brain tumor segmentation. In Proceedings of the International MICCAI Brainlesion Workshop, Singapore, 18 September 2022; pp. 30–41. [Google Scholar]
Gammoudi, I.; Ghozi, R.; Mahjoub, M.A. An Innovative Approach to Multimodal Brain Tumor Segmentation: The Residual Convolution Gated Neural Network and 3D UNet Integration. Trait. Signal 2024, 41, 141–151. [Google Scholar] [CrossRef]
Soni, V.; Singh, N.K.; Singh, R.K.; Tomar, D.S. Multiencoder-based federated intelligent deep learning model for brain tumor segmentation. IMA 2023, 34, e22981. [Google Scholar] [CrossRef]
Olisah, C.C. SEDNet: Shallow Encoder-Decoder Network for Brain Tumor Segmentation. arXiv 2024, arXiv:2401.13403. [Google Scholar]
Chen, R.; Lin, Y.; Ren, Y.; Deng, H.; Cui, W.; Liu, W. An efficient brain tumor segmentation model based on group normalization and 3D U-Net. IMA 2024, 34, e23072. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; Mcdonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Liu, D.; Sheng, N.; He, T.; Wang, W.; Zhang, J.; Zhang, J. SGEResU-Net for brain tumor segmentation. Math. Biosci. Eng. 2022, 19, 5576–5590. [Google Scholar] [CrossRef] [PubMed]
Tian, W.W.; Li, D.W.; Lv, M.Y.; Huang, P. Axial attention convolutional neural network for brain tumor segmentation with multi-modality mri scans. Brain Sci. 2023, 13, 12. [Google Scholar] [CrossRef]
Zhang, L.; Lan, C.; Fu, L.; Mao, X.; Zhang, M. Segmentation of brain tumor MRI image based on improved attention module Unet network. SIViP 2023, 17, 2277–2285. [Google Scholar] [CrossRef]
Liu, D.; Sheng, N.; Han, Y.; Hou, Y.; Liu, B.; Zhang, J. SCAU-net: 3D self-calibrated attention U-Net for brain tumor segmentation. Neural Comput. 2023, 35, 23973–23985. [Google Scholar] [CrossRef]
Zeeshan Aslam, M.; Raza, B.; Faheem, M.; Raza, A. AML-Net: Attention-based multi-scale lightweight model for brain tumor segmentation in internet of medical things. CAAI Trans. Intell. Technol. 2024; early view. [Google Scholar] [CrossRef]
Kharaji, M.; Abbasi, H.; Orouskhani, Y.; Shomalzadeh, M.; Kazemi, F.; Orouskhani, M. Brain Tumor Segmentation with Advanced nnU-Net: Pediatrics and Adults Tumors. Neurosci. Inform. 2024, 4, 100156. [Google Scholar] [CrossRef]
Pang, B.; Chen, L.; Tao, Q.; Wang, E.; Yu, Y. GA-UNet: A Lightweight Ghost and Attention U-Net for Medical Image Segmentation. J. Imaging Inform. Med. 2024, 37, 1874–1888. [Google Scholar] [CrossRef] [PubMed]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetV2: Enhance cheap operation with long-range attention. Proc. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
Peiris, H.; Hayat, M.; Chen, Z.; Egan, G.; Harandi, M. A robust volumetric transformer for accurate 3D tumor segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 162–172. [Google Scholar]
Jia, Q.; Shu, H. Bitr-unet: A cnn-transformer combined network for mri brain tumor segmentation. In Proceedings of the International MICCAI Brainlesion Workshop, Virtual Event, 27 September 2021; pp. 3–14. [Google Scholar]
Wang, W.; Chen, C.; Ding, M.; Yu, H.; Zha, S.; Li, J. TransBTS: Multimodal brain tumor segmentation using transformer. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; pp. 109–119. [Google Scholar]
Sun, Q.; Fang, N.; Liu, Z.; Zhao, L.; Wen, Y.; Lin, H. HybridCTrm: Bridging CNN and transformer for multimodal brain image segmentation. J. Healthc. Eng. 2021, 2021, 7467261. [Google Scholar] [CrossRef] [PubMed]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
Cai, Y.; Long, Y.; Han, Z.; Liu, M.; Zheng, Y.; Yang, W.; Chen, L. Swin Unet3D: A three-dimensional medical image segmentation network combining vision transformer and convolution. BMC Med. Inform. Decis. Mak. 2023, 23, 33. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2022, arXiv:2010.11929. [Google Scholar]
Liu, H.; Brock, A.; Simonyan, K.; Le, Q. Evolving normalization-activation layers. Adv. Neural Inf. Process. Syst. 2020, 33, 13539–13550. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Baid, U.; Ghodasara, S.; Mohan, S.; Bilello, M.; Calabrese, E.; Colak, E.; Farahani, K.; Kalpathy-Cramer, J.; Kitamura, F.C.; Pati, S. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv 2021, arXiv:2107.02314. [Google Scholar]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 2014, 34, 1993–2024. [Google Scholar] [CrossRef]
Lowekamp, B.C.; Chen, D.T.; Ibáñez, L.; Blezek, D. The design of SimpleITK. Front. Neuroinformatics 2013, 7, 45. [Google Scholar] [CrossRef]
Wright, L.; Demeure, N. Ranger21: A synergistic deep learning optimizer. arXiv 2021, arXiv:2106.13731. [Google Scholar]
Dice, L.R. Measures of the amount of ecologic association between species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
Kim, I.S.; McLean, W. Computing the Hausdorff distance between two sets of parametric curves. Commun. Korean Math. Soc. 2013, 28, 833–850. [Google Scholar] [CrossRef]
Aydin, O.U.; Taha, A.A.; Hilbert, A.; Khalil, A.A.; Galinovic, I.; Fiebach, J.B.; Frey, D.; Madai, V.I. On the usage of average Hausdorff distance for segmentation performance assessment: Hidden error when used for ranking. Eur. Radiol. Exp. 2021, 5, 4. [Google Scholar] [CrossRef]
Wu, Q.; Pei, Y.; Cheng, Z.; Hu, X.; Wang, C. SDS-Net: A lightweight 3D convolutional neural network with multi-branch attention for multimodal brain tumor accurate segmentation. Math. Biosci. Eng. 2023, 20, 17384–17406. [Google Scholar] [CrossRef] [PubMed]
Håversen, A.H.; Bavirisetti, D.P.; Kiss, G.H.; Lindseth, F. QT-UNet: A self-supervised self-querying all-Transformer U-Net for 3D segmentation. IEEE Access 2024, 12, 62664–62676. [Google Scholar] [CrossRef]
Akbar, A.S.; Fatichah, C.; Suciati, N.; Za’in, C. Yaru3DFPN: A lightweight modified 3D UNet with feature pyramid network and combine thresholding for brain tumor segmentation. Neural Comput. Appl. 2024, 36, 7529–7544. [Google Scholar] [CrossRef]
Papacocea, S.I.; Vrinceanu, D.; Dumitru, M.; Manole, F.; Serboiu, C.; Papacocea, M.T. Molecular Profile as an Outcome Predictor in Glioblastoma along with MRI Features and Surgical Resection: A Scoping Review. Int. J. Mol. Sci. 2024, 25, 9714. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The simple procedure of manual and semi-automatic segmentation medical images.

Figure 2. An illustration of the proposed SCENet for brain tumor image segmentation.

Figure 3. An illustration of building blocks of SCER Block.

Figure 4. An illustration of building blocks of CSAM block.

Figure 5. Comparison of the dice results of different segmentation methods.

Figure 6. Comparison of HD results of different segmentation methods.

Figure 7. Visualization results of medical cases. The union of green, yellow and red, the union of red and yellow, and the yellow labels represent WT, TC, and ET, respectively.

Figure 8. The result of ablation study of each module in SCENet.

Figure 9. The results of ablation experiments for the number of stacking convolution layers and kernel size in the SCER module.

Figure 10. The results of comparative experiment SCER module with shuffle block.

Table 1. Model parameter configuration.

Basic Configuration	Value
PyTorch Version	1.11.0
Python	3.8.10
GPU	NVIDIA RTX A5000 (24G)
Cuda	cu113
Learning Rate	1 × 10⁴
Optimizer	Ranger
Batch Size	1
Input Size	128 × 128 × 128
Output Size	128 × 128 × 128

Table 2. The online validation results for the comparison of different methods on BraTS 2021.

Methods	Dice (%)				Hausdorff 95 (mm)				Ref.
Methods	WT	TC	ET	AVG	WT	TC	ET	AVG	Ref.
3D U-Net	88.02	76.17	76.20	80.13	9.97	21.57	25.48	19.01	[9]
Att-Unet	89.74	81.59	79.60	83.64	8.09	14.68	19.37	14.05	[23]
UNETR	90.89	83.73	80.93	85.18	4.71	13.38	21.39	13.16	[36]
TransBTS	90.45	83.49	81.17	85.03	6.77	10.14	18.94	11.95	[34]
VT-UNet	91.66	84.41	80.75	85.61	4.11	13.20	15.08	10.80	[32]
MS UNet (2022)	91.94	86.27	82.41	86.87	-	-	-	-	[18]
SGEResU-Net (2022)	91.64	86.85	83.31	87.27	5.95	7.57	19.28	10.93	[24]
AABTS-Net (2022)	92.20	86.10	83.00	87.10	4.00	11.18	17.73	10.97	[25]
SDS-Net (2023)	91.80	86.80	82.50	87.03	21.07	11.99	13.13	15.40	[53]
Swin Unet3D (2023)	90.50	86.60	83.40	86.83	-	-	-	-	[37]
QT-UNet-B (2024)	91.24	83.2	79.99	84.81	4.44	12.95	17.19	11.53	[54]
Yaru3DFPN (2024)	92.02	86.27	80.9	86.4	4.09	8.43	21.91	11.48	[55]
SCENet (Ours)	91.67	87.70	83.35	87.57	5.34	8.03	19.41	10.93

Table 3. The result of ablation study of each module in SCENet. The symbol "√" indicates that it has been selected for use in the network.

Expt	SCER	CSAM	ASPP	Dice(%)
Expt	SCER	CSAM	ASPP	WT	TC	ET	AVG
A				90.06	82.68	80.88	84.54
B	√			91.15	86.40	82.17	86.57
C		√		91.40	85.51	81.99	86.30
D			√	90.74	85.79	82.96	86.50
E	√	√		91.75	86.82	81.99	86.85
F	√		√	91.11	86.52	82.31	86.65
G (SCENet)	√	√	√	91.67	87.70	83.35	87.57

Table 4. The results of ablation experiments for the number of stacking convolution layers and kernel size in the SCER module.

Expt	Conv Num	Kernel Size	Dice (%)				FLOPs	Parameter
Expt	Conv Num	Kernel Size	WT	TC	ET	AVG	FLOPs	Parameter
A	1	7 × 7 × 7	91.06	85.56	82.36	86.33	2763.476 G	228.368 M
B	2	3 × 3 × 3	90.59	85.22	81.48	85.76	1410.731 G	114.840 M
C (SCENet)	3	3 × 3 × 3	91.15	86.40	82.17	86.57	1537.095 G	125.446 M

Table 5. The results of comparative experiment SCER module with shuffle block.

Expt		Dice(%)
Expt		WT	TC	ET	AVG
A	shuffle block	90.73	84.99	82.39	86.04
B (adopt)	SCER	91.15	86.40	82.17	86.57

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, B.; Cao, N.; Zhang, R.; Yang, P. SCENet: Small Kernel Convolution with Effective Receptive Field Network for Brain Tumor Segmentation. Appl. Sci. 2024, 14, 11365. https://doi.org/10.3390/app142311365

AMA Style

Guo B, Cao N, Zhang R, Yang P. SCENet: Small Kernel Convolution with Effective Receptive Field Network for Brain Tumor Segmentation. Applied Sciences. 2024; 14(23):11365. https://doi.org/10.3390/app142311365

Chicago/Turabian Style

Guo, Bin, Ning Cao, Ruihao Zhang, and Peng Yang. 2024. "SCENet: Small Kernel Convolution with Effective Receptive Field Network for Brain Tumor Segmentation" Applied Sciences 14, no. 23: 11365. https://doi.org/10.3390/app142311365

APA Style

Guo, B., Cao, N., Zhang, R., & Yang, P. (2024). SCENet: Small Kernel Convolution with Effective Receptive Field Network for Brain Tumor Segmentation. Applied Sciences, 14(23), 11365. https://doi.org/10.3390/app142311365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SCENet: Small Kernel Convolution with Effective Receptive Field Network for Brain Tumor Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based Methods for Medical Image Segmentation

2.2. The Attention-Based Module for Medical Image Segmentation

2.3. Transformer-Based Architecture for Medical Image Segmentation

3. Methodology

3.1. Network Architecture

3.2. Small Kernel Convolution with Effective Receptive Field Shuffle Module (SCER)

3.3. Channel Spatial Attention Module (CSAM)

4. Experiments and Results

4.1. Datasets and Preprocessing

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Comparison with Other Methods

4.5. Ablation Experiments

4.5.1. Ablation Study of Each Module in SCENet

4.5.2. Ablation Study of the Number of Stacking Convolution Layers and Kernel Size in the SCER Module

4.5.3. Comparative Experiment SCER Module of SCENet with Shuffle Block of Shufflenet V2

5. Limitations and Future Perspectives

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI