1. Introduction
The precise determination of the boundaries of brain tumor areas from MRI is an important basis for physicians to diagnose, treat, surgically evaluate, and follow up on tumors. However, brain tumors have various shapes and complex boundaries; therefore, manual segmentation is time-consuming and labor-intensive, and it is challenging to guarantee segmentation accuracy. Automatic segmentation of brain tumors by computer can greatly improve imaging physicians’ efficiency and segmentation accuracy, which has significant clinical practical value.
With the rise of artificial intelligence, deep learning techniques are being widely used in the fields of image, information system, and natural language processing [
1,
2]. Among these, CNN, as one of the representative algorithms of deep learning, performs well in image-related tasks and has greatly promoted the development of image segmentation, classification, detection, and other technologies since its initial proposal [
3]. Subsequently, a large number of excellent network models have emerged, including ResNet and DenseNet, and these have enriched the applications of convolutional neural networks in various fields [
4]. Since convolutional neural networks can automatically learn representative and complex features directly from a dataset to train a network model with strong robustness and learning ability without inputting manually designed features, they are widely used in brain tumor segmentation. Currently, experts and scholars have proposed various effective automated brain tumor segmentation methods in the context of continuous development and the improvement of medical imaging equipment [
5,
6,
7,
8].
Magnetic resonance images of brain tumors are available in axial, coronal, and sagittal views, and brain tumors present significantly different information in the different views. According to the division of the way to utilize the MRI view of brain tumors, the current deep learning segmentation methods are mainly single-view methods with axial or multi-view methods with a particular focus on this view. For brain tumor segmentation methods that use 2D slices in the axial view, although the axial view can contain partial information about the brain, the complete lesion area cannot be observed from this view [
9]. Further information on the location and shape of the brain tumor needs to be determined by combining the coronal and sagittal views [
10]. In order to combine information from multiple views, some researchers decomposed the standard 3D convolution kernel into the axial intra-slice and inter-slice convolution kernels, which perform convolution operations in the axial plane and the view perpendicular to the axial plane, respectively [
11]. However, the receptive fields of the two convolutions are not the same, and only two orientations of the view are used. The extracted 3D spatial information still has limitations. To combine the three-view information, another way is to split the 3D dataset into axial, coronal, and sagittal 2D sliced images and extract the features in each view slice separately using different 2D CNNs, which makes full use of the spatial information and further improves the segmentation accuracy [
12]. However, individual models can often only perform limited feature extraction for images under a single view during the training process. For a single model, the complete feature extraction of the contextual image information relies on integrating each network. This processing of each view modeled independently before fusion ignores the correlation between each view slice and increases the complexity of the model.
In addition to taking full advantage of the axial, coronal, and sagittal views of the brain tumor, it is also necessary to consider using different-size convolution kernels to adapt brain tumor lesions of various sizes to improve the segmentation accuracy further. Due to the random size of brain tumors in MRI, segmentation models need to be adapted to lesions of different sizes. A large receptive field enables consideration of a larger range of contexts and more semantic information, which is crucial for processing large-size images of brain tumor lesions. In contrast, a small receptive field better captures the local detail information, facilitating a finer delineation of boundaries and more accurate predictions, especially for small brain tumor lesions. However, brain tumor segmentation models that use standard convolution for extracting features use only a single-size kernel per convolutional layer. This results in small and fixed receptive fields, limiting the ability of the network to represent features with varying lesion sizes [
13] adaptively. Dilated convolution allows us to set different dilation rates for the traditional standard convolution kernel and add zero-value pixel points between the individual pixel points of the convolution kernel. This varies the size of the kernel and thus flexibly expands the receptive field of the convolution kernel. Dilated convolution is used to have a larger receptive field without changing the feature map size, and there is no need to use pooling for downsampling. In contrast, a single dilated convolution has a specific receptive field. Convolution kernels with small dilation rates can learn detailed information well but cannot learn contextual features at a larger scale. Convolution kernels with large dilated rates can extract features with large receptive fields but lose more detailed information [
14]. Using a single kernel for feature extraction reduces the ability of the network to generalize objects of different sizes.
Therefore, in order to have different receptive fields while taking into account the local details and global semantic information, the pyramid structure of the feature extraction part [
15,
16] adopts the parallel method of multiple dilated rate convolution kernels. In the ASPP structure [
17], multiple dilated convolutional layers with different dilation rates are used in parallel to represent targets of arbitrary sizes, and their outputs are combined to integrate the information extracted from various receptive fields. To some extent, this improves the robustness of the model to image scale variations, but it cannot adapt to targets with high feature similarity and various sizes. In addition, this feature pyramid structure obtains different sizes of receptive fields by using parallel branches with independent kernels. It also causes the computational cost to increase with the number of parallel branches.
1.1. Motivation
After exploring conventional CNN brain tumor segmentation methods, we found that these often use information from only a single view. However, physicians often combine information from three views for brain tumor segmentation: axial, coronal, and sagittal. Moreover, brain tumors and their subregions have complex and irregular border structures. The standard convolution kernel cannot automatically adapt to various tumor sizes, connectivity, and boundary concavity and extract similar features simultaneously. To address these issues, we propose an end-to-end 3D brain tumor segmentation network based on hierarchical multi-view convolution and kernel-sharing dilated convolution (MVKS-Net), where 3D multi-view convolution is inspired by physicians’ segmentation process, and kernel-sharing dilated convolution characterizes the similar textures in the irregular realm of brain tumors.
1.2. Contributions
The contributions of this study are as follows:
We propose an axial–coronal–sagittal fusion convolution (ACSF), which decouples the standard 3D convolution into convolutions on three orthogonal views: axial, coronal, and sagittal. Combined with the extracted image features of the axial, coronal, and sagittal planes, the determination of the category of pixels can be further optimized by integrating two additional discriminations of the pixels in brain tumor images.
We propose a hierarchical decoupled multi-scale fusion module based on ACSF convolution. By incorporating short connections with residual-like structures between multi-view convolutional blocks for multi-scale feature fusion, the image information can be promoted to flow smoothly through each feature subgroup, and the receptive field of the module will gradually become larger, thus improving the perception of 3D spatial contextual information of the network.
We propose a kernel-sharing convolution with dilated rates (KSDC). Multiple branches with different dilation rates share a single kernel, which can simultaneously learn brain tumor features with different sizes and high feature similarity. This can better represent the complex boundaries and improve segmentation accuracy. In addition, kernel sharing significantly reduces computational costs.
The remainder of this study is structured as follows. Related works are discussed in
Section 2.
Section 3 describes the framework of brain tumor segmentation networks.
Section 4 provides the experimental analysis results and compares them with current advanced methods. In
Section 5, we summarize the proposed method and discuss its prospects. In
Section 6, we present future research directions.
3. Method
The proposed 3D brain tumor segmentation network architecture of multi-view fusion convolution and kernel-sharing convolution (MVKS-Net) is shown in
Figure 1. The main body of the network consists of hierarchical multi-view fusion convolution modules and kernel-sharing dilated convolution modules. Each layer is set up with 32 constant channels. The input to the network is a block of 3D images after four modalities of brain tumors are stitched. Each image block has a size of 128 × 128 × 128.
In the feature encoding stage, the input image uses 3 × 3 × 3 convolution, and the 4-channel image block is processed as a 32-channel image block with a size of 64 × 64 × 64. Then, the ACSF convolution module adaptively performs feature extraction under the axial plane, coronal plane, and sagittal plane of the 3D image block, further improving the model’s ability to capture the multi-view spatial information of the image. At the same time, after each convolution operation, synchronized batch normalization and ReLU function processing are performed. The downsampling uses 3 × 3 × 3 convolution with stride 2. In the final stage of the encoder, through the KSDC module, different-size receptive fields are generated, and the input features are scanned multiple times to better extract the high-level semantic information of the image and adapt to brain tumor lesions of different sizes. The decoding stage uses trilinear interpolation to upsample the feature map. A skip connection is used between the encoder and decoder to concatenate the upsampled features with the encoder’s high-resolution features. The details of the network structure are shown in
Table 1.
3.1. ACSF Convolution
In order to directly extract 3D spatial information from the axial, coronal, and sagittal views of MR images, we propose a new convolution method, namely axial–coronal–sagittal fusion (ACSF) convolution. ACSF convolution solves the 3D convolution integral into asymmetric convolution in the axial, coronal, and sagittal directions, and the specific implementation is shown in
Figure 2. Suppose the 3D input image is
; the 3D output image is
, where
and
represent the input and output channels; and
,
, and
represent the height, width, and depth of the input image.
,
, and
represent the height, width, and depth of the output image, respectively. Instead of presenting the 3D image as a 2D image slice with a three-view plane, we split and reshaped the convolution kernel of 3 × 3 × 3 into three parts. We insert an extra dimension of size one at different indices, generating kernels of 3 × 3 × 1, 3 × 1 × 3, and 1 × 3 × 3.
By learning the characteristics of the three views of brain tumors, H-W, H-D, and W-D, respectively, the representation of each convolution based on a single view is obtained:
,
, and
, where
. With the adaptive weights
,
, and
assigned to each branch, the three-dimensional features of the axial, coronal, and sagittal views are calculated:
where Conv3D is a function that stands for three-dimensional convolution operations.
,
, and
give weight to each output view. This weighting strategy helps to automatically select the most valuable information from the different views and suppresses features that are not conducive to improving segmentation accuracy. Then, the three result feature maps are fused to form the output feature map. With the help of ACSF convolution, the integrated coronal and sagittal image features will discriminate the pixel points twice more, which will give further basis for the classification of the pixel points, improve the classification accuracy of the pixel points, and help the final accuracy of the model segmentation.
In addition, for a convolution kernel of size k, the parameters of ACSF convolution are about , while the parameters of standard 3D convolution are about . Multi-view fusion convolution and standard 3D convolution have almost the same parameters when the convolution kernel size is 3, but when the convolution kernel size is greater than 3, the ACSF convolution will have a smaller number of parameters than the standard 3D convolution. This characteristic makes it possible to use large kernels.
3.2. Hierarchical Multi-View Fusion Module Based on ACSF Convolution
As shown in
Figure 3, the input image
X first performs a 1 × 1 × 1 convolution operation and then divides the 32 channels equally into four groups, corresponding to
,
,
, and
, each with eight channels. For the problem of insufficient feature extraction capability for each group, we add ACSF convolution units in parallel on different subgroups of the feature channel and perform ACSF convolution processing on subgroups
,
, and
. Finally, a short connection is applied between the corresponding subgroups.
where
stands for ACSF convolution. The feature maps of the previous subgroup after the convolution operation will be accumulated as input to the next subgroup, which not only promotes the image information to flow smoothly through each feature subgroup, but, also, the module’s receptive field will keep increasing through the short connections between the subgroups, making it possible to capture richer multi-scale information of MR images from different views. We connect features
,
,
, and
, and, finally, we use a 1 × 1 × 1 convolution kernel to further adjust the feature maps and feature channels under the different receptive fields obtained by different groups. The residual connection between the input and output further improves the stability of the model information flow. Each layer of the network adopts the ACSF convolution module with a short connection for feature extraction, which further improves the segmentation accuracy of brain tumors.
3.3. Kernel-Sharing Dilated Convolution
When the features in the region of the tumor lesion and the boundaries in the MR image have high similarity, a multi-branch structure with multiple convolution kernels of various sizes in parallel cannot perform feature extraction well due to the inconsistent weight parameters learned by the large- and small-size convolution kernels. To address this problem, we propose a new mechanism, called kernel-sharing dilated convolution (KSDC). The overall structure of the proposed KSDC module is shown in
Figure 4.
Suppose the input 3D image is , where represents the input channel, represents the height of the input feature map, represents the width of the input feature map, and represents the depth of the input feature map.
The input image is processed in three parallel branches. The first branch performs 1 × 1 × 1 convolution to obtain the feature map
; the second branch performs pyramidal dilation convolution to obtain the feature map
, where the dilation rate is variable to generate different-size receptive fields; the third branch performs a global average pooling of the input image and an upsampling to recover the size of the feature map to obtain the feature map
. Finally, the three parts of the feature map are fused to obtain the output feature
Y.
where
represents the convolution kernel of 1 × 1 × 1,
represents the convolution kernel of 3 × 3 × 3,
R represents the variable dilated rate,
represents global average pooling,
represents upsampling,
represents the output channel,
represents the height of the output feature map,
represents the width of the output feature map, and
represents the depth of the output feature map.
Multiple branches with different dilated rates share a kernel. The input feature maps are scanned multiple times by generating receptive fields of different sizes to adapt to lesion features of various sizes. Compared with the method of parallel multiple convolution kernels of different sizes, the sharing strategy proposed by us reduces the number of model parameters due to the sharing of convolution kernel parameters, which helps to reduce the computational cost. Sharing information increases the number of effective training samples, which improves the kernel’s representation ability and helps improve the segmentation performance.
5. Conclusions
This study proposes an efficient multimodal brain tumor segmentation network called MVKS-Net. By using multi-view fusion convolution and kernel-sharing dilated convolution instead of standard convolution, the average Dice coefficients of ET, WT, and TC on the BraTS2020 validation set can reach 78.16%, 89.52%, and 83.05%, respectively, with only 0.5 M parameters and 28.56 G floating-point operations. The results show that our network also has high segmentation accuracy and low arithmetic resource consumption, which can provide a strong reference for clinicians to perform brain tumor segmentation.
The proposed network in this paper deeply exploits the characteristics of brain tumor images. By using hierarchical multi-view fusion convolution with an ensemble discrimination idea, the segmentation accuracy of brain tumors can be further improved. In addition, kernel-sharing dilated convolution combines the scale adaptive idea of tumor feature similarity, which can combine different scale features to adapt to complex tumor boundaries. MVKS-Net is grounded in MR image information, and we can further consider the positional correlation between the three regions of the brain tumor with the inclusion relationship. The positional correlation between the edematous regions of brain tumors, tumor cores, and enhanced tumors can be further explored to introduce into the model. In addition, the model’s accuracy needs to be improved due to the limitation of the small size of the current dataset. In future work, we can consider extending the lightweight, efficient, and concise MVKS-Net to weakly supervised scenarios.
6. Future Work
Although the network proposed in this paper has achieved certain results and enhancements, there are still aspects that can be further refined and improved. For better follow-up, future research could be carried out in the following directions:
Firstly, due to the limitation of computing power resources, the input of our network uses cropped image blocks of MRI brain tumor images, which makes the feature information of tumor images learned by the network incomplete. In the future, larger-sized input image blocks can be used to obtain more comprehensive tumor information and improve network segmentation accuracy.
Secondly, our network is a direct concatenation of the four modalities of brain tumor MR images. However, each modality reflects different tissue information of brain tumors to different degrees, and taking full advantage of the complex relationships between brain tumor modalities will help guide the model for segmentation. In the future, multimodal fusion strategies can be considered to learn complex nonlinear complementary information between modalities in order to efficiently fuse and refine multimodal features.