A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet

Zhang, Qirui; Xu, Bing; Liu, Hu; Zhang, Yu; Yu, Zhiqiang

doi:10.3390/app14177953

Open AccessArticle

A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet

by

Qirui Zhang

^*

,

Bing Xu

^*

,

Hu Liu

,

Yu Zhang

and

Zhiqiang Yu

School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai 201418, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7953; https://doi.org/10.3390/app14177953

Submission received: 4 August 2024 / Revised: 1 September 2024 / Accepted: 3 September 2024 / Published: 6 September 2024

Download

Browse Figures

Versions Notes

Abstract

The convolutional neural network (CNN) has been widely applied in medical image segmentation due to its outstanding nonlinear expression ability. However, applications of CNN are often limited by the receptive field, preventing it from modeling global dependencies. The recently proposed transformer architecture, which uses a self-attention mechanism to model global context relationships, has achieved promising results. Swin-Unet is a Unet-like simple transformer semantic segmentation network that combines the dominant feature of both the transformer and Unet. Even so, Swin-Unet has some limitations, such as only learning single-scale contextual features, and it lacks inductive bias and effective multi-scale feature selection for processing local information. To solve these problems, the Residual Local induction bias-Unet (RL-Unet) algorithm is proposed in this paper. First, the algorithm introduces a local induction bias module into the RLSwin-Transformer module and changes the multi-layer perceptron (MLP) into a residual multi-layer perceptron (Res-MLP) module to model local and remote dependencies more effectively and reduce feature loss. Second, a new densely connected double up-sampling module is designed, which can further integrate multi-scale features and improve the segmentation accuracy of the target region. Third, a novel loss function is proposed that can significantly enhance the performance of multiple scales segmentation and the segmentation results for small targets. Finally, experiments were conducted using four datasets: Synapse, BraTS2021, ACDC, and BUSI. The results show that the performance of RL-Unet is better than that of Unet, Swin-Unet, R2U-Net, Attention-Unet, and other algorithms. Compared with them, RL-Unet produces significantly a lower Hausdorff Distance at 95% threshold (HD95) and comparable Dice Similarity Coefficient (DSC) results. Additionally, it exhibits higher accuracy in segmenting small targets.

Keywords:

Swin-Unet; local induction bias; Res-MLP; feature fusion

1. Introduction

With the rapid development of deep learning and computer vision technology, medical image segmentation has been widely applied in assisted diagnosis and intelligent medical treatment, considerably enhancing the effectiveness and accuracy of diagnosis [1]. It can extract meaningful semantic information from original medical image data, identify pixels of diseased organs, and obtain information features of these diseased parts [2,3,4]. Early medical image segmentation methods, such as template matching technology, edge detection, and traditional machine learning technology, have achieved certain results; however, due to the difficulty of feature representation, medical image segmentation is still a challenging problem. Recently, good progress has been made in medical image segmentation based on deep learning, which has aroused a lot of attention. Among them, the method based on Convolutional Neural Network (CNN) is widely used in medical image segmentation and has achieved good results.

Based on the AlexNet [5] network structure, the Fully Convolutional Network (FCN) [6] converts the full connection layer into the convolutional layer, and increases the dimension of the feature map by means of up-sampling, so as to realize the end-to-end semantic segmentation at the pixel level. However, this method has limited capability in capturing fine-grained features. This creates a significant challenge for tasks that require precise segmentation (e.g., medical image segmentation). Ronneberger et al. [7] proposed a landmark U-net network architecture on the basis of FCN. The encoding stage and decoding stage that correspond to each other in the internal components form the overall architecture of the network. In the encoding process, image features are extracted from the down-sampled images. In the process of decoding, images are up-sampled to gradually recover the size of the image [8]. The pooling operation and continuous convolution kernel in the encoding stage result in the loss of some feature information of the image. However, in the decoding stage, the up-sampled feature map merges with the front-end information from the skip connections, thereby enriching the image’s detailed features. This U-shaped structure makes U-Net achieve a major breakthrough in various image segmentation work and produces many related research algorithms. For example, Unet3+ [9] and Attention U-net [10] enable the decoder to integrate more comprehensive feature information by adding the feature information of different scales or introducing attention mechanisms. The V-Net [11] network integrates 3D convolution and U-Net architecture, and proposes a volumetric full-convolution 3D neural network segmentation method, which has achieved good results on prostate MRI images.

Although the CNN-based method has good local feature representation ability, due to the inherent restrictions of convolution transportation, it is challenging to learn global information and contextual semantic information interaction, so U-Net has certain limitations in remote relationship modeling and complex prediction tasks [12,13]. Some work has adopted the deep convolutional layer to expand the receptive field, introduce an attention mechanism into network structure, or adopt a pyramid network [14,15,16,17], etc., which may lead to further increases in model complexity and processing cost, and still need to further improve the ability of global information interactive capture.

Motivated by the prosperous implementation of Transformer in work related to NLP, the ViT [18] algorithm makes only minor changes to the whole image classification process in an effort to directly apply the typical Transformer structure to pictures. The entire image will be broken up into smaller image blocks as part of the ViT algorithm, and the linear embedding sequence of these smaller image blocks will then be supplied to the net as the input of Transformer. When compared to other convolution-based algorithms, it reaches state-of-the-art (SOTA) performance. With the deepening of the research, the combination of CNN and Transformer has achieved good results. Chen et al. [19] offered the Trans-Unet method in 2021, which generates a feature graph using CNN as a tool for extracting features and feeds it into Transformer to extract global data, and uses a cascading upper sampler to ensure accurate prediction. This algorithm has achieved excellent results in multiple medical image datasets. In 2021, Cao et al. proposed the Swin-Unet [20] algorithm based on Transformer to be applied to the medical image segmentation task, which significantly decreases the computational burden of the algorithm and achieves high segmentation accuracy. However, because Swin-Unet can only learn single-scale context features in the training process, the amount of computation is too large and the induction bias module for processing local information is lacking [21]. Some subsequent methods, such as nnFormer, MISSFormer, DSTransUnet, TransDeepLab, and DAE-Former [22,23,24,25,26], adopt a pure Transformer architecture. Specifically, transformer blocks are employed on both the encoder and decoder sides to capture more global characteristics and fuse multi-scale information through skip connections.

The Transformer architecture model, distinguished from the local dependency bias of convolutional neural networks due to its excellent global modeling capabilities and ability to capture long-range dependencies, has garnered significant attention from researchers. However, its high computational cost has also deterred some. As a result, in recent years, research in medical image segmentation has shifted towards hybrid Transformer architectures with skip connections.

Zhao et al. argued that segmentation methods based on patch division might disrupt local coherent features and introduce noise. Building on this premise, they proposed a progressive sampling module [27]. Unlike traditional segmentation models, this approach achieved promising results in small object segmentation tasks, although it did not perform as well in large object segmentation tasks. Rahman et al. innovatively combined visual transformers with graph convolution, proposing a graph-based cascaded attention architecture [28] to address the limitations of visual transformers in handling local spatial information correlations. Zhou et al. introduced a 3D Transformer architecture called nnFormer [29] based on multiple attention mechanisms, which balanced long-range and spatial dependencies in the self-attention mechanism. Besides, Huang et al. redefined the skip connections between the encoder and decoder, enhancing global dependencies further through a bridge-like connection method [30].

Inspired by these works, we suggest an RL-Unet algorithm based on Swin-Unet to solve the aforementioned issues. It redesigns the up-sampling module and adds a local inductive bias module to the Swin-Transformer block, resulting in a densely connected double up-sampling module that can fully learn multi-scale information and increase the boundary segmentation accuracy of the algorithm. The suggested method has perfect segmentation accuracy and a strong generalization capability, according to extensive testing on the segmentation datasets for the synapse multiple abdominal organs and BraTS2021 datasets. Our contributions, specifically, can be summed up as follows:

We add a local induction bias module to the Swin-Transformer block to assist the RLSwin-Transformer module learn local features and remote dependencies, and replace the MLP module with the Res-MLP module to prevent feature loss during transmission, thus increasing segmentation accuracy.
We design a new double up-sampling module, which includes bilinear up-sampling and expansion convolution with different expansion rates for feature extraction and image restoration.
A dense connection structure is designed at the decoding end, and an attention mechanism is introduced to restore resolution and generate segmentation map.
To significantly increase the segmentation accuracy of small targets, a novel loss function is proposed.

The remainder of the paper is organized as follows: In Section 2, we introduce the overall details of the proposed network structure and design the loss function. Section 3 consists of a description of the datasets used, as well as evaluation scenarios and metrics, and discusses the experiments and results. The paper is finally summarized in Section 4.

2. Method

In this section, we will discuss the details of our proposed network architectures for three models: Residual Local Induction Bias-Unet (RL-Unet), RLSwin-Transformer Module, and Densely Connected Double Up-Sampling Modules (DUSM). Furthermore, we introduce a unique loss function that enhances the performance of these models.

2.1. Integral Structure

The RL-Unet is a U-shaped network that is structured with an encoder, decoder, and skip connections. In the process of residual connection, local inductive bias module is added to build RLSwin-Transformer module. Meanwhile, inspired by TFCNs [31] network, the MLP module is changed into Res-MLP module. It can learn local feature information better and realize cross-group information blending. In the decoder stage, a new densely linked double up-sampling module is proposed, which contains two existing up-sampling methods: one is bilinear up-sampling and the other is introduced by Deeplabv3+ [32] ASPP module. Through the combination of the two, multi-scale information can be effectively learned and model generalization ability can be improved. Figure 1 depicts the RL-Unet’s general structure.

Firstly, in the coding part, the medical image data are separated into non-overlapping image blocks of equal size and input into Patch Partition, and the image blocks are embedded and encoded to complete the feature dimension conversion and other operations. Then, the converted image blocks are input into the RLSwin-Transformer module for feature learning, and the down-sampling function is implemented by Patch Merging [33,34,35]. The feature extraction part is divided into four stages. As the network depth increases, the size of the image blocks halves, and the number of channels increases by twice. The output of the four stages is processed by a standard 1 × 1 convolution to obtain the four hierarchical features of the RLSwin-Transformer (ST1, ST2, ST3, and ST4). In the decoding part, a dense connection double up-sampling module is proposed, which can restore the image’s original resolution and complete the final segmentation prediction through cross-scale connection and attention module.

2.2. RLSwin-Transformer Module

In the process of medical image segmentation, the main role of the RLSwin-Transformer module is to replace the traditional convolutional module for feature learning and information extraction. The RLSwin-Transformer module structure is shown in Figure 2. Features are input into the normalization layer (LN) to ensure the stability of data feature distribution; thenk feature information is learned by the window-based multi-head self-attention (W-MSA) module. Finally, W-MSA output is input into LN and residual multi-layer perceptrons (Res-MLP) successively. At the same time, the local induction bias (LIB) module is introduced in the process of residual connection to obtain the final output. The other layer is a similar process [27,36].

After adding local inductive bias module into RLSwin-Transformer module, on the one hand, rich contextual semantic information can be obtained by stacking convolutional layers with different receptive fields without changing scale. On the other hand, RLSwin-Transformer module can calculate adjacent pixels through the convolution operation to obtain local details of edges and corners. The calculation process of RLSwin-Transformer module is summarized as follows:

\begin{matrix} {\hat{z}}^{l} = W_M S A (L N (z^{l - 1})) + z^{l - 1} \end{matrix}

(1)

\begin{matrix} z^{l} = R e s M L P (L N (z^{l})) + L I B ({\hat{z}}^{l}) \end{matrix}

(2)

\begin{matrix} {\hat{z}}^{l + 1} = S W_M S A (L N (z^{l})) + z^{l} \end{matrix}

(3)

\begin{matrix} z^{l + 1} = R e s M L P (L N (z^{l + 1})) + L I B ({\hat{z}}^{l + 1}) \end{matrix}

(4)

where l represents the number of index layers, ẑ^l represents the output of W-MSA module, ẑ^l+1 indicates the output of SW-MSA module, and z^l and z^l+1 represent the output of the Res-MLP module. Equations (5) and (6) illustrate the specific structure of Res-MLP clearly.

\begin{matrix} z^{″} = G e L U (W^{'} z^{'} + b^{'}) \end{matrix}

(5)

\begin{matrix} R e s M L P (z^{'}) = z^{'} + G e L U (W^{″} z^{″} + b^{″}) \end{matrix}

(6)

where

z^{'}

represents the output of layernorm; W and b represent weight matrix and bias vector, respectively; and GeLU is the GeLU nonlinear activation function.

The local induction bias module is embedded in the residual connection process of the RLSwin-Transformer module. The right half of Figure 2 provides a detailed description of the precise structure. The Seq2Img layer performs a simple reconstruction function, converting the input one-dimensional sequence into a two-dimensional feature

{\hat{x}}^{l} \in R^{C \times H \times W}

.

M_{R e s}, M_{C o v} = C S p {({\hat{x}}_{i, :, :}^{l})}_{(i = C / 2)}

(7)

where

M_{R e s} \in R^{C / 2 \times H \times W}

and

M_{C o v} \in R^{C / 2 \times H \times W}

represent the feature maps of residual connections and feature extraction branches, respectively. Let

C S p (\cdot)

be the channel split module.

{\hat{x}}_{i, :, :}^{l}

indicates that input feature map is truncated at the i-th channel in the channel dimension.

Then channel split module evenly divides the channel number of input features into two parts: one is used for feature extraction and the other is used for residual connection. In the feature extraction part, the feature map is input into three Conv layers with the receptive field of 1 × 1, 3 × 3, 1 × 1 to extract local multi-scale features

M_{L m} \in R^{C / 2 \times H \times W}

.

The 3 × 3 convolution has been deeply optimized on GPUs, offering higher computational density compared to other convolution kernels [37]. Therefore, we opted for the most efficient combination of 1 × 1, 3 × 3, and 1 × 1 convolutions. This combination was chosen to balance between capturing fine details and maintaining computational efficiency.

x^{l} = C S h {({(M_{R e s}, M_{L m})}_{i, :, :})}_{(i = C / 2)}

(8)

where

x^{l} \in R^{C} \times H \times W

represents the output feature map of the channel shuffle module. Let

C S h (\cdot)

be the channel shuffle module.

{(M_{R e s}, M_{L m})}_{i, :, :}

represents that the output feature map of the feature extraction part and the residual connection part are spliced in the channel dimension.

Afterwards, the features are spliced on the channels, and the channels of the stacked feature graphs are reordered through the channel shuffle module to realize the feature fusion among the groups. Finally, the Img2Seq layer converts the two-dimensional images with multi-scale features into one-dimensional sequences and incorporates them into Res-MLP to enhance the RLSwin-Transformer module to learn local features and remote dependencies.

2.3. Densely Connected Double Up-Sampling Modules (DUSM)

To solve the problem that the size or shape of the target organ or lesion varies from patient to patient in medical images, a new densely connected double up-sampling module is proposed in this paper. In particular, based on previous studies on linear attention mechanisms [38], joint spatial attention (JSA) and joint channel attention (JCA) are enhanced to improve the spatial and channel relationships of semantic information. Additionally, down-sampling connection and double up-sampling connection are used to further integrate multi-scale features and improve feature representation capability. As shown in Figure 3. DUSM has four multi-scale features (DM1, DM2, DM3, DM4) through cross-scale connections (i.e., down-sampled connection and up-sampled connection) and attention modules.

Densely connected double up-sampling module (DUSM): Input is divided into two parts. One part is further extracted by expansion convolution with different expansion rates; then, up-sampling is performed by transposed convolution with step size of 2. In the other part, the number of channels is changed by 1 × 1 convolution first; then, bilinear up-sampling is carried out through ReLU activation function. Finally, the output from the two portions is combined on one channel, cutting the total number of channels in half. It can be defined as follows:

\begin{matrix} U_{i}^{j} (X) = U_{i}^{j} {(X)}_{r} + U_{i}^{j} {(X)}_{l} \end{matrix}

(9)

\begin{matrix} U_{i}^{j} {(X)}_{l} = f_{α} (f_{θ} (f_{u}^{1} (X)) + f_{u}^{6} (X) + f_{u}^{12} (X) + f_{u}^{18} (X)) \end{matrix}

(10)

\begin{matrix} U_{i}^{j} {(X)}_{r} = f_{θ} (f_{δ} (f_{σ} (f_{u}^{1} (X)))) \end{matrix}

(11)

where X is the input vector,

f_{u}

is an expansion convolution with different expansion rates,

f_{θ}

represents a 1 × 1 convolution with step 1,

f_{α}

is a transposed convolution with step 2,

f_{σ}

is a ReLU activation function,

f_{δ}

means bilinear up-sampling, and + indicates channel fusion.

Down-sampling connection: As shown in Figure 3b, output obtained at each stage of encoder end is down-sampled by means of convolution so that it can be fused to each corresponding DM module of decoder end through skip connection. It can be defined as follows:

D_{m}^{n} (X) = f_{σ} (f_{a} (X) + f_{a} (f_{b} (X)))

(12)

where

f_{a}

is a 3 × 3 convolution with step 2 and

f_{b}

is a 3 × 3 convolution with step 1.

Joint channel attention (JCA) module: The long-distance relationship between channel dimensions is extracted using an upgraded channel attention module, displayed in Figure 4. This process can be described as Equation (13).

C A (x) = \frac{\sum_{c} M (x) + (M {(x)}_{c, n} {(\frac{M (x)}{∥M (x)∥})}^{T}) \frac{M (x)}{∥M (x)∥}}{N + {(\frac{M (x)}{∥M (x)∥})}^{T} \sum_{c} {(\frac{M (x)}{∥M (x)∥})}^{T}}

(13)

where M(x) represents the reshaping operation that flattens the space size and N is the number of pixels in the input feature map.

Joint spatial attention (JSA) module: As illustrated in Figure 5, based on the linear attention mechanism, spatial attention is used to describe the long-distance dependence connection in the dimension of space, which can be defined as Equation (14).

S A (x) = \frac{\sum_{c} V (x) + (s o f t m a x ((Q {(x)}^{T} K (x)))) V (x)}{N + Q (x) \sum_{c} {(K (x))}^{T}}

(14)

where Q(x), K(x), and V(x) are vector matrices obtained by convolution,

Q \in R^{N \times D}

,

K \in R^{N \times D}

, and

V \in R^{N \times D}

.

Feature polymerization: the four polymerization characteristics (DM1, DM2, DM3, DM4) can be calculated as follows:

\begin{matrix} D M 1 = U_{2}^{1} (D M 2) + S T 1 + U_{2}^{1} (J C A (S T 3)) + U_{3}^{1} (D M 3) \end{matrix}

(15)

\begin{matrix} D M 2 = J C A (S T 2) + U_{4}^{2} (J S A (S T 4)) + U_{4}^{2} (D M 4) \end{matrix}

(16)

\begin{matrix} D M 1 = J S A (S T 3) + D_{2}^{3} (J C A (D_{1}^{2} (S T 1))) \end{matrix}

(17)

\begin{matrix} D M 4 = S T 4 + D_{3}^{4} (J S A (D_{2}^{3} (S T 2))) \end{matrix}

(18)

2.4. New Loss Function

Following are some examples of loss functions that are widely applied to medical image segmentation:

\begin{matrix} L_{F L} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{a = 1}^{A} λ {(1 - y_{p r e d})}^{γ} y_{t r u e} log (y_{p r e d}) \end{matrix}

(19)

\begin{matrix} L_{D L} = \frac{1}{A} \sum_{a = 1}^{A} \frac{1}{N} (1 - \frac{\sum_{i = 1}^{N} y_{t r u e} y_{p r e d}}{N} y_{t r u e}^{2} + \sum_{i = 1}^{N} y_{p r e d}^{2}) \end{matrix}

(20)

\begin{matrix} L_{C E} = - \sum_{i = 1}^{N} y_{t r u e} log y_{p r e d} \end{matrix}

(21)

where y_true represents the Ground Truth, y_pred represents the predicted segmentation result, N is the size of the training batch, and A is the number of categories separated. The FL loss balance factor is represented by

λ

while the FL loss exponential factor is represented by

γ

.

Focal Loss (FL): Increasing the weight of difficult-to-classify samples in the loss function is the same as using the FL [39] loss function. As a result, the loss of the comparatively simple to anticipate sample is reduced while the loss of the comparatively hard to guess sample is bigger. Consequently, the emphasis is put more on challenging-to-classify samples.
Dice Loss (DL): A metric function called Dice Loss [40] is used to assess how comparable two samples are. The value is between 0 and 1, and the higher it is, the more related the two are.
Cross Entropy Loss (CE): CE [41] expresses the distribution relationship between the predicted value and the true value, and the smaller the cross entropy, the more similar the probability distribution between the two.

Several experiments show that using only DL or CE on the improved RL-Unet model is not optimal for the segmentation results of the Synapse abdominal multi-organ segmentation dataset. Additionally, this paper also studies the segmentation of the RL-Unet model on the BraTS2021 dataset. When there is only backdrop and foreground, if several pixels of the little target are incorrect, Dice will vary visibly and have an impact on the training results. As a result, the prediction of the DL loss function for small targets is not reliable. Therefore, a novel loss function is designed in this paper, as shown in Equation (22), and Equation (23) is adjusted on the basis of the original loss function.

\begin{matrix} L_{1} = α L_{C E} + β L_{D L} \end{matrix}

(22)

\begin{matrix} L_{2} = 4 log L_{D L}^{2} + log L_{F L}^{2} \end{matrix}

(23)

where

α

and

β

are hyperparameters that can be fine-tuned and, here, we choose 0.3 and 0.7, respectively.

3. Experiments

3.1. Evaluation Metrics and Datasets

In this paper, the mean Dice-Similarity Coefficient (DSC) and mean Hausdorff Distance (HD) are utilized as evaluation measures. Both of them describe the similarity measures between two sample sets, but DSC focuses more on the segmentation accuracy of the internal filling part while HD is more sensitive to the segmentation boundary. They are defined as follows:

\begin{matrix} D S C = \frac{2 |A \cap B|}{|A| + |B|} \end{matrix}

(24)

\begin{matrix} H D (A, B) = m a x (h (A, B), h (B, A)) \end{matrix}

(25)

\begin{matrix} h (A, B) = \underset{a \in A}{m a x} \{\underset{b \in B}{m i n} ∥a - b∥\} \end{matrix}

(26)

\begin{matrix} h (B, A) = \underset{b \in B}{m a x} \{\underset{a \in A}{m i n} ∥b - a∥\} \end{matrix}

(27)

where A represents the label diagram of the medical image and B represents the prediction graph of algorithm segmentation. Equation (24) is bidirectional HD, and Equations (25) and (26) are unidirectional HD from A to B and from B to A, respectively.

This paper verifies the segmentation effect of RL-Unet on the Synapse abdominal multi-organ segmentation, BraTS2021, ACDC, and BUSI datasets.

Synapse dataset: The abdominal CT scan dataset consists of 30 cases, focusing on the segmentation of various abdominal organs. Every CT scan consists of 85–198 slices of 512 × 512 pixels. We randomly partition the dataset into 18 scans for training (2212). Validation scans were 12 axial slices and 12 scans. We only segmented eight abdominal organs, including the aorta, gallbladder (GB), left kidney (KL), right kidney (KR), liver, pancreas (PC), spleen (SP), and stomach (SM).

BraTS2021 dataset: BraTS2021 includes multi-parameter MRI scans from 2000 patients. MRI scans have four modal data sizes: Flair, t1ce, t1, and t2, each of which is 240 × 240 × 155 shared segmentation labels. The dataset was arbitrarily split into a training set, verification set and test set at a ratio of 8:1:1. Methods of data enhancement were pruning, rotation, flipping, Gaussian noise, contrast transformation, and brightness enhancement.

ACDC dataset: The ACDC dataset comprises 100 patients focusing on the segmentation of the right ventricle cavity, left ventricle myocardium, and left ventricle cavity. The segmentation labels for each case include left ventricle (LV), right ventricle (RV), and myocardium (MYO). We use 70 cases (1930 axial slices) for training: 10 for validation and 20 for testing.

BUSI dataset: The BUSI (Breast Ultrasound Image) dataset, categorized into normal, benign, and malignant classes, serves as a comprehensive resource for both classification and segmentation tasks in the analysis of breast ultrasound images. For training purposes, only the normal and malignant categories from the BUSI dataset are selected.

3.2. Experiment Settings

Synapse dataset and BraTS2021 Dataset. The pre-training model is initialized on the ImageNet dataset with an initial learning rate of 0.05. A clustering learning rate strategy is employed. The maximum training epoch is set to 150, batch size is 32, and the SGD optimizer with a momentum of 0.9 and weight decay of 1 ×

10^{- 4}

is used for RL-Unet. Many different data enhancement methods are used to expand the data diversity, including rotating the image 30 degrees, 45 degrees, 90 degrees, 120 degrees, and 150 degrees; moving the image (15,15), (−15, −15), (−15, 15), and (15, −15); and using enhancement factors 0.9 and 1.3 to enhance image tone.

ACDC dataset and BUSI dataset. We train each model for a maximum of 150 epochs, with a batch size of 32. For RL-Unet, we set the input resolution to 224 × 224. For data augmentation, we use random flipping and rotation. The combined weighted Cross-entropy (0.3) and DICE (0.7) loss functions are optimized.

3.3. Experiment Results

The parameters of DSC and HD obtained by the segmentation of eight abdominal organs by RL-Unet in the Synapse dataset are compared with the classical segmentation networks ViT, U-Net, Swin-Unet, Attention Unet (AttU-Net), GCASCADE, nnFormer, MISSFormer, and Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net). The experimental results are shown in Table 1.

Table 1. Segmentation results of different methods on the SYNAPSE dataset.

Methods	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach	DSC	HD
ViT [18]	44.38	39.59	67.46	62.94	89.21	43.14	75.45	69.87	61.50	39.61
U-Net [7]	89.18	65.42	76.86	70.64	93.35	55.37	89.80	76.01	77.03	39.70
Swin-Unet [20]	86.34	63.45	81.13	75.69	93.63	56.83	87.94	73.22	77.28	26.93
AttU-Net [10]	88.59	64.42	81.73	76.77	93.99	63.68	89.56	71.18	78.74	29.54
R2U-Net [42]	88.26	68.97	76.94	71.36	91.86	57.36	87.36	74.88	77.12	32.12
GCASCADE [28]	81.55	79.75	83.06	64.46	62.89	64.43	82.87	89.39	76.05	12.34
nnFormer [29]	92.04	70.17	86.57	86.25	96.84	83.35	90.51	86.83	86.57	10.63
MISSFormer [30]	86.99	68.65	85.21	82.00	94.41	65.67	91.92	80.81	81.96	18.20
RL-Unet (ours)	86.95	67.48	82.68	79.47	93.84	55.97	90.21	77.41	80.13	22.07

Compared with basic ViT and U-Net, DSC and HD parameters of RL-Unet have been greatly improved, which demonstrates the efficiency of the combination of CNN and Transformer. Secondly, compared with the classical medical segmentation algorithm U-Net, the parameters of DSC and HD of RL-Unet increased by 3.1% and 17.63 mm, respectively, indicating the feasibility of the segmentation effect of RL-Unet and its sensitivity to small organs. Compared with Swin-Unet, AttU-Net, and R2U-Net, the parameters of DSC and HD of RL-Unet are still greatly improved, which indicates that RL-Unet pays attention to local information and extracts and integrates multi-scale information in context, effectively improving the segmentation accuracy. Although our proposed algorithm exhibits a slightly lower DSC compared to the latest MISSFormer and nnFormer algorithms, we are unable to precisely reproduce the results of the original authors due to machine limitations and variations in data processing methods. To validate the effectiveness of our proposed algorithm, we conducted experiments on the BraTS2021, ACDC, and BUSI datasets.

As shown in Figure 6, we can see that the boundary of RL-Unet segmentation is clearer and smoother than that of other algorithms. It can be seen from the second line that several algorithms are missing the segmentation of pancreas, and the segmentation boundary of stomach in this dimension is poor, which overlaps with the prediction of pancreas. It can be seen from the three lines that several algorithms have a good segmentation effect and high segmentation accuracy for large targets such as the liver and stomach. Other algorithms resulted missing segmentation and over-segmentation on the pancreas. However, our proposed algorithm RL-Unet has a significantly better segmentation effect on the pancreas and small target organs than other algorithms, basically completely segmenting the pancreas and stomach. The fourth line reveals that the Attention Unet and Swin-Unet algorithms have a poor segmentation effect on the stomach, with a missing segmentation phenomenon, while RL-Unet and Attention Unet have a better segmentation effect on the stomach.

Despite the RL-Unet architecture’s capability to fuse multi-level semantic features, thereby preserving crucial edge information, its performance in pancreas segmentation remains suboptimal on certain images. This is primarily due to the small proportion of pancreas representation in the dataset, which leads to interference from surrounding organs and tissues during feature extraction. Consequently, a considerable amount of edge information is lost, compromising the model’s overall segmentation accuracy for the pancreas.

Figure 7a describes the average DSC scores obtained by different models with different training iterations on the Synapse dataset. As the number of training iterations increases, the average DSC score of our model is higher than that of other models. The DSC index, which calculates the similarity between the two samples, shows that our training results and labels are more similar, verifying the validity of our proposed model.

Figure 7b depicts the average HD resulting from different iterations of training using different models on the Synapse dataset. As can be seen from Figure 7b, the HD distance of our proposed model is superior to other models, significantly smaller than other models, and higher than the Att-Unet model only at 20 K, which may cause anomalies caused by isolated points, demonstrating that the result of dataset segmentation is good and close to the marked result.

As can be seen from Table 2, the RL-Unet algorithm proposed in this paper can still achieve a good segmentation effect on the BraTS2021 dataset, with DSC indexes reaching more than 80%, achieving similar or even surpassing results with U-Net, Attention Unet, Swin-Unet, and other classical medical image segmentation algorithms. It shows that the algorithm has good generalization ability and robustness.

To verify the generalization of the model, we performed another set of experiments on MRI images of the ACDC dataset. Table 3 shows the results of average DSC comparison between our proposed method and previous advanced methods. It can be clearly seen that compared with previous models such as AttnUnet, SwinUnet, TransUnet, and MT-Unet, the proposed method has a higher average DSC score; compared with PVTCASCADE model, the average DSC score is slightly lower, but it is still quite capable.

As shown in Table 4, RL-Unet outperforms most SORT architectures on the BUSI dataset, achieving a higher DSC score. Notably, the performance of RL-Unet is slightly below that of U-Net3+. We attribute U-Net3+’s superior performance to its full-scale skip connections, which are particularly effective for accurately segmenting complex structures in medical images. To address the limitations of full-scale skip connections, we have introduced JCA and JSA into our architecture, enhancing its performance to a comparable level.

As illustrated in Figure 8, the detailed annotations and consistent segmentation shapes in the ACDC dataset enable most architectures, including RL-Unet, to achieve high segmentation accuracy and visually appealing results. In certain cases, U-net exhibits partial segmentation omissions. As observed in the third line, the segmentation results produced by RL-Unet and Attention Unet are more complete and closely aligned with the Ground Truth. In contrast, the segmentation contours of Swin-Unet and U-Net show partial omissions. In the classification of benign cases, most architectures achieve satisfactory segmentation, with smooth transitions along the segmentation boundaries. However, in the classification of malignant cases, most architectures are only able to delineate the approximate contours and often lack the ability to capture detailed features accurately. RL-Unet more effectively restores details, mitigating issues of over-segmentation and under-segmentation.

3.4. Ablation Experiment

The four parts of the proposed model are validated by ablation experiments. (1) The MLP in the Transformer module is replaced with Res-MLP and a local inductive bias module is introduced. (2) A new feature aggregation module DUSM is proposed. (3) A new attention module is introduced. (4) A new loss function is designed.

First, replace the MLP with Res-MLP in the Transformer module and add the local inductive bias module. The baseline model is Swin-Unet. As can be seen from Table 5, the DSC index increased by 1.08%. Experiments show that the Res-MLP module and local inductive bias module make the semantic feature propagation of Transformer more comprehensive and increase the encoder’s understanding of global context information, thus improving the performance of the whole structure. Next, a new feature aggregation module was designed at the decoding end, and the index of DSC improved by 1.31%. Meanwhile, based on previous work on linear attention mechanisms, a new attention mechanism was introduced to enhance the channel and spatial relationship between semantic features. Experiments show that our proposed attention mechanism is effective. Finally, a novel loss function was proposed. When the loss function is applied to our proposed method, compared with the Swin-Unet and other algorithms, as shown in Figure 7c, it can be seen that the loss obtained by our proposed method is smoother and has stronger convergence, which verifies the effectiveness of our proposed algorithm. According to multiple experiments, the DSC score increased by 2.85% and the HD index also significantly improved, which proved the effectiveness of our proposed loss function.

3.5. Hyperparametric Learning Rate $η$

The experiment shows that with the change in learning rate, the experimental results will also change. In order to obtain the best learning rate, the parameter was selected as 0.01, 0.05, and 0.1 for experimental comparison, as shown in Table 6. Under other conditions being the same, it is found that the best segmentation results can be obtained when

η

is 0.05. Consequently,

η

= 0.05 was used as the starting point for every experiment in this research.

4. Conclusions

In this paper, we propose an RL-Unet algorithm based on the U-Net and Swin-Transformer models. It integrates multi-scale feature information and inductive bias module for local feature processing, and further improves prediction accuracy through multi-scale information fusion selection. We evaluate the segmentation results of our approach on four datasets, namely, the synapse multi-organ dataset and the BraTS2021 dataset. Additionally, we conduct ablation studies on each component of our design to analyze its individual contributions. The experimental results demonstrate that our algorithm outperforms traditional medical image segmentation algorithms in terms of segmentation results and indices. RL-Unet exhibits suboptimal performance on small sample and fine-grained segmentation tasks. Future work will investigate lightweight modules, which aim to mitigate computational burden while preserving the model’s expressive capability, to enhance the focus on challenging regions and pixels.

Author Contributions

Methodology, Q.Z.; Software, Q.Z.; Validation, Q.Z.; Formal analysis, Q.Z.; Writing—original draft, Q.Z.; Writing—review & editing, Q.Z.; Visualization, Q.Z.; Supervision, B.X., H.L., Y.Z. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article material, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical image segmentation review: The success of u-net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 1–20. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Li, W.; Zuluaga, M.A.; Pratt, R.; Patel, P.A.; Aertsen, M.; Doel, T.; David, A.L.; Deprest, J.; Ourselin, S.; et al. Interactive medical image segmentation using deep learning with image-specific fine tuning. IEEE Trans. Med. Imaging 2018, 37, 1562–1573. [Google Scholar] [CrossRef] [PubMed]
Yuan, F.; Zhang, Z.; Fang, Z. An effective cnn and transformer complementary network for medical image segmentation. Pattern Recognit. 2022, 136, 109228. [Google Scholar] [CrossRef]
Zhou, T.; Li, L.; Bredell, G.; Li, J.; Unkelbach, J.; Konukoglu, E. Volumetric memory network for interactive medical image segmentation. Med. Image Anal. 2023, 83, 102599. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—Proceedings of the MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual Conference, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 1055–1059. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: New York, NY, USA, 2016; pp. 565–571. [Google Scholar]
Fan, J.; Hua, Q.; Li, X.; Wen, Z.; Yang, M. Biomedical sensor image segmentation algorithm based on improved fully convolutional network. Measurement 2022, 197, 111307. [Google Scholar]
Bose, S.; Chowdhury, R.S.; Das, R.; Maulik, U. Dense dilated deep multiscale supervised u-network for biomedical image segmentation. Comput. Biol. Med. 2022, 143, 105274. [Google Scholar] [CrossRef]
Schlemper, J.; Oktay, O.; Schaap, M.; Heinrich, M.; Kainz, B.; Glocker, B.; Rueckert, D. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 2019, 53, 197–207. [Google Scholar] [CrossRef]
Oza, P.; Sharma, P.; Patel, S.; Kumar, P. Deep convolutional neural networks for computer-aided breast cancer diagnostic: A survey. Neural Comput. Appl. 2022, 34, 1815–1836. [Google Scholar] [CrossRef]
Yang, H.; Yang, D. Cswin-pnet: A cnn-swin transformer combined pyramid network for breast lesion segmentation in ultrasound images. Expert Syst. Appl. 2023, 213, 119024. [Google Scholar] [CrossRef]
Bozorgpour, A.; Azad, R.; Showkatian, E.; Sulaiman, A. Multi-scale regional attention deeplab3+: Multiple myeloma plasma cells segmentation in microscopic images. arXiv 2021, arXiv:2105.06238. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
Li, B.; Liu, S.; Wu, F.; Li, G.; Zhong, M.; Guan, X. Rt-unet: An advanced network based on residual network and transformer for medical image segmentation. Int. J. Intell. Syst. 2022, 37, 8565–8582. [Google Scholar] [CrossRef]
Zhou, H.; Guo, J.; Zhang, Y.; Yu, L.; Wang, L.; Yu, Y. nnformer: Interleaved transformer for volumetric segmentation. arXiv 2022, arXiv:2109.03201. [Google Scholar]
Huang, X.; Deng, Z.; Li, D.; Yuan, X.; Fu, Y. Missformer: An effective transformer for 2d medical image segmentation. IEEE Trans. Med. Imaging 2022, 42, 1484–1494. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Azad, R.; Heidari, M.; Shariatnia, M.; Aghdam, E.K.; Karimijafarbigloo, S.; Adeli, E.; Merhof, D. Transdeeplab: Convolution-free transformer-based deeplab v3+ for medical image segmentation. In International Workshop on PRedictive Intelligence In MEdicine; Springer: Berlin/Heidelberg, Germany, 2022; pp. 91–102. [Google Scholar]
Azad, R.; Arimond, R.; Aghdam, E.K.; Kazerouni, A.; Merhof, D. Dae-former: Dual attention-guided efficient transformer for medical image segmentation. In International Workshop on PRedictive Intelligence in MEdicine; Springer: Berlin/Heidelberg, Germany, 2023; pp. 83–95. [Google Scholar]
Zhao, Y.; Li, J.; Hua, Z. Mpsht: Multiple progressive sampling hybrid model multi-organ segmentation. IEEE J. Transl. Eng. Health Med. 2022, 10, 1–9. [Google Scholar] [CrossRef]
Rahman, M.M.; Marculescu, R. G-cascade: Efficient cascaded graph convolutional decoding for 2d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 7728–7737. [Google Scholar]
Zhou, H.-Y.; Guo, J.; Zhang, Y.; Han, X.; Yu, L.; Wang, L.; Yu, Y. nnformer: Volumetric medical image segmentation via a 3d transformer. IEEE Trans. Image Process. 2023, 32, 4036–4045. [Google Scholar] [CrossRef]
Huang, X.; Deng, Z.; Li, D.; Yuan, X. Missformer: An effective medical image segmentation transformer. arXiv 2021, arXiv:2109.07162. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Li, D.; Xu, C.; Wang, W.; Hong, Q.; Li, Q.; Tian, J. Tfcns: A cnn-transformer hybrid network for medical image segmentation. In Proceedings of the International Conference on Artificial Neural Networks, Bristol, UK, 6–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 781–792. [Google Scholar]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Huang, J.; Fang, Y.; Wu, Y.; Wu, H.; Gao, Z.; Li, Y.; Ser, J.D.; Xia, J.; Yang, G. Swin transformer for fast mri. Neurocomputing 2022, 493, 281–304. [Google Scholar] [CrossRef]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Fekri-Ershad, S.; Alsaffar, M.F. Developing a tuned three-layer perceptron fed with trained deep convolutional neural networks for cervical cancer diagnosis. Diagnostics 2023, 13, 686. [Google Scholar] [CrossRef] [PubMed]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 13733–13742. [Google Scholar]
Guo, M.H.; Liu, Z.N.; Mu, T.J.; Hu, S.M. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5436–5447. [Google Scholar] [CrossRef] [PubMed]
Tran, G.S.; Nghiem, T.P.; Nguyen, V.T.; Luong, C.M.; Burie, J. Improving accuracy of lung nodule classification using deep learning with focal loss. J. Healthc. Eng. 2019, 2019, 156416. [Google Scholar] [CrossRef]
Huang, Q.; Sun, J.; Ding, H.; Wang, X.; Wang, G. Robust liver vessel extraction using 3D u-net with variant dice loss function. Comput. Biol. Med. 2018, 101, 153–162. [Google Scholar] [CrossRef]
Hu, K.; Zhang, Z.; Niu, X.; Zhang, Y.; Cao, C.; Xiao, F.; Gao, X. Retinal vessel segmentation of color fundus images using multiscale convolutional neural network with an improved cross-entropy loss function. Neurocomputing 2018, 309, 179–191. [Google Scholar] [CrossRef]
Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv 2018, arXiv:1802.06955. [Google Scholar]
Wang, H.; Xie, S.; Lin, L.; Iwamoto, Y.; Han, X.-H.; Chen, Y.-W.; Tong, R. Mixed transformer u-net for medical image segmentation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 2390–2394. [Google Scholar]

Figure 1. The general structure of Residual Local induction bias-Unet (RL-Unet).

Figure 2. Residual Local induction bias-Unet (RL-Unet) module structure.

Figure 3. Spatial and channel interaction components of semantic information between different stages. (a) The structure of double up-sampling connection. (b) The structure of the downsampling connection.

Figure 4. Joint channel attention.

Figure 5. Joint spatial attention.

Figure 6. Visualization of segmentation results on SYNAPSE dataset. (a) Ground Truth; (b) RL-Unet; (c) Attention Unet; (d) U-Net; (e) Swin-Unet.

Figure 7. The segmentation results of U-Net, Swin-Unet, R2U-Net, Attention Unet, and RL-Unet algorithms visually. (a) Average DSC scores obtained by training different models. (b) Average HD obtained by training different models. (c) Comparison of loss curves of different algorithms.

Figure 8. Visualization of segmentation results on ACDC and BUSI datasets. (a) Image; (b) Ground Truth; (c) RL-Unet; (d) Attention Unet; (e) Swin-Unet; (f) U-Net.

Table 2. Segmentation results of different methods on BRATS2021 dataset.

Methods	ET	TC	WT	DSC
U-Net [7]	83.92	87.71	90.70	87.44
U-Net++ [24]	84.21	87.76	91.30	87.76
U-Net3+ [9]	82.21	86.52	91.17	86.63
nnFormer [29]	83.93	87.82	91.52	87.74
AttU-Net [10]	85.08	87.76	91.53	88.12
Swin-Unet [20]	84.16	86.77	91.01	87.31
RL-Unet (ours)	85.27	88.12	91.52	88.36

Table 3. Segmentation results of different methods on ACDC dataset.

Methods	RV	Myo	LV	DSC
R50+AttnUnet [19]	87.58	79.20	93.47	86.75
R50+Unet [19]	87.10	80.63	94.92	87.55
Trans-Unet [19]	86.67	87.27	95.18	89.71
Swin-Unet [20]	85.77	84.42	94.03	88.07
MT-Unet [43]	86.64	89.04	95.62	90.43
PVTCASCADE [28]	89.97	88.90	95.50	91.46
RL-Unet (ours)	88.54	87.65	95.46	91.18

Table 4. Segmentation results of different methods on BUSI dataset.

Methods	Benign	Malignant	DSC
U-Net [7]	70.49	63.47	66.98
AttU-Net [10]	73.30	62.95	68.13
Swin-Unet [20]	73.31	62.13	67.72
U-Net++ [24]	75.56	65.52	70.54
U-Net3+ [9]	75.07	66.19	70.67
RL-Unet (ours)	74.92	66.09	70.51

Table 5. An ablation experiment of each module.

Methods	DSC	HD
Baseline	77.28	26.93
Baseline+ResMLP	78.36	28.12
Baseline+ResMLP+DUSM	78.59	26.45
Baseline+ResMLP+DUSM+Attention	79.69	26.78
Baseline+ResMLP+DUSM+Attention+L2	80.13	22.07
Baseline+ResMLP+DUSM+Attention+L1	79.95	24.21

Table 6. Comparison of segmentation results at different

η

.

Table 6. Comparison of segmentation results at different

η

.

$η$	DSC	HD
0.01	78.51	28.67
0.05	80.13	22.07
0.1	78.87	27.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Q.; Xu, B.; Liu, H.; Zhang, Y.; Yu, Z. A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet. Appl. Sci. 2024, 14, 7953. https://doi.org/10.3390/app14177953

AMA Style

Zhang Q, Xu B, Liu H, Zhang Y, Yu Z. A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet. Applied Sciences. 2024; 14(17):7953. https://doi.org/10.3390/app14177953

Chicago/Turabian Style

Zhang, Qirui, Bing Xu, Hu Liu, Yu Zhang, and Zhiqiang Yu. 2024. "A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet" Applied Sciences 14, no. 17: 7953. https://doi.org/10.3390/app14177953

APA Style

Zhang, Q., Xu, B., Liu, H., Zhang, Y., & Yu, Z. (2024). A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet. Applied Sciences, 14(17), 7953. https://doi.org/10.3390/app14177953

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet

Abstract

1. Introduction

2. Method

2.1. Integral Structure

2.2. RLSwin-Transformer Module

2.3. Densely Connected Double Up-Sampling Modules (DUSM)

2.4. New Loss Function

3. Experiments

3.1. Evaluation Metrics and Datasets

3.2. Experiment Settings

3.3. Experiment Results

3.4. Ablation Experiment

3.5. Hyperparametric Learning Rate $η$

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet

Abstract

1. Introduction

2. Method

2.1. Integral Structure

2.2. RLSwin-Transformer Module

2.3. Densely Connected Double Up-Sampling Modules (DUSM)

2.4. New Loss Function

3. Experiments

3.1. Evaluation Metrics and Datasets

3.2. Experiment Settings

3.3. Experiment Results

3.4. Ablation Experiment

3.5. Hyperparametric Learning Rate η

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.5. Hyperparametric Learning Rate $η$