A Multi-Scale Cross-Fusion Medical Image Segmentation Network Based on Dual-Attention Mechanism Transformer

Cui, Jianguo; Wang, Liejun; Jiang, Shaochen

doi:10.3390/app131910881

Open AccessArticle

A Multi-Scale Cross-Fusion Medical Image Segmentation Network Based on Dual-Attention Mechanism Transformer

by

Jianguo Cui

,

Liejun Wang

^*

and

Shaochen Jiang

College of Information Science and Engineering, Xinjiang University, Urumqi 830049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10881; https://doi.org/10.3390/app131910881

Submission received: 28 August 2023 / Revised: 26 September 2023 / Accepted: 27 September 2023 / Published: 30 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

The U-net network, with its simple and powerful encoder–decoder structure, dominates the field of medical image segmentation. However, convolution operations are limited by receptive fields. They do not have the ability to model long-range dependencies, but Transformer has the capability of long-term modeling thanks to its core self-attention mechanism, which has been widely applied in the field of medical image segmentation. However, both CNNs and Transformer can only perform correlation calculations for a single sample, ignoring the correlation between different samples. To address these problems, we propose a new Transformer, which we call the Dual-Attention Transformer (DAT). This module captures correlations within a single sample while also learning correlations between different samples. The current U-net and some of its variant models have the problem of inadequate feature fusion, so we also improve the skip connection to strengthen the association between feature maps at different scales, reduce the semantic gap between the encoder and decoder, and further improve the segmentation performance. We refer to this structure as DATUnet. We conducted extensive experiments on the Synapse and ACDC datasets to validate the superior performance of our network, and we achieved an average DSC (%) of 83.6 and 90.9 and an average HD95 of 13.99 and 1.466 for the Synapse and ACDC datasets, respectively.

Keywords:

medical image segmentation; Transformer; convolutional neural network

1. Introduction

Medical image segmentation refers to segmenting certain parts of medical images with special meanings and extracting relevant features to assist doctors in obtaining necessary pathological information from complicated medical image data, thereby facilitating doctors’ further diagnosis and treatment of patients and, thus, reducing the workload of doctors’ clinical diagnosis. Current research in medical image segmentation technology is directed in a wide range of directions, such as segmentation tasks for different medical image data: MRI images of the brain [1,2]; dermatographic images [3] and ultrasound images of the heart [4,5,6,7]; and medical image segmentation studies for specific diseases or tasks: tumor segmentation [8], ocular vascular segmentation [9,10], organ segmentation [11,12], etc. Traditional medical image segmentation methods usually require the manual design of features and rules, but such methods are limited by manual experience and time cost, which have difficulty coping with large-scale and complex medical images.

In recent years, deep learning has become a mainstream method in the field of medical imaging. This method utilizes the powerful expressive power of neural networks to learn the features of medical images and then achieves automatic segmentation of different tissue structures in the images; this approach avoids the limitations of traditional methods. The most representative one is U-net [13], which is based on convolutional neural networks [14]. U-Net employs a symmetric encoder–decoder structure for image processing and enhances detail retention by skipping connections in order to reduce the loss of spatial information caused by the downsampling process. By borrowing the excellent structure of U-net [13], researchers have further developed many similar variant models, such as TransUnet [15], TransUnet++ [16], Swin-Unet [17], 3DUnet [18], Unetr [19], etc. Each of these models has its own characteristics and shows good results on different datasets. However, a convolutional neural network (CNN) [14] may have some limitations when it comes to processing medical images. Because CNNs are based on local perception and feature extraction, they may be limited when processing medical images with a large range of contextual information. This can lead to inaccuracies or missing details in the segmentation results. A CNN’s limited capacity to compute correlations for a single sample and its inability to capture long-range relationships prevent future model performance improvement. Additionally, because the features in the encoder and decoder stages are incompatible, it is important to investigate an efficient feature fusion technique in medical image segmentation networks. When feature maps of different scales are simply fused, the segmentation’s accuracy may suffer.

Recently, the revolutionary enhancements brought by Vision Transformer [20] to the field of computer vision have attracted a lot of attention from researchers, and an increasing number of researchers have devoted themselves to it [21,22,23,24]. Transformer, because of its core self-attention mechanism, is able to model potential correlations between global elements and possesses good context modeling capabilities, which precisely compensates for the disadvantage of a CNN’s lack of dependence on capturing long distances. Some efforts have attempted to apply Transformer to medical image segmentation [24,25,26], such as TransUnet [15], proposed by Chen et al, which combines a CNN and Transformer and improves the disadvantage of CNN-based networks with poor global modeling capabilities through Transformer. There are also pure Transformer-based structures, such as Swin-Unet [17], which take the form of a sliding window computation that restricts attention to a single window, effectively reducing the computational effort. These networks were proposed to some extent to solve the problem of the poor global modeling capability of CNNs, and some good results were achieved. The Transformer model can be limited in handling long-distance dependencies. For some image segmentation tasks with a wide range of contextual information, Transformer may not be able to effectively capture correlations at a distance, resulting in inaccuracies or missing details in the segmentation results, and the secondary complexity of the self-attention mechanism can lead to the inability to handle large resolution images [27]. In addition, the self-attention mechanism in Transformer generally only considers the correlation within a single sample. However, the structure of medical images is more fixed, and the semantic information is simpler, so the correlation between different samples is extremely important. All of these issues provide more possibilities for subsequent optimization.

To solve the above problem, we redesigned the network structure, and this network model uses our proposed DAT for coding and decoding images. Our DAT module takes the form of a two-layer attention, which can focus on both the similarity between a single sample and different samples. In this module, we use an effective attention mechanism [28] to compute the correlation within a single sample, which is equivalent to the performance of dot-product attention but with a significant reduction in memory computation. To calculate the correlation between samples, we used an external attention mechanism [29], which learns the potential relationships across the dataset by introducing two learnable shared memory units. Meanwhile, to solve the problem of inadequate feature fusion mentioned above, we introduced the CTrans (Channel Transformer) module [30] to enhance the fusion of feature maps of different scales at the codec side to reduce the semantic divide and further enhance the segmentation performance.

Our work can be summarized in two points:

We propose the DAT module, which can focus on both internal attention and external attention and helps to enhance the correlation between different samples.
Through a multi-scale feature fusion method, the feature information of different scales is effectively combined, and the effective information is fully preserved.
A pure Transformer structure, DAT-Unet, for medical image segmentation is designed, and the effectiveness of our method is proved on two different public datasets. Compared to other methods, our method has many advantages.

2. Relate Work

2.1. CNN-Based Methods

The research on medical image segmentation has important practical value. It can help doctors quickly and accurately identify and locate different tissue structures inside a patient’s body. CNNs, as a powerful image processing tool, have been widely used in medical image segmentation. CNN-based medical image segmentation algorithms mainly consist of two stages: feature extraction and segmentation prediction. In the feature extraction stage, high-level semantic features are extracted from the input image through multiple convolutional operations, which can effectively capture the correlations between different tissue structures. In the segmentation prediction stage, the encoded features are passed into the decoder, and pixel-level segmentation results are generated by deconvolution operations. In recent years, various CNN models for medical image segmentation have emerged, such as 3DUNet [18], which replaces all 2D operations with 3D operations to achieve 3D medical image segmentation; MultiResUnet [31], which improves the segmentation of retinal blood vessels by combining ResNet [32] with an attention mechanism; and U-Net3+, which proposes a full-scale skip connection approach. Each of these models has its own characteristics, but they all have the disadvantages of insufficient global modeling ability and the inability to learn correlations between different samples.

2.2. Transformer-Based Methods

Transformer was initially designed for natural language processing (NLP) tasks. Because of its powerful global modeling capabilities, some researchers have tried to apply it to computer vision. For example, Transunet [15], combines Transformer with a CNN in a hybrid coding form, providing a powerful segmentation model. Swin-Unet [17] is the first pure Transformer architecture for medical image segmentation, which greatly reduces the computational amount of the self-attention mechanism by limiting it to a single window in the form of a sliding window. However, Transformer-based networks generally have the drawback of secondary computational complexity, which is used in medical image segmentation. Additionally, Transformer can only focus on the correlation within the sample due to the limitations of its internal structure, which also limits the segmentation ability of the model.

2.3. Skip Connection

Fully convolutional networks (FCNs) [14] achieve end-to-end pixel-level prediction via convolutional layers instead of fully connected layers, but, since images suffer from semantic gaps during this end-to-end structure transfer, U-net uses a skip connection to improve this problem. Skip connection has been shown to help recover fine-grained detail information of images, and, in response to this finding, researchers have proposed many novel models, such as Unet++ [33], which adopts a multi-level feature fusion mechanism to effectively reduce the reliance on single-channel features; Attention U-net [34], which introduces attention gates in skip connection to enhance the model’s focus on useful features; and Dae-Former [27], which uses a cross-attention mechanism to connect features of the encoding side and the decoding side to more effectively retain the underlying information. However, simply connecting the output features of the encoder with the input features of the decoder is not sufficient to fully fuse the semantic information at different scales and achieve the global modeling ability. Additionally, the shallow features are distant from the deeper features, and the semantic gap between them is large. Relying solely on the merger of simple channels may even harm the performance of the final model, as has been demonstrated in a previous work [30]. It is very necessary to explore an effective method for skip connection.

3. Methods

3.1. Overall Architecture

Figure 1 shows the overall architecture of DATUnet. DATUnet takes the same end-to-end U-shaped structure as U-net, and the whole network architecture consists of three parts: an encoder, decoder, and skip connection. On the encoder side, we first perform coarse extraction of the feature map using a CNN and then input the extracted feature map to the DAT module for global correlation and external correlation calculation of the feature map. We use a total of three encoder blocks on the encoder side, each of which is composed of two consecutive DAT blocks. We use the DAT blocks for the smaller-size downsampling blocks and still use convolutional operations to extract features in the upper sampling part, which can effectively reduce the memory of the model and lower costs. Moreover, in order to ensure the coherence of the spatial information, we use the method of partially overlapping patch merging before each encoder block, so that the different patches can be linked to each other.

In the skip connection, we introduce the CTrans module for feature fusion, which consists of two parts, the Channel-wise Cross-Fusion Transformer (CCT) module that fuses the encoder output features and the Channel-wise Cross-Attention module (CCA) module that enhances feature fusion at the decoder side. The CCT module effectively combines feature information at different scales by cross-fusing features at the encoder and passes the feature information generated after fusion to the decoder. On the decoder side, we also use the DAT module to decode the feature maps and use the CCA module to guide the fusion of the features output from CCT and the features from the decoder. Finally, the output split map is generated by a linear projection layer. Next, we will present the specific details of each of our modules separately.

3.2. Dual-Attention Transformer (DAT)

The traditional Vision Transformer can only focus on the correlation of a single sample, but we believe that the correlation between different samples also helps to improve the performance of the model, so we propose the DAT module. In Figure 2, our DAT consists of an effective attention block and an external attention cascade, where the effective attention mechanism is used to compute intra-sample correlations, and the external attention mechanism is used to compute correlations between different samples. We use the module for both the encoding part and the decoding part of the model. However, considering the complexity of Transformer computation, we only use the double-layer Transformer block for encoding in the feature map part where the resolution is small.

E_{b l o c k 1} (X, Q_{1}, K_{1}, V_{1}) = E_{1} (Q_{1}, K_{1}, V_{1}) + X

(1)

M L P_{1} (E_{b l o c k 1}) = M L P (L N (E_{b l o c k 1}))

(2)

E_{b l o c k 2} (E_{b l o c k 1}, Q_{2}, M_{k}, M_{v}) = E_{2} (M L P_{1} (E_{b l o c k 1}) + E_{b l o c k 1}) + M L P_{1} (E_{b l o c k 1})

(3)

M L P_{2} (E_{b l o c k 2}) = M L P (L N (E_{b l o c k 2}))

(4)

D u a l A t t e n t i o n (E_{b l o c k 2}) = M L P_{2} (E_{b l o c k 2}) + E_{b l o c k 2}

(5)

In Equations (1)–(5) above, we present the steps for the calculation of DAT.

E_{b l o c k 1} ()

and

E_{b l o c k 2} ()

represent the effective attention block and external attention block, respectively.

E_{1}

and

E_{2}

represent the effective attention and external attention, respectively.

M L P

, as in Equation (6), represents the Mix-FFN [35] feedforward neural network, which consists of depth-wise convolutions and fully connected layers. In ViT, position encoding is typically added to determine the position of each patch. However, this fixed position encoding method leads to a decrease in accuracy during testing due to interpolation processing. Therefore, an

M L P

with a

3 \times 3

convolutional layer is used to dynamically represent the positional relationships between different patches.

M L P (X) = F C (G L U E (D W - C o n v_{3 \times 3} (F C (X))))

(6)

3.3. Efficient Attention

In the self-attention mechanism (7), the feature map is typically subject to two dot-product operations. First, calculate the similarity between pixels in the image through

S = Q K^{T}

; then, perform a weighted sum through

D = S V

. The calculation complexity is

O (n^{2})

. The high computational complexity limits the use of attention mechanisms in image processing.

S (Q, K, V) = S o f t max (Q K^{T}) V

(7)

Therefore, Shen et al. proposed the efficient attention mechanism (8) to address the issue of high computational complexity in self-attention mechanisms. This is an attention mechanism that performs similarly to the standard attention mechanism but with lower computational complexity.

ρ ()

represents the Softmax normalization function.

E (Q, K, V) = ρ (Q) (ρ {(K)}^{T} V)

(8)

In contrast to the self-attention mechanism, the efficient attention mechanism notices that, in the normalization process, if the channel dimension of Q and the spatial dimension of K are normalized separately, this is equivalent to normalizing

Q K^{T}

as a whole. Therefore, the efficient attention mechanism distributes the normalization of

Q K^{T}

to the query and key individually and utilizes the associativity of matrix multiplication

(Q K^{T}) V \to Q (K^{T} V)

to reduce the calculation complexity from

O (n^{2})

to

O (d_{k} d_{v})

, where n represents the input size and

d_{k}, d_{v}

represents the channel dimensions of the key and value. In practical applications,

d_{k}

and

d_{v}

are often significantly smaller than n.

3.4. External Attention

In image processing, most pixels are only closely related to a small number of elements. The standard attention score matrix has a lot of redundancy, and self-attention mechanisms can only focus on intra-sample correlations. To enable features to be more accurately implemented and to generate inter-sample correlations, Guo et al. [29] proposed the external attention mechanism. This is an attention mechanism constructed by two linear layers and two normalized layers, which not only takes into account the correlation between all samples but also reduces the computational complexity. The calculation steps of self-attention are written as follows in Equation (9):

\begin{matrix} A = {(a)}_{i, j} = s o f t max (Q K^{T}) \\ F_{o u t} = A V \end{matrix} \begin{matrix} a_{i, j} = \vec{q_{i}} \cdot \vec{k_{j}} \\ f_{i} = \sum a_{i, j} v_{j} \end{matrix}

(9)

The similarity calculation in self-attention can be understood as projecting the samples onto a specific subspace. Here, represents the attention score matrix, where represents the correlation between the i-th pixel of the query and the j-th pixel of the key. Then, we weight and sum each to obtain the final attention score. EA, on the other hand, uses a shared subspace to calculate the attention score, as in Equation (10).

\begin{matrix} A = {(a)}_{i, j} = N o r m (F {M_{}}^{T}) \\ F_{o u t} = A M_{} \end{matrix} \begin{matrix} a_{i, j} = \vec{q_{i}} \cdot \vec{m_{j}} \\ f_{i} = \sum a_{i, j} m_{j} \end{matrix}

(10)

All queries are projected onto the subspace M to calculate

{(a)}_{i, j}

and then weighted and summed with

m_{j}

to reconstruct

f_{j}

. The matrix A is inferred from prior knowledge of the dataset and is normalized using double norm. In contrast to the self-attention mechanism,

{(a)}_{i, j}

in the external attention mechanism represents the similarity between the i-th pixel and the j-th row of M, where

M \in R^{S \times d}

is the memory unit that serves as a shared base space, similar to the x, y, and z axis in a 3D coordinate system. It can represent features from all feature spaces. Because it is shared, it can model the relationships between all samples, and it is calculated with each pixel, forming connections with each pixel as well. The memory matrix M is initialized randomly and records the memory of the entire training set. The external attention mechanism is computed using the following Equations (11)–(12), where

M_{k}

and

M_{v}

represent the parameter matrices of two linear transformations, similar to K and V in the attention mechanism. As M is a shared space, during each training sample, the features in

M_{k}

and

M_{v}

are updated and iterated through the similarity matrix in A, providing the ability to model the correlation between different samples. The computational complexity of the external attention is

O (d S N)

, where d is the feature dimension, N is the number of elements, and S is the number of elements in the memory unit M.

A = D o u b l e N o r m (F {M_{k}}^{T})

(11)

F_{o u t} = A M_{v} .

(12)

Double Normalization

As shown in Figure 3, in the self-attention mechanism, normalization is usually performed using Softmax normalization along each row. However, this can lead to issues in which a single column in query or key with extremely large or small values can cause the corresponding column in the attention matrix to be extremely large or small as well, leading to difficulty in representing the similarities between pixels. Therefore, the external attention mechanism uses double normalization to eliminate this effect. As in Equations (13)–(15), in order to prevent a column in the attention map from being too large or too small, EA first applies Softmax to the query matrix and to the result of matrix multiplication, column by column, which prevents the values inside the feature map from being too large or too small. After eliminating this effect, the L1 normalization calculation is performed row by row to obtain the weight relationship between different pixels, so that the obtained weight relationship can highlight the correlation between different pixels more.

{(\tilde{a})}_{i, j} = F M_{k}^{T}

(13)

{\hat{a}}_{i, j} = exp ({\tilde{a}}_{i, j}) / \sum_{k}^{} exp ({\tilde{a}}_{k, j})

(14)

a_{i, j} = {\hat{a}}_{i, j} / \sum_{k}^{} {\hat{a}}_{i, k} .

(15)

3.5. Multi-Scale Channel Cross-Fusion Module

Since most current medical image segmentation networks have inadequate feature fusion, we introduce the CTrans module to improve this phenomenon. The module enables both full fusion of multi-scale features and information interaction in the channel dimension, effectively preserving feature information. The CTrans module consists of two parts: one is a multi-headed Channel-wise Cross-Fusion Transformer (CCT) and the other is a Channel-wise Cross-Attention module (CCA). CCT reduces the loss of detailed information by cross-fusing the channel information of feature maps at different scales on the encoder side and then passes the fused information into the CCA module. The CCA module, on the other hand, fuses the output of the CCT module with the features of the upsampling stage layer by layer on the decoding side, weighting the semantic and detailed features between the cross-layer image blocks. The CCT and CCA modules are described in detail below.

3.5.1. Channel-Wise Cross-Fusion Transformer

The CCT module involves three steps: multi-scale feature embedding, multi-head channel cross-attention, and multi-layer perceptron.

The multi-scale feature embedding first downsamples the input feature map to obtain four different scales of output

S_{i} \in R^{\frac{H W}{2^{i}} \times C_{i}}, (i = 1, 2, 3, 4)

. Firstly, the input feature map is convolved with different kernel sizes and strides to adjust to blocks of the same size (e.g., 14 × 14), which compresses spatial information while retaining channel information. Then, each feature vector is position-encoded to preserve the positional information of the feature map, resulting in tokens

T_{i} \in R^{\frac{H W}{i^{2}} \times C_{i}}, (i = 1, 2, 3, 4)

. Finally, the tokens for the four scales are concatenated along the channel dimension to form

T_{Σ}

.

The multi-head channel cross-attention mechanism is used to weight and refine the semantic and detail features between inter-channel image blocks. To achieve this, as expressed in Equation (16), we derive

Q_{i}

and

K, V

using linear transformations on the five outputs

T_{i}

(i = 1, 2, 3, 4, Σ)

embedded with multi-scale features.

Q_{i} = T_{i} \times W_{q i}, K = T_{Σ} \times W_{k}, V = T_{Σ} \times W_{v}

(16)

Let W be the weight matrix, where

W_{q i} \in R^{N_{i} \times d}

,

W_{k} \in R^{N_{Σ} \times d}

,

W_{v} \in R^{N_{Σ} \times d}

represents the feature channel dimension and, in our settings,

N_{1} = 64

,

N_{2} = 128

,

N_{3} = 256

,

N_{4} = 512

,

N_{Σ} = 960

represents the sequence length (i.e., the patch numbers). Because the encoder has multiple scales of output, we expand

Q_{i}

and K, V from (B, N, C) to (B, num_head, N, C) at dim = 1 to fuse the features, where num_head is set to 4. At this point,

Q_{i} \in R^{N_{i} \times d}

,

K \in R^{N_{Σ} \times d}

,

V \in R^{N_{Σ} \times d}

. After computing the similarity matrix M using

Q_{i}

and K, we perform a weighted sum with V to obtain

C A_{i}

. The specific formula is as follows (17):

C A_{i} = M_{i} V^{T} = σ [ψ (\frac{Q_{i}^{T} K}{\sqrt{C_{Σ}}})] V^{T} = σ [ψ (\frac{W_{Q_{i}}^{T} T_{i} T_{Σ} W_{K}}{\sqrt{C_{Σ}}})] W_{V}^{T} T_{Σ}^{T}

(17)

where

ψ ()

represents the instance normalization function and

σ ()

represents the Softmax normalization function. In order to facilitate gradient propagation, we take the mean over the num_head dimension and restore the dimensionality back to (B, N, C). Finally, we add the attention matrix to the original input, pass it through an MLP layer, and perform residual computation to obtain four output matrices

O_{i} \in R^{C \times H \times W} (i = 1, 2, 3, 4)

. These

O_{i}

matrices are concatenated with the features from the decoder

D_{i} \in R^{C \times H \times W} (i = 1, 2, 3, 4)

and further enhanced by the CCA module for feature fusion, as shown in Equations (18) and (19):

M C A_{i} = (C {A_{i}}^{1} + C {A_{i}}^{2} + \dots + C {A_{i}}^{N}) / N

(18)

O_{i} = M C A_{i} + M L P (Q_{i} + M C A_{i})

(19)

Here,

C A_{i}^{N}

represents the output of the multi-head channel cross-attention mechanism, where N is the number of heads;

O_{i}

is the output of the CCT module, and

Q_{i}

is the query vector. In practice, we set N to 4.

As shown in Figure 4, after passing through the multi-head channel cross-attention fusion mechanism, the feature map generates four different scale outputs. Each input feature is related to the other input features, which effectively reduces the problem of information loss during the transmission of images.

3.5.2. Channel-Wise Cross-Attention (CCA)

To reduce the semantic imbalance between the outputs of the CCT module and the decoder-side features, we adopt the CCA module to enhance the fusion of these two types of features. As show in Figure 5, CCA is a type of channel-wise attention mechanism that strengthens channels with more information and suppresses those with less information by weighting them.

For each layer of CCT, the output

O_{i}

is passed to the corresponding decoder-side feature

D_{i}

through the CCA module. In CCA, we perform average pooling and flatten operations on both

O_{i}

and

D_{i}

to generate the vector

G (O_{i}) \in R^{C \times 1 \times 1} G (D_{i}) \in R^{C \times 1 \times 1}

, where

G

represents global average pooling. This is performed to compress spatial information while preserving channel-wise information, as in Equation (20). Then, we use two weight matrices

L_{1} \in R^{C \times C}

,

L_{2} \in R^{C \times C}

to weight and sum the two vectors to highlight channels with important information. Finally, the vectors are reshaped to their original size for each layer.

M_{i} = L_{1} \cdot G (O_{i}) + L_{2} \cdot G (D_{i})

(20)

Compared to regular skip connections, our method reduces the loss of spatial and channel information during feature map transmission through channel-wise fusion, effectively improving the segmentation performance.

4. Experiment

4.1. Datasets

Dataset: To validate the performance of our model, we selected the publicly available Synapse Abdomen Multi-Organ and ACDC Heart datasets. The Synapse dataset contains 30 abdominal CT scans, covering eight abdominal organs (aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, spleen, and stomach), with a total of 3779 axial CT images. We randomly divided these thirty cases into 18 training sets and 12 testing sets.

The ACDC Heart Segmentation dataset contains 100 MRI scans of the heart with three labels for Left Ventricle (LV), Right Ventricle (RV), and Myocardium (Myo). Similarly, we divided the dataset into 80 training samples and 20 testing samples. We used a Dice similarity coefficient (DSC (%)) and a 95% Hausdorff distance (HD95 (mm)) as evaluation metrics to assess our model’s performance on the dataset.

4.2. Loss Function

A combination of cross-entropy (21) and Dice loss functions (22) was used in our experiments. Cross-entropy loss is a common classification loss function that measures the difference between predicted results and ground truth annotations. In the computer visual field, it is commonly used to predict the class of an image, where each pixel is classified as either foreground or background. This loss function can help the model better distinguish between the foreground and background and reduce mis-segmentation.

L_{C r o s s E n t r o p y L o s s} = - \sum_{x}^{} (p (x) log q (x))

(21)

The Dice loss function is a similarity-based loss function, as in Equation (22). It measures the degree of similarity between the predicted results and the true labeling. In medical image segmentation, the Dice loss function is often used for multi-classification problems, where each pixel is classified into categories. This loss function can help the model better preserve the shape and structural information of the target region and reduce over-smoothing.

L_{D i c e L o s s} = 1 - 2 \times \frac{\sum_{i} P_{i} G_{i}}{\sum_{i} P_{i} + \sum_{i} G_{i}} .

(22)

In this formula,

P_{i}

represents the probability value that the i-th element belongs to a certain class, and

G_{i}

represents the true label value of the i-th element. In our experiments, we combined the two loss functions with a weighted sum, as shown in Equation (23), where

λ_{1} = 0.4

and

λ_{2} = 0.6

represent the weights of cross-entropy and Dice losses, respectively. The combined loss function is defined as follows:

L_{t o t a l - l o s s} = λ_{1} \times L_{C r o s s E n t r o p y L o s s} + λ_{2} \times L_{D i c e L o s s} .

(23)

4.3. Evaluation Metrics

We followed the evaluation metrics of the baseline model, which are the DSC (%) and HD95 (mm). DSC (24) is used to measure the similarity between the predicted value and the label value. Its formula is as follows:

D S C = \frac{2 T P}{F P + 2 T P + F N}

(24)

In the formulas, TP represents the true positives (both the model and ground truth label classify a pixel as positive), FP represents the false positives (the model classifies a pixel as positive, but the ground truth label classifies it as negative), and FN represents the false negatives (the model classifies a pixel as negative, but the ground truth label classifies it as positive). The range of the DSC is from 0 to 1, and a higher value indicates a better segmentation performance in which the predicted results have a higher degree of overlap with the ground truth. HD95 is a metric that measures the distance difference between the predicted results and the ground truth. Its formula is as follows (25):

H D 95 = H D (P, G) a t 95 %

(25)

where P represents the predicted segmentation result of the model, G represents the ground truth, and

H D (P, G)

represents the Hausdorff distance between P and G. The Hausdorff distance is a metric that measures the distance between two sets, and it represents the maximum distance from each point in one set to the nearest point in the other set. HD95 represents the 95th percentile of the Hausdorff distance between P and G, which is the value that ranks in the top 95% of all distance values. The range of HD95 is from zero to positive infinity, and a smaller value indicates a better segmentation performance in which the predicted results have a smaller distance difference from the ground truth labels.

4.4. Experimental Details

We set the resolution of the input image to 224 × 224 and augmented the data with random flips and rotations to prevent overfitting of the model. We set the experimental batch size to eight, the initial value of the learning rate to 0.05, the momentum to 0.9, and the decay of the weights to 0.0001. The model was trained for 400 rounds on a single GPU with 16 G of video memory, and the type of graphics card was the NVIDIA Tesla T4 16 GB.

5. Experimental Results

We have compared our model with some CNN and Transformer methods, and the comparison results are shown in Table 1 and Table 2. Table 1 shows the experimental results on the Synapse dataset, and Table 2 shows the experimental results on the ACDC dataset. To highlight the experimental results, the best data are marked with the superscript “1”.

The ablation performances of DATUnet on the Synapse and ACDC datasets are shown in Table 3 and Table 4. We marked the best result with the superscript “1”.

5.1. Comparative Experiment

To validate the overall performance of our model, we conducted experiments on both the Synapse Abdomen Multi-Organ and ACDC Heart Segmentation datasets. The experimental results are shown in Table 1 and Table 2. To fully demonstrate the superiority of our model, we compared the experimental results with traditional CNN segmentation models and Transformer-based segmentation models, and the data showed that our experimental results have significantly surpassed some based on CNN or Transformer models. From the table, we can see that the CNN-based model generally has higher results in arterial segmentation than the Transformer method, mainly because the span of the arterial organ is small and the local correlation of the CNN can be fully utilized, while for a larger span of an organ, such as the pancreas, Transformer has better results due to its global correlation.

Our model improves the segmentation results for almost all organs, and this all-around improvement confirms the effectiveness of our method. Our DAT module, which combines an effective attention mechanism and external attention mechanism, focuses on both intra-sample correlation and inter-sample correlation, and this multifaceted information effectively improves the overall segmentation performance of the model. Meanwhile, the CTrans module fully combines multi-scale information by cross-fusing the channel information from feature maps of different scales to further enhance the performance of the model. Our model is highly competitive, against both CNN-based models and Transformer-based models. On the Synapse dataset, our model attains a mean DSC (%) sum of 83.6%, while the HD95 (mm) is reduced to 13.99, achieving the best result. Similarly, on the ACDC dataset, our model’s DSC (%) reached 90.35, which is also the best result.

5.2. Evaluation of Dataset

To provide a more intuitive demonstration of the superior performance of our model compared to other models, we conducted visual comparisons of the dataset segmentation results. Figure 6 shows a visual comparison of the Synapse dataset, and Figure 7 shows a visual comparison of the ACDC dataset.

Figure 6 shows the qualitative comparison results of different methods on the Synapse dataset. We divided the comparison methods into those that are CNN-based and those that are Transformer-based. We can see in the picture that some models produce the wrong segmentations for organs, such as some models in the second row that interpret part of the background information as foreground information, and Unet and Swin-Unet in the fourth row have a big difference in the segmentation and label value for the pancreas. In addition, some models are too vague in the boundary positioning of different organs. For example, the stomach and liver parts in the network segmentation diagram of AttUnet, Swin-Unet, etc. in the third line are mixed together, so the boundary cannot be clearly defined. These errors in segmentation are due to the limitations of these network models, which only simply connect the output features of the encoder to the input features of the decoder. The semantic gap between the shallow and deep features is too large, and this simple dependency relationship cannot fully fuse multi-scale feature information, leading to segmentation errors. However, our channel-cross fusion Transformer module can improve this problem. The differences in segmentation performance accuracies are due to the limitations of CNN-based or Transformer-based network models. Our DAT module can simultaneously focus on the correlation between individual samples and the correlation between different samples. This multidimensional information can help our model achieve more accurate segmentation results.

To further verify the generalization performance of DATUnet, we conducted qualitative comparisons on the ACDC Heart Segmentation dataset in Figure 7. It can be seen that the segmentation results of Swin-Unet have rough organ edges, and TransUnet has poor segmentation results for small targets, such as in the second row and fourth column, where a part of the Myocardium organ is wrongly segmented as the Right Ventricle organ. For example, in the third column of the second row, some Myo organs are incorrectly divided into RV organs, and, in the third column of the third row, the Myo undersegmentation is serious. The UNet network on the first line incorrectly identifies the background information as the RV organ. However, our model performs well on both large and small targets, which further demonstrates that our model can improve the segmentation results in a comprehensive manner

5.3. Ablation Experiments

To evaluate the importance and effectiveness of each component in our model and better understand their behaviors and performances, we conducted ablation experiments on the Synapse Abdomen Multi-Organ and ACDC Heart Segmentation datasets, and the results are shown in Table 3 and Table 4. It can be seen that adding each module to the model improved the segmentation performance, and applying them to the network together further improved the segmentation ability of our model. From the tables, we can see that our model made progress in the segmentation results of many organs, and the external attention mechanism utilized the correlation between different samples to improve the segmentation performance. The CTrans module can strengthen the fusion of multi-scale feature information, further improving the segmentation performance of the model. On the Synapse dataset, our model achieved a mean DSC (%) of 83.6% and an HD95 (mm) of 13.99, which is a significant improvement compared to the baseline Dae-former model, with an increase of 1.21% in the DSC and a decrease of 3.65 in the HD. On the ACDC dataset, our model improved the average DSC performance by 1.3% and decreased the HD by 0.121, compared to the baseline model. Our model has strong competitiveness compared to both CNN-based models and Transformer-based models.

In combination with Figure 8, it can be seen that adding each module improves the accuracy of the segmentation results compared to the baseline model. For example, in the first row and second column, the addition of the CTrans module makes the boundary of the stomach smoother, and the small target, the right kidney, which was not segmented before, became segmented. In the second and fourth columns of the second row, we can see that both the CTrans module and the EA module have improved the boundary determinacy of the left kidney and liver. From the visual comparison of the ablation experiments, we can see that the CTrans and EA modules are dedicated to improving the overall segmentation performance, which may manifest in different aspects of the segmentation, such as the accuracy of segmentation for different organs and the improvement of boundary determinacy and smoothness. As shown in Figure 9, it can be seen on the visualized segmentation map of the ACDC dataset that the baseline model suffers from incorrect segmentation (e.g., RV in the third row) and organ edges that are not smoothed (e.g., LV in the second row and Myo in the first row), but the addition of the CTrans and EA modules significantly improves these phenomena. The segmentation effect occurs when CTrans and EA are added at the same time, resulting in the best segmentation effect.

When constructing the DAT, we used the effective attention mechanism and the external attention mechanism. In order to fully combine the advantages of these two attention mechanisms, we also compared three possible connection modes as shown in the figure, namely, sequential cascade (Figure 10a), simple addition fusion (Figure 10b), and complex addition fusion (Figure 10c). The method of a is the scheme we finally choose, which has been introduced above, and the method of b is the simple addition, but this method lacks the normalization operation and leads to instability of the backpropagation. The method of c normalized the output of each attention block and then merged it into the FFN, which achieved good results, but it was still inferior to scheme a. We used the same architecture as the network in Figure 1 and tested all three scenarios on the Synapse dataset, as shown in Table 5. We can see that the sequential fusion approach gave the best results, so we ultimately chose sequential fusion as the base layer of our network.

In Table 6, we show the number of parameters of DATUnet. Compared with DAE-Former, we can see that we have a similar number of parameters but a slightly better performance. For network models with a smaller number of parameters, our model has a better Dice score and HD distance.

6. Conclusions

In this paper, we propose a new medical image segmentation model that mainly uses our designed DAT module for encoding and decoding while simultaneously considering the remote correlation of individual samples and the correlation between different samples. Our model effectively combines different fine-grained features through channel-cross fusion to improve the segmentation performance. We validated the superiority of our method on the Synapse Abdomen Multi-Organ and ACDC Heart Segmentation datasets, and, through ablation experiments, we confirmed the importance of each module. The experiments showed that our model significantly outperformed the baseline models, demonstrating its effectiveness in improving the segmentation performance. In a future work, we will focus on how to enhance the generalization performance of the network and reduce the amount of computation, so that the network can be more suitable for practical engineering needs.

Author Contributions

Conceptualization, J.C.; methodology, J.C.; software, J.C. and S.J.; validation, S.J. and L.W.; formal analysis, L.W. and S.J.; data curation, J.C.; writing—original draft preparation, J.C.; writing—review and editing, J.C. and L.W.; visualization, J.C. and S.J. All authors have read and agreed to the published version of the manuscript.

Funding

Special Funds for the Central Government to Guide Local Science and Technology Development under Grant ZYYD2022C19 and the Scientific and Technological Innovation 2030 Major Project under Grant 2022ZD0115802.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Synapse and ACDC datasets are openly available at: https://www.synapse.org/#!Synapse:syn3193805/wiki/217789 (accessed on 15 April 2023) and https://www.creatis.insa-lyon.fr/Challenge/acdc/ (accessed on 25 April 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhu, J.; Yang, G.; Lio, P. A residual dense vision transformer for medical image super-resolution with segmentation-based perceptual loss fine-tuning. arXiv 2023, arXiv:2302.11184. [Google Scholar]
Isensee, F.; Kickingereder, P.; Wick, W.; Bendszus, M.; Maier-Hein, K.H. Brain tumor segmentation and radiomics survival prediction: Contribution to the brats 2017 challenge. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: Proceedings of the Third International Workshop, BrainLes 2017, Held in Conjunction with MICCAI 2017, Quebec City, QC, Canada, 14 September 2017; Revised Selected Papers; Springer: Cham, Switzerland, 2018; pp. 287–297. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: Proceedings of the 24th International Conference, Strasbourg, France, 27 September–1 October 2021, Proceedings, Part I; Springer: Cham, Switzerland, 2021; pp. 14–24. [Google Scholar]
Zhao, Z.; Zhu, A.; Zeng, Z.; Veeravalli, B.; Guan, C. Act-net: Asymmetric co-teacher network for semi-supervised memory-efficient medical image segmentation. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1426–1430. [Google Scholar]
Tragakis, A.; Kaul, C.; Murray-Smith, R.; Husmeier, D. The fully convolutional transformer for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 3660–3669. [Google Scholar]
Zeng, D.; Wu, Y.; Hu, X.; Xu, X.; Yuan, H.; Huang, M.; Zhuang, J.; Hu, J.; Shi, Y. Positional contrastive learning for volumetric medical image segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: Proceedings of the 24th International Conference, Strasbourg, France, 27 September–1 October 2021, Proceedings, Part II; Springer: Cham, Switzerland, 2021; pp. 221–230. [Google Scholar]
Yang, L.; Qi, L.; Feng, L.; Zhang, W.; Shi, Y. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7236–7246. [Google Scholar]
Liu, X.; Shih, H.A.; Xing, F.; Santarnecchi, E.; Fakhri, G.E.; Woo, J. Incremental Learning for Heterogeneous Structure Segmentation in Brain Tumor MRI. arXiv 2023, arXiv:2305.19404. [Google Scholar]
Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv 2018, arXiv:1802.06955. [Google Scholar]
Sun, X.; Fang, H.; Yang, Y.; Zhu, D.; Wang, L.; Liu, J.; Xu, Y. Robust retinal vessel segmentation from a data augmentation perspective. In Ophthalmic Medical Image Analysis: Proceedings of the 8th International Workshop, OMIA 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, 27 September 2021; Springer: Cham, Switzerland, 2021; pp. 189–198. [Google Scholar]
Roy, S.; Koehler, G.; Ulrich, C.; Baumgartner, M.; Petersen, J.; Isensee, F.; Jaeger, P.F.; Maier-Hein, K. MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image Segmentation. arXiv 2023, arXiv:2303.09975. [Google Scholar]
Rahman, M.M.; Marculescu, R. Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation. arXiv 2023, arXiv:2303.16892. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Xu, L.; Wang, L.; Li, Y.; Du, A. Big Model and Small Model: Remote modeling and local information extraction module for medical image segmentation. Appl. Soft Comput. 2023, 136, 110128. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Computer Vision—ECCV 2022 Workshops: Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part III; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Tran, M.; Vo-Ho, V.K.; Le, N.T. 3DConvCaps: 3DUnet with Convolutional Capsule Encoder for Medical Image Segmentation. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 4392–4398. [Google Scholar]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, New Orleans, LA, USA, 18–24 June 2022; pp. 574–584. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 357–366. [Google Scholar]
Park, N.; Kim, S. How do vision transformers work? arXiv 2022, arXiv:2202.06709. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 4005615. [Google Scholar] [CrossRef]
Ji, Y.; Zhang, R.; Wang, H.; Li, Z.; Wu, L.; Zhang, S.; Luo, P. Multi-compound transformer for accurate biomedical image segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: Proceedings of the 24th International Conference, Strasbourg, France, 27 September–1 October 2021, Proceedings, Part I; Springer: Cham, Switzerland, 2021; pp. 326–336. [Google Scholar]
Chen, D.; Yang, W.; Wang, L.; Tan, S.; Lin, J.; Bu, W. PCAT-UNet: UNet-like network fused convolution and transformer for retinal vessel segmentation. PLoS ONE 2022, 17, e0262689. [Google Scholar] [CrossRef] [PubMed]
Azad, R.; Arimond, R.; Aghdam, E.K.; Kazerouni, A.; Merhof, D. Dae-former: Dual attention-guided efficient transformer for medical image segmentation. arXiv 2022, arXiv:2212.13504. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3531–3539. [Google Scholar]
Guo, M.H.; Liu, Z.N.; Mu, T.J.; Hu, S.M. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5436–5447. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2022; Volume 36, pp. 2441–2449. [Google Scholar]
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Proceedings of the 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Fu, S.; Lu, Y.; Wang, Y.; Zhou, Y.; Shen, W.; Fishman, E.; Yuille, A. Domain adaptive relational reasoning for 3d multi-organ segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2020: Proceedings of the 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part I; Springer: Cham, Switzerland, 2020; pp. 656–666. [Google Scholar]
Schlemper, J.; Oktay, O.; Schaap, M.; Heinrich, M.; Kainz, B.; Glocker, B.; Rueckert, D. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 2019, 53, 197–207. [Google Scholar] [CrossRef] [PubMed]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. UNet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Wang, H.; Xie, S.; Lin, L.; Iwamoto, Y.; Han, X.H.; Chen, Y.W.; Tong, R. Mixed transformer u-net for medical image segmentation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 2390–2394. [Google Scholar]
Heidari, M.; Kazerouni, A.; Soltany, M.; Azad, R.; Aghdam, E.K.; Cohen-Adad, J.; Merhof, D. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 6202–6212. [Google Scholar]
Rahman, M.M.; Marculescu, R. Medical image segmentation via cascaded attention decoding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6222–6231. [Google Scholar]
Huang, X.; Deng, Z.; Li, D.; Yuan, X. Missformer: An effective medical image segmentation transformer. arXiv 2021, arXiv:2109.07162. [Google Scholar] [CrossRef] [PubMed]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef] [PubMed]
Xu, G.; Wu, X.; Zhang, X.; He, X. Levit-unet: Make faster encoders with transformer for medical image segmentation. arXiv 2021, arXiv:2107.08623. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the DATUnet.

Figure 2. Overall frame diagram of the proposed Dual-Attention Transformer module.

Figure 3. Double normalization diagram.

Figure 4. Multi-head cross-attention detail diagram.

Figure 5. Channel-wise cross-attention detail diagram.

Figure 6. Visualizes the results of different methods on the Synapse dataset.

Figure 7. Visualizes the results of different methods on the ACDC dataset.

Figure 8. Ablation segmentation results of CTrans and EA on the Synapse dataset.

Figure 9. Ablation segmentation results of CTrans and EA on the Synapse dataset.

Figure 10. Diagram of different connection strategies. From left to right: (a) sequential fusion, (b) simple additive fusion, and (c) complex additive fusion.

Table 1. Comparative experimental results on the Synapse dataset. For each data, we mark the best results with the superscript “1”.

Method	Model	DSC↑	HD95↓	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
CNN	V-Net [36]	68.81	-	75.34	51.87	77.10	80.75	87.84	40.05	80.56	56.98
	DARR [37]	69.77	-	74.74	53.77	72.31	73.24	94.08	54.18	89.90	45.96
	R50 UNet [13]	74.68	36.87	84.18	62.84	79.19	71.29	93.35	48.23	84.41	73.92
	R50 AttnUNet [38]	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
	UNet [13]	76.85	39.70	89.07	69.72	77.77	68.60	93.43	53.98	86.67	75.58
	UNet++ [33]	78.13	25.65	89.27	62.35	83.00	78.98	94.53	56.70	85.99	74.20
	UNet3+ [39]	73.81	30.82	86.32	59.06	79.16	71.26	93.13	46.56	84.94	70.08
	DeepLabv3+ [40]	77.63	39.95	88.04	66.51	82.76	74.21	91.23	58.32	87.43	73.53
	Att-UNet [34]	77.77	36.02	89.55	68.88	77.98	71.11	93.57	58.04	87.30	75.75
Transformer	R50 ViT [15]	71.29	32.87	73.73	55.13	75.80	72.20	91.51	45.99	81.99	73.95
	ViT [15]	61.50	39.61	44.38	39.59	67.46	62.94	89.21	43.14	75.45	69.78
	TransUNet [15]	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
	Swin-UNet [17]	79.13	21.55	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.60
	MT-UNet [41]	78.59	26.59	87.92	64.99	81.47	77.29	93.06	59.46	87.75	76.81
	HiFormer [42]	80.39	14.70	86.21	65.69	85.23	79.77	94.61	59.52	90.99	81.08
	TransCASCADE [43]	82.68	17.34	86.63	68.48	87.66	84.56	94.43	65.33	90.79	83.52 ¹
	MISSFormer [44]	81.96	18.20	86.99	68.65	85.21	82.00	94.41	65.67	91.92	80.81
	DAE-Former [27]	82.43	17.46	88.96	72.30 ¹	86.08	80.88	94.98	65.12	91.94 ¹	79.19
	DATUnet(Ours)	83.64 ¹	13.99 ¹	89.78 ¹	70.52	89.06 ¹	84.93 ¹	95.65 ¹	68.01 ¹	91.61	79.53

Table 2. Comparative experimental results on the ACDC dataset. For each data, we mark the best results with the superscript “1”.

Method	Model	DSC↑	RV	Myo	LV
CNN	R50 U-Net [13]	87.55	87.10	80.63	94.92
	R50 Att-UNet [38]	86.75	87.58	79.20	93.47
	CE-Net [45]	87.21	85.68	83.97	91.98
	UNet [13]	88.28	86.08	86.04	92.72
	UNet++ [33]	89.06	87.66	86.47	93.06
	UNet3+ [39]	88.28	86.08	86.04	92.72
Transformer	R50 ViT [15]	87.57	86.07	81.88	94.75
	TransUNet [15]	89.71	88.86	84.53	95.73
	Swin-UNet [17]	90.00	88.55	85.62	95.83 ¹
	UNETR [19]	88.61	85.29	86.52	94.02
	DAE-Former [27]	89.00	87.78	86.99	92.22
	DATUnet(Ours)	90.35 ¹	88.89 ¹	88.79 ¹	93.42

Table 3. Results of ablation experiments on the Synapse dataset.

Method	DSC↑	HD95↓
DAE-Former [27]	82.43	17.46
DAE-Former + CTrans	82.64	13.18 ¹
DAE-Former + EA	83.12	16.49
DATUnet(Ours)	83.64 ¹	13.99

Table 4. Results of ablation experiments on the ACDC dataset.

Method	DSC↑	RV	Myo	LV
DAE-Former [27]	89.00	87.78	86.99	92.22
DAE-Former + CTrans	90.08	88.51	88.58	93.16
DAE-Former + EA	89.68	87.12	90.23 ¹	91.69
DATUnet(Ours)	90.35 ¹	88.89 ¹	88.79	93.42 ¹

Table 5. Comparison of different dual-attention strategies on the Synapse dataset.

Dual-Attention Strategy	DSC↑	HD95↓
Sequential	83.64	13.99
Simple Additive	80.13	26.22
Complex Additive	82.58	17.31

Table 6. Comparison of the number of parameters.

Methods	Params(M)	DSC↓	HD95↓
DeepLapv3 + (CNN) [40]	59.50	77.63	39.95
Swin-Unet [17]	27.17	79.13	21.55
TransUNet [15]	105.28	77.48	31.69
LeVit-Unet [46]	52.17	78.53	16.84
MISSFormer [44]	42.5	81.96	18.20
HiFormer [42]	25.51	80.39	14.70
DAE-Former [27]	48.1	82.43	17.46
DATUnet	59.44	83.64	13.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, J.; Wang, L.; Jiang, S. A Multi-Scale Cross-Fusion Medical Image Segmentation Network Based on Dual-Attention Mechanism Transformer. Appl. Sci. 2023, 13, 10881. https://doi.org/10.3390/app131910881

AMA Style

Cui J, Wang L, Jiang S. A Multi-Scale Cross-Fusion Medical Image Segmentation Network Based on Dual-Attention Mechanism Transformer. Applied Sciences. 2023; 13(19):10881. https://doi.org/10.3390/app131910881

Chicago/Turabian Style

Cui, Jianguo, Liejun Wang, and Shaochen Jiang. 2023. "A Multi-Scale Cross-Fusion Medical Image Segmentation Network Based on Dual-Attention Mechanism Transformer" Applied Sciences 13, no. 19: 10881. https://doi.org/10.3390/app131910881

APA Style

Cui, J., Wang, L., & Jiang, S. (2023). A Multi-Scale Cross-Fusion Medical Image Segmentation Network Based on Dual-Attention Mechanism Transformer. Applied Sciences, 13(19), 10881. https://doi.org/10.3390/app131910881

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Scale Cross-Fusion Medical Image Segmentation Network Based on Dual-Attention Mechanism Transformer

Abstract

1. Introduction

2. Relate Work

2.1. CNN-Based Methods

2.2. Transformer-Based Methods

2.3. Skip Connection

3. Methods

3.1. Overall Architecture

3.2. Dual-Attention Transformer (DAT)

3.3. Efficient Attention

3.4. External Attention

Double Normalization

3.5. Multi-Scale Channel Cross-Fusion Module

3.5.1. Channel-Wise Cross-Fusion Transformer

3.5.2. Channel-Wise Cross-Attention (CCA)

4. Experiment

4.1. Datasets

4.2. Loss Function

4.3. Evaluation Metrics

4.4. Experimental Details

5. Experimental Results

5.1. Comparative Experiment

5.2. Evaluation of Dataset

5.3. Ablation Experiments

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI