Attention-Based Two-Branch Hybrid Fusion Network for Medical Image Segmentation

Liu, Jie; Mao, Songren; Pan, Liangrui

doi:10.3390/app14104073

Open AccessArticle

Attention-Based Two-Branch Hybrid Fusion Network for Medical Image Segmentation

by

Jie Liu

^1,*

,

Songren Mao

²

and

Liangrui Pan

³

¹

Computer Engineering Department, Taiyuan Institute of Technology, Taiyuan 030008, China

²

College of Computer Science and Technology, Taiyuan Normal University, Jinzhong 030619, China

³

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4073; https://doi.org/10.3390/app14104073

Submission received: 7 April 2024 / Revised: 4 May 2024 / Accepted: 6 May 2024 / Published: 10 May 2024

(This article belongs to the Section Biomedical Engineering)

Download

Browse Figures

Versions Notes

Abstract

Accurate segmentation of medical images is vital for disease detection and treatment. Convolutional Neural Networks (CNN) and Transformer models are widely used in medical image segmentation due to their exceptional capabilities in image recognition and segmentation. However, CNNs often lack an understanding of the global context and may lose spatial details of the target, while Transformers struggle with local information processing, leading to reduced geometric detail of the target. To address these issues, this research presents a Global-Local Fusion network model (GLFUnet) based on the U-Net framework and attention mechanisms. The model employs a dual-branch network that utilizes ConvNeXt and Swin Transformer to simultaneously extract multi-level features from pathological images. It enhances ConvNeXt’s local feature extraction with spatial and global attention up-sampling modules, while improving Swin Transformer’s global context dependency with channel attention. The Attention Feature Fusion module and skip connections efficiently merge local detailed and global coarse features from CNN and Transformer branches at various scales. The fused features are then progressively restored to the original image resolution for pixel-level prediction. Comprehensive experiments on datasets of stomach and liver cancer demonstrate GLFUnet’s superior performance and adaptability in medical image segmentation, holding promise for clinical analysis and disease diagnosis.

Keywords:

medical image segmentation; swin transformer; ConvNext; GLFUnet; cancer

1. Introduction

Malignant tumors, commonly referred to as cancer, are serious illnesses that pose the greatest threat to human health, second only to cardiovascular and cerebrovascular conditions [1]. The International Agency for Research on Cancer (IARC), a branch of the World Health Organization, released the latest global cancer statistics for 2020. These figures revealed that there were approximately 19.29 million new cases of cancer worldwide, with cancer causing approximately 9.96 million deaths. The incidence and mortality rates of cancer in China rank among the highest globally [2]. Approximately 4.57 million new cases are reported annually, accounting for 23.7% of the world’s total new diagnoses, while over 3 million fatalities are recorded, making up 30.14% of global cancer-related deaths. Nearly half of the patients in China have been diagnosed in the middle and late stages, and even after radical surgery, about 50% of the patients will still have recurrence and metastasis. Thus, the crucial aspect in combating cancer is the timely identification and prompt management of the disease, which can substantially enhance the likelihood of patient survival.

As modern medical practices continue to evolve, the prevalent method for the early detection of cancer involves the examination of histopathologic sections. This technique is considered the gold standard for cancer diagnosis within contemporary medicine [3]. Identifying and analyzing cell nuclei in histopathological sections allows for the detection of serious diseases like cancer during the early clinical stage or even the pre-clinical phase. This allows us ample time to take appropriate preventative action. However, traditional pathology analysis is performed by pathologists who analyze the morphology of specific tissue cells through a microscope and use their own experience to diagnose the sections. The large number of tissue cells contained in a pathology sample, the high degree of cellular similarity, and the small field of view of the microscope make it take a lot of time to fully diagnose a section, which seriously hinders the diagnostic efficiency.

The progress in deep learning has led to Convolutional Neural Networks (CNN) becoming increasingly pivotal for medical image segmentation tasks. The automated feature extraction, robust adaptability, and quick processing speed of convolutional neural networks allow them to efficiently overcome the drawbacks of conventional detection techniques and enhance the precision and stability of medical picture segmentation [4,5]. The field of image segmentation was introduced to encoder and decoder architectures by Long et al. [6]. They introduced the Fully Convolutional Network (FCN), pioneering an end-to-end semantic segmentation approach for images and marking the initial deep learning application in this domain. Ronneberger et al. [7] introduced the U-Net architecture, an extension of FCN that facilitates the multi-scale extraction of both superficial and profound image features during the encoding phase. To mitigate the loss of fine-grained details due to consecutive downsampling, skip connections are incorporated to link the encoder and decoder at corresponding stages, enabling the network to harness features at varied levels of granularity. Numerous derivatives of the U-Net model have been developed, capitalizing on its exceptional segmentation capabilities. These models include UNet++ [8], Res-Unet [9], Attention U-Net [10], DenseUNet [11], R2U-Net [12], KiU-Net [13], and UNet 3+ [14]. They are made especially to achieve emotive performance and segment medical images.

Despite the notable success of convolutional neural networks in medical imaging, their inherent inductive bias means each convolutional kernel is limited to processing a local region of the image. This limitation hampers the network’s capacity to model extensive context and form extended dependencies. There is some work that attempts to describe long-range dependencies of convolution, including attention mechanisms [15,16,17] However, these methods still have a lot of drawbacks in terms of global context modeling since they are not tailored to the segmentation of medical images, but Transformer can solve these issues. The Transformer architecture [18], originally designed sequence tasks in Natural Language Process originally designed for sequence-to-sequence tasks in Natural Language Processing (NLP), has sparked widespread interest within the Computer Vision (CV) community. Dosovitskiy et al. [19] first introduced a vision transformer (Vision Transformer, ViT) in the realm of computer vision, which reimagined image classification as a sequence modeling challenge, achieving superior segmentation outcomes through pre-training on extensive external datasets. Zheng et al. [20] introduced the Segmentation Transformer (SETR), demonstrating that replacing the traditional encoder in encoder-decoder networks with the Transformer achieves exceptional performance in segmentation tasks. The Swin Transformer [21] later integrated a local receptive field through a sliding window self-attention mechanism, which optimized computational complexity and enhanced both efficiency and precision, surpassing former state-of-the-art methods in dense prediction tasks such as image classification, object detection, and semantic segmentation. TransUnet [22] proposed a U-shaped architecture with a CNN-Transformer hybrid encoder and a multi-stage up-sampling decoder, advancing the model’s performance in these domains. In order to capture multi-range interactions, Zhang et al. [23] combined detailed features derived by CNNs with discriminative maps of varying resolutions using a Transformer pyramid. They then accessed several receptive domains using an adaptive method to get the best segmentation results. Nonetheless, the majority of the current research is on using Transformer layers in place of convolution or stacking the two consecutively without taking into account the connection between their channels and locations, which has some drawbacks when it comes to obtaining fine-grained data.

To address these issues, this research presents a dual-branch hybrid network model (GLFUnet) built upon the U-Net framework, which effectively integrates fine and coarse features across various scales within the CNN and Transformer modules, leveraging attention mechanisms. First, hierarchical features are generated in parallel by ConvNeXt and Swin Transformer encoders. Initially, ConvNeXt’s lowest-level features are enhanced with spatial attention. Subsequently, the Global Attention Up-Sampling (GAU) module extracts high-level features’ global contextual information, leveraging channel attention to incorporate Swin Transformer’s global insights. The AtFF module then employs hierarchical up-sampling with skip connections to effectively capture both low-level spatial details and high-level semantic context. Lastly, the blended features are stepwise restored to the input image’s original resolution for pixel-level prediction.

The workflow of the suggested GLFUnet method is presented in Figure 1. To begin with, the gastric and liver cancer dataset undergoes data augmentation by applying various geometric manipulations such as vertical and horizontal flipping, random rotation within a −90° to 90° range, and color dithering. Following this, the images are fed into the GLFUnet model for feature extraction and subsequent network training. Subsequently, we utilize unseen test images for the segmentation process. Ultimately, we assess the proposed model’s performance and precision using evaluation metrics like Dice, mIOU, and Accuracy.

The following are the main contributions of this paper:

The advantages of extracting global and local information combined with different deep learning models are discussed. We present in this research a dual-branch hierarchical global-local fusion network that integrates CNN and Transformer models for lesion region segmentation on pathology pictures.
To tackle the challenge of indistinct segmentation outcomes from merged features, this paper employs the Attention Feature Fusion (AtFF) module. This module synergizes global and local encoder branch features using an attention mechanism that effectively forges long-range correlations between coarse global and detailed local information, enhancing the extraction of both types of details.
By incorporating depth supervision and an additional segmentation head, we methodically bring the merged features back to the input image’s resolution. This approach mitigates gradient vanishing and accelerates convergence for pixel-level prediction using depth-guided training.
Comprehensive comparative and ablation studies on various segmentation datasets confirm that our model is adept for histopathology image segmentation. It surpasses the performance of many prevalent segmentation techniques, with each module contributing to enhanced segmentation accuracy.

The structure of the paper is as follows: Section 2 offers an overview of essential concepts, focusing on research related to multi-scale feature integration and CNN-based image segmentation. Section 3 presents the GLFUnet model’s architecture, which combines attention mechanisms with algorithmic procedures. Section 4 discusses the experimental findings and their analysis, while Section 5 concludes and provides future perspectives.

2. Related Work

The pertinent foundations employed in this research are briefly described in this section. First, Section 2.1 introduces image segmentation model based on CNN and Transformer; then, Section 2.2 describes the feature fusion network.

2.1. CNN and Transformer

Convolutional Neural Networks (CNNs) are a cornerstone of deep learning, particularly adept at processing grid-structured data like images. The architecture comprises various layers: convolutional, activation, pooling, and fully connected. In the convolutional layer, an array of trainable filters extract features from the input by scanning it, thus reducing dimensionality. The activation layer introduces nonlinearity, allowing the network to learn complex patterns. The pooling layer downsamples the data, aiding in computational efficiency and preventing overfitting. The fully connected layer connects every neuron in one layer to every neuron in another, mapping the abstracted features to output classes. CNNs are efficient due to weight sharing and local receptive fields, which minimize the number of parameters. Their multi-layered design captures the hierarchical nature of data, making them pivotal in applications such as medical image segmentation. Notably, architectures like the U-Net encoder-decoder [7] and its offshoots have demonstrated superior performance in this domain. The UNet++ [8] and UNet 3+ [14] networks proposed by Zhou and Huang et al. design a series of nested and dense skip paths to reduce semantic gaps. A unique Attention Gate (AG) mechanism model, proposed by Attention U-Net [10], focuses on targets with varying sizes and forms. In order to enhance retinal blood vessel segmentation performance, ResUNet [9] incorporates a weighted attention method. DenseUNet [11] substitutes dense connections for skip connections in U-Net. The R2U-Net [12] optimizes feature representation by combining the benefits of residual networks. KiU-Net [13] presents a unique design that makes use of incomplete and overcomplete features to enhance the segmentation of tiny anatomical components. UNeXt [24] presents a merger of MLP with UNet, substantially cutting the parameter count without compromising segmentation efficacy. These techniques are all still grounded in CNN.

An increasing number of transformer-based techniques are showing up in CV tasks, spurred by Transformer’s [19] success in a variety of NLP tasks. Central to Transformer’s operation is the concept of Self-Attention, a process that enables the model to dynamically weigh the significance of various elements in a sequence relative to one another. This approach enables parallel processing of the full input sequence, circumventing the issue of long-term dependencies that traditional Recurrent Neural Networks (RNNs) struggle with and their inability to process information concurrently. The Transformer architecture also encompasses positional encoding to maintain the sequential order of data, a multi-head attention setup to enhance the model’s representation capabilities, and incorporates both a feed-forward network and layer normalization to refine the feature extraction process. With its versatile and modular design, Transformers have not only revolutionized tasks such as language modeling and machine translation but have also been widely adopted in image recognition, generative modeling, and other areas, thus expanding the scope of research and application possibilities. ViT [20] is among the newer vision transformers and initially showed that Transformer-exclusive architectures can hit state-of-the-art performance in image recognition when pre-trained on extensive datasets. To train on the smaller ImageNet-1K dataset, DeiT [25] proposes data-efficient training techniques and knowledge distillation. Swin Transformer [22] with a hierarchical structure reduces the computational complexity and achieves SOTA performance attains state-of-the-art results in a wide range of tasks by the proposed shift-window based self-attention. Meanwhile, SETR [21] treats semantic segmentation as a sequence prediction problem by employing the Transformer as an encoding component.

Although the Transformer excels in global context modeling, it struggles with capturing fine-grained details, particularly in medical imaging. Consequently, efforts have been made to integrate the strengths of both CNN and Transformer. For instance, TransUNet [23] proposed a mixed CNN and Transformer encoder along with a multi-stage upsampling decoder for better global context modeling in medical image segmentation. In contrast, SwinUnet [26] is a mirror-symmetric, U-Net inspired transformer-based model for medical image segmentation, featuring a Swin Transformer block and patch expanding layer in its design. Additionally, Medical Transformer [27] introduces a transformer architecture equipped with gated axial attention blocks to efficiently train on medical images. Meanwhile, TransFuse [28] employs an innovative dual-branch approach combining ViT with ResNet to enhance global context modeling without losing low-level local detail, demonstrating significant advantages in pathological image segmentation research [29,30].

2.2. Feature Fusion Network

Enhancing the performance of image target segmentation necessitates the fusion of more discerning multiscale features. Masterfully handling and depicting multi-scale information is a primary challenge in this field. The Feature Pyramid Network (FPN) approach by Lin et al. [31] was among the initial studies to construct a feature pyramid, alternately combining adjacent feature levels using lateral connections and top-down pathways to enhance feature portrayal. Furthering the overall feature hierarchy, Liu et al. [32] added a second bottom-up path aggregation network (PANet) atop the FPN to boost information flow and minimize the gap between the most elementary and highest feature levels. To better combine semantic and geographical information, DLA [33] created hierarchical depth aggregation structures and iterative depth aggregation. Aiming to achieve multiscale feature rebalancing, Pang et al. [34] utilized Balanced IOU sampling, a Balanced Feature Pyramid, and a Balanced L1 loss in the Libra R-CNN framework. These strategies targeted imbalances at the sample level, feature level, and object level, respectively. Tan et al. [35] introduced a Weighted Bidirectional Feature Pyramid Network (BiFPN) to facilitate efficient multiscale feature integration. For leveraging cross-scale features, STDL [36] proposed a scale-shifting module, M2det [37] offered a U-shaped design for fusing multiscale features, and G-FRNet [38] incorporated gate units to manage the exchange of cross-feature information. A neural structure search approach is also used by NAS-FPN [39] to identify a more reliable fusion structure that offers the best single-shot detector. Liu et al. [40] developed an Adaptive Multiscale Feature Fusion Network capable of dynamically learning or suppressing the scale-specific features at the moment of fusion. By deriving a spatial fusion coefficient for each feature at every scale location, this approach enhances the scalability of the features and realizes their adaptive integration. Potential conflicting information (i.e., inconsistency) on the fusion time space is automatically learned or suppressed. In this research, we create the intermediate fusion approach that can better collect coarse-grained and fine-grained information by using an attention mechanism.

3. Method

Figure 2 illustrates the comprehensive design of our proposed end-to-end dual-branch hybrid network architecture. The Swin Transformer branch captures global contextual information, whereas the ConvNeXt branch extracts spatial information at various scales. Furthermore, features of matching resolutions from both branches are combined within the Attentional Feature Fusion (AtFF) module for integration. Here, channel and spatial attention mechanisms harvest local and global feature information, and the up-sampling module within the global attention selectively blends pertinent data, which is then conveyed through a skip connection to the decoder module (depicted in the green dashed box in Figure 2). The decoder structure produces the parameter segmentation results and loss values after recovering the image’s information and related spatial dimensions. Additionally, to maximize the synergies of these two branches, we assign weights to the loss values generated by the three components and merge them for unified training.

3.1. ConvNeXt Branch

There are two components to the ConvNeXt branch encoder. In Figure 1, it is displayed in the yellow dashed box. The initial component serves as a stem that pre-processes the original image for feature extraction using a convolutional layer with a 4 × 4 kernel and a stride of 4. This is followed by a layer of standard normalization (LN) to hasten the training process of the neural network. At this point, the image dimension changes from [H, W, 3] to [H/4, W/4, C]. The second part contains 4 stages, the block count of each stage is changed from (3, 3, 9, 3) to (3, 3, 3, 3), and downsampling operation is performed between every two stages, which expands the receptive fields of different layers of ConvNeXt and enables each neuron to perceive a wider area of the pathology image. This helps ConvNeXt learn a larger range of contextual information and improves understanding of the overall structure. Every downsampling operation employs a convolutional layer with a 2 × 2 kernel size and a stride of 2. The downsampling is preceded by a standard normalization (LN) layer to prevent gradient explosion. The AtFF module receives the feature maps that were extracted at each step. The output feature map dimensions are [H/4, W/4, C], [H/8, W/8, 2C], [H/16, W/16, 4C], and [H/32, W/32, 8C]. After extensive experimentation, the hyperparameter C has a value of 96.

The ConvNeXt architecture has three layers. The first layer is a 7 × 7 depth-separable convolution and a standard normalization (LN) layer with 96 output channels. The subsequent layer consists of a 1 × 1 convolution followed by a GELU (Gaussian Error Linear Unit) activation function. This layer does only channel fusion, mapping channels C = 96 to 384. The third layer is a 1 × 1 convolution that maps channels C = 384 from the second layer to 96. Finally, there is a residual join that sums the input and the outputs are summed.

3.2. Swin Transformer Branch

As shown in the blue dashed box in Figure 1, the Swin Transformer branch is constructed in four stages, with the number of Swin Transformer Blocks varying from (2, 2, 6, 2) to (2, 2, 2, 2) at each stage. The spatial resolution of a picture with x ∈ RH × W × 3 is H × W and has three channels. In the patch partitioning module, the image for the Swin Transformer branch is segmented into non-overlapping patches of size 4 × 4. Consequently, the image dimensions change to [H/4, W/4, 48] from [H, W, 3]. In stage1, the channel data patch features of each pixel are mapped to the high-dimensional space by doing linear transformation through the Patch Embeding layer and establishing the positional relationship between the image patches. This alteration resizes the image from [H, W, 48] to [H/4, W/4, C], which is then provided to the Swin Transformer Block. The subsequent stages 2 through 4 replicate this process, but beforehand, the feature map undergoes a Patch Merging that modifies its dimensions from [H/4, W/4, C] to [H/8, W/8, 2C], prior to being input into the Swin Transformer Block. The feature output from each stage is directed to the AtFF module for data integration.

The Swin Transformer Block is composed of two sets of LayerNorm (LN) layers, a Windowed Multiple Self-Attention (W-MSA) layer, a residual connection, and a two-layer Multilayer Perceptron (MLP) unit. The detailed module configuration is depicted in Figure 3, where the input vector is initially processed by an LN (Layer Norm) layer. Subsequently, the input vectors are channeled into the Windows Multi-Head Self-Attention (W-MSA) module that employs a shifted window approach, which significantly decreases the model’s computational complexity. Finally, it passes through the residual structure before moving to the next LN layer. LayerNorm (LN) layers are implemented before every MSA module and MLP, with residual connections following each MSA and MLP. This shift window mechanism dictates that two successive Swin Transformer Blocks are calculated as follows (1) to (4):

{\hat{z}}^{l} = W - M S A (L N (z^{l - 1})) + z^{l - 1}

(1)

z^{l} = M L P (L N ({\hat{z}}^{l})) + {\hat{z}}^{l}

(2)

{\hat{z}}^{l + 1} = S W - M S A (L N (z^{l})) + z^{l}

(3)

z^{l + 1} = M L P (L N ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1}

(4)

3.3. Attentional Feature Fusion Module

We introduce an innovative Attention feature fusion (AtFF) module, illustrated in Figure 4, which amalgamates an attention mechanism with a multimodal fusion mechanism to effectively blend the hierarchical features from CNN and Transformer. To counteract the spatial information loss due to downsampling, we employ a dual-branch skip link (depicted in red) to integrate the high-level global context with the low-level precise details from both global and local encoders. The AtFF module is linked to the preceding fusion module through an additive process, enabling the fusion of local and global features at the same stage, as observed in Figure 4. Therefore, the skip connection module (SCM) process can be expressed as Equations (5)–(8):

{\hat{t}}_{i} = C h a n n e l A t t e n t i o n (t_{i}), i = 1,2, 3,4

(5)

{\hat{r}}_{4} = S p a t i a l A t t e n t i o n (r_{4})

(6)

{\hat{r}}_{i} = G A U (r_{i}, {\hat{r}}_{i + 1}), i = 1,2, 3

(7)

l_{s c m} = S C M ({\hat{t}}_{i} + {\hat{r}}_{i}, l_{i + 1})

(8)

where GAU stands for the Global Attention Up-Sampling module calculation and ChannelAttention and SpatialAttention, respectively, indicate the Channel Attention and Spatial Attention computations.

t_{i}

and

r_{i}

are the features in the Transformer branch and CNN branch in stage i, respectively, and

l_{i}

is the output of the AtFF module.

In order to improve the feature representation capability, Transformer branch features use channel attention to determine which global key information on the feature graph contains important information. Channels carrying less crucial information receive diminished focus. The SE-Block is employed to enact channel attention, thereby enhancing the dissemination of global information from the Transformer branch. SE-Block is composed of two components, namely Squeeze and Excitation. After collecting several feature graphs (t) by a global average pooling operation, the Squeeze operation compresses each feature graph, resulting in a 1 × 1 × C sequence of real values from the many feature maps. The specific calculation is shown in Equation (9):

z = F_{s q u e e z e} (t_{i}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} t_{i} (i, j)

(9)

To accurately represent channel dependencies, the subsequent Excitation operation is essential to utilize the insights obtained from the Squeeze stage. We employ a straightforward gating method with sigmoid activation in our work. The specific computational representations are as in Equations (10) and (11):

s = F_{e x c i t a t i o n} (z, W) = σ (W_{2} \cdot δ (W_{1} \cdot z))

(10)

{\hat{t}}_{i} = F_{s c a l e} (t_{i}, s) = t_{i} s

(11)

where σ represents the sigmoid activation function, δ represents the relu activation function,

W_{1} \in R^{\frac{c}{r} \times c}, W_{2} \in R^{c \times \frac{c}{r}}

.

F_{s c a l e}

is the product over channels.

Initially, the CBAM (Convolutional Block Attention Module) block serves as a spatial filter for the convolutional neural network (CNN) branch’s most reduced-scale features. It enhances local details and suppresses irrelevant regions to address the issue of noisy low-level CNN features. The feature maps are transformed from C × H × W to C × 1 × 1 by passing them through two parallel MaxPool and AvgPool layers. Subsequently, the MLP module generates channel attention maps, and the ReLU activation function extracts two activated results. Following an element-wise summation of the two outputs, a sigmoid activation function is applied to derive the Channel Attention output. This output is then subjected to Maximum Pooling and Average Pooling operations, resulting in two 1 × H × W feature maps. These maps are separated using the Concat operation, converted into 1-channel feature maps via 7 × 7 convolution, and ultimately, a sigmoid function is used to obtain the Spatial Attention feature map. The specific computational expressions are detailed in Equations (12) and (13):

m = σ (M L P (A v g P o o l (r_{4})) + M a x P o o l (r_{4}))

(12)

{\hat{r}}_{4} = σ (f^{7 \times 7} ([A v g P o o l (m); M a x P o o l (m)]))

(13)

where σ represents the sigmoid activation function and

f^{7 \times 7}

signifies a convolutional layer with a 7 × 7 convolutional kernel.

The global attention upsampling module (GAU) subsequently extracts the high-level features’ global contextual information by employing these features to guide the weighted computation of the low-level features through a global pooling operation. The number of channels in the CNN feature map is reduced when a 3 × 3 convolution is applied to the low-level features. The high-level features generate a global context, which is then multiplied by the low-level features after being processed by a 1 × 1 convolution with batch normalization and ReLU nonlinearity. Ultimately, the high-level features are added to the weighted low-level features and undergo a progressive upsampling process.

{\hat{r}}_{i} = G A U (r_{i}, {\hat{r}}_{i + 1}) = {\hat{r}}_{i + 1} + δ (B N (G A P ({\hat{r}}_{i + 1}))) \cdot C o n v (r_{i}) i = 1,2, 3

(14)

where Conv is a 3 × 3 convolutional layer, δ denotes the relu activation function, and GAP denotes global average pooling.

Ultimately, l_scm will achieve a twofold increase in resolution and a 50% reduction in channel size, or maintain quadruple resolution and channel size, during the upsampling process by employing a cross-scale extension layer instead of the downsampling and patch merging layer. The cross-scale extension layer within

{A t F F}_{1}

amplifies the resolution of the input features to quadruple

(h_{i} \times w_{i} \times {d i m}_{i} ⟶ 4 h_{i} \times 4 w_{i} \times {d i m}_{i})

. In contrast, within

{A t F F}_{2,3, 4}

the input features’ resolution is transformed from

h_{i} \times w_{i} \times {d i m}_{i}

to

2 h_{i} \times 2 w_{i} \times {d i m}_{i} / 2

. Consequently, the proposed cross-scale expansion layer (csel) can be expressed as follows:

l_{c s e l} = c o n c a t [{t r a n s p o s e}_{t} {(l_{s c m})}_{t = 1}^{T}]

(15)

where T represents all transposed convolutions in total. We set T to 4 in

{A t F F}_{1}

, and T to 2 in

{A t F F}_{2, 3,4}

. Given that

{A t F F}_{4}

serves as the initial upsampling block, it lacks input from any preceding module. Therefore, the computation of the

{A t F F}_{4}

module proceeds as follows:

l_{4} = c o n c a t [{t r a n s p o s e}_{t} {(S C M ({\hat{t}}_{i} + {\hat{r}}_{i}))}_{t = 1}^{T}], l_{4} \in R^{\frac{h}{8} \times \frac{w}{8} \times 4 c}

(16)

The calculation of

{A t F F}_{2,3}

module is shown in Equation (17):

l_{i} = c o n c a t [{t r a n s p o s e}_{t} {(S C M ({\hat{t}}_{i} + {\hat{r}}_{i}, l_{i + 1}))}_{t = 1}^{T}] l_{i} \in R^{\frac{h}{2^{i}} \times \frac{w}{2^{i}} \times 2^{i - 2} c}, i = 2,3

(17)

And the calculation of

{A t F F}_{1}

module is shown in Equation (18):

l_{1} = c o n c a t [{t r a n s p o s e}_{t} {(S C M ({\hat{t}}_{i} + {\hat{r}}_{i}, l_{2}))}_{t = 1}^{T}], l_{1} \in R^{h \times w \times c}

(18)

where SCM is the skip connection and

t_{i}

and

r_{i}

are the features in the CNN and Transformer branches in stage i, respectively. The result of

{A t F F}_{i}

is

l_{i}

. Following the output from the

{A t F F}_{1}

module, a linear projection (LP) layer is implemented to accomplish pixel-level segmentation of the image x.

l_{0} = S e g m e n t a t i o n (x) = L P (l_{1})

(19)

The complete process of GLFUnet is shown in Algorithm 1:

Algorithm 1: Attention-based two-branch hybrid fusion network for medical image segmentation

Input: The image to be segmented x, the training batch size, learning rate, momentum, max epoch.
Ouput: Trained network model.

3.4. Loss Function

The complete network undergoes end-to-end training using a combination of weighted dice loss and binary cross-entropy loss, denoted as

{L = L}_{d i c e}^{w} + L_{b c e}^{w}

. A simple header that instantly reverts the input feature mapping to its original resolution generates the segmentation predictions. To address the issues of slow convergence and gradient vanishing, we incorporate a deeply supervised approach, which includes supervision of both

{A t F F}_{1}

and

{A t F F}_{4}

. The overall training loss function is detailed in Equation (20):

L_{t o t a l} = α L (G, h e a d (l_{0})) + β L (G, h e a d (l_{1})) + γ L (G, h e a d (l_{4}))

(20)

where G is the ground truth and α, β, and γ are hyperparameters that may be adjusted.

4. Experiments and Analysis of Results

4.1. Implementation Details

The experiments are constructed utilizing the Pytorch deep learning framework. To enhance data variability, all samples undergo random rotations within [−90°, 90°], vertical flipping, and horizontal flipping at the outset of the experiment. The model’s training process employs a small batch stochastic gradient descent method for optimization. The experimental research in this paper was carried out using Python 3.8 and Pytorch 1.11.0. All five cross-validation datasets were randomly split into an 8:2 ratio for training and testing. The model is trained for a total of 20 epochs with a batch size of 64. Following comparative tests, it was determined that the optimal values for parameters α, β, and γ are set to 0.5, 0.3, and 0.2, respectively. An Adam optimizer with a learning rate of 5 × 10⁻⁴ is used. The initial parameter choices were informed by existing literature [28,41]. The hardware platform for these experiments included a PC equipped with an Intel(R) Core (TM) i7-10700KF@3.80GHz processor, an NVIDIA GeForce RTX3090 graphics card, 24 GB of memory, and the Windows 10 Professional operating system. The development environment was JetBrains PyCharm 2021.2 Professional Edition.

4.2. Evaluation Metrics

Since a medical image segmentation approach is presented in this research, the model accuracy is assessed using certain image segmentation criteria. The model’s segmentation efficacy is gauged using evaluation metrics such as the Dice Similarity Coefficient (DSC), Mean Intersection Over Union (mIoU), and Accuracy (Acc) to provide an impartial analysis of the methodology’s effectiveness.

The dice represents the similarity coefficient, which shows how similar the actual target area and the forecasted target area are. The cumulative similarity coefficients for all the test outcomes within the test set are represented as Dice, with its mathematical expression provided in Equation (21).

D i c e = \frac{2 \times T P}{2 \times T P + F P + F N}

(21)

The intersection and merger ratio coefficient (IoU) is the average of two sets of predicted and actual values. It illustrates the proportion of overlap and amalgamation between two sets. The aggregate of the intersection and union coefficients for all the test outcomes in the test set is defined as mIoU. mIoU is represented by the formula shown in Equation (22):

m I o U = \frac{T P}{T P + F P + F N}

(22)

Acc is Accuracy: The ratio of properly identified pixels to total pixels is how accuracy is expressed. The efficacy of the classification process improves with an increased accuracy rate. The calculation for Acc is given by:

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(23)

where TP, TN, FP, and FN signify the counts of correctly identified positives, correctly identified negatives, erroneously identified positives, and erroneously identified negatives within the feature’s image pixel points, respectively.

4.3. Experimental Results and Analysis of Different Datasets

4.3.1. Gastric Cancer Dataset

This paper uses the BOT gastric slice dataset from 2017 [42]. There are 140 normal portions and 560 sections with stomach cancer in the dataset. Hematoxylin-eosin (H&E) staining was applied to the slices at a 20× magnification. The stomach slices have a resolution of 2048 × 2048, and the data supplier has partially labeled the tumor location. The size of the deep learning network is too large for direct processing. To create the training set, we clip areas from the stomach picture that measure 224 by 224. The sample patches are shown in Figure 5. Due to the limited data available, all samples underwent horizontal flipping, vertical flipping, and random rotations within [−90°, 90°] to augment the training set. For experimental analysis, the dataset in this study was partitioned into training and testing subsets at a ratio of 8:2.

The GLFUnet medical image segmentation approach presented in this study is contrasted with various methods, including U-Net, ResUNet, FCN, DeeplabV3, ConvNeXt, Medical Transformer, TransUnet, SwinUnet, and TransFuse, in order to more thoroughly test the efficacy of this model. Table 1 displays the Dice, mIoU, and Acc statistics for GLFUnet on the dataset for stomach cancer as well as the other nine segmentation models.

ConvNeXt performs better than other CNN-based models, as evidenced by its testing results, which show 88.05%, 73.75%, and 88.50% on Dice, mIoU, and Acc, respectively. Among these Transformer-based models, SwinUnet performs the best, reaching 91.21%, 79.84%, and 92.25% for Dice, mIoU, and Acc, respectively; our proposed GLFUnet performs better than ConvNeXt and SwinUnet, two current state-of-the-art methods, with 91.65%, 79.87%, and 92.51%, respectively. As a result, GLFUnet performs the best on the dataset for stomach cancer. The improvement rate is 0.44%, 0.03%, and 0.26%, respectively, when compared to the best Transformer-based model SwinUnet; the improvement rate is 3.60%, 6.12%, and 4.01%, respectively, when compared to the best CNN-based model ConvNeXt.

To provide a clearer comparison, Figure 6 below presents a visual contrast of selected segmentation outcomes for gastric cancer pathology images. It is evident that the Transformer-based architecture emphasizes global contextual information extraction and excels in remote relationship modeling, yielding sharper edges compared to the results from the CNN-based model. GLFUne indicates that the segmentation outputs resulting from the integration of Swin Transformer and ConvNeXt via the AtFF module are more aligned with the annotated mask images, further substantiating the efficacy of our suggested approach. The GLFUnet segmentation exhibits superior performance.

4.3.2. Liver Cancer Dataset

Images of liver cancer from five patients were obtained from the Third Xiangya Hospital of Central South University in China [43]. Qualified pathologists carefully marked the tumor and healthy tissue regions in each sample, with an approximate resolution of about 40,000~60,000 × 30,000~50,000. To ensure a smoother overall segmentation outcome, overlapping was allowed between adjacent patches, set at half the patch size both horizontally and vertically [43,44]. The selection of the sliding window size also considered the number of samples and the resolution of the pathological images. Employing a sliding window with dimensions of 448 × 448 and a stride of 224, the liver samples were sectioned into small patches. Subsequently, based on the model size, the stride was adjusted to 224 × 224. Figure 7 illustrates a representative liver pathology sample diagram. For experimental purposes, the dataset used in this study was divided into training and test sets at an 8:2 ratio.

The GLFUnet medical image segmentation approach presented in this study is contrasted with various methods, including U-Net, ResUNet, FCN, ConvNeXt, Medical Transformer, TransUnet, SwinUnet, TransFuse and DHUnet, in order to more thoroughly test the efficacy of this model. Table 2 presents the Dice, mIoU, and Acc scores for GLFUnet on the liver cancer dataset, alongside those of nine other segmentation models.

ConvNeXt performs better than other CNN-based models, as evidenced by its ex-perimental results, which show 92.69%, 86.52%, and 97.24% on Dice, mIoU, and Acc, respectively. Our proposed GLFUnet model has metrics figures of 93.36%, 86.93% and 97.51% for Dice, mIoU and Acc, respectively. Among the Transformer-based models considered, DHUnet achieves top scores of 92.76%, 86.64%, and 97.43%. Compared to the leading Transformer-based model DHUnet, GLFUnet sees improvements of 0.6%, 0.29%, and 0.08% respectively, while it enhances by 0.67%, 0.41%, and 0.27% over the best CNN-based model ConvNeXt. Consequently, GLFUnet yields the highest performance on the liver cancer dataset. In order to compare more intuitively, Figure 8 below displays a visual comparison of the segmentation results for liver cancer pathology images, showing that the GLFUnet model is closer to the annotated mask images.

4.4. Ablation Experiments

In this study, a suite of controlled ablation tests were carried out on the segmented dataset to examine the influence of various components on model performance. This included disabling the cross-scale extension layer (csel) and the attentional feature fusion (AtFF) module within the model. Additionally, the fusion technique employed in the model was substituted with Concat and Add fusion methods to validate the effectiveness of the fusion strategy utilized herein. The data from Table 3’s ablation experiments indicate that the model performs better with the AtFF module than without it, confirming that the integration of the Attention Mechanism Fusion module enhances medical image segmentation precision. This underscores the significance of employing an attention mechanism fusion module for refining medical image segmentation accuracy. Line graphs were also used to illustrate the outcomes, displayed in Figure 9 and Figure 10, where the x-axis represents different module replacements and the y-axis denotes index percentages. The line chart’s metric heights reveal that the GLFUnet model’s segmentation efficacy is optimized.

5. Disscusion

In the current study, we have implemented the GLFUnet, a sophisticated deep neural network architecture, for the precise task of segmenting pathological images. This model synergistically combines the spatial information extraction prowess of the ConvNeXt branch with the adept global contextual information assimilation of the Swin Transformer branch. The efficacy of the model is further augmented by the concerted effort of the attentional feature fusion module and the decoder module. Following this integration, we meticulously evaluated the segmentation performance of our proposed network. Comprehensive experimental results from two distinct datasets, one specific to gastric cancer and the other to liver disorders, showcased varying complexities in terms of cancer type, density, and target size. These trials affirmed the commendable performance and adaptability of GLFUnet within the realm of medical image segmentation. In addition, we pitted GLFUnet against a roster of widely recognized segmentation models, including U-Net, ResUNet, FCN, ConvNeXt, MedT, TransUnet, SwinUnet, and TransFuse. Our comparative analysis revealed that GLFUnet not only outperforms these models but also significantly boosts the accuracy and efficiency of medical image segmentation tasks.

A noteworthy advantage of GLFUnet lies in its innovative dual-branch hybrid network design, which effectively surmounts the challenges faced by Convolutional Neural Networks (CNNs) in grasping global context and the relative limitation of Transformer models in processing fine-grained local details. Furthermore, the integration of the Attentional Feature Fusion (AtFF) module represents another stride forward, as it introduces an attention mechanism to sharpen the focus on critical features and selectively merges valuable information from features with corresponding resolutions across branches.

Despite its robustness, GLFUnet encounters certain limitations inherent to its design: the autonomous encoding of each branch leads to increased computational demands and a higher number of parameters that require learning, contrasting with the more streamlined nature of single-branch networks. It is well-established that hyperparameters play a crucial role in dictating the performance of deep neural networks, including variables such as the number of hidden layer nodes, mini-batch size, learning rate, weight decay, and the number of training epochs. However, in this study, we confined ourselves to using default or moderate values for these hyperparameters without engaging in exhaustive tuning. This implies that there is substantial room for enhancing our results by means of hyperparameter optimization and the selection of more optimal values.

6. Conclusions

This research introduces the GLFUnet model for medical image segmentation, which builds upon the U-Net framework. It integrates features derived from both Swin Transformer and ConvNeXt encoders in parallel, utilizing an attentional feature fusion module. The model effectively captures spatial and semantic information through skip connections. GLFUnet is adept at extracting both global and local detailed information crucial for pathology segmentation. Evaluations on datasets of stomach and liver cancer demonstrate GLFUnet’s superior performance and adaptability against other segmentation models. Ablation tests confirm the efficacy of each introduced component. However, the integration of the attention mechanism increases computational demands and prolongs processing time. Future work will focus on refining the model’s architecture to reduce complexity and streamline operations. Additionally, experimenting with various CNN and Transformer combinations will be pursued to enhance segmentation efficiency and precision.

Author Contributions

Conceptualization, J.L., S.M. and L.P.; methodology, J.L. and S.M.; software, S.M.; validation, S.M.; formal analysis, J.L. and S.M.; investigation, J.L. and S.M.; resources, S.M. and L.P.; data curation, J.L., S.M. and L.P.; writing—original draft preparation, S.M.; writing—review and editing, S.M. and L.P.; visualization, S.M.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Taiyuan Institute of Technology scientific research initial funding (2022KJ092), and the Opening Foundation of Shanxi Key Laboratory of Signal Capturing & Processing (2022-01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Two datasets were used in this study. One was a public dataset (BOT gastric slice dataset [42]), and the other was supported by data from the Third Xiangya Hospital of Central South University.

Conflicts of Interest

The authors declare no conflict of interest. The funder had no role in the design of the study, in the writing of the manuscript or in the decision to publish the results.

References

World Health Organization. Comprehensive Cervical Cancer Control: A Guide to Essential Practice; World Health Organization: Geneva, Switzerland, 2006.
Lawrence, S.; Giles, C.L.; Tsoi, A.C.; Back, A.D. Face recognition: A convolutional neural-network approach. IEEE Trans. Neural Netw. 1997, 8, 98–113. [Google Scholar] [CrossRef] [PubMed]
Nelson, C.J.; Cho, C.; Berk, A.R.; Holland, J.; Roth, A.J. Are gold standard depression measures appropriate for use in geriatric cancer patients? A systematic evaluation of self-report depression instruments used with geriatric, cancer, and geriatric cancer samples. J. Clin. Oncol. 2010, 28, 348. [Google Scholar] [CrossRef] [PubMed]
Olabarriaga, S.D.; Smeulders, A.W.M. Interaction in the segmentation of medical images: A survey. Med. Image Anal. 2001, 5, 127–142. [Google Scholar] [CrossRef] [PubMed]
Asadi-Aghbolaghi, M.; Azad, R.; Fathy, M.; Escalera, S. Multi-level context gating of embedded collective knowledge for medical image segmentation. arXiv 2020, arXiv:2003.05056. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted Res-UNet for High-Quality Retina Vessel Segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Li, X.; Chen, H.; Qi, X.; Dou, Q.; Fu, C.W.; Heng, P.A. H-DenseUNet: Hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Trans. Med. Imaging 2018, 37, 2663–2674. [Google Scholar] [CrossRef] [PubMed]
Alom, M.Z.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Nuclei segmentation with recurrent residual convolutional neural networks based U-Net (R2U-Net). In Proceedings of the NAECON 2018-IEEE National Aerospace and Electronics Conference, Dayton, OH, USA, 23–26 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 228–233. [Google Scholar]
Valanarasu, J.M.J.; Sindagi, V.A.; Hacihaliloglu, I.; Patel, V.M. Kiu-net: Towards accurate segmentation of biomedical images using over-complete representations. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part IV 23. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 363–373. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1055–1059. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021. [Google Scholar] [CrossRef]
Zhang, Z.; Sun, B.; Zhang, W. Pyramid Medical Transformer for Medical Image Segmentation. arXiv 2021. [Google Scholar] [CrossRef]
Valanarasu, J.M.J.; Patel, V.M. Unext: Mlp-based rapid medical image segmentation network. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; Springer Nature: Cham, Switzerland, 2022; pp. 23–33. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 10347–10357. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Valanarasu, J.M.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical transformer: Gated axial-attention for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part I 24. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 36–46. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part I 24. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 14–24. [Google Scholar]
Nguyen, C.; Asad, Z.; Deng, R.; Huo, Y. Evaluating transformer-based semantic segmentation networks for pathological image segmentation. In Proceedings of the Medical Imaging 2022: Image Processing, San Diego, CA, USA, 20–24 February 2022 and 21–27 March 2022; SPIE: Houston, TX, USA, 2022; Volume 12032, pp. 942–947. [Google Scholar]
Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. Med. Image Anal. 2023, 88, 102802. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2403–2412. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zhou, P.; Ni, B.; Geng, C.; Hu, J.; Xu, Y. Scale-transferrable object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 528–537. [Google Scholar]
Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9259–9266. [Google Scholar]
Amirul Islam, M.; Rochan, M.; Bruce, N.D.; Wang, Y. Gated feedback refinement network for dense image labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3751–3759. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Li, X.; Xie, X.; Shen, L. Deep learning based gastric cancer identification. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 182–185. [Google Scholar]
Wang, L.; Pan, L.; Wang, H.; Liu, M.; Feng, Z.; Rong, P.; Chen, Z.; Peng, S. DHUnet: Dual-branch hierarchical global–local fusion network for whole slide image segmentation. Biomed. Signal Process. Control. 2023, 85, 104976. [Google Scholar] [CrossRef]
Li, Z.; Tao, R.; Wu, Q.; Li, B. DA-RefineNet: Dual-inputs attention refinenet for whole slide image segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1918–1925. [Google Scholar]

Figure 1. Workflow diagram of the proposed GLFUnet network.

Figure 2. General framework of the GLFUnet model, consisting of the CNN branch, the Transformer branch, AtFF module and skip connections.

Figure 3. The diagram of Swin-Transformer Block.

Figure 4. The diagram of Attention Feature Fusion Module (AtFF).

Figure 5. Diagram of the pathologic samples from the stomach. (a,b) are Tumor, (c) is Normal.

Figure 6. Visualization of the segmentation results of the gastric cancer dataset.

Figure 7. Diagram of the pathologic samples of the liver. (a,b) are Tumor, (c) is Normal.

Figure 8. Visualization of segmentation results for liver cancer dataset.

Figure 9. Comparison of line plots of ablation experiments on the liver cancer dataset.

Figure 10. Comparison of line plots of ablation experiments on the gastric cancer dataset.

Table 1. Data for each indicator on the gastric cancer dataset.

Methods		Dice (%)	mIOU (%)	Acc (%)
CNN-based models	U-Net	83.67	72.43	88.63
	Res-UNet	86.81	72.08	88.92
	FCN	84.92	69.72	86.95
	DeeplabV3	87.67	77.44	88.10
	ConvNeXt	88.05	73.75	88.50
Transformer-based models	MedT	87.12	72.86	88.06
	TransUnet	90.76	79.07	91.46
	SwinUnet	91.21	79.84	92.25
	TransFuse	90.28	79.72	92.16
Ours	GLFUnet	91.65	79.87	92.51

Table 2. Data for each indicator on the livercancer dataset.

Methods		Dice (%)	mIOU (%)	Acc (%)
CNN-based models	U-Net	92.36	86.08	97.05
	Res-UNet	91.59	84.73	96.67
	FCN	91.81	85.23	96.74
	ConvNeXt	92.69	86.52	97.24
Transformer-based models	MedT	90.87	83.56	96.76
	TransUnet	91.53	84.67	96.66
	SwinUnet	91.70	84.53	97.12
	TransFuse	90.68	82.93	96.97
	DHUnet	92.76	86.64	97.43
Ours	GLFUnet	93.36	86.93	97.51

Table 3. Ablation experiments on two datasets.

Model	Dice (%)		mIOU (%)		Acc (%)
Model	Liver	Gastric	Liver	Gastric	Liver	Gastric
GLFUnet-AtFF	92.46	89.64	85.56	78.37	96.49	91.23
GLFUnet-csel	93.78	91.61	86.90	79.52	96.58	92.37
GLFUnet-Concat	93.73	88.81	86.81	76.85	97.53	92.05
GLFUnet-Add	93.41	89.90	86.74	77.93	97.31	92.13
GLFUnet	93.86	91.65	86.93	79.87	97.56	92.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Mao, S.; Pan, L. Attention-Based Two-Branch Hybrid Fusion Network for Medical Image Segmentation. Appl. Sci. 2024, 14, 4073. https://doi.org/10.3390/app14104073

AMA Style

Liu J, Mao S, Pan L. Attention-Based Two-Branch Hybrid Fusion Network for Medical Image Segmentation. Applied Sciences. 2024; 14(10):4073. https://doi.org/10.3390/app14104073

Chicago/Turabian Style

Liu, Jie, Songren Mao, and Liangrui Pan. 2024. "Attention-Based Two-Branch Hybrid Fusion Network for Medical Image Segmentation" Applied Sciences 14, no. 10: 4073. https://doi.org/10.3390/app14104073

APA Style

Liu, J., Mao, S., & Pan, L. (2024). Attention-Based Two-Branch Hybrid Fusion Network for Medical Image Segmentation. Applied Sciences, 14(10), 4073. https://doi.org/10.3390/app14104073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention-Based Two-Branch Hybrid Fusion Network for Medical Image Segmentation

Abstract

1. Introduction

2. Related Work

2.1. CNN and Transformer

2.2. Feature Fusion Network

3. Method

3.1. ConvNeXt Branch

3.2. Swin Transformer Branch

3.3. Attentional Feature Fusion Module

3.4. Loss Function

4. Experiments and Analysis of Results

4.1. Implementation Details

4.2. Evaluation Metrics

4.3. Experimental Results and Analysis of Different Datasets

4.3.1. Gastric Cancer Dataset

4.3.2. Liver Cancer Dataset

4.4. Ablation Experiments

5. Disscusion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI