Attention-Based Two-Branch Hybrid Fusion Network for Medical Image Segmentation

: Accurate segmentation of medical images is vital for disease detection and treatment. Convolutional Neural Networks (CNN) and Transformer models are widely used in medical image segmentation due to their exceptional capabilities in image recognition and segmentation. However, CNNs often lack an understanding of the global context and may lose spatial details of the target, while Transformers struggle with local information processing, leading to reduced geometric detail of the target. To address these issues, this research presents a Global-Local Fusion network model (GLFUnet) based on the U-Net framework and attention mechanisms. The model employs a dual-branch network that utilizes ConvNeXt and Swin Transformer to simultaneously extract multi-level features from pathological images. It enhances ConvNeXt’s local feature extraction with spatial and global attention up-sampling modules, while improving Swin Transformer’s global context dependency with channel attention. The Attention Feature Fusion module and skip connections efficiently merge local detailed and global coarse features from CNN and Transformer branches at various scales. The fused features are then progressively restored to the original image resolution for pixel-level prediction. Comprehensive experiments on datasets of stomach and liver cancer demonstrate GLFUnet’s superior performance and adaptability in medical image segmentation, holding promise for clinical analysis and disease diagnosis.


Introduction
Malignant tumors, commonly referred to as cancer, are serious illnesses that pose the greatest threat to human health, second only to cardiovascular and cerebrovascular conditions [1].The International Agency for Research on Cancer (IARC), a branch of the World Health Organization, released the latest global cancer statistics for 2020.These figures revealed that there were approximately 19.29 million new cases of cancer worldwide, with cancer causing approximately 9.96 million deaths.The incidence and mortality rates of cancer in China rank among the highest globally [2].Approximately 4.57 million new cases are reported annually, accounting for 23.7% of the world's total new diagnoses, while over 3 million fatalities are recorded, making up 30.14% of global cancer-related deaths.Nearly half of the patients in China have been diagnosed in the middle and late stages, and even after radical surgery, about 50% of the patients will still have recurrence and metastasis.Thus, the crucial aspect in combating cancer is the timely identification and prompt management of the disease, which can substantially enhance the likelihood of patient survival.
As modern medical practices continue to evolve, the prevalent method for the early detection of cancer involves the examination of histopathologic sections.This technique is considered the gold standard for cancer diagnosis within contemporary medicine [3].
Identifying and analyzing cell nuclei in histopathological sections allows for the detection of serious diseases like cancer during the early clinical stage or even the pre-clinical phase.This allows us ample time to take appropriate preventative action.However, traditional pathology analysis is performed by pathologists who analyze the morphology of specific tissue cells through a microscope and use their own experience to diagnose the sections.The large number of tissue cells contained in a pathology sample, the high degree of cellular similarity, and the small field of view of the microscope make it take a lot of time to fully diagnose a section, which seriously hinders the diagnostic efficiency.
The progress in deep learning has led to Convolutional Neural Networks (CNN) becoming increasingly pivotal for medical image segmentation tasks.The automated feature extraction, robust adaptability, and quick processing speed of convolutional neural networks allow them to efficiently overcome the drawbacks of conventional detection techniques and enhance the precision and stability of medical picture segmentation [4,5].The field of image segmentation was introduced to encoder and decoder architectures by Long et al. [6].They introduced the Fully Convolutional Network (FCN), pioneering an endto-end semantic segmentation approach for images and marking the initial deep learning application in this domain.Ronneberger et al. [7] introduced the U-Net architecture, an extension of FCN that facilitates the multi-scale extraction of both superficial and profound image features during the encoding phase.To mitigate the loss of fine-grained details due to consecutive downsampling, skip connections are incorporated to link the encoder and decoder at corresponding stages, enabling the network to harness features at varied levels of granularity.Numerous derivatives of the U-Net model have been developed, capitalizing on its exceptional segmentation capabilities.These models include UNet++ [8], Res-Unet [9], Attention U-Net [10], DenseUNet [11], R2U-Net [12], KiU-Net [13], and UNet 3+ [14].They are made especially to achieve emotive performance and segment medical images.
Despite the notable success of convolutional neural networks in medical imaging, their inherent inductive bias means each convolutional kernel is limited to processing a local region of the image.This limitation hampers the network's capacity to model extensive context and form extended dependencies.There is some work that attempts to describe long-range dependencies of convolution, including attention mechanisms [15][16][17] However, these methods still have a lot of drawbacks in terms of global context modeling since they are not tailored to the segmentation of medical images, but Transformer can solve these issues.The Transformer architecture [18], originally designed sequence tasks in Natural Language Process originally designed for sequence-to-sequence tasks in Natural Language Processing (NLP), has sparked widespread interest within the Computer Vision (CV) community.Dosovitskiy et al. [19] first introduced a vision transformer (Vision Transformer, ViT) in the realm of computer vision, which reimagined image classification as a sequence modeling challenge, achieving superior segmentation outcomes through pre-training on extensive external datasets.Zheng et al. [20] introduced the Segmentation Transformer (SETR), demonstrating that replacing the traditional encoder in encoder-decoder networks with the Transformer achieves exceptional performance in segmentation tasks.The Swin Transformer [21] later integrated a local receptive field through a sliding window selfattention mechanism, which optimized computational complexity and enhanced both efficiency and precision, surpassing former state-of-the-art methods in dense prediction tasks such as image classification, object detection, and semantic segmentation.Tran-sUnet [22] proposed a U-shaped architecture with a CNN-Transformer hybrid encoder and a multi-stage up-sampling decoder, advancing the model's performance in these domains.In order to capture multi-range interactions, Zhang et al. [23] combined detailed features derived by CNNs with discriminative maps of varying resolutions using a Transformer pyramid.They then accessed several receptive domains using an adaptive method to get the best segmentation results.Nonetheless, the majority of the current research is on using Transformer layers in place of convolution or stacking the two consecutively without taking into account the connection between their channels and locations, which has some drawbacks when it comes to obtaining fine-grained data.
To address these issues, this research presents a dual-branch hybrid network model (GLFUnet) built upon the U-Net framework, which effectively integrates fine and coarse features across various scales within the CNN and Transformer modules, leveraging attention mechanisms.First, hierarchical features are generated in parallel by ConvNeXt and Swin Transformer encoders.Initially, ConvNeXt's lowest-level features are enhanced with spatial attention.Subsequently, the Global Attention Up-Sampling (GAU) module extracts high-level features' global contextual information, leveraging channel attention to incorporate Swin Transformer's global insights.The AtFF module then employs hierarchical up-sampling with skip connections to effectively capture both low-level spatial details and high-level semantic context.Lastly, the blended features are stepwise restored to the input image's original resolution for pixel-level prediction.
The workflow of the suggested GLFUnet method is presented in Figure 1.To begin with, the gastric and liver cancer dataset undergoes data augmentation by applying various geometric manipulations such as vertical and horizontal flipping, random rotation within a −90 • to 90 • range, and color dithering.Following this, the images are fed into the GLFUnet model for feature extraction and subsequent network training.Subsequently, we utilize unseen test images for the segmentation process.Ultimately, we assess the proposed model's performance and precision using evaluation metrics like Dice, mIOU, and Accuracy.
Appl.Sci.2024, 14, x FOR PEER REVIEW 3 of 20 without taking into account the connection between their channels and locations, which has some drawbacks when it comes to obtaining fine-grained data.
To address these issues, this research presents a dual-branch hybrid network model (GLFUnet) built upon the U-Net framework, which effectively integrates fine and coarse features across various scales within the CNN and Transformer modules, leveraging attention mechanisms.First, hierarchical features are generated in parallel by ConvNeXt and Swin Transformer encoders.Initially, ConvNeXt's lowest-level features are enhanced with spatial attention.Subsequently, the Global Attention Up-Sampling (GAU) module extracts high-level features' global contextual information, leveraging channel attention to incorporate Swin Transformer's global insights.The AtFF module then employs hierarchical up-sampling with skip connections to effectively capture both low-level spatial details and high-level semantic context.Lastly, the blended features are stepwise restored to the input image's original resolution for pixel-level prediction.
The workflow of the suggested GLFUnet method is presented in Figure 1.To begin with, the gastric and liver cancer dataset undergoes data augmentation by applying various geometric manipulations such as vertical and horizontal flipping, random rotation within a −90° to 90° range, and color dithering.Following this, the images are fed into the GLFUnet model for feature extraction and subsequent network training.Subsequently, we utilize unseen test images for the segmentation process.Ultimately, we assess the proposed model's performance and precision using evaluation metrics like Dice, mIOU, and Accuracy.The following are the main contributions of this paper:

•
The advantages of extracting global and local information combined with different deep learning models are discussed.We present in this research a dual-branch hierarchical global-local fusion network that integrates CNN and Transformer models for lesion region segmentation on pathology pictures.

•
To tackle the challenge of indistinct segmentation outcomes from merged features, this paper employs the Attention Feature Fusion (AtFF) module.This module synergizes global and local encoder branch features using an attention mechanism that effectively forges long-range correlations between coarse global and detailed local information, enhancing the extraction of both types of details.

•
By incorporating depth supervision and an additional segmentation head, we methodically bring the merged features back to the input image's resolution.This approach mitigates gradient vanishing and accelerates convergence for pixel-level prediction using depth-guided training.

•
Comprehensive comparative and ablation studies on various segmentation datasets confirm that our model is adept for histopathology image segmentation.It surpasses The following are the main contributions of this paper:

•
By incorporating depth supervision and an additional segmentation head, we methodically bring the merged features back to the input image's resolution.This approach mitigates gradient vanishing and accelerates convergence for pixel-level prediction using depth-guided training.

•
Comprehensive comparative and ablation studies on various segmentation datasets confirm that our model is adept for histopathology image segmentation.It surpasses the performance of many prevalent segmentation techniques, with each module contributing to enhanced segmentation accuracy.
The structure of the paper is as follows: Section 2 offers an overview of essential concepts, focusing on research related to multi-scale feature integration and CNN-based image segmentation.Section 3 presents the GLFUnet model's architecture, which combines attention mechanisms with algorithmic procedures.Section 4 discusses the experimental findings and their analysis, while Section 5 concludes and provides future perspectives.

Related Work
The pertinent foundations employed in this research are briefly described in this section.First, Section 2.1 introduces image segmentation model based on CNN and Transformer; then, Section 2.2 describes the feature fusion network.

CNN and Transformer
Convolutional Neural Networks (CNNs) are a cornerstone of deep learning, particularly adept at processing grid-structured data like images.The architecture comprises various layers: convolutional, activation, pooling, and fully connected.In the convolutional layer, an array of trainable filters extract features from the input by scanning it, thus reducing dimensionality.The activation layer introduces nonlinearity, allowing the network to learn complex patterns.The pooling layer downsamples the data, aiding in computational efficiency and preventing overfitting.The fully connected layer connects every neuron in one layer to every neuron in another, mapping the abstracted features to output classes.CNNs are efficient due to weight sharing and local receptive fields, which minimize the number of parameters.Their multi-layered design captures the hierarchical nature of data, making them pivotal in applications such as medical image segmentation.Notably, architectures like the U-Net encoder-decoder [7] and its offshoots have demonstrated superior performance in this domain.The UNet++ [8] and UNet 3+ [14] networks proposed by Zhou and Huang et al. design a series of nested and dense skip paths to reduce semantic gaps.A unique Attention Gate (AG) mechanism model, proposed by Attention U-Net [10], focuses on targets with varying sizes and forms.In order to enhance retinal blood vessel segmentation performance, ResUNet [9] incorporates a weighted attention method.DenseUNet [11] substitutes dense connections for skip connections in U-Net.The R2U-Net [12] optimizes feature representation by combining the benefits of residual networks.KiU-Net [13] presents a unique design that makes use of incomplete and overcomplete features to enhance the segmentation of tiny anatomical components.UNeXt [24] presents a merger of MLP with UNet, substantially cutting the parameter count without compromising segmentation efficacy.These techniques are all still grounded in CNN.
An increasing number of transformer-based techniques are showing up in CV tasks, spurred by Transformer's [19] success in a variety of NLP tasks.Central to Transformer's operation is the concept of Self-Attention, a process that enables the model to dynamically weigh the significance of various elements in a sequence relative to one another.This approach enables parallel processing of the full input sequence, circumventing the issue of long-term dependencies that traditional Recurrent Neural Networks (RNNs) struggle with and their inability to process information concurrently.The Transformer architecture also encompasses positional encoding to maintain the sequential order of data, a multi-head attention setup to enhance the model's representation capabilities, and incorporates both a feed-forward network and layer normalization to refine the feature extraction process.With its versatile and modular design, Transformers have not only revolutionized tasks such as language modeling and machine translation but have also been widely adopted in image recognition, generative modeling, and other areas, thus expanding the scope of research and application possibilities.ViT [20] is among the newer vision transformers and initially showed that Transformer-exclusive architectures can hit state-of-the-art performance in image recognition when pre-trained on extensive datasets.To train on the smaller ImageNet-1K dataset, DeiT [25] proposes data-efficient training techniques and knowledge distillation.
Swin Transformer [22] with a hierarchical structure reduces the computational complexity and achieves SOTA performance attains state-of-the-art results in a wide range of tasks by the proposed shift-window based self-attention.Meanwhile, SETR [21] treats semantic segmentation as a sequence prediction problem by employing the Transformer as an encoding component.
Although the Transformer excels in global context modeling, it struggles with capturing fine-grained details, particularly in medical imaging.Consequently, efforts have been made to integrate the strengths of both CNN and Transformer.For instance, Tran-sUNet [23] proposed a mixed CNN and Transformer encoder along with a multi-stage upsampling decoder for better global context modeling in medical image segmentation.In contrast, SwinUnet [26] is a mirror-symmetric, U-Net inspired transformer-based model for medical image segmentation, featuring a Swin Transformer block and patch expanding layer in its design.Additionally, Medical Transformer [27] introduces a transformer architecture equipped with gated axial attention blocks to efficiently train on medical images.Meanwhile, TransFuse [28] employs an innovative dual-branch approach combining ViT with ResNet to enhance global context modeling without losing low-level local detail, demonstrating significant advantages in pathological image segmentation research [29,30].

Feature Fusion Network
Enhancing the performance of image target segmentation necessitates the fusion of more discerning multiscale features.Masterfully handling and depicting multi-scale information is a primary challenge in this field.The Feature Pyramid Network (FPN) approach by Lin et al. [31] was among the initial studies to construct a feature pyramid, alternately combining adjacent feature levels using lateral connections and top-down pathways to enhance feature portrayal.Furthering the overall feature hierarchy, Liu et al. [32] added a second bottom-up path aggregation network (PANet) atop the FPN to boost information flow and minimize the gap between the most elementary and highest feature levels.To better combine semantic and geographical information, DLA [33] created hierarchical depth aggregation structures and iterative depth aggregation.Aiming to achieve multiscale feature rebalancing, Pang et al. [34] utilized Balanced IOU sampling, a Balanced Feature Pyramid, and a Balanced L1 loss in the Libra R-CNN framework.These strategies targeted imbalances at the sample level, feature level, and object level, respectively.Tan et al. [35] introduced a Weighted Bidirectional Feature Pyramid Network (BiFPN) to facilitate efficient multiscale feature integration.For leveraging cross-scale features, STDL [36] proposed a scale-shifting module, M2det [37] offered a U-shaped design for fusing multiscale features, and G-FRNet [38] incorporated gate units to manage the exchange of cross-feature information.A neural structure search approach is also used by NAS-FPN [39] to identify a more reliable fusion structure that offers the best single-shot detector.Liu et al. [40] developed an Adaptive Multiscale Feature Fusion Network capable of dynamically learning or suppressing the scale-specific features at the moment of fusion.By deriving a spatial fusion coefficient for each feature at every scale location, this approach enhances the scalability of the features and realizes their adaptive integration.Potential conflicting information (i.e., inconsistency) on the fusion time space is automatically learned or suppressed.In this research, we create the intermediate fusion approach that can better collect coarse-grained and fine-grained information by using an attention mechanism.

Method
Figure 2 illustrates the comprehensive design of our proposed end-to-end dual-branch hybrid network architecture.The Swin Transformer branch captures global contextual information, whereas the ConvNeXt branch extracts spatial information at various scales.Furthermore, features of matching resolutions from both branches are combined within the Attentional Feature Fusion (AtFF) module for integration.Here, channel and spatial attention mechanisms harvest local and global feature information, and the up-sampling module within the global attention selectively blends pertinent data, which is then con-Appl.Sci.2024, 14, 4073 6 of 19 veyed through a skip connection to the decoder module (depicted in the green dashed box in Figure 2).The decoder structure produces the parameter segmentation results and loss values after recovering the image's information and related spatial dimensions.Additionally, to maximize the synergies of these two branches, we assign weights to the loss values generated by the three components and merge them for unified training.
Figure 2 illustrates the comprehensive design of our proposed end-to-end dualbranch hybrid network architecture.The Swin Transformer branch captures global contextual information, whereas the ConvNeXt branch extracts spatial information at various scales.Furthermore, features of matching resolutions from both branches are combined within the Attentional Feature Fusion (AtFF) module for integration.Here, channel and spatial attention mechanisms harvest local and global feature information, and the upsampling module within the global attention selectively blends pertinent data, which is then conveyed through a skip connection to the decoder module (depicted in the green dashed box in Figure 2).The decoder structure produces the parameter segmentation results and loss values after recovering the image's information and related spatial dimensions.Additionally, to maximize the synergies of these two branches, we assign weights to the loss values generated by the three components and merge them for unified training.

ConvNeXt Branch
There are two components to the ConvNeXt branch encoder.In Figure 1, it is displayed in the yellow dashed box.The initial component serves as a stem that pre-processes the original image for feature extraction using a convolutional layer with a 4 × 4 kernel and a stride of 4. This is followed by a layer of standard normalization (LN) to hasten the training process of the neural network.At this point, the image dimension changes from [H, W, 3] to [H/4, W/4, C].The second part contains 4 stages, the block count of each stage is changed from (3, 3, 9, 3) to (3,3,3,3), and downsampling operation is performed between every two stages, which expands the receptive fields of different layers of Con-vNeXt and enables each neuron to perceive a wider area of the pathology image.This

ConvNeXt Branch
There are two components to the ConvNeXt branch encoder.In Figure 1 The ConvNeXt architecture has three layers.The first layer is a 7 × 7 depth-separable convolution and a standard normalization (LN) layer with 96 output channels.The subsequent layer consists of a 1 × 1 convolution followed by a GELU (Gaussian Error Linear Unit) activation function.This layer does only channel fusion, mapping channels C = 96 to 384.The third layer is a 1 × 1 convolution that maps channels C = 384 from the second layer to 96.Finally, there is a residual join that sums the input and the outputs are summed.

Swin Transformer Branch
As shown in the blue dashed box in Figure 1 The Swin Transformer Block is composed of two sets of LayerNorm (LN) layers, a Windowed Multiple Self-Attention (W-MSA) layer, a residual connection, and a twolayer Multilayer Perceptron (MLP) unit.The detailed module configuration is depicted in Figure 3, where the input vector is initially processed by an LN (Layer Norm) layer.Subsequently, the input vectors are channeled into the Windows Multi-Head Self-Attention (W-MSA) module that employs a shifted window approach, which significantly decreases the model's computational complexity.Finally, it passes through the residual structure before moving to the next LN layer.LayerNorm (LN) layers are implemented before every MSA module and MLP, with residual connections following each MSA and MLP.This shift window mechanism dictates that two successive Swin Transformer Blocks are calculated as follows (1) to (4):

Attentional Feature Fusion Module
We introduce an innovative Attention feature fusion (AtFF) module, illustrated in Figure 4, which amalgamates an attention mechanism with a multimodal fusion mechanism to effectively blend the hierarchical features from CNN and Transformer.To counteract the spatial information loss due to downsampling, we employ a dual-branch skip link (depicted in red) to integrate the high-level global context with the low-level precise

Attentional Feature Fusion Module
We introduce an innovative Attention feature fusion (AtFF) module, illustrated in Figure 4, which amalgamates an attention mechanism with a multimodal fusion mechanism to effectively blend the hierarchical features from CNN and Transformer.To counteract the spatial information loss due to downsampling, we employ a dual-branch skip link (depicted in red) to integrate the high-level global context with the low-level precise details from both global and local encoders.The AtFF module is linked to the preceding fusion module through an additive process, enabling the fusion of local and global features at the same stage, as observed in Figure 4. Therefore, the skip connection module (SCM) process can be expressed as Equations ( 5)-( 8): where GAU stands for the

Attentional Feature Fusion Module
We introduce an innovative Attention feature fusion (AtFF) module, illustrated in Figure 4, which amalgamates an attention mechanism with a multimodal fusion mechanism to effectively blend the hierarchical features from CNN and Transformer.To counteract the spatial information loss due to downsampling, we employ a dual-branch skip link (depicted in red) to integrate the high-level global context with the low-level precise details from both global and local encoders.The AtFF module is linked to the preceding fusion module through an additive process, enabling the fusion of local and global features at the same stage, as observed in Figure 4. Therefore, the skip connection module (SCM) process can be expressed as Equations ( 5)-( 8): where GAU stands for the   In order to improve the feature representation capability, Transformer branch features use channel attention to determine which global key information on the feature graph contains important information.Channels carrying less crucial information receive diminished focus.The SE-Block is employed to enact channel attention, thereby enhancing the dissemination of global information from the Transformer branch.SE-Block is composed of two components, namely Squeeze and Excitation.After collecting several feature graphs (t) by a global average pooling operation, the Squeeze operation compresses each feature graph, resulting in a 1 × 1 × C sequence of real values from the many feature maps.The specific calculation is shown in Equation ( 9): To accurately represent channel dependencies, the subsequent Excitation operation is essential to utilize the insights obtained from the Squeeze stage.We employ a straightforward gating method with sigmoid activation in our work.The specific computational representations are as in Equations ( 10) and ( 11): where σ represents the sigmoid activation function, δ represents the relu activation function, F scale is the product over channels.Initially, the CBAM (Convolutional Block Attention Module) block serves as a spatial filter for the convolutional neural network (CNN) branch's most reduced-scale features.It enhances local details and suppresses irrelevant regions to address the issue of noisy low-level CNN features.The feature maps are transformed from C × H × W to C × 1 × 1 by passing them through two parallel MaxPool and AvgPool layers.Subsequently, the MLP module generates channel attention maps, and the ReLU activation function extracts two activated results.Following an element-wise summation of the two outputs, a sigmoid activation function is applied to derive the Channel Attention output.This output is then subjected to Maximum Pooling and Average Pooling operations, resulting in two 1 × H × W feature maps.These maps are separated using the Concat operation, converted into 1-channel feature maps via 7 × 7 convolution, and ultimately, a sigmoid function is used to obtain the Spatial Attention feature map.The specific computational expressions are detailed in Equations ( 12) and ( 13): where σ represents the sigmoid activation function and f 7×7 signifies a convolutional layer with a 7 × 7 convolutional kernel.The global attention upsampling module (GAU) subsequently extracts the high-level features' global contextual information by employing these features to guide the weighted computation of the low-level features through a global pooling operation.The number of channels in the CNN feature map is reduced when a 3 × 3 convolution is applied to the low-level features.The high-level features generate a global context, which is then multiplied by the low-level features after being processed by a 1 × 1 convolution with batch normalization and ReLU nonlinearity.Ultimately, the high-level features are added to the weighted low-level features and undergo a progressive upsampling process.
where Conv is a 3 × 3 convolutional layer, δ denotes the relu activation function, and GAP denotes global average pooling.Ultimately, l_scm will achieve a twofold increase in resolution and a 50% reduction in channel size, or maintain quadruple resolution and channel size, during the upsampling process by employing a cross-scale extension layer instead of the downsampling and patch merging layer.The cross-scale extension layer within AtFF 1 amplifies the resolution of the input features to quadruple i × w i × dim i −→ 4h i × 4w i × dim i ).In con- trast, within AtFF 2,3,4 the input features' resolution is transformed from h i × w i × dim i to 2h i × 2w i × dim i /2.Consequently, the proposed cross-scale expansion layer (csel) can be expressed as follows: where T represents all transposed convolutions in total.We set T to 4 in AtFF 1 , and T to 2 in AtFF 2,3,4 .Given that AtFF 4 serves as the initial upsampling block, it lacks input from any preceding module.Therefore, the computation of the AtFF 4 module proceeds as follows: The calculation of AtFF 2,3 module is shown in Equation ( 17): And the calculation of AtFF 1 module is shown in Equation ( 18): where SCM is the skip connection and t i and r i are the features in the CNN and Transformer branches in stage i, respectively.The result of AtFF i is l i .Following the output from the AtFF 1 module, a linear projection (LP) layer is implemented to accomplish pixel-level segmentation of the image x.
The complete process of GLFUnet is shown in Algorithm 1:

Loss Function
The complete network undergoes end-to-end training using a combination of weighted dice loss and binary cross-entropy loss, denoted as  =  +  .A simple header that instantly reverts the input feature mapping to its original resolution generates the segmentation predictions.To address the issues of slow convergence and gradient vanishing, we incorporate a deeply supervised approach, which includes supervision of

Loss Function
The complete network undergoes end-to-end training using a combination of weighted dice loss and binary cross-entropy loss, denoted as L = L w dice + L w bce .A simple header that instantly reverts the input feature mapping to its original resolution generates the segmentation predictions.To address the issues of slow convergence and gradient vanishing, we incorporate a deeply supervised approach, which includes supervision of both AtFF 1 and AtFF 4 .The overall training loss function is detailed in Equation ( 20): where G is the ground truth and α, β, and γ are hyperparameters that may be adjusted.

Implementation Details
The experiments are constructed utilizing the Pytorch deep learning framework.To enhance data variability, all samples undergo random rotations within [−90 • , 90 • ], vertical flipping, and horizontal flipping at the outset of the experiment.The model's training process employs a small batch stochastic gradient descent method for optimization.The experimental research in this paper was carried out using Python 3.8 and Pytorch 1.11.0.All five cross-validation datasets were randomly split into an 8:2 ratio for training and testing.The model is trained for a total of 20 epochs with a batch size of 64.Following comparative tests, it was determined that the optimal values for parameters α, β, and γ are set to 0.5, 0.3, and 0.2, respectively.An Adam optimizer with a learning rate of 5 × 10 −4 is used.The initial parameter choices were informed by existing literature [28,41].The hardware platform for these experiments included a PC equipped with an Intel(R) Core (TM) i7-10700KF@3.80GHzprocessor, an NVIDIA GeForce RTX3090 graphics card, 24 GB of memory, and the Windows 10 Professional operating system.The development environment was JetBrains PyCharm 2021.2Professional Edition.

Evaluation Metrics
Since a medical image segmentation approach is presented in this research, the model accuracy is assessed using certain image segmentation criteria.The model's segmentation efficacy is gauged using evaluation metrics such as the Dice Similarity Coefficient (DSC), Mean Intersection Over Union (mIoU), and Accuracy (Acc) to provide an impartial analysis of the methodology's effectiveness.
The dice represents the similarity coefficient, which shows how similar the actual target area and the forecasted target area are.The cumulative similarity coefficients for all the test outcomes within the test set are represented as Dice, with its mathematical expression provided in Equation (21).
The intersection and merger ratio coefficient (IoU) is the average of two sets of predicted and actual values.It illustrates the proportion of overlap and amalgamation between two sets.The aggregate of the intersection and union coefficients for all the test outcomes in the test set is defined as mIoU.mIoU is represented by the formula shown in Equation ( 22): Acc is Accuracy: The ratio of properly identified pixels to total pixels is how accuracy is expressed.The efficacy of the classification process improves with an increased accuracy rate.The calculation for Acc is given by: where TP, TN, FP, and FN signify the counts of correctly identified positives, correctly identified negatives, erroneously identified positives, and erroneously identified negatives within the feature's image pixel points, respectively.The GLFUnet medical image segmentation approach presented in this study is con trasted with various methods, including U-Net, ResUNet, FCN, DeeplabV3, ConvNeXt Medical Transformer, TransUnet, SwinUnet, and TransFuse, in order to more thoroughly test the efficacy of this model.Table 1 displays the Dice, mIoU, and Acc statistics fo GLFUnet on the dataset for stomach cancer as well as the other nine segmentation models ConvNeXt performs better than other CNN-based models, as evidenced by its testing results, which show 88.05%, 73.75%, and 88.50% on Dice, mIoU, and Acc, respectively Among these Transformer-based models, SwinUnet performs the best, reaching 91.21%The GLFUnet medical image segmentation approach presented in this study is contrasted with various methods, including U-Net, ResUNet, FCN, DeeplabV3, ConvNeXt, Medical Transformer, TransUnet, SwinUnet, and TransFuse, in order to more thoroughly test the efficacy of this model.Table 1 displays the Dice, mIoU, and Acc statistics for GLFUnet on the dataset for stomach cancer as well as the other nine segmentation models.ConvNeXt performs better than other CNN-based models, as evidenced by its testing results, which show 88.05%, 73.75%, and 88.50% on Dice, mIoU, and Acc, respectively.Among these Transformer-based models, SwinUnet performs the best, reaching 91.21%, 79.84%, and 92.25% for Dice, mIoU, and Acc, respectively; our proposed GLFUnet performs better than ConvNeXt and SwinUnet, two current state-of-the-art methods, with 91.65%, 79.87%, and 92.51%, respectively.As a result, GLFUnet performs the best on the dataset for stomach cancer.The improvement rate is 0.44%, 0.03%, and 0.26%, respectively, when compared to the best Transformer-based model SwinUnet; the improvement rate is 3.60%, 6.12%, and 4.01%, respectively, when compared to the best CNN-based model ConvNeXt.
To provide a clearer comparison, Figure 6 below presents a visual contrast of selected segmentation outcomes for gastric cancer pathology images.It is evident that the Transformer-based architecture emphasizes global contextual information extraction and excels in remote relationship modeling, yielding sharper edges compared to the results from the CNN-based model.GLFUne indicates that the segmentation outputs resulting from the integration of Swin Transformer and ConvNeXt via the AtFF module are more aligned with the annotated mask images, further substantiating the efficacy of our suggested approach.The GLFUnet segmentation exhibits superior performance.

Liver Cancer Dataset
Images of liver cancer from five patients were obtained from the Third Xiangya Hospital of Central South University in China [43].Qualified pathologists carefully marked the tumor and healthy tissue regions in each sample, with an approximate resolution of about 40,000~60,000 × 30,000~50,000.To ensure a smoother overall segmentation outcome, overlapping was allowed between adjacent patches, set at half the patch size both horizontally and vertically 10.The selection of the sliding window size also considered the number of samples and the resolution of the pathological images.Employing a sliding window with dimensions of 448 × 448 and a stride of 224, the liver samples were sectioned into small patches.Subsequently, based on the model size, the stride was adjusted to 224 × 224. Figure 7 illustrates a representative liver pathology sample diagram.For experimental purposes, the dataset used in this study was divided into training and test sets at

Liver Cancer Dataset
Images of liver cancer from five patients were obtained from the Third Xiangya Hospital of Central South University in China [43].Qualified pathologists carefully marked the tumor and healthy tissue regions in each sample, with an approximate resolution of about 40,000~60,000 × 30,000~50,000.To ensure a smoother overall segmentation outcome, overlapping was allowed between adjacent patches, set at half the patch size both horizontally and vertically [43,44].The selection of the sliding window size also considered the number of samples and the resolution of the pathological images.Employing a sliding window with dimensions of 448 × 448 and a stride of 224, the liver samples were sectioned into small patches.Subsequently, based on the model size, the stride was adjusted to 224 × 224. Figure 7 illustrates a representative liver pathology sample diagram.For experimental purposes, the dataset used in this study was divided into training and test sets at an 8:2 ratio.
the tumor and healthy tissue regions in each sample, with an approximate resolutio about 40,000~60,000 × 30,000~50,000.To ensure a smoother overall segmentation outc overlapping was allowed between adjacent patches, set at half the patch size both zontally and vertically 10.The selection of the sliding window size also considered number of samples and the resolution of the pathological images.Employing a sli window with dimensions of 448 × 448 and a stride of 224, the liver samples were sectio into small patches.Subsequently, based on the model size, the stride was adjusted to × 224. Figure 7   The GLFUnet medical image segmentation approach presented in this study is contrasted with various methods, including U-Net, ResUNet, FCN, ConvNeXt, Medical Transformer, TransUnet, SwinUnet, TransFuse and DHUnet, in order to more thoroughly test the efficacy of this model.Table 2 presents the Dice, mIoU, and Acc scores for GLFUnet on the liver cancer dataset, alongside those of nine other segmentation models.ConvNeXt performs better than other CNN-based models, as evidenced by its experimental results, which show 92.69%, 86.52%, and 97.24% on Dice, mIoU, and Acc, respectively.Our proposed GLFUnet model has metrics figures of 93.36%, 86.93% and 97.51% for Dice, mIoU and Acc, respectively.Among the Transformer-based models considered, DHUnet achieves top scores of 92.76%, 86.64%, and 97.43%.Compared to the leading Transformer-based model DHUnet, GLFUnet sees improvements of 0.6%, 0.29%, and 0.08% respectively, while it enhances by 0.67%, 0.41%, and 0.27% over the best CNNbased model ConvNeXt.Consequently, GLFUnet yields the highest performance on the liver cancer dataset.In order to compare more intuitively, Figure 8

Ablation Experiments
In this study, a suite of controlled ablation tests were carried out on the segmented dataset to examine the influence of various components on model performance.This included disabling the cross-scale extension layer (csel) and the attentional feature fusion (AtFF) module within the model.Additionally, the fusion technique employed in the model was substituted with Concat and Add fusion methods to validate the effectiveness of the fusion strategy utilized herein.The data from Table 3's ablation experiments indicate that the model performs better with the AtFF module than without it, confirming that the integration of the Attention Mechanism Fusion module enhances medical image segmentation precision.This underscores the significance of employing an attention mechanism fusion module for refining medical image segmentation accuracy.Line graphs were also used to illustrate the outcomes, displayed in Figures 9 and 10, where the x-axis represents different module replacements and the y-axis denotes index percentages.The line chart's metric heights reveal that the GLFUnet model's segmentation efficacy is optimized.

Ablation Experiments
In this study, a suite of controlled ablation tests were carried out on the segmented dataset to examine the influence of various components on model performance.This included disabling the cross-scale extension layer (csel) and the attentional feature fusion (AtFF) module within the model.Additionally, the fusion technique employed in the model was substituted with Concat and Add fusion methods to validate the effectiveness of the fusion strategy utilized herein.The data from Table 3's ablation experiments indicate that the model performs better with the AtFF module than without it, confirming that the integration of the Attention Mechanism Fusion module enhances medical image segmentation precision.This underscores the significance of employing an attention mechanism fusion module for refining medical image segmentation accuracy.Line graphs were also used to illustrate the outcomes, displayed in Figures 9 and 10, where the x-axis represents different module replacements and the y-axis denotes index percentages.The line chart's metric heights reveal that the GLFUnet model's segmentation efficacy is optimized.

Disscusion
In the current study, we have implemented the GLFUnet, a sophisticated deep neural network architecture, for the precise task of segmenting pathological images.This model synergistically combines the spatial information extraction prowess of the ConvNeXt branch with the adept global contextual information assimilation of the Swin Transformer branch.The efficacy of the model is further augmented by the concerted effort of the attentional feature fusion module and the decoder module.Following this integration, we meticulously evaluated the segmentation performance of our proposed network.Comprehensive experimental results from two distinct datasets, one specific to gastric cancer and the other to liver disorders, showcased varying complexities in terms of cancer type, density, and target size.These trials affirmed the commendable performance and adaptability of GLFUnet within the realm of medical image segmentation.In addition, we pitted GLFUnet against a roster of widely recognized segmentation models, including U-Net, ResUNet, FCN, ConvNeXt, MedT, TransUnet, SwinUnet, and TransFuse.Our comparative analysis revealed that GLFUnet not only outperforms these models but also significantly boosts the accuracy and efficiency of medical image segmentation tasks.
A noteworthy advantage of GLFUnet lies in its innovative dual-branch hybrid network design, which effectively surmounts the challenges faced by Convolutional Neural Networks (CNNs) in grasping global context and the relative limitation of Transformer

Disscusion
In the current study, we have implemented the GLFUnet, a sophisticated deep neural network architecture, for the precise task of segmenting pathological images.This model synergistically combines the spatial information extraction prowess of the ConvNeXt branch with the adept global contextual information assimilation of the Swin Transformer branch.The efficacy of the model is further augmented by the concerted effort of the attentional feature fusion module and the decoder module.Following this integration, we meticulously evaluated the segmentation performance of our proposed network.Comprehensive experimental results from two distinct datasets, one specific to gastric cancer and the other to liver disorders, showcased varying complexities in terms of cancer type, density, and target size.These trials affirmed the commendable performance and adaptability of GLFUnet within the realm of medical image segmentation.In addition, we pitted GLFUnet against a roster of widely recognized segmentation models, including U-Net, ResUNet, FCN, ConvNeXt, MedT, TransUnet, SwinUnet, and TransFuse.Our comparative analysis revealed that GLFUnet not only outperforms these models but also significantly boosts the accuracy and efficiency of medical image segmentation tasks.
A noteworthy advantage of GLFUnet lies in its innovative dual-branch hybrid network design, which effectively surmounts the challenges faced by Convolutional Neural Networks (CNNs) in grasping global context and the relative limitation of Transformer

Disscusion
In the current study, we have implemented the GLFUnet, a sophisticated deep neural network architecture, for the precise task of segmenting pathological images.This model synergistically combines the spatial information extraction prowess of the ConvNeXt branch with the adept global contextual information assimilation of the Swin Transformer branch.The efficacy of the model is further augmented by the concerted effort of the attentional feature fusion module and the decoder module.Following this integration, we meticulously evaluated the segmentation performance of our proposed network.Comprehensive experimental results from two distinct datasets, one specific to gastric cancer and the other to liver disorders, showcased varying complexities in terms of cancer type, density, and target size.These trials affirmed the commendable performance and adaptability of GLFUnet within the realm of medical image segmentation.In addition, we pitted GLFUnet against a roster of widely recognized segmentation models, including U-Net, ResUNet, FCN, ConvNeXt, MedT, TransUnet, SwinUnet, and TransFuse.Our comparative analysis revealed that GLFUnet not only outperforms these models but also significantly boosts the accuracy and efficiency of medical image segmentation tasks.
A noteworthy advantage of GLFUnet lies in its innovative dual-branch hybrid network design, which effectively surmounts the challenges faced by Convolutional Neural Networks (CNNs) in grasping global context and the relative limitation of Transformer models in processing fine-grained local details.Furthermore, the integration of the Attentional Feature Fusion (AtFF) module represents another stride forward, as it introduces an attention mechanism to sharpen the focus on critical features and selectively merges valuable information from features with corresponding resolutions across branches.
Despite its robustness, GLFUnet encounters certain limitations inherent to its design: the autonomous encoding of each branch leads to increased computational demands and a higher number of parameters that require learning, contrasting with the more streamlined nature of single-branch networks.It is well-established that hyperparameters play a crucial role in dictating the performance of deep neural networks, including variables such as the number of hidden layer nodes, mini-batch size, learning rate, weight decay, and the number of training epochs.However, in this study, we confined ourselves to using default or moderate values for these hyperparameters without engaging in exhaustive tuning.This implies that there is substantial room for enhancing our results by means of hyperparameter optimization and the selection of more optimal values.

Conclusions
This research introduces the GLFUnet model for medical image segmentation, which builds upon the U-Net framework.It integrates features derived from both Swin Transformer and ConvNeXt encoders in parallel, utilizing an attentional feature fusion module.The model effectively captures spatial and semantic information through skip connections.GLFUnet is adept at extracting both global and local detailed information crucial for pathology segmentation.Evaluations on datasets of stomach and liver cancer demonstrate GLFUnet's superior performance and adaptability against other segmentation models.Ablation tests confirm the efficacy of each introduced component.However, the integration of the attention mechanism increases computational demands and prolongs processing time.Future work will focus on refining the model's architecture to reduce complexity and streamline operations.Additionally, experimenting with various CNN and Transformer combinations will be pursued to enhance segmentation efficiency and precision.

Figure 1 .
Figure 1.Workflow diagram of the proposed GLFUnet network.

Figure 1 .
Figure 1.Workflow diagram of the proposed GLFUnet network.

Figure 2 .
Figure 2. General framework of the GLFUnet model, consisting of the CNN branch, the Transformer branch, AtFF module and skip connections.

Figure 2 .
Figure 2. General framework of the GLFUnet model, consisting of the CNN branch, the Transformer branch, AtFF module and skip connections.
, it is displayed in the yellow dashed box.The initial component serves as a stem that pre-processes the original image for feature extraction using a convolutional layer with a 4 × 4 kernel and a stride of 4. This is followed by a layer of standard normalization (LN) to hasten the training process of the neural network.At this point, the image dimension changes from [H, W, 3] to [H/4, W/4, C].The second part contains 4 stages, the block count of each stage is changed from (3, 3, 9, 3) to (3, 3, 3, 3), and downsampling operation is performed between every two stages, which expands the receptive fields of different layers of ConvNeXt and enables each neuron to perceive a wider area of the pathology image.This helps ConvNeXt learn a larger range of contextual information and improves understanding of the overall structure.Every downsampling operation employs a convolutional layer with a 2 × 2 kernel size and a stride of 2. The downsampling is preceded by a standard normalization (LN) layer to prevent gradient explosion.The AtFF module receives the feature maps that were extracted at each step.The output feature map dimensions are [H/4, W/4, C], [H/8, W/8, 2C], [H/16, W/16, 4C], and [H/32, W/32, 8C].After extensive experimentation, the hyperparameter C has a value of 96.
, the Swin Transformer branch is constructed in four stages, with the number of Swin Transformer Blocks varying from (2, 2, 6, 2) to (2, 2, 2, 2) at each stage.The spatial resolution of a picture with x ∈ RH × W × 3 is H × W and has three channels.In the patch partitioning module, the image for the Swin Transformer branch is segmented into non-overlapping patches of size 4 × 4. Consequently, the image dimensions change to [H/4, W/4, 48] from [H, W, 3].In stage1, the channel data patch features of each pixel are mapped to the high-dimensional space by doing linear transformation through the Patch Embeding layer and establishing the positional relationship between the image patches.This alteration resizes the image from [H, W, 48] to [H/4, W/4, C], which is then provided to the Swin Transformer Block.The subsequent stages 2 through 4 replicate this process, but beforehand, the feature map undergoes a Patch Merging that modifies its dimensions from [H/4, W/4, C] to [H/8, W/8, 2C], prior to being input into the Swin Transformer Block.The feature output from each stage is directed to the AtFF module for data integration.

Algorithm 1 : 20 Algorithm 1 : 4 6 if i = 4 then 7 𝑟̂15
Attention-based two-branch hybrid fusion network for medical image segmentation Input: The image to be segmented x, the training batch size, learning rate, momentum, max epoch.Ouput: Trained network model.Appl.Sci.2024, 14, x FOR PEER REVIEW 11 of Attention-based two-branch hybrid fusion network for medical image segmentation Input: The image to be segmented x, the training batch size, learning rate, momentum, max epoch.Ouput: Trained network model. 1 for epoch in max epoch do 2 ri = ConvNeXt(x) 3 ti = Transformer(x) 4 if dim is True then 5  ̂ = ℎ( ),  = 1,2,3,Dice, Jaccard, Acc = (,  ) 16 end for

Figure 5 .
Figure 5. Diagram of the pathologic samples from the stomach.(a,b) are Tumor, (c) is Normal.

Figure 6 .
Figure 6.Visualization of the segmentation results of the gastric cancer dataset.

Figure 6 .
Figure 6.Visualization of the segmentation results of the gastric cancer dataset.

Figure 7 .
Figure 7. Diagram of the pathologic samples of the liver.(a,b) are Tumor, (c) is Normal.

Figure 7 .
Figure 7. Diagram of the pathologic samples of the liver.(a,b) are Tumor, (c) is Normal.

Figure 8 .
Figure 8. Visualization of segmentation results for liver cancer dataset.

Figure 8 .
Figure 8. Visualization of segmentation results for liver cancer dataset.

Figure 9 .
Figure 9.Comparison of line plots of ablation experiments on the liver cancer dataset.

Figure 10 .
Figure 10.Comparison of line plots of ablation experiments on the gastric cancer dataset.

Figure 9 . 20 Figure 9 .
Figure 9.Comparison of line plots of ablation experiments on the liver cancer dataset.

Figure 10 .
Figure 10.Comparison of line plots of ablation experiments on the gastric cancer dataset.

Figure 10 .
Figure 10.Comparison of line plots of ablation experiments on the gastric cancer dataset.
Global Attention Up-Sampling module calculation and Chan-nelAttention and SpatialAttention, respectively, indicate the Channel Attention and Spatial Attention computations.t i and r i are the features in the Transformer branch and CNN branch in stage i, respectively, and l i is the output of the AtFF module.

Low-level Local Feature Hight-level Local Feature Fusion Feature ➕
Global Attention Up-Sampling module calculation and Chan-nelAttention and SpatialAttention, respectively, indicate the Channel Attention and Spatial Attention computations. and  are the features in the Transformer branch and CNN branch in stage i, respectively, and  is the output of the AtFF module.

Table 1 .
Data for each indicator on the gastric cancer dataset.

Table 1 .
Data for each indicator on the gastric cancer dataset.

Table 2 .
Data for each indicator on the livercancer dataset.

Table 3 .
Ablation experiments on two datasets.

Table 3 .
Ablation experiments on two datasets.