CCFNet: Collaborative Cross-Fusion Network for Medical Image Segmentation

: The Transformer architecture has gained widespread acceptance in image segmentation. However, it sacrifices local feature details and necessitates extensive data for training, posing challenges to its integration into computer-aided medical image segmentation. To address the above challenges, we introduce CCFNet, a collaborative cross-fusion network, which continuously fuses a CNN and Transformer interactively to exploit context dependencies. In particular, when integrating CNN features into Transformer, the correlations between local and global tokens are adaptively fused through collaborative self-attention fusion to minimize the semantic disparity between these two types of features. When integrating Transformer features into the CNN, it uses the spatial feature injector to reduce the spatial information gap between features due to the asymmetry of the extracted features. In addition, CCFNet implements the parallel operation of Transformer and the CNN and independently encodes hierarchical global and local representations when effectively aggregating different features, which can preserve global representations and local features. The experimental findings from two public medical image segmentation datasets reveal that our approach exhibits competitive performance in comparison to current state-of-the-art methods.


Introduction
Accurate image segmentation can more clearly identify changes in anatomic or pathologic structure in medical images [1], which is crucial in various computer-aided diagnostic applications, including lesion contour, surgical planning, and three-dimensional reconstruction.Medical image segmentation can detect and locate the boundaries of lesions in an image, thus helping to quickly identify the potential existence of tumors and cancerous regions, which may help clinicians save diagnosis time and improve the possibility of finding tumors [2].Traditionally, the symmetric encoder-decoder structures has been the standard in medical image segmentation, where U-Net [3] has become the benchmark of choice among different variants with great success.
The U-Net model consists of convolutions, the fundamental operation of which is the convolution operator with two characteristics, weight sharing and local connection, which ensures the affine invariance of the model.Although these characteristics help to create effective and versatile medical imaging systems, they still require additional improvements to aid clinicians in early disease diagnosis [4].Researchers have proposed various improved methods to add global context to convolutional neural networks (CNNs), among which influential methods include introducing an attention mechanism [5][6][7][8] and expanding convolution kernels [9][10][11] to extend their receptive fields.However, the locality of the convolutional layer's receptive fields leads to the inability of networks to utilize remote semantic dependence effectively, so their learning ability is limited to a relatively small area and they fail to fully explore object-level information, especially for organs in terms of texture, shape, and size, typically yielding weaker properties and exhibiting sizable inter-patient variability.
Transformer [12] has shown high performance in language learning tasks, and the attention-based model has emerged as an appealing solution because of its ability to efficiently handle very long sequence dependencies, adapting to various vision tasks.Recent research has demonstrated the ability of Transformer modules to entirely take over the role of conventional convolutions by manipulating sequences of picture patches; the most representative of these is Vision Transformer (ViT) [13].There have been many works proving that the ViT model can promote the development of many computer vision tasks, including semantic segmentation [14], object detection [15], and image classification [13], among others.The accomplishments of ViT in processing natural images have captivated the medical field.In response, researchers are delving into the capabilities of Transformer for medical image segmentation to address the intrinsic receptive field limitations of CNNs and make them suitable for medical imaging applications [16][17][18][19].
However, the performance of Transformer-based models largely depends on pretraining [20,21].There are two problems with Transformer-based models' pre-training process.First, there is a scarcity of comprehensive and widely accepted large datasets for pre-training in the medical imaging field due to the extensive time professionals need to dedicate to annotating medical images (in contrast, ImageNet [22], a large dataset, is available for pre-training natural scene images).Secondly, pre-training consumes much time and many computing resources.Moreover, using large natural image datasets for pretraining medical image segmentation models is difficult because of the domain difference between medical and natural pictures.In addition, there are also some open challenges in different types of medical images.For example, when Swin UNETR [23] is pre-trained on a CT dataset and subsequently applied to different medical imaging modalities like MRI, its performance tends to decline.This is attributed to the significant differences in regional characteristics between CT and MRI images [24].
Fully exploiting CNNs' and Transformer's respective advantages to effectively integrate fine-grained and coarse-grained information in images, thereby boosting the precision and performance in deep learning models, has become a research direction researchers are actively working on.In Figure 1, we summarize various medical image segmentation methods that utilize a combined CNN and Transformer hybrid architecture.As shown in Figure 1a, researchers have incorporated Transformer into models with CNN as the backbone in different ways, either by adding them or replacing certain architectural components to create a network that combines Transformer and a CNN in a serial or embedded fashion.However, this strategy only uses stacking to fuse fine-grained and coarse-grained features, which may reduce the fusion effect and not fully leverage the synergistic capabilities of both network types.Figure 1b,c illustrate parallel frameworks of a CNN and Transformer, extracting distinct feature information from both structures and merging them multiple times before passing them to the decoder for decoding.In Figure 1b, additional branches are used to fuse the CNN and Transformer branches, but this introduces network overhead and lacks effective interaction.For example, TransFuse [25] uses a branch composed of BiFusion modules to simultaneously utilize the different characteristics of the CNN branch and the Transformer branch during the feature extraction process, alleviating the problem of ignoring intermediate features due to serial connections.However, its upsampling method struggles to effectively restore the mid-layer information, leading to a loss of detail.
In Figure 1c, each layer extracts features using the CNN and Transformer, respectively, and finally, adds and fuses them as the output.However, due to semantic differences between Transformer's and the CNN's features, this strategy limits the fusion's effectiveness.In Figure 1c, the input features are independently extracted by both the Transformer and the CNN, then added and merged as the input of the CNN or Transformer.However, due to the feature semantic differences between Transformer and CNNs, this strategy limits the fusion effect.For example, both HiFormer [26] and CiT-Net [27] use CNN and Transformer branches to extract features, respectively, but HiFormer only implements one-way feature fusion from the CNN to Transformer, while CiT-Net only fuses the two features through a single fusion method and sends them to different branches.Both ignore the different characteristics of the two branches.In our study, we present the collaborative cross-fusion network (CCFNet), designed for medical image segmentation.CCFNet's encoder consists of two parts: first, in the shallow layers of the encoder, a CNN is employed to acquire convolutional depth features, compensating for detail lost in upsampling; second, in the deep encoder layers, a collaborative cross-fusion of Transformer and the CNN is applied, which can simultaneously enhance the image's local and global depictions.As shown in Figure 1d, compared with other combination strategies, CCFNet achieves close information interaction between the CNN and Transformer while continuously modeling local and global representations.By continuously aggregating the hierarchical representation and information interaction of global and local features, the information interaction of collaborative cross-fusion is closer and the feature fusion is more thorough.This makes CCFNet excellent in medical image segmentation tasks.
In CCFNet, high-resolution features extract fine-grained local information and perform depthwise convolutions.Since low-resolution features contain more global information (location and semantic information), feature prediction can fuse long-distance global information, and the self-attention mechanism facilitates the capture of deep information [28].CCFNet processes low-resolution features through a parallel fusion of the CNN and Transformer within the collaborative cross-fusion module (CCFM).This method capitalizes on the self-attention mechanism's robust long-range dependency capabilities to ensure accurate medical image segmentation.Considering the complementarity of the two network features, the CCFM sequentially delivers global context from the Transformer branch to the feature maps through the spatial feature injector (SFI) block.This integration significantly boosts the global perceptual capabilities of the CNN branch.Likewise, the collaborative self-attention fusion (CSF) block progressively reintroduces the local features from the CNN branch back into the Transformer, enriching the local detail and creating a dynamic interplay of fused features.Finally, local-global feature complementarity can be achieved, and the network's feature-encoding capabilities can be enhanced.The experiments utilize Synapse, an open accessible dataset for multi-organ medical image segmentation.When the average Dice similarity coefficient (DSC) scores are compared with those from other hybrid models, our proposed CCFNet shows improved accuracy in organ segmentation.This paper's important contributions are summarized below:

•
In the collaborative cross-fusion module, the CSF block is designed to adaptively fuse the correlation between the local tokens and the global tokens and reorganize the two features to introduce the convolution-specific inductive bias into the Transformer.The spatial feature injector block is designed to reduce the spatial information gap between local and global features, avoiding the asymmetry of extracted features and introducing the global information of the Transformer into the CNN.

•
On two publicly accessible medical image segmentation datasets, CCFNet outperforms other competitive segmentation models, validating its effectiveness and superiority.

Related Work 2.1. CNN
The first widely known CNN model was U-Net [3], named because of its U-shaped network structure, which consisted of an encoder-decoder symmetrical network composed of convolution, downsampling, upsampling, and stitching operations.U-Net employed an encoder to capture global contextual information by downsampling feature representations.Conversely, the decoder restored these features to the original input resolution for semantic prediction through upsampling.Additionally, skip connections linked outputs from the encoder with corresponding decoder layers at the same resolution, helping to recover spatial details lost in downsampling and enhancing detail preservation.V-Net [29] was developed specifically for 3D image segmentation, focusing on direct, end-to-end training to facilitate volume segmentation in MRI scans of the prostate.
Most U-Net variants were based on the residual block [30], dense block [31], and attention mechanism [5][6][7] to improve the network's segmentation capabilities.The concepts behind dense and residual blocks can increase the rate of feature reuse and alleviate the problem of vanishing gradients.Dense-UNet [32] and Res-UNet [33], inspired by dense and residual connections, respectively, adapted the U-Net structure by replacing its sub-modules with versions that incorporate these connections, enhancing the network's architecture.UNet++ [34] added the dense block and convolutional layers between its decoder and encoder, aimed at narrowing the semantic differences in its processed features to improve segmentation accuracy.MultiResUNet [35] innovatively combined the Mul-tiRes module and U-Net, in which the MultiRes module extends the concept of residual connections.In this configuration, feature maps produced by three 3 × 3 convolutions were spliced together as a combined feature map and then added to the outcome of a 1 × 1 convolution performed on the input feature map.
An attention mechanism in the network enables it to concentrate on vital sections while suppressing irrelevant input to improve processing efficiency.AttnUNet [8] introduced an attention mechanism into U-Net, adjusting encoder output features before they are concatenated with matching decoder features at each resolution level.This attention setup generates a gating signal that modulates feature importance at different spatial locations.FocusNet [36] employed the dual encoder-decoder architecture, leveraging attention gating to transfer relevant features from the decoder of one U-Net to the encoder of another.This mechanism enhances the propagation of features throughout the network.
On the improvement of a non-network structure, the authors of nnUNet [37] asserted that further enhancements could be achieved by gaining deeper insights into the data and training techniques tailored to medical data and implementing suitable preprocessing, and proposed a robust U-Net-based adaptive framework (including 2D and 3D frameworks) that automatically adjusted itself to preprocess data and selected the best network architecture for the task without human intervention.

Transformer
The groundbreaking Transformer architecture [12] sparked a paradigm shift in natural language processing, swiftly establishing itself as a commonly used foundational model for visual comprehension tasks.Transformer has great visuals because of its ability to create distinct visual environments, but it also has an inherent drawback of not being able to exploit spatial environments in images as a CNN does.Recent works have evolved toward possible solutions to overcome this drawback, and most extant studies in medical image segmentation employ CNN-Transformer hybrid models for feature processing.
TransUNet [16] introduced the integration of the Transformer architecture into medical image segmentation frameworks.It introduced a novel approach by framing the segmentation task as a sequence-to-sequence prediction, incorporating self-attention mechanisms.This model adopts a hybrid CNN-Transformer design, strategically combining the spatial details extracted by CNN features with the Transformer's capabilities.Additionally, within the realm of medical image segmentation, several other hybrid models, built upon the U-Net architecture, have also seen enhancements.For example, UCTransNet [17] adopted the Channel Transformer (CTrans) module, leveraging the channel attention mechanism as a modification to the traditional U-Net skip connections.This implementation includes one sub-module designed for channel-wise cross-attention via Transformer technology and another for multi-scale channel fusion.These enhancements facilitate the efficient integration of multi-scale channel data with decoder features to enhance clarity and resolution.Like TransUNet, UNETR [18] employed a Transformer along with a convolutional decoder in its encoder architecture to build segmentation maps.MT-UNet [19] developed a U-shaped network, utilizing the mixed Transformer module (MTM) designed for both intra-sample and inter-sample learning, aimed at enhancing the precision of medical image segmentation.First, the MTM applied the LGG-SA module to assess selfsimilarity.It then explored inter-sample connections using an external attention mechanism.MSFusion-UNet [38] adopted a flexible and efficient multi-stream fusion encoder to facilitate multi-scale fusion of multiple imaging stream features through spatial attention.LeVit-UNet [39] used LeViT blocks to build a U-Net variant of the encoder, which enables it to learn long-range dependencies more efficiently.DCA module [40] used double crossattention on the U-Net framework and enhanced skip connections to solve the semantic gap between encoder and decoder features.TransAttUnet [41] adopted an effective feature screening method by jointly designing multi-scale skip connections and multi-level guided attention.It learned non-local interactions between encoding features and passed key features to the decoder.From a macro perspective, these architectures integrated a CNN and Transformer in a serial combination manner.
In terms of parallel strategies, TransFuse [25] implemented the BiFusion module, which performed spatial attention on the CNN branch and channel attention on the Transformer branch.Following this, operations such as convolution, multiplication, concatenation, and residuals were executed to facilitate the merging of features from both branches.On this basis, FAFuse [42] improved the BiFusion module through a four-axis fusion module to improve the ability of representation learning.The HiFormer [26] algorithm uses CNN and Transformer branches to extract features.To provide positioning information and reuse features, it incorporates a skip connection that conveys local CNN features to the Transformer.CiT-Net [27] employs dynamically deformable convolutions within its CNN branch to enhance feature extraction capabilities, while it incorporates the compact convolution projection and the SW-ACAM module in the Transformer branch to more effectively capture long-term dependencies across dimensions.CSwin-PNet [43] connected a CNN and Swin Transformer [44] as the backbone for feature extraction, building a pyramidal network structure for feature encoding and decoding.CT-Net [45] utilized an asymmetric asynchronous branch parallel structure to efficiently extract local and global representations while reducing unnecessary computational costs.DPCTN [46] combined the dual-branch fusion of a CNN and Transformer.To reduce the information loss during the information pooling process, DPCTN specially adopted a three-branch transposed self-attention module to significantly improve the segmentation performance.
Several other studies also exist, ranging from developing hybrid CNN and Transformer models to improving the Transformer blocks themselves to handle the complexities of medical imagery.Impacted by the general adoption of the Swin Transformer, Swin-Unet [28] proposed an innovative approach by substituting the convolutional blocks with Swin Transformer blocks for 2D medical image segmentation, and this marked the introduction of the first entirely Transformer-based U-shaped architecture.DS-TransUNet [47] expanded upon Swin-Unet by incorporating an encoder designed to handle multi-scale inputs and integrating a new multi-scale feature fusion module.This module uses the selfattention mechanism to effectively link global dependencies among features from various scales, enhancing the quality of segmentation across diverse medical images.MedT [48] improved upon the current system by incorporating a control mechanism within the selfattention module, specifically to tackle the scarcity of medical image segmentation datasets.Additionally, it introduced a local-global training strategy, optimizing the model's training process on medical images, which can further improve the performance.

Method
CCFNet follows a U-shaped structure featuring hierarchical decoder and encoder sections, in which skip connections facilitate the linkage between the decoder and encoder.It is essential to note that CCFNet is structured with two branches, which process information differently, as shown in Figure 2

CNN Branch
The CNN branch adopts a feature pyramid structure.This is because while in the Transformer branch patch embedding is used to project image patches into vectors, which results in the loss of local details, in the CNN the convolution kernels slide across overlapping feature maps, which provides the possibility of preserving fine local features.As a result, the CNN branch is able to supply local feature details to the Transformer branch.Specifically, as the network depth increases in the CNN branch, the resolution of feature maps gradually decreases, the number of channels gradually increases, the receptive field gradually increases, and the feature encoding changes from local to global.Given an input image x ∈ R H×W×D 0 , its spatial resolution is H × W and D 0 is the number of channels, the feature map generated by F CNN (•) is represented as where D represents the dimension of the feature map, Θ represents the parameters of the CNN branch, and L represents the quantity of feature layers.Specifically, the first block f 1 is made up of 2 convolutions (3 × 3) with strides 1 and 2, and each convolution block is followed by normalization and the GELU activation function to extract initial local features (such as edge and texture information).As shown in Figure 3a, f 2 and f 3 are stacked with SEConv blocks composed of three convolutional blocks and an SE module [6].The number of SEConv blocks in f 2 and f 3 is 2 and 6, respectively.The efficient and lightweight SE module can be seamlessly integrated into the CNN architecture, which can help the CCFNet network to enhance local details, suppress irrelevant regions, correct channel features by modeling the relationship between channels, and improve the representational capacity of the neural network.The CNN branch of the parallel fusion layer consists of a six-layer stack of modules consisting of a DFE block and an SFI block.The feature map output C i of each layer has the same resolution size ( H 16 , W 16 , D 4 ), and the output of the i-th layer is expressed as where T i is the i-th layer's CSF-block-coded image representation on the Transformer branch with the same resolution as C i−1 .The structures of the DFE block and SFI block are shown in Figure 3b.More detailed operations are described in the parallel fusion layer.

Transformer Branch
The CNN branch obtains rich local features under a limited receptive field through convolution operations, while the Transformer branch performs global self-attention through attention mechanisms.The Transformer branch has the same input image x ∈ R H×W×D 0 as the CNN branch.Following [13,16], in the patch embedding we first divide x into an N = H P × W P sequence of patches, the size of each patch is P × P, and the default setting is 16.After splitting the input images into small patches, the patches are flattened to a sequence of 2D patches {x i p ∈ R p 2 •D 0 |i = 1, . . ., N} and fed to a trainable linear layer, which converts the vectorized patches x p into a sequence embedding space with an output dimension of D 4 , and then, in order to facilitate the fusion with the CNN branch, the reshape operation is used to generate image T 0 ∈ R H 16 × W 16 ×D 4 , which can be expressed as where E ∈ R (p 2 •D 0 )×D 4 is the patch embedding projection.The Transformer branch in the parallel fusion layer is connected to six CSF blocks of attention operations, and the CSF block consists of a ConvAttention and a CMLP (convolution multi-layer perceptron).The feature map output T i of each layer has the same resolution size ( H 16 , W 16 , D 4 ).Therefore, the output of the i-th layer can be written as follows: where C ′ i−1 and T i−1 are the two inputs of the CSF block, C ′ i−1 is the intermediate output of the i-th layer DFE module on the CNN branch, which has the same resolution as T i−1 , and T i is the encoded image representation.The structure of a CSF block is illustrated in Figure 3b.More detailed operations are described in the parallel fusion layer.

Parallel Fusion Layer
The parallel fusion layer has two branches, namely, the Transformer branch and the CNN branch, which process information in distinct ways.In the CNN branch, local features are collected hierarchically through a convolution operation, and local clues are also saved as feature maps.The parallel fusion layer fuses the feature representation of the CNN in a parallel manner through cascaded attention modules, which maximizes the preservation of local features and global representations.The parallel fusion layer is composed of six CCFMs superposed.
An image has two completely different representations: global features and local features.The former focuses on model object-level relationships between remote parts, while the latter aims at fine-grained details and is beneficial for pixel-level localization and tiny object detection.As shown in Figure 3b, a CCFM is used to efficiently combine these encoded features of the Transformer and CNN, which can interactively fuse convolutionbased local features and Transformer-based global representations.
The CCFM has two inputs, C i−1 and T i−1 , where C i−1 is the input on the CNN branch with the same resolution as T i−1 , T i−1 is the input on the Transformer branch, and is the feature map formed after extracting features on the CNN branch with the same resolution and number of channels as T i−1 , which can be expressed as The Transformer aggregates information between global tokens, but CNN only aggregates information in the limited local field of view of the convolution kernel, which leads to certain feature semantic differences between the Transformer and CNN.Therefore, by superimposing the feature maps of the CNN and Transformer, the CSF block adaptively fuses the self-attention weights with common information between them to calculate the mutual relationship between local tokens and global tokens.
As shown in Figure 3b, the CSF block consists of ConvAttention and CMLP.Like the traditional attention mechanism, the basic module of ConvAttention is multi-head self-attention (MHSA).As shown in Figure 4a, the difference is that ConvAttention has two inputs, adding T i−1 and C ′ i−1 to obtain feature maps F i and T i−1 as its input.In addition, ConvAttention uses convolutional mapping.The specific operation is that T i−1 generates V i through 3 × 3 convolutional mapping, and F i generates Q i and K i through 3 × 3 convo-lutional mapping.Subsequently, we use the flatten operation to project the patches into the d-dimensional embedding space as the input of the underlying module MHSA in the ConvAttention block.
The MHSA is performed on the obtained Q i , K i , and V i , an MHSA comprises h parallel self-attention heads.The calculation process is as follows: where W mhsa ∈ R d×d represents the multi-headed trainable parameter weights.The selfattention of each head in MHSA is calculated as where Q, K, and V ∈ R N T ×d are the query, key, and value matrices, which are obtained by convolution projection, N T = W h × W w denotes the number of patch tokens.{W h , W w } stand for the size of the feature F i /T i , and d is the query/key dimension.We follow [28,44] by including a relative position bias B ∈ R N T ×N T .Since the relative position along each axis lies in the range [−W h/w + 1, W h/w − 1], we parameterize a smaller deviation matrix B ∈ R (2W h −1)×(2W w −1) ; the value of B is taken from B.  As shown in Figure 4a, a CMLP is then carried out, which consists of two convolution layers (1 × 1).The output T i+1 obtained after the CMLP is used as the input of the Transformer branch in the next fusion module, and at the same time, it is feature-fused with the feature map of the same resolution on the CNN branch.
Given the varying receptive fields of the CNN and Transformer, the features they extract exhibit asymmetry.At the same time, the information reflected by these features has a great gap in space.As shown in Figure 4b, when the Transformer branch is fused to the CNN branch, the SFI block uses the spatial attention weight map of the feature obtained on the Transformer branch.The calculation formula is as follows: where σ represents the sigmoid function, and T avg and T max represent the cross-channel average-pooled features and max-pooled features, respectively.The attention map is multiplied by the feature map on the CNN branch to achieve spatial information feature enhancement.Then, it is concatenated with the feature map on the Transformer branch, and the features are further fused by 1 × 1 convolution.The final output is used as the input of the CNN branch in the next fusion module.In the last layer of the parallel fusion layer, the two features are finally used as the input of the decoder through fuse operation.Specifically, the outputs of the CNN branch and Transformer branch are added together and fused through a convolution.

Decoder
The decoder in CCFNet is a pure convolution module that consists of numerous upsampling steps to decode hidden features, with the ultimate output being the segmentation result.Firstly, bilinear interpolation is applied to the input feature map.The following operations are then repeated until the resolution of the original input is restored by concatenating the feature maps with the resolution improved by a factor of 2 with the feature maps on the corresponding jump joins, inputting them into successive convolution layers (3 × 3), and upsampling the output using bilinear interpolation.Finally, the feature maps with the restored original resolution are fed into a special convolution layer (segmentation head) to generate the pixel-level semantic prediction.
The encoder and decoder merge the semantic information of the encoder through skip connections and concatenation operations to obtain more contextual information.The outputs of the three layers of the branch in the encoder are sequentially connected to the three layers of the decoder to regain local spatial information to improve finer details.The parallel fusion layer is a dual-stream fusion operation of the CNN and Transformer, which sends the fused feature output of the two features to the decoding layer.

Loss Function
In general segmentation tasks, Dice loss [29] and cross-entropy loss are both frequently used, with Dice loss being suitable for large-sized target objects and cross-entropy loss performing well for a uniform distribution of categories.Following the TransUNet [16] literature, the loss function used in CCFNet training also uses the combined form of Dice loss and binary cross-entropy, which is defined as where I and J are the number of voxels and classes, respectively; Y and G i,j , respectively, represent the predicted value and true value of class j at pixel i.

Dataset
The effectiveness of the CCFNet model is demonstrated by experiments on two different public medical image datasets, such as the Synapse dataset [49] and the Automated Cardiac Diagnosis Challenge dataset [50] (ACDC).
The Synapse dataset, a public multi-organ segmentation dataset, contains 3779 axially enhanced abdominal clinical CT images from 30 abdominal CT scans.Following the partitioning method of the TransUNet [16] dataset, the dataset is divided into 18 cases for training and 12 cases for validation.The annotations for each image include eight abdominal organs (stomach, spleen, pancreas, liver, left kidney, right kidney, gallbladder, aorta); average HD (Hausdorff distance) and average DSC are used to evaluate CCFNet on this dataset.
ACDC is a publicly available dataset of cardiac magnetic resonance imaging (MRI) that contains 150 MRI 3D cases gathered from various individuals, with each instance covering the cardiac organ from the bottom to the top of the left ventricle.Following the setup in [16], only 100 well-annotated cases are used in the experiment, and labels for three key parts of the heart are chosen: the left ventricle (LV), right ventricle (RV), and myocardium (MYO).
Following the partitioning method of the dataset in TransUNet [16], the ACDC dataset is split at a ratio of 7:1:2, with training (1930 axial slices), validation, and test data, and the average DSC is used to evaluate the CCFNet approach on this dataset.

Evaluation Metrics
We use the Dice score and HD to evaluate the accuracy of segmentation in our experiments. HD(G where P i and G i represent the predicted and actual values for voxel i, and P ′ and G ′ signify the sets of surface points for the prediction and ground truth, respectively.

Implementation Details
The CCFNet model is implemented based on PyTorch and trained on an Nvidia GeForce RTX 3090 GPU with 24 GB of memory.Unlike previous work (TransUNet [16], Swin-Unet [28]) in which models were initialized by pre-trained models on ImageNet [22], the CCFNet model is randomly initialized and trained from scratch, so the maximum number of training epochs is increased to 1000.The other variables are kept the same as the initial learning rate of 0.01, using a multi-learning rate strategy, batch size of 24, and using the SGD optimizer with momentum of 0.9 and weight decay of 1e-4.For all experiments, we apply simple data augmentation, such as random rotations and flips.We slice all samples layer by layer to analyze 3D datasets.Finally, in order to reconstruct the 3D prediction for assessment, we stack all of the prediction's 2D slices together.

Results
We evaluate the performance of the CCFNet model on two different types of dataset (Synapse and ACDC) and compare it with various state-of-the-art models.
As shown in Table 1, experiments are performed on Synapse using the same image size and preprocessing, and CCFNet is compared with various of the main Transformeror CNN-based methods such as TransUNet, LeVit-UNet-384, MT-UNet, UCTransNet, TransFuse, and Swin-Unet.Meanwhile, to visually demonstrate the performance of the CCFNet model, some qualitative results of the CCFNet model on Synapse are visually contrasted with a variety of other approaches, such as Swin-Unet, TransUNet, and U-Net.As shown in Figure 5, the red boxes indicate areas where CCFNet outperforms the other methods.Specifically, CCFNet can outperform Swin-Unet by more than 7.08 mm and 2.46% on average HD and DSC, respectively.Among them, CCFNet has the highest DSC in five organs: the stomach, liver, kidney (left), kidney (right), and gallbladder.For some specific organs that are hard to segment, CCFNet can better capture remote dependence.In the first row of Figure 5, CCFNet can better segment the pancreas with long and narrow shapes than other models.In identifying large organs, CCFNet has better accuracy in recognizing and delineating stomach contours, as shown in the second row.The CCFNet segmentation results are primarily compatible with the ground truth labels.When it comes to identifying small organs, CCFNet has certain advantages.As shown in the third row, individual models may not fully identify the gallbladder.CCFNet can identify organ junctions more accurately, as shown in the fourth row, at the junction of the liver and stomach, while the other three models make some errors, which shows that CCFNet is effective.The visualization intuitively demonstrates the high segmentation accuracy of CCFNet, especially on some difficult-to-segment slices.The excellent performance is attributed to the CCFM in CCFNet, which can consider the local small organs while focusing on large organs, showing the strong representation ability of CCFNet in learning low-level specifics and high-level semantic features that are critical in the segmentation of medical images.To comprehensively evaluate the performance of CCFNet, in Table 3 we compare the number of parameters and calculations between CCFNet and several mainstream network structures.As can be seen from the table, CCFNet has increased both the number of parameters and the amount of operations compared to Swin-Unet and TransUNet.This is because we make full use of the self-attention mechanism in CCFNet to obtain a more refined feature representation.However, it is important to emphasize that this modest increase in complexity resulted in significant improvements in segmentation accuracy.Its improvement in segmentation accuracy is attributed to the model's ability to effectively capture global information and multi-scale features, thereby more accurately segmenting structures in medical images.

Ablation Studies
We mainly perform ablation research using the Synapse database to assess the effectiveness of each basic component of CCFNet.All tests are executed with the same hyperparameters and are initiated from scratch to maintain fairness in comparison.Table 4 shows that incorporating the SEConv module into the baseline model results in a consistent gain in segmentation accuracy over the baseline.This is due to the critical role of this module in being able to compensate for the large amount of detail information lost during the upsampling process of the decoder.Adding the CCFM module also brings huge gains, because the CCFM module integrates global and local features to improve segmentation efficiency.When only removing the CSF block (using a splicing operation to fuse different features on the two branches), the DSC score on Synapse drops by 2.02%.This shows that the CSF block can reduce the feature semantic difference between two features, introduce convolution-specific inductive bias into the Transformer branch, and enrich the detailed features of the Transformer.When only the SFI block is removed (using splicing operations to fuse different features on the two branches), the DSC score on Synapse drops by 1.22%, and the performance on HD is far worse than the original model.This shows that the SFI block can reduce the spatial information gap between features caused by the asymmetry of extracted features and introduce the global information of the Transformer branch into the CNN branch.The results show that the operations of the CSF block and SFI block are conducive to the fusion between two different feature maps and can integrate the global encoding of the Transformer and the local encoding ability of CNN.This is because the cross-fusion between the CNN to Transformer and Transformer to the CNN can more thoroughly integrate the features between the two different methods than the general simple fusion.

3D Implementation
To further explore the performance of CCFNet, we implement a 3D implementation of CCFNet.The performance of 3D CCFNet is evaluated on the Synapse dataset and compared to a variety of state-of-the-art 3D models, and the final experimental results are shown in Table 6.Among them, nnFormer [52] is a Transformer model based on a cross structure that is mainly used to deal with the 3D image segmentation problem in medical image analysis.According to 3D implementation, the image input size is set to 64 × 128 × 128 in the nnFormer.The 3D CCFNet shows a 0.31% improvement over nnFormer on average DSC, and 3D CCFNet achieves 8.78 mm in HD performance, meaning that 3D CCFNet can better delineate object boundaries.

Analysis and Discussion
Feature Analysis.As shown in Figure 6, in order to verify our motivation and effects, the feature maps of Synapse are visualized using the Grad-CAM method [53].Due to the limitations of convolution, the region of attention in Figure 6b   Defect analysis.In order to analyze the current shortcomings of CCFNet and improve them in future work, the ground truth of Synapse is compared with the prediction results of CCFNet separately.The red box underscores regions where the performance of CCFNet falls short, as shown in Figure 7.Although the masks of CCFNet-predicted organs are very close to the ground truth labels, there are still some problems.In organ segmentation, CCFNet can sometimes accurately segment organs but cannot accurately identify the corresponding organs.For example, in the first row, in the kidney segmentation, the middle part of the left kidney is identified as the right kidney.Similarly, in the segmentation of the spleen in the two left images of the second row, part of the spleen is identified as the stomach.In addition, for specific organs, it is challenging to segment organs, such as the pancreas.In the two images on the right-hand side in the second row, CCFNet can segment the pancreas area but cannot accurately describe the pancreas's contour, which indicates that CCFNet still has room for improvement.Processing 2D pictures will lose much spatial information from the original 3D pictures.As shown in the third row, comparing the identification results of two adjacent liver slices.The former slice can accurately segment the liver, but the latter slice cannot segment the liver completely.

Conclusions
This paper first summarizes and discusses existing frameworks for medical image segmentation, and then, proposes a collaborative cross-fusion network to solve the existing problems.CCFNet utilizes the CNN's inductive bias in spatial correlation modeling and the Transformer's powerful capabilities in global relationship modeling to process the global features based on the Transformer and the local features extracted by convolutions in a parallel interactive manner.The different features extracted from the two branches are merged and exchanged employing the CCFM, which can not only encode the hierarchical local and global representations independently but also effectively aggregate the local and global representations to narrow the semantic gap between different network features.Experiments show that CCFNet displays a considerable advantage over previous Transformer-based models on various segmentation tasks, striking a balance in modeling long-term dependencies and preserving the details of underlying features, exploiting the CNN and Transformer to the fullest extent possible.We recognize that there are some limitations in our current CCFNet, particularly in terms of model performance and efficiency when dealing with highly complex datasets.In future work, we plan to design a more lightweight and efficient parallel fusion network that can solve the problems currently found in CCFNet and test the model on more tasks.By developing a comprehensive fusion network, we expect to be able to overcome these limitations and further optimize the overall performance of the model.

Figure 1 .
Figure 1.Comparison of different CNN and Transformer integration strategies.(a) Serial or embedded fusion strategy; (b) CNN and Transformer branch fusion strategy; (c) parallel fusion strategy of CNN and Transformer at each layer; (d) CCFNet fusion strategy.The arrows in the figure indicate the direction of data flow in the feature map.
. The two branches preserve global contexts and local features through the parallel fusion layer composed of the CCFM.In this CCFM, the CSF block can adaptively fuse them according to the correlation between local and global tokens, thus introducing convolution-specific inductive bias into the Transformer.The SFI block can avoid asymmetry of extracted features and introduce global representations of Transformer into the CNN branch, which has extracted local semantic features through the detail feature extractor (DFE) block.Features from both parallel branches are successively fused to form features that are fused with each other, and finally, realize the complementarity of the two features.The proposed parallel branching approach has three main benefits: Firstly, the CNN branch gradually extracts low-level, high-resolution features to obtain detailed spatial information, which can help Transformer obtain rich features and accelerate its convergence.Second, the Transformer branch can capture global information while remaining sensitive to low-level contexts without building a deep network.Finally, during feature extraction, the proposed CCFM can leverage the different characteristics of Transformer and the CNN to the full extent, continuously aggregating hierarchical representations from global and local features.

Figure 2 .
Figure 2. The overall architecture of the CCFNet model for medical image segmentation.The network follows a standard U-shaped structure and consists of four parts: CNN branch, Transformer branch, parallel fusion layer, and decoder.Among them, the CNN branch uses convolution to extract finegrained features, the Transformer branch uses attention to capture global information, and the parallel fusion layer combines the CNN and Transformer in parallel to extract rich depth information.

Figure 3 .
Figure 3.The architecture of the proposed CCFNet.(a) Shows the convolution process of SEConv; (b) shows the structure of the CCFM, there are two branches in the module whose inputs are C i−1 and T i−1 .The Transformer branch is composed of collaborative self-attention fusion blocks, and the CNN branch is composed of detail feature extractor blocks and spatial feature injector blocks.

Figure 4 .
Figure 4.The architecture of the CCFM.(a) Structure of CSF block, which is composed of ConvAttention and CMLP; (b) the SFI block.

Figure 5 .
Figure 5. Qualitative visual comparison of CCFNet and other 2D methods on Synapse, and quantitative evaluation of images using SSIM, DSSIM, FSIM, MSE, and RSNR.(a) Ground truth; (b) CCFNet, our proposed method; (c) Swin-UNet, incorporating Swin Transformer blocks in both the encoder and decoder; (d) TransUNet, utilizing a ViT encoder on the ResNet-50 backbone and employing a UNet decoder; (e) UNet, featuring a U-shaped encoder-decoder architecture.The red box indicates areas where CCFNet outperforms the other methods.
tends to be locally informative and suffers from a lack of remote information capture.Because of the global features provided by the Transformer branch, Figure 6e learns to activate a larger region than the local region in Figure 6b, indicating that Figure 6e enhances the long-range feature dependence compared to Figure 6b.As the Transformer focuses on extracting global feature dependencies, the region of attention in Figure 6c is insufficient in local feature details.Because of the progressively finer local features captured by the CNN branch, critical local features are retained in Figure 6f.At the same time, the background is significantly suppressed, indicating that the learned feature representation in Figure 6f has a higher local perception ability than in Figure 6c.

Figure 6 .
Figure 6.Feature Analysis.(a) Ground truth; (b,e) attention maps of shallow and deep layers in the parallel fusion layer's CNN branch; (c,f) attention maps of shallow and deep layers in the parallel fusion layer's Transformer branch; and (d) the attention map finally integrating the CNN and Transformer (best color effect).

Figure 7 .
Figure 7. Qualitative visualization of CCFNet prediction results and ground truth on Synapse, each row of ground truth and model prediction results are placed in pairs (the left picture is ground truth, the right picture is the prediction result).The red box underscores regions where the performance of CCFNet falls short.

Table 1 .
Comparison on Synapse (Dice score % for each organ, average Dice score %, and average Hausdorff distance in mm).

Table 2
shows the experimental results of ACDC.CCFNet exceeds TransUNet by 1.35% and exceeds Swin-Unet by 1.07% in average DSC score.The excellent performance is attributed to the SEConv and CCFM modules in CCFNet, which provide powerful representation capabilities in learning low-level specifics and high-level semantic features for CCFNet, which is crucial in medical image segmentation.This once again confirms that CCFNet using CNN-Transformer collaborative cross-fusion has a stronger ability to learn effective representations for medical image segmentation than other advanced methods, indicating that CCFNet has excellent generalization and robustness.

Table 2 .
Comparison on the ACDC dataset in DSC (%).

Table 3 .
Comparison of number of parameters and FLOPs for 2D segmentation models in Synapse experiments.

Table 4 .
Ablation studies on effects of different components of CCFNet.

Table 5
displays the influences of evaluating various fusion techniques within CCFM modules by ablating different components in ablation studies on the Synapse dataset.Using only the DFE module is equivalent to removing the Transformer branch and only using the CNN branch.Using only the CSF module is equivalent to removing the CNN branch and only using the Transformer branch.The experimental results reveal that when removing different branches, the DSC scores on Synapse are almost the same.To some extent, this means that the global information represented by Transformer and the local information represented by the CNN play an equally important role in visual representation, indicating that both global features and local features are important for organ segmentation, and the fusion of the two can help the model to achieve more precise segmentation.

Table 5 .
Ablation studies of effects of different fusion methods on CCFM.The presence of a checkmark (✓) in the corresponding column indicates the utilization of the module, while the absence of such a mark signifies its non-utilization.

Table 6 .
Comparison with 3D module on Synapse (Dice score % for each organ, average Dice score %, and average Hausdorff distance in mm).