3.1. Overall Architecture
Adapted to the development of the backbone, new demands are placed on the extraction and fusion of multi-scale features. The FPN network addresses the problem of multi-scale feature fusion effectively by enabling fusion of adjacent-layer features, making FPN and its derivatives the most commonly used networks for multi-scale feature fusion. However, FPN and its derivatives have encountered the following difficult problems to solve:
Difficulty in acquiring cross-layer features: As shown in
Figure 1, if layer S3 needs features from layer S5, it can only access fused features from layers S4 and S5. Therefore, cross-layer feature acquisition is incomplete, with many features being discarded during the initial fusion.
From a global perspective, the effective fusion range of single-layer features is limited to neighboring layers, thereby restricting the fusion effectiveness of global features.
These structures typically only perform upsampling on small-scale feature layers, decomposing or adding features while neglecting detailed information from large-scale feature layers.
Therefore, commonly used necks lack the capability to integrate global information effectively and exhibit weaker perception of feature details. In addition, a large amount of effective information has not been fully utilized when aligning multi-scale features. To address these issues, this paper proposes a novel approach called GMDNet. GMDNet introduces a centralized processing branch in the fusion network to consolidate processed information and distribute it to various levels, effectively mitigating information loss in the neck. Furthermore, by integrating a three-dimensional feature encoder that enhances global fusion capability through attentive handling of channel and positional information, GMDNet improves the perception of feature details and greatly enhances the multi-scale feature fusion capability.
3.2. Global Information Sharing Module (GISM)
GISM is responsible for the re-collection and re-injection of multi-scale features in GMDNet. GISM consists of two branches, each comprising three components: feature alignment, feature fusion, and feature injection. In the first branch, the input to the neck consists of S2, S3, S4, and S5 from the backbone, which are fused together. In the second branch, the inputs are M3, M4, and M5, which are also fused together. The structure is illustrated in the diagram.
Feature Alignment Section:
To reconcile the trade-off between preserving bottom-level information and managing computational complexity across layers, this paper opts to use the S4 layer as the anchor layer for feature alignment in the first branch.
In F-FAS, feature alignment between and is achieved through Avgpool and Bilinear operations (), ensuring inputs are optimized for F-FFS. This method efficiently aggregates information across layers while also managing the computational complexity of the neck.
Aligned with the feature alignment module’s requirement in the first branch, which aims to balance preserving bottom-level information with managing computational complexity across layers, this paper opts to use the M5 layer as the reference layer for feature alignment in the second branch. Within S-FAS, the use of average pooling (Avgpool) aligns features with , resulting in (. H-FIM efficiently aggregates information across layers and reduces the computational load of the GAU module.
Feature fusion section:
F-FFS comprises multi-layer online reparameterized convolutional blocks (OREPA) [
34] and separation operations. To enhance deep model performance without increasing inference time costs, structural reparameterization is applied at the fusion endpoint. However, the accuracy of structural reparameterization relies on complex training procedures, inevitably escalating training costs [
14,
35,
36,
37].
OREPA greatly eliminates additional training time and saves a lot of computing resources through two steps. In the first step, OREPA removes all non-linear layers and uses linear scaling layers. In addition, the linear scaling layers can be merged into convolutional layers, which further saves computational resources. In the second step, a new method of converting linear layers into convolutional layers is introduced, which relieves the computational pressure of many layers. Therefore, OREPA effectively reduces GPU usage pressure and enhances inference efficiency. In summary, OREPA eliminates all nonlinear layers, replacing them with linear scaling layers, and incorporates a BN layer at the end of its structure. This enhancement condenses complex training processes into a single convolution, thereby significantly reducing training expenses. The simplified formulas for the sequential and parallel structures of OREPA are as follows:
where
represents the weights of the
j-th layer,
denotes the weights of the m-th branch, and
signifies the unified weight distribution across all branches.
The input
of OREPA is sourced from the output end of F-FAS. The output
of OREPA undergoes separation operations to produce the final outputs
and
of F-FFS, as depicted in
Figure 3a.
S-FFM comprises stacked GAU modules [
38] and separation modules. The core idea of GAU originates from the FFN layer in Transformer. GLU has found through research that using simple linear gating units to stack convolutional layers can effectively parallelize tokens. The use of parallelization reduces the complexity of the algorithm. Due to the use of sigmoid, this measure leads to gradient loss, so residual structures need to be added to the network, and finally Adaptive Softmax is added as the normalization function, effectively increasing the training efficiency of the model. Based on the research of GLU, GAU regards attention and GLU as a unified layer and extensively uses parallel computing to achieve computational efficiency. In summary, these modules are built upon Transformer theory, combining self-attention and GLU [
39] into a single layer that shares computations, thereby improving computational efficiency and reducing parameter count. The specific formula is as follows:
where
encompasses token–token attention weights, Z denotes
, with
and
representing two simplified transformations, and b signifying relative positional deviation.
S-FFS operates in three main steps:
Initially, it acquires high-level aligned input from the S-FAS module.
Next, the high-level aligned is inputted into stacked GAU modules to generate the fused high-level feature .
Finally, the fused high-level feature
fuse undergoes segmentation through separation modules, dividing it along the channel dimension into
and
, which are subsequently fused with their respective feature layers. This specific structure is depicted in
Figure 3b.
Feature Injection section:
To integrate the output information from the S-FFS module into various layers, this study employs attention-based fusion between the local feature and the injected information. At the input of this layer, the LAF module aligns the input features using Avgpool and Bilinear operations before passing them into the injection module. For global information alignment, a dual convolution approach is utilized. Following attention fusion, the OREPA module further extracts information.
The injection of output information from the S-FFS module into different layers involves attention-based fusion combining the local feature. Similar to the injection module in the first branch, global information integration is employed. However, at the input end of the LAF module in the second branch, only Avgpool is used for alignment. The specific structure is illustrated in
Figure 4.
3.3. Detailed Information Extraction Module (DIEM)
In existing multi-scale feature fusion methods, a large number of multi-scale features are lost in the transformation—in particular, the detailed information of upper-level features is largely ignored—and DIEM is used to reuse this lost information. It solves the problem of incomplete pyramid feature maps through Stack, 3D Convolution, and BN/SILU, seamlessly connecting the high-dimensional information of deep feature maps with the detailed information of shallow feature maps. The input formula for this port is as follows:
where
represents the two-dimensional input of the image, and
is obtained by
two-dimensional Gaussian filtering and smoothing.
After unifying the resolution, concatenating and overlaying feature maps of different scales, and applying 3D convolution to extract scale sequence features, this article utilizes the M3 layer as the reference layer, focusing on detailed and key information. The specific structure includes the following:
Employing a single convolution to standardize the number of channels in M4 and M5.
Utilizing nearest neighbor interpolation for aligning features.
Applying “unsqueeze” to convert the input 3D feature map into a 4D feature map, thereby adding depth information.
Stitching four-dimensional feature maps using 3D techniques based on depth information.
Employing 3D convolution, normalization, and SILU for extracting scale sequence features.
Regarding the fusion part of DIEM, the first step is to input the TFE+CSP module from the P4 and P3 layers, and the output is then combined with the two parallel outputs obtained by inputting the TFE+CSP module from the P3 layer. The second step is to input the P5 layer and two parallel outputs into the CCSP layer (CONCAT+CSP), CCSP layer, and CPAM layer, respectively. Finally, the outputs of the CCSP layer (CONCAT+CSP), CCSP layer, and CPAM layer are inputted into the head section.
Triple Feature Encoding Module (TFE):
The Triple Feature Encoding (TFE) module [
40] addresses the issue of information loss inherent in pyramid structures, where downsampling primarily affects the top feature layer. TFE categorizes features into large, medium, and small categories and employs feature scaling to preserve detailed information. The structure of TFE involves the following: first, the number of channels is standardized and consistency ensured, applying ConvBNSiLU operations uniformly to the three feature maps. Second, large-scale feature maps are downsampled using Avg+MaxPooling, followed by ConvBNSiLU operations to maintain feature diversity. Medium-scale feature maps undergo ConvBNSiLU operations for mesoscale output, while small-scale feature maps are downsampled using the Nearest method and then subjected to ConvBNSiLU operations to prevent information loss. Finally, the three outputs are convolved and concatenated as depicted in
Figure 5a.
Channel and Position Attention Mechanism (CPAM):
The Channel and Position Attention Mechanism (CPAM) integrates detailed and scale-specific information across multiple channels. CPAM takes as input complementary position attention information and channel attention information (output of P3 layer TFE+CSP [
15] module). Following pooling and convolution, the channel attention information serves as the input for the bottom layer, while the position information enters the next layer concurrently. Finally, the output from CPAM enters the head section, as illustrated in
Figure 5b.