A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance

Kang, Yaoxing; Zhang, Yunzuo; Ren, Yaheng; Cheng, Yu

doi:10.3390/info17010025

Open AccessArticle

A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance

¹

School of Traffic and Transportation, Shijiazhuang Tiedao University, Shijiazhuang 050043, China

²

Institute of Applied Mathematics, Hebei Academy of Sciences, Shijiazhuang 050081, China

³

Information Security Authentication Technology Innovation Center of Hebei Province, Shijiazhuang 050081, China

⁴

School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang 050043, China

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(1), 25; https://doi.org/10.3390/info17010025 (registering DOI)

Submission received: 29 November 2025 / Revised: 14 December 2025 / Accepted: 26 December 2025 / Published: 31 December 2025

(This article belongs to the Special Issue Machine Learning in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Deep learning-based industrial product surface defect detection methods are replacing manual inspection, while the issue of small object detection remains a key challenge in the current field of surface defect detection. The feature pyramid structures demonstrate great potential in improving the performance of small object detection and are one of the important current research directions. Nevertheless, traditional feature pyramid networks still suffer from problems such as imprecise focus on key features, insufficient feature discrimination capabilities, and weak correlations between features. To address these issues, this paper proposes a plug-and-play guided focus feature pyramid network, named GF-FPN. Built on the foundation of FPN, this network is designed with a bottom-up guided aggregation network (GFN): through a lightweight pyramidal attention module (LPAM), star operation, and residual connections, it establishes correlations between objects and local contextual information, as well as between shallow-level details and deep-level semantic features. This enables the feature pyramid network to focus on key features, enhance the ability to distinguish between objects and backgrounds, and thereby improve the model’s small object detection performance. Experimental results on the self-built TinyIndus dataset and NEU-DET demonstrate that the detection model based on GF-FPN exhibits more competitive advantages in object detection compared to existing models.

Keywords:

surface defect detection; small object detection; feature pyramid networks; attention mechanism

Graphical Abstract

1. Introduction

Industrial product surface defect detection refers to the detection of defects such as scratches, blowholes, cracks, color pollution, etc. With the development of artificial intelligence technology, surface defect detection technology based on machine vision is replacing manual inspection, becoming the core means of industrial quality control, and is widely used in automobile manufacturing, 3C electronics, semiconductors, new energy industry, packaging and printing, and other fields [1]. In recent years, a large number of research works have been carried out around deep learning-based surface defect detection. For example, Liu et al. [2] proposed a network GC-Net for steel surface defect detection. Zhou et al. [3] proposed IFIFusion to solve the challenge that surface defect detection models cannot accurately locate and classify defect regions in images at the same time. Liang et al. [4] proposed SmallNet for magnetic core surface defects. Through the relevant literature, it can be seen that deep learning-based surface defect detection has made significant progress.

As enterprises’ requirements for detection precision and efficiency continue to rise, the small object issue has become one of the major challenges faced by the surface defect detection of industrial products. Some defects are only a few pixels in size, containing limited information and with weak feature expression ability, resulting in low detection precision. For products with complex surface textures, “similar textures” may be mixed with small object detail features, further increasing the difficulty of detection. In recent years, a large number of research achievements have emerged in the field of small object detection [5], among which multi-scale feature fusion is one of the important research directions.

The core idea of multi-scale feature fusion is that images or features of different scales contain different information. Although high-level feature maps possess a large receptive field and strong semantic representation capability, they suffer from low resolution and weak geometric information representation; in contrast, low-level feature maps have high resolution and rich geometric details but weak semantic representation capability. By fusing multi-scale features, the advantages of different levels can be fully utilized, and the overall performance of the model can be improved. Multi-scale feature fusion technology has basically been adopted by current mainstream object detection models and has become an important method to improve the performance of multi-scale object detection. The most representative one is the feature pyramid network (FPN) and its variants [6,7,8,9,10,11]. However, these traditional feature pyramid models all face prominent challenges in small object detection (i.e., those with a pixel size of less than 32 × 32). Small objects rely heavily on low-level detail features (e.g., edges, textures) for recognition. For FPN, its “top-down” fusion prioritizes high-level semantic features, leading to insufficient supplementation of low-level detail information that is critical for small object recognition. For PANet, although the “bottom-up” path is added to supplement low-level features, it fails to distinguish the contribution degrees of features from different paths during fusion, making it susceptible to the dilution of useful features. Regarding BiFPN, its weighted feature fusion relies on data-driven learning; when small objects account for a small proportion of the training data, the model cannot learn effective weight allocation for small object features. Furthermore, the repeated cross-scale connections and weighted fusion incur considerable computational overhead. The attention mechanism provides a viable solution to address these issues.

As a key technology in the field of deep learning, the attention mechanism has made significant progress in many fields such as computer vision and natural language processing, and has become an important method to improve model performance and efficiency. Specifically, in the field of computer vision, the ones that have received extensive attention include spatial attention, channel attention, and self-attention. Representative models include SE-Net, CBAM, ViT, etc. [12,13,14]. Building on these works, to better meet the requirements of visual tasks, a range of innovative attention-based models have been successively proposed. Tang et al. [15] designed a dual-domain attention mechanism to enable the model to pay attention to important details, so as to extract representative features of PCB defects. Li et al. [16] proposed a global and local attention mechanism (GAL), which combines fine-grained local feature analysis with global contextual information processing, providing a deep modeling method for input images. Zhang et al. [17] combined diffusion convolution and channel attention mechanisms to realize dynamic feature extraction. Li et al. [18] designed a detector based on a sparse Vision Transformer, which uses selective token utilization to extract fine-grained features and aggregates features across windows to extract coarse-grained features. Representative related works also include LKR-DETR [19], SMamba [20], et al.

These works provide important references for the research of this paper. However, how to design an effective fusion mechanism to enable the model to have more accurate focusing ability, stronger feature discrimination ability, as well as how to balance detection precision and model complexity, remain significant challenges. Generally speaking, the existing feature pyramid structures still have the following shortcomings:

1.: The focus on key features is not accurate enough, and the feature discrimination ability is insufficient, leading to the loss of small object features and redundancy of invalid information during the downsampling process.
2.: The correlation between inter-layer features is not fully explored, resulting in insufficient expression of object features.

To address the aforementioned issues and further enhance the model’s small object detection performance, this paper proposes a guided focus feature pyramid network (GF-FPN). The network employs a bidirectional feature fusion strategy: it aggregates high-level semantic features into low-level features via channel-wise concatenation, and propagates attention weights layer-by-layer from low to high levels through the star operation (i.e., element-wise multiplication), thereby achieving feature focusing and fully exploring inter-layer feature correlations. For scale alignment, the network utilizes interpolation for upsampling. Additionally, it generates an attention guidance map and performs downsampling via a lightweight pyramidal attention module (LPAM), which enhances feature discrimination capability and facilitates the extraction of small object features. The differences between our GF-FPN and typical models are presented in Table 1. The key contributions of this paper are summarized as follows:

1.: This paper proposes GF-FPN to enhance small object detection performance. Built upon the original FPN, the network incorporates a bottom-up guided focus mechanism: via a pyramidal attention module, the star operation, and residual connections, it establishes correlations between objects and local contextual information, as well as between shallow detail features and deep semantic features. This allows the feature pyramid network to focus on critical features and enhance the ability to discriminate between objects and backgrounds, thereby boosting the model’s small object detection performance.
2.: This paper designs LPAM, which establishes the correlation between objects and local contextual information through a local spatial attention module(LSAM) and a full-channel correlation fusion module (FCCFM), enhancing the feature discrimination ability.
3.: This paper designs FCCFM, which constructs a correlation matrix between channels through a global descriptor and a self-attention mechanism, dynamically captures the correlation between channels, and realizes the fusion of full-channel spatial features.

2. Materials and Methods

2.1. Dataset

To verify the effectiveness of the proposed method, we tested the algorithm on the self-built TinyIndus dataset and NEU-DET dataset [21].

TinyIndus: The dataset mainly includes four types of products: medicinal glass bottles, medicinal glass tubes, glass balls, and disposable medical gloves, with nine labeled defect categories. The defect categories included in the dataset and the number of instances in each category are presented in Table 2. The defect images are shown in Figure 1. The dataset is divided into three parts: 80% is divided into the training set, 10% into the validation set, and 10% into the test set. The training set is used for parameter training, the validation set for hyperparameter tuning, and the test set for performance testing of the trained model.

NEU-DET: The dataset is a well-recognized and widely used benchmark dataset in the field of steel surface defect detection, developed and released by the School of Information Science and Engineering at Northeastern University, China. Specifically designed to address the challenges of automatic defect identification in industrial steel production, this dataset provides a standardized and comprehensive resource for training, validating, and comparing the performance of computer vision and object detection models. The dataset comprises a total of 1800 high-quality grayscale images, all captured from actual cold-rolled steel plates, with each image having a uniform resolution of 200 × 200 pixels. These images cover six common and typical surface defects in steel products, including crazing, inclusion, patches, pitted surface, rolled-in scale, and scratches. Each defect category contains exactly 300 images, ensuring a balanced distribution that avoids bias in model training. NEU-DET has become an essential evaluation benchmark for various state-of-the-art object detection algorithms.

2.2. The Proposed Model

The GF-FPN proposed in this paper is shown in Figure 2. To enhance small object detection performance, a bottom-up guided aggregation network is integrated with the original FPN. Unlike PANet’s bottom-up path enhancement, the guided aggregation path employs a lightweight pyramidal attention module and residual connections to replace simple convolutional downsampling and concatenation operations. During layer-by-layer downsampling, it generates the attention map for the next layer from the features of the previous layer, then applies this attention map to the feature extraction process via the star operation. This enables a more abundant feature representation and forms a spatial-channel-scale three-dimensional attention mechanism. By capturing local contextual information, it strengthens the correlation between semantic and detail features, amplifies the weight of small object features, enhances the capability to discriminate between effective features and background interference, and achieves accurate focusing on critical features. This novel structure yields superior small object detection performance. Like FPN, this feature fusion network is a plug-and-play module that can be flexibly integrated into various backbone networks. Subsequently, we elaborate on the specific implementations of the lightweight pyramidal attention module and the bottom-up guided aggregation path.

2.2.1. The Architecture of LPAM

The traditional FPN fuses multi-scale features, so that the fused feature layer contains both detail features and semantic features, which enables the model to more comprehensively understand the input data and effectively makes up for the limitations of single-scale features, which improves the small object detection performance to a certain extent. However, FPN lacks an active screening mechanism and only fuses features through simple convolution, upsampling, and addition/concatenation. It cannot actively suppress background noise or redundant features, nor can it solve the problem of detail feature loss during the downsampling process.

The introduction of the attention mechanism is to improve the network’s ability to distinguish features by autonomously learning different weights. However, this autonomous learning of the attention mechanism still has a certain degree of blindness. The core of improving the feature discrimination ability of the attention mechanism in multi-scale fusion is to make the attention weight focus more accurately on critical features, while suppressing noise (such as background interference, redundant features), and adapting to the characteristics of features of different scales.

In small object detection tasks, small objects have low feature dimensions, sparse semantic information, and are easily confused with background noise, resulting in significantly lower detection precision than objects of normal scales. Local contextual information (that is, the associated features within a limited range around small objects, such as the spatial position relationship between small objects and adjacent objects, the semantic attributes of the local background, etc.) has become the key to breaking through this bottleneck by supplementing the missing semantic and spatial clues of small objects [17,22]. Local contextual information is the core support for making up for insufficient features, correcting positioning deviations, and suppressing background noise in small object detection, and the degree of its effective utilization directly determines the upper limit of small object detection precision. This serves as both the starting point and theoretical support for the design of the lightweight pyramidal attention module in this paper.

The overall structure of LPAM is shown in Figure 3. The module consists of two parts: LSAM and FCCFM. The local spatial attention branch extracts basic features and extensive local contextual information through different receptive fields. The specific implementation is as follows: first, the input feature map is split into three parts along the channel dimension, which is mainly to reduce the amount of calculation; then, depthwise dilated convolution (DW-D-Conv) with different dilation rates and activation functions are used to expand the receptive field and extract extensive local contextual information; finally, these three parts are concatenated in the channel dimension; LSAM completes 2× downsampling and extracts the initial spatial attention map. FCCFM will fuse cross-channel spatial information based on the initial spatial attention map, establish correlations between features of different receptive fields, and adjust the channel dimension to match the star operation. This module calculates the correlation among all channels through global pooling and a self-attention mechanism, followed by feature fusion. Assuming the input feature is

X \in R^{H \times W \times C}

(where H and W are spatial dimensions, and C is the number of channels), the goal is to generate the correlation matrix

A \in R^{C \times C}

among all channels through channel self-attention. The steps are as follows:

The core of channel self-attention is to model the relationship among channels. Therefore, it is first necessary to compress the spatial dimension and condense the spatial information of each channel into a vector containing two elements to obtain the global descriptor of the channel. Here, global average pooling and maximum pooling are used:

v_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{i, j, c}

(1)

m_{c} = max (X_{i, j, c})

(2)

where

X_{c}

is the c-th channel of the input feature. Thus, the global descriptor vector

g_{c} = {[v_{c}, m_{c}]}^{T}

of the c-th channel can be obtained. The global descriptors of all channels can be expressed as

G = [g_{1}, g_{2}, \dots, g_{C}]

.

Next, two linear transformations are used to generate

Q

and

K

:

\{\begin{matrix} Q = W_{q} \cdot G, & W_{q} \in R^{d \times 2} \\ K = W_{k} \cdot G, & W_{k} \in R^{d \times 2} \end{matrix}

(3)

In the formula,

W_{q}

and

W_{k}

are learnable linear transformation matrices. Then, the attention weight between channels is calculated to dynamically measure the correlation strength. The specific implementation is to calculate the correlation score between channels through the similarity of

Q

and

K

, and then normalize to obtain the weight:

A = softmax (\frac{Q^{T} K}{\sqrt{d}})

(4)

The input features are weighted and summed through the attention weight matrix to obtain the full-channel fusion output features:

O_{c} = \sum_{j = 1}^{C} A [c, j] \cdot X_{c}

(5)

In the formula,

O_{c}

represents the output feature of the c-th channel.

The above is the implementation process of the channel self-attention module. Its core idea is to fuse full-channel spatial features by learning the self-attention correlation between channels. Different from the traditional correlation coefficient calculated by a fixed formula, the self-attention weight can be dynamically adjusted through learning and can capture non-linear and complex correlations. Thus, the correlation between features of different receptive fields is established.

Finally, a

1 \times 1

convolution adjusts the channel dimension to generate the attention guidance map.

Figure 3. Architecture of LPAM. The module consists of two parts: LSAM and FCCFM.

2.2.2. LPAM Parameter Analysis

To calculate the number of parameters of this structure, we need to analyze each layer of the LSAM module and the FCCFM module separately, and clarify the calculation method for the number of parameters of each type of operation:

1.: LSAM

The input of LSAM is

H \times W \times C_{in}

, which is evenly split into three branches (each branch has

C_{in} / 3

channels) by Split. Each branch contains DW-D-Conv and ReLU, and is finally merged by Concat. DW-D-Conv is divided into two parts: depthwise dilated convolution and pointwise convolution:

Depthwise dilated convolution (2D, kernel = $3 \times 3$ , dilation rate r is 1, 2, and 3, respectively):
○
Number of parameters for a single branch: $\frac{C_{in}}{3} \times 3 \times 3 = 3 C_{in}$ ;
○
Total number of parameters for three branches: $3 \times 3 C_{in} = 9 C_{in}$ .
Pointwise convolution (2D, kernel = $1 \times 1$ ):
○
Number of parameters for a single branch: $\frac{C_{in}}{3} \times \frac{C_{in}}{3} \times 1 \times 1 = \frac{C_{in}^{2}}{9}$ ;
○
Total number of parameters for 3 branches: $3 \times \frac{C_{in}^{2}}{9} = \frac{C_{in}^{2}}{3}$ .

2.: FCCFM

The input of FCCFM is the output of LSAM (

H / 2 \times W / 2 \times C_{in}

), which includes GlobalPool, Q/K fully connected layers, Softmax, and 1 × 1 Conv. Among them, GlobalPool and Softmax have no parameters.

Q/K fully connected layers: Map the $1 \times 1 \times C_{in}$ feature after GlobalPool to dimension d.
○
Number of parameters for a single fully connected layer: $C_{in} \times d$ ;
○
Total number of parameters for Q and K: $2 \times C_{in} \times d$ ;
1 × 1 Conv: Input channels $C_{in}$ , output channels $C_{out}$ , number of parameters: $C_{in} \times C_{out}$ .

3.: Total Number of Parameters

By summing the number of parameters of each part, the total number of parameters is as follows:

(9 + 3 d) C_{in} + \frac{C_{in}^{2}}{3} + C_{in} \times C_{out}

(6)

2.2.3. Guided Aggregation Network

The feature pyramid structure can provide multi-scale feature support for object recognition. The feature fusion of traditional FPN is top-down: only through

1 \times 1

convolution to compress the number of channels of high-level features, and then directly add or concatenate with the upsampled low-level features along the channel dimension. On the basis of the top-down path of FPN, PANet adds a bottom-up path enhancement. However, due to the lack of an attention mechanism, PANet cannot dynamically focus on the target area and cannot perform targeted enhancement on small target areas. Its fusion method is still the overall inter-layer fusion. Although the low-level features contain details, when transmitted to the high-level layer, they need to go through multiple upsampling and convolution operations, resulting in the local critical details of small objects being smoothed or diluted, and they are very easily interfered by background noise in complex backgrounds.

The GF-FPN incorporates a bidirectional path fusion structure similar to PANet. The top-down path transmits deep semantic features to the shallow layer through the traditional FPN to generate preliminary feature fusion. Here, channel dimension concatenation is used to retain the original features to avoid information loss or weak information being “covered” by strong information. However, at this time, the feature correlation between semantics and details has not been established, and the synergistic effect of “semantics + details” cannot be fully exerted. Therefore, this paper designs a guided aggregation network (GFN). The specific implementation of the network is shown in Figure 2. It can be seen from the figure that GFN aggregates features from bottom to top, and the downsampling part is mainly composed of LSAM and a residual connection. LSAM extracts local contextual features through depthwise dilated convolution with different dilation rates, and the full-channel correlation fusion module completes feature reconstruction. The design idea and specific implementation of LSAM have been discussed in detail in the previous section, so they will not be repeated here. It should be noted that LSAM acts on the feature fusion layers (F2, F3, F4) generated by FPN. In these initially fused feature layers, the shallow detail information and deep semantic features are both stored in the channel dimension. Therefore, LSAM not only establishes the correlation between objects and local contextual information, but also establishes the correlation between details and semantic information. LSAM reassigns each spatial value of the feature layer in a weighted manner. This process is adaptively learnable; through learning, highly correlated features are aggregated and enhanced, while irrelevant features are suppressed. Accordingly, the discriminative ability between targets and background regions is improved, and critical features are preserved during downsampling. Subsequently, to enable bottom-up fusion of cross-layer features, this paper abandons the traditional concatenation approach and proposes a residual connection structure.

The specific implementation of the residual connection is as follows: the long path first passes through the sigmoid activation function to obtain the attention guidance map, then employs the star operation to directly apply the attention weight to the feature map of the next layer. This thereby continues to enhance the discriminative ability between features and realizes layer-by-layer feature aggregation. The main reason for using the star operation is to increase network nonlinearity and improve feature focusing ability. Recent efforts have demonstrated that utilizing star operation can be a more effective choice than summation in network design for feature aggregation, as exemplified by FocalNet [23] and HorNet [24]. Furthermore, reference [25] explains the strong representative ability of star operation by explicitly demonstrating that the star operation possesses the capability to map inputs into an exceedingly high-dimensional, non-linear feature space.

The short path passes the output features of LSAM “as-is” to the output end of the main path, then performs element-wise addition with the output of the long path to obtain the output features of the next layer. Residual connections enable the fusion of cross-layer features. On one hand, they boost feature reuse and prevent the loss of critical features. On the other hand, they mitigate gradient issues (i.e., gradient vanishing or exploding) caused by the star operation. Via the guided aggregation network, the final output feature layers—denoted as P3, P4, and P5—are generated. These feature layers can be fed into the detection head for object recognition and localization.

3. Results and Discussion

3.1. Experiment Setup

Implementation Details: The experimental development environment is Python 3.9 and PyTorch 1.13.0. The experimental operating environment is as follows: Windows 10 system, with GPU being NVIDIA GeForce RTX 3060. The key hyperparameter settings are as follows: the batch size is 8, the experiment uses the SGD optimizer, the learning rate is 0.001, and the number of iterations (epoch) is 100. The loss function we adopted is a combined loss scheme proposed in [26].

Figure 4 presents the loss curves of the CSPDarknet + GF-FPN model during training. As shown in the figure, both the training and validation loss curves decrease steadily, which indicates that the model exhibits good convergence and the hyperparameter configuration is reasonable.

Evaluation Metrics: To accurately quantify the detection performance of the proposed model on medicinal glass defect images, this paper uses MSCOCO-style evaluation indicators. AP is used to measure the performance of the trained model in each category, and mAP is used to measure the performance of the model in all categories. Among them, mAP50 refers to the average precision in each category when the IOU threshold is 0.5. mAP50-90 refers to taking 10 IOU thresholds from 0.5 to 0.95 with a step size of 0.05, calculating the AP under each threshold, and then taking the average value. To further evaluate the model’s performance in terms of computational complexity, the number of parameters, Giga Floating-point Operations Per Second (GFLOPs), and Frames Per Second (FPS) are chosen.

3.2. Comparison with Different Feature Pyramid Networks

To evaluate the performance of the multi-scale fusion module GF-FPN designed in this paper, two backbone networks, Darknet53 and CSPDarknet, were adopted in the experiments. GF-FPN was compared with the original feature pyramid modules (FPN and PANet), where each of the three feature pyramid modules was paired with the two backbones, respectively. All configurations utilized the same detection head and were trained for an identical number of epochs, yielding the mAP50, mAP50-95, and AP50 values for each defect category across different models. The experimental results are recorded in Table 3 and Table 4.

Regarding mAP metrics, on both the TinyIndus and NEU-DET datasets, the two models equipped with the GF-FPN module attained the top and second-highest detection performance, respectively.

On the TinyIndus, when the backbone network is Darknet53, the model equipped with GF-FPN achieves a mAP50 increase of 4.7% and 7.4%, and a mAP50-95 increase of 3.4% and 5.5%, compared to those equipped with PANet and FPN, respectively. For the CSPDarknet backbone, the mAP50 rises by 8.1% and 9.7%, and the mAP50-95 rises by 2.6% and 5.3%, respectively. As indicated by the AP values of each defect category in Table 3, GF-FPN generally yields a noticeable improvement in the detection precision of different defect categories. However, both the dataset itself and the backbone network exert a significant impact on defect detection performance. Among all categories, the tube mouth defect is the most challenging to detect due to its small scale and susceptibility to texture interference, resulting in relatively low detection precision. Even so, it exhibits a significant improvement after integrating GF-FPN.

On the NEU-DET, for the Darknet53 backbone, the model equipped with GF-FPN achieves an increase of 1.3% and 2.2% in mAP50 value, and 1.3% and 4.1% in mAP50-95 value, compared to those equipped with PANet and FPN, respectively. For the CSPDarknet backbone, the mAP50 rises by 2.8% and 5.7%, and the mAP50-95 rises by 1.6% and 7.4%, respectively.

The visualized detection results are shown in Figure 5 and Figure 6. A set of images is randomly selected from the test dataset. The backbone network is CSPDarknet. As can be seen from the visualization experiment results, GF-FPN has a higher detection rate and higher localization accuracy compared with PANet, and its improvement in small object detection is more significant.

In general, the experimental results present that the GF-FPN designed in this paper has better detection performance than FPN and PANet. Moreover, our model exhibits good generalization on different datasets.

3.3. Learnable Parameters and Computational Cost

In Section 2.2.2, we presented the calculation formula for the parameter number of LPAM, whose parameter number is significantly smaller than that of traditional convolutional modules. In this section, we provide a more intuitive demonstration. Table 5 provides the number of learnable parameters, FPS, and GFLOPs of various feature pyramid networks. As shown in the table, our GF-FPN architecture has 41.7 million learnable parameters. When the input resolution is 640 × 640, it delivers 64.2 FPS and 166.9 GFLOPs. Compared with PANet, BiFPN, and AFPN, our model has fewer parameters and the lowest GFLOPs, balancing detection performance and computational efficiency. LPAM makes the main contribution to the reduction in parameter number.

3.4. Ablation Studies

To investigate the efficacy of star operation in our GF-FPN, we replaced it with element-wise add for ablation studies. Our experimentation utilized the CSPDarknet framework with GF-FPN as the backbone. As indicated in Table 6, we observed that star operation can improve the accuracy by approximately 1.7 points. To evaluate the effect of the residual connections, we removed them from the GF-FPN. The ablation results show that the residual connections improve the detection accuracy by approximately three points. Experimental results demonstrate the advantages of the GF-FPN structure design.

3.5. Limitations and Future Work

While the GF-FPN proposed in this paper delivers favorable performance, its overall architecture is built upon the original FPN with an additional new branch, thus fully retaining the original FPN structure. This results in GF-FPN having a higher parameter number than FPN and a slightly slower inference speed. We could further optimize the FPN branch itself; for instance, replacing conventional convolutions with dilated convolutions to reduce the model’s parameter number and enhance its performance. This constitutes one of our future research directions.

Additionally, during the experiments, we observed that the model exhibits varying detection performance across different categories. However, it is challenging for us to fully analyze the reasons for this discrepancy purely from the perspective of structural design. On one hand, this may be attributed to the inherent characteristics of the dataset; on the other hand, the poor interpretability of deep learning models has long been a persistent challenge in this field. We hope that with the continuous advancement of deep learning technology, we will be able to address this issue through more refined experimental designs and in-depth theoretical research, and this will also be a key part of our future work.

4. Conclusions

Aiming at the small object detection problem in industrial product surface defects, this paper proposes a plug-and-play guided focus feature pyramid network. Built upon FPN, the network adds a top-down guided aggregation network. Through a lightweight pyramidal attention module, star operation, and residual connections, it establishes correlations between objects and local contextual information, as well as between shallow details and deep semantic features. This addresses limitations of traditional feature pyramid structures, including imprecise focusing on critical features, inadequate feature discriminative power, and weak inter-feature correlations. In addition, the lightweight pyramidal attention module proposed in this paper uses depthwise separable dilated convolution to extract local features, establishes global channel correlation through a global spatial descriptor and self-attention mechanism, and subsequently conducts feature reconstruction based on these correlations. Our structural design realizes feature correlation and critical feature enhancement across channel and spatial dimensions, with no excessive computational burden. This structure provides a new idea for the fusion of the attention mechanism and the feature pyramid. Experimental results show that compared with traditional feature pyramids, the detection model based on GF-FPN has better detection performance.

Author Contributions

Conceptualization, Y.K. and Y.Z.; methodology, Y.K.; software, Y.K.; validation, Y.K. and Y.Z.; formal analysis, Y.Z. and Y.R.; investigation, Y.R. and Y.C.; resources, Y.R.; data curation, Y.R. and Y.C.; writing—original draft preparation, Y.K.; writing—review and editing, Y.K. and Y.Z.; visualization, Y.K.; supervision, Y.Z., Y.R. and Y.C.; project administration, Y.R.; funding acquisition, Y.Z., Y.R. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Sciences Promotion Project of Hebei Academy of Sciences (No.25A03), the National Natural Science Foundation of China (No. 61702347), the major science and technology projects of Universities in Hebei Province, China (No.2512602307A), the Natural Science Foundation of Hebei Province, China (No. F2022210007), the Science and Technology Project of Hebei Education Department, China (No. CXZX2025049), and the Central Guidance on Local Science and Technology Development Fund, China (No. 226Z0501G).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The NEU-DET dataset used in this study is publicly accessible via http://faculty.neu.edu.cn/songkechen/zh_CN/zhym/263269/list/index.htm (accessed on 7 November 2025), while other supporting data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qiao, Q.; Hu, H.; Ahmad, A.; Wang, K. A review of metal surface defect detection technologies in industrial applications. IEEE Access 2025, 13, 48380–48400. [Google Scholar] [CrossRef]
Liu, G.; Chu, M.; Gong, R.; Zheng, Z. Global attention module and cascade fusion network for steel surface defect detection. Pattern Recognit. 2025, 158, 110979. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, Y.; Liu, Z.; Jiang, Z.; Ren, Z.; Mi, T.; Zhou, S. IFIFusion: A independent feature information fusion model for surface defect detection. Inf. Fusion 2025, 120, 103039. [Google Scholar] [CrossRef]
Liang, W.; Sun, Y.; Zhang, S.; Bai, L.; Yang, J. SmallNet: A small defects detection network for magnetic chips based on context-weighted aggregation and feature multiscale loop fusion. IEEE Trans. Autom. Sci. Eng. 2024, 22, 10095–10106. [Google Scholar] [CrossRef]
Muzammul, M.; Li, X. Comprehensive review of deep learning-based tiny object detection: Challenges, strategies, and future directions. Knowl. Inf. Syst. 2025, 67, 3825–3913. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 6896–6904. [Google Scholar]
Du, Z.; Hu, Z.; Zhao, G.; Jin, Y.; Ma, H. Cross-layer feature pyramid transformer for small object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
Chen, Z.; Ma, Y.; Gong, Z.A.; Cao, M.; Yang, Y.; Wang, Z.; Liu, Y. R-AFPN: A residual asymptotic feature pyramid network for UAV aerial photography of small targets. Sci. Rep. 2025, 15, 16233. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Tang, J.; Wang, Z.; Zhang, H.; Li, H.; Wu, P.; Zeng, N. A lightweight surface defect detection framework combined with dual-domain attention mechanism. Expert Syst. Appl. 2024, 238, 121726. [Google Scholar] [CrossRef]
Li, Y.; Zhou, Z.; Qi, G.; Hu, G.; Zhu, Z.; Huang, X. Remote sensing micro-object detection under global and local attention mechanism. Remote Sens. 2024, 16, 644. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, T.; Zhen, J.; Kang, Y.; Cheng, Y. Adaptive downsampling and scale enhanced detection head for tiny object detection in remote sensing image. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Li, W.; Guo, Y.; Zheng, J.; Lin, H.; Ma, C.; Fang, L.; Yang, X. Sparseformer: Detecting objects in hrw shots via sparse vision transformer. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 4851–4860. [Google Scholar]
Dong, Y.; Xu, F.; Guo, J. LKR-DETR: Small object detection in remote sensing images based on multi-large kernel convolution. J. Real Time Image Process. 2025, 22, 46. [Google Scholar] [CrossRef]
Shi, N.; Yang, Z.; Yang, G.; Li, K.; Yang, Z.; An, J. Super Mamba feature enhancement framework for small object detection. Sci. Rep. 2025, 15, 37148. [Google Scholar] [CrossRef] [PubMed]
He, Y.; Song, K.; Meng, Q.; Yan, Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [Google Scholar] [CrossRef]
Wang, Z.; Li, Y.; Liu, Y.; Meng, F. Improved object detection via large kernel attention. Expert Syst. Appl. 2024, 240, 122507. [Google Scholar]
Yang, J.; Li, C.; Dai, X.; Gao, J. Focal modulation networks. NeurIPS 2022, 35, 4203–4217. [Google Scholar]
Rao, Y.; Zhao, W.; Tang, Y.; Zhou, J.; Lim, S.N.; Lu, J. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. NeurIPS 2022, 35, 10353–10366. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5694–5703. [Google Scholar]
Zhang, Y.; Wu, C.; Guo, W.; Zhang, T.; Li, W. CFANet: Efficient detection of UAV image based on cross-layer feature aggregation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Hawaii, NA, USA, 3–7 October 2023; pp. 2184–2189. [Google Scholar]

Figure 1. Images of products and defects in the TinyIndus: (a) glass balls:ball pock; (b) disposable medical gloves:glove flaw; (c) medicinal glass tubes:tube mouth flaw, tube body flaw, and tube body pock; (d) medicinal glass bottles: bottle bottom flaw, bottle body flaw, bottle body dirty, and bottle neck flaw.

Figure 2. Overall architecture of the proposed GF-FPN. A bottom-up guided aggregation network is added on the basis of FPN.

Figure 4. CSPDarknet + GF-FPN model training status on TinyIndus dataset.

Figure 5. Visualization of the prediction results on the TinyIndus dataset. Different colors represent different defect categories.

Figure 6. Visualization of the prediction results on the NEU-DET dataset. Different colors represent different defect categories, and to more intuitively demonstrate the positional differences between the prediction boxes and the ground-truth boxes, a different color scheme has been adopted for the ground-truth boxes here.

Table 1. Novelty matrix.

Model	Attention Type	Key Contribution	Limitation
FPN [6]	no attention module	top-down fusion, lateral connections	insufficient transmission of semantic information
PANet [7]	no attention module	added bottom-up path aggregation	no weight mechanism
BiFPN [8]	no attention module	proposed weighted bidirectional fusion, pruned invalid fusion nodes	the weighting mechanism relies on hyperparameter tuning
CFPT [10]	cross-layer spatialwise attention (CCA) and cross-layer spatialwise attention (CSA)	upsampler-free, global contextual information	lack of attention to local contextual information
GF-FPN (ours)	local spatial attention module (LSAM) and full-channel correlation fusion module (FCCFM)	bottom-up guided focus network, attention to local contextual information	none observed

Table 2. Statistics on the number of instances in the TinyIdus dataset.

Defects	Code Name	Instances
bottle neck flaw	xl-j-flaw (xjf)	134
bottle body flaw	xl-s-flaw (xsf)	119
bottle body dirty	xl-s-dirty (xsd)	172
bottle bottom flaw	xl-d-flaw (xdf)	76
tube body flaw	gs-flaw (gsf)	175
tube mouth flaw	gk-flaw (gkf)	254
tube body pock	gs-pock (gsp)	228
ball pock	q-pock (qp)	362
glove flaw	st-flaw (stf)	166

Table 3. Test results of different models on the TinyIdus dataset.

Model	mAP50	mAP50-95	xjf	xsf	xsd	xdf	gsf	gsp	gkf	qp	stf
Darknet53 + FPN	58.9	19.8	67.6	94.6	79.2	38.0	76.0	48.6	14.6	49.3	62
CSPDarknet + FPN	60.2	21.1	64	95.7	82.1	43.9	79.4	56.0	13.1	43.2	64.4
Darknet53 + PANet	61.6	21.9	69.7	99.5	81.5	58.8	67.2	44.1	18.2	49.4	66.4
CSPDarknet + PANet	61.8	23.8	67.3	98.8	80.0	44.9	78.6	61.6	14.8	41.3	69.1
Darknet53 + GF-FPN	66.3	25.3	79.8	99.5	87.4	60.3	79.2	47.7	20.4	51.3	70.7
CSPDarknet + GF-FPN	69.9	26.4	77.7	99.5	89	60.9	84.9	63.0	33.1	46.2	74.6

Table 4. Test results of different models on the NEU-DET dataset.

Model	mAP50	mAP50-95	Crazing	Inclusion	Patches	Pitted-Surface	Rolled-Inscale	Scratches
Darknet53 + FPN	72.7	39.3	39.6	78.9	91.4	76.6	61.9	87.6
CSPDarknet + FPN	71.2	36.4	35.5	77.0	92.6	76.0	57.2	89.1
Darknet53 + PANet	73.6	42.1	44.6	79.7	91.5	76.4	59.5	89.9
CSPDarknet + PANet	74.1	42.2	41.9	79.0	93.0	78.4	61.9	90.4
Darknet53 + GF-FPN	74.9	43.4	45.5	80.0	93.1	79.2	60.5	91.1
CSPDarknet + GF-FPN	76.9	43.8	47.1	82.1	93.3	81.8	63.8	93.6

Table 5. Comparison of model performance on TinyIdus dataset.

Model	Backbone	FPS	Params (M)	GFLOPs
FPN	CSPDarknet	65.8	40.7	158.4
PANet	CSPDarknet	62.9	46.2	182.6
BiFPN	CSPDarknet	-	46.8	188.3
AFPN [27]	CSPDarknet	-	58.5	224.8
GF-FPN	CSPDarknet	64.2	41.7	166.9

Table 6. Results of ablation experiments.

Backbone	Element-Wise Mul (Star)	Element-Wise Add	Residual Connection	mAP50	GFLOPs
CSPDarknet + GF-FPN	√		√	69.9	166.9
CSPDarknet + GF-FPN		√	√	68.2	166.9
CSPDarknet + GF-FPN	√			66.7	168.5
CSPDarknet + GF-FPN		√		65.1	168.5

Note: The √ denotes that the module is selected.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kang, Y.; Zhang, Y.; Ren, Y.; Cheng, Y. A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance. Information 2026, 17, 25. https://doi.org/10.3390/info17010025

AMA Style

Kang Y, Zhang Y, Ren Y, Cheng Y. A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance. Information. 2026; 17(1):25. https://doi.org/10.3390/info17010025

Chicago/Turabian Style

Kang, Yaoxing, Yunzuo Zhang, Yaheng Ren, and Yu Cheng. 2026. "A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance" Information 17, no. 1: 25. https://doi.org/10.3390/info17010025

APA Style

Kang, Y., Zhang, Y., Ren, Y., & Cheng, Y. (2026). A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance. Information, 17(1), 25. https://doi.org/10.3390/info17010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. The Proposed Model

2.2.1. The Architecture of LPAM

2.2.2. LPAM Parameter Analysis

2.2.3. Guided Aggregation Network

3. Results and Discussion

3.1. Experiment Setup

3.2. Comparison with Different Feature Pyramid Networks

3.3. Learnable Parameters and Computational Cost

3.4. Ablation Studies

3.5. Limitations and Future Work

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI