Next Article in Journal
Simulating Advanced Social Botnets: A Framework for Behavior Realism and Coordinated Stealth
Previous Article in Journal
Exploring the Determinants of Continuous Participation in Virtual International Design Workshops Mediated by AI-Driven Digital Humans
Previous Article in Special Issue
Cross-Domain Residual Learning for Shared Representation Discovery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance

1
School of Traffic and Transportation, Shijiazhuang Tiedao University, Shijiazhuang 050043, China
2
Institute of Applied Mathematics, Hebei Academy of Sciences, Shijiazhuang 050081, China
3
Information Security Authentication Technology Innovation Center of Hebei Province, Shijiazhuang 050081, China
4
School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang 050043, China
*
Authors to whom correspondence should be addressed.
Information 2026, 17(1), 25; https://doi.org/10.3390/info17010025 (registering DOI)
Submission received: 29 November 2025 / Revised: 14 December 2025 / Accepted: 26 December 2025 / Published: 31 December 2025
(This article belongs to the Special Issue Machine Learning in Image Processing and Computer Vision)

Abstract

Deep learning-based industrial product surface defect detection methods are replacing manual inspection, while the issue of small object detection remains a key challenge in the current field of surface defect detection. The feature pyramid structures demonstrate great potential in improving the performance of small object detection and are one of the important current research directions. Nevertheless, traditional feature pyramid networks still suffer from problems such as imprecise focus on key features, insufficient feature discrimination capabilities, and weak correlations between features. To address these issues, this paper proposes a plug-and-play guided focus feature pyramid network, named GF-FPN. Built on the foundation of FPN, this network is designed with a bottom-up guided aggregation network (GFN): through a lightweight pyramidal attention module (LPAM), star operation, and residual connections, it establishes correlations between objects and local contextual information, as well as between shallow-level details and deep-level semantic features. This enables the feature pyramid network to focus on key features, enhance the ability to distinguish between objects and backgrounds, and thereby improve the model’s small object detection performance. Experimental results on the self-built TinyIndus dataset and NEU-DET demonstrate that the detection model based on GF-FPN exhibits more competitive advantages in object detection compared to existing models.

Graphical Abstract

1. Introduction

Industrial product surface defect detection refers to the detection of defects such as scratches, blowholes, cracks, color pollution, etc. With the development of artificial intelligence technology, surface defect detection technology based on machine vision is replacing manual inspection, becoming the core means of industrial quality control, and is widely used in automobile manufacturing, 3C electronics, semiconductors, new energy industry, packaging and printing, and other fields [1]. In recent years, a large number of research works have been carried out around deep learning-based surface defect detection. For example, Liu et al. [2] proposed a network GC-Net for steel surface defect detection. Zhou et al. [3] proposed IFIFusion to solve the challenge that surface defect detection models cannot accurately locate and classify defect regions in images at the same time. Liang et al. [4] proposed SmallNet for magnetic core surface defects. Through the relevant literature, it can be seen that deep learning-based surface defect detection has made significant progress.
As enterprises’ requirements for detection precision and efficiency continue to rise, the small object issue has become one of the major challenges faced by the surface defect detection of industrial products. Some defects are only a few pixels in size, containing limited information and with weak feature expression ability, resulting in low detection precision. For products with complex surface textures, “similar textures” may be mixed with small object detail features, further increasing the difficulty of detection. In recent years, a large number of research achievements have emerged in the field of small object detection [5], among which multi-scale feature fusion is one of the important research directions.
The core idea of multi-scale feature fusion is that images or features of different scales contain different information. Although high-level feature maps possess a large receptive field and strong semantic representation capability, they suffer from low resolution and weak geometric information representation; in contrast, low-level feature maps have high resolution and rich geometric details but weak semantic representation capability. By fusing multi-scale features, the advantages of different levels can be fully utilized, and the overall performance of the model can be improved. Multi-scale feature fusion technology has basically been adopted by current mainstream object detection models and has become an important method to improve the performance of multi-scale object detection. The most representative one is the feature pyramid network (FPN) and its variants [6,7,8,9,10,11]. However, these traditional feature pyramid models all face prominent challenges in small object detection (i.e., those with a pixel size of less than 32 × 32). Small objects rely heavily on low-level detail features (e.g., edges, textures) for recognition. For FPN, its “top-down” fusion prioritizes high-level semantic features, leading to insufficient supplementation of low-level detail information that is critical for small object recognition. For PANet, although the “bottom-up” path is added to supplement low-level features, it fails to distinguish the contribution degrees of features from different paths during fusion, making it susceptible to the dilution of useful features. Regarding BiFPN, its weighted feature fusion relies on data-driven learning; when small objects account for a small proportion of the training data, the model cannot learn effective weight allocation for small object features. Furthermore, the repeated cross-scale connections and weighted fusion incur considerable computational overhead. The attention mechanism provides a viable solution to address these issues.
As a key technology in the field of deep learning, the attention mechanism has made significant progress in many fields such as computer vision and natural language processing, and has become an important method to improve model performance and efficiency. Specifically, in the field of computer vision, the ones that have received extensive attention include spatial attention, channel attention, and self-attention. Representative models include SE-Net, CBAM, ViT, etc. [12,13,14]. Building on these works, to better meet the requirements of visual tasks, a range of innovative attention-based models have been successively proposed. Tang et al. [15] designed a dual-domain attention mechanism to enable the model to pay attention to important details, so as to extract representative features of PCB defects. Li et al. [16] proposed a global and local attention mechanism (GAL), which combines fine-grained local feature analysis with global contextual information processing, providing a deep modeling method for input images. Zhang et al. [17] combined diffusion convolution and channel attention mechanisms to realize dynamic feature extraction. Li et al. [18] designed a detector based on a sparse Vision Transformer, which uses selective token utilization to extract fine-grained features and aggregates features across windows to extract coarse-grained features. Representative related works also include LKR-DETR [19], SMamba [20], et al.
These works provide important references for the research of this paper. However, how to design an effective fusion mechanism to enable the model to have more accurate focusing ability, stronger feature discrimination ability, as well as how to balance detection precision and model complexity, remain significant challenges. Generally speaking, the existing feature pyramid structures still have the following shortcomings:
1.
The focus on key features is not accurate enough, and the feature discrimination ability is insufficient, leading to the loss of small object features and redundancy of invalid information during the downsampling process.
2.
The correlation between inter-layer features is not fully explored, resulting in insufficient expression of object features.
To address the aforementioned issues and further enhance the model’s small object detection performance, this paper proposes a guided focus feature pyramid network (GF-FPN). The network employs a bidirectional feature fusion strategy: it aggregates high-level semantic features into low-level features via channel-wise concatenation, and propagates attention weights layer-by-layer from low to high levels through the star operation (i.e., element-wise multiplication), thereby achieving feature focusing and fully exploring inter-layer feature correlations. For scale alignment, the network utilizes interpolation for upsampling. Additionally, it generates an attention guidance map and performs downsampling via a lightweight pyramidal attention module (LPAM), which enhances feature discrimination capability and facilitates the extraction of small object features. The differences between our GF-FPN and typical models are presented in Table 1. The key contributions of this paper are summarized as follows:
1.
This paper proposes GF-FPN to enhance small object detection performance. Built upon the original FPN, the network incorporates a bottom-up guided focus mechanism: via a pyramidal attention module, the star operation, and residual connections, it establishes correlations between objects and local contextual information, as well as between shallow detail features and deep semantic features. This allows the feature pyramid network to focus on critical features and enhance the ability to discriminate between objects and backgrounds, thereby boosting the model’s small object detection performance.
2.
This paper designs LPAM, which establishes the correlation between objects and local contextual information through a local spatial attention module(LSAM) and a full-channel correlation fusion module (FCCFM), enhancing the feature discrimination ability.
3.
This paper designs FCCFM, which constructs a correlation matrix between channels through a global descriptor and a self-attention mechanism, dynamically captures the correlation between channels, and realizes the fusion of full-channel spatial features.

2. Materials and Methods

2.1. Dataset

To verify the effectiveness of the proposed method, we tested the algorithm on the self-built TinyIndus dataset and NEU-DET dataset [21].
TinyIndus: The dataset mainly includes four types of products: medicinal glass bottles, medicinal glass tubes, glass balls, and disposable medical gloves, with nine labeled defect categories. The defect categories included in the dataset and the number of instances in each category are presented in Table 2. The defect images are shown in Figure 1. The dataset is divided into three parts: 80% is divided into the training set, 10% into the validation set, and 10% into the test set. The training set is used for parameter training, the validation set for hyperparameter tuning, and the test set for performance testing of the trained model.
NEU-DET: The dataset is a well-recognized and widely used benchmark dataset in the field of steel surface defect detection, developed and released by the School of Information Science and Engineering at Northeastern University, China. Specifically designed to address the challenges of automatic defect identification in industrial steel production, this dataset provides a standardized and comprehensive resource for training, validating, and comparing the performance of computer vision and object detection models. The dataset comprises a total of 1800 high-quality grayscale images, all captured from actual cold-rolled steel plates, with each image having a uniform resolution of 200 × 200 pixels. These images cover six common and typical surface defects in steel products, including crazing, inclusion, patches, pitted surface, rolled-in scale, and scratches. Each defect category contains exactly 300 images, ensuring a balanced distribution that avoids bias in model training. NEU-DET has become an essential evaluation benchmark for various state-of-the-art object detection algorithms.

2.2. The Proposed Model

The GF-FPN proposed in this paper is shown in Figure 2. To enhance small object detection performance, a bottom-up guided aggregation network is integrated with the original FPN. Unlike PANet’s bottom-up path enhancement, the guided aggregation path employs a lightweight pyramidal attention module and residual connections to replace simple convolutional downsampling and concatenation operations. During layer-by-layer downsampling, it generates the attention map for the next layer from the features of the previous layer, then applies this attention map to the feature extraction process via the star operation. This enables a more abundant feature representation and forms a spatial-channel-scale three-dimensional attention mechanism. By capturing local contextual information, it strengthens the correlation between semantic and detail features, amplifies the weight of small object features, enhances the capability to discriminate between effective features and background interference, and achieves accurate focusing on critical features. This novel structure yields superior small object detection performance. Like FPN, this feature fusion network is a plug-and-play module that can be flexibly integrated into various backbone networks. Subsequently, we elaborate on the specific implementations of the lightweight pyramidal attention module and the bottom-up guided aggregation path.

2.2.1. The Architecture of LPAM

The traditional FPN fuses multi-scale features, so that the fused feature layer contains both detail features and semantic features, which enables the model to more comprehensively understand the input data and effectively makes up for the limitations of single-scale features, which improves the small object detection performance to a certain extent. However, FPN lacks an active screening mechanism and only fuses features through simple convolution, upsampling, and addition/concatenation. It cannot actively suppress background noise or redundant features, nor can it solve the problem of detail feature loss during the downsampling process.
The introduction of the attention mechanism is to improve the network’s ability to distinguish features by autonomously learning different weights. However, this autonomous learning of the attention mechanism still has a certain degree of blindness. The core of improving the feature discrimination ability of the attention mechanism in multi-scale fusion is to make the attention weight focus more accurately on critical features, while suppressing noise (such as background interference, redundant features), and adapting to the characteristics of features of different scales.
In small object detection tasks, small objects have low feature dimensions, sparse semantic information, and are easily confused with background noise, resulting in significantly lower detection precision than objects of normal scales. Local contextual information (that is, the associated features within a limited range around small objects, such as the spatial position relationship between small objects and adjacent objects, the semantic attributes of the local background, etc.) has become the key to breaking through this bottleneck by supplementing the missing semantic and spatial clues of small objects [17,22]. Local contextual information is the core support for making up for insufficient features, correcting positioning deviations, and suppressing background noise in small object detection, and the degree of its effective utilization directly determines the upper limit of small object detection precision. This serves as both the starting point and theoretical support for the design of the lightweight pyramidal attention module in this paper.
The overall structure of LPAM is shown in Figure 3. The module consists of two parts: LSAM and FCCFM. The local spatial attention branch extracts basic features and extensive local contextual information through different receptive fields. The specific implementation is as follows: first, the input feature map is split into three parts along the channel dimension, which is mainly to reduce the amount of calculation; then, depthwise dilated convolution (DW-D-Conv) with different dilation rates and activation functions are used to expand the receptive field and extract extensive local contextual information; finally, these three parts are concatenated in the channel dimension; LSAM completes 2× downsampling and extracts the initial spatial attention map. FCCFM will fuse cross-channel spatial information based on the initial spatial attention map, establish correlations between features of different receptive fields, and adjust the channel dimension to match the star operation. This module calculates the correlation among all channels through global pooling and a self-attention mechanism, followed by feature fusion. Assuming the input feature is X R H × W × C (where H and W are spatial dimensions, and C is the number of channels), the goal is to generate the correlation matrix A R C × C among all channels through channel self-attention. The steps are as follows:
The core of channel self-attention is to model the relationship among channels. Therefore, it is first necessary to compress the spatial dimension and condense the spatial information of each channel into a vector containing two elements to obtain the global descriptor of the channel. Here, global average pooling and maximum pooling are used:
v c = 1 H × W i = 1 H j = 1 W X i , j , c
m c = max X i , j , c
where X c is the c-th channel of the input feature. Thus, the global descriptor vector g c = [ v c , m c ] T of the c-th channel can be obtained. The global descriptors of all channels can be expressed as G = [ g 1 , g 2 , , g C ] .
Next, two linear transformations are used to generate Q and K :
Q = W q · G , W q R d × 2 K = W k · G , W k R d × 2
In the formula, W q and W k are learnable linear transformation matrices. Then, the attention weight between channels is calculated to dynamically measure the correlation strength. The specific implementation is to calculate the correlation score between channels through the similarity of Q and K , and then normalize to obtain the weight:
A = softmax Q T K d
The input features are weighted and summed through the attention weight matrix to obtain the full-channel fusion output features:
O c = j = 1 C A [ c , j ] · X c
In the formula, O c represents the output feature of the c-th channel.
The above is the implementation process of the channel self-attention module. Its core idea is to fuse full-channel spatial features by learning the self-attention correlation between channels. Different from the traditional correlation coefficient calculated by a fixed formula, the self-attention weight can be dynamically adjusted through learning and can capture non-linear and complex correlations. Thus, the correlation between features of different receptive fields is established.
Finally, a 1 × 1 convolution adjusts the channel dimension to generate the attention guidance map.
Figure 3. Architecture of LPAM. The module consists of two parts: LSAM and FCCFM.
Figure 3. Architecture of LPAM. The module consists of two parts: LSAM and FCCFM.
Information 17 00025 g003

2.2.2. LPAM Parameter Analysis

To calculate the number of parameters of this structure, we need to analyze each layer of the LSAM module and the FCCFM module separately, and clarify the calculation method for the number of parameters of each type of operation:
1.
LSAM
The input of LSAM is H × W × C in , which is evenly split into three branches (each branch has C in / 3 channels) by Split. Each branch contains DW-D-Conv and ReLU, and is finally merged by Concat. DW-D-Conv is divided into two parts: depthwise dilated convolution and pointwise convolution:
  • Depthwise dilated convolution (2D, kernel = 3 × 3 , dilation rate r is 1, 2, and 3, respectively):
    Number of parameters for a single branch: C in 3 × 3 × 3 = 3 C in ;
    Total number of parameters for three branches: 3 × 3 C in = 9 C in .
  • Pointwise convolution (2D, kernel = 1 × 1 ):
    Number of parameters for a single branch: C in 3 × C in 3 × 1 × 1 = C in 2 9 ;
    Total number of parameters for 3 branches: 3 × C in 2 9 = C in 2 3 .
2.
FCCFM
The input of FCCFM is the output of LSAM ( H / 2 × W / 2 × C in ), which includes GlobalPool, Q/K fully connected layers, Softmax, and 1 × 1 Conv. Among them, GlobalPool and Softmax have no parameters.
  • Q/K fully connected layers: Map the 1 × 1 × C in feature after GlobalPool to dimension d.
    Number of parameters for a single fully connected layer: C in × d ;
    Total number of parameters for Q and K: 2 × C in × d ;
  • 1 × 1 Conv: Input channels C in , output channels C out , number of parameters: C in × C out .
3.
Total Number of Parameters
By summing the number of parameters of each part, the total number of parameters is as follows:
( 9 + 3 d ) C in + C in 2 3 + C in × C out

2.2.3. Guided Aggregation Network

The feature pyramid structure can provide multi-scale feature support for object recognition. The feature fusion of traditional FPN is top-down: only through 1 × 1 convolution to compress the number of channels of high-level features, and then directly add or concatenate with the upsampled low-level features along the channel dimension. On the basis of the top-down path of FPN, PANet adds a bottom-up path enhancement. However, due to the lack of an attention mechanism, PANet cannot dynamically focus on the target area and cannot perform targeted enhancement on small target areas. Its fusion method is still the overall inter-layer fusion. Although the low-level features contain details, when transmitted to the high-level layer, they need to go through multiple upsampling and convolution operations, resulting in the local critical details of small objects being smoothed or diluted, and they are very easily interfered by background noise in complex backgrounds.
The GF-FPN incorporates a bidirectional path fusion structure similar to PANet. The top-down path transmits deep semantic features to the shallow layer through the traditional FPN to generate preliminary feature fusion. Here, channel dimension concatenation is used to retain the original features to avoid information loss or weak information being “covered” by strong information. However, at this time, the feature correlation between semantics and details has not been established, and the synergistic effect of “semantics + details” cannot be fully exerted. Therefore, this paper designs a guided aggregation network (GFN). The specific implementation of the network is shown in Figure 2. It can be seen from the figure that GFN aggregates features from bottom to top, and the downsampling part is mainly composed of LSAM and a residual connection. LSAM extracts local contextual features through depthwise dilated convolution with different dilation rates, and the full-channel correlation fusion module completes feature reconstruction. The design idea and specific implementation of LSAM have been discussed in detail in the previous section, so they will not be repeated here. It should be noted that LSAM acts on the feature fusion layers (F2, F3, F4) generated by FPN. In these initially fused feature layers, the shallow detail information and deep semantic features are both stored in the channel dimension. Therefore, LSAM not only establishes the correlation between objects and local contextual information, but also establishes the correlation between details and semantic information. LSAM reassigns each spatial value of the feature layer in a weighted manner. This process is adaptively learnable; through learning, highly correlated features are aggregated and enhanced, while irrelevant features are suppressed. Accordingly, the discriminative ability between targets and background regions is improved, and critical features are preserved during downsampling. Subsequently, to enable bottom-up fusion of cross-layer features, this paper abandons the traditional concatenation approach and proposes a residual connection structure.
The specific implementation of the residual connection is as follows: the long path first passes through the sigmoid activation function to obtain the attention guidance map, then employs the star operation to directly apply the attention weight to the feature map of the next layer. This thereby continues to enhance the discriminative ability between features and realizes layer-by-layer feature aggregation. The main reason for using the star operation is to increase network nonlinearity and improve feature focusing ability. Recent efforts have demonstrated that utilizing star operation can be a more effective choice than summation in network design for feature aggregation, as exemplified by FocalNet [23] and HorNet [24]. Furthermore, reference [25] explains the strong representative ability of star operation by explicitly demonstrating that the star operation possesses the capability to map inputs into an exceedingly high-dimensional, non-linear feature space.
The short path passes the output features of LSAM “as-is” to the output end of the main path, then performs element-wise addition with the output of the long path to obtain the output features of the next layer. Residual connections enable the fusion of cross-layer features. On one hand, they boost feature reuse and prevent the loss of critical features. On the other hand, they mitigate gradient issues (i.e., gradient vanishing or exploding) caused by the star operation. Via the guided aggregation network, the final output feature layers—denoted as P3, P4, and P5—are generated. These feature layers can be fed into the detection head for object recognition and localization.

3. Results and Discussion

3.1. Experiment Setup

Implementation Details: The experimental development environment is Python 3.9 and PyTorch 1.13.0. The experimental operating environment is as follows: Windows 10 system, with GPU being NVIDIA GeForce RTX 3060. The key hyperparameter settings are as follows: the batch size is 8, the experiment uses the SGD optimizer, the learning rate is 0.001, and the number of iterations (epoch) is 100. The loss function we adopted is a combined loss scheme proposed in [26].
Figure 4 presents the loss curves of the CSPDarknet + GF-FPN model during training. As shown in the figure, both the training and validation loss curves decrease steadily, which indicates that the model exhibits good convergence and the hyperparameter configuration is reasonable.
Evaluation Metrics: To accurately quantify the detection performance of the proposed model on medicinal glass defect images, this paper uses MSCOCO-style evaluation indicators. AP is used to measure the performance of the trained model in each category, and mAP is used to measure the performance of the model in all categories. Among them, mAP50 refers to the average precision in each category when the IOU threshold is 0.5. mAP50-90 refers to taking 10 IOU thresholds from 0.5 to 0.95 with a step size of 0.05, calculating the AP under each threshold, and then taking the average value. To further evaluate the model’s performance in terms of computational complexity, the number of parameters, Giga Floating-point Operations Per Second (GFLOPs), and Frames Per Second (FPS) are chosen.

3.2. Comparison with Different Feature Pyramid Networks

To evaluate the performance of the multi-scale fusion module GF-FPN designed in this paper, two backbone networks, Darknet53 and CSPDarknet, were adopted in the experiments. GF-FPN was compared with the original feature pyramid modules (FPN and PANet), where each of the three feature pyramid modules was paired with the two backbones, respectively. All configurations utilized the same detection head and were trained for an identical number of epochs, yielding the mAP50, mAP50-95, and AP50 values for each defect category across different models. The experimental results are recorded in Table 3 and Table 4.
Regarding mAP metrics, on both the TinyIndus and NEU-DET datasets, the two models equipped with the GF-FPN module attained the top and second-highest detection performance, respectively.
On the TinyIndus, when the backbone network is Darknet53, the model equipped with GF-FPN achieves a mAP50 increase of 4.7% and 7.4%, and a mAP50-95 increase of 3.4% and 5.5%, compared to those equipped with PANet and FPN, respectively. For the CSPDarknet backbone, the mAP50 rises by 8.1% and 9.7%, and the mAP50-95 rises by 2.6% and 5.3%, respectively. As indicated by the AP values of each defect category in Table 3, GF-FPN generally yields a noticeable improvement in the detection precision of different defect categories. However, both the dataset itself and the backbone network exert a significant impact on defect detection performance. Among all categories, the tube mouth defect is the most challenging to detect due to its small scale and susceptibility to texture interference, resulting in relatively low detection precision. Even so, it exhibits a significant improvement after integrating GF-FPN.
On the NEU-DET, for the Darknet53 backbone, the model equipped with GF-FPN achieves an increase of 1.3% and 2.2% in mAP50 value, and 1.3% and 4.1% in mAP50-95 value, compared to those equipped with PANet and FPN, respectively. For the CSPDarknet backbone, the mAP50 rises by 2.8% and 5.7%, and the mAP50-95 rises by 1.6% and 7.4%, respectively.
The visualized detection results are shown in Figure 5 and Figure 6. A set of images is randomly selected from the test dataset. The backbone network is CSPDarknet. As can be seen from the visualization experiment results, GF-FPN has a higher detection rate and higher localization accuracy compared with PANet, and its improvement in small object detection is more significant.
In general, the experimental results present that the GF-FPN designed in this paper has better detection performance than FPN and PANet. Moreover, our model exhibits good generalization on different datasets.

3.3. Learnable Parameters and Computational Cost

In Section 2.2.2, we presented the calculation formula for the parameter number of LPAM, whose parameter number is significantly smaller than that of traditional convolutional modules. In this section, we provide a more intuitive demonstration. Table 5 provides the number of learnable parameters, FPS, and GFLOPs of various feature pyramid networks. As shown in the table, our GF-FPN architecture has 41.7 million learnable parameters. When the input resolution is 640 × 640, it delivers 64.2 FPS and 166.9 GFLOPs. Compared with PANet, BiFPN, and AFPN, our model has fewer parameters and the lowest GFLOPs, balancing detection performance and computational efficiency. LPAM makes the main contribution to the reduction in parameter number.

3.4. Ablation Studies

To investigate the efficacy of star operation in our GF-FPN, we replaced it with element-wise add for ablation studies. Our experimentation utilized the CSPDarknet framework with GF-FPN as the backbone. As indicated in Table 6, we observed that star operation can improve the accuracy by approximately 1.7 points. To evaluate the effect of the residual connections, we removed them from the GF-FPN. The ablation results show that the residual connections improve the detection accuracy by approximately three points. Experimental results demonstrate the advantages of the GF-FPN structure design.

3.5. Limitations and Future Work

While the GF-FPN proposed in this paper delivers favorable performance, its overall architecture is built upon the original FPN with an additional new branch, thus fully retaining the original FPN structure. This results in GF-FPN having a higher parameter number than FPN and a slightly slower inference speed. We could further optimize the FPN branch itself; for instance, replacing conventional convolutions with dilated convolutions to reduce the model’s parameter number and enhance its performance. This constitutes one of our future research directions.
Additionally, during the experiments, we observed that the model exhibits varying detection performance across different categories. However, it is challenging for us to fully analyze the reasons for this discrepancy purely from the perspective of structural design. On one hand, this may be attributed to the inherent characteristics of the dataset; on the other hand, the poor interpretability of deep learning models has long been a persistent challenge in this field. We hope that with the continuous advancement of deep learning technology, we will be able to address this issue through more refined experimental designs and in-depth theoretical research, and this will also be a key part of our future work.

4. Conclusions

Aiming at the small object detection problem in industrial product surface defects, this paper proposes a plug-and-play guided focus feature pyramid network. Built upon FPN, the network adds a top-down guided aggregation network. Through a lightweight pyramidal attention module, star operation, and residual connections, it establishes correlations between objects and local contextual information, as well as between shallow details and deep semantic features. This addresses limitations of traditional feature pyramid structures, including imprecise focusing on critical features, inadequate feature discriminative power, and weak inter-feature correlations. In addition, the lightweight pyramidal attention module proposed in this paper uses depthwise separable dilated convolution to extract local features, establishes global channel correlation through a global spatial descriptor and self-attention mechanism, and subsequently conducts feature reconstruction based on these correlations. Our structural design realizes feature correlation and critical feature enhancement across channel and spatial dimensions, with no excessive computational burden. This structure provides a new idea for the fusion of the attention mechanism and the feature pyramid. Experimental results show that compared with traditional feature pyramids, the detection model based on GF-FPN has better detection performance.

Author Contributions

Conceptualization, Y.K. and Y.Z.; methodology, Y.K.; software, Y.K.; validation, Y.K. and Y.Z.; formal analysis, Y.Z. and Y.R.; investigation, Y.R. and Y.C.; resources, Y.R.; data curation, Y.R. and Y.C.; writing—original draft preparation, Y.K.; writing—review and editing, Y.K. and Y.Z.; visualization, Y.K.; supervision, Y.Z., Y.R. and Y.C.; project administration, Y.R.; funding acquisition, Y.Z., Y.R. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Sciences Promotion Project of Hebei Academy of Sciences (No.25A03), the National Natural Science Foundation of China (No. 61702347), the major science and technology projects of Universities in Hebei Province, China (No.2512602307A), the Natural Science Foundation of Hebei Province, China (No. F2022210007), the Science and Technology Project of Hebei Education Department, China (No. CXZX2025049), and the Central Guidance on Local Science and Technology Development Fund, China (No. 226Z0501G).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The NEU-DET dataset used in this study is publicly accessible via http://faculty.neu.edu.cn/songkechen/zh_CN/zhym/263269/list/index.htm (accessed on 7 November 2025), while other supporting data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Qiao, Q.; Hu, H.; Ahmad, A.; Wang, K. A review of metal surface defect detection technologies in industrial applications. IEEE Access 2025, 13, 48380–48400. [Google Scholar] [CrossRef]
  2. Liu, G.; Chu, M.; Gong, R.; Zheng, Z. Global attention module and cascade fusion network for steel surface defect detection. Pattern Recognit. 2025, 158, 110979. [Google Scholar] [CrossRef]
  3. Zhou, X.; Zhang, Y.; Liu, Z.; Jiang, Z.; Ren, Z.; Mi, T.; Zhou, S. IFIFusion: A independent feature information fusion model for surface defect detection. Inf. Fusion 2025, 120, 103039. [Google Scholar] [CrossRef]
  4. Liang, W.; Sun, Y.; Zhang, S.; Bai, L.; Yang, J. SmallNet: A small defects detection network for magnetic chips based on context-weighted aggregation and feature multiscale loop fusion. IEEE Trans. Autom. Sci. Eng. 2024, 22, 10095–10106. [Google Scholar] [CrossRef]
  5. Muzammul, M.; Li, X. Comprehensive review of deep learning-based tiny object detection: Challenges, strategies, and future directions. Knowl. Inf. Syst. 2025, 67, 3825–3913. [Google Scholar] [CrossRef]
  6. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  7. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
  8. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
  9. Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 6896–6904. [Google Scholar]
  10. Du, Z.; Hu, Z.; Zhao, G.; Jin, Y.; Ma, H. Cross-layer feature pyramid transformer for small object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
  11. Chen, Z.; Ma, Y.; Gong, Z.A.; Cao, M.; Yang, Y.; Wang, Z.; Liu, Y. R-AFPN: A residual asymptotic feature pyramid network for UAV aerial photography of small targets. Sci. Rep. 2025, 15, 16233. [Google Scholar] [CrossRef] [PubMed]
  12. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  13. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  14. Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  15. Tang, J.; Wang, Z.; Zhang, H.; Li, H.; Wu, P.; Zeng, N. A lightweight surface defect detection framework combined with dual-domain attention mechanism. Expert Syst. Appl. 2024, 238, 121726. [Google Scholar] [CrossRef]
  16. Li, Y.; Zhou, Z.; Qi, G.; Hu, G.; Zhu, Z.; Huang, X. Remote sensing micro-object detection under global and local attention mechanism. Remote Sens. 2024, 16, 644. [Google Scholar] [CrossRef]
  17. Zhang, Y.; Liu, T.; Zhen, J.; Kang, Y.; Cheng, Y. Adaptive downsampling and scale enhanced detection head for tiny object detection in remote sensing image. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
  18. Li, W.; Guo, Y.; Zheng, J.; Lin, H.; Ma, C.; Fang, L.; Yang, X. Sparseformer: Detecting objects in hrw shots via sparse vision transformer. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 4851–4860. [Google Scholar]
  19. Dong, Y.; Xu, F.; Guo, J. LKR-DETR: Small object detection in remote sensing images based on multi-large kernel convolution. J. Real Time Image Process. 2025, 22, 46. [Google Scholar] [CrossRef]
  20. Shi, N.; Yang, Z.; Yang, G.; Li, K.; Yang, Z.; An, J. Super Mamba feature enhancement framework for small object detection. Sci. Rep. 2025, 15, 37148. [Google Scholar] [CrossRef] [PubMed]
  21. He, Y.; Song, K.; Meng, Q.; Yan, Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [Google Scholar] [CrossRef]
  22. Wang, Z.; Li, Y.; Liu, Y.; Meng, F. Improved object detection via large kernel attention. Expert Syst. Appl. 2024, 240, 122507. [Google Scholar]
  23. Yang, J.; Li, C.; Dai, X.; Gao, J. Focal modulation networks. NeurIPS 2022, 35, 4203–4217. [Google Scholar]
  24. Rao, Y.; Zhao, W.; Tang, Y.; Zhou, J.; Lim, S.N.; Lu, J. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. NeurIPS 2022, 35, 10353–10366. [Google Scholar]
  25. Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5694–5703. [Google Scholar]
  26. Zhang, Y.; Wu, C.; Guo, W.; Zhang, T.; Li, W. CFANet: Efficient detection of UAV image based on cross-layer feature aggregation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
  27. Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Hawaii, NA, USA, 3–7 October 2023; pp. 2184–2189. [Google Scholar]
Figure 1. Images of products and defects in the TinyIndus: (a) glass balls:ball pock; (b) disposable medical gloves:glove flaw; (c) medicinal glass tubes:tube mouth flaw, tube body flaw, and tube body pock; (d) medicinal glass bottles: bottle bottom flaw, bottle body flaw, bottle body dirty, and bottle neck flaw.
Figure 1. Images of products and defects in the TinyIndus: (a) glass balls:ball pock; (b) disposable medical gloves:glove flaw; (c) medicinal glass tubes:tube mouth flaw, tube body flaw, and tube body pock; (d) medicinal glass bottles: bottle bottom flaw, bottle body flaw, bottle body dirty, and bottle neck flaw.
Information 17 00025 g001
Figure 2. Overall architecture of the proposed GF-FPN. A bottom-up guided aggregation network is added on the basis of FPN.
Figure 2. Overall architecture of the proposed GF-FPN. A bottom-up guided aggregation network is added on the basis of FPN.
Information 17 00025 g002
Figure 4. CSPDarknet + GF-FPN model training status on TinyIndus dataset.
Figure 4. CSPDarknet + GF-FPN model training status on TinyIndus dataset.
Information 17 00025 g004
Figure 5. Visualization of the prediction results on the TinyIndus dataset. Different colors represent different defect categories.
Figure 5. Visualization of the prediction results on the TinyIndus dataset. Different colors represent different defect categories.
Information 17 00025 g005
Figure 6. Visualization of the prediction results on the NEU-DET dataset. Different colors represent different defect categories, and to more intuitively demonstrate the positional differences between the prediction boxes and the ground-truth boxes, a different color scheme has been adopted for the ground-truth boxes here.
Figure 6. Visualization of the prediction results on the NEU-DET dataset. Different colors represent different defect categories, and to more intuitively demonstrate the positional differences between the prediction boxes and the ground-truth boxes, a different color scheme has been adopted for the ground-truth boxes here.
Information 17 00025 g006
Table 1. Novelty matrix.
Table 1. Novelty matrix.
ModelAttention TypeKey ContributionLimitation
FPN [6]no attention moduletop-down fusion, lateral connectionsinsufficient transmission of semantic information
PANet [7]no attention moduleadded bottom-up path aggregationno weight mechanism
BiFPN [8]no attention moduleproposed weighted bidirectional fusion, pruned invalid fusion nodesthe weighting mechanism relies on hyperparameter tuning
CFPT [10]cross-layer spatialwise attention (CCA) and cross-layer spatialwise attention (CSA)upsampler-free, global contextual informationlack of attention to local contextual information
GF-FPN (ours)local spatial attention module (LSAM) and full-channel correlation fusion module (FCCFM)bottom-up guided focus network, attention to local contextual informationnone observed
Table 2. Statistics on the number of instances in the TinyIdus dataset.
Table 2. Statistics on the number of instances in the TinyIdus dataset.
DefectsCode NameInstances
bottle neck flawxl-j-flaw (xjf)134
bottle body flawxl-s-flaw (xsf)119
bottle body dirtyxl-s-dirty (xsd)172
bottle bottom flawxl-d-flaw (xdf)76
tube body flawgs-flaw (gsf)175
tube mouth flawgk-flaw (gkf)254
tube body pockgs-pock (gsp)228
ball pockq-pock (qp)362
glove flawst-flaw (stf)166
Table 3. Test results of different models on the TinyIdus dataset.
Table 3. Test results of different models on the TinyIdus dataset.
ModelmAP50mAP50-95xjfxsfxsdxdfgsfgspgkfqpstf
Darknet53 + FPN58.919.867.694.679.238.076.048.614.649.362
CSPDarknet + FPN60.221.16495.782.143.979.456.013.143.264.4
Darknet53 + PANet61.621.969.799.581.558.867.244.118.249.466.4
CSPDarknet + PANet61.823.867.398.880.044.978.661.614.841.369.1
Darknet53 + GF-FPN66.325.379.899.587.460.379.247.720.451.370.7
CSPDarknet + GF-FPN69.926.477.799.58960.984.963.033.146.274.6
Table 4. Test results of different models on the NEU-DET dataset.
Table 4. Test results of different models on the NEU-DET dataset.
ModelmAP50mAP50-95CrazingInclusionPatchesPitted-SurfaceRolled-InscaleScratches
Darknet53 + FPN72.739.339.678.991.476.661.987.6
CSPDarknet + FPN71.236.435.577.092.676.057.289.1
Darknet53 + PANet73.642.144.679.791.576.459.589.9
CSPDarknet + PANet74.142.241.979.093.078.461.990.4
Darknet53 + GF-FPN74.943.445.580.093.179.260.591.1
CSPDarknet + GF-FPN76.943.847.182.193.381.863.893.6
Table 5. Comparison of model performance on TinyIdus dataset.
Table 5. Comparison of model performance on TinyIdus dataset.
ModelBackboneFPSParams (M)GFLOPs
FPNCSPDarknet65.840.7158.4
PANetCSPDarknet62.946.2182.6
BiFPNCSPDarknet-46.8188.3
AFPN [27]CSPDarknet-58.5224.8
GF-FPNCSPDarknet64.241.7166.9
Table 6. Results of ablation experiments.
Table 6. Results of ablation experiments.
BackboneElement-Wise Mul (Star)Element-Wise AddResidual ConnectionmAP50GFLOPs
CSPDarknet + GF-FPN 69.9166.9
CSPDarknet + GF-FPN 68.2166.9
CSPDarknet + GF-FPN 66.7168.5
CSPDarknet + GF-FPN 65.1168.5
Note: The denotes that the module is selected.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kang, Y.; Zhang, Y.; Ren, Y.; Cheng, Y. A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance. Information 2026, 17, 25. https://doi.org/10.3390/info17010025

AMA Style

Kang Y, Zhang Y, Ren Y, Cheng Y. A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance. Information. 2026; 17(1):25. https://doi.org/10.3390/info17010025

Chicago/Turabian Style

Kang, Yaoxing, Yunzuo Zhang, Yaheng Ren, and Yu Cheng. 2026. "A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance" Information 17, no. 1: 25. https://doi.org/10.3390/info17010025

APA Style

Kang, Y., Zhang, Y., Ren, Y., & Cheng, Y. (2026). A Cross-Scale Feature Fusion Method for Effectively Enhancing Small Object Detection Performance. Information, 17(1), 25. https://doi.org/10.3390/info17010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop