3.1. Frequency Domain Information Gathering-And-Allocation Mechanism
The backbone generates four feature maps at different resolutions (1/4, 1/8, 1/16, and 1/32 of the original image resolution). These feature maps are associated with defects of different scales, high-level maps (e.g., 1/16 and 1/32) are better suited for detecting larger defects and offering a more macroscopic distinction between defects and backgrounds. Low-level maps (e.g., 1/4 and 1/8) provide finer details to capture smaller defects and intricate texture information. The MFA module aims to address the issue of insufficient cross-layer information interaction in traditional fusion networks. The FGPEM module utilizes a PDCT filter to extract high-frequency features (corresponding to boundary information) and low-frequency features (corresponding to texture information), thereby enhancing spatial feature representation and consequently improving detection performance.
Figure 3 shows the overall structure of the FIGA mechanism, while
Figure 4 provides a detailed view of the FGPEM module.
To achieve an initial fusion of multi-scale features extracted from different levels
i of the backbone network, we first align these features to a common spatial resolution. Let
denote the feature map from the
i-th level, where
and
are its height, width, and channel depth, respectively. We designate the feature map
as the reference, and other feature maps (
and
) are spatially resampled to match the spatial dimensions
of
. The spatially aligned feature map for level
i, denoted as
, is obtained by applying a level-specific resampling operator
:
Here,
denotes an upsampling operation to target_dims using bilinear interpolation. Conversely,
represents a downsampling operation to target_dims using adaptive average pooling. The set of features
are now aligned to the same spatial resolution. Subsequently, these resolution-aligned features are concatenated along the channel dimension to produce a unified multi-scale feature representation
as follows:
To further refine the aligned features, the tensor
is passed through the core enhancement stage of our Multi-scale Feature Alignment (MFA) module, which consists of a cascade of three RepVGG [
40] blocks. Each block enhances features using a training-time multi-branch structure. This structure involves concatenating outputs from a
convolution (
), a Batch Normalization (BN) layer applied to the output of an initial
convolution, and a
convolution (
). Critically, this multi-branch setup fuses into a single
convolution during inference, thereby improving feature representation without increasing inference cost. The operations are defined as:
As shown in
Figure 4, after MFA alignment, the features are further processed through FGPEM for frequency domain feature extraction. The extracted frequency domain features are embedded with the original spatial features, thereby enhancing their perception capability of the defect samples.
DCT-II transforms images from the spatial domain to the frequency domain. In JPEG image compression, high-frequency components obtained through DCT-II are typically discarded to achieve efficient data compression. In the frequency coefficient matrix, the coefficients in the top-left corner represent low-frequency information (such as overall defect texture information), while coefficients in the bottom-right corner represent high-frequency information (such as defect edge information). This paper filters the DCT-II basis functions based on preset position, named PDCT, to selectively extract either high-frequency or low-frequency coefficients. The specific formula for constructing the PDCT filter element
is as follows:
Here, represents the intensity of the selected 2D DCT-II basis function at spatial position for a specific channel c. This 2D basis function is constructed by the product of two one-dimensional DCT cosine basis functions: one corresponding to the horizontal direction (parameterized by and ) and the other to the vertical direction (parameterized by and ). and are the width and height of the filter, respectively. The term accounts for sampling at pixel centers, which is a characteristic feature of the DCT-II transform. The normalization factor ensures the orthogonality of the basis functions. Conditional multiplication by distinguishes the DC component ( and ) from the AC components ( or ), which is a standard normalization for DCT-II.
After defining the filter elements, the frequency response
for each channel is obtained by summing the element-wise product of the input feature map
and the constructed filter
:
The frequency index pair
for each channel
c is determined by a preset position-based frequency selection strategy. Specifically,
is derived from initial fixed points
and
, which are the horizontal and vertical indices drawn from a specific, pre-defined set of 8 frequency components on a
grid. Our selection comprises 4 low-frequency components with coordinate pairs (0,0), (0,1), (1,0), and (1,1), and 4 high-frequency components with coordinate pairs (9,9), (9,8), (8,9), and (8,8). For any given frequency component selected,
takes its horizontal coordinate value while
takes its vertical coordinate value. These initial indices are then scaled by the filter dimensions:
After obtaining the frequency response
for the corresponding channel, the channel attention weights are calculated to enhance the representation of the original spatial domain features. This process involves a function
that takes
as input:
Here,
is the calculated channel attention weight for channel
c in batch
b, which is derived from
using two linear layers (
) with ReLU and
activations. The final output feature map
is then obtained by element-wise multiplication of the original input feature map
at spatial position
with these attention weights:
3.2. AES-FPN
Recently, ASFF [
41] and PAFPN [
42] have improved the performance of multi-scale object detectors by using lateral connections to fuse features from different levels. These studies reveal the complementarity between shallow and deep features in the network. However, they focus on information interaction between adjacent levels, with less consideration for feature interaction between non-adjacent levels. Some studies, such as BFP [
11] and CR-FPN [
33], have already made improvements to address the issue of insufficient cross-layer information interaction.
It can be recognized that detectors can effectively detect large defects, while small objects often exhibit poorer detection performance due to their limited feature information. Inspired by previous work, we design a new fusion network called AES-FPN to improve the feature fusion process. As shown in
Figure 2, the AES-FPN module enhances multi-scale feature fusion through two key innovations: the FIGA mechanism and the CS module. The FIGA mechanism is designed to extract frequency-domain insights. It generates two types of enhanced spatial features: one enriched with high-frequency boundary information (corresponding to defect edges) and the other with low-frequency texture information (capturing defect patterns). Subsequently, these specialized features are selectively injected into the fusion network to maximize their impact. The high-frequency features are fed into the upper layers of the model to strengthen the defect recognition, while the low-frequency features are directed to the lower layers of model to improve the distinction between defects and the background. The CS module is designed to prevent information loss during this fusion process.
The process of embedding the global frequency information into local features is accomplished by Inject operation. This operation takes two sets of inputs: the high-frequency (
) and low-frequency (
) global features provided by the FGPEM module, and the local feature maps (
and
) targeted for enhancement.
In Formula (
12),
represents target local features and
denotes auxiliary global features.
signifies a gating mechanism. In the output component, the output of the third feature
is obtained by embedding the low frequency auxiliary information
into
. The output of the fourth layer
is obtained by injecting the high frequency auxiliary information
into
and then fusing it with
. The output of the fifth feature layer
is computed by first fusing
with
, and then applying a CS operation, as shown in
Figure 5.
The CS module ensures the synchronization between the information obtained from the fusion network and the feature extraction network, preventing the loss of channel information during feature fusion. It can be observed that there is a loss of feature information in the feature fusion component when
is fused with the downsampled
, as
does not inject any auxiliary information. Information loss will affect detection performance, so this paper proposes the CS module shown in
Figure 5 to solve this problem. As illustrated, the input first passes through a
convolution module, increasing the channel dimension to 384, then split it into three equal parts. At this stage, most of the feature information in the channels cannot be utilized directly and required further processing. As shown in
Figure 6, Bottle Rep (BR) module is used to extract channel information from each part. Specifically, the BR module introduces a learnable parameter
as a metric to assess information significance, which is used to control the scaling of channel. After three rounds of extraction through the BR module, the resulting features are concatenated along the channel dimension and then adjusted to match the dimension of the backbone network using a final
convolution operation. After a single CS operation,
successfully recovers the lost channel information. The formula for the CS module is expressed as follows:
denotes the ReLU activation function,
, and
C is a channel-wise learnable scaling factor (normalized to
via a sigmoid function), which controls the information retention ratio. First, the channels are compressed by a
convolution (dimensionality reduction to
), then spatial features are extracted via a
convolution, and finally, the channel number is restored by a
convolution.
The Channel Split (CS) core consists of two steps: split-enhance-concatenate and dimensionality reduction. Equation (
21) represents splitting along the channels and enhancing channel expressive power through the BR module. Equation (
22) represents the concatenation and restoration operations. Here, ↑ denotes up-dimensioning to 384, and ↓ denotes down-dimensioning to the backbone network’s channel number. The BR module dynamically recovers lost channel information via the
parameter, and the CS module enhances feature expression through concatenation.
3.3. SD-GASNet Detector
SD-GASNet adopts the lightweight and efficient MLVT [
4] to extract feature and uses the proposed AES-FPN to fuse the feature of cross-layer multi-scale. In the detector head, an enhanced KL divergence loss function is designed for self-distillation training, which further improves the detector performance without an additional teacher model and achieves a balance between high performance and fast speed.
In the multi-scale feature fusion module of SD-GASNet, the AES-FPN module incorporates FIGA mechanism along with CS module. The FIGA mechanism employs MFA and FGPEM to align and fuse multi-scale frequency feature, which is then embedded into the shallow feature layers as auxiliary information. The CS module addresses the issue of channel information loss by controlling the scaling of high-level feature channels through a learnable parameter .
In addition, SD-GASNet employs a decoupled self-distillation head, which introduces DFL [
43] to assist in the regression of target boxes and implements self-distillation based on an enhanced KL divergence loss. The features fused by AES-FPN fusion network are also included in the distillation process to alleviate the phenomenon of knowledge loss encountered by the student network during training. Through this distillation method, the performance of model is self-improved without an external model.
GASNet is initially trained using the same method without the distillation branch. The pre-trained model subsequently serves as the “teacher” model in the self-distillation framework. During this distillation, features from the outputs of both the AES-FPN fusion network and the self-distillation head are utilized to enhance training.
For the features from the AES-FPN fusion network, it can be used to directly compute the distillation loss. For head features, it can be utilized to decouple classification and regression branches. The classification branch distills the classification outputs from both the teacher and student models directly. In the regression branch, DFL provides a more accurate expression of the bounding box distribution by the method of distance distribution, therefore, the distillation operation can only be applied to the DFL component. The additional components, which regress the bounding box through the manipulation of Intersection over Union (IoU), are excluded from the distillation process. The distillation strategy is shown in
Figure 7.
Compared with the general distillation model, our distillation model does not need to find an additional teacher model, and the student model is able to achieve better performance beyond the teacher model. Furthermore, we incorporate the feature output from the neck component into the distillation process and effectively prevent information loss during the learning phase. In this framework, the student model is jointly trained using the pseudo labels generated by the pre-trained teacher model and the true labels from the dataset.
3.4. Loss Function
The total loss of the SD-GASNet detector comprises the loss of the student model relative to the true labels of the dataset and the loss relative to the pseudo labels of the teacher model. In this paper, the teacher model is the trained student model itself. By minimizing the distillation loss, the student model achieves a significant improvement in detection performance without compromising inference speed. In this subsection, the enhanced KL divergence loss function is first introduced, and then the total loss from the whole training process is given a detailed description.
3.4.1. Enhanced KL Divergence Loss
In machine learning, the KL divergence is often used to quantify the discrepancy between two probability distributions. For example, it is used as part of the loss function in generative models or as a similarity measure in clustering and classification problems. In the field of knowledge distillation, the KL divergence loss is also used to measure the difference between the output distributions of the teacher model and the student model, as illustrated in the following equation.
where
is the KL divergence, P
and Q
denote the original and predicted probability distributions of dimension N, respectively. From the above formula, we know that the KL divergence consists of the self-entropy of
P and the cross-entropy between
P and
Q.
where
is the enhanced KL divergence loss,
is the weight coefficient that decays according to a cosine annealing schedule throughout the training process.
simultaneously constrains the information entropy of the distribution of the student model and the cross-entropy between the two distributions, facilitating the effective transfer of knowledge from the teacher model to the student model.
Moreover, the decay of the distillation weight benefits the performance improvement of the student model. Throughout the training process, gradually decays from 1 to 0 using cosine annealing. In the early stages of training, when is close to 1, resulting in a significant contribution from distillation loss, which helps the parameters of the student model stabilize quickly. In the later stages of training, when the parameters of the model have stabilized, approaches 0 to allow the own loss of the student model to dominate and thereby promote performance breakthroughs.
3.4.2. The Total Loss of SD-GASNet
The overall loss of SD-GASNet comprises three main supervised loss components: a classification loss (
), a regression loss (
), and a feature distillation loss (
). The regression loss itself is a combination of Distribution Focal Loss (
) and GIoU loss (
). As represented in the following equation:
First, except for the IoU branch, the remaining three branches are distilled between the output of the teacher model and the predictions of the student model using the enhanced KL loss in this paper. The formula is as follows:
where
is the improved KL divergence loss,
c corresponds to three different branches,
P and
Q are the predicted output of the student model and the teacher model, respectively. When
corresponds to the distillation computation process of three layers of feature in the AES-FPN output.
Next, we describe the computation of the loss in each of the three remaining branches of total loss. Considering that in defect detection task, the number of defect objects is significantly smaller than that of background objects, leading to a substantial imbalance between positive and negative samples. Traditional classification losses can be challenging to optimize for complex defects. Therefore, we introduce the Varifocal Loss (VFL) to address this issue of imbalance between positive and negative samples, as expressed in the following equation:
where
denotes the VFL,
represents the number of positive samples,
designates the classification score of the pixel point
in the student model, and
is used to denote the target categorization score of the pixel point
. The
represents the computation of the self-distillation loss that exists in the classification branch.
Due to the high similarity between defect images and backgrounds, defect detection is more challenging than traditional object detection. Therefore, for the regression branch, in addition to relying on GIOU loss to regress the bounding boxes, the DFL branch is also brought in to assist in regressing the bounding boxes. DFL branch is used to compute the distance distribution to represent the target box. Due to this distance distribution, the enhanced KL divergence for knowledge distillation is appropriately introduced. The formula of the regression branch is represented as follows:
Due to the high similarity between defect images and backgrounds, defect detection is more challenging than traditional object detection. Therefore, for the regression branch, we adopt DFL as the primary loss to directly learn the distance distribution of bounding box locations. This allows for a more flexible and accurate representation of object boundaries. To further refine the geometric properties of the predicted boxes, we also incorporate an auxiliary GIoU loss as a regularization term. The enhanced KL divergence is then applied to the DFL component for knowledge distillation. The complete formula for the regression branch is represented as follows:
where
denotes the number of positive samples, and
is the indicator function that is 1 when
, and 0 otherwise, which means that only positive samples contribute to the regression loss calculation.
represents the primary Distribution Focal Loss used to learn the boundary distributions, and
represents the auxiliary GIoU loss that provides geometric regularization. The
denotes the knowledge distillation of the distance distributions between the student and teacher models, which can further assist in the bounding box regression.