A Novel Small Target Detection Strategy: Location Feature Extraction in the Case of Self-Knowledge Distillation

: Small target detection has always been a hot and difﬁcult point in the ﬁeld of target detection. The existing detection network has a good effect on conventional targets but a poor effect on small target detection. The main challenge is that small targets have few pixels and are widely distributed in the image, so it is difﬁcult to extract effective features, especially in the deeper neural network. A novel plug-in to extract location features of the small target in the deep network was proposed. Because the deep network has a larger receptive ﬁeld and richer global information, it is easier to establish global spatial context mapping. The plug-in named location feature extraction establishes the spatial context mapping in the deep network to obtain the global information of scattered small targets in the deep feature map. Additionally, the attention mechanism can be used to strengthen attention to the spatial information. The comprehensive effect of the above two can be utilized to realize location feature extraction in the deep network. In order to improve the generalization of the network, a new self-distillation algorithm was designed for pre-training that could work under self-supervision. The experiment was conducted on the public datasets (Pascal VOC and Printed Circuit Board Defect dataset) and the self-made dedicated small target detection dataset, respectively. According to the diagnosis of the false-positive error distribution, the location error was signiﬁcantly reduced, which proved the effectiveness of the plug-in proposed for location feature extraction. The mAP results can prove that the detection effect of the network applying the location feature extraction strategy is much better than the original network.


Introduction
With the development of deep learning, the ability of neural networks has greatly developed. Because neural networks can automatically extract features and fit them to a large number of data points, many excellent target detection algorithms have emerged, such as mask R-CNN [1], cascade R-CNN [2], hybrid task cascade [3], et al. The technology is becoming more advanced, and the accuracy and speed have significantly improved. Due to the maturity of conventional target detection technology, researchers began to pay more attention to the detection of small targets. The most common definition at present comes from the general data set in the field of target detection. Small targets are defined as having a resolution of less than 32 × 32 pixel target [4]. At the same time, Chen et al. [5] proposed a data set for small targets and made the following definition for small targets: the median of the ratio of the bounding box area to the image area is between 0.08% and 0.58%, which is based on the definition of relative scale. Due to the small size, few pixels, wide distribution, and small number of small targets, the effective features that can be extracted by the network are very few. Many of the networks mentioned above that perform well in conventional target detection do not perform well in small target detection [6][7][8][9].
The main method to solve these problems is multi-scale feature fusion. The deep network has rich semantic information, but it lacks the fine-grained information of the shallow network. The semantic information and fine-grained information can be used simultaneously by fusing features from multiple scales, providing more comprehensive feature information for small target detections. Feature Pyramid Net (FPN) [10] is applied in many small target detections. It adopts multi-scale feature fusion and uses the fused results to make predictions. In the Single Shot MultiBox Detector (SSD) [11] series, DSSD [12] deconvolutes the deep feature map based on SSD and multiplies it with the shallow feature to obtain a better multi-layer feature map, which is very beneficial to the detection of small objects. In the fast RCNN series, HyperNet [13] fused the feature maps obtained from the first, third, and fifth convolution groups, pooled the shallow features, deconvoluted the deep features, and finally fused them in the way of channel concatenation to complement each other.
In general, multi-scale feature map fusion is helpful to capture detailed information and rich semantic information, which is convenient for object location and classification, respectively. However, the multi-scale representation method does not propose a specific feature. Widely extracting features from multiple layers will inevitably increase the computational burden. In addition, redundant information fusion may lead to background noise and insufficient performance [14].
Because small objects only occupy a small part of the image, the information obtained directly from local areas is very limited. Every object always exists in a specific environment or coexists with other objects. Then, some context-based detection methods are proposed to utilize the relationship between small objects and other objects or backgrounds. FASSD [15] uses additional features from different layers as the context and proposes an attention mechanism to focus on the context information from the target layer.
Based on the inspiration of the above two methods, a plug-and-play plug-in called the location feature extraction module (LFE module, for short) is designed that can achieve the extraction of location features in deep layers. The LFE module uses the larger receptive field in the deeper network to establish the global spatial context association and then uses the attention mechanism to strengthen the attention to spatial information so as to implement the location feature extraction process.
In order to enhance the generalization performance of neural networks, regularization methods are proposed to reduce overfitting. Knowledge distillation is one of the most effective regularization strategies, which is used to improve the generalization ability of networks. The existing knowledge distillation methods [16][17][18] require training a complex teacher network first and then passing it on to the student network, which is both time-and cost-consuming. Therefore, self-distillation learning-learning what you teach yourselfis proposed. Traditional self-distillation is realized under supervision, and supervised learning requires a lot of manually labeled label information [19]. The rise of self-supervised learning has solved this problem. Applying self-supervised contrastive learning to selfdistillation may produce desirable results.
Our contributions are: (1) We found that location feature extraction had a great impact on the improvement of small object detection, and improving location feature extraction can effectively improve detection accuracy.
(2) We proposed a location extraction structure that can extract location features at a deep level, use the larger receptive field to establish a global spatial context mapping, and then use the attention mechanism to strengthen the attention to spatial information to extract location features. Combined with CSPNet [20], it is integrated into the above structure to reduce the use of repeated gradient information.
(3) Based on traditional self-knowledge distillation learning, a self-knowledge distillation model was proposed that can be learned in a self-supervised environment. By using the linear combination of the change information generated after data augmentation and the current prediction results to distill knowledge and soften the target. This paper was organized as follows: In Section 2, the related work in self-supervised learning, self-knowledge distillation, feature extraction, etc. was generalized. In Section 3, a detailed description of the design and principle of the specific structure of the model was given. In Section 4, the experiment, including the experimental environment, the deployment of the experiment, and the analysis of the corresponding experimental results, were elaborated. In Section 5, conclusions were finally drawn.

Small Target Detection
Multi-scale feature fusion. The Single Shot MultiBox Detector (SSD) [11] uses a multiscale feature map for detection, using a large feature map to detect small targets and a small feature map to detect large targets. SSD detects the feature map obtained from each convolution and completes the target location and classification at one time. However, it detects the target by convolution of the feature map, unlike the full connection layer of YOLOv1 [21], so it will lose a lot of spatial information. In addition, each point on these feature layers constructs six prior frames at different scales. Finally, the output obtained on all feature maps is combined, and the detection results are obtained through NMS (non-maximum suppression). MDSSD [22] adds high-level features with rich semantic information to low-level features through deconvolution Fusion Block. In particular, several high-level features with different proportions were up-sampled at the same time, and then the connection was skipped to form a more descriptive feature map of small objects. Finally, these new fusion features were predicted. Different from the element summation strategy adopted by MDSSD [22], the convolution neural network based on the deconvolution region (DR-CNN [23]) adopted the concatenation strategy to fuse a multiscale feature map for small traffic symbol detection. The channel-aware deconvolution network (CADNet) [24] was proposed to study the relationship between the deep feature maps in different channels to avoid the simple superposition of feature maps. By using the correlation between different scale features, the recall rate of small objects can be improved at a lower computational cost.
Context-based detection method. The enhanced R-CNN [5] proposed in the context network can be considered the first detector focusing on small target detection. In this work, a new regional proposal network (RPN) was proposed to encode the context information of small object proposals. Internal and external networks (ION) [25] used spatial recurrent neural networks (RNNs) to search the context information outside the target area; the skip pooling method was then used to obtain internal multi-level feature mapping. Leng et al. [26] integrated the U-V parallax algorithm with the faster R-CNN, which combined internal and contextual information. The new framework of FASSD [15] started from the baseline SSD and then proposed three components to improve the detection accuracy of small targets. First, SSDs and features were fused to obtain context information, named F-SSD. Second, SSDs with reserved modules enabled the network to focus on important components, called A-SSDs. Third, the researcher combined feature fusion and attention modules, called FA-SSD. Attention Mechanisms. Inspired by the human visual system, the attention mechanism has been introduced into convolutional neural networks in recent years to improve the performance of target detection [27,28]. According to the form of attention acting on the feature map, the attention mechanism was mainly divided into channel attention [29], spatial attention [30], and channel and spatial mixed attention [31]. The SE attention mechanism [29] is a popular attention mechanism at present. It uses two-dimensional global pooling to calculate the channel attention mechanism, but the channel attention mechanism only pays attention to the coding of channel information, ignoring the influence of location information [32], and location information is very critical in visual tasks, so enhancing the extraction of location features is a key technology to improve the detection accuracy.

Self-Distillation and Self-Supervised Contrastive Learning
Conventional knowledge distillation [33] methods 'distill' the knowledge contained in a teacher model, which has a larger and better performance, into a student model. Then a new knowledge distillation was proposed. The model at the current time was regarded as a student, and the model at the previous time was regarded as a teacher to distill knowledge. Since the model structure of teacher and student is the same, it is called self-knowledge distillation (self KD) [34]. Specifically, a new self-distillation method was proposed [35], which distilled the knowledge from the deeper part of the network to the shallower part of the network. [36] designed a self-distillation that transferred knowledge from the early stage of the network (teachers) to the later stage (students) to support the supervised train in the same network. In order to further reduce the inference time, a distillation-based training scheme [37] was developed, where, the shallow exit layer attempts to simulate the output of the deep layer during the training process. Recently, self-distillation has been theoretically analyzed in [38], and its improved performance has been proven by experiments in the literature [39].
Currently, contrastive learning is widely used in self-supervised learning (SSL) [40][41][42], where each image is considered a separate class and positive samples are pulled closer while negative samples are pushed away. Its basic principle is to adopt the network structure of Siamese [43] and calculate the output loss of the two branches of the network from the inputting of positive and negative sample pairs of data. So the network can learn the features that can bring similar samples closer and dissimilar samples farther by using the InfoNCE loss [40]. MoCo [41,42] establishes a dynamic dictionary, which stores a large number of negative samples. MoCo emphasized that the size of sample pairs was very important for contrastive learning. However, SimCLR [44] believed that the way to construct negative examples was also very important. SimCLR used more data augmentation and added a projector after the encoder, which can greatly improve the effect. MoCov2 [42] verified the effectiveness of SimCLR by implementing two improvements in the MoCo framework. Contrastive learning requires many negative examples for comparison, which is time-consuming and memory-consuming. Therefore, FAIR and INRIA also launched a new method, SWAV [45], which clusters all kinds of samples and then distinguishes the clusters of each class. MoCo, SimCLR, and other contrastive learning methods rely on negative samples. Without negative samples, BYOL [46] depends on two neural networks, an online network and a target network, which interact and learn from each other. Continuing the ideas of BYOL, Xinlei Chen, and Kaiming He studied the Siamese network [43] and found that stopping the gradient was the key to avoiding collapse, so they proposed SimSiam [47].

Normalization
In order to make the input data of the neural network independent and identically distributed, normalization was introduced to limit the input to a certain range. BN (Batch Normalization) [48], GN (Group Normalization) [49], LN (Layer Normalization) [50], and IN (Instance Normalization) [51] are several classical normalization algorithms. However, BN was limited by the batch size, and its performance was poor when the batch size was small. Moreover, the above methods are all global normalization, which means that spatial information was not utilized and all features were normalized by the same mean and variance. Anthony Ortiz proposed LCN (local context normalization) [52], which utilized local context information and normalized according to the local neighborhood and corresponding statistical values to improve performance. It is also applicable to various batch sizes and transfer learning. In our experiment, BN and LCN, which were more suitable for this task, were selected as normalization methods, respectively.

Activation Function
The activation function is an important part of the deep learning network. Nonlinear characteristics were introduced into the network to make the network learn more complex and deeper nonlinear relationships. Additionally, it was used to enhance the representational ability of the network. At present, the common activation functions are sigmoid, tanh, and ReLU [53], of which ReLU is the most widely used. Compared with sigmoid and tanh, ReLU has the characteristic of no saturation, which avoids the risk of gradient disappearance in backpropagation and improves the performance of the network. However, ReLU is a zero-negative segmentation function. When the input value is negative, the gradient of neurons will become zero, causing some neurons to "die" and preventing parameters from being updated correctly. The Swish [54] activation function is a self-gated activation function. Its simple structure and the similarities with ReLU make it often better than ReLU in deeper networks. The nonlinear characteristics of the Swish activation function can further amplify the nonlinear characteristics of data so that the network can continue to learn. Its unsaturated characteristics can also effectively avoid the boundlessness of gradient disappearance, effectively overcoming the shortcomings of non-negative input in ReLU. So the representation ability of the model is further improved, and the learning efficiency of the network is ensured. Based on the high computational cost of the sigmoid function, this paper introduces Swish's extended function, Hard Swish [54], to activate the function.

Self-Supervised Learning for Pre-Training
In order to enhance the generalization of the network, the improved self-distillation named Simdis2x in the regularization method was implemented in a self-supervision environment. The self-distillation method in [55] and the method of using only positive samples for self-supervision in SimSiam [47] were borrowed. The biggest difference between the literature [55] and the proposal was mainly the softened hard target, which was used in the supervised learning scenario during self-distillation. The predicted result Pre(x) and target H(x) were linearly combined as a softened hard target to supervise the next epoch. However, the proposed method was a self-supervised method, and only positive samples were used, like SimSiam [47]. In self-distillation, target H(x) was not used; instead, it distilled the next epoch by using the linear combination of the change information brought by the image augmentation [56] and the current prediction result. Like the literature [55], it used the progressive method. At the beginning of the epoch, the weight of self dis loss was small. With the increase of epoch, the proportion of its loss in the total loss increased, and finally it was combined with the cosine loss in SimSiam. The total loss function formula is as follows: But L D means the symmetry loss of self-knowledge distillation in each image, as shown in Formula (2) (α_t is a hyperparameter). L S refers to the loss function of the SimSiam Network.
Here, t − 1 refers to the last epoch, cosine refers to the minimum of the negative value of the cosine similarity of the two views, P is the current prediction result, and f (·) is the change information generated by the image enhancement of the previous epoch.
The feature information from the last epoch generated through data augmentation and the color changes were processed by a multi-layer perceptron, then linearly combined with the projector in current progress to generate target features.
As shown in Formulas (3) and (4): where, V(t − 1) refers projector result of the last epoch, M refers to the MLP process. β is the augmentation difference after cropping, and δ is the augmentation difference caused by color change.

Location Feature Extraction Structure
The network structure of the location extraction module (referred to as the LFE module) is shown in Figure 1. Based on the idea of CSPNet, this module was divided into two parts. Part I: First, input features go through three times ConvNHS module to downsample. ConvNHS is a convolution module including convolution, normalization [48], and the Hard Swish activation function [54]. Normalization accelerates convergence and prevents overfitting. The Hard Swish [54] function improved the learning ability of network nonlinear representation without greatly increasing the computational cost. The global spatial context mapping was established through three different dimensional convolution modules. After that, the feature map entered the attention module, including channel attention and spatial attention. Because the location feature extraction of the channel attention mechanism was not strong [32], the spatial attention in the attention module was used to obtain more spatial information. In addition, the cross-channel information interaction of the ConvNHS module can make up for the lack of spatial information in channel attention to a certain extent. Part II: The second branch retained the initial fine-grained feature information to avoid the loss of global spatial information. The later experiments also proved that this structure does reduce the location errors and achieve the location feature extraction process ( Figure 2). The specific model architecture is as follows: The feature map (H × W × C1. H, W: height and width of the feature map. (C1: number of feature map channels) entering the module is processed in two branches. Branch 1, the ConvNHS module, including convolution, normalization, and activation functions, was written in Hard Swish. This module used convolution kernels of different sizes to change the number of channels, achieving cross-channel information interaction and the establishment of spatial context mapping. The normalization accelerated convergence, and Hard Swish completed the nonlinear modeling of input in a deeper network. Under the condition of keeping the original size of the feature map, the number of channels of the feature map was changed to C2, so the cross-channel information interaction was realized.
Under the condition that the amount of parameters does not increase significantly, the ConvNHS module enhanced the learning ability of nonlinear features and enhanced the learning of the location relationship in the deeper network. The problem that location feature extraction was not strong in the next channel attention mechanism was resolved, and a feature map F context was obtained.
Next, it was the channel attention mechanism: two one-dimensional feature maps were obtained through global max pooling and global average pooling, and a channel attention vector was obtained through a multi-layer perceptron (MLP) and activation function processing. With the previous ConvNHS module processing, the channel attention mechanism could interact with more cross-dimensional information and enhance the attention to spatial information.
Here, F c,context is different from the general channel attention feature map, which is the feature map that obtains interactive information and spatial context mapping. ⊗ denotes element-wise multiplication.
Deploy spatial attention after channeling attention to enhance attention to spatial information. The feature map F c,context output by the channel attention module was used as its input feature map. First, based on the dimension of the channel, global max pooling and average pooling were carried out to obtain two feature maps (H × W ×1). Then, the two feature maps were concatenated based on the channel dimension, using a 7 × 7 convolution kernel to perform channel dimension reduction. The weights of the spatial dimensions F s,context were generated. At this point, the processing of branch 1 ends.
Use spatial attention for F c,context , where V s is the spatial attention vector obtained by the spatial attention mechanism: Branch 2 performed a simple convolution on the entered feature map, and adjusted the number of channels to the same number as channels C2 in branch 1. The feature maps of branch 1 and branch 2 were concatenated to obtain the feature map with H × W × 2C2. The concatenated feature map was sent into the normalization and activation functions of LeakyReLU. Finally, the ConvNHS module integrated the features, and the final feature map F was the output.
where f concate refers to concatenation, and F branch2 refers to the feature map obtained after convolution in branch 2. Take a practical example to illustrate the dimensional change of this structure. Supposing the input feature map F (8 × 8 × 64. Where 8 × 8 was the height and width of the feature map and 64 is the number of channels) entered the ConvNHS module of branch 1 to interact with cross-dimensional information. The number of channels was changed to 32, the size was left the same, and a new feature map F context (8 × 8 × 32) was obtained. Then the channel attention module obtained two feature maps (1 × 1 × 32) through GMP (global max pooling) and GAP (global average pooling). After a series of convolutions, activation functions, additions, and multiplications, the two feature maps were integrated into the channel feature map F c,context (8 × 8 × 32). In the spatial attention module, two feature maps were generated after GMP and GAP, and the spatial feature map F s,context (8 × 8 × 32) was obtained after concatenation. The original feature map entered branch 2 to obtain F branch2 after a simple convolution (8 × 8 × 32). After concatenating the feature maps of branches 1 and 2, going through again ConvNHS processing, a final output feature map F (8 × 8 × 64) was generated.

Instantiation
In the convolution block ConvNHS of the LFE block, we chose the normalization method through careful consideration and experimental verification. In the specific implementation, we chose the most classic BN as the normalization method of the CPU environment to verify the performance of global normalization. At the same time, in the GPU environment, the latest popular LCN was used to verify the performance of local normalization. In addition to the above deployment, the activation function used Hard Swish, which can have advantages in deeper networks.
The LFE block can be plug-and-play, which was deployed in the YOLOv4 [57]. Firstly, only one LFE was inserted into the network for training and testing. While inspired by Arunabha M. Roy's team [58], inserting five location feature extraction modules has been tested for training and testing. The insertion positions are shown in Figure 3. After the images enter the backbone network CSPDarknet53 [57] for feature extraction, the SPP module is deployed to increase the receptive field so that any size feature map can be converted into a fixed-size feature vector. Then the PANet [59] is used for feature fusion. Five places were selected after the SPP [60] module and concatenation separately, both of which were rich in information interaction. Finally, the three Yolo heads made predictions, respectively. The later experiments (see Section 4.4 for specific experiment results) also proved that this structure does reduce the location errors and accomplish the location feature extraction process.

Dataset Pre-Processing
A self-made dedicated dataset for small target detection took images of the solder joints of the circuit board, which included a total of 8600 pictures of solder joint defects. They were divided into training sets and test sets according to the ratio of 9:1 and marked in PASCAL VOC format using the labelImg tool. There were six types of solder joint defects, which were marked as shot out, oxidation, welding missed joint, extensive solver, solder projection, and inveracious solder. We randomly divided 7735 pictures as the training set and 860 pictures as the test set.
The PASCAL VOC dataset [61] is a common public data set used in target detection tasks, including 20 classes of targets such as train, cat, and sofa. The experiment used VOC2012 for training, including 23,080 pictures and 54,900 targets, and used VOC2007 for testing, including 9963 pictures and 24,640 targets.
Public PCB dataset: The PCB defect dataset has collected a total of 693 pictures. They were divided into training sets and test sets according to the ratio of 9:1.
In order to expand the training samples, various forms of data augmentation are used for the training set. Firstly, the traditional methods of data augmentation are adopted, for instance, rotation, noise, brightness transformation, random clipping, and amplification, to simulate the differences in shooting time, angle, and definition. So the robustness and generalization of the model improve, and the network can fully learn the detailed characteristics of solder joint defects. In addition, mosaic data enhancement was also introduced.

Experimental Environment Setting
To prove the effectiveness of our strategy in different environments, different experimental verifications were carried out on the GPU and CPU. Under the CPU, the TensorFlow framework was deployed to build a model of YOLOv4 with the backbone network of CSPDarkerNet53 [57]. At the same time, based on the GPU, the PyTorch framework was used to build a model for the same training and testing. Specific parameter settings may vary depending on the environment. For example, the normalization method was selected according to the specific task and environment configuration. Here we chose two methods for experiments: batch normalization under the TensorFlow framework and local context normalization under the PyTorch framework. See Table 1 for the specific training parameters of the network.

Evaluation Metrics
The most important detection metric for object detection is mean average precision (mAP), which is the mean value of AP for each class. The results of object detection were presented in binary classification, and there were four categories: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Based on the number of the four categories, precision and recall were defined as AP is the area under the precision-recall(PR) curve: mAP is defined as where N represents the number of classes.

Experimental Result
In order to verify the performance of the LFE module, a visual thermodynamic diagram, a distribution diagram of false positive errors, and a series of comparisons between the proposed network and the original network on mAP were conducted.

Visual Experiment on the Effect of Location Feature Extraction
Heat map visualization is a tool commonly used in image analysis. It aggregates a large amount of data and uses a progressive color band to show the saliency effect. The more the color tends toward red, the better the recognition effect. Under CPU conditions, a heat map was used to visualize the recognition effect of the improved network with five LFE blocks inserted. In addition, the improved network was compared with the original YOLOv4. The results shown in Figure 4a show the results of the original YOLOv4, and Figure 4b shows the results of the improved YOLOv4. Compared with Figure 4a, the red area on the defect in Figure 4b has a darker color, a larger area, more concentration, a clearer shape, and a lower location error rate, indicating the improved network can more accurately locate the defect target, and the contour of the identified defect is more detailed. On the whole, the error rate of the improved network was lower, which shows that the LFE structure was effective for the extraction of location features.

False Positive Error Analysis Experiment of Characteristics
In order to further verify the LFE module effect, the diagnosing methods for false positive (FP) errors in object detection proposed by Hoieml et al. [62] were used. Figure 5 was one of the methods that showed the evolution of the proportion of four false positive errors with the increase in the total number of false positives. In Figure 5, the network with five LFE blocks inserted (Figure 5b compared with the original YOLOv4 Figure 5a) in terms of the error distribution on the PASCAL VOC 07+12 datasets. The x-axis represents the increasing number of false positive labels, and the y-axis represents the proportion of each error under a certain number of false positive labels, expressed in area. Through a change in the area occupied by the four errors, it can be seen that the diminution of the area occupied by the location errors was the most obvious, thus proving the effectiveness of our method for location feature extraction. According to the change in the number of various errors, the proportions of location errors, similarity errors, background errors, and other errors (see Figure 6 for the results) were analyzed. We found that location errors accounted for the largest proportion in the original YOLOv4, they decreased most after the LFE modules insertion, and their proportion of the total errors also decreased most. It proved the effectiveness of our LFE module in reducing location errors and extracting location features.  Table 2. Compared with the original YOLOv4, the mAP of the improved YOLOv4 on all three datasets was improved. Due to the small number of samples in the PCB dataset itself, mAP was relatively small. However, results were still improved, indicating that the LFE module was also effective for other datasets and had good generalization ability. Meanwhile, in the GPU environment, one LFE block was integrated into YOLOv4 at the third location in Figure 3, and experiments were conducted on three different datasets for comparison. The experimental results are shown on the right side of Table 2. Through the verification of two different experimental environments, it can be proven that the location feature extraction structure can improve network accuracy no matter the GPU or CPU conditions. In addition, the LFE module is universal across different datasets, which can prove the existence of generality. In addition to the above independent network exploration, a joint experiment of self-knowledge distillation, pre-training, and downstream tasks was conducted. Improved self-distillation was used to do pre-training on the Simdis2x network and was then trained and tested on the Pascal VOC, PCB, and our small target dataset. At the same time, during the downstream task training, the backbone of the training network was changed to ResNet18. Table 3 shows our experimental results. The training conditions are in bold to highlight the changes based on the initial conditions, and the mAP results are in bold to highlight better training effects. Same as Table 4.
Comparing the experimental results of different groups, it can be concluded that the proposed self-distillation and LFE modules have improved the network's mAP, respectively, and that the combined use of the two modules is the best. The location feature extraction strategy is also suitable for self-supervised scenarios and has advantages for the improvement of downstream tasks.
In addition, after inserting the module into the two latest baseline networks [15,63], the detection accuracy was improved compared with the original networks in Table 4, which proves the effectiveness of the two LFE blocks.

Ablation Study
A series of ablation experiments were conducted to explore how to build LFE blocks and how to insert them into a network to achieve the best detection accuracy.

Design of the ConvNHS Module
In the first branch of the LFE block, before the feature map enters the attention mechanism, it will be preprocessed by the convolution block ConvNHS. Whether effective location information can be extracted from the later attention mechanism depends largely on this module. Therefore, the ConvNHS module played a key role. The ConvNHS module is mainly composed of convolution, normalization, and activation functions. Different normalization methods were tested in the ConvNHS module under the GPU environment, and LCN achieved the best effect, followed by BN in Figure 7a. Therefore, LCN with the best effect was used in the GPU, and BN with the best effect in global normalization was used in the CPU. Based on the above experiments, LCN was selected for the normalization method. Next, the number of ConvNHS modules was further explored. As shown in Figure 7b, when three ConvNHS modules were inserted into the network, the detection effect was the best. Therefore, three ConvNHS modules were designed for subsequent experiments.

Position and Quantity of LFE Block Inserted into YOLOv4
One LFE block was respectively inserted at the five locations in Figure 3 for the experiment under GPU conditions, and then all five positions were inserted for the experiment to explore the impact of insertion position and number on the recognition effect. As shown in Figure 7c, for single insertion, it was best to insert at the second or third position. We determined that the third position was the connection point of the whole PANet, from bottom to top or from top to bottom, which had rich interactive information. The location feature extraction at this position was the most effective and abundant. From a global perspective, the five full insertion results were the best, indicating that the effect of LFE structures inserted into the network can be accumulated.
Based on the above experiments, the best LFE deployment strategy was determined; that is, each LFE block contains three ConvNHS structures, using LCN or BN for normalization. The insertion location is the location with the most abundant interactive information.

Conclusions
A novel positioning strategy was proposed: location feature extraction, which established spatial context mapping and improved attention mechanisms with the CSPNet idea to strengthen the network's attention to location features, obtain more space information, and increase the accuracy of network detection. In addition, the proposed self-knowledge distillation was used for pre-training, which strengthened the generalization ability of the network. Self-made solder joint defects as the object of small target detection and public datasets (Pascal VOC, PCB) used for general verification. The LFE module was integrated into the YOLOv4 network, and through the visualization of a heat map, diagnosis of the false positive error of location, and comparison of mAP, the superiority of the improved network was verified. In the future, the extraction structure of location features can be more in-depth and perfected, and it can be extended from location features to other features so as to design more targeted strategies.