4.1. Edge Extraction
The quality of the input image is very important as it is the first step of the whole network detection, which directly affects the subsequent detection process. Although, strong noise immunity is one of the advantages of deep neural networks, no network would want to receive a high-quality input, so that the trained model parameters have more powerful attention to our target. Therefore, we decided to use edge detection techniques to improve the semantic information in images for the purpose of image enhancement, detailed in this subsection.
The canny algorithm is used to extract edge information from UAV aerial images. The canny algorithm is mainly divided into four parts: Gaussian smooth image, gradient magnitude and direction calculation, gradient magnitude nonmaximum suppression, double threshold algorithm detection and edge connection.
Our images are obtained by unmanned aerial photography and are highly susceptible to light reflections to generate exposure points. To reduce the influence of these bright white points, a Gaussian kernel is used to smooth the image.
Compared with the median filter [
44] and the mean filter [
45], the Gaussian filter assigns different calculation weights to different fields of the current element, which can achieve the purpose of denoising while preserving the gray distribution characteristics of the image. Gaussian filtering is usually implemented by iterative operations on the image with (2
k + 1) × (2
k + 1) convolution kernels. The kernel generation equation is shown in Equation (7).
where
k represents an integer, (2
k + 1) represents the size of the convolution kernel, and (
i,
j) represents the coordinates of one of the points.
The size of the convolution kernel is usually set to an odd number for the convenience of calculation. The larger the kernel, the stronger the processing ability for local noise. In our experiments, kernels with sizes of 3 × 3, 5 × 5, and 9 × 9 were selected for comparison. The experimental results show that the kernel of 5 × 5 has the smallest effect.
After Gaussian smoothing, the background part still contains overexposed points. There is no need to worry about the negative impact this brings to the model, as the network focuses on the ground truth part during training. What must pay attention to is if the feature of the vibration damper is improved, and edge detection is one of the important means of image enhancement. The parts of the image with high gradient variation in the canny algorithm task image represent a higher probability of edges. Therefore, our next step is to extract the gradient information of the image.
Gradients reflect the intensity of local pixel transformations. The greater the gradient change, the greater the change in the corresponding region. The gradient needs to calculate the direction and size of two parts, usually by calculating the gradient of the horizontal and vertical directions to represent a complete gradient. Its calculation formula is shown in Equations (8) and (9).
The direction a and increment b of the gradient can be obtained based on the gradients in the horizontal and vertical directions, as shown in Equations (10) and (11).
Gradient images contain all grayscale variations. Therefore, the canny algorithm uses the nonmaximum suppression method [
41] to propose the lower gradient variation in the region.
The nonmaximum suppression algorithm calculates in eight areas around the pixel, retaining the parts with the largest grayscale changes in the horizontal, vertical, and diagonal directions while eliminating other parts with smaller changes by changing the broad-side gradient map to a single pixel width of the side.
The method of the nonmaximum suppression algorithm can only enhance the edge information and cannot guarantee that the remaining part is foreground information. Therefore, the last step of the canny algorithm is to use the double threshold algorithm to separate the foreground and background based on our prior knowledge.
In the double-threshold algorithm, the pixels above the strong edge threshold represent edge information, and the pixels below the weak edge threshold represent background information. The threshold between the two is the pending element, and if there is a strong edge in the eight-neighborhood of these pixels, the pixel is also classified as an edge pixel. Through comparison experiments of 200, 300, and 400 strong edge thresholds, it was found that the threshold of strong edge is best when the threshold is 300, and the weak edge threshold is set to 0.5 times of the strong edge. The formula for classifying gradient map pixels is shown in Equation (12).
To verify the effect of edge detection, we compared the performance of several classical edge detection operators on vibration dampers. As shown in
Figure 1, the edge extracted by the Canny operator is the clearest.
4.2. Attention Mechanism
After obtaining the edge information in the image using the canny algorithm, it can be used to produce positive effects. The attention mechanism [
42] originated in the field of NLP and has been introduced into computer vision in recent years. As shown in
Figure 2, by introducing additional convolution operations, the attention mechanism can focus on the additional information being added.
The attention mechanism is based on the edge information obtained by the canny algorithm, and performs a convolution operation to obtain the attention weight matrix a. The expression of the convolution operation is shown in Equation (13).
where
represents the input image,
represents the parameter of the convolution operation, and
represents the SoftMax function used for normalization.
We multiplied the resulting attention weight matrix with the corresponding input image to obtain the final output:
where
represents the final output result of the attention mechanism,
and
represent the input images, and the symbols
and
represent the multiplication and addition elements of the matrix.
Attention mechanism is used in ResNet101 to send the edge image output by the canny algorithm to the network to enhance the network’s ability to focus on the ground truth region during feature extraction. We used an attention mechanism in layers 1, 2, and 3 of ResNet because the network focuses on the low-level features of the input image in the early stage of feature extraction. At the fourth and fifth layers, the output is a feature map with highly abstract semantics. At this time, the introduction of the attention mechanism containing the edge map interferes with the effect of the feature map. A follow-up sensitivity analysis on where the attention mechanism is introduced proves our point.
4.3. Feature Fusion Network
After introducing edge detection and attention mechanisms, our framework improved to a certain extent. However, in the inspection data of overhead transmission lines captured by UAVs, the vibration damper is a small target object. When ResNet101 performs feature extraction, the deep network responds easily to semantic features and the shallow network responds easily to image features. This feature leads to a problem: although the high-level network can respond to semantic features, due to the small size of the Feature Map it does not contain much geometric information, which is not conducive to object detection. This problem is more pronounced for small-sized object detection. The vibration damper easily disappears in the feature map output by the fifth layer of ResNet because the target is small.
The disappearance of the vibration damper feature leads to a decrease in detection accuracy.
It is natural to think that a feature map that combines deep and shallow features can be used to meet the needs of small target detection. FPN [
43] is a network structure that adopts this idea. FPN uses the idea of image pyramid to solve the problem of difficulty in detecting small-sized objects in object detection scenes. The traditional image pyramid method uses a multiscale image input to construct multiscale features. The biggest problem with this approach is that the recognition time is
k times the recognition time of a single image, where k is the number of scaled dimensions.
To improve the detection speed, methods such as Faster R-CNN [
46] use a single-scale Feature Map, but the single-scale feature map limits the detection capability of the model, especially for samples with extremely low coverage in the training set (such as larger and smaller samples). Unlike Faster R-CNN, which only uses the top-level Feature Map, SSD [
47] uses the hierarchical structure of convolutional networks, starting from conv4_3 of VGG [
48], and obtains multiscale Feature Maps through different network layers. Although this method can improve accuracy and does not increase the test time, while it does not use the low-level Feature Map, these low-level features are very helpful for detecting small objects. In response to the above problems, FPN adopts the form of a Feature Map in the pyramid of SSD.
Different from SSD, FPN not only uses deep Feature Map in VGG, but also applies shallow Feature Map. These Feature Maps are efficiently integrated through bottom-up, top-down, and lateral connections, which improve the accuracy without greatly increasing the detection time. Therefore, as shown in
Figure 3, this article refers to these practices and introduce a structure composed of FPN and bottom-up after the third, fourth, and fifth layers of ResNet101 so that the semantics and lines of the final output feature maps of the three scales’ layer features are more abundant.
DamperYOLO was trained after all framework components were introduced. The training process is as described in Algorithm 1. As shown in
Figure 4, the Edge Detection module, the ResnNet101 backbone, Attention Mechanism, the FPN and Bottom-up framework are used to construct the entire vibration damper detection process.
Algorithm 1: The Training Process of DamperYOLO. |
Input: Original damper image set that each image contains dampers. |
Output: DamperYOLO after training. |
1: Initialize DamperYOLO with random weights; |
2: repeat |
3: for i in 1~epochs do 4: for j in 1~N do 5: Image augment for ; |
6: Extract feature map using ResNet101; 7: Output detection results using YOLO; |
8: Calculate the penalty value via Formula (2), (5) and (6); |
9: Minimize Formula (1) to update the parameters of DamperYOLO; 10: end for |
11: end for |
12: until DamperYOLO completes convergence |
13: return |