3.1. Faster R-CNN Networks
The Faster R-CNN network can be roughly divided into four parts: the first part was used to size and preprocess the original image, input it into the backbone for feature extraction, and finally obtain the shared feature map. The second part was the candidate frame generation. The network used the RPN network to generate the candidate frames and map them onto the feature map to obtain feature matrixes of different sizes. The third part was to scale the feature matrices with different sizes, which was performed via ROI Pooling. The fourth part was spreading and processing all the processed feature maps and then predicting the target class and target bounding box with the fully connected layer.
After analysis, the original network had certain problems with the detection of tiny foreign bodies, which were as follows: the feature extraction network of the network had only 16 layers, the extracted target features were not enough, and only the feature map of the last layer was used; the loss function of the localization frame of the network was not expressed enough, which cannot show the gap between the prediction frame and the real frame well; the generation of the a priori frame of the network was set according to experience, which did not apply to the detection of tiny foreign bodies; the size processing of the network before the fully connected layer led to the loss of information due to the rounding process, which greatly affected the detection of tiny foreign bodies in this paper.
Therefore, the feature extraction, IoU loss function, anchor box parameters, and ROI Pooling layer of the original network were optimized and improved for the above problems, and the feature fusion module and attention mechanism were added.
3.3. Faster R-CNN Networks with Feature Fusion and Attention Mechanism
In Faster R-CNN, during the network training process, the image features extraction directly affected the training effect. The different levels of feature layers extracted have certain differences during feature extraction. Generally speaking, low-level feature layers have higher resolution and less semantic information, while high-level feature layers have more semantic information and lower resolution. The original Faster R-CNN network directly used the feature image of the last layer of the feature extraction network, which has low resolution and more semantic information due to the high number of feature layers, and it was less effective for detecting tiny foreign bodies.
In this paper, we added the FPN network module after the shared feature layer to change the original prediction mechanism only at the last layer for predictions at different feature layers. We let the high-level feature layer perform double upsampling and then fused the features with the low-level feature layer, which meant that the low-level feature layer also had more semantic information.
The FPN network module was added after the ResNet152 feature extraction network, and the four layers of Conv2_x, Conv3_x, Conv4_x, and Conv5_x of the ResNet152 network were fused with features. The network structure diagram after adding FPN is shown in
Figure 7.
After improving the feature extraction network and feature fusion, the feature layers involved in the subsequent prediction all possess more semantic information. However, we preferred the network to focus on the features of foreign bodies rather than containing all the feature information. As shown in
Figure 8, the acquired images, in addition to the foreign bodies, can also see the surface characters or local characters of the liquid bags (red boxes), which have a similar shape to the foreign objects. Still, the color was very different from the foreign bodies. Therefore, the network needed to focus on the channel features of the image and concentrate the network attention on the foreign body rather than the characters. Some hot joints (green boxes) will be produced during the manufacturing process of liquid bags. To avoid recognizing the hot joint as a foreign object, we let the network focus on the size characteristics of the image instead of the large hot joint, based on the property that the foreign object was smaller than the hot joint.
In this paper, the CBAM attention mechanism module was added after the feature fusion module to allow the output of the whole shared feature layer to be more focused on foreign bodies. The CBAM module was a combination of the channel attention module and the spatial attention module. The channel attention mechanism used global average pooling and global maximum pooling parallel channel dimensions for the feature layer and then summed them up after shared full-connected processing and was processed by the sigmoid function into the weights of each channel, which is multiplied back to the feature input layer to complete the attention to the feature layer channels. Spatial attention, on the other hand, compressed the channels by stacking the maximum and average values in the channel direction, respectively, to compress the number of channels, and then by convolution to call back the number of channels, and finally multiplying it by the feature layer that has just completed the channel attention mechanism. The two models are processed in tandem in the network, completing the whole CBAM. The way the two attention modules were combined in the whole network is shown in
Figure 9.
3.4. Faster R-CNN Networks with Improved IoU Loss Function
The
IoU function was often used in Faster R-CNN networks to discriminate the degree of overlap of two rectangular boxes, which in turn enabled the delineation of positive and negative boxes in the RPN module. It can also be used as a loss function for bounding box regression. The method calculating the degree of overlap of rectangular boxes A and B is shown in Equation (4).
IoU is the ratio of the intersection and the union of two rectangular boxes with scale invariance. The result is a non-negative value in the range of 0 to 1. The larger the value, the higher the degree of overlap of the two boxes and the more accurate the model prediction. However, the
IoU function cannot accurately reflect the degree of overlap of the two rectangular boxes; at the same time, it cannot correctly reflect the distance between them when the rectangular boxes
A and
B do not overlap. As shown in
Figure 10, the
IoU for all three rectangular box overlap cases are equal. Still, their overlap was different, and the regression effect decreased in order.
To address the case where
IoU cannot express the distance between two rectangular boxes, Rezatofighi [
17] et al. proposed the
GIoU approach where they used the area C of the outer rectangle of two rectangular boxes to characterize the distance of the boxes, as shown in Equation (5).
As seen from the above equation, when the rectangular boxes
A and
B do not intersect, the farther the distance between the two boxes, the greater the C, and the closer the
GIoU is to −1. When the two boxes completely overlap, the
GIoU is added to the
IoU based on the term 0. Currently, the
GIoU and
IoU are equal to 1, so the
GIoU takes the value of (−1,1). When the two boxes completely overlap or are in an inclusion relationship, the formula of
GIoU agrees with
IoU, and again the situation that the relative positions of the two boxes cannot be calculated like
IoU occurs.
GIoU cannot react to the loss function when the overlapping regions are the same, but the directions and distances are different. In this regard, Zheng et al. [
18] proposed the
DIoU method, as shown in Equation (6).
Because the
DIoU method considers the degree of overlap between the target and the prediction frame, the distance between the frames, and the scale information, it makes the target frame more stable when performing the regression operation, which is more consistent with the mechanism of target frame regression than
GIoU. However,
DIoU also has some limitations in that it does not consider the important factor of aspect ratio, so Zheng et al. [
18] also proposed the
CIoU method at the same time, as shown in Equation (7).
CIoU considers the overlap area, centroid distance, and aspect ratio and introduces focal loss to solve the sample imbalance problem during box regression. Finally, this paper uses the CIoU function instead of the IoU function in the original Faster R-CNN network to increase the convergence speed and detection accuracy of network training.
3.5. Faster R-CNN Networks with Improved Anchor Box Parameters
When the Faster R-CNN network performs the target recognition task, the network needs to produce candidate boxes on the image first. The tool for the Faster R-CNN network to generate candidate boxes is the RPN network, which generates candidate boxes of various sizes with the template of the anchor box, and all candidate boxes are obtained using RPN convolution, with the center point as the base point of the candidate box, and then calculated with the aspect ratio size and size provided by the anchor box.
Usually, the anchor box parameters of Faster R-CNN are obtained empirically and contain three sizes and ratios, which are sizes of 128 × 128, 256 × 256, and 512 × 512, and ratios of 1:1, 1:2, and 2:1, corresponding to one centroid which can generate nine different anchor boxes at the same time. The candidate frames generated by the conventional parameters are all much larger than the size of the foreign bodies studied in this paper, resulting in a large proportion of background pixels in the candidate frames, which can adversely affect the accuracy of the loss function and the network. The size of the anchor box is set to be similar to the size of the foreign body in this paper, which will increase the proportion of foreign pixels in the candidate box and make the target feature extraction more effective. At the same time, it can play the role of balancing the positive and negative samples, thus improving the training speed and recognition rate of the network.
In this paper, we use the K-means clustering algorithm to cluster the target boxes of the liquid bag foreign body dataset and reset the size of the anchor box according to the clustering results to achieve the effect of improving the detection performance of the network, accelerating the convergence speed of training and the detection speed. While the K-means clustering algorithm has the Euclidean distance calculation between the sample and the cluster center, this calculation is not able to measure the degree of overlap between the two rectangular boxes; this paper uses 1-IoU to replace the original Euclidean distance, as shown in Equation (8).
where
denotes the distance from the target box to the cluster center box, and
can be understood as the degree of overlap between the target box and the cluster center box, which is calculated as the ratio of the intersection of two comparison boxes to the concatenation, so the value is a number between 0 and 1, and the closer the two boxes are, the closer the similarity value tends to 1. When the K-means algorithm measures the object similarity, the similarity is inversely proportional to the distance between the objects, i.e., the higher the similarity, the closer the distance, so it is converted using 1-IoU. For box clustering, an evaluation criterion is also needed to determine the clustering result. The IoU used above can indicate the degree of overlap between two boxes, so the larger the IoU between the rectangular boxes divided into the same class, the better the clustering result. In this paper, we use the index of average IoU to evaluate, and the function expression is shown in Equation (9).
where the larger value of q indicates the better clustering effect and the max term indicates the maximum value of the set of IoU values of all m annotation boxes compared with the generated k anchor boxes. According to the principle of the above algorithm, the steps of this paper for anchor box clustering using K-means for the dimensions of all foreign body annotation boxes are as follows:
(1) Read all the annotation files from the dataset and obtain the width and height of the annotation frame of the foreign body.
(2) Assign k random annotation boxes from the read annotation boxes as the initial centers of the clusters.
(3) Calculate the distance from the annotation box to each cluster center box and assign it to the cluster with the minimum distance from the cluster center.
(4) Based on the newly generated clusters, calculate the new cluster centers using the median calculation method.
(5) Repeat steps 3 to step 4 until the new cluster center calculated in step 4 no longer changes, thus obtaining the final anchor box.
(6) Calculate the IoU values of all the boxes and all the boxes in the final anchor box, select the highest value and add them together, and, finally, divide them by the number of boxes to obtain the average IoU value when clustering into k classes.
The parameters for generating nine anchor boxes are used in the original network. In this paper, we select k in the range of 2 to 10 for clustering and obtain nine different groups of anchor boxes, and their corresponding values of the average IoU are shown in
Figure 11.
From the above figure, it can be seen that the average IoU value gradually increases with the increase in the classification number k value, and the growth of IoU changes faster when k takes the value range of 2 to 8; the growth of IoU is slower when k takes the value of 9 and 10. Considering the calculation volume of the model as well as the accuracy, the eight anchor boxes generated at k = 8 were finally selected so that k = 8 was classified by multiple k-means for the labeled boxes. The sets of anchor box sizes (width, height) with the largest average IoU were selected as (9,9), (12,10), (13,15), (18,13), (13,25), (19,19), (25,20), and (58,47).
3.6. Improving Faster R-CNN Networks with ROI Pooling Module
After adding the improved shared feature layer, the network is passed into the RPN network to generate candidate frames, combined with the shared feature layer to obtain the feature images mapped to the feature layer after filtering the candidate frames. Since the candidate frames generated by the RPN network are of different sizes to accommodate different targets, the final feature images are also different, but the last fully connected layer of the network requires input feature images of the same size. The ROI Pooling layer effectively solves this problem by dividing the mapped regions of different sizes into equally sized sections, finally performing the maximum pooling operation on each section and obtaining the same-sized output from the ROI regions of different sizes. The ROI Pooling process is illustrated in
Figure 12.
During the execution of ROI Pooling, the position coordinates may appear as floating point numbers when the candidate boxes are mapped to the feature map, and the segmentation boundary positions may also appear as floating point numbers when the ROI region is divided equally, and the original network performs rounding operations during the execution of ROI Pooling for both cases.
Due to the presence of the rounding operation, the detection accuracy is greatly affected when detecting small targets, and this mechanism can make the position of the candidate frame have a large error with the real image. To solve this problem, the ROI Align module in the Mask-RCNN [
19] network is introduced to replace the ROI Pooling module in the original network, which causes a loss of accuracy due to the position shift caused by two rounding operations and thus the missing information. The ROI Align module is a bilinear interpolation method to obtain the pixel values of the floating-point coordinates, thus avoiding the quantization process. The ROI Align module is executed as follows:
(1) Map the coordinates of the anchor box in the original image coordinate system to the coordinates of the anchor box in the feature image coordinate system according to the ratio of the original image and the feature image, and then the coordinates in the feature map coordinate system are floating point coordinates, that is, the floating point coordinates of the ROI.
(2) Split all the ROIs into the same number of small cells, which still keep the floating point coordinates in the feature image coordinate system without quantization, so we receive two kinds of floating point coordinates of ROIs and split cells in the feature image coordinate system.
(3) The pixel values corresponding to these floating point coordinates are calculated via bilinear interpolation so that the pixel values of floating point coordinates and other integer coordinates in the segmentation unit are calculated. On this basis, the maximum pooling operation is performed for each segmentation unit so that a feature map of uniform size is obtained. Among them, bilinear interpolation extends the single linear interpolation in one direction to come up in both horizontal and vertical directions.
The theory of linear interpolation can be roughly described as follows: two endpoints are known; find the coordinates of the point on the line where the two endpoints are located between the two endpoints, where the horizontal coordinates are the corresponding floating point pixel coordinate values, and the vertical coordinates are the pixel values.