MSB R-CNN: A Multi-Stage Balanced Defect Detection Network

: Deep learning networks are applied for defect detection, among which Cascade R-CNN is a multi-stage object detection network and is state of the art in terms of accuracy and efﬁciency. However, it is still a challenge for Cascade R-CNN to deal with complex and diverse defects, as the widely varied shapes of defects lead to inefﬁciency for the traditional convolution ﬁlter to extract features. Additionally, the imbalance in features, losses and samples cause lower accuracy. To address the above challenges, this paper proposes a multi-stage balanced R-CNN (MSB R-CNN) for defect detection based on Cascade R-CNN. Firstly, deformable convolution is adopted in different stages of the backbone network to improve its adaptability to the varying shapes of the defect. Then, the features obtained by the backbone network are reﬁned and enhanced by the balanced feature pyramid. To overcome the imbalance of classiﬁcation and regression loss, the balanced L1 loss is applied at different stages to correct it. Finally, for the sample selection, the interaction of union (IoU) balanced sampler and the online hard example mining (OHEM) sampler are combined at different stages to make the sampling more reasonable, which can bring a better accuracy and convergence effect to the model. The results of our experiments on the DAGM2007 dataset has shown that our network (MSB R-CNN) can achieve a mean average precision (mAP) of 67.5%, an increase of 1.5% mAP, compared to Cascade R-CNN.


Introduction
Defect detection [1], an important task in computer vision, has attracted widespread attention in recent years. Defect detection can be applied in a wide range of fields, such as parts manufacturing [2], book printing [3], medical health [4], traffic safety [5], building maintenance [6], etc. In general, defect detection faces challenges of imbalances in different levels of features and multiple loss functions in an object detection network. A balanced network often encourages a better performance of computer vision tasks. For example, Cai et al. [7] proposed Cascade R-CNN to address the interaction of union (IoU) imbalance, and achieved state-of-the-art performance in object detection tasks [8] by designing a cascaded network structure and a gradually increased IoU threshold at each stage. However, the performance is very limited when Cascade R-CNN is directly applied in a defect detection task. For example, compared with Grid R-CNN [9], Cascade R-CNN achieved lower accuracy (66.0% mAP) than Grid R-CNN (66.5% mAP) on the DAGM2007 [10] dataset but with more parameters and computational costs.
The main reason is that defect detection tasks also require balanced features to achieve higher accuracy. The imbalance problem [11] and the shape features are major factors limiting the accuracy of Cascade R-CNN. Specifically, the defect detection task needs to extract the shape features of the defect and requires feature balance, object balance, and loss balance. Deformable convolution [12] calculates the offsets on the standard convolution to extract the shape features of defects, which can improve the detection accuracy. However, it also increases the complexity and computational costs of the model. Moreover, too many deformable convolutions are not conducive to network learning and will lead to a decrease in accuracy. So, the deformable convolution module needs to be carefully integrated into the backbone network. Cascade R-CNN uses feature pyramid networks (FPN) [13] to integrate the features from the backbone network. This also implies that the effects of high-level and low-level features in the first and last layers are different. To solve this problem, Libra R-CNN [14] uses the same deep integration to balance semantic features and enhance multi-level features.
In view of the above challenges, we propose a multi-stage balanced R-CNN (MSB R-CNN) to introduce a feature balanced module to Cascade R-CNN. Firstly, motivated by the fact that the gradients of the outliers still have a negative effect on learning the inliers with smaller gradients in smooth L1 loss [15], we use balanced L1 loss [14] to increase the gradient contribution of the inliers to the total loss value. By applying balanced L1 loss in each stage, the imbalance of loss can be effectively alleviated. Next, we reasonably combine the advantages of OHEM and IoU balance sampling and create a sample screening strategy to address the sample class imbalance problem. According to the statistics in Focal loss [16], simple samples containing less information are usually negative, and the ratio of hard samples that contain useful information varies dramatically. OHEM [17] can sort the loss of the sample, but it is also susceptible to outliers. The IoU balanced sampling [14] takes into account the relationship between IoU and sample difficulty, and can perform balanced sampling more efficiently. The sample difficulty represents how difficult it is for the sample to be detected.
The main contributions of this paper can be summarized as follows: 1.
We reasonably add deformable convolution modules to the backbone network to improve its ability of shape modeling. Therefore, the network can make more accurate predictions of defects that have large varying shapes.

2.
We present a balanced network learning strategy for defect detection to improve the convergence effect of the network. For the feature imbalance, we adopt a balanced feature fusion pyramid to make high-level and low-level features more balanced. For the imbalance in regression loss, we apply balanced L1 loss in appropriate stages to better balance the learning benefits between different tasks. For sample class imbalance, we set the sampling method according to the stage to be more in line with the sample distribution characteristics.

3.
Our MSB R-CNN network shows better performance on defect detection tasks compared to RetinaNet, Cascade R-CNN, and Libra R-CNN. MSB R-CNN can achieve a mean average precision (mAP) of 67.5% on the DAGM2007 dataset, an improvement of 1.5% mAP compared to Cascade R-CNN.
The remaining parts of this paper are organized as follows: related works are briefly reviewed in Section 2; the proposed method is introduced in Section 3; Section 4 provides the details of the experimental results and analysis; and finally, Section 5 concludes the paper with prospective future work.

Introductions of Related Works
In this section, we review the existing works for object detection and introduce two important methods for the accuracy improvement of defect detection: deformable convolution and feature balances.

Model Architectures for Object Detection
Recently, object detection models have become popularized by both two-stage and single-stage detectors. A two-stage detector is firstly proposed in R-CNN [18], which produced a significant performance improvement on VOC2007 [19]. SPPNet [20] introduces the spatial pyramid pool (SPP) layer, which allows CNN [21] to generate fixed-length representations. Then, Fast R-CNN [15] allows simultaneous training of the detector and the bounding box regressor under the same network configuration, which successfully integrates the advantages of R-CNN and SPPNet. Faster R-CNN [22] proposes a region proposal network to improve the efficiency of detectors and allow the detectors to be trained end-to-end. Following the Faster R-CNN, lots of methods are proposed, such as FPN [13], Cascade R-CNN [7], HTC [23] and Mask R-CNN [24]. On the other hand, singlestage detectors are simpler and faster, and they are popularized by Single Shot MultiBox Detector (SSD) [25] and YOLO [26,27]. RetinaNet [28] introduces a loss function called Focal loss for the imbalance of foreground and background categories in the model, which is used to reduce the weight of a large number of easy negative samples in the standard cross-entropy, thereby making the model more focused on the hard negative samples. Other methods focus on cascade procedures [29], duplicate removal [30], multi-scales features [31], adversarial learning and more contextual information fusion [32].

Deformable Convolution
Dai et al. [12] first propose deformable convolution, in which additional offsets are learned to allow the network to obtain information further from its regular local neighborhood, to improve the capability of regular convolution. Zhu et al. [33] present an improved Deformable ConvNets, which gives the network the ability to focus on regions of interests in the image through increased modeling power and better training. Specifically, the modeling power is enhanced by integrating a modulation mechanism to expand the scope of the deformation, and a more comprehensive convolution mechanism into the network. The authors also guide the network training via a feature mimicking scheme that helps the network to learn features that reflect the object focus and classification power of R-CNN features, to effectively use the enhanced capability.

Imbalance Problems in Detection
Oksuz et al. [11] define the problem of imbalance as the occurrence of a distributional bias regarding an input property in the object detection training pipeline. They identify eight different imbalance problems, which can be grouped into four main categories: class imbalance, scale imbalance, spatial imbalance and objective imbalance. Class imbalance can occur in two different ways from the object detection perspective: foreground-background imbalance and foreground-foreground imbalance. OHEM [17] and prime sample attention (PISA) [34] are two representative methods for solving the class imbalance. OHEM considers the sample loss value to select positive samples and negative samples in a more balanced manner. PISA proposes importance-based sample reweighting, which assigns weights to positive and negative examples based on the IoU of the samples. The scale imbalance is caused by the unbalanced distribution between the object scale and the marked bounding box, and the general solution is to use a balanced feature pyramid. Feature pyramid networks [13], multi-scale contextual features (MSCF) [35], scale aware trident networks [36], and path aggregation network (PANet) [37] are all proposed for solving the scale imbalance. Spatial imbalance can be divided into three types: imbalance in regression loss, IoU distribution imbalance and object location imbalance. Smooth L1 loss, Balanced L1 loss, Kullback-Leibler loss (KL loss) [38], hierarchical shot detector (HSD) [39], and Cascade R-CNN are all proposed for tackling the spatial imbalance. Objective imbalance appears in the process of minimizing the objective loss function during training. Classification-aware regression loss (CARL) [35] and GIoU Loss [40] are proposed for solving the objective imbalance. CARL is a more prominent approach combining classification and regression tasks. GIoU Loss is in the [−1, 1] range and used together with cross-entropy loss.

The Proposed MSB R-CNN
MSB R-CNN is an object detection network designed for defect detection. It can better balance the learning of the network and effectively improve the detection accuracy. The network includes five parts, which can be seen in Figure 1: backbone network, feature transformation pyramid, multi-stage detection head, the loss functions, and sampling strategies of the training process. The following subsections will focus on the deformable convolution in the backbone network, the balanced feature pyramid, the staged balanced L1 loss, and the sample selection strategy.

Deformable Convolution for Defect Detection
Convolutional neural networks have an inherent deficiency in the modeling of large and unknown shape transformations. This deficiency comes from the geometric structure of the convolution module: the convolution unit samples the fixed position of the input feature map, and the pooling layer is performed at a fixed ratio. Even the area of interest pooling segments the area of interest into fixed areas. These characteristics are influential, as the shapes of the defect object in the defect detection may have great differences in shapes. Deformable convolution and deformable region of interest pooling can effectively improve the ability of the modeling defect deformation. Figure 2 shows that the appropriate addition of variable-shape convolution to the backbone convolutional network can improve the adaptability of the network for different shapes of defects. The integration of deformable convolution to the backbone network not only effectively improves the extraction of defect shape features, but also regulates the number of parameters.

Feature Balance Transformation
The high-level features extracted by the backbone network have more semantic meaning, while the low-level features have more descriptive content. Both level features have a huge impact on defect detection. Therefore, the method of integrating the high-level and low-level characteristics of defects in MSB R-CNN is particularly important. The feature integration through horizontal connections in FPN [13] and PANet [37] promotes the development of defect detection. However, the integrated feature maps are not balanced from each resolution. Different from using horizontal connections to integrate multi-level features, the key to feature balance is to use the same deep integration of each resolution to balance semantic features to enhance multi-level features [14]. It consists of four steps: scaling, integration, refinement and enhancement, as shown in Figure 3. The feature with a l-level resolution is denoted as C l . In Figure 3, C 2 has the largest resolution. In order to integrate multi-level features and retain their semantic hierarchical structure, we first reshape the multi-level features {C 2 , C 3 , C 4 , C 5 } to an intermediate size, i.e., the same size as C 4 , with interpolation and maximum pooling. Once the features are rescaled, the balanced semantic features are obtained by the following average formula: where L denotes the number of multi-level features, and l min and l max are denoted as the lowest and highest levels indicators involved. Then, we further refine the balanced semantic features by an embedded Gaussian non-local attention module [41], to make the features more discriminative. After refinement, the features are restored to the original feature map sizes through up-sampling or down-sampling. Then, each one passes through a 3 × 3 convolution for enhancement. Using this method, features from low-level to high-level are aggregated at the same time. The output {P 2 , P 3 , P 4 , P 5 } is used for object detection in the same pipeline as in FPN. Therefore, by feeding these balanced features to the multi-stage detector, the performance of defect detection can be improved. Regular convolution (left) and deformable convolution (right) for defect images. Unlike regular convolution, which uses a fixed-shape convolution kernel, deformable convolution calculates offsets and the orientation for sampling points, which makes the shape of the convolution kernel variable, thereby improving the ability to extract shape features. With multi-scale feature integration and refinement, we obtain the balanced feature pyramid. Finally, identity connect is performed, that is, adding the original features to the output.

Staged Balanced Loss
A defect detector usually needs to perform the classification task and localization task; hence, there is a tradeoff to balance the classification loss and location loss during training process. If the two losses are not balanced, the training effect will be affected. There are also imbalances between simple samples and hard samples. The difficulty of the sample represents the difficulty of the sample to be detected, and usually the difficulty of small targets is greater than that of large targets. If they are not properly balanced, the small gradients produced by simple samples may be submerged by the large gradients produced by hard samples, which will limit the ability for further refinement. Therefore, the losses and samples both need to be rebalanced to achieve the best convergence.
Let us first review the commonly used smooth L1 loss. Smooth L1 loss is defined as follows: wherex is the absolute difference between the predicted value and the true value of the target bounding box coordinate. However, the gradients of the outliers still have a negative effect on learning the inliers with smaller gradients in smooth L1 loss. To solve this problem, balanced L1 loss [14] considers the gradient balance across inliers and outliers, and clips the large gradients produced by outliers. After adding gradient restriction to the derivative equation of smooth L1 loss, the gradient formulation of balanced L1 loss can be defined as follows: where α represents the contribution of inliers, and γ is the upper bound of the error of outliers to balance the tasks. According to Equation (3), L1 balanced can be obtained as follows [14]: where b is used to ensure L1 balanced (x) is continuous atx = 1, C is a constant, and the condition between the parameters is the following: The effect of the loss function in different detection stages of MSB R-CNN is different. The experimental results seem to indicate that applying the balanced L1 loss to the first and second stages can achieve the best results.

Sample Screening Strategy
In the process of model training, a lot of regional suggestions are proposed, and the positive and negative samples are distinguished according to the IoU of the original marked bounding box. Assuming that the threshold is set to 0.5, the samples with the IoU in the interval of [0.5, 1] are marked as positive samples, and those with IoU in the interval of [0, 0.5) are marked as negative samples. Most of the regional suggestions are negative samples, which cause a large number of meaningless negative samples to cover a few meaningful positive samples, especially in the multi-stage process in MSB R-CNN. Therefore, the method of constructing the sampling mechanism has a great impact on the training and accuracy of the model.
If there are no objects identified in the regional proposals, all these proposals are considered as the background. Then, the classifier can easily and correctly classify them into the background. The following case is also called a simple sample, that is, the IoU of the regional proposal and the original marked box is between [0, 0.1]. In this case, the object has few features and is easy to be classified. If the IoU of the regional proposal and the original marker box is close to but less than 0.5, such as 0.4, the regional proposal is considered a negative sample. However, this sample is closer to the original marked box. In this case, this sample becomes a hard sample. Another intuitive indicator to distinguish simple samples from hard samples is the loss value of the sample. The larger the loss value is, the more difficult the sample is to be detected correctly.
In view of this, OHEM and Focal loss are the main methods to solve the sample imbalance problem. OHEM automatically selects hard samples according to their confidence. This process significantly increases the use of memory and computational costs. In addition, there are still noisy samples in OHEM and during the sampling process, so it does not work well in some cases. Focal loss uses an elegant loss function to solve the problem of the imbalance of additional foreground categories in the single-stage detector. However, this brings little improvement on multi-stage detectors, due to the differences between multiple types of imbalances.
In order to overcome the disadvantages of OHEM and Focal loss, the IoU balanced sampling [35] takes into account the relationship between IoU and sample difficulty. The public statistical data [14] show that more than 60% of the hard samples have IoU values that are greater than 0.05, compared to the original marked box. However, only 30% of the samples selected by the random sampler have IoU values greater than 0.05. This also indicates that random sampler can easily lead to unbalanced samples with many hard samples being buried in a large number of simple samples. Based on this observation, the IoU balance sampling strategy is applied for mining hard samples.

Experimental Results and Analysis
Data sets: This paper conducts training and testing on the DAGM2007 data set [10] and GC10 data set [42].

•
The DAGM2007 data set is used to detect miscellaneous defects on various background textures. It contains 10 categories of different kinds of defects. Both training set and test set consist of 1000 images with one labeled defect each on the background texture. The class distribution of samples in DAGM2007 training set is shown in Figure 4a.

•
The GC10 data set contains 10 categories of different types of steel surface defects, for steel defect detection. It consists of 2000 images in the training set and 500 images in the test set. Each image has multiple labeled defects. The class distribution of samples in the GC10 training set is shown in Figure 4b. Training settings: The optimizer used in training is Stochastic Gradient Descent (SGD); the basic learning rate is 0.02; the momentum factor is 0.9; and the weight decay factor is set to 0.0001. In the initial 500 iterations, a linear warm-up is used to increase the learning rate from 0.0001 to the basic learning rate. A total of 40 epochs are trained, and a multi-stage learning rate decay strategy is adopted, which reduces the learning rate to 10% at 16 and 38 epochs, respectively. Then, we save the model, test the results in each period, and calculate its mAP and the AP of each AR of targets. The stop condition is either the loss stops decreasing or the validation accuracy reaching the peak, whichever condition comes first. The above settings are directly taken from mmdetection [43]. All models use the same training settings in both databases. Table 1 shows the overall defect detection performances of MSB R-CNN compared with the experimental results of the previous mainstream single-stage detection algorithms SSD, RetainNet, and multi-stage detection algorithms, Faster R-CNN, Grid R-CNN, Cascade R-CNN and Libra R-CNN. Our network obtains the highest accuracy of 67.5%, which is 1.5% higher than that of Cascade R-CNN. On the AP50 value, although we do not achieve the best results, the detection accuracy reaches 98.9%, which is above the expectation of the industrial application (usually above 95% is acceptable for industrial applications). The mAP of MSB R-CNN on the AP75 project achieves the best accuracy of 79.8%. In the defect detection of medium and large defects, the mAP of our network reaches the best accuracy of 65.8% and 69%, indicating that MSB R-CNN has a better detection effect for medium and large objects but less so for detecting small objects. . AR is the average recall for objects, AR = S is AR for small objects (area < 32 2 ), AR = M is AR for medium objects (32 2 < area < 96 2 ), and AR = L is AR for large objects (area > 96 2 ). Next, we analyze the detection effect of each category. Table 2 compares the results of each class of defect detection with the state of the art one-stage detection algorithms, i.e., SSD, RetainNet and multi-stage detection algorithms, Faster R-CNN, Grid R-CNN, Cascade R-CNN and Libra R-CNN. In the accuracy of the second, fifth, seventh, and tenth classes, MSB R-CNN achieves the best result. Especially in the tenth category, the mAP of 76.7% achieved by our network significantly outperforms other algorithms. It can also be seen from Figure 5 that these are larger targets. The third type of object is relatively small, and the edges are more complex. The results achieved by our algorithm are much better than those of others.

Experiments on GC10
In order to verify the performance of the model in data sets of varying complexity, we also evaluate our model and other state-of-the-art models on the GC10 data set, which has much greater complexity than DAGM2007. Since the number of samples with a small area is few in GC10 data set, we ignore the APS. As shown in Table 3, our model MSB R-CNN obtains the highest mAP of 34.0%, compared to Faster R-CNN, Grid R-CNN, RetinaNet, Cascade R-CNN, SSD and Libra R-CNN. Compared with Cascade R-CNN, the mAP of MSB R-CNN is 0.6% higher. MSB R-CNN achieves the best accuracy both in AP50 and AP75. However, there is no advantage for MSB R-CNN on APM. Table 4 shows the comparison of accuracy in each category. MSB R-CNN achieves the best results in the second, fourth, fifth, sixth, eighth, and tenth categories. The visualization of the MSB R-CNN prediction results is shown in Figure 6.

Ablation Study
All ablation experiments are based on the DAGM2007 data set. We train the models on the training subset and test on the test subset.

Effectiveness of Our Method
We perform ablation experiments to prove the influence of each module on the accuracy of MSB R-CNN. Table 5 summarizes the experimental results of multiple sets of ablation experiments, where the baseline is Cascade R-CNN, dcn represents a deformable convolutional network, bf represents feature balance, bl represents balanced loss, and sam represents a combination of OHEM and IoU balanced sampling. The baseline's mAP is 66.0% After adding deformable convolution, the mAP of our network is increased to 66.5%, and the mAP for detection of small defects reaches the highest of 62.6%. With the deformable convolution, the feature balance is performed, and the mAP of the network reaches 66.7%. After the feature is balanced, the loss is also balanced at a specific stage, and the mAP of the network reaches 67.0%. Finally, on the basis of the previous network, the sampling is selected in stages for IoU balanced sampling and OHEM sampling, making the network's mAP reaches a maximum of 67.5%, which is 1.5% higher than the benchmark. Moreover, the ability to detect large-scale defects reaches the highest level.

Impact of Fusion Deformable Convolution Parameters
It can be seen from Table 6 that adding deformable convolution in the first and third stages can obtain the highest mAP of 66.6%; the number of parameters is also increased by 0.33 M, compared to the benchmark. The more deformable convolutions are added, the larger the number of the parameters. The feature map comparison between deformable convolution and standard convolution is given in Figure 7.

Impact of Feature Balance Transformation
As seen from Table 7, although the value of mAP does not increase much after feature balancing, the APS is increased from 60.7% to 63.0%. This shows that feature imbalance mainly occurs in small targets. At the same time, the detection accuracy of each size is improved to different degrees. Figure 8 shows that the balanced feature has a higher degree of recognition.

Impact of Staged Loss Balance Parameters
Balanced L1 loss balances the contribution of difficult and simple samples to make the network converge better. It can be seen from Table 8 that applying balanced L1 loss in all three stages does not promise better results, as mAP drops to 65.9%. When the balanced L1 loss is applied in the first and second stages, the detection accuracy is the highest where the mAP reaches 66.7%. Therefore, we apply balanced L1 loss in the first and second stages of MSB R-CNN to achieve better results.   Table 9. Applying OHEM at every stage is not the most effective. That is because OHEM is used to sort the recommended regions with larger losses, and then choose to learn the recommended regions with larger losses. The influence of noise on the recommended regions is still unavoidable. Therefore, adding OHEM in specific stages can improve detection performance, but adding it in all stages will result in a decrease in detection accuracy. Moreover, Table 9 also shows that setting OHEM in one stage is usually better than setting OHEM in multiple stages. So, we apply OHEM in the first stage of MSB R-CNN. Next, we analyze the impact of IoU balanced sampling on defect detection. Table 10 shows that IoU balanced sampling can effectively improve the accuracy of the network. The mAP of IoU balanced sampling in the first and third stages and IoU balanced sampling in the first and second stages is 66.6%. Using IoU balanced sampling for all three stages, the mAP is 66.5%. On considering the accuracy and complexity, we apply IoU balanced sampling in the first and second stages. Here, we analyze the influence of OHEM sampling and IoU balanced sampling on defect detection. From the experimental results in Table 11, it can be concluded that the usage of three different sampling methods has a greater impact on the accuracy of the network. From the results, the optimal setting is to apply the IoU balanced sampling in the first and second stages and use OHEM in the third stage. In this case, we can obtain the best results, and the mAP reaches 66.8%.

Conclusions
In the face of complex defect types, it is difficult for general object detection networks to achieve accurate detection. We optimized Cascade R-CNN for defect detection task and proposed MSB R-CNN, which can better balance the learning of the network and effectively improve the detection accuracy. MSB R-CNN adopts deformable convolution in backbone network to improve the detection accuracy of defects with different shapes and uses balanced feature pyramid to make high-level and low-level features more balanced. During training, the balanced L1 loss is applied to better balance the learning benefits between different tasks, and IoU balanced sampling is used to balance the hard samples and simple samples. Based on the network architecture design and experiment results, MSB R-CNN shows more advantages in terms of accuracy and network balance than other popular detection networks. MSB R-CNN uses a multi-stage detector, which is suitable for high-precision detection, but it is relatively time-consuming. In the future, the proposed method can be further applied to a single-stage detector to meet the needs of real-time detection.
Author Contributions: Z.X., S.L. and Z.Y. implemented the proposed method, analyzed results and drafted the paper; Z.X., S.L. and J.C. conceived and designed the experiments; J.C. analyzed results and also revised the paper with Z.W. and Y.C. All authors have read and agreed to the published version of the manuscript.