1. Introduction
The main function of the shaft is to support the transmission components, transmit torque and bear the load. Due to the continuous casting of the steel embryo, cutting and grinding, etc., the surface of the metal shaft is prone to cracks, crusting, roller printing, scratches and other types of defects during processing. The defects of the shaft have a great influence on the performance and life of the shaft and easily cause equipment failure. Therefore, the detection of shaft defects has always been a particular concern of shaft processing manufacturers. At the same time, in order to save costs, some defective shafts are recycled and remachined, and shafts that cannot be remachined are directly scrapped. Therefore, it is necessary to classify defective shafts according to defect types.
The traditional shaft surface defect inspection relies on the manual operation of the workers. The labor intensity of the workers is large, their testing experiences are different, and long-term detection affects their mental state, resulting in low detection efficiency, poor consistency of results, false inspections and missing inspections. The metal shafts are processed by the turning process and polished, and will produce various defects during the machining process. In this project, the length range of the tested polishing metal shafts is from 100 mm to 400 mm, and most of the defects can be classified into pits, breach, abrasion and scratches; the remaining ones are classified as unknown defects. The common shapes of the four defects are shown in
Figure 1.
Figure 1a is a pit defect characterized by a circular shape and a small diameter of about 0.3 mm.
Figure 1b is a breach defect characterized by a short length and a width of 0.5 mm, with non-circular grooves around;
Figure 1c is a abrasion defect characterized by a large defect area, a fish scale shape, and a variety of shapes; and
Figure 1d is a scratch defect characterized by a long fine length, with a very small width of about 0.3 mm.
Machine vision detection is a recognition technology based on computer vision to achieve object detection which replaces traditional manual detection technology. Since the last century, machine vision technology has been widely used in many fields [
1,
2,
3,
4,
5,
6,
7,
8] of defect detection and quality control, such as mechanics, chemistry, material science, agriculture, tanning, textile, printing, electronics, and so on. In recent years, the application of deep learning technology using neural networks in the field of machine vision has made the recognition ability of machine vision reach new heights.
The neural network [
9,
10,
11] was originally derived from the 1940s, experienced an MP model that mimics the human brain, a single-layer perceptron that adds learning functions, and an enhanced BP neural network. After the 1990s, it was a period of depression. With the development of computer technology, the research of neural networks has risen rapidly, the fields and scopes that can be applied have been greatly expanded, and even in the future there will be a broader application space [
12,
13,
14], Among them, deep learning quickly uses neural network technology [
15,
16].
In fact, the application of machine vision in surface defect detection, whether based on classical image processing technology, or computer vision technology, or even neural network-based machine learning technology, has been extensively studied.
At present, most of the detection objects are still plane-oriented. For example, Yi, et al. [
17] proposed the detection of surface defects of end-to-end steel strips based on deep convolutional neural networks; Ma, et al. [
18] proposed blister defect detection based on convolutional neural networks for polymer lithium-ion batteries; He, et al. [
19] proposed a new object detection framework classification priority network (CPN) and a new classification network multi-group convolutional neural network (MG-CNN) to detect steel surface defects, using the You Only Look Once: Unified, Real-Time Object Detection (YOLO) neural network—the accuracy rate of hot-rolled strip surface defects could reach more than 94% and the classification rate is above 96%; Liu, et al. [
20] proposed periodic surface defect detection of steel plates based on deep learning to improve the detection rate by improving the Long Short-Term Memory (LSTM) network; and Song, et al. [
21] proposed a DCNN (deep convolutional neural network) to detect the micro-defects of metal screw end faces, and the detection accuracy could reach 98%. The image detection obtained by directly taking photos of such a plane is mature and the detection accuracy is generally good.
For the image acquisition and detection of irregular surfaces, there is no unified method. Different methods are adopted according to different objects. Xu, et al. [
22] proposed using vehicle ground-penetrating radar to obtain irregular railway subgrade images. The improved Fast R-CNN was used to identify hazards and compared with traditional neural network methods; Santur, et al. [
23] used a 3D laser to acquire the defect image of arail, and then conducted deep learning to achieve the high-precision and rapid detection of lateral defects such as fracture, scour and abrasion on railway surfaces; Sun, et al. [
24] directly adopted fixed-photographing methods for surface irregularities of automobile hubs and achieved the identification of automobile hub surface defects based on the improved Faster R-CNN, comparing this with the current state-of-the-art YOLOv3.
The image acquisition of regular curved surfaces mainly refers to the image acquisition of a rotating curved surface, which is often performed by multiple cameras, and then multiple photos are synthesized. For example, Su et al. [
25] proposed taking multiple photographs of the cylindrical surface, then synthesizing these photographs, finally obtaining the complete defect picture. However, with the advent of line-scan cameras and linear light sources, line-scanning is generally used to acquire images for rotating curved surfaces. For example, Shi, et al. [
26] used line-scanning to obtain images for the circular curved surfaces of chemical fiber paper tubes. The defect image was detected using Faster R-CNN, and the accuracy rate was 98%. Xu et al. [
27] proposed using a line scan camera for image acquisition on the surface of a cylindrical work piece and using Faster R-CNN to detect defects. There is currently no mention of how to obtain high-quality surface images in the case of a highlighted surface.
There are many methods to detect defects after obtaining an image with defects. There are methods based on shallow neural networks. For example, Tao, et al. [
28] proposed the use of cascaded autoencoder structures for the segmentation and localization of defects, and by using shallow convolutional neural networks, metalsurface defects are automatically detected and identified. Furthermore, there are deep learning methods based on traditional deep neural networks used to directly identify defects. For example, Chun, et al. [
29] used traditional deep learning methods to detect defects on the surface of products; that is, the traditional learning method is directly adopted, in which the image is segmented first, then the segmented image is to added into deep learning, and then a group of deep learners are used and the three deep learning methods are compared for the detection effects. Some people think that due to the limited training samples, deep learning is adopted in practical applications, and the method is not effective, so the proposed feature extraction is based on a convolutional neural network, meaning that the similarity between images is used to classify the defects; the accuracy rate can reach 97.25%in the method proposed by Qian wen et al. [
30]. Of course, the poor recognition effect due to the problem of limited training samples does exist, but there are many solutions to this problem; for example, Haselmann, et al. [
31] proposed an artificial defect synthesis algorithm based on a multi-step stochastic process in order to increase training images and improve the detection rate for supervised machine learning, which directly creates a large number of training images, but also improves the number of training images by improving the positive samples of images. Additionally, Park, et al. [
32] proposed a surface detection system for non-patterned welding defects based on a convolutional neural network using common pictures, and then a convolutional neural network to detect defect images in stages. An increased positive sample method is also proposed. There is also a method to increase the effective training images by directly discarding the redundant images that do not contain defects; for example, Li, et al. [
33] proposed the adoption of a regional planning method to crop out defective images roughly in the preprocessing stage and to remove a large number of redundant images, finally using Faster R-CNN to train the images.
Of course, with the in-depth study of deep learning, in addition to creating more and better deep neural network structures and algorithms, certain good detection effects can be obtained by improving the existing neural network structure, algorithm and optimization parameters for specific detection objects. For example, Cheon, et al. [
34] used scanning electron microscopy images to acquire images of wafer surface defects, and improved the deep learning Automatic Defect Classification (ADC) methods to classify defects; and Li, et al. [
35] proposed a surface defect detection method based on the mobile net–Single Shot MultiBox Detector (SSD) framework. The goal is to simplify the detection model without sacrificing accuracy. This is the reason for the optimization of the model structure and parameters from a practical point of view. This research objects also have certain specificities.
Among all of the above deep learning methods, there is no detection method for large images and small objects. Currently, the network structure that can directly detect small objects using deep neural networks is YOLOv3, but it cannot detect fine and micro objects, and so people will use other auxiliary means to achieve the purpose. For example, Cha, et al. [
36] used 256 × 256 small images to participate in training after image pre-processing and then detected 5888 × 3584 large images based on convolutional neural networks. Tang, et al. [
37] proposed a multi-view object detection method based on deep learning, due to the weak ability of detecting small objects in classical object detection methods based on regression models, and experimented with multi-view YOLO, YOLO2, SSD, improving the accuracy and speed in small object detection; Tayara, et al. [
38] proposed object detection in very high-resolution aerial images using a one-stage densely connected feature pyramid network, by which high-level multi-scale semantic feature maps with high-quality information are prepared for object detection. This work has been evaluated on two publicly available datasets and outperformed the current state-of-the-art results both in terms of mean average precision (mAP) and computation time. However, these small objects do not have a clear definition. These detected images are not large enough; that is, for a high-precision large image to detect fine or micro defects, there will be two problems: one is whether large images can participate training and whether the detected speed can meet the actual needs of production, while the other is that when the proportion of micro defects in the image is large, they can be detected. This paper will give partial solutions to these.
The feature of the detection object in this paper is the micro-fine defect, highlighted on the surface of the metal shaft; when the image resolution reaches 16,384 × 4096,a small group of pixels with a minimum of 80 points or fine lines with a width of 4 pixels can be detected. The classification of this defect proportion has not been studied before. According to the previous research results, it is still challenging to use existing deep learning technology for such features to achieve good classification results. The following solutions are adopted in the research.
Firstly, the data set of surface defects of the metal shafts was collected in the project, and the image of highlighted shaft surface defects was obtained by line scanning; secondly, the proposed ResNet [
39] convolvutional neural network feature extraction is combined with the Faster-R-CNN [
40] object detection model for the detection of metal shaft surface defects, and the structure and parameter of the ResNet model is adjusted and optimized for defect detection; third, we realize the detection and classification of a 16,384 × 4096largeimageof a small object by screen capture; fourth, the ResNet convolutional neural network is embedded in the Faster-R-CNN program framework, which can be changed later by replacing the convolutional neural network, realizing the scalability application of practical object detection and classification. In addition, due to the disadvantages of positive and negative sample screening methods, a method to increase the number of positive samples is proposed base on IoU multiple values; finally, the method of improving the recall, precision and accuracy rate is analyzed based on the experimental results. A limiting condition is proposed in this paper when using an existing deep learning network system in industry.
The rest of the paper is organized as follows; The system construction of the industry application and Faster R-CNN is described in
Section 2. Data collection is introducted in
Section 3. The system framework design is shown in
Section 4. The experiment parameters setting, operation, ablation experiments, and performance evaluations are discussed in
Section 5. Finally, the conclusions are presented in
Section 6.
2. System Overview
The defect detection of this project is part of an automated assembly line. The resolution of the image obtained by line scanning is 16,384 × 4096. Due to real-time requirements, the time of image processing and detection was limited to less than 1 s. There are two ways to achieve image detection, traditional defect detection and deep learning object detection. Certainly, the throughput will be the decisive factor. The hardware configuration is Intel(R) Xeon(R) CPU E5-2620 v3@ 2.40 GHz (two cores), RAM:32 GB, single chip Graphic Processing Unit(GPU):GTX1080Ti. Comparison experiment results are shown in
Table 1.
Table 1 shows that traditional defect detection methods have to be selected, due to the operation times is far less than 1 s. It is obviously, the whole detection system and construction have to be considered as follow.
The entire defect detection and classification is split into two stages: the first stage is used to quickly screen out qualified and unqualified products, and this part of the inspection is installed in the assembly line. The image preprocessing and the identification of defects are included in the first stage. As this stage involves Computer Unified Device Architecture (CUDA)-based and other fast algorithms, it is beyond the scope of this article. The second stage is defect classification system for the unqualified products that have been screened out. The system is not installed in the assembly line, and the time of defect classification do not have real-time requirements, because there are few unqualified products every day according to the plant product data, so the deep learning based on deep convolutional neural network is studied in this paper.
This project is proposed for the detection of micro-fine defects of a minimum of 0.3 mm or so. The resolution of the image obtained by line scanning is 16,384 × 4096. In the training stage, we do not need to train this resolution image as a sample. Excessive sample data will cause the training speed to be very slow, and there is a problem of positive and negative sample imbalance. In the prediction stage, if the whole scan image is input into the object detection model, the prediction speed will be abnormally slow, or the object cannot be realized because the object is too small.
The defect image of 16,384 × 4096 generated by line scanning and image preprocessing in the screening stage of non-conforming products is shown in
Figure 2. The basic feature is that the proportion of defective images is very small. The smallest recognition image is only about 80 pixels. The use of convolutional neural networks and current popular object detection is very difficult and the detection will fail to classify defects. In addition, the defect images we need to classify are relatively simple and are basically based on image geometric images. According to these characteristics, the whole processing system is as follows: firstly, a large number of 500 × 500 images containing a single defect are manually constructed according to the actual defects of the shaft, then they are labelled and manually input into the convolutional neural network for learning, and finally the model is obtained. In this scheme, the image generated by the screening stage is used for the defect search, and after finding the 500 × 500 image (the insufficient portion is filled with 0), it is extended around the defect point, and the intercepted image is in put to model for recognization, then the marked results output; meanwhile, we obtain the coordinate position of the detected point. The system overview is shown in
Figure 2.
The internal structure system diagram of the Faster R-CNN is shown in
Figure 3. The scheme is divided into two modules: one is based on the Region Proposal Network (RPN) are a proposal network module, which is used to generate candidate regions; the other is based on Faster R-CNN classification network module, which is used to detect and classify candidate regions generated by the RPN. Firstly, we input the pre-processed line-scan image of the metal shaft, and the image is pre-processed. The convolution feature map is extracted by the shared convolutional neural network; then, the candidate region is quickly generated through the RPN network, and the redundant candidate bounding box is initially eliminated by non-maximum suppression. Then, the candidate bounding box is extracted by the Region of Interest(ROI) pooling layer, and the Softmax multi-classification and bounding box regression are directly output through the fully connected layer in the convolutional neural network; finally, the final output is obtained by the non-maximum suppression fine screening bounding box.
5. Experiments
5.1. Parameter Setup
After the software structure design is completed, the selection and setting of system parameters is very important. The parameters were chosen and optimized in this paper as follows.
5.1.1. Loss Function Setting
In the process of Faster-R-CNN training, a loss of multitasking results; i.e., the classification information and the bounding box position information need to be corrected. The total mission loss of the Faster R-CNN network consists of the loss of the RPN, the fine-tuning loss of the classification network, and the L2 regularization loss.
The Faster R-CNN loss function formula is shown in Equation (2).
L represents the total loss value of Faster R-CNN,
Lp represents the loss value of the RPN,
LC represents the fine-tuning loss value of the classification network trimming, and
LR represents the regularization loss value of the weight value, which is L2:
(1) RPN region proposal network loss function
The final output of the RPN involves two parts: one part is the binary classification output—i.e., it is the object or it is not the object—and the other part is the bounding box regression output including the center coordinates and size of the candidate bounding box. The total loss of the defined RPN is the sum of two loss functions, as shown in Equation (3):
where
i is the
i-th anchor,
pi is the probability that the
i-th anchor predicts that it is the object,
pi* represents the
i-th anchor’s real category. If
pi* = 1 represents the detection object,
pi* = 0 means it is not a detection object.
Ti is a 4-dimensional vector representing the position and size of the
i-th anchor, and
ti* represents the real position and size of the
i-th anchor. For the classification, the loss function
Lcls uses Softmax Loss, with two categories. For the bounding box regression, the loss function is Smooth L1 Loss, and
pi*Lreg means that the loss function of the bounding box regression is calculated only when
pi* = 1; that is, when the anchor detects the object, then it calculates the loss of the bounding box regression. The loss of classification and bounding box regression needs to be normalized with
Ncls and
Nreg and parameters to speed up the convergence of iterative calculations and prevent divergence.
Ncls and
Nreg represent the sample size and the number of anchors that each sent in random small batch iteration, respectively.
represents the equilibrium parameters. The function of this is to balance the weight between the classification loss function and the bounding box regression loss function. The
values will be adjusted based on
Ncls and
Nreg. For the bounding box regression, we use the four coordinates as shown in Equation (4):
where
x,
y,
w, and
h represent the center coordinates and dimensions of the RPN output bounding box,
xa,
ya,
wa, and
ha represent the center coordinates and dimensions of the anchor, and
x*,
y*,
w*, and
h* represent the center coordinates and demensions of the real bounding box.
(2) Classification network fine-tuning loss function
The classification network fine-tunes the loss function in the same form as the RPN. The difference from the RPN loss function is the two-part content of the output of the classification network: the first part is the object category—here, according to this item, there are 5 categories—and the second part is the center coordinate and size of the bounding box regression output bounding box, but the number of candidate bounding boxes is different from the RPN, because a large number of untargeted and overlapping blocks are removed in the RPN screening anchor and non-maximum suppression.
(3) L2 regularization loss function
The main function of the regularized loss function is to add an index describing the complexity of the model to the loss function. By limiting the weight value, the model cannot arbitrarily fit the random noise in the training data, which can effectively prevent over-fitting after the model training. The commonly-used regularization comprises two kinds of function: L1 and L2. Here, we use the L2 regularization loss function, because the L1 regularization loss function will make the parameters sparse, while the L2 regularization will not be similar, and secondly, because the L2 regularization function is derivable, while the L1 regularization function is non-derivable. We use a random small batch gradient descent algorithm for training; there are a large number of derivation operations, and it is easier to calculate these using the L2 regularization loss function. The formula for the L2 regularization loss function is shown in Equation (5):
The regularization coefficient can be used to adjust the fitting strength to prevent over-fitting and under-fitting. Wi is the weight parameter value of the model.
5.1.2. Model Optimization Method Settings
The neural network model optimization algorithm is the key of model training. Through the optimization, the loss value is gradually reduced and finally reaches convergence. The selection of the optimization method directly determines the quality of the model. Here, we use the gradient descent algorithm [
43] as the model optimization algorithm, because the gradient descent algorithm can be used well for large-scale data set optimization. The iterative formula is shown in Equation (6), where the weight parameters are
, and
is the learning rate:
The commonly-used gradient descent algorithms include batch gradient descent, random gradient descent, and small batch gradient descent. Compared with the three gradient descent algorithms, the batch gradient descent algorithm is suitable for the calculation of small sample sizes; the random gradient descent algorithm is suitable for large sample size online learning; and the small batch gradient descent is suitable for the general case. The data to be trained in the metal shaft surface defect detection and recognition project is currently a relatively large amount of data, and the training samples are completed offline, without real-time requirements, and so to comprehensively consider the advantages and disadvantages of the three gradient descent algorithms, we select the small batch gradient descent algorithm to be used as the optimization algorithm.
5.1.3. Learning Rate Settings
In the previous section, we chose the small batch gradient descent algorithm as the optimization method of the model. For the gradient descent algorithm, the learning rate setting is a very important factor. An overly high learning rate may cause the iteration to oscillate near the minimum value and not converge, while too small a learning rate may cause the iteration convergence to be too slow.
In order to avoid manual adjustment of the learning rate, here we use the adaptive learning rate optimization gradient descent algorithm to automatically adjust the learning rate. Currently, the commonly-used adaptive learning rate gradient descent algorithm optimizers are Adagrad, Adadelta, RMSprop and Adam. After comparing the advantages and disadvantages of each optimizer and the pre-project conditions, the Adam optimizer is selected as the gradient descent algorithm optimizer for this project.
5.1.4. Moving Average Parameter Setting
In order to train the model generalization ability by the random gradient descent training neural network more strongly, the sliding average model is adopted in the training process of the model. The sliding average model maintains a shadow variable for each variable parameter in the neural network. The initial value of the shadow variable is the initial value of the corresponding variable. When each iteration parameter variable is updated, the value of the shadow variable is also updated at the same time. The update formula is as shown in Equation (7), where
s is the shadow variable and
is the decay rate. Generally, when the number is close to 1 (such as 0.999),
is the variable parameter to be updated:
The attenuation rate
determines the update speed of the model. The larger the attenuation rate is, the more stable the model is. In order to better control the update speed of the model and get a better model, we set the attenuation rate dynamically by the number of iterations. The updated formula of the decay rate is as shown in Equation (8), where
t represents the
t-throunds:
5.2. Experimental Parameters
According to the characteristics of the detected image and the above parameter setting analysis, the optimization of the parameter setting is divided into the following three parts:
- (1)
The first part is the setting of the model parameters. The setting of the model parameters includes the number of categories, the feature extraction network-related settings, the first candidate region generation network setting, the first prediction network hyper parameter setting, non-maximum value suppression setting, loss function parameter setting, pooling kernel parameter setting, second prediction network hyper parameter setting, second bounding box regression parameter setting, positive and negative sample screening IoU threshold, and so on. The specific parameter settings are shown in
Table 2.
- (2)
The second part is the setting of training parameters. The training parameter setting includes one training data point, the optimization method, moving average, model file saving path and so on. The specific parameter settings are shown in
Table 3.
- (3)
The third part is the setting of the evaluation parameters. The evaluation parameter settings include the number of evaluation samples, the evaluation data input path, the label path, and so on. The specific parameter settings are shown in
Table 4.
5.3. Experimental Operation
The hardware configuration of the object detection model training and prediction is shown as
Section 2.The GPU model is the GeForce GTX1080Ti-11GD5X Extreme PLUS OC, the core frequency is 1544~1657 MHz, and the stream processing unit is 3584. The operating system is Windows Server 2012 R2, and the operating environment is Anaconda3.5, Tensorflow 1.8.0 and Cudnn9.0.
The training steps of the metal shaft surface defect object detection model in this project are shown in
Figure 12. Firstly, we set the training parameters of the model; then, we import the previously prepared metal shaft surface defect data set into the model and convert the data set into the TFRecord file format—the TFRecord data file is a binary file that stores the image data and the label, which better uses the memory, and allows fast copying, moving, reading, storing, etc. in Tensorflow—due to the random small batch gradient descent algorithm iteration, the data in the TFRecord file is randomly disorganized and then batched in small batches, and the input model is calculated in the graph; then, the calculation graph is determined for forward propagation, and the result of the forward propagation is returned; the model is saved once every 100 rounds. To determine whether the number of iterations is a multiple of 100, if the number of iterations is A multiple of 100, the model file is saved by the Tensorflow model, and the model file generated during the training process can be tested and adjusted on the verification set; then, it is judged whether the number of iterations reaches 30,000 rounds, and the iteration is stopped when it arrives. If it is not reached, the weight parameter is updated by back propagation, and the next iteration is entered and TFRecord queue file is read again; if the number of iterations is not a multiple of 100, the reverse is spread directly into the updated weight parameters, into the next iteration. The final trained model is the ckpt file. When the prediction is made, the image to be detected is directly placed in the folder to be detected. The system automatically reads the file and imports the predicted model to calculate and display the result. In order to speed up training and forecasting, the CUDA parallel computing program was called throughout.
The experiment verified and predicted the pictures.
Figure 13,
Figure 14 and
Figure 15 show the detection effects of the three types of defects in single-image single-object, single-image double-object, single-image three-object.
5.4. Performance Evaluation
At present, the most commonly-used model evaluation methods in the field of deep learning are the error rate and accuracy rate evaluation, F1 evaluation, mAP, etc. Each evaluation method has its own advantages and disadvantages.
(1) Error rate and accuracy rate
Error rate and accuracy rate are the most commonly used evaluation methods in the classification field. The applicability is very strong. The error rate is the ratio of the number of samples with incorrect classification to the total number of samples. The accuracy rate is the ratio of the number of samples with the correct classification to the total number of samples.
Regarding the classification, it is assumed that for the data set
containing
n samples, there is a corresponding prediction result
, where
x is the defect picture sample,
y is the defect real label, and
Y is the prediction result by the object detection model, and the error and accuracy rate formula is as shown in Equations(9) and (10):
The accuracy rate assessment in the project can meet the project requirements.
(2) Recall rate, precision rate and F1 assessment
The recall ratio is the proportion of the number of predicted samples in all samples. The precision ratio is the ratio of the number of examples of accurate predictions to the total number of prediction samples in all prediction examples in which the prediction result is a certain category. For the classification problem, the combination of the real category and the prediction category can be divided into real cases, false positive cases, true counterexamples, and false counterexamples, and the numbers are represented by
TP,
FP,
TN, and
FN, respectively. Then, the calculation formula of the recall ratio is as shown in Equation (11):
The formula of the precision ratio is as shown in Equation (12):
The recall rate and the precision rate are a pair of contradictory measures. The higher the recall rate, the lower the precision; the lower the recall rate, the higher the precision. In the general classification problem, to balance the recall rate and the precision rate, it is necessary to find a balance point between the recall rate and the precision rate. The
F1 assessment is based on the harmonic mean of the precision and recall. By calculating
F1, the equilibrium can be established. The formula for
F1 is shown in Equation (13):
Mean average accuracy (mAP) is the most commonly used evaluation criterion in object detection. In common classification problems, recall and precision are the most commonly-used statistics. However, in object detection, we also need to confirm the position of the object in the image, and so this is different from the normal classification in calculating the recall ratio and precision ratio.
The mAP calculation is based on the IoU between the predicted bounding box and the real bounding box. We calculate the IoU of each detection bounding box and compare it with the calculated IoU value and the threshold (usually set to 0.5), then we obtain the number of correct detections in each image.
The object position in this project is not required to be accurate. Because it is a small object in a big image, the approximate position has been determined by the screenshot during the screening stage (not covered in this paper). Therefore, mAP is not used as an evaluation criterion. According to production requirements, as long as the probability of a certain defect is greater than a certain value (for example, greater than 50%), and in order to meet a certain value of accuracy rate, precision rate and recall rate, the evaluation criterion of the classification model system meets the production requirements.
5.5. Experimental Evaluation and Discussion
According to the evaluation method of the classified image, the following gives the experimental analysis and evaluation of the single object situation. On the multi-objective experimental results, as shown in
Figure 13,
Figure 14 and
Figure 15. Because there are few multi-target situations in practice, this case is for reference only and not discussed in the paper.
For verification and evaluation results of deep learning, ablation experiments are needed and as follows: First, verification and evaluation prediction results base on training and non-training images; Then, discussion and evaluation influence the factors of detecting results base on ablation experiments of deep learning system; Finally, analysis some issues for unrecognized and incorrectly identified.
5.5.1. Using Training Image Prediction
The performance evaluation using the training image prediction is shown in
Table 5.
It can be seen from
Table 5 that the recognition rate of the breach is relatively low, the detection is N/A, and the recognition rate of the other three defects is relatively high.
5.5.2. Non-Training Image Prediction
The performance evaluation using non-training image prediction is shown in
Table 6.
It can be seen from
Table 6 that the accuracy rate of the breach is much lower, which is basically consistent with the prediction using the training image. Moreover, the geometric image is simple, the pits and breach detection rates for small sizes are low, the complex images can be all detected, and all are correct.
Actually, when using a non-training image for detection, the correct ratio will be determined by the similarity between the image being trained and the image being detected; otherwise, the “N/A” condition will occur, e.g., Figure 21(a), but there were almost no undetected cases.
5.5.3. Ablation Studies
Ablation study is usually used in relatively complex neural networks for verification and research network feature by cancel parts network structure or module. But the project is a application project, and Faster R-CNN+ResNet101 is a mature and classic model, it is just used as a module in whole application system, so it’s not necessary to do ablation experiments on the network structure. But as a whole defect detection application system, it is necessary to do ablation studies, i.e., the ablation research method of original meaning is transplanted into the research for the whole deep learning defect detection system.
The detection system involves five modules, i.e., image capture, image, filtering de-noising, image segmentation, Faster R-CNN image recognition:
(1) Image capture module. It involves shooting and lighting. If the original image in the project is removed, but the image obtained under unstable shooting and lighting conditions are used, the images have different contrast, brightness, uneven illumination and the like. These images are tested by the defect of this system, and the effect is shown in
Figure 16.
As can be seen from
Figure 16, under different background conditions of shooting and illumination, as long as the defect image can be visually identified, it can be detected.
(2) Image module. If do not cut the image, but use the image directly taken, and the shape is different, the image resolution is original, the object detection effect is shown in
Figure 17.
As can be seen from
Figure 17. The resolution of original image (a), (e) is 16,384 × 4096 and 4096 × 16,384, (a), (e) can’t be detected because of the image size is too big. When the original image size is small enough, the defect can be detected. In addition, when the shape of the detected image and the shape of the training image do not match, the defect can still be detected.
(3) Filtering de-noising module. If the image is not filtered, it contains various typical noises. This kind of image directly performs the object detection effect as shown in
Figure 18.
As can be seen from
Figure 18, when the noise destroys the image, an identification error occurs, otherwise the conventional noise has no effect on image recognition.
(4) Image segmentation module. In this module, If the 500 × 500 resolution image is not adopted, the image with other resolutions is used, the detection effect is as follows.
(a) The image detection of different resolution
Because of the actual needs of production, sometimes, the time of image detection needs to be considered. The time of large image detection will be long, and the time of small image detection will be short. However, this involves the proportion of defect images in the detected image. The impact of this situation on the prediction results is shown in
Table 7 and
Figure 19.
Table 7 shows that the prediction time of 512 × 512 images is about 4 s in the whole system without CUDA participation. The speed of detecting a picture is about 2.29 s when there is CUDA parallel computing participation. The smaller the image, the faster the detection speed.
Table 7 shows the image detection speed in the case of different image resolutions and with CUDA participation. The training time of the model, training 30,000 rounds in the absence of CUDA participation, will take about 1 day; in the case of CUDA participation, the training time of about 6 h can be reached, although this training time is for reference only.
As can be seen from
Table 7, in some cases, the detection result is “indefinite”, which means that different images can be detected in some cases, some cannot be detected, and some can be detected but with display errors, or N/A, or not marked. In particular, there are significant differences between small objects such as pits and breaches, and large objects such as brasions and scratches, similar to
Figure 20 (2:1).
(b) Images detection of different scales.
In the case of an image with the same resolution (consistent with the training image), when the defect image accounts for a different proportion of the entire image, the detection effect is as shown in
Figure 20.
The subscript of each figure in the figure is the ratio of the enlargement and reduction of the defect. It can be seen from the figure that when small objects such as pits become larger, they are recognized as N/A, and when they become larger, they are recognized as brasions and scratches. If they become smaller, they will not be detected (there are no columns in the legend). Brasion defects cannot be recognized when enlarged more than 3 times, regardless of zooming in or out; as long as the overall shape of the defect does not change, it can always be correctly identified. The above phenomenon is very meaningful for the reliability of industrial applications.
(5) Faster R-CNN image recognition module. If don’t use the manuscript’s deep learning network, comparative studies using other deep learning models have been described in the manuscript
Section 5.6.
5.5.4. Unrecognized and Incorrectly Identified Analysis
Various cases of unrecognizable defects and misidentification are shown in
Figure 21. The analysis of and solution to various error phenomena are listed in
Table 8.
According to
Table 8, it can be seen that: (1) defective images participating in training should be as comprehensive as possible—no object or a wrong object should be added during the model to upgrade training; (2) the proportions of the object to be detected must be appropriate in the image. We try to match the object image of the defect object that is involved in the training; on the other hand, the defect is basically fixed. The proportions of the defect object in the image should be consistent, and the resolution of the image should not be changed to detect it. This is an essential difference from object recognition immutability. In this project, the image obtained according to the principle of the interception of the detected image is shown in
Figure 21(h). This is a typical normal case, and other images are only for reference.
According to the above analysis, in order to obtain a relatively high and reliable accuracy rate in practice, the following conditions must be met: (a) the proportion of the object to be inspected in the figure should be as close as possible to the training image; (b) when the size of the defect being detected is in the image is very different, the proportion of small objects and large objects in the image should be considered comprehensively; (c) if the trained defects are not much different in geometry and size, they should be considered unclassified.
5.6. Comparative Study
For the evaluation of deep learning classification performance, the comparison between Faster R-CNN+ResNet101 and some classical neural networks has been shown in reference [
4],
Table 3. The performance comparison of the network system between the Optimized Faster R-CNN and other more advanced deep learning methods such as R-FCN [
45], YOLOv3 [
46] is shown in
Table 9. In the experiment, the same datasets were used for training, and then the same 120 mixed defect datasets were still used for prediction. Experiments show that the Faster R-CNN can have better performance than the more advanced neural network structure if the design of the framework structure is reasonable and the parameter adjustment is appropriate. Practice has shown that the advanced deep learning system is not the best object detection system for some specific detection objects in the field of industry. Of course, it should not be ruled out that advanced neural networks can be optimized and modified to achieve better performance.
YOLOv3 is a more advanced deep learning system than Faster R-CNN. When using standard datasets, YOLOv3 works better, but it is not very good in the defect detection of this paper. The experimental data in some reference [
19] also proves at this point,
Table 10 shows that there are more defects detected on pit, and the breach, abrasion and scratch images are better, but the accuracy of abrasion detection involving a large number of small points is not high. The R-FCN test results are exactly the opposite. We just analysis some reason between Faster R-CNN and YOLOv3. Reason analysis:(a) YOLOv3 input image is a fixed image of 416 × 416 resolution, Faster R-CNN input image size is variable and much smaller than 416 × 416, so when extracting features, small targets like pits and small dots can easily lose useful information for inputting large images. Dimmed small images are not easy to lose useful information; (b) YOLOv3 requires a large number of training samples, and Faster R-CNN does not require a large number of training samples, so under the same sample quantity conditions, when Faster R-CNN achieves good results, YOLOv3 is not as good as Faster R-CNN.
Certainly, over-fitting must be considered, to prove the model for comparison in the project is the comparable one. The reasons of over-fitting of deep learning include training sample quantity, model and match between the two, etc. Because the training sample quantity is fixed for comparability, so a suitable model has to be selected. To prevent over-fitting, a suitable model is generated by adjusting net architecture, early stopping, regularization and adding noise, etc. But the image and YOLOv3 (YOLOv3 is from another mature trademark recognition project) are fixed, that means, early stopping is a good choice.
In order to select a suitable model, the early stop strategy is applied in YOLOv3 training.
Figure 22 shows accuracy, precision and recall of YOLOv3 base on different training rounds. It shows, when the training rounds is increase, the accuracy, precision and recall will gradually stabilize. In fact, on the project, involving simple geometric images (e.g., pits) and complex geometric images (e.g., abrasion), with the increase of training rounds, a fewer simple geometric images that can’t be detected, complex geometric images can be detected, but the repetition rate and error rate of detection is higher. These problems are out of the scope of this paper.
In addition, in order to determine whether the models used for comparison are significantly over-fitted, the cross-validation is usually adopted, i.e., the training data set is used for testing. The result shows that the accuracy, precision and recall tested using the training image is better than the test results with non-training data. Because of the line scan defect image is very different from civil image, so cross-validation can’t be used as a basis for judgment.