For the armatures in our paper, the defective regions are small, but the features needed are complex and diverse. Furthermore, because of the concentricity error between the armatures and loading platform, the pictures of the armatures have a certain degree of angular difference. Therefore, we need to extract stronger semantic and more abstract features to inspect the surface of the armatures to resist interference.
3.1. ResNet-101 + FPN Network
Thus, we need some way to extract stronger semantic and more abstract image features. A convolutional neural network (CNN) is a kind of backpropagation neural network with deep structure including convolution calculation. It is one of the representative algorithms of deep learning. Generally, CNN has standard structure that the output of the features extractor which consists of stacked convolutional and pooling layers, can be directly input into the classifier. And this typical structure without any additional branch can be called a plain network like Alex, VGG [
13,
14]. For the plain network, a high-level feature map of each image can be obtained through a deep-network structure. The small features become smaller or even disappear after downsampling. However, the information of high-resolution feature maps without downsampling is not abundant enough. It is not conducive to the extraction and characterization of complex small features, which is required for our dataset.
Inspired by the idea of feature pyramids in feature pyramid network (FPN) [
22], our novel network as shown in
Figure 6 has made some improvements on the basic design of the plain network, which comprises two specific branches: one for extracting feature maps, named the backbone branch, the other used as a network architecture called feature pyramid network (FPN) to fuse feature maps of multiple scales.
In the process of forward propagation calculation, the backbone branch outputs feature maps of various resolutions. The FPN branch combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway, bottom-up pathway, and lateral connections.
Backbone branch: For the classification task, to extract features better, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations [
23]. As a result, the network must have enough depth. In order to ensure that the network has enough depth to extract rich semantic information to represent complex features, we used ResNet as the backbone network, since it is connected via a shortcut to learn residual mapping, which is better at transferring gradient information to prevent gradient vanishing and gradient degradation [
24,
25]. As a result, it can reach thousands of layers compared with a common plain network. In the end, allowing for the time and limited computing resources, we chose ResNet-101, a stack of 101-layer residual units, as our backbone network.
FPN branch: Our goal is to build a feature pyramid with high-level semantics throughout. The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.
The top-down pathway provided the basic feature maps for our FPN. We collected the intermediate feature maps, which is the result of feedforward computation of the backbone network. There are often many layers producing intermediate feature maps of the same resolution, and we say these layers are in the same network stage. In the end, we chose the output of the last layer of each stage as our reference set of feature maps, which we enriched to create our pyramid. This choice is natural, since the deepest layer of each stage should have the strongest features.
For our backbone network, ResNet-101, we used the feature activation output by each stage’s last residual block. We denote the output of these last residual blocks as {C2, C3, C4, C5}, and they have strides of {4, 8, 16, 32} pixels with respect to the input image. We did not include the output of the conv1 in the pyramid due to its large memory footprint.
The bottom-up pathway outputs higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from the upper level feature maps collected by the top-down pathway. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. To start the iteration, we simply attached a 1 × 1 convolutional layer on C5 to produce the coarsest resolution map. Finally, we appended a 3 × 3 convolution to each merged map to generate the final feature map, which was to reduce the aliasing effect of upsampling. This final set of feature maps is called {P2, P3, P4, P5}, corresponding to {C2, C3, C4, C5}, which are respectively of the same spatial sizes.
The result is a feature pyramid that has rich semantics at all levels. The feature pyramid is built quickly from a single input image [
22], so P2~P5 all have rich semantics. Additionally, P2 has the highest resolution and fused more features with different scales. As a result, we only took the P2 layer feature map for prediction. We added two convolutional layers after P2 to reduce the size and dimension of the feature map for further information extraction and reduction of computation. After the two convolutional layers, we still added two fully connected layers to summarize the feature information. In the end, we connected SoftMax for binary classification.
3.2. Focal Loss
In the case of unbalanced input, the network will converge toward a large quantity of data. It will not care more about the small quantity of data with a large amount of information. For the data in this paper, negative samples have one or more of the above defects. The number of each kind of defective armature is also unbalanced. If the traditional cross-entropy loss function is used, it will lead to poor learning. To solve this problem, the focal loss was proposed by RetinaNet as follows [
26]:
αt balances the importance of positive/negative examples. The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted.
is the model’s estimated probability for the class with true label, and
pt is defined as follows:
We can see from Equation (2) that focal loss adds a modulating factor to the original cross-entropy loss function. When samples are misclassified and pt is very small, the modulating factor is close to 1, and the loss function has no influence. When pt approaches 1, the factor goes to 0, and the loss for well-classified examples is down-weighed. Therefore, focal loss can mine unbalanced data and weaken the contribution of the easy-to-classify armatures to loss. In this way, we can better mine the data distribution of negative samples and achieve a better learning effect.
3.3. Feature Library Matching
In the surface detection of the industrial product, the appearance of positive samples is relatively similar, and the appearance of the defective workpieces is relatively different. It is difficult to map all negative samples with different appearances into the same category. Therefore, we took advantage of the previously trained ResNet-101+FPN network, which has a strong representational ability for positive and negative samples, to extract the vector from its end and establish an a priori feature library according to the samples in the training set. We expected to get better results by matching the distance between the sample and a priori feature library to judge the category of the sample. The whole process is shown in
Figure 7.
The flat quadrilaterals in the picture represent the data, and the rectangles represent calculations. We used the above-mentioned trained network to infer the training set. Vectors with a size of 4096 extracted from the first fully connected layer were collected and saved. They formed a feature library of the good samples. We inferred the test samples and obtained a feature vector with a size of 4096. We could match the feature vector with the feature library to calculate the similarity between the test samples and the good samples. As a vector similarity measurement, cosine similarity is widely used in text similarity calculation [
27] and face recognition [
28]. Cosine similarity is defined as follows:
xi and yi represent the corresponding vectors. The result ranges from −1 to 1. A value of −1 means that the two vectors are pointing in opposite directions, a value of 1 means they are pointing in the same direction, and 0 usually means they are independent. A value in between them indicates similarity or dissimilarity. It can be seen that the range of the values is fixed and will not change like the Euclidean distance, which is helpful for us to find a better threshold value. Equation (3) indicates that it pays more attention to the similarity of the two vectors in the direction, rather than the absolute difference in value. For the comparison of feature vectors, it has a better effect. Therefore, we distinguished the samples by measuring the cosine similarity between feature vectors in the library and feature vectors of the samples.
It would be very time-consuming to compare the cosine similarity between the test workpiece and each vector in the feature library because of the inner product calculation. Therefore, we carried out K-means clustering to cluster the feature libraries of good parts into 100 groups by cosine similarity. The 100 cluster center vectors were selected as a new good feature library, and we only needed to compute the feature library once. We calculated the cosine similarity between the feature vectors of a test armature and each feature vector in the good feature library as g_dis, which has 100 dimensions. We set a threshold thre_dis. By comparing the size of g_dis and thre_dis, we can calculate the count, which is the number meeting the threshold condition. If the size of count is greater than 50 (the vote was more than half), it will output Y. By the voting mechanism, we could infer whether the sample was good or not.