An Improved Feature Pyramid Network and Metric Learning Approach for Rail Surface Defect Detection

: When deep learning methods are used to detect rail surface defects, the training accuracy declines due to small defects and an insufﬁcient number of samples. This paper investigates the problem of rail surface defect detection by using an improved feature pyramid network (FPN) and the metric learning approach. Firstly, the FPN is improved by adding deformable convolution and convolutional block attention modules to improve the accuracy of detecting defects of different scales, and it is pretrained on the MS COCO dataset. Secondly, a new model is established to extract network features based on the transfer learning model and learned network parameters. Thirdly, a multimodal network structure is constructed, and the distance between each modal representative and the embedded feature vector is calculated to classify the defects. Finally, experiments are carried out on the miniImageNet dataset and the rail surface defect dataset. The results show that the mAP (ﬁve-way ﬁve-shot) of our method is 73.42% on the miniImageNet dataset and 63.29% on the rail defect dataset. Our experiments show the effectiveness of the proposed method, and the results of the rail surface defect detection are satisfactory. As there are few sample classiﬁcation studies of rail surface defects, this work provides a different approach and lays a foundation for further research.


Introduction
The speed and frequency of train operations are greatly accelerated by the implementation of the railroad speedup strategy, in which a rail surface produces defects under different degrees of wear and tear [1,2].Defects on the rail surface can lead to coupled vibrations when the train is operating at a high speed, which aggravates the wear and tear of its components and causes accidents [3,4].Thus, rail surface defect detection becomes even more important.Thus far, there have been many methods for detecting defects on a rail surface.Usually, mechanical detection is performed by manual work and vision, but such detection is time-consuming and subjective and offers low accuracy [5].Automated detection methods are also used, such as ultrasonic detection [6,7], eddy current detection [8,9], magnetic flux leakage detection [10,11], and so on [12][13][14][15][16][17][18], but these methods are easily influenced by the hardware of the equipment.The detection methods of deep learning [19][20][21][22][23] focus on image features, and compared to the abovementioned methods, the methods of deep learning are quicker and more accurate with regard to detecting rail defects.Therefore, it is essential and significant to use deep learning to detect defects on a rail surface.
As techniques for deep learning develop rapidly, and computational power increases, deep convolutional neural networks (DCNNs) are gradually being used to extract and identify features.In order to apply DCNNs to target recognition, target detection, and other fields, researchers have proposed many approaches.Ming et al. [24] proposed a method to detect the surface defects of rails using 3D-range line-scan cameras combined with deep learning to effectively eliminate the false alarms caused by light, stains, and water stains.Elhanashi et al. [25] exploited different pretrained deep learning models, such as the residual network (ResNet)-50, ResNet-101, VGG-19, and U-Net architectures to extract features from chest X-ray images and studied the use of three architectures for classification methods.Kang et al. [26] introduced a detection system to detect surface defects with complex types and only a few samples via a DCNN, which included the use of the Faster RCNN to locate the defects and the use of a deep multitask neural network to obtain classification scores and anomaly scores.Shang et al. [27] proposed a detection model for classifying defects by traditional image processing and a DCNN, which included the traditional image processing method for extracting the track part during the first stage and the fine-tuned CNN, which was used to classify the image during the second stage.Liu et al. [28] established a DCNN-based detection model and presented a new sample expansion method to solve the problem of too few samples, which included using the sample generation method to solve the problem of sample imbalance and using a DCNN to detect defects.Gibert et al. [29] applied a DCNN to the automatic detection of fastener states, which included combining multiple detectors in a multitask learning framework to improve the detection performance, but it was complex and required many training samples.Feng et al. [30] improved the You Only Look Once (YOLO) and feature pyramid networks (FPNs) to detect rail defects using the backbone network and the detection layer of MobileNet, which satisfied the requirements of defect localization and real-time processing.Yang et al. [31] presented a method for detecting and localizing the defects of a rail surface; two traditional image processing methods were used to extract the rail images, differential box-counting and the GrabCut algorithm were combined for defect segmentation, and YOLOv2 was used to precisely locate and detect the defects.Ni et al. [32] presented an attention network to achieve the defect detection of a rail surface by crossing over the consistency of the joint-guided centroid estimates, which solved the problems of complex background interference and data imbalance.Liu et al. [33] extracted multiscale features via an FPN; then, the FPN was trained by a lightweight network to detect the defects of a rail surface, which reduced the model complexity and increased the real-time performance.The aforementioned detection methods improved the detection performance by improving the network structure or loss function, but they did not deeply explore the problems of multiscale defect detection and small defect samples in complex environments.
Although many detection methods based on deep learning have been applied to detect rail defects, there are still many difficulties.The main difficulties are as follows: (1) the defect samples are too small to meet the traditional convolutional neural network training; and (2) the defect scales are varied, and the identification rate is not high, especially for tiny defects.
For difficulty 1, researchers have conducted many studies on generative adversarial networks (GAN), transfer learning, and meta learning.Zhang et al. [34] considered how to use data expansion for few-sample learning and established a data enhancement method based on feature reconstruction and morphing information.Weiss et al. [35] summarized transfer learning and discussed how to use transfer learning in the case of few-sample learning.Snell et al. [36] introduced the prototype network of few-shot learning, and the network preserved a metric space that could be classified by computing the separations of the prototype representatives in each class.Gao et al. [37] presented a prototype network for noise low-sample relation classification with mixed attention, and the network was used to solve the problems of susceptibility to noise instances and sparse features in few-shot learning.Lv et al. [38] summarized a few-sample learning method that combined a CNN and an attention module to extract image features, calculated the similarity of images by a relational network, and predicted categories according to similarities.The GAN model was uncontrollable and difficult to train.Transfer learning does not correlate well with source tasks and target tasks; the effect of meta learning is poor when the distance between the test and the training task is far.
For difficulty 2, researchers have studied defect detection via a multiscale feature fusion.Xu et al. [39] introduced a bidirectional attention FPN to solve the issue of defect feature disappearance as the network deepened.Li et al. [40] combined the Faster RCNN and the FPN by increasing the shallow refining features to detect small targets better.Based on ResNet-101 and the FPN, Li et al. [41] presented a method to detect a printed circuit board defect using an extended FPN module.Dong et al. [42] combined a global contextual attention network and the FPN to detect complex defects on surfaces.Yang et al. [43] considered low detection precision for small-and medium-sized objects in a single-shot-detector (SSD) network and proposed a detection method of a pipeline flux leakage image.Wu et al. [44] designed an extended convolution module by multiscale convolutional kernels to accommodate defects of different sizes and to enhance the ability to extract features from the network.These methods detected small target defects via the FPN, but they did not study defects with geometric deformations.
Although many methods have been proposed to detect defects by the abovementioned references, most methods only focused on one of the difficulties; this paper presents an improved feature pyramid network and metric learning approach for rail surface defect detection.The contributions of the paper are as follows: • Considering the detection difficulties of different rail defect sizes and the few samples, an improved feature pyramid network and metric learning approach are proposed to detect rail surface defects.Compared with the existing methods, our method is more effective at classifying defects.

•
An improved FPN module is proposed to overcome the multiscale defects and enhance the defect weight of the training network.The improved FPN more accurately detects small defects.

•
A metric learning method is proposed to classify rail defects by calculating the distance between multimodal networks and feature vectors.This method solves the problem of having few samples.
The rest of this article is organized as follows.The related works are introduced in Section 2. The proposed method is introduced in Section 3. The experimental setup and results are reported in Section 4. The conclusion is drawn in Section 5.

Related Works
Representative-based metric learning (RepMet) is a new distance metric learning (DML) method, which is useful for few-shot detection; see [45], and the structure of the RepMet model is given by Figure 1.In this paper, the metric learning method of RepMet is used to identify defects.In Figure 1, R ij is the center of the j-th mode, i is the i-th class, 1 ≤ i ≤ N, N is the total number of classes, j is the j-th mode, 1 ≤ j ≤ K, K is the fixed number of modules, and E represents the embedded feature vectors.As shown in Figure 1, the pooled feature vectors are converted into E through the DML embedded module, where the DML embedded module consists of several fully connected (FC) layers with batch normalization (BN) and a rectified linear unit (ReLU).The input feature transforms the pooled feature vectors into representatives of the individual classes through an FC layer.R ij is obtained from the multimode mixed distribution to distinguish the mixture distribution learned in the embedding space.To classify and identify objects, the distance from E to R ij is calculated and converted into the probability and the background probability of the region of interest (ROI) in each class.
Considering the detection difficulties of few samples and different scale sizes, an improved FPN was established by adding deformable convolution (DC) and an attention module, and a new method was proposed by combining FPN and Faster RCNN to locate defects.Then, the MS COCO dataset was pretrained by the improved FPN, and the trained parameters and model of the FPN were transferred to the model of defect detection, and the rail defect features were extracted and localized after finetuning.Finally, the multimodal network structure and feature vectors of DML were used to calculate the probability of classification and recognition.Figure 2 describes the structure for detecting defects on the surface of rails with few samples.

Defect Feature Extraction and Location
In this subsection, we first describe the detection of the defect of the ROI and then the pooling of the ROI and the corresponding feature map of the ROI to extract defect features.
In the process of detecting defects, it is difficult to extract defects because the sizes of the defects vary.FPN is a useful tool to detect defects with different sizes, especially for small defects.Based on ResNet, the FPN extracts feature maps from different convolutional layers and then superimposes and fuses the feature maps of the previous layer and the feature maps of the current layer by twofold upsampling, which realizes the fusion of information in shallow and deep feature maps.Therefore, the FPN is improved to improve the accuracy of the model's detection of small defects.
Adding attention mechanisms to deep neural networks not only makes the network pay more attention to specific inputs but also increases the attention of important features and reduces the influence of unimportant features through weight allocation.The convolutional block attention module (CBAM) is a lightweight module that works with CNN for end-to-end training.In order to increase the weight of defective features in the network training, the features of the defect are extracted by adding a CBAM, and the CBAM structure is shown in Figure 3; see [46]. Figure 3 shows that the CBAM includes two separate submodules, i.e., the channel attention module (CAM) and the spatial attention module (SAM).The output of the CAM can be obtained using the max-pooling and avgpooling of the shared network and the output of the SAM derived from the pool, which is transmitted between convolution layers via the channel axis.The CAM and SAM focus on the important features of the images in the channel dimension and the spatial dimension, respectively, and the two modules are connected in series.In order to obtain the attention map, the channel information of the feature map is summarized twice and connected and convolved through the standard convolution layer.The CBAM can be described by where F ∈ R C×H×W is the input feature map, M c ∈ R C×1×1 is the 1D channel attention map, M s ∈ R 1×H×W is the 2D spatial attention map, F is the final refined output, ⊗ denotes the element-wise multiplication, R C×H×W is the feature map, and C, H, and W are the number of channels and the height and width of the feature map, respectively.Next, the ROI extraction in the Faster RCNN was improved to extract the defective features by adding the FPN.Meanwhile, ResNet50 is a basic network, and the traditional convolution operation used in the basic ResNet50 can be defined as where y(P 0 ) is the convolution value of point P 0 in the image, P n is all positions in R, R is the feature space, x is the input, and w is the weight of features.From Equation (2), it can be seen that the traditional convolution has the same receptive field at any position of the image and can extract an image with a fixed size.The generalization ability of traditional convolution is limited when the scales of defects change.For the varying scale of rail defects, the DC is introduced into the ResNet50, and the DC can be defined as where ∆P n is the offset variable.When the offset is added to the DC, the magnitude and location of the DC kernel are adjusted based on the current object content after learning.In order to fit the shape and size of other objects, the convolution kernel changes the sampling points in variable locations according to the image context.Next, based on ResNet50, an improved FPN was designed to obtain the feature maps of rail surfaces with different scales.The specific improvements were as follows: the DC was added to detect the rail defect features with variable dimensions; the CBAM was added to increase the weight of the defect features and reduce the amount of calculation.Region proposal networks (RPN) of Faster RCNN were used to extract the ROI of each feature map with different sizes.Usually, the sizes of the anchors in the feature map are 16 × 16, 32 × 32, 64 × 64, or 128 × 128, but the scales of the defects may not be square, and the scales of anchors were set to 1:2, 1:1 and 2:1, respectively.Figure 4 shows the structure of the improved FPN, and the structure can be described as follows: the DC operation was applied to the image, and the defect features were focused by CBAM; then, the feature maps were successively convolved 1 × 1, upsampled, and convolved 3 × 3 to obtain feature maps (denoted by P2, P3, P4, P5) with different sizes; finally, the ROI was obtained from the RPN and pooled to extract the defects.In dealing with the RPN, it is necessary to keep the prediction frame of the defect detection consistent with the true frame.Therefore, the frames need to be optimized by border regression functions.The position of the box is determined by its center coordinate, width, and height.Let (x, y, w, h) T , (x * , y * , w * , h * ) T , and (x a , y a , w a , h a ) T denote the center coordinates, width, and height of the predicted boundary, the true boundary, and anchor, respectively, where T denotes the transpose of a matrix.In order to calculate the deviations among the predicted frame, true frame, and anchors, the formula of the offset is established by where t i = {t x , t y , t w , t h } T is the offset predicted by the anchor, and t * i = {t * x , t * y , t * w , t * h } T denotes the offset between the anchor and the true border.The loss function of the regression used in RPN is given by where N anc is the number of anchors, P * i is the category of each anchor prediction, L is the smooth L1 function, and β is a parameter that is used to control the transformation of the function.During the training, we set the intersection over union (IoU) to be the overlap rate between the predicted frame and the true frame of the ROI, and if the IoU > 0.7, then, the ROI is a positive sample, and p * i = 1; if the IoU < 0.3, then the ROI is a negative sample, and p * i = 0.There are few samples for rails surface defect detection; thus, there are not enough samples for network training, which results in the phenomenon of overfitting.In order to avoid overfitting and obtain better network parameters, an improved FPN was pretrained by using the MS COCO dataset, and the learned parameters and model of the FPN were transferred to the rail surface detection model using transfer learning.In our work, the FPN was used for the optimization detection model; so, all the convolutional blocks of ResNet50 were used as the backbone for transfer learning, so that small-scale anchor boxes were generated from the feature maps extracted from the fifth convolutional block.Then, the detection effect of small objects was improved.The order of features extracted by the neural network was from a low to high level.The low level refers to the characteristics of strong universality, such as texture, edge, and other information, and the high level refers to the information features of the overall category of the target.Therefore, when transfer learning, the input convolutional layer was frozen to retain the low-level feature recognition information model, and the convolutional layer close to the output was finetuned to identify the rail defect feature information.

Defect Classification and Identification
Based on RepMet, a multimodal network was established to extract the feature information of the ROI, obtain different feature vectors, and distinguish the defects by measuring the distances between the feature vectors of the mode and the feature vectors of the DML.
The input of RepMet was a feature map with a fixed-size type of ROI, which was extracted by the improved FPN.Then, the convolution layer was used to further extract the ROI features to better distinguish different categories and ensure that the features extracted from the same types were uniform.The process of extraction can be defined as where f is the convolution operation, x i is the feature map of the extracted ROI, O is the set of the ROI, and F i is the feature map after convolution processing.Figure 2 shows the structure of the DML embedded module; the feature vector module E is composed of three FC layers, and each layer is followed by a ReLU to use nonlinear processing, where FC v1 = 512d, FC v2 = 256d, FC v3 = 128d, and FC vt , t ∈ {1, 2, 3} denotes the t-th FC layer of the feature vector module.Finally, the ROI was dealt by the feature vector module, and the feature vector E i = E(F i ) was obtained to extract the common characteristics of all feature information.A multimodal network is shown in Figure 2, which extracted the same and different information using the same and different categories, respectively.Similarly, the feature map F i was input to the high-dimensional FC layer denoted by FC m for nonlinear processing.In order to extract richer feature information, the FC layer was set to FC m = 1024d, which was convenient to use and learn the feature information of each mode.In each modal network, three FC layers with ReLU were used, where FC n1 = 512d, FC n2 = 256d, FC n3 = 128d, n ∈ {1, . . . ,N}, N denotes the number of classes, and FC ns , s ∈ {1, 2, 3} denotes the s-th FC layer of each mode network.After the nonlinear processing, the feature vector E ij of the multimodal network was obtained, where the distance between E i and E ij is defined as where K is the number of modes.All class distributions are assumed to be mixtures of isotropic multivariable Gaussian distributions; D j (E i ) is also used to calculate the extracted probabilities of j-th class in j-th mode, which can be defined as where σ 2 is the variance.By Equations ( 8) and ( 9), the posterior probability of the defective category in the ROI can be described as where C denotes the i-th class, its minimum value is the minimum distance of all modal calculations, G is the mixture coefficient, and X is the covariance of the modes.After calculating the posterior probability of the defective category in the ROI, the posterior probability of the background category needed to be further calculated.The foreground probability was used to calculate the background probability, defined as where B is the background category.
Next, the embedded loss function L em and cross-entropy loss function L CE were used in the loss function of classification and recognition, where L em ensures that the distance is small between the class and the correct mode and is large between the class and the incorrect mode.L em is defined as where i * is the label of the correct class, α is the error between the nearest distance from E i to the correct class and E i to the error class.By Equations ( 10) and ( 11), the L CE is defined as Finally, let L t = L em + L CE , which was used to reversely adjust the network parameters of the classification and recognition in the case of few samples.

Experiment
In this section, we describe how the proposed method was verified using the miniIma-geNet dataset, and the constructed defect dataset was used to detect and classify defects.

Experiment Dataset
In this subsection, the miniImageNet dataset and the rail surface defect dataset are introduced for use in the comparison and ablation experiments. (

1) MiniImageNet dataset
The miniImageNet dataset is a benchmark meta-learning and few-shot learning dataset, which contains 100 categories, and each category includes 600 samples.The miniImageNet dataset includes all samples of the ImageNet dataset.
(2) Defect dataset of rail surface A common dataset given by [47] is introduced in this subsection, and the dataset consists of images of rail surfaces with at least one defect.Moreover, in this dataset, there are two types of images: one type from fast rails and the other from normal or heavy-duty rails.Then, in order to facilitate the study of the rail surface dataset, the images were cropped, and the defect types are shown in Figure 5.The figure shows that the defect images were divided into five categories: crack, regular circle, irregular, small, and blur.The crack defects are long and narrow cracks across the rail surface; the regular circle defects refer to round defects on the rail surface; irregular defects mean that the surface defects may be caused by many fine-grained shapes; small dotted defects refer to very tiny rail surface defects, and the defect can be observed when the image is enlarged; blurred defects mean that the eye cannot clearly see the outline of the rail defects.

Evaluation Metrics
In this subsection, the evaluation metrics used in this paper are introduced.In object detection, the classification target is classified as a positive or negative sample, and the prediction result is classified as true or false.Finally, there are four types of samples: true positive (TP) , true negative (TN) , false negative (FN) , and false positive (FP) .
Precision is the ratio of the TP to the sum of samples predicted to be positive, as shown in Precision = TP/(TP + FP).
Recall is the ratio of the TP to the number of true positive samples, as shown in In addition, the evaluation metrics are given by where the average precision (AP) is the area value enclosed by the coordinate axis and the curve of precision and recall, p is the precision, r is the recall, q represents the categories, and Q is the number of categories.The mean average precision (mAP) is the ratio of the AP to the number of classes.The mAP is used as the evaluation standard in this paper.

Experimental Results of the miniImageNet Dataset
In the experiment, the 5-way 1-shot and 5-way 5-shot modes were used to train the network.That is, one sample (1-shot) and five samples (5-shot) were selected from each category of the defect dataset to train the network.The loss curves under the different training times are given in Figure 6.The curves in Figure 6 show that the training loss values of the methods in the figure decreased gradually, and the curves converged gradually during the training process, with the curves using RepMet and our method being more stable.
(1) Performance evaluation In order to compare the proposed method and traditional deep learning, the methods of the training network with the miniImageNet dataset and learning from scratch are usually used to test the performance of the network.Our experiment adopted the training method on the miniImageNet dataset, and the experimental results are listed in Table 1.Table 1 shows that the mAP of our method was better than the other methods except for the MBSS method [48] from the state of the art in the 5-way 1-shot and 5-way 5-shot modes, which implies that our method is also satisfactory.Our method was 5.7% and 8.34% lower than MBSS in the 5-way 1-shot and 5-way 1-shot tasks, respectively.The backbone network used by MBSS is ResNet12, whose network model complexity is lower than that of our method.(2) Ablation experiments In order to verify the effectiveness of the improved method in this paper, ablation experiments were used, and the experimental data and hyperparameters were the same as in the above experiments.The ablation experiment is an experiment, which only compares the improved part, and the others remain unchanged.The methods of the experiment were as follows: CBAM was not used in the 'non-CBAM' experiment; FPN was not used in the 'non-FPN' experiment, but we only used the last layer of features to extract the ROI; deformable convolution was not used in the 'non-DC' experiment; the pretrained model was used to extract the ROI, but the ROI extraction module was not finetuned in the 'non-FT' experiment; in the 'IOU-DML' experiment, the module of the ROI extraction combined the DML structure of RepMet.The results of the ablation experiment are given in Table 2.By comparing Tables 1 and 2, the results can be summarized as follows: the mAP of our method was low when the FPN and DC were not used; when CBAM was not used, the ROI module was finetuned, and the loss function was not used in the module, or if the ROI module was replaced with the DML network in RepMet, then the mAPs of these experiments, as shown in Table 2, were slightly higher than RepMet in Table 1.In addition, it can be seen that the FPN and DC directly influenced the mAP of our method, and when the ROI, loss function, and CBAM were added, the performance of the network was enhanced.

Experimental Results of the Defect Dataset
In order to verify the effectiveness of our method in the rail surface defect dataset, related detection experiments were carried out.
The CBAM was tested to observe the effect of the attention module on the demonstration.Figure 7 gives the comparisons of the class activation maps between the CBAM and non-CBAM, which showed that the method including the CBAM had a significant effect on the defect features and was more effective at detecting defects.In the experiment on the rail dataset, 5-way 1-shot and 5-way 5-shot modes were used to train and evaluate the performance of the network.Table 3 shows the evaluation results of the rail surface defect dataset mAP under different methods.In the case of the 5-way 1-shot and 5-way 5-shot, from Table 3, with the increase in the number of samples, the mAP of each method improved.Our method was improved over the RepMet method.In particular, compared with RepMet, the mAP of our method increased by 6.5% in the case of the 5-way 1-shot and 6.14% in the case of the 5-way 5-shot.Finally, the test results of the rail classification on five types of defects are shown in Figure 8.

Conclusions
Aiming at the complex rail surface, this paper proposed a rail surface defect detection method based on an improved FPN and metric learning.Based on FPN, the deformable convolution was replaced by traditional convolution to deal with the problem of the different defect sizes and easy deformation.The CBAM was added to enhance the weight of the defect features and reduce the amount of calculation.The RPN was added to extract the features and locate the bounding boxes of the small sample defects.In order to solve the problem of having only a few samples, which cannot meet the training requirements of the model, the improved FPN was pretrained on the MS COCO dataset, and the trained parameters were transferred to the rail defect detection model.Based on metric learning, a multimode network was established to classify the defects by calculating the distance between each modality and the embedded feature vector.The effectiveness of our method was verified by comparison experiments and ablation experiments.The results showed that the mAP (5-way 5-shot) of our method was 73.42% on the miniImageNet dataset and 63.29% on the rail defect dataset.Due to the complex features and high interclass similarity of the rail surface defects, the mAP values of our method on the rail defect dataset were all lower than those of the miniImageNet dataset.In terms of the limitations, our method was easily affected by the sample size and focused more on extracting the ROI, which needs to be further improved in the classifier part.In the future, we aim to develop and improve different methods to locate and classify rail surface defects.In addition to this, we will investigate data augmentation methods to enrich the current dataset and further improve the accuracy of classification.

Figure 1 .
Figure 1.The structure of the RepMet model.

Figure 2 .
Figure 2. The structure for detecting defects on the surface of rails with few samples.

Figure 4 .
Figure 4.The structure of an improved FPN.

Figure 5 .
Figure 5.The types of rail defects.

Figure 6 .
Figure 6.The training loss of the different methods.

Figure 8 .
Figure 8. Test results of the rail classification.

Table 1 .
The mAP of different methods on the miniImageNet dataset.

Table 2 .
MiniImageNet dataset for the mAP results of the ablation experiments.

Table 3 .
The mAP of the rail surface defect dataset using different methods.