Intelligent System for Railway Wheelset Press-Fit Inspection Using Deep Learning

Railway wheelsets are the key to ensuring the safe operation of trains. To achieve zero-defect production, railway equipment manufacturers must strictly control every link in the wheelset production process. The press-fit curve output by the wheelset assembly machine is an essential indicator of the wheelset’s assembly quality. The operators will still need to manually and individually recheck press-fit curves in our practical case. However, there are many uncertainties in the manual inspection. For example, subjective judgment can easily cause inconsistent judgment results between different inspectors, or the probability of human misinterpretation can increase as the working hours increase. Therefore, this study proposes an intelligent railway wheelset inspection system based on deep learning, which improves the reliability and efficiency of manual inspection of wheelset assembly quality. To solve the severe imbalance in the number of collected images, this study establishes a predicted model of press-fit quality based on a deep Siamese network. Our experimental results show that the precision measurement is outstanding for the testing dataset contained 3863 qualified images and 28 unqualified images of press-fit curves. The proposed system will serve as a successful case of a paradigm shift from traditional manufacturing to digital manufacturing.


Introduction
In recent years, many leading industrial countries have invested in national plans to support the domestic manufacturing industry's move towards smart manufacturing to achieve Industry 4.0's vision. Industry 4.0 includes many technologies and related paradigms, such as the industrial internet of things, cloud-based design, digital technologies, and innovative applications of artificial intelligence (AI) [1]. It has been recognized that AI technologies are impressive and bring many benefits to the enterprise.
The transportation sector, especially the railway sector, has adopted Industry 4.0 to a large extent to improve service quality, reduce cost, and increase resource utilization [2]. For railway transportation, the routine inspection and maintenance of train components and rails are crucial, because neglecting any link is likely to cause major harm. With the development of computer vision techniques, it has become possible for railway systems to use visual inspection technology to replace manual inspection. Liu et al. [3] provided a full review of visual inspection applications based on image processing in the railway industry for rail surface, track component wear, rail component identification, train body components, etc.
Wheelsets are key to ensuring the safe operation of trains because the failure of an axle, a wheel, or one of its bearings will inevitably lead to accidental derailment [4][5][6].
The primary source of damage to railway facilities and vehicles is wheel defects. Possible causes of damage include shaft misalignment, contaminated areas, excessive load, overheating, lack of lubrication, electrical damage, wheel damage, and manufacturing defects, etc. [6]. To develop wayside measurement systems for wheel defect recognition, Zhang et al. [7] utilized the optoelectronic measuring technique to develop a non-contact measurement system capable of measuring the geometric parameters of wheelsets online. Krummenacher et al. [8] proposed two methods to automatically detect wheel defects based on the wheel vertical force, measured by a permanently installed sensor system on the railway network. One method is based on novel wavelet features to classify time series data by support vector machine (SVM); the other method trains the convolutional neural networks (CNN) for different defect types to predict if a wheel has a defect during regular operation. Mosleh et al. [9] established a 3D numerical dynamic model of a vehicle-track coupling system and analyzed the sensitivity and reliability of different sensors and setups for wheel flat detection. In their method, the wheel flat was identified by the envelope spectrum approach with spectral kurtosis analysis by considering the evaluated shear and accelerations in 19 positions as inputs. Gao et al. [10] used a railway wheel flat measurement method based on a parallelogram mechanism to detect wheel flats dynamically and quantitatively. In addition, they established a three-dimensional simulation model based on the rigid-flexible coupled multibody dynamics theory to improve the speed threshold of the measuring mechanism vibration under wheel impact. Zhou et al. [11] proposed a new, long-term monitoring method for wheel flats based on multi-sensor arrays. In their method, the dynamic strain responses of rails are captured by sensor arrays mounted on the rail web to ensure that all the wheels are assessed during the train passage. The above research is mainly focused on detecting possible defects in train wheels due to long-term use or gradual deterioration.
The wheelset is a critical component of the traveling mechanism and must provide precise geometric and mechanical characteristics to minimize dynamic action and avoid derailment [12]. Therefore, whether the wheel and axle can be closely integrated is an essential key to safety inspection in the manufacturing procedure of wheelsets. In railway wheelset manufacturing and maintenance, numerical control (NC) wheelset assembly machines are commonly used to press-fit wheels, and continuously monitor and record pressure changes through the generated force-time or force-displacement curves [13]. The force-displacement curve, also known as the press-fit curve, is an important indicator of the quality of wheelset press-fitting. As insufficient or excessive press-fitting force will lead to safety risks, operators need to monitor and evaluate the assembly quality based on the characteristics of the press-fit curve change at any time [14]. In general, the press-fit quality can meet the standard by using NC wheelset assembly machines, but this is far from enough to achieve the goal of zero-defect production in railway equipment manufacturing. For safety reasons, operators need to recheck press-fit curves manually and individually, instead of judging by press-fit records, as in our practical case.
However, there are many uncertainties in manual inspection because the specific measurement criteria of curve judgment are not suitable for quantitative representation, such as the judgement rules of "flat part" or "evenly rise" [15]. Consequently, there are often ambiguous results in press-fit curve judgment, which lead to different interpretations of the results between inspectors. Figure 1 shows some of the press-fit curves used in this study. It can be seen that there is little difference between the qualified and unqualified images of press-fit curves. In addition, human misinterpretation errors are more likely to happen, due to the long working hours required. Therefore, an intelligent automatic system that can assist the inspector in judging the press-fit curves is urgently needed to improve the reliability and efficiency of wheelset press-fit inspection.
cessful application of the deep Siamese network in manufacturing and proposal of an intelligent railway wheelset inspection system, suitable for railway equipment manufacturers. The remainder of this paper is organized as follows. The review of deep Siamese neural network is described in Section 2. Section 3 presents the system architecture of the intelligent railway wheelset inspection system and the deep learning technique used. Section 4 describes experimental results and analysis, and the conclusions are presented in Section 5. In the figures, the y-axis is the press-mounting force, and the x-axis is its corresponding displacement.

Deep Siamese Neural Networks
Deep learning has recently become a trendy research topic in the AI field; it solves the core problems in representation learning by expressing simpler representations [19]. However, when there are almost no available data or a relatively small amount of data, these algorithms often fail to predict accurate results. Under this restriction, many researchers proposed various one-shot learning algorithms, which enable us to make the Compared with traditional machine learning technologies, deep learning has recently become a trendy research topic in AI [16][17][18]. This solves the central problems in representation learning by expressing more straightforward representations to enable computers to build complex patterns out of simpler concepts [19]. However, AI technologies still have limitations. Data are key to successful machine-learning algorithms. Insufficient data will lead to failed classification results. However, in the practical cases of the manufacturing industry, it is not easy to collect enough representative abnormal data because normally operating factories are unlikely to experience abnormal conditions frequently. For this reason, the number of abnormal samples that can be collected in this experiment is obviously much lower than for normal samples. The number of unqualified images of the press-fit curve only accounts for 2.3% of all images (109/4754). A description of the number of collected images can be found in Section 4.
At present, many scholars have successfully applied the Siamese network methods in various application fields to suppress the impact of class imbalance on classification performance [20][21][22]. Therefore, this study applies a deep Siamese network to establish the prediction model of press-fit quality, to solve the severe imbalance between positive and negative samples. The main contribution of this paper is a demonstration of the successful application of the deep Siamese network in manufacturing and proposal of an intelligent railway wheelset inspection system, suitable for railway equipment manufacturers. The remainder of this paper is organized as follows. The review of deep Siamese neural network is described in Section 2. Section 3 presents the system architecture of the intelligent railway wheelset inspection system and the deep learning technique used. Section 4 describes experimental results and analysis, and the conclusions are presented in Section 5.

Deep Siamese Neural Networks
Deep learning has recently become a trendy research topic in the AI field; it solves the core problems in representation learning by expressing simpler representations [19]. However, when there are almost no available data or a relatively small amount of data, these algorithms often fail to predict accurate results. Under this restriction, many researchers proposed various one-shot learning algorithms, which enable us to make the correct prediction using only one or a few training examples in each class [23][24][25]. At present, theoretical studies and technologies based on the Siamese neural networks (SNN) have been mature and were successfully applied in various fields [26], such as audio and speech signal processing [27,28], remote-sensing scenes [29], biology [30], medicine and health [31][32][33], robotics [34], smart surveillance [35], and text mining [36].
In the manufacturing industry fields, Jalonen et al. developed a visual product tracking system by using the Siamese network method to match the product images at both ends of the tracked process [37]. For tool wear recognition, Kurek et al. applied the Siamese network technique to classify the drill wear states based on images of drilled holes. The proposed automated solution can reduce the time required to manually evaluate the drill state [38]. It is necessary to check whether printed outputs or carved wares are missing or etched by comparing the drawings in the printing and carving industries. To reduce the workforce cost and working hours, Wang et al. presented an effective method of character verification to automatically compare the similarities between the drawing characters and scanned physical characters based on the Siamese network [39].
Koch et al. first applied deep learning based on a convolutional neural network to develop SNN for one-shot classification [24]. The general SNN architecture is shown in Figure 2. An SNN consists of twin networks that accept a pair of images as input and share the same weights. The weights guarantee that their respective networks could not map two extremely similar images to very different locations in feature space, because each network computes under the same function [24]. In this deep SNN architecture with L hidden layers and N l units, h l 1 represents the hidden vector in layer l for the first twin, and h l 2 denotes the same for the second twin. The notations x 1,i and x 2,i represent specific vector elements in two input images. In the distance layer, the difference is calculated between the twin feature vectors h L 1 and h L 2 in the last layer of hidden layers by distance formulas, such as Euclidean distance ( . After the distance layer, we adopt a fully connected method with a sigmoid activation function to predict the similarity of two input images. Rgardinge the choice of hidden layer network architecture, this study adopts deep residual networks (ResNets) because they have the advantage of efficiently and easily training substantially deeper networks [40].  To evaluate loss function, let S represent the training set size. Let y x i , x j be a length-S vector, which contains the binary labels for the training set, where y x i , x j = 1 if the samples x i and x j are from the same class, and zero otherwise. The cross-entropy loss function for binary classification is formulated as follows [24]:

System Overview and Description
where p x i , x j is a length-S vector, which contains the probabilities of predicting similarity for any pair of input samples, x i and x j in the training set.

System Overview and Description
Humans are the most valuable asset in the manufacturing industry. When developing the system, we should need to account for the human-in-the-loop in interaction [41]. To fully describe the main processes of the system, we roughly divide the system architecture into two parts: cyberspace and physical space. The proposed system architecture is presented in Figure 3.

System Overview and Description
Humans are the most valuable asset in the manufacturing industry. When developing the system, we should need to account for the human-in-the-loop in interaction [41]. To fully describe the main processes of the system, we roughly divide the system architecture into two parts: cyberspace and physical space. The proposed system architecture is presented in Figure 3. In cyberspace, the main task is to establish and evaluate the prediction model of press-fit quality. The purpose of the image-preprocessing stage is to automatically segment the press-fit curve region and remove unrelated areas from the original image of the recording press-fit information output using the NC wheelset assembly machine. Figure   Figure 3. The system architecture of the proposed intelligent railway wheelset inspection system.
In cyberspace, the main task is to establish and evaluate the prediction model of pressfit quality. The purpose of the image-preprocessing stage is to automatically segment the press-fit curve region and remove unrelated areas from the original image of the recording press-fit information output using the NC wheelset assembly machine. Figure 4 shows the image preprocessing steps. Figure 4a is an original image, containing the wheelset and press-fit information. Then, we crop the region of interest (ROI) for the press-fit curve at a fixed position in the original image to obtain Figure 4b. The image size of the ROI of the press-fit curve is 400 × 602 pixels. To convert the image of the press-fit curve into a binary image for subsequent modeling, we first convert the color press-fit image to grayscale and apply a Gaussian low-pass filter to image smoothing to suppress the high-frequency parts of the image. Then, we use grayscale 127 as the threshold value to binarize the smoothed image based on experimental experiences. Pixels with grayscale values below 127 in the image are regarded as candidate objects for the press-fit curve. We take half of the highest 255 grayscales as the binary threshold because, if the threshold value is set too high, over-segmentation will occur. This will allow more false candidate objects to be misjudged as the press-fit curve. However, if the threshold value is set too low, it will cause under-segmentation, which leads to some parts of the press-fit curve being discarded. The initial result after binarization is shown in Figure 4c. In the postprocessing stage, if the area of these objects is too small or the position of an object is obviously not the press-fit curve on image space, such as the straight line above the curve, we will treat these as noise and delete them. The final result is shown in Figure 4d. For a description of the deep Siamese modeling, please refer to the following subsection. Once evaluation and verification of the predictive model are completed, it will be deployed to the enterprise private cloud for remote access by field operators.
of the ROI of the press-fit curve is 400 × 602 pixels. To convert the image of the press curve into a binary image for subsequent modeling, we first convert the color press image to grayscale and apply a Gaussian low-pass filter to image smoothing to suppr the high-frequency parts of the image. Then, we use grayscale 127 as the threshold va to binarize the smoothed image based on experimental experiences. Pixels with graysc values below 127 in the image are regarded as candidate objects for the press-fit cur We take half of the highest 255 grayscales as the binary threshold because, if the thresho value is set too high, over-segmentation will occur. This will allow more false candid objects to be misjudged as the press-fit curve. However, if the threshold value is set t low, it will cause under-segmentation, which leads to some parts of the press-fit cur being discarded. The initial result after binarization is shown in Figure 4c. In the postp cessing stage, if the area of these objects is too small or the position of an object is obviou not the press-fit curve on image space, such as the straight line above the curve, we w treat these as noise and delete them. The final result is shown in Figure 4d. For a descr tion of the deep Siamese modeling, please refer to the following subsection. Once evalu tion and verification of the predictive model are completed, it will be deployed to t enterprise private cloud for remote access by field operators. In the physical space, the primary focus is on developing technologies and syst operational interfaces. The proposed system was implemented by the following te niques. The web front end of this system was developed by AngularJS and other standa technologies, such as HTML, CSS, and JavaScript, to provide an operational interface on-site operators. Node.js was chosen to build a web server, and RESTful APIs were c ated to connect the front-end applications with the backend services. Figure 5 shows In the physical space, the primary focus is on developing technologies and system operational interfaces. The proposed system was implemented by the following techniques. The web front end of this system was developed by AngularJS and other standard technologies, such as HTML, CSS, and JavaScript, to provide an operational interface for on-site operators. Node.js was chosen to build a web server, and RESTful APIs were created to connect the front-end applications with the backend services. Figure 5 shows the query page of the inspection results for the proposed intelligent system. Through this system, operators can upload the original image generated by the NC wheelset assembly machine. Then, the inspection result determined by the prediction model will be sent to the front-end webpage to assist on-site operators in decision-making. query page of the inspection results for the proposed intelligent system. Through this system, operators can upload the original image generated by the NC wheelset assembly machine. Then, the inspection result determined by the prediction model will be sent to the front-end webpage to assist on-site operators in decision-making.

Deep Learning Architecture Used in This Study
The layout of the used SNN architecture, which is mainly based on the well-known ResNet-50 [40], is shown in Figure 6. It receives a pair of images with an image size of 400 × 602 pixels as input in the first layer. Each image is then processed through ResNet-50 network architecture. While referring to the network architecture of ResNet-50, we will extract 2048 feature maps after applying 2048 filters at the last convolutional layer. To obtain the length-2048 difference vector, we first obtain 2048 × 1 feature vectors from each twin by applying the global-max pooling operation to the output feature maps from the previous layer, and then calculate Euclidian distance between the twin difference feature vectors. Finally, the neurons in the distance layer are fully connected with one unit and passed to the sigmoid function to measure the degree of similarity between the two input images of press-fit curves.
In general, the loss function may suffer from receiving lots of easily classified samples during the training stage. Whether positive or negative, a sample is called an easy sample when the model distinguishes it as successfully dominated by its high prior probability. This means that the model tends to be dominated by easy samples that contribute little to gradient computing, leading to the model finally being failed by the imbalance [42]. To suppress the impact of the above problem, Lin et al. [43] presented focal loss (FL). FS has proved effective in related research; therefore, this study adopted FS as a loss evaluation function.
To address the class imbalance, a modulating factor (1 − ( , )) was added to the cross-entropy loss function, as defined in Formula (1), with a tunable focusing parameter ≥ 0. Easily classified samples will be down-weighted due to the low values of the modulating factor [42]. The FL can be defined as [43]: where ∈ [0,1] is a weighting factor. To achieve the best experimental results, we set the parameters and to 0.25 and 2, respectively. In addition, we set the learning rate and batch size to 0.00006 and 8, respectively, in the hyperparameter tuning setting during the training stage.

Deep Learning Architecture Used in This Study
The layout of the used SNN architecture, which is mainly based on the well-known ResNet-50 [40], is shown in Figure 6. It receives a pair of images with an image size of 400 × 602 pixels as input in the first layer. Each image is then processed through ResNet-50 network architecture. While referring to the network architecture of ResNet-50, we will extract 2048 feature maps after applying 2048 filters at the last convolutional layer. To obtain the length-2048 difference vector, we first obtain 2048 × 1 feature vectors from each twin by applying the global-max pooling operation to the output feature maps from the previous layer, and then calculate Euclidian distance between the twin difference feature vectors. Finally, the neurons in the distance layer are fully connected with one unit and passed to the sigmoid function to measure the degree of similarity between the two input images of press-fit curves.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 11 Figure 6. The architecture of the SNN is used in this study. The two images from the same class are called a positive pair, and those from the different classes are called a negative pair. This network is mainly based on the well-known ResNet-50.

Experimental Results and Analysis
In this experiment, this study collected 4754 images of the press-fit curves; among In general, the loss function may suffer from receiving lots of easily classified samples during the training stage. Whether positive or negative, a sample is called an easy sample when the model distinguishes it as successfully dominated by its high prior probability. This means that the model tends to be dominated by easy samples that contribute little to gradient computing, leading to the model finally being failed by the imbalance [42]. To suppress the impact of the above problem, Lin et al. [43] presented focal loss (FL). FS has proved effective in related research; therefore, this study adopted FS as a loss evaluation function.
To address the class imbalance, a modulating factor 1 − p x i , x j γ was added to the cross-entropy loss function, as defined in Formula (1), with a tunable focusing parameter γ ≥ 0. Easily classified samples will be down-weighted due to the low values of the modulating factor [42]. The FL can be defined as [43]: where α ∈ [0, 1] is a weighting factor. To achieve the best experimental results, we set the parameters α and γ to 0.25 and 2, respectively. In addition, we set the learning rate and batch size to 0.00006 and 8, respectively, in the hyperparameter tuning setting during the training stage.

Experimental Results and Analysis
In this experiment, this study collected 4754 images of the press-fit curves; among them, the proportion of unqualified images accounts for 2.3% of the total. All the images in this experiment are divided into two categories, qualified or unqualified, by senior on-site inspectors based on their experience. Table 1 shows the number of qualified and unqualified images for the press-fit curve used in training, validating, and testing, respectively. To quantitatively evaluate the overall performance of the proposed intelligent system to recognize press-fit curves in railway wheelset press-fit assembly, the following three measurements are adopted: accuracy, precision, and recall. Let TQ, TU, FQ, and FU represent "true qualified", "true unqualified", "false qualified", and "false unqualified", respectively, in the confusion matrix, as shown in Table 2. The test result of a qualified condition is either qualified (TQ) or unqualified (FQ), while the test result of an unqualified condition is either qualified (FU) or unqualified (TU). The accuracy is the proportion of both true qualified and true unqualified for pressfit curves in all test results. It is the overall correct classification rate of all test results. Precision, also known as precision rate, is the proportion of all qualified test results that are truly qualified press-fit curves. The recall represents the probability of classifying the press-fit curve as a qualified condition if it is truly qualified. The definitions for the above measurements are listed below [44]: Accuracy = (TQ + TU)/(TQ + TU + FQ + FU), Recall = TQ/(TQ + FU).
For the classification results of the 3891 press-fit curves in the testing dataset, the TQ, TU, FQ, and FU are 3255, 28, 0, and 608, respectively. The accuracy is 84.37% ((3255 + 28)/3891), and the results of precision and recall are 100% (3255/(3255 + 0)) and 84.26% (3255/(3255 + 608)), respectively. Among the three efficiency evaluation indicators, we can see that the precision performance is quite outstanding. The proposed intelligent system can successfully detect all unqualified images. This is a significant indicator of railway equipment manufacturers that strictly control abnormal events during the wheelset production process. The proposed method is rigorous in identifying whether the press-fit curve is unqualified, to avoid the possibility of false qualified (FQ). However, this will increase the probability of false unqualified (FU) results, leading to decreased accuracy and recall. In addition, for safety reasons, if operators want to recheck press-fit images manually, they only need to check the images classified as unqualified by the proposed system, thereby reducing the effort of manual inspection.

Conclusions
The press-fit curve is an important indicator of the quality of the railway wheelset press-fitting. However, effectively improving human misinterpretation errors during manual inspection has always been a problem, which railway equipment manufacturers urgently need to solve. To this end, this study developed an intelligent railway wheelset inspection system to assist the operators in objectively and effectively judging the press-fit curves.
In practice, the press-fit quality of most wheelsets in the wheelset production process can meet the standard using NC wheelset assembly machines. Abnormal events account for a minimal number of the total. Although the number of unqualified abnormal events is rare, they are likely to cause major traffic accidents when they occur. Therefore, the proposed system must be robust in detecting all unqualified samples. As abnormal samples is rare, the number of unqualified images of the press-fit curve that can be collected in this experiment is much lower than the number of qualified images. In order to suppress the impact of class imbalance on classification performance, this study applied a deep Siamese network with focal loss to establish a prediction model of press-fit quality. The experimental results show that the precision measurement of the proposed system can reach 100% for the testing dataset, which contains 3863 qualified and 28 unqualified images of the press-fit curves. The proposed intelligent system can successfully detect all unqualified cases. The currently proposed system was gradually launched and tested in the manufacturing site, which is sufficient to prove that the system architecture, method, and technologies used in the implementation process proposed by this study can be provided as an essential reference for relevant researches and applications. Our results can also provide a successful case of paradigm shift from traditional manufacturing to digital manufacturing. In future work, we will continue to collect more images of press-fit curves and improve the two performance indicators of accuracy and recall through more training and testing.

Conflicts of Interest:
The authors declare no conflict of interest.