Detection and Severity Evaluation of Combined Rail Defects Using Deep Learning

: Various techniques have been developed to detect railway defects. One of the popular techniques is machine learning. This unprecedented study applies deep learning, which is a branch of machine learning techniques, to detect and evaluate the severity of rail combined defects. The combined defects in the study are settlement and dipped joint. Features used to detect and evaluate the severity of combined defects are axle box accelerations simulated using a veriﬁed rolling stock dynamic behavior simulation called D-Track. A total of 1650 simulations are run to generate numerical data. Deep learning techniques used in the study are deep neural network (DNN), convolutional neural network (CNN), and recurrent neural network (RNN). Simulated data are used in two ways: simpliﬁed data and raw data. Simpliﬁed data are used to develop the DNN model, while raw data are used to develop the CNN and RNN model. For simpliﬁed data, features are extracted from raw data, which are the weight of rolling stock, the speed of rolling stock, and three peak and bottom accelerations from two wheels of rolling stock. In total, there are 14 features used as simpliﬁed data for developing the DNN model. For raw data, time-domain accelerations are used directly to develop the CNN and RNN models without processing and data extraction. Hyperparameter tuning is performed to ensure that the performance of each model is optimized. Grid search is used for performing hyperparameter tuning. To detect the combined defects, the study proposes two approaches. The ﬁrst approach uses one model to detect settlement and dipped joint, and the second approach uses two models to detect settlement and dipped joint separately. The results show that the CNN models of both approaches provide the same accuracy of 99%, so one model is good enough to detect settlement and dipped joint. To evaluate the severity of the combined defects, the study applies classiﬁcation and regression concepts. Classiﬁcation is used to evaluate the severity by categorizing defects into light, medium, and severe classes, and regression is used to estimate the size of defects. From the study, the CNN model is suitable for evaluating dipped joint severity with an accuracy of 84% and mean absolute error (MAE) of 1.25 mm, and the RNN model is suitable for evaluating settlement severity with an accuracy of 99% and mean absolute error (MAE) of 1.58 mm.


Introduction
The railway is a transportation model that plays an important role nowadays because it is environmental-friendly, energy-saving, and safe. Therefore, the demand for the railway is increasing. However, the investment in railway projects is high, so the load and speed of rolling stocks are increased to meet the increasing demand for railway transportation. The high load and speed of rolling stocks deteriorate the railway infrastructure, and railway defects take place when the deterioration reaches a certain level. Railway defects can emerge as a single defect or combined defects. Combined defects are more complicated and more difficult to detect and evaluate than a single defect. Therefore, a tool to detect and evaluate the severity of combined defects is necessary to improve the railway maintenance capability.
Railway defects can be inspected using a traditional technique such as visual inspection [1] or more advanced techniques such as ultrasonic [2], magnetic induction [3], acoustic emission [4][5][6], and eddy current [7], which are non-destructive testing (NDT). The benefits of NDT are less waste, less downtime, accident prevention, advanced identification, comprehensive testing, and increased reliability [8]. Machine learning is an NDT technique that is popular in the present because it is fast, cost-saving, and it is proven that the performance is satisfied. Many machine learning techniques can be used to develop models to detect and evaluate defects. This study applies deep learning techniques to develop models because it is proven that deep learning techniques tend to provide the better performance if they are constructed properly [9].
This unprecedented study aims to apply deep learning techniques, namely, deep neural network (DNN), convolutional neural network (CNN), and recurrent neural network (RNN) to detect and evaluate the severity of combined defects consisting of settlement and dipped joint using axle box accelerations (ABA) as features. It is noted that the dipped joint and settlement in this study are simplified to the geometrical irregularities. In fact, they can be related to the void irregularity, which is more complicated, and further study is needed to investigate their dynamic behavior. ABA is used to detect and evaluate the severity of combined defects because it is one of the NDTs that requires a low installation cost, and it can be measured continuously when the rolling stock is operated. ABA can be measured by installing an axle box acceleration sensor to the rolling stock. The measurement can be monitored in real time or at the end of the day and fed into the machine learning models to detect and evaluate the severity defects. This process is an inverse analysis based on the fact that defects will affect the ABA differently depending on the type of defect. This approach is fast, cost-saving, and it monitors the track condition all the time. A verified simulation called D-Track is used to generate numerical data for machine learning model development.
The expected contributions of the study are that the developed models can detect and evaluate the severity of combined defects which will improve the railway maintenance capability in terms of cost, time, and reliability.

Literature Review
Machine learning is a branch of "the study and design of intelligent agents" to achieve a defined purpose [10]. Nowadays, machine learning is widely used in various areas such as computer science, psychology, medical, neuroscience, cognitive science, linguistics, engineering, etc. Machine learning can reduce human error, reduce human risk in some situations such as railway inspection, continue working for a long time especially repetitive tasks, work fast, and deal with complicated tasks [11].
Machine learning was adopted in the railway industry in different aspects. Huang et al. [12] used a random forest and support vector machine to control the speed profile and calculate the energy consumption of rolling stocks. They presented that the developed approaches had the error of energy consumption calculation of less than 0.1 kWh and could reduce the energy consumption by 2.84%. Alawad et al. [13] applied a decision tree to analyze fatal accidents. Sysyn et al. [14] applied the computer vision concept to predict contact fatigue on crossings. However, they faced a long processing time issue and claimed that deep learning could resolve this issue.
For railway defect detection, ABA was widely used Núñez et al. [15] applied ABA to detect squats and corrugations. The case study was from the Dutch Railway. They achieved an accuracy of the detection of higher than 85%. Then, Li et al. [16] applied the same concept to detect light squats. They could detect defects up to 85% using ABA, and many studies supported this finding [17,18]. Their findings demonstrated that ABA has the potential to be used to defect railway surface defects. This was also supported by many studies. Song et al. [19] found the relationship between ABA and polygonized wheels under high-speed conditions. ABA was used to predict the degradation of railway crossings [20]. Other defects can be detected using the ABA as well, such as insulated rail joint [21], bolt tightness [22], and track geometry [23]. Machine learning techniques were also applied to detect railway defects. Using ABA as the input for machine learning could provide a satisfying outcome. Table 1 summarizes machine learning techniques used to detect railway defects and demonstrates the research gap in this area.  Table 1, it can be seen that image processing is the popular technique that is used to detect defects. However, cameras need to be installed, and there are limitations about the light and quality of images. Combined defects have not been comprehensively studied because most studies considered each defect separately as well as the severity evaluation. This is the research gap that this study aims to fulfill by developing models to detect combined defects and evaluate the severity of defects using axel box accelerations (ABA). The outstanding benefits of using ABA are that it requires a few additional installations that are cost-saving, continuity of data collection, and speed of inspection.

Numerical Data Simulation and Characteristics
Machine learning models in this study are developed using numerical data simulated by D-Track. D-Track is a simulation used to simulate the dynamic behavior of wheel and rail in railway transportation. D-Track was developed by Cai [34] in 1996. Then, Steffens [35] developed the DARTS (Dynamic Analysis of Rail Track Structure) model and an interface for D-Track in 2005. He found that the accuracy of D-Track at that time was not satisfied, because the simulated data and site data were significantly different. Then, D-Track was improved for more accuracy by Leong [36]. He found the causes for the D-Track's accuracy issue, which included too-low calculated wheel-rail forces, unnecessary assumptions in D-track, inaccurate sleeper pad reactions, and inaccurate sleeper's bending moment calculation. From these issues, he improved both the interface of D-Track and its workflow to improve the performance of the simulation. From the improvement, the simulation's outcome was close to the site data with an error of less than 10%. He compared the simulated results with the field data collected in Melbourne to Geelong, Australia. The parameters used to compare were average wheel-rail contact force, shear force, average rail acceleration, and bending moment. He also compared results between numerical data such as DARTS (Dynamic Analysis of Rail Track Structure), DIFF (Vehicle-Track Dynamic Analysis Model), NUCARS (New and Untried Car Analytic Regime Simulation), SUBTTI (Subgrade-Train-Track Interaction), and VIA (Vehicle Interaction with Track Analysis Model). He found that results from D-track were correlated to other simulations. This study uses data simulated by D-Track as representatives of data for developing machine learning models to detect and evaluate the severity of combined defects, which are crucial to rail safety and predictive track maintenance [37][38][39][40][41][42][43].
To simulate the dynamic characteristic of the railway system using D-Track, various inputs need to be defined in the simulation such as track properties (stiffness, damping, sleepers, etc.), vehicle properties (speed, weight, wheel radius, etc.), defect properties, and defect locations. Detailed variables are also required to define each category. Different outputs can be reported using D-Track such as accelerations, forces, pressures, bending moments, shear forces, and displacements of each wheel and track component. As mentioned, this study uses ABA or axle box acceleration from the simulation to develop machine learning models because it can be measured easily in the practice.
In terms of simulation inputs, a summary of parameters is shown in Table 2. Table 2 shows the 1650 simulations run to simulate data. Examples of ABA are shown in Figure 1. The speed and weight of the rolling stock are 20 km/h and 40 tons respectively. Figure 1a presents ABA when the rail is free from defect, and Figure 1b presents ABA when the rail has the 2.5 mm dipped joint and 20 mm short settlement, as shown in Figure 2. These two figures show that the ABAs from the defect-free rail and the rail with defects are significantly different and easy to categorize. However, when the sizes of combined defects vary and the defects are combined, it will be more complicated to categorize the type and size of defect; machine learning plays an important role for this purpose. From Figure 1b, the ABA has peak and bottom values, which will be used as simplified features. Figure 1 presents only one ABA from a wheel. From the simulation, ABAs from two wheels are extracted and used as features.

Parameters Value
Sizes of dipped joint 0-10 mm (the length of the dipped joint is 1000 mm.) Sizes of settlement 0-100 mm (the lengths of the settlement are 3000 and 10,000 mm for short and long settlement, respectively)  Table 2 shows the 1650 simulations run to simulate data. Examples of ABA are shown in Figure 1. The speed and weight of the rolling stock are 20 km/h and 40 tons respectively. Figure 1a presents ABA when the rail is free from defect, and Figure 1b presents ABA when the rail has the 2.5 mm dipped joint and 20 mm short settlement, as shown in Figure  2. These two figures show that the ABAs from the defect-free rail and the rail with defects are significantly different and easy to categorize. However, when the sizes of combined defects vary and the defects are combined, it will be more complicated to categorize the type and size of defect; machine learning plays an important role for this purpose. From Figure 1b, the ABA has peak and bottom values, which will be used as simplified features. Figure 1 presents only one ABA from a wheel. From the simulation, ABAs from two wheels are extracted and used as features.  The ABA is used in two ways as mentioned: simplified data and raw data. Simplified data are used to develop the DNN model, and raw data are used to develop the CNN and RNN models. For simplified data, 14 features are extracted from the simulations, which   Table 2 shows the 1650 simulations run to simulate data. Examples of ABA are shown in Figure 1. The speed and weight of the rolling stock are 20 km/h and 40 tons respectively. Figure 1a presents ABA when the rail is free from defect, and Figure 1b presents ABA when the rail has the 2.5 mm dipped joint and 20 mm short settlement, as shown in Figure  2. These two figures show that the ABAs from the defect-free rail and the rail with defects are significantly different and easy to categorize. However, when the sizes of combined defects vary and the defects are combined, it will be more complicated to categorize the type and size of defect; machine learning plays an important role for this purpose. From Figure 1b, the ABA has peak and bottom values, which will be used as simplified features. Figure 1 presents only one ABA from a wheel. From the simulation, ABAs from two wheels are extracted and used as features.  The ABA is used in two ways as mentioned: simplified data and raw data. Simplified data are used to develop the DNN model, and raw data are used to develop the CNN and RNN models. For simplified data, 14 features are extracted from the simulations, which are the weight and speed of a rolling stock, three peak ABA from two wheels, and three The ABA is used in two ways as mentioned: simplified data and raw data. Simplified data are used to develop the DNN model, and raw data are used to develop the CNN and RNN models. For simplified data, 14 features are extracted from the simulations, which are the weight and speed of a rolling stock, three peak ABA from two wheels, and three bottom ABA from two wheels. In case of simplified data, the ABA is the result from simulations, but the weight and speed of a rolling stock are extracted before the simulation. This procedure is done under the assumption that the weight and speed are known from on-board sensors. The reason for using simplified data for the DNN model is that it is more suitable than using raw data. The authors have tried feeding the raw data into the DNN model where the number of input nodes is equal to the number of values. However, the performance is not satisfying. For raw data, ABAs from two wheels are fed into the models without processing and other features.
To process simplified data and arrange raw data, Visual Basic for Applications (VBA) is employed. Fourteen features are extracted from simulations' reports and combined to create the dataset alongside raw data from each simulation. In this study, the total number of simulations is 1650, so the number of samples is the same. Each sample is labeled in accordance with the classes of each model. For defect severity classification, the labels are depending on the size of the defect, as shown in Table 3.

AI Model Development
DNN, CNN, and RNN are employed to develop machine learning models for detecting and evaluating the severity of combined defects. For dipped joint and settlement detection, this study proposes two approaches. The first approach is using a single model to detect both dipped joint and settlement. The second approach is using two independent models to detect dipped joint and settlement separately. This is to investigate whether a model has better performance for a more specific task. Therefore, the first approach will categorize four classes of the sample, namely, class 0: defect-free, class 1: dipped joint, class 2: settlement, and class 3: dipped joint and settlement. For the second approach, two models are used to detect dipped joint and settlement separately so the classes are binary, defect, and no defect.
For defect severity classification, samples are labeled as shown in Table 3. Models for classifying the severity of dipped joint and settlement are developed independently. It is noted that the second approach applies two models to detect each defect separately so the labels shown in Table 3 are dependent on the models. That means that label 0 in the dipped joint severity classification model is different from label 0 in the settlement severity classification model. For defect severity regression, the models are different because they are regression models in which the labels are real numbers. As defect severity classification, two models are developed for dipped joint and settlement severity evaluation.
The workflow of the machine learning models for detecting and evaluating combined defects is shown in Figure 3.  All models are tuned by hyperparameter tuning to ensure that all models provide the best performance. The detail is presented in the following section. In the training, samples are split using the proportion of 70/30. The performance of developed models is evaluated using accuracy in the case of classification and mean absolute error (MAE) in the case of regression. The models with the highest accuracy and the lowest MAE will be selected for further application.

Hyperparameter Tuning
Some parameters of the models are not tuned during the training. Hyperparameter tuning is conducted to improve the performance of models and ensure that the models provide the best performance. In this study, a grid search is used to tune hyperparameters. The list of tuned hyperparameters of each model is shown in Table 4.  The features used to develop the DNN model consist of 14 features, as mentioned in the previous section. For CNN and RNN, two sets of raw data from two wheels' ABA are used as features. The total number of values is 6695 for each wheel.
All models are tuned by hyperparameter tuning to ensure that all models provide the best performance. The detail is presented in the following section. In the training, samples are split using the proportion of 70/30. The performance of developed models is evaluated using accuracy in the case of classification and mean absolute error (MAE) in the case of regression. The models with the highest accuracy and the lowest MAE will be selected for further application.

Hyperparameter Tuning
Some parameters of the models are not tuned during the training. Hyperparameter tuning is conducted to improve the performance of models and ensure that the models provide the best performance. In this study, a grid search is used to tune hyperparameters. The list of tuned hyperparameters of each model is shown in Table 4.

Results and Discussion
This section presents the results of model development and discusses them by separating them into two topics, combined defect detection and combined defect severity evaluation. For combined defect detection, two approaches are applied as mentioned in the previous section. The first approach is developing a model to detect dipped joint and settlement, and the second approach is developing two models to detect dipped joint and settlement separately. Two approaches are compared to test the hypothesis of whether two models perform better than a single model for detecting combined defects.
The combined defect severity evaluation is presented into two topics: severity classification and severity regression. The classification is used to classify the severity of combined defects into groups as shown in Table 3. The regression is used to predict the size of combined defects. The performance of models is evaluated using the accuracy or MAE depending on the models. Three deep learning techniques are used, which are DNN, CNN, and RNN. The detail is presented as follows.

One Model for Detecting Both Dipped Joint and Settlement
There are four classes in this case, namely, no defect, dipped joint, settlement, and dipped joint and settlement. The performance of each model is presented in Table 5.  Table 5, the accuracy of the CNN model is the highest followed by DNN and RNN, respectively. Surprisingly, the accuracy of the CNN model is almost 1.00; however, the accuracy of the RNN model is the worst, although both models use raw data as features. The DNN model performs quite well, although it does not perform as well as the CNN model and uses simplified data as features. From this, it can be concluded that using raw data does not guarantee higher accuracy than using simplified data. The RNN model has the lowest accuracy, from which it can be assumed that the technique is not suitable for classification in this condition. This is because the RNN model will perform well when it deals with the time-series data and the sequence of the data is significant. In this situation, the sequence of data is not highly related to each other. Therefore, the RNN model performs worse than other models. Moreover, training the RNN model takes the longest time compared to the DNN and CNN model. From the results, the CNN model is the best model for detecting combined defects in this approach. The tuned hyperparameters of the CNN model are shown in Table 6.

Two Models for Detecting Dipped Joint and Settlement Separately
This approach is to test whether the model has better performance if there are fewer classes to predict. Models are developed to detect dipped joint and settlement separately. The performance of each model is presented in Table 7.  Table 7, the accuracy of models is calculated by multiplying the accuracy of the best models on dipped joint and settlement detections. The CNN model also has the best accuracy of 0.99. The overall performance of models is accorded to the first approach in which the CNN model has the highest accuracy followed by the DNN and RNN models. However, it is worth noting that the RNN model performs better than the DNN model in settlement detection. Compared to the first approach, it can be seen that the performance of models is improved when the number of classes is lower. However, the CNN model's accuracy does not change. This might because the accuracy of the CNN model is high and there is no room for improvement. Although the performance of models can be improved by reducing the number of classes, the model developed in the first approach is good enough to detect combined defects. The CNN models from the two approaches perform the best and have the same accuracy of 0.99. The tuned hyperparameters of the CNN models are shown in Table 8.

Combined Defect Severity Evaluation
This section presents the results from model development to evaluate the combined defect severity after they are detected. To evaluate the severity, this study presents models to classify the severity and estimate the size of defects. In this part, dipped joint and settlement are considered separately, because the authors tried developing models to consider them together and found that the accuracy is not satisfied due to a too high number of classes to predict. Therefore, considering them separately is the better option.

Severity Classification
There are three classes to classify the severity of dipped joint and settlement, which are shown in Table 3. The accuracy of the classification of each model is presented in Table 9, and the confusion matrix of each model is presented as Tables 10 and 11, respectively. Actual Class 2 1 9 189 Table 11. Confusion matrix of settlement severity classification from the recurrent neural network (RNN) model.  Table 9, the CNN model is the best model for classifying the severity of dipped joint with an accuracy of 0.84, while the accuracy of the DNN and RNN models is not satisfied. The RNN model still performs worst for classifying the dipped severity. However, it is surprising that the RNN model has the highest accuracy in classifying the settlement severity with an accuracy of 0.99. This finding is conformed to the settlement detection model that the RNN model tends to perform well when dealing with the settlement. Therefore, the total accuracy of classifying combined defect severity is calculated from the accuracy of the CNN model on classifying the dipped joint severity and the accuracy of the RNN model on classifying the settlement severity, which equals 0.83. The tuned hyperparameters of each model are shown in Table 12.

Severity Regression
Models developed in this section are different from others because they are regression models. The output layer does not predict the class of data but the continuous value. As mentioned, the performance of each model is evaluated using MAE, which is straightforward to interpret compared to other indicators. The size of the defect is not labeled as groups, but it is directly used as a label. The performance of the severity regression or estimation is shown in Table 13. The plots between actual data and prediction are shown in Figures 4 and 5 for dipped joint and settlement, respectively.   From Table 13 and Figures 4 and 5, the CNN model is the best model for estimating the size of the dipped joint with the MAE of 1.25 mm, while the MAEs of the DNN and RNN models are lower, but the difference is not relatively high compared to the maximum size of 10 mm. The RNN model has the highest MAE. From the previous models, it can be concluded that the RNN model is not suitable for detecting and evaluating the dipped joint, which can be seen from the lowest performance in every aspect. Again, the RNN model has the lowest MAE on estimating the size of settlement, which equals 1.58 mm. Compared to the maximum size of settlement used in this study (100 mm), the RNN model can estimate the size of settlement with very low error. This emphasizes the performance of the RNN model on detecting and evaluating the settlement. It can be concluded that the CNN model is the best model for estimating the dipped joint size, and the RNN model is the best model for estimating the settlement size, which conformed to the model performance on the severity classification. The tuned hyperparameters of each model are shown in Table 14.   From Table 13 and Figures 4 and 5, the CNN model is the best model for estimating the size of the dipped joint with the MAE of 1.25 mm, while the MAEs of the DNN and RNN models are lower, but the difference is not relatively high compared to the maximum size of 10 mm. The RNN model has the highest MAE. From the previous models, it can be concluded that the RNN model is not suitable for detecting and evaluating the dipped joint, which can be seen from the lowest performance in every aspect. Again, the RNN model has the lowest MAE on estimating the size of settlement, which equals 1.58 mm. Compared to the maximum size of settlement used in this study (100 mm), the RNN model can estimate the size of settlement with very low error. This emphasizes the performance of the RNN model on detecting and evaluating the settlement. It can be concluded that the CNN model is the best model for estimating the dipped joint size, and the RNN model is the best model for estimating the settlement size, which conformed to the model performance on the severity classification. The tuned hyperparameters of each model are shown in Table 14. From Table 13 and Figures 4 and 5, the CNN model is the best model for estimating the size of the dipped joint with the MAE of 1.25 mm, while the MAEs of the DNN and RNN models are lower, but the difference is not relatively high compared to the maximum size of 10 mm. The RNN model has the highest MAE. From the previous models, it can be concluded that the RNN model is not suitable for detecting and evaluating the dipped joint, which can be seen from the lowest performance in every aspect. Again, the RNN model has the lowest MAE on estimating the size of settlement, which equals 1.58 mm. Compared to the maximum size of settlement used in this study (100 mm), the RNN model can estimate the size of settlement with very low error. This emphasizes the performance of the RNN model on detecting and evaluating the settlement. It can be concluded that the CNN model is the best model for estimating the dipped joint size, and the RNN model is the best model for estimating the settlement size, which conformed to the model performance on the severity classification. The tuned hyperparameters of each model are shown in Table 14.

Conclusions
This study is the first to apply deep learning techniques to detect and evaluate the severity of combined defects in the railway infrastructure. Dipped joint and settlement are used as the case study of combined defects. The numerical data are simulated using D-Track, which is a verified simulation for studying the dynamic behavior of wheel and rail. Various parameters are used to create the diversity of data. There are 1650 simulations that are run. The output from the simulations that are used as features to develop the machine learning models is ABA from two wheels. ABA is used in two ways: simplified data and raw data. The DNN model uses the simplified data that consists of 14 features, while the CNN and RNN models use raw data. The data are split with the proportion of 70/30 to be data and testing data.
The models for detecting combined defects are developed using two approaches: a single model and two models for detecting combined defects. The study shows that using a single model is good enough to detect combined defects when the best model is the CNN model with an accuracy of 0.99. To evaluate the severity, models are developed to classify the severity and estimate the size of defects. It is found that the CNN models have the best performance in classifying and estimating the dipped joint with the accuracy and MAE of 0.84 and 1.25 mm respectively. However, the RNN models perform better in detecting and estimating the settlement with the accuracy and MAE of 0.99 and 1.58 mm, respectively.
To improve this unprecedented study, site data can emphasize the reliability of the finding in the study. The main difference between the simulated data and site data is that there are noises in site data. Other types of defects are also able to improve the capability of the model by increasing the variety of data. Other information is worth trying as features for model development to support other sensors and measurements.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the confidentiality.