Supervised Machine Learning Approach for Detecting Missing Clamps in Rail Fastening System from Differential Eddy Current Measurements

: The rail fastening system forms an integral part of rail tracks, as it maintains the rail in a ﬁxed position, upholding the track stability and track gauge. Hence, it becomes necessary to monitor their conditions periodically to ensure safe and reliable operation of the railway. Inspection is normally carried out manually by trained operators or by employing 2-D visual inspection methods. However, these methods have drawbacks when visibility is minimal and are found to be expensive and time consuming. In the previous study, the authors proposed a train-based differential eddy current sensor system that uses the principle of electromagnetic induction for inspecting the railway fastening system that can overcome the above-mentioned challenges. The sensor system includes two individual differential eddy current sensors with a driving ﬁeld frequency of 18 kHz and 27 kHz respectively. This study analyses the performance of a machine learning algorithm for detecting and analysing missing clamps within the fastening system, measured using a train-based differential eddy current sensor. The data required for the study was collected from ﬁeld measurements carried out along a heavy haul railway line in the north of Sweden, using the train-based differential eddy current sensor system. Six classiﬁcation algorithms are tested in this study and the best performing model achieved a precision and recall of 96.64% and 95.52% respectively. The results from the study shows that the performance of the machine learning algorithms improved when features from both the driving channels were used simultaneously to represent the fasteners. The best performing algorithm also maintained a good balance between the precision and recall scores during the test stage.


Introduction
Rail transport has emerged as a significant mode of transportation to overcome heavy congestions of road and sky, increasing energy costs and carbon emissions. It is an effective mode of transportation that supports the economic and industrial expansion of a nation, through mobilization and transportation of people and commodity [1]. Rail freight transport and passenger traffic has increased rapidly in Europe between 1990 and 2007. There has been a 15% increase in rail freight ton-kilometres and a 28% increase in passenger-kilometres between 1990 and 2007 in EU15 countries [2]. In Sweden, there has been an average annual growth of 1.1% traffic on the railway network between 1960 and 2010 and a minimum annual increase of 1% in traffic tonnage is anticipated up to 2050 [3]. The need to shift huge volumes of passenger and freight traffic to railways and the current state of the existing railway infrastructure are issues that require significant attention in the field of transportation [4]. One possible solution to meet the growing demand and improve the rail performance could be capital expansion of the infrastructure, but this is a time consuming and cost-intensive approach. Hence, in order to improve the capacity, availability and the service quality of the existing infrastructure, an ideal solution would be to improve the maintenance and renewal (M&R) process.
The quality of the infrastructure and the utilization methods play a crucial role in determining the operational capacity of a given railway infrastructure [5]. The dependence between operational capacity and the condition of the infrastructure is a crucial aspect in railway infrastructure maintenance. When the railway infrastructure is in a good state with high quality, a higher operational capacity with higher quality of service is achieved. As the operational capacity increases, the infrastructure is subjected to more traffic and load, which leads to deterioration of the infrastructure and deformation of its components. Therefore, higher-frequency maintenance and renewal are required. These demand track possession, which in turn reduces the operational capacity. The down time arising from maintenance and renewal of networks is responsible for nearly half of all the delays to passengers. To investigate the actual delay within Sweden, the records of the infrastructure manager was analysed. Tables 1 and 2 show the delay in hours for 10 years, due to failure of components in track and switches and crossings (S&C) respectively. The delay time used for this study only includes the downtime of service arising due to the corrective maintenance employed to fix the fault or failure. The downtime arising due to corrective maintenance is unplanned and can occur at any given point of time, having consequential effect on traffic. On the other hand preventive maintenance actions are planned, i.e., the track occupation is guaranteed without traffic interference. However, in a few instances, preventive maintenance can occupy the track longer than the allocated time frame, thereby leading to traffic interference or delay [6], which is not included in the present study. The components considered for this study include only those that exhibit magnetic properties. Averages of 572.7 h and 670.3 h of delays are incurred in Sweden yearly due to failure of components in track and S&C respectively. To avoid such delays and to ensure safe and reliable operation, tracks and components need to be inspected periodically.
Track inspection is a crucial task that has to be carried out periodically to prevent catastrophic accidents and control the condition of the railway infrastructure [7]. Traditionally, rail inspections are carried out by trained inspectors who walk along the track length to search for visible defects and technical deviations. Such manual inspections are labourintensive, slow and may include human errors, especially in tough winter condition. This method is time-consuming and expensive for railroad companies, especially for long-term and large-scale development projects. Recent technological developments have seen automated inspections systems based on machine vision being utilized for track inspections. Automated rail inspection systems are composed of various functions including, gauge measurement, rail-surface defect detection, rail profile measurement and fastener defect detection [8]. Rail fastening systems are a crucial component in the rail infrastructure as they clamp the rail to the sleepers, maintaining the gauge, preventing the longitudinal and transverse deviation of rails from the sleepers, and preserving the designed geometry of the track. Failures of fasteners reduce the safety of train operations, increase wheel flange wear and may lead to catastrophic accidents [9]. In the past two decades, the application of automated machine vision for fastener inspection has grown significantly; however, the detection method from these rail images have varied over time.  2010  18  6  792  1  10  11  838  2011  70  2  313  1  6  96  488  2012  17  0  397  6  22  10  452  2013  10  0  372  10  12  6  410  2014  13  1  479  37  21  18  569  2015  13  14  476  4  9  30  546  2016  50  7 2010  8  93  1  1  6  260  3  0  22  509  2011  13  21  0  0  9  200  5  0  33  344  2012  1  38  0  1  6  282  105  0  51  592  2013  2  83  0  0  10  499  14  0  55  760  2014  29  37  86  0  31  274  36  0  40  643  2015  8  72  0  14  10  521  19  0  50  811  2016  18  94  3  1  12  602  33  2  39  948  2017  8  40  3  4  13  456  72  0  33  747  2018  9  43  0  1  18  491  20  1  49  780  2019  12  84  0  3  4  332  19  1  50  568 In 2007, Marino et al. [10] detected missing hexagonal-headed bolts from rail images using a multilayer perceptron neuron classifier. For hook-shaped fasteners Stella et al. [11] used a neural classifier to detect missing fasteners, employing wavelet transform and principal component analysis to preprocess the railway images. Yang et al. [12] used direction field as a template of the fastener in the rail images and used linear discriminant analysis (LDA) for matching, to obtain the weight coefficient matrix. To model two types of fasteners, Ruvo et al. [13] used error back propagation algorithm on the rail images and implemented the same on graphical processing unit to achieve real time performance. Ruvo et al. [14] also introduced a FPGA-based architecture employing the same algorithm on rail images. Xia et al. [15] and Rubinsztejn [16] used AdaBoost algorithm for detecting fasteners from rail images. Li et al. [17] adopted image-processing techniques to detect fasteners and its components from the images obtained from visual inspections. H.Feng et al. [8] adopted structure topic model (STM) to model fasteners and learn from the probabilistic representation of different components within rail images. H.Fan et al. [18] used line local binary pattern (LLBP) on the rail images to distinguish between normal fasteners and failed fasteners. Support vector machines (SVM) [19], Gabor filters [20] and edge detection [21] methods are other commonly employed techniques to detect fasteners from rail images. Recent advancements in image processing have enabled the use of deep learning to [22,23] R-CNN [24] detect fasteners from rail images collected during automated visual inspection.
The present mode of fastener detection is carried out from images acquired during automated visual inspection of track and its components. They become a complicated task when the rail and its components are concealed due to the presence of rust and dust. Presence of snow, stones and other debris or heavy rain can also cause hindrance in visual inspection and minimise the efficiency in detecting the rail and its components. A reliable and high-quality automated visual inspection is a relatively expensive technique to carry out and are difficult to mount and maintain on an in-service train as they are integrated in the operation and are subjected to brightness fluctuations and motion blurring during highspeed travel and can reduce the accuracy of detection. In Sweden around 298,080 euros alone was spent in 2014 to inspect two lines with a total track length of ca. 300 km, of which more than 75% was utilised to inspect track components that exhibit magnetic characteristics (rail fastening, weld joints, rail surface etc.) [25]. With the increasing demand for safety and cost-effectiveness, maintenance managers are striving to cut these operations and maintenance cost through effective condition-based maintenance while assuring the quality and capacity of the rail services.
In earlier research, the authors proposed a train-based differential eddy current sensor [25] for fastener inspection that can overcome the major challenges mentioned above. Eddy current sensors are not affected by the presence of nonconductive materials in the sensor-to-target gap and can be used in dirty environments, water, oil etc., where other inspection systems fail. The proposed inspection technique using a differential eddy current sensor was able to detect fastener signature from a distance of 65 mm above the railhead. The individual fastener signatures were easily distinguishable from the 1-D time signal plots. The sensor uses two driving fields with frequencies 18 kHz and 27 kHz respectively and the fastener signatures obtained from both the channels exhibited very good correlation. The sensor concept and working principle of the sensor is explained in Section 2 of the previous publication [25]. The results presented in the previous literature was based on study carried out for a short track section. However, when considering other track sections with an increased likelihood of disturbances, it was found that relying on one feature could generate a decrease in accuracy for detection. Hence, the purpose of this study was to develop a measurement system using the previously presented hardware solution in combination with a machine learning algorithm for detecting missing clamps within a rail fastening system. This study also compares different algorithms and evaluates their performance. The remainder of the paper is structured as follows. Section 2 elaborates the research methodology used for the study. The results and analysis are explained in Section 3 and the conclusions are discussed in Section 4.

Research Methodology
One of the most commonly observed faults in a rail fastening system is missing clamps. A missing clamp reduces the clamping force holding the rail on the sleeper. The track integrity is called in to question as soon as clamps are missing from the fastening system in consecutive sleepers as it may lead to slipping, excessive gage widening and low lateral resistance, which can further lead to risk of derailment. As stated in Section 1, the goal of this study is to develop an automated system for detecting missing clamps within a rail fastening system, using signals generated and measured by a differential eddy current sensor. An outline of the methodology used to develop an automated system is depicted in Figure 1. The data required for this study was collected using the differential eddy current sensor system. Features from the signal recorded using the two channels were extracted with the aid of signal processing techniques. The feature sets from individual channels and combined features from both channel were used as inputs for the classification algorithms. The idea of making use of three sets of input was to compare whether a single channel or both channels when combined performs better in missing clamp detection. The data were further processed before feeding as an input to the six classification algorithms used in this study. The algorithms were optimised using cross validation technique by splitting the input data in to training, validation and test set. The goal was to identify which algorithm performed the best for the missing clamp detection purpose. The steps involved in this study are elaborated further below. The system will make use of features extracted from the differential eddy current signals as an input for multiclass classification algorithm to differentiate intact clamps and one or two missing clamps from a fastening system. Only a missing clamp is considered as a failure in the fastening system as this is the most common failure mode of the fastener system found along the section of track examined for this study. This study makes use of standard laptop (Dell Ultrabook) with python 3.6 (with necessary packages such as Numpy [26], Pandas [27], and scikit-learn [28]) and Spyder.

Data Collection
The data for this study were collected along the northern loop of the heavy haul line at Katterjåkk and Stordalen, close to the Sweden-Norway border. The differential eddy current sensor was mounted 65 mm above the railhead on a trolley system and was made to run along the track, using a motor (refer Figure 2a). The speed of the trolley system varied between 1.3 m/s to 2 m/s. The sensor system, which is used to measure one side of the track, consists of two differential eddy current sensors with a driving field of 18 kHz and 27 kHz respectively placed at a distance of 20 cm apart in the travel direction. The above-mentioned carrier frequencies are selected as they fall under the rail norms. Each sensor consists of two differentially coupled pickup coils (P1 and P2) that are enclosed by the driving coil. The direct cross talk between the driver and the pickup coils within an individual sensor is cancelled, though not completely, by the differentially coupled pick-up coil. The inbuilt cross-talk cancellation (CTC) function within the sensor system cancels out the small cross talk between the two-carrier frequencies as they have a common factor of 9 KHz. The entire unit is vacuum potted with epoxy resin to stabilize the sensor, both against vibrations and to reduce temperature drift. The output voltage from an individual sensor is the result of the cross-talk residue between the driving coil and the pickup coils, and the induction of eddy currents along the rail, which are linearly superimposed. The differentially coupled pickup coils (P1-P2) are sensitive only to the changes in the eddy current in the rail and its vicinity. The resulting voltage will be zero if there is no change in either conductivity (σ), magnetic permeability (µ) or geometric form of the measured material, e.g., an ideal rail with no clamps or any surface defects.
The sensor was powered using a 12V62AH battery and the measured raw data were recorded using a laptop. The considered track sections consisted of concrete sleepers with E-clip fasteners. Tracks which were relatively healthy and track sections with damages were all considered for this study. Some of the visible damages on the rail head included squats, rail corrugation, crack and head checks. Figure 2b depicts a fastening system with intact clamps, a fastening system with one missing clamp and a fastening system with two missing clamps. A controlled measurement sequence was carried out along the track section to obtain the dataset for the e-clip fastening system. A pattern of missing clamps was created, along a measurement sequence, where clamps were removed from the outer and the inner side as well as from both sides at the 20th, 25th and 30th sleeper respectively, from the starting position of the measurement (refer Figure 3). This measurement sequence was carried out along various sections of the track. A fastening system with intact clamps was considered as a healthy system and fastening systems with one clamp missing (stage 1 fault) and two clamps missing (stage 2 fault) were considered as faulty system for this study.
A total of 2967 fastener signatures were used for this study, of which 2700 samples (91%) of the instances correspond to a healthy state (stage 0) with intact clamps, 168 samples (5.67%) correspond to faulty fasteners with one clamp missing (stage 1) and the remaining 99 samples (3.33%) represent a faulty state with both clamps missing (stage 2).

Feature Extraction
A number of signal processing methods were implemented before features were extracted from the raw signal pertaining to the individual fastener signatures. The eddy current (EC) signal had to be demodulated, filtered and rotated in order to extract information corresponding to the fastening system [25].

Demodulation and Resampling
The sensor signal was multiplied by its carrier frequency (for both channels respectively) and low-pass filtered (2 kHz) to demodulate the signal and extract the base band. The signal was further resampled from 215.52 kHz to 35.92 kHz.

Filtering
The EC signal was filtered further with a low pass filter of 3 Hz as the periodicity of the fasteners in the signal was found to be lower than 3 Hz, for both the channels. This was carried out to retrieve maximum information pertaining to the fastener system and attenuate other frequency components corresponding to noise and other ferromagnetic components.

Rotation of EC Signal
The demodulated and filtered fastener signatures were found to be shifted from the in-phase direction (real part). In order to retrieve maximum information and have a better visualisation, the complex EC signal was rotated such that the fastener signatures were projected along the in-phase direction. This to an extent aids in suppressing other responses in other demodulation angles not prevailing to the fastening system. The EC signal was rotated by degree θ or Φ radian, such that the peak amplitude of the fastener signatures were maximised. The signals were rotated by an optimal angle (found from the previous study [25]) of 83 • and 222 • , respectively for the two carrier frequencies.
The EC signal does not get affected due to the presence of nonconductive or nonmagnetic materials in the sensor-to-target gap. The disturbances arising due to the presence of conductive and magnetic material in the sensor-to-target gap can be suppressed to an extent by the above-mentioned low pass filtering and rotation of EC signal techniques. The low pass filter was set to extract the fastener signatures and to remove other high frequency components which could add to the energy content of the fastener signatures. The cut off frequency of the filter is dependent on the speed of the sensor and must be adjusted accordingly. Different components will have different geometric shapes, different values of magnetic permeability and electrical conductivity. Hence they will occur at different angles from the in-phase direction compared to the fasteners. Rotation of the EC signal based on fastener signature will thus to a major extent suppress information pertaining to other disturbances.
Four features for both channels are extracted for individual fasteners namely peakto-peak, RMS, and magnitude of the fastener signature at clamp frequency and the arc length of the complex signal. Three separate feature matrices will be used as an input for the classification purpose in this study. The first feature matrices will use the four features obtained from fastener signatures acquired from the 18 kHz channel. The four features obtained from the 27 kHz channel representing the fastener signature will be used for the second feature matrices. The first and second feature matrices will have a dimension of 2967 × 4 (2967 samples and four features). The third matrices contain the combination of 18 kHz and 27 kHz features, eight features in total representing a fastener signature. Hence the third feature matrices will have a dimension of 2967 × 8 (2967 samples and eight features respectively). The three feature matrices are tested to identify whether a single channel or both channels when combined perform better in missing clamp detection.

Preprocessing
From the feature extraction section, a feature matrix is obtained containing the various features obtained from the signal processing for each observation. To improve the classification algorithms, the feature matrix can be rescaled using either normalization or standardization, when the features have different scales. Normalization rescales the data by squeezing the values into the range [0, 1]. This technique might be useful in cases where all parameters need to have the same positive scale, but normalizing the data can be sensitive to outliers. Standardization (or z-score normalization) rescales the data to have a mean value (µ) of zero and a standard deviation (σ) of one. In this study, the choice of standardization is adopted since the data has a Gaussian distribution and there is no need to obtain a positive range for the feature matrix.

ML Algorithm
The fastener detection in this study is a multiclass classification problem with the objective of classifying fastener signatures from differential EC sensor into healthy fasteners with no clamps missing, fasteners with one and both clamps missing. There are a wide range of machine learning algorithms that can solve multiclass classification problems. The classification algorithms are usually evaluated based on the classification accuracy [29]. However, for practical implementations, other key factors, such as running time efficiency for training, ease of implementation, parameter tuning time, prediction time and the ease of continuous updating the algorithm online, need to be considered. A set of widely popular algorithms was selected for this study and each of them were compared to determine the best-suited algorithm for this application. Six established classifiers were selected: Gaussian naive Bayes (GNB), support vector machines (SVM), k-nearest neighbours (k-NN), gradient boosting decision trees (GBDT), random forest (RF) and AdaBoost (AB). The naive Bayes classifier was used to understand the lower bound classification performance on the fastener dataset and as a means to compare the other algorithms. The choice of the remaining selected algorithms was based on the performance, ease of implementation and the speed exhibited by the classifiers during training, parameter tuning and predictions observed in previous literatures [29][30][31][32]. Brief explanations for each algorithm are presented in the following subsection. Features extracted from both channels were tested using these algorithms, both individually and combined.

Gaussian Naive Bayes (GNB)
Naive Bayes classifier is a statistical classification technique based on Bayes theorem. Each and every feature variable are considered as an independent variable in naive Bayes. This probabilistic-based classifier can be trained very efficiently for supervised classification purpose and can be used in complex real world situations. GNB is a type of naive Bayes method that assumes a Gaussian or normal distribution on the values of the given class. The probability density of an observation x i given a class c k is computed as follows: where µ k the mean of the values in x is associated with class c k and σ 2 k is Bessel corrected variance of the values in x associated with class c k .

Support Vector Machines (SVM)
SVM was initially developed for binary classification problems and, due to various complexities, its extension to multiclass problems is not straightforward. The SVM algorithm creates a line or hyperplane that separates the data points to different classes. This line or hyperplane is called the decision boundary and the elements of the input data that defines the boundary are called support vectors. For SVM, the best hyperplane is the one that maximises the margins from both the classes. In general, the larger these margins are, the lower the generalisation error of the classifier. SVM algorithm uses a set of mathematical functions called kernels to take the data as input and transform it into the required form. The kernel functions return the inner product between two points in a suitable feature dimension. In general, it is used to transform a nonlinear decision surface to a linear equation, in a higher number of dimensions.
To solve a multiclass classification problem, several binary SVM classifiers are combined. The most common approaches for combining binary classifiers are one vs. one (OvO), one vs. rest (OvA), directed acyclic graph (DAG) and error-corrected output coding [32]. For this study the one vs. one method was selected as this method requires lesser training data vectors for each classifier and the memory required to create the kernel matrix is much smaller. For 'M' classes, the OvO algorithm constructs one binary classifier for every pair of distinct classes and a total of M × (M − 1)/2 binary classifiers are constructed. The training samples are made to be inputs to each classifier and the output from each classifier is in the form of a class label. The max wins algorithm is used to combine these classifiers and the class label that occurs the most is assigned to that point in the data vector. A comprehensive explanation of multiclass classification using SVM is provided in [33] 2.4.3. k-Nearest Neighbour (k-NN) k-NN is a nonparametric method where the training data is stored to be compared with unclassified data points rather than constructing a generic function for classification.
For this reason, k-NN algorithms are often called instance-based leaning algorithms or lazy-learning algorithms. k-NN predicts the class of a test data by the majority rule, i.e., prediction with majority class of its k most similar training data points in the feature space. Due to its simplicity, easy implementation and ability to handle complex problems, k-NN algorithms are widely used for various applications [34]. A comprehensive description of k-NN is given by Cover and Hart [35] 2.4.4. Random Forest (RF) RF is a type of ensemble classifier that works, based on the philosophy that a multitude of classifiers perform better than a single classifier. Each classifier is generated using random vectors sampled independently from the input vector, and each classifier contributes with a single vote for assigning the frequent class to the input vector [36]. The class is assigned based on the majority vote received. The output from every classifiers are averaged to improve the predictive accuracy and reduces the risk of overfitting to the training dataset. Breiman, 2001 [36] provides a comprehensive explanation of RF.

AdaBoost (AB)
AdaBoost, like RF, is an ensemble classifier that combines weak classifiers to form a strong classifier. AB starts with finding a weak classifier and subsequently fits it to a subset of training data to generate a new classifier [37]. The AB algorithm retrains iteratively by choosing the training subset based on the performance of the previous training. AB assigns a training weight after each training by a classifier and a misclassified item is assigned higher weight so that it appears with a higher probability in the training subset of the next classifier. A weight is also assigned to the classifier. A better performing classifier is assigned a higher weight so that it will generate more impact in the final output. For an input vector x of n features and h t (x) is the output of the t th weak classifier, the combined classifier is expressed as where e t is the error of the weak classifier. The importance of the weak classifier becomes greater as the error becomes smaller. A comprehensive description regarding AB can be found in [38].

Gradient Boosting Decision Tree (GBDT)
GBDT makes a prediction by training ensemble of weak decision trees in a gradual, additive and sequential manner [37]. GBDT identifies the shortcomings of the weak classifier by using gradients in the loss function unlike AB, which uses high weight data points. The loss function indicates how good the model's coefficient is in fitting the underlying data points. GBDT improves the prediction of the ensemble via incremental minimisation of errors in successive iterations of new decision tree construction [39]. Friedman [40] provides a generic form of GBDT as follows: where β j is the coefficient calculated by the gradient boosting algorithm and h j (x) is the individual decision tree generated in each sequence. Friedman [39] and Schapire [37] gives a comprehensive description of the GBDT algorithm.

Cross Validation
To investigate the problem of underfitting or overfitting resulting from a simple holdout validation, the dataset should be split into a training set, a validation set, and a testing set to judge how the algorithm will perform with new external data. In this case, the training set is used to train the model, the evaluation is performed on the validation set to minimize bias and variance and a final evaluation is performed on the test set to evaluate if the algorithm is either underfitting or overfitting. However, partitioning the available data into three distinct sets will reduce drastically the number of samples used for training the model. As a result, cross validation (CV) was used to evaluate our machine learning models with limited data samples. The test set had to be held out as before, but the validation test was not needed since the cross validation (CV) handled validation by itself within the training set. The main principle behind CV such as k-fold is to split the training set into k smaller sets and to train the model on k − 1 sets. Validation of the model is performed on the remaining part of the data set to compute the performance measure. This process is repeated k times so that each k-fold serves as a validating group once, while the performance measure will be an average of the measure for each k-fold. In our multi-class application, the f1-macro was adopted as the performance measure. It is defined as the average of the harmonic mean of precision and recall for all classes. The f1-macro was calculated for each k-fold and averaged to obtain the final performance measure for a fixed set before parameter tuning. The data set was split in to training set (70%) and test set (30%) and the cross validation was carried out within the training set.
The process of cross-validation can be looped over to optimize the parameters of each classification method. This strategy to optimize the parameters when training the model is called nested cross validation. In our case, the optimization was performed exhaustively by generating candidates from a grid of parameters relevant to each classifier. Hence, the goal of the grid search was to evaluate all the possible combinations of the parameters within suitable ranges to hold the best combination with regard to the scoring function.
SVM, k-NN, RF, AB and GBDT need parameter tuning to achieve their best performances. A limited number of values for each parameter was selected for optimising, as it was unfeasible to try the entire range of values possible. The parameters selected for optimisation were based on literature studies [29,40,41]. The parameter value selected for optimisation and the value range are described below: • SVM. Three main parameters are selected for optimising the SVM classifier: (i) kernel, which maps the observation into a feature space; (ii) C, which is the regularization parameter that aids in controlling the punishment given to the model for each misclassified point for a given curve; (iii) gamma, which defines how far the influence of a single training example reaches. Three-kernel types were tested: sigmoid, polynomial and radial basis function (RBF). The range for C and gamma were set from 10 −12 to 10 12 using a logarithmic scale.  Since a grid search algorithm was used for hyperparameter optimization, the search will be performed for all points predefined in the grid.

Model Performance Evaluation
Evaluation metrics are used to understand and evaluate how effective the model is based on some score, performed on the test dataset. Different evaluation metrics emphasise different aspects of performance of the classifier algorithm. It is essential to define the objectives of evaluation in order to enable suitable selection of metrics. The classification approach for this study is a multiclass classification model and the evaluation of these models can be understood in the context of a binary classification model.
The major evaluation metrics for binary classification model are derived from four categories.

•
True positives (TP) where the actual label is positive and correctly predicted class is positive; • False positives (FP) where the actual label is negative but incorrectly predicted as positive; • True negatives (TN) where the actual label is negative and correctly predicted as negative; • False negatives (FN) where the actual label is positive but incorrectly predicted as negative.
A multiclass classification problem can be viewed as a set of many binary classification problem. The commonly used evaluation metrics for multiclass classification based on these elements are described in Table A1 (in Appendix A). Standard metrics such as average accuracy and error rate can provide significant insight regarding the overall performance of the model. However, the data set used for this study consists of instances with unbalanced dataset, with the major class being the healthy system. Accuracy and error rate can become unreliable measures of model performance when the skew in class distribution is severe. Accuracy and error rate also have limitations with respect to different misclassification costs [42]. For railway application, the cost of misclassifying faulty systems as healthy systems is higher than misclassifying healthy systems as faulty systems. The former instance may lead to less intervention that can lead to hazardous deterioration of track system integrity and increase the risk of derailment. The later scenario of misclassifying healthy systems as faulty systems will not pose a threat to the safety and technical integrity of the system. However, the number of site visits might increase due to false alarms leading to increase in cost of maintenance. For the above-mentioned reasons accuracy and error were not considered for evaluating the models in this study.
The limitations of accuracy and error can be tackled using precision and recall or a combination of both. Precision quantifies the number of true positives of all actual positive predictions, however precision alone cannot be used to evaluate the performance of the model as it does not provide insight on the missed positive predictions. Recall is used to complement precision as it quantifies the number of correct positive predictions made from all possible positive predictions that could have been made. Precision addresses the question "of all the positive predictions, what is the probability of them being correct?" and recall addresses the question of "among all the items that belong to positive class, how many does the classifier actually detect as positive?" Precision and recall aim at minimising false positives and false negatives respectively. Often increase in precision leads to the reduction of recall. F-score or F1-score, which is the harmonic mean of the fractions of precision and recall, can be used to combine the properties of both precision and recall in to one single metric. F1-score can be used as a strong metric for model evaluation when both precision and recall are important.
For multiclass classification precision, recall and F1-score are represented by microaveraging or macroaveraging. A microaveraging method aggregates the contribution of all classes to calculate the average performance, whereas macroaveraging will calculate the performance individually for each class and then performs an average on it. Macroaverage considers each class equally. Since the study involves using an imbalanced data set, macroaveraging was utilised as it is insensitive to the class imbalance and treats them equal. For this study, the models were evaluated based on macroprecision, macrorecall and F1-score macro during the testing phase. F1-score macro was used as a scoring metric during the training and validation phase for parameter tuning. Figure 4a,b shows the time signal plot after demodulating, resampling, filtering and rotating the raw signal, for both the driving field for a single measurement sequence. The measurement was carried along a short section of the track having 47 sleepers. This section of the track was relatively healthy without any damages to the railhead. The zero crossing in the signal from the positive to the negative induction represents the centre positioning of the fastening system. The zero crossing was used as a way to segregate individual fastening systems. Individual fastening systems are easily distinguishable from the time signal plot for both driving channels. A drop in amplitude of the fastener signature is visible at those positions where the clamps were missing from the fasteners. Missing clamps cause reduction in the metallic material and changes the geometry of the fastening system, thus reducing the amplitude of the return field. The healthy fastening systems with intact clamps are depicted using black markers and fastening systems with one clamp missing and fasteners with both clamps missing are represented using blue and red markers respectively.   Figure 4). All the four features, for both channels showed a drop in their value, when the clamps were missing from the fastening system. The above plots are obtained for a short track section where the difference between feature levels are easily distinguishable. However, this clear difference between the three classes is not as evident for all track sections where additional disturbances can be present as observed in Figure 6. Figure 6 shows the histogram plot (in probability density function) for all the eight features for the entire data set of fasteners (from both 18 kHz and 27 kHz channel). The healthy fasteners are indicated within the black markers and fasteners with one or both clamps missing are marked in blue and red respectively. It is evident from the plot that there are significant overlaps between the three areas for all the features. The disturbances in track can affect the induced voltage in the eddy current sensor, thus causing fluctuations in the features. The overlap is seen in features extracted from both channels. It becomes a difficult task to mark a boundary or threshold to differentiate the state of fasteners and use individual features for classification purposes. This calls for the use of machine learning algorithms that can combine multiple features to create a boundary or threshold for classification purpose. Figure 7 depicts the 3-D representation of the overall data set used for classification using three random features for the 18 kHz channel.    The training, validation and testing performance for all the six algorithms for 18 kHz channel, 27 kHz channel and both channels combined are presented in Table 3. The scores presented in the above table are obtained after cross validation where the best hyperparameter are extracted. The best hyperparameter combination for all algorithms are depicted in Figure 8. All the algorithms except Gaussian naïve Bayes exhibited an accuracy of above 97% for all scenarios, during training, validation and testing. Since accuracy is not considered as a strong metric for evaluating performance on imbalanced data, the model comparisons were carried out using F1-score macro. There were no huge variations in the score during training, validation and testing for all the cases, indicating that the algorithms did not overfit or underfit the data. The scores during training validation and testing did not vary significantly, indicating that the models did not have high bias and high variance. Figure 8 represents the learning curve for all the algorithms with their respective optimised hyperparameters. From the learning curve plot, it is evident that, for all the algorithms, the training samples were sufficient as the training score converged to a value. All the validation scores also tended to converge to a value close to the training score. Increasing the training sample may slightly increase the validation score. Since the training score was high (low error), the training data was well fitted by all the estimated models, indicating a low bias nature. The gap between the training and validation curve for all the algorithms was minimal, indicating low variance.  Gaussian naive Bayes exhibited the lowest score during training, validation and testing for all three feature sets with an F1-score below 90%. All the other five algorithms exhibited a F1-score macro better than the baseline Gaussian naive Bayes classifier during training, validation and testing, for the respective data set. Thus it was easier to rule out GNB classifier when considering the best performing algorithm for fastener detection. The performances of the remaining five algorithms were evaluated closely in order to select the best performing model. The F1-score was above 89% during training, validation and testing stage, for the features extracted from 18 kHz channel. SVM, K-NN and random forest (RF) had an F1-score above 90% for training, validation and testing, where K-NN registered the highest score. K-NN had a score (F1) of 93.86% during validation phase and 92.29% while testing. Ada-Boost and GBDT had a test score of less than 90% during testing phase.

Results and Analysis
The F1-score for all the algorithms during testing and training for the data set of 27 kHz features was better compared to the 18 kHz dataset. The score was above 92% for the entire algorithm tested on the 27 kHz data set during every stage. AdaBoost algorithm exhibited the best performance for this data set with a score of 96.01% during training, 95.63% during validation and a score of 95.15% during testing. When the data set comprised the features of both 18 kHz and 27 kHz representing the fastener, the F1-score for all the algorithms improved further during training, validation and testing, compared to when the data set contained features from individual channel. AdaBoost was the best performing algorithm for the combined data set, with an F1 score above 96% during all three stages. An F1-score of above 94% was achieved for k-NN, RF and GBDT during testing phase on the combined data set. It is evident from Table 3 that the detection of the clamp was higher when features from both channels were used simultaneously to represent a single clamp.
In railway applications, it is essential to have both high precision and recall. A higher precision of the algorithm will minimise the false positive rates. A higher precision will thus contribute to better detection of the fastener state, thus ensuring safe and reliable operation of the railway. On the other hand, a higher recall of the algorithm will ensure the false negative rates are minimised. A higher recall will thus minimise cost incurred due to unwanted inspection. It is necessary to have a balance between reducing cost of inspection as well upholding the safety and reliability of the railway asset. Hence, the best algorithm was not selected based on the F1-score alone, but also on precision and recall score and the balance between the two, during the test stage. Table 4 represents the results obtained during the testing phase, for all the six algorithms. For the data set obtained from the 18 kHz channel, RF registered the highest precision (93.43%); however, the recall for the same was below 90%. Recall was seen to be highest for the k-NN algorithm with 94.29%, while the precision for the same was around 90%. No algorithm exhibited a good balance between the precision and recall during the testing phase for the 18 kHz data set. For the data from 27 kHz channel, the highest precision (94.16%) was achieved using random forest, with a recall of 90.46%. The highest recall (96.45%) for the 27 kHz data set was achieved with AB, with a precision of 93.92%. The algorithms with higher performance on this data set, however, did not exhibit a good balance between precision and recall. Precision and recall value were well balanced for all the algorithms when the data set included features, from both channels were used simultaneously to represent the clamp. Highest precision and recall were achieved by the AdaBoost algorithm, at 96.64% and 95.52% respectively, with the highest F1-score of 96.02 during testing.

Conclusions and Future Work
At present, rails and fasteners are still inspected with the aid of automated visual inspection and manual inspection, despite the fact these methods require huge investment in both capital and time. Moreover, automated visual inspection becomes a challenge when the rail and its component are obscured due to adverse environmental conditions, such as snow, debris etc. The authors proposed a train-based differential eddy current sensor that can overcome such challenges in the previous study [24]. This paper attempts to develop an intelligent system with the aid of machine learning approach to facilitate, effective and reliable health monitoring of track system by reducing human biases and error for detection of the state of railway fastening system. The data set used for this study was obtained from the measurements carried out along a heavy haul line in the north of Sweden using the differential eddy current measurement system. Three sets of data were used for analysis in this study, two of which were obtained from the individual channel of the sensor system (18 kHz and 27 kHz), four features representing each. The third data set comprised of the combined features from both channels (eight features). Six machine-learning algorithms were selected and compared among themselves for all three scenarios of data, to determine the best-performing model.
The result of the study shows that all the algorithms perform better, when the data set includes features from both channels rather than when they are considered individually. Among the three data sources, all the algorithms performed weakly on the 18 kHz data set comparatively. The performance of the algorithms based on F1-score was higher for the 27 kHz data set and for data set with combined features. In railway industry, it is essential to balance the risk of failure and the cost of inspection. Therefore, it is necessary to have a good balance between the precision and recall of the detection algorithm. Both channels are preferred for detection of railway fastening systems as the algorithms performed better and exhibited a good balance between precision and recall, when the features of both channels were used for representing an individual clamp. Among the six algorithms tested, AdaBoost, which is a type of ensemble algorithm, slightly outperformed the other four algorithms in all evaluation matrices.
Further research will be carried out to incorporate other classification techniques (e.g., artificial neural network, XGBoost, etc.) that can have the capability of improving the predictive strength for fastener detection. The possibilities of unsupervised clustering for anomaly detection will also be investigated in the further studies. The current study incorporates only one type of fastener, namely Pandrol e-clip. The future of this study will also focus on different types of fasteners that can be distinguished by the rotation angle, as different fastening systems have different geometrical shapes. The future scope in this study also involves high-speed measurements, detection of other magnetic track components, quantification of rail defects and developing efficient condition monitoring techniques with the aid of artificial intelligence to detect and predict faults from big data. The work in this study was based on using features obtained from the measurement signals as an input to machine learning algorithms. These features are subject to change when the distance between the sensor and the object varies (i.e., lift-off effect). In this application, lift off can occur due to wheel wear. However, this is a slow-occurring process which can be handled by continuous automatic calibration of the system where healthy signatures are used as a reference. This study will be carried out in future work.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.