Semi-Supervised Deep Learning in High-Speed Railway Track Detection Based on Distributed Fiber Acoustic Sensing

High deployment costs, safety risks, and time delays restrict traditional track detection methods in high-speed railways. Therefore, approaches based on optical sensors have become the most remarkable strategy in terms of deployment cost and real-time performance. Owing to the large amount of data obtained by sensors, it has been proven that deep learning, as a powerful data-driven approach, can perform effectively in the field of track detection. However, it is difficult and expensive to obtain labeled data from railways during operation. In this study, we used a segment of a high-speed railway track as the experimental object and deployed a distributed optical fiber acoustic system (DAS). We propose a track detection method that innovatively leverages semi-supervised deep learning based on image recognition, with a particular pre-processing for the dataset and a greedy algorithm for the selection of hyper-parameters. The superiority of the method was verified in both experiments and actual applications.


Introduction
Track defect detection: As a result of the high frequency and intensity of track operation in open air, track defects occur constantly. There are four typical defects in track detection: crevice, beam gap, cracking, and bulge, as shown in Figure 1. It should be noted that in the horizontal direction, we did not draw under the actual size. For actual track, the length of the track slab is 6450 mm. Bulges can be divided into multiple states according to their severity, such as slight bugles, mortar layer outflow, and empty [1]. Since the monolithic concrete structure of the track and the reinforcement are densely distributed, it is difficult to identify the defects inside. In traditional detection, in addition to the lack of real-time performance, the track detection vehicle has a long operating period and high deployment cost, and manual detection is restricted by the high error rate [2,3]. Therefore, approaches based on optical sensors have become the most remarkable methods in terms of deployment costs and real-time performance. Optical sensors in track detection: Bao et al. [4] monitored the temperature and strain of joints by pulse-pre-pumped Brillouin optical time-domain analysis (PPP-BOTDA) with a single-mode fiber as a distributed sensor. Kang et al. [5] developed an FBG sensing system and graphical user interface (GUI) to monitor the wheel thickness changes in real actual conditions, the application of SSL in engineering has become a hot research field. Summarizing the previous work, the current DAS-based track defect detection system can be roughly divided into three categories: amplitude visualization, methods based on traditional machine learning, and methods based on deep learning. The existing deep learning-based methods mostly extract features from single-point vibration data and use MLP models to build classifiers. In this paper, we propose a track detection system that innovatively leverages semi-supervised deep learning based on image recognition. The innovation of this article is as follows: (1) To increase the sample information density, we use multi-point amplitude rather than single-point; (2) To alleviate the impact of the lack of high-frequency components caused by an insufficient sampling rate, we use amplitude rather than frequency features to train the model; (3) We convert the data into images and classify the samples through a CNN network to achieve better convergence speed and capacity; (4) We use the deep network to adaptively extract the sample features rather than manually extract them; (5) We use semi-supervised learning to efficiently leverage unlabeled data to further improve the performance of the model.
Through the above-mentioned innovations, we successfully implemented a real-time track defect detection system and achieved superior performance, especially for multi-type minor defects. The structure of this paper is as follows. In the Section 2, we describe the deployment of the DAS and the distribution of defects on the experimental track. In the Section 3, we introduce the mechanism of SSL and several SSL models for comparison in the experiment. In the Section 4, we present a particular dataset pre-processing for our semi-supervised learning model, and we validate it in the Section 5. In the Section 6, we summarize our work and future research.

Sensor Deployment
Due to the security policy, experimental deployment along the railway track is not allowed. So, the existing backup fiber for video along the track is used as sensors, which means that additional installation is not needed. The fiber joint was connected to the DAS in the computer room. We deployed the commercial DAS with 5.8 m spatial resolution, measurement frequency range <5 kHz, and the fiber is G652D type. The DAS we used is an intensity-based DAS capable of only detecting vibrations. The vibration amplitude of the measurement points at each moment is buffered in the DAS in the form of a line of data; then, it is received and stored in a PC, as shown in Figure 2. The length of the optical path was measured to be 3000 m. By knocking around the fiber and observing the pulse position, we determined that the length of the optical path corresponding to the experimental track segment was approximately 1500 m, and the sampling rate is 2 kHz (2000 rows/s). The spectrograms of the measured points in the raw data are shown in Figure 3. position, we determined that the length of the optical path corresponding to the experimental track segment was approximately 1500 m, and the sampling rate is 2 kHz (2000 rows/s). The spectrograms of the measured points in the raw data are shown in Figure 3.
There are four typical defects to be recognized in our work including crevice, beam gap, cracking, and bulge, and in order to facilitate the analysis, two events causing peculiar vibrations serve as 'defects' for experimental object expansion, including switches and highway below. The distribution of defects is shown in Table 1. In actual conditions, events are difficult to locate accurately, so we roughly estimate it at 10 m intervals, which has been approved by maintainers. For ease of analysis, positions without events will be labeled 'no-event' in the experiment, but we did not list them in Table 1.

Data Representation
Most of the existing methods are based on values, and our method focuses on the pattern behind them, which is called feature representation in deep learning methods.  There are four typical defects to be recognized in our work including crevice, beam gap, cracking, and bulge, and in order to facilitate the analysis, two events causing peculiar vibrations serve as 'defects' for experimental object expansion, including switches and highway below. The distribution of defects is shown in Table 1. In actual conditions, events are difficult to locate accurately, so we roughly estimate it at 10 m intervals, which has been approved by maintainers. For ease of analysis, positions without events will be labeled 'no-event' in the experiment, but we did not list them in Table 1.

Data Representation
Most of the existing methods are based on values, and our method focuses on the pattern behind them, which is called feature representation in deep learning methods. Fitting the actual data into deep learning models is one of the most important steps in deep learning. To make sure that the data values are comparable instances widely, Min-Max normalization is performed in order to prevent neuron output saturation or small values being ignored caused by excessive input absolute value: where X min and X max are the maximum and minimum of the entire dataset, respectively. An example of normalized data is shown in Figure 4.
Note that not all of the vibration data were added to the dataset. We filter the data by amplitude and then concatenate the fragments from different moments to ensure that the dataset is composed of vibrations caused by train passing. The train can be considered a scanner, and when the train passes through the entire experimental track, the vibration caused by the train will be recorded. To facilitate further descriptions, the t 0 row of the dataset can be considered a snapshot of the track at moment t 0 .
Since data in the same row actually come from different moments and represent the vibration amplitudes of the entire track, we divide the data into spatial fragments in rows. In this way, the sampling rate requirement of the Nyquist criterion for DAS is avoided (only if f sampling > 2· f signal can the integrated information be retained), because it is unnecessary to analyze the vibration modes of the measurement points in continuous periods in the frequency domain. Instead of detecting all of the measurement points, we only need to detect the fragments, thus reducing the computational cost.
tively. An example of normalized data is shown in Figure 4.
Note that not all of the vibration data were added to the dataset. We filter the data by amplitude and then concatenate the fragments from different moments to ensure that the dataset is composed of vibrations caused by train passing. The train can be considered a scanner, and when the train passes through the entire experimental track, the vibration caused by the train will be recorded. To facilitate further descriptions, the row of the dataset can be considered a snapshot of the track at moment .
Since data in the same row actually come from different moments and represent the vibration amplitudes of the entire track, we divide the data into spatial fragments in rows. In this way, the sampling rate requirement of the Nyquist criterion for DAS is avoided (only if 2 • can the integrated information be retained), because it is unnecessary to analyze the vibration modes of the measurement points in continuous periods in the frequency domain. Instead of detecting all of the measurement points, we only need to detect the fragments, thus reducing the computational cost. In our scheme, an instance can only describe the amplitude of a fragment at a certain moment, which is considered not comprehensive enough to represent the condition of the fragment. According to the mutual reasoning relationship among the instances at different moments, we merge the instances corresponding to the same fragment with a time interval into a qualified instance in the form of an RGB image. An example of merging instances is shown in Figure 5. The time interval t is selected for validation as a hyperparameter.  In our scheme, an instance can only describe the amplitude of a fragment at a certain moment, which is considered not comprehensive enough to represent the condition of the fragment. According to the mutual reasoning relationship among the instances at different moments, we merge the instances corresponding to the same fragment with a time interval t into a qualified instance in the form of an RGB image. An example of merging instances is shown in Figure 5. The time interval t is selected for validation as a hyper-parameter.
Note that not all of the vibration data were added to the dataset. We filter the data by amplitude and then concatenate the fragments from different moments to ensure that the dataset is composed of vibrations caused by train passing. The train can be considered a scanner, and when the train passes through the entire experimental track, the vibration caused by the train will be recorded. To facilitate further descriptions, the row of the dataset can be considered a snapshot of the track at moment .
Since data in the same row actually come from different moments and represent the vibration amplitudes of the entire track, we divide the data into spatial fragments in rows. In this way, the sampling rate requirement of the Nyquist criterion for DAS is avoided (only if 2 • can the integrated information be retained), because it is unnecessary to analyze the vibration modes of the measurement points in continuous periods in the frequency domain. Instead of detecting all of the measurement points, we only need to detect the fragments, thus reducing the computational cost. In our scheme, an instance can only describe the amplitude of a fragment at a certain moment, which is considered not comprehensive enough to represent the condition of the fragment. According to the mutual reasoning relationship among the instances at different moments, we merge the instances corresponding to the same fragment with a time interval into a qualified instance in the form of an RGB image. An example of merging instances is shown in Figure 5. The time interval t is selected for validation as a hyperparameter.

Semi-Supervised Deep Learning
Semi-supervised deep learning (SSL) leverages unlabeled data to assist the training under four basic assumptions: smoothness assumption, low-density assumption, manifold assumption, and cluster assumption [26,27]. SLL reinforces the decision boundary according to the distribution of unlabeled data, as shown in Figure 6. Figure 6 is a twodimensional schematic diagram, which abstractly describes the classification using the deep learning method. Solid circles and hollow circles represent the labeled samples in different classes respectively, smaller circles represent unlabeled samples, and the triangle is a sample belonging to the solid class. In the supervised learning method, the samples are represented in the form of vectors, and the model learns a decision boundary for dividing vector clusters (as shown by the dotted line in Figure 6). Under the decision boundary, the triangle sample will be determined to the hollow class. However, considering the unlabeled samples, even if they have no labels, we can still obtain a more reasonable decision boundary through their distribution (as shown by the solid line in Figure 6). Now, the triangle sample will be correctly determined to the solid class. Figure 6 shows how semi-supervised learning helps the model establish a more reasonable decision boundary leveraging the unlabeled samples. viding vector clusters (as shown by the dotted line in Figure 6). Under the decision boundary, the triangle sample will be determined to the hollow class. However, considering the unlabeled samples, even if they have no labels, we can still obtain a more reasonable decision boundary through their distribution (as shown by the solid line in Figure 6). Now, the triangle sample will be correctly determined to the solid class. Figure 6 shows how semi-supervised learning helps the model establish a more reasonable decision boundary leveraging the unlabeled samples. Figure 6. The decision boundary reinforcement by semi-supervised learning. The recognized instance will be class 'hollow' in supervised learning. However, the fact is that according to the distribution of the dataset, it should belong to class 'solid', which can be achieved in semi-supervised learning.
SSL has been extensively studied by researchers, and various solid SSL models have been proposed. One commonly used class of SSL is based on the theory called consistency regularization [28], which is based on the intuition that the predictions of an instance and its perturbed version should be consistent for a qualified classifier. Therefore, to reinforce the decision boundary, we can minimize the divergence between the predictions of perturbed versions of the same unlabeled instance, such as the temporal ensemble (Pi model) [29], mean teachers [30], and UDA [31]. Another main approach to leverage unlabeled data is called 'pseudo labeling', where unlabeled data are given 'guessed' labels and join the labeled dataset for further training, such as Co-training [32] and Tri-training [32]. In addition, many holistic SSL strategies with superior performance have been proposed in recent years, such as Mix-match [33], Remix-match [34], and Fix-match [22]. Fix-match obtained state-of-the-art in the standard experimental setting described by Odena et al. [35].

Approaches of Data Augmentation for SSL
Most SSL strategies build loss functions with the help of data augmentation based on consistency regularization. Data augmentation in SSL can be divided into two types: weak and strong. Weak augmentation includes flip, shift, rotation, and scale, which can only produce slight distortion. On the contrary, strong augmentation can cause heavy distortion by crops, Gaussian blur, dropout, and so on. According to research in UDA, the SSL model can be significantly improved by suitable data augmentation [31]. Enlightened by this, the commonly used strong augmentations, such as Auto-Augment, Rand-Augment, and CT-augment, are dedicated to the selection of a set of transformations suiting the task Unlabeled Labeled Recognized Supervised Semi-supervised Figure 6. The decision boundary reinforcement by semi-supervised learning. The recognized instance will be class 'hollow' in supervised learning. However, the fact is that according to the distribution of the dataset, it should belong to class 'solid', which can be achieved in semi-supervised learning.
SSL has been extensively studied by researchers, and various solid SSL models have been proposed. One commonly used class of SSL is based on the theory called consistency regularization [28], which is based on the intuition that the predictions of an instance and its perturbed version should be consistent for a qualified classifier. Therefore, to reinforce the decision boundary, we can minimize the divergence between the predictions of perturbed versions of the same unlabeled instance, such as the temporal ensemble (Pi model) [29], mean teachers [30], and UDA [31]. Another main approach to leverage unlabeled data is called 'pseudo labeling', where unlabeled data are given 'guessed' labels and join the labeled dataset for further training, such as Co-training [32] and Tri-training [32]. In addition, many holistic SSL strategies with superior performance have been proposed in recent years, such as Mix-match [33], Remix-match [34], and Fix-match [22]. Fix-match obtained state-of-the-art in the standard experimental setting described by Odena et al. [35].

Approaches of Data Augmentation for SSL
Most SSL strategies build loss functions with the help of data augmentation based on consistency regularization. Data augmentation in SSL can be divided into two types: weak and strong. Weak augmentation includes flip, shift, rotation, and scale, which can only produce slight distortion. On the contrary, strong augmentation can cause heavy distortion by crops, Gaussian blur, dropout, and so on. According to research in UDA, the SSL model can be significantly improved by suitable data augmentation [31]. Enlightened by this, the commonly used strong augmentations, such as Auto-Augment, Rand-Augment, and CT-augment, are dedicated to the selection of a set of transformations suiting the task better. CT-Augment outperforms the others in terms of computational cost and efficiency [24].

Loss Function and Pipeline
Except for wrapper methods (such as Co-training, Tri-training, and Co-forest), there is always a loss term corresponding to the unlabeled data in the loss function in SSL. Consistency regularization methods build loss terms utilizing the predictions of two different perturbed versions of the same unlabeled data via the loss function: where N is the amount of unlabeled data, u i and u i are two different perturbed versions of unlabeled data u i , p m is the prediction presented by the model, and w τ is the weight of the unlabeled loss term. We need to change w τ over the training because unlabeled data cause too much disturbance in the early stages. The pipeline of the SSL based on consistency regularization is shown in Figure 7. where is the amount of unlabeled data, and are two different perturbed versions of unlabeled data , is the prediction presented by the model, and is the weight of the unlabeled loss term. We need to change over the training because unlabeled data cause too much disturbance in the early stages. The pipeline of the SSL based on consistency regularization is shown in Figure 7. In pseudo-labeling methods, unlabeled data will be given a pseudo label if the model is confident enough in the prediction. In each round of iteration, the cross-entropy between the prediction of unlabeled data and their pseudo labels (if they have) will be made and join the global loss via the loss function: where is the threshold that determines whether the model is confident enough. ( ( )) is the pseudo label of unlabeled data under weak augmentation α, and is a strong augmentation. The pipeline of the Fix-match is shown in Figure 8. In order to select the most suitable SSL model for track detection, we will use the SSL models as a hyper-parameter in validation, including UDA, Tri-training, and Fix-match.

Experiment Setting
The raw dataset contains 1869 rows of vibration data, and each row of data contains 9000 measurement points. A row of data corresponds to the vibration caused by a train passing through the track. Except for conventional data processing including de-noise, In pseudo-labeling methods, unlabeled data will be given a pseudo label if the model is confident enough in the prediction. In each round of iteration, the cross-entropy between the prediction of unlabeled data and their pseudo labels (if they have) will be made and join the global loss via the loss function: where β is the threshold that determines whether the model is confident enough. y m (α(u i )) is the pseudo label of unlabeled data u i under weak augmentation α, and A is a strong augmentation. The pipeline of the Fix-match is shown in Figure 8.
where is the amount of unlabeled data, and are two different perturbed versions of unlabeled data , is the prediction presented by the model, and is the weight of the unlabeled loss term. We need to change over the training because unlabeled data cause too much disturbance in the early stages. The pipeline of the SSL based on consistency regularization is shown in Figure 7. In pseudo-labeling methods, unlabeled data will be given a pseudo label if the model is confident enough in the prediction. In each round of iteration, the cross-entropy between the prediction of unlabeled data and their pseudo labels (if they have) will be made and join the global loss via the loss function: where is the threshold that determines whether the model is confident enough. ( ( )) is the pseudo label of unlabeled data under weak augmentation α, and is a strong augmentation. The pipeline of the Fix-match is shown in Figure 8. In order to select the most suitable SSL model for track detection, we will use the SSL models as a hyper-parameter in validation, including UDA, Tri-training, and Fix-match.

Experiment Setting
The raw dataset contains 1869 rows of vibration data, and each row of data contains 9000 measurement points. A row of data corresponds to the vibration caused by a train passing through the track. Except for conventional data processing including de-noise, In order to select the most suitable SSL model for track detection, we will use the SSL models as a hyper-parameter in validation, including UDA, Tri-training, and Fix-match.

Experiment Setting
The raw dataset contains 1869 rows of vibration data, and each row of data contains 9000 measurement points. A row of data corresponds to the vibration caused by a train passing through the track. Except for conventional data processing including de-noise, normalization, we do the instance merging proposed in Section 4 and divide the entire experimental track into 300 fragments with a 5 m interval, so the spatial accuracy of our research is as follows: Each row of data was divided into 300 instances, and each instance contained 9000/300 = 30 columns. We labeled each instance according to the defect distribution described in Table 1. The overall dataset is presented in Table 2.
It can be seen from Table 2 that the imbalance problem of the overall dataset is quite serious, especially class 'no-event', which is ten times larger than the minority class. Therefore, we process the dataset to the size described in Table 2 by one of the wellperforming balance methods, which will be selected in the validation as a hyper-parameter. Twenty percent of the overall dataset was randomly selected to constitute the testing set, and the remaining instances are used as the training set. To obtain unlabeled instances, we define a parameter µ: where N is the total number of training sets and N unlabeled is the number of unlabeled instances that we need. In the validation, each label has a probability µ to be removed. For our experiment, µ is set to 50%. The training set is presented in Table 3.   Table 4 identifies a set of hyper-parameter ranges between the maximum and minimum to optimize the use of BO in deep learning networks based on the survey [36]. In addition to these hyper-parameters, there are four more hyper-parameters defined in our research, including the structure of the deep network, data balance method, SSL model, and the time interval t defined in Section 4.

Validation
The four structural hyper-parameters for our research, which are shown in Table 5, are selected by the greedy algorithm. SMOTE-TL, S-ENN, Border-1, MWMOTE, and Safelevel are the most commonly used oversampling methods, in which artificial samples are generated by linear combinations of existing samples. The difference between them lies in the samples selected for linear combination, which is related to the dataset. Therefore, we need to choose a data balance method suitable for our dataset through experiments. The time interval t is discrete and is just a virtual concept representing the number of instances between the instances used to merge. Since the train is considered a scanner and the fragments from different rows are scanned by different trains, t represents the running interval of the trains. Approximately 54 trains pass through the experimental track every day, so we set the upper limit of the value range of t to 54, which means that an instance can contain the amplitude data belonging to three moments of three days occurring on the same fragment at most. When t is 0, an instance can only contain the amplitude data belonging to the three trains passing continuously. There is a reasonable intuition that containing data from different moments on different days enables a qualitative description of the vibration mode of a fragment. The vibration mode of a certain fragment at different moments in different days should be different. Therefore, we empirically initialize t to 36. Table 4. Optimized hyper-parameters for optimizers and their search ranges. SGDM, Rmsprop, Adam are gradient-based model optimizers. Since different models and datasets require different parameter update rules, it is necessary to select the optimal optimizer. The optimal parameter combination for optimizers also needs to be selected by random search in a certain range, including the initial learning rate, momentum, SGDF, GDF, and L2 regularization.

Hyper-Parameters SGDM Rmsprop Adam
Initial Learning Rate Since both are strongly related to the data distribution, the network structures and balance methods are selected in combinations. The evaluation metrics we used here are AUC, F1, and g-mean, which are the most used evaluation metrics for classifiers in the data imbalance situation. We conducted experiments on all combinations of network structures and balance methods. The details of the results are presented in Table 6, where the results of the best combinations under AUC, F1, and g-mean are in bold. The two best combinations, VGG-16 with S-ENN and ResNet with MWMOTE, will be used for further validation.

Validation on SSL Model
Strongly related to the network structures, the SSL model will be selected under the two combinations selected, including VGG-16 with S-ENN and ResNet with MWMOTE. The error rates of the different combinations are shown in Figure 9. For our task, it can be seen that the combination of Fix-match, VGG-16, and S-ENN is the best.  Table 6, where the results of the best combinations under AUC, F1, and g-mean are in bold. The two best combinations, VGG-16 with S-ENN and ResNet with MWMOTE, will be used for further validation.

Validation on SSL Model
Strongly related to the network structures, the SSL model will be selected under the two combinations selected, including VGG-16 with S-ENN and ResNet with MWMOTE. The error rates of the different combinations are shown in Figure 9. For our task, it can be seen that the combination of Fix-match, VGG-16, and S-ENN is the best.

Validation on t
We performed five-fold cross-validation on t, and the results are shown in Figure 10. There are two peaks at 17 and 39 in the box check of t, and the curve is approximately

Validation on t
We performed five-fold cross-validation on t, and the results are shown in Figure 10. There are two peaks at 17 and 39 in the box check of t, and the curve is approximately periodic with period 18. An instance is merged from three different trains passing through, and there are approximately 54 trains passing through a day. Therefore, the peak at 17 indicates that when the information contained in an instance is dispersed as much as possible in a day, the contribution of this instance is the largest, because the different vibration modes of the fragment at different times in a day are considered, as shown in Figure 11. In addition, considering the standard deviation, as the interval of the fragments contained in the instances becomes larger, the performance of the system becomes more stable. at 17 indicates that when the information contained in an instance is dispersed as much as possible in a day, the contribution of this instance is the largest, because the different vibration modes of the fragment at different times in a day are considered, as shown in Figure 11. In addition, considering the standard deviation, as the interval of the fragments contained in the instances becomes larger, the performance of the system becomes more stable. Figure 10. Five-fold cross-validation on t. Figure 11. A description of how hyper-parameter t works.

Comparison
According to the pipelines of SSL models, when there are no unlabeled instances at 17 indicates that when the information contained in an instance is dispersed as much as possible in a day, the contribution of this instance is the largest, because the different vibration modes of the fragment at different times in a day are considered, as shown in Figure 11. In addition, considering the standard deviation, as the interval of the fragments contained in the instances becomes larger, the performance of the system becomes more stable. Figure 10. Five-fold cross-validation on t. Figure 11. A description of how hyper-parameter t works.

Comparison
According to the pipelines of SSL models, when there are no unlabeled instances

Comparison
According to the pipelines of SSL models, when there are no unlabeled instances used, the SSL model may degenerate into a deep learning model. Therefore, we compare the SSL and deep learning methods trained using the same amount of labeled data by fixing the number of labeled instances and changing µ. The amount of labeled data in each class was fixed at 500, and the range of µ was [0, 0.84]. The results are presented in Figure 12.

Testing
The optimal hyper-parameters are selected according to the validation, as shown in Table 7. The result of testing is presented in the form of a confusion matrix, and the element is the amount of predicted as j. The precision and recall of multi-classification are defined by Precision = ( prediceted as )/( instance predicted as ) Recall = (∑ prediceted as )/(∑ ).
Precision and recall are the most commonly used evaluation metrics for the classification. The superiority of our method is shown in Table 8, and a visualization for detection is shown in Figure 13. It should be noted that in the real world, amplitude variation may be caused by a variety of situations, such as changes in the state of the train, different exposed states of the cable, and unexpected changes in the natural environment. The peaks in Figure 13 are mainly caused by the above unexpected situation and have less correlation with track defects in which we are interested.

Testing
The optimal hyper-parameters are selected according to the validation, as shown in Table 7. The result of testing is presented in the form of a confusion matrix, and the element e ij is the amount of i predicted as j. The precision and recall of multi-classification are defined by Precision i = ( ∑ i prediceted as i)/( ∑ instance predicted as i) Recall i = ( ∑ i prediceted as i)/( ∑ i). Precision and recall are the most commonly used evaluation metrics for the classification. The superiority of our method is shown in Table 8, and a visualization for detection is shown in Figure 13. It should be noted that in the real world, amplitude variation may be caused by a variety of situations, such as changes in the state of the train, different exposed states of the cable, and unexpected changes in the natural environment. The peaks in Figure 13 are mainly caused by the above unexpected situation and have less correlation with track defects in which we are interested. For safety considerations, accuracy is not always the most important metric for realtime track detection. A high recall rate of defects and a high precision rate of no-defect are always pursued to avoid defect omissions. The results of a two-class actual dataset including 'defect' and 'no-defect' are shown in Table 9. It can be seen that most of the defects are found (recall of defects = 0.9938), and the omissions are rare (precision of no-defects = 0.9938), which can superiorly meet the safety requirements.

(7)
Compared with the relevant work [19] (as shown in Table 10), our method is end-toend, which is conducive to a more convenient model adjustment and faster operation speed. With the efficient use of unlabeled data and greater sample information density, our model achieves higher accuracy in more complex tasks.

Use of Unlabeled Data Sample Structure Defect Types
Mode Acc [12] No Image (Three channel) 1 End-to-end 93.00% [16] No Image (Three channel) 1 End-to-end 97.14% [17] No Image (Three channel) 1 End-to-end 92.00% [19] No Three-point (channel) 4 Two-stage 94.98% For safety considerations, accuracy is not always the most important metric for realtime track detection. A high recall rate of defects and a high precision rate of no-defect are always pursued to avoid defect omissions. The results of a two-class actual dataset including 'defect' and 'no-defect' are shown in Table 9. It can be seen that most of the defects are found (recall of defects = 0.9938), and the omissions are rare (precision of no-defects = 0.9938), which can superiorly meet the safety requirements.  [19] (as shown in Table 10), our method is endto-end, which is conducive to a more convenient model adjustment and faster operation speed. With the efficient use of unlabeled data and greater sample information density, our model achieves higher accuracy in more complex tasks.

Discussion
In this paper, we propose a track detection method that innovatively leverages semi-supervised deep learning based on image recognition, with a particular dataset pre-processing and a greedy algorithm for the selection of hyper-parameters. The accuracy reached 97.91%, which is satisfactory. In this section, we discuss some details of our research and point out our limitations, and the ideas for further research are proposed.
Firstly, there is a trade-off between the traditional deep learning-based methods and our method. We achieved a low computational cost and low sampling frequency requirement in detection with the cost of spatial accuracy, as shown in Figure 14. However, traditional methods perform better in terms of spatial accuracy. Moreover, according to the validation results on t, time and environment can make a difference in the vibration modes of the track, which is consistent with the intuition of the researchers. This may bring some new solutions on how to expand (or augment) datasets when labeled data are limited in engineering situations. Using data from different times and environments may be a type of data augmentation analogous to flip and shift for image recognition. In addition, an outperformed classifier is not always a good solution for specific tasks. Only when meeting the different biases of each class required in the actual project can a classifier be applied, which is an important factor in the application of deep learning. For DAS, higher spatial resolution may lead to higher prices. However, in our method, higher spatial resolution means that each device can cover a longer distance. Therefore, using expensive equipment is counter-intuitively a more economical approach. In addition, due to the bottleneck of transmission rate, embedding the defect detection system into the DAS is a low-cost method to improve the actual sampling rate and detection accuracy of the system. End-to-end 97.91%

Discussion
In this paper, we propose a track detection method that innovatively leverages semisupervised deep learning based on image recognition, with a particular dataset pre-processing and a greedy algorithm for the selection of hyper-parameters. The accuracy reached 97.91%, which is satisfactory. In this section, we discuss some details of our research and point out our limitations, and the ideas for further research are proposed.
Firstly, there is a trade-off between the traditional deep learning-based methods and our method. We achieved a low computational cost and low sampling frequency requirement in detection with the cost of spatial accuracy, as shown in Figure 14. However, traditional methods perform better in terms of spatial accuracy. Moreover, according to the validation results on t, time and environment can make a difference in the vibration modes of the track, which is consistent with the intuition of the researchers. This may bring some new solutions on how to expand (or augment) datasets when labeled data are limited in engineering situations. Using data from different times and environments may be a type of data augmentation analogous to flip and shift for image recognition. In addition, an outperformed classifier is not always a good solution for specific tasks. Only when meeting the different biases of each class required in the actual project can a classifier be applied, which is an important factor in the application of deep learning. For DAS, higher spatial resolution may lead to higher prices. However, in our method, higher spatial resolution means that each device can cover a longer distance. Therefore, using expensive equipment is counter-intuitively a more economical approach. In addition, due to the bottleneck of transmission rate, embedding the defect detection system into the DAS is a lowcost method to improve the actual sampling rate and detection accuracy of the system. The main drawback of our method is that our method cannot evaluate the severity of defects and their evolution. We used deep learning models, so in fact, we did not actually model the mechanism of defects. To evaluate the severity and the evolution, we must use a dataset with severity labels. It is difficult to obtain such a dataset because it requires professionals to manually label the severity of the defects and requires high labor cost. In addition, there are still some limitations of our research: (1) Since the defects are rare on the railways in operation, and the tracks that can be used for research are very limited due to the security policy, there are not enough defects that can be used in our research to prove that all kinds of defects can be found by the proposed method. Considering the base assumption of SSL mentioned in Section 3, in other defects, there may not be a special density area for classification. (2) Under special conditions, such as in a tunnel full of echoes, vibration may be distorted owing to the influence of complex environmental noise, leading to classifica-Method proposed Traditional method Figure 14. The difference between traditional methods and method proposed.
The main drawback of our method is that our method cannot evaluate the severity of defects and their evolution. We used deep learning models, so in fact, we did not actually model the mechanism of defects. To evaluate the severity and the evolution, we must use a dataset with severity labels. It is difficult to obtain such a dataset because it requires professionals to manually label the severity of the defects and requires high labor cost. In addition, there are still some limitations of our research: (1) Since the defects are rare on the railways in operation, and the tracks that can be used for research are very limited due to the security policy, there are not enough defects that can be used in our research to prove that all kinds of defects can be found by the proposed method. Considering the base assumption of SSL mentioned in Section 3, in other defects, there may not be a special density area for classification. (2) Under special conditions, such as in a tunnel full of echoes, vibration may be distorted owing to the influence of complex environmental noise, leading to classification errors. In addition, it is almost impossible to detect under a large amount of environmental noise in a data-driven manner. Therefore, it is necessary to preprocess the data according to the environment. (3) The vibration modes of the defects may change with environment, and automatic online learning may lead to errors being inherited and amplified. Therefore, a traditional track detection method is necessary for updating the proposed method. (4) According to the security requirements of high-speed railways, a detection system based on black-box models cannot be fully trusted. Thus, it can only be used as a supplement in real-time detections. (5) In fact, our method is a classification task rather than a recognition task, and the number of categories is set in advance. Therefore, we need all types of defects to be marked before the training. However, it is difficult to guarantee that all defects are marked in a general situation. In order to solve this problem, we can add an additional 'unknown' category and classify samples that are not similar to existing training samples into this category.
There are two main research directions in this area. The first is the interpretability of the black-box models. Although we have explained the safety-related cost sensitivity classification in Section 5.4, the black-box models based on deep learning still cannot be fully trusted in safety-related areas. We should try to combine deep learning with traditional machine learning, such as decision trees, and try to use white-box models to explain the intermediate steps of the black-box models. Secondly, data representation is another direction of concern. It is the key to deep learning applications in engineering tasks that transform engineering problems into general deep learning problems. Finding a more suitable data representation to fit actual data into the existing deep learning model will be one of our main research topics in the future.

Conflicts of Interest:
The authors declare no conflict of interest.