Arching Detection Method of Slab Track in High-Speed Railway Based on Track Geometry Data

During the long-term service of slab track, various external factors (such as complicated temperature) can result in a series of slab damages. Among them, slab arching changes the structural mechanical properties, deteriorates the track geometry conditions, and even threatens the operation of trains. Therefore, it is necessary to detect slab arching accurately to achieve effective maintenance. However, the current damage detection methods cannot satisfy high accuracy and low cost simultaneously, making it difficult to achieve large-scale and efficient arching detection. To this end, this paper proposed a vision-based arching detection method using track geometry data. The main works include: (1) data nonlinear deviation correction and arching characteristics analysis; (2) data conversion and augmentation; (3) design and experiments of convolutional neural network- based detection model. The results show that the proposed method can detect arching damages effectively, and the F1-score reaches 98.4%. By balancing the sample size of each pattern, the performance can be further improved. Moreover, the method outperforms the plain deep learning network. In practice, the proposed method can be employed to detect slab arching and help to make maintenance plans. The method can also be applied to the data-based detection of other structural damages and has broad prospects.


Introduction
It has been almost 60 years since the occurrence of the first high-speed railway (HSR) line. Countries around the world are developing and constructing HSR on a large scale because of its smoothness and comfort [1]. In the process of long-term service, the track structure of HSR deteriorates inevitably due to external influence, which directly affects the track quality. Therefore, it is essential to realize the scientific and efficient maintenance of the track structure to ensure the reliability of HSR.
In all kinds of HSR track structures, slab track, with advantages of high regularity, strong durability, and low maintenance costs, has been widely utilized in the world [2]. Especially in China, slab tracks are adopted in most of the HSR lines [3]. During the long-term operation of the track system, complicated temperature conditions are the main factors for the service state of the slab track [4,5]. Continuous high temperature and large temperature gradient can cause structural deterioration, weaken the interface constraints, and even induce a series of damages [6,7]. Moreover, repeated train impact can also accelerate the development of the damages [8,9]. Due to the complexity of external factors, In recent years, deep learning has received extensive attention [39]. It does not need to set quantitative indexes manually and can learn deep potential features autonomously. The popular deep learning models include deep neural network (DNN) [40], recurrent neural network (RNN) [41], convolutional neural network (CNN) [42], generative adversarial network (GAN) [43], and hybrid models combining multiple deep algorithms [44]. Besides the CNN used to process 2D images, scholars also established 1D CNN to mine the features of series data [45]. The above models have been applied to several fields of the railway domain, such as equipment fault diagnosis [46], train delay prediction [47], and railway object detection [48]. However, using deep learning to detect the slab damages is still a research gap.
Due to the complexity of arching characteristics and the excellent feature extraction ability of deep learning, this paper proposes a vision-based deep framework for slab arching detection using track geometry data. Firstly, an alignment algorithm integrating correlation analysis (CA) and dynamic time warping (DTW) is established to correct mile-point error among multiple inspections, and the arching characteristics are analyzed. Then, inspired by manual vision, an arching detection method is designed based on the convolutional neural network (CNN). Before inputting to the model, the original series are converted into images by sections, and the image dataset is expanded by data augmentation. On this basis, the architecture of the framework is optimized, and the performance is evaluated on various datasets. Moreover, the proposed method is compared with the plain deep neural network. The main contribution of the study is to build an automatic detection method that can quickly and accurately detect the arching of track slabs. Compared to conventional data processing methods, the proposed method can simulate human vision and learn useful features automatically. It avoids the complex process of manual feature extraction and has a strong anti-interference ability to the original data noise. In practical application, the proposed method can help engineers to detect and locate slab arching, thus assisting effective track maintenance. Moreover, the method can be also applied to the detection of other structural damages.
The rest of the paper is organized as follows. The data source, mile-point alignment algorithm, and data characteristics of slab arching are described in Section 2. Section 3 introduces the proposed arching detection framework. The model architecture optimization and arching detection results are discussed in Section 4. Section 5 gives the conclusions of this research.

Data Description and Preprocessing
In this section, the data source of the study is introduced firstly. Then, a mile-point alignment method combining CA and DTW is established to preprocess the inspection data. On this basis, the arching characteristics hidden in the data are analyzed.

Data Source
The track geometry data utilized are collected from a high-speed railway line in China. The total length of the line is 1318 km, with China Railway Track System II (CRTSII) slab track. CRTSII slab track is one of the standard slab track systems in China, developed from the German Bögl slab track. The system consists of rails, fasteners, track slab, CAM layer, and concrete base slab. The prefabricated track slab of CRTSII slab track system is shown in Figure 1. The length, width, and thickness of the track slab are 6.45 m, 2.55 m, and 0.2 m, respectively. It is formed by the longitudinal connection of prefabricated track slabs, with wide and narrow joints cast-in-place between the slabs. As shown in Figure 2, due to the longitudinal connection of the slabs, the temperature stress is difficult to release, so the CRTSII slab track system is more likely to instability and arching in high-temperature seasons compared to other track systems. Through field investigation, it was found that the arching deformation of the CRTSII slab track is mostly about 5 mm and can reach 20 mm in extreme cases. Therefore, this paper chooses the CRTSII slab track as the research object of arching detection.

Data Mile-Point Alignment
Because of the effects of GPS accuracy, uncertain vehicle operating conditions, and complex human interference, there is an absolute error between the measured and actual mile-point values of track inspection data. Field surveys reveal that measured mile-point positions can be off by up to 200 m [49]. Even if the same inspection vehicle is used to detect the same section repeatedly, the mile-point values of each inspection may also be different.
As for slab arching detection, multiple historical data will be compared to determine whether the damages have occurred and developed. Therefore, it is necessary to correct the mile-point error among multiple inspections firstly.
In the normal service without major maintenance, the track geometry data among different inspections have similar waveforms. However, the wheels may slip or slide randomly during operation, thereby distorting the data waveforms in space (i.e., the mile-point deviation changes nonlinearly and randomly with the mileage) [50]. On account of this, we establish a mile-point alignment algorithm combining CA and DTW (CA-DTW). The specific process is as follows.
(1) Sequence segmentation Firstly, an inspection sequence with high quality is selected as the reference sequence, and its absolute mile-point error is corrected according to infrastructure ledger information. Then, the reference sequence is divided into consecutive 100-m segments (X 1 , X 2 , . . . ) by mileage. Through correlation-based and DTW-based alignment, each unaligned sequence can be matched with the reference sequence segment by segment.
(2) CA-based initial alignment As shown in Figure 4, the matching window and X i are of equal length (i.e., 100 m). Within X i ± 100 m, the window moves point by point on the unaligned sequence, to obtain different segments (Y i1 , . . . , Y ip , . . . , Y in ) corresponding to X i . The correlation coefficient ρ between X i and Y ip can be calculated by Equation (1) where Cov(•) is the covariance between the sequences; Var[•] is the variance of the sequence. According to the maximum ρ criterion, the initial aligned segment Y i * and deviation distance d can be determined.
By shifting initial aligned segments, the initial aligned sequence can be obtained.

Data Mile-Point Alignment
Because of the effects of GPS accuracy, uncertain vehicle operating conditions, and complex human interference, there is an absolute error between the measured and actual mile-point values of track inspection data. Field surveys reveal that measured mile-point positions can be off by up to 200 m [49]. Even if the same inspection vehicle is used to detect the same section repeatedly, the milepoint values of each inspection may also be different.
As for slab arching detection, multiple historical data will be compared to determine whether the damages have occurred and developed. Therefore, it is necessary to correct the mile-point error among multiple inspections firstly.
In the normal service without major maintenance, the track geometry data among different inspections have similar waveforms. However, the wheels may slip or slide randomly during operation, thereby distorting the data waveforms in space (i.e., the mile-point deviation changes nonlinearly and randomly with the mileage) [50]. On account of this, we establish a mile-point alignment algorithm combining CA and DTW (CA-DTW). The specific process is as follows.
(1) Sequence segmentation Firstly, an inspection sequence with high quality is selected as the reference sequence, and its absolute mile-point error is corrected according to infrastructure ledger information. Then, the reference sequence is divided into consecutive 100-m segments (X1, X2, ...) by mileage. Through correlation-based and DTW-based alignment, each unaligned sequence can be matched with the reference sequence segment by segment.
(2) CA-based initial alignment As shown in Figure 4, the matching window and Xi are of equal length (i.e., 100 m). Within Xi±100m, the window moves point by point on the unaligned sequence, to obtain different segments (Yi1, …, Yip, …, Yin) corresponding to Xi. The correlation coefficient ρ between Xi and Yip can be calculated by Equation (1) where Cov(•) is the covariance between the sequences; Var[•] is the variance of the sequence. According to the maximum ρ criterion, the initial aligned segment Yi * and deviation distance d can be determined. By shifting initial aligned segments, the initial aligned sequence can be obtained (3) DTW-based secondary alignment DTW is a matching method for similar but warping sequences, which is essentially a dynamic programming problem [51]. It can search the optimal alignment relations between sequences efficiently and has been widely used to process series data [52,53]. Considering the spatial distortion of inspection data, the initial aligned segment Yi * is locally scaled based on DTW so that the mileage offset can be further corrected. A dynamic program is established, taking the minimum sum of (3) DTW-based secondary alignment DTW is a matching method for similar but warping sequences, which is essentially a dynamic programming problem [51]. It can search the optimal alignment relations between sequences efficiently and has been widely used to process series data [52,53]. Considering the spatial distortion of inspection data, the initial aligned segment Y i * is locally scaled based on DTW so that the mileage offset can be further corrected. A dynamic program is established, taking the minimum sum of squared differences between Y i * and X i as the goal. The objective function and constraints are as follows.
where j is the point of Y i * , P(j) is the point of X i that matches j, and size(•) is the sample size of the segment. ζ(j) denotes the distance loss, and M A (•) denotes the mileage coordinates of the point on sequence A. The farther the mileage between P(j-1) and P(j), the higher the distance loss. k is the farthest distance between j and P(j) under consideration. Due to the initial alignment, the remaining deviation is small. After trial calculation, k is set to 4.
Taking the left longitudinal level irregularity (LLLI) of 8 January 2017 as the reference sequence, the CA-DTW method is performed on the inspections of low-temperature and high-temperature months, as shown in Figure 5. It can be seen intuitively that the data mileage deviations of different inspections are well corrected. To analyze the effect of the algorithm further, we use the sum of absolute deviations per kilometer E and correlation coefficient ρ to quantify the deviations of sequences, as shown in Table 1. We can observe that after alignment, E of each inspection is reduced, and ρ is improved, which shows the feasibility of the algorithm. Moreover, comparing the statistics of aligned inspections, we can find that the values of statistics are related to slab arching. There is a larger deviation and a smaller correlation between the sequence with arching and the reference sequence. squared differences between Yi * and Xi as the goal. The objective function and constraints are as follows.
where j is the point of Yi * , P(j) is the point of Xi that matches j, and size(•) is the sample size of the segment. ζ(j) denotes the distance loss, and MA(•) denotes the mileage coordinates of the point on sequence A. The farther the mileage between P(j-1) and P(j), the higher the distance loss. k is the farthest distance between j and P(j) under consideration. Due to the initial alignment, the remaining deviation is small. After trial calculation, k is set to 4.
Taking the left longitudinal level irregularity (LLLI) of 8 January 2017 as the reference sequence, the CA-DTW method is performed on the inspections of low-temperature and high-temperature months, as shown in Figure 5. It can be seen intuitively that the data mileage deviations of different inspections are well corrected. To analyze the effect of the algorithm further, we use the sum of absolute deviations per kilometer E and correlation coefficient ρ to quantify the deviations of sequences, as shown in Table 1. We can observe that after alignment, E of each inspection is reduced, and ρ is improved, which shows the feasibility of the algorithm. Moreover, comparing the statistics of aligned inspections, we can find that the values of statistics are related to slab arching. There is a larger deviation and a smaller correlation between the sequence with arching and the reference sequence.

Arching Characteristics
The inspection data of arched and non-arched segments are contrastively analyzed combined with the maintenance information. It can be found that the arching characteristics are mainly reflected in the longitudinal level irregularity (LLI). Figure 6 shows several arching segments of 15 July 2017, taking the data of 8 January 2017 as reference. A is the semi-crest amplitude, δ is the amplitude difference between different months, and λ is the wavelength. We can observe that different arched segments present various waveforms. Although the waveforms of different months in the specific segment are roughly similar, δ in the arched segment increases compared to the non-arched segment. By statistics, in the arched segments, the maximum value of A exceeds 3 mm, and the value of δ is generally greater than 1 mm. Most values of λ are close to the length of the track slab (6.45 m), and the maximum is less than 10 m.

Arching Characteristics
The inspection data of arched and non-arched segments are contrastively analyzed combined with the maintenance information. It can be found that the arching characteristics are mainly reflected in the longitudinal level irregularity (LLI). Figure 6 shows several arching segments of 15 July 2017, taking the data of 8 January 2017 as reference. A is the semi-crest amplitude, δ is the amplitude difference between different months, and λ is the wavelength. We can observe that different arched segments present various waveforms. Although the waveforms of different months in the specific segment are roughly similar, δ in the arched segment increases compared to the non-arched segment. By statistics, in the arched segments, the maximum value of A exceeds 3 mm, and the value of δ is generally greater than 1 mm. Most values of λ are close to the length of the track slab (6.45 m), and the maximum is less than 10 m.

Proposed Vision-Based Arching Detection Method
The process of arching detection in this paper includes five steps, as shown in Figure 7. The first step is to preprocess the inspection data by the CA-DTW mile-point alignment method, which is introduced specifically in Section 2.2. Then, inspired by human vision, the original sequences are converted into 2D images (step 2), and the data augmentation is used to balance the number of "Arching" and the number t of "Normal" (step 3). The fourth step is to design the architecture of the CNN-based detection model and optimize the model parameters. The fifth step is to train the detection model repeatedly and test its performance using the established dataset. Through the above steps, the effective arching detection can be achieved. In this section, the details of the dataset establishment and CNN-based detection model are mainly introduced.

Proposed Vision-Based Arching Detection Method
The process of arching detection in this paper includes five steps, as shown in Figure 7. The first step is to preprocess the inspection data by the CA-DTW mile-point alignment method, which is introduced specifically in Section 2.2. Then, inspired by human vision, the original sequences are converted into 2D images (step 2), and the data augmentation is used to balance the number of "Arching" and the number t of "Normal" (step 3). The fourth step is to design the architecture of the CNN-based detection model and optimize the model parameters. The fifth step is to train the detection model repeatedly and test its performance using the established dataset. Through the above steps, the effective arching detection can be achieved. In this section, the details of the dataset establishment and CNN-based detection model are mainly introduced.

Data Conversion and Augmentation
In our initial data-based arching detection, a basic technique is to plot and compare the LLI segments of different months manually, and then determine whether the slab arching has occurred or not. To imitate this process, we firstly convert original series data into images, rather than directly inputting the series to the detection model. The arching waveforms in the data have obvious graphical spatial characteristics, which can be more completely preserved in 2D images, thus facilitating clear detection.
The series-image conversion method is displayed in Figure 8. Combining the LLI sequence data of high-temperature month (e.g., July in the northern hemisphere) and low-temperature month (e.g., January in the northern hemisphere), the sliding window is utilized to construct a 2D-image dataset. To ensure the arching characteristics can be fully captured, the window length is set to 20 m, and the overlapping length is 10 m. The size of each image is 250 pixel × 100 pixel, the ordinate range is −4 mm to 4 mm, and the abscissa range is 0 m to 20 m. In the images, the blue line and orange line denote the LLI of the low-temperature month and high-temperature month, respectively. On account of determined axis ranges, the coordinates are invisible because labeled values are not required for the CNN model to learn the arching features. Furthermore, the naming of image files contains arching detection information, including the inspection date, mileage section, and inspection item (LLL/RRL), which is convenient for locating damaged segments.

Data Conversion and Augmentation
In our initial data-based arching detection, a basic technique is to plot and compare the LLI segments of different months manually, and then determine whether the slab arching has occurred or not. To imitate this process, we firstly convert original series data into images, rather than directly inputting the series to the detection model. The arching waveforms in the data have obvious graphical spatial characteristics, which can be more completely preserved in 2D images, thus facilitating clear detection.
The series-image conversion method is displayed in Figure 8. Combining the LLI sequence data of high-temperature month (e.g., July in the northern hemisphere) and low-temperature month (e.g., January in the northern hemisphere), the sliding window is utilized to construct a 2D-image dataset. To ensure the arching characteristics can be fully captured, the window length is set to 20 m, and the overlapping length is 10 m. The size of each image is 250 pixel × 100 pixel, the ordinate range is −4 mm to 4 mm, and the abscissa range is 0 m to 20 m. In the images, the blue line and orange line denote the LLI of the low-temperature month and high-temperature month, respectively. On account of determined axis ranges, the coordinates are invisible because labeled values are not required for the CNN model to learn the arching features. Furthermore, the naming of image files contains arching detection information, including the inspection date, mileage section, and inspection item (LLL/RRL), which is convenient for locating damaged segments.
Following the maintenance information and manual inspection, the images are labeled with "Normal" or "Arching". We can observe that part of the normal and arching images have some similar characteristics, which can readily be confused.
In actual HSR service, the probability of slab arching is much lower than the normal conditions. The number of arching samples and the number of normal samples are extremely imbalanced, which may cause over-fitting to the normal pattern and under-fitting to the arching pattern. Thus, it is necessary to carry out data augmentation on the arching images to expand the dataset. Common augmentation operations including rotation, flipping, cropping, translation, and scaling. Due to the particularity of the images formed by series data, the arching images are only flipped horizontally to avoid the generation of unnecessary features or the increase of image category, as shown in Figure 9. Moreover, the normal images are randomly downsampled to reduce the normal set. In this way, feature extraction can be better achieved. Following the maintenance information and manual inspection, the images are labeled with "Normal" or "Arching". We can observe that part of the normal and arching images have some similar characteristics, which can readily be confused.
In actual HSR service, the probability of slab arching is much lower than the normal conditions. The number of arching samples and the number of normal samples are extremely imbalanced, which may cause over-fitting to the normal pattern and under-fitting to the arching pattern. Thus, it is necessary to carry out data augmentation on the arching images to expand the dataset. Common augmentation operations including rotation, flipping, cropping, translation, and scaling. Due to the particularity of the images formed by series data, the arching images are only flipped horizontally to avoid the generation of unnecessary features or the increase of image category, as shown in Figure 9. Moreover, the normal images are randomly downsampled to reduce the normal set. In this way, feature extraction can be better achieved.   Following the maintenance information and manual inspection, the images are labeled with "Normal" or "Arching". We can observe that part of the normal and arching images have some similar characteristics, which can readily be confused.
In actual HSR service, the probability of slab arching is much lower than the normal conditions. The number of arching samples and the number of normal samples are extremely imbalanced, which may cause over-fitting to the normal pattern and under-fitting to the arching pattern. Thus, it is necessary to carry out data augmentation on the arching images to expand the dataset. Common augmentation operations including rotation, flipping, cropping, translation, and scaling. Due to the particularity of the images formed by series data, the arching images are only flipped horizontally to avoid the generation of unnecessary features or the increase of image category, as shown in Figure 9. Moreover, the normal images are randomly downsampled to reduce the normal set. In this way, feature extraction can be better achieved.

CNN for Slab Arching Detection
CNN is one of the widely utilized deep learning networks proposed by Lecun et al. [54]. It is good at detecting tiny and meaningful spatial features and has the characteristics of sparse weights [55]. In general, CNN is applied to computer vision areas, such as image classification [56] and object detection [57]. Therefore, it is also suitable for the vision-based slab arching detection problem in this paper.
CNN is a stacked architecture employing multiple layers, mainly including convolutional layers, pooling layers, and fully connected layers. The convolutional layer uses a series of kernels to perceive different local features, which can be integrated to form the feature maps expressing global information. Through parameter sharing, the amount of computation can be greatly reduced. The pooling layer is applied to reducing the size of feature maps and ignoring the minute variation by merging local information, thereby filtering out the noise in the original data and preventing over-fitting. After several alternating convolutional-pooling layers, the fully connected layers are used for the final classification based on the extracted features of the previous layers. Figure 10 shows the final architecture of the CNN detection model. Some parameters of the model are determined through the trial method, and the details are shown in Section 4.1. Firstly, the images are resized to 60 pixel × 60 pixel × 3 channel before inputting to the first convolutional layer. Then, three sets of convolutional-pooling layers are employed for feature extraction. The kernel sizes of convolutional and pooling layers are 5 × 5 and 2 × 2, respectively. The rectified linear units (ReLU) [58] is selected as the activate function following the convolution layer, thus performing a nonlinear transformation on the data space. Moreover, max pooling is adopted in the pooling layers. From the first convolutional layer to the last pooling layer, the sizes of the feature maps are reduced from 56 × 56 to 4 × 4.
Therefore, it is also suitable for the vision-based slab arching detection problem in this paper.
CNN is a stacked architecture employing multiple layers, mainly including convolutional layers, pooling layers, and fully connected layers. The convolutional layer uses a series of kernels to perceive different local features, which can be integrated to form the feature maps expressing global information. Through parameter sharing, the amount of computation can be greatly reduced. The pooling layer is applied to reducing the size of feature maps and ignoring the minute variation by merging local information, thereby filtering out the noise in the original data and preventing overfitting. After several alternating convolutional-pooling layers, the fully connected layers are used for the final classification based on the extracted features of the previous layers. Figure 10 shows the final architecture of the CNN detection model. Some parameters of the model are determined through the trial method, and the details are shown in Section 4.1. Firstly, the images are resized to 60 pixel × 60 pixel × 3 channel before inputting to the first convolutional layer. Then, three sets of convolutional-pooling layers are employed for feature extraction. The kernel sizes of convolutional and pooling layers are 5 × 5 and 2 × 2, respectively. The rectified linear units (ReLU) [58] is selected as the activate function following the convolution layer, thus performing a nonlinear transformation on the data space. Moreover, max pooling is adopted in the pooling layers. From the first convolutional layer to the last pooling layer, the sizes of the feature maps are reduced from 56 × 56 to 4 × 4.
where S k n is the kth component for the output vector of sample Xn; P(prediction = k|Xn) represents the predicted probability that sample Xn belongs to pattern k. V k n denotes the kth component for the feature vector of sample Xn before softmax processing. K is the number of patterns (i.e., the number of components for the output vector). During the training process, the "dropout" method is utilized to reduce overfitting in the fully connected layers, and the dropout rate is set to 0.5. Adam optimizer and exponential decay learning rate are chosen to train the model. The initial learning rate is set to 0.001, and the decay rate is 0.95. The loss function is the sum of cross-entropy loss and L2 regularization loss, which is expressed as follows. Next, the feature maps of the last pooling layer are flattened as a 1D vector and fed into the fully connected layers. The softmax classifier is utilized to compute the predicted probability of each pattern and obtain the final output, as shown in Equation (5).
where S n k is the kth component for the output vector of sample X n ; P(prediction = k|X n ) represents the predicted probability that sample X n belongs to pattern k. V n k denotes the kth component for the feature vector of sample X n before softmax processing. K is the number of patterns (i.e., the number of components for the output vector).
During the training process, the "dropout" method is utilized to reduce overfitting in the fully connected layers, and the dropout rate is set to 0.5. Adam optimizer and exponential decay learning rate are chosen to train the model. The initial learning rate is set to 0.001, and the decay rate is 0.95. The loss function is the sum of cross-entropy loss and L 2 regularization loss, which is expressed as follows.
where Loss represents the loss function, and N represents the number of samples in a batch (the batch size is 32). y n k is the kth component for the label of sample X n . w represents the collection of model parameters considering regularization. α is the regularization rate, which is set to 0.001.

Results and Discussion
This paper uses the track geometry data from 2016 to 2018 to train and test the proposed detection framework. The inspections of low-temperature months (January to March) and high-temperature months (July to August) of each year are combined in pairs to construct an image dataset. Each inspection contains 315 km of track geometry data, which is much longer than the arching wavelength (less than 10 m). The inspection line passes through different substructures and external environment, so the samples cover waveform characteristics of various conditions. Moreover, the combinations of different inspections can include different degrees of arching from the initial stage to the later stage. Therefore, the dataset contains sufficient features and can meet the need for model training and verification.
According to the data conversion and augmentation method in Section 3.1, the dataset is constructed including 11896 arching images. On this basis, five subsets are used for the research, as shown in Table 2. As for the training and validation sets, each subset has a different pattern ratio and sample scale, and the proportion of training and validation is 5:1. Subset 2 is a balanced set that provides an equivalent proportion for each type, whereas other subsets are imbalanced sets for comparison studies. Compared with Subset 2, all arching samples used for training and validation in Subset 1 are original images without augmentation operation. Subset 3 and Subset 4 have the same sample size as Subset 2 in training and validation set, but the ratio of "Arching" to "Normal" is 1:3 and 1:6, respectively. The total number of training and validation samples for Subset 5 is twice that of Subset 3, whereas the pattern ratio is the same. Besides, different subsets share the same testing set containing 11896 images, in which the samples in training and validation sets are filtered out. The model is implemented on the python Tensorflow (1.10.1) deep learning framework. Three error metrics, including Precision, Recall, and F 1 -score, are mainly employed to optimize the model structure and assess the model performance. They are defined as follows.
Rerall = TP/(TP + FN) × 100% (8) where TP (i.e., True Positive) represents the number of the arching samples that are correctly detected as "Arching"; FP (i.e., False Positive) represents the number of the normal samples that are incorrectly detected as "Arching"; FN (i.e., False Negative) represents the number of the arching samples that are incorrectly detected as "Normal". Precision reflects the proportion of the actual arching samples in detected "Arching" samples. Recall reflects the proportion of the correctly detected samples in actual arching samples. F 1 -score is the harmonic mean of Precision and Recall, which indicates the overall accuracy of the arching detection model.

CNN Architecture Optimization
There are many hyperparameters in the CNN framework, including the number of convolutionalpooling layers, convolutional kernel size, and the number of fully connected layers. To improve the model performance, we use the balanced set (Subset 2) to adjust and optimal the CNN architecture. According to the control variable method, eight cases are set for trial calculation. Each case is conducted in five runs taking the detection ability on the validation set as the criterion. The mean values of evaluation metrics are tabulated in Table 3. In the table, Time/Epoch denotes the training time required for each epoch. The convolution layer parameters are represented as "kernel size @ the number of kernels"; the fully connected layer parameters are represented as "FC number of hidden nodes in the layer". We can observe that with the increase of convolution-pooling layers (Case 1, 2, and 4), the values of metrics improve. In contrast, the Time/Epoch only slightly changes. The model performance with different convolutional kernel sizes (Case 3, 4, and 5) are also compared. It can be seen that as the kernel size increases, the error metrics first increases and then decreases, whereas the Time/Epoch grows obviously. Furthermore, different fully connected combinations are applied to the model (Case 4, 6, 7, and 8). One fully connected layer with 128 nodes has the highest efficiency.
Comprehensively considering the detection accuracy and calculation time, we choose the architecture of Case 4, whose F 1 -score is 98.0%. In the model, the number of convolution-pooling layers is three, the convolutional kernel size is 5 × 5, and the number of kernels is 6, 12, and 24. Moreover, one fully connected layer with 128 nodes is adopted.

Performance Evaluation with Various Datasets
Besides the optimization of the model architecture, constructing an excellent dataset is also a key to improving the model performance. Accordingly, based on the determined detection framework, the model runs on each subset in Table 2 for five times, and the error metrics are analyzed to evaluate the model performance. The quantity and proportion of each pattern in the testing result (one of five runs) are shown in Table 4. We can find that the proportions of arching and normal of the detection results are as high as 49.8% and 50.2% (Subset 2), which is quite close to the ground truth. Compared with Subset 2, the detection results of imbalanced sets are more different from the ground truth, especially for Subset 1. To quantify the detection ability of the proposed method, the error metrics of each subset are calculated as depicted in Figure 11. Regardless of the validation set or the testing set, the F 1 -score of each pattern can exceed 98%, showing the high detection accuracy of the proposed model. Specifically, the Precision, Recall, and F 1 -score of "Arching" on the testing set ( Figure 11b)  To quantify the detection ability of the proposed method, the error metrics of each subset are calculated as depicted in Figure 11. Regardless of the validation set or the testing set, the F1-score of each pattern can exceed 98%, showing the high detection accuracy of the proposed model. Specifically, the Precision, Recall, and F1-score of "Arching" on the testing set ( Figure 11b) can reach 98.3%, 98.4%, and 98.4%, respectively. For the detection of the pattern "Arching", the cases using Subset 2 outperform the cases using other datasets, as shown in Figure 11a,b. Compared with the metrics of Subset 1, the metrics of Subset 2 are significantly improved, in which the Recall on the testing set increases by 6.2%. It demonstrates that data augmentation operation can effectively enhance the arching feature extraction, thereby improving the model fitting to the pattern "Arching".
In terms of the data imbalance impact on model performance, Subset 2, Subset 3, Subset 4, and Subset 5 are used for comparative research. At the same training size, with the imbalance degree (i.e., the ratio of "Normal" to "Arching") increases, the overall detection performance of both the validation set and testing set decreases. As the pattern ratio changes from 1:1 to 6:1, the F1-score of the validation set decreases from 98.0% to 92.5%, and the F1-score of the testing set decreases from 98.4% to 96.3%. It is mainly because the reduction of arching training samples leads to the insufficient learning of arching features. At the same pattern ratio, with the increase of training size, the performance of the imbalanced set improves. However, its performance is still not as good as the balanced set, and the increase in training size means that more time is required for model training.  For the detection of the pattern "Arching", the cases using Subset 2 outperform the cases using other datasets, as shown in Figure 11a,b. Compared with the metrics of Subset 1, the metrics of Subset 2 are significantly improved, in which the Recall on the testing set increases by 6.2%. It demonstrates that data augmentation operation can effectively enhance the arching feature extraction, thereby improving the model fitting to the pattern "Arching".
In terms of the data imbalance impact on model performance, Subset 2, Subset 3, Subset 4, and Subset 5 are used for comparative research. At the same training size, with the imbalance degree (i.e., the ratio of "Normal" to "Arching") increases, the overall detection performance of both the validation set and testing set decreases. As the pattern ratio changes from 1:1 to 6:1, the F 1 -score of the validation set decreases from 98.0% to 92.5%, and the F 1 -score of the testing set decreases from 98.4% to 96.3%. It is mainly because the reduction of arching training samples leads to the insufficient learning of arching features. At the same pattern ratio, with the increase of training size, the performance of the imbalanced set improves. However, its performance is still not as good as the balanced set, and the increase in training size means that more time is required for model training.
For the detection of the pattern "Normal", since there are more normal samples in the imbalanced sets, the performance on the validation set is slightly improved from the balanced set to the imbalanced sets. Nevertheless, the performance on the testing set decreases to some extent, which may be due to the different pattern ratio of the validation and testing set, as shown in Table 5. In the imbalanced set, the number of arching samples of the validation set is far less than that of the testing set. Therefore, despite the high proportion of TN (i.e., True Negative, represents the number of the normal samples that are correctly detected as "Normal") in the testing result, the FN increases sharply compared to the validation result, and its influence is non-negligible. Thus, the testing results and validation results of "Normal" present different rules. Table 5. Detection results of the validation set and testing set (one of five runs). Testing Set   TP  TN  FP  FN  TP  TN  FP  FN   1  473  977  15  23  5349  5863  85  599  2  962  970  22  30  5823  5847  101 125  3  477  1472  16  19  5693  5884  64  255  4  262  1680  20  22  5467  5889  59  481  5  956  2948  26  36  5741  5904  44  207 Overall, the balanced set (Subset 2) has the best performance and can achieve accurate detection for both "Arching" and "Normal patterns. Therefore, in practice, we should construct balanced datasets to detect slab arching.

Comparison Study with DNN
To verify the superiority of the proposed framework, we adopt the plain DNN for comparison study. DNN is a typical deep learning network and has been employed in data anomaly detection with high accuracy [59]. During the training process, the model parameters of the DNN are firstly adjusted to ensure the best performance. Then, five subsets are adopted to train and test the model. In this part, we use F 1 -score on the testing set and Time/Epoch as indicators to compare the performance of the proposed model and the DNN, as shown in Figure 12. imbalanced set, the number of arching samples of the validation set is far less than that of the testing set. Therefore, despite the high proportion of TN (i.e., True Negative, represents the number of the normal samples that are correctly detected as "Normal") in the testing result, the FN increases sharply compared to the validation result, and its influence is non-negligible. Thus, the testing results and validation results of "Normal" present different rules. Table 5. Detection results of the validation set and testing set (one of five runs).

Comparison Study with DNN
To verify the superiority of the proposed framework, we adopt the plain DNN for comparison study. DNN is a typical deep learning network and has been employed in data anomaly detection with high accuracy [59]. During the training process, the model parameters of the DNN are firstly adjusted to ensure the best performance. Then, five subsets are adopted to train and test the model. In this part, we use F1-score on the testing set and Time/Epoch as indicators to compare the performance of the proposed model and the DNN, as shown in Figure 12. It can be observed that whether training with balanced or imbalanced sets, the proposed CNN-based method has higher detection accuracy and shorter training time compared with the DNN model. To quantitatively analyze the superiority of the proposed model, a performance improvement index (PDI) is established as follows.
where I proposed and I DNN represent the evaluation indicators of the proposed model and DNN model, respectively. According to Equation (10), the improvements in accuracy and cost of the proposed model are calculated in Table 6. Compared with the DNN model, the F 1 -score of the proposed model increases up to 3.4 times, and the training time is reduced by 0.6. It reflects the powerful spatial feature extraction capability of CNN. By contrast, the performance of the DNN model is poor. It can hardly detect slab arching, especially for the cases of using imbalanced sets. It is because the arching features included in track irregularities are very tiny and similar to normal waveforms. However, DNN changes the original positional relations between pixels by transforming the two-dimensional matrix (image) into a one-dimensional vector, which may make it difficult to learn sufficient characteristics. Moreover, all results of the balanced set are better than that of the imbalanced sets, again proving the advantages of the balanced training set.

Detection Result Analysis
To further study the detection results of the proposed model, we analyze the examples of detected and undetected arching in the testing result. As shown in Figure 13a, the proposed model can detect arching images with different shapes and degrees and avoid the influence of data noise. However, DNN cannot detect these samples due to its poor learning ability of graphical features.
Some examples of undetected arching are shown in Figure 13b. Among 5948 arching images of the testing set, 125 arching images are not detected, accounting for 2.1% of the total. The error is within an acceptable range and can be further corrected through on-site rechecking. Through analyzing the cause of the error, we can find that it is difficult to detect the arching in the early stage accurately. In the early stage, the arching features reflected on the LLI data are tiny because of the small deformation of the track slabs. Therefore, these samples have high similarity with normal samples, which increases the detection difficulty.

respectively.
According to Equation (10), the improvements in accuracy and cost of the proposed model are calculated in Table 6. Compared with the DNN model, the F1-score of the proposed model increases up to 3.4 times, and the training time is reduced by 0.6. It reflects the powerful spatial feature extraction capability of CNN. By contrast, the performance of the DNN model is poor. It can hardly detect slab arching, especially for the cases of using imbalanced sets. It is because the arching features included in track irregularities are very tiny and similar to normal waveforms. However, DNN changes the original positional relations between pixels by transforming the two-dimensional matrix (image) into a one-dimensional vector, which may make it difficult to learn sufficient characteristics. Moreover, all results of the balanced set are better than that of the imbalanced sets, again proving the advantages of the balanced training set.

Detection Result Analysis
To further study the detection results of the proposed model, we analyze the examples of detected and undetected arching in the testing result. As shown in Figure 13a, the proposed model can detect arching images with different shapes and degrees and avoid the influence of data noise. However, DNN cannot detect these samples due to its poor learning ability of graphical features.
Some examples of undetected arching are shown in Figure 13b. Among 5948 arching images of the testing set, 125 arching images are not detected, accounting for 2.1% of the total. The error is within an acceptable range and can be further corrected through on-site rechecking. Through analyzing the cause of the error, we can find that it is difficult to detect the arching in the early stage accurately. In the early stage, the arching features reflected on the LLI data are tiny because of the small deformation of the track slabs. Therefore, these samples have high similarity with normal samples, which increases the detection difficulty.

Conclusions
In this paper, a novel framework is proposed for slab arching detection using track geometry data. Based on human vision simulation, the method converts the original series data into 2D images and utilizes CNN for feature extraction and arching detection. The main conclusions are as follows.
(1) The alignment algorithm combining correlation analysis and DTW can correct the milepost deviation of different inspections effectively. Based on the aligned data, it is found that the longitudinal level irregularities can reflect slab arching, and the data wavelengths are close to the length of the track slab.
(2) The proposed detection framework can accurately detect arching damage, whose Precision, Recall, and F 1 -score can reach 98.3%, 98.4%, and 98.4%, respectively. Moreover, the data augmentation operation and balanced set establishment can help the model extract arching features more adequately and improve the performance. As the pattern ratio changes from 6:1 to 1:1, the F 1 -score can increase from 92.5% to 98.0%. (3) The proposed framework outperforms the plain DNN model, showing the excellent spatial characteristics learning ability. Compared to the plain DNN, the F 1 -score of the proposed model increases up to 3.4 times, and the training time is reduced by 0.6.
In practice, engineers can use the proposed vision-based method to accurately detect slab arching, quickly locate the damage sections, and promote the achievement of intelligent maintenance decisions. Additionally, the method can be expanded to other damage detection of track structures based on series data.