A Review of Deep Learning Applications for Railway Safety

: Railways speedily transport many people and goods nationwide, so railway accidents can pose immense damage. However, the infrastructure of railways is so complex that its maintenance is challenging and expensive. Therefore, using artiﬁcial intelligence for railway safety has attracted many researchers. This paper examines artiﬁcial intelligence applications for railway safety, mainly focusing on deep learning approaches. This paper ﬁrst introduces deep learning methods widely used for railway safety. Then, we investigated and classiﬁed earlier studies into four representative application areas: (1) railway infrastructure (catenary, surface, components, and geometry), (2) train body and bogie (door, wheel, suspension, bearing, etc.), (3) operation (railway detection, railroad trespassing, wind risk, train running safety, etc.), and (4) station (air quality control, accident prevention, etc.). We present fundamental problems and popular approaches for each application area. Finally, based on the literature reviews, we discuss the opportunities and challenges of artiﬁcial intelligence for railway safety.


Introduction
Artificial intelligence, which began in the 1950s with the question "Can computers think?", has become a modern concept that means "a computing-based technology system that automates intelligent tasks generally performed by ordinary people." [1][2][3]. Artificial intelligence has been widely used in various fields and has advanced sensing and data processing technologies. This paper presents an overview of deep learning applications for railway safety. We analyzed earlier studies over four representative application areas: (1) railway infrastructure, (2) train body and bogie, (3) operation (train running), and (4) station.
There have been many studies monitoring railway infrastructure, such as catenary and rail surfaces. Railway defects can occur for diverse reasons, such as long-term accumulated operation, rain, sunlight, wind, etc. Regular inspections are essential since rail defects can cause significant accidents. Recently, many studies have been conducted to detect defects in railways and related parts with rapidly developed artificial intelligence technology to prevent railway accidents. Diverse data sources have been used for detecting railway defects, such as images [4,5], accelerometers [6,7], and ultrasonic sensors [8,9].
There have also been many studies conducted on detecting train defects using artificial intelligence. The railway train has a complex structure that combines various parts, such as vehicle wheels, split pins, tram lines, and pantographs. Each train accessory has the characteristic that the degree of corrosion or decrease in durability is not constant because of the difference in function and environment. Fault detection and prediction are essential because even small-area or early-progress defects (cracks, cuts, aging, etc.) on trains can cause severe threats to passenger safety. These features can be even more critical for highspeed trains as their parts are exposed to harsh environments compared to other trains.
Additionally, safety accidents during train operation can occur, such as railroad trespassing or derailment. Such train accidents during train operation can lead to many casualties and various artificial intelligence technologies for train operation safety have been studied. For example, segmentation methods based on deep learning models have been proposed to detect rail tracks during operation or to check the existence of any obstacles on railways (e.g., people, cars). In addition, there have been attempts to quantify wind risk or train running safety for controlling and managing further operation.
Finally, many studies have been conducted wherein artificial intelligence is used for railway station safety. A railway station is dynamic and complex due to the presence of many people, including passengers and station staff, and trains that stop and depart quickly. Therefore, it is necessary to prevent and deal with various safety accidents that may occur in stations. For example, diverse artificial intelligence models were developed to quickly identify three different types of safety incidents (fall, slip, and trip) [10] and to monitor air quality in a station. Furthermore, many studies have modeled the station as a dynamic and complex system. This paper provides a comprehensive view of railway safety by covering the four representative areas. Table 1 shows the railway safety areas addressed by earlier reviews. While most review papers have focused on a specific problem, few studies covered various areas. For example, Tang et al. [11] and Liu et al. [12] covered the four application areas we covered. However, these studies aimed to overview artificial intelligence for the railway, so some specific safety issues were rarely covered. This review more widely includes studies about railway safety. For example, it further covers safety issues related to the catenary in the railway infra, the train's door and suspension, and wind risk during the operation. In addition, although many studies have focused on visual inspection methods based on image data, this study includes diverse data types. For example, various sensor data (e.g., vibration, current, acoustic emission signals) and image-shaped data (e.g., 2D-camera images and laser ultrasound scanning data) are explained in this review. Tang et al. [11] 2022 Liu et al. [12] 2019 × Ghofrani et al. [13] 2018 × Hu et al. [14] 2021 × × × × Sedghi et al. [15] 2021 × Yin et al. [16] 2020 × × Wen et al. [17] 2019 × × × Chenariyan et al. [18] 2019 × × × This study 2022 1 : almost subdomains listed in Sections 4-7, : about half of the subdomains, ×: rarely covered. 2 : more than two other types, : one type, ×: none.
The subsequent parts of this review are organized as follows: Section 2 provides an overview of deep learning approaches that have been used for railway safety. We classified the deep learning methods according to their data source and task. In Section 3, we described the methodology for searching and analyzing related studies. From Section 4 to Section 7, we explained the four application domains in railway safety (i.e., railway infra, train, operation, and station) and representative studies. Lastly, Section 8 concludes the paper by discussing the opportunities and challenges of artificial intelligence for railway safety.

Overview of Deep Learning Approaches
A deep neural network (DNN) is a machine learning that emulates brain neuron cells. Therefore, DNN can extract patterns and features like the human brain from numerous datasets. DNN is constructed of several layers (e.g., input layer, hidden layer, output layer). The input layer is responsible for receiving input values. On the other hand, the output layer is responsible for output values. There can be many hidden layers between the output and input layers. The greater the number of these hidden layers, the deeper the neural network. In the hidden layer, the following operations are performed: where x is input vector. W and b are the weight matrix and the bias term, respectively, and these are updated through training. f (u) is the activation function that makes neural networks nonlinear. z is the output vector of the hidden layer. Diverse deep learning architectures have been studied by extending the DNN structure. They can be distinguished according to their data types and tasks.

Image Data
Image data are represented in a two-dimensional structure in the form of a numerical matrix consisting of points called pixels. Pixels represent the contrast of colors in numbers ranging from 0 to 255, where 0 is black and 255 is white. In the case of a color image, the color of the image is expressed using the light and shade of red, green, and blue with three channels of RGB. Before processing images using a deep learning model, image preprocessing, such as image alignment, cropping, and adjustment (e.g., brightness and contrast), can be conducted.
A convolutional neural network (CNN) is a representative deep neural network used in image data processing. CNN can extract image patterns with spatial structures because filters composed of multiple weight values move spatially. The following equation represents the convolution in CNN: where i, j are the index of the output matrix, and m, n are the index of the input matrix. * is convolution. X is the input matrix. X m,n is the value of row m and column n of X. Z is the output matrix of the convolution layer. A CNN performs convolutional operations commonly used in image or signal processing. The CNN moves a mask, also called a kernel, filter, or window, and performs convolutional operations with input data to extract data features. Because this approach allows the detection of relevance between one element and neighboring elements, CNNs are suitable for data processing with grid structures, such as images. CNNs have shown superior performance to humans in some complex image processing problems and have also contributed significantly to image retrieval services, autonomous vehicles, and image automatic classification systems.

Time-Series Data
Sequential data refer to data in which objects in the data set have a certain order. Sequential data include numerous kinds of time series data with temporary properties, such as language, stock quotes, electrocardiogram (ECG) signals, seismic waves, and DNA sequencing. In railway safety, sensor data (e.g., vibration) and video (e.g., CCTV) are prevalent time-series data types.
The recurrent neural network (RNN) was developed to deliver information that occurred at the previous time step to the next time step through the recurrent edge, which is the edge connecting the hidden nodes. In other words, an RNN hidden layer can remember important things about input information that allows them to predict what will come next.
The key operation in RNN can be described by the following formula: where U, V, W is a weight matrix that is updated through training, for input-to-hidden, hidden-to-output and hidden-to-hidden. b, c are bias vectors.
Long short term memory (LSTM) improves an original RNN structure by adding gates that select inputs and outputs at a time step to properly understand the contextual dependence of sequential data (e.g., long-term dependence). A gated recurrent unit (GRU) is also a variant of RNN, like LSTM, but has fewer parameters.

Classification
Classification is a sort of supervised learning that is the process of identifying the category relationship of existing data and determining the category of newly observed data by itself. In the field of images, it is used to assign an appropriate label (or class) to objects in a given image as input. For example, a classification model can be trained to recognize a number in handwritten images.
There are various types of image classification models. Visual geometry group (VGG) is a relatively early classification model developed to determine how the depth (number of layers) of neural networks affect performance [19]. VGG has a structure that combines convolutional layers for feature representation and fully connected layers for classification. Filters of 3 × 3 are used to reduce the number of model weights that require learning to efficiently increase the depth of the model.
Residual Net (ResNet) [20] is a deep learning network with 152 layers. For ResNet, a new concept called residual block was introduced. Unlike previous networks, which aimed to generate output values as similar as possible to the correct answer, ResNet was designed to minimize the residual (the difference between the output and input values). This approach makes it possible to preserve previously learned information and to consider only additionally learned information. DenseNet [21] is similar to ResNet, but it uses the operation of the concatenation of the output of the previous layer with the next layer.
Res2Net [22] is a structure that combines ResNet with DenseNet and is a classification model that leads to performance improvement by configuring hierarchical residual-like connections in a bottleneck residual block. Res2Net is also characterized by segmentation by increasing the range of receptive fields in each network layer rather than expressing multi-scale layer-wise features.
Finally, Inception is a neural network structure designed to address problems that arise when classification models of deep and wide structures learn [23]. The number of channels was reduced while maintaining the input form using a convolutional layer of a 1 × 1 filter and matrix operations were densely performed to increase the computational efficiency. In addition, Inception uses an auxiliary layer that delivers backpropagation by calculating the intermediate learning error, to convey information to the deep layer during learning, and batch normalization, to prevent overfitting that frequently occurs in deep learning.

Object Detection
Object detection and localization have been popular tasks for railway safety. Object detection refers to a task that performs both classification and localization on multiple objects. Localization is a task to display the location of a specific object in an image through a bounding box. Object detection methods can be categorized into single-stage and twostage methods. A single-stage method detects the potential locations of the target object and classifies them by a single network. The two-stage method separately performs a region proposal first, extracts possible areas that the target object can locate, and then selects and classifies regions.
Regions with CNN features (R-CNN) is a two-stage method that performs the task of proposing an object region and classifying objects separately. The R-CNN first extracts regions by the selective search algorithm and then uses a pre-trained CNN model to extract image features for the classifier to distinguish the object and regressor to localize the object. In R-CNN, learning and inference are slow during region proposal and image feature extraction from regions because it processes about 2000 regions per image. Faster R-CNN solves the bottleneck phenomenon that occurs when proposing regions [24]. It first extracts image features from an input image and then performs the region proposal. Faster R-CNN uses a region proposal network, a deep learning network runnable on a GPU, to improve the region proposal extraction process by selective search.
You Only Look Once (YOLO) is a single-stage method for object detection using a single neural network to perform both classifying and detecting of the potential location area of the target objects [25]. In YOLO, the convolutional layers extract feature maps and fully connected layers and then predict the bounding box and class probability. YOLO divides the input image into S × S grids, and bonding box coordinates and confidence scores are predicted for each grid. Since the whole detection pipeline is a single network, YOLO can be optimized end-to-end directly on detection performance. Single-shot detectors (SSD) [26] is also a single-stage method for object detection. This method begins with the idea that a single feature map may be insufficient to detect objects of various sizes. SSD predicts bounding boxes using a pyramidal feature hierarchy instead of image grids in YOLO. The pyramidal feature hierarchy consists of feature maps extracted from various layers using a single deep neural network. Each convolutional layer has a different receptive field size and can provide unique image features at different scales.
In the object detection problem, the number of objects in the image is generally small, so it is easy to develop a class balance problem with very few object areas compared to the background area. RetinaNet [27] is a model that applies focal loss designed to focus on hard negative samples by lowering the weights for easy samples. In addition, both local and global features are utilized by adding a spatial attention map block (SAMB) and a channel weight map block (CWMB) in the image feature extraction process. This allows RetinaNet to weaken the influence of the background in the object detection process and focus on important features.

Segmentation
Segmentation is a method of extracting an object of interest from an image in units of pixels. By giving each pixel a label, it is possible to know which pixel belongs to which object. Segmentation is necessary for identifying shapes of target objects in detail, such as in traffic safety, autonomous driving services, and in reading magnetic resonance imaging (MRI). Depending on the purpose of use, segmentation can be divided into semantic segmentation and instance segmentation. Semantic segmentation assigns a class label to every pixel in an image, such as a person or car. The objects of the same class have the same label. However, instance segmentation identifies each object separately, even if they belong to the same class.
You Only Look At CoefficienTs (YOLACT) is a real-time model that improves the processing speed of instance segmentation by omitting the localization step [28]. Instead, this model solves the problem by dividing the segmentation process into two parallel, instead of sequential, subtasks. The first task generates a dictionary of non-local prototype masks over the entire image and another predicts the linear combination coefficient for each instance. Then, YOLACT produces instance masks by linearly combining the prototypes with the mask coefficients.

Feature Extraction
Feature extraction transforms raw data into numerical features more beneficial for the main task (e.g., classification, object detection). This task often affects model performance, helping reduce the dimensionality of the model and better represent latent patterns. Principle component analysis (PCA) is one of the traditional algorithms for feature extraction. PCA is a method of reducing multidimensional data by selecting the axis with the largest variance as the first principal component, then selecting the larger axis as the second principal component, and linearly converting the data when each variable (feature) is projected onto one axis. Other known techniques include linear discriminant analysis (LDA) [29], canonical correlation analysis (CCA) [30], singular value decomposition [31], isometric feature mapping (ISOMAP) [32], and locally linear embedding (LLE) [33].
Auto encoder (AE) is a deep neural network that can be used for feature extraction [34]. AE is used for anomaly detection, which determines whether a sample is normal or abnormal, or for denoising operations that extract the original data by removing the noise added to the data. AE is unsupervised learning that learns to output the same results as the input data. However, since the dimension of the hidden layer is designed to be lower than that of the input and output layers, AE learns in the direction of exploring representation information that can effectively indicate input data. The restricted Boltzmann machine (RBM) is also a deep learning model for feature learning that works through the process of finding better representations of input values [35]. The RBM consists of a visible layer, which is an input layer, and a hidden layer in which feature values are learned. The deep belief network (DBN) is a probabilistic generative model built by layers of pre-trained RBMs [36].

Methodology
In this review, we aim to (1) identify problems by railway safety category and solutions using deep learning models, (2) evaluate the performance of the proposed deep learning model and comparison with the previous model, and (3) summarize supplementary points of the proposed method and additional issues to be dealt with afterward. For doing this, we searched papers including the following keywords on Google Scholar: "railway" OR "deep learning" OR "defect" OR "railroad" OR "safety" OR "artificial intelligence". In order to investigate in-depth safety issues for each category, category-specific keywords were considered (e.g., "catenary" OR "surface"). In addition, the entire paper cited by the key reference paper was examined whether to be included in the review.
We checked the abstracts of all selected papers and excluded papers that were not related to railway safety or did not address the applications of deep learning techniques for solving problems. If it is not clear to determine about the papers, the introduction section and methodology section were additionally reexamined. Cross-checking was performed three times by independent authors. Four of the authors checked each part of the review paper. Next, two authors individually examined the whole part of this paper without discussion. If two papers had overlapping parts of the contents, a paper with a high number of citations was selected. When the data source was not clearly marked on the paper, it was classified as Custom. The performance of the model was selected as metrics and values with the best results. The papers included in the review were finally updated on September 2022. The details of the reivewed papers and the performance metrics were described in Appendix A, Tables A1 and A2, respectively.

Catenary
The catenary, which is responsible for supplying electricity to trains, is a critical facility in the electric railway system. Therefore, defects in a catenary can pose a severe threat to railway safety. While a human inspector usually needs to shut down the train power and go up to the vehicle to examine the state of the catenary, this procedure can cause many safety accidents. Prior studies have made efforts to study computer vision technologies to detect catenary defects fast and early.
Kang et al. [37] focused on detecting defects in the insulator, which is a catenary component. Figure 1 shows the proposed workflow of catenary defect detection. Their proposed framework captured images of areas where insulators are usually located using fixed-viewed cameras. Next, a Faster R-CNN model localized the specific location of the insulator in the input image (i.e., object detection). Finally, two other deep learning models were implemented to examine the extracted images of the insulator. One model was a deep learning classifier that had a CNN-DNN structure to output the classification score of the input image. The other model was an auto-encoder model that outputs the abnormal score of the insulator. The abnormal score determines whether and how the insulator is damaged. Actual data from Hefei-Fuzzhou high-speed railway line was used for evaluation. The results showed that the proposed framework effectively mitigates the small data problem and the complexity of processing catenary images, which can cause a decrease in diagnosis performance. There have been many studies on defect detection in a dropper, which connects a catenary and a messenger wire. Guo et al. [38] proposed a method to detect defects in a dropper from image data by deep learning models based on Faster R-CNN and fully connected layers. A balanced attention feature pyramid network (BA-FPN) was proposed that integrates multiple-level features onto the original Faster R-CNN structure. This enhances detection performance by extracting useful image features from small areas from the entire catenary image where the dropper is placed. Experimental results on the VOC 2012 and MSCOCO 2014 datasets showed that the proposed models achieved higher performance than conventional detection models (86.8% at mAP@0.5 and 83.9% at mAP@0.7).
The clevis is another catenary component that is located between the registration arms and the cantilever. The Faster R-CNN has also been widely used to detect clevis defects. Han et al. [39] proposed a deep learning model that focuses on image features from the surrounding areas of the clevis, as shown in Figure 2. This idea is under the heuristic insight that the catenary has a typical structure, so there are specific areas where useful image features for clevis crack diagnosis are likely placed. The evaluation results reveal that the proposed model has higher crack detection performance than existing models, such as Faster R-CNN and YOLO. In addition, the proposed model was robust to different size, texture, and grayscale transformations that resulted from changes in shooting distance, angle, and illuminance.
The split pin combines and supports diverse components in the catenary. Wang et al. [40] studied a deep learning framework that determines three states of the split pin (missing, loosening, and normal) according to the location of the joint. First, the proposed framework performed an object detection task based on YOLO v3 to explore split pins for five joints extracted from the entire catenary image. Next, semantic segmentation was performed in three parts (head, body, and tail) using DeepLab V3+ [41][42][43]. Finally, the classification model determined the state of the split pins. The evaluation was conducted with 2670 catenary images, including 21,472 split pins, and the split pin defects were detected with very high accuracy (98.72%). Chen et al. [44] studied an image-based deep learning model to check damage in the current-carrying ring of a catenary. RetinaNet [27] was used to detect and classify defects for fault diagnosis. RetinaNet was trained based on the focal loss that mitigates the imbalance between classes of training data, instead of cross-entropy loss. Additionally, RetinaNet contains a spatial attention map (SAM) and a channel weight map (CWM) to harness the spatial characteristics of each feature map and consider patterns in the channel. Performance tests were conducted with catenary images taken at various locations, and the proposed model achieved the best performance in diagnosis accuracy.

Rail Surface
Scouring, breaking, and deficient fastening in bolts and sleepers are typical defects on the rail surface. Figure 3 presents several types of rail surface defects. Santur et al. [45] proposed a machine learning model based on image features of defects extracted based on PCA, kernel principal component analysis (KPCA), singular value decomposition (SVD), and histogram match (HM). Faghih-Roohi et al. [46] adopted deep convolution neural networks to determine defect types of surface images (normal, weld, L-squat, M-squat, S-squat, and joint). They designed and compared three CNN models (small, medium, and large), each with different structures (number of layers, number of filters, sizes of filters, activation functions). The large model outperforms small and medium models and shows about 93% accuracy in detecting surface defects.
Many studies have been performed to develop object detection methods on rail surface images. For example, Yanan et al. [47] developed a fast and accurate defect detection model for rail surfaces using YOLO v3, which has the strength of accurately and quickly detecting small-sized targets. The detection model receives 416 × 416 images and divides them into boxes of various sizes, calculates normalized coordinate values of defects depending on the location of defects located inside the box, predicts defect inclusion scores for each box, and evaluates reliability. This method achieved high detection rates (97%) in 0.15 s. Similarly, Yuan et al. [48] developed a model that detects the location of defects from existing rail surface images. Their proposed model consisted of a MobileNetV2 for extracting image characteristics and a YOLOv3 module for defect localization. Their performance test results confirmed that the model increased the mean average precision (MAP) by more than 4%. Shang et al. [49] presented a novel pipeline consisting of two stages. In the first stage, an input image is localized to extract rail areas. The second stage detects defect areas using a deep learning model, a fine-tuned Inception3.
Some studies proposed deep learning methods to extract defects more detailedly using image segmentation. Kim et al. [5] adopted image segmentation to distinguish specific areas of defects on rail surfaces. The defective part was labeled in units of image pixels to train the segmentation model. The proposed model was implemented based on the VGG-19 structure and showed IoU and F1 scores exceeding 90%. Liang [50] proposed SegNet, a deep convolution neural network, to detect defects on rail surfaces. As shown in Figure 4, SegNet comprises feature extraction (FE) and feature construction (FC). This structure can learn rail surface types and their distributions from a given training dataset. Jiang et al. [8] proposed a technique for detecting rolling-contact fatigue (RCF), which is a failure or material removal driven by crack propagation caused by a near-surface alternating stress field. Specifically, this study used laser ultrasound scanning data to detect RCFs. To extract features from ultrasonic signals, wavelet packet transform (WPT), which decomposes signals in different frequency bands, and KPCA, which reduces the correlation between all defective features, were used. A support vector machine (SVM) model performed the final detection based on the features. A squat is an RCF defect and often leads to rail breaks. Yuan et al. [51] proposed an algorithm to automatically detect the position of rail squats using vehicle axle box acceleration signals. The convolutional variable auto encoder (CVAE), an unsupervised manager, extracts critical features from signals, and the one-class SVM (OCSVM) detects rail squats in abnormal conditions. In their study, the proposed method was shown to be robust to signal noise and train speed variability.
Suwansin and Phasukkit [9] analyzed acoustic emission signals from fatigue cracks on rails and developed a non-destructive localization model that determines the presence and location of defects without damaging railways. A DNN structure consisting of three hidden layers used the hyperbolic tangent function for considering the transient nature of acoustic emission signals. The model processed the acoustic emission signals and classified them into breaks at the head, web, or foot of the steel rail.
Shebani and Iwnicki [52] developed a neural network model that predicts wheel and rail wear using an artificial neural network. Nonlinear autoregressive models with an exogenous input neural network (NARXNN) were developed for wheel and rail wear prediction. Wheel and rail profiles, plus load, speed, yaw angle, and first and second derivative of the wheel and rail profiles, were used as inputs to the neural network while the neural network output was wheel and rail wear. Their laboratory tests confirmed the feasibility of the proposed wear prediction methods for realistic wheel and rail profiles and materials.
Studies have also been conducted to facilitate the acquisition and utilization of rail surface data necessary for artificial intelligence models. Wu et al. [53] attempted to develop a robust detection framework for the quality and sampling rates of rail surface images. Unmanned aerial vehicles (UAV), capable of moving at speeds ranging 2-15 m/s, were used to collect rail images. In addition, the proposed model used enhanced residual blocks for time and memory optimization in defect detection. Two image datasets from high-speed train sections between Beijing and Shanghai and Class I freight lines in South Carolina were used for training and testing the model.
Zhang et al. [54] proposed an efficient learning method based on line-level labels. Use of line-level labels can decrease the time and effort needed to collect data compared to pixel-level labels. In addition, this method can lower the model complexity and is more suitable for small data. The proposed model converted color information into numeric vectors using a 1D-CNN and LSTM, and detected rail surface defects line by line. Hajizadeh et al. [55] focused on the data imbalance in detecting rail surface defects. Most rail image datasets have an overwhelming proportion of normal state data than abnormal data including defects. Many captured images are not labeled to indicate whether they contains defects or not. Hajizadeh et al. [55] proposed semi-supervised learning methods to detect defects on rail surfaces. The proposed semi-supervised learning methods showed compliance performance, more than other methods, to data imbalance, such as undersampling and oversampling.
Santur et al. [56] addressed degraded image quality due to substances, such as dust or oil, which often cause false-positive cases. A high-resolution camera can also help deal with substances but leads to loss of time and additional costs in the railway maintenance process. Santur et al. [56] presented hardware and software architectures to perform railway surface inspection using a three-dimensional (3D) laser camera and deep learning. The use of 3D laser cameras in the railway inspection process provided high accuracy rates in real-time.
Falamarzi et al. [57] utilized train acceleration data to estimate the degradation of tram rails. Machine learning algorithms (Random Forest, SVM, and ANN) were trained and tested using Melbourne tram network data. The study results revealed that the proposed method allows for cost-effective maintenance strategies by reducing the time and effort in collecting data for evaluation.

Rail Components
Defect inspection of rail components (e.g., spikes that secure rails to ties and clips that press down on the bottom of the rail to concrete ties) commonly depends on the judgement of individual human inspectors. Many studies have used deep learning models to improve manual rail component inspection. Guo et al. [58] proposed a framework that can detect pixel-wise rail accessories in real time using CNN-based models that receive high-resolution rail images, shown in Figure 5. Their proposed framework shows a speed of over 30 FPS in high-resolution processing video in real-time. These results show that inspection video can be quickly converted into helpful information to aid rail maintenance. Similarly, Gibert et al. [59] proposed CNN models to perform defect detection in rail ties and fixtures. Sresakoolchai and Kaewunruen [60] developed a model that detects defects in rail dipped joints and track settlements and quantifies the degree of defects. Their proposed deep learning method receives 14 features, including weight, speed, and peak acceleration sensor data measured on wheels. The CNN and RNN modules in the model used time series acceleration values, and the DNN modules used train weight, speed, and wheel acceleration feature point values.
A train delivers high acceleration to wheelsets, axle boxes, the bogie, and total vehicle bodies as it passes through the rail. If defects occur in rail components, the acceleration data show different patterns. Yang et al. [7] proposed a deep learning-based approach for defect detection in rail joints through CNNs on acceleration sensor data. CNN-based models can work directly with raw data to reduce the heavy preprocessing of feature engineering and directly detect joints located on either the left or the right rail. Similarly, Sun et al. [61] used acceleration data to detect defects on rail joints. A single CNN model was designed to detect both left and right joints together. This can mitigate the interference issue when a different model is used for each side, which increases a high false-positive rate.
A clamp is a rail component that ties a rail so it does not move from side to side. The clamp should maintain railway safety by maintaining the spaces on the left and right sides of the rail. Inspecting clamps is time-consuming and expensive because it depends on visual inspections made by a human expert. Chandran et al. [62] attempted to check clamps using two differential eddy current signals. The current signals were collected using sensors capable of measuring eddy-phase current signals of 18 kHz and 27 kHz and missing clamps in the fastening system were detected using machine learning algorithms.
Soares et al. [63] derived malfunction patterns of a rail switch machine. Mean, intermediate, maximum, and minimum values were extracted from current signals during the switch operation. Then, similar defects were formed into one group by using k-means clustering. The proposed model evaluated the performance by receiving current data generated during switch operation provided by the railway company and showed a high score (.860) in the silhouette score, a clustering performance index.
Guo et al. [64] designed a real-time monitoring system to detect rail slab deformation of high-speed railways. This work combined fiber optic sensing methods and machine learning models to identify track slab deformation by using on-site track-side vibration acceleration data. The proposed method could identify the track slab deformation effectively and the detection rate could reach 96.09%.

Rail Geometry
Recent studies have utilized deep learning to analyze vibration data to evaluate railway track quality. Ma et al. [65] proposed a method to evaluate the quality of the rail track based on vehicle-body vibration. CNN and LSTM structures were integrated to process vehicle-body accelerations and predict vertical vehicle-body vibration. Such vehicle-body vibration prediction is beneficial for locating potential track geometry defects with lower costs than existing methods, such as using track inspection vehicles.
Hao et al. [66] further proposed a deep learning-based model applying attention structure and gated current unit (GRU) structure. CNN and GRU learn shape features and sequential features, respectively, and the attention structure receives the vertical, horizontal vibration, and train speed of the train as inputs, outputting the degree of vertical rail irregularity.

Train Door
Train door failures damage the train system and account for 40% of all train failure cases, leading to huge operation and maintenance expenditures. Ham et al. [67] studied a data-based approach to address train door failures. Eight failures were considered in four different scenarios. For each scenario, the change in the amount of current in the electric motor operating the train entrance was measured. Then, two techniques were used to analyze the current change data. First, 13 features were extracted from the time-series signal data using traditional feature engineering techniques based on pass filters (high and low). A KNN (k-nearest neighborhood) model detected door failures based on the extracted features. Another method is a deep learning model based on 1D convolution. Figure 6 shows components of a train door test rigs for the experiments. The evaluation results showed that both methods showed an accuracy of 98% or more, and CNN models showed slightly higher performance, even though they used row current signals without preprocessing.

Wheel
Wheel defects in trains are also one of the main causes of damage to railway systems and railway-related facilities. Neglecting train wheel defects will shorten the service life of a railway infrastructure, which may result in unnecessary maintenance costs. Furthermore, ground vibration and noise are generated when train wheel defects are present, causing significant damage to the surrounding environment. To determine train wheel faults, Krummenacher et al. [68] focused on the vertical force of the train. They continuously measured a load of trains running at top speed from wheel load checkpoints (WLCs) placed on rails at regular intervals and studied two methods to detect train defects. The first method determined train wheel defects using an SVM model based on the train load data processed by the discrete wavelet transform (DWT), a time series data processing method. Second, a CNN-based model was developed to detect train wheel defects. They found that these proposed methods show better performance than conventional defect detection methods. In particular, the CNN-based model had strengths identifying flat spots (wheel defects that stop wheel rotation and drag along the rails) and non-roundness (wheel defects that cause vibration and noise generation).
In addition, acceleration sensors for inspecting the position of railway wheels have been widely studied to increase information utilization and efficiently perform maintenance decisions. However, the acceleration sensor has a limitation of relatively accurate detection of the longitudinal movement of train wheels but poor lateral movement accuracy. Shi et al. [69] attempted to solve this problem by utilizing an image-based point tracking method with acceleration sensor data. Their proposed model was designed based on YOLO and generated a wheel reference point indicating a wheel position from the input image and comparing it with a normal position. Furthermore, they adopted various filters and data acquisition methods to improve performance, even in weather environments such as snow and fog.

Suspension
Wu et al. [70] detected defects in bogie suspension components (coil spring, air spring, vertical damper, and yaw damper) by considering the increased vibration and stability of a high-speed train during accelerated operation. They developed a Bayesian deep learningbased predictive model based on accelerometer (vertical and horizontal) data collected from a bogie and accelerometer sensors attached to trains, and data with each degree of deviation of each component (vertical and horizontal). Their developed predictive model imposed perturbation by the Monte Carlo algorithm to more clearly distinguish the difference between frequent and sudden faults. Class of faults was diagnosed using drop-out-based Bayesian deep learning. The proposed methods accurately detected rare but fatal defects, even with a small number of samples. Xie et al. [71] analyzed train vibration signals using a fast Fourier transform (FFT) that decomposed input signals by a frequency band and automated feature extraction by a deep belief network (DBN). With four different conditions (normal train, without anti-yaw shock absorber, air spring failure, and without transverse shock absorber), a total of 28,600 vibration data were collected using vibration sensors installed at various locations on a train. DBN models consisting of four restricted Boltzmann machines (RBMs) showed significant improvement in diagnosis performance.

Bearing
Bearing is a principal component widely used in most modern mechanical equipment. Defect inspection of bearings takes a long time and the cost of repairs is generally high, which can significantly decrease train productivity. While there have been many attempts to detect bearing defects, conventional methods have two limitations. First, methods based on features depending on expert rules or prior knowledge take too much time and human effort because different processes conducted by experts should be performed according to each specific problem. Second, traditional machine learning methods with shallow structures have had difficulty estimating nonlinear functional relationships in complex systems. In order to overcome these limitations, there have been studies to adopt deep learning to detect bearing defects.
Xu et al. [72] proposed a CNN-based model for bearing defect detection. Their proposed model used bearing vibration signals for defect detection. It converts original signals into two-dimensional features by CWT. Then, a CNN based on LeNet-5 processes features and determines its state. In addition, an ensemble method was adopted to utilize three Random Forest (RF) models with features of three specific layers as input values. He et al. [73] have developed a deep learning model that diagnoses defects using the Large Memory Storage and Retrieval Neural Network (LAMSTAR). This multi-layer fast deep learning structure can use many filters simultaneously. In addition, the short-time Fourier transform (STFT) is used to process acoustic data generated from bearings to determine when signals for each frequency band separated from the composite signal are generated. Performance tests performed in laboratory environments showed better performance than other conventional CNN models.
The features of bearing vibration signals, such as high nonlinearity, non-stationarity, and background noise, make it hard to diagnose bearing faults effectively and accurately. Zou et al. [74] proposed a deep learning method based on discrete wavelet transform (DWT) and improved DBN. First, the vibration signals from faulty bearings were converted to a two-dimensional (2D) time-frequency map. Then, the time-frequency map was processed by an improved DBN model, aiming to identify the correlation between fault features and fault types. In this way, the fault state of the bearing in the traction motor was diagnosed and identified in a semi-supervised manner. Figure 7 shows examples of railway equipment detection. Zhan et al. [75] proposed a model that utilizes Faster R-CNN to detect the location of the target component and whether it is defective from a complex background in a bogie image. In particular, they improved the original faster R-CNN by using two layers of different sizes for extracting defect regions and enabling region of interest (ROI) pooling. Experiments on 6499 test data on four parts (cut-out cock handle, dust collector, fastening bolts, and bogie block key) showed high detection accuracy with fast speed. Sun et al. [76] proposed a CNN model that detects defects in the side frame key (SFK) and shaft bolt (SB) among bogie components. The detection model accurately located the SFK and SB from the Trouble of Running Freight Train Detection System (TFDS) image data and then cropped it to diagnose each defect. Xiao et al. [77] proposed a hierarchical feature-based instance detection (HID) model to detect lost or broken defects in bogie components. Their proposed model consisted of three modules. The first module extracts hierarchical image features from train images through a CNN model. The second module delivers the extracted feature map to the region proposal network to generate a defect object area. The last module finally detects defects based on the generated regions and the feature maps. The proposed instance-level detection was evaluated on six train defects (lost pin, lost bolt, lost rivet, foreign object, broken chain, and broken wire).
Ye et al. [78] proposed a multi-feature fusion network (MFF-net) to address the loss of small-sized areas when reducing feature map size, which results in poor detection performance. To this end, three modules were devised. First, the feature fusion method (FFM) module incorporates deep and shallow features, such as spatial location and semantic information. Second, the multi-branch dilated convolution module (MDCM), which the Inception model inspires, simultaneously enhances feature extraction around objects of different sizes. The MDCM utilizes convolution networks and multi-branch networks to accommodate multi-scale features. Finally, the squeeze and excitation block (SE) module compresses and readjusts the features to improve model representation. The proposed model outperformed other conventional models in testing with the PASCAL VOC dataset. In addition, it showed excellent stability, even for complex environmental noises. Figure 8 shows an example of railroad trespassing detection. Zaman et al. [79] proposed a deep learning framework based on mask R-CNN that automatically detects railroad trespassing in real time. The model was trained based on the COCO dataset and detects trespassing events and classifies trespasser types (car, motorcycle, truck, pedestrian, etc.). In addition, Gao et al. [80] developed a railroad trespassing detection method based on one light detection and ranging (LiDAR) system and two different focal length cameras. The cameras can provide high-resolution images and rich semantic information, while their performance can be easily affected by lighting or weather conditions, and distance estimation accuracy is limited. LiDAR can measure the distance to an object accurately and provides a 3D image to work. However, sparse point cloud data provide limited detection capabilities in the case of small and dynamic obstacles. This work modifies an SSD network to incorporate multi-sensor data.

Railway Detection
Quickly detecting the front rail area can help prevent train accidents, such as derailments. However, railway detection in outdoor environments suffers from light-related issues, such as shadows, reflections, tunnels, and low contrast to the ground. In addition, railway detection becomes challenging in areas of overlapping multiple rails. Wang et al. [81] proposed a CNN-based deep learning model trained by the BH-rail dataset that contains railway images captured at various times. Wang et al. [82] proposed RailNet, a railway detection deep learning-based algorithm that processes video from front-view on-board cameras. RailNet consists of two networks: a network for feature extraction and another for railway segmentation. The feature extraction network has a pyramid structure to allow features to have top-to-bottom propagation. The railway segmentation network combines a ResNet50 backbone network with a fully convolutional network to generate the segmentation map.

Wind Risk
High-speed railways are susceptible to strong winds, which can pose a major threat to train safety. In order to ensure train safety, it is necessary to measure the wind speed of the preceding area in real time or to inform the train of the information in advance by short-term prediction. However, measured and predicted wind speed alone are not sufficient to explain wind conditions. For example, if the expected wind speed is slightly lower than the strong wind threshold, it is difficult to estimate whether a substantial wind accident can occur. Liu et al. [83] proposed a multiple attention layer based multi-instance learning (MAL-MIL) model to predict substantial wind risk alongside a high-speed railway (HSR). Based on attention mechanisms and LSTM networks, the model extracted features of the future wind status and identified the relationships between the current features and strong wind incidents.

Train Running Safety
There are many studies on monitoring the current state of train operation and quantifying train running safety [84][85][86][87]. However, these studies have mainly considered limited situations that can be monitored relatively simply, such as train bridges and tunnel passes. Lee et al. [88] presented a model that combines deep neural networks and recurrent neural networks for efficient train-running safety prediction. Their proposed model processed train vibration data, which was measured by an accelerometer, and predicted the wheel derail coefficient, wheel rate of lad reduction, and wheel lateral pressure. Numerical analyses were conducted using the transit simulation and the actual train-railway model, and these analysis results revealed that the proposed method has better prediction performance.

Managing Accident Reports
Accident reports can help minimize risk factors to prevent future accidents. Accident reports mostly contain diverse input field entries, such as fixed field entries, which include the primary cause of accidents, or a narrative field, which is a short text description of the accident. The narratives can provide more information than a fixed field entry, but the terminologies used in the reports are not easy to understand by a non-expert reader. Heidarysafa et al. [89] applied word embedding methods, such as Word2Vec and GloVe, to narrative texts in train accident reports. As shown in Figure 9, the proposed method classifies accident cause values for the primary cause field based on embedding vectors about the narrative text. This NLP approach can help label accidents more accurately and consistently.

Accident Prevention
A railway station is dynamic and complex due to the presence of many people, including passengers and station staff, and trains that stop and depart quickly. Therefore, it is necessary to prevent and deal with various safety accidents in stations. Alawad et al. [10] proposed a model that quickly identifies three safety incidents (fall, slip, and trip). It used diverse images of platforms, escalators, and tunnels captured by CCTV in the station. The CNN-based deep learning model classified input images into two classes (fall and not fall), and it achieved a high accuracy of 82.20% and an AUC value of 82.33%.

Air Quality Control
Air quality measurement sensors are installed in railway stations for air quality control. However, the measurement sensors often fail due to being in the wrong location for measurement, expired sensor equipment, malfunctioning electrical equipment, etc. Since air quality data are collected from several sensors, it is difficult to identify normal data by models having a linear or fixed structure because the variance of the data is significant, and values that do not follow a normal distribution are included. Loy-Benitez et al. [90] proposed a machine-learning-based soft sensor verification technique for detecting, diagnosing, identifying, and reconstructing abnormal measurements of multivariate air quality data. Figure 10 presents a diagram of the air quality monitoring and supervisory control process. Normal and abnormal values were extracted from the collected air quality data. A memory-gated current network auto encoder (MG-RNN-AE) algorithm based on an auto-encoder was developed to process air quality data. Furthermore, experimental results showed that the proposed method has a sustainable balance between power consumption and air quality levels, effectively performing air quality management within the station.

Simulation and Scheduling
Transportation modeling is difficult because it is a dynamic and complex system with interdependent factors, such as humans, equipment, and their temporal attributes. Recently, a deep learning approach that can extract complex high-level representations through hierarchical learning processes was applied to transportation modeling. Huang et al. [91] proposed CLF-Net, a deep learning model that combines 3D-CNN, LSTM, and fully connected neural networks to handle complex variables in dynamic systems. The proposed model separately processes data with different attributes for better predictive performance, uses spatio-temporal variables to capture space-time dependencies, and receives variables to learn the potential effects of static factors.
With the development of cities, short-term traffic prediction has become the core of the intelligence transportation system (ITS). Accurate short-term traffic forecasting can provide technical support to monitor train passenger flow and warn of excessive traffic congestion. Tang et al. [92] proposed a spatio-temporal long-term network (ST-LSTM) that captures spatio-temporal features from railway traffic data. Their proposed model improved the original LSTM structure, focusing on temporal rather than spatial features.
Predicting train delays can improve the quality of train operation, which helps to estimate train operation and more accurately make reasonable operational decisions. A train delay is affected by many factors, such as passenger flow, failure, extreme weather, and dispatch strategies. Considering such temporal and spatial factors between multiple trains and routes is challenging, which makes it difficult to accurately predict train delays. Zhang et al. [93] focused on predicting the cumulative effects of train delays over a certain period of time, represented by the total number of arrival delays in one station, rather than predicting each specific delay time of a single train. A deep learning framework based on the spatio-temporal attention mechanism and spatio-temporal convolution was proposed. Their model receives recent input of daily and weekly time series data and each component includes a spatio-temporal attention mechanism and spatio-temporal convolution, which can effectively capture spatio-temporal characteristics. Experiments on train operation data in the railway passenger ticket system of China demonstrated that the proposed model clearly outperforms existing performance criteria in train delay prediction.

Discussion and Conclusions
Our literature survey shows that artificial intelligence has been widely applied to various railway safety issues, such as railway infrastructure, trains, operations, and stations. This review details both opportunities and challenges for artificial intelligence in railway safety.
First of all, advances in data-driven artificial technologies can improve conventional railway safety performance methods. In addition, many studies have shown the feasibility of automating or supplementing conventional railway safety inspection procedures that depend on visual analysis or domain knowledge of a human expert. The proposed model structures in the discussed studies were determined based on the input data types. An image or video is one of the most common data types in artificial intelligence applications for railway safety. Many studies for defect detection (e.g., catenary and rail surface defects) developed CNN-based deep learning models and train vibration is another popular data source for railway safety. Accelerometers can easily measure train vibrations and LSTMbased models have been used to extract unique patterns from accelerometer data.
On the other hand, there are also challenging issues in utilizing artificial intelligence for railway safety that further studies should consider. We divided the addressed issues into two categories: (1) performance optimization and (2) generalization. First, many studies addressed the necessity of further performance improvement in artificial intelligence. For example, model accuracy needs to be improved to reach practical requirements or the model structure should be more optimized to be executed in real-time. Second, generalization of the proposed methods was issued by many studies. Some studies used simulation data in a lab setting, so in-situ validation needs to be performed for practical application.
More details regarding research issues in artificial intelligence for railway safety, addressed by prior studies, are explained in subsequent subsections. Developing deep learning models for railway safety is challenged by practical limitations of data volume or quality, such as diverse noises in railway environments and insufficient labeled data. Therefore, it is necessary to deal with such data deficiencies when developing artificial intelligence for railway safety. For example, Xiao et al. [77] utilized a hierarchy of features for training a deep learning model with a small number of labeled data. Ensemble methods that integrate different machine learning algorithms can help increase the efficiency of model learning. For example, Xu et al. [72] considered an ensemble method that integrates a CNN-based model and RF for bearing fault diagnosis. The ensemble approach can be efficient with a relatively small number of data rather than the end-to-end deep learning approach. In addition, unsupervised models can help deal with a small number of labeled data. Soares et al. [63] expected to improve system performance by analyzing other clustering algorithms or adjusting internal parameters.

Processing Time
Beyond model accuracy, processing time can be one of the essential requirements in artificial intelligence for railway safety. In particular, real-time processing can be required for high-speed train applications. For example, Wang et al. [81] suggested further studies to develop a real-time system that recognizes moving obstacles by combining railway area recognition and obstacle detection steps. In addition, Lee et al. [88] expected that their system could be utilized for real-time train control to reduce the risk of train derailment.

New Data Source
Most prior studies were conducted with a limited data source. For example, Wang et al. [40] emphasized the necessity of improvement of data quality. Their data were collected in limited circumstances, such as fixing the camera angle when taking the image data. In addition, Jiang et al. [8] argued that experiments in various railway conditions, such as the angle or length of the rail, should be conducted to develop a fault detection system. Furthermore, Ma et al. [65], who developed a method for rail defect inspection based on vibration signals, commented that considering various train types and driving speeds can help improve performance.
While most studies about artificial intelligence for railway safety have utilized image or vibration data, some studies have explored other data sources to improve model performance. For example, Wolf et al. [94] proposed using LiDAR sensor data to understand situations and components in 3D railway images. Suwansin and Phasukkit [9] utilized acoustic emission signals from rails for rolling contact fatigue. Furthermore, artificial intelligence could be improved and optimized by harnessing various situational features in the railway domain. Krummenacher et al. [68] developed an efficient model for detecting machine-learning-based train wheel defects by additionally considering the exterior characteristics of the train.

Tasks
Many prior studies have proposed a deep learning framework for defect detection in railways. However, the proposed frameworks were developed and evaluated with certain types of defects and there is much room for improvement to satisfy practical requirements. For example, Chandran et al. [62] focused on one fastener type and addressed the need to study the feasibility of the proposed method for other types. Similarly, Akhila et al. [4] also noted that the proposed framework needs to be improved with other examples and under different contexts. Wu et al. [70] conducted a study to detect defects in truck joints and accessories, further noting that a partial defect detection study should also be conducted to eliminate potential risk factors for train operation.

Validation with In-Situ Data
Because in-situ data acquisition is challenging in railways, many studies have been conducted with artificial data acquired in lab experiments. Even though the models trained by lab-setting datasets can ensure feasibility of the proposed methods and provide initial insights, these studies have addressed the need for further research with actual train-running data for validation. Shebani and Iwinicki [52] performed laboratory testing under limited conditions and noted that validation of the developed method in the field is necessary. Similarly, Kim et al. [5] addressed a gap between an actual train situation and simulation data. Shi et al. [69] developed a model to monitor rail-track geometry defects but reported that the model performance decreased in harsher outdoor situations. Unexpected noises can also cause such decreases in field performance [60]. Additionally, actual data can contain more diverse and complex conditions that are rarely covered by lab experiments. Ham et al. [67] detected a train entrance door failure using data generated by manipulating doors with several abnormal conditions, so their model should be further studied using actual train door failure data.