Accurate and Lightweight RailNet for Real-Time Rail Line Detection

: Railway transportation has always occupied an important position in daily life and so-cial progress. In recent years, computer vision has made promising breakthroughs in intelligent transportation, providing new ideas for detecting rail lines. Yet the majority of rail line detection algorithms use traditional image processing to extract features, and their detection accuracy and instantaneity remain to be improved. This paper goes beyond the aforementioned limitations and proposes a rail line detection algorithm based on deep learning. First, an accurate and lightweight RailNet is designed, which takes full advantage of the powerful advanced semantic information extraction capabilities of deep convolutional neural networks to obtain high-level features of rail lines. The Segmentation Soul (SS) module is creatively added to the RailNet structure, which improves segmentation performance without any additional inference time. The Depth Wise Convolution (DWconv) is introduced in the RailNet to reduce the number of network parameters and eventually ensure real-time detection. Afterward, according to the binary segmentation maps of RailNet output, we propose the rail line ﬁtting algorithm based on sliding window detection and apply the inverse perspective transformation. Thus the polynomial functions and curvature of the rail lines are calculated, and rail lines are identiﬁed in the original images. Furthermore, we collect a real-world rail lines dataset, named RAWRail. The proposed algorithm has been fully validated on the RAWRail dataset, running at 74 FPS, and the accuracy reaches 98.6%, which is superior to the current rail line detection algorithms and shows powerful potential in real applications.


Introduction
As an essential national infrastructure, railway transportation has received significant attention from society for its safety [1]. With the rapid development and popularization of high-speed rail technology, higher requirements are put forward for the speed and security of trains running on rail lines. In addition to the respective scheduling issues during train operation, it is also necessary to consider how to enhance the detection of road conditions during train operation [2]. With the application of railway video intelligent monitoring systems and the development of a new generation of the fully automatic driving signal system, the realization of intelligent monitoring of rail lines has become a hot topic of research [3,4], such as track obstacle recognition [5,6], rail cracks detection [7,8], road condition foreign body intrusion [9], and other issues. However, the factors that cause rail accidents are complex and changeable, such as bad weather, obsolete train tracks, malfunctions of electronic equipment, and the status of drivers.
In realizing the intelligentization and automation of railway transportation, the primary task is to predict the railway tracks in front during operation to provide trains with basic information about the environment ahead in time [10]. In this way, the train can sense the track's condition in advance, and adjust the speed in time, so as to avoid rail traffic accidents such as speeding and derailment in the curve. Simultaneously, the rail lines area is detected in advance to prevent foreign matter intrusion, which can help frame the detection range and reduce the amount of processing. In this way, the operation safety of the train can be ensured in real-time. At present, the detection of rail lines based on computer vision is the mainstream method of railway detection. The rail line detection algorithm based on computer vision can be divided into two directions: one is based on the image processing algorithms, using image edge detection and other algorithms to search for rail lines features and curve fitting. The other is based on deep convolutional neural networks, which have powerful semantic information extraction capabilities to obtain advanced feature information such as the edge, color, and texture of the rail lines and segment the railway tracks and background face of more complex images information.
Before the rise of deep learning, rail line detection mainly used traditional image processing technology, that is, based on the difference of a specific attribute of the entity pixel in the image in the field. This type of algorithm uses the change law of the entity pixel and the surrounding environment to determine the railway lines target in the image, so as to carry out line detection. As one of the early works, Zhong Ren et al. [11] proposed a rail recognition algorithm based on prior knowledge. The critical technology of the algorithm is rail modeling and template matching. By matching, the position of the railroad tracks in the current picture is determined. Although this method has some drawbacks, such as susceptibility to environmental interference and low accuracy, it has set a precedent for rail line detection. Afterward, according to the characteristics of the rails in the monitoring images, Q Wang et al. [12] proposed a rail line identification and detection method based on the Radon transform idea and the Bresenham straight lines detection algorithm. However, the applicability of this method is not strong, and it is only suitable for straight-line sections. Based on traditional image processing, Zhao Wu et al. [13] added postprocessing methods such as segment merging, slope culling, single-frame comprehensive decision-making, segment rebuilding, and multi-frame recognition result fusion to improve the accuracy of rail recognition, but only for straight-line detection. Lei Zhang et al. [1] studied the method of extracting rail tracks from infrared images, by obtaining the target area and edge of rail tracks through image segmentation, refining the extracted target area, and finally obtaining the curve based on the shape and location of railroad tracks. However, this method still needs a lot of improvement in both detection speed and detection accuracy. The proposed curve model directly influences the accuracy and computational complexity of the rail line detection algorithm. Kaleli [14] and Badino et al. [15] suggested extracting line features based on median filter and using dynamic programming to detect lines. Still, the model is susceptible to environmental interference, and the robustness needs to be improved. Although the complex curve model can fit more different boundary curves, it has a weak anti-jamming ability and is susceptible to noise interference. Recently, Yunze Wang et al. [16] used a curvature map-based orbital recognition algorithm to identify near-distance orbits and then obtain seed points from near-distance orbits recognition results, based on local gradient information, to recognize long-distance trajectories improved seed area growth algorithm to introduce directions. The algorithm overcomes the shortcomings of the previous methods, but it needs to be improved in identifying multi-rail lines. At the same time, the accuracy and real-time of rail line detection algorithms based on traditional image processing still need further breakthroughs.
With the success of deep learning, researchers have also gradually investigated its application in dealing with rail line detection. Ziguan Wang et al. [17] were among the first to use deep understanding in railway track detection. Their model is based on Mask R-CNN, which scans the picture and produces a candidate box containing the rails, calculates the position of the box containing the tracks, creates a mask covering the rails, and finally gets the position of the rails in the picture. They obtained photos from a surveillance video of a subway company and fabricated them into a dataset for training and evaluating their system. However, the presence of speculation in the final result of their output compromises the recognition effect. Moreover, they fail to release the accuracy and detection speed of their study, which hinder further comparisons. Recently, Xiaoyong Guan et al. [5] used ResNet101 and Feature Pyramid Networks (FPN) as the backbone network. Input pictures can generate feature maps of various sizes, forming pyramids of feature maps at different levels, making the network further enhanced in extracting features. By making railway datasets, building network models, and training network parameters, the recognition and segmentation of rail area, metro train, and signal lamps can be realized. The network can adapt to the changes in metro train operation environment. Nevertheless, although the complex network structure ensures the accuracy of detection, it hinders the real-time performance of rail line detection.
In this paper, an algorithm based on state-of-the-art deep learning convolutional neural networks is proposed to overcome the deficiencies of the aforementioned detection methods. This algorithm is mainly used in local trains and city railways. First, the RailNet is designed to preprocess images, extracting the key information and output the binary segmentation maps, which is robust to unnecessary noise. The rail lines are segmented from the background, and the feature of tracks are preserved without interference from other objects [18]. Afterward, the binary segmentation maps pass through the post-processing part of the RailNet, namely the sliding window detection algorithm. The algorithm is mainly composed of three steps: Inverse Perspective Transformation (IPT), Feature Point Extraction (FPE), and Rail Lines Curve Fitting. Moreover, the fitting results are mapped to the original images, and the rail lines are finally marked on the authentic images. An overview of the entire process of the algorithm can be seen in Figure 1. Overview of the proposal method.The RailNet part is responsible for extracting the rail lines features, which is trained to generate the binary segmentation maps of the rail lines. Afterward, the binary segmentation maps and the original images are processed by the rail lines fitting algorithm based on sliding window detection part. IPE and FPE respectively stand for Inverse Perspective Transformation and Feature Point Extraction of the rail lines. A second-order polynomial is fitted for each rail line, and the rail lines are reprojected onto the original images.
The main contributions of our algorithm are four-fold:

1.
A novel lightweight deep learning network, RailNet, is proposed. The encoderdecoder structure of the RailNet ensures the accuracy of detection. The Depth Wise Convolution (DWconv) is introduced in the RailNet, which reduces the number of network parameters and eventually ensures real-time detection. Compared with the existing state-of-the-art methods of extracting features, the RailNet has solid detection speed and higher accuracy.

2.
The Segmentation Soul (SS) module is creatively added to the RailNet structure, which can enhance the feature representation in the training phase and can be discarded in the testing phase. The SS module improves segmentation performance without any additional inference time.

3.
A rail lines fitting algorithm based on sliding window detection is proposed as the post-processing part of the RailNet. The algorithm further improves the accuracy of detection. Simultaneously, the rail lines in the original image are accurately marked, and the mathematical expression and curvature of the tracks are calculated.

4.
A dataset of rail lines, RAWRail, has been created for deep learning network training and testing. The dataset can be used for algorithm performance evaluation, which would help enrich the research and development of rail line detection.

Material and Methods
The main aim of the algorithm is to improve the accuracy and speed of rail line detection through the RailNet with its post-processing algorithm. We train the lightweight RailNet, which is realized by treating the rail line detection as a binary segmentation problem. The imbalance between the rail lines and background features can show whether the pixels belong to the tracks. Since RailNet outputs a set of pixels of railway lines, we still need to fit a curve through these pixels to improve detection accuracy. Therefore, this paper designs the rail lines fitting algorithm based on sliding window detection. It carries out postprocessing on the binary images output by the neural network and finally marks the rail lines on the original images.

RailNet
The detection of train tracks is essentially an image segmentation problem, which segments the tracks from the background and retains the characteristic information of the tracks. In this way, the network has a more vital anti-interference ability when extracting rail characteristics and can cope with changes in the number of rails. The RailNet model structure is mainly divided into two parts. Figure 2 shows the specific structure of the RailNet.
Encoder-decoder architectures are widely used in dense prediction tasks like semantic segmentation, which typically utilize convolutional layers and transpose convolution layers for feature encoding and decoding [19]. For a higher efficiency, the RailNet network adopts a light-weight encoder-decoder architecture. Table 1 shows details of the constituent layers. The encoder takes images of the front view of a rail as the input, and hierarchically extracts the features [19,20]. The decoder progressively recovers the resolution of the feature map and produce pixel-wise binary images.  Table 1. SS refers to the Segmentation Soul section. The backbone network extracts image features, which is also the encoding part of the network. Inspired by Bisenetv2, RailNet is designed to extract semantic information features of images [20]. The encoder of the RailNet replaces the standard convolution operations by the Depth Wise Convolutions (DWconv) to significantly lower the computational cost [21]. The details of the DWconv reducing the calculation cost are shown in Figure 3. To be more specific, the DWconv layers with a kernel size of 3 are stacked for progressive feature extraction. The 1 × 1 convolution layer is designed to follow each DWConv layer, which benefits for channel-wise information aggregation. As noted above, there exists a great deal of objects that share similar local appearance with rail lines in the input images. In order to improve detection accuracy, the context information should be properly extracted and preserved in the encoding stage. The network is designed as two DWConv layers followed by a 1 × 1 convolution layer, which is used for feature extraction on one particular feature resolution [22]. The first DWConv layer has a dilation rate of 1, while the following layer uses dilation rate of 2. This enlarges the reception field. The structure design of the encoder gives proper consideration to the efficiency and accuracy of the detection.

Decoder
After the backbone network is the binary segmentation part, which is the decoding part of the network. In order to recover the feature resolution and produce the rail line binary segmentation images, we design a decoder architecture that follows the encoder [19]. Although the transposed convolutional layer is mainly used to amplify the intermediate features in the neural network, it has the disadvantage of excessive calculation. Since the sub-pixel convolution layer has the advantages of no parameters and no computational cost, we use the sub-pixel convolution layer to gradually restore the feature resolution. The last layer of the decoder is the softmax layer, which is used to classify pixels. The decoder of RailNet has trained to output binary segmentation maps, indicating which pixels belong to a rail line or not [23].

Split Soul (SS) Module
To further improve the segmentation accuracy, we propose a booster training strategy [24], called the Split Soul (SS) module. This module consists of a 3 × 3 global average pooling layer, a 1 × 1 convolution layer, and a 3 × 3 convolution layer. The specific structure details of the SS are shown in Figure 4. More specifically, it is similar to a catalyst in chemical reactions: it can enhance the feature representation in the training phase and can be discarded in the testing phase. Accordingly, it increases little computation complexity in the testing phase. We can insert the Split Soul (SS) module to different positions of the RailNet. In general, it improves the segmentation performance without any extra testing time. ReLu is the ReLu activation function. Simultaneously, 1 × 1, 3 × 3, indicates the kernel size and H × W × S represents the tensor shape (height, width, depth).

Loss Function
The RailNet model applies the classical cross-entropy as the loss function, and the L 1 loss, L 2 loss, and cross-entropy loss are widely used in rail line detection. Among them, x i is the input, y i is the actual true value, that is, the known label, and y * i is the predicted value of the output. The cross-entropy loss uses an inter-class competition mechanism, and p i is the probability that the sample belongs to class C. When C = 2, the cross-entropy loss can be defined as a binary classification problem, where y is the label of the sample, the positive class is one, and the negative class is zero. In railway line detection, the imbalance rate between the railway line and the background is considerable. In order to solve this problem, each category is given a different weight w i . However, due to the existence of an inter-class competition mechanism, cross-entropy loss mainly represents the accuracy of prediction probability of correct tags. It ignores the difference of other wrong titles. To increase the intersection of predicted rail line pixels and actual rail lines pixels, we propose a loss function L IoU−Rail based on IoU: where M p is the predicted rails pixel, M T is the real rails pixel, M C and is the rail lines in the overlap area between the predicted rail lines area and the actual rail lines area.

The Rail Line Fitting Algorithm Based on Sliding Window Detection
As mentioned in the previous section, the RailNet outputs a set of pixels for the rail lines. It is not ideal to fit polynomials by these pixels in the original image space, so people have to resort to higher-order polynomials to deal with curved rail lines [23]. A generally accepted solution to this problem is to project the image into a "bird's eye" representation, where the rail lines are parallel to each other, so curved rail lines can be fitted with second to third-order polynomials.
The algorithm mainly consists of three steps: Inverse Perspective Transformation (IPT), Feature Point Extraction (FPE), Rail Lines Curve Fitting. Figure 5 shows the specific algorithm flow in the form of a flowchart.

Inverse Perspective Transformation (IPT)
Inverse perspective transformation is to remove the perspective effect of the camera and restore the parallel rail lines from the perspective of the top view. The inverse perspective transformation is as follows [25]: where x d and y d satisfy: In Equation (2), O C (A, B, C) coordinates the optical center of the camera in the world coordinate system. Respectively, θ 1 and θ 2 are the pitch angle and yaw angle of the camera.
The point (x ,y ) is the corresponding point of the pixel (x,y) in the original image in the inverse perspective image.

Feature Point Extraction (FPE)
After the inverse perspective transformation, the feature points of the target are detected and collected using histogram, sliding window, and other algorithms. The priority in this task is the determination of the coordinates of the initial sliding base points of the sliding window. With the bottom edge of the feature map set to the x-axis after inverse fluoroscopic transformation, the distribution of pixels in the vertical direction of the image for each abscissa on the x-axis is statistically derived using a histogram [26]. At this time, because most feature pixels belong to the tracks, there will be two apparent peaks near the abscissa of the rail lines on the left and right sides [27]. The coordinates of these two peaks are the starting points of sliding window detection.
After that is the sliding window detection of the feature map: First and foremost, the parameters are designed and initialized, including the number of sliding window detection, the height and width of the sliding window are obtained by image size and the number of detection [28]. Next, In the process of feature point detection in a sliding window, the window pixels are traversed, and the coordinates of non-zero pixel values are recorded. When the number of effective pixels in the window is less than the threshold, the window width is increased by the window height and width until the minimum number of pixels is met [29]. Furthermore, taking the average value of the abscissa of the effective pixels in the sliding window as the base point coordinate of the next sliding window, iterative detection is carried out until the total number of sliding windows is satisfied [30]. Last but not least, after the feature point detection deadline, the target array is the feature point of the detected target.

Rail Lines Curve Fitting
The detected feature points are fitted by the curve fitting algorithm. The curve fitting of the collected rail feature point array can well estimate the parameters of the rail lines, such as offset, inclination angle, curvature radius, and other information, so as to predict the direction of the tracks and provide help for the automatic train control system. The existing mainstream algorithm is to directly use the least square method to do quadratic or cubic curve fitting. For uncomplicated rail line detection, the fitting results are mostly quadratic curves, meeting the requirements.
where a, b and c are the quadratic term, the coefficient of the first term and the constant term respectively.
= |2a| The curvature of the rail lines is easily derived from the above polynomial procedure.

RAWRail
In order to realize the function of the network designed in this paper, we need to train and test the network. So as to verify the feasibility of the algorithm and reduce the difficulty of feature extraction, the experiment first collects the video stream of the rail lines in front of the train when the weather is good in the daytime and then converts it into pictures. When collecting images, we use cameras installed in front of local trains and city railways, which can capture objects about 350 m ahead during the daytime. However, due to the influence of the external environment and the inverse perspective transformation in the algorithm, the algorithm detects the rail lines distance up to 280 m. The dataset is named RAWRail. A total of 3000 railroad track pictures with 640 × 360 are prepared, and the images with only 2 rail lines are first detected.
Secondly, we label the rail lines of all the rail images by using LABELME to get the JSON file used as the real rail lines during training and finally compared with the predicted rail lines [29]. In the actual training, the 3000 pictures are divided into the training set, verification set, and test set according to the ratio of 0.9:0.05:0.05. The specific information of RAWRail is shown in Table 2. Table 2. Specific distribution of the RAWRail.

Number of Rails Left Curved Tracks Right Curved Tracks Straight Tracks
In All 2 1000 1000 1000 3000

Evaluation Metrics
At present, it is rare to use deep learning neural networks for feature extraction of train tracks, so the existing evaluation index for rail line detection is not perfect. In terms of evaluation indexes, TP (True Positive), TN (True Negative), FP (False Positive), FN (False Negative) are commonly used in the field of image processing [31].
The accuracy is calculated as the average number of correct points per image: where X i the number of correct points and Y i the number of ground-truth points. When the difference between the basic fact and the prediction point is less than a certain threshold, the point is correct. In addition to accuracy, FNR (False Negative Rate) and FPR (False Positive Rate) are also proposed.
where F pred is the number of rail lines that are initially correct but are predicted to be negative, N gt is the number of all right rail lines, N pred is the number of rail lines that are originally negative but predicted to be positive, F all is the number of all wrong tracks.

Implementation Details
The hyperparameters of each experiment in this work are generally consistent. Although the dataset has 3000 images, this is far from enough. During training, these experiments use data enhancement appropriately, and apply data enhancement with a probability of 10/11. The transformations used are rotation with an angle in degrees θ ∼ U(−10, 10), horizontal flip with a possibility of 0.5 [22]. The Adam optimizer is used, along with the Cosine Annealing learning rate scheduler with a batch size of 16 and an initial learning rate of 5 × 10 −4 until convergence [26]. The training session runs for 1961 epochs, taking approximately 23 h on four GeForce RTX 2080Ti. In the post-processing curve fitting, a second-order polynomial degree is chosen to be the default. The Tensorboard is used for data visualization analysis, and the RailNet training process is shown in Figure 6.

Results
In order to verify the rationality and superiority of the algorithm, the following three sets of experiments are designed. The detailed experimental diagram of each step of the algorithm is shown in Figure 7.

State-of-the-Art Comparison
The results of comparing our algorithm with other latest algorithms are shown in Table 3. Because these documents do not provide source code the evaluation index data can only be directly extracted from the original papers. Regarding detection accuracy and detection speed, it is not difficult to see that our algorithm is exceptionally competitive. Figure 8 shows the results of the algorithm detecting the rail lines area, which reflects the excellent performance in both straight and curved rail lines. The studies in [1,11,25] all lack the two evaluation metrics of FPR and FNR, and [11] also lacks the evaluation metric of FPS. Therefore, we introduce the two new evaluation indicators of FPR and FNR into these three algorithms and introduce FPS into [11]. We conduct supplementary experiments on FPR, FNR, and FPS evaluation metrics, and reproduce the algorithms. Morever, we let the four algorithms run under the same instrument and GPU conditions. In this way, the ACC, FPR, FNR, and FPS of the four algorithms can be compared comprehensively and clearly. In practical application, our algorithm not only detects the rail lines in real-time during train operation but also has strong robustness in bad weather.

Multi-Rail Line Detection
There will be some unexpected situations in practical application, such as several trains running in parallel and changing the rail lines in time when meeting the rail fork. At this time, it is of vital importance to identify multiple rail lines. It can be seen from Table 4 that the algorithm still has high accuracy in identifying multi-rail lines. This shows that the algorithm has strong robustness.

Ablation Study
To investigate the impact of some of the decisions made for the proposed method [22], two ablation studies were carried out, using only RAWRail's training set for training and the validation set for testing. Different order polynomials are used to fit the rail lines in the curve fitting part of the rail lines fitting algorithm based on sliding window detection module. The experimental results are shown in Table 5. Due to the camera's angle of view, the detected track lines are mainly near the camera, and the curvature of this part is not particularly obvious. This is why there is little difference in the experimental results when other order polynomials are used to fit rail lines. Therefore, low order polynomials of different orders have little influence on the fitting results. Even so, it can be seen from the data in the table that when the polynomial is of second-order, the detection accuracy is the highest, so the second-order polynomial is also used in the design of this algorithm.

Rail Line Type ACC FPR FNR
Multi-rail lines 94.16% 0.05958 0.02832 As to another ablation study we carried out, we can find that the resolution of the train camera is also the key factor affecting the results. Different sizes of images are input into the algorithm, and the specific experimental results are shown in Table 6. The experimental results show that reducing the image size will reduce the accuracy of rail lines prediction. At the same time, the detection speed of rail lines has increased significantly. In practical applications, combined with the characteristics of this algorithm, the best accuracy and detection speed can be found [32]. The image size in the RAWRail is 640 × 320.

Conclusions
In this paper, a novel method for rail line detection algorithm based on deep learning is proposed. Firstly, we propose the lightweight RailNet. The RailNet extracts the feature of tracks by converting the rail line detection into an image segmentation problem. The ingenious design of RailNet remarkably improves the accuracy and real-time of algorithm detection. Afterward, we design the rail lines fitting algorithm based on sliding window detection, which makes full use of the segmentation feature maps output by RailNet and finally marks the rail lines on the original images. For the training and testing of the RailNet, we collect a real-world rail lines dataset called RAWRail. Compared with the state-of-the-art methods, the proposed method is effective and efficient, while maintaining an accuracy of 98.6% and detection speed of 74 FPS. Furthermore, the proposed algorithm also works well with multi-rail lines, which provides wide application prospects.

Conflicts of Interest:
The authors declare no conflict of interest.