A Deep Learning-Based Method for Overhead Contact System Component Recognition Using Mobile 2D LiDAR.

The overhead contact system (OCS) is a critical railway infrastructure for train power supply. Periodic inspections, aiming at acquiring the operational condition of the OCS and detecting problems, are necessary to guarantee the safety of railway operations. One of the OCS inspection means is to analyze data of point clouds collected by mobile 2D LiDAR. Recognizing OCS components from the collected point clouds is a critical task of the data analysis. However, the complex composition of OCS makes the task difficult. To solve the problem of recognizing multiple OCS components, we propose a new deep learning-based method to conduct semantic segmentation on the point cloud collected by mobile 2D LiDAR. Both online data processing and batch data processing are supported because our method is designed to classify points into meaningful categories of objects scan line by scan line. Local features are important for the success of point cloud semantic segmentation. Thus, we design an iterative point partitioning algorithm and a module named as Spatial Fusion Network, which are two critical components of our method for multi-scale local feature extraction. We evaluate our method on point clouds where sixteen categories of common OCS components have been manually labeled. Experimental results show that our method is effective in multiple object recognition since mean Intersection-over-Unions (mIoUs) of online data processing and batch data processing are, respectively, 96.12% and 97.17%.


Introduction
Rail transportation is a significant part of the transportation network. Up to 2019, the mileage of the global railway is over 1.3 million kilometers [1]. Billions of passengers and cargoes are transported by rail every year. The large-scale railway with high traffic challenges the stability and safety of the railway system. The overhead contact system (OCS) is a critical railway infrastructure for train power supply [2]. The components of OCS, such as contact wires, are easily deformed because of the contact force from the pantograph and various climatic conditions. The deformations might result in unstable power supply and even accidents. To access the operational condition of OCS and discover potential problems, periodic OCS inspections are required [3]. However, most of the OCS inspections nowadays are still done by workers. For instance, in China, workers with relative expertise and experience manually measure geometries of OCS components under the help of a portable laser rangefinder. With continuously growing mileage of railway [4], the manual measurement, which is inefficient and high-cost, might be hard to meet demands in the future.
Mobile laser scanning (MLS) is an emerging technology used for capturing dense three-dimensional (3D) point cloud. An MLS system with quick and accurate data acquisition can In this study, we propose a new deep learning-based method to conduct semantic segmentation on the point cloud collected by mobile 2D LiDAR. Points are classified into meaningful categories of objects scan line by scan line by our method. As local features are important for the success of point cloud semantic segmentation [12,13], an iterative point partitioning algorithm is developed to partition points into regions for local feature extraction at each scan line, and the Recurrent Neural Network (RNN)-based module, named as Spatial Fusion Network (SFN), is developed to extract and fuse local features among scan lines. We leverage the iterative point partitioning algorithm and SFN to perform multi-scale feature extraction, which can improve the segmentation performance. Experiments are carried out on the MLS point cloud collected from the OCS of Chinese high-speed railways. Experimental results show that our method is light and fast to classify points with high accuracy. To verify the effectiveness of our design, we further analyze the influences of different settings of feature extraction and feature fusion. The contributions of our work are summed up as follows: 1. A new deep learning-based method is proposed to solve the problem of recognizing multiple OCS components from the point cloud collected by mobile 2D LiDAR. Experiments show that our method is effective in multiple object recognition. 2. To solve the issue of local feature extraction, we develop an iterative point partitioning algorithm and SFN to acquire local features among scan lines at multiple scales. 3. Our method is designed to classify points scan line by scan line, supporting both online data processing and batch data processing.
The rest of this paper is organized as follows. Section 2 presents the previous works of railway object recognition based on MLS data, some existing commercial solutions of OCS inspections, and some deep learning-based architectures of point cloud semantic segmentation. Section 3 illustrates the methodology. Section 4 describes the experiments and experimental results. Section 5 draws conclusions and the vision of the future.

Related Work
In the previous studies, some methods have been proposed to recognize railway elements from MLS data. These methods can be mainly categorized into two types: model-driven and data-driven methods and learning-based methods.
With model-driven and data-driven methods, points are grouped into non-overlap regions based on manually designed features as well as handcrafted rules. For instance, Arastounia et al. [14][15][16] recognized the railway infrastructure based on templates, local structures, shapes of objects and topological relationships among objects. Lou et al. [17] extracted railway tracks from MLS data based on height difference calculated by using a sliding window method. Stein et al. [18] detected rails and tracks in 2D LiDAR data based on the extracted geometrical and topological features, and a rail shape template. Zou et al. [19] leveraged vector shift estimation, K-mean clustering, and reverse smoothing to extract rail tracks, resolving the difficulties of rail track extraction in scenarios of bends and turnout. Pastucha [6] recognized objects of OCS by using the RANSAC algorithm and DBSCAN algorithm. Gézero et al. [20] designed a simple algorithm to extract the linear elements of the railway based on the assumption that the extracted elements are roughly parallel to the sensor system trajectory. However, manually designed features and handcrafted rules are often designed for recognizing specific objects and difficult to extend to new tasks. Besides, a rule for recognizing various objects from a point cloud is difficult to design by human beings because it might be very complicated.
Compared to model-driven and data-driven methods, learning-based methods are more promising to fully understand the point cloud of complex scenes because features are automatically learned from data rather than manually designed. The regular supervised machine learning approach has been introduced to recognize railway objects. For instance, Jung et al. recognized ten categories of railway electrification system objects from the MLS point cloud by using a multi-range CRF model [7] or a multi-scale hierarchical CRF model [8]. However, few studies introduced the deep learning approach to recognized railway objects from MLS data. In contrast, some deep learning-based methods [21][22][23][24][25][26][27][28][29] have been successfully applied to image-based OCS inspections for component detection and defect detection.
As for commercial solutions of OCS inspections, the image processing technology and the MLS technology are commonly introduced. Due to the mature development of image processing technology, some methods, such as edge detection, model-matching, and deep learning-based image recognition, are used for defect detection. For instance, JX-300 [30] is a kind of inspection vehicle equipped with an image-based OCS inspection system. Compared with image data, MLS data contain 3D information, which is useful to measure geometric parameters of the OCS components. For instance, tCat [31] is a real-time mobile mapping solution using a portable rail trolley with laser sensors to measure geometries of overhead wires. Selectra Vision Company [32] developed LiDAR-based equipment for contact wire measurement. In addition, some solutions combined the image processing technology and the MLS technology to build up OCS inspection systems. For instance, Meidensha Corporation [33,34] combined laser scanners and line scan cameras to detect the position of contact wires. MERMEC Company [35,36] developed an inspection system that is made up of lasers and cameras to monitor the overhead wires and pick up defects.
Nowadays, deep learning-based methods of point cloud semantic segmentation provide novel means to recognize various objects in point cloud scenes. PointNet [37] proposed by Qi et al. is a pioneer of deep learning-based architecture in the field of 3D point cloud classification and semantic segmentation. However, the lack of local information restricts the performance of PointNet in large and complicated scenarios. Then, Qi et al. further proposed PointNet ++ [12] to overcome disadvantages of PointNet by using a hierarchical structure to capture local features of a point cloud, which achieves strong robustness and good effect. Since local features are significant to improve the performance of 3D point cloud semantic segmentation, some studies have been conducted for local feature extraction. Ye et al. [38] proposed a model named as 3P-RNN, in which local features are extracted by using Pointwise Pyramid Pooling modules and RNNs. Some architectures extract local features based on convolutional operations, such as PointCNN [39], Pointwise CNN [40], DGCNN [41], and LDGCNN [13].

Methodology
As for the long-distance OCS inspection, the MLS system would collect millions of points. It is a tremendous task to compute point relations in the whole point cloud. To reduce computational complexity, we process data scan line by scan line, avoiding computing relations between points of different scan lines.
The framework of our method is shown in Figure 2. As for feature extraction, we are concerned about two issues: (1) we have to generate regions for local feature extraction; and (2) 3D contexts are in demand for classifying points at a scan line because the 2D features extracted at a scan line are incompetent for recognizing 3D objects. To solve these two issues, we leverage the iterative point partitioning algorithm to generate regions and SFN to acquire 3D contexts by feature extraction and feature fusion. Details of the proposed framework are described in the following sections.

Data Preprocessing
In the 2D polar coordinate system, the tth scan line can be represented as a set of 2D points as: where α t,i is a range distance, θ t,i is a bearing angle, and N is the number of points. Combined with traveled distances, we can transform points of scan lines into the 3D Cartesian coordinate system defined in Figure 3. Let d t be the traveled distance of the tth scan line, and let B t be a set of the 3D points transformed from A t . Let u and v be, respectively, the horizontal offset and vertical offset of the LiDAR based on the rail track centerline. The point set B t can be written as: where x t,i , y t,i , and z t,i are coordinates computed as follows: To reduce computation, we set up an inspection area to filter out distant points that belong to background objects such as trees and buildings. Let P t be a set of points within the inspection area. It can be written as: where x min , x max , y min , and y max describe the inspection area. By integrating the points within the inspection area, we can form a point cloud for OCS inspection. The point cloud Pc can be represented as: where T is the number of scan lines.
forward LiDAR y at the rail centerline z y at the starting point LiDAR Figure 3. An MLS system carried by an inspection device at the starting point. The front view is shown on the left, and the left view is shown on the right. Since the device moves on the rail, the defined 3D Cartesian coordinate system is relative to rail tracks.

Iterative Point Partitioning
The proposed algorithm for partitioning points is conducted iteratively at each scan line to generate regions and establish region-to-region relationships. Note that a region is a set of points in a local neighborhood. By this algorithm, points are partitioned based on the region centroids acquired previously. As defined in Figure 4, region size and reign span are concepts applied to the algorithm:

•
To extract local features at a specified scale, we group points into non-overlap regions with the region size which is limited by a region size threshold ε.

•
To build up the region-to-region relationship in a 3D local neighborhood for feature fusion, the region span between two regions has to be less than the span threshold δ. The value of δ is based on human experiences of OCS object shapes.
At the tth scan line, points within the inspection area can be written as: where N t is the number of points and z t, . . ,ĉ ε t−1,K be the list of the K region centroids acquired previously with the region size threshold ε. The specific steps of our iterative point partitioning algorithm are as follows: 1. To restrict the region span within δ, we filter the region centroids in C ε t−1 by discarding the ones that do not satisfy the following condition: whereẑ ε t−1,n denotes the z-coordinate value ofĉ ε t−1,n . 2. Points in P t are matched to the remaining region centroids by the nearest matching strategy.
The nearest matching strategy is based on the match function defined as follows: where p is a point to be matched, C is a set of region centroids, and Dist3D function computes the distance between p andĉ in the 3D coordinate system. 3. To restrict the region size within ε, we discard the matching relations that do not satisfy the following condition: where Dist2D function computes the distance between p t,i andĉ ε t−1,n in the xy-coordinate system, and the match function is defined in Step 2. The points without matching relation are grouped into an unmatched point set. 4. Assume thatĉ ε t−1,n is the region centroid of region G ε t−1,n (as described in Step 6,ĉ ε t−1,n might be the region centroid of another previous region which is not empty). Points matched toĉ ε t−1,n are positionally near to G ε t−1,n so that they can be grouped together to form a new region G ε t,n for feature fusion. Then, the region-to-region relationship between G ε t−1,n and G ε t,n can be built up. Note that n of the symbols formatted as G ε t,n denotes region-to-region relationships. Based on Equations (7) and (9), the new region G ε t,n can be written as: 5. The hierarchical clustering algorithm is utilized to partition the unmatched points which have been acquired in Step 3. The furthest-neighbor strategy [42] with ε as the merging threshold is applied to this process. 6. The regions generated with ε can be represented as K are the regions acquired in Step 4 by matching, and G ε t,K+1 , . . . , G ε t,K+M are the regions acquired in Step 5 by clustering. Based on these regions, we produce the region centroid list for the next iterative process, where: 7. To capture point-to-point relations and region-to-region relations, we transform points into the local 2D coordinate system relative to their region centroids, and region spans between regions whose features would be fused are computed: where n = 1, . . . , K + M; U ε t,n is a set of 2D points in the local coordinate system; s ε t,n is a region span;x ε t,n andŷ ε t,n are, respectively, the x-coordinate value and the y-coordinate value ofĉ ε t,n ; and ε and δ are utilized to scale the value into [−1, 1]. The distance between plane A and plane B denotes the region span between region A and region B.

Spatial Fusion Network
As PointNet is an effective architecture to extract features from unordered points [37], a PointNet layer is introduced to SFN for feature extraction. In addition, impelled by the success of RNN in video recognition [43][44][45][46], we utilize RNNs to fuse features among different scan lines to acquire 3D contexts. Specifically, RNNs in our SFN are implemented as two single-layer Long Short-Term Memory (LSTM) networks [47]. The architecture of SFN is shown in Figure 5. The inputs, represented as a set of 2D points in their local coordinate system as well as a value of region span, are produced by the iterative point partitioning algorithm (Step 7). At the tth iteration, the input points can be represented as {q t,1 , . . . , q t,M t }, where M t is the number of input points, and the region span can be represented as s t . The tth iterative process of SFN is described below in detail.

PointNet Layer for Feature Extraction
In the PointNet layer, the input points are firstly mapped to a high-dimensional feature space by a multilayer perceptron (MLP): where MLP : R 2 → R 128 , and f q t,i is the point feature of q t,i . Then, the point features are aggregated by a max-pooling layer to form a region feature: where f r t is the region feature aware of the relation of input points.

Recurrent Neural Networks for Feature Fusion
For feature fusion, the input region span is firstly used to produce the span feature which represents the spatial relationships between regions. Then, the span feature is concatenated with the region feature together for feature fusion. There are two optional kinds of LSTM: • unidirectional LSTM (UDLSTM), which uses forward dependencies to fuses features; and • bidirectional LSTM (BDLSTM), which make use of the forward and backward dependencies to form complete 3D perspectives for feature fusion.
As for using the UDLSTM, the process of feature fusion can be written as follows: (15) and the process with the usage of the BDLSTM can be written as follows: where f s t is the span feature; f r t is the region feature acquired in Equation (14); f f t is the new region feature with a 3D context; h 1 t−1 , h 1 t , and h 1 t+1 are hidden states of LSTM 1 ; h 2 t−1 , h 2 t , and h 2 t+1 are hidden states of LSTM 2 ; and concat function is used to concatenate the input features.
By combining the point feature and the region feature, we can capture relationships between points in a 2D region. By feature fusion, we can acquire contexts in 3D neighborhoods. Therefore, we concatenate f r t and f f t with each f q t,i , to form the new point features with 3D contexts: where f p t,i is the new point features of q t,i . Figure 6 shows an instance of single-scale feature extraction. As for multi-scale feature extraction, we conduct the single-scale feature extraction repeatedly with different region size thresholds. At the tth scan line, point features extracted at K scales can be written as:

Multi-Scale Feature Extraction and Per-Point Classification
where f p ε k t,i extracted with the region size threshold ε k is a point feature of p t,i and N t is the number of points within the inspection area at the tth scan line. Point features extracted at multiple scales are concatenated together to form multi-scale point features. Then, the multi-scale point features are input to an MLP to yield point scores: where f p t,i denotes the multi-scale point feature of p t,i ; score t,i is the point sore indicating which class p t,i should belong to. In Equation (19), MLP with (512, 128, 32, c) as its four hidden layer sizes are shared for yielding point scores at each scan line, where c is the number of categories of targets. Additionally, batch normalization is applied to each layer of the MLP with ReLU. Symbols formatted as P t , C ε t , U ε t,n , and s ε t,n are depicted in Section 3.2. At the beginning of the iterative process of point partitioning, C ε 0 is initialized as ∅. Then, regions and region-to-region relationships are produced by the iterative point partitioning algorithm (IPPA). The region-to-region relationships are informed by the superscript n of the region data, which are formatted as U ε t,n , s ε t,n . Note that the regions without points (e.g., region data U ε t,a = ∅, s ε t,a ) are discarded for feature extraction. Based on the region-to-region relationships, region data are reorganized in sequences (e.g., sequence . . . , U ε t−1,a , s ε t−1,a , U ε t+1,a , s ε t+1,a , . . . ). SFNs extract and fuse features among the sequential region data to acquire the point features with 3D contexts. We share the weights of SFNs for processing data at a single scale. The output Fp ε t = f p ε t,1 , . . . , f p ε t,N t is the set of point features corresponding to the input points P t = {p t,1 , . . . , p t,N t }.

Data Description
The MLS data for training and testing were collected from parts of the Chinese high-speed railway with a length of 16.7 km in 2019. The SICK LMS511 LiDAR sensor is applied to our MLS system for data acquisition. Table 1 shows the operating parameters of our MLS system, and Figure 7a shows the scene where data were acquired.   Table 2.

Implementation Details
For multi-scale feature extraction, we extract features with three region size thresholds, namely ε 1 = 0.5 m, ε 2 = 2 m, and ε 3 = 7 m, which are used for capturing the object relations at small scale, middle scale, and large scale. In addition, the region span threshold δ is set as 1 m. Figure 8 shows the results of iterative point partitioning with these three region size thresholds. As for feature fusion, UDLSTMs and BDLSTMs are, respectively, used in the SFN to evaluate our methods. Note that the mode with UDLSTMs is named as the unidirectional mode, and the mode with BDLSTMs is named as the bidirectional mode. In terms of optimizing learning-based modules, we leverage the ten-fold cross-validation method on the training dataset to tune hyperparameters. Experimentally, we choose the cross-entropy loss function to estimate the loss. In addition, the Adam optimizer with the learning rate 0.001 is utilized to optimize parameters of the modules. The layer sizes and numbers of layers, described in Sections 3.3 and 3.4, are determined by considering the segmentation accuracy and the computational efficiency. As the partitioning results are different among samples, the batch size is set as 1. Note that the batch size can be larger than 1 with padding alternatively, whereas it will bring in extra computation.

Semantic Segmentation Results
To evaluate our method, the evaluation metrics applied to our experiments are as follows: where IoU denotes the Intersection-over-Union. TP, FP, TN and FN are, respectively, the numbers of true positives, false positives, true negatives, and false negatives. Table 3 shows the point cloud semantic segmentation results on the test dataset. In the unidirectional mode and the bidirectional mode, high rates of mean precision (97.81% and 98.69%), mean recall (98.20% and 98.41%) and mean IoU (96.12% and 97.17%) are acquired by our method. By comparing the overall results of two modes, we can find that the more comprehensive 3D contexts provided in bidirectional mode can lead to a better segmentation performance. Major wires (i.e., contact wire, catenary wire, return wire, and feeder wire) are critical components for power transmission. Except for the catenary wire, the major wires are well recognized with IoU over 99% in both modes. Except for the dropper, stitch wire, and electric connector, the IoU of each class is over 94% in the unidirectional mode and over 96% in the bidirectional mode. The precision of stitch wire (90.37%) and electric connector (96.77%) are, respectively, the lowest in the unidirectional mode and the bidirectional mode. Besides, the electric connector class takes the lowest recall (90.59% and 90.90%) and the lowest IoU (86.75% and 88.23%) in both modes. The relatively low performance of recognizing electric connector objects might be caused by the small quantity of training samples. As shown in Tables 4 and 5, most of the errors of dropper are misclassifying dropper points as contact wire, catenary wire, and stitch wire. A few points of stitch wire and catenary wire are confused. In the unidirectional mode, a few electric connector points are incorrectly recognized as points of dropper and cantilever, whereas, in the bidirectional mode, points of electric connector are misclassified as dropper but not cantilever. In general, the errors shown in these two confusion matrices lead to the relatively low IoU of dropper, stitch wire, and electric connector. Segmentation results are visualized in Figures 9 and 10. Overall, few errors occur in the non-insulated overlap section. Compared with the non-insulated overlap section, the scene of the insulated overlap section is more complex because there are more relationships of interconnected objects. Nevertheless, objects are also clearly distinguished in the insulated overlap section. By highlighting errors, we can find that most of the errors occur at the intersections between objects.

Space and Time Complexity
We recorded the space and time cost using Pytorch 1.2 on a server with an Intel Xeon E5-2623 v4 CPU running at 2.6 GHz and an Nvidia Quadro M4000 GPU. The results are shown in Table 6. Compared with UDLSTM, using BDLSTM leads to more model parameters. As a result, our method in the bidirectional mode is more expensive than that in the unidirectional mode. The SLSL unidirectional mode consumes the most time because the iterative process cannot make full use of the GPU. The time cost without GPU (9.32 ms) just increases by approximately 3.67% in SLSL unidirectional mode. This result indicates that our method can be utilized to process scan lines with at least 100 Hz with Intel Xeon E5-2623 v4 CPU. Compared with the scanning frequency (25 Hz) of the LiDAR sensor used in the experiments, our method supports online data processing where data can be processed in real-time.

Ablation Study
Further experiments were conducted to explore the contributions of the multi-scale feature extraction and feature fusion.
To verify the effectiveness of multi-scale feature extraction, we compared the performances of different settings of region size threshold ε on the test dataset. As shown in Table 7, multiple-scale feature extraction contributes to remarkable improvements of overall accuracy (OA) and mIoU. Compared to the double-scale feature extraction (ε = 0.5, 7) in the unidirectional mode, the triple-scale feature extraction (ε = 0.5, 2, 7) helps mIoU increase by 1.02%, although OA rises by 0.19%. The greater improvement of mIoU indicates that multiple-scale feature extraction can reduce errors of small-quantity classes. To verify the effectiveness of feature fusion, we conducted an experiment in the non-feature fusion mode where we removed the feature fusion part of SFN, so that point features were only combined with region features. As depicted in Table 8, we can find that the rates of mIoU and OA increase with the help of feature fusion. In the non-feature fusion mode, features at a scan line only contain the 2D information, which would lead to confusion of recognizing 3D objects. For instance, points of a spreader are always misclassified as points of a cantilever at pole-spreader junctions in the non-feature fusion mode (Figure 11a), whereas few errors occur at that location in the unidirectional mode and the bidirectional mode (Figure 11b,c), which demonstrates that the feature fusion contributes to the reduction of errors.
(a) (b) (c) Figure 11. Segmentation results of a scene with spreaders: (a) non-feature fusion mode; (b) unidirectional mode; and (c) bidirectional mode. Table 8. Comparison of the results with multi-scale feature extraction (ε = 0.5, 2, 7) in the non-feature fusion mode, the unidirectional mode, and the bidirectional mode.

Conclusions
This study focuses on multiple OCS component recognition with mobile 2D LiDAR. A deep learning-based method is developed to conduct point cloud semantic segmentation scan line by scan line. We propose the iterative point partitioning algorithm and SFN as two critical components of our method for local feature extraction. As shown in the experiments, sixteen categories of common OCS components are recognized with high precision, recall, and IoU on the test dataset. According to the design of SFN, UDLSTM and BDLSTM are two optional kinds of LSTM used for feature fusion, resulting in 96.12% and 97.17% mIoU, respectively. In particular, UDLSTMs with the mere forward dependencies allow the SLSL unidirectional mode to support online data processing. Additionally, the ablation study shows that the design of multi-scale local feature extraction makes remarkable improvements of the segmentation performance. Compared with the model-driven and data-driven methods mentioned in Section 2, our method with automatic feature learning rather than manually designed rules is more effective in multiple object recognition and easier to be extended to new recognition tasks.
With our further research, we intend to improve the iterative point partitioning algorithm and SFN to adapt to 3D input. Hence, our method can be applied to higher performance MLS systems equipped with 3D LiDAR. Furthermore, due to the ability of automatically learning features, it is possible to extend our method to other inspection tasks such as rail track inspections and tunnel inspections.

Conflicts of Interest:
The authors declare no conflict of interest.