Aerial Footage Analysis Using Computer Vision for Efﬁcient Detection of Points of Interest Near Railway Tracks

: Object detection is a fundamental part of computer vision, with a wide range of real-world applications. It involves the detection of various objects in digital images or video. In this paper, we propose a proof of concept usage of computer vision algorithms to improve the maintenance of railway tracks operated by Via Rail Canada. Via Rail operates about 500 trains running on 12,500 km of tracks. These tracks pass through long stretches of sparsely populated lands. Maintaining these tracks is challenging due to the sheer amount of resources required to identify the points of interest (POI), such as growing vegetation, missing or broken ties, and water pooling around the tracks. We aim to use the YOLO algorithm to identify these points of interest with the help of aerial footage. The solution shows promising results in detecting the POI based on unmanned aerial vehicle (UAV) images. Overall, we achieved a precision of 74% across all POI and a mean average precision @ 0.5 (mAP @ 0.5) of 70.7%. The most successful detection was the one related to missing ties, vegetation, and water pooling, with an average accuracy of 85% across all three POI.


Introduction
Railways are a crucial part of the modern world, as they help provide affordable public transport and facilitate the supply chain of goods essential for the current lifestyle. Railway tracks span thousands of kilometres, and maintaining these tracks is not easy for several reasons [1]. The first and most obvious reason is that these tracks pass through some of the most remote areas, which is beneficial in terms of connectivity, but, which also makes their surveillance and monitoring quite tricky, as some of the areas through which they pass are not easily accessible and require additional resources. Second, it also requires railway companies to employ a large workforce to ensure smooth operations and the general public's safety [1].
One of the most crucial parts of railway track maintenance is the identification of sections that need to be repaired. Via Rail operates more than 500 trains per week in several Canadian provinces and 12,500 kilometres (7800 mi) of track [2]. The current approach requires regular manual monitoring of the tracks, which is resource-intensive. One of the crucial points of interest (POI) involves identifying vegetation that grows on the sided of the tracks, which could be fatal if not treated in a timely manner [3]. Identifying missing and broken wooden ties is an important task to increase the lifetime and operability of the tracks. Rainy weather could contribute to the sudden accumulation of water near tracks that is extremely difficult to identify quickly [4]. Advancements in certain technologies could revolutionize the identification of these points of interest.
Unmanned aerial vehicle (UAV) or remotely piloted aircraft system (RPAS) technology has essentially opened new opportunities for railway operators, offering safety and dependability, as well as reliable support [5]. The increased safety of workers is now commendable with UAVs on the rail sector side. With its arrival, the activities of the railways that previously required personnel to participate in dangerous inspection scenarios are no longer a problem. Integration of premium UAVs, advanced sensors, high-resolution cameras, and artificial intelligence (AI) has allowed this transportation industry to capture real-time data and perform intelligent data analysis.
The following are the main contributions of this paper.
• Approach: We study and apply the YOLOv5 object detection algorithm to detect various POI, that is, broken ties, missing ties, vegetation, and water pooling around railway tracks. • Experimentation: We collected aerial image footage in 4K resolution and the top-down angle for training and testing purposes by using DJI Inspire 2 with a Zenmuse X7 camera in the Ottawa, ON, Canada region. Our research team performed annotations with the help of an expert from Transport Canada. • Evaluation: We evaluate the algorithm's accuracy by measuring how accurately it determines a particular POI. We consider precision, recall, and mean average precision to determine overall efficiency. In general, we achieved a precision of 74% and a mean average precision @ 0.5 (mAP @ 0.5) of 70.7%.
To the best of our knowledge, this is the first paper that applies the newest YOLOv5 model to detect various POI along the train tracks. It is also the first paper that proposes the detection of water pooling as one of the POI around train tracks from the aerial perspective.
The remainder of the paper is divided in the following way. In the next section, we discuss related works and compare them with our approach. We then provide an overview of the chosen approach in Sections 3 and 4. Finally, we present the results and the detection examples in Section 5, followed by conclusions and plans for future work.

Related Work
The evolution of computer vision capabilities has made it possible to identify particular objects from various media sources, such as images and live streaming of videos [6][7][8][9][10][11]. One of the first papers on railway track object detection was proposed by Yuan et al. They used the enhanced Ostu algorithm to segment the rail picture and calculated the rail surface area, although the selection of thresholds for this method was not uniform [12]. Others offered a way to directly extract points of interest on the rail surface by improving the value by reducing the maximum variance between classes, although the precision of the recommended approach was not perfect [13]. In this section, we are going to discuss the related works and compare them with our approach.
The authors of [14] aim to provide a low-cost, practical framework for autonomous railway track detection by utilizing video cameras and computer vision algorithms. A computer vision technique is employed for the automatic detection of railroad lines. As part of the camera calibration operation, interior orientation characteristics are extracted. The external orientation parameters are computed by using the bundle adjustment calibration approach based on chosen ground control points. As a result, the eye fish effect of the photos is eliminated, and matching and feature extraction methods are used to create orthogonal images for the railway lines. The study provides a unique perspective on image pre-processing, which motivated similar implementation in this paper for correcting the orientation of images before the annotation process.
An algorithm for the detection of railway infrastructure objects, namely, track and signals, is proposed by [15] to detect signals relevant to the track and check if the train is moving along. The convolutional neural network based on the You Only Look Once technique, Hough transform, and other well-known computer vision algorithms are included in the algorithm. The principles of CV and CNNs each deal with a distinct object of detection, and when combined, they create a singular system that tries to identify both the rails and the relevant signals. By using this strategy, the artificial intelligence (AI) system is "aware" of the signal's course. According to the experiments that were run, the proposed algorithm has an up to 99.7% accuracy rate for signal detection. Although this paper uses an algorithm similar to that used in [15], the target POI are different, and the application requires a more lightweight model to perform high-speed detection, which would not be possible with the same model as in [15].
Rails, fasteners, and other elements of railway track lines invariably develop flaws due to continuous pressure from train operations and direct exposure to the environment, which directly affects the safety of train operations. The study from [16] developed a multiobject detection approach to detect fastener and rail surface faults without causing damage. It is based on deep convolutional neural networks. The enhanced YOLOv5 framework first localizes the rails and fasteners on the train track image. The rail's surface defects are then detected by using the defect detection model based on Mask R-CNN, and the defect area is segmented. The fasteners' status is finally classified by using a model built on the ResNet framework. The application of the YOLOv5 model for fastener and fault detection inspired and confirmed the validity of the application of YOLOv5 to detect the POI in our study. Again, the POI that we refer to are different.
A review study conducted by Lei Kou [17] further explores the recent research done to address defect detection on the railway lines. Computer vision and deep learning methods, along with traditional ultrasonic and acceleration detection methods, are proposed to evaluate the damage to the rail surface that can drastically improve the efficiency of the detection system while reducing inspection costs. A similar approach is explored by M. Hashmi et al. in [18] where they combined traditional acoustic-based systems with deep learning models to improve performance and prevent railway accidents. In this regard, two CNN models, convolutional 1D and convolutional 2D, and one recurrent neural network (RNN) model are utilized. Initially, three fault types are considered: superelevation, wheel burned, and standard tracks. On-the-fly feature extraction is performed by creating spectrograms as a deep learning model's layer. Because visual inspection is employed to identify POI in the current study, no traditional techniques were utilized.
In the past, methods based on computer vision have been investigated to detect flaws in railroad tracks. Still, full automation has always been challenging because neither conventional image-processing techniques nor deep learning classifiers trained from scratch perform well when applied to an infinite number of novel scenarios encountered in the real world when given a small amount of labeled data. Recent developments have made it possible for machine learning models to use data from an unrelated but relevant domain. The authors of [19] demonstrate that despite the lack of comparable domain data, transfer learning still allows for training large-scale deep learning classifiers on uncontrolled, realworld data and helps models understand other real-world objects. Similar transfer learning was employed in this study by using weights from previous training to improve accuracy and deal with a smaller dataset.
Another application of computer vision in the railway domain is detecting static objects. The study by Ye et al. [20] suggests a line-side condition monitoring strategy for the switch rail. The state of the switch rail, including movement, position, toe gap, and toe edge, can be remotely monitored in real-time by using specialized identification algorithms. Similarly, an intrusion-detecting algorithm for a railway crossing was suggested in [21]. The following three steps make up their procedure. First, three nonparametric background models for detecting moving and stationary objects were created, each with a different learning rate. An object-tracking approach is introduced to decrease false positives in the detection. The removal of static objects is followed by introducing a feedback mechanism to update the background models selectively. Although this approach was proven to be effective, our POI differ, and no correlation is found.
Tiange Wang, Fangfang Yang, and Kwok-Leung Tsui in [22] used the YOLO one-stage detection model to detect rail, clip, and bolt-on railway tracks. Bolts connect the ends of the rails at the joints, and elastic clips are used with the rail sleeper to connect the rails. Their approach used the k-means clustering method to discover the best aspect ratios of the previous boxes with the distance metric changed to 1-IoU. In [23] the authors used computer vision algorithms to detect cracks in the rail track plate. This was the only POI similar to the work done in this paper. Moreover, many other authors focused on a similar problem of crack detection [24]. To the best of our knowledge, this is the first paper that proposes the detection of water pooling as one of the POI around the train tracks.
Although the research in the area of railway track maintenance has been done recently, we want to reiterate that to the best of our knowledge, this is the first paper that applies the newest YOLOv5 model to detect broken and missing ties, as well as vegetation. It is also the first paper that proposes the detection of water pooling as one of the POI around the train tracks from the aerial perspective.

Algorithm Overview
YOLO (You Only Look Once) reframes object detection as a single regression problem rather than a classification problem. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for the boxes shown in Figure 1. YOLO makes global decisions about an image when making predictions. It sees the entire image during training and testing, so it implicitly encodes contextual information about classes, and their appearance [25]. We decided to choose the YOLO algorithm versus the single-shot multi-box detection (SSD) [26] for of two reasons. The first one was the speed of the detection. Our preliminary experiments showed that the YOLO algorithm offered a much higher detection speed with precision that was not much lower. The SSD's algorithm speed was lower as our model was large and complex. Moreover, the SSD algorithm was slightly less accurate in detecting smaller objects, such as tie cracks. We will discuss more details of YOLO architecture in the following subsections.

Detection
The entire process is shown in Figure 2. The input image is divided into an S * S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting the object. Each cell in the grid predicts the B bounding boxes and the confidence scores for those boxes. Confidence is defined as P r (Object) * IOU(truth, pred). The confidence score is zero if there is no object in that cell. Otherwise, the confidence score equals the intersection over union (IOU) between the predicted box and the ground truth.
Each bounding box consists of five predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the box's center relative to the grid cell's bounds. The (w,h) coordinates represent the width and height of the bounding box and are predicted relative to the entire image. Finally, confidence prediction represents the intersection over union (IOU) between the predicted and ground-truth boxes.
Each grid cell also predicts C conditional class probabilities-P r (Class i | Object). These probabilities are conditioned on the grid's cell that contains an object. Only one set of class probabilities is predicted per grid cell, regardless of the number of boxes. In the test, the conditional class probabilities and the confidences of the individual box are multiplied by the following equation: P r (Class i |Object) * P r (Object) * IOU truth pred = P r (Class i ) * IOU truth pred (1)

Architecture
The YOLO model is implemented as a convolutional neural network. The initial convolutional layers of the network extract features from the image, whereas the fully connected layers predict the output probabilities and coordinates. The detection network architecture has 24 convolution layers and two fully connected convolution layers. Alternating 1 × 1 convolution layers reduce the preceding layers' feature space. The network is shown in Figure 3.

Activation and Loss Function
The final layer predicts both class probabilities and bounding box coordinates. The coordinates of the bounding box are then normalized to 0 and 1. A linear activation function for the final layer and all other layers uses the following leaky rectified linear activation: The sum-square error in the output of the loss function is optimized by using Equation (3).
Here, S 2 is an S × S grid created by YOLO for detection, B is a bounding box, 1 obj i denotes if an object is present in cell i, 1 obj ij denotes j th the bounding box responsible for the prediction of the object in the cell i, and finally λ coord , lambda noobj are regularization parameters required to balance the loss function. x i ,y i denotes the location of the center of the bounding box, w i ,h i denotes the width and height of the bounding box, C i denotes the confidence score of whether there is an object or not and p i (c) denotes classification loss.

YOLOv5
The architecture of YOLOv5 consists of three parts (as shown in Figure 4). These are the backbone (CSPDarknet), the neck (PANet), and the head (YOLO layer). The data is first inputted to CSPDarknet for feature extraction and then fed to PANet for feature fusion. Finally, the YOLO layer outputs detection results (class, score, location, size) [27]. Joseph Redmon (inventor of YOLO) introduced the anchor box structure in YOLOv2 and a procedure for selecting anchor boxes of size and shape that closely resemble the ground-truth bounding boxes in the training set. By using the k-means clustering algorithm with different k values, the authors picked the five best-fit anchor boxes for the COCO dataset (containing 80 classes) and used them as a default. This reduces training time and increases network accuracy.
However, when applying these five anchor boxes to a unique dataset (containing a class not present in the original COCO datasets), these anchor boxes cannot quickly adapt to the ground-truth bounding boxes of this unique dataset. For example, a giraffe dataset prefers anchor boxes with thin and higher shapes than a square box. To address this problem, computer vision engineers usually run the k-means clustering algorithm on the unique dataset to get the best-fit anchor boxes for the data first. These parameters will then be manually configured in the YOLO architecture.
Glenn Jocher (inventor of YOLOv5) proposed integrating the anchor box selection process into YOLOv5. As a result, the network does not have to consider any of the datasets to be used as input; it will automatically "learn" the best anchor boxes for that dataset and use them during training [27]. Figure 5 shows the model pipeline with each step involved, from data collection to image inference processing, in our research study. Data was collected by using a UAV during flights over the Ottawa region in the fall of 2021. High-resolution images were captured from various distances [25 m, 50 m, 75 m]. After performing data pre-processing, the RoboFlow tool was used to annotate the POI. Roboflow also enabled the augmentation (vegetation only) and versioning of the dataset. After creating the dataset package in Roboflow, the Google Colab notebook was used to run algorithms for the model's training in connection to Amazon Web Server external instance. Once the model was trained, the trained weights were extracted and used for inference on the images.

Dataset Pre-Processing
Dataset pre-processing is performed to ensure that the data are in the suitable format for the algorithm's training. After an initial assessment, the following pre-processing steps were performed on the dataset to enable efficient annotation, which is critical for machine learning model performance: Image Orientation Correction. To make the annotation process more manageable and efficient, the orientation of the images was corrected so that the tracks were in a vertical or horizontal direction. Moreover, all images were taken by using the topdown approach to prevent the situation where different angles of the image affect the detection quality.
Artificial Vegetation. As the rail tracks are well maintained, there were very few images in which vegetation was visible. To achieve higher accuracy, we needed to train the model by using a well-balanced set of images with points of interest; thus, artificial vegetation was added on and beside the tracks.

Annotation
After pre-processing the data, the next step was to annotate the images to train the algorithm. We identified four points of interest: Vegetation: Each image was closely reviewed to check for vegetation near the track. If it was present at a distance from the track (<3 m), it was annotated accordingly. An example is shown in Figure 6. Missing Tie: The track portion was annotated when there was a missing tie. An example is shown in Figure 7. Broken Tie: If there was a crack at least 50% of the length of the tie (calculated horizontally), the tie was annotated as a broken one. An example is shown in Figure 8. Water Pooling: Any water body captured in the images was annotated as water pooling to identify the sites affected by water pooling near the tracks. An example is shown in Figure 9.

Simulation Setup
In this section, we discuss the simulation parameters and share additional details on the collected dataset. The total number of images used in the simulations was 5465, as shown in Figure 10. On average, we had two annotations per image, and a median image ratio of 5472 × 3648 collected from multiple UAVs. Moreover, we present a class balance in Figure 11. We performed an analysis in Roboflow, and none of the classes were under-or over-represented. The largest collection of images was marked with a water pooling class annotation. Finally, for each class, we present the distribution of annotations per image, for water pooling, vegetation, missing ties, and broken ties, in Figures 12-15, respectively. As we can see, the water pooling objects appeared as multiple instances on the images, whereas the rest of the classes were mostly individual objects per singular image.    We used 70% images for training purposes, 20% images for validation, and 10% images for testing purposes. We trained the YOLOv5m6 model from scratch with images resized to 640 pixels, in batches of 32, for 300 epochs, in an offline matter (images were taken by UAVs, then transferred to the server and used for training and testing). Our "patience" argument for training was set to 100 epochs, resulting in the training procedure stopping around 180 epochs due to no further model improvement. We used the hyperparameter settings shown in Table 1. The number of warm-up epochs was set to 5. The detection threshold was set to 0.7. In the next section, we discuss the results of the simulations.

Results
During model training, many different experiments were performed to optimize the performance of each point of interest by analyzing the initial results. In general, we achieved a precision of 74.1% and 70.7% mAP for detecting all classes, as shown in Table 2. One of our model's tasks was detecting broken ties on railway tracks. For the class of a broken tie, we achieved the lowest precision among all other classes, equal to 50.3% precision and 59.5% mAP.
During the initial phases of model training, the image dataset consisted of images taken from greater height [50-70 m], which contributed to lower performance, as identifying cracks in the ties was difficult due to lower resolution. Therefore, it was recommended that UAV images be taken from a lower height [25 m] to achieve optimal results. Pictures taken from higher heights were removed from the dataset. As a result of these changes, a 7% improvement in performance was witnessed but remained on a lower level than expected. Overall, the detection performance of broken ties remained average due to various factors affecting the algorithms' ability to detect broken ties. These factors involved mainly a high variation in cracks (due to the type of material-wood) and a low number of repeated samples for each type of crack. All of these factors affect the appearance of cracks in the tie. We are confident that with a larger dataset of more standardized cracks, we could introduce a different level of damage severity identified in training classes to improve overall results in this category.
Another task was to detect missing ties. We achieved the highest precision among all classes in this category-89% and the highest mAP-88.2%. This class was the easiest to detect because of an evident visual change in the tracks. By enabling tracking the distance between two ties, we could quickly identify missing ones, which can also help prioritize the maintenance of railway tracks.
Furthermore, one POI was to detect vegetation. For this class, we achieved 84% detection precision and 80.7% mAP. Further research in the vegetation category could involve mapping the size of vegetation or tracking the level/amount of ballast to implement effective deterrence against vegetation growth.
Finally, we focus on the most significant issue, water pooling. Our model achieved 80.9% detection precision and 74.2% mAP. Achieving high precision in the case of water pooling is extremely difficult, but we overcame the issue by using appropriate preprocessing methods, such as adaptive equalization of contrast and correction of white balance (including tint adjustment). In some cases, we observed that the exact area of water pooling did not match the boundary boxes perfectly, but the overall detection goal was very accurate.

Example Results of Detection
One of our model's tasks was detecting vegetation near railway tracks. Our model detects vegetation near tracks with high accuracy, as shown in Figure 16. The model does not detect vegetation when it is sparse in the entire area near the railway track, as shown in Figure 17.  Missing ties are detected accurately by our model, as shown in Figure 18. Our model does not detect missing ties when the distance between the missing ties is uneven (see Figure 19).  The model detects broken ties correctly as shown in Figure 20. It does not detect broken ties when the broken part contains gravel, as shown in Figure 21.  Water pooling is detected with high accuracy, as shown in Figure 22. Figure 23 shows an example of not perfectly accurate detection, where the water has a dark reflection.

Conclusions
In this article, we tackled the problem of identifying various points of interest around railway tracks by using aerial image footage. We processed them by using the YOLO object detection algorithm. The algorithm's efficiency was evaluated by measuring how accurately it determines the specific points of interest.
The accuracy of detecting various points of interest was directly proportional to the height at which the aerial footage was taken. For example, images with broken ties were challenging to identify from a greater distance. In addition, the algorithm quickly identified vegetation and missing ties in the images from various distances due to the more significant visual features of those classes. Finally, the water pooling was detected efficiently. As a next step, we are going to include images from different seasons of the year and more locations to improve the accuracy of the solution. As an end product, the algorithms developed in this paper have been deployed in production-ready software called Spexi Geospatial.

•
The model achieved more than 80% accuracy in categories such as vegetation, missing ties, and water pooling. • YOLOv5 offers a faster inference time than YOLOv4, allowing for deployment on embedded devices. • The deployment of the model by the railway maintenance department could significantly reduce the time and resources required for detecting major POI.

Limitations
• The model does not perform well for the broken tie class due to the lack of distinct, standardized visual features and false identification resulting from ballasts on ties. • The model lacks terrain diversity in the training dataset, which could result in lower accuracy for different geographic locations.

Future Work
In future work, we plan to conduct more robust training that includes images from diverse landscapes and seasons of the year to improve accuracy and make the model production-ready. Moreover, more research is required to handle the identification of visual features for specific categories, such as a broken tie. Visual feature enhancement can be performed during the pre-processing step to improve accuracy. Finally, the augmentation technique could be customized to reduce noise from the image by colour coding the unwanted areas.