Automatic Defect Description of Railway Track Line Image Based on Dense Captioning

The state monitoring of the railway track line is one of the important tasks to ensure the safety of the railway transportation system. While the defect recognition result, that is, the inspection report, is the main basis for the maintenance decision. Most previous attempts have proposed intelligent detection methods to achieve rapid and accurate inspection of the safety state of the railway track line. However, there are few investigations on the automatic generation of inspection reports. Fortunately, inspired by the recent advances and successes in dense captioning, such technologies can be investigated and used to generate textual information on the type, position, status, and interrelationship of the key components from the field images. To this end, based on the work of DenseCap, a railway track line image captioning model (RTLCap for short) is proposed, which replaces VGG16 with ResNet-50-FPN as the backbone of the model to extract more powerful image features. In addition, towards the problems of object occlusion and category imbalance in the field images, Soft-NMS and Focal Loss are applied in RTLCap to promote defect description performance. After that, to improve the image processing speed of RTLCap and reduce the complexity of the model, a reconstructed RTLCap model named Faster RTLCap is presented with the help of YOLOv3. In the encoder part, a multi-level regional feature localization, mapping, and fusion module (MFLMF) are proposed to extract regional features, and an SPP (Spatial Pyramid Pooling) layer is employed after MFLMF to reduce model parameters. As for the decoder part, a stacked LSTM is adopted as the language model for better language representation learning. Both quantitative and qualitative experimental results demonstrate the effectiveness of the proposed methods.


Introduction
In recent years, the rapid development of rail transit and the fast growth of operating mileage have put forward stricter requirements on transportation safety and maintenance. The health status of the railway track line is the basis for guaranteeing the normal operation of trains and plays a critical role in ensuring the effective, safe, and stable operation of the entire rail transit system [1]. A railway track line is mainly composed of tracks, fasteners, backing plates, and so on. Due to the impact of contact friction and vibration between the train wheels and track, coupled with the influence of the on-site operating environment, defects such as rail corrugation, spalling, and broken fasteners may occur on the railway track line. In practice, railway track line defects have a vital effect on the safety of vehicle operation and passenger comfort and may even further lead to major safety accidents. Furthermore, with the occurrence, evolution, and even deterioration of railway track line defects, the maintenance costs and the difficulty of maintenance decision-making are also increased to some extent. Up till now, the condition monitoring of the railway track line is mainly carried out through manual inspection or the use of track inspection cars, and then the inspection report is manually sorted out and generated according to the inspection results. Nevertheless, manual inspection and generation of inspection reports are of poor efficiency, high cost, and low level of automation and intelligence. In addition, the manufacturing cost of the track inspection car is high, and the track line is occupied during the inspection process. Hence, there is an urgent need to develop a comprehensive railway track line inspection platform that uses advanced technologies such as computer vision, deep learning, and natural language processing to improve the safety and stability of rail transit [2]. Not only can it inspect the railway track lines intelligently, but it can also further generate inspection reports automatically. Moreover, this non-contact sensing monitoring method is more timely, economical, and convenient and can achieve better results and reduce labor.
Over recent decades, researchers in this field have mainly made significant progress in the detection technology of rail surface and fastener defects, but there is no further investigation on the issue of the automatic generation of inspection reports. In the entire condition monitoring system of the railway track line, the inspection report is the main basis for maintenance decision-making. For practical applications, it is necessary to investigate the automatic generation of railway track line inspection reports in text format. At present, with the development of computer vision and natural language processing technology, as well as the continually improving computer technology, it is possible to obtain useful information about multiple objects from images automatically. In particular, dense captioning, a technology based on computer vision, is gaining traction in this field [3][4][5][6][7][8]. Dense captioning is a subset of image captioning technology [9] that understands the characteristics of objects, their activities, and relationships and expresses them in natural language. In comparison with other types of image captioning methods [10], this technology focuses attention on image regions containing objects and generates more objective and detailed region-wise captions. Therefore, it is suitable for analyzing railway track line images with multiple key components and clear positional relationships and generating text information about the type, position, status, and mutual relationship of the key components. Whereas, there are still some problems that need to be solved when developing an automatic defect description method of railway track line image based on dense captioning. First of all, it is difficult to use expressions generated by existing methods (e.g., "black office telephone and pen" or "yellow fire hydrant") to describe the safety status information of the key components from the field images, which requires constructing its own dataset. Secondly, improved regional feature extraction needs to be developed to achieve more accurate regional localization and better defect description performance. Finally, the processing time of each image should be shortened, and the complexity of the processing progress should be reduced so that the improved method can be applied to practical applications.
To solve these problems, the railway track line image (RTL-I) dataset collected from Beijing Metro Line 6 is constructed firstly, and the images and captions in the dataset are labeled manually in terms of the requirements of maintenance decision-making. In this study, four types of information, including object, type, position, and status, are defined to generate inspection information for the safety status of the railway track line. After that, to improve the defect description accuracy and be more suitable for the scenario of railway track lines, an improved DenseCap model named railway track line image captioning model (RTLCap) is proposed. Specifically, in RTLCap, ResNet-50-FPN [11] is used instead of VGG16 [12] as the backbone of the model to extract better image features. In addition, in response to the issues of object occlusion and category imbalance in the field images, Soft-NMS [13] and Focal Loss [14] are applied in RTLCap to promote the performance of defect description. Finally, to reduce the complexity of RTLCap and speed up the image processing procedure, while maintaining a desirable defect description performance, a reconstructed RTLCap model named Faster RTLCap is presented with the help of YOLOv3 [15]. The proposed Faster RTLCap model follows a similar path to the RTLCap model. More precisely, Darknet53 is used as the basic feature extractor in the encoder part, and a multi-level regional feature localization, mapping, and fusion module (MFLMF) is proposed to extract regional features. Besides, followed by MFLMF, an SPP (Spatial Pyramid Pooling) [16] layer is employed to reduce model parameters. For the decoder part, a stacked LSTM named CM-LSTM is adopted as the language model to improve the defect description performance further. Based on the experimental results, the proposed methods can generate the safety status information of the key components in the railway track line image in a text format effectively and automatically, which are better alternatives for practical applications.
To summarize, the core contributions of this paper are threefold: 1.
Based on advanced deep learning networks and natural language processing technologies, the problem of automatic defect description of railway track line image is investigated and solved for the first time, and the proposed methods meet the demand for automatic generation of inspection reports of railway track line safety status.

2.
A railway track line image captioning model (RTLCap for short) is proposed based on improved DenseCap, which achieves better defect description accuracy than the original DenseCap and is more suitable for the scenario of the railway track line. To our best knowledge, this is the first research that introduces dense captioning technology into the field of railway track line safety status detection to investigate the automatic generation of inspection reports.

3.
Motivated by the work of YOLOv3, a reconstructed RTLCap model named Faster RTLCap is presented. The Faster RTLCap reduces the image processing time effectively while maintaining a sound defect description performance. To be more exactly, the image processing time of Faster RTLCap is about 97.7% faster, and the defect description accuracy is improved by 1.12%.
The remainder of this paper is arranged as follows. In Section 2, some relevant work on rail surface and fastener defects detection methods are reviewed. In Section 3, the proposed railway track line image captioning model (RTLCap) is presented in detail. After that, the redesigned RTLCap model, Faster RTLCap, is investigated in Section 4. Finally, some conclusions of this paper are given in Section 5.

Related Work
As the lifeline of the whole rail transit system, the safety status of the railway track line directly affects the stability and safety of the operating vehicles during normal driving, as well as the comfort of passengers. Specifically, as shown in Figure 1, the railway track line is mainly composed of key components such as tracks, fasteners, and backing plates. When these key components of the railway track line appear defects and continue to develop and deteriorate, the cost of line maintenance and the difficulty of maintenance decision-making will also increase. To ensure the operational safety and efficiency of rail transit, it is necessary to detect the safety status of key components of railway track lines and generate inspection reports.
In the last decade, a large body of work has been concerned with the development of automated railway inspection methods and systems in terms of advanced technologies. Generally, the detection methods of rail surface and fastener defects mainly contain image processing-based methods and deep learning-based methods. Here, some representative investigations on detection methods based on deep learning are reviewed. The fastener is the key component used to connect the track and backing plate on the railway track line. It ensures that the rail and backing plate are relatively fixed, which is usually in a normal, partially broken, or completely missing state. In [17,18], a multi-layer perception neural classifier is put forward for the detection of missing fasteners and bolts, and an online fastener detection algorithm is implemented with the help of FPGA and GPU. In [19,20], an multitask learning framework (MTL) combined with multiple detectors is proposed to detect railway ties and fasteners, which improves detection performance through intermediate feature sharing and coarse-to-fine detection strategies. In [21], the identification and detection of fastener defects utilizing image processing technologies and deep learning networks are comprehensively investigated, and sound recognition accuracy and recall rate are obtained. In [22], a two-stage framework for detecting defective fasteners is introduced, which is composed of a CenterNet-based fastener positioning module and a VGG-based defect classification module. In [23], a new track fastener detection network architecture called MYOLOv3-Tiny is proposed to enable the deployment of the detection algorithm on lightweight processor devices. In [24], a two-stage classification model based on the modified Faster R-CNN and the support vector data description (SVDD) algorithms is presented for fastener detection, which realizes a fast and accurate detection of four fastener states. As for the rail surface defect, the main detection methods based on deep learning are as follows. In [25,26], an application of deep convolutional neural networks (DCNNs) for automatic detection of rail surface defects is presented, which achieves non-interference detection. In [27,28], a multiphase deep learning technique is introduced to detect rail surface defects in a vision-based railway track inspection system. Firstly, the track is extracted by image segmentation, and then the extracted track is put into a fine-tuned convolution neural network (CNN) for further classification. In [29], to detect and locate the rail surface defects in real time, two novel rail surface defects detection models with different deep convolutional networks are investigated with the help of MobileNet and YOLOv3. In [30], a track line multi-target defect detection network (TLMDDNet) and DC-TLMDDNet further optimized in the light of DenseNet are proposed, which can detect the defects of track and different types of fasteners simultaneously and comprehensively. In [31], a deep extractor (DE), integrating fully convolutional networks and conditional random fields (CRFs) is put forward for the detection of rail surface discrete defects (RSDDs), which provides new insights in the field. In [32], an attention neural network for rail surface defect detection via CASIoU-guided center-point estimation (CCEANN) is presented to solve the problem of data imbalance and complex situations in actual detection. In [33], an automatic railroad track component inspection method based on instance segmentation is proposed, realizing real-time detection performance in an experimental environment. In [34], an integrated inspection system based on a newly developed rail boundary guidance network (RBGNet) and image processing technologies are constructed, realizing an advanced rail surface segmentation performance. In [35], by using MobileNetv3 and deep separable convolution, an improved YOLOv4 model with lightweight is proposed for railway surface defect detection, enabling real-time detection speed.
It can be concluded that previous studies have shown remarkable progress in the inspection techniques of rail surface and fastener defects, and there is still room for further research in the automatic generation of detection reports.

Automatic Defect Description of Railway Track Line Image
In this section, the dense captioning model proposed in [3] is reviewed firstly as it forms the foundation for our work. After that, the proposed railway track line image cap-tioning model (RTLCap) is introduced. Finally, the experiments and results are described in details.

Dense Captioning Model
As an extension of image captioning, dense captioning is developed to discover abundant sets of visual contents and to generate captions of wider diversity and more details. The first dense captioning model, named DenseCap, is introduced by the groundbreaking work of Johnson et al. [3], which is the most relevant to our method. DenseCap locates regions of interest and describes them in natural language by performing object detection, soft spatial attention, and image captioning tasks simultaneously in the model.
A brief schema of DenseCap is shown in Figure 2. Internally, the input image is first processed by Faster R-CNN [36] with VGG16 [12] as the backbone to obtain object-candidate region features, which are generated by using the soft attention mechanism implemented in the localization layer. Afterward, region features are passed to a fully-connected neural network named as recognition network to get region codes. In the end, the region codes are entered into the Long Short-Term Memory (LSTM) to produce their corresponding sentences. A much more detailed discussion regarding DenseCap can be found in [3].

Figure 2.
A brief schema of DenseCap.An input image is first processed by the VGG16 [12]. The Localization Layer proposes regions and uses bilinear interpolation to smoothly extract a batch of corresponding activations. After that, these regions are processed using a fully-connected recognition network and described with an LSTM model.

Railway Track Line Image Captioning Model (RTLCap)
Inspired by the work of DenseCap, the railway track line image captioning model (RTLCap for short) is proposed in this paper to realize the automatic defect description of the railway track line image. The architecture of RTLCap is shown in Figure 3. To improve the defect description accuracy and be more suitable for the scene of railway track line, RTLCap has made three improvements based on DenseCap, which are discussed in the following.

Backbone and Anchors
As demonstrated in [7,8,37], better feature extraction can greatly improve the performance of description generation tasks. More specifically, accurate feature extraction is conducive to more accurate regional location and better regional description. Therefore, on this basis, ResNet-50, ref. [38] followed by FPN (Feature Pyramid Network), ref. [11] is used as the backbone of RTLCap to extract image features more accurately and comprehensively.
ResNet-50 is a residual learning framework proposed in the light of the existing training deep network, which has the advantages of easy optimization and low computational burden. Moreover, the residual unit is designed to solve the degradation and gradient problems so that the performance of the network can be improved as the depth increases. The Feature Pyramid Network (FPN), proposed by Lin et al. in 2017, was mainly introduced to solve the multi-scale problem in automatic object detection. More precisely, the FPN algorithm promotes the capability of feature expression through the fusion of high-level and low-level feature maps and completes the target detection task on multi-scale feature maps, which improves the detection performance of the model for small targets.
The basic structure of ResNet-50-FPN is shown in Figure 4, which takes a single-scale image of any size as input, and outputs proportionally sized feature maps at multiple levels in a fully convolutional mode. The construction of FPN mainly contains a bottom-up pathway, a top-down pathway, and lateral connections. ResNet-50-FPN has been proved to have a sound performance in [11]. Additionally, anchors are redesigned to be consistent with the scheme used in [11]. In particular, the anchors are defined to have areas of {32 2 , 64 2 , 128 2 , 256 2 , 512 2 } pixels on {P 2 , P 3 , P 4 , P 5 , P 6 } (as shown in Figure 4), respectively. Furthermore, similar to [36], anchors with multiple aspect ratios {1:2, 1:1, 2:1} are employed at each level. To summarize, there are a total of 15 anchors over the pyramid.

Soft-NMS
Non-maximum suppression (NMS) algorithm is an important part of the object detection pipeline. In short, the detection boxes are sorted from high to low according to their scores firstly. Then, the detection box M with the maximum score is selected, and all other detection boxes that significantly overlap with M are suppressed. At last, this process is applied to the remaining boxes recursively. According to the design of the algorithm, when two target boxes overlap greatly, the box with a lower score is discarded due to the large overlap area with the higher one, resulting in missed detection. In practice, the railway track line images collected from the field face such a problem during the detection process. As shown in Figure 5, in some images, the overlap between the fastener area and the backing plate area is greater than 0.7, which causes the final description result to be incomplete. To solve this problem, the Soft-NMS [13] method is adopted to replace the NMS algorithm used in the original model. The main idea of Soft-NMS is to reduce the confidence coefficient of the detection boxes that have significant overlap with M, instead of discarding them directly. There are two typical rescoring functions for Soft-NMS, linear penalty function, and Gaussian penalty function. Taking into account the continuity of the function, the Gaussian penalty function is employed in our work, which is formulated as follows where D stands for the set of final detections. b i denotes one of the detection boxes and s i is the corresponding confidence coefficient. More details about Soft-NMS can be found in Reference [13].

Focal Loss
In addition to the occlusion problem, the field track line images also meet the challenge of category imbalance. Concretely, in the field railway track line, there are far more key components such as fasteners in the normal state than in the abnormal state, which causes the problem of category imbalance, and affects the precision of the final description results.
To address this problem, the Focal Loss (FL) function discussed in [14] is utilized in our work. The FL is modified on the basis of the standard Cross Entropy (CE) loss, which reduces the weight of easy-to-classify samples so that the model can focus more on difficult-to-classify samples during training. More formally, the CE and FL are given by where p t represents the predicted probability of ground truth class. γ ≥ 0 is a tunable focusing parameter. As shown in Figure 6, when setting γ > 0, FL reduces the loss of well-classified samples greatly, and pays more attention to misclassified samples.

Experimental Environment and Datasets
The proposed methodology is implemented in Torch and PyTorch frameworks on an Ubuntu 18.04 operating system with NVIDIA Titan X [39] GPU, using Lua and Python programming languages. The RTLCap model is trained with an initial learning rate of 1 × e −4 for the detection task and 1 × e −3 for the caption task, respectively. Moreover, the adaptive moment estimation (Adam) [40] with exponential decay rates for the first and second moments of 0.9 and 0.999 is adopted to update the weights of the networks. Note that the parameter 'epsilon' is set to 5 × e −4 in this work. Besides, the idea of transfer learning [41] is utilized in that the weights trained through different datasets are used for the weight values initialization.
In addition, the Visual Genome (VG) [42] dataset and railway track line image (RTL-I) dataset are used as the evaluation benchmarks in our experiments. Similar to [3,43], the images and captions in the datasets are manually annotated using VGG image annotator (VIA) [44] (as shown in Figure 7). Furthermore, by using post-processing operations, the original data exported by VIA is converted into the data format required by the dense captioning model. The details of the Visual Genome (VG) dataset and railway track line image (RTL-I) dataset are as follows.
VG. Visual Genome (VG) is the largest dense caption dataset with three available versions now: V1.0, V1.2, and V1.4. Besides, VG has been applied to a variety of visionlanguage tasks such as dense captioning and Visual Question Answering (VQA) [45]. For the purpose of a fair comparison, the dataset of V1.0 and the same data splits in [3,4] are used. In more details, 77,398 images for training and 5000 images each for validation and test.
RTL-I. Railway track line image (RTL-I) dataset is made up of images that are taken from Beijing Metro Line 6, which is not publicly available in view of its specificity. According to the on-site survey of the Beijing Metro Line 6 and the information provided by the maintenance engineers, the defects of the railway track lines mainly contain three categories: broken fastener, missing fastener, and rail corrugation, as shown in Figure 8a,b, respectively. The rail corrugation is a periodic irregular wear phenomenon on the rail surface. The broken fastener is defined as the complete or partial fracture of the spring bar of the fastener, and the missing fastener is defined as the major or complete absence of fasteners.  In detail, these images are captured by the handhold DSLR camera with the camera angle perpendicular to the roadbed and the distance between the camera and roadbed is kept constant. The collected images are mainly composed of track, fasteners, backing plates and roadbed. Moreover, due to the limited number of key components in defect status, the image data augmentation methods such as rotation, mirroring, noise addition, color perturbation, etc., are probabilistically applied to enhance the RTL-I dataset. Ultimately, RTL-I consists of 1019 images including 4690 captions, each of which corresponds to a region in a given image.

Evaluation Metrics
For evaluation, the prediction results are measured in terms of the mean Average Precision (mAP), which has been used in previous dense captioning works to assess the accuracy of localization and description comprehensively. In more detail, localization accuracy is determined using Intersection over Union (IoU) thresholds, {0.3, 0.4, 0.5, 0.6, 0.7}, while description accuracy is determined by METEOR [46] score thresholds, {0, 0.05, 0.1, 0.15, 0.2, 0.25}. METEOR is employed here not only because it produces the harmonic mean of precision and recall, but because it is considered the most relevant indicator in image description evaluation. The average precision is calculated across all paired settings of the above thresholds and the mAP is reported.

Loss Function
Generally, the loss function is a criterion for evaluating the performance of a model. Following the definition of the loss function in [3], the loss of our framework is also mainly composed of three parts: detection loss (L det ), bounding box regression loss (L bbox ), and caption loss (L cap ). The total loss function of training is as follows where L det_rpn and L det_cls denote the detection loss in the region proposal network and recognition network, respectively. In the same way, L bbox_rpn and L bbox_cls represent the bounding box repression loss in the two networks. More concretely, L det_rpn is a two-class cross-entropy loss for foreground/background regions, and L det_cls is the aforementioned focal loss. Both L bbox_rpn and L bbox_cls are smoothed-L1 losses. Meanwhile, for the caption loss L cap , a cross-entropy loss of sentences for description generation is used. Referring to [3], the weighting coefficients α, β, and γ are set to 1.0 in our experiments.

Availability of the ResNet-50-FPN
In this subsection, the availability of using ResNet-50-FPN as the basic feature extractor in the DenseCap model (denoted as DenseCap_RF) is evaluated on the Visual Genome (VG) dataset. The mAP results are shown in Table 1, in which the larger the mAP value, the better the performance of the model. The DenseCap model introduced in [3] is used as our baseline model which forms the basis of our work. Additionally, the performance of the dense captioning model proposed in [4] is also presented, which incorporates joint inference and visual context fusion (JIVC for short). The best result is highlighted in bold. As shown in Table 1, in comparison with the most basic two dense captioning models, the ResNet-50-FPN helps to improve mAP scores with gains of 4.38 and 0.46, respectively, indicating the superiority of the DenseCap_RF. On the other hand, it is illustrated again that using a more powerful feature extraction network like ResNet-50-FPN can facilitate the dense captioning task.

Performance Evaluation for RTLCap
In this experiment, by comparing with DenseCap and DenseCap_RF, the validity of our method on the RTL-I dataset is evaluated. The results are shown in Table 2. From these experimental results, it can be concluded that the proposed RTLCap attains the best defect description performance in these models. Specifically, for the RTL-I dataset, the mAP of RTLCap is 0.980, which achieves 0.837 and 0.022 gains compared to DenseCap and DenseCap_RF, respectively. Thus, it is a better alternative for practical applications on railway track lines.  Figure 9 shows two examples of comparison results between the DenseCap_RF and the proposed RTLCap, which proves the effectiveness of RTLCap qualitatively in solving the problem of missing descriptions caused by object occlusion and misdescription caused by category imbalance.

Faster Railway Track Line Image Captioning Model
The RTLCap proposed in Section 3 achieves sound performance and is suitable for the automatic defect description of the railway track line image. Nonetheless, compared to the traditional NMS, the application of Soft-NMS brings a greater time cost to RTLCap, thus slowing down the image processing speed. Moreover, in the light of RPN (Region Proposal Network), the location and encoding of the region features in RTLCap are cumbersome, and there is redundancy in the generation of candidate boxes. Hence, the structure of RTLCap can be further simplified and optimized.
Nowadays, with the continuous in-depth research and development of one-stage object detection algorithms, many tasks completed by the two-stage object detection algorithm can be well fulfilled by the one-stage method. Moreover, recent results show that one-stage methods can achieve a balance of speed and accuracy [47], which presents an opportunity for reconstructing RTLCap to further improve performance.

One-Stage Detection Algorithm
Overall, the object detection algorithm based on deep learning is mainly divided into two streams: two-stage and one-stage algorithms [48]. In detail, the two-stage detection algorithm generates candidate regions on the image firstly, and then performs classification and boundary regression on each candidate region in turn, while the one-stage detection algorithm locates and classifies all targets on the entire image directly, omitting the step of generating candidate regions.
Compared with the two-stage algorithm represented by the R-CNN series [36,[49][50][51], the regression-based one-stage algorithm has a simpler detection process, faster reasoning speed, and meets the real-time requirements. More recently, encouraged by recent advances in computer vision and deep learning technologies, many one-stage detection algorithms with sound performance have been proposed, such as SSD series network [52][53][54], YOLO series network [15,[55][56][57], RetinaNet [14], etc. Among them, the YOLOv3 proposed by Redmon et al. in 2018 obtained a better speed and accuracy trade-off at that time, and it is also one of the preferred algorithms for object detection in the industry. Therefore, we can borrow recipes from YOLOv3 in redesigning the RTLCap model.

Faster RTLCap
To decrease the complexity of RTLCap and further speed up the processing progress of railway track line images, a reconstructed RTLCap model is introduced in this paper, named Faster RTLCap. More closely, Faster RTLCap follows a similar path to RTLCap, which is mainly composed of two parts, a feature bifurcation-fusion-based encoder part and a stacked LSTM-based decoder part. The architecture of the Faster RTLCap model is shown in Figure 10, and the details of the Faster RTLCap are described below.

Feature Bifurcation-Fusion-Based Encoder Part
As shown in Figure 11, the encoder part based on bifurcation-fusion (named YOLO-MFLMF) is divided into three steps: image feature extraction stage, bounding box, and class prediction stage, and regional feature construction and encoding stage. •

Image Feature Extraction Stage
In the image feature extraction stage, the Darknet53 [15] is used as a feature extractor to obtain basic feature maps of the input image, of which the accuracy is comparable to ResNet-101 and ResNet-152. It consists of 52 convolutional layers and 1 maximum pooling layer. Similar to [15], the final pooling layer of Darknet53 is removed, and the feature maps obtained by the last three Resn module are fed to the corresponding DBL5 modules to further extract and abstract features, respectively. Hence, an input image of shape 3 × W × H gives rise to a tensor of features of shape C × W × H , where C = {128, 256, 512}, W = W k , H = H k , and k = {8, 16, 32}. Note that C and k are in one-to-one correspondence. The acquired feature maps are the global features of the image with multi-level receptive fields. •

Bounding Box and Class Prediction Stage
Following [15], the dimensional clusters are applied as anchor boxes to predict bounding boxes. The generation of anchor boxes is outlined in detail later. For bounding box prediction, 4 coordinates are predicted as (t x , t y , t w , t h ) for each bounding box in this stage. Based on this, as shown in Figure 12, the width b w , height b h and center coordinates (b x , b y ) of the box are calculated as follows where c x and c y are the offsets of the grid where the center of the target object is located from the first grid coordinate of the detection map. p w and p h represent the width and height of the preset anchor box, respectively. As a result, the predicted positive region proposals with 4 coordinates are input to the next stage as the foundation for regional feature positioning. In addition, as in [15], each box uses multi-label classification in this stage to predict the classes that the bounding box may contain. •

Regional Feature Construction and Encoding Stage
First of all, in this stage, the basic image feature maps obtained from the image feature extraction stage, and the proposals predicted in the bounding box and class prediction stage, are fed to the MFLMF module together to build regional features. The specific structure of MFLMF is shown in Figure 13. It can be seen that the MFLMF module includes three main parts: feature localization, feature mapping, and feature fusion. The MFLMF module plays a key role in the Faster RTLCap model, which is discussed in detail in Appendix A. Secondly, the SPP_FC module encodes regional features from the MFLMF module. In detail, the features from each region are processed and flattened into a vector by an SPP (Spatial Pyramid Pooling) layer with three scale pooling windows [16] and then passed through two fully connected layers. In this way, the increase in model parameters caused by MFLMF is reduced, and a code of dimension D = 4096 is generated for each region that its visual appearance is comprehensively and compactly encoded.
In the end, the codes for all predicted regions are expressed by a matrix of shape B × D and passed to the language model.

Convolutional Anchors
In terms of the idea in [30], the prior anchor boxes are reconstructed by using K-means clustering, in which the Intersection over Union (IOU) of the rectangular box (represented by R IOU ) is adopted as the similarity, and the distance function of the cluster is given by where B and C denote the size and center of the rectangular box, respectively. R IOU (B, C) stands for the IOU between two rectangular boxes. The relationship between the average IOU and the number of anchor boxes is shown in Figure 14

Stacked LSTM-Based Decoder Part
For the decoder part, a sentence generator in terms of Long Short-Term Memory (LSTM) cell is considered because it has shown sound performance on sequential tasks such as machine translation and sequence generation [58]. LSTM is a recurrent neural network, which incorporates a built-in memory cell to store information and use long-range context. In detail, as shown in Figure 15, LSTM memory cells are surrounded by three gating units, which are used to control whether to forget the current cell value (forget gate f ), whether to read its input (input gate i), and whether to output the new cell value (output gate o), respectively. The definition of the gates and cell update and output are as follows c t = f t c t−1 + i t tanh(W cx x t + W ch h t−1 ) where denotes the product with a gate value, and the various W matrices are the weight parameters to be learned. The nonlinearities are sigmoid σ(·) and hyperbolic tangent tanh(·). The last equation h t is used for feeding to the Softmax function, which produces a probability distribution p t of all words. More details about LSTM can be found in [59]. Drawing on the work in [60], a stacked LSTM named CM-LSTM is adopted as the language model in the Faster RTLCap model, as deep architectures have powerful capabilities in feature self-learning [61]. Precisely, as shown in Figure 16, CM-LSTM is comprised of two modules: a Caption-LSTM (C-LSTM) for encoding caption inputs and a Multimodal LSTM (M-LSTM) for embedding visual and textual vectors to a common semantic space and decoding to sentence. Formally, CM-LSTM works as follows, for raw image inputĨ, caption sentence S, the encoding performs as where φ, C denote the feature encoding network, C-LSTM model, respectively, and WĨ, W c are their corresponding weights. E is embedding matrice learned from the network. Then, the encoded visual and textual representations are embedded to M-LSTM, and the hidden state output of M-LSTM can be formulated as follows where M denotes M-LSTM and its weight W m . The visual vector I is only fed to the model once, at t = −1. Finally, on the top of the M-LSTM is the Softmax layer, which computes the probability distribution of the next predicted word by where p ∈ R n and n is the vocabulary size.

Experiments and Results
In this experiment, the proposed model is compared with RTLCap, Faster RTLCap without SPP layer, and Faster RTLCap with LSTM to illustrate the feasibility and effectiveness of Faster RTLCap. More specifically, the RTL-I dataset is used as the evaluation benchmark. The experimental environment and evaluation metrics are the same as discussed in the previous section. Furthermore, the weight values of the feature extraction network in Faster RTLCap are initialized by using the pre-trained YOLOv3 weights trained by MS-COCO [62].

Loss Function
The proposed Faster RTLCap model can be trained in end-to-end by optimizing a joint loss derived from YOLOv3 and RTLCap. Formally, the joint loss L is stated as follows where L coord denotes the coordinate prediction error. L iou represents IoU (Intersection over Union) error. L class stands for the classification error. A specific depiction of these three losses can be found in our preceding work [30]. L cap is a cross-entropy loss, which is the same as the caption loss explained in the previous section. What is more, the values of α, β, γ, and λ are set to 1.0 in our experiments.

Performance Evaluation for Faster RTLCap
The performance comparison results are shown in Table 3, and the best result is highlighted in bold. Faster RTLCap (no SPP) denotes the Faster RTLCap model without the SPP layer, while the Faster RTLCap (with LSTM) represents the decoder part of the Faster RTLCap that uses LSTM. Based on these results, it can be evidently seen that Faster RTLCap has a better defect description performance than RTLCap, despite the small increase in model parameters. More concretely, compared with RTLCap, the image processing time of Faster RTLCap is reduced by about 97.7%, and the defect description accuracy is almost improved by 1.12%. All these results prove the effectiveness of Faster RTLCap, which reduces the image processing time of RTLCap significantly while maintaining an ideal defect description performance. Furthermore, it can also be concluded that the SPP layer used in Faster RTLCap decreases the model parameters effectively and further improves the defect description accuracy. Besides, the experimental results also prove that using CM-LSTM instead of LSTM as the language model improves the defect description performance with a slight increase in model complexity. To further verify the effect of choosing different numbers of anchors, an experiment is also carried out to evaluate the performance of the Faster RTLCap models with a different number of anchors. The mAP scores and image processing time for Faster RTLCap with different numbers of anchors are shown in Table 4, in which image processing time is obtained by counting and averaging the time of multiple experiments under the same conditions. Based on the experimental results, it can be observed that when the number of anchors is set to 9, Faster RTLCap achieves the highest defect description accuracy. Moreover, the image processing time of Faster RTLCap changes slightly with the number of anchors. Therefore, combined with the analysis of the relationship between the number of anchor boxes and the average IOU (as shown in Figure 14), it can be seen that selecting 9 clusters not only attains a better defect description accuracy, but also achieves an ideal image processing speed.

Conclusions
In this paper, the issues of automatic defect description of railway track line image are concerned for the first time. First of all, encouraged by recent advances in dense image captioning, the railway track line image captioning model (RTLCap) based on DenseCap [3] is proposed for the considered issues. The experiment on the VG dataset illustrates that using ResNet-50-FPN as the basic feature extractor can promote the dense captioning tasks, while the quantitative and qualitative experiments on the RTL-I dataset demonstrate the performance of RTLCap, and the use of Soft-NMS and Focal loss effectively alleviate the problem of object occlusion and category imbalance. Secondly, to improve the image processing speed and further optimize the structure of RTLCap, a redesigned RTLCap model is constructed usingYOLOv3, named Faster RTLCap. The experimental results indicate that, compared with RTLCap, the method based on Faster RTLCap has better defect description performance, notably reducing the image processing time by about 97.7% and improving the defect description accuracy by 1.12%. Furthermore, the structure of Faster RTLCap is more simplified than that of RTLCap. All findings are in line with our expectations, and the proposed models can automatically generate information about the type, position, status, and interrelationship of key components from the railway track line images collected on the field, providing better alternatives for practical applications.
In future work, we will mainly focus on the following aspects. On the one hand, the RTL-I dataset will be further expanded, and the methods proposed in this work will be further validated in a field test and applied to other different railway scenarios to improve the versatility of these models. On the other hand, the advanced speech recognition and machine translation technologies will be carefully investigated and integrated to develop a comprehensive railway image captioning system, which can not only generate text descriptions of the image content automatically but also generate the corresponding humanized voice descriptions.
Author Contributions: D.W. collected and analyzed the data, made charts and diagrams, conceived and performed the experiments and wrote the paper; X.W. conceived the structure and provided guidance; L.J. modified the manuscript. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Design Details of MFLMF
As shown in Figure 13, the MFLMF module consists of three parts: feature localization, feature mapping, and feature fusion. Here, the design details of MFLMF are described as follows. Note that for better discussion, the blocks corresponding to the last three Resn modules are named level3, level4, and level5 from top to bottom, respectively.

Appendix A.1. Feature Localization
Similar to the discussion in [15], the basic feature extractor applied in the Faster RTLCap model is a fully convolutional network. Hence, the relative position of the ROI (region of interest), that is, the target object region, in the feature map at each level is invariant, while its size changes accordingly with the depth of the network.
More formally, let the original input image size and the feature extraction network input size be (img_w, img_h) and (w, h), respectively. In the first step, the original input image is adjusted to the size required by the feature extraction network, with an adjustment factor of α = min( w img_w , h img_h ). As a result, the size of the adjusted input image (nw, nh) and the offset parameter (dx, dy) are calculated as follows In terms of these formulas, assuming that the coordinates of a certain point in the original image are (x, y), the position coordinates (x , y ) of this point mapped in a certain level of feature maps can be computed as follows The reverse is also true; when the coordinates of the predicted proposals are obtained, the regional position coordinates corresponding to the proposals in the feature maps of a certain level in the network can be derived from bottom to top. Therefore, in the light of this inference, after attaining the proposals generated by the bounding box and class prediction stage, localize the feature regions corresponding to the proposals in the feature maps of level3, level4, and level5 as the basis for further constructing regional features (as shown in Figure 13). In this way, the finally obtained regional features can be enhanced by fusing the features with multi-level receptive fields.

Appendix A.2. Feature Mapping and Fusion
After feature localization, for the sake of realizing the unification of feature sizes, the threelevel raw regional features of different sizes are fed to the corresponding RoIAlign layer [51], respectively. To be more exactly, based on the four sampling positions, the RoIAlign layer uses a image bilinear interpolation algorithm to normalize the three-level raw regional features to a certain size, and then pools them to a uniform size.
Finally, the regional features from the three levels are concatenated together along the channel as the final regional features.