4.1. Experimental Setting
Datasets. A total of 3235 overhead-view images of railways were collected in Ma’anshan, Nanjing, Qinghai–Tibet Railway and our laboratory. These images were divided into 2277 images for the training dataset, 655 images for the validation dataset, and 303 images for the test dataset at a ratio of 7:2:1.
Figure 5 shows some examples.
Implementation details. Our network was implemented using PyTorch and was run on three NVIDIA TITAN RTX GPUs (24 GB RAM). For training, this paper used the stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0.0001. The final model was trained for 360 epochs and the models for ablation experiments were trained for 36 epochs. Pre-processing operations included resizing, random flipping, normalization and padding. After inference, this paper performed erosion as the post-processing method to remove the noise points or lines for skeleton detection.
Evaluation protocol. Average precision (AP), average recall (AR) and mean average precision (mAP) were utilized as evaluation metrics for instance segmentation. For skeleton detection, the F-measure score (F-score) was used as the evaluation metric. A weighted-mean metric
was designed for our dynamic-weight parallel instance and skeleton network, which consisted of mAP for the instance-segmentation branch and F-score for the skeleton-detection branch, defined as:
where
is the weight value of 0.2 for mAP.
4.2. Main Result
Our network achieved
mAP for instance segmentation,
F-score for skeleton detection and
for the whole task. The evaluation metrics are shown in
Table 1 and some classic examples for detection masks are shown in
Figure 6.
The model performs well for images with a single railway or multiple railways, as well as straight railways or curved railways, as indicated by
Figure 6(A1,A2). The UAVs need to fly at different heights to carry out the tasks, so the model must be suitable for various scales of railways in the images. As shown in
Figure 6(B1–B3), the model can always accurately identify the railways and skeletons when the flying height changes from low to high. In certain environments, the railway may be shaded by trees or other objects, which increases the difficulty of the task. As shown in
Figure 6(C1,C2), the railways and skeletons are fully detected even though the railway is partly shaded by sparse trees. However, in
Figure 6(C3), the detected results are truncated by a tree because it is so dense that it completely obscures the railway. In order to operate during different times of the day and in various weather conditions, the UAVs must be able to detect the skeleton of a railway under different lighting conditions. As demonstrated in
Figure 6(D1–D4), the results of instance segmentation and skeleton detection are always right, whether the light is strong or dim.
Compared to the original SOLOv2 with ResNet-50 backbone, our changes to the backbone and attention module result in a increase in mAP for instance segmentation. Compared to the original AdaLSN with a fixed architecture and Inception-v3 backbone, our novel network achieves a improvement in skeleton detection.
4.3. Ablation Experiments
This paper investigates and compares the following five aspects in our methods:
Threshold function. To improve the process of inference using a model that only requires a few epochs of training, a threshold function is added after the original sigmoid function to simplify the features. We compare the inference results with and without a threshold function, in which the threshold is 1 − 10
, 1 − 10
, 1–
and 1 − 10
. As shown in
Figure 7, the inference achieves significant improvement through the addition of a threshold function. The inference results are good enough when the threshold value increases to 1 − 10
which is ultimately chosen as our threshold. As expected, the skeleton of each railway becomes more obvious and the F-score is much higher than the inference result without our threshold function, as shown in
Table 2.
Backbone. According to previous research, the best backbone for SOLOv2 is RseNet, and Inception-v3 is the best for AdaLSN. To further compare the impact of the backbone, this paper trains SOLOv2 (only for the instance-segmentation branch), AdaLSN with a fixed architecture (only for the skeleton-detection branch), and our dynamic-weight parallel instance and skeleton network with different backbones, including ELAN-based backbone, Inception-v3, ResNet-50 and ResNet-101.
Table 3 shows the evaluation metrics for only the instance-segmentation branch. As demonstrated, the network using the ELAN-based backbone achieves a higher mAP compared to the networks that use other backbones. The network achieves a
mAP improvement over ResNet-50, indicating that the ELAN-based architecture is the optimal choice among them for the instance-segmentation branch.
As shown in
Table 4, the ELAN-based backbone does not achieve the best F-score compared to the networks using other backbones, for only the skeleton-detection branch. However, the main target of changing the backbone is to improve the detection results of the instance-segmentation branch. Therefore, this paper pays more attention to the metrics of the instance-segmentation network and our parallel network.
The evaluation metrics for our novel network are shown in
Table 5. The
of the ELAN-based backbone network is 0.06 smaller than that of the Inception-v3 backbone network, which is mainly due to the smaller AP. However, as shown in
Table 3, the mAP of the ELAN-based backbone network is larger. The different results may because the weight of the instance-segmentation branch is always less than one in our network. Therefore, this paper still uses the ELAN-based backbone for more epochs of the training.
Attention module. This paper compares the effect of adding SimAM to the network using an ELAN-based backbone. For only always instance-segmentation network, the mAP is slightly larger when SimAM is added, as shown in
Table 3. As expected, adding SimAM also improves our novel network with the instance-segmentation branch and skeleton-detection branch, as shown in
Table 4. However, it performs poorly for the network for skeleton detection, as shown in
Table 5.
Loss function parameter. This paper designs a novel fused loss function to adjust the dynamic weight of the two branches during training. To increase the weight of the skeleton-detection branch as the loss of the instance-segmentation branch gradually stabilizes, the sigmoid function is chosen as the base. The sum of the weights for the two branches is always kept at one. This paper adds two parameters to adjust the function: alpha, which determines when the weight is 0.5, and beta, which determines the rate of weight increases.
Through experiments, the segmentation results are already satisfactory after training for 36 epochs. Therefore, this paper adjusts both parameters to be related to this epoch number:
is changed to be
, and
is changed to be
. The results of the experiments for finding the best parameters are shown in
Figure 8.
This paper varies
from 1 to 18 while keeping
fixed at 1, and the curves for each value are shown in
Figure 8c. When
is equal to one, the overall rate of weight increases changes the slowest during training. A new mean metric was calculated by assigning the weights of 0.35 to AP and AR, and 0.3 to F-score. It is mainly because the instance-segmentation branch is more important at the beginning of training. As shown in
Figure 8d, the best result is obtained when the
is set to one.
The value of
was changed from 0.5 to 1 while keeping
fixed at its best value, and the curves for each value are shown in
Figure 8a. Decreasing
means the weight of the skeleton-detection branch is lower during the same training epoch. The mean metrics is calculated in the same way as evaluating
. As shown in
Figure 8b, the best value for
is 0.9.
Parallel network. The main idea behind our network is to add a parallel instance-segmentation branch to remove the skeleton of the wrong target, on the base of the skeleton-detection network. This paper compares the results of our novel network to the skeleton-detection network. As shown in
Table 6, our network achieves a
improvement in F-score compared to the skeleton detection network with same backbone and attention model. It is a
improvement compared to the fixed AdaLSN with the original backbone. As expected, the results of the skeleton-detection network show skeletons of objects that are not railways, which are not present in our novel parallel network shown in
Figure 9.
4.4. Analysis
Experimental results show that our DWPIS with two branches can obtain stronger railway skeletons than the base architecture of the skeleton-detection network. Upon careful analysis, these enhancements come from three main sources: the instance-segmentation branch, fused loss function, and inference function.
Firstly, the instance segmentation locates the railway target. Through training the instance-segmentation branch, the network extracts the useful features of the railways containing the skeleton and the parameters of the shared backbone are optimized.
Secondly, the fused loss function with dynamic weight decided, using training-epoch changes, the dominance of the training. Through controlling the dominant branch, the network is mainly trained for instance segmentation first and then mainly for skeleton detection for a long time. For the skeleton-detection branch, the Dice loss function is more suitable for the serious imbalance between positive and negative samples and increases the convergence speed of training.
Finally, the threshold function, the subject of the skeleton-detection branch in the inference procedure, optimizes the skeleton results. By adding a threshold function after the original sigmoid function, better results can be obtained by filtering noise after fewer training epochs.