1. Introduction
Machine learning (ML) algorithms are suitable for control tasks that are difficult to formalize, and therefore difficult to master with conventional control approaches. ML approaches are increasingly applied to the control of Cyber–Physical Systems (CPSs) and used in safety-critical contexts in mobility, healthcare, industrial automation, and other fields. One typical application is the use of neural networks for image processing in autonomous driving in the automotive domain, for functions like lane detection and collision avoidance [
1]. Nevertheless, the black-box nature of machine learning algorithms hinders the demonstration of safety-related properties of ML-based CPSs, making it difficult to apply the standard processes and methods usually adopted by the safety engineering community [
2,
3]. Key problems with the use of neural networks for safety-critical tasks relate to traceability issues between requirements and solution elements and characteristics, coverage demonstration issues, and verification issues [
4].
In this paper, we present the use of the well-known N-Version programming method from traditional safety engineering for a machine learning-based rail track detection system, in order to analyze the effectiveness of this method on a real-world application example using machine learning and overcome these difficulties. Rail track detection is one of the core perception functions that is needed for autonomous driving on rails. Due to this function, all further perception tasks, like obstacle detection, are able to distinguish the train’s path, i.e., the ego-track, from the rest of the image. The ego-track is the critical area that needs supervision for obstacles of any kind, in order to ensure a safe ride of the train without collisions. Hence, the rail track detection itself is a safety-critical perception task.
The aim of this paper is to investigate the use of an N-version architecture with multiple neural networks for safe rail track detection. Different network architectures are combined in order to detect and, if necessary, correct perception errors. While existing work discusses N-version approaches conceptually, we implement a complete architecture based on real data with quantitative evaluation to detect errors and increase prediction quality. In our work, we create a novel domain-specific algorithm for prediction combination (Pixel Majority Voting) in N-Version architectures, which is analyzed and compared to a classical voting algorithm (Maximum Confidence Voting). In contrast to other approaches, we also develop a novel method for confidence evaluation of DNN predictions, as well as an algorithm to utilize those confidence values for detection of errors in the N-version architecture.
After a brief discussion of related literature, we will present an overview on the N-version approach in
Section 3, by explaining all components of the N-version system including the Deep Neural Networks we used for rail track detection and the dataset we used. In the following
Section 4, we will present the results of our work, followed by a brief discussion and conclusion in
Section 5 and
Section 6.
2. Related Work
Similar to other industries, the rail sector has also been conducting intensive research into the development of autonomous systems in recent years. The goal is the fully autonomous operation of trains, also known as Grade of Automation (GoA) 4 [
5].
Such autonomous trains have the potential to increase both the efficiency and safety of rail transportation. For example, the frequency of human error as a cause of failure can be drastically reduced through the software-based realization of control tasks and the safeguarding of this software through existing or newly created standards and norms [
6].
Highly automated or even autonomous trains also enable increased route capacities and support the flexibility of train traffic, especially for regional and freight traffic [
7]. These advantages might help to shift traffic from other means of transport to the rails and thus lead to positive effects with regard to climate change.
However, Singh et al. [
6] state that there are also considerable technical and regulatory challenges for the development and use of autonomous trains. These include the integration of reliable sensor technologies and the detection and management of unforeseen events.
Precise detection and assessment of obstacles, as well as reliable interaction with other trains and road users, play a decisive role. According to Trentesaux et al., this requires the development of robust sensor technologies and machine learning algorithms, as well as robust safety architectures [
8].
Obstacle detection is necessary for highly automated or autonomous trains and is one of many different subjects that are currently being researched in the context of autonomous trains using machine learning methods. In order to reliably classify objects in the vicinity of the train as obstacles, it is necessary to recognize the path of the train. This train path is also known as ego-track [
9].
For the reasons mentioned above, rail track detection—the subject of this work—can be regarded as a basic safety-critical function for autonomous driving on rails. A widely used method for rail track detection is semantic segmentation using convolutional neural networks (CNNs) [
10,
11,
12]. Li et al. [
10] and Risitc-Durrant et al. [
13] confirm that the use of CNNs offers advantages over traditional methods that use line or corner features. Traditional methods can be used effectively for individual scenes and, unlike CNNs, do not pose a major hurdle in terms of approval and proof of reliability and safety. On the other hand, they have problems with generalization, especially in the case of dynamic scene changes while the train is moving.
In addition to semantic segmentation, DNN architectures for alternative detection methods, like e.g., key point estimation, have been proposed in the literature [
14,
15,
16] with promising results.
Other work [
17] shows that the combined use of camera and LiDAR sensors for obstacle detection offers advantages. It shows that the use of deep learning techniques in this area enables reliable and fast obstacle detection that even surpasses traditional detection methods in terms of precision and reaction time.
One problem highlighted in the literature is the lack of availability of training data in the railroad sector [
13]. Drizi and Boukadoum [
18] therefore propose the use of transfer learning and augmentations in the railroad sector as well. Both are widely used techniques in dealing with limited datasets and are used in this work as well.
In addition to all the positive effects that the use of deep learning in the rail sector brings, the use of such algorithms for a safety-critical area such as rail traffic represents a major challenge. This includes, above all, the management of the training and test data used. Ensuring the consistency, accuracy, and completeness of the datasets used is of crucial importance for the generation of reliable neural networks [
19].
In addition to the technical challenges mentioned above, there exist a number of regulatory challenges for the use of machine learning methods for safety-critical control tasks. For example, EN 50716:2023 [
20] is a core standard for the use of software in the rail sector that defines four safety integrity levels (SILs) (i.e., SIL 1–4), as well as basic integrity (i.e., SIL 0), depending on the criticality of the application area and specifies corresponding safety requirements for the software. However, it currently only highlights the challenges regarding the use of software based on machine learning and does not yet contain any detailed and prescriptive safety requirements for such software. There are currently several standardization initiatives addressing this standardization gap that deal with the definition of standards for ML components, e.g., the standardization roadmap AI in Germany [
21].
In order to meet the regulatory challenges for machine learning-based systems, Machida [
22] proposes the use of the N-version programming approach. This approach is known in the safety sector. The basic idea is to have a function executed by several instances that are as diverse as possible. By comparing the results, errors in the individual instances can be detected and sometimes even masked. Machida proposes transferring this concept to machine learning-based systems, using multiple neural networks to implement a function instead of a single one. Other authors also support this theory and are researching the positive effects of N-version machine learning [
23,
24,
25]. As previous results have shown, the increase in reliability that is possible with such an approach depends heavily on the diversity of the individual networks. There are different types of diversity [
22], including using different sensors as model inputs, different datasets for model training, or different architectures of the neural networks.
Based on Machida’s concepts, we have already proposed a generic architecture for such ML-based N-version systems in an earlier work [
26]. Building on this, this paper aims to implement a real N-version system with multiple neural networks for rail track detection using semantic segmentation as an instantiation of this generic architecture and analyzing the results in terms of detection quality and reliability.
Table 1 highlights the limitation of the existing N-version approaches presented, as well as the differences in our own work.
3. N-Version Approach for Semantic Segmentation
Semantic segmentation is a task in computer vision that involves assigning a semantic label to each pixel in an image. This can be useful for applications such as autonomous driving, where it is important to understand the scene and identify objects such as cars, pedestrians, and buildings [
27]. In the following section, the concept for an N-version architecture for rail track detection using semantic segmentation will be explained.
3.1. N-Version Architecture
The basic principle of N-version architectures is to use multiple different implementations for the same functionality or task. This allows for error detection and thus increases the reliability and robustness of the overall system. Machida [
22] proposes the use of such architectures for systems using neural networks. He proposes different N-version configurations, e.g., double-model single-input (DMSI) or triple-model triple-input (TMTI), and states that a high diversity in both models and sensor inputs allows for higher reliability. In one of our previous works [
26], we proposed a generic NMXI architecture based on these variants, which allows almost arbitrary configurations of models and sensor inputs. In addition to the individual neural networks, we also consider different algorithms for combining the individual predictions. In addition to diversity, these algorithms also have a major impact on the performance and potential for error detection in the overall system.
For this work, we implemented a total of three neural networks for rail track detection and integrated them into a N-version architecture. The resulting architecture is shown in
Figure 1 (architecture diagram) and is an instantiation of the generic NMXI architecture, which can be described as 3M1I, corresponding to the TMSI (triple-model single-input) from Machida’s paper, as all neural networks work on the same input images.
In addition, a major focus of the N-version architecture presented in this paper is the Confidence Score. This score is calculated for the individual neural networks as well as for the entire system and indicates how well the system estimates its own predictions. This can be seen in the figure by the confidence evaluation blocks that are present for each neural network, as well as the combination block.
Each element of the 3M1I architecture is described in detail in the following sections: the neural networks, the confidence calculation and the combination algorithms, as well as the dataset for training and evaluation.
3.2. Dataset
The basis for the dataset used in this paper is the RailSem19 [
28] dataset, which is widely used for railway image segmentation in the literature. This dataset consists of 8,500 RGB images recorded from the train driver’s perspective. We included this dataset in our study for the following main reasons:
It includes scenes from various environments;
It contains relatively complex scenes (multiple tracks and switches per image);
It is a free dataset for public use.
The segmentation masks in this dataset contain labels for 21 classes. Of these 21 classes, only 2 are relevant for the rail track detection task. These two classes describe the metal rails (i.e., rail-raised) and the track bed (i.e., rail track). For our training, we combine these two classes into one class, the Track class, and all other classes into the Background class.
The dataset used in this paper is extended by our own dataset, in order to increase the number of images for training, validation and testing. This dataset is based on videos [
29] from the Nordlandsbanen (NLB) track in Norway. These videos were recorded for documentation purposes in all four seasons and hence contain various weather and lighting conditions, which makes it a good source for diverse training data. From these recordings, single frames were extracted, for which the segmentation masks were manually created. In this way, 7,019 labeled mask–image pairs were generated. In addition to the track masks, information tags were labeled for each image. The tags describe key properties of the image, e.g., weather conditions, lighting conditions, time of day, etc., that can be used to study the influence of such situational parameters on the perception performance.
With both previously described datasets combined, a total of 15,519 images were used for training and evaluation. For all images, only the ego-track was considered for the segmentation masks. Thus, the neural networks were directly trained to predict the train’s path. The whole dataset was divided into training, validation, and test sets.
Table 2 shows the corresponding distribution of the images across these datasets. For the test set, only images from our own dataset were considered. This decision was made because no information tag labels are available for the RailSem19 dataset. These information tags allow for filtering of the predictions on the test set, in order to determine weaknesses and strengths of the neural networks and of the combination algorithms, which is a feature we used in our experiments.
With 15,519 images, the dataset is comparatively small for neural network training on a semantic segmentation task. Due to the size limitations of our dataset, we have to deal with overfitting of the models during training. For this reason, we use dropout layers as well as batch normalization layers in the models and train the models for a limited amount of epochs.
3.3. Rail Track Detection Neural Networks
As mentioned earlier in this paper, a total of three neural networks were implemented for rail track detection. To comply with the N-version approach, all the networks are implemented for the same task–semantic segmentation of RGB camera images. All networks are trained using the dataset described in the last section, with images resized to 1920 × 1056 pixels.
In the following, all three neural networks are described in detail.
3.3.1. WCID
The first neural network is a version of a fully convolutional network (FCN) for semantic segmentation. This network was developed for lane detection in the automotive domain. It was presented in a paper by Michalke et al. entitled “Where can I drive? A Systems Approach: Deep Ego Corridor Estimation for Robust Automated Driving” [
30]. Since the authors do not give an explicit name for their network architecture, it will be referred to as WCID (short for “where can I drive?”) in this paper.
This network is, compared to the other segmentation networks, a small and lightweight architecture. It has approximately 500,000 trainable weights. We use this network in our experiments to show the influence of such small and limited architectures on the prediction combinations and the overall prediction, as well as the confidence scores. Like most segmentation networks, the WCID network consists of an encoder for feature extraction and a decoder for segmentation mask generation. In this network, the encoder has nine convolutional layers, which are divided into a total of four blocks. Each block consists of two consecutive convolution layers, except for the second block, which consists of three convolution layers. Each convolution layer is followed by a ReLU activation layer. For spatial reduction of the image, a Max Pooling layer follows at the end of each block. The overall network is a comparatively small network with few parameters, as intended by the authors. To prevent overfitting, blocks two through four have a dropout layer with a dropout rate of 0.2 after each ReLU activation layer. The dropout rate was determined by empirical validation. There are no bottleneck layers between the encoder and decoder in this network. The layer structure of the decoder mirrors that of the encoder. However, instead of convolutional layers, transposed convolutional layers are used, and instead of the Max Pooling layer at the end of the blocks, there is an up-sampling layer at the beginning of the blocks.
We used binary cross-entropy (BCE) as the loss function for training. To optimize the weights during training, the Adam optimizer was used with a learning rate of 0.0001. Training was performed with a batch size of 2 for 100 epochs.
3.3.2. VGG16-UNet
The second neural network for rail track detection is a VGG16-UNet. This network is based on a UNet architecture [
31]. This is a widely used and efficient segmentation architecture. This network also uses an encoder and a decoder. The UNet differs from other segmentation networks in that there are skip connections between convolutional layers in the encoder and the corresponding transpose convolutional layers in the decoder. These are used to better incorporate the spatial information of the learned features into the reconstruction of the segmentation map and to obtain a more accurate segmentation result [
31]. Due to this feature of the network architecture, we selected this DNN for our experiments. For correct rail track detection, an accurate segmentation is necessary, especially in the more distant parts of the image. In those regions, the rail track area is usually small and hence harder to segment. The UNet architecture is well suited for this task. In this paper, however, we do not use the original UNet architecture of Ronneberger et al., but a modified variant. In the VGG16-UNet, the encoder of the UNet has been replaced by a VGG16 network. This is an efficient architecture for feature extraction that was originally developed for object classification [
32]. This network consists of 16 layers, of which 13 are convolutional layers and 3 are fully connected layers. The UNet architecture is retained for the decoder. ReLU activation layers are also used for this network. For the implementation of this network, we used the code of Gupta [
33].
To train the VGG16-UNet, also the binary cross-entropy is used as the loss function for training. Stochastic Gradient Descent (SGD) was used as the optimizer for this training, with a learning rate of 0.0001. Training was performed with a batch size of 2 for 100 epochs.
3.3.3. MobileNet–SegNet
As a third neural network for rail track detection, we used a MobileNet–SegNet, which, like the VGG16-UNet, consists of a backbone network for feature extraction, i.e., the encoder, and a decoder network. The MobileNet architecture is used for the encoder part. This was described in a paper by Howard et al. [
34] and was developed for computationally efficient image processing in mobile applications. It uses depthwise convolutions instead of conventional convolutional layers. The architecture consists of a convolutional layer followed by 13 depthwise convolutions, each of which is paired with another convolutional layer. Each of these layers is followed by a batch normalization layer and a ReLU activation layer.
The MobileNet–SegNet uses a SegNet architecture [
35] as the decoder. Only the decoder part of this architecture is used, which consists of five blocks. Each block starts with an up-sampling layer, followed by a convolution layer and a batch normalization layer. We also used the Gupta implementation for the MobileNet–SegNet [
35].
Since the computational resources available for rail track detection in a train are limited, we chose the presented MobileNet architecture for our experiments. This network architecture has been developed specifically for such mobile applications and is therefore well suited for our use case. The SegNet decoder part is intended to support the generation of more accurate segmentation masks. This network has been developed for segmentation tasks in road traffic and is therefore well suited for our use case.
The MobileNet–SegNet uses a similar training configuration to the VGG16-UNet. We trained the MobileNet–SegNet using the SGD optimizer and a learning rate of 0.0001. The loss function is also binary cross-entropy. The batch size is set to 8 for this neural network. In this configuration, training was performed for 100 epochs.
3.4. Confidence Evaluation
A crucial part of the N-version architecture is the confidence evaluation. As can be seen in
Figure 1 (architecture diagram), there are two different confidence evaluations throughout the architecture. The first is the NN confidence evaluation, which represents the confidence calculation for each individual rail track detection prediction of the neural networks presented earlier. The other is the confidence combination, which uses the combined prediction and the individual predictions to compute an adjusted overall confidence for the system output.
Both confidence calculation methods are explained in more detail below.
3.4.1. Neural Network Confidence Evaluation
In the Neural Network Confidence Evaluation block of the N-version architecture, the confidence score for the prediction of a neural network is calculated. This confidence score is a form of self-assessment of the respective neural network and fundamentally describes how confident the network is in the respective prediction, i.e., how well the input–output pair could be represented by the learned training knowledge.
In contrast to object recognition in the form of classification, networks for semantic segmentation do not output this confidence score directly. Instead, the segmentation network calculates a probability that each pixel in the image belongs to a certain class. Hence, this output can also be referred to as a probability map.
For the confidence calculation, these pixels of the predicted probability map are assigned to either the Track class or the Background class, depending on their probability value. For this classification, the threshold value is used in the equations. This value is later experimentally determined to provide the best possible class division.
With
as the probability of pixel
i belonging to the Track class,
N as the total number of pixels in an image and
and
as the total number of pixels classified in the respective class, the confidence scores (
) for the Track and Background classes are calculated by the following equations:
In this calculation, the indicator function
is used to determine whether the probability of a pixel belonging to the Track class exceeds the specified threshold
. The indicator function returns a value of 1 if the condition is satisfied, and a value of 0 if it is not. Equation (
1) calculates the confidence score of the Track class. This score is calculated using the average of the pixel probabilities of all pixels classified into the Track class based on the thresholding. Before calculating the average, every pixel probability is normalized to the range of probabilities for the Track class (i.e., the range above
). The higher this value, the more "confident" the neural network is in its prediction of the rail areas in the image.
However, a good prediction of such a segmentation network for rail track detection also requires that it can correctly distinguish the rails from the rest of the image. Therefore, in addition to a high probability for the rail pixels, it is also crucial that the background pixels have a low probability for the Track class. For this reason, Equation (
2) calculates the confidence score for the Background class. This calculation also averages the pixel probabilities, which are normalized to the range of probabilities for the Background class. This time, however, this is applied for all pixels whose probability for the Track class is below the threshold
.
The overall confidence of the neural network’s prediction is calculated as the average of the two previously calculated confidence values:
Only if the network can distinguish between the Background and Track classes with a high probability can it be assumed that the corresponding image is well represented by the training, and the network is therefore confident in its prediction.
With regard to the safety of the overall system, this value can be utilized to detect erroneous or unsafe predictions. The assumption behind this is that the neural networks are not confident in their predictions for situations that do not fit their learned patterns from training. This means that another threshold can be defined to distinguish between predictions that appear uncertain based on the confidence score and those that appear good enough. This confidence threshold is referred to as in this paper and can be used to implement a fail-safe behavior for the overall rail track detection system. For predictions with a confidence score below this threshold, the train needs to slow down or come to a stop, in order to implement a safe reaction to the uncertainties in the environment perception. The threshold will be determined in a later section.
3.4.2. Confidence Combination
Since the combined prediction of the N-version system is not structurally different from the individual predictions of the neural networks, it is obvious to use the previously presented confidence scores for the combined prediction, resulting in the combination confidence score . Thus, a rough measure of how confidently the rails were detected by the N-version system can also be found for the combined prediction. However, this confidence score by itself is not significantly more reliable than that of a single neural network, nor does it exploit the strength of such an N-version approach. Due to their structure and the enormous number of trainable parameters, such neural networks have a very high complexity compared to conventional algorithms. The basic idea of the N-version approach is to effectively exploit the complexity of the individual networks. If the predictions of several such very complex systems for complex scenes, such as in the railroad domain, agree, it can be assumed with a high probability that this result is correct.
To exploit this fact, we must find a way to measure the extent to which the individual networks agree on their predictions. The Intersection over Union (
IoU) metric is suitable for this purpose. This metric is calculated according to Formula (
4), where
represents the predicted track pixels and
represents the track pixels in the ground-truth annotation:
In relation to the terms of object detection and classification, this metric can also be defined through the true positives (
TP), false positives (
FP) and false negatives (
FN):
For the rest of this paper, the represents the Intersection over Union calculated with a prediction and the respective ground-truth annotation. will be used for the calculation of the IoU between two neural network predictions.
For the adjusted combined confidence score of the N-version system, the IoUs of all three network predictions are calculated first. Since the IoU calculation is symmetric, three calculations are sufficient in the 3M1I configuration. Finally, the average of these three values is calculated to provide a measure of the agreement of the predictions. As a final step, the previously calculated confidence score for the combined prediction (
) is multiplied by this value. The calculation is shown in Equation (
6):
With this combined confidence score (
), a more accurate statement can be made about the quality of the overall result of the N-version system. Finally, this score can be used to detect situations where the neural networks do not provide sufficient recognition quality. Such detection allows the implementation of a fail-safe behavior for the operation of the ML-based system. For this purpose, another threshold must be defined to distinguish between sufficiently reliable and insufficient predictions. In the case of insufficient predictions, speed reduction of the autonomous train can be requested in order to allow for more reaction time in situations that are not sufficiently well perceived by the networks. This threshold value was introduced in
Section 3.4 as
and is further discussed in
Section 3.6.2.
3.5. Combination Algorithms
For our railway segmentation N-version system, we implemented two different basic combination algorithms. This allows us to gain insights into the significance and potential of such combination algorithms for system reliability. The following section explains both approaches.
3.5.1. Maximum Confidence Voting (MCV)
The main idea for this method is to ensure that the best of the neural network predictions is always forwarded as the system’s output. In order to achieve this, the combination algorithm utilizes a form of voting to generate the overall system output. A weight is calculated for each of the predictions, which should describe the quality of the prediction as best as possible. Based on these weights, the algorithm selects the prediction with the highest weight, i.e., the best quality. In our work, we use the Confidence Score (explained in
Section 3.4) as a simple way of assessing this quality. Hence, the MCV algorithm always selects the individual neural network prediction with the highest confidence as the combined output.
3.5.2. Pixel Majority Voting (PMV)
As a second combination algorithm, we implemented a different form of voting. In contrast to the first algorithm, this method, which will be called Pixel Majority Voting (PMV), is not about selecting the best of the three predictions. Instead, the goal is to achieve the best possible combination of all three predictions. For this purpose, voting is also performed in this method, but this time at the pixel level. Since there are three predictions for the same scene, majority voting can be implemented. This means that for each pixel in all three predictions, it is evaluated whether this pixel belongs to the Track or Background class. With
as the probability of pixel
i in prediction
j belonging to the Track class and
as the probability threshold for classification as part of the Track (>
) or Background (
) class, the voting is implemented by the following equation:
Based on the indicator function, pixel i in the combined prediction is assigned the respective predicted by at least two of the networks.
Since the combined prediction should be similar to the individual network predictions, in order to allow for the same confidence calculation methods, only classifying a pixel for the combined prediction is not sufficient. In order to achieve this, a combined probability value for every pixel needs to be calculated. This is simply achieved by calculating the average probability for all predictions that voted for the selected (“winner”) class. Equations (
8) and (
9) show these calculations for the respective classes.
3.6. Threshold Optimization
The previously described N-version architecture for railway detection utilizes two threshold values. The relation between both thresholds and the introduced terminology is briefly described in
Figure 2. Both values need to be optimized for optimal system performance and failure detection. The optimization process for each of the threshold values is described in the following subsections.
3.6.1. Theta ()
The threshold is an important property of the implemented N-version architecture. It is the basis for the correct classification of the classes in the predictions as well as for the calculation of the confidence. The goal of this section is to determine an optimal value for this threshold for the N-version architecture presented in the previous sections.
An obvious assumption for the
threshold value would be a limit of 0.5, i.e., 50% confidence. Nevertheless, an investigation was first carried out in this paper to determine the actual optimal threshold value. For this purpose, thresholds between 0 and 1 (i.e., 0% to 100%) were iterated in steps of 0.01 for the used test set (2,328 images), and the average IoU for the predictions of each network was calculated based on the threshold. The result of this analysis is shown graphically in
Figure 3 (
optimization graph).
In this analysis, it can be seen that the graphs of the three neural networks (shown by the solid lines) have a very similar shape. In the middle value range of thresholds (between 0.2 and 0.8), there is a flat curve with high IoU values. On the other hand, at the edges, there is a significant drop in the average IoU values. It is also noticeable that the curves for the VGG16-UNet (black) and the MobileNet–SegNet (gray) are almost identical with very high IoU values of about 0.95. Even though the WCID network shows a similar curvature at the edges of the graph, the middle part of the curve lies below the other two and shows – different than the other networks – no distinctive plateau in the center of the graph. Therefore, it can be concluded that the WCID network provides slightly worse detection results than the VGG16-UNet and the MobileNet–SegNet, but overall, all three networks show good detection performance. The drop in the IoU curves for very high and very low thresholds also shows that the neural networks have learned a robust distinction between the Track and Background classes through training. Most pixels seem to have either a very high or very low probability for the Track class in the predictions, because only in this range does a small change in the threshold cause a large change in the IoU values, i.e., in the correct classification of the pixels.
Because of the large flat areas in the middle of the graph, it is difficult to determine an optimal value that will produce the highest average IoU values. In order to determine an optimal threshold, the average of the three neural networks is calculated and shown as the dashed line in the figure. The maximum of this curve is at a threshold of 0.46. At this point, all three individual networks together provide the best predictions. Therefore, for analyses in this paper, a threshold of 0.46 is used to distinguish between the Track and Background classes.
3.6.2. Gamma ()
As introduced earlier, the detection of the train’s path is a safety-critical function in an autonomous train. With the computed confidence of the combined prediction, the confidence of the N-version architecture in each of its predictions can be analyzed. This confidence value is the only way to draw conclusions about the quality of the predictions while the train is moving, without the availability of ground-truth data to verify the predictions. In order to effectively identify unsafe predictions during operation with the N-version architecture presented earlier, it is therefore necessary to distinguish between supposedly safe and unsafe predictions based on this combined confidence value. The threshold is used for this purpose. The threshold is hence an essential parameter of the N-version architecture. The goal of this section is to find the best possible value for this threshold.
As a parameter of the N-version architecture, the threshold has a direct impact on two system attributes that are directly related to the dependability of the system. First, by detecting and filtering out low-confidence predictions based on this threshold, the safety of the predictions, and thus of the system itself, can potentially be increased. This means that the predictions that are above the threshold and are therefore marked as valid have a higher probability of not containing safety-critical errors. Thus, the safety of the system is supposed to be positively influenced. On the other hand, the second system property, availability, is reduced by filtering out unsafe predictions. The lower the confidence threshold is set, the more predictions are classified as invalid and the system cannot perform its specified function in these situations. The higher the rate of excluded predictions, the lower the overall availability of the system.
Therefore, when optimizing the
threshold as a system parameter, it is necessary to achieve the best possible trade-off between increasing the safety of the predictions and reducing the availability of the system. Such an investigation has been carried out for the presented N-version architecture. The influence of the confidence threshold was analyzed separately for both combination algorithms (i.e., MCV and PMV). The results are shown in
Figure 4 (
optimization graphs).
In the two graphs, confidence thresholds in the range of 0.0 to 1.0 were examined separately for each combination algorithm (MCV on the left and PMV on the right) in increments of 0.01 (i.e., the x-axis). Since the graph shows both the safety and the availability of the system, two y-axes are shown. The average
of all predictions in the test set whose confidence is above the respective threshold is shown on the left y-axis and with the solid curve. This metric describes the quality of the predictions, but does not allow direct conclusions about safety-critical errors in the predictions. A suitable metric to describe such safety-critical errors is, for example, the Track Center Offset Metric (TCOM) [
36]. For this metric, however, the distance of the predicted ego-track to the ground-truth ego-track must be calculated. For single RGB images, such distance calculations are not easily possible. Therefore, the analysis performed is based on the assumption that if the quality of the predictions is high (i.e., a high
), there will be fewer safety-critical errors in the predictions. As a measure of availability, we use the ratio of the number of images filtered out using the respective threshold, to the total number of images in the test set used. These values are shown on the right y-axis and with the dashed curve.
The two graphs are almost identical. It can be seen that, as expected, more and more predictions are filtered out as incorrect as the confidence threshold increases. Especially for high thresholds above 0.8, the availability drops rapidly to 0.0. A large part of the predictions seem to be in this confidence range. It can also be seen that the
threshold has a positive effect on the quality of the predictions. The higher the threshold value, the more predictions are classified as invalid, which also increases the quality, and thus presumably the safety. This correlation is almost linear for both combination algorithms, with a positive increase. However, this increase is very small for both algorithms. To find the optimal value for the confidence threshold, the intersection of the two curves is determined. At this point, the safety and the availability are at the same time the highest, which leads to the best system performance. This intersection is at 0.56 for both MCV and PMV. This value for
is used in the
Section 4.
5. Discussion
In this work, an N-version architecture for rail track detection using neural networks was investigated. The results show that the three networks used achieved different but overall good training performances. In particular, the WCID network, which was deliberately chosen as a small model, showed a lower performance compared to the other two architectures, as expected. Nevertheless, all models achieved very good results in terms of rail track detection, measured against the .
Despite the high overall detection performance, it was also possible to identify some test images where the quality of the predictions was insufficient. This underscores the weaknesses of neural networks discussed in
Section 2, especially in safety-critical applications. It was also noticeable that high confidence values often occurred even with inaccurate predictions. This shows that the confidence values of individual networks alone are not sufficient to reliably assess the safety of the segmentation results.
As the experimental results show, the used N-version architecture helps to improve the safety of the predictions. While the MCV algorithm reliably selected the prediction of the best single network in the majority of cases, the PMV algorithm achieved a higher prediction quality than the best single network for many images. A detailed analysis of the test results also revealed several cases where the PMV algorithm successfully corrected the errors of the individual DNNs. This was demonstrated in
Section 4.2 using a test image as an example. This showed that both the MCV and PMV algorithms are suitable methods for combining the individual networks in the N-version architecture.
In contrast to related N-version approaches from the literature [
23,
24,
25], in our work, we implemented a complete concrete N-version approach for a real and safety-critical use case. Furthermore, we implemented two different combination algorithms and were able to investigate the influence of such algorithms. In addition, we developed a confidence computation method adapted to the use case, which can be used as a safety indicator at runtime.
Applying N-version architectures in ML-enabled operational rail systems is a promising path towards certification of these systems. Current railway safety standards such as EN50716:2023 [
20] acknowledge the importance of machine learning technology for future railway applications, but also highlight the challenges associated with this technology and preconditions for their use in safety-critical contexts. As shown in safety-critical applications in other sectors, N-version architectures can help to address some of these points, similar to the application of N-version architectures when implementing traditional software-enabled systems satisfying very high safety integrity levels [
38]. To pave the way towards certifying ML-enabled N-version systems, further work is required to achieve and demonstrate basic safety integrity levels for the individual neural networks, and to better understand the behavioral diversity of the networks that is key for fault detection and fault masking.
Despite the promising results, there are some limitations that should be addressed in future work. A key point is the composition of the test dataset, which mainly contains images of low complexity. To further improve the validity of the results, future studies should include more complex scenarios.
So far, the diversity of the implemented N-version architecture is based solely on model diversity. The results show that this measure can already lead to first improvements regarding the prediction safety. However, to realize the full potential of the approach, other sources of diversity should be included. These include training on different datasets to ensure greater generalization capability, using different cameras or sensor technologies to achieve robustness to different environmental conditions, testing additional combination algorithms that may enable better prediction fusions, as well as using alternative methods for confidence calculation and plausibility checks to better quantify the quality and safety of the predictions.
Furthermore, it can be seen that the Intersection over Union () is not sufficient as a sole evaluation metric for rail track detection. Especially in real-world applications, a more accurate evaluation of the centerline error is important. Only such metrics allow the evaluation of safety-critical failures in the predictions in contrast to only a prediction quality evaluation using the . Besides that, statistical significance testing for the IoU improvements is also important and needs to be performed in the future.
For use in an operational environment, it would also be necessary to pay attention to the real-time performance of the ML-enabled system, especially in light of the typically limited hardware resources of on-board computing systems. Whilst the adaptation of ML network architectures and algorithms to limited hardware resources may lead to degraded inference performance, N-version architectures can help to counterbalance these drawbacks by combining the results of the real-time-capable networks. Assuming that these challenges are met, the approach can be transferred to other domains with similar technical challenges and certification environments, such as aerospace, automotive, and industrial robotics.
When developing an ML-based N-version architecture, the increased effort for selecting and training the neural networks must also be taken into account. Compared to a single network, the time and computational resources required increase with each additional network in the N-version architecture. Furthermore, the achieved model diversity is strongly dependent on the type of training data and the selected model architectures. It must be further examined how stable the advantages of the approach are under other conditions or on other datasets.
Our results raise several interesting topics for future work. A more detailed investigation of the influence of specific diversity criteria in the N-version architecture may provide further clarity on the advantages of this approach. An integration of explainable ML components that make the differences between the models interpretable also seems promising. Finally, an investigation of how N-version learning can be integrated into formal safety cases is also necessary in the future.