1. Introduction
Railways are among the busiest modes of global transportation and are essential for efficiently moving people, freight, and containers across regions. They frequently require rapid and consistent maintenance to ensure uninterrupted operations, resulting in substantial maintenance costs, time investment, and reliance on manual labor [
1,
2,
3]. Automated monitoring systems can address these challenges by offering a viable and rapid solution while also reducing the reliance on manual labor and the associated costs. Such systems necessitate an accurate understanding of 3D scenes, which is typically achieved by utilizing deep learning models for point cloud segmentation [
4,
5]—a task that involves classifying each point into its corresponding category, thereby dividing the point cloud into semantically meaningful regions. Most of these models are constrained by the requirement of large-scale annotated data for training. Creating such annotations for railway point cloud data is particularly challenging as it involves cumbersome manual labor and is time-consuming, expensive, and error-prone due to the intricate geometrical structures of the objects involved. Furthermore, their performance is typically evaluated on a test set that is identically and independently distributed (i.i.d.) relative to the training set, often causing overestimation of their actual effectiveness.
The i.i.d. assumption is often violated in real-world applications, potentially leading to catastrophic outcomes in safety-critical systems such as railway monitoring, where accuracy in decision-making is of paramount importance. Gawlikowski et al. [
6] attributed the failure of deep learning models in real-world deployment for safety-critical systems to their inability to distinguish between in-domain (ID) and out-of-distribution (OOD) samples, their sensitivity to distributional shifts, and their lack of reliable uncertainty estimation. Railway monitoring presents a complex and dynamic landscape, often resulting in distributional shifts. For example, maintenance activities can introduce new types of infrastructure as part of an upgrade process, leading to the emergence of novel classes not encountered during the model’s training phase. Additionally, timely detection and removal of vegetation are crucial for ensuring smooth operations, but vegetation characteristics vary significantly across regions. Furthermore, railway environments differ substantially based on geographical location, encompassing variations in infrastructure types, weather conditions, and environmental settings, such as rural versus urban areas. These scenarios, characterized by significant variations between training and test conditions [
7,
8,
9], are referred to as OOD scenarios. A model designed for railway monitoring is expected to generalize effectively under such OOD conditions. However, in practice, this often results in unreliable predictions [
6,
10]. Consequently, assessing a model’s generalization capability—both in terms of performance and uncertainty in its decisions—becomes a cornerstone for model selection and their reliable deployment.
The scarcity of large-scale point-wise annotations for 3D point clouds and the critical need for effective generalization motivate this paper exploring few-shot learning (FSL) [
11,
12,
13]—a relatively new deep learning paradigm that aims to develop models that are capable of generalizing to unseen novel classes using only a minimal amount of labeled data. In recent years, FSL has demonstrated significant potential in the image and text domains [
14,
15]. However, its application to 3D point clouds remains relatively limited. Although some works have explored FSL for point cloud segmentation in indoor environments [
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27], its use in railway environments is still largely under-explored. In contrast to controlled indoor environments with generally dense point clouds, railway environments present far more challenging scenarios: inherent environmental and sensor noises, unintentional occlusions, variations in lighting conditions, and significant variations in geometric shapes within the same class (e.g., vegetation). Long-range LiDAR scanners used for railway monitoring often collect data over extended distances, resulting in sparse points in many regions within the point cloud compared to the shorter-range detailed high-resolution scanning in indoor environments. Additionally, the railway environment has fewer features, but these features are much larger and more spread out compared to the smaller, more numerous objects found indoors.
This paper investigates the generalization capability of FSL under real-world distributional shifts encountered in railway monitoring systems, contributing to a deeper understanding of model effectiveness in decision-making for safety-critical applications. To achieve this, we formalize three types of distributional shifts: (a) ID shift, (b) in-domain OOD shift, and (c) cross-domain OOD shift, further detailed in
Section 2.2. A systematic evaluation is conducted using three performance metrics, along with a predictive uncertainty estimation metric to assess model sensitivity by quantifying prediction uncertainties. As demonstrated through experimental validation, this evaluation provides valuable insights into model development, selection, and deployment for reliable railway monitoring systems. Although prior studies have separately explored distributional shifts and uncertainty estimation in point cloud segmentation, to the best of our knowledge, this is the first to comprehensively evaluate few-shot segmentation for railroad monitoring under real-world distributional shifts.
The structure of the paper is as follows:
Section 2 presents a comprehensive review of the related works, providing background, context, and the problem setup.
Section 3 outlines the materials and methods used in the study. In
Section 4, we detail the experimental setup.
Section 5 delves into an in-depth analysis of the results, examining the key findings.
Section 6 offers a discussion of the implications and relevance of the study’s outcomes. In
Section 8, the conclusion is presented.
4. Experiments
Our experiments evaluated the generalization capability of the FSL model under distributional shifts formulated in
Section 2.2 for 3D point cloud segmentation in railroads. The experimental setup for distributional shifts, including data splits and class configurations for training and evaluation, is presented in
Table 1.
First, we assessed the model under a no-shift scenario, which replicates the ID shift scenario without introducing additional noise () to the evaluation set (. This scenario serves as the baseline for comparison with distributional shifts. For ID shifts, the trained model is evaluated on for both datasets, incorporating jitter, mirroring, and rotation (Equation (1)).
For the in-domain OOD shift, the model was evaluated on two class-based configurations: and for both datasets, while the remaining classes were used during the training phase. For the cross-domain OOD shift, the model was trained on the WHU-Railway3D dataset and evaluated on the Infrabel-5 Railroad Segmentation dataset and vice versa. Although both datasets share the same classes—pole, vegetation, ground, cable, and support device—in our experimental setup, they significantly differ in terms of distributions. We do not include the clutter class from both datasets in FSL training.
4.1. Evaluation Metrics
Performance metrics. Three metrics, mean Intersection over Union (mIoU), Overall Accuracy (OA), and Matthews Correlation Coefficient (MCC), are employed to assess the model’s performance. Given the four categories of the confusion matrix—true positive (), false positive (), true negative (), and false negative ()—these metrics are defined as follows:
mIoU: mIoU measures how well each class is segmented by measuring the overlap between predicted and ground-truth points for each class.
Note that we ignore the background class for our evaluation.
OA: OA measures the ratio of correctly classified points over the total number of points, regardless of classes, thereby reflecting the overall correctness of the model’s predictions.
MCC: In a class imbalance scenario, the
OA can present overly optimistic model performance evaluations. In contrast,
MCC provides a more equitable assessment of classification models. It yields a high score only when the predictions are accurate across all four categories of the confusion matrix—
,
,
, and
—while also considering the relative sizes of both positive and negative classes within the dataset. This makes the
MCC a robust metric for performance evaluation, particularly in imbalanced datasets.
This metric ranges from −1, indicating total disagreement, to 1, representing perfect prediction.
Uncertainty metric. Entropy quantifies the model’s performance in terms of probabilistic predictions, providing insights into the model’s uncertainty in its predictions. Given a point, entropy is defined as,
where
denotes the probability that a point belongs to class,
. We report the
mean entropy (mH), which calculates the average entropy across all points in the evaluation set. A lower mean entropy value indicates higher model confidence in its predictions.
4.2. Implementation Details
During the pretraining phase, the model is trained for 200 epochs on the WHU-Railway3D dataset using the Adam optimizer with a learning rate of
and L2 regularization of
. The learning rate is reduced by a factor of 0.5 after 100 epochs. A batch size of 16 is used. For episodic training in FSL, 40K episodic tasks were sampled from
. The Adam optimizer was employed with a learning rate of
, which was halved every 5000 episodes. The trained model was evaluated on 1500 test tasks sampled from
. An NVIDIA Tesla V100 GPU with 32 GB of memory was used for training. The code was implemented in PyTorch and, along with the pretrained model, is publicly available at
https://gitlab.kuleuven.be/eavise/point-clouds/dist_shift_railroads_fsl.git, (accessed on 1 February 2025).
To efficiently utilize limited computational resources, we adopted the data preprocessing strategy from PointNet, which employs a 3D sliding cubic window of size to subdivide a large point cloud into smaller cubic blocks. This sampling approach has also been utilized in PointNet++, DGCNN, and many others. In our case, , , and z equals the maximum height (in meters) recorded in the point cloud. This process captures points within cubic blocks, from which we randomly sample 2048 points as input to the network. We also conducted preliminary experiments using 1024 and 4096 points. Using 2048 points enabled the model to capture more representative features compared to 1024. However, increasing the number of points to 4096 did not yield significant performance benefits. Thus, 2048 points provided a favorable trade-off between computational efficiency and model performance.
5. Experimental Results and Analysis
We conducted a comprehensive evaluation of our model across four scenarios: no-shift, ID shift, in-domain OOD shift, and cross-domain OOD shift, with no-shift serving as the baseline. The model performance is quantified using the following performance metrics: mIoU, OA, and MCC. Additionally, the model’s predictive uncertainty is assessed using mH. The average performance over 1500 test tasks is reported as the experimental results, implementing one-way twenty-shot tasks (i.e., ) for no-shift and ID-shift. For in-domain OOD shift and cross-domain OOD shift scenarios, the results are reported for one-way (i.e., ) tasks with five, ten, and twenty shots (i.e., ). Additionally, in the cross-domain OOD shift scenario, the results are also reported for a five-way task setting (i.e., ) with five, ten, and twenty shots. A one-way task corresponds to a binary segmentation problem involving a single foreground class, while a five-way task represents a multi-class segmentation problem involving five foreground classes.
5.1. Quantitative Results
Table 2 presents the experimental results for the ID shift using the Infrabel-5 Railroad Segmentation dataset and the WHU-Railway3D dataset compared against the baselines under the no-shift scenario. The results indicate that the baselines achieved the highest performance and the lowest predictive uncertainty, with mIoU values of 89.1% and 83% and OA values of 92.4% and 89% for the Infrabel-5 Railroad Segmentation dataset and the WHU-Railway3D dataset, respectively. Under the ID shift scenario, with the application of jitter, mirroring, and rotation, the model’s performance remained comparable to the baselines, deviating by ~1% across all the metrics. These findings emphasize the strong generalization capability of FSL under minimal variations in evaluation conditions.
Table 3 and
Table 4 summarize the experimental results for the
in-domain OOD shift using one-way tasks. Specifically,
Table 3 reports the results for two evaluation settings with previously unseen classes from the Infrabel-5 Railroad Segmentation dataset:
= {
pole,
vegetation} and
= {
cable,
support device}. In the
= {
pole,
vegetation} setting, the model achieved its best performance on twenty-shot tasks with 64.2% mIoU and 74.4% OA, representing ~7.5% mIoU and ~5.5% OA improvements over the second-best performance in ten-shot tasks. Conversely, in the
= {
cable,
support device} setting, the model attained comparable performance across five- and twenty-shot tasks, with the highest performance observed in ten-shot tasks, ~1% better than in both other settings. The lowest predictive uncertainty was recorded in twenty-shot tasks for the
= {
pole,
vegetation} and in five-shot tasks for
= {
cable,
support device}.
Table 4 presents the results for the same evaluation settings as
Table 3 but applied to the WHU-Railway3D dataset. In the
= {
pole,
vegetation} setting, the model achieved its best performance on twenty-shot tasks with 76% mIoU and 78.4% OA, comparable to the results from ten- and twenty-shot tasks. For
= {
cable,
support device}, the highest performance was achieved in twenty-shot tasks with 59.1% mIoU and 83.9% OA, showing ~2.5% mIoU improvement over the second-best result in ten-shot tasks. The lowest predictive uncertainty occurred in five-shot tasks for the first setting and in ten-shot tasks for the second.
Table 5 and
Table 6 provide the experimental results for the
cross-domain OOD shift, evaluated on the Infrabel-5 Railroad Segmentation dataset and the WHU-Railway3D dataset, respectively. In
Table 5, the model achieved its best performance for the one-way setting in twenty-shot tasks, with 77.5% mIoU and 84.8% OA, comparable to the second-best performance in ten-shot tasks. The lowest predictive uncertainty is observed in the ten-shot tasks. For the five-way setting, the model attained its highest performance in twenty-shot tasks, achieving 50% mIoU and 70.6% OA, with ~3.2% improvements in mIoU over ten-shot tasks (second best) and ~4% OA improvements compared to five-shot tasks. The twenty-shot task setting generates the lowest predictive uncertainty.
In
Table 6, the model achieved its best performance for the one-way setting in ten-shot tasks, with 71.7% mIoU and 82.4% OA, comparable to the results (<1%) for five-shot and twenty-shot tasks. The lowest predictive uncertainty is observed under the same shot setting. For the five-way setting, the model performed best in twenty-shot tasks, with 48% mIoU and 62.4% OA, showing ~2.5% mIoU and ~1.6% OA improvements compared to the second-best results achieved in ten-shot tasks. The lowest predictive uncertainty was observed in the twenty-shot tasks.
5.2. Qualitative Results
The qualitative results for the OOD shifts are illustrated in
Figure 3,
Figure 4 and
Figure 5. In all the figures, the model was evaluated via a one-way five-shot setting, with smaller regions extracted from larger railroad areas for clear visualization. The ground truth is displayed in the ‘GT’ columns, while the model predictions are shown in the corresponding ‘Pred’ columns.
Figure 3 illustrates the results for the in-domain OOD shift, showing the results for the Infrabel-5 Segmentation dataset in the first row and the WHU-Railway3D dataset in the second row. In the first two columns of each row, vegetation is segmented by the model that was trained on ground, cable, and support devices. In the last two columns, support devices are segmented using the model that was trained on ground, vegetation, and poles. The results indicate better segmentation of vegetation from poles in the WHU-Railway3D dataset compared to the Infrabel-5 dataset. Conversely, the model performed better at segmenting support devices from cables in the Infrabel-5 dataset, while some cables were misclassified as support devices in the WHU-Railway3D dataset. Overall, the figure highlights the model’s strong generalization to novel railway infrastructure classes despite receiving minimal supervision, with only five labeled support examples.
Figure 4 and
Figure 5 present the qualitative results for the cross-domain OOD shift, evaluated on the Infrabel-5 Segmentation dataset and the WHU-Railway3D dataset, respectively, with the other dataset used for training. These datasets exhibit notable structural differences: the vegetation and poles are taller in the WHU-Railway3D dataset, while the support devices vary significantly in form. Both figures demonstrate that the model generalizes well to previously unseen distributions, although some challenges remain in segmenting vegetation (row 1, columns 1–2) and distinguishing support devices from cables (row 2, columns 3–4). Nonetheless, the model accurately segments poles and cables when presented against distinct backgrounds.
5.3. Ablation Study
Fine-tuning. Fine-tuning is often used to adapt a model to the test distribution when it significantly differs from the training distribution. We fine-tuned the pretrained model on the Infrabel-5 Railroad Segmentation dataset. The last two layers of the pretrained model were fine-tuned on the WHU-Railway3D dataset to better align the model with our selected evaluation classes. We employed the Adam optimizer with a learning rate of
and weight decay of
. Training and evaluation follow the same data splits as the ID shift (
Table 1), with a batch size of 16. The training is conducted for 30 epochs.
Table 7 presents the fine-tuning results for our segmentation task. The results show that fine-tuning achieves more than 60% mIoU in both datasets, achieving 92.6% OA for the WHU-Railway3D dataset, which is 10% higher than the OA value for the Infrabel-5 Railroad Segmentation dataset.
6. Discussion
This work discusses the generalization of FSL under inevitable distributional shifts encountered in real-world railroad monitoring, arising from sensor noise (ID shift), infrastructure upgrades (in-domain OOD shift), and environmental variations across geographical regions (cross-domain OOD shift). The baseline performances achieved under the no-shift condition (
Table 2), representing optimal similarity to the training conditions, provide an upper bound for comparison. The results for ID shift (
Table 2), compared to the baselines, demonstrate minimal impact on model performance, with deviations of ~1%, highlighting FSL’s strong generalization ability under minor variations in evaluation conditions.
The in-domain OOD shift, which evaluated the model on novel, previously unseen classes from the same dataset as the training set, indicates a performance drop. For the Infrabel-5 Railroad Segmentation dataset, the {
pole,
vegetation} setting experienced a ~24.9% decrease in mIoU and 18.1% reduction in OA compared to the baseline (
Table 3 vs.
Table 2). In contrast, the {
cable,
support device} setting exhibited a 1.45% decline in mIoU with a comparable OA. For the WHU-Railway3D dataset (
Table 4 vs.
Table 2), the {
pole,
vegetation} setting demonstrates a ~7.4% mIoU and ~10.7% OA decrease, while the {
cable,
support device} setting shows a ~23.9% mIoU and ~5.1% OA decline. These performance differences across the datasets (
Table 3 vs.
Table 4) for the two evaluation settings are likely attributable to class imbalances between the datasets. Vegetation has higher representation in the WHU-Railway3D dataset, whereas cable and support device classes are more prevalent in the Infrabel-5 Segmentation dataset (
Figure 6).
The cross-domain OOD shift with one-way tasks resulted in performance decreases of ~11.3–11.6% in mIoU and ~6.6–7.6% in OA when compared to the baselines (
Table 5 and
Table 6 vs.
Table 2) for both datasets. Specifically, the mIoU and OA decreased by 11.6% and 7.65% on the Infrabel-5 Railroad Segmentation dataset and by 11.3% and 6.6% on the WHU-Railway3D dataset.
N-way K-shot tasks were examined for different
N and
K configurations. Under cross-domain OOD shifts, increasing
to
led to performance drops of 27.5% mIoU and ~14.2% OA on the Infrabel-5 Railroad Segmentation dataset, and ~23.7% mIoU and ~19.9% OA on the WHU-Railway3D dataset. These results align with existing FSL research, where the increased complexity of multi-class segmentation (
) tasks contributes to the observed performance disparity. Increasing
K to 20 notably enhanced the model performance in some settings. For instance, in the in-domain OOD shift for the {
pole,
vegetation} setting on the Infrabel-5 Railroad Segmentation dataset, the mIoU and OA improved by ~12.1% and ~10%, respectively, compared to
(
Table 3). Similarly, for the {
cable,
support device} setting on the WHU-Railway3D dataset, the performance increased by ~2.5% in mIoU and ~1.6% in OA (
Table 4). For cross-domain OOD shift using one-way tasks, improvements of over 4% in both mIoU (4.43) and OA (4.16) were observed compared to
for the Infrabel-5 Segmentation dataset. Similar behavior was observed in the five-way tasks, where increasing
K resulted in ~4-4.4% gains in both mIoU and OA across both the Infrabel-5 Segmentation dataset and WHU-Railway3D dataset (
Table 5 and
Table 6). However, in other scenarios, the number of support examples had a minimal effect on performance. While increasing
K boosted the results, it occasionally introduced noise in some cases, leading to slight reductions in model performance.
This study underscores the importance of evaluating model performance using the
MCC metric, which is particularly useful for identifying trivial majority classifiers. An
MCC value of 0 indicates a model that predicts the majority class irrespective of input features [
74]. Given the highly class-imbalanced nature of our datasets,
MCC offers a more realistic assessment of model performance as it penalizes errors in minority classes more effectively than
OA. The baseline results (
Table 2) exhibit 82% agreement with the ground truth for the Infrabel-5 Segmentation dataset and 77% for the WHU-Railway3D dataset, while the corresponding OA values are higher, i.e., 92% and 89%, respectively. Similar trends are observed across all the experiments under distributional shifts. In those experiments involving one-way tasks,
MCC consistently exceeded 60%, except in the in-domain OOD shift for the {
pole,
vegetation} setting, where it decreased to 47% for the Infrabel-5 Segmentation dataset and 39% for the WHU-Railway3D dataset, respectively.
The comparison between the FSL baselines under no-shift (
Table 2) and fine-tuning (
Table 7) shows that FSL achieved an 18.1% improvement in mIoU on the Infrabel-5 Railroad Segmentation dataset and a 21.3% improvement on the WHU-Railway3D dataset. Under cross-domain OOD shift with the one-way task setting (
Table 5 and
Table 6 vs.
Table 7), FSL achieved 17.5% and 9.9% mIoU improvements on the Infrabel-5 Railroad Segmentation dataset and WHU-Railway3D dataset, respectively. However, such cross-domain OOD generalization with limited data is not inherent to fine-tuning as fine-tuning relies on the i.i.d. assumption and is evaluated on a test set drawn from the same distribution as the training data. Compared to fine-tuning, the FSL baselines (under no-shift) yielded a higher
mH value on the WHU-Railway3D dataset. However, on the Infrabel-5 Railroad Segmentation dataset,
mH was lower, suggesting that the model exhibits greater confidence when trained on a smaller dataset with FSL. Under cross-domain OOD shift, FSL resulted in higher
mH values, reflecting increased uncertainty in unseen environments. Nevertheless, the model remains capable of generalizing beyond the i.i.d. assumption—an aspect where fine-tuning falls short. These findings further highlight the importance of incorporating predictive uncertainty measures, such as entropy, in the evaluation of FSL models.
The entropy measure provides insights into the reliability of our model’s predictions by quantifying uncertainty, which is crucial for real-world applications such as railroad monitoring, where distributional shifts are inevitable. Since test data often lack ground truth for verification, quantifying predictive outcomes through entropy can support decision-making. When the model exhibits low confidence in its predictions, highly uncertain samples can be referred to human experts for further review.
Figure 7 illustrates this for the Infrabel-5 Railroad Segmentation and WHU-Railway3D datasets under cross-domain OOD settings, highlighting highly uncertain samples for poles (red), ground (gray), vegetation (green), cables (purple), and support devices (yellow). For instance, three highly uncertain pole samples (red) in the Infrabel-5 Railroad Segmentation dataset could be flagged for human evaluation in critical decision-making scenarios. Moreover, in domain adaptation with FSL,
mH provides essential insights: a high
mH value indicates low model confidence under a new distribution, aiding in decisions such as model rejection, recalibration, or further fine-tuning.