Multi-Modal Contrastive Learning for LiDAR Point Cloud Rail-Obstacle Detection in Complex Weather

: Obstacle intrusion is a serious threat to the safety of railway traffic. LiDAR point cloud 3D semantic segmentation (3DSS) provides a new method for unmanned rail-obstacle detection. However, the inevitable degradation of model performance occurs in complex weather and hinders its practical application. In this paper, a multi-modal contrastive learning (CL) strategy, named DHT-CL, is proposed to improve point cloud 3DSS in complex weather for rail-obstacle detection. DHT-CL is a camera and LiDAR sensor fusion strategy specifically designed for complex weather and obstacle detection tasks, without the need for image input during the inference stage. We first demonstrate how the sensor fusion method is more robust under rainy and snowy conditions, and then we design a Dual-Helix Transformer (DHT) to extract deeper cross-modal information through a neighborhood attention mechanism. Then, an obstacle anomaly-aware cross-modal discrimination loss is constructed for collaborative optimization that adapts to the anomaly identification task. Experimental results on a complex weather railway dataset show that with an mIoU of 87.38%, the proposed DHT-CL strategy achieves better performance compared to other high-performance models from the autonomous driving dataset, SemanticKITTI. The qualitative results show that DHT-CL achieves higher accuracy in clear weather and reduces false alarms in rainy and snowy weather.


Introduction
Railways play an important economic and social role in transport.Obstacle intrusions, e.g., caused by geological hazards within the rail track area, animals, vehicles, objects falling from bridges above the tracks, etc., can seriously jeopardize the safety of rail traffic.With the development of light detection and ranging (LiDAR) technology and deep learningbased environmental perception methods, rail transport has become more intelligent in recent years.Point cloud 3D semantic segmentation (3DSS) provides a new method for unmanned rail-obstacle detection.Based on 3DSS, Wang [1], Soilán [2], Manier [3], and Dibari [4] achieved rail-obstacle detection through a point-by-point analysis of railway point clouds and established an intelligent railway monitoring system.However, railways are mostly located in the wilderness and are exposed to complex weather conditions, which inevitably degrades the performance of models and hinders their real-world application.
To cope with the effects of complex weather, filter-based methods [5][6][7] pre-process the input data and remove the noise caused by rain or snow shading to maintain performance.Following the same principle, simulation-based methods [8,9] synthesize scattered noise representing rain, snow, or fog into clear weather data as a form of data augmentation to improve the adaptability of recognition neural networks.In addition, in our practical application, we found that the main factor leading to classification confusion in rain and snow is changes in the reflectivity of the surface of the object.As shown in Figure 1, because of the increase in specular reflection when the rail metal surface is wet, the contrast of the positive incidence region increases, and no return light is detected in the grazing region, resulting in partially missing imaging.Also, snow leads to increased diffuse reflections.Overall, dramatic changes in the light-intensity distribution of point clouds will occur in rain and snow, as shown in Figure 2.Both figures come from our real-world data collection.In a word, the data distribution is drastically and irregularly shifted with respect to clear weather data, resulting in the degradation of model performance, mainly in the form of an increase in false alarms, which cannot be resolved by using a filter-based approach.Sensor fusion methods [10][11][12][13][14][15][16][17][18][19][20] have been employed to improve adaptability to complex environments, using the camera to obtain high-resolution object texture information to compensate for the shortcomings of sparse and coarse LiDAR point clouds.However, RGB images undergo dramatic fluctuations in brightness and contrast in complex weather due to the light-sensitive nature of the camera, with the same characteristics as point cloud reflectivity, resulting in more difficult model fitting.In contrast, we demonstrate that local geometric structure information is a more reliable reference for a deep learning-based method.The neighborhood relationship captured from the different perspectives of the camera and LiDAR sensors is key to the higher robustness of multi-modal methods in complex weather.As shown in Figure 3, there is a difference in the receptive fields of 3D and 2D networks, i.e., the anchor point has a different influence on the neighborhood points during the back-propagation training stage.
In this paper, we propose a neighborhood attention-driven multi-modal contrast learning strategy, named DHT-CL, to improve the performance of rail-obstacle detection based on 3DSS in complex weather.Our approach has the following advantages: (1) It makes full use of the neighborhood information from the different perspectives of the camera and LiDAR sensors, which is robust to object reflectance variations in complex weather.(2) Only point cloud input is required during the inference stage, without the need for image input, making it efficient for deployment.(3) It improves the general rain and snow resistance of deep learning-based methods, which cannot be addressed by using filter-based data pre-processing methods alone.The framework can be explained as follows: First, using the point cloud and image as input, 2D and 3D feature maps are independently extracted by the 2D and 3D backbones, respectively.Second, a Dual-Helix Transformer (DHT) module reassigns weights to 2D and 3D features based on a neighborhood attention mechanism, which allows for selective preservation or filtering of homogeneity or heterogeneity in neighborhood cross-modal information.Third, an adaptive cross-modal discrimination loss is constructed for collaborative optimization, which softens or sharpens the output logit distribution depending on the presence or absence of obstacle anomalies.This structure is adaptive to learning priorities.It also allows 2D branches to be discarded during the inference stage and achieves performance improvements that are not limited to overlapping field-of-view (FOV) regions.Finally, based on a complex weather railway dataset with point-wise annotation, we compare our method with state-of-the-art models [19,[21][22][23][24] from the autonomous driving dataset, SemanticKITTI [25].The experimental results show that our DHT-CL method achieves a higher mIoU of 87.38% compared to the other models.The qualitative results show that DHT-CL improves accuracy in clear weather and reduces false alarms in rain and snow.
The contributions of this paper can be summarized as follows: • A Dual-Helix Transformer (DHT) module is proposed to extract deeper information for robust sensor fusion in complex weather through local cross-attention mechanisms.

•
An adaptive contrastive learning strategy is achieved through the use of an obstacle anomaly-aware cross-modal discrimination loss, which adjusts the learning priorities according to the presence or absence of obstacles.

•
Based on the proposed semantic segmentation network, a rail-obstacle detection method is implemented to identify and locate multi-class unknown obstacles (minimum size 7 × 7 × 7 cm 3 ) in complex weather, which exhibits high accuracy and robustness.
The remainder of this paper is organized as follows.We review the related works in Section 2 and describe our proposed method in Section 3. Section 4 presents the experimental setup and results, and we conclude this paper in Section 5.

Related Works
This section briefly reviews the related works, which are divided into four categories: rail-obstacle detection, 3D environmental perception in adverse weather, sensor fusion, and multi-modal contrastive learning.
Extensive studies have used RGB cameras due to their high resolution and low cost.In addition, some studies [30] have combined RGB and infrared cameras to work in lowlight conditions.For 2D input data, conventional image processing methods detect the rail track lines and nearby obstacles using the Hough transform [31][32][33] or the Canny or Sobel operators [34].The optical flow-based method [41] and Kalman filtering-based method [29] have been used for motion obstacle detection.Moreover, deep learning-based methods have shown advantages in terms of accuracy and robustness, and many studies have followed the framework of single-stage YoLo [35][36][37] or two-stage RCNN [38,39] for object detection.Other studies have relied on semantic segmentation, focusing on rail track lines [40] or railway track area segmentation [42].In [40], the authors incorporated vanishing point detection and identified shaded rail track points as obstacles.In [30], the authors detected anomalies using GAN [43] reconstruction and comparison to issue obstacle alarms.
Overall, both IR and RGB camera-based methods face difficulties in measuring distance and are prone to false alarms for objects in safe positions due to the perspective relationship.RGB cameras deteriorate dramatically in low-light conditions, whereas infrared cameras can work at night but have inferior image quality.Additionally, methods based on object detection algorithms mostly simulate obstacles using specific known classes such as pedestrians, vehicles, and specific target shapes.Actually, object detection algorithms are only effective for detecting objects with specific and consistent features and cannot be used for detecting multi-class novel obstacles.These methods lack generalized evaluation criteria, whereas our segmentation-based framework enables multi-class obstacle detection in more complex scenarios.
LiDAR has proven its effectiveness in 3D perception in recent years.LiDAR-based obstacle detection methods start with railway track localization and can be divided into geometric-based methods, machine learning-based methods, and deep learning-based methods.Geometric-based methods filter track points by searching for height and in-tensity jumps [44,45], implementing Hough transformations on BEV projections [46], or judging based on gauge corner characteristics [47].The track curve is then fitted using RANSAC [45], multi-segment folding lines [44], or an eigenvector-based neighborhood growth algorithm [46].In machine learning methods [48,49], rail track lines are extracted through classification.The feature mapping is performed using principal component analysis (PCA), whereas classification is performed using linear discriminant analysis (LDA) [48] or support vector machines (SVMs) [49].Yu et al. [50] adopted cylinder voxel partitioning and implemented 3D convolutions.Wang et al. [51] conducted 2D convolutions on the spherical projection of railway point clouds and incorporated an attention mechanism with decoupled spatial and channel-wise aspects.Based on rail track localization, Qu et al. [52] removed background points in the region of interest (RoI) and obtained outlier point clusters through Euclidean clustering for obstacle alarms.Another branch of research is devoted to full-scene railway perception based on semantic segmentation.Manier [3] designed a point-based axisymmetric convolution operator that projects the point cloud into 2D along the axis of symmetry in a columnar neighborhood.This method performs 2DCNN, which can accommodate railway scenes with significant vertical differences and minor horizontal differences.Dibari et al. [4] first applied the point cloud semantic segmentation networks PointNet [53] and PointNet++ [54] for railway scene parsing.The former achieved a higher mIoU of 62.6%.Soilán et al. [2] ported PointNet [53] and KPConv [55] to railway scenes, but they achieved satisfactory results only on regular concrete pavements in railway tunnels.
Methods based on clustering and outlier removal work well in experiments but are unstable in the real world.This requires careful thresholding for diverse scenarios, otherwise a large number of false alarms can occur due to fitting errors.Existing semantic segmentation models in railway scenes generally have unsatisfactory performance and suffer from performance degradation in rainy and snowy conditions.Our method improves accuracy and robustness in both clear and adverse weather conditions through multi-modal contrastive learning.

Two-Dimensional Environmental Perception in Adverse Weather
Hussain et al. [56] identified the occurrence of extreme weather-induced anomalies in autonomous driving systems based on GAN-reconstructed pixel errors.Their experiments showed that the performance of camera-based perception systems drastically decreases in complex weather conditions (rain, fog, snow, etc.).Liu et al. [57] proposed a camera and millimeter-wave radar fusion approach to enhance the performance of vehicle detection and trajectory predictions in complex weather.Their experimental results showed that the single-sensor approach is prone to producing missed alarms in extreme weather.Similarly, the authors of [58] also explored the manifestations of the failures of single-camera sensor-based methods and proposed a method based on prediction variance and trajectory deviations to eliminate false alarms.
Previous studies have shown that end-to-end neural network models that rely solely on camera sensors suffer from dramatic performance degradation in adverse weather conditions.

Three-Dimensional Environmental Perception in Adverse Weather
Point cloud distortion due to rain and snow is a direct cause of the performance degradation of LiDAR-based environmental perception methods.Full reference metrics for evaluating point cloud distortion rely on assessing the similarity between the distorted point cloud and the original point cloud based on the topology, geometry, color features [59,60], local curvature statistics [61], or 3D edge features [62].Since the original point cloud is not always accessible, no reference metrics are proposed based on the 3D natural scene statistics and entropy [63] or neural networks [64].Furthermore, Viola et al. [65] extracted a subset of statistical features from the original point cloud for reduced-reference evaluation, and Zhou et al. [66] proposed a reduced-reference metric based on content-oriented similarity and statistical correlation measurements.
To address point cloud distortion, Le et al. proposed an adaptive noise-removal filter [5] for the range image projected from LiDAR point clouds and an adaptive group of density outlier-removal filters [6] for LiDAR point clouds.In addition, Wang et al. [7] proposed a dynamic distance-intensity outlier-removal filter for snow denoising to preprocess point clouds and remove noise caused by adverse weather.Mai et al. [8] synthesized fog on the KITTI dataset to generate images and point clouds with reduced visibility, while Shih et al. [9] proposed a multi-mechanism spray synthesis model to improve the performance of recognition models.Kim et al. [67] analyzed the reasons for imaging performance degradation in adverse weather conditions by testing LIDAR's imaging capability of a 0.6 × 0.6 m 2 square target under varying degrees of rain and fog.The performance of multiple sets of LiDAR sensors under different artificial rainfall conditions was also evaluated in [68], using the number of imaging points of the target as a criterion.Piroli et al. [69] detected the presence of rain and snow using an energy-based anomaly detection framework.Li et al. [70] evaluated and modeled LiDAR visibility under different artificial fog conditions, while Delecki et al. [71] increased pressure on the recognition model by gradually adding computer-synthesized rain, snow, and fog to analyze the causes of recognition failures.
Based on the aforementioned analyses, we designed a multi-modal contrastive learning strategy specifically for rain and snow to alleviate performance degradation due to changes in object reflectivity in adverse weather, which cannot be resolved using filter-based approaches.

Sensor Fusion Methods
Sensor fusion methods attempt to fuse camera and LiDAR information and exploit their complementarity.Typically, 3D point clouds are first converted to 2D through perspective projection, spherical projection, cylindrical projection, or bird's-eye-view (BEV) projection.Snapnet [10] takes both types of 2D and 3D data as input to the network, which is data-level fusion.FuseSeg [11,12] concatenate the embedding layer features from the 2D and 3D modals, which is feature-level fusion.Genova et al. [13] designed a sparse filter and reprojected 2D images to 3D point clouds to facilitate the training of 3D models with 2D labels.Pointpainting [14] paints the point clouds according to the images, adding RGB channels to the 3D data.SAT [15] is a 2D-assisted training strategy that uses 2D images to build an attention map during the training stage while skipping the 2D branch using an attention mask during the inference stage.In MSeg3D [18] the non-overlapping region is complemented with predicted pseudo-camera features and self-and cross-attention between the camera, LiDAR, and fusion features are performed to generate the final prediction scores.
However, these methods necessitate precise alignment of the point cloud and image and performance enhancements only occur in the overlapping FOV region.

Multi-Modal Contrastive Learning
Contrastive learning across modalities promotes multi-modal information transformation and allows image branches to be discarded during the inference phase.In PMF [16], the embedding layer features are aggregated, and a modality perception loss is constructed using the KL divergence function to co-optimize.Liu et al. [17] generated fusion features using a linear module, which contains element-wise multiplication and addition, and then aligned the fusion features and 2D features using the L2 loss function for training.In 2DPASS [19], knowledge distillation [72] is implemented on a multi-scale feature map in a multilayer perceptron (MLP) manner.In SSLp2i [20], the embedding features in the FOVs are used to construct a contrastive loss using the L2 norm for semi-supervised learning.
Following the method that builds a contrastive loss for collaborative optimization, we developed a more effective method for information transfer in complex weather conditions while simultaneously optimizing obstacle anomaly detection tasks.

Methods
The objective of this study is to improve point cloud 3DSS performance in complex weather for rail-obstacle detection.We propose a multi-modal contrastive learning strategy, named DHT-CL, to handle the difficulty of data distribution shifts due to object surface reflectivity changes in complex weather.
An overview of the framework of DHT-CL is shown in Figure 4. Specifically, during the training stage, both point cloud and image branches are activated, and 2D and 3D features are first extracted independently by the 2D and 3D encoding networks, respectively.Then, the 3D features are projected into 2D to generate pseudo-2D features.The pseudo-2D and 2D features are then fed into the DHT module simultaneously to obtain the fusion features.Then, the 3D and fusion features are decoded by two independent classifiers to output the prediction scores, between which an obstacle anomaly-aware modality discrimination loss is constructed for collaborative optimization.All of the above are supervised by pure 3D labels.During the inference stage, the 2D branch is masked, which reduces the computational burden.The 2D backbone is a U-net for semantic segmentation.It contains a downsampling layer of a pre-trained ResNet34 [73], an upsampling layer based on transpose convolutions, and a skip-connected structure with a hidden size of 64.The 3D extractor, known as SPVCNN [23], is also a U-net with a voxel size of 0.05 m and a hidden size of 64.

Point-Pixel Mapping
The point-pixel correspondence serves as crucial prior knowledge in multi-modal methods and has a significant influence on the subsequent predictions.Instead of directly mapping the input 3D point clouds into 2D space, we first establish a pairwise index of the point cloud and the image within the overlapping FOV region.Then, we map the 3D features to 2D space based on this index, thereby generating pseudo-2D features.
Perspective projection is adopted in this paper to create the 2D-3D mappings.Specifically, given a LiDAR point cloud P = {p i } N i=1 ∈ R N×4 , a single point is denoted as , and a single pixel is denoted as where K and T are the pre-calibrated camera internal matrices K ∈ R 3×4 and the external matrices T ∈ R 4×4 .In this work, K is obtained using the calibration method proposed by Zhang [74], and T is obtained using the method proposed by Yuan [75].The 2D point clouds derived from the 3D projection are denoted as P 2D = {m i , n i } N i=1 ∈ R K×4 .They are subsequently discretized based on the camera's resolution r.The points within the camera picture (H, W) are filtered as follows: Finally, the point-pixel correspondence index Index{(u, v), i} is established based on whether the pixel coordinates {u, v} U,V u,v=1 overlap with the projected point cloud coordinates

Dual-Helix Transformer
The DHT module is key to the proposed contrastive learning strategy.Previous methods for transferring information between different modalities or representations, such as knowledge distillation [72], 2DPASS [19], PVKD [21], and xMUDA [76], incorporate a learnable layer as a buffer, achieving better performance compared to naive direct fusion methods because the learnable module is able to compensate for modal heterogeneity differences.However, with the pull of the loss function, the learnable module has a tendency to make the transformed data distribution too similar to another modality, thereby compromising the heterogeneous information transfer.Differences between modalities can hinder information transfer, but they are also the real cause of performance improvements.Balancing the trade-off between homogeneity and heterogeneity is crucial for multi-modal methods.We observe that neighborhood relationships play a crucial role in cross-modal information transfer.For instance, the edges of an object are the same, regardless of whether it is described through an RGB image or XYZ coordinates, and are less affected by changes in intensity or color in rainy or snowy weather.The DHT module pre-processes the features to be fused based on a neighborhood attention mechanism.It searches for the neighboring points of a pixel in the pseudo-2D point cloud space, calculates a Gram matrix describing similarity, and adjusts the central element based on these weights.The same operation is performed for the pseudo-2D point cloud features.In this way, objects with weaker neighborhood relationships are assigned smaller weights, thereby eliminating some of the confusing information generated by perspective relationships, such as railway tracks that look like they are connected to a distant railway signal pole in a 2D image.At the same time, information with specific characteristics is extracted and encoded into the data.Although the feature maps in the two imaging perspectives are completely different, the local geometric structures are similar, resulting in higher relevant weights.
Figure 5 shows a schematic diagram of the DHT module.Specifically, the input of the DHT module is the 2D features, F 2D , extracted by the 2D network and the pseudo-2D features, F 3Dproj , projected by the 3D features.The output of the DHT module is the fusion features, F f use .The formulas are as follows: where L denotes the linear layer; Q, K, and V denote the query, key, and value, respectively; and d k denotes the dimension size of the value features.As shown in the formulas, a post-layer norm is adopted [77], which enables better model performance.

F 3D_proj
Cross-Attention In addition, to deal with irregular 3D data, we designed a sparse sliding kernel-based 3D Transformer operator.The center element of the sliding kernel is represented as Q, and the other elements within the kernel are represented as Q and V.Then, the inner product matrix, QK T , is computed and used as weights to sum V, updating the center element.A schematic diagram detailing this process is shown in Figure 6.Self-attention or cross-attention of sparse 3D data can be efficiently realized using this operator.Fast neighborhood address queries, based on GPU Hash table and matrix parallel operations, are performed for memory and speed optimization.Also, the problem of sparsity arises when dealing with pseudo-2D features, as the neighborhood of the projected 3D features may be missing, in which case the standard convolution operator is no longer applicable.A 2D version of the sparse manifold convolution is utilized, with further details available in [78,79].Note that neighborhood points may be missing due to the sparsity of the point cloud.The missing points are indicated in gray.(c) Omission of the missing points by marking them as −1 in the GPU Hash table-based neighborhood address query operation.(d) Flattening of the irregular matrix and utilization as a key vector.Then, computing of the inner product between the query vector Q (in red) and the key vector K (in blue) to derive the attention weights.(e) Adjusting weight K by applying the weights derived from QK T .(f) Updating the center element of the sliding window to produce the final output.

Adaptive Contrastive Learning
Adaptive contrastive learning is accomplished through an obstacle anomaly-aware cross-modal discrimination loss.
First, consider the situation where a point is classified into the "unknown obstacle" class.This may be a real obstacle or it may be a misclassified object after being washed by rain or snow.It is possible that the output of the "unknown obstacle" classifier and the output of the normal class classifier are high at the same time, i.e., there are two or more classes that cannot be well distinguished.
In this paper, we introduce a binary classifier, which determines whether a point belongs to the "unknown obstacle" class.This classifier guides contrast learning and distinguishes between similar situations, one of which is described above.According to the analysis in [80], when constructing a contrastive loss using the Kullback-Leibler (KL) divergence, a smaller scaling factor T would make the distribution of logits more tolerant, or in other words, more discriminating.The penalty of the loss function mainly acts on regions with high similarity to the positive sample.In contrast, a larger T would make the distribution more uniform or similar, with the penalty acting over a wide range of negative samples.In our method, if a point belongs to the "unknown obstacle" class, the output logits of the binary classifier would be high and used as the reciprocal of the scaler T. Consequently, a smaller T would sharpen the distribution relatively, concentrating the loss penalty on one or two normal classes that are easily confused with the "unknown obstacle" class.On the other hand, if a point belongs to the normal class, the scaler T would be larger, softening the distribution relatively.Consequently, the loss penalty would be spread over multiple classes, emphasizing the difference between multiple negative samples belonging to normal classes.
A schematic diagram of this adaptive contrastive learning strategy is shown in Figure 7. Specifically, the adaptive scaler T is obtained through a binary classifier Φ b , as follows: where σ denotes the Sigmoid activation function, F f use denotes the fusion features extracted through the DHT module, Φ b denotes the binary classifier corresponding to the "unknown obstacle" class, and λ T is a constant parameter that is set to 0.4.The contrastive learning loss Loss CL is built in the form of the Kullback-Leibler (KL) divergence: The total loss function comprises three parts: the 3D network supervised loss, Loss 3D ; the 2D network supervised loss, Loss 2D ; and the contrastive learning loss, Loss CL .It is expressed as follows: where λ 2D and λ CL are constant parameters that are set to 0.1 and 0.05, respectively.The 3D network supervised loss, λ 3D , is expressed as: where L ce denotes the cross-entropy loss function, L lovasz denotes the Lovasz softmax loss function [81], Ŷ denotes the prediction from the 3D network, Y denotes the 3D ground-truth labels, and λ lov is set to 0.1.The 2D network supervised loss, Loss 2D , is expressed as: where Ŷf use denotes the fusion prediction, Ŷ3Dproj denotes the prediction from the projected 3D features, ŶK denotes the binary prediction about whether it is an "unknown obstacle", Y 2D denotes the 2D ground-truth labels projected from the 3D ground-truth labels, Y K denotes the binary label representing whether it is an "unknown obstacle", and L bce denotes the binary cross-entropy loss function.

Experiments
In this section, experiments are conducted to evaluate the performance of the proposed method.

Experimental Setup 4.1.1. Dataset
We built a railway point cloud dataset with per-point annotation, covering clear, rainy, and snowy weather.The imaging device used was a 905 nm self-developed LiDAR, with an angular resolution of 0.065°(Y) and 0.35°(X).The data were collected from the Changping section of the Beijing-Baotou High-Speed Railway.Multiple sets of LiDAR sensors were mounted on trackside signal poles and scanned in a top view, as shown in Figure 8.The average number of points per frame was 484 k, with 2985 frames in total, labeled as rail, sleeper, gravel bed, plant, person, building, signal pole, unknown obstacle, i.e., eight classes in total.The label distribution is shown in Figure 9.The training set consisted of 1885 frames, with paired images of a 1280 × 720 resolution.A total of 195 frames were used for validation and 905 frames were used for testing.

Evaluation Metrics
We evaluated model performance, mainly relying on the mean Intersection over Union (mIoU).The formulation for the mIoU is as follows: where K is the number of classes, k is the current class, and the "1" in "K + 1" denotes the outlier point and is generally ignored.TP, FP, and FN represent true positive, false positive, and false negative, respectively.In the ablation studies, the mean point accuracy (mAcc) is additionally reported, and the formulation is as follows: where TN represents true negative.

Training and Inference Details
The optimizer used was the stochastic gradient descent (SGD) and the learning rate scheduler used was cosine annealing with warm restarts.The learning rate was set to 0.01, the moment was 0.9, and the weight decay was 10 −4 .Random scaling, random positional shift, and random rotation and dropout were applied for data augmentation.Test-time augmentation (TTA) and model ensembles were not utilized.All models were trained to converge.All experiments were implemented on an RTX 4090 GPU.

Benchmark Results
We compared the proposed method with PVKD [21], Cylinder3D [22], and SPVCNN [23], which were the top three (before 1 October 2023) published open source models in the large-scale autonomous driving dataset, SemanticKITTI [25].Additionally, we included MinkowskiNet [24].All of the models were operated in single-scan mode.The performance of these models on the proposed complex weather railway dataset is shown in Tables 1 and 2. Table 2. Memory and speed performance on proposed complex weather railway dataset.
It is worth noting that these models exhibited different performance rankings on the autonomous driving dataset, SemanticKITTI [25], and the proposed railway dataset.This was mainly due to the fact that the railway dataset focuses more on fine-grained instances.The railway scene contains railway tracks and sleepers, as well as cluttered shrubs, rubble, and other objects that require centimeter-level segmentation boundaries.This led to the difference in performance compared to the standard road dataset.SPVCNN [23] is a model optimized for the efficiency of sparse matrix operations on large-scale point clouds, and it achieved higher performance compared to the other models.In addition, it demonstrated a smaller memory footprint and faster inference speeds.Its traditional U-Net architecture, as well as its skip-connected structure, proved to be better suitable for railway scenarios that require recognizing small-scale objects.Compared to SPVCNN [23], our model (DHT-CL) benefited from more robust cross-modal neighborhood information extraction in rain and snow and a training strategy that adaptively adjusted the focus of contrastive learning in response to obstacle anomalies, achieving an mIoU improvement of 3.8%, without introducing additional memory and computational load during the inference stage.

Threats to Validity
In this section, we analyze the threats to the validity of the proposed method.On the one hand, threats to internal validity can arise from the poor interpretability of neural network methods.That is, a complex model contains many modules whose individual causality on the system's validity is not obvious, and a number of extraneous variables, i.e., hyperparameters and training settings, can affect the final performance.For example, finer voxel partitioning and larger hidden layers can lead to inconsistent model performance.Therefore, in relation to the first point, we perform ablation studies on each module in the proposed DHT-CL method, keeping all extraneous variables, such as hyperparameters, training settings, etc., consistent to demonstrate that each module consistently enhances the overall model.In relation to the second point, in the benchmark experiments, the hidden layer size (64) and the voxel size (0.05 m) remained consistent for a fair comparison.Also, all models were trained using the same number of epochs and training settings and were implemented under the same software and hardware environments.On the other hand, threats to external validity can arise from the complex railway environment, e.g., cluttered wilderness objects, complex weather conditions, and various unknown obstacle intrusion contingencies, that make real-world applications difficult.To improve the generalizability of the proposed method, the data used in the experiments covered a wide range of outdoor railway scenarios, including natural rainy and snowy conditions.In addition, the test data included multi-class obstacles that had never been seen before in the training data, demonstrating the practicality of the proposed method in general scenarios.

Comparison with Other Multi-Modal Methods
In order to further demonstrate the performance of the proposed methods, we compared different multi-modal methods.For a fair comparison, all methods were based on the same independent 2D and 3D backbones.xMUDA [76] utilizes a linear layer to align the features of daytime and nighttime cameras and LiDAR sensors in order to accommodate domain shifts, and 2DPASS [19] fuses 2D and 3D features in an MLP manner.Both approaches led to over-similarity during multi-modal information transfer.As shown in Table 3, the proposed DHT-CL method achieved a 1.9% improvement in the mIoU in relation to the second performance method.

Ablation Study
Table 4 presents the results of the ablation study conducted on the proposed complex weather railway dataset.The baseline, SPVCNN [23], used point cloud input only, achieving an mIoU of 83.57%.After introducing naive contrast learning between the fusion and 3D modalities, the mIoU increased to 85.07%, positioning it between xMUDA [76] and 2DPASS [19], as discussed in the previous section.The DHT module extracted deeper information and improved the mIoU by 1.2%.The adaptive scaling factor made contrast learning more suitable for obstacle detection tasks, resulting in a mIoU improvement of 1.1%.How segmentation performance is affected by distance and point cloud density is investigated in this section.The distance is defined as the distance to the detector along the direction of the railway track, i.e., the Y-axis.The railway track is divided into segments every 3 m (containing both positive and negative segments), and the mIoU and mAcc at different distances are shown in Figure 10.As the distance increases, the mAcc continues to decrease, whereas the mIoU is the highest at 21 m, reaching 88.18%.Points at long distances are mostly labeled as ground, so the metrics do not drop to 0. The visualized results are shown in Figure A3.

Visualization Results
Figure 11 shows the visualized segmentation results in clear weather.Enhanced by DHT-CL, the model acquired the color and geometric structure of the "plant" class and was able to distinguish it from the "unknown obstacle" class.There are eight stone obstacles visible in the bottom right of the image, ranging from large to small.The baseline model was able to identify three, whereas the enhanced model was able to identify seven.Figure 12 shows the visualized segmentation results in rainy weather.The contour around the stone obstacle appears segmented more completely after enhancement by DHT-CL.In addition, the reflectivity contrast between the rail tracks, sleepers, and the gravel increased due to rain, and some of the sleepers were misclassified, resulting in false alarms, which were eliminated after the acquisition of color and neighborhood information through DHT-CL.
Figure 13 shows the segmentation results outside the overlapping FOV region of the LiDAR sensors and camera in rainy and snowy weather.The image on the left shows an irregular missing point cloud on a rainy day due to raindrop occlusion, with misidentification occurring near the missing portion.The image on the right shows a snowy day when objects around the tracks were incorrectly identified as a threat "plant" class due to snow accumulation.Classification confusion was eliminated through the use of DHT-CL.In short, the main effect of DHT-CL in rain and snow was to reduce the number of false-positive obstacle cases.In addition, the point clouds in the figure are outside the overlapping FOV region, and the method still yielded a performance enhancement, suggesting that the ability of the camera to access color information and the advantages of the two imaging perspectives have been internalized into the pure point cloud model.

Rail-Obstacle Detection in Complex Weather
We further evaluated the performance of the rail-obstacle detection task in applications, and 1000 frames of data were selected for testing.In order to assess performance as comprehensively as possible, we compensated for the lower probability of rail obstacles in real-world environments by increasing the percentage of obstacle-containing frames in our tests, i.e., 411 (out of 1000) frames contained rail obstacles.The test data were collected under natural rainy and snowy conditions, and included multi-class rail obstacles, such as fallen trees, pedestrians, irregular stones, etc., with a minimum size of 7 × 7 × 7cm 3 (which meets the Chinese rail-obstacle detection standard).Two metrics, i.e., the missed alarm (MA) rate and false alarm (FA) rate, were utilized for the performance evaluation.The formulations are as follows: where TP denotes true positive, i.e., the correctly predicted obstacle frames; TN denotes true negative, i.e., the correctly predicted non-obstacle frames; FP denotes false positive, i.e., the false obstacle frames; and FN denotes false negative, i.e., the missed obstacle frames.
The results of rail-obstacle detection in complex weather are shown in Table 5.Note that the FA metric does not mean that 0.48% of the total number of tests will be false alarms but rather that the confidence level of the reported alarms is 99.52%.Overall, the experimental results show that the LiDAR + camera method, i.e., the DHT-CL model, significantly reduced the missed alarm and false alarm rates compared to the LiDAR-only method, i.e., the baseline model with point clouds only.The missed alarms in the LiDAR-only method were mostly caused by the inability to recognize small-sized obstacles, and the false alarms were mostly caused by confusion in segmentation due to changes in reflectivity on rainy and snowy days.The false alarms in the LiDAR + camera method were caused by extremely heavy rainfall, at which point the sensor was out of action, as shown in Figure A2.In addition, the detailed workflow of the algorithm for generating obstacle alarms is shown in Appendix B.

Conclusions
In this paper, a multi-modal contrast learning strategy, named DHT-CL, is proposed to enhance the performance of point cloud 3DSS in complex weather, improving the robustness of railway-obstacle detection in real-world scenarios.The contrast learning strategy is guided by a sliding kernel-based attention mechanism, which extracts neighborhood crossmodal information and exhibits superior performance in rainy and snowy conditions.An adaptive contrast learning strategy is developed for rail-obstacle detection, which can also be applied to more general scenarios with certain prior knowledge.The proposed DHT-CL strategy achieved an mIoU of 87.38% in full-scene segmentation of railway point clouds under complex weather conditions.In addition, it achieved a missed alarm rate of 0.00% and a false alarm rate of 0.48% in the rail-obstacle detection task under adverse weather conditions.Our future work will refine the functionality of the execution module, e.g., excluding false alarms through multi-frame voting and addressing dynamic scenes, and implement model compression suitable for the computational resources of mobile devices.

Figure 1 .
Figure 1.Railway point clouds in sunny and rainy weather.The top is sunny and the bottom is rainy.The point clouds are coloured by light intensity (strong to weak corresponds to red to blue).

Figure 2 .
Figure 2. Distribution of railway point cloud intensity under different weather conditions.

Figure 3 .
Figure 3. Receptive fields of 2D and 3D networks: (a) 2D network receptive field, (b) projecting the point cloud onto the image, (c) re-projecting the 2D network receptive field into 3D, (d) 3D network receptive field.The re-projected 2D receptive field does not coincide with the 3D receptive field.Orange indicates receptive fields and blue indicates background.

Figure 4 .
Figure 4. Overview of DHT-CL.The point clouds and the images are processed independently by 2D and 3D encoding networks to generate the corresponding 2D and 3D features.Then, the DHT module extracts deeper information from these features, delivering the fusion features.Modalityindependent classifiers generate two prediction scores, upon which the obstacle anomaly-aware modality discrimination loss is constructed.All processes are supervised by 3D labels, with only the 3D branch activated during the inference stage.Raw point clouds are coloured by intensity and labels are coloured by different object classes.

Figure 5 .
Figure 5. Framework of the DHT module.Cross-attention is applied twice to the 2D and pseudo-2D features.

Figure 6 .
Figure 6.Schematic diagram of the local attention mechanism within the DHT module.(a) Selection of an anchor point, represented as a query vector Q (in red).(b) Search for neighborhood points (in blue) around this anchor point within a sliding window (a 5 × 5 kernel is shown in this diagram).Note that neighborhood points may be missing due to the sparsity of the point cloud.The missing points are indicated in gray.(c) Omission of the missing points by marking them as −1 in the GPU Hash table-based neighborhood address query operation.(d) Flattening of the irregular matrix and utilization as a key vector.Then, computing of the inner product between the query vector Q (in red) and the key vector K (in blue) to derive the attention weights.(e) Adjusting weight K by applying the weights derived from QK T .(f) Updating the center element of the sliding window to produce the final output.

Figure 7 .
Figure 7. Schematic diagram of adaptive contrastive learning strategy.

Figure 10 .
Figure 10.mIoU and mAcc at different distances and point cloud densities.

Figure 11 .
Figure 11.The segmentation results of DHT-CL in clear weather.Colour meanings are as follows: purple: rail track, light blue: sleeper, cyan: gravel bed, green: plant, salmon red: unknown obstacle.

Figure 12 .
Figure 12.The segmentation results of DHT-CL in rainy weather: (a) Image in rain.(b) Pure 3D net baseline without DHT-CL.(c) Enhanced by DHT-CL.(d) Ground-truth labels.

Figure 13 .Figure 14 .
Figure 13.The segmentation results of DHT-CL outside the FOVs in rainy and snowy weather: (a) Pure 3D net baseline without DHT-CL.(b) Enhanced by DHT-CL.(c) Ground-truth labels.Left is raining and right is snowing.4.3.6.Model Convergence The mIoU and mAcc values obtained on the validation set, varying with the epoch, are shown in Figure 14.The curve shows that the model converged well during training up to 63 epochs.The two concave valleys in the mIoU-epoch curve (about the 9th and 30th epochs) come from restarting the learning rate.Our learning rate strategy is shown in Figure A4.The total losses, varying with the epoch on the validation set and the step on the training set, are shown in Figure 15 and Figure 16, respectively.Due to the large-scale absence of point clouds under rainy and snowy conditions, as shown in Figure A5, i.e., the presence of noise in the training data that was off-distribution, anomalous gradients were generated at certain steps, resulting in spiky noise in the training losses, but the overall tendency was to stabilize.
The segmentation results for multi-class obstacles are shown in FigureA1.Figure A2 shows sensor failures under extreme, heavy-rain conditions, causing false alarms.The segmentation results of the full-scale point cloud are shown in Figure A3.Note that approximately 51.2 × 5.7 m 2 of the high-quality point cloud area was taken into account to provide the obstacle intrusion alarm.The learning rate, varying with the epoch, is shown in Figure A4.Cosine annealing with the warm restart strategy was utilized, whose initial epoch was set to 9 and multiplied by 2. Figure A5 shows noise that was severely offdistribution in the training set, leading to spiky noise in the loss, but the model eventually converged well.

Figure A5 .
Figure A5.Off-distribution noise in the training data.The point clouds are coloured by light intensity (strong to weak corresponds to red to blue).

Figure A7 .
Figure A7.Step 2. Per-point labels of the original point clouds are generated by the recognition network.

Figure A8 .
Figure A8.Step 3. The RoI (between the two red lines), i.e., the surveillance area, is delineated according to the location of the railway tracks.

Figure A9 .
Figure A9.Step 4. The targets within the surveillance area are filtered and identified as potential threats.

Figure A10 .
Figure A10.Step 5.The volume and location of each obstacle are calculated to produce the final alarms.

Table 1 .
Semantic segmentation results on proposed complex weather railway dataset.Only methods published and open source before 1 October 2023 in SemanticKITTI were compared, without utilizing test-time augmentation (TTA) or model ensembles.All experiments were conducted under the same software and hardware environments.Data are expressed as %.* denotes our method.

Table 3 .
Comparison with other multi-modal methods on proposed complex weather railway dataset.All methods were based on the same 2D and 3D frameworks.Data are presented as a %.* denotes our method.

Table 4 .
Ablation study on proposed complex weather railway dataset.

Table 5 .
Rail-obstacle detection results in complex weather.