1. Introduction
Reliable long-range onboard perception is a fundamental prerequisite for future railway safety systems. Unlike road vehicles, trains operate with significantly longer braking distances, limited maneuverability, and a tightly constrained rail corridor geometry. Consequently, potential obstacles must be detected well before they pose immediate intrusion risks. This requirement is particularly demanding in mainline railway environments, where safety-critical objects often appear at long ranges, within narrow clearance regions, and against complex trackside backgrounds. Therefore, railway obstacle perception should be treated not as a direct extension of open-road autonomous-driving detection, but as a structured sensing problem governed by railway geometry, onboard sensor topology, and application-specific evaluation metrics.
LiDAR is well suited for railway perception. Its active metric ranging and ability to map 3D structure are key advantages [
1,
2,
3]. However, long-range LiDAR sensing is intrinsically challenging. As distance increases, effective point density degrades. Far-field targets become sparse and fragmented in returns. In railway scenes, this sparsity is made worse by persistent structural clutter. Examples include catenary poles, signal equipment, vegetation, platform edges, and elongated trackside infrastructure. Small but safety-critical objects near the rail corridor may have weaker geometric signatures than visually prominent but irrelevant background structures. This imbalance between geometric salience and operational relevance is a core challenge for long-range railway obstacle perception [
4].
Most existing 3D object detection frameworks have been developed for automotive road scenes, where object localization is typically performed over a broad, relatively isotropic bird’s-eye-view (BEV) space. Although voxel-based, pillar-based, and center-based LiDAR detectors have established strong baselines for outdoor 3D perception [
5,
6,
7,
8], their underlying spatial assumptions do not fully align with the constraints of railway sensing. In railway environments, the operational significance of a detected object is strictly conditioned by its spatial relationship to the rail corridor and the clearance envelope. An object located within the rail corridor carries fundamentally different safety implications than a similar target outside it. Thus, uniformly processing the entire BEV plane risks diluting rail-relevant features and increasing vulnerability to off-corridor clutter.
A unified BEV representation provides a natural spatial framework for multi-sensor railway perception, as it maps geometric structure, visual cues, rail geometry, and object locations into a common coordinate system [
9,
10]. For railway scenes, this property is invaluable: the BEV plane aligns seamlessly with the track-oriented operating space, thereby facilitating corridor-aware reasoning. However, generic BEV fusion alone is insufficient. A dedicated railway perception framework must also encode rail geometry, account for operational domain variations, and distinguish standard reproducible detection benchmarks from stricter, railway-specific localization diagnostics. This distinction is vital because overlap-based 3D metrics can become unstable for small, distant objects, whereas strict rail-corridor localization remains exceptionally challenging under sparse far-field returns [
4,
11].
The organization of sensor data is equally critical. Modern railway datasets provide calibrated and synchronized onboard sensing streams, enabling the investigation of multimodal perception under realistic operating conditions [
12]. However, directly deploying heavy surround-view fusion architectures designed for road vehicles is sub-optimal for railway applications. The forward rail corridor represents the dominant safety-critical field of view, where LiDAR remains the most reliable source of metric geometry. A more effective strategy is to maintain LiDAR as the primary sensing modality while utilizing a front-center RGB camera to provide lightweight, auxiliary visual cues [
13,
14,
15,
16]. In this configuration, calibrated LiDAR-to-image projection serves as the geometric interface between modalities, allowing front-view visual features to be explicitly associated with LiDAR structures prior to BEV projection.
To address these challenges, this paper presents Rail-BEV, a LiDAR-centric, sensor-aware BEV perception framework for long-range railway obstacle detection. The framework adopts a forward-facing sensing configuration in which LiDAR provides the primary geometric representation, and a single front-center RGB camera supplies aligned auxiliary visual features via calibrated LiDAR-to-image projection. The aligned geometric and visual cues are integrated within a railway-oriented BEV backend that combines geometry-aware fusion, rail-geometry prediction, and lightweight inference-time structural refinement. Rather than treating Rail-BEV as a generic open-road detector, this study frames it as a dedicated railway sensing pipeline that jointly optimizes onboard sensor configurations, cross-modal geometric alignment, rail-corridor topology, operational domain behaviors, and protocol-aware evaluation metrics.
The main contributions of this study are summarized as follows:
First, Rail-BEV is formulated as a LiDAR-centric and sensor-aware railway perception framework. Instead of adopting a heavy surround-view fusion design, the proposed framework preserves LiDAR as the dominant geometric modality and uses only the front-center RGB camera as lightweight auxiliary visual evidence. This asymmetric sensing hierarchy reflects the forward-looking nature of railway safety perception and the central role of metric geometry in long-range obstacle recognition.
Second, calibrated LiDAR-to-image projection and geometry-aware BEV fusion are incorporated to associate front-view visual cues with LiDAR structure. The resulting representation organizes sparse geometric evidence and dense visual information in a unified railway-oriented BEV space, allowing visual assistance to complement far-range point-cloud responses without weakening the LiDAR-centered sensing hierarchy.
Third, a rail-geometry branch and inference-time rail-aware refinement are introduced to encode corridor-level railway structure. The rail-geometry branch is not designed as a standalone high-fidelity rail segmentation module; rather, it provides a coarse structural prior that encourages the BEV representation to preserve the spatial relationship between candidate obstacles and the rail corridor.
Finally, the experimental evaluation is reorganized under a unified sequence-level benchmark with fixed data partitions, identical training settings, and independent test-set assessment. The revised ablation study isolates the effects of front-view RGB assistance, ROI-based rail-corridor refinement, and track-consistency refinement without relying on validation-set model selection. Representative comparisons with LiDAR-only and multimodal BEV baselines are further included to place Rail-BEV within a reproducible railway perception benchmark.
3. Methodology
3.1. Sensor Configuration and Overall Framework
Rail-BEV is formulated as a LiDAR-centric and sensor-aware bird’s-eye-view (BEV) perception framework for long-range railway obstacle detection. The framework follows a frontal-clean onboard sensing configuration, in which LiDAR is retained as the primary geometric sensing modality and only the front-center RGB camera is used as lightweight auxiliary visual evidence. This design reflects the forward-looking nature of railway safety perception: the most safety-critical region is concentrated along the rail corridor, while LiDAR remains the most reliable source of metric geometry for long-range spatial reasoning [
12,
23,
26,
27]. The detailed sensor layout and configuration parameters are provided in
Figure S1 and Table S1 of the Supplementary Materials.
As illustrated in
Figure 1, Rail-BEV organizes heterogeneous onboard sensing signals into a unified railway-oriented BEV pipeline. The LiDAR stream first establishes the dominant BEV geometric representation. The front-center RGB stream provides auxiliary appearance evidence, which is associated with LiDAR geometry through calibrated LiDAR-to-image projection. The aligned visual and geometric cues are then integrated by a geometry-aware BEV fusion module. The fused representation is forwarded to two parallel task branches: a 3D object-detection head for railway obstacle perception, and a rail-geometry branch for structural corridor modeling. Finally, lightweight inference-time structural refinement is applied to improve the practical use of rail-consistent spatial information. The detailed stage-wise data flow of the framework is presented in
Figure 2.
The proposed framework is therefore not intended as a heavy multimodal detector based on full surround-view camera fusion. Instead, it is designed as a LiDAR-dominant railway sensing backend in which front-view visual evidence, rail geometry, and BEV reasoning are coupled under a compact and deployment-oriented sensing hierarchy [
9,
13,
14].
3.2. Frontal-Clean Sensor Data Organization
To stabilize the sensor interface for railway perception, Rail-BEV adopts a frontal-clean data organization strategy. Instead of using all available image streams, the data pipeline retains only the front-center RGB image and pairs it with the corresponding LiDAR frame and calibrated LiDAR-to-image projection matrix. This organization reduces cross-view fusion complexity while preserving the forward visual evidence that is most relevant to railway obstacle perception [
23].
For each time step t, the sensor input is represented as a triplet consisting of the LiDAR point cloud, the front-center RGB image, and the calibrated projection matrix:
where
denotes the current LiDAR point cloud,
denotes the aligned front-center image, and
denotes the LiDAR-to-image projection matrix. For a LiDAR point represented in homogeneous coordinates as
, its image-plane correspondence is obtained by
3.3. LiDAR-Centric BEV Perception Backbone
The perception backbone follows a LiDAR-centered BEV design. The LiDAR branch converts the point cloud into a structured BEV feature representation through voxel- or pillar-style encoding and BEV feature extraction [
5,
6]. This branch provides the primary geometric representation for downstream railway perception, because LiDAR directly preserves metric depth and spatial structure even when visual appearance is ambiguous. The LiDAR BEV feature at time
t is expressed as
where
denotes the LiDAR BEV encoder and
is the resulting BEV feature tensor. This representation serves as the dominant feature space of the framework. The detection and rail-geometry heads operate after sensor-aware fusion, but the geometric backbone remains LiDAR-centered throughout the pipeline.
3.4. Geometry-Aware BEV Fusion
The front-center RGB camera is used to provide auxiliary visual cues, rather than to replace LiDAR geometry. The image branch extracts features from the front-view image, and the calibrated projection matrix is used to associate image-space information with LiDAR geometry [
9,
13,
14,
15,
16]. This design introduces visual evidence into the BEV domain while maintaining a physically interpretable sensing hierarchy.
The front-view image feature is defined as
where
denotes the image feature extractor. The geometry-aware BEV fusion process is then formulated as
where
denotes the fusion operator that incorporates image evidence into the LiDAR-centered BEV representation using the calibrated projection relationship. The fused representation
preserves the geometric dominance of LiDAR while allowing front-view visual cues to complement sparse far-field point-cloud evidence. This formulation follows the frontal-clean sensing design: the system benefits from lightweight camera assistance without introducing the complexity of full multi-camera BEV fusion.
3.5. Rail-Geometry Branch as a Coarse Structural Prior
Railway obstacle relevance is strongly coupled with rail-corridor geometry. To encode this structure, Rail-BEV introduces a rail-geometry branch that predicts rail-related control points from BEV features. The branch acts as an auxiliary structural head and is optimized jointly with the object-detection branch. Its role is not to provide high-fidelity dense rail segmentation but to supply a coarse corridor-level structural prior that encourages the shared BEV representation to preserve rail-related spatial organization useful for railway obstacle perception [
24,
25].
Let the predicted and reference rail-control-point sets be defined as
The geometric discrepancy between the predicted and reference rail structures is measured by the Chamfer distance [
28]:
The overall training objective combines the detection loss and the rail-geometry loss:
where
denotes the object-detection loss,
denotes the rail-geometry loss derived from the predicted rail structure, and
controls the contribution of the structural branch. This formulation keeps the rail branch as an auxiliary geometric constraint and avoids reducing the detection objective to a single bounding-box regression term.
3.6. Inference-Time Structural Refinement
In addition to the neural BEV backend, Rail-BEV uses lightweight inference-time structural refinement to improve the practical use of railway-consistent spatial information. The refinement stage operates after raw predictions are generated and uses the predicted rail geometry to adjust detection behavior in a structurally informed manner. Its purpose is to suppress rail-inconsistent responses and recover rail-consistent candidates without imposing a fixed universal confidence-decay rule.
Let
denote the raw detection output from the BEV detector, and let
denote the predicted rail structure. The refined output is formulated as
where
denotes rail-corridor geometry refinement and
denotes track-consistency recovery. This composition describes structural refinement at the system level. It replaces the older fixed off-rail confidence-penalty formulation and avoids treating rail consistency as a hard binary mask. Under the revised independent-test evaluation, ROI-based rail-corridor refinement is interpreted as the main stable structural component, whereas the additional track-consistency step is treated as a supplementary refinement variant with limited incremental benefit in the current experiments.
3.7. Offline Class-Balanced Resampling
Railway obstacle datasets often exhibit long-tailed category distributions. Frequent or background-associated categories may dominate the optimization process, while rare but safety-critical obstacle instances remain underrepresented. To reduce this imbalance without increasing inference-time complexity, Rail-BEV adopts offline class-balanced resampling during data preparation.
Let
denote the number of training samples associated with category
c. The reweighted sampling probability is defined as
where
controls the strength of class rebalancing. The corresponding learning objective can be expressed as
Because this rebalancing is performed before training, it does not introduce additional computational cost during inference. In the present framework, class-balanced resampling is treated as a supporting training strategy rather than as the sole driver of the final benchmark gain.
3.8. Stage-Wise Training Strategy
Rail-BEV adopts a stage-wise training strategy to improve optimization stability and separate different aspects of the sensing pipeline. The first stage establishes a geometry-reliable LiDAR backbone. The second stage adapts the LiDAR-centered detector to real railway sensing conditions. The third stage introduces the frontal-clean visual configuration and fine-tunes the sensor-aware BEV fusion pipeline using aligned LiDAR-RGB assets.
This process can be expressed as
Here, , , and denote the parameters obtained after LiDAR-centered pretraining, real-domain adaptation, and sensor-aware fine-tuning, respectively. This staged design follows the sensing hierarchy of the framework: stable LiDAR geometry is learned first, and lightweight visual assistance is introduced only after the core BEV detector has become sufficiently reliable.
3.9. Main Benchmark and Supplementary Diagnostics
The main benchmark of Rail-BEV is based on range-stratified center-distance AP/mAP together with rail BEV mIoU. This choice is motivated by the difficulty of evaluating small and distant railway objects using strict overlap-based criteria [
11,
29]. A prediction is matched to a reference object when the Euclidean center-distance criterion is satisfied:
where
and
denote the predicted and reference object centers, respectively, and
is the class-specific center-distance threshold.
Rail-geometry consistency is evaluated by rasterizing predicted and reference rail structures into BEV occupancy masks. The rail BEV mIoU is calculated as follows:
where
and
M denote the predicted and reference BEV rail masks, and
is a small constant used for numerical stability.
In addition to the main benchmark, RA-AP and RA-mAP are retained as supplementary railway-oriented localization diagnostics. These diagnostics are not used as the primary headline metric, but they help expose the remaining difficulty of strict rail-corridor localization [
11]. For a predicted box and a reference box, the decoupled elliptical localization distance is defined as follows:
Here,
and
denote the longitudinal and lateral center deviations in the rail-oriented coordinate frame, while
and
denote class-specific longitudinal and lateral tolerances. A prediction is considered matched under this diagnostic rule when
This evaluation design separates reproducible center-distance detection performance from stricter railway-oriented localization analysis. Consequently, center-distance AP/mAP and rail BEV mIoU are treated as the main benchmark metrics, while RA-AP and RA-mAP are interpreted only as supplementary diagnostics for safety-relevant localization behavior.
4. Results
This section reports the reproducible performance of Rail-BEV under a scene-isolated OSDaR23 evaluation setting [
12,
23]. The results are organized to separate representative baseline comparison, controlled ablation analysis, operating-domain diagnostics, and supplementary railway-oriented localization analysis. Center-distance AP/mAP and rail BEV mIoU are treated as the primary metrics, whereas RA-AP and RA-mAP are retained only as supplementary railway-oriented localization diagnostics.
4.1. Experimental Protocol
The benchmark was conducted on public OSDaR23 sequences using a fixed scene-level split. Instead of random frame-level sampling, the data were partitioned at complete sequence boundaries to reduce temporal correlation between training and evaluation frames. The revised benchmark contains 23 training sequences with 583 frames, 5 validation sequences with 50 frames reserved for internal hyperparameter verification, and 5 independent test sequences with 230 frames for final reporting. LiDAR was retained as the primary geometric sensing modality, while only the front-center RGB camera was preserved as the auxiliary visual sensor. This configuration maintains a LiDAR-centric sensing hierarchy while retaining the most safety-relevant forward-view visual evidence for calibrated RGB–LiDAR association.
The primary detection metric was range-stratified center-distance AP/mAP, which was used as the headline benchmark for obstacle detection under sparse long-range railway sensing. Rail-geometry consistency was evaluated using rail BEV mIoU, which served as a companion metric for assessing whether the learned rail structure remained consistent in the BEV space. In addition, RA-AP and RA-mAP were reported separately as supplementary railway-oriented localization diagnostics. These diagnostic metrics were not used as the primary benchmark headline; rather, they were retained to expose strict rail-corridor localization bottlenecks that may not be fully reflected by center-distance matching alone.
4.2. Baseline Realignment and Distance-Stratified Evaluation
Table 1 compares Rail-BEV with representative CenterPoint and BEV Fusion baselines under the same scene-isolated evaluation setting. The comparison is reported across three physical distance bands: 0–50 m, 50–100 m, and 100+ m. This range-stratified design translates the numerical AP values into a more interpretable physical performance profile, addressing the fact that object size, LiDAR point density, and localization uncertainty change substantially with distance, see
Figure 3.
Rail-BEV achieves an overall mAP of 66.69%, outperforming CenterPoint by 14.45 percentage points and BEVFusion by 16.51 percentage points under the aligned setting. The improvement is most evident for pedestrians at long range, where Rail-BEV maintains 36.81% AP in the 100+ m band, compared with 2.86% for CenterPoint and 0.34% for BEVFusion. This result suggests that railway-oriented structural reasoning helps constrain the search space along the rail corridor when far-field LiDAR returns become sparse.
4.3. Clean Ablation Under the Unified Evaluation Setting
Table 2 presents the controlled ablation results obtained under the same scene-isolated evaluation setting. All variants were trained using identical optimization settings, random seed, epoch budget, sensor configuration, checkpoint rule, and evaluation metrics. Validation-set checkpoint selection was not used; each variant was evaluated once on the independent test sequences using the final training checkpoint. The LiDAR-only baseline achieved 0.5612 mAP and 0.0533 rail BEV mIoU.
Adding the front-center RGB stream improved mAP to 0.5750 and increased the obstacle AP from 0.2270 to 0.3398, indicating that lightweight visual evidence can complement sparse LiDAR responses. ROI-based rail-corridor refinement further improved mAP to 0.5916 and increased the rail BEV mIoU to 0.1193. In contrast, the additional track-consistency refinement reached 0.5793 mAP and 0.0665 rail BEV mIoU, showing limited incremental benefit under the current fixed protocol, see
Figure 4.
4.4. Evaluation Scope and Operating-Domain Diagnostics
Table 3 summarizes the scene-level data partition used in the revised evaluation. The benchmark is intentionally split at the sequence boundary rather than at the frame level, so that temporally adjacent frames from the same scene are not shared between training and testing. This design provides a stronger test of cross-scene reproducibility than the previous small validation subset, although it should still be interpreted as a controlled public-sequence benchmark rather than a complete validation of all railway operating domains.
Additional diagnostic slices are used only to interpret operating-domain variation and are not treated as replacements for the independent test benchmark. In particular, detection AP and rail BEV mIoU may not improve monotonically across all slices, because center-distance matching and rail-mask overlap quantify different aspects of railway perception. Center-distance AP reflects approximate object localization under sparse long-range returns, whereas rail BEV mIoU reflects the consistency of the predicted rail structure in the BEV space.
4.5. Qualitative Multimodal Evidence
Figure 5 provides a representative qualitative visualization of the multimodal sensing pathway in a curved railway scene. The figure follows the evidence chain from the front-center RGB image and LiDAR BEV representation to LiDAR-camera projection, rail-aware BEV interpretation, and a zoomed diagnostic view. This visualization complements the quantitative results by showing how lightweight visual evidence is geometrically associated with LiDAR structure before being organized in the BEV space, see
Figure 6.
4.6. Supplementary Railway-Oriented Localization Diagnostics
Although center-distance AP/mAP and rail BEV mIoU are the main reported metrics, strict railway-oriented localization diagnostics are useful for identifying remaining safety-relevant bottlenecks.
Table 4 reports RA-AP and RA-mAP under strict aligned evaluation and oracle-center diagnosis. The strict aligned evaluation obtains 0.1865 obstacle RA-AP and 0.0747 overall RA-mAP. The oracle-center diagnostic yields 0.0853 overall RA-mAP, indicating that improved center handling alone does not fully resolve strict railway-oriented localization.
These diagnostic values should not be promoted to the primary benchmark headline. Instead, they reveal that strict rail-corridor localization remains more challenging than the center-distance benchmark suggests. This is expected in long-range railway perception, where sparse far-field returns and small lateral deviations can substantially affect corridor-intrusion interpretation.
The conceptual distinction between center-distance matching and stricter railway-oriented localization diagnostics is further illustrated in
Figure S3.
4.7. Summary of Results
Overall, the revised results support three main findings: First, the distance-stratified baseline comparison shows that Rail-BEV achieves the strongest overall mAP among the compared methods and provides a marked improvement for long-range pedestrian detection. Second, the controlled ablation study shows that front-view RGB assistance improves the LiDAR-only baseline from 0.5612 to 0.5750 mAP, while ROI-based rail-corridor refinement further increases mAP to 0.5916 and rail BEV mIoU to 0.1193. Third, the additional track-consistency refinement does not provide a monotonic gain under the current fixed protocol, indicating that ROI-based structural refinement is the more stable component in the present framework.
At the same time, the results define a clear scope for interpretation. The scene-isolated public benchmark reduces temporal leakage relative to random frame sampling, but it remains limited by the scale and diversity of available public railway sequences. The most defensible interpretation is therefore that Rail-BEV provides an initial reproducible LiDAR-centric, front-view-assisted, rail-aware BEV baseline for long-range railway obstacle perception, while broader operating-domain coverage and stricter localization robustness remain important future directions, see
Figure 7.
5. Discussion
5.1. Sensor-Hierarchy Interpretation
The results support the central design premise of Rail-BEV: long-range railway obstacle perception benefits from a LiDAR-centric sensing hierarchy rather than from treating all modalities as equally weighted streams. LiDAR remains the primary source of metric geometry, which is essential for reasoning about obstacle position in a structured rail corridor [
1,
2,
3,
12,
23,
26,
27]. The front-center RGB camera contributes as a lightweight auxiliary sensor by providing appearance cues that are geometrically associated with LiDAR through calibrated projection before being organized in the BEV space [
9,
13,
14]. This configuration is particularly suitable for railway operation, where the safety-critical field of view is concentrated along the forward rail corridor, and where full surround-view camera fusion would introduce additional calibration and runtime complexity.
The revised sensor-configuration results support this interpretation. Under the unified scene-isolated evaluation setting, the LiDAR-only baseline reaches 0.5612 mAP and 0.0533 rail BEV mIoU, whereas adding the front-center RGB stream increases the corresponding values to 0.5750 and 0.0583, respectively. The improvement is especially meaningful because the visual stream is not used as a heavy independent perception branch; it is introduced as aligned auxiliary evidence within a LiDAR-centered BEV backend. Therefore, the gain should be interpreted as evidence for a practical sensor-organization strategy: forward visual assistance can strengthen sparse long-range LiDAR perception when it is constrained by calibrated geometry and integrated in a common BEV representation.
The distance-stratified comparison further indicates that the advantage of Rail-BEV is most visible in long-range pedestrian perception. In the 100+ m band, Rail-BEV achieves 36.81% pedestrian AP, whereas the corresponding values for CenterPoint and BEVFusion are 2.86% and 0.34%, respectively. This result should not be over-interpreted as full operational-domain generalization; rather, it shows that the proposed LiDAR-centered, front-view-assisted, rail-aware design can improve reproducible long-range perception under the current scene-isolated public benchmark.
5.2. Effect of Rail-Aware Structural Reasoning
The rail-aware branch and inference-time refinement further indicate that railway perception should not be evaluated solely as generic object detection in an unconstrained BEV plane. In railway scenes, the operational meaning of a detection is strongly conditioned by its relationship to the rail corridor and clearance envelope [
24,
25]. The controlled ablation shows that ROI-based rail-corridor refinement increases the overall mAP from 0.5750 to 0.5916 and improves the rail BEV mIoU from 0.0583 to 0.1193. This supports the interpretation that coarse corridor-level geometry helps organize railway-relevant BEV evidence.
At the same time, the structural-refinement results should be interpreted with caution. The improvement is not uniformly positive across all object categories: obstacle AP decreases from 0.3398 to 0.3109 after ROI refinement, while pedestrian AP also decreases slightly from 0.4349 to 0.4242. The additional track-consistency refinement reaches 0.5793 mAP and 0.0665 rail BEV mIoU, which is lower than the ROI-only variant. Thus, the revised evidence supports ROI refinement as the more stable structural component, while TC remains a supplementary refinement variant requiring further validation.
The rail BEV mIoU results also require balanced interpretation. The ROI-refined value of 0.1193 is useful as a companion geometry-consistency indicator, but it should not be presented as evidence of high-quality dense rail segmentation. In this work, the rail branch primarily serves as a coarse corridor-level structural cue for BEV perception and as a means of evaluating rail-geometry consistency. Its value lies in improving the organization of railway-relevant evidence, rather than in replacing specialized rail parsing or track reconstruction systems.
5.3. Operating-Domain Variation and Metric Scope
The scene-level analysis shows that the benchmark behavior is sensitive to operating-domain composition. Although the revised split separates training, validation, and testing at the sequence level, the public test sequences remain biased toward specific station and signal scenes. Therefore, the results should be interpreted as reproducible evidence under a controlled cross-scene public split, rather than as a final validation across the full railway operational design domain.
The operating-domain diagnostics further illustrate why selected slices should not be treated as global benchmark replacements. Detection AP and rail BEV mIoU quantify different aspects of performance: a model may approximately localize object centers while still producing coarse or fragmented rail masks. This discrepancy justifies the use of rail BEV mIoU as a companion geometry-consistency metric alongside center-distance mAP, rather than as the sole evidence for rail-geometry quality.
The supplementary railway-oriented localization diagnostics reveal an additional limitation. Under strict aligned evaluation, the obstacle RA-AP is 0.1865 and the overall RA-mAP is 0.0747. The oracle-center diagnostic slightly increases the overall RA-mAP to 0.0853 but does not consistently improve obstacle-specific RA-AP. This pattern suggests that strict railway-oriented localization is affected by more than center placement alone [
11]. Box geometry, lateral alignment, rail-corridor relation, and sparse far-field returns all contribute to the remaining gap between center-distance detection performance and strict safety-oriented localization behavior.
5.4. Limitations
The first limitation is the scope of the scene-isolated public benchmark. Although the revised protocol uses 23 training sequences, 5 validation sequences, and 5 independent test sequences, the available public data remain limited relative to the diversity of real railway operating domains [
12]. Consequently, the results should be presented as evidence for a controlled LiDAR-centric railway sensing baseline, rather than as a final generalization claim across severe curves, tunnels, open mainline sections, night scenes, active adverse weather, or rare operational anomalies [
26,
27,
30,
31].
The second limitation concerns multimodal robustness. The proposed configuration intentionally retains only the front-center RGB camera as auxiliary visual evidence. This choice is aligned with forward-looking railway safety perception and reduces sensing complexity, but it does not exhaust the full design space of multi-view camera assistance, radar complementarity, adverse-weather sensing, or redundancy under sensor degradation [
29]. Future evaluations should therefore include controlled sensor-dropout tests, calibration perturbation studies, and broader weather and illumination conditions to determine how robust the frontal-clean design remains outside nominal scenarios [
16,
32].
The third limitation is that temporal and ego-motion effects are not used as headline evidence in the present paper. Claims about multi-frame fusion, GNSS/IMU-supported motion compensation, or temporal consistency should therefore be reserved for future studies unless they are supported by unified reruns under the same evaluation protocol.
5.5. Deployment-Oriented Efficiency Considerations
Although Rail-BEV is not claimed as a fully optimized real-time onboard system, the current design follows a deployment-conscious sensing hierarchy. The framework avoids full surround-view camera fusion and retains only the front-center RGB camera as auxiliary visual evidence, thereby limiting the multimodal interface to the safety-critical forward rail corridor [
1,
2,
3,
12,
31,
33]. This design reduces cross-view calibration complexity and keeps LiDAR as the dominant geometric backbone. The present results should therefore be interpreted as a compact sensor-aware railway perception baseline rather than as a computationally minimized embedded product. Future onboard implementation should report platform-specific latency, memory footprint, calibration maintenance cost, and failure behavior under sensor degradation before deployment-level claims are made.
5.6. Future Directions
Future work should first extend the evaluation to broader scene coverage and more diverse operating domains, including curved tracks, tunnels, open mainline sections, night scenes, adverse weather, and stations with different platform layouts [
26,
27,
30,
31]. A larger and more diverse evaluation protocol would make it possible to separate model behavior caused by the sensing architecture from behavior caused by scene composition. Second, calibration robustness should be tested explicitly because the proposed RGB–LiDAR fusion relies on LiDAR-to-image projection. Even small projection errors may reduce the benefit of front-view visual assistance or weaken rail-aware refinement [
16,
32]. A further direction is to unify the main center-distance benchmark with stricter railway-oriented localization diagnostics. Finally, temporal and ego-motion modeling should be revisited under a controlled experimental design so that any claimed benefit can be separated from checkpoint selection, structural refinement, and sensor-configuration effects.
6. Conclusions
This study presents Rail-BEV as a LiDAR-centric and sensor-aware bird’s-eye-view (BEV) perception framework for long-range railway obstacle detection. The framework retains LiDAR as the primary geometric sensing modality and uses only the front-center RGB camera as lightweight auxiliary visual evidence through calibrated LiDAR-to-image projection [
12,
13,
14]. By organizing the aligned geometric and visual cues within a railway-oriented BEV backend, Rail-BEV integrates geometry-aware fusion, rail-geometry prediction, and lightweight structural refinement into a compact sensing pipeline tailored to the forward rail corridor.
Under a unified scene-isolated railway benchmark, Rail-BEV achieves the strongest overall mAP among the compared methods, reaching 0.6669 and showing a particularly large improvement for long-range pedestrian perception. The controlled ablation further shows that front-view RGB assistance improves the LiDAR-only baseline from 0.5612 to 0.5750 mAP, while ROI-based rail-corridor refinement further increases the overall mAP to 0.5916 and rail BEV mIoU to 0.1193. The additional track-consistency refinement produces limited incremental benefit under the current setting, indicating that ROI-based structural refinement is the more reliable component in the present framework.
These findings suggest that railway obstacle perception benefits from a LiDAR-centered sensing hierarchy supplemented by lightweight visual and structural cues. At the same time, the results should be interpreted within clear boundaries. The rail-geometry branch should be regarded as a coarse corridor prior rather than as a high-fidelity rail reconstruction module, and strict railway-oriented localization remains more difficult than center-distance matching. Future work should extend the evaluation to more diverse railway scenes, investigate calibration drift and sensor-degradation robustness, and assess real-time performance on onboard railway computing platforms.