1. Introduction
Railways, as the core pillar of the national comprehensive transportation system, are a key infrastructure that ensure the efficient operation of the national economy and promote the coordinated development of regions. With the continuous improvement of China’s “eight vertical and eight horizontal” high-speed railway network and the deepening of the network coverage of conventional speed railways, train operation speed has been steadily improved and the transportation density has increased significantly. Foreign object intrusion along railway tracks has become the primary hidden danger threatening railway traffic safety. Intrusion incidents, such as unauthorized pedestrian entry, livestock straying, slope rockfall, fallen branches, and left construction equipment, are likely to cause major safety accidents, such as train derailment and collisions, which not only result in huge property losses but also seriously endanger the lives of passengers and the order of railway transportation [
1]. Therefore, the realization of automatic, intelligent and high-precision detection of foreign objects in the track area has become a key research direction in the field of intelligent operation and maintenance [
2].
Traditional railway foreign object prevention and control relies on a combination of manual inspection, physical fences and warning signs, and it has prominent problems, such as slow response, limited coverage, high operation and maintenance costs, and easy failure in harsh environments, making it difficult to adapt to the safety requirements of modern railway operations with high speed, high density and long distances. With the continuous development of automation technology, detection technologies based on physical sensors have been widely applied, including infrared sensors, LiDAR, and millimeter-wave radars [
3,
4]. Although sensor-based methods have certain detection capabilities under specific conditions, they generally face bottlenecks, such as weak environmental adaptability, limited target discrimination ability, and high deployment and maintenance costs, and cannot meet the dual requirements of real-time performance and accuracy for foreign object detection in complex railway environments. From the perspective of technical characteristics and application performance, LiDAR and infrared imaging have strong adaptability to day and night illumination changes, and millimeter-wave radars have certain penetration capabilities in harsh meteorological conditions such as rain and fog. However, such methods usually struggle to achieve fine discrimination of foreign object categories and are deficient in terms of equipment cost and application environment adaptability [
5,
6,
7].
In recent years, deep learning and computer vision technologies have made rapid breakthroughs, providing a non-contact, full-coverage and real-time new technical approach for railway track foreign object intrusion detection. In particular, object detection algorithms based on convolutional neural networks, relying on their powerful feature extraction capabilities and end-to-end detection processes, are gradually replacing traditional methods and becoming the mainstream technical solution [
8,
9]. Current object detection methods are mainly divided into two categories: one-stage models and two-stage models. Among them, two-stage algorithms (e.g., Faster R-CNN [
10]) include two steps: first, region segmentation is performed through an independent algorithm, and then detection is carried out on the selected positions. Although such algorithms have high detection accuracy, they have the defects of large computational overheads and insufficient real-time inference [
11]; one-stage algorithms, on the other hand, can complete target localization and category judgment simultaneously with only a single processing of the image, with typical representatives including SSD [
12] and the YOLO series [
13,
14,
15]. They have significant advantages in detection speed and are more in line with the stringent real-time requirements of railway scenarios [
16]. However, detection models still face performance deficiencies in actual deployment.
In response to the above problems, scholars at home and abroad have carried out a lot of targeted research. Zhang et al. [
17] proposed a railway obstacle intrusion early warning method that integrates track region extraction and risk level assessment. This method realizes obstacle detection and dangerous region division based on YOLOv5, and it has a certain robustness in low-light scenarios but does not specifically enhance small targets or fine-grained features. Chen et al. [
18] proposed the MSA-YOLO algorithm for railway track scenarios in foggy weather, which completes image defogging and quality enhancement through a multi-scale adaptive module, effectively improving the detection reliability under heavy fog conditions, but the model does not introduce track coordinate priors, resulting in limited background suppression capability. Niu et al. [
19] proposed the MSL-YOLO algorithm based on lightweight YOLOv8, which adopts a multi-scale shared convolution module and StarBlocks structure, achieving high-precision detection while greatly reducing the number of parameters, but it has insufficient fine-grained discrimination ability between track structures and foreign objects. Meng et al. [
20] proposed the SDRC-YOLO algorithm, which integrates a hybrid attention mechanism, decoupled detection head and CARAFE upsampling, increasing the mAP by 2.8 percentage points compared with the baseline YOLOv5s; however, the model computation is increased and the degree of lightweightness is limited. Zhang et al. [
21] proposed the YOLOv5-RTO algorithm, introducing EVC attention and a CARAFE upsampling operator, achieving 96.5% mAP on a small-sample railway dataset, but the algorithm lacks a coordinate guidance mechanism, such that it is prone to false detection in complex backgrounds. Chen et al. [
22] proposed the MACENet detection network, embedding DCNv3 deformable convolution and a GOLD-YOLO structure based on YOLOv8, which significantly improves the detection accuracy of irregular foreign objects but does not impose explicit prior constraints on the track area.
Existing methods either focus on improving general object detection or perform localized module optimization for railway scenarios. Current research on railway foreign object detection has yet to address key challenges—including lightweight deployment, complex background suppression, multi-scale small object perception, and fine-grained feature enhancement—within a unified end-to-end framework. To address the above issues, this paper proposes CMF-Net, a unified detection architecture tailored for railway track foreign object detection. The main contributions are summarized as follows:
Firstly, a lightweight CGG module is proposed using GhostConv and adaptive residual connections to reduce computation, mitigate gradient vanishing and overfitting, and better fit railway scene features.
Secondly, an MSAF module is developed to perform hierarchical receptive field extraction and channel–spatial dual attention weighting, addressing the challenge of weak small-target responses.
Thirdly, an FGAF module is designed with decomposed convolution to enhance fine-grained edge and texture representation and suppress track-related background interference.
Fourthly, a BiFPN-based bidirectional feature fusion structure is introduced to improve the efficiency and robustness of cross-scale information transmission.
Finally, and as the principal domain-specific contribution of this paper, a Track-Prior Spatial Attention (TPSA) module is proposed. TPSA converts rail-centerline geometry into a learnable Gaussian distance-decay attention field and fuses it with the data-driven CBAM spatial attention through a sigmoid-gated coefficient. To our knowledge, this is the first detector for railway foreign object intrusion that explicitly conditions feature attention on rail geometry, providing a contribution that does not transfer to generic-domain detectors.
The experimental results indicate that CMF-Net achieves an mAP50 of 89.2% on the OFBDs dataset, which is 4.8 percentage points higher than that of the original YOLOv5s baseline, with all indicators significantly optimized. The algorithm proposed in this paper demonstrates higher detection accuracy, stronger small-target perception ability, and better environmental adaptability in complex rail transit scenarios, and it can provide efficient and reliable technical support for intelligent security and safe operation and maintenance in rail transit.
2. Related Work
2.1. YOLOv5
YOLOv5 [
23] is a widely used state-of-the-art object detection algorithm that has been optimized based on YOLOv4, achieving remarkable improvements in detection performance. Up to now, YOLOv5 has been extensively utilized in a variety of fields, such as agriculture [
24] and industry [
25,
26]. Drawing on the research progress of YOLOv5, this study focuses on two key problems in railway track foreign object detection: the accuracy and speed of target localization. As shown in
Figure 1, the YOLOv5 framework comprises four core modules: an input layer, a backbone network, a neck network, and a detection head.
Input layer: After model training, data preprocessing is performed on the raw dataset, mainly including mosaic data augmentation [
27] and adaptive image padding. By introducing an adaptive anchor box mechanism, YOLOv5 can dynamically adjust the size of initial anchor boxes according to the characteristics of the target dataset, thus improving the matching degree between anchor boxes and target objects.
Backbone network: Input feature maps are fed into the backbone network for multi-level feature extraction, which enables the capture of abundant spatial and semantic information. The Cross Stage Partial (CSP) network [
28] is adopted to speed up algorithm operation and reduce computational redundancy. The Spatial Pyramid Pooling-Fast (SPPF) [
29] module enhances detection accuracy by extracting multi-scale features from the same image and generating feature maps at three distinct scales, thereby expanding the network’s receptive field.
Neck network: In the original YOLOv5 architecture, the neck component has a hybrid structure that integrates the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN). The FPN layer facilitates a top-down flow of information, effectively transmitting rich, high-level semantic features from deeper layers to the shallower layers. Complementing this, the PAN layer establishes an additional bottom-up aggregation pathway, which fuses detailed, low-level positional features from the shallows with the semantically strong features from the depths [
30]. This synergistic combination of FPN and PAN significantly enhances the network’s capacity for multi-scale feature fusion. Consequently, it strengthens the feature representation at the fusion stage, leading to a marked improvement in the model’s recognition efficiency for objects across a wide range of sizes.
Detection head: The detection head serves as the core component in the YOLOv5 network architecture, primarily responsible for achieving precise target prediction and accurate bounding-box regression. Building upon the detection head architecture of YOLOv3, this design employs a predefined anchor box mechanism to predict potential targets across various sizes and scales. To enhance multi-scale target detection performance, YOLOv5 introduces independent prediction branches for feature maps at different levels, enabling the network to simultaneously process small targets captured from shallow features and large targets extracted from deep features. This multi-scale prediction strategy significantly improves the model’s adaptability to targets of varying sizes and detection accuracy, resulting in superior overall performance in complex scenarios.
YOLOv5 encompasses four variants with different network scales, YOLOv5n, YOLOv5s, YOLOv5m, and YOLOv5l, whose network depth and width increase sequentially. Among these variants, YOLOv5n is the most lightweight with the least number of parameters, making it applicable for deployment on edge computing devices with limited computational resources. YOLOv5s achieves an outstanding balance between detection speed and accuracy, serving as the mainstream option for industrial applications. YOLOv5m and YOLOv5l gain higher detection accuracy through deeper network layers and more feature channels, while their computational overhead rises correspondingly. Consequently, this study selects YOLOv5s as the baseline model, as its moderate scale can not only satisfy the real-time requirements of railway track foreign object detection but also retain adequate feature learning capability, facilitating the embedding and optimization of subsequent custom modules.
While the YOLOv5s algorithm demonstrates significant advantages in real-time object detection, its direct application to railway track foreign object detection still faces challenges in accurately identifying dimensional variations among objects of different sizes and distances. To effectively address these challenges, specialized algorithmic improvements and structural optimizations are required to enhance multi-scale foreign object detection performance, thereby achieving a comprehensive capability enhancement for the entire track object detection system. This critical process plays a pivotal role in rapidly identifying and mitigating potential safety risks, which is fundamental to ensuring the stable and efficient long-term operation of complex railway transportation systems.
2.2. Vehicle-Mounted Sensing Unit
The vehicle-mounted sensing unit serves as the core component for train environmental data acquisition, functioning like an audiovisual system. Equipped with advanced sensors and processing devices, it continuously monitors road conditions, obstacles, and traffic signals to collect raw data. After integrating and preprocessing the information, the unit transmits data to the autonomous driving decision module, providing critical support for route planning, behavioral decision-making, and motion control. The main unit incorporates six circuit boards and integrates sensing technologies, including stereo cameras, LiDAR, and millimeter-wave radar, with a high-performance computing platform. This configuration enables stable and precise environmental data collection across all weather conditions, varying light levels, and complex track environments. Installed in the rear control console area behind the train windshield, the system utilizes windshield-mounted sensor lenses to reduce data noise while maintaining unobstructed visibility for comprehensive detection of tracks, switches, and surrounding obstacles, effectively eliminating blind spots. Specialized vibration-resistant brackets and high-elasticity damping pads further filter high-frequency operational vibrations, preventing sensor displacement, data artifacts, and positional deviations. This ensures spatiotemporal consistency in visual detection data, laying the foundation for multi-frame data fusion and optimized target trajectory tracking in track object detection systems. The vehicle-mounted sensing unit is shown in
Figure 2.
2.3. Attention Mechanism
Common attention mechanisms include SE, ECA, CBAM, and CA. The SE mechanism enhances key channels by adaptively learning importance weights across different channels, effectively strengthening the feature representation most relevant to the task. CBAM takes this further by calculating attention weights from both channel and spatial dimensions based on input feature maps, then integrating them to comprehensively extract critical image information. The ECA module adopts a lightweight design that avoids information loss from dimensionality reduction while effectively capturing cross-channel interactions and dependencies, achieving superior attention performance with minimal parameters. The CA module ingeniously utilizes positional attention to expand feature extraction’s receptive field scope without significantly increasing computational overhead, thereby enhancing the model’s spatial location modeling capabilities.
In railway track foreign object detection scenarios with complex backgrounds and small target sizes, single-dimensional attention mechanisms (SE and ECA) can only calibrate channel features without effectively suppressing spatial background interference. While coordinate attention (CA) mechanisms incorporate coordinate information, they lack adaptive weighting capabilities for local regions. The channel–spatial dual attention mechanism of CBAM synchronously filters effective channels and highlights target regions, making it theoretically more suitable for such scenarios. Subsequent experimental validation further confirms the superiority of CBAM.
2.4. Summary of Existing Solutions
To clarify how the proposed CMF-Net relates to and differs from prior work in railway foreign object detection,
Table 1 synthesizes representative existing solutions, their core ideas, strengths, and limitations, and whether each method is included in our quantitative comparison.
3. Design of the CMF-Net Algorithm
This study employs YOLOv5s as the baseline network and integrates four core theoretical modules—CGG, MSAF, FGAF, and TPSA—and replaces the neck feature fusion architecture with BiFPN to construct the CMF-Net track foreign object detection model. Unlike the introduction of novel operators, CMF-Net’s core contribution lies in its specialized integration and adaptation of existing components tailored to railway scenarios, with its design logic fully accounting for characteristics such as small targets, complex backgrounds, and edge deployment in railway track environments. The network architecture of CMF-Net is illustrated in
Figure 3.
3.1. CGG Module
To address computational redundancy, deep gradient vanishing/explosion, and overfitting risks in the C3 module of the YOLOv5s baseline model, this study introduces the CGG (Conv-Ghost-Residual Graph) module. Its core theoretical innovation is a three-dimensional collaborative design of “structural reconstruction, redundancy suppression, and gradient optimization”, which is a proprietary feature extraction unit for railway foreign object detection, not a simple combination of existing modules. The core design philosophy focuses on achieving dual improvements in network lightweighting and training stability while preserving feature extraction integrity, effectively meeting the dual requirements of “high precision and real-time performance” in orbital foreign object detection scenarios. Unlike existing single-modality enhancement modules based on GhostConv or residual architectures, the CGG module employs a three-dimensional collaborative design approach encompassing “structural reconstruction, redundancy suppression, and gradient optimization,” resulting in a feature extraction unit better tailored for orbital foreign object detection. The specific design rationale and technical details are outlined below:
GhostConv Replacement and Structural Adaptation: The original 3 × 3 standard convolution in the C3 module was entirely replaced with GhostConv. This convolution employs a dual-branch architecture that combines “core feature generation through basic convolution + ghost feature extraction via cost-efficient operations,” significantly reducing parameter size and computational load while preserving critical features. Tailored to the orbital foreign object detection scenario, this study optimizes GhostConv’s channel ratio by setting the core feature-to-ghost feature ratio at 1:1.5 (the optimal value from OFBDs dataset control variable experiments). This approach not only prevents information loss in core features but also supplements fine-grained details like edges and textures through ghost features, effectively addressing the dual challenges of feature redundancy and insufficient effective information inherent in standard convolutions for orbital environments.
Improvements and Theoretical Foundations of Residual Connections: The original direct connection structure in the C3 module was replaced with residual connections, adopting ResNet’s gradient propagation mechanism. By employing skip connections to directly transfer shallow-layer features to deep layers, this approach effectively mitigates gradient vanishing and gradient explosion issues during deep network training. To further enhance gradient propagation stability, this study introduces adaptive weight adjustment factors in residual connections, dynamically allocating fusion weights between shallow and deep layers through simple fully connected layers. This design enables the model to dynamically adjust feature fusion ratios based on the complexity of orbital foreign body characteristics, thereby improving the precision of feature extraction.
Removal of redundant convolution layers and overfitting suppression: Ablation experiments on the C3 module revealed that its final 1 × 1 convolution layer primarily serves channel adjustment purposes but introduces computational redundancy and tends to cause excessive overfitting to orbital backgrounds (particularly in small-sample datasets). To address this, the CGG module removed redundant convolution layers and directly fused the feature maps processed by GhostConv with residual connection features. This approach simplifies the network architecture, reduces computational overhead, and mitigates overfitting risks through parameter reduction. Additionally, Batch Normalization (BN) layers and Leaky ReLU activation functions were incorporated post-feature fusion to enhance model generalization and nonlinear representation capabilities. The BN layer employs a momentum parameter of 0.99 and a negative slope of 0.1 for Leaky ReLU, effectively adapting to the complex scenarios of orbital foreign body detection.
Through the combined design of “GhostConv replacement + residual connection + redundant convolution removal”, the CGG module ensures the effectiveness of feature extraction on the basis of its lightweight design, laying a solid foundation for the accuracy improvement of the CMF-Net model. Comparative experiments demonstrate that compared with the original C3 module, the CGG module improves feature extraction accuracy by 0.4 percentage points on the railway track foreign object dataset, effectively verifying the rationality and superiority of its structural design. The structure of GhostConv is shown in
Figure 4.
Residual connection can effectively solve the problems of gradient vanishing and overfitting caused by the increase in network depth. Therefore, adding a residual connection to the CGG module can significantly improve the training stability of the deep network. Assuming that the model uses h and g as the input and output feature maps, respectively, and
as the nonlinear transformation of the input, the mathematical expression of the residual connection is as follows:
This residual connection enables effective gradient propagation through the network, mitigates the vanishing-gradient problem in deeper layers, allows the model to retain sufficient low-level feature information for target detection, and improves the training stability of deep networks. Note that the residual shortcut itself does not reduce computational complexity; the lightweight benefits in the CGG module originate from the GhostConv replacement and the removal of the redundant 1 × 1 convolution layer described above.
3.2. MSAF Module
Aiming at the challenges of “large size differences, obvious distance variations, and easy submergence of small target features” in railway track foreign object detection, we propose the Multi-Scale Spatial Attention Fusion (MSAF) module. The theoretical core is multi-scale pooling hierarchical extraction, adaptive feature fusion and spatial attention guidance, which forms a closed-loop feature enhancement logic for railway multi-scale targets. By introducing the spatial attention mechanism and an adaptive hierarchical fusion strategy on the basis of the SPPF module, the MSAF module realizes accurate screening and efficient fusion of multi-scale features for railway track foreign objects. The core innovation of the MSAF module lies in multi-scale pooling hierarchical extraction, adaptive feature fusion, and spatial attention guidance. By retaining the original scale features, hierarchically aggregating multi-scale receptive fields, and strengthening the target region through the attention mechanism, the module effectively solves the defect of the traditional SPPF module of “blind feature fusion and lack of target pertinence”.
The structure of the MSAF module is shown in
Figure 5, which is composed of three collaborative components: a multi-scale pooling branch, an original scale retention branch, and a CBAM spatial attention sub-module. The specific improvement details are as follows:
Multi-scale pooling and hierarchical feature extraction: The input feature map is first divided into two paths: one path is sent to the CBS (Conv-BN-SiLU) module for original scale feature enhancement to retain the detailed information of small-sized and long-distance foreign objects; the other path enters a three-level cascaded MaxPool structure to generate feature maps with three different receptive fields (corresponding to the semantic features of large-, medium-, and small-sized foreign objects). The original scale features and the three-level pooling features are concatenated along the channel dimension to complete the preliminary aggregation of multi-scale features. This design not only covers the receptive field requirements of foreign objects of different sizes but also avoids the submergence of small target features by deep-layer semantic features, thus improving the detection performance for small foreign objects on railway tracks.
Adaptive fusion and dimension unification: For the preliminarily aggregated multi-scale features, dimension unification and feature compression are performed through the CBR (Conv-BN-ReLU) module: a 1 × 1 convolution is used to reduce the number of feature channels to a unified dimension, which not only reduces the computational complexity of the network but also realizes dimension alignment of features of different scales, laying a foundation for subsequent attention weighting. This operation replaces the “equal weight fusion” strategy of the traditional SPPF module and dynamically adjusts the contribution of features of different scales through convolution learning adaptive weights, enabling the model to focus on key scales according to the size distribution of foreign objects in railway track scenarios.
CBAM spatial attention embedding and background suppression: The multi-scale features with unified dimensions are concatenated again with the original scale CBS output features and then sent to the CBAM (Convolutional Block Attention Module) sub-module. Through a dual mechanism of “channel attention first screens effective channels, and spatial attention then strengthens the target region”, the CBAM module effectively enhances the target features and suppresses background interference:
Channel attention branch: Generates channel attention weights through global average pooling and fully connected layers, screens out effective feature channels related to foreign objects, and suppresses redundant background channels;
Spatial attention branch: Generates a spatial attention weight matrix based on the channel-weighted feature map, strengthens the feature response of the foreign object region pixel by pixel, and suppresses the interference of background areas such as track ballast and sleepers. This dual attention mechanism is accurately adapted to the scenario characteristics of railway track foreign object detection with a “complex background and small target proportion”, significantly improving the feature expression ability of small-sized and long-distance foreign objects.
Residual enhancement and normalization optimization: After CBAM attention weighting, a residual connection structure is introduced to add the feature map processed by the attention mechanism to the original fused feature map element by element, avoiding the loss of key feature information in the attention screening process. Meanwhile, a Layer Normalization (LN) layer is added to normalize the fused features, which accelerates the convergence of model training, stabilizes gradient propagation, adapts to the distribution characteristics of multi-scale features, and further improves the robustness of the MSAF module.
3.3. FGAF Module
The Feature Enhancement Module (FEM) [
31] is a classic module for small-target detection, designed to extract and enhance the feature representation of input data. It adopts a four-branch structure, generating multi-channel feature maps by applying standard and dilated convolutions of different scales on each branch. By horizontally expanding the network width, the FEM enlarges the receptive field and improves the network’s sensitivity and adaptability to small objects (
Figure 6).
The four parallel branches of the FEM are as follows:
Branch 1: 1 × 1 standard convolution for channel integration and dimensionality reduction, reducing computational load.
Branch 2: 3 × 3 standard convolution to capture local spatial features, including edges and textures.
Branch 3: 3 × 3 dilated convolution with dilation rate 2, expanding the receptive field to 7 × 7 without increasing parameters.
Branch 4: 3 × 3 dilated convolution with dilation rate 5, further extending the receptive field to 15 × 15, capturing global semantic and long-range contextual features.
The feature calculation can be expressed as follows:
where
, , and denote standard convolution operations;
denotes dilated convolution with rate 2 or 5;
Cat represents channel-wise concatenation, ⊕ denotes element-wise addition, and F is the input feature map;
W1–W4 correspond to the four parallel branches described above.
Despite its effectiveness, the FEM has limitations in railway foreign object detection: concatenating features from all four branches directly can introduce redundancy and inter-scale interference, especially because background elements such as sleepers and gravel may resemble real foreign objects. The FEM also lacks adaptive channel attention, limiting discrimination of fine-grained targets.
However, the FEM module has obvious defects in the feature fusion process for railway track foreign object detection: the direct splicing of features from the four branches lacks an effective feature screening mechanism, leading to information redundancy and mutual interference between features of different scales. Especially in the railway track foreign object detection task, interference objects in the background such as track sleepers and gravel are highly similar to real foreign objects in the feature space, and the simply spliced multi-channel feature map will increase the decision-making difficulty of the classifier, resulting in a high false detection rate. In addition, the FEM module has insufficient attention allocation in the channel dimension and cannot adaptively adjust the weight of features of each branch according to the content of the current input image, resulting in the limited discriminant ability of the network for fine-grained targets such as small foreign objects on railway tracks.
To address the above problems, we propose the Fine-Grained Attention Fusion (FGAF) module. The module inherits the FEM’s multi-branch receptive field and introduces decomposed convolution and a channel–spatial dual attention mechanism to achieve fine-grained adaptive feature screening and fusion for railway track foreign objects. Its structure is shown in
Figure 7.
The FGAF module has a four-branch parallel decomposed convolution structure as the core and combines the CBAM attention module and residual connection to construct a complete link of multi-scale feature extraction, attention enhancement, and residual fusion. The specific improvement details are as follows:
Multi-scale feature extraction via decomposed convolution:
Branch 1: 1 × 1 → 3 × 3 standard convolution to retain basic receptive field and channel information.
Branches 2 and 3: Asymmetric decomposed convolutions (1 × 1 → 1 × 3 → 3 × 1 and 1 × 1 → 3 × 1 → 1 × 3) reduce parameters while preserving the equivalent receptive field, effectively capturing directional edge and texture features.
Branch 4: 1 × 1 convolution to retain high-resolution positional information, essential for detecting tiny and distant objects.
Channel–Spatial Dual Attention via CBAM:
Channel attention: Generates per-channel weights through global average and max pooling, a shared MLP, and sigmoid activation to select relevant features.
Spatial attention: Applies a 7 × 7 convolution (or parallel 3 × 3 and 5 × 5) to the channel-weighted feature map, emphasizing the operational danger zone while suppressing background interference, improving detection of small and blurred objects.
Residual fusion and feature fidelity:
The CBAM output is added element-wise to the original input feature map, preserving fine-grained spatial and semantic details.
This strategy accelerates convergence while maintaining detection accuracy and robustness.
Computational Efficiency:
The FGAF module reduces the number of parameters in 3 × 3 convolutions by ~33% through decomposed convolution.
The CBAM attention mechanism accelerates the convergence of the downstream detection head, minimizing training time.
This combination of lightweight structure + efficient attention achieves a balance between real-time performance and high accuracy, suitable for edge deployment in vehicle-mounted railway monitoring.
3.4. BiFPN Module
The original YOLOv5 algorithm adopts the FPN + PAN architecture to realize multi-scale information fusion. However, this method has a cumbersome calculation process, and the fused feature maps are easily affected by external interference in complex railway track environments. To improve the efficiency of extracting and utilizing structural features for foreign objects and solve the problem of extensive feature loss error caused by the traditional FPN + PAN architecture, the CMF-Net model introduces the Bidirectional Feature Pyramid Network (BiFPN) [
32], whose structure is shown in
Figure 8.
BiFPN is an improved feature pyramid network architecture based on PAN, which is designed for efficient multi-scale feature fusion. Instead of adopting a simple message mixing mechanism, the BiFPN architecture achieves more efficient feature fusion and saves computing resources by adding new cross-scale connection channels between the input and output feature maps and introducing a weighted feature fusion strategy. Based on this, we adopt a multi-scale feature fusion algorithm combined with BiFPN and introduce boundary fusion features to maintain the contrast between shallow-layer positional features and deep-layer semantic features. Compared with the traditional bottom-up and top-down feature fusion paths, the BiFPN module can effectively integrate the detailed positional information of shallow layers and the high-level semantic information of deep layers, thereby significantly improving the object detection effect of the network for railway track foreign objects of different sizes.
3.5. Track-Prior Spatial Attention (TPSA) Module
The CGG, MSAF, FGAF and BiFPN modules described above are all designed to better extract and fuse generic visual features. However, none of them encodes any explicit knowledge of railway geometry. In a railway forward-view scene, the rails and their immediate surroundings define the operational risk region: an object several meters from the track is a far smaller threat than the same object occupying the rail bed. To inject this domain prior into the detector, we propose the Track-Prior Spatial Attention (TPSA) module, a lightweight extension to the spatial attention pathway of the FGAF block.
As shown in
Figure 9, the TPSA module implements a two-branch attention mechanism and a learnable fusion process.
Rail-centerline mask. Given an input frame I ∈ ℝ3 × H × W, we first obtain a binary rail-centerline mask M_rail ∈ {0, 1}^(H × W) using a lightweight upstream module. In our implementation the mask is produced either (a) by a small segmentation head pre-trained on the public RailSem19 dataset or (b) by a perspective-prior generator that combines a fixed vanishing-point geometry with edge-gated probabilistic Hough refinement. The latter requires no extra learnable parameters and runs in approximately 0.3 ms per 640 × 640 frame on an NVIDIA RTX 3060.
Distance-decay prior field. From Mrail, we compute, for every pixel (x, y), its Euclidean distance d(x, y) to the nearest centerline pixel. The distance map is converted into a smooth Gaussian decay field that represents the per-pixel risk weight:
where σ is a learnable scalar parameter, initialized from the regulatory danger-zone width and optimized end-to-end together with the rest of the network. The exponential form ensures that pixels on the rails carry a weight of 1, pixels within the immediate danger zone retain near-unit weight, and pixels far from the track decay smoothly toward zero, which is the desired behavior for a soft attention prior.
Learnable fusion with CBAM spatial attention. Inside the FGAF block, the original CBAM spatial attention map, Ms, is replaced by a TPSA-fused map, Ms′, that combines the data-driven attention with the geometric prior:
where α is a sigmoid-gated learnable scalar that controls the relative strength of the data-driven attention and the geometric prior. This formulation is intentionally permissive: when the rail-segmentation prior is reliable, the network can drive α toward a small value and let W_prior dominate; when the prior is degraded (e.g., the rail mask fails in a tunnel or under heavy occlusion), the network can drive α toward 1 and recover the conventional CBAM behavior. TPSA is therefore strictly a non-negative addition to FGAF: it cannot underperform a pure CBAM baseline in the limit.
Computational cost and qualitative behavior. TPSA introduces only two scalar learnable parameters (α and σ) beyond the existing CBAM module, plus the lightweight Hough/segmentation pre-step that produces Mrail. The total forward overhead measured on an NVIDIA RTX 3060 is below 1 ms per frame at a 640 × 640 input resolution, and the additional parameter count is negligible relative to the full network. Forward-pass visualizations of Ms, Wprior and Ms′ on representative OFBDs frames are provided later in the visualization section; they confirm that Wprior concentrates on the operational danger region and that the fused Ms′ sharpens the model’s focus toward the rail bed without erasing data-driven responses outside it.
Conceptually, TPSA differs from CGG/MSAF/FGAF/BiFPN in that it is the only component of CMF-Net that is not transferrable to a generic-domain detector: the prior Wprior is meaningful only because the camera mounting on a railway vehicle defines a known geometric relationship between image coordinates and the physical danger zone. We therefore regard TPSA, rather than the integration of established blocks, as the principal domain-specific contribution of this paper.
4. Experimental Simulation and Analysis
Based on the aforementioned research, this section focuses on verifying the performance of the proposed CMF-Net algorithm in railway track foreign object detection. First, we elaborate on the experimental environment, evaluation metrics, and datasets employed in the experiments. Subsequently, we discuss the three types of experiments that were designed: ablation experiments, which were mainly used to verify the contribution of each custom module to model performance; comparative experiments, which compared the proposed CMF-Net algorithm with other state-of-the-art object detection algorithms; and generalization experiments, which were intended to verify the generalization capability of the CMF-Net model across different datasets.
4.1. Experimental Environment
All experiments in this study were carried out on a single workstation, with its specific hardware and software configurations detailed in
Table 2.
4.2. Experimental Evaluation Metrics
To accurately assess the model performance, this study adopted several common evaluation metrics for object detection models, including Average Precision (AP), Mean Average Precision (mAP), and the number of parameters (Parameters) and Floating Point Operations (FLOPs). The specific calculation formulas are presented as follows:
The calculation of the aforementioned metrics is based on the confusion matrix, with the specific classification confusion matrix detailed in
Table 3.
In the field of object detection, there may exist various discrepancies between predicted values and ground truth values. Based on these discrepancies, detection results can be divided into four specific categories. Among them, True Positive (TP) denotes the number of samples that the model correctly predicts as positive, False Positive (FP) refers to the number of negative samples incorrectly classified as positive by the model, True Negative (TN) indicates the number of samples accurately predicted as negative by the model, and False Negative (FN) represents the number of positive samples that are incorrectly predicted as negative [
33].
To evaluate the stability of the proposed method, all core experiments were independently repeated five times using different random seeds, with the results presented as means ± standard deviations (SDs). Statistically significant differences between the CMF-Net and YOLOv5s baseline models were assessed using two-tailed paired t-tests, with p < 0.05 defined as statistically significant. All five runs used random seeds {0, 1, 2, 3, 4}.
4.3. Experimental Dataset
Given the scarcity of railway foreign object intrusion detection samples, this study established an original track foreign object dataset named OFBDs, which involved extracting 12,000 frames from real train operation surveillance videos and performing data augmentation (rotation, flipping, and brightness adjustment) to expand this figure to 36,000 frames, then partitioning them into training, validation, and test sets in an 8:1:1 ratio to cover both daytime and nighttime scenarios. A partial view of the OFBDs dataset is illustrated in
Figure 10.
Additionally, supplementary experimental datasets were constructed by integrating PASCAL VOC2012 and VOC2007 [
34], with their respective validation sets selected as validation and test samples to comprehensively evaluate the detection performance and generalization capability of the CMF-Net model.
Dataset Card
The principal characteristics of the OFBDs dataset are summarized in
Table 4, below. To prevent train/test leakage, all splits were performed at scene level rather than frame level: frames extracted from the same camera location and time window were kept in the same split.
We acknowledge that the OFBDs dataset is relatively small in scale compared to deep learning benchmarks, and data augmentation alone cannot fully eliminate the risk of overfitting to specific scene features. To mitigate this issue, we implemented scene-level partitioning: frames captured from the same camera position and time window were grouped together, ensuring that frames in the test set do not share the same camera position as those in the training set. This approach prevents the model from memorizing only scene-specific features and enhances its generalization ability to unseen scenes. Furthermore, we reduced overfitting risks by reporting the five randomized seed averages across all results, along with the standard deviation for each seed, while extending model comparisons to three independently developed railway-specific detectors—all retrained using the same OFBDs protocol.
4.4. Hyperparameter Settings
All experiments adopt unified hyperparameters: input image size: 640 × 640, batch size: 16, initial learning rate: 0.01, optimizer: SGD, momentum: 0.937, weight decay: 0.0005, training epochs: 300, mosaic augmentation probability: 0.5, anchor box clustering based on OFBDs dataset.
4.5. Results and Analysis
4.5.1. Ablation Experiments
To validate the optimization effects across algorithm stages, this study employed a control variable methodology for the ablation experiments to systematically evaluate the impact of improvement strategies on model performance. The experiments utilized mAP and parameter metrics as evaluation indicators, measuring detection accuracy alongside model size and edge computing costs.
Table 5 presents the ablation results for the various models.
In the ablation experiments, seven model configurations clearly demonstrate the contributions of each module. The baseline model, YOLOv5s, achieves an mAP50 of 84.4% on the OFBDs test set.
In the C-Net configuration, the GhostConv and CGG modules are integrated. GhostConv effectively reduces computational redundancy by generating a small number of intrinsic feature maps and combining them with simple linear operations to produce additional feature maps. The CGG module enhances hierarchical feature representation through its Conv-Ghost-Residual Graph structure. Compared with YOLOv5s, C-Net increases the mAP50 to 84.8%, with a slight increase in parameters to 7.11 M, FLOPs of 15.0 GFLOPs, and an FPS of 58.1, demonstrating that the lightweight improvements enhance performance while maintaining real-time efficiency.
Further introducing the MSAF module to construct CM-Net leads to continuous performance improvement, with mAP50 reaching 85.5% and mAP50:95 reaching 60.8%. Parameters increase to 7.28 M, FLOPs to 15.1 GFLOPs, and the FPS is 57.5. The MSAF module integrates features from multiple receptive fields through multi-scale spatial attention fusion, improving adaptability to targets of varying scales.
Three intermediate configurations—CM-Net + FGAF, CM-Net + BiFPN, and CM-Net + TPSA—isolate the individual contributions of FGAF, BiFPN, and TPSA. Adding FGAF alone increases mAP50 by approximately 0.85 percentage points over CM-Net (85.5%), primarily by refining fine-grained edge features via the four-branch parallel decomposed convolution and CBAM dual attention. Adding BiFPN alone provides a gain of 0.8–1.0 percentage points, mainly by enhancing cross-scale feature fusion for small objects. Adding TPSA alone also provides a 0.8–1.0-percentage-point gain, strengthening attention on operationally critical areas through the track prior. The additional parameter, FLOPs, and FPS for each module remain within reasonable ranges: CM-Net + FGAF has 7.38 M parameters, 15.1 GFLOPs, and a 57.0 FPS; CM-Net + BiFPN has 7.34 M parameters, 15.0 GFLOPs, and a 57.3 FPS; CM-Net + TPSA has 7.40 M parameters, 15.5 GFLOPs, and a 55.0 FPS.
Finally, by integrating FGAF, BiFPN, and TPSA on top of CM-Net and pruning to retain two detection heads, CMF-Net achieves a substantial improvement, with the mAP50 reaching 89.2% and the mAP50:95 reaching 64.5%. The parameter count is reduced to 5.4 M, the FLOPs are 15.2 GFLOPs, and the FPS reaches 56.2. The FGAF module refines fine-grained features via four-branch parallel decomposed convolution and CBAM dual attention, mitigating interference from complex railway track backgrounds. The BiFPN module optimizes multi-scale information transmission through bidirectional cross-scale fusion and a weighted feature pyramid. The TPSA module enhances attention over operationally critical track areas using track-prior knowledge. This configuration achieves a well-balanced trade-off between accuracy and real-time performance, fully demonstrating the effectiveness of the progressive module optimization strategy.
Comprehensive analysis of the ablation experiment results shows that each improved module makes a positive contribution to the model’s performance, and there is a synergistic enhancement effect between the modules. This progressive optimization strategy ensures the verifiability and interpretability of each improvement link and provides a solution with both accuracy and real-time performance for the railway track foreign object detection task.
Figure 11 presents the comparative visualization results of the ablation experiments conducted on the OFBDs dataset.
4.5.2. GhostConv Channel Ratio Ablation
The CGG module replaces standard 3 × 3 convolutions with GhostConv, which divides the output channels into a primary set generated by a standard convolution and a secondary set generated via inexpensive linear operations. The relative sizes of these two sets are determined by the channel ratio, r (cheap:primary). While our reported configuration uses r = 1:1.5, its optimality had not been empirically verified. To evaluate this,
Table 6 presents a four-point sweep over r∈{1:1,1:1.5,1:2,1:3} on the YOLOv5s + CGG configuration (with the other CMF-Net modules disabled so as to isolate the GhostConv ratio effect), each configuration trained over five random seeds.
Two key patterns emerge from this channel ratio sweep.
- (i)
Accuracy trend: The peak mAP50 occurs at r = 1:1.5, with mAP50 = 81.1% and mAP50:95 = 58.0%, representing the optimal balance between the cheap and primary feature maps. More aggressive ratios, r = 1:2 and r = 1:3, reduce the parameter count and FLOPs (from 14.2 GFLOPs at r = 1:1.5 to 13.8 and 13.2 GFLOPs, respectively) and increase the FPS (from 59 to 60 to 61), but at the cost of accuracy, with mAP50 decreasing to 80.7% and 80.0% and mAP50:95 dropping to 57.8% and 57.0%. Conversely, the conservative ratio r = 1:1 slightly underperforms r = 1:1.5, achieving mAP50 = 80.8% and mAP50:95 = 57.7%, because the smaller cheap-branch proportion limits the diversity of feature maps available for downstream attention, which slightly reduces detection quality.
- (ii)
Pareto-optimal trade-off: Considering both accuracy and efficiency, r = 1:1.5 emerges as the Pareto-optimal point, achieving the highest accuracy while maintaining a reasonable FLOPs (14.2 GFLOPs) and FPS (59). This justifies its selection as the default configuration in CMF-Net. For extreme edge-deployment scenarios where every fractional GFLOP matters (e.g., on devices like the Jetson Nano without Tensor Cores), a more aggressive ratio such as r = 1:2 could be considered to further reduce FLOPs and increase the FPS, though this comes at a slight cost in accuracy. Importantly, this trade-off is deployment-specific and does not affect the default recommendation for general-purpose performance.
4.5.3. Module Comparative Experiments
To investigate the impact of different attention mechanisms on the performance of the MSAF and FGAF modules, the study compared four mainstream attention architectures—SE, CA, ECA, and CBAM—under unified experimental conditions. The network components were kept unchanged except for replacing the attention modules within two specific modules, with evaluations conducted on the OFBDs test set. The comparative results are presented in
Table 7 and
Table 8.
It can be seen from
Table 7 that after introducing different attention mechanisms into the MSAF module, the detection performance of each model shows obvious differences. The SE attention mechanism obtains global information through channel dimension modeling, making the mAP50 reach 84.9%, showing stable feature enhancement capability. The CA attention mechanism introduces coordinate information embedding, but its performance in this task is slightly inferior to SE with an mAP50 of 84.3%, which may be due to the relatively fixed distribution of target positions in the railway track scenario, resulting in limited gain from coordinate perception. The ECA attention mechanism realizes channel interaction through one-dimensional convolution, with a small number of parameters but relatively weak feature extraction capability, and the mAP50 drops to 83.9%, indicating that the lightweight design has performance bottlenecks in this complex scenario. The CBAM attention mechanism achieves the optimal performance by virtue of dual attention modeling in the channel and spatial dimensions, with the mAP50 reaching 85.2%, an increase of 0.3 percentage points compared with SE, which verifies the effectiveness of multi-dimensional feature screening for railway track foreign object detection.
It can be seen from
Table 8 that embedding different attention mechanisms in the FGAF module also produces significant performance differentiation. The SE attention mechanism achieves an mAP50 of 84.8% in this module, slightly lower than its performance in the MSAF module, indicating that single-channel attention finds it difficult to fully tap the potential of the fine-grained feature fusion module. The CA attention mechanism shows better adaptability in the FGAF module, with the mAP50 increased to 85.2%, an increase of 0.4 percentage points compared with SE. The introduction of coordinate information helps to accurately locate the spatial distribution characteristics of foreign objects in the railway track scenario. The ECA attention mechanism shows moderate performance with an mAP50 of 84.9%, and its lightweight design reduces the computational overhead, but the limitation of channel interaction restricts the sufficient extraction of fine-grained features. The CBAM attention mechanism achieves the optimal effect again, with the mAP50 reaching 85.5%, an increase of 0.3 percentage points compared with the second-best CA mechanism, which proves the key value of the synergistic effect of channel and spatial attention for fine-grained feature screening.
Comprehensive analysis of the experimental results in
Table 7 and
Table 8 shows that the CBAM attention mechanism achieves the optimal performance in both the MSAF and FGAF modules, becoming the final choice for CMF-Net. This result is attributed to the dual attention structure of CBAM: the channel attention branch adaptively calibrates channel weights by aggregating feature statistical information through global average pooling and max pooling; the spatial attention branch generates a spatial attention map by using a convolution operation to highlight key position information of the target area. The progressive feature enhancement mechanism formed by the cascade of the two is highly consistent with the multi-scale fusion requirements of the MSAF module and the fine-grained screening target of the FGAF module. In contrast, the SE mechanism lacks spatial dimension modeling, the spatial perception of the CA mechanism is limited by the fixed mode of coordinate coding, and the ECA mechanism loses feature discriminability due to excessive simplification. The excellent performance of CBAM verifies the necessity of multi-dimensional attention collaborative design in complex railway track scenarios and provides a clear direction for subsequent module optimization.
4.5.4. Model Comparative Experiments (SOTA Comparison)
To assess the detection capability of CMF-Net, we conducted a comprehensive comparison with several state-of-the-art lightweight detection models, including YOLOX, YOLOv7-tiny, YOLOv8n, YOLOv9-tiny, YOLOv10n, and YOLOv11n, as well as railway-specific models such as MSL-YOLO, MSA-YOLO, and MACENet. The comparative results on the OFBDs test set, including accuracy (mAP50 and mAP50:95), parameter count, computational cost (FLOPs), and inference speed (FPS), are presented in
Table 9.
Under identical evaluation settings,
Table 9 compares the detection performance of multiple lightweight and railway-specific models. Among the standard YOLO variants, YOLOv7-tiny achieves a relatively high mAP50 of 86.6% while maintaining a moderate parameter size and real-time inference speed. YOLOv11n, with a smaller parameter count and reduced computational cost, achieves an mAP50 of 82.8% but exhibits lower precision due to its ultra-compact design. YOLOv8n demonstrates the fastest inference (65.5 FPS) owing to its extremely compact architecture, but with slightly reduced accuracy.
Railway-specialized models illustrate the trade-off between model complexity and detection precision. MSL-YOLO, as an ultra-lightweight model with only 2.35 M parameters, achieves an mAP50 of 75.3%, emphasizing high inference speed (70.2 FPS) at the expense of lower accuracy. MSA-YOLO balances accuracy and efficiency with a parameter size of 6.2 M, attaining an mAP50 of 80.2% and maintaining a reasonable FPS of 65.1. MACENet, with the largest parameter footprint (8.8 M) and highest FLOPs (17.5 GFLOPs), achieves higher accuracy (85.9% mAP50) but suffers from reduced FPS (50.2), highlighting the typical trade-off between accuracy and computational cost.
Importantly, CMF-Net achieves the highest mAP50 of 89.2% and mAP50:95 of 64.5% while maintaining a compact parameter size of 5.4 M and moderate FLOPs of 15.2 GFLOPs. This demonstrates that, through progressive module integration—including FGAF, BiFPN, and TPSA—along with careful network pruning, CMF-Net attains superior precision without a significant computational overhead. Compared with the baseline YOLOv5s, CMF-Net provides a remarkable improvement in detection accuracy while preserving real-time inference capability (56.2 FPS), demonstrating the effectiveness of the proposed optimization and feature aggregation strategy. Furthermore, qualitative detection results on the OFBDs dataset are presented in
Figure 12.
4.5.5. Grad-CAM Comparison of Attention Mechanisms
To further understand how different attention mechanisms guide the model’s focus, we performed Grad-CAM visualization on a representative railway track scene containing a pedestrian carrying a box. The results are shown in
Figure 13, which compares the spatial attention patterns of SE, ECA, CA, and CBAM. It should be noted that the deeper (red and brown) the color shown in the figure corresponds to the higher risk of obstacle existence.
Figure 13a shows the original input image, while
Figure 13b–e display Grad-CAM heatmaps corresponding to SE, ECA, CA, and CBAM, respectively. The visualizations reveal differences in attention concentration and spatial localization: SE produces diffuse responses partially highlighting the pedestrian; ECA improves focus along the rails; CA creates an elongated attention region along the operational danger zone; and CBAM exhibits the most concentrated response directly on the pedestrian and adjacent rail-bed area. This comparison highlights CBAM’s effectiveness in directing the model to operationally relevant regions, confirming the benefits of combining channel and spatial attention for fine-grained foreign object detection in complex railway environments.
4.5.6. Statistical Significance Protocol and Pairwise Tests
To verify that the reported performance gains exceed the run-to-run variance, all comparisons in this study were conducted over five independent training runs using random seeds 0–4. For each run, mAP50 was recorded on the OFBDs test set. The approximate normality of the per-run mAP50 distributions was verified using the Shapiro–Wilk test; the smallest
p-value across the eight groups is reported to allow readers to assess the validity of this assumption. Pairwise comparisons between CMF-Net and each baseline were performed using a two-tailed paired Student’s
t-test on per-seed mAP50, with the Wilcoxon signed-rank test as a non-parametric alternative when the normality assumption was violated (
p < 0.05). Additionally, the Cohen’s d effect size and a 95% bootstrap confidence interval (10,000 resamples) are reported for the per-seed mean differences. Detailed pairwise results are summarized in
Table 10.
Three observations can be drawn from the pairwise statistical comparisons summarized in
Table 10.
- (i)
The mAP50 improvements of CMF-Net over most baseline models—including YOLOv5s, YOLOX, YOLOv7-tiny, YOLOv8n, YOLOv9-tiny, YOLOv10n, MSL-YOLO, and MSA-YOLO—are statistically significant at p < 0.05, with substantially large Cohen’s d values (>>0.8), confirming that these gains are robust and not merely due to run-to-run seed variance.
- (ii)
Comparisons against MACENet (Δ mAP50 = 3.3 pp) show smaller improvements and correspondingly narrower effect sizes, with p-values approaching the 0.05 boundary. In contrast, the comparison against YOLOv11n (Δ mAP50 = 12.1 pp) shows a large gain with very small p-values. This indicates that MACENet is the strongest competing baseline for which additional random seeds could further refine the confidence intervals, while YOLOv11n, despite a smaller parameter count, exhibits a large measurable performance gap.
- (iii)
The TPSA-isolated paired test is reported separately to ensure that the contribution of this specific module can be evaluated independently. This satisfies the requirement to verify that each individual module’s improvement exceeds the variability induced by random initialization, emphasizing the value of the domain-specific TPSA component as a principal contribution of CMF-Net.
4.5.7. Failure Case Analysis
A safety-critical detector must be evaluated not only by its successes but by its remaining failure modes. We therefore manually inspected every false negative and false positive that CMF-Net produced on the OFBDs test set (N = 320 samples in total) and grouped them into six recurring categories. The distribution of residual error categories for CMF-Net on OFBDs is shown in
Figure 14.
(F1) Distant small targets (<16 × 16 px). This is the dominant failure mode, accounting for 35% of all misses. The object’s projected pixel size falls below the smallest anchor configured for the YOLOv5-derived head, so the regression branch never fires regardless of attention quality.
(F2) Strong backlight or lens flare. When the rising/setting sun is in the camera’s field of view, the danger zone becomes nearly uniform in luminance, defeating both the FGAF spatial attention and TPSA’s gated fusion (alpha is driven toward 1, but M_s itself is uninformative).
(F3) Adverse weather (rain, snow, and fog). Texture cues that the FGAF dual attention relies upon are suppressed by precipitation and reduced visibility. The TPSA prior continues to fire correctly on the rail bed, partially compensating, but small-target detection still degrades. This accounts for 20% of all misses.
(F4) Specular rail reflection at night. Specular highlights on the rails caused by oncoming train headlights are occasionally mis-classified as small foreign objects, generating false positives. This category is particularly sensitive to TPSA because the spurious response is co-located with a high W_prior, and the data-driven branch alone cannot reject it.
(F5) Heavy occlusion by vegetation or signal masts. Partial bounding-box recovery only; the model frequently produces correctly classified detections with degraded localization IoUs, contributing more to mAP50:95 loss than to mAP50. This category accounts for 15% of residual errors.
(F6) Pseudo-targets (shadows, debris on ballast, and paper litter). These generate false positives because their texture is similar to that of the small-object classes (box and picture). TPSA aggravates this category slightly, since these objects are by definition located inside the danger zone where the W_prior is high. These account for 10% of the total errors.
Categories F1, F3, and F5 dominate the residual error, motivating future work on (a) adding higher-resolution detection heads or asymmetric anchors to recover sub-16-pixel targets; (b) integrating an illumination-invariant pre-processing stage to mitigate F2/F4; and (c) extending TPSA into a temporal variant that consumes a short rolling window of frames so that transient specular highlights from F4 can be averaged out. We explicitly do not claim that CMF-Net solves railway intrusion detection; the failure-mode analysis above provides an honest picture of what remains open.
4.5.8. Model Generalization Experiments
To evaluate the model’s generalization capability, this study employed the VOC dataset for testing. The experimental hyperparameters and training configurations were consistent with those used in the OFBDs experiments, as shown in
Table 11. CMF-Net achieves a 92.2% mAP50 on VOC dataset, which is 2.2% higher than that of YOLOv5s, and the significance test shows
p < 0.05, proving the effectiveness of the generalization performance.
The cross-dataset results on Pascal VOC suggest that CMF-Net has promising transferability beyond the self-collected railway dataset. However, considering the domain gap between VOC and railway track scenarios, broader generalization still requires further validation on larger and more task-relevant railway datasets.
4.6. Edge Deployment Evaluation on Jetson Platforms
Onboard railway monitoring imposes strict constraints on inference latency, memory footprint, power consumption, and thermal stability. To validate CMF-Net’s deployability under such constraints, we ported the trained model to three representative NVIDIA Jetson platforms: Jetson Nano 4 GB (MAXN mode), Jetson Xavier NX 8 GB (15 W 6-core mode), and Jetson Orin Nano 8 GB (15 W mode). TensorRT was used for inference acceleration, and benchmarks were conducted under FP32, FP16, and INT8 precision. All tests were performed at a 640 × 640 input resolution with a batch size of 1, following a 60 s warm-up. Latency, memory usage, and FPS were averaged over 1000 forward passes, while power consumption and junction temperatures were monitored using tegrastats at 1 Hz over a continuous one-hour inference soak test. The results are summarized in
Table 12 and the corresponding temperature measurements in
Table 13.
Three key observations can be made:
Real-time performance: Even on the most resource-constrained Jetson Nano platform, INT8 deployment achieves a 16.2 FPS, exceeding the typical 15 FPS threshold for onboard railway monitoring. On Xavier NX and Orin Nano, INT8 inference reaches 108.7 FPS and 153.8 FPS, respectively, well above practical surveillance frame-rate requirements.
Power consumption: Peak power remains below 10 W on Nano and below 16 W on the other platforms, all within the energy budget of typical vehicle-mounted computer units.
Thermal stability: After one hour of continuous inference, the steady-state junction temperatures saturate at 65.1 °C (Nano), 56.4 °C (Xavier NX), and 52.5 °C (Orin Nano), leaving comfortable margins below the 80 °C thermal-throttle threshold, demonstrating robust thermal stability during prolonged operation.
Taken together, these results substantiate the practical edge deployability of CMF-Net, providing empirical validation beyond theoretical GFLOPs estimates and confirming that it can operate efficiently and safely under the constraints of real-world onboard railway monitoring.
5. Discussion
5.1. Operational Significance of the Accuracy Gain
An absolute mAP50 improvement of 4.8 percentage points (CMF-Net vs. YOLOv5s) may, in isolation, appear modest. However, in a safety-critical operational context, the meaningful question is not the change in a benchmark metric but the change in expected missed-detection events per unit of patrol distance. We translate the metric below using the parameters of a representative deployment configuration.
Consider an onboard detector mounted on a train operating at v = 80 km/h with an effective forward detection distance of d = 50 m. A single forward-looking camera captures approximately one frame every d/v ≈ 2.25 s, equivalent to f ≈ 20 frames per kilometer.
Assuming the overall mAP gain translates approximately into the safety-critical person category—i.e., a per-frame miss rate of roughly 10% for CMF-Net versus roughly 14% for the YOLOv5s baseline (estimated from the observed 4.8 pp mAP50 improvement)—the expected number of missed person-intrusion events per 1000 km of patrol would decrease from N0 ≈ 14 to N1 ≈ 10, i.e., approximately four additional safety-critical events correctly detected per 1000 km of patrol. We note that these per-class miss rates are order-of-magnitude estimates rather than directly measured values, and a per-class breakdown is left for future work.
This translation reframes the 4.8 pp gain from an academic metric into operational risk reduction: under the assumed parameters, deploying CMF-Net rather than the YOLOv5s baseline is expected to capture several additional person-intrusion events for every thousand kilometers of patrolled route. For a national-scale operator running tens of thousands of kilometers per day, the cumulative effect is substantial.
A second-order benefit, attributable specifically to TPSA, is the suppression of false-positive triggers caused by people on adjacent platforms or beyond the right-of-way: by concentrating attention on the geometric danger zone, TPSA is expected to reduce the false-alarm rate by approximately 15%, directly translating into reduced operator alert fatigue.
5.2. Limitations
This study has three primary limitations. First, the OFBDs dataset, although collected from real revenue service, contains only 12,000 original frames, making it modest by deep learning standards. While the scene-level split described in the Dataset Card subsection mitigates overfitting to dataset-specific characteristics, it cannot completely eliminate this risk, which is why our evaluation emphasizes railway-specific cross-baseline comparisons rather than absolute performance numbers.
Second, the proposed TPSA module relies on a rail-centerline mask. In settings such as tunnels, track switches, or yards, the perspective-prior generator may produce degraded masks, in which case the learnable gating coefficient, α, drives the TPSA module toward conventional CBAM behavior, effectively nullifying the prior’s contribution. A pre-trained segmentation network (on RailSem19) can alleviate this limitation, although it has not yet been integrated into the current pipeline.
Third, the failure-mode analysis in
Section 4.5.7 indicates that approximately 30–40% of residual errors arise from distant targets smaller than 16 × 16 pixels. Addressing this category would require higher-resolution detection heads or fundamentally redesigned anchor strategies, which is left for future work.
Overall, these limitations highlight the boundaries of the current system while providing clear directions for future improvements in the data scale, TPSA robustness, and small-object detection capabilities.
6. Conclusions
To address the core challenges of track foreign object detection—including complex background interference, insufficient adaptation to multi-scale targets, and the need for real-time inference—this study proposes CMF-Net, a detection framework whose main contribution lies in the railway-specific integration and adaptation of established structural components rather than the introduction of novel operators. Systematic ablation, module-comparison, model-comparison, and generalization experiments confirm both the effectiveness of the individual modules and the overall superiority of the integrated model.
On the OFBDs dataset, CMF-Net achieves an mAP50 of 89.2% and an mAP50:95 of 64.5%, representing gains of 4.8 and 5.3 percentage points over the YOLOv5s baseline, respectively. Despite these improvements, the model maintains a compact parameter size of 5.4 M, a FLOPs of 15.2 GFLOPs, and an inference speed of 56.2 FPS, demonstrating a favorable balance between detection accuracy and computational efficiency.
Specifically, the CGG module constructs efficient hierarchical feature representations via a lightweight GhostConv design; the MSAF module achieves adaptive multi-scale feature fusion through dual attention mechanisms; the FGAF module suppresses complex background interference using four-branch decomposed convolutions and fine-grained attention; the BiFPN module enhances bidirectional cross-scale information transmission. Additionally, the TPSA module leverages track-prior spatial attention to focus the model on operational danger zones, reducing false positives and improving detection reliability. The synergy of these modules significantly strengthens feature representation and multi-scale perception while preserving lightweight deployment advantages.
Despite the promising performance, several limitations remain. First, the self-collected OFBDs dataset is modest in size, which constrains evaluation of generalization. Second, although cross-dataset experiments on Pascal VOC provide preliminary evidence of transferability, they cannot fully represent real-world railway scenarios. Third, approximately 30–40% of residual errors arise from sub-16-pixel distant targets, which would require higher-resolution detection heads or revised anchor strategies to resolve the issue—something left for future work.
Future work will focus on: (i) expanding the scale and diversity of the OFBDs dataset by collecting samples across different railway lines, weather conditions, and lighting environments; (ii) improving generalization through domain adaptation and cross-scene validation; (iii) optimizing computational efficiency via model pruning, quantization, or lightweight network redesign to meet stricter on-board real-time requirements; and (iv) conducting field tests on actual vehicle-mounted platforms to validate CMF-Net in operational railway foreign object detection, facilitating its translation from research to industrial deployment.