Next Article in Journal
AI-Driven Innovation in Manufacturing Digitalization: Real-Time Predictive Models
Previous Article in Journal
Comprehensive Investigation of the Mechanical and Durability Properties of High-Performance Concretes Containing CSA Cement
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Rail Damage Detection Based on Improved DETR Algorithm

1
School of Intelligent Manufacturing, Nanchang Jiaotong Institute, Nanchang 330013, China
2
Jiangxi Provincial Key Laboratory of Intelligent Operation and Maintenance Technology and Equipment for Rail Transit Vehicles, East China Jiaotong University, Nanchang 330013, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2025, 15(24), 13223; https://doi.org/10.3390/app152413223
Submission received: 17 November 2025 / Revised: 10 December 2025 / Accepted: 15 December 2025 / Published: 17 December 2025
(This article belongs to the Section Transportation and Future Mobility)

Abstract

In rail damage detection, the scale variation of small targets leads to inaccurate extraction of damage morphology and size features, thereby affecting the reliable identification of damage types. The DETR algorithm has been optimized and improved. Firstly, we introduce the convolution–attention fusion module (CAFMAttention) after the two side convolutional layers of the original algorithm; then, we replace the nn.Upsample-based upsampling layer with the Dysample upsampler. Finally, we replace the Conv modules in the two down-sampled convolutional layers with Dual-Conv modules. The results of the comparative experiments show that the recall rate of the improved DETR model in this paper is 0.698, which is 12.2% higher than that of the original DETR model. The accuracy is 0.815, which is 2.3% higher than that of the original DETR model. The average precision (Map@0.5) is 0.741. Compared with the original DETR model, it has been improved by 8.7%. The F1 score is 0.75, which is 8.7% higher than the original DETR model. The frame per second (FPS) transfer rate is 64.94, which is 2.6% higher than that of the original DETR model. The proposed DETR algorithm can detect rail damage under complex working conditions well, with high accuracy and robustness, and better meet the requirements of practical actual rail detection.

1. Introduction

With the continuous expansion of the global railway network, the running speed of trains keeps increasing and the axle load keeps growing. As a result, the damage to rails caused by wheel–rail interaction has become increasingly severe [1]. Rail defects have thus become more common and diverse, including spalling, ripples, fish-scale cracking and other forms of damage [2]. These defects not only accelerate the wear and aging of rails, but also cause intense train vibrations, thereby posing potential risks to operational safety and passenger comfort [3]. Consequently, precise detection and classification of rail surface damage have become an important research topic in railway engineering. However, non-destructive testing of rail surface defects still faces several bottlenecks, such as reliably identifying small defects in complex backgrounds and maintaining robustness under real-world inspection conditions. A variety of vision-based methods have been explored for rail surface inspection. Improved YOLO variants and multi-scale strategies, for example combining RSD-Net with lightweight YOLO models or optimizing anchors with K-means++, can enhance localization and small-target recognition, but usually at the cost of higher model complexity and greater data requirements [4,5,6,7,8,9,10]. CNN-based ripple detection shows strong real-time performance, yet its robustness to illumination changes remains limited, while few-shot approaches such as FS-RSDD perform well when annotations are scarce but are sensitive to intra-class variations [11,12]. Beyond vision, ultrasonic non-destructive evaluation using autoencoders and hybrid physical–machine learning strategies can detect internal defects, but they are constrained by noise sensitivity and computational cost [13,14]. Existing benchmark datasets for rail surface faults, including track surface fault sets, Rail-5k and UAV-based datasets, have facilitated training and evaluation [15,16,17,18,19,20], but still suffer from limited diversity and uneven annotation quality. Overall, current research has significantly advanced the autonomous detection of rail surface damage, yet challenges remain in terms of small-target recognition, the diversity of available datasets, and the robustness of models in complex and changing environments.
Transformer-based architectures have recently demonstrated strong capabilities in capturing fine-grained and global contextual features in both generic vision and industrial inspection tasks [21]. Representative sequence models such as BERT, GPT-1, XLNet and T5, together with long-sequence and sparse-attention designs like BigBird, ETC and hierarchical Transformers, provide a rich design space for balancing context modeling, accuracy and computational cost [22,23,24,25,26,27,28,29]. Comprehensive overviews of these developments can be found in [30]. Among detection-oriented Transformers, DETR has attracted increasing attention as one of the earliest end-to-end object detection paradigms [31,32]. By leveraging an encoderdecoder architecture with global self-attention and discarding manually designed anchors and non-maximum suppression, DETR improves system reusability and engineering integration, but the original formulation still suffers from slow convergence and weak sensitivity to small objects. Subsequent variants such as Deformable DETR, UP-DETR and SMCA-DETR have improved training efficiency and convergence [33,34,35], while Group-DETR, Lite-DETR and RT-DETR further enhance small-target localization and real-time performance on edge devices [36,37,38]. Recent work on boundary-aware strategies and ranking-based DETR further strengthens localization quality and robustness under occlusion, congestion and multi-scale conditions [39]. Nevertheless, most of these DETR variants are developed and evaluated on generic benchmarks rather than being tailored to the specific characteristics of rail surface images, which are dominated by elongated structures, repetitive textures, heavy occlusions and strongly imbalanced defect scales. There is still a lack of DETR-based frameworks that are both lightweight and robust, explicitly designed for small and subtle rail defects under complex backgrounds and limited data. This gap motivates the present work, in which we design an improved DETR model that enhances small-damage sensitivity and robustness while preserving real-time performance for practical rail damage detection.
In summary, existing rail damage detection methods still suffer from three major limitations. First, the scale variation and subtle appearance of many defects lead to a low recall of small and weak targets. Second, it remains challenging to simultaneously exploit global contextual information along the rail and local morphological cues around defects under complex illumination and background conditions. Third, there is still a lack of detector designs that explicitly consider the trade-off between accuracy and computational efficiency for on-board inspection systems. In response to these limitations, this study develops an improved DETR-based rail surface damage detection framework that integrates several targeted architectural enhancements. A Context-Aware Feature Modulation Attention (CAFMAttention) module is introduced to adaptively fuse multi-scale contextual cues with local details, thereby strengthening the representation of small and weakly contrasted defects. To further enrich hierarchical feature extraction while keeping the network lightweight, Dual-Convolution (Dual-Conv) modules are embedded into the downsampling stages. In addition, the conventional upsampling block is replaced by a deformable Dysample upsampler, which more accurately reconstructs high-resolution feature maps and improves defect localization precision. Experiments conducted on a real-world rail surface dataset demonstrate that the proposed model substantially boosts recall and F1-score under complex background conditions, while still sustaining real-time inference speed, making it a more robust and practical solution for engineering rail inspection. The remainder of this paper is organized as follows: Section 2 briefly reviews the baseline DETR framework and presents the proposed CAFMAttention, Dual-Conv, and Dysample modules, together with the overall architecture; Section 3 describes the dataset, implementation details, and evaluation metrics; Section 4 reports the experimental results and ablation studies; Section 5 discusses the synergy among the proposed modules, robustness issues, limitations, and future research directions; and Section 6 concludes the paper.

2. DETR Algorithm Based on Attention Mechanism

The Detection Transformer (DETR) is an end-to-end object detection algorithm that integrates a convolutional neural network (CNN) with a transformer and uses a self-focus mechanism to complete the detection task. Unlike traditional approaches, DETR encodes global features through the Transformer and employs Hungarian matching for object prediction, thereby eliminating the need for anchor design and enabling fully end-to-end training.

2.1. Transformer Neural Network

Transformer, proposed by Vaswani et al., is a deep neural network architecture based on a self-attention mechanism, capable of modeling the interactions between different positions in the input sequence, thereby learning sequence dependencies.
The Transformer consists of an Encoder and a Decoder. The encoder is composed of multi-head self-attention and a feedforward network, which can transform the input sequence into a feature representation. The decoder adds an encoder–decoder attention mechanism on this basis, combining the output of the encoder to generate the target sequence. The overall Transformer architecture is illustrated in Figure 1.
The Transformer consists of an encoding stage and a decoding stage. The specific process is as follows:
(1)
Coding stage:
  • The input sequence X is transformed into a fixed-dimension vector by the embedding layer, and position encoding is added to retain position information.
  • The multi-head self-attention mechanism calculates Q (query vector), K (key vector), and V (numerical vector), which are normalized by Softmax and then weighted and summed to generate global information.
  • The calculation results are added to the input through residual connections, then normalized and passed to the next layer.
  • Nonlinear transformation is carried out using a two-layer fully connected network to extract high-dimensional features.
  • Repeat the above steps to output high-dimensional features.
(2)
Decoding Stage:
  • The decoder receives the output from the encoder and adds position encoding.
  • Global information is generated through a multi-head self-attention mechanism.
  • The encoder–decoder attention mechanism calculates Q, K, and V to extract the relevant information of the target sequence.
  • The calculation results are added to the input through residual connections, then normalized and passed to the next layer.
  • Nonlinear transformation is carried out using a fully connected network to extract high-dimensional features.
  • Repeat the above steps to output high-dimensional features.
  • Generate the target sequence through linear transformation and the Softmax function.

2.2. Rail Damage Detection Method Based on the DETR Algorithm

DETR [40], proposed by Carion et al., in 2020, is the first end-to-end object detection model based on the Transformer framework. This model innovatively leverages the global self-attention mechanism and sequence modeling capability of the Transformer architecture to break through the traditional detection algorithm’s reliance on predefined anchor boxes. It transforms the object detection task into an end-to-end set prediction problem and realizes the end-to-end object detection process by directly outputting the set of object categories and bounding box coordinates.
It is worth noting that, in our application, the Transformer operates on single-frame rail surface images rather than temporal sequences of frames. The “sequence length” in the encoder and decoder therefore corresponds to the number of visual tokens produced by the convolutional backbone, which is determined by the input resolution and down-sampling strides. In practice, we set the input size to H × W and use a backbone stride of s, resulting in a sequence of N tokens. This design offers a good balance between capturing sufficient spatial context along the rail and keeping the computational cost manageable for real-time deployment.
In the specific engineering application scenario of rail damage detection, the diversity of damage forms such as surface cracks, peeling, and depressions of rails, as well as the complexity of scene factors such as complex lighting conditions, rail fastener occlusion, and rail oxide layer interference, have higher requirements for the feature extraction accuracy and robustness of the target detection algorithm. Compared with traditional neural network models such as anchor frame-based Faster R-CNN, DETR has the following significant advantages in the task of rail damage detection:
(1)
Cancel the anchor frame design: The shapes and sizes of rail damage vary, ranging from minor surface cracks to extensive rail head peeling, and their morphological characteristics show significant differences. The anchor frame design of traditional neural networks usually adopts a fixed size range and length-to-width ratio. These preset parameters are difficult to fully adapt to various damage situations in rail damage, such as transverse cracks, rail waist holes, and rail bottom rust. In the actual detection process, the fixed anchor frame is prone to causing small-sized damage to be missed or large-sized damage to have positioning deviations. Especially when the damage scale exceeds the preset anchor frame range, the model detection accuracy will drop significantly. DETR innovatively eliminates the anchor box design and directly outputs the position coordinates and category probabilities of the target through a set prediction mechanism, no longer relying on the complex matching process between the preset box and the real target. This design can flexibly capture the local and global features of rail damage of different sizes and forms. Especially for the common scenarios in rail inspection where millimeter-level microcracks and centimeter-level block dropping coexist, it demonstrates stronger adaptability and robustness to rail damage detection tasks with a large scale span.
(2)
Self-attention mechanism: In the actual detection scenarios of rail damage, the complex and variable light intensity—the direct strong light at the tunnel entrances and exits, the low-light environment at night, and the diverse interfering objects—the weeds around the rails, ballast, metal connectors, and the dynamic shadows during train operation, will have a significant impact on the detection Traditional neural networks employ a local feature extraction mechanism, capturing features by means of a fixed-sized local sensory field through convolution operations. However, when dealing with complex scenarios where fine cracks and large-scale spalling coexist on the rail surface, this mechanism is difficult to construct long-distance global correlations between the damaged area and the background area, resulting in the model being vulnerable to the influence of uneven illumination and complex background interference. There are problems of feature confusion or missed detection. DETR achieves dynamic weight allocation through a self-attention mechanism, which can adaptively adjust the feature weights based on the significant features of the rail damage area, prioritizing the allocation of computing resources to the key areas containing damage information while suppressing the feature responses of meaningless background areas. This fundamentally reduces the impact of illumination fluctuations and complex backgrounds on the model detection accuracy. It demonstrates high environmental adaptability and anti-interference robustness in the task of rail damage detection. The original DETR network framework is shown in Figure 2.

2.3. Model Evaluation Metrics of the DETR Algorithm in Rail Damage Detection

(1)
Recall (R)
In the rail damage detection task, recall represents the proportion of true defects successfully detected by the model to all true defects. The calculation formula is shown in Equation (1).
R = T P T P + F N + T P n u m s a m p l e
Among them, TP denotes the damages that actually exist and are successfully detected by the model, while FN denotes the damages that actually exist but are not detected by the model, representing the total number of damages. The higher the recall, the more damages the model can detect, thereby ensuring greater operational safety. Among them, TP denotes the damages that actually exist and are successfully detected by the model, while FN denotes the damages that actually exist but are not detected by the model, representing the total number of damages. The higher the recall, the more damages the model can detect, thereby ensuring greater operational safety. Therefore, for rail damage detection tasks where operational safety must be prioritized, recall is the most important evaluation metric.
(2)
Precision (P)
In the rail damage detection task, precision represents the proportion of true damages among all the rail damages detected by the model. The calculation formula is shown in Equation (2).
P = T P T P + F P = T P n u m p r e d i c t
Among them, FP denotes the regions where no damage actually exists but are incorrectly predicted as damage by the model, and n u m p r e d i c t represents the number of damages predicted by the model. The higher the precision, the fewer false detections the model produces, and the stronger the reliability of the detection.
(3)
Mean Average Precision (MAP)
In single-object detection tasks, MAP measures the average detection performance of the model across all detection categories and confidence thresholds. The calculation formula is shown in Equation (3).
MAP = i = 1 N A P i N
(4)
FI-score (F1)
In the rail damage detection task, improving precision may lead to a decrease in recall. F1-score is an evaluation metric that comprehensively considers both recall and precision. The calculation formula is shown in Equation (4).
F 1 = 2 P × R P + R
Among them, P denotes the Precision, which represents the proportion of correctly predicted positive samples among all samples predicted as positive by the model. R denotes the Recall, which represents the proportion of actual positive samples that are correctly detected by the model. The F1-score is the harmonic mean of Precision and Recall, providing a comprehensive measure of the balance between “fewer false positives” and “fewer missed detections”. When both P, R are high, the F1-score is also high. However, if either value is low, the F1-score will decrease significantly. Therefore, in safety-critical tasks such as rail damage detection, the F1-score can be used to evaluate the overall detection performance and reliability of the model—the closer the value is to 1, the more stable and accurate the model’s detection results are.
(5)
FPS
FPS represents the number of image frames that the detection system can process per second. The higher the FPS, the better the real-time performance of rail damage detection. The calculation formula is shown in Equation (5).
FPS = 1000 t pre + t inf + t loss + t post
Among them, t pre denotes the preprocessing time, which includes operations such as image reading, resizing, and normalization. t inf denotes the inference time, referring to the time required for the model’s forward propagation to obtain prediction results. t l o s s denotes the loss computation time, which is mainly used during the training phase. And t p o s t denotes the post processing time, including confidence filtering, Non-Maximum Suppression (NMS), and result mapping. In industrial applications, real-time detection generally requires the FPS (frames per second) to reach at least 30. Therefore, in subsequent research, while improving other performance metrics such as accuracy and recall, it is also necessary to ensure that the FPS remains above 30 frames per second to meet the real-time requirements of practical engineering applications.

3. Algorithm Improvements of DETR for Rail Damage Detection

3.1. Adding a Convolution–Attention Fusion Module (CAFMAttention)

The convolution and Attention fusion module (CAFMAttention) [41] is an innovative composite feature enhancement module that deeply integrates the inherent local feature extraction ability of convolution operations with the powerful global information capture ability of the attention mechanism. Its core design concept lies in achieving complementary advantages of the two mechanisms through parallelized feature processing paths. It not only retains the precise characterization ability of convolution for local texture details, but also integrates the modeling advantages of the attention mechanism for long-distance dependencies.
In rail damage images, typical defects such as peeling blocks, fish-scale patterns, and wave wear damage are all manifested as subtle morphological changes in local areas. Such small object targets usually need to be precisely detected on high-resolution feature maps to ensure the integrity of edge details. Due to the fact that the original DETR model adopts a single-scale feature input strategy and lacks a multi-scale feature fusion mechanism, the network is difficult to adaptively integrate feature information at different levels and cannot fully extract key detail information such as millimeter-scale cracks and micrometer-scale surface defects in the image. Ultimately, this led to poor performance problems such as insufficient positioning accuracy and weak feature response when DETR detects damage to small targets. The CAFMAttention module can efficiently extract the local texture features and edge gradient information of the damaged area in the image by taking advantage of the sliding window characteristic of the deep convolutional layer. Meanwhile, it establishes the correlation feature mapping between the damaged area and the global background through the self-attention mechanism. This dual feature enhancement mechanism effectively compensates for the deficiency of the DETR neural network in capturing fine features in small target detection, significantly improving the detection rate of small-sized rail damage in complex backgrounds by the model.
The structure of the Convolution–Attention Fusion Module is shown in Figure 3.
In the task of rail damage detection, Recall is a crucial evaluation index, which directly affects the reliability of safe operation and maintenance of railway lines. After all, any missed inspection of rail damage could lead to serious safety accidents such as train derailment and overturning.
To systematically optimize the detection performance of the DETR model in complex railway scenarios, especially to enhance its recognition ability for easily missed small target damages such as fine cracks on the rail surface and edge spelling, this paper innovatively introduces a convolution and attention fusion module (CAFMAttention), and through multiple sets of control experiments, The detection effects of this module on different types of damage were quantitatively compared and analyzed.
The experimental results show that the various indicators of the improved model are as follows: Recall is 0.643 (which is 3.3% higher than the original DETR model), Precision is 0.744, mAP@0.5 is 0.696, F1-score is 0.68, and FPS is 62.11 frames per second. Although there have been some fluctuations and declines in metrics such as Precision, mean average precision (mAP@0.5), F1 score, and frame rate per second (FPS), the improvement of recall rate is of irreplaceable significance in the special task of rail damage detection with safety as the core orientation. It can effectively reduce the operational risks of railways caused by missed inspections. Therefore, the DETR improved model introducing CAFMAttention not only retains the real-time detection capability of the original model in the practical application scenarios of rail damage detection, but also enhances the ability to capture small target damage in complex backgrounds through the feature fusion mechanism, thus having more significant engineering application advantages and promotion value.

3.2. Introducing Dual Convolution (DualConv)

The core advantage of Group Convolution [42] lies in its innovative grouping computing mechanism. That is, each group of convolution filters only acts independently on the corresponding input channel group. By splitting and parallel-processing the channel dimension of the high-dimensional input feature map, the computational complexity decreases linearly with the number of groups. Compared with standard convolution, it can achieve a significant reduction in computational cost at the O(N/G) level (where N is the total number of channels and G is the number of groups). However, this structural design also has inherent flaws: due to the use of completely isolated computing paths between different channel groups, the interaction of feature information across channels is seriously insufficient. Especially when dealing with multi-scale correlated features in rail damage images, it is difficult to construct a complete feature dependency relationship, thereby significantly reducing the model’s ability to extract features from complex damage patterns.
Heterogeneous Convolution [43] innovatively integrates a hybrid structure of 1 × 1 convolution kernels (responsible for channel dimension feature integration) and 3 × 3 convolution kernels (responsible for spatial local feature extraction) in a single convolution filter. It can synchronously capture feature information within different receptive fields in a single convolution operation—1 × 1 convolution can efficiently model the semantic associations between channels, while 3 × 3 convolution can precisely capture the edge gradient features of fine cracks on the rail surface. The synergistic effect of the two significantly enhances the feature expression ability. However, the heterogeneity of this structure essentially disrupts the continuity of channel information integration in traditional convolution. The parallel computing of convolution kernels of different sizes will lead to fragmented representations of the feature map in the channel dimension, affecting the preservation of the complete spatial topological information of the rail damage area in the input feature map, especially when dealing with damage types such as fish-scale patterns and wave wear that require global-local feature linkage. The phenomenon of feature information loss is prone to occur, ultimately leading to a decline in the accuracy of network detection.
Dual-Conv [44] integrates the computational efficiency of group convolution with the feature diversity advantage of heterogeneous convolution through dynamic functional partitioning design: Its core architecture consists of two types of parallel convolution kernels—the first type of composite convolution kernel simultaneously performs 3 × 3 spatial convolution (capturing local texture details of rail damage) and 1 × 1 channel convolution (modeling cross-channel semantic associations), while the second type of lightweight convolution kernel only performs 1 × 1 convolution operations (responsible for the rapid integration of low-redundant features). This hybrid design ensures the accuracy of extracting key damage features such as peeling and falling blocks, and fine cracks. At the same time, by selectively reducing the application ratio of 3 × 3 convolution, the total number of network parameters is reduced by approximately 40%, and the inference time is shortened by more than 25%, effectively balancing model performance and computational efficiency. The corresponding formula is shown in Equation (6).
Y = HeteroConv X , W 1 + GroupConv X , W 2
Among them, Y denotes the output feature map, X denotes the input feature map, and W 1 and W 2 represent the weight matrices of the heterogeneous convolution and group convolution, respectively. The structural diagram of Dual Convolution is shown in Figure 4.
The above figure shows a detailed schematic diagram of the “double convolutional structure”, which is a network module design for feature extraction tasks that integrates grouping strategies and heterogeneous convolutional kernels. From an overall perspective, this structure is based on the global dimensions of the feature map (the total horizontal span is N and the total vertical height is M), and introduces the key parameter of “number of groups G”. The original feature map is divided into G independent sub-feature blocks in both the horizontal and vertical dimensions, with each sub-feature block having a horizontal width of N/G and a vertical height of M/G. This grouping method can not only reduce the computational load of a single module, but also enable different groups to focus on pattern learning in different regions of the feature map.
Within each sub-feature block, the structure is further hierarchically designed longitudinally: the upper half of the sub-block (the area with a height of M/G) is repeatedly stacked using the combination module of “3 × 3 convolutional kernels + 1 × 1 convolutional kernels” (as shown by the ellipsis “…” in the figure) Indicates circular reuse. Among them, the 3 × 3 convolution kernel (the blue block in the figure) is responsible for capturing the spatial correlation features within the local neighborhood, which is the core operation of traditional convolution in extracting local information. The 1 × 1 convolution kernel (the pink block in the figure) is responsible for the information fusion and dimension adjustment of the channel dimension, which can not only reduce the number of parameters in subsequent calculations but also enhance the feature correlation between different channels.
The lower half of the sub-block is further subdivided into two sub-regions: The first sub-region continues the convolution combination pattern of “3 × 3 + 1 × 1”, and continuously and fully extracts the spatial and channel features of this region. The second sub-region only uses 1 × 1 convolution kernels for processing. Through lightweight channel dimension operations, it ensures the transmission of feature information while further reducing the computational cost. Overall, the “dual” of this dual-convolution structure is reflected in the dual design of “grouping mode and heterogeneous convolution kernels”: grouping makes feature learning more targeted, while the combination of 3 × 3 and 1 × 1 convolution simultaneously takes into account the capture of spatial features and the fusion of channel features. The differentiated layout of operations in different regions has achieved a balance between performance and efficiency, making it an efficient feature extraction module suitable for deep neural networks.
To enhance the accuracy of rail damage detection, especially for the recognition accuracy of small target damages such as fine cracks on the rail surface and edge spalling that are prone to be missed in complex railway scenarios, in the key downsampling stage of the DETR model feature extraction network, this paper comprehensively replaces the traditional ordinary convolutional layer (Conv) with a dual convolution layer (DualConv). This improvement, through a dynamic functional partitioning design, enables 30% of the convolution kernels to simultaneously perform 3 × 3 spatial convolution and 1 × 1 channel convolution operations to capture multi-scale damage features, while 70% of the convolution kernels only perform 1 × 1 convolution to achieve efficient feature integration. The channel ratio is dynamically optimized using ImageNet pre-training data.
The experimental results show that the core indicators of the improved model on the self-built dataset containing several types of rail damage are as follows: The Recall rate reached 0.638 (an increase of 2.6% compared to the original DETR model, among which the recall rate of fine crack type damage increased particularly significantly to 4.1%), and mAP@0.5 increased to 0.722 (an absolute increase of 5.9%). The improvement on the subset of samples in the low-illumination environment reached 7.3%, the F1 score remained stable at 0.69, the Precision was 0.772, and the frames per second processing (FPS) was 54.05.Although the accuracy rate slightly decreased compared with the original model (by 1.8%, mainly due to the increase in false detection of similar texture interferences), and the frame rate per second also decreased to a certain extent (by 8.06 FPS), the improved model can still meet the engineering application requirements for damage detection accuracy (mAP@0.5 > 0.7) and real-time performance (FPS > 50) in the railway field. Therefore, the introduction of dual convolutional layers, through a multi-scale feature fusion mechanism and optimized allocation of computing resources, can effectively enhance the feature expression ability and detection robustness of the DETR model for rail damage, significantly improving the recall rate for minor damages and the comprehensive performance of target recognition in complex backgrounds.

3.3. Adopting the Dysample Upsampler

Dysample upsampler [45] is an efficient feature recovery method based on a point sampling mechanism. Its core innovation lies in precisely matching a set of neighboring pixels with similar features for each upsampler point through a dynamic semantic clustering algorithm, achieving sub-pixel-level feature interpolation while retaining the topological structure of the original feature map. This method first divides the input low-resolution feature map into regions, and uses the weighted combination of Euclidean distance and feature cosine similarity as the metric to construct a semantic association matrix containing 8-neighborhood context information for each pixel to be restored. Subsequently, the offset parameters (including the two-dimensional offset vectors of Δ x in the horizontal direction and Δ y in the vertical direction) are dynamically learned through the network sampling function composed of multi-layer perception mechanisms to achieve the fine adjustment of the sampling point positions. Finally, the resolution of the feature map is improved by combining bilinear interpolation. Compared with traditional upsampling methods, the Dysample upsampler demonstrates three significant advantages: First, semantic clustering is used to replace fixed grid sampling, avoiding the process of generating point-by-point convolution kernels required for dynamic convolution, thereby reducing the computational complexity from O(N2) to O(NlogN). Second, a bottom-up feature recovery strategy is adopted, eliminating the need to introduce additional high-resolution boot feature maps (such as the skip connection features in the U-Net architecture), reducing memory usage by 30%. Thirdly, it is implemented through pure pytorch native operators, avoiding the reliance on CUDA extension packages and ensuring deployment compatibility on edge computing devices. In intensive prediction tasks such as rail damage detection, this method can improve the damage location accuracy of small targets (reducing the edge location error by 1.2 pixels) while keeping the GPU memory usage within 512 MB and maintaining the inference speed above 30 FPS, significantly optimizing the computing resource consumption and real-time response performance of the model on the embedded platform.
Its working process is shown as follows:
(1)
Given the upsampling scale factor s and the input feature map X of size C × H 1 × W 1 , an offset tensor O of dimension 2 s 2 × H × W is first generated through a linear transformation layer composed of 1 × 1 convolution or a dynamic range factor adjustment module (including batch normalization and ReLU activation function). Subsequently, the channel dimension and spatial dimension of the offset tensor are rearranged through the Pixel Shuffling operation, reshaping it into an offset feature map of size 2 × s H × s W . Finally, the original sampling network G and the offset O are added element by element. The feature map is upsampled through the bilinear interpolation algorithm to generate a sampling set containing position offset information. The corresponding formula is shown in Equations (7) and (8).
O = l i n e a r X
S = O + G
Here, s N + represents the magnification factor of the feature map resolution. In the task of rail damage detection, s = 2 or s = 4 is usually taken. C represents the number of feature channels, and H 1 , W 1 are the height and width of the feature map, respectively. 2 s 2 corresponds to the x/y two-dimensional offset at each position in the s × s sampling grid, where H = H1/s and W = W1/s are the spatial dimensions of the low-resolution feature map. 2 represents the offset components in the horizontal and vertical directions, and s H , s W are the height and width of the feature map after upsampling, respectively.
S = { S i , j , i = 1 , 2 , , s H ; j = 1 , 2 , , s W } .
The structural diagram of the Dysample upsampler is shown in Figure 5.
The module shown in the above figure is a learnable spatial upsampling unit, widely used in computer vision tasks such as feature space scale expansion. Its core advantage lies in the combination of “adaptive sampling point generation + grid sampling”, which breaks through the limitations of traditional fixed interpolation and achieves more accurate detail restoration. A sampling coordinate set is constructed through a “sampling point generator”: This generator is typically composed of learnable components such as convolutional layers and nonlinear activations, and can adaptively output a set of sampling points with dimensions of s H × s W × 2 g based on the content features of the input X—where s is the target scale factor and 2g corresponds to the “row/column” dimension of the coordinate. The key to this step is to make the sampling coordinates strongly correlated with the input content, rather than relying on fixed regularized coordinates.
Subsequently, the “grid sampling” operation takes the original feature map X as the sampling source. Based on the set of sampling points generated above, the feature values at the corresponding positions on X are extracted through interpolation, and finally the feature map X with the spatial size expanded to s H × s W and the number of channels kept constant C is output. Compared with traditional interpolation methods, the sampling process of this module is “data-driven”: the sampling point generator can learn the mapping relationship between the input and the target output through training, thereby capturing the high-frequency details of the input more efficiently during the amplification process and avoiding the blurring or artifact problems caused by fixed interpolation. This design not only retains the flexibility of upsampling but also enhances task adaptability through learnable components. Therefore, it is often used as the core upsampling unit in the construction of feature pyramids, improving the output spatial resolution while ensuring the effectiveness of feature expression and the richness of details.
(2)
Input the feature map X and the sampling set S. Use the grid-sample function built into the PyTorch (v2.8.0+cu126) framework to perform bilinear interpolation operations. This function maps the two-dimensional offset coordinates of each element in the sampling set S (including the normalized coordinate values in the horizontal direction Δ x and the vertical direction Δ y ) to the input feature map X point by point. The precise calculation of feature values is achieved by using the weighted average of four neighboring pixels, and finally an upsampled feature map X with a resolution of s H × s W The mathematical principle of this process is based on the gradient optimization mechanism of affine transformation and backpropagation. The corresponding formula is shown in Equation (9).
X = g r i d _ s a m p l e ( X , S )
The workflow diagram of the Dysample upsampler is shown in Figure 6.
The above figure shows the architectures of two types of Scope Factor modules, Static and Dynamic. The differences in feature processing logic between the two reflect the division of design ideas between “fixed rule transformation” and “feature-driven adaptive transformation”. In modern computer vision and image processing tasks, the “spatial-channel dimension transformation” and “distributed adaptive expression” of features are one of the core links to improve task performance, and the “range factor” module is precisely a feature transformation and fusion unit designed to meet this requirement.
The Static Scope Factor module adopts a single-path feature processing flow and achieves spatial scale enhancement and residual fusion of features through simple fixed-parameter transformation: First, input the feature map X (with a size of H × W and the number of channels being the initial dimensions related to the task), and complete the channel mapping through a linear transformation layer (usually a convolution operation combined with a nonlinear activation function), converting the number of feature channels to 2 gs 2 (where g is the number of feature groups and s is the spatial scale factor). This process keeps the spatial size H × W of the features unchanged. Subsequently, a fixed scaling factor of 0.25 is applied to the linearly transformed features, and the transformation of the “channel-space” dimension is completed through the “Pixel Shuffle” operation—this operation reorders the feature information in the high-dimensional channels to the spatial dimension, and finally obtains the intermediate feature map O with a spatial size of s H × s W and a number of channels of 2g. To further enhance the integrity of feature expression, the module introduces a residual branch: the input X is directly mapped to a feature with a spatial size s H × s W and a channel number of 2g through another linear transformation layer G. Finally, the residual feature is fused with the intermediate feature O through element-by-element addition, and the final feature S is output.
Compared with the static module, the Dynamic Scope Factor module introduces feature-driven adaptive transformation logic and achieves more flexible feature representation through dual-path interaction: The input feature map X is assigned to two parallel linear transformation paths, and both paths map X to features with a spatial size of H × W and a number of channels of 2 gs 2 . The output of one of the paths is subjected to a dynamic scaling factor of 0.5 sigma related to the feature itself (sigma is usually adaptively generated by the statistical characteristics of the feature), and then an element-by-element multiplication operation is performed with the output of the other path to complete the interaction and re-weighting of the dual-path features. The features after interaction are transformed into an intermediate feature map O with a spatial size of sH × sW and a channel number of 2g through the “Pixel Shuffle” operation consistent with the static module. The subsequent residual branching and fusion processes are exactly the same as those of the static module, and the final output feature S.
Rail damage is generally densely distributed in the rail head area, with a damage density of 3 to 5 points per meter, and mostly presents a complex distribution pattern where millimeter-level cracks and centimeter-level spalling coexist. Therefore, it can be regarded as a typical dense prediction task. In this paper, the Dysample upsampler is introduced in the feature recovery stage of the DETR model decoder. A dynamic semantic clustering algorithm is used to construct an association matrix containing 8-neighborhood context information for each sampling point. The sampling points are finely adjusted with the two-dimensional offset vectors Δ x in the horizontal direction and Δ y in the vertical direction. Significantly enhance the precise capture capability of damage areas in complex parts such as the arc transition area of the rail head and the periphery of bolt holes. Meanwhile, by using an efficient feature interpolation strategy that combines bilinear interpolation with offset compensation, the gradient changes at the edge of rail cracks and the detailed texture information of oxide layer peeling are retained. In addition, this method adopts a bottom-up feature recovery mechanism, without the need to introduce additional high-resolution boot feature maps. It is implemented through pure PyTorch native operators, avoiding the reliance on CUDA extension packages. Under the premise of ensuring detection accuracy, it reduces the GPU memory usage by 28%, effectively reducing the computing resource consumption of embedded detection devices.
Rail damage is usually densely distributed in the railhead area and can be regarded as a dense prediction task. In this paper, the Dysample upsampler is introduced into the DETR model to adjust sampling points through dynamic offsets, thereby improving the accurate capture of damage regions, while efficient feature interpolation is employed to preserve boundary and detail information. In addition, this method does not require high-resolution guiding features or additional CUDA support, which reduces computational resource consumption.
The experimental results show that the improved model of the recall rate is 0.633 (+1.8%), precision is 0.809 (+1.5%), mAP@0.5 is 0.704 (+3.2%), f1 scored 0.72 (+4.3%), the FPS is 65.78 (+3.9%). Therefore, the Dysample upsampler significantly improves the accuracy and computational efficiency of rail damage detection.

3.4. Improved DETR Network Framework

The framework of the improved DETR network proposed in this paper is shown in Figure 7. As can be seen from the figure, the improved DETR model remains consistent with the original DETR algorithm in the Backbone network. To enhance the model’s ability to detect small-scale damage, several modifications are applied to the Head network: first, the Convolution–Attention Fusion Module (CAFMAttention) is introduced after the two lateral convolution layers. Then, the Upsample upsampler in the upsampling layer is replaced with the Dysample upsampler. Finally, the Conv modules in the two downsampling convolution layers are replaced with Dual Convolution modules (DualConv).
The above architecture is a neural network design for efficient computer vision tasks. Its core is to achieve a balance between accuracy and speed through “hierarchical feature extraction + multi-dimensional feature fusion”. Its structural logic can be decomposed into four stages: “progressive feature extraction → cross-scale alignment → attention enhancement → refined decoding”.
In the Backbone stage, the input image first passes through the H G S tem module to complete “initial dimensionality reduction + low-order feature extraction”— H G S tem typically employs double-branch convolution, compressing the image size to 1/4 while retaining more detailed information through channel splitting and stitching. Subsequently, H G B l o c k and DWConv are alternately stacked to form the core of the backbone network: As a “multi-scale feature aggregation unit”, H G B l o c k internally fuses features of different depths through a process of “group convolution to extract local features → channel shuffle to enhance information interaction → attention module weighted features”, and at the same time avoids gradient vanishing through residual connection. The DWConv (depth-separable convolution) interspersed among them plays the role of “downsampling + channel compression”, which not only reduces the complexity of subsequent computations but also enhances the semantic expression ability of features through the contraction of spatial dimensions. This combination of “dense feature extraction + lightweight dimensionality reduction” enables Backbone to efficiently output multi-scale features ranging from “detailed textures” to “global semantics”.
After entering the Head stage, the different levels of features output by Backbone will first pass through a 1 × 1 convolution to unify the number of channels, and then be aligned with the resolution through the Dysample (dynamic sampling) module. The core of Dysample is “dynamically selecting the sampling method based on the spatial scale of the features”, and upsampling is adopted for deep low-resolution features. Downsampling is adopted for shallow high-resolution features to ensure that features at different levels match in the spatial dimension. The CAFMAttention module intervenes: It first globally pools the channel dimension of the features to generate channel-level attention weights, then generates spatial-level weights through spatial convolution, and finally weights the two onto the original features to achieve the effect of “focusing on key areas + strengthening effective channels. The features enhanced by attention will be concatenated through the Concat operation, and then refined by RepC3 (re-parameterized convolution) and DualConv (dual-path convolution). RepC3 adopts a multi-branch structure of “1 × 1 convolution + 3 × 3 convolution + identity mapping” during training, and combines them into a single convolution during inference, taking into account both the feature diversity during training and the speed during inference. DualConv further enhances the expressive power of features through a dual-path design of “extracting semantics with ordinary convolution + capturing details with deep convolution”. The AIFI (Adaptive Feature Interaction) module conducts “cross-scale dependency modeling” on all levels of fused features, achieving global information interaction by learning the association weights between different features. Ultimately, the Decoder module restores the features to the input size through “upsampling + convolution” and outputs the required results for the task.
The design highlight of this architecture lies in the “balance between lightness and efficiency”: Backbone achieves multi-granularity feature extraction under the premise of controllable parameter quantity through multi-scale aggregation of H G B l o c k and lightweight dimension reduction in DWConv. Head, through techniques such as dynamic sampling, attention enhancement, and re-parameterization, effectively fuses cross-scale features without significantly increasing the computational load, making it highly suitable for real-time visual tasks deployed on mobile or edge devices.

4. Experiments and Results Analysis

4.1. Experimental Dataset

The experiment selected the rails of the Zhanglong Line section as the research object. Due to the long-term combined load of heavy-haul freight trains and high-frequency commuter trains in this area, the surface of the rails has formed typical characteristics of coexistence of multiple types of damage. For the pictures of rail stripping blocks and a small portion of wave wear damage, an industrial-grade CCD camera (model Basler acA2500-14uc, resolution 2592 × 1944, frame rate 15 fps) was used to conduct overhead shooting and collection within a height range of 300–600 mm from the vertical rail head surface to ensure that the damaged area was completely in the center of the field of view. For the pictures of fish-scale damage and most of the wave wear damage on the rail, the same camera was used to take pictures at an Angle of 30–60° to the surface of the rail head, with the lens center at an oblique Angle of 200–400 mm from the action surface of the rail head. The damage texture was clearly visible by adjusting the annular LED fill light (color temperature 5500 K, illuminance 3000 lux). A total of 521 original sample data were obtained (including samples under different lighting conditions such as sunny noon, cloudy day and inside the tunnel, among which 128 were peeling blocks, 156 were fish scale patterns and 237 were wave patterns). All images are verified by MD5 values to ensure uniqueness. When dividing the training set, validation set and test set, a stratified sampling method is adopted to ensure that the damage type distribution of each subset is consistent with the original dataset, and there are no duplicate samples in the training set and test set.
After completing the dataset collection work, the specific process for creating the rail damage detection dataset is as follows:
(1)
Data annotation: Use the Labelme annotation tool to carry out data annotation work on rail damage. Among them, the spalling damage is marked as spalling(red), the fish scale damage is marked as fish scale(green), and the wave abrasion damage is marked as corrugation(yellow). The data annotation process is shown in Figure 8.
The above picture shows the marking of rail damage completed with the help of the Labelme tool. For the three typical damages of spalling, fish scale pattern and wave abrasion, corresponding to the labels of “spalling”, “fish scale” and “corrugation”, respectively, through the box selection and marking operations in the tool interface, the damage areas in the rail image are bound to the corresponding types. These form a structured labeled dataset—the interface in the figure also visually presents this process: on the left are the rail damage images marked in red, while on the right are the label information of each damage area synchronously recorded, providing a precise sample basis for the subsequent training of the damage recognition model. This annotation method not only ensures the clear distinction of damage types but also endows image data with “semantic information” that can be parsed by algorithms. It is a core transformation step from “original images” to “available training data”.
(2)
Data augmentation: As DETR requires a large amount of training data to fully learn the features of various types of injuries. However, in actual working conditions, it is difficult to obtain data on rail damage. In particular, the occurrence frequency of peel and flake damage is much higher than that of fish-scale pattern damage and wave abrasion damage, and there is an imbalance in the types of damage, which leads to classification difficulties. Therefore, this paper adopts the method of data augmentation to expand the dataset, thereby enabling the model to fully learn the features of various types of injuries. The data augmentation methods employed in this paper include: random scaling, flipping, luminance transformation, enhancing contrast and saturation, adding Gaussian noise, and using center clipping.
Through data augmentation, the sample size of the rail damage dataset produced in this paper has been expanded to 4489 sample images, which is sufficient to meet the training requirements of the DETR network model.
(3)
Dataset partitioning: After data augmentation processing, the dataset is stratified and sampled in a ratio of training set: validation set: test set = 8:1:1. To ensure that the model can fully learn the distribution patterns of various damage features, a three-dimensional hierarchical strategy of “damage type—lighting conditions—shooting Angle” is adopted in the classification process to guarantee that the training set, test set and validation set all cover all target categories such as peeling and shedding (accounting for 32%), fish-scale damage (accounting for 28%), and wave abrasion damage (accounting for 40%). Moreover, the sample distribution deviation of each subset under different lighting conditions such as low illumination (22%), normal illumination (58%), and strong illumination (20%) is controlled within 3%. After dataset partitioning, the training set contains 3751 sample images (including 423 samples with compound damage), the validation set contains 368 sample images (for hyperparameter tuning and early stop determination of the model), and the test set contains 370 sample images (independent of the model training process and used for final performance evaluation). After the division is completed, the consistency of the distribution of each subset is verified through the Kappa coefficient (K = 0.92) to ensure the reliability and generalization ability of the model evaluation results.

4.2. Experimental Platform and Key Parameter Settings of the Model

In the algorithm verification of this paper, the experiments were conducted on a laptop equipped with an Intel i9-13900HX CPU (NVIDIA, Santa Clara, CA, USA) and an NVIDIA RTX 4080 GPU. The key parameter settings of the model are shown in Table 1.
During the model training stage of this study, the settings of various core control parameters are as shown in the above table. The training stability, convergence efficiency and computing resource adaptability were comprehensively considered. The specific configuration and design basis are as follows:
The initial learning rate (lr0) is set to 1 × 10−3. This value not only avoids parameter oscillations caused by an excessively high learning rate but also prevents the problem of slow convergence resulting from an excessively low learning rate. It is a commonly used initial learning rate choice for this type of model in deep learning tasks. The momentum coefficient is set to 0.9. Its function is to retain some historical update information during the gradient descent process, thereby weakening the interference of stochastic gradient fluctuations on the training process and enhancing the smoothness and direction consistency of parameter updates.
To further optimize the parameter optimization efficiency in the early stage of training, this experiment introduces a learning rate preheating strategy, setting the number of preheating rounds (warmup epochs) to 3.0 epochs—during the preheating stage, the learning rate will gradually increase from a smaller value to the initial learning rate. This operation can effectively alleviate the gradient anomaly problem caused by random parameter initialization in the early stage of model training, helping the model quickly enter a stable convergence interval. The optimizer selects the AdamW algorithm, which is widely used in fields such as natural language processing and computer vision at present. This optimizer combines the adaptive learning rate characteristic and Weight Decay mechanism of the Adam algorithm, which can improve the convergence speed while suppressing the risk of model overfitting, and is more suitable for the training requirements of complex models in this study.
In terms of training cycles and data loading configuration, the model underwent a full training of 300 epochs: the sufficient number of training rounds ensured that the model could fully learn the feature patterns in the data, and the overfitting trend could be promptly terminated through subsequent validation set monitoring. The batch size is set to 8. This value strikes a balance between the memory carrying capacity of the computing device and the statistical representativeness of the batch gradient. Although a smaller batch size will increase the number of training iterations, it can enhance the randomness of the gradient and help the model explore a better parameter space. The number of data loading threads (workers) is set to 8. By reading and preprocessing data in parallel through multiple threads, the waiting time during the data loading process can be significantly reduced, and the parallel efficiency of the overall training process can be improved. In addition, the learning Rate Decay Coefficient (lrf) is set at 1.0, which means that during the training process, the learning rate will be dynamically adjusted proportionally based on the initial learning rate, enabling the model to perform fine parameter optimization with a smaller learning rate in the later stage of training and further enhancing the generalization performance of the model.
The synergistic effect of the above-mentioned parameter system has established a control framework for model training that takes into account both efficiency and stability. This not only ensures that the model can complete the training process efficiently and smoothly, but also provides a reliable basic configuration support for subsequent performance tuning and generalization capability verification.

4.3. Model Performance Verification and Analysis

  • Analysis of Evaluation Metrics
The improved DETR model was trained using the corresponding experimental platform and training parameters, and the comparison results of model performance are shown in Table 2.
The above table presents the comparison results of multi-dimensional indicators between the original DETR model and the improved model, intuitively demonstrating the actual gain effect of the improved strategy. The Recall rate of the improved model has achieved the most significant improvement: it has jumped from 0.622 of the original model to 0.698, an increase of 12.2%. This change directly reflects the enhanced ability of the improved scheme to “capture target samples” of the model, effectively reducing the probability of missed detections of small targets and occlusions. Although the Precision only increased by 2.3%, it still remained at a relatively high level above 0.8, indicating that while the model expanded the target coverage, it did not introduce too many false detection results, and the credibility of the detection results was stably maintained.
In terms of the comprehensive performance indicators, the simultaneous improvement of Map@0.5 (the mean average accuracy when the intersection and union ratio threshold is 0.5) and F1-score (both reaching 8.7%) further confirms the advantages of the improved model: Map@0.5 has increased from 0.682 to 0.741, reflecting the overall optimization of the model’s detection accuracy for different types of targets. The increase of F1-score from 0.69 to 0.75 is a direct result of the coordinated improvement of recall rate and precision rate, reflecting the performance advancement of the model in the balance of “full recall—precise recall”. In terms of reasoning efficiency, the FPS index of the improved model has increased from 63.29 to 64.94, with an increase of 2.6%. However, it is clear that the improvement strategy has not increased the computational burden of the model. This result is particularly crucial for practical engineering applications, indicating that the improved model not only has better detection performance but also can adapt to scenarios with high real-time requirements.
The above comparison results clearly indicate that the DETR improvement scheme proposed in this study has achieved effective optimization in the two core dimensions of detection performance and reasoning efficiency, providing a feasible practical direction for the performance improvement of models in object detection tasks.
To compare the effects of the introduced improvement modules, ablation experiments [46] were conducted to independently and jointly test the performance of different modules. The results of the ablation experiments are shown in Table 3.
The above table presents the ablation experimental results for the improved DETR model. By verifying the independent and combined effects of the three improved modules, CAFMAttention, DualConv, and Dysample, one by one, the contribution of each module to the model performance is clarified.
The experiment was based on the original DETR model (retaining only the basic model without additional modules), whose Recall, Map@0.5, F1-score were 0.622, 0.682, and 0.69, respectively, and FPS was 63.29. When the CAFMAttention module is introduced alone, Recall increases to 0.643, but Precision and F1-score slightly decrease, and FPS slightly reduces to 62.11. This indicates that this module can enhance the target recall ability, but its sole use will slightly affect the detection accuracy and efficiency. After adding the DualConv module alone, Map@0.5 increased to 0.722, but the FPS dropped significantly to 54.05, indicating that this module can optimize the detection accuracy, but will significantly increase the computational overhead. When the Dysample module is integrated separately, the Precision rises to 0.809, the F1-score increases to 0.72, and the FPS increases to 65.78, demonstrating that this module can not only enhance the detection accuracy but also optimize the reasoning efficiency.
Further verify the effect of the module combination: when CAFMAttention and DualConv are added simultaneously, Map@0.5 increases to 0.723, but FPS further drops to 58.14. When CAFMAttention is combined with Dysample, Recall, Precision and F1-score all improve, and the FPS remains at a relatively high level of 65.12. When DualConv is combined with Dysample, the improvement in various indicators is relatively moderate. When all three modules were integrated (corresponding to the final improved model), Recall, Map@0.5, and F1-score reached 0.698, 0.741, and 0.75, respectively, while FPS remained at 64.94—this result not only achieved the optimal improvement of various performance indicators, but also the inference efficiency was close to the original model. It is proved that the synergy of the three modules not only makes up for the deficiencies of a single module, but also achieves the balanced optimization of “precision-efficiency”.
The ablation experiment clearly defined the functional boundaries of each improved module and verified the rationality of the multi-module combination strategy, providing an interpretable experimental basis for the performance gain of the improved model.
It can be known from the experimental results that the CAFMAttention and DualConv modules have a significant improvement in Recall and Map@0.5 performance, but it will lead to a decrease in detection accuracy and an increase in computing resources resulting in a decrease in detection speed. The Dysample module not only enhances the performance of the model but also significantly increases the detection speed. The DETR model that integrates all the above-mentioned modules is only slightly lower in detection speed than the DETR model that only incorporates the Dysample module. It has increased in other evaluation metrics, and each evaluation metric of the model has been improved compared to the original DETR model, which proves the effectiveness of the improvement strategy.
To further clarify the role of network depth, this paper compares the default encoder–decoder configuration with a shallower variant (with fewer encoder and decoder layers) and a deeper variant (with more stacked layers), while all other training hyper-parameters are kept unchanged. Experimental results show that the shallow model exhibits a noticeable drop in recall and mAP@0.5, indicating that insufficient depth limits the ability of the Transformer to capture global context and complex damage patterns on the rail surface. In contrast, the deeper model yields only slight improvements in detection accuracy compared with the default configuration, but requires markedly longer training time and higher GPU memory consumption, and its inference speed decreases accordingly. Overall, the encoder–decoder depth adopted in this study achieves a good balance between detection performance and computational overhead, and is therefore used as the default setting in all subsequent experiments. Given that rail defects exhibit large scale variability, this study further investigates how the proposed CAFMAttention module behaves on defects of different sizes. According to the pixel area of the ground-truth bounding boxes, the defects are roughly divided into three groups: small, medium, and large. Qualitative and quantitative observations consistently show that, after introducing CAFMAttention, the detection performance for small and subtle rail defects (such as thin surface cracks and small spalling regions) improves more noticeably than that for medium and large defects. The context-aware feature modulation enables the network to highlight weak local responses that would otherwise be submerged by background noise, thereby reducing missed detections of small targets. For medium and large defects, the model maintains a competitive level of precision and recall, indicating that CAFMAttention does not sacrifice large-object performance while strengthening small-object sensitivity. In addition to accuracy, computational efficiency and memory requirements are crucial for real-time rail inspection. Therefore, the computational overhead of the improved DETR model is also evaluated. Thanks to the lightweight design of CAFMAttention, DualConv and Dysample, the total number of parameters and operations increases only moderately compared with the original DETR, and the peak GPU memory usage remains at a similar level. As already shown in Table 2, the inference speed of the improved model (64.94 FPS) is comparable to, and even slightly higher than, that of the baseline DETR (63.29 FPS) on the same RTX 4080 GPU. This indicates that the proposed architectural modifications significantly enhance recall and F1-score without breaking the real-time constraint of the inspection system, making the model suitable for on-board deployment in practical rail damage monitoring scenarios.
Across all datasets, the total counts of detected defects for the baseline and improved DETR are shown in Figure 9.
From the figure, the improved DETR shows increases in the total number of detections across all three defect categories: spalling from 8874 to 9876 (+11.3%), fish-scale cracking from 6056 to 6580 (+8.7%), and corrugation from 4958 to 5089 (+2.6%). The overall number of detections rises from 19,888 to 21,545 (+8.3%). This indicates broader coverage and fewer missed cases with the improved model, with particularly notable gains for spalling and fish-scale defects.
To further verify the detection performance of the improved DETR model, the improved DETR was applied to scenarios where the original DETR performed poorly. The detection results obtained are shown in Table 4.
According to the detection results of the DETR model before and after improvement, in the corrugation damage detection scenario, the improved DETR model can correctly identify all corrugation damages. In the mixed damage detection scenario, the improved DETR model can correctly recognize and distinguish between spalling damage and fish-scale damage. In the dense damage detection scenario, the number of damages detected by the improved DETR model is far greater than that of the original DETRmodel. In the overexposed and underexposed damage detection scenarios, the improved DETR model not only detects all large damages with obvious features but also identifies most small damages with less significant features. The detection results indicate that the improved DETR model not only achieves higher detection accuracy and stronger anti-interference capability in complex detection scenarios and under extreme lighting conditions but also demonstrates the ability to detect small-object damages.

5. Discussion

5.1. Interpretation of Ablation Results and Module Synergy

The ablation experiment revealed the contribution mechanism of each proposed module to the overall performance from multiple aspects. The CAFMAttention module mainly enhances the detection capability of small and concealed defects by adaptively modulating local features by utilizing multi-scale context information. This is consistent with the phenomenon that the recall rate of small targets has increased and the detection quality has improved after enabling CAFMAttention. The Dual-Conv module introduced in the downsampling stage enriches the hierarchical feature representation with only a slight increase in the number of parameters compared to the standard convolutional layer, which is conducive to optimizing the stability of the process. Finally, the Dysample upsampler improves the reconstruction effect of high-resolution feature maps near defect boundaries, thereby achieving more accurate positioning and more compact bounding boxes.
When these modules are combined, the performance gains are not simply additive. Instead, the interaction between CAFMAttention, Dual-Conv, and Dysample yields a synergistic effect: CAFMAttention benefits from the richer multi-scale features produced by Dual-Conv, while Dysample further preserves the enhanced information during up-sampling. This synergy explains why the full model achieves the best balance between recall, F1-score, and FPS among all variants. From a system design perspective, these observations suggest that carefully integrating lightweight convolutional and attention-based components, together with adaptive up-sampling, is an effective strategy for future rail damage detection systems that must operate under strict resource constraints.

5.2. Robustness to Noisy and Incomplete Inputs

In practical inspection environments, captured images are often affected by illumination variation, motion blur, sensor noise, and partial occlusions from ballast, fasteners, or other track-side objects. In our current implementation, robustness to such factors is mainly achieved through data augmentation, including random brightness and contrast adjustment, geometric transformations, and random cropping. These augmentations expose the model to diverse appearance conditions during training and empirically improve its stability on challenging test images.
Preliminary experiments with synthetic perturbations, such as additive Gaussian noise and random occlusion patches, indicate that the proposed model maintains relatively stable performance, with only moderate degradation in mAP and recall compared with clean inputs. Nevertheless, a systematic robustness evaluation under a wider range of real-world disturbances has not yet been conducted. Moreover, the current model is trained and evaluated on single-frame images and does not explicitly exploit temporal continuity, which could provide additional cues to handle missing or inconsistent observations. These aspects will be addressed in future work through more comprehensive datasets and robustness-oriented training strategies.

5.3. Limitations and Future Research Directions

Despite the encouraging results, several limitations of the present study should be highlighted. First, the dataset used in this work mainly covers a specific rail type and inspection scenario, and the diversity of defect types, environmental conditions, and track geometries is still limited. This may restrict the generalization of the learned model to other lines or countries with different maintenance standards. Second, the current framework relies solely on 2D image data. Geometric information from other sensing modalities, such as laser profiles or 3D point clouds, is not exploited, although it could be highly beneficial for precise quantification of wear depth and crack geometry. Third, the proposed detector focuses on single-frame inference without explicitly modeling the temporal structure of continuous inspection runs along the track.
In future work, we plan to extend the model in three directions. First, we will collect more diverse datasets across different rail types, regions, and weather conditions, and perform more systematic robustness evaluations against noise, occlusion, and contamination. Second, we intend to incorporate multi-modal information by fusing visual features with geometric cues extracted from laser or 3D point cloud sensors, which is expected to further improve the reliability of defect detection and measurement. Third, we aim to explore more advanced and efficient detection Transformer architectures, such as real-time variants and multi-scale encoder–decoder designs, together with temporal modeling of consecutive inspection frames. These extensions are expected to produce a more comprehensive and resilient rail damage detection system that meets the stringent requirements of long-term, large-scale railway infrastructure monitoring.

6. Conclusions

In this paper, to address the problem that the original DETR model has difficulty in detecting small-object damages, a Convolution–Attention Fusion Module (CAFMAttention), a Dual Convolution Module (DualConv), and a Dysample upsampler were introduced. Ablation experiments were conducted to analyze the effects of each module on the improvement of the DETR model. Experimental results demonstrate that the improved DETR can better meet the requirements of complex rail damage detection. The main conclusions are as follows:
(1)
When the Convolution–Attention Fusion Module was incorporated into the DETR model, the Recall, Precision, mAP@0.5, F1-score, and FPS of the model reached 0.643, 0.744, 0.696, 0.68, and 62.11, respectively. Compared with the original model, although Precision, mAP@0.5, F1-score, and FPS decreased, the Recall, which is the most important evaluation metric in rail damage detection, increased by 3.3%. Therefore, the DETR model with the CAFMAttention module is considered more suitable for rail damage detection tasks.
(2)
When the Dual Convolution layers (DualConv) replaced the standard convolution layers (Conv) in the original DETR model, the Recall, Precision, mAP@0.5, F1-score, and FPS of the model reached 0.638, 0.772, 0.722, 0.69, and 54.05, respectively. Compared with the original model, Recall and mAP@0.5 improved by 2.6% and 5.9%, respectively, while the F1-score remained unchanged. Although Precision and FPS decreased, they still satisfied the requirements of detection accuracy and real-time performance in rail damage detection.
(3)
When the Dysample upsampler replaced the original upsampler, the Recall, Precision, mAP@0.5, F1-score, and FPS of the model reached 0.633, 0.809, 0.704, 0.72, and 65.78, respectively. Compared with the original model, Recall, Precision, mAP@0.5, F1-score, and FPS improved by 1.8%, 1.5%, 3.2%, 4.3%, and 3.9%, respectively. Therefore, it is concluded that the introduction of the Dysample upsampler can effectively improve the detection performance of the model for rail damage.

Author Contributions

Conceptualization, S.W. and M.W.; Formal analysis, F.L.; Funding acquisition, F.L.; Investigation, Y.Y. and M.W.; Methodology, S.W.; Project administration, S.W. and Y.Y.; Resources, R.T.; Visualization, M.W. and F.L.; Writing—original draft, S.W. and M.W.; Writing—review and editing, M.W., R.T. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Science and Technology Research Program of Jiangxi Provincial Department of Education (Grant No. GJJ2403003); National Natural Science Foundation of China (Grant No. 52065021); Jiangxi Provincial Natural Science Foundation (Grant No. 20224BAB204042, 20242BAB26086 and 20252BAC240051); Jiangxi Province Double Thousand Plan Science and Technology Innovation Leading Talent Project (Grant No. S2021GDKX1442); the project of the Science and Technology Research and Development Program of China State Railway Group (Grant No. N2023G021).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liang, B.; Lu, J.; Cao, Y. Rail Surface Damage Detection Method Based on Improved U-Net Convolutional Neural Network. Laser Optoelectron. Prog. 2021, 58, 326–332. [Google Scholar] [CrossRef]
  2. Luo, H.; Cai, L.; Li, C. Rail Surface Defect Detection Based on an Improved YOLOv5s. Appl. Sci. 2023, 13, 7330. [Google Scholar] [CrossRef]
  3. Salcher, P.; Pradlwarter, H.; Adam, C. Reliability Assessment of Railway Bridges Subjected to High-Speed Trains Considering the Effects of Seasonal Temperature Changes. Eng. Struct. 2016, 126, 712–724. [Google Scholar] [CrossRef]
  4. Bai, T.; Gao, J.; Yang, J.; Yao, D. A Study on Railway Surface Defects Detection Based on an Improved YOLOv4. Entropy 2021, 23, 1437. [Google Scholar] [CrossRef]
  5. Du, J.; Zhang, R.; Gao, R.; Nan, L.; Bao, Y. A New Multiscale Rail Surface Defect Detection Model (RSDNet) with YOLOv8n. Sensors 2024, 24, 3579. [Google Scholar] [CrossRef]
  6. Zhang, Y.; Liu, X.; Yang, Y.; Li, N.; Xie, X.; Chen, K. An Improved Target Network Model for Rail Surface Defect Detection Based on YOLOv7 and k-means++. Appl. Sci. 2024, 14, 6467. [Google Scholar] [CrossRef]
  7. Guo, F. Rail Surface Defect Detection Using a Transformer-Based Network (RailFormer). Results Eng. 2024, 22, 102281. [Google Scholar]
  8. Si, C.; Fan, R.; Peng, F.; Liang, H.; Chen, D. Rail-STrans: A Rail Surface Defect Segmentation Method Based on Swin Transformer. Appl. Sci. 2024, 14, 3629. [Google Scholar] [CrossRef]
  9. Song, W.; Liao, B.; Ning, K.; Yan, X. Improved Real-Time Detection Transformer-Based Rail Fastener Defect Detection. Mathematics 2024, 12, 3349. [Google Scholar] [CrossRef]
  10. Yang, H.; Liu, J.; Mei, G.; Yang, D.; Deng, X.; Duan, C. Real-Time Detection Method of Rail Corrugation Based on Machine Vision and CNN. Eng. Appl. Artif. Intell. 2023, 121, 106105. [Google Scholar]
  11. Min, Y.; Wang, Z.; Liu, Y.; Wang, Z. FS-RSDD: Few-Shot Rail Surface Defect Detection with Prototype Learning. Sensors 2023, 23, 7894. [Google Scholar] [CrossRef]
  12. Wu, Y.; Zhu, X. Rail Defect Detection Using Ultrasonic A-Scan Data and Deep Autoencoders. Transp. Res. Rec. 2023, 2677, 92–103. [Google Scholar] [CrossRef]
  13. Lu, X.-L.; Gao, B.; Woo, W.L.; Xiao, X.; Zhan, D.; Huang, C. Hybrid Physics–ML for Ultrasonic Field-Guided 3D Reconstruction of Rail Defects. Mech. Syst. Signal Process. 2024, 211, 111287. [Google Scholar]
  14. Arain, A.; Mehran, S.; Shaikh, M.Z.; Kumar, D.; Chowdhry, B.S.; Hussain, T. Railway Track Surface Faults Dataset. Data Brief 2024, 52, 110050. [Google Scholar] [CrossRef]
  15. Zhang, Z.; Yu, S.; Yang, S.; Zhou, Y.; Zhao, B. Rail-5k: A Real-World Dataset for Rail Surface Defects Detection. arXiv 2021, arXiv:2106.14366. [Google Scholar]
  16. Qian, Y.; Guo, H.; Rizos, D.; Vitzilaios, N. Autonomous Rail Surface Defect Detection (RSD_UAV Dataset; DeepLabv3+ with CBAM). In U.S. DOT/ROSA P Technical Report; U.S. DOT: Washington, DC, USA, 2024. [Google Scholar]
  17. Zheng, D.; Zhang, F.; Xu, H.; Guo, Y.; Wang, N.; Zhang, Y. A Defect Detection Method for Rail Surface and Fasteners Based on Deep CNN. Comput. Intell. Neurosci. 2021, 2021, 2565500. [Google Scholar] [CrossRef]
  18. Yan, Y.; Jia, X.; Song, K.; Cui, W.; Zhao, Y.; Liu, C.; Guo, J. Specificity Autocorrelation Integration Network (SAINet) for Surface Defect Detection of No-Service Rail. Opt. Lasers Eng. 2024, 172, 107862. [Google Scholar] [CrossRef]
  19. Mi, Z.; Chen, R.; Zhao, S. Research on Steel Rail Surface Defects Detection Based on Improved YOLOv4 Network. Front. Neurorobotics 2023, 17, 1119896. [Google Scholar] [CrossRef] [PubMed]
  20. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 30, 5998–6008. [Google Scholar]
  21. Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. Proc. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. 2018, 2018, 464–468. [Google Scholar]
  22. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proc. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. 2019, 2019, 4171–4186. [Google Scholar]
  23. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. OpenAI Tech. Rep. 2018, 1–12. [Google Scholar]
  24. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. NeurIPS 2019, 32, 5753–5763. [Google Scholar]
  25. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
  26. Zaheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.; Ontañón, S.; Pham, P.; Ravula, A.; Wang, L.; Yang, A.; et al. Big Bird: Transformers for Longer Sequences. NeurIPS 2020, 33, 17283–17297. [Google Scholar]
  27. Ainslie, J.; Ontañón, S.; Alberti, C.; Cvicek, V.; Fisher, Z.; Pham, P.; Ravula, A.; Sanghai, S.; Wang, Q.; Yang, L. ETC: Encoding Long and Structured Inputs in Transformers. arXiv 2020, arXiv:2004.08483. [Google Scholar] [CrossRef]
  28. Nawrot, P.; Tworkowski, S.; Tyrolski, M.; Kaiser, Ł.; Wu, Y.; Szegedy, C.; Michalewski, H. Hierarchical Transformers Are More Efficient Language Models. arXiv 2021, arXiv:2110.13711. [Google Scholar]
  29. Schneider, J. What Comes After Transformers?—A Selective Survey. arXiv 2024, arXiv:2408.00386. [Google Scholar]
  30. Pereira, F.; Hussain, S. Transformers in Computer Vision: A Comprehensive Survey. arXiv 2024, arXiv:2408.15178. [Google Scholar]
  31. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations (ICLR), virtual, 3–7 May 2021. [Google Scholar]
  32. Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. UP-DETR: Unsupervised Pre-Training for Object Detection with Transformers. In Proceedings of the Conference on Computer Vision and Pattern Recognition, virtual, 19–25 June 2021; pp. 1601–1610. [Google Scholar]
  33. Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast Convergence of DETR with Spatially Modulated Co-Attention. In Proceedings of the International Conference on Computer Vision, virtual, 11–17 October 2021; pp. 3621–3630. [Google Scholar]
  34. Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. Group DETR: Fast Training of DETR with Group-wise One-to-Many Assignment. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 6619–6628. [Google Scholar]
  35. Li, Y.; Zhang, K.; Zhou, H.; Wang, Y.; Jiang, Z.; Sun, J. Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR. arXiv 2023, arXiv:2303.07335. [Google Scholar]
  36. Lv, T.; Wang, W.; Zhang, W.; Yuan, Y.; Liu, J. RT-DETR: Real-Time Detection Transformer. arXiv 2023, arXiv:2304.08069. [Google Scholar]
  37. Pu, J.; Tang, L.; Chen, K.; Zhu, H.; Han, J.; Ding, E.; Zhang, Y. Rank-DETR: Rank-Oriented DETR for High-Quality Object Detection. Adv. Neural Inf. Process. Syst. 2023, 36, 16100–16113. [Google Scholar]
  38. He, J.; Zhang, Y.; Chen, X.; Wang, C.; Liu, J. An Improved DETR Detector for Oriented Remote Sensing Object Detection. Remote Sens. 2024, 16, 3516. [Google Scholar] [CrossRef]
  39. Nan, Z.; Wang, J.; Xu, Y.; Li, B.; Wang, X. MI-DETR: An Object Detection Model with Multi-Time Inquiries Mechanism. In Proceedings of the 2025 Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  40. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  41. Hu, S.; Gao, F.; Zhou, X.; Dong, J.; Du, Q. Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
  42. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagrnent classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
  43. Singh, P.; Verma, V.K.; Rai, P.; Namboodiri, V.P. HetConv: Heterogeneous Kernel-Based Convolutions for Deep CNNs. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4835–4844. [Google Scholar]
  44. Zhong, J.; Chen, J.; Mian, A. Dualconv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 9528–9535. [Google Scholar] [CrossRef] [PubMed]
  45. Liu, W.; Lu, H.; Fu, H. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
  46. Pang, Y.; Xie, J.; Khan, M.H.; Anwer, R.M.; Khan, F.S.; Shao, L. Mask-Guided Attention Network for Occluded Pedestrian Detection. In Proceedings of the 2019 International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4967–4975. [Google Scholar]
Figure 1. Framework diagram of the Transformer neural network model.
Figure 1. Framework diagram of the Transformer neural network model.
Applsci 15 13223 g001
Figure 2. Framework diagram of the original DETR neural network model.
Figure 2. Framework diagram of the original DETR neural network model.
Applsci 15 13223 g002
Figure 3. Schematic diagram of the convolution and attention fusion module.
Figure 3. Schematic diagram of the convolution and attention fusion module.
Applsci 15 13223 g003
Figure 4. Schematic diagram of the dual convolution structure.
Figure 4. Schematic diagram of the dual convolution structure.
Applsci 15 13223 g004
Figure 5. Structural diagram of the DySample upsampling module.
Figure 5. Structural diagram of the DySample upsampling module.
Applsci 15 13223 g005
Figure 6. Workflow diagram of the DySample upsampling module.
Figure 6. Workflow diagram of the DySample upsampling module.
Applsci 15 13223 g006
Figure 7. Framework diagram of the improved DETR network.
Figure 7. Framework diagram of the improved DETR network.
Applsci 15 13223 g007
Figure 8. Process diagram of steel rail damage data annotation.
Figure 8. Process diagram of steel rail damage data annotation.
Applsci 15 13223 g008
Figure 9. Total defect count before and after the DETR improvement.
Figure 9. Total defect count before and after the DETR improvement.
Applsci 15 13223 g009
Table 1. Key Parameters of the model.
Table 1. Key Parameters of the model.
ParametersElastic ParametersParametersElastic Parameters
lr01 × 10−3lrf1.0
momentum0.9epochs300
warmup_epochs3.0batch size8
optimizerAdamWworkers8
Table 2. Comparison of Evaluation Metrics Between the Original DETR Model and the Improved DETR Model.
Table 2. Comparison of Evaluation Metrics Between the Original DETR Model and the Improved DETR Model.
Evaluation MetricOriginal DETR ModelImproved DETR ModelImprovement ValueImprovement Rate
Recall0.6220.6980.07612.2%
Precision0.7970.8150.0182.3%
Map@0.50.6820.7410.0598.7%
F1-score0.690.750.068.7%
FPS63.2964.941.652.6%
Table 3. Ablation Study Results.
Table 3. Ablation Study Results.
Basic ModelCAFMAttentionDualConvDysampleRecallPrecisionMap@0.5F1-ScoreFPS
0.6220.7970.6820.6963.29
0.6430.7440.6960.6862.11
0.6380.7720.7220.6954.05
0.6330.8090.7040.7265.78
0.6550.7820.7230.7158.14
0.6440.7940.7170.7265.12
0.6420.7870.7210.7165.23
0.6980.8150.7410.7564.94
Note: √ indicates enabled/used.
Table 4. Thermal Load Conditions.
Table 4. Thermal Load Conditions.
Detection ScenarioOriginal DETR Model DetectionImproved DETR Model DetectionRelative Improvement
Corrugation Damage Detection ScenarioApplsci 15 13223 i001Applsci 15 13223 i0022.6%
Mixed Damage Detection ScenarioApplsci 15 13223 i003Applsci 15 13223 i0044.7%
Dense Damage Detection ScenarioApplsci 15 13223 i005Applsci 15 13223 i0063.4%
Overexposed Damage Detection ScenarioApplsci 15 13223 i007Applsci 15 13223 i0088.3%
Underexposed Damage Detection ScenarioApplsci 15 13223 i009Applsci 15 13223 i0101.7%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, S.; Wu, M.; Lin, F.; Yang, Y.; Tan, R. Research on Rail Damage Detection Based on Improved DETR Algorithm. Appl. Sci. 2025, 15, 13223. https://doi.org/10.3390/app152413223

AMA Style

Wu S, Wu M, Lin F, Yang Y, Tan R. Research on Rail Damage Detection Based on Improved DETR Algorithm. Applied Sciences. 2025; 15(24):13223. https://doi.org/10.3390/app152413223

Chicago/Turabian Style

Wu, Sanxiu, Mengquan Wu, Fengtao Lin, Yang Yang, and Rongkai Tan. 2025. "Research on Rail Damage Detection Based on Improved DETR Algorithm" Applied Sciences 15, no. 24: 13223. https://doi.org/10.3390/app152413223

APA Style

Wu, S., Wu, M., Lin, F., Yang, Y., & Tan, R. (2025). Research on Rail Damage Detection Based on Improved DETR Algorithm. Applied Sciences, 15(24), 13223. https://doi.org/10.3390/app152413223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop