Rail Surface Defect Detection Based on Dual-Path Feature Fusion

Zhong, Yinfeng; Chen, Guorong

doi:10.3390/electronics13132564

Open AccessArticle

Rail Surface Defect Detection Based on Dual-Path Feature Fusion

by

Yinfeng Zhong

and

Guorong Chen

^*

Department of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2564; https://doi.org/10.3390/electronics13132564

Submission received: 18 April 2024 / Revised: 15 June 2024 / Accepted: 28 June 2024 / Published: 29 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of rail transit, the workload of track maintenance has increased, making the intelligent identification of rail surface defects crucial for improving detection efficiency. To address issues such as low defect detection accuracy, the loss of feature information due to single-path architecture backbones, and insufficient information interaction in existing rail defect detection methods, we propose a rail surface defect detection method based on dual-path feature fusion (DPF). This method initially employs a dual-path structure to separately extract low-level and high-level features. It then utilizes a combination of attention mechanisms and feature fusion techniques to integrate these features. By doing so, it preserves richer information and enhances detection accuracy and robustness. The experimental results demonstrate that the comprehensive performance of the proposed model is superior to mainstream algorithms.

Keywords:

defect detection; dual path; attention mechanism

1. Introduction

Railways occupy a pivotal position in the economic development of a country. They are not only the main mode of transportation for people but also serve as a crucial channel for cargo transportation, representing an economic lifeline and supporting daily operations of society [1]. Rails, as key infrastructure to ensure the safe and smooth operation of trains, are of great significance [2]. With the increasing duration and frequency of railway operations, wear, cracks, and other defects will inevitably appear on rail surfaces [3]. If these defects are not detected and handled in a timely manner, they will pose a serious threat to the safe operation of trains [4]. At present, steel rail inspection for China’s railways and urban rail transit systems primarily relies on flaw detection trolleys operating at a speed of 2 km per hour. High-speed railways are equipped with comprehensive maintenance vehicles; however, due to their high cost, such vehicles are not feasible for regular railways and urban rail transit systems [5]. However, this traditional method of track fault detection has problems of low reliability and high labor costs, which does not conform to the rapid advancement of modern railways [6]. Therefore, intelligent and efficient rail surface defect detection technology is crucial for preventing and reducing railway accidents and ensuring the safety of railway transportation [7].

In recent years, with technological advancements, particularly in computer vision technology, deep learning-based techniques have found widespread application in rail inspection [8]. Reference [9] considers the aspect of signals and introduces a detection method that decomposes surface defect signals of different frequency bands using wavelet packet transform, kernel principal component analysis, and support vector machines to derive defect detection results. Reference [10] proposes a classification and detection method for rail defects, focusing on the detection of internal rail defects. Initially, defect features are extracted and categorized based on the distribution and contour morphology of rail defects. Subsequently, to bolster the adaptability of detection parameters across diverse scenarios, a threshold adjustment approach rooted in data distribution patterns is suggested to elevate detection accuracy. Reference [11] introduces an effective multi-scale residual convolutional network model aimed at classifying various types of rail defects. Skip connections, paired with residual learning blocks, are employed to bolster the network’s efficacy. Reference [12] proposes a high-precision fusion model to detect rail surface defects by fusing features from two deep learning models. Initially, contrast adjustment is applied to the original image of the track to determine the track position. Subsequently, the most weighted features are selected from each network, and defects are identified by allocating the refined features to a support vector machine. Reference [13] harnesses two distinct MobileNet architectures to appraise the performance of defect detection. The network architecture incorporates the backbone network of MobileNet alongside several detection layers inspired by YOLO and feature pyramid networks for multi-scale feature maps. Reference [14] introduces a coarse-to-fine model intended for identifying defects across different scales. This model is segmented into three scales: sub-image level, region level, and pixel level. While the aforementioned deep learning-driven rail detection methods prove adept at detecting corresponding defects, they overlook the significance of defect size. Varying defect sizes can influence the ultimate recognition rate, ultimately leading to a diminution in recognition accuracy. To tackle these issues, this paper introduces a dual-path feature fusion-based model tailored for detecting surface defects on steel rails.

The primary contributions of this study can be summarized as follows:

This study proposes a steel rail surface defect detection model based on dual-path feature fusion (DPF). The model is designed with two distinct paths to separately extract low-level and high-level features. By utilizing an attention mechanism and feature fusion, these features are integrated, preserving richer information and enhancing the accuracy and robustness of detection.
The Dy-Bottleneck module is proposed in this paper, which incorporates a dual-path structure combining two parallel and interactive dynamic convolutions. This module dynamically adjusts based on the characteristics of the input data, allowing it to adapt to diverse datasets and complex scenarios.
A symmetric feature attention fusion module is introduced in this study. This module combines the lightweight Convolutional Block Attention Module (CBAM) with the symmetric design of the feature pyramid network (FPN). Specifically, CBAM attention and FPN structures are employed in both the feature extraction and feature fusion stages. This symmetric design makes the module more compact and consistent, enhancing the model’s understanding and recognition ability of images for better performance.

The structure of this paper is organized as follows: in Section 2, we introduce in detail the basic theoretical knowledge that the research relies on. In Section 3, we comprehensively elaborate on the proposed new method. In Section 4, we present the experimental results and conduct comparative experiments with other methods to further verify the superiority of our proposed method. Finally, in Section 5, we provide a concise summary of the entire paper.

2. Related Work

2.1. Dynamic Convolution

The dynamic convolution described in [15] and employed in this paper introduces a design that augments model complexity without necessarily increasing network depth or width. Rather than utilizing a single convolution kernel per layer, dynamic convolution adaptively combines several parallel convolution kernels based on input-dependent attention. The assembly of multiple kernels, thanks to their size, proves computationally efficient and offers superior representational power, as these kernels are nonlinearly aggregated via attention mechanisms.

Figure 1 illustrates the structure of dynamic convolution, which is divided into multiple components, including pooling [16], fully connected layers [17], and activation functions [18]. The model’s capability is enhanced by aggregating multiple convolution kernels through attention [19]. For different input images, these kernels are assembled in various ways.

2.2. Convolutional Block Attention Module

The Convolutional Block Attention Module (CDAM) structure [20] consists of channel attention and spatial attention. The channel attention module focuses primarily on the relationships between different channels in the feature map, enhancing the model’s ability to express features by learning the importance of each channel [21]. The spatial attention module captures the relationships between different spatial locations in the feature map, allowing the model to better understand the spatial structure of the target [22]. By combining the channel attention module and the spatial attention module, the CBAM structure can effectively improve the network’s representation capabilities, enabling the network to better capture important feature information in the input data. The introduction of this attention mechanism helps improve the model’s performance and achieve better results in object detection.

Figure 2 illustrates the structure of the CBAM. Module a in Figure 2, the channel attention module, includes average pooling, max pooling, and fully connected layers, which are used to weight the features of each channel. Module b in Figure 2, the spatial attention module, utilizes the channel attention weights obtained from the channel attention module to weight the input feature map, enhancing the network’s focus on the spatial dimensions of the feature map.

The channel attention and spatial attention in the CBAM structure cooperate with each other. By regulating the channel and spatial dimensions of the feature map, the network’s ability to represent features is improved. This structural design helps to improve the performance of convolutional neural networks in various computer vision tasks.

2.3. Feature Pyramid

The feature pyramid network (FPN) module [23] integrates features extracted at different scales in a feature pyramid for subsequent feature matching and description. This enhances the ability to recognize and locate target objects or scenes in an image. The feature pyramid allows us to analyze images at different scales, which is extremely useful for processing target objects of varying sizes or images with different resolutions. Through the feature pyramid, we can conduct the global and local feature analysis of images at various scales [24].

Figure 3 illustrates the structure of the feature pyramid used in this paper. It comprises a bottom–up pathway, a top–down pathway, and 1 × 1 convolutional kernels to adjust the output channel count of different feature maps for feature fusion [25]. The aim is to ensure that feature maps from different levels have the same shape, allowing the addition of upsampled features from other layers without altering the feature map size [26]. By combining the bottom–up and top–down pathway, and lateral connections, multi-scale feature extraction and fusion can be achieved, thereby improving the performance of convolutional neural networks in computer vision tasks [27].

3. Methods

3.1. Overall Structure

To address the issues of low detection accuracy for small defects, the loss of feature information due to single-path architecture backbones, and inadequate information interaction in existing rail image target detection methods, this paper proposes a rail surface defect target detection model based on dual-path feature fusion. The model consists of three main components: a dual-path backbone network, a symmetrical feature attention fusion module (Neck), and a detection head (Head). The overall architecture of the proposed dual-path feature fusion (DPF) is shown in Figure 4.

The dual-path backbone network is built upon the proposed Dy-Bottleneck module, which allows for the simultaneous extraction and fusion of features at different levels. This contributes to a more comprehensive understanding of the input data by the model and enables the capture of multi-scale and multi-level feature information.

The symmetrical feature attention fusion module employs CBAM attention, enabling the model to dynamically learn the importance of different channels and spatial locations. This allows the model to focus on critical regions within the image, enhancing its ability to perceive important features. Additionally, the integration of two FPN structures aids in capturing rich semantic information from the image, improving the model’s ability to recognize objects of different scales. The symmetrical design facilitates smooth information transfer and consistency within the module, enhancing its effectiveness and stability.

The detection head (Head) is primarily responsible for converting feature maps into target predictions. It also includes a loss function that calculates the discrepancy between the model’s predictions and the ground truth labels. The model parameters are then updated through backpropagation.

3.2. Dy-Bottleneck Module

The dual-path backbone network’s feature extraction section is founded upon the Dy-Bottleneck module, embodying a dual-path structural design. The Dy-Bottleneck module accepts two input feature maps, denoted as X^H and X^L, and consistently generates outputs of identical dimensions. The specific architecture of the Dy-Bottleneck can be delineated into three distinct variant styles.

Figure 5 illustrates the structure of Dy-Bottleneck (F). It involves processing the output X^H twice through a specific formula, detailed as follows:

Y_{1}^{H} = Dy - Conv (X^{L}) + X^{H}

(1)

Y_{1}^{L} = {Dy - Conv}_{1} (X^{H}) + {Dy - Conv}_{2} (X^{H})

(2)

In the formula, Dy-Conv represents dynamic convolution.

Figure 6 depicts the structure of Dy-Bottleneck (M). The specific formula is as follows:

Y_{2}^{H} = Dy - Conv (X^{L}) + Y_{1}^{H}

(3)

Y_{2}^{L} = Dy - Conv (X^{H}) + Y_{1}^{L}

(4)

In the formula, Dy-Conv represents dynamic convolution.

Figure 7 depicts the structure of Dy-Bottleneck (L). The specific formula is as follows:

Y_{3}^{H} = {Dy - Conv}_{1} (Y_{2}^{L}) + Y_{2}^{H} + {Dy - Conv}_{3} ({Dy - Conv}_{2} (Y_{2}^{L}) + Y_{2}^{H})

(5)

In the formula, Dy-Conv represents dynamic convolution.

The Dy-Bottleneck module is capable of more effectively integrating high-frequency and low-frequency features, effectively enhancing the model’s ability to represent input data, thereby improving the model’s performance and generalization capabilities.

As can be seen from Figure 4, the proposed structure consists of 1 Dy-Bottleneck (F); 12 Dy-Bottleneck (M) arranged in the configuration of 2, 4, 4, and 2; and 1 Dy-Bottleneck (L), forming the dual-path backbone. One path focuses on high-frequency features, while the other path handles low-frequency features. In this process, high-frequency features often take precedence over low-frequency features as they are more representative of the data distribution. However, in certain scenarios, low-frequency features can also be crucial. Table 1 illustrates the structure of the backbone.

After carefully evaluating both time and computational costs, we chose a dual-path methodology that produces six unique feature maps. This approach achieves a harmonious balance between efficiency and comprehensiveness, thereby guaranteeing a wider capture of information without compromising resource utilization.

3.3. Symmetric Feature Attention Fusion Module

As shown in Figure 4, the overall structure of symmetric feature attention can be observed. Initially, the feature maps outputted from the dual-path backbone network serve as inputs for the symmetric feature attention fusion module, denoted as L₁, L₂, L₃, and H₁, H₂, H₃. Each input first enters the CBAM, effectively enhancing the network’s representational capability and enabling the network to better capture crucial feature information from the input data. The attention-processed information is then grouped based on different paths and inputted into two sets of FPN. The fused feature information can extract multi-scale information from feature maps at different levels, aiding the model in better understanding contextual information and semantic relationships of the target. This, in turn, enhances the model’s perception of the target, which helps improve the accuracy and robustness of the model in object detection. The information processed by the FPN is grouped based on size and fused together using element-wise addition. Finally, three sets of feature maps are outputted. The mathematical expression for the above process is as follows:

D_{1} = FPN (CBAM (L_{1})) \oplus FPN (CBAM (H_{1}))

(6)

D_{2} = FPN (CBAM (L_{2})) \oplus FPN (CBAM (H_{2}))

(7)

D_{3} = FPN (CBAM (L_{3})) \oplus FPN (CBAM (H_{3}))

(8)

In the formula, FPN represents the feature pyramid network, CBAM denotes the Convolutional Block Attention Mechanism, and ⊕ signifies element-wise addition.

Finally, the symmetric feature attention fusion module outputs D₁, D₂, and D₃.

3.4. Detection Head

The detection head module in this chapter draws inspiration from the YOLO series, renowned for its simplicity and efficiency [28]. By employing three 1 × 1 convolutional layers, it facilitates the adjustment of feature channels and generates detection feature layers. The primary function of 1 × 1 convolution is dimensionality reduction or expansion. Importantly, this convolution operates exclusively on the channel radimension, preserving the size of the feature map while altering the number of channels to influence feature representation. This design maintains high performance in object detection tasks, facilitating feature extraction and transformation through convolutional operations, paving the way for subsequent object localization and classification. The lightweight design of the detection head also contributes to faster model training and inference, enhancing the model’s practical applicability. The network predicts objects of varying sizes through multi-scale feature maps, with each map corresponding to a grid. These feature maps undergo 1 × 1 convolution to generate detection results, as illustrated in Figure 8, which depicts the schematic of the detection head.

In the figure, the three feature maps of different scales essentially represent three grids, with sizes of 80 × 80, 40 × 40, and 20 × 20, respectively. Through 1 × 1 convolution, the sizes of these three feature maps are transformed to 80 × 80 × 3 × (5 + 3), 40 × 40 × 3 × (5 + 3), and 20 × 20 × 3 × (5 + 3). The notation (5 + 3) signifies the information contained within each anchor, where the first ‘3’ represents three anchors embedded in each grid. Here, ‘5’ denotes four positional coordinates (x, y, w, h) and a confidence score. The confidence score indicates the probability of a possible object presence within the grid. The ‘3’ in parentheses corresponds to the three categories present in the steel rail dataset.

3.5. Loss Function

The loss functions used in this paper include three types: classification loss [29], confidence loss [30], and location loss [31]. Classification loss is used in object detection tasks to measure the accuracy of the model’s classification predictions for target categories. In the model, the classification loss function is used to train the model to correctly classify detected target objects into different categories; confidence loss plays a crucial role. By minimizing the confidence loss, the model can learn how to predict the existence of target objects and make the confidence scores closer to the real situation; location loss refers to the loss function between the predicted bounding box position and the ground truth bounding box. Location loss plays a vital role as it helps the model adjust the position of the predicted bounding box to better fit the position of the real target. The formula for the loss function used in this paper is as follows:

(1): Classification Loss

To measure the model’s classification accuracy for target classes, the cross-entropy loss function is adopted for classification. Its formula is as follows:

L_{c l s} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} t_{i, c} \log (p_{i, c})

(9)

In the formula, C represents the category, P_i represents the predicted probability of the i-th category by the model, t_i represents the true label of the i-th category, and N represents the number of categories.

(2): Confidence Loss

To measure the model’s prediction accuracy of target existence, we use the binary cross-entropy loss function for confidence loss, as shown in the following formula:

L_{conf} = \frac{1}{N_{B}} \sum_{i = 1}^{N} \sum_{j = 1}^{B} [t_{i, j}^{obj} \sum_{n \in {conf}} {(t_{i, j}^{n} - p_{i, j}^{n})}^{2} + λ_{noobj} t_{i, j}^{noobj} \sum_{n \in {conf}} {(t_{i, j}^{n} - p_{i, j}^{n})}^{2}] \begin{matrix}  \end{matrix}

(10)

In the formula, B represents the number of anchor boxes, N represents the number of categories,

P_{i, j}^{obj}

represents the model’s predicted value for the presence of a target in the i-th anchor box,

t_{i, j}^{obj}

indicates whether the i-th anchor box in the j-th sample contains a target,

t_{i, j}^{noobj}

signifies whether the i-th anchor box in the j-th sample does not contain a target, and

λ_{noobj}

denotes the weight coefficient for the confidence loss of anchor boxes that do not contain targets.

(3): Location Loss

The location loss used is the mean squared error loss, which measures the prediction accuracy of the model for target positions, as shown in the following formula:

L_{loc} = \frac{1}{N_{B}} \sum_{i = 1}^{N} \sum_{j = 1}^{B} 1_{, j}^{obj} [λ_{coord} \sum_{n \in {x, y, w, h}} {(t_{i, j}^{n} - p_{i, j}^{n})}^{2}]

(11)

In the formula, B represents the number of anchor boxes, N represents the number of categories,

p_{i, j}

denotes the model’s predicted value for the positional information of the j-th anchor box,

N_{B}

signifies the number of anchor boxes containing targets in the sample,

λ_{coord}

represents the weight coefficient for location loss,

t_{i, j}^{obj}

indicates the positional information of the j-th anchor box in the i-th sample, and

t_{i, j}^{obj}

signifies that the j-th anchor box in the i-th sample contains a target.

So, the total loss formula is as follows:

L = L_{cls} + λ_{coord} L_{loc} + λ_{conf} L_{conf}

(12)

λ_{coord}

and

λ_{conf}

represent the weight coefficients for location loss and confidence loss, respectively.

4. Experimental Results

4.1. Datasets

The public dataset for RailDefect steel rail surface defects [32] used in this paper was collected on the railway test loop of the National Academy of Railway Sciences using a linear array camera installed on a high-speed train. The dataset contains over 10,000 images, among which, 400 images exhibit distinct defect features. These defects mainly fall into five types, including peeling, scratches, crushing, indentations, and cracks, covering various major issues that may arise on the surface of steel rails. Additionally, the dataset includes images of dirt, gaps, and unknown categories. The dataset is divided into eight detailed versions and three coarse classification versions. In our entire experiment, we adopted three common and practical categories for detection: defects, gaps, and dirt. During the experiment, we selected 80% as the training set and 20% as the validation set. Figure 9 illustrates the basic schematic diagram of defects in the dataset.

4.2. Evaluation Indicators or Evaluation Metrics

In the process of object detection, the comparison between predicted labels and actual labels in the test set yields four different outcomes: True Positives (TPs), True Negatives (TNs), False Positives (FPs), and False Negatives (FNs) [33]. This paper employs Precision (P) [34], Recall (R) [35], and Average Precision (AP) [36] as evaluation metrics to assess the effectiveness of all models. The mean Average Precision (mAP) is calculated by averaging the AP values across all categories [37].

The calculation method for P is as follows:

P = \frac{T P}{T P + F P}

(13)

The calculation method for R is as follows:

R = \frac{T P}{F N + T P}

(14)

The calculation method for AP is as follows:

AP = \int_{0}^{1} P (R) d R

(15)

4.3. Experimental Parameter Settings

The proposed method is implemented based on the PyTorch framework. During the training process, an NVIDIA RTX 3060 GPU (NVIDIA, Santa Clara, CA, USA) was used as the training device. The SGD optimizer was selected to minimize cross-entropy loss. To ensure a stable training process and optimal performance, the learning rate and batch size were set to 0.001 and 8, respectively. The SGD optimizer was chosen because it is a commonly used gradient descent algorithm that demonstrates fast convergence and good generalization capabilities. The selection of learning rate, number of epochs, and batch size took into account multiple factors. An excessively high learning rate might lead to unstable training dynamics, while a too-small learning rate could result in slow convergence during training. The constructed deep learning model was run for 300 iterations. The model’s performance was monitored in real time during training, and model parameters were saved at the moment of optimal performance. Finally, the model demonstrating the highest accuracy level was selected as the optimal model.

4.4. Experimental Results

(1): Ablation Experiment

To verify whether the hypotheses in model design are valid, ablation experiments can be conducted. By gradually changing the structure or parameters of the model, we can examine the model’s performance under different conditions, thereby confirming whether the model design meets expectations. Therefore, ablation experiments were conducted on the proposed steel rail surface defect detection model based on dual-path feature fusion and attention mechanism. The results of the ablation experiments are shown in the Table 2. Herein, “dual-path backbone” represents the main structure of the dual-path, or a single path without dynamic convolution using regular convolutional blocks; “dynamic convolution” indicates whether dynamic convolution is used in the Dy-Bottleneck module; “symmetric attention” signifies whether two FPNs or one FPN is used; and “CBAM” represents whether the attention mechanism is incorporated into the model.

√ represents used, × represents non-used. The table reveals that the model performs poorly without the dual-path backbone, dynamic convolution, symmetric attention, and CBAM, achieving approximately 20% lower performance compared to Experiment 7, where all these components are present. This underscores the crucial importance of these modules. However, the introduction of symmetric attention enhances precision by roughly 5%, indicating its role in stabilizing the model’s performance. Furthermore, the addition of CBAM leads to a steady improvement in all model metrics, suggesting that attention mechanisms focus the model on defective areas, significantly improving prediction outcomes. In subsequent Experiments 4, 5, and 6, removing dynamic convolution, symmetric attention, or CBAM individually results in varying degrees of performance degradation compared to the complete model. Notably, dynamic convolution has the most significant impact on the model. In conclusion, the ablation study demonstrates that every component of our model plays a vital role and is indispensable.

(2): Visualization Result Analysis

To gain insights into the model’s evolution during training, we present several visualizations of the training process. Figure 10 depicts the loss variation graph, illustrating how the loss function value changes over the course of training. By examining the loss curve, we can assess whether the model’s training is converging, detect issues such as overfitting or underfitting, and select appropriate learning rates and optimization algorithms. The loss curve reveals that the model stabilizes within 300 iterations, indicating rapid convergence.

Figure 11 depicts the Precision Curve (P Curve), Precision–Recall Curve (PR Curve), and Recall Curve (R Curve). The P Curve illustrates how precision varies with different recall rates along the PR Curve. By examining this curve, one can select an appropriate threshold to maximize precision and evaluate the model’s performance across various recall levels. The PR Curve assesses the performance of a classification model under different thresholds. It showcases the relationship between a classifier’s precision and recall. Observing the PR Curve allows us to determine a suitable threshold that balances precision and recall, enabling an evaluation of the model’s performance across different categories. The R Curve demonstrates how recall changes with varying precision levels along the PR Curve. This curve aids in selecting a threshold to maximize recall and appraise the model’s performance at different precision levels. The distinct upward convexity of the P Curve indicates high precision across various thresholds. The substantial area under the PR Curve signifies that the model maintains high precision and recall in diverse scenarios. The prominent upward convexity of the R Curve suggests high recall at different precision levels, and its extensive area underscores the model’s excellent recall performance across various precision points.

Figure 12 presents the object detection results obtained using the proposed model in this paper. The upper section displays the labels, while the lower section showcases the detection outcomes. As is evident from the figure, the model effectively detects most defects accurately, demonstrating its remarkably high performance level.

(3): Comparative Experiments with Mainstream Models

To objectively evaluate the performance of our model and demonstrate its value and advantages, we conducted a comparative analysis with mainstream models. This comparison helped to determine whether our model performs better than existing mainstream models on specific tasks, providing a basis for practical applications. Table 3 presents the comparison results with mainstream models.

From Table 3, we can observe that the algorithm proposed in this paper demonstrates superior performance in most categories and metrics. There is an improvement in accuracy by 0.5%, 3.1%, and 4.2%, respectively, and the defect recall rate increases by 0.1%. The remaining metrics either remain consistent or show minimal deviations.

Figure 13 illustrates the visualization results, where (a) represents the labeled image, (b) YOLOv4, (c) YOLOv5s, (d) YOLOv8n, and (e) the algorithm proposed in this paper. As seen in the figure, our proposed algorithm accurately detects all defects, indicating that the model introduced in this paper is more precise and stable in recognizing surface defects on steel rails, thus meeting the requirements for rail defect identification.

5. Conclusions

In this paper, we introduced a novel DPF architecture for detecting defects on the surface of steel rails. Specifically, (1) we obtained depth feature maps of different levels through a dual-path structure. (2) We used a symmetric feature attention fusion module to fuse feature maps of the same depth and put feature maps of different levels into the detection head for detection, thereby improving detection accuracy. Experiments were conducted on a public dataset with three defect categories. The accuracy of the proposed model reached 98.6% for defects, 94.1% for dirt, and 100% for gaps. Experimental data demonstrate that the proposed model exhibits strong competitiveness in performance compared to currently prevalent algorithms, satisfying the demand for the high-precision identification of rail defects.

Author Contributions

Conceptualization, G.C. and Y.Z.; methodology, G.C. and Y.Z.; software, G.C. and Y.Z.; validation, G.C. and Y.Z.; formal analysis, G.C. and Y.Z.; investigation, G.C. and Y.Z.; resources, G.C. and Y.Z.; data curation, G.C. and Y.Z.; writing—original draft preparation, G.C. and Y.Z.; writing—review and editing, G.C. and Y.Z.; visualization, G.C. and Y.Z.; supervision, G.C.; project administration, G.C.; funding acquisition, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by Technology Innovation and Application Development Project of Chongqing Science and Technology Bureau, grant number Grant CSTB2023TIAD-GPX0049; cooperative projects between universities in Chongqing and the Chinese Academy of Sciences, grant number Grant HZ2021015; General Program of Chongqing Science Technology Commission: cstc2021jcyj-msxm3332; General project of Chongqing Municipal Science and Technology Commission, grant number cstc2021jcyjmsxm3332; Sichuan Science and Technology Program 2023JDRC0033; Young Project of Science and Technology Research Program of Chongqing Education Commission of China number KJQN202001513 and number KJQN202101501; Luzhou Science and Technology Program 2021-JYJ-92; Chongqing Postgraduate Scientific Research Innovation Project, grant number CYS23752; The Science and Technology Research Program of Chongqing Municipal Education Commission in China number KJZD-K202100104 and number KJQN202301543; The Natural Science Foundation of Chongqing, grant number cstc2021jcyjmsxmX1212; Oil and Gas Production Safety and Risk control Key Laboratory of Chongqing open fund, grant number cqsrc202110; Chongqing University of Science and Technology master and doctoral student innovation project, grant number ZNYKC2314; General Program of Chongqing Science Technology Commission, grant number cstc2021jcyj-msxm3332.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We would like to thank Xiao Huang of the Hong Kong Polytechnic University for his guidance and support in the methods and experiments of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anderson, R.T.; Barkan, C.P.L. Railroad accident rates for use in transportation risk analysis. Transp. Res. Rec. 2004, 1863, 88–98. [Google Scholar] [CrossRef]
Shang, L.; Yang, Q.; Wang, J.; Li, S.; Lei, W. Detection of rail surface defects based on CNN image recognition and classification. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Republic of Korea, 11–14 February 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 45–51. [Google Scholar]
Gasparini, R.; D’Eusanio, A.; Borghi, G.; Pini, S.; Scaglione, G.; Calderara, S.; Fedeli, E.; Cucchiara, R. Anomaly detection, localization and classification for railway inspection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3419–3426. [Google Scholar]
Soleimanmeigouni, I.; Ahmadi, A.; Nissen, A.; Xiao, X. Prediction of railway track geometry defects: A case study. Struct. Infrastruct. Eng. 2020, 16, 987–1001. [Google Scholar] [CrossRef]
Feng, J.H.; Yuan, H.; Hu, Y.Q.; Lin, J.; Liu, S.W.; Luo, X. Research on deep learning method for rail surface defect detection. IET Electr. Syst. Transp. 2020, 10, 436–442. [Google Scholar] [CrossRef]
Yu, H.; Li, Q.; Tan, Y.; Gan, J.; Wang, J.; Geng, Y.A.; Jia, L. A coarse-to-fine model for rail surface defect detection. IEEE Trans. Instrum. Meas. 2018, 68, 656–666. [Google Scholar] [CrossRef]
Chen, Z.; Wang, Q.; He, Q.; Yu, T.; Zhang, M.; Wang, P. CUFuse: Camera and ultrasound data fusion for rail defect detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21971–21983. [Google Scholar] [CrossRef]
Zhou, W.; Hong, J. FHENet: Lightweight feature hierarchical exploration network for real-time rail surface defect inspection in RGB-D images. IEEE Trans. Instrum. Meas. 2023, 72, 5005008. [Google Scholar] [CrossRef]
Papaelias, M.P.; Roberts, C.; Davis, C.L. A review on non-destructive evaluation of rails: State-of-the-art and future development. Proc. Inst. Mech. Eng. Part F J. Rail Rapid Transit 2008, 222, 367–384. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, J.; Chen, J.; Wang, S.; Wang, L. Rail Surface Defect Detection Through Bimodal RSDINet and Three-Branched Evidential Fusion. IEEE Trans. Instrum. Meas. 2023, 72, 2508714. [Google Scholar] [CrossRef]
Feng, H.; Jiang, Z.; Xie, F.; Yang, P.; Shi, J.; Chen, L. Automatic fastener classification and defect detection in vision-based railway inspection systems. IEEE Trans. Instrum. Meas. 2013, 63, 877–888. [Google Scholar] [CrossRef]
Alemi, A.; Corman, F.; Lodewijks, G. Condition monitoring approaches for the detection of railway wheel defects. Proc. Inst. Mech. Eng. Part F J. Rail Rapid Transit 2017, 231, 961–981. [Google Scholar] [CrossRef]
Wei, X.; Yang, Z.; Liu, Y.; Wei, D.; Jia, L.; Li, Y. Railway track fastener defect detection based on image processing and deep learning techniques: A comparative study. Eng. Appl. Artif. Intell. 2019, 80, 66–81. [Google Scholar] [CrossRef]
Ge, H.; Huat, D.C.K.; Koh, C.G.; Dai, G.; Yu, Y. Guided wave–based rail flaw detection technologies: State-of-the-art review. Struct. Health Monit. 2022, 21, 1287–1308. [Google Scholar] [CrossRef]
Soukup, D.; Huber-Mörk, R. Convolutional neural networks for steel surface defect detection from photometric stereo images. In Proceedings of the International Symposium on Visual Computing, Las Vegas, NV, USA, 8–10 December 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 668–677. [Google Scholar]
Li, Y.; Trinh, H.; Haas, N.; Otto, C.; Pankanti, S. Rail component detection, optimization, and assessment for automatic rail track inspection. IEEE Trans. Intell. Transp. Syst. 2013, 15, 760–770. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Zhang, Q.; Yang, Y.B. Rest: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
Gullers, P.; Dreik, P.; O Nielsen, J.C.; Ekberg, A.; Andersson, L. Track condition analyser: Identification of rail rolling surface defects, likely to generate fatigue damage in wheels, using instrumented wheelset measurements. Proc. Inst. Mech. Eng. Part F J. Rail Rapid Transit 2011, 225, 1–13. [Google Scholar] [CrossRef]
Dong, H.; Song, K.; He, Y.; Xu, J.; Yan, Y.; Meng, Q. PGA-Net: Pyramid feature fusion and global context attention network for automated surface defect detection. IEEE Trans. Ind. Inform. 2019, 16, 7448–7458. [Google Scholar] [CrossRef]
Yunjie, Z.; Xiaorong, G.; Lin, L.; Yongdong, P.; Chunrong, Q. Simulation of laser ultrasonics for detection of surface-connected rail defects. J. Nondestruct. Eval. 2017, 36, 70. [Google Scholar] [CrossRef]
Vincent, O.R.; Babalola, Y.E.; Sodiya, A.S.; Adeniran, O.J. A Cognitive Rail Track Breakage Detection System Using Artificial Neural Network. Appl. Comput. Syst. 2021, 26, 80–86. [Google Scholar] [CrossRef]
Cheng, X.; Yu, J. RetinaNet with difference channel attention and adaptively spatial feature fusion for steel surface defect detection. IEEE Trans. Instrum. Meas. 2020, 70, 2503911. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wu, W.C.; Yin, C.C. Generation and directional decomposition of guided waves for finite-range defect detection in rail tracks. J. Mech. 2023, 39, 540–553. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Hopfield, J.J. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA 1984, 81, 3088–3092. [Google Scholar] [CrossRef] [PubMed]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Gibert, X.; Patel, V.M.; Chellappa, R. Deep multitask learning for railway track inspection. IEEE Trans. Intell. Transp. Syst. 2016, 18, 153–164. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR 2021, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Song, X.; Wang, Y.; Li, C.; Song, L. WDC-YOLO: An improved YOLO model for small objects oriented printed circuit board defect detection. J. Electron. Imaging 2024, 33, 013051. [Google Scholar] [CrossRef]
Dong, J.Y.; Lv, W.T.; Bao, X.M. Research progress of the PCB surface defect detection method based on machine vision. J. Zhejiang Sci-Tech Univ. 2021, 45, 379–389. [Google Scholar]
Akram, M.W.; Li, G.; Jin, Y.; Chen, X.; Zhu, C.; Ahmad, A. Automatic detection of photovoltaic module defects in infrared images with isolated and develop-model transfer deep learning. Sol. Energy 2020, 198, 175–186. [Google Scholar] [CrossRef]
Silva, L.H.D.S.; Azevedo, G.O.D.A.; Fernandes, B.J.; Bezerra, B.L.; Lima, E.B.; Oliveira, S.C. Automatic optical inspection for defective PCB detection using transfer learning. In Proceedings of the 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Guayaquil, Ecuador, 11–15 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR 2019, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]

Figure 1. Structure of dynamic convolution.

Figure 2. Structure of CBAM.

Figure 3. Structure of feature pyramid.

Figure 4. Overall architecture of the proposed dual-path feature fusion (DPF).

Figure 5. Structure of Dy-Bottleneck (F).

Figure 6. Structure of Dy-Bottleneck(M).

Figure 7. Structure of Dy-Bottleneck (L).

Figure 8. Schematic diagram of the detection head.

Figure 9. Example images from the RailDefect steel defect dataset.

Figure 10. Diagrams of changes during training process.

Figure 11. The graphs of P Curve, PR Curve, and R Curve.

Figure 12. Test results; the first row represents the labels, while the second row displays the model detection results.

Figure 13. Data sample display diagram. (a): The labeled image; (b): YOLOv4; (c): YOLOv5s; (d): YOLOv8n; (e): the algorithm proposed in this paper.

Table 1. Diagram of the main network backbone structure.

Main Modules	Input (Dual-Path) Size Dimensions	Output (Dual-Path) Size Dimensions	Quantity
Max pooling	640 × 640 × 3	320 × 320 × 64	1
Conv 2D	320 × 320 × 64	160 × 160 × 128	1
Dy-Bottleneck (F)	160 × 160 × 128	160 × 160 × 128 and 80 × 80 × 128	1
Dy-Bottleneck (M) 1st	160 × 160 × 128 and 80 × 80 × 128	160 × 160 × 128 and 80 × 80 × 128	2
Dy-Bottleneck (M) 2nd	160 × 160 × 128 and 80 × 80 × 128	80 × 80 × 256 and 40 × 40 × 256	4
Dy-Bottleneck (M) 3rd	80 × 80 × 256 and 40 × 40 × 256	40 × 40 × 512 and 20 × 20 × 512	4
Dy-Bottleneck (M) 4th	40 × 40 × 512 and 20 × 20 × 512	40 × 40 × 512 and 20 × 20 × 512	2
Dy-Bottleneck (L)	40 × 40 × 512 and 20 × 20 × 512	20 × 20 × 1024	1

Table 2. Diagram of the ablation experiment.

No	Dual-Path Backbone	Dynamic Convolution	Symmetric Attention	CBAM	P (%)	R (%)	mAP@0.5 (%)
1	×	×	×	×	68.4	70.2	71.5
2	×	×	√	×	73.2	65.1	71.4
3	×	×	√	√	87.4	68.0	72.5
4	√	×	√	√	91.5	93.6	97.2
5	√	√	×	√	93.8	95.8	97.3
6	√	√	√	×	94.1	94.1	97.3
7	√	√	√	√	97.5	93.8	98.3

Table 3. Comparison test results.

Model	P (%)			R (%)			mAP0.5 (%)
Model	Defect	Dirt	Gap	Defect	Dirt	Gap	Defect	Dirt	Gap
SSD [37]	68.4	63.3	87.4	70.2	67.1	58.0	71.5	69.9	72.8
Faster R-CNN [38]	85.7	87.1	89.7	80.8	78.7	79.1	86.2	86.5	87.2
YOLOv3-tiny [39]	90.1	90.6	91.3	82.9	83.2	83.0	88.8	90.1	89.7
YOLOv4 [40]	92.3	91.6	93.2	87.89	89.86	90.89	90.55	90.34	90.89
YOLOv5s [41]	91.4	83.3	87.3	91.6	90.7	92.75	91.3	91.6	94.4
YOLOv8n [42]	91.5	89.8	100	93.6	94.1	99.8	97.2	95.3	99.5
DETR [43]	98.1	91.0	95.8	94.4	94.1	95.8	98.3	97.3	99.0
DPF	98.6	94.1	100	94.5	94.1	92.7	98.3	97.3	99.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhong, Y.; Chen, G. Rail Surface Defect Detection Based on Dual-Path Feature Fusion. Electronics 2024, 13, 2564. https://doi.org/10.3390/electronics13132564

AMA Style

Zhong Y, Chen G. Rail Surface Defect Detection Based on Dual-Path Feature Fusion. Electronics. 2024; 13(13):2564. https://doi.org/10.3390/electronics13132564

Chicago/Turabian Style

Zhong, Yinfeng, and Guorong Chen. 2024. "Rail Surface Defect Detection Based on Dual-Path Feature Fusion" Electronics 13, no. 13: 2564. https://doi.org/10.3390/electronics13132564

APA Style

Zhong, Y., & Chen, G. (2024). Rail Surface Defect Detection Based on Dual-Path Feature Fusion. Electronics, 13(13), 2564. https://doi.org/10.3390/electronics13132564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rail Surface Defect Detection Based on Dual-Path Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Dynamic Convolution

2.2. Convolutional Block Attention Module

2.3. Feature Pyramid

3. Methods

3.1. Overall Structure

3.2. Dy-Bottleneck Module

3.3. Symmetric Feature Attention Fusion Module

3.4. Detection Head

3.5. Loss Function

4. Experimental Results

4.1. Datasets

4.2. Evaluation Indicators or Evaluation Metrics

4.3. Experimental Parameter Settings

4.4. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI