Two-Stage Multi-Label Detection Method for Railway Fasteners Based on Type-Guided Expert Model

Lv, Defang; Meng, Jianjun; Meng, Gaoyang; Shen, Yanni; Yao, Liqing; Liu, Gengqi

doi:10.3390/app152413093

Open AccessArticle

Two-Stage Multi-Label Detection Method for Railway Fasteners Based on Type-Guided Expert Model

by

Defang Lv

,

Jianjun Meng

^*,

Gaoyang Meng

,

Yanni Shen

,

Liqing Yao

and

Gengqi Liu

School of Mechanical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(24), 13093; https://doi.org/10.3390/app152413093

Submission received: 18 November 2025 / Revised: 8 December 2025 / Accepted: 10 December 2025 / Published: 12 December 2025

Download

Browse Figures

Versions Notes

Featured Application

The proposed Type-Guided Expert Model-based Fastener Detection and Diagnosis framework (TGEM-FDD) framework is particularly suitable for automated visual inspection systems in railway infrastructure maintenance. It can be deployed on track inspection vehicles or drones to enable real-time, high-precision detection of diverse fastener types and multiple concurrent defects along railway lines. This system provides a practical solution for railway authorities to transition from manual, periodic inspections to continuous, intelligent monitoring, thereby enhancing operational safety while reducing labor costs and human error in maintenance operations.

Abstract

Railway track fasteners, serving as critical connecting components, have a reliability that directly impacts railway operational safety. To address the performance bottlenecks of existing detection methods in handling complex scenarios with diverse fastener types and co-occurring multiple defects, this paper proposes a Type-Guided Expert Model-based Fastener Detection and Diagnosis framework (TGEM-FDD) based on You Only Look Once (YOLO) v8. This framework follows a “type-identification-first, defect-diagnosis-second” paradigm, decoupling the complex task: the first stage employs an enhanced YOLOv8s with Deepstar, SPPF-attention, and DySample (YOLOv8s-DSD) detector integrating Deepstar Block, Spatial Pyramid Pooling Fast with Attention (SPPF-Attention), and Dynamic Sample (DySample) modules for precise fastener localization and type identification; the second stage dynamically invokes a specialized multi-label classification “expert model” based on the identified type to achieve accurate diagnosis of multiple defects. This study constructs a multi-label fastener image dataset containing 4800 samples to support model training and validation. Experimental results demonstrate that the proposed YOLOv8s-DSD model achieves a remarkable 98.5% mean average precision at an Intersection over Union threshold of 0.5 (mAP@0.5) in the first-stage task, outperforming the original YOLOv8s baseline and several mainstream detection models. In end-to-end system performance evaluation, the TGEM-FDD framework attains a comprehensive Task mean average precision (Task mAP) of 88.1% and a macro-average F1 score for defect diagnosis of 86.5%, significantly surpassing unified single-model detection and multi-task separate-head methods. This effectively validates the superiority of the proposed approach in tackling fastener type diversity and defect multi-label complexity, offering a viable solution for fine-grained component management in complex industrial scenarios.

Keywords:

railway fastener; defect detection; deep learning; YOLOv8; multi-label classification; two-stage detection

1. Introduction

Rail transportation, serving as a critical infrastructure and economic artery for the nation, demands utmost safety and reliability. Track fasteners, as key connecting components that secure rails to sleepers, directly undertake vital functions including maintaining gauge, transmitting loads, and absorbing vibrations [1]. Due to the high-frequency forces exerted by trains and the impact of complex environmental loads, fastener defects occur frequently [2]. Figure 1 displays typical fastener defects identified during field investigations and provides a detailed classification of common fastener defect types. Once defects such as missing, broken, deformed, or severely corroded fasteners occur, they significantly compromise the integrity and stability of the track structure, potentially leading to rail displacement, gauge widening, and even catastrophic derailments. Therefore, conducting efficient and accurate periodic inspections of track fastener conditions constitutes a crucial link in ensuring railway operational safety.

Traditional fastener inspection primarily relies on manual patrols. This method is not only inefficient and labor-intensive but also struggles to guarantee accuracy and consistency in detection results due to the influence of inspectors’ subjective experience, physical condition, and environmental factors. It has become inadequate to meet the development demands of modern high-density, high-speed railway networks. To address these challenges, non-destructive testing (NDT) technologies have been developed for detecting defects in railway components, including methods utilizing acoustic and vibration signals, as well as machine vision technologies [3].

Early non-destructive testing methodologies were predominantly based on acoustic and vibration signal analysis [4]. For example, Zhang et al. [5] employed acoustic emission monitoring data to perform probabilistic assessment of railway damages, while Chellaswamy et al. [6] implemented a dynamic differential algorithm combined with vibration signals acquired from accelerometers mounted on bogies and axle boxes to identify track defects. Although these approaches demonstrate considerable efficacy in detecting certain macroscopic damages, they exhibit substantial limitations in identifying “latent” failures—such as fastener loosening and minor misalignments that generate only minimal vibrational and acoustic signatures—thus failing to satisfy the rigorous requirements of modern inspection standards.

In recent years, with the rapid advancement of computer vision technology, machine vision-based inspection methods have emerged as a research focus due to their non-contact nature, high efficiency, and precision. The methodologies in this field can be broadly categorized into two evolutionary stages: traditional image processing-based approaches and deep learning-based techniques.

The first stage encompasses inspection methodologies grounded in traditional image processing techniques. These approaches fundamentally rely on handcrafted feature extractors designed by researchers. Ma et al. [7] introduced a mechanism utilizing curve projection and template matching to achieve real-time detection of missing fasteners. Inspired by few-shot learning, Liu et al. [8] devised a novel visual inspection system to localize fastener regions. Furthermore, Gibert et al. [9] proposed a multi-task learning framework that leveraged Histogram of Oriented Gradients features (HOG) combined with a Support Vector Machine (SVM) classifier, thereby enhancing classification accuracy for damaged and missing fasteners. Wang et al. [10] comprehensively employed background subtraction, an improved Canny operator, Hough transform, alongside Local Binary Patterns and HOG features, with final classification performed by an SVM, demonstrating notable real-time performance and accuracy. Despite these methods having advanced automation to some extent, their performance is heavily dependent on the quality of handcrafted features. In practical scenarios characterized by varying illumination conditions, complex backgrounds, and diverse fastener types, these approaches exhibit insufficient generalization capability and poor robustness. Moreover, the intricate preprocessing and feature extraction pipelines inherent to these methods impose constraints on further enhancing detection efficiency.

The second phase is characterized by the adoption of deep learning-based detection methodologies. The emergence of deep learning, particularly Convolutional Neural Networks (CNNs), has enabled algorithms to autonomously learn hierarchical feature representations directly from data, thereby eliminating the dependency on manually engineered features. This paradigm shift has led to a qualitative leap in detection accuracy and robustness. Predominant object detection algorithms within this domain can be categorized into two-stage models and single-stage models.

The two-stage models, represented by the Region-based Convolutional Neural Network (R-CNN) series such as Fast Region-based Convolutional Neural Network (Fast R-CNN) [11] and Faster Region-based Convolutional Neural Network (Faster R-CNN) [12], first generate region proposals and then perform classification and regression on each proposed region. This architecture generally achieves superior detection accuracy. Wei et al. [13] applied Faster R-CNN to fastener detection, validating its exceptional classification performance. However, the multi-step detection pipeline inherent in these models results in high computational complexity and slower detection speeds, making it challenging to meet the demands of real-time online inspection in railway applications.

In pursuit of real-time performance, researchers have shifted their focus to single-stage detection models, with the You Only Look Once (YOLO) series gaining significant prominence due to their remarkable balance between speed and accuracy. Substantial improvements and applications have been developed around the YOLO architecture. Qi et al. [14] designed an MYOLOv3-Tiny network based on YOLOv3, incorporating depthwise separable convolutions and reconstructing the backbone network, which significantly enhanced detection speed while maintaining accuracy, making it more suitable for embedded deployment. As the YOLO algorithm has evolved, enhancement efforts have continued to deepen. Guo et al. [15] optimized the YOLOv4 framework to propose a cost-effective real-time detection solution for railway track components. Meanwhile, Fu et al. [16] employed the lightweight MobileNet to replace the original backbone network of YOLOv4, effectively capturing subtle features of fasteners while reducing model parameters and computational load, thereby further accelerating the detection process.

In recent years, YOLOv5 and its subsequent variants have emerged as predominant platforms for research. To enhance the discrimination capability for miniature fasteners against complex backgrounds, attention mechanisms and more efficient feature fusion networks have been introduced. Li et al. [17] embedded a Convolutional Block Attention Module (CBAM) into the neck of YOLOv5s to strengthen the extraction of critical features while suppressing irrelevant information, concurrently incorporating a weighted Bidirectional Feature Pyramid Network to achieve superior multi-scale feature fusion. The work by Wang et al. [18] demonstrates a more systematic approach: they integrated the Bidirectional Feature Pyramid Network (BiFPN) attention mechanism into YOLOv5’s backbone, employed Ghost Shuffle Convolution (GSConv) modules in the neck network to reduce complexity, and incorporated BiFPN to enhance feature fusion. Ultimately, they improved detection accuracy and model robustness through a lightweight decoupled head structure, with their proposed YOLOv5 with CBAM, GSConv, BiFPN and Decoupled head (YOLOv5-CGBD) model surpassing the original across multiple evaluation metrics.

Furthermore, several studies have explored alternative technical pathways. Liu et al. [19] enhanced the Faster R-CNN framework by employing K-means clustering to automatically determine anchor box positions, thereby improving detection performance for imbalanced fastener samples. In an innovative approach, Zhan et al. [20] leveraged prior information to verify fastener locations on three-dimensional ballastless track images, subsequently detecting defects through their designed RailNet model, introducing a novel dimension to inspection methodology. Cha et al. [21] successfully applied Faster R-CNN to multi-class damage detection in bridges and building clusters, demonstrating the broad applicability of deep learning in infrastructure visual inspection. Roy et al. [22] integrated Densely Connected Convolutional Networks (DenseNet) and Swin Transformer prediction heads into YOLOv5, proposing the DenseNet and Swin Transformer Prediction Head YOLOv5 (DenseSPH-YOLOv5) model for effective multi-scale road damage detection—a methodology whose conceptual framework holds significant referential value for fastener inspection.

However, current research and applications continue to face a prominent bottleneck: most existing approaches treat “fastener detection” as a single, homogeneous task, failing to adequately account for its inherent complexity and heterogeneity in real-world scenarios. This limitation manifests concretely at the following two levels:

Firstly, there exists a diversity in fastener types and heterogeneity in their defect manifestations. After decades of development in China’s railway system, multiple fastener types are in operation—ranging from traditional Type I and Type II elastic strips to Type III, WJ-7, and WJ-8 fasteners designed for high-speed and heavy-haul lines. Significant differences in structure, material, and fastening mechanisms among these variants lead to distinct defect patterns. A generic detection model optimized for Type II elastic strip fasteners often experiences severe performance degradation when confronted with structurally distinct WJ-8 fasteners. Current approaches for multi-type fastener detection typically involve training a single, all-encompassing model to recognize all types. Such models are compelled to learn generic features across all fastener types, thereby struggling to capture type-specific, subtle defect characteristics. Consequently, in complex mixed-type railway lines, the overall detection accuracy—particularly the recognition rate for specific defects on particular fastener types—remains unsatisfactory.

Secondly, the challenge lies in the co-occurrence of multiple defect types and the complexities of multi-label classification. A single fastener may simultaneously exhibit various defects; for instance, an elastic strip might present both “fracture” and “corrosion,” while a bolt could be “loose” and accompanied by “strip displacement.” This necessitates a detection system capable of multi-label classification. However, employing a single classifier to process all fastener types and structures leads to an excessively complex and high-dimensional feature space that the model must learn. The classifier is required to disentangle both “type-specific” characteristics and multiple “defect” features from highly diverse fastener images—an exceptionally difficult task that often results in insufficient model learning, ultimately causing misclassification and missed detections.

In summary, the core scientific challenge in the field of automated railway fastener inspection can be formulated as follows: determining how to construct an intelligent detection system capable of effectively accommodating the diversity of fastener types while achieving accurate and robust multi-label recognition of various defects.

To address the aforementioned challenges, this study proposes a Type-Guided Expert Model-based Fastener Detection and Diagnosis framework (TGEM-FDD) based on YOLOv8. The core concept follows a “type-identification-first, defect-diagnosis-second” paradigm, with specific contributions outlined as follows:

(1): Framework Design: We construct a novel two-stage detection framework that decouples the complex task into localization/type identification and specialized defect diagnosis, effectively overcoming the limitations of single-model approaches.
(2): Model Enhancement: We introduce three key enhancements to the first-stage detector—the Deepstar Block, Spatial Pyramid Pooling Fast with Attention (SPPF-Attention), and Dynamic Sample (DySample)—significantly boosting its feature representation, multi-scale fusion, and localization precision.
(3): Expert Model Pool: We innovatively adopt an expert model pool, where a dedicated multi-label classifier is trained for each fastener type, enabling specialized and accurate defect diagnosis.
(4): Dataset: We design and construct a multi-label fastener image dataset, providing crucial data support for fine-grained component recognition in complex industrial scenarios.

2. Materials and Methods

2.1. Dataset Construction

The image dataset of railway fasteners was acquired from an industrial railway line located in Hengshui City, Hebei Province, China, during November 2024. The collected fastener models are the most representative Type II clip fastener and WJ-8 fastener, which are the mainstream types in China’s conventional-speed railways and high-speed railways, respectively. Their samples can effectively reflect the typical conditions encountered in actual operation. To ensure dataset diversity and robustness, the collection procedure systematically incorporated real-world variations in illumination conditions, background complexity, and shooting distances.

Image acquisition was performed using a line-scan camera mounted on a track inspection vehicle, yielding 160 original images capturing two fastener types under various damage conditions. Each image covered a 12 m track segment with an original resolution of 512 × 9000 pixels.

Given the excessive aspect ratio of the original images—which proves suboptimal for effective model detection—we performed initial preprocessing through cropping and padding operations. This standardized each image to contain four fastener sets. After processing and quality screening, the final curated dataset comprised 1200 validated images containing 4800 individual fastener samples.

Table 1 summarizes the characteristics of the constructed dataset, providing comprehensive statistics on sample distribution across fastener types and detailed occurrence frequencies of six common defect categories. Figure 1 presents representative examples of typical defects observed across different fastener types.

2.2. Overall Architecture

The overall architecture of the proposed Type-Guided Expert Model-based Two-Stage Fastener Multi-Label Detection method (TGEM-FDD) is illustrated in Figure 2. The fundamental motivation for this decoupled design stems from the inherent challenges posed by the heterogeneous nature of diverse fastener types and the multi-label characteristic of their defects, which collectively compel a single model to learn an excessively complex and conflicting feature representation. Therefore, our core concept follows a “type identification first, defect diagnosis second” paradigm, which decomposes the complex multi-type, multi-label detection task into two relatively independent and focused sub-tasks. This design effectively overcomes the performance limitations of single-model approaches in handling such heterogeneous scenarios.

Stage I: Fastener Localization and Type Identification. This stage takes raw track images as the input, with the primary objective of accurately localizing all fasteners within the image and identifying the specific type of each fastener. In this study, an enhanced high-performance detector (YOLOv8s with Deepstar, SPPF-attention, and DySample (YOLOv8s-DSD)) is constructed based on the YOLOv8 architecture, incorporating modules such as Deepstar Block, Spatial Pyramid Pooling Fast with Attention (SPPF-Attention), and Dynamic Sample (DySample). The output of this stage consists of the bounding box coordinates and the type classification label for each fastener.

Stage II: Multi-Label Defect Diagnosis. This stage executes the type-guided mechanism for defect analysis. Initially, based on the fastener type identified in the first stage, the system dynamically retrieves the corresponding specialized classification model from a pre-constructed expert model pool. Subsequently, the cropped fastener region images from the first stage are fed into this expert model for refined analysis. Each expert model comprises a targeted multi-label classifier, generating independent confidence scores for potential multiple defects present in the specific fastener type.

Finally, a result fusion module integrates the localization and type information from the first phase with the defect diagnosis results from the second phase, yielding comprehensive detection results with complete informational integrity.

2.3. YOLOv8s-DSD Model

As the foundational and prerequisite component of the proposed two-stage detection framework, the performance of the first stage directly determines the upper limit of the entire system. The core objective of this stage is to accurately localize all fasteners within the image and correctly identify their specific types. The output of this stage provides a reliable control signal for the type-guided mechanism in the second phase.

The YOLOv8 [23] architecture is categorized into five variants—n, s, m, l, and x—based on application scenarios, with sequentially increasing network depth and width, correspondingly enhancing detection accuracy at the cost of computational complexity. Among these, YOLOv8s achieves an optimal balance between accuracy and efficiency: its precision and mean average precision (mAP) significantly surpass those of YOLOv8n, while its model size, computational load, and parameter count remain substantially lower than those of YOLOv8m. Given the stringent requirements for real-time performance and deployment on resource-constrained edge devices in railway fastener detection tasks, this study selects YOLOv8s as the baseline model for enhancement.

To improve the model’s sensitivity to small fastener targets and subtle inter-type differences, several modifications are implemented: the Cross Stage Partial network with 2 convolutions (C2f) module in the backbone network is replaced with the Deepstar Block to enhance feature selection and semantic extraction capabilities; an attention mechanism is integrated into the Spatial Pyramid Pooling Fast (SPPF) module in the Neck section; and the original interpolation-based upsampling method is substituted with the DySample dynamic upsampling operator. The resulting model, termed YOLOv8s with Deepstar, SPPF-attention, and DySample (YOLOv8s-DSD), is structured as illustrated in Figure 3.

(1): Deepstar Block Feature Extraction Module

In the YOLOv8 algorithm, the C2f module is responsible for feature extraction [24]. It consists of convolutional layers, split operations, and bottleneck modules, incorporating multiple gradient branches and cross-stage feature connections to capture more effective information. However, the extensive convolutional layers and gradient branches require more computational operations and higher memory demands, leading to inefficient feature extraction. To address this limitation, a novel feature extraction module named Deepstar Block is constructed, integrating multi-branch modeling, residual connections, star operations [25], and regularization strategies. Its structure is illustrated in Figure 4.

The Deepstar Block module adopts a dual-branch design, utilizing depthwise separable convolutions in place of standard convolutional operations to reduce computational load, with all being

1 \times 1

convolutions. In the first branch, initial features are formed through a convolutional layer, followed by a residual structure containing two parallel convolutional layers (without BatchNorm operations), after which a star operation is applied. Subsequently, two sequential convolutional layers are employed, followed by a Dropout operation. The second branch begins with a convolutional layer for channel compression, followed by two bottleneck layers to extract deep semantic features. The outputs of both branches are then concatenated, and finally, a convolutional layer is used to control the number of output channels.

In the first branch, the star operation facilitates information interaction between two feature maps of identical shape. Combined with residual connections, it ensures information integrity while enhancing gradient stability. The Dropout operation helps to suppress overfitting and improves generalization performance. The star operation, defined as elementwise matrix multiplication, non-linearly combines different channels of the input features by performing per-element multiplication, creating a large number of interaction terms and significantly expanding the dimensionality of the feature space. Through the star operation and residual fusion, this branch finely adjusts feature distributions, strengthens the expression of key information, and enhances feature diversity, thereby improving the model’s sensitivity to small target features and its ability to model complex semantic relationships.

The second branch, composed of convolutional and Bottleneck layers, is primarily responsible for extracting high-level semantic information and constructing stable feature representations. It reinforces the depth of information flow, enabling the extraction of richer features while maintaining gradient stability.

The Deepstar Block achieves a structurally stable and efficient design: one branch emphasizes dynamic regulation and detail refinement, exhibiting sensitivity to subtle features of small targets; the other focuses on global modeling and deep feature extraction, with stable gradients and strong representational capacity. By employing depthwise separable convolutions, the module reduces computational overheads while maintaining structural depth. Thus, the Deepstar Block module realizes an organic integration of structural stability, rich expressiveness, and computational efficiency, accomplishing a feature representation strategy that balances stability and sensitivity while complementing semantic information with local details.

(2): Spatial Pyramid Pooling Fast with Attention (SPPF-attention)

To address the issue of weak focus on key regions in fastener detection, this study improves upon the SPPF module [26]. The SPPF module is a spatial pyramid pooling structure designed for feature extraction and fusion. However, when processing fastener features, it struggles to effectively concentrate on critical attention information, resulting in limited discriminative capability for fastener types. To overcome this limitation, an attention mechanism is introduced into the SPPF module, leading to the proposed SPPF-attention module [27].

The structure of SPPF-attention is illustrated in Figure 5. The module embeds a spatial attention mechanism before the final

1 \times 1

convolutional layer, generating a spatial attention weight map through operations on the spatial dimensions of the feature map to highlight key spatial regions. Specifically, the input feature map first undergoes global average pooling and global max pooling along the channel dimension, producing two feature descriptors of size

h \times w \times 1

. These two descriptors are then concatenated along the channel dimension and passed through a

7 \times 7

convolutional layer for feature fusion and dimensionality reduction. A Sigmoid activation function is subsequently applied to generate spatial attention weights ranging from 0 to 1. Finally, these weights are multiplied elementwise with the original feature map, thereby enhancing the feature response in important regions while suppressing irrelevant areas. Through this attention mechanism, the module enables the network to automatically focus on critical regions of fasteners, extracting more discriminative features and effectively improving the recognition performance for defect characteristics.

(3): DySample Upsampler

The original YOLOv8 employs nearest-neighbor interpolation for upsampling, which operates by simply replicating the values of neighboring pixels to fill newly generated pixel locations. This approach lacks smoothness and tends to lose fine details and edge information, potentially leading to positional deviations of feature points. In this study, DySample [28] is adopted as the upsampler. DySample is a lightweight and efficient upsampling algorithm that utilizes a content-aware dynamic upsampling method, adaptively adjusting sampling positions based on local image characteristics to more accurately reconstruct the locations and details of feature points.

Compared to kernel-based dynamic upsamplers (e.g., Content-Aware ReAssembly of Features (CARAFE) [29], Feature Agnostic Dynamic Upsampling (FADE) [30], and Spatially Adaptive Pooling (SAPA) [31]), which incur significant computational overheads due to time-consuming dynamic convolutions and additional sub-networks for generating dynamic kernels, DySample circumvents dynamic convolution by formulating upsampling from a point-sampling perspective. This results in fewer parameters, reduced floating-point operations, lower GPU memory usage, and decreased latency, making it more suitable for edge deployment.

The design principle of DySample is illustrated in Figure 6. In Figure 6a, the input feature

χ

is resampled using a sampling set

S

generated by a sampling point generator. Figure 6b depicts the structure of the sampling point generator, which consists of two components: the generated offset

O

and the original sampling grid

G

. The offset generation mechanism combines linear transformation with pixel shuffling. Taking the static factor sampling method shown in the upper part of Figure 6b as an example, given an upsampling factor

S

and a feature map of size

c \times h \times w

, the feature map is passed through a linear layer with input and output channels of c and

{2 s}^{2}

, respectively, to generate an offset

O

of size

{2 s}^{2} \times h \times w

. This offset is then transformed into a

2 \times sh \times sw

representation via pixel-level shuffling, where 2 denotes the

X

and

y

coordinates. Finally, an upsampled feature map of size

c \times sh \times sw

is generated.

By introducing a neighboring pixel sampling mechanism, DySample effectively enhances sample diversity, thereby mitigating class imbalance issues and suppressing false predictions.

2.4. YOLOv8-Cls-Based Multi-Label Classification Network

This stage serves as the implementation of the “type-guided” mechanism, aiming to perform detailed analysis on the fastener region images localized and cropped in the first stage, while simultaneously outputting multiple potential defect labels for each fastener. To achieve this objective, we developed an “expert model pool” with multi-label classification capability by modifying the YOLOv8 pre-trained classification model through several key adaptations.

The YOLOv8-Cls model was selected as the foundational backbone for each expert model. Pre-trained on large-scale datasets such as ImageNet, this model possesses strong general feature extraction capabilities, providing an excellent starting point for subsequent defect feature learning.

Since the standard YOLOv8-Cls model is designed for single-label classification (where each image belongs to only one category), while fastener defects (such as concurrent “Fractured spring clip” and “Displaced spring clip”) often coexist, the following adaptations were implemented:

(1): Detection Head Reconstruction

The original single-label classification head at the network’s output layer (utilizing Softmax activation function) has been replaced with a multi-label classification head. The reconstructed head employs a Sigmoid activation function as its output layer, with channel dimensionality matching the number of defect categories. Specifically, for a task containing K defect types, the model outputs a K-dimensional vector

y = [y_{1}, y_{2}, \dots, y_{K}]

, where each element

y_{K} \in (0, 1)

represents the independent probability of the fastener containing the kth defect type. By setting an appropriate threshold (typically 0.5), the coexistence of multiple defects can be effectively determined.

(2): Loss Function Adaptation

To accommodate the multi-label classification task, the original cross-entropy loss has been replaced with binary cross-entropy loss. This loss function decomposes the multi-label classification task into K independent binary classification problems, computing the loss between predicted and ground-truth values for each label separately, and subsequently averaging these losses. The formulation is as follows:

L o s s = - \frac{1}{K} \sum_{k = 1}^{K} [{\hat{y}}_{k} \cdot \log (σ (y_{k})) + (1 - {\hat{y}}_{k}) \cdot \log (1 - σ (y_{k}))]

(1)

Here,

{\hat{y}}_{k} \in {0, 1}

is the true value of the kth defect label (1 indicates presence, 0 indicates absence), and

σ (y_{k})

is the Sigmoid probability predicted by the model for the kth defect. This loss function can effectively optimize both positive labels (i.e., existing defects) and negative labels (i.e., non-existing defects) simultaneously.

(3): Specialized Training

We independently train a modified YOLOv8-Cls (Multi-label) model for each primary fastener type (e.g., Type II Clip Fastener, WJ-8). Each model is trained exclusively using images of the corresponding fastener type along with their multi-label defect data. This “divide and conquer” strategy enables each expert model to deeply focus on learning the unique defect patterns and characteristics specific to its assigned fastener structure, thereby transforming it into a specialized “diagnostic expert” for that particular fastener type.

(4): Type-Guided Inference

During the inference phase, the fastener type (class_type) identified in the first stage serves as a control signal to dynamically retrieve the corresponding pre-trained model from the “expert model pool.” The selected expert model then analyzes the cropped fastener region images from the first stage and outputs a multi-label defect vector. Finally, a result fusion module integrates the localization, type, and defect information to generate comprehensive detection results.

3. Experiments and Results Analysis

3.1. Evaluation Metrics

This study employs standard evaluation metrics from the field of object detection. The core detection performance indicators include: mean average precision at an IoU threshold of 0.5 (mAP@0.5), average mAP over IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95), Precision (P), and Recall (R). Additionally, the number of parameters and Giga Floating Point Operations (GFLOPs) are selected to assess model complexity, ensuring practical deployment feasibility while maintaining performance.

To comprehensively evaluate the end-to-end performance of the two-stage framework, this paper defines “Task mAP” as a comprehensive evaluation metric. This metric requires the detection results to simultaneously satisfy three conditions:

(1): Accurate Localization: IoU between the detected bounding box and the ground truth box ≥ 0.5;
(2): Correct Type Identification: Accurate classification of the fastener type;
(3): Correct Defect Diagnosis: Accurate prediction of all defect labels (confidence threshold of 0.5).

The calculation method is as follows: the average precision (AP) is computed separately for each “type-defect” combination, and then the AP values across all combinations are averaged to obtain the final Task mAP@0.5.

This metric is intentionally stringent. A prediction is considered correct only if all defect labels are exactly matched; partial success (e.g., identifying only two out of three true defects) is treated as a failure under this criterion. This design aims to drive the system toward achieving comprehensive and faultless detection, which aligns with the high-reliability requirements of practical railway inspection. To complement this strict overall measure, the Macro-F1 score is also reported to provide a more nuanced assessment of the defect recognition module’s performance across all categories.

This metric rigorously reflects the system’s comprehensive capability in localization, type identification, and multi-label defect diagnosis, aligning closely with practical application requirements.

3.2. Experimental Environment and Implementation Details

The experiments were conducted in a software environment based on Windows 10, CUDA 12.1, and PyTorch 2.4.1. The hardware configuration consisted of an Intel Core i7-12700K CPU and an NVIDIA RTX 3090 GPU.

For the first-stage model training, the core hyperparameters were set as follows: a total number of training epochs of 200, an early stopping patience of 50 epochs to prevent overfitting, a batch size of 16, an initial learning rate of 0.01, and a weight decay coefficient of 0.0005.

In the second stage, each expert model was trained on its corresponding fastener-type subset using a strategy similar to the first stage: 100 training epochs, an early stopping patience of 50 epochs, a batch size of 16, an initial learning rate of 0.01, and a weight decay coefficient of 0.0005. Through independent training processes, specialized classification models were obtained for each fastener type.

Figure 7 illustrates the mAP progression curves during training for both YOLOv8s-DSD and the baseline model.

3.3. Module Effectiveness Analysis

(1): Localization Effectiveness Experiment of Deepstar Block

To determine the optimal placement of the Deepstar Block, we systematically replaced the C2f modules in different network components while keeping all other conditions identical. Specifically, the Deepstar Block was substituted into the backbone network, the neck network, and both the backbone and neck networks simultaneously. The experimental results are summarized in Table 2.

As shown in Table 2, the Deepstar Block exhibits varying effects when deployed in different network components. Its integration into the neck module yields limited performance improvements while significantly increasing both parameters and computational costs. Therefore, we decided against replacing the C2f modules in the neck with the Deepstar Block. Compared to deployments in the neck or the combined backbone-neck configuration, replacing only the backbone’s C2f modules achieves substantial accuracy improvements with only marginal increases in parameters and computation. This outcome aligns with the module’s design purpose: the dual-branch structure of the Deepstar Block is particularly effective in the backbone for collaboratively modeling both the global semantic features (e.g., overall fastener shape and type prototype) and fine-grained local details (e.g., edges and micro-defects), thereby providing a richer and more discriminative feature foundation for subsequent detection tasks. Consequently, applying the Deepstar Block exclusively to the backbone network delivers the optimal accuracy enhancement.

(2): Experimental Validation of SPPF-Attention Module

This investigation presents a systematic optimization of YOLOv8’s feature extraction architecture, with particular emphasis on augmenting the multi-scale feature fusion capabilities of the SPPF module. To rigorously evaluate the efficacy of the proposed SPPF-Attention mechanism in dense railway fastener detection applications, we conducted comprehensive comparative analyses against prevailing enhancement methodologies (Table 3). The integration of an efficient attention mechanism with hierarchical pooling operations yielded a composite SPPF-Attention structure that achieved a 1.9% enhancement in detection precision while maintaining parametric efficiency.

The implemented methodology employs feature selection mechanisms to optimize fusion weights across scale-specific representations, substantially improving small-scale fastener discriminability with minimal parametric overheads. Under challenging operational scenarios characterized by complex track environments and high fastener density, the proposed architecture demonstrates significant improvements in localization accuracy while effectively suppressing both false positive and false negative rates. Comparative assessment reveals that while dilated convolution-based Atrous Spatial Pyramid Pooling (ASPP) modules achieve receptive field expansion, they exhibit susceptibility to feature confusion among morphologically similar fastener instances. The lightweight Simplified Spatial Pyramid Pooling Fast (SimSPPF) variant, though computationally efficient, demonstrates inadequate characterization of critical fastener minutiae. The Spatial Pyramid Pooling Cross Stage Partial Channel (SPPCSPC) architecture introduces substantial computational redundancy through multi-branch pooling operations, resulting in suboptimal operational efficiency.

Our framework innovatively incorporates structural prior learning through attention reweighting, enabling selective emphasis on mechanically salient fastener attributes over superficial textural patterns. This design is driven by the task objective of fastener localization and type identification. During training, the network is intrinsically guided to allocate attention weights to features most relevant to this task. Although direct visualization of attention maps is not presented in this study, the significant and consistent improvement in detection accuracy (e.g., +1.9% mAP@0.5 as shown in Table 3) provides compelling empirical evidence. It indicates that the attention mechanism effectively learns to suppress irrelevant and variable background noise (e.g., randomly shaped ballast) and to enhance stable structural features of fasteners, rather than being misled by high-contrast but semantically irrelevant textures. This architectural refinement significantly enhances inter-class discriminability among heterogeneous fastener typologies. Empirical validation confirms that the optimized feature extraction backbone maintains real-time inference capabilities while exhibiting superior robustness and generalization performance in complex railway operational environments.

3.4. Ablation Studies

To validate the effectiveness of the proposed DeepstarBlock, SPPF-attention module, and DySample upsampler, we conducted comprehensive ablation experiments using YOLOv8s as the baseline model. The evaluation metrics included precision, recall, mAP@50, model size, floating point operations (GFLOPs), and parameter count. Eight different module combinations were systematically tested, with the experimental results on the test set summarized in Table 4.

The experiments first evaluated the individual contributions of each improved module through ablation studies. Compared to the baseline model (YOLOv8s), the standalone introduction of the Deepstar Block demonstrated significant lightweight advantages while improving detection accuracy: mAP@0.5 increased from 95.7% to 97.5% (+1.8%), mAP@0.5:0.95 improved from 77.2% to 78.7% (+1.5%), while GFLOPs decreased from 28.9 to 26.1 and parameters reduced from 11.2 M to 10.5 M. This module enhances feature representation capability while reducing computational resource consumption, achieving the improvement goal of being “both stronger and smaller.”

The independent incorporation of the SPPF module yielded the most substantial enhancement in comprehensive detection capability, achieving 97.6% mAP@0.5 (+1.9%) and 79.1% mAP@0.5:0.95 (+1.9%), though it introduced significant computational overheads, with GFLOPs increasing to 33.9 and parameters growing to 12.5 M. This validates the effectiveness of multi-scale feature fusion for fastener detection tasks, while acknowledging the inevitable computational cost increase.

The standalone implementation of the DySample upsampling operator achieved a stable improvement in mAP@0.5 from 95.7% to 96.2% (+0.5%) without increasing parameter count, verifying the effectiveness of its dynamic high-frequency detail preservation strategy.

Module combination experiments further revealed synergistic effects among components. This synergy can be attributed to the complementary roles of each module: the Deepstar Block serves as a powerful backbone, providing a rich and stable foundation of features; the SPPF-Attention module then operates on this enhanced foundation, performing more precise multi-scale feature fusion by adaptively focusing on salient regions across scales. Consequently, the Deepstar Block and SPPF-Attention combination achieved 98.1% mAP@0.5 and 78.9% mAP@0.5:0.95 with 31.3 GFLOPs and 11.8 M parameters, striking an optimal balance between computational efficiency and detection accuracy. Similarly, the integration of DySample with the Deepstar Block ensures that the high-quality features learned by the backbone are preserved and refined during upsampling, mitigating detail loss. This explains how the Deepstar Block with DySample combination achieved 97.8% mAP@0.5 with only 27.0 GFLOPs, constituting a “high-performance, low-complexity” solution with substantial potential for edge deployment.

The final model integrating all three modules achieved optimal performance: 98.5% mAP@0.5 (+2.8%) and 79.5% mAP@0.5:0.95 (+2.3%), with 11.8 M parameters and 31.9 GFLOPs. Compared to the baseline, it achieved significant detection accuracy improvement while maintaining an essentially equivalent parameter count, fully demonstrating the effectiveness of the proposed enhancement scheme.

These ablation studies demonstrate that the proposed improvement strategy effectively optimizes a lightweight baseline model (YOLOv8s) into a high-performance, efficiency-conscious detector (YOLOv8s-DSD). This not only validates the effectiveness of individual modules but also highlights the substantial application value of our solution in resource-constrained edge computing scenarios.

3.5. Performance Comparison of Different Detection Models

To objectively evaluate the performance advantages of the proposed method, comparative experiments were conducted between the improved model and current mainstream object detection models. The models selected for comparison include Faster R-CNN, SSD, YOLOv5s, YOLOv7s, and YOLOv8s. Detailed comparative results are presented in Table 5.

As shown in Table 5, the proposed YOLOv8s-DSD model demonstrates significant advantages in detection accuracy. Specifically, the model achieves an mAP@0.5 of 98.5%, surpassing the baseline models Faster R-CNN, SSD, YOLOv5s, YOLOv7-Tiny, and YOLOv8s by 22.9, 21.6, 10.2, 10.0, and 2.8 percentage points, respectively. In terms of precision, it reaches 95.0%, representing improvements of 22.7, 21.3, 10.5, 10.1, and 2.9 percentage points over the comparative models. The recall metric stands at 97.1%, with enhancement ranging from 4.2 to 27.0 percentage points compared to the other models.

Regarding model complexity, the proposed model maintains excellent performance while demonstrating notable efficiency advantages. Compared to the two-stage detector Faster R-CNN, the parameter count is only 8.6%, and the computational load is less than one-tenth. Even when compared to lightweight models in the same series, our model exhibits a favorable balance: relative to the YOLOv8s baseline, the parameter count increases by only 5.4% and computational load by 10.4%, while the key metric mAP@0.5:0.95 improves by 2.3 percentage points. This achieves significant performance gains at a relatively small computational cost.

In summary, the experimental results indicate that the proposed improvement scheme effectively enhances the accuracy of fastener detection while maintaining well-controlled model complexity. It achieves an optimal balance between precision and efficiency, fulfilling the dual requirements of detection accuracy and real-time performance in practical railway maintenance scenarios.

3.6. Comparative Experiment of the Overall Two-Stage Framework

To validate the comprehensive advantages of the proposed “type-guided expert model” two-stage framework in addressing the multi-task detection problem for fasteners, this section systematically compares it with various advanced single-stage end-to-end detection methods. All comparative methods were trained and tested on the same multi-label fastener dataset to ensure experimental fairness.

(1): Configuration of Comparative Methods.

This experiment establishes the following three categories of comparative approaches:

Category 1: Single-model Unified Detection Method.

YOLOv8s (Category Combination): This approach simplifies the task into a single multi-category detection problem, where the output categories represent all possible combinations of fastener types and defects.

Category 2: Multi-task Separate Head Detection Method.

YOLOv8s (Separate Head): This method modifies the detection head structure to simultaneously output bounding boxes, fastener type categories, and multiple independent defect confidence scores.

RT-DETR-R50 (Separate Head): A Transformer-based detector built on the RT-DETR architecture, adapted for multi-task output.

Category 3: Proposed Method.

TGEM-FDD Framework: The complete Type-Guided Expert Model Framework.

(2): Results and Analysis

The comparative experimental results presented in Table 6 lead to the following conclusions:

The single-model unified detection approach demonstrates the poorest performance, achieving a comprehensive task mAP of only 65.2%. This indicates that forcibly merging multi-type and multi-defect tasks into a single multi-category detection problem leads to significant performance degradation, as the model struggles to learn complex category combination relationships. The multi-task separate head method shows some improvement, with RT-DETR-R50 (separate head) reaching 76.5%, proving that the separate head design can partially alleviate multi-task learning conflicts. However, the single model still needs to simultaneously learn localization, type classification, and multi-defect recognition, exhibiting clear performance bottlenecks. The proposed TGEM-FDD framework achieves optimal performance with a comprehensive task mAP of 88.1%, significantly outperforming the previous two categories of methods, thus validating the effectiveness of the two-stage design and expert model strategy.

In terms of type recognition accuracy, the proposed method reaches 89.0%, maintaining a stable advantage compared to other approaches. For defect recognition macro-average F1-score, our method achieves 86.5%, representing a 7.2 percentage point improvement over RT-DETR-R50 (separate head). This demonstrates that the two-stage framework, through task decoupling, enables each component to focus on specific subtasks, thereby achieving more professional performance in the defect recognition phase.

Regarding parameter count, the proposed framework (13.5 M) is substantially lower than RT-DETR-R50 (19.8 M), reflecting better parameter efficiency. In inference speed, our method (121 FPS) maintains real-time performance while achieving the optimal accuracy-speed balance.

The experimental results clearly illustrate the capability differences among the three approaches in addressing complex multi-task detection problems. The single-model unified detection method performs poorly due to high task complexity; the multi-task separate head method achieves some improvement through structural optimization but remains limited by the representation capacity of a single model; the proposed two-stage framework achieves optimal performance through type-guided task decomposition and specialized diagnosis by expert models. The overall two-stage framework comparison experiment proves that the proposed TGEM-FDD framework significantly outperforms both single-model unified detection and multi-task separate head solutions in end-to-end fastener detection tasks, validating the advanced nature and practical utility of the two-stage design in complex industrial vision applications.

3.7. Visualization and Analysis of Results

To intuitively demonstrate the performance of the proposed TGEM-FDD framework in practical applications, we selected representative scenarios from the test set for visual analysis. Figure 8 presents the detection results of our method under various complex working conditions.

Figure 8 presents a visual analysis of detection results under various working conditions. Subfigures (a) and (b) demonstrate the framework’s precise detection capability under normal lighting conditions, successfully identifying fastener types and diagnosing multiple defects. However, subfigure (c) reveals instances of missed and false detections. Preliminary analysis indicates that these errors primarily stem from severe lighting variations (such as overexposure and extreme shadows), which significantly alter the apparent characteristics of the fasteners. This suggests that although the proposed method performs robustly under standard conditions, its generalization capability and robustness against intense, non-uniform lighting require further enhancement. Subsequent work will focus on improving the model’s adaptability to such challenging environmental factors.

4. Conclusions

This study has effectively addressed the core scientific challenge identified in the introduction: how to construct an intelligent detection system capable of accommodating fastener type diversity while achieving accurate multi-label defect recognition. This was accomplished through the proposed two-stage detection method based on type-guided expert models, which directly resolves the “multi-type adaptation and multi-defect recognition” challenge. The main contributions are as follows:

(1): A comprehensive type-guided two-stage detection framework was constructed, which decouples the tasks of fastener localization, type identification, and defect diagnosis, overcoming the performance limitations of single-model approaches in handling complex multi-task scenarios. Experimental results demonstrate that this framework achieves a comprehensive task mAP of 88.1%, significantly outperforming traditional single-model end-to-end methods.
(2): A series of effective model improvement strategies were proposed, including the Deepstar Block feature extraction module, SPPF-Attention multi-scale fusion mechanism, and DySample dynamic upsampling operator. Ablation studies show that these enhancements improve the mAP@0.5 of the single-stage detector by 2.8% while maintaining a favorable balance in model complexity.
(3): A multi-label task-oriented fastener image dataset was designed and constructed, encompassing various fastener types and defect categories, providing essential data support for precise component identification and condition assessment in complex industrial environments.
(4): An innovative expert model pool design was implemented, training dedicated multi-label classifiers for different fastener types. This approach achieved a macro-average F1 score of 86.5% in defect recognition tasks, demonstrating the effectiveness of specialized diagnostic strategies.

Experimental results demonstrate that the proposed method achieves an optimal balance among detection accuracy, model complexity, and real-time performance, effectively meeting the practical requirements of railway operation and maintenance. Future research will focus on enhancing model robustness under extreme working conditions, exploring more lightweight network architectures to better adapt to edge computing scenarios, and investigating cross-type transfer learning mechanisms to improve recognition capability for rare fastener types.

Author Contributions

D.L.: Methodology, Formal Analysis, Writing—Original Draft. J.M.: Conceptualization, Formal Analysis. G.M.: Investigation. Y.S.: Data curation. L.Y.: Visualization. G.L.: Project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [No. 62363021] and Lanzhou Science and Technology Plan Project (Key) [2023-01-16].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study comes from the railway department and is classified as confidential, hence it cannot be publicly accessed.

Acknowledgments

We also thank Hebei Xinneng Rail Transit Equipment Technology Co., Ltd. for their essential support in data collection and on-site validation.

Conflicts of Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Xiang, J.; Yuan, C.; Yu, C.; Lin, S.; Yang, H. Analysis of Elastic Bar Fracture Causes of Fasteners in Ballastless Track of High-Speed Railway. J. Rail. Way Sci. Eng. 2019, 16, 1605–1613. [Google Scholar]
Ye, W.; Ren, J.; Zhang, P.; Zhang, Q.; Li, L. Review of Integrated Full Life Cycle Data Management and Application of the Slab Tracks. Intell. Transp. Infrastruct. 2022, 1, liac018. [Google Scholar] [CrossRef]
Bono, F.M.; Radicioni, L.; Cinquemani, S.; Benedetti, L.; Cazzulani, G.; Somaschini, C.; Belloli, M. A Deep Learning Approach to Detect Failures in Bridges Based on the Coherence of Signals. Future Internet 2023, 15, 119. [Google Scholar] [CrossRef]
Ren, J.; Liu, W.; Du, W.; Zheng, J.; Wei, H.; Zhang, K.; Ye, W. Identification Method for Subgrade Settlement of Ballastless Track Based on Vehicle Vibration Signals and Machine Learning. Constr. Build. Mater. 2023, 369, 130573. [Google Scholar] [CrossRef]
Zhan, Z.; Sun, H.; Yu, X.; Yu, J.; Zhao, Y.; Sha, X.; Chen, Y.; Huang, Q.; Li, W.J. Wireless Rail Fastener Looseness Detection Based on MEMS Accelerometer and Vibration Entropy. IEEE Sens. J. 2020, 20, 3226–3234. [Google Scholar] [CrossRef]
Chellaswamy, C.; Krishnasamy, M.; Balaji, L.; Dhanalakshmi, A.; Ramesh, R. Optimized Railway Track Health Monitoring System Based on Dynamic Differential Evolution Algorithm. Measurement 2020, 152, 107732. [Google Scholar] [CrossRef]
Ma, H.; Min, Y.; Yin, C.; Cheng, T.; Xiao, B.; Yue, B.; Li, X. A Real Time Detection Method of Track Fasteners Missing of Railway Based on Machine Vision. Int. J. Perform. Eng. 2018, 14, 1190–1200. [Google Scholar] [CrossRef]
Liu, J.; Huang, Y.; Zou, Q.; Tian, M.; Wang, S.; Zhao, X.; Dai, P.; Ren, S. Learning Visual Similarity for Inspecting Defective Railway Fasteners. IEEE Sens. J. 2019, 19, 6844–6857. [Google Scholar] [CrossRef]
Gibert, X.; Patel, V.M.; Chellappa, R. Deep Multitask Learning for Railway Track Inspection. IEEE Trans. Intell. Transp. Syst. 2017, 18, 153–164. [Google Scholar] [CrossRef]
Wang, Z.; Wang, S. Research of Method for Detection of Rail Fastener Defects Based on Machine Vision. In Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015, Xi’an, China, 12–13 December 2015; Atlantis Press: Paris, France, 2015; pp. 2830–2834. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28; Neural Information Processing Systems Foundation, Inc. (NeurIPS): Montreal, QC, Canada, 2015; pp. 91–99. [Google Scholar]
Wei, X.; Yang, Z.; Liu, Y.; Wei, D.; Jia, L.; Li, Y. Railway Track Fastener Defect Detection Based on Image Processing and Deep Learning Techniques: A Comparative Study. Eng. Appl. Artif. Intell. 2019, 80, 66–81. [Google Scholar] [CrossRef]
Qi, H.; Xu, T.; Wang, G.; Cheng, Y.; Chen, C. MYOLOv3-Tiny: A New Convolutional Neural Network Architecture for Real-Time Detection of Track Fasteners. Comput. Ind. 2020, 123, 103303. [Google Scholar] [CrossRef]
Guo, F.; Qian, Y.; Shi, Y. Real-Time Railroad Track Components Inspection Based on the Improved YOLOv4 Framework. Autom. Constr. 2021, 125, 103596. [Google Scholar] [CrossRef]
Fu, J.; Chen, X.; Lv, Z. Rail Fastener Status Detection Based on MobileNet-YOLOv4. Electronics 2022, 11, 3677. [Google Scholar] [CrossRef]
Li, X.; Wang, Q.; Yang, X.; Wang, K.; Zhang, H. Track Fastener Defect Detection Model Based on Improved YOLOv5s. Sensors 2023, 23, 6457. [Google Scholar] [CrossRef]
Wang, L.; Zang, Q.; Zhang, K.; Wu, L. A Rail Fastener Defect Detection Algorithm Based on Improved YOLOv5. Proc. Inst. Mech. Eng. Part F J. Rail Rapid Transit 2024, 238, 851–862. [Google Scholar] [CrossRef]
Liu, J.; Liu, H.; Chakraborty, C.; Yu, K.; Shao, X.; Ma, Z. Cascade Learning Embedded Vision Inspection of Rail Fastener by Using a Fault Detection IoT Vehicle. IEEE Internet Things J. 2022, 9, 14960–14970. [Google Scholar] [CrossRef]
Zhan, Y.; Dai, X.; Yang, E.; Wang, K.C. Convolutional Neural Network for Detecting Railway Fastener Defects Using a Developed 3D Laser System. Int. J. Rail Transp. 2021, 9, 424–444. [Google Scholar] [CrossRef]
Cha, Y.J.; Choi, W.; Suh, G.; Mahmoudkhani, S.; Büyüköztürk, O. Autonomous Structural Visual Inspection Using Region-Based Deep Learning for Detecting Multiple Damage Types. Comput. Aided Civ. Inf. Eng. 2018, 33, 731–747. [Google Scholar] [CrossRef]
Roy, A.M.; Bhaduri, J. DenseSPH-YOLOv5: An Automated Damage Detection Model Based on DenseNet and Swin-Transformer Prediction Head-Enabled YOLOv5 with Attention Mechanism. Adv. Eng. Inform. 2023, 56, 102007. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 4–5 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Terven, J.; Cordova-Esparza, D. A Comprehensive Review of YOLO: From YOLOv1 and Beyond. Electronics 2023, 12, 4082. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7464–7475. [Google Scholar]
Li, Z.; He, Q.; Zhao, H.; Yang, W. Doublem-Net: Multi-Scale Spatial Pyramid Pooling-Fast and Multi-Path Adaptive Feature Pyramid Network for UAV Detection. Int. J. Mach. Learn. Cybern. 2024, 15, 5781–5805. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Better Design Paradigm of Detector Architectures for Autonomous Vehicles. Sensors 2022, 22, 8215. [Google Scholar] [CrossRef]
Liu, W.Z.; Lu, H.; Fu, H.T.; Xie, Y.; Zhang, D.; Zhang, Y. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6004–6014. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3015. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1874–1883. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9308–9316. [Google Scholar]

Figure 1. Data sample instances. (a) Type II elastic strip fastener, normal; (b) Type II elastic strip fastener, displaced spring clip; (c) WJ-8, normal; (d) WJ-8, missing screw.

Figure 2. Type-Guided Expert Model-based Fastener Detection and Diagnosis framework TGEM-FDD’s overall architecture.

Figure 3. YOLOv8s-DSD Network Architecture.

Figure 4. Structure of the Deepstar Block module.

Figure 5. SPPF-attention structure diagram.

Figure 6. Dysample network architecture. (a) Schematic diagram of the dynamic sampling core. (b) Detailed structure of the sampling point generator.

Figure 7. mAP convergence curves of YOLOv8s-DSD and baseline model during training.

Figure 8. Visualized multi-label detection results of the TGEM-FDD framework under different scenarios. (a) Correct detection cases for Type II clip fastener. (1) Multi-label: Fastener obscured and normal (no structural defect). (2) Multi-label: Displaced spring clip and fractured spring clip. (3) Single-label: Displaced spring clip. (4) Multi-label: Displaced spring clip and missing nut. (b) Correct detection cases for WJ-8 fastener. (1) Multi-label: Fastener obscured and normal. (2) Multi-label: Displaced spring clip and fastener obscured. (3) Multi-label: Displaced spring clip and missing nut. (4) Multi-label: Missing screw and missing nut. (c) Challenging cases and error analysis. (1) Ground Truth: Fractured spring clip + missing nut. Model Output: Displaced spring clip + fractured spring clip + missing nut. (2) Ground Truth: Displaced spring clip. Model Output: Displaced spring clip + missing nut. (3) Ground Truth: Normal. Model Output: Fastener obscured + normal. (4) Ground Truth: Missing screw + displaced spring clip. Model Output: Missing screw + missing spring clip.

Table 1. Fastener type and defect distribution statistics.

Fastener Type	Fastener Quantity	Normal	Missing Nut	Missing Screw	Displaced Spring Clip	Missing Spring Clip	Fractured Spring Clip	Fastener Obscured
Type II Clip Fastener	2000	1100	424	—	200	247	118	350
WJ-8	2800	1530	—	585	281	346	243	495
Total	4800	2630	424	585	481	593	361	845

Multi-label annotation: The total count of defect instances (3289) exceeds the total number of fastener samples (4800), which is a characteristic feature of a multi-label dataset, indicating that a single sample can be annotated with multiple defect labels. Structural priors: Entries marked with “—” indicate defect types that are structurally inapplicable to the specific fastener model. This provides crucial guidance for the specialized training of the type-guided expert models.

Table 2. Comparison experiment of location with Deepstar Block.

Location of Modification	mAP@0.5	GFLOPs	Parameters/M
Baseline (YOLOv8s)	95.7	28.9	11.2
Backbone	97.5	26.1	10.5
Neck	96.2	29.3	11.8
Backbone + Neck	96.9	30.4	13.1

Table 3. Comparison results of different SPPF modules in experiments.

SPPF Modules	mAP@0.5	GFLOPs	Parameters/M
Baseline (SPPF)	95.7	28.9	11.2
ASPP	96.3	35.1	11.9
SimSPPF	96.0	28.7	12.4
SPPCSPC	96.9	33.3	16.0
SPPF-attention	97.6	33.9	12.5

Table 4. Ablation study results of the improved model.

Deepstar Block	SPPF Attention	DySample	P/%	R/%	mAP@0.5	mAP@0.5:0.95	GFLOPs	Parameters/M
×	×	×	92.1	92.9	95.7	77.2	28.9	11.2
√	×	×	93.9	94.8	97.5	78.7	26.1	10.5
×	√	×	93.3	93.6	97.6	79.1	33.9	12.5
×	×	√	92.3	93.3	96.2	77.6	29.2	11.2
√	√	×	95.5	96.7	98.1	78.9	31.3	11.8
√	×	√	94.1	94.9	97.8	78.2	27.0	10.5
×	√	√	93.9	94.9	97.9	78.5	33.5	12.6
√	√	√	95.0	97.1	98.5	79.5	31.9	11.8

Table 5. Performance Comparison of Different Object Detection Models.

Model	Precision (P)/%	Recall (R)/%	mAP@0.5/%	mAP@0.5:0.95/%	Parameters (M)	GFLOPs
Faster R-CNN	72.3	70.1	75.6	48.9	136.7	366.2
SSD	73.7	72.5	76.9	50.3	26.3	124.5
YOLOv5s	84.5	85.2	88.3	62.7	7.2	16.5
YOLOv7-Tiny	84.9	85.1	88.5	62.8	6.1	13.2
YOLOv8s (Baseline)	92.1	92.9	95.7	77.2	11.2	28.9
YOLOv8s-DSD (Ours)	95.0	97.1	98.5	79.5	11.8	31.9

Table 6. Performance comparison of overall two-stage framework.

Model	Task mAP@0.5/%	Type Accuracy/%	Defect Macro-F1/%	Params/M	FPS
YOLOv8s (Category Combination)	65.2	85.3	68.7	11.2	145
YOLOv8s (Separate Head)	72.8	87.6	75.4	12.1	128
RT-DETR-R50 (Separate Head)	76.5	88.2	79.3	19.8	105
TGEM-FDD (Ours)	88.1	89.0	86.5	13.5	121

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, D.; Meng, J.; Meng, G.; Shen, Y.; Yao, L.; Liu, G. Two-Stage Multi-Label Detection Method for Railway Fasteners Based on Type-Guided Expert Model. Appl. Sci. 2025, 15, 13093. https://doi.org/10.3390/app152413093

AMA Style

Lv D, Meng J, Meng G, Shen Y, Yao L, Liu G. Two-Stage Multi-Label Detection Method for Railway Fasteners Based on Type-Guided Expert Model. Applied Sciences. 2025; 15(24):13093. https://doi.org/10.3390/app152413093

Chicago/Turabian Style

Lv, Defang, Jianjun Meng, Gaoyang Meng, Yanni Shen, Liqing Yao, and Gengqi Liu. 2025. "Two-Stage Multi-Label Detection Method for Railway Fasteners Based on Type-Guided Expert Model" Applied Sciences 15, no. 24: 13093. https://doi.org/10.3390/app152413093

APA Style

Lv, D., Meng, J., Meng, G., Shen, Y., Yao, L., & Liu, G. (2025). Two-Stage Multi-Label Detection Method for Railway Fasteners Based on Type-Guided Expert Model. Applied Sciences, 15(24), 13093. https://doi.org/10.3390/app152413093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Two-Stage Multi-Label Detection Method for Railway Fasteners Based on Type-Guided Expert Model

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.2. Overall Architecture

2.3. YOLOv8s-DSD Model

2.4. YOLOv8-Cls-Based Multi-Label Classification Network

3. Experiments and Results Analysis

3.1. Evaluation Metrics

3.2. Experimental Environment and Implementation Details

3.3. Module Effectiveness Analysis

3.4. Ablation Studies

3.5. Performance Comparison of Different Detection Models

3.6. Comparative Experiment of the Overall Two-Stage Framework

3.7. Visualization and Analysis of Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI