YOLO-GSD-seg: YOLO for Guide Rail Surface Defect Segmentation and Detection

Lai, Shijun; Zhao, Zuoxi; Mi, Yalong; Yuan, Kai; Wang, Qian

doi:10.3390/app16031261

Open AccessArticle

YOLO-GSD-seg: YOLO for Guide Rail Surface Defect Segmentation and Detection

by

Shijun Lai

,

Zuoxi Zhao

^*,

Yalong Mi

,

Kai Yuan

and

Qian Wang

Key Laboratory of Key Technologies for Agricultural Machinery and Equipment, Department of Agricultural Engineering, South China Agricultural University, Room 308, Tucao Building, 483 Wushan Road, Tianhe District, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1261; https://doi.org/10.3390/app16031261

Submission received: 18 December 2025 / Revised: 20 January 2026 / Accepted: 21 January 2026 / Published: 26 January 2026

(This article belongs to the Special Issue Deep Learning-Based Computer Vision Technology and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges of accurately extracting features from elongated scratches, irregular defects, and small-scale surface flaws on high-precision linear guide rails, this paper proposes a novel instance segmentation algorithm tailored for guide rail surface defect detection. The algorithm integrates the YOLOv8 instance segmentation framework with deformable convolutional networks and multi-scale feature fusion to enhance defect feature extraction and segmentation performance. A dedicated guide rail surface Defect (GSD) segmentation dataset is constructed to support model training and evaluation. In the backbone, the DCNv3 module is incorporated to strengthen the extraction of elongated and irregular defect features while simultaneously reducing model parameters. In the feature fusion network, a multi-scale feature fusion module and a triple-feature encoding module are introduced to jointly capture global contextual information and preserve fine-grained local defect details. Furthermore, a Channel and Position Attention Module (CPAM) is employed to integrate global and local features, improving the model’s sensitivity to channel and positional cues of small-target defects and thereby enhancing segmentation accuracy. Experimental results show that, compared with the original YOLOv8n-Seg, the proposed method achieves improvements of 3.9% and 3.8% in Box and Mask mAP₅₀, while maintaining a real-time inference speed of 148 FPS. Additional evaluations on the public MSD dataset further demonstrate the model’s strong versatility and robustness.

Keywords:

high-precision linear guide instance segmentation; DCNv3; multi-scale feature fusion; CPAM

1. Introduction

With the advancement of the manufacturing industry, high-precision linear guides have become essential components for linear motion control, and they are widely used in machine tools, industrial robots, and semiconductor manufacturing equipment. However, during the production and handling of high-precision linear guides, surface defects such as scratches and dents may occur due to improper human operations or external environmental factors. The surface integrity of high-precision linear guides is critical to their functional performance [1]. Surface defects can directly degrade the positioning accuracy of guide systems, resulting in reduced motion stability [2]. Moreover, such defects accelerate surface wear, promote component aging, and ultimately shorten the service life of the entire mechanical system [3].

At present, surface defect inspection of linear guides still largely relies on manual examination combined with specialized equipment and operator experience. This inspection process is inefficient and highly dependent on subjective human judgment, which can lead to inconsistent classification results [4]. In this context, machine vision-based inspection has attracted increasing attention. Existing research on machine vision for defect detection can be broadly categorized into traditional image processing methods and deep learning-based approaches. Traditional methods rely on handcrafted rules and predefined algorithms, such as analyzing defect size, shape, color characteristics, and grayscale gradients, followed by edge detection, threshold segmentation, and morphological operations. These methods can achieve high measurement accuracy under controlled conditions [5,6,7]; however, they are sensitive to environmental variations, computationally inefficient, and limited when it comes to capturing spatial and contextual information. In contrast, deep learning-based defect detection methods, which currently dominate this research field, employ convolutional neural networks to automatically extract discriminative defect features and perform classification, enabling efficient and robust automated defect detection [4].

The YOLO-based instance segmentation models proposed in recent years adopt a one-stage paradigm inspired by the YOLACT framework. These models typically employ a backbone network, together with a Feature Pyramid Network (FPN), to extract and fuse multi-scale features from input images [8,9]. The detection branch is responsible for predicting object categories and bounding boxes, while the segmentation branch generates k prototype masks, along with corresponding mask coefficients. Both detection and segmentation tasks are performed in parallel. In the segmentation branch, high-resolution feature maps are processed through a series of convolutional layers to produce prototype masks. The final instance segmentation results are obtained by linearly combining the prototype masks with their corresponding mask coefficients [10,11].

YOLO-based models have been widely applied across various industrial defect inspection scenarios. For example, Xie et al. developed a lightweight multi-scale feature fusion model, LMS-YOLO, for efficient steel surface defect detection, demonstrating the adaptability of YOLOv8 to industrial defect recognition tasks [12]. Zhao et al. proposed RDD-YOLO, a customized YOLO-based framework specifically designed for steel surface defect detection, highlighting the effectiveness of task-oriented model customization [13]. Chen et al. introduced a robust defect detection method by integrating an improved YOLOv5 architecture with a Transformer-based attention mechanism for petrochemical pipeline inspection, achieving high detection accuracy and recall [14]. Mao et al. combined a deep learning-based vision model with a robotic arm system for rim defect inspection, resulting in faster and more accurate performance when compared with traditional approaches [15]. In addition, Wu et al. employed an improved YOLO model for defect detection on aluminum profiles, further demonstrating the versatility of YOLO-based methods across different materials and defect types [16]. Collectively, these studies confirm the effectiveness and general applicability of YOLO-based models in industrial defect detection, while also emphasizing the importance of tailoring network architectures to specific inspection tasks and industrial scenarios.

In this paper, surface defects on linear guide rails are first systematically categorized, and a dedicated guide rail surface defect (GSD) dataset is constructed. Based on an analysis of the challenges associated with detecting such defects, a one-stage instance segmentation model tailored for guide rail surface defect images is proposed. The proposed model is developed within the YOLOv8 framework and incorporates deformable convolution and an attention-based scale sequence mechanism.

During feature extraction, deformable convolution enables the network to adaptively capture features of guide rail surface defects with varying scales and irregular geometries. Subsequently, multi-scale feature fusion and encoder modules are employed to effectively integrate features across different resolutions. Furthermore, spatial and channel attention mechanisms are introduced to guide the network to focus on fine-grained details, particularly for small and densely distributed defect targets.

Rather than proposing entirely new network components, this work presents a task-driven architectural design that explicitly maps characteristic failure modes of linear guide rail surface defects to targeted modifications within a real-time instance segmentation framework:

(1): To minimize the impact of specular noise on the accuracy of defect recognition and segmentation, this study utilized an adjustable dome light source to illuminate the surface defects on the guide rails during dataset collection. This approach captured the color and texture features of the guide rail surface defects under various lighting conditions, resulting in the creation of the guide rail surface defect (GSD) dataset.
(2): To enhance the capability of the backbone network in extracting features from high-precision linear guide surface defect images, this paper introduces a deformable convolution network, DCNv3, into the backbone of YOLOv8, forming the C2F-DCNv3 deformable convolution network. This module can flexibly address the issue of insufficient receptive fields at detection points corresponding to defects of different scales, focusing more on crack defects with significant aspect ratio differences, thereby effectively reducing the instances of missed and false detections.
(3): To improve the feature extraction network’s ability to extract features of small-target defects in guide rail surface defect images, we propose a feature extraction network named MTC-FPN. This network utilizes a multi-scale feature fusion module and a triple-feature encoder module to extract multi-scale and fine-detail features of guide rail surface defects. By further integrating the features extracted by these two modules through Channel and Position Attention Mechanisms, the Attention Mechanism focuses on more crucial features, enhancing the focus on small-target defects and thereby improving the model’s detection accuracy.

2. Dataset Construction

Defect image data constitute the foundation for constructing deep learning-based defect detection models, as the quality of a dataset directly determines the richness and accuracy of the discriminative features that a model can learn. Accordingly, the development and optimization of deep learning detection models require dataset designs that are tailored to the characteristics of defect images. A high-quality dataset should not only contain diverse defect instances but also account for practical factors encountered in real industrial scenarios, such as complex backgrounds, varying illumination conditions, target pose variations, and occlusions, to ensure strong generalization ability and robustness of the trained models. However, to the best of our knowledge, there is currently no publicly available dataset specifically designed for surface defects on rolling linear guide pairs [17,18,19]. Therefore, a dedicated data acquisition system for guide rail surface defects was designed in this study, and a guide rail surface defect dataset was constructed using data augmentation techniques, with all images captured at a resolution of 1920 × 1280 pixels.

2.1. Defect Classification

During production, processing, and transportation, defects that affect the usability of high-precision rolling linear guide pairs inevitably occur due to environmental, human, and mechanical factors. Based on the results of on-site production surveys and information obtained from discussions with defect detection personnel, the main types of defects on guide pair surfaces include scratches, rust spots (i.e., areas not treated by grinding), indentations, and depressions, as illustrated in Figure 1. This paper summarizes and collates information on the causes and repair methods of these defects and reclassifies the guide rail defects accordingly.

The first category of defects includes burns and rust spots, both of which arise during the grinding of the guide rail surfaces. Burns are caused by high temperatures impacting on the guide surface during grinding, leading to discoloration and hardening. Surfaces result from improper angle entry of the guide rail on the machining tool, leaving parts of the surface with the original material’s roughness and irregularities. These defects can be repaired by secondary grinding if minor, but severe cases may require disposal.

The second category is scratches, usually caused by the guide surface coming into contact with hard objects during machining, transportation, or installation. Inappropriate cleaning methods or the use of unclean cloths or tools can also lead to scratches. Based on their size, scratches are further classified into large and small scratches. To remove large scratches and restore the geometric accuracy of the guide, professional grinding equipment and localized grinding techniques are required. Small scratches can be smoothed out manually using fine sandpaper or grinding paste.

Indentations and depressions, comprising the final category, are both results of impacts and collisions with objects. The distinction between the two lies in the fact that depressions are usually caused by heavy or sharp objects, resulting in deep holes, whereas indentations often occur during transportation from being squeezed or hit by heavy objects, forming large uneven areas. Depressions are generally difficult to repair, while indentations may need grinding and polishing depending on the area and degree of unevenness.

This paper also categorizes defects in this manner and proceeds with descriptions of model training and defect segmentation experiments.

2.2. Classification of Rail Surface Defects

Due to the polished finish of linear guide rail surfaces, they exhibit high smoothness. When light is incident obliquely on a metal surface, according to Snell’s law, part of the light is reflected, while the remainder is absorbed or transmitted. For smooth metal surfaces with high reflectivity, the reflected light is highly concentrated, forming bright specular highlight regions. As the guide rail moves along the production line, these highlight regions show noticeable positional changes.

In addition, minor surface undulations or defects can scatter and diffuse the incident light, introducing subtle details and textures into the overall appearance. As shown in Figure 2, an adjustable dome light source is employed in this study to construct an image acquisition system for guide rail surface defect inspection. As illustrated in Figure 3, because the top and bottom surfaces of the guide rail are flat while the side surfaces contain grooves, different illumination modes are configured to suppress specular reflection noise, and defects with unclear edge imaging are collected separately.

2.3. Analysis of Detection Challenges

As illustrated in Figure 3, the image shows defects on the top and side surfaces of the guide rail, with the defects highlighted in red. The guide rail surface defects exhibit the following characteristics:

Some defects, such as scratches, have an extremely high aspect ratio.
There are various types of defects, with diverse shapes and sizes.
There is a prevalence of small-target defects, which are highly susceptible to interference from specular noise.

These characteristics pose the following challenges for deep learning algorithms:

(1): Guide rails often operate in mechanically complex industrial environments, where their surfaces may be affected by various external factors, such as dust, reflections, or shadows. These factors can significantly impact the quality of the images and the interpretability of the algorithms. There is a need for algorithms to possess strong robustness, capable of distinguishing actual defects from background noise.
(2): Guide rail defects vary greatly in size, from tiny cracks to larger pits or scratches, meaning that detection algorithms need to function effectively across different scales. Although multi-scale feature fusion strategies can help improve model adaptability, designing this fusion process to optimize performance and accuracy remains a technical challenge.
(3): Standard convolution neural networks typically find it challenging to effectively recognize elongated scratches and small-target defects, as the features of these defects are very subtle and inconspicuous relative to the entire image. The linear characteristics of elongated scratches require the network to be able to capture and correctly understand their direction and length, while traditional convolution kernels may struggle to adapt to such highly oriented structures.

2.4. Dataset Details and Annotation

To ensure the reliability and reproducibility of the experiments, we constructed the guide rail surface defect dataset. The original dataset consists of 3075 images. To increase data diversity and improve model generalization, data augmentation techniques, including random cropping, horizontal flipping, and noise injection, were applied, expanding the dataset to a total of 6000 images. The augmented dataset was then divided into training, validation, and test sets with a ratio of 7:2:1.

All images were annotated using the labelme tool. To ensure annotation quality, a “cross-check” mechanism was employed: three researchers performed the initial annotation, followed by a review from a senior expert. The Intersection over Union (IoU) threshold for inter-annotator agreement was set to 0.85. The statistical distribution of defect categories is detailed in Table 1.

3. YOLOv8 Instance Segmentation Model Based on Deformable Convolution Networks and Multi-Scale Feature Fusion

As discussed in Section 2.3, guide rail surface defect inspection involves multiple challenges related to imaging conditions, defect scale variation, and geometric complexity. In this work, the first challenge—robustness against environmental interference such as reflections, shadows, and background noise—is partially alleviated at the data acquisition level through the use of a dome light source and multi-illumination imaging strategies during dataset construction. This data-driven design improves image consistency and reduces illumination-induced artifacts, providing a more reliable foundation for subsequent model learning.

However, the remaining challenges, namely large-scale variation among defects and the difficulty of accurately capturing elongated scratches and small-target defects, cannot be sufficiently addressed through data design alone. These issues are inherently related to the feature extraction and representation capability of the segmentation model. In particular, guide rail defects range from tiny cracks to large pits and scratches, requiring the model to maintain both fine-grained detail sensitivity and cross-scale semantic consistency. Meanwhile, elongated scratches exhibit strong directional characteristics and high aspect ratios, which are difficult to model effectively using standard convolutional neural networks with fixed sampling grids.

As a representative one-stage instance segmentation framework, YOLOv8-seg achieves a favorable balance between accuracy and efficiency. Nevertheless, its backbone and feature fusion mechanisms are primarily designed for generic object instances and are not explicitly optimized for small, subtle, or highly elongated defect patterns commonly observed on guide rail surfaces. Consequently, directly applying YOLOv8-seg may lead to incomplete feature extraction for linear defects and insufficient representation of small-scale defect details. To address these limitations, we redesign the YOLOv8-seg architecture in a task-driven manner, focusing on enhanced geometric adaptability and multi-scale feature representation, as described in the following subsections.

3.1. Design Motivation of the Proposed Architecture

The overall architecture of this network model is illustrated in Figure 4. To enhance the ability of the YOLOv8 backbone network to extract semantic features of irregular and elongated defects, the concept of deformable convolution networks is applied to modify the C2f module in the backbone, thereby enhancing the model’s capability for geometric transformation modeling. To address the issue of small-target defect features being overlooked during feature fusion, a new feature fusion network architecture, MTC-FPN, is proposed. This architecture consists of two parts. (1) First, we have MSF (Multi-scale Feature Fusion) and TFE (triple-feature encoder) modules, where the MSF module fuses the semantic information of multiple feature maps of different sizes output by the backbone network and outputs multi-scale feature information; the TFE module captures fine local information within the feature maps. These modules enhance the feature extraction network’s ability to obtain multi-scale feature information and fine feature details. (2) The CPAM (Channel and Position Attention Mechanism) captures channel information from the feature maps output by the TFE module through a channel attention mechanism to calibrate channel weights, thereby enhancing important channel features and suppressing unimportant ones. The position attention mechanism integrates global feature information with local fine details, enhancing the model’s focus on small-target defects and minor scratches. The feature maps output by MTC-FPN are processed by the YOLOv8 segmentation decoupling head for final processing to generate guide rail surface defect classification boxes and segmentation masks. The post-processing includes (1) decode boxes, which convert the encoded positional information output by the network into actual boundary box coordinates corresponding to the original image [20]; (2) Non-Maximum Suppression (NMS), which merges overlapping boundary boxes to ensure that only one best detection result is retained per object; and (3) mask generation, where the network predicts a mask for each detected object area, indicating which pixels belong to the object [21].

3.2. Adaptive Deformable Feature Extraction for Irregular Industrial Defects

To enhance the geometric adaptability of the backbone for elongated and direction-sensitive defects, DCNv3 is introduced, and we introduce deformable convolution into the YOLOv8 backbone, as illustrated in Figure 4. Specifically, the standard 3 × 3 convolution in the Bottleneck of the C2f module is replaced with a deformable convolution layer. This design allows the convolution kernel to dynamically adjust its sampling locations according to the spatial distribution of features, thereby enhancing the alignment between convolutional receptive fields and irregular defect structures.

Deformable convolution networks (DCNs), originally proposed by Dai et al. [22], extend standard convolution by learning spatial offsets for each sampling point. This mechanism enables the convolution kernel to perform adaptive, non-uniform sampling, which significantly improves the modeling of geometric deformations. Compared with conventional CNNs, deformable convolutions can achieve a more flexible receptive field with fewer layers, making them particularly suitable for capturing thin cracks, scratches, and other irregular industrial defects.

Among existing deformable convolution variants, this work adopts DCNv3 rather than earlier versions (DCNv1 or DCNv2) for two key reasons. First, DCNv3 introduces group-wise weight sharing and softmax-normalized modulation scalars, which enable more stable and efficient spatial aggregation across long-range dependencies. This property is crucial for guide rail defects, where cracks may extend over large spatial regions and require consistent feature modeling along their entire length. Second, DCNv3 significantly reduces computational redundancy compared to previous deformable convolutions, allowing its integration into the backbone without compromising real-time detection performance. These advantages make DCNv3 more suitable for industrial inspection scenarios that demand both accuracy and efficiency [23].

DCNv3 can be mathematically expressed as follows:

y (p_{0}) = \sum_{g = 1}^{G} \sum_{k = 1}^{K} w_{g} m_{g k} x_{g} (p_{0} + p_{k} + {∆ p}_{g k})

(1)

where

G

represents the total number of groups in the spatial aggregation process;

K

represents the total number of sampling points;

w_{g}

represents the position-independent projection weights of the

g

group;

m_{g k}

represents the modulation scalar for the

k

sampling point in the

g

group, normalized through the softmax function along the dimension

K

;

x_{g}

represents the sliced input feature map;

p_{k}

represents the

k

position of predefined grid sampling; and

{∆ p}_{g k}

represents the offset relative to the grid sampling position

p_{k}

in the

g

group.

3.3. Task-Driven Multi-Scale Feature Enhancement for Linear Defect Segmentation

The feature extraction network plays a critical role in instance segmentation for linear guide rail surface defects, as it directly determines the model’s ability to perceive fine-grained, elongated, and densely distributed defect patterns. Unlike generic object detection tasks, guide rail defects—such as cracks, scratches, and wear marks—exhibit strong anisotropy, irregular geometry, and significant scale variation, with small defects often appearing densely clustered or partially overlapping. These characteristics impose higher demands on spatial adaptability, multi-scale representation, and detail preservation during feature extraction.

The original YOLOv8 employs an FPN-PAN structure [24] to aggregate multi-scale features through simple summation or concatenation. While effective for common object detection scenarios, this strategy primarily emphasizes hierarchical feature propagation and lacks explicit modeling of cross-scale dependencies. Consequently, it struggles to preserve fine spatial details and discriminative cues for small and densely distributed defect targets, which are critical in industrial inspection scenarios.

To address these limitations, this paper proposes a task-driven MTC-FPN (MSF–TFE–CPAM) feature extraction element tailored for guide rail surface defect segmentation, as illustrated in Figure 5. Rather than simply integrating existing modules, each component is designed to address a specific challenge inherent to linear industrial defects.

First, the MSF module is designed to explicitly model cross-scale feature interactions. Unlike traditional pyramid fusion strategies that perform pairwise or hierarchical fusion, MSF employs three-dimensional convolution to jointly process feature maps from multiple pyramid levels (P2–P5). This design enables simultaneous aggregation of semantic information from deep layers and fine-grained spatial details from shallow layers, which is crucial for maintaining the structural continuity of linear defects across scales.

Second, the TFE module focuses on selectively enhancing detailed spatial features in large-resolution feature maps. In industrial defect images, small defects are often embedded in complex backgrounds and may be suppressed during downsampling. The TFE module compensates for this effect by reinforcing local detail representations, thereby improving the detection and segmentation of densely overlapping small defect targets. The specific configuration of the TFE module is optimized to balance detail enhancement and noise suppression, ensuring that fine defect features are preserved without introducing excessive background interference.

Finally, the CPAM adaptively integrates the multi-scale features generated by the MSF module with the refined detail features from the TFE module. By jointly modeling channel-wise importance and spatial positional relevance, the CPAM enhances feature discrimination and promotes more effective feature fusion.

Through the coordinated design of DCNv3-based adaptive convolution, task-oriented multi-scale fusion, and targeted detail enhancement, the proposed MTC-FPN forms a unified and complementary feature extraction framework. This design goes beyond simple engineering integration and provides a principled solution for capturing the complex geometric structures and scale diversity characteristic of linear guide rail surface defects.

3.3.1. MSF

The Multi-scale Feature Fusion (MSF) module is primarily responsible for integrating global or high-level semantic information from multi-scale images [25]. In processing images of small-target defects, global and high-level semantic information is crucial for accurate identification and segmentation. The MSF module enhances the model’s ability to capture and utilize this information by fusing features from different scales, thereby improving segmentation accuracy. As illustrated in Figure 6, feature maps of large, medium, and small scales output by the YOLOv8 backbone network are processed through convolution operations to standardize the number of channels across multi-scale feature maps; this is followed by bilinear interpolation upsampling to reshape the feature maps to the same size. After upsampling, the feature map information is stacked and sent together into a 3D convolution to combine multi-scale features, ultimately outputting richer semantic information of defects.

3.3.2. TFE

The triple-feature encoder (TFE) module is a feature fusion mechanism proposed to improve visual recognition of densely overlapping small objects [26]. This module magnifies images to compare changes in shape or appearance at different scales, focusing on capturing the local fine details of small targets. In the feature network, the TFE module divides features into three sizes: large, medium, and small. The original feature pyramid network in YOLOv8 does not perform additional processing on large-size feature maps but upsamples small-size feature maps and combines them with the previous layer’s features. The TFE module, on the other hand, utilizes the rich detail information from large-size feature maps. As shown in Figure 7, before feature map fusion, the feature channel numbers are adjusted to ensure consistency among large, medium, and small feature maps, achieved through a convolution network; then, the large-size feature maps undergo downsampling through a mixed structure of maximum pooling and average pooling. This downsampling method preserves the most critical feature information while maintaining the integrity of the features, increasing the effectiveness and diversity of small-target defect images; subsequently, the small-size feature maps are upscaled using bilinear interpolation, with the final output being the merged feature map.

3.3.3. CPAM

The Channel and Position Attention Module (CPAM) is a network that integrates both the Channel and Position Attention Mechanisms, enhancing the model’s focus on both the feature and positional information of defects.

Regarding the feature network of this study, the overall network architecture of the CPAM is shown in Figure 8. In the CPAM, the Channel Attention Module first receives inputs from the TFE module, and then processes each channel of the input feature map TFE_OUT with global average pooling. This step reshapes each feature map of dimensions C × H × W (where C is the number of channels, and H, W are the height and width, respectively) into a vector of length N, generating a global feature descriptor for each channel. These channel descriptors are then processed through a one-dimensional convolution layer, where the kernel size dynamically adjusts based on the number of channels. This step not only captures dependencies between local channels but also maintains their individuality. The outputs of the one-dimensional convolution are then transformed into channel weights through two fully connected layers and a nonlinear Sigmoid function. The two fully connected layers are designed to capture nonlinear cross-channel interactions, with the channel weight formula shown as Equation (2), where

S_{j i}

represents the influence of the i channel on the j channel, and

A_{i}

and

A_{j}

represent the i and j channels in the TFE_OUT feature map, respectively. Each element in CAM_OUT, denoted as

E_{j}

,

E_{j}

is derived from Equation (3), with

μ

representing the position attention weight coefficient, which is initially set to zero and adjusted through neural network learning. Finally, these channel weights are multiplied by the original input feature map for channel recalibration, enhancing important channel features and suppressing unimportant ones.

s_{j i} = \frac{e x p (A_{i} \cdot A_{j})}{\sum_{i = 1}^{N} e x p (A_{i} \cdot A_{j})}

(2)

E_{j} = μ \sum_{i = 1}^{C} (x_{j i} A_{i}) + A_{j}

(3)

The feature map outputs from the Channel Attention Module are combined with the multi-scale feature maps from the MSF module and then fed into the Position Attention Mechanism. Initially, convolution operations are applied to the input multi-scale and detailed feature maps to generate three feature matrices: Query (Q), Key (K), and Value (V). The Q and K matrices are used to compute relationships between different positions and generate an attention map (with dimensionality reduction to decrease computational load), while the third matrix (V) retains the original number of channels to ensure consistency in the output dimensions. Subsequently, a softmax activation function calculates attention scores between different positions to adjust feature representations at each location and generates a position attention weight matrix X. In X, each element’s value represents the influence of one position on another. In the feature map, the more similar the feature representations of two positions, the greater their correlation. The calculation formula is shown in Equation (4).

X_{j i} = \frac{e x p (A_{i} \cdot A_{j})}{\sum_{i = 1}^{N} e x p (A_{i} \cdot A_{j})}

(4)

The resulting position attention weight map is multiplied by the feature map B, reshaped to match the dimensions of A, and then added to A to produce the final output feature map PAM_OUT. Each element in PAM_OUT, is denoted as

F_{j}

.

F_{j}

is calculated as shown in Equation (5).

β

represents the position attention weight coefficient, which is initially set to zero and adjusted through feature self-regulation learned by the neural network.

F_{j} = β \sum_{i = 1}^{C} (x_{j i} B_{i}) + B_{j}

(5)

In summary, the CPAM proposes a channel attention mechanism that does not involve dimensionality reduction, capturing local cross-channel interactions by considering each channel and its k-nearest neighbors. This is achieved through a 1D convolution of size k, where k represents the scope of local cross-channel interactions, assigning appropriate weights to each channel. The Position Attention Module then receives inputs from the channel attention network and the SSFF output, combined as a single input. Horizontal and vertical pooling operations on the input feature map, along its width and height dimensions, yield feature maps that focus on horizontal and vertical spatial structures. These directional feature maps are then encoded separately and ultimately merged to produce the output, facilitating the extraction of key positional information for each defect.

4. Segmentation Experiments and Experimental Analysis

4.1. Evaluation Metrics

In object segmentation tasks, it is often necessary to consider the degree of overlap between the predicted results and the true annotations, namely the Intersection over Union (IoU). IoU calculates the ratio of the intersection area to the union area between the predicted and true bounding boxes and is a critical metric for measuring prediction accuracy. Depending on different IoU threshold values, we can calculate AP values at various levels of strictness, such as AP50 (AP at an IoU threshold of 0.5). A more comprehensive mAP is the average of AP values calculated at IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05 [27,28,29].

In instance segmentation tasks, FPS (Frames Per Second) and Param (a measure of computational performance) are crucial indicators of an algorithm’s real-time performance, especially in the field of industrial defect detection [30]. An algorithm’s ability to rapidly process images or video streams and provide accurate real-time defect detection results significantly affects production efficiency. As detection speed is closely related to computer hardware, the speed metrics mentioned in this paper are all measured under the hardware environment shown in Table 1.

In summary, this paper will use a comprehensive and in-depth evaluation of the training results using metrics such as Box mAP50, Mask mAP50 F1-score, and FPS. Box mAP represents the precision of defect classification, while Mask mAP indicates segmentation accuracy. These metrics fully demonstrate the model’s performance in object detection and instance segmentation tasks from various dimensions, thereby providing a strong basis for optimizing model performance. Additionally, we will focus on the performance differences exhibited by the model when processing different categories to more finely analyze the strengths and weaknesses of the model. Through this multi-dimensional evaluation approach, we can more comprehensively understand the performance characteristics of the model, laying a solid foundation for subsequent research and improvement efforts.

4.2. Experimental Environment

In this study, all experiments, including model training, ablation studies, and defect segmentation experiments, were conducted on the same server. The programming language used for the experiments was Python, and the main development framework for instance segmentation was PyTorch. Detailed hardware specifications of the experimental environment are provided in Table 2, while software specifications can be found in Table 3.

In addition, the training-related hyperparameters, including the optimizer type, learning rate, batch size, and training epochs, were carefully configured to ensure stable model convergence. The specific parameter settings used in all experiments are reported in Table 4.

4.3. Comparative Experiments Before and After Improvements

As shown in Figure 9, this paper compares the segmentation effects of the original YOLOv8 instance segmentation model and the improved YOLOv8 instance segmentation model presented in this study.

Compared to the original YOLOv8 instance segmentation model, the improved YOLOv8 model presented in this paper shows better fit of the image boundaries, with smoother and more accurate segmentation boundaries (Figure 9c,d). The classification accuracy of the improved model is also more precise compared to the original YOLOv8 model (Figure 9g,h), and there is an improvement in the detection of small-target defects (Figure 9k,l,o,p). However, there are still some shortcomings in the improved instance segmentation model when applied to the segmentation of guide rail surface defects. The first shortcoming is the tendency for segmentation overlap errors when two defects are too close together (Figure 9h,k); an example of this is when two scratches on the right side of the guide rail surface are close together, resulting in the two defects being segmented as one. The second shortcoming is the segmentation precision; when the defects are elongated or irregular, the segmented defect contours are larger than the defects themselves (Figure 9h,p). The third shortcoming is the insufficient segmentation of areas with severe specular noise, such as deep indentations (Figure 9t).

4.4. Ablation Studies

To verify the improvements of the two modifications to the original YOLOv8 instance segmentation model for segmenting guide rail surface defects, ablation experiments were conducted. These experiments involved training modified models with only changes in the backbone network and only changes in the feature extraction network, under the same training conditions, to validate their effectiveness. The experimental results are shown in Table 5.

Furthermore, to verify the impact of the improvements discussed in this paper on the segmentation accuracy of the three types of defects, the segmentation results of the three defect types during the ablation experiments were recorded and compared with those from the unmodified original YOLOv8 instance segmentation model. The results are presented in Table 6, Table 7 and Table 8.

From the experimental data in Table 5, we observed the impacts of two enhancements, C2f_DCNv3 and MTC-FPN, on model performance. Firstly, when C2f_DCNv3 was applied alone compared to the original YOLOv8 model, there was an improvement in Box and Mask mAP50, while GLFlops slightly decreased, and FPS slightly increased. This indicates that C2f_DCNv3 not only enhances the model’s segmentation accuracy but also slightly improves the model’s computational efficiency. Similarly, when MTC-FPN was applied alone, there was a significant improvement in Box and Mask mAP50; however, the increase in Param led to a notable decrease in FPS, suggesting that MTC-FPN significantly enhances segmentation accuracy at the expense of some processing speed.

When C2f_DCNv3 and MTC-FPN were applied together, the model achieved the best performance in terms of mAP50. Although Param increased, the decrease in FPS was not substantial, indicating that the model maintains a relatively acceptable processing speed while retaining high segmentation accuracy. This result indicates that the proposed design achieves a favorable trade-off between accuracy gains and computational overhead, rather than improving performance through model scaling alone.

Analysis of precision and recall for different types of defects from Table 6, Table 7 and Table 8 reveals valuable insights. The combined application of C2f_DCNv3 and MTC-FPN achieved the best results in almost all cases. Notably, for the second type of defect (scratches) and the third type (dents, indentations, etc.), improvements in precision and recall were particularly significant, suggesting that the enhanced model significantly improved segmentation ability for these types of defects.

This experiment verified that the enhancements C2f_DCNv3 and MTC-FPN are effective in improving the performance of the YOLOv8 instance segmentation model for the task of detecting surface defects on rolling linear guide accessories. These enhancements not only improve the overall segmentation accuracy (mAP50) but also achieve more precise segmentation for different types of defects (improving precision and recall). Although the introduction of MTC-FPN somewhat reduces the processing speed of the model, through reasonable configuration and optimization, we can still maintain high segmentation accuracy while achieving satisfactory real-time performance. These experimental results provide valuable references for deploying and improving the model in practical applications.

4.5. Comparative Experiments and Versatility Studies

To assess the effectiveness of the proposed method for high-precision segmentation of surface defects on rolling linear guide accessories, comparative experiments were conducted using the dataset developed in this study. The results in Table 9 demonstrate clear advantages of the proposed approach over existing one-stage instance segmentation models.

According to the data in Table 9, in terms of mAP50, all enhanced YOLO variants outperform the original YOLOv8n-seg, indicating improved feature extraction capabilities in later architectures. The proposed method achieves an mAP50 of 82.1%, comparable to YOLOv11s-seg and approaching the performance of Mask-RCNN [31]. Under the more stringent mAP0.5:0.95 metric, the proposed model reaches 50.1%, surpassing the YOLOv8 and YOLOv10 [32] models and remaining close to the best-performing model, YOLOv11s-seg. These results confirm that the proposed method maintains robust segmentation quality under stricter IoU thresholds.

The F1-score further highlights model reliability. Although Mask-RCNN and Mask2Former [33] achieve the highest scores due to their large-capacity multi-stage or transformer-based designs, the proposed method attains a competitive 74.6%, outperforming all YOLO-based one-stage models. This demonstrates the strong balance between precision and recall enabled by the introduced architectural enhancements.

In terms of inference efficiency, the proposed method achieves an FPS value of 148, substantially faster than Mask-RCNN and Mask2Former, and sufficiently rapid for real-time industrial applications. While the lightweight YOLO variants achieve even higher speeds, they do so at the cost of reduced accuracy, particularly under high-IoU conditions. The proposed method therefore provides a more favorable accuracy–speed trade-off.

The performance gains primarily stem from the architectural modifications tailored to the characteristics of rolling-guide surface defects, which are complex, highly textured, and dominated by small targets. The integration of deformable convolutions enhances adaptability to irregular textures and expands the effective receptive field. A dedicated small-target feature extraction layer mitigates fine-detail loss during downsampling. Furthermore, the multi-scale feature fusion module and triple-feature encoder improve semantic consistency and strengthen both global and local feature representations. The incorporation of channel and spatial attention mechanisms further refines defect localization by emphasizing salient feature and positional information.

To evaluate the generalization capability of the proposed model, additional experiments were conducted on the MSD dataset, publicly available from Peking University [34]. This dataset includes oil stains, scratches, and spot defects, with the latter two categories containing numerous small targets. The quantitative results are presented in Table 10 and Table 11, and visual examples are shown in Figure 10.

Based on the MSD dataset results presented in Table 10 and Table 11, a comparative analysis of various instance segmentation methods was conducted. Across metrics such as mAP50, mAP75, F1-score, and FPS, our method demonstrated significant improvements over YOLOv8n-seg, YOLOv10n-seg, Fast-SCNN [35], and FDSNet [35]. Notably, our method excelled in mAP₅₀ and mAP₇₅, achieving 85.8% and 73.4%, respectively, representing a notable enhancement over other methods. Additionally, in addressing the three categories of defects present in the MSD dataset—oil stains, spots, and scratches—our method achieved good precision and recall rates, particularly excelling in identifying oil stains and scratches. Overall, our method has shown high accuracy and efficiency in the domain of industrial defect detection, exhibiting versatile and robust.

5. Discussion

5.1. Deployment and Feasibility

From an industrial deployment perspective, the proposed method is designed to balance accuracy and efficiency. Built upon a lightweight one-stage segmentation framework, the model introduces task-specific enhancements while maintaining acceptable computational complexity and GPU memory consumption for commonly used industrial GPUs. Although frame-per-second metrics do not fully capture end-to-end system latency, the achieved inference speed is sufficient to support online inspection requirements in typical production lines.

In terms of system integration, the proposed approach can be readily incorporated into existing machine vision pipelines, operating with standard industrial cameras and controlled illumination setups. While fixed single-light configurations may limit robustness in extreme cases, practical deployments can adopt multi-illumination or adjustable lighting strategies to enhance image quality without increasing algorithmic complexity. Overall, the method demonstrates strong feasibility for real-world inspection scenarios where stable performance, moderate resource usage, and ease of integration are critical considerations.

5.2. Limitations and Challenge Scenario Analysis

Although the proposed segmentation model demonstrates strong overall performance, several limitations remain when applied to complex industrial inspection scenarios. These issues outline important directions for subsequent improvements.

5.2.1. Overlapping and Visually Similar Defects

Surface defects on precision linear guides often occur in close proximity, overlap spatially, or share highly similar visual characteristics. In such cases, the proposed model may struggle to distinguish individual defect instances or accurately classify defect types, especially when scratches, pits, and wear marks exhibit comparable textures or when overlapping defects form continuous irregular patterns. This leads to ambiguities in instance separation and category assignment, which can compromise downstream analysis. Enhancing fine-grained feature representation, integrating structural priors, or introducing contrastive learning strategies may help improve discrimination under these challenging conditions.

5.2.2. Necessity of Segmentation for Downstream Measurement

Unlike conventional defect detection tasks, linear guide inspection requires precise morphological analysis of defects, including their length, area, and geometric characteristics. Bounding box-based detection methods cannot provide the pixel-level boundaries needed for such quantitative assessment. Segmentation is therefore essential to support accurate dimensional measurement and subsequent quality evaluation. While the proposed model offers reliable masks, further refinement is needed to ensure consistent contour accuracy for irregular or thin defect shapes.

5.2.3. Limited Robustness Under Complex Surface and Lighting Conditions

The detection accuracy of the proposed model remains influenced by the highly reflective, heterogeneous, and light-sensitive nature of precision guide surfaces, particularly under fixed single-illumination conditions. Variations such as specular highlights, shadows, or low-contrast regions may distort defect appearance, leading to misclassification, missed detections, or imprecise segmentation boundaries.

It should be noted that this limitation primarily manifests under constrained illumination setups. In practical industrial deployments, controlled illumination environments—such as multi-angle or multi-intensity lighting configurations—can be introduced to effectively alleviate these issues without modifying the model architecture. Furthermore, future work may explore illumination-invariant feature modeling and adaptive lighting strategies to enhance robustness under more extreme and dynamically changing conditions.

5.3. Future Research and Practical Applications

Beyond defect localization and segmentation, the proposed method provides a foundation for downstream industrial analysis and decision-making. Pixel-level segmentation enables automatic geometric measurement of surface defects, such as scratch length, defect area, aspect ratio, and spatial distribution, which are critical for quantitative quality assessment and grading in precision guide rail manufacturing. These measurements can be directly integrated into inspection systems to support rule-based or data-driven acceptance criteria.

From a practical perspective, future work may focus on coupling the segmentation results with automated measurement and reporting modules, enabling closed-loop quality control and traceability. In addition, further research may explore improved robustness under extreme illumination variations, adaptive thresholding for defect severity evaluation, and integration with multi-sensor inspection systems to enhance reliability in complex industrial environments.

6. Conclusions

This study presents a novel instance segmentation network specifically designed for high-precision rolling linear guide surface defects, also establishing the Linear Surface Defects dataset. The network leverages a deformable convolution network combined with multi-scale feature fusion strategies to tackle the challenges associated with detecting surface defects on linear guides. The key contributions of this paper include the use of an adjustable dome light source to reduce the impact of specular noise on detection accuracy, and the integration of DCNv3 into the YOLOv8seg backbone to form the C2F-DCNv3 convolution network, which effectively enhances the feature extraction capabilities for irregular and elongated defects. Additionally, the proposed MTC-FPN feature extraction network, through its multi-scale feature fusion and triple-feature encoder modules, effectively merges global and local detailed features of the guide surface. Channel and spatial attention mechanisms further enhance the model’s focus on small-target defects, significantly improving detection accuracy. Extensive experiments validate the advanced nature of the proposed model in high-precision linear guide surface defect detection, demonstrating its potential reference value and versatile in the field of surface defect detection.

Author Contributions

Conceptualization, S.L. and Z.Z.; methodology, S.L.; software, S.L. and K.Y.; validation, S.L. and Z.Z.; formal analysis, S.L. and Q.W.; investigation, S.L. and Y.M.; resources, S.L. and Y.M.; data curation, S.L. and K.Y.; writing—original draft preparation, S.L.; writing—review and editing, S.L.; visualization, S.L.; supervision, Z.Z.; project administration, S.L. and Z.Z.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China Project: “Research on Electrochemical Biosensing Methods for Weak Acid-Soluble Cadmium in Soil” (32471994); the State Key Research Program of China (Grant No. 2022YDF2001901-01); the 2023 Guangdong Provincial Science and Technology Special Fund (Jiangke [2023] No. 170); the Guangdong Provincial Department of Agriculture’s Modern Agricultural Innovation Team Program for Animal Husbandry Robotics (Grant No. 2019KJ129); and the Special Project of Guangdong Provincial Rural Revitalization Strategy in 2020 (YCN (2020) No. 39) (Fund No. 200-2018-XMZC-0001-107-0130).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions of this study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors gratefully acknowledge the financial support provided by the National Natural Science Foundation of China Project: “Research on Electrochemical Biosensing Methods for Weak Acid-Soluble Cadmium in Soil” (32471994); the State Key Research Program of China (Grant No. 2022YDF2001901-01), 2023 Guangdong Provincial Science and Technology Special Fund (Jiangke [2023] No. 170); the Guangdong Provincial Department of Agriculture’s Modern Agricultural Innovation Team Program for Animal Husbandry Robotics (Grant No. 2019KJ129); and the Special Project of Guangdong Provincial Rural Revitalization Strategy in 2020 (YCN (2020) No. 39) (Fund No. 200-2018-XMZC-0001-107-0130). Any opinions, findings, conclusions, or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of South China Agricultural University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, P.; Wang, T.; Zha, J. A study on accuracy of linear ball guide. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2022, 236, 3293–3312. [Google Scholar] [CrossRef]
Wang, X.; Zhou, C.; Ou, Y. Experimental analysis of the wear coefficient for the rolling linear guide. Adv. Mech. Eng. 2019, 11, 1687814018821744. [Google Scholar] [CrossRef]
Petkovic, J.; Riddle, A.; Akl, E.A.; Khabsa, J.; Lytvyn, L.; Atwere, P.; Campbell, P.; Chalkidou, K.; Chang, S.M.; Crowe, S. Protocol for the development of guidance for stakeholder engagement in health and healthcare guideline development and implementation. Syst. Rev. 2020, 9, 21. [Google Scholar] [CrossRef]
Tabernik, D.; Šela, S.; Skvarč, J.; Skočaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 2020, 31, 759–776. [Google Scholar] [CrossRef]
Elharrouss, O.; Hmamouche, Y.; Idrissi, A.K.; El Khamlichi, B.; El Fallah-Seghrouchni, A. Refined edge detection with cascaded and high-resolution convolutional network. Pattern Recognit. 2023, 138, 109361. [Google Scholar] [CrossRef]
Jardim, S.; António, J.; Mora, C. Image thresholding approaches for medical image segmentation-short literature review. Procedia Comput. Sci. 2023, 219, 1485–1492. [Google Scholar] [CrossRef]
Huang, J.; Cui, L.; Zhang, J. Novel morphological scale difference filter with application in localization diagnosis of outer raceway defect in rolling bearings. Mech. Mach. Theory 2023, 184, 105288. [Google Scholar] [CrossRef]
Cao, X.; Su, Y.; Geng, X.; Wang, Y. YOLO-SF: YOLO for fire segmentation detection. IEEE Access 2023, 11, 111079–111092. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Li, Y.; Feng, Q.; Liu, C.; Xiong, Z.; Sun, Y.; Xie, F.; Li, T.; Zhao, C. MTA-YOLACT: Multitask-aware network on fruit bunch identification for cherry tomato robotic harvesting. Eur. J. Agron. 2023, 146, 126812. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Xie, W.; Sun, X.; Ma, W. A light weight multi-scale feature fusion steel surface defect detection model based on YOLOv8. Meas. Sci. Technol. 2024, 35, 055017. [Google Scholar] [CrossRef]
Zhao, C.; Shu, X.; Yan, X.; Zuo, X.; Zhu, F. RDD-YOLO: A modified YOLO for detection of steel surface defects. Measurement 2023, 214, 112776. [Google Scholar] [CrossRef]
Chen, K.; Li, H.; Li, C.; Zhao, X.; Wu, S.; Duan, Y.; Wang, J. An automatic defect detection system for petrochemical pipeline based on cycle-gan and yolo v5. Sensors 2022, 22, 7907. [Google Scholar] [CrossRef]
Mao, W.-L.; Chiu, Y.-Y.; Lin, B.-H.; Wang, C.-C.; Wu, Y.-T.; You, C.-Y.; Chien, Y.-R. Integration of deep learning network and robot arm system for rim defect inspection application. Sensors 2022, 22, 3927. [Google Scholar] [CrossRef]
Wu, D.; Shen, X.; Chen, L. Detection of defects on aluminum profile surface based on improved YOLO. In Proceedings of the 2022 Prognostics and Health Management Conference (PHM-2022 London), London, UK, 27–29 May 2022; IEEE: New York, NY, USA, 2022; pp. 468–472. [Google Scholar]
Lv, X.; Duan, F.; Jiang, J.-J.; Fu, X.; Gan, L. Deep metallic surface defect detection: The new benchmark and detection network. Sensors 2020, 20, 1562. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Wang, D.; Liu, H.; Duan, B.; Yu, H. Flatness defect recognition method of cold rolling strip with a new stacked generative adversarial network. Steel Res. Int. 2022, 93, 2200284. [Google Scholar] [CrossRef]
Sampath, V.; Maurtua, I.; Martín, J.J.A.; Iriondo, A.; Lluvia, I.; Aizpurua, G. Intraclass image augmentation for defect detection using generative adversarial neural networks. Sensors 2023, 23, 1861. [Google Scholar] [CrossRef]
Sun, S.; Mo, B.; Xu, J.; Li, D.; Zhao, J.; Han, S. Multi-YOLOv8: An Infrared Moving Small Object Detection Model Based on YOLOv8 for Air Vehicle. Neurocomputing 2024, 588, 127685. [Google Scholar] [CrossRef]
Wang, M.; Fu, B.; Fan, J.; Wang, Y.; Zhang, L.; Xia, C. Sweet potato leaf detection in a natural scene based on faster R-CNN with a visual attention mechanism and DIoU-NMS. Ecol. Inform. 2023, 73, 101931. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
WWang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Hu, Y.; Wang, J.; Wang, X.; Sun, Y.; Yu, H.; Zhang, J. Real-time evaluation of the blending uniformity of industrially produced gravelly soil based on Cond-YOLOv8-seg. J. Ind. Inf. Integr. 2024, 39, 100603. [Google Scholar] [CrossRef]
Huo, X.; Sun, G.; Tian, S.; Wang, Y.; Yu, L.; Long, J.; Zhang, W.; Li, A. HiFuse: Hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 2024, 87, 105534. [Google Scholar] [CrossRef]
Bhattarai, A.R.; Nießner, M.; Sevastopolsky, A. Triplanenet: An encoder for eg3d inversion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 3055–3065. [Google Scholar]
Cao, Y.; Zhao, Z.; Huang, Y.; Lin, X.; Luo, S.; Xiang, B.; Yang, H. Case instance segmentation of small farmland based on Mask R-CNN of feature pyramid network with double attention mechanism in high resolution satellite images. Comput. Electron. Agric. 2023, 212, 108073. [Google Scholar] [CrossRef]
Huang, Y.; Luo, Y.; Cao, Y.; Lin, X.; Wei, H.; Wu, M.; Yang, X.; Zhao, Z. Damage Detection of Unwashed Eggs through Video and Deep Learning. Foods 2023, 12, 2179. [Google Scholar] [CrossRef] [PubMed]
Dong, C.; Duoqian, M. Control distance IoU and control distance IoU loss for better bounding box regression. Pattern Recognit. 2023, 137, 109256. [Google Scholar] [CrossRef]
Nan, Y.; Zhang, H.; Zeng, Y.; Zheng, J.; Ge, Y. Faster and accurate green pepper detection using NSGA-II-based pruned YOLOv5l in the field environment. Comput. Electron. Agric. 2023, 205, 107563. [Google Scholar] [CrossRef]
Hu, H.; Tang, C.; Shi, C.; Qian, Y. Detection of residual feed in aquaculture using YOLO and Mask RCNN. Aquacult Eng. 2023, 100, 102304. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Guo, S.; Yang, Q.; Xiang, S.; Wang, S.; Wang, X. Mask2Former with improved query for semantic segmentation in remote-sensing images. Mathematics 2024, 12, 765. [Google Scholar] [CrossRef]
Poudel, R.P.; Liwicki, S.; Cipolla, R. Fast-scnn: Fast semantic segmentation network. arXiv 2019, arXiv:1902.04502. [Google Scholar] [CrossRef]
Zhang, J.; Ding, R.; Ban, M.; Guo, T. FDSNeT: An accurate real-time surface defect segmentation network. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 3803–3807. [Google Scholar]

Figure 1. Guide rail surface defects and their classification. (a) Burns and rust spots on the guide rail surface. (b) Scratches on the guide rail surface. (c) Indentations and depressions on the guide rail surface.

Figure 2. Guide rail surface defect image acquisition system. 1. Adjustable dome light source. 2. CMOS industrial. 3. Laptop computer. 4. Stand. 5. Light source power supply and controller.

Figure 3. (a) Top surface imaging diagram; (b) side surface imaging diagram.

Figure 4. Overall architecture.

Figure 5. MTC-FPN.

Figure 6. Multi-scale feature fusion module.

Figure 7. Triple-feature encoder module.

Figure 8. Channel and positional attention mold.

Figure 9. Comparison of segmentation results before and after improvements. (a–t) Picture serial number.

Figure 10. MSD dataset segmentation experiment results.

Table 1. Distribution of instances in the GSD.

Defect Type	Training Instances	Validation Instances	Test Instances	Total
First category of defects	1400	400	200	2000
Second category of defects	1400	400	200	2000
Third category of defects	1400	400	200	2000

Table 2. Hardware specifications.

Hardware Names	Configuration Parameters
Development Environment	Anaconda3-2023.07 + Pycharm
GPU	NVIDIA GeForce RTX 4060
Operating System	Linux
ROM	12 G
Hard Drive	1 TB

Table 3. Software specifications.

Software Names	Version Numbers
Python	3.8.4
Pytorch	1.13.1
Torchvision	0.13.1
Cuda	11.7

Table 4. Training hyperparameters.

Training Hyperparameters	Value
Optimizer	Adam
Learning Rate	0.01 (with cosine annealing)
Batch Size	4
Epochs	300

Table 5. Results of ablation studies.

C2f_DCNv3	MTC-FPN	Box mAP₅₀ (%)	Mask mAP₅₀ (%)	Param (M)	FPS
—	—	79.4	78.3	3.26	198
√	—	80.2	79.4	3.15	206
—	√	82.4	81.2	3.90	137
√	√	83.3	82.1	3.79	148

√ indicates that the corresponding module is used in the model, while — indicates that the module is not used.

Table 6. First category of defects (burns, dense rust spots, etc.).

C2f_DCNv3	MTC-FPN	Precision (%)	Recall (%)
—	—	73.3	74.2
√	—	74.4	76.1
—	√	74.2	77.2
√	√	76.7	78.1

√ indicates that the corresponding module is used in the model, while — indicates that the module is not used.

Table 7. Second category of defects (scratches).

C2f_DCNv3	MTC-FPN	Precision (%)	Recall (%)
—	—	80.5	79.4
√	—	82.7	80.8
—	√	83.2	81.9
√	√	84.1	82.4

√ indicates that the corresponding module is used in the model, while — indicates that the module is not used.

Table 8. Third category of defects (dents, indentations, etc.).

C2f_DCNv3	MTC-FPN	Precision (%)	Recall (%)
—	—	84.6	80.3
√	—	86.5	82.7
—	√	88.3	83.4
√	√	89.2	84.1

√ indicates that the corresponding module is used in the model, while — indicates that the module is not used.

Table 9. Comparative experiment results.

Method	mAP0.5_M (%)	Map0.5:0.95_M (%)	F1-Sore (%)	GFLOPs	FPS
YOLOv8n-seg	78.3	46.2	67.5	12.0	198
YOLOv10n-seg	78.1	46.1	67.3	6.8	236
YOLOv11n-seg	79.2	47.8	68.0	11.2	208
YOLOv8s-seg	81.2	49.3	69.8	42.5	98
YOLOv10s-seg	81.6	50.6	71.0	21.0	167
YOLOv11s-seg	82.4	50.9	71.8	39.8	110
Mask-RCNN	84.8	55.3	74.9	275.0	21
Mask2Former	86.8	58.2	76.4	290.0	19
Ours	82.1	50.1	74.6	26.8	148

Table 10. Experimental results on the MSD dataset.

Method	mAP₅₀ (%)	mAP₇₅ (%)	Param (M)	FPS
YOLOv8n-seg	89.8	59.3	3.25	196
YOLOv10n-seg	89.2	70.1	2.62	238
Fast-SCNN	89.1	58.7	3.18	164
FDSNet	90.2	69.5	2.97	198
Ours	93.2	73.4	3.79	148

Table 11. Experimental results on the MSD dataset for three types of defects.

	YOLOv8n-seg		FDSNet		Ours
Class	P (%)	R (%)	P (%)	R (%)	P (%)	R (%)
Oil Stains	96.9	97.8	95.5	96.3	98.3	97.8
Spots	81.1	78.3	84.6	75.5	97.2	77.9
Scratches	83.7	78.4	80.7	76.8	85.7	78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lai, S.; Zhao, Z.; Mi, Y.; Yuan, K.; Wang, Q. YOLO-GSD-seg: YOLO for Guide Rail Surface Defect Segmentation and Detection. Appl. Sci. 2026, 16, 1261. https://doi.org/10.3390/app16031261

AMA Style

Lai S, Zhao Z, Mi Y, Yuan K, Wang Q. YOLO-GSD-seg: YOLO for Guide Rail Surface Defect Segmentation and Detection. Applied Sciences. 2026; 16(3):1261. https://doi.org/10.3390/app16031261

Chicago/Turabian Style

Lai, Shijun, Zuoxi Zhao, Yalong Mi, Kai Yuan, and Qian Wang. 2026. "YOLO-GSD-seg: YOLO for Guide Rail Surface Defect Segmentation and Detection" Applied Sciences 16, no. 3: 1261. https://doi.org/10.3390/app16031261

APA Style

Lai, S., Zhao, Z., Mi, Y., Yuan, K., & Wang, Q. (2026). YOLO-GSD-seg: YOLO for Guide Rail Surface Defect Segmentation and Detection. Applied Sciences, 16(3), 1261. https://doi.org/10.3390/app16031261

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-GSD-seg: YOLO for Guide Rail Surface Defect Segmentation and Detection

Abstract

1. Introduction

2. Dataset Construction

2.1. Defect Classification

2.2. Classification of Rail Surface Defects

2.3. Analysis of Detection Challenges

2.4. Dataset Details and Annotation

3. YOLOv8 Instance Segmentation Model Based on Deformable Convolution Networks and Multi-Scale Feature Fusion

3.1. Design Motivation of the Proposed Architecture

3.2. Adaptive Deformable Feature Extraction for Irregular Industrial Defects

3.3. Task-Driven Multi-Scale Feature Enhancement for Linear Defect Segmentation

3.3.1. MSF

3.3.2. TFE

3.3.3. CPAM

4. Segmentation Experiments and Experimental Analysis

4.1. Evaluation Metrics

4.2. Experimental Environment

4.3. Comparative Experiments Before and After Improvements

4.4. Ablation Studies

4.5. Comparative Experiments and Versatility Studies

5. Discussion

5.1. Deployment and Feasibility

5.2. Limitations and Challenge Scenario Analysis

5.2.1. Overlapping and Visually Similar Defects

5.2.2. Necessity of Segmentation for Downstream Measurement

5.2.3. Limited Robustness Under Complex Surface and Lighting Conditions

5.3. Future Research and Practical Applications

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI