1. Introduction
With the advancement of the manufacturing industry, high-precision linear guides have become essential components for linear motion control, and they are widely used in machine tools, industrial robots, and semiconductor manufacturing equipment. However, during the production and handling of high-precision linear guides, surface defects such as scratches and dents may occur due to improper human operations or external environmental factors. The surface integrity of high-precision linear guides is critical to their functional performance [
1]. Surface defects can directly degrade the positioning accuracy of guide systems, resulting in reduced motion stability [
2]. Moreover, such defects accelerate surface wear, promote component aging, and ultimately shorten the service life of the entire mechanical system [
3].
At present, surface defect inspection of linear guides still largely relies on manual examination combined with specialized equipment and operator experience. This inspection process is inefficient and highly dependent on subjective human judgment, which can lead to inconsistent classification results [
4]. In this context, machine vision-based inspection has attracted increasing attention. Existing research on machine vision for defect detection can be broadly categorized into traditional image processing methods and deep learning-based approaches. Traditional methods rely on handcrafted rules and predefined algorithms, such as analyzing defect size, shape, color characteristics, and grayscale gradients, followed by edge detection, threshold segmentation, and morphological operations. These methods can achieve high measurement accuracy under controlled conditions [
5,
6,
7]; however, they are sensitive to environmental variations, computationally inefficient, and limited when it comes to capturing spatial and contextual information. In contrast, deep learning-based defect detection methods, which currently dominate this research field, employ convolutional neural networks to automatically extract discriminative defect features and perform classification, enabling efficient and robust automated defect detection [
4].
The YOLO-based instance segmentation models proposed in recent years adopt a one-stage paradigm inspired by the YOLACT framework. These models typically employ a backbone network, together with a Feature Pyramid Network (FPN), to extract and fuse multi-scale features from input images [
8,
9]. The detection branch is responsible for predicting object categories and bounding boxes, while the segmentation branch generates k prototype masks, along with corresponding mask coefficients. Both detection and segmentation tasks are performed in parallel. In the segmentation branch, high-resolution feature maps are processed through a series of convolutional layers to produce prototype masks. The final instance segmentation results are obtained by linearly combining the prototype masks with their corresponding mask coefficients [
10,
11].
YOLO-based models have been widely applied across various industrial defect inspection scenarios. For example, Xie et al. developed a lightweight multi-scale feature fusion model, LMS-YOLO, for efficient steel surface defect detection, demonstrating the adaptability of YOLOv8 to industrial defect recognition tasks [
12]. Zhao et al. proposed RDD-YOLO, a customized YOLO-based framework specifically designed for steel surface defect detection, highlighting the effectiveness of task-oriented model customization [
13]. Chen et al. introduced a robust defect detection method by integrating an improved YOLOv5 architecture with a Transformer-based attention mechanism for petrochemical pipeline inspection, achieving high detection accuracy and recall [
14]. Mao et al. combined a deep learning-based vision model with a robotic arm system for rim defect inspection, resulting in faster and more accurate performance when compared with traditional approaches [
15]. In addition, Wu et al. employed an improved YOLO model for defect detection on aluminum profiles, further demonstrating the versatility of YOLO-based methods across different materials and defect types [
16]. Collectively, these studies confirm the effectiveness and general applicability of YOLO-based models in industrial defect detection, while also emphasizing the importance of tailoring network architectures to specific inspection tasks and industrial scenarios.
In this paper, surface defects on linear guide rails are first systematically categorized, and a dedicated guide rail surface defect (GSD) dataset is constructed. Based on an analysis of the challenges associated with detecting such defects, a one-stage instance segmentation model tailored for guide rail surface defect images is proposed. The proposed model is developed within the YOLOv8 framework and incorporates deformable convolution and an attention-based scale sequence mechanism.
During feature extraction, deformable convolution enables the network to adaptively capture features of guide rail surface defects with varying scales and irregular geometries. Subsequently, multi-scale feature fusion and encoder modules are employed to effectively integrate features across different resolutions. Furthermore, spatial and channel attention mechanisms are introduced to guide the network to focus on fine-grained details, particularly for small and densely distributed defect targets.
Rather than proposing entirely new network components, this work presents a task-driven architectural design that explicitly maps characteristic failure modes of linear guide rail surface defects to targeted modifications within a real-time instance segmentation framework:
- (1)
To minimize the impact of specular noise on the accuracy of defect recognition and segmentation, this study utilized an adjustable dome light source to illuminate the surface defects on the guide rails during dataset collection. This approach captured the color and texture features of the guide rail surface defects under various lighting conditions, resulting in the creation of the guide rail surface defect (GSD) dataset.
- (2)
To enhance the capability of the backbone network in extracting features from high-precision linear guide surface defect images, this paper introduces a deformable convolution network, DCNv3, into the backbone of YOLOv8, forming the C2F-DCNv3 deformable convolution network. This module can flexibly address the issue of insufficient receptive fields at detection points corresponding to defects of different scales, focusing more on crack defects with significant aspect ratio differences, thereby effectively reducing the instances of missed and false detections.
- (3)
To improve the feature extraction network’s ability to extract features of small-target defects in guide rail surface defect images, we propose a feature extraction network named MTC-FPN. This network utilizes a multi-scale feature fusion module and a triple-feature encoder module to extract multi-scale and fine-detail features of guide rail surface defects. By further integrating the features extracted by these two modules through Channel and Position Attention Mechanisms, the Attention Mechanism focuses on more crucial features, enhancing the focus on small-target defects and thereby improving the model’s detection accuracy.
2. Dataset Construction
Defect image data constitute the foundation for constructing deep learning-based defect detection models, as the quality of a dataset directly determines the richness and accuracy of the discriminative features that a model can learn. Accordingly, the development and optimization of deep learning detection models require dataset designs that are tailored to the characteristics of defect images. A high-quality dataset should not only contain diverse defect instances but also account for practical factors encountered in real industrial scenarios, such as complex backgrounds, varying illumination conditions, target pose variations, and occlusions, to ensure strong generalization ability and robustness of the trained models. However, to the best of our knowledge, there is currently no publicly available dataset specifically designed for surface defects on rolling linear guide pairs [
17,
18,
19]. Therefore, a dedicated data acquisition system for guide rail surface defects was designed in this study, and a guide rail surface defect dataset was constructed using data augmentation techniques, with all images captured at a resolution of 1920 × 1280 pixels.
2.1. Defect Classification
During production, processing, and transportation, defects that affect the usability of high-precision rolling linear guide pairs inevitably occur due to environmental, human, and mechanical factors. Based on the results of on-site production surveys and information obtained from discussions with defect detection personnel, the main types of defects on guide pair surfaces include scratches, rust spots (i.e., areas not treated by grinding), indentations, and depressions, as illustrated in
Figure 1. This paper summarizes and collates information on the causes and repair methods of these defects and reclassifies the guide rail defects accordingly.
The first category of defects includes burns and rust spots, both of which arise during the grinding of the guide rail surfaces. Burns are caused by high temperatures impacting on the guide surface during grinding, leading to discoloration and hardening. Surfaces result from improper angle entry of the guide rail on the machining tool, leaving parts of the surface with the original material’s roughness and irregularities. These defects can be repaired by secondary grinding if minor, but severe cases may require disposal.
The second category is scratches, usually caused by the guide surface coming into contact with hard objects during machining, transportation, or installation. Inappropriate cleaning methods or the use of unclean cloths or tools can also lead to scratches. Based on their size, scratches are further classified into large and small scratches. To remove large scratches and restore the geometric accuracy of the guide, professional grinding equipment and localized grinding techniques are required. Small scratches can be smoothed out manually using fine sandpaper or grinding paste.
Indentations and depressions, comprising the final category, are both results of impacts and collisions with objects. The distinction between the two lies in the fact that depressions are usually caused by heavy or sharp objects, resulting in deep holes, whereas indentations often occur during transportation from being squeezed or hit by heavy objects, forming large uneven areas. Depressions are generally difficult to repair, while indentations may need grinding and polishing depending on the area and degree of unevenness.
This paper also categorizes defects in this manner and proceeds with descriptions of model training and defect segmentation experiments.
2.2. Classification of Rail Surface Defects
Due to the polished finish of linear guide rail surfaces, they exhibit high smoothness. When light is incident obliquely on a metal surface, according to Snell’s law, part of the light is reflected, while the remainder is absorbed or transmitted. For smooth metal surfaces with high reflectivity, the reflected light is highly concentrated, forming bright specular highlight regions. As the guide rail moves along the production line, these highlight regions show noticeable positional changes.
In addition, minor surface undulations or defects can scatter and diffuse the incident light, introducing subtle details and textures into the overall appearance. As shown in
Figure 2, an adjustable dome light source is employed in this study to construct an image acquisition system for guide rail surface defect inspection. As illustrated in
Figure 3, because the top and bottom surfaces of the guide rail are flat while the side surfaces contain grooves, different illumination modes are configured to suppress specular reflection noise, and defects with unclear edge imaging are collected separately.
2.3. Analysis of Detection Challenges
As illustrated in
Figure 3, the image shows defects on the top and side surfaces of the guide rail, with the defects highlighted in red. The guide rail surface defects exhibit the following characteristics:
Some defects, such as scratches, have an extremely high aspect ratio.
There are various types of defects, with diverse shapes and sizes.
There is a prevalence of small-target defects, which are highly susceptible to interference from specular noise.
These characteristics pose the following challenges for deep learning algorithms:
- (1)
Guide rails often operate in mechanically complex industrial environments, where their surfaces may be affected by various external factors, such as dust, reflections, or shadows. These factors can significantly impact the quality of the images and the interpretability of the algorithms. There is a need for algorithms to possess strong robustness, capable of distinguishing actual defects from background noise.
- (2)
Guide rail defects vary greatly in size, from tiny cracks to larger pits or scratches, meaning that detection algorithms need to function effectively across different scales. Although multi-scale feature fusion strategies can help improve model adaptability, designing this fusion process to optimize performance and accuracy remains a technical challenge.
- (3)
Standard convolution neural networks typically find it challenging to effectively recognize elongated scratches and small-target defects, as the features of these defects are very subtle and inconspicuous relative to the entire image. The linear characteristics of elongated scratches require the network to be able to capture and correctly understand their direction and length, while traditional convolution kernels may struggle to adapt to such highly oriented structures.
2.4. Dataset Details and Annotation
To ensure the reliability and reproducibility of the experiments, we constructed the guide rail surface defect dataset. The original dataset consists of 3075 images. To increase data diversity and improve model generalization, data augmentation techniques, including random cropping, horizontal flipping, and noise injection, were applied, expanding the dataset to a total of 6000 images. The augmented dataset was then divided into training, validation, and test sets with a ratio of 7:2:1.
All images were annotated using the labelme tool. To ensure annotation quality, a “cross-check” mechanism was employed: three researchers performed the initial annotation, followed by a review from a senior expert. The Intersection over Union (IoU) threshold for inter-annotator agreement was set to 0.85. The statistical distribution of defect categories is detailed in
Table 1.
3. YOLOv8 Instance Segmentation Model Based on Deformable Convolution Networks and Multi-Scale Feature Fusion
As discussed in
Section 2.3, guide rail surface defect inspection involves multiple challenges related to imaging conditions, defect scale variation, and geometric complexity. In this work, the first challenge—robustness against environmental interference such as reflections, shadows, and background noise—is partially alleviated at the data acquisition level through the use of a dome light source and multi-illumination imaging strategies during dataset construction. This data-driven design improves image consistency and reduces illumination-induced artifacts, providing a more reliable foundation for subsequent model learning.
However, the remaining challenges, namely large-scale variation among defects and the difficulty of accurately capturing elongated scratches and small-target defects, cannot be sufficiently addressed through data design alone. These issues are inherently related to the feature extraction and representation capability of the segmentation model. In particular, guide rail defects range from tiny cracks to large pits and scratches, requiring the model to maintain both fine-grained detail sensitivity and cross-scale semantic consistency. Meanwhile, elongated scratches exhibit strong directional characteristics and high aspect ratios, which are difficult to model effectively using standard convolutional neural networks with fixed sampling grids.
As a representative one-stage instance segmentation framework, YOLOv8-seg achieves a favorable balance between accuracy and efficiency. Nevertheless, its backbone and feature fusion mechanisms are primarily designed for generic object instances and are not explicitly optimized for small, subtle, or highly elongated defect patterns commonly observed on guide rail surfaces. Consequently, directly applying YOLOv8-seg may lead to incomplete feature extraction for linear defects and insufficient representation of small-scale defect details. To address these limitations, we redesign the YOLOv8-seg architecture in a task-driven manner, focusing on enhanced geometric adaptability and multi-scale feature representation, as described in the following subsections.
3.1. Design Motivation of the Proposed Architecture
The overall architecture of this network model is illustrated in
Figure 4. To enhance the ability of the YOLOv8 backbone network to extract semantic features of irregular and elongated defects, the concept of deformable convolution networks is applied to modify the C2f module in the backbone, thereby enhancing the model’s capability for geometric transformation modeling. To address the issue of small-target defect features being overlooked during feature fusion, a new feature fusion network architecture, MTC-FPN, is proposed. This architecture consists of two parts. (1) First, we have MSF (Multi-scale Feature Fusion) and TFE (triple-feature encoder) modules, where the MSF module fuses the semantic information of multiple feature maps of different sizes output by the backbone network and outputs multi-scale feature information; the TFE module captures fine local information within the feature maps. These modules enhance the feature extraction network’s ability to obtain multi-scale feature information and fine feature details. (2) The CPAM (Channel and Position Attention Mechanism) captures channel information from the feature maps output by the TFE module through a channel attention mechanism to calibrate channel weights, thereby enhancing important channel features and suppressing unimportant ones. The position attention mechanism integrates global feature information with local fine details, enhancing the model’s focus on small-target defects and minor scratches. The feature maps output by MTC-FPN are processed by the YOLOv8 segmentation decoupling head for final processing to generate guide rail surface defect classification boxes and segmentation masks. The post-processing includes (1) decode boxes, which convert the encoded positional information output by the network into actual boundary box coordinates corresponding to the original image [
20]; (2) Non-Maximum Suppression (NMS), which merges overlapping boundary boxes to ensure that only one best detection result is retained per object; and (3) mask generation, where the network predicts a mask for each detected object area, indicating which pixels belong to the object [
21].
3.2. Adaptive Deformable Feature Extraction for Irregular Industrial Defects
To enhance the geometric adaptability of the backbone for elongated and direction-sensitive defects, DCNv3 is introduced, and we introduce deformable convolution into the YOLOv8 backbone, as illustrated in
Figure 4. Specifically, the standard 3 × 3 convolution in the Bottleneck of the C2f module is replaced with a deformable convolution layer. This design allows the convolution kernel to dynamically adjust its sampling locations according to the spatial distribution of features, thereby enhancing the alignment between convolutional receptive fields and irregular defect structures.
Deformable convolution networks (DCNs), originally proposed by Dai et al. [
22], extend standard convolution by learning spatial offsets for each sampling point. This mechanism enables the convolution kernel to perform adaptive, non-uniform sampling, which significantly improves the modeling of geometric deformations. Compared with conventional CNNs, deformable convolutions can achieve a more flexible receptive field with fewer layers, making them particularly suitable for capturing thin cracks, scratches, and other irregular industrial defects.
Among existing deformable convolution variants, this work adopts DCNv3 rather than earlier versions (DCNv1 or DCNv2) for two key reasons. First, DCNv3 introduces group-wise weight sharing and softmax-normalized modulation scalars, which enable more stable and efficient spatial aggregation across long-range dependencies. This property is crucial for guide rail defects, where cracks may extend over large spatial regions and require consistent feature modeling along their entire length. Second, DCNv3 significantly reduces computational redundancy compared to previous deformable convolutions, allowing its integration into the backbone without compromising real-time detection performance. These advantages make DCNv3 more suitable for industrial inspection scenarios that demand both accuracy and efficiency [
23].
DCNv3 can be mathematically expressed as follows:
where
represents the total number of groups in the spatial aggregation process;
represents the total number of sampling points;
represents the position-independent projection weights of the
group;
represents the modulation scalar for the
sampling point in the
group, normalized through the softmax function along the dimension
;
represents the sliced input feature map;
represents the
position of predefined grid sampling; and
represents the offset relative to the grid sampling position
in the
group.
3.3. Task-Driven Multi-Scale Feature Enhancement for Linear Defect Segmentation
The feature extraction network plays a critical role in instance segmentation for linear guide rail surface defects, as it directly determines the model’s ability to perceive fine-grained, elongated, and densely distributed defect patterns. Unlike generic object detection tasks, guide rail defects—such as cracks, scratches, and wear marks—exhibit strong anisotropy, irregular geometry, and significant scale variation, with small defects often appearing densely clustered or partially overlapping. These characteristics impose higher demands on spatial adaptability, multi-scale representation, and detail preservation during feature extraction.
The original YOLOv8 employs an FPN-PAN structure [
24] to aggregate multi-scale features through simple summation or concatenation. While effective for common object detection scenarios, this strategy primarily emphasizes hierarchical feature propagation and lacks explicit modeling of cross-scale dependencies. Consequently, it struggles to preserve fine spatial details and discriminative cues for small and densely distributed defect targets, which are critical in industrial inspection scenarios.
To address these limitations, this paper proposes a task-driven MTC-FPN (MSF–TFE–CPAM) feature extraction element tailored for guide rail surface defect segmentation, as illustrated in
Figure 5. Rather than simply integrating existing modules, each component is designed to address a specific challenge inherent to linear industrial defects.
First, the MSF module is designed to explicitly model cross-scale feature interactions. Unlike traditional pyramid fusion strategies that perform pairwise or hierarchical fusion, MSF employs three-dimensional convolution to jointly process feature maps from multiple pyramid levels (P2–P5). This design enables simultaneous aggregation of semantic information from deep layers and fine-grained spatial details from shallow layers, which is crucial for maintaining the structural continuity of linear defects across scales.
Second, the TFE module focuses on selectively enhancing detailed spatial features in large-resolution feature maps. In industrial defect images, small defects are often embedded in complex backgrounds and may be suppressed during downsampling. The TFE module compensates for this effect by reinforcing local detail representations, thereby improving the detection and segmentation of densely overlapping small defect targets. The specific configuration of the TFE module is optimized to balance detail enhancement and noise suppression, ensuring that fine defect features are preserved without introducing excessive background interference.
Finally, the CPAM adaptively integrates the multi-scale features generated by the MSF module with the refined detail features from the TFE module. By jointly modeling channel-wise importance and spatial positional relevance, the CPAM enhances feature discrimination and promotes more effective feature fusion.
Through the coordinated design of DCNv3-based adaptive convolution, task-oriented multi-scale fusion, and targeted detail enhancement, the proposed MTC-FPN forms a unified and complementary feature extraction framework. This design goes beyond simple engineering integration and provides a principled solution for capturing the complex geometric structures and scale diversity characteristic of linear guide rail surface defects.
3.3.1. MSF
The Multi-scale Feature Fusion (MSF) module is primarily responsible for integrating global or high-level semantic information from multi-scale images [
25]. In processing images of small-target defects, global and high-level semantic information is crucial for accurate identification and segmentation. The MSF module enhances the model’s ability to capture and utilize this information by fusing features from different scales, thereby improving segmentation accuracy. As illustrated in
Figure 6, feature maps of large, medium, and small scales output by the YOLOv8 backbone network are processed through convolution operations to standardize the number of channels across multi-scale feature maps; this is followed by bilinear interpolation upsampling to reshape the feature maps to the same size. After upsampling, the feature map information is stacked and sent together into a 3D convolution to combine multi-scale features, ultimately outputting richer semantic information of defects.
3.3.2. TFE
The triple-feature encoder (TFE) module is a feature fusion mechanism proposed to improve visual recognition of densely overlapping small objects [
26]. This module magnifies images to compare changes in shape or appearance at different scales, focusing on capturing the local fine details of small targets. In the feature network, the TFE module divides features into three sizes: large, medium, and small. The original feature pyramid network in YOLOv8 does not perform additional processing on large-size feature maps but upsamples small-size feature maps and combines them with the previous layer’s features. The TFE module, on the other hand, utilizes the rich detail information from large-size feature maps. As shown in
Figure 7, before feature map fusion, the feature channel numbers are adjusted to ensure consistency among large, medium, and small feature maps, achieved through a convolution network; then, the large-size feature maps undergo downsampling through a mixed structure of maximum pooling and average pooling. This downsampling method preserves the most critical feature information while maintaining the integrity of the features, increasing the effectiveness and diversity of small-target defect images; subsequently, the small-size feature maps are upscaled using bilinear interpolation, with the final output being the merged feature map.
3.3.3. CPAM
The Channel and Position Attention Module (CPAM) is a network that integrates both the Channel and Position Attention Mechanisms, enhancing the model’s focus on both the feature and positional information of defects.
Regarding the feature network of this study, the overall network architecture of the CPAM is shown in
Figure 8. In the CPAM, the Channel Attention Module first receives inputs from the TFE module, and then processes each channel of the input feature map TFE_OUT with global average pooling. This step reshapes each feature map of dimensions C × H × W (where C is the number of channels, and H, W are the height and width, respectively) into a vector of length N, generating a global feature descriptor for each channel. These channel descriptors are then processed through a one-dimensional convolution layer, where the kernel size dynamically adjusts based on the number of channels. This step not only captures dependencies between local channels but also maintains their individuality. The outputs of the one-dimensional convolution are then transformed into channel weights through two fully connected layers and a nonlinear Sigmoid function. The two fully connected layers are designed to capture nonlinear cross-channel interactions, with the channel weight formula shown as Equation (2), where
represents the influence of the i channel on the j channel, and
and
represent the i and j channels in the TFE_OUT feature map, respectively. Each element in CAM_OUT, denoted as
,
is derived from Equation (3), with
representing the position attention weight coefficient, which is initially set to zero and adjusted through neural network learning. Finally, these channel weights are multiplied by the original input feature map for channel recalibration, enhancing important channel features and suppressing unimportant ones.
The feature map outputs from the Channel Attention Module are combined with the multi-scale feature maps from the MSF module and then fed into the Position Attention Mechanism. Initially, convolution operations are applied to the input multi-scale and detailed feature maps to generate three feature matrices: Query (Q), Key (K), and Value (V). The Q and K matrices are used to compute relationships between different positions and generate an attention map (with dimensionality reduction to decrease computational load), while the third matrix (V) retains the original number of channels to ensure consistency in the output dimensions. Subsequently, a softmax activation function calculates attention scores between different positions to adjust feature representations at each location and generates a position attention weight matrix X. In X, each element’s value represents the influence of one position on another. In the feature map, the more similar the feature representations of two positions, the greater their correlation. The calculation formula is shown in Equation (4).
The resulting position attention weight map is multiplied by the feature map B, reshaped to match the dimensions of A, and then added to A to produce the final output feature map PAM_OUT. Each element in PAM_OUT, is denoted as
.
is calculated as shown in Equation (5).
represents the position attention weight coefficient, which is initially set to zero and adjusted through feature self-regulation learned by the neural network.
In summary, the CPAM proposes a channel attention mechanism that does not involve dimensionality reduction, capturing local cross-channel interactions by considering each channel and its k-nearest neighbors. This is achieved through a 1D convolution of size k, where k represents the scope of local cross-channel interactions, assigning appropriate weights to each channel. The Position Attention Module then receives inputs from the channel attention network and the SSFF output, combined as a single input. Horizontal and vertical pooling operations on the input feature map, along its width and height dimensions, yield feature maps that focus on horizontal and vertical spatial structures. These directional feature maps are then encoded separately and ultimately merged to produce the output, facilitating the extraction of key positional information for each defect.
4. Segmentation Experiments and Experimental Analysis
4.1. Evaluation Metrics
In object segmentation tasks, it is often necessary to consider the degree of overlap between the predicted results and the true annotations, namely the Intersection over Union (IoU). IoU calculates the ratio of the intersection area to the union area between the predicted and true bounding boxes and is a critical metric for measuring prediction accuracy. Depending on different IoU threshold values, we can calculate AP values at various levels of strictness, such as AP50 (AP at an IoU threshold of 0.5). A more comprehensive mAP is the average of AP values calculated at IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05 [
27,
28,
29].
In instance segmentation tasks, FPS (Frames Per Second) and Param (a measure of computational performance) are crucial indicators of an algorithm’s real-time performance, especially in the field of industrial defect detection [
30]. An algorithm’s ability to rapidly process images or video streams and provide accurate real-time defect detection results significantly affects production efficiency. As detection speed is closely related to computer hardware, the speed metrics mentioned in this paper are all measured under the hardware environment shown in
Table 1.
In summary, this paper will use a comprehensive and in-depth evaluation of the training results using metrics such as Box mAP50, Mask mAP50 F1-score, and FPS. Box mAP represents the precision of defect classification, while Mask mAP indicates segmentation accuracy. These metrics fully demonstrate the model’s performance in object detection and instance segmentation tasks from various dimensions, thereby providing a strong basis for optimizing model performance. Additionally, we will focus on the performance differences exhibited by the model when processing different categories to more finely analyze the strengths and weaknesses of the model. Through this multi-dimensional evaluation approach, we can more comprehensively understand the performance characteristics of the model, laying a solid foundation for subsequent research and improvement efforts.
4.2. Experimental Environment
In this study, all experiments, including model training, ablation studies, and defect segmentation experiments, were conducted on the same server. The programming language used for the experiments was Python, and the main development framework for instance segmentation was PyTorch. Detailed hardware specifications of the experimental environment are provided in
Table 2, while software specifications can be found in
Table 3.
In addition, the training-related hyperparameters, including the optimizer type, learning rate, batch size, and training epochs, were carefully configured to ensure stable model convergence. The specific parameter settings used in all experiments are reported in
Table 4.
4.3. Comparative Experiments Before and After Improvements
As shown in
Figure 9, this paper compares the segmentation effects of the original YOLOv8 instance segmentation model and the improved YOLOv8 instance segmentation model presented in this study.
Compared to the original YOLOv8 instance segmentation model, the improved YOLOv8 model presented in this paper shows better fit of the image boundaries, with smoother and more accurate segmentation boundaries (
Figure 9c,d). The classification accuracy of the improved model is also more precise compared to the original YOLOv8 model (
Figure 9g,h), and there is an improvement in the detection of small-target defects (
Figure 9k,l,o,p). However, there are still some shortcomings in the improved instance segmentation model when applied to the segmentation of guide rail surface defects. The first shortcoming is the tendency for segmentation overlap errors when two defects are too close together (
Figure 9h,k); an example of this is when two scratches on the right side of the guide rail surface are close together, resulting in the two defects being segmented as one. The second shortcoming is the segmentation precision; when the defects are elongated or irregular, the segmented defect contours are larger than the defects themselves (
Figure 9h,p). The third shortcoming is the insufficient segmentation of areas with severe specular noise, such as deep indentations (
Figure 9t).
4.4. Ablation Studies
To verify the improvements of the two modifications to the original YOLOv8 instance segmentation model for segmenting guide rail surface defects, ablation experiments were conducted. These experiments involved training modified models with only changes in the backbone network and only changes in the feature extraction network, under the same training conditions, to validate their effectiveness. The experimental results are shown in
Table 5.
Furthermore, to verify the impact of the improvements discussed in this paper on the segmentation accuracy of the three types of defects, the segmentation results of the three defect types during the ablation experiments were recorded and compared with those from the unmodified original YOLOv8 instance segmentation model. The results are presented in
Table 6,
Table 7 and
Table 8.
From the experimental data in
Table 5, we observed the impacts of two enhancements, C2f_DCNv3 and MTC-FPN, on model performance. Firstly, when C2f_DCNv3 was applied alone compared to the original YOLOv8 model, there was an improvement in Box and Mask mAP50, while GLFlops slightly decreased, and FPS slightly increased. This indicates that C2f_DCNv3 not only enhances the model’s segmentation accuracy but also slightly improves the model’s computational efficiency. Similarly, when MTC-FPN was applied alone, there was a significant improvement in Box and Mask mAP50; however, the increase in Param led to a notable decrease in FPS, suggesting that MTC-FPN significantly enhances segmentation accuracy at the expense of some processing speed.
When C2f_DCNv3 and MTC-FPN were applied together, the model achieved the best performance in terms of mAP50. Although Param increased, the decrease in FPS was not substantial, indicating that the model maintains a relatively acceptable processing speed while retaining high segmentation accuracy. This result indicates that the proposed design achieves a favorable trade-off between accuracy gains and computational overhead, rather than improving performance through model scaling alone.
Analysis of precision and recall for different types of defects from
Table 6,
Table 7 and
Table 8 reveals valuable insights. The combined application of C2f_DCNv3 and MTC-FPN achieved the best results in almost all cases. Notably, for the second type of defect (scratches) and the third type (dents, indentations, etc.), improvements in precision and recall were particularly significant, suggesting that the enhanced model significantly improved segmentation ability for these types of defects.
This experiment verified that the enhancements C2f_DCNv3 and MTC-FPN are effective in improving the performance of the YOLOv8 instance segmentation model for the task of detecting surface defects on rolling linear guide accessories. These enhancements not only improve the overall segmentation accuracy (mAP50) but also achieve more precise segmentation for different types of defects (improving precision and recall). Although the introduction of MTC-FPN somewhat reduces the processing speed of the model, through reasonable configuration and optimization, we can still maintain high segmentation accuracy while achieving satisfactory real-time performance. These experimental results provide valuable references for deploying and improving the model in practical applications.
4.5. Comparative Experiments and Versatility Studies
To assess the effectiveness of the proposed method for high-precision segmentation of surface defects on rolling linear guide accessories, comparative experiments were conducted using the dataset developed in this study. The results in
Table 9 demonstrate clear advantages of the proposed approach over existing one-stage instance segmentation models.
According to the data in
Table 9, in terms of mAP50, all enhanced YOLO variants outperform the original YOLOv8n-seg, indicating improved feature extraction capabilities in later architectures. The proposed method achieves an mAP50 of 82.1%, comparable to YOLOv11s-seg and approaching the performance of Mask-RCNN [
31]. Under the more stringent mAP0.5:0.95 metric, the proposed model reaches 50.1%, surpassing the YOLOv8 and YOLOv10 [
32] models and remaining close to the best-performing model, YOLOv11s-seg. These results confirm that the proposed method maintains robust segmentation quality under stricter IoU thresholds.
The F1-score further highlights model reliability. Although Mask-RCNN and Mask2Former [
33] achieve the highest scores due to their large-capacity multi-stage or transformer-based designs, the proposed method attains a competitive 74.6%, outperforming all YOLO-based one-stage models. This demonstrates the strong balance between precision and recall enabled by the introduced architectural enhancements.
In terms of inference efficiency, the proposed method achieves an FPS value of 148, substantially faster than Mask-RCNN and Mask2Former, and sufficiently rapid for real-time industrial applications. While the lightweight YOLO variants achieve even higher speeds, they do so at the cost of reduced accuracy, particularly under high-IoU conditions. The proposed method therefore provides a more favorable accuracy–speed trade-off.
The performance gains primarily stem from the architectural modifications tailored to the characteristics of rolling-guide surface defects, which are complex, highly textured, and dominated by small targets. The integration of deformable convolutions enhances adaptability to irregular textures and expands the effective receptive field. A dedicated small-target feature extraction layer mitigates fine-detail loss during downsampling. Furthermore, the multi-scale feature fusion module and triple-feature encoder improve semantic consistency and strengthen both global and local feature representations. The incorporation of channel and spatial attention mechanisms further refines defect localization by emphasizing salient feature and positional information.
To evaluate the generalization capability of the proposed model, additional experiments were conducted on the MSD dataset, publicly available from Peking University [
34]. This dataset includes oil stains, scratches, and spot defects, with the latter two categories containing numerous small targets. The quantitative results are presented in
Table 10 and
Table 11, and visual examples are shown in
Figure 10.
Based on the MSD dataset results presented in
Table 10 and
Table 11, a comparative analysis of various instance segmentation methods was conducted. Across metrics such as mAP50, mAP75, F1-score, and FPS, our method demonstrated significant improvements over YOLOv8n-seg, YOLOv10n-seg, Fast-SCNN [
35], and FDSNet [
35]. Notably, our method excelled in mAP
50 and mAP
75, achieving 85.8% and 73.4%, respectively, representing a notable enhancement over other methods. Additionally, in addressing the three categories of defects present in the MSD dataset—oil stains, spots, and scratches—our method achieved good precision and recall rates, particularly excelling in identifying oil stains and scratches. Overall, our method has shown high accuracy and efficiency in the domain of industrial defect detection, exhibiting versatile and robust.
5. Discussion
5.1. Deployment and Feasibility
From an industrial deployment perspective, the proposed method is designed to balance accuracy and efficiency. Built upon a lightweight one-stage segmentation framework, the model introduces task-specific enhancements while maintaining acceptable computational complexity and GPU memory consumption for commonly used industrial GPUs. Although frame-per-second metrics do not fully capture end-to-end system latency, the achieved inference speed is sufficient to support online inspection requirements in typical production lines.
In terms of system integration, the proposed approach can be readily incorporated into existing machine vision pipelines, operating with standard industrial cameras and controlled illumination setups. While fixed single-light configurations may limit robustness in extreme cases, practical deployments can adopt multi-illumination or adjustable lighting strategies to enhance image quality without increasing algorithmic complexity. Overall, the method demonstrates strong feasibility for real-world inspection scenarios where stable performance, moderate resource usage, and ease of integration are critical considerations.
5.2. Limitations and Challenge Scenario Analysis
Although the proposed segmentation model demonstrates strong overall performance, several limitations remain when applied to complex industrial inspection scenarios. These issues outline important directions for subsequent improvements.
5.2.1. Overlapping and Visually Similar Defects
Surface defects on precision linear guides often occur in close proximity, overlap spatially, or share highly similar visual characteristics. In such cases, the proposed model may struggle to distinguish individual defect instances or accurately classify defect types, especially when scratches, pits, and wear marks exhibit comparable textures or when overlapping defects form continuous irregular patterns. This leads to ambiguities in instance separation and category assignment, which can compromise downstream analysis. Enhancing fine-grained feature representation, integrating structural priors, or introducing contrastive learning strategies may help improve discrimination under these challenging conditions.
5.2.2. Necessity of Segmentation for Downstream Measurement
Unlike conventional defect detection tasks, linear guide inspection requires precise morphological analysis of defects, including their length, area, and geometric characteristics. Bounding box-based detection methods cannot provide the pixel-level boundaries needed for such quantitative assessment. Segmentation is therefore essential to support accurate dimensional measurement and subsequent quality evaluation. While the proposed model offers reliable masks, further refinement is needed to ensure consistent contour accuracy for irregular or thin defect shapes.
5.2.3. Limited Robustness Under Complex Surface and Lighting Conditions
The detection accuracy of the proposed model remains influenced by the highly reflective, heterogeneous, and light-sensitive nature of precision guide surfaces, particularly under fixed single-illumination conditions. Variations such as specular highlights, shadows, or low-contrast regions may distort defect appearance, leading to misclassification, missed detections, or imprecise segmentation boundaries.
It should be noted that this limitation primarily manifests under constrained illumination setups. In practical industrial deployments, controlled illumination environments—such as multi-angle or multi-intensity lighting configurations—can be introduced to effectively alleviate these issues without modifying the model architecture. Furthermore, future work may explore illumination-invariant feature modeling and adaptive lighting strategies to enhance robustness under more extreme and dynamically changing conditions.
5.3. Future Research and Practical Applications
Beyond defect localization and segmentation, the proposed method provides a foundation for downstream industrial analysis and decision-making. Pixel-level segmentation enables automatic geometric measurement of surface defects, such as scratch length, defect area, aspect ratio, and spatial distribution, which are critical for quantitative quality assessment and grading in precision guide rail manufacturing. These measurements can be directly integrated into inspection systems to support rule-based or data-driven acceptance criteria.
From a practical perspective, future work may focus on coupling the segmentation results with automated measurement and reporting modules, enabling closed-loop quality control and traceability. In addition, further research may explore improved robustness under extreme illumination variations, adaptive thresholding for defect severity evaluation, and integration with multi-sensor inspection systems to enhance reliability in complex industrial environments.