An Improved Cascade R-CNN-Based Fastener Detection Method for Coating Workshop Inspection

Liu, Jiaqi; Liu, Shanhui; Chen, Yuhong; Zhao, Jiawen; Fu, Jiahao

doi:10.3390/coatings16010037

Open AccessArticle

An Improved Cascade R-CNN-Based Fastener Detection Method for Coating Workshop Inspection

by

Jiaqi Liu

¹,

Shanhui Liu

^1,*

,

Yuhong Chen

²,

Jiawen Zhao

¹ and

Jiahao Fu

¹

Faculty of Printing, Packaging Engineering and Digital Media Technology, Xi’an University of Technology, Xi’an 710048, China

²

Foxconn Precision Electronics (Zhengzhou) Co., Ltd., Zhengzhou 450000, China

^*

Author to whom correspondence should be addressed.

Coatings 2026, 16(1), 37; https://doi.org/10.3390/coatings16010037 (registering DOI)

Submission received: 29 October 2025 / Revised: 12 December 2025 / Accepted: 25 December 2025 / Published: 30 December 2025

(This article belongs to the Special Issue Intelligent Monitoring, Control and Manufacturing in Coating Technologies)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges of small fastener targets, complex backgrounds, and the low efficiency of traditional manual inspection in coating workshop scenarios, this paper proposes an improved Cascade R-CNN-based fastener detection method. A VOC-format dataset was constructed covering three target categories—Marking-painted fastener, Fastener, and Fallen off—which represents typical inspection scenarios of coating equipment under diverse operating conditions and enhances the adaptability of the model. Within the Cascade R-CNN framework, three improvements were introduced: the Convolutional Block Attention Module (CBAM) was integrated into the ResNet-101 backbone to enhance feature representation of small objects; anchor scales were reduced to better align with the actual size distribution of fasteners; and Soft-NMS was adopted in place of conventional NMS to effectively reduce missed detections in overlapping regions. Experimental results demonstrate that the proposed method achieves a mean Average Precision (mAP) of 96.60% on the self-constructed dataset, with both Precision and Recall exceeding 95%, significantly outperforming Faster R-CNN and the original Cascade R-CNN. The method enables accurate detection and missing-state recognition of fasteners in complex backgrounds and small-object scenarios, providing reliable technical support for the automation and intelligence of printing equipment inspection.

Keywords:

Cascade R-CNN; small object detection; CBAM; Soft-NMS; coating workshop; fastener detection

1. Introduction

In modern manufacturing, coating equipment serves as a core component in precision production processes such as lithium battery electrode fabrication, optical film manufacturing, and printing [1]. Its operational condition directly affects the stability of production processes and the quality of final products. Screw-type fasteners (hereafter referred to simply as fasteners), as critical mechanical components of coating equipment, play a crucial role in maintaining structural integrity. Their loosening, absence, or improper installation can lead to mechanical vibration, equipment malfunction, and even unexpected shutdowns, severely impacting production efficiency and operational safety [2]. Therefore, achieving efficient and accurate detection and identification of fasteners during inspection has become a key research focus in advancing the intelligent inspection of coating workshops. Traditionally, inspection of coating equipment has relied primarily on manual operation. However, manual inspection suffers from low efficiency, high labor intensity, and limited accuracy—issues that become more pronounced when dealing with large-scale equipment and complex working environments, where fatigue and human error are inevitable [3]. Moreover, with the continuous demand for intelligent and digital transformation in modern manufacturing, traditional manual inspection modes can no longer meet the requirements for real-time monitoring and high reliability in production workshops [4]. Consequently, machine vision-based automatic screw-type fastener detection methods (hereafter referred to simply as fastener detection) have gradually emerged as a research hotspot. These approaches demonstrate significant advantages in detecting fastener absence and abnormalities automatically, providing an innovative solution to enable intelligent and unmanned inspection in coating workshops.

Nevertheless, unlike conventional industrial vision setups, the coating workshop imposes stringent engineering constraints on illumination, making traditional optical enhancement strategies difficult to apply. Conventional industrial illumination schemes—such as coaxial lighting, side lighting, dome illumination, and polarized lighting—typically rely on a fixed light–camera–object geometry and stable environmental conditions. These assumptions do not hold in the robotic inspection scenario: the camera pose varies continuously with the robot trajectory, inspection viewpoints are widely dispersed, and the interior of coating equipment lacks sufficient space for mounting standard illumination modules. In addition, strong metallic reflections, ink mist, surface contamination, and vibration-induced instability lead to highly inconsistent illumination. These constraints fundamentally limit the reliability of hardware-based illumination control and necessitate a detection framework that is intrinsically robust to illumination variation, viewpoint changes, and environmental disturbances.

To address these challenges, existing studies have explored both traditional image-processing methods and modern deep-learning-based detection approaches. Early studies primarily relied on manual inspection or traditional image processing-based detection methods. For instance, approaches based on edge detection [5], color segmentation [6], and geometric feature analysis [7] could identify fasteners under specific conditions, while industrial screw inspection systems also widely adopted classical machine vision techniques, such as reference-based inspection, ROI-based presence checking, template matching, and background-difference detection, as these methods are simple and highly efficient for fixed-position components. However, these methods are often susceptible to noise interference and exhibit limited robustness and accuracy in complex backgrounds or low-light environments, and remain highly sensitive to illumination variation, metallic reflections, background clutter, and partial occlusion, which frequently occur in coating workshop environments. Traditional object detection algorithms generally depend on handcrafted feature extraction and classifier-based learning. Dou et al. [8] proposed a Fast Template Matching (FTM)–based method for bolt localization; Li et al. [9] combined Histogram of Oriented Gradients (HOG) features with Support Vector Machines (SVM) to detect missing and loosened bolts; Lu et al. [10] improved the robustness of HOG features under varying illumination conditions; and Ramana et al. [11] employed the Viola–Jones algorithm to identify bolts in different loosened states, though the method exhibited limited resistance to interference. Min et al. [12] reported that traditional machine vision methods, such as background-difference and template matching, are sensitive to small targets and occlusion in industrial screw inspection. Peralta et al. [13] showed that in robotic screwing tasks using vision and force sensors, traditional methods have reduced reliability for fine-grained status recognition, such as loosening, corrosion, or paint-mark defects. Overall, while these traditional methods perform effectively under constrained conditions, they suffer from poor adaptability and limited generalization in complex industrial environments.

With the rapid progress of convolutional neural networks (CNNs), deep-learning–based object detectors have become the dominant solution for fastener inspection. Representative detection frameworks such as Faster R-CNN [14], the YOLO series [15], SSD [16], and Cascade R-CNN [17] have been widely applied in natural image tasks as well as industrial inspection scenarios, demonstrating substantial improvements in detection accuracy and robustness. Existing detection models can be broadly categorized into one-stage and two-stage approaches. One-stage detectors—including SSD and the YOLO family—offer high efficiency and real-time performance but often suffer from reduced accuracy when dealing with small or densely distributed objects. In contrast, two-stage models such as Faster R-CNN, Mask R-CNN [18], and SPP-Net [19] incur higher computational cost but provide stronger localization accuracy and feature representation, making them more suitable for small-object detection tasks. In the field of fastener inspection, numerous studies have attempted to enhance the performance of deep-learning–based detectors. Zhang et al. [20] improved SSD by incorporating multi-scale feature fusion to strengthen the representation of small targets. Ge et al. [21] adopted ResNet-50 as the backbone for bolt detection on angle steel towers, achieving higher mAP at the cost of slower inference. Luo et al. [22] integrated the TDM module into SSD, significantly improving the detection precision of small bolts in catenary systems. More recent works have focused on YOLO and Faster R-CNN adaptations: Li et al. [23] enhanced missing-bolt recall by optimizing YOLOv5; Wang et al. [24] combined a lightweight YOLOv5s-T with an RGB-D sensor to achieve real-time 3D bolt localization; Zhang et al. [25] embedded image enhancement techniques into Faster R-CNN for defect recognition; and Zhao et al. [26] introduced deep residual and Inception modules into Faster R-CNN while optimizing the region proposal network (RPN) via K-means++, achieving higher detection accuracy.

As an important evolution of the Faster R-CNN framework, Cascade R-CNN improves localization precision through multi-stage cascaded regression and has demonstrated strong performance in small-object detection. However, detecting fasteners in coating workshop environments remains highly challenging. Fasteners are typically small in size and embedded in cluttered background textures, often accompanied by strong metallic reflections that weaken feature discrimination. In addition, partial occlusion and densely distributed targets introduce mutual interference during detection, while standard Non-Maximum Suppression (NMS) further exacerbates missed detections by suppressing closely overlapping predictions. Moreover, the anchor scales used in generic detection frameworks are primarily designed for medium-to-large objects and do not correspond well to the true size distribution of fasteners, resulting in suboptimal localization accuracy. These limitations collectively restrict the robustness and practical deployment of existing methods in real industrial inspection scenarios, underscoring the need for an improved detection approach tailored to the characteristics of fasteners in coating workshops.

Grounded on these research gaps, this study proposes the following hypotheses:

H1.

Incorporating a CBAM attention mechanism into the backbone network strengthens feature representation for small fasteners and improves detection accuracy.

H2.

Optimizing anchor scales to align with the true size distribution of fasteners enhances localization precision and reduces missed detections.

H3.

Replacing standard NMS with Soft-NMS effectively mitigates suppression errors in overlapping or densely distributed fastener regions.

The remainder of this manuscript is organized as follows. Section 2 introduces the proposed fastener detection method tailored for coating workshop inspection, including the overall framework design, the integration of CBAM into the Cascade R-CNN backbone, the optimization of anchor scales, and the adoption of Soft-NMS to enhance detection robustness in densely distributed target regions. Section 3 presents the experimental setup and provides comprehensive evaluations of the proposed approach, including comparative analyses against baseline models and ablation studies that validate the effectiveness of each improvement. Section 4 concludes the paper by summarizing the main findings and discussing the practical implications of the proposed method for intelligent industrial inspection.

2. Fastener Detection Method for Coating Workshop Inspection

2.1. Overall System Architecture

The overall framework of the proposed improved Cascade R-CNN-based fastener detection method is illustrated in Figure 1. It primarily consists of three stages:

(1) Image acquisition and dataset construction: Industrial cameras are used to capture fastener images in real coating workshop environments, covering typical working conditions such as low illumination, complex backgrounds, and partial occlusions. The collected images are manually annotated and categorized into three classes—Marking-painted fasteners, Unpainted fasteners, and Missing—and are organized into a VOC-format dataset for model training and validation.

(2) Model improvement module: Based on the multi-stage detection structure of Cascade R-CNN, the proposed method progressively refines the classification and regression of candidate boxes. To enhance detection accuracy and robustness, a Convolutional Block Attention Module (CBAM) is incorporated to strengthen key region feature representation; the anchor scales are adjusted to better match the actual size distribution of fasteners; and Soft-NMS is employed to replace the conventional NMS, effectively reducing missed detections caused by dense target distribution and local occlusion.

(3) Result output and recognition stage: The trained improved model is used to process the input inspection images, outputting the precise spatial locations (bounding boxes) and status categories of each detected fastener—namely Marking-painted fastener, Unpainted fastener, and Missing. This enables accurate detection and recognition of fasteners in coating equipment.

2.2. Original Cascade R-CNN Framework

Cascade R-CNN is a two-stage object detection algorithm based on deep learning, proposed by Cai and Vasconcelos in 2018. Its overall architecture is built upon Faster R-CNN, consisting of a backbone network (typically ResNet-50 or ResNet-101) combined with a Feature Pyramid Network (FPN) for multi-scale feature extraction. The Region Proposal Network (RPN) is used to generate candidate bounding boxes, which are then processed through ROI Align to obtain fixed-size feature maps that are fed into the detection heads. The network architecture of Cascade R-CNN is shown in Figure 2. Unlike Faster R-CNN, Cascade R-CNN introduces a three-stage cascaded detection head, where each stage contains both classification and bounding box regression branches. The Intersection-over-Union (IoU) thresholds are progressively increased across stages (e.g., 0.5, 0.6, and 0.7), enabling iterative refinement of candidate boxes. This “cascade with increasing IoU” design effectively mitigates the issues of sample imbalance and low-quality proposals, allowing the predicted bounding boxes to progressively approach the ground-truth object boundaries. Consequently, Cascade R-CNN achieves higher localization accuracy, making it particularly well-suited for small object detection and complex background scenarios.

However, in the practical application of coating workshop inspection, the original Cascade R-CNN still exhibits certain limitations. First, the feature extraction process lacks an explicit attention mechanism, resulting in insufficient modeling of subtle features under low-illumination conditions. Consequently, the edge features of fasteners are easily obscured by background noise in complex environments. Second, the anchor scales are fixed and cannot be effectively aligned with the actual size distribution of small fastener targets, leading to a lower recall rate for small objects. Finally, the traditional NMS algorithm tends to suppress true positives in cases where fasteners are densely distributed or partially occluded, thereby causing missed detections. These issues constrain the detection accuracy and robustness of the original model in industrial field applications, making it difficult to fully satisfy the high reliability and stability requirements of printing equipment inspection. Therefore, it is essential to optimize the Cascade R-CNN framework in a targeted manner to enhance its detection performance under complex workshop conditions.

2.3. Structure of the Improved Cascade R-CNN

To address the multiple environmental constraints encountered during coating-equipment inspection—including uncontrolled illumination, strong metallic reflections, confined imaging space, variations in robot viewpoints, and the extremely small size of fastener targets—this study proposes a domain-specific enhancement of the Cascade R-CNN architecture. All optimizations are designed based on the physical characteristics of coating equipment and the statistical properties of the collected inspection data, and the overall model structure is illustrated in Figure 3. On this basis, considering the key requirements of small-scale target perception, illumination inconsistency, and densely distributed fasteners in real workshop environments, this study constructs an improved detection framework built upon a ResNet-101 + FPN backbone and incorporates several task-oriented structural refinements. Specifically, a Convolutional Block Attention Module (CBAM) is introduced to enhance critical feature representations under complex backgrounds and reflective surfaces; the anchor scale range is redesigned to better align with the true size distribution of fasteners; and Soft-NMS is applied to the multi-stage regression–classification pipeline to reduce missed detections caused by densely arranged fasteners and partial occlusions. Detailed descriptions of each optimization strategy are provided in the subsequent sections.

2.3.1. Integration of the CBAM Attention Mechanism

Considering that fasteners are typically small-sized targets and prone to being overwhelmed by complex backgrounds, this study incorporates CBAM at selected stages of the ResNet-101 backbone. As illustrated in Figure 4, the CBAMs are specifically inserted after the last Bottleneck Block of Layer2, Layer3, and Layer4, rather than uniformly across all residual blocks, enhancing the mid-to-high-level feature representations for better small target detection. This selective insertion strategy is motivated by empirical observations of the inspection data: mid- and high-level features are more informative for discriminating small reflective components from cluttered industrial backgrounds, while low-level attention incurs disproportionate computational cost for limited benefit.

The CBAM consists of two complementary submodules—the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). The channel attention enhances the network’s response to semantically significant features, while the spatial attention strengthens the perception of local object edges and structural cues, thereby improving detection performance for targets with blurred or indistinct boundaries. The detailed structures of the channel and spatial attention submodules are described as follows, as shown in Figure 5.

(1) Channel Attention Module (CAM)

Given an input feature map

F \in ℝ^{C \times H \times W}

, two channel descriptors are generated by applying global average pooling and global max pooling, respectively.

F_{a v g}^{c} = AvgPool (F), F_{m a x}^{c} = MaxPool (F)

(1)

where

F \in ℝ^{C \times H \times W}

denotes the input three-dimensional feature map, C represents the number of channels, H and W denote the height and width of the feature map, respectively,

F_{a v g}^{c}

refers to the result obtained by performing global average pooling over all spatial positions within each channel, and

F_{m a x}^{c}

represents the result obtained through global max pooling across the same spatial dimensions.

The two descriptors are then fed into a shared two-layer fully connected (FC) network. After applying nonlinear activation, their outputs are element-wise added and passed through a sigmoid function to generate the channel attention map

M_{c} \in ℝ^{C \times 1 \times 1}

:

M_{c} (F) = σ (W_{1} (ReLU (W_{0} (F_{a v g}^{c}))) + W_{1} (ReLU (W_{0} (F_{m a x}^{c}))))

(2)

where

W_{0}

and

W_{1}

denote the weight parameters of the first and second fully connected layers, respectively,

σ (\cdot)

represents the sigmoid activation function, which constrains the output values to the range

[0, 1]

, and

M_{c}

denotes the channel attention map, where each value corresponds to the importance of a specific channel.

Finally, the input feature map is weighted by element-wise multiplication with

M_{c}

across all channels to produce the refined feature representation:

F^{'} = M_{c} (F) \cdot F

(3)

where

F

denotes the input feature map,

M_{c} (F)

represents the channel attention map, and

F^{'}

is the intermediate feature map obtained after channel-wise weighting.

(2) Spatial Attention Module (SAM)

The SAM receives the channel-refined feature map

F^{'}

and performs max pooling and average pooling operations along the channel dimension. The resulting two 2D feature maps are then concatenated to form a unified spatial feature representation:

\{\begin{cases} F_{a v g}^{s} = Mean (F^{'}, \dim = C) \\ F_{m a x}^{s} = Max (F^{'}, \dim = C) \\ F_{s} = Concat [F_{a v g}^{s}; F_{m a x}^{s}] \end{cases}

(4)

where

F_{a v g}^{s}

denotes the per-pixel mean across all channels and

F_{m a x}^{s}

denotes the per-pixel maximum across all channels;

Concat [\cdot]

indicates concatenation along the channel dimension.

The feature map

F_{s} \in ℝ^{2 \times H \times W}

is passed through a

7 \times 7

convolution operation, followed by a sigmoid activation to generate the spatial attention map

M_{s} \in ℝ^{1 \times H \times W}

:

M_{s} (F^{'}) = σ (Conv 7 \times 7 (F_{s}))

(5)

where

M_{s} (F^{'})

represents the importance of each spatial position.

The final output feature map is obtained as:

F^{″} = M_{s} (F^{'}) \cdot F^{'}

(6)

where

M_{s} (F^{'})

denotes the spatial attention map,

F^{'}

represents the intermediate feature map after channel weighting, and

F^{″}

is the final output feature map enhanced jointly by channel and spatial attention.

The above CBAM attention mechanism strengthens the model’s focus on salient regions such as fastener edges and texture details, thereby improving its discriminative capability for small targets under complex background conditions.

2.3.2. Anchor Size Optimization

In the Cascade R-CNN framework, detection performance is highly dependent on the quality of the generated candidate boxes. The conventional anchor mechanism is typically biased toward medium- and large-scale targets. However, fasteners in coating equipment are generally small in size and occupy a limited image area. Using the default anchor configuration often leads to poor initial matching between anchors and real targets, which in turn affects the performance of subsequent multi-stage regression and classification. Accordingly, the redesigned anchor configuration is not a simple parameter adjustment but a data-driven reconstruction of the anchor space based on the statistical analysis of fastener sizes in the inspection dataset.

To address this issue, this study optimizes the anchor generation mechanism based on the size distribution characteristics of fasteners in coating workshop inspection scenarios. The anchor scale range is reduced to (8, 16, 32), and the aspect ratios are adaptively adjusted according to the geometric characteristics of the fastener targets. This domain-specific anchor redesign ensures that the receptive-field coverage more accurately matches the physical dimensions of fasteners, which not only increases the coverage of small targets by the initial anchors but also improves the positive–negative sample matching quality under progressively increasing IoU thresholds during the multi-stage regression process. Consequently, the enhanced Cascade R-CNN achieves higher recall rates and localization accuracy for small-object detection tasks.

2.3.3. Replacing Traditional NMS with Soft-NMS

In fastener detection tasks, the targets are typically small, densely distributed, and partially occluded, with multiple fasteners often appearing in adjacent or highly overlapping positions. Under such conditions, the traditional Non-Maximum Suppression (NMS) algorithm—which relies on a hard-threshold elimination strategy—tends to remove true positive boxes mistakenly. Specifically, when the IoU between a candidate box and the current highest-scoring box exceeds a preset threshold, the candidate is directly discarded. This mechanism frequently leads to missed detections of small or densely clustered targets. Unlike generic applications of Soft-NMS, its integration into every refinement stage of the cascade is specifically motivated by the densely clustered fastener layouts observed in coating-equipment assemblies.

To overcome this limitation, this study introduces Soft-NMS into the post-processing stage of the Cascade R-CNN framework. Unlike the hard binary suppression mechanism of traditional NMS, Soft-NMS adopts a continuous score-decay strategy, where overlapping boxes are not discarded outright but instead have their confidence scores gradually decayed according to their IoU with the top-scoring box. Specifically, candidate boxes with high overlap experience a significant reduction in confidence, while those with low overlap remain largely unaffected. This flexible suppression mechanism allows the network to retain true positives in dense regions while effectively distinguishing between overlapping detections. Consequently, the proposed method improves both recall rate and robustness under complex industrial conditions. The Soft-NMS processing procedure can be summarized as follows:

(1) Rank all candidate boxes in descending order according to their confidence scores, and select the box with the highest score as the current reference box.

(2) For all remaining boxes that overlap with the reference box, update their confidence scores based on the IoU value using a Gaussian decay function defined as:

s_{i} = s_{i} \cdot e^{- \frac{IoU {(M, b_{i})}^{2}}{σ}}

(7)

where

S_{i}

denotes the original confidence score of the candidate box

b_{i}

,

IoU (M, b_{i})

represents the Intersection over Union between the candidate box and the current retained box M, and

σ

is a hyperparameter used to control the rate of score decay, which is set to 0.5 in this study.

3. Experiments and Results Analysis

3.1. Experimental Setup and Dataset Construction

To validate the effectiveness of the proposed method, an experimental platform was established, and a fastener detection dataset tailored for coating workshop inspection scenarios was constructed.

3.1.1. Experimental Environment and Parameter Settings

The model was trained under the PyTorch 1.12.1 framework, utilizing an NVIDIA GeForce RTX 2060 GPU (NVIDIA, Santa Clara, CA, USA) for acceleration. Stochastic Gradient Descent (SGD) was employed as the optimizer, with an initial learning rate set to 0.001 and a 3× learning rate schedule applied. The training batch size was set to 2, and the total number of training epochs was 36. A checkpoint resume mechanism was enabled during training to allow continuation in case of interruptions. The experimental platform in this study is used for offline image acquisition to support algorithm evaluation. As the focus of this work is on algorithmic improvements, system-level engineering integration is not included.

3.1.2. Dataset Construction

To meet the requirements for automated detection of fastener states in coating workshop inspection scenarios, the detection targets were categorized into three classes: Marking-painted fastener (fasteners with marking paint),Fastener (fasteners without marking paint), and Fallen off (missing fasteners). In practical coating-workshop inspection, the current manual maintenance procedure also relies primarily on assessing the state of marking-painted fasteners. Each piece of coating equipment contains an extremely large number of fasteners—often more than 7000–10,000 on a single machine—which makes full inspection impractical. Therefore, only fasteners with marking paint are used as state indicators, since the paint line provides a reliable visual reference for judging rotation or loosening. Fasteners without marking paint, although still important structural components, cannot provide such cues and are thus only detected as objects rather than being subjected to state analysis. This practical requirement motivated us to establish the three-class detection scheme described above.

During data acquisition, a Hikvision MV-CH120-10UC industrial camera (manufactured by Hikvision, Hangzhou, China) was used to capture the operational process of coating equipment under various working conditions, covering typical scenarios such as low illumination, ink contamination, complex backgrounds, and partial occlusion, ensuring both diversity and representativeness of the dataset. Specifically, among the 4500 collected images, approximately 22% exhibit low illumination, 35% contain complex backgrounds including ink contamination, reflective metallic surfaces, dust, and paper textures, and about 18% have occlusions. The low illumination images were pre-processed using brightness and contrast enhancement before detection and recognition, ensuring consistent input for the model while retaining real-world lighting diversity. Subsequently, the LabelImg version 1.8.6 was used to manually annotate each collected image, employing rectangular bounding boxes to precisely mark the positions and categories of the fasteners across the three defined classes. To ensure annotation accuracy, all images were independently labeled by two experienced annotators and cross-checked. Multiple review and correction rounds were conducted for small targets. The fastener annotation examples are illustrated as shown in Figure 6.

After annotation, the dataset was organized in PASCAL VOC format and split into training, validation, and test sets with a ratio of 70%:15%:15%. The final dataset comprised approximately 4500 images, providing high-quality data support for subsequent model training and performance evaluation.

3.2. Evaluation Metrics and Comparative Methods

To comprehensively assess the performance of the proposed improved Cascade R-CNN method, Faster R-CNN (ResNet-101-FPN, baseline) and the original Cascade R-CNN (ResNet-101-FPN) were selected as comparison models for the fastener detection task in the coating workshop.

3.2.1. Evaluation Metrics

(1) Precision

Precision measures the proportion of correctly identified positive samples among all detected positive predictions. It is defined as:

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

where TP denotes the number of true positive detections, and FP denotes the number of false positive detections. A higher Precision indicates a lower false detection rate.

(2) Recall

Recall measures the proportion of actual targets that are correctly detected, and is defined as:

R e c a l l = \frac{T P}{T P + F N}

(9)

where FN denotes the number of false negatives. A higher Recall indicates a lower missed detection rate.

(3) mean Average Precision (mAP)

mAP represents the mean of the detection performance across all categories and serves as a comprehensive metric to evaluate the overall capability of the model in multi-class detection tasks. It is defined as:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(10)

where N denotes the number of categories (in this study, N = 3), and

A P_{i}

represents the Average Precision of the

i^{t h}

category. A higher mAP value indicates a stronger overall adaptability of the model to multi-class detection tasks. In our experiments, mAP was calculated using an IoU threshold of 0.5, where a prediction is considered correct if it overlaps the ground truth by at least 50%.

3.2.2. Comparative Experimental Results

To comprehensively evaluate the performance of the proposed improved Cascade R-CNN model in fastener detection tasks, two classical detection models, Faster R-CNN and the original Cascade R-CNN, were selected as baseline methods for comparison. The quantitative evaluation results of each model on the same test set are presented in Table 1.

Table 1 shows that the proposed improved Cascade R-CNN model achieves the best performance across key evaluation metrics. Specifically, the model attains a mean Average Precision (mAP) of 96.60%, representing improvements of 5.18 and 3.76 percentage points over Faster R-CNN (91.42%) and the original Cascade R-CNN (93.84%), respectively. In terms of Precision, the improved model reaches 95.85%, exceeding the two baseline models by 3.75% and 2.93%, respectively; for Recall, it achieves 95.72%, representing gains of 6.37% and 3.35% compared to the baselines. These quantitative results demonstrate that the three proposed enhancements—CBAM attention mechanism, Anchor scale optimization, and Soft-NMS algorithm—effectively improve the model’s detection performance in complex industrial scenarios.

To further analyze the detection performance of the improved Cascade R-CNN across different fastener categories, the Precision and Recall for the three classes were calculated, as presented in Table 2.

As shown in Table 2, the Precision and Recall for all three fastener categories remain above 95%, indicating that the improved model exhibits strong detection robustness across different fastener types. Specifically, fasteners with marking paint achieve the highest Precision (96.67%), reflecting the pronounced features and the most accurate detection for this category; Fallen-off fasteners attain the highest Recall (96.85%), demonstrating the model’s heightened sensitivity to missing states and its effectiveness in reducing false negatives; fasteners without marking paint show relatively balanced Precision and Recall, highlighting the model’s robustness in complex background conditions.

3.3. Ablation Study

To evaluate the contribution of each improvement module to detection performance, an ablation study was conducted. In this experiment, different enhancement modules—including Anchor size optimization, the CBAM attention mechanism, and the Soft-NMS algorithm—were incrementally incorporated into the baseline Cascade R-CNN model, and the mAP performance of each model variant on the test set was recorded. The experimental results are presented in Table 3.

The experimental results indicate that the introduction of the CBAM attention mechanism effectively enhances the model’s ability to extract critical features, thereby improving detection accuracy. Building upon this, the Anchor size optimization further strengthens the model’s perception and coverage of small targets. Finally, incorporating the Soft-NMS algorithm improves detection performance in scenarios with overlapping fasteners, significantly reducing the miss rate and achieving the best overall detection performance. In summary, each improvement module contributes positively to the model’s performance, and their combined application produces a complementary effect, substantially enhancing the accuracy and robustness of Cascade R-CNN in fastener detection tasks for printing equipment.

3.4. Visualization of Detection Results

To validate the effectiveness of the proposed method, the fastener detection results in coating workshop inspection scenarios were visualized. As shown in Figure 7, the improved Cascade R-CNN can accurately identify the three target categories—Marking-painted fastener, fastener, and Fallen off—under various operating conditions. The model maintains high detection accuracy even for small targets and densely distributed areas, with outputs closely matching manual annotations. Figure 7 also provides enlarged views of representative detection regions, clearly demonstrating the model’s capability to capture fine details of small fasteners. Overall, the results indicate that the proposed method possesses strong adaptability and robustness, effectively meeting the detection requirements in complex industrial scenarios.

To further highlight the improvements introduced in this study, Figure 8 provides a comparative visualization of detection results obtained from different models on the same challenging samples. Representative local regions are enlarged to more clearly illustrate model behavior on small or visually ambiguous fasteners. As shown, Faster R-CNN and the original Cascade R-CNN frequently exhibit misdetections, missed detections, or inaccurate localization when processing small fasteners, whereas the improved Cascade R-CNN achieves substantially more accurate localization and higher confidence scores across these cases. Figure 8 includes three representative groups of samples (a–c), each demonstrating the limitations of baseline detectors and the enhancements achieved by the proposed approach. Notably, in subfigures (b) and (c), the baseline models either incorrectly classify background structures as fasteners or fail to localize small targets. In contrast, the improved Cascade R-CNN consistently produces precise bounding boxes and reliable classification results, effectively avoiding the false positives and missed detections observed in the other models. These comparative visualizations further confirm the effectiveness and robustness of the proposed method under complex inspection conditions.

4. Summary

4.1. Discussion

Extensive experiments conducted on the custom VOC-format fastener dataset demonstrate that the proposed improved Cascade R-CNN achieves substantial performance gains in the challenging environment of coating workshop inspection. The incorporation of CBAM effectively enhances the backbone’s capacity to extract fine-grained and discriminative features, particularly in scenarios affected by weak contrast, complex background textures, and metallic reflections. Meanwhile, the optimized anchor configuration ensures better alignment with the actual scale distribution of small fasteners, and the adoption of Soft-NMS mitigates the excessive suppression of overlapping targets inherent in standard NMS. Together, these modifications result in significant improvements in mAP, Precision, and Recall, all of which exceed 95% and outperform Faster R-CNN and the baseline Cascade R-CNN by a meaningful margin.

Despite these encouraging results, several limitations were observed during the experimental evaluation. The introduction of attention modules and multi-stage regression slightly increased the computational load and reduced inference speed, which may constrain real-time deployment in high-throughput industrial inspection systems. These findings highlight the need for further optimization in both robustness and efficiency.

4.2. Conclusions

This study proposes an improved Cascade R-CNN–based fastener detection method to address the challenges of dense distribution, small target proportions, and complex background interference in coating workshop inspections. The method integrates CBAM attention modules into the ResNet-101 backbone, optimizes anchor configurations, and replaces traditional NMS with Soft-NMS. A customized VOC-format dataset was constructed to support model training and evaluation, and extensive experiments validate the effectiveness and robustness of the proposed improvements.

Based on the hypotheses defined in the Introduction, the experimental results provide clear validation:

(1) Integrating CBAM enhances feature representation, particularly for small and fine-grained fasteners—confirmed by performance improvements over the baseline.

(2) Optimizing anchor sizes improves small-object detection accuracy, which is reflected in the significant gains in Precision and Recall.

(3) Soft-NMS increases detection stability in dense regions, as evidenced by fewer false detections and fewer missed detections in overlapping scenarios.

The advantages of this work include improved accuracy for small and dense targets, enhanced robustness across complex lighting and background conditions, and strong applicability for real industrial inspection systems. Although the proposed method demonstrates strong detection performance, further optimization is still possible. Future research will focus on improving computational efficiency to facilitate broader deployment in time-critical inspection scenarios.

Author Contributions

Conceptualization, S.L.; Methodology, S.L.; Software, J.L.; Validation, J.L.; Formal analysis, Y.C.; Investigation, J.F.; Resources, Y.C.; Writing—original draft, J.L.; Writing—review & editing, J.Z.; Visualization, J.Z.; Supervision, J.F.; Project administration, S.L.; Funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Program of Shaanxi Province (Program No.2024GX-ZDCYL-02-02).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Yuhong Chen was employed by the company Foxconn Precision Electronics (Zhengzhou) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, X.; Liu, S.; Zhang, H.; Li, Y.; Ren, H. Defects Detection of Lithium-Ion Battery Electrode Coatings Based on Background Reconstruction and Improved Canny Algorithm. Coatings 2024, 14, 392. [Google Scholar] [CrossRef]
Kowalski, S. The Use of PVD Coatings for Anti-Wear Protection of the Press-In Connection Elements. Coatings 2024, 14, 432. [Google Scholar] [CrossRef]
Hütten, N.; Alves Gomes, M.; Hölken, F.; Andricevic, K.; Meyes, R.; Meisen, T. Deep Learning for Automated Visual Inspection in Manufacturing and Maintenance: A Survey of Open-Access Papers. Appl. Syst. Innov. 2024, 7, 11. [Google Scholar] [CrossRef]
Aldoseri, A.; Al-Khalifa, K.N.; Hamouda, A.M. AI-Powered Innovation in Digital Transformation: Key Pillars and Industry Impact. Sustainability 2024, 16, 1790. [Google Scholar] [CrossRef]
Xiao, Y.; Zhou, J. A Review of Image Edge Detection Methods. Comput. Eng. Appl. 2023, 59, 40–54. [Google Scholar] [CrossRef]
Kang, J.; Zhang, L.; Sun, Y.; Yang, X.; Wang, R.; Zhao, T. A Bolt Loosening Angle Detection Method Based on Color Segmentation. J. Mech. Strength 2025, 47, 102–109. [Google Scholar]
Li, Y.; Wu, X. Survey of Multilevel Feature Extraction Methods for RGB-D Images. J. Image Graph. 2024, 29, 1346–1363. [Google Scholar] [CrossRef]
Dou, Y.; Huang, Y.; Li, Q.; Luo, S. A Fast Template Matching-Based Algorithm for Railway Bolts Detection. Int. J. Mach. Learn. Cybern. 2014, 5, 835–844. [Google Scholar] [CrossRef]
Li, J.; Gao, X.; Yang, K. Locomotive Vehicle Bolt Detection Based on HOG Feature and Support Vector Machine. Inf. Technol. 2016, 3, 125–127+135. [Google Scholar]
Lu, S.; Liu, Z. Fast Localization Method of Bolt Under China Railway High-Speed. Comput. Eng. Appl. 2017, 53, 31–35. [Google Scholar]
Ramana, L.; Choi, W.; Cha, Y.J. Fully Automated Vision-Based Loosened Bolt Detection Using the Viola–Jones Algorithm. Struct. Health Monit. 2019, 18, 422–434. [Google Scholar] [CrossRef]
Min, Y.; Xiao, B.; Dang, J.; Yin, C.; Yue, B.; Ma, H. Machine Vision Rapid Detection Method of the Track Fasteners Missing. J. Shanghai Jiao Tong Univ. 2017, 51, 1268–1272. [Google Scholar]
Peralta, P.E.; Ferre, M.; Sánchez-Urán, M.Á. Robust Fastener Detection Based on Force and Vision Algorithms in Robotic (Un)Screwing Applications. Sensors 2023, 23, 4527. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18–20 June 1996; IEEE: Las Vegas, NV, USA, 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Volume Part I, pp. 21–37. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Salt Lake City, UT, USA, 2018; pp. 6154–6162. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Su, Z.; Xing, Z. An Improved SSD and Its Application in Train Bolt Detection. In Proceedings of the 4th International Conference on Electrical and Information Technologies for Rail Transportation (EITRT) 2019: Rail Transportation Information Processing and Operational Management Technologies; Springer: Singapore, 2020; pp. 97–104. [Google Scholar]
Ge, L.; Xu, J.; Wu, X.L.; Hou, R.J.; Shi, C.J.; Jia, G. Research on Tower Bolt Identification Technology Based on Convolution Network. J. Phys. Conf. Ser. 2021, 1852, 022055. [Google Scholar] [CrossRef]
Luo, L.; Ye, W.; Wang, J. Defect Detection of the Puller Bolt in High-Speed Railway Catenary Based on Deep Learning. J. Railw. Sci. Eng. 2021, 18, 605–614. [Google Scholar]
Li, Y.; Shi, X.; Xu, X.; Zhang, H.; Yang, F. YOLOv5s-PSG: Improved YOLOv5s-Based Helmet Recognition in Complex Scenes. IEEE Access 2025, 13, 34915–34924. [Google Scholar] [CrossRef]
Wang, X.; Yang, M.; Zheng, S.; Mei, Y. Bolt Detection and Positioning System Based on YOLOv5s-T and RGB-D Camera. J. Beijing Inst. Technol. 2022, 42, 1159–1166. [Google Scholar]
Zhang, H.; Shao, F.; Chu, W.; Dai, J.; Li, X.; Zhang, X.; Gong, C. Faster R-CNN Based on Frame Difference and Spatiotemporal Context for Vehicle Detection. Signal Image Video Process. 2024, 18, 7013–7027. [Google Scholar] [CrossRef]
Zhao, J.; Xu, H.; Dang, Y. Research on Bolt Detection of Railway Passenger Cars Based on Improved Faster R-CNN. J. China Saf. Sci. 2021, 31, 82–89. [Google Scholar]

Figure 1. Overall framework of the proposed fastener target detection method.

Figure 2. Framework of the original Cascade R-CNN. The blue arrows illustrate the data transmission flow between sequential stages. B0 denotes the initial bounding box proposals generated by the RPN. B1–B3 and C1–C3 represent the refined bounding boxes and classification scores at each of the three cascaded stages, respectively. H1–H3 correspond to the network heads for classification and regression, and "Pool" refers to the RoI pooling layer.

Figure 3. Framework of the improved Cascade R-CNN model for fastener detection. The blue arrows—B0–B3, C1–C3, H1–H3, and Pool—are defined similarly to those in Figure 2.

Figure 4. Integration of CBAMs into the ResNet-101 backbone. The blue arrows indicate the sequential flow of feature data through the network layers. The symbol "*" denotes the number of repeated bottleneck blocks within each respective layer.

Figure 5. Architecture of the CBAM attention mechanism.

Figure 6. Examples of fastener dataset annotations.

Figure 7. Fastener detection results.

Figure 8. Comparative visualization of detection results among different detection models. (a–c) Fastener detection results in three different industrial scenes. The columns from left to right represent the results of Faster R-CNN, Cascade R-CNN, and the Improved Cascade R-CNN, respectively.

Table 1. Quantitative comparison of different models on the fastener detection task.

Method	mAP (%)	Precision (%)	Recall (%)
Faster R-CNN	91.42	92.10	89.35
Cascade R-CNN	93.84	93.92	92.37
Improved Cascade R-CNN (Ours)	96.60	95.85	95.72

Table 2. Performance comparison of different fastener categories.

Category	Precision (%)	Recall (%)
Fallen-off	95.23	96.85
Fastener	96.28	95.41
Marking-painted fastener	96.67	95.72

Table 3. Results of the ablation study.

Model Variant	mAP (%)	Description
Cascade R-CNN	93.84	Original structure
+Anchor Size Optimization	94.72	Improves coverage of small targets
+Anchor Size Optimization + CBAM	95.53	Enhances feature extraction and representation
+Anchor Size Optimization + CBAM+ Soft-NMS	96.60	Reduces overlapping targets

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Liu, S.; Chen, Y.; Zhao, J.; Fu, J. An Improved Cascade R-CNN-Based Fastener Detection Method for Coating Workshop Inspection. Coatings 2026, 16, 37. https://doi.org/10.3390/coatings16010037

AMA Style

Liu J, Liu S, Chen Y, Zhao J, Fu J. An Improved Cascade R-CNN-Based Fastener Detection Method for Coating Workshop Inspection. Coatings. 2026; 16(1):37. https://doi.org/10.3390/coatings16010037

Chicago/Turabian Style

Liu, Jiaqi, Shanhui Liu, Yuhong Chen, Jiawen Zhao, and Jiahao Fu. 2026. "An Improved Cascade R-CNN-Based Fastener Detection Method for Coating Workshop Inspection" Coatings 16, no. 1: 37. https://doi.org/10.3390/coatings16010037

APA Style

Liu, J., Liu, S., Chen, Y., Zhao, J., & Fu, J. (2026). An Improved Cascade R-CNN-Based Fastener Detection Method for Coating Workshop Inspection. Coatings, 16(1), 37. https://doi.org/10.3390/coatings16010037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

An Improved Cascade R-CNN-Based Fastener Detection Method for Coating Workshop Inspection

Abstract

1. Introduction

2. Fastener Detection Method for Coating Workshop Inspection

2.1. Overall System Architecture

2.2. Original Cascade R-CNN Framework

2.3. Structure of the Improved Cascade R-CNN

2.3.1. Integration of the CBAM Attention Mechanism

2.3.2. Anchor Size Optimization

2.3.3. Replacing Traditional NMS with Soft-NMS

3. Experiments and Results Analysis

3.1. Experimental Setup and Dataset Construction

3.1.1. Experimental Environment and Parameter Settings

3.1.2. Dataset Construction

3.2. Evaluation Metrics and Comparative Methods

3.2.1. Evaluation Metrics

3.2.2. Comparative Experimental Results

3.3. Ablation Study

3.4. Visualization of Detection Results

4. Summary

4.1. Discussion

4.2. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI