Improved YOLOv5s-Based Crack Detection Method for Sealant-Spraying Devices

Kong, Weiyi; Ding, Hua; Cheng, Qingzhang; Li, Ning; Sun, Xiaochun; Dong, Xiaoxin

doi:10.3390/sym17122089

Open AccessArticle

Improved YOLOv5s-Based Crack Detection Method for Sealant-Spraying Devices

by

Weiyi Kong

¹,

Hua Ding

^2,*,

Qingzhang Cheng

²,

Ning Li

²,

Xiaochun Sun

² and

Xiaoxin Dong

²

¹

Shanxi Zhexing Security Technology Development Co., Ltd., No. 5 Changzhi West Lane, Taiyuan Xuefu Park, Shanxi Comprehensive Reform Demonstration Zone, Taiyuan 030021, China

²

School of Mechanical Engineering, Taiyuan University of Technology, Taiyuan 030024, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(12), 2089; https://doi.org/10.3390/sym17122089 (registering DOI)

Submission received: 27 October 2025 / Revised: 22 November 2025 / Accepted: 3 December 2025 / Published: 5 December 2025

(This article belongs to the Special Issue Symmetry/Asymmetry and Artificial Intelligence: Models, Methods, and Applications)

Download

Browse Figures

Versions Notes

Abstract

The manual spraying of sealant on train side doors is associated with high costs and significant safety risks. To address this challenge, this study proposes an automated crack localization method for sealant-spraying devices by enhancing the YOLOv5s network, with a specific focus on leveraging principles of symmetry. First, an automated sealant-spraying device is designed for operation and data acquisition. Geometric symmetry is then exploited through Zhang’s camera calibration method to accurately establish the two-dimensional mapping between spatial coordinates and the image plane, a process fundamental to spatial reasoning. The core of our approach lies in introducing structural and computational symmetry into the deep learning model. The original YOLOv5s network is improved by integrating the Selective Context Convolutional module and the Skew Intersection over Union (IoU) Loss function, which streamline computation and boost detection accuracy. Furthermore, we replace the standard C3 module with an improved version that incorporates a Reparameterization Visual Transfer block, enhancing feature representation through structural re-parameterization symmetry between training and inference phases. Validation using data from a coal handling facility demonstrates that the improved YOLOv5s model achieves superior performance in precision, mAP@0.5, and recall compared to the original. The results underscore the critical role of geometric and architectural symmetry in developing robust and efficient vision systems for industrial automation.

Keywords:

object detection algorithm; YOLOv5s; camera calibration; convolutional neural network

1. Introduction

Coal remains the primary energy source in China, and its transportation by rail plays a critical role in meeting industrial and domestic demand. However, during long-distance transportation, coal powder frequently leaks through side door cracks in train carriages. This not only leads to significant resource loss but also contributes to severe environmental pollution along railway routes. Currently, a common mitigation approach involves reducing train speed upon entering the coal yard, allowing workers to manually spray sealant onto door cracks using canned foaming agents. However, this method incurs high material and labor costs. Moreover, exposure to sealant particles poses health risks to workers. Given these challenges, there is an urgent need for an automated detection based on machine vision to become the key to solving this problem.

The application of machine vision systems in train side door seam detection requires high resolution and detection accuracy in order to recognize complex and irregular door seam shapes. In order to achieve this goal, the system must be able to perform high-precision object detection under complex environmental conditions and have real-time capabilities to quickly respond to unsealed door seam positions.

In recent years, advances in deep learning have significantly improved both detection precision and real-time performance, leading to its widespread adoption in industrial applications [1,2,3,4,5,6]. For instance, Liang et al. [7] developed a traffic modeling system capable of detecting human travel patterns, which was successfully deployed on smartphones. Another study integrated a convolutional neural network with principal component analysis and employed a divide-and-conquer strategy to segment smaller images, achieving high accuracy [8].

Object detection algorithms can be roughly divided into single-stage and two-stage methods. Two stage methods (such as Faster R-CNN [9], SPPNet [10], Fast R-CNN [11], and R-CNN [12]) first generate candidate regions and then process them, resulting in high accuracy but slow detection speed, making it difficult to meet real-time requirements. Single-stage methods (such as RetinaNet [13], YOLO [14], FCOS [15], and SSD [16]) simultaneously generate candidate regions and classify them, giving them faster detection speed and making them suitable for application scenarios that require fast response, but some accuracy may be sacrificed.

In order to improve the adaptability of object detection algorithms in complex environments, this paper introduces two methods: “symmetry principle” and “reparameterized symmetry”. The principle of symmetry refers to the property of an object remaining unchanged under specific transformations, such as rotation, translation, etc. In object detection, utilizing the principle of symmetry can help the model identify and process objects with symmetrical features, thereby avoiding the problem of false detection of symmetrical objects and improving detection accuracy. In addition, reparameterization symmetry maintains the consistency of the model under different input conditions by changing the parameterization representation. For example, using reparameterization methods can enable the model to better identify targets from different perspectives or scales, enhancing its robustness. The side door gaps and coal powder leaks of trains often exhibit irregular symmetry; therefore, utilizing these symmetry principles can significantly improve detection accuracy and real-time performance.

In this study, we chose YOLOv5 as the object detection algorithm. YOLOv5 [17] is an object detection algorithm belonging to the YOLO series, which has quickly gained attention in the field of machine vision due to various improvements in accuracy, real-time detection, and speed [18]. YOLOv5 has high detection speed and accuracy, making it particularly suitable for scenarios that require real-time processing, such as surveillance [19], biology [20], autonomous driving [21], and robotics [22]. In addition, YOLOv5 performs well with small objects and complex backgrounds, making it an ideal choice for handling complex scenes such as train side door gaps.

Although YOLOv5 performs excellently in many industrial applications, its performance often declines when dealing with small-scale objects or irregular shapes, such as train side door gaps. This is because YOLOv5 struggles to accurately capture fine details in complex, cluttered environments, which are crucial for high-precision detection. In tasks like train door gap detection, the algorithm must maintain high accuracy to prevent resource wastage and ensure sealing effectiveness. However, directly applying YOLOv5 in such complex environments can lead to erroneous detections, affecting the normal operation of sealing robots. So, determining how to build a door gap detection model that combines high precision, strong robustness, and real-time performance in complex industrial environments has become a key research problem yet to be fully solved for implementing automatic glue spraying technology for train side doors.

Given the strengths and widespread adoption of YOLOv5, we put forward an improved YOLOv5s method to increase the sensitivity and localization accuracy of a sealing robot for identifying cracks in train doors.

The main contributions of this study are as follows:

(1): A sealant-spraying device is designed to automate the sealing process for train door cracks.
(2): An enhanced YOLOv5s object detection algorithm is developed, demonstrating improved performance in detecting sealing targets.
(3): The integration of C3 and re-parametrized Vision Transformer (RepViT) blocks into the YOLOv5s network achieves model lightweighting and inference acceleration while maintaining detection precision.

This article proposes an improved YOLOv5s object detection method. We enhance the performance of the model in complex scenes by introducing the “principles of symmetry” and “re-parameterization symmetry”, while using SCConv and SLoU loss functions to reduce computational costs and improve detection accuracy. We also optimize the C3 and RepVitblock modules in the YOLOv5s network to improve detection accuracy and computational efficiency. Our method can improve the adaptability and robustness of the system in real-time detection while maintaining high accuracy, thereby achieving precise positioning of train door gaps.

This article explains the structure: Section 2 mainly describes the introduction and experimental process of the door seam positioning and spraying scheme for coal transport open top cars. Section 3 introduces camera calibration. Section 4 proposes an improved YOLOv5s network model for identifying and locating door seams. Section 5 presents experiments and results. Section 6 summarizes the article.

2. Positioning and Spraying Scheme for Train Door Cracks

The proposed method for locating and sealing cracks in train doors consists of both hardware and software components. The hardware includes a sealant-spraying device equipped with an industrial camera, while the software comprises an image-based crack identification and localization algorithm. The overall processing flow is as follows. The first step is to use an industrial camera to collect information about the train side doors and directly store the collected images in the MinIO image library, while storing the metadata in MongoDB. The second step is to use Kafka components to directly send the image path and metadata. The third step is to use the Spark Streaming module to receive real-time information sent by Kafka, read images from the MinIO image library, and preprocess the images to prepare for subsequent recognition and localization. The fourth step is to input the graph data into the improved train door seam recognition and positioning algorithm. The fifth step is to process the graph data. If a door gap is detected, the positioning coordinates will be automatically marked and saved, and control instructions will be generated to control the device to perform the glue spraying operation and complete the sealing task; If the door crack is not detected, it will be added to the dataset for manual review and annotation, and then a new model will be trained offline to complete the improved algorithm for updating and optimization. The overall flowchart is shown in Figure 1.

2.1. Sealant-Spraying Device Overview

The sealant-spraying device consists of a control terminal, a three-dimensional sliding module, a slider movement unit, a six-axis robotic arm [23], a robotic arm control unit, and a sealant-spraying unit. The sealant-spraying unit comprises an industrial camera, a sealant-spraying hose, a mounting plate, and a support frame. The three-dimensional sliding module includes a slider module and a corresponding control unit. The six-axis robotic arm is composed of multiple joints and connecting rods, and is operated by a dedicated control unit, which, along with the slider motion control unit, integrates a driver unit and a programmable logic controller. The slider motion control unit itself consists of a guide rail, slider, and connector. The overall structure of the sealant-spraying device used for coal transport applications is illustrated in Figure 2.

2.2. Train Door Sealing Process

Figure 3 illustrates the workflow of the train door sealing process. Initially, the robotic arm control unit activates the robotic arm to perform rotation and translation, positioning the sealant-spraying unit at a predetermined distance from the gondola car. Once in position, the industrial camera embedded in the sealing unit captures images of the side door and transmits them to the processing terminal. The image processing algorithm then identifies the location of the door crack. Based on this positional data, the slider motion control unit drives the slider module to move the sealing-spraying sealing unit along the crack. The sealant-spraying mechanism is then activated to apply sealant along the detected crack, thereby completing the sealing process.

2.3. Technical Workflow of the Improved Algorithm

The industrial camera first captures images of gondola car side doors under various conditions. These images are then preprocessed and annotated with side door axes to construct the dataset, which is divided into a test set, validation set, and training set at a ratio of 8:1:1. Using this dataset, a door crack detection model is developed based on the improved algorithm. During training, model parameters are tuned, including input data dimensions, the amount of object classes, and the amount of training iterations. Once trained, the model is used to detect door hinges in the input images. By applying camera calibration results, the spatial relationship between door hinges and door cracks is used to infer the position of the cracks. The final output includes the calculated position of the door cracks, which are then marked accordingly. Figure 4 presents the technical flowchart of the improved algorithm.

3. Camera Calibration

Due to the similar color between the gondola car’s side door and the surrounding side plate, directly detecting the door crack using the YOLOv5s algorithm is challenging. To address this, the algorithm is instead used to detect the round hinge pins on the gondola car and extract their pixel coordinates. The position of the door crack is then inferred based on the known spatial relationship between the hinge and the crack, along with actual dimensional measurements.

To establish the correspondence between pixel coordinates and real-world coordinates, this study employs Zhang’s camera calibration method [24]. This method involves capturing multiple images of a black-and-white checkerboard pattern from different angles. It is widely adopted for its flexibility, high accuracy, and robustness. A minimum of three checkerboard images is required to solve for the intrinsic parameters of the industrial camera. Table 1 shows the technical parameters of the industrial camera used in this article. Figure 5 shows an example image of the calibration checkerboard used in the process.

The camera calibration process, following Zhang’s method, fundamentally relies on the known geometric symmetry of the calibration pattern. The regular, repeating structure of the pattern allows for the precise computation of the camera’s intrinsic and extrinsic parameters. This step establishes a symmetric relationship between the 3D world coordinates and the 2D image pixels, which is essential for accurate spatial localization of cracks.

The conversion relation of the world coordinate system to the camera coordinate system involves a series of coordinate conversions as described by Equation (1):

[\begin{matrix} X_{C} \\ Y_{C} \\ Z_{C} \end{matrix}] = R [\begin{matrix} X_{W} \\ Y_{W} \\ Z_{W} \end{matrix}] + T \Rightarrow [\begin{matrix} X_{C} \\ Y_{C} \\ Z_{C} \\ 1 \end{matrix}] = [\begin{matrix} R & T \\ 0 & 1 \end{matrix}] [\begin{matrix} X_{W} \\ Y_{W} \\ Z_{W} \\ 1 \end{matrix}]

(1)

where T represents a 3 × 1 translation matrix and R denotes a 3 × 3 rotation matrix.

Equation (2) shows the conversion relation of camera coordinate system and image coordinate system:

Z_{C} [\begin{matrix} x \\ y \\ 1 \end{matrix}] = [\begin{matrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} X_{C} \\ Y_{C} \\ Z_{C} \\ 1 \end{matrix}]

(2)

where f represents the camera focal length (obtained through calibration).

Equation (3) shows the conversion relation of image coordinate system and the pixel coordinate system:

\{\begin{cases} u = \frac{x}{d x} + u_{0} \\ v = \frac{y}{d y} + v_{0} \end{cases} \Rightarrow [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} \frac{1}{d x} & 0 & u_{0} \\ 0 & \frac{1}{d y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x \\ y \\ 1 \end{matrix}]

(3)

where

(\begin{matrix} u_{0} & v_{0} \end{matrix})

represents the coordinate origin;

d x

and

d y

represent the pixel values of a point.

By combining Equations (1)–(3), the transformation from world coordinates to pixel coordinates is given as

[\begin{matrix} x \\ y \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & u_{0} & 0 \\ 0 & f_{y} & v_{0} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} R & T \\ 0 & 1 \end{matrix}] [\begin{matrix} X_{W} \\ Y_{W} \\ Z_{W} \\ 1 \end{matrix}]

(4)

Once the intrinsic and extrinsic parameter matrices of the camera are obtained, the relationship between real-world coordinates and pixel coordinates can be accurately established.

This article uses Zhang’s calibration method to calibrate the camera and evaluates the calibration accuracy through reprojection error. The maximum repositioning error of this calibration is 0.049 pixels, and the average error is 0.036 pixels, which can meet the requirements of industrial applications.

4. Improved YOLOv5s Network

4.1. Network Structure

The YOLOv5s architecture is composed of the backbone, neck, and head. However, due to the small size and rapid movement of the train door shaft, and given that the location of the door crack must be inferred based on the detected position of the shaft and their known spatial relationship, the standard YOLOv5s model does not meet the precision and robustness required for this application. To address these limitations, an improved version of the YOLOv5s network is proposed. This enhanced model is built upon the original YOLOv5s architecture but incorporates structural optimizations that reduce computational cost while improving detection accuracy.

Specifically, several original convolutional (Conv) modules are replaced with selective context Conv (SCConv) modules [25], which help reduce redundancy in intermediate feature representations. Additionally, the original C3 modules are replaced with C3-Rep modules that integrate RepViT blocks [26], enhancing the network’s capacity to extract and interpret complex image features. To further improve training performance and accelerate convergence, the original CLoU function is replaced with the SLoU function [27]. Figure 6 shows the structure of the improved YOLOv5s network.

4.2. SCConv Module

To reduce the computational cost of the YOLOv5s network while enhancing its ability to learn discriminative features, the SCConv module is integrated into the model. SCConv is designed to suppress redundant features and improve both classification accuracy and computational efficiency, without significantly increasing the model’s complexity. The SCConv module consists of two key components: the spatial reweighting unit (SRU) and the channel reweighting unit (CRU), which are executed sequentially.

Figure 7 illustrates the structure of the SRU. The SRU is designed to suppress spatial redundancy and enhance the network’s feature representation capability by reweighting and reconstructing intermediate spatial features. It begins by applying group normalization (GN) to evaluate the input features using scaling factors. These features are then subjected to a mapping transformation that separates them into two distinct components, each containing different spatial information. These components are subsequently reconstructed into a new feature space through a series of operations. Finally, the two parts are concatenated to produce the output feature. By reconfiguring spatial information in this manner, the SRU significantly improves the utilization of spatial features within the image.

Figure 8 presents the structural diagram of the CRU, which focuses on enhancing feature representation in the channel dimension of the feature map. The CRU begins by dividing the input channels into two parts based on a predefined division ratio. Each part undergoes separate processing using pointwise convolution (PWC) and grouped convolution (GWC). The resulting low-dimensional features are concatenated to form two intermediate representations, denoted as Y1 and Y2. These are then passed through a pooling operation to generate corresponding scalars S1 and S2. A SoftMax function is applied to the concatenated pooled outputs to obtain channel attention weights β1 and β2. Finally, these weights are used to reweight the features, and the weighted features are summed to produce the final output.

4.3. C3 Module

In YOLOv5s, the C3 module is composed of Conv, BottleNeck, and Concat operations, which are used to extract and fuse features from the input feature maps. While the C3 module helps reduce computational load and captures rich features, it can underperform in complex scenes due to redundant computations and inefficiencies in transmitting low-level features. To address these limitations, the RepViT block module is introduced as a replacement for the original C3 structure. Figure 9 illustrates the comparison between the original C3 module and the improved version incorporating RepViT block.

The RepViT block is a convolutional module derived from the RepViT architecture and designed for use in CNNs. It integrates the advantages of the ViT [28] and employs re-parameterization techniques [29] to significantly accelerate inference speed while maintaining high accuracy. The original MobileNetV3 module [30] combines extended convolution and projection layers to form a channel mixer. After a 1 × 1 extended convolution, a 3 × 3 depthwise separable convolution (DW) is used to fuse spatial information, serving as a spatial or “token” mixer. In RepViT block, to decouple the spatial mixer from the channel mixer, the DW convolution and the squeeze-and-excitation (SE) layer are moved earlier in the pipeline. During inference, the multi-branch deep convolutions are re-parameterized into a single-branch structure, reducing computational overhead and GPU memory usage. Additionally, the DW convolution layer—enabled by structural re-parameterization—eliminates the overhead associated with residual connections during inference. Figure 10 presents the architecture of the RepViT block.

4.4. Loss Function

YOLOv5s originally adopts the CLoU function, which extends the Distance IoU (DIoU) loss [31]. CLoU enhances bounding box regression by incorporating an aspect ratio term that encourages alignment between the predicted and ground truth box shapes through an additional penalty factor. In this study, the SLoU function is adopted to further improve performance. Unlike CLoU, SLoU introduces the angle between the predicted and ground truth boxes as a constraint, capturing both positional and directional discrepancies. By considering this angular deviation, SLoU provides a more comprehensive regression criterion, leading to faster convergence and improved training stability in the proposed network. Meanwhile, the SIoU loss function solves the problem of rotational symmetry of cracks. Unlike standard bounding boxes, Skew IoU considers the direction of slender, irregular cracks. This keeps the loss function invariant to the in-plane rotation of the crack, which is a key symmetry characteristic that enables more stable training and excellent detection of defects with arbitrary orientations.

SLoU enhances convergence speed by introducing an angle-based penalty term. Figure 11 illustrates the schematic of the SLoU function.

During bounding box regression, when the angle between the predicted box and the ground truth box (

α

) is less than

π / 4

, the prediction converges more rapidly along the x axis. In contrast, if β is reduced, the prediction converges rapidly to the y axis. The angle-based penalty cost is computed as follows:

φ = 1 - 2 \sin^{2} (\arcsin x - π / 4)

(5)

x = C h / σ = \sin a

(6)

With the introduction of angle-based cost, the number of distance-related variables can be reduced.

The redefined distance cost is expressed as

Δ = \sum_{t = x, y} 1 - \exp (- γ ρ t)

(7)

where

γ = 2 - φ

,

ρ_{x} = {(b_{c x}^{g t} - b_{c x})}^{2} / c_{w}^{2}

, and

ρ_{y} = {(b_{c y}^{g t} - b_{c y})}^{2} / c_{h}^{2}

. The distance loss becomes greater as

α

gets closer to

π / 4

.

The shape cost is calculated as

Ω = \sum_{t = w, h} (1 - \exp {(- ω_{t})}^{θ})

(8)

where

ω_{w} = \frac{|w - w^{g t}|}{\max (w - w^{g t})}

(9)

w_{h} = \frac{|h - h^{g t}|}{\max (h - h^{g t})}

(10)

where

w

and

w^{g t}

indicate the widths of the prediction and ground truth boxes, and

h

and

h^{g t}

denote their corresponding heights.

The LoU is computed as follows:

L o U = \frac{|B \cap B^{g t}|}{|B \cap B^{g t}|}

(11)

Finally, the complete expression for the SLoU function is

L = 1 - L oU + (Δ + Ω) / 2

(12)

5. Experiments and Results

5.1. Configuration of the Running Environment

The experiments were conducted using PyTorch 1.10.1 and Python 3.7. An NVIDIA GeForce RTX 3060 GPU with 16 GB of system memory and 8 GB of video memory was used for training. The CUDA version employed was 11.3.

Before training the model, improve YOLOv5s configuration parameter settings as shown in Table 2.

5.2. Image Acquisition and Dataset Production

At present, there is no publicly available dataset for detecting sealant in train carriages. The experimental data used in this article was obtained through on-site collection, and a self-made dataset of sealant in train carriages was created. A total of 1800 eligible images were obtained and saved in PNG format. The legend of the dataset is shown in Figure 12. Expand the original dataset to 4800 images through horizontal flipping, translation, random cropping, and filling. The YOLOv5 model uses Mosaic for data augmentation, randomly cropping and arranging the input images, and stitching four images together. Input the spliced images into the YOLOv5 network model for training, to improve the model’s generalization ability and avoid overfitting affecting detection performance.

The annotation software used in this article is LabelImg, which marks the image based on the position of the sealant and adds a label box. By marking the target sample in the image, an XML file containing the sealant target type and coordinate information is generated. The annotation is shown in the figure. Complete the production of the dataset through the above work. Divide the dataset into training, validation, and testing sets in an 8:1:1 ratio.

5.3. Evaluation Indicators

Following model training and deployment, performance evaluation is essential for assessing detection effectiveness and guiding further optimization. Based on the evaluation results, model parameters can be fine-tuned, and comparative analysis among different models can be conducted. In this study, the following evaluation metrics are used to assess real-time detection performance: precision (P), recall (R), mAP@0.5, and FPS.

Precision (P) is used to predict the proportion of samples with positive results. The calculation formula is

P = \frac{T P}{T P + F P}

(13)

mAP@0.5 reflects the detection performance of the model for different categories. The higher the mAP, the better the detection effect of the model. The calculation formula is

m A P = \frac{\sum_{i - 1}^{K} A P_{i}}{k}

(14)

Recall (R) indicates the number of correctly predicted cases among all positive samples. The calculation formula is

R = \frac{T P}{T P + F N}

(15)

where K denotes k categories; AP_i denotes the averageaccuracy of the ith label; FN denotes false negatives; FP denotes false positives; TP denotes true positives.

5.4. Experimental Results and Analysis

Figure 13 illustrates the model’s learning dynamics over time, highlighting performance trends and convergence behavior. In terms of precision, the precision of the improved YOLOv5s is 0.975, which is an improvement of 0.057 over the original model. This indicates a greater capability in correctly identifying actual door slits. The recall improved to 0.99, a 0.07 increase, suggesting enhanced sensitivity to true positive cases. Regarding mAP@0.5, the improved model achieved 0.993, significantly outperforming the baseline YOLOv5s (0.934), demonstrating higher overall detection accuracy for door cracks. On the more stringent mAP@0.5:0.9 metric, the original YOLOv5s scored 0.678, whereas the improved version reached 0.737, indicating superior robustness and accuracy across varying IoU thresholds—particularly at higher levels of overlap.

Figure 14 presents a comparative analysis of model accuracy over time. The results demonstrate that the improved YOLOv5s achieves notable performance gains, with increases of 5.7% in precision, 7.0% in recall, and 5.9% in mAP@0.5 compared to the original model. These improvements reflect enhanced detection capability and robustness. However, a slight decrease in FPS is observed, attributable to the increase in GFLOPs caused by the architectural enhancements introduced in the improved model.

5.4.1. Ablation Experiment

The ablation experiments summarized in Table 3 evaluate the contribution of each improvement to the overall model performance. Taking the YOLOv5s model as the baseline, when the SCConv module is added alone, there is a slight improvement in accuracy, FPS, and GFLOPs; when the slou is added alone, accuracy improves significantly, but the increase in GFLOPs leads to a decrease in FPS; and adding the RepViT alone results in substantial optimization of the model’s GFLOPs and FPS, with minimal impact on accuracy. When all three improvement strategies are applied simultaneously, the complete model integrates all enhancements, achieving the best balance between accuracy and efficiency, which fully demonstrates the necessity and synergistic effects of each module improvement.

5.4.2. Comparative Experiments of Different Models

To evaluate the effectiveness of the proposed improvements, a comparative experiment was conducted using several mainstream object detection models under identical training and validation conditions on the door crack dataset. Table 4 presents the results, comparing models based on accuracy, recall, mPA@0.5, and FPS. The improved YOLOv5s demonstrates a strong balance across all evaluation metrics. Although its inference speed (FPS) ranks mid-level among the compared models, it achieves the highest values in accuracy, recall, and mAP@0.5, indicating superior detection precision and robustness. These improvements contribute to more reliable and accurate identification of door crack positions. Compared to its baseline and other benchmark models, the enhanced YOLOv5s shows notable performance gains, making it the most effective solution for this application scenario.

5.5. Experimental Results

Figure 15 presents the detection outcomes of the proposed method applied to train door crack inspection using a purpose-built experimental setup. The results demonstrate that the improved YOLOv5s model accurately identifies and localizes door cracks, confirming its effectiveness in practical scenarios. The model exhibits strong object detection capability, supporting its applicability in real-world industrial inspection tasks. Overall, the method shows high feasibility and practical value for deployment in automated crack detection systems.

6. Conclusions

This study proposes an enhanced door crack detection method based on the YOLOv5s architecture, aimed at enabling more accurate identification and localization of cracks in train doors for integration with sealant-spraying devices. To optimize the model’s performance, several improvements were introduced. The SCConv module replaces part of the original convolution operations, effectively reducing computational complexity and network redundancy. The C3 module is substituted with a C3-Rep module incorporating RepViT block, enhancing the network’s capacity for feature extraction. Additionally, the original CLoU loss function is replaced with the SLoU, which incorporates an angle-based constraint to accelerate convergence and improve localization accuracy. Experimental results confirm the effectiveness of these modifications: the improved YOLOv5s achieves a 5.7% increase in precision, a 7.0% improvement in recall, and a 5.9% gain in mAP@0.5 compared to the baseline YOLOv5s. These results indicate that the proposed method not only surpasses the original YOLOv5s model but also demonstrates strong applicability for real-time detection and localization of door cracks in industrial environments.

Author Contributions

The work presented here was carried out in collaboration among all authors. Methodology, W.K. and Q.C.; software, Q.C.; investigation, W.K., H.D. and N.L.; original draft, Q.C.; Writing—review and editing and formal analysis, X.S. and X.D.; supervision, X.S. and X.D.; project administration, X.S. and X.D. All authors have read and agreed to the published version of the manuscript.

Funding

Funding was provided by the National Natural Science Foundation of China General Project, Research on Interpretable Fault Diagnosis Method for Coal Mining Machine with Knowledge Graph Enhanced Large Language Model (No. 52574201); the National Natural Science Foundation of China General Project, Research on Intelligent Fusion Prediction Method for Health Status of Coal Mining Machinery Driven by Big Data (No. 52174148); and the Key R&D Program Project of Shanxi Province, Key Technology Research on Intelligent Monitoring and Fault Diagnosis System for Right Angle Turning Scraper Conveyor Integrated Machine (No. 202202100401013).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

Author Weiyi Kong was employed by the Shanxi Zhexing Security Technology Development Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Vijayakumar, A.; Vairavasundaram, S. YOLO-based Object Detection Models: A Review and its Applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar]
Hoeser, T.; Bachofer, F.; Kuenzer, C. Object Detection and Image Segmentation with Deep Learning on Earth Observation Data: A Review—Part II: Applications. Remote Sens. 2020, 12, 3053. [Google Scholar] [CrossRef]
Lee, J.; Hwang, K.I. YOLO with adaptive frame control for real-time object detection applications. Multimed. Tools Appl. 2021, 81, 36375–36396. [Google Scholar] [CrossRef]
Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M. YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef] [PubMed]
Dong, Z.; Wang, M.; Wang, Y.; Liu, Y.; Feng, Y.; Xu, W. Multi-Oriented Object Detection in High-Resolution Remote Sensing Imagery Based on Convolutional Neural Networks with Adaptive Object Orientation Features. Remote Sens. 2022, 14, 950. [Google Scholar] [CrossRef]
Tian, D.; Han, Y.; Wang, S. Object feedback and feature information retention for small object detection in intelligent transportation scenes. Expert Syst. Appl. 2024, 238, 121811–121825. [Google Scholar] [CrossRef]
Liang, X.; Zhang, Y.; Wang, G.; Xu, S. A Deep Learning Model for Transportation Mode Detection Based on Smartphone Sensing Data. IEEE Trans. Intell. Transp. Syst. 2020, 21, 5223–5235. [Google Scholar] [CrossRef]
Andrade, I.E.C.; Padilha, R.M.S.; Paz, R.F.; Zuanetti, D.A. Automated high-resolution microscopy for clinker analysis: A divide-and-conquer deep learning approach with mask R-CNN and PCA for alite measurement. Expert Syst. Appl. 2025, 293, 128552. [Google Scholar] [CrossRef]
Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; Volume 10, pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892v1. [Google Scholar] [CrossRef]
Dong, X.; Yan, S.; Duan, C. A Lightweight Vehicles Detection Network Model Based on YOLOv5. Eng. Appl. Artif. Intell. 2022, 113, 104914–104928. [Google Scholar] [CrossRef]
Bushra, S.N.; Shobana, G.; Maheswari, K.U.; Subramanian, N. Smart Video Surveillance Based Weapon Identification Using Yolov5. In Proceedings of the International Conference on Electronic Systems and Intelligent Computing (ICESIC), Chennai, India, 22–23 April 2022; pp. 351–357. [Google Scholar]
Guan, Z.; Li, H.; Zuo, Z.; Pan, L. Design a Robot System for Tomato Picking Based on YOLOv5. IFAC PapersOnLine 2022, 55, 166–171. [Google Scholar]
Fu, L.; Chen, J.; Zhang, Y.; Huang, X.; Sun, L. CNN and Transformer-based deep learning models for automated white blood cell detection. Image Vis. Comput. 2025, 161, 105631. [Google Scholar] [CrossRef]
Mahaur, B.; Mishra, K.K. Small-object detection based on YOLOv5 in autonomous driving systems. Pattern Recognit. Lett. 2023, 168, 115–122. [Google Scholar] [CrossRef]
Srinivasamurthy, C.; SivaVenkatesh, R.; Gunasundari, R. Six-Axis Robotic Arm Integration with Computer Vision for Autonomous Object Detection using TensorFlow. In Proceedings of the Second International Conference on Advances in Computational Intelligence and Communication (ICACIC), Kakinada, India, 21–22 December 2023. [Google Scholar]
Zhang, Z. Flexible camera calibration by viewing a plane from unknown orientations. In Proceedings of the International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 2000; Volume 22, pp. 1330–1334. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting Mobile CNN From ViT Perspective. 2023. Available online: https://doi.org/10.48550/arXiv.2307.09283 (accessed on 18 July 2023).
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. 2022. Available online: https://arxiv.org/abs/2307.09283 (accessed on 25 May 2022).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. Available online: https://arxiv.org/abs/2010.11929 (accessed on 22 October 2020).
Ding, X.; Chen, H.; Zhang, X.; Huang, K.; Han, J.; Ding, G. Re-Parameterizing Your Optimizers Rather than Architectures. 2022. Available online: https://arxiv.org/abs/2205.15242 (accessed on 30 May 2022).
Wadekar, S.N.; Chaurasia, A. MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features. 2022. Available online: https://arxiv.org/abs/2209.15159 (accessed on 30 September 2022).
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]

Figure 1. Overall processing flowchart.

Figure 2. Structural diagram of the sealant-spraying device: (1) three-dimensional sliding module, (2) robotic arm, and (3) sealant-spraying unit.

Figure 3. Process flow of the train door sealing process.

Figure 4. Technical flowchart of the improved detection algorithm.

Figure 5. Checkerboard pattern.

Figure 6. Improved YOLOv5s network architecture diagram.

Figure 7. SRU structure diagram.

Figure 8. CRU structure diagram.

Figure 9. (a) Original C3 structure; (b) improved C3 structure.

Figure 10. RepViT block structure.

Figure 11. Schematic of the SLoU function.

Figure 12. Dataset legend. (a) Picture of the new carriage door gap; (b) Picture of the old carriage door gap.

Figure 13. Model line chart: (a) precision, (b) recall, (c) mAP@0.5, (d) mAP@0.5:0.9, (e) train/box_loss, (f) train/obj_loss, (g) val/box_loss, and (h) val/obj_loss.

Figure 14. Model comparison table.

Figure 15. Door crack inspection results.

Table 1. Technical parameters of industrial cameras.

Technical Name	Technical Specification
Name	CMOS industrial camera
pixel size	3.45 μm × 3.45 μm
target size	2/3′
resolution	2448 × 2048
maximum frame rate	24.1 fps

Table 2. Parameter Configuration.

Parameter Name	Parameter Value
image size	640 × 640
confidence threshold	0.25
IoU threshold	0.45
initial learning rate	0.01
weight decay	0.0005
epoch	200

Table 3. Ablation experiment results.

Methods	Precision/%	Recall/%	mAP@0.5/%	GFLOPs	FPS
YOLOv5s	91.8	92.0	93.4	15.8	38.6
YOLOv5s + SCConv	93.9	98.3	96.7	13.7	40.3
YOLOv5s + RepViT	92.4	96.3	98.1	12.3	41.3
YOLOv5s + SLoU	94.9	94.6	94.2	16.2	34.1
YOLOv5s + SCConv + RepViT	94.2	95.5	98.7	15.9	39.7
YOLOv5s + SCConv + RepViT + SLoU	97.5	99.0	99.3	16.8	36.4

Table 4. Comparative experiment results.

Methods	Precision/%	Recall/%	mAP@0.5/%	GFLOPs	FPS
YOLOv5s	91.8	92.0	93.4	15.8	38.6
YOLOv5m	91.1	91.2	91.3	47.9	41.1
YOLOv5l	90.4	90.4	89.9	107.7	43.7
YOLOv7-tiny	92.3	89.7	94.1	13.2	47.2
YOLOv8s	94.1	91.5	95.6	29.3	51.7
YOLOv9-c	95.6	93.2	95.9	102.1	54.5
Improved YOLOv5s	97.5	99.0	99.3	16.8	36.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kong, W.; Ding, H.; Cheng, Q.; Li, N.; Sun, X.; Dong, X. Improved YOLOv5s-Based Crack Detection Method for Sealant-Spraying Devices. Symmetry 2025, 17, 2089. https://doi.org/10.3390/sym17122089

AMA Style

Kong W, Ding H, Cheng Q, Li N, Sun X, Dong X. Improved YOLOv5s-Based Crack Detection Method for Sealant-Spraying Devices. Symmetry. 2025; 17(12):2089. https://doi.org/10.3390/sym17122089

Chicago/Turabian Style

Kong, Weiyi, Hua Ding, Qingzhang Cheng, Ning Li, Xiaochun Sun, and Xiaoxin Dong. 2025. "Improved YOLOv5s-Based Crack Detection Method for Sealant-Spraying Devices" Symmetry 17, no. 12: 2089. https://doi.org/10.3390/sym17122089

APA Style

Kong, W., Ding, H., Cheng, Q., Li, N., Sun, X., & Dong, X. (2025). Improved YOLOv5s-Based Crack Detection Method for Sealant-Spraying Devices. Symmetry, 17(12), 2089. https://doi.org/10.3390/sym17122089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Improved YOLOv5s-Based Crack Detection Method for Sealant-Spraying Devices

Abstract

1. Introduction

2. Positioning and Spraying Scheme for Train Door Cracks

2.1. Sealant-Spraying Device Overview

2.2. Train Door Sealing Process

2.3. Technical Workflow of the Improved Algorithm

3. Camera Calibration

4. Improved YOLOv5s Network

4.1. Network Structure

4.2. SCConv Module

4.3. C3 Module

4.4. Loss Function

5. Experiments and Results

5.1. Configuration of the Running Environment

5.2. Image Acquisition and Dataset Production

5.3. Evaluation Indicators

5.4. Experimental Results and Analysis

5.4.1. Ablation Experiment

5.4.2. Comparative Experiments of Different Models

5.5. Experimental Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI