Deep Learning-Based Understanding of Defects in Continuous Casting Product

: A novel YOLOv5 network is presented in this paper to quantify the degree of defects in continuously cast billets. The proposed network addresses the challenges posed by noise or dirty spots and different defect sizes in the images of these billets. The CBAM-YOLOv5 network integrates the channel and spatial attention of the Convolutional Block Attention Module (CBAM) with the C3 layer of the YOLOv5 network structure to better fuse channel and spatial information, with focus on the defect target, and improve the network’s detection capability, particularly for different levels of segregation. As a result, the feature pyramid is improved. The feature map obtained after the fourth down-sampling of the backbone network is fed into the feature pyramid through CBAM to improve the perceptual ﬁeld of the target and reduce information loss during the fusion process. Finally, a self-built dataset of continuously cast billets collected from different sources is used, and several experiments are conducted using this database. The experimental results show that the average accuracy (mAP) of the network is 93.7%, which can achieve intelligent rating.


Introduction
Since its inception in the 1950s, continuous casting technology has experienced significant development in the steel industry, gradually becoming a mainstream steel production process.Continuous casting involves the process of transforming liquid steel into solid continuous cast products through cooling and solidification within a continuous casting machine [1].In the course of continuous casting production, due to various factors, different types of defects may be present in the products, such as segregation, cracks, central porosity, shrinkage, and bubbles.These defects can have a substantial impact on the performance of steel products, such as tensile properties, toughness, corrosion resistance, wear resistance, and fatigue strength [2][3][4][5][6][7][8][9][10][11].Therefore, characterization of the defects is of great importance.
Currently, the macroscopic inspection of continuous casting products is mostly performed using the acid etching method.Ratings of the slabs are usually based on manual comparison between the cross sectional images of the etched slabs and the standard charts, as in the case for the Chinese National Standard YB/T 4003-2016 [12] and the Mannesmann rating method [13].Such approaches depend strongly on the operator who made the comparation.The SSAB steel plant [14] attempted to evaluate the same group of cast slabs using the Mannesmann standard, the two operators gave very different results.In contrast, the Rapp [15] standard divides the segregation in the slabs into dispersive dots and a continuous line.The quality of the slab was evaluated based on the number of dots within a given length.While the Mannesmann and Rapp standards have limitations in that they can only be used to grade the slabs where the segregation tends to form a line, the Chinese National Standard also give the standard charts of billets where the defects are widely scattered.
The use of automated digital identification and rating is a new development trend which can save the effort needed for manual inspection.However, such an approach places stringent demands on image quality as the captured images can be influenced by factors such as lighting, noise, and other environmental variables.Sometimes, defects lack distinct features, making them prone to confusion with background elements or stains.To reduce the impact of uneven lighting, Xi et al. [16].proposed a new framework for surface inspection.Zhao et al. [17] proposed a discriminant manifold regularized local descriptor to conduct the defect classification for steel surfaces.To detect pinholes in billets, Choi [18] proposed a Gabor filter combination to extract defect candidates and define morphological features.But, it is difficult to determine which feature is the most important in different images.It strongly depends on error handling and the difference between the target and the background.
The use of artificial intelligence methods instead of manual inspection is a recent trend.Information is extracted for processing and analysis by computer vision technology, and the advantages of this method are a high efficiency, high accuracy, and low human factor.Traditional computer image processing methods, of which the recognition effect depends on the difference between the target and the background, are only suitable for processing simple backgrounds.
To obtain the location and types of defects directly, it is more effective to combine production with object detection algorithms.Tao et al. [19] designed a novel cascaded autoencoder architecture to segment and localize defects on a metallic surface.However, this method cannot distinguish defects with complex backgrounds.Lin et al. [20] adopted the faster region-based convolutional neural network (Faster-RCNN) [21] for defect detection on steel surfaces.However, the processing speed significantly restricts its practical application in real-time industrial inspection.In contrast to Fast R-CNN, YOLO is a one-stage method that has a faster processing speed while maintaining similar detection capabilities.Yang et al. [22] proposed the application of YOLOv5 to the field of steel pipe weld defect detection and compared it with Faster R-CNN; YOLOv5 is much faster than Faster R-CNN and has a similar detection accuracy.But, there are difficulties in applying YOLOv5 to the detection of defects.To improve the detection capabilities of the model, Li [23], Yao [24], Chen [25], and Zheng [26] have made improvements to the structure of the model, including changing the network structure, incorporating another model, and adding self attention, and achieved better results.
This article extends the evaluation method for billets based on image detection techniques.By comparing real images and standard charts in the Chinese national standards YB/T 4002-2013 [27], we graded the defects of corresponding billets manually first and then trained a model using the data for an automatic grading based on the YOLOv5 method.In addition, we added CBAM attention to evaluate the billets, improving the accuracy and robustness of the method.This module improves the feature extraction capability of the YOLOv5 backbone network.This helps to minimize information loss during transmission and enhances the ability to detect defects.

Evaluation Method
In this study, hot acid was used to etch the billets of rebar steels and bainitic rail steels to obtain the defect area.The Chinese standard YB/T4002-2013 [27] was used to determine the defect level of the billets.When rating the other sizes of billet, the ratio of the defect area and the billet area can be used as the standard of rating since the area and the number of pixels have the same proportional transformation relationship.Therefore, the ratio of the number of pixels contained in the shrink hole to the number of pixels contained in the billet was used as a rating parameter for the defect grade.In actual production, because of the use of a hot acid etching treatment, the segregation after corrosion is often masked by the pot region, which appears as a porous area in the image.On the other hand, the spot itself is caused by segregation, which is the reason we selected the porous region for grading.
In situations where the size of the billet is not provided in the images, a standard reference image is used as a basis for comparison in order to rate defects.The aspect ratio of the generated defect area anchor box and the billet detection anchor box are compared, as illustrated in Figure 1.To achieve precise grading for industrial production, the Lagrange interpolation method is used to determine half-grades.If the defect area ratio falls between two adjacent grades, it is rated as a half-grade, with values ranging from 0, 0.5, 1, 1.5, to 4. The numerical results are as follows in Table 1.
In this study, hot acid was used to etch the billets of rebar steels and bainitic rail steels to obtain the defect area.The Chinese standard YB/T4002-2013 [27] was used to determine the defect level of the billets.When rating the other sizes of billet, the ratio of the defect area and the billet area can be used as the standard of rating since the area and the number of pixels have the same proportional transformation relationship.Therefore, the ratio of the number of pixels contained in the shrink hole to the number of pixels contained in the billet was used as a rating parameter for the defect grade.In actual production, because of the use of a hot acid etching treatment, the segregation after corrosion is often masked by the pot region, which appears as a porous area in the image.On the other hand, the spot itself is caused by segregation, which is the reason we selected the porous region for grading.
In situations where the size of the billet is not provided in the images, a standard reference image is used as a basis for comparison in order to rate defects.The aspect ratio of the generated defect area anchor box and the billet detection anchor box are compared, as illustrated in Figure 1.To achieve precise grading for industrial production, the Lagrange interpolation method is used to determine half-grades.If the defect area ratio falls between two adjacent grades, it is rated as a half-grade, with values ranging from 0, 0.5, 1, 1.5, to 4. The numerical results are as follows in Table 1.

Data Set of Continuous Cast Billets
The slabs were sampled from an industrial production, and then were completely polished and smoothed.To make this distinction, we use billet to refer to the material used in the experiment.The surface defects of the billets were obtained by hot acid etching: they were immersed in a solution of water and hydrochloric acid AR (about 36%) with a volume ratio of 1:1 and etched at 80 °C for 30 min.Afterwards, they were immediately cleaned with water and dried.A total of 267 images of different billet samples were

Data Set of Continuous Cast Billets
The slabs were sampled from an industrial production, and then were completely polished and smoothed.To make this distinction, we use billet to refer to the material used in the experiment.The surface defects of the billets were obtained by hot acid etching: they were immersed in a solution of water and hydrochloric acid AR (about 36%) with a volume ratio of 1:1 and etched at 80 • C for 30 min.Afterwards, they were immediately cleaned with water and dried.A total of 267 images of different billet samples were collected with a resolution of 4096 × 3048.The images were then labeled by the software LabelImg (https://pypi.org/project/labelImg/(accessed on 17 October 2023), which includes the defect area as a whole and the location of the billets since there is no scale to the images.In order to enrich the background information of the detected target and improve network robustness, the collected images were expanded by data enhancement techniques.Data enhancement methods commonly include image cropping, image scaling, color transformation and flipping, rotation, etc [28].Cutmix data augmentation [29] takes a portion of the image from the training set and fills the intercepted area with pixel values from other areas in a random manner.Theoretically similar to Cutmix, Mosaic data augmentation [30] selects four images from the training dataset, flips or scales them, changes the brightness, and stitches images together into one image according to randomly selected stitching points.This process is shown in Figure 2. The method inputs four images into the network for training at one time, which enriches the background information and improves the robustness of the network.This process also adds many small targets by random scaling, which enriches the dataset of small targets.
collected with a resolution of 4096 × 3048.The images were then labeled by the software LabelImg, which includes the defect area as a whole and the location of the billets since there is no scale to the images.In order to enrich the background information of the detected target and improve network robustness, the collected images were expanded by data enhancement techniques.Data enhancement methods commonly include image cropping, image scaling, color transformation and flipping, rotation, etc [28].Cutmix data augmentation [29] takes a portion of the image from the training set and fills the intercepted area with pixel values from other areas in a random manner.Theoretically similar to Cutmix, Mosaic data augmentation [30] selects four images from the training dataset, flips or scales them, changes the brightness, and stitches images together into one image according to randomly selected stitching points.This process is shown in Figure 2. The method inputs four images into the network for training at one time, which enriches the background information and improves the robustness of the network.This process also adds many small targets by random scaling, which enriches the dataset of small targets.

The YOLOv5 Algorithm
The schematic diagram of the YOLOv5 network architecture is shown in Figure 3, which consists of four parts: the network input, the feature extraction backbone network, the feature fusion neck network, and the network output [31].The backbone serves as the foundation for understanding the input image's content, the neck network component further processes and refines feature maps before they are used for object detection, and the output network is used to identify and locate objects within the image.

The YOLOv5 Algorithm
The schematic diagram of the YOLOv5 network architecture is shown in Figure 3, which consists of four parts: the network input, the feature extraction backbone network, the feature fusion neck network, and the network output [31].The backbone serves as the foundation for understanding the input image's content, the neck network component further processes and refines feature maps before they are used for object detection, and the output network is used to identify and locate objects within the image.

Convolutional Block Attention Module
Clausen [32] et al. proposed the Convolutional Block Attention Module (CBAM): a simple and effective attention module for feed-forward convolutional neural networks.It is a lightweight and general-purpose module.It can be seamlessly integrated into any CNN architecture and can be trained end-to-end with the underlying CNN with the structure, as shown in Figure 4.The model structure is modified using CBAM, which is added after each C3 module of the backbone network to better fuse channel and spatial information, focus on defective targets, and improve network detection, especially for different bias points, in addition to improving the feature pyramid.The feature map after the fourth down-sampling of the backbone network is injected into the feature pyramid through the CBAM module to improve the perceptual field of the target and reduce the information loss in the fusion process, as Figure 5 shows.

Convolutional Block Attention Module
Clausen [32] et al. proposed the Convolutional Block Attention Module (CBAM): a simple and effective attention module for feed-forward convolutional neural networks.It is a lightweight and general-purpose module.It can be seamlessly integrated into any CNN architecture and can be trained end-to-end with the underlying CNN with the structure, as shown in Figure 4.The model structure is modified using CBAM, which is added after each C3 module of the backbone network to better fuse channel and spatial information, focus on defective targets, and improve network detection, especially for different bias points, in addition to improving the feature pyramid.The feature map after the fourth down-sampling of the backbone network is injected into the feature pyramid through the CBAM module to improve the perceptual field of the target and reduce the information loss in the fusion process, as Figure 5 shows.The convolutional neural network is responsible for performing down-sampling operations that increase the number of channels while reducing the width and height of the input objects.To further enhance the ability to capture and utilize important information, channel attention can be applied.Channel attention allows the network to determine the importance of different channels and combine them in a weighted manner to emphasize the most informative channels.Channel attention calculates the weight of each channel   ∈  ×1×1 according to the equation: The input feature map F has dimensions of  ×× , where c is the number of channels and H and W are the height and width of the feature map,    and    are the feature maps W1 and W2 after average pooling and maximum pooling, respectively, which represent the two-layer weights of multilayer perception, and  is the sigmoid activation function.Then, the feature map after channel attention is passed through a spatial attention mechanism.This involves applying average pooling and maximum pooling operations to the channel-attended feature map, resulting in two new feature maps.These two feature maps are then concatenated and passed through a 7 × 7 convolutional layer to obtain a final feature map and, finally, the final feature map is output by sigmoid activation function.The spatial attention is shown in Equation ( 2): where  7×7 denotes the convolution operation with a filter size of seven.The convolutional neural network is responsible for performing down-sampling operations that increase the number of channels while reducing the width and height of the input objects.To further enhance the ability to capture and utilize important information, channel attention can be applied.Channel attention allows the network to determine the importance of different channels and combine them in a weighted manner to emphasize Metals 2023, 13, 1809 6 of 14 the most informative channels.Channel attention calculates the weight of each channel M c ∈ R c×1×1 according to the equation:

Loss Function
The input feature map F has dimensions of R c×H×W , where c is the number of channels and H and W are the height and width of the feature map, F c avg and F c max are the feature maps W 1 and W 2 after average pooling and maximum pooling, respectively, which represent the two-layer weights of multilayer perception, and σ is the sigmoid activation function.
Then, the feature map after channel attention is passed through a spatial attention mechanism.This involves applying average pooling and maximum pooling operations to the channel-attended feature map, resulting in two new feature maps.These two feature maps are then concatenated and passed through a 7 × 7 convolutional layer to obtain a final feature map and, finally, the final feature map is output by sigmoid activation function.The spatial attention is shown in Equation (2): where f 7×7 denotes the convolution operation with a filter size of seven.

Loss Function
The purpose of loss function is to evaluate the similarity between the predicted output of a neural network and the intended output.The smaller the value of loss function, the closer the predicted output is to the desired output.In this study, cross-entropy loss is used as the classification loss.The overall loss function is a weighted sum of the position loss, confidence loss, and classification loss, which is used to guide the network optimization process and update the network parameters.The optimization process continues until the value of the loss function reaches the minimum.At this point, the network has learned the mapping relationship between the input and output and can accurately detect the defect in the cast billet images.The classification loss represents the probability of belonging to a certain category, where the location loss L box is defined as [33]: where IoU (Intersection over Union) is the intersection of the prediction box and the real box in series, and the larger the IoU, the closer the actual prediction.ρ is the Euclidean distance between the coordinates of the center point of the detection box A and the prediction box B. c is the diagonal distance of the minimum box surrounding them; α is the weight; υ is used to measure the consistency of the aspect ratio between A and B. IoU is defined as (4) and IoU is illustrated as shown in Figure 6.
where A is the real box, B is the prediction box, A ∩ B is the intersection, and A ∪ B is the union of A and B; α and v are defined as: The classification loss and confidence loss in this study both use the binary crossentropy loss function, which is defined as follows: where n is the total number of bounding boxes or detection anchors, y n is the target value, and x n is the predicted confidence score for the nth bounding box.This score represents the model's confidence in detecting an object within that box.The classification loss and confidence loss guide the training process to improve the model's ability to classify objects correctly and predict confidence scores accurately for each bounding box.
To validate the performance of the CBAM-YOLOv5 detection model, the mean accuracy (mAP), precision (P), and recall (R) were used.Accuracy is the ratio of accurately identified positive classes to all predicted positive classes, and recall is the ratio of correctly identified positive classes to all positive classes, which are defined as where TP is the correctly identified defect, FP is the defect of the identified background, and FN is the unidentifiable defect.
The mAP is defined as where AP contains the area of the P-R curve surrounded by precision and recall, and mAP indicates the network model detection performance of the AP average for each category measure.

Comparison Results of Different Attention Modules
To further explore the effect of introducing CBAM in YOLOv5, we introduced different attention modules in the neck of YOLOv5, such as Efficient-Channel-Attention (ECA), Coordinate-Attention (CA), and Squeeze-and-Excitation (SE).The experimental results are shown in Table 2.
In view of the complex and gradual backgrounds of defects and the fact that defect sizes and types can be different, for fast generation in industrial environments, some scratches and dirty spots are also easily recognized as defect features.Therefore, a CBAM-YOLOv5 defect detection network based on YOLOv5 is proposed by combining the optical properties of cast images, defect imaging characteristics, and detection requirements to achieve more efficient identification.Moreover, two improvements were made to YOLOv5, and ablation experiments were conducted to evaluate the impact of each improvement and their combination.The experimental results are presented in Table 3.The mAP of the original YOLOv5 model is the lowest (84.1%), which does not meet detection requirements.However, after incorporating the CBAM attention module, the mAP increased to 91.0%, indicating that the CBAM attention improved the detection capability for defects.When the CBAMC3 module was added after SPP, the mAP was 92.3% alone, while the combination of SPP channels and the CBAM spatial attention produced a mAP of 93.7%, suggesting that the combination of the two enhancements significantly improved the detection of defects.This improvement was achieved through better feature representation and feature fusion.The combination of the two enhancements improved feature extraction by the backbone network, incorporated more semantic information into the pyramid layer during feature fusion, and reduced information loss during transmission.Compared to the other attention modules, we found that the best performance was achieved by introducing the CBAM module in YOLOv5.Specifically, the two CBAM-YOLOv5's reached 91.06% and 90.02%, respectively, which were 1.32% and 1.19% higher than the worst model (YOLOv5-ECA).Thus, we improved the performance of the model by introducing the CBAM attention mechanism.
The loss of the model during the training process, shown in Figure 7a-c, shows the training loss of the original YOLOv5s model.Figure 7d-f shows the training loss using the improved CBAM-YOLOv5 model.The loss curve represents the model's performance during training iterations.It should show a gradual decrease in the loss over time.When the loss plateaus, it suggests that the model has reached a stable state.Models that exhibit a rapid descent and achieve lower final loss values are generally preferred.In the initial stage of training, the loss decreases rapidly due to the high learning rate of the model.When the training rounds exceed 50 rounds, the confidence loss of the original model starts to improve on the test set and overfitting occurs, and the rest of the losses fluctuate less.And, the improved model gradually stabilizes the loss curve after 80 training rounds, and the model works better.
Figure 8 shows the accuracy.As recall increases, the number of training rounds increases.At the time of stabilization, the mAP of the original YOLOv5s model is 84.1%, while the mAP of the improved CBAM-YOLOv5 model is 93.7%, with an accuracy improvement of 9.6%.less.And, the improved model gradually stabilizes the loss curve after 80 training rounds, and the model works better.The performance of the deep learning model is affected by different training parameters, including the input image size, period, batch size, learning rate, and the optimizer used.In addition, different confidence thresholds are set and tested when the learning rate is 0.01 and 0.01.In addition, different confidence thresholds were set and it was tested that the learning rate was better when it was 0.01.A confidence threshold of 0.5 was chosen as the parameter for the model detection experiments.In order to verify whether the above parameters are optimal, the improved CBAM-YOLOv5 was tested several times based on the self-built continuous casting billet defect data set, and the above parameters were adjusted to observe the performance changes.The experimental results are shown in Table 4.In our experiments, we found that the variation in the loss function tends to stabilize as the period approaches 100.Thus, the rounds are set to 100 in this paper.Table 4 shows that Exp5 has the highest mAP, which also validates the settings of the experimental parameters in this paper.
In industrial production, the surface of the billets is often not entirely ideal and may contain dirt spots or complex backgrounds.Figure 9 demonstrates the recognition performance of the YOLOv5 model and the CBAM-YOLOv5 model on complex backgrounds and dirty spots.The CBAM incorporates a feature called Channel Attention, which dynamically re-weights the importance of different channels in the network, allowing the model to focus on defect features.In addition, the CBAM integrates Spatial Attention, which emphasizes important spatial regions within an image.This enables the model to suppress irrelevant background information and concentrate on the defect regions.By attending to specific spatial regions, the model becomes more robust in the presence of complex backgrounds.

Visualizing Knowledge of CNNs via Grad-CAM
A common issue with traditional neural networks is the challenge in providing an explanation for the decision made, often labeled as a "black box".Recent efforts have been made to address this issue and explain outputs, such as the CAM (Class Activation Mapping) [34] and GradCAM (Gradient-weighted Class Activation Mapping) techniques [35].
Finally, to evaluate the effect of the model, a Grad-CAM visualization is used to analyze the attention mechanism and improve interpretability, as shown in Figure 10. Figure In contrast to Figure 8a,b, the original YOLOv5 model classifies dirty spots as defects and marks them, while the CBAM-YOLOv5 model extracts defect features to distinguish them from dirty spots.Figure 8c,d compares the performance of the models in complex environments, and the cbam-YOLOv5 model exhibits superior detection capabilities compared to the original model.

Visualizing Knowledge of CNNs via Grad-CAM
A common issue with traditional neural networks is the challenge in providing an explanation for the decision made, often labeled as a "black box".Recent efforts have been made to address this issue and explain outputs, such as the CAM (Class Activation Mapping) [34] and GradCAM (Gradient-weighted Class Activation Mapping) techniques [35].
Finally, to evaluate the effect of the model, a Grad-CAM visualization is used to analyze the attention mechanism and improve interpretability, as shown in Figure 10. Figure 10a displays the heat map without the attention mechanism, indicating difficulties in recognizing defects due to their similarity to the background.In contrast, Figure 10b demonstrates the image after adding CBAM, revealing that the attention mechanism prioritizes the center of the image and improves the accuracy of defect recognition while reducing mislabeling.

Visualizing Knowledge of CNNs via Grad-CAM
A common issue with traditional neural networks is the challenge in providing an explanation for the decision made, often labeled as a "black box".Recent efforts have been made to address this issue and explain outputs, such as the CAM (Class Activation Mapping) [34] and GradCAM (Gradient-weighted Class Activation Mapping) techniques [35].
Finally, to evaluate the effect of the model, a Grad-CAM visualization is used to analyze the attention mechanism and improve interpretability, as shown in Figure 10. Figure 10a displays the heat map without the attention mechanism, indicating difficulties in recognizing defects due to their similarity to the background.In contrast, Figure 10b demonstrates the image after adding CBAM, revealing that the attention mechanism prioritizes the center of the image and improves the accuracy of defect recognition while reducing mislabeling.The GradCAM generates a heatmap that highlights the regions in the input image where the model's attention is concentrated.This heatmap is superimposed on the original image, making it visually evident where the model is making its predictions.By examining the heatmap, one can easily identify the areas within the image that are most relevant to the model's defect detection decision.This localization is crucial for understanding which parts of the image the model considers when identifying defects.This verification method validates the model's effectiveness and highlights the importance of attention mechanisms in improving recognition accuracy and robustness.Typically, the area in Figure 10c would be manually measured and the longest edge would be compared to the cast billet's dimensions for calculation.In contrast, our method in Figure 10b directly frames the largest area and allows the computer to calculate the rating.This method significantly speeds up the rating process while maintaining the same level of accuracy as manual measurement.Moreover, this method produces numerical data, avoiding human subjectivity in the rating process.
As shown in Figure 11 for the CBAM-YOLOv5 inspection effect, in the cast billets, both the defect area and the billet as a whole are marked in order to achieve the rating requirements.Since the scale information is not given in the figure, the pixel scale is used as the basis for the size division, and the defect area is compared with the overall size of the billet according to the values given in Table 1, with an aim of obtaining the defect grade automatically.
rectly frames the largest area and allows the computer to calculate the rating.This method significantly speeds up the rating process while maintaining the same level of accuracy as manual measurement.Moreover, this method produces numerical data, avoiding human subjectivity in the rating process.
As shown in Figure 11 for the CBAM-YOLOv5 inspection effect, in the cast billets, both the defect area and the billet as a whole are marked in order to achieve the rating requirements.Since the scale information is not given in the figure, the pixel scale is used as the basis for the size division, and the defect area is compared with the overall size of the billet according to the values given in Table 1, with an aim of obtaining the defect grade automatically.

Conclusions
This study proposes a detection and evaluation method for the analysis of segregation in billets that addresses challenges such as a varying defect size and complex background.A CBAM-YOLOv5 network was used which incorporated SPP and CBAM mechanisms to extract important features and reduce interference from background information.The modified network achieves a high detection accuracy for cast billet defects, with a mAP of 93.7%, which is an improvement of 9.6% over the original model.The study also improves the interpretability of the model using GradCAM to demonstrate the role of the attention mechanism in defect detection.In addition, the identified defect areas are compared to the cast billet area to obtain the defect score, which reduces subjectivity and randomness compared to manual measurement and better meets industry needs.

Figure 1 .
Figure 1.Recognition of defective images in the corresponding standard.The red area represents the detected defect region

Figure 1 .
Figure 1.Recognition of defective images in the corresponding standard.The red area represents the detected defect region.

Figure 2 .
Figure 2. Schematic of mosaic data enhancement.After preprocessing, the images were divided into training data and validation data, with a ratio of 7:3.The training data were used to train the deep neural network, and the validation data were used to evaluate the performance of the model during the training process.

Figure 2 .
Figure 2. Schematic of mosaic data enhancement.After preprocessing, the images were divided into training data and validation data, with a ratio of 7:3.The training data were used to train the deep neural network, and the validation data were used to evaluate the performance of the model during the training process.
et al. proposed the Convolutional Block Attention Module (CBAM): a simple and effective attention module for feed-forward convolutional neural networks.It is a lightweight and general-purpose module.It can be seamlessly integrated into any CNN architecture and can be trained end-to-end with the underlying CNN with the struc-

Figure 7 .
Figure 7. Variation of box loss, object loss, and classification loss of the model during the training process.(a-c) Loss curve of the original YOLOv5 model.(d-f) Loss curve of the CBAM-YOLOv5 model.

Figure 8
Figure 8 shows the accuracy.As recall increases, the number of training rounds increases.At the time of stabilization, the mAP of the original YOLOv5s model is 84.1%, while the mAP of the improved CBAM-YOLOv5 model is 93.7%, with an accuracy improvement of 9.6%.

Figure 7 .
Figure 7. Variation of box loss, object loss, and classification loss of the model during the training process.(a-c) Loss curve of the original YOLOv5 model.(d-f) Loss curve of the CBAM-YOLOv5 model.

Figure 7 .
Figure 7. Variation of box loss, object loss, and classification loss of the model during the training process.(a-c) Loss curve of the original YOLOv5 model.(d-f) Loss curve of the CBAM-YOLOv5 model.

Figure 8
Figure8shows the accuracy.As recall increases, the number of training rounds increases.At the time of stabilization, the mAP of the original YOLOv5s model is 84.1%, while the mAP of the improved CBAM-YOLOv5 model is 93.7%, with an accuracy improvement of 9.6%.

Figure 8 .
Figure 8. Precision (P), recall (R), and mAP variation in the model during the training process, the red line represents the CBAM-YOLOv5 model, and the blue line represents the original YOLOv5 model:(a) Precision, (b) Recall, (c,d) mAP with different IoU.

Figure 10 .
Figure 10.Grad-CAM heat map for visual validation of both models.(a) Visualization results of the original YOLOv5 model, (b) visualization results of the CBAM-YOLOv5 model, and (c) defect area marked manually.

Figure 10 .
Figure 10.Grad-CAM heat map for visual validation of both models.(a) Visualization results of the original YOLOv5 model, (b) visualization results of the CBAM-YOLOv5 model, and (c) defect area marked manually.

Table 2 .
Insertion of experimental results from different attention mechanisms.

Table 3 .
Results of ablation experiments.