Convolutional Network Research for Defect Identiﬁcation of Productor Appearance Surface

: The accurate and rapid identiﬁcation of surface defects is an important element of product appearance quality evaluation, and the application of deep learning for surface defect recognition is an ongoing hot topic. In this paper, a lightweight KD-EG-RepVGG network based on structural reparameterization is designed for the identiﬁcation of surface defects on strip steel as an example. In order to improve the stability and accuracy in the recognition of strip steel surface defects, an efﬁcient attention network was introduced into the network, and then a Gaussian error linear activation function was applied in order to prevent the neurons from being set to zero during neural network training, leaving neuron parameters without being updated. Finally, knowledge distillation is used to transfer the knowledge of the RepVGG-A0 network to give the lightweight model better accuracy and generalization capability. The outcomes of the experiments indicate that the model has a computational and parametric volume of 22.3 M and 0.14 M, respectively, in the inference phase, a defect recognition accuracy of 99.44% on the test set, and a single image detection speed of 2.4 ms, making it more suitable for deployment in real engineering environments.


Introduction
The detection of defects on a product's surface is important underlying research in the area of intelligent production, and this paper investigates the detection of surface defects in strip steel during industrial production. The surface quality of strip steel is one of the most important indicators of strip steel quality and is linked to the quality of products downstream in areas such as automotive, household appliances and construction. The detection of surface defects in steel has therefore become an extremely significant task in the steel production sector.
The identification of productor surface defects is an important task for enterprise product lines. In the early days, the task was completed by human-eyes checking, and it was limited by the human limitations of the eyes. After the emergence of image processing technology, the task was then completed by the characteristics of the defect image. Zhou [1] et al. applied the SIFT algorithm to the identification of defects on the surface of medium-thick plates and achieved a good accuracy of 95% for defects that occur continuously. Hu [2] et al. extracted four visual features of the target image: geometry, shape, texture and greyscale and used a genetic algorithm to optimize a hybrid chromosome-based classification model for effective identification of image defects. However, the characteristics-based methods made it hard to check for tiny defects or other imperfections. In recent years, deep learning methods, such as the convolutional network, were proposed to be applied in certain fields.
Since the introduction of Alexnet [3] convolutional neural networks in 2012, they have demonstrated high efficiency and accuracy in object recognition. Convolutional neural networks have gradually become an important research direction in detection and recognition, and the accurate, fast and contact-free recognition techniques are continuously ously investigated. Manzo [4] et al. used some pre-trained convolutional neural networks to detect the COVID-19 disease in CT images and gained an accuracy of 96.5%. Jiang [5] et al. used an improved VGG network to identify rice and wheat leaf disease simultaneously. Tao [6] et al. accurately identified smaller flames using an improved GoogLeNet network. As a new research hotspot, deep convolutional neural networks have been used in a wide range of industries.
Convolutional neural networks have been extensively applied to product surface defect recognition. Vonnocc [7] et al. used traditional machine learning methods and deep learning methods to classify surface defects in hot rolled strip steel, and they found that the deep learning approach worked better. Konovalenko [8] et al. detected surface defects in strip steel based on the ResNet50 framework, with a precision of 96.91% in recognition. Xiang [9] et al. used a small sample dataset to achieve an accurate recognition rate of 97.8% on an improved VGG-19 network. Feng [10] et al. added FcaNet and CMAM modules based on Resnet, achieving an accuracy of 94.11% for the defect identification in hot-rolled strip steel. Tang [11] et al. used multi-scale maximum pooling and an attention mechanism to detect surface defects, where the classification accuracy rate reaches 94.73%. Xing [12] proposes a convolutional classification model with symmetric structure to achieve accurate recognition of surface defects. These studies have focused on accuracy design, ignoring the computational volume, complexity and real-time requirements of the models in real-world applications. Wang [13] et al. designed the VGG-ADB model for defect recognition, which achieved 99.63% classification accuracy and 333 frame/s inference speed. The VGG-ADB model considered the inference speed of the network, but the model was ignored for the parametric design, where the model size reached 72.15 M. This constrained the application of the model on edge devices. In actual production, not only does the network require extremely high detection accuracy, but it also has high requirements for model size, detection speed and real-time detection.
The KD-EG-RepVGG surface defect detection algorithm is designed using structural reparameterization, GELU, ECA networks and knowledge distillation for the task requirement of surface defects identification. Through experimental comparative analysis, the KD-EG-RepVGG network is characterized by a low number of parameters, low computational effort, high speed and high accuracy. The general idea of the method in the paper is illustrated in Figure 1. The teacher network RepVGG-A0 guides the KD-EG-RepVGG network training. The structural re-parameterization technique loads the training weights into the KD-EG-RepVGG inference network to finally obtain the prediction results. This paper is structured as follows. Section 2 describes in detail the KD-EG-RepVGG network framework. Section 3 verifies the validity of the network from several perspectives, whereas Section 4 is the conclusion of the paper. This paper is structured as follows. Section 2 describes in detail the KD-EG-RepVGG network framework. Section 3 verifies the validity of the network from several perspectives, whereas Section 4 is the conclusion of the paper.

The KD-EG-RepVGG Network
The EG-RepVGG network is based on structural reparameterization, incorporating a lightweight attention network while using GELU as the activation function in the improved network, stacking the S-RepVGG block module and D-RepVGG block module based on RepVGGBlock. The model is structured as shown in Figure 2. The main function of the D-RepVGG block module is to extract features and adjust the space size and channel number of the feature map, whereas the main purpose of the S-RepVGG block is feature extraction. The S-RepVGG block has an additional directly connected structure compared to the D-RepVGG block, which mimics the residual connection in ResNet [14] and improves the model's ability to extract features. The output of D-RepVGG Block5 is made up of global average pooling and then a softmax classifier is appended. The global average pooling layer is used to downsample the output spatial resolution of the feature map to 1 × 1. The softmax layer is used to output the predicted categories. They together form the classification layer. With the aim of further improving the accuracy and generalization performance of the model, the RepVGG-A0 as a teacher model is used to guide the training of EG RepVGG model using knowledge distillation technology. The final result is a lightweight, fast and highly accurate strip steel surface defect recognition model, the KD-EG RepVGG model. The detailed structural information of the KD-EG-RepVGG model is shown in Table 1.
based on RepVGGBlock. The model is structured as shown in Figure 2. The main function of the D-RepVGG block module is to extract features and adjust the space size and channel number of the feature map, whereas the main purpose of the S-RepVGG block is feature extraction. The S-RepVGG block has an additional directly connected structure compared to the D-RepVGG block, which mimics the residual connection in ResNet [14] and improves the model's ability to extract features. The output of D-RepVGG Block5 is made up of global average pooling and then a softmax classifier is appended. The global average pooling layer is used to downsample the output spatial resolution of the feature map to 1 × 1. The softmax layer is used to output the predicted categories. They together form the classification layer. With the aim of further improving the accuracy and generalization performance of the model, the RepVGG-A0 as a teacher model is used to guide the training of EG RepVGG model using knowledge distillation technology. The final result is a lightweight, fast and highly accurate strip steel surface defect recognition model, the KD-EG RepVGG model. The detailed structural information of the KD-EG-RepVGG model is shown in Table 1.

Structural Re-Parameterisation
The structural reparameterization was first proposed in RepVGG networks by Ding XiaoHan [15] et al. The inference network is decoupled from the training network using structural reparameterization techniques. Decoupling the training network and inference network by using structure re-parameterization can not only obtain the full advantage of feature extraction brought by multi branch network training, but also obtain the high speed and low memory consumption of a single path model in inference deployment. The core component of the RepVGG network is the RepVGG Block. Its structure is shown in Figure 3.
The structural reparameterization was first proposed in RepVGG net XiaoHan [15] et al. The inference network is decoupled from the training structural reparameterization techniques. Decoupling the training network network by using structure re-parameterization can not only obtain the fu feature extraction brought by multi branch network training, but also o speed and low memory consumption of a single path model in inference de core component of the RepVGG network is the RepVGG Block. Its structu Figure 3. The structure of the network under training is illustrated in Figure 3a. phase, the RepVGG Block consists mainly of 3 × 3 convolutional kernels, tional kernels and Identity branches. By adding Identities branches and tional branches in parallel, information at different scales of the image ca and fused, increasing the representational power of the model.
In the inference stage, the 1 × 1 convolution and Identity branch from t fused into the 3 × 3 convolution, and the inference structure is shown RepVGG Block takes the training network and re-parameterizes it struct the network into a single linear structure consisting mainly of 3 × 3 convol any branches. The inference structure both gains the parameter weights multi-branch training and allows the use of the single linear structure to inference of the model during the deployment inference phase. At the sa optimization of the 3 × 3 convolution based on NVIDIA cuDNN's compu accelerates the model's detection speed in the inference phase.
The structural reparameterization in the inference phase mainly consis of the convolution kernel and the Batch Normalization (BN) layer [16], the 1 × 1 convolution into 3 × 3 convolution and the integration of Identity bra convolution. The formula for the fusion of the convolution and BN layers as follows: where denotes the mean of the BN layer and 2 denotes the BN layer v 2 are obtained statistically in the training dataset; is a constant to prev inator from being zero; is the scale factor of the BN layer; is the offset and the values of both and are obtained in the training. The structure of the network under training is illustrated in Figure 3a. In the training phase, the RepVGG Block consists mainly of 3 × 3 convolutional kernels, 1 × 1 convolutional kernels and Identity branches. By adding Identities branches and 1 × 1 convolutional branches in parallel, information at different scales of the image can be extracted and fused, increasing the representational power of the model.
In the inference stage, the 1 × 1 convolution and Identity branch from the training are fused into the 3 × 3 convolution, and the inference structure is shown in Figure 3b. RepVGG Block takes the training network and re-parameterizes it structurally, turning the network into a single linear structure consisting mainly of 3 × 3 convolutions without any branches. The inference structure both gains the parameter weights obtained from multi-branch training and allows the use of the single linear structure to speed up the inference of the model during the deployment inference phase. At the same time, deep optimization of the 3 × 3 convolution based on NVIDIA cuDNN's computational library accelerates the model's detection speed in the inference phase.
The structural reparameterization in the inference phase mainly consists of the fusion of the convolution kernel and the Batch Normalization (BN) layer [16], the integration of 1 × 1 convolution into 3 × 3 convolution and the integration of Identity branches into 3 × 3 convolution. The formula for the fusion of the convolution and BN layers in the model is as follows: where µ denotes the mean of the BN layer and σ 2 denotes the BN layer variance; µ and σ 2 are obtained statistically in the training dataset; is a constant to prevent the denominator from being zero; γ is the scale factor of the BN layer; β is the offset of the BN layer and the values of both γ and β are obtained in the training. For convolution, the formula is as it is in (2): where x and Conv(x) are the input and output of the convolution; W denotes the matrix weight of the convolution calculation; and b is the bias of the convolution layer calculation. The input to the BN layer is the output of the convolution into it. This is equivalent to taking Equation (2) and bringing it into Equation (1), resulting in a calculation such as Equation (3): The following can be obtained by sorting and simplifying: From the calculation results, we can obtain a new convolution by incorporating the weight information calculated by Batch Normalization layer into the convolution layer, where the convolution weight is γ √ σ 2 +ε W, and the bias of the convolution is For the Identity branch in the RepVGG Block, a 1 × 1 convolution kernel with a weight of 1 is used to construct a 1×1 convolution, and then a 3 × 3 convolution kernel is set to perform identity mapping on the input features. Keep the output of the Identity layer unchanged before and after the transformation. For a 1 × 1 convolution branch, a complementary zero operation is performed around the 1 × 1 convolution kernel so that it becomes a 3 × 3 convolution. At this point, both the 1 × 1 convolution and Identity are converted into a 3 × 3 convolution, and based on the additivity of the convolution operation, the three branches can then be incorporated into a single 3 × 3 convolution. The process is shown in Figure 4.
to taking Equation (2) and bringing it into Equation (1), resulting in a calculation such as Equation (3): The following can be obtained by sorting and simplifying: From the calculation results, we can obtain a new convolution by incorporating the weight information calculated by Batch Normalization layer into the convolution layer, where the convolution weight is √ 2 + ,and the bias of the convolution is For the Identity branch in the RepVGG Block, a 1 × 1 convolution kernel with a weight of 1 is used to construct a 1×1 convolution, and then a 3 × 3 convolution kernel is set to perform identity mapping on the input features. Keep the output of the Identity layer unchanged before and after the transformation. For a 1 × 1 convolution branch, a complementary zero operation is performed around the 1 × 1 convolution kernel so that it becomes a 3 × 3 convolution. At this point, both the 1 × 1 convolution and Identity are converted into a 3 × 3 convolution, and based on the additivity of the convolution operation, the three branches can then be incorporated into a single 3 × 3 convolution. The process is shown in Figure 4.

Efficient Channel Attention Network
The Efficient Channel Attention network [17] was added to the RepVGG Block to form the E-RepVGG network. The feature information can be obtained efficiently and without increasing the number of parameters of the model at the same time. The structure of ECA is shown in Figure 5. The feature map ∈ ℝ × × output from the convolution is pooled and globally averaged (Global Pooling) over the spatial dimension to output a feature vector of size 1 × 1 × , as is shown in Equation (5)

Efficient Channel Attention Network
The Efficient Channel Attention network [17] was added to the RepVGG Block to form the E-RepVGG network. The feature information can be obtained efficiently and without increasing the number of parameters of the model at the same time. The structure of ECA is shown in Figure 5. The feature map x ∈ R L×S×T output from the convolution is pooled and globally averaged (Global Pooling) over the spatial dimension to output a feature vector y of size 1 × 1 × T, as is shown in Equation (5): where L and S are the width and height of the feature map, respectively; and T is the number of channels in the feature map. Channel weighting coefficient obtained after the ECA network can be calculated by the following equation: where sigmoid is the sigmoid activation function; Ψ is the weight of the ECA network on the channel; and Ω is the parameter matrix for calculating the channel attention in ECA networks. The mathematical model is represented as follows: It is clear from that the weight value of Ψ is determined only by in the immediate vicinity of . This can be expressed as a 1-dimensio ( 1 ) with a kernel of size . Bringing in the simplification yields: where 1 denotes a 1-dimensional convolution of convolution kern paper, considering the model parameters and inference speed, the size sional convolution kernels is set to 3.
The weight coefficients of each channel calculated by the efficient att are multiplied by the channel weights of the input feature map ∈ ℝ × output:

Gaussian Linear Units
The rectified linear units (ReLU) activation function is used in the which effectively solved the problem of disappearing or exploding gradien network deepens. However, the ReLU activation function also has some p the input is less than zero, the ReLU output will be directly zeroed, and It is clear from Ω that the weight value of Ψ is determined only by the k channels in the immediate vicinity of y. This can be expressed as a 1-dimensional convolution (Conv1d) with a kernel of size k. Bringing in the simplification yields: where Conv1d denotes a 1-dimensional convolution of convolution kernel size k. In this paper, considering the model parameters and inference speed, the size of all 1-dimensional convolution kernels is set to 3. The weight coefficients of each channel calculated by the efficient attention network are multiplied by the channel weights of the input feature map x ∈ R L×S×T to obtain the output: where x ∈ R L×S×T is the output of the ECA network.

Gaussian Linear Units
The rectified linear units (ReLU) activation function is used in the RepVGG Block, which effectively solved the problem of disappearing or exploding gradients as the neural network deepens. However, the ReLU activation function also has some problems. When the input is less than zero, the ReLU output will be directly zeroed, and the neuron will be permanently zeroed, which is detrimental to the convergence of the network model and feature extraction. Therefore, Gaussian Error Linear Units [18] (GELU) are selected as the activation function in this paper to form the EG-RepVGG network. The GELU activation function is applied as a non-linear unit after the ECA network. The GELU activation function is differentiable at the origin, and the idea of stochastic regularity is introduced into the function. The activation operation will establish a stochastic connection between the input and output, effectively avoiding the situation where the neurons are set to zero and enhancing the learning speed and stability of the network.

Knowledge Distillation
The knowledge distillation is a novel technique for model compression proposed by Geoffrey Hinton [19] et al. A complex, highly generalizable large model is used to guide the training of a lightweight small model, allowing the small model to achieve the same accuracy as the large model at a smaller cost. At the heart of the knowledge distillation network is the fact that the different classes of confidence in the output of the teacher network define a rich similarity structure at the data level and can provide more inter-class knowledge for small networks to guide the training of small networks. The characteristic distillation is calculated by: The activation operation will establish a stochastic connection between the input and output, effectively avoiding the situation where the neurons are set to zero and enhancing the learning speed and stability of the network. The hyperparameter T softens the output categories of the large and small networks to find the distillation loss of the two networks' outputs and the direct training output loss of the small network. The two losses are weighted and summed to obtain the training losses of the networks. The entire knowledge distillation network training process is shown in Figure 6. In this paper, the KD-EG-RepVGG network was obtained by using RepVGG-A0 as the teacher network and instructing the training of the EG-RepVGG network. be permanently zeroed, which is detrimental to the convergence of the network model and feature extraction. Therefore, Gaussian Error Linear Units [18] (GELU) are selected as the activation function in this paper to form the EG-RepVGG network. The GELU activation function is applied as a non-linear unit after the ECA network. The GELU activation function is differentiable at the origin, and the idea of stochastic regularity is introduced into the function. The activation operation will establish a stochastic connection between the input and output, effectively avoiding the situation where the neurons are set to zero and enhancing the learning speed and stability of the network.

Knowledge Distillation
The knowledge distillation is a novel technique for model compression proposed by Geoffrey Hinton [19] et al. A complex, highly generalizable large model is used to guide the training of a lightweight small model, allowing the small model to achieve the same accuracy as the large model at a smaller cost. At the heart of the knowledge distillation network is the fact that the different classes of confidence in the output of the teacher network define a rich similarity structure at the data level and can provide more interclass knowledge for small networks to guide the training of small networks. The characteristic distillation is calculated by: The activation operation will establish a stochastic connection between the input and output, effectively avoiding the situation where the neurons are set to zero and enhancing the learning speed and stability of the network. The hyperparameter softens the output categories of the large and small networks to find the distillation loss of the two networks' outputs and the direct training output loss of the small network. The two losses are weighted and summed to obtain the training losses of the networks. The entire knowledge distillation network training process is shown in Figure 6. In this paper, the KD-EG-RepVGG network was obtained by using RepVGG-A0 as the teacher network and instructing the training of the EG-RepVGG network.  The loss function used in the training phase is the KL scatter loss and the cross-entropy loss weighted sum is used as the final loss for training and the loss formula is as in (11) Loss = α·L kd (q(u, T), q(z, T)) + (1 − α)·L s (y, q(z, 1)) (11) where N is the number of categories of defects; q(u, T) represents the information about the features of the teacher network after the distillation temperature; q(z, T) represents the information about the features of the student network after the distillation temperature; L kd is the scatter loss, an asymmetry measure of the difference between the probability distributions of q(u, T) and q(z, T). This is shown in Equation (12). L s is the cross-entropy loss, which indicates how close the predicted output value is to the true sample label, as shown in Equation (13). In this paper, the distillation temperature T = 7. α is the default value, which in this paper is 0.3 by default.

Experimental Paltform
The experimental platform includes: an Intel Core i7-11700F processor, a Nvidia GeForce RTX3060 12 GB graphics card, 32 GB memory; the software is Windows 10 operating system, python 3.8; and the deep learning framework used is pytorch.

Experimental Data Sets
This paper uses the NEU-CLS dataset [20] of strip surface defects produced and published by Northeastern University for experiments. As shown in Figure 7, the surface defects of the data strip are divided into six categories: Crack (Cr), Inclusion (In), Patch (Pa), Pitted Surface (Ps), Rolled-in Scale (Rs) and Scratch (Sc). where is the number of categories of defects; ( , ) represents the information about the features of the teacher network after the distillation temperature; ( , ) rep resents the information about the features of the student network after the distillation temperature; is the scatter loss, an asymmetry measure of the difference between th probability distributions of ( , ) and ( , ). This is shown in Equation (12).
is th cross-entropy loss, which indicates how close the predicted output value is to the tru sample label, as shown in Equation (13). In this paper, the distillation temperature = 7 α is the default value, which in this paper is 0.3 by default.

Experimental Paltform
The experimental platform includes: an Intel Core i7-11700F processor, a Nvidia Ge Force RTX3060 12 GB graphics card, 32 GB memory; the software is Windows 10 operating system, python 3.8; and the deep learning framework used is pytorch.

Experimental Data Sets
This paper uses the NEU-CLS dataset [20] of strip surface defects produced and pub lished by Northeastern University for experiments. As shown in Figure 7, the surface de fects of the data strip are divided into six categories: Crack (Cr), Inclusion (In), Patch (Pa) Pitted Surface (Ps), Rolled-in Scale (Rs) and Scratch (Sc). Table 2 shows the details of each defective picture. The total 1800 images in the table are divided into training set, valida tion set and test set at the ratio of 8:1:1. The training set has 1440 images, and the validation set and test set have 180 images each.

Experimental Results and Analysis
To analyze and measure the comprehensive performance of the network model in the identification task of strip surface defects, the accuracy, the Matthew's correlation coefficient, FPS, single picture detection time, model parameters and FLOPs were used to evaluate the model.
The accuracy (ACC) rate is the proportion of correctly classified samples to all samples. The higher the accuracy rate, the better the classification effect of the model, and the formula is shown in 14. The Matthews correlation coefficient (MCC) is used to calculate the correlation between the actual classification and the predicted classification, and it is a balanced evaluation index. The value range of MCC is between −1 and 1. When the value of MCC is closer to 1, the result predicted by the classifier is more reliable.
where TP is the number of samples correctly predicted by positive samples, TN is the number of negative samples correctly predicted, and ALL is the number of all samples.

Ablation Experiments
The comprehensive performance of KD-EG-RepVGG was evaluated on the NEU-CLS test set and the results are shown in Table 3. The super parameter setting in the teacher network RepVGG-A0 is also applied in the KD-EG-RepVGG network. The network is trained using the stochastic gradient decent (SGD) optimizer with a momentum coefficient of 0.9 and weight decay of 0.0001. The learning rate is set to 0.1. Batch Size and epochs are kept at 64 for 100, respectively. The comparison revealed that the lightweight model KD-EG-RepVGG after knowledge distillation had an accuracy improvement of greater than two percentage points over the EG-RepVGG model. Furthermore, the accuracy of the lightweight KD-EG-RepVGG network after knowledge distillation was improved by 0.6 percentage points over the teacher network RepVGG-A0. The Matthew's correlation coefficient of KD-EG-RepVGG on the test set is 99.02%, which further proves that the model is very accurate in identifying the surface defects of the strip. Figure 8 shows the validation accuracy and loss curve of the network. From the curve change trend, we can find that the KD-EG-RepVGG network converges faster and the model accuracy is higher. The aim of transferring the knowledge of large models to small networks and improving the accuracy and generalizability of the networks is achieved.

Experimental Results and Analysis
To analyze and measure the comprehensive performance of the network model in the identification task of strip surface defects, the accuracy, the Matthew's correlation coefficient, FPS, single picture detection time, model parameters and FLOPs were used to evaluate the model.
The accuracy (ACC) rate is the proportion of correctly classified samples to all samples. The higher the accuracy rate, the better the classification effect of the model, and the formula is shown in 14. The Matthews correlation coefficient (MCC) is used to calculate the correlation between the actual classification and the predicted classification, and it is a balanced evaluation index. The value range of MCC is between −1 and 1. When the value of MCC is closer to 1, the result predicted by the classifier is more reliable.
where is the number of samples correctly predicted by positive samples, is the number of negative samples correctly predicted, and is the number of all samples.

Ablation Experiments
The comprehensive performance of KD-EG-RepVGG was evaluated on the NEU-CLS test set and the results are shown in Table 3. The super parameter setting in the teacher network RepVGG-A0 is also applied in the KD-EG-RepVGG network. The network is trained using the stochastic gradient decent (SGD) optimizer with a momentum coefficient of 0.9 and weight decay of 0.0001. The learning rate is set to 0.1. Batch Size and epochs are kept at 64 for 100, respectively. The comparison revealed that the lightweight model KD-EG-RepVGG after knowledge distillation had an accuracy improvement of greater than two percentage points over the EG-RepVGG model. Furthermore, the accuracy of the lightweight KD-EG-RepVGG network after knowledge distillation was improved by 0.6 percentage points over the teacher network RepVGG-A0. The Matthew's correlation coefficient of KD-EG-RepVGG on the test set is 99.02%, which further proves that the model is very accurate in identifying the surface defects of the strip. Figure 8 shows the validation accuracy and loss curve of the network. From the curve change trend, we can find that the KD-EG-RepVGG network converges faster and the model accuracy is higher. The aim of transferring the knowledge of large models to small networks and improving the accuracy and generalizability of the networks is achieved.  Furthermore, to analyze more clearly the capabilities of the model, we calculated the confusion matrix of the model on the test set and the results are shown in Figure 9. From the confusion matrix, it can be obtained that the model had a high recognition rate of defects. The recall rate was calculated according to the confusion matrix, and it was found that only the "In" defect was 97.30%, and the other defects were 100%. The precision was calculated, and it was found that only the "Sc" defect is 97.13%, and the other defects were 100%. lectronics 2022, 11, x FOR PEER REVIEW Furthermore, to analyze more clearly the capabilities of the model, we confusion matrix of the model on the test set and the results are shown in F the confusion matrix, it can be obtained that the model had a high recog defects. The recall rate was calculated according to the confusion matrix, an that only the "In" defect was 97.30%, and the other defects were 100%. The calculated, and it was found that only the "Sc" defect is 97.13%, and the othe 100%.

Comparative Experimental Analyses
The KD-EG-RepVGG algorithm was compared with the current m vanced algorithms on the same test set. To demonstrate the validity of th KD-EG-RepVGG is compared with ResNet50, VGG16, ShuffleNetV2 and models in the same software and hardware environment. The accuracy, F ture detection time, calculation amount, parameter amount and other detec of various algorithms are compared and analyzed. The results of the exper orded in Table 4. In comparison, the KD-EG-RepVGG network achieves better classifica than the larger parametric models, VGG16 and ResNet50, outperforming R most three percentage points. Compared with the lightweight networks shu MobileNetV2, the KD-EG-RepVGG network has achieved great advantage

Comparative Experimental Analyses
The KD-EG-RepVGG algorithm was compared with the current mainstream advanced algorithms on the same test set. To demonstrate the validity of the models, the KD-EG-RepVGG is compared with ResNet50, VGG16, ShuffleNetV2 and MobileNetV2 models in the same software and hardware environment. The accuracy, FPS, single picture detection time, calculation amount, parameter amount and other detection indicators of various algorithms are compared and analyzed. The results of the experiment are recorded in Table 4. In comparison, the KD-EG-RepVGG network achieves better classification accuracies than the larger parametric models, VGG16 and ResNet50, outperforming ResNet50 by almost three percentage points. Compared with the lightweight networks shuffleNetV2 and MobileNetV2, the KD-EG-RepVGG network has achieved great advantages in reasoning speed, parameter amount and computation amount. The KD-EG-RepVGG network is more suitable for industrial applications because it achieves an increase in detection efficiency, detection accuracy and detection speed while consuming very little memory and few computing resources.

Model Visualisation
The features of the middle layer of the convolutional neural network model are visualized in order to gain a clearer understanding of the features learned with the convolutional neural network [21]. A random selection of defective images is fed into the KD-EG-RepVGG inference network, which visualizes the convolutional layers in the network. The visualization results are shown in Figure 10. In the KD-EG-RepVGG network, the shallow convolutional network retains the image information relatively intact, with the main detection being contour information. The deeper convolutional layers focus more on the location features of the target and some abstract information. From the visualization results, it can be observed that important regional features in the image are encoded into the network, indicating that the network is effective for feature learning.

Model Visualisation
The features of the middle layer of the convolutional neural network model are visualized in order to gain a clearer understanding of the features learned with the convolutional neural network [21]. A random selection of defective images is fed into the KD-EG-RepVGG inference network, which visualizes the convolutional layers in the network. The visualization results are shown in Figure 10. In the KD-EG-RepVGG network, the shallow convolutional network retains the image information relatively intact, with the main detection being contour information. The deeper convolutional layers focus more on the location features of the target and some abstract information. From the visualization results, it can be observed that important regional features in the image are encoded into the network, indicating that the network is effective for feature learning. The Gradient Weighted Class Activation Mapping algorithm [22] (Grad-CAM) is used to fully demonstrate the ability of the KD-EG-RepVGG network to extract defective features. A heat map was used to show the activated regions in the images, which is more consistent with human vision properties. This is more in line with human visual properties. The final layer of the KD-EG-RepVGG network was chosen for visual representation in this paper. This is because it is a generalized representation of the feature extraction from the previous layer of the network. Images of six types of defects were randomly selected for visualization with darker colors indicating that the network is paying more attention to the point. This is shown in Figure 11.

Cr
In Pa Ps Rs Sc Figure 11. KD-EG-RepVGG network heat map. The Gradient Weighted Class Activation Mapping algorithm [22] (Grad-CAM) is used to fully demonstrate the ability of the KD-EG-RepVGG network to extract defective features. A heat map was used to show the activated regions in the images, which is more consistent with human vision properties. This is more in line with human visual properties. The final layer of the KD-EG-RepVGG network was chosen for visual representation in this paper. This is because it is a generalized representation of the feature extraction from the previous layer of the network. Images of six types of defects were randomly selected for visualization with darker colors indicating that the network is paying more attention to the point. This is shown in Figure 11.

Model Visualisation
The features of the middle layer of the convolutional neural network model are visualized in order to gain a clearer understanding of the features learned with the convolutional neural network [21]. A random selection of defective images is fed into the KD-EG-RepVGG inference network, which visualizes the convolutional layers in the network. The visualization results are shown in Figure 10. In the KD-EG-RepVGG network, the shallow convolutional network retains the image information relatively intact, with the main detection being contour information. The deeper convolutional layers focus more on the location features of the target and some abstract information. From the visualization results, it can be observed that important regional features in the image are encoded into the network, indicating that the network is effective for feature learning. The Gradient Weighted Class Activation Mapping algorithm [22] (Grad-CAM) is used to fully demonstrate the ability of the KD-EG-RepVGG network to extract defective features. A heat map was used to show the activated regions in the images, which is more consistent with human vision properties. This is more in line with human visual properties. The final layer of the KD-EG-RepVGG network was chosen for visual representation in this paper. This is because it is a generalized representation of the feature extraction from the previous layer of the network. Images of six types of defects were randomly selected for visualization with darker colors indicating that the network is paying more attention to the point. This is shown in Figure 11.

Cr
In Pa Ps Rs Sc Figure 11. KD-EG-RepVGG network heat map. Figure 11. KD-EG-RepVGG network heat map.
As can be viewed in Figure 11, the KD-EG-RepVGG's extraction of defect features is focused on salient feature points, and the KD-EG-RepVGG network demonstrates high efficiency by focusing on only one feature point for the same feature in the case of impurities, spot cracks and pockmark defects. The KD-EG-RepVGG also has the ability to recognize features from multiple angles. The features of cracks at different locations and angles can be fully extracted, demonstrating a strong extraction capability.

Conclusions
Aiming at the requirement of strip surface defect detection in actual production, a strip defect recognition method based on a structural re-parameterized KD-EG-RepVGG network is proposed. In RepVGG Block, the ECA network and GELU activation functions are added. Among them, the ECA network improves the accuracy of the KD-EG-RepVGG network while increasing the convergence speed of KD-EG-RepVGG. The GELU activation function avoids neuron necrosis caused by zeroing. Through knowledge distillation technology, the KD-EG-RepVGG model obtains the knowledge of RepVGG-A0, which improves the accuracy and robustness of the model. Through ablation experiments and comparative analysis with other models, it can be seen that the lightweight KD-EG RepVGG network takes up very little memory resources and computing resources without affecting the accuracy, and has a faster detection speed. It is more suitable for deployment and uses in real production.
The future work involves many directions. Firstly, the research in this paper will be used as a basis to study the accurate localization of defects and to analyze the size of defects accurately. Then, the model will be deployed on edge equipment and applied in the production environment within plants.
Author Contributions: Methodology, X.X.; software, X.X.; writing-original draft preparation, X.X.; writing-review and editing, X.S.; All authors have read and agreed to the published version of the manuscript.