CASI-Net: A Novel and Effect Steel Surface Defect Classification Method Based on Coordinate Attention and Self-Interaction Mechanism

The surface defects of a hot-rolled strip will adversely affect the appearance and quality of industrial products. Therefore, the timely identification of hot-rolled strip surface defects is of great significance. In order to improve the efficiency and accuracy of surface defect detection, a lightweight network based on coordinate attention and self-interaction (CASI-Net), which integrates channel domain, spatial information, and a self-interaction module, is proposed to automatically identify six kinds of hot-rolled steel strip surface defects. In this paper, we use coordinate attention to embed location information into channel attention, which enables the CASI-Net to locate the region of defects more accurately, thus contributing to better recognition and classification. In addition, features are converted into aggregation features from the horizontal and vertical direction attention. Furthermore, a self-interaction module is proposed to interactively fuse the extracted feature information to improve the classification accuracy. The experimental results show that CASI-Net can achieve accurate defect classification with reduced parameters and computation.


Introduction
As the most important product in iron and steel enterprises, the steel strip has become an irreplaceable raw material in automobile manufacturing, aerospace, mechanical processing, and other fields [1]. However, in the actual production process of the hot-rolled strip, due to the imperfect manufacturing process, the surface of the strip usually contains different types of defects, such as scratches, surface cracks, and rolling marks [2][3][4]. These defects not only affect the appearance of the product but also reduce the quality of the finished product [1,2]. Traditionally, the classification of steel surface defects is checked manually by experts [2][3][4].
However, the manual detection process is subjective, fatigued, and the work speed is slow, which is not conducive to the completion of real-time detection tasks [4]. Therefore, in order to improve the recognition efficiency and accuracy, it is essential to develop an accurate automatic detection solution. In the past decades, machine vision technology as a safe, non-contact, and automatic solution has been widely used in material surface detection [5]. Machine vision detection is mainly composed of image acquisition and defect detection [6]. With the increasingly complex industrial environment, machine vision detection technology faces many challenges, such as low universality of equipment, high requirements for the light-source environment, and expensive costs of production and maintenance of the machine vision detection device [7]. In this case, machine vision detection technology is often inefficient and makes it difficult to achieve better detection results. In order to overcome the shortcomings of machine vision, researchers considered that deep learning has good performance compared with traditional machine vision. They applied deep learning to defect detection and achieved great improvement.
In recent years, due to the outstanding performance of deep learning in comparison to machine vision, deep learning has developed rapidly in computer vision applications [8][9][10][11][12]. Deep learning can solve the problem whereby different tasks need different image-processing algorithms in traditional machine vision. AlexNet [8] was proposed in 2012 and made a huge impact on the development of deep learning. Compared with traditional machine learning methods, deep learning uses convolution, pooling, and other operations for feature extraction to obtain the abstract feature information of the image. The convolution neural network (CNN) uses convolution operation for feature extraction of input images, which can learn local features and capture different degrees of semantic information so as to effectively learn feature expression from a large number of samples, and the model has a stronger generalization ability. Compared with traditional machine vision methods [7], CNN adopts a pooling layer and sparse connection to reduce model parameters while ensuring the efficiency of computing resources and network performance [13,14]. Deep learning combines the full connection layer to achieve high-precision detection and classification, which promotes the further development of deep learning in the field of image processing. Therefore, in order to achieve better classification accuracy, a deeper learning architecture is needed. However, deeper learning architectures [15,16] contain a large number of parameters and require a large amount of computation load.
In order to overcome the above problems, we propose a lightweight convolutional neural network called CASI-Net, which combines channel attention, location information, and a self-interaction module based on the biological vision to achieve fast and accurate classification of steel surface defects. In the feature extraction stage, inspired by [17], we use 3 × 1 convolution kernels and 1 × 3 convolution kernels to replace 3 × 3 convolution kernels, aiming at reducing network parameters. Then, in order to help CASI-Net more accurately locate and identify the region of interest, a coordinate attention (CA) block [18] is introduced and a self-interaction module based on biological vision is constructed. The self-interaction module can improve the richness of the extracted features. CASI-Net is compared with the typical surface defect identification methods and can use a small number of parameters to achieve more accurate identification results. Overall, the contributions of this paper are summarized as follows: • An end-to-end CASI-Net model is proposed, which combines location information and channel attention to locate defects more accurately. In addition, we construct a self-interaction module based on the biological visual interaction mechanism to learn more detailed feature information. Finally, CASI-Net can use very few parameters to achieve accurate classification.

•
We introduce the CA block to CASI-Net. The CA block can not only capture crosschannel information but also capture location information, which can help CASI-Net to locate and identify targets of interest more accurately.

•
The self-interaction module based on biological mechanisms is constructed to enrich the representation of feature maps, which is helpful for better recognition and classification.

•
To evaluate the performance of the CASI-Net for real industrial data, we use the NEU dataset provided by Northeastern University to validate the performance of CASI-Net. The classification results on NEU will verify the effectiveness of our proposed network.
The remainder of our paper is organized as follows. Section 2 introduces the related work, and two improved techniques are introduced in Section 3. Section 4 provides an evaluation of our method and experimental by comparison with state-of-the-art methods. In Section 5, the conclusion is provided.

Convolutional Neural Networks
CNN was proposed as early as 1989 [19]. Yann et al. proposed the first classical architecture LeNet-5 [20] of CNN in 1998. LeNet-5 contains six hidden layers, mainly by convolution and pooling operations stacking to extract image features, which can achieve good results on the MNIST dataset [20]. Krizhevsky proposed AlexNet in 2012, which has five convolution layers and three fully connected layers, except the pooling layer [8]. After that, SimonYan and Zisserman proposed to use 3 × 3 filters to construct a deeper network, VGG [15], which promoted the development of computer vision tasks in 2015. However, when only using the convolution layer stacking method, when the depth reaches a certain degree, it will not improve the effect but will rather deteriorate the effect. Therefore, He et al. designed the residual learning method and proposed ResNet [16] to solve the degradation problem of the deep network and realized a significant improvement in the deep network. However, the above studies placed too much emphasis on deepening the network depth to improve accuracy, without considering the calculation of the model. In order to achieve fewer parameters and lower degradation of the network performance, Iandola et al. proposed architecture to generate high-precision identification with significantly fewer parameters, called SqueezeNet [21]. Later, researchers proposed other representative lightweight networks such as ShuffleNet [22], MobileNet [23], and MobileNet V2 [24].

Attention Mechanisms
In recent years, the attention mechanism [25] has been widely used in various computer vision tasks, such as image classification [26][27][28][29] and image segmentation [30,31]. One successful example is SE [26], which squeezes each two-dimensional feature map to efficiently construct the interdependence between channels. However, SE [26] only considered channel information, ignoring the importance of location information. However, the spatial information of the object is also important in computer vision [28]. BAM [32] and CBAM [28] tried to use the channel domain and spatial domain for feature extraction, but BAM [32] and CBAM [28] only captured local information and could not obtain long-term dependence [15]. In order to solve the above problems, we introduce the CA block [15] into CASI-Net. In the coordinate attention [18], the channel attention is decomposed into two one-dimension feature coding processes, in which information for different directions is aggregated. Different from SE [26], the CA block [18] can not only capture the correlation dependence of feature maps but also retain accurate location information along the spatial direction.

Biological Visual Interaction Mechanism
The interaction mechanism of biological vision refers to that in visual information processing, where visual information interacts to a certain extent and, finally, completes the storage and recovery of back flow and abdominal flow [33][34][35][36]. In addition, when visual information is transmitted in the dorsal or ventral stream, the self-interaction behavior will be triggered [35]. This form of feature interaction can enrich the information of features and enhance the expression ability of information in the cerebral cortex [37,38]. For deep learning, when the feature map contains less effective information than the original map, the classification accuracy of the deep neural network is not excellent. On the contrary, when the feature map contains more effective information than the original map, the representation of the feature map can be expanded, thereby enhancing the expression ability of the CNN model. Based on the above research, a self-interaction module is constructed, inspired by the biological visual interaction mechanism, which enables CASI-Net to obtain more abundant original image information and enhance the characterization of CASI-Net features, thus further improving the classification accuracy of CASI-Net.

Proposed Method
The proposed CASI-Net architecture consists of a lightweight basic feature extractor (BLFE), a CA block [18], and a self-interaction module. The CASI-Net architecture is shown in Figure 1.
Mathematics 2022, 10, x FOR PEER REVIEW 4 of 14 ability of the CNN model. Based on the above research, a self-interaction module is constructed, inspired by the biological visual interaction mechanism, which enables CASI-Net to obtain more abundant original image information and enhance the characterization of CASI-Net features, thus further improving the classification accuracy of CASI-Net.

Proposed Method
The proposed CASI-Net architecture consists of a lightweight basic feature extractor (BLFE), a CA block [18], and a self-interaction module. The CASI-Net architecture is shown in Figure 1. In CASI-Net, the input image is a W H C   steel surface image with defects, and the output of CASI-Net is defect category confidence. Here, W , H , and C denote the width, height, and channel numbers of the input image, respectively. In the basic lightweight feature extractor, the output dimension of block i is i 3) denote the width, height, and channel of the output of feature maps in block i , respectively. In order to ensure CASI-Net focuses on the defect area, we introduce the CA block [18] into our constructed network to obtain refined feature maps through the attention of d W and d H directions. Then we construct a self-interaction module based on the interaction mechanism of biological vision to enrich the feature maps information and enhance the characterization of CASI-Net. Finally, CASI-Net connects to the Multilayer Perceptron (MLP) to obtain the category of an input defect image.

Basic Lightweight Feature Extractor
BLFE consists of three depth-wise separable convolution modules shown in Figure  1. Each module consists of four convolution layers, four ReLu layers, four batch normalization (BN) layers, and a Max pool layer, which is shown in Figure 2. A convolution layer is the basis of the image feature extraction process, while the core is the convolution operation. The convolution layer at the lowest level extracts low-level features such as edges In CASI-Net, the input image is a W × H × C steel surface image with defects, and the output of CASI-Net is defect category confidence. Here, W, H, and C denote the width, height, and channel numbers of the input image, respectively. In the basic lightweight feature extractor, the output dimension of block i is W i × H i × C i . Here, W i , H i , and C i (i = 1, 2, 3) denote the width, height, and channel of the output of feature maps in block i, respectively. In order to ensure CASI-Net focuses on the defect area, we introduce the CA block [18] into our constructed network to obtain refined feature maps through the attention of W d and H d directions. Then we construct a self-interaction module based on the interaction mechanism of biological vision to enrich the feature maps information and enhance the characterization of CASI-Net. Finally, CASI-Net connects to the Multilayer Perceptron (MLP) to obtain the category of an input defect image.

Basic Lightweight Feature Extractor
BLFE consists of three depth-wise separable convolution modules shown in Figure 1. Each module consists of four convolution layers, four ReLu layers, four batch normalization (BN) layers, and a Max pool layer, which is shown in Figure 2. A convolution layer is the basis of the image feature extraction process, while the core is the convolution operation. The convolution layer at the lowest level extracts low-level features such as edges and lines, and the higher convolution layer extracts the more complex features such as object color and contour. The BN [39] layer can speed up the training process and greatly solve the problem of gradient disappearance and improve the performance of the CNN [40]. Max pooling is used to reduce the dimension of features, compress the number of data and parameters, and effectively reduce the overfitting phenomenon [40]. A ReLU [41] as a nonlinear activation layer can aptly solve the overfitting problem. and lines, and the higher convolution layer extracts the more complex features such as object color and contour. The BN [39] layer can speed up the training process and greatly solve the problem of gradient disappearance and improve the performance of the CNN [40]. Max pooling is used to reduce the dimension of features, compress the number of data and parameters, and effectively reduce the overfitting phenomenon [40]. A ReLU [41] as a nonlinear activation layer can aptly solve the overfitting problem. In block i , the traditional 3 3  deep convolution is decomposed into 1 3  convolution kernel Conv2 and 3 1  convolution kernel Conv3, and finally, the feature map X is obtained by 1 2  convolution kernel Conv4.

Coordinate Attention
In order to focus on the defect areas and suppress the unimportant areas to achieve more accurate identification, CASI-Net combines the channel attention mechanism and the location information to obtain more accurate defect areas. Attention modules such as SE [26] and CBAM [28] can improve network performance in image classification. Traditional attention modules such as SE [26] only considered the channel information of the image and ignored the spatial information. In addition, SE [26] lost too much primitive information via global pooling. To solve these problems, we integrate the CA block [18] into CASI-Net to improve the accuracy of classification. In the CA block, feature tensors  In block i, the traditional 3 × 3 deep convolution is decomposed into 1 × 3 convolution kernel Conv2 and 3 × 1 convolution kernel Conv3, and finally, the feature map X is obtained by 1 × 2 convolution kernel Conv4.

Coordinate Attention
In order to focus on the defect areas and suppress the unimportant areas to achieve more accurate identification, CASI-Net combines the channel attention mechanism and the location information to obtain more accurate defect areas. Attention modules such as SE [26] and CBAM [28] can improve network performance in image classification. Traditional attention modules such as SE [26] only considered the channel information of the image and ignored the spatial information. In addition, SE [26] lost too much primitive information via global pooling. To solve these problems, we integrate the CA block [18] into CASI-Net to improve the accuracy of classification. In the CA block, feature tensors X = [x 1 , x 2 . . . , x n ] are obtained after BLFE as the input. Finally, CA outputs the re-weighted tensor Y = [y 1 , y 2 . . . , y n ] [18]. The architecture of the CA block is shown in Figure 3, where 'W d Avg Pool' and 'H d Avg Pool' refer to 1D W d Avg pooling and 1D 'H d Avg pooling', respectively [18].  For the input feature tensor X = [x 1 , x 2 . . . , x n ], one-dimensional pooling operations in the first step generate feature descriptors in W d and H d directions in the CA block. Specifically, CA uses two different pooling kernels to encode the channel along the W d direction and the H d direction. The two different pooling kernels size are (H 3 , 1) and (1, W 3 ), respectively [18]. Then the output of channel c, c ∈ {1, 2, . . . , n} at height h, h ∈ {1, 2, . . . , H} is expressed as follows x c is the c-channel feature map of the feature tensor X. The output of channel c, c ∈ {1, 2, . . . , n} with width w, w ∈ {1, 2, . . . , W} is expressed as follows x c is the c-channel feature map of the feature tensor X. z h c (h) and z w c (w) combine two different locations' information including the W d direction and H d direction, which allow CASI-Net to capture long-range dependencies along one spatial direction and preserve precise positional information along the other spatial direction, which helps CASI-Net more accurately locate the region of interest [18].
Next, z h and z w are cascaded by the convolution transform and nonlinear activation and obtain the feature maps f [15]. The expression of f is as follows where f is the feature maps containing W d and H d directions, δ is the ReLU function, and F 1 is the 1 × 1 convolution operation. Next, f is decomposed into two feature tensors f h and f w by the spatial dimension, and then the feature maps are convoluted by two 1 × 1 convolution layers to form the attention weights g h and g w in W d and H d directions [18], which are described as follows where F h and F w are 1 × 1 convolution operations, and σ is the sigmoid function. Finally, the attention weights in W d and H d directions are weighted with the input of CA, and the final output is y c (i, j) ∈ Re-weight Features Y as follows where g c h (i) and g c w (j) are the attention weights of the c-channel of X in W d and H d directions, respectively.

Self-Interaction Based on Biological Vision
In Section 3.2, through the CA block, CASI-Net obtains an enhancement feature map Y. Then we input the feature maps Y into the self-interaction module to enrich the effective information of feature maps. Inspired by the biological visual interaction mechanism, we design a novel feature augment extraction structure named self-interaction (SI). This interactive mechanism can enrich visual information and extract more discriminative feature information in deep learning, which can improve the results of defect classification. The specific structure of SI is shown in Figure 4.
In Section 3.2, through the CA block, CASI-Net obtains an enhancement feature map Y . Then we input the feature maps Y into the self-interaction module to enrich the effective information of feature maps. Inspired by the biological visual interaction mechanism, we design a novel feature augment extraction structure named self-interaction (SI). This interactive mechanism can enrich visual information and extract more discriminative feature information in deep learning, which can improve the results of defect classification. The specific structure of SI is shown in Figure 4. In SI, the output Y of CA is used as the input. After transposing the feature maps Y and obtaining T Y , the new feature map Z is obtained through interactive operation in the SI module constructed by us. The process of SI is described as follows In SI, the output Y of CA is used as the input. After transposing the feature maps Y and obtaining Y T , the new feature map Z is obtained through interactive operation in the SI module constructed by us. The process of SI is described as follows where M represents the Hadamard product of Y and Y T . Z is the final feature maps obtained after the interaction. y T c and y c represent the c-channel feature map of the refined feature tensors Y T and Y, respectively. (i, j) represent the coordinates of pixels of the feature map. The richness of the deep network in feature information processing is extended by the SI module. Interactive feature maps Z pay more attention to identifying regions and can obtain more detailed feature information in the original feature map, and Z is used for the final classification.

Dataset
The dataset used in our work is NEU-CLS, which contains six types of surface defects of a hot-rolled steel strip, which are Crazing (Cr), Inclusion (In), Patches (Pa), Pitted Surface (PS), Rolled-in Scale (RS), and Scratches (Sc) [1]. Each type of sample has 300 grayscale images of which the size is 200 × 200. NEU-CLS has 1800 images. In our experiment, we resize the input images to 300 × 300 × 3 (width, height, channel). Figure 5 shows the samples of six types of typical surface defects images of steel strips.
Each type gives four sample images, and it can be clearly observed that there are great differences in the appearance of the same type of defects. In short, the challenges of the NEU-CLS dataset are the inter-class similarity, intra-class difference, and complex background interference [4].

Enhanced Dataset
The steel defect dataset is inevitably subjected to non-uniform illumination, noise, and motion blur in the process of industrial acquisition, which poses a certain challenge to defect recognition. In order to evaluate the robustness of CASI-Net, we adapt the enhanced dataset, which includes severe non-uniform illumination, camera noise, and motion blur [2]. The 2 and 5 represent length of camera motion. The enhanced dataset is shown in Figure 6.

Dataset
The dataset used in our work is NEU-CLS, which contains six types of surface defects of a hot-rolled steel strip, which are Crazing (Cr), Inclusion (In), Patches (Pa), Pitted Surface (PS), Rolled-in Scale (RS), and Scratches (Sc) [1]. Each type of sample has 300 grayscale images of which the size is 200 200  . NEU-CLS has 1800 images. In our experiment, we resize the input images to 300 300 3   (width, height, channel). Figure 5 shows the samples of six types of typical surface defects images of steel strips. Each type gives four sample images, and it can be clearly observed that there are great differences in the appearance of the same type of defects. In short, the challenges of the NEU-CLS dataset are the inter-class similarity, intra-class difference, and complex background interference [4].

Enhanced Dataset
The steel defect dataset is inevitably subjected to non-uniform illumination, noise, and motion blur in the process of industrial acquisition, which poses a certain challenge to defect recognition. In order to evaluate the robustness of CASI-Net, we adapt the enhanced dataset, which includes severe non-uniform illumination, camera noise, and motion blur [2]. The 2 and 5 represent length of camera motion. The enhanced dataset is shown in Figure 6.

Implementation Details
All experiments are performed by Pytorch. We use 70% of the images as the training dataset and 30% of the images as the test dataset. Training is performed on GTX 1060 GPU, and we use SGD with a weight decay of 0.001, momentum of 0.9, and batch size of 16. In order to verify CASI-Net, we conduct experiments in the public surface defect database NEU released by Northeastern University of China [1].
Specifically, the input image size W H C   is 300 300 3   (width, height, and channel

Implementation Details
All experiments are performed by Pytorch. We use 70% of the images as the training dataset and 30% of the images as the test dataset. Training is performed on GTX 1060 GPU, and we use SGD with a weight decay of 0.001, momentum of 0.9, and batch size of 16. In order to verify CASI-Net, we conduct experiments in the public surface defect database NEU released by Northeastern University of China [1].

Performance Analysis
In this section, we establish ablation experiments to evaluate the effectiveness of CASI-Net. The comparison results are shown in Table 1. Firstly, we use the baseline to classify the surface defects of NEU, and the accuracy reached 94.79%. Then, we add the self-interaction module based on the biological visual interaction mechanism to the baseline. The baseline combined with the self-interaction module reaches 95.22% on NEU-CLS. Next, we add the CA block to the baseline without the self-interaction module. The recognition accuracy rate of the baseline after adding the CA block reaches 95.47% on the NEU steel surface defect dataset. Finally, we add the self-interaction module constructed by the biological visual interaction mechanism to the baseline, where the performance of CASI-Net in NEU-CLS reaches 95.83%. After adding the CA module to the baseline, we visualize the sample data of NEU steel surface defects in Figure 7.

Comparison with State-of-the-Art Methods
In addition, in order to verify the performance of CASI-Net, we compare the classification accuracy of various advanced steel surface defect classification models. The experimental results in Table 2 show that our proposed CASI-Net can achieve a higher classification accuracy of steel surface defects with fewer parameters. In the NEU public dataset, we evaluate and verify ResNet [16], MobileNet [23], EffNet [17], and CASI-Net. The experimental results show that compared with ResNet with 25.56 M parameters reaching From Figure 7, we know that CASI-Net can concentrate more on the location of the defect and suppress the non-defect part.

Comparison with State-of-the-Art Methods
In addition, in order to verify the performance of CASI-Net, we compare the classification accuracy of various advanced steel surface defect classification models. The experimental results in Table 2 show that our proposed CASI-Net can achieve a higher classification accuracy of steel surface defects with fewer parameters. In the NEU public dataset, we evaluate and verify ResNet [16], MobileNet [23], EffNet [17], and CASI-Net. The experimental results show that compared with ResNet with 25.56 M parameters reaching 95.09%, CASI-Net achieves 95.83 % accuracy with much fewer parameters. In addition, compared with MobileNet [23] and EffNet [17], CASI-Net can achieve a higher classification accuracy with little overhead increase. Compared with the most advanced steel surface defect classification, CASI-Net can classify steel surface defects more accurately.

Discussion
In this study, we demonstrated that compared with the traditional machine vision, the steel defect classification method based on deep learning can achieve higher classification accuracy. In this paper, we use the coordinated attention mechanism and the self-interaction module based on the biological vision to construct a lightweight convolutional neural network. By introducing the CA block, our network can concentrate more on defect areas. By constructing the SI module based on biological vision, the representation of the feature map is improved, so as to increase the recognition accuracy. In addition, compared with the depth network, our model can achieve a classification accuracy equivalent to that when the amount of parameters is reduced. In addition, we also discussed the impact of different dataset partitions on our construction method. We use 8:2 data division for the training network training and testing. The results show that CASI-Net can finally achieve 98.19% accuracy. We plotted the experimental results. The results and AUROC show that CASI-Net can accurately identify surface defects in Figure 8. Collectively, our data demonstrate that the recognition accuracy of CASI-Net verifies the applicability of our model in the task of surface defect recognition of a hot-rolled strip. However, there are some problems we have not taken into account. For example, for some defects in the dataset, there is a high degree of "inter class similarity and intra class diversity". For convolutional neural networks, it is difficult to distinguish them accurately. Therefore, in the next step, we will consider introducing fine-grained classification methods, such as bilinear pooling to improve the feature map of the extracted image or constructing high-order statistical features to model the channel to improve the feature map of the extracted image and capture the representative defect-recognition area, so as to improve the classification accuracy.
taset, there is a high degree of "inter class similarity and intra class diversity". For convolutional neural networks, it is difficult to distinguish them accurately. Therefore, in the next step, we will consider introducing fine-grained classification methods, such as bilinear pooling to improve the feature map of the extracted image or constructing high-order statistical features to model the channel to improve the feature map of the extracted image and capture the representative defect-recognition area, so as to improve the classification accuracy.

Conclusions and Future Work
This paper presents a light and effective classification network for steel surface defects called CASI-Net which adopts a new convolution block, which greatly reduces the computational burden and achieves high recognition accuracy. The proposed backbone network can achieve accurate identification results of steel surface defects. We incorporate the attention mechanism and the self-interaction mechanism based on biological vision into CASI-Net to improve the defect recognition accuracy. Our experiments show that CASI-Net can achieve better performance than other models with fewer parameters. In

Conclusions and Future Work
This paper presents a light and effective classification network for steel surface defects called CASI-Net which adopts a new convolution block, which greatly reduces the computational burden and achieves high recognition accuracy. The proposed backbone network can achieve accurate identification results of steel surface defects. We incorporate the attention mechanism and the self-interaction mechanism based on biological vision into CASI-Net to improve the defect recognition accuracy. Our experiments show that CASI-Net can achieve better performance than other models with fewer parameters. In Section 3, we considered using two different technologies to improve the defect recognition accuracy of the CASI-Net, including the CA block and a self-interaction module. In the CA block [18], the location information of feature maps is embedded into the channel attention and decomposed into two 1D feature encoding processes. Then the two 1D features are coded to form a pair of direction-aware and position-sensitive feature maps, which can be complementarily applied to the input feature maps to enhance the representation of the region of interest. Through the CA block, CASI-Net can capture correlation dependencies along the horizontal direction and retain accurate location information along the vertical direction. Inspired by the biological visual interaction mechanism, the self-interaction module is constructed. Through the self-interaction operation, the feature map contains more effective information from the original image, and the representation ability of features in the CNN model is further enhanced to improve the accuracy of defect classification. Overall, the recognition accuracy of CASI-Net is more than 95%, which verifies the applicability of our model in the task of surface defect recognition of a hot-rolled strip. In the future, our next work is to further verify the generalization performance of the model, and utilize optimization algorithms and adaptation equipment, so as to develop a complete steel surface defect diagnosis framework. Based on the needs of iron and steel enterprises, we aim to expand more actual functions, such as online help. In addition, the system can also provide users with more dynamic and beautiful interfaces.