High-Frequency Workpiece Image Recognition Model Integrating Multi-Level Network Structure

High-frequency workpieces have the characteristics of complex intra-class textures and small differences between classes, leading to the problem of low recognition rates when existing models are applied to the recognition of high-frequency workpiece images. We propose in this paper a novel high-frequency workpiece image recognition model that uses EfficientNet-B1 as the basic network and integrates multi-level network structures, designated as ML-EfficientNet-B1. Specifically, a lightweight mixed attention module is first introduced to extract global workpiece image features with strong illumination robustness, and the global recognition results are obtained through the backbone network. Then, the weakly supervised area detection module is used to locate the locally important areas of the workpiece and is introduced into the branch network to obtain local recognition results. Finally, the global and local recognition results are combined in the branch fusion module to achieve the final recognition of high-frequency workpiece images. Experimental results show that compared with various image recognition models, the proposed ML-EfficientNet-B1 model has stronger adaptability to illumination changes, significantly improves the performance of high-frequency workpiece recognition, and the recognition accuracy reaches 98.3%.


Introduction
With the advancement of science and technology, the global manufacturing industry is developing rapidly towards high quality, high efficiency, and high intelligence [1][2][3].High-frequency workpieces are one of the most important components in aerospace equipment.The quality, timeliness, and intelligence of their processing are all important factors affecting the development of the aerospace industry.During the processing of the workpiece, the workpiece is bound to the pallet with an RFID tag, and the processing content to be completed in the process is automatically loaded through scanning by the RFID tag reading and writing device.However, during the heat treatment stage of the workpiece, the separation of the workpiece from the pallet causes the failure of the original radio frequency chip of the workpiece, which affects the automation and intelligence of subsequent processing.Therefore, image recognition technology is used to identify the workpiece image after heat treatment and the workpiece image before heat treatment and then associate the processing content corresponding to the image number again and attach a new RFID tag to the high-frequency workpiece after heat treatment.The new RFID tag is the same as the tag before heat treatment.Radiofrequency chips with the same content are used to complete the intelligent process of subsequent finishing and other processing procedures.In order to improve the intelligence level of high-frequency workpiece processing, image recognition technology can be introduced into the processing process.The image recognition results are associated with the drawing number of the currently processed high-frequency workpiece, and the next step of processing can be automatically loaded through the manufacturing execution system (MES).At present, image recognition of high-frequency workpieces faces the following challenges: (1) The same types of workpieces have complex internal textures.
(2) Different types of workpieces have small differences in characteristics.(3) The quality of the workpiece image is greatly affected by changes in acquisition posture and lighting.
Image recognition technology has been widely used in different fields, and many researchers have proposed a variety of recognition algorithms [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19].In [13], the author built a five-layer memristor-based CNN to perform MNIST10 image recognition and achieved a high accuracy of more than 96%.In [15], the author proposed a fastener classification model, which can divide fasteners into four types, including normal, partially worn, missing, and covered.In the industrial field, researchers have proposed some effective algorithms for image recognition of mechanical parts.It can be divided into traditional algorithms based on mathematical models and deep learning algorithms based on convolutional neural networks.In the first category, Xu et al. proposed a template matching algorithm RTMM using ring feature templates to solve the problem of low accuracy of traditional matching methods caused by the presence of different sides of parts in the image [20].Yin et al. proposed a fast positioning and recognition algorithm based on equal-area ring segmentation to solve the problem of complex textures and high similarity after precision machining of high-frequency components, which makes it difficult to distinguish [21].Wang et al. proposed a method of part sorting based on fast template matching, which accelerates the process of part recognition by improving the template matching method.Furthermore, the efficiency of the part sorting system based on machine vision is improved [22].In the second category, Yang et al. proposed a joint loss-supervised deep learning recognition algorithm.This algorithm first builds an image feature vector encoding model based on a convolutional neural network, uses angle margin loss to replace SoftMax loss to reduce the distance between features within the workpiece class, and then introduces isolation loss to increase the distance between features of heterogeneous workpieces [23].Zhang et al., proposed a multi-branch feature fusion convolutional neural network (MFF-CNN) to solve the difficult problems of multi-surface distribution of features and light sensitivity when automatically classifying automobile engine main bearing cover parts [24].In addition, researchers have proposed improved Inception V3 [25] and Xception [26] for the identification of threaded connector parts.
The above algorithm can overcome the impact of illumination changes on recognition results to a certain extent; however, its research objective is relatively simple and cannot be effectively applied to the recognition of high-frequency workpieces that are complex within classes, have small gaps between classes, and have changeable postures.In order to effectively solve the problem of low recognition rate of high-frequency workpieces with the characteristics of complex intra-class and small gaps between classes under complex illumination changes, we propose in this paper a novel high-frequency workpiece image recognition model that uses EfficientNet-B1 [27] as the basic network and integrates multilevel network structures.The main contributions in this paper are threefold.Experimental results on a high-frequency workpiece dataset made in the laboratory show that compared with various image recognition algorithms, the proposed algorithm has stronger adaptability to illumination changes and significantly improves the accuracy of high-frequency workpiece recognition.The remainder of this paper is organized as follows: Section 2 outlines our method.Section 3 presents the experimental result of a high-frequency workpiece dataset made in the laboratory.Section 4 concludes this paper.

Overall Framework
The framework of the high-frequency workpiece recognition model proposed in this paper that integrates global attention and local attention is shown in Figure 1.It mainly consists of three modules: lightweight mixed attention module (LMAM), weakly supervised region detection module, and branch fusion module.First, the whole workpiece image I 1 is input into the lightweight mixed attention module (LMAM) to obtain enhanced features, and then, the multi-layer feature map M g and recognition results P 1 are generated through the lightweight network EfficientNet-B1.Second, in the weakly supervised area detection module, the local workpiece image I 2 is intercepted based on the whole workpiece image I 1 and multi-layer feature map M g , and it is imported into the LMAM and EfficientNet-B1 of another branch to obtain the recognition result P 2 .Finally, the branch fusion module is used to fuse the results of the two branches to obtain the final recognition result P.
Sensors 2024, 24, x FOR PEER REVIEW 3 of 13 has stronger adaptability to illumination changes and significantly improves the accuracy of high-frequency workpiece recognition.The remainder of this paper is organized as follows: Section 2 outlines our method.Section 3 presents the experimental result of a high-frequency workpiece dataset made in the laboratory.Section 4 concludes this paper.

Overall Framework
The framework of the high-frequency workpiece recognition model proposed in this paper that integrates global attention and local attention is shown in Figure 1.It mainly consists of three modules: lightweight mixed attention module (LMAM), weakly supervised region detection module, and branch fusion module.First, the whole workpiece image I1 is input into the lightweight mixed attention module (LMAM) to obtain enhanced features, and then, the multi-layer feature map Mg and recognition results P1 are generated through the lightweight network EfficientNet-B1.Second, in the weakly supervised area detection module, the local workpiece image I2 is intercepted based on the whole workpiece image I1 and multi-layer feature map Mg, and it is imported into the LMAM and EfficientNet-B1 of another branch to obtain the recognition result P2.Finally, the branch fusion module is used to fuse the results of the two branches to obtain the final recognition result P.

The Lightweight Mixed Attention Module (LMAM)
In actual industrial production environments, the surface of a high-frequency workpiece is easily affected by factors such as uneven illumination and a large change in illumination, resulting in spots, shadows, insufficient light, and other phenomena in the collected workpiece images.If low-quality workpiece images are directly input into the network to extract features, effective image features cannot be obtained, and it is difficult to accurately identify the categories of high-frequency workpieces.In order to overcome the impact of the above interference information on workpiece recognition, this paper uses the LMAM module to enhance the features of the workpiece image, as shown in Figure 1.
Inspired by the convolutional block attention module (CBAM) [28] and efficient channel attention (ECA) [29], this paper proposes the LMAM shown in Figure 2 for preliminary screening.It uses the lightweight channel attention module (LCAM) and the lightweight

The Lightweight Mixed Attention Module (LMAM)
In actual industrial production environments, the surface of a high-frequency workpiece is easily affected by factors such as uneven illumination and a large change in illumination, resulting in spots, shadows, insufficient light, and other phenomena in the collected workpiece images.If low-quality workpiece images are directly input into the network to extract features, effective image features cannot be obtained, and it is difficult to accurately identify the categories of high-frequency workpieces.In order to overcome the impact of the above interference information on workpiece recognition, this paper uses the LMAM module to enhance the features of the workpiece image, as shown in Figure 1.
Inspired by the convolutional block attention module (CBAM) [28] and efficient channel attention (ECA) [29], this paper proposes the LMAM shown in Figure 2 for preliminary screening.It uses the lightweight channel attention module (LCAM) and the lightweight spatial attention module (LSAM) to replace the channel attention module (CAM) and spatial attention module (SAM) in CBAM, respectively.

The Lightweight Channel Attention Module (LCAM)
Since the maximum pooling in CAM loses too much information, the LCAM in this article uses global standard deviation pooling to replace the global maximum pooling.
Then, one-dimensional convolution allows the network to focus on the learning of effective channels with less calculation, and its structure is shown in Figure 3.

The Lightweight Channel Attention Module (LCAM)
Since the maximum pooling in CAM loses too much information, the LCAM in this article uses global standard deviation pooling to replace the global maximum pooling.Then, one-dimensional convolution allows the network to focus on the learning of effective channels with less calculation, and its structure is shown in Figure 3. , the channel information description maps are obtained using the following two formulas, respectively.
( ) , where h and w are the coordinates in the height and width directions, respectively.H and W are the height and width of the image, respectively.z m and z sd accumulate global information in different ways and then perform onedimensional convolution on them, respectively.Note that z m can effectively extract salient information, and z sd can extract differential information.After a combination of convolution and activation, the channel attention weight is obtained.The convolution kernel size k can be adaptively calculated by the formula in the literature [18].spatial attention module (LSAM) to replace the channel attention module (CAM) and spatial attention module (SAM) in CBAM, respectively.Since the maximum pooling in CAM loses too much information, the LCAM in this article uses global standard deviation pooling to replace the global maximum pooling.Then, one-dimensional convolution allows the network to focus on the learning of effective channels with less calculation, and its structure is shown in Figure 3.  where h and w are the coordinates in the height and width directions, respectively.H and W are the height and width of the image, respectively.z m and z sd accumulate global information in different ways and then perform onedimensional convolution on them, respectively.Note that z m can effectively extract salient information, and z sd can extract differential information.After a combination of convolution and activation, the channel attention weight is obtained.The convolution kernel size k can be adaptively calculated by the formula in the literature [18].Using global average pooling and global standard deviation pooling on the input feature map X 1 ∈ R C×H×W , the channel information description maps are obtained using the following two formulas, respectively.
where h and w are the coordinates in the height and width directions, respectively.H and W are the height and width of the image, respectively.z m and z sd accumulate global information in different ways and then perform onedimensional convolution on them, respectively.Note that z m can effectively extract salient information, and z sd can extract differential information.After a combination of convolution and activation, the channel attention weight is obtained.
where F k m (•) and F k sd (•) are one-dimensional convolution, respectively.k is the convolution kernel size.σ(•) is sigmoid function.
The convolution kernel size k can be adaptively calculated by the formula in the literature [18].
where C is the number of channels, γ and b are set to 2 and 1, respectively.odd represents the nearest odd number.

The Lightweight Spatial Attention Module (LSAM)
The LSAM proposed in this paper is shown in Figure 4.In view of the small target difference between classes in the high-frequency workpiece dataset, a convolution kernel of size 3 × 3 is used to replace the convolution kernel of size 7 × 7 in SAM.At the same time, in order to obtain multi-scale context information, a dilated convolution with a kernel size of 3 × 3 is added in parallel.Since dilated convolution will produce a grid effect, different receptive fields are summed and combined.There are only 18 parameters in the entire convolution process, which greatly reduces the number of parameters compared with the 49 parameters of SAM.
represents the nearest odd number.

The Lightweight Spatial Attention Module (LSAM)
The LSAM proposed in this paper is shown in Figure 4.In view of the small target difference between classes in the high-frequency workpiece dataset, a convolution kernel of size 3 × 3 is used to replace the convolution kernel of size 7 × 7 in SAM.At the same time, in order to obtain multi-scale context information, a dilated convolution with a kernel size of 3 × 3 is added in parallel.Since dilated convolution will produce a grid effect, different receptive fields are summed and combined.There are only 18 parameters in the entire convolution process, which greatly reduces the number of parameters compared with the 49 parameters of SAM.Perform maximum pooling and average pooling on each pixel of the input feature by channel, then perform channel concatenation on the two obtained feature maps.Next, perform convolution with a kernel size of 3 × 3 and a dilated convolution with a kernel size of 3 × 3, where the dilation rate is set to 1 and 2, respectively.In order to ensure that the size of the feature map remains unchanged, the padding parameters need to be set, and, finally, the spatial attention weight is obtained through activation and addition operations.

The Weakly Supervised Region Detection Module
High-frequency workpieces have the characteristics of many types and small differences between classes, and the small differences between workpieces often appear in specific local areas.Based on these characteristics, this paper proposes to use a weakly supervised region detection module, which includes two mechanisms, boundary search and cropping, to locate areas with significant differences in the workpiece image, as shown in Figure 1.
The global multi-layer feature map g M of the input image is generated through the backbone network EfficientNet-B1, and the energy map   Perform maximum pooling and average pooling on each pixel of the input feature map X 2 ∈ R C×H×W by channel, then perform channel concatenation on the two obtained feature maps.Next, perform convolution with a kernel size of 3 × 3 and a dilated convolution with a kernel size of 3 × 3, where the dilation rate is set to 1 and 2, respectively.In order to ensure that the size of the feature map remains unchanged, the padding parameters need to be set, and, finally, the spatial attention weight is obtained through activation and addition operations.
where F 3×3 C (•) and F 3×3 D (•) denote the convolution and the dilated convolution is a kernel size of 3 × 3, respectively.P max (•) and P avg (•) denote the maximum pooling and average pooling, respectively.

The Weakly Supervised Region Detection Module
High-frequency workpieces have the characteristics of many types and small differences between classes, and the small differences between workpieces often appear in specific local areas.Based on these characteristics, this paper proposes to use a weakly supervised region detection module, which includes two mechanisms, boundary search and cropping, to locate areas with significant differences in the workpiece image, as shown in Figure 1.
The global multi-layer feature map M g of the input image is generated through the backbone network EfficientNet-B1, and the energy map M E is obtained by superimposing the feature maps of all channels.In order to eliminate the interference of negative elements, all elements of M E are normalized to [0, 1] to obtain the scaled energy map.
where max(M E ) and min(M E ) denote the maximum value and the minimum value in M E (i), respectively.In order to further improve the positioning accuracy, bilinear interpolation is used to upsample ME to the size of 25 × 25.
Then, M E is aggregated into two one-dimensional structured energy vectors.
Sensors 2024, 24, 1982 where V w and V h denote one-dimensional structured energy vectors along the width and height directions of space, respectively.Taking V w as an example, extract the energy of different elements.
where E[0 : W] denotes the energy sum of all elements in the width vector, and E[x 1 : denotes the energy of the area with width from x 1 to x 2 .Define the key area in the global image as occupying the smallest area and meeting the following condition: where γ denotes the preset threshold.
The width boundary coordinate [x 1 : x 2 ] and the height boundary coordinate [y 1 : y 2 ] of the area can be found automatically by using the boundary search mechanism.Then, a cropping mechanism is used to intercept effective workpiece information and valuable background information from the original image according to the entire boundary coordinate [x 1 : x 2 , y 1 : y 2 ], and the local workpiece image I 2 is finally obtained.

The Branch Fusion Module
In order to simultaneously utilize the global information and local information of the image and weigh the role of the two types of information in workpiece image recognition, a branch fusion module is proposed to take into account the recognition results of the two branches to further improve the accuracy of the recognition, as shown in Figure 1.It should be noted that in the model proposed in this paper, the two channel attention modules do not share parameters to extract workpiece features at different scales.The calculation formula of the fused recognition score is as follows: where P 1 and P 2 denote the recognition result of the global image and the recognition result of the local image, respectively.And µ denotes the balancing factor that weighs the recognition results of different branches.

Dataset
The high-frequency workpiece image dataset used in the experiment comes from a research institute in China.The dataset contains 20 different categories of workpieces.Each category of workpieces has 1000 images, totaling 20,000 images.The size of the image is 3822 × 2702.The dataset is randomly divided into a training set and a validation set in a ratio of 7:3. Figure 5 shows examples of six different categories of high-frequency workpieces, where the red circle indicates the small differences between the six categories.It can be seen from Figure 5 that for each type of high-frequency workpiece image, its internal texture is characterized by complex texture; for different types of high-frequency workpiece images, the differences between the classes are small.In addition, there are also local bright spots caused by uneven lighting in the image.Therefore, classifying multi-category high-frequency workpiece images is an equally challenging task.
research institute in China.The dataset contains 20 different categories of workpieces.Each category of workpieces has 1000 images, totaling 20,000 images.The size of the image is 3822 × 2702.The dataset is randomly divided into a training set and a validation set in a ratio of 7:3. Figure 5 shows examples of six different categories of high-frequency workpieces, where the red circle indicates the small differences between the six categories.It can be seen from Figure 5 that for each type of high-frequency workpiece image, its internal texture is characterized by complex texture; for different types of high-frequency workpiece images, the differences between the classes are small.In addition, there are also local bright spots caused by uneven lighting in the image.Therefore, classifying multicategory high-frequency workpiece images is an equally challenging task.

Experimental Settings
The training process of ML-Efficient-B1 is carried out by the Adam optimizer [30], where the initial learning rate is set to The training process is performed until 50 epochs using a machine with NVIDIA GeForce GTX 1660 SUPER GPU (NVIDIA is located in Santa Clara, California, USA).The implementation is carried out using the PyTorch1.8package [31].Note that the weight parameter initialization of the network is obtained by using pretraining on ImageNet.

Model Parameter Selection
The cropping range threshold  determines the size of the effective area extracted from the global image, further affecting the accuracy of workpiece recognition.If  is too small, excessive loss of workpiece features will result.On the contrary, if  is too large, the network will not be able to focus on important local features.Therefore, the intercepted area should be limited to a reasonable range.This article limits it to   0.60, 0.80 .In addition, the parameter  in Equation ( 11) also greatly affects the network's emphasis on different branches.

Experimental Settings
The training process of ML-Efficient-B1 is carried out by the Adam optimizer [30], where the initial learning rate is set to 1 × 10 −4 .The batch size and image size are set to 8 and 224 × 224, respectively.The training process is performed until 50 epochs using a machine with NVIDIA GeForce GTX 1660 SUPER GPU (NVIDIA is located in Santa Clara, California, USA).The implementation is carried out using the PyTorch1.8package [31].Note that the weight parameter initialization of the network is obtained by using pre-training on ImageNet.

Model Parameter Selection
The cropping range threshold γ determines the size of the effective area extracted from the global image, further affecting the accuracy of workpiece recognition.If γ is too small, excessive loss of workpiece features will result.On the contrary, if γ is too large, the network will not be able to focus on important local features.Therefore, the intercepted area should be limited to a reasonable range.This article limits it to [0.60, 0.80].In addition, the parameter µ in Equation ( 11) also greatly affects the network's emphasis on different branches.
In order to determine the optimal threshold, the balance factor was first set to 0.6, and then the recognition accuracy of high-frequency workpieces was tested when the values of γ were 0.6, 0.65, 0.70, 0.75, and 0.80, respectively.The experimental results are shown in Table 1.As can be seen from Table 1, as the cropping range threshold increases, the accuracy first increases and then decreases.When the threshold is 0.70, the algorithm in this paper achieves the best performance.Therefore, 0.70 is selected as the final threshold.
In order to evaluate the impact of the balance factor µ on the recognition results, first γ is fixed at 0.70, and then the high-frequency workpiece recognition accuracy is tested when µ takes different values.The experimental results are shown in Table 2.As can be seen from Table 2, when µ is 0.6, the high-frequency workpiece image obtains the highest recognition accuracy.As µ decreases, the recognition accuracy of highfrequency artifacts gradually decreases.This is because the cropped partial image contains less information, and relying too much on this part will cause the network to be unable to obtain better results.Therefore, we take µ = 0.6 as the balance parameter.

Comparison of Recognition Performance
In order to verify the effectiveness of the high-frequency workpiece image recognition model proposed in this paper, a comparative experiment was conducted with a variety of image recognition algorithms.Comparison algorithms include RTMM [20], JLS-DL [23], EfficientNet [27], a mechanical parts identification algorithm based on convolutional neural network [32] (denoted as WorkNet-2), main bearing cover parts identification based on deep learning [24] (denoted as MFF-CNN), the part recognition algorithm based on improved convolutional neural network [33] (denoted as Xception-P), NOAH [34], and RAFIC [35].The experimental results are shown in Table 3.It is shown in Table 3 that the recognition performance of the proposed ML-EfficientNet-B1 for high-frequency workpiece images is significantly better than other comparison methods.Specifically, the increases over that of the EfficientNet, WorkNet-2, MFF-CNN, Xception-P, RTMM, JLD-DL, NOAH, and RAFIC methods on the high-frequency workpiece image dataset are 12.1%, 8.3%, 7.4%, 6.2%, 4.7%, 3.4%, 2.2%, and 1.0%, respectively.
In addition, Figure 6 shows the confusion matrix of the recognition results of the proposed model.As can be seen from Figure 6, the recognition accuracy of the ML-Efficient-Net-B1 for each type of high-frequency workpiece is above 90%.

Ablation Study
In this sub-section, we carry out an ablation study of the proposed ML-EfficientNet-B1 in order to show the new modules on improving the network performance.Specifically, based on retaining the EfficientNet-B1 network, the effect of the lightweight mixed attention module (LMAM), the weakly supervised region detection module (WSRDM), and the branch fusion module (BFM) on the entire model is verified by controlling variables.The experimental results are shown in Table 4.

Ablation Study
In this sub-section, we carry out an ablation study of the proposed ML-EfficientNet-B1 in order to show the new modules on improving the network performance.Specifically, based on retaining the EfficientNet-B1 network, the effect of the lightweight mixed attention module (LMAM), the weakly supervised region detection module (WSRDM), and the branch fusion module (BFM) on the entire model is verified by controlling variables.The experimental results are shown in Table 4.The following can be seen from Table 4: (1) When directly using the EfficientNet-B1 network to classify high-frequency workpiece images, the recognition accuracy is 86.2%.
(2) After adding the LMAM module based on the backbone network EfficientNet-B1, the accuracy rate is increased by 10.5%.This shows that the LMAM module has the ability to perceive feature information of different color channels, overcome the impact of illumination changes, and more effectively extract the features of high-frequency workpiece images.
(3) On the basis of using the LMAM module, with further added WSRDM, the accuracy rate further increased by 0.7%, which shows that the feature learning of the WSRDM focuses on differentiated effective areas, which can improve the accuracy of recognition.(4) When LMAM, WSRDM, and WSRDM are used simultaneously, the accuracy rate is further improved by 0.9%, indicating that combining global information and local information can improve the recognition performance of high-frequency workpieces.

Visualization Results
In order to verify the effectiveness of the proposed algorithm, some images are randomly selected for visual display on the same test set.Figure 7 shows the subtle differences between different categories of high-frequency workpieces and the visualization of attention features before and after algorithm improvement.
tion changes, and more effectively extract the features of high-frequency workpiece images.(3) On the basis of using the LMAM module, with further added WSRDM, the accuracy rate further increased by 0.7%, which shows that the feature learning of the WSRDM focuses on differentiated effective areas, which can improve the accuracy of recognition.(4) When LMAM, WSRDM, and WSRDM are used simultaneously, the accuracy rate is further improved by 0.9%, indicating that combining global information and local information can improve the recognition performance of high-frequency workpieces.

Visualization Results
In order to verify the effectiveness of the proposed algorithm, some images are randomly selected for visual display on the same test set.Figure 7 shows the subtle differences between different categories of high-frequency workpieces and the visualization of attention features before and after algorithm improvement.As can be seen from Figure 7, compared with the original EfficientNet network, the improved ML-EfficientNet-B1 model makes the network more focused on the area where the boss is located.Therefore, the proposed EfficientNet-B1 model can extract more discriminative workpiece image features, significantly improving the recognition performance of the network.
Table 5 demonstrates the inference time of the proposed ML-EfficientNet-B1 and the other algorithms on a machine with NVIDIA GeForce RTX 3090 GPU, for an image with the spatial resolution of 224 × 224.As seen from this table, the inference time of the proposed network is 0.133 (s), which is acceptable for providing a high performance.

Method
Execution Time As can be seen from Figure 7, compared with the original EfficientNet network, the improved ML-EfficientNet-B1 model makes the network more focused on the area where the boss is located.Therefore, the proposed EfficientNet-B1 model can extract more discriminative workpiece image features, significantly improving the recognition performance of the network.
Table 5 demonstrates the inference time of the proposed ML-EfficientNet-B1 and the other algorithms on a machine with NVIDIA GeForce RTX 3090 GPU, for an image with the spatial resolution of 224 × 224.As seen from this table, the inference time of the proposed network is 0.133 (s), which is acceptable for providing a high performance.

Discussion and Limitation
In summary, the improved model proposed in this paper can effectively improve the recognition accuracy of multi-category high-frequency workpiece images, effectively solve the impact of illumination changes on high-frequency workpiece recognition results, and make full use of the global and local features of workpiece images.When the model in this paper is used to classify high-frequency workpiece images, it only uses the top surface image and does not consider the thickness information of the workpiece.It cannot be applied to the classification of high-frequency workpiece images with different thicknesses but the same top surface information.In subsequent algorithms, combining multi-view images can be considered to improve classification accuracy and the number of categories.

Conclusions
This paper proposes a high-frequency workpiece image recognition model that integrates multi-level network structures.First, LMAM is introduced to enhance the feature extraction capability of the network and reduce the impact of illumination changes on high-frequency artifact recognition results.Then, a weakly supervised area detection module is used to search and locate differentiated local images.Finally, the branch fusion module is used to effectively balance the network's ability to capture global features and local features of the workpiece images.Experimental results show that compared with the original EfficientNet network, the model in this paper improves the recognition accuracy of high-frequency artifacts by 12.1%.In addition, compared with various image recognition methods, the model in this paper has a certain degree of improvement in the recognition accuracy of high-frequency workpieces.In the future, in order to better arrange it on the production line, we will study a more lightweight high-frequency workpiece classification model.

( 1 )
We introduce a lightweight mixed attention module (LMAM) to extract global workpiece image features with strong illumination robustness, and the global recognition results are obtained through the backbone network.(2) We use a weakly supervised area detection module to locate the locally important areas of the workpiece, which is then introduced into the branch network to obtain local recognition results.(3) We combine the global and local recognition results in the branch fusion module to achieve the final recognition of high-frequency workpiece images.

Figure 1 .
Figure 1.The framework of the proposed model.

Figure 1 .
Figure 1.The framework of the proposed model.

Sensors 2024 ,
24,  x FOR PEER REVIEW 4 of 13 spatial attention module (LSAM) to replace the channel attention module (CAM) and spatial attention module (SAM) in CBAM, respectively.

Figure 3 .
Figure 3. Structure diagram of LCAM.Using global average pooling and global standard deviation pooling on the input feature map 1

F
are one-dimensional convolution, respectively.k is the convolution kernel size.( )   is sigmoid function.

Figure 3 .
Figure 3. Structure diagram of LCAM.Using global average pooling and global standard deviation pooling on the input feature map 1

F
are one-dimensional convolution, respectively.k is the convolution kernel size.( )   is sigmoid function.

P
 denote the maximum pooling and average pooling, respectively.

EM
is obtained by superimpos- ing the feature maps of all channels.In order to eliminate the interference of negative elements, all elements of E M are normalized to   0,1 to obtain the scaled energy map.

.
The batch size and image size are set to 8 and 224 × 224, respectively.

Figure 6 .
Figure 6.Confusion matrix on the high-frequency workpiece image dataset.

Figure 6 .
Figure 6.Confusion matrix on the high-frequency workpiece image dataset.

Figure 7 .
Figure 7. Attention feature visualization.Note that the first row denotes the original images, the second row denotes the results obtained by EfficientNet, and the third row denotes the results obtained by the proposed ML-EfficientNet-B1.

Figure 7 .
Figure 7. Attention feature visualization.Note that the first row denotes the original images, the second row denotes the results obtained by EfficientNet, and the third row denotes the results obtained by the proposed ML-EfficientNet-B1.

Table 1 .
The impact of the size of threshold γ on the recognition results.

Table 2 .
The impact of balancing factors on recognition results.

Table 3 .
Recognition results of different algorithms.

Table 4 .
The impact of different modules on network performance.

Table 5 .
Complexity of different algorithms.

Table 5 .
Complexity of different algorithms.