Imperfect Wheat Grain Recognition Combined with an Attention Mechanism and Residual Network

: Intelligent detection of imperfect wheat grains based on machine vision is of great signiﬁcance to correctly and rapidly evaluate wheat quality. There is little difference between the partial characteristics of imperfect and perfect wheat grains, which is a key factor limiting the classiﬁcation and recognition accuracy of imperfect wheat based on a deep learning network model. In this paper, we propose a method for imperfect wheat grains recognition combined with an attention mechanism and residual network (ResNet), and verify its recognition accuracy by adding an attention mechanism module into different depths of residual network. Five residual networks with different depths (18, 34, 50, 101, and 152) were selected for the experiment, it was found that the recognition accuracy of each network model was improved with the attention mechanism, and the average recognition rate of ResNet-50 with the addition of the attention mechanism reached 96.5%. For ResNet-50 with the attention mechanism, the optimal learning rate was further screened as 0.0003. The average recognition accuracy reached 97.5%, among which the recognition rates of scab wheat grains, insect-damaged wheat grains, sprouted wheat grains, mildew wheat grains, broken wheat grains, and perfect wheat grains reached 97%, 99%, 99%, 95%, 96%, and 99% respectively. This work can provide guidance for the detection and recognition of imperfect wheat grains using machine vision.


Introduction
Wheat is a major grain crop and an important commodity grain and strategic reserve grain variety in China, which plays an important role in grain production, circulation and consumption. Imperfect wheat grains refer to damaged but still usable wheat grains, including scab wheat grains, insect-damaged wheat grains, sprouted wheat grains, mildew wheat grains, and broken wheat grains. In the process of wheat circulation, the content of imperfect grains is the limiting index to measure the quality of wheat [1]. At present, the detection of imperfect grains mainly includes artificial sensory detection methods and recognition methods based on machine vision [2]. Due to the disadvantages of being time consuming, laborious, having low reproducibility and strong subjectivity, manual detection methods can no longer meet the requirements of rapid and accurate detection of large-scale wheat. In recent years, the method of automatic wheat grain recognition based on machine vision has attracted widespread attention. This detection technology has made some progress in wheat species and variety recognition [3,4] and classification of wheat and similar grains [5], for example, heat quality detection and grading [6,7], imperfect grain detection [8], cuticle rate, and hardness detection [9]. Classification based on machine vision uses a camera to shoot a wheat image. Through its analysis and processing, the shape, color, texture and other characteristic parameters of wheat are calculated. It can complete quality inspection of wheat size, color, surface smoothness, surface defects and damage at one time. Then, the corresponding grading institutions are controlled to carry out wheat quality grading. This method can, not only overcome the disadvantage of artificial classification, but also does not damage wheat in the process of recognition. At present, most identification methods adopt feature extraction algorithms, but artificial feature extraction needs to be continuously optimized through testing, and the process is quite complex; the wheat varieties are mixed, interlacing between imperfect grains (e.g., diseased spots on a single broken grain) and there are inevitable shift and illumination inequality factors in image collection [10]. As a result, it is difficult to find accurate and stable features in practical applications, and this method can no longer meet the need of rapid identification of imperfect grains.
With the rapid development of deep learning in the field of image recognition, the convolutional neural network (CNN) has also received extensive attention in the field of agriculture [11], for example, the identification of plant diseases and insect pests [12][13][14], weed identification [15], crop species identification [16,17], crop yield estimation [18], and other aspects have achieved excellent performance. There are also many imperfect wheat grains identification methods based on CNN. In 2010, Cheng et al. [19] used a two-layer back propagation (BP) neural network to identify perfect and broken wheat grains and the recognition accuracy reached 97.5%. In 2017, Cao et al. [20] added spatial pyramid pooling to a conventional CNN, and used this model to identify perfect wheat grains and imperfect wheat grains of two types, and the average recognition rate of the test reached 93.36%. In 2017, Le et al. [21] realized rapid identification between perfect and imperfect wheat grains by combining hyperspectral data and CNN. In 2020, Zhu et al. [22] used four CNNs (LeNet-5, AlexNet, VGG-16, and ResNet-34) to identify perfect and broken wheat grains, and compared them with a traditional support vector machine (SVM) and BP neural network, and the results showed that the identification accuracy was greatly improved by using CNN. In 2021, He et al. [23] used the LeNet-5, ResNet-34 and VGG-16 model combined with an image enhancement method to highlight the characteristics of imperfect particles, and the test accuracy was improved by 1% compared with the model without image enhancement.
The image features of imperfect grains of wheat are not clearly distinguished and the overall similarity is high. Therefore, its classification and recognition can be classified as a fine-grained image classification problem. In 2017, Luo et al. [24] pointed out that, different from ordinary image classification tasks, the signal to noise ratio (SNR) of fine-grained images is very small, and the information containing a sufficient degree of discrimination usually only exists in a very small and local area. How to effectively extract and utilize useful information from these local regions is the key to the success of a fine-grained image classification algorithm. In 2017, the Google team [25] proposed a simple network structure based on an attention mechanism and applied it to machine translation. Later, the attention mechanism was widely applied in the deep learning field. In 2018, Woo [26] proposed an attention module, the convolutional block attention module (CBAM), and added it to the CNN model. After testing, it was found that the network model with the attention module was also better than the network model without it in terms of image recognition. In 2019, Xu et al. [27] added a channel attention mechanism to VGG-16 and compared three types of fine-grained image datasets. The results showed that the network with the attention mechanism, not only improved classification accuracy, but also had a good generalization ability. In 2020, Peng et al. [28] applied CNN with an attention mechanism to soybean aphid identification which resulted in higher accuracy.
Based on the above, this study will attempt to combine the attention mechanism and residual network to classify and recognize six kinds of wheat grains: scab wheat grains, insect-damaged wheat grains, sprouted wheat grains, mildew wheat grains, broken wheat grains, and perfect grains. The aim is to explore a more accurate deep learning model that is suitable for imperfect wheat grain recognition, and to provide guidance for intelligent detection and recognition methods of wheat.
The structure of this paper is as follows: the second section introduces the methods used in this paper, the third section shows the experimental results, tables, and discussions, and the conclusions are presented in the fourth section.

Attention Mechanism
The attention module used in this paper is CBAM [26], which is primarily divided into two parts: channel attention and spatial attention. Channel attention focuses on 'what' is meaningful given an input image, while spatial attention focuses on 'where' as an informative part.

Channel Attention
Channel attention is used to pass the input feature graph F ∈ R C×H×W through both the average-pooling and max-pooling layers. The feature vectors after the average-pooled and the max-pooled layers are, respectively, expressed as F C avg and F C max . F C avg and F C max will then pass through a shared network of multi-layer perceptron (MPL). Thus, the channel attention map M c ∈ R C×1×1 can be obtained. The formula is expressed as [26]: σ represents the sigmoid function. MPL contains a hidden layer, in order to reduce parameter overhead, the hidden activation size is set to R C/r×1×1 , where is the reduction ratio set to 16. W 0 ∈ R C/r×C , and W 1 ∈ R C×C/r represents the weight shared in the MPL. It is important to note that a ReLU activation function is followed by W 0 .

Spatial Attention Module
The spatial attention module uses average-pooling and max-pooling to map the RGB channels of a feature map, generating two 2D maps: F s avg ∈ R 1×H×W and F s max ∈ R 1×H×W . They denote average-pooled features and max-pooled features across the channel. This information is then connected and convolved through a standard convolution layer to generate spatial attention module M s ∈ R H×W . Spatial attention is calculated as follows [26]: σ is sigmoid function, f 7×7 means that after a convolution operation, the size of the convolution kernel is 7 × 7.

ResNet Model
ResNet [29] is superposed by a large number of residual blocks, and its core idea is to enable the residual blocks to have an identity mapping ability, this can make the input of this building block equal to the output. This ability for identity mapping of residual blocks is achieved through the use of shortcut connections, which add the input of a block to its output. There are two types of basic residual blocks used. ResNet-18/34 and ResNet-50/101/152 use residual blocks and are shown in Figure 1a,b, respectively. A residual block can be expressed as [29]: where X i is the direct mapping part of the residual block, F (X i ) is the residual mapping part of the residual block, and F (X i ) means after two or three convolution operations.

The Residual Block Integrates with the Attention Mechanism
The integration of the residual block and the attention mechanism requires the addition of the attention mechanism in the ResNet residual block; this is done by adding channel and spatial attentions at the end of each base residual block of the ResNet model and adding the resulting attention results to the input to generate the new feature map. Because the base residual blocks used by the different layers of the ResNet model are different, the base residual blocks with the attention mechanism are also different. Figure 2a,b shows a structural diagram of residual blocks of ResNet-18/34 and ResNet-50/101/152 with the attention mechanism. A residual block to add attention can be expressed as [26]: F represents the residual map after two or three convolution operations, F represents the final feature map after channel attention and the spatial attention is equivalent to the residual map with added attention, where ⊗ denotes element-wise multiplication.

Data Acquisition
The wheat samples used in this study were collected by the relevant personnel of the Anhui Grains and Oil Research Institute, and included perfect wheat grains and five kinds of imperfect wheat grains sample; namely, perfect wheat grains, scab wheat grains, insect-damaged wheat grains, sprouted wheat grains, mildew wheat grains, and broken wheat grains. Some sample pictures are shown in Figure 3. The data were acquired by professional wheat quality inspection personnel of the research institute. The dataset used for wheat grain identification was 1800 artificially taken wheat grains images in six categories; each category had 300 images, and each image was a 100 × 100 pixel threechannel RGB image. The original dataset was divided into a training set and a test set, with 200 pieces in each category as the training set data and 100 pieces as the test set data.
Because the wheat image is captured in a single fixed scene, the training set is enriched by rotating the data sample at any angle, adjusting the image saturation, brightness, contrast and sharpness, and adding image noise. The diversity of datasets can make the model have a stronger generalization ability. The training set of each type of wheat (200 pictures) was expanded to 3000, so that the total number of pictures in the training set reached 18,000. These 18,000 pictures were divided using a 9:1 ratio in training, which meant that 16,200 pictures actually participated in network training, and 1800 pictures were used as the verification set in the training process.

Hardware and Software Preparation
The main parameters of the hardware device used in this test are: Intel(R) Core(TM) I5-7400 CPU @3.00GHz, an NVIDIA RTX1050Ti GPU was used for GPU acceleration with 4GB of memory. Python 3.8 was the programming language and Tensorflow2.3 framework was used to build the model on the PyCharm development platform.

Model Training and Test Results
In order to verify the feasibility of the attention mechanism, a total of 10 groups of models were tested: ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152, and ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152 with an attention mechanism, respectively. Each network model iterated 100 epochs, the initial learning rate was set as 0.0001, and the learning rate decayed by 1% for each iteration. Figure 4 shows the loss of function iteration curve of the ResNet training process without an attention mechanism and with an attention mechanism. The accuracy iteration curve of the ResNet training process and the accuracy iteration curve of the testing process can be seen in Figure 4a. The convergence rate of the training loss function curve of the ResNet model without the attention mechanism is slower with an increase in the number of network layers. As can be seen from Figure 4c,e, at the same time, the increase rate of the training accuracy and test accuracy also slow down. Figure 4b shows that the ResNet model with the added attention mechanism greatly reduces this phenomenon. The training loss function curve of the network model with different layers almost converges at the same time. By comparing the training results without an attention mechanism, it can be seen from Figure 4d,f that the training results of the network model with an attention mechanism are significantly more stable and can achieve a higher accuracy with only a small number of iterations.
I5-7400 CPU @3.00GHz, an NVIDIA RTX1050Ti GPU was used for GPU acceleration with 4GB of memory. Python 3.8 was the programming language and Tensorflow2.3 framework was used to build the model on the PyCharm development platform.

Model Training and Test Results
In order to verify the feasibility of the attention mechanism, a total of 10 groups of models were tested: ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152, and ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152 with an attention mechanism, respectively. Each network model iterated 100 epochs, the initial learning rate was set as 0.0001, and the learning rate decayed by 1% for each iteration. Figure 4 shows the loss of function iteration curve of the ResNet training process without an attention mechanism and with an attention mechanism. The accuracy iteration curve of the ResNet training process and the accuracy iteration curve of the testing process can be seen in Figure 4a. The convergence rate of the training loss function curve of the ResNet model without the attention mechanism is slower with an increase in the number of network layers. As can be seen from Figure 4c,e, at the same time, the increase rate of the training accuracy and test accuracy also slow down. Figure 4b shows that the ResNet model with the added attention mechanism greatly reduces this phenomenon.   Tables 1 and 2 show a comparison of the number of parameters, training time, and optimal accuracy of the ResNet training process, with and without an attention mechanism. According to the results in Table 1, with an increasing number of layers in the ResNet model, the classification accuracy of imperfect wheat grains did not increase. ResNet-50 achieved the highest classification accuracy in a relatively small amount of time. ResNet-101 and ResNet-152 needed more training time for the same number of iterations, but did not achieve a higher classification accuracy. It can be seen from Tables 1 and 2 that the number of parameters in the training of the ResNet model after the addition of an attention mechanism has been increased. For the network model under the same training batch, the training time for 100 iterations was increased by 1-2 h. By comparing the accuracy, it can be seen that the classification of imperfect wheat grains was significantly improved after the addition of the attention mechanism, and the best classification effect was ResNet-50. The only decrease was in ResNet-152. It is speculated that the reason may be that the number of parameters was too large. In order to ensure a normal training, the training batch had to be reduced without changing the computer equipment, which significantly increased the time spent in the training test of ResNet-152; however, the recognition accuracy did not improve.  Tables 1 and 2 show a comparison of the number of parameters, training time, and optimal accuracy of the ResNet training process, with and without an attention mechanism. According to the results in Table 1, with an increasing number of layers in the ResNet model, the classification accuracy of imperfect wheat grains did not increase. ResNet-50 achieved the highest classification accuracy in a relatively small amount of time. ResNet-101 and ResNet-152 needed more training time for the same number of iterations, but did not achieve a higher classification accuracy. It can be seen from Tables 1 and 2 that the number of parameters in the training of the ResNet model after the addition of an attention mechanism has been increased. For the network model under the same training batch, the training time for 100 iterations was increased by 1-2 h. By comparing the accuracy, it can be seen that the classification of imperfect wheat grains was significantly improved after the addition of the attention mechanism, and the best classification effect was ResNet-50. The only decrease was in ResNet-152. It is speculated that the reason may be that the number of parameters was too large. In order to ensure a normal training, the training batch had to be reduced without changing the computer equipment, which significantly increased the time spent in the training test of ResNet-152; however, the recognition accuracy did not improve.

Comparison of Identification Results
The dataset used for the test included perfect wheat grains and five types of imperfect wheat grains, each of which consisted of 100 images (100 × 100 pixels), which resulted in a total of 600 images. Precision, recall and F-measure were used as indicators to evaluate the performance of the model, and the calculation method was as follows: In the formula, TP represents the number of positive samples labeled as positive samples; that is, the correct number of imperfect wheat grains identified as that kind of wheat. FP represents the number of negative samples incorrectly labeled as positive samples; that is, the imperfect wheat grains species classified as one kind of wheat where they were actually another kind of imperfect wheat. FN represents the number of positive samples that were incorrectly labeled as negative samples; that is, imperfect wheat grain species of this type of wheat were incorrectly classified as another kind of imperfect wheat.
According to the above experimental comparison, the optimal network model without an attention mechanism and with an adding attention mechanism is the ResNet-50 model. Therefore, a confusion matrix was drawn and the precision recall rate and weighted Fmeasure were calculated to further compare the recognition accuracy of various types of perfect and imperfect wheat grains using ResNet-50. Tables 3 and 4, respectively, present the ResNet-50 confusion matrix and performance analysis, with and without an attention mechanism. According to the F-measure values in Table 3, the classification results of various types of imperfect wheat grains in the ResNet-50 model were similar, reaching more than 94%. By comparing the values in Tables 3 and 4, it can be seen that the ResNet-50 network model with an attention mechanism is superior to the ResNet-50 model without an attention mechanism in terms of the recognition rate of all kinds of imperfect wheat grains. After comparing Tables 3 and 4, we found that the F-measure value of the ResNet-50 network with an attention mechanism only increased the mildew category by 0.09% compared with the traditional ResNet-50 network. According to the confusion matrix, we can see that the model with an attention mechanism mistakenly classified six mildew wheat grains as sprouted wheat grains; as a result, the precision value of the sprouted category and recall value of the mildew category decreased, and the F-measure value of these two categories was significantly lower than that of other categories.  We tested the inference time of the traditional ResNet-50 and the ResNet-50 with an attention mechanism. Using 600 pictures in the test set, the total inference time of the model was 1.09 s and 1.63 s, respectively. The model with an attention mechanism took 50% more time than the traditional model. Because adding an attention mechanism increases the parameters of the model, it is reasonable to use more time. We can see that the model can acquire a large number of image prediction results in a very short time. The inference time of 600 images using ResNet-50 with an attention mechanism only increased by 0.49 s, which was acceptable.

Network Visualization with Grad-CAM
For further analysis, Grad-CAM [30] was used to visualize the attention of the network model. Grad-CAM is a recently proposed visualization method that uses gradients to calculate importance. As shown in Figure 5, compared to the network without an attention mechanism, the network model with an attention mechanism can focus attention more on the imperfect features of grains, which is obviously more beneficial to the recognition of the network model. It can be seen that the attention mechanism can obtain the feature information in pictures more effectively.

Optimize Learning Rate Results
In order to further improve the identification accuracy of imperfect wheat grains, the optimal learning rate of ResNet-50 with the best test effect was selected; seven different initial learning rates were set as 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.001, and 0.01. The learning rate decay was also set, and the learning rate decreased by 1% for each iteration. Iteration curves for different learning rates are shown in Figure 6. As can be seen from Figure 6a, when the learning rate was set to 0.01, the loss of function of the network model decreased slowly, which led to a slow increase in the training accuracy and testing accuracy (Figure 6b,c). It took nearly 50 iterations to reach a 90% accuracy, when the learning rate was small, the loss of function could converge quickly. Meanwhile, the accuracy of training and testing could reach 90% after 10 iterations, and the curve was very stable. It can be seen that the network model has a better performance effect when the learning rate is small.   Table 5 shows the model training parameters and training results at different learning rates. According to Table 5, the best achieved test accuracy was 97.5% when the learning rate was set to 0.0003. Table 6 shows the classification performance of ResNet-50 with an attention mechanism when the learning rate was set to 0.0003. It can be seen that the F-measure value of wheat grains of all kinds reached more than 96%, and the average F-measure value was 97.51%.   Table 5 shows the model training parameters and training results at different learning rates. According to Table 5, the best achieved test accuracy was 97.5% when the learning rate was set to 0.0003. Table 6 shows the classification performance of ResNet-50 with an attention mechanism when the learning rate was set to 0.0003. It can be seen that the F-measure value of wheat grains of all kinds reached more than 96%, and the average F-measure value was 97.51%.

Conclusions
In this paper, we combined an attention mechanism and the ResNet model to classify imperfect wheat grains, and proposed an imperfect wheat grain recognition method based on the attention mechanism. The study compared the recognition effect of the ResNet model with and without an attention mechanism. The results showed that the recognition accuracies of the five ResNet models with an attention mechanism were improved compared to that without an attention mechanism. The recognition accuracy of ResNet-50 was 97.5%. The confusion matrix and classification performance of the ResNet-50 model, with and without an attention mechanism, were calculated and the results showed that the ResNet-50 model with an attention mechanism was better than the model without attention mechanism in classifying the perfect wheat grains and the five types of imperfect wheat grains, because of the increased training parameters, resulting in an increase in training time. However, the stability of the model was enhanced, so it was feasible to introduce an attention mechanism into the classification of imperfect wheat grains.
The method proposed in this study can provide a new idea for the automatic identification of imperfect wheat grains in practical applications. The research results of effectively improving the automatic recognition accuracy of imperfect wheat grains are of great significance for intelligent and automatic classification of grain quality. However, the classification work of this study was limited to a single wheat grain, which is inefficient in practical applications. In the future work, we will identify a large number of tiled wheat grains in an image, and consider the problems of grain segmentation and target detection.  Data Availability Statement: Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflict of interest.