HLNet Model and Application in Crop Leaf Diseases Identiﬁcation

: Crop disease has been a severe issue for agriculture, causing economic loss for growers. Thus, disease identiﬁcation urgently needs to be addressed, especially for precision agriculture. As of today, deep learning has been widely used for crop disease identiﬁcation combined with optical imaging sensors. In this study, a lightweight convolutional neural network model is designed and validated on two publicly available imaging datasets and one self-built dataset with 28 types of leaf and leaf disease images of 6 crops as the research object. This model is an improvement of the existing convolutional neural network, reducing the ﬂoating-point operations by 65%. In addition, dilated depth-wise convolutions were used to increase the network receptive ﬁeld and improve the model recognition accuracy without affecting the network computational speed. Meanwhile, two attention mechanisms are optimized to reduce attention module computation, improving the capability of the model to select the correct regions of interest. After training, this model achieved an average accuracy of 99.86%, and the image calculation speed was 0.173 s. Comparing with 11 backbone models and 5 latest crop leaf disease identiﬁcation studies, the proposed model achieved the highest accuracy. Therefore, this model with an advantage of balancing between the calculation speed and recognition accuracy. Furthermore, the proposed model provides a theoretical basis and technical support for the practical application and mobile terminal applications of crop disease recognition in precision agriculture.


Introduction
Currently, the growth rate of global food production is much lower than the growth rate of the population. Food production and security are of great significance to ensure people's living standards and the development of the national economy [1]. Crop disease is one of the main disasters affecting agricultural production, and the scope and extent of the disease seriously restrict the productivity, quality, and sustainable development of agricultural production. One of the main bases for discovering diseases and identifying the types of diseases is the symptoms of diseased leaves [2]. The timely grasp of plant diseases in type, severity and development can effectively reduce the economic losses to agricultural and forestry production, and at the same time reduce the environmental pollution caused by the abuse of pesticides and provide a basis for disease prevention and control strategy [3].
Traditional artificial identification of diseases mainly relied on experts by evaluating the visible features, such as symptoms [4]. Artificial identification is easily misled by subjective factors and is difficult to meet the requirements of efficient disease identification [5]. The methods of identifying plant diseases based on computer vision firstly expansion. The data was labelled with its corresponding disease and then divided into training data and test data. In the model improvement part, we improved the structure of the convolutional neural network and added the dilated depth-wise convolutions. The identification accuracy of each model was compared, and the optimal model was selected. Then, we improved the attention module based on the previously selected optimal model and designed two models incorporating the optimized and unoptimized attention modules, respectively. The comparisons with other CNN models (backbone models and crop leaf disease identification models) verified the effectiveness of our model. The flowchart of this study is shown in Figure 1.
Sustainability 2022, 14, x FOR PEER REVIEW 3 of 21 and to expand data volume and unified data size through data pre-processing and data expansion. The data was labelled with its corresponding disease and then divided into training data and test data. In the model improvement part, we improved the structure of the convolutional neural network and added the dilated depth-wise convolutions. The identification accuracy of each model was compared, and the optimal model was selected. Then, we improved the attention module based on the previously selected optimal model and designed two models incorporating the optimized and unoptimized attention modules, respectively. The comparisons with other CNN models (backbone models and crop leaf disease identification models) verified the effectiveness of our model. The flowchart of this study is shown in Figure 1.

Dataset and Expansion
In this study, the public datasets and the collected data were mixed to establish the research data sets, where the public datasets were from PlantVillage [26] and UC Irvine Machine Learning Repository [27]. The self-collected data includes 90 images of northern leaf blight for corn and 300 images of black spot for tomatoes from the experimental field of Jilin Agricultural University. The image capture device was a smartphone oneplus8P, the main camera lens was 48 million pixels, and the captured picture pixels were 3000 × 3000. The experimental data with a total of 20,490 images included 6 types of crops and 28 types of leaf diseases. Specifically, the images used in this study are digital RGB images and the type of images are in JPG format.
Using multiple image expansion methods can effectively improve model robustness and avoid over-fitting, this paper expanded the original sample data to three times size using two random 40-degree rotations and horizontal flips. In the expansion process, we name the corresponding expanded image for each original image in the format "image name a, b, c". In the process of separating the validation data from the training data, each original image and the corresponding expanded images are considered as a whole and moved to training data or validation data overall. They are not divided to different data to avoid confusion of the images. Then, the sample data was divided into a training set and validation set according to the ratio of 8:2. For obtaining a higher training speed while maintaining a good identification rate, an interpolation algorithm was used to uniformly compress the input images to 224 × 224 pixels. The information of each crop leaf disease

Dataset and Expansion
In this study, the public datasets and the collected data were mixed to establish the research data sets, where the public datasets were from PlantVillage [26] and UC Irvine Machine Learning Repository [27]. The self-collected data includes 90 images of northern leaf blight for corn and 300 images of black spot for tomatoes from the experimental field of Jilin Agricultural University. The image capture device was a smartphone one-plus8P, the main camera lens was 48 million pixels, and the captured picture pixels were 3000 × 3000. The experimental data with a total of 20,490 images included 6 types of crops and 28 types of leaf diseases. Specifically, the images used in this study are digital RGB images and the type of images are in JPG format.
Using multiple image expansion methods can effectively improve model robustness and avoid over-fitting, this paper expanded the original sample data to three times size using two random 40-degree rotations and horizontal flips. In the expansion process, we name the corresponding expanded image for each original image in the format "image name a, b, c". In the process of separating the validation data from the training data, each original image and the corresponding expanded images are considered as a whole and moved to training data or validation data overall. They are not divided to different data to avoid confusion of the images. Then, the sample data was divided into a training set and validation set according to the ratio of 8:2. For obtaining a higher training speed while maintaining a good identification rate, an interpolation algorithm was used to uniformly compress the input images to 224 × 224 pixels. The information of each crop leaf disease can be obtained in Table A1 in Appendix A. The display image of crop diseases is shown in Figure 2. can be obtained in Table A1 in Appendix A. The display image of crop diseases is shown in Figure 2.

HLNet
HLNet was a disease identification model with high computational speed and lightweight structure. This model was improved based on the backbone model ShuffleNetV1. Firstly, HLNet had a more reasonable structure compared to ShuffleNetV1, which allowed HLNet to achieve a higher number of blocks and faster computation speed. Secondly, this study added dilated convolution layers with a larger perceptual field, which was less computationally intensive than traditional convolution layers of the same size and allowed the model to achieve higher recognition accuracy. Finally, the attention mechanism acting on the spatial and channel domains had been improved. The range of regions of interest and the number of feature extractions were increased, and the computational power consumption caused by the attention mechanism was reduced. The improved attention mechanism was added to HLNet, resulting in a significant improvement in the model's ability to identify similar crop diseases. The structure of the HLNet model is shown in Figure 3.

HLNet
HLNet was a disease identification model with high computational speed and lightweight structure. This model was improved based on the backbone model ShuffleNetV1. Firstly, HLNet had a more reasonable structure compared to ShuffleNetV1, which allowed HLNet to achieve a higher number of blocks and faster computation speed. Secondly, this study added dilated convolution layers with a larger perceptual field, which was less computationally intensive than traditional convolution layers of the same size and allowed the model to achieve higher recognition accuracy. Finally, the attention mechanism acting on the spatial and channel domains had been improved. The range of regions of interest and the number of feature extractions were increased, and the computational power consumption caused by the attention mechanism was reduced. The improved attention mechanism was added to HLNet, resulting in a significant improvement in the model's ability to identify similar crop diseases. The structure of the HLNet model is shown in Figure 3.

ShuffleNetV1
The CNN model had achieved excellent results in various fields due to its excellent identification accuracy [28][29][30]. CNN had gradually developed from a high precision and slow calculation speeds to a high precision and fast calculation speeds [31]. ShuffleNetV1, proposed in 2018, had attracted great attention due to its low computational load, which

ShuffleNetV1
The CNN model had achieved excellent results in various fields due to its excellent identification accuracy [28][29][30]. CNN had gradually developed from a high precision and slow calculation speeds to a high precision and fast calculation speeds [31]. ShuffleNetV1, proposed in 2018, had attracted great attention due to its low computational load, which reduced the computational complexity using depth-wise convolution (the number of groups in convolutional layers are the same as the input channels) and group pointwise convolution. Group pointwise convolution would induce inefficient interaction in the information between channels, resulting in weaker model generalization ability. In order to overcome the side effects caused by group pointwise convolution, ShuffleNetV1 proposed the channel shuffle operation to help information flowing between characteristic channels, the specific operation is shown in Figure 4, with the advantage of a substantial reduction in model parameters while maintaining high-precision performance. However, ShuffleNetV1 still has the following problems: (1) When the number of groups is 3 or higher, although the identification accuracy is the highest, the number of calculations is large. (2) Group pointwise convolution is frequently used in Block, and a large number of group pointwise convolutions with feature fusion function would cause computational redundancy.

ShuffleNetV1
The CNN model had achieved excellent results in various fields due to its excellent identification accuracy [28][29][30]. CNN had gradually developed from a high precision and slow calculation speeds to a high precision and fast calculation speeds [31]. ShuffleNetV1, proposed in 2018, had attracted great attention due to its low computational load, which reduced the computational complexity using depth-wise convolution (the number of groups in convolutional layers are the same as the input channels) and group pointwise convolution. Group pointwise convolution would induce inefficient interaction in the information between channels, resulting in weaker model generalization ability. In order to overcome the side effects caused by group pointwise convolution, ShuffleNetV1 proposed the channel shuffle operation to help information flowing between characteristic channels, the specific operation is shown in Figure 4, with the advantage of a substantial reduction in model parameters while maintaining high-precision performance. However, Shuf-fleNetV1 still has the following problems: (1) When the number of groups is 3 or higher, although the identification accuracy is the highest, the number of calculations is large. (2) Group pointwise convolution is frequently used in Block, and a large number of group pointwise convolutions with feature fusion function would cause computational redundancy.

Improvements of ShuffleNetV1 Model
In view of the shortcomings of ShuffleNetV1, this study improved ShuffleNetV1. Firstly, two parallel convolutional layers, 3 × 3 and 5 × 5 size convolution kernels, were added to the first layer of the model. Such layers can provide multi-scale feature information for the model and improve the extraction effect of different size lesions. Secondly, this study optimized the block structure of ShuffleNetV1 and reduce the calculation number of each block. Thirdly, 5 × 5 dilated depth-wise convolutions replaced the 3 × 3 depthwise convolution, which cannot increase the number of model calculations while improving the accuracy of model identification. Fourth, we improved the attention module, reduced the calculation number of the attention module, and improved the identification

Improvements of ShuffleNetV1 Model
In view of the shortcomings of ShuffleNetV1, this study improved ShuffleNetV1. Firstly, two parallel convolutional layers, 3 × 3 and 5 × 5 size convolution kernels, were added to the first layer of the model. Such layers can provide multi-scale feature information for the model and improve the extraction effect of different size lesions. Secondly, this study optimized the block structure of ShuffleNetV1 and reduce the calculation number of each block. Thirdly, 5 × 5 dilated depth-wise convolutions replaced the 3 × 3 depth-wise convolution, which cannot increase the number of model calculations while improving the accuracy of model identification. Fourth, we improved the attention module, reduced the calculation number of the attention module, and improved the identification accuracy of the model for similar leaf diseases. Finally, the feature map was compressed to a size of 1 × 1 by the global average pooling layer, and the identified category was output after input to the fully connected layer of the model. The model structure of this paper is shown in Figure 5. accuracy of the model for similar leaf diseases. Finally, the feature map was compressed to a size of 1 × 1 by the global average pooling layer, and the identified category was output after input to the fully connected layer of the model. The model structure of this paper is shown in Figure 5.

Block Improvement
There were two block types in ShuffleNetV1, one is the block (A) for feature extraction from the feature map, the size of the output feature map was the same as the size of the input feature map, shown in Figure 6a. The other was the block (B) with a downsampling function, as shown in Figure 6b.

Block Improvement
There were two block types in ShuffleNetV1, one is the block (A) for feature extraction from the feature map, the size of the output feature map was the same as the size of the input feature map, shown in Figure 6a. The other was the block (B) with a down-sampling function, as shown in Figure 6b. There were two block types in ShuffleNetV1, one is the block (A) for feature extraction from the feature map, the size of the output feature map was the same as the size of the input feature map, shown in Figure 6a. The other was the block (B) with a downsampling function, as shown in Figure 6b. This study firstly improved the Block (A) by deleting the first group pointwise convolution and putting the channel rearrangement module at the end and constructed the block in the order from depth-wise convolution (feature extraction)-group pointwise convolution (information circulation)-channel shuffle (circulation compensation). Compared with the original block, the block structure constructed in this way was more reasonable and the information circulation is more comprehensive. The improved structure is shown in Figure 7a. The study also improved the Block (B) by removing group pointwise convolutions and reordering the layers and replacing 3 × 3 average pooling layers in shortcut connection of matching sizes between different layers with 1 × 1 point convolution. The improved Block(B) is shown in Figure 7b. After the above optimization, the problems of excessive calculation number and calculation redundancy of the model are all resolved. This study firstly improved the Block (A) by deleting the first group pointwise convolution and putting the channel rearrangement module at the end and constructed the block in the order from depth-wise convolution (feature extraction)-group pointwise convolution (information circulation)-channel shuffle (circulation compensation). Compared with the original block, the block structure constructed in this way was more reasonable and the information circulation is more comprehensive. The improved structure is shown in Figure 7a. The study also improved the Block (B) by removing group pointwise convolutions and reordering the layers and replacing 3 × 3 average pooling layers in shortcut connection of matching sizes between different layers with 1 × 1 point convolution. The In addition to the above improvements, this paper replaced the ReLU activation fun tion with ReLU6 in ShuffleNetV1. ReLU6 limits the maximum output of ReLU to less th 6. This activation function would speed up the activation calculation and the model wou be easy to transplant to small mobile devices. In addition to the above improvements, this paper replaced the ReLU activation function with ReLU6 in ShuffleNetV1. ReLU6 limits the maximum output of ReLU to less Sustainability 2022, 14, 8915 7 of 20 than 6. This activation function would speed up the activation calculation and the model would be easy to transplant to small mobile devices.

Dilated Convolution
The key of improving model convergence speed was to speed up the aggregation of feature information. When the convolution kernel performs convolution operations, the size of the feature map would be reduced, which promotes information aggregation. With a larger size convolutional kernel, the size of the feature map was reduced more, and the speed of information aggregation was faster. Therefore, in order to increase the speed of model information aggregation and maintain the number of model calculations, we dilated the 3 × 3 depth-wise convolution (dilation is 2) for improving the accuracy of model identification and speeding up model training. Figure 8 shows the normal convolution and the dilated convolution with double expansion.
In addition to the above improvements, this paper replaced the ReLU activation fun tion with ReLU6 in ShuffleNetV1. ReLU6 limits the maximum output of ReLU to less th 6. This activation function would speed up the activation calculation and the model wou be easy to transplant to small mobile devices.

Dilated Convolution
The key of improving model convergence speed was to speed up the aggregation feature information. When the convolution kernel performs convolution operations, t size of the feature map would be reduced, which promotes information aggregation. W a larger size convolutional kernel, the size of the feature map was reduced more, and t speed of information aggregation was faster. Therefore, in order to increase the speed model information aggregation and maintain the number of model calculations, we lated the 3 × 3 depth-wise convolution (dilation is 2) for improving the accuracy of mod identification and speeding up model training. Figure 8 shows the normal convoluti and the dilated convolution with double expansion. The holes caused by the dilated depth-wise convolution kernel would lead to inc herent information of extracted features. When the size of the feature map was reduc to a certain extent, model identification accuracy might decrease when using dilat The holes caused by the dilated depth-wise convolution kernel would lead to incoherent information of extracted features. When the size of the feature map was reduced to a certain extent, model identification accuracy might decrease when using dilated depthwise convolutions. In order to analyze the influence of dilated depth-wise convolutions on the model, this paper designed three models with different numbers of dilated depth-wise convolutions, namely HLNet (A), HLNet (B), HLNet (C). HLNet (A) was the original model that had not been replaced, HLNet (B) was the model in which half of the depth-wise convolutions were replaced with dilated depth-wise convolutions, and HLNet (C) was the model that replaced all depth-wise convolutions with dilated depthwise convolutions. (HLNet: high-speed and lightweight network)

Attention Module Improvements
In actual production, many leaf diseases are very similar, dilated depth-wise convolutions alone cannot improve the accuracy of the model's identification for similar diseases. Thus, in order to improve the accuracy of similar disease identification, this study added a channel attention module and a spatial attention module to the model. The attention module obtained the weight of the feature information through the neural network, so that the model allocated the computing power resources reasonably and learned the key features, in which the channel attention of SENet [32] and the spatial attention of CBAM [33] were excellent.

Channel Attention and Spatial Attention
The channel attention in SENet model reduces the feature map F i of the i channel to a size of 1 × 1 through global pooling φ, and then passes through two fully connected layers σ. After the first fully connected layer, Re L U activation is performed, and finally the feature maps input to the sigmoid activation function, and each group of parameters is compressed to between 0-1 to obtain the channel attention weight CW i . The calculation formula of channel attention weight is displayed in Equation (1).
After obtaining the channel attention weight, CW i multiplies the original feature map F i to obtain the feature map CF i with channel attention. The calculation formula of feature map with channel attention is displayed in Equation (2).
The spatial attention in CBAM model compresses the same set of feature maps F i into one feature map using maximum pooling and average pooling, respectively, and fuses these two feature maps by concatenate operation. After that, a convolutional kernel of size 7 × 7 is used to convolve the group of feature maps so that the two feature maps are fused into one, and finally the feature maps are fed into the function to obtain the spatial attention weights SW i . The calculation formula of spatial attention weight is displayed in Equation (3).
After obtaining the spatial attention weight, SW i multiplies the original feature map F i to obtain the feature map with spatial attention. The calculation formula of feature map with spatial attention is displayed in Equation (4).

Model Based on Improved Attention
A large number of attention modules would increase the number of calculations sharply, and the calculation speed would drop sharply. In order to balance the identification accuracy and the calculation speed, we improved the two kinds of attentions.
For channel attention (CA), the fully connected layer consumes a significant number of arithmetic powers. However, the convolutional layer had fewer computational parameters and can replace the fully connected layer. Therefore, the two fully connected layers in the CA module were replaced with two pointwise convolutional (group 5 and 3) layers and point convolutional layers. After the replacement, the CA module had a deeper structure, and the number of calculations was also reduced, and the channel weight can be extracted more accurately and quickly.
For spatial attention (SA), since the feature map size was reduced to 7 × 7 at the end of the model, the leaf disease feature information was basically reduced to within one pixel, and the use of large-scale convolution kernels would cause computational redundancy. Therefore, the 7 × 7 convolution in SA was replaced with two 3 × 3 convolutions. Through this improvement, the SA module structure was deepened, the accuracy was improved, and the calculation parameters were reduced to 50% of the original.
In addition, we added two attention modules based on the HLNet (B) model with the highest identification accuracy in Section 2.4.2 (the comparison results of the identification accuracy experiment were included in Section 3.1). Since multiple attention modules can increase the number of calculations, this study only added attention modules to the last two stages of the model. The block arrangement method of adding attention module is shown in Figure 9.
For evaluating the effectiveness of the improved attention module, this paper constructed HLNet (BA) model using the unimproved attention module, and the HLNet (BB) model using the improved attention module. The model was obtained by stacking the above block 17 times, and the parameters of each layer of the model are shown in Table 1.
dancy. Therefore, the 7 × 7 convolution in SA was replaced with two 3 × 3 convolutions. Through this improvement, the SA module structure was deepened, the accuracy was improved, and the calculation parameters were reduced to 50% of the original.
In addition, we added two attention modules based on the HLNet (B) model with the highest identification accuracy in Section 2.4.2 (the comparison results of the identification accuracy experiment were included in Section 3.1). Since multiple attention modules can increase the number of calculations, this study only added attention modules to the last two stages of the model. The block arrangement method of adding attention module is shown in Figure 9.

Experimental Environment Parameter Settings
In this experiment, the computer operating system based on windows 10, equipped with NVIDIA Titan X GPU and Intel Xeon E5-2696 v3 CPU, and adopted Python 3.6 programming language. The deep learning framework used is Pytorch. The tools used for experimental validation are integrated in Pytorch. In this paper, we use the Adam training optimizer [34] and a quadruple cross-entropy function to calculate the loss values. Table 2 shows the hyperparameters used in model training. The training was divided into 40 epochs. The learning rate of the optimizer is 1 × 10 −3 , the weight decay was 0.001, the learning rate was reduced by 1/10 every 10 epochs, batch size is 10, input image size was 224 × 224. The pointwise convolutions were grouped 3. A total of 28 categories in this study.

Experimental Results and Analysis
This section analyzed the effectiveness of the improved model with the dilated depthwise convolutions and the attention module. In Section 3.1, ShuffleNetV1 was compared with the initially improved model, and the impact of adding dilated depth-wise convolutions on the model was analyzed. In Section 3.2, the networks incorporating the improved attention module were compared and the effectiveness of the attention module was demonstrated by analyzing the confusion matrix. In Section 3.3, the model with the highest identification accuracy was compared with backbone CNN models. In Section 3.4, we compared our study with latest lightweight crop leaf disease identification studies

Improved ShuffleNetV1 Experimental Results and Analysis
Through training ShuffleNetV1, HLNet (A), HLNet (B), HLNet (C), the evaluation indicators of each model are shown in Table 3. Evaluation indicators include: 1.
Best Acc (%): The best accuracy of top-1 identification was obtained during the model testing. This represents the accuracy of model identification and is the most important evaluation metric. The formula for calculating the accuracy rate is displayed in Equation (5)   The testing results in Table 3 indicated that after added multi-scale feature extraction and deleted redundant pointwise convolution, FLOPs (200.59 M vs. 579.5 M), model size (6375 K vs. 6891 K) and computation time (0.166 s vs. 0.236 s) of HLNet (A) were less than ShufflenetV1, which proved that most pointwise convolution was redundant, and removed pointwise convolution and then deepened network was an effective means to maintain model identification accuracy, reduce model computation and model calculation time.
Due to the addition of dilated depth-wise convolutions, HLNet (B), HLNet (C) maintained a small computational time and model size, the identification accuracy was greatly improved compared with HLNet (A). It proved that the dilated depth-wise convolution can improve the identification accuracy without affecting the calculation time, which was an effective method to improve the efficiency of the model. Figure 10 shows the HLNet (A), HLNet (B), HLNet (C), ShuffleNetV1 accuracy of the model in the test dataset at each epoch during the training process. As shown in the figure, HLNet (B) and HLNet (C) quickly achieved the highest accuracy due to the addition of dilated depth-wise convolutions. However, the model HLNet (C) with more dilated depth-wise convolutions kept oscillating in the follow-up training, and the best accuracy was not as high as HLNet (B). This proved that the holes formed by the expansion would lead to the learning bias in the extraction of small-sized feature maps, which reduced the accuracy to a certain extent. To further observe and evaluate the impact of dilated depth-wise convolutions on the model, we used the CAM heat map to observe the regions learned by the model [35]. The color close to red in the heat map was the area where the model extracted more features, and the color close to blue in the heat map was the area where the model extracted less features. Figure 11 shows the CAM heat map of HLNet (A), HLNet (B) and HLNet (C) models for three leaf diseases. As shown in the figure, HLNet (B) (Figure 11b) with a proper number of dilated depth-wise convolutions had more effective feature information than HLNet (A) (Figure 11a) with no dilated convolution. In Figure 11c, HLNet (C), which replaced all depth-wise convolution with dilated depth-wise convolutions, had a better information extraction effect on potato bacterial spot, but it was worse for the other two leaf diseases. Therefore, although the accuracy of HLNet (C) had improved, it was not as accurate as HLNet (B). In summary, by improving ShuffleNetV1 and adding an appropriate number of dilated depth-wise convolutions to build HLNet (B), the best accuracy was improved from 98.98% to 99.56%, the model floating point operations were reduced from 579.5 M to To further observe and evaluate the impact of dilated depth-wise convolutions on the model, we used the CAM heat map to observe the regions learned by the model [35]. The color close to red in the heat map was the area where the model extracted more features, and the color close to blue in the heat map was the area where the model extracted less features. Figure 11 shows the CAM heat map of HLNet (A), HLNet (B) and HLNet (C) models for three leaf diseases. As shown in the figure, HLNet (B) (Figure 11b) with a proper number of dilated depth-wise convolutions had more effective feature information than HLNet (A) (Figure 11a) with no dilated convolution. In Figure 11c, HLNet (C), which replaced all depth-wise convolution with dilated depth-wise convolutions, had a better information extraction effect on potato bacterial spot, but it was worse for the other two leaf diseases. Therefore, although the accuracy of HLNet (C) had improved, it was not as accurate as HLNet (B). To further observe and evaluate the impact of dilated depth-wise convolutions on the model, we used the CAM heat map to observe the regions learned by the model [35]. The color close to red in the heat map was the area where the model extracted more features, and the color close to blue in the heat map was the area where the model extracted less features. Figure 11 shows the CAM heat map of HLNet (A), HLNet (B) and HLNet (C) models for three leaf diseases. As shown in the figure, HLNet (B) (Figure 11b) with a proper number of dilated depth-wise convolutions had more effective feature information than HLNet (A) (Figure 11a) with no dilated convolution. In Figure 11c, HLNet (C), which replaced all depth-wise convolution with dilated depth-wise convolutions, had a better information extraction effect on potato bacterial spot, but it was worse for the other two leaf diseases. Therefore, although the accuracy of HLNet (C) had improved, it was not as accurate as HLNet (B). In summary, by improving ShuffleNetV1 and adding an appropriate number of dilated depth-wise convolutions to build HLNet (B), the best accuracy was improved from 98.98% to 99.56%, the model floating point operations were reduced from 579.5 M to

Attention Module Experimental Results and Analysis
To further improve the ability of the model to identify similar leaf diseases, we improved the attention module and put it on the HLNet (B) model in Section 3.1. For analyzing the effectiveness of the attention module, based on HLNet (B), we defined HLNet (BA) with the unimproved attention module and HLNet (BB) models with the improved attention module. We performed the following comparison experiments. After training the HLNet (B), HLNet (BA), and HLNet (BB) models, the evaluation metrics of these models were shown in Table 4 Figure 12 shows the training curves of the three models. As shown in the figure, although HLNet (BA) achieved a faster fit than HLNet (B), the identification accuracy was not improved. The optimized HLNet (BB) obtained a smoother training curve and achieved a faster fit.

Attention Module Experimental Results and Analysis
To further improve the ability of the model to identify similar leaf diseases, we improved the attention module and put it on the HLNet (B) model in Section 3.1. For analyzing the effectiveness of the attention module, based on HLNet (B), we defined HLNet (BA) with the unimproved attention module and HLNet (BB) models with the improved attention module. We performed the following comparison experiments. After training the HLNet (B), HLNet (BA), and HLNet (BB) models, the evaluation metrics of these models were shown in Table 4 Figure 12 shows the training curves of the three models. As shown in the figure, although HLNet (BA) achieved a faster fit than HLNet (B), the identification accuracy was not improved. The optimized HLNet (BB) obtained a smoother training curve and achieved a faster fit.  Similarly, we also used the CAM heat map to observe the regions where the model extracted features. As shown in Figure 13, it was obvious that the HLNet (BA) (Figure 13b) and HLNet (BB) (Figure 13c) models with the attention module learned more leaf features in maize and grape disease identification. While for potatoes and grapes, HLNet (BA) with the unimproved attention module also learned many wrong features. By improving the attention module, the interesting region of HLNet (BB) was more precisely focused on the leaf. Therefore, the improvement of the attention module enabled the model to learn the correct feature information quickly and accurately, which increased the accuracy of the model in the classification of similar diseases. Similarly, we also used the CAM heat map to observe the regions where the model extracted features. As shown in Figure 13, it was obvious that the HLNet (BA) ( Figure  13b) and HLNet (BB) (Figure 13c) models with the attention module learned more leaf features in maize and grape disease identification. While for potatoes and grapes, HLNet (BA) with the unimproved attention module also learned many wrong features. By improving the attention module, the interesting region of HLNet (BB) was more precisely focused on the leaf. Therefore, the improvement of the attention module enabled the model to learn the correct feature information quickly and accurately, which increased the accuracy of the model in the classification of similar diseases. The purpose of adding the attention module in this paper was to improve the accuracy of the model in classifying similar crop leaf diseases, so the accuracy of single class disease identification was very important. Therefore, this paper adopted the confusion matrix to show and analyze the number of model misidentifications intuitively. In order to observe the improvement effect of each step of the model, we choose HLNet (A), HLNet (B), HLNet (BA) and HLNet (BB) to generate the confusion matrix. Figure 14 were the confusion matrixes of HLNet (A), HLNet (B), HLNet (BA), HLNet (BB). The confusion matrix produced in this paper select 500 pieces from each category of the data for identification. In the confusion matrix, each column represented the predicted category and each row represented the true attribution category of the data, and the shade of the color in each square represented the number of recognized categories. The closer the color of the square was to red, the more often it was incorrectly identified. As shown in the confusion matrix in Figure 14a, HLNet (A) without the dilated depth-wise convolutions and attention module have the highest number of misidentifications. There were more errors in identification among for different diseases of the same crops such as apple, corn, and rice. Also, HLNet (A) had big identification errors in different but similar diseases of different crops, such as apples and potatoes. HLNet (A) without the dilated depth-wise convolutions and attention module had the highest number of misidentifications. In Figure  14b, we can see that the HLNet (B) model had improved the accuracy of identifying similar diseases in different crops, but it was less effective in identifying diseases in the same The purpose of adding the attention module in this paper was to improve the accuracy of the model in classifying similar crop leaf diseases, so the accuracy of single class disease identification was very important. Therefore, this paper adopted the confusion matrix to show and analyze the number of model misidentifications intuitively. In order to observe the improvement effect of each step of the model, we choose HLNet (A), HLNet (B), HLNet (BA) and HLNet (BB) to generate the confusion matrix. Figure 14 were the confusion matrixes of HLNet (A), HLNet (B), HLNet (BA), HLNet (BB). The confusion matrix produced in this paper select 500 pieces from each category of the data for identification. In the confusion matrix, each column represented the predicted category and each row represented the true attribution category of the data, and the shade of the color in each square represented the number of recognized categories. The closer the color of the square was to red, the more often it was incorrectly identified. As shown in the confusion matrix in Figure 14a, HLNet (A) without the dilated depth-wise convolutions and attention module have the highest number of misidentifications. There were more errors in identification among for different diseases of the same crops such as apple, corn, and rice. Also, HLNet (A) had big identification errors in different but similar diseases of different crops, such as apples and potatoes. HLNet (A) without the dilated depth-wise convolutions and attention module had the highest number of misidentifications. In Figure 14b, we can see that the HLNet (B) model had improved the accuracy of identifying similar diseases in different crops, but it was less effective in identifying diseases in the same crop. In Figure 14c, we can see the misidentification of the HLNet (BA) model was not improved obviously for adding the attention module, which demonstrated the model's ability for identifying similar diseases did not improve. In Figure 14d, the HLNet (BB) model with the optimized attention module achieved a significant improvement in similar diseases identification. Therefore, it can be concluded that: (1) The addition of large-sized convolution kernels, such as dilated depthwise convolutions, can only marginally improved the accuracy of single-class identification.
(2) Deepening the attention module can help the model effectively distinguishing similar diseases and improving the accuracy of model identification. crop. In Figure 14c, we can see the misidentification of the HLNet (BA) model was not improved obviously for adding the attention module, which demonstrated the model's ability for identifying similar diseases did not improve. In Figure 14d, the HLNet (BB) model with the optimized attention module achieved a significant improvement in similar diseases identification. Therefore, it can be concluded that: (1) The addition of large-sized convolution kernels, such as dilated depth-wise convolutions, can only marginally improved the accuracy of single-class identification. (2) Deepening the attention module can help the model effectively distinguishing similar diseases and improving the accuracy of model identification.

Comparison of Various Backbone Models
From the analysis above, HLNet (BB) is the model with the highest identification accuracy. To demonstrate that HLNet (BB) model can effectively balance arithmetic power and identification accuracy, we compared this model with backbone CNN models. Backbone CNN models compared in this paper are VGG16 [36], ResNet101 [37], DenseNet161 [38], InceptionV3 [39], AlexNet [40], MobileNetV1, MobileNetV2 [41], MobileNetV3 [42], Shuf-fleNetV1, ShuffleNetV2 [43], SqueezeNet Xception models. Since the training epoch in this paper was 40, some models with too many parameters cannot be trained to the best identification effect in this case. To demonstrate the advantage of the HLNet (BB) model in identification accuracy, this paper introduced transfer learning to initialize the models' parameters for models with floating point operations over 1G (VGG16, ResNet101, DenseNet161, InceptonV3, Xception), which ensured that the models achieved the highest identification accuracy after 40 epochs of training. The ImageNet [44] dataset is used for transfer learning.
For training HLNet (BB) and above 11 models, same hyperparameters were used. The optimizer and loss function are shown in Section 2.5. The evaluation metrics of each model are shown in Table 5. It clearly showed that the HLNet (BB) model had the highest accuracy. Although the VGG16, ResNet101, DenseNet161, InceptonV3, Xception models used transfer learning, the accuracy of these models was lower than that of HLNet (BB) which did not use transfer learning. HLNet (BB) had a smaller model size than other models except ShuffleNetV2 (238. 4 M vs. 198.73 M), however, the computation time of HLNet (BB) was shorter than ShuffleNetV2. In practical applications, computation time was more important than floating point operations. The model size of HLNet (BB) was smaller than other models except for MobileNetv2, ShuffleNetV1 and SqueezeNet models (8235 K vs. 7353 K vs. 6891 K vs. 4850 K). While the floating-point operations of the three models were much bigger than HLNet (BB), the computation time of the HLNet (BB) model was shorter than the other models except the SqueezeNet model. However, SqueezeNet had a low accuracy; less than 50%. Thus, the HLNet (BB) model outperformed the other models in comprehensive consideration of accuracy, FLOPs, model size and calculation time. For direct observations, we compared the HLNet (BB) model to the models with transfer learning (VGG16, ResNet101, DenseNet161, InceptionV3, Xception) using training curves, as depicted in Figure 15. We also used the same way to compare the HLNet (BB) model with the models that used transfer learning (AlexNet, MobileNet, MobileNetV1, ShuffleNetV1, ShuffleNetV2, SqueezeNet), as depicted in Figure 15. It was necessary to note that the HLNet (BB) model without transfer learning was used in two comparisons. Figure 15 shows that all large models reached fit at 20 epochs and that the training curve for HLNet (BB) was as smooth as most models using transfer learning. Even without using transfer learning, lightweight and addition of attention modules promoted the HLNet (BB) to achieve the fit at a very fast rate. Figure 16 shows the training curves for the HLNet (BB) model and the lightweight models. In Figure 16, the accuracy of HLNet (BB) proposed in this paper was the highest, and the training curve of HLNet (BB) was more stable than other models after the first bileNetV1, ShuffleNetV1, ShuffleNetV2, SqueezeNet), as depicted in Figure 15. It was necessary to note that the HLNet (BB) model without transfer learning was used in two comparisons. Figure 15 shows that all large models reached fit at 20 epochs and that the training curve for HLNet (BB) was as smooth as most models using transfer learning. Even without using transfer learning, lightweight and addition of attention modules promoted the HLNet (BB) to achieve the fit at a very fast rate.  Figure 16 shows the training curves for the HLNet (BB) model and the lightweight models. In Figure 16, the accuracy of HLNet (BB) proposed in this paper was the highest, and the training curve of HLNet (BB) was more stable than other models after the first epoch. The comparison results demonstrate that the weak pooling effect of large size convolutional kernels accelerate the model learning speed, and the improved attention module allowed the model to learn more effective features fast. In summary, HLNet (BB) had high classification accuracy (99.86%), lightweight calculation amount (238.44 M) and memory requirements (8235 K), and required less calculation time (0.173 s). It successfully reduced the computational effort of the attention module while enhancing the identification of similar diseases. HLNet (BB) had better identification results and a lighter model size compared to large models, and faster fitting and identification compared to lighter models, which indicated that our improvement of the curve for HLNet (BB) was as smooth as most models using transfer learning. Even without using transfer learning, lightweight and addition of attention modules promoted the HLNet (BB) to achieve the fit at a very fast rate.  Figure 16 shows the training curves for the HLNet (BB) model and the lightweight models. In Figure 16, the accuracy of HLNet (BB) proposed in this paper was the highest, and the training curve of HLNet (BB) was more stable than other models after the first epoch. The comparison results demonstrate that the weak pooling effect of large size convolutional kernels accelerate the model learning speed, and the improved attention module allowed the model to learn more effective features fast. In summary, HLNet (BB) had high classification accuracy (99.86%), lightweight calculation amount (238.44 M) and memory requirements (8235 K), and required less calculation time (0.173 s). It successfully reduced the computational effort of the attention module while enhancing the identification of similar diseases. HLNet (BB) had better identification results and a lighter model size compared to large models, and faster fitting and identification compared to lighter models, which indicated that our improvement of the In summary, HLNet (BB) had high classification accuracy (99.86%), lightweight calculation amount (238.44 M) and memory requirements (8235 K), and required less calculation time (0.173 s). It successfully reduced the computational effort of the attention module while enhancing the identification of similar diseases. HLNet (BB) had better identification results and a lighter model size compared to large models, and faster fitting and identification compared to lighter models, which indicated that our improvement of the model was successful, and this model surpassed most of the common backbone CNN models.

Comparison of Latest Lightweight Models
To further demonstrate the advanced performance of HLNet, this study also compared HLNet with various latest lightweight leaf disease identification models. The results of each model are shown in Table 6, and the results of these models are taken from the original papers.
The HLNet (BB) model with a novel lightweight structure, dilated convolution and improved attention mechanism achieved the highest accuracy of 99.86% compared with studies that identified variety of crops. Although the study by Wagle, S. A et al. [21] identified nine crops, more than our study, their model size was much larger than our model's, and the accuracy was lower than our model's.  6.67 M floating point operations and 99.40%, 97.01% accuracy. Compared with these models, HLNet (BB) had more crop identification types and higher accuracy, although the floating-point operations was slightly higher, which can solve the problem of practical application. Compared with the model proposed by Bhujel, A et al. [24], our model was lighter and had higher accuracy.
In summary, our models achieved a higher accuracy compared with that of the crop leaf disease identification models in the last two years. Compared with the models that identified only one crop, the universality of our model was stronger. Compared with the models that identified a variety of crops, our model was lighter. In addition, our study had abundant and objective evaluation indicators of light-weighting, which further proved that our model not only required less parameters and memory, but also had faster processing speed.

Conclusions
Fast and accurate identification of crop leaf diseases is the advantage of automatic disease identification. Compared with traditional CNN models, lightweight CNN models had faster identification speed and lower memory requirements, which were more suitable for practical applications. In this paper, we developed a deep learning model HLNet (BB) specifically for fast and efficient identification of crop leaf disease images based on lightweight convolutional neural network. Firstly, we improved the ShuffleNetV1 structure by deleting some of the pointwise convolutions and rearranging the positions of the layers in the block. By doing so, it successfully maintained the computational accuracy of the model and reduced the number of floating-point operations by 65% and the computation time by 30%. Secondly, we replaced the depth-wise convolution in the model with the dilated depth-wise convolution, which improved the model identification accuracy without affecting the computational speed and memory requirements (99.56% vs. 98.98%). Finally, we improved the existing attention module to reduce the number of floating-point operations and enhanced the ability to correctly extract interesting regions. The addition of the improved attention module resulted in a slight 4% improvement in model computation speed but a significant increase in the ability to identify similar diseases.
We constructed the dataset based on several publicly available datasets and self-built dataset with a total of 20,490 images, including 28 types of leaf and leaf disease images of 6 crops. Based on the dataset, we performed a series of validation experiments, and the results showed that the HLNet (BB) model had high accuracy, low computational effort and short computational consumption time compared with many backbone CNN models. Meanwhile, compared with the latest crop disease identification study, our study evaluated the lightness of the model from a wider range of perspectives. This study is helpful for providing a theoretical basis and technical support for the practical application of automatic identification of crop diseases.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://archive.ics.uci.edu/ml/datasets/Rice+Leaf+Diseases.

Conflicts of Interest:
The authors declare no conflict of interest.