A Lightweight Attention-Based Convolutional Neural Networks for Tomato Leaf Disease Classiﬁcation

: Plant diseases pose a signiﬁcant challenge for food production and safety. Therefore, it is indispensable to correctly identify plant diseases for timely intervention to protect crops from massive losses. The application of computer vision technology in phytopathology has increased exponentially due to automatic and accurate disease detection capability. However, a deep convolutional neural network (CNN) requires high computational resources, limiting its portability. In this study, a lightweight convolutional neural network was designed by incorporating different attention modules to improve the performance of the models. The models were trained, validated, and tested using tomato leaf disease datasets split into an 8:1:1 ratio. The efﬁcacy of the various attention modules in plant disease classiﬁcation was compared in terms of the performance and computational complexity of the models. The performance of the models was evaluated using the standard classiﬁcation accuracy metrics (precision, recall, and F1 score). The results showed that CNN with attention mechanism improved the interclass precision and recall, thus increasing the overall accuracy (>1.1%). Moreover, the lightweight model signiﬁcantly reduced network parameters (~16 times) and complexity (~23 times) compared to the standard ResNet50 model. However, amongst the proposed lightweight models, the model with attention mechanism nominally increased the network complexity and parameters compared to the model without attention modules, thereby producing better detection accuracy. Although all the attention modules enhanced the performance of CNN, the convolutional block attention module (CBAM) was the best (average accuracy 99.69%), followed by the self-attention (SA) mechanism (average accuracy 99.34%).


Introduction
Tomato is a ubiquitous crop with high nutritional values in the world. More than 180 million tons of tomatoes were produced worldwide in 2018, and Asia is the biggest market and producer of tomatoes [1]. However, it is affected by many diseases and pests, and the precise identification of those diseases is a challenging task for agronomists [2]. Traditionally, farmers have utilized their experience and visual inspection to identify plant diseases, but this comes with some serious cost, efficiency, and reliability issues [3]. Sometimes, even an experienced farmer and agronomist might fail to correctly identify a plant disease due to the large variety of species and similar disease symptoms. Furthermore, an increase in global temperature due to climate change has increased the chances of diseases occurring and spreading quickly [4]. Therefore, automatic detection of plant diseases is of utmost necessity for timely intervention in order to prevent massive losses. Convolutional neural network (CNN) is a powerful deep learning algorithm for image detection and classification that automatically extracts and analyzes image features. Therefore, the application of CNNs is soaring in most domains. Although CNN was introduced by study the performance of lightweight CNNs with improved image recognition algorithm (attention modules) for detection of a few classes of plant diseases.
In this study, a lightweight CNN with 20 layers and reduced trainable parameters was designed using the ResNet topology [24]. Then, the commonly used attention modules, namely the convolutional block attention module (CBAM) [15], self-attention module [16], squeeze-and-excitation module [25], and the dual attention module [26], were integrated into the base network to observe the impact of different attention mechanisms on conventional CNN. Moreover, the performance of the models with and without attention mechanisms was assessed by employing well-known classification metrics (accuracy, precision, recall, and F1 score). All the models were trained, validated, and tested using tomato disease datasets split at a ratio of 8:1:1 for training, validation, and testing. Furthermore, the productive number of attention modules and their locations in the base network were comprehensively assessed through an ablation study. Finally, the computational complexity of the models, the training and testing time per image, network parameters, and sizes were calculated and compared parametrically. Therefore, the main objectives of this study were to design a lightweight and computationally efficient network for classification of a few classes of plant diseases, improve the performance of conventional CNN by amending it with an attention mechanism, and identify an effective and efficient attention module for plant disease detection.

Data Collection and Preprocessing
Ten classes of tomato leaf images (9 disease and one healthy) that were part of the PlantVillage public datasets [27] were collected. The Fusarium wilt diseased images were captured from the greenhouse located at Gyeongsang National University, South Korea. In this way, a total of 19,510 images from 10 distinct disease classes and one healthy class were used to train, validate, and test the models. Most of the field-captured images were taken nondestructively, but few leaves were detached from the plant and captured on a white background. A sample of images from each class is presented in Figure 1. Similarly, Table 1 shows various information about the dataset, such as class assignment, the common and scientific name of the tomato diseases [13], the number of images per class, and the source of data collection. As image data preparation is very crucial in the deep learning model, different image preprocessing functions were carried out before applying to the model. The main preprocessing functions were labeling, resizing, rescaling, and augmentation of the raw images. Then, the images were split into training, validation, and testing sets at the ratio of 8:1:1 [12]. The larger the number of input images, the better the learning of the deep model. Thus, the image augmentation technique was performed to the training datasets.

Lightweight Attention-Based Network Design
A lightweight attention-based CNN model was designed using ResNet topology. It consisted of 20 layers, and the attention modules were embedded between the residual blocks 3 and 4 (after the 16th layer) [16]. Figure 2 shows the block diagram of the proposed model, and the detailed parameters of the base model are given in Table 2. The kernel filters of each layer of the base network were four times lower than the standard ResNet architecture, lowering the total network parameters to make it lighter and portable. In addition, the number of convolutional layers was limited to 20 to decrease the network complexity [28]. Conv1 layer had 16 kernel filters of large patch size (7 × 7) followed by a batch normalization layer, a rectifier linear unit (ReLU) activation layer, and a maximum pooling layer (Max. pooling), which reduced the size of feature maps to half of the input image size. Residual blocks 1 and 4 comprise a convolutional block containing three convolutional layers (conv.) accompanied by a batch normalization (BN) layer and an activation (ReLU) layer. Whereas Residual blocks 2 and 3 have a convolutional block followed by an identity block. The structure of the identity block was similar to the convolutional block except for the shortcut path. Finally, a global average pooling (Global Avg. Pooling) layer converted 2D feature maps to 1D before a dense output layer. The various attention modules were inserted into the base network at the same location, as shown in Figure 2. The necessary zero padding and maximum pooling layers were added to adjust the spatial dimension of the input and output feature maps.

Convolutional Block Attention Module (CBAM)
CBAM uses two attention modules (channel attention and spatial attention) in series, followed by the spatial attention module [15], as shown in Figure 3. The channel attention module was used to generate two feature maps using average and maximum pooling layers from the intermediate layer. Then, both feature maps were input to the shared multilayer perceptron (MLP), and the output feature maps were added before normalizing using the sigmoid function. The multiplied features between the channel attention module and convolutional layer were applied to the spatial attention module to determine the position of the important features in the image. The final feature maps from the channel and spatial attention modules are given in Equations (1) and (2).
where CA(x) represents the channel attention feature maps, SA(x) is the spatial attention feature maps, σ represents the sigmoid function of the feature maps, f 7×7 represents the 7 × 7 convolutional operation, MLP is multilayer perceptron, AvgPool(x) is average pooling of input x, MaxPool(x) is the maximum pooling of input x, F s max is the feature maps obtained from maximum pooling operation, and F s avg is the feature maps from maximum pooling operation.

Squeeze-and-Excitation (SE) Attention Module
The dimension of the input feature map was squeezed to 1 × 1 × C by global pooling operation, and two fully connected (FC) layers followed by a rectifier linear unit (ReLU) and sigmoid activation layers were attached to build an excitation block [25], as shown in Figure 4. The squeeze-and-excitation (SE) feature maps were element-wise multiplication, with the input feature maps forwarding to the next layer. The computational operation of the SE module is expressed mathematically in Equation (3). Finally, a mathematical multiplication was carried out to incorporate the SE features maps to the main network's feature maps.
where SE(x) represents the squeeze-and-excitation feature maps, F ex is squeezed or global pooled features, x is the input feature maps, W is the weight of the SE networks, σ is sigmoid operation, δ is ReLU operation, and W 1 and W 2 are the weights of the first and second dense layer, respectively.  2.2.3. Self-Attention (SA) Module Figure 5 represents the embedding of a self-attention module into the network and its architecture [16]. It consisted of three parallel convolutional and ReLU activation layers to extract the discriminating features from the input images. The output of the two convolutional layers was multiplied element-wise and fed to a softmax layer to generate an attention map. Then, the attention maps were multiplied by the transpose of the feature maps generated from the third convolutional branch to obtain self-attention feature maps. Finally, scaled attention maps were added to the input feature maps to generate output feature maps, as shown in Equation (4).
where out(x) is output features maps after the self-attention (SA) module, SA(x) is the feature maps after self-attention module, In(x) is the input feature maps, So f t is softmax operation, µ is a scaling factor, P(x), Q(x), and R(x) are the feature maps generated from the three parallel convolutional paths of the SA module, S(x) is the feature maps after softmax operation, and T(x) is the transposed feature maps of the P(x).Q(x). " represents the same content as the above row (convolutional).

Dual Attention (DA) Module
The authors of [26] proposed a dual attention mechanism with two attention networks, namely position attention (PA) and channel attention (CA) networks for scene segmentation. The position attention network is similar to the self-attention module except for the activation layers and the use of some different strategies for attention map generation, as shown in Figure 6. The DA module also contains a channel attention network that performs two multiplication operations, softmax, and an addition operation. Equation (5) shows the overall mathematical operations carried out in the dual attention module, and Equations (6) and (7) provide the mathematical operation performed in the PA and CA networks.
where DA(x) is the dual attentions' feature maps, PA(

Network Training and Evaluation
The lightweight base network and all the models with the various attention modules were trained, validated, and tested using the same image datasets. Moreover, the same training hyperparameters and evaluation strategies were applied to fairly compare their performance. As the deep CNN performance improves for a large number of training datasets, we used several data augmentation algorithms, as shown in Table 3. The data augmentation approaches were executed only on training datasets after splitting the whole images into training, validation, and testing sets. The increase in training image count due to the data augmentation process is provided in Table 4. Thus, the training images increased massively after the augmentation (8 times). Furthermore, the Adam optimizer with default learning rate was chosen as training hyperparameter to effectively converge the network [29]. Although the models were trained for a fixed 100 epochs, the optimally trained model was saved for testing purpose in every epoch to ensure minimum validation loss. Furthermore, an Adam optimizer that effectively converges the network [29] was chosen. Then, all the trained models were evaluated using the same testing datasets. The performance of the models was quantified by adopting the standard classification metrics, as shown in Equations (8)-(11) [30]. In addition, the size of the models was determined by counting the total number of network parameters and the memory space usage. On the other hand, the computational complexity was determined using the floating point operations (FLOPs), the total mathematical operations required to complete a forward and backpropagation of an input image.
where TP stands for true positive, TN for true negative, FP is false positive, and FN is false negative of the predicted class.

Training, Validation, and Testing Accuracy of the Models
All the models were trained and validated with the same dataset, training, and validation parameters. Figure 7 shows the training and validation accuracy and loss plots of the different models. The base model without attention module (lw_resnet20) trained relatively slower (indicated by a black line) than the model with attention modules. The model with SE attention module (lw_resnet20_se, represented by a blue line) showed quick training ability, as shown in Figure 7a,b. The validation accuracy and loss of all the models provided a significant fluctuation in each epoch. The best training accuracy and loss were obtained from the base model, followed by the lw_resnet20_se model. However, the highest validation accuracy and loss results were achieve by the lw_resnet20_cbam model, followed by the lw_resnet20_da model, as shown in Table 5.

Network Parameters and Efficiency
The deeper the network, the more network parameters there are, thus increasing the size and computational complexity of the network [31]. The network parameters, size, training and testing efficiency, FLOPs, and the average accuracy on the test dataset are presented in Table 7. The proposed models had almost 16 times fewer network parameters and were 23 times less complex than the standard ResNet50 model [15]. The base model was found to be comparatively efficient and lightweight due to fewer network parameters but showed poor performance on the test dataset. The SE and CBAM modules are the lightest attention modules compared to SA and DA. Moreover, the channel attention of CBAM and the module structure of SE are somewhat similar except for the maximum pooling layers. The training time of the models was not significantly different amongst the various attention modules. The test time per image of the model was calculated by averaging the time taken to detect 1960 test images. The SA and DA modules are heavier than CBAM and SE, increasing the computational complexity.

Tomato Disease Detection
The low interclass variance, high intra-class variance, and mixed symptoms of two or more diseases on the same leaf are some of the serious challenges for plant disease detection using computer vision techniques [9]. As all the images were of tomato leaves, the chances of producing false positive (FP) and false negative (FN) were higher due to lower interclass variance. Moreover, some of the images of late blight and target spot disease were in the preliminary stage, so there was marginal difference between diseased images and healthy images or mistakenly labeled ones, as shown in Figure 9. Therefore, most of the models poorly detected early blight, healthy, late blight, and target spot leaf images. In contrast, all the models perfectly identified the Fusarium wilt diseased images because of the distinction in datasets. The majority of the Fusarium wilt images were captured directly on the plant (nondestructively), which made them unique with the background and leaf position (Figure 1). The precision, recall, and F1 score of the target spot class were minimum for all the models due to higher FPs and FNs. All models wrongly detected some healthy images as late blight, bacterial spot, and target spot diseases except the lw_resnet20_cbam model, which falsely identified 1% of target spot images as healthy leaves because some diseased images at a very early stage were almost visually indistinguishable from healthy leaves. Therefore, all the models failed to achieve 100% correct classification of healthy and diseased images. However, lw_resnet20_cbam and lw_resnet20_sa models performed well except for giving 1% FPs and FNs, respectively. On the other hand, almost all the models precisely identified bacterial spot, leaf mold, mosaic virus, and yellow leaf curl virus diseased images.

Performance Evaluation of the Models
The attention modules allow the network to identify the discriminative features and their location in the input images to emphasize key features during training. The channel attention module determines the salient features available in the input images. At the same time, the spatial or position attention reveal the spatial location of those key features. The number of attention modules and their place in the network is same for all. We fixed the position of the attention module between blocks 3 and 4 to permit the network to focus on specific high-level features because the datasets were of the same plant (tomato). The lw_resnet20_cbam outperformed in terms of classification accuracy and model lightness. The additional maximum pooling layer in the channel attention module of the CBAM provide even minute details of the salient features to the network, boosting the network's performance. DA also uses two attention modules (channel and position), but it failed to perform as well as the CBAM model. One reason for the lower performance might be the parallel combination of channel and position attention. As [15] suggests, series combination results are better than parallel. Moreover, the module structure is bulkier than other attention modules due to three parallel convolutional layers for position attention and three branching matrix operations of input feature maps for channel attention. In contrast, CBAM uses maximum, average pooling, and convolutional layers in the spatial attention module, which is computationally more efficient than matrix operations.
The performance of our proposed model (lw_resnet20_cbam) was compared with models that were previously studied by various researchers. Some studies utilized the same tomato disease datasets with different deep CNN architectures. Some of them also implemented attention-based CNN to improve detection accuracy. Table 8 demonstrates the performance comparison of various CNN architectures used for the same tomato disease datasets. Only [2] used more tomato disease datasets (12 classes) than ours (11 categories). In addition, most researchers applied a generic model designed for a large number of image classification datasets, which is computationally inefficient for a small number of plant datasets. Moreover, the majority of generic models were used as transfer learning. From the table, it can be seen that none of the previous studies achieved better detection results than ours in such a large number of tomato leaf images with such a lightweight model. Therefore, this study will be helpful for future researchers to design efficient and effective networks for portable devices.
Amongst the various attention modules, the SA module also showed competitive results although it came at the cost of more network complexity and size. Its architecture is almost similar to the position attention module of the DA network except for an additional ReLU activation layer in each convolutional branch. In addition, the SA model's performance superseded the DA model but could not match up to the CBAM model. The SE network utilizes a similar principle as CBAM's channel attention module, although it only uses global average pooling operation in contrast to the maximum and global pooling operations in CBAM. Nevertheless, its performance was similar to the DA model. However, the SE module is the lightest and most efficient attention module. Thus, it is equally important to identify key features and their locations in the input images. Furthermore, the channel and spatial attention module should be in series so that the model can detect dominant features and their place in the input images.

Conclusions
This study experimented with various attention modules and analyzed their performance in tomato disease classification. Attention modules used for different purposes were employed. The network architecture, computational complexity, and performance were comprehensively compared. From the results, it can be concluded that the determination of key features and their location in the input images is crucial to enhancing classification performance. Moreover, identifying key feature regions is wiser than finding essential features. The determination of critical features and their position should be sequential because merging these features will lead to loss of crucial information. Our proposed model outperformed the prevailing generic models used for plant disease detection in terms of accuracy and efficiency.