IBSA_Net: A Network for Tomato Leaf Disease Identiﬁcation Based on Transfer Learning with Small Samples

: Tomatoes are a crop of signiﬁcant economic importance, and disease during growth poses a substantial threat to yield and quality. In this paper, we propose IBSA_Net, a tomato leaf disease recognition network that employs transfer learning and small sample data, while introducing the Shufﬂe Attention mechanism to enhance feature representation. The model is optimized by employing the IBMax module to increase the receptive ﬁeld and adding the HardSwish function to the ConvBN layer to improve stability and speed. To address the challenge of poor generalization of models trained on public datasets to real environment datasets, we developed an improved PlantDoc++ dataset and utilized transfer learning to pre-train the model on PDDA and PlantVillage datasets. The results indicate that after pre-training on the PDDA dataset, IBSA_Net achieved a test accuracy of 0.946 on a real environment dataset, with an average precision, recall, and F1-score of 0.942, 0.944, and 0.943, respectively. Additionally, the effectiveness of IBSA_Net in other crops is veriﬁed. This study provides a dependable and effective method for recognizing tomato leaf diseases in real agricultural production environments, with the potential for application in other crops.


Introduction
In agriculture, the identification of tomato leaf diseases has traditionally relied on manual identification, which can be subject to fluctuations in accuracy due to the variability in the experience and expertise of the human identifiers [1].Additionally, the process is laborious and time-consuming, making it challenging to identify a large number of experts with sufficient experience in the identification of tomato leaf diseases [2].
Recent advances in machine learning and deep learning techniques have revolutionized crop pest and disease identification by enabling the automated detection of leaf diseases.However, traditional machine learning methods face limitations, such as the need for manual screening of features and the inability to perform fully automated disease identification, including disease classification [3][4][5][6].Conventional convolutional networks are also not optimal for recognizing single diseases in crop disease identification and may not accurately identify diseased regions of the crop [7][8][9].
Annabel and Muthulakshmi addressed these limitations by employing the random forest (RF) algorithm to detect three different types of tomato leaf diseases, namely bacterial spots, tomato mosaic viruses, late blight, and healthy leaf disease.The algorithm achieved an impressive accuracy of 94.10%, surpassing the results obtained using the SVM and MDC algorithms, which recorded accuracies of 82.60% and 87.60%, respectively, when applied to the same dataset [10].In a similar vein, Das et al. (2020) developed a system that could detect seven distinct types of tomato leaf diseases using SVM, logistic regression (LR), and RF.The Haralick algorithm was employed to extract the texture features of the leaves, and the classifiers were used to classify the extracted features.The results of the study revealed that SVM outperformed the other two algorithms, recording an accuracy of 87.60%, followed by RF and LR, which recorded accuracies of 70.05% and 67.30%, respectively [11].
Deep learning models have been widely applied in plant leaf disease detection and classification.Thangaraj et al., proposed a TL-DCNN model to identify tomato leaf diseases and compared the performance of Adam, RMSprop and SGD optimizers, finding that the modified-Xception model with Adam optimizer achieved the best results [12].Pan Zhang et al., used seven pre-trained EfficientNet-B0-B7 models to train and test on four datasets of cucumber powdery mildew, downy mildew, healthy leaves and a combination of powdery mildew and downy mildew, finding that efficient-B4 achieved the highest accuracy (97%) for cucumber disease classification in four tasks [13].Zhe Tang et al., proposed a lightweight convolutional neural network model to diagnose grape diseases, including black rot, black measles and leaf blight.The model incorporated the squeeze-andexcitation (SE) mechanism into ShuffleNet and achieved 99.14% accuracy on PlantVillage dataset [14].Utpal Barman proposed an SSCNN network for citrus leaf disease classification, which can be deployed on smartphones and has higher accuracy and less computation time than MobileNet [15].Victor Gonzalez-Huitron et al., used four typical convolutional network models trained on Plant Village to find the optimal model and present a user interface on Raspberry Pi [16].Amreen Abbas et al., generated synthetic images of tomato leaves by constructing a C-GAN network for the network training [17].Brahimi et al. (2017) used GoogLeNet and AlexNet models to detect tomato plant diseases and found that GoogLeNet had an accuracy of 99.18%, higher than AlexNet's 98.60% [18].Jiang et al. (2020) proposed an improved ResNet50 model to classify three different types of tomato leaf diseases (spot disease, yellow leaf curling and late blight), which used leaky ReLU activation function and modified the filter size in each convolutional layer to 11 × 11 to improve accuracy.The experimental results showed that ResNet50 achieved 98% accuracy on the test dataset [19].Rubanga et al. (2020) used pre-trained CNN architectures (such as Inception V3, VGG16, VGG19 and ResNet) to detect tuta absoluta virus infection in tomato leaves, finding that Inception V3 had a detection accuracy of 87.20% better than other architectures [20].
Ullah, Z. used a hybrid deep learning approach to detect tomato plant diseases from leaf images, integrating pre-trained EfficientNetB3 and MobileNet models to achieve 99.92% accuracy [21].Waheed, H. developed an Android system for diagnosing ginger plant diseases by training VGG and MobileNetV2 models on a real dataset of ginger leaf images, achieving real-time detection with high performance [22].Ulutaş proposed two CNN models for tomato leaf disease detection, optimized via fine-tuning, particle swarm optimization, and grid search.Their ensemble model achieved 99.60% accuracy with fast training and testing times [23].Luo, Y. created "LiteCNN," a lightweight neural network that achieved 95.24% accuracy through knowledge distillation.The network was then deployed on an FPGA for a low-power, high-precision, and fast plant disease recognition terminal, suitable for real-time recognition in the field [24].Fan, X. proposed a transfer learning-based deep feature descriptor that combined deep and hand-crafted features through feature fusion to capture local texture information in plant leaf images.The approach achieved accuracies of 99.79%, 92.59%, and 97.12% on three different datasets [25].Janarthan, S. proposed a mobile lightweight deep learning model that achieved accuracies of 97%, 97.1%, and 96.4% on apple, citrus, and tomato leaf datasets, respectively.The model had approximately 88 million multiply-accumulate operations, 260,000 parameters, and 1 MB of storage space [26].
Image-based methods for tomato leaf disease recognition have limitations.They require large-scale annotated datasets for training, which are difficult to obtain and may lack representativeness.Additionally, they do not fully consider the diversity and complexity of image acquisition conditions in real environments, such as illumination, angle, occlusion, noise, and other factors.Transfer learning techniques are also not effectively utilized to improve the model's generalization and adaptation abilities on different datasets.To overcome these limitations, this paper proposes a novel convolutional neural network model, IBSA_Net, which combines the inverted bottleneck structure and Shuffle Attention mechanism.The paper also explores the impact of activation functions, such HardSwish, and MaxPool layer, on the network's performance.The proposed model achieves high accuracy, precision, recall, and F1-score on tomato leaf disease classification tasks and quickly identifies tomato leaf disease regions in real environments.Furthermore, this paper verifies the validity of IBSA_Net on other crops.The main contributions of this paper are as follows:

•
The IBMax module is proposed to enhance model recognition accuracy by downsizing feature maps and expanding the perceptual field.

•
The Shuffle Attention mechanism is introduced and alternately embedded into the network modules to enable the model to focus accurately on tomato leaf disease regions.

•
The PlantDoc++ dataset is constructed, and an innovative transfer learning approach is utilized to address the problem of models trained on a single background dataset being unable to generalize to real-world agricultural production environments.

Data Set Structure
The three datasets used in this experiment are the tomato leaf disease dataset in Plant Village, the PlantVillage Dataset with Data Augmentation (PDDA) on the Kaggle data site, and the extended dataset based on the PlantDoc dataset, PlantDoc++ [27].There are 39 different types of leaves in Plant Village, 10 of which belong to tomatoes.The tomato leaf disease dataset has nine disease images and one healthy leaf image.However, since the background of tomato leaves in Plant Village is relatively homogeneous, there may be a need to identify tomato leaf disease for complex backgrounds in real situations.The PlantVillage Dataset with Data Augmentation (PDDA), created from the PlantVillage dataset, was used to simulate tomato leaves photographed in a natural environment by fusing tomato leaves with a complex background.The PlantDoc dataset is a dataset of tomato leaf diseases taken in a natural environment, with eight types ranging from a few dozen to two hundred images per category.Since the original PlantDoc dataset has different disease categories than PlantVillage, it has unclear and watermarked images.Therefore, this paper expands PlantDoc by crawling relevant images on the web through ImageAssistant.After filtering and deleting, PlantDoc has the same disease categories as PlantVillage and ensures the clarity of each image.The modified dataset is called PlantDoc++.
In this paper, all models are trained on PlantVillage and PDDA, respectively, and the data set is divided using the ratio of training set: test set of 8:2.The number of the training set and test set are the same for PlantVillage and PDDA, Please refer to Table 1 for detailed information.After training on the two different datasets, some series of models trained on the corresponding datasets are called model clusters.The PlantVillage model cluster and PDDA model cluster are migrated to learn on the PlantDoc++ training set and finally tested on the PlantDoc++ test set, respectively.For PlantDoc++, the ratio of the training set to the test set is 5:5.The data enhancement methods used include 30 • , 45 • , and 60 • rotations of the samples; random cropping of the images using a 150 × 150 size crop frame, cropping and then scaling to the original size; the random flip of the images, and other operations.

Introduce Inverted Bottleneck Structure to Improve Network Accuracy
The bottleneck structure was first proposed in the ResNet model [28].That paper pointed out that the model's training time and parameters can be reduced by changing the original structure of two convolutional layers with a convolutional kernel size of 3 × 3 to a structure with a 3 × 3 convolution between two 1 × 1 convolutions.The diagram of the convolutional structure before and after the improvement is shown as follows.
As shown in Figure 1, after the improvement, the number of input and output channels are both 256, while the number of input and output channels of the middle convolution is 64, and this is "thick at both ends and thin in the middle" structure is called the bottleneck module.The data enhancement methods used include 30°, 45°, and 60° rotations of the samples; random cropping of the images using a 150 × 150 size crop frame, cropping and then scaling to the original size; the random flip of the images, and other operations.

Introduce Inverted Bottleneck Structure to Improve Network Accuracy
The bottleneck structure was first proposed in the ResNet model [28].That paper pointed out that the model's training time and parameters can be reduced by changing the original structure of two convolutional layers with a convolutional kernel size of 3 × 3 to a structure with a 3 × 3 convolution between two 1 × 1 convolutions.The diagram of the convolutional structure before and after the improvement is shown as follows.
As shown in Figure 1, after the improvement, the number of input and output channels are both 256, while the number of input and output channels of the middle convolution is 64, and this is "thick at both ends and thin in the middle" structure is called the bottleneck module.The idea of an inverted bottleneck module was proposed in the Transformer network and promoted in MobileNetV2 [29,30].Additionally, the ConvNeXt network also adopts this structure as one of the basic structures in its ConvNeXt block.By using the inverted bottleneck structure, the amount of parameters of the whole ConvNeXt network is reduced, and the recognition accuracy is also improved [31].The inverted bottleneck structure is the opposite of the bottleneck structure in that it has fewer input channels and fewer output channels than the number of channels in the intermediate convolutional layer.The following figure shows the ConvBN convolutional block that constitutes the IBSA module: The idea of an inverted bottleneck module was proposed in the Transformer network and promoted in MobileNetV2 [29,30].Additionally, the ConvNeXt network also adopts this structure as one of the basic structures in its ConvNeXt block.By using the inverted bottleneck structure, the amount of parameters of the whole ConvNeXt network is reduced, and the recognition accuracy is also improved [31].The inverted bottleneck structure is the opposite of the bottleneck structure in that it has fewer input channels and fewer output channels than the number of channels in the intermediate convolutional layer.The following figure shows the ConvBN convolutional block that constitutes the IBSA module: As shown in Figure 2, the ConvBN convolution block consists of a convolutional layer with a step number of 2 and a convolutional kernel size of 7 × 7, plus a BatchNorm layer with the activation function HardSwish.convBN 1 × 1 convolution is similar to ConvBN in that the convolutional layer uses 1 × 1 convolution for channel up-dimensioning and downdimensioning.The convolutional kernel in ConvBN has a larger sensory field to extract more image information.BathNorm is a batch normalization operation that reduces the dependence of the model on the initial parameters and accelerates the network's training and model generalization ability.In the design of the inverted bottleneck module, two models are used, including the maximum pooling layer (called IBMax) and without (called IB).The following figure shows the structure of the two modules.In the two inverted bottleneck introductory modules, the number of input and output channels of ConBN is c.The middle ConvBN1 × 1 and the third layer ConvBN1 × 1 channel up-dimensioning and down-dimensioning, respectively, and the up-dimensioning convolution changes the number of channels to four times the original one and then downdimensioning back to c, Please refer to Figure 3 for detailed information and specifics.In the inverted bottleneck primary module with Maxpool, the changing pattern of the number of channels is the same as in Figure 4a.In the two inverted bottleneck introductory modules, the number of input and output channels of ConBN is c.The middle ConvBN1 × 1 and the third layer ConvBN1 × 1 channel up-dimensioning and down-dimensioning, respectively, and the up-dimensioning convolution changes the number of channels to four times the original one and then downdimensioning back to c, Please refer to Figure 3 for detailed information and specifics.In the inverted bottleneck primary module with Maxpool, the changing pattern of the number of channels is the same as in Figure 4a.In the two inverted bottleneck introductory modules, the number of input and output channels of ConBN is c.The middle ConvBN1 × 1 and the third layer ConvBN1 × 1 channel up-dimensioning and down-dimensioning, respectively, and the up-dimensioning convolution changes the number of channels to four times the original one and then downdimensioning back to c, Please refer to Figure 3 for detailed information and specifics.In the inverted bottleneck primary module with Maxpool, the changing pattern of the number of channels is the same as in Figure 4a.

Adding Shuffle Attention Module to Improve the Spatial Localization Accuracy of the Network for Sick Regions
For traditional convolutional networks, the spatial localization ability of the network is poor, i.e., the network cannot locate the leaf disease region well and learn the disease features during the recognition process.Shuffle Attention is one of the attention mechanisms, the core idea of which is to focus on the picture's critical information [32].The following Figure 5 shows the structure schematic of the Shuffle Attention module: The Shuffle Attention (from now on referred to as SA) module takes the given input , where C, H, and W denote the number of input channels, feature map height and width, respectively.SA divides the channels into g groups in the channel di- , and for each , it generates two subfeature maps by channel separation . As shown in Figure 4, after the channel separation, the upper branch outputs the channel attention map using the interrelationship between channels; the lower branch generates the spatial attention map using the feature space relationship.

Adding Shuffle Attention Module to Improve the Spatial Localization Accuracy of the Network for Sick Regions
For traditional convolutional networks, the spatial localization ability of the network is poor, i.e., the network cannot locate the leaf disease region well and learn the disease features during the recognition process.Shuffle Attention is one of the attention mechanisms, the core idea of which is to focus on the picture's critical information [32].The following Figure 5 shows the structure schematic of the Shuffle Attention module:

Adding Shuffle Attention Module to Improve the Spatial Localization Accuracy of the Network for Sick Regions
For traditional convolutional networks, the spatial localization ability of the network is poor, i.e., the network cannot locate the leaf disease region well and learn the disease features during the recognition process.Shuffle Attention is one of the attention mechanisms, the core idea of which is to focus on the picture's critical information [32].The following Figure 5 shows the structure schematic of the Shuffle Attention module: The Shuffle Attention (from now on referred to as SA) module takes the given input , where C, H, and W denote the number of input channels, feature map height and width, respectively.SA divides the channels into g groups in the channel direction, i.e., . As shown in Figure 4, after the channel separation, the upper branch outputs the channel attention map using the interrelationship between channels; the lower branch generates the spatial attention map using the feature space relationship.The Shuffle Attention (from now on referred to as SA) module takes the given input X ∈ R C×H×W , where C, H, and W denote the number of input channels, feature map height and width, respectively.SA divides the channels into g groups in the channel direction, i.e., X = [X 1 , X 2 , •••, X g ], and for each X i ∈ R C/g×H×W , it generates two sub-feature maps by channel separation X i1 , X i2 ∈ R C/2g×H×W .As shown in Figure 4, after the channel separation, the upper branch outputs the channel attention map using the interrelationship between channels; the lower branch generates the spatial attention map using the feature space relationship.
(1) Channel attention branch Firstly, feature compression along the spatial dimension is performed by F gp global average pooling to generate channel statistics s, for s there is s ∈ R C/2g×1×1 , which embeds the global information of the feature subgraph.S is calculated as s is multiplied with the initial feature subgraph X i1 after the F c (•) operation and the sigmod activation function, and the final output is X i1 . where , which is in the same dimension as s, have the effect of scaling and panning the feature subgraphs.
(2) Spatial attention branch The spatial attention mechanism is more concerned with where the regions of interest are as a complement to the channel attention mechanism.The separated feature subgraphs are first subjected to the "Group Normalization" operation (GN) to obtain their spatial information, and then the feature representation is enhanced using the F c (•) operation, and the final output X i2 is.
where, W 2 b 2 is shaped as R C/2g×1×1 .The obtained channel attention and spatial attention branch outputs are finally stitched in the channel dimension, and two feature subgraphs with the number of channels C/2 g are stitched into a C/g channel subgraph.The feature subgraph that is divided into g groups, after the channel and spatial attention mechanism to the feature subgraph of X 1 , X 2 , . . ., X g , g feature subgraphs through the aggregation operation, recovered into the same latitude output as the initial input X X ,X ∈ R C×H×W .Finally, the aggregated output is then subjected to the channel shuffle operation, which enables cross-group communication of channel dimensions, enhances mutual learning between different information, and enhances the generalization of the model.

Enhancing Inter-Network Information Flow and Extraction by Building IBSA_Block
To better describe the model, the inverted bottleneck module (a) in Figure 4  For the IBSA_block composed of IB_block and IB_Maxpool_block, there is n ence in the overall structure; both of them are input after IBMax_block, and then ease area features are extracted by the SA attention mechanism module, and the o summed with the branch shortcut to form the residual module.The final outp For the IBSA_block composed of IB_block and IB_Maxpool_block, there is no difference in the overall structure; both of them are input after IBMax_block, and then the disease area features are extracted by the SA attention mechanism module, and the output is summed with the branch shortcut to form the residual module.The final output goes through the activation function.The activation function between the IBSA_block block and the block is the GELU activation function.

IBSA_Net Overall Structure
The entire IBSA_Net forms the main structure by superimposing IBSA_block blocks three times, and the UP_convolution is used for channel up-convolution between IBSA_block blocks.The channel up-convolution uses a convolution kernel size of 1 × 1, and the number of output channels is twice the number of input channels.The final output uses global average pooling to extract the information and add the fully connected layer output.The following Figure 7 shows the structure of the entire IBSA_Net network Meanwhile, Table 2 shows the output dimensions of each module in IBSA_Net.For the IBSA_block composed of IB_block and IB_Maxpool_block, there is no difference in the overall structure; both of them are input after IBMax_block, and then the disease area features are extracted by the SA attention mechanism module, and the output is summed with the branch shortcut to form the residual module.The final output goes through the activation function.The activation function between the IBSA_block block and the block is the GELU activation function.

IBSA_Net Overall Structure
The entire IBSA_Net forms the main structure by superimposing IBSA_block blocks three times, and the UP_convolution is used for channel up-convolution between IBSA_block blocks.The channel up-convolution uses a convolution kernel size of 1 × 1, and the number of output channels is twice the number of input channels.The final output uses global average pooling to extract the information and add the fully connected layer output.The following Figure 7 shows the structure of the entire IBSA_Net network Meanwhile, Table 2 shows the output dimensions of each module in IBSA_Net.

Experimental Configuration and Parameter Settings
This experiment uses Ubuntu 20.04.4 LTS 64 as the operating system (Canonical Ltd., London, UK) and Intel€ X€(R) Silver 4214 as the processor, CPU@2.20GHz,32 G of RAM (Intel, Santa Clara, CA, USA).The GPU is an NVIDIA Tesla T4 with 16 G of video memory (Nvidia, Santa Clara, CA, USA).
In this paper, all experiments are conducted using stochastic gradient descent (SGD) with multiple batches.For the IBSA_Net network, the batch size (Batchsize) is set to 32; the number of training rounds epochs is 50; the optimizer adopts the SGD optimizer, the initial learning rate is 0.01, and the learning rate decay is used, the number of steps (step_size) is 2. The decay reduction factor lr_decay is 0.9.For every two training rounds, the learning rate is multiplied by the decay coefficient.The cross-entropy loss function is used for the loss function.

Evaluation Indicators
To better evaluate the performance of different models on the same dataset, the following metric is considered to be introduced to evaluate the models: accuracy, which calculates the number of correctly predicted samples as a proportion of the total number of samples, as shown in Equation ( 4).Precision is the probability that, given a positive label, how many of them are true positives, i.e., the ratio of correctly predicted positive samples to the total number of predicted positive samples, as shown in Equation ( 5).The recall is the accuracy of predicting positive sample instances, i.e., the ratio of correctly predicted positive samples to the total number of actual positive samples, as shown in Equation (6). the F1 score integrates Precision and Recall to reconcile both, as shown in Equation (7).
In the network training process of deep learning, the loss function calculates the gap between the actual value and the predicted value.The one used in this paper is the cross-entropy loss function with the following equation.
where p(x i ) is the expected probability of the input and q(x i ) is the actual probability of the input.The smaller the L obtained from both calculations, the smaller the difference between the actual value and the predicted value, and the more accurate the prediction is.

Ablation Experiments of Model Lifting by Key Modules
In this subsection, the critical modules mentioned are the HardSwish activation function, the IBMax module, and the Shuffle-Attention module.We refer to the model in which all key modules are removed as OriNet.
The addition of activation functions can add nonlinear factors to the network and increase the expressive power of the neural network.The HardSwish (HS) function was first proposed in the MobileNetV3 model [33].Compared with the traditional Swish activation function, the HardSwish activation function is faster to compute and has better numerical stability.For the HardSwish function, the equation is as follows.
It extracts the primary information in the image for the maximum pooling layer.It makes the feature map of the image to operate in the network smaller, making the information denser while reducing the number of operations.
The following table shows the increase and decrease in comprehensive training and testing accuracy due to the addition and removal of the three key modules and the comparison of recognition accuracy.
As seen in Table 3, each of the three key modules contributes to the network, and adding any of the three modules improves network performance.The IBMax module has the most significant improvement for the first three individual module additions.The performance improvement from adding two key modules is more significant than that from adding only one.The above experimental results show that our proposed key modules effectively improve the network performance, which eventually brings a 9.2% performance improvement.

Comparison Experiments of Model Clusters with Different Datasets
In this study, the performance of the PlantVillage model group was analyzed in terms of training and testing.The recognition accuracy of six models was evaluated using the accuracy curve shown in Figure 8a,b.The findings revealed that the IBSA_Net model outperformed the other models, exhibiting significantly higher recognition accuracy in the early stages of training.As the training progressed, the accuracy curve of IBSA_Net tended to flatten after eth 10th round, indicating a convergence effect.After 50 rounds of training, the IBSA_Net model achieved a training accuracy of 99.7% and a testing accuracy of 99.4%.While the EfficientNet and MobileNetV3 models showed similar training accuracy to IBSA_Net in eth 50th round, at 98.4% and 97.6%, respectively, the training and testing accuracy gaps of models other than IBSA_Net were over 2%.These findings suggest that these models suffer from overfitting problems and display weaker generalization ability than the IBSA_Net model.
In Figure 9a,b, the performance of six models on the PDDA dataset is evaluated.The results demonstrate that the training and testing accuracies of all models have decreased compared to those of PlantVillage.This decline in accuracy may be due to the greater complexity of the backgrounds in the PDDA dataset.However, IBSA_Net remains the bestperforming model, indicating its ability to remain less influenced by complex backgrounds and display stronger robustness than the other models.
training, the IBSA_Net model achieved a training accuracy of 99.7% and a testing accuracy of 99.4%.While the EfficientNet and MobileNetV3 models showed similar training accuracy to IBSA_Net in eth 50th round, at 98.4% and 97.6%, respectively, the training and testing accuracy gaps of models other than IBSA_Net were over 2%.These findings suggest that these models suffer from overfitting problems and display weaker generalization ability than the IBSA_Net model.In Figure 9a,b, the performance of six models on the PDDA dataset is evaluated.The results demonstrate that the training and testing accuracies of all models have decreased compared to those of PlantVillage.This decline in accuracy may be due to the greater complexity of the backgrounds in the PDDA dataset.However, IBSA_Net remains the best-performing model, indicating its ability to remain less influenced by complex backgrounds and display stronger robustness than the other models.Tables 4 and 5 illustrate that the proposed IBSA_Net model exhibits superior performance on both the PlantVillage and PDDA datasets.Compared to advanced models, such as EfficientNet and MobileNetV3, along with classical convolutional neural networks,  In Figure 9a,b, the performance of six models on the PDDA dataset is evaluated.The results demonstrate that the training and testing accuracies of all models have decreased compared to those of PlantVillage.This decline in accuracy may be due to the greater complexity of the backgrounds in the PDDA dataset.However, IBSA_Net remains the best-performing model, indicating its ability to remain less influenced by complex backgrounds and display stronger robustness than the other models.Tables 4 and 5 illustrate that the proposed IBSA_Net model exhibits superior performance on both the PlantVillage and PDDA datasets.Compared to advanced models, such as EfficientNet and MobileNetV3, along with classical convolutional neural networks, Tables 4 and 5 illustrate that the proposed IBSA_Net model exhibits superior performance on both the PlantVillage and PDDA datasets.Compared to advanced models, such as EfficientNet and MobileNetV3, along with classical convolutional neural networks, IBSA_Net achieves higher accuracy, precision, and recall scores.Moreover, IBSA_Net shows excellent performance in leaf recognition tasks under complex backgrounds.In Figure 10, the loss curve graph reveals that IBSA_Net has the smallest loss and converges the fastest.After eth 10th round of training, the loss curve of IBSA_Net begins to flatten, while the training losses of the other five models in the first five rounds are significantly larger than those of IBSA_Net.In Figure 10, the loss curve graph reveals that IBSA_Net has the smallest loss and converges the fastest.After eth 10th round of training, the loss curve of IBSA_Net begins to flatten, while the training losses of the other five models in the first five rounds are significantly larger than those of IBSA_Net.

Transfer Learning with Small Sample Datasets PlantDoc++
Generally speaking, it is more difficult to obtain many crop disease leaf datasets in natural environments, especially for the nine tomato leaf disease datasets in this paper.Although there are more large public datasets available for people to use, the leaves in the public dataset represented by PlantVillage are primarily taken in a single background, which does not reflect the pose of leaves in the natural environment and is almost noiseless data.The models trained on such datasets cannot be extended to use in natural agricultural production environments.So this paper uses the PDDA dataset to simulate the natural picking environment.The PDDA dataset is to segment the leaves in the public dataset.It moves them to the background map of the natural tomato planting and picking environment to try to fit the tomato leaves in the natural environment, increase the noise

Transfer Learning with Small Sample Datasets PlantDoc++
Generally speaking, it is more difficult to obtain many crop disease leaf datasets in natural environments, especially for the nine tomato leaf disease datasets in this paper.Although there are more large public datasets available for people to use, the leaves in the public dataset represented by PlantVillage are primarily taken in a single background, which does not reflect the pose of leaves in the natural environment and is almost noiseless data.The models trained on such datasets cannot be extended to use in natural agricultural production environments.So this paper uses the PDDA dataset to simulate the natural picking environment.The PDDA dataset is to segment the leaves in the public dataset.It moves them to the background map of the natural tomato planting and picking environment to try to fit the tomato leaves in the natural environment, increase the noise of the dataset, and make the model more robust to the recognition of complex backgrounds and leaves.
Afterward, using the idea of transfer learning, the model clusters trained on PlantVillage and PDDA datasets, respectively, retaining their model parameters, were trained on the small sample dataset PlantDoc++ before testing the accuracy of the two model clusters in identifying tomato leaf diseases in natural environments.
The tomato leaf categories represented by the 0 to 9 markers in Figure 11 correspond to the leaf categories from Bacterial_spot to healthy in Table 6. Figure 11 and Table 6 show that the PlantVillage model cluster has a more significant decrease in the ability to identify disease datasets in natural environments compared to PlantVillage, with a generally low recall rate.However, the IBSA_Net network still has the best performance among all models.This indicates that after transfer learning, IBSA_Net pays more attention to the leaf features and has stronger robustness.In the PDDA model group, all models exhibited higher evaluation metrics than the PlantVillage model group, thus confirming the previous hypothesis that pre-training on datasets simulate complex backgrounds can enhance a model's ability to distinguish leaves from backgrounds and reduce interference from background factors.Although EfficientNet achieved precision and recall scores of 1 for certain diseases, its average values for all evaluation metrics were lower than those of IBSA_Net.Specifically, its evaluation metrics were lower for certain diseases, indicating that the model may have higher recognition rates for specific disease categories while lower recognition rates for others, which is not suitable for actual agricultural production environments.By contrast, the proposed IBSA_Net model in this study has almost no imbalanced evaluation metrics, displaying high evaluation metrics for each disease category.Moreover, IBSA_Net exhibits excellent transfer learning ability, demonstrating strong knowledge transfer from the source domain to the target domain.After pre-training on PDDA and testing on PlantDoc++, IBSA_Net achieved average precision, recall, and F1-score values of 0.942, 0.944, and 0.943, respectively, which were higher than the average evaluation metric values of the other five networks.Please refer to Figure 12 and Table 7 for detailed information and specifics.In the PDDA model group, all models exhibited higher evaluation metrics than the PlantVillage model group, thus confirming the previous hypothesis that pre-training on datasets that simulate complex backgrounds can enhance a model's ability to distinguish leaves from backgrounds and reduce interference from background factors.Although Ef-ficientNet achieved precision and recall scores of 1 for certain diseases, its average values for all evaluation metrics were lower than those of IBSA_Net.Specifically, its evaluation metrics were lower for certain diseases, indicating that the model may have higher recognition rates for specific disease categories while lower recognition rates for others, which is not suitable for actual agricultural production environments.By contrast, the proposed IBSA_Net model in this study has almost no imbalanced evaluation metrics, displaying transfer learning ability, demonstrating strong knowledge transfer from the source domain to the target domain.After pre-training on PDDA and testing on PlantDoc++, IBSA_Net achieved average precision, recall, and F1-score values of 0.942, 0.944, and 0.943, respectively, which were higher than the average evaluation metric values of the other five networks.Please refer to Figure 12 and Table 7 for detailed information and specifics.

Attentional Visualization Experiment
In order to better understand how convolutional neural networks identify diseases in different parts of the images during training, this paper utilizes the Grad-CAM attention visualization tool to generate attention heat maps of the network's predictions [34,35].The heat map highlights important regions by color-coding them, where redder and warmer colors correspond to regions of greater importance that the network is most interested in, while cooler colors (bluish) denote regions of less interest.Attention visualizations were obtained for some diseased leaves, and the corresponding heat maps are shown in Figure 13.From the heat maps, we can observe that within the PlantVillage model cluster, the red areas of the three models, except IBSA_Net, are small and scattered.In contrast, the warm areas of the IBSA_Net network are mainly concentrated around the disease spots, but the attention is not focused on the disease spot area.This indicates that the disease information learned from a single background dataset may not generalize well to more complex background datasets.model cluster, the red areas of the three models, except IBSA_Net, are small and scattered.In contrast, the warm areas of the IBSA_Net network are mainly concentrated around the disease spots, but the attention is not focused on the disease spot area.This indicates that the disease information learned from a single background dataset may not generalize well to more complex background datasets.It is important to note that in the first attention heat map of early blight, GoogLeNet and VGG networks identified non-disease areas, such as the hole formed by the branch in the upper left corner and the overlap of the leaf edge and background, as diseased spots.As depicted in the figure, these networks tend to show "false spots" in red regions of their attention heat maps.In contrast, after pre-training on the PDDA dataset, IBSA_Net demonstrated a superior ability to distinguish between "false spots" and actual diseased regions.The attention heat map of IBSA_Net mostly identifies diseased spots, assigning higher attention weights to these regions.The "false spots" related to the holes formed by the branch are colored light yellow and cyan, indicating lower network attention to these areas.Similarly, in the Target_Spot dataset, IBSA_Net effectively avoided predicting shadows or non-disease black spots as diseased regions, whereas other networks made such errors.These findings suggest that IBSA_Net can self-learn the differences between diseased regions and the environmental illusions of leaves and better distinguish between them.
Conversely, in the PDDA model cluster, there are significantly more warm regions in the attention heat maps and more overlap with the disease spot region.Precisely, the warm color regions of IBSA_Net largely overlap with the disease spot regions and cover most of the spots.This suggests that the ability to distinguish complex backgrounds from leaf spot regions was learned from the PDDA dataset and that IBSA_Net had the most significant ability to generalize from the source domain to the target domain.
To further quantify the level of attention each model pays to the diseased regions of tomato leaves, we calculated the percentage of warm regions versus diseased regions for each model's attention visualization when predicting the PlantDoc++ test set after training on different datasets.The attention matrix was converted to a NumPy array format, reshaped to the same size as the original image, and then superimposed on the original image to generate a visualized attention map.The number of pixels in the attention map was calculated based on the location of the tomato leaf spots or the bounding box [36].The number of pixels in the warm area of the attention map was determined by applying It is important to note that in the first attention heat map of early blight, GoogLeNet and VGG networks identified non-disease areas, such as the hole formed by the branch in the upper left corner and the overlap of the leaf edge and background, as diseased spots.As depicted in the figure, these networks tend to show "false spots" in red regions of their attention heat maps.In contrast, after pre-training on the PDDA dataset, IBSA_Net demonstrated a superior ability to distinguish between "false spots" and actual diseased regions.The attention heat map of IBSA_Net mostly identifies diseased spots, assigning higher attention weights to these regions.The "false spots" related to the holes formed by the branch are colored light yellow and cyan, indicating lower network attention to these areas.Similarly, in the Target_Spot dataset, IBSA_Net effectively avoided predicting shadows or non-disease black spots as diseased regions, whereas other networks made such errors.These findings suggest that IBSA_Net can self-learn the differences between diseased regions and the environmental illusions of leaves and better distinguish between them.
Conversely, in the PDDA model cluster, there are significantly more warm regions in the attention heat maps and more overlap with the disease spot region.Precisely, the warm color regions of IBSA_Net largely overlap with the disease spot regions and cover most of the spots.This suggests that the ability to distinguish complex backgrounds from leaf spot regions was learned from the PDDA dataset and that IBSA_Net had the most significant ability to generalize from the source domain to the target domain.
To further quantify the level of attention each model pays to the diseased regions of tomato leaves, we calculated the percentage of warm regions versus diseased regions for each model's attention visualization when predicting the PlantDoc++ test set after training on different datasets.The attention matrix was converted to a NumPy array format, reshaped to the same size as the original image, and then superimposed on the original image to generate a visualized attention map.The number of pixels in the attention map was calculated based on the location of the tomato leaf spots or the bounding box [36].The number of pixels in the warm area of the attention map was determined by applying a set threshold.Finally, the ratio of the number of pixels of the object to be recognized to the number of pixels in the warm area was obtained.
Table 8 displays the attention level of each model for the two tomato leaf diseases, Early_blight and Target_Spot, depicted in Figure 13.For both diseases, the PDDA model group has a higher proportion of warm-colored regions covered in the attention heat map compared to the PlantVillage model group.Furthermore, the area percentage of IBSA_Net, proposed in this paper, has the highest value in all the results.Table 8 displays the attention level of each model for the two tomato leaf diseases, Early_blight and Target_Spot, depicted in Figure 13.For both diseases, the PDDA model group has a higher proportion of warm-colored regions covered in the attention heat map compared to the PlantVillage model group.Furthermore, the area percentage of IBSA_Net, proposed in this paper, has the highest value in all the results.

Performance of IBSA_Net on Other Crops
While the experiments conducted in this study primarily focused on tomato disease leaves, the feature extraction and identification capabilities of IBSA_Net can be extended to other crops.To test this, we evaluated the model's performance on apple and chili pepper leaves, the dataset examples are shown in Figures 14 and 15.The apple leaf dataset comprises five categories: Alternaria leaf spot, Brown spot, Grey spot, Mosaic, and Healthy, while the chili pepper leaf dataset consists of two categories: Bacterial spot and Healthy.These datasets were obtained from Yang's paper and PlantVillage, respectively [37].The datasets were split into an 8:2 ratio for training and testing, and the results are presented below.From Tables 9 and 10, it is evident that both IBSA_Net and EfficientNet outperformed the other four models.Although EfficientNet achieved 0.7% higher accuracy than IBSA_Net on the apple leaf disease dataset, IBSA_Net exhibited superior performance in terms of Precision, Recall, and F1-score evaluation metrics.Furthermore, on the pepper leaf disease dataset, IBSA_Net outperformed all other five networks in all evaluation metrics.The results obtained from the apple and pepper leaf datasets confirm the efficacy of IBSA_Net in recognizing diseases in different crops.Remarkably, IBSA_Net demonstrated remarkable disease recognition and generalization abilities, even in the presence of complex image backgrounds.
ppl.Sci.2023, 13, x FOR PEER REVIEW 5 of 21As shown in Figure2, the ConvBN convolution block consists of a convolutional layer with a step number of 2 and a convolutional kernel size of 7 × 7, plus a BatchNorm layer with the activation function HardSwish.convBN 1 × 1 convolution is similar to ConvBN in that the convolutional layer uses 1 × 1 convolution for channel up-dimensioning and down-dimensioning.The convolutional kernel in ConvBN has a larger sensory field to extract more image information.BathNorm is a batch normalization operation that reduces the dependence of the model on the initial parameters and accelerates the network's training and model generalization ability.In the design of the inverted bottleneck module, two models are used, including the maximum pooling layer (called IBMax) and without (called IB).The following figure shows the structure of the two modules.

Figure 2 .
Figure 2. Comparison of ResNet module before and after improvement (a) The module before improvement; (b) The module before improvement.

Figure 2 .
Figure 2. Comparison of ResNet module before and after improvement (a) The module before improvement; (b) The module before improvement.
ppl.Sci.2023, 13, x FOR PEER REVIEW 5 of 21As shown in Figure2, the ConvBN convolution block consists of a convolutional layer with a step number of 2 and a convolutional kernel size of 7 × 7, plus a BatchNorm layer with the activation function HardSwish.convBN 1 × 1 convolution is similar to ConvBN in that the convolutional layer uses 1 × 1 convolution for channel up-dimensioning and down-dimensioning.The convolutional kernel in ConvBN has a larger sensory field to extract more image information.BathNorm is a batch normalization operation that reduces the dependence of the model on the initial parameters and accelerates the network's training and model generalization ability.In the design of the inverted bottleneck module, two models are used, including the maximum pooling layer (called IBMax) and without (called IB).The following figure shows the structure of the two modules.

Figure 2 .
Figure 2. Comparison of ResNet module before and after improvement (a) The module before improvement; (b) The module before improvement.

Figure 4 .
Figure 4. Two inverted bottleneck master modules (a) Inverted bottleneck module (b)Inverted bottleneck module with MaxPool layer added.

Figure 4 .
Figure 4. Two inverted bottleneck master modules (a) Inverted bottleneck module (b)Inverted bottleneck module with MaxPool layer added.

Figure 4 .
Figure 4. Two inverted bottleneck master modules (a) Inverted bottleneck module (b)Inverted bottleneck module with MaxPool layer added.

Figure 7 .Table 2 .
Figure 7. IBSA_Net network structure.Figure 7. IBSA_Net network structure.Table 2. Output dimensions of the different layers of IBSA_Net.Network Layer Kernel Size (for Convolutional and Pooling Layers Only)

Table 3 .
Effect of different key modules on model accuracy.

Table 6 .
Comparison of evaluation metrics of PlantVillage model clusters on PlantDoc++.

Table 7 .
Comparison of evaluation metrics of PDDA model clusters on PlantDoc++.

Table 8 .
Percentage of warm and sick areas in the attentional heat map.