Classification Accuracy Improvement for Small-Size Citrus Pests and Diseases Using Bridge Connections in Deep Neural Networks

Due to the rich vitamin content in citrus fruit, citrus is an important crop around the world. However, the yield of these citrus crops is often reduced due to the damage of various pests and diseases. In order to mitigate these problems, several convolutional neural networks were applied to detect them. It is of note that the performance of these selected models degraded as the size of the target object in the image decreased. To adapt to scale changes, a new feature reuse method named bridge connection was developed. With the help of bridge connections, the accuracy of baseline networks was improved at little additional computation cost. The proposed BridgeNet-19 achieved the highest classification accuracy (95.47%), followed by the pre-trained VGG-19 (95.01%) and VGG-19 with bridge connections (94.73%). The use of bridge connections also strengthens the flexibility of sensors for image acquisition. It is unnecessary to pay more attention to adjusting the distance between a camera and pests and diseases.


Introduction
Pests and diseases are major causes of huge economic loss in agricultural production. Timely detection and identification of these pests and diseases is essential to control their impact. Farmers used to rely on experienced experts to perform these tasks; unfortunately, this is a time-consuming, error-ridden, and costly process. To improve recognition efficiency, image processing and computer vision techniques are widely implemented. Bashish et al. [1] proposed a solution for automatic detection and classification of plant leaf diseases. Their method was based on image processing, which used a K-means clustering technique to segment RGB images and then used a single-layered artificial neural network model for classification. Ali et al. [2] presented a citrus disease recognition system in which a ∆E color difference algorithm was adopted to separate the disease affected areas, and a color histogram and texture features were extracted for classification. The features needed to recognize agricultural pests are more complex than those needed for diseases. They are sensitive to affine transformation, illumination, and viewpoint change. Wen et al. [3] introduced a local feature-based identification method to account for variations in insect appearance. Xie et al. [4] fused multiple features of insect species together in order to enhance recognition performance. Compared with other feature-combination methods, their approach produced higher accuracy, but needed more time to train.
The above research all required pretreatment to select features for classifiers. In contrast, deep convolutional neural networks (CNN) are representation-learning models [5]. They can receive raw image data as input and automatically discover the useful features for classification and

Image Dataset Description
Our dataset has 12,561 images. It covers 17 species of citrus pest and seven types of citrus disease. Each class contains over 350 images. Figure 1 shows the number of samples in each category. The image collection methods for citrus pests and diseases were described in our previous research [18]. To balance the data distribution in the dataset and improve the model's generalization ability, several data augmentation approaches were adopted. Table 1 depicts the parameter settings for each data augmentation. Instead of using only one operation, we randomly selected three operations and performed them sequentially to produce a new image. This method can considerably increase the diversity of generated images ( Figure 2).  Image samples used in earlier studies were gathered under laboratory conditions [6,31], which reduced the robustness of the trained model to realistic conditions [32]. In contrast, for this study, we collected images with variable, realistic backgrounds. In addition, to further adapt to real-world scenes, the original image was not excessively cropped to only keep the target object region [33]. Figure 3 shows images with varying distances between the camera sensor and the citrus pest or disease. Image samples used in earlier studies were gathered under laboratory conditions [6,31], which reduced the robustness of the trained model to realistic conditions [32]. In contrast, for this study, we collected images with variable, realistic backgrounds. In addition, to further adapt to real-world scenes, the original image was not excessively cropped to only keep the target object region [33]. Figure 3 shows images with varying distances between the camera sensor and the citrus pest or disease. Image samples used in earlier studies were gathered under laboratory conditions [6,31], which reduced the robustness of the trained model to realistic conditions [32]. In contrast, for this study, we collected images with variable, realistic backgrounds. In addition, to further adapt to real-world scenes, the original image was not excessively cropped to only keep the target object region [33]. Figure 3 shows images with varying distances between the camera sensor and the citrus pest or disease.

Network Architecture
Computer vision competitions greatly promoted the development of deep learning. Many advanced design methods were proposed to improve network performance, which changed the situation of simply stacking convolutional layers. We followed the strategy of SqueezeNet [34] to design the network structure from micro to macro. To save computational cost, our network depth was gradually increased until the accuracy was not significantly improved.

Microstructure of Building Unit
Attention mechanisms usually produce an attention map to highlight the important features, which brings an additional computation overhead and increases the optimization difficulty. We followed the micro-construction of the Network in Network [35] to enhance the features generated by each 3 × 3 convolution (Figure 4). This structure is compatible with the mainframe of a network with no need for extra branches. Furthermore, the Mlpconv layer receives each whole feature map as an input, avoiding over-compressing information like SE (Squeeze-and-Excitation) blocks [27].

Network Architecture
Computer vision competitions greatly promoted the development of deep learning. Many advanced design methods were proposed to improve network performance, which changed the situation of simply stacking convolutional layers. We followed the strategy of SqueezeNet [34] to design the network structure from micro to macro. To save computational cost, our network depth was gradually increased until the accuracy was not significantly improved.

Microstructure of Building Unit
Attention mechanisms usually produce an attention map to highlight the important features, which brings an additional computation overhead and increases the optimization difficulty. We followed the micro-construction of the Network in Network [35] to enhance the features generated by each 3 × 3 convolution ( Figure 4). This structure is compatible with the mainframe of a network with no need for extra branches. Furthermore, the Mlpconv layer receives each whole feature map as an input, avoiding over-compressing information like SE (Squeeze-and-Excitation) blocks [27].

Network Architecture
Computer vision competitions greatly promoted the development of deep learning. Many advanced design methods were proposed to improve network performance, which changed the situation of simply stacking convolutional layers. We followed the strategy of SqueezeNet [34] to design the network structure from micro to macro. To save computational cost, our network depth was gradually increased until the accuracy was not significantly improved.

Microstructure of Building Unit
Attention mechanisms usually produce an attention map to highlight the important features, which brings an additional computation overhead and increases the optimization difficulty. We followed the micro-construction of the Network in Network [35] to enhance the features generated by each 3 × 3 convolution ( Figure 4). This structure is compatible with the mainframe of a network with no need for extra branches. Furthermore, the Mlpconv layer receives each whole feature map as an input, avoiding over-compressing information like SE (Squeeze-and-Excitation) blocks [27].

Macro Connection between Building Blocks
In order to address the degradation problem, He et al. [14] developed a residual learning framework to add input features to output features. Huang et al. [15] adopted a concatenation operation to increase the frequency of feature reuse. Compared with the add operation, the concatenation feature reuse method is easier to use, which does not require a 1 × 1 convolution to align input and output channels [34]. In addition, concatenation takes less computation time than element-wise addition [17]. For these reasons, we reused previous layer features using concatenation. ShuffleNet V2 [17] proved that the amount of feature reuse decayed exponentially with the distance between two blocks. To avoid introducing redundancy, we only established connections between adjacent layers ( Figure 5).

Macro Connection between Building Blocks
In order to address the degradation problem, He et al. [14] developed a residual learning framework to add input features to output features. Huang et al. [15] adopted a concatenation operation to increase the frequency of feature reuse. Compared with the add operation, the concatenation feature reuse method is easier to use, which does not require a 1 × 1 convolution to align input and output channels [34]. In addition, concatenation takes less computation time than element-wise addition [17]. For these reasons, we reused previous layer features using concatenation. ShuffleNet V2 [17] proved that the amount of feature reuse decayed exponentially with the distance between two blocks. To avoid introducing redundancy, we only established connections between adjacent layers ( Figure 5).

Adaption to Object Scale in the Image
It is well known that CNN models are very sensitive to translations and rotations [36]. We observed that scale changes to an object in an image can also affect neural network performance ( Figure 6). In order to find the reason for this, we borrowed SE blocks [27] to monitor the contribution of features from different building blocks to the classification ( Figure 7). After training, we selected several groups of images to validate the feature importance distribution in each SE block ( Figure 8). For easy comparison, we divided features into three levels based on their importance values. Table 2 presents the number of features in each level. It can be seen that, as the object scale is reduced, the number of high-level features decreases while the number of mid-level features increases. This means that a network has to use more mid-level features to identify the class of smaller-size objects in images.

Adaption to Object Scale in the Image
It is well known that CNN models are very sensitive to translations and rotations [36]. We observed that scale changes to an object in an image can also affect neural network performance ( Figure 6). In order to find the reason for this, we borrowed SE blocks [27] to monitor the contribution of features from different building blocks to the classification ( Figure 7). After training, we selected several groups of images to validate the feature importance distribution in each SE block ( Figure 8). For easy comparison, we divided features into three levels based on their importance values. Table 2 presents the number of features in each level. It can be seen that, as the object scale is reduced, the number of high-level features decreases while the number of mid-level features increases. This means that a network has to use more mid-level features to identify the class of smaller-size objects in images.

Macro Connection between Building Blocks
In order to address the degradation problem, He et al. [14] developed a residual learning framework to add input features to output features. Huang et al. [15] adopted a concatenation operation to increase the frequency of feature reuse. Compared with the add operation, the concatenation feature reuse method is easier to use, which does not require a 1 × 1 convolution to align input and output channels [34]. In addition, concatenation takes less computation time than element-wise addition [17]. For these reasons, we reused previous layer features using concatenation. ShuffleNet V2 [17] proved that the amount of feature reuse decayed exponentially with the distance between two blocks. To avoid introducing redundancy, we only established connections between adjacent layers ( Figure 5).

Adaption to Object Scale in the Image
It is well known that CNN models are very sensitive to translations and rotations [36]. We observed that scale changes to an object in an image can also affect neural network performance ( Figure 6). In order to find the reason for this, we borrowed SE blocks [27] to monitor the contribution of features from different building blocks to the classification (Figure 7). After training, we selected several groups of images to validate the feature importance distribution in each SE block ( Figure 8). For easy comparison, we divided features into three levels based on their importance values. Table 2 presents the number of features in each level. It can be seen that, as the object scale is reduced, the number of high-level features decreases while the number of mid-level features increases. This means that a network has to use more mid-level features to identify the class of smaller-size objects in images.  (a) (a)    There are many mid-level and high-quality features in the intermediate layers. However, direct reuse of them will increase the computational complexity of the classification layer. In addition, the proportion of low-level features in shallower layers was greater than in deeper layers (Table 2). Based on these facts, we proposed a new feature reuse method to improve parameter efficiency. The 1 × 1 convolution in Figure 9 has two purposes. There are many mid-level and high-quality features in the intermediate layers. However, direct reuse of them will increase the computational complexity of the classification layer. In addition, the proportion of low-level features in shallower layers was greater than in deeper layers ( Table 2). Based on these facts, we proposed a new feature reuse method to improve parameter efficiency. The 1 × 1 convolution in Figure 9 has two purposes.

•
Channel compression: To reduce the number of useless features, the number of output channels from the 1 × 1 convolution is fewer than the number of input channels. This function is similar to the transition layer of DenseNet. • Feature retention: Unlike a 3 × 3 convolution, a 1 × 1 convolution performs a simple linear transformation, which can largely preserve input feature information. To further strengthen output feature quality, two 1 × 1 convolutions were stacked after the concatenation operation. Conventional feature reuse strategies (addition and concatenation) do not consider the discrepancy between features from different layers. More specifically, the difference between shallow features and deep features is not only in the distribution characteristic, but also in the representation complexity. This complexity difference between adjacent layers can be measured by Equation (1). As the distance between layers increases, the difference mentioned will be more significant. The progressive feature reuse method shown in Figure 9 ensures a strong correlation between concatenated features. In addition, the two 1 × 1 convolutions in each feature reuse block reduce the complexity difference between features from long-distance layers.
where represents the i-th feature map of the n-th layer, and denote the corresponding convolutional kernel and bias to the feature map, and k presents the number of feature maps in the (n−1)-th layer.

Experiment Preparation
The overall architecture of BridgeNet for citrus pests and disease recognition is shown in Table 3. Several baseline networks and their variants that have a similar depth to BridgeNet were selected for comparison. We replaced the two 1 × 1 convolutions of the Mlpconv block with CBAM (Convolutional Block Attention Module) [28] to compare their performance. In addition, the bridge

•
Channel compression: To reduce the number of useless features, the number of output channels from the 1 × 1 convolution is fewer than the number of input channels. This function is similar to the transition layer of DenseNet.

•
Feature retention: Unlike a 3 × 3 convolution, a 1 × 1 convolution performs a simple linear transformation, which can largely preserve input feature information. To further strengthen output feature quality, two 1 × 1 convolutions were stacked after the concatenation operation.
Conventional feature reuse strategies (addition and concatenation) do not consider the discrepancy between features from different layers. More specifically, the difference between shallow features and deep features is not only in the distribution characteristic, but also in the representation complexity. This complexity difference between adjacent layers can be measured by Equation (1). As the distance between layers increases, the difference mentioned will be more significant. The progressive feature reuse method shown in Figure 9 ensures a strong correlation between concatenated features. In addition, the two 1 × 1 convolutions in each feature reuse block reduce the complexity difference between features from long-distance layers.
where F i n represents the i-th feature map of the n-th layer, W j and b j denote the corresponding convolutional kernel and bias to the F j n− 1 feature map, and k presents the number of feature maps in the (n−1)-th layer.

Experiment Preparation
The overall architecture of BridgeNet for citrus pests and disease recognition is shown in Table 3. Several baseline networks and their variants that have a similar depth to BridgeNet were selected for comparison. We replaced the two 1 × 1 convolutions of the Mlpconv block with CBAM (Convolutional Block Attention Module) [28] to compare their performance. In addition, the bridge connection was compared with deformable convolution. We followed the suggestion of Dai et al. [29] to apply these deformable convolutions in the last three convolutional layers (with kernel size > 1). All these networks shared the same classification block ( Figure 10) and were trained with identical optimization schemes. Before training, the original image dataset was divided into a training set, a validation set, and a test set in the ratio of 4:1:1. Then, each model was trained and tested based on them. The three parts did not contain the same samples, and data augmentation was performed for each model and only on the training set. We saved the models that had the highest validation accuracy and examined their generalization ability on the test set. The model hyper-parameters presented in Table 4 were determined by trial and error. each model and only on the training set. We saved the models that had the highest validation accuracy and examined their generalization ability on the test set. The model hyper-parameters presented in Table 4 were determined by trial and error.    Table 5 displays the classification accuracy for each model. BridgeNet-19 achieved the highest validation accuracy, followed by VGG-19 with bridge connections and then pre-trained VGG-19. The test accuracy of the models followed the same trend as the validation accuracy, except that the pre-trained VGG-19 ranked second, followed by VGG-19 with bridge connections. Obviously, the models trained from scratch produced lower accuracy than their ImageNet pre-trained counterparts. However, the pre-trained models consumed much more computing resources than their competitors. It is of note that the use of deformable convolutions does not improve VGG-16 performance. In contrast, the application of bridge connections considerably increased validation and test accuracy. As for additional computational cost, the bridge connection created much less of a computational burden than deformable convolution; the use of bridge connections in the model costed 5.8 MB and the cost of using deformable convolutions was 40.8 MB. Weakly DenseNet-19 performed better than CBAMNet, which indicates that the two 1 × 1 convolutional layers used for feature enhancement are more effective than the attention mechanism. The accuracy of Weakly DenseNet-19 was further increased by using features from the middle layers for classification. In terms of additional computational cost, BridgeNet-19 spent 12.3 MB and MSN-19 consumed 10.3 MB. However, BridgeNet-19 obtained better performance than MSN-19, proving the higher parameter efficiency of using bridge connections. The smaller-size models took less training time per batch except for MSN-19, which had a slower training speed than BridgeNet-19. Figure 11 depicts the training details of each model. Models trained with ImageNet pre-training displayed the fastest convergence. However, models with branch structures required more epochs to reach the final convergence state. This phenomenon indicates that simpler structural models are easier to train. To validate the effectiveness of the new feature reuse method for adapting to object scale changes, the confusion matrices of Weakly DenseNet-19 and BridgeNet-19 were compared ( Figure 12). Using the comparison results, Figure 13 presents the images that were correctly identified by BridgeNet-19 but misclassified by Weakly DenseNet-19. It can be seen that, with the help of bridge connections, BridgeNet-19 has an enhanced ability to correctly classify images with small target objects. The use To validate the effectiveness of the new feature reuse method for adapting to object scale changes, the confusion matrices of Weakly DenseNet-19 and BridgeNet-19 were compared ( Figure 12). Using the comparison results, Figure 13 presents the images that were correctly identified by BridgeNet-19 but misclassified by Weakly DenseNet-19. It can be seen that, with the help of bridge connections, BridgeNet-19 has an enhanced ability to correctly classify images with small target objects. The use of bridge connections also improves the discrimination of similar categories, for example, the citrus anthracnose and canker (the difference between them is not obvious without close observation).

Classification Performance
Sensors 2020, 20, x FOR PEER REVIEW 13 of 16 of bridge connections also improves the discrimination of similar categories, for example, the citrus anthracnose and canker (the difference between them is not obvious without close observation).

Ablation Study
We considered the number of bridge connections as a hyper-parameter and explored its impact on network performance. Bridge connections were introduced from top to bottom as the overall number of connections was increased. Weakly DenseNet-19 was used as the backbone architecture. Table 6 reports the comparison results. It can be observed that, as the number of bridge connections increased, the model performance showed initial cumulative growth. When the number was increased to four, the accuracy then decreased. This indicates that excessive use of shallow information will bring more redundancy to the classification layer.

Conclusion and Future Work
In this study, a new CNN model was developed to identify common pests and diseases in citrus plantations. Each building block of the network contained two 1 × 1 convolutions that were used to enhance the features generated by 3 × 3 convolutional layers. Concatenation operations were used for feature reuse. To reduce redundancy, only features from adjacent layers were concatenated. We observed that, as the size of the target object in the image decreased, the use of mid-level features by the classification layer increased. Using this insight to adapt the model to scale changes, a new feature reuse method called bridge connection was designed. Experimental results show that the proposed BridgeNet-19 achieved the highest classification accuracy (95.47%). Compared with pre-trained models, our network also presented higher parameter efficiency; its model size (68.9 MB) was half of the pre-trained VGG-16 and VGG-19 networks.
Training of deep CNN models usually requires a large-scale image dataset. However, it is very difficult and expensive to collect so many high-quality, close-up sample images in some specific fields such as medicine and biology. Although ImageNet pre-trained models can allow researchers to achieve satisfactory results on different types of datasets quickly and easily, they are designed too Figure 13. Examples of images misclassified by Weakly DenseNet-19: only images that were misclassified due to the small subject size are presented.

Ablation Study
We considered the number of bridge connections as a hyper-parameter and explored its impact on network performance. Bridge connections were introduced from top to bottom as the overall number of connections was increased. Weakly DenseNet-19 was used as the backbone architecture. Table 6 reports the comparison results. It can be observed that, as the number of bridge connections increased, the model performance showed initial cumulative growth. When the number was increased to four, the accuracy then decreased. This indicates that excessive use of shallow information will bring more redundancy to the classification layer.

Conclusion and Future Work
In this study, a new CNN model was developed to identify common pests and diseases in citrus plantations. Each building block of the network contained two 1 × 1 convolutions that were used to enhance the features generated by 3 × 3 convolutional layers. Concatenation operations were used for feature reuse. To reduce redundancy, only features from adjacent layers were concatenated. We observed that, as the size of the target object in the image decreased, the use of mid-level features by the classification layer increased. Using this insight to adapt the model to scale changes, a new feature reuse method called bridge connection was designed. Experimental results show that the proposed BridgeNet-19 achieved the highest classification accuracy (95.47%). Compared with pre-trained models, our network also presented higher parameter efficiency; its model size (68.9 MB) was half of the pre-trained VGG-16 and VGG-19 networks.
Training of deep CNN models usually requires a large-scale image dataset. However, it is very difficult and expensive to collect so many high-quality, close-up sample images in some specific fields such as medicine and biology. Although ImageNet pre-trained models can allow researchers to achieve satisfactory results on different types of datasets quickly and easily, they are designed too bulky and ill-suited to fit small datasets. We hope to find a better solution to solve this problem in the future.