Fine-Grained Butterﬂy Classiﬁcation in Ecological Images Using Squeeze-And-Excitation and Spatial Attention Modules

: Most butterﬂy larvae are agricultural pests and forest pests, but butterﬂies have important ornamental value and the ability to sense and respond to changes in the ecological environment. There are many types of butterﬂies, and the research on classiﬁcation of butterﬂy species is of great signiﬁcance in practical work such as environmental protection and control of agricultural and forest pests. Butterﬂy classiﬁcation is a ﬁne-grained image classiﬁcation problem that is more di ﬃ cult than generic image classiﬁcation. Common butterﬂy photos are mostly specimen photos (indoor photos) and ecological photos (outdoor photos / natural images). At present, research on butterﬂy classiﬁcation is more based on specimen photos. Compared with specimen photos, classiﬁcation based on ecological photos is relatively di ﬃ cult. This paper mainly takes ecological photos as the research object, and presents a new classiﬁcation network that combines the dilated residual network, squeeze-and-excitation (SE) module, and spatial attention (SA) module. The SA module can make better use of the long-range dependencies in the images, while the SE module takes advantage of global information to enhance useful information features and suppress less useful features. The results show that the integrated model achieves higher recall, precision, accuracy, and f1-score than the state-of-the-art methods on the introduced butterﬂy dataset.


Introduction
There are many kinds of butterflies. "Systematic butterfly names of the world" [1] records 17 families, 47 subfamilies, 1690 genera, and 15,141 species of butterflies in the world. Among them, there are 12 families, 33 subfamilies, 434 genera and 2153 species of Chinese butterflies. Not only do butterflies have great ornamental value, they also play an important role in the overall stability of the ecosystem. Butterflies are also particularly effective at indicating subtle ecosystem changes, as they have a short lifespan and can respond quickly to these changes. Butterflies are very helpful for scientists monitoring global climate change. In addition, the larvae of most butterfly species are agricultural and forestry pests, which directly affect the living environment of humans and animals and food sources. Therefore, the research on automatic identification of butterfly species is of great significance not only in the research of species identification, but also in practical work such as environmental protection, control of agricultural and forest pests, and border quarantine. Butterfly classification is a fine-grained image classification problem, which is more difficult than general/generic image classification [2]. suppress less useful features. Experimental results show that the proposed method is superior to other state-of-the-art methods.
The remainder of this paper is organized as follows. Section 2 provides a detailed description of our dataset and the proposed method. The experimental results are reported in Section 3. Section 4 provides the conclusions.

Datasets
This study uses the Chinese butterfly image dataset [22] and the Leeds Butterfly Dataset [24]. Table 1 lists the scientific names of 30 species of butterflies and number of images per butterfly species. The Chinese butterfly dataset includes specimen images of butterflies (indoor photos) and images of butterflies in the ecological environment (outdoor photos/natural images). Ecological images of butterflies come from field photography and donations from butterfly lovers. The sizes of many of the images reaches 5184 × 3456 and the butterfly is not segmented from the background. The mimicry of butterflies also brings great challenges to the detection and recognition of butterflies in ecological images. Therefore, this paper only studies the problem of butterfly classification in ecological images. This study uses 20 species of butterfly images from the Chinese butterfly dataset. There are 10 images in the smallest category and 61 images in the largest category, for a total of 467 images. Figure 1 shows sample images of some Chinese butterfly species. As you can see from the image, the posture of the butterfly and the background of the image are very different. Some butterflies are similar in color to the background. These all pose challenges for classification.
The Leeds Butterfly Dataset contains 10 categories of butterfly images. There are 55 images in the smallest category and 100 images in the largest category, for a total of 832 images, as shown in Figure 2 and Table 1.

Squeeze-And-Excitation Modules
Convolutional neural networks use the idea of local receptive fields to fuse channel information and spatial information to extract features containing information. The squeeze-and-excitation (SE) module proposed by Jie Hu et al. [25] can capture channel information. It is well applied in a variety of classification tasks [26][27][28][29]. The SE module considers the relationships between feature channels and adds a focus mechanism for feature channels. This mechanism is particularly helpful for improving the recognition ability of the proposed model [19]. For fine-grained classification problems such as butterfly classification, the SE module can weaken the similarity problem between classes. By assigning different weights to different channels, the SE module enables the network to selectively enhance the characteristics of large amounts of information and suppress the useless characteristics. The structure of the SE module is shown in Figure 3. In order to achieve global information embedding, the global information is squeezed into the channel descriptor by using a global average pooling. The squeeze operation compresses each twodimensional feature channel into a real number. The real number has a global acceptance field, and the size of the output matches the number of input feature channels. For feature maps = [ , , ⋯ , ] ∈ ℝ × × , where ∈ ℝ × is a feature map at the k-th channel. The k-th element of z ( ∈ ℝ ) can be obtained by using Equation (1).
After the squeeze operation, the excitation operation is performed. The excitation operation relies on the information gathered in the squeeze operation to capture channel-wise dependencies. Weights are generated for each feature channel by a parameter w, which is used to explicitly model the correlation between functional channels. The weight of the excitation output after feature selection represents the importance of each feature channel. Then, multiplication is used to achieve channel-bychannel weighting of the previous features to complete the rescaling of the original features in the channel dimension.
The structure of the excitation operation includes 1 × 1 convolution layer, ReLU layer, 1 × 1 convolution layer and Sigmoid activations layer, in order. The results of the excitation operation can be obtained using Equation (2).

Squeeze-And-Excitation Modules
Convolutional neural networks use the idea of local receptive fields to fuse channel information and spatial information to extract features containing information. The squeeze-and-excitation (SE) module proposed by Jie Hu et al. [25] can capture channel information. It is well applied in a variety of classification tasks [26][27][28][29]. The SE module considers the relationships between feature channels and adds a focus mechanism for feature channels. This mechanism is particularly helpful for improving the recognition ability of the proposed model [19]. For fine-grained classification problems such as butterfly classification, the SE module can weaken the similarity problem between classes. By assigning different weights to different channels, the SE module enables the network to selectively enhance the characteristics of large amounts of information and suppress the useless characteristics. The structure of the SE module is shown in Figure 3.

Squeeze-And-Excitation Modules
Convolutional neural networks use the idea of local receptive fields to fuse channel information and spatial information to extract features containing information. The squeeze-and-excitation (SE) module proposed by Jie Hu et al. [25] can capture channel information. It is well applied in a variety of classification tasks [26][27][28][29]. The SE module considers the relationships between feature channels and adds a focus mechanism for feature channels. This mechanism is particularly helpful for improving the recognition ability of the proposed model [19]. For fine-grained classification problems such as butterfly classification, the SE module can weaken the similarity problem between classes. By assigning different weights to different channels, the SE module enables the network to selectively enhance the characteristics of large amounts of information and suppress the useless characteristics. The structure of the SE module is shown in Figure 3. In order to achieve global information embedding, the global information is squeezed into the channel descriptor by using a global average pooling. The squeeze operation compresses each twodimensional feature channel into a real number. The real number has a global acceptance field, and the size of the output matches the number of input feature channels. For feature maps = [ , , ⋯ , ] ∈ ℝ × × , where ∈ ℝ × is a feature map at the k-th channel. The k-th element of z ( ∈ ℝ ) can be obtained by using Equation (1).
After the squeeze operation, the excitation operation is performed. The excitation operation relies on the information gathered in the squeeze operation to capture channel-wise dependencies. Weights are generated for each feature channel by a parameter w, which is used to explicitly model the correlation between functional channels. The weight of the excitation output after feature selection represents the importance of each feature channel. Then, multiplication is used to achieve channel-bychannel weighting of the previous features to complete the rescaling of the original features in the channel dimension.
The structure of the excitation operation includes 1 × 1 convolution layer, ReLU layer, 1 × 1 convolution layer and Sigmoid activations layer, in order. The results of the excitation operation can be obtained using Equation (2). In order to achieve global information embedding, the global information is squeezed into the channel descriptor by using a global average pooling. The squeeze operation compresses each two-dimensional feature channel into a real number. The real number has a global acceptance field, and the size of the output matches the number of input feature channels. For feature maps U = [u 1 , u 2 , · · · , u c ] ∈ R C×H×W , where u k ∈ R H×W is a feature map at the k-th channel. The k-th element of z (z ∈ R C ) can be obtained by using Equation (1).
After the squeeze operation, the excitation operation is performed. The excitation operation relies on the information gathered in the squeeze operation to capture channel-wise dependencies. Weights are generated for each feature channel by a parameter w, which is used to explicitly model the correlation between functional channels. The weight of the excitation output after feature selection represents the importance of each feature channel. Then, multiplication is used to achieve channel-by-channel weighting of the previous features to complete the rescaling of the original features in the channel dimension.
The structure of the excitation operation includes 1 × 1 convolution layer, ReLU layer, 1 × 1 convolution layer and Sigmoid activations layer, in order. The results of the excitation operation can be obtained using Equation (2).
Here, W 1 ∈ R C r ×C and W 2 ∈ R C× C r . Reduction ratio r = 16 [30]. After the squeeze and excitation operations, according to Equation (3), the final outputx k at the k-th channel is obtained by the rescaling of the feature map u k with the activations s k .

Spatial Attention Modules
Local features generated by the fully convolutional network (FCN) may cause object classification errors [31]. Xiao Chen et al. [32] believe that the spatial attention module can capture remote context information. The spatial attention module (SA module) enhances its representation ability by encoding extensive contextual information as local features. The structure of the spatial attention module is shown in Figure 4.
After the squeeze and excitation operations, according to Equation (3), the final output at the k-th channel is obtained by the rescaling of the feature map with the activations s .

Spatial Attention Modules
Local features generated by the fully convolutional network (FCN) may cause object classification errors [31]. Xiao Chen et al. [32] believe that the spatial attention module can capture remote context information. The spatial attention module (SA module) enhances its representation ability by encoding extensive contextual information as local features. The structure of the spatial attention module is shown in Figure 4. After inputting the global feature map ∈ ℝ × × into three 1 × 1 convolutional layers (conv1, conv2 and conv3), conv1 and conv2 are transformed into feature spaces , ∈ ℝ ×( × ) . The spatial attention map ∈ ℝ ( × )×( × ) can be calculated using a softmax layer shown in Equation (4).
The convolutional layers conv3 is transformed into a feature map ∈ ℝ × × . The value of sji indicates the degree of influence of the -th position on the j-th position. The new feature map o = {o1, o2, …, oN} is obtained by performing matrix multiplication with the spatial attention map S and the feature map D shown in Equation (5).
The element-wise summation between feature attention map o and feature map U is the final output of spatial attention module , as shown in Equation (6), where β is the weight obtained through learning.

Overall SESADRN Architecture
The dilated residual network (DRN) [33] has a good classification performance. In this paper, After inputting the global feature map U ∈ R C×H×W into three 1 × 1 convolutional layers (conv1, conv2 and conv3), conv1 and conv2 are transformed into feature spaces B, C ∈ R C1×(H×W) . The spatial attention map S ∈ R (H×W)×(H×W) can be calculated using a softmax layer shown in Equation (4).
The convolutional layers conv3 is transformed into a feature map D ∈ R C×H×W . The value of s ji indicates the degree of influence of the i-th position on the j-th position. The new feature map o = {o 1 , o 2 , . . . , o N } is obtained by performing matrix multiplication with the spatial attention map S and the feature map D shown in Equation (5).
The element-wise summation between feature attention map o and feature map U is the final output of spatial attention module y, as shown in Equation (6), where β is the weight obtained through learning.

Overall SESADRN Architecture
The dilated residual network (DRN) [33] has a good classification performance. In this paper, we used DRN-D-54 as the backbone to build a classification network with the squeeze-and-excitation (SE) and spatial attention (SA) modules, named as SESADRN. The overall SESADRN architecture is shown in Figure 5. The size of the original image is adjusted to 224 × 224 × 3 as the input image of the model. The outputs (feature maps with a size of C × H × W) of SE and SA modules are concatenated to feed into a 1 × 1 convolution layer to obtain a global feature map A C×H×W . Finally, the global average pooling and fully connected layers are used to implement the butterfly classification.  Gradient-weighted Class Activation Mapping (Grad-CAM) [34] can provide visual explanations for classification decision. For the global feature map AC × H × W, and the gradient y s of the predicted score of the target category s, the weight w of the target category s corresponding to the t-th feature map A t (t ∈ [1, C]) can be calculated by Equation (7). After performing the ReLU operation on the weighted combination of the feature map, a Grad-CAM heatmap is obtained:

Experiments and Results
This section aims at the evaluation and analysis of our proposed model (SESADRN, used DRN54 pre-trained model) and other state-of-art models for butterfly classification. These comparison models are squeeze-and-excitation model based Resnet architecture (SE-Resnet50 [30]), Residual attention network (RAN56 [35], RAN92 [35]), and dual-attention dilated residual network (DADRN [32], used DRN54 pre-trained model). All models were pre-trained by the use of ImageNet. All experiments were performed on the Pytorch platform. The models were trained with 300 training epochs and the batch size was set to 16. The parameters were optimized by the use of the Adam optimizer. The initial learning rate was set to 0.0001, and the attenuation after every 50 cycles was 0.1. To avoid the problem of overfitting, we used a weight decay (weight decay = 1 × 10 −4 ) and online data augmentation strategy in training. Four online data augmentation methods were used in the paper (random mirroring, random rotation, adding Gaussian noise (0, 0.01), and resizing image size). The models were trained on a computer with 32 GB of memory and two NVIDIA GeForce GTX 980Ti. Since the number of each category is different, accuracy, weighted prediction, weighted recall, and weighted f1-scores were used as evaluation metric. All models employed the same parameters and configurations. In addition, Gradient-weighted Class Activation Mapping (Grad-CAM) [25] is used to visualize the classification results. Gradient-weighted Class Activation Mapping (Grad-CAM) [34] can provide visual explanations for classification decision. For the global feature map AC × H × W, and the gradient y s of the predicted score of the target category s, the weight w s t of the target category s corresponding to the t-th feature map A t (t ∈ [1, C]) can be calculated by Equation (7). After performing the ReLU operation on the weighted combination of the feature map, a Grad-CAM heatmap is obtained:

Experiments and Results
This section aims at the evaluation and analysis of our proposed model (SESADRN, used DRN54 pre-trained model) and other state-of-art models for butterfly classification. These comparison models are squeeze-and-excitation model based Resnet architecture (SE-Resnet50 [30]), Residual attention network (RAN56 [35], RAN92 [35]), and dual-attention dilated residual network (DADRN [32], used DRN54 pre-trained model). All models were pre-trained by the use of ImageNet. All experiments were performed on the Pytorch platform. The models were trained with 300 training epochs and the batch size was set to 16. The parameters were optimized by the use of the Adam optimizer. The initial learning rate was set to 0.0001, and the attenuation after every 50 cycles was 0.1. To avoid the problem of overfitting, we used a weight decay (weight decay = 1 × 10 −4 ) and online data augmentation strategy in training. Four online data augmentation methods were used in the paper (random mirroring, random rotation, adding Gaussian noise (0, 0.01), and resizing image size). The models were trained on a computer with 32 GB of memory and two NVIDIA GeForce GTX 980Ti. Since the number of each category is different, accuracy, weighted prediction, weighted recall, and weighted f1-scores were used as evaluation metric. All models employed the same parameters and configurations. In addition, Gradient-weighted Class Activation Mapping (Grad-CAM) [25] is used to visualize the classification results.

Performance Comparison with Other Methods
We compared the performance of different models and the results are shown in Figure 6, Tables 2 and 3. Figure 6 shows the accuracy and loss of training and validation for each epoch. As shown in Figure 6a,c, it can be seen that the training results of DADRN and SESADRN are similar and they are significantly better than those of SE-Resnet50, RAN56, and RAN92. The loss and accuracy of the DADRN and SESADRN models tend to stabilize more quickly, that is, they converge faster. For validation, it can be seen more clearly from the validation loss and accuracy figures that the SESADRN model has lower loss and higher accuracy. DADRN and SESADRN models tend to stabilize more quickly, that is, they converge faster. For validation, it can be seen more clearly from the validation loss and accuracy figures that the SESADRN model has lower loss and higher accuracy.  Table 2 lists the results of model accuracy, weighted precision, weighted recall, and weighted F1-scores on the test set. As shown in Table 2, the proposed method is the best of the five models for all four evaluation metrics. The accuracy, weighted precision, weighted recall, and weighted F1scores of the proposed method are 0.956, 0.954, 0.948, and 0.960, respectively. We can also see that the DADRN model achieved better results than those of the other three models (SE-Resnet50, RAN56, RAN92). This is consistent with the results in Figure 6.   Table 2 lists the results of model accuracy, weighted precision, weighted recall, and weighted F1-scores on the test set. As shown in Table 2, the proposed method is the best of the five models for all four evaluation metrics. The accuracy, weighted precision, weighted recall, and weighted F1-scores of the proposed method are 0.956, 0.954, 0.948, and 0.960, respectively. We can also see that the DADRN model achieved better results than those of the other three models (SE-Resnet50, RAN56, RAN92). This is consistent with the results in Figure 6. The comparison of the class-wise classification accuracy of the models is shown in Table 3.  Next, we compare the proposed method (SESADRN) with the method without a pre-trained model (SESADRNNoPretrained). The comparison results are shown in Figure 7. The results demonstrated that the use of pre-trained models can significantly improve the accuracy of recognition. Next, we compare the proposed method (SESADRN) with the method without a pre-trained model (SESADRNNoPretrained). The comparison results are shown in Figure 7. The results demonstrated that the use of pre-trained models can significantly improve the accuracy of recognition.

Classification Analysis Based on Grad-CAM
Class activation mapping (CAM) can give a good visual explanation for classification results, and can achieve weak supervised positioning of the target object (the butterfly in this study) [25,36]. For different network models, even if all of them make the identical prediction, Grad-CAM can tell us which network is a "stronger" classification network. We used Grad-CAM to provide visual explanations of the classification performance for our proposed SESADRN. The butterfly images of the China butterfly dataset and their Grad-CAMs are shown in Figure 8a,b, respectively. The butterfly images of Leeds Butterfly Dataset and their Grad-CAMs are shown in Figure 8c,d, respectively. By visualizing a specific prediction area of the image with Grad-CAM, we can see that the SESADRN classification model locates the butterfly area well in the image.

Classification Analysis Based on Grad-CAM
Class activation mapping (CAM) can give a good visual explanation for classification results, and can achieve weak supervised positioning of the target object (the butterfly in this study) [25,36]. For different network models, even if all of them make the identical prediction, Grad-CAM can tell us which network is a "stronger" classification network. We used Grad-CAM to provide visual explanations of the classification performance for our proposed SESADRN. The butterfly images of the China butterfly dataset and their Grad-CAMs are shown in Figure 8a,b, respectively. The butterfly images of Leeds Butterfly Dataset and their Grad-CAMs are shown in Figure 8c,d, respectively. By visualizing a specific prediction area of the image with Grad-CAM, we can see that the SESADRN classification model locates the butterfly area well in the image. We compared the Grad-CAMs of the proposed SESADRN with those of DADRN [32], which achieved the second best performance as shown in Tables 2 and 3. The original sample images are shown in Figure 9a. Their Grad-CAMs obtained by SESADRN and DADRN are shown in Figure 9b,c, respectively. For the SESADRN model and the DADRN model, the heatmaps shown in Figure 9b,c, respectively. As well as the results shown in Figure 8, the proposed SESADRN can localize the target object (butterfly) well and enhance the and enhance the meaningful features (Figure 9b), while DADRN could not localize the target object (butterfly) well (Figure 9c). The visualization of the Grad-CAMs provided us a visual explanation why the proposed SESADRN is superior than DADRN. We compared the Grad-CAMs of the proposed SESADRN with those of DADRN [32], which achieved the second best performance as shown in Tables 2 and 3. The original sample images are shown in Figure 9a. Their Grad-CAMs obtained by SESADRN and DADRN are shown in Figure 9b,c, respectively. For the SESADRN model and the DADRN model, the heatmaps shown in Figure 9b,c, respectively. As well as the results shown in Figure 8, the proposed SESADRN can localize the target object (butterfly) well and enhance the and enhance the meaningful features (Figure 9b), while DADRN could not localize the target object (butterfly) well (Figure 9c). The visualization of the Grad-CAMs provided us a visual explanation why the proposed SESADRN is superior than DADRN. We compared the Grad-CAMs of the proposed SESADRN with those of DADRN [32], which achieved the second best performance as shown in Tables 2 and 3. The original sample images are shown in Figure 9a. Their Grad-CAMs obtained by SESADRN and DADRN are shown in Figure 9b,c, respectively. For the SESADRN model and the DADRN model, the heatmaps shown in Figure 9b,c, respectively. As well as the results shown in Figure 8, the proposed SESADRN can localize the target object (butterfly) well and enhance the and enhance the meaningful features (Figure 9b), while DADRN could not localize the target object (butterfly) well (Figure 9c). The visualization of the Grad-CAMs provided us a visual explanation why the proposed SESADRN is superior than DADRN.

Conclusions and Future Work
In this work, we proposed a SESADRN model for fine-grained butterfly classification in ecological images. In the SESADRN model, the SA module can make better use of the long-range dependencies in the images, while the SE module takes advantage of global information to enhance useful information features and suppress less useful features. The results show that the SESADRN model has better classification performance on a given butterfly dataset than other state-of-the-art models.
However, there are still some issues in our research, such as: (1) the number of butterfly images in most categories is small, and (2) some categories are basically multiple images of the same butterfly taken from different angles. Some butterflies have only one image, and the background of this image is completely different from the background of other images. These factors can cause target detection errors. Therefore, the following ideas can further improve the performance of the proposed method: (1) Augment the butterfly database by collecting more butterfly photos.
(2) Try to use more appropriate models to classify butterflies, such as introducing a few-shot learning algorithm into the model. We are also going to apply the proposed method to other applications.