MRDA-MGFSNet: Network Based on a Multi-Rate Dilated Attention Mechanism and Multi-Granularity Feature Sharer for Image-Based Butterﬂies Fine-Grained Classiﬁcation

: Aiming at solving the problems of high background complexity of some butterﬂy images and the difﬁculty in identifying them caused by their small inter-class variance, we propose a new ﬁne-grained butterﬂy classiﬁcation architecture, called Network based on Multi-rate Dilated Attention Mechanism and Multi-granularity Feature Sharer (MRDA-MGFSNet). First, in this network, in order to effectively identify similar patterns between butterﬂies and suppress the information that is similar to the butterﬂy’s features in the background but is invalid, a Multi-rate Dilated Attention Mechanism (MRDA) with a symmetrical structure which assigns different weights to channel and spatial features is designed. Second, fusing the multi-scale receptive ﬁeld module with the depthwise separable convolution module, a Multi-granularity Feature Sharer (MGFS), which can better solve the recognition problem of a small inter-class variance and reduce the increase in parameters caused by multi-scale receptive ﬁelds, is proposed. In order to verify the feasibility and effectiveness of the model in a complex environment, compared with the existing methods, our proposed method obtained a mAP of 96.64%, and an F 1 value of 95.44%, which showed that the method proposed in this paper has a good effect on the ﬁne-grained classiﬁcation of butterﬂies.


Introduction
Butterflies are an important part of the ecosystem. In recent years, people have gradually realized that certain species of butterflies can provide us with antibiotics that may be vital to saving lives, which shows that they have important medicinal values. At the same time, certain habits of butterflies are also very helpful to scientists who monitor global climate change. Studying the categories of butterflies is conducive to recording and studying their habits and to protect them from extinction. The recognition and classification of butterfly images is an important part of butterfly protection. However, the problem is that the natural environment is complex and harsh, and the background of images on the internet is also highly complex. However, on our actual image recognition process, we usually need slight local differences to separate butterfly subcategories from each other because they have little inter-class differences. For example, the patterns of the Danaus genutia and Monarch butterfly have high similarities, whose main differences are in the shape of the lateral band of the front wing and the shape of the hind wing, but the differences are relatively small. In order to clearly distinguish the butterfly category even when the feature differences in the butterfly subcategories are very small, it is necessary to perform fine-grained classification of butterfly images. Fine-grained image classification is also called subcategory recognition, which usually has large inter-class differences, including object posture, brightness, dimension, cover, background and angle compared with inter-class differences. It is not an easy job to achieve fine-grained image classification based on a weakly supervised method, especially when you only have a limited amount of data in each category and there is no other manual label message for butterfly components. Moreover, it is easy for the images to be contaminated by noise when collecting and transmitting butterfly images, which decreases the quality of them or blurs them. At the same time, if we compare fine-grained image classification with coarse-grained image classification, we can find that the fine-grained image classification focuses more on smaller but significant local features to which it is harder to pay attention to due to the slight inter-class differences between subcategories and big intra-class differences. Extracting SIFT (Scale Invariant Feature Transform) [1], HOG (Histogram of Oriented Gradients) [2] or other local image features, and then using the Vector of Local Aggregation Descriptor (VLAD) [3], Fisher vector [4] or other coding models for feature coding, is what early artificial feature-based fine-grained image classification algorithms usually do. Due to the complexity of the selection process of artificial features and the restriction of expressive ability, the classification performance is poor. Nevertheless, the features gained from the Convolutional Neural Network (CNN) [5] have more powerful expressive capabilities than artificial works thanks to the rapid improvement of deep learning. Thus, scientists and researchers have proposed an enormous number of CNN algorithms which promote the improvement of fine-grained image classification algorithms. The CNN has made extraordinary accomplishments in the overall image classification and brought new development directions for fine-grained image classification in recent years and researchers began to choose CNN features for classification. However, due to the different structures of different CNN models, the recognition capabilities and recognition effects of butterfly images are very different. Therefore, in this paper, we mainly study the following problems: (1) High background complexity of butterfly images. When most algorithms are applied to the actual butterfly classification problem, they are easily influenced by the natural environment. When the CNN performs feature extraction of butterfly images, it is easy to extract "interference features" which are similar to butterfly features, such as dead trees, flowers, plant leaves, rocks, etc. What was mentioned above will seriously interfere the normal butterfly feature recognition and extraction. (2) Similarities between butterfly subcategories. The shape and area of butterfly patterns of the same category are not exactly the same, which brings great difficulties to the feature extraction of neural networks, and it is often difficult to obtain satisfactory precision. Thus, there is an urgent need to perform a more detailed feature extraction in the neural network, and it is also necessary to build a deeper network layer for the neural network to learn more detailed features.
Aiming at solving the two problems mentioned above, Xin et al. [6] used SESADRN to focus on the image features of butterflies, but there were still some errors in focusing on butterfly features. Tan A et al. 2020 [7] used ResNet-101 and the Feature Pyramid Network (FPN) as a feature extraction network to extract butterfly features, and used Mask R-CNN to automatically perform butterfly identification and fine-grained classification using images captured in the natural environment, which achieved good results. However, due to the high complexity of the background, the details extracted by the Region Proposal Network (RPN) are not comprehensive enough, and even some extraction regions do not match, so the classification accuracy is still not enough. Therefore, we propose a fine-grained butterfly classification method based on MRDA-MGFSNet, which can more effectively identify different categories of butterflies.
In our implementation, the main contributions of this paper are as follows: (1) Aiming at solving the problems of high background complexity in some butterfly images and difficulty in identifying them caused by their small inter-class variance, we propose MRDA-MGFSNet, designed as follows: a. A Multi-rate Dilated Attention Mechanism with a symmetrical structure suitable for fine-grained butterfly classification is proposed. This module assigns different weights to channel and spatial features to keep more important butterfly features and discard redundant information such as complex natural background information. At the same time, the module integrates dilated convolutions of different rates to expand the visual field of the network and obtain rich context information. It has a good effect on the problem that it is difficult to recognize butterfly images under the interference of a complex background. b.
A Multi-granularity Feature Sharer is designed. This module can effectively integrate the overall features of butterflies, save and extract similar spots and patterns and other feature information in butterflies. On the basis of effectively solving the recognition problem of the small inter-class variance of butterfly spots, by connecting a 2-dimensional channel-by-channel convolution and a 3-dimensional point-by-point convolution, it effectively compensates for the increase in parameters caused by the multi-scale structure, saves the training time and improves the efficiency of the network.
(2) The method of this paper obtains a mAP of 96.64% for the recognition of five categories of butterflies, and the F 1 value reaches 95.44%. It has a good effect on distinguishing butterflies with similar patterns and spots and other features. It has good performance for butterfly classification in complex natural environments, enabling butterfly experts and scholars to better use this technology in the field of butterfly identification to record and study butterfly habits to protect butterflies from extinction, which can finally protect the ecosystem from damage.
The rest of this paper is as follows: Section 2 introduces the related work. Then, Section 3 introduces the materials and methods. Next, Section 4 introduces the experiment and result analysis. Finally, Section 5 concludes.

Related Work
In recent years, many experts, scholars and researchers have made great contributions to the issue of the fine-grained classification of butterflies. For example, Zhang J W [8] used 26 morphological features of the color features of the front and the front and rear wings to identify 43 butterfly specimens, thereby obtaining a relatively good precision. Liu F [9] established a radial-based neural network model by using the color features of the front and back of the butterfly, which also achieved a good result. Kaya Y et al. [10] proposed a Gabor filtering and an extreme learning machine (ELM) based image feature extraction method to recognize five categories of butterflies with higher precision. Combined with the artificial wing network classifier, Kaya Y and Kayci L [11] used the RGB color feature of the butterfly wing surface and gray-level symbiosis matrix feature to identify 14 butterfly species in Turkey. The above method was improved in 2014, combining the Gray-Level Co-occurrence Matrix (GLCM) with a polynomial logistic regression realized the automatic identification of 19 categories of butterflies. Kang S H et al. [12] proposed a butterfly recognition method, which is to extend the training set by observing butterfly images from multi-angles. Hernández-Serna et al. [13] devised 15 special features of plants, butterflies and fish from three sides of image morphology, geometric construction and pattern features, using neural networks for training and species identification. Zhou A M et al. [14] used CaffeNet to identify butterfly pattern images whose identification result was not significantly distinct from traditional SVM methods, but the identification accuracy of butterfly ecological images was much higher. The faster R-CNN algorithm was used by Juan-Ying X et al. [15] to identify and classify butterfly images captured in the natural environment, achieving good results. Fine-grained classification also has many applications in agriculture. For example, a surface defect identification of citrus based on the KF-2D-Renyi and ABC-SVM algorithms was proposed by Aijiao Tan et al. [16] to better detect and classify citrus surface defects reaching an average accuracy of about 98%. Xiao Chen et al. [17] used a new Both-channel Residual Attention Network model(B-ARNet) to identify tomato leaf diseases and achieved an accuracy of about 88%. S. Huang et al. [18] proposed a Non-Local Progressive Average Denoising algorithm combined with a new parallel convolutional neural network to identify peach diseases, and achieved an average accuracy of 88%.
The above results certify that the CNN features can play a better character in finegrained image classification than traditional methods, whereas butterfly recognition needs more research and its classification accuracy can still improve. Furthermore, the existing butterfly fine-grained classification algorithms are generally based on butterfly specimens, and tend to be simple image classification tasks with weaker ecological expansion capabilities for images, which needs further research. Therefore, we propose a suitable model for butterfly image identification and classification in the natural environment in this paper, namely the fine-grained butterfly classification based on MRDA-MGFSNet.

Data Acquisition
The first part of the dataset came from some websites such as Baidu Images, Google Images, personal photography collections, blogs, and social media (5176 images). Some images with poor quality or unclear target objects were removed, leaving 4535 images. Almost all butterfly images in the dataset were captured in the natural environment, only a small part of which were butterfly specimen images. Based on the classification labels in the original websites, the authors re-filtered and categorized the images with reference to professional books. The second part of the dataset came from Kaggle, GitHub, and Google dataset search engines, some websites provided by research reports, etc.; a total of 11,249 images were collected. Although the images are from authoritative websites and have labels, some classification errors were inevitable in the dataset. The author checked them and reclassified them. There were a total of 15,784 images obtained from these two parts of the dataset retrieval mentioned above.
The quantitative distribution of the five categories of butterflies we collected is shown in Table 1.  [18] proposed a Non-Local Progressive Average Denoising algorithm combined with a new parallel convolutional neural network to identify peach diseases, and achieved an average accuracy of 88%.
The above results certify that the CNN features can play a better character in finegrained image classification than traditional methods, whereas butterfly recognition needs more research and its classification accuracy can still improve. Furthermore, the existing butterfly fine-grained classification algorithms are generally based on butterfly specimens, and tend to be simple image classification tasks with weaker ecological expansion capabilities for images, which needs further research. Therefore, we propose a suitable model for butterfly image identification and classification in the natural environment in this paper, namely the fine-grained butterfly classification based on MRDA-MGFSNet.

Data Acquisition
The first part of the dataset came from some websites such as Baidu Images, Google Images, personal photography collections, blogs, and social media (5176 images). Some images with poor quality or unclear target objects were removed, leaving 4535 images. Almost all butterfly images in the dataset were captured in the natural environment, only a small part of which were butterfly specimen images. Based on the classification labels in the original websites, the authors re-filtered and categorized the images with reference to professional books. The second part of the dataset came from Kaggle, GitHub, and Google dataset search engines, some websites provided by research reports, etc.; a total of 11,249 images were collected. Although the images are from authoritative websites and have labels, some classification errors were inevitable in the dataset. The author checked them and reclassified them. There were a total of 15,784 images obtained from these two parts of the dataset retrieval mentioned above.
The quantitative distribution of the five categories of butterflies we collected is shown in Table 1.  [18] proposed a Non-Local Progressive Average Denoising algorithm combined with a new parallel convolutional neural network to identify peach diseases, and achieved an average accuracy of 88%.
The above results certify that the CNN features can play a better character in finegrained image classification than traditional methods, whereas butterfly recognition needs more research and its classification accuracy can still improve. Furthermore, the existing butterfly fine-grained classification algorithms are generally based on butterfly specimens, and tend to be simple image classification tasks with weaker ecological expansion capabilities for images, which needs further research. Therefore, we propose a suitable model for butterfly image identification and classification in the natural environment in this paper, namely the fine-grained butterfly classification based on MRDA-MGFSNet.

Data Acquisition
The first part of the dataset came from some websites such as Baidu Images, Google Images, personal photography collections, blogs, and social media (5176 images). Some images with poor quality or unclear target objects were removed, leaving 4535 images. Almost all butterfly images in the dataset were captured in the natural environment, only a small part of which were butterfly specimen images. Based on the classification labels in the original websites, the authors re-filtered and categorized the images with reference to professional books. The second part of the dataset came from Kaggle, GitHub, and Google dataset search engines, some websites provided by research reports, etc.; a total of 11,249 images were collected. Although the images are from authoritative websites and have labels, some classification errors were inevitable in the dataset. The author checked them and reclassified them. There were a total of 15,784 images obtained from these two parts of the dataset retrieval mentioned above.
The quantitative distribution of the five categories of butterflies we collected is shown in Table 1.  [18] proposed a Non-Local Progressive Average Denoising algorithm combined with a new parallel convolutional neural network to identify peach diseases, and achieved an average accuracy of 88%.
The above results certify that the CNN features can play a better character in finegrained image classification than traditional methods, whereas butterfly recognition needs more research and its classification accuracy can still improve. Furthermore, the existing butterfly fine-grained classification algorithms are generally based on butterfly specimens, and tend to be simple image classification tasks with weaker ecological expansion capabilities for images, which needs further research. Therefore, we propose a suitable model for butterfly image identification and classification in the natural environment in this paper, namely the fine-grained butterfly classification based on MRDA-MGFSNet.

Data Acquisition
The first part of the dataset came from some websites such as Baidu Images, Google Images, personal photography collections, blogs, and social media (5176 images). Some images with poor quality or unclear target objects were removed, leaving 4535 images. Almost all butterfly images in the dataset were captured in the natural environment, only a small part of which were butterfly specimen images. Based on the classification labels in the original websites, the authors re-filtered and categorized the images with reference to professional books. The second part of the dataset came from Kaggle, GitHub, and Google dataset search engines, some websites provided by research reports, etc.; a total of 11,249 images were collected. Although the images are from authoritative websites and have labels, some classification errors were inevitable in the dataset. The author checked them and reclassified them. There were a total of 15,784 images obtained from these two parts of the dataset retrieval mentioned above.
The quantitative distribution of the five categories of butterflies we collected is shown in Table 1.  [18] proposed a Non-Local Progressive Average Denoising algorithm combined with a new parallel convolutional neural network to identify peach diseases, and achieved an average accuracy of 88%.
The above results certify that the CNN features can play a better character in finegrained image classification than traditional methods, whereas butterfly recognition needs more research and its classification accuracy can still improve. Furthermore, the existing butterfly fine-grained classification algorithms are generally based on butterfly specimens, and tend to be simple image classification tasks with weaker ecological expansion capabilities for images, which needs further research. Therefore, we propose a suitable model for butterfly image identification and classification in the natural environment in this paper, namely the fine-grained butterfly classification based on MRDA-MGFSNet.

Data Acquisition
The first part of the dataset came from some websites such as Baidu Images, Google Images, personal photography collections, blogs, and social media (5176 images). Some images with poor quality or unclear target objects were removed, leaving 4535 images. Almost all butterfly images in the dataset were captured in the natural environment, only a small part of which were butterfly specimen images. Based on the classification labels in the original websites, the authors re-filtered and categorized the images with reference to professional books. The second part of the dataset came from Kaggle, GitHub, and Google dataset search engines, some websites provided by research reports, etc.; a total of 11,249 images were collected. Although the images are from authoritative websites and have labels, some classification errors were inevitable in the dataset. The author checked them and reclassified them. There were a total of 15,784 images obtained from these two parts of the dataset retrieval mentioned above.
The quantitative distribution of the five categories of butterflies we collected is shown in Table 1.  [18] proposed a Non-Local Progressive Average Denoising algorithm combined with a new parallel convolutional neural network to identify peach diseases, and achieved an average accuracy of 88%.
The above results certify that the CNN features can play a better character in finegrained image classification than traditional methods, whereas butterfly recognition needs more research and its classification accuracy can still improve. Furthermore, the existing butterfly fine-grained classification algorithms are generally based on butterfly specimens, and tend to be simple image classification tasks with weaker ecological expansion capabilities for images, which needs further research. Therefore, we propose a suitable model for butterfly image identification and classification in the natural environment in this paper, namely the fine-grained butterfly classification based on MRDA-MGFSNet.

Data Acquisition
The first part of the dataset came from some websites such as Baidu Images, Google Images, personal photography collections, blogs, and social media (5176 images). Some images with poor quality or unclear target objects were removed, leaving 4535 images. Almost all butterfly images in the dataset were captured in the natural environment, only a small part of which were butterfly specimen images. Based on the classification labels in the original websites, the authors re-filtered and categorized the images with reference to professional books. The second part of the dataset came from Kaggle, GitHub, and Google dataset search engines, some websites provided by research reports, etc.; a total of 11,249 images were collected. Although the images are from authoritative websites and have labels, some classification errors were inevitable in the dataset. The author checked them and reclassified them. There were a total of 15,784 images obtained from these two parts of the dataset retrieval mentioned above.
The quantitative distribution of the five categories of butterflies we collected is shown in Table 1. The background of the images is basically a complex natural background. It can be seen from Table 1 that the butterflies in the dataset are all in a complex natural background. Due to the fact that there are many similar features, including spots, patterns, shapes, edges, etc., among these five categories of butterflies, we chose them as our research object. Studying the classification of these five categories of butterfly also has reference significance for the study of other butterfly categories. The backgrounds of images in the same category are also very different while the shape and color of the target (butterfly) in each category are very similar.

MRDA-MGFSNet
In the collected butterfly images, we found that, as shown in Figure 1a-d,f, the butterflies have features that are very similar to the backgrounds. For example, the shape and color of the butterfly in Figure 1a are very close to the stone in the background, and the color and shape of the butterflies in Figure 1b-d,f are almost blended with the flower in the backgrounds. These complex backgrounds were easily recognized as part of the butterfly by the neural network, causing recognition errors and reducing the recognition accuracy. In addition, the patterns of some petals were similar to butterfly patterns. For example, the horizontal and vertical spider web like patterns on the left and right wings of the Monarch butterfly had similar patterns to the edges of the petals. It is difficult to distinguish those features by using only spatial attention or channel attention, but if we fuse the two, then the neural network will have a greater possibility to finish this work. For another example, as shown in Figure 1c  Therefore, the problem of difficulty in recognition caused by the complexity of the backgrounds and small inter-class variance of butterflies needed to be solved urgently.
In order to solve the above problems, an MRDA-MGFSNet was designed, the core idea of which is to use the MGFS structure to make the main network have multi-scale receptive fields to prevent the loss of subtle features. At the same time, the MRDA module was used to focus on the important features of butterflies, abandon invalid background information, and effectively reduce the amount of parameter calculations and reduce the training time of the network.
MRDA-MGFSNet has three parts, and the model is defined as follows: 1.
The first part was used to extract features, of which there were 64 7 × 7 convolution kernels, stride was 2, whose purpose was to quickly extract various edge features and reduce the size of the image to half of the original size. The function of a maxpool of 3 × 3 size was to retain the main features while reducing the amounts of parameters and calculations, preventing over-fitting, and improving the generalization ability of the model.

2.
The second part was composed of 16 MRDA modules and MGFS modules (explained in detail below). The MGFS module was composed of 2 1 × 1 convolutions and 4 3 × 3 convolutions of different scales, which were used to pay attention to the similar spots and patterns and other small feature information of butterflies. The 3 × 3 convolution used a two-dimensional channel-by-channel convolution and a three-dimensional point-by-point convolution, and its purpose was to reduce the amount of parameter calculations and speed up network training. The MRDA module first, respectively, passed three dilated convolutions with rate = 1, 2, 3, and then used the channel attention mechanism which consists of a max pooling layer and two 1 × 1 convolutional layers. Then, we used the spatial attention mechanism which consists of an average pooling layer, a max pooling layer and two 3 × 3 convolutions. It assigned different weights to channel and spatial features, whose role was to distinguish similar patterns in the butterfly's feature maps and suppress the background information that was similar to the features of the butterfly but was invalid, and enhance the expressive ability of the network. Finally, the feature map obtained in the first layer was added to the module after the attention mechanism, and the PRelu activation function was used to enhance the nonlinear expression ability of the network.

3.
In the last part, an average pooling down-sampling layer was connected to a fully connected layer and, finally the, the output was converted into a probability distribution through softmax to obtain the classification result of the butterfly image.
The overall structure of the MRDA-MGFSNet is shown in Figure 2. The spatial attention mechanism [19] pays attention to the importance of the spatial location of the feature (spatial feature), generating spatial attention weights for the output feature map, and strengthening or suppressing different spatial location features based on the feature weights.
The channel attention mechanism [20] focuses on the importance of different feature channels (edge features, because it is a complete image of different C channels convolved by the convolution kernel). In a convolutional neural network, an image feature matrix (H, W, C) is generated after a two-dimensional image is passed through a convolution kernel, where H and W represent the image spatial scale, that is, height and width, and C represents the image feature channel. By modeling the importance of each feature channel, assigning weight to channel features, and strengthening or suppressing different channels according to task requirements.
The attention of the spatial domain is to ignore the information in the channel domain and treat the image features in each channel equally. This approach limits the spatial domain transformation method to the original image feature extraction stage, and cannot be well explained when it applies to other layers of the neural network. The channel attention mechanism directly performs an average pool for all the information in one channel globally, while ignoring the local information in each individual channel.
Due to the detailed features of the butterfly, the complex background has a great influence on it. Therefore, not only the spatial position needs to be paid attention to, but also the images in each channel cannot be treated equally, but should be given different weights. Therefore, combining the two ideas, MRDA was designed: Spatial attention uses a symmetrical multi-scale structure and uses dilated convolutions with different rates. This structure enables the data stream of butterfly features to be transmitted in a symmetrical manner. The symmetrical characteristic enables MRDA to have a more comprehensive ability to retain the complete butterfly features. Compared with standard convolution, dilated convolution can expand the receptive field of convolution and capture multi-scale information without introducing additional parameters. In this way, the network's visual perception domain can be expanded and rich contextual information can be obtained. At the same time, the PRelu activation function was used to improve the learning convergence effect of the network. The channel attention module used max pooling that can retain more butterfly texture features.
The structure of MRDA is shown in Figure 3. MRDA first uses multi-scale and rate = 1, 2, 4 dilated convolution on the input feature maps to expand the receptive field of view and obtain richer butterfly detail information. Then, the attention mechanism we designed was added to obtain three different output feature maps.
The upper part of Figure 3 shows the channel attention module designed in this paper. The max pooling layer used in this paper can retain more butterfly pattern features. After the pooling layer, a 1 × 1 convolutional layer was added to perform dimensionality reduction operations to reduce the number of channels. A 1 × 1 convolution kernel was also placed at the output to increase the dimensionality, and the dimensionality reduction and dimensionality increase operations were used to exchange information between channels. Then, the PRelu activation function was used to obtain the result of the channel attention. The definition of channel attention module is shown in Equation (1): The lower part of Figure 3 shows the spatial attention module designed in this paper. The feature maps output by the channel attention module was used as the input of the spatial attention module, and the input feature map was the channel compressed using average pooling and max pooling, and then concat operation was performed and two 3 × 3 convolutions were used to extract receptive fields. Finally, the spatial attention feature maps were generated through the PRelu activation function.
The definition of spatial attention module is shown in Equation (2): The dilated convolution used in the MRDA module in this paper is a method of data sampling on feature maps. It can increase the receptive fields without affecting the resolution to make up for the loss of information. Receptive fields refer to the area size mapped on the original image by the pixels on the feature map output by each layer of the network. The calculation method of the receiving field is shown in Equation (3): In Equation (3), r i represents the side length of the receptive field of the i-th layer, and l represents the coefficient of the dilated convolution.
As shown in Figure 4, in the case of the same core size, different coefficients can lead to different receptive fields. In Figure 4a, the coefficient was 1, which was no different from traditional convolution. In Figure 4b, the coefficient was 2, and the receptive fields were expanded to 7 × 7. In Figure 4c, the coefficient was 4, and the receptive fields were expanded to 15 × 15. Dilated convolution makes convolution calculations have a wider view and can capture longer dependencies at the same computational cost. Dilated convolution is suitable for situations that require a wider view and do not use multiple convolutions or larger convolution kernels. Therefore, in the feature fusion and down-sampling part of the network, we chose an expanded convolution with a convolution kernel size of 3 × 3 to increase the receptive field without changing the size of the feature map to improve the efficiency of the feature extraction of the network.

Multi-Granularity Feature Sharer (MGFS)
The use of MGFS was to solve the limitation of identifying butterflies on a single scale, improve the adaptability of the network, effectively integrate the more comprehensive features of butterflies, and save and extract information such as similar spots between butterflies. The structure of Multi-granularity Feature Sharer is show in Figure 5 and the description is as follows: (1) Generally, larger convolution kernels have a stronger ability to perceive large target objects, and small-size convolution kernels are better at extracting features of small targets. However, the quality of butterfly images varies. Some were butterfly specimens and had few backgrounds information, and some had complex backgrounds and the targets were not easy to find. Therefore, we increased branches of different sizes of receptive fields and used convolution kernels with sizes of 3 × 3, 5 × 5, and 7 × 7 to improve the recognition accuracy. (2) The MGFS structure divided the feature maps obtained after 1 × 1 convolution into 4 scales on average, of which 3 × 3 convolution used depthwise separable convolution to reduce the amount of parameter and calculation. (3) Using the PRelu activation function to replace the ReLU or Sigmoid activation function to improve the learning convergence effect of the network. (4) As the number of butterfly images was relatively small, the group normalization (GN) that was not affected by the batch size was used to replace the batch normalization (BN) layer to improve the network convergence effect, and the batch size was set to 10. In this paper, the structure enabled the neural network to learn more detailed features of butterflies, such as spots of similar color and size and their spatial distribution, and greatly improved the accuracy of the recognition of subtle features among butterflies. Figure 6 is a schematic diagram of the detailed design of the MGFS structure.
The multi-scale receptive field also brought along the problem of an increase in the amount of calculation parameters. At the same time, the multi-scale structure as well as multiple 1 × 1 and 3 × 3 small-size convolution kernel structures were used in this paper; the network was also deeper. Therefore, we introduced the depthwise separable convolution used by F. Chollet [21] to construct convolutional neural networks whose work enables large and complex neural networks to run more efficiently. As shown in Figure 7, the idea of depthwise separable convolution was to separate the traditional convolution operation into two steps: first, depthwise convolution was performed, that is, a one-to-one 2-D convolution was performed on each channel of the input feature map to reduce parameter calculations; then, using the 1 × 1 size convolution kernel to continue the traditional convolution (3-D convolution) operation to combine the features of each channel, also known as point-wise convolution. The structure of the depthwise separable convolution is shown in Figure 6.  Suppose that the size of the input feature map is S IN * S IN the number of channels is C, the size of the convolution kernel is S Kernel * S Kernel and there are a total of N, the calculation amounts of traditional convolution and depthwise separable convolution are shown in Equations (4) and (5): Therefore, the calculation ratio of depthwise separable convolution and traditional convolution is: ratio = S I N * S I N * C * S Kernel * S Kernel + S I N * S I N * C * N S I N * S I N * C * N * S Kernel * S Kernel It can be seen that the reduction in the calculation amount of the depthwise separable convolution is related to the size of the two-dimensional convolution kernel used S Kernel * S Kernel and the number N of the three-dimensional convolution kernel. In practice, the depthwise separable convolution generally uses a 3 × 3 size convolution kernel. If the output channel was 64, the calculation amount of the depthwise separable convolution can be calculated by Equation (6), which is only about 1/10 of the traditional convolution parameter calculation amount.

Experimental Environment and Preparation
The hardware information is as follows: the processor was AMD4800h, the GPU was RTX2060, and the video memory was 6 GB.
The unified input size of images was 256 × 256, and a total of 15,784 images were obtained. We used the 10-fold cross-validation method for training in this paper so the images were divided into 11,364 as the training set and 1263 as the validation set according to a 9:1 ratio in the method. Additionally, the number of images of the test set was 3157.

Results and Analysis
As the images of our dataset were not enough, for the accuracy and reliability of the model, we used the 10-fold cross-validation method for training in this paper. Crossvalidation is also called loop estimation. Most of the samples were taken out of a given modeling sample to build a model, and a small part of the sample was left for prediction with the newly established model, and the forecast error of this small part of the sample was calculated and their sum of squares was recorded. This process continued until all samples were predicted once and only once. For example, using the 10-fold cross-validation divided the butterfly dataset into ten parts, and took turns to train nine parts and one part for validation, and the average of the results of 10 times was used as an estimate of the accuracy of the algorithm. We repeated the10-fold cross-validation 10 times in this paper to obtain a higher accuracy and reliability. The advantage of this method is that it repeatedly uses randomly generated sub-samples for training and verification at the same time, and each result is verified once. In order to verify the recognition effect of MGFS on butterfly subtle differences such as similar spots, we used the 10-fold cross-validation method to allocate 90% of the training sample and the verification sample to 10%. The dataset included all the images of five categories of butterflies: Argynnis hyperbius, Monarch butterfly, Polygonia caureum, Danaus genutia and Papilio machaon. Any other experimental environment was the same.

a. Ablation experiment
First, the experiment adopted the 10-fold cross-validation method, the accuracy of all ablation experiments was the average value of ten times of the 10-fold cross-validation.
We used the MRDA-MGFSNet (Basic + MGFS + MRDA) network for butterfly finegrained classification, and then tested and recognized the butterfly categories. In this paper, under the same experimental environment, we used CNN [5], AlexNet-fc6 [22], VGG16 [23], DenseNet-161 [24], Resnet-50 [25] and other models to train our butterfly dataset. In addition, we called the network with an ordinary convolutional layer and residual structure except for MGFS and MRDA in this paper, as the Basic network and multi-scale network have the same multi-scale structure as MGFS and MRDA on the basis of the Basic network. The training loss of CNN is shown in the black solid line in Figure 8, and the training loss of ResNet-50 is shown in the blue solid line in Figure 8. The training loss of the AlexNet-fc6 is shown by the solid green line in Figure 8, the loss of VGG-16 is shown by the solid yellow line in Figure 8, and the method used in this paper is shown by the solid gray line in Figure 8. In this Figure, the horizontal axis represents the number of training epochs, and the vertical axis represents the loss. It can be seen that, compared with the CNN, AlexNet, VGG, and ResNet models, the MRDA-MGFSNet-based model had a lower loss value and a better training effect.  Table 2 shows the test accuracy of each butterfly category corresponding to each method and the overall accuracy (based on the maximum category score higher than 0.5). From the ablation experiment in Table 2, it can be clearly seen that the MGFS and MRDA proposed in this paper improved the recognition accuracy of various butterflies to a certain extent. The accuracy of Argynnis hyperbius, Danaus genutia, Papilio machaon was basically above 95%, while the classification accuracy of the Monarch butterfly and Polygonia caureum was relatively low. This is because simple specimen images accounted for a relatively larger proportion of the images of the first three categories of butterflies mentioned above than the others', and the image quality was relatively good. On the contrary, the images with complex backgrounds of the Monarch butterfly accounted for a larger proportion, and the image backgrounds of Polygonia caureum were also more complicated, so the training results of these two categories of butterflies were relatively poor. Compared with the basic network architecture used in this paper, MRDA had better results (+5.19%, +5.45%) for the Monarch butterfly and Polygonia caureum, which had more complex backgrounds that were similar to their own features. MGFS had a good recognition effect (+4.84%, +4.43%, +3.64%) for the three categories of butterflies, Polygonia caureum, Papilio machaon and Argynnis hyperbius, whose spots were important information and patterns were few. The superposition of MGFS and MRDA made the network slightly improve the comprehensive recognition accuracy of the two problems, which also proves that the two structures proposed in this paper had different effects for each problem. The effect of the fusion was slightly reduced, but the comprehensive performance improved. At the same time, we also designed a separate multi-scale network and a complete structure of the MGFS network ablation experiment in the experiment. In order to obtain a richer receptive field, a separate multiscale calculation increases the amount of parameter calculations, which slows down the training speed of the network, and the addition of depthwise separable convolutions forms the MGFS module. Under the same experimental environment, compared with the multiscale network training time, the network training time of the MGFS module was reduced by 1 h 38 min 29 s under the same experimental environment conditions. Experiments show that the MGFS module composed with depthwise separable convolution could effectively reduce the training time of the network to save experimental resources. Class Activation Mapping (CAM) can give a good visual interpretation of the classification results, and can achieve weakly supervised positioning of the target object.
As shown in Figure 9, we used Gradient-weighted Class Activation Mapping (Grad-CAM) to visually explain the performance of MRDA-MGFSNet on butterfly classification. It can be seen that our MRDA-MGFSNet model could locate the area of the butterfly in the image very well. b.
The latest methods comparison experiment As a matter of fact, it is one-sided and unconvincing to only rely on classification accuracy to determine whether the model is truly effective. Therefore, we accurately calculated the F 1 value of each network in this paper. In order to verify the effectiveness of the butterfly recognition training model, in this paper, we used two indexes: the recall rate and accuracy as evaluation index. We selected F 1 as one of the evaluation indicators of butterfly recognition results. F 1 is a measurement function of accuracy and recall rate, defined as the following formulas: In Equation (7), P represents precision; R represents recall rate. TP represents the number of samples that are actually butterflies, and the model predicts that the sample is a butterfly (detecting a positive sample as a positive sample). In Equation (8), FP represents the number of samples that are not actually butterflies, but the model predicts that the sample is a butterfly (tests negative samples as positive samples). In Equation (9), FN represents the number of samples that are actually butterflies, but the model did not predict them as butterflies (no positive samples were detected as positive samples). As shown in Table 3, the experimental results show that the F 1 value of the classification model in this paper reached the expected level of the experiment, proving that what we discussed and analyzed above is correct. In the field of machine learning, the confusion matrix is also called the possibility table or error matrix. It is a specific matrix used to visualize the performance of the algorithm, usually supervised learning (unsupervised learning, usually matching matrix). Each column represents the predicted value, and each row represents the actual category. This is very important, because in the real-world classification, the TP value and the FP value are the most direct indicators that ultimately determine whether the classification is correct, and the F 1 value is a comprehensive manifestation of these two indicators.
As shown in Figure 10, we compared the experimental results of the MRDA-MGFSNetbased model for each category with the experimental results of some of the state-of-the-art models such as NTS-Net [26], DFL-Net [27] and BSNet [28].
Argynnis hyperbius, the Monarch butterfly and Polygonia caureum have very similar patterns, spots and shapes. Similarly, Danaus genutia and Argynnis hyperbius, Papilio machaon and the Monarch butterfly also have similar patterns, shapes, colors and other features. Therefore, it can be clearly seen from the confusion matrix of each method that the network still had a great chance of misclassifying them.
NTS-Net proposes a novel training paradigm, and enables the navigator to detect the area with the largest amount of information under the guidance of the teaching device. However, this self-supervised learning mode is prone to missing or wrong extraction in the extraction of more subtle butterfly spots and other features. Therefore, it has higher FN values on the recognition of the three categories of Argynnis hyperbius, Monarch butterfly, Polygonia caureum, of butterflies with very similar patterns, spots and shapes.
DFL-Net proposes a discriminative mid-level patch, which uses a 1 × 1 convolution kernel as small "patch detectors" to design an asymmetric, multi-branch structure to utilize patch-level information. Although this kind of classification avoids the trade-off between recognition and positioning, it tends to be more on the classification itself and ignores the recognition and removal of background information. This leads to the result that it can better identify butterflies with similar features. The FP value of Danaus genutia and Papilio machaon is low, but the recognition effect of the Monarch butterfly and Polygonia caureum with a complex background is not good enough. BSNet is composed of an optical band attention module (BAM), optical band weighting (BRW) and reconstruction network (RecNet). BS-Net-Conv improves the utilization of spectrum-space information in HSI. After these three modules, the background information of the image can be better filtered out to obtain a good classification effect. This network can also achieve better results in the classification of butterfly images. It can be seen from the confusion matrix that, except for Argynnis hyperbius, it has low FN values for other species of butterflies. However, this is still not enough in real butterfly recognition.
It can be seen from the confusion matrix that the MRDA-MGFSNet proposed in this paper had a better butterfly classification effect. Compared with NTS-Net, DFL-Net and BSNet, the FP value (FN value in the same column) of the MRDA network for the Monarch butterfly and Polygonia caureum with a large proportion of complex and similar backgrounds in their images was significantly reduced. This is because the MRDA algorithm is more inclined to retain and pay attention to the butterfly characteristics and discard useless background information. The decrease in FP value increases the recall rate of these two categories of butterflies. According to Equation (7), with the same accuracy, the increase in recall rate eventually increases the F 1 value. Similarly, compared to the NTS-Net, DFL-Net and BSNet, the MGFS network had an effective inhibitory effect on the FP values of Argynnis hyperbius, Polygonia caureum, and Papilio machaon, which have spots as important information and have fewer patterns. This is because the MGFS algorithm has a more scale-feature-sharing mechanism that can retain more subtle features such as butterfly patterns and spots. In the end, the suppression of the FP value can achieve the effect of increasing the F 1 value. The network that combines MRDA and MGFS is more powerful in terms of overall performance, and has a good effect on improving the F 1 value. In addition, as shown in Table 4, our model achieved an mAP0.5 value of 96.64%, which proves that our model had a good effect in butterfly classification once again. Experiments have proved that under the same experimental environment, our algorithm is more suitable for butterfly classification than some other latest fine-grained classification models. Of course, in most cases, butterfly images taken in nature are always affected by many factors, such as noise. A stable model should also achieve a good precision and F 1 value for images with noise. Therefore, we designed a noise processing capability experiment.
First, as shown in Figure 11, we added different degrees of noise to the 3157 butterfly images in the test set to obtain a new dataset contaminated by noise. Then, we use the trained model to perform classification tests on the above two data 567 sets respectively, and obtained Table 5 It can be seen from the experimental results that the model has a certain decrease in 571 accuracy and F1 value after adding noise. The overall recognition accuracy has dropped 572 by 2.03%, and the average F1 value has dropped by 2.17%, which shows that noise can 573 indeed have a certain impact on the recognition of the model, but both the overall recog-574 nition accuracy and F1 value still exceed 93%. It can be seen that the model has good ro-575 bustness, and has the potential to deal with noisy images to a certain extent. 576

577
Aiming at solving the problems of high background complexity in some butterfly 578 images and difficulty in identifying them caused by their small inter-class variance, we 579 propose a new fine-grained butterfly classification architecture in this paper which has 580 achieved good performance in identifying butterfly species. The discussion is as follows: 581 a. Ablation experiments show that the MRDA has better results (+5.19%，+5.45%) for 582 butterflies which have more complex backgrounds that are similar to their own features; 583 The MGFS has a good recognition effect (+4.84%，+4.43%，+3.64%) for the three catego-584 Figure 11. Noise pollution data set Figure 11. Noise pollution dataset.
Then, we used the trained model to perform classification tests on the above two datasets, respectively, and obtained Table 5. It can be seen from the experimental results that the model had a certain decrease in accuracy and F 1 value after adding noise. The overall recognition accuracy dropped by 2.03%, and the average F 1 value dropped by 2.17%, which shows that noise can indeed have a certain impact on the recognition of the model, but both the overall recognition accuracy and F 1 value still exceeded 93%. It can be seen that the model has good robustness, and has the potential to deal with noisy images to a certain extent.

Conclusions
Aiming at solving the problems of high background complexity in some butterfly images and difficulty in identifying them caused by their small inter-class variance, we pro-posed a new fine-grained butterfly classification architecture in this paper which achieved good performance in identifying butterfly species. The discussion is as follows:

a.
Ablation experiments showed that the MRDA had better results (+5.19%, +5.45%) for butterflies which have more complex backgrounds that are similar to their own features; the MGFS had a good recognition effect (+4.84%, +4.43%, +3.64%) for the three categories of butterflies whose spots are important information and patterns are few; under the same experimental conditions, compared with the multi-scale network, the training time of the MGFS module (with depthwise separable convolution module) was reduced by 1 h 38 min 29 s. The above results show that the two architectures proposed in this paper achieved the expected experimental results, and can effectively solve the problems of complex backgrounds and small inter-class variance between butterflies. b.
Compared with some of the current state-of-the-art fine-grained classification methods, our mAP reached 96.64%, and the average F 1 value reached 95.44%. The designed butterfly fine-grained classification method can achieve better performance. This method had good effects and obvious advantages in identifying different patterns and spots in different butterfly images and removing complex interference information in the background. After the noise processing capability experiment, our model had an accuracy of 93.57% and an F 1 value of 93.64%, which is only 2.03% lower than the accuracy before noise was added, and the F 1 value was 2.17% lower, showing that our model has good potential to deal with noisy images. It can be well applied to the butterfly recognition to better protect the important butterflies for ecological protection in the future.
The butterfly recognition model proposed in this paper greatly improved the effect of fine-grained butterfly classification in a complex background. However, considering that the dataset contained few butterfly categories, it will be expanded in the future to improve the generalization ability of the recognition model. In addition, our model was still an early research prototype, and the number of butterfly images in the dataset was still insufficient. In the future, we need to collect more datasets to improve the recognition accuracy and further improve the performance of the model, so that our model can play a more important role in the field of butterfly protection and ecosystem protection.