A New Multi-Scale Convolutional Model Based on Multiple Attention for Image Classiﬁcation

: Computer vision systems are insensitive to the scale of objects in natural scenes, so it is important to study the multi-scale representation of features. Res2Net implements hierarchical multi-scale convolution in residual blocks, but its random grouping method affects the robustness and intuitive interpretability of the network. We propose a new multi-scale convolution model based on multiple attention. It introduces the attention mechanism into the structure of a Res2-block to better guide feature expression. First, we adopt channel attention to score channels and sort them in descending order of the feature’s importance (Channels-Sort). The sorted residual blocks are grouped and intra-block hierarchically convolved to form a single attention and multi-scale block (AMS-block). Then, we implement channel attention on the residual small blocks to constitute a dual attention and multi-scale block (DAMS-block). Introducing spatial attention before sorting the channels to form multi-attention multi-scale blocks(MAMS-block). A MAMS-convolutional neural network (CNN) is a series of multiple MAMS-blocks. It enables signiﬁcant information to be expressed at more levels, and can also be easily grafted into different convolutional structures. Limited by hardware conditions, we only prove the validity of the proposed ideas through convolutional networks of the same magnitude. The experimental results show that the convolution model with an attention mechanism and multi-scale features is superior in image classiﬁcation.


Introduction
In recent years, deep learning has made a number of breakthroughs in the fields of computer vision [1,2], natural language processing [3,4], and speech recognition [5,6]. As one of the most typical deep learning models, convolutional neural networks (CNNs) have made considerable progress in image classification [7,8], object detection [9,10], image retrieval [11,12], and other applications. With the richness of image datasets and the improvement of machine performance, CNN's powerful feature extraction and generalization capabilities are increasingly favored by the industry. Several typical CNN models (including AlexNet [13], VGG [14], ResNet [15], etc.) were originally used for image classification and further demonstrated its versatility in other image processing tasks. This article will discuss a different convolution model and apply it to image classification.
The visual attention mechanism has proven to be one of the most fascinating areas of cognitive neuroscience research. Human vision can quickly scan global input information and screen out specific targets. That is to say, it has the ability to pay attention to certain things while ignoring other things. By mimicking human visual attention, early attention models are often divided into data-driven easier to adapt to the complex geometric transformation of the image. These convolutional network models have improved the form of the convolution kernel.
Here are some singular convolution models. Selective Kernel networks [37] allows the network to adaptively adjust the receptive field sizes based on the input multi-scale information. Highway networks [38] allow unimpeded information flow across several layers on information highways. WideResNet [39] is an uncharacteristic example. It increases the performance of the network by expanding the width of the network instead of the depth. The feature recalibration convolution [22] automatically acquires the importance of each feature channel through self-learning, and then enhances useful features and suppresses unhelpful features according to their importance. Res2Net [25] proposes a new convolution backbone that groups single residual blocks and integrates more granular convolution operations. We try to give this convolution backbone a whole new meaning. First, the importance of each channel in a residual block is sorted by feature recalibration convolution. Then, the sorted residual blocks are grouped and layer-wise convolutions are performed. Finally, the results of the layer-wise convolution are concatenated and channel fusion is performed using a "1 × 1" convolution.

Multi-Scale Features
Here we only discuss multi-scale representation of images based on multi-resolution pyramids, which can be divided into two categories: Image pyramid and feature pyramid. Classic pedestrian detection algorithms, such as "Haar + Adaboost" and "HOG + SVM", use image pyramids to handle multi-scale targets. The image pyramid is a multi-scale representation of a single image. It consists of a series of different resolution versions of the original image. The Gaussian pyramid and the Laplacian pyramid are two common image pyramids, the former for downsampling images and the latter for upsampling images. Image pyramid processing multi-scale representation can achieve optimal results, but its time and space complexities are too high. Therefore, the feature pyramid becomes the protagonist of multi-scale representation.
U-Net [40] adopts a symmetric encoder-decoder structure to combine high-level features with low-level features to obtain richer multi-scale information. SSD [41] uses feature maps from different convolutional layers for object detection at different scales. FPN [42] employs the feature pyramid to scale the features of different layers, and then performs information fusion. Built on the FPN, PANet [43] adds a bottom-up pyramid to effectively transfer positioning information to other layers. ZigZagNet [44] further improved PANet, allowing different layers to interact to enhance bidirectional multi-scale context information. MSPN [45] and M2Det [46] have each completed two or even multiple FPN cascades, which proves that a simple superimposed FPN structure can obtain multi-scale features more effectively. UPerNet [47] and Parsing R-CNN [48], respectively, concatenate PPM modules and RoIAlign operations with FPN to obtain rich multi-scale information.
Most feature pyramid methods represent multi-scale features in a layer-wise manner. Res2Net [25] learns multi-scale representations from a novel and more granular perspective. It makes a multi-scale representation into each residual block, allowing a single network layer to have different sizes of receptive fields. However, Res2Net's grouping of residual blocks is random. We attempted to introduce attention mechanisms into the grouping tasks of Res2Net, so that more important information has the most hierarchical feature representation.

Multi-Scale Convolutional Model Based on Single Attention
ResNet proved that using residual blocks to learning the residuals between inputs and outputs is simpler and more efficient than learning the mapping between them directly. Res2Net further adopts grouping convolution to construct finer quadratic residual connections in the residual block, allowing the network to have richer receptive fields to learn multi-scale features. However, the random grouping of Res2Net in the residual block affects the robustness and intuitive interpretability of deep neural networks. This section will introduce a multi-scale convolutional model based on single attention (AMS-CNN). AMS-CNN is superimposed by multiple AMS-blocks. Its core content is the addition of the "Channels-Sort" module to Res2-block. The "Channels-Sort" module ranks each channel in the residual block in descending order by feature importance. The AMS-block then groups the sorted residual blocks and performs secondary residual learning. Finally, the "1 × 1" convolution is used to adjust the number of channels and fuse all channel features. AMS-bclok subtly combines the attention mechanism with multi-scale features through the Channels-Sort module, which can focus on specific targets and achieve multi-scale observation of objects. The AMS-block structure is shown in Figure 1. 1 1 Figure 1. The structure schematic of three blocks. The former is Res-block, the middle represents Res2-block, and the latter describes AMS-block.
As shown in Figure 1, Res2-block is a decomposition of the traditional Residual block, so that the model can learn multi-scale features in a single residual block. AMS-block introduces the Channels-Sort module based on the structure of Res2-block. The other difference from Res2-block is that we perform secondary residual learning for each grouping of the residual block, as shown in the right of Figure 1, where X 1 is transformed to Y 1 . This can further enrich the receptive fields of convolution operations and increase multi-scale information. The structure of the Channels-Sort module is illustrated in Figure 2. As shown in Figure 2, X 0 obtains X via a normal convolution operation or a convolution operation that introduces channel attention [22,24] or space attention [23,24]. The dotted box indicates the Channels-Sort module. The first half performs the Squeeze-and-Excitation operation, and the second half sorts the channels of the feature maps in descending order according to the feature importance learned by Squeeze-and-Excitation. The process of transforming X into V adopts two methods: Global average-pooling and depth-wise convolution, where V = [V1, V2].
where "DWConV means depth-wish convolution [31], and K a is the convolution kernel of X to V, K a = [K a1 , K a2 , ..., K ac2 ], where the number of channels of the convolution kernel is 1. "GlobleAvgPool" represent global average-pooling operation. Finally, V 1 , and V 2 are connected in series along the direction of the feature channel and activated by the ReLU function to obtain V.
After the convolution and activation operations, V obtains the vector V that characterizes the importance of the feature channels. The convolution kernel is K b (K b = [K b1 ]) and the activation function is ReLU.
where V represents an important factor of the feature channel. X is first multiplied by X and V element by element, and then the result is added to X element by element. The ⊗ and the ⊕ represent element-by-element multiplication and addition, respectively.
The function Top_K(data,k) indicates that the first k data are extracted in order. As Equation (6), when the value of k is the channel number c2, it means that we sort the vector V in descending order. The in Figure 2 represents channels sorting of X based on the index values of V, which can be implemented by the Tensorflow built-in function Batch_Gather( * ). The implementation algorithm of AMS-block is as shown in Algorithm A1 (Appendix A).

Multi-Scale Convolutional Model Based on Multiple Attention
The residual block with feature importance ordering processed by the Channels-Sort module is divided into n small blocks by equal grouping operation, where X = [X 1 , X 2 , ..., X k , ..., X n ]. From front to back, the importance of each group to the classification results is gradually reduced. Therefore, it enables more important features to expresses in more scales. Next, we introduce a spatial attention block before the Channels-Sort module to capture more distinguished local features, forming a multi-scale convolutional block with dual attention (DAMS-block). This process can be obtained in the X to Y stages of Figure 2. For a more fine-grained image classification, we further constructed the multi-scale convolutional block with multiple attention (MAMS-block). On the basis of DAMS-bclok, we pay attention to each residual small block after grouping. The channel attention analysis is performed on the small blocks in the same residual block in order to better realize the feature fusion between different small blocks. A network structure in which a plurality of MAMS-blocks is superimposed is called MAMS-CNN. The DAMS-block and the MAMS-block are given in Figure 3.  As shown in Figure 3, channel attention analysis is performed in each residual small block, and more detailed features can be observed in the receptive fields of different scales. For the operation between the small blocks, we chose to use the channel fusion of "1 × 1" convolution instead of adding them by element. The main reason is that different small groups focus on different features. Simple addition will blur the attention information, but channel fusion will get more attention features. Specific operations are as shown in Equations (8) and (9).

Channels
The "SE-block" in Figure 3 performs the F se ( * ) operation. Where F se ( * ) means to perform a Squeeze-and-Excitation operation, that is, channel attention. F se ( * ) is divided into three phases: Squeeze operation, excitation operation, and reweight operation. See reference [22] for details.
The parameter n represents the number of groups per residual block (e.g., n = 4 in Figure 3). Conv( * ,{1 × 1}) fuses channel information and adjusts the channel number using "1 × 1" convolution. Concat( * ) means that two convolutional blocks are connected in series along the direction of the feature channel.

Datasets and Experimental Details
In this section, we perform experiments on four image datasets, including CIFAR-10, CIFAR-100, FGVC-Aircraft [49], and Stanford Cars [50]. The CIFAR-10 consists of 60,000 32 × 32 images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. The CIFAR-100 is just like the Cifar-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The FGVC-Aircraft consists of 10,000 images. They are divided into 100 categories. This dataset includes 6667 training images and 3333 test images. The Stanford Cars consists of 16,185 images in 196 categories. This dataset is splited into 8144 training images and 8041 test images. Figure 4 shows partial examples of these datasets.  The algorithm in this manuscript is implemented on the Keras 2.2.4 framework backed by Tensorflow 1.11.0. In addition, the graphics card model is a GeForce GTX 1080 and the programming language and version is Python 3.5.2. The first half of epoches has a learning rate of 0.1, the third quarter of epoches has a learning rate of 0.01, and the last quarter of epoches has a learning rate of 0.001. For example, when epoches are equal to 300, the learning rates of the models were 0.1, 0.01, and 0.001, at 1-150 epochs, 151-225 epochs, and 226-300 epochs. The experiments adopt the SGD algorithm with momentum, where the momentum value is 0.9. Convolutional network structures, such as ResNet-50 and ResNet-101, are not used, mainly because their parameters are too large. The improved models based on these mainstream convolutional structures are difficult to complete under our existing experimental conditions. We only verify the validity and feasibility of the proposed methods with convolution models of the same magnitude.

Experiments on CIFAR-10 and CIFAR-100
In this section, we adopt ResNet, SE-CNN and Res2-CNN as reference to verify the effectiveness of our proposed models from three aspects: Test accuracy, time complexity, and computational complexity. Among them, SE-CNN and Res2-CNN, respectively, add SE-blocks and Res2-blocks to the ResNet-32 structure. The structures of ResNet-32 and its various improved models are shown in Table 1. ResNet-32 consists of four convolutional blocks, a global average pooling layer, and a dense layer. There are two convolution operations in one residual block, in each square bracket. The three parameters of the convolution kernel are the number of feature channels after convolution, the sizes of the convolution kernels, and the strides. When dimension reduction is required, the strides are equal to 2. The number of repetitions of each residual block is established according to the subsequent parameter "stack_n", which is 5, 4, or 1 in ResNet-32.  Table 2 shows the test accuracy and time complexity of CIFAR-10 under different models. The parameter "Groups" indicate the number of grouping feature channels. The values in the experiments are 2, 4, and 8. Here ResNet and SE-CNN do not perform group convolution, which is commensurate with a value of 1. The parameter "SPE" means "seconds per epoch", which is the number of seconds required to execute each epoch. Owing to the different grouping methods, the parameters that need to be trained throughout the networks are also very different. For comparison, ResNet and SE-CNN adopt a 32-layer network structure, and other models adopt distinct layers to make the overall parameters quantity close to ResNet-32.
For ResNet-32, the classification accuracy of CIFAR-10 is as high as 92.49% with only 0.46M parameter. SE-CNN introduces SE-block in the ResNet-32 architecture, which increased the test accuracy by 0.26%. Res2-CNN_A and Res2-CNN_B have improved the test accuracy by 0.28% and 0.39%, respectively, and their time complexity has gradually increased. Comparing different grouping forms of Res2-CNN, it can be concluded that the best result is achieved when the number of groups is 4. It can be seen from Table 2 that under the condition of only adding a small number of parameters, the classification accuracy of our proposed model is over 93%. Compared to ResNet, the test accuracy classification of AMS-CNN_A and AMS-CNN_B increased by 0.82% and 0.86%. Unlike Res2-CNN, the more channel groupings of AMS-CNN has, the better its performance, which mainly reflects the advantages of multi-scale functions. DAMS-CNN added spatial attention to AMS-CNN. The test accuracy classification of the two structures increased by 1.09% and 1.11%, and the highest result was 93.60%. Overall, the time complexity of several models corresponding to structure A is lower, while structure B takes more time. The main time consumption is in the process of sorting the channels by feature's importance. As can be seen from the FLOPs values, the computational complexities of the algorithms are on the same order of magnitude. This is because we control the size of the parameters of the network. Figure 5 shows the trend of changes in the test loss of CIFAR-10 for various network models. The downward tendency of structure B is more obvious than that of structure A, indicating that the more AMS-block or DAMS-block is used, the more powerful the network is. The overall performance of DAMS-CNN is better than AMS-CNN.  We adopt ResNet-32 as the network foundation model and limit the network parameters around 1.9M. Table 3 shows the test accuracies and time consumption of ResNet-32 and its various deformation models on the CIFAR-100. As before, we add an SE-block to each residual block of ResNet-32 to form SE-CNN. In addition, Res2-CNN, AMS-CNN and DAMS-CNN employ both A and B structures. At the same time, we have different groupings for the proposed models. We did not use the MAMS-CNN model for the CIFAR-10 and CIFAR-100 datasets. This is because the CIFAR datasets is relatively simple, and an overly complex feature learning model can lead to overfitting. It is worth noting that when we train AMS-CNN and DAMS-CNN, we increase the network weight attenuation coefficient "weight_decay" from 0.0001 to 0.0005, in order to avoid over-fitting caused by complex models. As can be seen from Table 3, when using mode_A training for various networks, the number of seconds it takes to process the images of the entire dataset is almost the same. However, when using the mode_B, the time complexity will increase a lot. For example, the SPE value of DAMS-CNN_B is more than six times that of ResNet-32. Under the same conditions, the performance of the three models we proposed exceeding the capabilities of Res2-CNN, SE-CNN, and ResNet. For the CIFAR-100, we achieved a maximum test accuracy of 75.87%, which was 2.78%, 1.68%, and 2.39% higher than the previous three comparison models. In addition, the performance of AMS-CNN_B is higher than that of AMS-CNN_A, which shows that adopting more AMS-block is beneficial to the improvement of network performance. Moreover, the performance of DAMS-CNN_A is better than that of AMS-CNN_A, indicating that the introduction of spatial attention mechanism on the basis of the latter is beneficial to feature learning. However, the overall performance of DAMS-CNN_B is not higher than DAMS-CNN_A. Through analysis, we believe that the network has been over-fitted. Therefore, we have not discussed the more complex model (MAMS-CNN) for the basic CIFAR dataset. Similar to the previous experiments, the computational complexities of the algorithms are on the same order of magnitude. Similar to Figure 5, structure B in Figure 6 is more efficient than the structure A, while DAMS-CNN is better than AMS-CNN. The overall loss of CIFAR-100 is large and fluctuates greatly during training. This happens because the image classification of the FIFAR-100 is more complex.

Experiments on Fine-Grained Image Datasets
In this section, we will adopt the proposed models for fine-grained image classification, including FGVC-Aircraft and Stanford Cars. Fine-grained images have the character of "small differences between classes and large differences within classes". Compared to the CIFAR dataset, we experimented with more complex network structures. All of the models in Table 4 are based on the ResNet structure. The ResNet structure here includes six convolution blocks, a global average pooling layer, and a dense layer. The number of output nodes of the networks is 100 and 196 depending on the datasets. The three parameters of each convolutional layer represent the number of channels after convolution, the size of the convolution kernel, and the stride. The default value of "Stride" is 1, which can be omitted. "Conv1" uses three different scales of receptive fields to capture richer features. The convolution kernels in all remaining residual blocks are 3 × 3, 3 × 1, and 1 × 3, in order to obtain rich features while reducing parameters. SE-CNN adds an SE-block to each residual block in ResNet for a total of 21 SE-blocks. For Res2-CNN, AMS-CNN, DAMS-CNN, and MAMS-CNN, we only experimented with the structure A mentioned in the previous section. This is to reduce the complexity of the models in order to complete the experiments under limited equipment conditions. We add Res2-block, AMS-block, DAMS-block, and MAMS-block after each large convolution block. Four corresponding special modules are added to each convolution model.
Next  Table 5 gives the experimental results for the FGVC-Aircraft dataset; Table 6 shows the experimental data for the Stanford Cars dataset. In the experiment, we compressed the size of the images of the two datasets into 224 × 224. Due to the limitations of the experimental equipment, we did not choose to use the commonly used image size of 448 × 448. The learning rates of the models were 0.1, 0.01, and 0.001, at 1-190 epochs, 191-270 epochs, and 271-360 epochs. The network weight attenuation coefficient "weight_decay" is set to 0.0005, in order to avoid over-fitting caused by complex models. It is worth noting that we did not use pre-training networks and fine-tuning techniques throughout the experiments.   The parameters of the networks are shown in Tables 5 and 6, and the values are between 10M and 11M. Since the datasets FGVC-Aircraft and Stanford Cars, respectively, contain images with 100 and 196 categories, the final output nodes of the networks are different. Therefore, the amount of parameters in Table 6 is slightly higher overall than Table 5. For FGVC-Aircraft, the test accuracies of MAMS-CNN reached 86.56%, which was 3.51%, 2.34%, and 4.29% higher than ResNet, SE-CNN, and Res2-CNN, respectively. At the same time, the performance of multi-scale networks with single attention, dual attention and multiple attention is getting higher and higher, indicating that the attention mechanisms have a natural advantage in image classification. It is worth noting that Res2-CNN does not show good performance on FGVC-Aircraft. For Stanford Cars, the MAMS-CNN network structure also achieved the best results, with a test accuracy rate of 89.15%, which is 6.13%, 6.06%, and 5.53% higher than ResNet, SE-CNN, and Res2-CNN, respectively. Even compared to the AMS-CNN and DAMS-CNN models, their performance is also improved by 0.50% and 0.13%, respectively. On the other hand, it can be seen that the more complicated the network structure is, the more time is consumed. However, for a dataset with an overall number of more than 10,000 images, the time complexity is within the tolerable range. Comparing two fine-grained datasets, we can see that our proposed models have better improvement effect on Stanford Cars than FGVC-Aircraft, because the former has more local features for fine-grained image recognition than the latter. As can be seen from the FLOPs and SPE, under similar computational complexity conditions, the time complexities are significantly different because the Channels-Sorting consumes a lot of time. As can be seen in Figure 7, from AMS-CNN to MAMS-CNN, the performance of the test loss is more stable. Comparing the three structures on two different datasets, it can be seen that the test loss on Stanford Cars are much less fluctuating and the final loss values are smaller. This is consistent with the performance of the networks on both datasets.

Comparisons with Prior Methods
In the previous sections, our focus was to demonstrate the effectiveness of the proposed modules on the same scale. we have added comparative experiments for the four datasets. We have added comparative experiments for the CIFAR-10 and CIFAR-100. In addition we compare our results with several other fine-grained classification models, including FV-CNN, DVAN, Random Maclaurin, Tensor Sketch, LRBP. All of these models do not use bounding box information and part annotation. The classification accuracies of the mentioned methods are shown in Tables 7 and 8. For the CIFAR dataset, the structure of our model is built on the WideResNet structure, adding three MAMS-CNN modules. For the fairness of the experiment, we ran WideResNet and our model on the same platform. The results demonstrate that our model has improved accuracy by 0.48% and 0.16% on CIFAR-10 and CIFAR-100, respectively. For FGVC-Aircraft and Stanford Cars datasets, we connect a MAMS-block after each residual block and raise the convolution kernel of Conv5 and Conv6 in Figure 4 to 256 and 512. The parameter quantities of the two networks reached 31.59M and 31.64M, respectively. As can be seen from the test results, although we did not specifically study fine-grained images, we still achieved good results on two fine-grained datasets.

Conclusions
Attention mechanisms and multi-scale features are two important measures for dealing with computer vision tasks. AMS-block first uses channel attention to arrange the feature maps in descending order, then performs two residual convolutions on the feature maps. It subtly integrates the attention mechanism and multi-scale features into the convolution model for image classification. At a more detailed level of research, we propose DAMS-block and MAMS-block. Based on this, the novel convolution model we construct can not only focus on important features and ignore disturbing information, but also enable significant features to have multi-scale expression through feature sorting. The classification results of MAMS-CNN on multiple image sets, including standard image sets and fine-grained image sets, demonstrate its powerful classification performance. On the other hand, MAMS-block is easy to graft into the deep learning model for end-to-end training, which is one of its advantages. However, more intuitive applications of attention mechanisms and multi-scale features are image tasks other than image classification, such as object detection, saliency analysis. Next we will try to introduce MAMS-block into further image processing models to maximize its value.