Multiscale Hybrid Convolutional Deep Neural Networks with Channel Attention

Attention mechanisms can improve the performance of neural networks, but the recent attention networks bring a greater computational overhead while improving network performance. How to maintain model performance while reducing complexity is a hot research topic. In this paper, a lightweight Mixture Attention (MA) module is proposed to improve network performance and reduce the complexity of the model. Firstly, the MA module uses multi-branch architecture to process the input feature map in order to extract the multi-scale feature information of the input image. Secondly, in order to reduce the number of parameters, each branch uses group convolution independently, and the feature maps extracted by different branches are fused along the channel dimension. Finally, the fused feature maps are processed using the channel attention module to extract statistical information on the channels. The proposed method is efficient yet effective, e.g., the network parameters and computational cost are reduced by 9.86% and 7.83%, respectively, and the Top-1 performance is improved by 1.99% compared with ResNet50. Experimental results on common-used benchmarks, including CIFAR-10 for classification and PASCAL-VOC for object detection, demonstrate that the proposed MA outperforms the current SOTA methods significantly by achieving higher accuracy while having lower model complexity.


Introduction
The Convolutional Neural Network (CNN) has excellent feature learning ability and has been rapidly developed [1][2][3][4] in the field of computer vision, such as image classification [5,6], object recognition [7][8][9], and semantic segmentation [10][11][12]. Since the AlexNet [1] network was proposed, researchers have aproposed many other methods to improve the performance of the network. For example, the attention mechanism in natural language processing is introduced into computer vision, which can improve the performance of the network [13][14][15][16][17][18]. SENet, which obtains the channel attention weight vector by learning the interaction between channels, is the most representative. And the channel weight vector is used to scale each channel in the input feature map to highlight the useful features and suppress the useless features.
Many researchers have improved the SENet network to obtain the performance gain, but these methods easily suffer from greater computational overhead Qin et al. [14] introduced discrete cosine transformation into the CNN and proposed a new multi-spectral channel attention mechanism. The frequency domain component index needs to be selected by three criteria and thus the model is complex. Wang et al. [15] proposed a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1-D convolution. However, the 1-D convolution layer is difficult to model the channel information, resulting in a small network effectiveness gain. According to the input multi-scale information, Li et al. [16] used a channel attention mechanism to adaptively adjust the receptive field of each neuron in order to obtain performance 1.
An effective MA module is proposed, which can extract multi-scale spatial information and establish channel long-range dependence. MA is a plug and play module that can be applied to various computer vision task architectures to improve the performance of the model.

2.
An effective backbone network EMANet is obtained by using the MA module instead of 3 × 3 convolution in the ResNet network, which can obtain rich feature information.

3.
Experimental results on mini-ImageNet, CIFAR-10 and PASCAL-VOC2007 datasets indicate that the proposed EMANet network achieves a distinguished performance compared with other attention networks while maintaining low complexity.
The rest of the paper is organized as follows: Section 2 introduces the channel attention mechanism and presents a pyramid compression hybrid module method. Section 3 quantitatively and qualitatively evaluates the performance of the proposed method and compares it with the baseline and existing state-of-the-art methods. Finally, Section 4 summarizes the work of this paper.

Channel Attention Module
The channel attention module has been widely used since it was proposed by Jie Hu, and is mainly used in various computer vision tasks. By learning correlations between channels in the input feature map, it dynamically weights each channel to enhance useful features and suppress noise. For a given feature map X ∈ R C×H×W , where C, H, W indicate the channel number, spatial height and width, respectively, an SE block consists of two parts: squeeze and excitation, which are used to encode the global information and calibrate the channel correlation, respectively. Generally, the global average pooling is Entropy 2022, 24, 1180 3 of 13 used to compress the two-dimensional feature map into a real number, which has a global receptive field, followed by two fully connected hidden layers. The output of each fully connected layer has an activation function, which is ReLU (Rectified Linear Unit, ReLU) and Sigmoid, respectively. The linear information between channels is more effectively combined by using two fully connected layers. The average-pooling function is defined as: where H, W indicate the height and width of the feature map, and x c (i, j) represents a pixel in the feature map. The c-th channel attention weight can be written as: where δ represents the rectified linear unit ReLU operation, W 0 ∈ R n×(n/r) and W 1 ∈ R (n/r)×n represent the weight of the fully connected layer, and the symbol σ represents the excitation function; usually, the channel weight vector is obtained by using the Sigmoid function, and n and r represent the number of channels and the channel decay rate, respectively. By using the excitation function, the channel weight can be allocated, so as to extract information more effectively. The channel attention weight generation process introduced above is named as the squeeze and excitation weight (SEW) module, and the schematic diagram of the SEW module is shown in Figure 1. features and suppress noise. For a given feature map where C, H, W indicate the channel number, spatial height and width, respectively, an SE block consists of two parts: squeeze and excitation, which are used to encode the global information and calibrate the channel correlation, respectively. Generally, the global average pooling is used to compress the two-dimensional feature map into a real number, which has a global receptive field, followed by two fully connected hidden layers. The output of each fully connected layer has an activation function, which is ReLU (Rectified Linear Unit, ReLU) and Sigmoid, respectively. The linear information between channels is more effectively combined by using two fully connected layers. The average-pooling function is defined as: where H, W indicate the height and width of the feature map, and ( ) , c x i j represents a pixel in the feature map.
The c-th channel attention weight can be written as: where δ represents the rectified linear unit ReLU operation, represent the weight of the fully connected layer, and the symbol σ represents the excitation function; usually, the channel weight vector is obtained by using the Sigmoid function, and n and r represent the number of channels and the channel decay rate, respectively. By using the excitation function, the channel weight can be allocated, so as to extract information more effectively. The channel attention weight generation process introduced above is named as the squeeze and excitation weight (SEW) module, and the schematic diagram of the SEW module is shown in Figure 1.

Hybrid Attention Module
This paper takes into account the hybrid idea of the ConvMixer [24] and the advantages of the multi-branch architecture of EPSANet [25]. Firstly, the input feature map is processed by multi-branch architecture, and each branch uses depthwise convolution to mix the spatial locations. Afterward, pointwise convolution is used to mix the channel locations. Large kernel convolution is used in depthwise convolution to mix remote spatial location information, so as to construct long-range dependence while obtaining larger receptive fields. Finally, a mixed attention MA module is proposed, which is composed of four parts, as shown in Figure 2. Firstly, by executing the Mixer and Concat (MC) module, the multi-scale mixed feature map is obtained. Secondly, the SEW module is executed on the multi-scale mixed feature map to obtain the channel weight vector. Thirdly, Softmax function recorrects the channel weight vector to obtain the calibrated multi-scale channel weight vector. Fourthly, the calibrated weight vector is multiplied by the corresponding channel of the multi-scale mixed feature map. And finally, a refined feature map which is richer in multi-scale feature information is obtained and used as the output.
As shown in Figure 2, in the MA module, the main operation for multi-scale mixed feature extraction is the MC module, and the overall structure of the module is shown in Figure 3. In order to extract multi-scale spatial information, the input feature map is

Hybrid Attention Module
This paper takes into account the hybrid idea of the ConvMixer [24] and the advantages of the multi-branch architecture of EPSANet [25]. Firstly, the input feature map is processed by multi-branch architecture, and each branch uses depthwise convolution to mix the spatial locations. Afterward, pointwise convolution is used to mix the channel locations. Large kernel convolution is used in depthwise convolution to mix remote spatial location information, so as to construct long-range dependence while obtaining larger receptive fields. Finally, a mixed attention MA module is proposed, which is composed of four parts, as shown in Figure 2. Firstly, by executing the Mixer and Concat (MC) module, the multi-scale mixed feature map is obtained. Secondly, the SEW module is executed on the multi-scale mixed feature map to obtain the channel weight vector. Thirdly, Softmax function recorrects the channel weight vector to obtain the calibrated multi-scale channel weight vector. Fourthly, the calibrated weight vector is multiplied by the corresponding channel of the multi-scale mixed feature map. And finally, a refined feature map which is richer in multi-scale feature information is obtained and used as the output.
As shown in Figure 2, in the MA module, the main operation for multi-scale mixed feature extraction is the MC module, and the overall structure of the module is shown in Figure 3. In order to extract multi-scale spatial information, the input feature map is processed in a multi-branch way, the channel dimension of the input tensor of each branch is C, and the output channel dimension is C = C/S, where S represents the number of branches. By doing this, more abundant spatial location information can be obtained. The different spatial resolutions and depths can be generated by using multi-scale convolutional kernels in a pyramid structure. And the spatial information with different scales on each channel-wise feature map can be effectively extracted by squeezing the Entropy 2022, 24, 1180 4 of 13 channel dimension of the input tensor. For each branch, it learns multi-scale mixed spatial information independently and establishes cross-dimensional interaction in a wide range. However, when the size of the convolution kernel increases, the hyperparameters also gradually increase. Therefore, in order to perform multi-scale convolution on the input tensors without increasing computational costs, grouped convolutions are heavily applied in the convolutional layers. At the same time, to select different group sizes without increasing the amount of parameters, referring to EPSANet network architecture design rules, the correlation between the multi-scale kernel size and group size can be defined as: where K represents the size of the convolution kernel and G is the size of the group; the effectiveness of this formula has been proved in the ablation study. For each branch, the spatial dimension of the input tensor is first compressed to extract local information, and the feature map generation function is defined as: where the size of the i-th convolution kernel is k i = 2 × (i + 1) + 1, the size of the i-th group is G i = 2 (k i −1)/2 , σ represents the activation function GELU, and BN is the Batch-Norm [26], which regularizes the tensors after activation to speed up the training of the model; z i ∈ R C ×H ×W represents feature maps with different scales, followed by the hybrid module. In order to mix the remote spatial location information, we increase the size of the convolution kernel to 9. Meanwhile, in order to prevent the increase of the convolution kernel size from causing more computational overhead and parameter numbers, we use deep convolution in this paper. According to research in the literature [27], if there is no identity shortcut in deepwise convolution of the large kernel, it is difficult to make it work. Therefore, a parallel shortcut branch was added for this paper. Referring to the Feed-Forward Network (FFN) design of ViTs architecture, we use a similar CNN-style block composed of shortcut, SoftBAN, one 1 × 1 layers and GELU to mix channel location information. Hence, each branch in the MC module is very similar to the Transformer structure. And by doing this, a larger combined receptive field can be obtained, and the cross-dimensional interaction of channels is established. In the operation of the mixing module, the spatial dimension and channel dimension of the tensor are not changed. The mixing operation function is defined as: where So f tBAN is an improvement to IEBN [28]; please check Appendix A for detailed proof.    By extracting the channel attention weight information from the multi-scale preprocessing feature map, the channel weight vectors with different scales are obtained. The channel attention weight vector can be expressed as: is the attention weight, and the ( ) SEW  function obtains the attention weight from the input feature maps at a different scale. Due to the introduction of multibranch architecture and the allocation of different convolution kernel sizes for each branch, the MA module can fuse context information at different scales, and under the holding of large kernel residual convolution, it is possible to generate better pixel-level attention for advanced semantic feature maps. In addition, in order to achieve the interaction of attention information and the fusion of cross-dimensional vectors without destroying the original channel attention weight vector, the whole channel attention weight vector is obtained by a concatenation method, as shown in Equation (8): where ϕ is a multi-scale weight attention vector. Soft attention is used across the channel to adaptively select different spatial scales, which are guided by the channel weight vector i ϕ . A soft weight assignment is given by: By extracting the channel attention weight information from the multi-scale preprocessing feature map, the channel weight vectors with different scales are obtained. The channel attention weight vector can be expressed as: where ϕ i ∈ R C ×1×1 is the attention weight, and the SEW(·) function obtains the attention weight from the input feature maps at a different scale. Due to the introduction of multibranch architecture and the allocation of different convolution kernel sizes for each branch, the MA module can fuse context information at different scales, and under the holding of large kernel residual convolution, it is possible to generate better pixel-level attention for advanced semantic feature maps. In addition, in order to achieve the interaction of attention information and the fusion of cross-dimensional vectors without destroying the original channel attention weight vector, the whole channel attention weight vector is obtained by a concatenation method, as shown in Equation (8): where ϕ is a multi-scale weight attention vector. Soft attention is used across the channel to adaptively select different spatial scales, which are guided by the channel weight vector ϕ i . A soft weight assignment is given by: Softmax is used to obtain multi-scale channel recalibration weights, which contain all local information in space and attention weights in channels. By doing this, the interaction between local and global attention is realized. Next, the channel attention vectors of the feature calibration are fused and spliced in a concatenation manner, so the entire channel attention vector can be expressed as: where at represents the attention weight vector of the multi-scale channel after attention interaction. We multiply the recalibrated weight at i of the multi-scale channel attention with the feature map F i of the corresponding scale as: where ⊗ denotes channel-wise multiplication, and Y i refers to the feature map weighted by the multi-scale channel attention weight vector, which has stronger feature representation and modeling capability, The concatenation operator is more efficient than the summation operator because it maintains the feature representation intact without destroying the information of the original feature map. In summary, the procedure to obtain optimized output can be written as: From the above analysis, the MA module proposed in this paper can integrate multiscale spatial information and cross-channel attention into the blocks of each feature group. Therefore, the MA module can obtain better information interaction between local and global channel attention.

Network Design
The network architecture refers to the design of ResNet, as shown in Figure 4. There are two main factors to consider in choosing the residual network architecture. First, the residual network is the best performing convolutional neural network architecture in various computer vision tasks. It is meaningful to use the residual network as the backbone network to verify whether the MA structure is conducive to the mainstream CNN. Second, the residual network is conducive to the training of the network, so that the potential performance of the network is released. The overall architecture of the network is shown in Table 1. The MA module is used to replace the 3 × 3 convolutional layer in the residual network architecture, and the rest of the architecture remains unchanged. We name this network architecture EMANet.
where  denotes channel-wise multiplication, and i Y refers to the weighted by the multi-scale channel attention weight vector, which has str representation and modeling capability, The concatenation operator is more the summation operator because it maintains the feature representation i destroying the information of the original feature map. In summary, the pro tain optimized output can be written as: From the above analysis, the MA module proposed in this paper can in scale spatial information and cross-channel attention into the blocks of each Therefore, the MA module can obtain better information interaction betw global channel attention.

Network Design
The network architecture refers to the design of ResNet, as shown in F are two main factors to consider in choosing the residual network architec residual network is the best performing convolutional neural network archi ious computer vision tasks. It is meaningful to use the residual network as network to verify whether the MA structure is conducive to the mainstrea ond, the residual network is conducive to the training of the network, so tha performance of the network is released. The overall architecture of the netw in Table 1. The MA module is used to replace the 3 × 3 convolutional layer i network architecture, and the rest of the architecture remains unchanged. network architecture EMANet.

Experimental Verification and Results Analysis
In order to verify the effectiveness of the model proposed in this paper, performance tests were performed based on mini-ImageNet, CIFAR-10 and PASCAL-VOC2007 datasets. All models were trained on NVIDIA RTX 3060Ti GPUs with 8 GB of VRAM and 16 GB of RAM, and the system was Ubuntu 20.04.4 LTS. The code and models are available at https://github.com/Xsmile-love/pytorch-emanet-master (accessed on 12 June 2022).

Dataset
For classification tasks, this paper uses mini-ImageNet dataset and CIFAR-10 dataset to verify the effectiveness of the proposed model. The mini-ImageNet dataset contains 100 categories, each category contains 600 images, with a total of 60,000 images; the size of each image is not fixed, the training dataset contains 48,000 images, and the validation dataset contains 12,000 images. The CIFAR10 dataset contains 10 categories of color images, each category contains 6000 images, each image size is 32 × 32; CIFAR-10 is a small dataset, a total of 60,000 images. A total of 50,000 images are used as the validation setand the rest are used as the validation set. For the object detection task, the PASCAL-VOC2007 dataset is generally used to verify the effectiveness of the model, which contains a total of 21,504 images; the training set contains 16,552 images, and the validation set contains 4952 images, with a total of 20 categories.

Experimental Parameter Settings
For the mini-ImageNet image classification task, the data is first augmented with random cropping, random horizontal flipping and normalization. The optimization is performed by using the stochastic gradient descent (SGD) with weight decay of 1 × 10 −4 , momentum is 0.9, cross entropy loss is used as the loss function, and the epoch is 120; the initial learning rate is set to 0.1 and is adjusted by the factor 0.1 int(epoch/30) , and the batch size is set to 16. For the CIFAR-10 dataset, random cropping, random horizontal flipping and normalization are used to enhance the dataset. The SGD is used with a weight decay of 0.0005, the momentum is 0.9, cross entropy loss is adopted to train all models, the learning rate is initially set as 0.1 and is adjusted by CosineAnnealingLR; the T_max and epoch are set as 200. For the object detection task, the Adam is used with a weight decay of 0.0005, StepLR is used as a learning strategy, step size is set as 1, gamma is 0.96, and the backbone network uses ImageNet 1k dataset to pre-train the weight. At the beginning of the training, the backbone network is frozen for 50 epochs. At this time, the region proposal network is trained. The learning rate in the freezing phase is 0.0001, and the batch size is set to four. All parameters are trained in the unfreezing stage, and the epoch is 100, since the memory usage is relatively large at this time, the batch size is set as two, and the learning rate in the unfreezing stage is 0.00001.

Image Classification Results
We compared EMANet with the current SOTA attention methods. The evaluation metrics included both efficiency (i.e., network parameter and GFLOPs) and effectiveness (i.e., Top-1 or Top-5 accuracy). As shown in Table 2, the EMANet network proposed in this paper achieved the best accuracy on Top-1, which outperforms ResNet [4] by an above absolute 1.99%, although ResNet [4] is 10.9% larger in parameter and 8.5% larger in computation. Compared with the EPSANet [23] network, the number of parameters and floating-point operations per second was increased by 0.62 and 0.11, respectively, but the Top-1 accuracy was increased by 0.83%. Therefore, it is worth increasing these parameters and floating-point operations per second. Furthermore, with comparable or less complexity than ECANet [13], EMANet achieves above absolute 1.08% gain in performance in terms of Top-5 accuracy, which demonstrates the superiority of adaptive aggregation for a multiple branch. In order to verify the generalization ability of the model, experiments were carried out on the CIFAR-10 dataset, and the experimental results are shown in Table 3. As can be seen from Table 3, the EMANet network proposed in this paper achieves the optimal result of 95.61% on accuracy, which verifies the generalization ability of the MA module. It is lower than other methods except that the number of parameters and floating-point operations are 0.62 and 0.05 higher than EPSANet [25], respectively. For example, compared with the SENet [13] network, the number of parameters was reduced by 18.6% and the computational cost is reduced by 8.20%. Figure 5 visually shows that the model proposed in this paper significantly outperforms other networks. The above results show that the MA module proposed in this paper improves the performance of the network to a certain extent, and maintains fewer parameters, which proves the effectiveness of the MA module. by 18.6% and the computational cost is reduced by 8.20%. Figure 5 visually shows that the model proposed in this paper significantly outperforms other networks. The above results show that the MA module proposed in this paper improves the performance of the network to a certain extent, and maintains fewer parameters, which proves the effectiveness of the MA module.

Network Visualization Results
In order to validate the effectiveness of the MA module more intuitively, nine images were sampled from the ImageNet-1k validation set, and Grad-CAM [29] was used to visualize the heatmap of layer4.2 feature maps in the EMANet network. Grad-CAM is a recently proposed visualization method, which uses the gradient to calculate the importance of spatial position in the convolution layer. Since the gradients are computed for unique classes, the Grad-CAM results can clearly demonstrate the regions that the network focuses on. By observing the regions that are considered to be very important for the prediction category, it can be seen how the network makes good use of features. For a fair comparison, heatmaps of layer4.2 feature maps in the ResNet50 network are also drawn. Figure 6 visualizes the Grad-CAM results. It can be clearly seen from Figure 6 that the Grad-CAM mask of the network with the MA module can cover the target object region better than other methods. In other words, the network integrated with the MA module learns to take advantage of information in the target object region and aggregate features from it. Therefore, the MA module proposed in this paper can indeed enhance the expression ability of the network.

Network Visualization Results
In order to validate the effectiveness of the MA module more intuitively, nine images were sampled from the ImageNet-1k validation set, and Grad-CAM [29] was used to visualize the heatmap of layer4.2 feature maps in the EMANet network. Grad-CAM is a recently proposed visualization method, which uses the gradient to calculate the importance of spatial position in the convolution layer. Since the gradients are computed for unique classes, the Grad-CAM results can clearly demonstrate the regions that the network focuses on. By observing the regions that are considered to be very important for the prediction category, it can be seen how the network makes good use of features. For a fair comparison, heatmaps of layer4.2 feature maps in the ResNet50 network are also drawn. Figure 6 visualizes the Grad-CAM results. by 18.6% and the computational cost is reduced by 8.20%. Figure 5 visually shows that the model proposed in this paper significantly outperforms other networks. The above results show that the MA module proposed in this paper improves the performance of the network to a certain extent, and maintains fewer parameters, which proves the effectiveness of the MA module.

Network Visualization Results
In order to validate the effectiveness of the MA module more intuitively, nine images were sampled from the ImageNet-1k validation set, and Grad-CAM [29] was used to visualize the heatmap of layer4.2 feature maps in the EMANet network. Grad-CAM is a recently proposed visualization method, which uses the gradient to calculate the importance of spatial position in the convolution layer. Since the gradients are computed for unique classes, the Grad-CAM results can clearly demonstrate the regions that the network focuses on. By observing the regions that are considered to be very important for the prediction category, it can be seen how the network makes good use of features. For a fair comparison, heatmaps of layer4.2 feature maps in the ResNet50 network are also drawn. Figure 6 visualizes the Grad-CAM results. It can be clearly seen from Figure 6 that the Grad-CAM mask of the network with the MA module can cover the target object region better than other methods. In other words, the network integrated with the MA module learns to take advantage of information in the target object region and aggregate features from it. Therefore, the MA module proposed in this paper can indeed enhance the expression ability of the network. It can be clearly seen from Figure 6 that the Grad-CAM mask of the network with the MA module can cover the target object region better than other methods. In other words, the network integrated with the MA module learns to take advantage of information in the target object region and aggregate features from it. Therefore, the MA module proposed in this paper can indeed enhance the expression ability of the network.

Object Detection Results
In order to validate the ability of the EMANet network to handle downstream tasks, pre-training was performed on the ImageNet-1k dataset, but due to the limitation of computer computing power, the remaining backbone networks listed in Table 4 were not pretrained, and the pretraining weights provided by the original author was used to train Faster-RCNN [30] on the PASCAL-VOC2007 dataset, and evaluate the bounding box Average Precision (AP) for object detection. We implemented Faster-RCNN using the MMDetection toolkit. As shown in Table 4, in the object detection task, EMANet achieved the best performance. Similar to image classification, the bounding box AP is 8.20% higher than ResNet [4], while the number of parameters and floating-point operations per second are 8.20% and 15.50% less than ResNet50, respectively. Compared with other attention networks, EMANet achieved the best performance in all metrics. It is worth noting that the EMANet network achieved 84.80% on AP 50 , which is 4.30%, 2.50% and 3.8% higher than SENet [13], FcaNet [14], and ECANet [15], respectively. The experimental results demonstrate that the proposed EMANet has good expression ability; when the complexity of the network is decreased, the performance is improved consistently, which proves the powerful feature expression ability of the EMANet network.

Ablation Study
In the pyramid architecture, a huge increase in the amount of parameters will e result from the increase in convolution kernel size. In order to extract multi-scale information from the input feature map without increasing the computational cost, this paper realized the balance between model accuracy and complexity by adjusting the convolution group size parameter, and improved the model performance by adjusting the kernel size of deep convolution kernel to mix long-distance spatial information.

1.
Convolution group size As shown in Table 5, this paper decreased the number of parameters and floatingpoint operation by adjusting group size. In the multi-branch architecture, as the size of the convolution kernel increases, the amounts of parameters will increase significantly. In order to extract multi-scale spatial information, the complexity is decreased by adjusting the group size of different branches. From the experimental results in Table 5, when the group size is (1,4,8,16), a good balance can be achieved between accuracy and complexity; the experiments are performed on the mini-ImageNet dataset.

Mixed operation kernel size
Mixing large-range spatial location information is achieved by adjusting the kernel size in deep convolution. It can be seen from Table 5 that with the increase in kernel size, the Top-1 accuracy increases gradually, but when the kernel size is 13, the performance is significantly reduced. Therefore, the kernel size of nine is selected to mix spatial location information in this paper.

Conclusions
The purpose of the research in this paper was to improve the performance of the model with reduced complexity. To achieve the goals, we proposed a plug-and-play module, i.e., MA, which can effectively extract multi-scale spatial information and important crossdimensional features. Therefore, it can enhance the expressiveness of the network. By leveraging an improved multi-branch architecture and channel attention mechanism, the MA module can efficiently aggregate multi-scale contextual features and image-level category information. Extensive qualitative and quantitative experiments demonstrate that the EMANet network proposed in this paper achieves the best performance across image classification and object detection tasks compared with other attention methods.
In the future, we will focus on the following tasks: • The MA module will be further improved to become a lightweight plug and play module.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Inspired by the Feed-Forward Network (FFN) which has been widely used in transformers [31] and MLPs [32], we use a similar CNN-style block composed of shortcut, softBAN, one 1 × 1 layers, GELU and BN; the structure is shown in Figure 3: we take the network without the normalization layer in the CNN-style block as the baseline. SoftBAN is an improvement on IEBN, as shown in Figure A1. We use the Softmax function to recalibrate the weight vector in IEBN, and the performance is further improved. The experimental results are shown in Table A1, where symbol + represents the combination of the baseline and the corresponding component, and this architecture is referred to as SoftBAN. Entropy 2022, 24, x FOR PEER REVIEW 12 of 13 Figure A1. Overall structure of SoftBAN.
It can be seen from Table A1 that the softBAN proposed in this paper is 0.17% and 0.42% higher than BN and IEBN respectively, and 0.16% higher than the baseline. The above results indicate the effectiveness of SoftBAN. Therefore, SoftBAN is used to normalize the spatial mixed feature map in this paper.