Spatial Channel Attention for Deep Convolutional Neural Networks

: Recently, the attention mechanism combining spatial and channel information has been widely used in various deep convolutional neural networks (CNNs), proving its great potential in improving model performance. However, this usually uses 2D global pooling operations to compress spatial information or scaling methods to reduce the computational overhead in channel attention. These methods will result in severe information loss. Therefore, we propose a Spatial channel attention mechanism that captures cross-dimensional interaction, which does not involve dimensionality reduction and brings signiﬁcant performance improvement with negligible computational overhead. The proposed attention mechanism can be seamlessly integrated into any convolutional neural network since it is a lightweight general module. Our method achieves a performance improvement of 2.08% on ResNet and 1.02% on MobileNetV2 in top-one error rate on the ImageNet dataset.

SENet [7], as one of the state-of-the-art channel attention mechanisms, provides significant performance gains at an extremely low computational cost.However, SE attention only considers the encoded inter-channel information and ignores the importance of spatial information, which is crucial for image classification.The convolutional block attention module (CBAM) [10] provides robust representative attention by combining channel information and spatial information.Compared with SENet, the CBAM offers significant performance improvements with a small computational overhead.However, in CBAM, spatial attention is obtained by simply global average pooling (GAP) and global max pooling (GMP), compressing channel C into a single channel to obtain important spatial information.Similarly, coordinate attention (CA) [12] uses GAP to compress, respectively, height H and width W to capture the interaction between spatial information and channel information.
Wang [8] pointed out that it is important to avoid dimensionality reduction and proper cross-channel interaction when learning channel attention.Therefore, Misra [13] proposed a lightweight non-dimensionality reduction attention mechanism, triplet attention (TA).TA captures the interaction between the spatial dimension and channel dimension through a rotation operation to enhance feature representations.
Provided by Misra [13], the SE module, CBAM, and TA are shown in Figure 1a-c, respectively.However, triplet attention uses two branches to capture the cross-dimensional interaction of channel C with height H and width W, respectively, and uses the third branch to capture spatial information, which is unnecessary; furthermore, it increases the complexity of the model.Therefore, we propose a simpler, but better performing attention mechanism to capture cross-dimensional interaction information, which does not involve dimensionality reduction, namely Spatial channel attention (SCA).
Specifically, to capture cross-dimensional interaction and alleviate the spatial information loss caused by GAP or GMP, we aggregate the cross-dimensional interaction features between the spatial dimensions H or W with the channel dimension C by simply permuting the input tensors.Then, we feed them into convolutional layers and Sigmoid activation layers to generate two interaction attention maps, respectively.Finally, we permute back attention maps to the original input shape and apply them to the input tensors via multiplication.
Our Spatial channel attention has the following advantages.First, it emphasizes the importance of cross-dimensional interaction to capture not only orientation-aware channel information, but also channel-sensitive spatial information, which helps the model to locate and identify objects of interest more accurately.Second, our method is flexible and lightweight, capturing rich discriminative feature representations with negligible computational overhead, classic convolutional neural network building blocks that can be easily inserted, such as ResNet [14] and MobileNetV2 [15], by emphasizing information representation to enhance functionality.
The proposed method considers cross-dimensional dependencies, which are computationally efficient and inexpensive.For example, for ResNet-50 [14] with 25.557M parameters and 4.122 GFLOPs, the proposed Spatial channel attention mechanism achieves a 2.08% improvement on top-one error rate at the cost of 4.6K more parameters and 4.1 × 10 −2 more GFLOPs.
We propose a Spatial channel attention mechanism to improve the performance of CNNS for image classification.The remainder of this paper is organized as follows.In Section 2, recent efforts in attention mechanisms are described for image classification.In Section 3, the working details of the proposed method SCA are explicitly introduced.In Section 4, extensive experiments are conducted to assess the performance of SCA.In Section 5, our work is summarized.

Related Work
Attention mechanisms originate from the human visual system where humans selectively concentrate on regions of interest while ignoring the rest.Therefore, attention mechanisms are extensively studied in computer vision tasks, such as image classification [16][17][18], object detection [4,19], and image segmentation [5,6,11], aiming to tell a model where and what to attend to for boosting the performance of deep convolutional neural networks (CNNs).In this section, we review some attention mechanisms that are closely related to our work.
Attention mechanisms adaptively recalibrate the weights of features to improve the information perception ability of the model.According to the feature dimension applied, attention can be categorized into various types of variants, such as spatial attention and channel attention.
To improve the capability of modeling spatial information in CNNs, spatial attention is widely used with great success.The non-Local module [4] computes the relationship between a pixel and all other pixels to capture the long-range dependencies in images.However, the computational overhead of the non-local module is expensive.In order to reduce the amount of computation, GCNet [19] uses 1 × 1 convolution and scaling operations, and CCNet [5] uses criss-cross attention modules in a cascading manner to aggregate the information on rows and columns of pixels.DANet [11] compresses the 3D tensor to 2D and captures spatial information through matrix multiplication.Similarly, SPNet [6] uses strip pooling on the height and width of features separately and generates spatial attention maps through matrix multiplication.
Channel attention assigns weights to different channels to tell the model what to focus on, which is simple and effective.SENet [7] was the first to propose an efficient method for channel attention, providing significant performance improvements at a minimal additional computational cost.SENet compresses each 2D spatial feature to generate channel weights, explicitly establishing the interdependencies between channels.To balance the paradox between model performance and complexity, Wang [8] proposed an attention without dimensionality reduction, efficient channel attention (ECA), which can bring significant performance gains by adding only a handful of parameters.
However, these channel attention methods only consider inter-channel interdependencies and ignore spatial information.Therefore, the CBAM [10] combines the channel attention and spatial attention to recalibrate the weights of features.The CBAM sequentially extracts attention maps along channels and spatial information and multiplies them with input feature maps to achieve adaptive feature augmentation.
Similarly, CA [12] utilizes two branches in parallel to extract cross-dimensional interaction between channel C with height H and width W to generate an attention map.Furthermore, TA [13] proposes three-branch attention without dimensionality reduction, where two branches capture cross-dimensional interaction and the third branch is used to build spatial attention.

Method
Spatial channel attention can be viewed as a computational unit that aims to enhance the expressive power of the learned features for mobile networks.It can take any intermediate feature tensor X as the input and outputs a transformed tensor with augmented representations Y.To provide a clear description of the proposed Spatial channel attention, we first revisit the channel attention in the CBAM, which is widely used in convolutional neural networks.

Revisiting CBAM
We first revisit the channel attention module and spatial attention module used in the CBAM [10] in this subsection.Let X ∈ R C×H×W be the input of the CBAM channel attention module, where C, H, and W denote the number of channels and the height and width of the feature map, respectively.
The channel attention weight in the CBAM can be expressed by the following equation: where ω c represents the channel attention weight, σ is the Sigmoid activation function, ReLU is another activation function, and W 1 c and W 0 c are weight matrices, whose sizes are defined as C × C/r and C/r × C, respectively.GAP c and GMP c are the global average pooling function and global max pooling function of the channel, respectively.
Similarly, the spatial attention weight in the CBAM can be expressed by the following equation: where Conv 7×7 represents a convolution operation with a filter size of 7 × 7, [;] represents the tensor concatenation operation, and GAP s and GMP s are the global average pooling function and global max pooling function of the channel, respectively.The CBAM is widely used in convolutional neural networks and has proven to be efficient.However, the CBAM compresses channel C into a single channel to obtain important spatial information and compresses spatial dimension H × W into a single pixel to obtain channel importance, ignoring the cross-dimensional interaction information between spatial and channel dimensions.Therefore, in the following, we introduce a novel attention block that considers both channel and spatial cross-dimensional information to enhance feature representation.

Spatial Channel Attention
As shown in Section 1, we propose an attention mechanism with few parameters to capture the interaction between spatial and channel dimensions, namely Spatial channel attention.Figure 2a,b show how we insert the attention mechanism into the residual block in ResNet and the inverted residual block in MobileNetV2, respectively.
The traditional way to compute spatial attention is to compress the channels of the input tensor to generate a weight for each region on the spatial dimension.This can lead to incorrectly assigning higher weights to non-target regions.Similarly, the traditional way to compute channel attention is to compress the spatial information of the input tensor to generate a weight for each channel via global average pooling.This results in a severe loss of spatial information.Furthermore, the interdependence between these spatial attention and channel attention methods is non-existent.In simple terms, channel attention tells which channel to focus on, while spatial attention tells where to focus on in the channel.The disadvantage of this process is that channel attention and spatial attention are separated and computed independently of each other, without considering any relationship between them.Therefore, we propose a Spatial channel attention mechanism that captures cross-dimensional interaction.The proposed Spatial channel attention is shown in Figure 3.There are two branches in Spatial channel attention, which are responsible for capturing the cross-dimensional interaction between channel dimension C and spatial dimension H or W, respectively.SCA will rearrange the input tensors through the permutation operation.Then, they are sequentially input to the convolutional layer and the Sigmoid activation layer to generate two attention maps for cross-dimensional interaction, respectively.Finally, attention maps are permuted again and applied to the input tensor via multiplication, obtaining feature representations that are interactively enhanced across dimensions.
Permute Permute As shown in Figure 4, given an input tensor X with dimension C × H × W, we first pass it to the two branches in the proposed attention module.In the top branch, we construct the interaction between the height dimension H and the channel dimension C. To do this, we permute X and rearrange to W × H × C. Next, the rearranged tensor is successively fed to a convolutional layer with a kernel size of 1 × 1 and a batch normalization layer, and an intermediate output with dimension 1 × H × C is obtained.Then, we input it in the Sigmoid activation function and obtain the attention weights.Finally, we permute the newly generated attention weights again to rearrange them as C × H × 1.
where Permute ch represents the operation of permuting the channel dimension C and the spatial dimension H and Conv 1×1 represents a convolution operation with a filter size of 1 × 1.
The attention-map-generation process of SCA.The top branch represents the crossdimensional interaction between C and H, and the bottom branch represents the cross-dimensional interaction between C and W.
Likewise, in the bottom branch, we rearrange X into a dimensional representation of H × C × W.Then, we feed it sequentially to a convolutional layer with a filter size of 1 × 1, a batch normalization layer, and a Sigmoid activation function, obtaining attention weights of shape 1 × C × W. Finally, we permute the newly generated attention weights again, rearranging them as C × 1 × W.
Finally, we apply the two newly generated attention weights to the input tensor X via broadcast elementwise multiplication.A tensor Y weighted with cross-dimensional interaction can be obtained.
where F ch (X) represents the interaction function between the channel dimension C and the height dimension H and F cw (X) represents the interaction function between the channel dimension C and the width dimension W.

Results and Analysis
In this section, we first introduce our experimental settings.Then, the proposed method is evaluated on ImageNet-1K [20] for image classification based on ResNet-50 [14] and MobileNetV2 [15].Ablation experiments were conducted to verify the effectiveness of cross-dimensional interaction.Finally, sample visualizations are provided from Grad-CAM to demonstrate the effectiveness of the proposed method in locating and identifying objects of interest.

Experimental Setup
For the fairness of comparisons, we followed the training configuration of ResNet.Likewise, we followed the training configuration and data augmentation method in [15] to implement our MobileNetV2-based architecture.We used the Adam [21] optimizer and cosine learning schedule, training on a 1 Nvidia Tesla P100 GPU.

Comparative Experiment
The results of the comparative experiments are shown in Table 1.Spatial channel attention introduces the fewest parameters while consistently outperforming other attentions.The ResNet-50-based model improved the top-one error rate on ImageNet by 2.08%, while the number of parameters only increased by 0.02%, and the FLOPs increased by about 1%.Spatial channel attention outperformed SENet and CBAM with 0.66% and 0.18% improvements in top-one error rates, respectively.The main reason is that the proposed method considers spatial information and does not use GAP or GMP to reduce dimensionality, which avoids information loss.Furthermore, our method also outperformed TA by a small margin.
We observed similar performance trends in the smaller MobileNetV2 model.Compared with MobileNetV2, using Spatial channel attention improved top-one error rate by 1.02%, while increasing parameter complexity by only 0.03%.Spatial channel attention outperformed SENet with a 0.24% improvement in top-one error rate, respectively.We also observed that in the case of MobileNetV2, the CBAM hurt the model performance: it reduced the accuracy by 1.71%.Experimental results showed that the proposed Spatial channel attention worked in both heavyweight and lightweight models with negligible increases in parameters and computation.Furthermore, our method also slightly outperformed TA.The main reason is that the proposed method utilizes multiplication to fuse the interaction of different dimensions, rather than simple addition, which results in the information of different dimensions being treated equally.

Ablation Studies
To investigate the importance of the interaction between different spatial dimensions and channel dimensions, we observed changes in performance by clipping different branches.In Table 2, CHA represents the top branch and CWA represents the bottom branch in Figure 4.As shown in Table 2, the experimental results showed that Spatial channel attention consistently outperformed the baseline model and its two counterparts.

Grad-CAM Visualization
To evaluate the effectiveness of the proposed method in locating and identifying objects of interest, sample visualizations are provided utilizing the Grad-CAM [22] techniques.Figure 5 is the gradient visualization based on Resnet-50, where the ground-truth label at the top is cock and the bottom is the Greater Swiss Mountain dog.
The visualizations shows that the proposed method can help to locate target objects and improve the performance of CNNs.

Conclusions
In this paper, we proposed an attention mechanism that does not involve dimensionality reduction, namely Spatial channel attention (SCA).SCA can capture cross-dimensional interaction information, including direction-aware channel information and channel-sensitive spatial information, which can help to improve the performance of the model image classification.Because SCA is a lightweight general module, it can be flexibly plugged into any convolutional neural network.
In the future, we plan to replace standard convolution with dilated convolution in the proposed method, aiming to reduce the computational cost while increasing the receptive field.We intend to apply SCA to object detection and other visual tasks and consider adding time-aware information to adapt to video-related tasks.

Figure 2 .
Figure 2. Connection implementation between the proposed attention mechanism and CNNs, where ⊕ denotes broadcast elementwise addition.