CSGN: Combined Channel- and Spatial-Wise Dynamic Gating Architecture for Convolutional Neural Networks

Hyun, Sangmin; Ryu, Chang Ho; Kang, Ju Yeon; Lim, Hyun Jo; Han, Tae Hee

doi:10.3390/electronics11172678

Open AccessArticle

CSGN: Combined Channel- and Spatial-Wise Dynamic Gating Architecture for Convolutional Neural Networks

by

Sangmin Hyun

¹,

Chang Ho Ryu

¹,

Ju Yeon Kang

²

,

Hyun Jo Lim

¹ and

Tae Hee Han

^3,*

¹

Department of Artificial Intelligence, Sungkyunkwan University, Suwon 16419, Korea

²

Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, Korea

³

Department of Semiconductor Systems Engineering, Sungkyunkwan University, Suwon 16419, Korea

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(17), 2678; https://doi.org/10.3390/electronics11172678

Submission received: 9 August 2022 / Revised: 19 August 2022 / Accepted: 24 August 2022 / Published: 26 August 2022

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The explosive computation and memory requirements of convolutional neural networks (CNNs) hinder their deployment in resource-constrained devices. Because conventional CNNs perform identical parallelized computations even on redundant pixels, the saliency of various features in an image should be reflected for higher energy efficiency and market penetration. This paper proposes a novel channel and spatial gating network (CSGN) for adaptively selecting vital channels and generating spatial-wise execution masks. A CSGN can be characterized as a dynamic channel and a spatial-aware gating module by maximally utilizing opportunistic sparsity. Extensive experiments were conducted on the CIFAR-10 and ImageNet datasets based on ResNet. The results revealed that, with the proposed architecture, the amount of multiply-accumulate (MAC) operations was reduced by 1.97–11.78× and 1.37–13.12× on CIFAR-10 and ImageNet, respectively, with negligible accuracy degradation in the inference stage compared with the baseline architectures.

Keywords:

dynamic neural network; exploiting sparsity; efficient computation; gating architecture

1. Introduction

The remarkable progress of modern convolutional neural networks (CNNs) has driven advancements across a wide range of scientific and engineering fields, particularly in image and vision applications [1]. However, neural networks for high performance are becoming deeper and wider, and the training and inference processes on high-resolution images in the real-world, such as autonomous driving [2], damage identification [3], and edge devices [4], suffer from the excessive growth in computational complexity [5,6,7,8]. Model compression, which derives a simplified neural network from the original without significant accuracy loss, is an effective method for alleviating these computational burdens [9]. Most neural networks, including compressed and efficient models, perform consistent computations for every image, regardless of their content.

Dynamic networks reduce computation by identifying channel and spatial features during the runtime and adjusting the model structure on demand. They also exhibit improved interpretability owing to selectively activated model components accounting for the analyzed input feature map during runtime. In addition, dynamic networks perform inference at a finer granularity when spatial features are taken into account. The spatially adaptive model can help extract the inherent sparsity of the feature map and be used with legacy model compression techniques.

Accordingly, dynamic-network-based gating architectures have been deployed to expand work-skipping opportunities through enhanced sparsity with gate functions [10,11]. In the decision-making scheme of dynamic networks, the gate function is a general and flexible approach to generating binary gate masks for selectively activating model components such as blocks, layers, channels, and auxiliary networks. Furthermore, the gate function can also be adapted to an arbitrary location in any backbone network as a plug-in module. The gating architectures can leverage the potential sparsity of the components by collaboratively considering various aspects of the models, such as the spatial-wise features of images and the dynamic network structure.

In this paper, we propose a channel and spatial gating network (CSGN) to enhance computational efficiency by generating optimized sparse masks. The CSGN is a gating architecture that considers both dynamic network and spatial-wise properties by combining two gating modules: channel and spatial gating. The outputs of these two modules create a synergistic effect and are applied to the output feature map to enhance sparsity.

The channel gating module with the Gumbel-SoftMax trick determines the effective channels for the next layer based on the saliency of the input channels. The Gumbel-Softmax trick is reparameterization technique to optimize the discrete decision while enabling backpropagation. The number of calculations for incoming images is effectively reduced by constructing adaptive channels via channel gating, and the resulting structured sparsity is amenable to implementation in hardware. In addition, a spatial gating module that skips ineffective regions by speculating an effective output feature map is introduced. Spatial gating generates opportunistic spatial masks using partial sums of weights and the input feature map. To relieve the workload of the sparse mask predictions, only highly correlated subchannels are utilized.

Furthermore, the CSGN offers insight into how the differentiable Gumbel-SoftMax could enhance channel gating by employing probabilistic masks for end-to-end training. To accomplish this during backward propagation, global sparsity loss is proposed to learn with the channel target rate. In the inference stage, we demonstrate a deterministic method for generating sparse masks to reduce redundant computations.

The contributions of this study are as follows:

1.: We propose a fine-grained gating architecture combining channel- and spatial-wise sparsity to design computation-efficient architecture on the fly.
2.: Two gating modules, CG and SG, can be integrated into the CNNs and utilized with the existing gradient-based optimizers. Additionally, we encourage two gating modules by introducing sparse training loss to enable end-to-end training.
3.: The performance of CSGN onthe widely used image classification datasets CIFAR-10 and ImageNet, and the object detection dataset MS COCO is validated to demonstrate the combined effect. Furthermore, compared to other dynamic gating networks and static methods, the proposed network achieves competitive results.

The remainder of this paper is organized as follows: Section 2 introduces the computation- efficient studies related to CSGN and emphasizes the contribution of this study. Section 3 describes CSGN in the following order: gating architecture, CG module, SG module, inference on CSGN, and sparsity training loss. Section 4 validates the effectiveness of CSGN. Finally, Section 5 provides the conclusion and future work.

2. Related Work

To address the extraordinary growth in computational complexity, static model compression research has been pursued to build computation-efficient CNN models. These models achieve efficiency through processes such as pruning [12,13,14,15], sparsity-aware architecture [16,17,18,19], knowledge distillation [20,21,22,23,24], network quantization [25,26,27,28], and tensor factorization [29,30,31]. Han et al. [9] paved the way for model compression incorporating pruning, quantization, and Huffman encoding to shrink over-parameterized CNNs by statically eradicating the weights based on the threshold and eliminating redundant network connections.

Compared with static compression, dynamic gating architectures enable existing network architectures to adapt on the fly, accounting for the characteristics of the input image. Gating architectures can be categorized into three types based on their granularity: layer-, channel-, and spatial-wise architectures. Table 1 shows a summary for static compression and dynamic gating architecture in this section.

2.1. Layer-Wise Gating Architecture

Layer-wise gating studies determine the number of conditionally executed layers in the neural networks. Inspired by the fact that the residual architecture is resilient to layer dropout, early-exiting schemes skip the remaining computations if a sufficient halting score is achieved in the early stage [32,33,34,35]. In addition, layer-skipping methods can independently apply the gate decision to the network layers to construct a dynamic depth network by skipping several intermediate layers [36,37,38].

2.2. Channel-Wise Gating Architecture

The channel-wise technique selects effective channels on the fly during forward propagation based on the fact that the same channel can be of disparate importance for different samples [47]. Similar to pruning, Liu et al. [14] utilized the scaling factor of batch normalization as a channel pruning with a global threshold to simply adapt to CNNs and used lasso regularization to increase sparsity. To select individual channels, Chen et al. [39] proposed a completely separate network rather than using the existing structure. On the other hand, a nondifferentiable method of top-k pruning after ranking based on channel saliency was suggested [40]. To address the lack of differentiability, Bejnordi et al. [41] engaged channel pruning with the Gumbel-SoftMax trick and coined a batch-shaping loss using posterior probability, while CSGN applies a gating module to every 3 × 3 convolutional layer per residual block.

CGNet [42], considering the channel and spatial location simultaneously, performed spatial adaptive allocation of calculations by enabling the required subset convolution filters of individual layers, and the remaining filters were activated only in strategically selected regions.

When channel-wise gating architectures are trained without additional constraints, all candidate components of the model are activated, resulting in computational redundancy. Therefore, the channel-wise gating method should utilize the associated sparse loss function to construct structured channel sparsity.

2.3. Spatial-Wise Gating Architecture

Because not all pixels contribute equally to the final prediction of CNNs, spatially gating architecture studies have been conducted to identify relevant regions in an image and perform adaptive inference based on spatial features of images [43]. To avoid the computation of less informative areas, a convolution operation is selectively performed on the sampled regions.

By combining spatial-wise gating and halting schemes, SACT [44] estimated pixel-level saliency and determined the dynamic depth based on the halting score. In addition, Gumbel-SoftMax was used to estimate the saliency of pixels [45], and a general framework for region-level adaptive inference was established by creating low-resolution image patches using reinforcement learning [46]. To efficiently predict the positions of zero elements on the output, Cao et al. [48] proposed a quantized sparse convolution. In summary, spatial-level gating architectures typically generate spatial masks with specific training techniques with the gate function; thus, convolution operations can be performed only on strategically sampled pixels with optimization techniques such as gradient estimation, reinforcement learning, and a trainable function policy. These techniques are used to train networks directly composed of non-differentiable functions with backward propagation.

Most previous gating architecture studies have focused on a single aspect of dynamic model components based on gating granularity, and as a result, they have not appropriately addressed more opportunistic sparsity induced by combined effects. Meanwhile, channel-wise gating architectures can collaboratively combine spatial-wise features to distill network sparsity maximally. Moreover, spatial-wise gating architectures can be used in conjunction with layer-wise methods, such as early-exiting and layer-skipping. By combining channel-wise and spatial-wise gating architectures, CSGN is designed to optimize the trade-off between accuracy and computational efficiency.

3. CSGN

The CSGN reduces CNN computation by boosting sparsity with two auxiliary branches called channel gating (CG) and spatial gating (SG). This section introduces the gating architecture, CSGN, and the auxiliary branches in detail. Next, the inference process on CSGN is outlined to reduce inference cost, and sparsity training loss is discussed to encourage the use of two gating modules in training.

3.1. Gating Architecture

Figure 1 illustrates the overall architecture of the convolutional layer in the residual block with the CSGN. CG incorporates a channel-wise squeezing function and decision network with the Gumbel-SoftMax trick for allowing propagation of gradients. SG performs partitioning-based prediction to separate effective regions by producing a partial sum of convolution. Both gating modules simultaneously generate channel- and spatial-wise execution masks based on the input feature map.

For example, as shown in Figure 1, CG squeezes the input feature map and selects effective channels after Gumbel-SoftMax sampling. In the output mask of CG, the selected channels are colored black, whereas the skipped channels are colored white. The output mask of SG excludes redundant locations by gate decision on the partial sum of convolution, and the spatial-aware region indicates an effective region, while unnecessary backgrounds are colored white.

To help explain the CSGN, the notations used are listed in Table 2. The residual block of ResNet [49] is engaged, which can be expressed as follows:

y = r (F (x, Θ_{1}^{l}, Θ_{2}^{l}) + x),

(1)

where

F

is a residual function. The residual block of CSGN can be denoted as follows:

y = r ({Conv}_{2} ({Conv}_{1} (x)) + x) .

(2)

The CG and SG modules are operated on the convolutional layers; thus, modified convolutional layers are denoted as follows:

{Conv}_{1} (x) = C G_{1} (x) ⊙ S G_{1} (x) ⊙ r (BN (Θ_{1}^{l} * x)),

(3)

{Conv}_{2} (x) = C G_{2} (x) ⊙ S G_{2} (x) ⊙ (BN (Θ_{2}^{l} * x)),

(4)

where

Θ_{1}^{l} \in R^{c^{l + 1} \times c^{l} \times k^{l} \times k^{l}}

and

Θ_{2}^{l} \in R^{c^{l + 1} \times c^{l + 1} \times k^{l} \times k^{l}}

are the weights of the residual block.

BN

and

r

are the batch normalization and activation functions (e.g., ReLU), respectively. Both

{Conv}_{1} (x)

and

{Conv}_{2} (x)

are obtained by element-wise multiplication

(⊙)

of the two gating modules and naïve convolution (*). The process of

{Conv}_{1} (x)

is depicted in Figure 1, and the two gating modules of

{Conv}_{2} (x)

can be expressed similarly.

3.2. CG Module

The CG module in the convolution layer determines the relevance of the input and output channels, allowing effective channels to propagate. The CG module has two functions: channel squeezing and the Gumbel-SoftMax trick.

3.2.1. Channel Squeezing

The goal of channel squeezing is to extract the channel saliency in each layer for a given input feature map. The computation amount can be reduced by removing pixels with little influence through global average pooling in (5).

C S (x) = \frac{1}{h^{l} w^{l}} \sum_{i = 1}^{h^{l}} \sum_{j = 1}^{w^{l}} |x_{c_{l}, i, j}|

(5)

The output of the average pooling checks the dependency between the input and output channels through the FC layer

f : C S (x) \in R^{c^{l} \times 1 \times 1} \to S \in R^{c^{l + 1} \times 1 \times 1}

, including the initialized weight tensor.

3.2.2. Binary Gumbel-Softmax Trick

The Gumbel-SoftMax trick determines the channels of the next convolutional layer that are activated based on the relevance of the input and output channels. After the channel squeezing, the CG output should be mapped to a binary vector mask for relevant channel selection. A binary vector mask is applied to the output feature map of the convolutional layer with element-wise multiplication. A gradient estimator should be employed for end-to-end training if a non-differentiable function that cannot be trained directly with backward propagation is used in the gating modules.

The Gumbel-SoftMax trick, which approximates categorical samples by reparameterization, can overcome this problem by imposing differentiability on the gradient [12,36,45,50]. Gumbel-SoftMax was modified to replace the input with the saliency of the channels so that it could be leveraged in training binary-value gated channels.

The Gumbel-SoftMax trick is a transformation of the Gumbel-Max trick that approximates the sampling of a discrete distribution of input values and samples a categorical distribution with class probabilities

π

to draw sample

Z

. In Gumbel-Max, the argmax function performs one-hot encoding by adding noise that follows the Gumbel distribution to input logits. Sample

Z

is defined as:

Z = one hot (_{i} [g_{i} + l o g π_{i}]),

(6)

where

g_{i}

and

π_{i}

are independent and identically distributed samples from Gumbel (0, 1) and the categorical distribution with class probabilities, respectively. However, one-hot encoding does not support backward propagation owing to its lack of differentiability; thus, Gumbel-SoftMax replaces categorical samples with a continuous and differentiable approximation obtained by altering argmax to SoftMax as presented in (7).

Z_{i} = \frac{exp (\frac{l o g π_{i} + g_{i}}{τ})}{\sum_{j = 1}^{k} exp (\frac{l o g π_{j} + g_{j}}{τ})}

(7)

Since the output of the CG module

C G (x)

is binary, the Gumbel-SoftMax trick transforms soft decisions (e.g., class probability) to hard decisions that can be determined simply by utilizing the saliency of channel

S

. Considering the class probability

π_{1}

as the channel execution probability

σ (S)

, the non-executed probability

π_{2}

is

1 - σ (S)

. By substituting

π_{1}

and

π_{2}

to into (7), the sigmoid function

σ

can be derived as follows [45]:

Z = σ (\frac{S + g_{1} - g_{2}}{τ}),

(8)

where

τ

is the Gumbel temperature.

τ

controls how closely the samples approximate discrete one-hot vectors. When

τ \to 0

, the sample vectors approach one-hot encoding. Conversely, the sample vectors are proximate to a uniform distribution, as

τ \to \infty

.

Figure 2 depicts the forward and backward propagation in CG. In the forward propagation, hard decisions are employed, as expressed in (9). In backward propagation, the CG module is trained by obtaining a gradient from the soft decision in (8) using a straight-through estimator, as in (10).

C G (x) = \{\begin{cases} 1 & i f Z > 0.5 \\ 0 & e l s e \end{cases} (forward)

(9)

\frac{\partial L}{\partial Z} = \frac{\partial L}{\partial C G} ⊙ \frac{\partial C G}{\partial Z} = \frac{\partial L}{\partial C G} (backward)

(10)

Note that the relevance of the channel is estimated through Gumbel-SoftMax, and Gumbel sampling addresses the non-continuity of the argmax operation.

3.3. SG Module

The SG module in the convolution layer estimates pixel-level saliency and skips the partial sums of irrelevant regions. As shown in (11), the SG module produces a spatial-wise mask (i.e., partial sums)

S G (x) \in {\{0, 1\}}^{c^{l + 1} \times h^{l} \times w^{l}}

, employing a subset of the input features

x^{p} \in R^{{p c}^{l} \times h^{l} \times w^{l}}

and weights

Θ^{p} \in R^{{p c}^{l + 1} \times c^{l} \times k^{l} \times k^{l}}

. Here,

p \in (0, 1]

is a fraction value that was adopted in partitioning by Gao et al. [40].

S G (x) = r (B N (Θ^{p} * x^{p}))

(11)

A sparse spatial mask for detecting the effective region is generated by predicting the output feature map. The output of the convolutional layer is determined in conjunction with the sparse spatial and channel masks. The SG module sparsifies the feature maps using a spatial-wise gate decision for the activated output channels by the CG module.

Because the residual network adopts ReLU as an activation function, the gate function in the SG can be described as a step function with a learnable threshold

Φ

. To impose differentiability, we alter the gate function to the sigmoid form for approximation during backward propagation, as follows:

σ (x - Φ) = \frac{1}{1 + e^{- α (x - Φ)}},

(12)

where

Φ

and

α

are the excitation threshold parameters of the features and hyperparameters that regulate the difference between the approximated and step functions, respectively. When

α

increases, the sigmoid function approaches a step function.

The output of the SG determines whether the pixels should be computed using a gate function with BN and ReLU. The spatially gated mask identifies spatial-wise characteristics by defining discrete decisions for spatial locations.

3.4. Inference on CSGN

In the training stage of CG, the Gumbel-SoftMax trick, a well-defined stochastic categorical sampling method, is employed. Once the network is trained, performing adaptive gating during inference can be the same procedure as training. However, stochastic sampling in training requires additional computation owing to the Gumbel sampling; to fully exploit the trained sparsity in the convolution, an inference procedure should be implemented in a different way from training.

One alternative is the deterministic method, which can utilize the input directly as a soft decision without Gumbel sampling, and this approach makes a hard decision through the gate function. In addition, since the inference shows superior performance in a deterministic manner [36] and the probabilistic mask is eventually learned in a deterministic form [12], the deterministic method is usually adopted in the inference stage.

CG can generate the mask by considering only the sign of the channel selection without Gumbel-SoftMax because the sign of the sigmoid input in (8) can make gate decisions. As the input of (8) contains no logarithms or exponentials, the inference procedure can be computationally efficient. Furthermore, the channel-gated mask is utilized in naïve convolution to reduce multiply-accumulate (MAC) operations because convolution computations comprise a majority of the computations at the inference stage.

The SG module utilizes a subset of the input features and weights for partitioned convolution operations to produce an output spatial mask. In this process, the MAC reduction can be achieved by identifying relevant regions in an image and skipping irrelevant partial sums.

During inference, CSGN reduces the MAC computations for the current convolutional layer by skipping the partial sum for the output channel weight with the channel mask that is the output of the CG module. For the SG module, the spatial mask further relieves the computation burden of the current convolutional layer by considering the spatial redundancy of the input feature map. Consequently, CSGN operates only the partial sum extracted by CG and SG during inference, and the output feature map is delivered to the next convolutional layer as input.

3.5. Sparsity Training Loss

CSGN can be integrated into existing CNNs and should be trained with constrained sparsity. To achieve the given target rate in training the gating architecture, the CSGN accounts for the newly defined sparsity training loss for the CG and SG modules. The objective of sparsity loss is to determine the optimal model between conventional task loss (e.g., cross-entropy loss) and sparsity. Sparsity loss can be added to the entire loss function

L_{t o t a l}

by granting a uniform target rate to each layer or a global target rate to the entire model.

Global sparsity loss is adopted as the loss function for the CG to assess the channel activation rate per layer. The global target rate determines the execution rate by summing the activated channels in all layers and penalizing when the execution rate differs from the target rate.

Consider L to be the entire model set of layers and calculate the difference with the CG target rate (

t_{C G} \in [0, 1]

) by dividing the sum of channels to be activated by the sum of all channels. The channel of the model and the activated channels per layer

l \in [1, L]

of the CG module are denoted as

c^{l}

and

A C^{l}

, respectively. The squared loss term is required to sparsify the CG; thus, the CG sparsity loss

L_{C G}

is defined as follows:

L_{C G} = \frac{1}{L} (\frac{\sum_{l \in L} A C^{l}}{\sum_{l \in L} c^{l}} - t_{C G})^{2} .

(13)

The threshold target of the SG is applied with a uniform target rate for the intermediate layers. The loss term

L_{S G}

is defined for the trainable threshold

Φ

in SG. The SG threshold target

t_{S G}

is used for learning the threshold of layers, and the squared loss term can be expressed as in (14):

L_{S G} = \sum_{l \in L} {(Φ^{l} - t_{S G})}^{2},

(14)

where

Φ^{l}

specifies the threshold parameter of SG per layer. The total loss function of the considering two gating modules of the CSGN is as follows:

L_{t o t a l} = L_{t a s k} + γ L_{C G} + λ L_{S G},

(15)

where

L_{t a s k}

is the task loss,

γ

and

λ

are the scaling factors of CG and SG, respectively.

The sparsification loss facilitates the determination of the relevant channel subset for each layer and the identification of spatial regions in the feature maps allowing the CSGN to achieve a practical solution with the least redundant work. It is a more flexible way to calculate the loss by utilizing the sparsity loss than with the top-k pruning proposed by [40] because channels are pruned by the fixed number of channels. The CSGN configures CG and SG to be trainable and achieves end-to-end training by utilizing a well-defined loss function.

4. Evaluation

4.1. Experimental Setup

In this section, we demonstrate the performance of the CG and SG module using two popular image classification benchmarks, CIFAR-10 [51] and ImageNet [52], as well as MS COCO [53], a representative object detection dataset. The baseline networks were implemented in PyTorch with ResNet20/32/44/110 on CIFAR-10 and ResNet18/34/50/101 on ImageNet to compare the performance with the state-of-the-art gating architectures regarding the model depth. We constructed an experimental environment using Pytorch 1.9, CUDA 10.2, and CUDNN 7.6.5 with NVIDIA TITAN Xp (Pascal) GPU and an Intel Xeon E5-1650 CPU.

4.2. CIFAR-10

For the CIFAR-10 dataset, the CSGN operated with the same training hyperparameters implemented in ConvNet-AIG [36] and DynConv [45]: an SGD optimizer with a momentum of 0.9; a weight decay of 5

\times 10^{- 4}

; a learning rate of 1

\times 10^{- 1}

, 1

\times 10^{- 2}

, and d 1

\times 10^{- 3}

for the 1st, 150th, and 250th epochs, respectively; a batch size of 256; a Gumbel temperature of 1; and a total epoch of 350. The weight initialization proposed by He [53] was adopted.

The CSGN was subsequently compared with gating networks used in previous studies: Convnet-AIG, SACT [44], SkipNet [37], DynConv and BAS [41]. The CG sparsity loss in (13) and SG threshold loss in (14) specified in Section 3.5 were utilized to adjust the sparsity of each module.

The CG target rate was evaluated as {1.0, 0.8, 0.6, 0.4}, and the SG threshold target was evaluated as {0.0, 0.5} with a partition size of 1/8 on the baseline ResNet20, summarized in Table 3. The hyperparameters for the loss were

γ = 1 \times 10^{4}

,

λ = 1 \times 10^{- 3}

and

α = 2.0

.

ResNet is used in various ways as a baseline for analysis according to the layer depth, and a comparison with other studies is shown in Figure 3a. This result demonstrates that CSGN outperforms the other gating networks with a target rate of {1.0, 0.8} and a threshold target of 0.0, which are optimal conditions. Table 3 shows a summary of the 1.97–11.78× MAC reductions with accuracy degradation on baseline ResNet20/32/44/110. When the target rate of the ResNet 110 model is decreased, the accuracy is seriously degraded. Therefore, the channel target rate was set to 0.8 and the threshold at 0.0 to compare with other studies. The trade-off between top-1 accuracy and computation costs can be evaluated using this target rate and threshold target.

DynConv and BAS commonly applied the gating module to one 3 × 3 convolutional layer per basic block, whereas CSGN applies to two 3 × 3 convolutional layers per basic block. Therefore, in the ResNet model using the basic block, CSGN has a gating module applied twice as opposed to DynConv and BAS. The accuracy slightly decreases while a significant MAC reduction is attained when deploying CSGN.

4.3. Imagenet

To evaluate CSGN extensively on the larger dataset, we applied it to ImageNet. The initialization of the convolutional layer was trained by performing fine-tuning with pretrained ResNet model parameters. As a comparison group, ConvNet-AIG, FBS [40], BAS, MSDNet [35] and DynConv based on ResNet18/34/50/101 were adopted. The model was trained on a single GPU using the same hyperparameters as the SACT and DynConv: epoch of 100, batch size of 128, and a learning rate of 0.025. The learning rate was decayed by 0.1 at epoch 30 and 60. Similar to the CIFAR-10 datasets,

t_{C G}

and

t_{S G}

were given as {1.0, 0.8, 0.6, 0.5} and {0.0, 0.5}, respectively, to evaluate the trade-off point.

Table 4 shows a summary of the 1.37–13.12× MAC reductions with accuracy degradation on baseline ResNet18/34/50/101. Figure 3b shows the trade-off between top-1 accuracy and MACs with other gating networks under optimal conditions of CSGN. The results present that CSGN outperforms the other studies in high top-1 accuracy at similar computation costs in the same ResNet baseline architectures. In the baseline of ResNet18/34/50/101, the top-1 accuracy was achieved at 69.39%, 73.35%, 75.19% and 76.21%, respectively, when the

t_{C G}

was 0.8. Compared to CIFAR-10, when a large number of residual blocks in ResNet, it is clear that the performance difference was more significant than in other networks.

4.4. Ms Coco

We conducted the object detection experiment to show that CSGN can be applied to various tasks other than classification tasks. RetinaNet [54] and Faster-RCNN with feature pyramid network (FPN) [55] were adopted to the backbone networks of CSGN and evaluated using the MS COCO 2017 dataset [56]. We trained CSGN for 100 epochs using Adam optimizer with a learning rate of

1 \times 10^{- 5}

and use a batch size of 16 and a fine-tuning with pre-trained ResNet model parameters. The input image was resized to 600 pixels in the shorter edge for training and inference. We set the target rate of CG to be {0.8, 0.6} and the SG threshold to be 0.0 with a partition size of 1/8 on the baseline ResNet50. During inference, we evaluated the MAC of the backbone network and average precision (AP) over the entire validation set. Table 5 shows that CSGN achieves 1.47–1.63× more MAC reduction than the baseline network and a negligible drop in AP by 0.4–0.6.

4.5. Ablation Study

To validate the effectiveness of the CSGN, the benefit of each module was clarified through ablation studies of the channel and spatial gating as the two CSGN modules. We aimed to find the optimal hyperparameters by applying the two modules one by one to grasp the effectiveness.

Table 6 shows a comparison of the accuracy and sparsity according to the CG sparsity in CSGN(CG). It can be observed that the activated channel rate according to the target rate is trained without significant accuracy degradation.

Table 7 shows the CSGN (SG) performance according to the global threshold and hyperparameter

α

, which was used to balance the task loss and the proposed loss function. As the threshold target was increased, the sparsity also increased, leading to accuracy degradation. A relatively slight decrease in accuracy was observed when the threshold exceeded 1.

To confirm the relationship between the accuracy and MACs of the CSGN, the module integrating CG and SG was evaluated. The CG target rate was set to {1.0, 0.8, 0.6, 0.4}, and the SG threshold target is assessed by {0.0, 0.5} from Table 3 and Table 4, respectively.

The CSGN significantly reduced MACs in a more various manner than in other studies. It seems that CG and SG complement the weaknesses of each module by enabling the flow of informative features. As a result, a well-organized final prediction can be obtained.

Table 8 presents the correlation between a subset of channels and the output feature map in CSGN (SG). As the partition size is increased, accuracy degradation is also increased when

t_{S G}

is 0. Although partial sums cannot correctly predict the output features, this result reveals that partial sums can be used to effectively identify and predict ineffective output features. When using SG with 1/8 partitioning, the accuracy decreases by approximately 1.18% and 1.23%; however, when coupled with CG, it works synergistically and produces negligible accuracy degradation. The

α

of the gate function affects the slope of the sigmoid function and operates closer to the step function, as seen by the small accuracy drop when a large value is given.

4.6. Additional Analysis

To evaluate the global sparsity loss of the CG, we analyzed the sparsity per layer on ResNet20/32/44. The activated channels per layer in Figure 4 indicate that the first layer is less sparse than the last. This indicates that the anterior channel is more important than the posterior one. According to the CG target rate, global sparsity changes; however, the sparsity of anterior layer barely shifts.

We compared the static method to validate the effect of dynamic gating architecture. Table 9 shows the CSGN outperforms the static methods such as soft filter pruning [13], FPGM [14], and network slimming [15]. The static method constrains the representation capability of the model because it permanently removes the computation of the weights or whole filters regardless of the characteristics of the input image. However, since the dynamic method estimates saliency depending on the input images, the dynamic method can keep the representation capability of the model and achieve superior performance between accuracy and computation efficiency compared with the static method.

To visualize and assess the amount of computation per spatial location of the SG module, ponder cost maps were adopted, as presented in DynConv [45] and SACT [44] in Figure 5. The dark region in the cost map indicates a small amount of computation, and the bright region implies a considerable amount of computation. The results show a difference in the amount of computation between a complex image and a simple image as the SG module extracts spatial-wise features from the images. In addition, although only a subset of the channel was used, it can be seen that it is sufficient to determine the effective regions by identifying the spatial features of the image.

5. Conclusions

In this study, we proposed a dynamic gating architecture to reduce the inference cost considering channel- and spatial-wise combined effects through the ablation study. CSGN can be end-to-end trained using channel- and spatial-gated masks and applied to deeper and wider models. CSGN achieved a better trade-off between top-1 accuracy and computation cost than similar gating architectures. We believe that CSGN can be widely applicable and beneficial in resource-constrained mobile and edge applications, where energy efficiency is the primary concern. Furthermore, the future work of CSGN includes incorporating other domain-based network optimization techniques and inference speedup with specialized hardware architectures.

Author Contributions

Data curation, C.H.R. and H.J.L.; Formal analysis, C.H.R. and J.Y.K.; Investigation, C.H.R.; Methodology, S.H. and C.H.R.; Project administration, S.H.; Software, C.H.R.; Supervision, T.H.H.; Validation, H.J.L.; Visualization, C.H.R.; Writing—original draft, S.H. and J.Y.K.; Writing—review & editing, J.Y.K., H.J.L. and T.H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00421, AI Graduate School Support Program(Sungkyunkwan University)), in part by the National R&D Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2020M3H2A1076786) and in part by the MOTIE and KEIT (20010560, Development of system level design and verification for in storage processing architecture based on phase change memory).

Conflicts of Interest

The authors declare no conflict of interest.

References

Alom, M.Z.; Taha, T.M.; Yakopcic, C.; Westberg, S.; Sidike, P.; Nasrin, M.S.; Hasan, M.; Van Essen, B.C.; Awwal, A.A.; Asari, V.K. A state-of-the-art survey on deep learning theory and architectures. Electronics 2019, 8, 292. [Google Scholar] [CrossRef]
Zeng, X.; Wang, Z.; Hu, Y. Enabling Efficient Deep Convolutional Neural Network-based Sensor Fusion for Autonomous Driving. arXiv 2022, arXiv:2202.11231. [Google Scholar]
Cuong-Le, T.; Nghia-Nguyen, T.; Khatir, S.; Trong-Nguyen, P.; Mirjalili, S.; Nguyen, K.D. An efficient approach for damage identification based on improved machine learning using PSO-SVM. Eng. Comput. 2021, 38, 3069–3084. [Google Scholar] [CrossRef]
Maor, G.; Zeng, X.; Wang, Z.; Hu, Y. An FPGA implementation of stochastic computing-based LSTM. In Proceedings of the 2019 IEEE 37th International Conference on Computer Design (ICCD), Abu Dhabi, United Arab Emirates, 17–20 November 2019; IEEE: Abu Dhabi, United Arab Emirates, 2019; pp. 38–46. [Google Scholar]
Chen, J.; Ran, X. Deep learning with edge computing: A review. Proc. IEEE 2019, 107, 1655–1674. [Google Scholar] [CrossRef]
Canziani, A.; Paszke, A.; Culurciello, E. An analysis of deep neural network models for practical applications. arXiv 2016, arXiv:1605.07678. [Google Scholar]
Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 2021, 461, 370–403. [Google Scholar] [CrossRef]
Kim, Y.D.; Park, E.; Yoo, S.; Choi, T.; Yang, L.; Shin, D. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv 2015, arXiv:1511.06530. [Google Scholar]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
Bengio, Y. Deep learning of representations: Looking forward. In Proceedings of the International Conference on Statistical Language and Speech Processing, Tarragona, Spain, 29–31 July 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 1–37. [Google Scholar]
Sigaud, O.; Masson, C.; Filliat, D.; Stulp, F. Gated networks: An inventory. arXiv 2015, arXiv:1512.03201. [Google Scholar]
Zhou, X.; Zhang, W.; Xu, H.; Zhang, T. Effective sparsification of neural networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3599–3608. [Google Scholar]
He, Y.; Kang, G.; Dong, X.; Fu, Y.; Yang, Y. Soft filter pruning for accelerating deep convolutional neural networks. arXiv 2018, arXiv:1808.06866. [Google Scholar]
He, Y.; Liu, P.; Wang, Z.; Hu, Z.; Yang, Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4340–4349. [Google Scholar]
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Ba, J.; Caruana, R. Do deep nets really need to be deep? In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2654–2662. [Google Scholar]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar]
Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar]
Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv 2016, arXiv:1606.06160. [Google Scholar]
Lin, X.; Zhao, C.; Pan, W. Towards accurate binary convolutional neural network. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 525–542. [Google Scholar]
Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4114–4122. [Google Scholar]
Li, S.; Hanson, E.; Li, H.; Chen, Y. Penni: Pruned kernel sharing for efficient CNN inference. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 5863–5873. [Google Scholar]
Li, S.; Hanson, E.; Qian, X.; Li, H.H.; Chen, Y. ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, 18–22 October 2021; pp. 992–1004. [Google Scholar]
Denton, E.L.; Zaremba, W.; Bruna, J.; LeCun, Y.; Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. Adv. Neural Inf. Process. Syst. 2014, 1, 1269–1277. [Google Scholar]
Graves, A. Adaptive computation time for recurrent neural networks. arXiv 2016, arXiv:1603.08983. [Google Scholar]
Bolukbasi, T.; Wang, J.; Dekel, O.; Saligrama, V. Adaptive neural networks for efficient inference. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 527–536. [Google Scholar]
Panda, P.; Sengupta, A.; Roy, K. Conditional deep learning for energy-efficient and enhanced pattern recognition. In Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 14–18 March 2016; IEEE: Dresden, Germany, 2016; pp. 475–480. [Google Scholar]
Huang, G.; Chen, D.; Li, T.; Wu, F.; Van Der Maaten, L.; Weinberger, K.Q. Multi-scale dense networks for resource efficient image classification. arXiv 2017, arXiv:1703.09844. [Google Scholar]
Veit, A.; Belongie, S. Convolutional networks with adaptive inference graphs. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–18. [Google Scholar]
Wang, X.; Yu, F.; Dou, Z.Y.; Darrell, T.; Gonzalez, J.E. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 409–424. [Google Scholar]
Wu, Z.; Nagarajan, T.; Kumar, A.; Rennie, S.; Davis, L.S.; Grauman, K.; Feris, R. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8817–8826. [Google Scholar]
Chen, Z.; Li, Y.; Bengio, S.; Si, S. You look twice: Gaternet for dynamic filter selection in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9172–9180. [Google Scholar]
Gao, X.; Zhao, Y.; Dudziak, Ł.; Mullins, R.; Xu, C.Z. Dynamic channel pruning: Feature boosting and suppression. arXiv 2018, arXiv:1810.05331. [Google Scholar]
Bejnordi, B.E.; Blankevoort, T.; Welling, M. Batch-shaping for learning conditional channel gated networks. arXiv 2019, arXiv:1907.06627. [Google Scholar]
Hua, W.; Zhou, Y.; De Sa, C.M.; Zhang, Z.; Suh, G.E. Channel gating neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 1886–1896. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Figurnov, M.; Collins, M.D.; Zhu, Y.; Zhang, L.; Huang, J.; Vetrov, D.; Salakhutdinov, R. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1039–1048. [Google Scholar]
Verelst, T.; Tuytelaars, T. Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2320–2329. [Google Scholar]
Wang, Y.; Lv, K.; Huang, R.; Song, S.; Yang, L.; Huang, G. Glance and focus: A dynamic approach to reducing spatial redundancy in image classification. Adv. Neural Inf. Process. Syst. 2020, 33, 2432–2444. [Google Scholar]
Lin, J.; Rao, Y.; Lu, J.; Zhou, J. Runtime neural pruning. Adv. Neural Inf. Process. Syst. 2017, 30, 2178–2188. [Google Scholar]
Cao, S.; Ma, L.; Xiao, W.; Zhang, C.; Liu, Y.; Zhang, L.; Nie, L.; Yang, Z. Seernet: Predicting convolutional neural network feature-map sparsity through low-bit quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–19 June 2019; pp. 11216–11225. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 1–26 June 2016; pp. 770–778. [Google Scholar]
Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Miami, FL, USA, 2009; pp. 248–255. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]

Figure 1. CSGN in the convolutional layer: CG generates a channel selection mask using Gumbel-SoftMax trick, and SG produces a spatial-wise mask with a partitioned convolution.

Figure 2. Forward and backward propagation in CG: forward and backward paths exploit hard and soft values for the gradient, respectively. In the inference process, the sparsity mask is generated without Gumbel sampling.

Figure 3. Performance comparison of CSGN with other gating networks on (a) CIFAR-10 and (b) ImageNet datasets.

Figure 4. Performance of CG with global sparsity loss according to the target rate on (a) ResNet20, (b) ResNet32, and (c) ResNet44 models.

Figure 5. Ponder cost map, which represents the amount of computation at each spatial location, indicates that SG predicts spatial features on CIFAR-10 images.

Table 1. Summary table for computation-efficient CNN models.

Static Pruning			ProbMask [12], Soft Filter Pruning [13], FPGM [14], Network Slimming [15]
Dynamic gating architecture	Layer-wise	Early-exit	ACT [32], Bolukbasi et al. [33], Panda et al. [34], MSDNet [35]
	Layer-wise	Layer-skipping	ConvNet-AIG [36], SkipNet [37], Blockdrop [38]
	Channel-wise		GaterNet [39], FBS [40], BAS [41], CGNet [42]
	Spatial-wise		Zhou et al. [43], SACT [44], DynConv [45], GFNet [46]

Table 2. Notations used in CSGN.

Notations	Descriptions
$Θ^{l}$	Weight of the lth convolutional layer
x, y	Input and output feature map
$h^{l}, w^{l}, c^{l}, k^{l}$	Height, width, channels, and filters at the lth layer
$σ$	Sigmoid function
*	Naïve convolution operation
g	Identical and independent distribution samples from Gumbel distribution
$π$	Categorical distribution with class probabilities
$τ$	Temperature for Gumbel-SoftMax
$L_{C G}, L_{S G}$	Sparsity loss function for CG and SG
$Φ$	Threshold parameter of SG
$t_{C G}, t_{S G}$	Target rate in CG and threshold target in SG
${Conv}_{1}, {Conv}_{2}$	Convolutional layers 1 and 2 in the residual block
$C S (x)$	Average pooling function $C S (x) \in R^{c^{l} \times 1 \times 1}$
$S$	Saliency of channels $S \in R^{c^{l + 1} \times 1 \times 1}$
$C G (x)$	CG output mask $C G (x) \in {\{0, 1\}}^{c^{l + 1} \times 1 \times 1}$
$S G (x)$	SG output mask $S G (x) \in {\{0, 1\}}^{c^{l + 1} \times h^{l} \times w^{l}}$

Table 3. Top-1 accuracy and the MAC reduction of the CSGN on CIFAR-10.

Baseline	$t_{SG}$	0.0				0.5
Baseline	$t_{CG}$	1.0	0.8	0.6	0.4	1.0	0.8	0.6	0.4
ResNet20	Accuracy (%)	91.39	91.21	90.18	88.68	90.18	89.26	87.78	86.47
ResNet20	MAC reduction	1.97×	2.46×	2.81×	3.82×	3.62×	4.53×	5.31×	7.11×
ResNet32	Accuracy (%)	92.18	91.74	91.76	90.58	90.99	90.79	89.03	88.27
ResNet32	MAC reduction	1.99×	2.47×	3.11×	4.64×	3.89×	4.48×	5.58×	7.98×
ResNet44	Accuracy (%)	92.93	92.54	91.84	90.55	91.67	90.71	89.68	89.24
ResNet44	MAC reduction	1.98×	2.71×	3.73×	5.09×	4.01×	5.39×	6.83×	9.51×
ResNet110	Accuracy (%)	93.36	92.88	91.91	90.74	88.98	86.21	84.65	82.43
ResNet110	MAC reduction	2.10×	3.05×	4.48×	5.49×	6.13×	9.18×	10.42×	11.78×

Table 4. Top-1 accuracy and the MAC reduction of the CSGN on ImageNet.

Baseline	$t_{SG}$	0.0				0.5
Baseline	$t_{CG}$	1.0	0.8	0.6	0.4	1.0	0.8	0.6	0.4
ResNet18	Accuracy (%)	70.21	69.39	65.22	63.03	69.12	68.76	64.14	63.08
ResNet18	MAC reduction	2.18×	4.12×	6.78×	8.12×	3.81×	5.37×	8.20×	13.12×
ResNet34	Accuracy (%)	74.42	73.35	69.54	68.24	72.85	72.35	69.19	67.31
ResNet34	MAC reduction	2.16×	2.46×	4.08×	7.34×	3.69×	4.33×	5.91×	9.16×
ResNet50	Accuracy (%)	75.34	75.19	74.14	72.73	75.08	73.86	71.70	69.60
ResNet50	MAC reduction	1.37×	1.49×	1.61×	1.74×	1.43×	1.57×	1.72×	1.88×
ResNet101	Accuracy (%)	77.31	76.21	74.92	74.02	75.83	74.21	72.01	70.24
ResNet101	MAC reduction	1.38×	1.47×	1.58×	1.71×	1.80×	1.87×	2.09×	2.14×

Table 5. Object detection results (bounding box AP) and the MAC reduction of the CSGN on COCO 2017.

Model		AP	${AP}_{0.5}$	${AP}_{0.75}$	MAC Reduction
RetinaNet [54]	Baseline	33.9	53.1	36.3	1.00×
	CSGN (0.8/0.0)	33.5	52.6	35.8	1.48×
	CSGN (0.6/0.0)	33.3	52.4	35.7	1.63×
Faster-RCNN [55]	Baseline	34.3	55.9	37.7	1.00×
	CSGN (0.8/0.0)	33.9	55.6	37.3	1.47×
	CSGN (0.6/0.0)	33.8	55.4	37.1	1.61×

Table 6. Effectiveness of CSGN (CG) according to target rate with baseline ResNet20.

$t_{CG}$	Accuracy (%)	Sparsity (%)	MAC ( $1 \times 10^{7}$ )
0.8	92.43	21.60	3.43
0.7	92.20	31.40	3.12
0.6	91.85	40.70	2.77
0.5	90.92	50.50	2.32
0.4	91.05	60.10	1.85
0.3	90.08	70.03	1.45

Table 7. Effectiveness of CSGN (SG) according to the hyperparameters with baseline ResNet20.

$α$	$t_{SG}$	Accuracy (%)	Sparsity (%)	MAC ( $1 \times 10^{7}$ )
1.0	0.0	91.37	49.90	2.07
	0.5	91.35	73.10	1.73
	1.0	90.63	72.30	1.56
	1.5	90.39	95.50	1.29
	2.0	90.05	99.40	0.91
2.0	0.0	91.39	49.80	2.06
	0.5	90.18	74.50	1.69
	1.0	90.08	89.10	1.46
	1.5	89.96	96.10	1.17
	2.0	89.96	96.10	0.87

Table 8. Partitioning impact effectiveness of SG according to partition size and

α

.

Table 8. Partitioning impact effectiveness of SG according to partition size and

α

.

$α$	1/p	Accuracy Drop (%)
2.0	2	0.37
	4	0.56
	8	1.18
	16	1.48
1.0	2	0.51
	4	0.97
	8	1.23
	16	1.60

Table 9. In terms of top-1 accuracy and the amount of computation, comparison with static compression with ResNet18 on ImageNet.

Model	Accuracy (%)	MAC Reduction
Soft filter pruning [13]	67.10	1.72×
FPGM [14]	68.41	1.70×
Network slimming [15]	67.21	1.39×
CSGN	70.21	2.19×

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hyun, S.; Ryu, C.H.; Kang, J.Y.; Lim, H.J.; Han, T.H. CSGN: Combined Channel- and Spatial-Wise Dynamic Gating Architecture for Convolutional Neural Networks. Electronics 2022, 11, 2678. https://doi.org/10.3390/electronics11172678

AMA Style

Hyun S, Ryu CH, Kang JY, Lim HJ, Han TH. CSGN: Combined Channel- and Spatial-Wise Dynamic Gating Architecture for Convolutional Neural Networks. Electronics. 2022; 11(17):2678. https://doi.org/10.3390/electronics11172678

Chicago/Turabian Style

Hyun, Sangmin, Chang Ho Ryu, Ju Yeon Kang, Hyun Jo Lim, and Tae Hee Han. 2022. "CSGN: Combined Channel- and Spatial-Wise Dynamic Gating Architecture for Convolutional Neural Networks" Electronics 11, no. 17: 2678. https://doi.org/10.3390/electronics11172678

APA Style

Hyun, S., Ryu, C. H., Kang, J. Y., Lim, H. J., & Han, T. H. (2022). CSGN: Combined Channel- and Spatial-Wise Dynamic Gating Architecture for Convolutional Neural Networks. Electronics, 11(17), 2678. https://doi.org/10.3390/electronics11172678

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CSGN: Combined Channel- and Spatial-Wise Dynamic Gating Architecture for Convolutional Neural Networks

Abstract

1. Introduction

2. Related Work

2.1. Layer-Wise Gating Architecture

2.2. Channel-Wise Gating Architecture

2.3. Spatial-Wise Gating Architecture

3. CSGN

3.1. Gating Architecture

3.2. CG Module

3.2.1. Channel Squeezing

3.2.2. Binary Gumbel-Softmax Trick

3.3. SG Module

3.4. Inference on CSGN

3.5. Sparsity Training Loss

4. Evaluation

4.1. Experimental Setup

4.2. CIFAR-10

4.3. Imagenet

4.4. Ms Coco

4.5. Ablation Study

4.6. Additional Analysis

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI