Structured Compression of Convolutional Neural Networks for Specialized Tasks

: Convolutional neural networks (CNNs) offer signiﬁcant advantages when used in various image classiﬁcation tasks and computer vision applications. CNNs are increasingly deployed in environments from edge and Internet of Things (IoT) devices to high-end computational infrastructures, such as supercomputers, cloud computing, and data centers. The growing amount of data and the growth in their model size and computational complexity, however, introduce major computational challenges. Such challenges present entry barriers for IoT and edge devices as well as increase the operational expenses of large-scale computing systems. Thus, it has become essential to optimize CNN algorithms. In this paper, we introduce the S-VELCRO compression algorithm, which exploits value locality to trim ﬁlters in CNN models utilized for specialized tasks. S-VELCRO uses structured compression, which can save costs and reduce overhead compared with unstructured compression. The algorithm runs in two steps: a preprocessing step identiﬁes the ﬁlters with a high degree of value locality, and a compression step trims the selected ﬁlters. As a result, S-VELCRO reduces the computational load of the channel activation function and avoids the convolution computation of the corresponding trimmed ﬁlters. Compared with typical CNN compression algorithms that run heavy back-propagation training computations, S-VELCRO has signiﬁcantly fewer computational requirements. Our experimental analysis shows that S-VELCRO achieves a compression-saving ratio between 6% and 30%, with no degradation in accuracy for ResNet-18, MobileNet-V2, and GoogLeNet when used for specialized tasks.


Introduction
The usage of convolution neural networks (CNNs) is continuously growing in computer vision applications ranging from IoT and edge devices to supercomputers and cloud infrastructures. Over the years, new CNNs have been introduced that can achieve remarkable prediction accuracy for different classification tasks. The new models, however, are increasingly complex and require more and more processing throughput for both model inference and training. For example, the ResNet-101 model [1] introduced in 2015 used 101 layers and required nearly sevenfold greater computational throughput [2] than the AlexNet model [3], which was introduced in 2012 and had eight layers. In conjunction with the processing complexity, the growing number of model layers introduces a major challenge for real-time computer vision applications due to the increasing inference latency. These challenges have become major entry barriers for CNNs in IoT and edge devices because they are limited in performance, memory footprint, cost, and energy. CNNs also introduce a significant increase in power consumption and cost for high-end computer systems because they require special accelerators such as graphics processing units (GPUs) or tensor processing units (TPUs) to handle their complex algorithms. Thus, the need to optimize CNN algorithms without excessive performance loss has become essential to enable them to fit within a system's envelope of cost, performance, memory, and energy. Different approaches, such as pruning [4][5][6][7] and quantization [8][9][10][11], have been introduced to optimize CNN models. In many cases, the optimization of these techniques involves model retraining and complex back-propagation processing. We discuss these approaches and other techniques in detail in Section 2.
In this paper, we examine the applications of CNN models used for specialized tasks. Whereas general-purpose CNNs are used for a broad range of classification tasks, specialized CNNs are optimized and tuned to handle a small set of specific tasks. Specialized CNN applications are an emerging field in machine learning for computing systems that range from edge devices to high-end large-scale systems [12][13][14], and several applications have been reported recently [15][16][17]. In Ref. [13], a specialized CNN usage is presented for offline video analytics that is performed in a hierarchical manner. A specialized lightweight CNN model is used at the first level of the classification process, and only low-confidence classifications are moved to the second-level general-purpose CNN. Figure 1a illustrates such an application of a specialized CNN for vehicle detection on highways. In this example, the real-time video analytic application employs a compressed CNN model that is specialized in vehicle detection and is derived from a general-purpose CNN. Similar hierarchical specialized CNN approaches have been used for image classification because image datasets are typically organized in hierarchical classes. In this case, a different specialized CNN can be used for every hierarchy with the needed classification specialty. In Ref. [18], an ensemble of specialized expert models is presented, where each model is an expert in a specific classification task. Figure 1b illustrates such an ensemble of expert CNN models. The ensemble is governed by a gating network that selects one or multiple expert CNNs based on the input image. The gating network assigns a weight to every expert CNN selected, while unselected experts are assigned a weight of 0. A similar approach called cascaded-CNN is presented in Ref. [19]. In that case, each CNN is optimized for specialized tasks, and the classification of all models is combined to produce a complete prediction map. Additional examples also exist in game-scrapping applications [13].
In this paper, we extend our previous work [20] and introduce an enhanced novel algorithm for the structured compression of CNNs that are employed for specialized tasks. The new compression algorithm, S-VELCRO, is based on structured value-locality compression. S-VELCRO exploits the property of value locality, which was introduced in [20]. With this property, specialized task CNNs produce a proximal range of values by activating functions in the inference process. Our prior work [20] introduced the value-locality-based compression (VELCRO) algorithm, which exploits value locality for unstructured model compression. VELCRO trims activation function elements without a predetermined constraint related to the network structure. Such unstructured compression methods, however, may involve overhead to compute addresses and indices of compressed elements and store them in dedicated metadata storage elements. In particular, such limitations become more taxing when the unstructured compression runs on GPUs and TPUs. As illustrated in Figure 2, S-VELCRO overcomes the limitations of VELCRO by performing structured compression and trimming model filters and their corresponding activation kernels.
To do so, S-VELCRO runs in two steps: a preprocessing stage identifies the filters with a high degree of value locality, and a compression step trims the selected filters. Unlike common CNN compression algorithms, S-VELCRO does not require any back-propagation training and thus has significantly fewer computational requirements. Our experimental environment examines S-VELCRO's capabilities on three CNN models-ResNet-18 [2], MobileNet V2 [21], and GoogLeNet [22]-using the ILSVRC-2012 (ImageNet) [23] dataset. In addition, we implement S-VELCRO on a hardware platform to measure its computational and energy savings.
an ensemble of expert CNN models. The ensemble is governed by a gating network that selects one or multiple expert CNNs based on the input image. The gating network assigns a weight to every expert CNN selected, while unselected experts are assigned a weight of 0. A similar approach called cascaded-CNN is presented in Ref. [19]. In that case, each CNN is optimized for specialized tasks, and the classification of all models is combined to produce a complete prediction map. Additional examples also exist in game-scrapping applications [13]. In this paper, we extend our previous work [20] and introduce an enhanced novel algorithm for the structured compression of CNNs that are employed for specialized tasks. The new compression algorithm, S-VELCRO, is based on structured value-locality compression. S-VELCRO exploits the property of value locality, which was introduced in [20]. With this property, specialized task CNNs produce a proximal range of values by activating functions in the inference process. Our prior work [20] introduced the valuelocality-based compression (VELCRO) algorithm, which exploits value locality for unstructured model compression. VELCRO trims activation function elements without a predetermined constraint related to the network structure. Such unstructured compression methods, however, may involve overhead to compute addresses and indices of compressed elements and store them in dedicated metadata storage elements. In particular, such limitations become more taxing when the unstructured compression runs on GPUs and TPUs. As illustrated in Figure 2, S-VELCRO overcomes the limitations of VELCRO by performing structured compression and trimming model filters and their corresponding activation kernels.  In this paper, we extend our previous work [20] and introduce an enhanced novel algorithm for the structured compression of CNNs that are employed for specialized tasks. The new compression algorithm, S-VELCRO, is based on structured value-locality compression. S-VELCRO exploits the property of value locality, which was introduced in [20]. With this property, specialized task CNNs produce a proximal range of values by activating functions in the inference process. Our prior work [20] introduced the valuelocality-based compression (VELCRO) algorithm, which exploits value locality for unstructured model compression. VELCRO trims activation function elements without a predetermined constraint related to the network structure. Such unstructured compression methods, however, may involve overhead to compute addresses and indices of compressed elements and store them in dedicated metadata storage elements. In particular, such limitations become more taxing when the unstructured compression runs on GPUs and TPUs. As illustrated in Figure 2, S-VELCRO overcomes the limitations of VELCRO by performing structured compression and trimming model filters and their corresponding activation kernels.  The contributions of this paper are summarized as follows:

1.
We present a novel algorithm, S-VELCRO, that exploits value locality to trim filters in CNN models utilized for specialized tasks. S-VELCRO offers structured compression that can save costs and overhead over unstructured compression: (i) addressing overhead [24] for computing addresses and indices of compressed elements and (ii) storing them in dedicated metadata. 2.
S-VELCRO presents a fast compression process that solely uses statistics gathering through the inference process and avoids the heavy computations required for the back-propagation training used by traditional compression approaches, such as pruning.

3.
The results of our experiments indicate that S-VELCRO produces a compressionsaving ratio of computations in the range of 24. We demonstrate the energy savings of S-VELCRO by implementing the compression algorithm on a hardware platform using a field-programmable gate array (FPGA) for ResNet-18. Our experimental results indicate a 24-27% reduction in energy consumption with S-VELCRO.
The remainder of this paper is organized as follows: Section 2 reviews previous work. Section 3 introduces the proposed method and algorithm. Section 4 presents the experimental results. Finally, Section 5 summarizes this study's conclusions.

Prior Work
CNN models typically require large amounts of storage and have high computational costs. Therefore, numerous model compression and acceleration methods have been proposed in recent years [25,26]. Approaches seek to optimize CNN computations by reducing redundancy, computational complexity, and memory storage. Such savings are critical for some real-time applications and in the deployment of CNN models on portable devices. In this section, we describe several related methods: model pruning and quantization, deep compression (a combination of pruning, quantization, and Huffman coding), knowledge distillation, CNN design for specialized inference tasks, and filter compression methods.

Pruning
Model pruning [4][5][6] is an effective approach to reducing model complexity and addressing the over-fitting problem. Taking inspiration from neuroscience studies, pruning tries to eliminate unimportant or redundant CNN parameters that are not sensitive to the performance of the model. Pruning techniques have been extensively studied [4,[27][28][29][30][31], and the core idea is to remove redundant weights, neurons, kernels, or filters with minimum accuracy loss at the inference stage of a trained network. Therefore, the use of pruning can result in a smaller network and fewer computations. Traditional pruning algorithms consist of training a large model, pruning, and then fine-tuning the pruned model, which may incur high computational complexity [32].
CNN pruning techniques can be classified as unstructured or structured. In unconstructed pruning, elements such as weights and neurons are removed without consideration of the network structure. Typically, less important weights or activations are replaced by zeros independently. In structured pruning [24], the pruning process is restricted by sparsity constraints and removes structured parts (e.g., filters or layers). Unstructured pruning may fully exploit redundancy in the network, but it may require a special format to represent the pruned elements, and speedup may require a particular software/hardware accelerator for sparse matrix multiplications. Structured pruning may limit the redundancy that can be exploited in the network, but the speedup is typically well-supported by various off-the-shelf libraries.
Pruning redundant, noninformative elements is usually done in accordance with their importance and contribution to the network's accuracy. The process of pruning is based on ranking the network elements according to various metrics-e.g., the L1 or L2 norms [5,[33][34][35][36] of weights, activations, and filters or the saliency of two weight sets [37]. Dynamic mechanisms are required to prune activations because their importance may depend on the model's input. Reinforcement learning is used to prune channels in Ref. [38], and spatial correlations of CNN output feature maps (OFMs) are used to predict and prune zero-valued activations in [39,40]. Model size reduction and inference time speed-up have also been demonstrated by various pruning techniques based on weight magnitudes [27,41,42].
Gradual pruning methods [43] attempt to arrive at an accurate model given a resourceconstrained environment (e.g., a model's memory footprint). In Ref. [44], the authors propose neuron importance score propagation (NISP) to jointly prune neurons in the entire network based on a unified goal. In Ref. [45], the authors prune neurons randomly, and random grouping of connection weights into hash buckets was proposed in Ref. [46]. Ref [47] proposed a new iterative pruning scheme based on Taylor expansion while focusing on transfer learning. Their results indicate that CNNs can be pruned by iteratively removing the least important OFMs and that a Taylor expansion-based criterion demonstrates improvement over other criteria such as weight pruning, using l2 norm, and activation pruning, using mean, variance, and mutual information. In that study, large trained CNNs were adapted to efficient smaller networks for specialized tasks (specialized in a subset of classes). The authors observed that every layer has both high-and low-degree important OFMs and that the median importance of OFMs tends to decrease with later depth [47]. In Ref. [48], the authors proposed the CURL method (compression using residual connections and limited data), which consists of compression of residual blocks and a label refinement strategy for small datasets.
While pruning techniques attempt to eliminate redundant network parameters, quantization approaches, which are described in the next subsection, attempt to employ an efficient data representation for the model parameters.

Quantization
Quantization methods compress the network by reducing the number of bits used to represent each weight, filter, and feature map. Typically, CNNs apply 32-bit floatingpoint precision. Several works [8][9][10][11] introduced fixed-point and vector quantization methods with a trade-off between accuracy and compression. In Ref. [8], Vanhouchke et al. demonstrated that 8-bit quantization of the parameters can result in significant compression without significant loss of accuracy. However, quantization methods that use fewer than 8 bits tend to significantly decrease the model accuracy.
Quantization schemes for weights and activations during training [49][50][51] try to reduce quantization errors under low precision while achieving accuracy comparable to full precision networks. This approach may be limited by a lack of data or computational resources. Aggressive quantization generally causes a significant loss of accuracy. Posttraining quantization methods [52][53][54][55] circumvent this limitation. For example, the authors in Ref. [53] proposed a minimum square error problem for weights, and sparsity-aware quantization was introduced in Ref. [55].

Advanced Compression Methods
While pruning and quantization attempt to remove redundancy in the network structure and its data representation, advanced compression methods have been proposed to further optimize CNNs by adopting different approaches that are described in this subsection.
Deep compression, a combination of pruning, trained quantization, and Huffman coding, was proposed in Ref. [28]. This method includes pruning that removes redundant units or channels while weight and activation quantization occurs simultaneously.
Knowledge distillation [56][57][58] effectively trains a small (student) model from a large (teacher) model without a significant loss of accuracy. The student model should mimic the teacher model, and the problem is determining how to distill the knowledge from a larger CNN into a small network. Knowledge distillation systems consist of three main elements: knowledge, an algorithm for knowledge distillation, and a teacher-student architecture. A comprehensive survey of knowledge distillation is available in Ref. [58].
Training one large network and specializing it without additional training for efficient deployment was proposed in Ref. [59]. That work introduced the once-for-all (OFA) approach that supports diverse architectural constraints (e.g., power, cost, latency, and performance). A progressive shrinking algorithm reduces the model size by operating across four dimensions (image resolution, depth, width, and kernel size). Many specialized subnetworks can be easily obtained from the OFA network with accuracy similar to training them independently [47].
FoldedCNN [14] is another approach to CNN design for specialized inference tasks. FoldedCNN increases inference throughput and hardware utilization of specialized CNNs beyond increased batch size. This approach does not compress CNN models but rather increases arithmetic intensity, which boosts utilization and throughput when it runs on certain accelerators.

CNN Insights
Several CNN insight explorations have extended the understanding of CNNs' internal mechanisms and their relation to feature extraction. In Ref. [60], a visualization technique has been introduced that provides insight into the internal mechanisms of CNNs and can be used to select effective architectures. An additional simpler visualization method to estimate the receptive fields of units in each layer has also been suggested by Ref. [61]. Their results reveal that OFMs have interpretable patterns and extract features at several levels of abstraction (edges, textures, shapes, concepts, etc.). Another study [62] used ablation analysis and the addition of noise to quantify the contribution of OFM units and their role in the network's output. Their experiments suggest that highly class-selective elements may degrade network performance, so their removal may not impact overall performance.
Further study of the importance of individual units in CNNs found that, while removal of individual OFMs may not significantly decrease the overall model accuracy, it can significantly impact the accuracy of specific classes [63]. Thus, the ablation experiments in Ref. [63] demonstrated that individual units specialize in subsets of classes, and different methods were proposed to measure the contribution of individual OFMs to classification accuracy.
CNN filter compression techniques attempt to remove kernels and filters that correspond to unimportant weights. In Ref. [64], the authors proposed removing filters based on their importance. The coupling factors between consecutive layers are computed and used to remove unimportant pathways in the networks. These factors are used to maximize the variance of feature maps and to preserve the most relevant filters. Another study on CNN filter compression [65] showed that information richness and sparsity can be used to determine the importance of feature maps. The relationship between input feature maps and 2D kernels was examined, and the kernel sparsity and entropy (KSE) indicator was proposed for measuring the feature map importance. The authors proposed compressing CNNs by reducing the number of kernels based on the KSE indicator [65]. A common thread is the observation that feature maps may contribute differently to the accuracy. Hence, being able to quantify their importance is helpful for network compression. Additionally, the first model layers extract simple features such as edges, corners, and simple textures, and the last layers have higher representations (e.g., recognize objects).
The recent studies presented in Refs. [60][61][62][63][64][65] have provided the motivation for the VELCRO algorithm for neural networks [20]. The core idea is that we can remove unimportant computations when using the CNN mode for specialized tasks. Thus, compression and speedup are achieved with minimal accuracy degradation. VELCRO identifies output elements of the activation function with a high degree of value locality and replaces these elements with their corresponding average arithmetic values. Thus, it reduces computational load and performs unstructured compression of the network while avoiding a highly complex training process.

Method and Algorithm
The S-VELCRO compression algorithm introduced in this study leverages the property of value locality, which was introduced in Ref. [20]. That study showed that when CNNs are used for specialized tasks, the output values of the activation tensor are in the proximity of the inferred inputs. In addition, the authors used the variances tensor to quantify value locality, which is illustrated in Figure 3 and Equation (1). Figure 3 shows the activation function output tensors in a layer k and channel c for the CNN model.
where c is the channel index and i and j are the element coordinates.
places these elements with their corresponding average arithmetic val computational load and performs unstructured compression of the n ing a highly complex training process.

Method and Algorithm
The S-VELCRO compression algorithm introduced in this study erty of value locality, which was introduced in Ref. [20]. That study CNNs are used for specialized tasks, the output values of the activat proximity of the inferred inputs. In addition, the authors used the quantify value locality, which is illustrated in Figure 3 where c is the channel index and i and j are the element coordinates. The variance tensor is a fundamental metric used by the S-VE leverage value locality through the structured compression process   The variance tensor is a fundamental metric used by the S-VELCRO algorithm to leverage value locality through the structured compression process of specialized task CNNs. The S-VELCRO algorithm trims convolution filters that produce activation kernels with a high degree of value locality in CNN models for specialized tasks and thereby eliminates the need to compute the kernel that corresponds to the trimmed filter. The S-VELCRO is run in two stages: 1.
Preprocessing stage: The S-VELCRO algorithm applies the uncompressed CNN model to calculate inference for a small group of images from the specialized task preprocessing dataset. Note that the dataset used for the preprocessing stage is distinct from the validation dataset. The variance tensor is calculated using Equation (1) for each convolution layer. Because the preprocessing stage relies only on inference, it requires significantly smaller computational overhead than common approaches, which employ a lengthy back-propagation training step [66].

2.
Compression stage: The compression stage is provided with a hyperparameter tuple T = {T 0 , T 1 , T 2 , . . . }, where each element in the tuple corresponds to the number of channels to be compressed in the corresponding convolution layer. For example, T k represents the number of channels that will be compressed in convolution layer k. The compression stage processes every convolution layer separately. First, it calculates the rank of every channel by summing all variance elements in the variance tensor with indices that correspond to the channel indices. Next, it selects the T k channels with the smallest rank to be compressed in convolution layer k. All compressed channels in the activation output function are replaced by the arithmetic average constant of the elements located at the same corresponding coordinates. All other activation elements remain unchanged. This channel compression process avoids both the channel activation function computation and the convolution of the corresponding trimmed filters. The hyperparameter tuple, T, determines the compression savings of each layer as well as the overall compression-saving ratio C for the model: where the tuple T = {T 0 , T 1 , . . . , T K } represents the number of channels to be compressed, F k is the filter size in layer k, and w k and h k are the channel width and height of the activation function output tensor for convolution layer k, respectively. The complete and formal definition of the algorithm is given in Algorithm 1. An example of the compression algorithm is depicted in Figure 4. Figure 4a illustrates the calculation of the variance tensor in the preprocessing stage for N = 3 images, and the dimensions of the activation tensor are c k = 3, w k = 3, and h k = 3, where c k represents the number of channels in layer k. The S-VELCRO preprocessing stage performs inference on the preprocessing dataset to create a variance tensor V[k]. Figure 4b illustrates the compression stage. The hyperparameter T k for layer k is defined in this example as T k = 1, which means that only one channel will be compressed. As illustrated in Figure 4b, the ranks r[0], r[1], and r [2] are calculated for every channel, and the channel with the smallest rank is compressed. In the illustrated example, channel 2 has the lowest rank, and its activation function outputs are replaced with their arithmetic average. The remaining elements remain unchanged with their original model activation function. The outcome of the S-VELCRO compression stage is given by the compressed activation-function output tensor A[k], where the computation of channel 2 (highlighted in green) is replaced by the arithmetic averages.
Input: A CNN model M with K activation-function outputs (each in a different convolution layer), N preprocessing images, and a channel trimming hyperparameter tuple T = {T 0 , T 1 , . . . , T K }, where ∀ 0 ≤ k < N 0 ≤ T k < c k and c k is the number of channels in convolution layer k.
Output: A compressed CNN model M C . Preprocessing stage (variance tensor calculation): Step 1: Let A(k) be the activation-function output tensor in convolution layer k, and let A (m) (k) be the corresponding activation-tensor values at the inference of image m, 0 ≤ m < N, where the tensors A[k] and A (m) [k] have the same dimensions and c k , w k , and h k are the number of channels, the width, and the height of the tensors at convolution layer k, respectively, for 0 ≤ k < K.
Step 2: Let model M perform inference for each image m such that 0 ≤ m < N, and let V[k] be the variance tensor and S[k] be the average tensor of convolution layer k (0 ≤ k < K) such that each element in the tensors V is defined as described in Equation (1), and S is defined as for each 0 ≤ c < c k , 0 ≤ i < w k and 0 ≤ j < h k .

Compression stage (filter trimming):
Step 3: For each convolution layer 0 ≤ k < K and for each 0 ≤ c < c k , let r[c] k be the rank of channel c in convolution layer k: Let r k be the rank tuple of convolution layer k: r k = {r[0] k , r[1] k , . . . , r[c − 1] k } for 0 ≤ k < K. Let r k be the sorted tuple of r k in a monotone increasing order: c ∈ I k for each 0 ≤ c < c k , 0 ≤ i < w k , and 0 ≤ j < h k .
Step 4: Let the compressed CNN model M C be such that every activation function output tensor A[k] is replaced with A[k] for every convolution layer 0 ≤ k < K.  where ∀ 0 ≤ k < N 0 ≤ T k < c k and ck is the number of channels in convolution layer k.
Output: A compressed CNN model MC.

Preprocessing stage (variance tensor calculation):
Step 1: Let A(k) be the activation-function output tensor in convolution layer k, and let A (m) (k) be the corresponding activation-tensor values at the inference of image m, 0 ≤ m < N, where the tensors A[k] and A (m) [k] have the same dimensions and ck, wk, and hk are the number of channels, the width, and the height of the tensors at convolution layer k, respectively, for 0 ≤ k < K. C h a n n e l 0 C h a n n e l 1 N=3, c k =3 , w k =3, h k =3 C h a n n e l 2 During the compression phase, the threshold values of the tuples T that produce the optimal compression saving ratio need to be found. As part of this study, we also introduce a novel automatic hyperparameters tuning process that avoids complex manual tuning. The user provides a target prediction accuracy, and the hyperparameter tuning algorithm searches for the optimal T that meets the target accuracy. Our hyperparameter tuning approach incrementally modifies each element in the tuple T as long as it does not decrease the overall prediction accuracy of the CNN model. This process is repeated for every element in every tuple. After reaching the last element in the tuple, the entire tuning process is repeated, beginning with the first element in the tuple, until the target prediction accuracy has been achieved. If the target prediction accuracy cannot be achieved, the hyperparameter tuning process terminates with the achievable prediction accuracy closest to the target.
In the last part of this section, we summarize the fundamental differences between VELCRO and S-VELCRO:

1.
While VELCRO employs unstructured compression, which can potentially compress any activation output element, S-VELSCRO performs structured compression of filters and their corresponding kernels.

2.
VELCRO and S-VELCRO use different hyperparameter mechanisms for the compression process. While VELCRO employs a hyperparameter tuple that represents the percentile of elements in the variance tensor to be compressed in every convolution layer, S-VELCRO uses a hyperparameter tuple that denotes the number of filters to be trimmed in every layer. 3.
VELCRO and S-VELCRO use fundamentally different approaches for the compression process. VELCRO compresses all elements in the activation tensor with a variance within the percentile threshold and replaces them with the arithmetic average. S-VELCRO compresses kernels and their corresponding filter using the channel rank such that all compressed channels in the activation output are replaced by the arithmetic average kernel.

Experimental Results and Discussion
Our experimental analysis examines the S-VELCRO algorithm using various CNN models and different specialized tasks. The first subsection describes the experimental environment. Next, we present the compression ratios S-VELCRO achieved for different prediction accuracy targets, the distribution of filters removed in each model layer, and the overall saving in filter memory footprint size. Lastly, we show the energy saving for S-VELCRO by demonstrating a hardware implementation on an FPGA board.

Experimental Environment
The experimental environment used for our study is based on PyTorch [67]; the ResNet-18, MobileNet V2, and GoogLeNet CNN models [21,22,60] (with their PyTorch pre-trained models); and the ILSVRC-2012 dataset (also known as ImageNet) [23,65]. The S-VELCRO algorithm has been fully implemented in the PyTorch environment. Table 1 lists the three groups of specialized tasks used for our experiments: cats, dogs, and cars. Each group includes four classes of images from the ILSVRC-2012 dataset. Throughout the experimental analysis, we do not modify the first layer of the model, which is a common practice used in other studies [51].

Compression Algorithm Performance
In this section, we present our experimental results for the compression-saving ratio achieved by S-VELCRO on the three groups of specialized tasks (see Table 1). For the preprocessing step, we used a small subset (<2%) of images from the preprocessing dataset. The validation step used the remaining images. It should be noted that the validation process was performed using images not used by the model in the compression step. This is essential to performing an unbiased evaluation of the model performance and preserving the generalization property of the model. Figure 5 presents the compressionsaving ratios versus top-1 prediction accuracy of the three CNN models for cars, dogs, and cats, respectively. P0 denotes the prediction accuracy achieved by the compressed model with no degradation relative to the uncompressed model. P0 is summarized in Table 2 for every CNN and specialized task. P0-n% represents a degradation of n% relative to P0. We examined the compression-saving ratios for five top-1 prediction accuracy targets: P0, P0-1%, P0-2%, P0-3%, and P0-4%. For each target, we examined different thresholds via trial and error and chose those that produce the highest compression-saving ratio. The compression-saving ratios S-VELCRO achieved for ResNet-18, with no degradation in the prediction accuracy compared with the uncompressed model, were 24.60%, 27.84%, and 26.48% for cars, dogs, and cats, respectively. For MobileNet V2, S-VELCRO achieved 13.35%, 17.33%, and 14.45% for cars, dogs, and cats, respectively, for P0 target prediction accuracy. Lastly, for GoogleNet, S-VELCRO achieved 11.64%, 6.07%, and 11.27% for cars, dogs, and cats, respectively. Note that when compromising the top-1 prediction accuracy in the range of 1-4%, the compression-saving ratio increases up to approximately 3%, 2.5%, and 4% for ResNet-18, MobileNet V2, and GoogLeNet, respectively. Table 3 compares the S-VELCRO algorithm with other pruning and compression approaches for both specialized and general-purpose CNNs. First, it can be observed that S-VELCRO outperformed VELCRO for ResNet-18, achieved similar computation accelerations for MobileNet V2, and underperformed VELCRO for GoogLeNet. However, when comparing S-VELCRO to VELCRO, we should keep in mind that since VELCRO employs unstructured compression, it incurs overhead to compute addresses and indices of compressed elements and store them in dedicated metadata storage elements. Such limitations, which are out of the scope of this study, become more evident when the unstructured compression method runs on GPUs and TPUs. When comparing S-VELCRO to the pruning methods described in Table 3, it should be noted that although S-VELCRO achieves smaller computational savings, it requires significantly fewer computational resources than common pruning techniques [66] because it avoids back-propagation training.

Filter Memory Footprint Size
Our next experimental analysis examines the savings in filter memory footprint size. The results summarized in Table 4 Figure 6 presents histograms of the percentages of the filters removed by S-VELCRO for every layer in the CNNs. For ResNet-18, the first layers 1 and 3 and the deep layers 13, 15, and 16 of the model have the highest percentages of removed filters. For MobileNet V2, the overall filter memory footprint saving is significantly smaller than for ResNet-18. These observations reflect the highly compact nature of the MobileNet V2 network with respect to ResNet-18 and GoogLeNet. Figure 6c illustrates the percentages of removed filters in every convolution stage located through all model layers (convolution and inception) in GoogLeNet. Our results indicate that for the majority of the convolution stages, the trimmed filter percentages are lower than 10%. At the same time, several convolution stages exhibit a high percentage of trimmed filters. For example, in convolution stage 53 for cars and cats, nearly 100% of the filters are trimmed. These observations also support the low compression-saving ratio measurements presented by GoogLeNet.

Hardware Implementation
We demonstrated S-VELCRO energy savings on a Xilinx Alveo U280 Data Center accelerator FPGA card [68] (Figure 7). When implementing the ResNet- 18

Hardware Implementation
We demonstrated S-VELCRO energy savings on a Xilinx Alveo U280 Data Center accelerator FPGA card [68] (Figure 7). When implementing the ResNet-18 CNN model on the FPGA acceleration card, we measured the energy consumption of the original model and the model compressed by S-VELCRO. Our hardware implementation was designed in Verilog and implemented using the Xilinx Vivado [69] design suite. Our energy consumption measurements are illustrated in Figure 8 for a single infer ence operation. When the model was compressed with no degradation in the prediction accuracy (P0), the energy consumption was reduced by approximately 24-27%. When th model was further compressed, although the prediction accuracy was compromised in the range of 1-4%, the additional energy saving was negligible.

Conclusions
This study presents S-VELCRO, a new compression algorithm that exploits value lo cality to perform structured compression on CNN models used for specialized tasks. S VELCRO eliminates the overhead of unstructured compression by computing address in dices and managing the metadata associated with the compressed elements. In addition S-VELCRO offers a fast compression process because it avoids back-propagation training which involves a heavy computational load. Our experimental analysis indicates that S VELCRO produces a compression-saving ratio of up to 27.84%, 11.27%, and 17.33% fo ResNet-18, GoogLeNet, and MobileNet V2, respectively. In addition, S-VELCRO can sav up to 44.92%, 22.37%, and 19.41% of the filter memory footprint for ResNet-18, MobileNe v2, and GoogLeNet, respectively. Lastly, we demonstrated S-VELCRO energy savings us ing an FPGA board, showing that it can save up to 27% of the hardware energy consump Our energy consumption measurements are illustrated in Figure 8 for a single inference operation. When the model was compressed with no degradation in the prediction accuracy (P0), the energy consumption was reduced by approximately 24-27%. When the model was further compressed, although the prediction accuracy was compromised in the range of 1-4%, the additional energy saving was negligible. Our energy consumption measurements are illustrated in Figure 8 for a single inference operation. When the model was compressed with no degradation in the prediction accuracy (P0), the energy consumption was reduced by approximately 24-27%. When the model was further compressed, although the prediction accuracy was compromised in the range of 1-4%, the additional energy saving was negligible.

Conclusions
This study presents S-VELCRO, a new compression algorithm that exploits value locality to perform structured compression on CNN models used for specialized tasks. S-VELCRO eliminates the overhead of unstructured compression by computing address indices and managing the metadata associated with the compressed elements. In addition, S-VELCRO offers a fast compression process because it avoids back-propagation training, which involves a heavy computational load. Our experimental analysis indicates that S-VELCRO produces a compression-saving ratio of up to 27.84%, 11.27%, and 17.33% for ResNet-18, GoogLeNet, and MobileNet V2, respectively. In addition, S-VELCRO can save up to 44.92%, 22.37%, and 19.41% of the filter memory footprint for ResNet-18, MobileNet v2, and GoogLeNet, respectively. Lastly, we demonstrated S-VELCRO energy savings using an FPGA board, showing that it can save up to 27% of the hardware energy consumption.

Conclusions
This study presents S-VELCRO, a new compression algorithm that exploits value locality to perform structured compression on CNN models used for specialized tasks. S-VELCRO eliminates the overhead of unstructured compression by computing address indices and managing the metadata associated with the compressed elements. In addition, S-VELCRO offers a fast compression process because it avoids back-propagation training, which involves a heavy computational load. Our experimental analysis indicates that S-VELCRO produces a compression-saving ratio of up to 27.84%, 11.27%, and 17.33% for ResNet-18, GoogLeNet, and MobileNet V2, respectively. In addition, S-VELCRO can save up to 44.92%, 22.37%, and 19.41% of the filter memory footprint for ResNet-18, MobileNet v2, and GoogLeNet, respectively. Lastly, we demonstrated S-VELCRO energy savings using an FPGA board, showing that it can save up to 27% of the hardware energy consumption.