Energy-Constrained Deep Neural Network Compression for Depth Estimation

: Many applications, such as autonomous driving, robotics, etc., require accurately estimating depth in real time. Currently, deep learning is the most popular approach to stereo depth estimation. Some of these models have to operate in highly energy-constrained environments, while they are usually computationally intensive, containing massive parameter sets ranging from thousands to millions. This makes them hard to perform on low-power devices with limited storage in practice. To overcome this shortcoming, we model the training process of a deep neural network (DNN) for depth estimation under a given energy constraint as a constrained optimization problem and solve it through a proposed projected adaptive cubic quasi-Newton method (termed ProjACQN). Moreover, the trained model is also deployed on a GPU and an embedded device to evaluate its performance. Experiments show that the stage four results of ProjACQN on the KITTI-2012 and KITTI-2015 datasets under a 70% energy budget achieve (1) 0.13% and 0.61%, respectively, lower three-pixel error than the state-of-the-art ProjAdam when put on a single RTX 3090Ti; (2) 4.82% and 7.58%, respectively, lower three-pixel error than the pruning method Lottery-Ticket; (3) 5.80% and 0.12%, respectively, lower three-pixel error than ProjAdam on the embedded device Nvidia Jetson AGX Xavier. These results show that our method can reduce the energy consumption of depth estimation DNNs while maintaining their accuracy.


Introduction
As a classical computer vision problem, stereo image-based depth estimation has a wide range of applications such as autonomous driving, robotics, 3D scene understanding, etc. [1][2][3][4][5].Many of these have to operate in highly energy-constrained environments.However, most of the existing deep-learning-based depth estimation methods focus on designing more powerful network architectures to obtain more accurate depth images.Thus, their strong performance often requires massive computing resources, making them too heavy to run on embedded devices.
There are several state-of-the-art studies [6][7][8][9][10][11] that attempt to address this problem by building lightweight networks, that is, they trade off computation and accuracy with low inference time.These "mini" models could be further reduced in size to save more energy while maintaining a similar performance.An open-source study [12] proposes an end-toend DNN training framework that provides quantitative energy consumption guarantees via weighted sparse projection and input masking.The input mask enables the input sparsity to be controlled by a trainable parameter, and thus increases the energy reduction opportunities.They also present a projected version Adam (termed ProjAdam) to train the model under quantitatively estimated energy consumption.However, the method only applies to image classification tasks, and the network only includes fully connected layers and 2D convolution layers, which are different from many depth estimation networks.
In this paper, we also formulate the training process of depth estimation DNN under a given energy budget as a constrained optimization problem.Different from prior work, we model the energy consumption of the depth estimation network after analyzing its architecture and taking 3D convolution layers into consideration.Furthermore, we propose a projected adaptive cubic quasi-Newton optimizer (termed ProjACQN) to obtain a better model of such a complex optimization problem.Unlike the commonly used first-order projected optimizers [12,13] or proxy optimizers [14][15][16], which only utilize the gradient information, ProjACQN incorporates Hessian information with the norm of difference between the two previous estimates, and performs the projection operation onto the energy constraint after the parameter update.
We evaluate the efficiency of our method on KITTI-2012 [17] and KITTI-2015 [18] using the depth estimation DNN AnyNet [6].Experiments show that our method can reduce energy consumption of AnyNet by 30% while improving its accuracy by 0.13% and 0.62% compared to ProjAdam [12] on KITTI-2012 and KITTI-2015, respectively.Our method is also compared with three existing pruning methods (including L1-norm pruning [19], BN pruning [20] and Lottery-Ticket [21]) under the same energy budget, and achieves the best result.In addition, we perform the models on the embedded device Nvidia Jetson AGX Xavier [22], and find that ProjACQN with a 70% energy budget is able to outperform a dense model without an energy budget in terms of three-pixel error and time consumption.

Related Work
Generally, research on depth estimation using DNNs mainly focuses on designing architectures with better performance.These methods can be classified into three classes: supervised, semi-supervised and unsupervised.Taking [23,24] as examples of supervised methods, ref. [23] performs prediction at adaptively selected locations, which are easier to be estimated accurately and can alleviate excessive computation.Ref. [24] proposes a depth refinement architecture using 3D dilated convolutions to predict geometrically consistent disparity images.These supervised stereo depth estimation methods have achieved great success, but they require per-pixel ground truth depth data, which are often hard to acquire.To resolve this issue, an alternative is the class of unsupervised methods, which uses the geometric constraints between stereo images as the supervisory signal instead of directly inputting ground truth disparity.Ref. [25] combines an unsupervised stereo disparity estimation network with a perceptual loss network, which enables it to refine the predicted disparity.Ref. [26] designs a Siamese autoencoder architecture to extract mutual information between the rectified stereo images.It also exploits the mutual epipolar attention, and uses the optimal transport algorithm to refine the depth image.In addition, there are very few works on semi-supervised stereo depth estimation using DNNs.Ref. [27] trains in a semi-supervised manner to combine information from LIDAR and photometric data.The model can achieve a better performance than those trained only with LIDAR.
The methods mentioned above often contain massive parameter sets to achieve outperformance, while they require very a high cost of computation and GPU memory to be put on embedded devices.Recently, many studies have focused on improving energy efficiency through artifical intelligence [6][7][8][9][10][11]28,29].This has inspires many studies to explore lightweight DNNs for stereo depth estimation.Ref. [10] performs online adaptive stereo depth estimation through a self-supervised learning method, which helps to save computation and GPU memory.Ref. [11] is based on a Max-tree hierarchical representation of image pairs, and is able to identify matching regions along image scan-lines.Ref. [8] estimates depth via a series of binary classifications.Instead of obtaining an accurate depth map, it classifies objects according to their relative distance.Ref. [7] proposes an efficient neural network, in which its computation and latency saving mostly benefit from using the depth-wise separable convolution and network pruning.Refs.[6,30] proposes AnyNet to achieve a wide accuracy range of the disparity map according to the permitted reference time.In practice, AnyNet has four stages, and the higher the stage, the better the accuracy with more time cost.

Problem Formulation
In this section, we provide an energy model for a depth estimation DNN which consists of fully connected layers, 2D convolution layers and 3D convolution layers.

Energy Consumption Modeling
Generally, the energy cost of inferencing a typical depth estimation DNN after the popular systolic array hardware architecture [31] can be formulated as [12] min where L(•) is the loss function, W the weight tensor of the sparse model, W dense the weight tensor of the original dense model, M the input mask, E budget the given energy budget and E the total energy cost.The depth estimation DNN model mainly consists of a sequence of fully connected layers, 2D convolution layers and 3D convolution layers.Also, their energy consumption can be decomposed into two parts: computation energy and data access energy.Let U, V, W be the sets of the fully connected layers, 2D convolution layers and 3D convolution layers in a DNN, respectively; comp the energy consumed by computation units in layer i; and data the energy consumed in layer i when accessing data from the hardware memory.The total energy cost can be acquired through assuming e MAC denotes the energy consumption of one Multiply-and-Accumulate (MAC) operation, X (i) the input tensor of layer i and supp(•) returns a binary tensor which labels a nonzero position.Then, the computation energy cost of a fully connected layer is The computation energy cost of a 2D convolution layer is Generally, to accelerate the loading speed, it is common to load data from the DRAM to the cache and then from the cache to the register file (RF) when DNN is inferencng [12,32].Thus, each layer has to load their input and weight three times, once each from DRAM, RF and cache.Let e DRAM , e RF and e cache be the unit energy costs of DRAM, RF and cache, respectively.Thus, the data access energy cost of each layer can be formulated as and N weight cache the total numbers related to weight.Ref. [12] only gives the detailed computation of the energy cost for the fully connected layer and 2D convolution layer.

Computation Energy for 3D Convolution Layer
Let W (w) ∈ R c out ×c in ×r×r×r be the weight tensor, where c in and c out are the number of input channels and output channels, respectively, and let X (w) ∈ R c in ×h×w×d be the input tensor, where h, w and d are the height, width and depth of X, respectively.The computation of the 3D convolution operation can be formulated as where x, y and z represent the corresponding position of the output element.Assume the convolution padding is p and the convolution stride s.
W_RF and N (w) W_DRAM be the access numbers of the weight tensor, respectively, and X_DRAM the access numbers of the input tensor.For simplification, we can unfold the input tensor as X(w) ∈ R h w d ×c in r 3 according to the output channel, where each column represents all elements in X (w) that operate with one element in W (w) .The access number can be formulated as where k W represents the cache size for the weight matrix, and N overlap the number of overlapped elements due to the nature of the 3D convolution operation.s h and s w denote the width and height of the systolic array, respectively.According to Equations ( 5) and ( 8), the total data access energy consumption for a 3D convolution layer can be formulated as

Optimization Algorithm
In this section, we first provide the formulation of our projected adaptive cubic quasi-Newton optimizer.Then, we utilize the optimizer to solve the constrained optimization problem (1).

Projected Cubic Quasi-Newton Optimizer
Consider the following optimization problem: where the constraint set C is convex and compact.We can find the optima through the following augmented second-order approximation: where g k represents the gradient at iteration k, B k an approximation to the Hessian matrix at W k , and ρ a positive constant.We can obtain a stationary point without restriction through setting the derivative of the object function to zero: The update of the stationary point becomes To avoid inverse difficulty caused by matrix degradation, we constrain the absolute value of the diagonal elements to be greater than a given positive parameter θ: Then, the optima can be acquired through projecting the stationary point into the constraint set.This yields a novel update, where the projection operation proj C is to project the stationary point into optimization constraint set C. The detailed algorithm for ProjACQN is shown in Algorithm 1.We then use this algorithm to solve problem (1) in the following subsection.

Update Weight Tensor and Input Mask
The problem in Equation (1) has two variables, M and W. For simplification, it can be transformed into two sub-problems and solved through alternative updating of the following two equations: where The sub-problems ( 16) and ( 17) are similar to (10), and can be optimized with the projected cubic quasi-Newton optimizer.According to Equations ( 15)-( 17), the update of the weight tensor can be formulated as The update of the mask tensor can be formulated as According to [12], the projection problem ( 18) can be transformed into a knapsack problem, and the projection problem ( 19) is the well-known L 0 norm projection.To summarize, the complete algorithm for training an energy-constrained depth estimation DNN is given in Algorithm 2.

Experiment
To validate the efficiency of our method, we perform extensive experiments on KITTI-2012 [17] and KITTI-2015 [18] using the depth estimation network AnyNet [7].The experiments can be divided into three parts: First, we test the performance of AnyNet trained with a GPU under different energy-cost budgets, and the results are compared with the projected first order optimizer ProjAdam.Then, we apply three existing pruning methods (including L1-norm pruning [19], BN pruning [20] and Lottery-Ticket [21]) to AnyNet under the same energy budget for comparison.FInally, the performance of the trained models are also accessed on the embedded device Nvidia Jetson AGX Xavier [22].
Implementation Details We implement AntNet with four stages, and the output of higher stages have better performance with more time cost.The hyper-parameters of the network are set according to the default values.The DNN framework we experiment on is PyTorch1.10.0 with python3.6 and is GPU-accelerated through Cuda-1.13.The hardware is a single RTX 3090Ti with I9-10920X CPU, while the RAM is 32GB.Following AnyNet, we use the metric three-pixel error to evaluate the performance (lower is better).In addition, the predicted depth maps are also enhanced through histogram equalization [33] for comparison.

Dataset
The training set of KITTI-2012 contains 194 image pairs, while the test set contains 195 image pairs.Both the training set and test set of KITTI-2015 have 200 image pairs.
ProjAdam: The learning rates are searched among {a × 10 b } where a ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9} and b ∈ {−5, −4, −3}, while the weight decay for weight update and input mask update are set to the same as that of ProjACQN.Other parameters are set as their default values in the literature.

Results on GPU under Different Energy Budgets
Here, we test the performance of AnyNet trained through GPU under 50%, 60% and 70% energy budgets.Quantitative results on KITTI-2012 and KITTI-2015 are shown in Tables 1 and 2, respectively.On KITTI-2012, the four stage results of ProjACQN and ProjAdam are comparable to the dense model under a 70% energy budget, while ProjACQN achieves a much lower three-pixel error under a 60% energy budget.On KITTI-2015, the four stage results of ProjACQN achieve a lower three-pixel error under 60% and 70% energy costs than ProjAdam and even outperform the dense model.This may be due to the removal of redundant information.Furthermore, ProjACQN achieves a comparable result under a 50% energy budget.We also show the curves of training loss on KITTI-2012 and KITTI-2015 under a 70% energy cost in Figures 1 and 2, respectively.From these figures, we can see that ProjACQN achieves the best convergence speed.Figure 3 gives some visual examples of the predicted disparity.

Comparison Between Different Pruning Methods
To comprehensively compare our method with prior work, we also slim AnyNet through three existing pruning methods including L1-norm pruning, BN pruning and Lottery-Ticket to AnyNet on KITTI-2015 and KITTI-2012.Here, we list the results under a 70% energy cost in Table 3.Our method achieves an evidently lower three-pixel error than the other methods.

Results on Embedded Device
In this section, we perform the model trained through ProjAdam and ProjACQN on an Nvidia Jetson AGX Xavier under 50%, 60% and 70% energy budgets.Quantitative results on KITTI-2012 and KITTI-2015 are shown in Tables 4 and 5, respectively.It can be found that the stage four result of ProjACQN has an obvious advantage on KITTI-2012 and KITTI-2015, while the performance of the other three stages are comparable.The FPS of stage four using the dense model, ProjAdam and ProjACQN is 11.5, 20.4 and 20.31, respectively.It should be noted that the three-pixel error increased mostly due to the Float16 data type of the embedded device, which could be solved through quantization.Figures 4 and 5 present some visual examples of disparity predictions from stage four of AnyNet under 50%, 60% and 70% energy budgets on KITTI-2102 and KITTI-2015, respectively.We can see that the segmentation results using ACQN-H is more similar to the dense model than ProjAdam.It is worth noting that the results are noisy under 50% and 60% energy budgets, due to the input mask.

Conclusions
We have presented an approach to compress deep neural networks under a given energy constraint for depth estimation.The training process of the depth estimation DNN model is formulated as a constrained optimization problem, and can be solved through the proposed projected adaptive cubic quasi-Newton optimizer.Experiments show that our method can reduce the energy consumption of AnyNet by 30% while improving accuracy by 0.13% and 0.62% compared to the state-of-the-art method ProjAdam on KITTI-2012 and KITTI-2015, respectively.When comparing with existing pruning methods, ProjACQN also achieves the best three-pixel error.It is worth mentioning that, when performing the models on the embedded device Nvidia Jetson AGX Xavier, ProjACQN with a 70% energy budget is able to outperform the dense model without an energy budget in terms of three-pixel error and time consumption.
total numbers of DRAM, RF and cache accesses related to input, respectively, and N weight DRAM , N weight RF

Figure 1 .
Figure 1.The training loss curves of four stages on KITTI-2012 under 70% energy cost.

Figure 2 .
Figure 2. The training loss curves of four stages on KITTI-2015 under 70% energy cost.

Figure 3 .
Figure 3. Disparity predictions using ProjACQN from four stages of AnyNet under 70% energy cost on KITTI-2012 and KITTI-2015.Three-pixel errors are shown as red numbers, the lower the better.

Figure 4 .
Figure 4. Disparity predictions from stage four of AnyNet under 50%, 60% and 70% energy budgets on KITTI-2012.The percentages in the figure represent the three-pixel errors, the lower the better.Evaluation results of ProjACQN are marked in blue.

Figure 5 .
Figure 5. Disparity predictions from stage four of AnyNet under 50%, 60% and 70% energy budgets on KITTI-2015.The percentages in the figure represent the three-pixel errors, the lower the better.Evaluation results of ProjACQN are marked in blue.
Then, the size of the output tensor is c out × d × h × w .Thus, the number of MAC operations is sum(supp(X (w) ) * supp(W (w) )) ≤ h w d ||W (w) || 0 , and the computation energy cost can be approximated through Initialize timestep 1: while True do 5: end while //If accuracy decreases, exit loop with W k and M t ; 6:

Table 1 .
Three-pixel error (%) and resulting energy consumption of AnyNet with 50%, 60% and 70% energy budget on KITTI-2012 dataset.A lower three-pixel error is better.