Population Risk Improvement with Model Compression: An Information-Theoretic Approach

It has been reported in many recent works on deep model compression that the population risk of a compressed model can be even better than that of the original model. In this paper, an information-theoretic explanation for this population risk improvement phenomenon is provided by jointly studying the decrease in the generalization error and the increase in the empirical risk that results from model compression. It is first shown that model compression reduces an information-theoretic bound on the generalization error, which suggests that model compression can be interpreted as a regularization technique to avoid overfitting. The increase in empirical risk caused by model compression is then characterized using rate distortion theory. These results imply that the overall population risk could be improved by model compression if the decrease in generalization error exceeds the increase in empirical risk. A linear regression example is presented to demonstrate that such a decrease in population risk due to model compression is indeed possible. Our theoretical results further suggest a way to improve a widely used model compression algorithm, i.e., Hessian-weighted K-means clustering, by regularizing the distance between the clustering centers. Experiments with neural networks are provided to validate our theoretical assertions.


Introduction
Although deep neural networks have achieved remarkable success in various domains [1], e.g., computer vision [2], playing games like Go [3], and autonomous driving [4], the improvement of the performance of deep models often comes with deeper layers and more complex network structures, which usually have a large number of parameters. For example, in the application of image classification, it takes over 200 MB to save the parameters of AlexNet [2] and more than 500 MB for VGG-16 net [5]. Hence, it is difficult to port such large models to resource-limited devices such as mobile devices and embedded systems, due to their limited storage, bandwidth, energy, and computational resources.
Due to this reason there has been a flurry of work on compressing deep neural networks (see [6][7][8] for recent surveys). Existing studies mainly focus on designing compression algorithms to reduce the memory and computational cost, while keeping the same level of population risk. In some recent papers [9][10][11][12], aggressive model compression algorithms have been proposed, which require 10% or fewer bits to store the compressed model compared to the storage required by the original model. Surprisingly, it has been observed empirically in these works that the population risk of the compressed model can often be even better than that of the original model. This phenomenon is counter-intuitive at first glance, since more compression generally leads to more information loss.
Indeed, a compressed model would usually have a larger empirical risk than the original one, since machine learning methods are usually trained by minimizing the empirical risk. On the other hand, model compression could possibly decrease the generalization error, since it can be interpreted as a regularization technique to avoid overfitting. As the population risk is the sum of the empirical risk and the generalization error, it is possible for the population risk to be reduced by model compression.

Contributions
In this paper, we provide an information-theoretic explanation for the population risk improvement with model compression by jointly characterizing the decrease in generalization error and the increase in empirical risk. Specifically, we focus on the case where the model is compressed based on a pre-trained model.
We first prove that model compression leads to a tightening of the informationtheoretic generalization error bound in [13], and it can therefore be interpreted as a regularization method to reduce overfitting. Furthermore, by defining a distortion metric based on the difference in the empirical risk between the original model obtained by empirical risk minimization (ERM) and compressed models, we use rate distortion theory to characterize the increase in empirical risk as a function of the number of bits R used to describe the model. If the decrease in generalization error exceeds the increase in empirical risk, the population risk can be improved. An empirical illustration of this result for the MNIST dataset is provided in Figure 1, where model compression can lead to population risk improvement (details are given in Section 7). To better demonstrate our theoretical results, we investigate the example of linear regression comprehensively, where we develop explicit bounds on the generalization error and the increase in empirical risk. risk. On the other hand, model compression could possibly decrease the gener error, since it can be interpreted as a regularization technique to avoid overfitting population risk is the sum of the empirical risk and the generalization error, it is for the population risk to be reduced by model compression.

Contributions
In this paper, we provide an information-theoretic explanation for the popula improvement with model compression by jointly characterizing the decrease in ge tion error and the increase in empirical risk. Specifically, we focus on the case w model is compressed based on a pre-trained model.
We first prove that model compression leads to a tightening of the info theoretic generalization error bound in [13], and it can therefore be interpreted as a ization method to reduce overfitting. Furthermore, by defining a distortion metric the difference in the empirical risk between the original model obtained by empi minimization (ERM) and compressed models, we use rate distortion theory to cha the increase in empirical risk as a function of the number of bits R used to des model. If the decrease in generalization error exceeds the increase in empirical population risk can be improved. An empirical illustration of this result for the dataset is provided in Figure 1, where model compression can lead to popula improvement (details are given in Section 7). To better demonstrate our theoretica we investigate the example of linear regression comprehensively, where we develop bounds on the generalization error and the increase in empirical risk.  Compression Ratio Cross Entropy Loss The generalization error ofŴ decreases and the empirical risk ofŴ increases with more com (smaller compression ratio). The population risk ofŴ is less than that of W for compress larger than 6% in this figure. As the compression ratio goes to 100% (no compression), the p risk ofŴ will converge to that of the original model W.
Our results also suggest a way to improve a method for compression based on The generalization error ofŴ decreases and the empirical risk ofŴ increases with more compression (smaller compression ratio). The population risk ofŴ is less than that of W for compression ratios larger than 6% in this figure. As the compression ratio goes to 100% (no compression), the population risk ofŴ will converge to that of the original model W. Our results also suggest a way to improve a method for compression based on Hessianweighted K-means clustering [11] in both scalar and vector case, by regularizing the distance between the clustering centers. Our experiments with neural networks validate our theoretical assertions and demonstrate the effectiveness of the proposed regularizer.

Related Works
There have been many studies on model compression for deep neural networks. The compression could be achieved by varying the training process, e.g., network structure optimization [14], low precision neural networks [15], and neural networks with binary weights [16,17]. Here we mainly discuss compression approaches that are applied on a pre-trained model.
Pruning, quantization, and matrix factorization are the most popular approaches to compressing pre-trained deep neural networks. The study of pruning algorithms for model compression which remove redundant parameters from neural networks dates back to the 1980s and 1990s [18][19][20]. More recently, an iterative pruning and retraining algorithm to further reduce the size of deep models was proposed in [9,21]. The method of network quantization or weight sharing, i.e., employing a clustering algorithm to group the weights in a neural network, and its variants, including vector quantization [22], soft quantization [23,24], fixed point quantization [25], transform quantization [26], and Hessian weighted quantization [11], have been extensively investigated. Matrix factorization, where low-rank approximation of the weights in neural networks is used instead of the original weight matrix, has also been widely studied in [27][28][29].
All of the aforementioned works demonstrate the effectiveness of their compression methods via comprehensive numerical experiments. Little research has been done to develop a theoretical understanding of how model compression affects performance. In work [30], an information-theoretic view of model compression via rate-distortion theory is provided, with the focus on characterizing the tradeoff between model compression and only the empirical risk of the compressed model. In [31][32][33], using a PAC-Bayesian framework, a non-vacuous generalization error bound for compressed model is derived based on its smaller model complexity.
In contrast to these works, instead of focusing on minimizing only the empirical risk as in [30], or minimizing only the generalization error as in [33], we use the mutual information based generalization error bound developed in [13,34] jointly with rate distortion theory to connect analyses of generalization error and empirical risk. This way, we are able to characterize the tradeoff between decrease in generalization error and the increase in empirical risk that results from model compression, and thus provide an understanding as to why model compression can improve the population risk. More importantly, our theoretical studies offer insights on designing practical model compression algorithms.
The rest of the paper is organized as follows. In Section 2, we provide relevant definitions and review relevant results from rate distortion theory. In Section 3, we prove that model compression results in the tightening of an information-theoretic generalization error upper bound. In Section 4, we use rate distortion theory to characterize the tradeoff between the increase in empirical risk and the decrease in generalization error that results from model compression. In Section 5, we quantify this tradeoff for a linear regression model. In Section 6, we discuss how the Hessian-weighted K-means clustering compression approach can be improved by using a regularizer motivated by our theoretical results. In Section 7, we provide some experiments with neural network models to validate our theoretical results and demonstrate the effectiveness of the proposed regularizer. Notation 1. For a random variable X generated from a distribution µ, we use E X∼µ to denote the expectation taken over X with distribution µ. We use I d to denote the d-dimensional identity matrix and A to denote the spectral norm of a matrix A. The cumulant generating function (CGF) of a random variable X is defined as Λ X (λ) ln E[e λ(X−EX) ]. All logarithms are the natural ones.

Review of Rate Distortion Theory
Rate distortion theory, introduced by Shannon [35], is a major branch of information theory that studies the fundamental limits of lossy data compression. It addresses the minimal number of bits per symbol, as measured by the rate R, to transmit a random variable W such that the receiver can reconstruct W without exceeding distortion D.
Specifically, let W m = {W 1 , W 2 , · · · , W m } denote a sequence of m i.i.d. random variables W i ∈ W generated from a source distribution P W . An encoder f m : W m → {1, 2, · · · , M} maps the message W m into a codeword, and a decoder g m : {1, 2, · · · , M} → W m reconstructs the message by an estimateŴ m from the codeword, whereŴ ⊆ W denotes the range ofŴ. A distortion metric d : W × W → R + quantifies the difference between the original and reconstructed messages. The distortion between sequences w m andŵ m is defined to be A commonly used distortion metric is the square distortion: d(w,ŵ) = (w −ŵ) 2 .
Now we define the following rate-distortion and distortion-rate function for lossy data compression.

Definition 2.
The rate-distortion function and the distortion-rate function are defined as where M * (m, D) min{M : (m, M, D) is achievable} and D * (m, R) min{D : (m, 2 mR , D) is achievable}.
The main theorem of rate distortion theory is as follows.

Lemma 1 ([36]
). For an i.i.d. source W with distribution P W and distortion function d(w,ŵ): where I(W;Ŵ) E W,Ŵ [ln P W,Ŵ P W PŴ ] denotes the mutual information between W andŴ.
The rate-distortion function quantifies the smallest number of bits required to compress the data given the distortion, and the distortion-rate function quantifies the minimal distortion that can be achieved under the rate constraint.

Generalization Error
Consider an instance space Z, a hypothesis space W, and a non-negative loss function : W × Z → R + . A training dataset S = {Z 1 , · · · , Z n } consists of n i.i.d samples Z i ∈ Z Entropy 2021, 23, 1255 5 of 20 drawn from an unknown distribution µ. The goal of a supervised learning algorithm is to find an output hypothesis w ∈ W that minimizes the population risk: In practice, µ is unknown, and therefore L µ (w) cannot be computed directly. Instead, the empirical risk of w on the training dataset S is studied, which is defined as A learning algorithm can be characterized by a randomized mapping from the training dataset S to a hypothesis W according to a conditional distribution P W|S . The (expected) generalization error of a supervised learning algorithm is the expected difference between the population risk of the output hypothesis and its empirical risk on the training dataset: where the expectation is taken over the joint distribution P S,W = P S ⊗ P W|S . The generalization error is used to measure the extent to which the learning algorithm overfits the training data.

Compression Can Improve Generalization
In this section, we show that lossy compression can lead to a tighter mutual information based generalization error upper bound, which potentially reduces the generalization error of a supervised learning algorithm.
We start from the following lemma which provides an upper bound on the generalization error using the mutual information I(S; W) between training dataset S and the output of the learning algorithm W.
Compression can be viewed as a post-processing of the output of a learning algorithm. The output model W generated by a learning algorithm can be quantized, pruned, factorized, or even perturbed by noise, which results in a compressed modelŴ. Assume that the compression algorithm is only based on W and can be described by a conditional distribution PŴ |W . Then the following Markov chain holds: S → W →Ŵ. By the data processing inequality, Thus, we have the following theorem characterizing the generalization error of the compressed model.
Note that the generalization error upper bound in Theorem 1 for the compressed model is always no greater than the one in Lemma 2. This allows for the interpretation of compression as a regularization technique to reduce the generalization error.

Generalization Error and Model Distortion
In this section, we define a distortion metric in model compression that allows us to relate the distortion (the increase in empirical risk) due to compression with the reduction in the generalization error bound discussed in Section 3.

Distortion Metric in Model Compression
The expected population risk of a model W can be written as where the first term, which is the expected empirical risk, reflects how well the model W fits the training data, while the second term demonstrates how well the model generalizes.
In the empirical risk minimization framework, we control both terms by (1) minimizing the empirical risk of W directly or using other stochastic optimization algorithms, and (2) using regularization methods to control the generalization error, e.g., early stopping and dropout [1]. Now, consider the expected population risk of the compressed modelŴ: Compared with (11), we note that the first empirical risk term is independent of the compression algorithm, the second generalization error term can be upper bounded by Theorem 1, and the third term E[L S (Ŵ) − L S (W)] quantifies the increase in the empirical risk if we use the compressed modelŴ instead of the original model W. We then define the following distortion metric for model compression: which is the difference in the empirical risk between the compressed modelŴ and the original model W. In general, function d S (w,ŵ) is not always non-negative. However, for ERM solution W, which is obtained by minimizing the empirical risk L S (W), d S (w,ŵ) ≥ 0, which ensures that d S (w,ŵ) is a valid distortion metric. By Theorem 1, it follows that where L S,W (PŴ |W ) is an upper bound on the expected difference between the population risk ofŴ and the empirical risk of the original model W on training dataset S. Note that L S (W) is independent of the compression algorithm. Therefore, the bound in (14) can be viewed as an upper bound of the population risk of the compressed modelŴ.

Population Risk Improvement
By Lemma 1, the smallest distortion that can be achieved at rate R is D(R) = min I(W;Ŵ)≤R E S,W,Ŵ [d S (Ŵ, W)]. Thus, the tightest bound in (14) that can be achieved at rate R is given in the following theorem.  From the properties of the distortion-rate function [36], we know that D(R) is a decreasing function of R. Thus, we see that as R decreases the first term in (15), which corresponds to the generalization error, decreases, while the second term, which corresponds to the empirical risk, increases. Due to this tradeoff, it may be possible for the bound in (15) to be smaller due to compression, i.e., using a smaller rate R. This indicates that the population risk could improve with compression algorithm, which minimizes the upper bound L S,W (PŴ |W ).

Remark 1.
In order to conclude definitively that the population risk can be improved with compression, we need to find a lower bound (as a function of R) to match (at least in the order sense) the upper bound in Theorem 2. This appears to be difficult to construct in general. One approach might be to use the same decomposition as in (12) and develop lower bounds for min I(W;Ŵ)=R gen(µ, PŴ |S ) and min I(W;Ŵ)=R E S,W,Ŵ [d S (Ŵ, W)] independently. However, such an approach runs into the following issues: (1) such a lower bound would be loose since the compression algorithm PŴ |W that minimizes generalization error, the one that minimizes the distortion, and the one that minimizes the sum of the two can be quite different; and (2) a lower bound for generalization error needs to be developed, which appears to be difficult, with existing literature mainly focusing on lower bounding the excess risk, e.g., [37].
As will be shown in Section 7, we can actually improve the population risk with a well designed compression algorithm in practical applications.

Example: Linear Regression
In this section, we comprehensively explore the example of linear regression to get a better understanding of the results in Section 4. To this end, we develop explicit upper bounds for generalization error and distortion-rate function D(R). All the proofs of the lemmas and theorems are provided in the Appendixes A-D.
Suppose that the dataset S = {Z 1 , · · · , Z n } = {(X 1 , Y 1 ), · · · , (X n , Y n )} is generated from the following linear model with weight vector w * = (w * (1) , · · · , w * (d) ) ∈ R d , where X i 's are i.i.d. d-dimensional random vectors with distribution N (0, Σ X ), and ε i ∼ N (0, σ 2 ) denotes i.i.d. Gaussian noise. We adopt the mean squared error as the loss function, and the corresponding empirical risk on S is for w ∈ W = R d , where X ∈ R d×n denotes all the input samples, and Y ∈ R n denotes the responses. If n > d, the ERM solution is which is deterministic given S. Its generalization error can be computed exactly as in the following lemma (see Appendix A for detailed proof).

Information-Theoretic Generalization Bounds for Compressed Linear Model
We note that the mutual information based bound in Lemma 2 is not applicable for this linear regression model, since W is a deterministic function of S, and I(S; W) = ∞. However, Consider a compression algorithm, which maps the original weights W ∈ R d to the compressed modelŴ ∈Ŵ ⊆ R d . For a fixed and compactŴ, we define which measures the largest distance between the reconstructionŵ and the optimal weights w * . The following proposition provides an upper bound on the generalization error of the compressed modelŴ, and the detailed proof is provided in Appendix B.

Distortion-Rate Function for Linear Model
We now provide an upper bound on the distortion-rate function D(R) for the linear regression model. Note that ∇L S (W) = 0, since W minimizes the empirical risk. The Hessian matrix of the loss function is which is not a function of W. Then, the distortion function can be written as: The following theorem characterizes upper bounds for R(D) and D(R) for linear regression.
Proof sketch. The proof of the upper bound for R(D) is based on considering a Gaussian random vector which has the same mean and covariance matrix as W. In addition, the upper bound is achieved when W −Ŵ is independent of the dataset S with the following conditional distribution, where α nD dσ 2 ≤ 1. Note that this "compression algorithm" requires the knowledge of optimal weights w * , which is unknown in practice.
The details can be found in Appendix C.
Entropy 2021, 23, 1255 9 of 20 Remark 2. As shown in [38], if n > d/ 2 , 1 n XX − Σ X ≤ holds with high probability. Then, the following lower bound on R(D) holds if we can approximate 1 n XX in (23) using Σ X , where W G denotes a Gaussian random vector with the same mean and variance as W. The details can be found in Appendix D.
Combing Propositions 1 and 2, we have the following result.
In (28) the first term corresponds to the generalization error, which decreases with compression, and the second term corresponds to the empirical risk, which increases with compression.

Evaluation and Visualization
In the following plots, we generate the training dataset S using the linear model in (16) by letting d = 50, n = 80, Σ X = I d and σ 2 = 1. We consider the following two compression algorithms. The first one is the conditional distribution PŴ |W in the proof of achievability (26), which requires the knowledge of w * and is denoted as "Oracle". The second one is the well-known K-means clustering algorithm, where the weights in W are grouped into K clusters and represented by the cluster centers in the reconstructionŴ. By changing the number of clusters K, we can control the rate R, i.e., I(W;Ŵ). We average the performance and estimate I(W;Ŵ) of these algorithms with 10,000 Monte-Carlo trials in the simulation.
We note that I(W;Ŵ) is equal to the number of bits used in compression only in the asymptotic regime of large number of samples. In practice, we may have only one sample of the weights W, and therefore I(W;Ŵ) simply measures the extent to which compression is performed by the compression algorithm.
In Figure 2a, we plot the generalization error bound in Proposition 1 as a function of the rate R and compare the generalization errors of the Oracle and K-means algorithms. It can be seen that Proposition 1 provides a valid upper bound for the generalization error, but this bound is tight only when R is small. Moreover, both compression algorithms can achieve smaller generalization errors compared to that of the ERM solution W, which validates the result in Theorem 1. Figure 2b plots the upper bound on the distortion-rate function in Theorem 2 and the distortions achieved by the Oracle and K-means algorithms. The distortion of the Oracle decreases as we increase the rate R and matches the D(R) function well. However, there is a large gap between the distortion achieved by K-means algorithms and D(R). One possible explanation is that since w * is unknown, it is impossible for the K-means algorithm to learn the optimal cluster center with only one sample of W. Even if we view W (j) , j = 1, · · · , d as i.i.d. samples from the same distribution, there is still a gap between the distortion achieved by the K-means algorithm and the optimal quantization as studied in [39].
We plot the population risks of the ERM solution W, the Oracle, and K-means algorithms in Figure 2c. It is not surprising that the Oracle algorithm achieves a small population risk, sinceŴ is a function of w * andŴ = w * when R = 0. However, it can be seen that the K-means algorithm achieves a smaller population risk than the original model W, since the decrease in generalization error exceeds the increase in empirical risk, when we use fewer clusters in the K-means algorithm, i.e., a smaller rate R. We note that the minimal population risk is achieved when K = 2, since we initialize w * so that w * (i) , 1 ≤ i ≤ d, can be well approximated by two cluster centers.

Clustering Algorithm Minimizing L S,W
In this section, we propose an improvement of the Hessian-weighted (HW) K-means clustering algorithm [11] for model compression by regularizing the distance between the cluster centers, which minimizes the upper bound L S,W (PŴ |W ), as suggested by our theoretical results in Section 4.

Hessian-Weighted K-Means Clustering
The goal of HW K-means is to minimize the distortion on the empirical risk d S (Ŵ, W), which has the following Taylor series approximation: where H S (W) is the Hessian matrix. Assuming that W is a local minimum of L S (W) (ERM solution) and ∇L S (W) ≈ 0, the first term can be ignored. Furthermore, the Hessian matrix H S (W) can be approximated by a diagonal matrix, which further simplifies the objective to d S (Ŵ, W) ≈ ∑ d j=1 h (j) (W (j) −Ŵ (j) ) 2 , where h (j) is the j-th diagonal element of the Hessian matrix.

Diameter Regularization
In contrast to HW K-means which only cares about empirical risk, our goal is to obtain as small a population risk as possible by minimizing the upper bound Here, we let the number of clusters K to be an input argument of the algorithm, so that I(W;Ŵ) ≤ log 2 K, and we want to minimize L S,W (PŴ |W ) by carefully designing the reconstructed weights given K, i.e., by choosing cluster centers{c (1) , · · · , c (K) }. Then, Entropy 2021, 23, 1255 11 of 20 minimizing the sub-Gaussian parameter σ is one way to control the generalization error of the compression algorithm. Recall that in Proposition 1, we have where the sub-Gaussian parameter is related to C(w * ) = supŵ ∈Ŵ ŵ − w * 2 2 in linear regression. Note that this quantity can be interpreted as the diameter of the set W. Since the ground truth w * is unknown in practice, we then propose the following diameter regularization by approximating C(w * ) in (32) by β max where β is a parameter controls the penalty term and can be selected by cross validation in practice. Our diameter-regularized Hessian-weighted (DRHW) K-means algorithm solves the following optimization problem: Such an optimization problem can be easily extended to the vector case which leads to a vector quantization algorithm. Suppose that we group the d-dimensional weights w = {w (1) , · · · , w (d) } into d = d/m vectors with length m, i.e., {w (1) , · · · , w (d ) }, w (j) ∈ R m , then our goal is to find cluster centers c k ∈ R m and assignments minimizing the following cost function: where H (j) is the diagonal Hessian matrix corresponding to the vector w (j) . An iterative algorithm to solve the above optimization problem for vector quantization is provided in Algorithm 1. The algorithm alternates between minimizing the objective function over the cluster centers and the assignments. In the Assignment step, we first fix centers and assign each w (j) to its nearest neighbor. We then fix assignments and update the centers by the weighted mean of each cluster in the Update step. For the farthest pair of centers, the diameter regularizer pushes them toward each other, so that the output centers have potentially smaller diameters than those of regular K-means. We note that the time complexity of the proposed diameter-regularized Hessian weighted K-means algorithm is the same as that of the original K-means algorithm.

Experiments
In this section, we provide some real-world experiments to validate our theoretical assertions and the DRHW K-means algorithm. (The code for our experiments is available at the following link https://github.com/wgao9/weight-quant (accessed on 13 August 2021)) Our experiments include compression of: (i) a three-layer fully connected network on the MNIST dataset [40]; and (ii) a convolutional neural network with five convolutional layers and three linear layers on the CIFAR10 dataset [41] (We downloaded the pre-trained model in PyTorch from https://github.com/aaron-xichen/pytorch-playground (accessed on 13 August 2021)).
In Theorem 1, an upper bound on the expected generalization error is provided, and therefore we independently train 50 different models (with the same structure but different parameter initializations) using different subset of training samples, and average the results. We use 10% of the training data to train the model for MNIST and use 20% of the training data to train the model for CIFAR10. For each experiment, we use the same number of clusters for each convolutional layer and fully connected layer.
In the following experiments, we plot the cross entropy loss as a function of compression ratio. Note that compression ratio can be controlled by changing the number of clusters K in the quantization algorithm. To see this, suppose that the neural networks have total of d parameters that need to be compressed, and each parameter is of b bits. Let C (k) be the set of weights in cluster k and let b k be the number of bits of the codeword assigned to the network parameters in cluster k for 1 ≤ k ≤ K. For a lookup table to decode quantized values, we need Kb bits to store all the reconstructed weights, i.e., cluster centers c = {c (1) , · · · , c (K) }. Then, the compression ratio is given by where | · | denotes the number of elements in the set. In our experiments, we use a variablelength code such as the Huffman code to compute the compression ratio under different numbers of clusters K.
In Figures 3 and 4, we compare the scalar DRHW K-means algorithm with the scalar HW K-means algorithm for different compression ratios on the MNIST and CIFAR10 datasets. Both figures demonstrate that the compression algorithm increases the empirical risk but decreases the generalization error, and the net effect is that the both compressed models have smaller population risks than those of the original models. More importantly, the DRHW K-means algorithm produces a compressed model that has a better population risk than that of the HW K-means algorithm.    In Figure 5, we compare the population risk of scalar DRHW K-means algorithm and that of the vector DRHW K-means algorithm with block length m = 2 for different compression ratios on the MNIST dataset. It can be seen from the figure that the improvement by using vector quantization (m = 2) is quite modest, which implies that the dependence between the weights W (j) is weak. However, we can still observe the improvement of adding the diameter regularizer in vector DRHW K-means algorithm by comparing the curves with β = 50 and β = 0.
In Figure 6, we demonstrate how β affects the performance of our diameter-regularized Hessian-weighted K-means algorithm in scalar case. It can be seen that as β increases, the generalization error decreases and the distortion in empirical risk increases, which validates the idea that this proposed diameter regularizer can be used to reduce the generalization error. The value of β that results in the best population risk therefore can be chosen via cross-validation in practice.

onclusion
In this paper, we have provided an information-theoretical understanding of how el compression affects the population risk of a compressed model. In particular, our lts indicate that model compression may increase the empirical risk but decrease the eralization error. Therefore, it might be possible to achieve a smaller population risk model compression. Our experiments validate these theoretical findings. Furthermore, showed how our information-theoretic bound on the population risk can be used to mize practical compression algorithms. We note that our results could be applied to improve other compression algorithms, as pruning and matrix factorization. Moreover, we believe that the informationretic analysis adopted here could be generalized to characterize a similar tradeoff een the generalization error and empirical risk in other applications beyond comsing pre-trained models, e.g., distributed optimization [42] and low precision training .

Conclusions
In this paper, we have provided an information-theoretical understanding of how model compression affects the population risk of a compressed model. In particular, our results indicate that model compression may increase the empirical risk but decrease the generalization error. Therefore, it might be possible to achieve a smaller population risk via model compression. Our experiments validate these theoretical findings. Furthermore, we showed how our information-theoretic bound on the population risk can be used to optimize practical compression algorithms.
We note that our results could be applied to improve other compression algorithms, such as pruning and matrix factorization. Moreover, we believe that the information-theoretic analysis adopted here could be generalized to characterize a similar tradeoff between the generalization error and empirical risk in other applications beyond compressing pre-trained models, e.g., distributed optimization [42] and low precision training [15].
where KL(P W P W G ) is the Kullback-Leibler divergence between the two distributions, and the last step follows from the fact that KL(P W,Ŵ P * W,Ŵ ) ≥ 0. Note that