Distributed Vector Quantization Based on Kullback-Leibler Divergence

The goal of vector quantization is to use a few reproduction vectors to represent original vectors/data while maintaining the necessary fidelity of the data. Distributed signal processing has received much attention in recent years, since in many applications data are dispersedly collected/stored in distributed nodes over networks, but centralizing all these data to one processing center is sometimes impractical. In this paper, we develop a distributed vector quantization (VQ) algorithm based on Kullback-Leibler (K-L) divergence. We start from the centralized case and propose to minimize the K-L divergence between the distribution of global original data and the distribution of global reproduction vectors, and then obtain an online iterative solution to this optimization problem based on the Robbins-Monro stochastic approximation. Afterwards, we extend the solution to apply to distributed cases by introducing diffusion cooperation among nodes. Numerical simulations show that the performances of the distributed K-L–based VQ algorithm are very close to the corresponding centralized algorithm. Besides, both the centralized and distributed K-L–based VQ show more robustness to outliers than the (centralized) Linde-Buzo-Gray (LBG) algorithm and the (centralized) self-organization map (SOM) algorithm.


Introduction
Vector quantization is a signal processing method which uses reproduction vectors to represent original data vectors while maintaining necessary fidelity of the data [1,2].As one of the data compression methods which are able to reduce communication and storage burdens, vector quantization has been intensively studied in recent years.A vector quantizer is a system that maps original/input vectors into corresponding reproduction vectors drawn from a finite reproduction alphabet.Many famous vector quantization algorithms have been proposed.The Linde-Buzo-Gray (LBG) algorithm [3,4] and the self-organization map (SOM) algorithm [5][6][7] are two of the most popular vector quantization algorithms.Based on information theoretic concepts, vector quantization algorithms which aim to minimize the Cauchy-Schwartz (C-S) divergence or the Kullback-Leibler (K-L) divergence between the distributions of the original data and the reproduction vectors have also been devised and have been proven to perform better than the LBG and SOM algorithms [8,9].
In signal processing over networks, data are usually collected/stored at different nodes over networks and data from all nodes are needed in tasks to make use of the overall information.In traditional signal processing algorithms, we need to transmit all data to one powerful processing center to perform signal processing tasks, which is sometimes infeasible for distributed applications, especially when the data amounts are very large.The reasons are nodes over networks are usually communication-resource-limited and sometimes power-limited (in the case of wireless sensor networks), and transmitting massive original data will consume large power/communication resources and also bring a data-privacy-leaking risk.Furthermore, when the data amount is extremely large, the center node may not be able to efficiently process the whole data.Therefore, distributed signal processing algorithms, which take these limitations into consideration, are needed in such cases.Many distributed algorithms have been proposed in recent years [10], such as the distributed parameter estimation [11][12][13][14][15][16][17][18][19][20], distributed Kalman filtering [21,22], distributed detection [23,24], distributed clustering [25,26], and distributed information-theoretic learning [27,28].In a majority of these distributed algorithms, signal processing tasks are accomplished at each node based on local computation, local data, as well as limited information exchange among neighbor nodes.During the processing, nodes only transmit necessary information to their neighbors instead of transmitting all original data to one processing center, so as to reduce the communication complexity, protect data privacy, and provide better flexibility and robustness to node/link failures in the meantime.Figure 1 gives a brief sketch of the discussed distributed processing mechanism.In terms of distributed vector quantization, the LBG and SOM algorithms have been successfully extended to the distributed case in [29] and the proposed distributed LBG and SOM algorithms achieve performances close to that of the corresponding centralized LBG and SOM algorithms, respectively.Since the simulation results in literature on centralized vector quantization have shown that algorithms based on C-S divergence and K-L divergence can achieve better performances than the LBG and SOM algorithms [8,9], it is a natural thought to develop divergence-based vector quantization algorithms in the field of distributed processing.However, the existing divergence-based vector quantization algorithms [8,9] cannot be directly/easily extended to the distributed case due to the lack of data samples in estimating the global data distribution for each individual node (details are provided in the following section).
In this paper, we develop a distributed divergence-based vector quantization algorithm that can solve a global vector quantization problem without transmitting original data among nodes.We firstly start from the centralized case.Considering the limitations in distributed cases, we define the objective function based on the K-L divergence between the distribution of global original data and the distribution of global reproduction vectors, and then use the Robbins-Monro (R-M) stochastic approximation method [30,31] to efficiently solve the divergence-minimizing problem online.We show that the obtained iterative solution for the centralized case can be easily extended to distributed cases by introducing diffusion cooperation among nodes, which is a frequently-used technique in distributed processing [11,12].Under the diffusion cooperation, each node cooperatively estimates the reproduction vectors with its neighbors by exchanging some intermediate estimates rather than transmitting original data.Simulations show that the local estimates obtained at different nodes are quite consistent.Besides, the performances of the distributed algorithm are very close to the corresponding centralized algorithm.

Starting from the Centralized Case
Mathematically, a vector quantizer is a mapping, q, from a K-dimensional input vector, xpnq " px 0 pnq, ¨¨¨, x K´1 pnqq, n " 1, ¨¨¨N, to a reproduction vector qpxpnqq, and the reproduction alphabet contains M reproduction vectors, {m i , i " 1, ¨¨¨, M}.The most important issue in the quantization is how to keep the fidelity of data as much as possible with a limited number of reproduction vectors.The information theory provides natural measures for evaluating the fidelity of data in quantization.Divergences can be used to measure the match degree between the distribution of original data ppxq and the distribution of reproduction vectors.Low values of divergences indicate that the original data can be well represented by the reproduction vectors.Compared with the traditional fidelity measure, sum/mean of squared distortion error, the divergences go beyond the second-order moment and evaluate the data fidelity from a more holistic perspective of the whole distributions.It is expected that divergence-based vector quantization has some advantages over the squared distortion error-based quantization when the distribution of the distortion error is not Gaussian.
There are various kinds of divergences [32], and each has unique characteristics.In [8], the authors have studied the centralized vector quantization based on C-S divergence.In [9], the authors have studied the centralized vector quantization based on K-L divergence.In their method, the Parzen window method is employed to estimate the distribution of the original data ppxq based on all the data samples, as well as the distribution of the reproduction vectors.However, in distributed cases, input data samples are distributed over the whole network and are usually impractical to be gathered together.Thus, their method may not be easily extended to the distributed field.
In this paper, taking the limitations of distributed cases into consideration, we propose to perform the vector quantization by minimizing the K-L divergence between the distribution of the original data ppxq and the distribution of the reproduction vectors.Following the thought of [8], we use the Parzen window method to estimate the distribution of reproduction vectors, where κ Θ p¨q is a kernel with parameters Θ (the choices of kernel will be discussed in the following).
Given the above estimator, the objective function is written as follows: Note that the above divergence is a function of the reproduction vectors, {m i , i " 1, ¨¨¨M}, thus the problem becomes the choice of locations for the reproduction vectors in the original data domain.Naturally, we can minimize D with respect to m i as below: The above equation depends on the global original data distribution p pxq, which is unknown in advance, and more importantly, is hard to estimate in distributed cases.Fortunately, as seen from Formula (3), with the use of K-L divergence, the partial differential formula is a mathematical expectation over p pxq.For such a situation, we can employ the Robbins-Monro stochastic approximation method [30,31] to solve the above equation effectively.The R-M method is an online iterative algorithm which directly solves the kind of equation above by using one data sample per iteration without estimating the total distribution of data.Thus, it avoids the difficulty of estimating the global data distribution, especially in the distributed cases.This is the reason that we use K-L divergence rather than C-S divergence [8] to design our objective function in this paper.The iterative solution to our problem given by the R-M method is simple, as below: where αpnq is the learning step-size.In this paper we use a monotonically decreasing step-size, αpnq " α 2 expp´n{α 1 q, where α 1 , α 2 are adjustable parameters, respectively.Remark 1 (the choices of kernel): The most commonly-used kernel κ Θ p¨q in the Parzen window estimation method is the Gaussian kernel with parameters Θ i " tm i , Σ i u, where m i is the mean vector and Σ i is the covariance matrix.Inspired by the works on robust mixture modeling using t-distribution [33,34], in this paper, we also introduce the multi-dimensional Student's t-distribution as the heavy-tailed alternative choice of kernel, with mean vector m i , precision matrix Σ i , and degree of freedom υ i .If υ i ą 2, the covariance matrix of the distribution is υ i pυ i ´2q ´1Σ i .As υ i tends to infinity, the t-distribution converges to the Gaussian distribution.Equipped with this heavy-tailed kernel, the vector quantization is supposed to be more robust to outliers.

Extended to Distributed Cases
When it comes to distributed cases, e.g., sensor networks, massive data are collected/stored by each sensor, and data from all sensors are needed in the quantization task to make use of the overall data information.However, due to the limited power and limited communication resource of nodes, transmitting the large amounts of data to a processing center might be a heavy burden for the nodes.In the following, we show that our iterative solution obtained for the centralized case can be easily extended to distributed cases.
We consider a general network modeled to be a connected graph with no nodes isolated.Each node is connected to several nodes which are called neighbors.Each node communicates only with its one-hop neighbors.In this case, we introduce the diffusion cooperation among nodes to develop the corresponding distributed vector quantization algorithm.The distributed estimation algorithm with the diffusion cooperation consists of two steps, local updating and fusion-based, on information exchanging.
Specifically, each node j firstly uses parts of its own input vectors to iteratively estimate its local reproduction vectors, After the updating, each node sends its local estimates to its neighbors.Then each node combines the information from its neighbors to obtain fused estimates of reproduction vectors, where B j is the neighbor set of node j, and {c jl } are some combination coefficients satisfying ř lP j c jl " 1, c jl " 0 , if l R B j .As we see from (7) and (8), though the objective is to minimize the divergence between the global data distribution ppxq and the distribution of reproduction vectors, we do not need to know (or estimate) ppxq, which is nontrivial in the distributed environment.Each node only needs its own data in Equation (7).Each node repeats the above process until its fused estimates converge.In detail, we let a node come to OFF state when the maximum change of fused estimates during a period of iterations is less than a threshold.OFF nodes stop computation or communication and their neighbors use the last results transmitted by the OFF nodes to continue updating and fusion.The algorithm ends when all nodes become OFF.As the diffusion cooperation process goes on, local data information is diffused over the whole network without transmitting the original data.Finally, all nodes obtain consistent local estimates of reproduction vectors based on global data information.
For clarity, our distributed vector quantization algorithm based on diffusion cooperation is summarized as follows.
Initialization: Initialize the threshold value, the kernel parameters, and reproduction vectors for each node.Computation: Each node performs the following process until the termination rule is satisfied.
1. Use parts of the node's local input vectors to iteratively update the estimates via Equation (7). 2. Transmit the local estimation results to neighbors.3. Fuse the results from the neighbors to obtain fused estimates of the reproduction vectors via Equation (8).
Termination rule: The algorithm ends when the states of all nodes are OFF.
In [35,36], the authors have provided a detailed convergence analysis for a general class of distributed R-M-based algorithms.They have proved that in a connected network, under a set of explicit assumptions, the distributed R-M-based algorithms can converge to a consistent point (for different nodes) and the point is a critical point of the corresponding objective functions.Their results are fully applicable to our case when the threshold value is sufficiently close to zero.In practice, we usually use a small positive threshold value (large threshold values are not suggested since the algorithms would be forced to stop while the estimates are far from convergence) to make sure that the algorithm stops in a finite number of steps.In such a case, the different nodes may not converge to a strictly consistent point.Intuitively, a larger threshold value makes the algorithm stop in a smaller number of steps, but meanwhile it also enlarges the inconsistency degree of local estimates at different nodes.In the following simulations, we study the effects of the threshold value on the convergence in detail.

Communication Complexity Analysis
In this subsection, we provide an analysis of the communication complexity of our distributed algorithm.In each iteration loop, each node transmits M local estimates of reproduction vectors to its neighbors.Let ˇˇB j ˇˇdenote the number of its neighbors, and then the communication complexity for one node in one iteration loop is Op ˇˇB j ˇˇMq.Let T j denote the number of iterations executed by the node, then the total communication complexity for the node is Op ˇˇB j ˇˇT j Mq.On the other hand, the traditional centralized algorithm needs to gather all input data to a processing center.Let N j denote the number of input vectors of node j, then the communication complexity is OpN j H j q, where H j is the number of hops from the node to the central processor.Since the number of input vectors usually is very large compared with the other quantities, the proposed distributed algorithm can significantly reduce the communication complexity in such cases.

Numerical Experiments
We study the performance of our proposed algorithms by simulations in this section.We denote our distributed vector quantization using K-L divergence as d-KL and denote the corresponding centralized vector quantization as c-KL.We employ two types of kernel in the simulations, which are the Gaussian kernel (g kernel) and the Student's t kernel (t kernel), respectively.For comparison, the simulation results obtained by centralized LBG (c-LBG) and centralized SOM (c-SOM) algorithms are presented.Since the simulation results in [29] show that the performances of the distributed LBG and distributed SOM are very close to those of the c-LBG and c-SOM, here we do not additionally provide the simulation results of the distributed LBG and distributed SOM.In addition, for the distributed cases, the case without cooperation among nodes is also tested and denoted as nc-KL.

Data Generation and Evaluation Indexes
We use the noisy double-moon data as the synthetic experimental data, which are widely used as a benchmark in signal processing and machine learning [8,28,29].The noise distribution considered is heavy-tailed, which leads to data samples containing more outliers than those under Gaussian noise.Though, according to the analysis in Section 2.3, our d-KL algorithm has advantages over the c-KL on communication complexity when the data sample amount is large, for easy implementation of the simulation, we set a relatively small number of data samples for an individual node, which is 700 (in each run, for each node, 200 samples are used as training data and the other 500 samples are used as testing data).
Generally, the performance of a vector quantization algorithm is measured by the average (or total) distortion between input vectors and their corresponding reproduction vectors, dpxpnq, qpxpnqqq, where dp¨, ¨q is a distortion measure function.We choose the commonly used Euclidean distance as the distortion measure for the convenience of comparison.For the distributed algorithms, we evaluate the inconsistency degree of local estimates at different nodes by 1 std j pm j i pkqq, where m j i pkq stands for the normalized (normalized to the maximum ˇˇm j i pkq ˇǎ mong all nodes) k-th component of the i-th estimated reproduction vector, and function std j p¨q calculates the standard deviation of the corresponding estimates over all nodes.Besides, to provide more information about the convergence rate, we report the average number of iteration steps of the network, which is calculated as 1 NoIS OFF pjq, where NoIS OFF pjq stands for the number of iteration steps (NoIS) needed before the node j becomes OFF.

Results
There are some crucial parameters in our proposed algorithms, such as the degree of node unbalance, the threshold value, the number of reproduction vectors, the network structure, etc.In the following simulations, we study the effects of the various parameters on the proposed algorithms in detail.
Firstly, we fix the threshold value as 0.002, the number of reproduction vectors as 10, and test the performance of the algorithms under different degrees of node unbalance.Here we consider a network composed of 10 nodes, which are randomly distributed in a region.We let each node connect to its nearest two nodes, and then randomly add some long-range connections with a probability of 0.1.In this paper, we set the linear combination coefficients according to the Metropolis rule [37], which is an efficient and commonly used rule in designing the combination coefficients for distributed processing [11,15].For the distributed quantization algorithms, we let each node process 20% of its training data in one updating-fusion iteration loop.In Figure 2, we present the quantization error of the algorithms under different data unbalance levels as described above.All the results are the averages of 20 independent trials.For distributed algorithms, the quantization errors are averaged results over the whole network.Since the total percentages of two moons are always balanced, the node unbalance would not affect the performance of the centralized algorithms.Their quantization error is presented as straight lines.The distortion performances of the c-KLs, including types of both g kernel and t kernel, are better than those of c-LBG and c-SOM.As expected, c-KL with t kernel outperforms c-KL with g kernel slightly.The distortions of our distributed algorithm (with diffusion cooperation) are very close to and sometimes even lower than (similar phenomena are also found in studies on distributed clustering [25,28]) that of the corresponding centralized algorithm.Besides, its performances are hardly affected by the node unbalance.In comparison, the distortion of the non-cooperative distributed algorithm increases quickly with the unbalance degree.Figure 3 shows the learning curves of the first components of the 10 reproduction vectors for the d-KL (t kernel) algorithm under unbalance level 3 in one trial.We see that the local estimated reproduction vectors at different nodes are quite consistent when the algorithm converges.Next, for the d-KL algorithm, we let the proportion of input data processed at each node per iteration vary from 10% to 100% under unbalance level 3, and we study the corresponding quantization error and inconsistency degree of local estimates at different nodes.The results are presented in Figure 4. We see that as the proportion increases, the quantization error of d-KL does not change obviously, while the inconsistency degree increases slightly with the proportion.The reason is that, in each iteration loop, though the fusion step still introduces global data information when the proportion is large, the ratio of global information to local information decreases with the increase of the (local data) proportion, which finally leads to the increasing difference in fused estimates among different nodes.Nevertheless, the overall inconsistency degree is quite low.Besides, the d-KL with t kernel sustainedly outperforms the d-KL with g kernel under different proportions.Secondly, we fix the unbalance level as 3, the proportion of input data processed at each node per iteration as 20%, and we let the threshold value vary from 0.0005 to 0.02 to study the effect of the threshold value on the convergence of the d-KL (since, in the above simulations, the d-KL with t kernel sustainedly outperforms the d-KL with g kernel, we take the d-KL with t kernel as a representative in the following simulations).Other parameters are kept the same as those used in the above simulations.The corresponding results are shown in Figure 5.We see that, in such a range, the quantization error of d-KL is kept nearly the same, which indicates that the algorithm converges to similar sets of reproduction vectors.Besides, as the threshold value increases, the average number of iteration steps decreases while the inconsistency degree of the local estimates increases.This phenomenon is reasonable since a relatively large threshold value allows the nodes to become OFF at a relatively early stage of convergence (when the estimates at nodes are still varying in a relatively wide range) and thus reduce the average number of iteration steps.Correspondingly, the information of estimates exchanging among nodes is reduced, resulting in a larger inconsistency degree.Thirdly, we study the performances of the algorithms under different numbers of reproduction vectors ranging from six to 16.The threshold value is set as 0.002 and other parameters are kept the same as those used above.The result is shown in Figure 6.We see that the quantization error decreases with the increasing number of reproduction vectors.The distortion performances of c-KL and d-KL outperform other algorithms within the testing range, and the performance of d-KL is very close to that of c-KL.This result shows that the effectiveness of our proposed quantization algorithms is robust to the choice of the amount of reproduction vectors.Fourthly, we study the effect of network structures on the performance of d-KL in detail.We consider networks of different degrees of connectivity and of different topology.For the former case, we generate the network of 10 nodes in a way similar to that used above except that we let the probability of long-range connections P long vary from 0 (resulting in a regular network) to 1 (resulting in a full-connected network).For the latter case, we generate four types of networks of 40 nodes in a way similar to that employed in [15], which are the Watts-Strogatz small-world network (WS), the Barabási-Albert scale-free network (BA), the regular network (RG), and the Erdos-Rényi random network (ER), respectively.In the latter case, the total amounts of connections/links are set as 80.We test the d-KL algorithm on the various networks and show the simulation results in Figure 7 and Table 1.From Figure 7 we see that as the P long increases, the quantization error of d-KL does not change much, while both the average number of iteration steps and the inconsistency degree of local estimates monotonically decrease.The reason is that a large value of P long increases the degrees of connectivity of the network and makes the information spread more efficiently over the network.Thus, at each iteration step, an individual node can utilize more information provided by its neighbors to help its local estimation, which finally makes the algorithm converge faster and makes the inconsistency degree of local estimates smaller.From Table 1 we see that the ER random network has the best performances in terms of all three evaluation indexes, especially for the index of inconsistency degree.It is because the ER random network has the shortest average path length among the four types of networks and has the highest efficiency in information spreading.In contrast, the regular network which has the longest average path length, results in the lowest performances.Similar comparison results have also been observed in the study of distributed least mean squares algorithms [15].These results give a guideline for the design of the network topology in distributed vector quantization.

Conclusions
In this paper, we have developed a distributed K-L-based vector quantization algorithm.We only transmit limited intermediate estimates rather than original data, and thus communication complexity is reduced and data privacy is protected to some extent.Simulation results show that the d-KL algorithm can achieve performances similar to the corresponding c-KL algorithm.Besides, both the centralized and distributed K-L-based VQ algorithms show more robustness to outliers than the centralized LBG and the centralized SOM algorithm.

Figure 1 .
Figure 1.A brief sketch of the distributed processing mechanism: (a) Network structure and neighborhood of node j; (b) Local computation at node j.

Figure 2 .
Figure 2. The distortion performances of vector quantization algorithms under different levels of unbalance.

Figure 3 .
Figure 3.The learning curves of reproduction vectors for the d-KL (t kernel) algorithm under unbalance level 3.

Figure 4 .
Figure 4.The quantization error and inconsistency degree of the d-KL algorithm under different settings of data proportion.

Figure 5 .
Figure 5. Performances of the d-KL algorithm under different threshold values.(a) Quantization error; (b) Average number of iteration steps and inconsistency degree.

Figure 6 .
Figure 6.The distortion performances of vector quantization algorithms under different numbers of reproduction vectors.

Figure 7 .
Figure 7. Performances of the d-KL algorithm under different values of P long .(a) Quantization error; (b) Average number of iteration steps and inconsistency degree.

Table 1 .
Performances of the d-KL algorithm under networks of different topology.