Distributed Density Estimation Based on a Mixture of Factor Analyzers in a Sensor Network

Distributed density estimation in sensor networks has received much attention due to its broad applicability. When encountering high-dimensional observations, a mixture of factor analyzers (MFA) is taken to replace mixture of Gaussians for describing the distributions of observations. In this paper, we study distributed density estimation based on a mixture of factor analyzers. Existing estimation algorithms of the MFA are for the centralized case, which are not suitable for distributed processing in sensor networks. We present distributed density estimation algorithms for the MFA and its extension, the mixture of Student’s t-factor analyzers (MtFA). We first define an objective function as the linear combination of local log-likelihoods. Then, we give the derivation process of the distributed estimation algorithms for the MFA and MtFA in details, respectively. In these algorithms, the local sufficient statistics (LSS) are calculated at first and diffused. Then, each node performs a linear combination of the received LSS from nodes in its neighborhood to obtain the combined sufficient statistics (CSS). Parameters of the MFA and the MtFA can be obtained by using the CSS. Finally, we evaluate the performance of these algorithms by numerical simulations and application example. Experimental results validate the promising performance of the proposed algorithms.

Abstract: Distributed density estimation in sensor networks has received much attention due to its broad applicability. When encountering high-dimensional observations, a mixture of factor analyzers (MFA) is taken to replace mixture of Gaussians for describing the distributions of observations. In this paper, we study distributed density estimation based on a mixture of factor analyzers. Existing estimation algorithms of the MFA are for the centralized case, which are not suitable for distributed processing in sensor networks. We present distributed density estimation algorithms for the MFA and its extension, the mixture of Student's t-factor analyzers (MtFA). We first define an objective function as the linear combination of local log-likelihoods. Then, we give the derivation process of the distributed estimation algorithms for the MFA and MtFA in details, respectively. In these algorithms, the local sufficient statistics (LSS) are calculated at first and diffused. Then, each node performs a linear combination of the received LSS from nodes in its neighborhood to obtain the combined sufficient statistics (CSS). Parameters of the MFA and the MtFA can be obtained by using the CSS. Finally, we evaluate the performance of these algorithms by numerical simulations and application example. Experimental results validate the promising performance of the proposed algorithms.
Keywords: distributed density estimation; mixture of factor analyzers; mixture of Student's t-factor analyzers; sensor network

Introduction
Sensor networks are composed of tiny, intelligent sensor nodes that are deployed over a geographic region. This type of network has a broad range of applications, such as environmental monitoring, precision agriculture and military surveillance [1][2][3]. Distributed estimation over sensor networks is to estimate some parameters of interest through local computation and information exchange among neighbor nodes. Compared to centralized estimation, it does not need to send observations collected by all of the sensors to a powerful central node, so the complexity and resource consumption can be reduced. Furthermore, distributed estimation is more flexible and robust to node and/or link failure [2,4]. Recently, many distributed estimation algorithms have been proposed, such as distributed LMS [5], distributed recursive least squares (RLS) [6], distributed source location [7], distributed power allocation [8], distributed sparse estimation [9,10], distributed information theoretic learning [11] and distributed Gaussian process regression [12].
Mixture of Gaussians (GMM) is a flexible and powerful probabilistic modeling tool for density estimation. It has been used in several areas, such as pattern recognition, computer vision, signal and image analysis and machine learning. When estimating parameters in the GMM by the maximum likelihood criterion, the expectation maximization (EM) algorithm [13,14] is usually adopted. It iteratively performs the expectation step (E-step) to calculate the conditional expectations of unobserved/hidden variables and runs the maximization step (M-step) to estimate parameters of data distributions based on the result of the E-step. However, when the dimension of observations is high, the fitting performance of the GMM deteriorates or even the associated EM algorithm cannot work [15]. The main reason is that the GMM cannot realize dimensionality reduction, which is to compress highly-correlated components of observations. In this case, a mixture of factor analyzers (MFA) [16,17] can be considered. The MFA combines local factor analysis in a form of a finite mixture. As factor analysis can describe variability among high-dimensional observations in terms of potentially low-dimensional latent factors, the MFA can carry out dimensionality reduction simultaneously when finishing specific tasks. Moreover, in order to process the non-normality of data or outliers, normal distributions in the MFA can be replaced by Student's t-distributions, obtaining the mixture of Student's t-factor analyzers (MtFA) [18,19]. Therefore, the MFA and its extension MtFA are effective tools for processing high-dimensional observations [20]. They have been successfully applied in the domains of signal processing [21,22], bioinformatics [23,24] and other applied fields.
In sensor networks, GMM has been introduced for density estimation of observations [25][26][27][28][29][30][31][32]. The estimation process for the GMM needs to be realized by distributed EM algorithms. According to the way by which nodes communicate with each other, distributed EM algorithms can be classified into the incremental type [25,26], the consensus type [27] and the diffusion type [28][29][30][31][32]. In the incremental scheme [25,26], a long way from the first node to last node of the pre-selected path is needed. When any node along the path fails, reliability problem may happen. In the consensus-based distributed EM algorithm for the GMM [27], a consensus filter, by which global statistics for each node is achieved, is carried out between the E-step and the M-step at each iteration. The objective is to obtain the same estimations for all nodes at each iteration. In the diffusion type of distributed estimation for the GMM, each node exchanges information only with its neighbors through a diffusion cooperative protocol. Good performance is obtained while communication overhead is kept low [2,4]. In this paper, we focus on a diffusion type of distributed estimation. Among previous studies, a distributed model order and parameter estimation algorithm for the GMM was proposed in [28]. Moreover, algorithm performance was analyzed. In [29], a diffusion-based EM algorithm was presented for distributed estimation in unreliable sensor networks. In this scenario, some nodes may be subject to data failures and report only noise. The aim of the algorithm was to achieve the optimal performance within the whole range of SNRs. In [30], information diffusion and averaging were considered and performed simultaneously. In [31], an adaptive diffusion scheme was proposed. In [32], the performance of the diffusion-based EM algorithm was analyzed. It could be considered as a stochastic approximation method [33] to find the maximum likelihood estimation of the GMM.
As the MFA can handle high-dimensional observations, which are also usually encountered in sensor networks, in this paper, we propose distributed density estimation algorithms for the MFA and its extension MtFA. We represent these two algorithms as D-MFA and D-MtFA, respectively. Specially, for each node in the sensor network, we define an objective function as the linear combination of local log-likelihoods, whose combination weights are determined by the number of observations in the corresponding neighbor nodes. After local sufficient statistics are computed, the current node calculates its combined sufficient statistics by a linear weighed combination of these local sufficient statistics from nodes in its neighborhood set. Finally, parameters of the MFA and MtFA are updated using the combined sufficient statistics. Apart from the distributed processing of the MFA and the MtFA in this paper, there are two other differences from the existing algorithms. First, in the relevant algorithms [25,27,28], mixing proportions in the GMM are different at each node, whereas means and covariances are the same throughout the network. On the contrary, in this paper, all of the parameters in the MFA or the MtFA are the same throughout the network. Using this design, distributed clustering and classification can be done in arbitrary nodes after the estimation process finishes. Second, for each node, the objective function is directly defined. The combination of weights in the objective function is effectively designed.
The rest of this paper is organized as follows. In Section 2, a brief overview of the MFA and the MtFA are provided. In Section 3, the D-MFA algorithm and D-MtFA algorithm are formulated. In Section 4, numerical simulations for the synthetic observations are performed to illustrate the effectiveness and advantages of the proposed algorithms. Moreover, the application of these algorithms to distributed clustering is also presented. Finally, conclusions are drawn in Section 5.
The acronyms mentioned in this paper are listed in the following.

Mixture of Factor Analyzers
Let the observed dataset be Y = {y 1 , ..., y N }. In the MFA, it assumes that each p-dimensional data vector y n is generated as: where I is the number of mixing components. The corresponding q-dimensional (q < p) factor u n ∼ N (u n |0, I q ) is independent of the e ni ∼ N (e ni |0, D i ), where D i is a p × p diagonal matrix. The parameter µ i is the mean of the i-th analyzer, and A i (p × q) is the linear transformation known as the factor loading matrix. The so-called mixing proportions π i (i = 1, ..., I) are nonnegative and sum to one. The standard EM algorithm for the MFA is given in [15,16].

Mixture of Student's t-Factor Analyzers
Since the MFA adopts the normal family for the distributions of the errors and the latent factors, it is sensitive to outliers. An obvious way to improve the robustness of this model for observations having longer tails than normal is using the t-family of elliptically-symmetric distributions. Therefore, the MtFA has been proposed in [18]. In the MtFA, it assumes that p-dimensional data vector y n is generated in the same way as that in the MFA, as shown in Equation (1). However, the distributions of q-dimensional (q < p) factor u n and noise e ni are t(u n |0, I q , ν i ) and t(e ni |0, D i , ν i ), respectively. In the above Student's t distributions, ν i is called the degree of freedom that controls the length of the tails of the distributions. With this modification, the MtFA is more robust to outliers and can process the non-normality of observations in a better way [23]. In essence, t(u n |0, I q , ν i ) and t(e ni |0, D i , ν i ) can be respectively regarded as average Gaussian scale distributions N (u n |0, I q /w ni ) and N (e ni |0, D i /w ni ) with the Gamma distributed precision scalar w ni , that is: where G(·) denotes the Gamma distribution. The standard EM algorithm for the MtFA is given in [14,18].

Network Model and Objective Function
Consider a sensor network with M nodes.
The m-th node has N m data observations Y m = {y m,n } n=1,...,Nm (m = 1, ..., M ), and y m,n denotes the n-th observation in node m. The distribution of each p-dimensional observation y m,n is modeled by the MFA, defined in Equation (1). It is noted that the factor associated with y m,n here is represented as u m,n . The parameter set of the MFA The network topology is described by a graph. Let W denote the distance that a node can communicate via wireless radio links. Nodes m and l are connected if the Euclidean distance d m,l between m and l is less than or equal to W . Moreover, a graph is connected if for any pair of nodes (m, n), there exists a path from m to n. The neighborhood set of node m, denoted by R m , is defined as the one-hop neighbors of node m (including m itself). For example, in Figure 1, the dashed circle represents the neighborhood set of node m, containing Node 1, Node 2, node l and node m itself. In order to design the D-MFA and the D-MtFA algorithms, the objective functions should be carefully specified at first. Here, we take node m in the sensor network for example. We define the objective function F m (Θ) of the MFA at node m as a linear combination of local log-likelihoods log p(Y l |Θ) associated with nodes l in its neighborhood R m .
where {c lm } l∈Rm are some non-negative combination coefficients satisfying the condition It should be emphasized that when defining F m (Θ), we consider two important factors. First, as node m can only communicate with its neighbors, it is reasonable to define F m (Θ) as a combination of local log-likelihoods log p(Y l |Θ), (l ∈ R m ). When estimating Θ, node m can make use of the information from nodes in R m . Due to the effect of information diffusion, each node can obtain all information directly or indirectly from other nodes. Second, the contributions of different local log-likelihoods log p(Y l |Θ), (l ∈ R m ) for the estimation of Θ at node m may also be different. The combination coefficient c lm weights the importance of information flow from the node l (l ∈ R m ). Therefore, how to choose c lm is important. Here, we adopt a simple, but effective mechanism that c lm is determined by: If node l has a larger number of observations N l , the information from this node makes a larger contribution to obtain more accurate parameter estimations. Therefore, a larger combination coefficient c lm in Equation (3) can further make this contribution prominent. In the future, a more effective implementation, such as an adaptive strategy [4], can be considered to determine these combination coefficients better.

Distributed Density Estimation Algorithm for the MFA
After the objective functions F m (Θ) (m = 1, ..., M ) have been determined, the next task is to estimate parameters Θ in the MFA by maximizing F m (Θ). For node l, an I-dimensional binary latent variable z l,n , which is associated with y l,n , is introduced. As the MFA is a mixture model, z l,n,i (= 1) denotes that y l,n belongs to the i-th component of the MFA. The latent variables for node l in the neighborhood R m are U l = {u l,n } n=1,...,N l , Z l = {z l,n } n=1,...,N l , (l ∈ R m ). Now, F m (Θ) (m = 1, ..., M ) in Equation (2) can be expressed as: The three conditional probabilities in Equation (4) are: Here, we derive the distributed estimation algorithm with the aid of the standard EM algorithm [13]. First, we introduce two distributions q(Z l ) and q(U l |Z l ) defined over the latent variables. For any choice of q(Z l ) and q(U l |Z l ), the following decomposition holds: where: The verification of log-likelihood decomposition can be found in [34]. As F m (Θ) as a combination of local log-likelihoods, the whole decomposition can also be expressed by a combination of local log-likelihood decompositions, as shown in Equation (5).
Moreover, KL(q l ||p l ) in Equation (5) is the Kullback-Leibler divergence between q(Z l )q(U l |Z l ) and p(Z l |Y l , Θ)p(U l |Z l , Y l , Θ), which satisfies KL(q l ||p l ) ≥ 0. Therefore, it can be seen from Equation (5) that l∈Rm c lm · L(q l , Θ) ≤ F m (Θ). In other words, l∈Rm c lm · L(q l , Θ) is a lower bound on F m (Θ). As direct maximization of the F m (Θ) is difficult, it can be solved by the maximization of this lower bound instead.
Suppose that the parameters estimated in the last iteration are ..,I . In the first stage, the lower bound l∈Rm c lm · L(q l , Θ old ) is maximized with respect to q(Z l )q(U l |Z l ) while holding Θ old fixed. From Equation (5), this maximum can be achieved when KL(q l ||p l ) = 0. In other words, q(Z l )q(U l |Z l ) = p(Z l |Y l , Θ old )p(U l |Z l , Y l , Θ old ). Therefore, two conditional distributions, p(Z l |Y l , Θ old ) and p(U l |Z l , Y l , Θ old ), should be computed.
Concretely, for node l (l ∈ R m ), p(Z l |Y l , Θ old ) can be calculated by: where: Moreover, p(U l |Z l , Y l , Θ old ) should also be obtained by: The mean u m,n,i and covariance Ω i are: where: is an intermediate variable introduced to simplify expressions in the following steps.
When the above two conditional distributions have been obtained, q(Z l )q(U l |Z l ) is determined and held fixed, and the lower bound l∈Rm c lm ·L(q l , Θ) is maximized with respect to Θ to get new estimated Θ new . This will cause the lower bound to increase, which will necessarily cause the corresponding F m (Θ) to increase.
Concretely, the current lower bound is expressed as: Discard the second logarithmic term unrelated to Θ in Equation (10); the objective function represented by Q m (Θ) at node m is: p(z l,n,i |y l,n )p(u l,n |y l,n , z l,n,i ) × z l,n,i log π i N (u l,n |0, I q )N (y l,n |µ i + A i u l,n , D i ) Now, parameters in Θ can be obtained by taking derivation of Q m (Θ) with respect to Θ. First, π i and µ i are updated to: and: respectively. Subsequently, by performing derivation of Q m (Θ) with respect to A i , we have: Finally, the expression of D i can be obtained in the same way, where diag{·} denotes the operator setting off-diagonal terms to zero. In Equations (11)- (14), z l,n,i is the expectation of z l,n,i given by Equation (6). u l,n and u l,n u T l,n can be obtained from Equation (7), which are: u l,n = u l,n,i and u l,n u T l,n = Ω i + u l,n,i u T l,n,i respectively. Substituting Equation (15) into Equations (13) and (14), we have: and: where: From Equations (11), (12) and (16)- (18), we can see, when estimating parameters Θ at node m, that three combined sufficient statistics (CSS) must be obtained, represented as: where:

LSS
(1) ..,I are local sufficient statistics (LSS) of node l. Therefore, CSS in node m is a linear combination of the LSS of nodes in R m . If node l has a large number of observations, the accuracy of calculated LSS l should be high and should make an important contribution to the CSS of node m. A relatively large c lm in Equation (3) can make this contribution prominent, obtaining accurate estimation of Θ.
In the following, we summarize the realization process of the D-MFA algorithm.
Step 1 (Initialization): This initializes the parameters ..,I . Each node l broadcasts the number of its observations N l to its neighbors. When receiving this information, each node calculates the combination coefficient by Equation (3).
Step 2 (Computation): Each node l in the sensor network computes z l,n,i , Ω i and g i by Equations (6), (8) and (9), respectively. Then, it computes three local sufficient statistics ..,I according to its own observations Y l , by Equation (20).
Step 3 (Diffusion): Each node l in sensor networks diffuses its local sufficient statistics LSS l , as shown in Figure 1.
Step Step 5 (Estimation): Node m (m = 1, ..., M ) estimates π i , µ i , A i and D i according to Equations (11), (12), (16) and (17), respectively. Here, we substitute Equation (19) into Equations (11), (12) and (18), reformulating the estimation step as follows: where: Step 6 (Termination): Node m (m = 1, ..., M ) calculates its current local log-likelihood as: where superscript "new" denotes the newly estimated parameters at the current iteration. If log p(Y m |Θ new ) − log p(Y m |Θ old ) < , node m enters the terminated state; else, go to Step 2 and start the next iteration. It is noted that the terminated nodes do no computation or communication in the following iterations. If one node cannot receive information from a neighbor node in the next iteration, the node will use the received and saved LSS information from that neighbor node at the last iteration when updating CSS. When there is no message communication or information exchange in the network, implying all nodes reach the terminated state, the algorithm ends.

Distributed Density Estimation Algorithm for the MtFA
Compared to the MFA, the main difference of the MtFA is that it has an additional degree of freedom parameter ν i (i = 1, ..., I). Therefore, the parameter set of the MtFA to be estimated is Moreover, apart from Z l and U l , the latent variable W l = {w l,n,i } n=1,...,N l i=1,...,I should be introduced, explained in Section 2.2. Similarly, for node m in a sensor network, a linear combination of local log-likelihoods associated with nodes in R m is defined as: The derivation process of the D-MtFA algorithm is similar to that of the D-MFA, except that in Step 2, the posterior distribution of p(w m,n,i |y m,n , z m,n,i ) and p(u m,n |y m,n , z m,n,i , w m,n,i ) should be computed, and in Step 5, ν i needs to be estimated. We put this derivation in the Appendix in detail and directly describe the D-MtFA algorithm here.
Step 1 (Initialization): This initializes the values of the parameters ..,I . Each node l broadcasts the number of its observations N l to its neighbors. When receiving this information, each node calculates the combination coefficient by Equation (3).
Step 2 (Computation): Each node l in the sensor network computes five local sufficient statistics LSS l = {LSS The expressions of expectations z l,n,i and w l,n,i in Equation (22) are given in the Appendix. Moreover, the intermediate variables Ω i and g i should also be prepared for simplifying expressions in Step 5, shown in the Appendix.
Step 3 (Diffusion): Each node l in the sensor network diffuses its local sufficient statistics LSS m , as shown in Figure 1.
Step 4 (Combination): When node m (m = 1, ..., M ) receives the local sufficient statistics from all of its one-hop neighbor nodes l (l ∈ R m ), it calculates the combined sufficient statistics, shown as: Step 5 (Estimation): Node m (m = 1, ..., M ) estimates the parameters of the MtFA: where: In addition, ν i is updated by solving the following equation: where ψ(·) is the digamma function and ν old i is the value of ν i on the last iteration of this algorithm. Equation (29) can be solved by some numerical methods, i.e., the Newton method.
Step 6 (Termination): Node m (m = 1, ..., M ) calculates its current local log-likelihood log p(Y l |Θ new ), expressed in Equation (21). The superscript "new" denotes the newly-estimated parameters at the current iteration. The termination condition of the algorithm is the same as that in the D-MFA algorithm.

Synthetic Data
In this subsection, we test the performances of the proposed algorithms on synthetic data. Here, we consider a sensor network composed of 100 nodes to evaluate the estimation performance of the proposed algorithms. Nodes are randomly placed in a square of 5 × 5. The communication distance is taken as 0.8. In this setting, the connected graph reflecting network topology is shown in Figure 2. In the first 30 nodes (Node 1-Node 30), each node has 80 observations. In the next 40 nodes (Node 31-Node 70), each node contains 100 observations. In the last 30 nodes (Node 71-Node 100), each node has 120 observations. All of the 10-dimensional observations in the 100 nodes are assumed to be generated from three-component Gaussian mixtures. The parameters are as follows: (π 1 , π 2 , π 3 ) = (0.3, 0.5, 0.2); µ 1 = (3 3 3 3 3 0 0 0 0 0), µ 2 = (0 0 0 0 0 0 0 0 0 0), (1 1 1 1 1 0.1 0.1 0.1 0.1 0.1), We adopt several models to represent the distributions of these observations, and the task is to estimate parameters in the models. Here, we compare the performance of four schemes. In the first scheme, the standard EM algorithm for the MFA (S-MFA) is implemented in a centralized unit using all observations from 100 nodes. In the second scheme, the D-MFA algorithm proposed in Section 3.2 performs simultaneously in all nodes. In the third scheme, the EM algorithm for the MFA runs in each node using only local observations of that node. In other words, there is no information exchange among nodes. We abbreviate it as non-cooperation MFA (NC-MFA) for description convenience. In the last scheme, the distributed EM algorithm for the GMM (D-GMM) is implemented. In the D-GMM, the objective function is similar to that of the proposed D-MFA, except that MFA is replaced by GMM. It should be emphasized that the centralized unit is assumed to be always reliable in the S-MFA under ideal conditions. However, this condition is not always fulfilled when the centralized unit fails. Therefore, the S-MFA is seldom adopted in sensor networks. The aim is to test whether the estimation performance of the D-MFA can approach that of the S-MFA.
In the initialization of these MFA schemes, the dimension of factors are set to five. (π 0 1 , π 0 2 , π 0 3 ) = (1/3, 1/3, 1/3), {µ 0 1 , µ 0 2 , µ 0 3 } are set as randomly-selected observations in the those nodes. The initial elements in {D 0 1 , D 0 2 , D 0 3 } and {A 0 1 , A 0 2 , A 0 3 } are generated by standardized normal distributions. In order to make the estimation results visible, the principal component analysis is performed for observations, obtaining the two largest eigenvalues and the associated eigenvectors. Then, the observations, the estimated means µ i (i = 1, 2, 3) and the covariances Σ i (Σ i = A i A T i + D i ) after the termination of the algorithms can be projected into 2D principal subspace [34]. Figure 3 illustrates the results of the estimated parameters at the 2D principal subspace in these four schemes. In this figure, the estimated mean µ i of each component is denoted by "+", and the estimated covariance Σ i is represented by shaded ellipse. Concretely, in Figure 3a, parameters can be correctly estimated by the S-MFA, as the centralized unit can use all of the observations directly. In Figure 3b-d, the results of a randomly-selected node are given. For the NC-MFA, the appropriate parameters are incorrectly estimated, as it can only use its own observations, which also happens in other nodes. For the D-GMM, as it is based on GMM, it cannot describe and process high-dimensional observations well. Finally, in the D-MFA, each node can receive the calculated LSS from nodes in its neighborhood set and combine them for parameter estimation. Compared to GMM, MFA can reflect the properties of these high-dimensional observations more accurately. Therefore, the estimated means and covariances are correct in the D-MFA, as shown in Figure 3b. The other nodes have the same results as this selected node, which are not given here due to space limitation. Moreover, as the same observations and models are used in three MFA schemes, the changes of the average log-likelihood over all nodes in the S-MFA, log-likelihood of the D-MFA and the NC-MFA are shown in different lines in Figure 4. We can see that as the iteration increases, the D-MFA is convergent. Its convergence performance approaches that of the S-MFA.
In order to further show the estimation accuracy of the D-MFA at all of the nodes in a sensor network, we select two kinds of parameters (π 1 , π 2 , π 3 ) and µ 1 , giving the estimation results of these parameters. In Figure 5, the estimated (π 1 , π 2 , π 3 ) in the D-MFA and the NC-MFA at all 100 nodes are provided. In the NC-MFA, each node cannot correctly estimate parameters due to limited observations and no information exchange with other nodes, shown by dashed lines. On the contrary, after the D-MFA converges, the estimated values (π 1 , π 2 , π 3 ) in all 100 nodes approach their true values (0.3, 0.5, 0.2). In Figure 6, we compare all of the vector components in µ 1 estimated by these three MFA schemes. It is noted that for the D-MFA and the NC-MFA, we give the mean and standard deviation of each vector component over 100 nodes to reflect the whole performance of network. We can clearly see that the S-MFA can correctly estimate µ 1 , as it can use all of the observations. For the D-MFA, the mean of estimated µ 1 in each vector component approaches the corresponding true value (3 3 3 3 3 0 0 0 0 0), while the mean of µ 1 obtained by the NC-MFA is not consistent with true value. Moreover, the standard deviation of D-MFA is smaller than that of the NC-MFA. Since the other parameters lead to similar results, we omit them here.

Real Data
In several countries, there are monitoring sites located in different regions, whose tasks are to detect nutritional ingredients in the wine samples. These sites form a sensor network, in which each site only communicates with its neighbors and can implement local computations. The wine samples sent to these monitoring sites may belong to different cultivars. Therefore, in each monitoring site, these samples need to be classified, which is good for analyzing in-depth the relationship of nutritional ingredients in the wine and their cultivars. It is certain that the more the references are available for each site, the better the results will be. Therefore, the network and cooperation between sites are required.
In this subsection, we consider the wine cultivar clustering problem as a simulation of the above scenario. The database of this problem is the wine dataset, which is one of the most popular datasets in the UCImachine learning repository [35]. In this wine dataset, 178 samples are collected from a chemical analysis of wines grown in three different cultivars in Italy (the No. 1∼No. 59 samples belong to the first class; the No. 60∼No.130 samples belong to the second class; the No. 131∼No. 178 samples belong to the third class). Each sample has 13 attributes, so the dimension of observations is 13. The sensor network is composed of eight nodes, represented as eight monitoring sites. The average number of nodes in the neighborhood set is two, and the graph is guaranteed to be connected. The average number of samples in each site is 22. Clustering belongs to the unsupervised learning paradigm in machine learning. When the D-MFA (or the D-MtFA) is adopted for clustering, initial values of parameters in the D-MFA are set, which are the same as those in Section 4.1. Then, the corresponding algorithm derived in Section 3.2 (or Section 3.3) is performed. After algorithm converges, an additional computation step based on the estimated parameters Θ at node m is carried out to obtain z m,n,i by Equation (6). Finally, the cluster decision for each observation y m,n is: The clustering results of the D-MFA at Node 1∼Node 8 are shown in Figure 8. In this figure, blue "•" represents correctly-clustered observations, while red "×" denotes wrongly-clustered ones. From these figures, we can see that the correct ratio in eight nodes are 100%, 100%, 95.2%, 95.5%, 100%, 95.5%, 100%, 92.9%. There are five wrongly-clustered observations in all. The correct ratio in the entire network is 97.2%. In order to compare the performance of the D-MFA with that of the S-MFA, we perform the D-MFA and the S-MFA algorithms 20 times. The average correct ratio of these 20 runs by the D-MFA is 96.9%, approaching that by the S-MFA, which is 98.2%. The reason for the small performance gap between the S-MFA and the D-MFA may be that the number of observations for each node is small and the dimension is relatively high. The accuracy of the calculated LSS in Step 2 of the D-MFA are a little worse than those global sufficient statistics obtained by all of the observations in the E-step of the S-MFA. For the NC-MFA, as the number of observations in this example at each node is small, the clustering cannot be implemented. For the D-GMM, as the dimension of observations is high, it also cannot finish the task of this example effectively. Moreover, as there are no outliers in this dataset, the clustering result of the D-MtFA is the same as that of the D-MFA, which is not shown here. In summary, we can use the proposed schemes to realize distributed clustering.

Conclusions
In this paper, we propose a distributed density estimation method base on a mixture of factor analyzers in sensor networks. First, a linear combination of local log-likelihoods associated with nodes in its neighborhood (including itself) is defined as the objective function. In this objective function, the combination coefficients are determined by the number of observations in corresponding nodes. Then, the D-MFA and the D-MtFA algorithms are derived. In these algorithms, the combined sufficient statistics of each node are used to estimate parameters, which can be obtained by performing a linear combination of local sufficient statistics from nodes in its neighborhood. Finally, we evaluate the performances of the proposed algorithms and apply them to the tasks of clustering and classification. Experimental results show that they are promising and effective statistical tools for processing the high-dimensional datasets in a distributed way in sensor networks.
In our future work, we will investigate distributed algorithms that can automatically determine the structure of MFA, e.g., the number of components. We also intend to design adaptive strategies to adjust combination coefficients more flexibly. Moreover, the coverage problem [36] in a sensor network is important when implementing distributed algorithms. We will consider this issue in the future.
In the first stage, three conditional distributions are needed to compute. Concretely, p(z l,n,i |y l,n ) at node l is: p(z l,n,i |y l,n ) = π old i t(y l,n |µ old The conditional distribution of w l,n,i given z l,n,i and y l,n is obtained: p(w l,n,i |z l,n,i , y l,n ) = G w l,n,i ν old i + p 2 , ν old i + W l,n,i 2 (A2) where: W l,n,i = (y l,n − µ old i ) T A old i (A old i ) T + D old i (y l,n − µ old i ) The expectations z l,n,i are given by Equation (A1), and w l,n,i can be obtained from Equation (A2), which is: w l,n,i = ν old i + p ν old i + W l,n,i The conditional distribution of u l,n given w l,n,i , z l,n,i and y l,n can be computed: p(u l,n |w l,n,i , z l,n,i , y l,n ) = N (u l,n |u l,n,i , Ω i w l,n,i ) The mean u l,n,i and Ω i are: u l,n,i = g T i (y l,n − µ old i ) is an intermediate variable introduced to simplify expressions in the following stage. Moreover, the expectations u l,n and u l,n u T l,n are: u l,n = u l,n,i and u l,n u T l,n = Ω i w l,n,i + u l,n,i u T l,n,i Similar to the D-MFA, when the first stage finishes, the current lower bound is: p(z l,n,i |y l,n )p(w l,n,i |y l,n , z l,n,i )p(u l,n |y l,n , z l,n,i , w l,n,i ) × z l,n,i log π i N (u l,n |0, I q )G(w l,n,i |ν i /2, ν i /2)N (y l,n |µ i + A i u l,n , D i ) Now, at node m, parameters can be obtained by taking derivation of Q m (Θ) with respect to Θ. First, we obtain: respectively. By substituting Equations (22) and (23) into Equations (A3) and (A4), we can obtain Equations (24) and (25). Subsequently, by respectively performing derivation of Q m (Θ) with respect to A i and D i , we have: By substituting z l,n,i , w l,n,i , u l,n , u l,n u T l,n into Equations (A5) and (A6) and by considering the simplified expressions in Equations (22), (23) and (28), we can obtain Equations (26) and (27).
Finally, by performing derivation of Q m (Θ) with respect to ν i , the update equation for ν i is obtained, shown by Equation (29).