A Distributed Learning Method for ℓ1-Regularized Kernel Machine over Wireless Sensor Networks

In wireless sensor networks, centralized learning methods have very high communication costs and energy consumption. These are caused by the need to transmit scattered training examples from various sensor nodes to the central fusion center where a classifier or a regression machine is trained. To reduce the communication cost, a distributed learning method for a kernel machine that incorporates ℓ1 norm regularization (ℓ1-regularized) is investigated, and a novel distributed learning algorithm for the ℓ1-regularized kernel minimum mean squared error (KMSE) machine is proposed. The proposed algorithm relies on in-network processing and a collaboration that transmits the sparse model only between single-hop neighboring nodes. This paper evaluates the proposed algorithm with respect to the prediction accuracy, the sparse rate of model, the communication cost and the number of iterations on synthetic and real datasets. The simulation results show that the proposed algorithm can obtain approximately the same prediction accuracy as that obtained by the batch learning method. Moreover, it is significantly superior in terms of the sparse rate of model and communication cost, and it can converge with fewer iterations. Finally, an experiment conducted on a wireless sensor network (WSN) test platform further shows the advantages of the proposed algorithm with respect to communication cost.


Introduction
A wireless sensor network (WSN) consists of a large number of small battery-powered devices that can sense, process and communicate data. WSNs are used on a wide scale to monitor the environment, track objects, and so on [1,2]. Classification and regression of monitoring information are the most fundamental and important tasks in a variety of WSN applications such as vehicle classification, fault detection and intrusion detection [3][4][5]. Therefore, many machine learning methods developed for classification or regression problems are increasingly widely used in WSNs [6,7]. However, WSNs are frequently characterized as networks with a central node that runs main operations such as network synchronization, data processing and storage, while the remaining nodes only obtain and transmit information to the central node. For machine learning problems in WSNs, the scattered training examples must be transmitted from different sensor nodes to the central fusion center by multi-hop routing. Then, all the training examples are used at the central fusion center to train a classifier or a regression machine using the batch learning method. In this paper, this learning method is referred to as the "centralized learning method". Therefore, centralized learning methods have very high communication costs and energy consumption and are liable to cause congestion and failure on nodes near the central fusion center. This will lead to an energy imbalance among the nodes and greatly reduce the lifetime of the WSN [8]. To avoid and solve these problems, distributed learning methods for classifiers or regression machines, which depend on in-network processing through collaboration between single-hop neighboring nodes, have attracted more and more interest from researchers [9][10][11][12][13][14][15][16][17].
The kernel method or kernel machine, which is shorthand for the machine learning method based on the kernel function, has attracted broad attention because of the successful application of support vector machines (SVMs) and statistical learning theory. Because of the incomparable advantages in solving nonlinear problems, the kernel method has been successfully applied to many fields and has become a mainstream method of machine learning [18][19][20][21][22][23]. As WSN applications become more widespread, research on distributed learning methods for the kernel machine have attracted increasing attention in recent years [24][25][26][27]. Guestrin et al., in [10], presented a general framework for a distributed linear regression and proposed a distributed learning method that depends on the fixed network structure. Unfortunately, this method requires a very high computational overhead to maintain the network structure, and it does not apply to nonlinear kernel machines. Predd et al., in [11,12], provided a distributed collaborative learning algorithm for a linear kernel regression machine that depends on a collaboration between single-hop neighboring nodes and a consistency constraint on the prediction value of each shared training example. Because of the dependence on shared training examples, the convergence, convergence rate and communication cost of this algorithm are greatly affected by the number and distribution of shared training examples. Forero et al., in [13], proposed a distributed learning method for linear SVMs based on a consensus of weights and biases between single-hop neighboring nodes. Then, Forero et al., in [14], presented a distributed learning method for nonlinear SVMs that depends on sharing the training examples among all nodes and constrains the consensus on the prediction of shared training examples on all nodes. However, the construct of shared training examples is very cumbersome in this method. Flouri et al., in [15,16], and Yumao et al., in [17], proposed distributed learning methods for SVM based on the sparse characteristic of SVM. Because the sparse characteristic of SVM is determined by the hinge loss function, the available distributed learning algorithms for SVMs in WSNs still have a very high communication cost.
As an extension to the minimum mean squared error, the kernel minimum mean squared error (KMSE) was developed for solving nonlinear classification or regression problems. Its excellent performance and general applicability are well known and have been demonstrated [28]. Moreover, 1 norm regularization ( 1 -regularized) is widely used to solve optimization problems by incorporating an 1 norm penalty, and it can identify parsimonious models for training examples such as Lasso and Compressive Sensing [29]. To solve the problems of dependence on a fixed network structure and shared training examples and the high communication cost in existing distributed learning methods for kernel machines, this paper introduces the 1 -regularized term instead of the 2 -regularized term to construct the optimization problem of KMSE and obtain a sparse model that can reduce the amount of data transmitted between neighboring nodes. Therefore, this paper proposes a distributed learning algorithm for the 1 -regularized KMSE estimation problem that depends on in-network processing and collaboration by transmitting the sparse model only between single-hop neighboring nodes and is independent of the shared training examples and the fixed network structure. Simulation results demonstrate that the proposed algorithm can obtain an almost identical prediction accuracy as that obtained by the batch learning method, has significant advantages in terms of the sparse rate of model and communication cost compared with the existing algorithms and can converge with relatively few iterations.
The remainder of this paper is organized as follows. In Section 2, we briefly review the supervised learning model of the KMSE estimation problem, discuss the alternating direction method of multipliers, and describe the problem to be solved in this paper. In Section 3, we present a detailed derivation and solution for the 1 -regularized KMSE estimation problem and the collaborative approach between neighboring nodes. Then, we describe the proposed distributed learning algorithm for the 1 -regularized KMSE estimation problem. In Section 4, we evaluate the performance of the proposed algorithm by extensive simulation experiments with both synthetic dataset and datasets from UC Irvine Machine Learning Repository (UCI). In Section 5, we conduct an experiment on a WSN test platform to further validate the performance of the proposed algorithm with regard to communication cost. Finally, we conclude the paper in Section 6.

Preliminaries and Problem Statement
In this section, we briefly review the supervised learning model of KMSE estimation problems, discuss the alternating direction method of multipliers, and then describe the problem to be addressed in this paper.

Kernel Minimum Mean Squared Error Estimation
Given a training example set S " tpx i , y i qu , x i P d , i " 1, ..., m, where y i P t1,´1u or y i P are drawn independently and identically distributed from an unknown distribution, a functional relationship between the inputs x and the outputs y can be inferred that minimizes the mean squared error between the outputs y and the predictions f pxq. According to the empirical risk minimization principle that can directly estimate the function f p¨q from the training examples, a function f p¨q is selected from a class of functions that minimizes the empirical risk as shown in Equation (1): Kernel methods are used to minimize Equation (1) by a generalized linear model [18]. Kernel methods replace f pxq by w T φpxq, where φpxq is a nonlinear mapping to a high-dimensional feature in Hilbert space H K . For kernel minimum mean squared error estimation problems, the optimization problem involves solving for w as in Equation (2): where a regularized term is added that penalizes large values of w and prevents the solution from overfitting the training examples. The value of λ makes a tradeoff between the minimization of the empirical risk and the smoothness of the obtained solution. Because the loss function in Equation (2) is convex, the Representer theorem [18] states that the optimal solution can be expressed as a linear combination of the training examples w˚" m ř i"1 α i φpx i q, and it can also be written as f˚pxq " When φ T px i q¨φpxq is replaced by kernel kpx i , xq, the reformulation of the optimal solution that is the most common form is shown in Equation (3): where kp¨,¨q is a kernel function selected as a similarity measure parameter that greatly simplifies the calculation of obtaining a nonlinear model with no need for the explicitly nonlinear mapping.
Here, kpx i , xq denotes the similarity measure between training example x i and the new example x, and α i P , @i P t1, ..., mu is the coefficient of kpx i , xq. In Equation (3), the prediction of the new example x depends on all the training examples. The regularization term of Equation (2) can be written as a function of the coefficients α i in Equation (3). A popular choice of regularization term is ridge regression, i.e., the 2 norm that is widely used in many optimization problems; however, it cannot obtain a sparse solution. Another choice is the 1 norm that has been widely studied and applied owing to its sparse characteristics.

Alternating Direction Method of Multipliers
Consider the separable problem in Equation (4) [29,30]: where F i : m Þ Ñ , i " 1, ..., N is a convex function, P i , i " 1, ..., N represents the bounded polyhedral subsets of m , x i P m , i " 1, ..., N is a local variable, and z P m is a global variable. Because the constraint x i " z, i " 1, ..., N is that all the local variables should be consistent or equal, this problem is called the global consensus problem. The alternating direction method of multipliers (ADMM) for Problem (4) can be derived from the augmented Lagrange method shown as Equation (5): where y is the vector of dual variable or Lagrange multiplier, and y i is an element of y that is the dual variable on the constraint x i " z. The ρ ą 0 is called the penalty parameter, and the quadratic term is included to overcome the lack of strict convexity of the primal function in (4). The resulting ADMM algorithm is as follows: where Equations (6) and (8) are carried out independently for each i " 1, ..., N, and the update of global variable z in Equation (7) is handled at the central fusion center. Therefore, the ADMM algorithm operates in cycles and is considered a highly parallelizable method which applies to convex separable problems that are not necessarily strictly convex.

Problem Statement
Consider a WSN with J sensor nodes. Assume that the WSN is connected and that any node j P J only communicates with its one-hop neighboring nodes. The set of all the one-hop neighboring nodes of node j (and j itself) is denoted as B j Ď J. The training example set on node j P J is a subset of the entire training example set, which is denoted by S j :" px jn , y jn q, n " 1, 2,¨¨¨, N j ( , where N j is the number of training examples of S j . We assume that no training examples have to be shared between neighboring nodes. To decrease the communication cost of the centralized learning method for kernel machines in WSNs, this paper studies the distributed learning method for the KMSE machine. Our research idea is to obtain a sparse model on each node by involving the 1 norm regularization to reduce the amount of data transmitted between neighboring nodes. Moreover, each node communicates only with its one-hop neighboring nodes to save energy; thus, we only consider the in-network processing and the collaboration between one-hop neighboring nodes.
For ease of description, two definitions are provided.

Distributed 1 -Regularized KMSE Machine
In this section, we first detail the derivation and solution of the optimization problem for the 1 -regularized KMSE machine. Then, we introduce the collaboration method between neighboring nodes, and, finally, propose and describe the distributed learning algorithm for the 1 -regularized KMSE machine.

Derivation and Solution of the 1 -Regularized KMSE Machine
The KMSE estimation problem involving the 1 -regularized term can be written as shown in (9): where λ|| f || 1 is the 1 -regularized term, and λ ą 0 is a scalar regularized parameter usually chosen through cross-validation. If S j ( J j"1 is centrally available at the fusion center of the WSN, the global optimal model f˚pxq can be obtained by solving the optimization problem (10): where the optimization problem (10) is the equivalent reformulation of (9); therefore, the optimization problem in (9) and (10) can obtain the same global optimal model on the same training set. Based on the form of (10), the equivalent reformulation of (10), which is easy to decompose, is constructed and written as (11): where J{ J ř j"1 N j is a constant and can be simplified as 1{N j when each node has the same number of training examples. The number of training examples, unless otherwise noted, is the same for each node.
In (11) py jn´f px jn qq 2`λ || f || 1 is referred to as the j-th term of the objective that relies only on the local training examples of node j; thus, it can be split into J subproblems that are constructed on each node with its local training examples. Our goal is to solve the problem (11) in such a way that each term can be handled by its own processor on each node. To this end, the optimization problem (11) can be rewritten with the local model f j pxq and a common global model f pxq, shown as (12): where the equality constraint indicates that all the local models should result in the same prediction on the same training example. This approach ensures the consistency of the local model f j pxq and the global model f pxq on the same training set. However, λ|| f j || 1 in (12) and λ|| f || 1 in (11) may not have the same value because different training examples are used. Therefore, the optimization problem in (12) is an approximation of the optimization problem in (11). The equality constrained convex optimization problem (12) is typically tackled by solving its dual variable. To solve this optimization problem, the ADMM algorithm is used. The augmented Lagrangian function for problem (12) is given by Equation (13) [30]: where p j is the dual variable or Lagrange multiplier corresponding to the equality constraint f j px jn q " f px jn q, @j P J, n " 1, ..., N j , and c ą 0 is called the penalty parameter. To obtain the minimum value of Lp f j , f , p j q, we take the partial derivatives of Lp f j , f , p j q with respect to f j , f , and p j , respectively, and set them to zero. This resulting ADMM algorithm is as follows Equations (14)- (16): where Equations (14) and (16) are carried out independently on each node j P 1, ..., J, and the global optimal model f pxq is usually handled in the central fusion center. This algorithm can be simplified further. With the average (j P 1, ..., J) of a prediction denoted with an overline, the update of f k`1 px jn q can be written as Equation (17). Similarly, averaging the update of p k j yields Equation (18): Substituting Equation (17) into Equation (18) shows that p k`1 " 0, i.e., the dual variables have an average value of zero after the first iteration. Using f k`1 px jn q " f k`1 px jn q, the ADMM algorithm can be written as Equations (19)-(21): Note that the average predictions f k`1 px j q of the local training examples on node j are obtained by all models on each node within the WSN; consequently, all models are required to be on each node. However, this will significantly increase the communication cost and energy consumption of each node without relying on the fusion center or on special WSN nodes. To reduce the communication overhead, the local average predictions f B j k`1 px j q obtained by models on single-hop neighboring nodes are used as the approximation to the global average predictions f k`1 px j q that would be obtained by all models on each node. As a consequence, the iterative Formulas (19)-(21) can be rewritten as Equations (22)-(24): where B j Ď J is the set of neighboring nodes of node j P J, including itself;ˇˇB jˇd enotes the number of nodes in B j ; and f B j k px j q denotes the average predictions of local training examples on node j, which are obtained by models on nodes in B j at the k-th iteration. In this paper, the distributed learning algorithm for the 1 -regularized KMSE machine is executed through iterations over Equations (22)-(24).
The optimization problem (22) is still a non-constrained convex optimization problem that only relies on its local training examples on each node. To improve the convenience of solving this optimization problem, it is rewritten in matrix form in (25): where Y j P R N j is the output vector of training examples on node j P J, K j P R N jˆp N j`1 q is the augmented matrix obtained by kpx 1¨x2 q, @x 1 , x 2 P S j , and α j P R N j`1 is the coefficient vector. To solve the non-constrained optimization problem in (25), an equivalent reformulation of (25) with an equality constraint is constructed as shown in (26). Then, the ADMM is used to solve the optimization problem in (26), and the resulting iterations are shown in Equations (27)- (29): In Equation (27), I is the identity matrix, and p 2 N j`c qK T j K j`ρ I is always invertible because ρ ą 0. In Equation (28), S is the soft thresholding operator, which is defined as shown in Equation (30): Now, each node j P J can obtain the sparse coefficient vector a j through the iterations from Equations (27) to (29); that is, there are relatively few key examples in the model. The model obtained for each node can be expressed as in Equation (31): where the form of Equation (31) is the same as that of Equation (3), l is the number of key examples, and a ji is a nonzero value. Consequently, the distributed learning algorithm for the 1 -regularized KMSE machine can be executed in the sequence Equations (27)-(29), (23), (24) for each iteration until each node obtains a stable model.

Collaboration Method between Neighboring Nodes
In Equation (23) px j q, each node requires all the models on its single-hop neighboring nodes. Therefore, transferring the sparse models between single-hop neighboring nodes is required. After each node receives all the models from its neighboring nodes, all the key examples in these models are added to the local training example set, and, then, all the received models are used to predict the average predictions of all the local training examples. In short, the distributed learning method presented here depends on in-network processing, with sparse models transferred only between single-hop neighboring nodes.

Distributed Learning Algorithm for the 1 -Regularized KMSE Machine
Based on the derivation and solution of the 1 -regularized KMSE estimation problem and the collaboration between neighboring nodes, we propose a distributed learning algorithm for the 1 -regularized KMSE machine (L1-DKMSE). The detailed steps of L1-DKMSE are illustrated as follows:

Algorithm 1: L1-DKMSE
Input: Initialize the training set S j :" px j , y j q for node j P J, iterations k " 0, f B j k px j q " y j , and p k j px j q " 0, choose the Radial Basis Function (RBF) as the kernel function, and initialize the kernel parameter σ and the regularization parameter λ.
Output: the sparse model fj p¨q, @j P J.

Repeat:
Step 1: each node j P J obtains its sparse model f k j p¨q by iterations of Equations (27)-(29) using its training examples. Then, it broadcasts its sparse model to its one-hop neighboring nodes in B j .
Step 2: each node j P J receives f k i p¨q, i P B j and adds the key examples in f k i p¨q, i P B j to its local training set.
Step 3: each node j P J predicts its local training examples by using f k i p¨q, i P B j and then computes f B j k px j q and p k j px j q using Equations (23) and (24), respectively.

Step 4:
If the models f k j p¨q on each node are all stable, stop; otherwise, increment k (k " k`1) and return to Step 1.

Numerical Simulations
To analyze the performance of Algorithm 1, extensive simulation experiments have been conducted using a synthetic dataset and UCI datasets. Consider a randomly generated network with J " 30 nodes connected with a minimum degree per node of two and a maximum degree per node of five. The Centralized L1-regularized KMSE learning algorithm (L1-CKMSE), the Centralized SVM learning algorithm (CSVM), the Distributed Kernel Least Squares learning algorithm (DKLS) in [10], the Distributed Parallel SVM learning algorithm (DPSVM) in [17], and the Distributed SVM learning algorithm based on a nonlinear kernel (NDSVM) in [14] are compared with Algorithm 1 with regard to prediction accuracy, sparse rate of model, communication cost, and iterations.

Synthetic Dataset
The synthetic dataset is composed of the labeled training examples from two different equiprobable, nonlinear, separable classes C 1 and C 2 .
Class C 1 contains examples from a two-dimensional Gaussian distribution with a covariance matrix of Σ " r0.6, 0; 0, 0.4s and a mean vector mu 1 " r0, 0s T . Class C 2 is a mixture of Gaussian distributions with the mixing parameters π 21 " 0.3 and π 22 " 0.7, the mean vectors mu 2 " r´2,´2s T and mu 3 " r2, 2s T , and the equal covariance matrix Σ. Each local training set S j consists of

Numerical Simulations
To analyze the performance of Algorithm 1, extensive simulation experiments have been conducted using a synthetic dataset and UCI datasets. Consider a randomly generated network with 30 J  nodes connected with a minimum degree per node of two and a maximum degree per node of five. The Centralized L1-regularized KMSE learning algorithm (L1-CKMSE), the Centralized SVM learning algorithm (CSVM), the Distributed Kernel Least Squares learning algorithm (DKLS) in [10], the Distributed Parallel SVM learning algorithm (DPSVM) in [17], and the Distributed SVM learning algorithm based on a nonlinear kernel (NDSVM) in [14] are compared with Algorithm 1 with regard to prediction accuracy, sparse rate of model, communication cost, and iterations.

Synthetic Dataset
The synthetic dataset is composed of the labeled training examples from two different equiprobable, nonlinear, separable classes 1  In general, the prediction accuracy obtained by the centralized learning algorithm is used as the benchmark. Thus, the prediction accuracy of the L1-CKMSE algorithm is used as the benchmark. As shown in Figure 1a, the prediction accuracy of CSVM, L1-CKMSE, DKLS, DPSVM, and L1-DKMSE algorithms on the synthetic dataset is nearly equivalent; however, NDSVM-1 and NDSVM-2 obtain a relatively low prediction accuracy. Therefore, the prediction accuracy of L1-DKMSE algorithm is nearly equivalent to the benchmark on the synthetic dataset and is much better than that of NDSVM.
The sparse rate of model is the ratio of the number of key examples to that of all training examples (see Definitions 1 and 2); thus, a lower sparse rate of model or fewer key examples is better. As shown in Figure 1b, the sparse rate of model obtained by CSVM, DKLS, DPSVM and NDSVM on the synthetic dataset is significantly higher than that obtained by L1-CKMSE and L1-DKMSE, whereas the sparse rate of model obtained by L1-CKMSE is slightly higher than that obtained by L1-DKMSE. Specifically, the sparse rate of model obtained by L1-CKMSE is 19.75% higher than that obtained by L1-DKMSE on the synthetic dataset. A comparison of the sparse rate of model obtained by all the compared algorithms shows that our proposed L1-DKMSE algorithm In general, the prediction accuracy obtained by the centralized learning algorithm is used as the benchmark. Thus, the prediction accuracy of the L1-CKMSE algorithm is used as the benchmark. As shown in Figure 1a, the prediction accuracy of CSVM, L1-CKMSE, DKLS, DPSVM, and L1-DKMSE algorithms on the synthetic dataset is nearly equivalent; however, NDSVM-1 and NDSVM-2 obtain a relatively low prediction accuracy. Therefore, the prediction accuracy of L1-DKMSE algorithm is nearly equivalent to the benchmark on the synthetic dataset and is much better than that of NDSVM.
The sparse rate of model is the ratio of the number of key examples to that of all training examples (see Definitions 1 and 2); thus, a lower sparse rate of model or fewer key examples is better. As shown in Figure 1b, the sparse rate of model obtained by CSVM, DKLS, DPSVM and NDSVM on the synthetic dataset is significantly higher than that obtained by L1-CKMSE and L1-DKMSE, whereas the sparse rate of model obtained by L1-CKMSE is slightly higher than that obtained by L1-DKMSE. Specifically, the sparse rate of model obtained by L1-CKMSE is 19.75% higher than that obtained by L1-DKMSE on the synthetic dataset. A comparison of the sparse rate of model obtained by all the compared algorithms shows that our proposed L1-DKMSE algorithm significantly outperforms the compared algorithms in this respect, indicating that L1-DKMSE can obtain the sparsest model among all the compared algorithms. Therefore, our algorithm can significantly reduce the computing costs of performing predictions.
The communication costs are measured in terms of the number of scalars transmitted on all nodes. As shown in Figure 1c, the communication cost for L1-DKMSE is significantly less than that for CSVM, L1-CKMSE, DKLS, and NDSVM, and it is close to that for DPSVM on the synthetic dataset. Specifically, the communication costs for L1-DKMSE are 85.53%, 85.46%, 77.56%, 87.06%, and 91.24% less than for CSVM, L1-CKMSE, DKLS, NDSVM-1 and NDSVM-2, respectively, and 27.85% less than for DPSVM. Consequently, L1-DKMSE has been shown to significantly outperform CSVM, L1-CKMSE, DKLS, DPSVM and NDSVM in terms of communication cost on the synthetic dataset.
As shown in Figure 1d, the iterations required by L1-DKMSE are slightly higher than those required by DKLS and DPSVM but significantly less than those required by CSVM, L1-CKMSE, NDSVM-1 and NDSVM-2. These results show that our proposed L1-DKMSE requires relatively few iterations on the synthetic dataset.

UCI Datasets
To further verify the applicability and effectiveness of the L1-DKMSE algorithm, three datasets from the UCI repository are used to conduct experiments. Brief descriptions of these three datasets are listed in Table 1 Table 2 shows the optimal values of the parameters used in the different algorithms for each dataset. The optimal values of the parameters used in CSVM, DPSVM and NDSVM are chosen by cross-validations, and those in L1-CKMSE, L1-DKMSE and DKLS are selected by the grid search method. Moreover, the number of shared examples in the NDSVM algorithm is represented as L.   In Figure 2a, the prediction accuracy achieved of CSVM, L1-CKMSE, and DPSVM is almost the same for each dataset. The prediction accuracy of L1-DKMSE is slightly below that obtained by the centralized learning algorithms (L1-CKMSE and CSVM) on the same dataset, but slightly higher than that obtained by NDSVM on the same dataset. Specifically, the prediction accuracy obtained by L1-DKMSE is 1.61%, 2.12%, and 2.64% below the prediction accuracy obtained by L1-CKMSE on the magic, default of credit card client and spambase datasets, respectively. The comparison shows that no distinct differences in terms of prediction accuracy occurred between L1-DKMSE and L1-CKMSE, indicating that the prediction accuracies of L1-DKMSE and L1-CKMSE are nearly equivalent for each dataset.
As shown in Figure 2b, the sparse rate of model obtained by CSVM, DKLS, DPSVM and NDSVM on each dataset is significantly higher than that obtained by L1-CKMSE and L1-DKMSE for the same dataset, whereas the sparse rate of model obtained by L1-CKMSE is much higher than that obtained by L1-DKMSE. Specifically, the sparse rate of model results obtained by L1-DKMSE are 45.04%, 16.81%, and 22.96% lower than those obtained by L1-CKMSE on the magic, default of credit card client and spambase datasets, respectively. A comparison of the sparse rate of model obtained by these algorithms shows that the L1-DKMSE algorithm significantly outperforms the compared algorithms, indicating that L1-DKMSE can obtain the sparsest model among all the algorithms tested in this simulation and can significantly reduce the computing costs for performing predictions.
As shown in Figure 2c, the communication costs for L1-DKMSE are significantly lower than those for CSVM, L1-CKMSE, DKLS, DPSVM and NDSVM on each dataset. Among the five comparison algorithms, the communication costs for DKLS are the closest to those for the L1-DKMSE algorithm; however, the communication costs for L1-DKMSE are 57.02%, 41.69% and 19.71% less than those for DKLS on the magic, default of credit card client and spambase datasets, respectively. The comparison results show that L1-DKMSE significantly outperforms the compared algorithms with respect to communication costs on each dataset listed in Table 1.
As shown in Figure 2d, the iterations of L1-DKMSE are slightly higher than those of DKLS and DPSVM on the magic and default of credit card client datasets and slightly higher than those of DPSVM on the spambase dataset but significantly less than those of CSVM, L1-CKMSE and NDSVM. The simulation results of these three datasets of UCI show that our proposed L1-DKMSE algorithm requires relatively fewer iterations.

Experiment on Test Platform
To further compare the communication costs of the different algorithms, an experiment is conducted on a test platform developed by our team to validate the proposed distributed learning method for kernel methods. In this experiment, the experimental results for the communication costs on the synthetic dataset and only the communication costs of sending data are considered. Thus, the average amount of data at every turn on each node is easily calculated. The average amount of data sent at every turn on each node and the iterations of the different algorithms are shown in Table 3. Two 18650-type Li-ion batteries are used to power the sensor node. The direct load method is used to calculate the battery capacity from the battery voltage. The correspondence of voltage and battery capacity is shown in Table 4. For each of the different algorithms, each node broadcasts a certain amount of data at every turn and repeats the iterations listed in Table 3. The energy consumption of each node for the different algorithms can be calculated by the relationship between voltage and battery capacity in Table 4. The experimental results of the energy consumption of each node for the different algorithms are illustrated in Figure 3. As Figure 3 shows, the energy consumption of each node for L1-DKMSE is significantly less than that for CSVM, L1-CKMSE, DKLS, DPSVM and NDSVM. Among the five comparison algorithms, the energy consumption of each node for the DPSVM algorithm is closest to that of our proposal; however, the energy consumption of each node for L1-DKMSE is still 77.16% less than that for DPSVM. The comparison results show that L1-DKMSE significantly outperforms the compared algorithms in terms of the energy consumption of each node. Two 18650-type Li-ion batteries are used to power the sensor node. The direct load method is used to calculate the battery capacity from the battery voltage. The correspondence of voltage and battery capacity is shown in Table 4. For each of the different algorithms, each node broadcasts a certain amount of data at every turn and repeats the iterations listed in Table 3. The energy consumption of each node for the different algorithms can be calculated by the relationship between voltage and battery capacity in Table 4. The experimental results of the energy consumption of each node for the different algorithms are illustrated in Figure 3. As Figure 3 shows, the energy consumption of each node for L1-DKMSE is significantly less than that for CSVM, L1-CKMSE, DKLS, DPSVM and NDSVM. Among the five comparison algorithms, the energy consumption of each node for the DPSVM algorithm is closest to that of our proposal; however, the energy consumption of each node for L1-DKMSE is still 77.16% less than that for DPSVM. The comparison results show that L1-DKMSE significantly outperforms the compared algorithms in terms of the energy consumption of each node.

Conclusions
In this paper, we presented a distributed learning algorithm, L1-DKMSE, for the 1  -regularized KMSE machine and demonstrated its effectiveness through simulation experiments as well as a test platform experiment. The experimental results indicated that L1-DKMSE can obtain almost the same prediction accuracy as that of the centralized learning method and learn a very sparse model. In particular, it can significantly decrease the communication costs during the model training process and can converge with fewer iterations. Because of its remarkable advantages in terms of the communication cost and the sparseness of the model, the L1-DKMSE algorithm is considered a feasible learning method for kernel machines in WSNs. In future work, we will explore the following topics: (1) how to transmit and share training examples between neighboring nodes under unreliable communication links; (2) how to select a neighboring node according to the residual energy of its neighbor nodes; and (3) how to apply an online learning method for a kernel machine to reduce the computational complexity and memory requirements.

Conclusions
In this paper, we presented a distributed learning algorithm, L1-DKMSE, for the 1 -regularized KMSE machine and demonstrated its effectiveness through simulation experiments as well as a test platform experiment. The experimental results indicated that L1-DKMSE can obtain almost the same prediction accuracy as that of the centralized learning method and learn a very sparse model. In particular, it can significantly decrease the communication costs during the model training process and can converge with fewer iterations. Because of its remarkable advantages in terms of the communication cost and the sparseness of the model, the L1-DKMSE algorithm is considered a feasible learning method for kernel machines in WSNs. In future work, we will explore the following topics: (1) how to transmit and share training examples between neighboring nodes under unreliable communication links; (2) how to select a neighboring node according to the residual energy of its neighbor nodes; and (3) how to apply an online learning method for a kernel machine to reduce the computational complexity and memory requirements.