Data Compression Based on Stacked RBM-AE Model for Wireless Sensor Networks

Data compression is very important in wireless sensor networks (WSNs) with the limited energy of sensor nodes. Data communication results in energy consumption most of the time; the lifetime of sensor nodes is usually prolonged by reducing data transmission and reception. In this paper, we propose a new Stacked RBM Auto-Encoder (Stacked RBM-AE) model to compress sensing data, which is composed of a encode layer and a decode layer. In the encode layer, the sensing data is compressed; and in the decode layer, the sensing data is reconstructed. The encode layer and the decode layer are composed of four standard Restricted Boltzmann Machines (RBMs). We also provide an energy optimization method that can further reduce the energy consumption of the model storage and calculation by pruning the parameters of the model. We test the performance of the model by using the environment data collected by Intel Lab. When the compression ratio of the model is 10, the average Percentage RMS Difference value is 10.04%, and the average temperature reconstruction error value is 0.2815 °C. The node communication energy consumption in WSNs can be reduced by 90%. Compared with the traditional method, the proposed model has better compression efficiency and reconstruction accuracy under the same compression ratio. Our experiment results show that the new neural network model can not only apply to data compression for WSNs, but also have high compression efficiency and good transfer learning ability.


Introduction
Research has found sensor networks suitable to collect environment information, such as temperature, light and humidity in harder-to-reach areas. In recent years, with the development of information technology, wireless sensor networks (WSNs) have played an important role in a wide range of disciplines [1]. Unmanned vehicles need GPS and accelerometer to locate themselves, and to get information regarding their environments using a camera and lidar [2]. By fusing these multi-modal sensor data, they can also predict the moving trajectory of nearby objects. Similarly, environmental monitoring applications need temperature, humidity, wind direction, etc. The increase in WSNs applications and sensor nodes have led to an explosion in the amount of sensing data. If the sensing data collected by WSNs are directly sent to a gateway, it will not only consume a lot of power but also increase the probability of transmission errors. Because a sensor node's storage and portable energy resources is limited, it is important to design an energy-saving and redundant data compression scheme for WSNs. These factors have motivated efficient technologies to reduce energy consumption and extend the lifetime of WSNs.
WSNs' data collection methods include direct transmission to the base station, multi-hop forwarding, data aggregation, and modeling with coding. Data aggregation is an important technology In Ref. [26], a deep convolution network for ECG signal compression is proposed, which needs a large amount of calculation. The Restricted Boltzmann Machine (RBM) is a probabilistic model for a density over observed variables that uses a set of hidden variables (representing presence of features). In the standard RBM, all observed variables are related to all hidden variables by different parameters. It was widely used in classification and generation. In Ref. [27], the Convolutional RBM (C-RBM) was developed to achieve object detection. The C-RBM is a variant of the RBM model in which weights are shared to respect the spatial structure of images. In Ref. [28], a pre-training algorithm for the Deep Boltzmann Machines (DBMs) was described. In Ref. [29], a CS reconstruction algorithm was proposed, which amounts to two nested inference problems, one on the CS observation-matching problem, and the other on the RBM model. Compared with variational autoencoders (VAE) and generative adversarial networks (GAN), RBM has the simplest network structure and the minimum number of parameters. Correspondingly, RBM has a small computational energy consumption and is more suitable for use in WSNs with limited energy. In our survey, there is currently no research on how to directly use RBM to compress the sensing data for WSNs. Currently, we mainly explore how to use RBM to compress the sensing data. In our next research, we will also explore how to use VAE, GAN and other deep learning models to compress sensing data.
In this work, we combined the RBM generation model with the nonlinear learning method of deep learning theory, and propose the Stacked RBM-AE model, which can compress and reconstruct the sensing data for WSNs. We use four standard RBMs of different sizes to form the encoder and decoder of the Stacked RBM-AE model. The Stacked RBM-AE model compresses the sensing data by using its mathematical characteristics. In our experiments, we explore the performance of the model and compare it with other compression methods, and we propose an energy optimization method of the model to reduce the energy consumption of calculation. The contributions of this paper are summarized as follows:

•
We developed a hybrid model named Stacked RBM-AE that combines four standard RBMs with the feature of auto-encoder to compress and reconstruct the sensing data.

•
We proposed a new method of data compression, which has better reconstruction accuracy than the traditional algorithm under the same compression ratio.

•
We proposed an energy optimization method by pruning the model parameters, and we analyzed the efficiency of pruning different proportion model parameters on the reconstruction accuracy of the Stacked RBM-AE model under the same compression ratio.
The remainder of this paper is organized as follows: Section 2 introduces the architecture of the Stacked RBM-AE, and the details of Stacked RBM-AE training. Section 3 discusses the results of the experiments. Specially, we conduct several tricks on energy optimization. In Section 4, we sum up our algorithm, and discus further work.

Architecture of the Stacked RBM-AE
The Stacked RBM-AE model contains two parts. The first part of the Stacked RBM-AE model we call encoder E, which includes four standard RBMs (in Figures 1 and 2).  Each standard RBM is an undirected graph model, which includes two layers. The vector ( 1 , 2 , … , ) represents the visual layer, which is the input vector with n scalar components from the training dataset; The vector (ℎ , ℎ , … , ℎ ) represents the hidden layer vector with m scalar components. The standard RBM takes the state space ( , ) ∈ {0,1} + . The goal of RBM is to find ( ) the unknown true high dimensional distribution of the visual layer variables [30]. To achieve this goal, the high-dimensional distribution of the training dataset is modeled to get ( | ), where the sample distribution depends on the model parameters [31]. For the second part of the Stacked RBM-AE model, firstly we use the characteristics of autoencoder to flip the encoder to get a symmetrical scale decoding output named decoder ; the structure is shown in Figure 3. The initial input of the encoder is used as the input of the multilayer auto-encoder. Our goal is to satisfy the top output approximately equal to the bottom input.  Each standard RBM is an undirected graph model, which includes two layers. The vector ( 1 , 2 , … , ) represents the visual layer, which is the input vector with n scalar components from the training dataset; The vector (ℎ , ℎ , … , ℎ ) represents the hidden layer vector with m scalar components. The standard RBM takes the state space ( , ) ∈ {0,1} + . The goal of RBM is to find ( ) the unknown true high dimensional distribution of the visual layer variables [30]. To achieve this goal, the high-dimensional distribution of the training dataset is modeled to get ( | ), where the sample distribution depends on the model parameters [31]. For the second part of the Stacked RBM-AE model, firstly we use the characteristics of autoencoder to flip the encoder to get a symmetrical scale decoding output named decoder ; the structure is shown in Figure 3. The initial input of the encoder is used as the input of the multilayer auto-encoder. Our goal is to satisfy the top output approximately equal to the bottom input. Each standard RBM is an undirected graph model, which includes two layers. The vector v (v 1 , v 2 , . . . , v n ) represents the visual layer, which is the input vector with n scalar components from the training dataset; The vector h (h 1 , h 2 , . . . , h m ) represents the hidden layer vector with m scalar components. The standard RBM takes the state space (v, h) ∈ {0, 1} m+n . The goal of RBM is to find P data (v) the unknown true high dimensional distribution of the visual layer variables [30]. To achieve this goal, the high-dimensional distribution of the training dataset is modeled to get P(v|θ ), where the sample distribution depends on the model parameters θ [31].
For the second part of the Stacked RBM-AE model, firstly we use the characteristics of auto-encoder to flip the encoder E to get a symmetrical scale decoding output named decoder D; the structure is shown in Figure 3. The initial input of the encoder E is used as the input of the multilayer auto-encoder. Our goal is to satisfy the top output approximately equal to the bottom input. is the real-valued matrix of × , which represents the weights of the units connected; is the bias of visual layer units in × 1 size; is the bias of the hidden layer units in × 1 size. We use to represent the weights between and ℎ . represents the bias of . represents the bias of ℎ .
In the pre-training phase, we use CD algorithm [32] to train four standard RBMs of different sizes in turn. Algorithm 1 shows the procedure of the pre-training of standard RBM. It first calculates the activation probability of all hidden layer units and the activation probability of all visual layer units, and then updates model parameters by minimizing the model loss.

Algorithm 1: Pre-training of standard RBM
Input: A mini-batch of training data set S, the number of train iteration iter, the learning rate α Initialization: , , = 0 While i < iter do The hidden layer units activation probability = ( + ) Get ( ) by one step Gibbs sampling for The visible layer units activation probability = ( + ( ) ) Update parameters , , according to (7)- (9) i + + end For the standard RBM, the activation state of each hidden layer unit is conditionally independent for a given state of the visible layer unit. So, we can get the activation probability of ℎ :

Pre-training of standard RBM
For the standard RBM, v denotes the activation probability of all visual units, and h denotes the activation probability of all hidden layer units. The model parameter θ contains three parameters: W, b and c. v and h are the binary column vectors of 1 × n and 1 × m, respectively; W is the real-valued matrix of n × m, which represents the weights of the units connected; b is the bias of visual layer units in n × 1 size; c is the bias of the hidden layer units in m × 1 size. We use W ij to represent the weights between v i and h j . b i represents the bias of v i . c j represents the bias of h j .
In the pre-training phase, we use CD algorithm [32] to train four standard RBMs of different sizes in turn. Algorithm 1 shows the procedure of the pre-training of standard RBM. It first calculates the activation probability of all hidden layer units and the activation probability of all visual layer units, and then updates model parameters by minimizing the model loss. The visible layer units activation probability v = sigmoid b + Wh (1) Update parameters W, b, c according to (7)-(9) i + + end For the standard RBM, the activation state of each hidden layer unit is conditionally independent for a given state of the visible layer unit. So, we can get the activation probability of h j : Similarly, when a state of a hidden layer unit is given, the activation probability of the visual layer unit is also conditionally independent: We define the energy function as: The energy function means that there is an energy value between each visual layer unit and each hidden layer unit. By indexing and regularizing the energy function, we get the joint probability of v and h: We use the likelihood function derivation method to get the solution of parameters. The joint probability distribution P(v, h|θ) is known. By summing all the states of the hidden layer unit, the edge distribution P(v|θ ) of the visible layer unit set can be obtained: The representation based on the energy function is simply an alternative to the product representation of the factor [33]. According to Equation (5), we can get θ by maximizing P(v|θ ).
As ln(x) is increnebted, arg max θ P(v|θ ) and arg max θ ln(P(v|θ )) are equivalent. Thus, we define the train goal function L(θ) as: T is the number of train set. arg min θ L(θ) uses the gradient descent method [34]. In the direction of ∂L(θ) ∂(θ) , L(θ) becomes the fastest, we can find the minimum value of function by finding an optimal parameter adjustment stride in this direction.
We use α to represent the learning rate of the algorithm, and the parameters update formula can be defined as: v (k) is the state of v after k step Gibbs sampling.

Retraining of Stacked RBM-AE
We initialize the Stacked RBM-AE model by using pre-trained standard RBM parameters, but the pre-trained parameters are only applicable to standard RBM, and we do not achieve good performance on the Stacked RBM-AE model. Therefore, we use the pre-trained parameters to retrain the Stacked RBM-AE model and use the BP algorithm [35] to fine tune the model parameters. Algorithm 2 shows the procedure of the retraining of Stacked RBM-AE. It first calculates the encoder output e and the decoder output, and then calculates the model loss according to Equation (13), finally updating model parameters. The encoder output and the decoder output are calculated by the Stacked RBM. The encoder output e according to (10) Get e (1) by one step Gibbs sampling for e The decoder output d according to (11) The model loss L according to (13) Update parameters W, b, c according to (7)- (9) i + + end The encoder output of the Stacked RBM-AE model is calculated by four different sizes of RBMs in turn, as shown in Figure 2. We use W i , c i and b i to represent the parameters of the i layer standard RBM in the model. The encoder output is calculated by: where h i is the activation probability of all hidden layer units in the i th RBM. Similarly, the decoder output is calculated by: where v i is the activation probability of all visual layer units in the i th RBM. For the input data normalized to [0,1], they will be given the probability meaning that each neural unit value of the Stacked RBM-AE model still obeys the Bernoulli distribution of {0,1}. The normalized real value of the operation is considered to be the probability that the current unit has a value of 1. Then for a single training sample, the objective or cost function that minimizes the cross entropy is: For the training dataset, the loss of the Stacked RBM-AE model is defined as: where l is layer number. sl represents the unit numbers of l th layer, tl and tla represents the current layer number and the next layer number. The second term is the weights penalty item, which is used to prevent over-fitting. λ is the regularization parameter in the penalty item.

Preprocessing Dataset
The experimental data was collected by the Intel Lab wireless sensor network research team from the University of California, who placed 54 sensors to collect environmental data information in the laboratory from 28 February to 5 April 2004. The sensor node distribution by Intel Labs is shown in Figure 4. The weather board sensor collects time stamp topology information as well as temperature, humidity, light and voltage values every 31 s. Due to node failure, there are several temperature values above 100 • C and below −30 • C. We, therefore, preprocessed the temperature data by taking the threshold of −5 • C and 45 • C by a priori knowledge.
In order to reduce the difference between the input data of the compressed model, and converge the algorithm more quickly, we mapped the original data to [0, 1] by max − min normalization, where max is the maximum value of the node sensing data, min is minimum value.
where is layer number. represents the unit numbers of ℎ layer, and represents the current layer number and the next layer number. The second term is the weights penalty item, which is used to prevent over-fitting. λ is the regularization parameter in the penalty item.

Preprocessing Dataset
The experimental data was collected by the Intel Lab wireless sensor network research team from the University of California, who placed 54 sensors to collect environmental data information in the laboratory from 28 February to 5 April 2004. The sensor node distribution by Intel Labs is shown in Figure 4. The weather board sensor collects time stamp topology information as well as temperature, humidity, light and voltage values every 31 s. Due to node failure, there are several temperature values above 100 °C and below −30 °C . We, therefore, preprocessed the temperature data by taking the threshold of −5 °C and 45 °C by a priori knowledge.
In order to reduce the difference between the input data of the compressed model, and converge the algorithm more quickly, we mapped the original data to [0, 1] by − normalization, where is the maximum value of the node sensing data, is minimum value. * = − − (14) After the preprocessing is completed, the node temperature database is converted to a data set. Each node collects temperature data as a column vector, and stores this data in the temperature.txt file. Due to node failure, some nodes only recorded a small amount of data. The average number of data by each node after preprocessing is 29,665. The maximum number of data in all nodes is 55,080, and the minimum number is 2507. We divide each node data into two parts: The training set and the test set. In order to ensure that each node has enough training data to train the model, we did not use common split standards such as 7:3 and 6:2:2, and instead used 9:1.
The more the number of samples in the training set, and the closer the empirical distribution is to the true distribution, the more accurate will the fitness of model distribution be. Increasing the number of training sets is an effective measure to prevent over-fitting. We provide following three methods: (1) Obtain data from the source; (2) estimate the data distribution parameters of the training set by statistical methods such as point estimation and interval estimation, and use this distribution After the preprocessing is completed, the node temperature database is converted to a data set. Each node collects temperature data as a column vector, and stores this data in the temperature.txt file. Due to node failure, some nodes only recorded a small amount of data. The average number of data by each node after preprocessing is 29,665. The maximum number of data in all nodes is 55,080, and the minimum number is 2507. We divide each node data into two parts: The training set and the test set. In order to ensure that each node has enough training data to train the model, we did not use common split standards such as 7:3 and 6:2:2, and instead used 9:1.
The more the number of samples in the training set, and the closer the empirical distribution is to the true distribution, the more accurate will the fitness of model distribution be. Increasing the number of training sets is an effective measure to prevent over-fitting. We provide following three methods: (1) Obtain data from the source; (2) estimate the data distribution parameters of the training set by statistical methods such as point estimation and interval estimation, and use this distribution to generate more data; (3) augment the training set by interpolation in the original training set, such as Kriging Interpolation and Natural Neighbor Interpolation.

Compression and Reconstruction
We used the following performance criteria to evaluate the performance of our compression algorithm: (1) compression ratio (CR); (2) percentage RMS difference (PRD) [26]; (3) quality-score (QS) [26]. The definitions and formulas of these performance criteria are as follows: (1) Compression Ratio (CR): It is defined as the compressed data length over the size of uncompressed data, as shown in Equation (15).
where D or is the number of bytes of all original data. D cp is the number of bytes of all compressed data. The CR value is expected to be high for an effective compression algorithm.
(2) Percentage RMS Difference (PRD): It is a widely used performance measure that is used for calculating the quality of reconstructed data in the compression. The PRD value is expected to be as low as possible for a quality compression approach.
where S o represents the original input data, S r represents the reconstructed data.
(3) Quality-Score (QS): Other important evaluation criteria for determining the effectiveness of compression algorithms is the QS value. QS is the ratio of CR to PRD. It represents the reconstructed data quality. The larger the QS value, the better the compression and reconstruction performance.
The learning rate determines how far the weights move in the gradient direction in a mini-batch, which is usually set by the experimenter. If the learning rate is small, the training will become more reliable, but optimization will take a long time, because each step towards the minimum of the loss function is small. If the learning rate is big, the training may not converge and could also diverge. Optimization will cross the minimum value and cause loss function to become worse. There are many ways to set an initial value for the learning rate. A simple solution is to try a few different values to see which value will optimize the loss function without loss in training speed. In the pre-training phase, we refer to the parameters in Ref. [36], set the learning rate of 0.01/batch-size (the batch-size is 120). In the retraining phase, we start with a value of 0.1, then exponentially reduce the learning rate to 0.01, 0.001 and 0.0001. In order to find the optimal learning rate, we set different learning rates to train the model and record the summation of loss value when the model loss is no longer reduced [37]. Figure 5 shows the loss value of a model with different learning rates. We sought to find a point with the smallest value of model loss. In our experiment, we found that when the learning rate was between 0.0001 and 0.001, the model loss value was the smallest. We finally selected a learning rate of 0.0001.
We first use the training set of node 7 to pre-train the model without retraining, and test the efficiency of a different number of pre-training iterations to the compression performance of the model. Then we use the test set of node 7 to calculate the PRD value and the QS value of the model under different numbers of pre-training iterations. Similarly, we use the training set of node 7 to retrain the model with a fixed number of pre-training iterations. We use the test set of node 7 to calculate the PRD value and the QS value of the model under different numbers of retraining iterations. During the test, we added the PRD value and the QS value of each mini-batch of the test set, and then averaged the sum value as the final result. The number of mini-batches of the test set is 325. The results are shown in Figure 6. Without retraining, the compression performance of the model will not increase and could even decrease with an increase in the number of pre-training iterations. When the number of pre-training iterations is 10 and the number of retraining iterations is 0, the compression performance of the model is optimal with the smallest PRD value and the largest QS value. Thus, we set the number of pretraining iterations of our model to 10. When the number of pre-training iterations is fixed, increasing the number of retraining iterations can significantly improve the compression performance of the model. When the number of retraining iterations reaches 200 and above, the compression performance of the model tends to be stable. At this time, increasing the number of retraining  Without retraining, the compression performance of the model will not increase and could even decrease with an increase in the number of pre-training iterations. When the number of pre-training iterations is 10 and the number of retraining iterations is 0, the compression performance of the model is optimal with the smallest PRD value and the largest QS value. Thus, we set the number of pre-training iterations of our model to 10. When the number of pre-training iterations is fixed, increasing the number of retraining iterations can significantly improve the compression performance of the model. When the number of retraining iterations reaches 200 and above, the compression performance of the model tends to be stable. At this time, increasing the number of retraining iterations does not significantly improve the compression performance of the model. Considering that the model needs to be calculated on the sensor node, increasing the number of retraining iterations will lead to an increase in calculation energy consumption. We finally selected the number of retraining iterations as 200.
In this experiment, we test the compression performance of the model under different CRs. This experiment data uses the data of node 7 with the number of pre-training iterations of 10, and the number of retraining iterations of 200. We then set CR as 10, 20, 40, and 120. We use the test set of node 7 to calculate the PRD value and the QS value of the model under different CRs. During the test, we summed up the PRD value and the QS value of each mini-batch of the test set, and then averaged the sum value as the final result. The number of mini-batches of the test set is 325. Figure 7 shows the compression performance of the model under different CRs.  illustrated that for a single node, with the increase of the CR, the compression performance of the model is not significantly increased or decreased, which shows that the model can imbibe the inherent properties of the data. These inherent properties are inherently weighted on the weight matrix and are independent of the dimension after compression. This is the difference between the deep compression method and the shallow compression method, since the reconstruction error of the shallow compression method increases with an increase in the CR. The result means that our algorithm can get a higher CR in the case of minimal reconstruction error. In our experiments, when CR was 10, the PRD value was the smallest. The reconstructed data value is closest to the original data value. Although increasing CR can significantly increase the QS value, the PRD value will also increase, which represents a difference between the reconstructed data, and the original data becomes larger. In this experiment, we explore the optimal compression performance of the model, and in the next experiment, we set the CR to 10 to explore the reconstruction performance of the model. Figure 8 shows the reconstructed data and the original data of node 7 for our model, with the number of pre-training iterations being 10, the number of retraining iterations being 200, and CR of 10. We first use the model to compress the original data, and then use the model to reconstruct the compressed data. For all the data in the test set of node 7, we sum up the absolute value error between the original data and the reconstructed data, and then average the sum value as the final result. The average absolute value error obtained was 0.2815 °C , maximum value was 0.4602 °C , and minimum value was 0.0026 °C . Our model has been proven to have higher reconstruction accuracy, and the  illustrated that for a single node, with the increase of the CR, the compression performance of the model is not significantly increased or decreased, which shows that the model can imbibe the inherent properties of the data. These inherent properties are inherently weighted on the weight matrix and are independent of the dimension after compression. This is the difference between the deep compression method and the shallow compression method, since the reconstruction error of the shallow compression method increases with an increase in the CR. The result means that our algorithm can get a higher CR in the case of minimal reconstruction error. In our experiments, when CR was 10, the PRD value was the smallest. The reconstructed data value is closest to the original data value. Although increasing CR can significantly increase the QS value, the PRD value will also increase, which represents a difference between the reconstructed data, and the original data becomes larger. In this experiment, we explore the optimal compression performance of the model, and in the next experiment, we set the CR to 10 to explore the reconstruction performance of the model. Figure 8 shows the reconstructed data and the original data of node 7 for our model, with the number of pre-training iterations being 10, the number of retraining iterations being 200, and CR of 10. We first use the model to compress the original data, and then use the model to reconstruct the compressed data. For all the data in the test set of node 7, we sum up the absolute value error between the original data and the reconstructed data, and then average the sum value as the final result. The average absolute value error obtained was 0.2815 • C, maximum value was 0.4602 • C, and minimum value was 0.0026 • C. Our model has been proven to have higher reconstruction accuracy, and the reconstructed data can correctly approximate the trend and value of the original data. In the following experiments, we compare the performance of our algorithm with other compression algorithms. Performing CS algorithm on 40,000 data points for node 7. Since the length of stream data for CS algorithm cannot be too long, we divide these points into eight segments with segment length of 5000. We average the results of all segments as the final result of CS algorithm, and set the CR at 10. The results are shown in Table 1. At the same time, we test the performance of our algorithm on different data sets. The results are shown in Table 2.

Transfer Learning
In order to verify the generalization performance of the Stacked RBM-AE model, we use the data of node 7 to train the model, and then use the trained model to test the compression and reconstruction performance of all nodes. This process in deep learning is called transfer learning. At the same time, each node is trained separately to obtain the performance of the model of each node. We test the performance of the model after 10 pre-training iterations and 200 retraining iterations. We set the CR to 10 and the learning rate to 0.0001. We record the PRD value, QS value and reconstruction error of the model of each node. The PRD value and QS value of each node both are the average after summing up the value of PRD and QS of each mini-batch. The reconstruction error is the average after summing the absolute value error between the original data and the reconstructed data of each mini-batch. The red lines in Figures 9-11 show the compression and reconstruction performance of the model of each node. The compression and reconstruction performance of each node on testing with parameters of node 7 are shown in blue lines in Figures 9-11. In the following experiments, we compare the performance of our algorithm with other compression algorithms. Performing CS algorithm on 40,000 data points for node 7. Since the length of stream data for CS algorithm cannot be too long, we divide these points into eight segments with segment length of 5000. We average the results of all segments as the final result of CS algorithm, and set the CR at 10. The results are shown in Table 1. At the same time, we test the performance of our algorithm on different data sets. The results are shown in Table 2.

Transfer Learning
In order to verify the generalization performance of the Stacked RBM-AE model, we use the data of node 7 to train the model, and then use the trained model to test the compression and reconstruction performance of all nodes. This process in deep learning is called transfer learning. At the same time, each node is trained separately to obtain the performance of the model of each node. We test the performance of the model after 10 pre-training iterations and 200 retraining iterations. We set the CR to 10 and the learning rate to 0.0001. We record the PRD value, QS value and reconstruction error of the model of each node. The PRD value and QS value of each node both are the average after summing up the value of PRD and QS of each mini-batch. The reconstruction error is the average after summing the absolute value error between the original data and the reconstructed data of each mini-batch. The red lines in Figures 9-11 show the compression and reconstruction performance of the model of each node. The compression and reconstruction performance of each node on testing with parameters of node 7 are shown in blue lines in Figures 9-11.   We can see from the red lines in Figures 9 and 10 that the model has good compression performance for all nodes. For most nodes, the PRD value of the model is less than 20. There are some spikes in the red line in Figure 9, such as in node 18 and node 33. For this spike parts of the red line, we analyzed the model loss value during the training process, and find that the training loss value of the model is very small, reaching 3.7652, while the training loss value of most nodes are between 8 and 10. The phenomenon of low training error and high test error shows that the model has been over-fitted. Since we use a fixed number of training iterations in our experiments, the number of training iterations exceeds the optimal number of training iterations of the model for some nodes. By reducing the number of training iterations, the phenomenon of model over-fitting can be avoided effectively. The red line in Figure 11 shows that the error of data reconstruction is less than 1 for most nodes. Except those over-fitting nodes, the minimum and maximum error of data reconstruction can reach 0.2835 °C and 0.9303 °C for all nodes. Changing data cannot cause a significant drop in the compression and reconstruction performance of the model. The blue line shows that the model trained with the data of node 7 is not very suitable for most other nodes. Compared to the red lines, the PRD value and reconstruction error value are increased. For all nodes, the biggest difference of the PRD value between red line and blue line can reach 15.37, and the smallest difference is 0.16. The minimum and maximum error of the digital difference between the data reconstruction of red line and blue line is 0.0176 °C and 0.5289 °C , respectively. For nodes located near node 7, such as node 6, node 8 and node 9, the reconstruction error varies little and the performance of the model does not decrease significantly by using model parameters that are not corresponding to the nodes themselves.   We can see from the red lines in Figures 9 and 10 that the model has good compression performance for all nodes. For most nodes, the PRD value of the model is less than 20. There are some spikes in the red line in Figure 9, such as in node 18 and node 33. For this spike parts of the red line, we analyzed the model loss value during the training process, and find that the training loss value of the model is very small, reaching 3.7652, while the training loss value of most nodes are between 8 and 10. The phenomenon of low training error and high test error shows that the model has been over-fitted. Since we use a fixed number of training iterations in our experiments, the number of training iterations exceeds the optimal number of training iterations of the model for some nodes. By reducing the number of training iterations, the phenomenon of model over-fitting can be avoided effectively. The red line in Figure 11 shows that the error of data reconstruction is less than 1 for most nodes. Except those over-fitting nodes, the minimum and maximum error of data reconstruction can reach 0.2835 °C and 0.9303 °C for all nodes. Changing data cannot cause a significant drop in the compression and reconstruction performance of the model. The blue line shows that the model trained with the data of node 7 is not very suitable for most other nodes. Compared to the red lines, the PRD value and reconstruction error value are increased. For all nodes, the biggest difference of the PRD value between red line and blue line can reach 15.37, and the smallest difference is 0.16. The minimum and maximum error of the digital difference between the data reconstruction of red line and blue line is 0.0176 °C and 0.5289 °C , respectively. For nodes located near node 7, such as node 6, node 8 and node 9, the reconstruction error varies little and the performance of the model does not decrease significantly by using model parameters that are not corresponding to the nodes themselves.   We can see from the red lines in Figures 9 and 10 that the model has good compression performance for all nodes. For most nodes, the PRD value of the model is less than 20. There are some spikes in the red line in Figure 9, such as in node 18 and node 33. For this spike parts of the red line, we analyzed the model loss value during the training process, and find that the training loss value of the model is very small, reaching 3.7652, while the training loss value of most nodes are between 8 and 10. The phenomenon of low training error and high test error shows that the model has been over-fitted. Since we use a fixed number of training iterations in our experiments, the number of training iterations exceeds the optimal number of training iterations of the model for some nodes. By reducing the number of training iterations, the phenomenon of model over-fitting can be avoided effectively. The red line in Figure 11 shows that the error of data reconstruction is less than 1 for most nodes. Except those over-fitting nodes, the minimum and maximum error of data reconstruction can reach 0.2835 °C and 0.9303 °C for all nodes. Changing data cannot cause a significant drop in the compression and reconstruction performance of the model. The blue line shows that the model trained with the data of node 7 is not very suitable for most other nodes. Compared to the red lines, the PRD value and reconstruction error value are increased. For all nodes, the biggest difference of the PRD value between red line and blue line can reach 15.37, and the smallest difference is 0.16. The minimum and maximum error of the digital difference between the data reconstruction of red line and blue line is 0.0176 °C and 0.5289 °C , respectively. For nodes located near node 7, such as node 6, node 8 and node 9, the reconstruction error varies little and the performance of the model does not decrease significantly by using model parameters that are not corresponding to the nodes themselves. We can see from the red lines in Figures 9 and 10 that the model has good compression performance for all nodes. For most nodes, the PRD value of the model is less than 20. There are some spikes in the red line in Figure 9, such as in node 18 and node 33. For this spike parts of the red line, we analyzed the model loss value during the training process, and find that the training loss value of the model is very small, reaching 3.7652, while the training loss value of most nodes are between 8 and 10. The phenomenon of low training error and high test error shows that the model has been over-fitted. Since we use a fixed number of training iterations in our experiments, the number of training iterations exceeds the optimal number of training iterations of the model for some nodes. By reducing the number of training iterations, the phenomenon of model over-fitting can be avoided effectively. The red line in Figure 11 shows that the error of data reconstruction is less than 1 for most nodes. Except those over-fitting nodes, the minimum and maximum error of data reconstruction can reach 0.2835 • C and 0.9303 • C for all nodes. Changing data cannot cause a significant drop in the compression and reconstruction performance of the model. The blue line shows that the model trained with the data of node 7 is not very suitable for most other nodes. Compared to the red lines, the PRD value and reconstruction error value are increased. For all nodes, the biggest difference of the PRD value between red line and blue line can reach 15.37, and the smallest difference is 0.16. The minimum and maximum error of the digital difference between the data reconstruction of red line and blue line is 0.0176 • C and 0.5289 • C, respectively. For nodes located near node 7, such as node 6, node 8 and node 9, the reconstruction error varies little and the performance of the model does not decrease significantly by using model parameters that are not corresponding to the nodes themselves. The digital difference of reconstruction error of node 6, node 8 and node 9 is 0.0875 • C, 0.1127 • C and 0.0720 • C, respectively. This means that for some areas of the WSNs with strong spatial correlation, when a node in the area is trained, all nodes in the area can share the trained model parameters. These results prove that our algorithm has a good transfer learning ability. The error reduction speed during the model training helps judge whether the training is completed or not. When the training error value drops slowly and approaches a stable value, it means that the training is complete. In our experiments, when the training of the model was completed, the number of training iterations was mostly less than 200. This can further reduce the storage and communication energy consumption of node model parameters. The transfer learning ability can be applied to the design of clustering routing protocol in WSNs.

Energy Optimization
For the use of the Stacked RBM-AE model in WSNs, we present a solution. The server first uses the node data to train the model and sends the trained model parameters to the node (only the parameters of the encoder need to be sent, because the node does not need to decode the data). After the node receives the trained model parameters, the node constructs and initializes the encoder by using the parameters. When the node collects the sensing data, the encoder is used to encode and compress the data. The compressed data is formed into a data packet. Then the node sends the data packet to the server. The server uses the decoder to decode and reconstruct the sensing data of the data packet. For some areas of the WSNs with strong spatial correlation, the server only needs to train one time of one node in the area to get the model parameters, and all nodes in the area share the parameters. Figure 12 shows the process. The digital difference of reconstruction error of node 6, node 8 and node 9 is 0.0875 °C , 0.1127 °C and 0.0720 °C , respectively. This means that for some areas of the WSNs with strong spatial correlation, when a node in the area is trained, all nodes in the area can share the trained model parameters. These results prove that our algorithm has a good transfer learning ability. The error reduction speed during the model training helps judge whether the training is completed or not. When the training error value drops slowly and approaches a stable value, it means that the training is complete. In our experiments, when the training of the model was completed, the number of training iterations was mostly less than 200. This can further reduce the storage and communication energy consumption of node model parameters. The transfer learning ability can be applied to the design of clustering routing protocol in WSNs.

Energy Optimization
For the use of the Stacked RBM-AE model in WSNs, we present a solution. The server first uses the node data to train the model and sends the trained model parameters to the node (only the parameters of the encoder need to be sent, because the node does not need to decode the data). After the node receives the trained model parameters, the node constructs and initializes the encoder by using the parameters. When the node collects the sensing data, the encoder is used to encode and compress the data. The compressed data is formed into a data packet. Then the node sends the data packet to the server. The server uses the decoder to decode and reconstruct the sensing data of the data packet. For some areas of the WSNs with strong spatial correlation, the server only needs to train one time of one node in the area to get the model parameters, and all nodes in the area share the parameters. Figure 12 shows the process. In our experiments, we use the model to compress the sensing data; when the number of data is 120, we set CR to 10. If the type of original sensing data is float, the original data size is 120 × 4 = 480 (byte), and the packet size after compression is 12 × 4 = 48 (byte). The energy of data transmission can be reduced by 90% for each node. Using the Stacked RBM-AE model to compress a data packet needs 18,737 floating point calculations (including multiplication and addition). We train the model by using NVIDIA GeForce GTX 1080 Ti and Tensorflow. Training the model take 181.54 ms when the number of model training iterations is 200 and the number of sensing data is 120. Compressing this sensing data takes 0.23 ms. We also test the floating-point calculation speed of STM32F103. When the main frequency of the STM32F103 is 150 M, STM32F103 will take 42 instruction cycles (0.22 ms) to In our experiments, we use the model to compress the sensing data; when the number of data is 120, we set CR to 10. If the type of original sensing data is float, the original data size is 120 × 4 = 480 (byte), and the packet size after compression is 12 × 4 = 48 (byte). The energy of data transmission can be reduced by 90% for each node. Using the Stacked RBM-AE model to compress a data packet needs 18,737 floating point calculations (including multiplication and addition). We train the model by using NVIDIA GeForce GTX 1080 Ti and Tensorflow. Training the model take 181.54 ms when the number of model training iterations is 200 and the number of sensing data is 120. Compressing this sensing data takes 0.23 ms. We also test the floating-point calculation speed of STM32F103. When the main frequency of the STM32F103 is 150 M, STM32F103 will take 42 instruction cycles (0.22 ms) to execute a floating-point calculation. In theory, STM32F103 will take 4122 ms to calculate the encoder floating point calculation of the model, where the main frequency is 150 M. We utilize the parameter pruning method to further reduce storage energy consumption and communication energy consumption. Figure 13 shows the energy optimization process. Figure 14 shows the parameter pruning process.  Figure 13 shows the energy optimization process. Figure 14 shows the parameter pruning process.  We judge the importance of the parameters in the model according to the absolute value of the parameters [38], and then calculate the number of pruning and prune threshold, according to the prune ratio. The parameters in the model whose absolute value is less than the prune threshold are removed from the network, and then we retrain the model to restore the performance of the model. Algorithm 3 shows the procedure of the pruning parameters. We first sort the trained model weights according to the absolute value, and then multiply the prune rate and the number of weight parameters to get the number of parameters that we need to prune. After sorting is completed, we expand the weights into a 1-D vector. The numerical value is found in the vector as the prune threshold according to the number of prune parameters. The weights are compared with the prune threshold value. When the absolute value of the weights is less than the prune threshold, set the value to 0. We get a simplified model that initialized the model parameters by using the pruned weights. In order to restore accuracy, this simplified model requires retraining to fine-tune weights, because we prune a portion of the original model parameters [39]. We then iterate the pruning and retraining processes until the model performance returns to the original performance.   Figure 13 shows the energy optimization process. Figure 14 shows the parameter pruning process.  We judge the importance of the parameters in the model according to the absolute value of the parameters [38], and then calculate the number of pruning and prune threshold, according to the prune ratio. The parameters in the model whose absolute value is less than the prune threshold are removed from the network, and then we retrain the model to restore the performance of the model. Algorithm 3 shows the procedure of the pruning parameters. We first sort the trained model weights according to the absolute value, and then multiply the prune rate and the number of weight parameters to get the number of parameters that we need to prune. After sorting is completed, we expand the weights into a 1-D vector. The numerical value is found in the vector as the prune threshold according to the number of prune parameters. The weights are compared with the prune threshold value. When the absolute value of the weights is less than the prune threshold, set the value to 0. We get a simplified model that initialized the model parameters by using the pruned weights. In order to restore accuracy, this simplified model requires retraining to fine-tune weights, because we prune a portion of the original model parameters [39]. We then iterate the pruning and retraining processes until the model performance returns to the original performance. We judge the importance of the parameters in the model according to the absolute value of the parameters [38], and then calculate the number of pruning and prune threshold, according to the prune ratio. The parameters in the model whose absolute value is less than the prune threshold are removed from the network, and then we retrain the model to restore the performance of the model. Algorithm 3 shows the procedure of the pruning parameters. We first sort the trained model weights according to the absolute value, and then multiply the prune rate and the number of weight parameters to get the number of parameters that we need to prune. After sorting is completed, we expand the weights into a 1-D vector. The numerical value is found in the vector as the prune threshold according to the number of prune parameters. The weights are compared with the prune threshold value. When the absolute value of the weights is less than the prune threshold, set the value to 0. We get a simplified model that initialized the model parameters by using the pruned weights. In order to restore accuracy, this simplified model requires retraining to fine-tune weights, because we prune a portion of the original model parameters [39]. We then iterate the pruning and retraining processes until the model performance returns to the original performance.

Algorithm 3: Pruning Parameters. ABS(x) indicate the absolute value for x. Sort(x) indicate sorting by the numerical value of x.
Input: The model weights W, the prune rate α, the number of prune iterations iter While i < iter do Sort(ABS(W)) Get m which is the number of elements in W Get the number of W that need pruning n = α × m Get prune threshold thr if W < thr then W = 0 end Update parameters W according to Algorithm 2 i + + end The network of the Stacked RBM-AE model has eight layers, and each layer of the network is a standard RBM. We use L i to represent the i layer standard RBM in the model. The distribution of the Stacked RBM-AE model parameters is shown in the Table 3. As we can see from Table 3, most of the model parameters are located in L 1 , L 2 , L 7 and L 8 . When we prune weights, we only need to prune the weights of the L 1 -L 4 , because the weights of the L 5 -L 8 is the transposition of the L 1 -L 4 weights. In order to explore the efficiency of pruning the weights of the different layers on the performance of the model, we separately prune different rates of the weights of each layer. We use the data of node 7 to train and test the model. During the test, we record the average value of PRD, QS and reconstruction error of each mini-batch of the test set. We use W i to represent the weights of the i th layer. Figure 15 shows the compression performance of the model under different prune rates of each layer.
When the prune rate is below 20%, pruning the weight respectively of any layer will not lead to the decline of the performance of the model. This means that there are redundant data in the weights of our model, which cannot affect the performance of the model. On the whole, the compression performance decreases with the increasing the prune rate. Removing too many parameters from the model will affect the data learning and fitting ability of the model. If we seek to increase the prune rate while maintaining the performance of the model, we need to choose an appropriate prune rate according to the model compression performance change.

Conclusions
In this paper, we propose a Stacked RBM-AE model to compress WSNs data by using RBM and AE. In order to improve compression performance, we design a model parameter adjustment strategy, which includes two parts: pre-training and retraining. We test the efficiency of the number of iterations of pre-training and retraining on the performance of the model by experiment studies. We also offer a solution to use the model for WSNs and discuss the computational efficiency of the model. Considering the calculation and communication energy consumption, we use the method of prune parameters to further optimize the energy consumption of our algorithm. Our experimental results show that our algorithm has better transfer learning ability, and has better reconstruction accuracy than the traditional algorithms under the same CR. The data communication energy consumption can be reduced by 90%.
The sensor node usually be equipped with multiple sensors to collect different environmental monitoring data. This method can be extended to perform joint compression and reconstruction of multi-stream data. Since Gibbs Sampling and k-step divergence are used to estimate the probability distribution of the reconstructed data in this method, the theoretical system error is inevitably introduced. Exploring how to reduce the systematic error of the model is the focus of our next research work. VAE, GAN or other mixed deep learning models can also be used in future research work, which is expected to reduce the systematic error caused by the model assumptions and further reduce the energy consumption for WSNs. We will also look at how to use deep neural networks for WSNs.

Conclusions
In this paper, we propose a Stacked RBM-AE model to compress WSNs data by using RBM and AE. In order to improve compression performance, we design a model parameter adjustment strategy, which includes two parts: pre-training and retraining. We test the efficiency of the number of iterations of pre-training and retraining on the performance of the model by experiment studies. We also offer a solution to use the model for WSNs and discuss the computational efficiency of the model. Considering the calculation and communication energy consumption, we use the method of prune parameters to further optimize the energy consumption of our algorithm. Our experimental results show that our algorithm has better transfer learning ability, and has better reconstruction accuracy than the traditional algorithms under the same CR. The data communication energy consumption can be reduced by 90%.