Entropy Correlation and Its Impacts on Data Aggregation in a Wireless Sensor Network

A correlation characteristic has significant potential advantages for the development of efficient communication protocols in wireless sensor networks (WSNs). To exploit the correlation in WSNs, the correlation model is required. However, most of the present correlation models are linear and distance-dependent. This paper proposes a general distance-independent entropy correlation model based on the relation between joint entropy and the number of members in a group. This relation is estimated using entropy of individual members and entropy correlation coefficients of member pairs. The proposed model is then applied to evaluate two data aggregation schemes in WSNs including data compression and representative schemes. In the data compression scheme, some main routing strategies are compared and evaluated to find the most appropriate strategy. In the representative scheme, with the desired distortion requirement, a method to calculate the number of representative nodes and the selection of these nodes are proposed. The practical validations showed the effectiveness of the proposed correlation model and data reduction schemes.


Introduction
Wireless Sensor Networks (WSNs) are the collection of sensor nodes, which cooperatively monitor the surrounding environment over large physical areas. The latest achievements in the integration of micro-electro-mechanical systems and digital electronics with the development of wireless communications have enabled the wide deployment of WSNs. Sensor nodes in WSNs have been equipped with various sensing capabilities in space and time and higher processing capacities can satisfy requests from various modern applications. Due to low-cost, the small size, and no-replace battery powered characteristics of sensor nodes, energy conservation is commonly recognized as the key challenge in designing and operating the network.
In typical WSNs applications, sensors are normally deployed densely over the monitoring area to achieve satisfactory coverage [1]. As a result, there will be multiple sensors recording data from the same event in the sensing field, i.e., recorded data from these sensors will be correlated with each other. The existence of correlation characteristics has many significant potential advantages for the development of efficient communication protocols well-suited to the WSNs paradigm. For example, due to the correlation degree, data in a correlated region can be compressed with a high ratio to reduce communication load to save dissipated energy [2][3][4]. Moreover, with high enough correlation, it may not be necessary for every sensor node in a correlation group to transmit its data to the sink. Instead, a smaller number of sensor measurements (representation) might be adequate to communicate the event features to the sink within a certain reliability level [5]. In Reference [23], with the expectation of looking at the data itself, the correlation is recognized by using the relation between the joint entropy of a set and the number of nodes in the set. The increase of the joint entropy of a group when one variable is added into the group is considered. If the added variable is highly correlated with variables in the group, i.e., it strongly depends on the variables in the group, the increasing of the joint entropy of the group by adding the variable is small. In other words, a small amount of additional information is needed to specify the added variable. Therefore, by considering the relation of joint entropy value with the number of variables in a group, it can be found that the increasing speed of the joint entropy value will gradually reduce and approach zero. Essentially, the joint entropy value moves to approach the "saturation" state when the number of considered variables increases. The more correlation among the nodes exists, the faster the joint entropy value moves to a "saturation" state. This phenomenon is described in Figure 2. The speed of approaching the "saturation" state could be specified to a correlational level.  The distance-independent, entropy-based correlation model is used in Reference [22]. However, instead of calculating directly from real data, the entropy correlation coefficient is chosen to be the Pearson linear correlation coefficient to reduce the computation complexity. On the other hand, it also reduces the generality of using entropy.
In Reference [23], with the expectation of looking at the data itself, the correlation is recognized by using the relation between the joint entropy of a set and the number of nodes in the set. The increase of the joint entropy of a group when one variable is added into the group is considered. If the added variable is highly correlated with variables in the group, i.e., it strongly depends on the variables in the group, the increasing of the joint entropy of the group by adding the variable is small. In other words, a small amount of additional information is needed to specify the added variable. Therefore, by considering the relation of joint entropy value with the number of variables in a group, it can be found that the increasing speed of the joint entropy value will gradually reduce and approach zero. Essentially, the joint entropy value moves to approach the "saturation" state when the number of considered variables increases. The more correlation among the nodes exists, the faster the joint entropy value moves to a "saturation" state. This phenomenon is described in Figure 2. The speed of approaching the "saturation" state could be specified to a correlational level. conditions. Therefore, it is better to look at the information contained in the data itself rather than consider only attribute meta-data such as location and time.
The distance-independent, entropy-based correlation model is used in Reference [22]. However, instead of calculating directly from real data, the entropy correlation coefficient is chosen to be the Pearson linear correlation coefficient to reduce the computation complexity. On the other hand, it also reduces the generality of using entropy. In Reference [23], with the expectation of looking at the data itself, the correlation is recognized by using the relation between the joint entropy of a set and the number of nodes in the set. The increase of the joint entropy of a group when one variable is added into the group is considered. If the added variable is highly correlated with variables in the group, i.e., it strongly depends on the variables in the group, the increasing of the joint entropy of the group by adding the variable is small. In other words, a small amount of additional information is needed to specify the added variable. Therefore, by considering the relation of joint entropy value with the number of variables in a group, it can be found that the increasing speed of the joint entropy value will gradually reduce and approach zero. Essentially, the joint entropy value moves to approach the "saturation" state when the number of considered variables increases. The more correlation among the nodes exists, the faster the joint entropy value moves to a "saturation" state. This phenomenon is described in Figure 2. The speed of approaching the "saturation" state could be specified to a correlational level.  In Reference [23], joint entropy is calculated from real data and then the joint entropy of a node set is approximated by an exponential function by the number of nodes in the set. However, this model can only be obtained when the correlational set has been established. There has not been an efficient way to recognize the correlation among nodes using the relation between joint entropy and the number of nodes in a set. To get the relation as described in Figure 2, in principle, the joint entropy values of all possible subgroups are calculated and an efficient process to check the relation between joint entropy and the number of nodes in all subgroups is required. That process is complicated with O(2 n ) complexity that requires vast of computation. This is not possible in practice.
In this paper, we focus on discovering and exploiting the general correlation in WSNs by using information entropy theory and by looking at the sensed data itself. To overcome these above difficulties on correlation recognition, we develop an approach to evaluate the joint entropy of a group as a function of the number of nodes in the group using entropy of individual nodes and the entropy correlation coefficient between pairs of nodes in the group. From this evaluation, the definition of a correlated region is proposed and the correlation-clustering scheme based on the proposed definition is presented. Using the proposed clustering scheme, sensor nodes are clustered into correlated region. Then data aggregation can be done to save energy. Thus, we consider of using data compression and representative aggregation in WSNs. The impacts of the proposed correlation model for the development of efficient data aggregations for WSNs including data compression and representative aggregation are considered. This paper is the correction and the extension of our previous studies in References [24,25].
The remainder of the paper is organized as follows. In Section 2, the estimation of the joint entropy of a group is presented. The validation of this estimation is also done in this section. Section 3 proposes a definition of the correlated region and then the correlation clustering scheme is described. In Section 4, the entropy correlation model is used to evaluate the main compression and aggregation schemes for WSNs. Lastly, the conclusion and future works are presented in Section 5.

Entropy Concept
To explore the correlational characteristics between the collected data, the concept of information entropy is used. In this section, at first, we review the concept of entropy and mutual information [26].
In information theory, the entropy of a random variable is a function that attempts to characterize the "unpredictability" or "uncertainty" of a random variable. If a random variable X takes on values in a set X = {x 1 , x 2 , . . . , x n } and is defined by a probability distribution P(X). Then the entropy H(X) of the random variable X is written below.
The units of entropy are "bits" or "nats" depending on log p(x), which is based on base 2 logarithms or natural logarithms. In this paper, the base 2 logarithm is used instead of the natural logarithm and, hence, entropy is defined as the expected number of bits of information contained in each event. This has taken over all possibilities.
In the case of multi-random variables, the number of bits of information is calculated by joint entropy, which is the entropy of a joint probability distribution or a multi-valued random variable. For n random variables X 1 , X 2 , . . . , X n , the joint entropy H(X 1 , X 2 , . . . , X n ) is defined by the equation below.
H(X 1 , . . . , X n ) = − ∑ where x 1 , x 2 , . . . , x n are particular values of X 1 , X 2 , . . . , X n , respectively, and P(x 1 , . . . , x n ) is the probability of these values occurring together. Now let's consider the case of two random variables X and Y. The relation between entropy and joint entropy is shown below.
With equality if X and Y are independent. The above inequation shows that when the information covered by X fully comprised Y in its content, the joint entropy of two random variables equals the summation of the entropy of both variables. On the other hand, the joint entropy of these two variables is always smaller than the total entropy of these two variables. However, joint entropy alone cannot be used to evaluate the level of sharing information between pairs of random variables because the value of joint entropy depends on the single value of entropy of every variable.
Another metric used for measuring the mutual dependence between the two variables is mutual information, which measures the relationship between two random variables. In general, it measures how much information is communicated, on average, in one random variable about another. The formal definition of the mutual information of two random variables X and Y whose joint distribution is defined by P(X, Y) is given by the equation below.
The relation between mutual information and entropy is given by the equation below.
It is difficult to compare the correlation level between two pairs of random variables using mutual information or joint entropy because their values depend on the entropy of each individual data in the pair. To overcome this problem, we use normalized measures of mutual information called the entropy correlation coefficient [27], which is given below.
ρ(X, Y) is called the entropy correlation coefficient of the two random variables X and Y in the relation with mutual information I(X, Y) or joint entropy H(X,Y). The entropy correlational coefficient presents the comparative relationship of a pair of data independent to the value of individual entropy and, therefore, it can be used to compare the correlational level of two pairs of data. From inequality (3), it can be found that the entropy correlation coefficient ρ varies from 0 to 1 (see Appendix A.1 for more explanation). The larger the value of ρ, the higher the correlation is. If ρ = 1, (in case H(X) = H(Y) = H(X, Y)), two sets of data totally depend on each other. If ρ = 0 (in case H(X, Y) = H(X) + H(Y)), they are independent.

Joint Entropy Estimation
Assume that there is a set of N data {X 1 , X 2 , . . . , X N } with the entropy of each data, H(X i ), and entropy correlation coefficient, ρ ij = ρ(X i , X j ) with any 1 ≤ i = j ≤ N satisfies the following conditions.
The joint entropy is estimated based on the idea of hierarchical clustering [28]. With a group that has only one node, the entropy of one node is limited by Equation (7).
where k 1 =1. With a group of two nodes X i and X j , from the definition of entropy correlation coefficient in Equation (6), we have created the following formula.
In addition, or The coefficient k 2 can also be rewritten below.
With a group of three nodes X i , X j , and X k , at first, the two nodes X i and X j are replaced by an equivalent node X ij with entropy H X ij equals to H X i , X j or we have H X ij = H X i , X j ≤ k 2 H max . According to hierarchical clustering [19,28], the correlation coefficient between one cluster and another cluster can be obtained by the shortest correlation coefficient from any member of one cluster to any member of the other cluster. Therefore, the equation below is formed. Then, Or where k 3 = b 2 (k 2 + 1). Similarly, joint entropy H m of a group with m nodes could be considered as the joint entropy of a sub-cluster with m-1 nodes and the remaining node. The entropy of the sub-cluster is joint entropy of m-1 nodes and the entropy correlation coefficient between the sub-cluster and the main node is the greatest/shortest/average correlation coefficient from any member of the sub-cluster to the remaining node. Thus, the following formula is obtained.
where k m = b 2 (k m−1 + 1). From the recurrence relation of k m , the general formula to calculate k m can be obtained as follows (m > 2).
or in the more compact way (in case b = 2):

Joint Entropy Lower Bound
The lower bound of the joint entropy of a group with m node could be determined in a similar way to the upper bound. In this case, the correlation coefficient between one cluster and another cluster can be obtained by the greatest correlation coefficient from any member of one cluster to any member of the other cluster. The results are below.
With a group that has only one node, we have formed the following equation.
With a group of m nodes (m ≥ 2): H m ≥ l m H min (20) where l m = c 2 (l m−1 + 1) with c = 2 − ρ max . From the recurrent relation of l m , the general formula to calculate l m can be obtained as follows (m > 2): or in the more compact way (in case c = 2):

Validation of Joint Entropy Estimation
To validate the proposed joint entropy estimation, at first, two special cases are considered. In the first case, all nodes completely depend on each other, i.e., all nodes measure the same information. In this case: H(X 1 ) = H(X 2 ) = . . . = H(X m ) = H, and ρ ij = 1, ∀i, j = 1, 2, . . . , m; i = j. Thus H min = H max = H and ρ min = ρ max = 1, then k m = l m = 1. Using Equations (17) and (21), we have H m = H(X 1 , X 2 , . . . , X m ) = H. These results show that the estimated joint entropy equal the actual joint entropy in this case.
In the second case, all nodes are completely independent of each other. In this case: Using Equations (16) and (20), we have formed the following formula.
This inequality is true because in this case: Moreover, to verify the above estimation of joint entropy in a practice, sample data supplied by the Intel Berkeley Research Lab [21] is used. The sample data was collected from 54 sensors deployed in the Intel Berkeley Research lab between 28 February, 2004 and 5 April, 2004. Mica2Dot sensors with weatherboards collected time stamped topology information along with humidity, temperature, light, and voltage values once every 30 seconds. Data were collected using the TinyDB in-network query processing system and were built on the TinyOS platform. In this paper, temperature data is considered an example for validation of the proposed estimation. A group of 11 nodes, which is named dataset 1, is chosen from 48 nodes with ρ min = 0.6, H min = 2.16, and H max = 2.55. The node selection algorithm will be presented in the next section. We choose 256 samples for each node to calculate entropy, joint entropy, and entropy correlation coefficients. The entropy of each node and the entropy correlation coefficient between each pair of nodes are shown in Tables 1 and 2, respectively. From these entropies and the entropy correlation coefficients, the lower bound and upper bound of subsets from 11 nodes will be calculated alternatively. Additionally, the practical joint entropy of all considered subsets are calculated in comparison with an estimated lower bound and an upper bound. The results are shown in Table 3. It is found that the practical joint entropy of one subset is always between the lower bound and the upper bound. The above examples show the validity of the proposed estimation method.

Estimated Joint Entropy and Correlation
As mentioned in Reference [23], correlated nodes share a large amount of information among them. Therefore, their joint entropy will not increase much when the number of nodes in the group increases. In other words, the joint entropy will go to a "saturation" state when the number of nodes increases. On the other hand, from Equations (16) and (20), it can be seen that the upper bound and the lower bound functions of joint entropy are the same with only an argument difference. If these differences are small enough, the difference between the upper bound and the lower bound is small. It means the real joint entropy value is similar with its upper or lower bound. Therefore, the upper/lower bound function is chosen to estimate the joint entropy of the group.
Among these two arguments, the entropy correlation coefficient has a strong effect on the shape of the function. Figure 3 shows the estimated joint entropy with different values of an entropy correlation coefficient. It can be seen that, with a high enough value of an entropy correlation coefficient, the estimated joint entropy has the same characteristics as the calculated joint entropy of a correlation group. This means that the joint entropy will go to a "saturation" state when the number of nodes increases. The nodes with a higher correlation will approach the saturation state faster. From these results, it can be concluded that:


It is acceptable to use a lower/upper bound function to estimate the joint entropy of a correlation group because they have similar characteristics of going to a "saturation" state when the number of nodes in the group increases.  The entropy correlation coefficient of all pairs in the group can be represented by a correlation's level of the group.
When correlated nodes are grouped, it is expected that the joint entropy of the group is as small as possible. The worst case of joint entropy is its upper bound. Therefore, to evaluate the correlation of a group, the upper bound function should be used. Then the practical joint entropy will always be satisfied if its upper bound is already satisfied. Figure 4 shows the upper bound and practical joint entropy of the dataset 1 in the previous section. It can be seen that they have similar joint entropy characteristics of a correlational group. The difference between them is the difference between node entropy and the entropy correlation coefficient in practical data.  It can be seen that, with a high enough value of an entropy correlation coefficient, the estimated joint entropy has the same characteristics as the calculated joint entropy of a correlation group. This means that the joint entropy will go to a "saturation" state when the number of nodes increases. The nodes with a higher correlation will approach the saturation state faster. From these results, it can be concluded that: • It is acceptable to use a lower/upper bound function to estimate the joint entropy of a correlation group because they have similar characteristics of going to a "saturation" state when the number of nodes in the group increases.

•
The entropy correlation coefficient of all pairs in the group can be represented by a correlation's level of the group.
When correlated nodes are grouped, it is expected that the joint entropy of the group is as small as possible. The worst case of joint entropy is its upper bound. Therefore, to evaluate the correlation of a group, the upper bound function should be used. Then the practical joint entropy will always be satisfied if its upper bound is already satisfied. Figure 4 shows the upper bound and practical joint entropy of the dataset 1 in the previous section. It can be seen that they have similar joint entropy characteristics of a correlational group. The difference between them is the difference between node entropy and the entropy correlation coefficient in practical data. It can be seen that, with a high enough value of an entropy correlation coefficient, the estimated joint entropy has the same characteristics as the calculated joint entropy of a correlation group. This means that the joint entropy will go to a "saturation" state when the number of nodes increases. The nodes with a higher correlation will approach the saturation state faster. From these results, it can be concluded that:


It is acceptable to use a lower/upper bound function to estimate the joint entropy of a correlation group because they have similar characteristics of going to a "saturation" state when the number of nodes in the group increases.  The entropy correlation coefficient of all pairs in the group can be represented by a correlation's level of the group.
When correlated nodes are grouped, it is expected that the joint entropy of the group is as small as possible. The worst case of joint entropy is its upper bound. Therefore, to evaluate the correlation of a group, the upper bound function should be used. Then the practical joint entropy will always be satisfied if its upper bound is already satisfied. Figure 4 shows the upper bound and practical joint entropy of the dataset 1 in the previous section. It can be seen that they have similar joint entropy characteristics of a correlational group. The difference between them is the difference between node entropy and the entropy correlation coefficient in practical data.

Correlated Region Definition
As mentioned in Reference [10], sensor nodes in the same correlated region record information of a single event in the sensor field, i.e., these sensed data are correlated with each other. Since the sensed data is taken from the same event, the number of bits to represent sensed data should be the same, i.e., the entropy of sensed data is similar. On the other hand, the entropy correlation coefficient of all pairs in this region is also similar.
Moreover, as shown in the last section, a group that has two properties such as similar entropy and a similar entropy correlation coefficient is a correlated group. Therefore, the correlated region can be defined as follows. Definition 1. A correlated region with correlation level ρ 0 is a region in which the sensed data of all sensor nodes has the same entropy value and entropy correlation coefficient between all pairs of nodes are also the same and equal to ρ 0 .
However, in practical cases, it is difficult to obtain the similarity between two entropies or the entropy correlation coefficient of pairs of nodes. Therefore, the correlated region can be defined in a more practical way, which is shown below.

Definition 2.
A group of m nodes {X 1 , X 2 , . . . , X m } is in a correlated region with a correlation level ρ 0 if entropies of all member nodes vary in a very small range and entropy correlation coefficients between all pairs of nodes are larger than or equal to ρ 0 .
where ∆H is the entropy variation range. H 0 is called "base entropy" and ρ 0 is called "correlation level" of the data collected in the region. The higher the correlation level, the greater the amount of correlation of the collected data is in this region. In this paper, if a region has ρ 0 ≥ 0.5, we call it is a highly correlated region.
With this definition, we can estimate the joint entropy of a group m nodes {X 1 , X 2 , . . . , X m } by the following equation.
in which k m is calculated by using Equation (17) or Equation (18) with b = 2 − ρ 0 . This equation is called the entropy correlation model. It will be used for analysis and evaluation of the impact of the proposed correlation definition in the next section. One problem that arises with Definition 2 is how the entropy variation range ∆H affects the precision of the joint entropy value. To answer this question, we consider the variation of the joint entropy value corresponding to the entropy variation. For simplicity, it is supposed that ρ 0 = ρ ij = ρ X i , X j , ∀ i = j. From Definition 2, the real joint entropy value H rm of the group including m nodes {X 1 , X 2 , . . . , X m } is limited by the following inequation. Then, The above equation describes the effect of an entropy variation range ∆H to the difference between the real joint entropy value and the estimated joint entropy value. The smaller the entropy variation range ∆H is, the smaller the difference is. Therefore, the entropy variation range ∆H is chosen by depending on the requirement of the precision e (%) of the estimated joint entropy value.
In this paper, we choose the error between the real and the estimated joint entropy value, which is e = 85%. Then ∆H ≤ 15%H 0 .

Correlation Clustering Algorithm
Using the definition of the correlated region, a sensor field can be divided into correlated regions with a specified base entropy and a correlational level. The clustering algorithm is described in Algorithm 1. At first, an entropy range and a correlation level are chosen. Next, nodes with their entropy values in the entropy range are selected into a group. Then, the entropy correlation coefficient of all pairs in the group is calculated and a node with the highest number of pairs that satisfied the correlational level is chosen as a core node. Next, nodes in the group that the correlation coefficients between them and the core node are smaller than the correlation level will be removed from the group. After that, the process of removing a node with the highest number of pairs that do not satisfy correlation level is repeated until all pairs in the group satisfy the correlation level. Calculate entropy H(X i ) 4. ENDFOR 5. FOR each node X i in the network 6.
FOR each node X j in the network and X j = X i 7.
Initialize new group G = Φ.

13.
FOR each node X i in the network and not belonging to any group 14.
FOR each node X i in G 32.
IF 0 < C(X i ) = max{C(X j ), X j ∈G} 33. Remove UNTIL max{C(X j ), X j ∈G}=0 37. UNTIL all nodes are grouped 38. END In the step (*) of the algorithm, the base entropy and correlation level are chosen so that they can cover all possible values of entropy and the entropy correlation coefficient in the network. The value of the entropy correlation coefficient should be chosen from high to low. In the step (**) of the algorithm, if there is more than one node that satisfies the condition 0 < C(X i ) = max{C(X j ), X j ∈G}, the node that has a maximum entropy value will be removed.
It should be noted that the entropy of each individual node's data and entropy correlation coefficients of each data pairs must be known beforehand to implement the clustering. Therefore, a data acquisition period is required at the beginning to collect enough data. The calculation of entropy values and entropy correlation coefficients must be done before the clustering algorithm and applications of correlation clustering.
Using the proposed correlation clustering algorithm, the sensor nodes in Section 2.2 are clustered similarly to Table 4. The network is divided into four groups. The first group includes 11 nodes with ρ 0 = 0.6. The second group includes six nodes with ρ 0 = 0.6 but in a different entropy value range. The third group includes eight nodes with ρ 0 = 0.5. All other nodes belong to the last group, which is weakly correlated. The first group is used as an example of a correlation group in Section 2. In addition, to calculate joint entropy to show the correlation among nodes in the established correlation group, Figure 5 shows the practical sensed data of 11 nodes in the first group. It can be found that data of all nodes except node 33 are quite similar, i.e., they vary with time in the same way. Data of node 33 looks different from the others, but its negative varies similarly to the other. Thus, they are all correlated with each other.

Validation
To validate the practical Definition 2 in comparison to the theoretical Definition 1, we complete a comparison between the estimated joint entropy using Equation (29) and practical calculation of the selected group by Definition 2. The result is shown in Figure 4. It can be seen that they have similar shapes, but there is a difference between estimated values and practical values. As explained in the previous section, the difference is caused by the dissimilarity in entropy values and correlation coefficients of nodes in the practical environment. However, the most important thing is the correlational characteristics (joint entropy values go to a "saturation" state when the number of nodes increases) has been preserved.
To check the correlational characteristics, the derivative of the joint entropy function by the node's number in a group (∆ ∆ ⁄ = − −1 ) is calculated for both the estimated and practical joint entropy values, which is shown in Figure 6. It is found that the derivatives are similar and reduced when the number of nodes increases for both cases. It means that the joint entropy values go

Validation
To validate the practical Definition 2 in comparison to the theoretical Definition 1, we complete a comparison between the estimated joint entropy using Equation (29) and practical calculation of the selected group by Definition 2. The result is shown in Figure 4. It can be seen that they have similar shapes, but there is a difference between estimated values and practical values. As explained in the previous section, the difference is caused by the dissimilarity in entropy values and correlation coefficients of nodes in the practical environment. However, the most important thing is the correlational characteristics (joint entropy values go to a "saturation" state when the number of nodes increases) has been preserved.
To check the correlational characteristics, the derivative of the joint entropy function by the node's number in a group (∆H n /∆n = H n − H n−1 ) is calculated for both the estimated and practical joint entropy values, which is shown in Figure 6. It is found that the derivatives are similar and reduced when the number of nodes increases for both cases. It means that the joint entropy values go to a "saturation" state when the number of node increases. The correlational characteristic is preserved, and it is possible to use the estimated joint entropy model by Equation (29) to estimate the joint entropy of a correlated group established by using Definition 2.

Validation
To validate the practical Definition 2 in comparison to the theoretical Definition 1, we complete a comparison between the estimated joint entropy using Equation (29) and practical calculation of the selected group by Definition 2. The result is shown in Figure 4. It can be seen that they have similar shapes, but there is a difference between estimated values and practical values. As explained in the previous section, the difference is caused by the dissimilarity in entropy values and correlation coefficients of nodes in the practical environment. However, the most important thing is the correlational characteristics (joint entropy values go to a "saturation" state when the number of nodes increases) has been preserved.
To check the correlational characteristics, the derivative of the joint entropy function by the node's number in a group (∆ ∆ ⁄ = − −1 ) is calculated for both the estimated and practical joint entropy values, which is shown in Figure 6. It is found that the derivatives are similar and reduced when the number of nodes increases for both cases. It means that the joint entropy values go to a "saturation" state when the number of node increases. The correlational characteristic is preserved, and it is possible to use the estimated joint entropy model by Equation (29) to estimate the joint entropy of a correlated group established by using Definition 2.  For complexity evaluation, in case of using the relation between joint entropy and the number of nodes in the group by direct calculation, the joint entropy values of all possible subgroups must be calculated. Then the relations between joint entropy and the number of nodes in all subgroups are checked. With a group with n nodes, there are 2 n − 1 different subgroups and the complexity of this direct calculation is O(2 n ). However, by using the proposed clustering algorithm, we only have to calculate the entropy of individual nodes and an entropy correlation coefficient of all pairs in the considered group. The complexity for this calculation is O(n 2 ). Then using the proposed clustering algorithm with O(n 3 ) complexity, correlation subgroups can be established. Thus, the complexity of the proposed calculation for correlation recognition is only O(n 3 ), which is much smaller than O(2 n ) of a direct calculation.

Data Aggregations Using Entropy Correlation
According to the aggregation strategies, data level aggregation methods are divided into three types: in-network query type, data compression type, and representative type [16]. Correlation is appropriated with the data compression type and the representative type. This section considers the applications of the proposed entropy correlation model into data aggregation in WSNs including compression aggregation and representative aggregation. In this paper, it is assumed that correlated nodes are geographically close to each other. Thus, a correlated region may correspond to a geographical region.

Comparison of Compression Schemes
Using the correlation-based clustering algorithm, correlation nodes are grouped into a correlated region and data compression can be done to reduce the amount of data transfer in the network to save energy. There are many compression techniques that can be applied in a wireless sensor network [2][3][4].
In Reference [17], the impact of correlation to data compression has been analyzed using its distance-based correlation model. In this paper, the impact of correlation to compression will be evaluated using the same framework in Reference [17] but, instead, using our proposed correlation model.
To choose the most appropriate lossless compression approach like Reference [17], we will evaluate three main qualitatively different routing schemes: Distributed Source Coding [29], Routing Driven Compression [30,31], and Compression Driven Routing [32]. We consider the arrangement of sensor nodes in a grid where only (2n − 1) nodes in the first column are the sources. There are n 1 hops on the shortest path between the sources and the sink [17]. The paths taken by data and the intermediate aggregation of three considered schemes are shown in Figure 7. In Compression Driven Routing (CDR), nodes have no knowledge of the correlation, but the data is compressed close to the sources and initially routed for maximum possible compression at each hop. The energy consumption for this scheme (ECDR) is shown below.
Using the estimated joint entropy model in Equation (29) for the above expressions, we can In Distributed Source Coding (DSC), the sensor nodes know about their correlations and they can compress data to avoid transmitting redundant information. In this case, ideally, each source will send exactly the right amount of uncorrelated data to the sink along the shortest path possible without the need for intermediate compression. The energy consumption for this scheme (E DSC ) is calculated as: In Routing Driven Compression (RDC), the sensor nodes do not know about their correlations and send data along the shortest paths to the sink while allowing for opportunistic compression wherever the paths overlap. The energy consumption for this scheme (E RDC ) can be derived by the equation below.
In Compression Driven Routing (CDR), nodes have no knowledge of the correlation, but the data is compressed close to the sources and initially routed for maximum possible compression at each hop. The energy consumption for this scheme (E CDR ) is shown below.
Using the estimated joint entropy model in Equation (29) for the above expressions, we can quantify the performance of each scheme. Figure 8   It can be seen that the Distributed Source Coding scheme has the lowest energy consumption because the number of transmitted bits is the smallest among lossless compression schemes. The higher the correlation level, the smaller the energy usage is. For the Routing Driven Compression scheme, the correlation level does not affect much of the energy usage of the scheme. For the Compression Driven Routing scheme, the energy usage is high with a small correlation level, but it reduces considerably when the correlation level increases. The Compression Driven Routing scheme approaches Distributed Source Coding in a high correlation area. The above results are similar to the results in Reference [17], which used a different correlation model.
From this result, the Distributed Source Coding and Compression Driven Routing are appropriate for the wireless sensor network with high correlation characteristics of the environment. However, the Distributed Source Coding scheme is quite difficult to implement when compared to the Compression Driven Routing, which can be implemented easily by local compression. Therefore, Compression Driven Routing is highly recommended to be the compression approach for a wireless sensor network with high correlation characteristics.

Compression Based Routing Scheme in a Correlated Region
In the previous sections, we have proposed a clustering scheme to divide a wireless sensor It can be seen that the Distributed Source Coding scheme has the lowest energy consumption because the number of transmitted bits is the smallest among lossless compression schemes. The higher the correlation level, the smaller the energy usage is. For the Routing Driven Compression scheme, the correlation level does not affect much of the energy usage of the scheme. For the Compression Driven Routing scheme, the energy usage is high with a small correlation level, but it reduces considerably when the correlation level increases. The Compression Driven Routing scheme approaches Distributed Source Coding in a high correlation area. The above results are similar to the results in Reference [17], which used a different correlation model.
From this result, the Distributed Source Coding and Compression Driven Routing are appropriate for the wireless sensor network with high correlation characteristics of the environment. However, the Distributed Source Coding scheme is quite difficult to implement when compared to the Compression Driven Routing, which can be implemented easily by local compression. Therefore, Compression Driven Routing is highly recommended to be the compression approach for a wireless sensor network with high correlation characteristics.

Compression Based Routing Scheme in a Correlated Region
In the previous sections, we have proposed a clustering scheme to divide a wireless sensor network into correlated regions with a specified correlational level. In addition, the comparison among several compression-based routing schemes has been done and it is shown that CDR is a suitable scheme. However, according to our proposed definition of the correlated region, all nodes in the same correlated region are correlated with the same correlational level. This leads to the existence of many equivalent CDR paths with a similar maximum compression possibility. Therefore, to get the optimal routing scheme in a correlated region, it is necessary to consider the shortest path routing in combination with CDR. In this paper, it is assumed that correlated nodes are geographically close to each other. Thus, the idea of finding the optimal compression-based routing in each correlated region is to use distance-based clustering with data sending along the shortest paths to cluster head in each cluster. Similar to Reference [17], we will analyze the energy consumption of clustering-based routing schemes for 1-D, 2-D, and general topology networks when compression is performed using our proposed correlation model. Two compression situations are considered including compression at intermediate nodes on the short path tree (SPT) and only at the cluster heads.

1-D Analysis
Considering N source nodes linearly located with unit spacing on one side of a 2-D grid of nodes and the sink on the other side. These source nodes are divided into N/s clusters in which each cluster consists of s nodes. The cluster head for each cluster is supposed to be located at the end of each cluster. The routing pattern is shown in Figure 9. As mentioned in the previous subsection, we will consider two types of compression performances. With the first type, the data is compressed sequentially from one end to the cluster head end. In the second type, the compression is done only at the cluster head. The cluster head then sends the compressed data along the shortest path involving P hops to the sink. The total bit-hop cost for this routing scheme is shown below.
where E1in is the bit-hop cost within each cluster and E1ex is a bit-hop cost for each cluster to send compressed information to the sink.
a. Sequential Compression along a Short Path Tree to the Cluster Head In this case, at each hop, a node receives data from its previous hop, compresses them with its As mentioned in the previous subsection, we will consider two types of compression performances. With the first type, the data is compressed sequentially from one end to the cluster head end. In the second type, the compression is done only at the cluster head. The cluster head then sends the compressed data along the shortest path involving P hops to the sink. The total bit-hop cost for this routing scheme is shown below.
where E 1in is the bit-hop cost within each cluster and E 1ex is a bit-hop cost for each cluster to send compressed information to the sink.

a. Sequential Compression along a Short Path Tree to the Cluster Head
In this case, at each hop, a node receives data from its previous hop, compresses them with its own data, and transmits compressed data to the next hop. The total bit-hop cost could be obtained as follows (detailed proof is in Appendix A.2).
where the coefficient k s , according to Equation (18), is calculated by using the formula below.
From Equation (38), it can be found that the optimal value of cluster size depends on the joint entropy coefficient ρ 0 and number of hops P to the sink.
It is difficult to find the formula for the optimal value of the cluster size. Instead, we will consider the total bit-hop cost E s , respectively, to cluster sizes with some different values of joint entropy coefficient ρ 0 . Figure 10 shows the total bit-hop cost E s , respectively, to cluster size with different values of ρ 0 (in this case N = 50, P = 5, H 0 = 1). It is found that, in the case of very low correlation (ρ 0 = 0.1), the optimal number of cluster size is s opt = 1. It means that, in this case, clustering does not bring any efficiency in energy saving. All nodes will communicate directly with the sink instead of through an intermediate node. In case of high correlation (ρ0 ≥ 0.5), the optimum number of cluster sizes is sopt = N, i.e., it is not necessary to cluster the correlated region into smaller subsets. The data is compressed sequentially from the node end to the cluster head end with all nodes in the cluster. The head will compress data and get sent to the sink. Because of high correlation, the more the number of nodes, the higher the compression rate of data. In case the entropy correlation coefficient is low (ρ0 = 0.3), there exists an optimum number of the cluster size but a small value. The correlated region should be clustered into some smaller subsets to optimize transmission cost. However, with this low correlation and a small number of correlated nodes, the effectiveness of the correlation characteristic is not significant.
The optimal value of cluster size also depends on the number of hops (P) to the sink. If P is large, the energy dissipation for transmitting data from the cluster head to the sink is larger than energy dissipation in the other tasks. The amount of data sent to the sink should be reduced. Moreover, the compression of highly correlated data is more efficient if the number of the dataset is higher. Thus, the number of nodes in each cluster should be increased, i.e., the number of the clusters should be reduced or may even not be necessary to divide the correlated region into smaller subsets.
From the above analysis, it can be concluded that, with the highly correlated region (ρ0 ≥ 0.5), it In case of high correlation (ρ 0 ≥ 0.5), the optimum number of cluster sizes is s opt = N, i.e., it is not necessary to cluster the correlated region into smaller subsets. The data is compressed sequentially from the node end to the cluster head end with all nodes in the cluster. The head will compress data and get sent to the sink. Because of high correlation, the more the number of nodes, the higher the compression rate of data. In case the entropy correlation coefficient is low (ρ 0 = 0.3), there exists an optimum number of the cluster size but a small value. The correlated region should be clustered into some smaller subsets to optimize transmission cost. However, with this low correlation and a small number of correlated nodes, the effectiveness of the correlation characteristic is not significant.
The optimal value of cluster size also depends on the number of hops (P) to the sink. If P is large, the energy dissipation for transmitting data from the cluster head to the sink is larger than energy dissipation in the other tasks. The amount of data sent to the sink should be reduced. Moreover, the compression of highly correlated data is more efficient if the number of the dataset is higher. Thus, the number of nodes in each cluster should be increased, i.e., the number of the clusters should be reduced or may even not be necessary to divide the correlated region into smaller subsets.
From the above analysis, it can be concluded that, with the highly correlated region (ρ 0 ≥ 0.5), it is not necessary to divide a correlation cluster into smaller groups. The whole region would be a cluster and all nodes will send data to the cluster head. The cluster head will compress the data and will send the data to the sink.

b. Compression at a Cluster Head Only
In this case, each node receives data from the previous hop and transfers them with its own data to the next hop without compression. The cluster head receives data from all nodes in the cluster, compresses them, and sends the data to the sink. Therefore, the total bit-hop cost could be obtained as follows (detailed proof is in Appendix A.3).
where the coefficient k s is calculated below.  Figure 11 shows the total bit-hop cost Es, respectively, to cluster size with different values of ρ0 (in this case N = 50, P= 30, H0 = 1). It is found that there exists an optimal value of cluster size that is not at the two ends of the graph (s ≠ 1 and s ≠ N). However, this optimal value depends on the correlation coefficient. With a high correlation value (ρ0 ≥ 0.5), the optimal cluster size value reduces when the correlation increases. In addition, it is found that the optimal value strongly depends on the number of hops (P). The higher the value of the number of hops, the higher the value of the optimal cluster size value. This is because the energy dissipation for transmitting data from the cluster head to the sink becomes dominant in comparison to the other tasks. The amount of data sent to the sink should be reduced. Moreover, the compression of highly correlated data is more efficient if the number of the dataset is higher. Thus, the number of nodes in each cluster should be increased, i.e., the cluster size should be higher.

2-D Analysis
Consider a 2-D network with N = n 2 nodes located on an n × n unit grid. The network nodes are divided into clusters of size s × s with s 2 nodes. The routing patterns within a cluster and from cluster heads to sinks are demonstrated in Figure 12. In each cluster, data is transmitted to cluster heads with compression at each hop ( Figure 12a) or with compression at the cluster head only (Figure 12b). Data from cluster heads is transmitted to sink without any compression along the transmission path (Figure 12c). The total bit-hop cost for this routing scheme is shown below.
where E2in is a total bit-hop cost within clusters and E2ex is the total bit-hop cost for clusters to send compressed information to the sink.  Figure 11 shows the total bit-hop cost E s , respectively, to cluster size with different values of ρ 0 (in this case N = 50, P= 30, H 0 = 1). It is found that there exists an optimal value of cluster size that is not at the two ends of the graph (s = 1 and s = N). However, this optimal value depends on the correlation coefficient. With a high correlation value (ρ 0 ≥ 0.5), the optimal cluster size value reduces when the correlation increases. In addition, it is found that the optimal value strongly depends on the number of hops (P). The higher the value of the number of hops, the higher the value of the optimal cluster size value. This is because the energy dissipation for transmitting data from the cluster head to the sink becomes dominant in comparison to the other tasks. The amount of data sent to the sink should be reduced. Moreover, the compression of highly correlated data is more efficient if the number of the dataset is higher. Thus, the number of nodes in each cluster should be increased, i.e., the cluster size should be higher.

2-D Analysis
Consider a 2-D network with N = n 2 nodes located on an n × n unit grid. The network nodes are divided into clusters of size s × s with s 2 nodes. The routing patterns within a cluster and from cluster heads to sinks are demonstrated in Figure 12. In each cluster, data is transmitted to cluster heads with compression at each hop ( Figure 12a) or with compression at the cluster head only (Figure 12b). Data from cluster heads is transmitted to sink without any compression along the transmission path (Figure 12c). The total bit-hop cost for this routing scheme is shown below.
where E 2in is a total bit-hop cost within clusters and E 2ex is the total bit-hop cost for clusters to send compressed information to the sink. In this case, at each cluster, a node receives data from the previous hop, compresses them with its own data, and transmits to the next hop to reach the cluster head, which is shown in Figure 12a. The cluster head compresses all received data and sends the data to the sink without any compression in the transfer path, which is shown in Figure 12c. The total bit-hop cost can be obtained by the formula below (detailed proof is in Appendix A.4.). In this case, at each cluster, a node receives data from the previous hop, compresses them with its own data, and transmits to the next hop to reach the cluster head, which is shown in Figure 12a. The cluster head compresses all received data and sends the data to the sink without any compression in the transfer path, which is shown in Figure 12c. The total bit-hop cost can be obtained by the formula below (detailed proof is in Appendix A.4). Figure 13 shows the total bit-hop cost E s , respectively, to cluster size with different values of ρ 0 (in this case n = 72, s = [1,2,3,4,6,8,9,12,18,24,36,72]). It is found that, with a small correlation coefficient value (ρ 0 ≤ 0.3), there exists an optimal value of the cluster size, which is not at the two ends of the graph. However, in the case of the high correlation (ρ 0 ≥ 0.5), the optimum number of the cluster size is s opt = n, i.e., it is not necessary to cluster the correlated region into smaller subsets. The reason is that, with high correlation, the energy for transmitting data from the cluster head to the sink becomes dominant. Then, the number of clusters should be reduced. Therefore, the cluster size becomes higher. Moreover, compression with a large number of highly correlated data is much better than dividing them into smaller groups.  Figure 13 shows the total bit-hop cost Es, respectively, to cluster size with different values of ρ0 (in this case n = 72, s = [1,2,3,4,6,8,9,12,18,24,36,72]). It is found that, with a small correlation coefficient value (ρ0 ≤ 0.3), there exists an optimal value of the cluster size, which is not at the two ends of the graph. However, in the case of the high correlation (ρ0 ≥ 0.5), the optimum number of the cluster size is sopt = n, i.e., it is not necessary to cluster the correlated region into smaller subsets. The reason is that, with high correlation, the energy for transmitting data from the cluster head to the sink becomes dominant. Then, the number of clusters should be reduced. Therefore, the cluster size becomes higher. Moreover, compression with a large number of highly correlated data is much better than dividing them into smaller groups. In this case, at each cluster, a node receives data from the previous hop and transfers to the next hop without compression, which is shown in Figure 12b. The cluster head compressed all received data and sent it to the sink without any compression in the transfer path, which is shown in Figure 13. Total bit-hop cost E s , respectively, to cluster size with different values of entropy correlation coefficient in the case of 2-D with compression along SPT to the cluster head.

b. Compression at the Cluster Head Only
In this case, at each cluster, a node receives data from the previous hop and transfers to the next hop without compression, which is shown in Figure 12b. The cluster head compressed all received data and sent it to the sink without any compression in the transfer path, which is shown in Figure 12c. The total bit-hop cost can be obtained as follows (detailed proof is in Appendix A.5). Figure 14 shows the total bit-hop cost E s , respectively, to cluster size with different values of ρ 0 (in this case n = 72, s = [1,2,3,4,6,8,9,12,18,24,36,72]). It is found that the optimal value of the cluster size is not at two ends. In this case, the transmission cost is highest at s = 1 and s = n because all nodes transfer data without compression. However, this optimal value depends on the correlation coefficient. With a higher correlation value, the optimal cluster size value is smaller.

General Topology Model Analysis
The analysis in the previous subsections is based on simple topology. To verify the robustness of our conclusions, we present the simulation with more general topology, which is shown in Figure 15.

General Topology Model Analysis
The analysis in the previous subsections is based on simple topology. To verify the robustness of our conclusions, we present the simulation with more general topology, which is shown in Figure 15.
Same as in Section 4.1.2, the network is assumed to be deployed in an n × n area. Nodes are deployed randomly over the network area. The network then can be divided into the correlated region with of size s × s. In this simulation, we choose the network size of 48 m × 48 m (n = 48), density of deployment is 1 node/m 2 , and s = [1,2,3,4,6,8,12,16,24,48]. We also assumed that the energy consumption of clustering schemes compression is performed at intermediate nodes on SPT and only at the cluster heads. In addition, because the node position is randomly set, the distances between two nodes are also random values. Therefore, the bit-hop cost is not used. Instead, the transmission cost is Sensors 2018, 18, 3118 22 of 34 calculated using the distance-based energy model, which is similar to Reference [33]. In this paper, for simplicity, the transmission cost is determined to be proportional to the square of the distance from the source to the destination. Figure 14. Total bit-hop cost Es, respectively, to cluster size with different values of the entropy correlation coefficient in the case of 2-D with compression at the cluster head only.

General Topology Model Analysis
The analysis in the previous subsections is based on simple topology. To verify the robustness of our conclusions, we present the simulation with more general topology, which is shown in Figure 15. Same as in Section 4.1.2, the network is assumed to be deployed in an n × n area. Nodes are deployed randomly over the network area. The network then can be divided into the correlated region with of size s × s. In this simulation, we choose the network size of 48 m × 48 m (n = 48), density of deployment is 1 node/m 2 , and s = [1,2,3,4,6,8,12,16,24,48]. We also assumed that the energy In this case, each node transmits the amount of data that is equal to entropy/joint entropy of transmission data (including its own data and received data from other nodes, according to SPT). Entropy/joint entropy is calculated according to Equation (29) with H 0 = 1. Figure 16 shows the total transmission cost, respectively, to cluster size with different values of ρ 0 . It is found that the simulation results are the same as our previous analysis. The optimal cluster size is s = n, i.e., all correlated regions should be a unique cluster. In addition, because the node position is randomly set, the distances between two nodes are also random values. Therefore, the bit-hop cost is not used. Instead, the transmission cost is calculated using the distance-based energy model, which is similar to Reference [33]. In this paper, for simplicity, the transmission cost is determined to be proportional to the square of the distance from the source to the destination.

a. Opportunistic Compression along the Short Path Tree to the Cluster Head
In this case, each node transmits the amount of data that is equal to entropy/joint entropy of transmission data (including its own data and received data from other nodes, according to SPT). Entropy/joint entropy is calculated according to Equation (29) with H0 = 1. Figure 16 shows the total transmission cost, respectively, to cluster size with different values of ρ0. It is found that the simulation results are the same as our previous analysis. The optimal cluster size is s = n, i.e., all correlated regions should be a unique cluster. In this case, each node transmits its own data and transfers other data to the next node in SPT to the cluster head without compression. Compression is done only at the cluster head to send to the base station. Each node is assumed to have 1-bit own data. This Figure 16 the total transmission cost

b. Compression at the Cluster Head Only
In this case, each node transmits its own data and transfers other data to the next node in SPT to the cluster head without compression. Compression is done only at the cluster head to send to the base station. Each node is assumed to have 1-bit own data. This Figure 17 the total transmission cost to cluster size with different values of ρ 0 . It is found that the simulation result is the same as our previous analysis. The optimal cluster size is not at two ends and this optimal value depends on the correlation coefficient. With a higher correlation value, the optimal cluster size value is smaller. correlated region becomes a cluster and the optimal routing path in each cluster is the shortest path from nodes to their cluster head. - In the case of compression at the cluster head only and not at intermediate nodes, the transmission path is the shortest path to the cluster head. To get the optimal transmission cost, it is necessary to divide a correlated region into some smaller clusters. It is difficult to get the analytical solution of the optimal cluster size. Yet, we can draw the total transmission cost curves and find the nearly optimal value with a specified correlation coefficient and the number of network nodes.
In addition, the compression along the shortest path tree scheme have a lower transmission cost than compression only at the cluster head. It is obvious since the amount of transmitted data in the case of compression along the transmission path is smaller than compression at the cluster head only.

Representative Aggregation
In a correlated region with a high enough correlation level, it may not be necessary that every sensor node in a correlation group transmits its data to the sink. Instead, a smaller number of sensor measurements might be adequate to communicate the event features to the sink within a certain reliability level. These working sensors are called representative nodes of the region/group. To evaluate the reliability level, the distortion function is used.

Distortion Function
The distortion function can be interpreted as the percentage of information loss due to the network resource constraints. This research considers the entropy correlation concept. Therefore, the predefined entropy distortion function proposed in Equation (29) is used.
Suppose that there are a total number of N sensor nodes in the considered area and denote their observed data as {X1, X2, …, XN}. The joint entropy of all these N sensors, H (X1, X2, …, XN), is the maximum amount of information that can be gained for the area of interest. If a subset of these sensors denoted as {Xi1, Xi2, …, XiM} are selected to report their observed data to the sink, the information gained at the sink is H (Xi1, Xi2, …, XiM). The distortion function is defined as the ratio of

Optimal Routing Schemes in Correlation Networks
In correlation networks, nodes are divided into correlated regions. From the above analyses with different topologies, the optimal routing scheme in correlation networks can be established below. - In case compression along the transmission path to the cluster head is used, it is not necessary to divide a correlated region into smaller clusters to optimize the transmission cost. Instead, each correlated region becomes a cluster and the optimal routing path in each cluster is the shortest path from nodes to their cluster head. - In the case of compression at the cluster head only and not at intermediate nodes, the transmission path is the shortest path to the cluster head. To get the optimal transmission cost, it is necessary to divide a correlated region into some smaller clusters. It is difficult to get the analytical solution of the optimal cluster size. Yet, we can draw the total transmission cost curves and find the nearly optimal value with a specified correlation coefficient and the number of network nodes.
In addition, the compression along the shortest path tree scheme have a lower transmission cost than compression only at the cluster head. It is obvious since the amount of transmitted data in the case of compression along the transmission path is smaller than compression at the cluster head only.

Representative Aggregation
In a correlated region with a high enough correlation level, it may not be necessary that every sensor node in a correlation group transmits its data to the sink. Instead, a smaller number of sensor measurements might be adequate to communicate the event features to the sink within a certain reliability level. These working sensors are called representative nodes of the region/group. To evaluate the reliability level, the distortion function is used.

Distortion Function
The distortion function can be interpreted as the percentage of information loss due to the network resource constraints. This research considers the entropy correlation concept. Therefore, the predefined entropy distortion function proposed in Equation (29) is used.
Suppose that there are a total number of N sensor nodes in the considered area and denote their observed data as {X 1 , X 2 , . . . , X N }. The joint entropy of all these N sensors, H(X 1 , X 2 , . . . , X N ), is the maximum amount of information that can be gained for the area of interest. If a subset of these sensors denoted as {X i1 , X i2 , . . . , X iM } are selected to report their observed data to the sink, the information gained at the sink is H(X i1 , X i2 , . . . , X iM ). The distortion function is defined as the ratio of the decrease in the amount of information to the maximum amount of information, which is given by the equation below.
The value of D satisfies 0 ≤ D ≤ 1. Using the estimated joint entropy Equation (29), the distortion can be calculated by the formula below.
The distortion depends on the entropy correlation coefficient of the group, the number of representative nodes, and the total number of nodes in the group. If the desired distortion and the total number of nodes in the group is fixed, the number of representative nodes in a correlation group can be determined by using Equation (48).

Number of Representative Nodes
As per the conclusion in the previous section, the number of representative nodes in a correlation group is determined based on an entropy correlation coefficient when the total number of nodes in the group is unchanged.
From Equation (48), we have formed the equation below.
Moreover, using Equation (18), we have formed the formula below.
in which k N is calculated using Equation (18). Even though Equation (52) can be used to determine the number of representative nodes analytically, it is difficult to recognize the effect of correlation on the number of representative nodes. Thus, in this paper, a visual approach to estimate the number of representative nodes is presented.
Since the representative-based aggregation is more effective with the highly correlated region in this paper, the region with a high correlation level (ρ 0 ≥ 0.5) is considered. In an environment with a high correlation level, the joint entropy will go to a "saturation" state when the number of nodes increases. Therefore, with the same distortion, the number of representative nodes does not depend on the number of total nodes in the group if the number of representative nodes is high enough. In the high correlation level (ρ 0 ≥ 0.5), the estimated joint entropy goes to saturation when the number of nodes in the group reaches 14 nodes with distortion D ≈ 0, which is shown in Figure 3. For that reason, we only consider the total number less than 20 nodes (N ≤ 20).
Using Equation (48), the relations between the distortion and the number of representative nodes with different entropy correlation coefficients are shown in Figure 18 Tables 5-7 show the number of representative nodes in the cases of D = 0.05, 0.1, and 0.15, respectively. It can be found that, with the same distortion, the number of representative nodes is quite similar even though the number of nodes in a correlation group is different. It only depends on the entropy correlation coefficient, which is shown in Equation (52). The reason is that, in a correlational group, when the number of nodes is large enough, the joint entropy is almost unchanged. Thus, k N can be considered a constant number. For that reason, iM only depends on β, i.e., depends on the correlation coefficient.  Tables 5, 6, and 7 show the number of representative nodes in the cases of D = 0.05, 0.1, and 0.15, respectively. It can be found that, with the same distortion, the number of representative nodes is quite similar even though the number of nodes in a correlation group is different. It only depends on the entropy correlation coefficient, which is shown in Equation (52). The reason is that, in a correlational group, when the number of nodes is large enough, the joint entropy is almost unchanged. Thus, kN can be considered a constant number. For that reason, iM only depends on β, i.e., depends on the correlation coefficient.           It is noted that the above results are based on Equation (48) or Equation (52) with assumptions that the entropy of the nodes is similar, and the pairs of nodes have a similar entropy correlation coefficient. However, it is difficult to obtain these conditions practically. The selection of representative nodes, in some cases, may not satisfy the distortion requirement. Therefore, with the more practical assumption seen in Definition 2 of a correlated region, the value of the representative nodes should be increased by 1 node from the theoretical calculation to satisfy the distortion limit. This increment of representative nodes is found through practical experiences to guarantee the required distortion limit and make the selection of these nodes become more flexible. The detailed    It is noted that the above results are based on Equation (48) or Equation (52) with assumptions that the entropy of the nodes is similar, and the pairs of nodes have a similar entropy correlation coefficient. However, it is difficult to obtain these conditions practically. The selection of representative nodes, in some cases, may not satisfy the distortion requirement. Therefore, with the more practical assumption seen in Definition 2 of a correlated region, the value of the representative nodes should be increased by 1 node from the theoretical calculation to satisfy the distortion limit. This increment of representative nodes is found through practical experiences to guarantee the required distortion limit and make the selection of these nodes become more flexible. The detailed explanation will be shown in the practical validation section.

Representative Node Selection
After knowing the number of representative nodes, it is necessary to select these nodes in the cluster group. The selection can be based on different purposes such as maximizing the total information (the obtained information from representative nodes is maximum), maximizing coverage (total covered areas by representative nodes is maximum), or energy balancing (the nodes with the highest remaining energy are chosen to be presentative nodes). In this paper, maximizing the obtained information from representative nodes in which the representative nodes are the least correlated ones when all other nodes in the group are chosen. This can be done by calculating the average entropy correlation coefficient of one node with all other nodes in the group and choose nodes with the least values of the average entropy correlation coefficient to be representative nodes. The selection algorithm of representative nodes by maximizing the obtained information is shown in Algorithm 2.
The results of choosing representative nodes are presented in detail in the practical validation section below.

10.
Add X i to R; 11.
Remove X i from C; 12. ENDFOR 13. END

Practical Validation
To validate the proposed representative node selection algorithm, we again use the temperature data supplied by the Intel Berkeley Research Lab [21]. In addition, the first correlation group with 11 nodes (dataset 1) is chosen at the same time as in the previous section.
At first, the number of representative nodes is chosen based on a theoretical calculation. In case the distortion is 0.05, the number of representative nodes is 8. By maximizing the obtained information, the selected representative nodes are {4, 8,9,15,18,21,33, and 47}. The calculated distortion is 0.08 and does not satisfy the requirement that distortion must be less than or equal to 0.05. In the two other cases (D = 0.1 and D = 0.15), the actual distortions are still larger than the required distortion (see Table 8 for more details). Next, the number of representative nodes is chosen based on a practical calculation (practical calculation equals the theoretical calculation plus one). Now, if we choose the distortion of 0.05, the number of representative nodes is 9. The selected representative nodes for maximizing the obtained information are {4, 8, 9, 15, 18, 21, 33, 42, and 47}. In this case, the real distortion is calculated as 0.05. In two other cases (D = 0.1 and D = 0.15), our selections are also satisfied with real distortions that are 0.1 and 0.15, respectively, which is shown in Table 9. In addition, the distortion limit requirement is always satisfied with all the possible selections of representative nodes. For this dataset, it can be found that our selection is satisfied using our practical calculation and is unsatisfied with a theoretical calculation. The reason is because the chosen entropy range (∆H = 0.4 = 20%H 0 ) is quite large. This causes Equations (48) and (52) to become less precise. Now we choose another dataset (dataset 2), also from the Intel Berkeley Research Lab [21], but at a different time. By using the proposed clustering algorithm, a set of 10 nodes, which are numbered by nodes {5, 21, 24, 31, 33, 40, 41, 42, 46, 47} are chosen. The entropy correlation coefficient is the same with the last dataset (ρ 0 = 0.6) but the entropy range is smaller (∆H = 0.3 = 12%H 0 ). The detailed entropy values are shown in Table 10. Table 10. Entropy values of 10 nodes in the correlated region (dataset 2 with N = 10 nodes). Similar to dataset 1, the number of representative nodes is calculated based on a theoretical calculation. In the case of the distortion being 0.05, the number of representative nodes is 7. By maximizing the obtained information, the selected representative nodes are {5, 24, 33, 40, 41, 42, and 46}. The calculated distortion is 0.068 and does not satisfy the requirement that distortion must be less than or equal to 0.05. However, in two other cases (D = 0.1 and D = 0.15), the actual distortions are equal to or smaller than the required distortion (see Table 11 for more details). As mentioned in the previous section, the theoretical selection of representative nodes, in some cases, may not be satisfied in practice especially with a small distortion.

Node
Next, the number of representative nodes is chosen based on a practical calculation (practical calculation equals the theoretical calculation plus one). Now if the distortion is 0.05, the number of representative nodes is 8. The selected representative nodes for maximizing the obtained information are {5, 21, 24, 33, 40, 41, 42, and 46}. The actual distortion is 0.045, which satisfies the distortion. The distortion is less than or equal to 0.05. In two other cases (D = 0.1 and D = 0.15), our selections are always satisfied, which is shown in Table 12. In addition, the distortion limit requirement is always satisfied with all the possible selections of representative nodes. From these two examples, it can be found that the selection of entropy range of nodes to establish the correlated region will affect the satisfaction of distortion. In our experiment, the maximum range should be 20% (∆H = 20%H 0 ) with a practical calculation, which is less than 10% with a theoretical calculation. It is recommended to use a practical calculation because the satisfied distortion will be obtained with not only maximum information selection but also maximum coverage, energy balance, or other selections.

Conclusions
The paper has proposed a novel practical correlation model to establish correlation regions using only the entropy of individual data and entropy correlation coefficients of data pairs. This model is built by looking at the data itself and it can overcome the limitation of other distance-dependent models. Moreover, the proposed model is simple to use because its arguments including node entropy and entropy correlation coefficients can be calculated from the real data with light computing efforts.
Using the proposed correlation model, the deployment of correlation characteristics can be analyzed in detail and the impacts of correlation to data aggregation can be evaluated thoroughly. For the compression aggregation in a high correlation environment (ρ 0 ≥ 0.5) to obtain the optimal transmission cost, the network is clustered according to the correlated region. Each region corresponds to a cluster. In addition, in each cluster, data is transmitted and compressed along the shortest path to the cluster head.
For representative aggregation, using the estimated joint entropy model, the distortion function is established. Then the method to determine the number of representative nodes and the representative node selection algorithm is also proposed. The paper also analyzes the difference between the theoretical and practical aspect. To establish a correlation group and deploy the correlation characteristic, it is recommended to choose the large enough entropy correlation coefficient (ρ 0 ≥ 0.5) and a small enough entropy range ((∆H ≤ 20% H 0 ). In addition, the practical calculation of representative nodes should be used to obtain the flexible selection of the presentative nodes.
The paper has only considered fixed correlated regions, i.e., the correlation among nodes does not change along with time. In case correlated regions may change over time, it is necessary to have a mechanism to recognize the change. Then the network needs to be re-clustered according to new correlation relations. This will be considered carefully in the future and a complete routing protocol with data aggregation will be developed and implemented to validate our results in practice. Moreover, some other approaches that take advantage of correlation characteristics will be considered.