Recognizing Information Feature Variation: Message Importance Transfer Measure and Its Applications in Big Data

Information transfer that characterizes the information feature variation can have a crucial impact on big data analytics and processing. Actually, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to Kullback-Leibler (KL) divergence and Renyi divergence. Furthermore, to some degree, small probability events may carry the most important part of the total message in an information transfer of big data. Therefore, it is significant to propose an information transfer measure with respect to the message importance from the viewpoint of small probability events. In this paper, we present the message importance transfer measure (MITM) and analyze its performance and applications in three aspects. First, we discuss the robustness of MITM by using it to measuring information distance. Then, we present a message importance transfer capacity by resorting to the MITM and give an upper bound for the information transfer process with disturbance. Finally, we apply the MITM to discuss the queue length selection, which is the fundamental problem of caching operation on mobile edge computing.


Introduction
In recent years, due to the exploding amount of data, computing complexity for data processing is growing rapidly. In particular, cloud data center traffic will jump up to one order of magnitude by 2020 [1,2]. To some degree, the reason for this phenomenon seems to be that more and more mobile devices such as smartphones, tablets or mobile Internet of things (IoT) devices are utilized and the growing services of clouds are provided. In this context, it is necessary to dig out the valuable information from the collected data. On one hand, computation technologies including cloud computing and mobile edge computing (MEC) are needed for big data processing. On the other hand, it is essential to develop more efficient technologies for big data analysis and mining, such as distributed parallel computing, machine learning, deep learning, and neural networks, etc. [3][4][5][6].
As for data mining, the small probability events usually attract much more attention than the large probability ones [7][8][9][10]. In other words, there exits higher practical value in the rarity of small probability events. For example, in the anti-terrorist scenario, we just focus on a few illegal and dangerous people [11]. Moreover, as for the synthetic identification (ID) detection, only a small number of artificial identities for financial frauds should be paid more attention to [12]. In fact, it is challenging and significant to measure and mine small probability events.
According to rate-distortion theory, it is rational for us to regard small probability events detection as a clustering problem [13,14]. By using popular clustering principles (e.g., minimum within-cluster distance, maximum inter-cluster distance, and minimum compressing distortion), some efficient clustering approaches were proposed to detect small probability events. Specifically, a graph-based rare category detection and time-flexible rare category detection were presented based on the global similarity matrix and time-evolving of graphs, respectively [15,16]. Actually, these algorithms were proposed by resorting to traditional information measures and theory, which are considered from the viewpoint of typical events, which are the large probability events. In information theory, there are two fundamental measures, Shannon entropy and Renyi entropy, which have a vital impact on wireless communication, estimation theory, signal processing and pattern recognition etc. Nevertheless, they are not applicable to mining small probability events hidden in big data.
To do this, a new information measure named message importance measure (MIM) is proposed from the perspective of big data [17]. To simplify the form of MIM, we shall introduce the definition of MIM as follows.
Definition 1. For a continuous probability distribution f (x) with respect to the variable X in a given interval S x , the differential message importance measure (DMIM) focusing on the small probability events is defined as Furthermore, for the discrete probability P={p(x 1 ), p(x 2 ), . . . ,p(x n )}, the relative message importance measure (RMIM) is given by By resorting to the exponential form, the MIM can amplify small probability elements much more than Shannon entropy and Renyi entropy, which include the logarithm operator or polynomial operator respectively. Actually, this highlights the significance of small probability events in information measure and theory. In addition, a series of postulates are investigated to characterize Shannon entropy and Renyi entropy. Particularly, Fadeev's postulates are well-known to describe the information measures, which consist of four postulates [18]. In this case, in terms of two independent random distributions P and Q, there exists a weaker postulate for Renyi entropy than that for Shannon entropy, as follows where the function H(·) denotes a kind of information measure. Similarly, there exists a weaker postulate for the MIM than that for Renyi entropy, namely Consequently, from the viewpoint of generalized Fadeev's postulates, we can regard the MIM as a reasonable information measure similar to Shannon entropy and Renyi entropy.

Message Importance Transfer Measure
As for an information transfer process, we construct such a model that the original probability distribution P and the final one Q in the transfer process satisfies the Lipschitz condition as follows, where H(·) is the corresponding information measure function; λ > 0 is the Lipschitz constant; · 1 denotes the l 1 -norm measure.
Here, we shall analyze and measure the information transfer process mentioned in Equation (5) from the perspective of the message importance. In fact, it is a significant problem for us to measure the message importance variation in big data analytics. According to Definition 1, it is available to regard the DMIM or RMIM as an element to measure the message importance distance which can be also used in the discussion of information transfer processes. Then, an information transfer measure focusing on the message importance are proposed as follows.

Definition 2.
For two probability distributions g(x) and f (x) with respect to the variable X in a given interval S x , the message importance transfer measure (MITM) is defined as Furthermore, in terms of the two discrete probability Q = {q(x 1 ), q(x 2 ), . . . , q(x n )} and P = {p(x 1 ), p(x 2 ), . . . , p(x n )}, the MITM can be written as Note that Definition 2 characterizes a kind of relationship between two distributions from the perspective of information theory. In fact, this is a reasonable information measure that focuses on the effects of small probability elements regarded as message importance for two end-to-end distributions. On one hand, the MITM provides a tool to reflect the change of message importance in the whole transfer process. On the other hand, it also reveals the entire information feature variation of two end-to-end distributions, which we can use as a promising tool in the data mining.

Related Works for Information Measures in Big Data
There exist a variety of different information measures handling the problem of distributions, which can play a crucial role in many applications involved with artificial intelligence as well as big data analysis and processing.
As typical information measures, Shannon entropy and Renyi entropy are applicable to texture classification, intrinsic dimension estimation [19]. As well, the relative entropy, a kind of K-L divergence, is suitable for outlier detection [20] and functional magnetic resonance imaging (FMRI) data processing [21]. Moreover, the MIM and non-parametric message importance measure (NMIM) both focusing on the small probability events, have been proven effective in anomaly detection [17,22,23]. What is more, information divergences such as message importance (M-I) divergence can be applicable to extending methods of machine learning by using distributions and their relationship as features [24].
In addition, some information measures are proposed to reveal the correlation of message during the information transfer process. For example, the directed information [25][26][27][28] and Schreiber's transfer entropy [29] are commonly applied to infer the causality structure and characterize the information transfer process. Moreover, referring to the idea from dynamical system theory, new information transfer measures are proposed to indicate the causality between states and control the systems [30][31][32].
However, in spite of numerous kinds of information measures, few works focus on how to characterize the information transfer from the perspective of message importance in big data. To this end, a new information measure different from the above is introduced.

Organization
We organize the rest of this paper as follows. In Section 2, we investigate the variation of message importance in the information transfer process by using MITM. In Section 3, we introduce the message importance transfer capacity measured by the MITM to describe the information transfer system with additive disturbance. In Section 4, the MITM and the KL divergence are used to guide the queue length selection for MEC from the viewpoint of the queue theory. Moreover, we also present some simulations to validate our theoretical results. Finally, we conclude in Section 5.

The Information Distance for Message Importance Variation
We now investigate the variation of message importance between two distributions by using an information transfer measure. This characterizes the information distance from the perspective of message importance, which can also reflect the robustness of the information transfer measure.
Consider an observation model, P g 0 | f 0 : f 0 (x) → g 0 (x), namely an information transfer map for the variable X from one distribution f 0 (x) to the other distribution g 0 (x). In fact, it turns out to be not easy to cope with the two distributions. Instead, considering the similar way in [33], the relationship between f 0 (x) and g 0 (x) is given by and the constraint condition satisfies where and α are two positive adjustable coefficients, as well as u(x) is a perturbation function of the variable X in the interval S x . Then, we discuss the information distance of message importance measured by the MITM in the Definition 2. This characterizes the difference between the origin and the destination of the information transfer from the viewpoint of message importance. By using the model P g 0 | f 0 : f 0 (x) → g 0 (x) mentioned above, the end-to-end MITM is investigated in the information transfer process as follows.
Proposition 1. For two probability distributions g 0 (x) and f 0 (x) whose relationship satisfies the conditions Equations (8) and (9), the MITM is given by where and α are parameters, u(x) denotes a function of the variable X, and |D I (g 0 (x)|| f 0 (x))| ≤ S x | f α 0 (x)u(x)|dx that satisfies the constraint Equation (5).
Proof of Proposition 1. According to the Binomial theorem, it is not difficult to see that Then, by using Taylor series expansion of e x , it is readily seen that Therefore, by substituting Equation (12) into Equation (6), the proof of the proposition can be readily completed.
Furthermore, it is not difficult to gain the MITM between the two different distributions g (u) 1 and g (u) 2 based on the same reference distribution f 0 (x) as follows where g (u) in which the and α are parameters, u 1 (x) and u 2 (x) denote functions of the variable X in the interval S x , and |D I ( Similarly, it is available for the discrete probability distributions to have the same form of MITM as that mentioned in the Proposition 1. In particular, for two distributions Q 0 = {q 0 (x 1 ), q 0 (x 2 ), . . . , q 0 (x n )} and P 0 = {p 0 (x 1 ), p 0 (x 2 ), . . . , p 0 (x n )}, it is easy to see that if the relationship between Q 0 and P 0 satisfies with the constraint condition where and α are adjustable coefficients, andũ(x i ) is a perturbation function of the variable X. Moreover, it is not difficult to gain the discrete form of Equation (13) in the same way as above.
Remark 1. By resorting to the information distance measured by the MITM, the message importance distinction between two different distributions can be characterized. In the observation model mentioned in Equation (8), it is apparent that the parameter dominates the information distance when the perturbation function is finite and the parameter α < ∞. Furthermore, the MITM is convergent with the order of O( ) in the case of small parameter . Actually, it provides a way to apply MITM to measure the message importance variation in an information transfer process.

Message Importance Transfer Capacity
In this section, we shall utilize the MITM to analyze the information transfer processing shown in Figure 1. To this end, we propose the message importance transfer capacity based on the MITM as follows.

Map/ Encoder
Demap/ Decoder End-to-end information transfer

Definition 3.
Assume that there exists an information transfer process where the p(y|x) is a probability distribution matrix characterizing the information transfer from the variable X to Y. The message importance transfer capacity is defined as where p(y) = S x p(x)p(y|x)dx, L(Y) = S y p(y)e −p(y) dy, and L(Y|X) = S x S y p(xy)e −p(y|x) dxdy with the constraint |L(Y) − L(Y|X)| ≤ λ p(y) − p(y|x) 1 .
Then, we discuss some specific information transfer scenarios to have an insight into the applications of message importance transfer capacity, as follows.

Binary Symmetric Information Transfer Matrix
Consider the binary symmetric information transfer matrix, in which the original variables are complemented with the transfer probability. In particular, the rows of the probability matrix are permutations of each other and so are columns which can be seen in the following proposition.

Proposition 2.
Assume an information transfer process {X, p(y|x), Y}, whose the information transfer matrix is described as which implies that the variable X and Y both follow binary distributions. In this case, we have the message importance transfer capacity as follows where Proof of Proposition 2. Assume that the distribution of variable X is a binary distribution (p, 1 − p). As well, it is readily seen that Moreover, according to the definition of C in Equation (18), we have Then, it is not difficult to see that According to the monotonically decreasing of is the only solution for ∂C(p,β) ∂p = 0. Therefore, by substituting p = 1 2 into C(p, β), the proposition is testified.

Remark 2.
In light of Proposition 2, on one hand, when β = 1/2, in other words, there is just random information transfer process, we will obtain the lower bound of the message importance transfer capacity that is C(β) = 0. On the other hand, when β = 0, namely, the information transfer process is definite, we will gain the maximum message importance transfer capacity.

Binary Erasure Information Transfer Matrix
The binary erasure information transfer matrix is similar to the binary symmetric one, however, in the former a part of information is lost rather than corrupted. In other words, a fraction of information is erased. In this case, the message importance transfer capacity is discussed as follows.
Proposition 3. Consider an information transfer process {X, p(y|x), Y}, in which the information transfer matrix is described as which indicates that X follows the binary distribution and Y follows the 3-ary distribution. Then, we have where Proof of Proposition 3. Assume the distribution of variable X is (p, 1 − p). As well, according to the binary erasure information transfer matrix, it is not difficult to see that where Due to the monotonically decreasing of for p ∈ [0, 1], it is readily seen that p = 1/2 is the only solution for ∂C(p,β) ∂p = 0. Thus, by substituting p = 1/2 into Equation (26), the proposition is readily verified.

Strongly Symmetric Information Transfer Matrix
In terms of the strongly symmetric information transfer matrix, it can be regarded as an extension of the binary symmetric one. The message information transfer capacity of the former is also analogous to the that of the latter, which is discussed as follows.
Proposition 4. Assume an information transfer process with the strongly symmetric transfer matrix as follows which implies that the variable X and Y both obey K-ary distribution. We have where the parameter β ∈ (0, 1) and

Proof of Proposition 4. Assume the probability distribution of variable
As for the strongly symmetric transfer matrix, when the probabilities of x i are equal, that is, which indicates that the probabilities of y j (j = 1, 2, . . . , K) are equal. In addition, on account of the information transfer matrix, it is easy to see that What is more, according to the definition of message importance transfer capacity in Equation (18), it is readily seen that where L(Y) = ∑ y j p(y j )e −p(y j ) . Then, by using Lagrange multiplier method, we have By setting ∂G(p(y j ),λ 0 ) ∂p(y j ) = 0 and ∂G(p(y j ),λ 0 ) ∂λ 0 = 0, it can be readily verified that the extreme value of ∑ y j p(y j )e −p(y j ) is achieved by the solution p(y 1 ) = p(y 2 ) = . . . = p(y K ) = 1/K.

In light of
< 0 with respect to p(y j ) ∈ [0, 1], it is readily seen that when the variable X follows the uniform distribution which leads to the uniform distribution for variable Y, we will gain the message importance transfer capacity C(β). Then, it is easy for us to complete the proof of the proposition.

Continuous Case for the Message Importance Transfer Capacity
By using the MITM as a measuring tool, the information transfer process in the continuous case is investigated. Considering the information transfer process described as Equation (17), it is significant to clarify the effect of the continuous disturbance on the message importance transfer capacity.
Theorem 1. Assume that there exists an information transfer process between the variable X and Y, denoted by {X, p(y|x), Y}, where E[X] = 0, E[X 2 ] = P s , Y = X + Z. The variable Z denotes an independent memoryless additive disturbance, whose mean and variance satisfy that E[Z] = µ and E[(Z − µ) 2 ] = σ 2 , respectively. Then, we adopt the MITM to measure the message importance transfer capacity as where P N = µ 2 + σ 2 , p(y) = S x p(x)p(y|x)dx with the constraint |L(Y) − L(Z)| ≤ λ p(y) − p(z) 1 (λ > 0 is the Lipschitz constant), and L(·) is the MIM operator. That is, the variance of X makes more effect on the constraint of the message importance transfer capacity.
Proof of Theorem 1. According to Equation (17), we have Moreover, by virtue of the independence of X and Z, we have which indicates that p(y|x) = p(z).
Then, we have Consequently, in terms of the Definition 3, it is readily seen that L(Y) − L(Y|X) can be written as L(Y) − L(Z), which testifies Equation (34a).
Furthermore, according to the fact that E[ we have the constraint condition Equation (34b). As well, by substituting the definition of MITM into Equation (34a), the Theorem 1 is proved.

Remark 3.
For the message importance transfer capacity with an additive disturbance, it is worth noting that the distribution of the transferred variable Y with the constrained variance may have a significant impact on the practical applications. In practice, the variance can be regarded as the power of signals. Consequently, the message importance transfer capacity mentioned in Theorem 1 can be used to guide the signal transfer process with additive disturbance, if the system does not have relatively large change.
Corollary 1. Consider an information transfer process {X, p(y|x), Y}, where Y = X + Z and the variable Z denotes an independent Gaussian disturbance with E[Z] = µ z and E[Z 2 ] = σ 2 z . Assume that the variable X follows a Gaussian mixture model as where µ k and σ 2 k are the means and the variances of independent Gaussian distributions, in other words, In this case, the message importance transfer capacity C(µ x , σ 2 x ) with the constraint |C(µ x , σ 2 where Θ = 1 N ∑ N k=1 σ 2 k . In particular, the parameters σ 2 k can be controlled by the parameters σ 2 x , µ x and µ k in a system, where the µ x and σ 2 x are the mean and variance of the variable X, which are given by (42b) Proof of Corollary 1. As for the Gaussian variable Z satisfying E[Z] = µ z and E[Z 2 ] = σ 2 z , the DMIM is given by where the er f (·) is the error function, namely, and the parameters α 0 , β 0 and γ 0 satisfy Then, it is readily seen that which can be approximated by In addition, according to Y = X + Z (with the independent X and Z), it is readily seen that the variable Y also follows a Gaussian mixture model as whereμ k = µ k + µ z andσ 2 k = σ 2 k + σ 2 z (k = 1, 2, . . . , N). By using of Taylor series extension, we have the DMIM of variable Y as follows Then, according to Equation (48), it is readily seen that whereμ k = µ k + µ z andσ 2 k = σ 2 k + σ 2 z (k = 1, 2, . . . , N), the parameters α 1 , β 1 and γ 1 are Then, it is not difficult to see that whereμ i andσ 2 i (orμ j andσ 2 j ) denote the means and the variances in Gaussian mixture model mentioned in Equation (48).
Furthermore, in the light of Equations (47) and (52), we have the message importance transfer measure with the constrained variances σ 2 k as follows where the parameter Θ can be regarded as a constant which is controlled by the system parameters σ 2 x , µ x and µ k , as follows Moreover, the parameter σ 2 z is a system constant and µ k are regarded as constants, while the parameters σ 2 k (k = 1, 2, . . . , N) can be adjusted flexibly. According to the Lagrange multiplier method, when By substituting Equation (55) into Equation (53a), the proof of Corollary 1 is already completed.
In order to investigate the continuous information transfer processing mentioned in Corollary 1, we do some simulations shown as Figures 2 and 3. In particular, Figure 2 shows that when the variable X following a Gaussian mixture model transfers to the variable Y, the message importance measures of X and Y become more absolutely close with N increasing (N denotes the number of Gaussian functions in the Gaussian mixture model). Besides, we also see that the differences of message importance measures between the variable X and Y are not significant in the case of large variances σ 2 k . In addition, from Figure 3, it is seen that the message importance transfer capacity is increasing with the increment of the number of Gaussian functions. Moreover, the larger variances σ 2 k of the Gaussian mixture model are, the larger message importance transfer capacity we have.   Message importance transfer capacity

Remark 4.
As for an additive disturbance system where the data source derive from a Gaussian mixture model, we can obtain the message importance transfer capacity, if there are all the same variances σ 2 k for the Gaussian distribution components in the data source. In practice, when the power of signal source is controlled in a signal transfer processing, we can adjust signal distributions to achieve the optimal message importance transfer by using Corollary 1.

Application in Mobile Edge Computing with the M/M/s/k Queue
As for mobile users, almost all of them have few computing resources and depend solely on cloud computing. This implies that the large distance between the cloud and the end devices is not suitable for the low delay requirement of the future applications. To cope with the issue, the MEC is proposed to improve cloud computing.
As far as the MEC is concerned, the edge servers are placed in the Base Stations (BSs) to reduce the delay, while context aware applications are close to the mobile users [34]. To characterize the MEC more specifically, a MEC model is constructed based on the queuing theory as follows.
In terms of a MEC system in Figure 4, it consists of many mobile users, an edge server, and a central cloud located far from the local devices. For each mobile user, a part of or all the service requests can be offloaded to the corresponding edge server when the communication is disturbed by other mobile users or environmental noise. If the upper bound of the service rate for the edge server is larger than the sum of mobile users' request rate, the offloaded requests will be coped with by the edge server. Otherwise, the overloaded requests will be offloaded to the central cloud for processing [35]. In these cases, the queue model on the edge server can be considered as the M/M/s/k queue, where the first M describes the request interarrival time of mobile users, the second M denotes the request service time in the edge server, and both of them follow exponential distribution; the parallel processing core number is s, which means each processing core can at most server one request simultaneously; the queuing buffer is k in the edge server. Note that we only consider a simple model on MEC to show the potential application of MITM. In fact, there may be some complicated cases in the MEC such as fault tolerance, failover, and the existence of overlay networks, etc.; we shall consider this in the near future.  In fact, it is significant for the MEC system to use the finite buffer size (or caching size) to approximate the infinite one, which can be treated as a problem of queue length selection. To do this, we exploit the MITM and KL divergence to measure the effect of queue states variation on the MEC performance as follows.

MITM in the Queueing Model
As a measurement for the distance of the message importance, MITM characterizes the difference between two distributions. This can be applied to distinguish the state probability distributions in queue models. To give more general analysis, we discuss the relationship between the queue state stationary distributions in the M/M/s/k model. The queue state stationary probability of the model with arrival rateλ and service rateμ can be described as where s is the number of servers, k is the size of buffer or cache, the traffic intensity ρ = a /s < 1 as well as a =λ/μ. Therefore, according to the definition 1, we can obtain the RMIM of the queue state stationary probability in the M/M/s/k model. Then, by use of Taylor series expansion, the approximate RMIM is given by where the parameter ϕ 1 and ϕ 2 are Furthermore, referring to Equation (57), we can use the MITM to characterize the information difference for the queue model as follows.
In particular, if there is only one server, the corresponding queue model is M/M/1/k, it is not difficult to obtain where D I(s=1) (P ∞ ||P k ) denotes the MITM with the number of server s = 1. The corresponding optimal buffer size is given by It is apparent that δ plays an important role in selecting the caching size when using finite size caching to imitate the infinite caching working mode.

KL Divergence in the Queue Model
As a common information measure, KL divergence is also considered to be applied to measuring the information distinction between the queue state stationary probability distributions with different buffer sizes in the queue models. In particular, for the M/M/s model, we have the following proposition. Proposition 6. In the M/M/s model, the KL divergence between the queue state distribution P k+1 = {p k+1,0 , p k+1,1 , . . . , p k+1,s+k+1 , 0, . . . , 0} and P k = {p k,0 , p k,1 , . . . , p k,s+k , 0, 0, . . . , 0} with buffer size k + 1 and k, is derived as where the parameters p k,j , p k+1,j , ϕ 1 and ϕ 2 are the same as them in Proposition 5.
Furthermore, it is not difficult to have the KL divergence between the distribution P ∞ = {p ∞,0 , p ∞,1 , . . . , p ∞,∞ } and P k = {p k,0 , p k,1 , . . . , p k,s+k , 0, 0, . . . , 0} with buffer size ∞ and k, which is obtained as Similar to Equation (61), it is rational for us to use KL divergence as measurement to select the buffer size. The corresponding optimal buffer size can be described as where δ > 0 is a small enough parameter and it can adjust the information transfer gap between the queue state stationary distributions P ∞ and P k which are with buffer size ∞ and k respectively. Then, we have where k is the buffer size or queue length, ϕ 1 and ϕ 2 are mentioned in Equations (58a) and (58b).
What is more, as for the M/M/1/k model, the optimal buffer size is simplified as follow Therefore, by using the information measures such as MITM and KL divergence, it may provide an effective method to select the caching size, which can exploit the resources of MEC more reasonably.

Numerical Validation
To validate our derived results in theory, we take some event simulation experiments of the queue model by use of Matlab. By setting the arrival rateλ and service rateμ of queue model, the process of arrival and departure for each event is simulated during a fixed period. We will elaborate on specific parameters of the queue model in the following context. In the figures of results, the legends D I -Sim, D I -Ana and D-Sim, D-Ana are used to denote the simulation results and the analytical results for MITM and KL divergence, respectively.

Effect of the Traffic Intensity
We now exploit M/M/s/k model to investigate performance of the MITM and KL divergence in the case of different traffic intensity. In the simulations, the queue length, namely the buffer size, is set to change from 0 to 30, the number of servers satisfies s = 1, and the traffic intensity is set as ρ = 0.6, 0.7, 0.9. Then, we can compare the simulation results with the theoretical ones for the MITM and KL divergence. From Figure 5, it is seen that the analytical results mentioned in Equations (59) and (60) can validate the simulation results. In particular, Figure 5a,b shows that analytical results of MITM and KL divergence can absolutely fit the simulation experiments in the M/M/s/k models with different traffic intensity. What is more, from Figure 5c, we can see that in the same queue model the convergence for MITM is faster than that for KL divergence. That is, the MITM offers a reasonably lower bound for queue length selection with respect to MEC. Besides, the less traffic intensity we have, the more caching size resources can be saved.

Effect of the Number of Servers
With regard to effects of number of servers on the MITM and KL divergence, we do the simulation experiments with M/M/s model by setting the number of servers as s = 1, 3, 5. What is more, we set the queue length as k = 0, 1, 2, . . . 30, and the traffic intensity always as ρ = 0.9. Then, we gain the comparison between the simulation results and the theoretical ones. Figure 6a,b show that it is almost available for analytical results to fit simulation results. From Figure 6c, similar to Figure 5c, we can also use the MITM to gain a lower bound for queue length selection than KL divergence. Moreover, keeping other conditions the same, a larger number of servers can make MITM and KL divergence converge faster. In other words, there is a trade-off between the number of servers and caching size.

Performance Results for Different Arrival Events Distributions
Now we discuss the performance results in the cases of different distributions of events' arrivals which is listed in Table 1. It is apparent that average interarrival time is maintained as the same, namely 1/λ 0 . As well, we make sure that the number of server and traffic intensity are s = 1 and ρ = 0.9 in all cases, respectively. Then, we make simulations in the three cases to compare the testing results with the analytical results.  As for Figure 7, it is illustrated that the convergence of MITM is faster than that of KL divergence, which indicates that MITM may provide a reasonable lower bound to select the caching size for MEC.
In addition, we can see that the Poisson distribution (namely, events' arrivals follow exponential distribution) corresponds to the worst case for the arrival process among the three discussed cases with respect to the convergence of both MITM and KL divergence. . The performance of different information measures between the queue length k and ∞ for the queue models with the same number of server s = 1, the same traffic intensity ρ = 0.9, and the different arrival events' distributions.

Conclusions
In this paper, the information transfer problem was investigated from the perspective of information theory and big data analytics. An information measure, i.e., MITM, was proposed to characterize the information distance between two distributions, similar to KL divergence and Renyi divergence. Actually, the information measure plays a vital role in focusing on the message importance hidden in small probability events of big data. Therefore, it is applicable for the information measure to characterize information transfer process in big data. We have investigated the variation of message importance in the information transfer process by using MITM. Furthermore, we proposed the message importance transfer capacity based on the MITM so that an upper bound can be presented for the information transfer process with disturbance. In addition, we applied the information transfer measure to select the caching size in MEC. As the next step of research, we shall carry out real data experiments to test some of the most complicated cases of MEC and make use of the information transfer measures to investigate some related algorithms as well as to discuss the effect of window length on the whole system performance in big data analytics.
Author Contributions: R.S., S.L. and P.F. conceived and designed the methodology; R.S. and P.F. the mathematical analysis and the practical simulations; R.S., S.L. and P.F. discussed the results; R.S. and P.F. wrote the paper; All authors have read and approved the final manuscript.