Function Analysis of the Euclidean Distance between Probability Distributions

Minimization of the Euclidean distance between output distribution and Dirac delta functions as a performance criterion is known to match the distribution of system output with delta functions. In the analysis of the algorithm developed based on that criterion and recursive gradient estimation, it is revealed in this paper that the minimization process of the cost function has two gradients with different functions; one that forces spreading of output samples and the other one that compels output samples to move close to symbol points. For investigation the two functions, each gradient is controlled separately through individual normalization of each gradient with their related input. From the analysis and experimental results, it is verified that one gradient is associated with the role of accelerating initial convergence speed by spreading output samples and the other gradient is related with lowering the minimum mean squared error (MSE) by pulling error samples close together.


Introduction
Adaptive signal processing is carried out by minimizing or maximizing an appropriate performance criterion for adjusting weights of algorithms designed based on that criterion [1]. The mean squared error (MSE) criterion that measures the average of the squares of the error signal is widely employed in the Gaussian noise environment. However in non-Gaussian noise like impulsive noise, the averaging process of squared error samples that may mitigate the effects of the Gaussian noise is defeated because a single large, impulse can dominate these sums. As recent signal processing methods, the information-theoretic learning (ITL) is based on the information potential concept that data samples can be treated as physical particles in an information potential field where they interact with each other by information forces [2]. The ITL method usually exploits probability distribution functions constructed by the kernel density estimation method with the Gaussian kernel.
Among the ITL criteria, Euclidian distance (ED) between two distributions has been known to be effective in signal processing fields demanding similarity measure functions [3][4][5]. For training of adaptive systems for medical diagnosis, the ED criterion has been successfully applied to distinguish biomedical datasets [6]. For finite impulse response (FIR) adaptive filter structures in impulsive noise environments, ED between the output distribution and a set of Dirac delta functions has been used as an efficient performance criterion taking advantage of the outlier-cutting effect of Gaussian kernel for output pairs and symbol-output pairs [7]. In this approach with output distribution and delta functions, minimization of the ED (MED) leads to adaptive algorithms that adjust weights so as for the output distribution to be formed into the shape of delta functions located at each symbol point, that is, output samples concentrate on symbol points. Though the blind MED algorithm shows superior performance of robustness against impulsive noise and channel distortions, a drawback of heavy computational burden lies in it. The computational complexity is due in large part to the double summation operations at each iteration time for its gradient estimation. A follow-up study [8], however, shows that the drawback can be reduced significantly by employing a recursive gradient estimation method.
The gradient in ED minimization process of the MED algorithm has two components; one for kernel function of output pairs and the other for kernel function of symbol-output pairs. The roles of these two components have not been investigated or analyzed in scientific literature. In this paper, we analyze the roles of the two components and prove the analysis through controlling each component individually by normalizing each component with component-related input power. Through simulation in multipath channel equalization under impulsive noise, their roles of managing sample pairs are verified, and it is shown that the proposed method of controlling each component through power normalization increases convergence speed and lowers steady state MSE significantly in multipath and impulsive noise environment.

MSE Criterion and Related Algorithms
Employing the tapped delay line (TDL) structure, the output y k becomes y k = W T k X k at time k with the input vector X k = [x k , x k−1 , . . . , x k−L+1 ] T and weight W k = [w 0.k , w 1,k , . . . , w L−1,k ] T . Given the desired signal d k chosen randomly among the M symbol points (A 1 , A 2 , . . . , A M ), the system error is calculated as e k = d k − y k . In blind equalization, the constant modulus error e CME,k = |y The MSE criterion, one of the most widely used criteria, is the statistical average E[·] of error power e 2 k in supervised equalization and of CME power (|y k | 2 − R 2 ) 2 in a blind one. For practical implementation we can use the instant squared error e 2 k as a cost function in supervised equalization. With the gradient ∂e 2 k W = −2e k X k and a step size µ LMS , minimization of e 2 k leads to the least mean square (LMS) algorithm [1]: As an extension of the LMS algorithm, the normalized LMS (NLMS) algorithm has been introduced where the gradient is normalized as proportional to the inverse of the dot product of the input vector with itself X k 2 = X T k X k = ∑ L−1 m=0 x 2 k−m as a result of minimizing weight perturbation W k+1 − W k 2 of the LMS algorithm [1]. Then the NLMS algorithm becomes: The NLMS algorithm is known to be more stable with unknown signals and effective in real time adaptive systems [10,11]. We can see under impulsive noise environments that a single large error sample induced by impulsive noise can generate large weight perturbations. The perturbation becomes zero only when the error e k is zero. So we can predict that the weight update process (1) may be unstable so that it requires a very small step size in impulsive noise environment. Also the LMS and NLMS algorithms utilizing instant error power e 2 k may cause instability in an impulsive noise environment.

ED Criterion and Entropy
Unlike the MSE based on error power, probability distribution functions can be used in constructing performance criterion. As one of the criteria utilizing distributions, the ED between the distribution of transmitted symbol f D (d) and the equalizer output distribution f Y (y) is defined as (3) [3,6].
Assuming that modulation schemes are known to receivers beforehand and all the M symbol points (A 1 , A 2 , . . . , A M ) are equally likely, the distribution of the transmitted symbols can be expressed as: The output distribution can be estimated based on kernel density estimation method f Y (y) = 1/N ∑ N i=1 G σ (y − y i ) with a set of available N output samples {y 1 , y 2 , . . . , y N } [6]. Then the ED can be expressed as: The first term 1/M in (5) is a constant which is not adjustable, so the ED can be reduced to the following performance criterion C ED [7]: In ITL methods, data samples are treated as physical particles interacting with each other. If we place physical particles in the locations of y i and y j , the Gaussian kernel G σ √ 2 y j − y i produces an exponentially decaying positive value as the distance between the two particles increases. This leads us consider the Gaussian kernel G σ √ 2 y j − y i as a potential field-inducing interaction among particles. Then ∑ N j=1 G σ √ 2 y j − y i corresponds to the sum of interactions on the i-th particle and 1/N 2 ∑ N i=1 ∑ N j=1 G σ √ 2 y j − y i is the averaged sum of all pairs of interactions. This summed potential energy is referred to as information potential in ITL methods [2]. Therefore, the term (6) is the information potential between symbol points and output samples, and 1/N 2 (6) indicates the information potential among output samples themselves.
On the other hand, the information potential can be interpreted in the concept of entropy that can be described in terms of "energy dispersal" or the "spreading of energy" [11]. As one of the convenient entropy definitions, Reny's entropy of order 2, H Reny (y) is defined in (7) as logarithm of the sum of the power of probability which is much easier to estimate [2]: When the Reny's entropy is used along with the kernel density estimation method we obtain a much simpler form of Reny's quadratic entropy as: This leads to: Likewise: Therefore the cost function C ED becomes: Equations (9) and (11) indicate that the entropy of output samples increases as the distance (y j − y i ) between the two information particles y j and y i increases. Therefore, (y j − y i ) can be referred to as entropy-governing output and we can notice that (9) controls the spreading of output samples. Likewise, the term 2 1 H Reny (Am,x) in (11) governs dispreading or recombining the sample pairs of symbol points and output samples.

Entropy-Governing Variables and Recursive Algorithms
When defining y j,i = (y j − y i ) and e m,i = (A m -y i ) and X j.i = (X j -X i ) for convenience's sake, y j,i , e m,i and X j,i can be referred to as entropy-governing output, entropy-governing error and entropy-governing input, respectively. Using these entropy-governing variables and the on-line , the cost function at time k, C ED,k can be written as: where: Minimization of C ED,k indicates that U k forces spreading of output samples and −V k compels output samples to move close to symbol points. Considering that initial-stage output samples which may have clustered about wrong places due to channel distortion, U k is associated with the role of getting the output samples to move out in search of each destination, that is, accelerating initial convergence speed. On the other hand, V k is related with compelling output samples near a symbol point to come close lowering the minimum MSE.
On the other hand, the double summation operations for U k and V k impose a heavy computational burden. In the work [8] it has been revealed that each component U k+1 and V k+1 of C ED,k+1 = U k+1 − V k+1 can be recursively calculated so that the computational complexity of (12) is significantly reduced as in the following equations (15) and (16): Similarly, V k+1 can be divided into the terms with y k+1 and the terms with y k−N+1 : The gradients ∂U k ∂W and ∂V k ∂W are calculated recursively by using Equations (15) and (16) as: Similarly, ∂V k ∂W is calculated recursively as described below: Since the argument y k,i (17) is a function of the entropy-governing output y k,i , we can define y k,i as the modified entropy-outputŷ k,i , which becomes a significantly mitigated value through the Gaussian kernel when the entropy-governing output y k,i is a large value.
Then (17) becomes Similarly, we see that the argument e m,k (18) is a function of entropy-governing error e m,k , so that we have the modified entropy-errorê m,k as: The modified entropy-errorê m,k also becomes a significantly reduced value through the Gaussian kernel when the entropy-governing error e m,k is large. Then (18) becomes: Through minimization of C ED,k = U k − V k with the gradients ∂U k ∂W and ∂V k ∂W obtained by (20) and (22), the following recursive MED (RMED) algorithm can be derived [7]: Comparing the gradients of RMED to the gradient (1) which is composed of error and input, we may find that the gradients ∂U k ∂W and ∂V k ∂W in (20) and (22) have similar termsŷ k,i ·X i,k (modified entropy-output multiplied by entropy-input) andê m,k ·X k (modified entropy-error multiplied by input), respectively. Considering that impulsive noise may induce large entropy-governing output y k,i or entropy-governing error e m,k , modified entropy-output y k,i and modified entropy-errorê m,k which are significantly mitigated by the Gaussian kernel can be viewed as playing a crucial role in obtaining stable gradients under strong impulsive noise. Therefore we can anticipate that the RMED algorithm (23) can have a low weight perturbation in impulsive noise environments.

Input Power Estimation for Normalized Gradient
For the purpose of minimizing the weight perturbation W k+1 − W k 2 of the LMS algorithm in (1), the NLMS algorithm has been introduced where the gradient is normalized by the averaged power of the current input samples X k .
Applying this approach to RMED we propose in this section to normalize the gradients in some ways. Since the role of U k (spreading output samples) is different from that of V k (moving output samples close to symbol points), the gradients of (23) can be normalized separately as: where P U (k) is the average power of X i,k and P V (k) is the average power of X k as: Since defeating the impulsive noise contained in the input by way of the average operation is considered to be ineffective, the denominators of (26) and (27) are likely to be fluctuating under impulsive noise. This may cause the algorithm to be sensitive to impulsive noise. Also the summation operators make the algorithm demand computationally burdensome. To avoid these drawbacks, we can track the average power P U (k) and P V (k) recursively with the balance parameter β (0 < β <1) as: With the recursive power estimation (28) and (29), we may summarize the proposed algorithm in a more formal one as in the Table 1. In the following section, we will investigate the new RMED algorithm (25) with separate normalization by P U (k) in (28) and P V (k) in (29) in the aspect of convergence speed and steady state MSE.

Results and Discussion
A base-band communication system with multipath fading channel and impulsive noise used in the experiment is depicted in Figure 1. The symbol set in the transmitter is composed of equally probable four symbols (−3, −1, 1, 3). The transmitted symbol is to be distorted by the multipath channel H(z) = 0.26 + 0.93z −1 + 0.26z −2 [12]. The channel output is added by impulsive noise n k . The distribution function of n k , f (n k ) is expressed in Table 2 where σ 2 I N is the variance of impulses which are generated according to Poisson process (occurrence rate ε) and σ 2 GN is that of the background Gaussian noise [13]. The simulation setup and parameter values are described in the Figure 1 and the Table 2. according to Poisson process (occurrence rate ε) and is that of the background Gaussian noise [13]. The simulation setup and parameter values are described in the Figure 1 and the Table 2.

Features Parameters
The symbol points in the transmitter Kernel size σ 0.6 An example of impulsive noise being used in this simulation is depicted in Figure 2.

Features Parameters
The symbol points in the transmitter The channel transfer function H(z) H(z) = 0.26 + 0.93z −1 + 0.26z −2 The noise distribution function f (n k ) An example of impulsive noise being used in this simulation is depicted in Figure 2.  It has in Section 4 been analyzed that U k is associated with the role of spreading output samples which are clustered to wrong positions due to distorted channel characteristics and V k is related with moving output samples close to symbol points. This process can be explained through initial-stage investigation of what happens in the error distribution and observing how the distribution of output samples changes in the experimental environment.  It has in Section 4 been analyzed that Uk is associated with the role of spreading output samples which are clustered to wrong positions due to distorted channel characteristics and Vk is related with moving output samples close to symbol points. This process can be explained through initial-stage investigation of what happens in the error distribution and observing how the distribution of output samples changes in the experimental environment. Figure 3 shows the error distribution in the initial stage with 200 error samples and ensemble average of 500 runs. Considering the four symbol points are (−3, −1, 1, 3), error values greater than 1.0 are associated with output samples which can be decided as wrong symbols. The cumulative probability of initial output samples placed in the wrong regions in this respect is calculated to be 0.35 from the Figure 3 (35% output samples are not in place). The peaks or ridges in the error distribution are about 6 on each side. This observation may indicate that output samples are clustered or grouped in some regions (two groups are within the correct range but 4 groups are in the incorrect positions on each side of the distribution). This result coincides clearly with the initial output distribution in Figure 4. The output distribution showing about 12 peaks indicates that the initial output samples are clustered into 12 groups mostly located out of place, that is, not around −3, −1, 1, 3.  On the 35% output samples clustered in the wrong symbol regions, the spreading force has a positive effect in order for them in blind search to move out for finding their correct symbol positions. This process is observed in the graph of k = 700 in Figure 4. The output distribution at time k = 700 has an evenly spread shape, indicating that the clustered output samples have moved out and mingled with one another. At the sample time k = 1800 the output samples start to position at their correct symbol areas. From this phase, the force moving output samples close to the symbol points is in effect on lowering steady state MSE. On the 35% output samples clustered in the wrong symbol regions, the spreading force has a positive effect in order for them in blind search to move out for finding their correct symbol positions. This process is observed in the graph of k = 700 in Figure 4. The output distribution at time k = 700 has an evenly spread shape, indicating that the clustered output samples have moved out and mingled with one another. At the sample time k = 1800 the output samples start to position at their correct symbol areas. From this phase, the force moving output samples close to the symbol points is in effect on lowering steady state MSE.
These results imply that U k is related with convergence speed and V k with steady state MSE. To verify this analysis we experiment the proposed algorithm in the following three modes with respect to convergence speed and steady state MSE (we assume that steady state MSE is close to minimum MSE): Mode 1 of RMED-SN algorithm in (30) is for observing changes in initial convergence speed by normalizing only ∂U k ∂W by the average power P U (k) of entropy-input X i,k compared to the not-normalized RMED. Mode 2 is to observe whether the normalization of ∂V k ∂W by P V (k) of input X k without managing U k lowers the steady state MSE of RMED. Finally we see if Mode 3 employing normalization of ∂U k ∂W and ∂V k ∂W simultaneously yields both of the two performance enhancements; faster convergence and lowered steady state MSE.     is normalized but is not, and we see little difference (about 1 dB) in the steady state MSE.
In Figure 6 RMED and Mode 2 are compared. Both algorithms have similar convergence speed with difference of only 500 samples. But after convergence the Mode 2 yields much lower steady state MSE than the original RMED by over 2 dB. These findings indicate that the role of Vk is  Figure 5 shows the MSE learning performance for CMA, LMS, RMED and Mode 1 of the proposed algorithm. As discussed in Section 2, the learning curves of the MSE-based algorithms, CMA and LMS do not fall down below −6 dB being defeated by the impulsive noise. On the other hand, the RMED and proposed algorithm show a rapid and stable convergence. The difference of convergence speed between RMED and Mode 1 is clearly observed. While the RMED converges in about 4000 samples, the Mode 1 does in about 2000 samples. Therefore, Mode 1 shows faster convergence than the RMED algorithm by 2 times verifying the analysis of the role of U k since only ∂U k ∂W is normalized but ∂V k ∂W is not, and we see little difference (about 1 dB) in the steady state MSE.
In Figure 6 RMED and Mode 2 are compared. Both algorithms have similar convergence speed with difference of only 500 samples. But after convergence the Mode 2 yields much lower steady state MSE than the original RMED by over 2 dB. These findings indicate that the role of V k is definitely related with lowering minimum MSE. This is in accordance with the analysis that U k plays the role of pulling error samples close together.     In Mode 3, it is still not clear whether the normalization to Uk for speeding up the initial convergence may have a negative influence in later iterations, so we try to reduce the Uk normalization gradually after convergence (k ≥ 3000) by using °( ) in place of PU(k) as: where k ≥ 3000 and a constant c is 0 ≤ c ≤ 1.
The results for c = 0.8, 0.9, 0.99, 1.0 are shown in Figure 8 in terms of error distribution since the learning curves for the various constant values are not clearly distinguishable.  In Mode 3, it is still not clear whether the normalization to U k for speeding up the initial convergence may have a negative influence in later iterations, so we try to reduce the U k normalization gradually after convergence (k ≥ 3000) by using P • U (k) in place of P U (k) as: where k ≥ 3000 and a constant c is 0 ≤ c ≤ 1.
The results for c = 0.8, 0.9, 0.99, 1.0 are shown in Figure 8 in terms of error distribution since the learning curves for the various constant values are not clearly distinguishable. In Mode 3, it is still not clear whether the normalization to Uk for speeding up the initial convergence may have a negative influence in later iterations, so we try to reduce the Uk normalization gradually after convergence (k ≥ 3000) by using °( ) in place of PU(k) as: where k ≥ 3000 and a constant c is 0 ≤ c ≤ 1.
The results for c = 0.8, 0.9, 0.99, 1.0 are shown in Figure 8 in terms of error distribution since the learning curves for the various constant values are not clearly distinguishable.  The value of c in (33) may be related with the degree of gradual reduction in the normalization to U k , that is, c = 1 indicates no reduction (Mode 3 as it is) and c = 0.8 means comparatively rapid reduction. From the Figure 8, we observe that the error performance becomes better and then worse as the degree of reduction decreases from 0.8 to 1.0. This implies that the gradual reduction of the normalization to U k is effective but not much. We may conclude that the normalization to U k for speeding up the initial convergence has a slight negative influence in later iterations and this can be overcome by employing the gradual reduction of the U k normalization.

Conclusions
Minimization of the Euclidean distance between output distribution and Dirac delta function as a performance criterion is known to force the distribution of system output to come to a set of delta functions located at each symbol point. In the analysis of the algorithm RMED developed based on that criterion and recursive gradient estimation, it has been revealed in this paper that the minimization process of the cost function uses its two gradients with different functions; one for U k that forces spreading of output samples and the other one for V k that compels output samples to move close to symbol points. In order to verify the roles of U k and V k explained in the analysis by controlling U k and V k separately, we proposed to normalize ∂U k ∂W with the averaged power of entropy-governing input and to normalize ∂V k ∂W with that of input. From the results through simulation for the separate normalization of the gradients of RMED in multipath channel equalization under impulsive noise, faster convergence by about two times through normalization of ∂U k ∂W and lower steady state MSE by over 2 dB by normalization of ∂V k ∂W have been observed. From the analysis and experimental results, we can conclude that U k is associated with the role of accelerating initial convergence speed by spreading output samples which may have clustered around wrong places in the initial-stage due to channel distortions, and V k is related with lowering the minimum MSE by pulling error samples close together through the minimization of C ED,k . Also it can be concluded that through applying normalization to the two factors ∂U k ∂W and ∂V k ∂W separately with each related input power, significant performance enhancement can be achieved.

Conflicts of Interest:
The authors declare no conflict of interest.