Next Article in Journal
Heat Transfer and Pressure Drop Characteristics in Straight Microchannel of Printed Circuit Heat Exchangers
Previous Article in Journal
Entropy Approximation in Lossy Source Coding Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Minimum Error Entropy Algorithms with Sparsity Penalty Constraints

1
School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510640, China
2
School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
3
Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA
*
Author to whom correspondence should be addressed.
Entropy 2015, 17(5), 3419-3437; https://doi.org/10.3390/e17053419
Submission received: 30 January 2015 / Revised: 28 April 2015 / Accepted: 5 May 2015 / Published: 18 May 2015

Abstract

:
Recently, sparse adaptive learning algorithms have been developed to exploit system sparsity as well as to mitigate various noise disturbances in many applications. In particular, in sparse channel estimation, the parameter vector with sparsity characteristic can be well estimated from noisy measurements through a sparse adaptive filter. In previous studies, most works use the mean square error (MSE) based cost to develop sparse filters, which is rational under the assumption of Gaussian distributions. However, Gaussian assumption does not always hold in real-world environments. To address this issue, we incorporate in this work an l1-norm or a reweighted l1-norm into the minimum error entropy (MEE) criterion to develop new sparse adaptive filters, which may perform much better than the MSE based methods, especially in heavy-tailed non-Gaussian situations, since the error entropy can capture higher-order statistics of the errors. In addition, a new approximator of l0-norm, based on the correntropy induced metric (CIM), is also used as a sparsity penalty term (SPT). We analyze the mean square convergence of the proposed new sparse adaptive filters. An energy conservation relation is derived and a sufficient condition is obtained, which ensures the mean square convergence. Simulation results confirm the superior performance of the new algorithms.

1. Introduction

In recent years, sparsity aware learning methods have received a lot of attention due to their broad applicability. In sparse channel estimation, the goal is usually to estimate a parameter vector of an unknown channel with most zero tap under noise disturbances. So far many sparsity aware adaptive filtering algorithms have been developed to solve the problem of sparse channel estimation. In general, a sparse adaptive filtering algorithm can be derived by incorporating a sparsity penalty term (SPT), such as the l0-norm, into a traditional adaptive algorithm. Typical examples of sparse adaptive filtering algorithms include sparse least mean square (LMS) [14], sparse affine projection algorithms (APA) [5], sparse recursive least squares (RLS) [6], and their variations [712].
However, there are some limitations of the existing sparse adaptive filters. Specifically, when data are non-Gaussian (especially when data are disturbed by impulsive noise or containing large outliers), they may perform very poorly. The main reason for this is that most of the existing algorithms are developed based on the well-known mean square error (MSE) criterion, which relies heavily on the assumption of Gaussian distributions. This assumption does not always hold, particularly in most practical applications. For instance, different types of artificial noise in electronic devices, atmospheric noises, and lighting spikes in natural phenomena, can be described more accurately using non-Gaussian noise models [13,14]. When sparse filters are applied in such situations, the performance will become much worse due to the sensitivity to the impulsive noises or outliers [15].
Information theoretic learning (ITL), on the other hand, provides a nice approach for dealing with non-Gaussian signal processing [16,17]. The minimum error entropy (MEE) [1827] criterion in ITL was successfully used in adaptive filtering to improve the learning performance in non-Gaussian noises. Basically, the MEE aims at minimizing the entropy of the training error such that the adaptive model becomes as close as possible to the unknown system. Since the MEE can capture higher-order statistics and information content of signals rather than simply their energy, it is particularly useful for non-Gaussian machine learning and signal processing. In this work, we will use the MEE instead of the MSE to develop sparse adaptive filtering algorithms. The new adaptive filters are very robust to impulsive noises.
As an important part, the SPT in sparse adaptive filters enables them to fit well the sparse structures of the unknown systems. Finding the sparsest solution leads to the l0-norm minimization, which is an NP-hard problem. In existing methods, the l1-norm and reweighted l1-norm are frequently used as the SPT. As a nice approximator of the l0-norm, the Correntropy Induced Metric (CIM) can also be used as a sparsity penalty term in sparse channel estimation [28,29]. In the present paper, we will incorporate the above-mentioned SPTs into the sparsity aware MEE algorithms, and develop three sparse MEE algorithms, namely the sparse MEE with zero-attracting (l1-norm) penalty term (ZAMEE), sparse MEE with the logarithmic (reweighted l1-norm [30]) penalty term (RZAMEE), and sparse MEE with the CIM penalty term (CIMMEE).
The organization of the rest of the paper is as follows. In Section 2, we briefly introduce the MEE criterion and the CIM. In Section 3, we derive the ZAMEE, RZAMEE and CIMMEE algorithms. In Section 4, we establish an energy conservation relation and derive a sufficient condition that ensures the mean square convergence of the sparse MEE algorithms. In Section 5, we present simulation results to demonstrate the performance of the developed methods. Finally in Section 6, we give the conclusion.

2. MEE and CIM

2.1. Minimum Error Entropy Criterion

Figure 1 shows a general scheme of adaptive system training under MEE criterion. As entropy measures the average uncertainty or diversity of a random variable, minimizing the error entropy will make the error distribution more concentrated (usually with higher peaks), and the discrepancy between the unknown system and adaptive model will be minimized. In supervised learning, the error signal is, in general, defined as the difference between the outputs of unknown system and adaptive model.
Consider a linear channel model, where the input vector X ( n ) = [ x n M + 1 , , x n 1 , x n ] T at time n is sent over an FIR channel with parameter vector W * = [ w 1 * , w 2 * , , w M * ] T (M is the size of the channel memory). Assume that the channel parameters are real-valued, and most of them are zero. The desired signal d(n) is then
d ( n ) = W * T X ( n ) + ( n )
where v(n) denotes an interference noise. Let W ( n ) = [ w 1 ( n ) , w 2 ( n ) , , w M ( n ) ] T be the weight vector of an adaptive filter. The instantaneous error can be calculated as e(n) = d(n)–WT(n)X(n). Assume that the error e(n) is a random variable with probability density function (PDF) fe(e). Let f ^ e ( e ) be an estimator of fe(e) based on a set of error samples. Then an estimator of Renyi’s quadratic entropy for the error signal can be expressed as [16,17]
H R 2 ( e ) = log f ^ e 2 ( ξ ) d ξ = log V ( e )
where V ( e ) = f ^ e 2 ( ξ ) d ξ is called the information potential (IP) [1618]. Based on Parzen window approach, the probability density function of the error takes the following form [16,17]
f ^ e ( e ) = 1 N i = 1 N κ σ ( e e ( i ) ) ,
where N is the samples number, κσ(·) denotes a kernel function with bandwidth σ, and the N error samples are {e(1),e(2),⋯,e(N)}. The Gaussian kernel function is one of the most popular kernels, which is given by
κ σ ( x ) = 1 σ 2 π exp ( x 2 2 σ 2 ) .
In this work, without mentioned otherwise, the kernel function is a Gaussian kernel. Combining (2) and (3), one can derive
H R 2 ( e ) = log f ^ e 2 ( e ) d e = log ( 1 N i = 1 N κ σ ( e e ( i ) ) ) 2 d e = log 1 N 2 i = 1 N j = 1 N κ σ ( e e ( i ) ) κ σ ( e e ( j ) ) d e = log 1 N 2 i = 1 N j = 1 N κ σ 2 ( e ( i ) e ( j ) ) .
It follows easily that
V ( e ) = 1 N 2 i = 1 N j = 1 N κ σ 2 ( e ( i ) e ( j ) ) .
Obviously, minimizing the error entropy is equivalent to maximizing the information potential [31,32]. Thus, the optimization criterion for MEE training can be
J M E E = max W V ( e ) .
From (7), a steepest ascent algorithm for estimating the weight vector can be derived as
W ( n + 1 ) = W ( n ) + η V ( e ( n ) ) ,
where η denotes a step size, and ∇V(e(n)) stands for the gradient of the information potential with respect to the weight vector, expressed as
V ( e ( n ) ) = V ( e ( n ) ) W ( n ) = W ( n ) ( 1 N 2 i = 1 N j = 1 N κ σ 2 ( e ( i ) e ( j ) ) ) = 1 2 N 2 σ 2 i = 1 N j = 1 N [ κ σ 2 ( e ( i ) e ( j ) ) ( e ( i ) e ( j ) ) ( y ( i ) W ( n ) y ( j ) W ( n ) ) ] ,
where y(i) and y(j) denote the outputs of the system at i and j time, respectively.

2.2. Correntropy Induced Metric

Correntropy is a novel nonlinear similarity measure between two random variables, quantifying how similar two random variables in a neighborhood of the joint space [28,33,34]. Given two vectors of samples: X = [x1,⋯,xN]T, Y = [y1,,yN]T, a sample mean estimator of the correntropy between X and Y is defined by
V ^ ( X , Y ) = 1 N i = 1 N κ σ ( x i y i ) .
In order to find the sparsest vector (minimum l0-norm) satisfying a series of linear constrains, one can use the CIM as an approximation of the l0-norm. Based on the correntropy, the CIM is defined as [28]
C I M ( X , Y ) = ( κ ( 0 ) V ^ ( X , Y ) ) 1 / 2 ,
which is a metric in sample space and satisfies
  • Non-negativity: CIM(X, Y) 0.
  • Identity of indiscernible: CIM(X, Y) = 0 if and only if X = Y.
  • Symmetry: CIM(X,Y) = CIM(Y,X).
  • Triangle inequality: CIM(X,Z)<CIM(X, Y) + CIM(Y,Z).
The CIM provides a nice approximation for the l0-norm. Given a vector X = [x1,⋯,xN]T, the l0-norm of X can be approximated by [28,29]
X 0 C I M 2 ( X , 0 ) = κ ( 0 ) N i = 1 N ( 1 exp ( x i 2 2 σ 2 ) ) .
Figure 2 shows the contours of the CIM in a 3-D space, from which one can observe that this metric divides the space in three regions, namely Euclidean region, Transition region and Rectification region. The CIM behaves like an l2-norm (convex function) in the Euclidean region, like an l1-norm in the Transition region and like an l0-norm (non-convex function) in the Rectification region. It can be shown that if |xi|>δ, ∀xi 0, then as σ→0, one can get a solution arbitrarily close to that of the l0-norm, where δ is a small positive number (see [29] for details). Therefore, with a smaller kernel width, the CIM will favor sparsity and can be used as a penalty term in sparse channel estimation.

3. Sparse MEE Algorithms

3.1. Sparse MEE with Zero-Attracting (l1-norm) Penalty Term (ZAMEE)

To develop a sparse MEE algorithm with zero-attracting (l1-norm) penalty term [4], we introduce the cost function
J Z A M E E ( n ) = J M E E ( n ) + λ J Z A ( n ) = 1 L 2 i = n L + 1 n j = n L + 1 n κ 2 σ 1 ( e ( i ) e ( j ) ) + λ W(n ) 1 ,
where JZA(n)=‖ W(n)) ‖1 denotes the l1-norm of the estimated parameter vector, L is the sliding data length, and σ1 is the kernel width in MEE. In (13), the MEE term is robust to impulsive noises, and the ZA penalty term is a sparsity inducing term, and the two terms are balanced by a weight factor λ ≥ 0.
Based on the cost function (13), one can derive the following adaptive algorithm:
W ( n + 1 ) = W ( n ) η J Z A M E E ( n ) W ( n ) = W ( n ) η [ 1 2 σ 1 2 L 2 i = n L + 1 n j = n L + 1 n [ [ e ( i ) e ( j ) ] κ 2 σ 1 ( e ( i ) e ( j ) ) [ y ( i ) W ( n ) y ( j ) W ( n ) ] ] + λ s i g n ( W(n) ) ] = W ( n ) + n 2 σ 1 2 L 2 i = n L + 1 n j = n L + 1 n [ [ e ( i ) e ( j ) ] κ 2 σ 1 ( e ( i ) e ( j ) ) [ X ( i ) X ( j ) ] ] ρ s i g n ( W ( n ) ) ,
where ρ = is the zero-attractor control factor, and sign(·) is a component-wise sign function [24], with sign(x) = 1 for x > 0, sign(x) = −1 for x < 0, and sign(x) = 0 for x = 0. The algorithm (14) is referred to as the ZAMEE algorithm.

3.2. Sparse MEE with the Logarithmic Penalty Term (RZAMEE)

In this part, we derive a sparse MEE algorithm with a logarithmic penalty term [1,2], which can also generate a zero attractor. The corresponding cost function is given by
J R Z A M E E ( n ) = J M E E ( n ) + λ J R Z A ( n ) = 1 L 2 i = n L + 1 n j = n L + 1 n κ 2 σ 1 ( e ( i ) e ( j ) ) + λ i = 1 M log ( 1 + | w i ( n ) | / δ ) ,
where the log-sum penalty i = 1 M log ( 1 + | w i ( n ) | / δ ) behaves more similarly to the l0-norm than the l1-norm ‖W‖1, and δ is a positive number. Then, a gradient-based adaptive algorithm can be easily derived as
w i ( n + 1 ) = w i ( n ) η J R Z A M E E ( n ) w i ( n ) = w i ( n ) η [ 1 2 σ 1 2 L 2 i = n L + 1 n j = n L + 1 n [ e ( i ) e ( j ) ] κ 2 σ 1 ( e ( i ) e ( j ) ) [ X ( i ) X ( j ) ] + λ sign ( w i ( n ) ) 1 + δ | w i ( n ) | ] = w i ( n ) + η 2 σ 1 2 L 2 i = n L + 1 n j = n L + 1 n [ [ e ( i ) e ( j ) ] κ 2 σ 1 ( e ( i ) e ( j ) ) [ X ( i ) X ( j ) ] ] ρ sign ( w i ( n ) ) 1 + δ | w i ( n ) |
where δ = 1 δ. This algorithm is referred to as the RZAMEE algorithm.

3.3. Sparse MEE with CIM Penalty Term (CIMMEE)

One can also employ the CIM as a sparsity penalty term to develop a sparse MEE algorithm. A new cost function can be defined by
J C I M M E E ( n ) = J M E E ( n ) + λ J C I M ( n ) = 1 L 2 i = n L + 1 n j = n L + 1 n κ 2 σ 1 ( e ( i ) e ( j ) ) + λ 1 M σ 2 2 π i = 1 M ( 1 exp ( w i ( n ) 2 2 σ 2 2 ) ) ,
where σ2 denotes the kernel width in CIM. The second term (i.e., the CIM) with a smaller kernel width will become a sparsity inducing term. Based on the new cost function of (17), we derive a gradient-based adaptive algorithm:
W ( n + 1 ) = W ( n ) η J C I M M E E ( n ) W ( n ) = W ( n ) η [ 1 2 σ 1 2 L 2 i = n L + 1 n j = n L + 1 n [ e ( i ) e ( j ) ] κ 2 σ 1 ( e ( i ) e ( j ) ) [ X ( i ) X ( j ) ] + λ 1 M σ 2 3 2 π W ( n ) . * exp ( W ( n ) . * W ( n ) 2 σ 2 2 ) ] = W ( n ) + η 2 σ 1 2 L 2 i = n L + 1 n j = n L + 1 n [ e ( i ) e ( j ) ] κ 2 σ 1 ( e ( i ) e ( j ) ) [ X ( i ) X ( j ) ] ρ 1 M σ 2 3 2 π W ( n ) . * exp ( W ( n ) . * W ( n ) 2 σ 2 2 ) ,
where .* denotes element-wise product. The above algorithm is referred to as the CIMMEE algorithm. The kernel width σ2 is a key parameter in the penalty term. A proper kernel width will make the CIM approximate well the l0-norm.
The derived sparsity aware MEE algorithms can be written in a unifying form:
W ( n + 1 ) = W ( n ) η J M E E ( e ( n ) ) W ( n ) G ( W ( n ) ) = W ( n ) + η χ T ( n ) h ( e ( n ) ) G ( W ( n ) )
where e ( n ) = [ e ( n L + 1 ) , e ( n L + 2 ) , , e ( n ) ] T is an 1 error vector, χ(n) = [X(n − L + 1), X(n − L + 2), …, X(n)]T is an L×M input matrix, h ( e ( n ) ) = [ h 1 ( e ( n ) ) , h 2 ( e ( n ) ) , , h L ( e ( n ) ) ] T , in which
h i ( e ( n ) ) = J M E E ( e ( n ) ) e ( n L + i )
and G(W(n)) is the derivative of the sparsity penalty term with respect to W(n), which is an 1 vector. For ZAMEE, RZAMEE and CIMMEE, G(W(n)) can be expressed respectively as G ( W ( n ) ) = ρ s i g n ( W ( n ) ) , G ( W ( n ) ) = ρ sign ( W ( n ) ) 1 + δ | W ( n ) | and G ( W ( n ) ) = ρ 1 M σ 2 3 2 π W ( n ) . * exp ( W ( n ) . * W ( n ) 2 σ 2 2 ).

4. Mean Square Convergence Analysis

Now we analyze the mean square convergence of the algorithm (19). For simplicity and rigorous analysis, we only consider the case in which G ( W ( n ) ) = ρ 1 M σ 2 3 2 π W ( n ) . * exp ( W ( n ) . * W ( n ) 2 σ 2 2 ). First, we derive a fundamental energy conservation relation [24,35,36].

4.1. Energy Conservation Relation

In order to presenting a unifying formulation for the sparsity under MEE criterion, we rewrite e(n) = d(n)–WT(n)X(n) as follows
e ( n ) = d ( n ) χ ( n ) W ( n ) ,
where d ( n ) = [ d ( n L + 1 ) , d ( n L + 2 ) , , d ( n ) ] T is the 1 desired signal vector. From (1), we derive
d ( n ) = χ ( n ) W * + v ( n ) ,
where v ( n ) = [ v ( n L + 1 ) , v ( n L + 2 ) , , v ( n ) ] T is the noise vector. Obviously, combining (21) and (22), the error vector e ( n ) can be expressed as
e ( n ) = χ ( n ) W ˜ ( n ) + v ( n ) ,
where W ˜ ( n ) = W * W ( n ) is the weight error vector. Let us define the a priori error vector e a ( n ) and a posteriori error vector e p ( n ) as follows:
{ e a ( n ) = χ ( n ) W ˜ ( n ) e p ( n ) = χ ( n ) W ˜ ( n + 1 )
In addition, e a ( n ) and e p ( n ) have the following relationship:
e p ( n ) = e a ( n ) + χ ( n ) ( W ˜ ( n + 1 ) W ˜ ( n ) ) = e a ( n ) χ ( n ) ( W ( n + 1 ) W ( n ) ) .
To simplify the analysis, here we assume L = M. Then, combining (19) and (25), we have
e p ( n ) = e a ( n ) χ ( n ) ( η χ T ( n ) h ( e ( n ) ) G ( W ( n ) ) ) e p ( n ) e a ( n ) = ( η ( n ) h ( e ( n ) ) χ ( n ) G ( W ( n ) ) ) 1 ( n ) ( e p ( n ) e a ( n ) ) = ( η h ( e ( n ) ) + 1 ( n ) χ ( n ) G ( W ( n ) ) ) χ T ( n ) 1 ( n ) ( e p ( n ) e a ( n ) ) = ( W ( n + 1 ) W ( n ) ) χ T ( n ) 1 ( n ) ( e p ( n ) e a ( n ) ) = W ˜ ( n + 1 ) W ˜ ( n ) ,
where ℜ(n) = χ(n)χT (n) is an L×L -dimensional symmetric matrix and is assumed to be invertible. Therefore, we have
W ˜ ( n + 1 ) = W ˜ ( n ) + χ T ( n ) 1 ( n ) ( e p ( n ) e a ( n ) ) .
Squaring both sides of (27), we obtain
W ˜ T ( n + 1 ) W ˜ ( n + ) = [ W ˜ ( n ) + χ T ( n ) 1 ( n ) ( e p ( n ) e a ( n ) ) ] T × [ W ˜ ( n ) + χ T ( n ) 1 ( n ) ( e p ( n ) e a ( n ) ) ] .
After some simple manipulations, we derive
W ˜ ( n + 1 ) 2 + e a ( n ) 2 1 ( n ) = W ˜ ( n ) 2 + e p ( n ) 1 ( n ) 2 ,
where W ˜ ( n ) 2 = W ˜ T ( n ) W ˜ ( n ) , e a ( n ) 1 ( n ) 2 = e a T ( n ) 1 ( n ) e a ( n ) and e p ( n ) 1 ( n ) 2 = e p T ( n ) 1 ( n ) e e ( n ). Taking the expectations of the both sides of (29), we have
E [ W ˜ ( n + 1 ) 2 ] + E [ e a ( n ) 1 ( n ) 2 ] = E [ W ˜ ( n ) 2 ] + E [ e p ( n ) 1 ( n ) 2 ]
where E [ ] denotes the expectation operator, E [ W ˜ ( n ) 2 ] is the weight error power (WEP) at iteration n.
Remark: Equation (30) is referred to as the energy conservation relation for the proposed sparsity aware MEE algorithms, which is, interestingly, the same as the energy conservation relation derived in [24]. In fact, the sparsity penalty terms have no influence on the energy conservation relation. Similar extensions of the energy conservation relation to multi-dimensional error can be found in [37,38].

4.2. Sufficient Condition for Mean Square Convergence

Based on the energy conservation relation (30), a sufficient condition can be derived that ensures the mean square convergence. By substituting e p ( n ) = e a ( n ) ( η ( n ) h ( e ( n ) ) χ ( n ) G ( W ( n ) ) ) into (30), we obtain
E [ W ˜ ( n + 1 ) 2 ] = E [ W ˜ ( n ) 2 ] 2 η E [ e a T ( n ) h ( e ( n ) ) ] + η 2 E [ h T ( e ( n ) ) ( n ) h ( e ( n ) ) ] + E [ G T ( W ( n ) G ( W ( n ) ) ] 2 E [ e a T ( n ) 1 ( n ) χ ( n ) G ( W ( n ) ) ] + 2 η E [ h T ( e ( n ) ) χ ( n ) G ( W ( n ) ) ]
Before evaluating the expectations E [ G T ( W ( n ) ) G ( W ( n ) ) ], E [ e a T ( n ) 1 ( n ) χ ( n ) G ( W ( n ) ) ], E [ e a T ( n ) h ( e ( n ) ) ], E [ h T ( e ( n ) ) χ ( n ) G ( W ( n ) ) ] and E [ h T ( e ( n ) ) ( n ) h ( e ( n ) ) ] we give the following assumptions.
Assumptions:
  • The noise {v(n)} is independent, identically distributed, and independent of the input {X(n)}.
  • The a priori error vector e a ( n ) is jointly Gaussian distributed.
  • The input vectors {X(n)} are zero-mean independent, identically distributed.
  • i,j∈{n−L + 1,⋯,n}, ℜi,j (n) is independent of {e(i),e (j)}.
  • The vectors {G(W(n))} are zero-mean independent, identically distributed, and independent of the input {X(n)}.
Remark: The assumptions (A), (B), (C) and (D) are commonly used in the literature [35,36]. In this work, the unknown system is assumed to be a sparse system, of which most coefficients are zero or close to zero, so the weight vector W(n) of the adaptive filter is also sparse, especially at the final stage of convergence when the filter gets very close to the unknown system. Since W(n) is sparse, the vector G ( W ( n ) ) = ρ 1 M σ 2 3 2 π W ( n ) . * exp ( W ( n ) . * W ( n ) 2 σ 2 2 ) will be close to a null vector, because we have ϕ ( x ) = ρ 1 M σ 2 3 2 π x exp ( x 2 2 σ 2 2 ) 0 when x is very close to or far from zero. Thus, for the CIMMEE, the assumption (E) is reasonable.
If the above assumptions hold, in a similar way to [24,35], one can derive
E [ e a T ( n ) h ( e ( n ) ) ] = γ 2 ( n ) θ G ( γ 2 ( n ) )
E [ h T ( e ( n ) ) ( n ) h ( e ( n ) ) ] = θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ]
where γ2(n)=E[(ea(n−L+i))2], θG(γ2(n)) and θI(γ2(n)) denote two functions of γ2(n). The subscript G in θG points to the fact that the Gaussian assumption (B) is the main assumption for the Equation (32) and the subscript I in θI indicates that the independence assumption (D) is the major assumption leading to the expression (33). For more details about (32) and (33), interested readers are referred to [24]. By assumption (E), it follows easily that
{ E [ e a T ( n ) 1 ( n ) χ ( n ) G ( W ( n ) ) ] = E [ e a T ( n ) 1 ( n ) χ ( n ) ] E [ G ( W ( n ) ) ] = 0 E [ h T ( e ( n ) ) χ ( n ) G ( W ( n ) ) ] = E [ h T ( e ( n ) ) χ ( n ) ] E [ G ( W ( n ) ) ] = 0
Let the variance of {G(W(n))} be ς2. Then we derive
E [ G T ( W ( n ) ) G ( W ( n ) ) ] = ς 2
Substituting (32), (33), (34) and (35) into (31), we obtain
E [ W ˜ ( n + 1 ) 2 ] = E [ W ˜ ( n ) 2 ] 2 η γ 2 ( n ) θ G ( γ 2 ( n ) ) + η 2 θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ] + ς 2
From (36), we observe
E [ W ˜ ( n + 1 ) 2 ] E [ W ˜ ( n ) 2 ] η 2 θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ] 2 η γ 2 ( n ) θ G ( γ 2 ( n ) ) + ς 2 0
Since ℜ(n) is assumed to be invertible, we have
θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ] > 0
Thus, to make the weight error power monotonically decreased (hence converge), the step size η should satisfy the following inequality:
γ 2 ( n ) θ G ( γ 2 ( n ) ) ϒ θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ] η γ 2 ( n ) θ G ( γ 2 ( n ) ) + ϒ θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ]
where ϒ = ( γ 2 ( n ) θ G ( γ 2 ( n ) ) ) 2 θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ] ς 2. As η>0, the above inequality implies
θ G ( γ 2 ( n ) ) > 0
ϒ 0
Therefore, a sufficient condition for the mean square convergence will be
{ θ G ( γ 2 ( n ) ) > 0 γ 2 ( n ) θ G ( γ 2 ( n ) ) ϒ θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ] η γ 2 ( n ) θ G ( γ 2 ( n ) ) + ϒ θ I ( γ 2 ( n ) ) E [ X ( n ) 2 ] , ϒ 0 n
Remark: It is worth noting that the sufficient condition of (42) does not ensure that the WEP will converge to zero. Actually, for a stochastic gradient-based algorithm, there are always misadjustments. Even so, the derived condition will guarantee the monotonic decrease of WEP and ensure that the algorithm does not diverge.

5. Simulation Results

In this section, we perform simulations on time-varying channel estimation to demonstrate the performance of the proposed sparse aware MEE algorithms (ZAMEE, RZAMEE, and CIMMEE), compared with several other algorithms, including least absolute deviation (LAD) [39], MEE, ZALMS, and RZALMS, in a sparse system identification setting. In all the simulations, the performance measure adopted is the mean square deviation (MSD), defined as
MSD= E [ W * W ( n ) 2 ]

5.1. Experiment 1

In the first experiment, in order to identify the sparsity of the system, we use a filter of 20 coefficients in the time varying system. The parameter vector of the unknown channel is assumed to be
W * = { [ 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] n 2000 [ 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 ] 2000 < n 3000 [ 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ] 3000 < n
In (44), the channel memory size, M, is 20. The channel model has a sparsity of 1/20 during 1 to 2000 iterations, while the sparsity changes to 1/4 when the iteration is from 2000 to 3000, and it is non-sparsity after 3000 iterations. The input signal {x(n)} is a white Gaussian random sequence with zero mean and unit variance. Simulation results below are obtained by averaging over 100 independent Monte Carlo runs, and each run has 5000 iterations.
We employ the alpha-stable distribution [40] as impulsive noise model, which has been widely applied in the literature [4143]. The characteristic function of the alpha-stable distribution is given by
f ( t ) = exp { j δ t γ | t | α [ 1 + j β sgn ( t ) S ( t , α ) ] }
in which
S ( t , α ) = { tan α π 2 i f α 1 2 π log | t | i f α = 1
where α∈(0,2] is the characteristic factor, < δ <+ is the location parameter, β∈[−1,1] is the symmetry parameter, and γ > 0 is the dispersion parameter. Such a distribution is called a symmetric alpha-stable (SαS) distribution when β = 0. We define the parameters vector as V = (α,β,γ,δ).
First, we investigate the convergence behaviors of the proposed methods in impulsive alpha-stable noise, where the noise parameters vector is V = (1.2,0,0.2,0). The sliding data length for MEE is L = 20. The step size is set at 0.03 for all algorithms. The kernel widths in MEE and CIM are 2.0 and 0.04, respectively. For all sparse aware algorithms, ρ is set at 0.0001. The parameter δ′ for RZALMS and RZAMEE is 10. The average convergence curves in terms of the MSD are shown in Figure 3. As one can see from the MSD results, when the channel system is very sparse (before the 2000th iteration), the sparse aware MEE achieve faster convergence rate and better steady-state performance than the other robust algorithms (LAD, MEE), while ZALMS and RZALMS work poorly, as they are sensitive to the impulsive noises. Thus, we only consider the MEE, LAD algorithms comparing with the proposed algorithm in the next experiment case. In addition, CIMMEE achieves lower MSD than ZAMEE and RZAMEE, since the CIM provides a nice approximation for the l0-norm. After the 2000th iteration, as the number of non-zero taps increases to ten, the performance of the ZAMEE and RZAMEE deteriorates while the CIMMEE maintains the best performance among all the sparse aware filters. After 3000 iterations, the sparse aware MEE algorithms still perform comparable with the MEE, even though the system is now completely non-sparse.
Second, we conduct the simulation with different γ (0.2, 0.4, 0.6, 0.8, 1) and α (1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7) to further demonstrate the performance of the proposed method. In this simulation, we mainly focus on the fully sparse channel system in the first stage of the proposed model. The step size is set at 0.02 for all algorithms, and other parameter settings are the same as in the previous simulation for all algorithms. The MSD, versus different γ and α, are illustrated in Figures 5 and 6, respectively. Evidently, the sparse aware MEE algorithms perform well with the different parameter of the impulsive noise model. Moreover, we see that the CIMMEE achieves much lower MSDs in all the cases. Simulation results confirm that the proposed sparse aware MEE algorithms, especially CIMMEE, can efficiently estimate a sparse channel in impulsive noise environment.
Third, we perform simulations to investigate how the kernel width σ1 affects the performance, which is an important parameter for the sparse aware MEE. Here, the steady-state MSDs of the CIMMEE with different σ1 (0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, and 5) and different α (1, 1.2, 1.4, 1.6, 1.8, and 2) are computed. Other parameters are set as: γ=1, η = 0.01, ρ = 0.0001, σ2 = 0.04 and δ’ = 10. The results are given in Figure 6. One can see that the CIMMEE achieves different MSDs with different σ1 and under different noise distributions. In this example, the lowest MSD will be obtained around σ1 = 1.5. From the simulation results, we may conclude that the kernel width in MEE has a significant influence on the performance.

5.2. Experiment 2

In the second experiment, the system is the same as the first experiment, except for the switching times. The parameter vector of the unknown channel is assumed to be
W * = { [ 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] n 5000 [ 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 ] 5000 < n 10000 [ 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ] 10000 < n
and the channel memory size, M, is 20. The input signal {x(n)} is now a correlated signal generated by the process (x(n) = 0.8x(n–1) + v(n)) and then normalized to variance 1, where v(n) is a white Gaussian process. The observed noise is the same noise assumed in the first experiment with the same parameters. All simulation results are obtained by averaging over 100 independent Monte Carlo runs, and each run performs 15,000 iterations. The sliding data length is L = 20. The step size is set at 0.04 for all algorithms. The kernel widths in MEE and CIM are 3.0 and 0.05, respectively. For all sparse MEE algorithms, ρ is set at 0.0001. The parameter δ’ for RZAMEE is 10. Figure 7 shows the average MSD estimate of the three sparse MEE filters. As seen from the MSD results, similar performance trends are observed as in the first experiment. When the system is very sparse, the CIMMEE achieves better steady-state performance than ZAMEE and RZAMEE. As the number of non-zero taps increases to 10, even 20 (completely non-sparse), the CIMMEE algorithms still performs better than the other sparse MEE filters because the CIM has a nice approximation for the l0-norm.
Second, we perform simulations to investigate how the kernel width σ1 and the characteristic factor α affect the performance, which are important parameters for the sparse aware MEE. Here, the steady-state MSDs of the CIMMEE with different σ1 (1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, and 5) and different α (1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, and 1.9) are computed. Other filter parameters are set as: γ=1, η = 0.01, ρ = 0.0001, σ2 =0.05 and δ’ = 10. Figure 8 shows the simulation result in 3-D space. As one can observe clearly, the best performance of the CIMMEE can be obtained at about σ1 = 3. If σ1 is too small or too large, the convergence performance will become worse. However, the MSD is little affected by the characteristic factor α. This implies that the MEE is an extremely robust principle in impulsive non-Gaussian noises.

5.3. Experiment 3

In the third experiment, we demonstrate the performance when the input signal is a fragment of 2 s of real speech, sampled at 8kHZ [4,8]. Figure 9 shows an acoustic echo path of a 1024-tap system with 52 non-zero coefficients, which can be considered to be very sparse and is used in the simulation. The output is still disturbed by an alpha-stable noise and the noise parameters vector is V = (1.4,0,0.2,0). All simulation results are obtained by averaging over 100 independent Monte Carlo runs. The sliding data length is L = 20. The other parameters are set as: η = 0.0015, ρ = 0.0001, σ1 =1.0, σ2=0.05 and δ’ = 10. The convergence curves for the sparse MEE algorithms are shown in Figure 10. Compared with the ZAMEE and RZAMEE, the CIMMEE algorithm achieves a smaller MSD.

6. Conclusion

The MEE, as an adaptation criterion, has been successfully applied in many fields because of its desirable performance in non-Gaussian situations. In this work, we develop several sparsity aware MEE algorithms, including ZAMEE, RZAMEE, and CIMMEE, which are derived by incorporating different sparsity penalty terms into the MEE criterion. The mean square convergence properties of the proposed algorithms have been analyzed. Based on an energy conservation relation, we derive a sufficient condition that guarantees the mean square stability. Simulation results show that the new algorithms can achieve excellent performance, especially when the measurements are disturbed by impulsive non-Gaussian noises. How to select proper parameters, such as the kernel bandwidth, is an important issue. This will be an interesting topic for future study.

Acknowledgments

This work was supported by 973 Program (No. 2015CB351703) and National Natural Science Foundation of China (No. 61372152, no. 61271210).
MSC Codes: 62B10

Author Contributions

The contributions of each author are as follows: Zongze Wu and Badong Chen proved the main results and wrote the draft; Siyuan Peng and Wentao Ma carried out the simulations; Jose C. Principe polished the language and was in charge of technical checking. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, Y.; Gu, Y.; Hero, A.O. Sparse LMS for system identification, Proceedings of 35th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2009), Taipei, Taiwan, 19–24 April 2009; pp. 3125–3128.
  2. Gu, Y.; Jin, J.; Mei, S. l0 norm constraint LMS algorithm for sparse system identification. IEEE Signal Process. Lett. 2009, 16, 774–777. [Google Scholar]
  3. Jin, J.; Qu, Q.; Gu, Y. Robust Zero-point Attraction Least Mean Square Algorithm on Near Sparse System Identification. IET Signal Process. 2013, 7, 210–218. [Google Scholar]
  4. Shi, K.; Shi, P. Convergence analysis of sparse LMS algorithms with l1-norm penalty based on white input signal. Signal Process 2010, 90, 3289–3293. [Google Scholar]
  5. Yin, D.; So, H.C.; Gu, Y. Sparse Constraint Affine Projection Algorithm with Parallel Implementation and Application in Compressive Sensing, Proceedings of 39th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2014), Florence, Italy, 4–9 May 2014; pp. 7288–7292.
  6. Babadi, B.; Kalouptsidis, N.; Tarokh, V. SPARLS: The sparse RLS algorithm. IEEE Trans. Signal Process. 2010, 58, 4013–4025. [Google Scholar]
  7. Wu, F.Y.; Tong, F. Gradient optimization p-norm-like constraint LMS algorithm for sparse system estimation. Signal Process 2013, 93, 967–971. [Google Scholar]
  8. Salman, M.S. Sparse leaky LMS algorithm for system identification and its convergence analysis. Int. J. Adapt. Control Signal Process. 2014, 28, 1065–1072. [Google Scholar]
  9. Aliyu, M.L.; Alkassim, M.A.; Salman, M.S. A p-norm variable step-size LMS algorithm for sparse system identification. Signal Image Video Process 2014. [Google Scholar] [CrossRef]
  10. Wu, F.Y.; Tong, F. Non-Uniform Norm Constraint LMS Algorithm for Sparse System Identification. IEEE Commun. Lett. 2013, 17, 385–388. [Google Scholar]
  11. Das, B.K.; Chakraborty, M. Sparse Adaptive Filtering by an Adaptive Convex Combination of the LMS and the ZA-LMS Algorithms. IEEE Trans. Circuits Syst. 2014, 61, 1499–1507. [Google Scholar]
  12. Liu, Y.; Li, C.; Zhang, Z. Diffusion sparse least-mean squares over networks. IEEE Trans. Signal Process 2012, 60, 4480–4485. [Google Scholar]
  13. Plataniotis, K.N.; Androutsos, D.; Venetsanopoulos, A.N. Nonlinear filtering of non-Gaussian noise. J. Intell. Robot. Syst. 1997, 19, 207–231. [Google Scholar]
  14. Weng, B.; Barner, K.E. Nonlinear system identification in impulsive environments. IEEE Trans. Signal Process. 2005, 53, 2588–2594. [Google Scholar]
  15. Golub, G.H.; van Loan, C.F. Matrix Computation; the Johns Hopkins University Press: Baltimore, MD, USA, 1983. [Google Scholar]
  16. Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
  17. Chen, B.; Zhu, Y.; Hu, J.; Principe, J.C. System Parameter Identification: Information Criteria and Algorithms; Elsevier: Amsterdam, The Netherlands, 2013. [Google Scholar]
  18. Erdogmus, D.; Principe, J.C. From linear adaptive filtering to nonlinear information processing. IEEE Signal Process. Mag. 2006, 23, 15–33. [Google Scholar]
  19. Erdogmus, D.; Principe, J.C. An error-entropy minimization for supervised training of nonlinear adaptive systems. IEEE Trans. Signal Process. 2002, 50, 1780–1786. [Google Scholar]
  20. Chen, B.; Hu, J.; Pu, L.; Sun, Z. Stochastic gradient algorithm under (h, ϕ)-entropy criterion. Circuit Syst. Signal Process 2007, 26, 941–960. [Google Scholar]
  21. Wolsztynski, E.; Thierry, E.; Pronzato, L. Minimum-entropy estimation in semi-parametric models. Signal Process. 2005, 85, 937–949. [Google Scholar]
  22. Song, A.; Qiu, T. The Equivalency of Minimum Error Entropy Criterion and Minimum Dispersion Criterion for Symmetric Stable Signal Processing. IEEE Signal Process. Lett. 2010, 17, 32–35. [Google Scholar]
  23. Chen, B.; Principe, J.C. Some further results on the minimum error entropy estimation. Entropy 2012, 14, 966–977. [Google Scholar]
  24. Chen, B.; Zhu, Y.; Hu, J. Mean-square convergence analysis of ADALINE training with minimum error entropy criterion. IEEE Trans. Neural Netw. 2010, 21, 1168–1179. [Google Scholar]
  25. Chen, B.; Principe, J.C. On the Smoothed Minimum Error Entropy Criterion. Entropy 2012, 14, 2311–2323. [Google Scholar]
  26. Li, C.; Shen, P.; Liu, Y.; Zhang, Z. Diffusion information theoretic learning for distributed estimation over network. IEEE Trans. Signal Process. 2013, 61, 4011–4024. [Google Scholar]
  27. Xue, Y.; Zhu, X. The minimum error entropy based robust wireless channel tracking in impulsive noise. IEEE Commun. Lett. 2002, 6, 228–230. [Google Scholar]
  28. Liu, W.F.; Pokharel, P.P.; Principe, J.C. Correntropy: Properties and Applications in Non-Gaussian Signal Processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar]
  29. Seth, S.; Principe, J.C. Compressed signal reconstruction using the correntropy induced metric, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2008), Las Vegas, NV, USA, 31 March–4 April 2008; pp. 3845–3848.
  30. Wipf, D.P.; Nagarajan, S.S. A new view of automatic relevance determination. Adv. Neural Inf. Process. Syst. 2008. Available online: http://papers.nips.cc/paper/3372-a-new-view-of-automatic-relevance-determination accessed on 5 May 2015.
  31. Chen, B.; Zhu, P.; Principe, J.C. Survival information potential: A new criterion for adaptive system training. IEEE Trans. Signal Process. 2012, 60, 1184–1194. [Google Scholar]
  32. Principe, J.C.; Xu, D.; Zhao, Q.; John, F. Learning from examples with information theoretic criteria. J.VLSI Signal Process. Syst. Signal Image Video Technol. 2000, 26, 61–77. [Google Scholar]
  33. Chen, B.; Xing, L.; Liang, J.; Zheng, N.; Principe, J.C. Steady-state Mean-square Error Analysis for Adaptive Filtering under the Maximum Correntropy Criterion. IEEE Signal Process. Lett. 2014, 21, 880–884. [Google Scholar]
  34. Chen, B.; Principe, J.C. Maximum correntropy estimation is a smoothed MAP estimation. IEEE Signal Process. Lett. 2012, 19, 491–494. [Google Scholar]
  35. Al-Naffouri, T.Y.; Sayed, A.H. Adaptive filters with error nonlinearities: Mean-square analysis and optimum design. EURASIP J. Appl. Signal Process. 2001, 4, 192–205. [Google Scholar]
  36. Douglas, S.C.; Meng, T.H.Y. Stochastic gradient adaptation under general error criteria. IEEE Trans. Signal Process. 1994, 42, 1335–1351. [Google Scholar]
  37. Sayed, A.H. Fundamentals of Adaptive Filtering; Wiley: New York, NY, USA, 2003. [Google Scholar]
  38. Shin, H.-C.; Sayed, A.H. Mean-square performance of a family of affine projection algorithms. IEEE Trans. Signal Process. 2004, 52, 90–102. [Google Scholar]
  39. Papoulis, E.V.; Stathaki, T. A normalized robust mixed-norm adaptive algorithm for system identification. Signal Process. Lett. 2004, 11, 5286–5298. [Google Scholar]
  40. Shao, M.; Nikias, C.L. Signal processing with fractional lower order moments: Stable processes and their applications. Proc. IEEE 1993, 81, 986–1010. [Google Scholar]
  41. Weng, B.; Barner, K.E. Nonlinear system identification in impulsive environments. IEEE Trans. Signal Process. 2005, 53, 2588–2594. [Google Scholar]
  42. Georgiadis, A.T.; Mulgrew, B. A family of recursive algorithms for channel identification in alpha-stable noise, Proceedings of the Fifth Bayona Workshop on Emerging Technologies in Telecommunications, Bayona, Spain, 6–8 September 1999; pp. 153–157.
  43. Wang, J.; Kuruoglu, E.E.; Zhou, T. Alpha-stable channel capacity. IEEE Commun. Lett. 2011, 15, 1107–1109. [Google Scholar]
Figure 1. Adaptive system training under minimum error entropy (MEE) criterion.
Figure 1. Adaptive system training under minimum error entropy (MEE) criterion.
Entropy 17 03419f1
Figure 2. Contours of CIM (X,0) in 3-D space (where kernel size is 0.4).
Figure 2. Contours of CIM (X,0) in 3-D space (where kernel size is 0.4).
Entropy 17 03419f2
Figure 3. Tracking and steady-state behaviors of 20-order adaptive filters.
Figure 3. Tracking and steady-state behaviors of 20-order adaptive filters.
Entropy 17 03419f3
Figure 4. Steady-state mean square deviation (MSD) versus different values of γ.
Figure 4. Steady-state mean square deviation (MSD) versus different values of γ.
Entropy 17 03419f4
Figure 5. Steady-state mean square deviation (MSD) versus different values of α.
Figure 5. Steady-state mean square deviation (MSD) versus different values of α.
Entropy 17 03419f5
Figure 6. Steady-state mean square deviation (MSD) of sparse minimum error entropy (MEE) with the correntropy induced metric (CIM) penalty term (CIMMEE) with different kernel size σ1 for different α.
Figure 6. Steady-state mean square deviation (MSD) of sparse minimum error entropy (MEE) with the correntropy induced metric (CIM) penalty term (CIMMEE) with different kernel size σ1 for different α.
Entropy 17 03419f6
Figure 7. Tracking and steady-state behaviors of 20-order adaptive filters with correlated input.
Figure 7. Tracking and steady-state behaviors of 20-order adaptive filters with correlated input.
Entropy 17 03419f7
Figure 8. Steady-state mean square deviation (MSD) of sparse minimum error entropy (MEE) with the correntropy induced metric (CIM) penalty term (CIMMEE) with different kernel size σ1 and different α in 3-D space.
Figure 8. Steady-state mean square deviation (MSD) of sparse minimum error entropy (MEE) with the correntropy induced metric (CIM) penalty term (CIMMEE) with different kernel size σ1 and different α in 3-D space.
Entropy 17 03419f8
Figure 9. Acoustic echo path with length M = 1024.
Figure 9. Acoustic echo path with length M = 1024.
Entropy 17 03419f9
Figure 10. Convergence behaviors with speech signal input. The speech signal is shown in the upper plot.
Figure 10. Convergence behaviors with speech signal input. The speech signal is shown in the upper plot.
Entropy 17 03419f10

Share and Cite

MDPI and ACS Style

Wu, Z.; Peng, S.; Ma, W.; Chen, B.; Principe, J.C. Minimum Error Entropy Algorithms with Sparsity Penalty Constraints. Entropy 2015, 17, 3419-3437. https://doi.org/10.3390/e17053419

AMA Style

Wu Z, Peng S, Ma W, Chen B, Principe JC. Minimum Error Entropy Algorithms with Sparsity Penalty Constraints. Entropy. 2015; 17(5):3419-3437. https://doi.org/10.3390/e17053419

Chicago/Turabian Style

Wu, Zongze, Siyuan Peng, Wentao Ma, Badong Chen, and Jose C. Principe. 2015. "Minimum Error Entropy Algorithms with Sparsity Penalty Constraints" Entropy 17, no. 5: 3419-3437. https://doi.org/10.3390/e17053419

Article Metrics

Back to TopTop