A Nonlinear Maximum Correntropy Information Filter for High-Dimensional Neural Decoding

Neural signal decoding is a critical technology in brain machine interface (BMI) to interpret movement intention from multi-neural activity collected from paralyzed patients. As a commonly-used decoding algorithm, the Kalman filter is often applied to derive the movement states from high-dimensional neural firing observation. However, its performance is limited and less effective for noisy nonlinear neural systems with high-dimensional measurements. In this paper, we propose a nonlinear maximum correntropy information filter, aiming at better state estimation in the filtering process for a noisy high-dimensional measurement system. We reconstruct the measurement model between the high-dimensional measurements and low-dimensional states using the neural network, and derive the state estimation using the correntropy criterion to cope with the non-Gaussian noise and eliminate large initial uncertainty. Moreover, analyses of convergence and robustness are given. The effectiveness of the proposed algorithm is evaluated by applying it on multiple segments of neural spiking data from two rats to interpret the movement states when the subjects perform a two-lever discrimination task. Our results demonstrate better and more robust state estimation performance when compared with other filters.


Introduction
Brain machine interface (BMI) establishes a direct communication pathway between the brain and external devices [1][2][3][4][5][6]. BMI collects the noisy signals from hundreds of neurons in the brain, and estimates a motor intention from these signals [7,8]. This estimated movement intention can then be used to control the robot to assist motor disabled people [9][10][11][12][13][14][15][16][17][18][19]. Signal processing algorithms play a key role in BMI. As a commonly-used state-observation model, the Kalman filter (KF) has been adopted to decode the movement intents as the state from the high-dimensional observations formed by multiple neural firing activity [20][21][22][23], in which the movement state evolves over time as described by the linear state model and the observation model reflects how the neuron firing tunes to movement with Gaussian noise. The implementation of the Kalman filter nicely considers the gradual change of the continuous brain state, and thus is especially appropriate for the brain control task where the subject continuously adjusts the brain states to control an external robot.
However, there are some challenges in applying the state-observation model for BMI. Firstly, the nervous systems are nonlinear in general [24,25], and the traditional KF is not fit for nonlinear systems. As the extensions of KF, the extended Kalman filter (EKF) [26], unscented Kalman filter (UKF) [27,28], and cubature Kalman filter (CKF) [29] can deal with a nonlinear system. However, these algorithms can only approximate the speed is analyzed theoretically. The proposed algorithm is applied to real neural recording from two rats while the subjects were manually controlling a two-lever discrimination task. The two-dimensional movement states are estimated from multiple neuronal activities. The performance, convergence time, the sensitivity to the bad initial conditions, and robustness to heavy-tailed noise of the proposed algorithm are compared with those of the traditional Kalman filter and neural network decoder. We also use the mean square error in 2D to evaluate the movement estimation compared with the ground truth.
Our contribution is to propose a new methodology to better estimate the state in the filtering process for the noisy high-dimensional measurement systems. The neural network is utilized to construct the nonlinear measurement model between the high-dimensional measurements and low-dimensional states. The correntropy criterion is considered in deriving the state to cope with the non-Gaussian noise and eliminate uncertainty. The rest of this paper is organized as follows. In Section 2, the NMCIF is introduced. In Section 3, the detailed derivation of NMCIF is given, and the robustness and convergence of the algorithm are analyzed. In Section 4, the real data application on the two-lever discrimination task of rats is used to validate the effectiveness of the proposed algorithm compared with other filters. In Section 5, the discussion and conclusion are given.

Nonlinear Maximum Correntropy Information Filter
In general, the state-observation model is written as dynamic systems: where x k denotes the n-dimensional state at time k. In our application, it is the 2-dimensional position. F denotes the state transition function. In our application, it is a linear function F estimated from the training data by least square [53]. q k denotes the process noise with mean q k and covariance matrix Q k , which is estimated by the residue of the state transition approximation. y k denotes the m-dimensional measurements. In our application, it is a high-dimensional vector formed by multiple neural firing activities considering the firing history. H is the nonlinear observation model. In our application, it is estimated from the state vector to each dimension of the neural activity independently. In the linear case, each row of H describes how the neural activity weighted tunes the movement states. ζ k denotes the measurement noise, which is estimated by the residue of using the observation model. For a nonlinear system with high-dimensional correlated observation, such as the most experiments of BMI, the noisy measurements would be collected from hundreds of neurons, and there exists heavy connectivity among neurons. The commonly-used second-order statistics method, the MMSE criterion, may not be optimal as the nonlinear system generally does not have Gaussian distributed noise [42]. Correntropy is a key concept in ITL, which measures the similarity between two random variables considering all the even order statistics. Assuming X and Y are two random variables, the correntropy is defined as: where E represents the expectation operator, F XY (x, y) denotes the joint distribution function with respect to the two random variables, and κ(·, ·) is a shift-invariant Mercer kernel. Gaussian kernel is simple and has been successfully applied in a heavy-tailed noise environment [44,46,47], and we chose the Gaussian kernel to eliminate this noise, i.e., where e = x − y, and σ > 0 is the kernel bandwidth. Taking the Taylor series expansion of the Gaussian kernel, it is noted that the correntropy contains not only the second-order moment but also all higher even order statistical information. In particular, when the kernel bandwidth σ is 20 times larger than the values chosen based on Silverman's rule [44], the second-order moment plays a key role in correntropy.
In this section, we propose a new method for the state-observation model estimation. Let W k be defined as the information matrix of process noise, which is the inverse of the covariance matrix W k = Q −1 k . We propose the NMCIF. The prior mean x k and corresponding information matrix χ k are firstly computed as: and the posterior mean x k is recursively calculated by the fixed-point iterative methods [54] as: where g(·) denotes the nonlinear preprocess on the observation that deducts the dimensionality of original observation into the new observation with the same dimension as the state, such that: Superscript (t) stands for the fixed-point iteration index and the iterative process stops when: with ω being a small positive value, or the iterative index reaches a preset value. K (t−1) k stands for the Kalman gain as: Here is the revised prior information matrix denoted as in Equation (12) and V (t−1) k is denoted as in Equation (13).
In our application, we use multi-layer perception as a universal approximator to take the high-dimensional neural firing as input and output for the preprocessed states as the observation. Then, the observation model is an identity matrix. r k denotes the neural network approximation residue with mean r k and covariance matrix R k . Since the neural network could consider the bias, we then assume the process noise and measurement noise have zero means. V k is defined by V k = R −1 k . The corresponding information matrix χ k of the posterior is updated as: where for each iteration t, where d k,i denotes the i-th element of D k , m k,i denotes the i-th row of M k , and Here we take D k as the transformed observation, M k is the new observation model, and e k is the noise term. S k can be obtained by making Cholesky decomposition on the inverse of covariance of the noise term Note that when the kernel bandwidth becomes increasingly larger, the performance of the algorithm will behave like the corresponding nonlinear information filter (NIF). Especially, if σ → ∞, then C χ,k → I and C V,k → I, the proposed algorithm will reduce to the NIF. Thus, the NIF algorithm is a special case of the proposed algorithm, and is summarized as follows:

Derivation of Estimation on the Mean of Posterior
Next, we give a derivation process with respect to the nonlinear maximum correntropy information filter.
The mean of the prior state x k and the corresponding information matrix are computed the same as Equations (5) and (6). By Equations (1), (8) and (17), we can build the following equation: x k g(y k ) where I n is an n × n identity matrix, and We can easily obtain the inverse of covariance of the noise term in Equation (22) as the following equation: where S k is the Cholesky decomposition of E δ k δ T k −1 .
Left multiplying both sides of Equation (22) by S k yields: where The correntropy-based cost function is introduced as: where d k,i denotes the i-th component of D k , m k,i denotes the i-th row of M k , and n is the dimension of the state.
By the similar derivation in [55], the fixed-point iterative method would be adopted as: where superscript (t) stands for the fixed-point iteration index, and C , and the diagonal elements are defined in Equations (14) and (15). In fact, Equation (26) can be further written as Equation (7) (the detailed derivation can be seen in Appendix A).

Derivation of Information Matrix
After the iterative process stops, the estimation error information matrix χ k also needs to update. According to [56], the influence function (IF), defined as an n by 1 matrix, can be used to derive the error covariance matrix of an estimator under an assumed probability distribution, and there is a relationship between the influence function and the asymptotic covariance matrix of the estimation error P k as follows: where E is the expectation operator. The detailed derivation can be seen in Appendix B.
The IF can be computed as: where is the expectation of the first derivative of ϕ(e) at distribution A 0 , and A 0 is the target distribution of e. m T k,i is the transposition of the i-th row of M k , and M k is denoted in Equation (24). The detailed derivation can be seen in Appendix C. Thus the corresponding information matrix is written as: Here we prove E A 0 [ϕ (e)] and E A 0 ϕ 2 (e) as constants in Equations (A29) and (A30) of Appendix D, then substituting into Equation (29) yields: where σ is the kernel bandwidth, and θ is the variance of the distribution A 0 . Note that the information matrix in Equation (30) can only hold in an asymptotic sense and can be used for the state with a high dimension. However, for the state with a small dimension, especially in the derivation of our algorithm, Equation (30) may not be a good choice. Equation (26) can be referred to as the iterative reweighted least squares (IRLS) method.
Thus, assuming the iteration terminates at t = T, the final estimate is x (T) k and the associated covariance P k can be obtained by: Thus, the information matrix is computed as: We can see that the information matrix in Equation (30) is equivalent to the least squares (LS) method, i.e., M T k M k , multiplied by a constant scalar being smaller than one. The information matrix in Equation (32) implies that each row is multiplied by a different scalar depending on the residual error e k,i as in Equation (16). The outlier could be eliminated by multiplying a smaller scalar in calculating the information matrix. Thus compared to Equation (30), using Equation (32) may be relatively better in the small dimension case, especially in the proposed algorithm. Unless otherwise specified, the proposed algorithm uses the information matrix in Equation (32).

Robustness Analysis
According to [56], the influence function (IF) can also be used to measure quantitatively the robustness of an estimator, i.e., the infinitesimal robustness. If an estimator is infinitesimally robust, the corresponding IF needs to satisfy continuity and boundedness [57]. We show the IF satisfies continuity in Equation (28). Next, we will give the proof with respect to the boundedness of IF. According to Equation (28), i is a known matrix, and we can see that the boundedness of IF depends on ϕ(e).
where e is the error and σ is the kernel bandwidth. Note that ϕ(e) is a continuous and odd function according to the boundedness theorem [58], thus it satisfies the boundedness in any closed intervals, i.e., if ∃τ > 0, and ϕ(e) is bounded in the interval [−τ, τ]. In the interval [τ, +∞], according to the L' Hospital's rule [59], we have: Consequently, ϕ(e) is bounded in the interval [τ, +∞), and it is bounded in the interval (−∞, −τ] due to the odd characteristic of the function ϕ(e). This completes the proof. The IF in Equation (28) satisfies continuity and boundedness, and the proposed algorithm can be regarded as infinitesimally robust.

Convergence Analysis
In Section 3.1, we adopt a fixed-point iterative algorithm to obtain the posterior estimation in Equation (26). In fact, the convergence of the fixed-point iteration is impacted by the kernel bandwidth σ. According to the Banach Fixed-Point Theorem [60], the sufficient condition with respect to the choice of the kernel bandwidth to ensure the convergence of the fixed-point iteration is the same as in NMCIF [55,61].
Here we focus on discussing the convergence rate of the fixed-point iteration. For the sake of this discussion, Equation (26) can be written in an equivalent form as: where e k,i , m k,i , and d k,i are defined as in Equation (24). Thus Equation (35) forms a fixed point equation as where [55] and Equation (37), we have: where . 1 denotes a 1-norm of a vector or an induced norm of a matrix. We can see since β, N −1 mm , e k,i , m k,i , and d k,i are bounded, it is evident that if we choose a large enough kernel bandwidth σ, the gradient vector is close to zero with a limited bounded 1-norm. Then the gradient at the optimal point is zero in practice and the algorithm is at least quadratically convergent. In particular, if σ → ∞, we have G σ (e k,i ) → 1, then the fixed-point iteration method changes to the MMSE solution (i.e., the NIF in Section 2), which has a zero gradient vector in the optimal solution and converges in just one step. Consequently, when the kernel bandwidth σ decreases, the gradient of the fixed-point method at the optimal solution will increase, and the order of convergence will decrease from the super-linear to linear order of convergence [62]. Moreover, the robustness of the algorithm increases with a smaller kernel bandwidth.
By the above analysis, we can obtain that the smaller the kernel bandwidth, the more robust the algorithm is and the slower the convergence rate. However the kernel bandwidth has a low bound value to guarantee the convergence of the fixed-point method. On the other hand, when the kernel bandwidth becomes increasingly larger, the convergence rate increases, and the performance of the algorithm will behave like the corresponding NIF. Especially, if σ → ∞, then C k → I, the proposed algorithm will reduce to the NIF and converge in just one step.

Neural Decoding Using Nonlinear Maximum Correntropy Information Filter
In this section, we apply the proposed nonlinear maximum correntropy information filter to decode the two-dimensional movement intention from high-dimensional neural observations, and make comparisons with the Kalman filter and multi-layer perception network.

Experiment and Data Collection
The BMI experiment paradigm of the rats' two-lever discrimination task was designed and conducted in the Hong Kong University of Science and Technology. All animal handling procedures were approved by the Animal Care Committee of the Hong Kong University of Science and Technology. Two 16-channel microelectrode arrays were implanted in the primary motor cortex (M1) and medial prefrontal cortex (mPFC) respectively on the left hemisphere. The raw data of neural extracellular potential was recorded by a multi-channel acquisition processor (Plexon Inc, Dallas, Texas) with a 40-kHz frequency. The recorded potential was filtered by a 4-pole Butterworth high pass filter at 500 Hz and the action potentials were detected with an artificially set threshold (about 4 times of standard deviation of the filtered potential) in an online recording system. The spike trains were binned by a 100-ms time window without overlap for each channel. The count of spikes in the 100-ms window was assigned as firing rates.
The rats were trained to perform the two-lever discrimination task. In this task, the rats were required to distinguish two audio tones, which were randomly given by the computer at the trial start. The rats needed to use their right paw to press the corresponding lever and hold it for over 500 ms within a 6-second try-out time to get a water reward. With the high pitch (10 kHz), the rats needed to reach the high lever which was 10 cm high. While with the low tone (1.5 kHz), the rats turned to the low lever which wa spositioned 6 cm lower than the high lever. After successfully holding, the water was provided by a pump with a short feedback tone which had the same frequency as the start cue. The behavioral events and their timing information were recorded by the behavioral chamber (Lafayettee Instrument, Lafayettee, USA), and synchronized through the Plexon digital input with neural spike activity. There are totally 51 trials of a high lever press and 80 trials of a low lever press in rat A, as well as 72 trials of a high lever press and 128 trials of a low lever press in rat B.
To model the above procedure, we only consider the total successful trials, and the two-dimensional states are denoted as x k . To label the behavioral data, all successful trials are segmented from start to 500 ms after the start, and from 500 ms before the press to the press onset. All segments are connected to represent the behavioral process and smoothed by a sigmoid function. After the feedback, the rats spend 200 ms returning rest state and it is smoothed by a sigmoid function as well. To label the behavioral data, the 500 ms before the start cue is set as the rest state with [0,0], and holding a high lever and low lever are set as [1,1] and [1,−1] respectively. We use the current time step of the neural firing rates as the 32-dimensional measurements, which are denoted as y k to decode the behavioral state. Figure 1c shows one segment of spike rasters over time in five channels. And the corresponding behavioral states in 2D are shown in Figure 1a,b respectively. Following the above design, the state equations and measurement equations can be written as Equations (1) and (8). The state transition F in Equation (1) is obtained by the least square approach, the nonlinear relationship g(·) in Equation (8) is approximated by a multi-layer perception network, and the mean and covariance of the process noise q k in Equation (1) and measurement noise r k in Equation (8) are estimated from the residual errors. Next according to the above model, we apply the proposed algorithm to estimate the two-dimensional states x k from the high-dimensional measurements y k and compare it with the KF and neural network decoding performance. Note that the neural network takes in the current neural firing rates with the past 400 ms firing rates history and outputs the 2-dimensional movement intention. The number of hidden PE is set as 10. The weights are initialized 20 times and trained by the steepest gradient-descent back propagation method, which minimizes the mean square error between the output and the ground truth movement. We use 60% of the training data for weight training and 40% for testing. We use the 2D-mean square error between the movement estimation and the ground truth as the criterion to evaluate the decoding performance.

Figure 2a
,b show a segment of decoded 2-D movements, x k,1 and x k,2 using KF, with NMCIF and NN presented for comparison. We can see that the curve of NMCIF (the magenta solid curve) is closer to the ground truth (the red dotted curve) than that of KF (the blue solid curve). This is because the proposed method adopts the nonlinear process of observation into the same dimension of the states instead of the linear model, which considers interactions among recorded neurons. Moreover, from the 210th time step to the 220th time step, the 320th time step to the 330th time step, and the 350th time step to the 400th time step in x k,2 of Figure 2, we can see that the curve of NMCIF (the magenta solid curve) is smoother than that of NN (the green solid curve) due to the consideation of the time dynamic nature of the state. Table 1 shows the statistical 2D-MSEs of the algorithms between the true values and estimation values across 10 data segments with each rat. Please note that here for the NN, we use a MLP network to approximate the nonlinearity. We can also use RNN, where the results of RNN is 0.3555 ± 0.0959 and 0.3984 ± 0.0735. In terms of the averaged performance, the 2D-MSE of NMCIF in rat_A decreases by 57.62% and 6.91% compared with those of KF and NN, and the 2D-MSE of NMCIF in rat_B decreases by 25.89% and 5.17%. The results show that the proposed algorithm is superior to the KF and NN.  We then discuss convergence and sensitivity of the proposed algorithm with the presence of large deviated state initials.
From Section 3.4, we know that the kernel bandwidth plays a key role on the performance of NMCIF. Figure 3 shows the relationships between performance and the kernel bandwidth. We can see that if the kernel bandwidth is too small, the proposed algorithm plays a worse performance, even diverges. If the kernel bandwidth is too large, the performance of NMCIF (the magenta solid curve) is similar to the performance of NIF (the black solid curve). With a proper kernel bandwidth, the NMCIF outperforms the NIF. In particular, when the kernel bandwidth σ = 2, the NMCIF has its best performance. Figure 4 shows the relationship between the fixed-point iteration number and the kernel bandwidth. As one can see, the larger the kernel bandwidth is, the less the iteration number.    Figure 5 shows the comparison between the KF and proposed NMCIF to exhibit the advantage of the information type filter when the initial condition is largely uncertain. The initial covariance matrix in KF is set to 1, but the initial information matrix in NMCIF is set to a small positive value (10 −6 ). Figure 5a,b show the estimation values of x k,1 and x k,2 in different algorithms compared with the ground truth on a segment of the data when the initial estimation value deviates greatly from the initial true value. As one can see, the NMCIF (the magenta solid curve) can converge quickly to the neighborhood of the ground truth (the red dotted curve), and KF (the blue solid curve) converges slowly. Table 2 summarizes the statistical decoding error over 10 segments of data with large deviated initial values. In terms of the averaged performance, the 2D-MSE of x k by NMCIF in rat_A and rat_B decrease by 33.84% and 26.00% respectively compared with that of KF. It is obvious that NMCIF is superior to KF when the initial estimation value deviates greatly from the initial true values. The red dotted curve is ground truth, the blue solid curve is the Kalman filter, and the magenta solid curve is the nonlinear maximum correntropy information filter. As our algorithm demonstrates the best performance over KF and NN for neural decoding, here we further discuss the robustness of the proposed algorithm with the presence of large noise, which is generally seen in the online recorded neural activity [63]. Here 3.3% of time instances are added randomly with large noise, which corresponds to r k in Equation (8). Figure 6a,b show one segment of movement decoding using nonlinear information filter and the proposed method. In these figures, the red dotted curve denotes the ground truth and the blue asterisks denote the timing with large noise. The performance of NIF and NMCIF are presented for comparison. From the 410th time step to the 420th time step and the 460th time step to the 470th time step in Figure 6a and the 440th time step to the 450th time step in Figure 6b, we can see that the curve of NMCIF (the magenta solid curve) is smoother than that of NIF (the black solid curve) because of the usage of correntropy against the heavy-tailed non-Gaussian noise. The statistical 2D-MSEs of x k by NIF and NMCIF across 10 data segments are summarized in Table 3, respectively. Note that NMCIF_A denotes the NMCIF using the information matrix in Equation (30), and NMCIF_B denotes the NMCIF using the information matrix in Equation (32). In terms of the averaged performance, the 2D-MSE of NMCIF_B in rat_A and rat_B decrease by 29.54% and 8.97% respectively compared with that of NIF. Therefore, NMCIF exhibits better performance than NIF in non-Gaussian noise. The 2D-MSE of NMCIF_B is less than that of NMCIF_A, which indicates that the information matrix in Equation (32) may be relatively better in NMCIF. True NIF NMCIF Figure 6. The reconstruction of 2−D movements affected by noisy observation. The X−axis is the time, The Y−axis is the first (a) and the second (b) dimension of the movement. The red dotted curve is ground truth, the black solid curve is the nonlinear information filter, and the magenta solid curve is the nonlinear maximum correntropy information filter.

Discussion and Conclusions
BMI records and decodes the neural signals to predict movement intents with the goal of helping better the lives of motor-impaired patients by restoring motor functions. As a widely-used decoding algorithm, KF is usually adopted to estimate the motor as the states from neural activity observations with the linear assumption and Gaussian noise. However, the nervous systems are nonlinear in general, and the observations are high-dimensional because recordings are collected from hundreds of neurons and could be very noisy, and the initial movement states of disabled patients cannot always be known. All these factors would result in the performance degradation of the KF. In this paper, we propose a nonlinear maximum correntropy information filter to derive a state in the filtering process from the noisy high-dimensional measurements. A nonlinear model is used to preprocess the high-dimensional neural observations into the same dimensionality as the states, which allows the connectivity among neurons to be considered. The correntropy criterion is adopted to address the presence of the non-Gaussian noise. The information matrix of the state is derived with less sensitivity to the state's initial conditions. We provide an analysis of the convergence condition and robustness. The proposed algorithm is applied to decode the movements from real neural data collected from rats performing a two-lever discrimination task. The 2D reconstruction error (2D-MSE) of NMCIF in rat_A decreases by 57.62% and 6.91% compared with those of KF and NN, and the 2D-MSE of NMCIF in rat_B decreases by 25.89% and 5.17%. The results demonstrate that the nonlinear model considering the connection of neural activities contributes to better estimation performance.
When the initial estimation value deviates greatly from the initial true value, the 2D-MSEs of NMCIF in rat_A and rat_B decrease by 33.84% and 26.00% respectively compared with those of KF. These exhibit the superiority of the information matrix type filter. Moreover, the 2D-MSEs of NMCIF in rat_A and rat_B decrease by 29.54% and 8.97% respectively compared with those of NIF in the presence of heavy-tailed non-Gaussian noise showing the proposed algorithm's superiority to the NIF. These results confirm the effectiveness of the proposed algorithm, which potentially improves the decoding performance for BMI. In our study, the NN used as the nonlinear approximator is not limited to MLP but can also be RNN etc., as RNN is usually used in speech recognition and natural language processing. In the area of motor intention estimation, it has been recently used to classify the letter shapes using invasive brain signals [37]. However, the training of MLP is much simpler than the backpropagation through time in RNN. Our method follows the combination of the neural network and simple state-observation model. It inherits a simple explanation on state dynamics and observation model, which is correspondingly the neural tuning property and neural connectivity. On the other hand, the neural tuning property generally changes over time [64,65], the performance of static decoder may decrease if we use the fixed observation model. Thus considering the adaptation in the neural system, the neural network in our method allows the introduction of adaptive nonlinear models [66,67] to replace the stationary observation model. Our study currently uses invasive recordings on rat's single neuron spikes. The rat is an ideal subject for estimating motor intention and is widely used in many papers [68][69][70], and the Kalman filter has been used in rats [39], monkeys [40], and human subjects [71]. We plan to utilize our algorithm on non-human primates and patients in the future. As more electrodes could be implanted in such subjects, the recorded neural signals would build higher dimensional observations. We would expect the superiority of the proposed algorithm to be clearer on these data.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix B. The Relationship of the Influence Function and Asymptotic Covariance Matrix
Following [57], we derive the relationship of the IF to the asymptotic covariance matrix of the estimation error vector. Using a form of Taylor expansion involving the derivative of the functional to express the estimator T(A) yields: where rem(A − A 0 ) is a remainder term. According to the von Mises expansion and assuming the IF exists, Equation (A9) can be written as: Since the Fisher consistency at A 0 , obtained by IF(e, A 0 )d(A 0 ) = 0, we have: For A = F m (e) with F m (e) being the empirical distribution function, obtained by u(e − e i ) with u(·) being the unit step function, the integral term in Equation (A11) can be written in linear term as: Then, we have: If the remainder term converges in probability to zero, then by virtue of the central limit theorem and Slutsky's lemma, the probability distribution of √ m(T(F m ) − T(A 0 )) tends to N(0, Λ k ), with an asymptotic covariance matrix given by:

Appendix C. The Derivation of Influence Function
For the sake of this discuss, we omit the time step k. From Equation (24), we can obtain the residual vector e = D − Mx. Consider a ε-contamination model A = (1 − ε)A 0 + ε∆e, where A 0 is the target distribution, ∆e is the probability mass at e, and ε is the contaminating ratio, and define the cumulative probability distribution of the residual vector as A 0 (e). In general, the state estimate based on correntropy can be obtained by solving Equation (26), which can be written in an equivalent form as: Let T(A) be an estimator based on correntropy (i.e., state estimate x k at the distribution A), which is written in a functional form of A. If each λ(e i , m T i , x) in Equation (A15) is left-multiplied by probability mass 1 2n , and from the Glivenko-Cantelli theorem [72], Equation (A15) asymptotically tends to: where e and m T are generalizations of e i and m T i . According to [56], the asymptotic IF of T(A) is defined as: To derive the expression of IF, we substitute the ε-contamination model A = (1 − ε)A 0 + ε∆e into Equation (A16) to yield: Taking the differentiation of Equation (A18) with respect to ε yields: (A30)