Kernel Mixture Correntropy Conjugate Gradient Algorithm for Time Series Prediction

Kernel adaptive filtering (KAF) is an effective nonlinear learning algorithm, which has been widely used in time series prediction. The traditional KAF is based on the stochastic gradient descent (SGD) method, which has slow convergence speed and low filtering accuracy. Hence, a kernel conjugate gradient (KCG) algorithm has been proposed with low computational complexity, while achieving comparable performance to some KAF algorithms, e.g., the kernel recursive least squares (KRLS). However, the robust learning performance is unsatisfactory, when using KCG. Meanwhile, correntropy as a local similarity measure defined in kernel space, can address large outliers in robust signal processing. On the basis of correntropy, the mixture correntropy is developed, which uses the mixture of two Gaussian functions as a kernel function to further improve the learning performance. Accordingly, this article proposes a novel KCG algorithm, named the kernel mixture correntropy conjugate gradient (KMCCG), with the help of the mixture correntropy criterion (MCC). The proposed algorithm has less computational complexity and can achieve better performance in non-Gaussian noise environments. To further control the growing radial basis function (RBF) network in this algorithm, we also use a simple sparsification criterion based on the angle between elements in the reproducing kernel Hilbert space (RKHS). The prediction simulation results on a synthetic chaotic time series and a real benchmark dataset show that the proposed algorithm can achieve better computational performance. In addition, the proposed algorithm is also successfully applied to the practical tasks of malware prediction in the field of malware analysis. The results demonstrate that our proposed algorithm not only has a short training time, but also can achieve high prediction accuracy.


Introduction
Usually, traditional time series prediction methods mainly include autoregression, Kalman filtering, and moving average models. These traditional approaches focus on mathematical statistics and have no capabilities of self-learning, self-organization, and self-adaption. In particular, they cannot be effectively used for data types of nonlinear and multi-feature dimensions in analyzing some complex problems. Currently, there are some machine learning methods developed to address this issue, such as support vector machine (SVM), artificial neural network (ANN), and deep neural the increase of the memory requirement and computation of KAF. There are many traditional methods to restrain the growth of the network structure, such as the novelty criterion [16], the correlation criterion [17], the approximate linear dependence criterion [8], and the surprise criterion [18]. Since the angle between two elements in the Hilbert space can be expressed by the inner product and the similarity of elements in Hilbert space can be measured by the angle, here we use a sparsification criterion based on the angle among elements. The angle criterion used here not only provides geometric intuition, but also offers a simple structure that can be implemented easily.
With the rapid advancement of Internet technology [19], the issue of network security imposes huge challenges to the Internet. Specifically, the demand for malware analysis has become increasingly urgent, and practitioners and researchers have been making progress in the field of malware prediction and detection [20]. Usually, malware is able to implement intention by calling the existing application programming interface (API) in the system. Therefore, the API calling time series as a software behavior is analyzed to achieve malware prediction and detection [21]. Generally speaking, the obtained API call time series can be used to predict future malicious behavior. Specifically, in this article, the proposed algorithm KMCCG is also used in the practical application of malware prediction.
The main contributions of this article are summarized as follows. (1) On the basis of mixture correntropy, a novel robust algorithm KMCCG is proposed through a comprehensive use of the half-quadratic optimization method, the CG technique, and the kernel trick. KMCCG cannot only improve the learning accuracy, but also maintain robustness to impulse noise. (2) In view of the issue that the algorithm KMCCG will produce a growing RBF network, the sparsification criterion based on the angle is used to control the network structure. (3) For a special time series analysis application in relation to malware prediction, KMCCG is accordingly used to achieve this task, which verifies that our proposed algorithm can achieve higher prediction accuracy with less training time.
The rest of this article is organized as follows. In Section 2, the mixture correntropy, the algorithm KCG, and the sparsification criterion are introduced. In Section 3, the details of our algorithm KMCCG are presented. In Section 4, the simulation on time series prediction and the experiment on the malware prediction task are conducted to verify the effectiveness of the our proposed algorithm. The conclusion is summarized in Section 5.

Mixture Correntropy
Correntropy is a local similarity function, which is defined as the generalized correlation in kernel space. It is closely related to the cross-information potential (CIP) in information theory learning (ITL) [22]. It shows very promising results in nonlinear non-Gaussian signal processing. The main property of correntropy is that it provides an effective mechanism to mitigate the influence of large outliers. Recently, correntropy has been successfully applied in various areas, such as signal processing [23], machine learning [24][25][26], adaptive filtering [27][28][29], and others [30][31][32].
The correntropy is used to represent the similarity between two random variables X and Y. Let k σ (·, ·) be a Mercer kernel function with a kernel bandwidth of σ. Let E[·] be the mathematical expectation. Then, the correntropy can be defined as: Generally, the Gaussian kernel is the most widely-used kernel in correntropy, and it is as follows: where e = X − Y is the error value.
Here, a nonlinear mapping ϕ(·) is used by the kernel function to map the input space U to high-dimensional space F , and it satisfies ϕ(x), ϕ(y) = k σ (X, Y). Then, (1) is rewritten as: Since the joint probability density of data in practical applications is usually unknown, for a finite sample {x i , y i } N i=1 , the correntropy can be defined as: Generally, the kernel bandwidth is one of the key parameters in correntropy. Usually, a small kernel bandwidth makes the algorithm more robust to outliers, but it will lead to slow convergence and poor accuracy. On the other hand, when the kernel bandwidth increases, the robustness will be significantly reduced in the case of abnormal values. In order to achieve better performance, a new similarity measure MCC was proposed [11]. It can achieve faster convergence speed and higher filtering accuracy, while maintaining robustness to outliers. The mixture correntropy uses the mixture of two Gaussian functions as the kernel function, and its definition is as follows: where 0 α 1 is the mixture coefficient and σ 1 and σ 2 are the kernel bandwidths of the Gaussian functions G σ 1 (·) and G σ 2 (·), respectively. When the mixture coefficient α takes a suitable value, the performance of MCC can be better than that of the original correntropy criterion, so the mixture correntropy is a more flexible measure of similarity. Typically, the empirical mixture correntropy loss can be expressed as L(X, Y) or L(e), where X = [x 1 , x 2 , · · · , x N ] T , Y = [y 1 , y 2 , · · · , y N ] T , and e = [e 1 , e 2 , · · · , e N ] T . Then, it is defined as follows: where e i = x i − y i . Here, the Hessian matrix of L(e) with respect to e is: where It can be seen that the Hessian matrix H L(e) of empirical mixture correntropy loss, as a function of e, is convex when the condition e ∞ = max 1 i N |e i | σ 1 is satisfied. Therefore, the use of mixture correntropy as a loss function cannot be directly applied to convex optimization [33].

Kernel Conjugate Gradient Algorithm
The CG is a specific method between the steepest descent and the Newton methods. It accelerates the typically slow convergence associated with the steepest descent, while avoiding the information requirements associated with the evaluation, storage, and inversion of Hessian as required by Newton's method [10].
Let A ∈ R n×n be a symmetric positive definite matrix and d 1 , d 2 , · · · , d m be a set of non-zero vectors in R n . If d T i Ad j = 0 (i = j), then we think that d 1 , d 2 , · · · , d m are conjugated to each other about A.
There is an unconstrained optimization problem as follows: where f is a continuous differentiable function on R n . The nonlinear method CG for solving the above (8) has the following iterative format: where g k = ∇ f (x k ) , p k is the search direction, α k > 0 is the step factor, β k is a certain parameter, and different β k corresponds to a different CG method. That is, when f (x) is a strictly convex quadratic function, the search direction p generated by the method (9) is conjugated to the Hessian matrix of f (x). The process of CG iteration is described in Algorithm 1, where r i is the residual vector, p i is the search direction, x i is the approximate solution. In addition, α i and β i are the step factors, and the stopping criterion is that the algorithm achieves convergence.

Input:
Given symmetric positive definite matrix A; Given the vector b; Given the initial iteration value x; Initialization: In order to address the nonlinear problem effectively, the KCG algorithm was proposed for online application [11]. In an effort to use kernel techniques, the solution vector of the algorithm CG is represented by a linear combination of input vectors. Then, the convergence speed of the online algorithm KCG is much faster than that of the algorithm KLMS. Actually, the KCG achieves the same convergence performance as the KRLS in many cases; however, the computational cost is greatly reduced [11]. Another attractive feature of KCG is that it does not require the user-defined parameters. The algorithm KCG is described as Algorithm 2. In this algorithm, G is the Gram matrix, η is the coefficient vector, κ(·, ·) is the Mercer kernel, M is the size of the dictionary, r 0 and r 1 are the residual vectors, and e is the error vector. Moreover, α 1 , α 2 , and β 2 are step sizes; S 1 is the residual vector of the normal equation; v 1 and v 2 are intermediate vectors; and H stands for the conjugate transpose.

Algorithm 2
The kernel conjugate gradient (KCG) algorithm. Initialization: Since the algorithm KCG is derived from the solution of the least squares problem, good performance can be maintained in a Gaussian noise environment [11]. However, in the non-Gaussian case, the performance of KCG may not be satisfactory [11]. Therefore, we used the mixture correntropy as the loss function and propose the algorithm KMCCG.

Sparsification Criterion
KAF uses an online approaching strategy, that is each time a new set of data arrives, it is allocated a storage unit. The linear growth of the network structure leads to an increase in the memory requirements and calculations of KAF. Since the angle between two elements in the feature space can be represented by the inner product and the similarity of the elements in the space can be measured by an angle, then a simple sparsification criterion on the basis of the angle between elements in RKHS is used to control the network structure [11]. The cosine of the angle between ϕ(x) and ϕ(y) is as follows: κ(x, x)κ(y, y) .
The algorithm reconstructs the network "dictionary" through the sparsification criterion. If the current dictionary is D and a new sample (x k , d k ) is coming, the procedure of the angle criterion-based sparsification can be described through the following two steps.
First, the parameter v is calculated: Second, if ν k is smaller than the predefined threshold ν 0 , (ϕ (x k ) , d k ) is added to D, otherwise it is discarded. Because the parameter ν 0 represents the level of similarity between the new element and those old elements in D, we call it the similarity parameter.

Half-Quadratic Optimization of the Mixture Correntropy
For the mixture correntropy loss function (5), since its Hessian matrix is not positive definite, its global convexity cannot be guaranteed. Therefore, the mixture correntropy loss cannot be directly applied to the convex optimization problem. Fortunately, the half-quadratic (HQ) optimization method is an effective method to address the non-convex optimization problem [33], which converts the original objective function into the half-quadratic objective function. In this article, the HQ optimization method is used to transform the mixture correntropy loss function, and then, the method CG is employed to solve the transformation function.
Since the mixture correntropy is the sum of two exponential functions, we let According to the theory of the conjugate function, the conjugate function of g(v) is given by: Let . Therefore, we can get: where v = − exp(−u).
When u = e 2 k 2σ 2 , we can get: where v = − exp − e 2 k 2σ 2 . Then, (5) can be written as: Because the solution to the mixture correntropy Loss (6) is equivalent to solving the following problem: therefore, by solving the sum of the following weighted least squares problems, the equivalent solution of the mixture correntropy can be obtained. where The Hessian matrix of the weighted least squares problem (17) is as follows: where v i < 0 and v i < 0. Hence, we obtain H(e) > 0, and then, (17) is a global convex optimization problem. Here, when the Hessian matrix is positive definite, it means that the function is a convex function. Then, the objective function can be optimized by the conjugated gradient method.

Kernel Mixture Correntropy Conjugate Gradient Algorithm
The core of KAF is to transform the input data into a high-dimensional feature space through the kernel function. The inner product operation in the feature space is more efficiently calculated by the kernel technique. Its goal is to get the mapping function f (x) between the input and output. According to the adaptive filtering theory, we can obtain: where η i is the expansion coefficient and κ (·, x k ) is a kernel function with the center x k . According to (17), we consider the following kernel-induced weighted least squares problem: where η = [η 1 , η 2 , . . . , η k ] T is the expansion coefficient and G is the Gram matrix, which is defined as: Then, we use the method CG to solve the weighted least squares problem (20). The major work of the online algorithm KMCCG lies in the update of the Gram matrix G M and the coefficient vector η M . The Gram matrix G M can be updated as follows: where T is a good approximation of η M , only a few iterations (one or two) with this initial vector can achieve satisfactory performance. We use [η M−1 , 0] T as the initial value of η M , and the initial residual r 0 can be expressed as follows: where When k = 2, q 2 = κ (x 2 , x 2 ) = 0.6595, g 2 = κ (x 1 , x 2 ) = 0.3337, and v 1 = 0.3337, then because max {|v 1 |} < v 0 = 0.8, x 2 is added to the dictionary, which means that M = M + 1 = 2 and X 2 = [X 1 , x 2 ]. The Gram matrix G M can be updated as follows: Then, the residual can be obtained as r 0 = [0, 0.3991] T and r 1 = [−0.1834, 0.1210] T . On the basis of this, the correlation coefficient η 2 can be updated. When the new input arrives, the algorithm will continuously update the correlation coefficient to make it more suitable for the next input. Finally, the algorithm will stop when the convergence condition is satisfied, that is it reaches the maximum number of iterations.

Computational Time Complexity Analysis
As shown in Table 1, we obtain the computational time complexity of four algorithms KLMS, KRLS, KCG, and KMCCG, after analyzing the implementation process of those algorithms. In this table, M is the dictionary size in the algorithm. KLMS is a simple KAF with minimal computational complexity. KRLS requires (4M 2 + 4M) additions, (4M 2 + 4M) multiplications, and one division per iteration, while KCG achieves a convergence speed and filtering accuracy comparable to KRLS [11], and its computational complexity is relatively small. In addition, compared with KCG, KMCCG requires two divisions and one multiplication when calculating v k and also requires an addition and a multiplication to update the residual vector r 0 . Table 1 shows that in each iteration, KMCCG requires fewer additions and multiplications than KRLS, but requires four more division operations. Because the number of instruction cycles required by the division operation is generally 20-times that of the addition operation and M is usually greater than 100, the computational complexity of KMCCG is still lower than that of the KRLS algorithm. Moreover, when the input vector does not meet the sparsification criterion, there is no additional calculation for KMCCG, but there are still (4M 2 + 4M) additions and (4M 2 + 4M) multiplications for KRLS. Therefore, KMCCG can achieve higher prediction accuracy with less computational and storage costs.

Algorithm
Additions

Experimental Results and Discussions
In this section, the experiments on short-term predictions of the Mackey-Glass chaotic time series, minimum daily temperatures time series, and the real-world malware API call sequence are conducted to illustrate the performance of our proposed algorithm.

Mackey-Glass Time Series Prediction
The chaotic time series is one of the fundamental forms of movement in nature and human society. Generally, the classical Mackey-Glass chaotic time series is generated by the following differential equation [6]: where the parameters are set to a = 0.2, b = 0.1, n = 10, and τ = 30. Moreover, the sampling period is 6 s. This experiment uses the past seven samples u(k) = [x(k), x(k − 1), . . . , x(k − 6)] to predict the current input d(k) = x(k + 1). Then, we use the dimensions of the matrix to represent the size of the matrix. Therefore, the size of the training input set is (7 × 1000), the size of the training target set is (1000 × 1), the size of the testing input set is (7 × 200), and the size of the testing target set is (200 × 1). The algorithm KMCCG was compared with the quantized KLMS (QKLMS) algorithm [34], the quantized kernel maximum correntropy (QKMC) algorithm [35], and the kernel maximum mixture correntropy (KMMCC) algorithm [15] in four different noise environments, to verify the performance of our proposed algorithm. Here, QKLMS is one of the most classical KAF algorithms based on the mean square error (MSE) criterion, and it achieves good prediction accuracy in the Gaussian noise environment. QKMC is a KAF algorithm based on correntropy, which can also achieve good prediction accuracy and maintain robustness. The recently-proposed algorithm KMMCC combined with the mixture correntropy criterion has been demonstrated to be able to obtain satisfactory prediction accuracy and robustness. All algorithms were configured with a Gaussian kernel. For a fair comparison, the optimal parameter setting was conducted to let each algorithm achieve the desirable performance. Finally, the performance of the algorithm was evaluated by MSE, which is defined here as follows: where N represents the number of predicted values. Figure 1 shows the learning performance of these algorithms in four noise environments. Obviously, in all four types of noise environments, the testing MSE of KMCCG was smaller than that of the stochastic gradient-based filtering algorithms, i.e., QKLMS, QKMC, and KMMCC. Meanwhile, the convergence speed of KMCCG was obviously faster than that of the other compared algorithms. This verifies that the CG technique used in KMCCG can achieve a faster convergence speed and higher learning accuracy. Therefore, the algorithm KMCCG can achieve the best performance in all the compared algorithms.

Minimum Daily Temperatures Time Series Prediction
In this section, the minimum daily temperatures time series is selected as the dataset to verify the performance of the proposed algorithm. This dataset describes the minimum daily temperature in Melbourne, Australia, for 10 years (1981-1990) [36]. The unit is Celsius, and there are 3650 observations. The data source is the Australian Meteorological Agency.
Here, we use the previous five input samples where x(k) and d(k) represent the input vector and the corresponding expected output, respectively. Additionally, the size of the training input set as (5×1000); the size of the training target set was (1000×1); the size of the test input set was (5×200); and the size of the test target set was (200×1). Finally, the MSE was also used to evaluate the performance of those algorithms. Then, our algorithms was compared with QKLMS, QKMC, and KMMCC to verify the computational performance. Figure 2 shows the learning curve for these algorithms. Obviously, the testing MSE of KMCCG is less than that of QKLMS, QKMC, and KMMCC, which demonstrates that the proposed algorithm can perform better than all three other algorithms.

Malware API Call Sequence Prediction
In this section, We apply the proposed algorithm to the malware API call sequence prediction, while verifying the effectiveness of our algorithm through the actual time series data. The purpose of this experiment is to predict what the next API would be, which can be used to determine whether it is malware or not.

Background
API is the service interface provided by the operating system. Applications call the API when completing file reading and writing, network access, and other tasks [37]. Meanwhile, malware also needs to call the API when implementing functions. Hence, it is an effective method to predict and detect malware behavior by extracting the API call sequence [21].
With the rapid advances in computational intelligence methodology, using machine learning algorithms to predict malware via the API call sequence can make the malware prediction more intelligent, and the new malware can be detected in a more timely manner [38]. In this field, SVM, ANN, and other methods have been applied to malware prediction and detection, and some satisfactory results are achieved.
In [39], with the help of global features using the Gabor wavelet transform and Gist, the feed-forward ANN was developed to identify the behavior of malicious data with a good accuracy. In [40], after abstracting the complex behaviors based on the semantic analysis of dynamic API sequences, an SVM was proposed to achieve malware detection with good generalization ability. Furthermore, with the popular use of the deep learning method, some DNN models were also applied to tackle the issue of malware detection. For instance, in [41], the features were extracted from five minutes of API call log sequences by using a recurrent neural network, and then, they were input to the convolutional neural network to achieve deep learning with the purpose of malware detection.
Although some good performances have been achieved by using the above approaches, there still exist several limitations, such as the long training time and difficulty in parameter determination. Since mixture correntropy as a new measure of local similarity defined in kernel space can be used to address large outliers, hence, in order to reduce the training time while maintaining high prediction accuracy and robustness to abnormal data in the API call sequence, our algorithm KMCCG can be considered to cope with the malware prediction. Here, it should be noted that although some traditional machine learning-based malware prediction and detection algorithms may be vulnerable to adversarial methods or tools, such as EvadeML [42] and poisoning attack [43], the algorithm KMCCG may be a better choice in malware prediction, in consideration of the satisfactory robustness achieved by using mixture correntropy.
We mainly analyzed the acquired API call sequence and predicted the malicious behavior that may occur in the future using our proposed kernel learning algorithm. Then, through the combination of these predicted malicious behaviors with the actual detected malicious behaviors, we will extract feature vectors and integrate them as the discriminant basis of malware detection. In so doing, we can determine whether the application belongs to malware or not, through the machine learning classification model.

Experimental Result
API call information can be extracted by static and dynamic methods. Through the use of the static method [44], the API list can be extracted from the portable executable (PE) format of the executable files. Furthermore, with the dynamic method [45], the called API can be observed by running the executable files.
While creating the dataset, we randomly selected Windows malware samples from the malware datasets of Dasmalwerk and VirusShare and put the software into the cuckoo sandbox to analyze the report automatically. In order to avoid related security issues caused by malware propagation and accidental execution, here we chose to deploy the cuckoo sandbox to the Ubuntu environment. Figure 3 shows a flowchart for building the malware corpus. The specific analysis process of the sample is as follows: (1) The first step is to launch cuckoo on the Linux platform. (2) Then, we submit the sample to be analyzed to cuckoo. (3) Cuckoo uploads the sample to the virtual machine and collects the behavior data. (4) After the analysis is completed, cuckoo will generate an analysis report in its own working directory. Then, the API sequence is extracted from the report, and Figure 4 shows some API call time series. Each line in this figure represents a malware API call time series. In the following experiment, the API call sequence of a certain malware sample is selected as the dataset.
This experiment used the past seven samples to predict the current input. The size of the training input set was (7 × 1000); the size of the training target set was (1000 × 1), the size of the testing input set was (7 × 200), and the size of testing target set was (200 × 1). Figure 5 is the time series dataset obtained by replacing the API call sequence with the word frequency for which the API appeared in the whole dataset. Figure 6 shows the normalized sequence shown in Figure 5.     In this experiment, the performance of the algorithm is verified by comparing the prediction accuracy of QKLMS, KMMCC, ANN, SVM, and KMCCG. Considering that the prediction error is an effective evaluation metric, we still adopted the transformed error (26) to evaluate the algorithm performance. Figure 7 shows the learning performance of all five algorithms, which represents the relationship between the testing MSE of the algorithm and the number of iterations. Obviously, the MSE of KMCCG is smaller than that of the classical KAF algorithms, i.e., QKLMS and KMMCC. The total training time and the average value of MSE for five runs of the experiment are summarized in Table 2. It can be found that the proposed algorithm can achieve a prediction accuracy equivalent to the popular ANN and SVM, but spent less training time. Here, the algorithm KMCCG can be successfully applied to predict malware API call sequences, which verifies the satisfactory performance of our proposed algorithm.
Then, the time series of malware API calls were combined with Gaussian noise to further verify the robustness of the algorithm. In the real world, noise is often not caused by a single source, but a combination of many different sources. As the number of noise sources increases, it tends to a Gaussian distribution. Here, Gaussian noise is considered in the experiment to analyze the impact of noise in the system. Figure 8 shows the performance comparison of the proposed algorithm with the other four algorithms in the Gaussian noise environment. The evaluation metric is also the MSE shown in (26). It can be seen that the algorithm KMCCG can achieve higher prediction accuracy and be more stable in the noise environment. This shows that KMCCG has satisfactory robustness.

Conclusions
Through the combination of the MCC and the algorithm KCG, a novel kernel learning algorithm, i.e., KMCCG, is proposed in this article. Specifically, in an effort to curb effectively the growing RBF network in our algorithm KMCCG, a sparsification criterion based on the angle between elements in RKHS is used to control the increase of data size in online applications, which is equivalent to the coherence criterion. The proposed algorithm achieves much faster convergence speed than the algorithm KLMS and lower computational complexity than the algorithm KRLS. The prediction results for Mackey-Glass chaotic time series and Lorentz time series showed that our algorithm achieved good performance in robustness, filtering accuracy, and computational efficiency. Furthermore, the proposed kernel learning algorithm was applied to malware prediction. The results also showed that the algorithm KMCCG not only had a short training time, but also maintained a high prediction accuracy, which further verified the satisfactory performance of KMCCG.
As a use case of our algorithm, this article only focused on the task of malware API call sequence prediction. Actually, the prediction experiment of malware API call time series is only a part of the malware detection technology, and the software cannot be directly classified as malware or a benign one by only using our method. In future work, we will combine the results of future behavior prediction with the actual detected malicious behavior as the basis of classification reference and judge whether the application belongs to malware or not. Moreover, to further extend the applications of malware prediction and detection while using the algorithm KMCCG, we will discuss some other classification tasks for malware through the evaluation of false positives.
Author Contributions: In this article, X.L. and Y.G. provided the original ideas and were responsible for revising the whole article; N.X. designed and performed the experiments and wrote the original article; W.W., L.W., C.H., and W.Z. analyzed the data. All authors read and approved the final manuscript.