1. Introduction
In practical applications, if the data are collected in a multi-source environment, the noise distribution is complex and unknown. Therefore, it is almost impossible for a single-noise distribution to clearly describe the real-noise [
1].
is a method of
that implements a sum-of-squares error function together with regularization, thus controlling the bias–variance trade-off [
2,
3]. It is intended to find the concealed linear structures in the original data [
4,
5]. For the sake of transition from linear to nonlinear function, the following generalization can be made [
6]: by mapping input vectors into a high-dimensional feature space
H (
H is Hilbert space) through some nonlinear-mapping, seek the solution of the optimization problem in space
H. Using a suitable kernel function
, nonlinear-mappings can be estimated by kernel
, which is an extended
with kernel techniques. In recent years,
as a data-rich nonlinear forecasting tool has been increasingly welcomed [
7], which is applicable in many different contexts [
8,
9,
10], such as machine learning, optical character recognition, and especially wind speed/power forecasting.
Generally, the existing techniques used for wind-speed forecasting include: (i) physical; (ii) statistical (also called data-driven); and (iii) artificial intelligence (AI)-based methods. The physical models attempt to estimate wind flow around and inside the wind farm using physical laws governing the atmospheric behavior [
11,
12]. The statistical models seek the relationships between a set of explanatory variables and the on-line measured generation data, and the historical wind speed data recorded at the site are only used to establish the statistical model. We can model it in a variety of ways, including persistence method and auto-regressive model [
13,
14]. AI methods include artificial neural networks (ANNs) [
15], deep learning [
16], SVR machines [
17,
18], and the hybrid methods [
19,
20].
Suykens et al. [
21,
22,
23] proposed least square support vector regression model with Gaussian noise (
, also known as kernel ridge regression (
)). Mixed-model based on multi-objective optimization [
24,
25], mixed-method based on singular spectrum analysis, firefly algorithm, and BP neural network predict wind speed with complicated noise [
26], indicating that the mixed prediction method has the ability of powerful prediction. Mixed
machine [
27] is applied to forecast the wind speed noise, which improves performance of wind-speed prediction.
[
28] models fitted by Gaussian–Laplacian (G-L) mixed noise are developed, and good performance is obtained compared with the existing regression algorithm.
To solve the above problems, we study model
of G-L mixed noise-characteristic for complex or unknown noise distribution. In this case, we construct a technique to search the optimal solution of the corresponding regression task. Although many
algorithms have been implemented in past years, we exploit ALM method, as shown in
Section 4. If the task is not differentiable or discontinuous, the sub gradient descent method can be employed, or the SMO [
29] can also be used if there is a very large sample size.
The structure of this paper is as follows.
Section 2 derives the optimal empirical risk loss by Bayesian principle.
Section 3 constructs the
model of G-L mixed noise.
Section 4 gives the solution and algorithm design of
. In
Section 5, the numerical experiment of short-term wind-speed prediction is presented. Finally, we conclude the work.
2. Bayesian Principle to Mixed Noise Empirical Risk Loss
Given the Dataset
where
,
is the training data.
R represents real number set,
is the
n-dimensional Euclidean-Space, and
N is the sample size. Superscript
T is the transpose of matrix. Assuming that the sample of dataset
is generated by the additive noise function
, the relationship between the measured value
and predicted value
is:
where
is random, i.i.d. (independent, identical probability distribution) with
of mean
and standard deviation
. Generally, the noise
(probability density function)
is unknown. It is necessary to predict unknown target
from training set
.
Following the authors of [
30,
31], the optimal empirical risk loss in the sense of Maximum Likelihood (MLE) is
i.e., the empirical risk loss
is the log-likelihood of noise characteristic.
It is assumed that noise in Equation (
2) is Laplacian, with
. By Equation (
3), in MLE the optimal empirical risk loss should be
.
Suppose noise in Equation (
2) is Gaussian of zero mean and homoscedastic standard deviation
. By Equation (
3), the empirical risk loss of Gaussian noise with homoscedasticity is
. The noise in Equation (
2) is Gaussian of zero mean and heteroscedastic standard deviation
. By Equation (
3), the empirical risk loss for Gaussian-noise with heteroscedasticity is
(
).
Assume noise
in Equation (
2) is the mixed noise of two kinds of noise with the
s
and
, respectively. Suppose that
. By Equation (
3), the corresponding empirical risk loss of mixed-noise is
where
are the convex empirical risk losses of the above two kinds of noise characteristic, respectively. The weight factors are
and
.
Figure 1 displays the Gaussian–Laplacian (G-L) empirical risk loss of different parameters (the parameter lambda is
) [
29].
3. Model of G-L Mixed Noise-Characteristic
Given the training samples , construct the linear regressor . To deal with nonlinear problems, it can be summarized as follows: mapping input vectors into high-dimension feature space H through the nonlinear mapping (take a prior distribution), induced by nonlinear kernel function , kernel mapping is any positive definite Mercer kernel.
Definition 1 ([
6,
28]).
Positive definite Mercer kernel: Assume that X is a subset of . Assume that the kernel function defined on is a positive definite Mercer kernel functionl the kernel mapping Φ
is called a positive definite Mercer kernel if there is mapping (H is Hilbert Space), such thatwhere represents the inner-product in Space H. Therefore, the optimization problem of Space H is solved. At present, the input vectors are replaced by inner product in feature space H. Through the use of kernel , the linear model be extended to a nonlinear .
In general, the mixed distribution has fine approximation ability to any continuous distribution. When there is no prior knowledge of real-noise, it can well adapt to unknown or complicated noise. Thus, it is presented that a uniform model
with mixed noise characteristics (
). The primal problem of model
is formalized as
where parameter
represents weight-vector,
b is the bias-term,
is the penalty parameter, and the weight factors are
,
.
,
is a nonlinear mapping which transfers the input dataset to a higher-dimensional feature space
H.
is the random noise variable at time
.
is the convex loss-functions for noise characteristic in sample-point
(
).
In the application domain, most distributions do not obey Gaussian distribution, and they also do not satisfy Laplacian distribution. the noise distribution is complicated, and it is almost impossible to describe real noise with a single distribution. It has been reported that mixed noise models, constituted by multiple noise distributions, perform better than single-noise model [
1]. As the function fitting -machine, the goal is to estimate an unknown function
from dataset
. In this section, G-L mixed homoscedastic and heteroscedastic noise distributions are used to fit complicated noise characteristic.
3.1. Model of G-L Mixed Homoscedastic Noise-Characteristic
Suppose noise in Equation (
2) is Gaussian of zero mean and homoscedastic standard deviation
. By Equation (
3), we have that the empirical risk loss of homoscedastic-Gaussian-noise characteristic is
. The Laplacian-noise is
. Adopting G-L mixed homoscedastic noise distribution to fit complicated noise-characteristic, by Equation (
4), the empirical risk loss about G-L mixed homoscedastic noise is
. Putting forward the
model of G-L mixed homoscedastic noise-characteristic (
), the primal problem of
is depicted as
where parameter vector
,
is homoscedastic,
is a penalty parameter, and the weight factors are
and
.
Proposition 1. The solution of the primal problem in Equation (7) of is existent and unique about ϖ. Theorem 1. The dual problem of the primal problem in Equation (7) iswhere is homoscedastic, is a penalty parameter, and the weight factors are and . Proof. We introduce Lagrange functional as
Minimizing
and deriving the partial-derivative
, respectively, on the basis of KKT-conditions, we get
The extreme condition is replaced by
, and the maximum value of
is obtained. The dual problem in Equation (
8) of the primal problem in Equation (
7) is derived. □
The decision-maker for
may be represented as
where the parameter vector
,
,
is the inner-product of
H and
is the kernel-function.
Suppose the noise in Equation (
2) is Gaussian homoscedastic noise, which is Gaussian noise of zero mean and the homoscedastic variance
. Thus, the dual problem of
can be derived by Theorem 2:
3.2. Model of G-L Mixed Heteroscedastic Noise-Characteristic
It is assumed that the noise in Equation (
2) is Gaussian of zero mean and heteroscedastic standard deviation
, that is
,
. From Equation (
3), the empirical risk loss of heteroscedastic Gaussian-noise characteristic is
and the loss-function of Laplacian-noise is
,
. Utilizing G-L mixed heteroscedastic noise distribution to predict complicated noise-characteristic, from Equation (
4), the loss function corresponding to G-L mixed heteroscedastic noise is
. The new model
with G-L mixed heteroscedastic noise-characteristic (
) is proposed. The primal problem of
is depicted as
where the parameter vector is
,
are heteroscedastic, and
is the penalty parameter. The weight-factors are
and
.
Proposition 2. The solution of the primal problem in Equation (10) of is existent and unique about ϖ. Theorem 2. The dual problem of model in Equation (10) iswhere are heteroscedastic and is the penalty parameter. The weight factors are and . Proof. It is easier to derive the proof of Theorem 2 by analogy with Theorem 1. □
The decision-maker for
may be expressed as
where the parameter vector is
,
, and
is the kernel function.
Suppose noise in Equation (
2) is G-L mixed-homoscedastic-noise, in which Gaussian-noise of zero mean and homoscedastic-variance
, Theorem 1 can be deduced from Theorem 2.
4. Solution from ALM
In this section, we use Augmented Lagrange-multiplier method (ALM) [
32] to solve the dual problem in Equation (
8) by applying Gradient descent or Newton’s method to a sequence of equality-constrained problems. By eliminating equality constraints, arbitrary equality constraints can be reduced to equivalent unconstrained problems [
33,
34]. If there are large-scale training samples, some rapid optimization techniques can be combined with the proposed model, for example the sequential minimal optimization (SMO) algorithm [
29] and the stochastic gradient decent (SDG) algorithm [
35].
Theorems 1 and 2 provide effective recognition techniques for and , respectively. In this section, we derive the solution from ALM and the algorithm for model of G-L mixed homoscedastic noise characteristic (). Analogously, the solution of model can be obtained by ALM method.
(1) Let dataset be , where , , .
(2) The optimal parameters were searched by using the 10-fold cross-validation strategy, and the appropriate kernel function was selected.
(3) Solve model
of the problem in Equation (
8), and get the optimal solution
.
(4) Build the decision-function as follows
The parameter vector is , , , () is the inner product in H, is a kernel function.
5. Case Study
This section tests and verifies the validity of constructed model by comparing it with other techniques in the Heilongjiang, China dataset . This case study consists of the following subsections: G-L mixed-noise characteristic of wind speed, prediction performance evaluation criteria, and short-term wind-speed forecasting based on an actual dataset.
5.1. G-L Mixed-Noise-Characteristic of Wind-Speed
To demonstrate the effectiveness of the proposed model, we collected wind speed data from Heilongjiang. The dataset consists of more than one year of wind speed data, recording wind speed values every 10 min. We first discovered the G-L mixed noise and conducted experiments on it. We found that turbulence is the main reason for the high uncertainty of wind speed random fluctuations. From the perspective of wind energy, the most significant feature of wind energy resources is their variability. Now, it shows the distribution of wind speed. Take a wind speed value every 5 s and calculate the histogram of wind speed within 1–2 h. Two typical distributions are given: one is calculated when the wind speed is high and the other is calculated when the wind speed is low (see
Figure 2 and
Figure 3, respectively).
We analyzed the one-month time-series dataset, and used the persistence method to investigate the error distribution [
32]. The results show that the wind speed error
obtained from the persistence prediction is not subject to single distribution, while approximately to G-L mixed distribution, and
of
is
, as shown in
Figure 4.
As can be seen from the above charts and figures, wind speed error approximately satisfies G-L mixed distribution. This is a mixed kind of task.
5.2. Prediction Performance Evaluation Criteria
It is generally known that no prediction model forecasts perfectly. The predictable performance of
,
,
, and
also has certain evaluation criteria, for example MAE (mean absolute error), RMSE (root mean square error), MAPE (mean absolute percentage error), and SEP (the standard error of prediction). The four criteria be defined as follows:
where
N is the size of the dataset
,
is the
ith actual observed data, and
is the
ith forecasted-result.
is the mean value of observations
[
36,
37,
38,
39,
40].
shows how similar the predicted value is to the observed value, while
measures overall deviation between predicted value and observed value.
is the ratio between error and observed value.
is the ratio of
to average observation. They are dimensionless measurements of accuracy of wind speed system, and are sensitive to small changes.
5.3. Short-Term Wind-Speed Forecasting with Real dataset
In this section, 2160 consecutive data (1–2160, time span of 15-days) are extracted as the training set and 720 consecutive data (2161–2880, time span of 5-days) are extracted as the testing set. The input vector is
,
is the actual observed data of wind speed at moment
, and the forecasting value is
, where
. That is, the above models are used to forecast wind speed of each point
after 10, 30 and 60 min, respectively.
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12 and
Figure 13 describe the forecasting results given by models
,
,
, and
.
The models
,
,
, and
were implemented in Matlab 7.8. Initial parameters of
were
,
, and
. The optimal parameters
were searched by using 10-fold cross-validation technique. The technology of parameter selection is studied in detail in [
41,
42]. In this simulation, parameters were set to
. The practical application demonstrates that both polynomial kernel and Gaussian kernel perform well under the assumption of smoothness. Under these circumstances, models
,
,
, and
employ polynomial and Gaussian kernel functions [
43]:
where
d is a positive integer and
is a positive number.
The dual problem of and of the Gaussian-noise model () and are as follows.
: The authors of [
41,
44] define the dual problem of
as
: The authors of [
45,
46] studied
with equality constraints and inequality constraints. The loss-function of Gaussian-noise is
, (
). Thus, thus dual problem of
is
: [
22] studied
for Gaussian-noise model. The dual problem of
is
where
are slack-variables.
,
are constants. For
and
, the size of
is not gained, but is a variable whose value is compromised by a constant with the model complexity and relaxation variables through
[
35].
In
Figure 5,
Figure 8 and
Figure 11, wind-speed forecasting-results at
-point of
,
,
, and
are presented after 10, 30, and 60 min, respectively.
Figure 6,
Figure 9, and
Figure 12 show the error statistic of wind-speed prediction using the above four models. The box plots (
Figure 7,
Figure 10, and
Figure 13) of several noise levels further intuitively demonstrate the comparative effect of error statistics using the above four wind-speed forecasting models. The statistical criteria of
,
,
and
are displayed in
Table 1,
Table 2 and
Table 3.
From box-whisker plots in
Figure 7,
Figure 10, and
Figure 13, as well as
Table 1,
Table 2 and
Table 3, it can be concluded that, in most cases, the forecasting-error of
calculation is superior to
,
and
. With the increase of prediction horizon to 30 and 60 min, the forecasting error of different models increases and the relative error decreases. Thus, in these cases, it is not that important. However,
Table 1,
Table 2 and
Table 3 show that, under all the criteria of
,
,
, and
, the Gaussian–Laplacian mixed-noise model is slightly better than the classical model.
6. Conclusions
Most existing regression-techniques suppose that the noise model is single. Wind-speed forecasting is complicated due to its volatility and uncertainty, thus it is difficult to model with a single-noise distribution. This section summarizes our main work: (1) optimal empirical risk loss of G-L mixed noise is deduced by Bayesian principle; (2) the of G-L mixed homoscedastic noise () and G-L mixed heteroscedastic noise () for complicate noise is developed; (3) the dual problem of and is obtained using Lagrange-functional and according to KKT conditions; (4) the stability and effectiveness of the algorithm are guaranteed by solving with the ALM method; and (5) the proposed technology is used to predict short-term wind speed by historical data, and then forecast the wind speed at some time after 10, 30, and 60 min, respectively. The comparison results display that the proposed model is better than classical technologies in statistical criteria.
In the same way, we can also study Gaussian–Laplacian, or Gaussian–Weibull mixed noise classification models. The new hybrid noise models would effectively solve complicated noise classification problems.