To obtain “good” models from a set of acquired or existent data, three sub-problems must be solved:
These sub-problems are solved by the application of a model design framework, composed of two existing tools. The first, denoted as ApproxHull, performs data selection, from the data available for design. The feature and topology search are solved by the evolutionary part of MOGA (Multi-Objective Genetic Algorithm), while parameter estimation is performed by the gradient part of MOGA.
2.1.1. Data Selection
To design data driven models like RBFs (Radial Basis Functions), it is mandatory that the training set involves the samples that enclose the whole input–output range where the underlying process is supposed to operate. To determine such samples, called convex hull (CH) points, out of the whole dataset, convex hull algorithms can be applied.
The standard convex hull algorithms suffer from both exaggerated time and space complexity for high-dimension studies. To tackle these challenges in high dimensions, ApproxHull was proposed in [
29] as a randomized approximation convex hull algorithm. To identify the convex hull points, ApproxHull employs two main computational geometry concepts: the hyperplane distance and the convex hull distance.
Given the point
in a
d-dimensional Euclidean space, and a hyperplane
H, the hyperplane distance of
x to
H is obtained by:
where
and
d are the normal vector and the offset of
H, respectively.
Given a set
and a point
, the Euclidean distance between
x and the convex hull of
X, denoted by conv(
X), can be computed by solving the following quadratic optimization problem:
where
,
, and
. Assuming that the optimal solution of Equation (2) is
a*, the distance of point
x to conv(
X) is given by:
ApproxHull consists of five main steps. In Step 1, each dimension of the input dataset is scaled to the range [−1, 1]. In Step 2, the maximum and minimum samples with respect to each dimension are identified and considered as the vertices of the initial convex hull. In Step 3, a population of k facets based on the current vertices of the convex hull is generated. In Step 4, the furthest points to each facet in the current population are identified using Equation (1), and they are considered as the new vertices of the convex hull, if they have not been detected before. Finally, in Step 5, the current convex hull is updated by adding the newly found vertices into the current set of vertices. Step 3 to Step 5 are executed iteratively until no vertex is found in Step 4 or the newly found vertices are very close to the current convex hull, thus containing no useful information. The closest points to the current convex hull are identified using the convex hull distance shown in (3) under an acceptable user-defined threshold.
In a prior step before determining the CH points, ApproHull eliminates replicas and linear combinations of samples/features. After having identified the CH points, ApproxHull generates the training, test and validation sets to be used by MOGA, according to user specifications, but incorporating the CH points in the training set.
2.1.2. Parameter Separability
We shall be using models that are linear–nonlinearly separable in their parameters [
30,
31]. The output of this type of model, at time step
k, is given as:
In (4),
is the ANN input at step
k,
is the basis functions vector,
u is the (linear) output weights vector, and
v represents the nonlinear parameters. For simplicity, we shall assume here only one hidden layer, and
v is composed of
n vectors of parameters, each one for each neuron
. This type of model comprises Multilayer Perceptrons, Radial Basis Function (RBF) networks, B-Spline and Asmod models, Wavelet networks, and Mamdani, Takagi, and Takagi-Sugeno fuzzy models (satisfying certain assumptions) [
32].
This means that the model parameters can be divided into linear and nonlinear parameters:
and that this separability can be exploited in the training algorithms. For a set of input patterns
X, training the model means finding the values of
w, such that the following criterion is minimized:
where
denotes the Euclidean norm. Replacing (4) in (6) we have:
where
,
m being the number of patterns in the training set. As (7) is a linear problem in
u, its optimal solution is given as:
Where the symbol ‘+’ denotes a pseudo-inverse operation. Replacing (8) in (7), we have a new criterion, which is only dependent on the nonlinear parameters:
The advantages of using (9) instead of (7) are threefold:
It lowers the problem dimensionality, as the number of model parameters to determine is reduced;
The initial value of is much smaller than
Typically, the rate of convergence of gradient algorithms using (9) is faster than using Equation (7).
2.1.5. MOGA
This framework is described in detail in [
37], and it will be briefly discussed here. MOGA evolves ANN structures, whose parameters separate (in this case RBFs), with each structure being trained by minimizing criterion (9) in
Section 2.1.3. As we shall be designing forecasting models, where we want to predict the evolution of a specific variable within a predefined PH (Prediction Horizon), the models should provide multi-step-ahead forecasting. This type of forecast can be achieved in a direct mode, by having several one-step-ahead forecasting models, each providing the prediction of each-step-ahead within a PH. An alternative method, which is followed in this work, is to use a recursive version. In this case, only one model is used, but its inputs evolve with time. Consider the Nonlinear Auto-Regressive model with Exogeneous inputs (NARX), with just one input, for simplicity:
where
denotes the prediction for time-step
k + 1 given the measured data at time
k, and
the
jth delay for variable
i. This represents the one-step-ahead prediction within a prediction horizon. As we iterate (17) over PH, some or all of the indices in the right-hand-side will be larger than
k, which means that the corresponding forecast must be employed. What has been said for NARX models is also valid for NAR models (with no exogeneous inputs).
The evolutionary part of MOGA evolves a population of ANN structures. Each topology comprises of the number of neurons in the single hidden layer (for a RBF model), and the model inputs or features. MOGA assumes that the number of neurons must be within a user-specified bound,
Additionally, one needs to select the features to use for a specific model, i.e., must perform input selection. In MOGA we assume that, from a total number
q of available features, denoted as
F, each model must select the most representative
d features within a user-specified interval,
. For this reason, each ANN structure is codified as shown in
Figure 1:
The first component corresponds to the number of neurons. The next dm represent the minimum number of features, while the last white ones are a variable number of inputs, up to the predefined maximum number. The values correspond to the indices of the features fj in the columns of F.
The operation of MOGA is a typical evolutionary procedure. We shall refer the reader to publication [
37] regarding the genetic operators.
The model design cycle is illustrated in
Figure 2. First, the search space should be defined. That includes the input variables to be considered, the lags to be considered for each variable, and the admissible range of neurons and inputs. The total input data, denoted as
F, together with the target data, must then be partitioned into three different sets:
training set, to estimate the model parameters;
test set, to perform early stopping; and
validation set, to analyze the MOGA performance.
Secondly, the optimization objectives and goals need to be defined. Typical objectives are Root-Mean-Square Errors (RMSE)-evaluated on the training set (
), or on the test set (
), as well as the model complexity, #(
v)—number of nonlinear parameters—or the norm of the linear parameters (
). For forecasting applications, as it is the case here, one criterion is also used to assess its performance. Assume a time-series
sim, a subset of the design data, with
p data points. For each point, the model (14) is used to make predictions up to
PH steps ahead. Then, an error matrix is built:
where
e[
i,
j] is the model forecasting error taken from instant
i of
sim, at step
j within the prediction horizon. Denoting the RMS function operating over the
ith column of matrix
E, by
, the forecasting performance criterion is the sum of the RMS of the columns of
E:
Notice that every performance criterion can be minimized, or set as a restriction, in the MOGA formulation.
After having formulated the optimization problem, and after setting other hyperparameters, such as the number of elements in the population (npop), number of iterations population (niter), and genetic algorithm parameters (proportion of random immigrants, selective pressure, crossover rate and survival rate), the hybrid evolutive-gradient method is executed.
Each element in the population corresponds to a certain RBF structure. As the model is nonlinear, a gradient algorithm such as the LM algorithm minimizing (6) is only guaranteed to converge to a local minimum. For this reason, the RBF model is trained a user-specified number of times, starting with different initial values for the nonlinear parameters. MOGA allows initial centers chosen from the heuristics mentioned in
Section 2.1.4, or using an adaptive clustering algorithm [
38].
As the problem is multi-objective, there are several ways for identifying which training trial is the best one. One strategy is to select the training trial whose Euclidean distance from the origin is the smallest. The green arrow in
Figure 3 illustrates this situation for
. In the second strategy, the average of objective values for all training trails is calculated, and then the trial whose value is the closest to the average value will be selected as the best one (i.e., red arrow in
Figure 3).
The other
strategies are to select the training trial which minimized the
objective (i.e.,
) better than the other trials. As an example, the yellow and blue arrows in
Figure 3 are the best training trials which minimized objective 1 and objective 2, respectively.
After having executed the specified number of iterations, we have performance values of
npop *
niter different models. As the problem is multi-objective, a subset of these models corresponds to non-dominated models (
nd), or Pareto solutions. If one or more objectives is (are) set as restriction(s), a subset of
nd, denoted as preferential solutions,
pref, corresponds to the non-dominated solutions, which meet the goals. An example is shown in
Figure 4.
The performance of MOGA models is assessed on the non-dominated model set, or in the preferential model set. If a single solution is sought, it will be chosen on the basis of the objective values of those model sets, performance criteria applied to the validation set, and possibly other criteria.
When the analysis of the solutions provided by the MOGA requires the process to be repeated, the problem definition steps should be revised. In this case, two major actions can be carried out: input space redefinition by removing or adding one or more features (variables and lagged input terms in the case of modelling problems), and improving the trade-off surface coverage by changing objectives or redefining goals. This process may be advantageous, as usually the output of one run allows us to reduce the number of input terms (and possibly variables for modelling problems) by eliminating those not present in the resulting population. Additionally, it usually becomes possible to narrow the range for the number of neurons in face of the results obtained in one run. This results in a smaller search space in a subsequent run of the MOGA, possibly achieving a faster convergence and better approximation of the Pareto front.
Typically, for a specific problem, an initial MOGA execution is performed, minimizing all objectives. Then, a second execution is run, where typically some objectives are set as restrictions.