Improved Predictive Ability of KPLS Regression with Memetic Algorithms

: Kernel partial least squares regression (KPLS) is a non-linear method for predicting one or more dependent variables from a set of predictors, which transforms the original datasets into a feature space where it is possible to generate a linear model and extract orthogonal factors also called components. A difﬁculty in implementing KPLS regression is determining the number of components and the kernel function parameters that maximize its performance. In this work, a method is proposed to improve the predictive ability of the KPLS regression by means of memetic algorithms. A metaheuristic tuning procedure is carried out to select the number of components and the kernel function parameters that maximize the cumulative predictive squared correlation coefﬁcient, an overall indicator of the predictive ability of KPLS. The proposed methodology led to estimate optimal parameters of the KPLS regression for the improvement of its predictive ability.


Introduction
The method of Partial least squares regression (PLS) has its origin in the field of economics with the work of Herman Wold [1] in the 1960s. It is currently popular in the social sciences, marketing and chemometrics [2]. PLS regression, however, is a linear method and is unsuitable to describe data structures that show variations with non-linear characteristics [3].
To solve the problem, Rosipal and Trejo [4] propose the Kernel partial least squares regression (KPLS) that transforms the original data sets into an arbitrary dimensionality feature space through non-linear mapping, and then creates a linear model in the feature space [5]. KPLS regression has been widely used in many scientific fields because of its high generalizability.
Shawe-Taylor and Cristianini [6] state that the improvement of classical multivariate statistical methods based on the decomposition into singular values with the use of kernel functions such as KPLS, is an area of research that has been widely used in recent years. Furthermore, Bennet and Embrechts [7] point out that future research should make full use of the potential of KPLS algorithms, that statistical learning theory can help clarify why KPLS generalizes well, and that implementing variants to KPLS could solve its limitations.
However, a recurrent problem is to determine the number of components (latent variables) and the parameters of the KPLS regression that maximize the predictive ability of the models generated [8]. Huang et al. [9] point out that the main problems in KPLS regression are the kernel function selection and the determination of its parameters, as well as the selection of the number of components; however, the selection of the kernel function remains an open problem. The usual choices of kernel functions found in the literature review are as follows [10,11]: 1.
There is no theoretical basis for how to specify kernel function parameter values, however, they must be specified before any kernel-based method is performed. The most popular approach is to select them empirically and to a lesser extent cross-validation has been used to decide on kernel parameter values [10]. More recent studies have emphasised that kernel parameters must be optimised simultaneously with the selection of components, as these choices depend on each other [12,13].
The work of Fu et al. [13] presents a methodology for simultaneous estimation of the kernel function parameter and the number of components in KPCA and KPLS models. The aim of this paper is to propose a methodology to improve the predictive ability of the KPLS regression taking as reference the approach proposed by Mello-Román and Hernández [14,15] of metaheuristic tuning of the KPLS regression but using as iterative optimizer the Memetic algorithms (MA) in the selection of the number of components and the kernel function parameter.
Nature-inspired metaheuristic algorithms such as GA, PSO and others have demonstrated good performance in optimising KPLS regression, however, they require significant computational effort that increases significantly with increasing iterations [14,15]. In this article, MA is chosen as a KPLS regression optimizer because of its ability to generate optimal solutions in a reasonable time, even for very complex problem structures [16].
MA is considered a powerful problem solver in the field of continuous optimization, as it offers a balance between the exploration of the search space through its scheme of evolutionary algorithms and the focused exploitation of promising regions with a local search algorithm [16]. Algorithm design is not a trivial task and the non-existence of a universal optimiser suggests that algorithms designed efficiently must specifically address the characteristics of the problems to be optimised [17]. In this work, preliminary tests are carried out for the MA configuration, in order to determine the best balance between exploration and local search.
It is necessary to mention that there are other regression techniques used for predictive purposes such as Auto-Regressive Integrated Moving Average (ARIMA) [18] and others that have developed versions that introduce the kernel concept such as Support Vector Regression (SVR) and Gaussian Processes (GP) [19,20]. KPLS differs from them in that it addresses the problem of reducing the data dimensionality [10].
As for the robustness of kernel-based regression methods, some research has concluded that the choice of kernel function plays an important role in it [21,22]. A non-linear kernel function (for example, the Gaussian kernel function) leads to fairly robust regression methods with respect to outliers in the original input space. That is, non-linear modelling is not only attractive when the data structure is complex and a linear method may not be suitable, but can also be an interesting option when the data contain outliers [22].

Kernel Partial Least Squares Regression
Given an X n×m array of independent variables {x i } n i=1 ∈ R m and a Y n×r array of response variables {y i } n i=1 ∈ R r . Let us assume a non-linear transformation of the input variables {x i } n i=1 in a feature space F provided with inner product; that is, ϕ : x i ∈ R m → ϕ(x i ) ∈ F . We denote by Φ a matrix (n × M) whose i-th row is the vector ϕ(x i ). Depending on the non-linear transformation Φ(.) the feature space F can be highly dimensional.
The kernel function k x i , x j = ϕ(x i ) ·ϕ x j calculates the inner product in the feature space F. The kernel matrix K = ΦΦ represents a matrix (n × n) of the cross products of the transformed data {ϕ(x i )} n i=1 , i.e., the element of the i-th row and the j-th column is the The KPLS regression algorithm proposed by Rosipal and Trejo [4] is as follows: Algorithm 1. KPLS algorithm according to Rosipal and Trejo [4].
Deflate K and Y matrices Repeat steps 2-7 until the U and T vectors have been removed The deflation technique that is incorporated on step 6 of Algorithm 1 can be viewed as an implementation of Hotelling deflation [23,24]. After the extraction of the h components, with the generated matrices T n×h and U n×h the B matrix of coefficients of the KPLS regression is defined by the following expression: Predictions on the training data set are obtained from the formula: The expression is used for predictions in test data sets: where the K t matrix is the matrix (n t × n) whose elements are the kernel functions k ij = k x i , x j evaluated on the values of the test set {x i } n+n t i=n+1 and the training set x j n i=1 [3]. Before applying KPLS it is necessary to centralize the data in the feature space [25]. The following procedures can be applied for this purpose: where I is again an n-dimensional identity matrix and 1 n , 1 n t represent the vectors whose elements are ones, with length n and n t , respectively. KPLS regression has over the years demonstrated high predictive performance; however, the choice of the optimal kernel function and its parameters is an open problem [7,26]. This research aims to contribute to the solution of the problem by applying an optimization algorithm. To do so, it is first necessary to define the mathematical function to be optimized.
Next, the mathematical structure of some indicators of the predictive ability of the KPLS regression is described.

Predictive Ability of KPLS
In any empirical modelling, it is essential to determine the correct complexity of the model. In the PLS/KPLS regression it is necessary to check the predictive significance of each component, and not to add up components that are not significant. According to Abdi [27], when the purpose of the regression is to generate models capable of predicting the value of the dependent variables in new observations, each component can only be considered relevant if it improves that prediction.
The cross-validation procedure is a practical and reliable way of testing predictive significance and has become the standard in PLS [28].
The cross-validation procedure is one of the oldest re-sampling techniques [29]. It consists of dividing the dataset into segments of size equal to k and then using k − 1 blocks to fit the model and validate it on the remaining segment. This procedure is done for all the k − 1 possible combinations of the k segments [30]. In this paper, the cross-validation procedure was applied to randomly divide the data set into k = 10 segments of equal size [31].
The way to evaluate the predictive ability of a model is by comparing the observed values and the predictions of the model [32]. Several studies take as an indicator the predictive ability in PLS/KPLS the root mean square error (RMSE) [13] or functions of it [33].
where PRESS (Predictive Residual Sum of Squares) is the sum of the squared differences between the predictionsŷ i and the observed values y i . However, RMSE has the disadvantage that its values depend on the scales of measurements of the dependent variables, and in certain comparative studies it is not possible to use it. For this reason, the predictive ability of the models is often quantified in terms of the coefficient obtained from cross-validation processes, called by some authors the predictive squared correlation coefficient: where RSS (Residual Sum of Square) is the sum of the squared differences of the observed values y i with respect to the mean y. In the PLS/KPLS regression, the coefficient Q 2 is calculated for each extracted component h. RSS is calculated using component h − 1 and PRESS using component h. According to Thévenot [34], if the research interest is focused on evaluating the overall predictive ability of the PLS/KPLS regression model for a set of h > 1 components, a suitable indicator is the coefficient Q 2 cum , which evaluates the model's capacity for generalization for a set of components {2, . . . h − 1, h}. It is obtained by means of the mathematical expression: The coefficient Q 2 cum takes values between 0 and 1, and the higher its value, the better the predictive performance [35].
As mentioned above, a key factor in PLS/KPLS regression is to define the optimal number of components. One criterion is to consider a significant component if Q 2 ≥ 0.0975 or equivalent, to maintain the h component if Another criterion is to consider the optimal number of components according to the variation of the coefficients Q 2 and Q 2 cum . The coefficient Q 2 cum increases as components are added up to a certain maximum value and then decreases. Such critical point provides an estimate of the optimal number of components.
In this work the second approach is chosen: to evaluate the variation of the coefficient Q 2 cum and to determine by optimal the number of components for which Q 2 cum reaches its maximum value. This criterion is equivalent to considering a significant component if

Memetic Algorithms
Moscato [37] introduces for the first time the term Memetic algorithms (MA) to refer to metaheuristic algorithms that show a hybrid approach. MA integrates evolutionary algorithms (such as GA) and a local search (LS) procedure to generate a hybrid algorithm with characteristics of both components. For a modern and detailed review, the text by Moscato and Cotta [38] can be used. Figure 1 shows a flowchart for a MA. At a simple level, the only difference between this and a flowchart for evolutionary algorithms is the extra "Perform local search" step [39]. Gogna and Tayal [40] describe the methodology used by MA in the fo MA starts with the generation of an initial population of solutions (individ can be generated randomly or, for better results, it can be generated using specific) approach. The next step is the generation of a new population. Th several operations to generate offspring. The first operator is the selection the best individual from the population in a similar way to that used in ev This is followed by the recombination of the selected individuals to produce of GA. The generated offspring undergo mutations to increase the diversity next step uses a local search operator to produce random variations in the This operation is repeated to search, in each individual's environment, for a c is better suited than the original solution.
MMA is conceived in an eclectic and pragmatic paradigm, open to th techniques (metaheuristic or not). Given the breadth of possibilities, it is wor rithmic design used in this research is based on the proposal in [16]. The pro variants with the paradigm of local search chains [41] which is denoted h main feature is the ability to apply LS several times in the same solution, us iterations/function evaluations each time. The final state of the LS parameter Gogna and Tayal [40] describe the methodology used by MA in the following generic steps: MA starts with the generation of an initial population of solutions (individuals). This population can be generated randomly or, for better results, it can be generated using a heuristic (problem-specific) approach. The next step is the generation of a new population. This involves the use of several operations to generate offspring. The first operator is the selection operator which selects the best individual from the population in a similar way to that used in evolutionary algorithms. This is followed by the recombination of the selected individuals to produce offspring as in the case of GA. The generated offspring undergo mutations to increase the diversity of the population. The next step uses a local search operator to produce random variations in the generated population. This operation is repeated to search, in each individual's environment, for a candidate solution that is better suited than the original solution.
MMA is conceived in an eclectic and pragmatic paradigm, open to the integration of other techniques (metaheuristic or not). Given the breadth of possibilities, it is worth noting that the algorithmic design used in this research is based on the proposal in [16]. The proposal implements MA variants with the paradigm of local search chains [41] which is denoted here MA-LS-Chains. Its main feature is the ability to apply LS several times in the same solution, using a fixed number of iterations/function evaluations each time. The final state of the LS parameters after each LS application becomes the initial point of a subsequent LS application on the same solution, creating an LS chain. MA-LS-Chains has proven to be more efficient than other algorithms, mainly in solving high-dimensional problems [16]. Algorithm 2 presents an example of a pseudocode for MA-LS-Chains. The Bergmeir et al. proposal [16] uses a steady-state genetic algorithm (SSGA) as evolutionary algorithm [42,43]. Algorithm 2. Pseudocode of MA-LS-Chains [16] 1.
Generate the initial population 2.
while not termination-condition do 3.
Build the set S LS of individuals which can be refined by LS.

5.
Pick the best individual c LS in S LS . 6.
if c LS belongs to an existing LS chain then 7.
Initialize the LS operator with the LS state stored with c LS . 8. else 9.
Initialize the LS operator with the default LS parameters. 10.
Apply the LS algorithm to c LS with I str , giving c r LS .

12.
Replace c LS by c r LS .

13.
Store the final LS state with c r LS .

end while
After generating the initial population, the following steps are executed in a loop: the SSGA is run with a certain amount of evaluations nfrec. Then, the set S LS is built with the individuals of the population that have never been improved by the LS, or that have been improved by the LS but with an improvement (in fitness) superior to δ min LS , where δ min LS is a parameter of the algorithm (by default δ min LS = 10 −8 ). If |S LS | = 0, the LS is applied with an intensity of I str to the best individual in S LS . If S LS is empty, the whole population is reinitialized except for the best individual which is maintained in the population [16].

Methodology
The main problems in KPLS are: the selection of the kernel function with its parameters and the number of components [9]. This research proposes a metaheuristic tuning procedure for the simultaneous estimation of the parameters of the kernel function and the number of components. It takes as objective function the coefficient Q 2 cum and studies the performance of MA as optimization agent.
Given the original data matrices X n×m and Y n×r , the kernel function k x i , x j with parameter θ, and the kernel matrix K n×n generated from X n×m and k x i , x j . After performing the KPLS regression according to Algorithm 1 and extracting h components, the objective function is the coefficient Q 2 cum (θ, h) determined by Equation (11), which is a function of the values that h and θ take in the KPLS regression. The optimization task can be written as follows: where the number of components h can take positive integer values lower than the number of columns in the X matrix and the domain of the kernel function parameter θ is a predefined subset S of real numbers. The objective function Q 2 cum is an indicator of the overall predictive power of the KPLS regression, it takes values between 0 and 1 and the higher its value, the higher the predictive power of the model [35]. It is obtained from cross-validation procedures and it is dimensionless, that is, it is not affected by the scales of the response variables.
To solve the optimization problem, an algorithm is implemented in a general iterative optimizer, in this case MA [15]. The problem is solved exclusively for feasible solutions of θ and h, infeasible solutions are discarded [44]. Algorithm 3 describes the steps of the proposed method: Algorithm 3. KPLS parameters selection [15] 1.
Given X n×m the input matrix, Y n×r the output matrix, a kernel function k x i , x j with parameter θ ∈ S ⊂ R 2.
The optimizer randomly generates an initial population p of θ (individuals).

3.
From X n×m the input matrix, p kernel matrices K n×n are generated, one matrix for each θ.

4.
KPLS regression (Algorithm 1) between K n×n and Y n×r is executed with h ∈ Z + ≤ m components.
Identify θ associated with the maximum Q 2 cum and extract h for which this value is reached. 7.
g optimizer iterations are performed. The fitness function of the optimizer is max Q 2 cum . 8.
The optimizer determines the best values of θ and h and the optimal Q 2 cum .
The algorithm described in Algorithm 3 has a generic character. It is designed for population-based metaheuristic optimizers and can be used with different kernel functions. For this work, MA was used as the iterative optimizer and the Gaussian kernel function (Equation (1)), where the kernel function parameter θ = σ > 0.

Dataset and Pilot Framework
A data set on academic performance in Mathematics is used from individual characteristics of 899 Paraguayan students aged 12-15, from 16 educational institutions. The measurement instrument applied was a questionnaire to measure 9 predictor variables: Sex, Age, Classroom Learning, Bilingualism, Autonomous Learning, Academic Self-Concept, External Tutoring, Future Expectations and Work Status. Some of these predictor variables are multiple response. The response variable is the Mathematics academic performance of the participating students, determined on a numerical scale. The dataset is available in the GitHub digital repository at the following link: https://github.com/jorgemellopy/KPLS-MA (accessed on 25 February 2021).
In [45], this dataset was statistically modelled using Multiple Linear Regression (MLR). The value of the Coefficient of Determination obtained was R 2 = 37.9%, which represents the proportion of variability in academic performance that is attributed to the set of all predictor variables included in the model.
Preliminary tests are carried out for the MA configuration [46], to determine the best balance between exploration and intensification in the search space. This is done by taking different values of the parameter effort in MA. This parameter takes values between 0 and 1 giving the ratio of the number of evaluations performed in the local search and in the evolutionary algorithm (SSGA) [16].
The KPLS parameter selection algorithm (Algorithm 3) was run k = 30 times for each configuration [47], extracting the best results, mean and standard deviation of the estimates, and running time [48]. A non-parametric statistical analysis of the results obtained for the different MA configurations was carried out [49]. The Kruskal-Wallis test was used for independent samples, a non-parametric multiple comparison procedure. Kruskal-Wallis is an extension of the Mann-Whitney U test and is considered the non-parametric analogue of the one-way analysis of variance (ANOVA). Once an optimal configuration of the KPLS regression has been selected, the results obtained on the same dataset by other regression techniques such as MLR and PLS are compared.
All the experimental evaluations were carried out with the R software support. This required the installation of the following packages: plsdepot [50], kernlab [11], metaheuris-ticOpt [51] and Rmalschains [16]. The tests were performed simultaneously on a computer Intel ® Celeron ® CPUG1610 @ 2600 GHz, 4.00 GB RAM.
The package Rmalschains implements in its continuous optimization approach the following local search strategies [16]: Covariance Matrix Adaptation Evolution Strategy (CMAES) [52], Solis Wets solver (SW) [53], Subgrouping Solis Wets (SWW) [54] and Simplex [55]. The selection of the CMAES algorithm, in the continuous optimization approach, to perform the local search is due to the adaptability of its parameters, its fast convergence and it obtains very good results [16].

Preliminary Tests
Preliminary tests were carried out to select the best value for the parameter effort  Table 1.   Table 2 shows the behavior of the estimates of , and running time for the different tested values of the parameter effort in MA. Minimal differences are observed in the running time, however, given the characteristics and conditions of the experiments they are not considered relevant. The longest running time obtained was 8281.5 s = 138 min.   Table 2 shows the behavior of the estimates of Q 2 cum , σ and running time for the different tested values of the parameter effort in MA. Minimal differences are observed in the running time, however, given the characteristics and conditions of the experiments they are not considered relevant. The longest running time obtained was 8281.5 s = 138 min. The independent samples Kruskal-Wallis test was executed taking as null hypothesis the similarity of distributions of Q 2 cum , σ and the running time for the different values of the parameter effort in MA. The results presented in Table 3 indicate that there is no statistical evidence to reject the null hypothesis of similarity of distributions, both for the estimates of Q 2 cum and σ. The dispersion in the running time leads to rejecting the null hypothesis at a significance level at α = 0.05. Table 3. Independent-samples Kruskal-Wallis test summary.

Null Hypothesis Sig. Decision
The distribution of Q 2 cum is the same across categories of effort 0.799 Retain the null hypothesis The distribution of σ is the same across categories of effort 0.055 Retain the null hypothesis The distribution of running time is the same across categories of effort 0.000 1 Reject the null hypothesis 1 The significance level is 0.05.
Since there are no significant differences in the estimation of the kernel function parameter, the value of the parameter effort = 0.75 in MA is taken, for which the average estimation of Q 2 cum is higher. Therefore, the KPLS regression is configured with a number of components h = 2 and σ = 1.6964.
In order to have a reference of the efficiency of MA in the proposed optimisation task, k = 30 runs of Algorithm 3 have been executed taking as optimisers: Genetic Algorithms (GA) and Particle Swarm Optimisation (PSO) [15], both with population size p = 50 and number of iterations equal to 20. The hyperparameters of these optimisers were determined at the default values set in the R packages used. GA and PSO also found maximum values of Q 2 cum for h = 2, 28/30 and 26/30, respectively. Table 4 details the hyperparameters used and compares the results given by GA, PSO with MA, for h = 2. Since the comparison of different metaheuristics as optimisers of the KPLS regression is not the main objective of this paper, no statistical analysis is included, however it is possible to observe that MA, GA and PSO obtain similar estimates of h, σ and Q 2 cum .

Benchmarking Predictive Improvement
In [45], MLR was applied to the same dataset. The Determination coefficient obtained was R 2 = 37.9%, which represents the proportion of variability of the response variable explained by the set of six predictor variables included in the model, selected according to the reduction of the sum of squares of residuals resulting from including each predictor and the subsequent normalization of the values obtained, in a "Steps forward" selection process. The results obtained by the MLR are taken as a benchmark measure to evaluate the performance of the proposed method.
To evaluate the improvement in predictive ability in relation to simple PLS, the KPLS regression with Gaussian kernel function and parameter σ = 1.6964 was executed once again. The results for both methods with a number of components h = 2 are presented in Table 5. The results allow us to identify an improvement in the predictive ability of the KPLS regression using the proposed methodology, taking as a reference both the results obtained by PLS in this trial and previous research on the same dataset. In general, the results are considered adequate in terms of the accuracy of the estimates and the computational effort required.
The proposed methodology for optimising the KPLS regression using MA has been implemented on other open datasets [56]. The results can be found in detail at [57], the conclusions are similar to those reached in this paper and lead to confirm the hypothesis of improving the predictive ability of KPLS by means of MA.

Discussion
This work addresses the problem of optimizing the predictive ability of the KPLS regression and proposes a metaheuristic tuning procedure for the selection of the parameters of the kernel function θ and the number of components h. The optimization problem is defined taking as an objective function the coefficient Q 2 cum , an indicator of the overall predictive ability of the KPLS regression obtained from cross-validation procedures. To solve the problem, a selection algorithm of the parameters θ and h is designed, incorporating a Memetic algorithm (MA) as an iterative optimizer.
Preliminary tests were carried out to obtain an adequate MA configuration and the right balance between exploration and local search. No significant differences were observed in the average estimates of Q 2 cum and θ for different MA configurations. Through the proposed methodology, optimal values were selected for the kernel function parameter and the number of components in the KPLS regression. MA has performed well in terms of accuracy of estimates with a low running time.
The improved KPLS regression through the proposed procedure obtained better results than the PLS regression. The procedures described here may constitute a valid reference for the users of the technique who are looking for a better predictive performance. However, it is necessary to mention that the conclusions are limited to the evaluated dataset and the chosen computational tools.
Future research could study the effectiveness of the proposed method in mass data environments such as those frequently found in marketing, genomics, brain imaging, manufacturing and others. It is a natural extension of this research line, the search for an optimal configuration of the kernel function when it is integrated to other multivariate techniques. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.