Kernel Partial Least Square Regression with High Resistance to Multiple Outliers and Bad Leverage Points on Near-Infrared Spectral Data Analysis

: Multivariate statistical analysis such as partial least square regression (PLSR) is the common data processing technique used to handle high-dimensional data space on near-infrared (NIR) spectral datasets. The PLSR is useful to tackle the multicollinearity and heteroscedasticity problem that can be commonly found in such data space. With the problem of the nonlinear structure in the original input space, the use of the classical PLSR model might not be appropriate. In addition, the contamination of multiple outliers and high leverage points (HLPs) in the dataset could further damage the model. Generally, HLPs contain both good leverage points (GLPs) and bad leverage points (BLPs); therefore, in this case, removing the BLPs seems relevant since it has a signiﬁcant impact on the parameter estimates and can slow down the convergence process. On the other hand, the GLPs provide a good efﬁciency in the model calibration process; thus, they should not be eliminated. In this study, robust alternatives to the existing kernel partial least square (KPLS) regression, which are called the kernel partial robust GM6-estimator (KPRGM6) regression and the kernel partial robust modiﬁed GM6-estimator (KPRMGM6) regression are introduced. The nonlinear solution on PLSR was handled through kernel-based learning by nonlinearly projecting the original input data matrix into a high-dimensional feature mapping that corresponded to the reproducing kernel Hilbert spaces (RKHS). To increase the robustness, the improvements on GM6 estimators are presented with the nonlinear PLSR. Based on the investigation using several artiﬁcial dataset scenarios from Monte Carlo simulations and two sets from the near-infrared (NIR) spectral dataset, the proposed robust KPRMGM6 is found to be superior to the robust KPRGM6 and non-robust KPLS.


Introduction
In vibrational spectroscopic techniques, multivariate statistical analysis is the common method used in the pre-treatment screening, processing, and interpreting of near-infrared (NIR) spectral data. It allows a huge number of spectral to be processed in relation to the amount of chemical quantities. However, with the high-dimensional and irregular data space problem in the NIR dataset, the use of classical multivariate analysis is sometimes not appropriate. Moreover, with its dataset complexity, it suffers from contamination of multiple outliers and high leverage points (HLPs). These are important factors that can contribute to inaccurate interpretation and can be computationally intensive. The outliers are observations that produce high residual or outlying in the y -coordinate while the HLPs are outlying in the X -coordinate. The HLPs comprise good leverage points (GLPs) and bad leverage points (BLPs). The GLPs, in this case, are not significant because they are still near the fitted regression line, and they can increase the efficiency of an estimate [1,2]. On the other hand, the BLPs are far from the majority pattern of the data; hence, they are highly significant on the computed values of various estimates [2,3]. To prevent serious damage to the parameter estimate, only the affected outliers and BLPs should be eliminated during the model fitting process. In relation to this, there are many studies [4][5][6][7] related to identifying the outliers and HLPs that have been conducted, and none of them has classified the HLPs into good or bad. Therefore, it seems timely to introduce some alternatives to the nonlinear robust multivariate method that can handle irregular data space problems and are able to identify outliers and BLPs in the dataset.
A recent well-known multivariate method that is used to downscale the high-dimensional dataset is the partial least square regression (PLSR) method [8]. It is used by projecting a large set of the n × m matrix X into a smaller set of uncorrelated variables called latent variables that correspond to the n × 1 vector y. This has been beneficial to overcome the multicollinearity and heteroscedasticity problem in the variables. Theoretically, PLSR is known as a class of linear methods with the basic assumption that such linearity serves both relationships between observed variables and modeling in the latent variables [9]. With the irregular space in the NIR spectral dataset, the nonlinear PLSR is preferred as it has shown its superiority to the linear method [10,11]. Furthermore, the nonlinear solution for PLSR is conducted by mapping the original input data matrix into the high-dimensional nonlinear feature space through a nonlinear mapping function. Among all the existing nonlinear methods, the kernel method is the most powerful method due to its flexibility and efficiency in the computational aspect [12][13][14][15][16]. Additionally, in the class of kernel mapping, to reach the nonlinear optimization procedure, using the reproducing kernel Hilbert spaces (RKHS) [17] is the most suggested procedure [13,18]. It produces a consistent solution through its unique functional spaces into kernel partial least square (KPLS) regression [13]. However, not much attention has been given to making a robust procedure on the KPLS regression method.
To address the issue, improvements on bounded influence and high breakdown point (with close to 50%) robust procedure of the GM6-estimator [19] are introduced in the KPLS. In the GM6-estimator, the initial estimator uses the least trimmed of squares (LTS) [20]. To detect the outliers and HLPs, the initial weight function is used by using robust Mahalanobis distance (RM 2 i ) based on the minimum volume ellipsoid (MVE) [21]. However, it has been reported that the suggested initial weight function is not resistant to the swamping effects and not able to classify the HLPs into good or bad [3].
Another robust method was introduced by Rousseeuw and Leroy [22] called reweighted least squares (RLS), based on the least median of squares (LMS). This RLS-LMS is an improvement to the inefficiency of classical LMS [23], which yields a better high breakdown point to identify outliers [24,25]. However, the LMS is sensitive to the contamination of multiple HLPs [24]; therefore, the diagnostic robust generalized potential (DRGP) method [26,27] is suggested to overcome this limitation. The DRGP calculates the generalized potential criteria of each observation to examine whether the suspected observations contain potential HLPs. In addition, further improvement of the efficiency of DRGP that is known as DRGP(ISE) was also introduced by replacing the RM 2 i based on MVE with index set equality (ISE) [28]. The DRGP(ISE) calculates the initial location and scale in parameter estimates through robust procedures. To make DRGP(ISE) more efficient, only outliers and BLPs are down-weighted since it has been proven that the least square (LS) estimates can only be damaged by the BLPs [26]. This is conducted by classifying the observations using diagnostic criteria of modified generalized studentized residuals (MGT i ) [3] against the DRGP to prevent the swamping and masking effects. In this case, only the regular observations and GLPs will be given a weight of 1, while the weight for outliers and BLP observations will be assigned a weight of 0.
In this study, the new robust methods called the kernel partial robust GM6-estimator (KPRGM6) regression and the kernel partial robust modified GM6-estimator (KPRMGM6) are presented. The KPRGM6 is the extension work of the kernel partial robust M-estimator (KPRM) [7], which employs generalized weight w i , which contains both row and column weights to remove the outliers and BLPs in the nonlinear kernel. The KPRMGM6 improves the initial weight in the GM6-estimator by applying several steps to classify the outliers and BLPs using diagnostic criteria of MGT i and the generalized potential values of DRGP. The desirability indices use several statistical measures to assess the performance of the methods, which are: root mean square error (RMSE), coefficient of determination (R 2 ), and standard error (SE). The RMSE measures the absolute error of the predicted model; R 2 is the proportion of variation in the data summarized by the model and indicates the reliability of the goodness of fit for the model; and SE measures the uncertainty in the prediction. In this case, the RPD parameter is no longer significant since it is not different from R 2 in evaluating the quality of the model [29]. To evaluate the performance, the non-robust KPLS is also included in comparison to the proposed robust methods.
The main objectives of this study are: (1) to formulate a robust nonlinear solution to the PLSR method using kernel-based learning of RKHS with the modified GM6-estimator in handling the irregular data space in the input data matrix; (2) to evaluate the performance of the proposed robust methods KPRGM6 and KPRMGM6 in classifying the outliers and BLPs during the model fitting process; and (3) to apply the proposed methods on several artificial dataset scenarios with Monte Carlo simulations and sets of NIR spectral of oil palm (Elaeis guineensis Jacq.) fruit mesocarp (fresh and dried ground). We limit the study by applying only the PLSR method as the principal solution to reduce the dimension in NIR spectral dataset into smaller new latent variables. The solutions also deal with the robust procedures and kernel method to downgrade the influence of outliers and BLPs and to figure out the nonlinear behavior in the dataset. The significance of this study is that it can contribute to the development of big data analysis, particularly for the process control in the NIR spectral data analysis.

Kernel Partial Least Square Regression
With the high-dimensional and irregular data space in the raw NIR spectral dataset, the kernel-based learning solution using the RKHS procedure is proposed. In general, each point in the original input data matrix is mapped nonlinearly to a higher dimensional feature space F that corresponds to a RKHS space. The theory and some general description on this can be found in Aronszajn [17]. For convenience, denote x ∈ X ⊂ m , where m is the number of predictor variables, and y ∈ Y ⊂ p , where p is the number of response variables. Let X be the centered matrix of X -space and y the centered matrix of Y -space. The mixed relation in the PLSR [30] can be formed by integrating the linear inner relation X = VP T + E between X and y block score and the outer relation y = uq T + f for the n × 1 vector y as: where V is a n × l (for l ≤ m) matrix of the n × 1 vector v g v g = Xw j / w j T w j l g=1 , a a T = b inner q T is the l × 1 vector of the coefficient, and f f = gq T + f denotes the n × 1 vector of residual in the mixed relation where f is a n × 1 vector of residual in outer relation for response y. w j w j = X T u / u T u m j=1 is a m × 1 vector of weight for X, is a l × 1 vector of regression coefficient as LS solution on the decomposition of vector u u = b g v g l g=1 of inner relation, g is a n × 1 vector of residual in the inner relation of u and V, and q q g = y T v g / v g T v g l g=1 is the loading l × 1 vector of outer relation for the response y. Equation (1) holds a = V T y, and without loss of generality X = VP T as the outer relation for predictor X, the formulation in Equation (1) can be reconstructed as: where V = XW * and W * = W(P T W) −1 . Define b PLSR = W (P T W) −1 a as m × 1 vector coefficient of mixed relation in the PLSR, then Equation (2) is equivalent to: the residual f has to be minimized. Applying the relation between inner and outer relation both in X and y, they can be calculated as W = X T u, P = X T V(V T V) −1 . The estimator for the parameter b PLSR in Equation (3) then can be calculated as: b PLSR denotes the m dimensional vector of regression coefficient in the PLSR model. The computation of the PLSR model uses the PLSR general algorithm called the nonlinear iterative partial least squares (NIPALS) algorithm [31]. This algorithm allows an iterative procedure to solve the singular value decomposition problems. In relation to the integration of kernel mapping function on the original input of the NIPALS algorithm, the modification in the algorithm is needed to extract the new latent variables from the kernel matrices. This new latent variable is calculated efficiently in the feature space F by using the nonlinear kernel functions.
Using the RKHS space H to correspond to a kernel function K, any kernel function as a sequence of linearly independent functions can be stated as a connection between the RKHS space H defined by K and space of feature mapping F. Let K(x , y) be defined as any positive definite kernel following Mercer's theorem [17] on a compact domain X × X, and it can be written as: is the sequence of eigenfunctions, and S is the dimension of the space H. Renormalize Equation (5), then: It is obvious to see that any kernel K(x, y) in Equation (6) also corresponds to a canonical (Euclidean) dot product in a possibly higher dimensional feature space F, with the mapping function ϕ written as: x ∈ X represents the feature mapping. Assuming a nonlinear transformation of the original input matrix {x i } n i=1 into the feature mapping F in the form of: where the mapping ϕ replaces x i with the sequence of eigenfunctions ϕ(x i ) and produces the high-dimensional and can even be infinite feature space F. Define ϕ as the n × S matrix of mapped space data where the ith row is the vector ϕ(x i ) in the feature space F, rather than using an explicit nonlinear mapping, the use of nonlinear kernel function is preferred. Recall Equation (6) in which the deflation is obtained as: where K is represented as n × n kernel Gram matrix of the cross dot products among all mapped input data points {ϕ(x i )} n i=1 . Now, recall the PLSR theorem whereby once the component score variable n × l matrix V in the linear PLS is obtained, a nonlinear PLS is determined as the new input matrix. Applying v g l g=1 as the new extraction of the normalized latent variables, the deflation of the matrix K and vector y is formulated as: Thus, the coefficient matrix of the KPLS regression model feature space F can be obtained from:b the final prediction of the PLSR model is given as: whereŷ andŷ v are the prediction of the training set and validation set, respectively. K v is the n v × n kernel matrix of the validation set with each element composed of is the input vectors of the validation set whereby n v is the number of samples in the validation set and x j n j=1 is the input vectors of the training set. As mentioned in Rosipal [13], centering the kernel K ij is necessary to produce the bias term to be zero. Centralization on the mapped data in F can be calculated as: where I represents the n -dimensional identity matrix and 1 n , 1 n v denotes the vectors with elements equal to 1, with lengths n and n v , respectively.

Kernel Partial Robust GM6-Estimator (KPRGM6)
Extending the work of Jia [7], which provides the robust version of KPLS using a modified robust M-estimator, here another kernel version using a robust GM6-estimator is introduced. The robust GM6-estimator proposed by Coakley and Hettmansperger [19] combines a high breakdown point (50%) of LTS [20] as the initial estimator and the LMS estimator [23] for the initial estimates of the scale, which provides 0.95 efficiency Letb LTS be the initial estimator obtained in LTS andσ LMS be the initial scale estimate in LMS. The initial solution for the partial robust GM6-estimator can be rewritten as: is the derivative of the re-descending score function, and f i b LTS is the residual using the initial LTS estimator To identify outliers and HLPs in the dataset, the classical threshold value of suitable chi-square distribution χ 2 0.95,m with RM 2 i is used: where m v and C v are robust estimates of multivariate location and covariance matrix of the minimum covariance determinant (MCD) estimator [21] calculated from the matrix of V.
The improvement on KPRGM6 is now the modified M-estimator of Serneels [6] replaced with the GM6-estimator. Here, the final estimates are calculated iteratively rather than in the single-step (Newton Raphson). The final estimates for the partial robust GM6-estimator can be defined as:b where w r i and w x i are robust weights in row-and column-based, respectively. w r i uses slight modification on vertical weight in the y direction using a re-descending where ρ(z , c) is a fair weight function [32] ρ(z, c) = 1/ 1 + z c 2 , and tuning constant c follows Huber's function [33]. While w x i is calculated using: is a robust estimator of the center of the l dimensional score vectors, and v i = (v i,1 , . . . , v i,l ) T is the vector of component score matrix V, which should be estimated. The kernel version of PRGM6 is the input matrix X, which is subsequently replaced by the outer product ϕϕ T of the n × n kernel Gram matrix K. A particular concern in the KPRGM6 is to assign the generalized weight w i on the observations that are suspicious as outliers and HLPs. Let K be defined as the new weighted matrix that is calculated using the remaining dataset of X, so that: where Ω is said to be the diagonal weight matrix, with the ith elements in the diagonal matrix equal to the generalized weight w i in the PRGM6.

Kernel Partial Robust Modified GM6-Estimator
Another improvement in the robust procedure of GM6-estimator called the kernel partial robust modified GM6 regression (KPRMGM6) is introduced. This method accommodates several robust approaches on initial weight in the GM6-estimator to remove both outliers and BLPs in the dataset. Initially, we identify the suspected outliers using the high breakdown point of the RLS-MLS estimator. Thereafter, we observe the suspected potential HLPs using the DRGP(ISE) method, which is computationally faster. To prevent the swamping and masking effects, the diagnostic criteria using fast MGT i (FMGT i ) and generalized potential DRGP are employed to re-confirm the suspicions into true outliers and BLPs in the dataset. Here, the outliers and BLPs will be weighted as 0, while the remaining observations will be weighted as 1. This weight is then applied as the initial weight w i in the GM6-estimator to determine the final estimates iteratively. The procedure uses several robust approaches, which are discussed in the following order.

Reweighted Least Squares Based on Least Median of Squares
The formulation of the robust procedure solution of the RLS-LMS estimator on the PLSR model can be written as [22]:  the observation will be identified as a suspected outlier and is assigned as 0; otherwise assign w i as 1. Here, the suspected outliers will be placed in the deletion set D r , while the remaining observations with no suspected outliers will be in the remaining set R r .

Diagnostic Robust Generalized Potential Based on Index Set Equality
The DRGP(ISE) was introduced by Lim and Midi [28] as an improvement to the MVE [21] and fast minimizing covariance determinant (MCD) [34] in the initial weight of classical DRGP [26]. The method employs two steps. The first step is to determine the suspected multiple outliers and HLPs using RM 2 i based on ISE. Here, the suspected observations are placed in the deletion set D x , while the remaining observations are in the remaining set R x . In ISE, arbitrarily, let T old be a subset containing h different observations matrix of V using the same initial subset T old and the DRGP(ISE) will yield to the same final location and scale estimates as in fast MCD [28]. Define m T old and C T old as the mean vector and covariance matrix of the whole observations in T old . In addition, let I old = π old (1) , π old (2) , . . . , π old (h) be the index set related to the sample items in T old with their Mahalanobis distance squares denoted as d 2 old (i) following Equation (12)  and π (i) being the permutation on {1, 2, . . . , h}. The replacement procedure to fast MCD is if I new = I old , re-define T old := T new followed by the mean vector m T old := m T new and the covariance matrix C T old := C T new to recalculate d 2 old (i). Otherwise, if I new = I old , then the process is converged. The second step is to re-confirm the suspected observations in D x into potential outliers and HLPs using the generalized potential p * ii that is examined through the robust cut-off point p * ii > Median p * ii + 3Q n p * ii [5]. Q n is an order statistics of all pairwise distance proposed by Rousseeuw and Croux [35]. The formula for generalized potential p * ii [5] can be written as: where w d is the number of deleted cases and k is the number of regressors (including the intercept). The rule is that if p * ii satisfies the criteria, then the suspicion D x in the first step is true. Otherwise, put back the observation and recalculate the p * ii on the remaining subset R x .

Fast Modified Generalized Studentized Residuals
The FMGT i improves the efficiency of existing FMGT i [3] by accommodating the RLS-LMS and DRGP(ISE) as initial estimators to remove the suspected outliers and HLPs in the calculation of parameter estimates. This method builds a deletion group D based on the union of remaining sets of R r in RLS-LMS and R x in DRGP(ISE). Let R be the remaining group in the dataset, and the estimated parameters of the PLSR model as in Equation (4) can be reformed as: where the residual of ith remaining observations is given by The formulation of the externally studentized residual denoted as t * i for i ∈ R is: where the w ii(R) is the ith diagonal element of hat matrix calculated using: the additional point i in the R set is defined as: Hence, the formulation of the externally studentized residual t * i for i ∈ D is given by: Following [3], the FMGT i for all observations can be rewritten by combining Equations (16) and (17) as follows: where f i(R) is the residual of the ith observations in the remaining group R. Theσ R−1 is the scale estimate of the remaining group R excluding the ith case andσ R is the scale estimate of the remaining group R. The improvement on the diagnostic criteria [3] for classification of observation by using FMGT i in Equation (18) and generalized potential p * ii of DRGP(ISE) in Equation (15) then can be proposed into four following categories: (i) Observation is classified as a regular observation:

Proposed Algorithm in Kernel Partial Robust Modified GM6-Estimator
In general, the proposed improvement in the modified GM6-estimator is to introduce a new initial weight with high resistance to the contamination of multiple outliers and BLPs in the dataset. The main computational steps in the proposed KPRMGM6 can be summarized as follows.
Step 1: Compute the kernel Gram matrix K in Equation (7) of the cross dot products among all mapped input data points.
Step 2: Centralize the kernel Gram matrix K as in Equation (9).
Step 3: Uses identity matrix I as the initial weight Ω on the centralized kernel matrix to obtain the weighted K g K g = ΩK g Ω and output vector y g , y g = Ωy g .
Step 4: Regress K g on y g to obtain the weight w g , then apply the normalization and rename it as v g .
Step 5: Continue the steps until convergence and l number of PLS are determined. Here l < m.
Step 6: The new calculated latent variables denoted as the n × l matrix of V are used as the new input space.
Step 7: Calculate the residual f i based on the initial LTS estimator using new latent variables V.
Step 9: Calculate the standardized residuals e i e i = f î σ LMS using f i andσ LMS .
Step 10: Compute the proposed improvement of initial weight w i in Equation (10)  Step 11: Calculate the bounded influence function for BLPs t i using standardized residuals e i and improvised initial weight w i , t i = e i w i . .
Step 12: Apply the weighted least squares (WLS) iteratively to obtain the parameter estimates ofb PGM6 as in Equation (10).
Step 13: Calculate the new residual f i from WLS and repeat Steps (8-12) until convergence.

Results and Discussions
To examine the performance, all the proposed methods consisted of KPLS, KPRGM6, and KPRMGM6 and were evaluated using artificial data and NIR spectral data. The artificial data use the Monte Carlo simulation with some scenarios applied. The NIR spectral data use the spectral signature of oil palm fruit mesocarp (fresh and dried ground) with three interested dependent variables (chemical parameters): percent of oil to dry mesocarp (%ODM), percent of oil to wet mesocarp (%OWM), and percent of fat fatty acid (%FFA). Here, the NIR spectral data associated with its chemical parameters were converted into the comma-separated values (CSV) text file format. This analysis was performed using R i386 software (http://rproject.org) with version 3.4.2 for windows.

Monte Carlo Simulation Study
It is known that almost all the available chemometric methods are only capable of capturing linear datasets; however, they fail to capture the structure of highly nonlinear datasets. The artificial data use the sin(x) function [7] to create the nonlinear behavior in the dataset. The artificial dataset is generated randomly using a uniform distribution within the range of [0, 10]. The scenarios in the simulation use a different artificial dataset that is based on sample size, number of predictors, and level of contamination of outliers and HLPs. The sample size uses n = 40, 60, 160, and 400, while the number of predictor variables uses m = 41, 101, and 201. The different levels α = 0.05, 0.15, and 0.25 of outliers and HLPs were applied to evaluate the robustness. The formulation of this artificial data can be defined as follows: 10) (j = 1, 2, 3, . . . , m) e ∼ N (0, 1) (i = 1, 2, 3, . . . , n) x j = c j (j = 1, 2, 3, . . . , m) y = sin(−4/5(k)) + e (i = 1, 2, 3, . . . , n) where the c j and e are independent of each other while x j and y are observable variables. The variable c j is the sum of the arithmetic sequence k and random values using uniform distribution. The artificial spectra X is the collection of x j , while y uses the sine function, which is calculated through the sum of the sine function of stated initial sequence with added noise e. If the sample order is considered as HLPs in X dimension, then c j = k + U(5, 10). Corresponding to the vertical outlier in y, the y i = sin(−4/5(k)) − e i , while if it is considered as HLP, the y i = sin(−4/5(k)) + e i , where e ∼ U (3,5). Clearly, the illustration of the scenarios using this formulation can be seen in Figure 1. As seen in Figure 1, the illustrations of the scenarios using different level contamination of outliers and HLPs in the dataset are presented in terms of an open circle. In Figure 1a, the scenario uses a small sample size, number of predictors, and level of contaminations. While in Figure 1b,c, the scenarios are set to be higher. These data are assumed good enough to evaluate the nonlinear effects of the proposed methods. We decide not to increase the level contamination of outliers and HLPs greater than 25% since with relating to data acquisition quality, this is not acceptable. The superiority of the proposed methods is evaluated by plotting the prediction values on the original artificial data (see Figure 2). As seen in Figure 2a, using a small sample size, number of predictors, and 5% level contamination, the calibration models with KPRGM6 and KPRMGM6 produce better prediction results than KPLS. This can be observed through the fitted line of predicted values against the actual values in the proposed methods (see the green line and blue line). According to the robustness performance, the KPRMGM6 is superior to KPRGM6. This is because the KPRGM6 suffers the influence of HLPs in the dataset. By increasing the sample size, the number of predictors, and level contaminations (see Figure 2b,c), the proposed KPRMGM6 (blue line) is still superior to the KPRGM6 and KPLS since it almost fits the whole actual values. In general, the kernel solution succeeded in figuring out the nonlinear structure of the dataset. With the contamination of outliers and HLPs in the dataset, the non-robust KPLS suffers from the swamping and masking effects, while the robust KPRGM6 is only able to downgrade partially some of the effects. On the other hand, KPRMGM6 is able to prevent the contamination of both outliers and BLPs; hence, it produces a better fitting line and efficiency. To confirm this finding, the Monte Carlo simulation ran 10,000 simulations, and the results are based on the average of statistical measures.  By preventing the effects of outliers and HLPs in the fitting process (see Table A1), the KPRMGM6 produces the lowest prediction error (RMSE) and better R 2 than KPRGM6 and KPLS. The KPLS still suffers from the swamping effects in the fitting process. The KPRGM6 only partially succeeds in removing the influence of outliers and BLPs in the dataset. Using different scenarios, even with low and high levels of contaminations applied in the fitting process, the results using the KPRMGM6 are still satisfactory. Clear outliers and HLPs in the dataset can be observed in Figure 3. In Figure 3a, using small sample size, number of predictors, and 5% level contamination, the KPLS really suffers from the outliers and HLPs in the dataset. This has a dramatic impact to mislead the fitted model, making the accuracy of the model low. The KPRGM6 is only able to partially downgrade the effect of contamination, which leads to a decrease in the accuracy of the fitted model. Using a higher level of contaminations (15% and 25%), the KPRGM6 still suffers from the influence, which increases the prediction error; in addition, the accuracy in KPLS becomes worse (see Figure 3b,c). Here, the fitted regression line using KPRMGM6 shows better performance (efficiency and accuracy) than the KPRGM6 and KPLS since the model is not really affected by the contamination of outliers and BLPs.

NIR Spectral Data
The spectral data use light absorbance in each j wavelength band adopted from the Beer-Lambert law [36], and the data are presented in m × 1 column vector x j using the log base 10. The spectral measurement was obtained by scanning (in contact) the fruit mesocarp using a portable handheld NIR spectrometer (QualitySpec Trek) from Analytical Spectral Devices (ASD Inc., Boulder, Colorado (CO), USA). A total of 80 fruit bunches were harvested from the site of the breeding trial in Palapa Estate, PT. Ivomas Tunggal, Riau Province, Indonesia. In a bunch, there were 12 fruit mesocarp samples collected from different sampling positions. The sampling positions comprised the vertical and horizontal lines in a bunch [37]: bottom-front, bottom-left, bottom-back, bottom-right, equator-front, equator-left, equator-back, equator-right, top-front, top-left, top-back, and top-right. Right after the collection, the fruit mesocarp samples were sent immediately to the laboratory for spectral measurement and wet chemistry analysis. The sources of variability such as planting materials (Dami Mas, Clone, Benin, Cameroon, Angola, Colombia), planting year (2010,2011,2012), and ripeness level (unripe, under-ripe, ripe, over-ripe) were also considered to cover the different sources of variation in the palm population as much as possible.
Two sets of NIR spectral data with different sample properties of fruit mesocarp (fresh and dried ground) were used. The average of three spectra measurements of each fruit sample mesocarp was used in the computation. The fresh fruit mesocarp was used to estimate the %ODM and %OWM, while the dried ground mesocarp was used to estimate the %FFA.
These parameters were analyzed through conventional analytical chemistry that adopts standard test methods from the Palm Oil Research Institute of Malaysia (PORIM) [38,39]. The %ODM was calculated on a dry matter basis, which removes the weight of water content, while the %OWM used a wet matter basis. Statistically, the distribution range of %ODM used as the dataset is 56.38-86.9%; the %OWM is 19.75-64.81%; and the %FFA is 0.17-6.3%. The NIR spectra on oil palm fruit mesocarp (both in fresh and dried ground mesocarp) and its frequency distribution on response variables, the %ODM, %OWM, and %FFA, can be seen in the previous study [37]. It is important to note that there is no prior knowledge on whether outliers and HLPs are present in this dataset. Therefore, the methods were evaluated based on their accuracy improvement through its desirability index.

Oil to Dry Mesocarp
The NIR spectra of fresh fruit mesocarp from 960 observations involving 488 wavelengths (range 550-2500 nm: 4 nm interval) were used in this study. Using the %ODM as the response variable, the effectiveness of the proposed methods was presented. As seen in Table 1, the robust KPRMGM6 improves the accuracy of the model with the lowest prediction error (0.128) and better R 2 (0.999) than the KPRGM6 and KPLS. The non-robust KPLS produces higher SE (0.282-0.283) with lower R 2 (0.927) compared to the KPRGM6 and KPRMGM6. It is known that the proposed robust methods are able to prevent the influence of outliers and HLPs in the dataset; thus, they produce more accurate predictions. The comparison between the measured and predicted values is presented in the fitting line regression (see Figure 4). It shows that by using the proposed robust methods (see Figure 4b,c), the model predictions have been able to approximately estimate the measured values. The robust KPRMGM6 fits the data more accurately compared to the robust KPRGM6. Corresponding to the residual plots produced by the proposed methods, the KPRMGM6 (see Figure 4c) yields the lowest prediction error than the remaining methods. Using a similar NIR spectral dataset as in the previous analysis, %OWM was used as a response variable. As seen in Table 1, it is known that the proposed robust KPRGM6 and KPRMGM6 are still superior to the non-robust KPLS. This is because the KPLS fails to prevent the influence of outliers and HLPs that may exist in the dataset. The two robust methods produce better R 2 (0.998-0.999) and lower prediction error (0.301-0.364). Based on these results, the KPRGM6 and KPRMGM6 are still superior as they show their robustness compared to the non-robust KPLS. The visual comparison between the measured values against the predicted values of the two proposed robust methods can be seen in Figure 5. It is observed that both the KPRGM6 and KPRMGM6 have successfully fit the measured values adequately with less error (see Figure 5b,c). However, the KPRMGM6 still outperforms the KPRGM6. As highlighted in the residual plots, the KPRMGM6 yields the lowest prediction error (see Figure 5c) compared to KPRGM6, which is dominantly close to 0. Hence, the results confirm our previous claims. In this study, even though the same NIR spectral data are used as predictor variables, with the different use of response variables, the desirability indices show different results. Nonetheless, the summarized performance of the two proposed robust methods remains satisfactory

Fat Fatty Acids
Another set of NIR spectra of dried ground mesocarp from 839 observations and 500 wavelengths (range 500-2500 nm: 4 nm interval) was used in the study. Here, the %FFA was used as the response variable. As seen in Table 1, the two proposed robust KPRGM6 and KPRMGM6 are better than non-robust KPLS as it yields the highest error in the prediction (0.222) and the lowest R 2 (0.814). The robust procedures in KPRGM6 and KPRMGM6 have prevented the PLSR model from the influence of outliers and HLPs that may exist in the dataset. Based on the fitted regression line graphs (see Figure 6), the KPRGM6 and KPRMGM6 models summarize the variability (84.4-86.6%) of the response data adequately. As seen in the residual plot (see Figure 6b,c), the KPRMGM6 produces less error (0.117) than KPRGM6 (0.207). It also observed that there is no specific pattern shown in the residual plot both in KPRGM6 and KPRMGM6; this claims a good fit of predicted values that are a random pattern. With robust procedures provided in the PLSR model, this model is relatively good enough to be used in further interpretation since it is resistant to the influence of outliers and BLPs. The range of %FFA data used in the calibration model is still needed to be expanded in order to guarantee all the possible outcome values are covered in the model. In general, we measure the %FFA using the conventional laboratory method, which is based on the extracted oil from ground dried mesocarp. This then would contribute to increasing the bias in the model accuracy of the proposed method.

Conclusions
With consideration to the high dimensionality and irregular data space in the dataset, particularly for chemometric analysis on NIR spectral data, the two new robust methods called the KPRGM6 and KPRMGM6 algorithms are proposed. The methods combine the benefits of linear PLSR and the kernel-based learning RHKS with the robustness of the modified GM6 estimators. Based on the results, the proposed robust methods have generally succeeded in preventing the influence of outliers and HLPs in the dataset and captured the nonlinear relationships through high-dimensional feature mapping. The nonrobust KPLS suffers from the contamination of outliers and HLPs; hence, it has decreased the accuracy of the PLS model. In the investigation, the use of different datasets reached different desirability indexes where the KPRMGM6 is generally superior to KPRGM6 in terms of accuracy improvement. The proposed modified KPRMGM6 shows its superiority by removing only the outliers and BLPs in the dataset since GLPs contribute to increasing the efficiency during the fitting process of parameter estimates. By adding some artificial outliers and HLPs in the Monte Carlo simulation data, the KPRGM6 has only managed to detect some outliers while the rest are unidentified. On the other hand, the contamination of all real outliers and BLPs in the dataset has to be down-weighted for the proposed KPRMGM6 to show great prominence. For future research, combining the proposed KPRMGM6 with the robust pre-processing and wavelength selection method is highly suggested. This will contribute to creating a more flexible, robust PLSR method rather than applying the method separately. A consideration to employing such machine learning (ML) and deep learning (DL) methods to figure out the nonlinear behavior in the dataset is also encouraged.