Robust Wavelength Selection Using Filter-Wrapper Method and Input Scaling on Near Infrared Spectral Data †

The extraction of relevant wavelengths from a large dataset of Near Infrared Spectroscopy (NIRS) is a significant challenge in vibrational spectroscopy research. Nonetheless, this process allows the improvement in the chemical interpretability by emphasizing the chemical entities related to the chemical parameters of samples. With the complexity in the dataset, it may be possible that irrelevant wavelengths are still included in the multivariate calibration. This yields the computational process to become unnecessary complex and decreases the accuracy and robustness of the model. In multivariate analysis, Partial Least Square Regression (PLSR) is a method commonly used to build a predictive model from NIR spectral data. However, in the PLSR method and common commercial chemometrics software, there is no standard wavelength selection procedure applied to screen the irrelevant wavelengths. In this study, a new robust wavelength selection procedure called the modified VIP-MCUVE (mod-VIP-MCUVE) using Filter-Wrapper method and input scaling strategy is introduced. The proposed method combines the modified Variable Importance in Projection (VIP) and modified Monte Carlo Uninformative Variable Elimination (MCUVE) to calculate the scale matrix of the input variable. The modified VIP uses the orthogonal components of Partial Least Square (PLS) in investigating the informative variable in the model by applying the amount of variation both in X and y{SSX,SSY}, simultaneously. The modified MCUVE uses a robust reliability coefficient and a robust tolerance interval in the selection procedure. To evaluate the superiority of the proposed method, the classical VIP, MCUVE, and autoscaling procedure in classical PLSR were also included in the evaluation. Using artificial data with Monte Carlo simulation and NIR spectral data of oil palm (Elaeis guineensis Jacq.) fruit mesocarp, the study shows that the proposed method offers advantages to improve model interpretability, to be computationally extensive, and to produce better model accuracy.


Introduction
The selection of relevant wavelengths in chemometrics analysis in Near Infrared (NIR) spectral data is crucial to prevent the number of relevant variables to be removed in the analysis. Partial Least Square Regression (PLSR), a conventional method used in chemometrics, has no standard procedure. There are related studies of its application in oil and fat assessment in oil palm (Elaeis guineensis Jacq.). However, many researchers are still using the band partition experiments (see [1][2][3]) by manually segmenting the wavelengths into several bands: visible (400-700 nm), NIR (701-1100 nm), Shortwave Infrared 1 (SWIR1: 1101-1351 nm), SWIR2 (1400-1800 nm), and SWIR3 (2000-2500 nm). The selection procedure is commonly done based on trial and error [1] through the improvement in model accuracy attained. This observation is inefficient and requires advanced experience. Moreover, an in-depth analysis needs to be carried out to understand the NIR spectra signature based on its chemical information. A wavelength selection method is therefore required to assess the contribution of each wavelength. This selection will reduce the number of variables used in the model. In the interpretation, the selection method may give a better understanding of the underlying process of the sample studied.
In the review papers (see [4][5][6][7]), several recommended wavelength selection methods have been discussed. The researchers have highlighted the limitations and properties of each method presented. No one has suggested which method is better than the other. A convenient approach is to make a comparison between the methods and examine their superiority using the experimental simulation and real given problems. There are three main categories of variable selection methods: filter, wrapper, and embedded. The main differences between these categories are based on their processing steps. In the filter methods, relevant variables are ranked and selected according to the threshold on the relevancy index calculated from the fitted model [8,9]. The filter methods are considered fast and straightforward because no learning algorithm is required in the computational process. However, the filter methods do not take into account biasness in the learning model and neglect any conditional dependence (or independence) that might probably exist [10]. In the wrapper methods, the supervised learning approach adopts the search algorithm iteratively with either the deterministic or randomization [5]. The wrapper methods are known to be costly because in the evaluation criterion, a predefined learning algorithm and cross-validation procedure is performed [5]. In the embedded methods, the advantages of both the filter and wrapper methods are seized [7]. The embedded methods evaluate the quality of selected relevant variables during the model building without performing an evaluation process on the learning model [11]. This embedded strategy has motivated the current study to integrate the combination steps in the filter and wrapper methods with the objective to improve the performance of the PLSR model.
There are several methods associated with both the filter and the wrapper methods. In the filter methods, some selection procedures have been reported, such as the Stepwise Regression Coefficients [12], Loading Weights [13], Correlation Coefficient [14], and Variable Importance in Projection (VIP) [15]. Among these procedures, the VIP has earned considerable attention because of its stability and consistency to select the relevant wavelengths and its relatively low computational cost. Using different dataset matrix as simulation, the VIP outperforms the other selection methods (see [6,16]). In the classical VIP [17], the score is calculated by using the weight combination of the overall component variables of the squared PLSR weight. However, the score highlights the variation only in the predictive components without including the orthogonal components. An upgrade of the VIP score based on the Orthogonal Projections to Latent Structures (OPLS) [18] was then introduced. The OLPS-VIP score systematically includes and differentiates the variation both in the predictive and orthogonal components. In some studies (see [18,19]), the OPLS-VIP has shown its superiority to the classical VIP. In the wrapper methods, some selection methods have also been discussed, such as the Genetic Algorithm (GA) [20], Uninformative Variable Elimination (UVE) [21], Iterative Predictor Weighting [22], and Backward Variable Elimination [23]. Among these methods, the UVE method is the most consistent method to improve the PLSR performance [21,24].
Nonetheless, the leave-one-out validation procedure is still applied in the classical UVE. This procedure leads to overfitting and is time-consuming while acquiring the stability values for a large dataset. As an improvement, the Monte Carlo Uninformative Variable Elimination (MCUVE) method was proposed by Cai [25]. The MCUVE adopts the principle of Monte Carlo to evaluate the stability of the corresponding coefficients. However, the reliability coefficient and the cut-off criterion performed in the method to date is not robust enough. Therefore, this has motivated the current study to propose a slight improvement to make the MCUVE procedure more robust.
In practice, it is difficult to eliminate all the irrelevant wavelengths in spectra processing. A smaller number of wavelengths (as predictor variables) used in the model calibration will result in overfitting or underfitting. To overcome this, a new robust procedure to highlight the relevant wavelengths and to downgrade the influence of the irrelevant wavelengths in the PLSR model is needed. It has been investigated that the scaling method in the PLS model is also essential to improve the convergence speed of the algorithm [26,27]. In addition, the auto-scaling method using the mean centering and standard deviation, is a common scaling procedure in the data pre-processing step. Hence, another scaling strategy should be considered in the improvement.
The main objectives of this study are (1) to establish a new procedure for wavelengths selection called the modified VIP-MCUVE (mod-VIP-MCUVE) with input scaling strategy in the PLSR model; (2) to evaluate the performance of the proposed method with the standard auto-scaling procedure in the PLSR and the input scaling strategy using the classical VIP and MCUVE methods; (3) to apply the proposed method on the artificial data and NIR spectra of oil palm fruit mesocarp (fresh and dried ground).

Partial Least Square Regression
Partial Least Square Regression (PLSR) was firstly initiated by Wold [28] as the generalized statistical method and standard method used in the spectroscopy analysis. Let us define a multiple regression model that relates several m predictors X to a response variable y. In matrix form this can be written as where y is an n × 1 vector of the response variable, X is n × m matrix of predictors, b is a m × 1 vector of unknown parameters, and e is a n × 1 vector of random errors. The solution for the estimator b using the least-squares method is given asb Here the data set problem is in condition with a large number of m predictors. Hence, there will be an infinite number of solutions for estimating b as X T X is singular, which does not meet the usual trivial theorem on rank in the regression. In this case, it is necessary to extract the new latent variables by maximizing a covariance criterion between predictor X and response y that link the central values of these two sets [29].
Initialize a starting n × 1 score vector of u from any single y as in Equation (1), there exists an outer relation for predictor X as is the matrix m × l consists of loading vector m × 1, is the n × 1 column vector of scores x j in X involves as m × 1 vector of weight for X, V is a n × l matrix of n × 1 vector v g , and E is a n × m matrix of residual in outer relation for predictor X. Following these, the outer relation for the response y also can be defined as where q is the loading l × 1 vector q g = y T v g / v g T v g l g=1 and f is a n × 1 vector of residual in y.
is also called as linear inner relation between X and y block score, which simply can be written as is a l × 1 vector of regression coefficient as Least Square solution on the decomposition of vector u, and g is a n × 1 vector of residual in the inner relation. Applying the normalization in P, W, and q as the process to improve the inner relation, the mixed relation of PLSR model by integrating Equations (4) and (5) results as where a a T = b inner q T is the l × 1 vector coefficient and f f = g q T + f denotes n × 1 vector of residual in the mixed relation. Equation (6) holds a = V T y, and without loss of generality X = VP T as in Equation (3), the formulation in Equation (6) can be reconstructed by multiplying the two sides with weight matrix of W which is with V = X W * and W * = W(P T W) −1 . Let us define b PLSR = W (P T W) −1 a as m × 1 vector coefficient of mixed relation in the PLSR, then Equation (7) is equivalent to where f has to be minimized. Applying the relation in Equations (3) and (4), so that W = X T u, . Therefore, the estimator for the parameter b PLSR can be calculated aŝ b PLSR denotes the m dimensional vector of regression coefficient in the PLSR model.

Variable Importance in Projection
In the classical VIP [17], the VIP score measures the contribution of each j th wavelength in the multivariate models based on the projection to the PLS components. The method becomes popular because of its simple procedure and less computational complexity [16]. The VIP score is formulized through the normalized loading weights v g v g = w g / w g and the explained sum of squares for the predictive component y. Mathematically, the VIP score for each j th wavelength in the PLS model with l components can be calculated as where m is the number of predictors, SSY comp; g is the variance of y explained by the gth PLS component, and l g=1 SSY comp; g is the total variance summarized by the PLS model over l components. The criteria are the j th wavelength with VIP score > 1 is considered as the most relevant, while the VIP score < 0.5 is considered as irrelevant wavelength.

Uninformative Variable Elimination
The classical UVE method [21] uses the leave-one-out jackknife method and artificial random noise variables denoted as n × m matrix N to compute the statistic parameters. The reliability of each wavelength through variable selection criterion then is calculated based on the PLSR coefficient However, when handling a large dataset, this procedure becomes costly [25,30]. As a solution, the Monte Carlo method, which is based on random selection and probability statistics, is applied in the UVE and so-called as MCUVE [25]. In the MCUVE, some specific number of a subsample N t from the training set are randomly selected to build the r PLS sub-model. Then, it produces the number of set of PLSR coefficientb PLSR as r × m matrix. The reliability c j is computed based on the fraction between the mean and standard deviation of m × 1 column vector of PLSR coefficient b i * j in each jth wavelength from the i * vectors of coefficients. The highest c j represents the most reliable wavelengths; otherwise, the class is as a less reliable wavelength.
The cut-off threshold criterion in MCUVE is defined through the maximum absolute value of the reliability c arti f of the artificial random noise variables matrix N. The wavelength with c j less than the artificial random noise c arti f is removed from the PLSR model.

Input Scaling of Filter-Wrapper Method
Auto-scaling is a common input scale method used to standardize a dataset in the modeling process of PLSR. This auto-scaling transforms each numeric in the input variable into the same variance through its mean and standard deviation [31,32]. However, this method is observed disadvantageous when the original input variables are measured on the same scale. Moreover, it removes the interpretability related to the loadings [33]. Taking a particular concern in the intensity power of Near Infrared region, auto-scaling fails to keep useful interpretive information about the wavelength contribution since the low-intensity regions are enhanced to the same magnitude as like in the high intensity [34]. The scaling method is very crucial to correct for wavelengths-dependent scattering effects in the NIR spectra dataset [35,36]. To overcome this, a new alternative input scaling method based on the Filter and Wrapper methods is proposed. This method is then simply denoted as mod-VIP-MCUVE. Besides correcting the wavelengths-dependent scattering effects and preserving the chemical interpretive information in each wavelength, the proposed method will eliminate the influence of irrelevant Sensors 2020, 20, 5001 6 of 22 wavelengths during the modeling process. In general, the three main computational steps of the mod-VIP-MCUVE method are the following.

•
Step 2: Run the modified MCUVE procedure using the scaled input matrix of OPLS-VIP to get the reliability scores.

•
Step 3: Re-scale the input matrix using the reliability scores as final scaled input matrix in the PLSR model.
In the OPLS-VIP [19], the VIP score is measured not only based on the projections to the PLS components but also included the orthogonal components. This score considers variations both in the predictor variable X(SSX) and the response variable y(SSY). Here, the fourth variant of four versions of OPLS-VIP is preferred due to its interpretative information ability. The OPLS-VIP score uses the combinations {SSX,SSY} in the weighting parameters and normalized loadings v g . The total OPLS-VIP score then is used as the final VIP score as it is calculated based on the VIP pred (predictive components) and VIP ortho (orthogonal components).
Let us redefine g as the predictive component and g o as the orthogonal component, then l stands for the total number of predictive components, and l o stands for the total number of orthogonal components with m and m o are the total number of variables used in the predictive and orthogonal components, respectively. The formulation for OPLS-VIP score both in predictive and orthogonal can be written as where, as in Equations (12) and (13), the sum of square (SS) both in variable y and variable X has subscript comp; g and comp; g 0 . The subscript comp; g refers to the explained SS of gth component in the predictive, while the subscript comp; g 0 refers to the explained SS of g o th component in the orthogonal. The SS with subscript cum refers to the cumulative explained SS overall components in the model. The total of OPLS-VIP score (denotes as VIP-total) is then formulized as where M is the sum of variables used both in the predictive and orthogonal components The above total OPLS-VIP score is used to scale the original wavelength variables as the new input matrix. Let us defineX as the scaled input variable that is calculated by using the OPLS-VIP score on predictor variable X which are initially not scaled. Mathematically it can be written as where Ω ∈ is said to be a diagonal weight matrix with size m × m. The element λ j in the diagonal matrix Ω is a non-negative variable scaling factor for the jth input wavelength. The new scaledX is used as a new input matrix in the modified MCUVE.
In the modified MCUVE, the drawback of the classical cut-off threshold criterion in Equation (11) has been discussed by Centner (see [21]). As an alternative, a new modified robust cut-off criterion based on a one-sided tolerance interval from Natrella [37] is proposed with a better stable elimination on the irrelevant wavelengths. The cut-off value is calculated using the robust location and scale of the reliability coefficients obtained from the added artificial uninformative random variable. In addition, it includes the value of k factor as a function of the γ desired proportions, α as a level of error, and r number of repetition used in MC random subsample selection. Using the c arti f in Equation (11), then the new proposed cut-off criterion can be defined as where k can be calculated as This new cut-off criterion benefits from classifying the most (c j > cut − off value) and less relevant variable for further interpretation. Applying the reliability c j of modified MCUVE as the element λ j in Ω, the new scaled input variablẽ X * for the PLSR model is then updated.

Monte Carlo Simulation Study
A simulation study was carried out to evaluate the performance of our proposed method and to compare its performance with some existing methods discussed in this study. Following the simulation study of Kim [26], the artificial dataset was generated randomly using the Uniform distribution (0,1) and included the added noise that follows the normal distribution. This dataset was applied in the linear combination equation with different scenarios. Five sample sizes (n = 40, 60, 150, 400, and 600), three levels of number of predictor variables (m =41, 101, and 201), and five levels of number of important variables (IV = 0.10, 0.20, 0.40, 0.60, and 0.80) were considered. The 100 (IV) % of predictor variables were selected as important variables, and the remaining 100 (1 − IV) % were considered as less important. The formulation of this simulation can be defined as follows . , mo; j e = 1, 2, . . . , me) iv = iv 1 , iv 2 , . . . , iv 100(IV)% * m y = Xb + e 0 i = 1, 2, . . . , n; j = iv 1 , iv 2 , . . . , iv 100(IV)% * m (17) where m is the total number of predictors, mo is the number of observable variables and the me me = (m − 100 (IV) % * m)/2 is the number of artificial noise variable. These artificial variables are classified as less important variables in the dataset. In Equation (17), the c j o , ce j e , and e j are independent of each other while X and y are illustrated as observable variable. The c j o follows the Uniform distribution (1,10) with size n. The artificial noise variables ce j e are added to the predictor and follow the Uniform distribution (5,20) with size n. This ce j e is classified as a less important variable. The e j follows the standard normal distribution with size n and b represents a vector coefficient for selected important variables which follows the Uniform distribution (0,7) with size m. The iv as the set of selected important variables in mo and e 0 is added error in the linear combination of y. In the PLSR model, the number of PLS components is a principal indicator in the modeling since it is always viewed to be subjective.
In Figure 1, a re-sampling procedure called cross-validation, showing the lowest Root Mean Square Error of Prediction (RMSEP) is used to select the optimum number of PLS components. 'Selection' means the selected optimum number of PLS components suggested by cross-validation techniques (highlighted with the blue dashed line). While, the 'Abs.minimum' refers to the lowest RMSEP (highlighted with the gray dashed line). As the number of PLS components used in the PLSR model increases, the mean of RMSEP also decreases. The optimum number of PLS components depends on how well the specific number of original variables have contribution to the model. In the experiment using different levels of n, m, and IV, the proposed mod-VIP-MCUVE requires a smaller amount of PLS components to fit minimum RMSEP than the other methods (classical PLSR method with no input scaling applied, VIP and MCUVE). The results are consistent and satisfying. The MCUVE input scaling method is comparable to the proposed method, but it still produces higher RMSEP values. When accommodating fewer variables used as predictor, a faster computational speed will be attained. Based on the global minimum cross-validation, the proposed mod-VIP-MCUVE has succeeded in reducing the RMSEP and improving the accuracy of the PLSR model.
Several statistical measures as evaluation indices are used to assess the goodness of the methods: Root Mean Square Error (RMSE), Coefficient of Determination (R 2 ), Ratio of Performance to Deviation (RPD), and Standard Error (SE). The RMSE indicates the absolute measure of fit, R 2 measures the proportion of variation in the data explained by the model, RPD assess the reliability of the goodness of fit for model, and SE measures of the uncertainty in the NIRS prediction. In this section, the Monte Carlo simulation was run 10,000 repeated times, and the results are based on the average of statistical measures (see Table 1). Some scenarios using different treatments are applied to evaluate the PLSR model. According to the results, comparing the RMSE, R 2 , and SE values in all scenarios, the proposed mod-VIP-MCUVE produced better accuracy than the other methods. The reliability of proposed method based on its RPD is still outperformed. Downgrading the irrelevant variables in the fitting process, the performance of the proposed method is comparable to the classical PLSR method with full variables involved. This shows that the proposed mod-VIP-MCUVE with fewer numbers of variables is more efficient than the traditional PLSR model since it could obtain a similar accuracy.
The most relevant variables selected by the methods are classified based on their cut-off threshold criterion on the score values. To evaluate the interpretability, the calculations are plotted in Figure 2.
In Figure 2, the selection of variables in each method uses different cut-off criteria. The selection includes the use of the VIP total score because this procedure is also included in the proposed mod-VIP-MCUVE method. For selection, the classical VIP and the VIP total method use VIP score >1, while the MCUVE and the mod-VIP-MCUVE use cut-off threshold criterion. The VIP-total suggests a greater number of relevant variables than the classical VIP. The MCUVE uses standard cut-off threshold criterion then classifies a greater number of relevant variables than the classical VIP and the VIP-total. The proposed mod-VIP-MCUVE uses robust cut-off threshold (red line threshold) then collects a higher number of relevant variables than the MCUVE (green line threshold). The mod-VIP-MCUVE succeeds to downgrade (close to 0) the irrelevant variables and highlight the pertinent variables of the computational process. Using the proposed mod-VIP-MCUVE, the final subset of selected relevant variables guarantees the best prediction capabilities with better accuracy than the other methods.
The computing time performance during the fitting process was recorded to evaluate the efficiency of the proposed mod-VIP-MCUVE method. In Figure 3, the proposed mod-VIP-MCUVE method Sensors 2020, 20, 5001 9 of 22 outperformed the others. Using different sample sizes and numbers of predictor use, the proposed method is still consistent in expediting the convergence speed. The PLSR has the worst performance due to its inefficient computing time even the auto-scaling is naturally applied using its mean and standard deviation in the procedure.    In Figure 2, the selection of variables in each method uses different cut-off criteria. The selection includes the use of the VIP total score because this procedure is also included in the proposed mod-VIP-MCUVE method. For selection, the classical VIP and the VIP total method use VIP score >1, while The computing time performance during the fitting process was recorded to evaluate the efficiency of the proposed mod-VIP-MCUVE method. In Figure 3, the proposed mod-VIP-MCUVE method outperformed the others. Using different sample sizes and numbers of predictor use, the proposed method is still consistent in expediting the convergence speed. The PLSR has the worst performance due to its inefficient computing time even the auto-scaling is naturally applied using its mean and standard deviation in the procedure.

NIR Spectral Dataset
A total of 80 fruit bunches were collected from the site of breeding trial in Palapa Estate, PT. Ivomas Tunggal, Riau Province, Indonesia. The source of variability such as planting material (Dami Mas, Clone, Benin, Cameroon, Angola, Colombia), planting year (2010-2012) and ripeness level (unripe, under ripe, ripe, over ripe) were considered for covering as much as possible of the whole

NIR Spectral Dataset
A total of 80 fruit bunches were collected from the site of breeding trial in Palapa Estate, PT. Ivomas Tunggal, Riau Province, Indonesia. The source of variability such as planting material (Dami Mas, Clone, Benin, Cameroon, Angola, Colombia), planting year (2010-2012) and ripeness level (unripe, under ripe, ripe, over ripe) were considered for covering as much as possible of the whole range of potential variation in the palm population. Right after harvest, the bunch samples were sent immediately to the laboratory for spectral measurement and wet chemistry analysis. The fruit mesocarp samples were collected from 12 sampling positions by considering the vertical and horizontal lines in a bunch (see Figure 4): bottom-front, bottom-left, bottom-back, bottom-right, equator-front, equator-left, equator-back, equator-right, top-front, top-left, top-back, and top-right. The spectral measurement was done by scanning (in contact) the oil palm fruit mesocarp using a Portable Handheld NIR spectrometer, QualitySpec Trek, from Analytical Spectral Devices (ASD Inc., Boulder, CO, USA). A dataset of NIR spectral data is shown in Figure 5 then was used in evaluating the proposed method. The spectral data as a result of the light absorbance in each j wavelength bands were adopted from Beer-Lambert Law [38] and presented in m × 1 column vector x j using the log base 10.
Portable Handheld NIR spectrometer, QualitySpec Trek, from Analytical Spectral Devices (ASD Inc., Boulder, CO, USA). A dataset of NIR spectral data is shown in Figure 5 then was used in evaluating the proposed method. The spectral data as a result of the light absorbance in each j wavelength bands were adopted from Beer-Lambert Law [38] and presented in 1 × m column vector j x using the log base 10. The spectra collection was measured three times in each fruit mesocarp sample. The averaged spectra were used in the computation (see Figure 5). There are two sample conditions with different parameters observed in this study: fresh fruit mesocarp and dried ground mesocarp. The fresh fruit mesocarp is used to estimate the percentage of Oil to Dry Mesocarp (%ODM) and Oil to Wet Mesocarp (%OWM), while the dried ground mesocarp is used to estimate the percentage of Fat Fatty Acids (%FFA). These parameters were analyzed through wet chemistry analysis using the standard test methods from the Palm Oil Research Institute of Malaysia (PORIM) [39,40]. The %ODM is calculated in dry matter basis, which removes the weight of water content, while the %OWM uses wet matter basis. As seen in Figure 6, the distribution of the %ODM is 56.38-86.9%, the %OWM is 19.75-64.81%, and the %FFA is 0.17-6.3%. These wide ranges of the distribution showed the possible actual range variation covered in the analysis.   The spectra collection was measured three times in each fruit mesocarp sample. The averaged spectra were used in the computation (see Figure 5). There are two sample conditions with different parameters observed in this study: fresh fruit mesocarp and dried ground mesocarp. The fresh fruit mesocarp is used to estimate the percentage of Oil to Dry Mesocarp (%ODM) and Oil to Wet Mesocarp (%OWM), while the dried ground mesocarp is used to estimate the percentage of Fat Fatty Acids (%FFA). These parameters were analyzed through wet chemistry analysis using the standard test methods from the Palm Oil Research Institute of Malaysia (PORIM) [39,40]. The %ODM is calculated in dry matter basis, which removes the weight of water content, while the %OWM uses wet matter basis. As seen in Figure 6, the distribution of the %ODM is 56.38-86.9%, the %OWM is 19.75-64.81%, and the %FFA is 0.17-6.3%. These wide ranges of the distribution showed the possible actual range variation covered in the analysis.

Oil to Dry Mesocarp
In the spectral measurement on fresh fruit mesocarp sample (Figure 5a), each spectrum is composed of 489 wavelengths as data points (range 550-2500nm: 4nm interval) with a total spectrum collecting about 960 observations. Here, the importance of the wavelengths is generally unknown, and it needs to be investigated. The selection of the most informative wavelengths in the NIR spectral related to the %ODM in the fresh fruit mesocarp is crucial for further data interpretation. The summary of the fitting performance on the dataset using the calibration model with wavelength selection methods is presented in Table 2.

Oil to Dry Mesocarp
In the spectral measurement on fresh fruit mesocarp sample (Figure 5a), each spectrum is composed of 489 wavelengths as data points (range 550-2500 nm: 4 nm interval) with a total spectrum collecting about 960 observations. Here, the importance of the wavelengths is generally unknown, and it needs to be investigated. The selection of the most informative wavelengths in the NIR spectral related to the %ODM in the fresh fruit mesocarp is crucial for further data interpretation. The summary of the fitting performance on the dataset using the calibration model with wavelength selection methods is presented in Table 2. Note: nPLS is a number of optimum PLS components used in the PLSR model.
As seen in Table 2, the proposed mod-VIP-MCUVE is superior to the other methods. Using the auto-scaling method, the classical PLSR shows the worst performance compared to the other methods with wavelength selection and input scaling applied. The proposed mod-VIP-MCUVE and MCUVE use fewer PLS components than the classical PLSR and VIP method in the fitting process. The classical PLSR suffers overfitting due to the higher number of PLS components used in the model. The VIP method has low accuracy since there are many variables removed in the computation (see Figure 7). Unlike in the MCUVE method, the proposed mod-VIP-MCUVE has better performance and more efficient computationally because of a lower number of PLS components (26 PLS) in the model. The proposed method succeeds in highlighting the most relevant wavelengths and downgrades the influence of irrelevant wavelengths. This result confirms the usefulness of the wavelengths selection and input scaling applied in the input variables which leads to faster convergence speed. removed many irrelevant wavelengths in the model. As assumed earlier, the more wavelengths are excluded in the model the lower the accuracy in the prediction result. In the fourth plot of mod-VIP-MCUVE, the green line shows the old cut-off threshold using previous MCUVE (threshold = 5.486), while the red line is the new proposed robust cut-off threshold (threshold = 2.916). The proposed mod-VIP-MCUVE method shows better cut-off threshold since there is only a smaller number of wavelengths indicated as the most irrelevant wavelengths. In Figure 7, information regarding the relevant wavelengths related to the %ODM is presented. All the variable selection methods selected the same spectral region, which has the most relevant contribution to the response variable. The methods show a different cut-off threshold to remove the irrelevant wavelengths in the regions. It can be observed that the VIP, MCUVE, and VIP-total removed many irrelevant wavelengths in the model. As assumed earlier, the more wavelengths are excluded in the model the lower the accuracy in the prediction result. In the fourth plot of mod-VIP-MCUVE, the green line shows the old cut-off threshold using previous MCUVE (threshold = 5.486), while the red line is the new proposed robust cut-off threshold (threshold = 2.916). The proposed mod-VIP-MCUVE method shows better cut-off threshold since there is only a smaller number of wavelengths indicated as the most irrelevant wavelengths.
The diffuse selected reflectance [38] is important to identify the relevant wavelengths related to the %ODM. This exhibits their fundamental attribute to the overtone or combination bands involving the molecular stretching and bending absorption over a wide spectral range. The main absorption in the NIR spectral range is produced by the combination and overtone of C-H, O-H, N-H, and C=O groups. The relevant wavelength ranges are indicated through lowercase alphabet notation in the graphic. Based on Figure 7, it is feasible to observe that the well-defined absorption bands are from visible red color (a: 668-684 nm), CH 2

Oil to Wet Mesocarp
Using a similar set of NIR spectral data from fresh mesocarp as in Section 5.1, the %OWM is used as a response variable. As indicated in Figure 6, the variability of the water content has impacted the shifting of distribution in the response variable. The summary of the fitting performance on the prediction results using different variable selection treatment is presented in Table 3. The comparison shows that the proposed mod-VIP-MCUVE could achieve superior performance than the other methods. It offers the accuracy improvement of RMSE and R 2 to the VIP, MCUVE, and PLSR methods. Moreover, the number of selected optimal PLS components used in the calibrated model of the proposed mod-VIP-MCUVE is the smallest (17 PLS) which indicates that important information could still be attained even by using fewer variables. This also has proven the necessity to accomplish the wavelength selection before fitting the calibration model. In Figure 8, the cut-off threshold in VIP and MCUVE has minimized many wavelengths which forces much important information to become lost. The proposed mod-VIP-MCUVE and new proposed cut-off (red line) succeeds to minimize only the most irrelevant variable and to retain rich information in the spectra. Using

Fat Fatty Acids
Another NIR spectral dataset ( Figure 5b) using a total of 839 observations and 500 wavelengths (in the range 500-2500 nm: 4 nm interval) were collected from dried ground mesocarp sample. The importance of the wavelengths related to the %FFA is unknown. The summary of the fitting performance on the dataset using the calibration model with wavelength selection methods is presented in Table 4. Similar to previous results, the proposed mod-VIP-MCUVE is superior to the classical VIP and the classical PLSR. The performance of the MCUVE is comparable to the mod-VIP-MCUVE. However, the proposed mod-VIP-MCUVE uses less PLS components (25 PLS) than MCUVE, the classical PLSR, and the VIP method in the fitting process. These have proven its efficiency in the computation. The classical PLSR is still suffering overfitting due to a higher number of PLS components used in the model. The VIP method also has low accuracy since there are many variables removed in the computation (see Figure 9). The MCUVE shows its performance is better than the classical PLSR and VIP; however, it is a non-robust method. Again, the proposed mod-VIP-MCUVE method has succeeded in highlighting the most relevant wavelengths and downgraded the influence of irrelevant wavelengths. These results confirmed the usefulness of the wavelengths selection method in the input scaling strategy to improve the classical PLSR model.

Fat Fatty Acids
Another NIR spectral dataset ( Figure 5b) using a total of 839 observations and 500 wavelengths (in the range 500-2500nm: 4nm interval) were collected from dried ground mesocarp sample. The importance of the wavelengths related to the %FFA is unknown. The summary of the fitting performance on the dataset using the calibration model with wavelength selection methods is presented in Table 4. Similar to previous results, the proposed mod-VIP-MCUVE is superior to the classical VIP and the classical PLSR. The performance of the MCUVE is comparable to the mod-VIP-MCUVE. However, the proposed mod-VIP-MCUVE uses less PLS components (25 PLS) than MCUVE, the classical PLSR, and the VIP method in the fitting process. These have proven its efficiency in the computation. The classical PLSR is still suffering overfitting due to a higher number of PLS components used in the model. The VIP method also has low accuracy since there are many variables As shown in Figure 9, the cut-off threshold in the mod-VIP-MCUVE and MCUVE has succeeded in removing only the most irrelevant wavelengths and keeping the remaining of the relevant variables in the model. The VIP and the VIP-total has removed many variables in the model, and this yields the less relevant variables also included in the removal process. Hence, the less important variables were reduced in the fitting process. As shown in the fourth plot of the mod-VIP-MCUVE method, the old cut-off threshold using previous MCUVE represented as the green line is 8.071, while the new proposed cut-off threshold represented as the red line is 3.039. Based on these thresholds, the number of the irrelevant variables indicated in the mod-VIP-MCUVE is still less than the MCUVE.
The fundamental attribute of diffuse selected reflectance related to the %FFA is crucial to be investigated. The mod-VIP-MCUVE method is used to interpret the selected wavelength, which has well-defined absorption both in the visible and NIR regions. In Figure 9, the most relevant wavelengths are C=O stretch fourth overtone and C-H second overtone As shown in Figure 9, the cut-off threshold in the mod-VIP-MCUVE and MCUVE has succeeded in removing only the most irrelevant wavelengths and keeping the remaining of the relevant variables in the model. The VIP and the VIP-total has removed many variables in the model, and this yields the less relevant variables also included in the removal process. Hence, the less important variables were reduced in the fitting process. As shown in the fourth plot of the mod-VIP-MCUVE method, the old cut-off threshold using previous MCUVE represented as the green line is 8.071, while the new proposed cut-off threshold represented as the red line is 3.039. Based on these thresholds, the number of the irrelevant variables indicated in the mod-VIP-MCUVE is still less than the MCUVE.
The fundamental attribute of diffuse selected reflectance related to the %FFA is crucial to be investigated. The mod-VIP-MCUVE method is used to interpret the selected wavelength, which has well-defined absorption both in the visible and NIR regions. In Figure 9,

Conclusions
In summary, the NIR spectral region contains rich and abundant information that warrants further interpretation using advanced chemometric techniques. The study has shown that the wavelength selection using input scaling method strategy to be promising, particularly with the application on a high dimension dataset such as the NIRS spectral data. The proposed method is robust since it uses robust reliability weight procedure to rescale the original input matrix. According to the evaluation indices, the proposed mod-VIP-MCUVE method confirmed its superiority to the classical PLSR, the VIP method, and the MCUVE method. Using the modified robust cut-off Figure 9. Comparison of selected wavelengths from different wavelength selection methods using spectral data of ground dried mesocarp on the %FFA.

Conclusions
In summary, the NIR spectral region contains rich and abundant information that warrants further interpretation using advanced chemometric techniques. The study has shown that the wavelength selection using input scaling method strategy to be promising, particularly with the application on a high dimension dataset such as the NIRS spectral data. The proposed method is robust since it uses robust reliability weight procedure to rescale the original input matrix. According to the evaluation indices, the proposed mod-VIP-MCUVE method confirmed its superiority to the classical PLSR, the VIP method, and the MCUVE method. Using the modified robust cut-off threshold, the proposed method succeeds to highlight the irrelevant wavelengths in the model. Moreover, the proposed method also has the benefit to reduce the data dimension and to improve the model accuracy and computational complexity. The proposed method has even investigated successfully the fundamental attribute of diffuse selected reflectance of the NIRS spectral absorption. This is essential in the improvement of the chemical interpretability. Further, the input scaling procedure using robust selection procedure on optimum number of PLS component is expected to get better improvement in the computational complexity.