Abstract
Multicollinearity often occurs when two or more predictor variables are correlated, especially for high dimensional data (HDD) where . The statistically inspired modification of the partial least squares (SIMPLS) is a very popular technique for solving a partial least squares regression problem due to its efficiency, speed, and ease of understanding. The execution of SIMPLS is based on the empirical covariance matrix of explanatory variables and response variables. Nevertheless, SIMPLS is very easily affected by outliers. In order to rectify this problem, a robust iteratively reweighted SIMPLS (RWSIMPLS) is introduced. Nonetheless, it is still not very efficient as the algorithm of RWSIMPLS is based on a weighting function that does not specify any method of identification of high leverage points (HLPs), i.e., outlying observations in the X-direction. HLPs have the most detrimental effect on the computed values of various estimates, which results in misleading conclusions about the fitted regression model. Hence, their effects need to be reduced by assigning smaller weights to them. As a solution to this problem, we propose an improvised SIMPLS based on a new weight function obtained from the MRCD-PCA diagnostic method of the identification of HLPs for HDD and name this method MRCD-PCA-RWSIMPLS. A new MRCD-PCA-RWSIMPLS diagnostic plot is also established for classifying observations into four data points, i.e., regular observations, vertical outliers, and good and bad leverage points. The numerical examples and Monte Carlo simulations signify that MRCD-PCA-RWSIMPLS offers substantial improvements over SIMPLS and RWSIMPLS. The proposed diagnostic plot is able to classify observations into correct groups. On the contrary, SIMPLS and RWSIMPLS plots fail to correctly classify observations into correct groups and show masking and swamping effects.
1. Introduction
Multicollinearity occurs when independent variables in the regression model are highly correlated to each other. As a result of multicollinearity, the standard error of the ordinary least squares (OLS) estimate becomes inflated, which weakens the statistical power of the regression model, especially when confronting high dimensional data (HDD). Hence, one may be unable to declare an individual independent variable significant even if it has a strong relationship with the response variable. Moreover, an independent variables’ coefficient may change its signs and renders it more difficult to specify the correct model. Additionally, multicollinearity can cause the correlation matrix of X variables to become singular; thus, OLS fails to produce estimates, and no predictions of the y-variable can be made. These problems become severe when dealing with high dimensional data where . The rapid advancement of computer technology and statistics allows for the generation of data on thousands of variables. Classical multivariate statistical methods are no longer capable of effectively analyzing such high-dimensional data. As a result, Partial Least Squares (PLS) was developed to address this issue, and its applications have gained popularity in a wide variety of natural science fields. For example, PLS is used to quantify genes in genomic and proteomic data. In neuroinformatics, PLS is used to investigate brain function. It is also used in computer science for image recognition and in the engineering industry for process control. PLS was originally designed to solve a problem in chemometrics, but it has since become a standard tool in spectroscopy, magnetic resonance, atomic omission, and chromatography experiments.
Partial Least Squares Regression (PLSR) is introduced with the main aim to build a regression model for multicollinear and high dimensional data with one or several response variables []. This method is very popular in chemometrics [], genomic and proteomic analysis [], and neural network study [] and is widely applied in engineering areas. When dealing with one or multivariate responses, the partial least squares regression is commonly referred to as PLS1 or PLS2, respectively. In general, the PLSR method is used to form a relationship between dependent variables and independent variables by constructing new independent explanatory variables called PLS factors, latent variables, or PLS components. The PLS components describe maximum correlations between predictors and response variables. The concept of the PLS method is similar to the Principal Component Regression (PCR) method. As in PCR, a set of uncorrelated components is formed to make a prediction. The major difference between these two procedures is, in the PLS procedure, the PLS components are chosen based on the maximum covariance of and variables while in PCR, the components are chosen with a maximum variance of variables solely.
There are numerous PLS approaches in the literature, such as NIPALs which was introduced by Wold around 1975 []. NIPALS requires deflation of and in order to compute components that are very time consuming for a huge dataset []. Lindgren et al. [] proposed the Kernel Algorithm, and the result is identical to NIPALS but in the Kernel Algorithm, deflation is carried out only for the matrix in the next computation of components. The latter method also requires much time since the procedure works based on eigen decomposition. The SIMPLS technique was introduced by S. De Jong []. In SIMPLS, no deflation of the centered data matrices and is made, but the deflation is carried out based on the covariance matrix, . Hoskuldson [] suggested a method similar to the Kernel Algorithm named the Eigenvector algorithm. The idea is to compute all eigenvectors to the largest eigenvalues, where is the desired number of PLS components. Orthogonal Projections to Latent Structures (O-PLS) was introduced by Trygg et al. [] in order to maximize both correlation and covariance between the x and y scores in order to achieve both good prediction and interpretation.
However, it is now evident that PLS is not robust in the sense that its estimate is easily affected by outliers. In regression, outliers can be categorized into vertical outliers, residual outliers, and high leverage points (HLPs) or simply called leverage points []. Vertical outliers are outlying observations in the -space, while residual outliers are observations that have large residuals. HLPs are the outlying observations in the -space. It has the most detrimental effect on the computed values of various estimates, which results in incorrect statistical model development and instills impacts on decision making. Therefore, accurate HLPs detection using reliable methods is vital. In general, classical approaches typically fail to correctly identify outliers in the dataset. As a result, the estimated model will be affected by the masking and swamping problems. ‘Masking’ refers to a situation where outliers are incorrectly declared as inliers, while ‘swamping’ refers to normal observations incorrectly declared as outliers.
Since outliers have an unduly effects on the computed values of various classical estimates, robust methods that are not easily affected by outliers are put forward. Robustness refers to the ability of a method to remain unaffected or less affected by outliers. Several robust alternatives to conventional or classical PLS have been suggested in several literature. The key objective of the robust PLS is to discover contaminated data and reduce their effects by assigning smaller weights to them when estimating a regression model. There are two approaches of robust PLS: down weighting of outliers and robustifying the covariance matrix. For instance, Wakeling et al. [] used nonrobust initial weights to replace the univariates OLS regression. Cummin et al. [] proposed iteratively reweighted algorithms, but the weights were not resistant to high leverage points. For the second approach, Gil et al. [] proposed a method based on constructing robust covariance matrices by using the Stahel–Donoho estimator. However, this method cannot be applied to high-dimensional datasets since it uses a resampling technique by drawing subsets of size p + 2. Hubert and Branden [] developed robust SIMPLS by constructing a robust covariance matrix using the minimum covariance determinant (MCD) and the idea of projection pursuit to perform on low and high dimensional data. Then in 2005, Serneels et al. [] proposed partial robust M-regression (PRM) in order to reduce the effect of outliers in -space and -space of predictor variables. In 2017, Aylin and Agostinelli [] developed a robust iteratively reweighted SIMPLS (RWSIMPLS) by employing the weight function of Markatou et al. []. The weakness of this weight function is that there is no clear discussion on how the effects of outliers or HLPs are reduced since they did not employ any method of identification of outliers or HLPs. It is possible to incorrectly detect and minimize non-outlier observations.
The work of [] has motivated us to improvise their method, i.e., the robust weighted SIMPLS based on a new robust weighting function. The weight is formulated based on the minimally regularized covariance determinant and principal component analysis (MRCD-PCA) algorithm for the identification of HLPs in HDD. MRCD-PCA is the extension work of MRCD [], which incorporates the PCA method in the algorithm of MRCD. MRCD-PCA is used for the detection of high leverage points. A simulation study and two real datasets are used to assess the performance of our proposed method compared to SIMPLS and RWSIMPLS. All analyses were performed using R program [].
The objectives of this study are as follows: (1) to develop a new robust weighted SIMPLS by incorporating a new weight function based on our newly proposed MRCD-PCA algorithm for identifying HLPs; (2) to establish a new algorithm of classifying observations into four categories including regular or good observations, good leverage points (outlying observations in -space), vertical outliers (outlying observations in -space), and bad leverage points (outlying observations in both -space and -space); (3) to determine the optimum number of PLS components, denoted as , where is determined based on the proposed robust root mean squared error prediction, i.e., R-RMSEP; and (4) to evaluate the performance of the proposed method compared to the existing methods, namely SIMPLS and RWSIMPLS. This paper focusses only on PLS1 regression and SIMPLS.
In the next section, we describe the classical SIMPLS algorithm and some existing robust algorithms. In Section 3, we present our proposed MRCD-PCA-RWSIMPLS. In Section 4, we demonstrate the performance of our proposed technique on some simulated data and two real datasets. We compare our method with conventional SIMPLS and the RWSIMPLS methods. We also illustrate our proposed robust diagnostic plots and compute the influence function of real datasets.
2. The SIMPLS Algorithm
The algorithm of SIMPLS maximizes the problem of under the orthogonal constraints of and . The first SIMPLS component is identical to the NIPALS result or the Kernel algorithm, but the following components are different. In the SIMPLS approach, there is no deflation of the data-centered matrices and , but deflation is executed in a covariance matrix or cross-product matrix of and , given as . Thus, the computation of SIMPLS algorithm is much simpler and less complicated than other PLS algorithms.
2.1. Classical SIMPLS
The multiple linear regression or multivariate regression can be written as follows:
where is a matrix, containing values of explanatory variables at points. is a p-dimensional column vector containing regression coefficients, and is matrix of random errors, and response variable is a matrix, where is a dimension of response variables at observations.
In the SIMPLS method, and variables are modeled by linear latent variables based on regression models as in (2) and (3):
where is a matrix of predictors with size , and is a matrix of response variables with size . and are the error terms, assumed to be independent and identically distributed random normal variables for and , respectively. and are the score matrices or latent variables of and , respectively, and they are formulated as follows:
for and . The matrix and the matrix are the weight matrices that can be computed by singular value decomposition (SVD). The optimal -weight vector, , is the first eigenvector of the eigenvalue problem, and the optimal -weight vector, , is proportional to . and are the first left and right singular vectors of the cross-product matrix of . is the number of PLS component that is less than or equal to . and are the and orthogonal loading matrices for and , respectively.
The fundamental goal of the SIMPLS is to maximize the covariance between the (-scores) and (-scores). In conventional SIMPLS, the covariance between vectors t and u is estimated by sample covariance . The constraint of score vectors, is needed to solve a problem of maximization. The unique maximization under the constraint of length 1 is given in Equation (6),
The first score vectors and are the solutions to the maximization problem for and spaces, respectively. The subsequent score vectors are maximized in the same manner as in Equation (6) with additional constraints that the subsequent score vectors are orthogonal to previous score vectors, i.e., and for . Instead of the original and matrices, orthogonality is provided by finding the subsequent score vectors on the residual matrices. Different algorithms of PLS provide a different method of calculating residual matrices.
Below is the algorithm of the classical SIMPLS approach [].
Step 1: Compute the centered data matrix, and , by subtracting the mean of each column, respectively.
Step 2: Based on the centered matrix and in Step 1, determine the initial empirical covariance matrix .
Step 3: Iterate steps 3 to 9 for .
If , .
If , .
This step accounts for the orthogonal scores on all previous score vectors, as the search is performed in the orthogonal complement of .
Step 4: Perform SVD on and obtain the first left singular vector of where is the dominant eigenvector of .
Step 5: is normalized to possess a Euclidean length of 1.
Step 6: The score matrices are obtained by projecting matrix on the optimal direction , and it can be calculated as follows.
Step 7: Then, is normalized to possess a Euclidean length of 1.
Step 8: The loading vector of , , is determined by regressing the matrix on .
Step 9: Next, the loading vector of , is calculated as follows.
Note that steps 3 to 9 are iteratively repeated for all extracted components. Then, the model can be estimated by computing the final estimated regression coefficient by using the algorithm above. The parameter estimate is calculated as described in Equation (14):
where is the vector of , is the vectors of , and is the vector of . The conventional OLS is applied in step 8 and 9. As already mentioned, the conventional methods such as the OLS method is not robust, i.e., it is easily affected by outlying observations particularly to high leverage points. Subsequently, the empirical covariance matrix is not reliable as it is extremely affected by outlying points. Hence, the SIMPLS algorithm produces non-robust estimates, resulting in inefficient estimates and misleading conclusions about the fitted regression model. These drawbacks have inspired researchers to develop many robust SIMPLS algorithms.
2.2. Robust Iteratively Reweighted SIMPLS (RWSIMPLS)
Aylin and Agostinelli [] proposed Robust Iteratively Reweighted SIMPLS, which is an improvement of the SIMPLS algorithm where they integrated the weight function suggested by [] as in Equation (15) in the SIMPLS algorithm in order to reduce the influence of outlying observations in the dataset:
where is the residual of the estimate . This weight function depends on the residual, the chosen model distribution density, , and the sample empirical distribution, . For the RWSIMPLS procedure, the components are extracted by deflating a weighted covariance matrix as described in Equation (16):
where and are the weighted matrices obtained by multiplying each row by square root of the following:
where is the weight of residual corresponding to the response, is a dimension of response variables, and is the number of observations.
The algorithm of RWSIMPLS can be summarized as follows.
Step 1: Center each of the observation in and matrices by subtracting the median of each column.
is the dimension of variables, and m is the dimension of .
Step 2: Scale the response variable observations in column by using median absolute deviation (MAD).
Step 3: Compute the weights for each by using the formula in Equations (13) and (15) and substitute by .
Step 4: Determine the weighted predictor variables and response variables by multiplying each row by the squared root of weights.
Step 5: Take a small sample of and out of and without replacement. Then, perform the classical SIMPLS on these selected data, and use to determine the initial coefficient estimates ( is a number of selected sample size).
Step 6: Calculate the residual of where , for and .
Step 7: Scale and center the residual, , in step 6:
with . Maronna et al. [] suggested the use of to make comparable to the standard deviation for normally distributed data.
Step 8: Compute the weights for each , and then reweight each of the observations in and .
Step 9: Compute the reweighted covariance matrix .
Step 10: The classical SIMPLS is applied on to estimate the new coefficients.
Step 11: Repeat the steps 6–10 until convergence, where convergence is achieved when the maximum value of the absolute deviation for two consecutive coefficient estimates is less than .
Step 12: Steps 5–11 are repeated for a few c times, say c = 5 times. The final will be the coefficient with the least value of five absolute deviations.
Although RWSIMPLS has been shown to be more efficient than the SIMPLS, it is suspected that it is not efficient enough due to using weighting functions in Equation (15) in the RWSIMPLS procedure. The main aim of using the weighting function in the RWSIMPLS algorithm is to reduce the effect of outliers, especially HLPs. However, no identification of HLPs method is considered when defining the weight function. Instead, [] employed weight function as in Equation (15), which is based on the residual, model density, and function distribution but is not based on any method of the identification of HLPs. As a result, there is no guarantee that the weight they proposed will lessen the impact of true HLPs on the parameter estimates, and this will produce a misleading conclusion. This limitation prompted us to propose a new version of RWSIMPLS, which is based on the MRCD-PCA method of identification of HLPs for HDD, denoted as MRCD-PCA-RWSIMPLS. We expect that this method will be more efficient than SIMPLS and RWSIMPLS.
3. The Proposed MRCD-PCA-RWSIMPLS Algorithm
In this section, we introduce our new proposed MRCD-PCA-RWSIMPLS procedure. We first describe how the weighting function is formulated. Then, the full algorithm of the proposed method is illustrated.
3.1. Formulating the Weighting Function
The diagnostic robust Mahalanobis distance based on the combination of minimum regularized covariance determinant and principal component analysis (RMD(MRCD-PCA)) method is expected to be very successful in the detection of multiple HLPs in high dimensional data with sparse structures.
The method consists of three steps whereby, in the first step, the dimension of a high dimensional sparse dataset will be reduced by using the PCA method. The idea to exploit PCA is to replace the original matrix with k largest eigenvalues to the corresponding eigenvectors. In the second step, we generate a fitted matrix in the original dimension based on the chosen number of principal components such that its cumulative variance is at least 80%. In the third step, the fitted matrix will be shrunk, and an invertible covariance matrix for HDD is yielded. The MRCD [] was then performed on these fitted to determine the robust mean and robust covariance of HDD. By using these robust estimators, the distance of each observation was computed by employing the Robust Mahalanobis Distance (RMD) based on MRCD-PCA.
The advantage of this method is that data are transformed based on the orthogonal PCA components, and it can reduce noise and solve the multicollinearity issue without losing much information. We have the following estimated:
MRCD-PCA location estimates:
MRCD-PCA scatter estimates:
where is a location estimate based on subset , is a diagonal matrix scale for each independent variables, is vector mean for each independent variable, is a regularization parameter, is an eigenvector of identity matrix, is an eigenvalue of identity matrix, is an identity matrix, is a subset of data points with smallest determinant, and is a sample covariance of .
Following [], after obtaining robust multivariate location and scale estimates given by MRCD, we compute the Robust Mahalanobis Distance (RMD) as follows:
where and are the estimates of the robust mean and robust covariance matrix of MRCD-PCA, respectively. Since the distribution of is intractable, as per [], we use a confidence bound type cut-off point as follows.
Any observation that corresponds to exceeding this cut-off point is considered as HLPs. Following [,], a weight function is formulated based on the diagnostic method of the detection of outliers with the main aim of reducing their effect on the parameter estimates. Thus, we formulate a new weight function based on as follows:
where weight equals to is given to leverage points, and a weight equal to 1 is given to regular observations. Subsequently, this weight is integrated in the establishment of the MRCD-PCA-RWSIMPLS algorithm.
In our proposed algorithm, we scaled the dataset, , by using robust weighted mean, , and robust weighted standard deviation, . The empirical covariance in our algorithm is computed based on the robust scaled weighted dataset, and . Then, we extract PLS-components by deflating a robust scaled weighted covariance matrix, .
3.2. The Proposed Algorithm of MRCD-PCA-RWSIMPLS
Step 1: Calculate the weights as in Equation (31) and store them in a diagonal matrix.
Step 2: Compute the robust weighted mean and robust weighted standard deviation for each column of and column of matrices:
where is a robust weighted mean of column , is a robust weighted standard deviation of column , is a robust weighted mean of column , and is a robust weighted standard deviation of column .
Step 3: Center and scale each of the data, and , by using robust weighted mean and robust weighted standard deviation in Step 2.
Step 4: Determine the robust weighted covariance based on the weighted scaled observations in step 3.
Step 5: Perform ordinary SIMPLS as in Section 2.1 on robust scaled weighted dataset, and , by using the robust weighted scaled covariance, , in Step 4 to determine the estimated coefficients.
Step 6: Note that Steps 4 to 5 are repeated for all , i.e., the optimal number of components selected based on robust root mean squared error prediction . We call this algorithm a robust weighted SIMPLS based on MRCD-PCA (MRCD-PCA-RWSIMPLS).
3.3. The Proposed Diagnostic Plots Based on MRCD-PCA-RWSIMPLS
A diagnostic plot is very useful for practitioners to quickly capture abnormalities in data. Rousseeuw and Zomeren [] established a diagnostic plot to classify observations into four types of data points, namely regular observations, good leverage points, vertical outliers, and bad leverage points. The proposed plot plots the standardized residual versus squared robust Mahalanobis distance based on minimum volume ellipsoid (MVE) or minimum covariance determinant (MCD). In PLS regression analysis, orthogonal, score, and residual distance can be used to measure the degree of outlyingness of an observation []. Aylin and Agostinelli [] proposed a diagnostic plot to classify observations into four types of data points by plotting RWSIMPLS standardized residuals versus leverage values, i.e., the diagonal elements of the hat matrix that is used to identify HLPs. Any observation that corresponds to a leverage value larger than the cut-off points, 2p/n, is considered as HLPs. The disadvantage of this plot is that it employs leverage values that are proven to not be very successful in detecting multiple HLPs due to the masking effect []. Hence, using leverage values as diagnostic measure for identifying HLPs will produce unsatisfactory results. Thus, we propose a new diagnostic plot by plotting the MRCD-PCA-RWSIMPLS standardized residual against the robust Mahalanobis distance as in Equation (29) and cut-off point as in Equation (30).
The robust standardized residual-score distance plot is a scatter plot of the robust standardized residual versus the robust diagnostic measure for identifying HLPs. The vertical axis of the graph is the robust standardized residual, and the horizontal axis is the robust score distance, as described Equation (29). The standardized robust residual and robust score distance threshold lines are also shown on the plot to indicate the existence and the types of outliers. Based on the robust standardized residual-score distance plot, we can categorize observations into four categories:
- Observations in the lower-left corner are “regular observations”
- In the upper left corner are the observations that are far away from the space and the observations that have large robust standardized residuals. These points are known as “vertical outliers”
- Observations in the lower right corner are “good leverage points” The points are far from the PLS space but proceed in the direction of the fitted line. These observations have a large robust score distance with a small robust standardized residual.
- Observations in the upper right corner are “bad leverage points” that have large robust standardized residual and large robust score distances.

It is important to highlight the difference between Rousseeuw and Zomeren [] and our proposed diagnostic plots. The diagnostic plot proposed by Rousseeuw and Zomeren [] plots standardized least trimmed squares residuals on the Y-axis while the robust Mahalanobis distance based on minimum volume ellipsoid (MVE) or minimum covariance determinant (MCD) is plotted on the X-axis. This plot can only be used for low dimensional data. Our proposed MRCD-PCA-RWSIMPLS diagnostic plot plots the MRCD-PCA-RWSIMPLS standardized residuals against the robust score distance based on the MRCD-PCA diagnostic measure for identifying HLPs for HDD. Our proposed plot can be applied to both low and high dimensional data.
4. Results and Discussion
This section reports an extensive Monte Carlo simulation study and the analysis of two real datasets in order to assess the performance of our proposed MRCD-PCA-RWSIMPLS compared to SIMPLS and RWSIMPLS methods. All computations were performed using R software.
4.1. Simulation Study
The simulation study was conducted similarly to []’s simulation design. The observation for the response variable is obtained by where , . A true parameter is drawn from a normal distribution with a mean of 0 and a standard deviation of 0.001. The explanatory variable matrix is calculated by so that the rank of equals the number of components , which is the number of columns T; T is matrix of , and P is a matrix of . The columns of and are generated from the standardized normal distribution, and , where is the number of components, , and is drawn from N(0, 1).
In this study, we consider an asymmetric contamination scenario for two cases under a Student’s t-distribution with 2 degrees of freedom and Student’s t-distribution with 5 degrees of freedom . In the first case, we contaminate the data in the X direction by generating three different proportions of outliers, , such as 10%, 20%, and 30%, given the contamination models for Student’s t-distribution with 2 and 5 degrees of freedom, i.e., and , respectively. In the second case, we contaminate the data in both the and directions by generating the same percentage of outliers as in the first case. The contamination model can be written as and for Student’s t-distribution with 2 degrees of freedom and and for Student’s t-distribution with 5 degrees of freedom. In order to investigate the effect of different sizes and different number of predictors on the MRCD-PCA-RWSIMPLS estimates, four samples of different sizes (30, 60, 100, and 200) and different number of predictors (100, 400, 800, and 1000) were considered, and under each setting, we repeat the simulation 1000 times, i.e., S = 1000. Following [] the optimal number of components, A is fixed as two for each setting. As per Aylin and Agostinelli [], the Mean squared error (MSE) for , i.e., , is used as a criterion to evaluate the performance of our proposed MRCD-PCA-RWSIMPLS compared to the existing SIMPLS and RWSIMPLS estimators. MSE can easily be shown to have two components: one measures variability (precision), and the other measures its bias (accuracy). A good estimator is one that has the least MSE. The results are presented in Table 1 and Table 2.

Table 1.
Mean square errors for regression parameter estimates. Contamination in direction.

Table 2.
Mean square errors for regression parameter estimates. Contamination in and directions.
It can be observed from Table 1 and Table 2 that MRCD-PCA-RWSIMPLS is the most efficient method evidenced by having the least MSE values followed by RWSIMPLS and SIMPLS irrespective of error distributions, sample size, outlier direction, and percentage of contamination. We observe that the classical SIMPLS performs poorly.
In the following section, the performance of our proposed MRCD-PCA-RWSIMPLS was further evaluated with regards to its classification of observations into four categories and empirical influence functions using two real datasets.
4.2. Gasoline Dataset
The Gasoline dataset contains 401 near-infrared (NIR) spectroscopy measurements (-variables) from 900 nm to 1700 nm with 60 gasoline samples. The response element is the Octane number of 83.4 to 89.6. Mevik and Wehren [] noted that this dataset contains no outliers. Branden and Hubert [] used the same dataset to compare the performance of SIMPLS and the RSIMPLS methods for the empirical influence function. Aylin and Agostinelli [] also employed this dataset in their study in order to compare the performance of their proposed RWSIMPLS compared to SIMPLS and PRM methods.
In building the PLS regression model, it is very crucial to select the optimal number of components, []. The most popular method used is the leave one out cross-validation (LOOCV) for choosing the appropriate number of PLS components []. Data are divided into a test and a training set. The test set is used to estimate the regression parameters and should be independent of the training set. The root mean squared error prediction is computed for each number of PLS components, and in the test set. is used to choose the appropriate number of PLS components. It is defined as where is the respondent
observation in the dataset, and is the estimated values of respondent observation in the
test set with PLS components.
In order to determine the optimal number of components, SIMPLS based on leave one out cross-validation (LOOCV) root mean square error is calculated for k = 1, 2, 3, …, and 10 components and the scree plot is obtained. We divide the dataset into train and test datasets. For gasoline data, we take and with 10 components. In this paper, we also propose a robust root means square error prediction where is the respondent observation in the dataset, and is the predicted values based on MRCD-PCA robust weighted SIMPLS observations in the test set. R-RMSEP is compared to the classical RMSEP, and the results are shown in Table 3 and Table 4. Based on Table 4, the optimal number of components chosen is k = 3, which explains 95.78 percent of the variation in Gasoline data, and this total variance is yielded by R-RMSEP. Likewise, in Table 3, R-RMSEP provides a low error value of 0.264. The scree plot of R-RMSEP in Figure 1 indicates that there is a significant drop from PLS component one to PLS component three, and after PLS component three, the graph line shows no changes and is almost flat. On the other hand, the scree plot of classical RMSEP in Figure 2 shows that there is a sharp elbow occurring at component two. However, as observed in Table 4, the classical RMSEP only indicates 85.58 percent of total variance explained for component two, which is less than the total variance computed by R-RMSEP with three components. Hence, is chosen as the number of components after which the cross-validation error does not show a significant decrease [], and the total variance explained is more than 90 percent as suggested by []. We use the number of components equals to three to perform our proposed diagnostic plot on this dataset in order to classify the outlying observations into three types of outliers.

Table 3.
Leave one out cross validation of R-RMSEP and classical RMSEP for Gasoline data.

Table 4.
Percentage of variance explained of independent variables and response variable for Gasoline data.

Figure 1.
Scree plot for R-RMSEP for Gasoline data.

Figure 2.
Scree plot for RMSEP for Gasoline data.
The diagnostic plots for SIMPLS, RWSIMPLS, and MRCD-PCA-RWSIMPLS are presented in Figure 3, Figure 4 and Figure 5. It can be observed from Figure 4 that the plot based on RWSIMPLS produces erroneous results because many good observations are declared as outliers. This phenomenon is referred to a swamping effect. The RWSIMPLS plot declared observations 3, 15, 41, and 57 as good leverage points, observation 53 as a bad leverage point, and observations 1, 2, 4, 6, 7, 9, 11, 12, 16, 17, 18, 19, 20, 26, 28, 32, 33, 34, 35, 36, 37, 44, 48, 51, 54, 55, 56, and 59 as vertical outliers. The plot in Figure 5 based on SIMPLS also shows the swamping effect whereby it declared eight good HLPs, i.e., cases 2, 3, 4, 14, 15, 47, 54, and 55. There is one bad HLPs for observation 57 and four vertical outliers for observations 5, 11, 12, and 17. However, it is interesting to observe the plot in Figure 3, which is based on our proposed MRCD-PCA-RWSIMPLS. It shows less swamping effects where it only declared three good leverage points on observations 2, 3, and 15. Thus, our proposed MRCD-PCA-RWSIMPLS plot is more reliable than the other two plots.

Figure 3.
Diagnostic plot of MRCD-PCA-RWSIMPLS for Gasoline data.

Figure 4.
Diagnostic plot of RWSIMPLS for Gasoline data.

Figure 5.
Diagnostic plot of SIMPLS for Gasoline data.
The performance of MRCD-PCA-RWSIMPLS method is further evaluated with regard to its empirical influence function.
Influence Function
Following [], the empirical influence function was constructed by first scaling the dependent variable () to have a similar range with predictor variables. Next, in order to contaminate the dataset, we substitute the first explanatory variable and the first dependent variable with the value from −20 to 20 simultaneously. Then, the empirical influence function is calculated by Equation (39):
where is the estimated coefficient parameter obtained from the contaminated dataset, is the estimated coefficient parameter of clean dataset, and is equals to 60.
Figure 6 demonstrates the results of empirical influence functions for SIMPLS, RWSIMPLS, and our proposed MRCD-PCA-RWSIMPLS method. We observe that the empirical influence function for SIMPLS is much affected when we replaced the first value of the response and explanatory variables with 0. It can be observed that SIMPLS has the largest empirical influence values, and its plot plunges at = 0 and = 0 and produces a small value of influence function at that point, similar to those of RWSIMPLS and MRCD-PCA-RWSIMPLS. It is interesting to observe that the MRCD-PCA-RWSIMPLS has the lowest empirical influence values and shows no fluctuation in its estimates. On the contrary, the empirical influence values for RWSIMPLS are higher than MRCD-PCA-RWSIMPLS, with moderate fluctuations.

Figure 6.
Empirical influence functions of SIMPLS (red line), RWSIMPLS (blue line), and MRCD-PCARWSIMPLS (green line) for Gasoline data.
4.3. Octane Dataset
The second real example in this paper is the octane dataset where it contains near-infrared (NIR) and absorbance spectra over p = 226 wavelengths with n = 39 gasoline samples. Leave one out cross validation (LOOCV) is performed to determine the optimal number of components, SIMPLS for k = 1, 2, 3, …, 10 components, and the scree plot is obtained. The robust root mean squares error predicted (R-RMSEP) is computed. We first divide the dataset into train and test datasets. For octane data, we take and with 10 components. From the scree plot in Figure 7, it can be observed in the R-RMSEP plot that there is a substantial drop from component one to component two, and after component two, the graph line shows no major changes. Moreover, in Table 5, the R-RMSEP produces a smaller root mean square error of 0.576 compared to classical RMSEP. In Table 6, the two components of R-RMSEP explained almost 80 percent of the total variance in Octane data. Therefore, it is adequate to choose the number of components as equals to two for the Octane dataset. On the contrary, the classical RMSEP in Figure 8 shows the lowest point in the graph is on the third component, and there is a fluctuation in the line. It indicates that the classical RMSEP is intruded by the outliers in the Octane dataset. Based on Table 4, the third component of classical RMSEP produces very high variance explained at 99.06 percent, which is almost 100 percent of data explained, and this can cause overfitting as the classical RMSEP is influenced by outliers [].

Figure 7.
Scree plot of R-RMSEP for Octane data.

Table 5.
Leave one out cross validation for R-RMSEP and classical RMSEP for Octane data.

Table 6.
Percentage of variance explained of independent variables and response variable for Octane data.

Figure 8.
Scree plot of RMSEP for Octane data.
We also illustrate our proposed diagnostic plot on this dataset to classify outlying observations and to compare the robustness of our proposed algorithm, MRCD-PCA-RSIMPLS, and the other two methods, RWSIMPLS and classical SIMPLS. The authors of [,] used this dataset to identify outliers, and they diagnosed six outliers in X-space. The outliers are the observations that contain added ethanol. From Figure 9, it is clearly shown that MRCD-PCA-RSIMPLS successfully detects the true six high leverage points (Cases 25, 26, 36, 37, 38, and 39). On the other hand, in Figure 10, the RWSIMPLS method suffers from swamping and masking problems as it detects good observations as outliers, and it fails to detect real outliers. RWSIMPLS declares observations 6, 23, 25, 34, 37, 38 and 39 as good HLPs, vertical outliers for observations 4, 7, 10, 13, 16, 18, 20, 21, 22, 24, 26, 32, and 36. The classical SIMPLS also suffers from masking and swamping effects. As observed, the SIMPLS in Figure 11 identifies observations 23, 26, 36, 37, 38, and 39 as good HLPs and observation 7, 13, and 32 as vertical outliers. SIMPLS suffers from a masking problem as it fails to detect the outlier point on observation 25. MRCD-PCA-RWSIMPLS diagnostic plots are very efficient in detecting and classifying outliers without the masking and swamping problem. However, RWSIMPLS could detect the outlying observation correctly with additional false outliers, whereas the performance of the SIMPLS is even worse. The method fails to detect the true HLPs.

Figure 9.
Diagnostic plot of MRCD-PCA-RWSIMPLS for Octane data.

Figure 10.
Diagnostic plot of RWSIMPLS for Octane data.

Figure 11.
Diagnostic plot of SIMPLS for Octane data.
Influence Function
In order to compute the empirical influence function for Octane data, we first removed the six outliers for observations 25, 26, 36, 37, 38, and 39 from the dataset in order to avoid distracting errors in the influence function computation. Data are then contaminated in the same manner as that of [] by simultaneously replacing the first explanatory variable and the first dependent variable with values ranging from −20 to 20. The empirical influence function is calculated as in Equation (39). Figure 12 displays the graphical result of empirical influence functions for SIMPLS, RWSIMPLS, and our proposed MRCD-PCA-RWSIMPLS method. We observe that the empirical influence function for the SIMPLS is affected when we replaced the first value of the response and the explanatory variables with 0. It can be observed that SIMPLS has the large empirical influence values and its plot plunges at = 0 and = 0 and provides a small value of influence function at that point. The performance of RWSIMPLS is terrible as it fluctuates tremendously when the data are corrupted. It shows that RWSIMPLS is extremely vulnerable to even a single outlier in the data. It is interesting to observe that MRCD-PCA-RWSIMPLS has the lowest empirical influence values and shows no fluctuation in its estimates.

Figure 12.
Empirical influence functions of SIMPLS (red line), RWSIMPLS (blue line), and MRCD-PCARWSIMPLS (green line) for Octane data.
5. Conclusions
In this paper, a robust weighted of SIMPLS, which is based on weight obtained from MRCD-PCA was developed. We call this method MRCD-PCA-RWSIMPLS. A new diagnostic plot for classifying observations into four types of data points, namely regular observations, good leverage points, vertical outliers, and bad leverage points, is also established. The robust weighting function introduced in the algorithm of MRCD-PCA-RWSIMPLS has tremendously improved the efficiency of its estimates. The MRCD-PCA-RWSIMPLS outperformed the SIMPLS and RWSIMPLS methods as it has the least MSE and the least empirical influence value. SIMPLS produced the worst performance. The proposed MRCD-PCA-RWSIMPLS diagnostic plot was able to classify observations into correct groups. SIMPLS and RWSIMPLS diagnostic plots failed to correctly identify the actual type of data points and suffered from masking and swamping effects. Hence, the merits of our proposed MRCD-PCA-RWSIMPLS method and its diagnostic plot are confirmed, as reflected in their application to real datasets and in the Monte Carlo simulation study. Nonetheless, our proposed MRCD-PCA-RWSIMPLS method has its own limitation as its computation running time is slightly longer than that of SIMPLS and RWSIMPLS methods. A slightly increased running time for obtaining MRCD-PCA-RWSIMPLS estimates compared to the other two methods is a trade-off one has to consider when using our proposed method as it is the most efficient method for confronting outliers for HDD.
Author Contributions
Conceptualization, S.Z. and H.M.; methodology, S.Z. and H.M.; validation, H.M. and M.S.M.; formal analysis, S.Z. and M.S.M.; writing—original draft preparation, S.Z.; writing review and editing, H.M. All authors have read and agreed to the published version of the manuscript.
Funding
This article was partially supported by the Fundamental Research Grant Scheme (FRGS) under the Ministry of Higher Education, Malaysia with project numberFRGS/1/2019/STG06/UPM/01/1.
Institutional Review Board Statement
Ethical review and approval were waived for this study due to its theoretical and mathematical approach.
Informed Consent Statement
Not applicable.
Data Availability Statement
The Gasoline and Octane datasets are used to verify the performance of our proposed method. This dataset has been used by M. Hubert []. It is available in package “PLS” in R. The Octane dataset has been used by M. Hubert [] and Boudt et al. []. It is available in package “rrcov” in R.
Acknowledgments
The authors would like to thank the reviewers for their constructive suggestions.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
References
- Thakkar, S.; Perkins, R.; Hong, H.; Tong, W. Computational Toxicology. Comprehensive Toxicology, 3rd ed.; Elsevier Ltd.: Amsterdam, The Netherlands, 2018; Volume 5–15. [Google Scholar] [CrossRef]
- Berntsson, F. Methods of High-Dimensional Statistical Analysis for the Prediction and Monitoring of Engine Oil Quality; KTH Royal Institute of Technology School of Engineering Sciences: Stockholm, Sweden, 2016. [Google Scholar]
- Boulesteix, A.-L.; Strimmer, K. Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings Bioinform. 2006, 8, 32–44. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bulut, E.; Yolcu, U.; Tasmektepligil, M.Y. The use of partial least squares regression and feed forward artificial neural networks methods in prediction vertical and broad jumping of young football players. World Appl. Sci. J. 2013, 21, 572–577. [Google Scholar] [CrossRef]
- Varmuza, K.; Filzmoser, P. Introduction to Multivariate Statistical Analysis in Chemometrics; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar] [CrossRef] [Green Version]
- Lindgren, F.; Rännar, S. Alternative partial least-squares (PLS) algorithms. Perspect. Drug Discov. Design 1998, 12, 105–113. [Google Scholar] [CrossRef]
- de Jong, S. SIMPLS: An alternative approach to partial least squares regression. Chemom. Intell. Lab. Syst. 1993, 18, 251–263. [Google Scholar] [CrossRef]
- Höskuldsson, A. PLS regression methods. J. Chemom. 1988, 2, 211–228. [Google Scholar] [CrossRef]
- Trygg, J.; Wold, S. Orthogonal projections to latent structures (O-PLS). J. Chemom. 2002, 16, 119–128. [Google Scholar] [CrossRef]
- Alguraibawi, M.; Midi, H.; Imon, A.H.M.R. A new robust diagnostic plot for classifying good and bad high leverage points in a multiple linear regression model. Math. Probl. Eng. 2015, 2015, 279472. [Google Scholar] [CrossRef]
- Wakelinc, I.N.; Macfie, H.J.H. A robust PLS procedure. J. Chemom. 1992, 6, 189–198. [Google Scholar] [CrossRef]
- Cummins, D.J.; Andrews, C.W. Iteratively reweighted partial least squares: A performance analysis by monte carlo simulation. J. Chemom. 1995, 9, 489–507. [Google Scholar] [CrossRef]
- Gil, J.A.; Romera, R. On robust partial least squares (PLS) methods. J. Chemom. 1998, 12, 365–378. [Google Scholar] [CrossRef]
- Hubert, M.; Branden, K.V. Robust methods for partial least squares regression. J. Chemom. 2003, 17, 537–549. [Google Scholar] [CrossRef]
- Serneels, S.; Croux, C.; Filzmoser, P.; Van Espen, P.J. Partial robust M-regression. Chemom. Intell. Lab. Syst. 2005, 79, 55–64. [Google Scholar] [CrossRef]
- Alin, A.; Agostinelli, C. Robust iteratively reweighted SIMPLS. J. Chemom. 2017, 31, e2881. [Google Scholar] [CrossRef]
- Markatou, M.; Basu, A.; Lindsay, B.G. Weighted likelihood estimating equations with a bootstrap search. J. Am. Stat. Assoc. 1998, 93, 740–750. [Google Scholar] [CrossRef]
- Boudt, K.; Rousseeuw, P.J.; Vanduffel, S.; Verdonck, T. The minimum regularized covariance determinant estimator. Stat. Comput. 2019, 30, 113–128. [Google Scholar] [CrossRef] [Green Version]
- Maronna, R.A.; Martin, R.D.; Yohai, V.J.; Salibian-Barrera, M. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
- Lim, H.A.; Midi, H. Diagnostic Robust Generalized Potential Based on Index Set Equality (DRGP (ISE)) for the identification of high leverage points in linear model. Comput. Stat. 2016, 31, 859–877. [Google Scholar] [CrossRef]
- Coakley, C.W.; Hettmansperger, T.P. A bounded influence, high breakdown, efficient regression estimator. J. Am. Stat. Assoc. 1993, 88, 872–880. [Google Scholar] [CrossRef]
- Dhhan, W.; Rana, S.; Midi, H. A high breakdown, high efficiency and bounded influence modified GM estimator based on support vector regression. J. Appl. Stat. 2016, 44, 700–714. [Google Scholar] [CrossRef]
- Rousseeuw, P.J.; Van Zomeren, B.C. Unmasking multivariate outliers and leverage points. J. Am. Stat. Assoc. 1990, 85, 633–639. [Google Scholar] [CrossRef]
- Midi, H.; Ramli, N.M.; Imon, A.H.M.R. The performance of diagnostic-robust generalized potential approach for the identi-fication of multiple high leverage points in linear regression. J. Appl. Stat. 2009, 36, 1–15. [Google Scholar]
- Mevik, B.-H.; Wehrens, R. Principal component and partial least saquares regression in R. J. Stat. Softw. 2007, 1, 128–129. [Google Scholar] [CrossRef] [Green Version]
- Branden, K.V.; Hubert, M. Robustness properties of a robust partial least squares regression method. Anal. Chim. Acta 2004, 515, 229–241. [Google Scholar] [CrossRef]
- Nengsih, T.A.; Bertrand, F.; Maumy-Bertrand, M.; Meyer, N. Determining the number of components in PLS regression on incomplete data set. Stat. Appl. Genet. Mol. Biol. 2019, 1–28. [Google Scholar] [CrossRef]
- Turkmen, A.S. Robust Partial Least Squares for Regression and Classification; Auburn University: Auburn, AL, USA, 2018. [Google Scholar]
- Thennadil, S.N.; Dewar, M.; Herdsman, C.; Nordon, A.; Becker, E. Automated weighted outlier detection technique for multivariate data. Control. Eng. Pract. 2018, 70, 40–49. [Google Scholar] [CrossRef] [Green Version]
- Liu, Y.; Tran, T.; Postma, G.; Buydens, L.; Jansen, J. Estimating the number of components and detecting outliers using Angle Distribution of Loading Subspaces (ADLS) in PCA analysis. Anal. Chim. Acta 2018, 1020, 17–29. [Google Scholar] [CrossRef] [PubMed]
- Hubert, M.; Rousseeuw, P.J.; Branden, K.V. ROBPCA: A new approach to robust principal component analysis. Technometrics 2005, 47, 64–79. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).