1. Introduction
Semi-parametric linear additive (SLA) models have both the flexibility of non-parametric regression models as well as the simplicity of linear regression models. These applicable models are broadly used as a popular mechanism for data analysis in many fields. In SLA models, an acceptable relationship of the mean response variable is assumed to connect with some explanatory variables linearly, while it relates to other explanatory variables non-linearly in an additive form.
Suppose
is the vector of the response variable and
is the
design matrix with
p covariates and
n observations
. Without loss of generality, assume that
is partitioned into
and
for some
. Then, the semi-parametric linear additive model (see, e.g., [
1]) is defined as
where
is a
q-dimensional vector of unknown parameters,
are unknown smooth functions, and
’s are random error terms, which are presumed to be independent of
. It is assumed that the response and the covariates are centered, and thus the intercept term is omitted without loss of generality.
There are several approaches for the estimation of non-parametric additive models, including the back-fitting technique (see [
2]), simultaneous estimation and optimization [
3,
4,
5,
6], mixed model approach [
1,
7,
8], and Boosting approach [
9,
10]. Ref. [
5] has presented a review of some of these methods, up to 2006, and [
11] has performed several comparisons between these techniques. The problem of variable selection and penalized estimation in additive models has been investigated by many researchers [
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22].
An essential concern in practice is to identify the linear and nonlinear parts of the SLA model, i.e., whether the explanatory variables can be considered as the linear or nonlinear parts of the model. Ref. [
23] studied an additive regression model as the standard model by assuming that each of the functions is decomposed into linear and nonlinear parts. Their proposed approach of estimation was a penalized regression scheme based on a group mini-max concave penalty. Ref. [
24] surveyed the additive model and tried to isolate the linear and nonlinear predictors by using two group penalty functions, one for enforcing the sparsity and the other one for enforcing linearity to the components. Ref. [
25] introduced a similar model to that of [
23], while they imposed the LASSO and group LASSO penalty functions to the coefficients of the linear parts and the coefficients of the spline estimator of the nonlinear part, respectively. Ref. [
26] introduced a similar additive model and they enforced linearity to the spline approximation of the functions using the group penalty function of the second derivative of the B-splines. There are also more contributions on the problem of structure recognition and separation of nonlinear and linear parts of the SLA model [
27,
28]. Some details of the literature review for the proposed separation approaches are considered in
Section 2.
The presence of outliers, which are unusual observations that fail to follow the scheme of the bulk of the observations, is a frequent problem in the model fitting of datasets. In such situations, robust regression approaches are used to solve the undesirable effects of the outliers. Some of the most popular robust regression approaches are M-estimation, S-estimation, the least median of squares, and the least trimmed squares; see [
29] for more details. Robust methods are well-known statistical techniques to overcome the complication of outliers. The least trimmed squares (LTS), suggested by Rousseeuw and Leroy [
30], is one of the most popular robust regression techniques, as it minimizes the sum of
h smallest squared residuals instead of the whole sum of them, for a specified positive integer trimming parameter
. The LTS estimator is efficient in reaching the maximum possible breakdown point (50%) [
31]. There are several works that have studied robust estimations for the semi-parametric and non-parametric linear models (see, e.g., [
32,
33,
34]).
In this paper, we consider the effect of outliers on simultaneous separation and estimation methods in SLA models, and we survey the LTS version of the methods by introducing the LTS version of the separation and sparse estimation approach suggested by [
24]. The paper is organized as follows.
Section 2 presents a literature review of some simultaneous separation and estimation approaches.
Section 3 contains the general LTS version of the approaches presented in
Section 2, and then we try to apply the LTS version of the proposed method by [
24] in our implementation. Then, the finite sample breakdown point of the proposed model is established with the introduction of a computational algorithm. The comprehensive simulation studies are conducted in
Section 4, in which many different criteria are evaluated in six different competitive models. The proposed approach is then applied in the Boston housing prices dataset, along with the prediction achievement of different methods. At the same time, we try to reveal the effect of the outliers using the different partial residual plots of all competitive schemes.
3. Robust Penalized Estimation Methods
All of the penalized loss functions (
3), (
4), (
6), and (
7) can be written in the following general form:
where
is the loss function of the
ith observation, and
is the penalty function of the
jth parameter,
,
in the
mth model,
.
The least trimmed squares (see [
35]) penalized loss function associated with the
mth model is then as follows:
where
is the binary indicator clarifying whether the
ith observation is a normal observation or is an outlier point, such that
,
, for
, and
is a starting conjecture for the number of normal observations. Let
be the diagonal matrix with diagonal elements
.
The resulting robust sparse semi-parametric linear estimator is obtained by the following optimization problem:
In this work, we only consider the robust version of penalized loss function (
4):
Hereafter, we name scheme (
4) sparse semi-parametric linear additive (SSLA) and scheme (
11) robust sparse semi-parametric linear additive (RSSLA). We also name the special case
of schemes (
4) and (
11) sparse nonlinear additive (SNLA) and robust sparse nonlinear additive (RSNLA), respectively, because by letting
, the schemes change into nonlinear forms. As an alternative competitor for these schemes, the simple linear LASSO regression is also considered, which is called sparse linear (SL), and its robust version based on the LTS method is called robust sparse linear (RSL) in this research.
3.1. The Breakdown Point of the RSSLA Model
The RSSLA estimator is obtained as
where
Conventionally, we consider
, where
denotes the ceiling of
a and
is the percent of leverage observed points. Indeed
is a starting guess for the percent of outlier points. Some researchers propose considering
(see [
36] for more details). Others have proposed considering
. The finite-sample breakdown point (FBP; see, e.g., [
29]) is a size or rate for the consistency of a method. For the complete sample
, the FBP of an estimator
is given by
where
is a corrupted sample obtained from
by replacing
m of the complete
n observations by random samples. In the following theorem, the FBP of LTS-SPSRE is established.
Theorem 1. The FBP of estimator is Proof. Let
be the corrupted sample by replacing the last
sample points. Then the number of normal points in
is
. For an arbitrary sample
, we can write
where
.
Let
be such that
; then
Since
, we can write
and hence
.
Let
be the
matrix of
s,
. Change the last
observations of
such that the last
m observations of
are changed to
, with
and
,
, in which
is a vector with 1 as its
ith elements and zero elsewhere, and
Let
and consider the point
. Now, for the last
m sample points, according to
, it can be written that
Also, for the corrupted sample, we can write
in which at least one of the last
m points of the corrupted sample is in the set of the least possible
h residuals. Now, considering
such that
, it can be seen that
since
, which is a contradiction. Thus, we deduce that
which means that FBP occurs as
M tends to infinity, i.e.,
, and the proof completed. □
3.2. Computational Penalized LTS Algorithm
To find
, we have to look for the minimum of the set
overall
combinations of the complete set
. Thus, for somewhat large values of sample size, achieving the optimal value may need too much time and space. To extend the procedure of obtaining the RSSLA model, an analog of the FAST-LTS algorithm developed by [
35] is proposed.
Let
be the indicator vector obtained at iteration
k and
be the obtained argument that minimizes
in the
kth iteration. Then,
where
are the sorted sample of the squared residuals.
It is obvious that
and the algorithm continues until convergence.
To guarantee that the updated solution of the algorithm is as close as possible to the optimal solution of
, the steps of the algorithm are replicated
s times with
s beginning indicator vectors
. To decrease the computational cost of the algorithmic program, the methodology proposed by [
35] is applied, in which only two iterations of the algorithm for each iteration are performed, obtaining
, keeping a small number,
k, of them with the lowest values of
, and the algorithm is continued until convergence occurs. The latest result is the indicator with the minimum value for the optimization problem.
4. Simulation Study
In this section, we present an extensive simulation study to examine the performance of the proposed estimators in the presence of the outliers. We consider the simulation scenarios proposed by [
24] to generate clean data. The clean data are generated from the model
where
and
, for
. The errors
are generated from a normal distribution with zero mean and variance
. The covariates
are generated from a multivariate normal distribution with zero mean vector and the covariances
and then the cumulative distribution function of the standard normal distribution is applied to them to transform their range into
.
The simulation study is performed for
iterations of the data generation and estimation. For each iteration of the simulation study, we generate
clean train datasets, with
covariates, and
. The clean train data points are denoted by
We further generate
test datasets, denoted by
Then, we contaminate the response values
and
as follows. From
n (
) samples, we choose
randomly, and we denote this subset by
. Then, for any
, we generate
and
from a uniform distribution over
independently. Next, we let
where
is the sample standard deviation of clean responses. For a Core i5 10210U CPU (1.60 GHz) with 8 GB RAM and R version 4.2.1 (64 Bit), the mean computation time for SSLAM is 16.84 min (with optimization of BIC for choosing penalization parameters), for SLM it is 0.26 s, for SGAM it is 2.41 min, for RSSLAM it is 53.29 s (without optimization of BIC for choosing penalization parameters), for RSLM it is 0.23 s, and for RSGAM it is 4.54 min. Note that the codes of the proposed models are developed in R, while for the SLM and RSLM models, the R package
glmnet is used, which implements the main procedures in the C programming language.
Several criteria are considered in this simulation study to examine the performance of the estimators. The mean integrated square error (MISE) for
is defined as
For the purpose of testing the prediction efficiency of the proposed methods in the presence of the outliers, we define the clean data mean square error (CMSE) and clean data prediction error (CPE) for the train and test datasets, respectively, as follows:
The false negative rate and the false positive rate are also defined as follows:
where
stands for the cardinality of the set
A.
We also define the false linear rate (FLR) and false non-linear rate (FNLR) criteria as follows, to examine the separation performance of SSLA and RSSLA models:
as well as the false outlier rate (FOR) and false non-outlier rate (FNOR) criteria as follows, to examine the outlier detection performance of robust models:
The means and standard errors of all the above criteria are tabulated for all different scenarios in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8. From
Table 1 and
Table 2, one can see that the robust separative model RSSLA is the most powerful model for estimating the true regression functions, especially for larger values of
n (
in
Table 2), while for
(
Table 1), the RSNLA model is also a successful model for estimation of the nonlinear regression functions. From
Table 3, it can be observed that the RSSLA model is almost more efficient than other competitors (except the RSNLA model in a few cases) in the sense of the clean data MSE (CMSE). The clean data prediction performance of the RSSLA model is of course the best among the six models, based on the CPE values tabulated in
Table 4. From
Table 5, the RSNLA model is the best model based on the FNR criterion, while the best values of the FPR criterion are obtained for the SL model, based on the values in
Table 6. However, one can see that the FNR and FPR values for the RSSLA model are better than those of the SSLA model, which means that the robust modeling improves the FNR and FPR values in the separative semi-parametric linear model. From
Table 7, it can be seen that both SSLA and RSSLA models have near-zero values of the FLR, while the RSSLA model has significantly lower values of the FNLR than the SSLA model. This shows that the robust modeling helps the model to separate the linear and nonlinear covariates more accurately. Finally, from the values of the FOR and FNOR criteria in
Table 8, we can deduce that the RSSLA model is the most powerful model among the three robust models for the correct detection of the outliers.
5. Case Study
To evaluate the performance of the proposed method for a real dataset, we analyze the Boston housing prices dataset [
37,
38] with 506 observations and 14 features. The R package MASS [
39] contains these data. Here, we consider the median value of the price of the owner-occupied homes in USD 1000 (Median Price) as the response variable, and the following covariates:
Crime rate: per capita crime rate by town;
Nitrogen Oxides: nitrogen oxide concentration (parts per 10 million);
Rooms: average number of rooms per dwelling;
Age: proportion of owner-occupied units built prior to 1940;
Distances: weighted mean of distances to five Boston employment centers;
Lower Status: lower status of the population (percent).
The following model is considered:
The leave-one-out cross-validation is considered, by only considering the samples with less than the 90% quantile of the train set square residuals (not considered as outliers) in all models. We call this criterion trimmed leave-one-out cross-validation (TLOOCV), which is as follows:
where
is the prediction of
by using all observations except
, and
Values of TLOOCV are presented in
Table 9, along with the percent of test points (
), considered as the outliers. As one can see from
Table 9, the RSSLA model has achieved the smallest value of the TLOOCV among all models.
To draw the partial residual plot for the
jth covariate (
), we compute the residuals of the regression of the response variable against all covariates except the
jth covariate, and then we plot it against the
jth covariate. These plots are shown in
Figure 1,
Figure 2,
Figure 3,
Figure 4,
Figure 5 and
Figure 6 for all six models. The outliers are the points where their square residual is greater than the 90% quantile of the square residuals.