1. Introduction
In traditional statistical models, the response variables are usually studied in the one-dimensional case, for example, references [
1,
2,
3,
4]. Recently, there has been an increasing need to predict several response variables by a common set of covariates. For example, in bioinformatics, multiple phenotypes of a patient are to be predicted based on measures of genetic variation. In economics, the returns of multiple stocks can be predicted based on a common set of economic and market variables. In neuroscience, the responses of different regions of the brain can be predicted based on the input stimulus. Such data analysis problems are widely available in multiple regression.
For multiple response variable regression, one can naively model linear regression for each response variable individually by ignoring the inherent correlation structure between the response variables. In fact, performing a separate ordinary least squares regression for each response variable to estimate the coefficient matrix is consistent with performing maximum likelihood estimation. Another practical problem is that the number of parameters in the regression coefficient matrix can be huge even for a modest number of response variables.
It is natural to think of ways to improve this simple approach. Here two directions can be used to reduce the degrees of freedom parameters of the model and to improve its predictive power and interpretability. One of these directions is to reduce the dimensionality of the coefficient matrix. To take advantage of correlations between the response variables, the rank of the coefficient matrix is restricted, which gives rise to the well-known reduced-order regression (RRR), as in references [
5,
6]. RRR allows the response variables to borrow information from each other through a common set of latent factors to improve the prediction accuracy. However, each latent factor is a linear combination of all predictors. When there are a large number of predictor variables, some of them may not be useful for prediction. Therefore, an alternative approach is to reduce the number of free parameters by variable selection to exclude redundant predictors in forming latent factors. The issue of rank reduction estimation and variable selection in the RRR framework has been extensively studied in the recent literature [
7,
8,
9,
10,
11,
12]. However, in all these works, the components of the error terms are assumed to be independently identically distributed. Ref. [
13] added variable selection and covariance estimation to the RRR model to improve the predictive performance and facilitate interpretation.
Although RRR can substantially reduce the number of free parameters in multivariate problems, Ref. [
14] found that it is extremely sensitive to outliers. In real-world data analysis, outliers and leverage points are bound to occur, and thus, the genuine reduced-rank structure could easily be masked or distorted. Ref. [
15] proposed a method to detect outliers in a continuous distribution based on the cumulative distribution function. Ref. [
16] proposes the use of order statistics to detect extreme values in continuously distributed samples. Ten case studies of outlier detection are given in reference [
16], the method of order statistics demonstrates its excellence in a comparative study. However, this method of outlier detection is without covariates and the effect of covariates needs to be taken into account in our paper.
The closely related research of this paper includes sparse reduced-rank regression with covariance estimation (Cov-SRRR) in literature [
13] and robust reduced-rank regression (R4) in literature [
14]. However, the Cov-SRRR method does not consider the robustness and the R4 method does not consider the correlations among error terms and the sparsity of the coefficient matrix. They only considered a subset of our objectives. The main contributions of this paper are highlighted as follows.
- In this work, the mean-shift method is used to get a robust estimation that can explicitly detect outliers, account for general correlations among error terms, and improve estimation accuracy by applying shrinkage on the precision matrix. In addition, we also incorporate variable selection in the model estimation, which is done through a penalty function on the coefficient. 
- A block coordinate descent algorithm is proposed to obtain parameter estimates of the objective function. The monotonicity of the algorithm is also given. 
The rest of the paper is organized as follows: in 
Section 2, we formally introduce the framework of reduced-rank regression with a sparse mean-shift outlier component and an unknown error covariance matrix. The penalized estimation method is used to get a sparse estimation. A block coordinate descent algorithm is developed in 
Section 3 to get the estimation of the RRR model proposed in 
Section 2. The simulation studies are given in 
Section 4, which compared the proposed model to relevant competitors. A real data application is shown in 
Section 5. Finally, a summary and brief discussion are given in 
Section 6.
  2. Robust Reduced-Rank Regression Model with Unknown Covariance
Firstly, we write the multivariate regression model in matrix notation as
      
      resulting in the famous RRR model, where 
, 
, 
, 
, 
, 
 denote rank and 
 is a 
 coefficient matrix, and 
 is a random 
 matrix and 
 from the 
 distribution.
 Ref. [
17] investigates the outlier detection from the perspective of adding a mean shift parameter for each data point in a one-dimensional response variable. Ref. [
14] introduces a multivariate mean-shift regression model for detecting outlier terms. We introduce a mean-shift regression model that includes multivariate response variable dependence. The mean-shift multivariate response variable regression model can be written as
      
      where 
 is the 
 response matrix, 
 is the 
 predictor matrix, 
 is the 
 coefficient matrix, 
 is the 
 matrix which describes the outlying effects on 
 and 
 is the 
 error matrix whose rows are independent draws from 
 which is a symmetric distribution. The unknown parameters in the model can be collectively written as 
 or 
, where 
 denotes the precision matrix. The assumption that errors are correlated is necessary for many applications, for example, when the predictors are not sufficient to eliminate all the correlations between the dependent variables. Obviously, this leads to an over-parameterized model, so we must regularize the unknown matrices appropriately. We assume that 
 has a low-rank structure and 
 is a sparse matrix with only a few nonzero terms due to their inconsistency with the majority of the data. So the following sparse robust reduced-rank regression problem is proposed
      
      where 
 and 
 are the penalty functions for 
 and 
. The penalty parameters are 
 and 
, respectively. The term 
 is the 
 entry in 
 and the lasso penalty on the off-diagonal entries of 
 has two reasons. First, it ensures the optimal solution of 
 has a finite objective function value when there are more responses than samples; second, the penalty has the effect of getting a sparse estimation. The penalty of 
 encourages sparsity estimation of coefficient matrix 
. The term of 
 is used to get a sparse estimation of 
. The 
, 
 and 
 are the tuning parameters. Based on the symmetry of multivariate normal distribution, the covariance matrix or precision matrix must be a symmetric matrix, which reduces the number of parameters and improves the efficiency of parameter estimation. Following reference [
18], we estimate the precision matrix using the Gaussian likelihood together with a sparsity-inducing 
-penalty on all of its entries
      
      where tr is the trace of the matrix and 
.
As shown in the above formula 
 and 
 resulting in the reduced-rank regression (RRR) model given in book [
19]. The reduced-rank restriction means a number of linear constraints on regression coefficients which can reduce the number of parameters and improve the estimation efficiency. According to the rank constraint 
 can be expressed as a product of two rank 
r matrices as follows
      
Inspired by the penalized regression with a grouped lasso penalty in literature [
12,
20] consider the following optimization problem
      
      where 
 is the 
jth row of 
, 
 is the corresponding tuning parameter and 
 is a unit matrix of order 
r. The sparsity of the rows means that the unselected variables are not associated with any response variable. The element-wise sparsity penalty can also be applied to 
. These two types of sparsity can be blended, firstly, with row sparsity employed in a screening step to reduce 
p to some dimension 
d (
) desired, and then, the element-wise sparsity pursuit in each individual loading vector. In this paper, similar to sparse reduced rank regression (SRRR) in literature [
12], only the row-wise sparsity is considered for 
.
The problem (
3) can be expressed as
      
As mentioned above, the assumptions of the model are very mild and, therefore, it can be applied to a wide range of applications. In addition, the model parameters are appropriately regularized, so the model fitting should be robust and the result has well interpretations. As is given in Theorem 2 by paper [
14], when the covariance matrix is not considered, the mean-shift model is equivalent to the robust M-estimation problem for 
. There is no doubt that both simulation studies and the form of model expressions can show the robustness of the method.
  3. Computational Algorithm
This section presents an algorithm for solving the optimization problem (
7), iteratively optimizing with respect to 
, 
, 
 and 
.
For given (
, 
, 
), the problem of solving for the precision matrix 
 can be expressed as
      
      where 
. Reference [
21] describes that this problem can be solved by applying the DP-GLASSO algorithm in the R package dpglasso. The DP-GLASSO algorithm optimizes the primal objective with the actual precision matrix and the GLASSO algorithm given in reference [
22] can solve the dual of the graphical lasso penalized likelihood with the covariance matrix. However, reference [
21] shows that the DP-GLASSO algorithm performs better than the GLASSO algorithm.
For fixed 
 and 
, the objective of minimization over 
 is
      
      subject to the constraint 
. Applying the transformation 
, we have 
. The first term in the objective function (
9) can be written as
      
      where 
. When 
, then the problem (
9) becomes a SRRR problem given in literature [
12], which can be solved iteratively. For the sparsity-inducing penalty functions 
, there are a lot of choices. The 
 penalty in reference [
23] is most popular among the sparse penalty literature but suffers from inconsistency and biased estimates, especially when predictors are correlated. To alleviate those problems, some non-convex penalties such as the 
 penalty, SCAD and the capped 
 penalty are proposed to approximate the ideal non-convex 
 penalty.
To solve the S-optimization problem for a general penalty function 
, we propose to use a threshold rule based 
-estimator as multiple penalty functions usually correspond to a threshold rule. The threshold-based iterative selection procedure (TISP) in reference [
24] can be used to solve the penalty problem for any P associated with a threshold rule (an odd, unbounded monotone shrinkage function) given in reference [
17].
According to reference [
25], 
-estimator is linked to a general penalty function 
 by
      
      for some nonnegative 
 such that 
 for all 
s. Any thresholding rule can be solved by this method. We can apply this technique to solve all popular penalties including but not limited to the aforementioned 
, SCAD, 
 and 
. This means this method is universally valid. The partial derivative of the function 
 at 
 can be calculated:
The closed-form solution for 
 given 
 is
      
      where 
 and 
 is the thresholding function corresponding to penalty 
. The further details are given in the literature [
25].
When 
 is given, the optimization problem of 
 becomes
      
This can be identified as a Procrustes rotation problem, realizable through computing the singular value decomposition of 
. The optimal 
. The full algorithm is presented in Algorithm 1.
For fixed 
, 
 and 
, the problem of solving 
 reduces to the minimization of
      
Suppose 
. To solve problem (
15), construct a surrogate function
      
      where 
 is the gradient of 
ℓ with respect to 
. Now define the 
th iterate as
      
The problem of minimizing 
 boils down to the g-optimization in (
17), which is simpler than direct minimizing 
F. We rewrite the problem in the form of
      
To solve this problem (
18), for a general penalty function 
, we still propose to use the thresholding rule based 
-estimators. The 
 penalty has the appealing properties of promoting sparsity and can directly restrict the cardinality of nonzero elements/rows in the estimation matrix. Moreover, it is easy to see that tuning for the constraint form 
 is intuitive and easy compared to the penalty form 
. Here 
 is used as an upper bound for the percentage of nonzero terms. In this paper, to guarantee the sparsity of 
, we advise using 
 constraint in place of the 
 penalty 
 for tuning ease. By employing a quantile thresholding rule 
, this problem can be solved through TISP easily. The quantile thresholding rule is a special case of hard thresholding which correspond 
 penalty. Let 
 be the element-wise quantile threshold function, then
      
      where 
 is the 
 quantile of the 
. The tuning parameter 
 has a defined range 
 as an upper bound for the nonzero percentage, which is easier to interpret.
The row sparsity regularization can be used for variable selection in our multiple response problem and is used in our main algorithm. In particular, for the row sparsity regularization 
 constraint 
 it calls for the multivariate version of quantile thresholding 
:
      where 
 is the 
 quantile of the norm of the vector 
. The row-wise constraint version can be realized by simply changing 
 to 
.
Both the row-wise sparsity and element-wise sparsity are considered for  to find the outlier rows and outlier elements, respectively.
Assuming that 
 is constructed by (
11), our algorithm iteratively applies the DP-GLASSO, the SRRR, and simple multivariate thresholding operations until the function (
7) has converged. The complete numerical algorithm is summarized in Algorithm 1.
      
| Algorithm 1 Algorithm to solve Cov-SR4 | 
| Require:, , r: the desired rank; M: the maximum iteration number; : error tolerance; and the initial estimates , , . 1:;2:repeat3:   for fixed ( ,  ,  ), solve   with objective function (8 ) using the DP-GLASSO algorithm;4:   for fixed (, ), solve  defined by the SRRR problem (11).5:   for fixed ( ,  ,  ), the multivariate thresholding operations can be used to solve problem (18 ) to get  ;6:   ;7:   Calculate   and   according to Equation (7 );8:until or 9:return , ,  and .
 | 
Theorem 1. Assume the largest eigenvalue of  equals to L. Then, as long as , the sequence of iterates defined by (1) satisfiesThat is, the objective function values are non-increasing during the iteration.  Proof.  Firstly, for fixed 
, 
 and 
, the optimization for 
 with objective function (
8) using the DP-GLASSO algorithm is a block coordinate descent (BCD) which is given in reference [
21].
Secondly, for fixed 
, 
 and 
, the optimization for 
 with closed-form solution (
13) is equivalent to the minimization of following equation
        
Let 
. Then
        
        where the first inequality comes from the Taylor expansion and 
, and the second inequality holds because 
 is a minimum point of 
. The monotonicity of V-optimality is due to the fact that it is a closed form minimization.
The monotonicity of C-optimality can be deduced from the following inequality
        
        where the first inequality comes from the Taylor expansion and 
, and the second inequality holds because 
 is a minimum point of 
.    □