Solving High-Dimensional Problems in Statistical Modelling: A Comparative Study

In this work, we present numerical methods appropriate for parameter estimation in high-dimensional statistical modelling. The solution of these problems is not unique and a crucial question arises regarding the way that a solution can be found. A common choice is to keep the corresponding solution with the minimum norm. There are cases in which this solution is not adequate and regularisation techniques have to be considered. We classify specific cases for which regularisation is required or not. We present a thorough comparison among existing methods for both estimating the coefficients of the model which corresponds to design matrices with correlated covariates and for variable selection for supersaturated designs. An extensive analysis for the properties of design matrices with correlated covariates is given. Numerical results for simulated and real data are presented.


Introduction
Many fields of science, and especially health studies, require the solution of problems in which the number of characteristics is larger than the sample size. These problems are referred to as high-dimensional problems. In the present paper, we focus on solving high-dimensional problems in statistical modelling.
We consider the linear regression model where X = 1 x 1 · · · x d is the design matrix of order n × (d + 1), which is supposed to be high-dimensional, i.e., n < d. The columns x i ∼ N(0 n , σ 2 i I n ), i = 1, 2, . . . , d, are the correlated covariates of the model and all the elements of the first column of the design matrix are equal to 1 in correspondence with the mean effect. The response vector y has length n, = ( 1 , 2 , . . . , n ) T is the n-vector of independent and identically distributed (i.i.d.) random errors, where i ∼ N(0, σ 2 ) for all i = 1, 2, . . . , n.
In the present study we focus on the following two points.

1.
Estimation of the regression parameter β ∈ R d+1 . From numerical linear algebra point of view, the statistical model (1) can be considered as an underdetermined system. This kind of system has infinitely many solutions. The first way to determine the desired vector β is to keep the solution with the minimum norm. This solution is referred to as minimum norm solution (MNS), [1] (p. 264). Another way of solving these problems is based on regularisation techniques. Specifically, these methods allow us to solve a different problem which has a unique solution and thus to estimate the desired vector β. One of the most popular regularisation methods is Tikhonov regularization, [2]. Another regularization technique which is used is the pq regularization, [3,4]. It is of major importance to decide whether problem (1) can be solved directly in the least squares sense or regularisation is required. Therefore, we describe a way of choosing the appropriate method for solving (1) for design matrices with correlated covariates. For these matrices we study extensively their properties. We prove that as the correlation of the covariates increases, the generalised condition number of the design matrix increases as well and thus the design matrix becomes ill-conditioned.

2.
To ascertain the most important factors of the statistical model. Variable selection is a major issue in solving high-dimensional problems. By means of variable selection we refer to the specification of the important variables (active factors) in the linear regression model, i.e., the variables which play a crucial role in the model. The rest of the variables (inactive factors) can be omitted. We deal with the variable selection in supersaturated designs (SSDs) which are fractional factorial designs in which the run size is less than the number of all the main effects. In this class of designs, the columns of X, except the first column, have elements ±1. The symbols 1 and −1 are usually utilised to denote the high and low level of each factor, respectively. The correlation of SSDs is usually small, i.e., r ≤ 0.5. The analysis of SSDs is a main issue in Statistics. Many methods for analysing these designs have been proposed. In [5], a Dantzig selector was introduced. Recently, a sure independence screening method has been applied in a model selection method in SSDs [6], and a support vector machine recursive feature elimination method for feature selection [7]. In our study, as we want to retain sparsity in variable selection, we adopt the pq regularisation and the SVD principal regression method, [8], in order to determine the most important factors of the statistical model.
In the regression model (1), there is no error setting in the design matrix X which defines the model. It is always considered an unperturbed matrix X with covariates from normal distribution with well determined rank. However, we assume i.i.d. random error = ( 1 , 2 , . . . , n ) T , i ∼ N(0, σ 2 ) for all i = 1, 2, . . . , n, incorporated in the model as given from relation (1). Thus, we are having well-posed problems on the set of the data according to the work in [9].
The paper is organised as follows. In Section 2, we briefly present some methods for solving high-dimensional problems. We initially display the MNS and in the sequel we present two regularisation methods. Specifically, Tikhonov regularisation and a general regularisation technique, pq regularisation method, are discussed. The described methods are used in estimating the regression parameter β of (1) for design matrices with correlated covariates and the results are given in Section 3. These methods can be applied to ill-posed problems as well. Variable selection for SSDs can be found in Section 4. We end up this work with several concluding remarks in Section 5.

Methods Overview
In this section, we present some methods for solving high-dimensional problems.

Minimum Norm Solution
The system (1), which is an underdetermined system, does not have a unique solution. In fact, this underdetermined system has infinitely many solutions, and we are seeking a solution such that its norm is minimised, i.e., the minimum norm solution (MNS) argmin β∈R d+1 y − Xβ 2 2 , [1] (p. 264). A necessary and sufficient condition for the existence of MNS is given in the following theorem.
Theorem 1. Let X ∈ R n×(d+1) be a high-dimensional matrix, i.e., n < d, with rank(X) = n, and β * be a solution of the underdetermined system Xβ = y. Then, β * is a MNS if and only if β * ∈ Range(X T ).
Proof. As β * is a solution of the underdetermined system Xβ = y, we have Let us consider the QR factorisation of X T , i.e., where Q ∈ R (d+1)×(d+1) is orthogonal and R 1 ∈ R n×n is upper triangular. Therefore, (2) can be rewritten as If we set Moreover, we have Taking into account the result of the above theorem, we obtain the formula for the MNS β * , which is given by β * = X T (XX T ) −1 y.
Formula (5) cannot be used directly for calculating the vector β, as it is not a stable computation. Therefore, we state the Algorithm 1 for a stable way of calculating the MNS through the singular value decomposition (SVD) of the design matrix X, [1] (p. 265). The operation count for this algorithm is dominated by the computation of the SVD, which requires a cost of O(nd 2 ) flops.

The Discrete Picard Condition
It is crucial to identify when problem (1) can be directly solved with a satisfactory MNS solution or different ways of handling the solution must be employed. In [10], a criterion for deciding whether a least squares problem can have a satisfactory direct solution or not is proposed. This criterion employs the SVD of the design matrix X and the discrete Picard condition as defined in [11,12]. Let vectors v i , i = 1, 2, . . . , n. The discrete Picard condition ensures that the solution can be approximated by a regularised solution [13].
Definition 1 (The discrete Picard condition). The discrete Picard condition (DPC) requires that the ratio |c i | s i decreases to zero as i → n, i.e., where c i = u T i y. The DPC implies that the constants |c i | tend to zero faster than the singular values tend to zero. Example 1. Let us now consider design matrices of order 50 × 101, their columns have same variance σ 2 and same correlation structure r. In particular, we test two design matrices X with (r, σ 2 ) = (0.9, 0.25) and (r, σ 2 ) = (0.999, 1). In Figure 1, we display the ratios |c i | s i and |ĉ i | s i , which correspond to the noise-free and the noisy problem, c i = u T i y,ĉ i = u T iŷ , i = 1, 2, . . . , n, y = y + . If the graphs are close enough the MNS is satisfactory; otherwise, regularisation techniques are necessary for deriving a good approximation of the desired vector β. As we can see in Figure 1, the values of the depicted ratios are very close in the design matrix with r = 0.9 case whereas in the highly correlated matrix with r = 0.999 case the ratios differ. This implies that a regularisation method is necessary for the second case.

Regularisation Techniques
There are cases where the MNS β * cannot achieve a good approximation of the desired unknown solution β. As in the linear regression model as described in (1) the design matrix X is always unperturbed, and thus its rank can be a priori known, we can adopt regularisation techniques. In the present section, we present two regularisation methods. In particular, we present the popular Tikhonov regularisation [2] and the pq regularisation which has recently received considerable attention [3,4]. Both of these techniques replace the initial problem with another one which is close to the original.

Tikhonov Regularisation
A regularisation method that is widely used is Tikhonov regularisation. The standard form of Tikhonov regularization, which corresponds in linear regression model (1), is given by min where λ is the regularisation parameter. The solution of the penalised least-squares problem (6) is given by the formula As we can see, Tikhonov regularisation depends on the regularization parameter λ. An appropriate method for selecting λ leads to the derivation of a satisfactory approximation β λ of the desired regression parameter β. The error in the input data for the statistical model that we study follows the standard norm distribution, i.e., ∼ N(0 n , σ 2 I n ). Therefore, the norm of the error is known, and it is given by 2 In the case of known error norm, the appropriate method for the selection of the regularisation parameter is the discrepancy principle, which is reported in Algorithm 2 [14] (p. 283). Following also the analysis presented in [9], and due to the uniqueness of λ for most reasonable values of (see, for example, in [15]), we adopt this method for our study.
over a given grid of λ.

pq Regularisation
A more general regularisation technique is the so-called pq regularisation [3]. The main idea of this approach is based on the replacement of the minimisation problem y − Xβ 2 by an pq minimisation problem of the form where µ > 0 is the regularisation parameter and 0 < p, q ≤ 2. The solution of the minimisation problem (7) is given bŷ Remark 1. In case of p = q = 2, the regularised minimisation problem (7) reduces to Tikhonov regularisation.
Concerning the selection of the regularisation parameter, we choose the optimal value of µ, i.e., the value that minimises the error norm β µ − β 2 over a given grid of values for µ. Concerning the computational cost, the implementation of the pq regularisation requires O(nd) flops.

Design Matrix with Correlated Covariates
In high-dimensional applications, the design matrix X = 1 x 1 · · · x d has correlated covariates x i ∼ N(0 n , σ 2 i I n ), i = 1, . . . , d, where σ 2 i is the variance of x i and the correlation structure is given from the relation Next, we present a thorough investigation of the properties that characterize these matrices.

Correlated Covariates with Same Variance and Correlation
We initially consider design matrices with correlated covariates which have same variance σ 2 and same correlation r. In the following theorem, we formulate and prove in detail the types for the singular values of the design matrix X. In [16], this case of design matrix is considered and there exists a brief description of the eigenvalues of the matrix X T X. Theorem 2. Let X = 1 x 1 · · · x d ∈ R n×(d+1) be a high-dimensional design matrix of full rank whose columns x i ∼ N(0 n , σ 2 I n ), i = 1, 2, . . . , d, with correlation structure r. The singular values of the matrix X are Proof. The n singular values of X are the square roots of the n non-zero eigenvalues of X T X. Therefore, we compute the matrix X T X, i.e., x ji = 0, ∀ i = 1, . . . , d, due to the construction of the design matrix X according to the normal distribution. Therefore, the matrix X T X has one eigenvalue equal to n. Moreover, we can express the variance σ 2 of each covariate in terms of vector norms as follows: wherex i denotes the mean value of each x i . As the mean value of each x i is zero, we have The submatrixX TX of X T X can be written as where J is the d × d matrix with all elements equal to 1. The non-zero eigenvalues ofX TX are λ 1 = (n − 1)σ 2 (1 − r) with algebraic multiplicity n − 2 and λ 2 = (n − 1)σ 2 [(d − 1)r + 1] with algebraic multiplicity 1. Therefore, the singular values of X are Let us denote by κ(X) the generalised condition number of X, i.e., κ(X) = X 2 · X † 2 , where X † = X T (XX T ) −1 is the pseudoinverse of X, [1] (p. 246). It is known that the generalised condition number can be expressed in terms of the maximum s max and the minimum s min singular value of X as κ(X) = s max s min , [1] (p. 216).
In Theorem 3, we express the generalised condition number of X in terms of the correlation structure r. Theorem 3. Let X = 1 x 1 · · · x d ∈ R n×(d+1) be a high-dimensional design matrix of full rank whose columns x i ∼ N(0 n , σ 2 I n ), i = 1, 2, . . . , d, with correlation structure r. The generalised condition number of X is given by Proof. It is obvious that s n = σ (n − 1)[(d − 1)r + 1] > s i , i = 2, . . . , n − 1 holds. Therefore, we have to distinguish three cases. The first case is s 1 ≥ s n , the second case is s i < s 1 < s n and the last one is s 1 < s i .
First case: If s 1 ≥ s n , then κ(X) = s 1 s i = n (n − 1)σ 2 (1 − r) . The restriction s 1 ≥ s n can be rewritten as follows: The restriction s i < s 1 < s n can be reformulated as follows: Moreover, we make the check Therefore, we conclude that the generalised condition number κ(X) is equal to (d − 1)r + 1 1 − r if the following relation holds.
and σ 2 < n n − 1 or r > 1 − n (n − 1)σ 2 and σ 2 > n n − 1 Third case: If s 1 ≤ s i , then κ(X) = s n s 1 = (n − 1)σ 2 ((d − 1)r + 1) n . This restriction is equivalently written as Taking into consideration the derived formulae for the generalised condition number of the design matrix X, we see that if r ≈ 1 the generalised condition number κ(X) becomes large. A detailed example is presented next.

Example 2.
In this example, we plot the generalised condition number of X as a function of the correlation r. We consider n = 50, d = 100 and σ 2 = 2. In Figure 2, we display κ(X) for correlation r −→ 1. As we see in Figure 2, as correlation r tends to 1, the generalised condition number κ(X) increases rapidly.

Highly Correlated Covariates with Different Variance and Correlation
Next, we consider a general and more usual case in which the covariates x i ∼ N(0 n , σ 2 i I n ) of the design matrix X have different variance σ 2 i and correlation r ij , i, j = 1, . . . , d. Based on the results presented in [17] for the eigenvalues of the matrix X T X, we record analytic formulae for the singular values of X in the following theorem. Theorem 4. Let X = 1 x 1 · · · x d ∈ R n×(d+1) be a high-dimensional design matrix of full rank whose columns x i ∼ N(0 n , σ 2 i I n ), i = 1, 2, . . . , d, with highly correlation structure r ij . The singular values of the matrix X are assuming that 1 − r ij = O(δ) as δ → 0.
As we record in Section 3.1, the generalised condition number is equal to the ratio s max s min and in the present case s min = O(δ) considering that 1 − r ij = O(δ), i.e., highly correlated covariates. Therefore, the value of κ(X) is large and this affects the solution of the corresponding problem.

Remark 2. As the correlation r increases the generalised condition number κ(X) increases as well.
From Theorems 2 and 4 we deduce that the case of highly correlated covariates leads to possible instability and thus regularisation is recommended. This result is confirmed from Table 1 which is presented in Section 3.3.

Numerical Implementation
The implementation of the simulation study presented in this section and in Section 4 has been done by using the Julia Programming Language.
Given the high-dimensional design matrix X of order n × (d + 1), the response vector y of order n and the n-vector = ( 1 , 2 , . . . , n ) T of i.i.d. random errors, j ∼ N(0, 1), j = 1, 2, . . . , n, we estimate the vector β by using the methods which are described in Section 2. We consider design matrices X with correlated covariates and we distinguish the two aforementioned cases. The results for the first case, i.e., the covariates of the design matrices having same correlation r and same variance σ 2 , are recorded in Tables 1 and 2. The results for the second case are displayed in Table 3.
The implemented simulation scheme is the following. For each design matrix X, a random vector β is generated and y = Xβ denotes the noise free response vector. Then, 100 iterations are performed, in each one the response vector is perturbed by noise i resulting in a noisy response vectorŷ = y + i , i = 1, 2, . . . , 100. Eventually, the regression parameterβ is computed by using both the MNS given by Algorithm 1 and the regularisation techniques. The quality of the generated approximation solutionβ is assessed by the mean square error (MSE) between β andβ which is given by the formula 2 2 ]. In Algorithm 3, we summarise the simulation scheme.

Algorithm 3: Simulation scheme.
Input: Design matrix X ∈ R n×(d+1) Result: MSE(β) β = randn(n); y = Xβ ; for i ← 1 to 100 dô y = y + i ; In Tables 1-3, we present the results of estimating the regression parameter β for different orders of the design matrices X. In the two first columns of the tables, the correlation r and the variance σ 2 of the covariates are recorded, respectively. In Table 3, we record the interval in which lies the correlation and the variance. In the third column, the adopted methods are written. Specifically, we record MNS, Tikhonov regularisation and pq regularisation technique for different pairs of (p, q). The fourth column contains the used grid of values for the regularisation parameter λ or µ for Tikhonov or pq regularisation, respectively. In the last column the MSE(β) of the derived approximation solutionsβ are recorded.   As we can see in these tables, in the case of highly correlated design matrices, the regularisation is necessary for deriving a good approximation of the desired vector β. On the other hand, if the correlation of the design matrix is not high, MNS can achieve a fair estimation and a regularisation method does not improve the results, as it is verified by the MSE(β). Therefore, according to the presented results, for matrices with moderate correlated covariates, regularisation is redundant, as MNS yields adequate results. However, as the correlation between the covariates rises, the regularisation is essential.
Note that in case of design matrices with same variance and correlation r = 0.999 (Tables 1 and 2) the regularisation techniques, Tikhonov and pq , can achieve comparable results. The choice of the pair of parameters (p, q) and the values of the required regularisation parameter play an important role for the efficient implementation of both methods.

Variable Selection in SSDs
In this section, we are interested in selecting the active factors of SSDs by using the methods which are described in Section 2. In our comparison, we also include SVD principal regression method which is used in SSDs, and it was proposed in [8]. We briefly refer to this method as SVD regression. The main computational cost of this approach is the evaluation of the SVD.
We measure the effectiveness of these methods through the Type I and Type II error rates. In particular, Type I error measures the cost of declaring an inactive factor to be active and Type II measures the cost of declaring an active effect to be inactive. In our numerical experiments, we consider 500 different realisations of the error and in the presented tables we record the mean value of Type I, II error rates.
It is worth mentioning that both the MNS and Tikhonov regularisation give that all the factors are active, i.e., Type I = 1, Type II = 0, for all the tested SSDs. Therefore, these methods are not suitable for variable selection and we do not include them in the following presented tables.
Example 3 (An illustrative example). In this example, we shall exhibit in detail the performance of each method for a particular problem. For this purpose, we adopt the illustrative example presented in [8], with design matrix Then a first column x 0 with all entries equal to 1 is added to the matrix, which corresponds to the average mean. The simulated data are generated by the model As we can see from the generated approximation solutionsβ, the MNS and Tikhonov regularised solution cannot specify the active factors of the model and completely spoil the sparsity. On the other hand, the pq regularisation method and the SVD regression can determine appropriately the active factors of the model. Example 4 (Williams' data). We consider the well-known Williams' dataset (rubber age data) which is reported in Table 4. It is a classical dataset of SSDs and it is tested in several works, such as in [8]. As it is written in [8], as the columns 13 and 16 in the original design matrix are identical, the column 13 is removed for executing our numerical experiments. For this dataset we consider two cases, the real case and 3 synthetic cases.
We initially deal with the real case where the design matrix X and the response vector y are given, without the initial knowledge of the desired vector β. In literature, it is reported that the active factor is x 15 . In this case, according to our numerical experiments, the SVD regression and the pq regularisation method for p = 0.8, q = 0.1 indicate that the factor x 15 is important. In particular, the proposed models, i.e., the coefficients β i are given in Table 5.
The second case corresponds to 3 synthetic cases, see in [8] and references therein, which are given below. For these simulated cases, we record the results in Table 6. In particular, we compute Type I and II error rates for the described methods. We apply the pq regularisation method for µ = 5 and the SVD regression for the significance level a = 0.05. As we notice in this table, both the pq regularisation and the SVD regression can select sufficiently the important factors, as we see from the corresponding Type I, II error rates. The first model has the particularity that it includes the interaction of the factors x 5 , x 9 which does not usually appear in SSDs analysis. The first model is a challenging case for all the methods.

Example 5 (A 3-circulant SSD).
In this example, we consider one more SSD, which is also used in [18], and it is recorded in Table 7. We test the behaviour of the methods for variable selection by considering three models which can be found in [19] and are given below.
Model 1: y ∼ N(10x 1 , I 8 ) The results are presented in Table 8. For the three used models, we apply the pq regularisation method for µ = 5, 5.5, 0.5 respectively and the SVD regression for a = 0.25. According to the presented numerical results, we see that both the pq regularisation and the SVD regression can achieve satisfactory Type I and II error rates for the Model 1. On the other hand, for the Model 2 the SVD regression fails to specify the active factors whereas the pq regularisation method achieves better Type II error. However, neither of the methods produce fair results for the Model 3. The coefficients of this model are not sufficiently close and this fact affects the behaviour of the methods.   Example 6. In this example we consider a real data set presented in [20] that deals with moss bags of Rhynchostegium riparioides which were exposed to different water concentrations of 11 trace elements under laboratory conditions. The design matrix X can be found in Table 1 in [20]. We consider the main effects, the second-and third-order interactions of influent factors. Therefore, we have a 67 × 232 SSD and we can select the important factors applying the pq regularisation for µ = 0.75 and the SVD regression for significance level a = 0.05.
From Table 9, we see that both 2 -0.1 and SVD regression methods identify the main effect Zn as active factor. The second order interactions Cd/Mn, As/Pb and Mn/Ni are also identified as active. These results are in agreement with [20]. Table 9. Important elements and interactions.

Method
Main Effects Second-Order Interactions Third-Order Interactions

Conclusions
In the present work, we analysed the properties of design matrices with correlated covariates. Specifically, we derived and proved formulae for the singular values of these matrices and we studied the connection of the generalised condition number with the correlation structure. Moreover, we described some available methods for solving highdimensional problems. We checked the behaviour of the MNS and the necessity of applying regularisation techniques in estimating the regression parameter β in the linear regression model. We concluded that in solving high-dimensional statistical problems the following remarks must be taken into consideration.

1.
Regularisation should be applied only if the given data set satisfies the discrete Picard condition. In this case, the choice of the regularisation parameter can be uniquely chosen by applying the discrepancy principle method.

2.
The regression parameter β ∈ R d+1 can be satisfactory estimated by the MNS if the design matrix is not highly correlated but in case of highly correlated data matrices we have to adopt regularisation techniques. The quality of the derived estimationβ of β is assessed by the computation of MSE(β).

3.
In variable selection, where sparse solutions are needed, SVD regression or pq regularisation can be used. When only few factors of the experiment are needed to be specified (maybe only the most important), SVD regression may be preferable since it avoids regularisation and the troublesome procedure of defining the regularisation parameter. The quality of the variable selection which is proposed by the estimation methods is assessed by the evaluation of Type I and II error rates.
In conclusion, the proposed scheme for the selection of the appropriate method for the solution of high-dimensional statistical problems is summarised in the following logical diagram, see Figure 3.

Design Matrix with Correlated
Covariates X Highly Correlated correlation ≈ 1 Not Highly Correlated correlation ≤ 0.9