1. Introduction
Multiple regression is one of the most widely used tools of applied statistical modeling needed for solving various data fitting problems, analyzing predictors’ impact on the outcome variable, and making predictions. The construction of linear models is commonly performed by using the criterion of minimizing the total squared errors by the response variable, and such a model is known as the ordinary least squares (OLS) regression. Multiple textbooks, monographs, research papers, and internet sources are devoted to the OLS models and their applications in statistical modeling—see, for example [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10].
The minimization of the OLS objective produces the so-called normal system of equations for the estimation of the model coefficients, and the current paper is devoted to the consideration of the normal system, as well as to its various properties and relations with several other theoretical and practical questions. It describes the classic Laplace expansion and its double form for presenting a determinant via the sum of the elements in a row, a column, or both, multiplied by their cofactors which allows the calculation of determinants via those of lower order [
11,
12,
13]. The paper demonstrates that double Laplace expansion is a useful tool for presenting statistical modeling relations.
Another property obtained from the normal system of equations defines the geometric meaning of multiple linear regression as a hyperplane going through some special points [
14,
15]: a model with
n parameters goes through
n points of the weighted mean values. These mean values can be calculated from the data, and a normal system can be interpreted as equations of interpolation through the points of the weighted means [
16,
17]. Generally, the multivariate interpolation presents a difficult problem requiring grid approximations and other special techniques [
18,
19,
20,
21,
22]; however, with the linear interpolation via the points of weighted means, the multivariate interpolation can be easily performed.
The normal system of equations can sometimes be based on an ill-conditioned matrix, with its determinant close to zero. An inversion of such a matrix can produce big values, inducting the inflated parameters of regression. In such cases, known as the effects of multicollinearity among the predictors, a regularization in forms of ridge regression, LASSO, elastic net, and other techniques [
23,
24,
25,
26,
27,
28,
29,
30,
31] is applied. A penalizing function in these approaches is added to the OLS objective, which produces parameters of the model adjusted due to the applied restrictions. This paper considers the possibility of applying a penalizing function directly to the normal system of equations, which corresponds to the modified parameters of the model and has its interesting features.
The work also considers the estimation of the logistic regression, which is commonly built in the nonlinear modeling by the maximum likelihood criterion. As it is well-known, the generalized linear models (GLM) can be represented via the linear link functions [
32,
33,
34]. Finding how to transform an outcome variable and obtain the linear link helps reduce a nonlinear to the linear model estimation. The paper describes the Mahalanobis distance known for its scaling multivariable observations for data clustering, segmentation, fusion, and other practical needs [
35,
36]. This distance is employed for measuring how far the observations are from the weighted means of the predictors used for interpolation by the normal system. A binary outcome variable can be weighted by these distances in relation to the centers of the weighted means of the predictors. The weighted means of the binary outcome can serve as the estimated proportions, which, in turn, can be used as the dependent variable values in the linear link function. Then, the parameters of the logistic regression can be found by solving the normal system of the linearized problem.
The paper is structured as follows. After
Section 1 (Introduction),
Section 2 describes the Laplace expansion and its double form, and
Section 3 presents the relations of the multiple regression in terms of the Laplace expansion.
Section 4 considers the geometrical properties of multiple regression as hyperplane interpolation via the points of weighted means.
Section 5 describes ridge regularization of the normal system, and
Section 6 defines Mahalanobis distances from the observations to the weighted means for building a linear link function of the nonlinear model.
Section 7 illustrates the techniques by numerical examples and demonstrates that these approaches are useful in various data modeling problems. Finally,
Section 8 summarizes the results.
2. Laplace Expansion and Its Double Form
To start with the main formulation for the Laplace expansion, consider a square matrix
A of an
n-th order, with elements
ajk, where
j,
k = 1, 2, …,
n:
Due to the Laplace expansion [
13], the determinant
det(
A) of the matrix (1) can be presented as sum of the elements of any row multiplied by cofactors of these elements, so with any
jth row the determinant equals
where
are the minors or determinants of the (
n − 1)th order submatrices obtained by removing the
jth row and the
kth columns from the original matrix (1). The signed minors
are the cofactors of the elements
, and the expression (2) is called the cofactor expansion of determinant. Similarly, a determinant can be expanded as the sum of the elements of any column multiplied by their cofactors.
By the same pattern, a determinant can be expressed via the double sum of the product of elements in any row and column, weighted by their cofactors as well. Indeed, expanding the minors in (2) by the elements in a jth row yields the expression for the dual expansion.
Let us consider such a formula in more detail. Suppose that matrix
A (1) is extended to matrix
B of the (
n + 1)th order by the additional first row with elements
xk, additional first column with elements
bj, and an additional element
y in the upper left corner:
The determinant of matrix
B can be reduced to the following formula:
This formula is expressed via the determinant of matrix
A and the double sum of weighted products of the elements
xk in the first row and
bj in the first column. The weights in the sum (4) are defined by the minors
of matrix
A, or, more exactly, by the signed minors
, which are the cofactors of the elements
ajk of the matrix (1).
By assuming that matrix
A is not singular, so its determinant differs from zero, it is possible to represent the expression (4) as
The elements of the inverted matrix
A−1 are defined by the cofactors of the transposed matrix divided by the determinant of the matrix, so we can rewrite the result (5) in the matrix form:
where
x and
b are vector columns with the elements
xk and
bj (3), respectively, and prime denotes transposition, so
x’ is the vector row.
If the determinant (3) equals zero,
det(
B) = 0; then, with the non-singular matrix
A, the result (6) yields the equation
We will return to this bilinear form in further consideration.
3. Multiple Linear Regression in Laplace Expansion
Let us briefly recall some main relations of regression modeling. The multiple linear regression for the dependent variable
y and
n independent variables
xk (
k = 1, 2, …,
n) can be written as
where
i denotes observations (
i = 1, 2, …,
N), the intercept
a0 and the coefficients of regression
aj are the unknown parameters of the model, and
is the error term. The parameters are estimated by minimizing the sum of squared deviations, which is the OLS criterion:
Equalizing the derivatives of the expression (9) by the unknown parameters to zero yields the so-called normal system:
Dividing the first equation in (10) by
N defines the intercept in the regression model:
Substituting the intercept (11) into the other
n Equation (10) reduces them to the following equation in matrix form:
in which the matrix
Cxx and vector
cxy of the
nth order are defined as follows:
In the relations (13),
X denotes the matrix of
N x
n order of the centered observations by the predictors,
y is the vector of the centered observations of the dependent variable, and
X’ is the matrix transposed. The symmetric matrix
Cxx and the vector
cxy in (12) and (13) are the sample covariance matrix of the predictors
x among themselves and the vector of their covariances with the dependent variable
y, respectively. Resolving Equation (12) for the vector
a of the coefficients of regression yields the expression
where
is the inverted matrix. With the obtained parameters (14), the model (8) for the centered variables
x and
y can be presented as follows:
where
x’ is the vector row of predictors and
y is a value of the dependent variable predicted by the regression.
The obtained solution (15) coincides with Equation (7). Thus, if in the matrix (3) we use covariances with the vector
b =
cxy, and the matrix
A = Cxx, then the regression model corresponds to the following equation for the determinant of the matrix (3):
which reduces to the solution (15), proving that the regression model can be expressed in terms of the Laplace expansion (7).
For an explicit example, let us use standardized variables centered and normalized by their standard deviations, when, in place of the covariances, we have correlations. The model with two predictors can be presented as follows:
where the parameters are so-called beta coefficients of the normalized equation. The matrix and vector of correlations (13) for the model (17) are
Then, relation (16) can be written explicitly as
Applying the Laplace expansion (2) by the elements of the first row, which are the names of the variables, produces the equation
By finding determinants of the second order, we resolve Equation (20) for the variable
y:
This is the simple pair regression (17) with beta coefficients. The Laplace expansion by the first column in (19) yields the equation
which can also be regrouped to the solution (21). Substituting values of two predictors into the determinant (19) or into the explicit Equations (21) and (22) yields a predicted value of the dependent variable.
4. Multivariate Interpolation in Laplace Expansion
The Laplace expansion has a useful geometric interpretation. Consider a system of linear equations with the given coefficients
ajk and
bj, and variables
xk:
This system geometrically corresponds to the hyperplane of the
nth order going through
n multidimensional points with the coordinates defined by the coefficients in each Equation (23), which are (
b1,
a11,
a12, …,
a1n), (
b2,
a21,
a22, …,
a2n), …, (
bn, an1,
an2, …,
ann). A practical way of finding the hyperplane going via these given points is known in analytic geometry [
14,
15]. It involves solving the following determinant equation:
The elements of this determinant match the matrix (3), and Equation (24) corresponds to expression (16). The first row of (24) contains the names of the variables, and the other rows contain parameters of the system (23). Substituting the names in the first row (24) by the values in any other row leads to the determinant with the same two rows. Such a determinant equals zero, so this hyperplane goes exactly via each of the multidimensional points.
The problem of multivariate interpolation becomes easily solvable by the linear interpolation in the approach (23) and (24): indeed, using the values of the predictors in a needed point in place of the names in (24) yields the value of the outcome. Expanding the determinant by the elements of the first row produces the equation of the hyperplane in the explicit formula for interpolation. The linear interpolation can be extended to the polynomial interpolation by many variables if, besides the linear items, we also include quadratic, mix-effects, higher powers of variables, and other kinds of polynomial or nonlinear items in the model. All such items can be denoted as new variables and used in determinant (24) for multivariable interpolation.
Let us return to the normal system (10) of the non-centered and non-standardized variables and show how it can be formulated in terms of multivariate interpolation through the set of special multidimensional points. The OLS model with 1 +
n parameters of the intercept and coefficients of regression (9) can be presented as the hyperplane going through the 1 +
n points of the weighted mean values of all variables. More specifically, assuming that the total of any
x differs from zero, we divide each
jth term in Equation (10) by the term with the intercept
a0, so this system reduces to the following one:
The system (25) can be represented via the mean values, denoted by bars:
with the means and weighted means defined as
The weights of the
ith observations in their total for each variable
xj are
with the totals for each
jth variable equals one. Thus,
in (26) and (27) are the mean values of
y, and
are the mean values of
xk, weighted by the
jth set of weights built by
xj (
j = 1, 2, …,
n).
Consequently, the normal system (10) can be reduced to system (26), where each equation describes the hyperplane going through the corresponding (1 +
n)-dimensional points defined in the rows (27). Therefore, linear regression (8) can be seen geometrically as a hyperplane going through the 1 +
n points of the mean values (27) of variables weighted by each other variable. Such a hyperplane can be built similarly to the procedure described in (23) and (24), So, with the normal system (26), there is the determinant equation:
Unlike in (24), there is also a column with
x0 for the intercept, which corresponds to the variable identically equals one.
Expanding determinant (29) by the elements of the first row produces explicit formula for the hyperplane, which is also the regression model. For an explicit example of the model with one predictor, Equation (29) becomes
With identity
x0 = 1, finding the determinants of the second order and resolving Equation (30) for the dependent variable
y, produces the following expression:
For only one predictor,
x1 can be simplified as
x and
w1 as
w, so the pair regression (31) can be reduced to
The regression line (32) goes via the two points of means and weighted means of the variables because the equality
yields
, and equality
yields
. The slope in (32) can be transformed to the expression
which is the quotient of the sample covariance of
x with
y and the variance of the independent variable
x, so it coincides with the regular definition of the slope for the pair regression. The intercept in (32) is
which corresponds to the relation (11) for the case of one predictor.
The normal system of equations considered above in (10), (25) and (26), or in the matrix form (12) can contain an ill-conditioned matrix with the determinant close to zero, so an inversion of such a matrix produces big values inducting the inflated parameters of regression. This situation requires imposing regularization with a penalizing function, which is commonly added to the OLS objective. However, as shown below, it is possible to apply a penalizing function directly to the normal system.
5. Regularization of the Normal System
For a clear exposition, let us start with the regular ridge regression (RR). The OLS criterion (9) in the matrix form for the standardized variables and with the added penalized function of the quadratic norm of standardized beta-coefficients is defined as
where
denotes the vector of ridge regression parameters and
k is a small positive ridge parameter. The notations (13) are used for the assumed normalized variables. Minimizing objective (35) by this vector yields the system of equations
where
I is the identity matrix of the
nth order. Equation (36) presents the normal system (12) with the added scalar matrix of the constant
k. Resolving (36) yields
which is the vector of the beta-coefficients for the RR model. The ridge regression solution (37) exists even for a singular correlation matrix
Cxx, and it reduces to the OLS solution (14) when
k reaches zero.
Instead of penalizing the OLS objective (35), it is possible to apply the ridge regularization item to the normal system (12) for new parameters:
where NS denotes the normal system. The derivative of objective (38) by the vector
produces the equation
then, the ridge solution by the regularized normal system becomes as follows:
A similar solution corresponding to the parameter
k = 1 was obtained for NS regularization by another criterion in [
30] (note that there is a typo with the missed sign of inversion for the matrix in the formula (25) in that work).
Another useful step in building a regularized solution
is the additional adjustment to the vector
by the constant term
q, which improves the quality of the data fit:
For any obtained regularized solution of beta-coefficients
, the term
q is defined as
The maximum value of the coefficient of multiple determination, often used for the estimation of the regression quality of fit, with the adjustment (40) and (41), equals
More detail on this adjustment parameter is given in [
30].
Substituting the ridge solutions (37) into (42) and using the obtained constant in (41) leads to the following adjusted ridge solution:
Similarly, by using the normal system solution (40), we obtain the term (42) and the adjusted solution:
Note that symmetrical matrices in the quotients (44) and (45) are commutative. The adjusted solutions (44) and (45) produce the maximum possible data fit by the vectors with structures (37) or (40), respectively.
6. Mahalanobis Distance to the Points of Means
The weighted mean values (27) can also be applied for facilitating estimations in some nonlinear models by their transformation to the corresponding linear link functions. Let us consider grouping the observations by the variables
xij around the points (27) of the weighted means
of the predictors. As a measure of distance from a multivariate observation to the points of mean values for the variables measured in different units, the Mahalanobis distance can be tried out. The
n + 1 points of means correspond to the
n + 1 centers with the coordinates given in all
kth rows (27), without the mean values by the
y variable. A distance from an
ith point to the
kth center can be defined by Mahalanobis distance as follows:
where
is the inverted covariance matrix,
xi is the vector-column of the
ith values by all
x variables, and
mk is the vector column of the means by
x-variables in each
kth row in (27):
including the first row in (27) with the non-weighted mean values. A smaller
dik (46) corresponds to a higher level of belonging of the
ith observation to the
kth cluster of mean values, so the weights of such a belonging can be defined by the reciprocal distances, 1/
dik. However, such metrics cannot be used for the case
dik = 0, which can happen in real data.
The weights for the correspondence of an
ith observation
xi to a
kth center
mk (47) can be better defined via the probability density function (pdf) of the multinormal distribution based on the Mahalanobis distance (46):
A smaller
dik corresponds to a bigger pdf (48), and for
dik = 0, the pdf reaches its maximum, so there is no singularity. It is convenient to use the weights
vik of belonging of an
ith observation to a
kth group defined by the pdf values (48) normalized within each
ith observation:
So, for each
ith observation, the total of weights across the groups of means equals one,
. With the weights (49), the weighted means of the dependent variable can be found in relation to each of
kth groups. Note that the group of
k = 0 corresponds to the predictor
x0 with intercept in the first row in the Formulae (25)–(27).
Suppose the dependent variable
y is presented by a binary variable with 0 and 1 values. With the weights (49), the weighted means for
y are
The binary outcome model is commonly built in the form of the logistic regression by the maximum likelihood criterion. With this model, the probability in each
ith point of observations can be found by the expression
where
bj are the estimated logit regression parameters. The model (51) can be transformed into the linear link function:
It is clear that the binary 0-or-1 values of the original outcome variable
yi cannot be used under the logarithm in (52). However, we can define the new dependent variable
z at the left-hand side (52) via the proportions (50) as follows:
Then, using the values (53) as the dependent variable in Equation (26), we solve this normal system of equations with respect to the parameters of logistic regression in the linearized form (52). The obtained parameters can serve in the logit Equation (51) for estimating probability at each observation point.
7. Numerical Examples
From the MASS package of the R software, the dataset “Cars 93” was taken for numerical examples. There are 93 observations on different car brands and models, measured by 27 numerical, ordinal, and nominal variables. The following eight numerical variables were used as predictors: x1—Engine size (liters), x2—Horsepower, x3—Rev. per mile (engine revolutions per mile in highest gear), x4—Fuel tank capacity (US gallons), x5—Length (inches), x6—Wheelbase (inches), x7—Width (inches), x8—Weight (pounds). As the dependent variable y, the Price (in USD 1000) is taken for the linear models with numerical outcome, and the binary variable Origin (USA and non-USA companies as 1 and 0 values, respectively) is taken for the logistic regression.
Table 1, in its first two numerical columns, presents the pair correlations
r of the dependent variable of Price with the predictors and the beta-coefficients of the OLS linear regression (14) by normalized predictors.
The next five columns display the ridge regressions (37) adjusted by the relations (41) and (42) or (44) with different values of the ridge parameter
k. The OLS model corresponds to the case
k = 0, and all regressions show how their coefficients behave in profiling by the
k parameter. The last three rows in
Table 1 present the coefficient of multiple determination
R2, the parameter of adjustment
q (42), and the coefficient
R2adj adjusted by this parameter. With bigger
k, the quality of fit
R2 of the ridge regressions diminishes, although the adjustment parameter
q improves it to
R2adj values.
Coefficients of the pairwise regressions of the price by each one predictor coincide with the pair correlations for the standardized data. Assuming that all the pairwise models have meaningful relation of the outcome with the predictors, we can notice that, in the OLS multiple regression, the parameter with
x3 becomes positive, and with
x7, it becomes negative, so of the opposite directions than observed in the pairwise relations. Such a change in sign in the coefficients corresponds to the effects of multicollinearity among the predictors in the multiple regression. Ridge regression makes the model less prone to collinearity impact, and by increasing the parameter
k, we can reach the multiple ridge regression with the coefficients of same signs as in the pairwise models or in the correlations. However, as we see in
Table 1, only with the ridge parameter growing to the rather big value of
k = 1 can the coefficient with
x3 become negative, as in the correlation, but the coefficient with
x7 still cannot reach a positive value like in the pair relation.
A possibility to achieve the multiple regression coefficients of the signs coinciding with the pairwise models can be found using the ridge penalizing in the normal system itself, as described in the derivation (38)–(40).
Table 2 is built similarly to the previous table, starting with the correlations and the OLS model, but the next columns present the results of the ridge regularization for the normal system with the adjustment of (40)–(42) or (45).
The results of
Table 2 demonstrate that, already with the ridge parameter,
k = 0.4, the coefficients in multiple regressions all become of the same signs as the coefficients in the pairwise models or the correlations. It means that the ridge penalizing—not in the original OLS objective but in the normal system—presents a better way to obtain the interpretable solutions for multiple linear models.
Let us consider the Origin binary variable in the OLS modeling by the same eight predictors. Its solution can be reduced to the normal system of equations presented via the weighted mean values as it is described in relations (25)–(27).
Table 3 shows coefficients of the normal system given in the weighted mean values (26). The first column in
Table 3 presents the weighted means of the dependent variable. The next column of the variable
x0 corresponds to the intercept. After that, the weighted means of the predictors are on the right-hand side. The values in each row define the nine multidimensional points through which the hyperplane of the linear regression (8) goes. Solving this normal system produces coefficients of the OLS regression model.
For the Origin binary outcome,
Table 4 in the first two numeric columns presents the non-standardized coefficients of the OLS and the logit regressions. The last row shows the coefficient of multiple determination
R2 for the OLS model and the
pseudo-
R2 for the logit model (defined as one minus the quotient of the residual deviance by the null deviance). These models have a similar quality of fit and prediction.
The last column in
Table 4 shows the estimation via the linear link by the Mahalanobis distance and weighted proportions described in the formulae (46)–(53). The
R2 in this case equals one, but it only indicates that this model corresponds to the interpolation via the points of the weighted means, so its residuals equal zero. The three solutions in
Table 4 look different, but their correlations are very high. Correlating without the intercepts yields 0.9496 for OLS and Logit, 0.8996 for OLS and Linear-link, and 0.9906 for Logit and Linear-link. With the intercepts, these values become 0.9986, 0.9990, and 0.9998, respectively.
With these three models, we can estimate the fitted values for the outcome variable y and find correlations between them, which are 0.9119 between OLS and Logit, 0.9875 between OLS and Linear-link, and 0.9527 between the Logit and Linear-link models. Thus, these vectors of fitted values are highly correlated; however, some OLS predictions could occur to be negative or above one, while both Logit and Linear-link always produce meaningful probability values in the [0, 1] range. The predicted variables can also be dichotomized to 0 or 1 for the values below 0.5 and equal/higher than 0.5; then, it is possible to build the cross-tables of counts for the original versus the predicted values. The quality of prediction can be evaluated by the hit rate, which is the total of the correct 0 and 1 predictions of the dependent binary variable in the total number of observations. The hit rate reached by each of the three predictions is about 83%, so, in general, all these models are of a good quality of fit and prediction.