1. Introduction
Linear regression remains one of the most widely used modeling tools in statistics, data science, and machine learning [
1]. In its familiar form
the ordinary least squares (OLS) estimator chooses coefficients that minimize the squared discrepancy between observations and predictions. When the design matrix
X has the full column rank, the fitted values and the associated orthogonal projector onto the column space
take the classical form
with
P symmetric and idempotent [
2]. More generally, the orthogonal projector onto
can be written as
in terms of the Moore–Penrose pseudoinverse
[
2].
The fitted component and residual are
so that
and
is the orthogonal decomposition of the response into a part explained by
and a part orthogonal to it. This algebraic formulation is standard, but it can obscure the underlying geometry that governs the behavior of the residuals, and inversion-based formulas can be fragile in high-dimensional or nearly singular regimes [
2].
A classical identity from numerical linear algebra makes the geometry explicit:
where the columns of
N form an orthonormal basis of
[
2]. When
, the orthogonal complement is one-dimensional and the projector reduces to the rank-one form
where
n is the unit vector (ambiguous sign) perpendicular to
. In this codimension one setting, the residual is the projection of
Y into a single direction, and the operator (
5) completely avoids matrix inversion. For higher codimension, the residual lies in a subspace of dimension
, and the factorization (
4) makes this structure explicit.
The construction of normal directions motivates generalizations of the classical cross-product to
. Belhaouari et al. formalize a determinant-based vector product that yields a vector orthogonal to
linearly independent vectors in
[
3]. In regression problems with
, a single normal vector uniquely defines the rank-one projector (
5). For higher codimensions, these constructions naturally combine with null-space and QR-based methods to form an orthonormal residual basis
N.
Multicollinearity is a persistent challenge in regression modeling, as highly dependent predictors inflate variances, destabilize coefficient estimates, and impede interpretation [
4,
5]. Classical diagnostics such as the variance inflation factor (VIF) quantify this effect but offer limited geometric insight. We therefore introduce the Geometric Multicollinearity Index (GMI), a scale-invariant, volume-based measure derived from the polar sine and the Gram determinant. For a full column-rank matrix
with
, the polar sine normalizes
by the product of the column norms, taking values in
: it equals 1 for orthogonal predictors and approaches 0 as multicollinearity increases.
In rank-deficient or over-parameterized regimes, the Gram determinant vanishes and the GMI saturates at 1, indicating extreme geometric degeneracy. As a normalized volume measure of the parallelotope spanned by the columns of X, the GMI provides a compact, scale-invariant geometric diagnostic across dimensions. In this regime, should be interpreted as a diagnostic flag for complete geometric collapse, rather than as a measure of severity beyond rank deficiency.
Although both regression and principal component analysis (PCA) rely on orthogonal projections, their objectives differ fundamentally [
6]. PCA projects
X onto directions of maximum variance, yielding residuals
that remain in
. By contrast, regression projects the response, decomposing
Y into a fitted component
and a residual
[
2]. This distinction between feature-space and response-space residuals is essential for interpreting linear-model diagnostics.
Scope of the present paper is the identities and are standard results from orthogonal projection theory. The contribution of this paper lies in a rank-aware geometric interpretation of these identities via multivector and cross-product constructions, together with the introduction of the volume-based Geometric Multicollinearity Index (GMI) as a complementary geometric diagnostic.
This work is theoretical in nature. We develop a rank-aware residual projector, characterize the associated residual subspaces, and formalize GMI as an intrinsic measure of multicollinearity. Algorithmic, large-scale, regularized, and nonlinear extensions are beyond the scope of this paper and are left for future work.
The main contributions of this theoretical paper are as follows:
Rank-aware geometric formulation is the first one where we develop a unified, rank-aware description of the residual projector based on the identity , which reduces to the rank-one form when and generalizes to arbitrary rank.
Multivector perspective on residual spaces such that we reformulate and the associated projector using multivectors, wedge products, and cross-products, yielding a basis-independent geometric characterization valid in both codimension-one and higher-codimension settings.
In Geometric Multicollinearity Index (GMI) We introduce a scale-invariant, volume-based diagnostic derived from the Gram determinant and polar sine, which quantifies geometric degeneracy in both full-rank and rank-deficient regimes and complements classical VIF-type measures.
Also, We clarify the conceptual distinction between regression residuals and PCA reconstruction residuals, emphasizing the impact of projecting the response versus the predictors on residual geometry and interpretation.
Finally, we provide simple numerical examples illustrating rank-one and multivector residual projections and the behavior of GMI under controlled perturbations.
The organization of the paper is provieded as follows
Section 2 introduces the notation and basic geometric concepts.
Section 3 develops the main theoretical results on residual projectors and the geometric structure of
.
Section 4 presents numerical illustrations, including the behavior of GMI under controlled multicollinearity.
Section 5 and
Section 6 discuss implications and conclude.
2. Notation and Preliminaries
We briefly fix notation and recall the algebraic and geometric concepts that underpin the proposed framework. We follow standard linear algebra conventions as in Strang [
7], Lay et al. [
8], and Golub and Van Loan [
2].
2.1. Basic Notation and Subspaces
The transpose of a matrix or vector is denoted by the superscript ⊤. Lowercase letters (e.g., x, n) denote column vectors, and uppercase letters (e.g., X, N) denote matrices. For a matrix , we write for its column space and denote its rank by . Throughout, r denotes the residual vector while denotes the rank of X, to avoid notational ambiguity. All notation is defined once and used consistently throughout the manuscript; in particular, N denotes an orthonormal basis of the orthogonal complement .
We reserve
r exclusively for the residual vector in (
8).
The orthogonal complement of a subspace
is denoted by
, and the null space of a matrix
A is
A summary of all symbols and subspace conventions is provided in Abbreviations.
Given , the matrix denotes its outer product. The identity matrix in is denoted by I.
2.2. Projection and Residual Operators
Let have rank ; when X has full column rank (), the orthogonal projector admits the classical closed form .
The orthogonal projector onto the column space
is the unique symmetric idempotent matrix
P that satisfies
When
X has full column rank
, this projector admits the well-known closed form
which is symmetric and idempotent [
2]. In the general case, including rank-deficient designs,
P can be written as
where
denotes the Moore–Penrose pseudoinverse of
X [
2]. Throughout,
denotes the Moore–Penrose pseudoinverse of
X, computed via an SVD-based routine.
Both (
6) and (
7) implement the same geometric operation: orthogonal projection onto
.
When , the same orthogonal projector can equivalently be constructed by Cholesky factorization of , producing identical residuals while requiring positive definiteness of the Gram matrix.
For any response vector
, the fitted component is
and the residual is
which obeys
and hence lies in the orthogonal complement
[
2].
Let
and set
. Choose any matrix
whose columns form an orthonormal basis of
:
Then the residual projector admits the geometric factorization
and the null space of
is
This characterization is standard in numerical linear algebra and regression theory [
2] and forms the algebraic backbone of our geometric residual formulation. Equation (
10) makes explicit that the residual
r lies in a
k-dimensional orthogonal complement spanned by the columns of
N.
2.3. Cross-Product and Cross–Wedge–QR Method
To construct explicit orthogonal directions, especially in the case of codimension-one where
, we employ the cross-product in
[
3]. Given
linearly independent vectors
the determinant-based construction of Belhaouari et al. produces
which is orthogonal to each
:
This extends earlier
n-dimensional vector products and related constructions in eigenanalysis and geometry [
9,
10].
In the regression setting, when
the predictor space
has codimension one and
is spanned by a single unit vector
n. Choosing any basis of
and applying the cross-product yields a nonzero vector orthogonal to
; normalizing it gives a unit normal
n spanning
. In this case, the residual projector reduces to the rank-one form
and the residual is simply
.
In this codimension-one setting, the rank-one projector coincides with the classical construction of the cross-product, and we refer to this case as the residual projector of the cross-product (rank one).
A key geometric property of the cross-product is that its norm equals the
-dimensional volume of the parallelotope spanned by its arguments [
3]:
Thus, the cross-product simultaneously provides an orthogonal direction and encodes a volume. When , a single cross-product yields at most one null direction; in that case additional null directions must be obtained by other means (e.g., null-space methods and QR), as discussed later.
2.4. Wedge Products, Gram Matrices, and Polar Sine
To describe volumes and linear independence in a coordinate-free manner, we rely on standard constructions from linear algebra and multilinear geometry [
2,
7,
8]. Given vectors
, their wedge product
encodes the oriented
k-dimensional parallelotope spanned by
. Geometrically,
if and only if the vectors are linearly dependent, and its norm
equals the volume
k of the associated parallelotope. This interpretation underlies many geometric treatments of subspaces, volumes, and orthogonality in
[
2,
7].
For computational purposes, such volumes can be expressed using Gram matrices. Given vectors
, the associated Gram matrix is
The
m-dimensional volume of the parallelotope spanned by
is given by
a classical identity that plays a central role in covariance analysis and principal component analysis [
2,
6].
We quantify the angular separation of
p predictor vectors using the polar sine. Let
with Gram matrix
. Following geometric constructions based on determinants and cross-products [
3,
9,
10], the polar sine is defined as
When
and the columns of
X are linearly independent,
and
; it equals 1 if and only if the vectors are mutually orthogonal. If the columns of
X are linearly dependent or
, then
is singular and
, and we adopt the convention
. Thus, the polar sine takes values in
, is scale-invariant, and decreases toward zero as the set of columns becomes geometrically degenerate.
Within the present framework, the polar sine provides a normalized geometric measure of predictor-space collapse under multicollinearity and forms the basis of the Geometric Multicollinearity Index (GMI) introduced in the following section.
2.5. Projection Strategies: QR and Cross-Product Formulations
Two projection strategies play a central role in the remainder of the paper.
First, QR decomposition is used when an entire orthonormal basis of
or
is required. If
is a (thin) QR factorization with orthonormal columns in
Q, then
is the orthogonal projector onto
, and orthonormal complements can be obtained by extending
Q to an orthogonal basis of
[
2]. QR-based projectors are numerically stable and well suited for high-dimensional or nearly singular problems.
Second, when the residual space is one-dimensional (for example, when
), the cross-product provides a compact alternative. It produces a normal vector
n directly from which the residual projector is derived:
This follows without forming or inverting
[
3]. In higher-codimension settings (
), cross-products can still be used to generate individual null directions, but a full residual basis is obtained more naturally from
and orthonormalised via QR.
When null directions are generated sequentially via cross-products and accumulated into a multivector basis that is subsequently orthonormalised using QR, we refer to the resulting construction as the Recursive Cross–Wedge–QR method.
These geometric primitives—orthonormal bases, cross-products, wedge products, Gram matrices, and polar sine—constitute the toolkit on which the residual projection framework developed in the following sections is built.
Here s denotes the sketch size (number of random probe vectors) used in the Gaussian range-finding procedure.
Sketch–QR Baseline (Reproducibility)
The Sketch–QR baseline follows a standard randomized range-finding procedure. Given
and a sketch size
s, we draw a Gaussian sketching matrix
with independednt and identically distributed
entries and form
. We then compute a thin QR factorization
, where
has orthonormal columns. The approximate projector is
, and the approximate residual is
. Unless stated otherwise, we use a single sketch without power iterations or oversampling [
11,
12].
4. Illustrative Numerical Examples
When the design matrix
X is rank deficient, the normal equations
do not admit a unique solution. In all rank-deficient experiments, we therefore compute the minimum-norm least-squares solution using the Moore–Penrose pseudoinverse, i.e.,
, and refer to this baseline as OLS (Moore–Penrose pseudoinverse). The full implementation and experimental scripts are publicly available [
13].
Predictive performance is assessed using K-fold cross-validation with , reporting CV MSE and CV . Unless stated otherwise, all randomized procedures use a fixed random seed (seed = 42) to ensure reproducibility.
Coefficient stability is evaluated via
perturbation trials with the design matrix
X held fixed. In each trial, the response is perturbed as
The model is refit for each trial to obtain coefficients
. We let
denote the sample standard deviation of the
jth coefficient
across trials. We summarize coefficient stability by
i.e., the median across coefficients of their across-trial standard deviations.
To validate the geometric projector, we verify agreement between the classical OLS residual and the geometric residual by monitoring up to numerical tolerance.
Sketch–QR [
11] yields an approximate basis
and projector
, producing the approximate residual
. Agreement with the exact residual is summarized using cosine similarity
and the residual discrepancy
(and, when reported, a principal-angle distortion between
and
).
4.1. Rank-One Residual Projection in
This example explicitly shows how the normality of the cross-product reproduces the ordinary least-squares (OLS) residual in the rank-one case, in line with Proposition 1.
We consider a rank-one residual setting where the orthogonal complement of the column space is one-dimensional. Let us
with
of full-column-rank (so
is invertible). The ordinary least-squares projection matrix is
A direct calculation yields
To recover the same residual in purely geometric form, we construct a unit normal vector to the column space of
. Let us
Then
and
, so Proposition 1 applies, and
The residual can therefore be written as the rank-one projection
which coincides with
and lies on the line spanned by
.
From a computational point of view, this example highlights the attraction of the geometric formulation in low dimension: the cross-product and the multiplication of a scalar vector are sufficient to obtain the residual, avoiding explicit matrix inversion. Once is known, applying the projector to a new response vector reduces to one dot product and one scaled copy of , i.e., work per residual.
4.2. Multivector Residual Projection in
This example illustrates the multivector residual projection via an orthonormal basis of as in Theorem 2 and links it to the Geometric Multicollinearity Index.
We now consider a multivector setting in which the residual subspace has a dimension greater than one. Let us
The ordinary least-squares projector is again
and a straightforward computation gives
Since
, its orthogonal complement has dimension
. A basis for the null space of
is
Applying Gram–Schmidt yields an orthonormal basis
with
and
. The general residual projection theorem then gives
Evaluating
for this
and
reproduces the residual OLS:
making explicit that the residual lives in the two-dimensional subspace spanned by
and
and realizing the null space + QR construction in Theorem 2, in a concrete low-dimensional setting.
From a complexity point of view, building on a null-space basis followed by QR (or Gram–Schmidt) is a one-off cost of order for with residual dimension , while applying the projector to a new response vector requires only operations.
4.2.1. GMI for the Example
For this design matrix
, the Gram matrix is
The column norms are
and
, so the polar sine and the Geometric Multicollinearity Index are
In this small example, the GMI value
indicates a moderate degree of collinearity between the two regressors; the residual subspace
has dimension two (as
), and, compared to the orthonormal case (
), the fitted coefficients are more sensitive to perturbations of
.
This example isolates the effect of controlled collinearity on GMI in a minimal symbolic setting.
Consider the family of matrices
For every
the two columns are linearly independent, so
is positive definite, and the polar-sine is well defined. A direct computation gives
and hence
As
increases, the columns become more collinear,
decreases, and
increases, reflecting the growing multicollinearity in a purely geometric way. In families where columns eventually become exactly collinear, the Gram determinant vanishes, the polar sine drops to zero, and
reaches 1, in agreement with the rank-deficient discussion in
Section 3.
4.2.2. Large-Scale Multivector Residual Projection
To complement the low-dimensional symbolic example above, we consider a larger synthetic design in which the residual subspace has a dimension greater than one and the multivector formulation is essential.
We generate a rank-deficient design matrix with , , and , so that . All predictors are standardized, and the response is constructed so that the exact residual has a prescribed norm, ensuring a controlled comparison between methods.
Table 2 reports residual norms, cosine similarity to the exact residual, and run times for the variants of the Article 1 projector only. The exact constructions (null space/QR, rank-one cross-product when applicable, and the Moore–Penrose pseudoinverse baseline for OLS) reproduce the same residual up to numerical precision, confirming the multivector identity
in a high-dimensional setting.
Sketch–QR yields an approximate residual, whose agreement with the exact residual is summarized by cosine similarity. Entries marked NaN or N/A indicate methods that are not applicable in this regime (e.g., rank-one cross-product when ), while FAIL indicates that the required algebraic conditions (e.g., ) are violated.
This experiment demonstrates that the geometric residual formulation extends naturally from symbolic low-dimensional examples to realistic large-n, rank-deficient designs, without relying on regularization or nonlinear modeling.
Together with the symbolic R
4 example, this large-scale synthetic experiment confirms that the multivector residual projector
is exact in dimensions, while Sketch–QR provides a principled approximation when a reduced basis is desired, setting the stage for the real-data illustrations in
Section 4.3.
4.3. Illustrative Real Data Use of GMI
Real-data experiments use two regression benchmarks: Boston Housing and Auto MPG. Both are standardized via centering and scaling before forming X, and they provide realistic multicollinearity patterns suitable for illustrating GMI as a purely geometric, response-independent diagnostic.
This example illustrates how GMI can serve as a purely geometric multicollinearity diagnostic on real data and how it can be interpreted alongside coefficient stability.
We use the Boston Housing dataset because it is a widely used regression benchmark with well-known predictor correlations, making it a convenient setting for illustrating multicollinearity diagnostics and residual geometry on realistic data.
To illustrate the geometric diagnostics on real data, we use the classical Boston Housing dataset, with median house price as the response and a set of standardized socio-economic and structural predictors. The protocol used in this subsection is as follows:
- (a)
Standardize all predictors to zero mean and unit variance, and form the design matrix X.
- (b)
For selected subsets of predictors, form the corresponding design matrix
X, compute the Gram matrix
, and evaluate the polar sine and GMI using the definitions in (
18).
- (c)
Report predictive metrics using five-fold cross-validation (CV) so that MSE and are out-of-sample summaries rather than in-sample fit statistics.
- (d)
Fit ordinary least-squares models on these subsets and monitor the variation of the estimated coefficients under small perturbations of Y (e.g., additive Gaussian noise or resampling).
The purpose of this subsection is to demonstrate (i) that the geometric residual projector reproduces the classical OLS residuals on a realistic dataset and (ii) that GMI provides a compact, response-independent summary of predictor-space degeneracy.
4.3.1. Boston Housing and Auto MPG: Increasing Model Size
Real-data experiments use two regression benchmarks: Boston Housing and Auto MPG. Both are standardized through centering and scaling before forming X and provide realistic multicollinearity patterns suitable to illustrate GMI as a purely geometric and response-independent diagnostic.
Table 3 and
Table 4 report results for increasing the size of the model (
,
,
, and in full where applicable). For each subset, we report out-of-sample predictive metrics using five-fold cross-validation (CV MSE and CV
), geometric multicollinearity (GMI), and classical diagnostics (max VIF and
), along with a summary of coefficient stability.
We restrict algorithmic comparisons to the projector variants defined in
Table 1 (OLS/Cholesky when applicable, null space/QR, rank-one cross-product when
, and Sketch–QR).
Table 3 and
Table 4 show that increasing the size of the model improves predictive performance (lower CV MSE, higher CV
) while typically increasing multicollinearity. In Boston, moving from
to the full model reduces CV MSE from
to
and increases CV
from
to
, while GMI increases from 0 to
and
increases from 1 to
. In Auto MPG the collinearity is even stronger: the full model attains CV MSE
and CV
, but the corresponding diagnostics are GMI
, max VIF
, and
.
These results illustrate the intended role of GMI: it is computed purely from X (independent of Y) and tracks familiar multicollinearity measures (VIF and condition number), while the coefficient stability summary provides a concrete link between predictor space degeneracy and sensitivity of fitted parameters.
4.3.2. Implementation Note (Stable GMI Computation)
When the Gram matrix
is poorly conditioned, the evaluation of direct determinants is numerically fragile. We therefore compute the polar sine and GMI in the log-domain with a small jitter
:
In implementation,
is evaluated using a numerically stable log-determinant routine (equivalently, a Cholesky factorization when
). Unless stated otherwise, we use
.
Figure 4 makes the equivalence of the exact geometric projector and classical OLS residuals explicit: the discrepancy curve for the exact QR-basis construction is at a numerical noise level in both the
and
models. In contrast, Sketch–QR constructs an approximate projector
, so its residual
differs slightly from
; the magnitude of this discrepancy decreases as the size of the sketch
s increases. Overall, this visualization complements
Table 3 and
Table 4 by separating the exact residual-geometry identities (Article 1 methods) from the approximate Sketch–QR construction while keeping the presentation compact and reproducible.
5. Discussion
This paper presents a rank-aware geometric interpretation of linear regression residuals via orthogonal projectors. In the case of codimension one, the residual is fully determined by a unit normal , resulting in ; in general rank settings, the residual lies in a higher-dimensional orthogonal complement spanned by an orthonormal basis N, with . This formulation makes the residual subspace explicit and decouples basis construction from repeated projection.
An immediate consequence is that all exact computational routes—normal-equation OLS, Cholesky-based solvers, and null space or QR constructions—must yield identical residuals up to numerical tolerance, as they implement the same projector onto . We confirm this numerically through residual agreement metrics and discrepancy visualizations, showing that differences arise only from conditioning and the computational regime. In contrast, Sketch–QR yields an approximate projector and residual, whose deviation from the exact solution is quantified using cosine similarity and norm-based discrepancies.
We further link residual geometry to predictor-space conditioning through the Geometric Multicollinearity Index (GMI), a scale-invariant, normalized volume measure derived from the polar sine. Across the Boston Housing and Auto MPG benchmarks, higher GMI values consistently align with larger max VIF and , while coefficient stability summaries illustrate increased parameter sensitivity under geometric degeneracy.
Although both regression and PCA rely on orthogonal projections, their objectives differ fundamentally: PCA projects predictors to minimize reconstruction error, with residuals remaining in , whereas regression projects the response, yielding residuals in .
Finally, this work focuses on unregularized least squares and exact orthogonal projection. With regularization, the operators are no longer orthogonal and the residual subspace becomes non-Euclidean or data-adaptive, motivating future work on regularized and sketch-aware residual geometry while preserving the geometric diagnostics enabled by GMI.
6. Conclusions
This paper presented a rank-aware geometric interpretation of linear regression residuals. In the case of codimension-one, the residual projector reduces to the rank-one form , where n spans . For general rank, the residual lies in a dimensional orthogonal complement, with a projector for an orthonormal basis N of . Although algebraically equivalent to the classical OLS formulation, this representation makes the residual subspace explicit and decouples basis construction from repeated projection.
Building on this viewpoint, we introduced the Geometric Multicollinearity Index (GMI), a scale-invariant diagnostic derived from the polar sine and Gram determinant that quantifies predictor-space degeneracy. Synthetic experiments confirm predictable behavior under controlled perturbations, while real-data benchmarks on Boston Housing and Auto MPG show that increasing model size typically improves predictive performance while increasing multicollinearity; GMI captures this trade-off and aligns with max VIF and . Across all experiments, exact projector constructions yield identical residuals up to numerical tolerance, whereas Sketch–QR produces a controlled approximation whose deviation is quantified via cosine similarity and residual discrepancy norms.
In general, these results elevate to a central geometric object in regression analysis and position GMI as a concise indicator of departure from orthogonal designs. Extensions to sketch-based, large-scale, streaming, and nonlinear settings are deferred to companion work.