General Hyperplane Prior Distributions Based on Geometric Invariances for Bayesian Multivariate Linear Regression

Based on geometric invariance properties, we derive an explicit prior distribution for the parameters of multivariate linear regression problems in the absence of further prior information. The problem is formulated as a rotationally-invariant distribution of L-dimensional hyperplanes inN dimensions, and the associated system of partial differential equations is solved. The derived prior distribution generalizes the already known special cases, e.g., 2D plane in three dimensions.


Introduction
In the context of Bayesian probability theory, a proper assignment of prior probabilities is crucial.Depending on the domain, quite different prior information can be available.It may be in the form of point estimates provided by domain experts (see, e.g., [1] for prior distribution elicitation) or in the form of invariances (of the prior knowledge) of the system of interest, which should be reflected in the prior probability density [2].However, especially for the ubiquitous case of the estimation of parameters of linear equation systems (like a straight line or hyperplane fitting), the latter requirement is often violated.Consider, for concreteness, the simple case of y = ax, a straight line through the origin, with a the parameter of interest.Here, the commonly-applied prior is constant, p (a | I) = const., often accompanied by statements like "Since we do not have specific prior information, we chose a uniform prior on a...".In Figure 1 on the left-hand side, 15 random samples generated from this prior distribution with a ∈ [0, 50] are displayed.Confronted with this result, the typical response is (at least in the experience of the author) that instead, a more "uniform" prior distribution of the slopes was intended, which is often depicted like in Figure 1 on the right-hand side.This plot was generated from a prior distribution that has an equal probability density for the angle of the line to the abscissa, corresponding to Additionally, in fact, in practice, the units of the axes are commonly chosen in such a way that extreme values of the slopes are not a priori overrepresented.If we generalize this requirement to more than one independent or dependent variable, then the desired prior probability should be invariant under arbitrary rotations in this parameter space.Some important special cases have been given already in [3], e.g., for a 1D line in two dimensions or a 2D plane in three dimensions.There also, the governing transformation invariance equation underlying invariant priors is derived.These special cases have since then been generalized to invariant priors for (N − 1)-dimensional hyperplanes in N -dimensional space; see, e.g., [4].These hyperplane priors proved to be valuable for Bayesian neural networks [5], where the specific properties of the prior density favored node-pruning instead of simple edge pruning of standard (quadratic) weight regularizers.This is especially helpful for a Bayesian approach to fully-connected deep convolutional networks; see e.g., [6,7].Nevertheless, the general case of prior probability densities for L-dimensional hyperplanes in N -dimensions (N > L) in a suitable parameterization has not been available so far.It has even been conjectured that it is impossible to derive a general solution [8].Luckily, this conjecture has been too pessimistic, and an explicit formula for the prior density, which can directly be applied to linear regression problems, is derived below.
It should be pointed out that multivariate regression is of course a longstanding topic in Bayesian inference.with classical contributions, e.g., by Box and Tiao [9], Zellner [10] or West [11].However, the standard approach is based on the use of conjugate priors (instead of invariance priors), mostly for computational convenience [12].In contrast, the subsequently derived prior distribution is determined by the basic desideratum of consistency if the available prior information is invariant under the considered transformations (i.e., rotations).Whether this invariance holds depends on the considered problem and must not be assumed without further consideration (similar to the case of flat priors for the coefficients).For example, the assumption of rotation invariance may not be suitable for covariates with different underlying units (e.g., m 2 , kg).

Problem Statement
In standard notation, a multivariate regression model is notated as follows: with: where z i is the response vector, y i the model value vector, x i the vector of the L covariates for observation i, t the intercept vector and A the M × L-dimensional matrix of adjacent regression coefficients.The observation noise i of each data point is often considered as Gaussian distributed, i ∼ N (0, Σ).This regression model can also be considered as estimating the "best" L-dimensional hyperplane in an N -dimensional space, because in an N -dimensional space, an L-dimensional hyperplane is given by: The quantity of interest is the prior probability density which remains invariant under translations and rotations of the coordinate system.

Derivation
Using the transformation invariance equation derived in [3]: for infinitesimal transformations of the form we can establish a partial differential equation system for F .

Invariance under Translations
Let us first consider a translation with respect to y i : y i = y i + , i.e., g i = 1, g j,j =i = 0.Then, the equation in the primed variables reads: Collecting the coefficients yields t i = t i + , and therefore, Equation ( 5) results in: which holds for any i.Therefore, F (A, t) can be a function of a only.Since F (A | I) does not depend on t, the prior distribution is improper (not normalizable in t) as long as there are no limits on the magnitude of t.
The translation with respect to x i results in the same conclusion.

Invariance under Rotations
The general rotation in n-dimensional space may be expressed as a sequence of rotations around rotation axes, which are perpendicular to the planes spanned by appropriately-chosen pairs of coordinate system basis vectors [13].This is based on the fact that any orthogonal matrix, i.e., rotation matrices, can be written uniquely as a product of 2 × 2 rotations.To avoid convoluted language, we denote in the following the rotation around the rotation axis that is perpendicular to the plane spanned by the linear combination of the basis vectors e i and e j simply as rotation in the x i x j -plane.

Rotation in the x
Now, we perform one such infinitesimal 2 × 2-rotation for independent parameters around an arbitrary rotation axis perpendicular to the plane spanned by e i and e j , preserving all other coordinates: Substituting the primed coordinates into Equation ( 5) yields the implied transformations: and, therefore, the partial differential equation: 3.2.2.Rotation in the y i y j -Plane Now, we perform one such rotation in the plane of two dependent parameters e i and e j ; thus y k = y k ∀ k = (j, i) and: Substituting the primed coordinates into Equation ( 5) yields the implied transformations: and, therefore, the partial differential equation: 3.2.3.Rotation in a Plane Spanned by x i y j -Axes Performing a rotation in the xy-plane, we obtain: which yields (see the Appendix): and therefore:

The PDE System
The translation invariance of Equation ( 5) excludes a dependence of Rotation invariance with respect to the y-axis requires F to fulfill the homogeneous, linear system of first order partial differential equations (i, j ∈ [1, M ] , i = j) (i.e., Equation ( 21)): and similar for rotations around the x-axis (i, j ∈ [1, L] , i = j) (Equation ( 13)): Rotations around an axis perpendicular to a plane given by an x,y-pair require the probability distribution to obey also Using the product rule, the double sum can be rewritten as and the last term of the previous equation can be split into three parts and simplified: Using this, Equation (30) can be written as:

Solution
This system of PDEs (Equations ( 28), ( 29) and ( 33)) can be tackled with the theory of Lie groups, which provides a systematic, though algebraically-intensive solution strategy, which is implemented in contemporary computer algebra systems.The solutions of several test cases computed by the Maple computer algebra system (http://www.maplesoft.com/)(it proved to be superior to MATHEMATICA (www.http://www.wolfram.com/mathematica/)for the present PDE-systems) led to the conjecture that a general solution to this PDE system is given by the sum of the squares of all possible minors of the coefficient matrix: where A n denotes a submatrix (minor) of size n × n (this notation is used at various places throughout the paper and should not be confused with the power of a matrix, which does not occur in this paper) and P = Min (M, L).Equation (34) does not appear unreasonable from the onset as prior density, because it preserves the underlying symmetry of the problem (permutation invariance of the parameters) and it is non-negative.
An explicit example for the case N = 4, L = 2 is: A two-dimensional slice of this probability density is given in Figure 2. The high symmetry of the prior distribution with respect to parameter permutations results in similar, "Cauchy"-like shapes if slices along other parameter axis are displayed.For the case N = 6, L = 3, the solution is given by:
We will make repeated use of the Laplace expansion of determinants: where the minor M n−1 ij is the (n − 1) × (n − 1)-matrix derived from the n × n-matrix A n by deletion of the i-th row and j-th column (by definition M 0 := 1).The cofactor matrix A n−1 ij is defined to be: and satisfies the following n-equations Further useful is the following form of the Laplace expansion, taking into account index shifts of a previous deletion of row k and column i of an (n + 1)-matrix A n+1 , resulting in the minor M n ki : where M n−1 (jk)(li) is the minor given by deletion of the j-th and k-th row and the l-th and i-th column.l and j are defined as: In the following, we face the problem of possibly too heavy of a nomenclature, because we need summation indices, while we also need to keep track of the original indices underlying the entries in the minors, where some rows and columns have been deleted, although the relative order is preserved.The mapping could be expressed, e.g., as a i(i )j(j ) with i , j ∈ [1, m] and i (:) ∈ [1, M ] and j (:) ∈ [1, L].To avoid this cumbersome notation, we implicitly assume from now on (up to the Conclusion Section) this mapping for all summations that are indexed by either k or l.Therefore: (41) 6.2.x i x j -and y i y j -Rotations We now verify that Equation (34) solves Equation (29).It is obvious that only those determinants of Equation (34) that contain column i or column j have the potential to provide non-zero contributions in Equation (29): if column j is missing, the derivative in the first term is zero.If, instead, column i is missing, then the derivative in the second term of Equation (29) yields zero.To proceed, we introduce H(A) via: It is noteworthy that H(A) has a very simple form: it is given by a sum of positive terms.This almost decouples the problem, and we can largely proceed on a term-by-term basis.Using the equality: the left-hand side of Equation (29) transforms to (i, j ∈ [1, L] , i = j): and using Equation (38), we obtain: and, therefore, Equation (34) solves Equation ( 29).The calculation is similar for Equation ( 28) and yields the result that Equation (34) solves also the system Equation (28).

(x i y j )-Rotations
The verification of the successful solution of Equation ( 33) by Equation ( 34) requires some more steps.As before, Equation (33) can be written as: and after multiplication with H (A) L+M +3 2 as:

Matrices with Either Row j or Column i
The inner double sum can be simplified for all matrices containing either row j or column i (i.e., all matrices of size P × P and all matrices A m,r of size ) using the Laplace expansion (here, the expansion with respect to row j is shown): which cancels the corresponding determinant of H (A) in the last term of Equation (48).

Matrices with Neither Row j nor Column i
The basic idea is to show that M −1 m L−1 m -matrices with neither row j nor column i, m ∈ (1, 2, • • • , P − 1), cancel with the contributions of the corresponding matrices including row j and column i of size (m + 1) × (m + 1) of the second term.
Please note that there is a one-to-one correspondence of minors of size m × m without the j-th row and i-th column and the matrices of size (m + 1) × (m + 1) with row j and column i in the second term, therefore allowing one to label both with the same index r.After division by − (M + L + 1), the remaining terms of Equation ( 48) are (taking into account that the labeling of the rows and columns of the matrices of size (m + 1) × (m + 1) and (m) × (m) must be consistent): with H ji now only containing determinants with neither row j nor column i.If we now only consider the relevant term of H ji , we can write: The equation is trivially true if det A m,r ji = 0; otherwise, we can divide by det A m,r ji and obtain: Replacing the various cofactors by the corresponding minors (cf.Equations ( 36) and (37)) yields: and after replacing det (A m+1,r ) by its Laplace expansion together with multiplication by (−1) (i+j) , the equation reads: and can be simplified to: (54) because (−1) 2i equals one in the second term.Therefore, the third term cancels with the second term for k = j, and the remaining equation is given by: Using Equation (39) together with the definition Equation (40), the inner sum of the first term of Equation ( 55) can be rewritten as: and: Splitting the summation over k into two parts (k < j) and (k > j) and inserting the definition for k , we obtain: where the first two terms cancel the last two terms.Summarizing the previous approach, we have shown that for an arbitrary n × n-determinant, the first and third term of Equation (48) almost cancel.Only determinants not containing the j-th row and the i-th column remain.These remaining contributions are canceled by the (n + 1)-order determinant (required to contain the matrix element a ji ) of the second term in Equation ( 48).This schema can be repeated down to n = 1, and the last step (n = 0) is easily explicitly calculated.This finishes our derivation.

Relation to Previously-Derived Special Cases
The underlying equation systems of the special case of an (n−1)-dimensional hyperplane in an n-dim space used in [8] and in this paper differ slightly due to a different parameterization, and therefore, the derived priors appear on first glance to be different, although they are identical, as will be shown below.
For probability density functions in different coordinate systems, the following equation holds: where |• • • | denotes the absolute value of the Jacobi determinant: The equation describing the (n-1)-dim hyperplane in an n-dim space in this paper is given by: and results in the following prior: In [8], the corresponding hyperplane equation reads: with prior distribution: The latter constraints yield a proper (normalizable) prior.The relation of the two different parameterizations is given by: which yields the Jacobian: Using this result and Equation (65), we can write: which shows the equivalence of the two priors (Equations ( 62) and ( 64)).The requirement of b 2 i > R 2 0 leads to: In the case of all a 1i = 0, we obtain: which means that the lower limit R 2 0 corresponds to an upper limit of t 2 1 .

Practical Hints
In the worst case, the hyperplane prior has an exponentially-increasing number of determinants with increasing dimension.The total number of individual determinants for an N -dimensional plane in a 2N -dimensional space is given by: which is already 70 for a 4D hyperplane in an 8D space.Therefore, it is advantageous to compute the determinants using iteratively the Laplace expansion, starting from small determinants, storing the determinants of the previous step.This requires the storage of at most N N/2 2 terms.As a proposal density for Markov chain Monte Carlo (MCMC) sampling methods (e.g., rejection sampling), the dominating multivariate Cauchy distribution is a good candidate.Source code for the set up of the PDE system and for the solution, together with a Maple script for the verification of the solution, can be obtained from the author.

Conclusions
This paper has derived a prior density for L-dimensional hyperplanes in N -dimensional space, based on geometric invariances.It is suited, e.g., to parameter estimation of multilinear regression problems in the absence of further prior knowledge or Bayesian model estimation for neural networks.In the latter case, the prior has to be made proper by suitable restriction of the range of the offset parameters, which depends on domain knowledge.The obtained prior density avoids the too strong weight of "large" values of the regression coefficients typically assigned by uniform priors.Being a rational function, its influence on the parameter estimates on standard problems with Gaussian uncertainties (resulting in an exponential likelihood) on the data will be limited.However, this can be different for robust estimation approaches with heavy-tailed likelihood distributions.

Figure 2 .
Figure 2. Probability density of p (a 11 , a 21 | a 12 , a 22 , I) for a 12 = 3 and a 22 = 5 for the case N = 4, L = 2.The probability density exhibits the typical "Cauchy-"like shape with heavy tails compared to a binormal distribution.Due to the symmetry of the prior distribution, slices with respect to the other parameters display the same basic features.