Next Article in Journal
Entropy-Based Privacy against Profiling of User Mobility
Next Article in Special Issue
Most Likely Maximum Entropy for Population Analysis with Region-Censored Data
Previous Article in Journal
A Colour Image Encryption Scheme Using Permutation-Substitution Based on Chaos
Previous Article in Special Issue
The Homological Nature of Entropy
Open AccessArticle

General Hyperplane Prior Distributions Based on Geometric Invariances for Bayesian Multivariate Linear Regression

Max-Planck-Institute for Plasmaphysics, Boltzmannstrasse 2, 85748 Garching, Germany
Academic Editors: Frédéric Barbaresco and Ali Mohammad-Djafari
Entropy 2015, 17(6), 3898-3912; https://doi.org/10.3390/e17063898
Received: 30 March 2015 / Revised: 1 June 2015 / Accepted: 2 June 2015 / Published: 10 June 2015
(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Abstract

Based on geometric invariance properties, we derive an explicit prior distribution for the parameters of multivariate linear regression problems in the absence of further prior information. The problem is formulated as a rotationally-invariant distribution of \(L\)-dimensional hyperplanes in \(N\) dimensions, and the associated system of partial differential equations is solved. The derived prior distribution generalizes the already known special cases, e.g., 2D plane in three dimensions.
Keywords: prior probabilities; hyperplanes; geometrical probability; neural networks prior probabilities; hyperplanes; geometrical probability; neural networks

1. Introduction

In the context of Bayesian probability theory, a proper assignment of prior probabilities is crucial. Depending on the domain, quite different prior information can be available. It may be in the form of point estimates provided by domain experts (see, e.g., [1] for prior distribution elicitation) or in the form of invariances (of the prior knowledge) of the system of interest, which should be reflected in the prior probability density [2]. However, especially for the ubiquitous case of the estimation of parameters of linear equation systems (like a straight line or hyperplane fitting), the latter requirement is often violated. Consider, for concreteness, the simple case of y = ax, a straight line through the origin, with a the parameter of interest. Here, the commonly-applied prior is constant, p (a|I) = const., often accompanied by statements like “Since we do not have specific prior information, we chose a uniform prior on a…”. In Figure 1 on the left-hand side, 15 random samples generated from this prior distribution with a ∈ [0, 50] are displayed. Confronted with this result, the typical response is (at least in the experience of the author) that instead, a more “uniform” prior distribution of the slopes was intended, which is often depicted like in Figure 1 on the right-hand side. This plot was generated from a prior distribution that has an equal probability density for the angle of the line to the abscissa, corresponding to
p ( a | I ) ~ 1 ( 1 + a 2 ) 3 / 2 .
Additionally, in fact, in practice, the units of the axes are commonly chosen in such a way that extreme values of the slopes are not a priori overrepresented. If we generalize this requirement to more than one independent or dependent variable, then the desired prior probability should be invariant under arbitrary rotations in this parameter space. Some important special cases have been given already in [3], e.g., for a 1D line in two dimensions or a 2D plane in three dimensions. There also, the governing transformation invariance equation underlying invariant priors is derived. These special cases have since then been generalized to invariant priors for (N − 1)-dimensional hyperplanes in N-dimensional space; see, e.g., [4]. These hyperplane priors proved to be valuable for Bayesian neural networks [5], where the specific properties of the prior density favored node-pruning instead of simple edge pruning of standard (quadratic) weight regularizers. This is especially helpful for a Bayesian approach to fully-connected deep convolutional networks; see e.g., [6,7].
Nevertheless, the general case of prior probability densities for L-dimensional hyperplanes in N-dimensions (N > L) in a suitable parameterization has not been available so far. It has even been conjectured that it is impossible to derive a general solution [8]. Luckily, this conjecture has been too pessimistic, and an explicit formula for the prior density, which can directly be applied to linear regression problems, is derived below.
It should be pointed out that multivariate regression is of course a longstanding topic in Bayesian inference. with classical contributions, e.g., by Box and Tiao [9], Zellner [10] or West [11]. However, the standard approach is based on the use of conjugate priors (instead of invariance priors), mostly for computational convenience [12]. In contrast, the subsequently derived prior distribution is determined by the basic desideratum of consistency if the available prior information is invariant under the considered transformations (i.e., rotations). Whether this invariance holds depends on the considered problem and must not be assumed without further consideration (similar to the case of flat priors for the coefficients). For example, the assumption of rotation invariance may not be suitable for covariates with different underlying units (e.g., m2, kg).

2. Problem Statement

In standard notation, a multivariate regression model is notated as follows:
y i = A X i + t , x i L , A M × L , t M and y i M ,
with:
z i = y i + ϵ i , ϵ i M ,
where zi is the response vector, yi the model value vector, xi the vector of the L covariates for observation i, t the intercept vector and A the M × L-dimensional matrix of adjacent regression coefficients. The observation noise ϵi of each data point is often considered as Gaussian distributed, ϵ i ~ N(0, ). This regression model can also be considered as estimating the “best” L-dimensional hyperplane in an N-dimensional space, because in an N-dimensional space, an L-dimensional hyperplane is given by:
y 1 = a 11 x 1 + a 12 x 2 + + a 1 L x L + t 1 y 2 = a 21 x 1 + a 22 x 2 + + a 2 L x L + t 2 y 3 = a 31 x 1 + a 32 x 2 + + a 3 L x L + t 3 y M = a M 1 x 1 + a M 2 x 2 + + a M L x L + t M
with M = N − L.
The quantity of interest is the prior probability density F (A) = F (a11, ⋯, aML, t1, ⋯, tM|I) for the coefficients a11, ⋯, aML, t1, ⋯, tM, which remains invariant under translations and rotations of the coordinate system.

3. Derivation

Using the transformation invariance equation derived in [3]:
i = 1 N z i ( F ( z 1 , , z N ) g i ( z 1 , z N ) ) = 0
for infinitesimal transformations of the form z i = z i + ϵ g i ( z 1 , z N ), we can establish a partial differential equation system for F.

3.1. Invariance under Translations

Let us first consider a translation with respect to y i : y i = y i + ϵ, i.e., gi = 1, gj,ji = 0. Then, the equation in the primed variables reads:
y i = y i + ϵ = a i 1 x 1 + a i 2 x 2 + + a i L x L + t i
Collecting the coefficients yields t i = t i + ϵ, and therefore, Equation (5) results in:
0 + + 0 + t i ( F ( A , t ) 1 ) + 0 + + 0 = 0 ,
which holds for any i. Therefore, F (A, t) can be a function of a only. Since F (A|I) does not depend on t , the prior distribution is improper (not normalizable in t) as long as there are no limits on the magnitude of t.
The translation with respect to xi results in the same conclusion.

3.2. Invariance under Rotations

The general rotation in n-dimensional space may be expressed as a sequence of rotations around rotation axes, which are perpendicular to the planes spanned by appropriately-chosen pairs of coordinate system basis vectors [13]. This is based on the fact that any orthogonal matrix, i.e., rotation matrices, can be written uniquely as a product of 2 × 2 rotations. To avoid convoluted language, we denote in the following the rotation around the rotation axis that is perpendicular to the plane spanned by the linear combination of the basis vectors ei and ej simply as rotation in the xixj-plane.

3.2.1. Rotation in the xixj-Plane

Now, we perform one such infinitesimal 2 × 2-rotation for independent parameters around an arbitrary rotation axis perpendicular to the plane spanned by ei and ej, preserving all other coordinates: x k = x k k ( j , i ) and
x i = x i ϵ x j ,
x j = ϵ x i x j .
Substituting the primed coordinates into Equation (5) yields the implied transformations:
a k i = a k i a k j ϵ ,
a k j = a k j a k i ϵ ,
t k = t k
and, therefore, the partial differential equation:
k = 1 M a k i ( F ( A ) ( a k j ) ) + k = 1 M a k j ( F ( A ) ( a k i ) ) = 0.

3.2.2. Rotation in the yiyj-Plane

Now, we perform one such rotation in the plane of two dependent parameters ei and ej; thus y k = y k k ( j , i ) and:
y i = y i ϵ y j ,
y j = ϵ y i y j .
Substituting the primed coordinates into Equation (5) yields the implied transformations:
a i k = a i k a j k ϵ ,
a j k = a j k + a i k ϵ ,
t i = t i t j ϵ ,
t j = t j t i ϵ ,
t k = t k
and, therefore, the partial differential equation:
k = 1 M a i k ( F ( A ) ( a j k ) ) + k = 1 M a j k ( F ( A ) ( a i k ) ) = 0.

3.2.3. Rotation in a Plane Spanned by xiyj-Axes

Performing a rotation in the xy-plane, we obtain:
x i = x i ϵ y j ,
y j = ϵ x i + y j .
which yields (see the Appendix):
a j i = a j i + ( 1 + a j i 2 ) ϵ ,
a k l = a k l + ( a j l a k i ) ϵ ,
t k = t k + ( a k i t j ) ϵ
and therefore:
k = 1 M l = 1 L a k l ( F ( a j l a k i ) ) + a k l F + F a j i = 0.

4. The PDE System

The translation invariance of Equation (5) excludes a dependence of F on t1, ⋯, tM, so F is of the form F(a11, ⋯, aML|I). Rotation invariance with respect to the y-axis requires F to fulfill the homogeneous, linear system of first order partial differential equations (i, j ∈ [1, M], ij) (i.e., Equation (21)):
k = 1 L a j k ( F a i k ) k = 1 L a i k ( F a j k ) = 0
and similar for rotations around the x-axis (i, j ∈ [1, L], ij) (Equation (13)):
k = 1 L a k j ( F a k i ) k = 1 M a k i ( F a k j ) = 0.
Rotations around an axis perpendicular to a plane given by an x,y-pair require the probability distribution to obey also (i ∈ [1, L], j ∈ [1, M]):
k = 1 M l = 1 L a k l ( F ( a j l a k i ) ) + a j i F + F a j i = 0.
Using the product rule, the double sum can be rewritten as
k = 1 M l = 1 L a k l ( F ( a j l a k i ) ) = k = 1 M l = 1 L a j l a k i a k l F + F k = 1 M l = 1 L a k l ( a j l a k i )
and the last term of the previous equation can be split into three parts and simplified:
F k = 1 M l = 1 L a k l ( a j l a k i ) = F k = 1 , k j M a k i ( a j i a k i ) + F l = 1 , l i M a j l ( a j l a j i ) + F a j i a j i 2 = ( M 1 ) a j i F + ( L 1 ) a j i F + 2 a j i F = ( M + L ) a j i F .
Using this, Equation (30) can be written as:
k = 1 M l = 1 L a j l a k i a k l F + a j i F + ( M + L + 1 ) a j i F = 0.

5. Solution

This system of PDEs (Equations (28), (29) and (33)) can be tackled with the theory of Lie groups, which provides a systematic, though algebraically-intensive solution strategy, which is implemented in contemporary computer algebra systems. The solutions of several test cases computed by the Maple computer algebra system ( http://www.maplesoft.com/) (it proved to be superior to MATHEMATICA ( www.http://www.wolfram.com/mathematica/) for the present PDE-systems) led to the conjecture that a general solution to this PDE system is given by the sum of the squares of all possible minors of the coefficient matrix:
F ( a 11 , , a M L | I ) = [ 1 + k = 1 ( P M ) ( P L ) ( det ( A P , k ) ) 2 + k = 1 ( P 1 M ) ( P 1 L ) ( det ( A P 1 , k ) ) 2 + + k = 1 ( 1 M ) ( 1 L ) ( det ( A 1 , k ) ) 2 ] M + L + 1 2
where An denotes a submatrix (minor) of size n × n (this notation is used at various places throughout the paper and should not be confused with the power of a matrix, which does not occur in this paper) and P = Min (M, L). Equation (34) does not appear unreasonable from the onset as prior density, because it preserves the underlying symmetry of the problem (permutation invariance of the parameters) and it is non-negative.
An explicit example for the case N = 4, L = 2 is:
F ( a 11 , a 12 , a 21 , a 22 , | I ) = [ 1 + a 11 2 + a 12 2 + a 21 2 + a 22 2 + ( a 11 a 22 a 12 a 21 ) 2 ] 5 / 2 .
A two-dimensional slice of this probability density is given in Figure 2. The high symmetry of the prior distribution with respect to parameter permutations results in similar, “Cauchy”-like shapes if slices along other parameter axis are displayed.
For the case N = 6, L = 3, the solution is given by:
F ( a 11 , , a 33 | I ) = ( 1 + a 11 2 + a 12 2 + a 13 2 + a 21 2 + a 22 2 + a 23 2 + a 31 2 + a 32 2 + a 33 2 ) + ( a 22 a 33 a 23 a 32 ) 2 + ( a 21 a 33 a 23 a 31 ) 2 + ( a 21 a 32 a 22 a 31 ) 2 + ( a 12 a 33 a 13 a 32 ) 2 + ( a 11 a 33 a 13 a 31 ) 2 + ( a 11 a 32 a 12 a 31 ) 2 + ( a 12 a 23 a 13 a 22 ) 2 + ( a 11 a 23 a 13 a 21 ) 2 + ( a 11 a 22 a 12 a 21 ) 2 + ( a 11 ( a 22 a 33 a 23 a 32 ) a 12 ( a 21 a 23 a 23 a 31 ) + a 13 ( a 21 a 32 a 22 a 31 ) ) 2 ) 7 / 2 .

6. Proof

6.1. Preliminaries

To prove that Equation (34) fulfills the equation system given by Equations (28), (29) and (33), we verify directly that Equation (34) solves the PDEs.
We will make repeated use of the Laplace expansion of determinants:
det ( A n ) = j = 1 n a i j ( 1 ) i + j det ( M i j n 1 )
where the minor M i j n 1 is the (n − 1) × (n − 1)-matrix derived from the n × n-matrix An by deletion of the i-th row and j-th column (by definition M0 := 1). The cofactor matrix A i j n 1 is defined to be:
A i j n 1 = ( 1 ) i + j M i j n 1
and satisfies the following n-equations (i, j, k = 1, 2, ⋯, N):
j = 1 n a i j det ( A k j n 1 ) = δ i k det ( A n ) , i = 1 n a i j det ( A i k n 1 ) = δ j k det ( A n ) .
Further useful is the following form of the Laplace expansion, taking into account index shifts of a previous deletion of row k and column i of an (n + 1)-matrix An+1, resulting in the minor M k i n:
det ( M k i n ) = l = 1 , l i n + 1 a j l ( 1 ) ( l + j ) det ( M ( j k ) ( l i ) n 1 )
where M ( j k ) ( l i ) n 1 is the minor given by deletion of the j-th and k-th row and the l-th and i-th column. l′ and j′ are defined as:
l = l ( l < i ) and l = l 1 ( l > i ) j = j ( j < k ) and j = j 1 ( j > k ) .
In the following, we face the problem of possibly too heavy of a nomenclature, because we need summation indices, while we also need to keep track of the original indices underlying the entries in the minors, where some rows and columns have been deleted, although the relative order is preserved. The mapping could be expressed, e.g., as a i ( i ) j ( j ) with i′, j [1, m] and i (:) [1, M] and j (:) [1, L]. To avoid this cumbersome notation, we implicitly assume from now on (up to the Conclusion Section) this mapping for all summations that are indexed by either k or l. Therefore:
k = 1 m a k i det ( A k j m 1 ) has to be read as k = 1 m a k ( k ) i det ( A k ( k ) j m 1 ) .

6.2. xixj- and yiyj-Rotations

We now verify that Equation (34) solves Equation (29). It is obvious that only those determinants of Equation (34) that contain column i or column j have the potential to provide non-zero contributions in Equation (29): if column j is missing, the derivative in the first term is zero. If, instead, column i is missing, then the derivative in the second term of Equation (29) yields zero. To proceed, we introduce H(A) via:
F ( A ) = H ( A ) L + M + 1 2 .
It is noteworthy that H(A) has a very simple form: it is given by a sum of positive terms. This almost decouples the problem, and we can largely proceed on a term-by-term basis. Using the equality:
a p q ( det ( A m ) ) 2 = 2 det ( A m ) det ( A p q m 1 )
the left-hand side of Equation (29) transforms to (i, j ∈ [1, L], ij):
( M + L + 1 ) H ( a ) L + M + 3 2 det ( A m ) . ( k = 1 m a k i det ( A k j m 1 ) k = 1 m a k j det ( A k i m 1 ) )
and using Equation (38), we obtain:
( M + L + 1 ) H ( a ) L + M + 3 2 det ( A m ) ( δ i j det ( A m ) δ i j det ( A m ) ) = 0
and, therefore, Equation (34) solves Equation (29). The calculation is similar for Equation (28) and yields the result that Equation (34) solves also the system Equation (28).

6.3. (xiyj)-Rotations

The verification of the successful solution of Equation (33) by Equation (34) requires some more steps. As before, Equation (33) can be written as:
M + L + 1 2 ( k = 1 M l = 1 L a j l a k i H ( A ) L + M + 3 2 H ( A ) a k l + H ( A ) L + M + 3 2 H ( A ) a j i 2 a j i H ( A ) L + M + 1 2 ) = 0
and after multiplication with H ( A ) L + M + 3 2 as:
( M + L + 1 ) ( m = 1 P r = 1 ( m M ) ( m L ) ( k = 1 m l = 1 m a j l a k i det ( A m , r ) det ( A k l m 1 , r ) + det ( A m , r ) det ( A j i m 1 , r ) ) a j i H ( A ) ) = 0

6.3.1. Matrices with Either Row j or Column i

The inner double sum can be simplified for all matrices containing either row j or column i (i.e., all matrices of size P × P and all matrices Am,r of size m × m, m ∈ (1, 2, ⋯, P − 1) with label r = 1 , 2 , , ( M m ) ( L m ) ( M 1 m ) ( L 1 m ) using the Laplace expansion (here, the expansion with respect to row j is shown):
k = 1 m l = 1 m a j l a k i det ( A m , r ) det ( A k l m 1 , r ) = det ( A m , r ) k = 1 m a k i l = 1 m a j l det ( A k l m 1 , r ) = det ( A m , r ) k = 1 m a k i δ j k det ( A m , r ) = a j i ( det ( A m , r ) ) 2
which cancels the corresponding determinant of H (A) in the last term of Equation (48).

6.3.2. Matrices with Neither Row j nor Column i

The basic idea is to show that ( M 1 m ) ( L 1 m ) -matrices with neither row j nor column i, m ∈ (1, 2, ⋯, P − 1), cancel with the contributions of the corresponding matrices including row j and column i of size (m + 1) × (m + 1) of the second term.
Please note that there is a one-to-one correspondence of minors of size m × m without the j-th row and i-th column and the matrices of size (m + 1) × (m + 1) with row j and column i in the second term, therefore allowing one to label both with the same index r. After division by (M + L + 1), the remaining terms of Equation (48) are (taking into account that the labeling of the rows and columns of the matrices of size (m + 1) × (m + 1) and (m) × (m) must be consistent):
k = 1 , k j m + 1 l = 1 , l i m + 1 a j l a k i det ( A j i m , r ) det ( A ( j k ) ( i l ) m 1 , r ) + det ( A m + 1 , r ) det ( A j i m , r ) a j i H j i ( A ) = 0
with Hji now only containing determinants with neither row j nor column i. If we now only consider the relevant term of Hji, we can write:
k = 1 , k j m + 1 l = 1 , l i m + 1 a j l a k i det ( A j i m , r ) det ( A ( j k ) ( i l ) m 1 , r ) + det ( A m + 1 , r ) det ( A j i m , r ) a j i det ( A j i m , r ) 2 = 0 .
The equation is trivially true if det ( A j i m , r ) = 0; otherwise, we can divide by det ( A j i m , r ) and obtain:
k = 1 , k j m + 1 l = 1 , l i m + 1 a j l a k i det ( A ( j k ) ( i l ) m 1 , r ) + det ( A m + 1 , r ) a j i det ( A j i m , r ) 2 = 0 .
Replacing the various cofactors by the corresponding minors (cf. Equation (36) and (37)) yields:
k = 1 , k j m + 1 l = 1 , l i m + 1 a j l a k i ( 1 ) ( i + j + k + l ) det ( M ( j k ) ( i l ) m 1 , r ) + det ( A m + 1 , r ) a j i ( 1 ) ( i + j ) det ( M j i m , r ) = 0
and after replacing det (Am+1,r) by its Laplace expansion together with multiplication by (1)(i+j), the equation reads:
k = 1 , k j m + 1 l = 1 , l i m + 1 a j l a k i ( 1 ) ( k + l ) det ( M ( j k ) ( i l ) m 1 , r ) + ( 1 ) ( i + j ) k = 1 a k i ( 1 ) ( i + k ) det ( M k i m , r ) a j i det ( M j i m , r ) = 0
and can be simplified to:
k = 1 , k j m + 1 l = 1 , l i m + 1 a j l a k i ( 1 ) ( k + l ) det ( M ( j k ) ( i l ) m 1 , r ) + k = 1 a k i ( 1 ) ( j + k ) det ( M k i m , r ) a j i det ( M j i m , r ) = 0
because (1)2i equals one in the second term. Therefore, the third term cancels with the second term for k = j, and the remaining equation is given by:
k = 1 , k j m + 1 a k i ( 1 ) k l = 1 , l i m + 1 a j l ( 1 ) l det ( M ( j k ) ( i l ) m 1 , r ) + k = 1 , k j a k i ( 1 ) ( j + k ) det ( M k i m , r ) = 0 .
Using Equation (39) together with the definition Equation (40), the inner sum of the first term of Equation (55) can be rewritten as:
l = 1 , l i m + 1 a j l ( 1 ) l det ( M ( j k ) ( i l ) m 1 , r ) = ( 1 ) j M k i m , r ( j < k )
and:
l = 1 , l i m + 1 a j l ( 1 ) l det ( M ( j k ) ( i l ) m 1 , r ) = ( 1 ) j 1 M k i m , r ( j > k ) .
Splitting the summation over k into two parts (k < j) and (k > j) and inserting the definition for k′, we obtain:
k 1 , k < j a k i ( 1 ) ( j + k ) det ( M k i m , r ) + k > j a k i ( 1 ) ( j + k ) det ( M k i m , r ) + k 1 , k < j a k i ( 1 ) ( ( j 1 ) + k ) det ( M k i m , r ) + k > j a k i ( 1 ) ( j + ( k 1 ) ) det ( M k i m , r ) = 0
where the first two terms cancel the last two terms.
Summarizing the previous approach, we have shown that for an arbitrary n × n-determinant, the first and third term of Equation (48) almost cancel. Only determinants not containing the j-th row and the i-th column remain. These remaining contributions are canceled by the (n + 1)-order determinant (required to contain the matrix element aji) of the second term in Equation (48). This schema can be repeated down to n = 1, and the last step (n = 0) is easily explicitly calculated. This finishes our derivation.

7. Relation to Previously-Derived Special Cases

The underlying equation systems of the special case of an (n−1)-dimensional hyperplane in an n-dim space used in [8] and in this paper differ slightly due to a different parameterization, and therefore, the derived priors appear on first glance to be different, although they are identical, as will be shown below.
For probability density functions in different coordinate systems, the following equation holds:
p ( a ) d a = p ( b ( a ) ) | ( b ) ( a ) | d a ,
where |⋯| denotes the absolute value of the Jacobi determinant:
| ( b ) ( a ) | = | b 1 a 1 b 1 a 2 b 1 a n b n a 1 b n a 2 b n a n | .
The equation describing the (n-1)-dim hyperplane in an n-dim space in this paper is given by:
y 1 = a 11 x 1 + a 12 x 2 + + a 1 ( n 1 ) x n 1 + t 1
and results in the following prior:
p ( a 11 , a 12 , , a 1 ( n 1 ) , t 1 ) = ( 1 + i = 1 n 1 a 1 i 2 ) n + 1 2 .
In [8], the corresponding hyperplane equation reads:
0 = b 1 x 1 + b 2 x 2 + + b n x n + 1
with prior distribution:
p ( b 1 , b 2 , , b n ) = ( i = 1 n b i 2 ) n + 1 2 , with i = 1 n b i 2 > R 0 2 .
The latter constraints yield a proper (normalizable) prior. The relation of the two different parameterizations is given by:
b i = a 1 i t 1 i n and b n = 1 t 1
which yields the Jacobian:
| ( b ) ( a ) | = | det ( 1 t 1 0 0 0 a 11 t 1 0 1 t 1 0 0 a 12 t 1 0 0 0 1 t 1 a 1 ( n 1 ) t 1 0 0 0 0 1 t 1 2 ) | = 1 t 1 n + 1
Using this result and Equation (65), we can write:
p ( b ( a ) ) | ( b ) ( a ) | d a = 1 ( i = 1 n 1 ( a 1 i t 1 ) 2 + 1 t 1 2 ) n + 1 2 t 1 n + 1 d a = 1 ( 1 + i = 1 n 1 a 1 i 2 ) n + 1 2 d a
which shows the equivalence of the two priors (Equations (62) and (64)). The requirement of b i 2 > R 0 2 leads to:
R 0 2 i = 1 n b i 2 = i = 1 n 1 ( a 1 i t 1 ) 2 + 1 t 1 2 = 1 t 1 2 ( 1 + i = 1 n 1 a 1 i 2 ) .
In the case of all a1i = 0, we obtain:
t 1 2 1 R 0 2
which means that the lower limit R 0 2 corresponds to an upper limit of t 1 2.

8. Practical Hints

In the worst case, the hyperplane prior has an exponentially-increasing number of determinants with increasing dimension. The total number of individual determinants for an N-dimensional plane in a 2N-dimensional space is given by:
k = 0 N ( k N ) 2 = ( N 2 N )
which is already 70 for a 4D hyperplane in an 8D space. Therefore, it is advantageous to compute the determinants using iteratively the Laplace expansion, starting from small determinants, storing the determinants of the previous step. This requires the storage of at most ( N N / 2 ) 2 terms. As a proposal density for Markov chain Monte Carlo (MCMC) sampling methods (e.g., rejection sampling), the dominating multivariate Cauchy distribution is a good candidate. Source code for the set up of the PDE system and for the solution, together with a Maple script for the verification of the solution, can be obtained from the author.

9. Conclusions

This paper has derived a prior density for L-dimensional hyperplanes in N-dimensional space, based on geometric invariances. It is suited, e.g., to parameter estimation of multilinear regression problems in the absence of further prior knowledge or Bayesian model estimation for neural networks. In the latter case, the prior has to be made proper by suitable restriction of the range of the offset parameters, which depends on domain knowledge. The obtained prior density avoids the too strong weight of “large” values of the regression coefficients typically assigned by uniform priors. Being a rational function, its influence on the parameter estimates on standard problems with Gaussian uncertainties (resulting in an exponential likelihood) on the data will be limited. However, this can be different for robust estimation approaches with heavy-tailed likelihood distributions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

In this section, the relation between the primed coefficient a n m and the unprimed coefficient anm is derived. A rotation perpendicular to the xiyj-plane relates xi, yj with x i , y j by:
x i = x i ϵ y j ,
y j = ϵ x i y j .
and x k = x k, k = 1, ⋯, L; ki and y k = y k, k = 1, ⋯, M; kj. Using this, the system Equation (5) in the transformed coordinate system reads (n = 1, ⋯, M; nj):
y n = a n 1 x 1 + a n 2 x 2 + + a n i ( x i ϵ y j ) + a n ( i + 1 ) x ( i + 1 ) + + a n L x L + t n y j + ϵ x i = a j 1 x 1 + a j 2 x 2 + + a j i ( x i ϵ y j ) + a j ( i + 1 ) x ( i + 1 ) + + a j L x L + t j .
Solving for yj, we obtain:
y j = 1 1 + a j i ϵ ( t j x i ϵ + k = 1 L a j k x k )
and subsequently:
y n ( t n + k = 1 L a n k x k ) a n i ϵ 1 1 + a j i ϵ ( t j x i ϵ + k = 1 L a j k x k ) .
Using the Taylor expansion 1 / ( 1 + a j i ϵ ) = 1 a j i ϵ + O ( ϵ 2 ) up to first order and collecting the coefficients, the previous equations yield:
a j i = a j i a j i 2 ; t j = t j t j a j i a n k = a n k a n i a j k ; t n = t n t j a n i .
First, we solve for a j i:
a j i 2 ϵ a j i + ϵ + a j i = 0 a j i = 1 1 4 ϵ ( ϵ + a j i ) 2 ϵ = a j i + ( 1 + a j i 2 ) ϵ ¯ + O ( ϵ 2 )
and next for a j k:
a j k = a j k ( 1 a j k ϵ ) a j k = a j k 1 a j i ϵ = a j k ( 1 + a j i ϵ ) + O ( ϵ 2 ) = a j k ( 1 + a j i ϵ ) ¯ + O ( ϵ 2 ) .
A similar calculation for a n i yields:
a n i = a n i ( 1 + a j i ϵ ) ¯ + O ( ϵ 2 )
which then allows one to compute a n k for index pairs with {nk}≠ {ji}:
a n k = a n k + ( a n i + a n i a j i ϵ ) ( a j k + a j k a j i ϵ ) ϵ = a n k + a n i a j k ϵ ¯ + O ( ϵ 2 ) .
The offset variable tj is given by:
t j = t j ( 1 a j i ϵ ) t j = t j 1 a j i ϵ = t j ( 1 + a j i ϵ ) + O ( ϵ 2 ) = t j ( 1 + a j i ϵ ) ¯ + O ( ϵ 2 ) .
and the other offset variables tn by:
t n = t n + ( a n i + a n i a j i ϵ ) ( t j + t j a j i ϵ ) ϵ = t n + a n i t j ϵ ¯ + O ( ϵ 2 )
which concludes the derivation of Equations (24)(26).

References

  1. Gosling, J.P.; Oakley, J.E.; O’Hagan, A. Nonparametric elicitation for heavy-tailed prior distributions. Bayesian Anal 2007, 2, 693–718. [Google Scholar]
  2. Jaynes, E.T. Prior Probabilities. IEEE Trans. Syst. Sci. Cybern 1968, SSC4, 227–241. [Google Scholar]
  3. Kendall, M.; Moran, P. Geometrical Probability; Griffin: London, UK, 1963. [Google Scholar]
  4. Von der Linden, W.; Dose, V.; von Toussaint, U. Bayesian Probability Theory: Application to the Physical Sciences, 1st ed.; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  5. Von Toussaint, U.; Gori, S.; Dose, V. Bayesian Neural-Networks-Based Evaluation of Binary Speckle Data. Appl. Opt 2004, 43, 5356–5363. [Google Scholar]
  6. Hinton, G.; Salakhutdinov, R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar]
  7. Minh, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar]
  8. Dose, V. Hyperplane Priors. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Volume AIP Conference Proceedings 659; Williams, C.J., Ed.; American Institute of Physics: Melville, NY, USA, 2003; pp. 350–357. [Google Scholar]
  9. Box, G.E.P.; Tiao, G.C. Bayesian Inference in Statistical Analysis; Wiley: New York, NY, USA, 1992; Reprint from 1973. [Google Scholar]
  10. Zellner, A. An Introduction to Bayesian Inference in Econometrics; Wiley: New York, NY, USA, 1971. [Google Scholar]
  11. West, M. Outlier Models and Prior Distributions in Bayesian Linear Regression. J. R. Stat. Soc. B 1984, 46, 431–439. [Google Scholar]
  12. O’Hagan, A. Kendall’s Advanced Theory of Statistics, Bayesian Inference, 1st ed.; Arnold Publishers: New York, NY, USA, 1994; Volume 2B. [Google Scholar]
  13. Landau, L.; Lifschitz, E. Lehrbuch der Theoretischen Physik I, 1st ed.; Akademie Verlag: Berlin, Germany; 1962. [Google Scholar]
Figure 1. Comparison of two different priors. (a) 15 random samples drawn from p (a|I) = 1/50, i.e., a uniform distribution in the slope with 0 ≤ a ≤ 50. (b) the density p (a|I) (1 + a2)3/2, corresponding to a distribution uniform in the angle, is visualized by 15 samples.
Figure 1. Comparison of two different priors. (a) 15 random samples drawn from p (a|I) = 1/50, i.e., a uniform distribution in the slope with 0 ≤ a ≤ 50. (b) the density p (a|I) (1 + a2)3/2, corresponding to a distribution uniform in the angle, is visualized by 15 samples.
Entropy 17 03898f1
Figure 2. Probability density of p (a11, a21 | a12, a22, I) for a12 = 3 and a22 = 5 for the case N = 4, L = 2. The probability density exhibits the typical “Cauchy-”like shape with heavy tails compared to a binormal distribution. Due to the symmetry of the prior distribution, slices with respect to the other parameters display the same basic features.
Figure 2. Probability density of p (a11, a21 | a12, a22, I) for a12 = 3 and a22 = 5 for the case N = 4, L = 2. The probability density exhibits the typical “Cauchy-”like shape with heavy tails compared to a binormal distribution. Due to the symmetry of the prior distribution, slices with respect to the other parameters display the same basic features.
Entropy 17 03898f2
Back to TopTop