Next Article in Journal
Improvement of Image Binarization Methods Using Image Preprocessing with Local Entropy Filtering for Alphanumerical Character Recognition Purposes
Next Article in Special Issue
Bayesian Inference for Acoustic Direction of Arrival Analysis Using Spherical Harmonics
Previous Article in Journal
The Thermodynamics of Network Coding, and an Algorithmic Refinement of the Principle of Maximum Entropy
Previous Article in Special Issue
Discriminative Structure Learning of Bayesian Network Classifiers from Training Dataset and Testing Instance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection

Department of Mathematics, College of Science & Technology, Nihon University, 1-8-14, Surugadai, Kanda, Chiyoda-ku, Tokyo 101-8308, Japan
Entropy 2019, 21(6), 561; https://doi.org/10.3390/e21060561
Submission received: 11 March 2019 / Revised: 23 May 2019 / Accepted: 29 May 2019 / Published: 4 June 2019
(This article belongs to the Special Issue Bayesian Inference and Information Theory)

Abstract

:
In recent years, selecting appropriate learning models has become more important with the increased need to analyze learning systems, and many model selection methods have been developed. The learning coefficient in Bayesian estimation, which serves to measure the learning efficiency in singular learning models, has an important role in several information criteria. The learning coefficient in regular models is known as the dimension of the parameter space over two, while that in singular models is smaller and varies in learning models. The learning coefficient is known mathematically as the log canonical threshold. In this paper, we provide a new rational blowing-up method for obtaining these coefficients. In the application to Vandermonde matrix-type singularities, we show the efficiency of such methods.

1. Introduction

In recent studies, real data associated with, for example, image or speech recognition, psychology, and economics, have been analyzed by learning systems. Hence, many learning models have been proposed, and thus, the need for appropriate model selection methods has increased.
In this section, we first introduce the widely-applicable information criterion (WAIC) [1,2,3,4,5,6,7] and cross-validation in Bayesian model selection.
Let q ( x ) be a true probability density function of variables, x R N , and let x n : = { x i } i = 1 n be n training samples selected from q ( x ) independently and identically. Consider a learning model that is written in probabilistic form as p ( x | w ) , where w W R d is a parameter.
Suppose that the purpose of the learning system is to estimate the unknown true density function q ( x ) from x n using p ( x | w ) in Bayesian estimation. Let ψ ( w ) be an a priori probability density function on parameter set W and p ( w | x n ) be the a posteriori probability density function:
p ( w | x n ) = 1 Z n ( β ) ψ ( w ) i = 1 n p ( x i | w ) β ,
where:
Z n ( β ) = W ψ ( w ) i = 1 n p ( x i | w ) β d w ,
for inverse temperature β . We typically set β = 1 . Define:
E w β [ f ( w ) ] = d w f ( w ) ψ ( w ) i = 1 n p ( x i | w ) β d w ψ ( w ) i = 1 n p ( x i | w ) β ,
and:
V w β [ f ( w ) ] = E w β [ f ( w ) 2 ] E w β [ f ( w ) ] 2 .
We then have predictive density function p ( x | x n ) = E w β [ p ( x | w ) ] , which is the average inference of the Bayesian density function.
We next introduce Kullback function K ( q | | p ) and empirical Kullback function K n ( q | | p ) for density functions p ( x ) , q ( x ) :
K ( q | | p ) = q ( x ) log q ( x ) p ( x ) d x ,
K n ( q | | p ) = 1 n i = 1 n log q ( x i ) p ( x i ) .
Function K ( p | | q ) , which always has a non-negative value and satisfies K ( q | | p ) = 0 , if and only if q ( x ) = p ( x ) , is a pseudo-distance between density functions p ( x ) , q ( x ) . We define Bayes training loss T n and Bayes generalization loss G n as follows:
T n = 1 n i = 1 n log p ( x i | x n ) ,
and:
G n = q ( x ) log p ( x | x n ) d x .
Additionally, we define Bayesian generalization error B g and Bayesian training error B t as follows:
B g = K ( q ( x ) p ( x | x n ) )
and
B t = K n ( q ( x ) p ( x | x n ) ) .
Then, we have:
E [ T n ] = G n , B g = G n S , B t = T n S n
for average entropy S = q ( x ) log q ( x ) d x and empirical entropy S n = 1 n i = 1 n log q ( x i ) of the true density function. Value B g describes how precisely the predictive function approximates the true density function.
We define x n \ x i = { x 1 , , x i 1 , x i + 1 , , x n } . The WAIC is denoted by:
W n = T n + β n i = 1 n V w β [ log p ( x i | w ) ]
and the cross-validation loss is denoted by:
C n = 1 n i = 1 n log p ( x i | x n \ x i )
for n 2 .
Watanabe [2,3,6,7] proved the following four relations:
E [ G n ] = L ( w 0 ) + 1 n β λ + β 1 β ν + o ( 1 n β ) , E [ T n ] = L ( w 0 ) + 1 n β λ β + 1 β ν + o ( 1 n β ) , E [ W n ] = L ( w 0 ) + 1 n β λ + β 1 β ν + o ( 1 n β ) , E [ C n ] = L ( w 0 ) + 1 n β λ + β 1 β ν + o ( 1 n β )
for learning coefficient λ Q and singular fluctuation ν R , where L ( w ) = E x [ log p ( x | w ) ] and w 0 W 0 = { w 0 W L ( w 0 ) = min w W L ( w ) } . For singular models, the set W 0 is typically not a singleton set. Nevertheless, the WAIC and the cross-validation can estimate the Bayesian generalization error without any knowledge of the true probability density function.
These values are calculated from training samples x i using learning model p. In real applications or experiments, we typically do not know the true distribution, but only the values of the training errors. Our purpose is to show that these methods are effective. We can select a suitable model from several statistical models by observing these values.
In this paper, we consider the value λ , which is equal to the log canonical threshold introduced in Definition 1. This coefficient is not needed to evaluate the WAIC and the cross-validation in practice, while the learning coefficients from our recent results have been used very effectively by Drton and Plummer [8] for model selection using a method called the “singular Bayesian information criterion (sBIC)”. The method sBIC even works using the bounds of learning coefficients.
It is known that λ = ν = d / 2 holds, where d is the dimension of the parameter space for regular models. Value λ is obtained by a blowing-up process, which is the main tool in the desingularization of an algebraic variety. The following theorem is an analytic version of Hironaka’s theorem [9] used by Atiyah [10].
Theorem 1
(Desingularization [9]). Let f be an analytic function in a neighborhood of w * = ( w 1 * , , w d * ) R d with f ( w * ) = 0 . There exists an open set U w * , an analytic manifold M, and a proper analytic map μ from M to U such that: (1) μ : M E U f 1 ( 0 ) is an isomorphism, where E = μ 1 ( f 1 ( 0 ) ) , and: (2) for each u M , there is a local analytic coordinate system ( u 1 , , u n ) such that f ( μ ( u ) ) = ± u 1 s 1 u 2 s 2 u n s n , where s 1 , , s n are non-negative integers.
The theorem establishes the existence of the desingularization map; however, generally, it is still difficult to obtain such maps for Kullback functions because the singularities of these functions are very complicated. From learning coefficient λ and its order θ , value ν is obtained theoretically as follows: Let ξ ( u ) be an empirical process defined on the manifold obtained by a resolution of singularities and u * denote the sum of local coordinates that attain the minimum λ and maximum θ . We then have:
ν = 1 2 E ξ 0 d t u * d u ξ ( u ) t λ 1 / 2 e β t + β t ξ ( u ) 0 d t u * d u t λ 1 / 2 e β t + β t ξ ( u ) ,
where ξ ( u ) is a random variable of a Gaussian process with mean zero, and covariance E ξ [ ξ ( w ) ξ ( u ) ] = E X [ a ( x , w ) a ( x , u ) ] for the analytic function a ( x , w ) obtained by the resolution of singularities using log ( q ( x ) / p ( x | w ) ) .
Our purpose in this paper is to obtain λ . In recent studies, we determined the learning coefficients for reduced rank regression [11], the three-layered neural network with one input unit and one output unit [12,13], normal mixture models with a dimension of one [14], and the restricted Boltzmann machine [15]. Additionally, Rusakov and Geiger [16,17] and Zwiernik [18], respectively, obtained the learning coefficients for naive Bayesian networks and directed tree models with hidden variables. Drton et al. [19] considered these coefficients of the Gaussian latent tree and forest models.
The papers [20,21] derived bounds on the learning coefficients for Vandermonde matrix-type singularities and explicit values under some conditions.
The remainder of this paper is structured as follows: In Section 2, we introduce log canonical thresholds in algebraic geometry. In Section 3, we summarize key theorems for obtaining learning coefficients for learning theory. In Section 4, we present our main results. We consider the log canonical thresholds of Vandermonde matrix-type singularities (Definition 3). We present our conclusions in Section 5.

2. Log Canonical Threshold

Definition 1.
Let f be an analytic function in neighborhood U of w * . Let ψ be a C function with a compact support. Define log canonical threshold λ w * ( f , ψ ) as the largest pole of U | f | 2 z ψ d w over C or U | f | z ψ d w over R . Additionally, define θ w * ( f , ψ ) by its order. If ψ ( w * ) 0 , then we define λ w * ( f ) = λ w * ( f , ψ ) and θ w * ( f ) = θ w * ( f , ψ ) because the log canonical threshold and its order are independent of ψ.
Applying Hironaka’s Theorem 1 to function f ( w ) , we have the proper analytic map μ from manifold M to neighborhood U of w * that satisfies Hironaka’s Theorems (1) and (2). Then, integration U | f | z ψ ( w ) d w is equal to M | f μ | z ψ ( μ ( u ) ) μ ( u ) d u , which is the sum of U M | u 1 s 1 u 2 s 2 u d s d | z ψ ( μ ( u ) ) | μ ( u ) | d u , where ( u 1 , , u d ) is a local analytic coordinate system on U M M . Therefore, the poles can be obtained. Note that for each w * with f ( w * ) 0 , there exists neighborhood U such that f ( w ) 0 for all w U . Thus, U | f | z ψ ( w ) d w has no poles. The learning coefficient is the log canonical threshold of the Kullback function (relative entropy) over the real field.

3. Main Theorems

In this section, several theorems are introduced for obtaining real log canonical thresholds over the real field. Theorem 2 (the method for determining the deepest singular point), Theorem 3 (the method to add variables), and Theorem 4 (the rational blowing-up method) are very helpful for obtaining the log canonical threshold. These theorems over the real field are useful for reducing the number of changes of coordinates via blow-ups.
We denote constants, such as a * , b * , and w * , by suffix ∗. Define the norm of a matrix C = ( c i j ) as | | C | | = i , j | c i j | 2 . Set N + 0 = N { 0 } .
Lemma 1
([14,22,23]). Let U be a neighborhood of w * R d . Consider the ring of analytic functions on U. Let J be the ideal generated by f 1 , , f n , which are analytic functions defined on U. ( 1 ) If g 1 2 + + g m 2 f 1 2 + + f n 2 , then λ w * ( g 1 2 + + g m 2 ) λ w * ( f 1 2 + + f n 2 ) .
( 2 ) If g 1 , , g m J , then λ w * ( g 1 2 + + g m 2 ) λ w * ( f 1 2 + + f n 2 ) . In particular, if g 1 , , g m generate ideal J , then λ w * ( f 1 2 + + f n 2 ) = λ w * ( g 1 2 + + g m 2 ) .
The following lemma is also used in the proofs.
Lemma 2
([15]). Let J , J be the ideals generated by f 1 ( w ) , , f n ( w ) and g 1 ( w ) , , g m ( w ) , respectively. If w and w are different variables, then:
λ ( w * , w * ) ( f 1 2 + + f n 2 + g 1 2 + + g m 2 ) = λ w * ( f 1 2 + + f n 2 ) + λ w * ( g 1 2 + + g m 2 ) .
Theorem 2
(Method for determining the deepest singular point [21]). Let f 1 ( w 1 , , w d ) , …, f m ( w 1 , , w d ) be homogeneous functions of w 1 , , w j ( j d ) . Furthermore, let ψ be a C function such that ψ ( 0 , , 0 , w j + 1 * , , w d * ) ψ ( w 1 * , , w d * ) and ψ w is a homogeneous function of w 1 , , w j in a small neighborhood of ( 0 , , 0 , w j + 1 * , , w d * ) . Then, we have:
λ ( 0 , , 0 , w j + 1 * , , w d * ) ( f 1 2 + + f m 2 , ψ ) λ ( w 1 * , , w j * , w j + 1 * , , w d * ) ( f 1 2 + + f m 2 , ψ )
Theorem 3
(Method to add variables [21]). Let f 1 ( w 1 , , w d ) , …, f m ( w 1 , , w d ) be homogeneous functions of w 1 , , w d . Set f 1 ( w 2 , , w d ) = f 1 ( 1 , w 2 , , w d ) , …, f m ( w 2 , , w d ) = f m ( 1 , w 2 , , w d ) . If w 1 * 0 , then we have:
λ ( w 1 * , , w d * ) ( f 1 2 + + f m 2 ) = λ ( w 2 * / w 1 * , , w d * / w 1 * ) ( f 1 2 + + f m 2 ) .
Resolutions of singularities are obtained by constructing the blow-up along the smooth submanifold. In this paper, we use the blow-up method along some singular varieties as explained below for obtaining log canonical thresholds.
Theorem 4 (Rational blow-up process).
Let m d and ( α 1 , , α m ) R m , α 1 , , α m > 0 . Consider the set U = { w = ( w 1 , , w d ) R d | 0 w 1 , , w d 1 } .
We set g i ( u i ) = ( g 1 i ( u i ) , , g d i ( u i ) ) for g i i ( u i ) = u i i , g j i ( u i ) = u j i u i i α i / α j ( 1 j m , j i ) , g j i ( u i ) = u j i ( 1 j d ) .
Let f , ψ be analytic functions defined on U and f i ( u 1 i , , u d i ) = f ( g i ( u i ) ) . Then, we have:
min 1 i m { λ 0 ( f i ( u i ) , u i i j = 1 m α i / α j 1 ψ ( g i ( u i ) ) ) } = λ 0 ( f , ψ ) .
Proof. 
The proof of this theorem uses a resolution of singularities along a smooth submanifold.
Set w i = w i 1 / α i , and construct the blow-up along the submanifold { w 1 = = w m = 0 } . Then, we have for 1 i m , w i = v , w j = v w j ( 1 j m , j i ) , and w j = w j ( m + 1 j d ) .
Consider the set U i = { u i = ( u 1 i , , u d i ) R d | 0 u 1 i , , u d i 1 } for 1 i m .
Set u i i = v α i , u j i = w j α j ( j i ) , then we have w i = u i i , w j = u i i α j / α i u j i ( 1 j m , j i ) , and w j = u j i ( m + 1 j d ) .
The Jacobian w / u i is 1 j m , j i u i i α j / α i . □
The theorem is simple; however, it is a useful tool for obtaining log canonical thresholds.

4. Main Results

In this section, we apply the theorems in Section 3 to Vandermonde matrix-type singularities, which are generic and essential in learning theory. Their associated log canonical thresholds provide the learning coefficients of, for example, three-layered neural networks in Section 4.1, normal mixture models in Section 4.2, and mixtures of binomial distributions [24].

4.1. Three-Layered Neural Network

Consider the three-layered neural network with N input units, H hidden units, and M output units, which is trained for estimating the true distribution with r hidden units. Denote an input value by z = ( z j ) R N with a probability density function q ( z ) . Then, an output value y = ( y k ) R M of the three layered neural network is given by y k = f k ( z , w ) + ( noise ) , where w = { a k i , b i j } 1 i H and:
f k ( z , w ) = i = 1 H a k i tanh ( j = 1 N b i j z j ) .
Consider a statistical model:
p ( y | z , w ) = 1 ( 2 π ) M / 2 exp ( 1 2 | | y f ( z , w ) | | 2 ) ,
and p ( z , y | w ) = p ( y | z , w ) q ( z ) . Assume that the true distribution:
p ( y | z , w t * ) = 1 ( 2 π ) M / 2 exp ( 1 2 | | y f ( z , w t * ) | | 2 ) ,
is included in the learning model, where w t * = { a k , H + i * , b H + i , j * } 1 i r and f k ( z , w t * ) = i = 1 r ( a k , H + i * ) tanh ( j = 1 N b H + i , j * x j ) .
We have:
p ( z , y | w ) = 1 ( 2 π ) M / 2 exp ( 1 2 | | y f ( z , w ) | | 2 ) q ( z ) ,
and the notation ( z , y ) for the three-layered neural network corresponds to x in Section 1 and Section 2.

4.2. Normal Mixture Models

We consider a normal mixture model [14] with identity matrix variances:
p ( x | w ) = 1 ( 2 π ) N / 2 i = 1 H a i exp ( j = 1 N ( z j b i j ) 2 2 ) ,
where w = { a i , b i j } 1 i H and i = 1 H a i = 1 , a i 0 .
Set the true distribution by:
p ( x | w t * ) = 1 ( 2 π ) N / 2 i = H + 1 H + r ( a i * ) exp ( j = 1 N ( z j b i j * ) 2 2 ) ,
where w t * = { a i * , b i j * } H + 1 i H + r and i = H + 1 H + r a i * = 1 , a i * < 0 . (In order to simplify the following, we use the values a i * < 0 , not a i * > 0 .)

4.3. Vandermonde Matrix-Type Singularities

Definition 2.
Fix Q N . Define: [ b 1 * , b 2 * , , b N * ] Q = γ i ( 0 , , 0 , b i * , , b N * ) if b 1 * = = b i 1 * = 0 , b i * 0 , and γ i = 1 if Q is odd , sign ( b i * ) if Q is even .
For simplicity, we use the notation w = { a k i , b i j } 1 i H instead of w = { a k i , b i j } 1 k M , 1 i H , 1 j N because we always have 1 k M and 1 j N in this section.
Definition 3.
Fix Q N and m N + 0 .
Let A = a 11 a 1 H a 1 , H + 1 * a 1 , H + r * a 21 a 2 H a 2 , H + 1 * a 2 , H + r * a M 1 a M H a M , H + 1 * a M , H + r * ,
B I = ( j = 1 N b 1 j j , j = 1 N b 2 j j , , j = 1 N b H j j , j = 1 N b H + 1 , j * j , , j = 1 N b H + r , j * j ) t ,
for I = ( 1 , , N ) N + 0 N , and:
B = ( B I ) 1 + + N = Q n + m , n 0 = ( B ( m , 0 , , 0 ) , B ( m 1 , 1 , , 0 ) , , B ( 0 , 0 , , m ) , B ( m + Q , 0 , , 0 ) , )
(t denotes the transpose).
a k i and b i j ( 1 k M , 1 i H , 1 j N ) are the variables in a neighborhood of a k i * and b i j * , where a k i * and b i j * are fixed constants.
Let J be the ideal generated by the elements of A B .
We call singularities of J Vandermonde matrix-type singularities.
To simplify, we usually assume that:
( a 1 , H + j * , a 2 , H + j * , , a M , H + j * ) t 0 , ( b H + j , 1 * , b H + j , 2 * , , b H + j , N * ) 0 ,
for 1 j r and:
[ b H + j , 1 * , b H + j , 2 * , , b H + j , N * ] Q [ b H + j , 1 * , b H + j , 2 * , , b H + j , N * ] Q ,
for j j .
Example 1.
If m = N = M = r = 1 , Q = 2 , H = 3 , then we have A = a 11 a 12 a 13 a 14 * , B = b 11 b 11 3 b 11 5 b 11 7 b 21 b 21 3 b 11 5 b 21 7 b 31 b 31 3 b 11 5 b 31 7 b 41 * b 41 * 3 b 41 * 5 b 41 * 7 . These matrices A , B correspond to the three-layered neural network:
p ( y | z , w ) = 1 ( 2 π ) 1 / 2 exp ( 1 2 ( y a 11 tanh ( b 11 z ) a 12 tanh ( b 21 z ) a 13 tanh ( b 31 z ) ) 2 ) ,
and the true distribution:
p ( y | x , w t * ) = 1 ( 2 π ) 1 / 2 exp ( 1 2 ( y + a 14 * tanh ( b 41 * z ) ) 2 ) .
Example 2.
If Q = r = M = 1 , H = 2 , N = 2 , then we have A = a 11 a 12 a 13 * , B = b 11 b 12 b 11 2 b 11 b 12 b 12 2 b 11 3 b 11 b 12 2 b 11 2 b 12 b 12 3 b 21 b 22 b 21 2 b 21 b 22 b 22 2 b 21 3 b 21 b 22 2 b 21 2 b 22 b 22 3 b 31 * b 32 * b 31 * 2 b 31 * b 32 * b 32 * 2 b 31 * 3 b 31 * b 32 * 2 b 31 * 2 b 32 * b 32 * 3 .
If a 13 * = 1 , these matrices A , B correspond to a normal mixture model with identity matrix variances:
p ( x | w ) = a 11 2 π exp ( ( z 1 b 11 ) 2 + ( z 2 b 12 ) 2 2 ) + a 12 2 π exp ( ( z 1 b 21 ) 2 + ( z 2 b 22 ) 2 2 ) ,
i = 1 2 a 1 i = 1 , a 1 i 0 , and the true distribution is:
p ( x | w t * ) = 1 2 π ( a 13 * ) exp ( ( z 1 b 31 * ) 2 + ( z 2 b 32 * ) 2 2 ) , a 13 * = 1 .
In this paper, we denote:
A M , H = a 11 a 12 a 1 H a 21 a 22 a 2 H a M 1 a M 2 a M H , B H , N , I = j = 1 N b 1 j j j = 1 N b 2 j j j = 1 N b H j j a n d :
B H , N ( Q ) = ( B H , N , I ) 1 + + N = Q n + 1 , 0 n H 1 .
Furthermore, we denote: a * = a 1 , H + 1 * a M , H + 1 * and:
( A M , H , a * ) = a 11 a 12 a 1 H a 1 , H + 1 * a 21 a 22 a 2 H a 2 , H + 1 * a M 1 a M 2 a M H a M , H + 1 *
Theorem 5
([14]). Consider sufficiently small neighborhood U of:
w * = { a k i * , b i j * } 1 i H
and variables w = { a k i , b i j } 1 i H in set U. Set ( b 01 * * , b 02 * * , , b 0 N * * ) = ( 0 , , 0 ) . Let each ( b 11 * * , b 12 * * , , b 1 N * * ) , …, ( b r 1 * * , b r 2 * * , , b r N * * ) be a different real vector in:
[ b i 1 * , b i 2 * , , b i N * ] Q 0 , f o r   i = 1 , , H + r ;
that is,
{ ( b 11 * * , , b 1 N * * ) , , ( b r 1 * * , , b r N * * ) ; [ b i 1 * , , b i N * ] Q 0 , i = 1 , , H + r } .
Then, r is uniquely determined, and r r by the assumption in Definition 3. Set ( b i 1 * * , , b i N * * ) = [ b H + i , 1 * , , b H + i , N * ] Q , for 1 i r . Assume that [ b i 1 * , , b i N * ] Q = 0 , 1 i H 0 ( b 11 * * , , b 1 N * * ) , H 0 + 1 i H 0 + H 1 , ( b 21 * * , , b 2 N * * ) , H 0 + H 1 + 1 i H 0 + H 1 + H 2 , ( b r 1 * * , , b r N * * ) , H 0 + + H r 1 + 1 i H 0 + + H r ,
and H 0 + + H r = H . We then have:
λ w * ( | | A B | | 2 ) = M r 2 + λ w 1 ( 0 ) * ( | | A M , H 0 B H 0 , N ( Q ) | | 2 )
+ α = 1 r λ w 1 ( α ) * ( | | ( A M , H α 1 , a ( α ) * ) B H α , N ( 1 ) | | 2 ) + α = r + 1 r λ w 1 ( α ) * ( | | A M , H α 1 B H α 1 , N ( 1 ) | | 2 ) ,
where w 1 ( 0 ) * = { a k , i * , 0 } 1 i H α ,
w 1 ( α ) * = { a k , H 0 + + H α 1 + i * , 0 } 2 i H α and a ( α ) * = a 1 , H + α * a M , H + α * for α 1 .
Theorem 6.
We use the same notation as in Theorem 5. Set λ = λ 0 ( | | A M , H B H , N ( Q ) | | 2 ) .
We have the following:
1. 
H = 1 . λ = min { M 2 , N 2 } .
2. 
H = 2 . λ = min { β N + ( 2 β ) M 2 , β = 0 , 1 , 2 , 2 N + Q ( N 1 + M ) 2 Q + 2 } .
3. 
H = 3 .
λ = min { β N + ( 3 β ) M 2 , β = 0 , 1 , 2 , 3 ,
β N + ( 3 β ) M + Q ( α ( N + α β ) + ( 3 α ) M ) 2 ( Q + 1 ) , α = 1 , , β 1 , β = 2 , 3 ,
3 N + Q ( 3 N 3 + 3 M ) 2 ( 2 Q + 1 ) }
4. 
H = 4 .
λ = min { β N + ( 4 β ) M 2 , β = 0 , 1 , 2 , 3 , 4 ,
β N + ( 4 β ) M + Q ( α ( N + α β ) + ( 4 α ) M ) 2 ( Q + 1 ) , α = 1 , , β 1 , β = 2 , 3 , 4 ,
4 N + Q ( α N α 1 + ( 8 α ) M ) 2 ( 2 Q + 1 ) , α = 2 , 3 , 4 , 4 N + Q ( 5 N 5 + 3 M ) 2 ( 2 Q + 1 ) ,
3 N + M + Q ( α N α + ( 8 α ) M ) 2 ( 2 Q + 1 ) , α = 2 , 3 ,
4 N + Q ( 6 N 6 + 6 M ) 2 ( 3 Q + 1 ) } .
Its proof appears in Appendix A.
In paper [22], we had exact values for N = 1 :
λ 0 ( | | A M , H B H , 1 ( Q ) | | 2 ) = M Q k ( k + 1 ) + 2 H 4 ( 1 + k Q )
where: k = max { i Z : 2 H M ( i ( i 1 ) Q + 2 i ) } , and we had:
θ = 1 , if 2 H > M ( k ( k 1 ) Q + 2 k ) 2 , if 2 H = M ( k ( k 1 ) Q + 2 k )

5. Conclusions

In this paper, we proposed a new method of “rational blowing-up” (Theorem 4), and we applied the method to Vandermonde matrix-type singularities and demonstrated its effectiveness. Theorem 6 determines the explicit values of log canonical thresholds for H = 1 , 2 , 3 , 4 . Our future research aim is to improve our methods and obtain explicit values for the general model.
These theoretical values introduce a mathematical measure of preciseness for numerical calculations in information criteria in Section 1. Furthermore, our theoretical results will be helpful in numerical experiments such as the Markov chain Monte Carlo (MCMC). In the papers [25,26], the mathematical foundation for analyzing and developing the precision of the MCMC method was constructed by using the theoretical values of marginal likelihoods.
We will also consider these applications in the future.

Funding

This research was funded by the Ministry of Education, Culture, Sports, Science, and Technology in Japan, Grant-in-Aid for Scientific Research 18K11479.

Acknowledgments

We thank Maxine Garcia, PhD, from Edanz Group for editing a draft of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We demonstrate Theorem 6 by blowing-up processes.
Let j [ α ] , j [ α , β ] { 1 , , N } and i [ α ] , i [ α , β ] { 1 , , H } .
By constructing the blow-up along { b i j = 0 , 1 i H , 1 j N } and by choosing one branch of the blow-up process, we assume that b i j = v 1 b i j for i , j 1 , j [ 1 ] = 1 and b 1 j [ 1 ] 0 . We also set b i j = b i j + b i j [ 1 ] b 1 j [ 1 ] b 1 j ( i 2 , j j [ 1 ] ), b 1 j = b 1 j , b i j [ 1 ] = b i j [ 1 ] ( i , j 1 ) and a k 1 = i = 1 H a k i b i j [ 1 ] .
Then, we have:
f k I = i = 1 H a k i j = 1 N b i j j , = v 1 | I | ( a k 1 j = 1 N b 1 j j + i = 2 H a k i b i j [ 1 ] j [ 1 ] j j [ 1 ] ( b i j + b i j [ 1 ] b 1 j [ 1 ] b 1 j ) j ) ,
for I = ( 1 , , N ) .
J = f k I | I = ( 1 , , N ) N + 0 N , | I | = n Q + 1 , n N + 0 , 1 k M = v 1 n Q + 1 i = 1 H a k i b i j [ 1 ] n Q + 1 , v 1 | I | i = 2 H a k i j = 1 N b i j j . = v 1 a k 1 , v 1 ( n + 1 ) Q + 1 i = 2 H a k i b i j [ 1 ] n Q + 1 ( b i j [ 1 ] Q b 1 j [ 1 ] Q ) , v 1 | I | i = 2 H a k i j = 1 N b i j j .
For simplicity, we set b i j = b i j and a k 1 = a k 1 again.
By constructing the blow-up along { b i j = 0 , 2 j N , 2 i H } and by choosing one branch of the blow-up process, set b i j = v 21 b i j for i , j 2 and b 2 j [ 2 ] 0 .
  • If j [ 1 ] j [ 2 ] , then set b i j = b i j + b i j [ 2 ] b 2 j [ 2 ] b 2 j ( i 3 , j j [ 2 ] ), b 2 j = b 2 j , b i j [ 2 ] = b i j [ 2 ] ( i , j 2 ), and a k 2 = i = 2 H a k i b i j [ 2 ] .
    J = f k I | I = ( 1 , , N ) N + 0 N , | I | = n Q + 1 , n N + 0 , 1 k M = v 1 a k 1 , v 1 ( n + 1 ) Q + 1 i = 2 H a k i b i j [ 1 ] n Q + 1 ( b i j [ 1 ] Q b 1 j [ 1 ] Q ) , v 1 | I | i = 2 H a k i j = 1 N b i j j = v 1 a k 1 , v 1 v 2 a k 2 , v 1 ( n + 1 ) Q + 1 v 21 n Q + 1 i = 3 H a k i b i j [ 1 ] n Q + 1 ( b i j [ 1 ] Q v 21 Q b 1 j [ 1 ] Q ) , v 1 | I | v 21 | I | i = 3 H a k i j = 1 N b i j j , v 1 ( n + 1 ) Q + 1 v 21 ( n + 1 ) Q + 1 i = 3 H a k i b i j [ 2 ] n Q + 1 ( b i j [ 2 ] Q b 2 j [ 2 ] Q ) , .
    For simplicity, we set b i j = b i j and a k 1 = a k 1 again.
  • If j [ 1 ] = j [ 2 ] , then, by constructing the blow-up along { b i j = 0 , 2 i H , 2 j N , j j [ 2 ] } and by choosing one branch of the blow-up process, we set b i j = v 22 b i j for i , j 2 , j j [ 2 ] . Assume that b i [ 2 , 2 ] , j [ 2 , 2 ] 0 .
    (a)
    By constructing the rational blow-up along { v 1 Q = 0 , v 22 = 0 } , we have the following (i) and (ii).
    • Set v 22 = v 1 Q v 22 .
      b i j = b i j + b i j [ 2 ] ( b i j [ 2 ] Q b 1 j [ 2 ] Q ) b 2 j [ 2 ] ( b 2 j [ 2 ] Q b 1 j [ 2 ] Q ) b 2 j for i 3 , j j [ 2 ] , b 2 j = b 2 j , b i j [ 2 ] = b i j [ 2 ] for i , j 2 , and a k 2 = i = 2 H a k i b i j [ 2 ] ( b i j [ 2 ] Q v 2 Q b 1 j [ 2 ] Q ) .
      J = f k I | I = ( 1 , , N ) N + 0 N , | I | = n Q + 1 , n N + 0 , 1 k M = v 1 a k 1 , v 1 ( n + 1 ) Q + 1 i = 2 H a k i b i j [ 1 ] n Q + 1 ( b i j [ 1 ] Q b 1 j [ 1 ] Q ) , v 1 | I | i = 2 H a k i j = 1 N b i j j = v 1 a k 1 , v 1 ( n + 1 ) Q + 1 v 21 n Q + 1 i = 2 H a k i b i j [ 1 ] n Q + 1 ( b i j [ 1 ] Q v 2 Q b 1 j [ 1 ] Q ) , v 1 | I | v 21 | I | v 22 | I | j [ 2 ] i 2 H a k i j = 1 N b i j j , f o r | I | j [ 2 ] > 0 = v 1 a k 1 , v 1 ( n + 2 ) Q + 1 v 21 ( n + 1 ) Q + 1 i = 3 H a k i b i j [ 1 ] n Q + 1 ( b i j [ 1 ] Q v 2 Q b 1 j [ 1 ] Q ) ( b i j [ 1 ] Q b 2 j [ 2 ] Q ) , v 1 Q + 1 v 21 a k 2 , v 1 | I | ( Q + 1 ) Q j [ 2 ] v 21 | I | v 22 | I | j [ 2 ] i 3 H a k i j = 1 N b i j j , f o r | I | j [ 2 ] > 0 .
    • Set v 1 = v 22 1 / Q v 1 .
      Set b i j = b i j + b i j [ 2 , 2 ] b i [ 2 , 2 ] j [ 2 , 2 ] b i [ 2 , 2 ] j for i 2 , i [ 2 , 2 ] , j j [ 2 , 2 ] , j [ 2 ] . and a k i [ 2 , 2 ] = i = 2 H a k i b i j [ 2 , 2 ] .
      J = f k I | I = ( 1 , , N ) N + 0 N , | I | = n Q + 1 , n N + 0 , 1 k M v 22 | I | / Q v 1 | I | v 21 | I | i = 2 H a k i j = 1 N b i j j = v 1 v 22 1 / Q a k 1 , v 22 ( n + 1 ) + 1 / Q v 1 ( n + 1 ) Q + 1 i = 2 H a k i b i j [ 1 ] n Q + 1 ( b i j [ 1 ] Q b 1 j [ 1 ] Q ) , v 22 | I | / Q v 1 | I | v 21 | I | v 22 | I | j [ 2 ] i = 2 H a k i b i j [ 2 ] j [ 2 ] b i j [ 2 , 2 ] j [ 2 , 2 ] , f o r | I | = j [ 2 ] + j [ 2 , 2 ] , j [ 2 , 2 ] > 0 , v 22 | I | / Q v 1 | I | v 22 | I | j [ 2 ] i 2 , i i [ 2 , 2 ] H a k i j = 1 N b i j j , f o r | I | j [ 2 ] j [ 2 , 2 ] > 0 = v 1 v 22 1 / Q a k 1 , v 1 v 22 1 / Q v 22 v 21 a k i [ 2 , 2 ] , v 22 ( n + 1 ) + 1 / Q v 1 ( n + 1 ) Q + 1 i = 2 H a k i b i j [ 1 ] n Q + 1 ( b i j [ 1 ] Q b 1 j [ 1 ] Q ) , v 22 | I | / Q v 1 | I | v 21 | I | v 22 | I | j [ 2 ] i = 2 H a k i b i j [ 2 ] j [ 2 ] b i j [ 2 , 2 ] j [ 2 , 2 ] , f o r | I | = j [ 2 ] + j [ 2 , 2 ] , j [ 2 , 2 ] > 0 , v 22 | I | / Q v 1 | I | v 21 | I | v 22 | I | j [ 2 ] i 2 , i i [ 2 , 2 ] H a k i j = 1 N b i j j , f o r | I | j [ 2 ] j [ 2 , 2 ] > 0 .
    By constructing blow-ups, we have the theorem.

References

  1. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  2. Watanabe, S. Algebraic analysis for nonidentifiable learning machines. Neural Comput. 2001, 13, 899–933. [Google Scholar] [CrossRef] [PubMed]
  3. Watanabe, S. Algebraic geometrical methods for hierarchical learning machines. Neural Netw. 2001, 14, 1049–1060. [Google Scholar] [CrossRef]
  4. Watanabe, S. Algebraic geometry of learning machines with singularities and their prior distributions. J. Jpn. Soc. Artif. Intell. 2001, 16, 308–315. [Google Scholar]
  5. Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: New York, NY, USA, 2009; Volume 25. [Google Scholar]
  6. Watanabe, S. Equations of states in singular statistical estimation. Neural Netw. 2010, 23, 20–34. [Google Scholar] [CrossRef] [Green Version]
  7. Watanabe, S. Mathematical Theory of Bayesian Statistics; CRC Press: New York, NY, USA, 2018. [Google Scholar]
  8. Drton, M.; Plummer, M. A Bayesian information criterion for singular models. J. R. Statist. Soc. B 2017, 79, 1–38. [Google Scholar] [CrossRef]
  9. Hironaka, H. Resolution of singularities of an algebraic variety over a field of characteristic zero. Ann. Math. 1964, 79, 109–326. [Google Scholar] [CrossRef]
  10. Atiyah, M.F. Resolution of singularities and division of distributions. Commun. Pure Appl. Math. 1970, 13, 145–150. [Google Scholar] [CrossRef]
  11. Aoyagi, M.; Watanabe, S. Stochastic complexities of reduced rank regression in Bayesian estimation. Neural Netw. 2005, 18, 924–933. [Google Scholar] [CrossRef]
  12. Aoyagi, M.; Watanabe, S. Resolution of singularities and the generalization error with Bayesian estimation for layered neural network. IEICE Trans. J88-D-II 2005, 10, 2112–2124. [Google Scholar]
  13. Aoyagi, M. The zeta function of learning theory and generalization error of three layered neural perceptron. RIMS Kokyuroku Recent Top. Real Complex Singul. 2006, 1501, 153–167. [Google Scholar]
  14. Aoyagi, M. A Bayesian learning coefficient of generalization error and Vandermonde matrix-type singularities. Commun. Stat. Theory Methods 2010, 39, 2667–2687. [Google Scholar] [CrossRef]
  15. Aoyagi, M. Learning coefficient in Bayesian estimation of restricted Boltzmann machine. J. Algebr. Stat. 2013, 4, 30–57. [Google Scholar] [CrossRef]
  16. Rusakov, D.; Geiger, D. Asymptotic Model Selection for Naive Bayesian Networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Alberta, AB, Canada, 1–4 August 2002; pp. 438–445. [Google Scholar]
  17. Rusakov, D.; Geiger, D. Asymptotic model selection for naive Bayesian networks. J. Mach. Learn. Res. 2005, 6, 1–35. [Google Scholar]
  18. Zwiernik, P. An asymptotic behavior of the marginal likelihood for general Markov models. J. Mach. Learn. Res. 2011, 12, 3283–3310. [Google Scholar]
  19. Drton, M.; Lin, S.; Weihs, L.; Zwiernik, P. Marginal likelihood and model selection for Gaussian latent tree and forest models. Bernoulli 2017, 23, 1202–1232. [Google Scholar] [CrossRef] [Green Version]
  20. Aoyagi, M.; Nagata, K. Learning coefficient of generalization error in Bayesian estimation and Vandermonde matrix type singularity. Neural Comput. 2012, 24, 1569–1610. [Google Scholar] [CrossRef] [PubMed]
  21. Aoyagi, M. Consideration on singularities in learning theory and the learning coefficient. Entropy 2013, 15, 3714–3733. [Google Scholar] [CrossRef]
  22. Aoyagi, M. Log canonical threshold of Vandermonde matrix type singularities and generalization error of a three layered neural network. Int. J. Pure Appl. Math. 2009, 52, 177–204. [Google Scholar]
  23. Lin, S. Asymptotic approximation of marginal likelihood integrals. arXiv 2010, arXiv:1003.5338v2. [Google Scholar]
  24. Yamazaki, K.; Aoyagi, M.; Watanabe, S. Asymptotic analysis of Bayesian generalization error with Newton diagram. Neural Netw. 2010, 23, 35–43. [Google Scholar] [CrossRef] [PubMed]
  25. Nagata, K.; Watanabe, S. Exchange Monte Carlo Sampling from Bayesian posterior for singular learning machines. IEEE Trans. Neural Netw. 2008, 19, 1253–1266. [Google Scholar] [CrossRef]
  26. Nagata, K.; Watanabe, S. Asymptotic behavior of exchange ratio in exchange Monte Carlo method. Int. J. Neural Netw. 2008, 21, 980–988. [Google Scholar] [CrossRef] [PubMed]

Share and Cite

MDPI and ACS Style

Aoyagi, M. Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection. Entropy 2019, 21, 561. https://doi.org/10.3390/e21060561

AMA Style

Aoyagi M. Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection. Entropy. 2019; 21(6):561. https://doi.org/10.3390/e21060561

Chicago/Turabian Style

Aoyagi, Miki. 2019. "Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection" Entropy 21, no. 6: 561. https://doi.org/10.3390/e21060561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop