Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection

In recent years, selecting appropriate learning models has become more important with the increased need to analyze learning systems, and many model selection methods have been developed. The learning coefficient in Bayesian estimation, which serves to measure the learning efficiency in singular learning models, has an important role in several information criteria. The learning coefficient in regular models is known as the dimension of the parameter space over two, while that in singular models is smaller and varies in learning models. The learning coefficient is known mathematically as the log canonical threshold. In this paper, we provide a new rational blowing-up method for obtaining these coefficients. In the application to Vandermonde matrix-type singularities, we show the efficiency of such methods.


Introduction
In recent studies, real data associated with, for example, image or speech recognition, psychology, and economics, have been analyzed by learning systems. Hence, many learning models have been proposed, and thus, the need for appropriate model selection methods has increased.
Let q(x) be a true probability density function of variables, x ∈ R N , and let x n := {x i } n i=1 be n training samples selected from q(x) independently and identically. Consider a learning model that is written in probabilistic form as p(x|w), where w ∈ W ⊂ R d is a parameter.
Suppose that the purpose of the learning system is to estimate the unknown true density function q(x) from x n using p(x|w) in Bayesian estimation. Let ψ(w) be an a priori probability density function on parameter set W and p(w|x n ) be the a posteriori probability density function: where: for inverse temperature β. We typically set β = 1. Define: Function K(p||q), which always has a non-negative value and satisfies K(q||p) = 0, if and only if q(x) = p(x), is a pseudo-distance between density functions p(x), q(x). We define Bayes training loss T n and Bayes generalization loss G n as follows: log p(x i |x n ), and: G n = − q(x) log p(x|x n )dx.
Additionally, we define Bayesian generalization error B g and Bayesian training error B t as follows: and B t = K n (q(x) p(x|x n )).
Then, we have: E[T n ] = G n , B g = G n − S, B t = T n − S n for average entropy S = − q(x) log q(x)dx and empirical entropy S n = − 1 n ∑ n i=1 log q(x i ) of the true density function. Value B g describes how precisely the predictive function approximates the true density function.
We define x n \x i = {x 1 , . . . , x i−1 , x i+1 , . . . , x n }. The WAIC is denoted by: and the cross-validation loss is denoted by: Watanabe [2,3,6,7] proved the following four relations: For singular models, the set W 0 is typically not a singleton set. Nevertheless, the WAIC and the cross-validation can estimate the Bayesian generalization error without any knowledge of the true probability density function. These values are calculated from training samples x i using learning model p. In real applications or experiments, we typically do not know the true distribution, but only the values of the training errors. Our purpose is to show that these methods are effective. We can select a suitable model from several statistical models by observing these values.
In this paper, we consider the value λ, which is equal to the log canonical threshold introduced in Definition 1. This coefficient is not needed to evaluate the WAIC and the cross-validation in practice, while the learning coefficients from our recent results have been used very effectively by Drton and Plummer [8] for model selection using a method called the "singular Bayesian information criterion (sBIC)". The method sBIC even works using the bounds of learning coefficients.
It is known that λ = ν = d/2 holds, where d is the dimension of the parameter space for regular models. Value λ is obtained by a blowing-up process, which is the main tool in the desingularization of an algebraic variety. The following theorem is an analytic version of Hironaka's theorem [9] used by Atiyah [10].
The theorem establishes the existence of the desingularization map; however, generally, it is still difficult to obtain such maps for Kullback functions because the singularities of these functions are very complicated. From learning coefficient λ and its order θ, value ν is obtained theoretically as follows: Let ξ(u) be an empirical process defined on the manifold obtained by a resolution of singularities and ∑ u * denote the sum of local coordinates that attain the minimum λ and maximum θ. We then have: where ξ(u) is a random variable of a Gaussian process with mean zero, and covariance E ξ [ξ(w)ξ(u)] = E X [a(x, w)a(x, u)] for the analytic function a(x, w) obtained by the resolution of singularities using log(q(x)/p(x|w)). Our purpose in this paper is to obtain λ. In recent studies, we determined the learning coefficients for reduced rank regression [11], the three-layered neural network with one input unit and one output unit [12,13], normal mixture models with a dimension of one [14], and the restricted Boltzmann machine [15]. Additionally, Rusakov and Geiger [16,17] and Zwiernik [18], respectively, obtained the learning coefficients for naive Bayesian networks and directed tree models with hidden variables. Drton et al. [19] considered these coefficients of the Gaussian latent tree and forest models.
The papers [20,21] derived bounds on the learning coefficients for Vandermonde matrix-type singularities and explicit values under some conditions. The remainder of this paper is structured as follows: In Section 2, we introduce log canonical thresholds in algebraic geometry. In Section 3, we summarize key theorems for obtaining learning coefficients for learning theory. In Section 4, we present our main results. We consider the log canonical thresholds of Vandermonde matrix-type singularities (Definition 3). We present our conclusions in Section 5.

Log Canonical Threshold
Definition 1. Let f be an analytic function in neighborhood U of w * . Let ψ be a C ∞ function with a compact support. Define log canonical threshold because the log canonical threshold and its order are independent of ψ.
Applying Hironaka's Theorem 1 to function f (w), we have the proper analytic map µ from manifold M to neighborhood U of w * that satisfies Hironaka's Theorems (1) and (2).
is a local analytic coordinate system on U M ⊂ M. Therefore, the poles can be obtained. Note that for each w * with f (w * ) = 0, there exists neighborhood U such that f (w) = 0 for all w ∈ U. Thus, U | f | z ψ(w)dw has no poles. The learning coefficient is the log canonical threshold of the Kullback function (relative entropy) over the real field.

Main Theorems
In this section, several theorems are introduced for obtaining real log canonical thresholds over the real field. Theorem 2 (the method for determining the deepest singular point), Theorem 3 (the method to add variables), and Theorem 4 (the rational blowing-up method) are very helpful for obtaining the log canonical threshold. These theorems over the real field are useful for reducing the number of changes of coordinates via blow-ups.
We denote constants, such as a * , b * , and w * , by suffix * . Define the norm of a matrix C = (c ij ) as Lemma 1 ([14,22,23]). Let U be a neighborhood of w * ∈ R d . Consider the ring of analytic functions on U. Let J be the ideal generated by f 1 , . . . , f n , which are analytic functions defined on U.
The following lemma is also used in the proofs.
If w * 1 = 0, then we have: Resolutions of singularities are obtained by constructing the blow-up along the smooth submanifold. In this paper, we use the blow-up method along some singular varieties as explained below for obtaining log canonical thresholds.
Let f , ψ be analytic functions defined on U and f i (u 1i , · · · , u di ) = f (g i (u i )). Then, we have: Proof. The proof of this theorem uses a resolution of singularities along a smooth submanifold.

Consider the set
The theorem is simple; however, it is a useful tool for obtaining log canonical thresholds.

Main Results
In this section, we apply the theorems in Section 3 to Vandermonde matrix-type singularities, which are generic and essential in learning theory. Their associated log canonical thresholds provide the learning coefficients of, for example, three-layered neural networks in Section 4.1, normal mixture models in Section 4.2, and mixtures of binomial distributions [24].

Three-Layered Neural Network
Consider the three-layered neural network with N input units, H hidden units, and M output units, which is trained for estimating the true distribution with r hidden units. Denote an input value by z = (z j ) ∈ R N with a probability density function q(z). Then, an output value y = (y k ) ∈ R M of the three layered neural network is given by y k = f k (z, w) + (noise), where w = {a ki , b ij } 1≤i≤H and: ||y − f (z, w)|| 2 ), and p(z, y|w) = p(y|z, w)q(z). Assume that the true distribution: ||y − f (z, w)|| 2 )q(z), and the notation (z, y) for the three-layered neural network corresponds to x in Sections 1 and 2.

Normal Mixture Models
We consider a normal mixture model [14] with identity matrix variances: where w * t = {a * i , b * ij } H+1≤i≤H+r and ∑ H+r i=H+1 a * i = −1, a * i < 0. (In order to simplify the following, we use the values a * i < 0, not a * i > 0.)

Vandermonde Matrix-Type Singularities
For simplicity, we use the notation w = {a ki , b ij } 1≤i≤H instead of w = {a ki , b ij } 1≤k≤M,1≤i≤H,1≤j≤N because we always have 1 ≤ k ≤ M and 1 ≤ j ≤ N in this section. (t denotes the transpose). a ki and b ij (1 ≤ k ≤ M, 1 ≤ i ≤ H, 1 ≤ j ≤ N) are the variables in a neighborhood of a * ki and b * ij , where a * ki and b * ij are fixed constants. Let J be the ideal generated by the elements of AB. We call singularities of J Vandermonde matrix-type singularities. To simplify, we usually assume that: (a * 1,H+j , a * 2,H+j , · · · , a * M,H+j ) t = 0, (b * H+j,1 , b * H+j,2 , · · · , b * H+j,N ) = 0, for 1 ≤ j ≤ r and: for j = j .
If a * 13 = −1, these matrices A, B correspond to a normal mixture model with identity matrix variances: , a 1i ≥ 0, and the true distribution is: ), −a * 13 = 1.
In this paper, we denote: a 11 a 12 · · · a 1H a 21 a 22 · · · a 2H . . .  14]). Consider sufficiently small neighborhood U of: Then, r is uniquely determined, and r ≥ r by the assumption in Definition 3.

(Q)
H,N || 2 ). We have the following: Its proof appears in Appendix A.

Conclusions
In this paper, we proposed a new method of "rational blowing-up" (Theorem 4), and we applied the method to Vandermonde matrix-type singularities and demonstrated its effectiveness. Theorem 6 determines the explicit values of log canonical thresholds for H = 1, 2, 3, 4. Our future research aim is to improve our methods and obtain explicit values for the general model.
These theoretical values introduce a mathematical measure of preciseness for numerical calculations in information criteria in Section 1. Furthermore, our theoretical results will be helpful in numerical experiments such as the Markov chain Monte Carlo (MCMC). In the papers [25,26], the mathematical foundation for analyzing and developing the precision of the MCMC method was constructed by using the theoretical values of marginal likelihoods.
We will also consider these applications in the future.
For simplicity, we set b ij = b ij and a k1 = a k1 again. By constructing the blow-up along {b ij = 0, 2 ≤ j ≤ N, 2 ≤ i ≤ H} and by choosing one branch of the blow-up process, set b ij = v 21 b ij for i, j ≥ 2 and b 2j [2] = 0. [2] (i, j ≥ 2), and a k2 = ∑ H i=2 a ki b ij [2] .