Abstract
In recent years, selecting appropriate learning models has become more important with the increased need to analyze learning systems, and many model selection methods have been developed. The learning coefficient in Bayesian estimation, which serves to measure the learning efficiency in singular learning models, has an important role in several information criteria. The learning coefficient in regular models is known as the dimension of the parameter space over two, while that in singular models is smaller and varies in learning models. The learning coefficient is known mathematically as the log canonical threshold. In this paper, we provide a new rational blowing-up method for obtaining these coefficients. In the application to Vandermonde matrix-type singularities, we show the efficiency of such methods.
1. Introduction
In recent studies, real data associated with, for example, image or speech recognition, psychology, and economics, have been analyzed by learning systems. Hence, many learning models have been proposed, and thus, the need for appropriate model selection methods has increased.
In this section, we first introduce the widely-applicable information criterion (WAIC) [1,2,3,4,5,6,7] and cross-validation in Bayesian model selection.
Let be a true probability density function of variables, , and let be n training samples selected from independently and identically. Consider a learning model that is written in probabilistic form as , where is a parameter.
Suppose that the purpose of the learning system is to estimate the unknown true density function from using in Bayesian estimation. Let be an a priori probability density function on parameter set W and be the a posteriori probability density function:
where:
for inverse temperature . We typically set . Define:
and:
We then have predictive density function , which is the average inference of the Bayesian density function.
We next introduce Kullback function and empirical Kullback function for density functions :
Function , which always has a non-negative value and satisfies , if and only if , is a pseudo-distance between density functions . We define Bayes training loss and Bayes generalization loss as follows:
and:
Additionally, we define Bayesian generalization error and Bayesian training error as follows:
and
Then, we have:
for average entropy and empirical entropy of the true density function. Value describes how precisely the predictive function approximates the true density function.
We define . The WAIC is denoted by:
and the cross-validation loss is denoted by:
for .
Watanabe [2,3,6,7] proved the following four relations:
for learning coefficient and singular fluctuation , where and . For singular models, the set is typically not a singleton set. Nevertheless, the WAIC and the cross-validation can estimate the Bayesian generalization error without any knowledge of the true probability density function.
These values are calculated from training samples using learning model p. In real applications or experiments, we typically do not know the true distribution, but only the values of the training errors. Our purpose is to show that these methods are effective. We can select a suitable model from several statistical models by observing these values.
In this paper, we consider the value , which is equal to the log canonical threshold introduced in Definition 1. This coefficient is not needed to evaluate the WAIC and the cross-validation in practice, while the learning coefficients from our recent results have been used very effectively by Drton and Plummer [8] for model selection using a method called the “singular Bayesian information criterion (sBIC)”. The method sBIC even works using the bounds of learning coefficients.
It is known that holds, where d is the dimension of the parameter space for regular models. Value is obtained by a blowing-up process, which is the main tool in the desingularization of an algebraic variety. The following theorem is an analytic version of Hironaka’s theorem [9] used by Atiyah [10].
Theorem 1
(Desingularization [9]). Let f be an analytic function in a neighborhood of with . There exists an open set , an analytic manifold M, and a proper analytic map μ from M to U such that: (1) is an isomorphism, where , and: (2) for each , there is a local analytic coordinate system such that , where are non-negative integers.
The theorem establishes the existence of the desingularization map; however, generally, it is still difficult to obtain such maps for Kullback functions because the singularities of these functions are very complicated. From learning coefficient and its order , value is obtained theoretically as follows: Let be an empirical process defined on the manifold obtained by a resolution of singularities and denote the sum of local coordinates that attain the minimum and maximum . We then have:
where is a random variable of a Gaussian process with mean zero, and covariance for the analytic function obtained by the resolution of singularities using .
Our purpose in this paper is to obtain . In recent studies, we determined the learning coefficients for reduced rank regression [11], the three-layered neural network with one input unit and one output unit [12,13], normal mixture models with a dimension of one [14], and the restricted Boltzmann machine [15]. Additionally, Rusakov and Geiger [16,17] and Zwiernik [18], respectively, obtained the learning coefficients for naive Bayesian networks and directed tree models with hidden variables. Drton et al. [19] considered these coefficients of the Gaussian latent tree and forest models.
The papers [20,21] derived bounds on the learning coefficients for Vandermonde matrix-type singularities and explicit values under some conditions.
The remainder of this paper is structured as follows: In Section 2, we introduce log canonical thresholds in algebraic geometry. In Section 3, we summarize key theorems for obtaining learning coefficients for learning theory. In Section 4, we present our main results. We consider the log canonical thresholds of Vandermonde matrix-type singularities (Definition 3). We present our conclusions in Section 5.
2. Log Canonical Threshold
Definition 1.
Let f be an analytic function in neighborhood U of . Let ψ be a function with a compact support. Define log canonical threshold as the largest pole of over or over . Additionally, define by its order. If , then we define and because the log canonical threshold and its order are independent of ψ.
Applying Hironaka’s Theorem 1 to function , we have the proper analytic map from manifold M to neighborhood U of that satisfies Hironaka’s Theorems (1) and (2). Then, integration is equal to , which is the sum of , where is a local analytic coordinate system on . Therefore, the poles can be obtained. Note that for each with , there exists neighborhood U such that for all . Thus, has no poles. The learning coefficient is the log canonical threshold of the Kullback function (relative entropy) over the real field.
3. Main Theorems
In this section, several theorems are introduced for obtaining real log canonical thresholds over the real field. Theorem 2 (the method for determining the deepest singular point), Theorem 3 (the method to add variables), and Theorem 4 (the rational blowing-up method) are very helpful for obtaining the log canonical threshold. These theorems over the real field are useful for reducing the number of changes of coordinates via blow-ups.
We denote constants, such as , , and , by suffix ∗. Define the norm of a matrix as . Set .
Lemma 1
([14,22,23]). Let U be a neighborhood of . Consider the ring of analytic functions on U. Let be the ideal generated by , which are analytic functions defined on U. If , then
If , then In particular, if generate ideal , then
The following lemma is also used in the proofs.
Lemma 2
([15]). Let be the ideals generated by and , respectively. If w and are different variables, then:
Theorem 2
(Method for determining the deepest singular point [21]). Let , …, be homogeneous functions of . Furthermore, let ψ be a function such that and is a homogeneous function of in a small neighborhood of . Then, we have:
Theorem 3
(Method to add variables [21]). Let , …, be homogeneous functions of . Set , …, . If , then we have:
Resolutions of singularities are obtained by constructing the blow-up along the smooth submanifold. In this paper, we use the blow-up method along some singular varieties as explained below for obtaining log canonical thresholds.
Theorem 4 (Rational blow-up process).
Let and . Consider the set .
We set for , , .
Let ψ be analytic functions defined on U and . Then, we have:
Proof.
The proof of this theorem uses a resolution of singularities along a smooth submanifold.
Set , and construct the blow-up along the submanifold . Then, we have for , , , and .
Consider the set for .
Set , then we have , , and .
The Jacobian is . □
The theorem is simple; however, it is a useful tool for obtaining log canonical thresholds.
4. Main Results
In this section, we apply the theorems in Section 3 to Vandermonde matrix-type singularities, which are generic and essential in learning theory. Their associated log canonical thresholds provide the learning coefficients of, for example, three-layered neural networks in Section 4.1, normal mixture models in Section 4.2, and mixtures of binomial distributions [24].
4.1. Three-Layered Neural Network
Consider the three-layered neural network with N input units, H hidden units, and M output units, which is trained for estimating the true distribution with r hidden units. Denote an input value by with a probability density function . Then, an output value of the three layered neural network is given by where and:
Consider a statistical model:
and . Assume that the true distribution:
is included in the learning model, where and
4.2. Normal Mixture Models
We consider a normal mixture model [14] with identity matrix variances:
where and , .
Set the true distribution by:
where and , . (In order to simplify the following, we use the values , not .)
4.3. Vandermonde Matrix-Type Singularities
Definition 2.
Fix . Define: if , , and
For simplicity, we use the notation instead of because we always have and in this section.
Definition 3.
Fix and .
Let ,
for , and:
(t denotes the transpose).
and are the variables in a neighborhood of and , where and are fixed constants.
Let be the ideal generated by the elements of .
We call singularities of Vandermonde matrix-type singularities.
To simplify, we usually assume that:
for and:
for .
Example 1.
If , , , then we have , . These matrices correspond to the three-layered neural network:
and the true distribution:
Example 2.
If , , then we have ,
If , these matrices correspond to a normal mixture model with identity matrix variances:
, , and the true distribution is:
In this paper, we denote:
.
Furthermore, we denote: and:
Theorem 5
([14]). Consider sufficiently small neighborhood U of:
and variables in set U. Set . Let each , …, be a different real vector in:
that is,
Then, is uniquely determined, and by the assumption in Definition 3. Set for . Assume that
and . We then have:
where
and for .
Theorem 6.
We use the same notation as in Theorem 5. Set .
We have the following:
- 1.
- . .
- 2.
- . , , .
- 3.
- ., ,, ,
- 4.
- ., ,, ,,,
Its proof appears in Appendix A.
In paper [22], we had exact values for :
where: and we had:
5. Conclusions
In this paper, we proposed a new method of “rational blowing-up” (Theorem 4), and we applied the method to Vandermonde matrix-type singularities and demonstrated its effectiveness. Theorem 6 determines the explicit values of log canonical thresholds for . Our future research aim is to improve our methods and obtain explicit values for the general model.
These theoretical values introduce a mathematical measure of preciseness for numerical calculations in information criteria in Section 1. Furthermore, our theoretical results will be helpful in numerical experiments such as the Markov chain Monte Carlo (MCMC). In the papers [25,26], the mathematical foundation for analyzing and developing the precision of the MCMC method was constructed by using the theoretical values of marginal likelihoods.
We will also consider these applications in the future.
Funding
This research was funded by the Ministry of Education, Culture, Sports, Science, and Technology in Japan, Grant-in-Aid for Scientific Research 18K11479.
Acknowledgments
We thank Maxine Garcia, PhD, from Edanz Group for editing a draft of this manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
We demonstrate Theorem 6 by blowing-up processes.
Let and .
By constructing the blow-up along and by choosing one branch of the blow-up process, we assume that for , and . We also set (, ), , () and .
Then, we have:
for .
For simplicity, we set and again.
By constructing the blow-up along and by choosing one branch of the blow-up process, set for and .
- If , then set (, ), , (), and .For simplicity, we set and again.
- If , then, by constructing the blow-up along and by choosing one branch of the blow-up process, we set for , . Assume that .
- (a)
- By constructing the rational blow-up along , we have the following (i) and (ii).
- Set .for , , , for , and .
- Set .Set for , . and .
By constructing blow-ups, we have the theorem.
References
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
- Watanabe, S. Algebraic analysis for nonidentifiable learning machines. Neural Comput. 2001, 13, 899–933. [Google Scholar] [CrossRef] [PubMed]
- Watanabe, S. Algebraic geometrical methods for hierarchical learning machines. Neural Netw. 2001, 14, 1049–1060. [Google Scholar] [CrossRef]
- Watanabe, S. Algebraic geometry of learning machines with singularities and their prior distributions. J. Jpn. Soc. Artif. Intell. 2001, 16, 308–315. [Google Scholar]
- Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: New York, NY, USA, 2009; Volume 25. [Google Scholar]
- Watanabe, S. Equations of states in singular statistical estimation. Neural Netw. 2010, 23, 20–34. [Google Scholar] [CrossRef]
- Watanabe, S. Mathematical Theory of Bayesian Statistics; CRC Press: New York, NY, USA, 2018. [Google Scholar]
- Drton, M.; Plummer, M. A Bayesian information criterion for singular models. J. R. Statist. Soc. B 2017, 79, 1–38. [Google Scholar] [CrossRef]
- Hironaka, H. Resolution of singularities of an algebraic variety over a field of characteristic zero. Ann. Math. 1964, 79, 109–326. [Google Scholar] [CrossRef]
- Atiyah, M.F. Resolution of singularities and division of distributions. Commun. Pure Appl. Math. 1970, 13, 145–150. [Google Scholar] [CrossRef]
- Aoyagi, M.; Watanabe, S. Stochastic complexities of reduced rank regression in Bayesian estimation. Neural Netw. 2005, 18, 924–933. [Google Scholar] [CrossRef]
- Aoyagi, M.; Watanabe, S. Resolution of singularities and the generalization error with Bayesian estimation for layered neural network. IEICE Trans. J88-D-II 2005, 10, 2112–2124. [Google Scholar]
- Aoyagi, M. The zeta function of learning theory and generalization error of three layered neural perceptron. RIMS Kokyuroku Recent Top. Real Complex Singul. 2006, 1501, 153–167. [Google Scholar]
- Aoyagi, M. A Bayesian learning coefficient of generalization error and Vandermonde matrix-type singularities. Commun. Stat. Theory Methods 2010, 39, 2667–2687. [Google Scholar] [CrossRef]
- Aoyagi, M. Learning coefficient in Bayesian estimation of restricted Boltzmann machine. J. Algebr. Stat. 2013, 4, 30–57. [Google Scholar] [CrossRef]
- Rusakov, D.; Geiger, D. Asymptotic Model Selection for Naive Bayesian Networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Alberta, AB, Canada, 1–4 August 2002; pp. 438–445. [Google Scholar]
- Rusakov, D.; Geiger, D. Asymptotic model selection for naive Bayesian networks. J. Mach. Learn. Res. 2005, 6, 1–35. [Google Scholar]
- Zwiernik, P. An asymptotic behavior of the marginal likelihood for general Markov models. J. Mach. Learn. Res. 2011, 12, 3283–3310. [Google Scholar]
- Drton, M.; Lin, S.; Weihs, L.; Zwiernik, P. Marginal likelihood and model selection for Gaussian latent tree and forest models. Bernoulli 2017, 23, 1202–1232. [Google Scholar] [CrossRef]
- Aoyagi, M.; Nagata, K. Learning coefficient of generalization error in Bayesian estimation and Vandermonde matrix type singularity. Neural Comput. 2012, 24, 1569–1610. [Google Scholar] [CrossRef] [PubMed]
- Aoyagi, M. Consideration on singularities in learning theory and the learning coefficient. Entropy 2013, 15, 3714–3733. [Google Scholar] [CrossRef]
- Aoyagi, M. Log canonical threshold of Vandermonde matrix type singularities and generalization error of a three layered neural network. Int. J. Pure Appl. Math. 2009, 52, 177–204. [Google Scholar]
- Lin, S. Asymptotic approximation of marginal likelihood integrals. arXiv 2010, arXiv:1003.5338v2. [Google Scholar]
- Yamazaki, K.; Aoyagi, M.; Watanabe, S. Asymptotic analysis of Bayesian generalization error with Newton diagram. Neural Netw. 2010, 23, 35–43. [Google Scholar] [CrossRef] [PubMed]
- Nagata, K.; Watanabe, S. Exchange Monte Carlo Sampling from Bayesian posterior for singular learning machines. IEEE Trans. Neural Netw. 2008, 19, 1253–1266. [Google Scholar] [CrossRef]
- Nagata, K.; Watanabe, S. Asymptotic behavior of exchange ratio in exchange Monte Carlo method. Int. J. Neural Netw. 2008, 21, 980–988. [Google Scholar] [CrossRef] [PubMed]
© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).