Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection

Aoyagi, Miki

doi:10.3390/e21060561

Open AccessArticle

Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection

by

Miki Aoyagi

Department of Mathematics, College of Science & Technology, Nihon University, 1-8-14, Surugadai, Kanda, Chiyoda-ku, Tokyo 101-8308, Japan

Entropy 2019, 21(6), 561; https://doi.org/10.3390/e21060561

Submission received: 11 March 2019 / Revised: 23 May 2019 / Accepted: 29 May 2019 / Published: 4 June 2019

(This article belongs to the Special Issue Bayesian Inference and Information Theory)

Download Review Reports Versions Notes

Abstract

:

In recent years, selecting appropriate learning models has become more important with the increased need to analyze learning systems, and many model selection methods have been developed. The learning coefficient in Bayesian estimation, which serves to measure the learning efficiency in singular learning models, has an important role in several information criteria. The learning coefficient in regular models is known as the dimension of the parameter space over two, while that in singular models is smaller and varies in learning models. The learning coefficient is known mathematically as the log canonical threshold. In this paper, we provide a new rational blowing-up method for obtaining these coefficients. In the application to Vandermonde matrix-type singularities, we show the efficiency of such methods.

Keywords:

learning coefficient; Kullback function; singular learning machine; resolution of singularities

1. Introduction

In recent studies, real data associated with, for example, image or speech recognition, psychology, and economics, have been analyzed by learning systems. Hence, many learning models have been proposed, and thus, the need for appropriate model selection methods has increased.

In this section, we first introduce the widely-applicable information criterion (WAIC) [1,2,3,4,5,6,7] and cross-validation in Bayesian model selection.

Let

q (x)

be a true probability density function of variables,

x \in R^{N}

, and let

x^{n} : = {x_{i}}_{i = 1}^{n}

be n training samples selected from

q (x)

independently and identically. Consider a learning model that is written in probabilistic form as

p (x | w)

, where

w \in W \subset R^{d}

is a parameter.

Suppose that the purpose of the learning system is to estimate the unknown true density function

q (x)

from

x^{n}

using

p (x | w)

in Bayesian estimation. Let

ψ (w)

be an a priori probability density function on parameter set W and

p (w | x^{n})

be the a posteriori probability density function:

p (w | x^{n}) = \frac{1}{Z_{n} (β)} ψ (w) \prod_{i = 1}^{n} p {(x_{i} | w)}^{β},

where:

Z_{n} (β) = \int_{W} ψ (w) \prod_{i = 1}^{n} p {(x_{i} | w)}^{β} d w,

for inverse temperature

β

. We typically set

β = 1

. Define:

E_{w}^{β} [f (w)] = \frac{\int d w f (w) ψ (w) \prod_{i = 1}^{n} p {(x_{i} | w)}^{β}}{\int d w ψ (w) \prod_{i = 1}^{n} p {(x_{i} | w)}^{β}},

and:

V_{w}^{β} [f (w)] = E_{w}^{β} [f {(w)}^{2}] - E_{w}^{β} {[f (w)]}^{2} .

We then have predictive density function

p (x | x^{n}) = E_{w}^{β} [p (x | w)]

, which is the average inference of the Bayesian density function.

We next introduce Kullback function

K (q | | p)

and empirical Kullback function

K_{n} (q | | p)

for density functions

p (x), q (x)

:

K (q | | p) = \int q (x) log \frac{q (x)}{p (x)} d x,

K_{n} (q | | p) = \frac{1}{n} \sum_{i = 1}^{n} log \frac{q (x_{i})}{p (x_{i})} .

Function

K (p | | q)

, which always has a non-negative value and satisfies

K (q | | p) = 0

, if and only if

q (x) = p (x)

, is a pseudo-distance between density functions

p (x), q (x)

. We define Bayes training loss

T_{n}

and Bayes generalization loss

G_{n}

as follows:

T_{n} = - \frac{1}{n} \sum_{i = 1}^{n} log p (x_{i} | x^{n}),

and:

G_{n} = - \int q (x) log p (x | x^{n}) d x .

Additionally, we define Bayesian generalization error

B_{g}

and Bayesian training error

B_{t}

as follows:

B_{g} = K (q (x) ∥ p (x | x^{n}))

and

B_{t} = K_{n} (q (x) ∥ p (x | x^{n})) .

Then, we have:

\begin{matrix} E [T_{n}] & = & G_{n}, \\ B_{g} & = & G_{n} - S, \\ B_{t} & = & T_{n} - S_{n} \end{matrix}

for average entropy

S = - \int q (x) log q (x) d x

and empirical entropy

S_{n} = - \frac{1}{n} \sum_{i = 1}^{n} log q (x_{i})

of the true density function. Value

B_{g}

describes how precisely the predictive function approximates the true density function.

We define

x^{n} \ x_{i} = {x_{1}, \dots, x_{i - 1}, x_{i + 1}, \dots, x_{n}}

. The WAIC is denoted by:

W_{n} = T_{n} + \frac{β}{n} \sum_{i = 1}^{n} V_{w}^{β} [log p (x_{i} | w)]

and the cross-validation loss is denoted by:

C_{n} = - \frac{1}{n} \sum_{i = 1}^{n} log p (x_{i} | x^{n} \ x_{i})

for

n \geq 2

.

Watanabe [2,3,6,7] proved the following four relations:

\begin{matrix} E [G_{n}] & = & L (w_{0}) + \frac{1}{n β} (λ + \frac{β - 1}{β} ν) + o (\frac{1}{n β}), \\ E [T_{n}] & = & L (w_{0}) + \frac{1}{n β} (λ - \frac{β + 1}{β} ν) + o (\frac{1}{n β}), \\ E [W_{n}] & = & L (w_{0}) + \frac{1}{n β} (λ + \frac{β - 1}{β} ν) + o (\frac{1}{n β}), \\ E [C_{n}] & = & L (w_{0}) + \frac{1}{n β} (λ + \frac{β - 1}{β} ν) + o (\frac{1}{n β}) \end{matrix}

for learning coefficient

λ \in Q

and singular fluctuation

ν \in R

, where

L (w) = - E_{x} [log p (x | w)]

and

w_{0} \in W_{0} = {w_{0} \in W ∥ L (w_{0}) = {min}_{w \in W} L (w)}

. For singular models, the set

W_{0}

is typically not a singleton set. Nevertheless, the WAIC and the cross-validation can estimate the Bayesian generalization error without any knowledge of the true probability density function.

These values are calculated from training samples

x_{i}

using learning model p. In real applications or experiments, we typically do not know the true distribution, but only the values of the training errors. Our purpose is to show that these methods are effective. We can select a suitable model from several statistical models by observing these values.

In this paper, we consider the value

λ

, which is equal to the log canonical threshold introduced in Definition 1. This coefficient is not needed to evaluate the WAIC and the cross-validation in practice, while the learning coefficients from our recent results have been used very effectively by Drton and Plummer [8] for model selection using a method called the “singular Bayesian information criterion (sBIC)”. The method sBIC even works using the bounds of learning coefficients.

It is known that

λ = ν = d / 2

holds, where d is the dimension of the parameter space for regular models. Value

λ

is obtained by a blowing-up process, which is the main tool in the desingularization of an algebraic variety. The following theorem is an analytic version of Hironaka’s theorem [9] used by Atiyah [10].

Theorem 1

(Desingularization [9]). Let f be an analytic function in a neighborhood of

w^{*} = (w_{1}^{*}, \dots, w_{d}^{*}) \in R^{d}

with

f (w^{*}) = 0

. There exists an open set

U ∋ w^{*}

, an analytic manifold M, and a proper analytic map μ from M to U such that: (1)

μ : M - E \to U - f^{- 1} (0)

is an isomorphism, where

E = μ^{- 1} (f^{- 1} (0))

, and: (2) for each

u \in M

, there is a local analytic coordinate system

(u_{1}, \dots, u_{n})

such that

f (μ (u)) = \pm u_{1}^{s_{1}} u_{2}^{s_{2}} \dots u_{n}^{s_{n}}

, where

s_{1}, \dots, s_{n}

are non-negative integers.

The theorem establishes the existence of the desingularization map; however, generally, it is still difficult to obtain such maps for Kullback functions because the singularities of these functions are very complicated. From learning coefficient

λ

and its order

θ

, value

ν

is obtained theoretically as follows: Let

ξ (u)

be an empirical process defined on the manifold obtained by a resolution of singularities and

\sum_{u^{*}}

denote the sum of local coordinates that attain the minimum

λ

and maximum

θ

. We then have:

ν = \frac{1}{2} E_{ξ} \frac{\int_{0}^{\infty} d t \sum_{u^{*}} \int d u ξ (u) t^{λ - 1 / 2} e^{- β t + β \sqrt{t} ξ (u)}}{\int_{0}^{\infty} d t \sum_{u^{*}} \int d u t^{λ - 1 / 2} e^{- β t + β \sqrt{t} ξ (u)}},

(1)

where

ξ (u)

is a random variable of a Gaussian process with mean zero, and covariance

E_{ξ} [ξ (w) ξ (u)] = E_{X} [a (x, w) a (x, u)]

for the analytic function

a (x, w)

obtained by the resolution of singularities using

log (q (x) / p (x | w))

.

Our purpose in this paper is to obtain

λ

. In recent studies, we determined the learning coefficients for reduced rank regression [11], the three-layered neural network with one input unit and one output unit [12,13], normal mixture models with a dimension of one [14], and the restricted Boltzmann machine [15]. Additionally, Rusakov and Geiger [16,17] and Zwiernik [18], respectively, obtained the learning coefficients for naive Bayesian networks and directed tree models with hidden variables. Drton et al. [19] considered these coefficients of the Gaussian latent tree and forest models.

The papers [20,21] derived bounds on the learning coefficients for Vandermonde matrix-type singularities and explicit values under some conditions.

The remainder of this paper is structured as follows: In Section 2, we introduce log canonical thresholds in algebraic geometry. In Section 3, we summarize key theorems for obtaining learning coefficients for learning theory. In Section 4, we present our main results. We consider the log canonical thresholds of Vandermonde matrix-type singularities (Definition 3). We present our conclusions in Section 5.

2. Log Canonical Threshold

Definition 1.

Let f be an analytic function in neighborhood U of

w^{*}

. Let ψ be a

C^{\infty}

function with a compact support. Define log canonical threshold

λ_{w^{*}} (f, ψ)

as the largest pole of

\int_{U} {| f |}^{2 z} ψ d w

over

C

or

\int_{U} {| f |}^{z} ψ d w

over

R

. Additionally, define

θ_{w^{*}} (f, ψ)

by its order. If

ψ (w^{*}) \neq 0

, then we define

λ_{w^{*}} (f) = λ_{w^{*}} (f, ψ)

and

θ_{w^{*}} (f) = θ_{w^{*}} (f, ψ)

because the log canonical threshold and its order are independent of ψ.

Applying Hironaka’s Theorem 1 to function

f (w)

, we have the proper analytic map

μ

from manifold M to neighborhood U of

w^{*}

that satisfies Hironaka’s Theorems (1) and (2). Then, integration

\int_{U} {| f |}^{z} ψ (w) d w

is equal to

\int_{M} {| f \circ μ |}^{z} ψ (μ (u)) μ^{'} (u) d u

, which is the sum of

\int_{U_{M}} | u_{1}^{s_{1}} u_{2}^{s_{2}} \dots u_{d}^{s_{d}} |^{z} ψ (μ (u)) | μ^{'} (u) | d u

, where

(u_{1}, \dots, u_{d})

is a local analytic coordinate system on

U_{M} \subset M

. Therefore, the poles can be obtained. Note that for each

w^{*}

with

f (w^{*}) \neq 0

, there exists neighborhood U such that

f (w) \neq 0

for all

w \in U

. Thus,

\int_{U} {| f |}^{z} ψ (w) d w

has no poles. The learning coefficient is the log canonical threshold of the Kullback function (relative entropy) over the real field.

3. Main Theorems

In this section, several theorems are introduced for obtaining real log canonical thresholds over the real field. Theorem 2 (the method for determining the deepest singular point), Theorem 3 (the method to add variables), and Theorem 4 (the rational blowing-up method) are very helpful for obtaining the log canonical threshold. These theorems over the real field are useful for reducing the number of changes of coordinates via blow-ups.

We denote constants, such as

a^{*}

,

b^{*}

, and

w^{*}

, by suffix ∗. Define the norm of a matrix

C = (c_{i j})

as

| | C | | = \sqrt{\sum_{i, j} {| c_{i j} |}^{2}}

. Set

N_{+ 0} = N \cup {0}

.

Lemma 1

([14,22,23]). Let U be a neighborhood of

w^{*} \in R^{d}

. Consider the ring of analytic functions on U. Let

J

be the ideal generated by

f_{1}, \dots, f_{n}

, which are analytic functions defined on U.

(1)

If

g_{1}^{2} + \dots + g_{m}^{2} \leq f_{1}^{2} + \dots + f_{n}^{2}

, then

λ_{w^{*}} (g_{1}^{2} + \dots + g_{m}^{2}) \leq λ_{w^{*}} (f_{1}^{2} + \dots + f_{n}^{2}) .

(2)

If

g_{1}, \dots, g_{m} \in J

, then

λ_{w^{*}} (g_{1}^{2} + \dots + g_{m}^{2}) \leq λ_{w^{*}} (f_{1}^{2} + \dots + f_{n}^{2}) .

In particular, if

g_{1}, \dots, g_{m}

generate ideal

J

, then

λ_{w^{*}} (f_{1}^{2} + \dots + f_{n}^{2}) = λ_{w^{*}} (g_{1}^{2} + \dots + g_{m}^{2}) .

The following lemma is also used in the proofs.

Lemma 2

([15]). Let

J, J^{'}

be the ideals generated by

f_{1} (w), \dots, f_{n} (w)

and

g_{1} (w^{'}), \dots, g_{m} (w^{'})

, respectively. If w and

w^{'}

are different variables, then:

λ_{(w^{*}, w^{' *})} (f_{1}^{2} + \dots + f_{n}^{2} + g_{1}^{2} + \dots + g_{m}^{2}) = λ_{w^{*}} (f_{1}^{2} + \dots + f_{n}^{2}) + λ_{w^{' *}} (g_{1}^{2} + \dots + g_{m}^{2}) .

Theorem 2

(Method for determining the deepest singular point [21]). Let

f_{1} (w_{1}, \dots, w_{d})

, …,

f_{m} (w_{1}, \dots, w_{d})

be homogeneous functions of

w_{1}, \dots, w_{j}

(j \leq d)

. Furthermore, let ψ be a

C^{\infty}

function such that

ψ (0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*}) \geq ψ (w_{1}^{*}, \dots, w_{d}^{*})

and

ψ_{w}

is a homogeneous function of

w_{1}, \dots, w_{j}

in a small neighborhood of

(0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*})

. Then, we have:

λ_{(0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*})} (f_{1}^{2} + \dots + f_{m}^{2}, ψ) \leq λ_{(w_{1}^{*}, \dots, w_{j}^{*}, w_{j + 1}^{*}, \dots, w_{d}^{*})} (f_{1}^{2} + \dots + f_{m}^{2}, ψ)

Theorem 3

(Method to add variables [21]). Let

f_{1} (w_{1}, \dots, w_{d})

, …,

f_{m} (w_{1}, \dots, w_{d})

be homogeneous functions of

w_{1}, \dots, w_{d}

. Set

f_{1}^{'} (w_{2}, \dots, w_{d}) = f_{1} (1, w_{2}, \dots, w_{d})

, …,

f_{m}^{'} (w_{2}, \dots, w_{d}) = f_{m} (1, w_{2}, \dots, w_{d})

. If

w_{1}^{*} \neq 0

, then we have:

λ_{(w_{1}^{*}, \dots, w_{d}^{*})} (f_{1}^{2} + \dots + f_{m}^{2}) = λ_{(w_{2}^{*} / w_{1}^{*}, \dots, w_{d}^{*} / w_{1}^{*})} (f_{1}^{' 2} + \dots + f_{m}^{' 2}) .

Resolutions of singularities are obtained by constructing the blow-up along the smooth submanifold. In this paper, we use the blow-up method along some singular varieties as explained below for obtaining log canonical thresholds.

Theorem 4 (Rational blow-up process).

Let

m \leq d

and

(α_{1}, \dots, α_{m}) \in R^{m}, α_{1}, \dots, α_{m} > 0

. Consider the set

U = {w = (w_{1}, \dots, w_{d}) \in R^{d} | 0 \leq w_{1}, \dots, w_{d} \leq 1}

.

We set

g_{i} (u_{i}) = (g_{1 i} (u_{i}), \dots, g_{d i} (u_{i}))

for

g_{i i} (u_{i}) = u_{i i}

,

g_{j i} (u_{i}) = u_{j i} u_{i i}^{α_{i} / α_{j}} (1 \leq j \leq m, j \neq i)

,

g_{j i} (u_{i}) = u_{j i} (1 \leq j \leq d)

.

Let

f,

ψ be analytic functions defined on U and

f_{i} (u_{1 i}, \dots, u_{d i}) = f (g_{i} (u_{i}))

. Then, we have:

min_{1 \leq i \leq m} {λ_{0} (f_{i} (u_{i}), u_{i i}^{\sum_{j = 1}^{m} α_{i} / α_{j} - 1} ψ (g_{i} (u_{i})))} = λ_{0} (f, ψ) .

Proof.

The proof of this theorem uses a resolution of singularities along a smooth submanifold.

Set

w_{i}^{'} = w_{i}^{1 / α_{i}}

, and construct the blow-up along the submanifold

{w_{1}^{'} = \dots = w_{m}^{'} = 0}

. Then, we have for

1 \leq i \leq m

,

w_{i}^{'} = v

,

w_{j}^{'} = v w_{j}^{″}

(1 \leq j \leq m, j \neq i)

, and

w_{j}^{'} = w_{j}^{″}

(m + 1 \leq j \leq d)

.

Consider the set

U_{i} = {u_{i} = (u_{1 i}, \dots, u_{d i}) \in R^{d} | 0 \leq u_{1 i}, \dots, u_{d i} \leq 1}

for

1 \leq i \leq m

.

Set

u_{i i} = v^{α_{i}}, u_{j i} = {w_{j}^{″}}^{α_{j}} (j \neq i)

, then we have

w_{i} = u_{i i}

,

w_{j} = u_{i i}^{α_{j} / α_{i}} u_{j i}

(1 \leq j \leq m, j \neq i)

, and

w_{j} = u_{j i} (m + 1 \leq j \leq d)

.

The Jacobian

\partial w / \partial u_{i}

is

\prod_{1 \leq j \leq m, j \neq i} u_{i i}^{α_{j} / α_{i}}

. □

The theorem is simple; however, it is a useful tool for obtaining log canonical thresholds.

4. Main Results

In this section, we apply the theorems in Section 3 to Vandermonde matrix-type singularities, which are generic and essential in learning theory. Their associated log canonical thresholds provide the learning coefficients of, for example, three-layered neural networks in Section 4.1, normal mixture models in Section 4.2, and mixtures of binomial distributions [24].

4.1. Three-Layered Neural Network

Consider the three-layered neural network with N input units, H hidden units, and M output units, which is trained for estimating the true distribution with r hidden units. Denote an input value by

z = (z_{j}) \in R^{N}

with a probability density function

q (z)

. Then, an output value

y = (y_{k}) \in R^{M}

of the three layered neural network is given by

y_{k} = f_{k} (z, w) + (noise),

where

w = {a_{k i}, b_{i j}}_{1 \leq i \leq H}

and:

f_{k} (z, w) = \sum_{i = 1}^{H} a_{k i} tanh (\sum_{j = 1}^{N} b_{i j} z_{j}) .

Consider a statistical model:

p (y | z, w) = \frac{1}{{(2 π)}^{M / 2}} exp (- \frac{1}{2} | | y - {f (z, w) | |}^{2}),

and

p (z, y | w) = p (y | z, w) q (z)

. Assume that the true distribution:

p (y | z, w_{t}^{*}) = \frac{1}{{(2 π)}^{M / 2}} exp (- \frac{1}{2} | | y - f (z, w_{t}^{*}) {| |}^{2}),

is included in the learning model, where

w_{t}^{*} = {a_{k, H + i}^{*}, b_{H + i, j}^{*}}_{1 \leq i \leq r}

and

f_{k} (z, w_{t}^{*}) = \sum_{i = 1}^{r} (- a_{k, H + i}^{*}) tanh (\sum_{j = 1}^{N} b_{H + i, j}^{*} x_{j}) .

We have:

p (z, y | w) = \frac{1}{{(2 π)}^{M / 2}} exp (- \frac{1}{2} | | y - {f (z, w) | |}^{2}) q (z),

and the notation

(z, y)

for the three-layered neural network corresponds to x in Section 1 and Section 2.

4.2. Normal Mixture Models

We consider a normal mixture model [14] with identity matrix variances:

p (x | w) = \frac{1}{{(2 π)}^{N / 2}} \sum_{i = 1}^{H} a_{i} exp (- \frac{\sum_{j = 1}^{N} {(z_{j} - b_{i j})}^{2}}{2}),

where

w = {a_{i}, b_{i j}}_{1 \leq i \leq H}

and

\sum_{i = 1}^{H} a_{i} = 1

,

a_{i} \geq 0

.

Set the true distribution by:

p (x | w_{t}^{*}) = \frac{1}{{(2 π)}^{N / 2}} \sum_{i = H + 1}^{H + r} (- a_{i}^{*}) exp (- \frac{\sum_{j = 1}^{N} {(z_{j} - b_{i j}^{*})}^{2}}{2}),

where

w_{t}^{*} = {a_{i}^{*}, b_{i j}^{*}}_{H + 1 \leq i \leq H + r}

and

\sum_{i = H + 1}^{H + r} a_{i}^{*} = - 1

,

a_{i}^{*} < 0

. (In order to simplify the following, we use the values

a_{i}^{*} < 0

, not

a_{i}^{*} > 0

.)

4.3. Vandermonde Matrix-Type Singularities

Definition 2.

Fix

Q \in N

. Define:

{[b_{1}^{*}, b_{2}^{*}, \dots, b_{N}^{*}]}_{Q} = γ_{i} (0, \dots, 0, b_{i}^{*}, \dots, b_{N}^{*})

if

b_{1}^{*} = \dots = b_{i - 1}^{*} = 0

,

b_{i}^{*} \neq 0

, and

γ_{i} = \{\begin{matrix} 1 & if Q is odd, \\ sign (b_{i}^{*}) & if Q is even . \end{matrix}

For simplicity, we use the notation

w = {a_{k i}, b_{i j}}_{1 \leq i \leq H}

instead of

w = {a_{k i}, b_{i j}}_{1 \leq k \leq M, 1 \leq i \leq H, 1 \leq j \leq N}

because we always have

1 \leq k \leq M

and

1 \leq j \leq N

in this section.

Definition 3.

Fix

Q \in N

and

m \in N_{+ 0}

.

Let

A = (\begin{matrix} a_{11} & \dots & a_{1 H} & a_{1, H + 1}^{*} & \dots & a_{1, H + r}^{*} \\ a_{21} & \dots & a_{2 H} & a_{2, H + 1}^{*} & \dots & a_{2, H + r}^{*} \\ ⋮ & ⋮ \\ a_{M 1} & \dots & a_{M H} & a_{M, H + 1}^{*} & \dots & a_{M, H + r}^{*} \end{matrix})

,

B_{I} = {(\prod_{j = 1}^{N} b_{1 j}^{ℓ_{j}}, \prod_{j = 1}^{N} b_{2 j}^{ℓ_{j}}, \dots, \prod_{j = 1}^{N} b_{H j}^{ℓ_{j}}, \prod_{j = 1}^{N} {b_{H + 1, j}^{*}}^{ℓ_{j}}, \dots, \prod_{j = 1}^{N} {b_{H + r, j}^{*}}^{ℓ_{j}})}^{t},

for

I = (ℓ_{1}, \dots, ℓ_{N}) \in {N_{+ 0}}^{N}

, and:

B = {(B_{I})}_{ℓ_{1} + \dots + ℓ_{N} = Q n + m, n \geq 0} = (B_{(m, 0, \dots, 0)}, B_{(m - 1, 1, \dots, 0)}, \dots, B_{(0, 0, \dots, m)}, B_{(m + Q, 0, \dots, 0)}, \dots)

(t denotes the transpose).

a_{k i}

and

b_{i j}

(1 \leq k \leq M, 1 \leq i \leq H, 1 \leq j \leq N)

are the variables in a neighborhood of

a_{k i}^{*}

and

b_{i j}^{*}

, where

a_{k i}^{*}

and

b_{i j}^{*}

are fixed constants.

Let

J

be the ideal generated by the elements of

A B

.

We call singularities of

J

Vandermonde matrix-type singularities.

To simplify, we usually assume that:

{(a_{1, H + j}^{*}, a_{2, H + j}^{*}, \dots, a_{M, H + j}^{*})}^{t} \neq 0, (b_{H + j, 1}^{*}, b_{H + j, 2}^{*}, \dots, b_{H + j, N}^{*}) \neq 0,

for

1 \leq j \leq r

and:

{[b_{H + j, 1}^{*}, b_{H + j, 2}^{*}, \dots, b_{H + j, N}^{*}]}_{Q} \neq {[b_{H + j^{'}, 1}^{*}, b_{H + j^{'}, 2}^{*}, \dots, b_{H + j^{'}, N}^{*}]}_{Q},

for

j \neq j^{'}

.

Example 1.

If

m = N = M = r = 1

,

Q = 2

,

H = 3

, then we have

A = (\begin{matrix} a_{11} & a_{12} & a_{13} & a_{14}^{*} \end{matrix})

,

B = (\begin{matrix} b_{11} & b_{11}^{3} & b_{11}^{5} & b_{11}^{7} \\ b_{21} & b_{21}^{3} & b_{11}^{5} & b_{21}^{7} \\ b_{31} & b_{31}^{3} & b_{11}^{5} & b_{31}^{7} \\ b_{41}^{*} & {b_{41}^{*}}^{3} & {b_{41}^{*}}^{5} & {b_{41}^{*}}^{7} \end{matrix})

. These matrices

A, B

correspond to the three-layered neural network:

p (y | z, w) = \frac{1}{{(2 π)}^{1 / 2}} exp (- \frac{1}{2} {(y - a_{11} tanh (b_{11} z) - a_{12} tanh (b_{21} z) - a_{13} tanh (b_{31} z))}^{2}),

and the true distribution:

p (y | x, w_{t}^{*}) = \frac{1}{{(2 π)}^{1 / 2}} exp (- \frac{1}{2} {(y + a_{14}^{*} tanh (b_{41}^{*} z))}^{2}) .

Example 2.

If

Q = r = M = 1

,

H = 2, N = 2

, then we have

A = (\begin{matrix} a_{11} & a_{12} & a_{13}^{*} \end{matrix})

,

B = (\begin{matrix} b_{11} & b_{12} & b_{11}^{2} & b_{11} b_{12} & b_{12}^{2} & b_{11}^{3} & b_{11} b_{12}^{2} & b_{11}^{2} b_{12} & b_{12}^{3} \\ b_{21} & b_{22} & b_{21}^{2} & b_{21} b_{22} & b_{22}^{2} & b_{21}^{3} & b_{21} b_{22}^{2} & b_{21}^{2} b_{22} & b_{22}^{3} \\ b_{31}^{*} & b_{32}^{*} & {b_{31}^{*}}^{2} & b_{31}^{*} b_{32}^{*} & {b_{32}^{*}}^{2} & {b_{31}^{*}}^{3} & b_{31}^{*} {b_{32}^{*}}^{2} & {b_{31}^{*}}^{2} b_{32}^{*} & {b_{32}^{*}}^{3} \end{matrix}) .

If

a_{13}^{*} = - 1

, these matrices

A, B

correspond to a normal mixture model with identity matrix variances:

p (x | w) = \frac{a_{11}}{2 π} exp (- \frac{{(z_{1} - b_{11})}^{2} + {(z_{2} - b_{12})}^{2}}{2}) + \frac{a_{12}}{2 π} exp (- \frac{{(z_{1} - b_{21})}^{2} + {(z_{2} - b_{22})}^{2}}{2}),

\sum_{i = 1}^{2} a_{1 i} = 1

,

a_{1 i} \geq 0

, and the true distribution is:

p (x | w_{t}^{*}) = \frac{1}{2 π} (- a_{13}^{*}) exp (- \frac{{(z_{1} - b_{31}^{*})}^{2} + {(z_{2} - b_{32}^{*})}^{2}}{2}), - a_{13}^{*} = 1 .

In this paper, we denote:

\begin{matrix} A_{M, H} = (\begin{matrix} a_{11} & a_{12} & \dots & a_{1 H} \\ a_{21} & a_{22} & \dots & a_{2 H} \\ ⋮ \\ a_{M 1} & a_{M 2} & \dots & a_{M H} \end{matrix}), B_{H, N, I} = (\begin{matrix} \prod_{j = 1}^{N} {b_{1 j}}^{ℓ_{j}} \\ \prod_{j = 1}^{N} {b_{2 j}}^{ℓ_{j}} \\ ⋮ \\ \prod_{j = 1}^{N} {b_{H j}}^{ℓ_{j}} \end{matrix}) a n d : \end{matrix}

B_{H, N}^{(Q)} = {(B_{H, N, I})}_{ℓ_{1} + \dots + ℓ_{N} = Q n + 1, 0 \leq n \leq H - 1}

.

Furthermore, we denote:

a^{*} = (\begin{matrix} a_{1, H + 1}^{*} \\ ⋮ \\ a_{M, H + 1}^{*} \end{matrix})

and:

\begin{matrix} (A_{M, H}, a^{*}) = (\begin{matrix} a_{11} & a_{12} & \dots & a_{1 H} & a_{1, H + 1}^{*} \\ a_{21} & a_{22} & \dots & a_{2 H} & a_{2, H + 1}^{*} \\ ⋮ \\ a_{M 1} & a_{M 2} & \dots & a_{M H} & a_{M, H + 1}^{*} \end{matrix}) \end{matrix}

Theorem 5

([14]). Consider sufficiently small neighborhood U of:

w^{*} = {a_{k i}^{*}, b_{i j}^{*}}_{1 \leq i \leq H}

and variables

w = {a_{k i}, b_{i j}}_{1 \leq i \leq H}

in set U. Set

(b_{01}^{* *}, b_{02}^{* *}, \dots, b_{0 N}^{* *}) = (0, \dots, 0)

. Let each

(b_{11}^{* *}, b_{12}^{* *}, \dots, b_{1 N}^{* *})

, …,

(b_{r^{'} 1}^{* *}, b_{r^{'} 2}^{* *}, \dots, b_{r^{'} N}^{* *})

be a different real vector in:

{[b_{i 1}^{*}, b_{i 2}^{*}, \dots, b_{i N}^{*}]}_{Q} \neq 0, f o r i = 1, \dots, H + r;

that is,

{(b_{11}^{* *}, \dots, b_{1 N}^{* *}), \dots, (b_{r^{'} 1}^{* *}, \dots, b_{r^{'} N}^{* *}); {[b_{i 1}^{*}, \dots, b_{i N}^{*}]}_{Q} \neq 0, i = 1, \dots, H + r} .

Then,

r^{'}

is uniquely determined, and

r^{'} \geq r

by the assumption in Definition 3. Set

(b_{i 1}^{* *}, \dots, b_{i N}^{* *}) = {[b_{H + i, 1}^{*}, \dots, b_{H + i, N}^{*}]}_{Q},

for

1 \leq i \leq r

. Assume that

{[b_{i 1}^{*}, \dots, b_{i N}^{*}]}_{Q} = \{\begin{matrix} 0, & 1 \leq i \leq H_{0} \\ (b_{11}^{* *}, \dots, b_{1 N}^{* *}), & H_{0} + 1 \leq i \leq H_{0} + H_{1}, \\ (b_{21}^{* *}, \dots, b_{2 N}^{* *}), & H_{0} + H_{1} + 1 \leq i \leq H_{0} + H_{1} + H_{2}, \\ ⋮ \\ (b_{r^{'} 1}^{* *}, \dots, b_{r^{'} N}^{* *}), & H_{0} + \dots + H_{r^{'} - 1} + 1 \leq i \leq H_{0} + \dots + H_{r^{'}}, \end{matrix}

and

H_{0} + \dots + H_{r^{'}} = H

. We then have:

λ_{w^{*}} {(| | A B | |}^{2}) = \frac{M r^{'}}{2} + λ_{{w_{1}^{(0)}}^{*}} (| | A_{M, H_{0}} B_{H_{0}, N}^{(Q)} {| |}^{2})

+ \sum_{α = 1}^{r} λ_{{w_{1}^{(α)}}^{*}} (| | (A_{M, H_{α} - 1}, {a^{(α)}}^{*}) B_{H_{α}, N}^{(1)} {| |}^{2}) + \sum_{α = r + 1}^{r^{'}} λ_{{w_{1}^{(α)}}^{*}} (| | A_{M, H_{α} - 1} B_{H_{α} - 1, N}^{(1)} {| |}^{2}),

where

{w_{1}^{(0)}}^{*} = {a_{k, i}^{*}, 0}_{1 \leq i \leq H_{α}},

{w_{1}^{(α)}}^{*} = {a_{k, H_{0} + \dots + H_{α - 1} + i}^{*}, 0}_{2 \leq i \leq H_{α}}

and

{a^{(α)}}^{*} = (\begin{matrix} a_{1, H + α}^{*} \\ ⋮ \\ a_{M, H + α}^{*} \end{matrix})

for

α \geq 1

.

Theorem 6.

We use the same notation as in Theorem 5. Set

λ = λ_{0} (| | A_{M, H} B_{H, N}^{(Q)} {| |}^{2})

.

We have the following:

1.: $H = 1$ . $λ = min {\frac{M}{2}, \frac{N}{2}}$ .
2.: $H = 2$ . $λ = min {\frac{β N + (2 - β) M}{2}$ , $β = 0, 1, 2$ , $\frac{2 N + Q (N - 1 + M)}{2 Q + 2}}$ .
3.: $H = 3$ .
$λ = min {\frac{β N + (3 - β) M}{2}$ , $β = 0, 1, 2, 3$ ,
$\frac{β N + (3 - β) M + Q (α (N + α - β) + (3 - α) M)}{2 (Q + 1)},$ $α = 1, \dots, β - 1$ , $β = 2, 3$ ,
$\frac{3 N + Q (3 N - 3 + 3 M)}{2 (2 Q + 1)}}$
4.: $H = 4$ .
$λ = min {\frac{β N + (4 - β) M}{2}$ , $β = 0, 1, 2, 3, 4$ ,
$\frac{β N + (4 - β) M + Q (α (N + α - β) + (4 - α) M)}{2 (Q + 1)},$ $α = 1, \dots, β - 1$ , $β = 2, 3, 4$ ,
$\frac{4 N + Q (α N - α - 1 + (8 - α) M)}{2 (2 Q + 1)},$ $α = 2, 3, 4$ , $\frac{4 N + Q (5 N - 5 + 3 M)}{2 (2 Q + 1)},$
$\frac{3 N + M + Q (α N - α + (8 - α) M)}{2 (2 Q + 1)},$ $α = 2, 3$ ,
$\frac{4 N + Q (6 N - 6 + 6 M)}{2 (3 Q + 1)}} .$

Its proof appears in Appendix A.

In paper [22], we had exact values for

N = 1

:

λ_{0} (| | A_{M, H} B_{H, 1}^{(Q)} {| |}^{2}) = \frac{M Q k (k + 1) + 2 H}{4 (1 + k Q)}

where:

k = max {i \in Z : 2 H \geq M (i (i - 1) Q + 2 i)},

and we had:

θ = \{\begin{matrix} 1, & if 2 H > M (k (k - 1) Q + 2 k) \\ 2, & if 2 H = M (k (k - 1) Q + 2 k) \end{matrix}

5. Conclusions

In this paper, we proposed a new method of “rational blowing-up” (Theorem 4), and we applied the method to Vandermonde matrix-type singularities and demonstrated its effectiveness. Theorem 6 determines the explicit values of log canonical thresholds for

H = 1, 2, 3, 4

. Our future research aim is to improve our methods and obtain explicit values for the general model.

These theoretical values introduce a mathematical measure of preciseness for numerical calculations in information criteria in Section 1. Furthermore, our theoretical results will be helpful in numerical experiments such as the Markov chain Monte Carlo (MCMC). In the papers [25,26], the mathematical foundation for analyzing and developing the precision of the MCMC method was constructed by using the theoretical values of marginal likelihoods.

We will also consider these applications in the future.

Funding

This research was funded by the Ministry of Education, Culture, Sports, Science, and Technology in Japan, Grant-in-Aid for Scientific Research 18K11479.

Acknowledgments

We thank Maxine Garcia, PhD, from Edanz Group for editing a draft of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We demonstrate Theorem 6 by blowing-up processes.

Let

j [α], j [α, β] \in {1, \dots, N}

and

i [α], i [α, β] \in {1, \dots, H}

.

By constructing the blow-up along

{b_{i j} = 0, 1 \leq i \leq H, 1 \leq j \leq N}

and by choosing one branch of the blow-up process, we assume that

b_{i j} = v_{1} b_{i j}^{'}

for

i, j \geq 1

,

j [1] = 1

and

b_{1 j [1]}^{'} \neq 0

. We also set

b_{i j}^{'} = b_{i j}^{″} + \frac{b_{i j [1]}^{'}}{b_{1 j [1]}^{'}} b_{1 j}^{'}

(

i \geq 2

,

j \neq j [1]

),

b_{1 j}^{″} = b_{1 j}^{'}

,

b_{i j [1]}^{″} = b_{i j [1]}^{'}

(

i, j \geq 1

) and

a_{k 1}^{'} = \sum_{i = 1}^{H} a_{k i} b_{i j [1]}

.

Then, we have:

\begin{matrix} f_{k I} & = & \sum_{i = 1}^{H} a_{k i} \prod_{j = 1}^{N} b_{i j}^{ℓ_{j}}, \\ = & v_{1}^{| I |} (a_{k 1} \prod_{j = 1}^{N} {b_{1 j}^{'}}^{ℓ_{j}} + \sum_{i = 2}^{H} a_{k i} {b_{i j [1]}^{'}}^{ℓ_{j [1]}} \prod_{j \neq j [1]} {(b_{i j}^{″} + \frac{b_{i j [1]}^{'}}{b_{1 j [1]}^{'}} b_{1 j}^{'})}^{ℓ_{j}}), \end{matrix}

for

I = (ℓ_{1}, \dots, ℓ_{N})

.

\begin{matrix} J & = & 〈f_{k I} | I = (ℓ_{1}, \dots, ℓ_{N}) \in {N_{+ 0}}^{N}, | I | = n Q + 1, n \in N_{+ 0}, 1 \leq k \leq M〉 \\ = & 〈v_{1}^{n Q + 1} \sum_{i = 1}^{H} a_{k i} {b_{i j [1]}^{″}}^{n Q + 1}, v_{1}^{| I |} \sum_{i = 2}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}^{″}}^{ℓ_{j}}〉 . \\ = & 〈v_{1} a_{k 1}^{'}, v_{1}^{(n + 1) Q + 1} \sum_{i = 2}^{H} a_{k i} {b_{i j [1]}^{″}}^{n Q + 1} ({b_{i j [1]}^{″}}^{Q} - {b_{1 j [1]}^{″}}^{Q}), v_{1}^{| I |} \sum_{i = 2}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}^{″}}^{ℓ_{j}}〉 . \end{matrix}

For simplicity, we set

b_{i j}^{″} = b_{i j}

and

a_{k 1}^{'} = a_{k 1}

again.

By constructing the blow-up along

{b_{i j} = 0, 2 \leq j \leq N, 2 \leq i \leq H}

and by choosing one branch of the blow-up process, set

b_{i j} = v_{21} b_{i j}^{'}

for

i, j \geq 2

and

b_{2 j [2]}^{'} \neq 0

.

If $j [1] \neq j [2]$ , then set $b_{i j}^{'} = b_{i j}^{″} + \frac{b_{i j [2]}^{'}}{b_{2 j [2]}^{'}} b_{2 j}^{'}$ ( $i \geq 3$ , $j \neq j [2]$ ), $b_{2 j}^{″} = b_{2 j}^{'}$ , $b_{i j [2]}^{″} = b_{i j [2]}^{'}$ ( $i, j \geq 2$ ), and $a_{k 2}^{'} = \sum_{i = 2}^{H} a_{k i} b_{i j [2]}$ .

$\begin{matrix} J & = & 〈f_{k I} | I = (ℓ_{1}, \dots, ℓ_{N}) \in {N_{+ 0}}^{N}, | I | = n Q + 1, n \in N_{+ 0}, 1 \leq k \leq M〉 \\ = & 〈v_{1} a_{k 1}, v_{1}^{(n + 1) Q + 1} \sum_{i = 2}^{H} a_{k i} {b_{i j [1]}}^{n Q + 1} ({b_{i j [1]}}^{Q} - {b_{1 j [1]}}^{Q}), v_{1}^{| I |} \sum_{i = 2}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}}^{ℓ_{j}}〉 \\ = & 〈v_{1} a_{k 1}, v_{1} v_{2} a_{k 2}^{'}, v_{1}^{(n + 1) Q + 1} v_{21}^{n Q + 1} \sum_{i = 3}^{H} a_{k i} {b_{i j [1]}^{″}}^{n Q + 1} ({b_{i j [1]}^{″}}^{Q} v_{21}^{Q} - {b_{1 j [1]}^{″}}^{Q}), \\ v_{1}^{| I |} v_{21}^{| I |} \sum_{i = 3}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}^{″}}^{ℓ_{j}}, v_{1}^{(n + 1) Q + 1} v_{21}^{(n + 1) Q + 1} \sum_{i = 3}^{H} a_{k i} {b_{i j [2]}^{″}}^{n Q + 1} ({b_{i j [2]}^{″}}^{Q} - {b_{2 j [2]}^{″}}^{Q}),〉 . \end{matrix}$

For simplicity, we set $b_{i j}^{″} = b_{i j}$ and $a_{k 1}^{'} = a_{k 1}$ again.
If $j [1] = j [2]$ , then, by constructing the blow-up along ${b_{i j} = 0, 2 \leq i \leq H, 2 \leq j \leq N, j \neq j [2]}$ and by choosing one branch of the blow-up process, we set $b_{i j}^{'} = v_{22} b_{i j}^{″}$ for $i, j \geq 2$ , $j \neq j [2]$ . Assume that $b_{i [2, 2], j [2, 2]}^{″} \neq 0$ .
(a)
By constructing the rational blow-up along ${v_{1}^{Q} = 0, v_{22} = 0}$ , we have the following (i) and (ii).
Set $v_{22} = v_{1}^{Q} v_{22}^{'}$ .
$b_{i j}^{'} = b_{i j}^{″} + \frac{b_{i j [2]}^{'} ({b_{i j [2]}^{'}}^{Q} - {b_{1 j [2]}^{'}}^{Q})}{b_{2 j [2]}^{'} ({b_{2 j [2]}^{'}}^{Q} - {b_{1 j [2]}^{'}}^{Q})} b_{2 j}^{'}$ for $i \geq 3$ , $j \neq j [2]$ , $b_{2 j}^{″} = b_{2 j}^{'}$ , $b_{i j [2]}^{″} = b_{i j [2]}^{'}$ for $i, j \geq 2$ , and $a_{k 2}^{'} = \sum_{i = 2}^{H} a_{k i} b_{i j [2]} (b_{i j [2]}^{Q} v_{2}^{Q} - b_{1 j [2]}^{Q})$ .

$\begin{matrix} J & = & 〈f_{k I} | I = (ℓ_{1}, \dots, ℓ_{N}) \in {N_{+ 0}}^{N}, | I | = n Q + 1, n \in N_{+ 0}, 1 \leq k \leq M〉 \\ = & 〈v_{1} a_{k 1}, v_{1}^{(n + 1) Q + 1} \sum_{i = 2}^{H} a_{k i} {b_{i j [1]}}^{n Q + 1} ({b_{i j [1]}}^{Q} - {b_{1 j [1]}}^{Q}), v_{1}^{| I |} \sum_{i = 2}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}}^{ℓ_{j}}〉 \\ = & 〈v_{1} a_{k 1}, v_{1}^{(n + 1) Q + 1} v_{21}^{n Q + 1} \sum_{i = 2}^{H} a_{k i} {b_{i j [1]}^{″}}^{n Q + 1} ({b_{i j [1]}^{″}}^{Q} v_{2}^{Q} - {b_{1 j [1]}^{″}}^{Q}), \\ v_{1}^{| I |} v_{21}^{| I |} v_{22}^{| I | - ℓ_{j [2]}} \sum_{i \geq 2}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}^{'}}^{ℓ_{j}}, f o r | I | - ℓ_{j [2]} > 0〉 \\ = & 〈v_{1} a_{k 1}, v_{1}^{(n + 2) Q + 1} v_{21}^{(n + 1) Q + 1} \sum_{i = 3}^{H} a_{k i} {b_{i j [1]}^{″}}^{n Q + 1} ({b_{i j [1]}^{″}}^{Q} v_{2}^{Q} - {b_{1 j [1]}^{″}}^{Q}) ({b_{i j [1]}^{″}}^{Q} - {b_{2 j [2]}^{″}}^{Q}), \\ v_{1}^{Q + 1} v_{21} a_{k 2}^{'}, v_{1}^{| I | (Q + 1) - Q ℓ_{j [2]}} v_{21}^{| I |} {v_{22}^{'}}^{| I | - ℓ_{j [2]}} \sum_{i \geq 3}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}^{'}}^{ℓ_{j}}, f o r | I | - ℓ_{j [2]} > 0〉 . \end{matrix}$

Set $v_{1} = v_{22}^{1 / Q} v_{1}^{'}$ .
Set $b_{i j}^{'} = b_{i j}^{″} + \frac{b_{i j [2, 2]}^{'}}{b_{i [2, 2] j [2, 2]}^{'}} b_{i [2, 2] j}^{'}$ for $i \geq 2, i \neq [2, 2]$ , $j \neq j [2, 2], j [2]$ . and $a_{k i [2, 2]}^{'} = \sum_{i = 2}^{H} a_{k i} b_{i j [2, 2]}$ .

$\begin{matrix} J & = & 〈f_{k I} | I = (ℓ_{1}, \dots, ℓ_{N}) \in {N_{+ 0}}^{N}, | I | = n Q + 1, n \in N_{+ 0}, 1 \leq k \leq M〉 \\ v_{22}^{| I | / Q} {v_{1}^{'}}^{| I |} v_{21}^{| I |} \sum_{i = 2}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}}^{ℓ_{j}}〉 \\ = & 〈v_{1}^{'} v_{22}^{1 / Q} a_{k 1}, v_{22}^{(n + 1) + 1 / Q} {v_{1}^{'}}^{(n + 1) Q + 1} \sum_{i = 2}^{H} a_{k i} {b_{i j [1]}}^{n Q + 1} ({b_{i j [1]}}^{Q} - {b_{1 j [1]}}^{Q}), \\ v_{22}^{| I | / Q} {v_{1}^{'}}^{| I |} v_{21}^{| I |} v_{22}^{| I | - ℓ_{j [2]}} \sum_{i = 2}^{H} a_{k i} {b_{i j [2]}^{'}}^{ℓ_{j [2]}} {b_{i j [2, 2]}^{'}}^{ℓ_{j [2, 2]}}, f o r | I | = ℓ_{j [2]} + ℓ_{j [2, 2]}, ℓ_{j [2, 2]} > 0, \\ v_{22}^{| I | / Q} {v_{1}^{'}}^{| I |} v_{22}^{| I | - ℓ_{j [2]}} \sum_{i \geq 2, i \neq i [2, 2]}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}^{'}}^{ℓ_{j}}, f o r | I | - ℓ_{j [2]} - ℓ_{j [2, 2]} > 0〉 \\ = & 〈v_{1}^{'} v_{22}^{1 / Q} a_{k 1}, v_{1}^{'} v_{22}^{1 / Q} v_{22} v_{21} a_{k i [2, 2]}, v_{22}^{(n + 1) + 1 / Q} {v_{1}^{'}}^{(n + 1) Q + 1} \sum_{i = 2}^{H} a_{k i} {b_{i j [1]}}^{n Q + 1} ({b_{i j [1]}}^{Q} - {b_{1 j [1]}}^{Q}), \\ v_{22}^{| I | / Q} {v_{1}^{'}}^{| I |} v_{21}^{| I |} v_{22}^{| I | - ℓ_{j [2]}} \sum_{i = 2}^{H} a_{k i} {b_{i j [2]}^{'}}^{ℓ_{j [2]}} {b_{i j [2, 2]}^{'}}^{ℓ_{j [2, 2]}}, f o r | I | = ℓ_{j [2]} + ℓ_{j [2, 2]}, ℓ_{j [2, 2]} > 0, \\ v_{22}^{| I | / Q} {v_{1}^{'}}^{| I |} v_{21}^{| I |} v_{22}^{| I | - ℓ_{j [2]}} \sum_{i \geq 2, i \neq i [2, 2]}^{H} a_{k i} \prod_{j = 1}^{N} {b_{i j}^{″}}^{ℓ_{j}}, f o r | I | - ℓ_{j [2]} - ℓ_{j [2, 2]} > 0〉 . \end{matrix}$
By constructing blow-ups, we have the theorem.

References

Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Watanabe, S. Algebraic analysis for nonidentifiable learning machines. Neural Comput. 2001, 13, 899–933. [Google Scholar] [CrossRef] [PubMed]
Watanabe, S. Algebraic geometrical methods for hierarchical learning machines. Neural Netw. 2001, 14, 1049–1060. [Google Scholar] [CrossRef]
Watanabe, S. Algebraic geometry of learning machines with singularities and their prior distributions. J. Jpn. Soc. Artif. Intell. 2001, 16, 308–315. [Google Scholar]
Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: New York, NY, USA, 2009; Volume 25. [Google Scholar]
Watanabe, S. Equations of states in singular statistical estimation. Neural Netw. 2010, 23, 20–34. [Google Scholar] [CrossRef] [Green Version]
Watanabe, S. Mathematical Theory of Bayesian Statistics; CRC Press: New York, NY, USA, 2018. [Google Scholar]
Drton, M.; Plummer, M. A Bayesian information criterion for singular models. J. R. Statist. Soc. B 2017, 79, 1–38. [Google Scholar] [CrossRef]
Hironaka, H. Resolution of singularities of an algebraic variety over a field of characteristic zero. Ann. Math. 1964, 79, 109–326. [Google Scholar] [CrossRef]
Atiyah, M.F. Resolution of singularities and division of distributions. Commun. Pure Appl. Math. 1970, 13, 145–150. [Google Scholar] [CrossRef]
Aoyagi, M.; Watanabe, S. Stochastic complexities of reduced rank regression in Bayesian estimation. Neural Netw. 2005, 18, 924–933. [Google Scholar] [CrossRef]
Aoyagi, M.; Watanabe, S. Resolution of singularities and the generalization error with Bayesian estimation for layered neural network. IEICE Trans. J88-D-II 2005, 10, 2112–2124. [Google Scholar]
Aoyagi, M. The zeta function of learning theory and generalization error of three layered neural perceptron. RIMS Kokyuroku Recent Top. Real Complex Singul. 2006, 1501, 153–167. [Google Scholar]
Aoyagi, M. A Bayesian learning coefficient of generalization error and Vandermonde matrix-type singularities. Commun. Stat. Theory Methods 2010, 39, 2667–2687. [Google Scholar] [CrossRef]
Aoyagi, M. Learning coefficient in Bayesian estimation of restricted Boltzmann machine. J. Algebr. Stat. 2013, 4, 30–57. [Google Scholar] [CrossRef]
Rusakov, D.; Geiger, D. Asymptotic Model Selection for Naive Bayesian Networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Alberta, AB, Canada, 1–4 August 2002; pp. 438–445. [Google Scholar]
Rusakov, D.; Geiger, D. Asymptotic model selection for naive Bayesian networks. J. Mach. Learn. Res. 2005, 6, 1–35. [Google Scholar]
Zwiernik, P. An asymptotic behavior of the marginal likelihood for general Markov models. J. Mach. Learn. Res. 2011, 12, 3283–3310. [Google Scholar]
Drton, M.; Lin, S.; Weihs, L.; Zwiernik, P. Marginal likelihood and model selection for Gaussian latent tree and forest models. Bernoulli 2017, 23, 1202–1232. [Google Scholar] [CrossRef] [Green Version]
Aoyagi, M.; Nagata, K. Learning coefficient of generalization error in Bayesian estimation and Vandermonde matrix type singularity. Neural Comput. 2012, 24, 1569–1610. [Google Scholar] [CrossRef] [PubMed]
Aoyagi, M. Consideration on singularities in learning theory and the learning coefficient. Entropy 2013, 15, 3714–3733. [Google Scholar] [CrossRef]
Aoyagi, M. Log canonical threshold of Vandermonde matrix type singularities and generalization error of a three layered neural network. Int. J. Pure Appl. Math. 2009, 52, 177–204. [Google Scholar]
Lin, S. Asymptotic approximation of marginal likelihood integrals. arXiv 2010, arXiv:1003.5338v2. [Google Scholar]
Yamazaki, K.; Aoyagi, M.; Watanabe, S. Asymptotic analysis of Bayesian generalization error with Newton diagram. Neural Netw. 2010, 23, 35–43. [Google Scholar] [CrossRef] [PubMed]
Nagata, K.; Watanabe, S. Exchange Monte Carlo Sampling from Bayesian posterior for singular learning machines. IEEE Trans. Neural Netw. 2008, 19, 1253–1266. [Google Scholar] [CrossRef]
Nagata, K.; Watanabe, S. Asymptotic behavior of exchange ratio in exchange Monte Carlo method. Int. J. Neural Netw. 2008, 21, 980–988. [Google Scholar] [CrossRef] [PubMed]

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aoyagi, M. Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection. Entropy 2019, 21, 561. https://doi.org/10.3390/e21060561

AMA Style

Aoyagi M. Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection. Entropy. 2019; 21(6):561. https://doi.org/10.3390/e21060561

Chicago/Turabian Style

Aoyagi, Miki. 2019. "Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection" Entropy 21, no. 6: 561. https://doi.org/10.3390/e21060561

APA Style

Aoyagi, M. (2019). Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection. Entropy, 21(6), 561. https://doi.org/10.3390/e21060561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning Coefficient of Vandermonde Matrix-Type Singularities in Model Selection

Abstract

1. Introduction

2. Log Canonical Threshold

3. Main Theorems

4. Main Results

4.1. Three-Layered Neural Network

4.2. Normal Mixture Models

4.3. Vandermonde Matrix-Type Singularities

5. Conclusions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI