Learning Rate of Regularized Regression Associated with Zonal Translation Networks

Xuexue Ran; Baohuai Sheng; Shuhua Wang

doi:10.3390/math12182840

,

and

¹

School of Mathematical Physics and Information, Shaoxing University, Shaoxing 312000, China

²

Department of Economic Statistics, School of International Business, Zhejiang Yuexiu University, Shaoxing 312000, China

³

School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China

^*

Author to whom correspondence should be addressed.

Mathematics2024, 12(18), 2840;https://doi.org/10.3390/math12182840

Version Notes

Order Reprints

Abstract

We give a systematic investigation on the reproducing property of the zonal translation network and apply this property to kernel regularized regression. We propose the concept of the Marcinkiewicz–Zygmund setting (MZS) for the scattered nodes collected from the unit sphere. We show that under the MZ condition, the corresponding convolutional zonal translation network is a reproducing kernel Hilbert space. Based on these facts, we propose a kind of kernel regularized regression learning framework and provide the upper bound estimate for the learning rate. We also give proof for the density of the zonal translation network with spherical Fourier-Laplace series.

Keywords:

kernel regularized regression; learning theory; convolution translation network; reproducing kernel Hilbert space; Marcinkiewicz–Zygmund inequality; quadrature rule; learning rate

MSC:

41A25

1. Introduction

It is known that convolutional neural networks provide various models and algorithms to process data models in many fields, such as computer vision (see [1]), natural language processing (see [2]), and sequence analysis in bioinformatics (see [3]). Regularized neural network learning has thus become an attractive research topic (see [4,5,6,7,8,9]). In this paper, we shall give theory analysis for the learning rate of regularized regression associated with the zonal translation network on the unit sphere.

1.1. Kernel Regularized Learning

Let

Ω

be a compact subset in the d-dimensional Euclidean space

R^{d}

with the usual norm

∥ x ∥ = \sqrt{\sum_{k = 1}^{d} x_{k}^{2}}

for

x = (x_{1}, x_{2}, \dots, x_{d}) \in R^{d}

and Y be a nonempty closed subset contained in

[- M, M]

for a given

M > 0

. The aim of regression learning is to learn the target function, which describes the relationship between the input

x \in Ω

and the output

y \in Y

from a hypothesis function space. In most cases, the target function is offered as a set of observations

z = {z_{i}}_{i = 1}^{m} = {(x_{i}, y_{i})}_{i = 1}^{m} \in Z^{m}

drawn independently and identically distributed (i.i.d.) according to a probability joint distribution (measure)

ρ (x, y) = ρ_{Ω} (x) ρ (y | x)

on

Z = Ω \times Y

, where

ρ (y | x) (x \in Ω)

is the conditional probability of y for a given x and

ρ_{Ω} (x)

is the marginal probability about x, i.e., for every integrable functions

φ : Ω \times Y \to R

, there hold

\begin{matrix} \int_{Ω \times Y} φ (x, y) d ρ = \int_{Ω} (\int_{Y} φ (x, y) d ρ (y ∣ x)) d ρ_{Ω} . \end{matrix}

For a given normed space

(B, ∥ \cdot ∥_{B})

consisting of real functions on

Ω

, we define the regularized learning framework with B as

\begin{matrix} f_{z, λ}^{(B)} : = a r g min_{f \in B} (E_{z} (f) + \frac{λ}{2} {∥ f ∥}_{B}^{2}), \end{matrix}

(1)

where

λ > 0

is the regularization parameter,

E_{z} (f)

is the empirical mean

E_{z} (f) = \frac{1}{m} \sum_{i = 1}^{m} {(y_{i} - f (x_{i}))}^{2} .

To give theory analysis for the convergence of algorithm (1) quantitatively, we often use the integral framework (see [10,11])

\begin{matrix} f_{ρ, λ}^{(B)} : = a r g min_{f \in B} (E_{ρ} (f) + \frac{λ}{2} {∥ f ∥}_{B}^{2}), \end{matrix}

where

E_{ρ} (f) = \int_{Z} {(y - f (x))}^{2} d ρ

.

The optimal target function is the regression function

f_{ρ} (x) = \int_{Y} y d ρ (y | x)

satisfying

\begin{matrix} f_{ρ} = inf_{f} E_{ρ} (f), \end{matrix}

where the inf is taken over all the

ρ_{Ω}

-measurable functions f. Moreover, there holds the famous equality (see [12])

\begin{matrix} ∥ f - f_{ρ} ∥_{L^{2} (ρ_{Ω})}^{2} = E_{ρ} (f) - E_{ρ} (f_{ρ}), f \in L^{2} (ρ_{Ω}) . \end{matrix}

(2)

The choices for the hypothesis space B in (1) are rich. For example, C.P. An et al. choose the algebraic polynomial class as B (see [13,14,15]). In [16], C. De Mol et al chose the dictionary as B. Recently, some papers chose the Sobolev space as the hypothesis space B (see [17,18]). By the kernel method, we mean traditionally replacing B with a reproducing kernel Hilbert space (RKHS)

(H_{K}, {⟨ \cdot, \cdot ⟩}_{K})

, which is a Hilbert space consisting of real functions defined on

Ω

, and there is a Mercer kernel

K_{x} (y) = K (x, y)

on

Ω \times Ω

(i.e.,

K_{x} (y)

is a continuous and symmetric function on

Ω \times Ω

and for any

n \geq 1

and any

{x_{1}, x_{2}, \dots, x_{n}} \subset X

the Mercer matrices

{(K (x_{i}, x_{j}))}_{i, j = 1, 2, \dots, n}

are positive semi-definite) such that

\begin{matrix} f (x) = {⟨ f, K_{x} ⟩}_{K}, \forall x \in Ω, \forall f \in H_{K} . \end{matrix}

(3)

and there holds the embedded inequality

\begin{matrix} | f (x) | \leq {c ∥ f ∥}_{K}, \forall x \in Ω, \forall f \in H_{K}, \end{matrix}

(4)

where c is a constant independent of f and x. There are two results for the optimal solution

f_{z, λ}^{(B)}

. The reproducing property (3) yields the representation

\begin{matrix} f_{z, λ}^{(B)} (x) = \sum_{k = 1}^{m} c_{k} K_{x_{k}} (x), \forall x \in Ω . \end{matrix}

(5)

The embedded inequality (4) yields the inequality

\begin{matrix} | f_{z, λ}^{(B)} (x) | \leq \sqrt{\frac{2 M}{λ}} \forall x \in Ω . \end{matrix}

(6)

Representation (5) is the theory basis for kernel regularized regression (see [19,20]). Inequality (6) is the key inequality for bounding the learning rate with covering number method (see [21,22,23]). For other skills of the kernel method, one can cite [10,11,12,24] et al.

1.2. Marcinkiewicz-Zygmund Setting (MZS)

It is particularly important to mention here that the translation network has recently been used as the hypothesis space of regularized learning (see [25,26]). From the view of approximation theory, a simple single-layer translation network with m neurons is a function class produced by translating a given function

ϕ

, and can be written as

\begin{matrix} Δ_{ϕ, \bar{X}}^{Ω} = {\sum_{i = 1}^{n} c_{i} T_{x_{i}} (ϕ, \cdot) : c_{i} \in R, x_{i} \in Ω, i = 1, 2, \dots, n}, \end{matrix}

where

\bar{X} = {x_{i}}_{i = 1}^{n} \subset Ω

is a given node set, and a given

x \in Ω T_{x} (ϕ, y)

is a translation operator corresponding to

Ω

. For example, when

Ω = R^{d}

and

Ω = {[- π, π]}^{d}

, we choose

T_{x} (ϕ, y)

as the usual convolution translation operator

ϕ (x - y)

for a

ϕ

defined on

R^{d}

or an

2 π

-periodic function

ϕ

(see [27,28]. When

Ω = S^{d - 1} = {x \in R^{d} : ∥ x ∥ = 1}

is the unit sphere in

R^{d}

, one can choose

T_{x} (ϕ, y)

as the zonal translation operator

ϕ (x y)

for a given

ϕ

defined on the interval

[- 1, 1]

(see [29]). In [30], we defined a kind of translation operator

T_{x} (ϕ, y)

for

Ω = [- 1, 1]

. To ensure that the single-layer translation network

Δ_{ϕ, \bar{X}}^{Ω}

can approximate the constant function,

Δ_{ϕ, \bar{X}}^{Ω}

is modified as

\begin{matrix} N_{ϕ, \bar{X}}^{Ω} = {\sum_{i = 1}^{n} c_{i} T_{x_{i}} (ϕ, \cdot) + c_{0} : c_{0}, c_{i} \in R, x_{i} \in Ω, i = 1, 2, \dots, n} . \end{matrix}

(7)

In the case of

T_{y} (ϕ, y) = σ (w^{⊤} x + b)

and

w, x \in R^{d}, b \in R

, R.D.Nowak et al. use (7) to design regularized learning frameworks (see [31]). An algorithm is provided by S.B. Lin et al. [26] for designing such a kind of network, and is applied to construct regularized learning algorithms. In [5],

N_{ϕ, \bar{X}}^{Ω}

was used to construct deep neural network learning frameworks. The same type of investigations are given in [32,33,34].

It is easy to see that the approximation ability and the construction of a translation network depend upon the node set

\bar{X} = {x_{i}}_{i = 1}^{n}

(see [35,36,37]). On the other hand, according to the view of [38], the quadrature rule and the Marcinkiewicz–Zygmund (MZ) inequality associated with

\bar{X}

also influence the construction of the translation network

Δ_{ϕ, \bar{X}}

. For a bounded closed set

Ω \subset R^{d}

, with measure

d ω

satisfying

\int_{Ω} d ω = V < + \infty

. We denote by

P_{n} \subset L^{2} (Ω)

the linear space of polynomials on

Ω

of a degree of, at most, n, equipped with the

L^{2}

-inner product

⟨ v, z ⟩ = \int_{Ω} v z d ω .

The n-point quadrature rule (QR) is

\begin{matrix} \int_{Ω} g d ω \approx \sum_{j = 1}^{n} w_{j} g (x_{j}), \end{matrix}

(8)

where

\bar{X} = {x_{j}}_{j = 1}^{n} \subset Ω

and weights

w_{j}

are all positive for

j = 1, 2, \dots, n .

We say the QR (8) has polynomial exactness n if there is a positive integer

γ

such that

\begin{matrix} \int_{Ω} g d ω = \sum_{j = 1}^{n} w_{j} g (x_{j}), \forall g \in P_{γ n} . \end{matrix}

(9)

The Marcinkiewicz–Zygmund (MZ) inequality based on the set

\bar{X} = {x_{j}}_{j = 1}^{n} \subset Ω

is

\begin{matrix} {(\sum_{j = 1}^{n} A_{j} {| g (x_{j}) |}^{p})}^{\frac{1}{p}} \sim {(\int_{Ω} {| g (ω) |}^{p} d ω)}^{\frac{1}{p}}, 1 < p < + \infty, \forall g \in P_{n}, \end{matrix}

(10)

where the weights

A_{j}

in (10) may be not the same as the

w_{j}

in (8) and (9). Another important inequality, which is called the MZ condition in the case of unit sphere in [39], associated with polynomial approximation, is

\begin{matrix} {(\sum_{j = 1}^{n} A_{j} {| g (x_{j}) |}^{p})}^{\frac{1}{p}} \leq C {(\int_{Ω} {| g (ω) |}^{p} d ω)}^{\frac{1}{p}}, 1 < p < + \infty, \forall g \in P_{r n}, \end{matrix}

(11)

where C is a constant independent of p, and n and r are any positive integers.

We give examples to show explanations for (9)–(11), which will form the main idea of this paper:

(i) In many cases of the domain

Ω

, relations (9)–(11) are coexistent. For example, when

Ω = [- 1, 1]

, (9)–(11) hold in the case that

x_{j} = x_{j, n} (j = 1, 2, \dots, n)

are the zeros of the n-th Jacobi polynomial orthogonal with respect to

d ω

, and

w_{j} = A_{j} (j = 1, 2, \dots, n)

are the Cotes–Christoffel numbers associated with

d ω

(see Theorem A and Theorem B in [40] and Theorem 3.4.1 in [41]). H.N. Mhaskar et al first showed in [42] that (9) and (10) are coexistent on the unit sphere

S^{d - 1}

, and the corresponding relation (11) was shown in [43].

(ii) The relations (9)–(11) are equivalent.

Accord to the view of [38], the quadrature rule (QR) follows automatically from the Marcinkiewicz–Zygmund (MZ) inequality in many cases of

Ω

. H.N. Mhaskar et al. gave a general method of transition from MZ inequality to the polynomial exact QR in [44]; see also Theorem 4.1 in [42]. In particular, in the case of

Ω = [- 1, 1]

, (10) may be obtained from (9) and (11) directly (see [45]).

These show that besides the polynomial exactness formula (QR) (9), the MZ inequality (10) is also an important feature for describing the node set

\bar{X}

. For this reason, the node set

\bar{X} = {x_{j}}_{j = 1}^{n} \subset Ω

, which yields an MZ inequality, is given a special terminology, called the Marcinkiewicz–Zygmynd Family (MZF) (see [38,46,47,48]). However, from this literature, we know that the MZF does not totally coincide with the Lagrange interpolation nodes in the case of

d > 1

. The hyperinterpolations are then developed with the help of exact QR (see [49,50,51,52,53]) and are applied to approximation theory and regularized learning (see [13,14,15,54]). On the other hand, we find that the problem of the polynomial exact QR is investigated individually (see [55,56]). The concept of spherical t-design was first defined in [57], and has been given investigations by many papers subsequently; one can see the classical references [58,59]. We say

T_{t} = {x_{i}}_{i = 1}^{| T_{t} |} \subset S^{d - 1}

is a spherical t-design if

\begin{matrix} \frac{1}{| T_{t} |} \sum_{i = 1}^{| T_{t} |} π (x_{i}) = \frac{1}{ω_{d - 1}} \int_{S^{d - 1}} π (x) d ω (x), \end{matrix}

(12)

where

ω_{d - 1}

is the volume of

S^{d - 1}

and

π (x)

is a spherical polynomial with the degree t. Moreover, in many applications, the polynomial exact QR and the MZF have been used as assumptions. For example, C.P. An et al. gave an approximation order for the hyperinterpolation approximation under the assumptions that (9), (12), and the MZ inequality (10) hold (see [60,61]). Also, in [25], Lin et al. gave investigations on regularized regression associated with a zonal translation network by assuming that the node set

\bar{X} = {x_{j}}_{j = 1}^{n} \subset S^{d - 1}

is a type of spherical t-design.

The polynomial exact QR is also a good tool in approximation theory. For example, H.N. Mhaskar et al used a polynomial exact QR to construct the first periodic translation operators (see [27]) and the zonal translation network operators (see [29]). Along this line, the translation operators defined on the unit ball, on the Euclidean space

R^{d}

, and on the interval

[- 1, 1]

are constructed (see [28,30,62]).

The above investigations encourage us to define a terminology that contains both the Marcinkiewicz–Zygmynd family (MZF) and the polynomial exact QR: we call it the Marcinkiewicz–Zygmund setting (MZS).

Definition 1

(Marcinkiewicz–Zygmund setting (MZS) on Ω). We say a given finite node set

Ω^{(n)} \subset Ω

forms a Marcinkiewicz–Zygmund setting on Ω if (9)–(11) simultaneously hold.

In this paper, we design the translation network

N_{ϕ, \bar{X}}^{Ω}

by taking

Ω = S^{d - 1}

, assuming that

\bar{X} = Ω^{(n)} = {x_{j}}_{j = 1}^{n} \subset S^{d - 1}

satisfies MZS and choosing the zonal translation

T_{x} (ϕ, y) = ϕ (x y)

with

ϕ

being a given integrable function

ϕ

on

[- 1, 1]

. Under these assumptions, we provide a learning framework with

N_{ϕ, Ω^{(n)}}^{S^{d - 1}}

being the hypothesis space, and show the learning rate.

The contributions of this paper are twofold. First, after absorbing the ideas of [38,46,47,48] and the successful experience of [13,25,27,29,42,60,61,63], we propose the concept of the Marcinkiewicz–Zygmund setting (MZS) for scattered nodes on the sphere unit; based on this assumption, we show the convergence rate for the approximation error of kernel regularized learning associated with spherical Fourier analysis. Second, we give a new application of translation network and, at the same way, expand the application scopes of kernel regularized learning.

The paper is organized as follows. In Section 2, we first show the density for the zonal translation class and then show the reproducing property for the translation network

Δ_{ϕ, Ω^{(n)}}^{S^{d - 1}}

. In Section 3, we provide the results in the present paper, for example, a new regression learning framework and a learning setting, the error decomposition for the error analysis, and an estimate for the convergence rate. In Section 4, we give several lemmas, which are used to prove the main results. The proofs for all the theorems and propositions are given in Section 5.

Throughout the paper, we write

A = O (B)

if there is a positive constant C independent of A and B such that

A \leq C B

. In particular, by

A = O (1)

we show that A is a bounded quantity. We write A∼B if both

A = O (B)

and

B = O (A)

.

2. The Properties of the Translation Networks on the Unit Sphere

Let

ϕ \in L_{W_{η}}^{1} = {{ϕ : ∥ ϕ ∥}_{1, W_{η}} = \int_{- 1}^{1} | ϕ (x) | W_{η} (x) d x < + \infty}, W_{η} (x) = {(1 - x^{2})}^{η - \frac{1}{2}}, η > - \frac{1}{2}

. Then, H.N. Mhaskar et al constructed in [29] a sequence of approximation operators to show that the zonal translation class

\begin{matrix} Δ_{ϕ}^{S^{d - 1}} & = & c l {ϕ (x y) : y \in S^{d - 1}} \cup {1} \\ = & c l {\sum_{i = 1}^{n} c_{i} ϕ (x_{i} \cdot) + c_{0} : c_{0}, c_{i} \in R, x_{i} \in S^{d - 1}, i = 1, 2, \dots, n; n = 1, 2, \dots} \end{matrix}

is dense in

L^{p} (S^{d - 1}) (1 \leq p \leq + \infty)

if

\hat{a_{l}^{η} (ϕ)} > 0

for all

l = 0, 1, 2, \dots,

where

\hat{a_{l}^{η} (ϕ)} = c_{η} \int_{- 1}^{1} ϕ (x) \frac{C_{l}^{η} (x)}{C_{l}^{η} (1)} W_{η} (x) d x, η = \frac{d - 2}{2}

and

C_{n}^{η} (x) = p_{n}^{(η - \frac{1}{2}, η - \frac{1}{2})} (x)

is the n-th Legendre polynomial satisfies the orthogonal relation

\begin{matrix} c_{η} \int_{- 1}^{1} C_{n}^{η} (x) C_{m}^{η} (x) W_{η} (x) d x = h_{n}^{η} δ_{n, m} . \end{matrix}

with

c_{η} = \frac{Γ (η + 1)}{Γ (\frac{1}{2}) Γ (η + \frac{1}{2})}

,

h_{n}^{η} = \frac{η}{n + η} C_{n}^{η} (1)

, and it is known that (see from (B.2.1), (B.2.2) and (B.5.1) of [64])

C_{n}^{η} (x) \leq C_{n}^{η} (1) = n^{2 η - 1}, x \in [- 1, 1] .

It follows that

\begin{matrix} ϕ (t) = \sum_{l = 0}^{+ \infty} \hat{a_{l}^{η} (ϕ)} \frac{l + η}{η} C_{l}^{η} (t) = \sum_{l = 0}^{+ \infty} \hat{a_{l}^{η} (ϕ)} Z_{l}^{η} (t), \end{matrix}

(13)

where

Z_{l}^{η} (t) = \frac{l + η}{η} C_{l}^{η} (t), η = \frac{d - 2}{2}

.

Let

P_{n}^{d}

denote the space of all homogeneous polynomials of degree n in d variables. We denote by

L^{p} (S^{d - 1})

the class of all measurable functions defined on

S^{d - 1}

with the finite norm

\begin{matrix} {∥ f ∥}_{p, S^{d - 1}} = \{\begin{matrix} {(\int_{S^{d - 1}} {| f (x) |}^{p} d σ (x))}^{\frac{1}{p}}, & 1 \leq p < + \infty, \\ max_{x \in S^{d - 1}} | f (x) |, & p = + \infty, \end{matrix} \end{matrix}

and for

p = + \infty

, we assume that

L^{+ \infty} (S^{d - 1})

is the space

C (S^{d - 1})

of continuous functions on

S^{d - 1}

with the uniform norm.

For a given integer

n \geq 0

, the restriction to

S^{d - 1}

of a homogeneous harmonic polynomial of degree n is called the spherical harmonics of degree n. If

Y \in P_{n}^{d}

, then

Y (x) = {∥ x ∥}^{n} Y (x^{'}), x^{'} = \frac{x}{∥ x ∥}

, so that Y is determined by its restriction on the unit sphere. Let

H_{n} (S^{d - 1})

denote the space of the spherical harmonics of degree n. Then,

dim H_{n} (S^{d - 1}) = (\begin{matrix} n + d - 2 \\ n \end{matrix}) + (\begin{matrix} n + d - 3 \\ n - 1 \end{matrix}), n = 1, 2, 3, \dots,

Spherical harmonics of different degrees are orthogonal on the unit sphere. For further properties about harmonics, one can refer to [65].

For

n = 0, 1, 2, \dots,

let

{Y_{l}^{n} : 1 \leq l \leq dim H_{n} (S^{d - 1})}

be an orthonormal basis of

H_{n} (S^{d - 1})

. Then,

\begin{matrix} \frac{1}{ω_{d - 1}} \int_{S^{d - 1}} Y_{l}^{n} (ξ) Y_{l^{'}}^{m} (ξ) d σ (ξ) = δ_{l, l^{'}} δ_{m, n}, \end{matrix}

where

ω_{d - 1}

denotes the surface area of

S^{d - 1}

and

ω_{d - 1} = \frac{2 π^{\frac{d}{2}}}{Γ (\frac{d}{2})}

. Furthermore, by (1.2.8) in [64], we have,

\begin{matrix} \sum_{j = 1}^{dim H_{n} (S^{d - 1})} Y_{l}^{n} (x) Y_{l}^{n} (y) = \frac{n + η}{η} C_{n}^{η} (x y), x, y \in S^{d - 1}, \end{matrix}

(14)

where

C_{n}^{η} (t)

is the n-th generalized Legendre polynomial, the same as in (13). Combining (13) and (14), we have

\begin{matrix} ϕ (x y) = \sum_{l = 0}^{+ \infty} \hat{a_{l}^{η} (ϕ)} \frac{l + η}{η} C_{l}^{η} (x \cdot y) = \sum_{l = 0}^{+ \infty} \hat{a_{l}^{η} (ϕ)} \sum_{k = 1}^{dim H_{l} (S^{d - 1})} Y_{k}^{l} (x) Y_{k}^{l} (y), x, y \in S^{d - 1} . \end{matrix}

(15)

Also, there holds the Funk–Hecke formula (see (1.2.11) in [64] or (1.2.6) in [66]):

\begin{matrix} \int_{S^{d - 1}} f (x \cdot y) Y_{n} (y) d σ (y) = \hat{a_{l}^{η} (f)} Y_{n} (x), Y_{n} \in H_{n} (S^{d - 1}) . \end{matrix}

(16)

In particular, there holds

\begin{matrix} \int_{S^{d - 1}} ϕ (x \cdot y) d σ (y) = ω_{d - 1} {∥ ϕ ∥}_{1, W_{η}}, \forall x \in S^{d - 1}, ϕ \in L_{W_{η}}^{1} . \end{matrix}

(17)

For a

f \in L^{1} (S^{d - 1})

we define

a_{l}^{(n)} (f) = \int_{S^{d - 1}} f (x) Y_{l}^{n} (x) d σ (x)

. Then,

\begin{matrix} f (x) & \sim & \sum_{l = 0}^{\infty} \sum_{j = 1}^{dim H_{l} (S^{d - 1})} a_{j}^{(l)} (f) Y_{j}^{l} (x) \\ = & \sum_{l = 0}^{\infty} Y_{l} (f, x), \end{matrix}

where

Y_{l} (f, x) = \sum_{j = 1}^{dim H_{l} (S^{d - 1})} a_{j}^{(l)} (f) Y_{j}^{l} (x) .

It is known that (see (6.1.4) in [66])

\begin{matrix} {∥ f ∥}_{2, S^{d - 1}} & = & {(\sum_{l = 0}^{\infty} \sum_{j = 1}^{dim H_{l} (S^{d - 1})} {| a_{j}^{(l)} (f) |}^{2})}^{\frac{1}{2}} \\ = & {(\sum_{l = 0}^{\infty} {∥ Y_{l} (f) ∥}_{2, S^{d - 1}}^{2})}^{\frac{1}{2}}, f \in L^{2} (S^{d - 1}) . \end{matrix}

(18)

2.1. Density

We first give a general discrimination method for density.

Proposition 1

(see Lemma 1 in Chapter 18 of [67]). For a subset V in a normed linear space E, the following two properties are equivalent:

(a) V is fundamental in E (that is, its linear span is dense in E).

(b)

V^{⊥} = {0}

(that is, 0 is the only element of

E^{*}

that annihilates V).

Based on this proposition, we can show the density of

Δ_{ϕ}^{S^{d - 1}}

in a qualitative way.

Theorem 1.

Let

ϕ \in L_{W_{η}}^{2}

satisfy

\hat{a_{l}^{η} (ϕ)} > 0

for all

l = 0, 1, 2, \dots .

Then,

Δ_{ϕ}^{S^{d - 1}}

is dense in

L^{2} (S^{d - 1})

.

Proof.

See the proof in Section 5. □

We can quantitatively show the density of

Δ_{ϕ}^{S^{d - 1}}

in

L^{+ \infty} (S^{d - 1})

.

Let

C ([- 1, 1]) = L_{W_{η}}^{\infty}

denote the set of all continuous functions defined on

[- 1, 1]

and

{∥ ϕ ∥}_{\infty, W_{η}} = {∥ ϕ ∥}_{C ([- 1, 1])} = sup_{x \in [- 1, 1]} | ϕ (x) | .

Define a differential operator

P_{η} (D)

as

P_{η} (D) = W_{η} {(x)}^{- 1} \frac{d}{d x} W_{η} (x) (1 - x^{2}) \frac{d}{d x}

and

P_{η} {(D)}^{l} = P_{η} (D) (P_{η} {(D)}^{l - 1}), l = 1, 2, \dots .

Theorem 2.

Let

ϕ \in L_{W_{η}}^{2}

be sufficient smoothness (for example,

P_{η} {(D)}^{l} ϕ \in C ([- 1, 1])

for all

l \geq 1

) and satisfy

\hat{a_{l}^{η} (ϕ)} > 0

for all

l = 0, 1, 2, \dots .

Then, for a given

f \in L^{+ \infty} (S^{d - 1})

and

\forall ε > 0

, there is a

g \in Δ_{ϕ}^{S^{d - 1}}

such that

\begin{matrix} {∥ f - g ∥}_{\infty, S^{d - 1}} < ε . \end{matrix}

(19)

Proof.

See the proof in Section 5. □

2.2. MZS on the Unit Sphere

We first restate a proposition.

Proposition 2.

There is a finite subset

Ω^{(n)} \subset S^{d - 1}

and positive constants

N_{d}, c

such that for any given

n \geq N_{d}

, we have two nonnegative number sets

{μ_{k} : x_{k} \in Ω^{(n)}}

and

{A_{k} : x_{k} \in Ω^{(n)}}

satisfying

\begin{matrix} \sum_{x_{k} \in Ω^{(n)}} \frac{μ_{k}^{(n)}}{A_{k}^{(n)}} \leq c, \end{matrix}

(20)

such that

\begin{matrix} \int_{S^{d - 1}} f (x) d σ (x) = \sum_{x_{k} \in Ω^{(n)}} μ_{k}^{(n)} f (x_{k}), f \in H_{n} (S^{d - 1}) . \end{matrix}

(21)

and

\begin{matrix} \int_{S^{d - 1}} {| f (x) |}^{p} d σ (x) \sim \sum_{x_{k} \in Ω^{(n)}} A_{k}^{(n)} {| f (x_{k}) |}^{p}, f \in H_{n} (S^{d - 1}), 1 \leq p < + \infty . \end{matrix}

(22)

Moreover, for any

m \geq n

and

p \geq 1

, there exists a constant

c_{p, d} > 0

such that

\begin{matrix} \sum_{x_{k} \in Ω^{(n)}} A_{k}^{(n)} | f (x_{k}) |^{p} \leq c_{p, d} {(\frac{m}{n})}^{d - 1} \int_{S^{d - 1}} {| f (x) |}^{p} d σ (x), f \in H_{m} (S^{d - 1}) \end{matrix}

(23)

and

\begin{matrix} | {x_{k} \in Ω^{(n)} : μ_{k}^{(n)} \neq 0} | \sim dim (H_{n} (S^{d - 1})) . \end{matrix}

(24)

Proof.

Inequalities (21)–(22) were proved by H.N. Mhaskar et al in [42] or see [29], which have now been extended to other domains (see [68]). Inequality (24) may be found from [29]. Inequality (23) is followed by (22) and the following facts (see Theorem 2.1 in [43]):

Suppose that Ω is a finite subset of

S^{d - 1}

,

{μ_{ω} : ω \in Ω}

is a set of positive numbers, and n is a positive integer. If for a

0 < p_{o} < + \infty

holds the inequality

\begin{matrix} \sum_{ω \in Ω} μ_{ω} {| f (ω) |}^{p_{0}} \leq C_{1} \int_{S^{d - 1}} {| f (x) |}^{p_{0}} d σ (x), f \in H_{n} (S^{d - 1}) \end{matrix}

with

C_{1}

independent of f, then for any

0 < p < + \infty

and any

f \in H_{m} (S^{d - 1})

with

m \geq n

\begin{matrix} \sum_{ω \in Ω} μ_{ω} {| f (ω) |}^{p} \leq C_{d, p} {(\frac{m}{n})}^{d - 1} \int_{S^{d - 1}} {| f (x) |}^{p} d σ (x), f \in H_{m} (S^{d - 1}), \end{matrix}

where

C_{d, p} > 0

depends only on d and p.

Inequalities (21)–(22) show the existence of

Ω^{(n)} \subset S^{d - 1}

, which has the polynomial exact QR (9) and satisfies the MZ inequality (10). Inequality (23) often goes with (21) and (22), and is needed in discussing the approximation order (see [63]). □

Proposition 3.

For any given

n \geq 1

, there exists a finite subset

Ω^{(n)} \subset S^{d - 1}

, which forms an MZS on

S^{d - 1}

.

Proof.

The results follow from (21)–(23). □

2.3. The Reproducing Property

Let

ϕ \in C ([- 1, 1])

be a given even function. For a given finite set

Ω^{(n)} \subset S^{d - 1}

, i.e.,

| Ω^{(n)} | < + \infty

, and the corresponding finite number set

{μ_{k}^{(n)} : x_{k} \in Ω^{(n)}}

, we define a zonal translation network as

\begin{matrix} H_{Ω^{(n)}}^{ϕ} : = c l {f (x) = \sum_{x_{k} \in Ω^{(n)}} c_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k}) + c_{0} : c_{k} \in R, k = 0, 1, 2, \dots, | Ω^{(n)} |}, \end{matrix}

where

T_{x} (ϕ) (y) = ϕ (x y)

. Then, it is easy to see that

\begin{matrix} H_{Ω^{(n)}}^{ϕ} = H_{ϕ}^{(n)} ⨁ R, \end{matrix}

where for

A, B \subset R

, and

A ⋂ B = {0}

we define

A ⨁ B = {a + b : a \in A, b \in B}

and

\begin{matrix} H_{ϕ}^{(n)} : = c l {f (x) = \sum_{x_{k} \in Ω^{(n)}} c_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k}) : c_{k} \in R, k = 0, 1, 2, \dots, | Ω^{(n)} |} . \end{matrix}

For

f (x) = \sum_{x_{k} \in Ω^{(n)}} c_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k}) \in H_{ϕ}^{(n)}, g (x) = \sum_{x_{k} \in Ω^{(n)}} d_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k}) \in H_{ϕ}^{(n)}

, we define a bivariate operation as

\begin{matrix} {⟨ f, g ⟩}_{ϕ} = \sum_{x_{k} \in Ω^{(n)}} c_{k} d_{k} μ_{k}^{(n)} \end{matrix}

and

\begin{matrix} {∥ f ∥}_{ϕ} = {(\sum_{x_{k} \in Ω^{(n)}} μ_{k}^{(n)} {| c_{k} |}^{2})}^{\frac{1}{2}} . \end{matrix}

Because of (15), we see by Theorem 4 in Chapter 17 of [67] that the matrix

{(T_{x_{i}} (ϕ, x_{j}))}_{i, j = 1, 2, \dots, | Ω^{(n)} |}

is positive and definite for a given n.

It follows that the vector

c = {μ_{k}^{(n)} c_{k}}_{x_{k} \in Ω^{(n)}}

is unique. Then, for a given n

(H_{ϕ}^{(n)}, ∥ \cdot ∥_{ϕ})

is a finite-dimensional Hilbert space, whose dimensional

dim H_{ϕ}^{(n)} = | Ω^{(n)} | \sim n^{d - 1}

, and is isometrically isomorphic with

l^{2} (Ω^{(n)})

, where

\begin{matrix} l^{2} (Ω^{(n)}) = {c = {μ_{k}^{(n)} c_{k}}_{x_{k} \in Ω^{(n)}} {: ∥ c ∥}_{l^{2} (Ω^{(n)})} = {(\sum_{x_{k} \in Ω^{(n)}} μ_{k}^{(n)} {| c_{k} |}^{2})}^{\frac{1}{2}} < + \infty} . \end{matrix}

Since

H_{ϕ}^{(n)}

is a finite-dimensional Hilbert space, we know by Theorem A in section 3 of Part I in [69] that

H_{ϕ}^{(n)}

must be a reproducing kernel Hilbert space; what we need to do is to find the reproducing kernel.

We have a proposition.

Proposition 4.

Let

ϕ \in C [- 1, 1]

satisfy

P_{η} {(D)}^{l} ϕ \in C ([- 1, 1])

and

\begin{matrix} {(\sum_{x_{k} \in Ω^{(n)}} | μ_{k}^{(n)} |)}^{\frac{1}{2}} = O (n^{α}), 2 l > α \geq 0 . \end{matrix}

(25)

If

Ω^{(n)} \subset S^{d - 1}

satisfies the MZ condition (23) by

A_{k}^{(n)} = | μ_{k}^{(n)} |

, then

(H_{ϕ}^{(n)}, ∥ \cdot ∥_{ϕ})

is a finite-dimensional reproducing kernel Hilbert space associated with the kernel

\begin{matrix} K_{x}^{*} (ϕ) (y) = K^{*} (ϕ, x, y) = \sum_{x_{k} \in Ω^{(n)}} μ_{k}^{(n)} T_{x_{k}} (ϕ, x) T_{x_{k}} (ϕ, y), \end{matrix}

(26)

i.e.,

\begin{matrix} f (x) = {⟨f, K_{x}^{*} (ϕ) (\cdot)⟩}_{ϕ} \forall x \in S^{d - 1}, \forall f \in H_{ϕ}^{(n)} \end{matrix}

(27)

and there is a constant

k^{*} > 0

such that

\begin{matrix} | f (x) | \leq k^{*} {∥ f ∥}_{ϕ} \forall f \in H_{ϕ}^{(n)}, \forall x \in S^{d - 1} . \end{matrix}

(28)

Proof.

See the proof in Section 5. □

Corollary 1.

Under the assumptions of Proposition 4,

H_{Ω^{(n)}}^{ϕ}

is a finite-dimensional reproducing kernel Hilbert space associated with the inner product defined by

\begin{matrix} {⟨ f, g ⟩}_{H_{Ω^{(n)}}^{ϕ}} = {⟨ f_{1}, g_{1} ⟩}_{ϕ} + c_{0} d_{0}, \end{matrix}

where

f (x) = f_{1} (x) + c_{0}, g (x) = g_{1} (x) + d_{0},

f_{1} (x) = \sum_{x_{k} \in Ω^{(n)}} c_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k}), g_{1} (x) = \sum_{x_{k} \in Ω^{(n)}} d_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k}) .

and the corresponding reproducing kernel

K_{x} (ϕ) (y)

is

\begin{matrix} K_{x} (ϕ) (y) = K_{x}^{*} (ϕ) (y) + 1, x, y \in S^{d - 1} . \end{matrix}

(29)

Furthermore, there is a constant

κ > 0

such that

\begin{matrix} | f (x) | \leq {κ ∥ f ∥}_{H_{Ω^{(n)}}^{ϕ}}, \forall f \in_{H_{Ω^{(n)}}^{ϕ}}, \forall x \in S^{d - 1} . \end{matrix}

(30)

Proof.

The results can be obtained from Proposition 4, and the fact that the real set R is a reproducing kernel Hilbert space whose reproducing kernel is 1, and the inner product is the usual product of two real numbers. □

Corollary 2.

Let

Ω^{(n)} \subset S^{d - 1}

be defined as in Proposition 2 and

ϕ \in L_{W_{η}}^{\infty}

. Then,

H_{Ω^{(n)}}^{ϕ}

is a finite-dimensional reproducing kernel Hilbert space associated with kernel (29), and there holds inequality (30).

Proof.

See the proof in Section 5. □

3. Apply to Kernel Regularized Regression

We now apply the above reproducing kernel Hilbert spaces to kernel regularized regression.

3.1. Learning Framework

For a set of observations

z = {(x_{i}, y_{i})}_{i = 1}^{m}

drawn i.i.d. according to a joint distribution

ρ (x, y)

on

Z = S^{d - 1} \times Y, Y = [- M, M], M > 0

is a given real number, satisfying

ρ (x, y) = ρ (y | x) ρ_{S^{d - 1}} (x)

; we define a regularized framework as

\begin{matrix} f_{z, λ}^{Ω^{(n)}} : = a r g min_{f \in H_{Ω^{(n)}}^{ϕ}} (E_{z} (f) + \frac{λ}{2} {∥ f ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}), \end{matrix}

(31)

where

λ = λ (m) \to 0 (m \to + \infty)

are the regularization parameters, and

E_{z} (f) = \frac{1}{m} \sum_{i = 1}^{m} {(y_{i} - f (x_{i}))}^{2} .

It can be seen that the n in (31) may be different from the sample number m. But it can be chosen according to our needs for the purpose of increasing the learning rate.

The general integral model of (31) is

\begin{matrix} f_{ρ, λ}^{Ω^{(n)}} : = a r g min_{f \in H_{Ω^{(n)}}^{ϕ}} (E_{ρ} (f) + \frac{λ}{2} {∥ f ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}), \end{matrix}

(32)

where

E_{ρ} (f) = \int_{Z} {(y - f (x))}^{2} d ρ .

To show the convergence analysis for (31), we need to bound the error

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ} ∥}_{L^{2} (ρ_{S^{d - 1}})}, \end{matrix}

which is an approximation problem whose convergence rate depends upon the approximation ability of

H_{Ω^{(n)}}

. An error decomposition will be given in Section 3.2.

3.2. Error Decompositions

By (2) and the definition of

f_{ρ, λ}^{Ω^{(n)}}

, we have

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ} ∥}_{L^{2} (ρ_{S^{d - 1}})} \\ \leq & {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{L^{2} (ρ_{S^{d - 1}})} + {∥ f_{ρ, λ}^{Ω^{(n)}} - f_{ρ} ∥}_{L^{2} (ρ_{S^{d - 1}})} \\ = & {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{L^{2} (ρ_{S^{d - 1}})} + \sqrt{E_{ρ} (f_{ρ, λ}^{Ω^{(n)}}) - E_{ρ} (f_{ρ}) + \frac{λ}{2} {∥ f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}} \\ \leq & {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{L^{2} (ρ_{S^{d - 1}})} + D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})}, \end{matrix}

(33)

where we have used the fact that for

a > 0, b > 0, c > 0

and

0 < p \leq 1

, there holds

\begin{matrix} {(a + b + c)}^{p} \leq a^{p} + b^{p} + c^{p} \end{matrix}

and

\begin{matrix} D^{Ω^{(n)}} (f_{ρ}, λ) = inf_{g \in H_{Ω^{(n)}}^{ϕ}} (∥ g - f_{ρ} ∥_{L^{2} (ρ_{S^{d - 1}})} + \sqrt{\frac{λ}{2}} {∥ g ∥}_{H_{Ω^{(n)}}^{ϕ}}) \end{matrix}

is a K-functional that denotes the approximation error, whose decay will be described later. So, the main estimate that we need to deal with is the sample error

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{L^{2} (ρ_{S^{d - 1}})} . \end{matrix}

3.3. Convergence Rate for the K-Functional

We first provide a convergence rate for the K-functional

D^{Ω^{(n)}} (f, λ)

.

Proposition 5.

Let

ϕ \in L_{W_{η}}^{\infty}

satisfy

\hat{a_{l}^{η} (ϕ)} > 0

for all

l \geq 0

, and let

N ≫ n

be a positive integer. Then, there is a

Ω^{(2 (N + n))} \subset S^{d - 1}

, which forms an MZS on

S^{d - 1}

such that

\begin{matrix} D^{Ω^{(2 (N + n))}} {(f, λ)}_{L^{2} (ρ_{S^{d - 1}})} = O (\frac{n^{\frac{1}{2}} E_{N} {(ϕ)}_{2, W_{η}} {∥ f ∥}_{2, S^{d - 1}}}{ϕ_{n}} + E_{n} {(f)}_{2, S^{d - 1}} + \frac{n^{\frac{1}{2}} \sqrt{λ} {∥ f ∥}_{2, S^{d - 1}}}{ϕ_{n}}), \end{matrix}

(34)

where

ϕ_{n} = min_{0 \leq l \leq 2 n} \hat{a_{l}^{η} (ϕ)} .

Proof.

See the proof in Section 5. □

Corollary 3.

Let

ϕ \in L_{W_{η}}^{\infty}

satisfy

\hat{a_{l}^{η} (ϕ)} > 0

for all

l \geq 0

, and for a given l, there holds

P_{η} {(D)}^{l} \in C ([- 1, 1])

. If

m = m (n) ≫ N ≫ n

are chosen such that

\frac{n^{\frac{1}{2}}}{N^{2 l} ϕ_{n}} \to 0 (n \to + \infty)

and

\frac{n^{\frac{1}{2}} \sqrt{λ}}{ϕ_{n}} \to 0 (n \to + \infty)

, then

\begin{matrix} D^{Ω^{(2 (N + n))}} {(f, λ)}_{L^{2} (ρ_{S^{d - 1}})} \to 0^{+}, n \to + \infty . \end{matrix}

(35)

3.4. The Learning Rate

Theorem 3.

Let

Ω^{(n)} \subset S^{d - 1}

satisfy (25) and

γ = {(\int_{Z} y^{2} d ρ (x, y))}^{\frac{1}{2}} < + \infty .

If

κ D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} > M \sqrt{λ}

, then we have a constant

C > 0

such that for any

δ \in (0, 1)

, with confidence

1 - δ

, holds

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}} \leq (\frac{κ γ}{λ \sqrt{m}} + \frac{2 κ^{2} D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})}}{λ^{\frac{3}{2}} m}) l o g \frac{2}{δ} . \end{matrix}

(36)

If

κ D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} \leq M \sqrt{λ}

, then we have a constant

C > 0

such that for any

δ \in (0, 1)

, with confidence

1 - δ

, holds

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}} \leq \frac{2 (M + γ)}{λ \sqrt{m}} l o g \frac{2}{δ} . \end{matrix}

(37)

Proof.

See the proof in Section 5. □

Corollary 4.

Let

Ω^{(n)} \subset S^{d - 1}

satisfy (25) and

γ = {(\int_{Z} y^{2} d ρ (x, y))}^{\frac{1}{2}} < + \infty .

If

κ D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} > M \sqrt{λ}

, then we have a constant

C > 0

such that for any

δ \in (0, 1)

, with confidence

1 - δ

, holds

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}} \\ \leq & (\frac{κ γ}{λ \sqrt{m}} + \frac{2 κ^{2} D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})}}{λ^{\frac{3}{2}} m}) l o g \frac{2}{δ} + D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} . \end{matrix}

(38)

If

κ D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} \leq M \sqrt{λ}

, then we have a constant

C > 0

such that for any

δ \in (0, 1)

, with confidence

1 - δ

holds

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}} \leq \frac{2 (M + γ)}{λ \sqrt{m}} l o g \frac{2}{δ} + D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} . \end{matrix}

(39)

Proof.

See the proof in Section 5. □

Corollary 5.

Let

ϕ \in L_{W_{η}}^{\infty}

satisfy

\hat{a_{l}^{η} (ϕ)} > 0

for all

l \geq 0

and

Ω^{(2 (N + n))} \in S^{d - 1}

be the MZS defined as in Proposition 5. If

f_{ρ} \in H_{Ω^{(n)}}^{ϕ}

, then for any

δ \in (0, 1)

, with confidence

1 - δ

, holds

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ} ∥}_{L^{2} (ρ_{S^{d - 1}})} \\ = & O (\frac{l o g \frac{2}{δ}}{λ \sqrt{m}}) + O (\frac{n^{\frac{1}{2}} E_{N} {(ϕ)}_{2, W_{η}} {∥ f_{ρ} ∥}_{2, S^{d - 1}}}{ϕ_{n}} + E_{n} {(f_{ρ})}_{2, S^{d - 1}} + \frac{n^{\frac{1}{2}} \sqrt{λ} {∥ f ∥}_{2, S^{d - 1}}}{ϕ_{n}}) . \end{matrix}

(40)

Proof.

In this case,

D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} = O (\sqrt{λ})

. We have (40) by (39) and (34). □

3.5. Comments

We propose the concept of MZS for a scattered node set on the unit sphere, with which we show that the related convolutional zonal translation network is a reproducing kernel Hilbert space, and we show the learning rate for the kernel regularized least square regression model. We give further comments for the results.

(1) The zonal translation network that we have chosen is a finite-dimensional reproducing kernel Hilbert space; our discussions belong to the scope of the kernel method, which is a combination and application of (neural) translation networks with learning theory.

(2) Compared with the existing convergence rate estimate of neural network learning, our upper estimates are dimensional-independent (see Theorem in [25], Theorem 3.1 in [70], Theorem 7 in [71], Theorem 1 in [26]).

(3) The density derivation in Theorem 1 for the zonal translation network is qualitative; the density deduction in Theorem 2 is quantitative with the help of spherical Fourier analysis. We think that this method can be extended to other domains such as the unit ball, the Euclidean space

R^{d}

and

R_{+} = [0, + \infty)

, et al.

(4) It is hopeful that with the help of the MZ-condition, one may show the reproducing property for a deep translation network, and thus give investigations for the performance of deep convolutional translation learning with the kernel method (see [6,7,33]).

(5) We provide a method of constructing a finite-dimensional reproducing kernel Hilbert space with a convolutional kernel defined on the domains having a near-best-approximation operator, for example, the interval

[- 1, 1]

, the unit sphere

S^{d - 1}

, and the unit ball

B^{d}

, etc. (see [64]). The only assumption that we need is (25). The set

Ω^{(n)} \subset Ω

may be any finite scattered sets satisfying the MZ condition (23), whose parameters

A_{k}^{(n)}

can be obtained according to (5.3.5) in Theorem 5.3.6 of [64], as we know that this is the first time that the reproducing property of the zonal neural network is shown.

(6) In many research references, to obtain an explicit learning rate, one often assumes that the approximation error (i.e., the K-functional) has a decay of power convergence, i.e.,

D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} = O (λ^{β}), λ \to 0^{+}, 0 < β \leq 1 .

It was proved in [72] that the K-functional is equivalent to a modulus of smoothness. In this paper, it is the first time that an upper estimate for the convergence rate of the K-functional has been provided (see (34)).

(7) One advantage of framework (31) is that it is a finite-dimensional quadratic nonlinear strict convex optimization problem; since the structure of

H_{Ω^{(n)}}^{ϕ}

, so the optimal solution for (31) is unique and can be obtained by the gradient descent algorithm.

(8) It is easy to see that the optimal solution

f_{ρ, λ}^{Ω^{(n)}}

depends upon both the distribution

ρ

and the function

ϕ

. How to quantitatively describe the influence level, i.e., the robustness of

f_{ρ, λ}^{Ω^{(n)}}

with respect to

ρ

and

ϕ

, is a significant research direction; for such kinds of research, one can refer to references [73,74,75].

(9) Combining the upper estimate (40) and convergence (35), we know that if

m = m (n) ≫ N ≫ n

are chosen such that

\frac{n^{\frac{1}{2}}}{N^{2 l} ϕ_{n}} \to 0 (n \to + \infty)

and

\frac{n^{\frac{1}{2}} \sqrt{λ}}{ϕ_{n}} \to 0 (n \to + \infty)

, then for any

δ \in (0, 1)

, with confidence

1 - δ

, holds the convergence

\begin{matrix} lim_{n \to + \infty} ∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ} ∥_{L^{2} (ρ_{S^{d - 1}})} = 0, f_{ρ} \in H_{Ω^{(n)}}^{ϕ} . \end{matrix}

(41)

Convergence (41) shows that under these assumptions, algorithm (31) is convergent.

(10) We now make a comparison of the learning framework in present paper with the general learning framework (1) associated with a reproducing kernel Hilbert space

H_{K}

. The Theorem 1 in [76] shows the sample error

\begin{matrix} ∥ f_{z, λ}^{(H_{K})} - f_{ρ, λ}^{(H_{K})} ∥_{K} \leq \frac{6 c M log (2 / δ)}{λ \sqrt{m}} . \end{matrix}

(42)

Inequality (42) is a fundamental inequality for obtaining the optimal learning rate with the integral operator approach (see Theorem 2 in [76]). Inequality (37) in Theorem 3 shows that the sample error estimate (42) also holds for (31). But

H_{Ω^{(n)}}^{ϕ}

is a finite-dimensional proper subset of

Δ_{ϕ}^{S^{d - 1}}

, which is a reproducing kernel Hilbert space. These show that the learning framework (1) may obtain algorithm convergence as well as the optimal learning rate if

n, N

and

ϕ

are chosen properly.

(11) In this paper, we have shown our idea of the kernel regularized translation network learning with the zonal translation network. The essence is an application of the MZ inequality and the exact QR, or the MZ-condition and the MZS. We conjecture that the results in the present may be extended to many other translation networks whose domains satisfy the MZ-condition and the MZS, for example, the periodic translation network (see [27]), the translation on the interval

[- 1, 1]

(see [30]), the unit sphere

S^{d - 1}

(see [29]), and the unit ball

B^{d}

(see [62]).

(12) Recently, the exact spherical QR is used to investigate the convergence for the spherical scattered data-fitting problem (see [25,77,78]). The Tikhonov regularization model used is the following type:

\begin{matrix} f_{Λ^{(m)}, λ}^{(N_{ϕ})} : = a r g min_{f \in N_{ϕ}} (\sum_{z_{k} \in Λ^{(m)}} μ_{k}^{(m)} {(f (z_{k}) - y_{k})}^{2} + λ {∥ f ∥}_{N_{ϕ}}^{2}), \end{matrix}

where

N_{ϕ}

is a native space of the type

Δ_{ϕ}^{S^{d - 1}}

,

Λ^{(m)} = {z_{k}}_{z_{k} \in Λ^{(m)}} \subset S^{d - 1}

is a scattered set, and

y_{k} = f^{*} (z_{k}) + ε_{k} (z_{k} \in Λ^{(m)})

and

μ_{k}^{(m)}

are the positive numbers defined in the polynomial exact QR as in (21);

f^{*} (x)

is the target function to be fitted. It is hopeful that the method used in the present may be used to investigate the convergence of the algorithm

\begin{matrix} f_{Λ^{(m)}, λ}^{Ω^{(n)}} : = a r g min_{f \in H_{Ω^{(n)}}^{ϕ}} (\sum_{z_{k} \in Λ^{(m)}} μ_{k}^{(m)} {(f (z_{k}) - y_{k})}^{2} + λ {∥ f ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}), \end{matrix}

where

Ω^{(n)}

and

H_{Ω^{(n)}}^{ϕ}

are defined as in Section 2.

4. Lemmas

To give a capacity-independent generalization error for algorithm (31), we need some concepts of convex analysis.

G \hat{a} t e a u x

differentiable. Let

(H, ∥ \cdot ∥_{H})

be a Hilbert space and

F (f) : H \to R \cup {\mp \infty}

be a real function. We say that F is

G \hat{a} t e a u x

differentiable at

f_{0} \in H

if there is an

ξ \in H

such that for any

g \in H

, there holds

\begin{matrix} lim_{t \to 0} \frac{F (f_{0} + t g) - F (f_{0})}{t} = {⟨ g, ξ ⟩}_{H} \end{matrix}

and we write

F_{G}^{'} (f_{0}) = ξ

or

\nabla_{f} F (f) = ξ

. It is known that for a differentiable convex function,

F (f)

on H

f_{0} = a r g min_{f \in H} F (f)

if and only if

(\nabla_{f} F (f)) |_{f = f_{0}} = 0

. (see Proposition 17.4 in [79]).

To prove the main results, we need some lemmas.

Lemma 1.

Let

(H, ∥ \cdot ∥)

be a Hilbert space, ξ be a random variable on

(Z, ρ)

with values in H, and

{z_{i}}_{i = 1}^{m}

be independent sample drawers of ρ. Assume that

{∥ ξ ∥}_{H} \leq \tilde{M} < + \infty

almost surely. Denote

σ^{2} (ξ) = {E (∥ ξ ∥}_{H}^{2})

. Then, for any

0 < δ < 1

, with confidence

1 - δ

, holds

\begin{matrix} {∥ \frac{1}{m} \sum_{i = 1}^{m} ξ (z_{i}) - E (ξ) ∥}_{H} \leq \frac{2 \tilde{M} log (\frac{2}{δ})}{\sqrt{m}} + \sqrt{\frac{2 σ^{2} (ξ) log (\frac{2}{δ})}{m}} . \end{matrix}

(43)

Proof.

Find it from [76]. □

Lemma 2.

Let

(H, ∥ \cdot ∥_{H})

be a Hilbert space over X with respect to kernel K. If

(E, ∥ \cdot ∥_{E})

and

(F, ∥ \cdot ∥_{F})

be closed subspaces of H such that

E ⊥ F

and

E ⨁ F = H

; then,

K = L + M

, where L and M are the reproducing kernels of E and F, respectively. Moreover, for

h = e + f, e \in E, f \in F

, we have

\begin{matrix} {∥ h ∥}_{H} = {({∥ e ∥}_{E}^{2} + {∥ f ∥}_{F}^{2})}^{\frac{1}{2}} . \end{matrix}

Proof.

See Corollary 1 in Chapter 31 of [67] or the Theorem in Section 6 in part I of [69]. □

Lemma 3.

There hold the following equalities:

\begin{matrix} \nabla_{f} E_{z} (f) (\cdot) = - \frac{2}{m} \sum_{i = 1}^{m} (y_{i} - f (x_{i})) K_{x_{i}} (ϕ) (\cdot), f \in H_{Ω^{(n)}}^{ϕ} . \end{matrix}

(44)

and

\begin{matrix} \nabla_{f} E_{ρ} (f) (\cdot) = - 2 \int_{Z} (y - f (x)) K_{x} (ϕ) (\cdot) d ρ, f \in H_{Ω^{(n)}}^{ϕ} . \end{matrix}

(45)

Proof of (44). By the equality

\begin{matrix} a^{2} - b^{2} = 2 (a - b) b + {(a - b)}^{2}, a \in R, b \in R \end{matrix}

(46)

we have

\begin{matrix} lim_{t \to 0} \frac{E_{z} (f + t g) - E_{z} (f)}{t} \\ = & lim_{t \to 0} \frac{\frac{1}{m} \sum_{i = 1}^{m} (- 2) t (y_{i} - f (x_{i})) g (x_{i}) + t^{2} g {(x_{i})}^{2}}{t} \\ = & - \frac{2}{m} \sum_{i = 1}^{m} (y_{i} - f (x_{i})) g (x_{i}) \end{matrix}

Since

g (x) = {⟨ g, K_{x} (ϕ; \cdot) ⟩}_{H_{Ω^{(n)}}^{ϕ}}

and the definition of

G \hat{a} t e a u x

is derivative, we have by the above equality that

\begin{matrix} lim_{t \to 0} \frac{E_{z} (f + t g) - E_{z} (f)}{t} = {⟨g, - \frac{2}{m} \sum_{i = 1}^{m} (y_{i} - f (x_{i})) g (x_{i}) K_{x_{i}} (ϕ) (\cdot)⟩}_{H_{Ω^{(n)}}^{ϕ}} . \end{matrix}

We then have (44). By the same way, we can have (45).

Lemma 4.

Let

(H, {⟨ \cdot ⟩}_{H})

be a Hilbert space consisting of real functions on X. Then,

\begin{matrix} {∥ f ∥}_{H}^{2} - {∥ g ∥}_{H}^{2} = 2 {⟨ f - g, g ⟩}_{H} + {∥ f - g ∥}_{H}^{2}, \forall f, g \in H \end{matrix}

(47)

and

\begin{matrix} \nabla_{f} {(∥ f ∥}_{H}^{2}) (\cdot) = 2 f (\cdot), \forall f \in H . \end{matrix}

(48)

Proof.

Equality (47) is the deformation of the parallelogram formula. Equality (48) can be shown with (47).

□

Lemma 5.

Framework (31) has a unique solution

f_{z, λ}^{Ω^{(n)}}

and (32) has a unique solution

f_{ρ, λ}^{Ω^{(n)}}

. Moreover, There holds the bound

\begin{matrix} {∥ f_{ρ, λ}^{Ω^{(n)}} ∥}_{C (S^{d - 1})} \leq \frac{2 κ D^{Ω^{(n)}} (f_{ρ}, λ)}{\sqrt{λ}}, \end{matrix}

(49)

where κ is defined as in (30).

There hold the equality

\begin{matrix} λ f_{z, λ}^{Ω^{(n)}} (\cdot) = \frac{2}{m} \sum_{i = 1}^{m} (y_{i} - f_{z, λ}^{Ω^{(n)}} (x_{i})) K_{x_{i}} (ϕ) (\cdot) \end{matrix}

(50)

and the equality

\begin{matrix} λ f_{ρ, λ}^{Ω^{(n)}} (\cdot) = 2 \int_{Z} (y - f_{ρ, λ}^{Ω^{(n)}} (x)) K_{x} (ϕ) (\cdot) d ρ \end{matrix}

(51)

Proof.

Proof of (49). Since

E_{ρ} (f_{ρ, λ}^{Ω^{(n)}}) \geq E_{ρ} (f_{ρ})

, we have by (2) that

\begin{matrix} \frac{λ}{2} {∥ f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2} & \leq & E_{ρ} (f_{ρ, λ}^{Ω^{(n)}}) - E_{ρ} (f_{ρ}) + \frac{λ}{2} {∥ f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2} \\ = & inf_{g \in H_{Ω^{(n)}}^{ϕ}} (E_{ρ} (g) - E_{ρ} (f_{ρ}) + \frac{λ}{2} {∥ g ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}) \\ = & inf_{g \in H_{Ω^{(n)}}^{ϕ}} (∥ g - f_{ρ} ∥_{L^{2} (ρ_{S^{d - 1}})}^{2} + \frac{λ}{2} {∥ g ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}) . \end{matrix}

(52)

By (52), we have (49).

Proof of (50). By the definition of

f_{z, λ}^{Ω^{(n)}}

and (48), we have

\begin{matrix} 0 = \nabla_{f} (E_{z} (f) + \frac{λ}{2} {∥ f ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}) |_{f = f_{z, λ}^{Ω^{(n)}}}, \end{matrix}

i.e.,

\begin{matrix} 0 & = & \nabla_{f} E_{z} {(f) |}_{f = f_{z, λ}^{Ω^{(n)}}} + λ \nabla_{f} (\frac{1}{2} {∥ f ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2} {) |}_{f = f_{z, λ}^{Ω^{(n)}}} \\ = & - \frac{2}{m} \sum_{i = 1}^{m} (y_{i} - f_{z, λ}^{Ω^{(n)}} (x_{i})) K_{x_{i}} (ϕ) (\cdot) + λ f_{z, λ}^{Ω^{(n)}} (\cdot) . \end{matrix}

Hence, (50) holds. We can prove (51) in the same way. □

Lemma 6.

The solutions

f_{z, λ}^{Ω^{(n)}}

and

f_{ρ, λ}^{Ω^{(n)}}

satisfy the inequality

\begin{matrix} {∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}} \leq \frac{2 A (z)}{λ}, \end{matrix}

(53)

where

A (z) = ∥ \int_{Z} (y - f_{ρ, λ}^{Ω^{(n)}} (x)) K_{x} (ϕ) (\cdot) d ρ - \frac{1}{m} \sum_{i = 1}^{m} (y_{i} - f_{ρ, λ}^{Ω^{(n)}} (x_{i})) K_{x_{i}} (ϕ) (\cdot) ∥_{H_{Ω^{(n)}}^{ϕ}} .

Proof.

By (46), we have

\begin{matrix} a^{2} - b^{2} \geq 2 (a - b) b, a \in R, b \in R \end{matrix}

It follows that

\begin{matrix} E_{z} (f_{z, λ}^{Ω^{(n)}}) - E_{z} (f_{ρ, λ}^{Ω^{(n)}}) \\ = & \frac{1}{m} \sum_{i = 1}^{m} [{(y_{i} - f_{z, λ}^{Ω^{(n)}} (x_{i}))}^{2} - {(y_{i} - f_{ρ, λ}^{Ω^{(n)}} (x_{i}))}^{2}] \\ \geq & - \frac{2}{m} \sum_{i = 1}^{m} (y_{i} - f_{ρ, λ}^{Ω^{(n)}} (x_{i})) \times (f_{z, λ}^{Ω^{(n)}} (x_{i}) - f_{ρ, λ}^{Ω^{(n)}} (x_{i})) \\ = & {⟨f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}}, - \frac{2}{m} \sum_{i = 1}^{m} (y_{i} - f_{ρ, λ}^{Ω^{(n)}} (x_{i})) K_{x_{i}} (ϕ) (\cdot)⟩}_{H_{Ω^{(n)}}^{ϕ}}, \end{matrix}

(54)

where we have used the reproducing property

f_{z, λ}^{Ω^{(n)}} (x_{i}) - f_{ρ, λ}^{Ω^{(n)}} (x_{i}) = {⟨f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}}, K_{x_{i}} (ϕ) (\cdot)⟩}_{H_{Ω^{(n)}}^{ϕ}} .

By the definition of

f_{z, λ}^{Ω^{(n)}}

, we have

\begin{matrix} 0 & \geq & (E_{z} (f_{z, λ}^{Ω^{(n)}}) + \frac{λ}{2} {∥ f_{z, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}) - (E_{z} (f_{ρ, λ}^{Ω^{(n)}}) + \frac{λ}{2} {∥ f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}) . \end{matrix}

On the other hand, by the above inequality (54) and (47), we have

\begin{matrix} 0 & \geq & (E_{z} (f_{z, λ}^{Ω^{(n)}}) - E_{z} (f_{ρ, λ}^{Ω^{(n)}})) + \frac{λ}{2} (∥ f_{z, λ}^{Ω^{(n)}} ∥_{H_{Ω^{(n)}}^{ϕ}}^{2} - {∥ f_{ρ, λ}^{Ω^{(n)}} ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}) \\ \geq & {⟨f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}}, - \frac{2}{m} \sum_{i = 1}^{m} (y_{i} - f_{ρ, λ}^{Ω^{(n)}} (x_{i})) K_{x_{i}} (ϕ) (\cdot)⟩}_{H_{Ω^{(n)}}^{ϕ}} \\ + & {⟨f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}}, λ j_{q} (f_{ρ, λ}^{Ω^{(n)}})⟩}_{H_{Ω^{(n)}}^{ϕ}} + λ ∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥_{H_{Ω^{(n)}}^{ϕ}}^{2} . \end{matrix}

Because of (51), we have

\begin{matrix} 0 & \geq & 2 ⟨ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}}, \int_{Z} (y - f_{ρ, λ}^{Ω^{(n)}} (x)) K_{x} (ϕ) (\cdot) d ρ \\ - \frac{1}{m} \sum_{i = 1}^{m} (y_{i} - f_{ρ, λ}^{Ω^{(n)}} (x_{i})) K_{x_{i}} (ϕ) (\cdot) ⟩_{H_{Ω^{(n)}}^{ϕ}} + λ ∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥_{H_{Ω^{(n)}}^{ϕ}}^{2} . \end{matrix}

By the Cauchy inequality, we have

\begin{matrix} λ ∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥_{H_{Ω^{(n)}}^{ϕ}}^{2} \\ \leq & 2 ⟨ f_{ρ, λ}^{Ω^{(n)}} - f_{z, λ}^{Ω^{(n)}}, \int_{Z} (y - f_{ρ, λ}^{Ω^{(n)}} (x)) K_{x} (ϕ) (\cdot) d ρ \\ - \frac{1}{m} \sum_{i = 1}^{m} (y_{i} - f_{ρ, λ}^{Ω^{(n)}} (x_{i})) K_{x_{i}} (ϕ) (\cdot) ⟩_{H_{Ω^{(n)}}^{ϕ}} \\ \leq & ∥ \int_{Z} (y - f_{ρ, λ}^{Ω^{(n)}} (x)) K_{x} (ϕ) (\cdot) d ρ \\ - \frac{1}{m} \sum_{i = 1}^{m} (y_{i} - f_{ρ, λ}^{Ω^{(n)}} (x_{i})) K_{x_{i}} (ϕ) (\cdot) ∥_{H_{Ω^{(n)}}^{ϕ}} \times 2 ∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥_{H_{Ω^{(n)}}^{ϕ}} . \end{matrix}

We then have (53). □

5. Proof for Theorems and Propositions

Proof of Theorem 1.

If

Δ_{ϕ}^{S^{d - 1}}

is not dense in

L^{2} (S^{d - 1})

, i.e.,

c l s p a n (Δ_{ϕ}^{S^{d - 1}}) \neq L^{2} (S^{d - 1}) .

then by (b) in Proposition 1, we know

{(c l s p a n (Δ_{ϕ}^{S^{d - 1}}))}^{⊥} \neq {0}

, and there is a nonzero functional

F \in L^{2} (S^{d - 1})

such that

F (f) = 0, f \in c l s p a n (Δ_{ϕ}^{S^{d - 1}}) .

It follows that

F (ϕ (\cdot y)) = 0

for all

y \in S^{d - 1}

. By the Riesz representation Theorem, F corresponds to a nonzero

h \in L^{2} (S^{d - 1})

in such a way that

\begin{matrix} F (f) = \int_{S^{d - 1}} f (x) h (x) d σ (x), \forall f \in L^{2} (S^{d - 1}) . \end{matrix}

Consequently,

\int_{S^{d - 1}} ϕ (x y) h (y) d σ (y) = 0, \forall x \in S^{d - 1},

which gives

\begin{matrix} \int_{S^{d - 1}} (\int_{S^{d - 1}} ϕ (x y) h (y) d σ (y)) h (x) d σ (x) = 0 . \end{matrix}

(55)

Combining (55) with (15), we have

\begin{matrix} \sum_{l = 0}^{+ \infty} \hat{a_{l}^{η} (ϕ)} \sum_{k = 1}^{dim H_{l} (S^{d - 1})} {(\int_{S^{d - 1}} h (y) Y_{k}^{l} (y) d σ (y))}^{2} = 0 . \end{matrix}

It follows that

\int_{S^{d - 1}} h (y) Y_{k}^{l} (y) d σ (y) = 0

for all

l \geq 0 .

Therefore,

h = 0 .

We have obtained a contradiction. □

Proof of Theorem 2.

For a nonnegative function

\hat{a} \in C^{\infty} (R)

satisfying (a)

s u p p \hat{a} \subset [0, 2]

and

\hat{a} (t) = 1, t \in [0, 1]

, or (b)

s u p p \hat{a} \subset [\frac{1}{2}, 2]

, we define a near-best-approximation operator

V_{n}^{(η)} (f, x)

as

\begin{matrix} V_{n}^{(η)} (f, x) = \sum_{l = 0}^{+ \infty} \hat{a} (\frac{l}{n}) \hat{a_{l}^{η} (f)} \frac{l + η}{η} C_{l}^{η} (x), x \in [- 1, 1] . \end{matrix}

Then, by [80], we know

\begin{matrix} V_{n}^{(η)} (p, x) = p (x), p \in P_{n}, \end{matrix}

V_{n}^{(η)} (f) \in P_{2 n}

, and for any

f \in L_{W_{η}}^{p}

, there hold

∥ V_{n}^{(η)} {(f) ∥}_{p, W_{η}} \leq c {∥ f ∥}_{p, W_{η}}

,

\begin{matrix} \hat{a_{l}^{η} (V_{n}^{(η)} (f))} = \hat{a_{l}^{η} (f)}, 0 \leq l \leq n \end{matrix}

and

\begin{matrix} ∥ V_{n}^{(η)} {(f) - f ∥}_{p, W_{η}} \leq c E_{n} {(f)}_{p, W_{η}}, \end{matrix}

where

E_{n} {(f)}_{p, W_{η}} = inf_{p \in P_{n}} {∥ f - p ∥}_{p, W_{η}}, \forall f \in L_{W_{η}}^{p} .

By Theorem 5.4 in [81], we have

\begin{matrix} E_{n} {(f)}_{p, W_{η}} \leq C K_{l} {(f, P_{η} (D), n^{- 2 l})}_{p, W_{η}}, l = 1, 2, \dots, 1 \leq p \leq + \infty, \end{matrix}

(56)

where

K_{l} {(f, P_{η} (D), λ^{- l})}_{p, W_{η}} = inf_{g \in C^{2 l} [- 1, 1]} ({∥ f - g ∥}_{p, W_{η}} + λ^{- l} {∥ P_{η} {(D)}^{l} (f) ∥}_{p, W_{η}}) .

By (56), we know that if

P_{η} {(D)}^{l} (f) \in C^{2 l} [- 1, 1]

, then

\begin{matrix} E_{n} {(f)}_{p, W_{η}} = O (\frac{1}{n^{2 l}}), l = 1, 2, \dots, 1 \leq p \leq + \infty . \end{matrix}

(57)

By the same way, we define a near-best-approximation operator

V_{n} (f, x)

as

\begin{matrix} V_{n} (f, x) = \sum_{l = 0}^{+ \infty} \hat{a} (\frac{l}{n}) Y_{l} (f, x), x \in S^{d - 1}, f \in L^{1} (S^{d - 1}) . \end{matrix}

Then, it is known that (see Lemma 4.1.1 in [66] or see Theorem 2.6.3 in [64])

V_{n} (f) \in H_{2 n} (S^{d - 1})

and

V_{n} (f) = f

for

f \in H_{n} (S^{d - 1})

, and there is a constant

c > 0

such that for any

f \in L^{p} (S^{d - 1})

\begin{matrix} ∥ V_{n} {(f) - f ∥}_{p, S^{d - 1}} \leq c E_{n} {(f)}_{p, S^{d - 1}}, ∥ V_{n} {(f) ∥}_{p, S^{d - 1}} \leq c {∥ f ∥}_{p, S^{d - 1}} . \end{matrix}

Since

V_{n} (f) \in H_{2 n} (S^{d - 1})

, we have

\begin{matrix} V_{n} (f, x) = a_{0}^{(0)} (V_{n} (f)) + \sum_{l = 1}^{2 n} \sum_{j = 1}^{dim H_{l} (S^{d - 1})} a_{j}^{(l)} (V_{n} (f)) y_{j}^{l} (x), x \in S^{d - 1} . \end{matrix}

(58)

On the other hand, by (16), we have for

N ≫ 2 n

and

1 \leq l \leq 2 n

that

\begin{matrix} Y_{j}^{l} (x) & = & \frac{1}{\hat{a_{l}^{η} (V_{N}^{(η)} (ϕ))}} \int_{S^{d - 1}} V_{N}^{(η)} (ϕ, x \cdot y) Y_{j}^{l} (y) d σ (y) \\ = & \frac{1}{\hat{a_{l}^{η} (ϕ)}} \int_{S^{d - 1}} V_{N}^{(η)} (ϕ, x \cdot y) Y_{j}^{l} (y) d σ (y), Y_{j}^{l} \in H_{l} (S^{d - 1}) . \end{matrix}

(59)

Since

V_{N}^{(η)} (ϕ, x \cdot) Y_{j}^{l} (\cdot) \in H_{2 (N + n)} (S^{d - 1})

, we have by (21) that

\begin{matrix} \int_{S^{d - 1}} V_{N}^{(η)} (ϕ, x \cdot y) Y_{j}^{l} (y) d σ (y) = \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} V_{N}^{(η)} (ϕ, x \cdot x_{k}) Y_{j}^{l} (x_{k}) . \end{matrix}

It follows by (59) and (58) that

\begin{matrix} V_{n} (f, x) & = & a_{0}^{(0)} (V_{n} (f)) + \sum_{l = 1}^{2 n} \sum_{j = 1}^{dim H_{l} (S^{d - 1})} a_{j}^{(l)} (V_{n} (f)) y_{j}^{l} (x) \\ = & a_{0}^{(0)} (V_{n} (f)) + \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} V_{N}^{(η)} (ϕ, x \cdot x_{k}) \\ \times \sum_{l = 1}^{2 n} \sum_{j = 1}^{dim H_{l} (S^{d - 1})} \frac{a_{j}^{(l)} (V_{n} (f)) y_{j}^{l} (x_{k})}{\hat{a_{l}^{η} (ϕ)}} \\ = & a_{0}^{(0)} (V_{n} (f)) + \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} V_{N}^{(η)} (ϕ, x \cdot x_{k}) (\sum_{l = 1}^{2 n} \frac{Y_{l} (V_{n} (f), x_{k})}{\hat{a_{l}^{η} (ϕ)}}) . \end{matrix}

Define an operator as

\begin{matrix} V_{n}^{ϕ} (f, x) = a_{0}^{(0)} (V_{n} (f)) + \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} ϕ (x \cdot x_{k}) (\sum_{l = 1}^{2 n} \frac{Y_{l} (V_{n} (f), x_{k})}{\hat{a_{l}^{η} (ϕ)}}) . \end{matrix}

Then,

V_{n}^{ϕ} (f) \in H_{Ω^{(2 (N + n))}}^{ϕ} \subset Δ_{ϕ}^{S^{d - 1}}

and

V_{n} (f, x) = V_{n}^{V_{N}^{(η)} (ϕ)} (f, x) .

It follows that

\begin{matrix} | V_{n}^{ϕ} (f, x) - f (x) | & \leq & | V_{n}^{ϕ} (f, x) - V_{n} (f, x) | + | V_{n} (f, x) - f (x) | \\ \leq & c E_{n} {(f)}_{\infty, S^{d - 1}} + | V_{n}^{ϕ} (f, x) - V_{n}^{V_{N}^{(η)} (ϕ)} (f, x) |, \end{matrix}

(60)

where

\begin{matrix} | V_{n}^{ϕ} (f, x) - V_{n}^{V_{N}^{(η)} (ϕ)} (f, x) | \\ \leq & \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} | ϕ (x \cdot x_{k}) - V_{N}^{(η)} (ϕ, x \cdot x_{k}) | (\sum_{l = 1}^{2 n} \frac{| Y_{l} (V_{n} (f), x_{k}) |}{\hat{a_{l}^{η} (ϕ)}}) \\ \leq & c E_{N} {(ϕ)}_{\infty, [- 1, 1]} \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} (\sum_{l = 1}^{2 n} \frac{| Y_{l} (V_{n} (f), x_{k}) |}{\hat{a_{l}^{η} (ϕ)}}) . \end{matrix}

(61)

Because of (22) and (20), we have

\begin{matrix} \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} (\sum_{l = 1}^{2 n} \frac{| Y_{l} (V_{n} (f), x_{k}) |}{\hat{a_{l}^{η} (ϕ)}}) \\ \leq & \sum_{x_{k} \in Ω^{(2 (N + n))}} \frac{μ_{k}^{(2 (N + n))}}{A_{k}^{(2 (N + n))}} \times A_{k}^{(2 (N + n))} (\sum_{l = 1}^{2 n} \frac{| Y_{l} (V_{n} (f), x_{k}) |}{\hat{a_{l}^{η} (ϕ)}}) \\ \leq & c \sum_{x_{k} \in Ω^{(2 (N + n))}} A_{k}^{(2 (N + n))} (\sum_{l = 1}^{2 n} \frac{| Y_{l} (V_{n} (f), x_{k}) |}{\hat{a_{l}^{η} (ϕ)}}), \end{matrix}

where we have used the fact that

\sum_{x_{k} \in Ω^{(2 (N + n))}} \frac{μ_{k}^{(2 (N + n))}}{A_{k}^{(2 (N + n))}} \leq c

. It follows that

\begin{matrix} \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} (\sum_{l = 1}^{2 n} \frac{| Y_{l} (V_{n} (f), x_{k}) |}{\hat{a_{l}^{η} (ϕ)}}) \\ \leq & \frac{c}{ϕ_{n}} \sum_{x_{k} \in Ω^{(2 (N + n))}} A_{k}^{(2 (N + n))} (\sum_{l = 1}^{2 n} | Y_{l} (V_{n} (f), x_{k}) |) \\ \leq & \frac{c}{ϕ_{n}} \int_{S^{d - 1}} \sum_{l = 1}^{2 n} \int_{S^{d - 1}} | Y_{l} (V_{n} (f), x) | d σ (x) \\ \leq & \frac{2 c ω_{d - 1} n^{\frac{1}{2}}}{ϕ_{n}} {(\sum_{l = 1}^{2 n} {∥ Y_{l} (V_{n} (f)) ∥}_{2, S^{d - 1}}^{2})}^{\frac{1}{2}} \\ \leq & \frac{2 c ω_{d - 1} n^{\frac{1}{2}} {∥ V_{n} (f) ∥}_{2, S^{d - 1}}}{ϕ_{n}} \\ = & O (\frac{n^{\frac{1}{2}} {∥ f ∥}_{2, S^{d - 1}}}{ϕ_{n}}), \end{matrix}

(62)

where we have used equality (18) and

ϕ_{n} = min_{1 \leq l \leq 2 n} \hat{a_{l}^{η} (ϕ)}

. Take (62) into (61). Then,

\begin{matrix} | V_{n}^{ϕ} (f, x) - V_{n}^{V_{N}^{(η)} (ϕ)} (f, x) | = O (\frac{n^{\frac{1}{2}} {∥ f ∥}_{2, S^{d - 1}} E_{N} {(ϕ)}_{\infty, [- 1, 1]}}{ϕ_{n}}) . \end{matrix}

(63)

Collecting (63) and (60), we have a constant

C > 0

such that

\begin{matrix} | V_{n}^{ϕ} (f, x) - f (x) | \leq C (\frac{n^{\frac{1}{2}} {∥ f ∥}_{2, S^{d - 1}} E_{N} {(ϕ)}_{\infty, [- 1, 1]}}{ϕ_{n}} + E_{n} {(f)}_{\infty, S^{d - 1}}) . \end{matrix}

(64)

Since

ϕ_{n}

depends upon n and

N ≫ n

, we can choose sufficient l and N such that

\begin{matrix} \frac{n^{\frac{1}{2}} {∥ f ∥}_{2, S^{d - 1}} E_{N} {(ϕ)}_{\infty, [- 1, 1]}}{ϕ_{n}} \leq \frac{n^{\frac{1}{2}} O (∥ P_{η} {(D)}^{l} {ϕ ∥}_{\infty, [- 1, 1]} {) ∥ f ∥}_{2, S^{d - 1}}}{N^{2 l} ϕ_{n}} < \frac{ε}{2 C} . \end{matrix}

(65)

Also, since

E_{n} (f) \to 0

for

n \to + \infty

, we have for sufficient large n that

\begin{matrix} E_{n} {(f)}_{\infty, S^{d - 1}} \leq \frac{ε}{2 C} . \end{matrix}

(66)

Taking (65) and (66) into (64), we finally arrive at (19). □

Proof of Proposition 4.

By the definition of

{⟨ \cdot, \cdot ⟩}_{ϕ}

and the definition kernel

K_{x}^{*} (ϕ, y)

in (26), we have for any

f (x) = \sum_{x_{k} \in Ω^{(n)}} c_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k})

that

\begin{matrix} {⟨ f, K_{x}^{*} (ϕ, \cdot) ⟩}_{ϕ} = \sum_{x_{k} \in Ω^{(n)}} c_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k}) = f (x) . \end{matrix}

Kernel (27) then holds. We now show (28). In fact, by Cauchy’s inequality, we have

\begin{matrix} | f (x) | & = & | \sum_{x_{k} \in Ω^{(n)}} c_{k} μ_{k}^{(n)} T_{x} (ϕ) (x_{k}) | \\ \leq & {∥ c ∥}_{L^{2} (Ω^{(n)})} {(\sum_{x_{k} \in Ω^{(n)}} | μ_{k}^{(n)} | | T_{x} (ϕ) (x_{k}) |^{2})}^{\frac{1}{2}} \\ = & {∥ f ∥}_{ϕ} {(\sum_{x_{k} \in Ω^{(n)}} | μ_{k}^{(n)} | | ϕ (x_{k} \cdot x) |^{2})}^{\frac{1}{2}} . \end{matrix}

On the other hand, by the Minkowski inequality and inequality (23), we have

\begin{matrix} {(\sum_{x_{k} \in Ω^{(n)}} | μ_{k}^{(n)} | | ϕ (x_{k} \cdot x) |^{2})}^{\frac{1}{2}} \\ \leq & {(\sum_{x_{k} \in Ω^{(n)}} | μ_{k}^{(n)} | | (ϕ - V_{n} (ϕ)) (x_{k} \cdot x) |^{2})}^{\frac{1}{2}} + {(\sum_{x_{k} \in Ω^{(n)}} | μ_{k}^{(n)} | | V_{n} (ϕ) (x_{k} \cdot x) |^{2})}^{\frac{1}{2}} \\ \leq & c E_{n} {(ϕ)}_{C ([- 1, 1])} (\sum_{x_{k} \in Ω^{(n)}} | μ_{k}^{(n)} {|)}^{\frac{1}{2}} + O (1) {(\int_{S^{d - 1}} {| V_{n} (ϕ) (x \cdot y) |}^{2} d σ (x))}^{\frac{1}{2}} \\ = & O (n^{α}) E_{n} {(ϕ)}_{C ([- 1, 1])} + O (1) {(\int_{- 1}^{1} {| V_{n} (ϕ) (u) |}^{2} W_{η} (u) d u)}^{\frac{1}{2}} \\ = & O (1), \end{matrix}

where we have used (17), (57), and (25). Kernel (28) thus holds, where we have used (16). □

Proof of Corollary 2.

Since

Ω^{(n)}

is defined as in Proposition 2, we know

μ_{k}^{(n)} \geq 0

and

\sum_{x_{k} \in Ω^{(n)}} | μ_{k}^{(n)} | = ω_{d - 1} < + \infty

, and condition (25) is satisfied with

α = 0

; we then have the results of Corollary 2 by Proposition 4.

□

Proof of Proposition 5.

Since

V_{n}^{ϕ} (f) \in H_{Ω^{(2 (N + n))}}^{ϕ}

, we have

\begin{matrix} D^{Ω^{(2 (N + n))}} (f, λ) & \leq & ∥ f - V_{n}^{ϕ} {(f) ∥}_{L^{2} (ρ_{S^{d - 1}})} + \sqrt{\frac{λ}{2}} {∥ V_{n}^{ϕ} (f) ∥}_{H_{Ω^{(2 (N + n))}}^{ϕ}} \\ = & A + \sqrt{\frac{λ}{2}} B, \end{matrix}

(67)

where

A = ∥ f - V_{n}^{ϕ} {(f) ∥}_{L^{2} (ρ_{S^{d - 1}})}

and

B = {(| a_{0}^{(0)} (V_{n} (f)) |^{2} + \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} {|\sum_{l = 1}^{2 n} \frac{Y_{l} (V_{n} (f), x_{k})}{\hat{a_{l}^{η} (ϕ)}}|}^{2})}^{\frac{1}{2}} .

We first bound A. By (60), we have

\begin{matrix} A & \leq & ∥ V_{n}^{ϕ} (f) - V_{n} {(f) ∥}_{2, S^{d - 1}} + {∥ V_{n} (f) - f ∥}_{2, S^{d - 1}} \\ = & ∥ V_{n}^{ϕ} (f) - V_{n} {(f) ∥}_{2, S^{d - 1}} + O (E_{n} {(f)}_{2, S^{d - 1}}), \end{matrix}

(68)

where

\begin{matrix} ∥ V_{n}^{ϕ} (f) - V_{n} {(f) ∥}_{2, S^{d - 1}} \\ \leq & \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} {(\int_{S^{d - 1}} {| ϕ (x \cdot x_{k}) - V_{N}^{(η)} (ϕ, x \cdot x_{k}) |}^{2} d σ (x))}^{\frac{1}{2}} \\ \times (\sum_{l = 1}^{2 n} \frac{| Y_{l} (V_{n} (f), x_{k}) |}{\hat{a_{l}^{η} (ϕ)}}) \\ \leq & c E_{N} {(ϕ)}_{2, W_{η}} \sum_{x_{k} \in Ω^{(2 (N + n))}} μ_{k}^{(2 (N + n))} (\sum_{l = 1}^{2 n} \frac{| Y_{l} (V_{n} (f), x_{k}) |}{\hat{a_{l}^{η} (ϕ)}}) \\ = & O (\frac{n^{\frac{1}{2}} E_{N} {(ϕ)}_{2, W_{η}} {∥ f ∥}_{2, S^{d - 1}}}{ϕ_{n}}), \end{matrix}

(69)

where we have used (62) and (17). Collecting (69) and (68), we have

\begin{matrix} A = O (\frac{n^{\frac{1}{2}} E_{N} {(ϕ)}_{2, W_{η}} {∥ f ∥}_{2, S^{d - 1}}}{ϕ_{n}}) + O (E_{n} {(f)}_{2, S^{d - 1}}) . \end{matrix}

(70)

We now bound B. By (20), (22), and (18), we have

\begin{matrix} B & \leq & {(| a_{0}^{(0)} (V_{n} (f)) |^{2} + \sum_{x_{k} \in Ω^{(2 (N + n))}} \frac{μ_{k}^{(2 (N + n))}}{A_{k}^{(2 (N + n))}} \times A_{k}^{(2 (N + n))} {|\sum_{l = 1}^{2 n} \frac{Y_{l} (V_{n} (f), x_{k})}{\hat{a_{l}^{η} (ϕ)}}|}^{2})}^{\frac{1}{2}} \\ \leq & {(| a_{0}^{(0)} (V_{n} (f)) |^{2} + c \sum_{x_{k} \in Ω^{(2 (N + n))}} A_{k}^{(2 (N + n))} {|\sum_{l = 1}^{2 n} \frac{Y_{l} (V_{n} (f), x_{k})}{\hat{a_{l}^{η} (ϕ)}}|}^{2})}^{\frac{1}{2}} \\ = & O {(| a_{0}^{(0)} (V_{n} (f)) |^{2} + \int_{S^{d - 1}} {|\sum_{l = 1}^{2 n} \frac{Y_{l} (V_{n} (f), x)}{\hat{a_{l}^{η} (ϕ)}}|}^{2} d σ (x))}^{\frac{1}{2}} \\ = & O {(\frac{n^{\frac{1}{2}}}{ϕ_{n}} \sum_{l = 1}^{2 n} {∥ Y_{l} (V_{n} (f)) ∥}_{2, S^{d - 1}}^{2})}^{\frac{1}{2}} \\ = & O (\frac{n^{\frac{1}{2}} {∥ V_{n} (f) ∥}_{2, S^{d - 1}}}{ϕ_{n}}) \\ = & O (\frac{n^{\frac{1}{2}} {∥ f ∥}_{2, S^{d - 1}}}{ϕ_{n}}) . \end{matrix}

(71)

Taking (71) and (70) into (67), we have (34). □

Proof of Theorem 3.

Taking

ξ (x, y) (\cdot) = (y - f_{ρ, λ}^{Ω^{(n)}} (x)) K_{x} (ϕ) (\cdot)

, by (49), we have

\begin{matrix} {∥ ξ (x, y) (\cdot) ∥}_{H_{Ω^{(n)}}^{ϕ}} & = & | K_{x} (ϕ) (x) | | y - f_{ρ, λ}^{Ω^{(n)}} (x) | \\ \leq & κ (M + | f_{ρ, λ}^{Ω^{(n)}} (x) |) \\ \leq & κ (M + \frac{κ D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})}}{\sqrt{λ}}) . \end{matrix}

(72)

If

κ D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} > M \sqrt{λ}

, then we have by (72) that

\begin{matrix} {∥ ξ (x, y) (\cdot) ∥}_{H_{Ω^{(n)}}^{ϕ}} \leq \frac{2 κ^{2} D^{Ω^{(n)}} (f_{ρ}, λ)}{\sqrt{λ}} . \end{matrix}

(73)

Also,

\begin{matrix} σ^{2} (ξ) & = & E (∥ ξ ∥_{H_{Ω^{(n)}}^{ϕ}}^{2}) \\ \leq & κ^{2} \int_{S^{d - 1}} {(y - f_{ρ, λ}^{Ω^{(n)}} (x))}^{2} d ρ (x, y) \\ \leq & κ^{2} (\int_{S^{d - 1}} {(y - f_{ρ, λ}^{Ω^{(n)}} (x))}^{2} d ρ (x, y) + \frac{λ}{2} ∥ f_{ρ, λ}^{Ω^{(n)}} ∥_{H_{Ω^{(n)}}^{ϕ}}^{2}) \\ \leq & κ^{2} min_{f \in H_{Ω^{(n)}}^{ϕ}} (E_{ρ} (f) + \frac{λ}{2} {∥ f ∥}_{H_{Ω^{(n)}}^{ϕ}}^{2}) \\ \leq & κ^{2} \int_{Z} y^{2} d ρ (x, y) < + \infty . \end{matrix}

(74)

Taking (73) and (74) into (43), we have (36).

Also, if

κ D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})} > M \sqrt{λ}

, then we have by (72) that

\begin{matrix} {∥ ξ (x, y) (\cdot) ∥}_{H_{Ω^{(n)}}^{ϕ}} \leq 2 M . \end{matrix}

(75)

Taking (75) and (74) into (43), we have (37). □

Proof of Corollary 4.

Because of (36), we have by (30) that

\begin{matrix} ∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥_{L^{2} (ρ_{S^{d - 1}})} & \leq & ∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥_{C (S^{S^{d - 1}})} \\ \leq & κ ∥ f_{z, λ}^{Ω^{(n)}} - f_{ρ, λ}^{Ω^{(n)}} ∥_{H_{Ω^{(n)}}^{ϕ}} \\ \leq & 4 κ (\frac{M}{λ \sqrt{m}} + \frac{D^{Ω^{(n)}} {(f_{ρ}, λ)}_{L^{2} (ρ_{S^{d - 1}})}}{λ^{\frac{3}{2}} m}) l o g \frac{4}{δ} . \end{matrix}

(76)

Taking (76) into (33), we have (38). By the same way, we can have (39). □

Author Contributions

Conceptualization, X.R.; methodology, B.S.; validation, X.R.; formal analysis, X.R.; resources, B.S. and S.W.; writing—original draft preparation, X.R. and B.S.; writing—review and editing, B.S. and S.W.; supervision, B.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by the National Natural Science Foundation of China under Grants No. 61877039, the NSFC/RGC Joint Research Scheme (Project No. 12061160462 and N_CityU102/20) of China and Natural Science Foundation of Jiangxi Province of China (20232BAB201021).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.-V.; Norouzi, M.; Macherey, W.; Cao, Y.; Gao, Q. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef] [PubMed]
Chui, C.K.; Lin, S.-B.; Zhou, D.-X. Construction of neural networks for realization of localized deep learning. arXiv 2018, arXiv:1803.03503. [Google Scholar] [CrossRef]
Chui, C.K.; Lin, S.-B.; Zhou, D.-X. Deep neural networks for rotation-invariance approximation and learning. Anal. Appl. 2019, 17, 737–772. [Google Scholar] [CrossRef]
Fang, Z.-Y.; Feng, H.; Huang, S.; Zhou, D.-X. Theory of deep convolutional neural networks II: Spherical analysis. Neural Netw. 2020, 131, 154–162. [Google Scholar] [CrossRef]
Feng, H.; Huang, S.; Zhou, D.-X. Generalization analysis of CNNs for classification on spheres. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 6200–6213. [Google Scholar] [CrossRef]
Zhou, D.-X. Deep distributed convolutional neural networks: Universality. Anal. Appl. 2018, 16, 895–919. [Google Scholar] [CrossRef]
Zhou, D.-X. Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 2020, 48, 787–794. [Google Scholar] [CrossRef]
Cucker, F.; Zhou, D.-X. Learning Theory: An Approximation Theory Viewpoint; Cambridge University Press: New York, NY, USA, 2007. [Google Scholar]
Steinwart, I.; Christmann, A. Support Vector Machines; Springer: New York, NY, USA, 2008. [Google Scholar]
Cucker, F.; Smale, S. On the mathematical foundations of learning. Bull. Amer. Math. Soc. 2001, 39, 1–49. [Google Scholar] [CrossRef]
An, C.-P.; Chen, X.-J.; Sloan, I.H.; Womersley, R.S. Regularized least squares approximations on the sphere using spherical designs. SIAM J. Numer. Anal. 2012, 50, 1513–1534. [Google Scholar] [CrossRef][Green Version]
An, C.-P.; Wu, H.-N. Lasso hyperinterpolation over general regions. SIAM J. Sci. Comput. 2021, 43, A3967–A3991. [Google Scholar] [CrossRef]
An, C.-P.; Ran, J.-S. Hard thresholding hyperinterpolation over general regions. arXiv 2023, arXiv:2209.14634. [Google Scholar]
De Mol, C.; De Vito, E.; Rosasco, L. Elastic-net regularization in learning theory. J. Complex. 2009, 25, 201–230. [Google Scholar] [CrossRef]
Fischer, S.; Steinwart, I. Sobolev norm learning rates for regularized least-squares algorithms. J. Mach. Learn. Res. 2020, 21, 8464–8501. [Google Scholar]
Lai, J.-F.; Li, Z.-F.; Huang, D.-G.; Lin, Q. The optimality of kernel classifiers in Sobolev space. arXiv 2024, arXiv:2402.01148. [Google Scholar]
Sun, H.-W.; Wu, Q. Least square regression with indefinite kernels and coefficient regularization. Appl. Comput. Harmon. Anal. 2011, 30, 96–109. [Google Scholar] [CrossRef]
Wu, Q.; Zhou, D.-X. Learning with sample dependent hypothesis spaces. Comput. Math. Appl. 2008, 56, 2896–2907. [Google Scholar] [CrossRef]
Chen, H.; Wu, J.-T.; Chen, D.-R. Semi-supervised learning for regression based on the diffusion matrix. Sci. Sin. Math. 2014, 44, 399–408. (In Chinese) [Google Scholar]
Sun, X.-J.; Sheng, B.-H. The learning rate of kernel regularized regression associated with a correntropy-induced loss. Adv. Math. 2024, 53, 633–652. [Google Scholar]
Wu, Q.; Zhou, D.-X. Analysis of support vector machine classification. J. Comput. Anal. Appl. 2006, 8, 99–119. [Google Scholar]
Sheng, B.-H. Reproducing property of bounded linear operators and kernel regularized least square regressions. Int. J. Wavelets Multiresolut. Inf. Process. 2024, 22, 2450013. [Google Scholar] [CrossRef]
Lin, S.-B.; Wang, D.; Zhou, D.-X. Sketching with spherical designs for noisy data fitting on spheres. SIAM J. Sci. Comput. 2024, 46, A313–A337. [Google Scholar] [CrossRef]
Lin, S.-B.; Zeng, J.-S.; Zhang, X.-Q. Constructive neural network learning. IEEE Trans. Cybern. 2019, 49, 221–232. [Google Scholar] [CrossRef]
Mhaskar, H.N.; Micchelli, C.A. Degree of approximation by neural and translation networks with single hidden layer. Adv. Appl. Math. 1995, 16, 151–183. [Google Scholar] [CrossRef]
Sheng, B.-H.; Zhou, S.-P.; Li, H.-T. On approximation by tramslation networks in L^p(R^k) spaces. Adv. Math. 2007, 36, 29–38. [Google Scholar]
Mhaskar, H.N.; Narcowich, F.J.; Ward, J.D. Approximation properties of zonal function networks using scattered data on the sphere. Adv. Comput. Math. 1999, 11, 121–137. [Google Scholar] [CrossRef]
Sheng, B.-H. On approximation by reproducing kernel spaces in weighted L^p-spaces. J. Syst. Sci. Complex. 2007, 20, 623–638. [Google Scholar] [CrossRef]
Parhi, R.; Nowak, R.D. Banach space representer theorems for neural networks and ridge splines. J. Mach. Learn. Res. 2021, 22, 1–40. [Google Scholar]
Oono, K.; Suzuki, Y.J. Approximation and non-parameteric estimate of ResNet-type convolutional neural networks. arXiv 2023, arXiv:1903.10047. [Google Scholar]
Shen, G.-H.; Jiao, Y.-L.; Lin, Y.-Y.; Huang, J. Non-asymptotic excess risk bounds for classification with deep convolutional neural networks. arXiv 2021, arXiv:2105.00292. [Google Scholar]
Mallat, S. Understanding deep convolutional networks. Phil. Trans. R. Soc. A 2016, 374, 20150203. [Google Scholar] [CrossRef] [PubMed]
Narcowich, F.J.; Ward, J.D.; Wendland, H. Sobolev error estimates and a Bernstein inequality for scattered data interpolation via radial basis functions. Constr. Approx. 2006, 24, 175–186. [Google Scholar] [CrossRef]
Narcowich, F.J.; Ward, J.D. Scattered data interpolation on spheres: Error estimates and locally supported basis functions. SIAM J. Math. Anal. 2002, 33, 1393–1410. [Google Scholar] [CrossRef]
Narcowich, F.J.; Sun, X.P.; Ward, J.D.; Wendland, H. Direct and inverse Sobolev error estimates for scattered data interpolation via spherical basis functions. Found. Comput. Math. 2007, 7, 369–390. [Google Scholar] [CrossRef]
Gröchenig, K. Sampling, Marcinkiewicz-Zygmund inequalities, approximation and quadrature rules. J. Approx. Theory 2020, 257, 105455. [Google Scholar] [CrossRef]
Gia, Q.T.L.; Mhaskar, H.N. Localized linear polynomial operators and quadrature formulas on the sphere. SIAM J. Numer. Anal. 2008, 47, 440–466. [Google Scholar] [CrossRef]
Xu, Y. The Marcinkiewicz-Zygmund inequalities with derivatives. Approx. Theory Its Appl. 1991, 7, 100–107. [Google Scholar] [CrossRef]
Szegö, G. Orthogonal Polynomials; American Mathematical Society: New York, NY, USA, 1967. [Google Scholar]
Mhaskar, H.N.; Narcowich, F.J.; Ward, J.D. Spherical Marcinkiewicz-Zygmund inequalities and positive quadratue. Math. Comput. 2001, 70, 1113–1130, Corrigendum in Math. Comp. 2001, 71, 453–454. [Google Scholar] [CrossRef]
Dai, F. On generalized hyperinterpolation on the sphere. Proc. Amer. Math. Soc. 2006, 134, 2931–2941. [Google Scholar] [CrossRef]
Mhaskar, H.N.; Narcowich, F.J.; Sivakumar, N.; Ward, J.D. Approximation with interpolatory constraints. Proc. Amer. Math. Soc. 2001, 130, 1355–1364. [Google Scholar] [CrossRef]
Xu, Y. Mean convergence of generalized Jacobi series and interpolating polynomials, II. J. Approx. Theory 1994, 76, 77–92. [Google Scholar] [CrossRef]
Marzo, J. Marcinkiewicz-Zygmund inequalities and interpolation by spherical harmonics. J. Funct. Anal. 2007, 250, 559–587. [Google Scholar] [CrossRef]
Marzo, J.; Pridhnani, B. Sufficiant conditions for sampling and interpolation on the sphere. Constr. Approx. 2014, 40, 241–257. [Google Scholar] [CrossRef][Green Version]
Wang, H.P. Marcinkiewicz-Zygmund inequalities and interpolation by spherical polynomials with respect to doubling weights. J. Math. Anal. Appl. 2015, 423, 1630–1649. [Google Scholar] [CrossRef]
Gia, T.L.; Slon, I.H. The nuiform norm of hyperinterpolation on the unit sphere in an arbitrary number of dimensions. Constr. Approx. 2001, 17, 249–265. [Google Scholar] [CrossRef]
Sloan, I.H. Polynomial interpolation and hyperinterpolation over general regions. J.Approx.Theory 1995, 83, 238–254. [Google Scholar] [CrossRef]
Sloan, I.H.; Womersley, R.S. Constructive polynomial approximation on the sphere. J. Approx. Theory 2000, 103, 91–118. [Google Scholar] [CrossRef]
Wang, H.-P. Optimal lower estimates for the worst case cubature error and the approximation by hyperinterpolation operators in the Sobolev space sertting on the sphere. Int. J. Wavelets Multiresolut. Inf. Process. 2009, 7, 813–823. [Google Scholar] [CrossRef]
Wang, H.-P.; Wang, K.; Wang, X.-L. On the norm of the hyperinterpolation operator on the d-dimensional cube. Comput. Appl. 2014, 68, 632–638. [Google Scholar]
Sloan, I.H.; Womersley, R.S. Filtered hyperinterpolation: A constructive polynomial approximation on the sphere. Int. J. Geomath. 2012, 3, 95–117. [Google Scholar] [CrossRef]
Bondarenko, A.; Radchenko, D.; Viazovska, M. Well-seperated spherical designs. Constr. Approx. 2015, 41, 93–112. [Google Scholar] [CrossRef]
Hesse, K.; Womersley, R.S. Numerical integration with polynomial exactness over a spherical cap. Adv. Math. Math. 2012, 36, 451–483. [Google Scholar] [CrossRef]
Delsarte, P.; Goethals, J.M.; Seidel, J.J. Spherical codes and designs. Geom. Dedicata 1977, 6, 363–388. [Google Scholar] [CrossRef]
An, C.-P.; Chen, X.-J.; Sloan, I.H.; Womersley, R.S. Well conditioned spherical designs for integration and interpolation on the two-sphere. SIAM J. Numer. Anal. 2010, 48, 2135–2157. [Google Scholar] [CrossRef]
Chen, X.; Frommer, A.; Lang, B. Computational existence proof for spherical t-designs. Numer. Math. 2010, 117, 289–305. [Google Scholar] [CrossRef]
An, C.-P.; Wu, H.-N. Bypassing the quadrature exactness assumption of hyperinterpolation on the sphere. J. Complex. 2024, 80, 101789. [Google Scholar] [CrossRef]
An, C.-P.; Wu, H.-N. On the quadrature exactness in hyperinterpolation. BIT Numer. Math. 2022, 62, 1899–1919. [Google Scholar] [CrossRef]
Sun, X.-J.; Sheng, B.-H.; Liu, L.; Pan, X.-L. On the density of translation networks defined on the unit ball. Math. Found. Comput. 2024, 7, 386–404. [Google Scholar] [CrossRef]
Wang, H.-P.; Wang, K. Optimal recovery of Besov classes of generalized smoothness and Sobolev class on the sphere. J. Complex. 2016, 32, 40–52. [Google Scholar] [CrossRef]
Dai, F.; Xu, Y. Approximation Theory and Harmonic Analysis on Spheres and Balls; Springer: New York, NY, USA, 2013. [Google Scholar]
Müller, C. Spherical Harmonic; Springer: Berlin/Heidelberg, Germany, 1966. [Google Scholar]
Wang, K.-Y.; Li, L.-Q. Harmonic Analysis and Approximation on the Unit Sphere; Science Press: New York, NY, USA, 2000. [Google Scholar]
Cheney, W.; Light, W. A Course in Approximation Theory; China Machine Press: Beijing, China, 2004. [Google Scholar]
Dai, F.; Wang, H.-P. Positive cubature formulas and Marcinkiewicz-Zygmund inequalities on spherical caps. Constr. Approx. 2010, 31, 1–36. [Google Scholar] [CrossRef][Green Version]
Aronszajn, N. Theory of reproducing kernels. Trans. Amer. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
Lin, S.-B.; Wang, Y.-G.; Zhou, D.-X. Distributed filtered hyperinterpolation for noisy data on the sphere. SIAM J. Numer. Anal. 2021, 59, 634–659. [Google Scholar] [CrossRef]
Montúfar, G.; Wang, Y.-G. Distributed learning via filtered hyperinterpolation on manifolds. Found. Comput. Math. 2022, 22, 1219–1271. [Google Scholar] [CrossRef]
Sheng, B.-H.; Wang, J.-L. Moduli of smoothness, K-functionals and Jackson-type inequalities associated with kernel function approximation in learning theory. Anal. Appl. 2024, 22, 981–1022. [Google Scholar] [CrossRef]
Christmann, A.; Xiang, D.-H.; Zhou, D.-X. Total stability of kernel methods. Neurocomputing 2018, 289, 101–118. [Google Scholar] [CrossRef]
Sheng, B.-H.; Liu, H.-X.; Wang, H.-M. The learning rate for the kernel regularized regression (KRR) with a differentiable strongly convex loss. Commun. Pure Appl. Anal. 2020, 19, 3973–4005. [Google Scholar] [CrossRef]
Wang, S.-H.; Sheng, B.-H. Error analysis of kernel regularized pairwise learning with a strongly convex loss. Math. Found. Comput. 2023, 6, 625–650. [Google Scholar] [CrossRef]
Smale, S.; Zhou, D.-X. Learning theory estimates via integral operators and their applications. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef]
Lin, S.-B. Integral operator approaches for scattered data fitting on sphere. arXiv 2024, arXiv:2401.15294. [Google Scholar]
Feng, H.; Lin, S.-B.; Zhou, D.-X. Radial basis function approximation with distributively stored data on spahere. Constr. Approx. 2024, 60, 1–31. [Google Scholar] [CrossRef]
Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer: New York, NY, USA, 2010. [Google Scholar]
Kyriazis, G.; Petrushev, P.; Xu, Y. Jacobi decomposition of weighted Triebel-Lizorkin and Besov spaces. Stud. Math. 2008, 186, 161–202. [Google Scholar] [CrossRef]
Chen, W.; Ditzian, Z. Best approximation and K-functionals. Acta Math. Hung. 1997, 75, 165–208. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Learning Rate of Regularized Regression Associated with Zonal Translation Networks

Abstract

1. Introduction

1.1. Kernel Regularized Learning

1.2. Marcinkiewicz-Zygmund Setting (MZS)

2. The Properties of the Translation Networks on the Unit Sphere

2.1. Density

2.2. MZS on the Unit Sphere

2.3. The Reproducing Property

3. Apply to Kernel Regularized Regression

3.1. Learning Framework

3.2. Error Decompositions

3.3. Convergence Rate for the K-Functional

3.4. The Learning Rate

3.5. Comments

4. Lemmas

5. Proof for Theorems and Propositions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics