1. Introduction
We consider the following regression model:
where
,
,
, is an unknown real-valued random function (random field) that is continuous on a compact set
with probability 1. The design
consists of a set of observable random
k-dimensional vectors taking values in
with possibly unknown distributions, and it is not necessarily independent or identically distributed. We consider the design as an array of random vectors
that may depend on
n. In particular, this scheme includes models with a fixed design. It is not assumed that the random function
does not depend on the design. Some conditions on the random error
are given below.
This paper is devoted to the construction of uniformly consistent (in the sense of convergence in probability) kernel-type estimators for the regression function under minimal assumptions on the dependence of the design points.
Let us review publications related to kernel estimators in the problem under consideration. Here, we do not aspire to present a comprehensive overview of this actively developing area of nonparametric estimation and we will only indicate publications representing the main trends in this direction. The most popular procedures for kernel estimation in the classical case of a nonrandom regression function are apparently related to the Nadaraya–Watson estimators, local polynomial estimators (in particular, local linear estimators), Priestley–Chao estimators, and Gasser–Müller, as well as their modifications (see, for example, books [
1,
2,
3,
4,
5]).
We are primarily interested in the conditions on the design elements, and in this regard, we note that traditionally in regression problems, it is customary to consider deterministic and random designs separately. Part of this division seems to be due to different approaches to the study of estimators in one or another case. Moreover, initially, there was some specification by design type between kernel-type estimators: the Nadaraya–Watson estimators were studied, for example, only in the case of random design, while the Priestley–Chao and Gasser-Müller estimators are for a one-dimensional nonrandom case. Further in the direction mentioned, many various generalizations were obtained and the above-mentioned bounds were blurred (see, for example, [
6,
7,
8,
9]). The Nadaraya–Watson estimators in the case of a nonrandom regular multidimensional design were studied, for example, in [
10].
In the case of deterministic design in the vast majority of works, one or another conditions for the design regularity are assumed (for example, see [
10,
11,
12,
13,
14,
15,
16,
17,
18,
19]). In papers dealing with random design, independent identically distributed design elements are often considered [
1,
11,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30]. But over the past decades, many forms of dependent random variables have been proposed and the corresponding limit theorems for sequences with such properties (as well as probabilistic inequalities) were proved. Development in this direction of probability theory has also fully affected the problems of nonparametric regression so that as design elements, samples from a stationary sequence of random variables are often considered that satisfy one or another known form of dependence. In particular, to construct the design elements, the authors used various mixing conditions, moving average schemes, associated random variables, Markov or martingale properties, etc. In this regard, we note, for example, the papers [
1,
18,
21,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41]. In the recent papers [
42,
43,
44,
45,
46,
47,
48], there are considered nonstationary sequences of design elements with certain special forms of dependence (Markov chains, autoregression, partial sums of moving averages, etc.). The uniform consistency of the kernel-type estimators of the regression function, both in the case of deterministic and random design, was studied by many authors (see, for example, [
11,
16,
18,
23,
24,
25,
26,
36,
38,
39,
42,
43,
44,
49,
50] and the references therein).
It is worth noting that the nature of dependence of real sample data in statistics is difficult to determine if the dependence of observations takes place by the nature of the stochastic experiment. In this regard, the creation and development of new approaches to the statistical analysis of dependent observations of large size that do not satisfy the classical mixing conditions and other known forms of correlation, as well as the study of new forms of dependence, which would be statistically more clear and justified, is of interest not only from a theoretical point of view but is also relevant and especially important for applications.
In the present paper, we continue to develop the concept of dense data proposed in [
51,
52,
53]. In these papers, it is established that to restore the regression function, it is enough to know the noisy values of this function on some dense (in one sense or another) set of points from the regression function domain. We succeeded in constructing new kernel-type estimators by using special sums of weighted observations with the structure of Riemann integral sums that are close to the corresponding integrals in the case of dense design. In this case, the stochastic nature of the design points does not play any role. In the papers of predecessors, in the case of random design, the fulfillment of the dense filling with design points of the regression function domain was satisfied due to various concrete forms of the weak dependence of design points, and the asymptotic properties of the estimators were studied using corresponding probabilistic limit theorems. It is important to emphasize that the new estimators are uniformly consistent not only in the cases of weak dependence mentioned above but also for significantly different correlations of observations when the conditions of ergodicity or stationarity are not satisfied, as well as the classical mixing conditions and other known conditions of dependence (see Example 2 in
Section 2 below). In addition, the new estimators have the property of universality regarding the nature of the design: it can be either fixed and not necessarily regular or random, while not necessarily satisfying the traditional correlation conditions.
Note that the proposed estimators belong to the class of local linear kernel estimators, but with some weights other than those used in the classical version. These weights are given by the Lebesgue measure of the elements of some finite random partition of the design sample space, where each element of the partition corresponds to one design point. In this paper, explicit upper bounds on the rate of uniform convergence in the probability of the new estimators to the regression function are obtained simultaneously for both fixed and random designs, while, in contrast to the previously known results, the rate of convergence of our estimators are insensitive to the correlation structure of the design points. The only design characteristic explicitly included in the resulting upper bounds is the minimal radius of the epsilon net formed by the design elements in the regression function domain. This minimal radius is a qualitatively different characteristic compared with previously known ones, in terms of which it is possible to describe sufficient conditions for the uniform consistency of kernel-type estimators. Its advantage over classical weak dependence characteristics is that it is insensitive to the correlation of the design observations. The main thing is that as the size of the observations indefinitely increases, this radius tends to zero in probability. Such a requirement, as we have already noted above, is essentially necessary since only when the design is densely filled with the regression function domain, it is possible to restore it with one or another accuracy.
Previously, similar ideas were implemented in [
51,
52] for local constant estimators and in [
53] for local linear estimators with a univariate design. The estimators from [
53] are a particular case (for
) of the estimators proposed here. Note that the construction of estimators from [
53], when the weights of the weighted least squares method are the spacing statistics constructed by the variational series of design points, has no direct generalization to the case of functions of many variables. Similar conditions on design elements were also used in [
54,
55] for nonparametric regression, and in [
56,
57,
58] for nonlinear regression. In particular, in [
54,
55], similar conditions were proposed for the Nadaraya–Watson estimators, but these conditions guarantee only pointwise consistency. In [
54,
55,
59], conditions for the uniform consistency of Nadaraya–Watson estimators and classical local linear estimators are obtained in terms of dense data, but the conditions are not as simple as in the present paper and require a more uniform dense filling with design points of the regression function domain than is required in this paper.
Note also that the model (
1) assumes that the unknown function
is a random process with almost surely continuous trajectories (random field). This more general statement (in comparison with the classical one), is considered partly in order to use the obtained results as an application for estimating the mean and covariance functions of a continuous random field. In connection with the random regression function, we note, for example, recent works [
60,
61,
62,
63,
64,
65,
66,
67], in which the mean and covariance functions of the random regression function
f are estimated in the case of
N independent realizations
of the function
f, there are noisy values of each of these random curves in some set of the design elements (the design can be either common to all trajectories or different from series to series). We consider one of the variants of this problem as an application of the main result. Previously, we considered some statements of this problem in the case of a univariate design (see [
51,
53,
68,
69,
70]).
To conclude the Introduction, it is worth noting that all kernel methods have the so-called “curse of dimensionality” property: as the dimension of the design points increases, the convergence rates of the kernel methods decrease. Therefore, in this case, kernel methods require relatively large samples to achieve the required accuracy. In this regard, reducing the dimension of the design by selecting relevant factors play an important role. As a guide in this direction, we point out the recent paper by [
71], where one can find a detailed bibliography.
This paper has the following structure.
Section 2 contains the main results. In
Section 3, as a corollary, the problem of estimating the mean function in the dense design case is considered. The proofs of the results are found in
Section 4.
Currently, the authors are preparing a continuation of this work for publication that will provide both a computer simulation of the statistical procedures proposed in this work and examples of processing real data.
2. Main Results
2.1. Notation and Main Assumptions
We agree that vectors are column vectors. Vectors are denoted by bold letters, and matrices are in straight capital letters. Denote by the diagonal matrix of dimension with relevant elements on the main diagonal. The symbol ⊤ denotes the transposition of a vector or matrix, and the determinant of some matrix is denoted by . Denote by the Lebesgue measure in . For any vector , the symbol means the supnorm in , i.e., . For an arbitrary matrix , as a matrix norm, we consider the norm subordinated to the vector supnorm, i.e., , where the symbol here and below denotes the matrix entry at the intersection of the l-th row and m-th column. Notice that we may consider any vector column as a matrix of dimension and its supnorm coincides with the matrix norm above. Therefore, we use one symbol to denote these two norms.
By
, we denote the diameter of a set
A, i.e.,
. We also need a notation for the modulus of continuity of a function
f defined on a unit
k-dimensional cube
:
Next, by
, we denote a random variable
such that for all
, one has
where
and
are positive (maybe random or not) variables and the function
does not depend on the parameters of the model under consideration. We agree that throughout what follows, all limits, unless otherwise stated, are taken for
. In what follows, without loss of generality, we assume that
.
We now formulate the following four main assumptions.
The observations
have the structure (
1), where
,
is an unknown real-valued random function on
that is continuous with probability 1. The design points
are a set of observable
k-dimensional random variables with, generally speaking, unknown distributions and values in
, not necessarily independent or identically distributed; in this case, the random variables
may depend on
n.
For all
, the unobservable random errors
form a sequence of martingale differences under the condition
where
does not depend on
n. It is also assumed that
is independent of
and the random function
; moreover, the random variables
may depend on
n.
The kernel function , , vanishes outside the cube and can be represented as for , where is a symmetric distribution density with the support . We assume that the function satisfies the Lipschitz condition with constant and .
In what follows, we need the notation . It is clear that is a distribution density on .
The following condition is the only limitation on the design.
For each n, there exists a random partition of the set into n Jordan-measurable subsets such that every element of this partition contains exactly one point from the set (the numbering of elements partition is such that ), and . Here, it is assumed that the diameters are random variables, i.e., measurable mappings of the probability space.
Remark 1. Traditionally, a family of nonempty sets forms a partition of the set if the elements of the family are pairwise disjoint and . Let us agree that in , it is allowed that the elements of the set intersect along sets of zero Lebesgue measure (for example, along boundaries). Such a reservation allows us not to exclude the situation of multiple points in the design. In the case of pairwise different design points, such a reservation is not required. Note also that without the above convention, condition can be formulated as follows: for each n, there exists a random partition of the set into n Jordan-measurable subsets such that
Remark 2. Condition means that for any n, the set of design points forms an ε-net of for provided that .
Remark 3. Note that condition is satisfied for any nonrandom regular design. If are independent and identically distributed and the set is the support of the distribution , then condition is also fulfilled. If, additionally, the distribution density is separated from zero on , then with probability 1. If is a stationary sequence with an α-mixing condition and the marginal distribution support is , then condition is also satisfied. Note that all kinds of dependence of design points known to the authors satisfy condition , but the fulfillment of this condition is also quite possible for other types of dependence not yet described in the modern literature on nonparametric regression (see Example 2 below).
We introduce the following class of estimators for the regression function
f:
where
is the
-dimensional vector such that the first coordinate equals 1 and the other ones vanish.
Remark 4. It is easy to check that the kernel estimator (4) is the first coordinate of the -dimensional estimator of the following variant of the weighted least squares method:Thus, the proposed class of estimators in a certain sense (in fact, by construction) is close to the classical multidimensional local linear kernel estimators, but in weighted least squares (5), we use slightly different weights. 2.2. The Main Theorem, Corollaries, and Examples
The following theorem is the main result of this paper. It allows us to construct a confidence region (tube) for an unknown regression function.
Theorem 1. Let conditions – be satisfied. Then, for any fixed with probability 1, the following relation is valid:with a nonnegative random variable such thatwhere the constants , , and defined in (
36)
, Lemmas 6
and 8
, depend on k and the kernel K; furthermore, the constant additionally depends on p and . Remark 5. Let f be a nonrandom regression function. Substitute the following intp (7):Applying Markov’s power inequality with exponent for the second term in (7), it is easy to see that under the conditions of the above theorem,(here, is a positive constant that depends on k, p, , and K) and there exists a solution to the equationIt is clear that this solution tends to zero as n grows. In fact, the quantity minimizes in h the order of smallness of the right-hand side of relation (6). Notice that in virtue of (8), the limit relations and are valid. Theorem 1 implies the following two assertions.
Corollary 1. Let conditions – be fulfilled and let be a set of equicontinuous nonrandom functions from the space . Then,where is a solution to Equation (8), where the modulus of continuity is replaced with the universal modulus . Moreover, there is the valid relation Corollary 2. If conditions – are fulfilled and the modulus of continuity of the random regression function , satisfies with probability 1 the condition for some proper random variable and a positive nonrandom function with the condition as , then there is the valid limit relationwhere is a solution to Equation (8), in which the modulus of continuity is replaced with . Example 1. , with , and for and some proper random variable ζ. Then, In particular, if is a Wiener process on and the independent identically distributed random variables have a normal distribution with zero mean, then for all sufficiently small ,Here, , , and for arbitrarily small positive and . Example 2. Let a sequence of bivariate random variables be defined by the relationwhere and are independent and uniformly distributed on the respective rectangles and , and the sequence does not depend on and and consists of Bernoulli random variables with the success probability , i.e., the distribution of random variables is an equilibrium mixture of two uniform distributions in the corresponding rectangles. The dependence between the random variables for any natural i is defined by the equalities and . In this case, the random variables in (9) form a stationary sequence of random variables uniformly distributed on the square , but, say, all known mixing conditions are not satisfied here since for all natural m and n,On the other hand, it is easy to check that the stationary sequence satisfies the Glivenko–Cantelli theorem. This means that for any fixed ,is almost surely uniform in t; here, # is the counting measure and the constants and do not depend on t and h. In other words, the sequence satisfies condition . In the general case, according to the scheme of this example, one can construct various sequences of dependent random variables uniformly distributed over based on the choice of various sequences of Bernoulli switches with the conditions and for unlimited numbers of indices and , respectively. In this case, condition is also satisfied. But the corresponding sequence (not necessarily stationary) may not satisfy the strong law of large numbers. For example, this situation occurs when for , and for , where ; i.e., we play in which of the rectangles we throw the first point at random, and then alternate the number of throws for each of the two specified rectangles as follows: 2, 4, 8, 16, and so on. Indeed, we introduce the notation , , and for , and we note that for all outcomes that make up the event , the following is fulfilled:where and , and and are the collections of indices such that the observations lie in the rectangles or , respectively. It is easy to see that and . Hence, almost surely as due to the strong law of large numbers for the sequences and . On the other hand, for all elementary outcomes belonging to the event , as ,with probability 1, where and are the collections of indices such that the observations , lie in the respective rectangles or . When justifying the convergence in (10), we took into account the fact that and , i.e., . Similar conclusions are valid for all outcomes that make up the event .
3. Estimating the Mean and Covariance of a Random Regression Function with a Dense Design
In this section, as an application of Theorem 1, we consider one of the variants of the problem of estimating the mean function of an almost surely continuous random process. The estimation of the mean and covariance functions plays an important role and many recent works were devoted to solving this problem (see, for example, [
60,
67,
72,
73] and the references below). As in the classical formulation of the problem of nonparametric regression of the form (
1), the design in the problem of estimating the mean and covariance functions is considered to be either random or deterministic. For a random design, as a rule, it is assumed that the design points are independent identically distributed random variables (see, for example, [
60,
61,
62,
63,
64,
65,
66,
74,
75]). In the case of deterministic design, one or another regularity condition is usually used. For example, the nonrandom equidistant design was discussed in [
61]. In addition, in the problem of estimating the mean function, it is customary to subdivide the design into certain types depending on the number of design points for a particular trajectory.
The literature focuses on two opposing types of data: the design is in one sense or another “sparse” (for example, the number of design points for each of the realizations of the regression process is uniformly bounded; see [
60,
61,
62,
74]), or the design is somewhat “dense” (the number of design elements in each series grows simultaneously with the number of series; see [
60,
62,
65,
74]). In Theorem 2 given below, the second of the mentioned types of design is considered, provided only condition
is satisfied in each of the independent series. Note that our formulation of the problem of estimating the mean function also includes the situation of a general deterministic design. Mean function estimation approaches that are used for dense or sparse data are often different (see, for example, [
73,
76]). In the case of a growing number of observations in each series, it is natural to first estimate the random regression function in each series, and then average their estimators over all series (see, for example, [
61,
65]). This is exactly what we do next (see Formula (
12) below) by following this generally accepted approach. The uniform consistency of the mean function estimators was studied, for example, in [
60,
62,
64,
66,
72].
Therefore, consider the following statement of the problem under consideration. We have
N independent copies of model (1) from condition
:
where
,
, are independent identically distributed unobservable random processes with a.s. continuous trajectories, and the collections
and
for all
j satisfy conditions
, and
. Here and below in this section, the index
j means the copy number of model (
1). We define an estimator for the mean function by the equality
where the estimators
are determined by Formula (
4), where the values from (
1) are replaced with the corresponding characteristics from (
11).
A corollary of Theorem 1 is the following assertion.
Theorem 2. For model (
11)
, let conditions – be fulfilled andMoreover, let a sequence and a subsequence of naturals satisfy the conditionsThen, Remark 6. If condition (13) is replaced with a slightly stronger constraintthen by (14), one can show the uniform consistency of the estimatorfor the mixed second moment , where and are defined in (14). The proof of this fact is similar to the proof of Theorem 2, and thus, we omit the detailed arguments. In other words, under the above conditions, the estimatoris uniformly consistent for the covariance function of the random regression field . 4. Proofs
Throughout this section, we assume that conditions
–
are satisfied. To prove Theorem 1, we need a number of auxiliary assertions. Let
where
,
, and
.
Remark 7. We emphasize that due to the density properties of , the summation domain in all sums in (15) and (16), as well as in all sums in the formulas given below in (41), (42), (44), and (51) coincide with the setand the domain of integration in (17) and (18), respectively, coincides with the setThese facts are fundamental for the further analysis. Lemma 1. For , the following relations are valid: Proof. The relation (21) obviously follows from the representation
introduced in condition
. The statements (
20), (22), and (23) follow from Lemma 1 in [
53] and the abovementioned representation for the kernel function. □
Lemma 2. For , on the subset of elementary events defined by the relation , the following inequalities hold:and on the subset of elementary events ,for any . Proof. To display the first estimates in (
24)–(
26), one should take into account Remark 7 and the following relations:
The second estimates in (
24)–(
26) follow from the well-known error estimate for the approximation by Riemann integral sums of the corresponding integrals of functions satisfying the Lipschitz condition:
where the functions
,
, are defined for
,
, and
is a Lipschitz constant of the function
. It is easy to verify that under the conditions of the lemma,
To complete deriving (
24)–(
26), we need to take into account the definitions (
15)–(18) and the estimates (
29) and (
30).
Finally, (27) follows from the first relation in (22) and the second relation in (24). When deriving the last assertion (28) of the lemma, we use the assertions of Lemmas 1 and 2. □
In what follows, we also need to establish the Lipschitz property of the introduced functions , , and .
Lemma 3. For , on the set of elementary outcomes defined by the relation , for any , , the following inequalities hold:On the set of elementary outcomes with the condition , Proof. First of all, we note that the normalized kernel
satisfies the Lipschitz condition with constant
, and for the sets
defined in (
19), under the conditions of the lemma, we have the estimate
From here, we easily obtain (
31):
The proof of estimate (32) differs slightly from the one just shown:
Similarly, estimate (33) is proved after dividing the original sum into three subsums. Inequality (33) follows directly from the upper bounds (
24) and (25) for the values under consideration, as well as from (
31) and (32). We only deduce the main estimate:
To prove (
34), we only need to use estimate (33), as well as the uniform lower bound
of the set of outcomes defined by the inequality
(see (
27)). After this, you just need to add up the obtained estimates and carry out elementary calculations, which we omit. □
We define the entries of a square matrix
of size
k as follows:
Note that by virtue of the Cauchy–Bunyakowsky inequality, one has
for all
. We also need the following notation:
It is easy to see that the difference
is the variance of a non-degenerate distribution, and thus, it is strictly positive. Also,
The following assertion holds.
Lemma 4. For , on the set of outcomes defined by the relation , Proof. Denote by
the square matrix of size
k with entries
By Lemma 1,
and the following estimate holds for the determinant of this diagonal matrix with positive entries:
Next,
When deriving (4), we use the definition of the matrix determinant and estimate (28), and took into account the fact that due to Lemmas 1 and 2, for all
, one has
Hence, on the set of elementary outcomes defined by the relation
,
The assertion of the lemma can now be easily derived directly from the definition of the inverse matrix and relation (
39). □
We need the following notation of
n-dimensional vectors:
In this case,
, and for the estimate
in (
4), there is the valid representation
Using definitions (
2) and (3) and the first notation in (
15), one can rewrite the matrix
of dimension
and
-dimensional vectors
,
, and
as follows:
The following assertion holds.
Lemma 5. The following identity is valid: Proof. We use the Frobenius formula to invert a nondegenerate block matrix:
where
is a square nondegenerate matrix of size
and
is a square matrix of size
. In this formula, we let
,
,
Then, in view of (
41) and the Frobenius formula, the first component of the vector of interest to us is
□
Lemma 6. For , on the set of elementary outcomes defined by the relation , Proof. We use definition (
43). Next, using the Frobenius formula (see the proof of Lemma 5), we obtain the identity
where
Consider each of the terms on the right side of this relation. Notice that for the standard inner product
in
and the supnorm
under consideration, we have the inequality
. Therefore,
Further, taking into account Remark 7, we obtain
The above relations, together with the identity (
44) and the estimates (
45), complete the proof of the lemma. □
Lemma 7. For any , , and , on the set of outcomes defined by the relation , the following estimates are valid:where , and is the k-dimensional vector with unit r-th component and zeros as the other ones. Proof. Since
for
, then the first assertion of the lemma follows from Lemma 4, relations (
45), and the following estimate:
To prove the second assertion, first of all, we note
where the elements of the adjoint matrix
are, up to the signs ±, the complementary minors of the original matrix
. As is known, the determinant of any square matrix
of size
k with entries
,
, can be represented as the multilinear form
where summation is over all permutations
of natural numbers
, and
is the number of inversions in a particular permutation
. We need Formula (
46) to represent the determinant
and the complementary minors of the matrix
in the inverse matrix
, respectively, as
k- and
-linear forms of type (
46) constructed by entries (
35).
To prove this, in the definition of the function
, the term
is Lipschitz and the summand
is local Lipschitz, and the method is carried out the same as that in Lemma 3 (see the proof of (
34)). First, we note that the following representation holds:
where
is the complementary minor to the entry
of the matrix
. Note that in virtue of notation (
43), relation (
27) can be rewritten as follows:
Further, to calculate the Lipschitz constant for the ratio in (
47), we need the lower bounds (
39) and (
48) for the denominator of the fractions on the right-hand side of (
47) that are valid on the set of elementary outcomes
, taking into account the fact that the lower bound in (
39) has the order of smallness
. Moreover, Lemmas 2 and 3, together with the representation (
46) (with the elements
instead of
), allow us to assert that the upper bound for
has the same order of smallness in
h, and the Lipschitz constant for
is estimated from above as
. At the same time, the complementary minors
are estimated from above as
, with a Lipschitz constant of order
uniformly over all
. In fact, we only need to repeat the calculations when deriving estimate (
34) to calculate the Lipschitz constant of the fraction on the right side of (
47), from which it follows that the mentioned constant has the form
, where as a constant
, we can take the value
For the term
, we establish that in the
h-neighborhood of the point
, it has the property of being locally Lipschitz with a constant of the form
. By analogy with (
47), we represent the specified summand as
Repeating almost verbatim the previous arguments in proving this lemma, we come to the conclusion that in the
h-neighbourhood of any point
(under the condition
), the function
satisfies the Lipschitz condition with constant
, which is significantly less than that in (
49). Therefore, the final constant in the lemma can be set equal to
. □
Lemma 8. For any and , on the subset of elementary events defined by the relation , the following estimate holds:where the symbol denotes the conditional probability given the σ-algebra generated by the design , and the constant is defined in (61)
. Proof. We use the notation (
43). Then, from (
41), (42), and the Frobenius formula, we obtain the identity
where
In virtue of estimate (
27), Lemma 2, and Remark 7, one has
The distribution tail
is estimated by Kolmogorov’s
dyadic chaining (see, for example, [
77]). First of all, we note that the set
under the supremum sign can be replaced with the set of dyadic rational points
, where
Therefore,
where
(here,
is the smallest integer greater than or equal to
a). Hence,
where
is a sequence of positive numbers, with
.
To evaluate the probabilities on the right side of (
53), we need the following martingale inequality (see [
78], Theorem 2.1):
where
is a martingale difference with finite moments of order
.
To obtain an upper bound for
, we use (
54) for
From (
29), Lemma 7, and the elementary estimate
we obtain that with probability 1,
Next, taking into account the estimate
, the above inequality and (
54) imply the relation
where
To estimate the probability
, we use inequality (
54), with
We need the following relations:
Note that
for
. Moreover, the summation domain in (4) coincides with the set
These facts, Lemma 7, and the relation
were taken into account when deriving (4). To prove (57), we used the second assertion of Lemma 7.
Thus,
and in virtue of inequality (
54), we obtain
where
Now, using relations (
53), (
55), and (59), we conclude that
The optimal sequence
that minimizes the right-hand side of this inequality is
for
, where the coefficient
c is given by the equality
. For the specified sequence, we obtain
This relation, together with the estimate (
52), completes the derivation of Lemma 8 if we let
where the constants
and
are defined in (
56) and (60), respectively. □
We complete the proof of Theorem 1. We set
, where
is defined in (
50), and together with Lemma 8, we take into account the relation
It remains for us to use the identity (
40) and the assertions of the Lemmas 5 and 6. Theorem 1 is proved.
To prove Theorem 2, we need the following auxiliary assertion.
Lemma 9. If condition (
13)
is satisfied, then , and for the independent copies of an almost sure continuous random process, the following uniform law of large numbers holds: Proof. The first assertion of the lemma follows from (
13) and Lebesgue’s dominated convergence theorem. Let
For simplicity, let
. For an arbitrary fixed
and
, the following relations hold:
Now, notice that
and
Therefore, the right side in (4) does not exceed
, and due to the arbitrariness of
r and the first assertion of the lemma, the relation (62) for
is proved. For arbitrary
k, the derivation of the statement of the lemma is similar. □
Proof of Theorem 2. We let
and prove the following version of the law of large numbers for the values
:
where the sequences
and
are defined in (
14). We introduce the events
and notice that for any positive
, one has
Next, from Theorem 1, it follows that
where
. To complete the proof of (64), it remains to estimate the first term on the right-hand side of (65) using Markov’s inequality, taking into account the limit relations from (
14) and the last estimate.
The assertion of Theorem 2 follows from Lemma 9, limit relation (64), and the simple estimate
Thus, Theorem 2 is proved. □