Convergence Rates for Empirical Estimation of Binary Classification Bounds

Sekeh, Salimeh Yasaei; Noshad, Morteza; Moon, Kevin R.; Hero, Alfred O.

doi:10.3390/e21121144

Open AccessArticle

Convergence Rates for Empirical Estimation of Binary Classification Bounds

¹

School of Computing and Information Science, University of Maine, Orono, ME 04469, USA

²

Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA

³

Department of Mathematics and Statistics, Utah State University, Logan, UT 84322, USA

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(12), 1144; https://doi.org/10.3390/e21121144

Submission received: 5 November 2019 / Accepted: 15 November 2019 / Published: 23 November 2019

(This article belongs to the Special Issue Robust Procedures for Estimating and Testing in the Framework of Divergence Measures)

Download

Browse Figures

Versions Notes

Abstract

:

Bounding the best achievable error probability for binary classification problems is relevant to many applications including machine learning, signal processing, and information theory. Many bounds on the Bayes binary classification error rate depend on information divergences between the pair of class distributions. Recently, the Henze–Penrose (HP) divergence has been proposed for bounding classification error probability. We consider the problem of empirically estimating the HP-divergence from random samples. We derive a bound on the convergence rate for the Friedman–Rafsky (FR) estimator of the HP-divergence, which is related to a multivariate runs statistic for testing between two distributions. The FR estimator is derived from a multicolored Euclidean minimal spanning tree (MST) that spans the merged samples. We obtain a concentration inequality for the Friedman–Rafsky estimator of the Henze–Penrose divergence. We validate our results experimentally and illustrate their application to real datasets.

Keywords:

classification; Bayes error rate; Henze–Penrose divergence; Friedman–Rafsky test statistic; convergence rates; bias and variance trade-off; concentration bounds; minimal spanning trees

1. Introduction

Divergence measures between probability density functions are used in many signal processing applications including classification, segmentation, source separation, and clustering (see [1,2,3]). For more applications of divergence measures, we refer to [4].

In classification problems, the Bayes error rate is the expected risk for the Bayes classifier, which assigns a given feature vector

x

to the class with the highest posterior probability. The Bayes error rate is the lowest possible error rate of any classifier for a particular joint distribution. Mathematically, let

x_{1}, x_{2}, \dots, x_{N} \in R^{d}

be realizations of random vector

X

and class labels

S \in {0, 1}

, with prior probabilities

p = P (S = 0)

and

q = P (S = 1)

, such that

p + q = 1

. Given conditional probability densities

f_{0} (x)

and

f_{1} (x)

, the Bayes error rate is given by

ϵ = \int_{R^{d}} min \{p f_{0} (x), q f_{1} (x)\} d x .

(1)

The Bayes error rate provides a measure of classification difficulty. Thus, when known, the Bayes error rate can be used to guide the user in the choice of classifier and tuning parameter selection. In practice, the Bayes error is rarely known and must be estimated from data. Estimation of the Bayes error rate is difficult due to the nonsmooth min function within the integral in (1). Thus, research has focused on deriving tight bounds on the Bayes error rate based on smooth relaxations of the min function. Many of these bounds can be expressed in terms of divergence measures such as the Bhattacharyya [5] and Jensen–Shannon [6]. Tighter bounds on the Bayes error rate can be obtained using an important divergence measure known as the Henze–Penrose (HP) divergence [7,8].

Many techniques have been developed for estimating divergence measures. These methods can be broadly classified into two categories: (i) plug-in estimators in which we estimate the probability densities and then plug them in the divergence function [9,10,11,12], (ii) entropic graph approaches, in which the relationship between the divergence function and a graph functional in Euclidean space is derived [8,13]. Examples of plug-in methods include k-nearest neighbor (K-NN) and Kernel density estimator (KDE) divergence estimators. Examples of entropic graph approaches include methods based on minimal spanning trees (MST), K-nearest neighbors graphs (K-NNG), minimal matching graphs (MMG), traveling salesman problem (TSP), and their power-weighted variants.

Disadvantages of plug-in estimators are that these methods often require assumptions on the support set boundary and are more computationally complex than direct graph-based approaches. Thus, for practical and computational reasons, the asymptotic behavior of entropic graph approaches has been of great interest. Asymptotic analysis has been used to justify graph based approaches. For instance, in [14], the authors showed that a cross match statistic based on optimal weighted matching converges to the the HP-divergence. In [15], a more complex approach based on the K-NNG was proposed that also converges to the HP-divergence.

The first contribution of our paper is that we obtain a bound on the convergence rates for the Friedman and Rafsky (FR) estimator of the HP-divergence, which is based on a multivariate extension of the non-parametric run length test of equality of distributions. This estimator is constructed using a multicolored MST on the labeled training set where MST edges connecting samples with dichotomous labels are colored differently from edges connecting identically labeled samples. While previous works have investigated the FR test statistic in the context of estimating the HP-divergence (see [8,16]), to the best of our knowledge, its minimax MSE convergence rate has not been previously derived. The bound on convergence rate is established by using the umbrella theorem of [17], for which we define a dual version of the multicolor MST. The proposed dual MST in this work is different than the standard dual MST introduced by Yukich in [17]. We show that the bias rate of the FR estimator is bounded by a function of N,

η

and d, as

O ({(N)}^{- η^{2} / (d (η + 1))})

, where N is the total sample size, d is the dimension of the data samples

d \geq 2

, and

η

is the Hölder smoothness parameter

0 < η \leq 1

. We also obtain the variance rate bound as

O ({(N)}^{- 1})

.

The second contribution of our paper is a new concentration bound for the FR test statistic. The bound is obtained by establishing a growth bound and a smoothness condition for the multicolored MST. Since the FR test statistic is not a Euclidean functional, we cannot use the standard subadditivity and superadditivity approaches of [17,18,19]. Our concentration inequality is derived using a different Hamming distance approach and a dual graph to the multicolored MST.

We experimentally validate our theoretic results. We compare the MSE theory and simulation in three experiments with various dimensions

d = 2, 4, 8

. We observe that, in all three experiments, as sample size increases, the MSE rate decreases and, for higher dimensions, the rate is slower. In all sets of experiments, our theory matches the experimental results. Furthermore, we illustrate the application of our results on estimation of the Bayes error rate on three real datasets.

1.1. Related Work

Much research on minimal graphs has focused on the use of Euclidean functionals for signal processing and statistics applications such as image registration [20,21], pattern matching [22], and non-parametric divergence estimation [23]. A K-NNG-based estimator of Rényi and f-divergence measures has been proposed in [13]. Additional examples of direct estimators of divergence measures include statistic based on the nonparametric two sample problem, the Smirnov maximum deviation test [24], and the Wald–Wolfowitz [25] runs test, which have been studied in [26].

Many entropic graph estimators such as MST, K-NNG, MMG, and TSP have been considered for multivariate data from a single probability density f. In particular, the normalized weight function of graph constructions all converge almost surely to the Rényi entropy of f [17,27]. For N uniformly distributed points, the MSE is

O (N^{- 1 / d})

[28,29]. Later, Hero et al. [30,31] reported bounds on

L_{γ}

-norm bias convergence rates of power-weighted Euclidean weight functionals of order

γ

for densities f belonging to the space of Hölder continuous functions

Σ_{d} (η, K)

as

O (N^{- α η / (α η + 1) 1 / d})

, where

0 < η \leq 1

,

d \geq 1

,

γ \in (1, d)

, and

α = (d - γ) / d

. In this work, we derive a bound on convergence rate of FR estimator for the HP-divergence when the density functions belong to the Hölder class,

Σ_{d} (η, K)

, for

0 < η \leq 1

,

d \geq 2

[32]. Note that throughout the paper we assume the density functions are absolutely continuous and bounded with support on the unit cube

{[0, 1]}^{d}

.

In [28], Yukich introduced the general framework of continuous and quasi-additive Euclidean functionals. This has led to many convergence rate bounds of entropic graph divergence estimators.

The framework of [28] is as follows: Let F be finite subset of points in

{[0, 1]}^{d}

,

d \geq 2

, drawn from an underlying density. A real-valued function

L_{γ}

defined on F is called a Euclidean functional of order

γ

if it is of the form

L_{γ} (F) = min_{E \in E} \sum_{e \in E} {| e (F) |}^{γ}

, where

E

is a set of graphs, e is an edge in the graph E,

| e |

is the Euclidean length of e, and

γ

is called the edge exponent or power-weighting constant. The MST, TSP, and MMG are some examples for which

γ = 1

.

Following this framework, we show that the FR test statistic satisfies the required continuity and quasi-additivity properties to obtain similar convergence rates to those predicted in [28]. What distinguishes our work from previous work is that the count of dichotomous edges in the multicolored MST is not Euclidean. Therefore, the results in [17,27,30,31] are not directly applicable.

Using the isoperimetric approach, Talagrand [33] showed that, when the Euclidean functional

L_{γ}

is based on the MST or TSP, then the functional

L_{γ}

for derived random vertices uniformly distributed in a hypercube

{[0, 1]}^{d}

is concentrated around its mean. Namely, with high probability, the functional

L_{γ}

and its mean do not differ by more than

C {(N log N)}^{(d - γ) / 2 d}

. In this paper, we establish concentration bounds for the FR statistic: with high probability

1 - δ

, the FR statistic differs from its mean by not more than

O ({(N)}^{(d - 1) / d} {(log (C / δ))}^{(d - 1) / d})

, where C is a function of N and d.

1.2. Organization

This paper is organized as follows. In Section 2, we first introduce the HP-divergence and the FR multivariate test statistic. We then present the bias and variance rates of the FR-based estimator of HP-divergence followed by the concentration bounds and the minimax MSE convergence rate. Section 3 provides simulations that validate the theory. All proofs and relevant lemmas are given in the Appendix A, Appendix B, Appendix C, Appendix D and Appendix E.

Throughout the paper, we denote expectation by

E

and variance by abbreviation

Var

. Bold face type indicates random variables. In this paper, when we say number of samples we mean number of observations.

2. The Henze–Penrose Divergence Measure

Consider parameters

p \in (0, 1)

and

q = 1 - p

. We focus on estimating the HP-divergence measure between distributions

f_{0}

and

f_{1}

with domain

R^{d}

defined by

D_{p} (f_{0}, f_{1}) = \frac{1}{4 p q} [\int \frac{{(p f_{0} (x) - q f_{1} (x))}^{2}}{p f_{0} (x) + q f_{1} (x)} d x - {(p - q)}^{2}] .

(2)

It can be verified that this measure is bounded between 0 and 1 and, if

f_{0} (x) = f_{1} (x)

, then

D_{p} = 0

. In contrast with some other divergences such as the Kullback–Liebler [34] and Rényi divergences [35], the HP-divergence is symmetrical, i.e.,

D_{p} (f_{0}, f_{1}) = D_{q} (f_{1}, f_{0})

. By invoking relation (3) in [8],

\begin{matrix} \int \frac{{(p f_{0} (x) - q f_{1} (x))}^{2}}{p f_{0} (x) + q f_{1} (x)} d x = 1 - 4 p q A_{p} (f_{0}, f_{1}), \end{matrix}

where

\begin{matrix} \begin{matrix} A_{p} (f_{0}, f_{1}) & = & \int \frac{f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x = E_{f_{0}} [{(p \frac{f_{0} (X)}{f_{1} (X)} + q)}^{- 1}], \\ u_{p} (f_{0}, f_{1}) & = & 1 - 4 p q A_{p} (f_{0}, f_{1}), \end{matrix} \end{matrix}

one can rewrite

D_{p}

in the alternative form:

\begin{matrix} D_{p} (f_{0}, f_{1}) = 1 - A_{p} (f_{0}, f_{1}) = \frac{u_{p} (f_{0}, f_{1})}{4 p q} - \frac{{(p - q)}^{2}}{4 p q} . \end{matrix}

Throughout the paper, we refer to

A_{p} (f_{0}, f_{1})

as the HP-integral. The HP-divergence measure belongs to the class of

ϕ

-divergences [36]. For the special case

p = 0.5

, the divergence (2) becomes the symmetric

χ^{2}

-divergence and is similar to the Rukhin f-divergence. See [37,38].

2.1. The Multivariate Runs Test Statistic

The MST is a graph of minimum weight among all graphs

E

that span n vertices. The MST has many applications including pattern recognition [39], clustering [40], nonparametric regression [41], and testing of randomness [42]. In this section, we focus on the FR multivariate two sample test statistic constructed from the MST.

Assume that sample realizations from

f_{0}

and

f_{1}

, denoted by

X_{m} \in R^{m \times d}

and

Y_{n} \in R^{n \times d}

, respectively, are available. Construct an MST spanning the samples from both

f_{0}

and

f_{1}

and color the edges in the MST that connect dichotomous samples green and color the remaining edges black. The FR test statistic

R_{m, n} : = R_{m, n} (X_{m}, Y_{n})

is the number of green edges in the MST. Note that the test assumes a unique MST, therefore all inter point distances between data points must be distinct. We recall the following theorem from [7,8]:

Theorem 1.

As

m \to \infty

and

n \to \infty

such that

\frac{m}{n + m} \to p

and

\frac{n}{n + m} \to q

,

\begin{matrix} 1 - R_{m, n} (X_{m}, Y_{n}) \frac{m + n}{2 m n} \to D_{p} (f_{0}, f_{1}), a . s . \end{matrix}

(3)

In the next section, we obtain bounds on the MSE convergence rates of the FR approximation for HP-divergence between densities that belong to

Σ_{d} (η, K)

, the class of Hölder continuous functions with Lipschitz constant K and smoothness parameter

0 < η \leq 1

[32]:

Definition 1

(Hölder class). Let

X \subset R^{d}

be a compact space. The Hölder class

Σ_{d} (η, K)

, with η-Hölder parameter, of functions with the

L_{d}

-norm, consists of the functions g that satisfy

\begin{matrix} \begin{matrix} \{g : ∥ g (z) - p_{x}^{⌊ η ⌋} (z) ∥_{d} \leq K ∥ x - z ∥_{d}^{η}, x, z \in X\}, \end{matrix} \end{matrix}

(4)

where

p_{x}^{k} (z)

is the Taylor polynomial (multinomial) of g of order k expanded about the point

x

and

⌊ η ⌋

is defined as the greatest integer strictly less than η.

In what follows, we will use both notations

R_{m, n}

and

R_{m, n} (X_{m}, Y_{n})

for the FR statistic over the combined samples.

2.2. Convergence Rates

In this subsection, we obtain the mean convergence rate bounds for general non-uniform Lebesgue densities

f_{0}

and

f_{1}

belonging to the Hölder class

Σ_{d} (η, K)

. Since the expectation of

R_{m, n}

can be closely approximated by the sum of the expectation of the FR statistic constructed on a dense partition of

{[0, 1]}^{d}

,

R_{m, n}

is a quasi-additive functional in mean. The family of bounds (A16) in Appendix B enables us to achieve the minimax convergence rate for the mean under the Hölder class assumption with smoothness parameter

0 < η \leq 1

,

d \geq 2

:

Theorem 2

(Convergence Rate of the Mean). Let

d \geq 2

, and

R_{m, n}

be the FR statistic for samples drawn from Hölder continuous and bounded density functions

f_{0}

and

f_{1}

in

Σ_{d} (η, K)

. Then, for

d \geq 2

,

\begin{matrix} |\frac{E [R_{m, n}]}{m + n} - 2 p q \int \frac{f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x| \leq O ({(m + n)}^{- η^{2} / (d (η + 1))}) . \end{matrix}

(5)

This bound holds over the class of Lebesgue densities

f_{0}, f_{1} \in Σ_{d} (η, K)

,

0 < η \leq 1

. Note that this assumption can be relaxed to

f_{0} \in Σ_{d}^{s} (η, K_{0})

and

f_{1} \in Σ_{d}^{s} (η, K_{1})

that is Lebesgue densities

f_{0}

and

f_{1}

belong to the Strong Hölder class with the same Hölder parameter

η

and different constants

K_{0}

and

K_{1}

, respectively.

The following variance bound uses the Efron–Stein inequality [43]. Note that in Theorem 3 we do not impose any strict assumptions. We only assume that the density functions are absolutely continuous and bounded with support on the unit cube

{[0, 1]}^{d}

. Appendix C contains the proof.

Theorem 3.

The variance of the HP-integral estimator based on the FR statistic,

R_{m, n} / (m + n)

is bounded by

\begin{matrix} V a r (\frac{R_{m, n} (X_{m}, Y_{n})}{m + n}) \leq \frac{32 c_{d}^{2} q}{(m + n)}, \end{matrix}

(6)

where the constant

c_{d}

depends only on d.

By combining Theorems 2 and 3, we obtain the MSE rate of the form

O ({m + n)}^{- η^{2} / (d (η + 1))}) + O ({(m + n)}^{- 1})

. Figure 1 indicates a heat map showing the MSE rate as a function of d and

N = m = n

. The heat map shows that the MSE rate of the FR test statistic-based estimator given in (3) is small for large sample size N.

2.3. Proof Sketch of Theorem 2

In this subsection, we first establish subadditivity and superadditivity properties of the FR statistic, which will be employed to derive the MSE convergence rate bound. This will establish that the mean of the FR test statistic is a quasi-additive functional:

Theorem 4.

Let

R_{m, n} (X_{m}, Y_{n})

be the number of edges that link nodes from differently labeled samples

X_{m} = {X_{1}, \dots, X_{m}}

and

Y_{n} = {Y_{1}, \dots, Y_{n}}

in

{[0, 1]}^{d}

. Partition

{[0, 1]}^{d}

into

l^{d}

equal volume subcubes

Q_{i}

such that

m_{i}

and

n_{i}

are the number of samples from

{X_{1}, \dots, X_{m}}

and

{Y_{1}, \dots, Y_{n}}

, respectively, that fall into the partition

Q_{i}

. Then, there exists a constant

c_{1}

such that

\begin{matrix} E [R_{m, n} (X_{m}, Y_{n})] \leq \sum_{i = 1}^{l^{d}} E [R_{m_{i}, n_{i}} ((X_{m}, Y_{n}) \cap Q_{i})] + 2 c_{1} l^{d - 1} {(m + n)}^{1 / d} . \end{matrix}

(7)

Here,

R_{m_{i}, n_{i}}

is the number of dichotomous edges in partition

Q_{i}

. Conversely, for the same conditions as above on partitions

Q_{i}

, there exists a constant

c_{2}

such that

\begin{matrix} E [R_{m, n} (X_{m}, Y_{n})] \geq \sum_{i = 1}^{l^{d}} E [R_{m_{i}, n_{i}} ((X_{m}, Y_{n}) \cap Q_{i})] - 2 c_{2} l^{d - 1} {(m + n)}^{1 / d} . \end{matrix}

(8)

The inequalities (7) and (8) are inspired by corresponding inequalities in [30,31]. The full proof is given in Appendix A. The key result in the proof is the inequality:

R_{m, n} (X_{m}, Y_{n}) \leq \sum_{i = 1}^{l^{d}} R_{m_{i}, n_{i}} ((X_{m}, Y_{n}) \cap Q_{i}) + 2 | D |,

where

| D |

indicates the number of all edges of the MST which intersect two different partitions.

Furthermore, we adapt the theory developed in [17,30] to derive the MSE convergence rate of the FR statistic-based estimator by defining a dual MST and dual FR statistic, denoted by

{MST}^{*}

and

R_{m, n}^{*}

respectively (see Figure 2):

Definition 2

(Dual MST, MST

^{*}

and dual FR statistic

R_{m, n}^{*}

). Let

F_{i}

be the set of corner points of the subsection

Q_{i}

for

1 \leq i \leq l^{d}

. Then, we define

{MST}^{*} (X_{m} \cup Y_{n} \cap Q_{i})

as the boundary MST graph of partition

Q_{i}

[17], which contains

X_{m}

and

Y_{n}

points falling inside the section

Q_{i}

and those corner points in

F_{i}

which minimize total MST length. Notice it is allowed to connect the MSTs in

Q_{i}

and

Q_{j}

through points strictly contained in

Q_{i}

and

Q_{j}

and corner points are taken into account under condition of minimizing total MST length. Another word, the dual MST can connect the points in

Q_{i} \cup Q_{j}

by direct edges to pair to another point in

Q_{i} \cup Q_{j}

or the corner the corner points (we assume that all corner points are connected) in order to minimize the total length. To clarify this, assume that there are two points in

Q_{i} \cup Q_{j}

, then the dual MST consists of the two edges connecting these points to the corner if they are closed to a corner point; otherwise, dual MST consists of an edge connecting one to another. Furthermore, we define

R_{m, n}^{*} (X_{m}, Y_{n} \cap Q_{i})

as the number of edges in an

{MST}^{*}

graph connecting nodes from different samples and number of edges connecting to the corner points. Note that the edges connected to the corner nodes (regardless of the type of points) are always counted in dual FR test statistic

R_{m, n}^{*}

.

In Appendix B, we show that the dual FR test statistic is a quasi-additive functional in mean and

R_{m, n}^{*} (X_{m}, Y_{n}) \geq R_{m, n} (X_{m}, Y_{n})

. This property holds true since

MST (X_{m}, Y_{n})

and

{MST}^{*} (X_{m}, Y_{n})

graphs can only be different in the edges connected to the corner nodes, and in

R^{*} (X_{m}, Y_{n})

we take all of the edges between these nodes and corner nodes into account.

To prove Theorem 2, we partition

{[0, 1]}^{d}

into

l^{d}

subcubes. Then, by applying Theorem 4 and the dual MST, we derive the bias rate in terms of partition parameter l (see (A16) in Theorem A1). See Appendix B and Appendix E for details. According to (A16), for

d \geq 2

, and

l = 1, 2, \dots

, the slowest rates as a function of l are

l^{d} {(m + n)}^{η / d}

and

l^{- η d}

. Therefore, we obtain an l-independent bound by letting l be a function of

m + n

that minimizes the maximum of these rates i.e.,

l (m + n) = a r g min_{l} max \{l^{d} {(m + n)}^{- η / d}, l^{- η d}\} .

The full proof of the bound in (2) is given in Appendix B.

2.4. Concentration Bounds

Another main contribution of our work in this part is to provide an exponential inequality convergence bound derived for the FR estimator of the HP-divergence. The error of this estimator can be decomposed into a bias term and a variance-like term via the triangle inequality:

\begin{matrix} \begin{matrix} |R_{m, n} - \int \frac{f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x| \leq \underset{variance - like term}{\underset{︸}{|R_{m, n} - E [R_{m, n}]|}} \\ + \underset{bias term}{\underset{︸}{|E [R_{m, n}] - \int \frac{f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x|}} . \end{matrix} \end{matrix}

The bias bound was given in Theorem 2. Therefore, we focus on an exponential concentration bound for the variance-like term. One application of concentration bounds is to employ these bounds to compare confidence intervals on the HP-divergence measure in terms of the FR estimator. In [44,45], the authors provided an exponential inequality convergence bound for an estimator of Rény divergence for a smooth Hölder class of densities on the d-dimensional unite cube

{[0, 1]}^{d}

. We show that if

X_{m}

and

Y_{n}

are the set of m and n points drawn from any two distributions

f_{0}

and

f_{1}

, respectively, the FR criteria

R_{m, n}

is tightly concentrated. Namely, we establish that, with high probability,

R_{m, n}

is within

1 - O ({(m + n)}^{- 2 / d} {ϵ^{*}}^{2})

of its expected value, where

ϵ^{*}

is the solution of the following convex optimization problem:

\begin{matrix} \min_{ϵ \geq 0} & C_{m, n}^{'} (ϵ) exp (\frac{- {(t / (2 ϵ))}^{d / (d - 1)}}{(m + n) \tilde{C}}) \\ subject to & ϵ \geq O (7^{d + 1} {(m + n)}^{1 / d}), \end{matrix}

(9)

where

\tilde{C} = 8 {(4)}^{d / (d - 1)}

and

\begin{matrix} C_{m, n}^{'} (ϵ) = 8 {(1 - O ({(m + n)}^{- 2 / d} ϵ^{2}))}^{- 2} . \end{matrix}

(10)

Note that, under the assumption

{(m + n)}^{1 / d} ≃ 1

,

C_{m, n}^{'} (ϵ)

becomes a constant depending only on

ϵ

by

8 {(1 - (c ϵ^{2})}^{- 2}

, where c is a constant. This is inferred from Theorems 5 and 6 below as

{(m + n)}^{1 / d} ≃ 1

. See Appendix D, specifically Lemmas A8–A12 for more detail. Indeed, we first show the concentration around the median. A median is by definition any real number

M_{e}

that satisfies the inequalities

P (X \leq M_{e}) \geq 1 / 2

and

P (X \geq M_{e}) \geq 1 / 2

. To derive the concentration results, the properties of growth bounds and smoothness for

R_{m, n}

, given in Appendix D, are exploited.

Theorem 5

(Concentration around the median). Let

M_{e}

be a median of

R_{m, n}

which implies that

P (R_{m, n} \leq M_{e}) \geq 1 / 2

. Recall

ϵ^{*}

from (9) then we have

\begin{matrix} \begin{matrix} P (| R_{m, n} (X_{m}, Y_{n}) - M_{e} | \geq t) \leq C_{m, n}^{'} (ϵ^{*}) exp (\frac{- {(t / ϵ^{*})}^{d / (d - 1)}}{(m + n) \tilde{C}}), \end{matrix} \end{matrix}

(11)

where

\tilde{C} = 8 {(4)}^{d / (d - 1)}

.

Theorem 6

(Concentration of

R_{m, n}

around the mean). Let

R_{m, n}

be the FR statistic. Then,

\begin{matrix} P (| R_{m, n} - E [R_{m, n}] | \geq t) \leq C_{m, n}^{'} (ϵ^{*}) exp (\frac{- {(t / (2 ϵ^{*}))}^{d / (d - 1)}}{(m + n) \tilde{C}}) . \end{matrix}

(12)

Here,

\tilde{C} = 8 {(4)}^{d / (d - 1)}

and the explicit form for

C_{m, n}^{'} (ϵ^{*})

is given by (10) when

ϵ = ϵ^{*}

.

See Appendix D for full proofs of Theorems 5 and 6. Here, we sketch the proofs. The proof of the concentration inequality for

R_{m, n}

, Theorem 6, requires involving the median

M_{e}

, where

P (R_{m, n} \leq M_{e}) \geq 1 / 2

, inside the probability term by using

| R_{m, n} - E [R_{m, n}] | \leq | R_{m, n} - M_{e} | + | E [R_{m, n}] - M_{e} | .

To prove the expressions for the concentration around the median, Theorem 5, we first consider the

h^{d}

uniform partitions of

{[0, 1]}^{d}

, with edges parallel to the coordinate axes having edge lengths

h^{- 1}

and volumes

h^{- d}

. Then, by applying the Markov inequality, we show that with at least probability

1 - (δ_{m, n}^{h} / ϵ)

, where

δ_{m, n}^{h} = O (h^{d - 1} {(m + n)}^{1 / d})

, the FR statistic

R_{m, n}

is subadditive with

2 ϵ

threshold. Afterward, owing to the induction method [17], the growth bound can be derived with at least probability

1 - (h δ_{m, n}^{h} / ϵ)

. The growth bound explains that with high probability there exists a constant depending on

ϵ

and h,

C_{ϵ, h}

, such that

R_{m, n} \leq C_{ϵ, h} {(m n)}^{1 - 1 / d}

. Applying the law of total probability and semi-isoperimetric inequality (A108) in Lemma A11 gives us (A35). By considering the solution to convex optimization problem (9), i.e.,

ϵ^{*}

and optimal

h = 7

the claimed results (11) and (12) are derived. The only constraint here is that

ϵ

is lower bounded by a function of

δ_{m, n}^{h} = O (h^{d - 1} {(m + n)}^{1 / d})

.

Next, we provide a bound for the variance-like term with high probability at least

1 - δ

. According to the previous results, we expect that this bound depends on

ϵ^{*}

, d, m and n. The proof is short and is given in Appendix D.

Theorem 7

(Variance-like bound for

R_{m, n}

). Let

R_{m, n}

be the FR statistic. With at least probability

1 - δ

, we have

\begin{matrix} |R_{m, n} - E [R_{m, n}]| \leq O (ϵ^{*} {(m + n)}^{(d - 1) / d} {(log (C_{m, n}^{'} (ϵ^{*}) / δ))}^{(d - 1) / d}) . \end{matrix}

(13)

or, equivalently,

\begin{matrix} |\frac{R_{m, n}}{m + n} - \frac{E [R_{m, n}]}{m + n}| \leq O (ϵ^{*} {(m + n)}^{- 1 / d} {(log (C_{m, n}^{'} (ϵ^{*}) / δ))}^{(d - 1) / d}), \end{matrix}

(14)

where

C_{m, n}^{'} (ϵ^{*})

depends on

m, n

, and d is given in (10) when

ϵ = ϵ^{*}

.

3. Numerical Experiments

3.1. Simulation Study

In this section, we apply the FR statistic estimate of the HP-divergence to both simulated and real data sets. We present results of a simulation study that evaluates the proposed bound on the MSE. We numerically validate the theory stated in Section 2.2 and Section 2.4 using multiple simulations. In the first set of simulations, we consider two multivariate Normal random vectors

X

,

Y

and perform three experiments

d = 2, 4, 8

, to analyze the FR test statistic-based estimator performance as the sample sizes m, n increase. For the three dimensions

d = 2, 4, 8

, we generate samples from two normal distributions with identity covariance and shifted means:

μ_{1} = [0, 0]

,

μ_{2} = [1, 0]

and

μ_{1} = [0, 0, 0, 0]

,

μ_{2} = [1, 0, 0, 0]

and

μ_{1} = [0, 0, \dots, 0]

,

μ_{2} = [1, 0, \dots, 0]

when

d = 2

,

d = 4

and

d = 8

, respectively. For all of the following experiments, the sample sizes for each class are equal (

m = n

).

We vary

N = m = n

up to 800. From Figure 3, we deduce that, when the sample size increases, the MSE decreases such that for higher dimensions the rate is slower. Furthermore, we compare the experiments with the theory in Figure 3. Our theory generally matches the experimental results. However, the MSE for the experiments tends to decrease to zero faster than the theoretical bound. Since the Gaussian distribution has a smooth density, this suggests that a tighter bound on the MSE may be possible by imposing stricter assumptions on the density smoothness as in [12].

In our next simulation, we compare three bivariate cases: first, we generate samples from a standard Normal distribution. Second, we consider a distinct smooth class of distributions i.e., binomial Gamma density with standard parameters and dependency coefficient

ρ = 0.5

. Third, we generate samples from Standard t-student distributions. Our goal in this experiment is to compare the MSE of the HP-divergence estimator between two identical distributions,

f_{0} = f_{1}

, when

f_{0}

is one of the Gamma, Normal, and t-student density function. In Figure 4, we observe that the MSE decreases as N increases for all three distributions.

3.2. Real Datasets

We now show the results of applying the FR test statistic to estimate the HP-divergence using three different real datasets [46]:

Human Activity Recognition (HAR), Wearable Computing, Classification of Body Postures and Movements (PUC-Rio): This dataset contains five classes (sitting-down, standing-up, standing, walking, and sitting) collected on eight hours of activities of four healthy subjects.
Skin Segmentation dataset (SKIN): The skin dataset is collected by randomly sampling B,G,R values from face images of various age groups (young, middle, and old), race groups (white, black, and asian), and genders obtained from the FERET and PAL databases [47].
Sensorless Drive Diagnosis (ENGIN) dataset: In this dataset, features are extracted from electric current drive signals. The drive has intact and defective components. The dataset contains 11 different classes with different conditions. Each condition has been measured several times under 12 different operating conditions, e.g., different speeds, load moments, and load forces.

We focus on two classes from each of the HAR, SKIN, and ENGIN datasets, specifically, for HAR dataset two classes “sitting” and “standing” and for SKIN dataset the classes “Skin” and “Non-skin” are considered. In the ENGIN dataset, the drive has intact and defective components, which results in 11 different classes with different conditions. We choose conditions 1 and 2.

In the first experiment, we computed the HP-divergence using KDE plug-in estimator and then the MSE for the FR test statistic estimator is derived as the sample size

N = m = n

increases. We used 95% confidence interval as the error bars. We observe in Figure 5 that the estimated HP-divergence ranges in

[0, 1]

, which is one of the HP-divergence properties [8]. Interestingly, when N increases the HP-divergence tends to 1 for all HAR, SKIN, and ENGIN datasets. Note that in this set of experiments we have repeated the experiments on independent parts of the datasets to obtain the error bars. Figure 6 shows that the MSE expectedly decreases as the sample size grows for all three datasets. Here, we have used the KDE plug-in estimator [12], implemented on the all available samples, to determine the true HP-divergence. Furthermore, according to Figure 6, the FR test statistic-based estimator suggests that the Bayes error rate is larger for the SKIN dataset compared to the HAR and ENGIN datasets.

In our next experiment, we add the first six features (dimensions) in order to our datasets and evaluate the FR test statistic’s performance as the HP-divergence estimator. Surprisingly, the estimated HP-divergence doesn’t change for the HAR sample; however, big changes are observed for the SKIN and ENGIN samples (see Figure 7).

Finally, we apply the concentration bounds on the FR test statistic (i.e., Theorems 6 and 7) and compute theoretical implicit variance-like bound for the FR criteria with

δ = 0.05

error for the real datasets ENGIN, HAR, and SKIN. Since datasets ENGIN, HAR, and SKIN have the equal total sample size

N = m + n = 1200

and different dimensions

d = 14, 12, 4

, respectively; here, we first intend to compare the concentration bound (13) on the FR statistic in terms of dimension d when

δ = 0.05

. For real datasets ENGIN, HAR, and SKIN, we obtain

P (| R_{m, n} - E [R_{m, n}] | \leq ξ) \geq 0.95,

where

ξ = ξ^{'} [0.257, 0.005, 0.6 \times 10^{- 11}]

, respectively, and

ξ^{'}

is a constant not dependent on d. One observes that as the dimension decreases the interval becomes significantly tighter. However, this could not be generally correct and computing bound (13) precisely requires the knowledge of distributions and unknown constants. In Table 1, we compute the standard variance-like bound by applying the percentiles technique and observe that the bound threshold is not monotonic in terms of dimension d. Table 1 shows the FR test statistic, HP-divergence estimate (denoted by

R_{m, n}

,

{\hat{D}}_{p}

, respectively), and standard variance-like interval for the FR statistic using the three real datasets HAR, SKIN, and ENGIN.

4. Conclusions

We derived a bound on the MSE convergence rate for the Friedman–Rafsky estimator of the Henze–Penrose divergence assuming the densities are sufficiently smooth. We employed a partitioning strategy to derive the bias rate which depends on the number of partitions, the sample size

m + n

, the Hölder smoothness parameter

η

, and the dimension d. However, by using the optimal partition number, we derived the MSE convergence rate only in terms of

m + n

,

η

, and d. We validated our proposed MSE convergence rate using simulations and illustrated the approach for the meta-learning problem of estimating the HP-divergence for three real-world data sets. We also provided concentration bounds around the median and mean of the estimator. These bounds explicitly provide the rate that the FR statistic approaches its median/mean with high probability, not only as a function of the number of samples, m, n, but also in terms of the dimension of the space d. By using these results, we explored the asymptotic behavior of a variance-like rate in terms of m, n, and d.

Author Contributions

Conceptualization, S.Y.S., M.N. and A.O.H.; methodology, S.Y.S. and M.N.; software, S.Y.S. and M.N.; validation, S.Y.S., M.N., K.R.M. and A.O.H.; formal analysis, S.Y.S., M.N. and K.R.M.; investigation, S.Y.S. and M.N.; resources, S.Y.S. and M.N.; data curation, M.N.; writing—original draft preparation, S.Y.S.; writing—review and editing, M.N., K.R.M. and A.O.H.; supervision, A.O.H.; project administration, A.O.H.; funding acquisition, A.O.H.

Funding

The work presented in this paper was partially supported by ARO grant W911NF-15-1-0479 and DOE grants DE-NA0002534 and DE-NA0003921.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations

HP	Henze-Penrose
BER	Bayes error rate
MST	Minimal Spanning Tree
FR	Friedman-Rafsky
MSE	Mean squared error

Appendix A. Proof of Theorem 4

In this section, we prove the subadditivity and superadditivity for the mean of FR test statistic. For this, first we need to illustrate the following lemma.

Lemma A1.

Let

{Q_{i}}_{i = 1}^{l^{d}}

be a uniform partition of

{[0, 1]}^{d}

into

l^{d}

subcubes

Q_{i}

with edges parallel to the coordinate axes having edge lengths

l^{- 1}

and volumes

l^{- d}

. Let

D_{i j}

be the set of edges of MST graph between

Q_{i}

and

Q_{j}

with cardinality

| D_{i j} |

, then for

| D |

defined as the sum of

| D_{i j} |

for all

i, j = 1, \dots, l^{d}

,

i \neq j

, we have

E | D | = O (l^{d - 1} n^{1 / d})

, or more explicitly

\begin{matrix} E [| D |] \leq C^{'} l^{d - 1} n^{1 / d} + O (l^{d - 1} n^{(1 / d) - s}), \end{matrix}

(A1)

where

η > 0

is the Hölder smoothness parameter and

s = \frac{(1 - 1 / d) η}{d ((1 - 1 / d) η + 1)} .

Here, and in what follows, denote

Ξ_{M S T} (X_{n})

the length of the shortest spanning tree on

X_{n} = {X_{1}, \dots, X_{n}}

, namely

\begin{matrix} Ξ_{M S T} (X_{n}) : = min_{T} \sum_{e \in T} | e |, \end{matrix}

where the minimum is over all spanning trees T of the vertex set

X_{n}

. Using the subadditivity relation for

Ξ_{M S T}

in [17], with the uniform partition of

{[0, 1]}^{d}

into

l^{d}

subcubes

Q_{i}

with edges parallel to the coordinate axes having edge lengths

l^{- 1}

and volumes

l^{- d}

, we have

\begin{matrix} Ξ_{M S T} (X_{n}) \leq \sum_{i = 1}^{l^{d}} Ξ_{M S T} (X_{n} \cap Q_{i}) + C l^{d - 1}, \end{matrix}

(A2)

where C is constant. Denote D the set of all edges of

M S T (⋃_{i = 1}^{M} Q_{i})

that intersect two different subcubes

Q_{i}

and

Q_{j}

with cardinality

| D |

. Let

| e_{i} |

be the length of i-th edge in set D. We can write

\begin{matrix} \sum_{i \in | D |} | e_{i} | \leq C l^{d - 1} and E \sum_{i \in | D |} | e_{i} | \leq C l^{d - 1}, \end{matrix}

also we know that

\begin{matrix} E \sum_{i \in | D |} | e_{i} | = E_{D} \sum_{i \in | D |} E [| e_{i} | | D] . \end{matrix}

(A3)

Note that using the result from ([31], Proposition 3), for some constants

C_{i 1}

and

C_{i 2}

, we have

\begin{matrix} E | e_{i} | \leq C_{i 1} n^{- 1 / d} + C_{i 2} n^{- (1 / d) - s}, i \in | D | . \end{matrix}

(A4)

Now, let

C_{1} = max_{i} {C_{i 1}}

and

C_{2} = max_{i} {C_{i 2}}

, hence we can bound the expectation (A3) as

\begin{matrix} E | D | (C_{1} n^{- 1 / d} + C_{2} (n^{- (1 / d) - s})) \leq C l^{d - 1}, \end{matrix}

which implies

\begin{matrix} \begin{matrix} E | D | \leq (C_{1} n^{- 1 / d} + O (n^{- (1 / d) - s})) \\ \leq C^{'} l^{d - 1} n^{1 / d} + O (l^{d - 1} n^{(1 / d) - s}) . \end{matrix} \end{matrix}

To aim toward the goal (7), we partition

{[0, 1]}^{d}

into

M : = l^{d}

subcubes

Q_{i}

of side

1 / l

. Recalling Lemma 2.1 in [48], we therefore have the set inclusion:

\begin{matrix} M S T (⋃_{i = 1}^{M} Q_{i}) \subset ⋃_{i = 1}^{M} M S T (Q_{i}) \cup D, \end{matrix}

(A5)

where D is defined as in Lemma A1. Let

m_{i}

and

n_{i}

be the number of sample

{X_{1}, \dots, X_{m}}

and

{Y_{1}, \dots, Y_{n}}

respectively falling into the partition

Q_{i}

, such that

\sum_{i} m_{i} = m

and

\sum_{i} n_{i} = n

. Introduce sets A and B as

A : = M S T (⋃_{i = 1}^{M} Q_{i}), B : = ⋃_{i = 1}^{M} M S T (Q_{i}) .

Since set B has fewer edges than set A, thus (A5) implies that the difference set of B and A contains at most

2 | D |

edges, where

| D |

is the number of edges in D. On the other word,

\begin{matrix} \begin{matrix} | A Δ B | \leq | A - B | + | B - A | = | D | + | B - A | \\ = | D | + (| B | - | B \cap A | \leq | D | + (| A | - | B \cap A |) = 2 | D | . \end{matrix} \end{matrix}

The number of edge linked nodes from different samples in set A is bounded by the number of edge linked nodes from different samples in set B plus

2 | D |

:

\begin{matrix} R_{m, n} (X_{m}, Y_{n}) \leq \sum_{i = 1}^{M} R_{m_{i}, n_{i}} ((X_{m}, Y_{n}) \cap Q_{i}) + 2 | D | . \end{matrix}

(A6)

Here,

R_{m_{i}, n_{i}}

stands with the number edge linked nodes from different samples in partition

Q_{i}

, M. Next, we address the reader to Lemma A1, where it has been shown that there is a constant c such that

E | D | \leq c l^{d - 1} {(m + n)}^{1 / d}

. This concludes the claimed assertion (7). Now, to accomplish the proof, the lower bound term in (8) is obtained with similar methodology and the set inclusion:

\begin{matrix} ⋃_{i = 1}^{M} M S T (Q_{i}) \subset M S T (⋃_{i = 1}^{M} Q_{i}) \cup D . \end{matrix}

(A7)

This completes the proof.

Appendix B. Proof of Theorem 2

As many of continuous subadditive functionals on

{[0, 1]}^{d}

, in the case of the FR statistic, there exists a dual superadditive functional

R_{m, n}^{*}

based on dual MST,

{MST}^{*}

, proposed in Definition 2. Note that, in the MST* graph, the degrees of the corner points are bounded by

c_{d}

, where it only depends on dimension d, and is the bound for degree of every node in MST graph. The following properties hold true for dual FR test statistic,

R_{m, n}^{*}

:

Lemma A2.

Given samples

X_{m} = {X_{1}, \dots, X_{m}}

and

Y_{n} = {Y_{1}, \dots, Y_{n}}

, the following inequalities hold true:

(i): For constant $c_{d}$ which depends on d:

$\begin{matrix} \begin{matrix} R_{m, n}^{*} (X_{m}, Y_{n}) \leq R_{m, n} (X_{m}, Y_{n}) + c_{d} 2^{d}, \\ R_{m, n} (X_{m}, Y_{n}) \leq R_{m, n}^{*} (X_{m}, Y_{n}) . \end{matrix} \end{matrix}$

(A8)
(ii): (Subadditivity on $E [R_{m, n}^{*}]$ and Superadditivity) Partition ${[0, 1]}^{d}$ into $l^{d}$ subcubes $Q_{i}$ such that $m_{i}$ , $n_{i}$ be the number of sample $X_{m} = {X_{1}, \dots, X_{m}}$ and $Y_{n} = {Y_{1}, \dots, Y_{n}}$ respectively falling into the partition $Q_{i}$ with dual $R_{m_{i}, n_{i}}^{*}$ . Then, we have

$\begin{matrix} E [R_{m, n}^{*} (X_{m}, Y_{n})] \leq \sum_{i = 1}^{l^{d}} E [R_{m_{i}, n_{i}}^{*} ((X_{m}, Y_{n}) \cap Q_{i})] + c l^{d - 1} {(m + n)}^{1 / d}, \\ R_{m, n}^{*} (X_{m}, Y_{n}) \geq \sum_{i = 1}^{l^{d}} R_{m_{i}, n_{i}}^{*} ((X_{m}, Y_{n}) \cap Q_{i}) - 2^{d} c_{d} l^{d}, \end{matrix}$

(A9)

where c is a constant.

(i) Consider the nodes connected to the corner points. Since

MST (X_{m}, Y_{n})

and

{MST}^{*} (X_{m}, Y_{n})

can only be different in the edges connected to these nodes, and in

R^{*} (X_{m}, Y_{n})

we take all of the edges between these nodes and corner nodes into account, so we obviously have the second relation in (A8). In addition, for the first inequality in (A8), it is enough to say that the total number of edges connected to the corner nodes is upper bounded by

2^{d} c_{d}

.

(ii) Let

| D^{*} |

be the set of edges of the

{MST}^{*}

graph which intersect two different partitions. Since MST and

{MST}^{*}

are only different in edges of points connected to the corners and edges crossing different partitions. Therefore,

| D^{*} | \leq | D |

. By eliminating one edge in set D in the worse scenario we would face two possibilities: either the corresponding node is connected to the corner which is counted anyways or any other point in MST graph which wouldn’t change the FR test statistic. This implies the following subadditivity relation:

\begin{matrix} R_{m, n}^{*} (X_{m}, Y_{n}) - | D | \leq \sum_{i = 1}^{l^{d}} R_{m_{i}, n_{i}}^{*} ((X_{m}, Y_{n}) \cap Q_{i}) . \end{matrix}

Further from Lemma A1, we know that there is a constant c such that

E | D | \leq c l^{d - 1} {(m + n)}^{1 / d}

. Hence, the first inequality in (A9) is obtained. Next, consider

| D_{c}^{*} |

which represents the total number of edges from both samples only connected to the all corners points in

{MST}^{*}

graph. Therefore, one can easily claim:

\begin{matrix} R_{m, n}^{*} (X_{m}, Y_{n}) \geq \sum_{i = 1}^{l^{d}} R_{m_{i}, n_{i}}^{*} ((X_{m}, Y_{n}) \cap Q_{i}) - | D_{c}^{*} | . \end{matrix}

In addition, we know that

| D_{c}^{*} | \leq 2^{d} l^{d} c_{d}

where

c_{d}

stands with the largest possible degree of any vertex. One can write

\begin{matrix} R_{m, n}^{*} (X_{m}, Y_{n}) \geq \sum_{i = 1}^{l^{d}} R_{m_{i}, n_{i}}^{*} ((X_{m}, Y_{n}) \cap Q_{i}) - 2^{d} c_{d} l^{d} . \end{matrix}

The following list of Lemmas A3, A4 and A6 are inspired from [49] and are required to prove Theorem A1. See Appendix E for their proofs.

Lemma A3.

Let

g (x)

be a density function with support

{[0, 1]}^{d}

and belong to the Hölder class

Σ_{d} (η, L)

,

0 < η \leq 1

, stated in Definition 1. In addition, assume that

P (x)

is a η-Hölder smooth function, such that its absolute value is bounded from above by a constant. Define the quantized density function with parameter l and constants

ϕ_{i}

as

\begin{matrix} \hat{g} (x) = \sum_{i = 1}^{M} ϕ_{i} 1 {x \in Q_{i}}, w h e r e ϕ_{i} = l^{d} \int_{Q_{i}} g (x) d x . \end{matrix}

(A10)

Let

M = l^{d}

and

Q_{i} = {x, x_{i} : ∥ x - x_{i} ∥ < l^{- d}}

. Then,

\begin{matrix} \int ∥ (g (x) - \hat{g} (x)) P (x) ∥ d x \leq O (l^{- d η}) . \end{matrix}

(A11)

Lemma A4.

Denote

Δ (x, S)

the degree of vertex

x \in S

in the

M S T

over set

S

with the n number of vertices. For given function

P (x, x)

, one obtains

\int P (x, x) g (x) E [Δ (x, S)] d x = 2 \int P (x, x) g (x) d x + ς_{η} (l, n),

(A12)

where, for constant

η > 0

,

\begin{matrix} ς_{η} (l, n) = (O (l / n) - 2 l^{d} / n) \int g (x) P (x, x) d x + O (l^{- d η}) . \end{matrix}

(A13)

Lemma A5.

Assume that, for given k,

g_{k} (x)

is a bounded function belong to

Σ_{d} (η, L)

. Let

P : R^{d} \times R^{d} \mapsto [0, 1]

be a symmetric, smooth, jointly measurable function, such that, given k, for almost every

x \in R^{d}

,

P (x, .)

is measurable with

x

a Lebesgue point of the function

g_{k} (.) P (x, .)

. Assume that the first derivative P is bounded. For each k, let

Z_{1}^{k}, Z_{2}^{k}, \dots, Z_{k}^{k}

be an independent d-dimensional variable with common density function

g_{k}

. Set

Z_{k} = {Z_{1}^{k}, Z_{2}^{k} \dots, Z_{k}^{k}}

and

Z_{k}^{x} = {x, Z_{2}^{k}, Z_{3}^{k} \dots, Z_{k}^{k}}

. Then,

\begin{matrix} E [\sum_{j = 2}^{k} P (x, Z_{j}^{k}) 1 \{(x, Z_{j}^{k}) \in M S T (Z_{k}^{x})\}] = P (x, x) E [Δ (x, Z_{k}^{x})] + \{O (k^{- η / d}) + O (k^{- 1 / d})\} . \end{matrix}

(A14)

Lemma A6.

Consider the notations and assumptions in Lemma A5. Then,

\begin{matrix} | k^{- 1} \underset{1 \leq i < j \leq k}{\sum \sum} P (Z_{i}^{k}, Z_{j}^{k}) 1 {(Z_{i}^{k}, Z_{j}^{k}) \in M S T (Z_{k})} - \int_{R^{d}} P (x, x) g_{k} (x) d x | \\ \leq ς_{η} (l, k) + O (k^{- η / d}) + O (k^{- 1 / d}) . \end{matrix}

(A15)

Here,

M S T (S)

denotes the MST graph over nice and finite set

S \subset R^{d}

and η is the smoothness Hölder parameter. Note that

ς_{η} (l, k)

is given as before in Lemma A4 (A13).

Theorem A1.

Assume

R_{m, n} : = R (X_{m}, Y_{n})

denotes the FR test statistic and densities

f_{0}

and

f_{1}

belong to the Hölder class

Σ_{d} (η, L)

,

0 < η \leq 1

. Then, the rate for the bias of the

R_{m, n}

estimator for

d \geq 2

is of the form:

\begin{matrix} |\frac{E [R_{m, n}]}{m + n} - 2 p q \int \frac{f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x| \leq O (l^{d} {(m + n)}^{- η / d}) + O (l^{- d η}) . \end{matrix}

(A16)

The proof and a more explicit form for the bound (A16) are given in Appendix E.

Now, we are at the position to prove the assertion in (5). Without loss of generality, assume that

(m + n) l^{- d} > 1

. In the range

d \geq 2

and

0 < η \leq 1

, we select l as a function of

m + n

to be the sequence increasing in

m + n

which minimizes the maximum of these rates:

\begin{matrix} l (m + n) = a r g min_{l} max \{l^{d} {(m + n)}^{- η / d}, l^{- η d}\} . \end{matrix}

The solution

l = l (m + n)

occurs when

l^{d} {(m + n)}^{- η / d} = l^{- η d}

, or equivalently

l = ⌊ {(m + n)}^{η / (d^{2} (η + 1))} ⌋

. Substitute this into l in the bound (A16), the RHS expression in (5) for

d \geq 2

is established.

Appendix C. Proof of Theorems 3

To bound the variance, we will apply one of the first concentration inequalities which was proved by Efron and Stein [43] and further was improved by Steele [18].

Lemma A7

(The Efron–Stein Inequality). Let

X_{m} = {X_{1}, \dots, X_{m}}

be a random vector on the space

S

. Let

X^{'} = {X_{1}^{'}, \dots, X_{m}^{'}}

be the copy of random vector

X_{m}

. Then, if

f : S \times \dots \times S \to R

, we have

\begin{matrix} V [f (X_{m})] \leq \frac{1}{2} \sum_{i = 1}^{m} E [{(f (X_{1}, \dots, X_{m}) - f (X_{1}, \dots, X_{i}^{'}, \dots, X_{m}))}^{2}] . \end{matrix}

(A17)

Consider two set of nodes

X_{i}

,

1 \leq i \leq m

and

Y_{j}

for

1 \leq j \leq n

. Without loss of generality, assume that

m < n

. Then, consider the

n - m

virtual random points

X_{m + 1}, \dots, X_{n}

with the same distribution as

X_{i}

, and define

Z_{i} : = (X_{i}, Y_{i})

. Now, for using the Efron–Stein inequality on set

Z_{n} = {Z_{1}, \dots, Z_{n}}

, we involve another independent copy of

Z_{n}

as

Z_{n}^{'} = {Z_{1}^{'}, \dots, Z_{n}^{'}}

, and define

Z_{n}^{(i)} : = (Z_{1}, \dots, Z_{i - 1}, Z_{i}^{'}, Z_{i + 1}, \dots, Z_{n})

, then

Z_{n}^{(1)}

becomes

(Z_{1}^{'}, Z_{2}, \dots, Z_{n}) = \{(X_{1}^{'}, Y_{1}^{'}), (X_{2}, Y_{2}), \dots, (X_{m}, Y_{n})\} = : (X_{m}^{(1)}, Y_{n}^{(1)})

where

(X_{1}^{'}, Y_{1}^{'})

is independent copy of

(X_{1}, Y_{1})

. Next, define the function

r_{m, n} (Z_{n}) : = R_{m, n} / (m + n)

, which means that we discard the random samples

X_{m + 1}, \dots, X_{n}

, and find the previously defined

R_{m, n}

function on the nodes

X_{i}

,

1 \leq i \leq m

and

Y_{j}

for

1 \leq j \leq n

, and multiply by some coefficient to normalize it. Then, according to the Efron–Stein inequality, we have

\begin{matrix} V a r (r_{m, n} (Z_{n})) \leq \frac{1}{2} \sum_{i = 1}^{n} E [{(r_{m, n} (Z_{n}) - r_{m, n} (Z_{n}^{(i)}))}^{2}] . \end{matrix}

Now, we can divide the RHS as

\begin{matrix} \frac{1}{2} \sum_{i = 1}^{n} E [{(r_{m, n} (Z_{n}) - r_{m, n} (Z_{n}^{(i)}))}^{2}] = \frac{1}{2} \sum_{i = 1}^{m} E [{(r_{m, n} (Z_{n}) - r_{m, n} (Z_{n}^{(i)}))}^{2}] \\ + \frac{1}{2} \sum_{i = m + 1}^{n} E [{(r_{m, n} (Z_{n}) - r_{m, n} (Z_{n}^{(i)}))}^{2}] . \end{matrix}

(A18)

The first summand becomes

\begin{matrix} \begin{matrix} = \frac{1}{2} \sum_{i = 1}^{m} E [{(r_{m, n} (Z_{n}) - r_{m, n} (Z_{n}^{(i)}))}^{2}] = \frac{m}{2 {(m + n)}^{2}} E [{(R_{m, n} (X_{m}, Y_{n}) - R_{m, n} (X_{m}^{(1)}, Y_{n}^{(1)}))}^{2}], \end{matrix} \end{matrix}

which can also be upper bounded as follows:

\begin{matrix} \begin{matrix} |R_{m, n} (X_{m}, Y_{n}) - R_{m, n} (X_{m}^{(1)}, Y_{n}^{(1)})| \leq |R_{m, n} (X_{m}, Y_{n}) - R_{m, n} (X_{m}^{(1)}, Y_{n})| \\ + |R (X_{m}^{(1)}, Y_{n}) - R_{m, n} (X_{m}^{(1)}, Y_{n}^{(1)})| . \end{matrix} \end{matrix}

(A19)

For deriving an upper bound on the second line in (A19), we should observe how much changing a point’s position modifies the amount of

R_{m, n} (X_{m}, Y_{n})

. We consider two steps of changing

X_{1}

’s position: we first remove it from the graph, and then add it to the new position. Removing it would change

R_{m, n} (X_{m}, Y_{n})

at most by

2 c_{d}

because

X_{1}

has a degree of at most

c_{d}

, and

c_{d}

edges will be removed from the MST graph, and

c_{d}

edges will be added to it. Similarly, adding

X_{1}

to the new position will affect

R_{m, n} (X_{m, n}, Y_{m, n})

at most by

2 c_{d}

. Thus, we have

|R_{m, n} (X_{m}, Y_{n}) - R_{m, n} (X_{m}^{(1)}, Y_{n})| \leq 4 c_{d},

and we can also similarly reason that

|R_{m, n} (X_{m}^{(1)}, Y_{n}) - R_{m, n} (X_{m}^{(1)}, Y_{n}^{(1)})| \leq 4 c_{d} .

Therefore, totally we would have

\begin{matrix} |R_{m, n} (X_{m}, Y_{n}) - R_{m, n} (X_{m}^{(1)}, Y_{n}^{(1)})| \leq 8 c_{d} . \end{matrix}

Furthermore, the second summand in (A18) becomes

\begin{matrix} \begin{matrix} = \frac{1}{2} \sum_{i = m + 1}^{n} E [{(r_{m, n} (Z_{n}) - r_{m, n} (Z_{n}^{(i)}))}^{2}] = K_{m, n} E [{(R_{m, n} (X_{m}, Y_{n}) - R_{m, n} (X_{m}^{(m + 1)}, Y_{n}^{(m + 1)}))}^{2}], \end{matrix} \end{matrix}

where

K_{m, n} = \frac{n - m}{2 {(m + n)}^{2}}

. Since, in

(X_{m}^{(m + 1)}, Y_{n}^{(m + 1)})

, the point

X_{m + 1}^{'}

is a copy of virtual random point

X_{m + 1}

, therefore this point doesn’t change the FR test statistic

R_{m, n}

. In addition, following the above arguments, we have

\begin{matrix} |R_{m, n} (X_{m}, Y_{n}) - R_{m, n} (X_{m}, Y_{n}^{(m + 1)})| \leq 4 c_{d} . \end{matrix}

Hence, we can bound the variance as below:

\begin{matrix} V a r (r_{m, n} (Z_{n})) \leq \frac{8 c_{d}^{2} (n - m)}{{(m + n)}^{2}} + \frac{32 c_{d}^{2} m}{{(m + n)}^{2}} . \end{matrix}

(A20)

Combining all results with the fact that

\frac{n}{m + n} \to q

concludes the proof.

Appendix D. Proof of Theorems 5–7

We will need the following prominent results for the proofs.

Lemma A8.

For

h = 1, 2, \dots

, let

δ_{m, n}^{h}

be the function

c h^{d - 1} {(m + n)}^{1 / d}

, where c is a constant. Then, for

ϵ > 0

, we have

P (R_{m, n} (X_{m}, Y_{n}) \leq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m_{i}}, Y_{n_{i}}) + 2 ϵ) \geq \frac{ϵ - δ_{m, n}^{h}}{ϵ} .

(A21)

Note that, in the case

ϵ \leq δ_{m, n}^{h}

, the above claimed inequality becomes trivial.

The subadditivity property for FR test statistic

R_{m, n}

in Lemma A8, as well as Euclidean functionals, leads to several non-trivial consequences. The growth bound was first explored by Rhee (1993b) [50], and as is illustrated in [17,27] has a wide range of applications. In this paper, we investigate the probabilistic growth bound for

R_{m, n}

. This observation will lead us to our main goal in this appendix that is providing the proof of Theorem 6. For what follows, we will use

δ_{m, n}^{h}

notation for the expression

O (h^{d - 1} {(m + n)}^{1 / d})

.

Lemma A9

(Growth bounds for

R_{m, n}

). Let

R_{m, n}

be the FR test statistic. Then, for given non-negative ϵ, such that

ϵ \geq h^{2} δ_{m, n}^{h}

, with at least probability

g (ϵ) : = 1 - \frac{h δ_{m, n}^{h}}{ϵ}

,

h = 2, 3, \dots

, we have

\begin{matrix} R_{m, n} (X_{m}, Y_{n}) \leq c_{ϵ, h}^{″} {(# X_{m} # Y_{n})}^{1 - 1 / d} . \end{matrix}

(A22)

Here,

c_{ϵ, h}^{″} = O (\frac{ϵ}{h^{d - 1} - 1})

depending only on ϵ and h.

The complexity of

R_{m, n}

’s behavior and the need to pursue the proof encouraged us to explore the smoothness condition for

R_{m, n}

. In fact, this is where both subadditivity and superadditivity for the FR statistic are used together and become more important.

Lemma A10

(Smoothness for

R_{m, n}

). Given observations of

X_{m} : = (X_{m^{'}}, X_{m^{″}}) = {X_{1}, \dots, X_{m^{'}}, X_{m^{'} + 1}, \dots, X_{m}},

where

m^{'} + m^{″} = m

and

Y_{n} : = (Y_{n^{'}}, Y_{n^{″}}) = {Y_{1}, \dots, Y_{n^{'}}, Y_{n^{'} + 1}, \dots, Y_{n}}

, where

n^{'} + n^{″} = n

, denote

R_{m, n} (X_{m}, Y_{n})

as before, the number of edges of

MST (X_{m}, Y_{n})

which connect a point of

X_{m}

to a point of

Y_{n}

. Then, for given integer

h \geq 2

, for all

(X_{n}, Y_{m}) \in {[0, 1]}^{d}

,

ϵ \geq h^{2} δ_{m, n}^{h}

where

δ_{m, n}^{h} = O (h^{d - 1} {(m + n)}^{1 / d})

, we have

\begin{matrix} P (| R_{m, n} (X_{m}, Y_{n}) - R_{m^{'}, n^{'}} (X_{m^{'}}, Y_{n^{'}}) | \leq {\tilde{c}}_{ϵ, h} {(# X_{m^{″}} # Y_{n^{″}})}^{1 - 1 / d}) \\ \geq 1 - \frac{2 h δ_{m, n}^{h}}{ϵ}, \end{matrix}

(A23)

where

{\tilde{c}}_{ϵ, h} = O (\frac{ϵ}{h^{d - 1} - 1})

.

Remark: Using Lemma A10, we can imply the continuty property, i.e., for all observations

(X_{m}, Y_{n})

and

(X_{m^{'}}, Y_{n^{'}})

, with at least probability

2 g (ϵ) - 1

, one obtains

\begin{matrix} \begin{matrix} | R_{m, n} (X_{m}, Y_{n}) - R_{m^{'}, n^{'}} (X_{m^{'}}, Y_{n^{'}}) | \\ \leq c_{ϵ, h}^{*} {(# (X_{m} Δ X_{m^{'}}) # (Y_{n} Δ Y_{n^{'}}))}^{1 - 1 / d}, \end{matrix} \end{matrix}

(A24)

for given

ϵ > 0

,

c_{ϵ, h}^{*} = O (\frac{ϵ}{h^{d - 1} - 1})

,

h \geq 2

. Here,

X_{m} Δ X_{m^{'}}

denotes symmetric difference of observations

X_{m}

and

X_{m^{'}}

.

The path to approach the assertions (11) and (12) proceeds via semi-isoperimentic inequality for the

R_{m, n}

involving the Hamming distance.

Lemma A11

(Semi-Isoperimetry). Let μ be a measure on

{[0, 1]}^{d}

;

μ^{n}

denotes the product measure on space

{({[0, 1]}^{d})}^{n}

. In addition, let

M_{e}

denotes a median of

R_{m, n}

. Set

\begin{matrix} A : = \{X_{m} \in {({[0, 1]}^{d})}^{m}, Y_{n} \in {({[0, 1]}^{d})}^{n}; R_{m, n} (X_{m}, Y_{n}) \leq M_{e}\} . \end{matrix}

(A25)

Following the notations in [17],

H (x, x^{'}) = # {i, x_{i} \neq x_{i}^{'})

and

ϕ_{A} (x^{'}) + ϕ_{A} (y^{'}) = min {H (x, x^{'}) + H (y, y^{'}) : x, y \in A}

and

ϕ_{A} (x^{'}) ϕ_{A} (y^{'}) = min {H (x, x^{'}) H (y, y^{'}) : x, y \in A}

. Then,

\begin{matrix} μ^{m + n} (\{x^{'} \in {({[0, 1]}^{d})}^{m}, y^{'} \in {({[0, 1]}^{d})}^{n} : ϕ_{A} (x^{'}) ϕ_{A} (y^{'}) \geq t\}) \\ \leq 4 exp (\frac{- t}{8 (m + n)}) . \end{matrix}

(A26)

Now, we continue by providing the proof of Theorem 5. Recall (A25) and denote

\begin{matrix} \begin{matrix} F_{x} : = \{x_{i}, i = 1, \dots, m, x_{i} = x_{i}^{'}\}, \\ F_{y} : = \{y_{j}, j = 1, \dots, n, y_{j} = y_{j}^{'}\}, \\ and \\ G_{x} : = \{x_{i}, i = 1, \dots, m, x_{i} \neq x_{i}^{'}\}, \\ G_{y} : = \{y_{j}, j = 1, \dots, n, y_{j} \neq y_{j}^{'}\} . \end{matrix} \end{matrix}

In addition, for given integer h, define events

B

,

B^{'}

by

\begin{matrix} \begin{matrix} B : = \{| R_{m, n} (X_{m}^{'}, Y_{n}^{'}) - R (F_{x}, F_{y}) | \leq c_{ϵ, h} {(# G_{x} # G_{y})}^{1 - 1 / d}\}, \\ B^{'} : = \{| R (F_{x}, F_{y}) - R_{m, n} (X_{m}, Y_{n}) | \leq c_{ϵ, h} {(# G_{x} # G_{y})}^{1 - 1 / d}\}, \end{matrix} \end{matrix}

where

c_{ϵ, h}

is a constant. By virtue of smoothness property, Lemma A10, for

ϵ \geq h^{2} δ_{m, n}^{h}

, we know

P (B) \geq 2 g (ϵ) - 1

and

P (B^{'}) \geq 2 g (ϵ) - 1

. On the other hand, we have

\begin{matrix} R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \leq | R_{m, n} (X_{m}^{'}, Y_{n}^{'}) - R (F_{x}, F_{y}) | \\ + | R (F_{x}, F_{y}) - R_{m, n} (X_{m}, Y_{n}) | + R_{m, n} (X_{m}, Y_{n}) . \\ = | ϖ^{'} | + | ϖ | + R_{m, n} (X_{m}, Y_{n}) (say) . \end{matrix}

Moreover,

P (R_{m, n} (X_{m}, Y_{n}) \leq M_{e}) \geq 1 / 2

. Therefore, we can write

\begin{matrix} 1 / 2 \leq P (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \leq M_{e} + | ϖ^{'} | + | ϖ |) \\ \leq P (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \leq M_{e} + | ϖ^{'} | + | ϖ | | B \cap B^{'}) P (B \cap B^{'}) \\ + P (B^{c} \cup B^{' c}) . \end{matrix}

(A27)

Thus, we obtain

\begin{matrix} \begin{matrix} P (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \leq M_{e} + 4 ϵ {(# G_{x} # G_{y})}^{1 - 1 / d}) \\ \geq (1 / 2 - 1 + P (B \cap B^{'})) / P (B \cap B^{'}) \\ = 1 - ({(2 P (B \cap B^{'}))}^{- 1}) . \end{matrix} \end{matrix}

Note that

P (B \cap B^{'}) = P (B) P (B^{'}) \geq {(2 g (ϵ) - 1)}^{2}

. Now, we easily claim that

1 - ({(2 P (B \cap B^{'}))}^{- 1}) \geq 1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}) .

(A28)

Thus,

\begin{matrix} P (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \leq M_{e} + 4 ϵ {(# G_{x} # G_{y})}^{1 - 1 / d}) \geq 1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}) . \end{matrix}

On the other word, calling

ϕ_{A} (x^{'})

and

ϕ_{A} (y^{'})

in Lemma A11, we get

\begin{matrix} \begin{matrix} P (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \leq M_{e} + 4 ϵ {(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d}) \geq 1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}) . \end{matrix} \end{matrix}

(A29)

Furthermore, denote event

C : = \{R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \leq M_{e} + 4 ϵ {(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d}\} .

Then, we have

\begin{matrix} P (R_{m, n} (X_{m}, Y_{n}) \geq M_{e} + t) = μ^{m + n} (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t) \\ = μ^{m + n} ((R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t) | C) P (C) \\ + μ^{m + n} ((R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t) | C^{c}) P (C^{c}) \\ \leq μ^{m + n} ({(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d} \geq \frac{t}{4 ϵ}) P (C) \\ + μ^{m + n} ((R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t) | C^{c}) P (C^{c}) . \\ Using P (C) = 1 - P (C^{c}) \\ = μ^{m + n} ({(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d} \geq \frac{t}{4 ϵ}) \\ + P (C^{c}) {μ^{m + n} ((R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t) | C^{c}) \\ - μ^{m + n} ({(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d} \geq \frac{t}{4 ϵ})} . \end{matrix}

(A30)

Define set

K_{t} = \{{(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d} \geq \frac{t}{4 ϵ}\}

, so

\begin{matrix} \begin{matrix} μ^{m + n} (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t | C^{c}) \\ = μ^{m + n} (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t | C^{c}, K_{t}) μ^{m + n} (K_{t}) + μ^{m + n} ((R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t) | C^{c}, K_{t}^{c}) μ^{m + n} (K_{t}^{c}) . \end{matrix} \end{matrix}

Since

μ^{m + n} (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t | C^{c}, K_{t}) = 1,

and

\begin{matrix} \begin{matrix} μ^{m + n} (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t | C^{c}, K_{t}^{c}) = μ^{m + n} (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t) . \end{matrix} \end{matrix}

Consequently, from (A30), one can write

\begin{matrix} \begin{matrix} P (R_{m, n} (X_{m}, Y_{n}) \geq M_{e} + t) \\ \leq μ^{m + n} ({(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d} \geq \frac{t}{4 ϵ}) \\ + P (C^{c}) \{μ^{m + n} (R_{m, n} (X_{m}^{'}, Y_{n}^{'}) \geq M_{e} + t) μ^{m + n} (K_{t}^{c})\} \\ \leq μ^{m + n} ({(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d} \geq \frac{t}{4 ϵ}) \\ + ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}) P (R_{m, n} (X_{m}, Y_{n}) \geq M_{e} + t) . \end{matrix} \end{matrix}

(A31)

The last inequality implies by owing to (A29) and

μ^{m + n} (K_{t}^{c}) \leq 1

. For

g (ϵ) \geq 1 / 2 + 1 / (2 \sqrt{2})

, we have

1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}) \geq 0,

or equivalently this holds true when

ϵ \geq (2 h \sqrt{2} δ_{m, n}^{h}) / (\sqrt{2} - 1)

. Furthermore, for

h \geq 7

, we have

h^{2} δ_{m, n}^{h} \geq (2 h \sqrt{2} δ_{m, n}^{h}) / (\sqrt{2} - 1),

(A32)

therefore

P (R_{m, n} (X_{m}, Y_{n}) \geq M_{e} + t)

is less than and equal to

{(1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}))}^{- 1} μ^{m + n} ({(ϕ_{A} (x^{'}) ϕ_{A} (y^{'}))}^{1 - 1 / d} \geq \frac{t}{4 ϵ}) .

(A33)

By virtue of Lemma A11, finally we obtain

\begin{matrix} P (R_{m, n} (X_{m}, Y_{n}) \geq M_{e} + t) \leq 4 {(1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}))}^{- 1} exp (\frac{- t^{d / (d - 1)}}{8 {(4 ϵ)}^{d / d - 1} (m + n)}) . \end{matrix}

(A34)

Similarly, we can derive the same bound on

P (R_{m, n} (X_{m}, Y_{n}) \leq M_{e} - t)

, so we obtain

\begin{matrix} P (| R_{m, n} - M_{e} | \geq t) \leq C_{m, n}^{'} (ϵ, h) exp (\frac{- t^{d / (d - 1)}}{8 {(4 ϵ)}^{d / (d - 1)} (m + n)}), \end{matrix}

(A35)

where

\begin{matrix} C_{m, n}^{'} (ϵ, h) = 8 {(1 - 2^{- 1} {(1 - \frac{2 h O (h^{d - 1} {(m + n)}^{1 / d})}{ϵ})}^{- 2})}^{- 1} . \end{matrix}

(A36)

We will analyze (A35) together with Theorem 6. The next lemma will be employed in Theorem 6’s proof.

Lemma A12

(Deviation of the Mean and Median). Consider

M_{e}

as a median of

R_{m, n}

. Then, for

ϵ \geq h^{2} δ_{m, n}^{h}

and given

h \geq 7

, we have

| E [R_{m, n} (X_{m}, Y_{n})] - M_{e} | \leq C_{m, n} (ϵ, h) {(m + n)}^{(d - 1) / d},

(A37)

where

C_{m, n} (ϵ, h)

is a constant depending on ϵ, h, m, and n by

\begin{matrix} C_{m, n} (ϵ, h) = C {(1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}))}^{- 1}, \end{matrix}

(A38)

where C is a constant and

δ_{m, n}^{h} = O (h^{d - 1} {(m + n)}^{1 / d}), and g (ϵ) = 1 - \frac{h δ_{m, n}^{h}}{ϵ} .

We conclude this part by pursuing our primary intension which has been the Theorem 6’s proof. Observe from Theorem 5, (11) that

\begin{matrix} \begin{matrix} P (| R_{m, n} - E [R_{m, n}] | \geq t + C_{m, n} (ϵ, l) ({m + n)}^{(d - 1) / d}) \\ \leq P (| R_{m, n} - M_{e} | + | E [R_{m, n}] - M_{e} | \\ \geq t + C_{m, n} (ϵ, l) ({m + n)}^{(d - 1) / d}) \\ \leq P (| R_{m, n} - M_{e} | \geq t) \\ \leq 8 {(1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}))}^{- 1} exp (\frac{- t^{d / (d - 1)}}{8 {(4 ϵ)}^{d / d - 1} (m + n)}) . \end{matrix} \end{matrix}

Note that the last bound is derived by (11). The rest of the proof is as the following: When

t \geq 2 C_{m, n} {(ϵ, h) (m + n)}^{(d - 1) / d}

, we use

(t - C_{m, n} (ϵ, h) {({m + n)}^{(d - 1) / d})}^{d / (d - 1)} \geq {(t / 2)}^{d / (d - 1)} .

Therefore, it turns out that

\begin{matrix} P (| R_{m, n} - E [R_{m, n}] | \geq t) \\ \leq 8 {(1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}))}^{- 1} exp (\frac{- t^{d / (d - 1)}}{8 {(8 ϵ)}^{d / (d - 1)} (m + n)}) . \end{matrix}

(A39)

In other words, there exist constants

C_{m, n}^{'} (ϵ, h)

depending on

m, n

,

ϵ

, and h such that

\begin{matrix} P (| R_{m, n} - E [R_{m, n}] | \geq t) \leq C_{m, n}^{'} (ϵ, h) exp (\frac{- {(t / (2 ϵ))}^{d / (d - 1)}}{(m + n) \tilde{C}}), \end{matrix}

(A40)

where

\tilde{C} = 8 {(4)}^{d / (d - 1)}

.

To verify the behavior of bound (A40) in terms of

ϵ

, observe (A35) first; it is not hard to see that this function is decreasing in

ϵ

. However, the function

exp (\frac{- {(t / (2 ϵ))}^{d / (d - 1)}}{(m + n) \tilde{C}})

increases in

ϵ

. Therefore, one can not immediately infer that the bound in (12) is monotonic with respect to

ϵ

. For fixed

N = n + m

, d, and h, the first and second derivatives of the bound (12) with respect to

ϵ

are quite complicated functions. Thus, deriving an explicit optimal solution for the minimization problem with the objective function (12) is not feasible. However, in the sequel, we discuss that under conditions when t is not much larger than

N = m + n

, this bound becomes convex with respect to

ϵ

. Set

\begin{matrix} K (ϵ) = C_{m, n}^{'} (ϵ, h) exp (\frac{- B (t)}{ϵ^{d / (d - 1)}}), \end{matrix}

(A41)

where

C_{m, n}^{'}

is given in (10) and

B (t) = \frac{t^{d / (d - 1)}}{8 {(8)}^{d / (d - 1)} (N)} .

By taking the derivative with respect to

ϵ

, we have

\begin{matrix} \frac{d K (ϵ)}{d ϵ} = K (ϵ) (\frac{d}{d ϵ} (log C_{m, n}^{'}) + \frac{B (t) d / (d - 1)}{ϵ^{(2 d - 1) / (d - 1)}}), \end{matrix}

(A42)

where

\begin{matrix} \frac{d}{d ϵ} (log C_{m, n}^{'}) = \frac{- 4 a_{h} ϵ}{(ϵ - 2 a_{h}) (8 a_{h}^{2} - 8 ϵ a_{h} + ϵ^{2})}, \end{matrix}

(A43)

where

a_{h} = h δ_{m, n}^{h}

. The second derivative

K (ϵ)

with respect to

ϵ

after simplification is given as

\begin{matrix} \frac{d^{2}}{d ϵ^{2}} K (ϵ) = {(\frac{- 4 a_{h} ϵ}{(ϵ - 2 a_{h}) (8 a_{h}^{2} - 8 ϵ a_{h} + ϵ^{2})} + \frac{B (t) \bar{d}}{ϵ^{\bar{d} + 1}})}^{2} \\ + K (ϵ) (\frac{8 a_{h} (8 a_{h}^{3} + ϵ^{2} (ϵ - 5 a_{h}))}{{(8 a_{h}^{2} - 8 a_{h} ϵ + ϵ^{2})}^{2} {(ϵ - 2 a_{h})}^{2}} - \frac{B (t) \bar{d} (\bar{d} + 1)}{ϵ^{\bar{d} + 2}}), \end{matrix}

(A44)

where

\bar{d} = d / (d - 1)

. The first term in (A44) and

K (ϵ)

are non-negative, so

K (ϵ)

is convex if the second term in the second line of (A44) is non-negative. We know that

ϵ \geq h^{2} δ_{m, n}^{h} = h a_{h}

, when

h = 7

, we can parameterize

ϵ

by setting it equal to

γ a_{h}

, where

γ \geq 7

. After simplification,

K (ϵ)

is convex if

\begin{matrix} a_{h}^{\bar{d} - 1} (γ^{\bar{d} - 1} + 3 γ^{\bar{d} - 2}) + B (t) \bar{d} (\bar{d} + 1) \\ \times {a_{h}^{- 1} (- 32 γ^{- 6} + 64 γ^{- 5} - 48 γ^{- 4} + 8 γ^{- 3} - \frac{7}{2} γ^{- 2} + 2 γ^{- 1} - \frac{1}{8}) \\ + a_{h}^{- 2} (32 γ^{- 6} - 64 γ^{- 5} + 40 γ^{- 4} + 8 γ^{- 3} + \frac{1}{2} γ^{- 2})} \geq 0 . \end{matrix}

(A45)

This is implied if

\begin{matrix} 0 \leq B (t) \bar{d} (\bar{d} + 1) a_{h}^{- 1} \\ \times (- 32 γ^{- 6} + 64 γ^{- 5} - 48 γ^{- 4} + 8 γ^{- 3} - \frac{7}{2} γ^{- 2} + 2 γ^{- 1} - \frac{1}{8}), \end{matrix}

(A46)

such that

γ \geq 7

. One can easily check that, as

γ \to \infty

, then (A46) tends to

- \frac{1}{8} B (t) \bar{d} (\bar{d} + 1) a_{h}^{- 1}

. This term can be negligible unless we have t that is much larger than

N = m + n

with the threshold depending on d. Here, by setting

B (t) / a_{h} = 1

, a rough threshold

t = O (7^{d - 1} {(m + n)}^{1 - 1 / d^{2}})

depending on d,

m + n

is proposed. Therefore, minimizing (A35) and (A40) with respect to

ϵ

when optimal

h = 7

is a convex optimization problem. Denote

ϵ^{*}

the solution of the convex optimization problem (9). By plugging optimal h (

h = 7

) and

ϵ

(

ϵ = ϵ^{*})

in (A35) and (A40), we derive (11) and (12), respectively.

In this appendix, we also analyze the bound numerically. By simulation, we observed that lower h i.e.,

h = 7

is the optimal value experimentally. Indeed, this can be verified by Theorem 11’s proof. We address the reader to Lemma A8 in Appendix D and Appendix E where, as h increases, the lower bound for the probability increases too. In other words, for fixed

N = m + n

and d, the lowest h implies the maximum bound in (A92). For this, we set

h = 7

in our experiments. We vary the dimension d and sample size

N = m + n

in relatively large and small ranges. In Table A1, we solve (9) for various values of d and

N = m + n

. We also compute the lower bound for

ϵ

i.e.,

7^{d + 1} N^{1 / d}

per experiment. In Table A1, we observe that as we have higher dimension the optimal value

ϵ^{*}

equals the

ϵ

lower bound

h^{d + 1} N^{1 / d}

, but this is not true for smaller dimensions with even relatively large sample size.

Table A1. d, N,

ϵ^{*}

are dimension, total sample size

m + n

, and optimal

ϵ

for the bound in (12). The column

h^{d + 1} N^{1 / d}

represents approximately the lower bound for

ϵ

which is our constraint in the minimization problem and our assumption in Theorems 5 and 6. Here, we set

h = 7

.

Table A1. d, N,

ϵ^{*}

are dimension, total sample size

m + n

, and optimal

ϵ

for the bound in (12). The column

h^{d + 1} N^{1 / d}

represents approximately the lower bound for

ϵ

which is our constraint in the minimization problem and our assumption in Theorems 5 and 6. Here, we set

h = 7

.

Concentration Bound (11)
$d$	$N = m + n$	$ϵ^{*}$	$t_{0}$	$h^{d + 1} N^{1 / d}$	Optimal (11)
2	$10^{3}$	$1.1424 \times 10^{4}$	$2 \times 10^{7}$	$1.0847 \times 10^{4}$	0.3439
4	$10^{4}$	$1.7746 \times 10^{5}$	$3 \times 10^{10}$	168,070	0.0895
5	550	$4.7236 \times 10^{5}$	$10^{10}$	$4.1559 \times 10^{5}$	0.9929
6	$10^{4}$	$3.8727 \times 10^{6}$	$2 \times 10^{12}$	$3.8225 \times 10^{6}$	0.1637
8	1200	$9.7899 \times 10^{7}$	$12 \times 10^{12}$	$9.7899 \times 10^{7}$	0.7176
10	3500	$4.4718 \times 10^{9}$	$2 \times 10^{15}$	$4.4718 \times 10^{9}$	0.4795
15	$10^{8}$	$1.1348 \times 10^{14}$	$10^{24}$	$1.1348 \times 10^{14}$	0.9042

To validate our proposed bound in (12), we again set

h = 7

and for

d = 4, 5, 7

we ran experiments with sample sizes

N = m + n = 9000, 1100, 140

, respectively. Then, we solved the minimization problem to derive optimal bound for t in the range

10^{10} [1, 3]

. Note that we chose this range to have a non-trivial bound for all three curves; otherwise, the bounds partly become one. Figure A1 shows that when t increases in the given range, the optimal curves approach zero.

Figure A1. Optimal bound for (12), when

h = 7

versus

t \in 10^{10} [1, 3]

. The bound decreases as t grows.

Figure A1. Optimal bound for (12), when

h = 7

versus

t \in 10^{10} [1, 3]

. The bound decreases as t grows.

To prove the Theorem 7 in the concentration of

R_{m, n}

, Theorem 6, let

δ = C_{m, n}^{'} (ϵ^{*}) exp (\frac{- {(t / (2 ϵ^{*}))}^{d / (d - 1)}}{(m + n) \tilde{C}}),

this implies

t = O (ϵ^{*} {(m + n)}^{(d - 1) / d} (log {(C_{m, n}^{'} (ϵ^{*}) / δ))}^{(d - 1) / d})

and the proofs are completed.

Appendix E. Additional Proofs

Lemma A3: Let

g (x)

be a density function with support

{[0, 1]}^{d}

and belong to the Hölder class

Σ_{d} (η, L)

,

0 < η \leq 1

, expressed in Definition 1. In addition, assume that

P (x)

is a

η

-Hölder smooth function, such that its absolute value is bounded from above by some constants c. Define the quantized density function with parameter l and constants

ϕ_{i}

as

\begin{matrix} \hat{g} (x) = \sum_{i = 1}^{M} ϕ_{i} 1 {x \in Q_{i}}, where ϕ_{i} = l^{d} \int_{Q_{i}} g (x) d x, \end{matrix}

(A47)

and

M = l^{d}

and

Q_{i} = {x, x_{i} : ∥ x - x_{i} ∥ < l^{- d}}

. Then,

\begin{matrix} \int ∥ (g (x) - \hat{g} (x)) P (x) ∥ d x \leq O (l^{- d η}) . \end{matrix}

(A48)

Proof.

By the mean value theorem, there exist points

ϵ_{i} \in Q_{i}

such that

\begin{matrix} ϕ_{i} = l^{d} \int_{Q_{i}} g (x) d x = g (ϵ_{i}) . \end{matrix}

Using the fact that

g \in Σ_{d} (η, L)

and

P (x)

is a bounded function, we have

\begin{matrix} \begin{matrix} \int ∥ g (x) - \hat{g} (x)) P (x) ∥ d x & = \sum_{i = 1}^{M} \int_{Q_{i}} ∥ (g (x) - Φ_{i}) P (x) ∥ d x \\ = \sum_{i = 1}^{M} \int_{Q_{i}} ∥ (g (x) - g (ϵ_{i})) P (x) ∥ d x \\ \leq c L \sum_{i = 1}^{M} \int_{Q_{i}} ∥ x - ϵ_{i} ∥^{η} d x . \end{matrix} \end{matrix}

Here, L is the Hölder constant. As

x, ϵ_{i} \in Q_{i}

, a sub-cube with edge length

l^{- 1}

, then

∥ x - ϵ_{i} ∥^{η} = O (l^{- d η})

and

\sum_{i = 1}^{M} \int_{Q_{i}} d x = 1

. This concludes the proof. □

Lemma A4: Let

Δ (x, S)

denote the degree of vertex

x \in S

in the

M S T

over set

S \subset R^{d}

with the n number of vertices. For given function

P (x, x)

, one yields

\int P (x, x) g (x) E [Δ (x, S)] d x = 2 \int P (x, x) g (x) d x + ς_{η} (l, n),

(A49)

where for constant

η > 0

,

\begin{matrix} ς_{η} (l, n) = (O (l / n) - 2 l^{d} / n) \int g (x) P (x, x) d x + O (l^{- d η}) . \end{matrix}

(A50)

Proof.

Recall notations in Lemma A3 and

\begin{matrix} | \int g (x) P (x) d x - \int \hat{g} (x) P (x) d x | \leq \int | (g (x) - \hat{g} (x)) P (x) | d x . \end{matrix}

Therefore, by substituting

\hat{g}

, defined in (A47), into g with considering its error, we have

\begin{matrix} \int P (x, x) g (x) E [Δ (x, S)] d x \\ = \int P (x, x) E [Δ (x, S)] \sum_{i = 1}^{M} ϕ_{i} 1 {x \in Q_{i}} d x + O (l^{- d η}) \\ = \sum_{i = 1}^{M} ϕ_{i} \int_{Q_{i}} P (x, x) E [Δ (x, S)] d x + O (l^{- d η}) . \end{matrix}

(A51)

Here,

Q_{i}

represents as before in Lemma A3, so the RHS of (A51) becomes

\begin{matrix} \begin{matrix} \sum_{i = 1}^{M} ϕ_{i} \int_{Q_{i}} P (x, x) E [Δ (x, S \cap Q_{i})] d x + \sum_{i = 1}^{M} ϕ_{i} \int_{Q_{i}} P (x, x) O (l^{1 - d} / n) + O (l^{- d η}) \\ = \sum_{i = 1}^{M} ϕ_{i} P (x_{i}, x_{i}) \frac{1}{M} \int_{Q_{i}} M E [Δ (x, S \cap Q_{i})] d x + \sum_{i = 1}^{M} ϕ_{i} \int_{Q_{i}} P (x, x) O (l^{1 - d} / n) + 2 O (l^{- d η}) . \end{matrix} \end{matrix}

(A52)

Now, note that

\int_{Q_{i}} M E [Δ (x, S \cap Q_{i})] d x

is the expectation of

E [Δ (x, S \cap Q_{i})]

over the nodes in

Q_{i}

, which is equal to

2 - \frac{2}{k_{i}}

, where

k_{i} = \frac{n}{M}

. Consequently, we have

\begin{matrix} \int P (x, x) g (x) E [Δ (x, S)] d x = (2 - \frac{2 M}{n}) \sum_{i = 1}^{M} ϕ_{i} P (x_{i}, x_{i}) \frac{1}{M} + O (\frac{l^{1 - d}}{n}) \sum_{i = 1}^{M} ϕ_{i} P (x_{i}, x_{i}) + 3 O (l^{- d η}) \\ = 2 \int g (x) P (x, x) d x + 5 O (l^{- d η)}) + M (O (\frac{l^{1 - d}}{n}) - (\frac{2}{n})) \int g (x) P (x, x) d x . \end{matrix}

(A53)

This gives the assertion (A49). □

Lemma A5: Assume that, for given k,

g_{k} (x)

is a bounded function belong to

Σ_{d} (η, L)

. Let

P : R^{d} \times R^{d} \mapsto [0, 1]

be a symmetric, smooth, jointly measurable function, such that, given k, for almost every

x \in R^{d}

,

P (x, .)

is measurable with

x

a Lebesgue point of the function

g_{k} (.) P (x, .)

. Assume that the first derivative P is bounded. For each k, let

Z_{1}^{k}, Z_{2}^{k}, \dots, Z_{k}^{k}

be independent d-dimensional variable with common density function

g_{k}

. Set

Z_{k} = {Z_{1}^{k}, Z_{2}^{k} \dots, Z_{k}^{k}}

and

Z_{k}^{x} = {x, Z_{2}^{k}, Z_{3}^{k} \dots, Z_{k}^{k}}

. Then,

\begin{matrix} E [\sum_{j = 2}^{k} P (x, Z_{j}^{k}) 1 \{(x, Z_{j}^{k}) \in M S T (Z_{k}^{x})\}] \\ = P (x, x) E [Δ (x, Z_{k}^{x})] + \{O (k^{- η / d}) + O (k^{- 1 / d})\} . \end{matrix}

(A54)

Proof.

Let

B (x, r) = {y : ∥ y - x ∥_{d} \leq r}

. For any positive K, we can obtain:

\begin{matrix} E \sum_{j = 2}^{k} | P (x, Z_{j}^{k}) - P (x, x) | 1 \{Z_{j}^{k} \in B (x, K k^{- 1 / d})\} \\ = (k - 1) \int_{B (x; K k^{- 1 / d})} | (P (x, y) g_{k} (y) - P (x, x) g_{k} (x)) + P (x, x) (g_{k} (x) - g_{k} (y)) | d y \\ \leq (k - 1) [\int_{B (x; K k^{- 1 / d})} | (P (x, y) g_{k} (y) - P (x, x) g_{k} (x)) | d y + O (k^{- η / d}) V (B (x, K k^{- 1 / d})], \end{matrix}

(A55)

where

V

is the volume of space

B

which equals

O (k^{- 1})

. Note that the above inequality appears because

g_{k} (x) \in Σ_{d} (η, L)

and

P (x, x) \in [0, 1]

. The first order Taylor series expansion of

P (x, y)

around

x

is

\begin{matrix} \begin{matrix} P (x, y) & = P (x, x) + P^{(1)} (x, x) ∥ y - x ∥ + o ({∥ y - x ∥}^{2}) \\ = P (x, x) + O (k^{- 1 / d}) + o (k^{- 2 / d}) . \end{matrix} \end{matrix}

Then, by recalling the Hölder class, we have

\begin{matrix} \begin{matrix} | P (x, y) g_{k} (y) - P (x, x) g_{k} (x) | & = | (P (x, x) + O (k^{- 1 / d})) (g_{k} (x) + O (k^{- η / d})) - P (x, x) g_{k} (x) | \\ = O (k^{- η / d}) + O (k^{- 1 / d}) . \end{matrix} \end{matrix}

Hence, the RHS of (A55) becomes

\begin{matrix} \begin{matrix} (k - 1) [(O (k^{- η / d}) + O (k^{- 1 / d})) V (B (x, K k^{- 1 / d})) + O (k^{- η / d}) V (B (x, K k^{- 1 / d}))] \\ = (k - 1) [O (k^{- 1 - η / d}) + O (k^{- 1 - 1 / d})] . \end{matrix} \end{matrix}

The expression in (A54) can be obtained by choice of K. □

Lemma A6: Consider the notations and assumptions in Lemma A5. Then,

\begin{matrix} | k^{- 1} \underset{1 \leq i < j \leq k}{\sum \sum} P (Z_{i}^{k}, Z_{j}^{k}) 1 {(Z_{i}^{k}, Z_{j}^{k}) \in M S T (Z_{k}) - \int_{R^{d}} P (x, x) g_{k} (x) d x | \\ \leq ς_{η} (l, k) + O (k^{- η / d}) + O (k^{- 1 / d}) . \end{matrix}

(A56)

Here,

M S T (S)

denotes the MST graph over nice and finite set

S \subset R^{d}

and

η

is the smoothness Hölder parameter. Note that

ς_{η} (l, k)

is given as before in (A50).

Proof.

Following notations in [49], let

Δ (x, S)

denote the degree of vertex

x

in the

M S T (S)

graph. Moreover, let

x

be a Lebesgue point of

g_{k}

with

g_{k} (x) > 0

. In addition, let

Z_{k}^{x}

be the point process

{x, Z_{2}^{k}, Z_{3}^{k}, \dots, Z_{k}^{k}}

. Now, by virtue of (A55) in Lemma A5, we can write

\begin{matrix} \begin{matrix} E [\sum_{j = 2}^{k} P (x, Z_{j}^{k}) 1 {(x, Z_{j}^{k}) \in M S T (Z_{k}^{x})}] = P (x, x) E [Δ (x, Z_{k}^{x})] + \{O (k^{- η / d}) + O (k^{- 1 / d})\} . \end{matrix} \end{matrix}

(A57)

On the other hand, it can be seen that

\begin{matrix} k^{- 1} E [\underset{1 \leq i < j \leq k}{\sum \sum} P (Z_{i}^{k}, Z_{j}^{k}) 1 {(Z_{i}^{k}, Z_{j}^{k}) \in M S T (Z_{k})}] \\ = \frac{1}{2} E [\sum_{j = 2}^{k} P (Z_{1}^{k}, Z_{j}^{k}) 1 {(Z_{i}^{k}, Z_{j}^{k}) \in M S T (Z_{k})}] \\ = \frac{1}{2} \int g_{k} (x) d x E [\sum_{j = 2}^{k} P (x, Z_{j}^{k}) 1 {(x, Z_{j}^{k}) \in M S T (Z_{k})}] . \end{matrix}

(A58)

Recalling (A57),

\begin{matrix} = \frac{1}{2} \int g_{k} (x) P (x, x) E [Δ (x, Z_{k}^{x})] d x + O (k^{- η / d}) + O (k^{- 1 / d}) . \end{matrix}

(A59)

By virtue of Lemma A4, (A49) can be substituted into expression (A59) to obtain (A56). □

Theorem A1: Assume

R_{m, n} : = R (X_{m}, Y_{n})

denotes the FR test statistic as before. Then, the rate for the bias of the

R_{m, n}

estimator for

0 < η \leq 1

,

d \geq 2

is of the form:

\begin{matrix} | \frac{E [R_{m, n}]}{m + n} - 2 p q \int \frac{f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x | \leq O (l^{d} {(m + n)}^{- η / d}) + O (l^{- d η}) . \end{matrix}

(A60)

Here,

η

is the Holder smoothness parameter. A more explicit form for the bound on the RHS is given in (A61) below:

\begin{matrix} | \frac{E [R_{m, n}^{'} (X_{m}, Y_{n})]}{m + n} - \int \frac{2 p q f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x | \leq O (l^{d} {(m + n)}^{- η / d}) \\ + O (l^{d} {(m + n)}^{- 1 / 2}) + 2 c_{1} l^{d - 1} {(m + n)}^{(1 / d) - 1} + c_{d} 2^{d} {(m + n)}^{- 1} \\ - 2 l^{d} {(m + n)}^{- 1} \int \frac{2 p q f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x + c_{2} {(m + n)}^{- 1} l^{d} \\ + O (l) {(m + n)}^{- 1} \sum_{i = 1}^{M} l^{d} {(a_{i})}^{- 1} \int \frac{2 f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x + O (l^{- d η}) \\ + O (l) \sum_{i = 1}^{M} l^{d / 2} \frac{\sqrt{b_{i}}}{a_{i}^{2}} \int \frac{2 f_{0} (x) f_{1} (x) (f_{0} (x) \sqrt{m} + f_{1} (x) \sqrt{n})}{{(m f_{0} (x) + n f_{1} (x))}^{2}} d x \\ + \sum_{i = 1}^{M} 2 l^{- d / 2} \frac{\sqrt{b_{i}}}{a_{i}^{2}} \int \frac{f_{0} (x) f_{1} (x) {(α_{i} β_{i} (m a_{i} f_{0}^{2} (x) + n b_{i} f_{1}^{2} (x)))}^{1 / 2}}{{(m f_{0} (x) + n f_{1} (x))}^{2} (m + n)} d x . \end{matrix}

(A61)

Proof.

Assume

M_{m}

and

N_{n}

be Poisson variables with mean m and n, respectively, one independent of another and of

{X_{i}}

and

{Y_{j}}

. Let also

X_{m}^{'}

and

Y_{n}^{'}

be the Poisson processes

{X_{1}, \dots, X_{M_{n}}}

and

{Y_{1}, \dots, Y_{N_{n}}}

. Set

R_{m, n}^{'} : = R_{m, n} (X_{m}^{'}, Y_{n}^{'})

. Applying Lemma 1, and (12) cf. [49], we can write

\begin{matrix} | R_{m, n}^{'} - R_{m, n} | \leq K_{d} (| M_{m} - m | + | N_{n} - n |) . \end{matrix}

(A62)

Here,

K_{d}

denotes the largest possible degree of any vertex of the MST graph in

R^{d}

. Moreover, by the matter of Poisson variable fact and using Stirling approximation [51], we have

E [| M_{m} - m |] = e^{- m} \frac{m^{m + 1}}{m!} \leq e^{- m} \frac{m^{m + 1}}{\sqrt{2 π} m^{m + 1 / 2} e^{- m}} = O (m^{1 / 2}) .

(A63)

Similarly,

E [| N_{n} - n |] = O (n^{1 / 2})

. Therefore, by (A62), one yields

\begin{matrix} \begin{matrix} E [R_{m, n}] = E [R_{m, n} - R_{m, n}^{'}] + E [R_{m, n}^{'}] = O ({(m + n)}^{1 / 2}) + E [R_{m, n}^{'}] . \end{matrix} \end{matrix}

(A64)

Therefore,

\begin{matrix} \frac{E [R_{m, n}]}{m + n} = \frac{E [R_{m, n}^{'}]}{m + n} + O ({(m + n)}^{- 1 / 2}) . \end{matrix}

(A65)

Hence, it will suffice to obtain the rate of convergence of

E [R_{m, n}^{'}] / (m + n)

in the RHS of (A65). For this, let

m_{i}

,

n_{i}

denote the number of Poisson process samples

X_{m}^{'}

and

Y_{n}^{'}

with the FR statistic

R_{m, n}^{'}

, falling into partitions

Q_{i}^{'}

with FR statistic

R_{m_{i}, n_{i}}^{'}

. Then, by virtue of Lemma 4, we can write

\begin{matrix} E [R_{m, n}^{'}] \leq \sum_{i = 1}^{M} E [R_{m_{i}, n_{i}}^{'}] + 2 c_{1} l^{d - 1} {(m + n)}^{1 / d} . \end{matrix}

Note that the Binomial RVs

m_{i}

,

n_{i}

are independent with marginal distributions

m_{i} \sim B (m, a_{i} l^{- d})

,

n_{i} \sim B (n, b_{i} l^{- d})

, where

a_{i}

,

b_{i}

are non-negative constants satisfying,

\forall i, a_{i} \leq b_{i}

and

\sum_{i = 1}^{l^{d}} a_{i} l^{- d} = \sum_{i = 1}^{l^{d}} b_{i} l^{- d} = 1

. Therefore,

E [R_{m, n}^{'}] \leq \sum_{i = 1}^{M} E [E [R_{m_{i}, n_{i}}^{'} | m_{i}, n_{i}]] + 2 c_{1} l^{d - 1} {(m + n)}^{1 / d} .

(A66)

Let us first compute the internal expectation given

m_{i}

,

n_{i}

. For this reason, given

m_{i}

,

n_{i}

, let

Z_{1}^{m_{i}, n_{i}}, Z_{2}^{m_{i}, n_{i}}, \dots

be independent variables with common densities

g_{m_{i}, n_{i}} (x) = (m_{i} f_{0} (x) + n_{i} f_{1} (x)) / (m_{i} + n_{i})

,

x \in R^{d}

. Moreover, let

L_{m_{i}, n_{i}}

be an independent Poisson variable with mean

m_{i} + n_{i}

. Denote

F_{m_{i}, n_{i}}^{'} = {Z_{1}^{m_{i}, n_{i}}, \dots, Z_{L_{m_{i} . n_{i}}}^{m_{i}, n_{i}}}

a non-homogeneous Poisson of rate

m_{i} f_{0} + n_{i} f_{1}

. Let

F_{m_{i}, n_{i}}

be the non-Poisson point process

{Z_{1}^{m_{i}, n_{i}}, \dots Z_{m_{i} + n_{i}}^{m_{i}, n_{i}}}

. Assign a mark from the set

{1, 2}

to each points of

F_{m_{i}, n_{i}}^{'}

. Let

{\tilde{X}}_{m_{i}}^{'}

be the sets of points marked 1 with each probability

m_{i} f_{0} (x) / (m_{i} f_{0} (x) + n_{i} f_{i} (x))

and let

{\tilde{Y}}_{n_{i}}^{'}

be the set points with mark 2. Note that owing to the marking theorem [52],

{\tilde{X}}_{m_{i}}^{'}

and

{\tilde{Y}}_{n_{i}}^{'}

are independent Poisson processes with the same distribution as

X_{m_{i}}^{'}

and

Y_{n_{i}}^{'}

, respectively. Considering

{\tilde{R}}_{m_{i} . n_{i}}^{'}

as FR statistic over nodes in

{\tilde{X}}_{m_{i}}^{'} \cup {\tilde{Y}}_{n_{i}}^{'}

we have

\begin{matrix} E [R_{m_{i}, n_{i}}^{'} | m_{i}, n_{i}] = E [{\tilde{R}}_{m_{i}, n_{i}}^{'} | m_{i}, n_{i}] . \end{matrix}

Again using Lemma 1 and analogous arguments in [49] along with the fact that

E [| M_{m} + N_{n} - m - n |] = O ({(m + n)}^{1 / 2})

, we have

\begin{matrix} \begin{matrix} E [{\tilde{R}}_{m_{i}, n_{i}}^{'} | m_{i}, n_{i}] = E [E [{\tilde{R}}_{m_{i}, n_{i}}^{'} | F_{m_{i}, n_{i}}^{'}]] \\ = E [\underset{s < j < m_{i} + n_{i}}{\sum \sum} P_{m_{i}, n_{i}} (Z_{s}^{m_{i}, n_{i}}, Z_{j}^{m_{i}, n_{i}}) 1 \{(Z_{s}^{m_{i}, n_{i}}, Z_{j}^{m_{i}, n_{i}}) \in F_{m_{i}, n_{i}}\}] + O ({(m_{i} + n_{i})}^{1 / 2})) . \end{matrix} \end{matrix}

Here,

\begin{matrix} \begin{matrix} P_{m_{i}, n_{i}} (x, y) : = P_{r} {mark x \neq mark y, (x, y) \in F_{m_{i}, n_{i}}^{'}} \\ = \frac{m_{i} f_{0} (x) n_{i} f_{1} (y) + n_{i} f_{1} (x) m_{i} f_{0} (y)}{(m_{i} f_{0} (x) + n_{i} f_{1} (x)) (m_{i} f_{0} (y) + n_{i} f_{1} (y))} . \end{matrix} \end{matrix}

By owing to Lemma A6, we obtain

\begin{matrix} \begin{matrix} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} E [\underset{s < j < m_{i} + n_{i}}{\sum \sum} P_{m_{i}, n_{i}} (Z_{s}^{m_{i}, n_{i}}, Z_{j}^{m_{i}, n_{i}}) 1 \{(Z_{s}^{m_{i}, n_{i}}, Z_{j}^{m_{i}, n_{i}}) \in F_{m_{i}, n_{i}}\}] + \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [O {((m_{i} + n_{i}))}^{1 / 2}] \\ = \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) \int g_{m_{i}, n_{i}} (x, x) P_{m_{i}, n_{i}} (x, x) d x + (ς_{η} (l, m_{i}, n_{i}) + O ({(m_{i} + n_{i})}^{- η / d}) \\ + O ({(m_{i} + n_{i})}^{- 1 / d})) (m_{i} + n_{i})] + \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [O ({(m_{i} + n_{i})}^{1 / 2})], \end{matrix} \end{matrix}

(A67)

where

\begin{matrix} ς_{η} (l, m_{i}, n_{i}) = (O (l / (m_{i} + n_{i})) - 2 l^{d} / (m_{i} + n_{i})) \int g_{m_{i}, n_{i}} (x) P_{m_{i}, n_{i}} (x, x) d x + O (l^{- d η}) . \end{matrix}

The expression in (A67) equals

\begin{matrix} \sum_{i = 1}^{M} \int E_{m_{i}, n_{i}} [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{m_{i} f_{0} (x) + n_{i} f_{1} (x)}] d x + \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς_{η} (l, m_{i}, n_{i})] \\ + O (l^{d} {(m + n)}^{1 - η / d}) + O (l^{d} {(m + n)}^{1 / 2}) . \end{matrix}

(A68)

Because of Jensen inequality for concave function:

\begin{matrix} \begin{matrix} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [O ({(m_{i} + n_{i})}^{1 / 2})] = \sum_{i = 1}^{M} O {(E [m_{i}] + E [n_{i}])}^{1 / 2} \\ = \sum_{i = 1}^{M} O {(m a_{i} l^{- d} + n b_{i} l^{- d})}^{1 / 2} = O (l^{d} {(m + n)}^{1 / 2}) . \end{matrix} \end{matrix}

In addition, similarly since

η < d

, we have

\begin{matrix} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [O ({(m_{i} + n_{i})}^{1 - η / d})] = O (l^{d} {(m + n)}^{1 - η / d}), \end{matrix}

(A69)

and, for

d \geq 2

, one yields

\begin{matrix} \begin{matrix} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [O ({(m_{i} + n_{i})}^{1 - 1 / d})] = O (l^{d} {(m + n)}^{1 - 1 / d}) = O (l^{d} {(m + n)}^{1 / 2}) . \end{matrix} \end{matrix}

(A70)

Next, we state the following lemma (Lemma 1 from [30,31]), which will be used in the sequel:

Lemma A13.

Let

k (x)

be a continuously differential function of

x \in R

which is convex and monotone decreasing over

x \geq 0

. Set

k^{'} (x) = \frac{d k (x)}{d x}

. Then, for any

x_{0} > 0

, we have

\begin{matrix} k (x_{0}) + \frac{k (x_{0})}{x_{0}} | x - x_{0} | \geq k (x) \geq k (x_{0}) - k^{'} (x_{0}) | x - x_{0} | . \end{matrix}

(A71)

Next, continuing the proof of (A60), we attend to find an upper bound for

\begin{matrix} E_{m_{i}, n_{i}} [\frac{m_{i} n_{i}}{m_{i} f_{0} (x) + n_{i} f_{1} (x)}] . \end{matrix}

(A72)

In order to pursue this aim, in Lemma A13, consider

k (x) = \frac{1}{x}

and

x_{0} = E_{m_{i}, n_{i}} [m_{i} f_{0} (x) + n_{i} f_{1} (x)]

, therefore as the function

k (x)

is decreasing and convex, one can write

\begin{matrix} \frac{1}{m_{i} f_{0} (x) + n_{i} f_{1} (x)} \leq \frac{1}{E_{m_{i}, n_{i}} [m_{i} f_{0} (x) + n_{i} f_{1} (x)]} + \frac{| m_{i} f_{0} (x) + n_{i} f_{1} (x) - E_{m_{i}, n_{i}} [m_{i} f_{0} (x) + n_{i} f_{1} (x)] |}{E_{m_{i}, n_{i}}^{2} [m_{i} f_{0} (x) + n_{i} f_{1} (x)]} . \end{matrix}

(A73)

Using the Hölder inequality implies the following inequality:

\begin{matrix} \begin{matrix} E_{m_{i}, n_{i}} [\frac{m_{i} n_{i}}{m_{i} f_{0} (x) + n_{i} f_{1} (x)}] \leq \frac{E_{m_{i}, n_{i}} [m_{i} n_{i}]}{E_{m_{i}, n_{i}} [m_{i} f_{0} (x) + n_{i} f_{1} (x)]} \\ + \frac{{(E_{m_{i}, n_{i}} [m_{i}^{2} n_{i}^{2}])}^{1 / 2}}{E_{m_{i}, n_{i}}^{2} [m_{i} f_{0} (x) + n_{i} f_{1} (x)]} \times {(E_{m_{i}, n_{i}} {[m_{i} f_{0} (x) + n_{i} f_{1} (x) - E_{m_{i}, n_{i}} [m_{i} f_{0} (x) + n_{i} f_{1} (x)]]}^{2})}^{1 / 2} . \end{matrix} \end{matrix}

(A74)

As random variables

m_{i}

,

n_{i}

are independent, and because of

V [m_{i}] \leq m a_{i} l^{- d}

,

V [n_{i}] \leq n b_{i} l^{- d}

, we can claim that the RHS of (A74) becomes less than and equal to

\begin{matrix} \begin{matrix} \frac{m n a_{i} b_{i} l^{- 2 d}}{m a_{i} l^{- d} f_{0} (x) + n b_{i} l^{- d} f_{1} (x)} + \frac{{(α_{i} β_{i} (m a_{i} l^{- d} f_{0}^{2} (x) + n b_{i} l^{- d} f_{1}^{2} (x)))}^{1 / 2}}{{(m a_{i} f_{0} (x) + n b_{i} f_{1} (x))}^{2}}, \end{matrix} \end{matrix}

(A75)

where

\begin{matrix} \begin{matrix} α_{i} = m a_{i} l^{d} (1 - a_{i} l^{- d}) + m^{2} a_{i}^{2}, \\ β_{i} = n b_{i} l^{d} (1 - b_{i} l^{- d}) + n^{2} b_{i}^{2} . \end{matrix} \end{matrix}

Going back to (A66), we have

\begin{matrix} E [{R^{'}}_{m, n} (X_{m}, Y_{n})] \leq \sum_{i = 1}^{M} a_{i} b_{i} l^{- d} \int \frac{2 m n f_{0} (x) f_{1} (x)}{m a_{i} f_{0} (x) + n b_{i} f_{1} (x)} d x \\ + \sum_{i = 1}^{M} 2 \int \frac{f_{0} (x) f_{1} (x) {(α_{i} β_{i} (m a_{i} l^{- d} f_{0}^{2} (x) + n b_{i} l^{- d} f_{1}^{2} (x)))}^{1 / 2}}{{(m a_{i} f_{0} (x) + n b_{i} f_{1} (x))}^{2}} d x \\ + \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς_{η} (l, m_{i}, n_{i})] + O (l^{d} {(m + n)}^{1 - η / d}) \\ + O (l^{d} {(m + n)}^{1 / 2}) + 2 c_{1} l^{d - 1} {(m + n)}^{1 / d} . \end{matrix}

(A76)

Finally, owing to

a_{i} \leq b_{i}

and

\sum_{i = 1}^{M} b_{i} l^{- d} = 1

, when

\frac{m}{m + n} \to p

, we have

\begin{matrix} \frac{E [{R^{'}}_{m, n} (X_{m}, Y_{n})]}{m + n} \leq \int \frac{2 p q f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x \\ + \sum_{i = 1}^{M} 2 \int \frac{f_{0} (x) f_{1} (x) {(α_{i} β_{i} (m a_{i} l^{- d} f_{0}^{2} (x) + n b_{i} l^{- d} f_{1}^{2} (x)))}^{1 / 2}}{{(m a_{i} f_{0} (x) + n b_{i} f_{1} (x))}^{2} (m + n)} d x \\ + \frac{1}{m + n} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς_{η} (l, m_{i}, n_{i})] + O (l^{d} {(m + n)}^{- η / d}) \\ + O (l^{d} {(m + n)}^{- 1 / 2}) + 2 c_{1} l^{d - 1} {(m + n)}^{(1 / d) - 1} . \end{matrix}

(A77)

Passing to Definition 2,

{MST}^{*}

, and Lemma A2, a similar discussion as above, consider the Poisson processes samples and the FR statistic under the union of samples, denoted by

{R^{'}}_{m, n}^{*}

, and superadditivity of dual

R_{m, n}^{*}

, we have

\begin{matrix} \begin{matrix} E [{R^{'}}_{m, n}^{*} (X_{m}, Y_{n})] \geq \sum_{i = 1}^{M} E [{R^{'}}_{m_{i}, n_{i}}^{*} ((X_{m}, Y_{n}) \cap Q_{i})] - c_{2} l^{d} \\ = \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [E [{R^{'}}_{m_{i}, n_{i}}^{*} ((X_{m}, Y_{n}) \cap Q_{i}) | m_{i}, n_{i}]] - c_{2} l^{d} \\ \geq \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [E [{R^{'}}_{m_{i}, n_{i}} ((X_{m}, Y_{n}) \cap Q_{i}) | m_{i}, n_{i}]] - c_{2} l^{d}, \end{matrix} \end{matrix}

(A78)

the last line is derived from Lemma A2, (ii), inequality (A8). Owing to the Lemma A6, (A69), and (A70), one obtains

\begin{matrix} E [{R^{'}}_{m, n}^{*} (X_{m}, Y_{n})] \geq \sum_{i = 1}^{M} \int E_{m_{i}, n_{i}} [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{m_{i} f_{0} (x) + n_{i} f_{1} (x)}] d x \\ - \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς_{η} (l, m_{i}, n_{i})] - O (l^{d} {(m + n)}^{1 - η / d}) - O (l^{d} {(m + n)}^{1 / 2}) - c_{2} l^{d} . \end{matrix}

(A79)

Furthermore, by using the Jenson’s inequality, we get

\begin{matrix} \begin{matrix} E_{m_{i}, n_{i}} [\frac{m_{i} n_{i}}{m_{i} f_{0} (x) + n_{i} f_{1} (x)}] \geq \frac{E [m_{i}] E [n_{i}]}{E [m_{i}] f_{0} (x) + E [n_{i}] f_{1} (x)} = \frac{l^{- d} (m a_{i} n b_{i})}{m a_{i} f_{0} (x) + n b_{i} f_{1} (x)} . \end{matrix} \end{matrix}

Therefore, since

a_{i} \leq b_{i}

, we can write

\begin{matrix} E_{m_{i}, n_{i}} [\frac{m_{i} n_{i}}{m_{i} f_{0} (x) + n_{i} f_{1} (x)}] \geq \frac{l^{- d} m n a_{i} b_{i}}{b_{i} (m f_{0} (x) + n f_{1} (x))} = \frac{l^{- d} m n a_{i}}{(m f_{0} (x) + n f_{1} (x))} . \end{matrix}

(A80)

Consequently, the RHS of (A79) becomes greater than or equal to

\begin{matrix} \sum_{i = 1}^{M} a_{i} l^{- d} \int \frac{2 m n f_{0} (x) f_{1} (x)}{m f_{0} (x) + n f_{1} (x)} d x \\ - \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς_{η} (l, m_{i}, n_{i})] - O (l^{d} {(m + n)}^{1 - η / d}) - O (l^{d} {(m + n)}^{1 / 2}) - c_{2} l^{d} . \end{matrix}

(A81)

Finally, since

\sum_{i = 1}^{M} a_{i} l^{- d} = 1

and

\frac{m}{m + n} \to p

, we have

\begin{matrix} \frac{E [{R^{'}}_{m, n}^{*} (X_{m}, Y_{n})]}{m + n} \geq \int \frac{2 p q f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x - {(m + n)}^{- 1} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς (l, m_{i}, n_{i})] \\ - O (l^{d} {(m + n)}^{- η / d}) - O (l^{d} {(m + n)}^{- 1 / 2}) - c_{2} l^{d} {(m + n)}^{- 1} . \end{matrix}

(A82)

By definition of the dual

R_{m, n}^{*}

and (i) in Lemma A2,

\begin{matrix} \frac{E [R_{m, n}^{'} (X_{m}, Y_{n})]}{m + n} + \frac{c_{d} 2^{d}}{m + n} \geq \frac{E [{R^{'}}_{m, n}^{*} (X_{m}, Y_{n})]}{m + n}, \end{matrix}

(A83)

we can imply

\begin{matrix} \begin{matrix} \frac{E [{R^{'}}_{m, n} (X_{m}, Y_{n})]}{m + n} \geq \int \frac{2 p q f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x - {(m + n)}^{- 1} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς_{η} (l, m_{i}, n_{i})] \\ - O (l^{d} {(m + n)}^{- η / d}) - O (l^{d} {(m + n)}^{- 1 / 2}) - c_{2} l^{d} {(m + n)}^{- 1} - c_{d} 2^{d} {(m + n)}^{- 1} . \end{matrix} \end{matrix}

(A84)

The combination of two lower and upper bounds (A84) and (A77) yields the following result

\begin{matrix} | \frac{E [R_{m, n}^{'} (X_{m}, Y_{n})]}{m + n} - \int \frac{2 p q f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x | \\ \leq O (l^{d} {(m + n)}^{- η / d}) + O (l^{d} {(m + n)}^{- 1 / 2}) + 2 c_{1} l^{d - 1} {(m + n)}^{(1 / d) - 1} \\ + c_{d} 2^{d} {(m + n)}^{- 1} + c_{2} {(m + n)}^{- 1} l^{d} + \frac{1}{m + n} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς_{η} (l, m_{i}, n_{i})] \\ + \sum_{i = 1}^{M} 2 \int \frac{f_{0} (x) f_{1} (x) {(α_{i} β_{i} (m a_{i} l^{- d} f_{0}^{2} (x) + n b_{i} l^{- d} f_{1}^{2} (x)))}^{1 / 2}}{{(m a_{i} f_{0} (x) + n b_{i} f_{1} (x))}^{2} (m + n)} d x . \end{matrix}

(A85)

Recall

ς_{η} (l, m_{i}, n_{i})

, then we obtain

\begin{matrix} \begin{matrix} \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [(m_{i} + n_{i}) ς_{η} (l, m_{i}, n_{i})] = \sum_{i = 1}^{M} O (l) \int E [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{(m_{i} + n_{i}) (m_{i} f_{0} (x) + n_{i} f_{1} (x))}] d x \\ - 2 l^{d} \sum_{i = 1}^{M} \int E [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{(m_{i} + n_{i}) (m_{i} f_{0} (x) + n_{i} f_{1} (x))}] d x + O (l^{- η}) \sum_{i = 1}^{M} E_{m_{i}, n_{i}} [m_{i} + n_{i}] . \end{matrix} \end{matrix}

(A86)

In addition, we have

\begin{matrix} \begin{matrix} E_{m_{i}, n_{i}} [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{(m_{i} + n_{i}) (m_{i} f_{0} (x) + n_{i} f_{1} (x))}] \geq \frac{1}{m + n} E_{m_{i}, n_{i}} [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{(m_{i} f_{0} (x) + n_{i} f_{1} (x))}] . \end{matrix} \end{matrix}

(A87)

This implies

\begin{matrix} \begin{matrix} \sum_{i = 1}^{M} \int E [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{(m_{i} + n_{i}) (m_{i} f_{0} (x) + n_{i} f_{1} (x))}] d x \geq \int \frac{2 p q f_{0} (x) f_{1} (x)}{p f_{0} (x) + q f_{1} (x)} d x . \end{matrix} \end{matrix}

(A88)

Note that the above inequality is derived from (A80) and

\frac{m}{m + n} \to p

. Furthermore,

\begin{matrix} \frac{1}{m + n} \sum_{i = 1}^{M} O (l) \int E_{m_{i}, n_{i}} [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{(m_{i} + n_{i}) (m_{i} f_{0} (x) + n_{i} f_{1} (x))}] d x \\ \leq \sum_{i = 1}^{M} O (l) \int E_{m_{i}, n_{i}} [\frac{2 m_{i} n_{i} f_{0} (x) f_{1} (x)}{{(m_{i} + n_{i})}^{2} (m_{i} f_{0} (x) + n_{i} f_{1} (x))}] d x \\ \leq \sum_{i = 1}^{M} O (l) \int E_{m_{i}, n_{i}} [\frac{2 f_{0} (x) f_{1} (x)}{(m_{i} f_{0} (x) + n_{i} f_{1} (x))}] d x . \end{matrix}

(A89)

The last line holds because of

m_{i} n_{i} \leq {(m_{i} + n_{i})}^{2}

. Going back to (A73), we can give an upper bound for the RHS of above inequality as

\begin{matrix} \begin{matrix} E_{m_{i}, n_{i}} [{(m_{i} f_{0} (x) + n_{i} f_{1} (x))}^{- 1}] \leq {(m a_{i} l^{- d} f_{0} (x) + n b_{i} l^{- d} f_{1} (x))}^{- 1} \\ + (E_{m_{i}, n_{i}} | m_{i} f_{0} (x) + n_{i} f_{1} (x) - (E [m_{i}] f_{0} (x) + E [n_{i}] f_{1} (x) |) / {(m a_{i} l^{- d} f_{0} (x) + n b_{i} l^{- d} f_{1} (x))}^{2} . \end{matrix} \end{matrix}

Note that we have assumed

a_{i} \leq b_{i}

and by using Hölder inequality we write

\begin{matrix} \begin{matrix} E_{m_{i}, n_{i}} [{(m_{i} f_{0} (x) + n_{i} f_{1} (x))}^{- 1}] \leq l^{d} {(a_{i})}^{- 1} {(m f_{0} (x) + n f_{1} (x))}^{- 1} \\ + (f_{0} (x) \sqrt{V (m_{i})} + f_{1} (x) \sqrt{V (n_{i})}) / (a_{i}^{2} l^{- d} {(m f_{0} (x) + n f_{1} (x))}^{2}) \leq l^{d} {(a_{i})}^{- 1} {(m f_{0} (x) + n f_{1} (x))}^{- 1} \\ + l^{- d / 2} \sqrt{b_{i}} (f_{0} (x) \sqrt{m} + f_{1} (x) \sqrt{n}) / (a_{i}^{2} l^{- d} {(m f_{0} (x) + n f_{1} (x))}^{2}) . \end{matrix} \end{matrix}

(A90)

As result, we have

\begin{matrix} \sum_{i = 1}^{M} O (l) \int E_{m_{i}, n_{i}} [\frac{2 f_{0} (x) f_{1} (x)}{(m_{i} f_{0} (x) + n_{i} f_{1} (x))}] d x \\ \leq \sum_{i = 1}^{M} O (l) \int l^{d} {(a_{i})}^{- 1} \frac{2 f_{0} (x) f_{1} (x)}{m f_{0} (x) + n f_{1} (x)} d x \\ + \sum_{i = 1}^{M} O (l) \int l^{- d / 2} \sqrt{b_{i}} \frac{2 f_{0} (x) f_{1} (x) (f_{0} (x) \sqrt{m} + f_{1} (x) \sqrt{n})}{a_{i}^{2} l^{- d} {(m f_{0} (x) + n f_{1} (x))}^{2}} d x . \end{matrix}

(A91)

As a consequence, owing to (A85), for

0 < η \leq 1

,

d \geq 2

, which implies

η \leq d - 1

, we can derive (A61). Thus, the proof can be concluded by giving the summarized bound in (A60). □

Lemma A8: For

h = 1, 2, \dots

, let

δ_{m, n}^{h}

be the function

c h^{d - 1} {(m + n)}^{1 / d}

. Then, for

ϵ > 0

, we have

P (R_{m, n} (X_{m}, Y_{n}) \leq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m_{i}}, Y_{n_{i}}) + 2 ϵ) \geq \frac{ϵ - δ_{m, n}^{h}}{ϵ} .

(A92)

Note that in case

ϵ \leq δ_{m, n}^{h}

the above claimed inequality is trivial.

Proof.

Consider the cardinality of the set of all edges of

MST (⋃_{i = 1}^{h^{d}} Q_{i})

which intersect two different subcubes

Q_{i}

and

Q_{j}

,

| D |

. Using the Markov inequality, we can write

\begin{matrix} P (| D | \geq ϵ) \leq \frac{E (| D |)}{ϵ}, \end{matrix}

where

ϵ > 0

. Since

E | D | \leq c h^{d - 1} {(m + n)}^{1 / d} : = δ_{m, n}^{h}

, therefore for

ϵ > δ_{m, n}^{h}

and

h = 1, 2, \dots

:

\begin{matrix} P (| D | \geq ϵ) \leq \frac{δ_{m, n}^{h}}{ϵ} . \end{matrix}

In addition, if

Q_{i}

,

i = 1, \dots h^{d}

is a partition of

{[0, 1]}^{d}

into congruent subcubes of edge length

1 / h

, then

\begin{matrix} P (\sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m}, Y_{n} \cap Q_{i}) + 2 | D | \geq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m}, Y_{n} \cap Q_{i}) + 2 ϵ) \leq \frac{δ_{m, n}^{h}}{ϵ} . \end{matrix}

(A93)

This implies

\begin{matrix} P (\sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m}, Y_{n} \cap Q_{i}) + 2 | D | \leq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m}, Y_{n} \cap Q_{i}) + 2 ϵ) \geq 1 - \frac{δ_{m, n}^{h}}{ϵ} . \end{matrix}

(A94)

By subadditivity (A6), we can write

\begin{matrix} R_{m, n} (X_{m}, Y_{n}) \leq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m}, Y_{n} \cap Q_{i}) + 2 | D |, \end{matrix}

and this along with (A94) establishes (A92). □

Lemma A9: (Growth bounds for

R_{m, n}

) Let

R_{m, n}

be the FR statistic. Then, for given non-negative

ϵ

, such that

ϵ \geq h^{2} δ_{m, n}^{h}

, with at least probability

g (ϵ) : = 1 - \frac{h δ_{m, n}^{h}}{ϵ}

,

h = 2, 3, \dots

, we have

\begin{matrix} R_{m, n} (X_{m}, Y_{n}) \leq c_{ϵ, h}^{″} {(# X_{m} # Y_{n})}^{1 - 1 / d} . \end{matrix}

(A95)

Here,

c_{ϵ, h}^{″} = O (\frac{ϵ}{h^{d - 1} - 1})

depending only on

ϵ

, h. Note that, for

ϵ < h^{2} δ_{m, n}^{h}

, the claim is trivial.

Proof.

Without loss of generality, consider the unit cube

{[0, 1]}^{d}

. For given h, if

Q_{i}

,

i = 1, \dots h^{d}

is a partition of

{[0, 1]}^{d}

into congruent subcubes of edge length

1 / h

, then, by Lemma A8, we have

P (R_{m, n} (X_{m}, Y_{n}) \leq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m_{i}}, Y_{n_{i}}) + 2 ϵ) \geq \frac{ϵ - δ_{m, n}^{h}}{ϵ} .

(A96)

We apply the induction methodology on

# X_{m}

and

# Y_{n}

. Set

c : = sup_{x, y \in {[0, 1]}^{d}} R_{m, n} ({x, y})

which is finite according to assumption. Moreover, set

c_{2} : = \frac{2 ϵ}{h^{d - 1} - 1}

and

c_{1} : = c + d h^{d - 1} c_{2}

. Therefore, it is sufficient to show that for all

(X_{m}, Y_{n}) \in {[0, 1]}^{d}

with at least probability

g (ϵ)

\begin{matrix} R_{m, n} (X_{m}, Y_{n}) \leq c_{1} {(# X_{m} # Y_{n})}^{(d - 1) / d} . \end{matrix}

(A97)

Alternatively, as for the induction hypothesis, we assume the stronger bound

\begin{matrix} R_{m, n} (X_{m}, Y_{n}) \leq c_{1} {(# X_{m} # Y_{n})}^{(d - 1) / d} - c_{2} \end{matrix}

(A98)

holds whenever

# X_{m} < m

and

# Y_{n} < n

with at least probability

g (ϵ)

. Note that

d \geq 2

,

ϵ > 0

and

c_{1}

,

c_{2}

both depend on

ϵ

, h. Hence,

\begin{matrix} c_{1} - c_{2} = c + c_{2} (d h^{d - 1} - 1) \geq c + c_{2} (h^{d - 1} - 1) = c + 2 ϵ \geq c, \end{matrix}

which implies

P (R_{m, n} \leq c_{1} - c_{2}) \geq P (R_{m, n} \leq c)

. In addition, we know that

P (R_{m, n} \leq c) = 1 \geq g (ϵ)

; therefore, the induction hypothesis holds particularly

# X_{m} = 1

and

# Y_{n} = 1

. Now, consider the partition

Q_{i}

of

{[0, 1]}^{d}

; therefore, for all

1 \leq i \leq h^{d}

, we have

m_{i} : = # (X_{m} \cap Q_{i}) < m

and

n_{i} : = # (Y_{n} \cap Q_{i}) < n

and thus, by induction hypothesis, one yields with at least probability

g (ϵ)

\begin{matrix} R_{m_{i}, n_{i}} (X_{m}, Y_{n} \cap Q_{i}) \leq c_{1} {(m_{i} n_{i})}^{1 - 1 / d} - c_{2} . \end{matrix}

(A99)

Set

B

the event

\{all i : R_{m_{i}, n_{i}} \leq c_{1} {(m_{i} n_{i})}^{1 - 1 / d} - c_{2}\}

and

B_{i}

stands with the event

\{R_{m_{i}, n_{i}} \leq c_{1} {(m_{i} n_{i})}^{1 - 1 / d} - c_{2}\}

. From (A96) and since

Q_{i}

’s are partitions, which implies

\begin{matrix} \begin{matrix} P (B) = {(P (B_{i}))}^{h^{d}} \leq P (B_{i}), P (B^{c}) = P (⋃_{i = 1}^{l^{d}} B_{i}^{c}) \leq \sum_{i = 1}^{h^{d}} P (B_{i}^{c}) \leq h^{d} (1 - g (ϵ)), \\ and P (B) = \prod_{i = 1}^{h^{d}} P (B_{i}) \geq {(g (ϵ))}^{h^{d}}, \end{matrix} \end{matrix}

we thus obtain

\begin{matrix} \begin{matrix} \frac{ϵ - δ_{m, n}^{h}}{ϵ} \leq P (R_{m, n} \leq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m_{i}}, Y_{n_{i}}) + 2 ϵ | B) P (B) + P (R_{m, n} \leq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m_{i}}, Y_{n_{i}}) + 2 ϵ | B^{c}) P (B^{c}) \\ \leq P (R_{m, n} \leq \sum_{i = 1}^{l^{d}} R_{m_{i}, n_{i}} (X_{m_{i}}, Y_{n_{i}}) + 2 ϵ | B) P (B) + P (B^{c}) . \end{matrix} \end{matrix}

Equivalently,

\begin{matrix} \begin{matrix} P (R_{m, n} \leq \sum_{i = 1}^{h^{d}} R_{m_{i}, n_{i}} (X_{m_{i}}, Y_{n_{i}}) + 2 ϵ | B) \geq (1 - \frac{δ_{m, n}^{h}}{ϵ} - 1 + P (B)) / P (B) = 1 - \frac{δ_{m, n}^{h}}{ϵ P (B)} . \end{matrix} \end{matrix}

In fact, in this stage, we want to show that

\begin{matrix} 1 - \frac{δ_{m, n}^{h}}{ϵ P (B)} \geq g (ϵ) or P (B) \geq \frac{δ_{m, n}^{h}}{ϵ (1 - g (ϵ))} . \end{matrix}

Since

P (B) \geq {(g (ϵ))}^{h^{d}}

, therefore it is sufficient to derive that

{(g (ϵ))}^{h^{d}} \geq \frac{δ_{m, n}^{h}}{ϵ (1 - g (ϵ))}

. Indeed, for given

g (ϵ) = (\frac{ϵ - h δ_{m, n}^{h}}{ϵ})

, we have

g (ϵ) \leq \frac{ϵ - δ_{m, n}^{h}}{ϵ}

hence

\frac{δ_{m, n}^{h}}{ϵ (1 - g (ϵ))} = \frac{1}{h} \leq 1

. Furthermore, we know

\frac{1}{h} \leq 1 - \frac{1}{h^{(1 / h^{d})}}

and since

ϵ \geq h^{2} δ_{m, n}^{h}

this implies

\frac{h δ_{m, n}^{h}}{ϵ} \leq \frac{1}{h}

and consequently

\frac{h δ_{m, n}^{h}}{ϵ} \leq 1 - \frac{1}{h^{h^{- d}}}

or

g {(ϵ)}^{h^{d}} = {(\frac{ϵ - h δ_{m, n}^{h}}{ϵ})}^{h^{d}} \geq \frac{1}{h} = \frac{δ_{m, n}^{h}}{ϵ (1 - g (ϵ))} .

This implies the fact that for

ϵ \geq h^{2} δ_{m, n}^{h}

\begin{matrix} \begin{matrix} P (R_{m, n} \leq \sum_{i = 1}^{h^{d}} (c_{1} {(m_{i} n_{i})}^{1 - 1 / d} - c_{2}) + 2 ϵ) \geq g (ϵ), where g (ϵ) = \frac{ϵ - h δ_{m, n}^{h}}{ϵ} . \end{matrix} \end{matrix}

Now, let

γ : = # {i : m_{i}, n_{i} > 0}

and using Hölder inequality gives

P (R_{m, n} (X_{m}, Y_{n}) \leq c_{1} γ^{1 / d} {(m n)}^{1 - 1 / d} - γ c_{2} + c_{2} (h^{d - 1} - 1)) \geq g (ϵ) .

(A100)

Next, we just need to show that

c_{1} γ^{1 / d} {(m n)}^{1 - 1 / d} - γ c_{2} + c_{2} (h^{d - 1} - 1)

in (A100) is less than or equal to

c_{1} {(m n)}^{1 - 1 / d} - c_{2}

, which is equivalent to show

\begin{matrix} c_{2} (h^{d - 1} - γ) \leq c_{1} {(m n)}^{1 - 1 / d} (1 - γ^{1 / d}) . \end{matrix}

We know that

m, n \geq 1

and

c_{1} \geq d h^{d - 1} c_{2}

, so it is sufficient to get

\begin{matrix} c_{2} (h^{d - 1} - γ) \leq d h^{d - 1} c_{2} (1 - γ^{1 / d}), \end{matrix}

(A101)

choose t as

γ = t h^{d}

, then

0 < t \leq 1

, so (A101) becomes

\begin{matrix} (h^{- 1} - t) \geq d h^{- 1} (1 - h t^{1 / d}) . \end{matrix}

(A102)

Note that the function

d h^{- 1} (1 - h t^{1 / d}) + t - h^{- 1}

has a minimum at

t = 1

which implies (A101) and subsequently (A95). Hence, the proof is completed. □

Lemma A10: (Smoothness for

R_{m, n}

) Given observations of

X_{m} : = (X_{m^{'}}, X_{m^{″}}) = {X_{1}, \dots, X_{m^{'}}, X_{m^{'} + 1}, \dots, X_{m}},

such that

m^{'} + m^{″} = m

and

Y_{n} : = (Y_{n^{'}}, Y_{n^{″}}) = {Y_{1}, \dots, Y_{n^{'}}, Y_{n^{'} + 1}, \dots, Y_{n}}

, where

n^{'} + n^{″} = n

, denote

R_{m, n} (X_{m}, Y_{n})

as before, the number of edges of

MST (X_{m}, Y_{n})

which connect a point of

X_{m}

to a point of

Y_{n}

. Then, for integer

h \geq 2

, for all

(X_{n}, Y_{m}) \in {[0, 1]}^{d}

,

ϵ \geq h^{2} δ_{m, n}^{h}

, where

δ_{m, n}^{h} = O (h^{d - 1} {(m + n)}^{1 / d})

, we have

\begin{matrix} P (| R_{m, n} (X_{m}, Y_{n}) - R_{m^{'}, n^{'}} (X_{m^{'}}, Y_{n^{'}}) | \leq {\tilde{c}}_{ϵ, h} {(# X_{m^{″}} # Y_{n^{″}})}^{1 - 1 / d}) \geq 1 - \frac{2 h δ_{m, n}^{h}}{ϵ}, \end{matrix}

(A103)

where

{\tilde{c}}_{ϵ, h} = O (\frac{ϵ}{h^{d - 1} - 1})

. For the case

ϵ < h^{2} δ_{m, n}^{h}

, this holds trivially.

Proof.

We begin with removing the edges which contain a vertex in

X_{m^{″}}

and

Y_{n^{″}}

in minimal spanning tree on

(X_{m}, Y_{n})

. Now, since each vertex has bounded degree, say

c_{d}

, we can generate a subgraph in which has at most

c_{d} (# X_{m^{″}} + # Y_{n^{″}})

components. Next, choose one vertex from each component and form the minimal spanning tree on these vertices, assuming all of them can be considered in FR test statistic, we can write

\begin{matrix} \begin{matrix} R_{m, n} (X_{m}, Y_{n}) \leq R_{m^{'}, n^{'}} (X_{m^{'}}, Y_{n^{'}}) + c_{ϵ, h}^{″} {(c_{d}^{2} # X_{m^{″}} # Y_{n^{″}})}^{1 - 1 / d}, \\ or equivalently \\ \leq R_{m^{'}, n^{'}} (X_{m^{'}}, Y_{n^{'}}) + c_{ϵ 1}^{h} {(# X_{m^{″}} # Y_{n^{″}})}^{1 - 1 / d}, \end{matrix} \end{matrix}

(A104)

with probability at least

g (ϵ)

, where

g (ϵ)

is as in Lemma A9. Note that this expression is obtained from Lemma A9. In this stage, it remains to show that with at least probability

g (ϵ)

\begin{matrix} R_{m, n} (X_{m}, Y_{n}) \geq R_{m^{'}, n^{'}} (X_{m^{'}}, Y_{n^{'}}) - {\tilde{c}}_{ϵ, h} {(# X_{m^{″}} # Y_{n^{″}})}^{1 - 1 / d}, \end{matrix}

(A105)

which, again by using the method before, with at least probability

g (ϵ)

, one derives

\begin{matrix} \begin{matrix} R_{m^{'}, n^{'}} (X_{m^{'}}, Y_{n^{'}}) \leq R_{m, n} (X_{m}, Y_{n}) + {\hat{c}}_{ϵ, h} {(c_{d}^{2} (# X_{m^{″}} # Y_{n^{″}}))}^{1 - 1 / d}, \\ o r e q u i v a l e n t l y \\ \leq R_{m, n} (X_{m}, Y_{n}) + c_{ϵ 2}^{h} {(# X_{m^{″}} # Y_{n^{″}})}^{1 - 1 / d} . \end{matrix} \end{matrix}

Letting

{\tilde{c}}_{ϵ, h} = max {c_{ϵ 1}^{h}, c_{ϵ 2}^{h}}

implies (A105). Thus,

\begin{matrix} P (| R_{m, n} (X_{m}, Y_{n}) - R_{m^{'}, n^{'}} (X_{m^{'}}, Y_{n^{'}}) | \geq {\tilde{c}}_{ϵ, h} {(# X_{m^{″}} # Y_{n^{″}})}^{1 - 1 / d}) \leq 2 - 2 g (ϵ), \end{matrix}

(A106)

Hence, the smoothness is given with at least probability

2 g (ϵ) - 1

as in the statement of Lemma A10. □

Lemma A11: (Semi-Isoperimetry) Let

μ

be a measure on

{[0, 1]}^{d}

;

μ^{n}

denotes the product measure on space

{({[0, 1]}^{d})}^{n}

. In addition, let

M_{e}

denotes a median of

R_{m, n}

. Set

\begin{matrix} A : = \{X_{m} \in {({[0, 1]}^{d})}^{m}, Y_{n} \in {({[0, 1]}^{d})}^{n}; R_{m, n} (X_{m}, Y_{n}) \leq M_{e}\} . \end{matrix}

(A107)

Then,

\begin{matrix} μ^{m + n} (\{x^{'} \in {({[0, 1]}^{d})}^{m}, y^{'} \in ({[0, 1]}^{n}) : ϕ_{A} (x^{'}) ϕ_{A} (y^{'}) \geq t\}) \leq 4 exp (\frac{- t}{8 (m + n)}) . \end{matrix}

(A108)

Proof.

Let

ϕ_{A} (z^{'}) = min {H (z, z^{'}), z \in A}

. Using Proposition 6.5 in [17], isoperimetric inequality, we have

μ^{m + n} (\{z^{'} \in {({[0, 1]}^{d})}^{m + n} : ϕ_{A} (z^{'}) \geq t\}) \leq 4 exp (\frac{- t^{2}}{8 (m + n)}) .

(A109)

Furthermore, we know that

{(ϕ_{A} (x^{'}) + ϕ_{A} (y^{'}))}^{2} \geq ϕ_{A} (x^{'}) ϕ_{A} (y^{'}),

hence

\begin{matrix} μ^{m + n} (\{(x^{'} \in {({[0, 1]}^{d})}^{m}, y^{'} \in ({[0, 1]}^{n}) : ϕ_{A} (x^{'}) ϕ_{A} (y^{'}) \geq t\}) \\ \leq μ^{m + n} (\{(x^{'} \in {({[0, 1]}^{d})}^{m}, y^{'} \in ({[0, 1]}^{n}) : {(ϕ_{A} (x^{'}) + ϕ_{A} (y^{'}))}^{2} \geq t\}) \\ = μ^{m + n} (\{(x^{'} \in {({[0, 1]}^{d})}^{m}, y^{'} \in ({[0, 1]}^{n}) : ϕ_{A} (x^{'}) + ϕ_{A} (y^{'}) \geq \sqrt{t}\}) . \end{matrix}

(A110)

The last equality in (A110) achieves because of

ϕ_{A} (x^{'}), ϕ_{A} (y^{'}) \geq 0

and note that

ϕ_{A} (z^{'}) \geq ϕ_{A} (x^{'}) + ϕ_{A} (y^{'})

. Therefore,

\begin{matrix} \begin{matrix} μ^{m + n} (\{(x^{'} \in {({[0, 1]}^{d})}^{m}, y^{'} \in ({[0, 1]}^{n}) : ϕ_{A} (x^{'}) + ϕ_{A} (y^{'}) \geq \sqrt{t}\}) \\ \leq μ^{m + n} (\{(z^{'} \in {({[0, 1]}^{d})}^{m + n} : ϕ_{A} (z^{'}) \geq \sqrt{t}\}) . \end{matrix} \end{matrix}

By recalling (A109), we derive the bound (A108). □

Lemma A12: (Deviation of the Mean and Median) Consider

M_{e}

as a median of

R_{m, n}

. Then, for given

g (ϵ) = 1 - \frac{h δ_{m, n}^{h}}{ϵ}

, and

δ_{m, n}^{h} = O (h^{d - 1} {(m + n)}^{1 / d})

such that for

h \geq 7

,

ϵ \geq h^{2} δ_{m, n}^{h}

, we have

| E [R_{m, n} (X_{m}, Y_{n})] - M_{e} | \leq C_{m, n} (ϵ, h) {(m + n)}^{(d - 1) / d},

(A111)

where

C_{m, n} (ϵ, h)

stands with a form depends on

ϵ

, h, m, n as

\begin{matrix} C_{m, n} (ϵ, h) = C {(1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}))}^{- 1}, \end{matrix}

(A112)

where C is a constant.

Proof.

Following the analogous arguments in [17,53], we have

\begin{matrix} | E [R_{m, n} (X_{m}, Y_{n})] - M_{e} | \leq E | R_{m, n} (X_{m}, Y_{n}) - M_{e} | = \int_{0}^{\infty} P (| R_{m, n} (X_{m}, Y_{n}) - M_{e} | \geq t) d t \\ \leq 8 {(1 - (1 / (2 {(2 g (ϵ) - 1)}^{2})))}^{- 1} \int_{0}^{\infty} exp (\frac{- t^{d / (d - 1)}}{8 {(4 ϵ)}^{d / d - 1} (m + n)}) d t \\ = C {(1 - ({(2 {(2 g (ϵ) - 1)}^{2})}^{- 1}))}^{- 1} {(m + n)}^{(d - 1) / d}, \end{matrix}

(A113)

where

g (ϵ) = 1 - (h O (h^{d - 1} {(m + n)}^{1 / d})) / ϵ

. The inequality in (A113) is implied from Theorem 5. Hence, the proof is completed. □

References

Xuan, G.; Chia, P.; Wu, M. Bhattacharyya distance feature selection. In Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, 25–29 August 1996; Volume 2, pp. 195–199. [Google Scholar]
Hamza, A.; Krim, H. Image registration and segmentation by maximizing the Jensen-Renyi divergence. In Energy Minimization Methods in Computer Vision and Pattern Recognition. EMMCVPR 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 147–163. [Google Scholar]
Hild, K.E.; Erdogmus, D.; Principe, J. Blind source separation using Renyi’s mutual information. IEEE Signal Process. Lett. 2001, 8, 174–176. [Google Scholar] [CrossRef]
Basseville, M. Divergence measures for statistical data processing–An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
Battacharyya, A. On a measure of divergence between two multinomial populations. Sankhy ā Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
Lin, J. Divergence Measures Based on the Shannon Entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Berisha, V.; Hero, A. Empirical non-parametric estimation of the Fisher information. IEEE Signal Process. Lett. 2015, 22, 988–992. [Google Scholar] [CrossRef]
Berisha, V.; Wisler, A.; Hero, A.; Spanias, A. Empirically estimable classification bounds based on a nonparametric divergence measure. IEEE Trans. Signal Process. 2016, 64, 580–591. [Google Scholar] [CrossRef]
Moon, K.; Hero, A. Multivariate f-divergence estimation with confidence. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2420–2428. [Google Scholar]
Moon, K.; Hero, A. Ensemble estimation of multivariate f-divergence. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014; pp. 356–360. [Google Scholar]
Moon, K.; Sricharan, K.; Greenewald, K.; Hero, A. Improving convergence of divergence functional ensemble estimators. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1133–1137. [Google Scholar]
Moon, K.; Sricharan, K.; Greenewald, K.; Hero, A. Nonparametric ensemble estimation of distributional functionals. arXiv 2016, arXiv:1601.06884v2. [Google Scholar]
Noshad, M.; Moon, K.; Yasaei Sekeh, S.; Hero, A. Direct Estimation of Information Divergence Using Nearest Neighbor Ratios. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017. [Google Scholar]
Yasaei Sekeh, S.; Oselio, B.; Hero, A. A Dimension-Independent discriminant between distributions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Noshad, M.; Hero, A. Rate-optimal Meta Learning of Classification Error. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Wisler, A.; Berisha, V.; Wei, D.; Ramamurthy, K.; Spanias, A. Empirically-estimable multi-class classification bounds. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016. [Google Scholar]
Yukich, J. Probability Theory of Classical Euclidean Optimization; Lecture Notes in Mathematics; Springer: Berlin, Germany, 1998; Volume 1675. [Google Scholar]
Steele, J. An Efron–Stein inequality for nonsymmetric statistics. Ann. Stat. 1986, 14, 753–758. [Google Scholar] [CrossRef]
Aldous, D.; Steele, J.M. Asymptotic for Euclidean minimal spanning trees on random points. Probab. Theory Relat. Fields 1992, 92, 247–258. [Google Scholar] [CrossRef]
Ma, B.; Hero, A.; Gorman, J.; Michel, O. Image registration with minimal spanning tree algorithm. In Proceedings of the IEEE International Conference on Image Processing, Vancouver, BC, Canada, 10–13 September 2000; pp. 481–484. [Google Scholar]
Neemuchwala, H.; Hero, A.; Carson, P. Image registration using entropy measures and entropic graphs. Eur. J. Signal Process. 2005, 85, 277–296. [Google Scholar] [CrossRef]
Hero, A.; Ma, B.; Michel , O.J.; Gorman, J. Applications of entropic spanning graphs. IEEE Signal Process. Mag. 2002, 19, 85–95. [Google Scholar] [CrossRef]
Hero, A.; Michel, O. Estimation of Rényi information divergence via pruned minimal spanning trees. In Proceedings of the IEEE Workshop on Higher Order Statistics, Caesarea, Isreal, 16 June 1999. [Google Scholar]
Smirnov, N. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Mosc. Univ. 1939, 2, 3–6. [Google Scholar]
Wald, A.; Wolfowitz, J. On a test whether two samples are from the same population. Ann. Math. Stat. 1940, 11, 147–162. [Google Scholar] [CrossRef]
Gibbons, J. Nonparametric Statistical Inference; McGraw-Hill: New York, NY, USA, 1971. [Google Scholar]
Singh, S.; Póczos, B. Probability Theory and Combinatorial Optimization; CBMF-NSF Regional Conference in Applied Mathematics; Society for Industrial and Applied Mathematics (SIAM): Philadelphia, PA, USA, 1997; Volume 69. [Google Scholar]
Redmond, C.; Yukich, J. Limit theorems and rates of convergence for Euclidean functionals. Ann. Appl. Probab. 1994, 4, 1057–1073. [Google Scholar] [CrossRef]
Redmond, C.; Yukich, J. Asymptotics for Euclidean functionals with power weighted edges. Stoch. Process. Their Appl. 1996, 6, 289–304. [Google Scholar] [CrossRef]
Hero, A.; Costa, J.; Ma, B. Convergence Rates of Minimal Graphs with Random Vertices. Available online: https://pdfs.semanticscholar.org/7817/308a5065aa0dd44098319eb66f81d4fa7a14.pdf (accessed on 18 November 2019).
Hero, A.; Costa, J.; Ma, B. Asymptotic Relations between Minimal Graphs and Alpha-Entropy; Tech. Rep.; Communication and Signal Processing Laboratory (CSPL), Department EECS, University of Michigan: Ann Arbor, MI, USA, 2003. [Google Scholar]
Lorentz, G. Approximation of Functions; Holt, Rinehart and Winston: New York, NY, USA, 1996. [Google Scholar]
Talagrand, M. Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathématiques de i’I. H. E. S. 1995, 81, 73–205. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, USA, 20 June–30 July 1961; pp. 547–561. [Google Scholar]
Ali, S.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 28, 131–142. [Google Scholar] [CrossRef]
Cha, S. Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 2007, 1, 300–307. [Google Scholar]
Rukhin, A. Optimal estimator for the mixture parameter by the method of moments and information affinity. In Proceedings of the 12th Prague Conference on Information Theory, Prague, Czech Republic, 29 August–2 September 1994; pp. 214–219. [Google Scholar]
Toussaint, G. The relative neighborhood graph of a finite planar set. Pattern Recognit. 1980, 12, 261–268. [Google Scholar] [CrossRef]
Zahn, C. Graph-theoretical methods for detecting and describing Gestalt clusters. IEEE Trans. Comput. 1971, 100, 68–86. [Google Scholar] [CrossRef]
Banks, D.; Lavine, M.; Newton, H. The minimal spanning tree for nonparametric regression and structure discovery. In Computing Science and Statistics, Proceedings of the 24th Symposium on the Interface; Joseph Newton, H., Ed.; Interface Foundation of North America: Fairfax Station, FA, USA, 1992; pp. 370–374. [Google Scholar]
Hoffman, R.; Jain, A. A test of randomness based on the minimal spanning tree. Pattern Recognit. Lett. 1983, 1, 175–180. [Google Scholar] [CrossRef]
Efron, B.; Stein, C. The jackknife estimate of variance. Ann. Stat. 1981, 9, 586–596. [Google Scholar] [CrossRef]
Singh, S.; Póczos, B. Generalized exponential concentration inequality for Rényi divergence estimation. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), Bejing, China, 22–24 June 2014; pp. 333–341. [Google Scholar]
Singh, S.; Póczos, B. Exponential concentration of a density functional estimator. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 3032–3040. [Google Scholar]
Lichman, M. UCI Machine Learning Repository. 2013. Available online: https://www.re3data.org/repository/r3d100010960 (accessed on 18 November 2019).
Bhatt, R.B.; Sharma, G.; Dhall, A.; Chaudhury, S. Efficient skin region segmentation using low complexity fuzzy decision tree model. In Proceedings of the IEEE-INDICON, Ahmedabad, India, 16–18 December 2009; pp. 1–4. [Google Scholar]
Steele, J.; Shepp, L.; Eddy, W. On the number of leaves of a euclidean minimal spanning tree. J. Appl. Prob. 1987, 24, 809–826. [Google Scholar] [CrossRef]
Henze, N.; Penrose, M. On the multivarite runs test. Ann. Stat. 1999, 27, 290–298. [Google Scholar]
Rhee, W. A matching problem and subadditive Euclidean funetionals. Ann. Appl. Prob. 1993, 3, 794–801. [Google Scholar] [CrossRef]
Whittaker, E.; Watson, G. A Course in Modern Analysis, 4th ed.; Cambridge University Press: New York, NY, USA, 1996. [Google Scholar]
Kingman, J. Poisson Processes; Oxford Univ. Press: Oxford, UK, 1993. [Google Scholar]
Pál, D.; Póczos, B.; Szapesvári, C. Estimation of Renyi entropy andmutual information based on generalized nearest-neighbor graphs. In Proceedings of the 23th International Conference on Neural Information Processing Systems (NIPS 2010), Vancouver, BC, Canada, 6–9 December 2010. [Google Scholar]

Figure 1. Heat map of the theoretical MSE rate of the FR estimator of the HP-divergence based on Theorems 2 and 3 as a function of dimension and sample size when

N = m = n

. Note the color transition (MSE) as sample size increases for high dimension. For fixed sample size N, the MSE rate degrades in higher dimensions.

Figure 1. Heat map of the theoretical MSE rate of the FR estimator of the HP-divergence based on Theorems 2 and 3 as a function of dimension and sample size when

N = m = n

. Note the color transition (MSE) as sample size increases for high dimension. For fixed sample size N, the MSE rate degrades in higher dimensions.

Figure 2. The dual MST spanning the merged set

X_{m}

(blue points) and

Y_{n}

(red points) drawn from two Gaussian distributions. The dual FR statistic (

R_{m, n}^{*}

) is the number of edges in the

{MST}^{*}

(contains nodes in

X_{m} \cup Y_{n} \cup {2 corner points}

) that connect samples from different color nodes and corners (denoted in green). Black edges are the non-dichotomous edges in the

{MST}^{*}

.

Figure 2. The dual MST spanning the merged set

X_{m}

(blue points) and

Y_{n}

(red points) drawn from two Gaussian distributions. The dual FR statistic (

R_{m, n}^{*}

) is the number of edges in the

{MST}^{*}

(contains nodes in

X_{m} \cup Y_{n} \cup {2 corner points}

) that connect samples from different color nodes and corners (denoted in green). Black edges are the non-dichotomous edges in the

{MST}^{*}

.

Figure 3. Comparison of the bound on the MSE theory and experiments for

d = 2, 4, 8

standard Gaussian random vectors versus sample size from 100 trials.

Figure 3. Comparison of the bound on the MSE theory and experiments for

d = 2, 4, 8

standard Gaussian random vectors versus sample size from 100 trials.

Figure 4. Comparison of experimentally predicted MSE of the FR-statistic as a function of sample size

m = n

in various distributions Standard Normal, Gamma (

α_{1} = α_{2} = 1, β_{1} = β_{2} = 1, ρ = 0.5

) and Standard t-Student.

Figure 4. Comparison of experimentally predicted MSE of the FR-statistic as a function of sample size

m = n

in various distributions Standard Normal, Gamma (

α_{1} = α_{2} = 1, β_{1} = β_{2} = 1, ρ = 0.5

) and Standard t-Student.

Figure 5. HP-divergence vs. sample size for three real datasets HAR, SKIN, and ENGIN.

Figure 6. The empirical MSE vs. sample size. The empirical MSE of the FR estimator for all three datasets HAR, SKIN, and ENGIN decreases for larger sample size N.

Figure 7. HP-divergence vs. dimension for three datasets HAR, SKIN, and ENGIN.

Table 1.

R_{m, n}

,

{\hat{D}}_{p}

, m, and n are the FR test statistic, HP-divergence estimates using

R_{m, n}

, and sample sizes for two classes, respectively.

Table 1.

R_{m, n}

,

{\hat{D}}_{p}

, m, and n are the FR test statistic, HP-divergence estimates using

R_{m, n}

, and sample sizes for two classes, respectively.

FR Test Statistic
Dataset	$E [R_{m, n}]$	${\hat{D}}_{p}$	$m$	$n$	Variance-Like Interval
HAR	3	0.995	600	600	(2.994,3.006)
SKIN	4.2	0.993	600	600	(4.196,4.204)
ENGIN	1.8	0.997	600	600	(1.798,1.802)

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sekeh, S.Y.; Noshad, M.; Moon, K.R.; Hero, A.O. Convergence Rates for Empirical Estimation of Binary Classification Bounds. Entropy 2019, 21, 1144. https://doi.org/10.3390/e21121144

AMA Style

Sekeh SY, Noshad M, Moon KR, Hero AO. Convergence Rates for Empirical Estimation of Binary Classification Bounds. Entropy. 2019; 21(12):1144. https://doi.org/10.3390/e21121144

Chicago/Turabian Style

Sekeh, Salimeh Yasaei, Morteza Noshad, Kevin R. Moon, and Alfred O. Hero. 2019. "Convergence Rates for Empirical Estimation of Binary Classification Bounds" Entropy 21, no. 12: 1144. https://doi.org/10.3390/e21121144

APA Style

Sekeh, S. Y., Noshad, M., Moon, K. R., & Hero, A. O. (2019). Convergence Rates for Empirical Estimation of Binary Classification Bounds. Entropy, 21(12), 1144. https://doi.org/10.3390/e21121144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convergence Rates for Empirical Estimation of Binary Classification Bounds

Abstract

1. Introduction

1.1. Related Work

1.2. Organization

2. The Henze–Penrose Divergence Measure

2.1. The Multivariate Runs Test Statistic

2.2. Convergence Rates

2.3. Proof Sketch of Theorem 2

2.4. Concentration Bounds

3. Numerical Experiments

3.1. Simulation Study

3.2. Real Datasets

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

Appendix A. Proof of Theorem 4

Appendix B. Proof of Theorem 2

Appendix C. Proof of Theorems 3

Appendix D. Proof of Theorems 5–7

Appendix E. Additional Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI