On Accuracy of PDF Divergence Estimators and Their Applicability to Representative Data Sampling

Generalisation error estimation is an important issue in machine learning. Cross-validation traditionally used for this purpose requires building multiple models and repeating the whole procedure many times in order to produce reliable error estimates. It is however possible to accurately estimate the error using only a single model, if the training and test data are chosen appropriately. This paper investigates the possibility of using various probability density function divergence measures for the purpose of representative data sampling. As it turned out, the first difficulty one needs to deal with is estimation of the divergence itself. In contrast to other publications on this subject, the experimental results provided in this study show that in many cases it is not possible unless samples consisting of thousands of instances are used. Exhaustive experiments on the divergence guided representative data sampling have been performed using 26 publicly available benchmark datasets and 70 PDF divergence estimators, and their results have been analysed and discussed.


Introduction
In the previous work [1,2] we have proposed the Density Preserving Sampling (DPS) procedure as an alternative to commonly used cross-validation (CV) technique [3] for the purpose of generalisation error estimation.The new method proved to be comparable with repeated CV in terms of bias of produced estimates and superior to CV in terms of variance, while eliminating the need for repetitions in the estimation process.In [2] it has also been demonstrated that it is possible to select only a single DPS fold and use it as validation data to obtain an error estimate with accuracy comparable with 10× repeated CV, effectively reducing the computational requirements by another order of magnitude.The problem of selecting the right DPS fold however still remains unsolved.Correntropy [4], which is the basis of the cost function used in DPS optimisation is only moderately correlated with bias of the error estimator.Moreover, the correlation is only present for a single, carefully chosen value of the kernel smoothing parameter, but there is no principled way to discover this value [2].The idea of further reduction of computational cost of generalisation error estimation nevertheless still remains very attractive.
In this paper we investigate the possibilities of selecting a representative subset of data from a larger dataset.Unfortunately, there is no universal and measurable notion of representativeness.A standard definition of a representative sample that can be found in any statistical textbooks states that it should have the same properties as the population from which it has been drawn.The question "which properties" however remains open and the answer differs from one application to another.In our case the application can be stated as accurate estimation of the generalisation performance of a predictive model.For this purpose some easily calculable measure of representativeness, based on the probability density functions (PDFs) as the most universal characteristic of data, is required.We thus examine a number of most popular divergence measures from the literature, in order to investigate their usability for our goal.
There are however some challenges here.First of all, in most real-world applications the PDFs have unknown, non-parametric forms and as a result need to be somehow approximated.The two best known and most commonly used methods of doing this are the Parzen window [5] and k-Nearest Neighbour (kNN) [6] based density estimators.The problem is that if the estimates of the PDFs are poor, it is hard to expect the value of a divergence measure calculated using these estimates to be reliable.The PDF estimates can be inaccurate for many reasons, such as incorrect choice of the parameters of density estimation methods, not enough data used for the estimation, or non-representativeness of the data.Moreover, as the divergence measures in use are defined as integrals over various functions of the PDFs, in all but the simplest cases there are no closed-form formula for their calculation.The only way is to resort to some estimation procedure, which can make the discussed problem even worse.This issue has already been indicated many years ago by Akaike [7], who noticed that "the difficulty of constructing an adequate model based on the information provided by a finite number of observations is not fully recognized" (see also [8]) and "it must be realized that there is usually a big gap between the theoretical results and the practical procedures".However, despite a great deal of research and publications on the subject of divergence measures, some of which, e.g., [9], has been triggered by [7], not much has changed in the context of practical procedures of their estimation.
The significance of this research thus stems from (1) gathering relevant PDF divergence concepts and presenting them in a unified way, (2) experimental convergence study of various PDF divergence Entropy 2011, 13 1231 estimators, which in fact questions their usability for samples smaller than thousands of instances, and (3) investigation of representative sampling by optimising divergence measures.This paper can hence be perceived as a critical review which, based on extensive experimental investigations, shows the potential benefits but also serious limitations of the investigated techniques under a broad spectrum of conditions.This paper is organized as follows.Section 2 discusses two most commonly used PDF estimation techniques, which form a basis for estimation of the divergence measures described in Section 3. Section 4 investigates empirical convergence properties of these estimators, using four toy problems.In Section 5 we present experimental results of applying the divergence estimators to the problem of selecting a representative subset of a larger dataset for generalisation error estimation.A discussion of some implications of these results is given in Section 6 and the final conclusions can be found in Section 7.

Estimation of the PDFs
Estimation of the PDFs directly from data is a fundamental issue in the context of divergence estimation for at least two reasons: (1) the divergence is measured between two or more PDFs, so some expressions for the PDFs are needed and (2) the PDFs very rarely have known parametric forms.Although there exist methods which do not estimate the PDFs directly (see for example [10,11] and references therein), they still require appropriate parameter settings, which usually cannot be done in a principled and fully automatic way.In this regard they are in fact not different from the so-called "indirect" methods investigated here, since the whole difficulty lies in setting these parameters correctly, as already mentioned in [2] and shown in the subsequent sections.

Parzen Window Method
The Parzen window method [5] is the most commonly used non-parametric PDF estimation procedure.The estimate f (x) of the unknown density f (x) can be obtained by using: where N is the dataset size, V N stands for window volume, ϕ is some window function and σ N is the smoothing parameter also known as window bandwidth [6].The window function is often chosen to be Gaussian due to its analytical properties, thus the two PDFs p(x) and q(x) can be approximated by: where N p and N q are the numbers of d-dimensional points drawn i.i.d.according to p(x) and q(x) respectively and G(x − x i , σ 2 I) is a Gaussian PDF given by: The Gaussian PDF of (3) corresponds to the window function with absorbed normalising constant V N .Since in (3) the Gaussian is spherical, it is characterized by a single parameter σ.Although intuitively it limits flexibility, the estimators of (2) converge to the true densities when N → ∞, if at the same time σ p and σ q tend to 0 at a certain rate.When dealing with two subsamples of the same dataset, we assumed that σ p = σ q = σ, eliminating one of the parameters.Hence given a set of data, the PDF estimation comes down to a single parameter, thus its value should be selected very carefully.If σ is chosen too big, the density estimate will be oversmoothed and the fine details of the PDF will be lost.If σ is chosen too small, the estimate will be spiky with large regions of values near 0. For this reason there has been a substantial amount of research focused on automatic selection of the window width from data.For a comprehensive review of the subject see [12].Three different automatic bandwidth selection methods have been used in this study: • Pseudo likelihood cross-validation [13], which attempts to select the bandwidth σ to maximize a pseudo-likelihood function of the density estimate using leave-one-out approximation to avoid a trivial maximum at σ = 0. Interestingly, the pseudo-likelihood method minimizes the Kullback-Leibler divergence between the true density and the estimated density, but it tends to produce inconsistent estimates for heavy-tailed distributions [12].• "Rule of Thumb" (RoT) minimisation [14] of the Asymptotic Mean Integrated Squared Error (AMISE) between the true distribution and its estimate.Calculation of bandwidth minimizing the AMISE criterion requires estimation of integral of squared second derivative of the unknown true density function (see [12]), which is a difficult task by itself.The RoT method thus replaces the unknown value with an estimate calculated with reference to a normal distribution.This makes the method computationally tractable at the risk of producing poor estimates for non-Gaussian PDFs.
• "Solve-the-equation plug-in method" [15], which also minimizes AMISE between the true distribution and its estimate, but without assuming any parametric form of the former.This method is currently considered as state-of-the-art [16], although it has a computational complexity which is quadratic in the dataset size.A fast approximate bandwidth selection algorithm of [17], which scales linearly in the size of data has been used in this study.

k-Nearest Neighbour Method
The second well known probability density estimator is the k-nearest neighbour (kNN) method, according to which densities p(x) and q(x) can be approximated as [6,18]: where k is the nearest neighbour count, π d/2 /Γ(d/2 + 1) is the volume of a unit-ball and r k (x), s k (x) are the Euclidean distances from x to its k th nearest neighbour in X p (set of instances drawn i.i.d.from p(x)) and X q (set of instances drawn i.i.d.from q(x)) respectively.Note that if x ∈ X p then the influence of x on the density estimate should be eliminated, thus N p in (4) becomes N p − 1 and r k (x) denotes the distance to the k th nearest neighbour in X p \ x rather than in X p .A similar remark applies to the situation when x ∈ X q .

Divergence Measures
There is now more than a dozen of different divergence measures that one can find in the literature [19].Perhaps the most prominent of them is the family of Chernoff's α-divergences [20], which includes such measures as the Kullback-Leibler divergence [21] or squared Hellinger's distance [22] as its special cases.Although most of these measures have strong theoretical foundations, there are no closed-form solutions to calculate them exactly, apart from the cases when the probability density functions are some simple parametric models like, e.g., Gaussians.For this reason one is forced to resort to some kind of estimators of the divergences.Although the estimators can be obtained in various ways, one of them is to estimate the unknown probability density functions first and then substitute them into the formulas for divergences.This is the approach taken in this study.Note, that the literature on PDF divergence estimation is somewhat inconsistent, hence in the following sections an attempt has been made to gather all the relevant concepts and present them in a unified way, filling some of the existing gaps.As a result, most of the formulas that can be found in the Sections 3 and 4 have been derived or transformed by the authors for the purpose of this study.
Throughout the following sections the sample mean is used as the estimator of expected value.By the Law of Large Numbers sample mean converges to the expected value with probability 1, as the sample size tends to infinity.The expected value of an arbitrary function q(x), w.r.t. the PDF p(x) is: and can be approximated by:

Kullback-Leibler Divergence
The Kullback-Leibler divergence (D KL ), also known as information divergence or relative entropy, is probably the most widely used measure of similarity between two PDFs [21].The measure has been used in a wide variety of applications, like data condensation [23], Blind Source Separation via Independent Component Analysis [24,25], classification [26,27], or image processing [28,29] to name a few.D KL between the joint PDF and a product of marginal PDFs is equal to the mutual information between the two random variables, which is an important concept of Shannon's information theory [30].
The Kullback-Leibler divergence is non-symmetric and can be interpreted as the number of additional bits (if base 2 logarithm is used) needed to encode instances from true distribution p(x) using the code based on distribution q(x) [21].For two continuous random variables, D KL is given by: The natural logarithms are used throughout this paper unless otherwise stated.From the above definition it is easy to see, that D KL is only defined if q(x) > 0 for every x.Using ( 6) and ( 7) one can also write: Finally, by substituting (2) into (9) and rearranging, the estimator DKL (p, q) becomes: In the experiments in Section 4 another estimator of D KL derived in [18,31], based on the kNN density estimate rather than Parzen window, is also used: where r k (x pi ) and s k (x pi ) are the Euclidean distances to the k th nearest neighbor of x pi in X p x pi and X q respectively.For a special case, when both p(x) and q(x) are Mixtures of Gaussians there exist other techniques for approximation of D KL , which have been reviewed in [32].

Jeffrey's Divergence
A big inconvenience of the Kullback-Leibler divergence, especially in the context of practical applications [8,33], is its non-symmetricity.Jeffrey's divergence (D J ) is a simple way of making D KL symmetric and is given by [34]: which solves the non-symmetricity issue in a very simple and intuitive way.Note however that there is another problem: D J is not defined if either p(x) = 0 or q(x) = 0, which is in fact even more restrictive than in the case of D KL .Jeffrey's divergence has been used for example in [29] for classification of multimedia data with Support Vector Machines.

Jensen-Shannon Divergence
The Jensen-Shannon divergence (D JS ) is a measure designed to address the weaknesses of the Kullback-Leibler divergence.Namely, unlike the latter, D JS is symmetric, always finite and semibounded [35].Jensen-Shannon Divergence is defined in terms of D KL as: Entropy 2011, 13 where m(x) = 1 2 (p(x) + q(x)).Unfortunately no estimator of D JS was given in [35], but it can be approximated using the estimators of D KL as: where m(x) = 1 2 (p(x) + q(x)).Thus again using (2): Using kNN density, the divergence estimator becomes: Some of the applications of the Jensen-Shannon divergence include feature clustering for text classification [36] and outlier detection in sensor data [37].

Cauchy-Schwarz Divergence
The Cauchy-Schwarz divergence D CS is a symmetric measure, which obeys 0 ≤ D CS ≤ ∞ with the minimum obtained for p(x) = q(x) [38].The measure inspired by the Cauchy-Schwarz inequality was derived as a part of the Information Theoretic Learning (ITL) framework [39] and its theoretical properties have been further investigated in [11].The Cauchy-Schwarz divergence is given by: If the Parzen window method with Gaussian kernels is used for PDF estimation, substituting (2) into each integral of ( 17) in turn and rearranging yields: Using the Gaussian convolution property, i.e., G(z 2 I) and inserting the above equations into (17) one finally obtains: Note that unlike other divergence measures presented above, in (18) the only approximation is the Parzen windowing itself, as due to the Gaussian convolution property there was no need to use (7).This suggests that potentially the estimator of D CS should be more reliable than that of D KL , D J or D JS .
It is interesting to note, that D CS can also be written as: where H(X) = − log IP (X) denotes Renyi's quadratic entropy and IP (X) stands for the Information Potential [39], which emphasizes the direct relation of D CS to information theory.The Cauchy-Schwarz divergence has been used for example for classification and clustering [40,41].

Mean Integrated Squared Error
The Integrated Squared Error (ISE) is a measure of distance between two PDFs.It is also a special case of a family of divergence measures presented in [42].However, perhaps the best known application of ISE is estimation of kernel bandwidth in the Parzen density method [12].ISE is given by: After rearranging and applying (7) the following estimation formula can be obtained: Using the Parzen window density estimators of (2) and rearranging one gets: which is a result surprisingly similar to (19), but this time the information potentials are used instead of entropies and the Gaussian kernel width is equal to σ rather than √ 2σ.As before ISE can also be estimated using the kNN density estimators of ( 4) and ( 5) yielding: Since ISE depends on the particular realisation of a set of points [12,17], in practice the Mean Integrated Squared Error (MISE) is used instead.MISE is the expectation of ISE and here it is estimated using (7).

Empirical Convergence of the Divergence Estimators
In this section we present an empirical convergence study of the estimators from Section 3 using a number of toy problems, for which most of the divergence measures can be calculated exactly.The goal of these experiments is to check if and how fast in terms of the sample size the estimators converge.

Experiment Setup
Following [18] an empirical convergence study using four toy problems has been designed.For each of them, two Gaussian distributions were used.The contour plots for the first three toy problems can be seen in Figure 1.The distributions were chosen to be Gaussian, that is p(x) = G(x − μ p , Σ p ) and q(x) = G(x − μ q , Σ q ), as in this case there exist closed-form formulas enabling exact calculations of most of the divergence measures introduced in the previous sections.The parameters of the PDFs were: 1. Problem 1, which is the same as the one in [18]: 2. Problem 2, where the means of both distributions are equal, but the covariance matrices are not: Entropy 2011, 13 1238 3. Problem 3, where the covariance matrices of both distributions are equal, but the means are not: , Σ p = 1 0.9 0.9 1 4. Problem 4, where the Gaussians are 20-dimensional, μ p = 0 0 . . .0 T , Σ p = I and μ q , Σ q have been generated randomly from the [−1, +1] and [0, +2] intervals respectively.
For each experiment, 100 random samples of an exponentially increasing size were drawn from both p(x) and q(x).The divergence estimate was then calculated as the mean value of estimates for each of these 100 samples.
Denoting by d the dimensionality of the distributions, the Kullback-Leibler divergence between two Gaussian distributions p G (x) and q G (x) can be calculated using [43]: Calculation of the Jeffrey's divergence between two Gaussian PDFs is straightforward, as  In order to calculate D CS and ISE exactly for the special case of two Gaussian distributions, the following Gaussian multiplication formula is used: Entropy 2011, 13 where the exact expression for μ pq and Σ pq in this case are irrelevant, as shown below.Using (28), the following holds: From the above and ( 17) and ( 20) the closed-form solutions for D CS and ISE are: Unfortunately, there is no closed-form formula to calculate the Jensen-Shannon divergence, and an estimate based on (6) calculated for N p = N q = 100000 had to be used: where: Note, that in both equations above the terms under the logarithms are not dependent on any PDF estimator-specific parameters, as the PDFs are given in analytical forms, so the sample mean is the only estimation required.Figures 2-10 present the results of the experiments for the 4 toy problems.The "best" estimate in each plot has been marked with bold red line.Additionally the confidence intervals (mean ± one standard deviation) for the best method were also plotted.The best estimator has been chosen according to the criterion of fast convergence to the true value of the estimated measure and lack of divergence after reaching that target value.The below scoring function has been proposed and used to chose the best estimator: (37) where M = {10, 20, . . ., 100, 200, . . ., 1000, 2000, . . ., 10000}, ȳi is the mean value of the estimator for the sample size m i and t is the true value of the divergence measure.Note that this scoring function heavily penalizes any deviation from the true value for large sample sizes, which in effect assigns low scores to estimators which have not converged or started to diverge.
In the figures below the following code has been used for denoting the divergence estimators: where k is the number of nearest neighbours and the remaining symbols have been given in Table 1.Diagonal covariance matrix, separate for each distribution

Estimation of the Kullback-Leibler (KL) Divergence
Figure 2 shows the plots presenting the evolution of Kullback-Leibler divergence estimators based on the Parzen window method as the sample size increases for all 4 toy problems.The values on the horizontal axis denote the number of instances drawn from each distribution.As it can be seen, there is a considerable discrepancy between various estimators (i.e., estimators using various bandwidth selection methods).More specifically, while some of them seem to be converging to the true value, others diverge, which is especially well visible in the case of high-dimensional toy problem 4.Moreover, even the "best" estimators reach the true divergence values for sample sizes, for which, if encountered in practice, the representativeness of even a random subsample should not be a problem.In such cases the purposefulness of divergence guided sampling seems doubtful.Figure 3 presents the experimental results for Kullback-Leibler divergence estimators based on the kNN density estimator.In this case, the convergence for the 2-dimensional problems, albeit slow, can still be observed regardless of the value of k and has also been proven in [18].However, for the 20-dimensional problem 4, the estimators fail completely.

Estimation of the Jeffrey's (J) Divergence
The behaviour of the Parzen window based estimators of the Jeffrey's divergence has been depicted in Figure 4.For the 2-dimensional problems, the picture is very similar to the case of D KL .However, in the high dimensional space most of the estimators cannot even be evaluated due to numerical problems, resulting from near-zero values of many Gaussian functions in calculation of DKL (q, p).

Summary
The picture emerging from the experimental results does not look very optimistic.Many estimates of various divergence measures either diverge or converge too slowly, questioning their usefulness for the purpose pursued in this study.From all the estimators examined, the Parzen window based Jensen-Shannon's divergence estimator looks most promising, as it converges relatively quickly although it also demonstrates a considerable variance before converging-even for a sample of 10 instances the true value is within one standard deviation from the mean value of the estimate.
A common problem is also the behaviour of most estimators in a high dimensional space.In this case only DJS and DJS seem to be acting reasonably (once again, note the units on the vertical axes in Figures 6d and 7d but in general Parzen windowing with kernels with exponential fall-off is known to be problematic, as the density in large areas of the high-dimensional space is necessarily close to 0.
It is also important to have in mind that all the toy problems examined are relatively simple, as they consist of symmetric, unimodal distributions, which unfortunately are rarely encountered in practice.As a result we can already expect the divergence estimation for real-world datasets to be even more difficult.
On the basis of the above experiments it is also very difficult to state what should be a minimal sample size for the divergence estimate to be accurate.There are at least two factors one needs to have in mind: the dimensionality of the PDF and its shape.The higher the dimensionality the more samples are needed to cover the input space and reflect the true underlying distribution.The same applies to the shape of the PDF-a spherical Gaussian can be described with a much lower number of samples than, e.g., some complex distribution with a lot of peaks and valleys.Moreover, the minimum sample size depends not only on the problem but also on the divergence estimator used.As can be seen in the above experiments even for a single problem and a fixed sample size, one estimator could have already converged, while another requires a sample twice as big for convergence.

PDF Divergence Guided Sampling for Error Estimation
In this section an empirical study of correlation between various PDF divergence estimators and bias of a generalisation error estimate is investigated using a number of benchmark datasets.Although as shown in Section 4, the PDF divergence measures are rather difficult to estimate, they can still be useful in the context of ranking PDFs according to their similarity to a given reference distribution.The goal of these experiments is thus to assess the possibility of estimating the generalisation error in a single run, i.e., without retraining the used model, by taking advantage of PDF divergence estimators within the sampling process.This would allow to further reduce the computational cost of error estimation when compared to both CV and DPS.

Experiment Setup
The experiments have been performed using 26 publicly available datasets (Table 2) and 20 classifiers (Table 3).For each dataset the below procedure was followed: 1. 400 stratified splits of the dataset have been generated, leaving 87.5% of instances for training and 12.5% instances for testing, which corresponds with the 8-fold CV and DPS as used in [1,2], 2. for each of the 400 splits and each class of the dataset, 70 different estimators of divergence between the training and test parts were calculated, accounting for all possible combinations of techniques listed in Table 1, where in the case of kNN density based estimators k = {1, 2, 3, 5, 9} were used; the rationale behind estimating the divergence for each class separately is that the dataset can only be representative of a classification problem if the same is true for its every class, 3. for each divergence estimator the classes were sorted by the estimated value, forming 400 new splits per estimator (since for some splits some estimators produced negative values, these splits have been discarded), 4. 11 splits were selected for each divergence estimator based on the estimated value averaged over all classes, including splits for the lowest and highest averaged estimate and 9 intermediate values, 5. the classifiers were trained using the training part and tested using the test part for each split and each divergence estimator, producing error estimates sorted according to the divergence estimate.
The above procedure has been depicted in Figure 11.

Correlation between Divergence Estimators and Bias
In the course of the experiments over 30,000 correlation coefficients have been calculated, accounting for all dataset/classifiers/divergence estimator triplets, with the exception of the cases in which calculation of correlation was not possible due to numerical problems during calculation of the divergence estimators (especially the ones based on AMISE bandwidth selection).
The maps of linear correlation coefficients between bias and divergence estimates, averaged over all divergence estimators used, have been depicted in Figure 12.The crossed-out cells denote the situation in which for all 11 splits both the values of divergence estimator and bias were constant, so it was impossible to assess correlation.As it can be seen, for the signed bias, moderate correlation can be observed only for a handful of datasets.However in some cases this applies to all (chr, let) or almost all (cba) classifiers.For other datasets the correlation is weak to none and sometimes even negative.Only occasional and rather weak correlation can be observed in the absolute bias scenario.This can be confirmed by looking at Figure 13 with histograms of correlation coefficients for all 30,000 dataset/classifier/divergence estimator triplets in both signed and absolute bias scenarios.Thus only the former scenario appears viable, as in the case of absolute bias the histogram is skewed towards −1 rather than 1.  Figure 18 presents the histograms of correlation coefficients for individual divergence estimators.As it can be seen, there is only a handful of estimates which demonstrate a certain degree of correlation with the bias, including some of the Cauchy-Schwarz and Kullback-Leibler divergence estimators and especially kl2-k3, kl2-k5 and kl2-k9.This seems to contradict the experimental results presented in Figure 3, where it can be seen, that the higher the number of neighbours, the slower the convergence of the kl2 estimators.In the case of kl2-k1 in Figure 18 however, the histogram is symmetric if not skewed to the left, while it changes its shape to more right-skewed as the number of nearest neighbours is increased.
In Figure 19 the histograms of datasets, classifiers and divergence estimators for the 806 high (≥ 0.9) signed bias correlation cases have been presented.The first observation is that the correlation is indeed strong only for 3 to 4 datasets and the divergence estimators already identified.The disappointing performance of the ise1, j2, js2 and ise2 estimators has also been confirmed.Also note, that although the histogram of classifiers does not present a uniform distribution, there are numerous high correlation cases for almost all classifiers, with knnc, gfc, efc taking the lead, and treec being the worst one.
The most surprising conclusion can however be drawn from examination of the four datasets, for which the high correlation has been observed.A closer look at Table 2 reveals that the one thing they have in common is a large number of classes, ranging from 20 to 24, while most of the remaining datasets have only 2 to 3 classes.Since in the experimental setting used, the divergences have been approximated for each class in separation, the estimates have been effectively calculated for very small sample sizes (the average class size for the let dataset is just 39 instances).From the experiments described in Section 4 it is however clear that for sample sizes of this order the estimates are necessarily far from converging, especially in the case of high-dimensional problems.However, in order to put things into perspective, one needs to realize that the 806 high correlation cases constitute just above 2.6% of the total number of over 30,000 cases.Thus effectively they form the tail of the distribution depicted in Figure 13 and most likely do not have any practical meaning.
For comparison with the results of [2], scatter plots of the 49 unique subsamples of the Cone-torus dataset for the lowest values of all divergence estimators used in the experiments have been depicted in Figure 20.The number in round brackets in the title of each plot denotes an identifier of the unique subset.The decision boundaries of a quadratic classifier (qdc) have also been superimposed on each plot.The classifier has been chosen due to its stability, so that any drastic changes in the shape of the decision boundaries can be attributed to considerable changes in the structure of the dataset used for training.In the majority of cases, the decision boundaries resemble the ones given in [2].The same applies to the banana-shaped class, which is also clearly visible in most cases.This can be contrasted to Figure 21 containing the scatter plots of 49 unique subsets for the highest values of divergence estimators, where the decision boundaries take on a variety of shapes.As it can be seen though, the properties of the subsamples do depend on the values of the divergence estimators.For the Cone-torus dataset (cnt) there was however only a handful of high correlation cases.This behaviour is in fact very similar to that of DPS, where typically 7 out of 8 folds resembled the original dataset when examined visually.Thus although the examined divergence estimators were not able to produce a single fold allowing for generalisation error estimation, they could be used in a setting similar to the one presented in [2].The authors report "excellent" results if three or more representatives from each class are selected in the case of the Gaussian datasets and six or more representatives in the non-Gaussian setting, although no numerical results are presented.According to the experimental results reported in Section 4 for sample size of 100 in most cases it is difficult to expect the DKL estimate to approximate the true divergence value well.However, by manipulating the kernel covariance matrix one is able to almost freely influence the value of the estimate.In the discussed paper, the authors have set the kernel covariance matrix to be equal to the sample covariance matrix, which led to excellent performance only on the Gaussian datasets.This is not surprising as in this case a single kernel function was able to approximate the class PDF well, if located correctly.If one also takes into account that a relatively stable quadratic classifier was used in the experiments, the results should be attributed to this specific experimental setting rather than to optimisation of the divergence.The authors admit that "the selection of kernel function and kernel covariance matrix is not clearly understood", which suggests that it is the manual tuning of the covariance matrix which might be responsible for the "excellent" results in the non-Gaussian scenario.
Surprisingly in most of the literature the Kullback-Leibler or Jeffrey's divergence is not estimated at all.Instead, it is either argued that optimisation of a given objective function is equivalent to optimisation of the divergence between an empirical measure and true yet unknown distribution (e.g., [47,48]) or closed-form solutions are used, restricting the PDFs to be Gaussian (e.g., [29,49]).Hence it appears that in practice D KL is mostly of theoretical interest, stemming from its connection to the information theory.
On the contrary, in [37] the authors use an estimator of the Jensen-Shannon divergence in a form similar to (14) but with a different kernel function.In their experimental setting the divergence estimator is calculated for two samples with sizes equal to 1024 and 10,240, which according to the results presented in Figure 6 is more than enough to obtain accurate estimates, even in a high-dimensional space.
Estimation of the divergence (and other measures for that matter) is hence always connected with a risk resulting from poor accuracy of the estimators, if the datasets do not contain thousands or tens of thousands of instances.This issue is however often neglected by other researchers.An example can be found in [41], where a Cauchy-Schwarz divergence-based clustering algorithm is derived.The authors report good performance for two synthetic, 2-dimensional datasets (419 instances, 2 clusters and 819 instances, 3 clusters) using the RoT method to determine the "optimal" Parzen window width and then heuristically annealing it around that value.These results more or less stay in agreement with the ones presented in Figure 10, although they might also be a result of manual tuning of the annealing process.
As for the second experimental part of this study, due to a wide range of datasets used, it can be stated that sampling by optimisation of any divergence measure estimator presented in Section 3 does not produce subsets, which lead to consistent observable estimation of the generalisation error.At this stage it is however difficult to state if the reason for this result is nonsuitability of the divergence measures used or poor accuracy of their estimators, especially in the light of the properties of the datasets for which high correlation has been identified.

Conclusions
In this paper, accuracy and empirical convergence of various PDF divergence estimation methods and their application to sampling for the purpose of generalisation error estimation have been evaluated.Five different divergence measures all having sound theoretical foundations have been examined, paired with two PDF estimators with various parameter settings, leading to 70 divergence estimators in total.
The most important outcome of this study is that the evaluated divergence measures can be estimated accurately only if large amounts of data are available.For some estimators and problems it is possible to obtain accurate estimates for sample sizes in the range of 400-600 instances, while in other cases even 10,000 instances is not sufficient.It is important to emphasize here that empirical convergence has been assessed using simple, synthetic problems with data sampled from Gaussian distributions.Despite this fact, the results are not very encouraging and in fact call into question the practical usability of the examined PDF divergence measures, at least in the situations where their values need to be approximated.
The experimental results of Section 4 have however revealed that although the estimators might be off the true divergence value by a large margin, their values nevertheless can differ quite considerably from one fixed size sample to another.Hence there is a possibility that such estimators can still quantify the similarity between two PDFs, being sufficient for applications in which the relative rather than absolute values of the estimator are of interest.One such application has also been investigated in this paper.
Building upon encouraging experimental results of the Density Preserving Sampling technique derived in [1,2], the idea was to exploit some of the PDF divergence measures as objective functions of a sampling procedure, in order to obtain a representative subsample of a given dataset, usable for accurate estimation of generalisation error.This would in effect further reduce the computational cost of generalisation performance assessment, not only when compared to cross-validation but also with respect to DPS.Unfortunately, in the experiments of Section 5 no strong correlation between the bias and divergence estimators has been identified.Although in some particular cases discussed in previous sections the correlation coefficient exceeded 0.90, these cases account for just above 2.6% of the total number of examined cases and should most likely be credited to a specific set of circumstances rather than to any properties of the divergence estimators used.Hence the PDF divergence measures examined here still remain of only theoretical significance, at least from the point of view of predictive error estimation.
for Parzen window density estimates XX − k for kNN density estimates

Figure 12 .Figure 13 .
Figure 12.Correlation between bias and divergence estimates averaged over the latter.

Figure 17 .
Figure 17.Correlation maps for each dataset and signed bias (axes like in Figure 15).

Figure 19 .
Figure19.Histograms of datasets, classifiers and divergence estimators for the 806 high (≥ 0.9) signed bias correlation cases (numbers of cases denoted on the vertical axis).
* The number of instances actually used in the experiments, selected randomly with stratified sampling from the whole, much larger dataset in order to keep the experiments computationally tractable.

Table 3 .
Classifiers used in the experiments.