Nonparametric Multivariate Density Estimation: Case Study of Cauchy Mixture Model

Tomas Ruzgas; Mantas Lukauskas; Gedmantas Čepkauskas

doi:10.3390/math9212717

Abstract

Estimation of probability density functions (pdf) is considered an essential part of statistical modelling. Heteroskedasticity and outliers are the problems that make data analysis harder. The Cauchy mixture model helps us to cover both of them. This paper studies five different significant types of non-parametric multivariate density estimation techniques algorithmically and empirically. At the same time, we do not make assumptions about the origin of data from any known parametric families of distribution. The method of the inversion formula is made when the cluster of noise is involved in the general mixture model. The effectiveness of the method is demonstrated through a simulation study. The relationship between the accuracy of evaluation and complicated multidimensional Cauchy mixture models (CMM) is analyzed using the Monte Carlo method. For larger dimensions (d ~ 5) and small samples (n ~ 50), the adaptive kernel method is recommended. If the sample is n ~ 100, it is recommended to use a modified inversion formula (MIDE). It is better for larger samples with overlapping distributions to use a semi-parametric kernel estimation and more isolated distribution-modified inversion methods. For the mean absolute percentage error, it is recommended to use a semi-parametric kernel estimation when the sample has overlapping distributions. In the smaller dimensions (d = 2) and a sample is with overlapping distributions, it is recommended to use the semi-parametric kernel method (SKDE) and for isolated distributions, it is recommended to use modified inversion formula (MIDE). The inversion formula algorithm shows that with noise cluster, the results of the inversion formula improved significantly.

Keywords:

Cauchy mixture model; nonparametric density estimation; density estimation algorithms; adapted kernel density estimate; logspline estimation

MSC:

62G05; 62G07; 62G30

1. Introduction

Estimation of probability density functions (pdf) is considered an essential part of statistical modelling. It expresses random variables as functions of other variables, making it possible to detect hidden relationships between data [1]. In a significant number of machine learning algorithms, it is essential to determine a previously unknown function of the distribution density of the data. The function of the distribution density is applied in the Bayesian classifier [2,3], in density-based clustering algorithms [4,5,6], or information-based feature selection algorithms [7,8]. Effective density estimates must be carefully created in advance to obtain unknown functions of probability density. Nowadays, there is still much focus on developing innovative density estimation procedures [9,10]. Density estimation is an open research topic in the fast-growing area of deep learning. Scientists have begun proposing robust density estimators based on neural networks such as Parzen neural networks [11], soft-constrained neural networks [12], and others [13].

Let us say that the random vector X∈R^d satisfies the distribution mixture model if its distribution density f(x) satisfies the equation

f (x) = \sum_{k = 1}^{q} p_{k} f_{k} (x) = f (x, θ)

. The parameter q is the number of mixture clusters, and p_k is the a priori probability. These conditions must also be met:

p_{k} > 0

and

\sum_{k = 1}^{q} p_{k} = 1

. The function

f_{k} (x)

is a function of the distribution density, and

θ

is a multidimensional parameter of the model. Suppose

X

is a d-dimensional random vector with a distribution density

f (x)

, and there is a sample of independent copies of

X

, where

X = (X (1), \dots, X (n))

. It can be argued that the sample satisfies the mixture model if

X (t)

satisfies

f (x) = \sum_{k = 1}^{q} p_{k} f_{k} (x) = f (x, θ)

.

One of the statistical tasks is to estimate the density of the observed random variable. Suppose the available sample’s distribution type is known (Normal, Poisson, and others). In that case, the distribution density of the data can be estimated simply using mean and covariance matrix estimates, fitting them to a defined distribution [14,15,16]. Thus, the standard parametric method is applied when the assumptions about the density form are met. When estimating density in a parametric way, the value of the multidimensional distribution parameter θ needs to be found, which is not straightforward because the number of parameters increases rapidly as the dimension d increases. For example, in the case of a mixture of Gaussian distributions,

\dim θ = \frac{1}{2 q d (d + 1)} + q d + q - 1

, and even with a small dimension

d = q = 5

, the model will consist of

\dim (θ) = 104

parameters. When searching for parameter estimates, it may be necessary to solve the optimization problem in the 104-dimensional space. In practice, the number of clusters

q

may also be unknown, and it needs to be estimated. The parametric method is not proper when the random size distribution is unknown. In this case, non-parametric methods are used to determine certain forms of density estimates [17,18,19].

The histogram is one of the simplest and oldest estimates of density. To the best of our knowledge, data in the form of histograms (without graphical representation) were first presented in 1661 to determine mortality probabilities in different age groups [20]. To approximate the density

f (x)

in the area

Ω

, the number of observations

X (t)

falling into

Ω

is calculated and divided by n and the volume of the area

Ω

. The area of space to which all observations fall is first found. That means the fluctuation intervals of all

X

projections on the axes

X^{(1)}

,

X^{(2)}

,…,

X^{(d)}

are found. The fluctuation intervals of the observations are divided into l partial intervals and in the hypercubes

Ω_{j} (j = 1, \dots, r)

bounded by them, the density estimate is calculated as

\hat{f} (x) = \frac{n (Ω_{j})}{n \cdot h_{1} \cdot h_{2} \dots \cdot h_{d}} .

(1)

Here

n (Ω_{j})

is the number of observations entering the hypercube

Ω_{j}

and

h_{j}, j = 1, \dots, d

are the edges of the hypercube [21,22]. It is recommended to select the number of hypercubes [17,23,24], and to choose

r ≅ 1 + 3.32 \log (n)

, and

l = \sqrt[d]{r}

has to be an integer number, so

r

is chosen that

{⌈ \sqrt[d]{1 + 3.32 l o g n}}^{d} ⌉

.

A histogram is one of the simplest means of presenting data that is easy to understand and convenient. This estimate is a function that acquires non-negative values, and its integral is equal to one. However, it is not continuous. That poses problems when knowing the density estimate derivatives is essential, mainly when density estimation is used in intermediate steps of other methods, such as clustering using a gradient algorithm or plotting high-measurement data-level lines. Remarkably, the histogram stood as the only non-parametric density estimator until the 1950’s when substantial and simultaneous progress was made in density estimation and spectral density estimation. In 1951, in a little-known paper, Fix and Hodges [25] introduced the basic algorithm of non-parametric density estimation; an unpublished technical report was formally published as a review by Silverman and Jones in 1989 [26]. They addressed the problem of statistical discrimination when the parametric form of the sampling density was not known. During the following decade, several general algorithms and alternative theoretical modes of analysis were introduced by Rosenblatt in 1956 [27], Parzen in 1962 [28], and Cencov in 1962 [29]. Then followed the second wave of essential and primarily theoretical papers by Watson and Leadbetter in 1963 [30], Loftsgaarden and Quesenberry in 1965 [31], Schwartz in 1967 [32], Epanechnikov in 1969 [33], Tarter and Kronmal in 1970 [34], and Kimeldorf and Wahba in 1971 [35]. Next, Cacoullos introduced the natural multivariate generalization in 1966 [36]. Finally, in the 1970s, the first papers focusing on the practical application of these methods were published by Scott et al. in 1979 [24] and Silverman in 1978 [37]. These and later multivariate applications awaited the computing revolution.

Modern data analysis uses several non-parametric methods for statistically estimating the distribution density of multivariate random variables. Kernel estimates are particularly prevalent [38,39]. Quite popular and spline [40,41] and semi-parametric [42,43,44,45,46] algorithms. However, detailed comparisons of the effectiveness of existing popular estimates for multimodal density are lacking. With the most popular non-parametric estimation procedures, optimal selection of their parameters is encountered in practice. The most crucial element in the design of kernel estimates is the width of the smoothing. It is not easy to select the nodes of the spline estimates. Although several adaptive procedures for the selection of these parameters have been developed [39,47,48,49,50,51,52], however, they are not efficient enough when the sample volume is not large, especially then the observational dimension is large. In the latter case, it is appropriate to apply data design [53,54,55,56] because of the more extensive the dimension of the observed random vectors, the more complex the task of parameter selection.

The main idea of this paper is to estimate the performance of different density estimators by using density mixtures to show another type of problem, which may result from data heteroscedasticity and outliers. The relationship between the accuracy of evaluation and complicated multidimensional Cauchy mixture models (CMM) is analyzed using the Monte Carlo method. For example, Kalantan and Einbeck [57] used engineering data and, for computer vision, used CMM, comparing it with the Gaussian mixture model. Azzari and Foi [58] used harmony between Gauss and heavy-tailed Cauchy to find noise-model parameters that make outlier estimation robust when imaged dominated by texture. Finally, Teimouri [59] analyzed patients with Cushing’s syndrome and their diagnostic tests. The focus was on the tetra hydrocortisone urine release rate (mg/24 h) and evaluating parameters in the EM algorithm and Cauchy mixture model.

Scientific novelty. Evaluation accuracy comparative analysis is made by using different probability density estimation procedures. Density function estimates are chosen as popular different technique estimates, which other researchers have already analyzed. This research is essential because it focuses on Cauchy distributions.

2. The Density Estimation Algorithms

This section aims to present the density estimation algorithms used in the study theoretically. All algorithms are presented with algorithms theoretical substantiation. When making the histogram, each

X (t)

can be imagined as a separate column with a height of

1 / n

. Then it makes sense to change the centre of the column to

X (t)

itself and get the following function:

\hat{f} (x) = \frac{1}{n \cdot h_{1} \cdot h_{2} \dots \cdot h_{d}} \sum_{t = 1}^{n} I_{C_{h} (X (t))} (x) .

(2)

Here

C_{h}

is a hypercube with centre

X (t)

, and the lengths of the edges are

h_{1}, \dots, h_{d}

. In summary, instead of the indicator function, a smooth “prominence”—the kernel function—can be used at each observed point. The multidimensional fixed-width bandwidth estimate with the kernel function K and the fixed (global) kernel width parameter h, which can be used to estimate the density

\hat{f} (x)

of the multidimensional data X∈R^d, is then defined as follows:

\hat{f} (x) = \frac{1}{n h^{d}} \sum_{t = 1}^{n} K (\frac{x - X (t)}{h}) .

(3)

These are some of the most common non-parametric estimates of distribution density [38,39,60,61]. The kernel function is selected to meet the following condition:

\int_{R^{d}} K (x) d x = 1 .

(4)

The standard normal distribution density function

φ

is often used as the kernel [62,63]:

ϕ (x) = (2 π)^{- d / 2} \exp (- \frac{1}{2} x^{'} x) .

(5)

Often the observations are not evenly distributed in all directions. Therefore, it is desirable to scale the data by eliminating the most significant dispersion differences in the different coordinate directions. One suitable method for this [64] is data standardization. That means the sample’s effect on a linear transformation. The mean of the transformed data is zero, the covariance matrix is unitary, and (3) apply the Equation to already standardized data. For example, suppose Z is a standardized random vector,

Z = S^{- \frac{1}{2}} (X - \bar{X}),

(6)

here

\bar{X}

is the empirical mean of the sample, and S∈R^d×d is the empirical covariance matrix. Based on the fixed kernel width density estimate (3), a more complex standardized data density estimate has been constructed:

\hat{f_{z}} (z) = \frac{1}{n h^{d}} \sum_{t = 1}^{n} K (\frac{z - Z (t)}{h}) .

(7)

\overset{´}{f} (x) = \frac{{(d e t S)}^{- 1 / 2}}{n h^{d}} \sum_{t = 1}^{n} K (S^{- 1 / 2} \frac{x - X (t)}{h}) .

(8)

The optimal kernel width

h^{*}

for a fixed core width is determined by minimizing the average integral root mean square error (MISE) [65]. For example, when the distribution of observations is normal with a unit covariance matrix in Gaussian kernel, the expression

h^{*}

proposed by [65] is

h^{*} = A n^{- \frac{1}{d + 4}}

, here

A = {[4 / (2 d + 1)]}^{\frac{1}{d + 4}}

. More sophisticated kernel width selection methods (such as the least-squares cross-checking method) are obtained by more complex and lengthy calculations [66,67,68,69,70].

In practical research, the kernel width is often selected experimentally. If the value of h is small, the density function estimate has more modes that correspond to the layout of the observed data. A higher value of h means more significant smoothing of the estimate.

Although fixed-core width density estimates are widely used to estimate non-parametric densities, they often have some practical drawbacks [65]. For example, fixed-core width density estimates do not ensure the distribution ends’ integrity without over-smoothing the underlying bulk density.

2.1. Adapted Kernel Density Estimate (AKDE)

A good improvement on the fixed kernel width density estimate is the adapted kernel density estimate [65]. The adapted kernel density estimate is constructed similarly to the fixed kernel width density estimate. The kernel describes the density at each observed point. In this case, the kernel width is already considered when moving from one observation to another. In areas of different smoothness, it is appropriate to take different kernel widths. This method consists of two steps: estimation of the adapted kernel width and density estimation by the kernel method, using the information obtained in the first step. The algorithm can be summarized as follows:

Step 1: The elements of sample

X = (X (1), \dots, X (n))

are standardized to

Z = (Z (1), \dots, Z (n))

such that

\hat{E} [Z] = 0

and

\hat{E} [Z Z^{'}] = I

.

Step 2: Estimates

{\tilde{f}}_{Z} (z)

of the fixed kernel density estimate (3) satisfying the condition

{\tilde{f}}_{Z} (Z (t)) > 0

, ∀t.

Step 3: The local width parameter is determined

λ_{t} = {(\frac{{\tilde{f}}_{Z} (Z (t))}{g})}^{- γ}

, where g is

{\tilde{f}}_{Z} (z)

the geometric mean,

\log g = \frac{1}{n} \sum_{i = 1}^{n} \log \hat{f_{Z}} (Z (t))

and γ is the sensitivity parameter:

0 \leq γ \leq 1

.

Step 4: An adapted kernel estimate is made with variable-width kernels:

{\hat{f}}_{Z} (z) = \frac{1}{n} \sum_{t = 1}^{n} h^{- d} λ_{t}^{- d} K (\frac{z - Z (t)}{h λ_{t}})

.

Where h is the same global smoothness parameter as in Equation (3), the higher

γ

, the more sensitive the density selection. Quite often, the parameter value is selected as follows

γ = \frac{1}{2}

[65,71].

2.2. Semi-Parametric Kernel Density Estimate (SKDE)

When data are scarce, parametric estimates are often applied even when the unknown density is not parameterized. Therefore, it is essential to mention the combination of parametric and non-parametric estimates. For example, one of the semi-parametric estimates of kernel density was examined by F. Hoti and L. Holmström [46]. This estimate divides the random vector into two subvectors and estimates the distribution density of one of them by the kernel method. Afterward, another relative density is approximated by the Normal distribution density [46]. For example, suppose d and s are positive integers d ≥ 2, 1 ≤ s ≤ d − 1. Using this method, the d-dimensional vector X∈R^d is decomposed into two s and (d–s) dimensioned subvectors

X = (\begin{matrix} Y \\ Z \end{matrix})

, and the sample is decomposed accordingly:

X = (\begin{matrix} Y \\ Z \end{matrix})

, where Y∈R^s, Z∈R^d^-s. The evaluated density function is expressed as the product of the distribution density of the random vector Y and the conditional distribution density of the random vector Z:

f_{X} (x) = f_{(Y, Z)} (y, z) = f_{Y} (y) f_{Z | Y = y} (z | y)

,

x = (\begin{matrix} y \\ z \end{matrix}) \in R^{d}

. Here

f_{X}

and

f_{Y}

are the densities of

X

and

Y

.

f_{Z | Y = y}

is the density of

Z

when

Y = y

.

Suppose that the relative density

Y = y

is multidimensional normal Gaussian, but the density

f_{Y}

does not belong to any family of parametric functions. The density

f_{X}

is then obtained by estimating

f_{Y}

in a non-parametric manner and applying a multidimensional Normal density to each

f_{Z | Y = y}

. The density function

f_{Y} (y)

, as with (8), is evaluated by the kernel method [65]. Since the sample elements are not standardized, the smoothness parameter is not the same in all directions. Therefore, using the kernel method, it is replaced by the s-dimensional matrix H:

\hat{f} (y) = \frac{1}{n} \sum_{t = 1}^{n} \frac{1}{\det (H)} K (H^{- 1} (y - Y (t))) .

(9)

Usually, the shape of H is chosen diagonally—

H = d i a g (h_{1}, \dots, h_{s})

, and the smoothness parameters are calculated as follow

h_{j} = {(\frac{4}{s + 2})}^{1 / (s + 4)} n^{- 1 / (s + 4)} σ_{j} .

(10)

It should be noted that this form, when s = 1, was proposed by B. W. Silverman [65]. Replacing the standard deviation σ_j of the component Yj with its estimate

{\hat{σ}}_{j} = \sqrt{\frac{\sum {(X_{j} - \bar{X_{j}})}^{2}}{n_{j}}}

and by the rule of D. W. Scott [39] first multiplier is always between 0.924 and 1.059

{\hat{h}}_{j}

can be calculated as follows

{\hat{h}}_{j} = n^{- 1 / (s + 4)} {\hat{σ}}_{j} .

(11)

This Scott’s rule is easy to summarize for the smoothness matrix H:

\hat{H} = n^{- 1 / (s + 4)} {Σ^}^{1 / 2} .

(12)

Here

\hat{Σ} = diag ({\hat{σ}}_{1}^{2}, {\hat{σ}}_{2}^{2}, \dots, {\hat{σ}}_{s}^{2})

is the diagonal matrix of Y empirical variances.

The conditional density

f_{Z | Y} (\cdot | y)

is approximated by the Gaussian distribution N(m(y), C(y)), where m(y), C(y) denote the conditional mean of the vector Y and the conditional covariance matrix:

m (y) = E (Z | Y = y)

, y∈R^s,

C (y) = E [{(Z - m (y))}^{'} (Z - m (y)) | Y = y]

, y∈R^s. For the estimation of m(y) and C(y), it is proposed to apply the kernel smoothing:

\hat{m} (y) = \frac{\sum_{t = 1}^{n} K_{H_{2}} (y - Y (t)) Z (t)}{\sum_{j = 1}^{n} K_{H_{2}} (y - Y (j))} = \sum_{t = 1}^{n} W_{H_{2}} (y - Y (t)) Z (t), y \in R^{s} .

(13)

Here are the weights

W_{H_{2}} (y - Y (t)) = \frac{K_{H_{2}} (y - Y (t))}{\sum_{j = 1}^{n} K_{H_{2}} (y - Y (j))}

.

The sum of which is equal to one. The formula (13) can be understood as a regression estimate of the conditional mean function m of Nadaraya and Watson [72,73]. The conditional covariance matrix can be evaluated similarly

\hat{C} (y) = \sum_{t = 1}^{n} W_{H_{3}} (y - Y (t)) {(Z (t) - \hat{m} (y))}^{'} (Z (t) - \hat{m} (y))

, y∈R^s. The parametric estimate of the relative density f(Z|Y) = y looks akin to this

{\hat{f}}_{Z | Y = y} (z) = {[{(2 π)}^{d - s} \hat{\det C} (y)]}^{- 1 / 2} \exp \{- \frac{1}{2} (z - \hat{m} (y)) \hat{C} {(y)}^{- 1} {(z - \hat{m} (y))}^{'}\}

, z∈R^d^-s. The estimate of the distribution density f_X of X then is:

{\hat{f}}_{X} (x) = {\hat{f}}_{(Y, Z)} (y, z) = {\hat{f}}_{Y} (y) {\hat{f}}_{Z | Y = y} (z)

, x = (y,z)∈R^d.

The procedure described above is called the semi-parametric kernel density estimate. In practice, even if the conditional assumption of the normality of several random vector components is satisfied. The decomposition dimensions also influence the accuracy of the density estimation results, and the choice of the coordinates influences the accuracy of the density estimation results. One way to select them is to use the least-squares method or the maximum likelihood cross-entropy method recommended by original method authors [46]. The authors propose the parameters H₂ and H₃ [46] to select 2H.

2.3. Logspline Estimation (LSDE)

This subsection describes the logspline estimation (LSDE) calculation. One-dimensional polynomial splines are called partial polynomials of a certain degree. Breakpoints that contain a transition from one polynomial to another are called nodes. Suppose that the vector

t = (t_{1}, \dots, t_{K}) \in R^{K}

defines a set of such K points. Splines describe smooth connections, showing how different areas are separated by nodes [74]. These constraints are precisely defined by expressing partial polynomials in the number of continuous derivatives s. These include partially linear curves. If there are no restrictions, breakpoints are allowed in the nodes of these functions. Assuming that the functions are globally continuous, it is required that the individual linear parts meet at each node. If greater smoothness is needed (for continuous first-order derivatives), then the flexibility of the nodes is lost. Moreover, the curves are considered simple linear functions. The term “linear spline” is applied to a continuous partial linear function in the literature on approximation theory. Accordingly, the term “cubic spline” is assigned to continuous cubic functions with second-order continuous derivatives and nodes that allow jumps of third-order derivatives. If the polynomial degree is b and the vector of the nodes is t, then the set of polynomial splines with s continuous derivatives forms a linear space. For example, a set of linear splines with nodes in the sequence t is defined by function

1, x, (x - t_{1}) +, \dots, (x - t_{K}) + .

(14)

Here (٠)₊ = max(٠,0). We will rely on this set as the base of space. The base of the spline space of degree b and s smoothness consists of monomial whose form

{(x - t_{k})}_{+}^{s + j}

, here 1 ≤ j ≤ b − s. Using this formula, in the case of classical cubic splines, where b = 3 and s = 2, the base consists of elements

1, x, x^{2}, x^{3}, {(x - t_{1})}^{3} +, \dots, {(x - t_{K})}^{3} + .

(15)

From the model point of view, this base is convenient because the individual functions at the nodes are merged. In expressions (14) and (15), each function is precisely associated with one of the nodes, and removing this function essentially corresponds to removing the node itself. It is known that the numerical properties of functions (14) and (15) are poor. For example, the solution matrix deteriorates as rapidly as the number of nodes decreases in linear regression problems. A practical alternative is the so-called B-spline base [75,76]. These functions are designed to be supported in several contiguous intervals defined by nodes (b + 1 contiguous intervals are used for the smoothest splines). Suppose we can find the basis for splines of space B₁(x; t), …, B_J(x; t) b with smoothness s and a sequence of nodes t so that any function in space can be written as

g (x; β, t) = β_{1} B_{1} (x; t) + \dots + β_{J} B_{J} (x; t)

. Where the corresponding coefficient vector is β = (β₁, …, β_J)′. As seen from (14) and (15), then spline spaces of maximum smoothness are used J = K + b + 1.

According to the title of the subsection, the object of this analysis is the logarithmic density. Suppose X is a random vector that takes values from the interval (L, U). In the individual case, L and U can be ±∞. The parameters L and U are set to 2t₁ − t₂ and 2t_K − t_{K − 1}, respectively. If β₁ ≥ 0 or β_K _{− 1}≥0, then the adjustment is made 2L_old − t₁ and U_new = 2U_old − t_K is performed. The method of Kooperberg and Stone [52,77,78,79], known as logspline, is implemented with cubic spline. The cubic spline is described in (15). These functions are also continuously differentiated, and the partial polynomials are defined accordingly in the sequence of nodes t = (t₁, …, t_K). In each interval [t₁, t₂], …, [t_K₋₁, t_K] cubic splines are also cubic polynomials, but at the edges (L, t₁] and [t_K, U) are linear functions. The minimum number of nodes is K ≥ 3 (otherwise, a linear function or constant can be obtained). The basis form is 1, B₁(x; t), …, B_J(x; t), where J = K − 1.

It is said that the vector β = (β₁, …, β_J)′∈R^J exists then

C (β, t) = l o g (\int_{L}^{U} e x p (β_{1} B_{1} (x; t) + \dots + β_{J} B_{J} (x; t)) d x) < \infty

. Suppose B denotes a set of such possible vectors. After selecting β∈B, the family of positive density functions in the interval (L, U) is defined the form of which is

g (x; β, t) = e x p (β_{1} B_{1} (x; t) + \dots + β_{J} B_{J} (x; t) - C (β, t)), L < x < U .

(16)

Now, having a random sample n of magnitude X(1), …, X(n) from the interval (L, U) with an unknown density function f, the logical probability function corresponding to the model of logsplines (16) is

l (β, t) = \sum_{i} l o g (g (X_{i}; β, t)) = \sum_{i} \sum_{j} β_{j} B_{j} (X_{i}; t) - n C (β, t), β \in B .

(17)

where estimation of maximum likelihood

\hat{β} = a r g m a x_{β \in B} l (β, t)

and an estimate of density

\hat{f} = g (x; \hat{β}, t)

, L < x < U.

Let us say that during the stepwise determination procedure, the sequence of models is denoted by v, the vth model has J_v base functions. The Generalized Akaike Information Criterion (AIC) selects the best model [80]. Suppose that

{\hat{l}}_{v}

defines the estimate of the logic-likelihood function (17) for the vth model. The Equation defines the Akaike information criterion

{AIC}_{a, v} (t) = - 2 {\hat{l}}_{v} (t) + a J_{v}

for which the model has a loss parameter a. From many models, the one whose value of v minimizes AIC_a_,v. Stone [52] recommends the use of a = log n.

2.4. PPDE Algorithm. Estimation of the Projection Density of the Target

The projection pursuit density estimator (PPDE) proposed by Friedman is based on the target projection and consistent projection Gaussianization. The essence of J. H. Friedman and coauthors [54,55,81] in estimating the target projection density is to search for “interesting”, small-measurement data projections. The distribution structures, where the projections have distributions that are very different (in the sense of some projection index) from Gaussian. Huber [82] made a heuristic proposal to consider the Gaussian distribution as the least interesting. This proposal is based on the facts that:

The multidimensional Gaussian distribution is entirely defined by its linear structure (mean and covariance matrices). Therefore, it is desired to feel a data structure independent of the correlation and linear transformations such as the scale parameter.
All projections of a multidimensional Gaussian distribution are also Gaussian distributions. Thus, if the projection differs insignificantly from the Gaussian distribution, it indicates that distribution is also close to the Gaussian.
For multidimensional data with a structure in multiple projection directions, many projections will have a distribution close to normal. This statement follows from the central limit theorem.
In the case of constant variance, the Gaussian distribution is considered to be the least informative.

Friedman developed Huber’s idea and proposed an algorithm called exploratory target projection to estimate multidimensional non-parametric density. This procedure consists of five steps:

(1): Data standardization simplifies layout, scalability, and correlation structures;
(2): Projection index: the degrees of ‘interest’ in various directions are determined.
(3): Optimization strategy: search for the direction in which the projection index is the largest.
(4): Data transformation: the one-dimensional density is calculated in the chosen direction, and the data are multiplied.
(5): Density formation: multidimensional density is formed from the calculated one-dimensional densities. Multidimensional density is a function of one-dimensional densities.

The following projection index construction has been proposed. It is known that all projections of a multidimensional Gaussian distribution are one-dimensional Gaussian distributions. If the distribution in one direction is not Gaussian, then the multidimensional distribution is also not Gaussian. Therefore, the projection index

I (τ)

shows how far the one-dimensional density

f_{τ} (y)

is in the direction

τ (Y = τ^{'} Z)

from the Gaussian distribution when

Z

is a standardized quantity [83]:

\tilde{I} (τ) = \int_{- \infty}^{\infty} {(f_{τ} (y) - ϕ (y))}^{2} d y, where ϕ (y) = \frac{1}{\sqrt{2 π}} e^{\frac{- y^{2}}{2}} .

(18)

The projection direction

τ

, which maximizes the projection of a distribution

\tilde{I} (τ)

, is chosen to highlight the multimodal or other nonlinear structure of that distribution. We transform the data

y

by equality

R = 2 Φ (Y) - 1 = 2 Φ (τ^{'} Z) - 1, R \in [- 1, 1]

, where

Φ (u)

is a function of the standard normal distribution

Φ (u) = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{u} e^{\frac{- t^{2}}{2}} d t

. The distribution density of the transformed quantity

R

, function

f_{R} (r)

can be rewritten as

f_{R} (r) = \frac{f_{τ} (y)}{|\frac{\partial r}{\partial y}|} = \frac{f_{τ} (y)}{2 ϕ (y)} .

(19)

Equation (18) can be rewritten by changing the variable

y

to

r : \tilde{I} (τ) = \int_{- 1}^{1} 2 ϕ (y) {(f_{R} (r) - 1 / 2)}^{2} d r = \int_{- 1}^{1} 2 ϕ (Φ^{- 1} (\frac{R + 1}{2})) {(f_{R} (r) - 1 / 2)}^{2} d r .

Friedman [55] proposed a slightly different form of the projection index

I (τ)

, taking the integrated square error as a measure of

R

inequality:

I (τ) = \int_{- 1}^{1} {(f_{R} (r) - 1 / 2)}^{2} d r = \int_{- 1}^{1} f_{R}^{2} (r) d r - 1 / 2 .

(20)

Note that if the distribution of

Y

is Gaussian, then

f_{R} (r) = \frac{1}{2}

, ∀r, and the projection index

I (τ)

is zero. The more the

Y

distribution differs from the normal, the higher the value of the index

I (τ)

. Since

R \in [- 1, 1]

,

f_{R} (r)

can be decomposed by orthogonal Lagrangian polynomials,

{\{ψ_{j}\}}_{j = 0}^{\infty}

, i.e.,

f_{R} (r) = \sum_{j = 0}^{\infty} b_{j} ψ_{j} (r)

:

I (τ) = \int_{- 1}^{1} f_{R}^{2} (r) d r - 1 / 2 = \int_{- 1}^{1} [\sum_{j = 0}^{\infty} b_{j} ψ_{j} (r)] f_{R} (r) d r - 1 / 2 .

(21)

An iterative expression defines orthogonal Lagrangian polynomials.

ψ_{0} (r) = 1

and

ψ_{1} (r) = r . ψ_{j} (r) = \frac{(2 j - 1) r ψ_{j - 1} (r) - (j - 1) ψ_{j - 2} (r)}{j}

, then

j \geq 2

. It follows from the orthogonality property that the coefficients

b_{j}

can be calculated as follows

b_{j} = \frac{2 j + 1}{2} \int_{- 1}^{1} ψ_{j} (r) f_{R} (r) d r = \frac{2 j + 1}{2} E_{R} [ψ_{j} (r)] = \frac{2 j + 1}{2} \frac{1}{n} \sum_{t = 1}^{n} ψ_{j} (2 Φ (Y (t)) - 1),

where

\int_{- 1}^{1} ψ_{j} (r) f_{R} (r) d r = E_{R} [ψ_{j} (r)]

is the mean of the sample approximates the expression. Thus, equality can be written as

I (τ) = \int_{- 1}^{1} f_{R}^{2} (r) d r - 1 / 2 = \sum_{j = 1}^{s} \frac{2 j + 1}{2} E_{R}^{2} [ψ_{j} (r)] .

(22)

It should be noted that the infinite amount has been changed to finite. Such a change has advantages: the sum is calculated faster, giving robustness to the projection index. By summing only a finite number of members, the slowly fading “tails” of the projection distributions have a more negligible effect on the value of the projection index. Therefore, it is suggested to choose 4 ≤ s ≤ 7.

There are many methods for finding “interesting” projections. The method used in this research for finding the ‘most interesting’ projection direction is a mixed optimization strategy [55,64,84]. After defining the analytical expression of the projection index, its gradient in the projection direction τ is obtained as follows

\frac{\partial I}{\partial τ} = \frac{2}{\sqrt{2 π}} \sum_{j = 1}^{s} (2 j + 1) E [ψ_{j} (r)]E[ψ_{j}^{'} (r) e^{- y^{2} / 2} (z - τ y)] .

(23)

Here, the Lagrangian polynomial derivative is calculated by an iterative formula:

ψ_{1}^{'} (r) = 1

, then

ψ_{j}^{'} (r) = r ψ_{j - 1}^{'} (r) + j ψ_{j - 1} (r)

, then

j \geq 1

. Initially, an approximate step optimizer is found by searching in the directions of the main components and their combinations so that the initial convergence to the maximum can be achieved quickly. Then, the approximate step optimizer (steepest ascent) quickly selects the projections required to ascend to the (local) maximum of the projection index. The projection index is used to search for ‘interesting’ data projections. However, it is usually not enough to find a single projection to reasonably estimate the multidimensional density. In general, “interesting” directions do not have to be orthogonal and may require more projection directions than the data dimension. Therefore, when estimating density by targeted projection, the so-called deletion of the data structure is applied. A nonlinear scale transformation is performed, found in the projection direction, so the distribution of the transformed data becomes normal. This operation ensures that the same direction as before was not found when searching for another projection direction.

The deletion of the data structure is based on the fact that if the projection of one-dimensional data projection

τ^{'} Z

has a distribution density

f_{τ} (y)

and a corresponding distribution function

F_{τ}

, then the random variable is equal to

\tilde{Y} = Φ^{- 1} (F_{τ} (Y)),

(24)

where

Φ^{- 1}

is the inverse of the standard normal distribution. Friedman [55] proposed to calculate the empirical estimate of the distribution function as follows

{\hat{F}}_{τ} (y) = r a n k (Y) / n - \frac{1}{2 n}

, where

r a n k (y)

is the rank of

Y

in the whole sample of size n. Unfortunately, this estimate is not accurate and often results in a very uneven density function. By denoting

Z^{(0)} = Z

, we will discuss how

Z^{(k - 1)}

is obtained from

Z^{(k)}

. Based on Equation (14),

Z^{(k)}

can be defined as

Z^{(k)} = Z^{(k - 1)} + [Φ^{- 1} (F_{τ} (τ^{'} Z^{(k - 1)})) - τ^{'} Z^{(k - 1)}] τ .

(25)

The same procedure is performed to find the ‘most interesting’ projection with

Z^{(k)}

searching for a new direction. This sequence is repeated until the multidimensional distribution becomes close to the Gaussian distribution in all directions. It has been observed [55] that gaussianization in one direction disrupts normalcy in the directions previously studied so that their projection index

I (τ)

is no longer zero. However, studies show [54] that the changes in results are minimal. Multidimensional density is calculated from one-dimensional density estimates.

The relationship between the multidimensional densities

Z^{(k)}

and

Z^{(k - 1)}

(where

Z^{(k)}

is the structure of the distant data

Z^{(k - 1)}

along the k-th projection

τ (k)

) is

f_{τ (k)} (z^{(k)}) = \frac{f_{τ (k - 1)} (z^{(k - 1)})}{|J_{k} (z^{(k - 1)})|}

and

f_{τ (k - 1)} (z^{(k - 1)}) = f_{τ (k)} (z^{(k)}) |J_{k} (z^{(k - 1)})|

, here is the Jacobian

J_{k} (z^{(k - 1)}) = \frac{\partial z^{(k)}}{\partial z^{(k - 1)}} = \frac{\partial (U z^{(k)})}{\partial (U z^{(k - 1)})} = \frac{\partial y^{(k)}}{\partial y^{(k - 1)}} = \frac{f_{τ (k)} (y^{(k - 1)})}{ϕ (y^{(k)})} = \frac{f_{τ (k)} (τ^{'} (k) z^{(k - 1)})}{ϕ (τ^{'} (k) z^{(k)})} \geq 0

.

Starting from the initial multidimensional data

Z^{(0)}

gaussianization procedure is performed for each “interesting” projection found by

I (τ)

. After a certain number, the projections’ multidimensional data

Z^{(M)}

differ slightly from the normal distribution. Density

Z^{(0)}

can be calculated as follows

\begin{matrix} f (z^{(0)}) = f_{τ (1)} (z^{(1)}) J_{1} (z^{(0)}) = f_{τ (2)} (z^{(2)}) J_{2} (z^{(1)}) J_{1} (z^{(0)}) = f_{τ (M)} (z^{(M)}) \prod_{k = 1}^{M} J_{k} (z^{(k - 1)}) \\ \approx ϕ (z^{(M)}) \prod_{k = 1}^{M} J_{k} (z^{(k - 1)}) = ϕ (z^{(M)}) \prod_{k = 1}^{M} \frac{f_{τ (k)} (τ^{'} (k) z^{(k - 1)})}{ϕ (τ^{'} (k) z^{(k)})} . \end{matrix}

(26)

The one-dimensional density of the projected data

f_{τ (k)} (τ^{'} (k) z^{(k - 1)})

is calculated according to Equation (18) or more precisely

f_{τ (k)} (τ^{'} (k) z^{(k - 1)}) = ϕ (τ^{'} (k) z^{(k - 1)}) \sum_{j = 0}^{s} \frac{2 j + 1}{n} \sum_{t = 1}^{n} ψ_{j} (r_{t}^{(k - 1)}) ψ_{j} (r^{(k - 1)}) .

(27)

Then, replacing the unknown one-dimensional distribution densities on the right-hand side (26) with their statistical estimates, we obtain

\hat{f} (z) = ϕ (z^{(M)}) \prod_{k = 1}^{M} \frac{{\hat{f}}_{τ (k)} (τ^{'} (k) z^{(k - 1)})}{ϕ (τ^{'} (k) z^{(k)})} .

(28)

The target projection density estimate is calculated relatively quickly because of the shape of the multivariate projection index and the iterative relationship between polynomials.

2.5. Inversion Formula

When examining approximations of parametric methods, it should be emphasized that as the data dimension increases, the number of model parameters increases rapidly, making it more difficult to find accurate parameter estimates. One-dimensional data projections

X_{τ} = τ^{'} X

density

f_{τ}

is much easier to find than multidimensional data density

f

because there exists a mutually unambiguous correspondence,

f \leftrightarrow \{f_{τ}, τ \in R^{d}\}

. It is quite natural to try to find the multidimensional density

f

using the density estimates

{\hat{f}}_{τ}

of one-dimensional observational projections. It should be noted that in the case of the mixture, when the distributions are Gaussian, the projections of observations are also distributed according to the (one-dimensional) Gaussian mixture model

f_{τ} (x) = \sum_{k = 1}^{q} p_{k, τ} φ_{k, τ} (x) = f_{τ} (x, θ_{τ}) .

(29)

Here

φ_{k, τ} (x) = φ (x; m_{k, τ}, σ_{k, τ}^{2})

– one-dimensional Gaussian density. The parameter

θ

of the multidimensional mixture. The distribution parameters of the data projections

θ_{τ} = (p_{k, τ}, m_{k, τ}, σ_{k, τ}^{2})

,

k = 1, \dots, q

are related by equations:

p_{j, τ} = p_{j}, m_{j, τ} = τ^{'} M_{j} and σ_{j, τ}^{2} = τ^{'} R_{j} τ

. Using the inversion formula

f (x) = \frac{1}{{(2 π)}^{d}} \int_{R^{d}}^{} e^{- i t^{'} x} ψ (t) d t,

(30)

where

ψ (t) = E e^{i t^{'} x}

denotes the characteristic function of the random variable

X

. Marking

u = |t|

,

τ = t / |t|

and changing the variables to a spherical coordinate system, density is written

f (x) = \frac{1}{{(2 π)}^{d}} \int_{τ : |τ| = 1}^{} d s \int_{0}^{\infty} e^{- i u τ^{'} x} ψ (u τ) u^{d - 1} d u .

(31)

Here, the first integral is understood as the surface integral of the unit sphere. After noting the characteristic function of the projection of the observed random variable as

ψ_{τ} (u) = E e^{i u τ^{'} X}

. Then equality

ψ (u τ) = ψ_{τ} (u)

holds. By selecting the set

T

of projection directions evenly spaced on the sphere and replacing the characteristic function with its estimate (

\hat{f} (x))

a formula

\hat{f} (x) = \frac{A (d)}{# T} \sum_{τ \in T} \int_{0}^{\infty} e^{- i u τ^{'} x} {\hat{ψ}}_{τ} (u) u^{d - 1} e^{- h u^{2}} d u,

(32)

is obtained to calculate the estimate [85,86]. Here and # continue to denote the number of elements in the set

T

. Using the d-meter ball volume

V_{d} (R) = \frac{π^{\frac{d}{2}} R^{d}}{Γ (\frac{d}{2} + 1)} = \{\begin{matrix} \frac{π^{\frac{d}{2}} R^{d}}{(\frac{d}{2})!}, & then d \mod 2 \equiv 0 \\ \frac{2^{\frac{d + 1}{2}} π^{\frac{d - 1}{2}} R^{d}}{d!!}, & then d \mod 2 \equiv 1 \end{matrix},

(33)

the constant

A (d)

depending on the data dimension can be calculated using

A (d) = \frac{{(V_{d} (1))}_{R}^{'}}{{(2 π)}^{d}} = \frac{d 2^{- d} π^{- \frac{d}{2}}}{Γ (\frac{d}{2} + 1)} .

(34)

Computer simulation studies have shown that the density estimates obtained using the inversion formula are not smooth. Therefore, in formula (32), an additional multiplier

e^{- h u^{2}}

is used below the integral sign. This multiplier further smoothes the estimate

\hat{f} (x)

(32) with the Gaussian kernel function. This form of the multiplier allows the value of the integral to be calculated analytically. The number of clusters and Gaussian mixture parameters was selected using the constructive procedure and software developed at the Lithuanian Institute of Mathematics and Informatics, applying the w² type criterion [87]. Formula (32) can be used for various estimates of the characteristic function of the projected data. We will discuss the two methods used in this work.

One of them is based on the density approximation of the Gaussian distribution mixture model. In the present case, after replacing the parameters of the Gaussian mixture with their statistical estimates (

{\hat{p}}_{k, τ} = p_{k}, {\hat{m}}_{k, τ} = τ^{'} M_{k}, σ_{k, τ}^{2} = τ^{'} R_{k} τ)

(Page 10), the following parametric estimate

{\hat{ψ}}_{τ} (u) = \sum_{k = 1}^{{\hat{q}}_{τ}} {\hat{p}}_{k, τ} e^{i u {\hat{m}}_{k, τ} - u^{2} {\hat{σ}}_{k, τ}^{2} / 2}

(35)

of the characteristic function is used, and adding (32) to (35) gives

\begin{matrix} \hat{f} (x) = \frac{A (d)}{# T} \sum_{τ \in T} \sum_{k = 1}^{{\hat{q}}_{τ}} {\hat{p}}_{k, τ} \int_{0}^{\infty} e^{i u ({\hat{m}}_{k, τ} - τ^{'} x) - u^{2} (h + {\hat{σ}}_{k, τ}^{2} / 2)} u^{d - 1} d u \\ = \frac{A (d)}{# T} \sum_{τ \in T} \sum_{k = 1}^{{\hat{q}}_{τ}} {\hat{p}}_{k, τ} I_{d - 1} (\frac{{\hat{m}}_{k, τ} - τ^{'} x}{\sqrt{{\hat{σ}}_{k, τ}^{2} + 2 h}}) {(\sqrt{{\hat{σ}}_{k, τ}^{2} + 2 h})}^{- d} \end{matrix}

(36)

and where

I_{j} (y)

can be written as

I_{j} (y) = Re [\int_{0}^{\infty} e^{i y z - z^{2} / 2} z^{j} d z] .

(37)

It should be noted that only the real part of the expression can be considered here. The sum of the imaginary parts must be equal to zero. Because the density estimate

\hat{f} (x)

can acquire only real values. The chosen form of the smoothing multiplier

e^{- h u^{2}}

allows relating the smoothing parameter h to the variances of the projection clusters—in the calculations, the variances are increased by

2 h

. How to calculate expression (37) is given in Appendix B.

2.6. Modified Density Estimate of the Inversion Formula

One of the disadvantages of the inversion formula method defined in (32) is that the Gaussian distribution mixture model described by this estimate (where

f_{k} = φ_{k}

) evaluates well only the density of observations close to it. However, when approximating the density under study with a mixture of Gaussian distributions, the estimation of the density of the inversion formula often becomes complicated due to a large number of components with low a priori probabilities. Their number can be reduced by introducing a noise cluster—the modified algorithm based on a multidimensional Gaussian distribution mixture model. Let us use the inversion formula (30). The parametric estimate of the characteristic function of uniform distribution density can be calculated as follows

\hat{ψ} (u) = \frac{2}{(b - a) u} \sin \frac{(b - a) u}{2} \cdot e^{\frac{i u (a + b)}{2}} .

(38)

In uniform distribution density function (38), b is the maximum value, and a is the minimum value. In the density estimate calculation formula (32), construct the estimation of the characteristic function as a union of the characteristic functions of a mixture of Gaussian distributions and a uniform distribution with corresponding a priori probabilities as follows

{\hat{ψ}}_{τ} (u) = \sum_{k = 1}^{{\hat{q}}_{τ}} {\hat{p}}_{k, τ} e^{i u {\hat{m}}_{k, τ} - u^{2} {\hat{σ}}_{k, τ}^{2} / 2} + {\hat{p}}_{0, τ} \frac{2}{(b (τ) - a (τ)) u} \sin \frac{(b (τ) - a (τ)) u}{2} \cdot e^{\frac{i u (a (τ) + b (τ))}{2}} .

(39)

Here the second term describes a uniform distributed noise cluster and

{\hat{p}}_{0}

is the weight of the noise cluster. Based on the parameters of the uniform distribution and the projected data, we can write

a (τ) = {(τ^{'} x)}_{\min} - \frac{{(τ^{'} x)}_{\max} - {(τ^{'} x)}_{\min}}{2 (n - 1)} and

(40)

b (τ) = {(τ^{'} x)}_{\max} + \frac{{(τ^{'} x)}_{\max} - {(τ^{'} x)}_{\min}}{2 (n - 1)} .

(41)

Using notations such as (36), we can write

\begin{matrix} \hat{f} (x) = \frac{A (d)}{# T} \sum_{τ \in T} [\sum_{k = 1}^{{\hat{q}}_{τ}} {\hat{p}}_{k, τ} I_{d - 1} (\frac{{\hat{m}}_{k, τ} - τ^{'} x}{\sqrt{{\hat{σ}}_{k, τ}^{2} + 2 h}}) {({\hat{σ}}_{k, τ}^{2} + 2 h)}^{- \frac{d}{2}} \\ + \frac{2 {\hat{p}}_{0, τ}}{b (τ) - a (τ)} J_{d - 2} (\frac{a (τ) + b (τ) - 2 τ^{'} x}{2 \sqrt{2 h}}, \frac{b (τ) - a (τ)}{2 \sqrt{2 h}}) \cdot {(2 h)}^{- \frac{d - 1}{2}}] . \end{matrix}

(42)

where the expression

I_{j} (y)

is the same as (37) and its value is

I_{j} (y) = C_{j} (y)

and

J_{j} (y, z) = Re [\int_{0}^{\infty} e^{i y u - u^{2} / 2} \cdot \sin z u \cdot u^{j} d u] .

(43)

By integrating, we get

\begin{matrix} \int_{0}^{\infty} e^{i y u - u^{2} / 2} \cdot \sin z u \cdot u^{j} d u = \int_{0}^{\infty} (\cos y u + i \sin y u) \cdot \sin z u \cdot e^{- u^{2} / 2} \cdot u^{j} d u \\ = \int_{0}^{\infty} (\frac{\sin (y + z) u + \sin (z - y) u}{2} + i \frac{\cos (y - z) u - \cos (y + z) u}{2}) \cdot e^{- u^{2} / 2} \cdot u^{j} d u \\ = \frac{1}{2} S_{j} (y + z) + \frac{1}{2} S_{j} (z - y) + i \frac{1}{2} C_{j} (y - z) - i \frac{1}{2} C_{j} (y + z) . \end{matrix}

(44)

the above formula uses the variables

S_{j} (y)

and

C_{j} (y)

, the calculation of which is given in formulas (52) and (53) in Appendix B.

3. Materials and Methods

Density estimation algorithms were presented in the previous section. The Monte Carlo method was used in this study. Such a comparison of algorithms allows us to measure the real observation density values and evaluate algorithms’ efficiency. For the research, we used multidimensional (d = 2, 5, 10, 15) distributions of the Cauchy mixture

\sum_{j = 1}^{q} p_{j} C (x, m_{j}, u_{j}) .

(45)

Additionally,

C (x, m_{j}, u_{j})

is defined as follows

C (x, m_{j}, u_{j}) = \prod_{k = 1}^{d} \frac{u_{j k}}{π [u_{j k}^{2} + (x_{k} - m_{j k})^{2}]} .

(46)

Calculations were performed using sample sizes of n = 50, 100, 200, 400, 800 while changing the number of distributions, their weights, and centers (see. Table 1). In each case, 100,000 samples were generated.

Table 1. Parameters table.

In cases of d = 10, 15, the same weights were used as in d = 5. Additionally, centres are located on the apexes of the hypercube.

Algorithms used in the research: AKDE—adaptive kernel, PPDE—projection pursuit, LSDE—logspline, SKDE—semi-parametric kernel, IFDE—inversion formula, MIDE—inversion formula with noise cluster. In IFDE and MIDE methods are used mixture parameters, calculated with a program made in an institute of Mathematics and Informatics (Vilnius) [87].

Selection of parameters in the density estimation procedure. In this study, the Monte Carlo method aimed to perform the accuracy of the non-parametric estimates of distribution density previously described in the methodological sections (AKDE, PPDE, LSDE, SKDE, IFDE, MIDE) comparative analysis. The authors [34] propose to collect the value of the sensitivity parameter (γ, see. AKDE method step 3) used in the AKDE method from the set {0.2; 0.4; 0.6; 0.8}. The specific value of the parameter is determined by a probabilistic cross-check [88,89]. In the SKDE, all possible values of the sub-vector Y dimension s (1≤ s ≤ d − 1, where d is dimensions, see page 5) and their corresponding coordinates were reselected. The most factual errors were used to compare the results with other studied methods. The LSDE method minimizes the Akaike information criterion by selecting the number of baseline spline points [78]. The computer program for calculating this estimate is provided in the R package and was used in the study. Akaike information criterion

A I C = - 2 l (t) + a J (t)

,

J

—degree of spline,

a = \log (n)

,

l

—probability function used to select the spline coefficients. The MIDE method has a smoothing parameter, h. The chosen form of the smoothing multiplier

e^{- h u^{2}}

allows relating the smoothing parameter h to the variances of the projection clusters. Modelling studies have shown that this method is sensitive to parameter selection. If h is set too low, the estimate becomes very slick and has large errors. Excessive smoothing of the density estimate does not greatly affect its quality. In the studies, it was observed that the estimation becomes uneven due to the similarity of the values of the observations projected in some directions, thus distinguishing low-weight components with small dispersions. The smoothing parameter (h) as well as the specific value of the noise cluster weight (probability) from the set {0.05; 0.1; 0.15; 0.2; 0.3; 0.4} are selected by cross-checking the least squares [65]. The vector of the estimate parameters is searched for in such a way that it minimizes the integrated square error

Θ = \arg \min_{Θ} \int_{- \infty}^{\infty} {(\hat{f_{Θ}} (x) - f (x))}^{2} d x = a r g \min_{Θ} \{\hat{‖ f_{Θ}} {(x) ‖}_{2}^{2} - \frac{2}{n} \sum_{t = 1}^{n} \hat{f_{Θ}} (X (t))\},

(47)

where Θ is the evaluated parameter and F(x) is the observed random variable distribution function. Changing an unknown distribution function to an empirical distribution function yields an expression for the parameter estimate

Θ^= \arg \min_{Θ} (\hat{‖ f_{Θ}} {(x) ‖}_{2}^{2} - \frac{2}{n} \sum_{t = 1}^{n} \hat{f_{Θ}} (X (t) | t),

(48)

where

\hat{f_{Θ}} (x | t)

is the value of the estimate at point x, which is calculated by subtracting the value of X(t) from the observations. In addition, empirical research suggests that it is better to look for a maximum local minimum point rather than a global minimum [90]. Using PPDE method and following the recommendation of the paper [38], the order of the spread was

4 \leq s \leq 6

(see Page 9), and the projection directions were chosen to maximize the value of the estimate of the design index (2) recommended by J. H. Friedman

I (α) = \int_{- 1}^{1} f_{r}^{2} (r) d r - \frac{1}{2} = \sum_{j = 1}^{J} \frac{2 j + 1}{2} E_{r}^{2} [ψ_{j} (r)] .

(49)

4. Results and Discussion

This section presents the main results obtained during the simulations. We calculate the mean absolute error and (50) mean absolute percentage error (51) to evaluate the accuracy.

δ_{1} = \frac{1}{n} \sum_{t = 1}^{n} |f (x (t)) - \hat{f} (x (t))| ≅ \int |f (x) - \hat{f} (x)| f (x) d x .

(50)

δ_{2} = \frac{2}{n} \sum_{t = 1}^{n} |\frac{f (x (t)) - \hat{f} (x (t))}{f (x (t)) + \hat{f} (x (t))}| ≅ \int |f (x) - \hat{f} (x)| d x .

(51)

The result tables (Table 2 and Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9 and Table A10) provide 100,000 samples densities mean absolute percentage error. The values in parentheses provide information about the standard deviation of errors. The best results in these tables are bolded and underlined. According to Table 2, it is concluded that when n = 100, d = 5, the best results are obtained by SKDE and MIDE methods. Based on Table A2, it can be observed that when q = 2, n = 200, the best results are obtained using SKDE and MIDE methods. According to Table A3, it is concluded that when q = 3, n = 200, in the case of highly overlapping distributions (i = 1, 2), the best results are obtained by the SKDE method, and in the case of more isolated distributions (i ≥ 3)—by the MIDE method. Based on Table A4, it can be observed that when q = 3, n ≥ 400, the best results are obtained by SKDE, while the second-best method is MIDE. According to Table A5, it is concluded that when q = 4, n = 400, in the case of highly overlapping distributions (i ≤ 3), the best results are obtained by the SKDE method and in the case of more isolated distributions (i ≥ 4)—by the MIDE method. Table A6 shows results of q = 4, n ≥ 400, it can be noticed that, in the case of highly overlapping or average isolated distributions (i ≤ 5), the best results are obtained by the SKDE method and in the case of more isolated distributions (i = 6)—by the MIDE method. Table A7 and Table A8 show results of q = 2 and n = 50, It can be noticed that in all cases highly overlapping or isolated distributions, the best results are obtained by AKDE method and in the case of more isolated distributions (i = 6) with p₁ = 0.6; p₂ = 0.4— by the MIDE method. Table A9 and Table A10 show results q = 3 and n = 50; the best results are obtained by the AKDE method in all cases (highly overlapping or isolated distributions).The LSDE method with huge outliers (|x − m_j| > 100 uj) is grouped with a more significant number of values closer to the centre of the distribution. With the help of the calculated spline coefficients, the density in the outliers is estimated at a value close to 10¹⁰⁰. That is incorrect, and in such cases, the use of this method is not recommended.

Table 2. An example of mean absolute percentage error.

The results for the smaller dimensions (d = 2) are presented in Table A1. It can be seen that the best results are obtained using the SKDE method, both in large- and small-scale overlapping cases (i < 4). On the other hand, in the case of isolated distributions (i ≥ 5), good results were obtained by the MIDE method.

In the case of mean absolute percentage error, recommended using the semiparametric kernel when the sample has overlapping distributions. In the case of two dimensions (d ~ 2) and a sample is with overlapping distributions, it is recommended to use the semiparametric kernel method and for isolated distributions, to use the adaptive kernel method.

5. Conclusions

This paper reviewed the most popular and most often used nonparametric density estimation algorithms. The density estimation inversion formula was also presented in this article. It was observed that when a noise cluster is included, the results of the inversion formula improved statistically significantly. Based on the mean absolute error, in the case of higher dimension (d ~ 5) and small samples (n ~ 50), it is recommended to use the adaptive kernel method. If the sample is n ~ 100, then the modified inversion formula method showed the best results. For larger samples with overlapping distributions it is recommended to use a semi-parametric kernel and for more isolated distribution—modified inversion methods. Based on the mean absolute percentage error, it is recommended to use the semiparametric kernel when the sample is with overlapping distributions. In the case of two dimensions (d ~ 2) and a sample is with overlapping distributions, it is recommended to use the semiparametric kernel method. For isolated distributions, it is recommended to use the adaptive kernel method.

Author Contributions

Conceptualization, T.R. and M.L.; methodology, T.R.; software, T.R. and M.L.; formal analysis, T.R. and M.L.; investigation, T.R. and M.L.; writing—original draft preparation, T.R., M.L. and G.Č.; writing—review and editing, M.L. and G.Č.; supervision, T.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are thankful to the area editor and the reviewers for giving valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. 10² times upscaled mean absolute error with d = 2; p₁ = 0.5; p₂ = 0.5; n = 100.

Evaluation Methods		Density
		d = 2; p₁ = 0.5; p₂ = 0.5; n = 100
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	1.74	1.41	1.16	1.08	1.02	0.99
AKDE	SD	(0.94)	(0.80)	(0.70)	(0.59)	(0.54)	(0.50)
PPDE	Mean	2.21	1.85	1.52	1.32	1.25	1.21
PPDE	SD	(0.49)	(0.41)	(0.40)	(0.44)	(0.37)	(0.31)
LSDE	Mean	0.87	0.71	0.78	0.63	0.69	0.69
LSDE	SD	(0.43)	(0.20)	(0.08)	(0.08)	(0.04)	(0.09)
SKDE	Mean	0.63	0.61	0.52	0.52	0.51	0.51
SKDE	SD	(0.12)	(0.17)	(0.07)	(0.05)	(0.06)	(0.04)
IFDE	Mean	1.69	1.31	0.97	0.75	0.61	0.53
IFDE	SD	(0.06)	(0.10)	(0.08)	(0.01)	(0.04)	(0.06)
MIDE	Mean	0.69	0.66	0.57	0.55	0.51	0.51
MIDE	SD	(0.06)	(0.10)	(0.08)	(0.01)	(0.04)	(0.06)

Table A2. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.7; p₂ = 0.3; n = 200.

Evaluation Methods		Density
		d=5; p₁ = 0.7; p₂ = 0.3; n = 200
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	0.979	0.801	0.703	0.664	0.655	0.654
AKDE	SD	(0.171)	(0.146)	(0.145)	(0.155)	(0.158)	(0.159)
PPDE	Mean	1.001	0.822	0.722	0.681	0.671	0.669
PPDE	SD	(0.174)	(0.152)	(0.151)	(0.160)	(0.163)	(0.163)
LSDE	Mean	5.039	4.185	2.632	0.944	0.665	0.660
LSDE	SD	(1.265)	(6.747)	(1.081)	(0.138)	(0.140)	(0.112)
SKDE	Mean	0.857	0.759	0.705	0.658	0.649	0.638
SKDE	SD	(0.087)	(0.069)	(0.076)	(0.085)	(0.083)	(0.083)
IFDE	Mean	0.912	0.801	0.721	0.681	0.667	0.666
IFDE	SD	(0.133)	(0.149)	(0.151)	(0.160)	(0.162)	(0.163)
MIDE	Mean	0.956	0.788	0.694	0.661	0.657	0.640
MIDE	SD	(0.162)	(0.154)	(0.144)	(0.152)	(0.158)	(0.163)

Table A3. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.45; p₂ = 0.45; p₃ = 0.1; n= 200.

Evaluation Methods		Density
		d = 5; p₁ = 0.45; p₂ = 0.45; p₃ = 0.1; n = 200
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	0.970	0.725	0.576	0.533	0.504	0.500
AKDE	SD	(0.127)	(0.089)	(0.078)	(0.073)	(0.069)	(0.066)
PPDE	Mean	0.992	0.746	0.594	0.528	0.506	0.500
PPDE	SD	(0.137)	(0.092)	(0.077)	(0.072)	(0.069)	(0.066)
LSDE	Mean	1.057	0.775	0.652	0.590	0.508	0.503
LSDE	SD	(0.164)	(0.203)	(0.650)	(0.491)	(0.067)	(0.081)
SKDE	Mean	0.6245	0.6274	0.6312	0.629	0.630	0.628
SKDE	SD	(0.072)	(0.025)	(0.027)	(0.049)	(0.049)	(0.050)
IFDE	Mean	0.990	0.743	0.589	0.525	0.497	0.499
IFDE	SD	(0.136)	(0.091)	(0.076)	(0.071)	(0.071)	(0.067)
MIDE	Mean	0.993	0.746	0.574	0.525	0.496	0.490
MIDE	SD	(0.137)	(0.092)	(0.077)	(0.072)	(0.069)	(0.066)

Table A4. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.4; p₂ = 0.4; p₃ = 0.2; n = 400.

Evaluation Methods		Density
		d = 5; p₁ = 0.4; p₂ = 0.4; p₃ = 0.2; n = 400
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	0.916	0.689	0.527	0.445	0.410	0.396
AKDE	SD	(0.099)	(0.068)	(0.048)	(0.044)	(0.048)	(0.052)
PPDE	Mean	0.937	0.709	0.545	0.461	0.423	0.407
PPDE	SD	(0.109)	(0.074)	(0.049)	(0.044)	(0.048)	(0.052)
LSDE	Mean	0.815	0.549	0.511	0.443	0.404	0.401
LSDE	SD	(0.007)	(0.063)	(0.151)	(0.094)	(0.040)	(0.030)
SKDE	Mean	0.655	0.499	0.413	0.388	0.385	0.384
SKDE	SD	(0.064)	(0.049)	(0.034)	(0.031)	(0.028)	(0.027)
IFDE	Mean	0.937	0.709	0.544	0.460	0.423	0.404
IFDE	SD	(0.109)	(0.074)	(0.049)	(0.044)	(0.048)	(0.052)
MIDE	Mean	0.757	0.509	0.415	0.391	0.391	0.388
MIDE	SD	(0.109)	(0.074)	(0.049)	(0.044)	(0.048)	(0.052)

Table A5. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.25; p₂ = 0.25; p₃ = 0.25; p₄ = 0.25; n = 400.

Evaluation Methods		Density
		d = 5; p₁ = 0.25; p₂ = 0.25; p₃ = 0.25; p₄ = 0.25; n = 400
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	0.912	0.645	0.447	0.351	0.309	0.290
AKDE	SD	(0.128)	(0.068)	(0.029)	(0.019)	(0.026)	(0.030)
PPDE	Mean	0.934	0.665	0.464	0.365	0.321	0.299
PPDE	SD	(0.145)	(0.089)	(0.048)	(0.031)	(0.033)	(0.035)
LSDE	Mean	0.934	0.676	0.464	0.365	0.321	0.293
LSDE	SD	(0.145)	(0.064)	(0.048)	(0.031)	(0.033)	(0.039)
SKDE	Mean	0.658	0.472	0.372	0.345	0.316	0.290
SKDE	SD	(0.071)	(0.031)	(0.020)	(0.017)	(0.019)	(0.018)
IFDE	Mean	0.933	0.665	0.464	0.365	0.321	0.299
IFDE	SD	(0.145)	(0.089)	(0.048)	(0.031)	(0.033)	(0.035)
MIDE	Mean	0.889	0.622	0.433	0.341	0.307	0.281
MIDE	SD	(0.118)	(0.074)	(0.037)	(0.026)	(0.019)	(0.027)

Table A6. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.1; p₂ = 0.1; p₃ = 0.1; p₄ = 0.7; n = 400.

EvaluationMethods		Density
		d = 5; p₁ = 0.1; p₂ = 0.1; p₃ = 0.1; p₄ = 0.7; n = 400
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	0.957	0.800	0.678	0.617	0.587	0.571
AKDE	SD	(0.131)	(0.127)	(0.103)	(0.090)	(0.087)	(0.087)
PPDE	Mean	0.979	0.821	0.697	0.634	0.603	0.586
PPDE	SD	(0.141)	(0.137)	(0.112)	(0.099)	(0.094)	(0.093)
LSDE	Mean	0.979	0.821	0.697	0.634	0.596	0.586
LSDE	SD	(0.141)	(0.137)	(0.112)	(0.099)	(0.098)	(0.093)
SKDE	Mean	0.687	0.580	0.514	0.496	0.491	0.489
SKDE	SD	(0.076)	(0.070)	(0.058)	(0.058)	(0.058)	(0.056)
IFDE	Mean	0.979	0.820	0.697	0.634	0.602	0.585
IFDE	SD	(0.141)	(0.137)	(0.112)	(0.098)	(0.094)	(0.093)
MIDE	Mean	0.924	0.770	0.652	0.597	0.533	0.488
MIDE	SD	(0.135)	(0.131)	(0.108)	(0.093)	(0.092)	(0.091)

Table A7. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.5; p₂ = 0.5; n = 50.

Evaluation Methods		Density
		d = 5; p₁ = 0.5; p₂ = 0.5; n = 50
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	1.093	0.828	0.758	0.741	0.739	0.740
AKDE	SD	(0.095)	(0.099)	(0.114)	(0.123)	(0.130)	(0.134)
PPDE	Mean	1.147	0.872	0.794	0.770	0.764	0.762
PPDE	SD	(0.157)	(0.150)	(0.157)	(0.156)	(0.159)	(0.160)
LSDE	Mean	2.100	1.997	2.002	2.010	2.013	2.014
LSDE	SD	(0.078)	(0.028)	(0.017)	(0.019)	(0.024)	(0.025)
SKDE	Mean	1.149	0.875	0.797	0.773	0.765	0.763
SKDE	SD	(0.160)	(0.154)	(0.160)	(0.160)	(0.161)	(0.162)
IFDE	Mean	1.145	0.864	0.780	0.765	0.763	0.757
IFDE	SD	(0.163)	(0.137)	(0.140)	(0.150)	(0.163)	(0.156)
MIDE	Mean	1.094	0.860	0.759	0.767	0.742	0.752
MIDE	SD	(0.142)	(0.167)	(0.160)	(0.156)	(0.163)	(0.160)

Table A8. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.6; p₂ = 0.4; n = 50.

Evaluation Methods		Density
		d = 5; p₁ = 0.6; p₂ = 0.4; n = 50
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	1.138	0.872	0.770	0.748	0.745	0.746
AKDE	SD	(0.105)	(0.085)	(0.126)	(0.143)	(0.152)	(0.155)
PPDE	Mean	1.192	0.918	0.808	0.778	0.771	0.769
PPDE	SD	(0.150)	(0.136)	(0.172)	(0.178)	(0.181)	(0.182)
LSDE	Mean	2.114	1.995	1.977	1.983	1.986	1.987
LSDE	SD	(0.101)	(0.083)	(0.080)	(0.079)	(0.082)	(0.084)
SKDE	Mean	1.195	0.919	0.810	0.780	0.772	0.770
SKDE	SD	(0.154)	(0.138)	(0.174)	(0.182)	(0.183)	(0.183)
IFDE	Mean	1.183	0.906	0.802	0.778	0.765	0.769
IFDE	SD	(0.142)	(0.125)	(0.163)	(0.185)	(0.176)	(0.185)
MIDE	Mean	1.152	0.882	0.782	0.754	0.747	0.742
MIDE	SD	(0.136)	(0.124)	(0.155)	(0.175)	(0.176)	(0.180)

Table A9. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.33; p₂ = 0.33; p₃ = 0.33; n = 50.

Evaluation Methods		Density
		d = 5; p₁ = 0.33; p₂ = 0.33; p₃ = 0.33; n = 50
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	1.166	0.828	0.634	0.547	0.512	0.500
AKDE	SD	(0.120)	(0.086)	(0.057)	(0.064)	(0.074)	(0.080)
PPDE	Mean	1.224	0.879	0.677	0.581	0.540	0.523
PPDE	SD	(0.184)	(0.128)	(0.107)	(0.108)	(0.108)	(0.109)
LSDE	Mean	2.075	1.934	1.921	1.939	1.938	1.937
LSDE	SD	(0.127)	(0.094)	(0.058)	(0.054)	(0.048)	(0.044)
SKDE	Mean	1.226	0.881	0.678	0.583	0.542	0.524
SKDE	SD	(0.186)	(0.130)	(0.109)	(0.110)	(0.110)	(0.110)
IFDE	Mean	1.215	0.839	0.649	0.554	0.522	0.513
IFDE	SD	(0.175)	(0.099)	(0.110)	(0.102)	(0.110)	(0.111)
MIDE	Mean	1.182	0.834	0.638	0.545	0.518	0.501
MIDE	SD	(0.167)	(0.124)	(0.097)	(0.101)	(0.106)	(0.106)

Table A10. 10⁴ times upscaled mean absolute error with d = 5; p₁ = 0.45; p₂ = 0.45; p₃ = 0.1; n = 50.

Evaluation Methods		Density
		d = 5; p₁ = 0.45; p₂ = 0.45; p₃ = 0.1; n = 50
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	1.126	0.838	0.690	0.633	0.618	0.615
AKDE	SD	(0.112)	(0.126)	(0.063)	(0.053)	(0.061)	(0.067)
PPDE	Mean	1.182	0.882	0.727	0.660	0.640	0.634
PPDE	SD	(0.156)	(0.132)	(0.091)	(0.085)	(0.087)	(0.088)
LSDE	Mean	2.101	2.002	1.985	2.010	2.014	2.015
LSDE	SD	(0.105)	(0.063)	(0.012)	(0.029)	(0.028)	(0.025)
SKDE	Mean	1.183	0.885	0.729	0.663	0.642	0.635
SKDE	SD	(0.157)	(0.134)	(0.094)	(0.089)	(0.090)	(0.090)
IFDE	Mean	1.170	0.859	0.702	0.649	0.624	0.619
IFDE	SD	(0.142)	(0.129)	(0.074)	(0.088)	(0.083)	(0.086)
MIDE	Mean	1.142	0.850	0.696	0.639	0.620	0.618
MIDE	SD	(0.141)	(0.125)	(0.080)	(0.084)	(0.083)	(0.087)

Appendix B

Calculate expression (36). Marked

C_{j} (y) = \int_{0}^{\infty} \cos y z \cdot e^{- z^{2} / 2} \cdot z^{j} d z and

(A1)

S_{j} (y) = \int_{0}^{\infty} \sin y z \cdot e^{- z^{2} / 2} \cdot z^{j} d z .

(A2)

The Equation holds

\int_{0}^{\infty} e^{- i y z - z^{2} / 2} z^{j} d z = C_{j} (y) + i S_{j} (y) .

(A3)

Integration in parts results in

\begin{matrix} C_{j} (y) = {e^{- \frac{z^{2}}{2}} z^{j - 1} \cos y z|}_{0}^{\infty} + \int_{0}^{\infty} e^{- \frac{z^{2}}{2}} ((j - 1) z^{j - 2} \cos y z - y z^{j - 1} \sin y z) d z = \\ = 1_{\{j = 1\}} + (j - 1) C_{j - 2} (y) - y S_{j - 1} (y), j \geq 1 . \end{matrix}

(A4)

Analogously expressing

S_{j} (y)

and taking into account the constraints of the j index, recursive equations are obtained

C_{j} (y) = (j - 1) C_{j - 2} (y) - y S_{j - 1} (y), j \geq 2 and

(A5)

C_{1} (y) = 1 - y S_{0} (y) also

(A6)

S_{j} (y) = (j - 1) S_{j - 2} (y) - y C_{j - 1} (y), j \geq 2 and

(A7)

S_{1} (y) = y C_{0} (y), t h e n j = 1 .

(A8)

To calculate the functions

C_{0} (y)

and

S_{0} (y)

it is used that

{(S_{0} (y))}_{y}^{'} = \int_{0}^{\infty} z \cos y z \cdot e^{- z^{2} / 2} d z = C_{1} (y) .

(A9)

From (A7) and (A10), it is obtained that

S_{0}

satisfies the differential equation

S_{0}^{'} (y) = 1 - y S_{0} (y) .

This Equation is solved by spreading

S_{0}

by Taylor series

S_{0}^{'} (y) = \sum_{l = 0}^{\infty} c_{l + 1} (l + 1) y^{l + 1} = 1 - \sum_{l = 2}^{\infty} c_{l - 1} y^{l} .

(A10)

Comparing the coefficients to similar members, their values are found

c_{0} = 0

,

c_{1} = 1

,

c_{l} = - c_{l - 2} / l

,

l \geq 2

. Thus,

S_{0} (y) = \sum_{l = 0}^{\infty} \frac{{(- 1)}^{l} y^{2 l + 1}}{(2 l + 1)!!} = y - \frac{y^{3}}{3!!} + \frac{y^{5}}{5!!} - \frac{y^{7}}{7!!} + \dots .

(A11)

C_{0}

is found from expression (50)

\begin{matrix} C_{0} (y) = \int_{0}^{\infty} \cos y z \cdot e^{- z^{2} / 2} d z = \frac{1}{2} \int_{- \infty}^{\infty} \cos y z \cdot e^{- z^{2} / 2} d z \\ = \frac{1}{2} \int_{- \infty}^{\infty} (\cos y z - i \sin y z) \cdot e^{- z^{2} / 2} d z = \sqrt{\frac{π}{2}} e^{- y^{2} / 2} . \end{matrix}

(A12)

Seeking integral (32) value

I_{j} (y) = C_{j} (y)

.

References

Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; Wiley: New York, NY, USA, 1973; Volume 3. [Google Scholar]
John, G.; Langley, P. Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–20 August 1995. [Google Scholar]
Wang, X.-Z.; He, Y.-L.; Wang, D.D. Non-Naive Bayesian Classifiers for Classification Problems with Continuous Attributes. IEEE Trans. Cybern. 2013, 44, 21–39. [Google Scholar] [CrossRef] [PubMed]
Azzalini, A.; Menardi, G. Clustering via nonparametric density estimation: The R package pdf Cluster. arXiv 2013, arXiv:1301.6559. [Google Scholar]
Cuevas, A.; Febrero-Bande, M.; Fraiman, R. Cluster analysis: A further approach based on density estimation. Comput. Stat. Data Anal. 2001, 36, 441–459. [Google Scholar] [CrossRef]
Campello, R.J.; Kröger, P.; Sander, J.; Zimek, A. Density-based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1343. [Google Scholar] [CrossRef]
Kwak, N.; Choi, C.-H. Input feature selection by mutual information based on Parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1667–1671. [Google Scholar] [CrossRef] [Green Version]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Yang, K.; Wong, W.H. Density Estimation via Discrepancy Based Adaptive Sequential Partition. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Advances in Neural Information Processing Systems 29. pp. 1091–1099. [Google Scholar]
Rothfuss, J.; Ferreira, F.; Walther, S.; Ulrich, M. Conditional density estimation with neural networks: Best practices and benchmarks. arXiv 2019, arXiv:1903.00954. [Google Scholar]
Trentin, E.; Lusnig, L.; Cavalli, F. Parzen neural networks: Fundamentals, properties, and an application to forensic anthropology. Neural Netw. 2018, 97, 137–151. [Google Scholar] [CrossRef]
Trentin, E. Soft-Constrained Neural Networks for Nonparametric Density Estimation. Neural Process. Lett. 2017, 48, 915–932. [Google Scholar] [CrossRef]
Huynh, H.T.; Nguyen, L. Nonparametric maximum likelihood estimation using neural networks. Pattern Recognit. Lett. 2020, 138, 580–586. [Google Scholar] [CrossRef]
Archambeau, C.; Verleysen, M. Fully nonparametric probability density function estimation with finite gaussian mixture models. In Proceedings of the 5th ICPAR Conference, Calcutta, India, 10–13 December 2003; pp. 81–84. [Google Scholar]
Priebe, C.E. Adaptive mixtures. J. Am. Stat. Assoc. 1994, 89, 796–806. [Google Scholar] [CrossRef]
Scott, D.W. Remarks on fitting and interpreting mixture models. Comput. Sci. Stat. 1999, 104–109. [Google Scholar]
Delicado, P.; Del Río, M. A generalization of histogram type estimators. J. Nonparametr. Stat. 2003, 15, 113–135. [Google Scholar] [CrossRef]
Peel, D.; MacLahlan, G. Finite Mixture Models; Wiley Series in Probability and Statistics; John Wiley & Sons Inc.: Hoboken, NJ, USA, 2000. [Google Scholar]
Minnotte, M.C. Achieving higher-order convergence rates for density estimation with binned data. J. Am. Stat. Assoc. 1998, 93, 663–672. [Google Scholar] [CrossRef]
Tapia, R.A.; Thompson, J.R. Nonparametric Probability Density Estimation; Johns Hopkins University Press: Baltimore, MD, USA, 1978; p. 176. [Google Scholar]
Jones, M.C.; Samiuddin, M.; Al-Harbey, A.H.; Maatouk, T.A.H. The edge frequency polygon. Biometrika 1998, 85, 235–239. [Google Scholar] [CrossRef]
Simonoff, J.S. The anchor position of histograms and frequency polygons: Quantitative and qualitative smoothing. Commun. Stat. Simul. Comput. 1995, 24, 691–710. [Google Scholar] [CrossRef]
Scott, D.W. Averaged Shifted Histograms: Effective Nonparametric Density Estimators in Several Dimensions. Ann. Stat. 1985, 13, 1024–1040. [Google Scholar] [CrossRef]
Scott, D.W. On optimal and data-based histograms. Biometrika 1979, 66, 605–610. [Google Scholar] [CrossRef]
Fix, E.; Hodges, J. An important contribution to nonparametric discriminant analysis and density estimation. Int. Stat. Rev. 1951, 3, 233–238. [Google Scholar]
Silverman, B.W.; Jones, M.C.; Fix, E. An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation: Commentary on Fix and Hodges (1951). Int. Stat. Rev. 1989, 57, 233. [Google Scholar] [CrossRef]
Rosenblatt, M. A central limit theorem and a strong mixing condition. Proc. Natl. Acad. Sci. USA 1956, 42, 43–47. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Parzen, E. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
Cencov, N.N. Estimation of an unknown distribution density from observations. Soviet Math. 1962, 3, 1559–1566. [Google Scholar]
Watson, G.S.; Leadbetter, M. On the estimation of the probability density, I. Ann. Math. Stat. 1963, 34, 480–491. [Google Scholar] [CrossRef]
Loftsgaarden, D.O.; Quesenberry, C.P. A Nonparametric Estimate of a Multivariate Density Function. Ann. Math. Stat. 1965, 36, 1049–1051. [Google Scholar] [CrossRef]
Schwartz, S.C. Estimation of Probability Density by an Orthogonal Series. Ann. Math. Stat. 1967, 38, 1261–1265. [Google Scholar] [CrossRef]
Epanechnikov, V.A. Non-Parametric Estimation of a Multivariate Probability Density. Theory Probab. Its Appl. 1969, 14, 153–158. [Google Scholar] [CrossRef]
Tarter, M.; Kronmal, R. On Multivariate Density Estimates Based on Orthogonal Expansions. Ann. Math. Stat. 1970, 41, 718–722. [Google Scholar] [CrossRef]
Kimeldorf, G.; Wahba, G. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 1971, 33, 82–95. [Google Scholar] [CrossRef] [Green Version]
Cacoullos, T.; Sobel, M. An inverse sampling procedure for selecting the most probable event in a multinomial distribution. In Multivariate Analysis; Academic Press: New York, NY, USA, 1966; pp. 423–455. [Google Scholar]
Silverman, B. Choosing the window width when estimating a density. Biometrika 1978, 65, 1–11. [Google Scholar] [CrossRef]
Hwang, J.-N.; Lay, S.-R.; Lippman, A. Nonparametric multivariate density estimation: A comparative study. IEEE Trans. Signal. Process. 1994, 42, 2795–2810. [Google Scholar] [CrossRef] [Green Version]
Scott, D. Multivariate Density Estimation. Ann. Stat. 1992, 20, 1236–1265. [Google Scholar]
Kooperberg, C. Bivariate density estimation with an application to survival analysis. J. Comput. Graph. Stat. 1998, 7, 322–341. [Google Scholar]
Takada, T. Nonparametric density estimation: A comparative study. Econ. Bull. 2001, 3, 1–10. [Google Scholar]
Delgado, M.A.; Robinson, P.M. Nonparametric and semiparametric methods for economic research. J. Econ. Surv. 1992, 6, 201–249. [Google Scholar] [CrossRef] [Green Version]
Gill, R.D.; Wellner, J.A.; Præstgaard, J. Non-and semi-parametric maximum likelihood estimators and the von mises method (part 1) [with discussion and reply]. Scand. J. Stat. 1989, 16, 97–128. [Google Scholar]
Gill, R.D.; Van Der Vaart, A.W. Non-and semi-parametric maximum likelihood estimators and the von Mises method: II. Scand. J. Stat. 1993, 20, 271–288. [Google Scholar]
Hyndman, R.J.; Yao, Q. Nonparametric Estimation and Symmetry Tests for Conditional Density Functions. J. Nonparametr. Stat. 2002, 14, 259–278. [Google Scholar] [CrossRef] [Green Version]
Holmström, L.; Hoti, F. Application of semiparametric density estimation to classification. In Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, 26 August 2004; Volume 3, pp. 371–374. [Google Scholar]
Castellana, J.; Leadbetter, M. On smoothed probability density estimation for stationary processes. Stoch. Process. Their Appl. 1986, 21, 179–193. [Google Scholar] [CrossRef] [Green Version]
Chiu, S.-T. Bandwidth Selection for Kernel Density Estimation. Ann. Stat. 1991, 19, 1883–1905. [Google Scholar] [CrossRef]
Gu, C.; Qiu, C. Smoothing Spline Density Estimation: Theory. Ann. Stat. 1993, 21, 217–234. [Google Scholar] [CrossRef]
Härdle, W.; Müller, M. Multivariate and Semiparametric Kernel Regression; SFB 373 Discussion Paper; Springer: Heidelberg/Berlin, Germany, 1997. [Google Scholar]
Jones, M.C.; Marron, J.S.; Sheather, S.J. A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 1996, 91, 401–407. [Google Scholar] [CrossRef]
Stone, C.J.; Hansen, M.H.; Kooperberg, C.; Truong, Y.K. Polynomial splines and their tensor products in ex-tended linear modeling: 1994 Wald memorial lecture. Ann. Stat. 1997, 25, 1371–1470. [Google Scholar] [CrossRef]
Aladjem, M. Projection pursuit mixture density estimation. IEEE Trans. Signal. Process. 2005, 53, 4376–4383. [Google Scholar] [CrossRef]
Friedman, J.H.; Stuetzle, W.; Schroeder, A. Projection pursuit density estimation. J. Am. Stat. Assoc. 1984, 79, 599–608. [Google Scholar] [CrossRef]
Friedman, J.H. Exploratory projection pursuit. J. Am. Stat. Assoc. 1987, 82, 249–266. [Google Scholar] [CrossRef]
Rudzkis, R.; Radavičius, M. Testing Hypotheses on Discriminant Space in the Mixture Model of Gaussian Distributions. Acta Appl. Math. 2003, 79, 105–114. [Google Scholar] [CrossRef]
Kalantan, Z.I.; Einbeck, J. Quantile-Based Estimation of the Finite Cauchy Mixture Model. Symmetry 2019, 11, 1186. [Google Scholar] [CrossRef] [Green Version]
Azzari, L.; Foi, A. Gaussian-Cauchy mixture modeling for robust signal-dependent noise estimation. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 5357–5361. [Google Scholar]
Teimouri, M. Statistical Inference for Mixture of Cauchy Distributions. arXiv 2018, arXiv:1809.05722. [Google Scholar]
Jones, M.C. Discretized and interpolated kernel density estimates. J. Am. Stat. Assoc. 1989, 84, 733–741. [Google Scholar] [CrossRef]
Lambert, C.G.; Harrington, S.E.; Harvey, C.R.; Glodjo, A. Efficient on-line nonparametric kernel density estimation. Algorithmica 1999, 25, 37–57. [Google Scholar] [CrossRef]
Gasser, T.; Müller, H.-G.; Mammitzsch, V. Kernels for Nonparametric Curve Estimation. J. R. Stat. Soc. Ser. B 1985, 47, 238–252. [Google Scholar] [CrossRef]
Marron, J.; Nolan, D. Canonical kernels for density estimation. Stat. Probab. Lett. 1988, 7, 195–199. [Google Scholar] [CrossRef]
Fukunaga, K.; Hostetler, L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 1975, 21, 32–40. [Google Scholar] [CrossRef] [Green Version]
Silverman, B.W. Density Estimation for Statistics and Data Analysis. In Monographs on Statistics and Applied Probability; Chapman & Hall: London, UK, 1994. [Google Scholar]
Breiman, L.; Meisel, W.; Purcell, E. Variable kernel estimates of multivariate densities. Technometrics 1977, 19, 135–144. [Google Scholar] [CrossRef]
Hall, P.; Sheather, S.J.; Jones, M.; Marron, J.S. On optimal data-based bandwidth selection in kernel density estimation. Biometrika 1991, 78, 263–269. [Google Scholar] [CrossRef]
Hart, J.D.; Vieu, P. Data-Driven Bandwidth Choice for Density Estimation Based on Dependent Data. Ann. Stat. 1990, 18, 873–890. [Google Scholar] [CrossRef]
Marron, J.S. An Asymptotically Efficient Solution to the Bandwidth Problem of Kernel Density Estimation. Ann. Stat. 1985, 13, 1011–1023. [Google Scholar] [CrossRef]
Wand, M.P.; Jones, M.C. Multivariate plug-in bandwidth selection. Comput. Stat. 1994, 9, 97–116. [Google Scholar]
Abramson, I.S. On Bandwidth Variation in Kernel Estimates-A Square Root Law. Ann. Stat. 1982, 10, 1217–1223. [Google Scholar] [CrossRef]
Nadaraya, E.A. On estimating regression. Theory Probab. Its Appl. 1964, 9, 141–142. [Google Scholar] [CrossRef]
Watson, G.S. Smooth regression analysis. Sankhyā Indian J. Stat. Ser. A 1964, 26, 359–372. [Google Scholar]
Smith, P.W.; Schumaker, L. Spline Functions: Basic Theory. Math. Comput. 1982, 38, 652. [Google Scholar] [CrossRef]
De Boor, C.; De Boor, C. A Practical Guide to Splines; Springer: New York, NY, USA, 1978; Volume 27. [Google Scholar]
Koo, J.-Y. Bivariate B-splines for tensor logspline density estimation. Comput. Stat. Data Anal. 1996, 21, 31–42. [Google Scholar] [CrossRef]
Hansen, M.H.; Kooperberg, C. Spline Adaptation in Extended Linear Models (with comments and a rejoinder by the authors). Stat. Sci. 2002, 17, 2–51. [Google Scholar] [CrossRef]
Kooperberg, C.; Stone, C.J. A study of logspline density estimation. Comput. Stat. Data Anal. 1991, 12, 327–347. [Google Scholar] [CrossRef]
Kooperberg, C.; Stone, C.J. Logspline density estimation for censored data. J. Comput. Graph. Stat. 1992, 1, 301–328. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control. 1974, 19, 716–723. [Google Scholar] [CrossRef]
Friedman, J.; Tukey, J. A Projection Pursuit Algorithm for Exploratory Data Analysis. IEEE Trans. Comput. 1974, 100, 881–890. [Google Scholar] [CrossRef]
Huber, P.J. Projection pursuit. Ann. Stat. 1985, 13, 435–475. [Google Scholar] [CrossRef]
Hall, P. On Polynomial-Based Projection Indices for Exploratory Projection Pursuit. Ann. Stat. 1989, 17, 589–605. [Google Scholar] [CrossRef]
Goffe, W.L.; Ferrier, G.; Rogers, J. Global optimization of statistical functions with simulated annealing. J. Econ. 1994, 60, 65–99. [Google Scholar] [CrossRef] [Green Version]
Ruzgas, T. The Nonparametric Estimation of Multivariate Distribution Density Applying Clustering Procedures. Ph.D. Thesis, Matematikos ir Informatikos Institutas, Vilnius, Lithuania, 2007. [Google Scholar]
Kavaliauskas, M.; Rudzkis, R.; Ruzgas, T. The projection-based multi-variate distribution density estimation. Acta Comment. Univ. Tartu. Math. 2004, 8, 135–141. [Google Scholar]
Rudzkis, R.; Radavičius, M. Statistical estimation of a mixture of Gaussian distributions. Acta Appl. Math. 1995, 38, 37–54. [Google Scholar] [CrossRef]
Van deLaan, M. Efficient and Inefficient Estimation in Semiparametric Models; CWI Tracts: Amsterdam, The Netherlands, 1995. [Google Scholar]
Van Der Laan, M.J.; Dudoit, S.; Keles, S. Asymptotic Optimality of Likelihood-Based Cross-Validation. Stat. Appl. Genet. Mol. Biol. 2004, 3, 1–23. [Google Scholar] [CrossRef] [Green Version]
Hall, P. Large Sample Optimality of Least Squares Cross-Validation in Density Estimation. Ann. Stat. 1983, 11, 1156–1174. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

Number of Components	Proportions of Components	Location Parameters	Separation Size of Locations
q = 2	p₁ = (1 − p₂), p₂ = 0.1, 0.3, 0.5	m₁ = (0, 0), m₂ = (0.5i, 0.5i)	i = 1, 2,…, 6
q = 3	p₁ = p₂ = (1 − p₃)/2, p₃ = 0.1, 1/3, 0.8	m₁ = (0, 0), m₂ = (0.5i, 0.5i), m₃ = (0.5i, 0)	i = 1, 2,…, 6
q = 4	p₁ = p₂ = p₃ = (1 − p₄)/3, p₄ = 0.1, 0.25, 0.7	m₁ = (0, 0), m₂ = (0.5i, 0.5i), m₃ = (0.5i, 0), m₄ = (0, 0.5i)	i = 1, 2,…, 6
q = 2	p₁ = (1 − p₂), p₂ = 0.1, 0.2, 0.3, 0.4, 0.5	m₁ = (0, 0, 0, 0, 0), m₂ = (0.5i, 0.5i, 0.5i, 0.5i, 0.5i)	i = 1, 2,…, 6
q = 3	p₁ = p₂ = (1 − p₃)/2, p₃ = 0.1, 0.2, 1/3, 0.4, 0.6, 0.8	m₁ = (0, 0, 0, 0, 0), m₂ = (0.5i, 0.5i, 0.5i, 0.5i, 0.5i), m₃ = (0.5i, 0.5i, 0, 0, 0)	i = 1, 2,…, 6
q = 4	p₁ = p₂ = p₃ = (1 − p₄)/3, p₄ = 0.1, 0.16, 0.25, 0.4, 0.7	m₁ = (0, 0, 0, 0, 0), m₂ = (0.5i, 0.5i, 0.5i, 0.5i, 0.5i), m₃ = (0.5i, 0.5i, 0, 0, 0), m₄ = (0, 0, 0.5i, 0.5i, 0.5i)	i = 1, 2,…, 6

Evaluation Methods		Density
		d = 5; p₁ = p₂ = p₃ = 1/3; n = 100
		i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
AKDE	Mean	0.8268	0.8257	0.8198	0.8128	0.8066	0.8075
AKDE	SD	(0.0760)	(0.0814)	(0.0848)	(0.0827)	(0.0788)	(0.0731)
PPDE	Mean	0.9243	0.9319	0.9303	0.9300	0.9284	0.9250
PPDE	SD	(0.0500)	(0.0364)	(0.0375)	(0.0387)	(0.0410)	(0.0433)
LSDE	Mean	0.8043	0.8162	0.8583	0.8611	0.8613	0.8711
LSDE	SD	(0.0534)	(0.0540)	(0.0490)	(0.0349)	(0.0434)	(0.0577)
SKDE	Mean	0.7158	0.7144	0.7088	0.7071	0.7179	0.7227
SKDE	SD	(0.0260)	(0.0905)	(0.0905)	(0.0830)	(0.0631)	(0.0499)
IFDE	Mean	0.94593	0.8886	0.7857	0.8463	0.8761	0.8312
IFDE	SD	(0.0362)	(0.1318)	(0.0706)	(0.0380)	(0.1110)	(0.0538)
MIDE	Mean	0.7389	0.7332	0.7235	0.7149	0.7121	0.7219
MIDE	SD	(0.0280)	(0.0221)	(0.0338)	(0.0195)	(0.0208)	(0.0203)