Nonparametric Estimation of Continuously Parametrized Families of Probability Density Functions—Computational Aspects

: We consider a rather general problem of nonparametric estimation of an uncountable set of probability density functions (p.d.f.’s) of the form: f ( x ; r ) , where r is a non-random real variable and ranges from R 1 to R 2 . We put emphasis on the algorithmic aspects of this problem, since they are crucial for exploratory analysis of big data that are needed for the estimation. A specialized learning algorithm, based on the 2D FFT, is proposed and tested on observations that allow for estimate p.d.f.’s of a jet engine temperatures as a function of its rotation speed. We also derive theoretical results concerning the convergence of the estimation procedure that contains hints on selecting parameters of the estimation algorithm.


Introduction
Consider a family f (x; r) of functions such that for every r ∈ [R 1 , R 2 ] ⊂ R f (.; r) : R → R is a probability density function (p.d.f.) on real line R, while non-random parameter r takes values from a finite interval [R 1 , R 2 ]. Assume that we have observations (κ l , r l ), l = 1, 2, . . . d at our disposal, where for r l ∈ [R 1 , R 2 ] observation κ l is drawn at random according to p.d.f. f (x; r l ), x ∈ R. Our aim is to propose, under mild assumptions, a nonparametric estimator of the whole continuum set of p.d.f.'s F = { f (.; r), r ∈ [R 1 , R 2 ]}, assuming that the number of data d → ∞ and that r l 's cover [R 1 , R 2 ] more and more densely as d → ∞. Later on, we shall refer to the above stated problem as the F -estimation problem.
In this paper, we concentrate mainly on the algorithmic aspects of the F -estimation problem, since it is computationally demanding. However, we also provide an outline of the proof that the proposed learning algorithm is convergent in the integrated mean squared error (IMSE) sense.
Before providing motivating examples, we briefly indicate similarities, differences, and a generality of this problem among other nonparametric estimation tasks that were considered from the 1950s [1-4]: 1.
The F -estimation problem has certain similarities to the nonparametric estimation problem of a bivariate p.d.f. Notice, however, the important difference, namely in our case r is a non-random parameter. In other words, our sub-task is to select r l 's in an appropriate way (or to confine ourselves to such an interval [R 1 , R 2 ] which is covered densely by passive observations of pairs (κ l , r l ), l = 1, 2, . . . d).

2.
One can also notice a formal similarity of our problem and the problem of nonparametric estimation of a non-stationary p.d.f. f (x; t), say that was introduced to the statistical literature by Rutkowski (see [5][6][7] ) and continued in the papers on the concept drift tracking (see [8][9][10]). There is, however, a fundamental difference between the time t and parameter r. Namely, time is not reversible and we usually do not have an opportunity to repeat observations at instants preceding present t. On the contrary, in the F -estimation problem, we allow the next observation to be done at r l+1 < r l . Furthermore, we allow also for repeated observations for the same value of r.

3.
The estimation of several p.d.f.'s was considered in [11], where it was pointed out that it is a computationally demanding problem, because all of these densities should be estimated simultaneously. However, the goal of this paper is quite different than ours. Namely, in [11], the goal was to compare several densities that are not necessarily indexed by the same additional parameter, which does not arise as a parameter of an estimator. 4.
Similarly, calculating the median off (.; r), we obtain an estimator of the median regression on r.
Analogously, estimators of other quantile regression functions can be obtained. 6.
When we allow that x and/or r if f (x; r) are vectors, then the F -estimation problem covers also multivariate density and regression estimation problems. In this paper, we do not follow these generalizations, since even for univariate x and r we obtain computationally and data demanding problem. On the other hand, we propose double orthogonal expansion as the base for solving the F -estimation problem. Replacing orthogonal functions of x and r by their multivariate counterparts, we obtain an estimator that formally covers also multivariate cases, but it is still non-trivial to derive a computationally efficient algorithm and to establish its asymptotic properties.
Below, we outline examples of possible applications of the F -estimation: • The temperature of a jet engine x depends on the rotating speed r of its turbine and on many other non controllable factors. It is relatively easy to collect a large number of observations (κ l , r l ), l = 1, 2, . . . d from proper and improper operating conditions of the engine. For diagnostic purposes, it is desirable to estimate the continuum set of p.d.f.s of the temperatures for rotating speeds r ∈ [R 1 , R 2 ]. This is our main motivating example that is discussed in detail in Section 7.1.

•
Consider the fuel consumption x of a test exemplar of a new car model. The main impact on x comes from the car speed r, but x also depends on the road surface, the type of tyres and many other factors. It would be of interest for users to know the whole family F p.d.f.'s in addition to the mean fuel consumption.
In a similar vain, it would be of interest to estimate F when x is the braking distance of a car running at speed r.

•
Cereal crops x depend on an amount r of a fertilizer applied to a unit area as well as on soil valuation, weather conditions, etc. Estimating F would allow for selecting r which provides a compromise between a high yield and its robustness against other conditions.
Taking into account a rapidly growing amount of available data and increasing computational power, it would be desirable to extend many other examples of nonparametric regression applications to estimating the whole F . Our starting point for constructing an estimator for F is nonparametric estimation of p.d.f.'s by orthogonal series estimators. We refer the reader to the classic results in this field [1,3,[12][13][14][15]. We shall need also some results on the influence of rounding errors on these estimators (see [16,17]).
The paper is organized as follows: the derivation of the algorithm is presented, a fast method of computation is proposed. Subsequently, tests of synthetic data are preformed and the convergence of the method is shown. Finally, a real world problem regarding jet turbine temperature is presented as well as other possible applications. As an appendix, a detailed proof of convergence is given.

Derivation of the Estimation Algorithm
Let us define (X(r), r) as a generic pair, where-for fixed r ∈ [R 1 , R 2 ]-random variable (r.v.) X(r) has p.d.f f (x; r). We use a semicolon ; in order to indicate that r is a non-random variable. For simplicity of the exposition, we assume that X(r) have bounded supports, [S 1 , S 2 ] ⊂ R say, which is the same for every r ∈ [R 1 , We additionally assume that f is squared integrable, i.e., f (.

Preparations-Orthogonal Expansion
Consider two orthogonal and normalized (orthonormal) sequences v k (x), x ∈ [S 1 , S 2 ] k = 1, 2, . . . and T j (r), r ∈ [R 1 , R 2 ], j = 1, 2, . . . that are complete in L 2 (S 1 , S 2 ]) and L 2 ([R 1 , R 2 ]), respectively. Then, f (x; r) can be represented by the series (convergent in L 2 [S 1 , S 2 ]) with the following coefficients: Notice that a k (r)'s can be interpreted as follows: where E r stands for the expectations with respect to random variable X(r), having p.d.f f (x; r). Furthermore, each a k (r) can be represented by the series: with constant coefficients α kj , defined as follows: Series (3) is convergent in L 2 ([R 1 , R 2 ]). By back substitution, we obtain the following series representation of f : The series in (5), convergent in in L 2 ([S 1 , S 2 ] × [R 1 , R 2 ]), forms a base for constructing estimators for F . They differ in the way of estimating α kj 's and in the way of resolving the so called bias-variance dilemma. The latter can be resolved by appropriate down-weighting α kj 's estimators. In this paper, we confine ourselves to the simplest way of down-weighting, namely, to the truncation of the both sums in a way described later on.

Intermediate Estimator
The simplest, from the computational point of view, estimatorf (x; r) for the family f (x; r), r ∈ [R 1 , R 2 ], we obtain when we additionally assume that the observations of X(r)'s are made on an equidistant grid ρ 1 < ρ 2 < . . . < ρ M that splits [R 1 , R 2 ] into non-overlapping intervals of the length ∆ r > 0, which cover all [R 1 , R 2 ] in such a way that R 1 = ρ 1 − ∆ r /2 and R 2 = ρ M + ∆ r /2. In this section, we tentatively impose the restriction that only repeated, but independent and identically distributed (i.i.d.) observations of X(ρ m ), m = 1, 2, . . . , M are available. In the asymptotic analysis at the end of this paper, we shall assume that M grows to infinity with the number of observations d. Then, also positions of ρ m 's and ∆ r will be changing, but we do not display this fact in the notations, unless necessary.
At each of ρ m , m = 1, 2, . . . , M, n m > 0 observations (X l (ρ m ), ρ m ), l = 1, 2, . . . , n m are made, keeping the above-mentioned assumptions on mutual independence in force. Additionally, the mutual independence of the following lists of r.v.'s is postulated: Then, one can estimate α kj 's byα whereâ Estimators (7) are justified by (4). Notice that in (7) the simplest quadrature formula is used for approximating the integral When an active experiment is applicable, then one can select ρ m 's at nodes of more advanced quadratures with weights, in the same spirit as in [17,18] for nonparametric regression estimators, where the bias reduction was proved. In turn, estimators (8) are motivated by replacing the expectations in (2) by the corresponding empirical means.
Truncating series (5) at K-th and J-th terms and substitutingα kj instead of α kj , we obtain the following estimator if the family f (x; r), r ∈ [R 1 , Later on, K and J will depend on the number of observations, but it is not displayed, unless necessary.
Estimator (9) is quite general in the sense that one can select (and mix) different orthonormal bases v k 's and T j 's. In particular, the trigonometric system, Legendre polynomials, Haar system, and other orthogonal vawelets can be applied. The reduction of computational complexity of (9) is possible when, for a given orthogonal system, its discrete and fast counterpart exists. We illustrate this idea in the next section, by selecting v k 's and T j 's as the trigonometric bases, applying discrete Fourier transform (DFT) and the fast Fourier transform (FFT) as its fast implementation.

Efficient Learning Algorithm
Our aim in this section is to propose an algorithm for fast calculations ofα kj 's in (9), using the FFT method, which is necessary to learn a proper selection of K and J or a proper selection those ofα kj 's that are essentially different from zero.

Data Preprocessing
The FFT algorithm operates on data on a grid. Thus, our first step is to attach the set of raw We already have one axis of the grid, namely, points ρ m , m = 1, 2, . . . , M. Define B m , m = 1, 2, . . . , M in the following way. For j = 1, 2, . . . , d check: Notice that the contents of each B m 's depends on ∆ r , but it is not displayed in the notation. Denote byn m the cardinality of B m . Clearly, we must have ∑ M m=1n m = d. Informally, one can consider x mj , j = 1, 2, . . . ,n m in bin B m as slightly distorted realizations of r.v.'s in (6).
The second split of the grid goes along the x-axis. Denote by Now, we are ready to define the number of observations q mn that are attached to each grid point (χ n , ρ m ). Namely, q mn is the number of observations x mn that are contained in bin B m and simultaneously take values in [χ n − ∆ x , χ n + ∆ x ), n = 1, 2, . . . , N, m = 1, 2, . . . , M. Let us define M × N matrix P with elements: Clearly, p m. sum up to 1. Notice, however, that-strictly speaking-p m. 's are not histograms of r.v.'s X(ρ m )'s, since they are based on the observations contained in bins B m . Nevertheless, we shall later interpret them as such because -as ∆ r → 0 -p m. 's converge to f (x; ρ m ), assuming that also ∆ x → 0 in an appropriate way (see [13] for precise statements).

Fast Calculations and Smoothing
The crucial, mostly time-consuming step is smoothing preprocessed data contained in matrix P. Therefore, it is expedient to apply 2D FFT in order to calculate the DFT of P: where the resulting M × N matrix G has complex elements g mn , n = 1, 2, . . . , N, m = 1, 2, . . . , M (see, e.g., [19] for the definition of 2D DFT and its implementation as 2D FFT). Obviously, the inverse transform provides P = FFT −1 2D (G; M, N). Thus, in order to smooth P, we have to remove high frequency components from matrix G, retaining only its, appropriately chosen, K × J sub-matrixĜ, 1 < K < M, 1 < J < N and setting to zero other elements of G Instead of setting zeros, one can apply at this stage so-called windowing, e.g., using the Hamming window that provides a mild way of approaching zero. Remark 1. The appropriate choice of K × J sub-matrixĜ, with elementsĝ kj , means that we have to take into account that the complex valued matrix G has the component corresponding to (0, 0) frequency, which is placed as g 11 (or g MN ). Analogously, other components of G, corresponding to low frequencies, are placed at the "corners" of this matrix. Hence, in order to cancel high frequencies, we have to reshape matrix G in order to have (0, 0) frequency in its middle and other low frequencies nearby. Then, to put zeros at the positions corresponding to high frequencies, select sub-matrixĜ and reshape it in the order reverse to that of the reshaping G. This last step is necessary so as to obtain a smoothed version of the P matrix, which would be a K × J matrix.

Remark 2.
It is crucial for further considerations to observe thatα kj 's are directly calculable fromĝ jk 's and their conjugates by adding or subtracting their real and imaginary parts. Remark 3. The choice of K, J or the choice of indexes k, j for whichâl pha kj is left as nonzero, is crucial for proper functioning of the estimation algorithm, since their choice dictates the amount of smoothness of the resulting surface. Below, we discuss possible ways of learning the algorithm to select them properly.
Although the F estimation problem differs from the one of estimating bi-variate p.d.f.'s, we may consider the methods elaborated in this field as candidates that might be useful also here.

1.
The cross-validation approach-see [20], where the asymptotic optimality of this method is proved for the trigonometric and the Hermite series density estimators, 2. The Kronmal and Tarter rule [21], which-in our case-reads as follows. For m = 1, 2, . . . M and l = 1, 2, . . . N, check the following condition: where c > 0 is a preselected constant. According to derivations in [21], c = 1 , but in the same paper it is advocated to use c = 0.5. From our simulation experiments, it follows that for moderate M and N c = 1.5 is appropriate, while, for larger M, N, even larger constants are better. If for g kl condition (13) holds, then leave it in matrix G as a nonzero element. Otherwise, set g kl = 0 in this matrix. SetĜ = G. Notice that in this case matrixĜ is of the same dimensions as G, but it has many zeros as its elements.
Selection of ∆ r and ∆ x (or equivalently M and N) as functions of the number of observations is also very important for the proper estimation. We give some hints on their choice in the section before last, where the asymptotic analysis is discussed.
The performance of Algorithm 1 is illustrated in the next section.
Step 2: Perform data preprocessing, as described in Section 3.1, in order to obtain matrix P.
Step 4: Select elements of matrixĜ either by selecting K and J, using cross-validation, or by the Kronmal-Tarter rule.
Step 5: Calculate matrixP = FFT −1 2D (P; K, J) when in Step 4 the cross-validation is used or aŝ P = FFT −1 2D (P; M, N) when the Kronmal-Tarter rule is applied. Output:P is the output of the algorithm, if it suffices to have the estimates of f (x; r) on the grid. If one needs the estimates of f (x; r) at intermediate points, then calculateα kl 's from the corresponding elements ofĜ matrix (see Remark 2) and used them in (9).

Test on Synthetic Data
The first test can be made using synthetic data. These data are obtained from the family of probability density functions: where N (µ,σ 2 ) is normal distribution with mean µ = r i and variation σ 2 = 1. The data are generated with 200 points in r i = 0., 0.25, ..., 50. and 300 random points for each r i . Those data are binned in order to obtain the matrix of probabilities (11) with size 200 × 56. The similarities between two p.d.f.'s can be calculated using distances which are defined in many ways. Here, we would use Hellinger distance and Kullback-Leibler divergence.
For a specific r, the Hellinger distance is defined as follows: Another integration along r is required in order to obtain distance for all r's: The Kullback-Leibler divergence for a specific r can be defined as follows: Again, additional integration along r is required Due to inherent randomness, the calculations were carried out 100 times. The results are presented in Table 1. Observe that Kullback-Leibler divergence is not symmetric. Its symmetrized version provides almost the same results as in Table 1. The reconstruction using the presented algorithm can be seen in Figure 1.

Convergence of the Learning Procedure
In this section, we prove the convergence of the learning procedure in a simplified version similar to that described in Section 2.2, but without the discretization with respect to r. Then, we shall discuss additional conditions for the convergence of its discretized version.
Let X(r), X 1 (r), . . . , X n be independent, identically distributed random variables with parameter r, with p.d.f. f (., r) ∈ L 2 ([R 1 , R 2 ]), where-for simplicity-we assume that n is the same for each r. Then, f has the representation which is convergent in the L 2 ([S 1 , Then, we estimate a k (r)'s as follows: Lemma 1. For every r ∈ [R 1 , R 2 ]â k (r) is the unbiased estimator of a k (r).
Proof. Indeed, As an estimator of f (., r), we take the truncated version of (19): where the truncation point K depends on n. It may also depend on r, but, for the asymptotic analysis, we take this simpler version. The standard way of assessing the distance between f and its estimatorf n is the mean integrated squared error (MISE). Notice, however, that, in our case, the MISE additionally depends on r, which is further denoted as MISE(r). Thus, in order to have a global error, we consider the integrated MISE(r) that is defined as follows: The MISE(r) is defined as follows: where the expectation w.r.t. f (., r) concerns all X i (r)'s that are present in (23). Using the orthonormality of v k 's, we obtain: Continuing (26), we obtain for each r ∈ [R 1 , R 2 ]: It is known that that for squared integrable f we have: ∑ ∞ k=1 a 2 k (r) < ∞. Thus, if K(n) → ∞ when n → ∞, then for every r ∈ [R 1 , R 2 ] we obtain: This result is not sufficient to prove the convergence of I MISE n to zero as n → ∞. To this end, we need a stronger result, namely an upper bound on Bias 2 (r K(n)), which is independent of r.

Lemma 2.
Let us assume that ∂ ∂ x f (x, r) exists and it is a continuous function of x in [S 1 , S 2 ] for each r ∈ [R 1 , R 2 ]. Furthermore, there exists constant 0 < U < ∞, which does not depend on r, and such that If K(n) → ∞ when n → ∞, then -for n sufficiently large we have: Proof. It is known that Thus, Bias 2 (r K(n)) ≤ U 2 (S 2 − S 1 ) 2 ∑ ∞ k=K(n) k −2 , which finishes the proof, since it is known that for sufficiently large K(n) we have ∑ ∞ k=K(n) k −2 = K −1 (n).
For evaluating the variance component, we use Lemma 1: where Var r (.) is the variance of an r.v. having the p.d.f. F(x, r). Let us assume that the orthonormal and complete system v k 's is uniformly bounded, i.e., there exists p being a non-negative integer and C p such that sup x∈[S 1 , S 2 ] |v k (x)| ≤ C p k p , k = 1, 2, . . . .
Notice that (33) holds for the trigonometric system with p = 0 Lemma 3. If (33) holds, then Proof. X i (r)'s are i.i.d. and (33) holds. Thus, Using this fact in (32) finishes the proof.
Notice that the bound in (34) does not depend on r.
Theorem 1. Let the assumptions of Lemmas 2 and 3 hold. If the sequence K(n) is selected in such a way that the following conditions hold: then the estimation error I MISE n → 0 as n → ∞.
Proof. By Lemmas 2 and 3, we have uniform (in r) bounds for the variance and for the squared bias, respectively. Thus, Hence, which finishes the proof by applying (36).
Observe that for v k 's being the trigonometric system we have p = 0 and the r.h.s. of (38) is, roughly, of the form: c 1 K(n)/n + c 2 /K(n), c1 > 0, c 2 > 0, which is minimized by K(n) = c 3 √ n for a certain constant c 3 > 0. This implies that I MISE n = O(n −1/2 ).

Proposed Algorithm
Let us define (x(r), r) as a generic par that for fixed r has p.d.f f (x; r). We use semicolon ; in order to indicate that r is a non-random variable. Observations: we admit that X i (r i ) and X j (r j ) are independent random variables even when r i = r j for i = j.
In general, r.v.'s in (39) are assumed to be mutually independent.
Consider two orthinormal sequences v k (x), x ∈ [κ 1 , κ 2 ] k = 1, 2, . . . and T j (r), r ∈ [R 1 , R 2 ], j = 1, 2, . . . that are complete in L 2 ([κ 1 , κ 2 ]) and L 2 ([R 1 , R 2 ]), respectively. Then, f (x; r) can be represented by the series (convergent in Notice that a k (r)'s can be interpreted as where E r stands for the expectations with respect to random variable X(r) with p.d.f f (x; r). Furthermore, each a k (r) can be represented by the series: By the back substitution, we obtain the following series representation (in L 2 ([κ 1 , The simplest from the computational point of view, estimatorf (x; r)or the family f (x; r), r ∈ [R 1 , R 2 ] we obtain, when we additionally assume that the observations of X(r)'s are made on an equidistant grid ρ 1 < ρ 2 < . . . < ρ M that splits [R 1 , R 2 ] into nonoverlapping intervals of the length ∆ > 0.
Then, one can estimate α kj 's byα Estimators (46) are justified by (42). Notice that in (46) the simplest quadrature formula is used for approximating the integral When an active experiment is applicable, then one can select ρ m 's at nodes of more advanced quadratures with weights, in the same spirit as in [17] for nonparameteric regression estimators.
Truncating series (44) at K-th and J-t terms and substitutingα kj instead of α kj , we obtain the following estimator if the family f (x; r), r ∈ [R 1 , If as T j (r) a trigonometric series is used on [R 1 , R 2 ], thenα kj can be calculated using FFT. In addition, v k can be trigonometric too.

Subject Overview
A jet turbine is a well-known engine known for at least 90 years The typical application is that of an aircraft power-plant, but also in some ground applications. The main examples are firefighting and snow removal. In recent years, some companies have started to develop engines optimized for thermal power rather than thrust.
In a very simplistic view (see Figure 2), an already running turbine has one input parameter that is fuel flow. As outputs, we have its rotational speed (given in rpm-r) and temperature (in • C-T).
In ideal conditions, there should exist a simple relationship between T and r. Since many factors vary and not all of them can be directly measured then we can think about this relationship as probability, which puts us in the framework of the problem stated in the introduction.

Data Preparation
The orginal process data from the server have a form of JSON file of a database table. In this table, we are interested only in two columns, namely the turbine rotation speed (in rpm) and the turbine temperatue. Since the resulting temperature is dependent on many other factors, not all of them measured or even known, we treat the value as a random variable X and the rotation speed as known (so not random) r. The amount of relevant data are 71,268.
These two columns have to be changed into frequencies for each r i . Groups are formed on a rectangular grid. An example of such a grid can be seen in Figure 3. The resulting number of samples in each box of the grid is shown in Figure 4. They should be converted into frequencies by simple divison. In order not to obscure the image, a 32 × 32 grid was used. As a result, we obtain a matrix containing the observations near points on the grid ρ 1 , . . . , ρ m

The Estimation Process
From the matrix obtained in the previous section, a two-dimensional Discrete Fourier Transform is calculated using the FFT algorithm. The result is a matrix F m n of equal size but containing complex values. In literature regarding the Fourier transform, many properties can be found. A good example is book [22]. We use symmetry and antisymmetry of the resulting matrix. General inversion of the Fourier transform is defined in the following way: where j 2 = −1. Equation (49) is defined only for the original points of the matrix (48). The continuous surface can be obtained by changing the therms i m ∈ [0, 1], j n ∈ [0, 1] into r max r i ∈ [0, 1], T max T i ∈ [0, 1]. We cannot guarantee that between grid points the result would be real. The sensible solution is to use the absolute value of a possibly complex number.
We can ask the question of why use the FFT, if we reconstruct the same data again. The reason is smoothing the result. The reconstruction can use only a selected part of the spectrum-obviously lower frequencies. As mentioned before, they reside in the center of the matrix: where γ k is the correction factor necessary for compensating for the omitted data. Its values should be selected so the reconstruction result is still p.d.f. and integrates to 1 . Obviously, the size of part of the FFT matrix taken is 4 · K · L, the careful selection of K, L is crucial. When those numbers are too small (see Figure 5), the result loses any resemblance to the original data ( Figure 4). On the other hand, when the numbers are too large, this gives a detailed reconstruction ( Figure 6) with all unnecessary details. The acceptable reconstruction in Figure 7 is made with K = 16 and L = 4. It is obvious that smaller size means faster calculations.
The exact values can be obtained experimentally (as in this example) or by using some method like the Kronmal-Tarter rule.
Another heuristic approach is to reduce the abount of the entry points (probabilities) f (X, ρ) by removing perimeter data. The result of such an approach can be seen in Figure 8. The removal of peripherial data results in reduction of over-fitting.

Process Simulation
The simulation is an important tool in both design and then subsequent maintenance of the system, especially if physical device is cumbersome, costly, or dangerous to use. The obvious method of simulating the engine temperature at specific rotational speed is to generate random numbers from a distribution specific for that temperature. The simplest method is the so-called acceptance-rejection method. In this method, a random number from distribution f (x) ∈ [a, b] is generated as follows (in general using any dense random number generator, but typically uniform is used): if y > f (x) then go to 1, otherwise the result is x.
We can easily obtain distribution for a specific parameter r. Using the example from Section 7.1, with r = 50,000 we obtain a distribution as in Figure 9.
If a random generation algorithm is used directly, the resulting process path would be highly irregular and perceived as not real. To avoid this, a simple filter can be used whereT is the resulting new random number,T i+1 is the new process value, andT i is the old process value. The result with parameter τ = 0.05 can be seen in Figure 10.

Process Diagnostics
From the resulting function f (r, T), for specific r, we can calculate expected the value ofT and also some specified interval where α ∈ [0, 1] of realizations can be found-the equivalent of a confidence interval. If this parameter α is selected carefully, we can discern whether the actual pair r, T form the process in or outside it. Other similar diagnostic methods can be seen in [23,24].

Convergence of the Learning Procedure
In this section, we prove the convergence of the learning procedure in a simplified version similar to that described in Section 2.2, but without the discretization with respect to r. Then, we shall discuss additional conditions for the convergence of its discretized version.
Let X(r), X 1 (r), . . . , X n be independent, identically distributed random variables with parameter r, with p.d.f. f (., r) ∈ L 2 ([R 1 , R 2 ]), where-for simplicity-we assume that n is the same for each r. Then, f has the representation which is convergent in the L 2 ([S 1 , Then, we estimate a k (r)'s as follows: Lemma 4. For every r ∈ [R 1 , R 2 ]â k (r) is the unbiased estimator of a k (r).
Proof. Indeed, As an estimator of f (., r), we take the truncated version of (52): where the truncation point K depends on n. It may also depend on r, but, for the asymptotic analysis, we take this simpler version. The standard way of assessing the distance between f and its estimatorf n is the mean integrated squared error (MISE). Notice, however, that, in our case, the MISE additionally depends on r, which is further denoted as MISE(r). Thus, in order to have a global error, we consider the integrated MISE(r), which is defined as follows: The MISE(r) is defined as follows: where the expectation w.r.t. f (., r) concerns all X i (r)'s that are present in (56).
Using the orthonormality of v k 's, we obtain: Continuing (59), we obtain for each r ∈ [R 1 , R 2 ]: Var(r, K(n)) It is known that that for squared integrable f we have: ∑ ∞ k=1 a 2 k (r) < ∞. Thus, if K(n) → ∞ when n → ∞, then for every r ∈ [R 1 , R 2 ], we obtain: This result is not sufficient to prove the convergence of I MISE n to zero as n → ∞. To this end, we need a stronger result, namely an upper bound on Bias 2 (r K(n)), which is independent of r.

Lemma 5.
Let us assume that ∂ ∂ x f (x, r) exists and it is a continuous function of x in [S 1 , S 2 ] for each r ∈ [R 1 , R 2 ]. Furthermore, there exists constant 0 < U < ∞, which does not depend on r, and such that If K(n) → ∞ when n → ∞, then-for n sufficiently large, we have: Proof. It is known that Thus, Bias 2 (r K(n)) ≤ U 2 (S 2 − S 1 ) 2 ∑ ∞ k=K(n) k −2 , which finishes the proof, since it is known that for sufficiently large K(n) we have ∑ ∞ k=K(n) k −2 = K −1 (n).
For evaluating the variance component, we use Lemma 1: where Var r (.) is the variance of a r.v. having the p.d.f. F(x, r). Let us assume that the orthonormal and complete system v k 's is uniformly bounded, i.e., there exists p being a nonnegative integer and C p such that Notice that (66) holds for the trigonometric system with p = 0 Lemma 6. If (66) holds, then Proof. X i (r)'s are i.i.d. and (66) holds. Thus, Var r (v k (X i (r))) = 1 n Var r (v k (X 1 (r))) ≤ C 2 p K 2 p (n) n .
Using this fact in (65) finishes the proof.
Notice that the bound in (67) does not depend on r.
Theorem 2. Let the assumptions of Lemmas 5 and 6 hold. If the sequence K(n) is selected in such a way that the following conditions hold: K(n) → ∞ and K 2 p+1 (n) n → 0 as an → ∞, then the estimation error I MISE n → 0 as n → ∞.
Proof. By Lemmas 2 and 6, we have uniform (in r) bounds for the variance and for the squared bias, respectively. Thus, Hence, which finishes the proof by applying (69).
Observe that for v k 's being the trigonometric system we have p = 0 and the r.h.s. of (71) is, roughly, of the form: c 1 K(n)/n + c 2 /K(n), c1 > 0, c 2 > 0, which is minimized by K(n) = c 3 √ n for a certain constant c 3 > 0. This implies that I MISE n = O(n −1/2 ). Notice that this rate is known (see [25]) to be the best possible when a bivariate, continuously differentiable regression function is estimated by any nonparametric method.
However, this convergence rate was obtained under an idealized assumption that we have observations of X j (r)'s for each r. Further discussion of this topic is outside the scope of this paper. We only mention that in [17] it was proved that orthogonal series estimators of p.d.f.'s retain consistency under mild assumptions concerning grouped observations. In particular, in the notation of Section 3, bin width ∆ x should depend on the number of observations d as follows: ∆ x (d) → 0 as d → ∞, but in such a way that ∆ 2 x (d) multiplied by the qube power of the number of spanning orthogonal terms should also approach zero. One can expect that, for the trigonometric series, ∆ r should also fulfill similar conditions.

Conclusions
In this paper, a method for the estimation of families of density functions is presented along with an efficient learning algorithm. It gives insight into the relation between probability density function and external factors that influence it. The method was used in a real-world problem to simulate and diagnose a jet turbine.
Additional, significant possible applications include the estimation of quality indexes for decision-making and optimal control, especially for repetitive and/or spatially distributed processes (see [26,27] for most recent contributions).
From the formal point of view, the method presented can easily be extended into a multidimensional case. Namely, if r is multivariater = [r (1) , r (2) , . . . , r (d) ], say, then it suffices to replace orthogonal system T j (r), j = 1, 2, . . . by the tensor product of T j 's, i.e., by all possible products of the following form: T j 1 (r (1) ) T j 2 (r (2) ) · . . . · T j d (r (d) ), where j i takes all the values in {1, 2, . . .}, i = 1, 2, . . . , d. As a consequence, one has to replace all the sums over j by the multiple sums over j i 's. When T j i 's form the trigonometric system, then one can apply the multidimensional FFT algorithm. Clearly, a much larger number of observations is also necessary for a reliable estimation, but a family of estimated p.d.f.'sf (x,r) provides much more information than a regression function ofr.
Funding: This research received no external funding.