Kernel Analysis Based on Dirichlet Processes Mixture Models

Kernels play a crucial role in Gaussian process regression. Analyzing kernels from their spectral domain has attracted extensive attention in recent years. Gaussian mixture models (GMM) are used to model the spectrum of kernels. However, the number of components in a GMM is fixed. Thus, this model suffers from overfitting or underfitting. In this paper, we try to combine the spectral domain of kernels with nonparametric Bayesian models. Dirichlet processes mixture models are used to resolve this problem by changing the number of components according to the data size. Multiple experiments have been conducted on this model and it shows competitive performance.


Introduction
Probabilistic models are essential for machine learning to have a correct understanding of problems. While frequentist models are considered a traditional method for statisticians, Bayesian models are more widely used in machine learning fields. A Bayesian model is a statistical model where probability is used to represent all uncertainties in the model. Uncertainty of both input and output are under consideration in a Bayesian model. A parametric Bayesian model assumes a finite set of parameters which is not flexible because the complexity of the model is bounded while the amount of data is unbounded. A Bayesian nonparametric model is a Bayesian model on infinite-dimensional parameter space. The amount of information captured by the model from the data can grow as the amount of the data grows. These models are always connected with some kinds of stochastic processes to describe an infinite number of parameters. For example, the Gaussian process (GP) and Dirichlet process mixture models (DPMM) are two cornerstones of Bayesian nonparametric models.
The recent development of deep learning has led to the extensive use of the neural network in a variety of domains and a higher performance than conventional methods is achieved. However, this is less than satisfactory for statisticians because of the uninterpretability of black-box function approximation. Neal [16] shows that as the number of hidden unions approaches infinity, Bayesian neural networks converge to Gaussian processes. Thus Gaussian processes are regarded as powerful and interpretable alternatives to neural networks. It provides a principled, practical and probabilistic approach to learning in kernel machines [17].
The property of a Gaussian process is closely related to its kernel (or covariance function). Kernels provide a depiction of smoothness and periodicity for models and parameter selection of kernel is essential for model training. There are many widely used kernels, for example, the squared exponential (SE) kernels, rational quadratic kernels, polynomial kernels and the Matérn kernels family. The expressiveness of a kernel is defined as the ability to discover the inherent patterns of the data.
Recent research [18][19][20] has proposed some expressive kernels to discover patterns without human intervention. Deep architectures are used in designing kernels. Other previous work [21,22] sought to combine the Gaussian process into a Bayesian neural network framework. However, these approaches usually do not have closed forms and are thus less interpretable. Meanwhile, sophisticated approximate inference techniques, for example, variational inference and expectation propagation, are required in the training step. Combining and modifying existing kernels to make new kernels is straightforward because the sum, the product or the convolution of two kernels is also a kernel. There are many challenges [23,24] to designing kernels according to this principle for specialized applications. Specific restrictions are enforced on these kernels to avoid overfitting and complicated inference. However, it is hard to generalize these kernels to other applications for their specialized kernel structures.
We use Gaussian mixture models (GMM) to analyze the spectral density of popular kernels and find that traditional GMM suffer from overfitting and underfitting by a fixed number of components. Therefore, we seek to combine an infinite mixture model in designing kernels. This innovation introduces more flexibility and expressiveness into designed kernels.
Our work is distinct in that we combine Dirichlet process mixture models into analyzing and designing kernels from the spectral domain. Using traditional Gaussian mixture models in kernel analysis [25] imposes restraint on the expressiveness of kernels. Dirichlet process mixture models are flexible models to fit data because the complexity of models grows as more data are observed. By using it to our advantage, we propose the infinite mixture (IM) kernel. The combination of these two traditional Bayesian nonparametric models is one of the most innovative contributions of this work. Theoretically, we prove that with possibly infinite components in mixture models, designed kernels can fit any linear or nonlinear function by arbitrary accuracy. Besides, a Markov Chain Monte Carlo (MCMC) implementation of a hierarchical Bayesian model for spectral analysis is presented and experiments are conducted on real-world temporal data which show the IM kernel's competitiveness in comparison with other kernels and its superiority in extrapolation.
To the best of our knowledge, this is the first trial to combine the Dirichlet process mixture model in designing kernels. The result is promising on account of the computational convenience and expressiveness of Dirichlet process mixture models. Our infinite version has several advantages. (1) We use MCMC methods (collapsed Gibbs sampling specifically) to avoid the local minimum problem which frequently occurs in derivative-based methods like EM used in conventional GMM. (2) The number of Gaussian components K is automatically determined by the provided data, which alleviates the requirement for expertise. On the other hand, it is beneficial for the customization of a kernel for a specific scene. (3) Setting K in advance limits the expressiveness of kernels, while an infinite version makes more details available.
The main contribution of the paper can be summarized in two parts: • We try to combine a hierarchical Bayesian model with the spectral mixture kernel. Our method shows better and more robust performance. • The original spectral mixture kernel initializes the mean of each component uniformly between 0 and the Nyquist frequency. We analyze the robustness of the spectral mixture kernel and find this strategy to have fluctuant performance. One of the strongest advantages of our method is that the infinite version has a better solution for initialization and has more stable performance.
In Section 2, we have a brief review of the Dirichlet process and the Gaussian process. We introduce a hierarchical Dirichlet mixture model in Section 3 and use this model to analyze kernels in the spectral domain and put forward the concept of the infinite mixture kernel in Section 4. In Section 5, we conduct multiple experiments to show the robustness and good performance of the IM kernel.

Dirichlet Distribution
Dirichlet distribution is an extension of Beta distribution from bivariate to multivariate and can be expressed as: where x 1 , . . . , x K > 0 and ∑ K i=1 = 1. The Beta function B(α) is the normalizer of the distribution and can be expressed in the form of Gamma function: Thus we can regard a sample from the Dirichlet distribution (x 1 , . . . , x K ) as a distribution itself. The probability density function (PDF) of the gamma distribution is: There is another version of the PDF of the gamma distribution which is also widely used: The transformation between these two versions is:

Dirichlet Process
Dirichlet processes (DP) are a family of stochastic processes whose observations are probability distributions. A Dirichlet process has two parameters, a base distribution H and a positive real-valued scalar α called scaling or concentration parameter. Given these two parameters and a measurable set S, the Dirichlet process DP(H, α) is a stochastic process whose sample path is a probability distribution over S, such that the following holds: then ((G(B 1 ), . . . , G(B n )) ∼ Dir(αH(B 1 ), . . . , αH(B n )).
The concentration parameter α is considered the pseudo count number and the base distribution H can be considered the prior knowledge of the distribution [26].

Gaussian Process
The Gaussian process is widely used as a statistical model for modeling functions. In the field of machine learning, we are not especially interested in drawing random function from the prior. There always exist raw training data or evaluation results of a model from a series of experiments iterations. These data provide extra knowledge of the model and incorporating this knowledge into the model is of primary interest. In the Gaussian process regression, we seek to predict f (x) based on limited observations. We express Gaussian process as: Suppose f (x) is a real process, we can define the mean function m(x) and the covariance function k(x, x ) as:

A Hierarchical Dirichlet Process Mixture Model
A Gaussian mixture model can be written as: where µ i denotes the mean of the i-th component, Λ i denotes the precision of the i-th component and π i denotes the weight or mixture coefficient of the i-th component which satisfies ∑ K i=1 π i = 1 and π i ≥ 0 for i = 1, . . . , K. A GMM model can be learned using expectation-maximization (EM) algorithm. However, the number of components K is set manually here which is inconvenient in practical applications because K should vary with the complexity of data.
Given hyperparameters for means, precisions and weights respectively, a hierarchical model in high dimension with a potentially infinite number of Gaussian components is implemented. We first consider inference in a finite version with K components and then extend it into infinite hierarchical models as the limiting case of the finite one. We outline our model in a probabilistic graphical model for clarity. Figure 1 illustrates the hierarchical structure embedded in the model. Nine nodes represent different collections of random variables and one node represents observations. We classify these random variables into hyperparameters, global variables and local variables. The first column contains hyperparameters with priors set by sufficient statistics from the data. The second column contains means, precisions and weights of K components which we call global variables. The node tagged by c i represents the local variable that indicates which component generates this data point.
In Gaussian mixture models, the data y j N j=1 is assumed to come from a generative model: To establish a hierarchical model, we put priors on µ i and Λ i : where G denotes the gamma distribution. First we have a deeper insight into µ. We give hyperparameters λ and r vague priors as: where µ y = 1 N ∑ N j=1 y j and Σ y = 1 N ∑ N j=1 (y j − µ y ) 2 denote the mean and variance of all observations. Figure 1. Graphical model representation of an infinite hierarchical mixture model. Nodes denote random variables. Under the condition of high dimensionality, each node can be a collection of multiple random variables. Edges denote dependence. Two blue plates denote replication of global hidden variables and the red plate denotes replication of local hidden variables and observations. Notice that blue plates have potentially infinite number of replications, while the red plate has only N replications corresponding to each data point.
When considering prior in Bayesian inference, the usage of conjugate priors allows the results to be derived in closed form. It is conventional to use Gaussian, gamma and inverse-gamma distribution as conjugate priors for scalar means, precisions and variances, respectively [27]. Considering the Wishart distribution is a generalization of the gamma distribution to multiple dimensions, we use Gaussian, Wishart and inverse-Wishart distribution as conjugate priors for high-dimensional hyperparameters. The posterior distribution of µ given other variables is: where n i = ∑ N j=1 δ(c j , i) denotes the number of data points generated by the i-th component and y i = 1 n i ∑ j:c j =i y j denotes the mean of observations in this component. As to hyperparameters, taking Equation (14) and Equation (15) as priors and Equation (13) as likelihood, we get the posteriors: where W denotes the Wishart distribution and D denotes the number of dimensions. The posterior of precision and its hyperparameters can be derived in the same way. The weight of components π 1 , . . . , π K have a hyperparameter α. We can simply set the prior of α as p α −1 ∼ G( 1 2 , 2), which means p(α) ∝ α −3/2 exp(− 1 2α ). The joint distribution of N category variables c j is: Notice that this distribution is a multinomial distribution with permutation rather than combination and the Dirichlet distribution is the conjugate prior of the multinomial distribution in Bayesian statistics. In order to bring as less prior information as possible, we give a symmetric Dirichlet distribution prior for π i : By combining Equations (18) and (19) we can integrate out the weight: Collapsed Gibbs sampling requires updating one variable each time, so c j for all j = 1, . . . , N should be considered separately. We can derive from Equation (20) directly: where c −j denotes c j for i = 1, . . . , j − 1, j + 1, . . . , N and n −j,i denotes the number of data points generated by the i-th component without the j-th data point. When K rounds towards infinity, Equation (21) , which is not equal to 1. Thus the probability of this data point belonging to a new component is p(c j = c j for all j = j|c −j , α) = α/(n − 1 + α). The mean and precision of the new component are sampled from the prior. In the meantime, components are removed when they get empty. After enough iterations, this model tends to be stabilized. The detailed derivation of all formulas in this section can be found in Appendix A.

Kernel Analysis in the Spectral Domain
In this section, we analyze kernels in their spectral domain and propose a new kernel called the infinite mixture (IM) kernel. The IM kernel is defined as a kernel with infinite Gaussian mixture components in its spectral domain. Considering the symmetric property of spectral density for stationary kernels, the spectral density of the infinite mixture kernel is designed as: where µ i , Λ i and w i are the mean, precision and weight for the i-th component respectively. The design of a kernel is of essential importance in Gaussian process. When considering interpolation, we need the covariance function to be smooth enough, which means if two input points are close to each other, we expect the value of the function at those points to be similar. This requires the covariance matrix to have the attribute that if two points are closer, the corresponding element in the matrix will be larger. The squared exponential (SE) kernel, also called the radial basis function kernel, is the most widely used kernel and has the form: It is infinitely differentiable, which means the Gaussian process fitted with this kernel has mean square derivatives in any order and is thus very smooth. A stationary kernel is a function of τ = x − x .
According to the definition, the SE kernel is a typical stationary kernel and can be reformulated as k(τ) = exp(− τ 2 /2 2 ). The application of the Bochner's theorem (Theorem 1) [28,29] enables us to analyze a stationary kernel from its spectral domain.

Theorem 1.
A complex-valued function k on R D is the function of a weakly stationary mean square continuous complex-valued random process on R D if and only if it can be represented as: where µ is a positive finite measure.
If µ has a density S(s), that is, dµ(s) = S(s)ds, then S is called the spectral density or power spectral of kernel k. Furthermore, when the spectral density exists, the Wiener-Khintchine theorem shows that the covariance function and the spectral density are Fourier duals of each other: By substituting the SE kernel in Equation (23) into (25), we can get the corresponding spectral density of the SE kernel as a Gaussian function S(s) = (2π 2 ) D/2 exp(−2π 2 2 s s), where D denotes the dimensionality of s. Note that the variance of process is k(0) = S(s)ds, which is exactly the integral in spectral domain.
Analyzing kernel from a spectral (or frequency) perspective, the characteristic of zero-centered Gaussian in the spectral density of the SE kernel restricts its expressiveness. Wilson and Adams [25] proposed the spectral mixture (SM) kernel as an extension of traditional kernels by using several Gaussian components instead of one zero-centered Gaussian.
The number of components of the SM kernel is set to a fixed value. However, without additional expertise, setting K in advance is not suitable for practical application which gives rise to underfitting or overfitting. When the inherent pattern of data is simple, setting K with a large value will take noise in frequency domain into consideration and result in overfitting. Meanwhile, when the inherent pattern is sophisticated, setting K with a small value will lead to the ignorance of details and result in underfitting.
In order to address this problem, we combine an infinite mixture model into kernel design and propose the infinite mixture kernel under a Bayesian framework. By combining Equations (25) and (22), we get the analytic expression of the IM kernel: where ∑ ∞ i=1 w i = 1 and w i ≥ 0. The IM kernel can assign K automatically as the distribution of data changes. The Wiener's Tauberian theorem (Theorem 2) gives the theory ground to the assertion that an infinite mixture of Gaussians is dense in the set of all possible distributions. This theorem is stronger because it only uses translation transformation to make the span dense in L 1 (R) function space. Various popular kernels can be reconstructed by the IM kernel via analyzing their spectral density which shows the expressiveness of the IM kernels from another point of view.
Before applying infinite mixture models, we need to get the empirical spectral density of the data first. From Theorem 1, the spectral density is proportional to the discrete Fourier transform (DFT) of the squared data. The time complexity of DFT is O(n 2 ), which can be reduced to O(n log n) using fast Fourier transform (FFT). FFT is used here to calculate the spectral density. Notice that if the observations are not equidistant, data should be preprocessed. A simple way to preprocess the data is using interpolation methods like linear interpolation, polynomial interpolation and spline interpolation. A more precise and complex approach is using a compressed sensing algorithm [30]. Further experiments show that when the input has a grid structure except for a lack of input in a few ranges, it works well without preprocessing. We refer to it as a quasi-grid structure. The empirical spectral density is taken as an unnormalized distribution and we get observations by slice sampling [31].

Experiments
In this section, we conduct a series of experiments to verify the robustness and good performance of the IM kernel on multiple datasets. We first introduce details of the implementation of the hierarchical mixture model and analyze the independence of samples. Then we perform robustness test on the SE kernel and the IM kernel, finding that performance of the SE kernel fluctuates wildly with different random seeds while IM kernel maintains good performance. Finally, we compare IM kernel with other widely used kernels in several datasets. The IM kernel shows smaller negative log-likelihood and mean square error (MSE) than others. Matlab (version 9.1) code to perform all experiments is available on GitHub (https://github.com/deeperKernelInsight/infinitemixturekernel). All experiments are conducted on a PC with a 3.20 GHz Hexa-core CPU and 16 GB RAM.

An MCMC Implementation of the Hierarchical Model
A collapsed Gibbs sampler marginalizes over a part of dependent variables while sampling from other variables. This strategy is widely used in hierarchical Bayesian models like latent Dirichlet allocation (LDA). Collapsing out the Dirichlet distribution in this model helps to accelerate sample steps. We use collapsed Gibbs sampling as the specific MCMC method to sample hyperparameters, local variables and global variables. Meanwhile, the exact order of sampling parameters influences the performance. In the implementation, we sample means and precisions for all components first, with hyperparameters next and indicator variables at last.
We use the airline passenger dataset which records monthly passenger numbers from 1949 to 1961 here to look into this MCMC process. Before the Markov chain becomes stationary, there are a few iterations for "burn-in". In all experiments below, we set 1000 iterations for "burn-in". Since collapsed Gibbs sampling draws only one parameter each time, observations close to each other tend to be correlated. Figure 2 plots the autocorrelation function of hyperparameters taking lag as x-axis. From this figure, the parameters are deemed independent of itself after 200 iterations. After the "burn-in" iterations, we record observations every 100 iterations and choose the most suitable one to refine.   Figure 3a shows the number of components in the process of training with a different number of sampling points. For universality, we set only one component in the beginning. It increases rapidly in the first few iterations and stays stable with slight fluctuations. With the number of sampling points increases, the number of components used to depict the model grows. The primary cause is that models become complex with the growth of sampling points and this is where the flexibility emerges. Figure 3b plots histogram of 100 observations for α in the condition of 10,000 sampling points. The concentrated parameter α takes values around 2.5, which means only α/(n + α) ≈ 0.025% of data points belonging to a new component.

Robustness Test
Since the learning surface for a spectral mixture kernel is often high multimodal, the original model of the SE kernel is vulnerable for its stern demand for initialization. The SE kernel takes a rather simple and greedy strategy by setting initial weights as k(0)/K, initial variances sampled from a truncated Gaussian distribution with a mean equal to the range of the data and initial frequencies sampled from a uniform distribution from 0 to its Nyquist frequency. Figure 4 shows the performance of the SM kernel with different random seeds. It performs well when the random seed takes the value 3,141,879. However, taking 579 and 1457 as random seeds results in bad performance. Notice that there are obvious performance gaps between random seeds which indicates that the simple strategy used by the SM kernel is not stable and is thus not a good option. Performances of the IM kernel with different random seeds are not plotted because results remain stable and overlap in the chart. By switching the number of sampling points, the performances are slightly different. With a small number of sampling points, kernel tends to show secular trends at the expense of details. Meanwhile, with a large number of sampling points, kernel tends to show more details while secular trend flattens out.

Performance on Multiple Datasets
The datasets we employ are publicly available and has been used in previous research [24]: (a) monthly gasoline demand in Ontario; (b) monthly death tolls from bronchitis in the UK; (c) mean monthly air temperature; (d)weekday bus ridership in Iowa city. We divide these datasets each into two parts with 2/3 data as the training set and others as the testing set. We attempt to give fair treatment to these kernels by initializing hyperparameters with high marginal log-likelihoods and fully training kernels to convergence. The marginal log-likelihood with noise observations can be expressed as: We use GPML toolbox (http://www.gaussianprocess.org) to refine parameters and evaluate other popular kernels. Figure 5 shows the predictive distribution of the IM kernel and its spectral density. The top row displays predicting results on the testing set with the training and testing set in blue and black respectively. The mean of the predictive distribution is in red. The IM kernel shows good predictive ability with the testing set located in 95% of the predictive mass shown in shadow. The bottom row shows the log spectral density of the IM kernel in red and the log empirical spectral density in blue. The learned kernel fitted well with the empirical spectral. The log spectral density of kernel in the Temperature dataset shows the ability to filter noise. Tables 1 and 2 show negative log-likelihood and MSE of the SE kernel, the Matérn kernel, the squared exponential (SE) kernel and the IM kernel with different numbers of sampling points. In experiments, we set numbers of sampling points as 5000, 10,000 and 50,000. Only the standard deviations of the IM and SM kernels are recorded in the brackets due to the space limitation. The IM kernel shows better performance than others both in negative log-likelihood and in MSE. Meanwhile, when the number of sampling performance increases, the result turns out better. While a larger number of sampling points needs a longer time to sample and inference, the trade-off between accuracy and time cost is concerned in practical application.

Conclusions
We introduced the infinite mixture kernel by combining a hierarchical Bayesian model into the analyze of the spectral domain. The extensive experiments have shown the robustness and better performance of our method than the finite version. The results are competitive with popularly used kernels and the spectral density of the IM kernel provides an extra interpretation of the inherent pattern of the data.
In future work, one can combine the IM kernel with approximate inference algorithms. What's more, in this paper we only analyze the kernel of a specific stochastic process. Although the Gaussian distribution and the Gaussian processes are widely used in many different fields, there are many other distributions and their corresponding stochastic processes, for example, the q-Gaussian distribution and the distributions which emerge in the anomalous diffusion contexts and their corresponding stochastic processes. Analyzing the performance of these unconventional distributions and their corresponding stochastic processes can result in interesting discoveries. We believe that kernel analysis from a spectral perspective helps to design more powerful kernels while a deeper investigation is still needed for future research.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

GP
Gaussian process GMM Gaussian mixture models DPMM Dirichlet processes mixture model PDF Probability density function CDF Cumulative distribution function SM Spectral mixture kernel SE Squared exponential kernel IM Infinite mixture kernel

Appendix A. Hierarchical Bayesian Models
A traditional Gaussian mixture model is expressed as: p (y|µ 1 , . . . , µ k , s 1 , . . . , s k , π 1 , . . . , where µ j , s j , and π j are the mean, the precision, and the proportional hyperparameters for the j-th components. The mean µ j is given Gaussain prior: where λ and r are hyperparameters. The hyperparameters λ and r are also given Gaussian and Gamma priors: where µ y and σ 2 y are the mean and variance of all observations. Take Equation (A2) as the prior and Equation (A1) as the likelihood function, we can get the posterior distribution for the means µ j : p(µ j |y j , s j , λ, r) ∝ p(µ j |s j , λ, r)p(y j |µ j , s j , λ, r) ∝ p(µ j |λ, r)p(y j |µ j , s j ) where y j is all observations belonging to class j, n j is the number of these observations, and y j = 1 n j ∑ y j,i . Take Equation (A3) as the prior and Equation (A2) as the likelihood function, we get the posterior distribution for the hyperparameter λ: Take Equation (A4) as the prior and Equation (A2) as the likelihood function, we get the posterior distribution for the hyperparameter r: The precision of each component s j is given Gamma prior: where β and w are hyperparameters with inverse Gamma and Gamma priors: Take Equation (A8) as the prior and Equation (A1) as the likelihood function, we can get the posterior distribution for the means s j : p(s j |y j , µ j , β, w) ∝ p(s j |µ j , β, w)p(y j |µ j , s j , β, w) ∝ p(s j |β, w)p(y j |µ j , s j ) Take Equation (A10) as the prior and Equation (A8) as the likelihood function, we get the posterior distribution for the hyperparameter w: p(w|s 1 , . . . , s k , β) ∝ p(w|β)p(s 1 , . . . , s k |w, β) ∝ p(w) k ∏ j=1 p(s j |w, β) Take Equation (A9) as the prior and Equation (A8) as the likelihood function, we get the posterior distribution for the hyperparameter β: p(β|s 1 , . . . , s k , w) ∝ p(β|w)p(s 1 , . . . , s k |β, w) The posterior distribution of β is in analytical form but not a stantard form. It can be proved that p (log(β)|s 1 , . . . , s k , w) is log-concave. Thus we can use adaptive rejection sampling method [32] to sample this distribution.
We can integrate out the proportional hyperparameters and get the distribution of each indicator conditioning on α: p (c 1 , . . . , c n |α) = p (c 1 , . . . , c k |π 1 , . . . , π k ) p (π 1 , . . . , π k ) dπ When using Gibbs sampling methods to sample c, we have to sample each c i conditioning on others c i . So we need to get the conditional distribution of c i : where n −i,j is the number of the j-th components except the i-th point.
α is given an inverse gamma distribution: ).