Conditional Deep Gaussian Processes: Empirical Bayes Hyperdata Learning

It is desirable to combine the expressive power of deep learning with Gaussian Process (GP) in one expressive Bayesian learning model. Deep kernel learning showed success as a deep network used for feature extraction. Then, a GP was used as the function model. Recently, it was suggested that, albeit training with marginal likelihood, the deterministic nature of a feature extractor might lead to overfitting, and replacement with a Bayesian network seemed to cure it. Here, we propose the conditional deep Gaussian process (DGP) in which the intermediate GPs in hierarchical composition are supported by the hyperdata and the exposed GP remains zero mean. Motivated by the inducing points in sparse GP, the hyperdata also play the role of function supports, but are hyperparameters rather than random variables. It follows our previous moment matching approach to approximate the marginal prior for conditional DGP with a GP carrying an effective kernel. Thus, as in empirical Bayes, the hyperdata are learned by optimizing the approximate marginal likelihood which implicitly depends on the hyperdata via the kernel. We show the equivalence with the deep kernel learning in the limit of dense hyperdata in latent space. However, the conditional DGP and the corresponding approximate inference enjoy the benefit of being more Bayesian than deep kernel learning. Preliminary extrapolation results demonstrate expressive power from the depth of hierarchy by exploiting the exact covariance and hyperdata learning, in comparison with GP kernel composition, DGP variational inference and deep kernel learning. We also address the non-Gaussian aspect of our model as well as way of upgrading to a full Bayes inference.


Introduction
The deep Gaussian process [1] is a Bayesian learning model which combines both the expressive power of deep neural networks [2] and calibrated uncertainty estimation. The hierarchical composition of Gaussian Processes (GPs) [3] is the origin of expressiveness, but also renders inference intractable, as the marginalization of GPs in the stage of computing evidence is not analytically possible. Expectation propagation [4,5] and variational inference [6][7][8][9] are approximate inference schemes for DGP. The latter has issues of posterior collapse, which turns DGP into a GP with transformed input. References [8,9] address this issue and compositional freedom [10] in such hierarchical learning. Nevertheless, inferential challenges continue to slow the adoption of DGP.
Despite challenges, there has been progresses in understanding this seemingly simple yet profound model. In the case where the GPs in the hierarchy are zero-mean, DGP exhibits pathology, becoming a constant function as the depth increases [11]. Using the fact that the exponential covariance function is strictly convex, references [12,13] studied the conditional statistics for squared distance in function space, suggesting region in hyperparameter space to avoid the pathology. Recently, reference [14] showed the connection between DGP and a deep neural network with bottlenecked layers, and reference [15] suggested that a DGP with a large width may collapse back to a GP.

Related Work
In the literature on deep probabilistic models, [26] proposed the conditional neural process in which the mean and variance functions are learned from the encoded representation of context data in a regression setup for target data. Deep Gaussian processes (DGPs) constitute one family of models for composition functions by conditioning input to a GP on the output of another GP [1]. A similar idea appeared in the works of warped GP [27,28]. The implicit process in [29] is a stochastic process embedding the Gaussian distribution into a neural network. Solutions of stochastic differential equation driven by GP are also examples of composite processes [30]. Variational DGP casts the inference problem in terms of optimizing ELBO [6] or EP [5]. However, the multi-modalness of DGP posterior [10,23] may arise from the fact that the hidden mappings in intermediate layers are dependent [9]. Inference schemes capable of capturing the multi-modal nature of DGP posterior were recently proposed by [8,9]. Depth of neural network models and the function expressivity were studied in [31,32], and uncertainty estimates were investigated in [33]. DGP in weight space representation and its variational Bayesian approach to DGP inference were introduced in [34], which were based on the notion of random feature expansion of Gaussian [35] and arcsine [36] kernels. Deep hierarchical SVMs and PCAs were introduced in [37].
Moment matching is a way to approximate a complex distribution with, for instance, a Gaussian by capturing the mean and the second moment. Reference [38] considered a GP regression with uncertain input, and replaced the non-Gaussian predictive distribution with a Gaussian carrying the matched mean and variance. Expectation propagation, in [4], computed the vector of mean and variance parameters of non-Gaussian posterior distributions. Reference [21] approximated the distribution over unseen pixels as a multivariate Gaussian with matched mean and covariance. Moment matching is also extensively applicable to comparing two distributions [39] where the embedded means in RKHS are computed. In generative models, the model parameters are learned from comparing the model and data distributions [40].
Inducing points are an important technique in sparse GP [20,22,41,42] and DGP. In addition to being locally defined as a function's input and output, [43] introduced a transformation to form a global set of inducing features. One popular transformation uses the basis of Gaussian so that one can recover the local inducing points easily [43]. Transformation using the basis of spherical harmonic functions in [44] allows orthogonal inducing features and connects with the arcsine kernels of Bayesian deep neural network [45]. Reference [46] employed the inter-domain features in DGP inference. Recently, [47] proposed a method to express the local inducing points in the weight space representation. All the methods cited here treated the inducing points or features in a full Bayes approach, as they are random variables associated with an approximate distribution [24].

Background
Here, we briefly introduce the notions of the Gaussian process as a model for random continuous function f (x) : R d → R. A deep Gaussian process [1] is a hierarchical composition of Gaussian processes for modeling general composite function where the bold faced function f 1 : R d → R H 1 has an output consisting of H 1 independent GPs, and similarly for f 2 : R H 1 → R H 2 and so on. The depth and width of DGP are thus denoted by L and H 1:L , respectively.

Gaussian Process
In machine learning, the attention is often restricted to the finite set of correlated random variables f := { f (x 1 ), · · · , f (x N )} corresponding to the design location X = (x 1 , · · · , x N ) T . Denoting f i := f (x i ), the above set of random variables is a GP if and only if the following relations, are satisfied for all indices i, j. For convenience, we can use f ∼ GP (µ, k) to denote the above. The mean function µ(·) : R d → R and the covariance function k(·, ·) : R d × R d → R then fully specify the GP. One can proceed to write down the multivariate normal distribution as the pdf The covariance matrix K has matrix element K ij = k(x i , x j ), characterizing the correlation between the function values. The covariance function k encodes function properties such as smoothness. The vector m := µ(X) represents the mean values at corresponding inputs. Popular covariance functions include the squared exponential (SE) k(x i , x j ) = σ 2 exp[−||x i − x j || 2 /(2 2 )] and the family of Matern functions. The signal magnitude σ and length scale are hyper-parameters. The conditional property of Gaussians allows one to place constraint on the model p(f). Given a set of function values u = f (Z), the space of random function f now only includes those passing through these fixed points. Then the conditional pdf p(f|u) has the conditional mean and covariance: where the matrix K XZ represents the covariance matrix evaluated at X against Z.

Deep Gaussian Process
We follow the seminal work in [1] to generalize the notion of GP to the composite functions f L • f L−1 • · · · f 2 • f 1 (x). In most literature, DGP is defined from a generative point of view. Namely, the joint distribution for the simplest zero-mean DGP with L = 2 and H 2 = H 1 = 1 can be expressed as with the conditional defined as f 2 | f 1 ∼ GP (0, k( f 1 , f 1 )) and f 1 ∼ GP (0, k(X, X).

Marginal Prior, Covariance and Marginal Likelihood
In the above DGP model, the exposed GP for f 2 is connected with the data output y, and the intermediate GP for f 1 with the data input X. In Bayesian learning, both f s shall be marginalized when computing the evidence. Now we define the marginal prior as in which the bold faced f 1 representing the set of intermediate function values are marginalized, but the exposed f 2 is not. Note that the notation f (x) = f 2 ( f 1 (x)) is not ambiguous in a generative view, but may cause some confusion in the marginal view as the label f 1 has been integrated out. To avoid confusing with the exposed function f 2 (·), we still use f (·) to denote the marginalized composite function unless otherwise stated. Motivated to write down an objective in terms of marginal likelihood, the moment matching method in [23] was proposed, so one can approximate Equation (6) with a multivariate Gaussian q(f|X) such that the mean and the covariance are matched. In the zero-mean DGP considered in [23], the covariance matching refers to In the case where the squared exponential kernel is used in both GPs, the approximate [23]. The hyperparameters include the length scale and signal magnitude σ with layer indexed at the subscript.
Consequently, the evidence of the data X, y associated with the 2-layer DGP can be approximately expressed as Thus, the learning of hyperparameters σs and s in the zero-mean DGP model is through the gradient descent on log p(y|X), and the gradient components ∂K ∂ 1,2 and ∂K ∂σ 1,2 are needed in the framework of GPy [48].

Model
Following the previous discussion, we shall introduce the model of conditional DGP along with the covariance and marginal prior. The mathematical connection with deep kernel learning and the non-Gaussian aspect of marginal prior will be discussed. The difference between the original DGP and the conditional DGP is that the intermediate GPs in the latter are conditioned on the hyperdata. Learning the hyperdata via the approximate marginal likelihood is, loosely speaking, an empirical Bayesian learning of the feature function in the setting of deep kernel learning.

Conditional Deep Gaussian Process
In the simple two-layer hierarchy with width H 1 = H 2 = 1, the hyperdata {Z, u} = {z 1:M ∈ R d , u 1:M ∈ R} are introduced as support for the intermediate GP for f 1 , and the exposed GP for f 2 remains zero-mean and does not condition on any point. Thus, f 1 can be viewed as a space of random functions constrained with f 1 (z 1:M ) = u 1:M , and the Gaussian distribution p( f 1 (x 1:N )|Z, u) has its conditional mean and covariance in Equation (3) (with m on RHS set to zero) and (4), respectively. Following Equation (6), the marginal prior for this conditional DGP can be similarly expressed as With f 1 being conditioned on the hyperdata {Z, u}, one can see that the multivariate Gaussian p( f 1 (x 1:N )|Z, u) emits samples in the space of random functions passing through the fixed hyperdata so that Equation (9) is a sum of an infinite number of GPs. Namely, with f 1 under the constraints due to the hyperdata and the smoothness implied in kernel k 1 . Therefore, f are represented by an ensemble of GPs with same kernel but different feature functions. We shall come back to this point more rigorously in Section 4.2. Now we shall approximate the intractable distribution in Equation (9) with a multivariate Gaussian q(f|X, Z, u) carrying the matched covariance. The following lemma is useful for the case where the exposed GP for f 2 | f 1 uses the squared exponential (SE) kernel. [49]) The covariance in p(f) (Equation (9)) with the SE kernel k 2 (x, y) = σ 2 2 exp[−(x − y) 2 /2 2 2 ] in the exposed GP for f 2 | f 1 can be calculated analytically. With the Gaussian conditional distribution, p(f 1 |X, Z, u), supported by the hyperdata, the effective kernel reads

Lemma 1. (Lemma 3 in
where m i,j := m(x i,j ) and c ij : ) are the conditional mean and covariance, respectively, at the inputs x i,j . The positive parameter δ 2 ij := c ii + c jj − 2c ij and the the length scale 2 dictates how the uncertainty about f 1 affects the function composition.
Next, in addition to the hyperparameters such as σs and s, the function values u 1:M are hyperdata that shall be learned from the objective. With approximating the non-Gaussian marginal prior p(f|X, Z, u) with q(f|X, Z, u), we are able to compute the approximate marginal likelihood as the objective The learning of all hyperparameter data follows the standard gradient descent used in GPy [48], and the gradient components include the usual ones, such as ∂K eff /∂ 2 in exposed layer and those related to the intermediate layer ∂K eff /∂ 1 and the hyperdata ∂K eff /∂u 1:M through chaining with ∂K eff /∂(m i − m j ) and ∂K eff /∂δ 2 ij via Equations (3) and (4). To exploit the expressive power of neural network during optimization, the hyperdata can be further modeled by a neural network; i.e., with w denoting the weight parameters. In such case, the weights w are learned instead of the hyperdata u 1:M .

When Conditional DGP Is Almost a GP
In the limiting case where the probabilistic nature of f 1 is negligible, then the conditional DGP becomes a GP with the transformed input; i.e., the distribution p(f 1 |X, Z, u) becomes highly concentrated around a certain conditional meanf 1 (x). To get insight, we reexamine the covariance in the setting where f 1 is almost deterministic. We can reparameterize the random function f 1 at two distinct inputs x 1,2 for the purpose of computing covariance: where m(x) is the conditional mean given the fixed Z and u. The random character lies in the two correlated random variables, ( i , j ) T ∼ N (0, C), corresponding to the weak but correlated signal around zero. Under that assumption, we follow the analysis in [9,38] and prove the following lemma.

Lemma 2.
Consider p( f ) defined in Equation (9), with f 2 | f 1 being a more general GP (µ 2 , k 2 ) and f 1 reparametrized as in Equation (13). The covariance, cov( , has the following form: where the cs are matrix elements of kernel matrix C associated with the weak random variables i,j in Equation (13). The notation m i,j := m(x i,j ) and prime as derivative is used.

Proof.
The assumption is that f 2 | f 1 ∼ GP (µ 2 , k 2 ) and that f 1 (x) a weak random function (x) overlaying a fixed function m(x). At any two inputs x i,j , we expand the target function f to the second order: where the shorthanded notation m i := m(x i ) and (x i ) := i is used. Note that ( i , j ) is bivariate Gaussian with zero mean and covariance matrix C.
Then one uses the fact that f 2 | f 1 , f 2 | f 1 and f 2 | f 1 are jointly Gaussian to compute the conditional covariance, which can be expressed in a compact form: The operatorÔ accounts for the fact that cov[ f 2 (m i ), f 2 (m j )] = ∂ 2 m i m j k 2 (m i , m j ) and cov[ f 2 (m i ), f 2 (m j )] = ∂ 2 m j k 2 (m i , m j ). Thus, the operator readŝ Now we are ready to deal with the outer expectation with respect to the s. Note that the covariance c ij := E[ i j ] = c(x i , x j ) and variance c ii := E[ 2 i ] = c(x i , x i ) are matrix elements of C. Consequently, we prove the total covariance in Equation (14).
j ) hold for the stationary k 2 , the above covariance (Equation (14)) with µ 2 = 0 is identical to the effective kernel in Equation (10) in the limit 2 2 δ 2 ij , which reads Such a situation occurs when the inputs Z in hyperdata are dense enough so that f 1 becomes almost deterministic.
Consequently, in the limit when the conditional covariance in δ 2 is small compared with the length scale 2 2 , Equation (19) indicates that the effective kernel is the SE kernel with a deterministic input m(x), which is equivalent to the deep kernel with SE as the base kernel (see Equation (5) in [16]). On the other hand, when δ 2 and 2 are comparable, the terms within the first bracket in the RHS of Equation (19) are a non-stationary function which may attribute multiple frequencies in the function f . The deep kernel with the spectral mixture kernel (Equation (6) in [16]) as the base is similar to the effective kernel.

Non-Gaussian Aspect
The statistics of the non-Gaussian marginal prior p(f|X, Z, u) are not solely determined by the moments up to the second order. The fourth moment can be derived in a similar manner in [23] with the help of the theorem in [50]. Relevant discussion of the heavy-tailed character in Bayesian deep neural network can be found in [51][52][53]. See Lemma A1 for the details of computing the general fourth moment in the case where SE kernel is used in f 2 | f 1 in the conditional 2-layer DGP. Here, we briefly discuss the non-Gaussian aspect, focusing on the variance of covariance, i.e., by comparing E with p being the true distribution (Equation (9)) and q being the approximating Gaussian.
In the SE case, one can verify the difference in the fourth order expectation value: where we have used the fact that the inequalities (1 + hold. Therefore, the inequality suggests the heavytailed statistics of the marginal prior p( f (x i ), f (x j )) over any pair of function values.

Results
The works in [54,55] demonstrate that GPs can still have superior expressive power and generalization if the kernels are dedicatedly designed. With the belief that deeper models generalize better than the shallower counterparts [56], DGP models are expected to perform better in fitting and generalization than GP models do if the same kernel is used in both. However, such expectation may not be fully realized, as the approximate inference may lose some power in DGP. For instance, diminishing variance in the posterior over the latent function was reported in [9] regarding the variational inference for DGP [6]. Here, with a demonstration of extrapolating real-world time series data with the conditional DGP, we shall show that the depth, along with optimizing the hyperdata, does enhance the expressive power and the generalization due to the multiple length scale and multiplefrequency character of the effective kernel. In addition, the moment matching method as an approximate inference for conditional DGP does not suffer from the posterior collapse. The simulation codes can be found in the github repository. Figure 1a,b shows fitting and extrapolating the classic carbon dioxide data (yellow marks for training, red for test) with GPs using, respectively, the SE kernel and a mixture of SE, periodic SE and rational quadratic kernels [3].

Mauna Loa Data
All the θs are hyperparameters in the mixture kernel. As a result of the multiple time scales appearing in the data, the vanilla GP fails to capture the short time trend, but the GP with mixture of kernels can still present excellent expressivity and generalization. The log marginal likelihoods (logML) were 144 and 459 for the vanilla GP and kernel mixture GP, respectively. The two-layer zero-mean DGP with SE kernel in both layers performed better than the single-layer counterpart. In Figure 1c, Figure 1. Extrapolation of standardized CO 2 time series data (yellow dots for training and red dots for test) using GP with three kernels. The dark solid line represents the predictive mean, and the shaded area is the the model's confidence. Panel (a) displays the result using a single GP with an SE kernel. Panel (b) was obtained following the kernel composition in [3]. Panel (c) came from using the effective kernel of 2-layer zero-mean DGP with SE used in both layers [23]. Next, we shall see whether improved extrapolation can arise in other deep models or other inference schemes. In Figure 2, the results from DKL and from DGP using the variational inference are shown. Both were implemented in GPFlux [57]. We modified the tutorial code for hybrid GP with three-layer neural network as the code for DKL. The result in Figure 2a does not show good fitting nor good extrapolation, which is to some extent consistent with the simulation of a Bayesian neural network with ReLu activation [32]. As for the DGP using variational inference, the deeper models do not show much improvement compared to the vanilla GP, and the obtained ELBO was 135 for the two-layer DGP (Figure 2b), and it was 127 for three-layer (Figure 2c). Now we continue to show the performance of our model. In the two-layer model, we have 50 points in hyperdata supporting the intermediate GP. A width-5 tanh neural network is used to represent the hyperdata, i.e., u 1:50 = nn w (z 1:50 ). Then, the hyperparameters, including σ 1,2 , 1,2 and weight parameter w, were learned from gradient descent upon the approximate marginal likelihood. The top panel in Figure 3a displays the prediction and confidence from the posterior over f 1 , obtained from a GP conditioned on the learned hyperparameters and hyperdata. The logML of the two-layer model was also 338, the same as the SE[SE] GP, and in the bottom panel of Figure 3a one can observe a good fit with the training data. More importantly, the extrapolation shows some high frequency signal in the confidence (shaded region). In comparing ot with Figure 3 of [54], the high-frequency signal only appeared after a periodic kernel was inserted. We attribute the high-frequency signal to the propagation of uncertainty in f 1 (top panel) to the exposed layer (see discussion in Section 4.2).
Lastly, the three-layer model using 37 and 23 hyperdata in the f 1 and f 2 layer, respectively, has its results in Figure 3b. Those hyperdata were parameterized by the same neural network used in the two-layer model. The training had a logML of 253, resulting in a good fit with the data. The extrapolation captured the long term trend in its predictive mean, and the test data were mostly covered in the confidence region. In the latent layers, more expressive patterns overlaying the latent mappings seemed to emerge due to the uncertainty and the depth of the model. The learned σ 1,2,3 ≈ (0.49, 0.86, 2.56) and 1,2,3 ≈ (0.014, 1.2, 0.46) show that different layers managed to learn different resolutions.

Airline Data
The models under consideration can be applied to the airline data too. It can be seen in Figure 4 that the vanilla GP was too simple for the complex time-series data, and the GP with the same kernel composition could both fit and extrapolate well. The logML values were -11.7 and 81.9 for the vanilla GP and kernel mixture GP, respectively. Similarly, the SE[SE] kernel captured the multiple length scale character in the data, resulting in a good fit with logML 20.9, but poor extrapolation. Here, we display the results using the DKL, variational inference DGP, both two-layer and three-layer, in Figure 5. For the airline data, the DKL with a ReLu neural network as feature extractor panel (Figure 5a) had similar performance to its counterpart on CO 2 data, as did the variational DGP (Figure 5b,c). Our two-layer model, aided by the probabilistic latent layer supported by 13 hyperdata, showed improved extrapolation along with the high-frequency signal in prediction and confidence. The optimal logML was 28.5, along with the learned σ 1,2 = (3.03, 0.73), 1,2 = (0.026, 2.19) and noise level σ n = 0.004. As shown in Figure 6b, the latent function supported by learned hyperdata shows an increasing trend on top of an oscillating pattern, which led to the periodic extrapolation in the predictive distribution, albeit only the vanilla kernels were used. It is interesting to compare Figure 6b with Figure 6a, as the latter model had 23 hyperdata supporting the latent function, and the vanishing uncertainty learned in the latent function produced an extrapolation that collapsed to zero. The logML in Figure 6a

Discussion
What did we gain and lose while modifying the original DGP defined in Equation (6) by additionally conditioning the intermediate GPs on the hyperdata? On one hand, when the hyperdata are dense, the conditional DGP is mathematically connected with the deep kernel learning, i.e., a GP with warped input. On the other hand, in the situations when less dense hyperdata are present and the latent GPs are representations of random functions passing through the hyperdata, the conditional DGP can be viewed as an ensemble of deep kernels, and the moment matching method allows us to express it in a closed form. What do we lose in such an approximation? Apparently, the approximate q for the true marginal prior p in Equation (9) cannot account for the heavy-tailed statistics.
In the demonstration, the presence of hyperdata constrains the space of the intermediate functions and moves the mass of the function distribution toward the more probable ones in the process of optimization. Comparing the SE[SE] GP, which represents an approximate version of zero-mean 2-layer DGP, against the conditional DGP model, the constrained space of intermediate functions does not affect the learning significantly, and the generalization is improved. Besides, the uncertainty in the latent layers is not collapsed.
One possible criticism of the present model may result from the empirical Bayes learning of the weight parameters. Although the weight parameters are hyperparameters in both our model and in DKL, it is important to distinguish that the weight parameters in our model parameterize u 1:M , which supports the intermediate GP, representing an ensemble of latent functions. In DKL, however, the weight parameters fully determine the one latent function, which might lead to overfitting even though marginal likelihood is used as an objective [19]. A possible extension is to consider upgrading the hyperdata to random variables, and the associated mean and variance in q(u 1:M ) can also be modeled as neural network functions of Z. The moment matching can then be applied to approximate the marginal prior df 1 dup(f 2 |f 1 )p(f 1 |X, Z, u)q(u).

Conclusions
Deep Gaussian processes (DGPs), based on nested Gaussian processes (GPs), offer the possibility of expressive inference and calibrated uncertainty, but are limited by intractable marginalization. Approximate inference for DGPs via inducing points and variational inference allows scalable inference, but incurs costs by limiting expressiveness and causing an inability to propagate uncertainty. We introduce effectively deep kernels with optimizable hyperdata supporting latent GPs via a moment-matching approximation. The approach allows joint optimization of hyperdata and GP parameters via maximization of marginal likelihood. We show that the approach avoids mode collapse, connects DGPs and deep kernel learning, effectively propagating uncertainty. Future directions on conditional DGP include that consideration of randomness in the hyperdata and the corresponding inference.