Conditional Deep Gaussian Processes: Multi-Fidelity Kernel Learning

Deep Gaussian Processes (DGPs) were proposed as an expressive Bayesian model capable of a mathematically grounded estimation of uncertainty. The expressivity of DPGs results from not only the compositional character but the distribution propagation within the hierarchy. Recently, it was pointed out that the hierarchical structure of DGP well suited modeling the multi-fidelity regression, in which one is provided sparse observations with high precision and plenty of low fidelity observations. We propose the conditional DGP model in which the latent GPs are directly supported by the fixed lower fidelity data. Then the moment matching method is applied to approximate the marginal prior of conditional DGP with a GP. The obtained effective kernels are implicit functions of the lower-fidelity data, manifesting the expressivity contributed by distribution propagation within the hierarchy. The hyperparameters are learned via optimizing the approximate marginal likelihood. Experiments with synthetic and high dimensional data show comparable performance against other multi-fidelity regression methods, variational inference, and multi-output GP. We conclude that, with the low fidelity data and the hierarchical DGP structure, the effective kernel encodes the inductive bias for true function allowing the compositional freedom.


Introduction
Multi-fidelity regression refers to a category of learning tasks in which a set of sparse data is given to infer the underlying function but a larger amount of less precise or noisy observations is also provided. Multifidelity tasks frequently occur in various fields of science because precise measurement is often costly while approximate measurements are more affordable (see [1,2] for example). The assumption that the more precise function is a function of the less precise one [1,3] is shared in some hierarchical learning algorithms (e.g., one-shot learning in [4], meta learning [5], and continual learning [6]). Thus, one can view the plentiful low fidelity data as a source of prior knowledge so the function can be efficiently learned with sparse data.
In Gaussian Process (GP) regression [7] domain experts can encode their knowledge into the combinations of covariance functions [8,9], building an expressive learning model. However, construction of an appropriate kernel becomes less clear when building a prior for the precise function in the context of multi-fidelity regression because the uncertainty, both epistemic and aleatoric, in the low fidelity function prior learned by the plentiful data should be taken into account. It is desirable to fuse the low fidelity data to an effective kernel as a prior, taking advantage of marginal likelihood being able to avoid overfitting, and then perform the GP regression as if only the sparse precise observations are given.
Deep Gaussian Process (DGP) [10] and the similar models [11,12] are expressive models with a hierarchical composition of GPs. As pointed out in [3], a hierarchical structure is particularly suitable for fusing data of different fidelity into one learning model. Although full Bayesian inference is promising in obtaining expressiveness while avoiding overfitting, exact inference is not tractable and approximate solutions such as the variational approach [13][14][15] are employed. Ironically, the major difficulty in inference comes from marginalization of the latent GPs in Bayesian learning, which, on the flip side, is also why overfitting can be prevented.
We propose a conditional DGP model in which the intermediate GPs are supported by the lower fidelity data. We also define the corresponding marginal prior distribution which is obtained by marginalizing all GPs except the exposed one. For some families of kernel compositions, we previously developed an analytical method in calculating exact covariance in the marginal prior [16]. As such, the method is applied here so the marginal prior is approximated as a GP prior with an effective kernel. The high fidelity data are then connected to the exposed GP, and the hyperparameters throughout the hierarchy are optimized via the marginal likelihood. Our model, therefore, captures the expressiveness embedded in hierarchical composition, retains the Bayesian character hinted in the marginal prior, but loses the non-Gaussian aspect of DGP. From the analytical expressions, one can partially understand the propagation of uncertainty in latent GPs as it is responsible for the non-stationary aspect of effective kernels. Moreover, the compositional freedom, i.e., different compositions may result in the same target function, in a DGP model [17,18] can be shown to be intact in our approach.
The paper is organized as follows. In Section 2, we review the literature of multifidelity regression model and deep kernels. A background of GP, DGP, and the moment matching method is introduced in Section 3. The conditional DGP model defined as a marginal prior and the exact covariance associated with two families of kernel compositions are discussed in Section 4. The method of hyperparameter learning is given in Section 5, and the simulation of synthetic and high dimensional multi-fidelity regression in a variety of situations are presented in Section 6. A brief discussion followed by the conclusion appear in Sections 7 and 8, respectively.

Related Work
Assuming autoregressive relations between data of different fidelity, Kennedy and O'Hagan [1] proposed the AR1 model for multi-fidelity regression tasks. Le Gratiet and Garnier [19] improved computational efficiency with a recursive multi-fidelity model. Deep-MF [20] mapped the input space to the latent space and followed the work in Kennedy and O'Hagan [1]. NARGP [21] stacked a sequence of GPs in which the posterior mean about the low-fidelity function is passed to the input of the next GP while the associated uncertainty is not. GPAR [22] uses a similar conditional structure between functions of interest. MF-DGP in [3] exploited the DGP structure for the multi-fidelity regression tasks and used the approximate variational inference in [13]. Multi-output GPs [23,24] regard the observations from different data sets as realization of vector-valued function; [25] modeled the multi-output GP using general relation between multiple target functions and multiple hidden functions. Alignment learning [26,27] is an application of warped GP [11,12] to time series data. We model the multi-fidelity regression as a kernel learning, effectively taking the space of functions representing the low fidelity data into account.
As for general studies of deep and non-stationary kernels, Williams [28] and Cho and Saul [29] used the basis of error functions and Heaviside polynomial functions to obtain the arc-sine and arc-cosine kernel functions, respectively, of neural networks. Duvenaud et al. [30] employed the analogy between neural network and GP, and constructed the deep kernel for DGP. Dunlop et al. [31] analyzed variety of non-stationary kernel compositions in DGP, and Shen et al. [32] provided an insight from Wigner transformation of general two-input functions. Wilson et al. [33] proposed the general recipe for constructing the deep kernel with neural networks. Daniely et al. [34] computed the deep kernel from the perspective of two correlated random variables. Mairal et al. [35] and Van der Wilk et al. [36] studied the deep kernels in the convolutional models. The mo-ment matching method [16] allows obtaining the effective kernel encoding the uncertainty in learning the lower fidelity function.

Gaussian Process and Deep Gaussian Process
Gaussian Process (GP) [7] is a popular Bayesian learning model in which the prior over a continuous function is modeled as a Gaussian. Denoted by f ∼ GP (µ, k), the collection of any finite function values f ( Thus, a continuous and deterministic mean function µ(·) and a positive definite kernel function k(·, ·) fully specify the stochastic process. It is common to consider the zero-mean case and write down the prior distribution, p(f|X) = N (0, K) with covariance matrix K. In the setting of a regression task with input and output of data {X, y}, the hyperparameters in the mean and kernel functions can be learned by optimizing the marginal likelihood, p(y|X) = dfp(y|f)p(f|X).
Deep Gaussian Process (DGP) was proposed in [10] as a hierarchical composition of GPs for superior expressivity. From a generative view, the distribution over the com- in which the capital bolded face symbol F i stands for a multi-output GP in ith layer and the independent components have p(f i |F i−1 ) = N (0, K(F i−1 .F i−1 )). The above is the general DGP, and the width in each layer is denoted by H i := |F i |. In such notation, the zeroth layer represents the collection of inputs X. Here, we shall consider the DGP with L = 2 and H 2 = H 1 = 1 and the three-layer counterpart. The intractability of exact inference is a result of the fact that the random variables F i for L > i > 0 appear in the covariance matrix K. In a full Bayesian inference, the random variables are marginalized in order to estimate the evidence p(y|X) associated with the data.

Multi-Fidelity Deep Gaussian Process
The multi-fidelity model in [1] considered the regression task for a data set consisting of observations measured with both high and low precision. For simplicity, the more precise observations are denoted by {X, y} and those with low precision by {X 1 , y 1 }. The main assumption made in [1] is that the less precise observations shall come from a function f 1 (x) modeled by a GP with zero mean and kernel k, while the more precise ones come from the combination f (x) = α f 1 (x) + h(x). With the residual function h being a GP with kernel k h , one can jointly model the two subsets with the covariance within precise k ij refers to the covariance between the two inputs at x i and x j .
The work in [3] generalized the above the linear relationship between the more and less precise functions to a nonlinear one, i.e., f (x) = f 2 ( f 1 (x)) + noise. The hierarchical structure in DGP is suitable for nonlinear modeling. The variational inference scheme [13] was employed to evaluate the evidence lower bounds (ELBOs) for the data with all levels of precision, and the sum over all ELBOs is the objective for learning the hyperparameters and inducing points.

Covariance in Marginal Prior of DGP
The variational inference, e.g., [13], starts with connecting the joint distribution p( f 1 , f 2 |X) with data y, followed by applying the Jensen's inequality along with an approx-imate posterior in evaluating the ELBO. In contrast, we proposed in [16] that the marginal prior for the DGP, with the bolded face symbols representing the set of function values, f (·) = f 2 ( f 1 (·)), f 2 (·), and f 1 (·), can be approximated as a GP, i.e., q(f|X) = N (0, K eff ) in the zero-mean case. The matching of covariance in p and q leads to the closed form of effective covariance function for certain families of kernel compositions. The SE[SC] composition, i.e., the squared exponential and squared cosine kernels being used in the GPs for f 2 | f 1 and f 1 , respectively, is an example. With the intractable marginalization over the intermediate f 1 being taken care of in the moment matching approximation, one can evaluate the approximate marginal likelihood for the data set {X, y}, In the following, we shall develop along the line of [16] the approximate inference for the multi-fidelity data consisting of precise observations {X, y} and less precise observations {X 1:L−1 , y 1:L−1 } with the L-layer width-1 DGP models. The effective kernels k eff shall encode the knowledge built on these less precise data, which allows modeling the precise function even with a sparse data set.

Conditional DGP and Multi-Fidelity Kernel Learning
In the simplest case, we are given two subsets of data, {X, y} with high precision and {X 1 , y 1 } with low precision. We can start with defining the conditional DGP in terms of the marginal prior, where the Gaussian distribution p(f 1 |X, X 1 , y 1 ) = N ( f 1 (x 1:N )|m, Σ) has the conditional mean in the vector form, and the conditional covariance in the matrix form, The matrix K X,X 1 registers the covariance among the inputs in X and X 1 , and likewise for K X,X and K X 1 ,X 1 . Thus, the set of function values f 1 (x 1:N ) associated with the N inputs in X are supported by the low fidelity data.
The data {X, y} with high precision are then associated with the function f (x) = f 2 ( f 1 (x)). Following the previous discussion, we may write down the true evidence for the precise data conditioned on the less precise ones shown below, To proceed with the moment matching approximation of the true evidence in Equation (7), one needs to find the effective kernel in the approximate distribution q(f|X, X 1 , y 1 ) = N (0, K eff ) and replace the true distribution in Equation (4) with the approximate distribution, Therefore, the learning in the conditional DGP includes the hyperparameters in the exposed GP, f 2 | f 1 , and those in the intermediate GP, f 1 . Standard gradient descent is applied to above approximate marginal likelihood. One can customize the kernel K eff in the GPy [37] framework and implement the gradient components ∂K eff /∂θ with θ ∈ {σ 1,2 , 1,2 } in the optimization.

Analytic Effective Kernels
Here, we consider the conditional DGP with two-layer and width-1 hierarchy, focusing on the cases where the exposed GP for f 2 | f 1 in Equation (4) uses the squared exponential (SE) kernel or the squared cosine (SC) kernel. We also follow the notation in [16] so that the composition denoted by k 2 [k 1 ] represents that k 2 is the kernel used in the exposed GP and k 1 used in the intermediate GP. For example, SE[SC] means k 2 is SE while k 1 is SC. Following [16], the exact covariance in the marginal prior Equation (4) is calculated, Thus, when the exposed GP has the kernel k 2 in the exponential family, the above integral is tractable and the analytic k eff can be implemented as a customized kernel. The following two lemmas from [16] are useful for the present case with a nonzero conditional mean and a conditional covariance in f 1 . [16]) For a vector of Gaussian random variables g 1:n ∼ N (m, C), the expectation of exponential quadratic form exp[− 1 2 Q(g 1 , g 2 , · · · , g n )] with Q(g 1:n ) = ∑ i,j A ij g i g j ≥ 0 has the following closed form,

Lemma 1. (Lemma 2 in
The n-dimensional matrix A appearing in the quadratic form Q is symmetric.
where the transpose operation on column vector is denoted by the superscript.
We shall emphasize that our previous work [16] did not discuss the cases when the intermediate GP for f 1 is conditioned on the low precision data {X 1 , y 1 }. Thus, the conditional mean and the non-stationary conditional covariance were not considered in [16].
] in the exposed GP can be calculated analytically. With the Gaussian conditional distribution, p(f 1 |X, X 1 , y 1 ), supported by the low fidelity data, the effective kernel reads, where m i,j := E[ f 1 (x i,j )|X 1 , y 1 ] being the conditional mean of f 1 . The positive parameter δ 2 ij and and the the length scale 2 in k 2 dictates how the uncertainty in f 1 (x) affects the function composition.

Proof. For SE[ ] composition, one can represent the kernel
as an exponential quadratic form exp[− Q 2 ] with Q = f t 1 Af 1 with A = 1 −1 −1 1 . 2 = 1 is set for ease of notation. Now f 1 is a bivariate Gaussian variable with mean m and covariance matrix C. To calculate the expectation value in Equation (12), we need to compute the following 2-by-2 matrix and one can show [I 2 + AC] −1 can be reduced to The seemingly complicated matrix in fact is reducible as one can show that ( , which leads to the exponential term in the kernel. With the determinant |I 2 + CA| = (1 + δ 2 ij ) and restoring back the length scale 2 , the kernel in Equation (12) is reproduced.
A few observations are in order. First, we can rewrite c ii c ij c ji c jj 1 −1 , which guarantees the positiveness of δ 2 ij as the two-by-two sub-block of covariance matrix is positive-definite. Secondly, there are deterministic and probabilistic aspects of the kernel in Equation (12). When the the length scale 2 is very large, the term δ 2 encoding the uncertainty in f 1 becomes irrelevant and the kernel is approximately a SE kernel with the input transformed via the conditonal mean Equation (5), which is reminiscent of the deep kernel proposed in [33] where GP is stacked on the output of a DNN. The kernel used in [21] similarly considers the conditional mean in f 1 as a deterministic transformation while the uncertainty is ignored. On the other hand, when δ 2 and 2 2 are comparable, it means that the (epistemic) uncertainty in f 1 shaped by the supports y 1 is relevant. The effective kernel then represents the covariance in the ensemble of GPs, each of which receives the inputs transformed by one f 1 sampled from the intermediate GP. Thirdly, we shall stress that the appearance of δ 2 is a signature of marginalization over the latent function in deep probabilistic models. Similar square distance also appeared in [30] where the effectively deep kernel was derived from a recursive inner product in the space defined by neural network feature functions.
In the following lemma, we consider the composition where the kernel in outer layer is squared cosine, k h (x, y) = (σ 2 2 /2){1 + cos[(x − y)/ 2 ]}, which is a special case of spectrum mixture kernel [38].

Lemma 4.
The covariance in f of the marginal prior with SC kernel used in the exposed GP is given below, where δ 2 ij has been defined in the previous lemma.
The form of product of cosine and exponential kernels is similar with the deep spectral mixture kernel [33]. In our case the cosine function has the warped input m(x i ) − m(x j ), but the exponential function has the input c(x i , x i ) + c(x j , x j ) − 2c(x i , x j ) due to the conditional covariance in the intermediate GP.

Samples from the Marginal Prior
Now we study the samples from the approximate marginal prior with the effective kernel in Equation (12). We shall vary the low fidelity data X 1 , y 1 to see how they affect the inductive bias for the target function. See the illustration in Figure 1. The top row displays the low-fidelity functions f 1 |X 1 , y 1 , which are obtained by a standard GP regression. Here, the low-fidelity data are noiseless observations of three simple functions (linear in the left, hyper tangent in middle, and sine in right). The conditional covariance and condition mean are then fed into the effective kernel in Equation (12), and so we can sample functions from the prior carrying the effective kernel. The samples are displayed in the second row.
In such cases, it can be seen that f 1 |X 1 , y 1 is nearly a deterministic function (top row) given the sufficient amount of noiseless observations in {X 1 , y 1 }. In fact, the left panel in the second row is equivalent to the samples from a SE kernel as f 1 is the identity function. Moving to the second column, one can see the effect of nonlinear warping generates additional kinks in the target functions. The case on the third column with periodic warping results in periodic patterns to the sampled functions.
Next, we shall see the effect of uncertainty in f 1 |X 1 , y 1 (third row) on the sampled functions (bottom row). The increased uncertainty (shown by shadow region) in f 1 generates the weak and high frequency signal in the target function due to the non-stationary δ 2 in Equation (12). We stress that these weak signals are not white noise. The noise in the low fidelity data even reverses the sign of sampled functions, i.e., comparing the second against the bottom rows in the third column. Consequently, the expressivity of the effective kernel gets a contribution from the uncertainty in learning the low fidelity functions.  (12). The low fidelity data X 1 , y 1 , marked by the cross symbols, and the low fidelity function f 1 |X 1 , y 1 and the uncertainty are shown in the top (noiseless) and the third (noisy) rows. Top row: the uncertainty in X 1 , y 1 is negligible so f 1 is nearly a deterministic function, so the effective kernels are basically kernels with warped input. The corresponding samples from q are shown in the second row. Third row: the noise in X 1 , y 1 generates the samples in bottom row which carry additional high-frequency signals due to the non-stationary δ 2 in Equation (12).

Method
Since we approximate the marginal prior for the conditional DGP with a GP, the corresponding approximate marginal likelihood should be the objective for jointly learning the hyperparameters including those in the exposed GP and the intermediate GPs. From the analytical expression for the effective kernel, e.g., Equation (12), the gradient components include the explicit derivatives ∂K eff /∂σ 2 and ∂K eff /∂ 2 as well as those implicit derivatives ∂K eff /∂σ 1 and ∂K eff /∂ 1 which can be computed via chain rule.
With the data consisting of observations of different fidelity, an alternative method can learn the hyperparameters associated with each layer of the hierarchy sequentially. See Algorithm 1 for details. The low fidelity data are fed into the first GP regression model for inferring f 1 and the hyperparameters 1 and σ 1 . The conditional mean and conditional covariance in f 1 |X 1 , y 1 are then sent to the effective kernel. The second GP using the effective kernel is to infer the high fidelity function f with the marginal likelihood for the high fidelity data being the objective. Optimization of the second model results in the hyperparameters 2 and σ 2 in the second layer. Learning in the three-layer hierarchy can be generalized from the two-layer hierarchy. In the Appendix, a comparison of regression using the two methods is shown.

Algorithm 1 A learning algorithm for conditional DGP multi-fidelity regression
Input: two sources of data, low-fidelity data (X 1 , y 1 ) and high-fidelity data (X, y), kernel k 1 for function f 1 , and the test input x * . 1. 1,2 , 1,2 } and predictive mean µ * and variance σ * at x * .

Results
In this section, we shall present the results of multi-fidelity regression given low fidelity data X 1 , y 1 and high fidelity X, y and use the 2-layer conditional DGP model. The cases where there are three levels of fidelity can be generalized with the 3-layer counterpart. The toy demonstrations in Section 6.1 focus on data sets in which the target function is a composite, f (x) = f 2 ( f 1 (x)). The low fidelity data are observations of f 1 (x) while the high fidelity are those of f (x). The aspect of compositional freedom is discussed in Section 6.2, and the same target function shall be inferred with the same high fidelity data but the low fidelity data now result from a variety of functions. In Section 6.3, we switch to the case where the low fidelity data are also observations of the target function f but with large noise. In Section 6.4, we compare our model with the work in [3] on the data set with high dimensional inputs.

Synthetic Two-Fidelity Function Regression
The first example in Figure 2 consists of 10 random observations of the target function

Compositional Freedom and Varying Low-Fidelity Data
Given the good learning results in the previous subsection, one may wonder the effects of having a different low fidelity data set on inferring the high fidelity function. Here, we consider the same high fidelity data from the target function in Figure 2, but the low fidelity data are observations of f 1 (x) = x, f 1 (x) = tanh x, f 1 (x) = sin 4πx, and f 1 (x) = sin 8πx. Figure 4 displays the results. Plots in the top rows represent f 1 |X 1 , y 1 , while the bottom rows show the inferred target function given the high fidelity data (red dots). It can be seen in the left most column in panel (a) that the linear f 1 is not a probable low fidelity function as the true target function (red dashed line) in the bottom is outside the predictive confidence. Similarly in the second plot in (a), f 1 being a hyper tangent function is not probable to account for the true target function. In the end, f 1 being a periodic function is more likely to account for the true target function than the first two cases, but the right most plot with f 1 (x) = sin 8πx leads to the predictive mean very close to the true target function.  Next, the low fidelity data become the noisy observations of the same four functions. As shown in panel (b), the increased variance in f 1 |X 1 , y 1 also results in raising the variance in f , especially comparing the first two cases in (a) against those in (b). A dramatic difference can be found in comparing the third plot in (a) against that in (b). In (b), the presence of noise in the low fidelity data slightly raises the uncertainty in f 1 , but the ensuing inference in f generates the prediction which fails to contain most of the true target function within its model confidence. Thus, the likelihood that f 1 (x) = sin 4πx is the probable low fidelity function is greatly reduced by the noise in the observation. Lastly, the noise in observing f 1 (x) = sin 8πx as the low fidelity data does not significantly change the inferred target function.
Therefore, the inductive bias associated with the target function is indeed controllable by the intermediate function distribution conditioned on the lower fidelity data. The observation motivates the DGP learning from the common single-fidelity regression data with the intermediate GPs conditioned on some optimizable hyperdata [39]. These hyperdata constrain the space of intermediate function, and the uncertainty therein contribute to the expressiveness of the model.

Denoising Regression
Here we continue considering the inference of the same target function in f (x) = (x − √ 2) sin 2 8πx, but now the low fidelity data set becomes the noisy observations of the target function. See Figure 5 for illustration. Now we have 15 observations of f with noise level of 0.001 (red dots) as high fidelity data and 30 observations of the same function with nosie level of 0.1 (dark cross symbol) as the low fidelity data. Next, we follow the same procedure in inferring f 1 with the low fidelity, and then use the conditional mean and covariance in constructing the effective kernel for inferring the target function f with the high fidelity data. Unlike the previous cases, the relation between f and f 1 here is not clear. However, the structure of DGP can be viewed as the intermediate GP emitting infinitely many samples of f 1 into the exposed GP. Qualitatively, one can expect that the actual prediction for f is the average over the GP models with different warping f 1 . Consequently, we may expect the variance in predicting f is reduced.   Indeed, as shown in Figure 5, the predictive variance using a GP with the low fidelity (high noise) observations only is marked by the light-blue region around the predictive mean (light-blue solid line). When the statistical information in f 1 |X 1 , y 1 is transferred to the effective kernel, the new prediction and model confidence possess much tighter uncertainty (marked by the light-green shaded region) around the improved predictive mean (dark solid line) even in the region away from the low-noise observations.

Multi-Fidelity Data with High Dimensional Input
The work in [3] along with their public code in emukit [40] assembles a set of multifidelity regression data sets in which the input x is of high dimension. Here we demonstrate the simulation results on these data (see [3] for details). The simulation is performed using the effective kernels with compositions: SE [ [40]. Algorithm 1 is followed to obtain the results here. The performance of generalization is measured in terms of mean negative log likelihood (MNLL). Table 1 displays the results using the same random seed from MF-DGP and our methods. We also include the simulation of standard GP regression with the high fidelity data only. It is seen that the knowledge about the low fidelity function is significant for predicting high-level simulation (comparing with vanilla GP) and that the informative kernels have better performance in these cases.

Discussion
In this paper, we propose a novel kernel learning which is able to fuse data of low fidelity into a prior for high fidelity function. Our approach addresses two limitations of prior research on GPs: the need to choose or design kernel [8,9] and the lack of explicit dependence on the observations in the prediction (in Student-t process [41] the latter is possible). We resolve limitations associated with reliance on designing kernels, introducing new data-dependent kernels together with effective approximate inference. Our results show that the method is effective, and we prove that our moment-matching approximation retains some multi-scale, multi-frequency, and non-stationary correlations that are characteristic of deep kernels, e.g., [33]. The compositional freedom [18] pertaining to hierarchical learning is also manifested in our approach.

Conclusions
Central to the allure of Bayesian methods, including Gaussian Processes, is the ability to calibrate model uncertainty through marginalization over hidden variables. The power and promise of DGP is in allowing rich composition of functions while maintaining the Bayesian character of inference over unobserved functions. Modeling the multi-fidelity data with the hierarchical DGP is able to exploit its expressive power and to consider the effects of uncertainty propagation. Whereas most approaches are based on variational approximations for inference and Monte Carlo sampling in the prediction stage, our approach uses a moment-based approximation in which the marginal prior of DGP is analytically approximated with a GP. For both, the full implications of these approximations are unknown. Continued research is required to understand the full strengths and limitations of each approach.

Acknowledgments:
We would like to thank the helpful correspondences with the authors of [22].

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: DGP Deep Gaussian Process SE Squared Exponential SC Squared Cosine Appendix A Figure A1 shows the two results of multi-fidelity regressions with the same data. The left panel is obtained with jointly learning the hyperparameters in all layers with the standard gradient descent on the approximate marginal likelihood, while the right panel is from learning the hyperparameters sequentially with the Algorithm 1. It is noted that the right panel yields higher log of marginal likelihood.