Machine Learning for Multiple Yield Curve Markets: Fast Calibration in the Gaussian Affine Framework

Calibration is a highly challenging task, in particular in multiple yield curve markets. This paper is a first attempt to study the chances and challenges of the application of machine learning techniques for this. We employ Gaussian process regression, a machine learning methodology having many similarities with extended Kalman filtering - a technique which has been applied many times to interest rate markets and term structure models. <br><br>We find very good results for the single curve markets and many challenges for the multi curve markets in a Vasicek framework. The Gaussian process regression is implemented with the Adam optimizer and the non-linear conjugate gradient method, where the latter performs best. We also point towards future research.


Introduction
It is the aim of this paper to apply machine learning techniques to the calibration of bond prices in multi curve markets in order to predict the term structure of basic instruments.The challenges are two-fold: on the one side interest rate markets are characterized by having not only single instruments, like stocks, but full term structures, i.e. the curve of yields for different investment periods.On the other side, in multi curve markets not only one term structure is present but multiple yield curves for different lengths of the future investment period are given in the market and have to be calibrated.This is a very challenging task, see Eberlein et al. (2019) for an example using Lévy processes.
The co-existence of different yield curves associated to different tenors is a phenomenon in interest rate markets which originates with the 2007-2009 financial crisis.In this time, spreads between different yield curves reached their peak beyond 200 basis points.Since then the spreads have remained on a non-negligible level.The most important curves to be considered in the current economic environment are the overnight indexed swap (OIS) rates and the interbank offered rates (abbreviated as Ibor, such as Libor rates from the London interbank market) of various tenors.In the European market these are respectively the Eonia-based OIS rates and the Euribor rates.The literature on multiple yield curves is manifold and we refer to Grbac and Runggaldier (2015) and Henrard (2014) for an overview.The general theory and affine models have been developed and applied, among others, in Mercurio (2010); Grbac et al. (2015); Cuchiero et al. (2016Cuchiero et al. ( , 2019)); Grbac et al. (2020).
The recent developments have seen many machine learning techniques, in particular deep learning became very popular.While deep learning typically needs big data, here we are more confronted with small data together with a high-dimensional prediction problem, since a full curve (the term structure), and in the multi curve market, even multiple curves have to be calibrated and predicted.To be able to deal with this efficiently, one would like to incorporate information from the past, and a Bayesian approach seems best suited to this.We choose Gaussian process regression (GPR) as our machine learning approach which ensures fast calibration; see De Spiegeleer et al. (2018).This is a non-parametric Bayesian approach to regression and is able to capture non-linear relationships between variables.The task of learning in Gaussian processes simplifies to determining suitable properties for the covariance and mean function, which will determine the calibration of our model.We place ourselves in the context of the Vasiček model, which is a famous affine model, see Filipović (2009) and Keller-Ressel et al. (2018) for a guide to the literature and details.

Related literature
Calibration of log-bond prices in the simple Vasiček model framework via Gaussian processes for machine learning has already been applied in Beleza Sousa et al. (2012) for a single maturity and in Beleza Sousa et al. (2014) for several maturities.They both rely on the theory of Gaussian processes presented in Rasmussen and Williams (2006).While in Section 3 we will extend Beleza Sousa et al. (2012) by presenting an additional optimization method, Beleza Sousa et al. (2014) and Section 4 constitute a different access to the calibration of interest rate markets with several maturities.While Beleza Sousa et al. (2014) calibrate solely zero-coupon log-bond prices, from which one is not able to construct a post-crisis multi-curve interest rate market, we calibrate zero-coupon log-bond prices and log-δ-bond prices on top in order to encompass forward rate agreements and to be conform with the multi-curve framework, cf.Grbac et al. (2020) for the notion of δ-bonds.The modelling of log-δ-bond prices allows to build forward-rate-agreements (FRAs) so that we have the basic building blocks for multi-curve interest rate markets.

Gaussian process regression
Following (Rasmussen and Williams, 2006, Chapter 2 and 5) we provide a brief introduction to Gaussian process regression (GPR).For a moment consider a regression problem with additive Gaussian noise: we observe y 1 , . . ., y n together with covariates x 1 , . . ., x n and assume that where f (•, θ) is a regression function depending on the unknown parameter θ and is a d-dimensional noise vector which we assume to consist of i.i.d.mean-zero Gaussian errors and standard deviation σ.
In the Bayesian approach we are not left without any knowledge on θ but may start from a prior distribution; sometimes this distribution can be deducted from previous experience in similar experiments, while otherwise one chooses an uninformative prior.Assuming continuity of the prior, we denote the prior density of θ by p(θ).Inference is now performed by computing the a posteriori distribution of θ conditional on the observation (x, y).This can be achieved by Bayes' rule, i.e.
where we assumed only that the distribution of θ does not depend on x.Similarly, we can compute p(y|x) from p(y|x, θ) by integrating with respect to p(θ).
For a moment, we drop the dependence on x in the notation.Assuming only that the observation y ∼ N (µ y , Σ yy ) is normally distributed, we are already able to state the marginal likelihood p(y): it is given up to a normalizing constant c by (1) If we assume moreover that ξ = f (θ) is jointly normally distributed with y, we arrive at the multivariate Gaussian case.Hence, the conditional distribution p(ξ|x, y) is again Gaussian and can be computed explicitly.Starting from (ξ, y) ∼ N (µ, Σ), where we split µ = (µ ξ , µ y ) and we can compute p(ξ|y) through some straightforward calculations1 .First, observe that y ∼ N (µ y , Σ ξξ + σ2 I n ).We obtain Σ yy = Σ ξξ + σ2 I n .Hence the marginal likelihood is given by (2) Second, we compute This formula is the basis for the calibration in the Vasiček model, as we will show now.

The single-curve Vasiček interest rate model
As a first step, we calibrate the single-curve Vasiček interest rate model following Beleza Sousa et al. (2012).To begin with, we want to mention the important difference in mathematical finance between the objective measure P and the riskneutral measure Q.The statistical propagation of all stochastic processes takes place under P. Arbitrage-free pricing means computing prices for options; of course the prices depend on the driving stochastic factors.The fundamental theorem of asset pricing now yields that arbitrage-free prices of traded assets can be computed by taking expectations under a risk-neutral measure Q of the discounted pay-offs.For a calibration, the risk-neutral measure has to be fitted to observed option prices, which is our main target.
In this sense, we consider zero-coupon bond prices under the risk-neutral measure Q.The Vasiček model is a single-factor model driven by the short rate r = (r t ) t≥0 which is given by the solution of the stochastic differential equation with initial value r 0 .Here W is a Q-Brownian motion and κ, θ, σ are positive constants.For a positive κ the process r converges to the long-term mean θ.
The price process of the zero-coupon bond price with maturity T is denoted by (P (t, T )) 0≤t≤T with P (T, T ) = 1.The Vasiček model is an affine bond price model, which implies that bond prices take an exponential affine form, i.e.
for t ≤ T. (5) Here, the functions (A(•), We recall that the solution of the stochastic differential equation ( 4) is given by (cf.(Brigo and Mercurio, 2001, Chapter 4.2)), Together with (5), this implies that the zero coupon bond prices are log-normally distributed.To apply the Gaussian process regression, we consider in the sequel log-bond prices for t ≤ T , The corresponding mean function is given by and the covariance function is given by An observation now consists of a vector y = (y(t 1 , T ), . . ., y(t n , T )) + of logbond prices with additional Gaussian i.i.d.noise with variance σ2 .Consequently, y ∼ N (µ y , Σ yy ) with We define the vector of hyper-parameters as Θ = (r 0 , κ, θ, σ) and aim at finding the most reasonable values of hyper-parameters, minimizing the negative log marginal likelihood defined in (1) with the underlying mean covariance function from (9).
Remark 1 (Extending the observation).Note that in the set-up considered here, the driving single factor process r is considered not observable at t 0 , t 1 , . . ., t n , compare Equation ( 9) which depends on r 0 only (and on Θ of course).Such an approach may be appropriate for short time intervals.The propagation of r 0 to the future times t 1 , . . ., t n is solely based on the risk-neutral evolution under the risk-neutral measure Q via µ and c.
If, on the contrary, updating of r should be incorporated, the situation gets more complicated since the evolution of r (taking place under the statistical measure P) needs to be taken into account.One can pragmatically proceed as follows: assume that at each time point r is observable.We neglect the information contained in this observation about the hyper-parameters Θ.Then, the formulas need to be modified only slightly by using conditional expressions.For example, µ(t, T ) conditional on r s for s ≤ t equals and, if t = s, this simplifies even further.
In a similar spirit, one may extend the observation by adding bond prices with different maturities or, additionally, derivatives.A classical calibration example would be the fitting of the hyper-parameters to the observation of bond prices and derivatives, both with different maturities, at a fixed time t 0 .The Gaussian process regression hence also provides a suitable approach for this task.We refer to De Spiegeleer et al. ( 2018) for an extensive calibration example in this regard.
We refer to Rasmussen and Williams (2006) for other approaches to model selection such as cross validation or alignment.For our application purposes maximizing the log-marginal likelihood is a good choice since we already have information about the choice of covariance structure, and it only remains to optimize the hyperparameters, cf.Fischer et al. (2016).

Prediction with Gaussian Processes regression
The prediction of new log-bond prices given some training data is of interest in the context of the risk management of portfolios and options on zero-coupon bonds and other interest derivatives.The application to pricing can be done along the lines of Dümbgen and Rogers (2014), which we do not deepen here.Once having found the calibrated parameters Θ * , we can apply those calibrated parameters to make predictions for new input data.
For the dynamic prediction of log-bond prices we seek a function that is able to predict new output values given some new input values, initially based on a set of training data.For this purpose we specify in a first step the prior distribution, expressing our prior beliefs over the function.In Figure 1 we plotted for the sake of illustration one simulated log-bond price and an arbitrary chosen prior, expressing our beliefs.This could correspond to the log-prices of a 1-year maturity bond, plotted over 252 days.The blue line constitutes the prior mean and the shaded area nearly twice the prior standard deviation corresponding to the 95% confidence interval (CI).Without observing any training data, we are neither able to narrow the CI of our prior nor to enhance our prediction.
The next step is to incorporate a set of training data, to enhance our prediction.The Gaussian process regression has a quite fascinating property in this regard: consider an observation at the time point t, t ∈ {1, . . ., 251}.The prediction for y(t) is of course y(t) itself -and hence perfect.The prediction of y(s), be it for s < t (which is often called smoothing) or the prediction for s > t (called filtering) Note that the initial value at time 0 is known, as well as the final value P (T, T ) = 1 implying log P (T, T ) = 0. is normally distributed, but, according to (3), has a smaller variance compared to our initial prediction without the observation of y(t).Increasing the number of observations, to, say, y(t 1 ), . . ., y(t n ) improves the prediction dramatically.We illustrate this property in Figure 2. Depending on the location of s relative to t 1 , . . ., t n the prediction can become very exact (bottom right -for s ∈ [0, 130]) and may still have a significant error (e. g. for s around 180).
The joint distribution of y and ỹ can directly be obtained from the mean and covariance functions in Equations ( 7), ( 8): since they are jointly normally distributed, we obtain that Recall that µ y and Σ yy were already specified in Equation ( 9).Analogously, we obtain that The posterior distribution is obtained by using Equation (3): conditional on the training data y, we obtain that Given some training data, Equation (10) now allows to calculate the posterior distribution, i.e. the conditional distribution of the predictions ỹ given the observation y, and to make predictions, derive confidence intervals, etc.  b), (c), and (d), respectively.We want to predict log-bond-prices (red dotted line).The red dots constitute observations of the log-bond prices.The blue line denotes the posterior mean, which we take as prediction.The shaded areas represent the 95% CIs.
The posterior distributions, i.e. the prior conditioned on some training data of 1, 3, 100, and 102 observations, respectively, is shown in in Figure 2. The red dotted line represents a simulated trajectory of the log-bond price, while the red dots constitute the observed training data.The blue line is the posterior mean, which we take as our prediction, and the shaded areas denote the 95% confidence interval (CI).We observe that the more training data we have, the more accurate our prediction gets.While in Figure 2 (a) the CI is quite large, two additional training data narrow the CI significantly down, see Figure 2 (b).Of course, the observation times t i play an important role in this: if we had observations t i close to t i−1 , the additional gain in knowledge would most likely be small, while in Figure 2 (b), the (t i ) are nicely spread.Consequently, if one is able to choose the (t i ), one can apply results from optimal experimental design to achieve a highly efficient reduction in the prediction variance.
In most practical cases, the (t i ) can not be chosen freely, and just arrive sequentially, as we illustrate in 2 (c)-(d): here, we aim at predicting log-bond prices of a 1-year maturity bond which we have observed daily until day t n = 125.Figure 2 (c) describes the situation in which we do not have any information about future prices of the 1-year maturity bond, i.e. the 2nd half of the 1 year.Figure 2 (d) depicts the situation in which we assume to know two log-bond prices or option strikes on zero-coupon bonds in the future, which enhances the prediction and narrows the CI down.

Performance measures
In order to monitor the performance of the calibration or the prediction, one typically splits the observed data-set into two disjoint sets.The first set is called training set and is used to fit the model.The second one is the validation set and used as a proxy for the generalization error.Given the number of observed data points and number of parameters to optimize for our simulation task, we recommend to split the training and validation test set in a 70%/30%-ratio.The quality of the predictions can be assessed with the standardized mean squared error (SMSE) loss and the mean standardized log loss (MSLL).
The SMSE considers the average of the squared residual between the mean prediction and the target of the test set and then standardizes it by the variance of the targets of the test cases.
where σ2 denotes the variance of the targets of the test cases.The SMSE is a simple approach and the reader should bear in mind that this assessment of the quality does not incorporate information about the predictive variance.But as it represents a standard measure of the quality of an estimator, it is useful to consider it for the sake of comparability with other literature.
The MSLL as defined in Rasmussen and Williams (2006) is obtained by averaging the negative log probability of the target under the model over the test set and then standardizing it by subtracting the loss that would be obtained under the trivial model with mean and variance of the training data as in Equation ( 1 The MSLL will be around zero for simple methods and negative for better methods.Note that in (11) we omitted in the notation the dependence on the parameter set Θ, since now the parameter set is fixed.Moreover, we notice that the MSLL incorporates the predictive mean and the predictive variances unlike the SMSE.
In the following section we simulate log-bond prices in a Vasiček single-curve model and calibrate the underlying parameters of the model.This is performed for one maturity.Since we deal with simulated prices, we consider the noise-free setting and set σ2 = 0 in (2).

Calibration results for the Vasiček single-curve model
We generate 1.000 samples of log-bond price time series, each sample consisting of 250 consecutive prices (approximately one year of trading time) via Equations ( 4) -(5).As parameter set we fix For each of the 1.000 samples we seek the optimal choice of hyper-parameters Θ = {r 0 , κ, θ, σ} based on the simulated trajectory of 250 log-bond prices.For this purpose we minimize the negative log marginal likelihood (1) with underlying mean and covariance function ( 7) and ( 8), respectively, by means of two optimization methods: the non-linear conjugate gradient (CG) algorithm and the adaptive moment estimation (Adam) optimization algorithm.The CG algorithm uses a non-linear  conjugate gradient by Polak and Ribiere (1969).For details on the CG algorithm we refer to Nocedal and Wright (2006).The Adam optimization algorithm is based on adaptive estimates of lower-order moments and is an extension of the stochastic gradient descent.For details of the Adam optimization algorithm we refer to Kingma and Ba (2014).
After a random initialization of the parameters to optimize, the CG optimization and the Adam optimization were performed using the python library SciPy and TensorFlow, respectively.Parallelization of the 1.000 independent runs was achieved with the python library multiprocessing.The outcomes can be summarized as follows.
1.The results of the calibration via the conjugate gradient optimization algorithm are very satisfying.In Figure 3 the learned parameters r 0 , κ, θ, and σ of 1.000 simulated log-bond prices are plotted in 50 bins histograms.The red dashed lines in each sub-plot indicate the true model parameters.
The mean and standard deviation of the learned parameters are summarized in Table 1.We observe that the mean of the learned parameters reaches the true parameters very closely while the standard deviation is reasonable.
2. The results of the calibration via the Adam algorithm are satisfying, even though we note a shift of the mean reversion parameter κ and the volatility parameter σ.The learned parameters r 0 , κ, θ, and σ of 1000 simulated log-bond prices are plotted in 50 bins histograms in Figure 4 and summarized with their mean and standard deviation in Table 1.The Adam algorithm slowly reduces the learning rate over time to speed up the learning algorithm.Nonetheless, one needs to specify a suitable learning rate.If the learning rate is small, training is more reliable, but the optimization time in order to find a minimum can increase rapidly.If the learning rate is too big, the optimizer can overshoot a minimum.We tried different learning rates of 0.0001, 0.001, 0.01, 0.05, and 0.1.Finally, we decided in favor of a learning rate of 0.05 and performed the training over 700 epochs in order to achieve a suitable trade-off between accuracy and computation time.We expect the results to improve slightly with more epochs, at the expense of a longer training time.
We conclude that in the single-curve Vasiček specification both optimization algorithms provide reliable results.The optimization by means of the CG algorithm outperforms the Adam algorithm, compare Table 1.However, there is hope that the results obtained by the Adam algorithm can be improved by increasing the number of training epochs or by choosing a smaller learning rate in addition to more training epochs.

Multi-curve Vasiček interest rate model
The financial crisis in 2007-2009 has triggered many changes in financial markets.In the post-crisis interest rate markets, multiple yield curves are standard: due to credit and liquidity issues, we obtain for each tenure a different curve, see for example Grbac and Runggaldier (2015) and the rich literature referenced therein.
The detailed mechanism in the multi-curve markets is quite involved, and we refer Grbac et al. (2020) for a precise description.Intuitively, traded instruments are forward-rate agreements which exchange a fixed premium against a floating rate over a future time interval [T, T + δ].Most notably, in addition to the parameter maturity a second parameter appears: the tenor δ.While before the crisis, the curves were independent of δ, after the crisis the tenure can no longer be neglected.The authors in Grbac et al. (2020) show that forward-rate agreements can be decomposed into δ-bonds, which we denote by P (t, T, δ).If δ = 0 we write P (t, T ) = P (t, T, 0), which is the single-curve case considered previously.
In the multi-curve Vasiček interest rate model we consider two maturity time points T ≤ T , hence δ = T − T .We calibrate zero-coupon bond prices P (t, T, 0) and tenor-δ-bond prices P (t, T, δ) under the risk-neutral measure Q at time t for the maturity T .
The calibration in the multi-curve framework corresponds to the situation in which we are given several data-sets of different bond prices, where all of them are sharing the same hyper-parameters.We call this situation multi-task learning and reveal the different meaning in contrast to (Rasmussen and Williams, 2006, Section 5.4.3)where the corresponding log-marginal likelihoods of the individual problems are summed up and the result is then optimized with respect to the hyper-parameters.As our multi-curve example exhibits correlation, the latter approach can not be applied.
We will utilize a two-dimensional driving factor process r = (r 1 , r 2 ) , generalizing Equation ( 4).Modelling r as a two-dimensional Ornstein-Uhlenbeck process can be done as follows: let r be the unique solution of the SDE where W 1 , W 2 are two standard Q-Brownian motions with correlation ρ.The zero-coupon bond prices only depend on r 1 and we utilize the single-curve affine framework: The functions (A(•), B(•)) : [0, T ] → R × R satisfy the Riccati equations and we find in the new notation (replacing κ by κ 1 , etc.) that As in ( 7) and ( 8) we obtain that zero-coupon log-bond prices y(t, T, 0) = log P (t, T, 0) are normal with mean function For two given time points t i , t j ≤ T the covariance function is The next step is to develop the prices for the tenor-δ bonds.We assume that while for the tenor 0 the interest rate is r 1 , the interest rate for tenor δ is r 1 + r 2 .This implies that Since r is an affine process, this expectation can be computed explicitly.Using the affine machinery we obtain that where the function (Φ(•), Ψ(•)) : [0, T ] → R × R 2 satisfies the Riccati equations.This implies that Proposition 2. Under the above assumptions, log P (•, T, δ), 0 ≤ t ≤ T is a Gaussian process with mean function and with covariance function Note that µ and c in the above proposition do not depend on δ.We rather use δ as index to distinguish mean and covariance functions for the zero-coupon bonds and the δ-tenor bonds.
The proof is relegated to the appendix.The next step is to phrase observation and prediction in the Gaussian setting explicitly.We denote y = (y 0 , y δ ) where y 0 = (y(t 1 , T, 0), . . ., y(t n , T, 0)) and y δ = (y(t 1 , T, δ), . . ., y(t n , T, δ)) -of course, at each time point t i we observe two bond prices now: P (t i , T, 0) and P (t i , T, δ).The vector y is normally distributed, y ∼ N (µ y , Σ y ) with We calculate these parameters explicitly and, analogously to the single-curve set-up, the calibration methodology with Gaussian process regression follows.
To begin with, note that Σ 00 y coincides with Σ yy from equation ( 9), when we replace the parameters r 0 , κ, θ, σ by r 1 0 , κ 1 , θ 1 , σ 1 , respectively.The next step for computing the covariance matrix is to compute and, analogously, in a similar way to Σ 00 y we obtain Σ δδ y .In the multi-curve calibration we aim at minimizing the log marginal likelihood (1) with corresponding mean function and covariance matrix (18).

Calibration results
For the calibration in the multi-curve setting we have to consider log-bond prices and the logarithm of tenor-δ bond prices.To be specific, we generate 1000 sequences of log-bond prices and 1000 sequences of log tenor-δ bond prices with maturity 1 by means of (13), and ( 16)-( 17).We choose as parameters For each sequence we generate 125 training data points of log-bond prices and 125 training data points of log tenor-δ bond prices.This corresponds to a computational effort similar to the single-curve specifications, since the underlying covariance matrix will be (250 × 250)-dimensional.Based on the simulated prices we aim at finding the optimal choice of hyper-parameters For this purpose, we apply the CG optimization algorithm and the Adam optimization algorithm.After a random initialization of the parameters to optimize, we perform the calibration procedure using SciPy and TensorFlow for the CG and Adam optimizer, respectively.Parallelization of the 1000 independent runs is achieved with the library multiprocessing.The outcomes are as follows.
1. Considering the results of the calibration by means of the CG algorithm, we note several facts.In this regard, the mean and the standard deviation of the calibrated parameters can be found in Table 2. Figure 5 shows the learned parameters of the processes r 1 and r 2 in 50 bins histograms for 1000 simulations.) of 1000 simulated log-bond prices and log tenor-δ bond prices, as well as the true Vasiček parameters.We note that the CG optimizer yields for every parameter a higher standard deviation than the Adam algorithm.This can in particular be observed for the parameter θ 2 .
The red dashed line in each of the sub-plots indicates the true model parameter value.First, except for the long term mean, the volatility for r 2 is higher than the one for r 1 which implies more difficulties in the estimation of the parameters for r 2 , which is clearly visible in the results.For r 1 , we are able to estimate the parameters well (in the mean), with the most difficulty in the estimation of κ 1 which shows a high standard deviation.
For estimating the parameters of r 2 we face more difficulties, as expected.
The standard deviation of θ 2 is very high -it is known from filtering theory and statistics that the mean is difficult to estimate, which is reflected here.
Similarly, it seems difficult to estimate the speed of reversion parameter κ 2 and we observe a peak around 0.02 in κ 2 .This might be due to a local minimum, where the optimizer gets stuck.
2. For the calibration results by means of the Adam algorithm we note the following.The learned parameters are illustrated in a 50 bins histogram, see Figure 6, and mean and standard deviation of each parameter are stated in Table 2.After trying several learning rates of 0.0001, 0.001, 0.01, 0.05, and 0.1 we decided in favor of the learning rate 0.05 and chose 750 training epochs.While most mean values of the learned parameters are not as close to the true values as the learned parameters from the CG algorithm, we notice, that the standard deviation of the learned parameters is smaller compared to the standard deviation of the learned parameters from the CG.Especially, comparing the values of θ 2 in Figure 5 and Figure 6, we observe the different range of calibrated values.

Conclusion
Concluding the simulation results, we can state that the calibration in the multi-curve framework constitutes a more challenging task compared to the calibration in the single-curve framework, since we need to find 8 parameters instead of 4 parameters.In particular, in order to keep the computational complexity similar to the single-curve framework, we chose 125 training data time points, which results in a covariance matrix of the same dimension as in the single-curve setting.We are confident that Figure 5: 50 bins histogram of the learned parameters with the conjugate gradient (CG) optimizer.Total simulations: 1000.We note, that the learned parameters are centered around the true parameter values except for the volatility parameters σ 1 and σ 2 , exhibiting a shift.In statistics, the task of estimating the volatility parameters is well known.However, our observations reveal, that the task of finding the volatility parameter in the multi-curve Vasiček interest rate model constitutes a challenging task.Moreover, the range of the parameters, especially for θ 2 , is very high.doubling the training input data points would improve the results at the expense of computation time.
It would also be interesting to analyze, how other estimators perform in comparison to the shown results, like for example classical maximum-likelihood estimators.Since it is already known that it is difficult to estimate the variance with MLtechniques, it could also be very interesting to mix classical approaches with ML Figure 6: 50 bins histogram of the learned parameters obtained with the adaptive moment estimation (Adam) optimizer.The learning rate is 0.05 and training is performed over 750 epochs.Total simulations: 1000.We note, that the range of learned parameters is more narrow compared to the CG optimizer, resulting in a lower standard deviation, cf.Table 2.
approaches, which we leave for future work.
A first step in order to extend the presented approach could consist in studying further optimization techniques such as the adaptive gradient algorithm (AdaGrad) or its extension Adadelta, the root mean square propagation (RMSProp), and the Nesterov accelerated gradient (NAG), one could further investigate the prediction of log-bond prices with the learned parameters in order to obtain decision-making support for the purchase of options on zero-coupon bonds and use the underlying strike prices as worst-case scenarios.A next step could be the development of further short rate models such as the Cox-Ingersoll-Ross, the Hull-White extended Vasiček or Cox-Ingersoll-Ross framework.An interesting application is the calibration of interest rate models including jumps.Beyond that, the study of the calibration of interest rate markets by means of Bayesian neural networks seems very promising and remains to be addressed in future work.
The Gaussian process regression approach naturally comes with the a posteriori distribution, which contains much more information compared to the simple prediction (which contains only the mean).It seems to be highly interesting to utilize this for assessing the model risk of the calibration and compare it to the non-linear approaches recently developed in Fadina et al. (2019);Hölzermann (2020).
Summarizing, the calibration of multiple yield curves is a difficult task and we hope to stimulate future research with this initial study showing promising results on the one side and many future challenges on the other side.

B Coding notes and concluding remarks
We add some valuable remarks regarding the coding framework and deliver insight to the technical difficulties that arose during the calibration exercise.
In our application we needed customized covariance matrices.Therefore, we decided against applying one of the Python libraries for the implementation of Gaussian process regression (amongst others the packages pyGP, pyGPs, scikit-learn, GPy, Gpytorch, GPflow) due to their limited choice of covariance functions.We aimed at achieving a trade-off between computational speed and accuracy.We performed all simulations in two environments.The first comprises an AMD R Ryzen 2700x CPU equipped with 16 cores, and one GeForce GTX 1070 GPU.The second comprises 4 Intel(R) Xeon(R) Gold 6134 CPU resulting in 32 cores, and 4 GeForce GTX 1080 GPUs.Since the calibration of the different settings was performed in two environments with different hardware specifications, we do not compare the optimizers regarding time consumption.
Utilizing all available resources comprising CPUs and GPUs for parallel operations also turned out to be a delicate task.For example the TensorFlow default distribution is built without extensions such as AVX2 and FMA.Our CPU supports the advanced vector extensions AVX2 FMA, requiring the build from source of the library TensorFlow.The reader should bear in mind that also minor code improvements of functions which are called the most, improve the overall performance considerably.

Figure 1 :
Figure1: Illustration of a Gaussian process prior.Our task is to predict the unobserved log-bond-prices (red dotted line) with maturity T = 252.The blue line represents the prior mean and the shaded area represents the 95% confidence interval.Note that the initial value at time 0 is known, as well as the final value P (T, T ) = 1 implying log P (T, T ) = 0.
Figure 2: Illustration of Gaussian process posterior distributions, i.e. the prior conditioned on 1, 3, 100, and 102 observed training examples in (a), (b), (c), and (d), respectively.We want to predict log-bond-prices (red dotted line).The red dots constitute observations of the log-bond prices.The blue line denotes the posterior mean, which we take as prediction.The shaded areas represent the 95% CIs.

Figure 3 :
Figure 3: 50 bins histogram of the learned parameters obtained with the CG optimizer.Total simulations: 1000.

Figure 4 :
Figure 4: 50 bins histogram of the learned parameters obtained with the Adam optimizer.The learning rate is 0.05 and training is performed over 700 epochs.Total simulations: 1000.

Table 1 :
Single curve calibration results with mean and standard deviations (StDev) of the learned parameters (params.) of 1000 simulated log-bond prices as well as the true Vasiček parameters.

Table 2 :
Calibration results with mean and standard deviations (StDev) of the learned parameters (params.