Deep Arbitrage-Free Learning in a Generalized HJM Framework via Arbitrage-Regularization

: A regularization approach to model selection, within a generalized HJM framework, is introduced, which learns the closest arbitrage-free model to a prespeciﬁed factor model. This optimization problem is represented as the limit of a one-parameter family of computationally tractable penalized model selection tasks. General theoretical results are derived and then specialized to afﬁne term-structure models where new types of arbitrage-free machine learning models for the forward-rate curve are estimated numerically and compared to classical short-rate and the dynamic Nelson-Siegel factor models.


Introduction
The compatibility of penalized regularization with machine learning approaches allows for the successful treatment of various challenges in learning theory such as variable selection (see Tibshirani (1996)) and dimension reduction (see Zou et al. (2006)).The objective of many machine learning models used in mathematical finance is to predict asset prices by learning functions depending on stochastic inputs.In general, there is no guarantee that these stochastic factor models are consistent with no-arbitrage conditions.This paper introduces a novel penalized regularization approach to address this modelling difficulty in a manner consistent with financial theory.The incorporation of an arbitrage-penalty term allows various machine learning methods to be directly and coherently integrated into mathematical finance applications.We focus on regression-type model selection tasks in this article.However, the arbitrage-penalty can also be applied to other types of machine learning algorithms with financial applications.
To motivate our approach we first consider informally, similar to (Björk 2009, Chapter 10), the following simple situation that will later be made more precise.Let (Ω, F , (F t ) t≥0 , P) be a filtered probability space satisfying the usual conditions.Let (X t ) t≥0 be an F t -adapted real-valued stochastic process with continuous paths representing the price of a financial asset.Let r be the constant risk-free interest rate and assume a fixed time interval [0, T].Existence of a martingale measure Q equivalent to the underlying real-world measure P implies absence of arbitrage.The price at time t ∈ [0, T] of a derivative security with integrable payoff f (X T ) at time T is given by the risk-neutral pricing formula E Q e −r(T−t) f (X T ) F t . (1) With L t = dQ dP F t we may express Equation (1) under the real-world probability measure P as E P e −r(T−t) L T L t f (X T ) F t . (2) Equivalently, the price given by Equation (2) can be expressed under P by defining the state-price density process Z t = e −rt L t .If Q is the minimal martingale measure of Schweizer (1995) then the transformation f (X t ) → Z t f (X t ) (3) can be interpreted as finding the process closest to ( f (X t )) t≥0 which is a (local) martingale under P.The purpose of this paper is to find an analogue of the transformation (3) in this setting when X t is described by a stochastic factor model, as is the case with most machine learning approaches to mathematical finance.For example, X t may be described by a deep neural network with stochastic inputs.
The above-ignored well-known results regarding the uniqueness of Q (see Schweizer (1995)) and other important generalizations of the martingale approach to arbitrage theory.In particular, the more general setting for the fundamental theorem of asset pricing of Delbaen and Schachermayer (1998) implies that if "arbitrage", in the sense of no-free lunch with vanishing risk, exists then the transformation (3) is undefined.However, many machine learning approaches to mathematical finance may admit arbitrage so it is necessary to consider the general case.The arbitrage-regularization framework introduced in this paper integrates machine learning methodologies with the general martingale approach to arbitrage theory.
We consider a general framework for learning the arbitrage-free factor model that is most similar to a factor model within a prespecified class of alternative factor models.This search is optimized by minimizing a loss-function measuring the distance of the alternative model to the original factor model with the additional constraint that the market described by the alternative model is a local martingale under a reference probability measure.
The main theoretical results rely on asymptotics for the arbitrage-regularization penalty for selecting the optimal arbitrage-free model from a class of stochastic factor models.Relaxation of the asymptotic results necessary for practical implementation are presented.Throughout this paper, the bond market will serve as the primary example of our methods since no-arbitrage conditions for factor models are well understood, see Filipović (2001) and the references therein.Numerical results applying the arbitrage-regularization methodology are implemented using real data.
The remainder of this paper is organized as follows.Section 2 states the arbitrage-regularization problem and presents an overview of relevant background on bond markets.Section 3 develops the arbitrage-penalty and establishes the main asymptotic optimality results.Non-asymptotic relaxations of these results are also considered and linked with transaction costs.Section 4 specializes the general results to bond markets and where a simplified expression for the arbitrage-penalty is obtained.Numerical implementations of the results are considered and the arbitrage-regularization methodology is used to generate new machine learning-based models consistent with no free lunch with vanishing risk (NFLVR) and the results are compared with classical term-structure models as benchmarks.Section 5 concludes and Appendix A contains supplementary material primarily required for the proofs, such as functional Itô calculus and Γ-convergence results.Proofs of the main theorems of the paper are included in Appendix B.

The Arbitrage-Regularization Problem
For the remainder of the paper all stochastic processes described in this paper are defined on a common stochastic basis (Ω, F , (F t ) t≥0 , P).Let P be a probability measure equivalent to the reference probability measure P and (r t ) t≥0 denotes the risk-free rate in effect at time t ≥ 0. Assume that there exists an asset whose price process, denoted by (N t ) t≥0 , is a strictly positive P -martingale which serves as numéraire.Unless otherwise specified all the processes in this paper will be described under the martingale-measure for (N t ) t≥0 , denoted P N is defined by The choice of the numéraire can be used to encode or remove any trend from the price processes being modelled.Price processes which are local martingales under P N or P are usually only semi-martingales under the objective measure P. Further details on numéraires can be found in Shreve (2004).We consider a large financial market (X t (u)) t≥0,u∈U , indexed by a non-empty Borel subset U ⊆ R D , were D is a positive integer.For example, (X t (u)) t≥0,u∈U may be used to represent a bond market where, using the parameterization of Musiela and Rutkowski (1997), U = [0, ∞) represents the collection of all possible maturities and X t (u) represents the time t price of a zero-coupon bond with maturity u.
For each u ∈ U , the process (X t (u)) t≥0 will be driven by a latent, possibly infinite-dimensional, factor process.In the case of the bond market, this latent process will be the forward-rate curve.Write where φ u t φ(t, β t , u), {S t (•, •; u)} u∈U is a family of path-dependent functionals encoding the latent process into the asset price X t (u), φ u t is the factor model for the latent process, and β t are the R d -valued stochastic factors driving the latent process.Following Fournie (2010), S t will be allowed to depend on the local quadratic-variation of the factor process φ(t, β t , u), denoted by [[φ u ]] t and defined by In the case of the bond market, S t will be the map taking a forward-rate curve, such as (φ(t, β t , u)) u∈U , to the time t price of a zero-coupon bond with maturity u, as defined by It will often be convenient to use the reparameterization of Musiela and Rutkowski (1997) and rewrite (6) as where τ u − s, for 0 ≤ t ≤ s ≤ u, and represents the time to maturity of the bond.
In general, S t will be allowed to depend on the path of φ(t, β t , u).Thus, S t will be a path-dependent functional of regularity C 1,2 b in the sense of Fournie (2010) as discussed in Appendix A.2.However, as in the bond market, if S t depends only on the current value of φ(t, β t , u) then the requirement that S t be of class C 1,2 b , in the sense of Fournie (2010), is equivalent to it being of regularity C 1,2 (I × R d ) in the classical sense; where I [0, ∞).Therefore, the classical Itô-calculus would apply to S t .
Analogously to Björk and Christensen (1999), the factor model ϕ for the latent process will always be assumed to be suitably integrable and suitably differentiable.Specifically, ϕ will belong to a Banach subspace X of L p ν⊗µ I × R d × U which can be continuously embedded within the Fréchet space C 1,2,2 (I × R d × U ); where ν is a Borel probability measure supported on I, µ is a Borel probability measure supported on R d × U , and both ν and µ are equivalent to the corresponding Lebesgue measures restricted to their supports.Here, 1 ≤ p < ∞ is kept fixed.
An example from the bond modelling literature is the Nelson-Siegel model (see Nelson and Siegel (1987) and Diebold and Rudebusch (2013)), which expresses the forward-rate curve as a function of its level, slope, and curvature through the factor model.The Nelson-Siegel family is part of a larger class of affine term-structure models, in which, at any given time, the forward-rate curve is described in terms of a set of market factors as where d is a positive integer and ϕ i ∈ C 2 (U ) and ϕ 0 is a forward-rate curve typically calibrated to the data available at time t = 0. Note that the forward-rate curves in (7) are parameterized according to the change of variables in (6), however, since U represents all times to maturities these are indeed traded assets.However, as shown in Filipović (2001), the Nelson-Siegel model is typically not arbitrage-free therefore we would like to learn the closest arbitrage-free factor model, driven by the same stochastic factors.Therefore, given a non-empty and unbounded hypothesis class H ⊆ X of plausible alternative models, we optimize where H is required to contain the (naive) factor model ϕ and : X → [0, ∞) is continuous and coercive loss function.For example, may be taken to be the norm on X .Geometrically, (8) describes a projection of ϕ onto the (possibly non-convex) subset of H of factor models making each S t (φ u t , [[φ u ]] t ; u) into a P N -local martingale for every u ∈ U .The requirement that H contains the (naive) factor model ϕ is for consistency, in order to ensure that for any arbitrage-free factor model ϕ the solution to problem (8) is itself.
In general, the problem described by ( 8) may be challenging to implement as projections onto non-convex sets are intractable.In analogy with regularization literature, such as Hastie et al. (2015), instead we consider the following relaxation of problem (8) which is more amenable to numerical implementation argmin φ∈H (ϕ − φ) + AF λ (φ); where {AF λ } 2≤λ<∞ is a family of functions from H to [0, ∞] taking value 0 if each S t (φ u t , [[φ u ]] t ; u) is a P N -local martingale simultaneously for every value of u and λ is a meta-parameter determining the amount of emphasis placed on the penalizing factor models which fail to meet this requirement.Problem (9) is called the arbitrage-regularization problem.
At the moment, there are only two available lines of research which are comparable to the arbitrage-regularization problem.Results of the first kind, such as the arbitrage-free Nelson-Siegel model of Christensen et al. (2011a), provide closed-form case-by-case arbitrage-free variants of specific model only if they coincide with specific arbitrage-free HJM type factor models, such as those studied by Björk and Christensen (1999).However, the reliance on analytic methods typically limit this type of approach to simple or specific models and does not allow for a general or computationally viable solution to the problem.Moreover, arbitrage-free corrections derived in this way are not guaranteed to be optimal in the sense of (8), or approximately optimal in the sense of ( 9).This will be examined further in the numerics section of this paper.
The use of a penalty to capture no-arbitrage conditions has, to the best of the authors' knowledge, thus far only been explored numerically by Chen et al. (2019) within the discrete-time portfolio optimization setting.A similar problem has been treated in Chen et al. (2006) for learning the equivalent martingale measure in the multinomial tree setting for stock prices.Our paper provides the first instance of a theoretical result in this direction as well as such a framework that applies to large-financial markets such as bond markets or which applies in the continuous-time setting.
Before presenting the main results we first state necessary assumptions.
Assumption 1.The following assumptions will be maintained throughout this paper.
(i) β t is an R d -valued diffusion process which is the unique strong solution to where β 0 ∈ R d , W t is an R d -valued Brownian motion, the components α i : R 1+d → R are continuous, the components σ i,j : R 1+d → R d×d d i,j=1 are measurable and such that the diffusion matrix is a continuous function of β for any fixed t ≥ 0. (ii) The stochastic differential equation (10) has a unique R d -valued solution for each b verifying the following "predictable-dependence" condition of Fournie (2010): , where S d + is the set of d × d-dimensional positive semi-definite matrices with real-coefficients, The central problem of the paper will be addressed in full generality before turning to applications in term-structure models, in the next section.

Main Results
In this section, we show the asymptotic equivalence of problems ( 8) and ( 9) for general asset classes.This requires the construction of the penalty term AF λ measuring how far a given factor model is from being a P N -local martingale.The construction of AF λ is made in two steps.First, a drift condition which guarantees that each {S t (φ u t , [[φ u ]] t ; u)} u∈U is simultaneously a P N -local martingale is obtained.This condition generalizes the drift condition of Heath et al. (1992) and provides an analogue to the consistency condition of Filipović and Teichmann (2004).Second, the drift condition is used to produce the penalty term in (9).Subsequently, the optimizers of (9) will be used to asymptotically solve problem (8).
is satisfied P N -a.s. for every t ∈ [0, ∞) and every u ∈ U , where D and ∇ respectively denote the horizontal and vertical derivative of Fournie (2010) (see Appendix A.2 Equation (A3)).
The drift condition in Proposition 1 implies that if φ is such that the difference of the left and right-hand sides of ( 11) is equal to 0, P N -a.s. for all u ∈ U then S t (φ u t , [[φ u ]] t ; u) is a P N -local martingale simultaneously for all u ∈ U .Thus, S t (φ u t , [[φ u ]] t ; u) is simultaneously a P N -local martingale for all u ∈ U if for every u ∈ U the [0, ∞)-valued process Λ u t (φ) is equal to 0 P N -a.s,where Λ u t (φ) is defined using (11) by where φ u t = φ(t, β t , u).The arbitrage-penalty is defined as follows.
Remark 1. Whenever |Λ u t (φ)| λ fails to be integrable, we make the convention that The convergence of the optimizers of (9) to the optimizers of ( 8) is demonstrated in the next theorem.The proof relies on the theory of Γ-convergence, which is useful for interchanging the limit and an arginf operations.
Assumption 2. Assume that (i) For every φ ∈ H and P N -a.e.ω ∈ Ω, the function Please note that both statements (i) and (ii) are with respect to the relative topology on H .
Theorem 1.Under Assumption 2 the following hold: (i) Equation ( 8) admits a minimizer on H , (ii) where ι H is defined on H as Theorem 1 provides a theoretical means of asymptotically computing the optimizer φ of problem (8).In practice, this limit cannot always be computed and only very large values of λ can be used.However, in reality trading does not occur in a friction-less market but every transaction placed at time t incurs a cost 0 < k t .Moreover, only a finite number of assets are traded.
Consider a market with frictions where only finitely many assets are traded.In this setting, an admissible strategy is an adapted, left-continuous of finite-variation process θ t ∈ R n whose corresponding wealth process is P-a.s.bounded below.In the context of this paper, the sub-market with proportional transaction cost k t > 0 is precisely such a market.Any such admissible strategy on this finite sub-market defines an admissible portfolio whose liquidation value, as defined by (Guasoni 2006, Equation 2.2) and (Guasoni 2006, Remark 2.4), is defined by where φ denote the optimizer of 9 for a fixed value of 2 ≤ λ < ∞, Dθ i denotes the weak derivative of θ i in the sense of measures, and |Dθ i | denotes its variation.The first term on the right-hand side of ( 17) is the capital gains from trading, second represents the cost incurred from various transaction costs, and the last term represents the cost of instantaneous liquidation at time t.Although more general transactions costs may be considered, the proportional transaction costs presented here are sufficient for the formulation of the next result.
The next result guarantees that the market model φ is arbitrage-free, granted that k t is large enough to cover the spread between S t (φ The following assumption quantifies the requirement that λ be taken to be sufficiently large.
Assumption 3.There exists some 0 < m < M and some 2 < λ such that for every 0 ≤ t, positive integer n, and every u 1 , . . ., u n ∈ U the following holds: Proposition 2. If mk t ≤ M for all times 0 ≤ t then for any admissible strategy θ trading S t (φ In the next section, we apply Theorem 1 and the arbitrage-regularization (9) to the bond market.

Arbitrage-Regularization for Bond Pricing
As discussed in Diebold and Rudebusch (2013), affine term-structure models are commonly used in forward-rate curve modelling due to their tractability and the interpretability.In the formulation of Björk and Christensen (1999), as further developed in Filipović (2000); Filipović et al. (2010), affine term-structure models are characterized by ( 7) together with the additional requirement that its stochastic factor process β t follows an affine diffusion.By Cuchiero (2011) this means that the dynamics of β t are given by for some d × d matrices ζ and {ζ k } d k=1 , and some vectors γ and {γ k } d k=1 in R d such that there exists a solution (L, R) to the following Riccati system Fix meta-parameters p, κ ≥ 1 and U = [0, ∞).For the next result, all the factor models will be taken as belonging to the weighted Sobolev space W p,k where C is a unique constant ensuring that 1 ∈ W p,k w and its weighted integral is equal to 1 Here, is the space of all compactly supported functions with infinitely many derivatives.Furthermore, k is required to satisfy Remark 2. In the case where p = 2 and k ≥ 1, the Sobolev space W p,k w is a reproducing kernel Hilbert space (see Nelson and Siegel (1987)) therefore point evaluation is a continuous linear functional and by (weighted) Morrey-Sobolev Theorem of Brown and Opic (1992) it can be embedded within a space of continuous functions.Therefore, given any φ ∈ W p,k w and any R d -valued stochastic process β t , the process (φ(t, β t , •)) t≥0 is a well-defined process in the following space of forward-rate curves of Filipović (2001) Analytic tractability is ensured by requiring that the factor models considered for the arbitrage-regularization (9) belong to the class H defined by where w(u) = e −|u| κ .This class of functions generalizes the Nelson-Siegel family (7) discussed in the introduction.Under these conditions the following theorem characterizes the asymptotic behavior of (9) in λ as solving problem (8), given fixed meta-parameters p, κ ≥ 1.Following Filipović (2001), it will be convenient to denote Theorem 2. Let ϕ be given by (7), β t be as in (18), and fix p, κ ≥ 1.Then (i) For every λ ∈ [ 2 3 , 1) there exists an element φ λ in H minimizing where Λ u t (φ) is defined by where The following inclusion holds where ι H is as in (16).
It may convenient to understand the φ λ as a function of λ when interpreting approximations of the limit (26) as a function of λ.The following result removes the challenges posed by the unbounded interval [2, ∞), in which λ lies, by reparameterizing problem ( 23) with a bounded meta-parameter λ.
Next, the arbitrage-regularization of forward-rate curves will be considered using deep learning methods.

A Deep Learning Approach to Arbitrage-Regularization
The flexibility of feed-forward neural networks (ffNNs), as described in the universal approximation theorems of Hornik (1991); Kratsios (2019b), makes the collection of ffNNs a well-suited class of alternative models for the arbitrage-regularization problem.In the context of this paper an ffNN is any function from R d to R d+1 of the form where each where β 1 = 1, β i+1 = β i for all i > 1, and where β denotes the transpose of β.The process β t will be assumed to be a d dimensional Ornstein-Uhlenbeck process and in particular will be of the form(18).Therefore, the special class of models we consider here are of the form (7).
It has been shown in Rahimi and Recht (2008), among others, that if a network is appropriately designed, then training only the final layer and suitably initializing the matrices W N , . . ., W 1 performs comparably well to networks with all the layers trained.More recently, the approximation capabilities of neural networks with randomized first few layers has is shown in Gonon (2020).This phenomenon was observed in numerous numerical studies, such as Jaeger and Haas (2004), where the entries of the matrices W N , . . ., W 1 are chosen entirely randomly.This practice has also become fundamental to feasible implementations of recurrent neural network (RNN) theory and reservoir computing, as studied in Gelenbe (1989), where training speed becomes a key factor in determining the feasibility of the RNN and reservoir computing paradigms.
The hypothesis class of alternative factor models to be considered in the arbitrage-regularization problem effectively reduces from (28) to where φ is a given factor model of the form (21), and {u j } J j=1 is a uniform random sample on a non-empty compact subset of U ; J > 0. Thus, the optimization problem ( 30) is random since it relies on randomly generated data points {u j } J j=1 .However, instead of initializing f in an ad-hoc random manner, the initialization (30) guarantees that the shapes generated by ( 29) are close to those produced by the naive factor model (7).In this case, a brief computation shows that Λ u t (φ) simplifies to where F(u) u 0 f (s)ds with the integration is defined component-wise and β i denotes the i th entry of the vector β.

Numerical Implementations
The data-set for this implementation consists of German bond data for 31 maturities with observations obtained on 1273 trading days from January 4th 2010 to December 30th 2014.As is common practice in machine learning, further details of our code and implementation can be found on Kratsios (2019a).The code is relatively flexible and could be adapted to other bond-data sets.
The performance of the arbitrage-regularization methodology will now be applied to two factor models of affine type and its performance will be evaluated numerically.The first factor model is the commonly used dynamic Nelson-Siegel model of Diebold and Rudebusch (2013) and the second is a machine learning extension of the classical PCA approach to term-structure modeling.The performance of the arbitrage-regularization for each model will be benchmarked against both the original factor models and against the HJM-extension of the Vasiček model.The Vasiček model is a natural benchmark since, as shown in Björk and Christensen (1999), it is consistent with a low-dimensional factor model.Therefore, each of the factor models contains roughly the same number of driving factors which ensures that the comparisons are fair.Moreover, the numéraire process N t will be taken to be the money-market and we take P = P.The meta-parameter λ is taken to be 1 − 10 −4 so that it is approximately 1.
As described in ( 29)-( 31), the solution to the arbitrage-regularization (9), will be numerically approximated using randomly initialized deep feed-forward neural networks.The initialization network f of ( 29) is selected to have fixed depth N = 5, fixed height d = d i = 10 2 and its weights are learned using the ADAM algorithm.The meta-parameters p = 2 and κ = 1 are chosen empirically, and the parameters of the Ornstein-Uhlenbeck process are estimated using the maximum-likelihood.Once the model parameters have been learned, and the factor model optimizing (9) has been learned, the day ahead predictions of the stochastic factors are obtained through Kalman filter estimates of the hidden parameters β t for each of the factor models.In the case of the Vasiček model the unobservable short-rate parameter is also estimated using the Kalman filter (see Bain and Crisan (2009)).These day-ahead predictions are then fed into the factor model and used to compute the next-day bond prices.These predictions are then compared to the realized next-day bond prices.
where, as discussed in Diebold and Rudebusch (2013), the first factor represents the long-term level of the forward-rate curve, the second represents its shape, the third represents its curvature, and τ is a shape parameter; typically kept fixed.Since market conditions are continually changing, the Nelson-Siegel model is typically extended from a static model to a dynamic model by replacing the static choice of β with a three-dimensional Ornstein-Uhlenbeck process and fixing the shape parameter τ > 0 as in Diebold and Rudebusch (2013).However, as demonstrated in Filipović (2001), the dynamic Nelson-Siegel model does not admit an equivalent measure to P N that makes the entire bond market simultaneously into local martingales.It was then shown in Christensen et al. (2011a) that a specific additive perturbation of the Nelson-Siegel family circumvents this problem, but empirically this is observed to come at the cost of reduced predictive accuracy.In our implementation, the parameters of the Ornstein-Uhlenbeck process driving β i t will be estimated using the maximum likelihood method described in Meucci (2005).

Model 2: dPCA (Machine-Learning Model)
The dynamic Nelson-Siegel model's shape has been developed through practitioner experience.The second factor model considered here will be of a different type, with its factors learned algorithmically.As with (32), consider a static three-factor model for the forward-rate curve of the form where φ 1 , . . ., φ 3 are the first three principal components of the forward-rate curve calibrated on the first 100 days of data.Subsequently, a time-series for the β i parameters is generated, using the first 100 days of data, where on each day the β i are optimizes according to the Elastic-Net (ENET) regression problem of Hastie et al. (2015) defined by on rolling windows consisting of 100 data points and {u k,t } K t k=0 are the available data-points on the forward-rate curve at time t.The meta-parameters θ 1 and θ 2 are chosen by cross-validation on the first 100 training days and then fixed.
The ENET regression is used due to its factor selection abilities and computational efficiency.Next, analogously to the dynamic Nelson-Siegel model, an R 3 -valued Ornstein-Uhlenbeck process βt is calibrated, using the maximum likelihood methodology outlined in Meucci (2005) to the time-series β ENET t .These will provide the hidden stochastic factors in the dynamic PCA model (33).Thus, the dPCA model is the factor model with stochastic inputs defined by The resulting model differs from the dynamic Nelson-Siegel model in that its factors and dynamics are not chosen by practitioner experience but learned through the data and implicitly encode some path-dependence.However, as with the dynamic Nelson-Siegel model it falls within the scope of Theorem 2.

Discussion
The predictive performance of the Vasiček (Vasiček), dPCA, A-Reg(dPCA), the dynamic Nelson-Siegel Model (dNS), the arbitrage-free Nelson-Siegel model of Christensen et al. (2011a) (AFNS), and the arbitrage-regularization of the dynamic Nelson-Siegel Model (A-Reg(dNS)) is reported in the following tables.The predictive quality is quantified by the estimated mean-squared errors when making day-ahead predictions of the bond price for each maturity, for all but the first days in our data-set.The lowest estimated mean-squared errors recorded are highlighted using bold font and the second lowest estimated mean-squared errors on each maturity are emphasized using italics.
Table 1 evaluates the performance of the considered models on the short-mid end of the curve.Overall, the performance of all the models are generally comparable at the very short end but rapidly after the dPCA model begins to outperform the rest.The accuracy of the Vasiček model on small maturities is likely to it being a short-rate model.
In Table , 2 the dPCA model outperforms the rest by progressively larger margins.Most notably, in Tables 3 and 4 which summarize the performance of the models for very long bong maturities the A-Reg(dPCA) model shows very low predictive error for a low number of factors while simultaneously being consistent with no-arbitrage conditions.
An advantage of the A-Reg(dPCA) model is that it can accurately model the long-end of the forward-rate curve in an arbitrage-free manner.This fact is due to the dynamic factor selection properties of the dPCA model which otherwise could not have been used in a consistent manner if it were not for Theorem 2.
The numerical implementation highlights a few key facts about the arbitrage-regularization methodology.First, for nearly every maturity, the empirical performance of the arbitrage-regularization of a factor model is comparable to the original factor model.An analogous phenomenon was observed in Devin et al. (2010) when projecting infinite-dimensional arbitrage-free HJM models onto the finite-dimensional manifold of Nelson-Siegel curves.Therefore, correcting for arbitrage does not come at a significant predictive cost.However, it does come with the benefit of making the model theoretically sound and compatible with the techniques of arbitrage-pricing theory.
Second, since ( 9) incorporates an additional constraint into the modeling procedure the arbitrage-regularization of a factor model has a reduction in performance as compared to the initial factor model.This phenomenon has also been observed empirically in Christensen et al. (2011a) for the arbitrage-free Nelson-Siegel correction of the dynamic Nelson-Siegel model.Therefore, one should not expect to improve on the predictive performance of the initial factor model by correcting for the existence of arbitrage.
Third, the empirical performance of A-Reg(dPCA) was significantly better than the empirical performance of the other arbitrage-free models, namely AFNS, A-Reg(dNS), and the Vasiček model, across nearly all maturities.This was especially true for mid and long maturity zero-coupon bonds.Moreover, the performance of A-Reg(dPCA) and dPCA were comparable.Similarly, for most maturities, the empirical performance of the AFNS, dNS, and A-Reg(dNS) models were all similar and notably lower than the performance of the A-Reg(dPCA), dPCA, and Vasiček models.This emphasizes the fact that arbitrage-regularization methodology produces performant models only if the original model itself produces accurate predictions.Therefore, it is up to the practitioner to make an appropriate choice of model.However, the methodology used to develop dPCA and A-Reg(dPCA) could be used as a generic starting point.
Since the arbitrage-regularization methodology applies to nearly any factor model, one may use any methodology to produce an accurate reference factor model and then apply arbitrage-regularization to make it theoretically consistent at a small cost in performance.This opens the possibility to applying machine learning models, such as dPCA, to finance without the worry that they are not arbitrage-free since their asymptotic arbitrage-regularization is well-defined.Furthermore, the flexibility of deep feed-forward neural networks allows for the efficient implementation of ( 9).
The AFNS model proposes an arbitrage-free correction for the dynamic Nelson-Siegel.However, there is no guarantee that the AFNS corrects dNS optimally, and the predictive gap between these two models is documented in Christensen et al. (2011a).This is both echoed in Tables 2 and 3. Furthermore, this is also reflected by Theorem 2 which guarantees asymptotic optimally of the A-Reg(dNS) model.
Unlike most regularization problems where there is a trade-off between the regularization term and the (un-regularized) objective function, the arbitrage-regularization requires λ to be taken as close to 1 as possible.Since the limit (26) can only be approximated numerically λ cannot be evaluated, however λ can be taken to be arbitrarily close to, but less than, 1.This choice is justified by Figures 1 and 2 which illustrates that for values of λ near 1 there is little change in the model's predictive performance.Figures 1 and 2 plot the change in the shape of the day-ahead predicted forward-rate curve and the change in the MSE of the day-ahead predicted bond prices as function of λ.In those figures, the curves with a pink color correspond to low values of λ and the curves progressing towards a blue color correspond to high values of lambda.Please note that in these plots, the reparameterization of Corollary 1 is used and an abuse of notation is made by using λ to denote λ.
In the case of the dNS model, an interesting property is that long-maturity bond prices do not change much, whereas short-maturity bond prices exhibit more dramatic changes.This property suggests that the dNS model is closer to being arbitrage-free on the long end of the curve than it is on the short end.This paper introduced a novel model-selection problem and provided an asymptotic solution in the form of the penalized optimization given by problem (9).The problem was posed and solved in a generalized HJM-type setting, within Theorem 1 and specialized to the term-structure of interest setting in Theorem 2 where simple expressions for the penalty term were derived.
The key innovation of the paper was the construction of the penalty term AF λ defining the arbitrage-regularization problem (9).The construction of this term in Proposition 1 relied on the structure of the generalized HJM-type setting proposed in Heath et al. (1992) and generalized in (4) which allowed one to encode the dynamics of a large class of factor models with stochastic inputs into the specific structure of any asset class.
The numerical feasibility of the proposed method was made possible by the flexibility of feed-forward neural networks, as demonstrated in Hornik (1991); Kratsios (2019b), which allowed the optimizer of the arbitrage-regularization problem (9) to be approximated to arbitrary precision.In the numerics section of this paper, it was found that the arbitrage-regularization of a factor model does not heavily impact its predictive performance but does make it approximately consistent with no-arbitrage requirements.
In particular, the compatibility of the proposed approach with generic factor models with stochastic inputs allowed for the consistent use of factor models generated from machine learning methods.The A-Reg(dPCA) model is a novel example of such an approximately arbitrage-free model where the dynamics and factors were generated algorithmically instead of through practitioner experience.
The precise quantification of the approximate arbitrage-free property was made in Proposition 2. Thus, approximately arbitrage-free factor models under the stylized assumption of no transaction costs were indeed arbitrage-free when proportional transaction costs are in place, which is a more realistic assumption.
Finally, the arbitrage-regularization approach introduced in this paper opens the door to the compatible use of predictive machine-learning factor models with the no-free lunch with vanishing risk condition.The general treatment in Theorem 1 can be transferred to other asset classes and models generated from other learning algorithms.This approach can be an important new avenue of research lying at the junction of predictive machine learning and mathematical finance.
Some relevant technical background is briefly discussed within this appendix.These topics include related aspects of the functional Itô calculus introduced in Dupire (2009) and developed in and its height h > 0 vertical extension is defined by For a functional from Λ d to R, its vertical and horizontal derivatives on the path x t ∈ Λ d are defined by infinitesimally extending the path S(x t , v t ) either vertically or horizontally using (A1) and (A2).However, since the calculus should not look into the future, instead only non-anticipative functionals are to be considered.
Definition A1 (Non-Anticipative Functional; (Fournie 2010, Definition 2.1)).A non-anticipative functional is a family of functionals S {S t } t∈[0,T] where Analogously to classical calculus, the limiting ratio between the difference of S(x t , v t ) and its extensions define its horizontal and vertical at any given time 0 ≤ t, respectively, by where the limits defined in (A3) are taken in Λ d with respect to metric F where x is the usual Euclidean norm and v F is the Fröbenius norm.As it will be seen shortly, the horizontal and vertical derivatives extend the time and spacial derivatives from ordinary calculus.However, some technical remarks must first be addressed.
In general, these path derivatives are not defined on any non-anticipative functional S : Λ d → R moreover even if it is, analogously to the classical calculus, there is no guarantee that its vertical (resp.horizontal) derivative is continuous with respect to d. Analogously with the traditional Itô calculus, the collection of paths for which one can derive a useful Itô formula are those which admit one continuous horizontal derivative and for which DS(x t , v t ) and two vertical derivatives; i.e., ∇ 2 S(x t , v t ) ∇(∇S(x t , v t )) are both continuous.
However, for an tractable extension of the Itô formula with respect to S to be possible it is additionally required that S be boundedness-preserving.Here, a functional S : Λ d → R is said to be boundedness-preserving if, for every non-empty compact subset K ⊆ R, there exists some The collection subset of all functional S : Λ d → R which are boundedness-preserving and have continuous boundedness-preserving derivatives ∇S(x t , v t ), DS(x t , v t ), and D 2 S(x t , v t ) at every path x t ∈ Λ d is denoted by C 1,2 .
(iii) (Stability Under Relaxation): For every n ∈ N let { ˜ n } n∈N be a sequence of functions from X to R ∪ {∞} where lsc is the largest lower semi-continuous function dominated by , point-wise.
The first of the two critical ingredients in Theorem A2 was the lower semi-continuity of and the second was its coerciveness.Analogously to the definition of equi-continuity, in general, when working with a sequence of functions, to be able to apply the analogous machinery to Theorem A2 we require that there exists a non-empty compact subset of K satisfying inf (A8) The property described by (A8) is called mild equi-coerciveness.A stronger condition, that we will make use of is equi-coerciveness, which states that for every t > 0, there exists a compact subset K t of (X, d) The central result in the Theory of Γ-converges is the following extension of Theorem A2.
Theorem A4 (The Fundamental Theorem of Γ-Convergence; (Braides 2002, Theorem 2.10), (Focardi 2012, Theorem 2.1)).If { n } n∈N is a mildly equi-coercive sequence of functions from X to R ∪ {∞} for which the Γ-limit exists in X, then lim If moreover, { n } n∈N is equicoercive, then lim This section closes with the precise definition of convergence in the Γ-sense.
Definition A2 ((Dal Maso 1993, Chapter 4)).Let { n } n∈N be a sequence of R ∪ {∞}-valued functions on a metric space (X, d).A function is the Γ-limit of { n } n∈N if and only if both where lsc is the largest lower semi-continuous function dominated by point-wise.

Appendix B. Proofs
Proof of Proposition 1.For legibility, for each u ∈ U , we represent the process φ(t, β t , u) by By the Functional Itô Formula, (Cont and Fournié 2013, Theorem 4.1), if follows that, for every u ∈ U From (A9) it follows that, for each u ∈ U , the price processes X t (u) are P N -local-martingales if and only if Next, the quantities a u and γ u are described.By the usual Itô formula, it follows that for each u ∈ U Therefore, (A11) implies that Combining (A28) and (A29) yields for every 0 ≤ T and every finite number of u 1 , . . ., u n ∈ U.
has a smooth boundary and since k was assumed to satisfy (20), then the (weighted) Morrey-Sobolev Theorem of Brown and Opic (1992) applies.Therefore, W p,k w (I × R d × U ) can be continuously embedded within C 2,2,2 (I × R d × U ) and therefore Assumption 1 (ii) holds.Furthermore, by ( 6) and together with (Cont and Fournié 2013, Example 1) each S t (•, •; u) satisfies Assumption 1 (iii); thus Assumption 1 is satisfied.
Next, ( 26) is reformulated in terms of Theorem 1 and Assumptions 2 are verified.Subsequently, the optimizers of the objective-function under the limit on the left-hand side of ( 26) are shown to exist for λ ≥ 2.
It will be convenient to work under the parameterization of Musiela and Rutkowski (1997), where where the constants c i are defined by c i φ i (0) for i = 0, . . ., d. Equation (A30) is satisfied for P N -a.s every ω ∈ Ω, for every 0 ≤ t ≤ u, and for every u ∈ U each Λ u t (φ) = 0 are P N -a.s;where Λ u t (φ) is defined by ζ k;i,j Φ i (u)Φ j (u) ζ k;i,j Φ i (u)Φ j (u) p , (A34) are continuous in u; moreover, they are continuous in t since they are constant in t.Therefore, (t, u) → Λ u t (φ) is continuous for every φ ∈ H ; whence Assumption 2 (i) holds.Next, Assumption 2 (ii) will be verified.Given the dynamics of ( 18), (Filipović 2009, Proposition 9.3) characterizes all φ 0 , . . ., φ d for which the forward-rate curve (21) corresponds to a bond market, through (6), in which each bond price is a P N -local-martingale; all such φ 0 , . . ., φ d are solutions to the differential Riccati system where as before, {Φ i } d i=0 and φ are related through ( 21) and ( 22).Differentiating across the Riccatti system (A35) with respect to u yields an equivalent differential system of the form Therefore, (A36) can be rewritten as φ ∈ H : (∀u ∈ U ) S t (φ u t , [[φ u ]] t ; u) is a P N -local-martingale = φ ∈ H : {Φ i } d i=0 solves (A37) for some c 0 , . . ., c d ∈ R .

(A38)
Since the C k (R; R d+1 ) is equipped with its the topology of uniform convergence on compacts of functions and their first two derivatives, then it follows that the right-hand side of (A38) is closed in C 2 (R; R d+1 ); whence it is closed in the relative topology on H ⊆ C 2 (R; R d+1 ).Thus, Assumption (ii) holds.
Lastly, since the loss function , defined by ζ k;i,j Φ i (u)Φ j (u)  19) implies that, for every φ ∈ H and every λ ≥ 2, the integral (A39) is finite.Thus, for every λ ≥ 2, the map Φ → AF λ (φ) is continuous from W p,k w (I × R d × U ) to [0, ∞).Furthermore, since the loss-function is continuous, then, for every λ ≥ 2 the function Φ → (ϕ − φ) + AF λ (φ) is continuous.Furthermore, since both and AF λ are bounded below by 0 and finite-valued then so is (ϕ − •) + AF λ (•).Lastly, since is coercive then by definition, for every r ≥ 0, there exists a compact subset K r ⊆ H satisfying {φ ∈ H : (ϕ − φ) ≤ k} ⊆ K r . (A41) Therefore, the non-negativity of each AF λ implies that for every λ ≥ 2 and every r ≥ 0, 4.2.1.Model 1: The Dynamic Nelson-Siegel Model (Practitioner Model) The Nelson-Siegel family is a low-dimensional family of forward-rate curve models used by various central banks to produce forward-rate or yield curves.As discussed in Carmona (2014), Finland, Italy, and Spain are such examples with other countries such as Canada, Belgium, and France relying on a slight extension of this model.The Nelson-Siegel model's popularity is largely due to its interpretable factors and satisfactory empirical performance.It is defined by

Figure 1 .Figure 2 .
Day-ahead predictions as a function of λ across given maturities for the A-Reg(dNS) model.(a) Average day-ahead predicted yield curves.(b) Estimated MSE of day-ahead bond price predictions.Day-ahead predictions as a function of λ across given maturities for the A-Reg(dPCA) model.(a) Average day-ahead predicted yield curves.(b) Estimated MSE of day-ahead bond price predictions.
× R d × U ) Space of (equivalence classes of) p-integrable functions with respect to ν ⊗ µ p ν⊗µ N -local-martingale, simultaneously for every u ∈ U , if and only if P N -a.s.(11) holds simultaneously for every u ∈ U .By Theorem 2, since S s ( φu i s , [[ φu i ]] s ; u i ) is a P N -local martingale and by construction P N ∼ P then the (Delbaen and Schachermayer 1994, Fundamental Theorem of Asset Pricing) implies that if t (φ u t , [[φ u ]] t , u) is a P Proof of Theorem 2. By definition of (R d , g t ), β t , and H Assumptions 1 (i) and (iv) hold; thus only Assumptions 1 (ii) and (iii) must be verified, in order to ensure that the stated problem falls within the scope of this paper.Let µ = μ ⊗ υ where d [0,∞) , υ is the unique probability measure with Lebesgue density proportional to e − β κ on R d , and where dν dm (t) = e −|t| 1 [0,∞) , 1 [0,∞)is the (probabilistic not convex analytic) indicator function on the interval [0, ∞), here m is the Lebesgue measure on R. Therefore, the elements of W I N -local martingales if and only if for every u ∈ U , 0 ≤ t ≤ u, and P N -a.e ω ∈ Ω, the following holds 0 X t (u) = exp − u t φ(t, β t , s)ds = exp − τ 0 φ(t, β t , v)dv,where τ = u − t and u = u − s for t ≤ s ≤ u.Under this reparameterization, in the case where each S t (•, •; u) is as in (6), it is shown in(Filipović 2009, Proposition 9.3) that for each u ∈ U and 0 ≤ t, the bond prices S t (φ u t , [[φ u ]] t ; u) are each P