Solving Schrödinger Bridges via Maximum Likelihood

The Schrödinger bridge problem (SBP) finds the most likely stochastic evolution between two probability distributions given a prior stochastic evolution. As well as applications in the natural sciences, problems of this kind have important applications in machine learning such as dataset alignment and hypothesis testing. Whilst the theory behind this problem is relatively mature, scalable numerical recipes to estimate the Schrödinger bridge remain an active area of research. Our main contribution is the proof of equivalence between solving the SBP and an autoregressive maximum likelihood estimation objective. This formulation circumvents many of the challenges of density estimation and enables direct application of successful machine learning techniques. We propose a numerical procedure to estimate SBPs using Gaussian process and demonstrate the practical usage of our approach in numerical simulations and experiments.


Introduction
Analysis of cross-sectional data is ubiquitous in machine learning and science. Temporal data are typically sampled at discrete intervals due to technological or physical constraints. This means information between time points is lost. This motivates the need to model the stochastic evolution of a process between sampled time points. The classical Schrödinger bridge problem [1,2] finds the most likely stochastic process that evolves a distribution π 0 (x) to another distribution π 1 (y) consistently with a pre-specified Brownian motion. We consider a more general dynamical Schrödinger bridge problem for any pre-specified diffusion prior. Practically, this generalization allows us to exploit domain knowledge, e.g., oceanic and atmospheric flows might be interpolated from empirical measurements using previously established dynamics as priors.
In the classical set up numerical approaches to solve the Schrödinger bridge are mainly based on the Sinkhorn-Knopp algorithm [3]; however, extending those algorithms to more general diffusion priors and marginals requires complex adaptations. We introduce an iterative proportional maximum likelihood (IPML) algorithm to solve the general Schrödinger bridge problem. The IPML algorithm also obtains a good approximation for the dynamics of the underlying physical process that solves the SBP. This contrasts to previous approaches [4][5][6] that estimate the value that extremises the SBP objective. Practically, this means we obtain physically interpretable solutions that we can leverage for downstream tasks.
IPML's inspiration is from probabilistic numerics (PN) [7]. We combine a PN styled formulation with the iterative proportional fitting procedure (IPFP) [8,9]. The algorithm iteratively simulates trajectories that converge to the SBP. We prove that IPML converges in probability at each iteration. Our numerical experiments show the algorithm can be implemented with Gaussian process (GP) models of the drifts. GPs allow us to incorporate functional prior information. We demonstrate the practical use of our algorithm on realworld embryoid cells data (see Figure 4) by quantitatively and qualitatively comparing the performance of our algorithm to state-of-the-art deep learning methods and optimal transport techniques. To summarise, the main contributions of our work are: • We recast the iterations of the dynamic IPFP algorithm as a regression-based maximum likelihood objective. This is formalised in Theorem 1 and Observation 1. This differs to prior approaches such as [10] where their maximum likelihood formulation solves a density estimation problem. This allows the application of many regression methods from machine learning that scale well to high dimensional problems. Note that this formulation can be parametrised with any method of choice. We chose GPs; however, neural networks would also be well suited, • We solve the aforementioned regression objectives using GPs [11] motivated by the connection between the drift of stochastic differential equations and GPs [12], • We provide a conceptual comparison with the approach by [10] and detail why the density estimation formulation in [10] scales poorly with dimension, • Finally we re-implement the approach by [10] and compare approaches across a series of numerical experiments. Furthermore we empirically show how our approach works well in dimensions higher than 2.

Technical Background
Our solution has three components. (1) We reformulate the SBP as a dynamical system giving a stochastic differential equation (SDE) with initial value (IV) and final value (FV) constraints [13]. (2) We reverse the system, reformulating the FV constraint as an IV constraint. Both IV problems are solved through a stochastic control formulation (Section 2.1.2). (3) We iterate the IV and FV constrained problems to converge to the full boundary value constrained SDEs.

Dynamic Formulation
The dynamic version of the Schrödinger bridge is written in terms of measures over the space of trajectories, which describe the stochastic dynamics defined over the unit interval.
where Q ∈ D(π 0 , π 1 ) is a measure with prescribed marginals of π 0 , π 1 at times 0, 1 that is (X 0 ) # Q = π 0 and (X 1 ) # Q = π 1 . Q γ 0 is a drift augmented Brownian motion with a scalar volatility γ acting as the prior, see Figure 1. Traditionally Q γ 0 :=W γ is a Wiener measure with volatility γ, however we consider the general setting in this work. The prior Q γ 0 can be written as a solution of the diffusion: From here we use b 0 to denote the drift of the prior Q γ 0 . Note that a finite KL implies Q and Q γ 0 are both Itô SDEs with volatility γ. The diffusion coefficient γ is a time homogeneous constant and for the KL-divergence to be finite the process Q must have γ as its diffusion coefficient.

Time Reversal of Diffusions
When sampling solutions, IV constraints are trivially solved by initialisation of samples, but FV constraints are more problematic; however, if we can reformulate the SBP as a time reversed diffusion the FV constraint becomes an IV constraint (Section 3.2). Here we review how the diffusion is reversed. We reverse time (see Figure 1) in the random variable x(t) described by the diffusion in Equation (2) such that x − (t) = x(1 − t). The reverse time diffusion x − (t) is also an Itô process. For a modern application of time reversal in machine learning see [14].
where β − (t) is a Brownian motion adapted to the reverse filtration (F − i ) i∈T , that is F − t ⊆ F − s , s ≤ t. Furthermore, the dual drift u − (x, t) satisfies Nelson's duality relation: where p(x, t) solves the associated Fokker-Planck equation.

Stochastic Control Formulation
Now that the FV constraint has been converted to an IV constraint, we cast the problem into a stochastic control formulation to estimate the drift of each diffusion process. Following from the dynamic formulation, the control formulation casts the problem explicitly in terms of stochastic differential equations. The control formulation is used to enforce constraints as initial value problems. Furthermore, the drift based formulations of the SBP admit a reverse time formulation, which starts the chain at the end of the interval and progresses the dynamics backwards in time to the start. Lemma 2 ([20]). Let the measure Q be defined by solutions to the SDE dx Then the KL divergence D KL Q Q γ 0 can be decomposed in terms of either the forward or reversed diffusion as where x − (t) is the time reversal of x + (t), and x + (1) ∼ π Q 1 .
Proof. This theorem follows by a direct application of the disintegration theorem (Appendix B) followed by Girsanov's theorem. A detailed proof can be found in [20].
Using the above decompositions, we can solve the SBP by minimising either decomposition in Equation (4) over the space of random processes b ± (t) that satisfy a valid Itô SDE drift. The backwards and forwards objectives are respectively: min Q∈D(π 0 ,π 1 ) While we do not directly use the stochastic control formulations, the drift-based formulation serves as inspiration for our iterative scheme. Specifically, the existence and parametrisation of an optimal drift as shown in [10,20].
This Lemma is key in formulating our ML approach to IPFP since it justifies our parametrisation of the drift in terms of a deterministic function u ± * i.e., b ± * (t) = u ± * (x ± (t), t). For a brief introduction to the Schrödinger system and potentials see Appendix A.

Iterative Proportional Fitting Procedure
We now have two boundary value constrained diffusion processes solved through a stochastic control formulation. We use the iterative proportional fitting procedure (IPFP) to alternate between the forward and backward formulation with only one initial value constraint enforced at a time (see Figure 2), such that convergence to the full boundary value problem is guaranteed. Algorithm 1: Iterative Proportional Fitting (IPFP) [21], for the traditional SBP with Brownian prior set Q γ 0 := W γ . input : π 0 (x), π 1 (y), Q γ The measure theoretic version of IPFP we introduce is an extension of the continuous IPFP, initially proposed by [8] to a more general setting over general probability measures (see [21,22]). The convergence of IPFP has been shown in [9] and further extensions and results have been presented in [21,22]. The idea behind this family of approaches is to alternate minimising KL between the two marginal distribution constraints, until convergence. The quantities P * i and Q * i in the algorithm of Figure 2 are known as half bridges [10] and can be expressed in closed form in terms of known quantities, where p and p P * i 0 are the marginals of Q * i−1 and P * i at times 1 and 0, respectively. While a closed form expression to estimate half bridges is known, its components are usually not available in closed form and require approximations. Note that the time reversed formulations (Section 2.1.2) and a simpler variant of Lemma 3 also apply to the half bridge problems. Applying the disintegration theorem (see Appendix B) we can reduce the IPFP introduced to the more popular instance of IPFP proposed by [8]. Furthermore for the case of discrete measures this algorithm reduces to the Sinkhorn-Knopp algorithm [3].

Methodology
In this section we introduce IPML by providing the theoretical foundations of our algorithm and convergence guarantees. Then, we propose a practical implementation of IPML based on a Bayesian non-parametric model (GP). These components allow us to solve the general SBP problem.

Approximate Half Bridge Solving as Optimal Drift Estimation
We present a novel approach to approximately solve the empirical Schrödinger bridge problem by exploiting the closed form expressions of the half bridge problem. Rather than parametrising the measures in the half bridge and solving the optimisation numerically, we seek to directly approximate the measure that extremises the half bridge objective. We achieve this by using Gaussian processes [11] to estimate the drift of the trajectories sampled from the optimal half bridge measure.
We start from the following observation, which tells us how to sample from the optimal half bridge distribution (see Appendix C for proof).

Observation 1.
We can parametrise a measure Q with its drift as either of the SDEs, Then, we can sample from the solution to the half bridges, via simulating trajectories (e.g., using the Euler-Maruyama (EM) method) following the SDEs Solutions to the above SDEs are distributed according to P * − and P * + , respectively.
Intuitively, we are performing a cut-and-paste-styled operation by cutting the dynamics of the shortened (unconstrained) time interval and pasting the constraint to it at the corresponding boundary. be sampled/ discretised trajectories from the SDE that represents the half bridge measure P + : Then carrying out maximum likelihood estimation of the time reversed drift u 0 − on time reversed samples: where u − is the drift of our estimator SDE: converges in probability to the to the true dual drift is the optimal half bridge drift for P + as N → ∞, ∆t → 0 (see Appendix D for proof).
Note that w.l.o.g. the above result also holds for estimating the forward drift from backward samples. The combination of Observation 1 and Theorem 1 constitute one the main contributions of this work as they allow us to solve half bridges with a simple regression objective. Finally it is important to highlight that in practice for most interesting SDEs we can only sample approximately using schemes such as the Euler Mayurama (EM) method, for these settings our proof of Theorem 1 does not hold, however we believe it may be possible to extend/adapt the result for the approximate EM sample case.

On the Need for Time Reversal
We have already provided the intuition as to how time reversal enables the exchange of initial value constraints for final value constraints. In what follows we provide a more detailed technical review of this important simplification. Consider the first half bridge problem for a given IPFP iteration, P * − = arg inf P∈D(·,π 1 ) D KL (P||Q). When formulated without time reversal in terms of the forward drift and the forward diffusion it reduces to a stochastic control problem which is subject to dynamics, which is a forward SDE with a terminal hitting condition. Unlike a forward SDE with an initial value problem the above SDE is not trivial to sample from, one approach would be to solve the backwards Kolmogorov equation subject to the terminal condition; however, this approach requires (1) solving a parabolic PDE that typically involves mesh-based methods that do not scale well in high dimensions and (2) carrying out density estimation on the samples from π 1 , also problematic in high dimensions; however, if we consider the time reversed process, the terminal condition becomes an initial value prob- for which it is easy to sample consistent trajectories via the Euler-Maruyama (EM) discretisation without mesh-based methods or additional density estimations.

Iterative Proportional Maximum Likelihood (IPML)
Combining the KL minimisation routines in the original IPFP algorithm with our MLE-based drift estimation provides Figure 3. The routine DriftFit fits a dual drift on the EM sampled trajectories, and can be parametrised by any function estimation procedure with consistency guarantees. The SDESolve routine generates M trajectories using the EM method. Dist(prior, learned bridge) GP prior = 0 N=10 GP prior = 0 N=20 GP prior = 0 N=40 GP prior = 1 N=20 GP prior = 1 N=40

Drift Estimation with Gaussian Processes:
One can choose any parametric or non parametric model to carry out the DriftFit routine. In this section, we briefly introduce our implementation. Our DriftFit routine is based on the work in [12,23] that uses GPs to estimate the drift of SDEs from observations. We refer to this routine as GPDriftFit. We can restate the regression problems given by Equation (8) and its reversed counterpart in the following form: where ∼ N (0, I d ). Placing GP priors on the drift functions u − ∼ GP and u + ∼ GP, we arrive at standard multi-output GP formulation. Following [12,24], we assume the dimensions of the drift function are independent (assuming that the drift dimensions are decoupled is the least restrictive assumption as coupling them would impose a form of regularization). (equivalent to imposing a block-diagonal kernel matrix on a multioutput GP) and thus we fit a separate GP for each dimension. See Appendix E.1 for the specification of the predictive means. Similar to [12,23], we are not making use of the predictive variances (using the predictive mean as an estimate for the drift can also be interpreted as a form of kernel ridge regression under the empirical risk minimisation framework). Instead, we simply use the predictive mean as an estimate of the drift and subsequently use that estimate to perform the EM method, thus effectively we could interpret this approach as a form of kernel ridge regression under the empirical risk minimisation framework.
Note that the advantage of the Bayesian GP interpretation of this procedure is that the GP posterior under certain conditions is naturally conjugate to the posterior of the SDE drift when modelled with a GP prior [23,25]. Furthermore, the Bayesian non-parametric formulation allows for the encoding of the prior drift function as the mean function of the GP prior (i.e., u ∼ GP (b Q γ 0 0 , k)). This becomes helpful when trying to sample unlikely paths and evaluating the fitted drift at samples that were not observed by the GP during training, here we want the fitted drift to fall back to the prior rather than to a Brownian motion given by a GP prior with 0 mean.
We now have the relevant ingredients to carry out IPML as specified in Figure 3. The computational cost of IPML with a Gaussian process is detailed in Appendix H.5. The majority of the computation is spent fitting the GP in DriftFit at each iteration and scales as a function of the time discretization used as well as the size of π 0 and π 1 .

Related Methodology
In this section we carry out a conceptual comparison with two pre-existing numerical approaches for solving the static SBP. While our goal is to solve the dynamic SBP, the solution of the static SBP can be used to construct that of the dynamic SBP [10], and this connection is central to our discussion. We would also like to highlight that an algorithm akin to IPML has been proposed concurrently and independently by [26], the main difference with our algorithm is that they estimate the drifts of the SDEs using neural networks score matching while we use using Gaussian processes and maximum-likelihoodbased ideas.

Sinkhorn-Knop Algorithms
Within the machine learning community the static SBP with a Brownian motion prior is popularly known as entropic optimal transport [4]. In this formulation there are no trajectories and the empirical distributions are treated as discrete measures [π 0 ] j = 1 N and [π 1 ] i = 1 M thus the objective is given by min Q∈D(π 0 ,π 1 ) Q, where h is the discrete entropy, ·, · computes the dot product between two matrices. The cost matrix C corresponds to the log transition density induced by the prior SDE: C , which in the case of a Brownian motion prior reduces to a scaled Euclidean distance.
Once the problem has been discretized as described, the Sinkhorn-Knopp algorithm [3] can be applied directly to fit an optimal discrete transport map (and discrete SBP potentials φ,φ) between the two distributions. Recent work [6] has showed empirical success in forming a continuous approximation of the SBP potentials using the logsumexp formula [6]; however, it still remains to formally analyse the accuracy of the logsumexp potentials.
Estimating the cost matrix C Q γ 0 ij required by the Sinkhorn algorithm requires a mixture of both density estimation and further simulation of the prior. Furthermore, once we have the logsumexp potentials, additional high dimensional integrals must be estimated every time we wish to evaluate the optimal drift. Details are discussed in Appendix F. In addition, the Sinkhorn-Knopp algorithm still faces challenges in high dimensional spaces, specially for small values of γ [27]. Many proposed enhancements and literature [4,5] have focused on cost functions that implicitly require a Brownian motion prior and thus do not apply to our general setting.

Data-Driven Schrödinger Bridge (DDSB)
The method proposed by [10] is perhaps the most similar approach to our approach and consists of iterating two coupled density estimation objectives (see Appendix G) fitted at the marginals until convergence. While conceptually similar to our approach, there are three key differences. As with Sinkhorn-Knopp-based methods, their approach aims to solve the static SBP. Once converged, it requires further approximation to estimate the optimal drift. The coupled maximum likelihood formulation of the static half bridges in [10] is based on un-normalised density estimation with respect to the SBP potentials φ,φ. The coupling of φ,φ in these objectives does not directly admit the application of modern methods in density estimation since it does not allow us to freely parametrise MLE estimators for the boundaries, and thus, neural density estimators such as [28][29][30] cannot be taken advantage of to circumvent the computation of the partition function. Regression problems are ubiquitous in machine learning (ML) and thus why we believe that this formulation can be very impactful as it allows us to leverage all these methods from ML. Experimentally, regression methods have been observed to scale better to high dimensional problems than density estimation methods, we observed similar evidence as we were unable to scale up the DDSB method beyond two dimensions.
Additionally, to compute the normalising term in the DDSB objective, we have to estimate a multidimensional integral that is not taken with respect to a probability distribution. This poses a difficult challenge in high dimensions. For a more detailed commentary, see Appendix G. Note that the method by [20] uses importance sampling to estimate these quantities which performs poorly beyond two dimensions.
It worth noting that we are interested in a method that can obtain a dynamic interpolation between two distributions. A downside of solving the static bridge either by Sinkhorn or [10] is that it does not directly provide us with an estimate of the optimal dynamics. In order to obtain an estimate of the optimal drift we require a series of approximations to estimate the integrals in Equations (A70) and (A71), thus every time we evaluate the optimal drift using these approaches we have to simulate the prior SDE O(N + M) times. This makes the runtime of obtaining the drift and dynamical interpolation expensive, since for each Euler step we take we have to simulate another SDE and backpropagate through it to evaluate the drift.
Finally we would like to highlight that a series of modern approaches to generative modelling [26,[31][32][33] motivate both empirically and theoretically the gain in accuracy obtained in generative modelling tasks when using a dynamical approach rather than a static one.

Numerical Experiments
In this section, we demonstrate the capability of IPML to solve the Schrödinger bridge while efficiently incorporating priors on a range of different tasks from synthetic experiments to embryo cells (code supporting experiments can be found at https://github. com/franciscovargas/GP_Sinkhorn, accessed on the 22 August 2021).

Simple 1D and 2D Distribution Alignment
The first experiment considered is a simple alignment experiment where π 0 and π 1 are either unimodal or bimodal Gaussian distributions (see Appendix H.1 for exact details on the distributions). In Table 1 we compare the accuracy of the fitted marginals with our implementation of the Data-Driven Schrödinger Bridge (DDSB) by [10]. The scoring metrics used are the Earth mover's distance (EMD) as well as a Kolmogorov-Smirnov (KS) statistic on both sample sets. We were unable to get DDSB to work well when π 0 and π 1 were distant from each other. In these distant settings, DDSB collapses the mass of the marginals to a single data point (see Appendix H.1.1), as a result for the unimodal experiment we had to set γ = 100 for the DDSB approach to yield sensible results. We can observe IPML obtains better marginals overall and at a lower value of γ = 1.
We carried out 2D experiments with our approach to show the ability of our method to diffuse from a simple uni-modal distribution to a multi-modal distribution as in Figure 1 where our learned bridge successfully splits. Furthermore, we can visually observe how the forward and backwards trajectories are mirror images of each other as expected.

2D Double Well Experiments
In the double well experiment, we illustrate how to incorporate an arbitrary functional prior and learn the distribution over paths connecting π 0 and π 1 . In order to encode prior information, we experiment with the potential well illustrated in Figures 1 and A3. The boundary distributions π 1 , π 0 are taken to be Gaussian distributions centred at the centre of each well, respectively (See Appendix H.2 for the experiment specification and IPML parameters).
The motivation behind this experiment is to show that the SBP with this prior follows low energy (according to the well's potential function) trajectories for configurations of particles sampled at the wells. Intuitively, we can expect the learned trajectories to avoid the high energy peak located at x = (0, 0) and go via the "passes" on either side. Note that if we estimated the optimal transport (OT) geodesics between π 0 and π 1 or similarly ran IPML with a Brownian motion prior, the learned optimal trajectories would go right through the middle, which is the highest energy path between both wells.
The prior is incorporated in the algorithm in two different ways, (1) by having the first drift Q * 0 to follow the negative derivative of the potential function dx(t) = −∇ x U(x) + γdβ(t) and (2) by setting the mean function of the GP used to fit the drift as the negative derivative of the potential function. As illustrated in Figure 1, the learned trajectories between the two wells avoid the main high energy region and go through via lower energy passes, thus respecting the potential prior; however, the behaviour of the trajectory will differ depending on the choice of kernel, as illustrated in Figure A3, underlying the need for its careful consideration. We compared our approach to DDSB using the mean squared error distance to the prior in our evaluation. The results were DDSB: 22.1, IPML (no prior): 19.9, IPML: (prior) 18.7. As we can see, IPML considerably outperforms DDSB.

Finite Sample/Iteration Convergence
Theorem 1 provides us with asymptotic guarantees; however, it does not extend to the finite sample and discretisation case. To highlight the importance of finite effects on IPML, we carried out this analysis empirically. In Figure 3 we plot an empirical estimate of the error term in the control formulation of the SB (Equation (5)). This term is effectively the mean squared error between the learned drift and the prior drift (gradient field of the well). We analyse this metric for different values of N (number of samples) and ∆ t (discretization factor). We observe that IPML quickly reaches a low error valley, then, the cumulative error from the successive finite sample MLE can be observed and the drift starts to slowly deviate from the prior, this motivates early stopping. As N is increased, IPML achieves lower error faster and deviates less from the prior in later iterations. Finally, we observe placing the drift prior via the GP has a significant effect in improving the error and its convergence. Additionally in Appendix E we detail how this question could be approached from a theoretical perspective while underlining its significance and difficulty as illustrated by the lack of such analyses in related algorithms [10,22].

Single Cell-Embryo Body (EB) Data Set
We perform an experiment on an embryoid body scRNA-seq time course [34]. Singlecell RNA sequencing enables accurate identification of cells at specific time points; however, all cells are destroyed by measurement. This prevents modelling single-cell trajectory and instead we rely on modelling the data-manifold at discrete time points. The datasets consist of 5 time points illustrated in Figure A4. To evaluate the performance of the algorithms, we fit the models at the endpoints (T = 1, 5) and predict the intermediate frames. The metric used is the Earth mover's distance between the data at intermediate frames and the predicted distribution. We evaluate the performance at the endpoints by considering the prediction of the forward model at T = 5 and the backward model at T = 1.
We compare the performance with two methods. The first one, TrajectoryNet [34], uses continuous normalising flows with a soft constraint based on optimal transport. The second leverages the McCann interpolant [35,36] to interpolate the discrete OT solution; this corresponds to the linear interpolation induced by the transport map.
The results are summarised in Table 2. In most frames, IPML outperforms Trajec-toryNet and performs similarly to OT. As the noise (volatility) in the trajectories goes to 0, IPML theoretically converges to the OT solution with linear geodesics. So without any additional prior information, we would not expect IPML to outperform OT; however, when dealing with finite data, the DriftFit procedure allows for some non-linearity in the trajectories if it improves the fit. As a result, we do see some differences in the convex hull displayed in Figure 4 where we observe a better coverage of the single cell observations. We hypothesize this explains the improvement in performance vs. OT for frame 4. The performance of IPML may be further improved by incorporating domain-specific knowledge as a prior. It would outperform OT in those settings. Table 2. Earth Moving Distance (EMD) on the EB data. EXP stands for the exponential kernel and EQ for exponentiated quadratic. The column "full" represents EMD averaged over all frames whereas "path" is averaged over the intermediate frames (T = 2, 3, 4).

Motion Capture
In this experiment, we demonstrate how IPML can be used to model human motion from sensor data (data from The CMU Graphics Lab Motion Capture Database funded by the NSF (http://mocap.cs.cmu.edu, accessed on the 22 August 2021)). The motion corresponds to a basketball movement where the subject raises both arms simultaneously as illustrated in Figure 5 and each sensor corresponds to an oriented angle. We focus on modelling the right shoulder and elbow where the starting distribution corresponds to the leftmost image and the ending one the rightmost. This results in a 4-dimensional space as we model both position and velocity for each sensor. We compare the fit of IPML using a Brownian and a 2nd order linear ODE (Langevin) prior. The experimental details can be found in Appendix H.3. As illustrated in Figure 5, IPML is able to approximately model the dynamics using both priors. The Brownian prior displays noisy trajectories as expected in contrast to the Langevin prior that, by construction, smooths out the predicted positions (due to the 2nd order term). We observe that the Langevin prior approximates the step function nature of the true trajectory more closely than the Brownian prior, additionally, we can see that it also produces slightly better alignments.

Limitations and Opportunities
In this work, we propose to use GPs to estimate the drift; however, IPML could also be used with a parametric function estimator (e.g., a neural network). This could be useful with high-dimensional data where GPs may underperform. The main advantage of using a GP is its capacity to incorporate functional priors via the mean function. This can be useful in applications such as molecular dynamics where a potential function may be available. In contrast, implementing functional regularization in a neural network would require approximating a non-trivial high-dimensional integral to estimate the mean squared error between the parametric estimator and the functional prior.
A particular useful extension would be to adapt the SBP to work with more general forms of volatility functions that are not constant. This can be used, for example, in enforcing positivity constraints on a stochastic process via a geometric Brownian motion prior; this has applications in modelling biological signals such as transcription factors [37]. Work in this direction would require extending the theory of SBPs to non constant volatility functions.
Another promising setting is when multiple frames of data are available rather than just two boundary conditions. The IPFP algorithm trivially adapts to multiple constraints [21] rather than just initial and terminal distributions. Future work could explore experiments similar to the one presented in [34] where multiple frames are considered during training and the performance is measured using a leave-one-out procedure.

Conclusions
We have presented IPML, a method to solve the Schrödinger bridge for arbitrary diffusion priors. We presented theoretical results guaranteeing convergence in the limit of infinite data. We devised a practical application of the algorithm using Gaussian processes and presented several experiments on a variety of problems from synthetic to biological data. The approach opens up opportunities in science, where oftentimes, prior knowledge about the temporal evolution of a process has been developed but needs to be combined with data-driven methods to scale up to modern problems.

Conflicts of Interest:
The authors declare no conflict of interest.

Assumptions Across Proofs
For abstraction purposes we will list the set of assumptions assumed across all results in this appendix: • All SDEs considered have L−Lipchitz diffusion coefficients as well as satisfying linear growth • The optimal drifts are elements of a compact space and thus satisfy the HJB equations. Note for the proposes of Theorem 1 this can be relaxed using notions of Γ−convergence, • All SDEs satisfy a Fokker-Plank equation and thus p(

Appendix A. Brief Introduction to the Schrödinger System and Potentials
The Schrödinger system and its potentials are mentioned when introducing some of the results and connections of the full Schrödinger bridge problem. In this section, we will provide a brief introduction to how the system arises from the original static Schrödinger bridge. For brevity we will use x = x(0) and y = x(1) to denote the boundaries.
The static Schrödinger bridge problem can be derived from the full dynamic bridge by marginalising out the dynamics via the Disintegration Theorem and focusing on the problem only concerning what is happening at the boundaries. The static SBP is formulated as: arg min q(x,y)∈D(π 0 ,π 1 ) where p Q γ (x, y) is the prior joint distribution for the boundary and can be obtained by solving the FPK equation corresponding to the prior SDE. Now if we write down the Lagrangian for the above problem we arrive at: which via performing the appropriate variations wrt to q leads to the optimal solution: which when relabeling the terms containing the Lagrange multipliers we obtain: such that:φ Furthermore, if we relabel the potentials to indicate the times they correspond to we arrive at the Schrödinger system: The above functional system is refered to as the Schrödinger system and the Schrödinger potentials are given by φ,φ. Furthermore the time interpolates for the potentials can be obtained by the propagations: For a more rigorous and extensive introduction please see [20].

Appendix B. Disintegration Theorem-Product Rule for Measures
In this section, we present the Disintegration Theorem in the context of probability measures, which serves as the extension of the product rule to measures that do not admit the traditional product rule. Furthermore we will provide a proof for a direct lemma of the Disintegration Theorem that is more analogous to the standard product rule. Similar to the product rule these theorems are essential for decomposing and manipulating path measures and thus is needed for most results pertaining to the dynamic SBP.

Theorem A1. (Disintegration Theorem for continuous probability measures):
For a probability space (Z, B(Z), P), where Z is a product space: Z = Z x × Z y and π i : Z → Z i is a measurable function known as the canonical projection operator (i.e., π x (z x , z y ) = z x and π −1 where P x (·) = P(π −1 (·)) is a probability measure, typically referred to as a push-forward measure, and corresponds to the marginal distribution.
A direct consequence of the above instance of the disintegration theorem is, with f (x, y) = 1 A x ×A y (x, y), We can see that, in the context of probability measures, the above is effectively analogous to the product rule.

RN Derivative Disintegration
We now have the required ingredients to show the following: Lemma A1. (RN-derivative product rule) Given two probability measures defined on the same product space, (Z x × Z y , B(Z x × Z y ), P) and (Z x × Z y , B(Z x × Z y ), Q), the Radon-Nikodym derivative dP dQ (x, y) can be decomposed as Proof. Note that this is an instructive sketch for a well known result [10,20,38,39]. We assume that dQ y|x >> dP y|x , which is not a trivial fact, yet it follows from the disintegration theorem in this particular setting (see Theorem 1.6 b in [39]) . Starting from we apply the Radon-Nikodym theorem to P(A y |x) and then to P x :

Now, via the disintegration we have that
Thus, we conclude that which, via the Radon-Nikodym theorem, implies Note that swapping Q for the Lebesgue measure would result in the standard product rule for probability density functions.

Appendix C. Proof Sketches for Half Bridges
In this section we provide a proof sketch for the closed form solution of the half bridges as well as a proof for Observation 1.
Theorem A2. The forward half bridge admits the following solution: Proof. Via the disintegration theorem, we have the following decomposition of KL: Thus, via matching terms accordingly, we can construct P * by setting P(·|x) = Q(·|x) and matching the constraints: Observation A1. We can parametrise a measure Q with its drift as the solution to either of the following SDEs: Then, we can sample from the solution to the following half bridges: via simulating trajectories following the SDEs Paths sampled from the above SDEs will be distributed according to P * − and P * + , respectively.
Proof. (Sketch) W.l.o.g., consider the decomposition of the KL divergence that follows from the disintegration Theorem (Appendix B): Furthermore the disintegration Q(·|x(0)) is a solution to the dynamics dx + (t) = b + (t) + √ γdβ + (t). We can make the term E π P 0 [D KL (P(·|x)||Q(·|x))] go to 0 by setting P(·|x(0)) = Q(·|x(0)), it is clear the dynamics of P(·|x(0)) follows dx What is left is to attach the constraint via x(0) ∼ π 0 (x(0)) which brings D KL (π P 0 ||π Q 0 ) to 0. This is simply enforcing the constraints via the product rule (Disintegration Theorem) and then matching the remainder of the unconstrained interval with the disintegration for the reference distribution Q, following Equation (7).

Appendix D. Proof Sketch for Reverse-MLE Consistency
The proof sketch for Theorem 1 will show how the likelihood converges in the large data and small time step limit to an optimisation of the KL divergence between two reverse time diffusions, from here one can use the standard arguments to show this quantity is minimised when the two measures describe the same stochastic process or equivalently when the drifts are equal. Lemma A2. Normalizing the time reversed likelihood with the true discretised backwards SDE density does not affect the maximum likelihood estimate. That is: be sampled/ discretised trajectories from the SDE that represents the half bridge measure P + : Then carrying out maximum likelihood estimation of the time reversed drift u 0 − on time reversed samples: where, and u − is the drift of our estimator SDE: converges in probability to the to the true dual drift u − 0 (x, t) = u + 0 (x, t) − γ∇ x ln p(x, t) where u + 0 (x, t) is the optimal half bridge drift for P + as N → ∞, ∆t → 0. Under the assumptions: • The density p(x, t) is differentiable with respect to x, • The optimal drift lies in a compact space (this can be relaxed using notions of Γ−convergence), • The prior drift coefficient is L−Lipchitz and satisfies linear growth.

Proof. For the interest of brevity let
Taking logs and applying Lemma A2 to Equation (A19) yields: where q(x t T ) represents the terminal distribution of the forward SDE (i.e., x + (1) ∼ π). Now we can equivalently write the above expression in terms of the time reversed samples (i.e., x − t i = x + t n−i ): Expanding the squares and re-arranging: t k −∆t . Now we can consider the limit of Equation (A25): We can write the sum over the time grid as a stochastic integral if we express the inner terms using the continuous time approximation from Lemma A3: From [15,16,18] it follows thatx (n)− (t) is a semi-martingale (w.r.t. to the backwards filtration, see [17,40]) then in the limit the stochastic integrals are taken with respect to the true time reversed stochastic process adapted to the backwards filtration (F − i ) i∈T (see Theorem 2.13 in [41]): now we have: where by Lemma A3 each random function x (n)− (t) is sampled i.i.d from the SDE: we can now apply the weak law of large numbers (WLLN): Now using that the log RN derivative between the estimator SDE and the true SDE is given by Girsanov's theorem [20,42,43]: we arrive at: Thus using 1 N L(u) to denote the negative normalized log-likelihood we have shown the following pointwise (i.e., for u ∈ B) convergence in probability: Then if B , is compact plus additional continuity and boundedness assumptions on L(u) a stronger form of uniform convergence in probability can be attained: from the above it directly follows that: where u * N is the MLE estimate of the dual drift at N samples. Which implies: See Chapter 4.5 of [44] for a more detailed discussion on the required assumptions of L.

Extending Theorem 1 to EM Samples
Note that the above proof holds for discretised samples from the original SDE, however we have been unable to extend it to approximate sampling schemes such as EM. We believe that the following result motivates the possibility that Theorem 1 holds when the samples are obtained from a consistent scheme such as EM: Lemma A3. (Convergence of discrete time reversal) The time reversal of discrete Euler-Mayurama samples: converges in probability to the solutions of the time reversed diffusion: Proof. First let's consider the continuous time step-wise approximation induced by the the EM samples: From [45] it is a well known result that under the standard regularity assumptions on the drift u(x, t) (i.e., Lipchitz continuity in t and and x) that the above approximation converges in probability to the SDE solution x + (t) as ∆t → 0.
Observing that that the time reversal of the above corresponds to the approximation induced by the reverse samples we now consider the continuous time reversed approximation: Usingx(t) converges in probability to x + (t) we have that ∀t ∈ [0, 1], > 0: we carry out the following substitution s = 1 − t then ∀s ∈ [0, 1], > 0: which completes the proof. Note that showing that reversing backwards samples reversed to a forward direction converges in probability follows the same sketch structure.
Now it requires to be proved thatx (n)− (t) is a semi-martingale which may follow from Lemma A3 as ∆t → 0. Another potential proof strategy would be to try and exploit the strong convergence properties of the Euler scheme to show that the stochastic integral from Theorem 1 also converges in the case of Euler samples. It may be the case that a correction term needs to be introduced in order for the integrals built from the reversed EM sample to converge to the backwards integrals. This remains an interesting question for future work.

Appendix E. Towards a Finite Sample Analysis of Approximate IPFP Schemes
We use the term approximate IPFP schemes for methodologies such as the one we present in this paper (i.e., IPML) where steps 5, 6 of IPFP algorithm (Figure 2) are replaced with inexact approximations. In this section we will present a rough sketch that takes the first step towards formally analysing approximate IPFP schemes in the finite sample/discretisation regime. This will serve to illustrate many of the challenges that still remain in this analysis as well as provide some initial results.

Conjecture A1. (Heuristic Finite Sample Bound)
Given that: • The exact IPFP at the ith iteration can be bounded from above as: The approximate IPFP projection operatorsP ± π : X → X are K−Lipchitz • and the approximate finite sample projection error can be bounded by a constant: Then it follows that the approximate IPFP iteration error can be bounded from above by: when P ± π are non-expansive operators (K = 1), and: π 0,1 = inf Q∈D(π 0 ,π 1 ) D KL Q Q γ 0 , and ||.|| is the L 2 (Q * ) norm error as per Equation (4), that is it is just the half bridge in Line 6 of the algorithm in Figure 2 written in terms of the drifts.
Proof. (Sketch) For notational simplicity we will relabel the half bridges in the algorithm of Figure 2 as projection operators P ± π : X → X in a function space X . This leads to the following iterates: For the approximate IPFP we have the iterates: Furthermore, to simplify the analysis further we will consider the composition of the two projection operators (i.e., P π 0,1 = P + π 0 • P − π 1 ) combined into a single operator (for both exact and approximate projections): We will now proceed to bound the error: From the assumptions it follows that: yielding: Now we will proceed to analyse the first term: Now we can expand the recurrence until the first iteration, yielding: When K = 1 we attain the bound: When the projection operatorP π 0,1 is non expansive this gives the bound: Note that the above sketch is more of a strategy to outline the challenges and steps required for this analysis rather than a proof itself. We believe that the formal finite sample analysis on approximate IPFP schemes merits its own separate work and is thus not the focus of this work; however, in order to highlight its importance we have provided a template towards this formal analysis following a strategy similar to [46]. This analysis allows us to understand what conditions are needed to be shown and how they will affect the finite sample convergence rates.
We will now proceed to discuss each of the assumptions that are required as elements of the above analysis strategy.
Assumption A1. The exact IPFP at the ith iteration can be bounded from above as: This assumption/conjecture is somewhat reasonable since Proposition 2.1 in [22] proves formally that the sum of the KL errors of the ith marginals are bounded by S Q γ 0 π 0,1 i , furthermore via the disintegration theorem one can show that the half bridge error terms are in fact bounded by the sum of the error of the ith marginals [20]. Thus it seems feasible to extend the result in [22] to the path error we use in our conjecture.
Assumption A2. The approximate IPFP projection operatorsP ± π : X → X are K−Lipchitz. Specifically we are interested in the cases where they may be non-expansive and contractive.
This assumption is specific to the approximation method used to estimate the half bridges. It requires going into rigorous details on the nature of the approximation and imposing suitable regularity assumptions. Not that ideally we want the operator to be contractive or at the very least not expansive since for Lipchitz constants greater than one the cumulative error would potentially accumulate exponentially.
It is interesting to note that in the contractive case we can attain the bound: where the error term is constant with respect to the iteration number. Additionally the bound in Equation (A48) also trivially applies.
Assumption A3. The approximate finite sample projection error can be bounded by a constant: This should typically be something that one shows; however, as seen in [46], it is something that is often assumed in these proof strategies. The reason for assuming the result is that the projection varies depending on how we approximate the KL minimisation, in our case we combine our proposed IPML with drift estimation via Gaussian processes. The error of the drift estimation (i.e., ∆T,N ) via this approach is not straightforward to obtain and is still an area of active research see [23,25] for more details. As mentioned earlier this is not the focus of our work thus it is reasonable for us to assume that the drift estimation machinery proposed by prior authors is sound and should be possible to bound with a reasonable error. that is each drift coordinate depends on the entire input space and thus no simplifying assumptions have been made in this setting.

Appendix F. Estimates Required by the Sinkhorn-Knop Algorithm
In order to adapt the Sinkhorn-Knop algorithm to a general prior we require the following considerations: • The Sinkhorn-Knop algorithm requires computing the cost C . For more general SDE priors whose transition densities are not available in closed form can be challenging:

-
First we have to estimate the empirical distribution for ln p Q γ 0 (y i |x j ) by generating multiple x(1) samples from the priors SDE for each x j in the dataset: Once we have samples {x k } j we must carry out density estimation for each j in order to evaluate ln p Q γ 0 (y i |x j ). • The Sinkhorn-Knop algorithm produces discrete potentials, which can be interpolated using the logsumexp formula; however, how do we go from these static potentials to time dependant trajectories?
-From [10] we can obtain expressions for the optimal drift as a function of the time extended potentials: The first integral can be approximated using the Montecarlo approximation via sampling from the prior SDE to draw samples from p Q γ 0 (y(t)|z(0)); however, computing theφ potential integral is less clear and requires careful thought.

-
Note that in order to estimate the drift we would have to simulate the SDE prior every time we want to evaluate the drift, which itself will be run in another SDE simulation to generate optimal trajectories.

Appendix G. Approximations Required by DDSB
The approach in [10] iterates the following two objectives.
whereφ 0 , φ 1 are parametric functions aimed at estimating the potentials, and: Note that in Equation (A74) the outer most integral is not taken with respect to a probability distribution and thus does not admit standard approximations. The authors in [10] propose the method of importance sampling [49]; however, this does not scale well to higher dimensions. This is contrasting to our approach where all integrals are expectations with respect to the empirical distribution and the SDEs being fitted.

Appendix H. Experiments
In this section we provide in depth details regarding the parameter configurations of our approach plus our experimental setup. For all the experiments below, unless stated otherwise, we used the exponential kernel with a lengthscale set at the default value of 1. We ran IPML for 5 iterations with a discretization factor ∆ t = 0.01 and γ = 1. The iteration number (5) was selected by observing that the algorithm converged after that number in all experiments considered.

Appendix H.1. 1D Experiment
The prior Q γ 0 in IPM (the practical implementation of Q γ 0 will be added in the main paper during the revision period) is specified by a prior drift and an initial value distribution π Q γ 0 0 . We set π Q γ 0 0 = π 0 and the prior to a Brownian motion W γ unless stated otherwise. The initial value distribution does not affect the argmax of the SBP and is thus a free parameter.
Appendix H.1.1. Delta Collapse in DDSB Here we will illustrate a common failure case of the approach in [10]. That is when the distributions π 0 and π 1 are distant from each other the methodology proposed in [10] breaks for suitable values of γ (i.e., γ = 1, 2, 3...). Using the same marginals π 0 , π 1 as in Experiment 1 we re-train the method by [10] with γ = 1. In Figure A2 we can see the results of this experiment. We can see the marginals learned by DDSB collapse all the mass in a very small low density region for the true marginals. We found this phenomena to occur often and harder to overcome in 2-dimensional experiments and thus did not pursue further comparisons with this method. Note we normalised the delta spike of the DDSB marginals to have a range in [0, 1] so we could compare to the true density, its actual value is on the order of 10 11 , which confirms that the model is collapsing to a point mass (dirac delta function). Figure A2. Schrödinger Bridge results using the method by [10] on unimodal to bimodal Gaussian 1D data. This example illustrates the Dirac delta collapse of the marginals.

Appendix H.2. Well Experiment
We used the following potential to model the double wells: U(x, y) = 5 2 (x 2 − 1) 2 + y 2 + 1 furthermore, we used the boundary distributions: where we selected the boundary distributions to be located at the wells and visually follow a similar spread/curvature as the contour of the well. For these experiments we use γ = 3.
As highlighted previously π Q γ 0 0 is a free parameter, which in this section we set as: we select this as the marginal prior since its samples cover the space over which the wells is defined well, and helps estimate the backwards drift of the prior more accurately. As mentioned in the paper, using different kernels as well as lengthscales resulted in slightly different behaviour on the well experiment where the trajectory would not split as illustrated in Figure A3. The splitting pattern was obtained using an exponential kernel with a lower lengthscale (0.25) underlying the need for careful consideration of the kernel and its parameters.

Discussion on Kernel Choice
We can observe that when using EQ kernel, the trajectories do not cross both passes but instead choose one. We conjecture that this is due to the fitted predictive means having a preference for simpler functions that send nearby particles in the same direction. We empirically verify this conjecture by experimenting with alternative kernels to EQ (i.e., exponential) that model a wider class of functions. We believe that these alternative kernels are less prone to discouraging splitting since, when viewed within the kernel ridge regression framework, they span an RKHS of functions that do not have the smoothness constraint of the EQ kernel.
The data used for this experiment were taken directly from the GitHub repository published by [34]. PCA is first applied to the data and the first five components are selected. We used the code available at https://github.com/KrishnaswamyLab/TrajectoryNet/tree/ master/TrajectoryNet (accessed on the 22 August 2021) to reproduce the performance of TrajectoryNet as well as the optimal transport baseline. half a revolution. We visually confirmed with the true trajectories that this seems like a reasonable prior. Figure A4. Fitted SBP distribution on the start and end data slices with Brownian prior on single-cell human Embryo data from [34]. Observations depicted as red point clouds. See Section 5.3 for experimental details.

Appendix H.4. Cell Experiment
For the Brownian motion prior that we use as a baseline comparison in Appendix H.3 we explored several values of γ and selected the best performing one (visually) of γ = 0.3. Note we do pick a smaller value of γ for the Brownian motion and this is partly to compensate that the Langevin prior trajectories will always look smooth by construction thus we pick a not so noisy Brownian motion that still gives good results.
Note that the volatility term here is singular and in our mocap application setting it is exactly the vector: where the dimension d is given by the number of sensors we use to fit the motion. A careful reader may wonder weather the SBP machinery still applies to such a sparsely structured diffusion and in fact it does. Following Theorem 4 of [43] we can see how the Radon-Nikodym derivative is finite and can be expressed in an almost identical fashion to the original controlled SBP formulation and thus justifying the existence of the SBP in this scenario as well as the application of the IPML algorithm and its derivatives.  Figure A5. Trajectories of the shoulder and arm oriented angle sensor through the motion. The two plots on the right demonstrate IPML's fit using a Brownian and Langevin prior.