Multi-Modal Latent Diffusion

Multimodal datasets are ubiquitous in modern applications, and multimodal Variational Autoencoders are a popular family of models that aim to learn a joint representation of different modalities. However, existing approaches suffer from a coherence–quality tradeoff in which models with good generation quality lack generative coherence across modalities and vice versa. In this paper, we discuss the limitations underlying the unsatisfactory performance of existing methods in order to motivate the need for a different approach. We propose a novel method that uses a set of independently trained and unimodal deterministic autoencoders. Individual latent variables are concatenated into a common latent space, which is then fed to a masked diffusion model to enable generative modeling. We introduce a new multi-time training method to learn the conditional score network for multimodal diffusion. Our methodology substantially outperforms competitors in both generation quality and coherence, as shown through an extensive experimental campaign.


INTRODUCTION
Multi-modal generative modelling is a crucial area of research in machine learning that aims to develop models capable of generating data according to multiple modalities, such as images, text, audio, and more.This is important because real-world observations are often captured in various forms, and combining multiple modalities describing the same information can be an invaluable asset.For instance, images and text can provide complementary information in describing an object, audio and video can capture different aspects of a scene.Multi-modal generative models can also help in tasks such as data augmentation (He et al., 2023;Azizi et al., 2023;Sariyildiz et al., 2023), missing modality imputation (Antelmi et al., 2019;Da Silva-Filarder et al., 2021;Zhang et al., 2023;Tran et al., 2017), and conditional generation (Huang et al., 2022;Lee et al., 2019b).
Multi-modal models have flourished over the past years and have seen a tremendous interest from academia and industry, especially in the content creation sector.Whereas most recent approaches focus on specialization, by considering text as primary input to be associated mainly to images (Rombach et al., 2022;Saharia et al., 2022;Ramesh et al., 2022;Tao et al., 2022;Wu et al., 2022;Nichol et al., 2022;Chang et al., 2023) and videos (Blattmann et al., 2023;Hong et al., 2023;Singer et al., 2022), in this work we target an established literature whose scope is more general, and in which all modalities are considered equally important.A large body of work rely on extensions of the Variational Autoencoder (VAE) (Kingma & Welling, 2014) to the multi-modal domain: initially interested in learning joint latent representation of multi-modal data, such works have mostly focused on generative modeling.Multi-modal generative models aim at high-quality data generation, as well as generative coherence across all modalities.These objectives apply to both joint generation of new data, and to conditional generation of missing modalities, given a disjoint set of available modalities.
In short, multi-modal VAEs rely on combinations of uni-modal VAEs, and the design space consists mainly in the way the uni-modal latent variables are combined, to construct the joint posterior distribution.Early work such as Wu & Goodman (2018) adopt a product of experts approach, whereas others Shi et al. (2019) consider a mixture of expert approach.Product-based models achieve high generative quality, but suffer in terms of both joint and conditional coherence.This was found to be due to experts mis-calibration issues (Shi et al., 2019;Sutter et al., 2021).On the other hand, mixture-based models produce coherent but qualitatively poor samples.A first attempt to address the so called coherence-quality tradeoff (Daunhawer et al., 2022) is represented by the mixture of product of experts approach (Sutter et al., 2021).However recent comparative studies (Daunhawer et al., 2022) show that none of the existing approaches fulfill both the generative quality and coherence criteria.A variety of techniques aim at finding a better operating point, such as contrastive learning techniques (Shi et al., 2021), hierarchical schemes (Vasco et al., 2022), total correlation based calibration of single modality encoders (Hwang et al., 2021), or different training objectives Sutter et al. (2020).More recently, the work in (Palumbo et al., 2023) considers explicitly separated shared and private latent spaces to overcome the aforementioned limitations.
By expanding on results presented in (Daunhawer et al., 2022), in Section 2 we further investigate the tradeoff between generative coherence and quality, and argue that it is intrinsic to all variants of multi-modal VAEs.We indicate two root causes of such problem: latent variable collapse (Alemi et al., 2018;Dieng et al., 2019) and information loss due to mixture sub-sampling.To tackle these issues, in this work, we propose in Section 3 a new approach which uses a set of independent, uni-modal deterministic auto-encoders whose latent variables are simply concatenated in a joint latent variable.Joint and conditional generative capabilities are provided by an additional model that learns a probability density associated to the joint latent variable.We propose an extension of score-based diffusion models (Song et al., 2021b) to operate on the multi-modal latent space.We thus derive both forward and backward dynamics that are compatible with the multi-modal nature of the latent data.
In section 4 we propose a novel method to train the multi-modal score network, such that it can both be used for joint and conditional generation.Our approach is based on a guidance mechanism, which we compare to alternatives.We label our approach Multi-modal Latent Diffusion (MLD).
Our experimental evaluation of MLD in Section 5 provides compelling evidence of the superiority of our approach for multi-modal generative modeling.We compare MLD to a large variety of VAE-based alternatives, on several real-life multi-modal data-sets, in terms of generative quality and both joint and conditional coherence.Our model outperforms alternatives in all possible scenarios, even those that are notoriously difficult because modalities might be only loosely correlated.Note that recent work also explore the joint generation of multiple modalities Ruan et al. (2023); Hu et al. (2023), but such approaches are application specific, e.g.text-to-image, and essentially only target two modalities.When relevant, we compare our method to additional recent alternatives to multi-modal diffusion (Bao et al., 2023;Wesego & Rooshenas, 2023), and show superior performance of MLD.

LIMITATIONS OF MULTI-MODAL VAES
In this work, we consider multi-modal VAEs (Wu & Goodman, 2018;Shi et al., 2019;Sutter et al., 2021;Palumbo et al., 2023) as the standard modeling approach to tackle both joint and conditional generation of multiple modalities.Our goal here is to motivate the need to go beyond such a standard approach, to overcome limitations that affect multi-modal VAEs, which result in a trade-off between generation quality and generative coherence (Daunhawer et al., 2022;Palumbo et al., 2023).
Consider the random variable X = {X 1 , . . ., X M } ∼ p D (x 1 , . . ., x M ), consisting in the set of M of modalities sampled from the (unknown) multi-modal data distribution p D .We indicate the marginal distribution of a single modality by X i ∼ p i D (x i ) and the collection of a generic subset of modalities by X A ∼ p A D (x A ), with X A def = {X i } i∈A , where A ⊂ {1, . . ., M } is a set of indexes.For example: given A = {1, 3, 5}, then X A = {X 1 , X 3 , X 5 }.
We begin by considering uni-modal VAEs as particular instances of the Markov chain X → Z → X, where Z is a latent variable and X is the generated variable.Models are specified by the two conditional distributions, called the encoder Z | X=x ∼ q ψ (z | x), and the decoder X | Z=z ∼ p θ (x | z).Given a prior distribution p n (z), the objective is to define a generative model whose samples are distributed as closely as possible to the original data.
In the case of multi-modal VAEs, we consider the general family of Mixture of Product of Experts (MOPOE) (Sutter et al., 2021), which includes as particular cases many existing variants, such as Product of Experts (MVAE) (Wu & Goodman, 2018) and Mixture of Expert (MMVAE) (Shi et al., 2019).Formally, a collection of K arbitrary subsets of modalities S = {A 1 , . . .A K }, along with weighting coefficients ω i ≥ 0, To lighten the notation, we use q ψ A i in place of q i ψ A i noting that the various q i ψ A i can have both different parameters ψ Ai and functional form.For example, in the MOPOE (Sutter et al., 2021) parametrization, we have: q ψ A i (z | x Ai ) = j∈Ai q ψ j (z | x j ).Our exposition is more general and not limited to this assumption.The selection of the posterior can be understood as the result induced by the two step procedure where i) each subset of modalities A i is encoded into specific latent variables Y i ∼ q ψ A i (• | x Ai ) and ii) the latent variable Z is obtained as Z = Y i with probability ω i .Optimization is performed w.r.t. the following evidence lower bound (ELBO) (Daunhawer et al., 2022;Sutter et al., 2021): A well-known limitation called the latent collapse problem (Alemi et al., 2018;Dieng et al., 2019) affects the quality of latent variables Z.Consider the hypothetical case of arbitrary flexible encoders and decoders: then, posteriors with zero mutual information with respect to model inputs are valid maximizers of Equation (1).To prove this, it is sufficient to substitute the posteriors 1) to observe that the optimal value L = p D (x) log p D (x)dx is achieved (Alemi et al., 2018;Dieng et al., 2019).The problem of information loss is exacerbated in the case of multi-modal VAEs (Daunhawer et al., 2022).Intuitively, even if the encoders q ψ A i (z | x Ai ) carry relevant information about their inputs X Ai , step ii) of the multi-modal encoding procedure described above induces a further information bottleneck.A fraction ω i of the time, the latent variable Z will be a copy of Y i , that only provides information about the subset X Ai .No matter how good the encoding step is, the information about X {1,...,M }\A that is not contained in X Ai cannot be retrieved.
Furthermore, if the latent variable carries zero mutual information w.r.t. the multi-modal input, a coherent conditional generation of a set of modalities given others is impossible, since XA1 ⊥ X A2 for any generic sets . ., θ M }where we use p θ i instead of p i θ i to unclutter the notation -could enforce preservation of information and guarantee a better quality of the jointly generated data, in practice, the latent collapse phenomenon induces multi-modal VAEs to converge toward sub-optimal operating regime.When the posterior q ψ (z | x) collapses onto the uninformative prior p n (z), the ELBO in Equation (1) reduces to the sum of modality independent reconstruction terms i ω i j∈Ai p j D (x j )p n (z) log p θ j (x j |z) dzdx j .In this case, flexible decoders can similarly ignore the latent variable and converge to the solution p θ j (x j |z) = p j D (x j ) where, paradoxically, the quality of the approximation of the various marginal distributions is extremely high, while there is a complete lack of joint coherence.
General principles to avoid latent collapse consist in explicitly forcing the learning of informative encoders q θ (z | x) via β−annealing of the Kullback-Leibler (KL) term in the ELBO and the reduction of the representational power of encoders and decoders.While β−annealing has been explored in the literature (Wu & Goodman, 2018) with limited improvements, reducing the flexibility of encoders/decoders clearly impacts the generation quality.Hence the presence of a trade-off: to improve coherence, the flexibility of encoders/decoders should be constrained, which in turns hurt generative quality.This trade-off has been recently addressed in the literature of multi-modal VAEs (Daunhawer et al., 2022;Palumbo et al., 2023), but our experimental results in Section 5 indicate that there is ample room for improvement, and that a new approach is truly needed.

OUR APPROACH: MULTI-MODAL LATENT DIFFUSION
We propose a new method for multi-modal generative modeling that, by design, does not suffer from the limitations discussed in Section 2. Our objective is to enable both high-quality and coherent joint/conditional data generation, using a simple design (see Appendix A for a schematic representation).As an overview, we use deterministic uni-modal autoencoders, whereby each modality X i is encoded through its encoder e ψ i , which is a short form for e i ψ i , into the modality specific latent variable Z i and decoded into the corresponding Xi = d θ i (Z i ).Our approach can be interpreted as a latent variable model where the different latent variables Z i are concatenated as Z = [Z 1 , . . ., Z M ].This corresponds to the parametrization of the two conditional distributions as , respectively.Then, in place of an ELBO, we optimize the parameters of our autoencoders by minimizing the following sum of modality specific losses: where l i can be any valid distance function, e.g, the square norm ∥•∥2 .Parameters ψ i , θ i are modality specific: then, minimization of Equation (2) corresponds to individual training of the different autoencoders.Since the mapping from input to latent is deterministic, there is no loss of information between X and Z.1 Moreover, this choice avoids any form of interference in the back-propagated gradients corresponding to the uni-modal reconstruction losses.Consequently gradient conflicts issues (Javaloy et al., 2022), where stronger modalities pollute weaker ones, are avoided.
To enable such a simple design to become a generative model, it is sufficient to generate samples from the induced latent distribution Z ∼ q ψ (z) = p D (x)q ψ (z | x)dx and decode them as To obtain such samples, we follow the two-stage procedure described in Loaiza-Ganem et al. (2022); Tran et al. (2021), where samples from the lower dimensional q ψ (z) are obtained through an appropriate generative model.We consider score-based diffusion models in latent space (Rombach et al., 2022;Vahdat et al., 2021) to solve this task, and call our approach Multi-modal Latent Diffusion (MLD).It may be helpful to clarify, at this point, that the two-stage training of MLD is carried out separately.Uni-modal deterministic autoencoders are pre-trained first, followed by the training of the score-based diffusion model, which is explained in more detail later.
To conclude the overview of our method, for joint data generation, one can sample from noise, perform backward diffusion, and then decode the generated multi-modal latent variable to obtain the corresponding data samples.For conditional data generation, given one modality, the reverse diffusion is guided by this modality, while the other modalities are generated by sampling from noise.The generated latent variable is then decoded to obtain data samples of the missing modality.

JOINT AND CONDITIONAL MULTI-MODAL LATENT DIFFUSION PROCESSES
In the first stage of our method, the deterministic encoders project the input modalities X i into the corresponding latent spaces Z i .This transformation induces a distribution q ψ (z) for the latent variable Z = [Z 1 , . . ., Z M ], resulting from the concatenation of uni-modal latent variables.
Joint generation.To generate a new sample for all modalities we use a simple score-based diffusion model in latent space (Sohl-Dickstein et al., 2015;Song et al., 2021b;Vahdat et al., 2021;Loaiza-Ganem et al., 2022;Tran et al., 2021).This requires reversing a stochastic noising process, starting from a simple, Gaussian distribution.Formally, the noising process is defined by a Stochastic Differential Equation (SDE) of the form: where α(t)R t and g(t) are the drift and diffusion terms, respectively, and W t is a Wiener process.
The time-varying probability density q(r, t) of the stochastic process at time t ∈ [0, T ], where T is finite, satisfies the Fokker-Planck equation (Oksendal, 2013), with initial conditions q(r, 0).We assume uniqueness and existence of a stationary distribution ρ(r) for the process Equation (3). 2 The forward diffusion dynamics depend on the initial conditions R 0 ∼ q(r, 0).We consider R 0 = Z to be the initial condition for the diffusion process, which is equivalent to q(r, 0) = q ψ (r).Under loose conditions (Anderson, 1982), a time-reversed stochastic process exists, with a new SDE of the form: indicating that, in principle, simulation of Equation (4) allows to generate samples from the desired distribution q(r, 0).In practice, we use a parametric score network s χ (r, t) to approximate the true score function, and we approximate q(r, T ) with the stationary distribution ρ(r).Indeed, the generated data distribution q(r, 0) is close (in KL sense) to the true density as described by Song et al. (2021a); Franzese et al. (2023): where the first term on the r.h.s is referred to as score-matching objective, and is the loss over which the score network is optimized, and the second is a vanishing term for T → ∞.
To conclude, joint generation of all modalities is achieved through the simulation of the reversetime SDE in Equation ( 4), followed by a simple decoding procedure.Indeed, optimally trained decoders (achieving zero in Equation ( 2)) can be used to transform Z ∼ q ψ (z) into samples from Conditional generation.Given a generic partition of all modalities into non overlapping sets A 1 ∪A 2 , where A 2 = ({1, . . ., M } \ A 1 ), conditional generation requires samples from the conditional distribution q ψ (z A1 | z A2 ), which are based on masked forward and backward diffusion processes.
Given conditioning latent modalities z A2 , we consider a modified forward diffusion process with initial conditions The composition operation C(•) concatenates generated (R A1 ) and conditioning latents (z A2 ).As an illustration, consider A 1 = {1, 3, 5}, such that X A1 = {X 1 , X 3 , X 5 }, and More formally, we define the masked forward diffusion SDE: The mask m(A 1 ) contains M vectors u i , one per modality, and with the corresponding cardinality.If modality j ∈ A 1 , then u j = 1, otherwise u j = 0.Then, the effect of masking is to "freeze" throughout the diffusion process the part of the random variable R t corresponding to the conditioning latent modalities z A2 .We naturally associate to this modified forward process the conditional time varying density q(r, t | z A2 ) = q(r A1 , t | z A2 )δ(r A2 − z A2 ).
To sample from q ψ (z A1 | z A2 ), we derive the reverse-time dynamics of Equation ( 6) as follows: ).Then, we approximate q(r A1 , T | z A2 ) by its corresponding steady state distribution ρ(r A1 ), and the true (conditional) score function ∇ log q(r, t | z A2 ) by a conditional score network s χ (r A1 , t | z A2 ).

GUIDANCE MECHANISMS TO LEARN THE CONDITIONAL SCORE NETWORK
A correctly optimized score network s χ (r, t) allows, through simulation of Equation (4), to obtain samples from the joint distribution q ψ (z).Similarly, a conditional score network s χ (r A1 , t | z A2 ) allows, through the simulation of Equation ( 7), to sample from q ψ (z A1 | z A2 ).In Section 4.1 we extend guidance mechanisms used in classical diffusion models to allow multi-modal conditional generation.A naïve alternative is to rely on the unconditional score network s χ (r, t) for the conditional generation task, by casting it as an in-painting objective.Intuitively, any missing modality could be recovered in the same way as a uni-modal diffusion model can recover masked information.In Section 4.2 we discuss the implicit assumptions underlying in-painting from an information theoretic perspective, and argue that, in the context of multi-modal data, such assumptions are difficult to satisfy.Our intuition is corroborated by ample empirical evidence, where our method consistently outperform alternatives.

MULTI-TIME DIFFUSION
We propose a modification to the classifier-free guidance technique (Ho & Salimans, 2022) to learn a score network that can generate conditional and unconditional samples from any subset of modalities.
Instead of training a separate score network for each possible combination of conditional modalities, which is computationally infeasible, we use a single architecture that accepts all modalities as inputs and a multi-time vector τ = [t 1 , . . ., t M ].The multi-time vector serves two purposes: it is both a conditioning signal and the time at which we observe the diffusion process.
Training: learning the conditional score network relies on randomization.As discussed in Section 3.1, we consider an arbitrary partitioning of all modalities in two disjoint sets, A 1 and A 2 .The set A 2 contains randomly selected conditioning modalities, while the remaining modalities belong to set A 1 .Then, during training, the parametric score network estimates ∇ log q(r, t | z A2 ) , whereby the set A 2 is randomly chosen at every step.This is achieved by the masked diffusion process from Equation ( 6), which only diffuses modalities in A 1 .More formally, the score network input is More precisely, the algorithm for the multi-time diffusion training (see A for the pseudo-code) proceeds as follows.At each step, a set of conditioning modalities A 2 is sampled from a predefined distribution ν, where ν(∅) , where P({1, . . ., M }) is the powerset of all modalities.The corresponding set A 1 and mask m(A 1 ) are constructed, and a sample X is drawn from the training data-set.The corresponding latent variables Z A1 = {e i ψ (X i )} i∈A1 and Z A2 = {e i ψ (X i )} i∈A2 are computed using the pre-trained encoders, and a diffusion process starting from R 0 = C(Z A1 , Z A2 ) is simulated for a randomly chosen diffusion time t, using the conditional forward SDE with the mask m(A 1 ).The score network is then fed the current state R t and multi-time vector τ (A 1 , t), and the difference between the score network's prediction and the true score is computed, applying the mask m(A 1 ).The score network parameters are updated using stochastic gradient descent, and this process is repeated for a total of L training steps.Clearly, when A 2 = ∅, training proceeds as for an un-masked diffusion process, since the mask m(A 1 ) allows all latent variables to be diffused.
Conditional generation: any valid numerical integration scheme for Equation ( 7) can be used for conditional sampling (see A for an implementation using the Euler-Maruyama integrator).First, conditioning modalities in the set A 2 are encoded into the corresponding latent variables z A2 = {e j (x j )} j∈A2 .Then, numerical integration is performed with step-size ∆t = T /N, starting from the initial conditions R 0 = C(R A1 0 , z A2 ), with R A1 0 ∼ ρ(r A1 ).At each integration step, the score network s χ is fed the current state of the process and the multi-time vector τ (A 1 , •).Before updating the state, the masking is applied.Finally, the generated modalities are obtained thanks to the decoders as XA1 = {d j θ (R j T )} j∈A1 .Inference time conditional generation is not randomized: conditioning modalities are the ones that are available, whereas the remaining are the ones we wish to generate.
Any-to-any multi-modality has been recently studied through the composition of modality-specific diffusion models (Tang et al., 2023), by designing cross-attention and training procedures that allow arbitrary conditional generation.The work by Tang et al. (2023) relies on latent interpolation of input modalities, which is akin to mixture models, and uses it as conditioning signal for individual diffusion models.This is substantially different from the joint nature of the multi-modal latent diffusion we present in our work: instead of forcing entanglement through cross-attention between score networks, our model relies on joint diffusion process, whereby modalities naturally co-evolve according to the diffusion process.Another recent work (Wu et al., 2023) targets multi-modal conversational agents, whereby the strong, underlying assumption is to consider one modality, i.e., text, as a guide for the alignment and generation of other modalities.Even if conversational objectives are orthogonal to our work, techniques akin to instruction following for cross-generation, are an interesting illustration of the powerful capabilities of in-context learning of LLMs (Xie et al., 2022;Min et al., 2022).

IN-PAINTING AND ITS IMPLICIT ASSUMPTIONS
Under certain assumptions, given an unconditional score network s χ (r, t) that approximates the true score ∇ log q(r, t), it is possible to obtain a conditional score network s χ (r A1 , t | z A2 ), to approximate ∇ log q(r A1 , t | z A2 ).We start by observing the equality: where, with a slight abuse of notation, we indicate with q(z A2 | C(r A1 , r A2 ), t) the density associated to the event: the portion corresponding to A 2 of the latent variable Z is equal to z A2 given that the whole diffused latent R t at time t, is equal to C(r A1 , r A2 ).In the literature, the quantity q(z A2 | C(r A1 , r A2 ), t) is typically approximated by dropping its dependency on r A1 .This approximation can be used to manipulate Equation (8) as q(r A1 , t | z A2 ) ≃ q(r A2 , t | z A2 )q(r A1 , t|r A2 , t) dr.
Further Monte-Carlo approximations (Song et al., 2021b;Lugmayr et al., 2022) of the integral allow implementation of a practical scheme, where an approximate conditional score network is used to generate conditional samples.This approach, known in the literature as in-painting, provides high quality results in several uni-modal application domains (Song et al., 2021b;Lugmayr et al., 2022).
The KL divergence between q(z A2 | C(r A1 , r A2 ), t) and q(z A2 | r A2 , t) quantifies, fixing r A1 , r A2 , the discrepancy between the true and approximated conditional probabilities.Similarly, the expected ]dr, provides information about the average discrepancy.Simple manipulations allow to recast this as a discrepancy in terms of mutual information t , as the latter is the result of a diffusion with the former as initial conditions, corresponding to the Markov chain The positive quantity ∆ is close to zero whenever the rate of loss of information w.r.t initial conditions is similar for the two subsets A 1 , A 2 .In other terms, ∆ ≃ 0 whenever out of the whole R t , the portion R A2 t is a sufficient statistic for Z A2 .The assumptions underlying the approximation are in general not valid in the case of multi-modal learning, where the robustness to stochastic perturbations of latent variables corresponding to the various modalities can vary greatly.Our claim are supported empirically by an ample analysis on real data in B, where we show that multi-time diffusion approach consistently outperforms in-painting.

EXPERIMENTS
We compare our method MLD to MVAE Wu & Goodman (2018), MMVAE Shi et al. (2019), MOPOE Sutter et al. (2021), Hierarchical Genertive Model (NEXUS) Vasco et al. (2022) and Multi-view Total Correlation Autoencoder (MVTCAE) Hwang et al. (2021), MMVAE+ Palumbo et al. (2023) re-implementing competitors in the same code base as our method, and selecting their best hyperparameters (as indicated by the authors).For fair comparison, we use the same encoder/decoder architecture for all the models.For MLD, the score network is implemented using a simple stacked multilayer perceptron (MLP) with skip connections (see A for more details).Results.Overall, MLD largely outperforms alternatives from the literature, both in terms of coherence and generative quality.VAE-based models suffer from a coherence-quality trad-off and modality collapse for highly heterogeneous data-sets.We proceed to show this on several standard benchmarks from the multi-modal VAE-based literature (see C for details on the data-sets).
The first data-set we consider is MNIST-SVHN ((Shi et al., 2019)), where the two modalities differ in complexity.High variability, noise and ambiguity makes attaining good coherence for the SVHN modality a challenging task.Overall, MLD outperforms all VAE-based alternatives in terms of coherency, especially in terms of joint generation and conditional generation of MNIST given SVHN, see Table 1.Mixture models (MMVAE, MOPOE) suffer from modality collapse (poor SVHN generation), whereas product of experts (MVAE, MVTCAE) generate better quality samples at the expense of SVHN to MNIST conditional coherence.Joint generation is poor for all VAE models.Interestingly, these models also fail at SVHN self-reconstruction which we discuss in E. MLD achieves the best performance also in terms of generation quality, as confirmed also by qualitative results (Figure 1) showing for example how MLD conditionally generates multiple SVHN digits within one sample, given the input MNIST image, whereas other methods fail to do so.The Multi-modal Handwritten Digits data-set (MHD) (Vasco et al., 2022) contains gray-scale digit images, motion trajectory of the hand writing and sounds of the spoken digits.In our experiments, we do not use the label as a forth modality.While digit image and trajectory share a good amount of information, the sound modality contains a lot more of modality specific variation.Consequently, conditional generation involving the sound modality, along with joint generation, are challenging tasks.Coherency-wise (Table 2) MLD outperforms all the competitors where the biggest difference is seen in joint and sound to other modalities generation (in the latter task MVTCAE performs better than other competitors but is still worse than MLD).MLD dominates alternatives also in terms of generation quality (Table 3).This is true both for image, sound modalities, for which some VAE-based models suffer in producing high quality results, demonstrating the limitation of these methods in handling highly heterogeneous modalities.MLD, in the other hand, achieves high generation quality for all modalities, possibly due to the independent training of the autoencoders avoiding interference.The POLYMNIST data-set (Sutter et al., 2021) consists of 5 modalities synthetically generated by using MNIST digits and varying the background images.The homogeneous nature of the modalities is expected to mitigate gradient conflict issues in VAE-based models, and consequently reduce modality collapse.However, MLD still outperforms all alternatives, as shown Figure 2. Concerning generation coherence, MLD achieves the best performance in all cases with the single exception of a single observed modality.On the qualitative performance side, not only MLD is superior to alternatives, but its results are stable when more modalities are considered, a capability that not all competitors share.MLD* denotes the version of our method using a powerful image autoencoder.

CONCLUSION AND LIMITATIONS
We have presented a new multi-modal generative model, Multimodal Latent Diffusion (MLD), to address the well-known coherence-quality tradeoff that is inherent in existing multi-modal VAE-based models.MLD uses a set of independently trained, uni-modal, deterministic autoencoders.Generative properties of our model stem from a masked diffusion process that operates on latent variables.We also developed a new multi-time training method to learn the conditional score network for multi-modal diffusion.An extensive experimental campaign on various real-life data-sets, provided compelling evidence on the effectiveness of MLD for multi-modal generative modeling.In all scenarios, including cases with loosely correlated modalities and high-resolution datasets, MLD consistently outperformed the alternatives from the state-of-the-art.
A APPENDIX

MULTI-MODAL LATENT DIFFUSION -SUPPLEMENTARY MATERIAL A DIFFUSION IN THE MULTIMODAL LATENT SPACE
In this section, we provide additional technical details of MLD.We first discuss a naive approach based on In-painting which uses only unconditional score network for both joint and conditional generation.We also discuss alternative training scheme based on a work from the caption-text translation literature Bao et al. (2023).Finally, we provide extra technical details for the score network architecture and sampling technique.

A.1 MODALITIES AUTO-ENCODERS
Each deterministic autoencoders used in the first stage of MLD uses a vector latent space with no size constraints.Instead, VAE-based models, generally require the latent space of each individual VAE to be exactly of the same size, to allow the definition of a joint latent space.
In our approach, before concatenation, the modality-specific latent spaces are normalized by elementwise mean and standard deviation.In practice, we use the statistics retrieved from the first training batch, which we found sufficient to gain sufficient statistical confidence.This operation allows the harmonization of different modality-specific latent spaces and, therefore, facilitate the learning of a joint score network.

A.2 MULTI-MODAL DIFFUSION SDE
In Section 3, we presented our multi-modal latent diffusion process allowing multi-modal joint and conditional generation.The role of the SDE is to gradually add noise to the data, perturbing its structure until attaining a noise distribution.In this work, we consider Variance preserving SDE (VPSDE) Song et al. (2021b).In this framework we have : ρ(r) ∼ N (0; I), α(t) = − 1 2 β(t) and g(t) = β(t), where β(t) = β min + t(β max − β min ).Following (Ho et al., 2020;Song et al., 2021b), we set β min = 0.1 and β max = 20.With this configuration and by substitution of Equation (3), we obtain the following forward SDE: The corresponding perturbation kernel is given by : The marginal score ∇ log q(R t , t) is approximated by a score network s χ (R t , t) whose parameters χ can be optimized by minimizing the ELBO in Equation ( 5), where we found that using the same re-scaling as in Song et al. (2021b) is more stable.
The reverse process is described by a different SDE (Equation ( 4)).When using a variance-preserving SDE, Equation (4) specializes in: With R 0 ∼ ρ(r) as initial condition and time t flows from t = 0 to t = T .
Once the parametric score network is optimized, trough the simulation of Equation ( 11), sampling R T ∼ q ψ (r) is possible allowing joint generation.A numerical SDE solver can be used to sample R T which can be fed to the modality specific decoders to jointly sample a set of X = {d i θ (R i T )} M i=0 .As explained in Section 4.2, the use of the unconditional score network s χ (R t , t) allows conditional generation through the approximation described in Song et al. (2021b).
As described in Algorithm 1, one can generate a set of modalities A 1 conditioned on the available set of modalities A 2 .First, the available modalities are encoded into their respective latent space z A2 , the initial missing part is sampled from the stationary distribution R A1 0 ∼ ρ(r A1 ), using an SDE solver (e.g.Euler-Maruyama), the reverse diffusion SDE (in Equation ( 11)) is discretized using a finite time steps ∆t = T /N, starting from t = 0 and iterating until t ≈ T .At each iteration, the available portion of the latent space is diffused and brought to the same noise level as R A1 t allowing the use of the unconditional score network.Lastly, the reverse diffusion update is done.This process is repeated until arriving at t ≈ T and obtaining R A1 T = ẐA1 which can be decoded to recover xA1 .Note that the joint generation can be seen as a special case of Algorithm 1 with A 2 = ∅.We name this first approach Multi-modal Latent Diffusion with In-painting (MLD IN-PAINT) and provide extensive comparison with our method MLD in Appendix B.

Algorithm 1: MLD IN-PAINT conditional generation
// Diffuse the available portion of the latent space(eq.( 10)) As discussed in Section 4.2, the approximation enabling the in-painting approach can be efficient in several domains but its generalization to the multi-modal latent space scenario is not trivial.We argue that this is due to the heterogeneity of modalities which induce different latent spaces characteristics.
For different modality-specific latent spaces, the loss of information ratio can vary through the diffusion process.We verify this hypothesis through the following experiment.
Latent space robustness against diffusion perturbation: We analyse the effect of the forward diffusion perturbation on the latent space through time.We encode the modalities using their respective encoders to obtain their latent space Z = [e ψ 1 (X 1 ) . . .e ψ M (X M )].Given a time t ∈ [0, T ], we diffuse the different latent spaces by applying Equation ( 10) to get R t ∼ q(r|z, t) with R t being the perturbed version of the latent space at time t.We feed the modality specific decoders with the perturbed latent space Xt = {d i θ (R i t )} M i=1 , Xt being the output modalities generated using the perturbed latent space.To evaluate the information loss induced by the diffusion process on the different modalities, we assess the coherence preservation in the reconstructed modalities Xt by computing the coherence (in %) as done in Section 5.
We expect to obtain high coherence results for t ≈ 0, when compared to t ≈ T , the information in the latent space being more preserved at the beginning of the diffusion process than at the last phase of the froward SDE where all dependencies on initial conditions vanish.Figure 5 shows the coherence as a function of the diffusion time t ∈ [0, 1] for different modalities across multiple data-sets.We observe that within the same data-set, some modalities stand out with a specific level of robustness (using as a proxy the coherence level) against the diffusion perturbation in comparison with the remaining modalities from the same data-set.For instance, we remark that SVHN is less robust than MNIST which should manifest in an under-performance of SVHN to MNIST conditional generation.An intuition that we verify in Appendix B.

A.3 MULTI-TIME MASKED MULTI-MODAL SDE
To learn the score network capable of both conditional and joint generation, we proposed in Section 4 a multi-time masked diffusion process.
Algorithm 2 presents a pseudo-code for the multi time masked training.The masked diffusion process is applied following a randomization with probably d.First, a subset of modalities A 2 is selected randomly to be the conditioning modalities and A 1 the remaining set of modalities to be the diffused modalities.The time t is sampled uniformly from [0, T ] and the portion of the latent space corresponding to the subset A 1 is diffused accordingly.Using the masking as shown in Algorithm 2, the portion of the latent space corresponding to the subset A 2 is not diffused and forced to be equal to R A2 0 = z A2 .The multi-time vector τ is constructed.Lastly, the score network is optimized by minimizing a masked loss corresponding to the diffused part of the latent space.With probability (1 − d), all the modalities are diffused at the same time and A 2 = ∅.In order to calibrate the loss, given that the randomization of A 1 and A 2 can result in diffusing different sizes of the latent space, we re-weight the loss according to the cardinality of the diffused and freezed portions of the latent space: Where dim(.) is the sum of each latent space cardinality of a given subset of modalities with dim(∅) = 0 .
Algorithm 2: MLD Masked Multi-time diffusion training step // Diffuse the available portion of the latent space(Equation ( 10 The optimized score network can approximate both the conditional and unconditional true score: (13) The joint generation is a special case of the latter with A 2 = ∅: Algorithm 3 describes the reverse conditional generation pseudo-code.It's pertinent to compare this algorithm with Algorithm 1.The main difference resides in the use of the multi-time score network, enabling conditional generation with the multi-time vector playing the role of time information and conditioning signal.On the other hand, in Algorithm 1, we don't have a conditional score network, therefore we resort to the approximation from Section 4.2, and use the unconditional score.
Algorithm 3: MLD conditional generation. Data: The set of modalities to be generated

A.4 UNI-DIFFUSER TRAINING
The work presented in Bao et al. (2023) is specialized for an image-caption application.The approach is based on a multi-modal diffusion model applied to a unified latent embedding, obtained via pre-trained autoencoders, and incorporating pre-trained models (CLIP Radford et al. (2021) and GPT-2 Radford et al. (2019)).The unified latent space is composed of an image embedding, a CLIP image embedding and a CLIP text embedding.Note that the CLIP model is pre-trained on a pairs of multi-modal data (image-text), which is expected to enhance the generative performance.Since it is not trivial to have a jointly trained encoder similar to CLIP for any type of modality, the evaluation of this model on different modalities across different data-set (e.g.including audio) is not an easy task.
To compare to this work, we adapt the training scheme presented in Bao et al. (2023) to our MLD method.Instead of applying a masked multi-modal SDE for training the score network, every portion of the latent space is diffused according to a different time t i ∼ U(0, 1) and, therefore, the multi-time vector fed to the score network is τ (t) = [t 0 ∼ U(0, 1), ..., t M ∼ U(0, 1)].For fairness, we use the same score network and reverse process sampler as for our MLD version with multi-time training, and call this variant Multi-modal Latent Diffusion UniDiffuser (MLD UNI).

A.5 INTUITIVE SUMMARY: HOW DOES MLD CAPTURE MODALITY INTERACTIONS?
MLD treats the latent spaces of each modality as variables that evolve differently through the diffusion process according to a multi-time vector.The masked multi-time training enables the model to learn the score of all the combination of conditionally diffused modalities, using the frozen modalities as the conditioning signal, through a randomized scheme.By learning the score function of the diffused modalities at different time steps, the score model captures the correlation between the modalities.At test time, the diffusion time of each modality is chosen to modulate its influence on the generation, as follows.
For joint generation the model uses the unconditional score which corresponds to using the same diffusion time for all modalities.Thus, all the modalities influence each other equally.This ensures that modality interaction information is faithful to the one characterizing the observed data distribution.
The model can also generate modalities conditionally by using the conditional score, by freezing the conditioning modalities during the reverse process.The freezed state is similar to the final state of the revere process where information is not perturbed, thus the influence of the conditioning modalities is maximal.Subsequently, the generated modalities reflect the necessary information from the conditioning modalities and achieve the desired correlation.

A.6 TECHNICAL DETAILS
Sampling schedule: We use the sampling schedule proposed in Lugmayr et al. (2022), which has shown to improve the coherence of the conditional and joint generation.We use the best parameters suggested by the authors: N = 250 time-steps, applied r = 10 re-sampling times with jump size j = 10.For readability in algorithm 1 and algorithm 3, we present pseudo code with a linear sampling schedule which can be easily adapted to any other schedule.
Training the score network: Inspired by the architecture from (Dupont et al., 2022), we use simple Residual MLP blocks with skip connections as our score network (see Figure 6).We fix the width and number of blocks proportionally to the number of the modalities and the latent space size.
As in Song & Ermon (2020), we use Exponential moving average (EMA) of model parameters with a momentum parameter m = 0.999.

B MLD ABLATIONS STUDY
In this section, we compare MLD with two variants presented in Appendix A : MLD IN-PAINT, a naive approach without our proposed multi-time masked SDE, MLD UNI a variant of our method using the same training scheme of Bao et al. (2023).We also analyse the effect of the d randomization parameter on MLD performance through ablations study.

B.1 MLD AND ITS VARIANTS
Table 4 summarizes the different approaches adopted in each variant.All the considered models share the same deterministic autoencoders trained during the first stage.
For fairness, our evaluation was done using the same configuration and code basis of MLD.This includes: the autoencoder architectures and latent space size (similar to Section 5), the same score network (Figure 6) is used across experiments, with MLD IN-PAINT using the same architecture with one time dimension instead of the multi-time vector.In all the variants, the joint and conditional generation are conducted using the same reverse sampling schedule described in Appendix A.6. Results In some cases, the MLD variants can match the joint generation performance of MLD but, overall, they are less efficient and have noticeable weaknesses: MLD IN-PAINT under-performs in conditional generation when considering relatively complex modalities, MLD UNI is not able to leverage the presence of multiple modalities to improve cross generation, especially for data-sets with a large number of modalities.On the other hand, MLD is able to overcome all these limitations.
MNIST-SVHN.In Table 5, MLD achieves the best results and dominates cross generation performance.We observe that MLD IN-PAINT lacks coherence for SVHN to MNIST conditional generation, a results we expected by analysing the experiment in Figure 5. MLD UNI, despite the use of a multi-time diffusion process, under-performs our method, which indicates the effectiveness of our masked diffusion process in learning the conditional score network.Since all the models use the same deterministic autoencoders, the observed generative quality performance are relatively similar (See Figure 8 for qualitative results ).MHD.Table 6 shows the performance results for the MHD data-set in terms of generative coherence.MLD achieves the best joint generation coherence and, along with MLD UNI, they dominate the cross generation coherence.MLD IN-PAINT shows a lack of coherence when conditioning on the sound modality alone, a predictable result since this is a more difficult configuration, as the sound modality is loosely correlated to other modalities.We also observe that MLD IN-PAINT performs worse than the two other alternatives when conditioned on the trajectory modality, which is the smallest modality in terms of latent size.This indicates another limitation of the naive approach regarding coherent generation when handling different latent spaces sizes, a weakness our method MLD overcomes.Table 7 presents the qualitative generative performance which are homogeneous across the variants with MLD, achieving either the best or second best performance.POLYMNIST.In Figure 9, we remark the superiority of MLD in both generative coherence and quality.MLD-Uni is not able to leverage the presence of a large number of modalities in conditional generation coherence.Interestingly, an increase in the number of input modalities impacts negatively the performance of MLD UNI.2019) is constructed using pairs of MNIST and SVHN, sharing the same digit class (See Figure 12a).Each instance of a digit class (in either dataset) is randomly paired with 20 instances of the same digit class from the other data-set.SVHN modality samples are obtained from house numbers in Google Street View images, characterized by a variety of colors, shapes and angles.A high number of SVHN samples are noisy and can contain different digits within the same sample due to the imperfect cropping of the original full house number image.One challenge of this data-set for multi-modal generative models is to learn to extract digit number and reconstruct a coherent MNIST modality.
MHD Vasco et al. (2022) is composed of 3 modalities: synthetically generated images and motion trajectories of handwritten digits associated with their speech sounds.The images are gray scale 1 × 28 × 28 and the handwriting trajectory are represented by a 1 × 200 vector.The spoken digits sound is 1s long audio processed as Mel-Spectrograms constructed with a hopping window of 512 ms with 128 Mel Bins resulting in a 1 × 128 × 32 representation.This benchmark is the closest to a real world multi-modal sensors scenario because of the presence of three completely different modalities, the audio modality representing a complex data type.Therefore, similar to SVHN, the conditional generation of sound to coherent images or trajectories represents a challenging use case.
POLYMNIST Sutter et al. ( 2021) is an extended version of the MNIST data-set to 5 modalities.Each modality is constructed using a randomly set of MNIST digits with an overlay over a random crop from a modality specific, 3 channel image background.This synthetic generated data-set allows the evaluation of the scalability of multi-modal generative models to large number of modalities.Although this data-set is composed of only images, the different modality-specific background having different textures, results in different levels of difficulty.In Figure 12c, the digits numbers are more difficult to distinguish in modality 1 and 5 than in the remaining modalities.For MNIST-SVHN , MHD and POLYMNIST, the shared semantic information is the digit class number.Single modality classifiers are trained to classify the digit number of a given modality sample.To compute the conditional generation of modality m with a subset of modalities A, we feed the modality specific pre-trained classifier C m with the conditional generated sample Xm .The predicted label class is compared to the ground truth label y X A which is the label of modalities of the subset X A .For N samples, the matching rate average establishes the coherence.For all the experiments, N is equal to the length of the test-set.

Coherence(
The joint generation coherence is measured by feeding the generated samples of each modality to their specific trained classifier.The rate with which all classifiers output the same predicted digit label for N generations is considered as the joint generation coherence.
The leave one out coherence: is the conditional generation coherence using all the possible subsets excluding the generated modality:

C.2.2 GENERATION QUALITY
For each modality, we consider the following metrics: • RGB Images: FID Heusel et al. (2017) is the state-of-the-art standard metric to evaluate image generation quality of generative models.
• Audio: FAD Kilgour et al. (2019), is state-of-the-art standard metric in the evaluation of audio generation.FAD performs well in terms of robustness against noise and is consistent with human judgments Vinay & Lerch (2022).Similar to FID, a Fréchet distance is computed but VGGish (audio classifer model) embeddings are used instead.
• Other modalities For other modality types, we derive FMD (Fréchet Modality Distance), a similar metric to FID and FAD.We compute the Fréchet distance between the statistics retrieved from the activations of the modality specific pre-trained classifiers used for coherence evaluation.FMD is used to evaluate the generative quality of MNIST modality in MNIST-SVHN and image and trajectory modalities in MHD data-set.
For conditional generation, we compute the quality metric (FID,FAD or FMD) using the conditionally generated modality and the real data.For joint generation, we use the randomly generated modality and randomly selected same number of samples from the real data.
For CUB, we use 10000 samples to evaluate the generation quality in terms of FID.In the remaining experiments, we use 5000 samples to evaluate the performance in terms of FID, FAD or FMD.

D IMPLEMENTATION DETAILS
We report in this section the implementation details for each benchmark.We used the same unified code-base for all the baselines, using the PyTorch framework.The VAE implementation is adapted from the official code whenever it's available (MVAE, MMVAE and MOPOE as in3 , MVTCAE4 and NEXUS 5 ).For fairness, MLD and all the VAE-based models use the same autoencoder architecture.
We use the best hyper-parameters suggested by the authors.Across all the data-sets, we use the Adam optimizer Kingma & Ba (2014) for training.

D.1 MLD
MLD uses the same autoencoders architecture used for VAE-based models, except that these are deterministic autoencoders.The autoencoders are trained using the same reconstruction loss term as for the VAE-based models.Table 8 and Table 9 summarize the hyper-parameters used during the two phases of MLD training.Note that for the image modality in the CUB dataset, to overcome over-fitting in training the deterministic autoencoder, data augmentation was necessary (we used TrivialAugmentWide from the Torchvision library).2019) and use the same autoencoder architecture and pre-trained classifier.The latent space size is set to 20, β = 5.0.For MVTCAE α =5 6 .For both modalities, the likelihood is estimated using Laplace distribution.For NEXUS, we use the same modalities latent space sizes as in MLD, the joint NEXUS latent space is set to 20, β i = 1.0 and β c = 5.0.We train all the VAE-models for 150 epochs with 256 batch size and learning rate of 1e − 3.
For MHD, we reuse the autoencoders architecture and pre-trained classifier of Vasco et al. (2022).We adopt the hyper-parameters of Vasco et al. (2022) to train NEXUS model with the same settings, besides discarding the label modality.For the remaining VAE-based models, the latent space size is set to 128, β = 1.0 and α = 5 6 for MVTCAE.For all the modalities, Mean square error (MSE) is used to compute the reconstruction loss, similar to Vasco et al. (2022).These models are trained for 600 epochs with 128 batch size and learning rate of 1e − 3.
For POLYMNIST, we use the same autoencoders architecture and pretrained classifier used by Sutter et al. ( 2021); Hwang et al. (2021).We set the latent space size to 512, β = 2.5 and α = 5 6 for MVTCAE.For all the modalities, the likelihood is estimated using Laplace distribution.For NEXUS, we use the same modality latent space size as in MLD, the joint NEXUS latent space to 64, β i = 1.0 and β c = 2.5.We train all the models for 300 epochs with 256 batch size and learning rate of 1e − 3.
For CUB, we use the same autoencoders architecture and implementation settings as in Daunhawer et al. (2022).Laplace and one-hot categorical distributions are used to estimate likelihoods of the image and caption modalities respectively.The latent space size is set to 64, β = 9.0 for MVAE, MVTCAE and MOPOE and β = 1 for MMVAE.We set α = 5 6 for MVTCAE.For NEXUS, we use the same modalities latent space sizes as in MLD, the joint NEXUS latent space is set to 64, β i = 1.0 and β c = 1.We train all the models for 150 epochs with 64 batch size, with learning rate of 5e − 4 for MVAE, MVTCAE and MOPOE and 1e − 3 for the remaining models.
Finally, note that in the official implementation of Sutter et al. ( 2021) and Hwang et al. (2021), for the POLYMNIST and MNIST-SVHN data-sets, the classifiers were used for evaluation using dropout.In our implementation, we make sure to deactivate dropout during evaluation step.

D.3 MLD WITH POWERFULL AUTOENCODER
Here we provide more detail about the CUB experiment using more powerful autoencoder denoted MLD* in Figure 3.We use an architecture similar to Rombach et al. (2022) adapted to (64X64) resolution images.We modified the autoencoder architecture to be deterministic and train the model with a simple Mean square error loss.We kept the same configuration of the CUB experiment described in the previous experiment on the same dataset including the text autoencoder, score network and hyper-parameters.We also perform experiments with the same settings on (128X128) resolution images.We included the qualitative results in fig.25.

D.4 COMPUTATION RESOURCES
In our experiments, we used 4 A100 GPUs, for a total of roughly 4 months of experiments.

E ADDITIONAL RESULTS
In this section, we report detailed results for all of our experiments, including standard deviation and additional qualitative samples, for all the data-sets and all the methods we compared in our work.
E.1 MNIST-SVHN E.1.1 SELF RECONSTRUCTION In Table 10 we report results about self-coherence, which we use to support the arguments from Section 2. This metric is used to measure the loss of information due to latent collapse, by showing the ability of all competing models to reconstruct an arbitrary modality given the same modality or a set thereof as an input.For our MLD model, the self-reconstruction is done without using the diffusion model component: the modality is encoded using its deterministic encoder and the decoder is fed with the latent space to get the reconstruction.
We observe that VAE based models fail at reconstructing SVHN given SVHN.This is especially more visible for product of experts based models (MVAE and MVTCAE.In MLD, the deterministic autoencoders do not suffer from such weakness and achieve overall the best performance.
Figure 13 shows qualitative results for the self-generation.We remark that some samples generated using VAE-based models, the digits differs from the ones in the input sample, indicating information loss due to the latent collapse.For example, in the case of MVAE, generation of the MNIST digit 3, in MVTCAE generation of the SVHN digit 2.      Preprint.Under review.

E.3 POLYMNIST
Evaluation metrics.Coherence is measured as in Shi et al. (2019); Sutter et al. (2021); Palumbo et al. (2023), using pre-trained classifiers on the generated data and checking the consistency of their outputs.Generative quality is computed using Fréchet Inception Distance (FID) Heusel et al. (2017) and Fréchet Audio Distance (FAD) Kilgour et al. (2019) scores for images and audio respectively.Full details on the metrics are included in C. All results are averaged over 5 seeds (We report standard deviation in E).

Figure 1 :
Figure 1: Qualitative results for MNIST-SVHN.For each model we report: MNIST to SVHN conditional generation in the left, SVHN to MNIST conditional generation in the right.

Figure 2 :
Figure 2: Results for POLYMNIST data-set.Left: a comparison of the generative coherence (% ↑) and quality in terms of FID (↓) as a function of the number of inputs.We report the average performance following the leave-one-out strategy (see C). Right: are qualitative results for the joint generation of the 5 modalities.

Figure 5 :
Figure5: The coherence as a function of the diffusion process time for three datasets.The diffusion perturbation is applied on the modalities latent space after an element wise normalization.

Figure 8 :
Figure 8: Qualitative results for MNIST-SVHN.For each model we report: MNIST to SVHN conditional generation in the left, SVHN to MNIST conditional generation in the right.

Figure 9 :Figure 11 :
Figure 9: Results for POLYMNIST data-set.Left: a comparison of the generative coherence (% ↑) and quality in terms of FID (↓)) as a function of the number of modality input.We report the average performance following the leave-one-out strategy (see Appendix C).Right: are qualitative results for the joint generation of the 5 modalities.

CUBFigure 12 :
Figure 12: Illustrative example of the Datasets used for the evaluation Coherence( Xm |X A ) with A = {1, .., M } \ m ).Due to the large number of modalities in POLYMNIST, similar to Sutter et al. (2021); Hwang et al. (2021); Daunhawer et al. (2022) we compute the average leave one out coherence conditional coherence as a function of the input modalities subset size.Due to the unavailability of labels in the CUB data-set, we use CLIP-S Hessel et al. (2021) a state of the art metric for image captioning evaluation.

Figure 13 :Figure 14 :Figure 15 :
Figure 13: Self-generation qualitative results for MNIST-SVHN.For each model we report: MNIST to MNIST conditional generation in the left, SVHN to SVHN conditional generation in the right.

Figure 16 :
Figure 16: Joint generation qualitative results for MHD.The three modalities are randomly generated simultaneously (Top row: image,Middle row: trajectory vector converted into image, Bottom row: sound Mel-Spectogram ).

Figure 18 :
Figure18: Top: Generation Coherence (%) for POLYMNIST (Higher is better).Bottom: Generation quality (FID) (Lower is better).We report the average leave one out performance as a function of the number of observed modalities for each modality X i .Joint refers to random generation of the 5 modalities simultaneously.

Figure 19 :
Figure19: Top: Generation Coherence (%) for POLYMNIST (Higher is better).Bottom: Generation quality (FID) (Lower is better).We report the average leave one out performance as a function of the number of observed modalities for each modality X i .Joint refers to random generation of the 5 modalities simultaneously.

Figure 20 :Figure 21 :
Figure 20: Conditional generation qualitative results for POLYMNIST .The modality X 2 (dirst row) is used as the condition to generate the 4 remaining modalities(The rows below).

Figure 22 :
Figure 22: Results for POLYMNIST data-set.Left: a comparison of the generative coherence ( ↑ ) and quality in terms of FID (↓) as a function of the number of inputs.

Table 1 :
Generation coherence and quality for MNIST-SVHN ( M :MNIST, S: SVHN).The generation quality is measured in terms of Fréchet Modality Distance (FMD) for MNIST and FID for SVHN.

Table 2 :
Generation Coherence (%) for MHD (Higher is better).Line above refer to the generated modality while the observed modalities subset are presented below.

Table 3 :
Generation quality for MHD in terms of FMD for image and trajectory modalities and FAD for the sound modality (Lower is better).
Score network s χ architecture used in our MLD implementation.Residual MLP block architecture is shown in Figure7.

Table 4 :
MLD and its variants ablation study

Table 5 :
Generation Coherence and Quality for MNIST-SVHN (M is for MNIST and S for SVHN ).The generation quality is measured in terms of FMD for MNIST and FID for SVHN.

Table 6 :
Generation Coherence (%↑) for MHD (Higher is better).Line above refers to the generated modality while the observed modalities subset are presented below.

Table 10 :
Self-generation coherence and quality for MNIST-SVHN ( M :MNIST, S: SVHN).The generation quality is measured in terms of FMD for MNIST and FID for SVHN.