Next Article in Journal
Exploring the Mechanisms Influencing Graduate Students’ Adoption of Generative AI: Insights from the Technology Acceptance Model
Previous Article in Journal
Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models
Previous Article in Special Issue
Adaptive Segmentation and Statistical Analysis for Multivariate Big Data Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation

by
Jinsong Zhang
1,
Xiongzheng Li
2 and
Yuqin Lin
1,*
1
School of Computer Science, Big Data and Software, Fuzhou University, Fuzhou 350108, China
2
Alibaba Group, Hangzhou 310052, China
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(4), 107; https://doi.org/10.3390/bdcc10040107
Submission received: 11 February 2026 / Revised: 18 March 2026 / Accepted: 26 March 2026 / Published: 1 April 2026

Abstract

Exemplar-based image translation generates an output image by transferring appearance from a reference exemplar to a content image. Existing works only consider the local correspondences between two modalities, and ignore the global distributions in each modality, struggling to obtain fine-grained details with efficient computation. In this paper, we propose OTFormer, a multi-scale Optimal Transport transformer for exemplarbased image translation. We formulate cross-modal alignment as a multi-scale optimal transport problem, which progressively provides a globally coherent matching. In addition, we design a lightweight multi-scale fusion block to extract and fuse features efficiently. Experiments on CelebA-HQ and DeepFashion demonstrate that OTFormer improves both image fidelity and style adherence, while reducing model parameters by 62% and achieving faster inference compared with strong baselines. These results highlight OTguided global alignment as an effective and deployable solution for high-fidelity exemplarbased image translation.

1. Introduction

Image-to-image translation is an important problem in computer vision, which has many applications, such as conditional image editing [1,2] and controllable synthesis [3,4,5]. Among its variants, exemplar-based image translation aims to generate an output by transferring appearance attributes (e.g., texture, color, and local details) from a reference exemplar to a content image while preserving the underlying structure. Despite rapid progress, high-fidelity transfer, i.e., generating images that are both photorealistic and accurately preserve the structure of the content image while inheriting the appearance of the exemplar, remains difficult. A method must estimate reliable cross-image correspondences under semantic mismatch, and it should handle pose or layout changes and occlusions. Moreover, it must still satisfy deployment constraints on model size and inference speed. This setting exposes a persistent tension between correspondence accuracy and model efficiency.
Current methods largely fall into two streams. Each stream addresses a different side of the tension.
  • Attention-based methods [6,7,8,9] use dense pairwise attention to compute correspondences. They rely on attention mechanisms with local similarity matching. However, the local matching only considers the correlation between cross-modal local features, ignoring global structure in each domain. This weakness can produce misaligned features and artifacts, such as distorted textures or abnormal style placement. Besides, the quadratic complexity in attention mechanisms and large parameter footprint increase computation and memory demands, which makes real-world deployment harder.
  • Diffusion-based models [10,11] can achieve strong generation fidelity. However, exemplar-based translation rarely offers precisely aligned exemplar–content pairs. This data scarcity limits direct supervision for the denoising process [10,11]. Furthermore, diffusion models also require iterative sampling at inference time, which slows inference and increases computational cost [12]. These properties reduce their appeal in efficiency-critical scenarios.
We argue that the core issue is the lack of a principled framework for cross-domain feature alignment. A useful framework should be globally coherent and computationally efficient. Optimal Transport (OT) theory [13,14] provides a good foundation and mathematical tool for alignment. OT casts alignment as a minimal-cost mapping between two distributions. UNITE [15] pioneered unbalanced OT for exemplar-based translation. However, it still relies on a single-scale OT formulation and a conventional CNN decoder. These choices limit fine-grained transfer and multi-scale style control. More importantly, existing methods do not fully leverage OT in lightweight multi-scale designs. Moreover, previous methods also fail to use OT to support local semantic similarity and global structural preservation at the same time.
To address these problems, we propose OTFormer by integrating OT theory into a hierarchical Transformer framework for exemplar-based translation. OTFormer replaces local similarity attention with a multi-scale OT formulation for correspondence learning, which encourages semantic matching and enforces spatial coherence across domains. Besides, we stack several OTFormer blocks at different resolutions for multi-stage refinement. Early blocks operate at a low resolution and captures global style layout, and later blocks operate at a higher resolution and refine local textures and details. To achieve parameter-efficient modeling, we also introduce a lightweight Multi-Scale Fusion (MSF) block. Specifically, we adopt depthwise convolution for efficient feature extraction and introduce parallel depthwise convolutions with different receptive fields to achieve multi-scale feature aggregation. The experimental results demonstrate the effectiveness of our method compared with existing methods.
The main contributions can be summarized as follows:
  • We propose OTFormer for exemplar-based translation, which provides a globally coherent and theoretically grounded alternative to local attention matching.
  • We design a progressive alignment scheme and a lightweight MSF block, which supports coarse-to-fine style transfer while maintaining strong parameter efficiency.
  • We show that OTFormer outperforms GAN-based and diffusion-based methods in visual quality and semantic consistency, offering a better efficiency profile in parameters and inference.
The remainder of this paper is organized as follows. Section 2 reviews related work on exemplar-based image translation, covering GAN-based, diffusion-based, and optimal transport-based methods. Section 3 details the proposed OTFormer framework, including the multi-scale optimal transport formulation and the lightweight multi-scale fusion block. Section 4 presents experimental results and ablation studies on CelebA-HQ and DeepFashion. Section 5 concludes the paper.

2. Literature Review

Exemplar-based image translation has mainly progressed through two generative paradigms. The first paradigm uses Generative Adversarial Networks (GANs), while the second paradigm uses diffusion models. We also include the previous OT-based methods related to exemplar-based translation.

2.1. GAN-Based Methods

GAN-based methods [16,17,18] for exemplar-based translation typically learn a mapping from a content image and an exemplar to an output by leveraging adversarial training. Early approaches focused on learning style transfer through cycle consistency or feature matching, but often struggled with fine-grained alignment. Later works introduced explicit correspondence learning mechanisms to improve transfer quality.
Generative Adversarial Networks (GANs) [19,20] have been a core tool for exemplar-based translation. GAN-based methods can transfer visual style while preserving content structure in many settings [21,22]. CoCosNet [6] introduced a dual-encoder design with pixel-wise attention by estimating correspondences between the content and the exemplar. CoCosNet delivers strong results, but the dense matching in pixel space is expensive. CoCosNet-v2 [7] addresses this cost with a ConvGRU-assisted PatchMatch strategy. This strategy improves matching efficiency across multiple resolutions.
Transformer architectures have further pushed this line of work. DyNAST [8] uses dynamic sparse attention to refine feature transformation. MAT [9] proposes masked adaptive transformers to achieve better alignment for high-quality image synthesis. Chiu et al. [23] propose a method that incorporates object co-saliency awareness into the colorization process, leading to more natural and visually appealing results. HyperplaneGAN [24] proposes a novel unified translation framework to achieve facial-attribute editing and synthesis. These methods improve flexibility, but they still struggle to obtain fine-grained details with efficient computation. This is because standard attention mainly models pairwise cross-image similarities, and ignores intra-image structural dependencies, which can lead to style inconsistency or content blur in generated images.

2.2. Diffusion-Based Methods

Diffusion models have recently emerged as a powerful class of generative models. For exemplar-based translation, they are typically adapted by conditioning the denoising process on both content and exemplar images. The iterative refinement process of diffusion enables high-quality synthesis, but at the cost of slow inference and large model sizes.
Diffusion models [25,26] have recently achieved strong performance in image generation. These models synthesize images through iterative denoising. A learned reverse process gradually transforms noise into an image.
Several works adapt diffusion models to exemplar-based translation. MIDMS [10] propose a matching-and-generation framework with exemplar guidance. EBDM [11] propose a Brownian bridge diffusion model for inter-domain transformation. These approaches can produce high-quality images, but they face practical constraints. Exemplar-based translation rarely provides tightly aligned training pairs. This data scarcity makes it hard to accurately supervise denoising trajectories. Kosuji et al. [27] address this issue by leveraging the pre-trained diffusion model to achieve exemplar-based color translation. Jin et al. [28] propose a novel exemplar-based image synthesis framework by adapting the powerful diffusion models using the proposed appearance matching adapter. Diffusion inference also requires many sampling steps, which increases computational cost and slows inference [12,29,30]. These limitations reduce feasibility for real-time or resource-limited deployment.

2.3. Optimal Transport-Based Methods

Instead of attention mechanisms, Optimal Transport (OT) offers a principled alternative for feature alignment [13,31]. OT defines correspondence as a transport plan between feature distributions, which can be obtained through cost minimization by solving the OT problem. This global optimization perspective enforces spatial coherence and is robust to local ambiguities.
UNITE [15] introduces unbalanced OT [32,33] for cross-domain alignment. However, it adopts a single-resolution OT formulation and a conventional CNN decoder, which limits the capacity of high-quality image translation. In contrast, multi-scale OT can capture cross-image matching and within-image structure more directly, but prior GAN-based methods do not fully explore this potential.
Our OTFormer integrates multi-scale OT into progressive correspondence estimation. This design treats alignment as a global optimization problem rather than a purely local matching step, which preserves cross-modal similarity and intra-modal structure during refinement. A lightweight multi-scale fusion module further improves efficiency, reducing parameters while maintaining synthesis quality.

3. Methodology

In this section, we present the proposed OTFormer framework. We first describe the overall architecture, then detail the optimal transport formulation within each OTFormer block, and finally introduce the lightweight multi-scale fusion block.

3.1. Optimal Transport Transformer

We propose OTFormer, a new framework for exemplar-based image translation. OTFormer integrates optimal transport theory with hierarchical feature learning. This design targets key limitations in prior work. GAN-based methods [6,7,8,9] often depend on expensive attention mechanisms, while diffusion-based methods [10,11] often require slow iterative sampling and large models. OTFormer takes a different route by formulating feature alignment as a multi-scale mass transport optimization problem. This formulation not only encourages globally coherent correspondences across images, but also preserves inter-modal style consistency. At the same time, the formulation maintains intra-modal structural fidelity.
As illustrated in Figure 1, the framework consists of dual encoders extracting multi-scale features from input content image I c and exemplar image I e , respectively, followed by a decoder comprising multiple cascaded OTFormer blocks. Each OTFormer block solves an optimal transport problem between feature distributions, and produces a minimal-cost transport plan. Then, we can warp the exemplar features to align with the content features using the transport plan. Then, we adopt a lightweight multi-scale fusion block to extract and fuse features, which reduces parameters while achieving superior synthesis quality.

3.1.1. Encoders

OTFormer employs symmetric dual encoders E c and E e to extract hierarchical features from input content image and input exemplar image. Given I c , I e R H × W × 3 , the encoders produce multi-scale feature pyramids:
{ I c l } l = 0 2 = E c ( I c ) , { I e l } l = 0 2 = E e ( I e ) ,
where I l denotes the feature map at pyramid level l. Each encoder contains three strided convolutional blocks, and each block increases the receptive field progressively. This design supports robust multi-scale representation learning, which enables coarse-to-fine correspondence estimation. Note that the model uses encoders with the same architecture but different weights for the content and exemplar branches.

3.1.2. OTFormer Block

The decoder is formed by a cascade of OTFormer blocks, each tasked with estimating precise cross-modal correspondences and refining generated features with low computational cost. As depicted in Figure 2, each OTFormer block receives three inputs: the current generated feature map I g 0 , the content feature I c 1 , and the exemplar feature I e 1 . Central to the block is a differentiable, entropically regularized optimal transport solver (implemented via the Sinkhorn algorithm), which computes a transport plan T between the generated and exemplar feature distributions. This plan is used to warp exemplar features into the spatial domain of the generated features, followed by efficient multi-scale fusion with the content features to produce the refined output I g 1 .
  • Optimal transport We summarize the discrete optimal transport formulation used in the OTFormer block. Let
X = { x i } i = 1 n , Y = { y j } j = 1 m
be two sets of feature vectors obtained by spatially flattening I g 0 and I e 1 , respectively, where x i , y j R C . Their marginal weight vectors are denoted a Δ n and b Δ m (default uniform distributions: a i = 1 n , b j = 1 m ). The cost matrix C R n × m is defined by a dissimilarity measure; we adopt cosine-based dissimilarity:
C i j = 1 x i , y j x i y j .
We choose cosine dissimilarity because deep features tend to be high-dimensional and their magnitudes can vary due to factors like contrast or illumination; cosine similarity focuses on the direction of feature vectors, which better captures semantic content. Moreover, it is differentiable and has been widely used in attention mechanisms, making it a natural fit for our end-to-end training.
The hyperparameters in Algorithm 1 are chosen as follows: ϵ = 10 8 prevents division by zero; the regularization parameter τ = 0.03 balances convergence speed and solution accuracy, selected empirically based on validation performance; γ = 1 controls the unbalancedness (here we use balanced OT). The maximum number of iterations is set to 20, which is sufficient for convergence in practice.
Algorithm 1 Sinkhorn-Knopp Algorithm
1:
procedure  Initialization
2:
    Normalize features: I g 0 I g 0 / I g 0 2 2 + ϵ                              ▹ ϵ = 10 8
3:
    Compute cost matrix: C 1 I g 0 , I e 1 R B × L × L
4:
    Initialize kernel: K exp ( C / τ )                                        ▹ τ = 0.03
5:
    Set ρ γ / ( γ + ϵ )                                                                           ▹ γ = 1
6:
    Initialize a 1 L / L , b 1 L / L
7:
end procedure
8:
procedure  SinkhornIteration
9:
    for  k = 1 to m a x _ i t e r  do
10:
         b 1 L / L K a + ϵ ρ                                                          ▹ Target marginal
11:
         a 1 L / L K b + ϵ ρ                                                            ▹ Source marginal
12:
    end for
13:
     T diag ( a ) K diag ( b ) R B × L × L
14:
     T * T / j = 1 L T : , : , j                                                   ▹ Column normalization
15:
end procedure
We solve the entropically regularized Kantorovich problem:
T * = arg min T U ( a , b ) C , T ε H ( T ) ,
= arg min T U ( a , b ) i , j C i j T i j + ε i , j T i j log T i j ,
where U ( a , b ) = { T R + n × m T 1 m = a , T 1 n = b } , H ( T ) is the entropy term, and ε > 0 is the entropic regularization coefficient. Problem (3) is efficiently solved using Sinkhorn–Knopp iterations [34,35], implemented in the log-domain for numerical stability. Taking the optimization between the generated feature I g 0 and the exemplar feature I e 1 as an example, the Sinkhorn–Knopp algorithm can be written as Algorithm 1. The resulting transport plan T * is differentiable, enabling end-to-end training.
  • Warping and fusion Using the transportation map T * and exemplar features I e 1 , we warp the exemplar features to obtain style-consistent generated features:
I w 1 = T * × I e 1 ,
where × denotes matrix multiplication. The warped feature I w 1 is roughly aligned with the generated feature I g 0 . These are fused by simple addition, followed by bilinear upsampling if higher resolution is required. Subsequently, a lightweight multi-scale fusion (MSF) block refines the fused feature:
I g = MSF ( Up ( I w 1 + I g 0 ) ) .
To improve consistency with the content image, each OTFormer block concatenates content features along the channel dimension. This concatenation encourages the generated features to remain aligned with the content structure. Then, two MSF blocks are used to fuse the concatenated features. These MSF blocks produce the final generated feature I g 1 .
  • Multi-scale Fusion Block Exemplar-based translation needs effective feature reuse and strong multi-scale context aggregation. Many existing methods use large networks to boost capacity. However, these networks can produce unstable feature statistics during training, and can also miss important multi-scale context or lose information during processing. These issues limit the modeling of complex structural dependencies.
Therefore, we introduce a multi-scale fusion (MSF) block to aggregate multi-scale spatial context efficiently. Figure 3 presents the overall architecture of the MSF block. The MSF block extracts spatial context at multiple scales and fuses it efficiently. Specifically, the block takes an input feature tensor F R L × C . Here, L = H × W denotes the flattened spatial dimension, and the symbol C denotes the number of channels. We first adopt Layer Normalization [36] to stabilize feature statistics. Then, we adopt a pointwise layer to mix channel information, which is implemented using a linear projection and a GELU activation [37]:
F pw = GELU ( W 1 F ^ + b 1 ) , W 1 R C × C ,
where W 1 and b 1 are learnable parameters.
After that, we reshape the feature to F sp R H × W × C for spatial processing. Three depthwise convolutions are used in parallel to extract multi-scale spatial context. These convolutions use kernel sizes 3 × 3 , 5 × 5 , and 7 × 7 :
F 3 = DWConv 3 × 3 ( F sp ) ,
F 5 = DWConv 5 × 5 ( F sp ) ,
F 7 = DWConv 7 × 7 ( F sp ) ,
where DWConv k × k ( · ) denotes depthwise convolution [38].
Then, the multi-scale features are concatenated with the channel dimension. We also fuse F pw in the concatenation to preserve low-level information:
F cat = Concat ( F 3 , F 5 , F 7 , F pw ) R H × W × 4 C .
Finally, another pointwise projection is used to compress the channel dimension:
F out = W 2 F cat + b 2 , W 2 R 4 C × C ,
where W 2 and b 2 are learnable parameters. A standard convolution would require O ( k 2 C 2 ) parameters per scale, while the depthwise branch uses O ( k 2 C ) parameters per scale, keeping the computation lightweight.
The proposed MSF block improves local feature extraction and preserves efficiency. This block also strengthens multi-scale context aggregation, which benefits translation quality. In Figure 1, MSF blocks fuse the transformed features I w 1 and the generated features I g 0 . A convolution layer further transforms the content feature I c 2 before concatenation. Two additional MSF blocks then process the fused representation, and output the final generated feature I g 1 .
A stack of OTFormer blocks produces the final output image I g by a convolution layer, which maps the final features to the RGB space. The model also predicts RGB outputs at multiple resolutions following the multi-scale synthesis from StyleGAN2 [39]. The model aggregates these outputs upsampled to the output resolutions through pixel-wise summation to obtain the final image I g .
Discussion: For attention-based correspondence, the computational cost is O ( n 2 ) for computing similarity and weighting. In our optimal transport (OT) based approach, each Sinkhorn iteration also costs O ( n 2 ) ; with a fixed number of iterations L (e.g., 20), the total cost becomes O ( L n 2 ) . However, due to our multi-scale design, the OT problems are solved at low resolutions—for instance, n = 256 for a 16 × 16 feature map. This makes O ( L n 2 ) comparable to or even smaller than the cost of attention at high resolutions (e.g., n = 4096 for a 64 × 64 map). Furthermore, the lightweight MSF block further reduces overall FLOPs. These factors explain why our method achieves faster inference despite using iterative optimization.

3.2. Loss Function

To train the OTFormer end to end, we adopt a combination of loss functions to encourage photorealistic and faithful synthesis.
Correspondence Constraint Loss: To enforce the structural alignment between the warped exemplar image and the content image, we minimize the 1 distance between them. Given the warped image
I w = T × I e ,
where ↓ denotes downsampling (via bilinear interpolation) to match the spatial resolution of the transport map T. The correspondence loss is
L w = I w I c 1 .
This loss is applied to all OTFormer blocks.
Perceptual Loss: We measure content consistency between generated and content images using a perceptual loss based on pre-trained VGG [40] features:
L p = ϕ l ( I g ) ϕ l ( I c ) 2 ,
where ϕ l denotes activations from the l-th layer.
Style loss. To encourage stylistic similarity, we minimize the difference between the Gram matrices of generated and exemplar features:
L s = Gram ( ϕ l ( I g ) ) Gram ( ϕ l ( I e ) ) 2 ,
where Gram computes the Gram matrix [41].
Contextual Loss: We further enhance style fidelity using the contextual loss [42]:
L c o n = log l W l CX ϕ l ( I e ) , ϕ l ( I ^ t ) ,
where CX measures contextual similarity and W l represents weighting coefficients for VGG-19 layers [40].
Adversarial Loss: A generative adversarial network (GAN) framework encourages photorealism by pitting the generator G, i.e., OTFormer, against a discriminator D:
L adv D = E [ h ( D ( I t ) ) ] E [ h ( D ( I g ) ) ] ,
L adv G = E [ D ( I g ) ] ,
where h ( t ) = min ( 0 , 1 + t ) is the hinge loss.

3.3. Implementation Details

We implemented OTFormer using PyTorch 1.8.0 and trained on a NVIDIA GeForce RTX 4090 workstation. We use learning rates of 2 × 10 4 and 4 × 10 4 for the OTFormer and discriminator, respectively. During training, we set the batch size to 7. The batch size of 7 was chosen as the largest value that fits into the memory of a single NVIDIA GeForce RTX 4090 GPU (24 GB) given the model size and input resolution. This setting allows for stable training while maximizing GPU utilization. Following [9], on the CelebA-HQ dataset, we train the whole model for 60 epochs, and for the DeepFashion dataset, we train for 100 epochs.

4. Experimental Results

In this section, we empirically evaluate OTFormer on two public datasets. We first describe the datasets and evaluation metrics, then compare our method with state-of-the-art approaches both quantitatively and qualitatively. Finally, we present ablation studies to analyze the contribution of each component.

4.1. Datasets and Metrics

We conduct comparison experiments on two public datasets, i.e., CelebA-HQ [43] and DeepFashion [44]. CelebA-HQ dataset has 30,000 face images, while DeepFashion dataset has 52,712 person images. Following [6,9], we perform the edge-to-image task for CelebA-HQ dataset, where the edge image is obtained by connecting the facial landmarks for facial regions and detecting the Canny edges for the background regions. The DeepFashion dataset paired images for the same identity with different poses; therefore, we perform the pose-to-image task on DeepFashion.
Following previous works [6,7,9], we adopt several metrics to evaluate the quality of generated results. We evaluate the distribution-level quality of the generated results using Fréchet inception score (FID) and sliced Wasserstein distance (SWD). With reference to the CelebA-HQ dataset, we evaluate the style relevance and the semantic consistency following [6,9]. The style relevance for color and texture is computed based on the features of relu 1 _ 2 and relu 2 _ 2 , which came from the pre-trained VGG-19 network. The semantic consistency is evaluated by making use of the average cosine similarity among high-level features from the VGG-19 network. The data pairs in DeepFashion dataset have ground-truth images; therefore, we adopt structural similarity (SSIM), peak signal to noise ratio (PSNR) to measure the quality of the generated results at the pixel level. To better align human perception, we also adopt learned perceptual image patch similarity (LPIPS) to evaluate the perceptual similarity. In addition, to assess the efficiency of our method, we also report the number of parameters (Params), floating-point operations (FLOPs) and the inference time (time) of different methods. The inference time is computed as the average runtime over 1000 images on the same machine.

4.2. Comparison Results

We compare OTFormer with several state-of-the-art methods, including CoCosNet [6], CoCosNet-v2 [7], UNITE [15], DynaST [8], MIDMs [10], MAT [9], and EBDM [11]. For fair comparison, we report numbers from the original papers or re-evaluated using their official code when available. The codes of MIDMs and EBDM are not publicly available, so we cite their reported numbers.
Quantitative results. The quantitative results on CelebA-HQ and DeepFashion highlight the clear advantages of our method. Our method improves generation quality across multiple metrics, and maintains strong efficiency in parameters and computation.
The quantitative results on DeepFashion dataset are shown in Table 1. Our method outperforms all other methods across all reported metrics. For distribution-based metrics, our OTFormer achieves the best FID and SWD scores, which indicates superior ability to generate realistic images. For pixel-level metrics, our method achieves the highest SSIM and PSNR, while also obtaining the lowest LPIPS score, which demonstrates better structural similarity and perceptual quality compared to other methods. Furthermore, our method is more efficient than all other methods in terms of parameters, FLOPs and inference time. Specifically, our method achieves 17.4 million parameters, reducing by 62% compared to the previous method with the fewest parameters (CoCosNet-v2, 45.6 million), which suggests that our method is more practical for real-world applications. Table 2 presents the quantitative results on the CelebA-HQ dataset. Our method achieves competitive performance on FID, texture and color metrics, while the semantic consistency is comparable to baselines.
Figure 4 visualizes the FID and texture scores of different methods on the CelebA-HQ dataset. Our method achieves the best FID and texture scores, which indicates superior generation quality and style consistency. At the same time, our method has significantly fewer parameters than all other methods, which highlights the efficiency of our method.
Qualitative results. Figure 5 shows qualitative comparisons of CelebA-HQ. OTFormer produces more photorealistic results while better preserving exemplar style cues, such as hair color and eye details. Competing methods often introduce artifacts or lose fine-grained textures.
On DeepFashion (as shown in Figure 6), OTFormer not only preserves pattern continuity even under occlusion but also infers reasonable textures for invisible parts. For instance, it generates plausible sleeve patterns in occluded areas and maintains consistent shirt patterns near the bottom. More qualitative results can be found in the supplementary document.

4.3. Ablation Study

To verify the effectiveness of our contributions, we carry out ablation studies on the architecture design and loss functions.

4.3.1. Architecture Design

We studied several variants of the model architecture. All models were trained under the same loss functions and training strategy for fair comparison.
The Model with Attention Mechanism (w Att): We adopt a traditional cross-attention mechanism to replace the optimal transport solver to evaluate the effectiveness of the optimal transport formulation in modeling correspondences.
The Model without Multi-Scale Fusion Block (w/o MSF): In this variant, the multi-scale fusion block is replaced by a standard single-scale convolutional layer, which lacks the ability to explicitly capture multi-scale spatial context. This tests the necessity of multi-scale fusion in effectively integrating features.
The Model without Multi-Scale OTFormer Block (w/o MSOTF): This model omits the multi-scale design in the OTFormer blocks, i.e., it only models correspondences at a single low-resolution scale, removing the coarse-to-fine refinement process.
The Model without Multi-Scale Depth Convolution (w/o MSD): Within the multi-scale fusion block, parallel depthwise convolution layers with varying kernel sizes are replaced by a single depthwise convolution layer. This variant assesses the impact of capturing multi-scale receptive fields in feature fusion.
Table 3 and Figure 7 show the quantitative and qualitative results on the CelebA-HQ dataset. We can see that our full model consistently outperforms all ablated variants across most metrics. Notably, replacing the optimal transport solver with a standard attention mechanism (w Att) leads to decreased performance with worse FID and semantic consistency, which suggests the effectiveness of the optimal transport formulation for exemplar-based image translation. This indicates that the optimal transport solver can better model the complex correspondences between the content and exemplar features, leading to improved generation quality. Furthermore, the model without multi-scale OTFormer block (w/o MSOTF) also performs worse than the full model, which indicates that the multi-scale design is important for modeling correspondences at different scales and improving generation quality. Moreover, by comparing the model without multi-scale fusion block (w/o MSF) and the model without multi-scale depth convolution (w/o MSD) with the full model, we can see the effectiveness of the proposed blocks and designs. From Figure 7, we can obtain similar conclusions, which further validate the effectiveness of the proposed architecture design.

4.3.2. Loss Functions

To validate the effectiveness of the loss design, we conduct experiments on the CelebA-HQ dataset with different loss combinations. Table 4 and Figure 8 show the quantitative and qualitative results. From Table 4, we can see that each loss function is important for improving the generation quality. The model without correspondence constraint loss (w/o Cor) achieves the best SWD, but the other metrics all degrade, which indicates that it is useful to improve the spatial alignment between the final output and the content image. Compared with the full model, the model without contextual loss (w/o CX) fails to contain accurate style in the exemplar, leading to a significant decrease in color and texture metrics. The model without perceptual loss (w/o Per) obtains the worst SWD score, which indicates that the perceptual loss is important for preserving the semantic consistency between the generated image and the content image. For the style loss, it can be seen that it is important for style consistency. Overall, all the loss terms are effective for improving the final results. Figure 8 shows qualitative results, which suggests similar conclusions. Without the perceptual loss, the model fails to contain the structures in the content image. Without the style loss, it is difficult to obtain the accurate style in the exemplar. In contrast, the full model can preserve the structure in the content image and obtain a more faithful style regarding the exemplar.

4.4. Limitations and Future Work

Despite its strong performance, OTFormer has limitations. As shown in Figure 9, it can struggle under extreme occlusion or when the exemplar and content have large pose differences, leading to structural distortions. Ambiguous style cues, such as textureless regions, may result in blur. Future work will explore incorporating semantic priors and more robust cost functions to handle such cases. Additionally, extending OTFormer to video translation is a promising direction.

5. Conclusions

In this paper, we propose OTFormer, a new framework for exemplar-based image translation. OTFormer integrates optimal transport into a Transformer architecture by introducing a lightweight multi-scale fusion block, which reduces correspondence misalignment between content and exemplar features. Extensive experiments on two public benchmark datasets validate the effectiveness of the proposed method. OTFormer can not only preserve content structure more consistently, but can also transfer exemplar style attributes more faithfully. Furthermore, our model uses 62% fewer parameters than existing methods and runs faster at inference time, which improves its practical applicability.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bdcc10040107/s1, Figures S1–S8: More qualitative results.

Author Contributions

Conceptualization, J.Z., X.L. and Y.L.; methodology, Y.L. and J.Z.; software, J.Z.; validation, Y.L.; formal analysis, J.Z. and Y.L.; investigation, J.Z.; data curation, Y.L.; writing—original draft preparation, J.Z.; writing—review and editing, Y.L.; visualization, J.Z. and X.L.; supervision, Y.L.; project administration, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data analyzed in this study were obtained from publicly available datasets. The specific datasets and access information are as follows: DeepFashion (available at https://liuziwei7.github.io/projects/DeepFashion.html, accessed on 26 June 2016) and CelebA-HQ (available at https://mmlab.ie.cuhk.edu.hk/projects/CelebA/CelebAMask_HQ.html, accessed on 14 June 2020). No new raw data were created in this study.

Conflicts of Interest

Author Dr. Xiongzheng Li was employed by Alibaba Group (China). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Bsoul, A.A.R.; Alshboul, Y. Integrating Convolutional Neural Networks with a Firefly Algorithm for Enhanced Digital Image Forensics. AI 2025, 6, 321. [Google Scholar] [CrossRef]
  2. Zhang, J.; Li, K.; Lai, Y.K.; Yang, J. Pise: Person image synthesis and editing with decoupled gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2021; pp. 7982–7990. [Google Scholar]
  3. Martini, L.; Iacono, S.; Zolezzi, D.; Vercelli, G.V. Advancing Persistent Character Generation: Comparative Analysis of Fine-Tuning Techniques for Diffusion Models. AI 2024, 5, 1779–1792. [Google Scholar] [CrossRef]
  4. Zhang, L.; Lu, W.; Huang, Y.; Sun, X.; Zhang, H. Unpaired Remote Sensing Image Super-Resolution with Multi-Stage Aggregation Networks. Remote Sens. 2021, 13, 3167. [Google Scholar] [CrossRef]
  5. Zhang, J.; Li, X.; Jia, H.; Li, J.; Su, Z.; Wang, G.; Li, K. LoGAvatar: Local Gaussian Splatting for human avatar modeling from monocular video. Comput.-Aided Des. 2025, 190, 103973. [Google Scholar] [CrossRef]
  6. Zhang, P.; Zhang, B.; Chen, D.; Yuan, L.; Wen, F. Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2020; pp. 5143–5153. [Google Scholar]
  7. Zhou, X.; Zhang, B.; Zhang, T.; Zhang, P.; Bao, J.; Chen, D.; Zhang, Z.; Wen, F. Cocosnet v2: Full-resolution correspondence learning for image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2021; pp. 11465–11475. [Google Scholar]
  8. Liu, S.; Ye, J.; Ren, S.; Wang, X. Dynast: Dynamic sparse transformer for exemplar-guided image generation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 72–90. [Google Scholar]
  9. Jiang, C.; Gao, F.; Ma, B.; Lin, Y.; Wang, N.; Xu, G. Masked and Adaptive Transformer for Exemplar Based Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2023; pp. 22418–22427. [Google Scholar]
  10. Seo, J.; Lee, G.; Cho, S.; Lee, J.; Kim, S. Midms: Matching interleaved diffusion models for exemplar-based image translation. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2023; Volume 37, pp. 2191–2199. [Google Scholar]
  11. Lee, E.; Jeong, S.; Sohn, K. EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024. [Google Scholar]
  12. Bhunia, A.K.; Khan, S.; Cholakkal, H.; Anwer, R.M.; Laaksonen, J.; Shah, M.; Khan, F.S. Person image synthesis via denoising diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2023; pp. 5968–5976. [Google Scholar]
  13. Courty, N.; Flamary, R.; Tuia, D.; Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1853–1865. [Google Scholar] [CrossRef] [PubMed]
  14. Villani, C. Optimal Transport: Old and New; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338. [Google Scholar]
  15. Zhan, F.; Yu, Y.; Cui, K.; Zhang, G.; Lu, S.; Pan, J.; Zhang, C.; Ma, F.; Xie, X.; Miao, C. Unbalanced feature transport for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2021; pp. 15028–15038. [Google Scholar]
  16. Zhang, J.; Lai, Y.K.; Ma, J.; Li, K. Multi-scale information transport generative adversarial network for human pose transfer. Displays 2024, 84, 102786. [Google Scholar] [CrossRef]
  17. Li, K.; Zhang, J.; Liu, Y.; Lai, Y.K.; Dai, Q. PoNA: Pose-guided non-local attention for human pose transfer. IEEE Trans. Image Process. 2020, 29, 9584–9599. [Google Scholar] [CrossRef]
  18. Zhang, J.; Liu, X.; Li, K. Human pose transfer by adaptive hierarchical deformation. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2020; Volume 39, pp. 325–337. [Google Scholar]
  19. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  20. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
  21. Zhang, J.; Lai, Y.K.; Yang, J.; Li, K. PISE-V: Person image and video synthesis with decoupled GAN. Vis. Comput. 2024, 41, 5781–5798. [Google Scholar] [CrossRef]
  22. Jing, Y.; Yang, Y.; Feng, Z.; Ye, J.; Yu, Y.; Song, M. Neural style transfer: A review. IEEE Trans. Vis. Comput. Graph. 2019, 26, 3365–3385. [Google Scholar] [CrossRef] [PubMed]
  23. Chiu, Y.H.; Chang, K.H.; Lin, I.C. Exemplar-based image colorization with awareness of object co-saliency. Multimed. Tools Appl. 2026, 85, 57. [Google Scholar] [CrossRef]
  24. Li, D.; Deng, H.; Qin, P.; Chen, W.; Feng, G. HyperplaneGAN: A unified consistent translation framework for facial attribute editing. Multimed. Tools Appl. 2025, 84, 24229–24253. [Google Scholar] [CrossRef]
  25. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  26. Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
  27. Kosugi, S. Leveraging the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10059–10069. [Google Scholar] [CrossRef]
  28. Jin, S.; Nam, J.; Kim, J.; Chung, D.; Kim, Y.S.; Park, J.; Chu, H.; Kim, S. AM-Adapter: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision; Computer Vision Foundation: New York, NY, USA, 2025; pp. 17077–17086. [Google Scholar]
  29. Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
  30. Zhang, J.; Zhu, M.; Zhang, Y.; Zheng, Z.; Liu, Y.; Li, K. SpeechAct: Towards generating whole-body motion from speech. IEEE Trans. Vis. Comput. Graph. 2025, 31, 6737–6750. [Google Scholar] [CrossRef]
  31. Singh, S.P.; Jaggi, M. Model fusion via optimal transport. Adv. Neural Inf. Process. Syst. 2020, 33, 22045–22055. [Google Scholar]
  32. Séjourné, T.; Peyré, G.; Vialard, F.X. Unbalanced optimal transport, from theory to numerics. Handb. Numer. Anal. 2023, 24, 407–471. [Google Scholar]
  33. Pham, K.; Le, K.; Ho, N.; Pham, T.; Bui, H. On unbalanced optimal transport: An analysis of sinkhorn algorithm. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 7673–7682. [Google Scholar]
  34. Sinkhorn, R. Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 1967, 74, 402–405. [Google Scholar] [CrossRef]
  35. Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends® Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
  36. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
  37. Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  38. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2017; pp. 1251–1258. [Google Scholar]
  39. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2020; pp. 8110–8119. [Google Scholar]
  40. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  41. Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
  42. Mechrez, R.; Talmi, I.; Zelnik-Manor, L. The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV); Computer Vision Foundation: New York, NY, USA, 2018; pp. 768–783. [Google Scholar]
  43. Lee, C.H.; Liu, Z.; Wu, L.; Luo, P. MaskGAN: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2020; pp. 5549–5558. [Google Scholar]
  44. Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; Tang, X. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2016; pp. 1096–1104. [Google Scholar]
Figure 1. Overview of OTFormer. The inputs are a content image and an exemplar image. The model extracts three-level hierarchical features from both inputs. The model then establishes correspondences progressively with optimal transport. The model finally synthesizes the output image with a stack of OTFormer blocks.
Figure 1. Overview of OTFormer. The inputs are a content image and an exemplar image. The model extracts three-level hierarchical features from both inputs. The model then establishes correspondences progressively with optimal transport. The model finally synthesizes the output image with a stack of OTFormer blocks.
Bdcc 10 00107 g001
Figure 2. Architecture of an OTFormer block. The block takes three inputs: the current generated feature map I g 0 , the content feature I c 1 , and the exemplar feature I e 1 . The core component is a differentiable optimal transport solver, which computes a transport plan T between the generated and exemplar features. The transport plan is used to warp exemplar features into the spatial domain of the generated features, followed by efficient multi-scale fusion with the content features to produce the refined output I g 1 .
Figure 2. Architecture of an OTFormer block. The block takes three inputs: the current generated feature map I g 0 , the content feature I c 1 , and the exemplar feature I e 1 . The core component is a differentiable optimal transport solver, which computes a transport plan T between the generated and exemplar features. The transport plan is used to warp exemplar features into the spatial domain of the generated features, followed by efficient multi-scale fusion with the content features to produce the refined output I g 1 .
Bdcc 10 00107 g002
Figure 3. Overview of the multi-scale fusion (MSF) block. The block first applies a pointwise layer to mix channel information, then uses three parallel depthwise convolutions with different kernel sizes to extract multi-scale spatial context. Finally, the multi-scale features are concatenated and fused by another pointwise layer.
Figure 3. Overview of the multi-scale fusion (MSF) block. The block first applies a pointwise layer to mix channel information, then uses three parallel depthwise convolutions with different kernel sizes to extract multi-scale spatial context. Finally, the multi-scale features are concatenated and fused by another pointwise layer.
Bdcc 10 00107 g003
Figure 4. Quantitative comparison of OTFormer with other methods on CelebA-HQ. Lower FID means higher image quality, and higher texture score indicates better style similarity. Circle size indicates the number of model parameters.
Figure 4. Quantitative comparison of OTFormer with other methods on CelebA-HQ. Lower FID means higher image quality, and higher texture score indicates better style similarity. Circle size indicates the number of model parameters.
Bdcc 10 00107 g004
Figure 5. Qualitative results compared with four state-of-the-art methods on the CelebA-HQ dataset.
Figure 5. Qualitative results compared with four state-of-the-art methods on the CelebA-HQ dataset.
Bdcc 10 00107 g005
Figure 6. Qualitative results compared with four state-of-the-art methods on the DeepFashion dataset.
Figure 6. Qualitative results compared with four state-of-the-art methods on the DeepFashion dataset.
Bdcc 10 00107 g006
Figure 7. Qualitative results of ablation studies on architecture design.
Figure 7. Qualitative results of ablation studies on architecture design.
Bdcc 10 00107 g007
Figure 8. Qualitative results of ablation studies on loss functions.
Figure 8. Qualitative results of ablation studies on loss functions.
Bdcc 10 00107 g008
Figure 9. Failure cases of OTFormer.
Figure 9. Failure cases of OTFormer.
Bdcc 10 00107 g009
Table 1. Quantitative comparison on DeepFashion [44] datasets.
Table 1. Quantitative comparison on DeepFashion [44] datasets.
ModelFID ↓SWD ↓SSIM ↑PSNR ↑LPIPS ↓Params ↓FLOPs ↓Time ↓
CoCosNet [6]14.417.20.50115.00.286146.3396.50.088
CoCosNet-v2 [7]13.016.70.62817.30.19545.6394.40.176
UNITE [15]13.116.7---186.7474.30.105
DynaST [8]8.411.80.70318.50.16091.1340.70.061
MIDMs [10]10.910.1------
MAT [9]8.211.00.66117.60.181103.6128.00.043
EBDM [11]10.612.4---765.0--
Ours6.99.20.70418.60.15717.450.00.037
Table 2. Quantitative comparison on CelebA-HQ [43] datasets.
Table 2. Quantitative comparison on CelebA-HQ [43] datasets.
ModelFID ↓SWD ↓Texture ↑Color ↑Semantic ↑
CoCosNet [6]14.315.20.9580.9770.949
CoCosNet-v2 [7]13.214.00.9540.9750.948
UNITE [15]13.214.90.9520.9660.950
DynaST [8]12.012.40.9590.9780.952
MIDMs [10]15.712.30.9620.9820.915
MAT [9]11.513.20.9650.9860.949
EBDM [11]11.812.10.9680.9840.920
Ours11.413.10.9700.9880.948
Table 3. Ablation study on architecture design on CelebA-HQ.
Table 3. Ablation study on architecture design on CelebA-HQ.
ModelFID ↓SWD ↓Texture ↑Color ↑Semantic ↑
w Att12.414.30.9650.9840.944
w/o MSF11.813.40.9620.9820.948
w/o MSOTF12.112.90.9680.9870.946
w/o MSD11.613.70.9870.9680.947
Ours11.413.10.9700.9880.948
Table 4. Ablation study on loss functions on CelebA-HQ.
Table 4. Ablation study on loss functions on CelebA-HQ.
ModelFID ↓SWD ↓Texture ↑Color ↑Semantic ↑
w/o Cor11.912.10.9870.9690.947
w/o CX11.512.60.9830.9600.950
w/o Adv16.116.90.9880.9670.946
w/o Per12.6155.70.9870.9740.915
w/o Style11.812.20.9790.9600.950
Ours11.413.10.9700.9880.948
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Li, X.; Lin, Y. Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation. Big Data Cogn. Comput. 2026, 10, 107. https://doi.org/10.3390/bdcc10040107

AMA Style

Zhang J, Li X, Lin Y. Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation. Big Data and Cognitive Computing. 2026; 10(4):107. https://doi.org/10.3390/bdcc10040107

Chicago/Turabian Style

Zhang, Jinsong, Xiongzheng Li, and Yuqin Lin. 2026. "Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation" Big Data and Cognitive Computing 10, no. 4: 107. https://doi.org/10.3390/bdcc10040107

APA Style

Zhang, J., Li, X., & Lin, Y. (2026). Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation. Big Data and Cognitive Computing, 10(4), 107. https://doi.org/10.3390/bdcc10040107

Article Metrics

Back to TopTop