Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation
Abstract
1. Introduction
- Attention-based methods [6,7,8,9] use dense pairwise attention to compute correspondences. They rely on attention mechanisms with local similarity matching. However, the local matching only considers the correlation between cross-modal local features, ignoring global structure in each domain. This weakness can produce misaligned features and artifacts, such as distorted textures or abnormal style placement. Besides, the quadratic complexity in attention mechanisms and large parameter footprint increase computation and memory demands, which makes real-world deployment harder.
- Diffusion-based models [10,11] can achieve strong generation fidelity. However, exemplar-based translation rarely offers precisely aligned exemplar–content pairs. This data scarcity limits direct supervision for the denoising process [10,11]. Furthermore, diffusion models also require iterative sampling at inference time, which slows inference and increases computational cost [12]. These properties reduce their appeal in efficiency-critical scenarios.
- We propose OTFormer for exemplar-based translation, which provides a globally coherent and theoretically grounded alternative to local attention matching.
- We design a progressive alignment scheme and a lightweight MSF block, which supports coarse-to-fine style transfer while maintaining strong parameter efficiency.
- We show that OTFormer outperforms GAN-based and diffusion-based methods in visual quality and semantic consistency, offering a better efficiency profile in parameters and inference.
2. Literature Review
2.1. GAN-Based Methods
2.2. Diffusion-Based Methods
2.3. Optimal Transport-Based Methods
3. Methodology
3.1. Optimal Transport Transformer
3.1.1. Encoders
3.1.2. OTFormer Block
- Optimal transport We summarize the discrete optimal transport formulation used in the OTFormer block. Let
| Algorithm 1 Sinkhorn-Knopp Algorithm |
|
- Warping and fusion Using the transportation map and exemplar features , we warp the exemplar features to obtain style-consistent generated features:
- Multi-scale Fusion Block Exemplar-based translation needs effective feature reuse and strong multi-scale context aggregation. Many existing methods use large networks to boost capacity. However, these networks can produce unstable feature statistics during training, and can also miss important multi-scale context or lose information during processing. These issues limit the modeling of complex structural dependencies.
3.2. Loss Function
3.3. Implementation Details
4. Experimental Results
4.1. Datasets and Metrics
4.2. Comparison Results
4.3. Ablation Study
4.3.1. Architecture Design
4.3.2. Loss Functions
4.4. Limitations and Future Work
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bsoul, A.A.R.; Alshboul, Y. Integrating Convolutional Neural Networks with a Firefly Algorithm for Enhanced Digital Image Forensics. AI 2025, 6, 321. [Google Scholar] [CrossRef]
- Zhang, J.; Li, K.; Lai, Y.K.; Yang, J. Pise: Person image synthesis and editing with decoupled gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2021; pp. 7982–7990. [Google Scholar]
- Martini, L.; Iacono, S.; Zolezzi, D.; Vercelli, G.V. Advancing Persistent Character Generation: Comparative Analysis of Fine-Tuning Techniques for Diffusion Models. AI 2024, 5, 1779–1792. [Google Scholar] [CrossRef]
- Zhang, L.; Lu, W.; Huang, Y.; Sun, X.; Zhang, H. Unpaired Remote Sensing Image Super-Resolution with Multi-Stage Aggregation Networks. Remote Sens. 2021, 13, 3167. [Google Scholar] [CrossRef]
- Zhang, J.; Li, X.; Jia, H.; Li, J.; Su, Z.; Wang, G.; Li, K. LoGAvatar: Local Gaussian Splatting for human avatar modeling from monocular video. Comput.-Aided Des. 2025, 190, 103973. [Google Scholar] [CrossRef]
- Zhang, P.; Zhang, B.; Chen, D.; Yuan, L.; Wen, F. Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2020; pp. 5143–5153. [Google Scholar]
- Zhou, X.; Zhang, B.; Zhang, T.; Zhang, P.; Bao, J.; Chen, D.; Zhang, Z.; Wen, F. Cocosnet v2: Full-resolution correspondence learning for image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2021; pp. 11465–11475. [Google Scholar]
- Liu, S.; Ye, J.; Ren, S.; Wang, X. Dynast: Dynamic sparse transformer for exemplar-guided image generation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 72–90. [Google Scholar]
- Jiang, C.; Gao, F.; Ma, B.; Lin, Y.; Wang, N.; Xu, G. Masked and Adaptive Transformer for Exemplar Based Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2023; pp. 22418–22427. [Google Scholar]
- Seo, J.; Lee, G.; Cho, S.; Lee, J.; Kim, S. Midms: Matching interleaved diffusion models for exemplar-based image translation. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2023; Volume 37, pp. 2191–2199. [Google Scholar]
- Lee, E.; Jeong, S.; Sohn, K. EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024. [Google Scholar]
- Bhunia, A.K.; Khan, S.; Cholakkal, H.; Anwer, R.M.; Laaksonen, J.; Shah, M.; Khan, F.S. Person image synthesis via denoising diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2023; pp. 5968–5976. [Google Scholar]
- Courty, N.; Flamary, R.; Tuia, D.; Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1853–1865. [Google Scholar] [CrossRef] [PubMed]
- Villani, C. Optimal Transport: Old and New; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338. [Google Scholar]
- Zhan, F.; Yu, Y.; Cui, K.; Zhang, G.; Lu, S.; Pan, J.; Zhang, C.; Ma, F.; Xie, X.; Miao, C. Unbalanced feature transport for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2021; pp. 15028–15038. [Google Scholar]
- Zhang, J.; Lai, Y.K.; Ma, J.; Li, K. Multi-scale information transport generative adversarial network for human pose transfer. Displays 2024, 84, 102786. [Google Scholar] [CrossRef]
- Li, K.; Zhang, J.; Liu, Y.; Lai, Y.K.; Dai, Q. PoNA: Pose-guided non-local attention for human pose transfer. IEEE Trans. Image Process. 2020, 29, 9584–9599. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, X.; Li, K. Human pose transfer by adaptive hierarchical deformation. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2020; Volume 39, pp. 325–337. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
- Zhang, J.; Lai, Y.K.; Yang, J.; Li, K. PISE-V: Person image and video synthesis with decoupled GAN. Vis. Comput. 2024, 41, 5781–5798. [Google Scholar] [CrossRef]
- Jing, Y.; Yang, Y.; Feng, Z.; Ye, J.; Yu, Y.; Song, M. Neural style transfer: A review. IEEE Trans. Vis. Comput. Graph. 2019, 26, 3365–3385. [Google Scholar] [CrossRef] [PubMed]
- Chiu, Y.H.; Chang, K.H.; Lin, I.C. Exemplar-based image colorization with awareness of object co-saliency. Multimed. Tools Appl. 2026, 85, 57. [Google Scholar] [CrossRef]
- Li, D.; Deng, H.; Qin, P.; Chen, W.; Feng, G. HyperplaneGAN: A unified consistent translation framework for facial attribute editing. Multimed. Tools Appl. 2025, 84, 24229–24253. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
- Kosugi, S. Leveraging the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10059–10069. [Google Scholar] [CrossRef]
- Jin, S.; Nam, J.; Kim, J.; Chung, D.; Kim, Y.S.; Park, J.; Chu, H.; Kim, S. AM-Adapter: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision; Computer Vision Foundation: New York, NY, USA, 2025; pp. 17077–17086. [Google Scholar]
- Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
- Zhang, J.; Zhu, M.; Zhang, Y.; Zheng, Z.; Liu, Y.; Li, K. SpeechAct: Towards generating whole-body motion from speech. IEEE Trans. Vis. Comput. Graph. 2025, 31, 6737–6750. [Google Scholar] [CrossRef]
- Singh, S.P.; Jaggi, M. Model fusion via optimal transport. Adv. Neural Inf. Process. Syst. 2020, 33, 22045–22055. [Google Scholar]
- Séjourné, T.; Peyré, G.; Vialard, F.X. Unbalanced optimal transport, from theory to numerics. Handb. Numer. Anal. 2023, 24, 407–471. [Google Scholar]
- Pham, K.; Le, K.; Ho, N.; Pham, T.; Bui, H. On unbalanced optimal transport: An analysis of sinkhorn algorithm. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 7673–7682. [Google Scholar]
- Sinkhorn, R. Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 1967, 74, 402–405. [Google Scholar] [CrossRef]
- Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends® Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2017; pp. 1251–1258. [Google Scholar]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2020; pp. 8110–8119. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
- Mechrez, R.; Talmi, I.; Zelnik-Manor, L. The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV); Computer Vision Foundation: New York, NY, USA, 2018; pp. 768–783. [Google Scholar]
- Lee, C.H.; Liu, Z.; Wu, L.; Luo, P. MaskGAN: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2020; pp. 5549–5558. [Google Scholar]
- Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; Tang, X. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Computer Vision Foundation: New York, NY, USA, 2016; pp. 1096–1104. [Google Scholar]









| Model | FID ↓ | SWD ↓ | SSIM ↑ | PSNR ↑ | LPIPS ↓ | Params ↓ | FLOPs ↓ | Time ↓ |
|---|---|---|---|---|---|---|---|---|
| CoCosNet [6] | 14.4 | 17.2 | 0.501 | 15.0 | 0.286 | 146.3 | 396.5 | 0.088 |
| CoCosNet-v2 [7] | 13.0 | 16.7 | 0.628 | 17.3 | 0.195 | 45.6 | 394.4 | 0.176 |
| UNITE [15] | 13.1 | 16.7 | - | - | - | 186.7 | 474.3 | 0.105 |
| DynaST [8] | 8.4 | 11.8 | 0.703 | 18.5 | 0.160 | 91.1 | 340.7 | 0.061 |
| MIDMs [10] | 10.9 | 10.1 | - | - | - | - | - | - |
| MAT [9] | 8.2 | 11.0 | 0.661 | 17.6 | 0.181 | 103.6 | 128.0 | 0.043 |
| EBDM [11] | 10.6 | 12.4 | - | - | - | 765.0 | - | - |
| Ours | 6.9 | 9.2 | 0.704 | 18.6 | 0.157 | 17.4 | 50.0 | 0.037 |
| Model | FID ↓ | SWD ↓ | Texture ↑ | Color ↑ | Semantic ↑ |
|---|---|---|---|---|---|
| CoCosNet [6] | 14.3 | 15.2 | 0.958 | 0.977 | 0.949 |
| CoCosNet-v2 [7] | 13.2 | 14.0 | 0.954 | 0.975 | 0.948 |
| UNITE [15] | 13.2 | 14.9 | 0.952 | 0.966 | 0.950 |
| DynaST [8] | 12.0 | 12.4 | 0.959 | 0.978 | 0.952 |
| MIDMs [10] | 15.7 | 12.3 | 0.962 | 0.982 | 0.915 |
| MAT [9] | 11.5 | 13.2 | 0.965 | 0.986 | 0.949 |
| EBDM [11] | 11.8 | 12.1 | 0.968 | 0.984 | 0.920 |
| Ours | 11.4 | 13.1 | 0.970 | 0.988 | 0.948 |
| Model | FID ↓ | SWD ↓ | Texture ↑ | Color ↑ | Semantic ↑ |
|---|---|---|---|---|---|
| w Att | 12.4 | 14.3 | 0.965 | 0.984 | 0.944 |
| w/o MSF | 11.8 | 13.4 | 0.962 | 0.982 | 0.948 |
| w/o MSOTF | 12.1 | 12.9 | 0.968 | 0.987 | 0.946 |
| w/o MSD | 11.6 | 13.7 | 0.987 | 0.968 | 0.947 |
| Ours | 11.4 | 13.1 | 0.970 | 0.988 | 0.948 |
| Model | FID ↓ | SWD ↓ | Texture ↑ | Color ↑ | Semantic ↑ |
|---|---|---|---|---|---|
| w/o Cor | 11.9 | 12.1 | 0.987 | 0.969 | 0.947 |
| w/o CX | 11.5 | 12.6 | 0.983 | 0.960 | 0.950 |
| w/o Adv | 16.1 | 16.9 | 0.988 | 0.967 | 0.946 |
| w/o Per | 12.6 | 155.7 | 0.987 | 0.974 | 0.915 |
| w/o Style | 11.8 | 12.2 | 0.979 | 0.960 | 0.950 |
| Ours | 11.4 | 13.1 | 0.970 | 0.988 | 0.948 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, J.; Li, X.; Lin, Y. Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation. Big Data Cogn. Comput. 2026, 10, 107. https://doi.org/10.3390/bdcc10040107
Zhang J, Li X, Lin Y. Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation. Big Data and Cognitive Computing. 2026; 10(4):107. https://doi.org/10.3390/bdcc10040107
Chicago/Turabian StyleZhang, Jinsong, Xiongzheng Li, and Yuqin Lin. 2026. "Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation" Big Data and Cognitive Computing 10, no. 4: 107. https://doi.org/10.3390/bdcc10040107
APA StyleZhang, J., Li, X., & Lin, Y. (2026). Multi-Scale Optimal Transport Transformer for Efficient Exemplar-Based Image Translation. Big Data and Cognitive Computing, 10(4), 107. https://doi.org/10.3390/bdcc10040107

