Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context
Abstract
1. Introduction
2. Background
2.1. Machine Translation
2.2. Multimodal Machine Translation
3. Revisiting Existing Methods
3.1. Detailed Comparison
3.1.1. Setup
3.1.2. Comparison of Architecture
# | Method | EN→DE | EN→FR | ||||
---|---|---|---|---|---|---|---|
Test16 | Test17 | MSCOCO | Test16 | Test17 | MSCOCO | ||
Image-Must MMT Systems | |||||||
1 | Del+obj [28] | 38.00 | - | - | 59.80 | - | - |
2 | DCCN [29] | 39.70 | 31.00 | 26.70 | 61.20 | 54.30 | 45.40 |
3 | Gated Fusion [15] | 41.96 | 33.59 | 29.04 | 61.69 | 54.85 | 44.86 |
4 | Selective Attn [30] | 41.84 | 34.32 | 30.22 | 62.24 | 54.52 | 44.82 |
5 | PLUVR [12] | 40.30 | 33.45 | 30.28 | 61.31 | 53.15 | 43.65 |
6 | ITP [31] | 41.77 | 34.58 | 30.61 | - | - | - |
7 | PVP [11] | 42.30 | - | - | 65.50 | - | - |
8 | VTLM [32] | 43.30 | 37.60 | 35.10 | - | - | - |
9 | VL-T5 [18] | 45.40 | 41.99 | 36.96 | 65.42 | 59.92 | 51.98 |
Image-Free MMT Systems | |||||||
10 | Transformer-Base [19] | 38.33 | 31.36 | 27.54 | 60.60 | 53.16 | 42.83 |
11 | ImagiT [33] | 38.50 | 32.10 | 28.70 | 59.70 | 52.40 | 45.30 |
12 | Transformer-Small [19] | 39.68 | 32.99 | 28.50 | 61.31 | 53.85 | 44.03 |
13 | Transformer-Tiny [19] | 41.02 | 33.36 | 29.88 | 61.80 | 53.46 | 44.52 |
14 | UVR-NMT [10] | 40.79 | 32.16 | 29.02 | 61.00 | 53.20 | 43.71 |
15 | RMMT [15] | 41.45 | 32.94 | 30.01 | 62.12 | 54.39 | 44.52 |
16 | IKD-MMT [16] | 41.28 | 33.83 | 30.17 | 62.53 | 54.84 | - |
17 | VALHALLA [34] | 41.90 | 34.00 | 30.40 | 62.30 | 55.10 | 45.70 |
18 | T5 [18] | 44.81 | 42.25 | 37.37 | 65.63 | 59.82 | 52.06 |
3.1.3. Comparison of Training Data
3.1.4. Comparison of the Usage of Vision
3.2. Discussion
4. Visual Probing Method
4.1. Visual Type Regularization
- Noise Image (Noise): For simplicity of implementation, we randomly sample random noise from Gaussian noise with the same dimension (36 × 4 bounding box coordinates and 36 × 2048 RoI features) as the first two types of features as input.
4.2. Visual Contribution Probing
4.3. Modality Relationship Probing
5. Results
5.1. Setup
5.2. Main Results
- (1)
- When text features are strong, the test results are nearly identical, regardless of whether images are used, as discussed in Section 3.1. Applying the baseline consistency learning method yields results comparable to the initial VL-T5 (only a small gain of 0.03 BLEU). When consistency learning is applied to models tested with only text (7), the results are very similar to those from visual type regularization (4 and 5), with differences under 0.04. This suggests that consistency learning can achieve comparable results to visual type regularization when text features are dominant.
- (2)
- When real image features are available during testing, incorporating additional visual features offers limited benefits. Even when using non-original data (e.g., noise or generated images) during training (as seen in models 4 and 5), the results are slightly improved when the original image is used during testing, although the improvement is modest (around 0.2 BLEU). However, when using noise or generated images for consistency learning, performance slightly decreases (around 0.02 BLEU). This suggests that when real image features are available, additional visual features may not significantly improve performance, and consistency learning may not be as effective.
- (3)
- The choice of visual features can have a significant impact on model performance, and the distribution of visual features should be carefully considered during training. Combining different visual features (8, 9, 10, 11, and 12) leads to either a decrease in performance (up to −0.66 BLEU) or minimal improvements (+0.1 BLEU), primarily due to the excessive variations in feature distributions. In most cases, these combinations perform worse than using real images during testing.
- (4)
- Consistency learning is an effective method for improving the performance of image-free MMT systems, even without real images during training or testing. Specifically, the use of (Image, Generated)/Generated data pairs results in performance improvements, consistent with findings from prior image-free MMT systems [16,34]. Furthermore, our study shows that performance can be further enhanced even when real images are not used for training (e.g., using (Noise, Noise)/Noise for training).
- (5)
- Using visual context as a constraint in consistency learning can improve translation quality on in-domain data, but may hurt performance on out-of-domain datasets. As shown in Table 2, various combinations of visual features and consistency learning techniques improve in-domain performance, but result in decreased performance on out-of-domain data (e.g., MSCOCO). This suggests that consistency learning with visual regularization helps the model adapt to in-domain data, but struggles with more challenging out-of-domain instances (e.g., ambiguous verbs), as seen in MSCOCO, which is consistent with prior reports [34].
- (6)
- Although the differences in BLEU and BERTScore across various visual input types may appear numerically minor, their remarkable consistency across multiple test sets, evaluation metrics, and training–testing configurations constitutes a meaningful observation. This invariance indicates that the model’s behavior is largely insensitive to the semantic content of visual signals, thereby reinforcing our conclusion that visual inputs function primarily as regularization factors rather than as semantically informative modalities in strong MMT architectures.
6. Analysis
6.1. Visual Contribution Analysis
6.2. Modality Relationship Analysis
- V–T Embedding: This left part of the analysis shows that the model is unable to effectively use visual information to assist with text representation from a semantic perspective. The visual–text contrast loss across the four methods is similar (ranging from 103.2 to 103), indicating that the models fail to map the representations of different modalities into the same encoding space. This behavior is consistent with the results observed in many Vision-Language models, such as VilBERT [42] and LXMERT [43].
- Visual Embedding: This middle figure illustrates that noise and visual information serve two key roles: model regularization and visual representation optimization. The consistency constraints allow the model to reduce the differences between visual representations, even when the visual types differ (e.g., Image vs. Noise and Generated). However, the model fails to optimize the representational distance when purely noisy visual input is used (d), as the contrast loss increases with training and eventually plateaus.
- Text Embedding: The right part of the figure shows that visual inputs, including noise, act as regularizers during consistency training. When using a real image during consistency training, the model significantly improves its text representation. However, when other visual types, such as noise or generated images, are used for visual regularization, the trained model’s textual representation is nearly identical to that of the model trained using the noise consistency constraint.
6.3. Effect of Visual Inputs
6.4. Case Study
7. Discussion
- (1)
- Future research in the field of MMT should focus on testing in real-world settings. Current methodologies are based on a weak baseline (Transformer-Tiny) and under low-resource settings for text, which have shown the significant role of visual information in providing complementary semantic information not available in the text. However, it is important for future research to investigate whether the use of visual information is still necessary in real-world settings, where the text is sufficient and the baseline is more robust.
- (2)
- Visual information serves as a regularization method in MMT and can even be replaced by random noise when training on the Multi30K dataset. Real images and generated images have similar regularization effects (as discussed in Section 4). Even when the regularization is different, the model trained on (noise, noise) still produces translation results comparable to those of a model trained on regular visual data. This finding aligns with image-free MMT methods [16,44]. Exploring simpler constraints to exploit such features (visual or noise) may be a potential direction for future research.
- (3)
- Stronger translation metrics should be used. Many existing methodologies in MMT [15,30] do not comprehensively analyze the semantic similarity produced by different models using the most recent semantic similarity measures [41]. It has been observed that the text embeddings of models are nearly the same even though their BLEU scores varied by up to 0.4 points. This implies that the reported improvements may only result in models that are more aligned with the BLEU, rather than models that truly enhance translation quality.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; IEEE Computer Society: Washington, DC, USA, 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 8 April 2009. [Google Scholar]
- Callison-Burch, C.; Koehn, P.; Monz, C.; Schroeder, J. Findings of the 2009 Workshop on Statistical Machine Translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, 30–31 March 2009; pp. 1–28. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 6325–6334. [Google Scholar] [CrossRef]
- Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From Recognition to Cognition: Visual Commonsense Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Washington, DC, USA, 2019; pp. 6720–6731. [Google Scholar] [CrossRef]
- Futeral, M.; Schmid, C.; Sagot, B.; Bawden, R. Towards Zero-Shot Multimodal Machine Translation. arXiv 2024, arXiv:2407.13579. [Google Scholar]
- Elliott, D.; Frank, S.; Sima’an, K.; Specia, L. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany, 12 August 2016; pp. 70–74. [Google Scholar] [CrossRef]
- Barrault, L.; Bougares, F.; Specia, L.; Lala, C.; Elliott, D.; Frank, S. Findings of the Third Shared Task on Multimodal Machine Translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, 31 October–1 November 2018; pp. 304–323. [Google Scholar] [CrossRef]
- Zhang, Z.; Chen, K.; Wang, R.; Utiyama, M.; Sumita, E.; Li, Z.; Zhao, H. Neural Machine Translation with Universal Visual Representation. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Huang, P.Y.; Hu, J.; Chang, X.; Hauptmann, A. Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8226–8237. [Google Scholar] [CrossRef]
- Fang, Q.; Feng, Y. Neural Machine Translation with Phrase-Level Universal Visual Representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5687–5698. [Google Scholar] [CrossRef]
- Caglayan, O.; Madhyastha, P.; Specia, L.; Barrault, L. Probing the Need for Visual Context in Multimodal Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4159–4170. [Google Scholar] [CrossRef]
- Li, J.; Ataman, D.; Sennrich, R. Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online/Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8556–8562. [Google Scholar] [CrossRef]
- Wu, Z.; Kong, L.; Bi, W.; Li, X.; Kao, B. Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 6153–6166. [Google Scholar] [CrossRef]
- Peng, R.; Zeng, Y.; Zhao, J. Distill The Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2379–2390. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 140:1–140:67. [Google Scholar]
- Cho, J.; Lei, J.; Tan, H.; Bansal, M. Unifying Vision-and-Language Tasks via Text Generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; Proceedings of Machine Learning Research. Volume 139, pp. 1931–1942. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Yao, S.; Wan, X. Multimodal Transformer for Multimodal Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4346–4350. [Google Scholar] [CrossRef]
- Futeral, M.; Schmid, C.; Laptev, I.; Sagot, B.; Bawden, R. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 5394–5413. [Google Scholar] [CrossRef]
- Kukacka, J.; Golkov, V.; Cremers, D. Regularization for Deep Learning: A Taxonomy. arXiv 2017, arXiv:1710.10686. [Google Scholar]
- Bowen, B.; Vijayan, V.; Grigsby, S.; Anderson, T.; Gwinnup, J. Detecting concrete visual tokens for Multimodal Machine Translation. In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), Chicago, IL, USA, 30 September–2 October 2024; pp. 29–38. [Google Scholar]
- Jabri, A.; Joulin, A.; van der Maaten, L. Revisiting Visual Question Answering Baselines. In Lecture Notes in Computer Science, Proceedings of the ECCV (8), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9912, pp. 727–739. [Google Scholar]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Post, M. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, 31 October–1 November 2018; pp. 186–191. [Google Scholar] [CrossRef]
- Ive, J.; Madhyastha, P.; Specia, L. Distilling Translations with Visual Awareness. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6525–6538. [Google Scholar] [CrossRef]
- Lin, H.; Meng, F.; Su, J.; Yin, Y.; Yang, Z.; Ge, Y.; Zhou, J.; Luo, J. Dynamic Context-guided Capsule Network for Multimodal Machine Translation. In Proceedings of the MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, 12–16 October 2020; pp. 1320–1329. [Google Scholar] [CrossRef]
- Li, B.; Lv, C.; Zhou, Z.; Zhou, T.; Xiao, T.; Ma, A.; Zhu, J. On Vision Features in Multimodal Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 6327–6337. [Google Scholar] [CrossRef]
- Ji, B.; Zhang, T.; Zou, Y.; Hu, B.; Shen, S. Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6755–6764. [Google Scholar] [CrossRef]
- Caglayan, O.; Kuyu, M.; Amac, M.S.; Madhyastha, P.; Erdem, E.; Erdem, A.; Specia, L. Cross-lingual Visual Pre-training for Multimodal Machine Translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 16 April 2021; pp. 1317–1324. [Google Scholar] [CrossRef]
- Long, Q.; Wang, M.; Li, L. Generative Imagination Elevates Machine Translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5738–5748. [Google Scholar] [CrossRef]
- Li, Y.; Panda, R.; Kim, Y.; Chen, C.R.; Feris, R.; Cox, D.D.; Vasconcelos, N. VALHALLA: Visual Hallucination for Machine Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 5206–5216. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, QC Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar]
- Liang, X.; Wu, L.; Li, J.; Wang, Y.; Meng, Q.; Qin, T.; Chen, W.; Zhang, M.; Liu, T. R-Drop: Regularized Dropout for Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Virtual, 6–14 December 2021; pp. 10890–10905. [Google Scholar]
- Pan, X.; Wang, M.; Wu, L.; Li, L. Contrastive Learning for Many-to-many Multilingual Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 244–258. [Google Scholar] [CrossRef]
- Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the AISTATS, Sardinia, Italy, 13–15 May 2010; pp. 297–304. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 13–23. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [Google Scholar] [CrossRef]
- Elliott, D.; Kádár, Á. Imagination Improves Multimodal Translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, 1 December 2017; pp. 130–141. [Google Scholar]
# | Model | Test16 | Test17 | MSCOCO | Average | ||||
---|---|---|---|---|---|---|---|---|---|
BLEU | BERTScore | BLEU | BERTScore | BLEU | BERTScore | BLEU | BERTScore | ||
Testing with the original matching images | |||||||||
1 | VL-T5 [18] | 45.40 ± 0.04 | 0.8298 | 41.99 ± 0.29 | 0.8151 | 36.96 ± 0.07 | 0.8152 | 41.45 | 0.8200 |
2 | +consistency learning | 45.21±0.18 | 0.8294 | 42.08 ± 0.23 | 0.8149 | 37.15 ± 0.12 | 0.8156 | 41.48 (+0.03) | 0.8200 (+0.0 × 10−3) |
3 | +(Noise, Generated)/Image | 45.12 ± 0.19 | 0.8296 | 41.89 ± 0.12 | 0.8151 | 37.27 ± 0.34 | 0.8155 | 41.43 (−0.02) | 0.8201 (+0.1 × 10−3) |
4 | +(Image, Generated)/Image | 45.39 ± 0.22 | 0.8297 | 42.05 ± 0.28 | 0.8152 | 37.4 3 ± 0.28 | 0.8159 | 41.62 (+0.17) | 0.8203 (+0.3 × 10−3) |
5 | +(Image, Noise)/Image | 45.39 ± 0.21 | 0.8294 | 42.05 ± 0.28 | 0.8152 | 37.48 ± 0.43 | 0.8160 | 41.64 (+0.19) | 0.8202 (+0.2 × 10−3) |
Testing without the original matching images | |||||||||
6 | +w/o Visual | 44.81 ± 0.05 | 0.8294 | 42.25 ± 0.34 | 0.8157 | 37.37 ± 0.12 | 0.8158 | 41.48 (+0.03) | 0.8203 (+0.3 × 10−3) |
7 | +w/o Visual+consistency learning | 45.12 ± 0.13 | 0.8295 | 42.04 ± 0.15 | 0.8159 | 37.64 ± 0.17 | 0.8163 | 41.60 (+0.15) | 0.8206 (+0.6 × 10−3) |
8 | +(Image, Generated)/Noise | 44.84 ± 0.34 | 0.8286 | 41.04 ± 1.19 | 0.8143 | 36.51 ± 0.86 | 0.8139 | 40.79 (−0.66) | 0.8189 (−1.1 × 10−3) |
9 | +(Noise, Generated)/Noise | 44.87 ± 0.05 | 0.8293 | 41.87 ± 0.04 | 0.8153 | 37.12 ± 0.44 | 0.8156 | 41.29 (−0.16) | 0.8201 (+0.1 × 10−3) |
10 | +(Image, Noise)/Noise | 45.08 ± 0.15 | 0.8289 | 42.02 ± 0.19 | 0.8152 | 37.20 ± 0.40 | 0.8156 | 41.43 (−0.02) | 0.8199 (−0.1 × 10−3) |
11 | +(Image, Noise)/Generated | 45.12 ± 0.08 | 0.8292 | 42.06 ± 0.17 | 0.8153 | 37.14 ± 0.11 | 0.8148 | 41.44 (−0.01) | 0.8198 (−0.2 × 10−3) |
12 | +(Noise, Generated)/Generated | 45.22 ± 0.04 | 0.8294 | 42.83 ± 0.20 | 0.8152 | 37.33 ± 0.30 | 0.8155 | 41.46 (+0.01) | 0.8200 (+0.0 × 10−3) |
13 | +(Noise, Noise)/Noise | 45.26 ± 0.25 | 0.8293 | 42.15 ± 0.25 | 0.8153 | 37.36 ± 0.21 | 0.8157 | 41.59 (+0.14) | 0.8201 (+0.1 × 10−3) |
14 | +(Image, Generated)/Generated | 45.42 ± 0.19 | 0.8296 | 42.11 ± 0.10 | 0.8152 | 37.53 ± 0.24 | 0.8162 | 41.69 (+0.24) | 0.8203 (+0.3 × 10−3) |
Model | Newstest14 |
---|---|
VL-T5 | 23.41 ± 0.05 |
+(Image, Generated)/Noise | 22.10 ± 1.25 |
+(Noise, Noise)/Noise | 23.19 ± 0.12 |
+(Image, Generated)/Generated | 23.23 ± 0.10 |
+(Noise, Generated)/Generated | 23.26 ± 0.02 |
+(Noise, Generated)/Noise | 23.28 ± 0.03 |
+w/o Visual | 23.44 ± 0.07 |
+w/o Visual+consistency learning | 23.20 ± 0.10 |
Training with | Test16 | Test17 | MSCOCO | Avg. |
---|---|---|---|---|
Image | 44.54 ± 0.54 | 41.00 ± 1.40 | 36.66 ± 0.55 | 40.73 |
Generated | 44.39 ± 0.58 | 41.10 ± 1.17 | 36.87 ± 0.49 | 40.79 |
Noise | 45.12 ± 0.13 | 42.29 ± 0.37 | 37.21 ± 0.48 | 41.54 |
Testing with | Test16 | Test17 | MSCOCO | Avg. |
---|---|---|---|---|
Image | 45.15 ± 0.08 | 42.09 ± 0.41 | 37.02 ± 0.23 | 41.42 |
Generated | 45.11 ± 0.07 | 42.07 ± 0.20 | 37.00 ± 0.29 | 41.39 |
Noise | 45.12 ± 0.13 | 42.29 ± 0.37 | 37.21 ± 0.48 | 41.54 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, X.; Rao, J.; Liu, X. Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context. Mathematics 2025, 13, 1874. https://doi.org/10.3390/math13111874
Ma X, Rao J, Liu X. Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context. Mathematics. 2025; 13(11):1874. https://doi.org/10.3390/math13111874
Chicago/Turabian StyleMa, Xinyu, Jun Rao, and Xuebo Liu. 2025. "Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context" Mathematics 13, no. 11: 1874. https://doi.org/10.3390/math13111874
APA StyleMa, X., Rao, J., & Liu, X. (2025). Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context. Mathematics, 13(11), 1874. https://doi.org/10.3390/math13111874