Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context

Ma, Xinyu; Rao, Jun; Liu, Xuebo

doi:10.3390/math13111874

Open AccessArticle

Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context

by

Xinyu Ma

,

Jun Rao

and

Xuebo Liu

^*

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(11), 1874; https://doi.org/10.3390/math13111874

Submission received: 13 March 2025 / Revised: 16 May 2025 / Accepted: 23 May 2025 / Published: 3 June 2025

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Multimodal Machine Translation (MMT) has long been assumed to outperform traditional text-only MT by leveraging visual information. However, recent studies challenge this assumption, showing that MMT models perform similarly even when tested without images or with mismatched images. This raises fundamental questions about the actual utility of visual information in MMT, which this work aims to investigate. We first revisit commonly used image-must and image-free MMT approaches, identifying that suboptimal performance may stem from insufficiently robust baseline models. To further examine the role of visual information, we propose a novel visual type regularization method and introduce two probing tasks—Visual Contribution Probing and Modality Relationship Probing—to analyze whether and how visual features influence a strong MMT model. Surprisingly, our findings on a mainstream dataset indicate that the gains from visual information are marginal. We attribute this improvement primarily to a regularization effect, which can be replicated using random noise. Our results suggest that the MMT community should critically re-evaluate baseline models, evaluation metrics, and dataset design to advance multimodal learning meaningfully.

Keywords:

machine translation; multimodal; probing tasks

MSC:

68T50

1. Introduction

With the rise of large-scale multimodal data and advances in computing power, research has expanded beyond unimodal tasks such as image classification [1,2] and machine translation [3,4] to more complex multimodal tasks, including visual question answering [5] and visual commonsense reasoning [6]. These tasks require an integrated understanding of visual, textual, and other modalities.

Multimodal Machine Translation (MMT) [7,8,9] integrates textual and visual information to enhance translation quality, typically requiring both a source text and a corresponding image (Figure 1). Like other multimodal tasks, MMT research assumes that additional modalities, such as images, contribute to translation performance [10,11,12]—either by providing contextual information unavailable in text [13,14] or by acting as a regularization signal during training [15,16]. However, a key paradox persists: empirical results often show comparable translation performance whether models are trained with text alone or with image–text pairs. This raises fundamental questions about the actual contribution of visual information, necessitating a critical reassessment of current MMT approaches.

To this end, we systematically compare and analyze existing image-must and image-free MMT systems. Our analysis highlights the limitations of prior baselines, prompting us to propose a more robust alternative—pre-trained T5 [17] and VL-T5 [18]—to better assess the necessity of visual information (Section 3). To further investigate the role of visual input, we introduce visual type regularization, incorporating three types of visual signals—ranging from semantically relevant to irrelevant (Figure 1 and Figure 2)—alongside two probing tasks: (1) Visual Contribution Probing, which evaluates the impact of visual information on MMT models (Section 4.2), and (2) Modality Relationship Probing, which analyzes the representation effect by visualizing contrastive loss dynamics between modalities during training (Section 4.3).

Our probing results demonstrate that adding real images, generated images, or even noise can serve as a regularization mechanism when using a strong baseline such as VL-T5 [18] (Section 5.2). While our visual type regularization approach yields slight improvements in BLEU scores, these gains appear to stem from domain biases in consistency learning (Section 6.1) rather than meaningful visual context, which has minimal impact on sentence representation (Section 6.2). Notably, we find that introducing noise during training achieves comparable improvements to real-image-based training, suggesting that visual information functions more as a regularization signal than a source of semantic enhancement (Section 6.2–Section 6.4). These findings call for a re-evaluation of MMT research priorities, including baseline selection, evaluation metrics, and dataset design. We hope this study fosters a more critical approach to multimodal learning, particularly in real-world settings, and inspires future advancements in the field.

2. Background

2.1. Machine Translation

Machine translation (MT) aims to convert a source-language sentence into a target-language sentence. Given a pair dataset of parallel corpus

{(x_{i}, y_{i})}_{1}^{N}

, the machine translation problem naturally turns into the likelihood maximization, as follows:

L_{MT} = - \frac{1}{N} \sum_{i = 1}^{N} log p (y_{i} ∣ x_{i}),

(1)

where x and y denote the source and the reference in two different languages. Current MT systems are built on the encoder–decoder architecture of the attention mechanism [19]. The final results of the tests are performed by averaging the latest ten checkpoints on a fixed random seed.

2.2. Multimodal Machine Translation

MMT introduces image modalities and constructs a set

{(x (i), z (i), y (i))}_{1}^{N}

, unlike MT, which only uses two parallel languages, as shown in Equation (1). Usually, MMT fuses the encoded source language and image with the attention mechanism [19,20] to encode the multimodal feature. Similar to bilingual MT, the MMT model is trained with the objective as follows:

L_{MMT} = - \frac{1}{N} \sum_{i = 1}^{N} log p (y_{i} ∣ x_{i}, z_{i}),

(2)

where z denotes the image. Currently, MMT systems can be categorized into two types: (1) Image-must MMT, where the image content is considered helpful in the current translation process during the training and testing steps. These methods argue that images are important when confronted with issues such as the problems of incorrection, ambiguity, and gender neutrality [14,21]. They focus on better modal interaction in textual and image information to weigh the information in each modality. (2) Image-free MMT, where the image corpus is supposed to be less critical and can be unnecessary in the testing. They consider that the visual is useful for training for regularization [22,23] or bias [24].

3. Revisiting Existing Methods

This section compares these two types of methods in terms of architecture, amount of training data, use of images, and present insights.

3.1. Detailed Comparison

3.1.1. Setup

The Multi30K dataset [8] is widely used in MMT, consisting of two multilingual expansions (DE and FR) of Flickr30K [25]. The training and validation sets contain 29,000 and 1014 instances, respectively. We evaluate the MMT systems listed in Table 1 and report the BLEU [26] scores using sacreBLEU [27] on Multi30K across three test splits: Test2016, Test2017, and MSCOCO.

3.1.2. Comparison of Architecture

As shown in Table 1, BLEU score differences between models of the same architecture are minimal. Most methods are based on the transformer architecture, categorized into three model sizes: base (1, 5, 7, 8, 10, 11), small (2, 12), and tiny (3, 4, 6, 13, 14, 15, 16, 17). For example, models such as Gated Fusion (3), Selective Attn (4), and ITP (6), all based on the tiny setting, show less than a 0.3-point difference in BLEU scores on Test2016. This suggests that architectural variation within the same capacity range may not significantly impact performance, at least under standard evaluation. Similarly, for models based on the transformer-base setting, the difference between PVP (7) and VTLM (8) is within 1 point. VL-T5 (9) and T5 (18) exhibit nearly identical performance across datasets, even though VL-T5 incorporates images, while T5 does not, implying limited benefits from visual input in this setup. In contrast, models with different architectures, such as the transformer models with varying sizes (10, 12, 13), show significant performance differences. We hypothesize that this disparity arises from capacity-induced overfitting, as observed by Wu et al. [15].

Table 1. Main results from Test2016, Test2017, and MSCOCO for EN→DE and EN→FR. The first category collects the existing image-must MMT systems, which take both source sentences and paired images as input for testing. The second category illustrates the systems that do not require real images as input for testing. Some results are from [15]. Existing works mainly build upon weak baselines.

#	Method	EN→DE			EN→FR
		Test16	Test17	MSCOCO	Test16	Test17	MSCOCO
		Image-Must MMT Systems
1	Del+obj [28]	38.00	-	-	59.80	-	-
2	DCCN [29]	39.70	31.00	26.70	61.20	54.30	45.40
3	Gated Fusion [15]	41.96	33.59	29.04	61.69	54.85	44.86
4	Selective Attn [30]	41.84	34.32	30.22	62.24	54.52	44.82
5	PLUVR [12]	40.30	33.45	30.28	61.31	53.15	43.65
6	ITP [31]	41.77	34.58	30.61	-	-	-
7	PVP [11]	42.30	-	-	65.50	-	-
8	VTLM [32]	43.30	37.60	35.10	-	-	-
9	VL-T5 [18]	45.40	41.99	36.96	65.42	59.92	51.98
Image-Free MMT Systems
10	Transformer-Base [19]	38.33	31.36	27.54	60.60	53.16	42.83
11	ImagiT [33]	38.50	32.10	28.70	59.70	52.40	45.30
12	Transformer-Small [19]	39.68	32.99	28.50	61.31	53.85	44.03
13	Transformer-Tiny [19]	41.02	33.36	29.88	61.80	53.46	44.52
14	UVR-NMT [10]	40.79	32.16	29.02	61.00	53.20	43.71
15	RMMT [15]	41.45	32.94	30.01	62.12	54.39	44.52
16	IKD-MMT [16]	41.28	33.83	30.17	62.53	54.84	-
17	VALHALLA [34]	41.90	34.00	30.40	62.30	55.10	45.70
18	T5 [18]	44.81	42.25	37.37	65.63	59.82	52.06

3.1.3. Comparison of Training Data

Training data plays a more significant role than architectural differences in MMT systems. We observe that smaller models can sometimes outperform larger ones, potentially due to inadequate pre-training or overfitting in larger architectures. We compare three transformer-based MMT systems: Del+obj (1) [28], PVP (7) [11], and VTLM (8) [32]. The latter two of these methods use more training data. Specifically, PVP and VTLM are trained on a larger corpus including additional parallel, while Del+obj is trained on the basic Multi30K dataset. The comparative results indicate that PVP (7) and VTLM (8) exhibit significantly superior performance compared with Del+obj (1) as a result of their utilization of large-scale training data, with BLEU scores of nearly 4 and 5, respectively, on the Test16 EN→DE dataset. This highlights that data scale, rather than architectural novelty, contributes more directly to performance improvement. The largest models, VL-T5/T5, exhibit the best results and outperform PVP (7) and VTLM (8), with a BLEU boost of nearly 5 on Test17. We attribute this to extensive pre-training on large-scale text–image corpora (for VL-T5) and textual data (for T5), further supporting the crucial role of training data.

3.1.4. Comparison of the Usage of Vision

During testing, visual information does not appear to be essential. Our findings indicate that many MMT systems achieve comparable performance with or without real image input at test time. The image-free MMT systems prohibit the use of real images during testing, with images being optional during training. Some methods (10, 12, 13, 18) do not use images during training at all. Instead, they rely entirely on textual signals, and their performance remains competitive with vision-aware models. Others (11, 14, 15, 16, 17) rely on feature generation or retrieval modules to generate or retrieve text-guided features, which can be considered as visual signals. These pseudo-visual features are aiming to simulate the benefits of visual grounding without requiring real images. However, the results of these various image-free MMT systems show little difference, as evidenced by the similar performance of image-free MMT models (e.g., 10/11, 13/14/15, and 16/17).

3.2. Discussion

The observed performance improvements in MMT systems may not necessarily stem from the acquisition of visual features. In fact, our results suggest that the role of visual input is often overstated, and performance gains are more plausibly attributed to other factors. Variations in architecture, training data, and experimental configurations across methods complicate efforts to draw definitive conclusions. Despite these differences, the highest- and second-highest-performing image-free and image-must systems, respectively, show comparable results. Notably, VL-T5/T5 and Gated Fusion/Transformer-Tiny exhibit similar performance levels, regardless of whether images are included in the input. For example, VL-T5 (which utilizes images) and T5 (text only) achieve nearly identical BLEU scores across all test sets, challenging the assumption that image grounding is essential for MMT.

In both image-free and image-must settings, VL-T5 and T5 consistently outperform other methods, achieving superior translation results. This indicates that model scale, pre-training, and language modeling capacity are likely more influential than access to visual context. These findings challenge the prevailing belief, as expressed in prior works such as PLUVR, PVP, and ITP, that visual information is essential for MMT. Given these results, we argue that many past works may have overestimated the contribution of vision due to limited baselines or inadequate experimental controls. This discrepancy prompts a re-examination of the role of visual inputs, especially when evaluated against stronger baselines like VL-T5 and T5. Furthermore, it raises critical questions about the practical utility of visual information in real-world MMT applications.

4. Visual Probing Method

A comparison of existing MMT methods reveals a disparity in performance, leaving the role of visual information in MMT unclear. To address this, we propose a visual type regularization method and introduce two probing tasks, which are detailed in the following sections.

4.1. Visual Type Regularization

Our goal is to investigate whether only relevant visual context can enhance the translation quality of MMT systems, regardless of whether the visual input is a real image. To this end, we explore three types of visual inputs, as illustrated in the following Figure 1:

Real Image (Image): This feature is extracted from the original paired image v using a Faster R-CNN [35] trained on Visual Genome [36] with $n = 36$ object regions. Each object is represented by its position feature (i.e., bounding box coordinates) and its 2048-dimensional region-of-interest (RoI) feature.
Generated Image (Generated): We use a text-to-image model stable diffusion [37] to generate sentence-dependent images instead of integrating the generation module into the MT framework [33,34]. The feature extraction process of this type is like the real Image.
Noise Image (Noise): For simplicity of implementation, we randomly sample random noise from Gaussian noise with the same dimension (36 × 4 bounding box coordinates and 36 × 2048 RoI features) as the first two types of features as input.

Recent MMT methods, such as VALHALLA [34] and IKD-MMT [16], rely on consistency learning [38] and have achieved state-of-the-art (SOTA) performance. These methods attribute their performance improvements to the integration of visual features. However, it is important to note that these approaches often employ weak baselines, making it difficult to determine whether the performance gains stem from consistency learning, the incorporation of visual features, or both.

We propose a visual type regularization strategy to investigate whether the enhancement in performance is attributed to visual features or consistency learning, as illustrated in Figure 2.

4.2. Visual Contribution Probing

In this approach, we input various types of visual data (as described in the previous section) alongside text into the MMT model. If two visual inputs are identical, this can be considered a form of consistency learning. We then obtain two distributions from the inference processes of two sub-networks with shared weights. This allows us to test whether different types of visual inputs contribute to different aspects of the MMT system, with the constraints imposed by the two distributions.

To measure the contribution of visual information, we use visual type regularization to evaluate the consistency of the outputs from the two networks. This is performed using KL-Divergence

D_{KL}

, which helps constrain the original network. The formula for measuring consistency is as follows:

\begin{matrix} L_{KL}^{i} = \frac{1}{2} D_{KL} (p_{1} (y_{i} ∣ x_{i}, z_{i}) ∥ p_{2} (y_{i} ∣ x_{i}, z_{i}^{*})) & + \frac{1}{2} D_{KL} (p_{2} (y_{i} ∣ x_{i}, z_{i}^{*}) ∥ p_{1} (y_{i} ∣ x_{i}, z_{i})), \end{matrix}

(3)

where

z_{i}

is a type of visual input (visual 1) and

z_{i}^{*}

is another type of visual input (visual 2). With the basic negative log-likelihood learning objective

L_{MMT}

of the two forward passes and

α

being used as the loss weight, the final objective of our MMT system is to minimize

L^{i}

for data

(x_{i}, z_{i}, y_{i})

, as follows:

\begin{matrix} L^{i} & = L_{MMT}^{i} + α \cdot L_{KL}^{i} \\ = - \frac{1}{2} log p_{1} (y_{i} ∣ x_{i}, z_{i}) - \frac{1}{2} log p_{2} (y_{i} ∣ x_{i}, z_{i}^{*}) \\ + \frac{α}{2} D_{KL} (p_{1} (y_{i} ∣ x_{i}, z_{i}) | | p_{2} (y_{i} ∣ x_{i}, z_{i}^{*})) \\ + \frac{α}{2} D_{KL} (p_{2} (y_{i} ∣ x_{i}, z_{i}^{*}) | | p_{1} (y_{i} ∣ x_{i}, z_{i})) \end{matrix}

(4)

4.3. Modality Relationship Probing

In previous works, contrastive learning has been shown to improve translation quality by enhancing the quality of representations [31,39]. The core idea behind contrastive learning is to bring positive examples closer together while pushing negative examples farther apart in the representational space, thereby improving the model’s ability to distinguish between relevant and irrelevant features.

For simplicity, we consider a batch of data pairs

{(q_{i}, v_{i})}_{i = 1}^{N}

. As shown in Figure 1, for each query sample, we obtain its final query encoding

F (q_{i})

through a sub-network, where the input could be visual (

z_{1}

), text (x), or a combination of both (text+visual

(x, z_{1})

). We then retrieve the corresponding positive embedding

{F^{*} (v_{j}^{*}), i = j}

for each query, and the negative embeddings

{F^{*} (v^{j}), i \neq j}

for the other samples within the batch, through another sub-network.

The contrastive loss [40] is then computed as follows:

L_{ctr} = - \sum_{q_{i}, v_{j} \in D} log \frac{e^{sim (F (q^{i}), F^{*} (v^{j})) / τ}}{\sum_{v^{j}} e^{sim (F (q^{i}), F^{*} (v^{j})) / τ}},

(5)

where sim(·) calculates the similarity of different embeddings.

D

is the data pool of a batch and

τ

is the temperature. Thus, we can obtain the contrast loss of visual-to-text, visual-to-visual, and text-to-text by calculating different data pairs.

5. Results

5.1. Setup

We use the stronger pre-trained T5_Base and VL-T5_Base [18] as our baselines rather than a weak baseline (non-pre-trained Transformer-Tiny [19]). The relevant settings of our training follow the original settings of Cho et al. [18]. We perform beam search with the beam size set to 5 and set

τ = 0.1

for all experiments. All models are trained and evaluated on one machine with four RTX 3090 GPUs. We report the average results over three random seeds for all tests.

5.2. Main Results

The upper part of Table 2 compares five models, including the baseline VL-T5 (trained with real images), the consistency learning method [34] (state of the art in MMT), and our proposed regularization method. The lower part of the table compares results from tests where real images were not used. We employ both BLEU and BERTScore [41] as evaluation metrics for each method.

(1): When text features are strong, the test results are nearly identical, regardless of whether images are used, as discussed in Section 3.1. Applying the baseline consistency learning method yields results comparable to the initial VL-T5 (only a small gain of 0.03 BLEU). When consistency learning is applied to models tested with only text (7), the results are very similar to those from visual type regularization (4 and 5), with differences under 0.04. This suggests that consistency learning can achieve comparable results to visual type regularization when text features are dominant.
(2): When real image features are available during testing, incorporating additional visual features offers limited benefits. Even when using non-original data (e.g., noise or generated images) during training (as seen in models 4 and 5), the results are slightly improved when the original image is used during testing, although the improvement is modest (around 0.2 BLEU). However, when using noise or generated images for consistency learning, performance slightly decreases (around 0.02 BLEU). This suggests that when real image features are available, additional visual features may not significantly improve performance, and consistency learning may not be as effective.
(3): The choice of visual features can have a significant impact on model performance, and the distribution of visual features should be carefully considered during training. Combining different visual features (8, 9, 10, 11, and 12) leads to either a decrease in performance (up to −0.66 BLEU) or minimal improvements (+0.1 BLEU), primarily due to the excessive variations in feature distributions. In most cases, these combinations perform worse than using real images during testing.
(4): Consistency learning is an effective method for improving the performance of image-free MMT systems, even without real images during training or testing. Specifically, the use of (Image, Generated)/Generated data pairs results in performance improvements, consistent with findings from prior image-free MMT systems [16,34]. Furthermore, our study shows that performance can be further enhanced even when real images are not used for training (e.g., using (Noise, Noise)/Noise for training).
(5): Using visual context as a constraint in consistency learning can improve translation quality on in-domain data, but may hurt performance on out-of-domain datasets. As shown in Table 2, various combinations of visual features and consistency learning techniques improve in-domain performance, but result in decreased performance on out-of-domain data (e.g., MSCOCO). This suggests that consistency learning with visual regularization helps the model adapt to in-domain data, but struggles with more challenging out-of-domain instances (e.g., ambiguous verbs), as seen in MSCOCO, which is consistent with prior reports [34].
(6): Although the differences in BLEU and BERTScore across various visual input types may appear numerically minor, their remarkable consistency across multiple test sets, evaluation metrics, and training–testing configurations constitutes a meaningful observation. This invariance indicates that the model’s behavior is largely insensitive to the semantic content of visual signals, thereby reinforcing our conclusion that visual inputs function primarily as regularization factors rather than as semantically informative modalities in strong MMT architectures.

6. Analysis

6.1. Visual Contribution Analysis

Our earlier analysis highlighted that consistency learning improves performance on in-domain datasets but leads to a decrease in performance on out-of-domain datasets. To investigate whether this performance reduction is due to the inclusion of visual information or the consistency learning itself, we conduct further testing on a text-only test set, Newstest14 [3]. Newstest14 primarily consists of news content and features longer text compared with Multi30K.

As shown in Table 3, we evaluate a model trained on Multi30K and observe a similar performance drop when applying the visual consistency constraint. However, we also find that applying the consistency constraint using text-only input (the last line in the table) results in a comparable decrease in performance. This finding supports previous studies suggesting that consistency learning tends to generalize poorly on out-of-domain data [38].

Moreover, the performance with text-only input closely mirrors the results observed when using visual constraints. This indicates that domain-specific biases in datasets play a significant role and suggests that consistency learning might have a more pronounced impact on MMT models than the inclusion of visual context.

6.2. Modality Relationship Analysis

Overall, Table 2 and Figure 3 demonstrate that visual signals influence the translation results primarily by affecting the text representation, which is similar to the impact of noise on the translation model.

To further investigate why visual type regularization improves the MMT model, we examine four models that show performance improvements (models 4, 5, 13, and 14). We compare the variations in the distances between the representations of visual–text, visual–visual, and text–text during the training process, as described in Section 4.3 (see Figure 3).

V–T Embedding: This left part of the analysis shows that the model is unable to effectively use visual information to assist with text representation from a semantic perspective. The visual–text contrast loss across the four methods is similar (ranging from 103.2 to 103), indicating that the models fail to map the representations of different modalities into the same encoding space. This behavior is consistent with the results observed in many Vision-Language models, such as VilBERT [42] and LXMERT [43].
Visual Embedding: This middle figure illustrates that noise and visual information serve two key roles: model regularization and visual representation optimization. The consistency constraints allow the model to reduce the differences between visual representations, even when the visual types differ (e.g., Image vs. Noise and Generated). However, the model fails to optimize the representational distance when purely noisy visual input is used (d), as the contrast loss increases with training and eventually plateaus.
Text Embedding: The right part of the figure shows that visual inputs, including noise, act as regularizers during consistency training. When using a real image during consistency training, the model significantly improves its text representation. However, when other visual types, such as noise or generated images, are used for visual regularization, the trained model’s textual representation is nearly identical to that of the model trained using the noise consistency constraint.

6.3. Effect of Visual Inputs

Previous experiments utilizing visual type regularization have shown that visual information influences model training but does not semantically assist the MMT model. To directly compare the impacts of different visual inputs on translation performance, we conduct experiments where the model is trained with noise and tested with noise across various visual contexts.

Table 4 presents the results when the model is trained with ordered versus unordered images (real or generated images compared with noise) and tested with noise. The results show an increase in the average BLEU score when ordered visual inputs (either real or generated images) are used during training, compared with training with noise. This increase suggests that the ordered visual information does contribute to the model’s training process. However, when this visual information is removed during testing, the model’s performance drops, indicating the importance of visual input during training.

On the other hand, Table 5 illustrates that the final results across all datasets remain nearly identical regardless of the type of visual input used during testing. The average change in BLEU score is less than 0.15. Even when the model is trained with noise, substituting different types of visual inputs (real, generated, or noise) during testing leads to minimal changes in performance. This suggests that, unlike previous work [16], where vision was considered essential for training, the model’s consideration of visual input in our experiments is largely akin to noise, having minimal impact on the final translation performance.

6.4. Case Study

We present two representative translation examples to visually highlight the differences between the models, as shown in Figure 4. These examples reveal how different training setups with various visual inputs influence the translation output.

In Figure 4a, we compare models trained with different visual contexts: (Noise, Noise)/Noise, (Image, Generated)/Image, and (Image, Noise)/Image. While these models produce some differences in translation, the variations extend beyond simple lexical choices and are also reflected in grammatical structures. Despite these differences, all four models produce nearly identical sentences, with only minor lexical variations. This suggests that the presence of noise, whether during training or testing, influences not only the words used but also the broader translation behavior of the model.

Similarly, Figure 4b highlights the differences between the model trained on (Image, Image)/Image and the others. This model tends to show more variation in the noun vocabulary, while the other models (trained on combinations like (Image, Noise)/Image and (Image, Generated)/Image) tend to produce similar content. This pattern of variation in vocabulary and structure is representative of the majority of the translation cases observed in our study. It aligns with the phenomenon of text–text contrast loss (Figure 3), where the model struggles to generate significantly different outputs when the input modality remains consistent.

We also compare the parameter distributions in the final attention layer for the (Image, Noise)/Image and (Image, Generated)/Generated combinations, as shown in Figure 5. The results show that the (Image, Generated)/Generated combination has a more concentrated distribution, indicating a more focused and stable attention mechanism, which likely contributes to better performance. In contrast, the (Image, Noise)/Image combination exhibits a more spread-out distribution, suggesting less stability and alignment in the attention process, leading to suboptimal results.

7. Discussion

(1): Future research in the field of MMT should focus on testing in real-world settings. Current methodologies are based on a weak baseline (Transformer-Tiny) and under low-resource settings for text, which have shown the significant role of visual information in providing complementary semantic information not available in the text. However, it is important for future research to investigate whether the use of visual information is still necessary in real-world settings, where the text is sufficient and the baseline is more robust.
(2): Visual information serves as a regularization method in MMT and can even be replaced by random noise when training on the Multi30K dataset. Real images and generated images have similar regularization effects (as discussed in Section 4). Even when the regularization is different, the model trained on (noise, noise) still produces translation results comparable to those of a model trained on regular visual data. This finding aligns with image-free MMT methods [16,44]. Exploring simpler constraints to exploit such features (visual or noise) may be a potential direction for future research.
(3): Stronger translation metrics should be used. Many existing methodologies in MMT [15,30] do not comprehensively analyze the semantic similarity produced by different models using the most recent semantic similarity measures [41]. It has been observed that the text embeddings of models are nearly the same even though their BLEU scores varied by up to 0.4 points. This implies that the reported improvements may only result in models that are more aligned with the BLEU, rather than models that truly enhance translation quality.

Author Contributions

Conceptualization, X.M., J.R. and X.L.; methodology, J.R. and X.L.; writing—original draft preparation, J.R.; writing—review and editing, X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62206076), Guangdong Basic and Applied Basic Research Foundation (Grant No. 2024A1515011491), Shenzhen Science and Technology Program (Grant Nos. ZDSYS20230626091203008, KQTD2024072910215406, KJZD20231023094700001), and Shenzhen College Stability Support Plan (Grant Nos. GXWD20220811173340003, GXWD20220817123150002).

Data Availability Statement

The dataset presented in this study for training and testing is available in Multi30K at https://github.com/multi30k/dataset, accessed on 21 December 2023.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; IEEE Computer Society: Washington, DC, USA, 2009; pp. 248–255. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 8 April 2009. [Google Scholar]
Callison-Burch, C.; Koehn, P.; Monz, C.; Schroeder, J. Findings of the 2009 Workshop on Statistical Machine Translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, 30–31 March 2009; pp. 1–28. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 6325–6334. [Google Scholar] [CrossRef]
Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From Recognition to Cognition: Visual Commonsense Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Washington, DC, USA, 2019; pp. 6720–6731. [Google Scholar] [CrossRef]
Futeral, M.; Schmid, C.; Sagot, B.; Bawden, R. Towards Zero-Shot Multimodal Machine Translation. arXiv 2024, arXiv:2407.13579. [Google Scholar]
Elliott, D.; Frank, S.; Sima’an, K.; Specia, L. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany, 12 August 2016; pp. 70–74. [Google Scholar] [CrossRef]
Barrault, L.; Bougares, F.; Specia, L.; Lala, C.; Elliott, D.; Frank, S. Findings of the Third Shared Task on Multimodal Machine Translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, 31 October–1 November 2018; pp. 304–323. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, K.; Wang, R.; Utiyama, M.; Sumita, E.; Li, Z.; Zhao, H. Neural Machine Translation with Universal Visual Representation. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Huang, P.Y.; Hu, J.; Chang, X.; Hauptmann, A. Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8226–8237. [Google Scholar] [CrossRef]
Fang, Q.; Feng, Y. Neural Machine Translation with Phrase-Level Universal Visual Representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5687–5698. [Google Scholar] [CrossRef]
Caglayan, O.; Madhyastha, P.; Specia, L.; Barrault, L. Probing the Need for Visual Context in Multimodal Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4159–4170. [Google Scholar] [CrossRef]
Li, J.; Ataman, D.; Sennrich, R. Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online/Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8556–8562. [Google Scholar] [CrossRef]
Wu, Z.; Kong, L.; Bi, W.; Li, X.; Kao, B. Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 6153–6166. [Google Scholar] [CrossRef]
Peng, R.; Zeng, Y.; Zhao, J. Distill The Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2379–2390. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 140:1–140:67. [Google Scholar]
Cho, J.; Lei, J.; Tan, H.; Bansal, M. Unifying Vision-and-Language Tasks via Text Generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; Proceedings of Machine Learning Research. Volume 139, pp. 1931–1942. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Yao, S.; Wan, X. Multimodal Transformer for Multimodal Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4346–4350. [Google Scholar] [CrossRef]
Futeral, M.; Schmid, C.; Laptev, I.; Sagot, B.; Bawden, R. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 5394–5413. [Google Scholar] [CrossRef]
Kukacka, J.; Golkov, V.; Cremers, D. Regularization for Deep Learning: A Taxonomy. arXiv 2017, arXiv:1710.10686. [Google Scholar]
Bowen, B.; Vijayan, V.; Grigsby, S.; Anderson, T.; Gwinnup, J. Detecting concrete visual tokens for Multimodal Machine Translation. In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), Chicago, IL, USA, 30 September–2 October 2024; pp. 29–38. [Google Scholar]
Jabri, A.; Joulin, A.; van der Maaten, L. Revisiting Visual Question Answering Baselines. In Lecture Notes in Computer Science, Proceedings of the ECCV (8), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9912, pp. 727–739. [Google Scholar]
Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Post, M. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, 31 October–1 November 2018; pp. 186–191. [Google Scholar] [CrossRef]
Ive, J.; Madhyastha, P.; Specia, L. Distilling Translations with Visual Awareness. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6525–6538. [Google Scholar] [CrossRef]
Lin, H.; Meng, F.; Su, J.; Yin, Y.; Yang, Z.; Ge, Y.; Zhou, J.; Luo, J. Dynamic Context-guided Capsule Network for Multimodal Machine Translation. In Proceedings of the MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, 12–16 October 2020; pp. 1320–1329. [Google Scholar] [CrossRef]
Li, B.; Lv, C.; Zhou, Z.; Zhou, T.; Xiao, T.; Ma, A.; Zhu, J. On Vision Features in Multimodal Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 6327–6337. [Google Scholar] [CrossRef]
Ji, B.; Zhang, T.; Zou, Y.; Hu, B.; Shen, S. Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6755–6764. [Google Scholar] [CrossRef]
Caglayan, O.; Kuyu, M.; Amac, M.S.; Madhyastha, P.; Erdem, E.; Erdem, A.; Specia, L. Cross-lingual Visual Pre-training for Multimodal Machine Translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 16 April 2021; pp. 1317–1324. [Google Scholar] [CrossRef]
Long, Q.; Wang, M.; Li, L. Generative Imagination Elevates Machine Translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5738–5748. [Google Scholar] [CrossRef]
Li, Y.; Panda, R.; Kim, Y.; Chen, C.R.; Feris, R.; Cox, D.D.; Vasconcelos, N. VALHALLA: Visual Hallucination for Machine Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 5206–5216. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, QC Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar]
Liang, X.; Wu, L.; Li, J.; Wang, Y.; Meng, Q.; Qin, T.; Chen, W.; Zhang, M.; Liu, T. R-Drop: Regularized Dropout for Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Virtual, 6–14 December 2021; pp. 10890–10905. [Google Scholar]
Pan, X.; Wang, M.; Wu, L.; Li, L. Contrastive Learning for Many-to-many Multilingual Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 244–258. [Google Scholar] [CrossRef]
Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the AISTATS, Sardinia, Italy, 13–15 May 2010; pp. 297–304. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 13–23. [Google Scholar]
Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [Google Scholar] [CrossRef]
Elliott, D.; Kádár, Á. Imagination Improves Multimodal Translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, 1 December 2017; pp. 130–141. [Google Scholar]

Figure 1. An example of the MMT translation process. The MMT model requires both text and images as inputs. The visual input can be different types, such as (1) real image, (2) generated image, and (3) noise image. We use different visual information to explore the effect of visual signals on MMT systems.

Figure 2. An overview of our visual type regularization. In training, we change the visual input type to perform consistency training through the KL divergence of two output distributions. z and z* denote different image inputs.

Figure 3. Modality relationship probing. (a) Models did not learn the difference between the modalities. (b) Unlike those trained with generated and real images, models fail to regard noise as the visual representation. (c) Text representation can be significantly improved by the consistency training (c), while the effect of consistency learning of two different visual types on the model (a,b) is equivalent to noise information (d). (X, Y)/Z means that the model adopts our visual type regularization, training with types X and Y while validating/testing with type Z.

Figure 4. Translation examples from MSCOCO of the four models with consistency learning. The translated sentences have little connection to the visual. The nouns with different translation results are marked in bold. (a) (Noise, Noise)/Noise sometimes produces syntactic variation with (Image, Generated)/Image & (Image, Noise)/Image, with almost the same noun vocabulary. The different prepositions about syntax are marked in italics. (b) (Image, Image)/Image differs from the other three (tends to output the same content) in the variation of the noun vocabulary.

Figure 5. The parameter distribution of the output module in the final attention layer for (Image, Noise)/Image and (Image, Generated)/Generated: (a) self-attention of (Image, Noise)/Image and (b) self-attention of (Image, Generated)/Generated. A deeper shade of blue indicates a more concentrated distribution.

Table 2. Visual type regularization results. “+consistency learning” is a baseline for our comparison, and its types of visual input are identical. “+w/o visual” means that we train this model with zero vector as visual input, and (X, Y)/Z means that the model adopts visual type regularization, training with types X and Y while validating/testing with type Z.

#	Model	Test16		Test17		MSCOCO		Average
#	Model	BLEU	BERTScore	BLEU	BERTScore	BLEU	BERTScore	BLEU	BERTScore
Testing with the original matching images
1	VL-T5 [18]	45.40 ± 0.04	0.8298	41.99 ± 0.29	0.8151	36.96 ± 0.07	0.8152	41.45	0.8200
2	+consistency learning	45.21±0.18	0.8294	42.08 ± 0.23	0.8149	37.15 ± 0.12	0.8156	41.48 (+0.03)	0.8200 (+0.0 × 10⁻³)
3	+(Noise, Generated)/Image	45.12 ± 0.19	0.8296	41.89 ± 0.12	0.8151	37.27 ± 0.34	0.8155	41.43 (−0.02)	0.8201 (+0.1 × 10⁻³)
4	+(Image, Generated)/Image	45.39 ± 0.22	0.8297	42.05 ± 0.28	0.8152	37.4 3 ± 0.28	0.8159	41.62 (+0.17)	0.8203 (+0.3 × 10⁻³)
5	+(Image, Noise)/Image	45.39 ± 0.21	0.8294	42.05 ± 0.28	0.8152	37.48 ± 0.43	0.8160	41.64 (+0.19)	0.8202 (+0.2 × 10⁻³)
Testing without the original matching images
6	+w/o Visual	44.81 ± 0.05	0.8294	42.25 ± 0.34	0.8157	37.37 ± 0.12	0.8158	41.48 (+0.03)	0.8203 (+0.3 × 10⁻³)
7	+w/o Visual+consistency learning	45.12 ± 0.13	0.8295	42.04 ± 0.15	0.8159	37.64 ± 0.17	0.8163	41.60 (+0.15)	0.8206 (+0.6 × 10⁻³)
8	+(Image, Generated)/Noise	44.84 ± 0.34	0.8286	41.04 ± 1.19	0.8143	36.51 ± 0.86	0.8139	40.79 (−0.66)	0.8189 (−1.1 × 10⁻³)
9	+(Noise, Generated)/Noise	44.87 ± 0.05	0.8293	41.87 ± 0.04	0.8153	37.12 ± 0.44	0.8156	41.29 (−0.16)	0.8201 (+0.1 × 10⁻³)
10	+(Image, Noise)/Noise	45.08 ± 0.15	0.8289	42.02 ± 0.19	0.8152	37.20 ± 0.40	0.8156	41.43 (−0.02)	0.8199 (−0.1 × 10⁻³)
11	+(Image, Noise)/Generated	45.12 ± 0.08	0.8292	42.06 ± 0.17	0.8153	37.14 ± 0.11	0.8148	41.44 (−0.01)	0.8198 (−0.2 × 10⁻³)
12	+(Noise, Generated)/Generated	45.22 ± 0.04	0.8294	42.83 ± 0.20	0.8152	37.33 ± 0.30	0.8155	41.46 (+0.01)	0.8200 (+0.0 × 10⁻³)
13	+(Noise, Noise)/Noise	45.26 ± 0.25	0.8293	42.15 ± 0.25	0.8153	37.36 ± 0.21	0.8157	41.59 (+0.14)	0.8201 (+0.1 × 10⁻³)
14	+(Image, Generated)/Generated	45.42 ± 0.19	0.8296	42.11 ± 0.10	0.8152	37.53 ± 0.24	0.8162	41.69 (+0.24)	0.8203 (+0.3 × 10⁻³)

Table 3. Out of domain results for consistency regularization on Newstest14. These methods with constraint are lower than the baseline VL-T5 or T5 (w/o Visual). Bold font highlights the best result.

Model	Newstest14
VL-T5	23.41 ± 0.05
+(Image, Generated)/Noise	22.10 ± 1.25
+(Noise, Noise)/Noise	23.19 ± 0.12
+(Image, Generated)/Generated	23.23 ± 0.10
+(Noise, Generated)/Generated	23.26 ± 0.02
+(Noise, Generated)/Noise	23.28 ± 0.03
+w/o Visual	23.44 ± 0.07
+w/o Visual+consistency learning	23.20 ± 0.10

Table 4. Noise testing results. Models are trained with different visual types and tested with noise.

Training with	Test16	Test17	MSCOCO	Avg.
Image	44.54 ± 0.54	41.00 ± 1.40	36.66 ± 0.55	40.73
Generated	44.39 ± 0.58	41.10 ± 1.17	36.87 ± 0.49	40.79
Noise	45.12 ± 0.13	42.29 ± 0.37	37.21 ± 0.48	41.54

Table 5. Noise training results. The model is trained with noise and tested with different visual types.

Testing with	Test16	Test17	MSCOCO	Avg.
Image	45.15 ± 0.08	42.09 ± 0.41	37.02 ± 0.23	41.42
Generated	45.11 ± 0.07	42.07 ± 0.20	37.00 ± 0.29	41.39
Noise	45.12 ± 0.13	42.29 ± 0.37	37.21 ± 0.48	41.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, X.; Rao, J.; Liu, X. Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context. Mathematics 2025, 13, 1874. https://doi.org/10.3390/math13111874

AMA Style

Ma X, Rao J, Liu X. Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context. Mathematics. 2025; 13(11):1874. https://doi.org/10.3390/math13111874

Chicago/Turabian Style

Ma, Xinyu, Jun Rao, and Xuebo Liu. 2025. "Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context" Mathematics 13, no. 11: 1874. https://doi.org/10.3390/math13111874

APA Style

Ma, X., Rao, J., & Liu, X. (2025). Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context. Mathematics, 13(11), 1874. https://doi.org/10.3390/math13111874

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context

Abstract

1. Introduction

2. Background

2.1. Machine Translation

2.2. Multimodal Machine Translation

3. Revisiting Existing Methods

3.1. Detailed Comparison

3.1.1. Setup

3.1.2. Comparison of Architecture

3.1.3. Comparison of Training Data

3.1.4. Comparison of the Usage of Vision

3.2. Discussion

4. Visual Probing Method

4.1. Visual Type Regularization

4.2. Visual Contribution Probing

4.3. Modality Relationship Probing

5. Results

5.1. Setup

5.2. Main Results

6. Analysis

6.1. Visual Contribution Analysis

6.2. Modality Relationship Analysis

6.3. Effect of Visual Inputs

6.4. Case Study

7. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI