Vocabulary Retention Under Multimodal Coupling Strength Index (MCSI): Insights from Eye Tracking

Tang, Qiyue; Chen, Chen

doi:10.3390/app15147645

Open AccessArticle

Vocabulary Retention Under Multimodal Coupling Strength Index (MCSI): Insights from Eye Tracking

by

Qiyue Tang

and

Chen Chen

^*

College of Furnishings and Industrial Design, Nanjing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7645; https://doi.org/10.3390/app15147645

Submission received: 9 June 2025 / Revised: 5 July 2025 / Accepted: 7 July 2025 / Published: 8 July 2025

Download

Browse Figures

Versions Notes

Abstract

This eye-tracking investigation employed a 2 × 2 experimental design to examine multimodal lexical encoding processes. Eighty participants were systematically assigned to four conditions: Group A (text-only), Group B (text + image), Group C (text + sound), and Group D (text + image + sound). The results demonstrated significantly superior recall accuracy in Group D (92.00%) compared with unimodal conditions (Group B: 82.07%; Group C: 76.00%; Group A: 59.60%; p < 0.001), confirming robust audiovisual synergy. The novel Multimodal Coupling Strength Index (MCSI) dynamically quantified crossmodal integration efficacy through eye-tracking metrics (Attentional Synchronization Coefficient, ASC; Saccade Duration–Fixation Duration differential, SD-FD), revealing significantly stronger coupling in audiovisual conditions (C/D: 0.71; B/D: 0.54). Crucially, the established MCSI provides a transferable diagnostic framework for evaluating multimodal integration efficiency in learning environments.

Keywords:

eye tracking; lexical memory; MCSI; English learning; multimodal effect; audiovisual synergy

1. Introduction

Vocabulary acquisition is essential for mastering any language, and researchers have developed various techniques to improve it. Traditional approaches often involve repetitive memorization, which tends to be tedious and less effective for lasting memory. Newer studies indicate that combining different senses—like seeing pictures while hearing words—makes vocabulary stick better in memory. Research by Ebbinghaus shows that without reinforcement, people forget words quickly, remembering only about 33% after one day [1]. Park, Chaehee mentioned that although learning words in meaningful sentences improves retention by activating meaning-related brain networks, this method still relies on just one type of input, missing the opportunity to engage the brain’s natural ability to process multiple senses together [2].

Brain research by Beauchamp et al. reveals that a specific area called the superior temporal sulcus (STS) acts as a central hub where sights and sounds come together [3]. This biological process supports the Dual-Coding Theory mentioned by Bucci W [4], which explains why combining images and sounds strengthens memory—it creates two linked mental records. Modern eye-tracking technology now allows us to measure exactly where people look while learning, giving new insights into how multisensory learning works.

Although we have strong theories, current studies still have gaps. Most importantly, there are no good ways to measure in real time how single-sense learning (like just seeing images) compares to multi-sense learning (seeing and hearing together) for vocabulary. Also, existing research mainly uses simple performance measures that do not reveal how the brain combines sights and sounds to boost learning. To address these gaps, this study introduces a novel Multimodal Coupling Strength Index (MCSI) that synergizes eye-tracking metrics with memory performance, thereby establishing a real-time indicator of audiovisual integration efficiency.

The core research aims to resolve two pivotal questions: Firstly, how do unimodal (auditory-only, visual-only) and multimodal (audiovisual) conditions comparatively influence lexical memory? Secondly, can the proposed MCSI model dynamically quantify and validate crossmodal integration efficiency during vocabulary learning? To address these questions, we implement a rigorous experimental design comparing four sensory conditions: Group A (no sound and image), Group B (image-only), Group C (sound-only), and Group D (sound and image).

The study makes key advances: It is the first to directly compare these four learning conditions using eye tracking to understand the cognitive processes. The MCSI model is a new tool that combines attention shifts (how eyes move) and memory results to precisely measure the synergy between senses. Our results also provide strong evidence for brain-based learning theories by showing how eye movements link to memory success.

The article systematically unfolds these contributions across five sections. Following this Introduction, Section 2 synthesizes the theoretical foundations and identifies research gaps. Section 3 details the experimental methodology, including eye-tracking protocols and MCSI formulation. Section 4 presents empirical findings on memory accuracy and multimodal integration effects. Section 5 discusses theoretical implications and practical applications, while Section 6 addresses limitations and future research directions.

2. Literature Review

2.1. Theory

2.1.1. Theoretical Foundations of Multimodal Learning

Multimodal learning operates via the brain’s parallel processing architecture, fundamentally anchored in Paivio’s (1971) Dual-Coding Theory (DCT) [5]. This framework establishes that verbal and nonverbal inputs generate segregated yet interconnected memory representations, with empirical evidence confirming that auditory word forms and visual images create distinct but mutually reinforcing cognitive traces [6]. These dual pathways enhance recall through two complementary neurocognitive mechanisms: (a) activating parallel neural routes to strengthen memory encoding, and (b) providing enriched retrieval cues—specifically through auditory and visual stimuli serving as cognitive anchors.

Building upon these principles, Mayer and Moreno’s Cognitive Theory of Multimedia Learning (CTML) [7] provides an evidence-based instructional design framework comprising three core processes: (1) visual and auditory channels independently process pictorial and verbal information; (2) limited working memory capacity constrains information intake; and (3) active cognitive integration synthesizes multisensory inputs. These mechanisms collectively resolve the fundamental tension between multimodal stimulation and cognitive load constraints, thereby enabling efficient knowledge construction.

This integrated theoretical foundation offers robust scaffolding for the empirical investigation of sensory-mediated learning. In the present study, we operationalize auditory and visual stimuli within cognitive retrieval paradigms to systematically examine unimodal (auditory-only or visual-only) and multimodal (audiovisual) processing conditions.

2.1.2. The Influence of Sound on Memory

The auditory modality plays a distinctive role in lexical memory through phonological processing mechanisms. Gkalitsiou & Byrd noted that the phonological loop, a core component of the working memory system, is specialized for the temporary storage and manipulation of phonological information (e.g., verbal repetition, digit span tasks) [8]. Speech input can effectively reduce visual channel cognitive load by distributing processing across multiple sensory modalities. For instance, John Sweller et al. mentioned that concurrent narration with visual animation capitalizes on working memory’s dual-channel capacity while alleviating single-channel overload. However, phonological loop efficiency is not absolute but rather modulated by multiple acoustic parameters including emotional valence, spatial characteristics, linguistic features, and perceptual complexity [9].

The emotional qualities of sound significantly shape how we process and remember words. Dolcos et al. mentioned that when sounds carry positive emotions, they boost memory accuracy for word meanings by creating constructive interactions between the brain’s emotion processing centers and working memory systems [10]. Research by Kensinger confirmed that people recall word meanings more accurately when hearing positive voices compared to neutral tones. Conversely, negative emotional speech—such as angry tones—triggers defensive reactions through amygdala activation [11]. Mickley Steinmetz et al. and Bigot et al. indicated this redirects mental resources from understanding word meanings towards processing emotional content, resulting in measurable memory impairment [12,13]. Eye-tracking studies demonstrate this emotional modulation in real time: positive speech shortens initial focus duration on abstract words, while negative speech causes visible attention disruption through increased pupil dilation. These physiological responses found by Mueller & Kuchinke reveal how sound’s emotional properties regulate memory formation through early-stage attention control mechanisms [14].

Neurocognitive research demonstrates that spatial attributes of auditory stimuli optimize crossmodal resource allocation through ecological cue simulation [15,16]. When lexical items are paired with spatially localized sounds, empirical evidence reveals significantly enhanced free recall accuracy (Δaccuracy = +18.7%, p < 0.001) [17], indicating privileged access to memory traces via spatial coding mechanisms. This phenomenon originates from dynamic phase coupling between auditory spatial working memory buffers and hippocampal place cell networks [18]. Complementary eye-tracking evidence confirms that 3D auditory cues reduce visual channel dependency by 23.7% (d = 1.15), demonstrating efficient crossmodal resource substitution [19,20].

Combining spatial attributes with the semantic processing power of linguistic features allows for precise attentional guidance. Rhythmic and rhyming speech patterns can improve the efficiency of speech loops, as recently reported by Saito et al. [21]. Similarly, semantically consistent prosody (e.g., pairing ‘water’ with the sound of dripping water) can strengthen the semantic network. Empirical data by Gupta & Tisdale suggest an 18% memory advantage for rhyming words compared to a no-rhyme condition, and semantically related onomatopoeia reduces recognition reaction times by 210 ms [22]. Eye-tracking measures performed by Huettig & McQueen show that the fixation frequency of words decreases in conditions of high semantic congruence, suggesting that memory efficiency increases through optimal integration of linguistic features [23].

The complexity of auditory stimuli may counteract language enhancement effects through resource competition. Clear monophonic speech optimizes speech cycle efficiency, whereas complex speech increases the cognitive load, as reported by Bidebnan & Krishnan [24]. Experimental evidence indicated by Fernández-Quezada, D et al. suggests a nonlinear relationship between noise intensity and memory performance. Eye-tracking metrics reveal complexity-induced attentional disturbances [25], with an increase in the variability in gaze duration for key information in noisy conditions (SD from 120 ms to 280 ms), suggesting attentional instability due to resource competition, as reported by John E Marsh et al. [26].

Purely auditory input exhibits a fundamental limitation in supporting long-term consolidation of form-meaning associations due to the absence of visual cues, as reported by Amedi et al. [27]. Eye-tracking studies by Mayberry et al. demonstrate that visual information significantly enhances memory stability through the redundant encoding of form-meaning mappings [28]. These findings collectively indicate that unimodal auditory input without visual supplementation reduces memory persistence.

In summary, we found that previous scholars have investigated various aspects of sound’s own multidimensional properties such as affective, spatial (left and right channel), linguistic structure, and complexity dimensions. However, it is noteworthy that we found few experiments focusing their goals on the effects of sound on word memory help, which provides new ideas and points of attention for our experiments.

2.1.3. The Influence of Pictures on Memory

In visual modality, pictures enhance memory performance through semantic activation and visual attention mechanisms. Empirical evidence recently reported by Brewin & Langley and Bainbridge et al. demonstrates that items presented as pictures exhibit higher free verbal recall rates compared to those presented as text [29,30]. These findings collectively indicate that pictorial presentation confers a distinct advantage in memory facilitation. However, the magnitude of this mnemonic benefit is contingent upon several factors: the dynamic characteristics of the visual stimuli, specific pictorial attributes (including contrast and color properties), and the degree of congruence between pictorial and verbal information.

Empirical studies by Cheng et al., Li et al., and Shang et al. demonstrate that motion-enhanced visual stimuli (e.g., GIF animations, brief video sequences) significantly potentiate initial attentional capture through locomotive salience, while concurrently elevating visual distraction indices [31,32,33]. Comparative analyses by Cilia et al. [34] reveal substantially abbreviated target acquisition latencies for dynamic versus static stimuli, attributable to motion trajectories automatically triggering visual orienting reflexes. Directional motion indicators (e.g., animated arrows) enhance gaze allocation to critical regions during cued search paradigms. However, complex visual scenes containing multiple concurrent object movements impair automatic symbolic cue processing, inducing saccadic pattern disruption, as documented by Guzzon et al. [35]. These collective findings support an inverted U-shaped functional relationship between motion intensity and mnemonic facilitation, with moderate dynamism optimizing attentional guidance efficacy.

Theeuwes, Gao, J et al. and Rosenholtz et al. mentioned that in picture displays, high contrast combined with warm colors (e.g., red, yellow) optimizes early attentional allocation by increasing visual salience but may impair deep semantic processing [36,37,38]. Eye-tracking data by Zhang, YF et al. further reveal that high-contrast images elicit significantly denser gaze hotspot clustering compared to low-contrast images, yet they also increase semantic association errors [39]. Wang, T et al. found that these findings suggest that perceptual properties must be systematically integrated with picture–word congruence: while high contrast facilitates encoding when semantic clarity is high, it amplifies surface–depth processing conflicts under semantic ambiguity [40].

Mayer, Sadoski & Paivio, and Cao, Jingcun et al. reported that pictures exhibiting high semantic alignment with lexical content enhance memory encoding through “semantic resonance” mechanisms [41,42,43]. However, Zhang, YF et al. indicated that complex pictorial stimuli necessitate additional cognitive resources for visual filtering, which compromises vocabulary semantic integration and long-term retention [39]. Empirical evidence by Rey further demonstrates that irrelevant pictorial elements reduce learning efficiency by approximately 18%, indicating visual interference significantly impairs semantic integration processes [44].

Notably, pictorial stimuli exhibit distinct temporal processing characteristics compared to textual stimuli, as reported by Potter et al., Hochstein & Ahissar, and Ma, Alex C. et al. [45,46,47]. Eye-tracking studies demonstrate that image processing involves unique temporal dynamics with potential interference effects. Underwood & Foulsham indicated that visual stimuli elicit a faster attentional capture than text, as indicated by significantly shorter first fixation durations. These findings collectively substantiate the fundamental role of pictorial elements as primary visual access points in information processing systems [48].

In summary, existing research has systematically investigated multiple aspects of pictorial processing, including dynamic visual features, picture-form memory effects in picture–word congruence, and temporal processing characteristics with potential interference effects through eye-tracking methodologies. However, our review identified a notable research gap: few experimental studies have specifically examined pictorial facilitation effects on lexical memory retention or explored subsequent memory consolidation processes. The exception is auditory integration studies, which introduced a novel experimental dimension. To bridge this gap, we designed a follow-up study integrating these two critical dimensions.

2.1.4. Multimodal Interaction Effects of Sound and Image

The synergistic effect of crossmodal integration arises from the complementary and dynamic interaction of multisensory information channels. The combination of auditory and visual stimuli initiates a “dual encoding-cross-reinforcement” mechanism: while speech input activates phonological word representations, pictorial stimuli engage semantic networks, with both modalities converging to strengthen memory traces through associative encoding in the hippocampus. Empirical evidence by Clark & Paivio supports this mechanism—for instance, simultaneous presentation of the word “bird” with corresponding visual imagery enhances long-term memory retention through the cross-activation of phonological and visual coding systems [49].

Oculomotor indicators provide microscopic evidence for multimodal integration as well. For example, a series of studies by Heikkilä, H & Räihä, KJ found that the audiovisual congruent group switched gaze 40% more frequently than the unimodal group, suggesting that multimodal information facilitates dynamic integration and that switching paths are more focused on semantically critical areas [50]. It has also been found that the frequency of gaze switching was significantly higher in the multimodal group than in the unimodal group (25.00% more on average), reflecting the fact raised by Paré, M. et al. that learners dynamically integrate content between audiovisual information more frequently [51]. This all suggests that multimodal input enhances learning efficiency by optimizing resource allocation.

However, Mayer (2001) suggests that multimodal designs must strictly adhere to the “consistency principle”. Inconsistent audiovisual information by Eitel et al. induces competing representations in working memory, leading to memory fragmentation [52]. Empirical evidence from eye-tracking studies confirms this effect: Eitel et al. observed a 30% increase in eye movements in response to verbal stimuli in an incongruent multimodal condition, suggesting that learners are repeatedly attempting to reconcile conflicting information. This inefficient integration process depletes cognitive resources and ultimately affects learning efficiency [52].

In summary, the current study focuses on the learning effects of sound, pictures, and multimodality and analyzes them inductively. However, two key limitations remain: firstly, analyses of multimodal synergies have relied heavily on traditional statistical methods (e.g., ANOVA) and lack dynamic quantitative tools to distinguish between nonlinear interactions between auditory and visual stimuli over time. Secondly, assessments of cognitive resource allocation often rely on subjective reports or isolated behavioral metrics (e.g., reaction time), failing to use micro-behavioral data from eye tracking (e.g., average saccade [ASC], standard deviation of fixation duration [SD-FD]) to quantify the process of attentional reallocation.

2.1.5. Positioning MCSI Within Multimodal Integration Research

To fill these gaps, we propose the Multimodal Coupled Strength Index (MCSI), a dynamic computational model that combines eye-tracking metrics (ASC, SD-FD) with memory accuracy, allowing for qualitative analysis of the results. Sweller et al. indicated that while established metrics like cognitive load indices quantify isolated aspects of sensory processing, they lack granularity in capturing dynamic synergies between auditory and visual pathways during real-time learning [9]. The Multimodal Coupling Strength Index (MCSI) addresses this gap through three distinctive innovations: Firstly, it integrates temporal oculomotor dynamics with behavioral outcomes, unlike traditional post hoc metrics such as the Cognitive Load Theory (CLT) efficiency ratio, which relies solely on performance accuracy and subjective ratings. Secondly, the MCSI introduces quantitative indicators beyond the descriptive indices of previous studies. Thirdly, its computational form has capabilities that a static multimodal framework does not have. This positions the MCSI as a transferable tool for quantifying multisensory integration in domains beyond vocabulary acquisition

2.2. Objectives

This study pursues dual primary objectives: (a) conducting a comparative evaluation of learning efficacy across three sensory modalities—auditory-only stimulation, visual-only stimulation, and multimodal audiovisual stimulation; and (b) establishing a Multimodal Coupling Strength Index (MCSI) model that dynamically quantifies vocabulary recall performance through real-time eye-tracking metrics. Grounded in Mayer’s (2014) Multimedia Learning Theory, we advance two specific hypotheses: firstly, unimodal auditory and visual stimuli independently enhance vocabulary recall accuracy, while multimodal conditions generate superior lexical memory consolidation; secondly, the MCSI empirically demonstrates significant synergistic effects, confirming multimodal stimulation’s cognitive advantage through quantifiable coupling strength metrics.

3. Research Methodology

3.1. Preliminary Preparation

3.1.1. Word and App Selection

We chose the Patterning Words App as our main research object. Meanwhile, the lexical stimuli consisted of 50 target items selected through stratified sampling from the IELTS 2025 academic wordlist, primarily drawn from the 2025 exam syllabus vocabulary. Using British National Corpus (BNC) frequency data, we selected low-frequency words with the following distribution: nouns (60%), verbs (25%), adjectives (15%), words such as recycle, biodiversity, etc.

3.1.2. Participants

The lexical stimuli were selected from the 2025 examination syllabus vocabulary, which primarily targets young adult learners aged 18–23 years. Consequently, our study sample was drawn from this specific demographic group (18–23 years old).

The experimental design comprised four stimulus groups, each containing 20 participants. We recruited 80 Chinese undergraduate students from Nanjing Forestry University (41 males, 39 females; mean age = 19.89 years, SD = 1.37, range = 18–23 years). All participants met the following criteria: (1) CET-4 scores between 425 and 550 (moderate proficiency level), (2) normal or corrected-to-normal vision, (3) no prior participation in similar eye-tracking studies, and (4) voluntary consent to complete the experimental procedures.

3.1.3. Experimental Equipment

This study employed eye-tracking methodology with a two-component hardware system: (1) a primary workstation comprising a Dell OptiPlex 7000 computer (Dell Inc., Xiamen, China) paired with a Dell E2423H monitor (1920 × 1080 pixel resolution) and eye-tracking apparatus (Devices a and c in Figure 1), and (2) Audio-Technica ATH-M50x headphones (Audio-Technica Corporation, Tokyo, Japan) for auditory stimulus delivery in Groups C and D (Devices b in Figure 1). Participants maintained a fixed viewing distance of 60–70 cm from the eye tracker to ensure data consistency.

The experimental protocol utilized Ergolab 3.0 software for eye movement recording and data export, including gaze plots and thermograms. Audio stimuli were manually synchronized with visual presentation, with uniform parameters: 65 dB sound pressure level, 300–350 lux ambient illumination (5000 K color temperature), and <30 dB background noise. The overall experimental environment is shown in Figure 1.

3.2. Experimental Design

The study recruited 80 participants, randomly assigned to four experimental groups (A, B, C, D; n = 20 per group). For the 50 target words, original images were processed using Adobe Photoshop 2023 (Adobe, San Jose, CA, USA) to (1) remove backgrounds, (2) standardize dimensions (1206 × 904 pixels), (3) apply color balance adjustments (sRGB gamut), and (4) the removal of paraphrasing pictures for some words.

Group A/C received identical visual stimuli, as did Group B/D. The critical distinction was that Group C and Group D additionally received synchronized auditory stimuli during word presentation. All groups completed the experimental procedure sequentially, with the results analyzed by group. Figure 2 illustrates the experimental workflow.

3.3. Operational Steps

The experiment comprised four sequential stimulus sets. To maintain data consistency, all participants performed the eye-tracking tasks at a fixed viewing distance of 60–70 cm from the display screen. Each stimulus group contained 50 words (1206 × 904 pixels) presented sequentially. Prior to formal testing, participants completed eye movement calibration procedures.

Each trial consisted of a 10 s word presentation. Following the experimental session, participants completed a vocabulary recognition test assessing their ability to select the correct Chinese translations for the presented words. To ensure experimental validity, all participants received standardized instructions detailing the study’s purpose and procedures before commencement. Key experimental variables and factors are detailed in Table 1 and Table 2.

3.3.1. Step 1—Under Conditions of No Sound and Image

Group A comprised 20 participants (9 males, 11 females) exposed to no sound and image stimuli. The pictorial stimuli were systematically divided into two distinct regions: T1 (Display words and their Chinese meanings) and T2 (Display contextual examples of words). No auditory stimuli were presented in this condition. Figure 3 illustrates the specific visual presentation format.

3.3.2. Step 2—Under Conditions of Image-Only

Group B consisted of 20 participants (12 males, 8 females). The visual stimuli were divided into two distinct regions: T1 (Display words and their Chinese meanings) and P (Display pictures related to words). Similar to Group A, this condition did not include any auditory stimuli. The specific visual presentation format is illustrated in Figure 4.

3.3.3. Step 3—Under Conditions of Sound-Only

Group C comprised 20 participants (9 males, 11 females). The visual stimuli were organized into two distinct regions: T1 (Display words and their Chinese meanings) and T2 (Display contextual examples of words). Importantly, this condition incorporated synchronized auditory stimuli consisting of word pronunciations. The complete stimulus presentation format is detailed in Figure 5.

3.3.4. Step 4—Under Conditions of Text, Images, and Sound

Group D consisted of 20 participants (11 males, 9 females). The visual stimuli were partitioned into two distinct regions: T1 (Display words and their Chinese meanings) and P (Display pictures related to words). Similar to Group C, this condition incorporated synchronized auditory stimuli consisting of word pronunciations. The complete stimulus configuration is illustrated in Figure 6.

3.3.5. Step 5—Conceptual Definition of MCSI

The Multimodal Coupling Strength Index (MCSI) is a quantitative metric designed to evaluate the synergistic efficiency of crossmodal integration (e.g., auditory–visual stimuli) in cognitive tasks. It dynamically measures how effectively multimodal inputs optimize attentional allocation while balancing cognitive load, thereby enhancing learning outcomes such as memory retention and recall accuracy.

Based on the presentation of acoustic and visual stimuli in the study, the MCSI focuses multimodality on sound and vision and incorporates eye-tracking metrics to dynamically assess word memory effects. The specific formula is as follows:

M C S I = \frac{({A S C}_{u n i m o d a l} - {A S C}_{m u l t i m o d a l})}{{A S C}_{u n i m o d a l}} \times \frac{R e a d i n e s s g a i n s}{\sqrt{{S D - F D}_{m u l t i m o d a l}}}

For the specific values calculated by the MCSI, the corresponding synergy intensity, strategy recommendations, and cognitive mechanisms are explained in Table 3.

4. Results

4.1. Analysis of the Four Groups of Overall Data

Based on the experimental procedures and subsequent vocabulary assessment, we obtained the following key datasets: (1) lexical retention performance across all four experimental groups (Table 4), and (2) comprehensive eye-tracking metrics for each group (Table 5).

4.2. Effect of Auditory Stimuli on Lexical Memory

For Objective 1 (examining auditory stimulus effects on lexical memory), rigorous variable control was implemented during comparative analyses. The presence/absence of auditory stimuli served as the independent variable for evaluating differences in lexical memory and saccadic frequency. Figure 7 and Figure 8 present comparative thermogram visualizations for Group A/C and Group B/D, respectively.

Comparative analysis of Group A/C and B/D revealed both independent and synergistic effects of auditory stimulation. With auditory intervention, Group A/C demonstrated significant improvement in accuracy (59.60% to 76.00%; MD = 16.40%, F = 18.473, p < 0.001), indicating enhanced lexical memory. Group B/D showed further augmentation when auditory stimuli were combined with visual support (82.10% to 92.00%; MD = 9.90%, F = 29.967, p < 0.001). Complete statistical results are presented in Table 6.

Analysis of the saccadic frequency data revealed distinct patterns across experimental conditions. The Group A/C comparison showed no significant difference in saccadic frequency (F = 0.001, p = 0.974). In contrast, the Group B/D comparison demonstrated significant differences in saccadic frequency (F = 16.066, p < 0.001), with visual attention becoming more focused on semantically critical regions (Figure 8 and Figure 9). This pattern indicates that voice-guided auditory input enhances visual information extraction efficiency. Complete statistical results are presented in Table 7.

4.3. Effect of Visual Stimuli on Lexical Memory

For the investigation of visual stimulus effects on lexical memory, strict variable control was implemented during comparative analyses. The presence/absence of visual stimuli served as the independent variable for evaluating differences in both memory performance and fixation duration. Figure 9 and Figure 10 present comparative thermograms for Group A/B and Group C/D, respectively.

The Group A/B comparison showed a substantial accuracy improvement (59.60% to 82.10%; MD = 22.50%, F = 50.21, p < 0.001, η² = 0.62). Notably, high-accuracy participants (>75% correct) increased from 41% (Group A) to 78% (Group B), indicating that pictorial cues substantially enhanced lexical memory. Group C/D exhibited further improvement when visual stimuli complemented existing auditory support (76.00% to 92.00%; MD = 16.00%, F = 47.87, p < 0.001, η² = 0.58), with the complete data presented in Table 8.

The implementation of visual stimuli elicited significant attentional redistribution and memory optimization effects. Specifically, Group A maintained prolonged engagement in the T1 region (M = 62.50 s, SD = 2.50) relative to Group B (M = 58.00 s, SD = 2.30), showing a 7.2% decrease with a moderate effect size (F = 4.261, p = 0.04, η² = 0.13). More strikingly, visual stimulation induced attentional reallocation from the T2 region in Group A (M = 32.50 s, SD = 2.40) to the P region in Group B (M = 37.20 s, SD = 2.20), representing a 14.5% increase with a stronger effect magnitude (F(1,28) = 4.913, p = 0.027, η² = 0.17)., demonstrating significant attention reallocation that strongly correlated with accuracy improvement, with the complete data presented in Table 9.

Crossmodal synergy analysis revealed distinct cognitive profiles between conditions. Group D demonstrated a significantly compressed T1 region dwell time (M = 53.80 s, SD = 2.00) compared to Group C (M = 62.50 s, SD = 2.40), reflecting a 13.9% reduction with a large effect size (F = 14.332, p < 0.001, η² = 0.31). Conversely, Group D exhibited enhanced P region engagement (M = 36.60 s, SD = 1.90) versus the T2 region fixation of Group C’s (M = 29.20 s, SD = 2.10), constituting a 25.3% increase (F= 12.616, p < 0.001, η² = 0.28). Notably, Group D demonstrated significantly improved accuracy despite reduced word area attention time, indicating enhanced lexical memory efficiency through audiovisual integration (complete data in Table 10).

4.4. Effect of Multimodal Integration on Lexical Memory

Analysis of multimodal (visual–auditory) co-stimulation revealed that Group D demonstrated significantly higher proportions of high-accuracy performers (>90.00% correct;) compared to Group B (32.00%; Χ² = 11.57, p = 0.001), confirming the super-additive integration effect of dual-channel stimulation. Furthermore, Group C/D showed significant accuracy improvement from 76.00% to 92.00% (MD = 16.00%, F = 47.87, p < 0.001, η² = 0.58) when pictorial stimuli were combined with existing auditory support. Complete statistical results are presented in Table 11.

4.5. Analysis of MCSI Calculation Results

4.5.1. MCSI Specific Calculation Process

Based on the previously proposed Multimodal Coupling Strength Index, we collated the collected experimental data and calculated the corresponding MCSI values. The specific values are shown in Table 12.

The specific calculations for Groups B/D are as follows:

∆ A S C = 265.04 - 236.52 = 28.52

M C S I = \frac{28.52}{265.04} \times \frac{9.93}{\sqrt{3.91}} = 0.1076 \times 4.4813 = 0.54

The specific calculations for Groups C/D are as follows:

∆ A S C = 259.40 - 236.52 = 22.88

M C S I = \frac{22.88}{259.40} \times \frac{16.00}{\sqrt{3.91}} = 0.0882 \times 7.2207 = 0.71

4.5.2. Interpretation of MCSI Calculations

Based on the above calculations, we can see that the result of the MCSI for Group B/D is 0.54. The strong synergy of 0.54 among the threshold analyses may be due to the fact that the visual input is close to cognitive saturation, and the degree of word memory has already improved substantially under the stimulation of the pictures alone; however, the result of the MCSI for Group C/D is 0.71, which has a super-strong synergy among the threshold analyses, suggesting that auditory input provides a nonlinear gain for visual learning.

The overall results suggest that the data from the MCSI have a strong synergistic strength, showing that the multimodal model has a strong synergistic stimulating effect.

5. Discussion

5.1. Synergistic Enhancement and Attentional Optimization

The most striking outcome is the demonstrable synergy in the multimodal condition. The performance leap in Group D cannot be fully attributed to the independent contributions of sound or pictures alone, as evidenced by its significant margin over both Group B (text + image) and Group C (text + sound). This super-additive effect underscores the dynamic integration of auditory and visual information streams. Eye movement data elucidate the cognitive foundation of this synergy: a strategic reallocation of attentional resources. Participants receiving synchronized audiovisual input (Group D) exhibited significantly reduced average saccade counts (ASCs) compared to both unimodal groups and the visual-only group receiving pictures (Group B). This decrease in ASC indicates more focused attention and less exploratory visual searching. Concurrently, dwell time analysis revealed a critical shift: attention was drawn away from the primary text definition area (T1)—with significantly shorter fixation durations there compared to groups without pictures (Group C) or without sound (Group B)—and towards the semantically relevant picture area (P), where fixation durations were sustained or increased. This pattern suggests that the synchronized auditory input (pronunciation) acted as a guide, enabling learners to rapidly disambiguate the target word and efficiently anchor its meaning to the visual representation, thereby optimizing encoding efficiency and reducing extraneous cognitive load associated with decoding or searching within the text areas.

5.2. MCSI: Quantifying Integration Strength and Revealing Crossmodal Dynamics

Our proposed MCSI model successfully quantified the strength of audiovisual coupling, confirming its robust synergistic nature (MCSI values of 0.54 for Group B/D comparison and 0.71 for Group C/D comparison, indicating ‘strong’ and ‘ultra-strong’ synergy, respectively). More importantly, the MCSI analysis, particularly for Group C/D (MCSI = 0.71), provided a deeper mechanistic insight: high coupling strength is associated with crossmodal inhibition rather than mere facilitation. In this case, the auditory input appears to actively suppress redundant processing likely occurring in the visual channel when only text and contextual examples are present (Group C), such as excessive focus on orthographic details. This inhibition forces a redirection of resources towards deeper semantic integration—binding the sound of the word to the meaning depicted in the picture. This finding aligns with the observed attentional shift in Group D and highlights a sophisticated regulatory mechanism within multimodal integration. The MCSI thus addresses a key limitation noted earlier, moving beyond static group comparisons (ANOVA) to offer a dynamic, process-oriented index that combines behavioral outcomes with online attentional dynamics (SD-FD, ASC) to characterize the quality of multimodal integration.

5.3. Theoretical and Practical Implications

Our findings significantly advance the understanding of multimodal vocabulary acquisition and offer concrete pathways for application.

5.3.1. Theoretical Implications: Advancing Multimodal Integration Models

Theoretically, this study provides robust empirical validation and significant extensions to established frameworks like the Dual-Coding Theory (DCT) [4,5] and Cognitive Theory of Multimedia Learning (CTML) [7]. While confirming the predicted superiority of multimodal input for memory encoding, our results add crucial granularity by elucidating the underlying cognitive mechanisms:

Attentional Optimization as a Core Mechanism: We demonstrate that the efficiency gains in multimodal learning (Group D) are directly linked to optimized attentional allocation. The observed reduction in the average saccade count (ASC) and the strategic shift in dwell time (away from text definitions towards semantically congruent pictures) provide concrete, quantifiable evidence for how synchronized audiovisual input streamlines cognitive processing. This goes beyond simply confirming DCT/CTML predictions by specifying how resource allocation is improved—through guided disambiguation and anchoring facilitated by sound.
Crossmodal Regulation (Inhibition): Perhaps the most significant theoretical contribution lies in revealing a sophisticated regulatory mechanism within multimodal integration, quantified by the high MCSI value (0.71) for the C/D comparison. This suggests that effective integration is not merely additive facilitation but involves crossmodal inhibition. Specifically, the auditory input appears to actively suppress redundant or inefficient processing that occurs in unimodal visual conditions (e.g., excessive focus on orthographic details in Group C). This inhibition forces a redirection of cognitive resources towards deeper semantic binding between the auditory word form and the visual representation. This finding of inhibitory dynamics adds a novel layer to our understanding of how modalities interact beyond simple facilitation, contributing new knowledge to psycholinguistic models of lexical encoding.
Quantifying Integration Dynamics (MCSI): The development and successful application of the Multimodal Coupling Strength Index (MCSI) represent a methodological and conceptual advance. The MCSI moves beyond static group comparisons (e.g., ANOVA on recall scores) by providing a dynamic, process-oriented index that quantitatively captures the strength and quality of real-time audiovisual integration. It bridges online attentional dynamics (SD-FD, ASC) with behavioral outcomes (recall gains), offering a transferable tool for future research to quantify multisensory synergy in various learning domains. This directly addresses a key gap noted in the literature review regarding the lack of dynamic quantitative tools for assessing multimodal interactions.

In essence, this study not only supports core tenets of the DCT and CTML but significantly refines them by uncovering the critical role of attentional optimization and revealing a previously underexplored inhibitory component in crossmodal integration dynamics during vocabulary learning.

5.3.2. Practical Implications: Translating Findings into Action

The demonstrated efficacy of synchronized audiovisual input translates into actionable recommendations across key stakeholder levels:

For Individuals (Learners): Actively utilize vocabulary learning resources that pair word pronunciations with semantically relevant images (e.g., flashcards with integrated audio, language apps with picture–sound links). Consciously link new words’ sounds to their visual meanings during study.
For Managers/Coordinators (Language Program Leaders, EdTech Developers, Curriculum Designers): Prioritize the design and selection of learning materials/apps featuring tightly synchronized audio (pronunciation) and semantically congruent images. Leverage insights from eye tracking (minimizing visual search) and MCSI (efficiency gains) by ensuring designs promote direct sound–image binding and avoid extraneous elements. Advocate for evidence-based resources incorporating these principles.
For Institutions (Educational Policy Makers, Funding Bodies): Develop guidelines mandating effective multimodal integration (semantic alignment, synchronization) in procured learning resources. Invest in enabling infrastructure (e.g., reliable AV equipment, software) to support widespread implementation of these validated multimodal approaches.

6. Limitations and Future Work

This study has several limitations that warrant consideration. Firstly, the sample was restricted to university students (a common population in psychological research), whose cognitive strategies and neuroplasticity differ substantially from children or older adults, potentially biasing estimates of the pedagogical effectiveness of audiovisual stimuli. Secondly, the artificial laboratory setting may not reflect authentic learning conditions, where factors like rapid task engagement and emotional states (e.g., test anxiety) could influence outcomes but were not accounted for in our analyses. Thirdly, we did not assess long-term memory retention through delayed testing, limiting our understanding of whether the observed effects persist over educationally relevant timescales (month to years). Fourthly, the experimental design focused exclusively on word-level memorization, leaving open questions about how multimodal stimuli affect higher-level language processing (phrases, syntax, or discourse), where such interventions might potentially cause interference rather than facilitation. Additionally, manual audio presentations may have introduced minor temporal inconsistencies that could affect individual memory accuracy.

To address these limitations, future research should (1) incorporate more diverse age groups and geographical backgrounds in sampling, (2) implement delayed retention tests to evaluate long-term memory effects, (3) expand stimulus materials to include phrases and sentences for investigating multimodal effects on complex linguistic processing, and (4) employ automated stimulus delivery systems to ensure precise timing control.

7. Conclusions

This study employed eye-tracking technology to systematically examine the facilitative and interfering mechanisms of multimodal (visual–auditory) stimulation on English vocabulary memorization, combining post-experimental assessments with quantitative eye movement metrics (saccadic frequency and fixation duration) to determine the distinct effects of pictorial and auditory stimuli. The key findings revealed three principal outcomes: (1) auditory stimulation significantly enhanced lexical memory (p < 0.01), with the combined audiovisual condition demonstrating particular efficacy by reducing saccadic frequency (F = 14.33, p < 0.001) and promoting greater attentional focus on target lexemes; (2) visual stimulation not only independently improved memorization performance but exhibited superior efficacy compared to sound-only conditions; (3) multimodal integration (Group D) produced synergistic cognitive benefits, with accuracy enhancement significantly exceeding unimodal conditions (Group B/C) through dual-path modulation—simultaneously strengthening memory representations via direct visual encoding while optimizing attentional allocation efficiency; and (4) the constructed MCSI model also showed a strong multimodal synergy, with the best performance in improving word memory. These results collectively demonstrate that coordinated audiovisual input creates super-additive mnemonic effects through complementary neurocognitive mechanisms.

Author Contributions

Conceptualization, C.C.; methodology, C.C. and Q.T.; validation, Q.T.; formal analysis, Q.T.; investigation, Q.T.; data curation, C.C. and Q.T.; writing—original draft preparation, Q.T.; writing—review and editing, C.C.; visualization, Q.T.; supervision, C.C.; project administration, C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the National Nature Science Foundation of China Grant (No.72201128) and the China Postdoctoral Science Foundation (No.2023M730483).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Nanjing Forestry University Statement of Ethical Review of Research Projects (protocol code: NO. 2024018 and date of approval:12 January 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. This study has been approved by the school ethics committee.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Acknowledgments

We thank the Laboratory of Human Factors and Ergonomics of NJFU for supporting the experiments. We would like to thank Jing Zhang for sharing her patented technique in our behavioral analysis. And we thanked Yuxi Lin for their assistance during the dissertation completion process. During the preparation of this work, the authors used DeepSeek in order to improve language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ebbinghaus, H. Memory: A contribution to experimental psychology. Ann. Neurosci. 2013, 20, 155–156. [Google Scholar] [CrossRef] [PubMed]
Park, C. The Effects of Writing Tasks on Vocabulary Learning among Korean EFL Learners and Involvement Load Hypothesis. Korean J. Engl. Lang. Linguist. 2024, 24, 554–567. [Google Scholar] [CrossRef]
Beauchamp, M.S.; Lee, K.E.; Argall, B.D.; Martin, A. Integration of auditory and visual information about objects in superior temporal sulcus. Neuron 2004, 41, 809–823. [Google Scholar] [CrossRef] [PubMed]
Bucci, W. Dual coding: A cognitive model for psychoanalytic research. J. Am. Psychoanal. Assoc. 1985, 33, 571–607. [Google Scholar] [CrossRef]
Paivio, A. Dual coding theory: Retrospect and current status. Can. J. Psychol. Rev. Can. Psychol. 1991, 45, 255–287. [Google Scholar] [CrossRef]
Zou, D.; Teng, M.F. Effects of tasks and multimedia annotations on vocabulary learning. System 2023, 115, 103050. [Google Scholar] [CrossRef]
Mayer, R.E.; Moreno, R. Nine ways to reduce cognitive load in multimedia learning. Educ. Psychol. 2003, 38, 23–44. [Google Scholar] [CrossRef]
Gkalitsiou, Z.; Byrd, C.T. Working memory in adults who stutter using a visual N-back task. J. Fluen. Disord. 2021, 70, 105846. [Google Scholar] [CrossRef]
Sweller, J.; Van Merriënboer, J.J.; Paas, F. Cognitive Architecture and Instructional Design: 20 Years Later. Educ. Psychol. Rev. 2019, 31, 261–292. [Google Scholar] [CrossRef]
Bogdan, P.C.; Dolcos, S.; Federmeier, K.D.; Lleras, A.; Schwarb, H.; Dolcos, F. Emotional dissociations in temporal associations: Opposing effects of arousal on memory for details surrounding unpleasant events. Cogn. Emot. 2025, 39, 82–96. [Google Scholar] [CrossRef]
Kensinger, E.A.; Schacter, D.L. Reality monitoring and memory distortion: Effects of negative, arousing content. Mem. Cogn. 2006, 34, 251–260. [Google Scholar] [CrossRef] [PubMed]
Bigot, M.; De Badts, C.-H.; Benchetrit, A.; Vicq, E.; Moigneu, C.; Meyrel, M.; Wagner, S.; Hennrich, A.A.; Houenou, J.; Lledo, P.-M.; et al. Disrupted basolateral amygdala circuits supports negative valence bias in depressive states. Transl. Psychiatry 2024, 14, 382. [Google Scholar] [CrossRef]
Mickley Steinmetz, K.R.; Scott, L.A.; Smith, D.; Kensinger, E.A. The effects of trauma exposure and posttraumatic stress disorder (PTSD) on the emotion-induced memory trade-off. Front. Integr. Neurosci. 2012, 6, 34. [Google Scholar] [CrossRef]
Mueller, C.J.; Kuchinke, L. Processing of face identity in the affective flanker task: A diffusion model analysis. Psychol. Res.-Psychol. Forsch. 2016, 80, 963–973. [Google Scholar] [CrossRef]
Clarke, S.; Da Costa, S.; Crottaz-Herbette, S. Dual Representation of the Auditory Space. Brain Sci. 2024, 14, 535. [Google Scholar] [CrossRef]
Driver, J.; Spence, C. Multisensory perception: Beyond modularity and convergence. Curr. Biol. 2000, 10, R731–R735. [Google Scholar] [CrossRef]
Yu, H.; Wang, A.; Zhang, M.; Yang, J.; Takahashi, S.; Ejima, Y.; Wu, J. Semantically congruent audiovisual integration with modal-based attention accelerates auditory short-term memory retrieval. Atten. Percept. Psychophys. 2022, 84, 1625–1634. [Google Scholar] [CrossRef]
Buzsaki, G.; Moser, E.I. Memory, navigation and theta rhythm in the hippocampal-entorhinal system. Nat. Neurosci. 2013, 16, 130–138. [Google Scholar] [CrossRef]
Xiao, X.; Wang, J.; Shu, Y.; Tan, J. Creativity and Perception: Unveiling the Role of Cross-Modal Audiovisual Integration. J. Creat. Behav. 2024, 58, 460–477. [Google Scholar] [CrossRef]
Esposito, D.; Bollini, A.; Gorr, M. The link between blindness onset and audiospatial processing: Testing audiomotor cues in acoustic virtual reality. In Proceedings of the 43rd Annual International Conference of the IEEE-Engineering-in-Medicine-and-Biology-Society (IEEE EMBC), Virtual Event, 1–5 November 2021; pp. 5880–5884. [Google Scholar]
Saito, K.; Sun, H.; Tierney, A. Domain-general auditory processing determines success in second language pronunciation learning in adulthood: A longitudinal study. Appl. Psycholinguist. 2020, 41, 1083–1112. [Google Scholar] [CrossRef]
Gupta, P.; Tisdale, J. Does phonological short-term memory causally determine vocabulary learning? Toward a computational resolution of the debate. J. Mem. Lang. 2009, 61, 481–502. [Google Scholar] [CrossRef]
Huettig, F.; McQueen, J.M. The tug of war between phonological, semantic and shape information in language-mediated visual search. J. Mem. Lang. 2007, 57, 460–482. [Google Scholar] [CrossRef]
Bidebnan, G.M.; Krishnan, A. Effects of reverberation on brainstem representation of speech in musicians and non-musicians. Brain Res. 2010, 1355, 112–125. [Google Scholar] [CrossRef]
Fernandez-Quezada, D.; Martinez-Fernandez, D.E.; Fuentes, I.; Garcia-Estrada, J.; Luquin, S. The Influence of Noise Exposure on Cognitive Function in Children and Adolescents: A Meta-Analysis. NeuroSci 2025, 6, 22. [Google Scholar] [CrossRef]
Marsh, J.E.; Ljung, R.; Nostl, A.; Threadgold, E.; Campbell, T.A. Failing to get the gist of what’s being said: Background noise impairs higher-order cognitive processing. Front. Psychol. 2015, 6, 548. [Google Scholar] [CrossRef]
Amedi, A.; Merabet, L.B.; Camprodon, J.; Bermpohl, F.; Fox, S.; Ronen, I.; Kim, D.-S.; Pascual-Leone, A. Neural and behavioral correlates of drawing in an early blind painter: A case study. Brain Res. 2008, 1242, 252–262. [Google Scholar] [CrossRef]
Mayberry, M.R.; Crocker, M.W.; Knoeferle, P. A connectionist model of anticipation in visual worlds. In International Conference on Natural Language Processing; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3651, pp. 849–861. [Google Scholar]
Brewin, C.R.; Langley, K.M.R. Imagery retrieval may explain why recall of negative scenes contains more accurate detail. Mem. Cogn. 2019, 47, 420–427. [Google Scholar] [CrossRef]
Bainbridge, W.A.; Hall, E.H.; Baker, C.I. Drawings of real-world scenes during free recall reveal detailed object and spatial information in memory. Nat. Commun. 2019, 10, 5. [Google Scholar] [CrossRef]
Shang, L.; Huang, M.; Shi, W.; Liu, Y.; Liu, Y.; Steven, W.; Sun, B.; Xie, X.; Qiao, Y. Improving Training and Inference of Face Recognition Models via Random Temperature Scaling. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 15082–15090. [Google Scholar]
Li, R.; Gong, D.; Yin, W.; Chen, H.; Zhu, Y.; Wang, K.; Chen, X.; Sun, J.; Zhang, Y. Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes. arXiv 2023, arXiv:2304.08993. [Google Scholar]
Cheng, C.; Kaldy, Z.; Blaser, E. Focused attention predicts visual working memory performance in 13-month-old infants: A pupillometric study. Dev. Cogn. Neurosci. 2019, 36, 100616. [Google Scholar] [CrossRef]
Cilia, F.; Aubry, A.; Le Driant, B.; Bourdin, B.; Vandromme, L. Visual Exploration of Dynamic or Static Joint Attention Bids in Children with Autism Syndrome Disorder. Front. Psychol. 2019, 10, 2187. [Google Scholar] [CrossRef] [PubMed]
Guzzon, D.; Brignani, D.; Miniussi, C.; Marzi, C.A. Orienting of attention with eye and arrow cues and the effect of overtraining. Acta Psychol. 2010, 134, 353–362. [Google Scholar] [CrossRef] [PubMed]
Gao, J.; Yue, Z.; Liu, Y.; Xie, S.; Fan, X.; Liu, R. A Dual-Stream-Modulated Learning Framework for Illuminating and Super-Resolving Ultra-Dark Images. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 7500–7513. [Google Scholar] [CrossRef] [PubMed]
Rosenholtz, R.; Huang, J.; Raj, A.; Balas, B.J.; Ilie, L. A summary statistic representation in peripheral vision explains visual search. J. Vis. 2012, 12, 14. [Google Scholar] [CrossRef]
Theeuwes, J. Top-down and bottom-up control of visual selection. Acta Psychol. 2010, 135, 77–99. [Google Scholar] [CrossRef]
Zhang, Y.; Zhu, M.; Sun, Y.; Tang, B.; Zhang, G.; An, P.; Cheng, Y.; Shan, Y.; Merzenich, M.M.; Zhou, X. Environmental noise degrades hippocampus-related learning and memory. Proc. Natl. Acad. Sci. USA 2021, 118, e2017841117. [Google Scholar] [CrossRef]
Wang, T.; Xia, L.; Cheng, L. Is simpler better? Semantic content modulates the emotional prosody perception in Mandarin-speaking children with autism spectrum disorder. J. Commun. Disord. 2025, 113, 106495. [Google Scholar] [CrossRef]
Mayer, R.E. Using multimedia for e-learning. J. Comput. Assist. Learn. 2017, 33, 403–423. [Google Scholar] [CrossRef]
Paivio, A.; Sadoski, M. Lexicons, Contexts, Events, and Images: Commentary on Elman (2009) From the Perspective of Dual Coding Theory. Cogn. Sci. 2011, 35, 198–209. [Google Scholar] [CrossRef]
Cao, J.; Li, X.; Zhang, L. Is Relevancy Everything? A Deep-Learning Approach to Understand the Effect of Image-Text Congruence. Manag. Sci. 2025. [Google Scholar] [CrossRef]
Rey, G.D. A review of research and a meta-analysis of the seductive detail effect. Educ. Res. Rev. 2012, 7, 216–237. [Google Scholar] [CrossRef]
Ma, A.C.; Cameron, A.D.; Wiener, M. Memorability shapes perceived time (and vice versa). Nat. Hum. Behav. 2024, 8, 1296–1308. [Google Scholar] [CrossRef] [PubMed]
Potter, M.C.; Wyble, B.; Hagmann, C.E.; McCourt, E.S. Detecting meaning in RSVP at 13 ms per picture. Atten. Percept. Psychophys. 2014, 76, 270–279. [Google Scholar] [CrossRef] [PubMed]
Hochstein, S.; Ahissar, M. View from the top: Hierarchies and reverse hierarchies in the visual system. Neuron 2002, 36, 791–804. [Google Scholar] [CrossRef]
Underwood, G.; Foulsham, T. Visual saliency and semantic incongruency influence eye movements when inspecting pictures. Q. J. Exp. Psychol. 2006, 59, 1931–1949. [Google Scholar] [CrossRef]
Clark, J.M.; Paivio, A. Extensions of the Paivio, Yuille, and Madigan (1968) norms. Behav. Res. Methods Instrum. Comput. 2004, 36, 371–383. [Google Scholar] [CrossRef]
Heikkila, H.; Raiha, K.-J. Speed and Accuracy of Gaze Gestures. J. Eye Mov. Res. 2009, 3, 1–14. [Google Scholar] [CrossRef]
Paré, M.; Richler, R.C.; Ten Hove, M. Gaze behavior in audiovisual speech perception: The influence of ocular fixations on the McGurk effect. Percept. Psychophys. 2003, 65, 553–567. [Google Scholar] [CrossRef]
Eitel, A.; Scheiter, K.; Schueler, A. How Inspecting a Picture Affects Processing of Text in Multimedia Learning. Appl. Cogn. Psychol. 2013, 27, 451–461. [Google Scholar] [CrossRef]

Figure 1. The overall experimental environment.

Figure 2. The process of the experiment.

Figure 3. Picture interpretation of Group A.

Figure 4. Picture interpretation of Group B.

Figure 5. Picture interpretation of Group C.

Figure 6. Picture interpretation of Group D.

Figure 7. Comparison of thermograms of group A/C.

Figure 8. Comparison of thermograms of group B/D.

Figure 9. Comparison of thermograms of group A/B.

Figure 10. Comparison of thermograms of group C/D.

Table 1. Label list for each group.

Group	Letter	Segmentation	Significance
Text	A	T1	Display words and their Chinese meanings
Text	A	T2	Display contextual examples of words
Text + Image	B	T1	Display words and their Chinese meanings
Text + Image	B	P	Display pictures related to words
Text + Sound	C	T1	Display words and their Chinese meanings
Text + Sound	C	T2	Display contextual examples of words
Text + Image + Sound	D	T1	Display words and their Chinese meanings
Text + Image + Sound	D	P	Display pictures related to words

Table 2. Key experimental factors.

Key Experimental Factors	Abbreviation	Significance
Average fixation duration	AFD	Regular time to focus on the area of interest
Standard deviation of fixation duration	SD-FD	The degree of deviation between each value in the attention time and the average attention time
Average saccade	ASC	The average number of times the eye jumps from one fixation point to another during eye tracking
Standard deviation of Average saccade	SD-SC	The degree of deviation between each value in the number of eye jumps and the average number of eye jumps

Table 3. MCSI threshold decision matrix.

MCSI Scope	Synergy	Strategy Proposal	Cognitive Mechanisms Explained
0.0–0.3	weak	Increase in time delay	Competition for corridors leads to fragmentation of resources
0.3–0.5	medium	Enhance semantic consistency	Independent coding predominates, with limited overlay effects
0.5–0.7	strong	Introduce contradictory stimuli	Dynamic inhibition optimizes attention allocation
>0.7	ultra-strong	Enable trimodal stimulation	Multi-channel load balancing triggers superplasticity

Table 4. Analysis of accuracy data for four stimulus word tests.

Group	Accuracy	F	p
A	59.60%	28.74	<0.001
B	82.07%
C	76.00%
D	92.00%

Table 5. Specific data of eye tracking.

Group	Group Segmentation	AFD(s)	SD-FD	ASC	SD-SC
A	T1	62.50	2.54	259.90	7.56
A	T2	32.57	2.46	259.90	7.56
B	T1	57.96	2.33	265.04	7.27
B	P	37.24	2.18	265.04	7.27
C	T1	62.10	2.45	259.40	6.98
C	T2	29.20	2.34	259.40	6.98
D	T1	53.76	2.01	236.52	7.15
D	P	36.64	1.90	236.52	7.15

Table 6. Analysis of the impact of sound on the accuracy of lexical memory.

Group	Group Segmentation	Accuracy	F	p
A/C	A	59.60%	18.473	<0.001
A/C	C	76.00%	18.473	<0.001
B/D	B	82.07%	29.967	<0.001
B/D	D	92.00%	29.967	<0.001

Table 7. Analysis of the effect of sound on the number of eye beats.

Group	Group Segmentation	ASC	SD-SC	F	p
A/C	A	259.90	7.56	0.001	0.974
A/C	C	259.40	6.98	0.001	0.974
B/D	B	265.04	7.27	16.066	<0.001
B/D	D	236.52	7.15	16.066	<0.001

Table 8. Analysis of the impact of images on the accuracy of word memory.

Group	Group Segmentation	Accuracy	F	p
A/B	A	59.60%	50.213	<0.001
A/B	B	82.07%	50.213	<0.001
C/D	C	76.00%	47.874	<0.001
C/D	D	92.00%	47.874	<0.001

Table 9. Analysis of the impact of images between Group A/B on focus time data.

Group	Group Segmentation	AFD(s)	SD-SC	F	p
A/B	A-T1	62.50	2.54	4.261	0.040
	B-T1	57.96	2.33	4.261	0.040
	A-T2	32.57	2.46	4.913	0.027
	B-P	37.24	2.18	4.913	0.027

Table 10. Data analysis of the impact of inter-Group C/D images on attention time.

Group	Group Segmentation	AFD(s)	SD-SC	F	p
C/D	C-T1	62.10	2.45	14.332	<0.001
	D-T1	53.76	2.01	14.332	<0.001
	C-T2	29.20	2.34	12.616	<0.001
	D-P	36.64	1.90	12.616	<0.001

Table 11. Experimental data on the combined effect of images and sound.

Group	Group Segmentation	Accuracy	F	p
B/D	B	82.07%	29.967	<0.001
B/D	D	92.00%	29.967	<0.001
C/D	C	76.00%	47.874	<0.001
C/D	D	92.00%	47.874	<0.001

Table 12. Calculated data related to MCSI.

Group	ASC (Unimodal)	ASC (Multimodal)	Readiness Gains	SD-FD (Multimodal)	MCSI
B/D	265.04	236.52	9.93	3.91	0.54
C/D	259.40	236.52	16.00	3.91	0.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, Q.; Chen, C. Vocabulary Retention Under Multimodal Coupling Strength Index (MCSI): Insights from Eye Tracking. Appl. Sci. 2025, 15, 7645. https://doi.org/10.3390/app15147645

AMA Style

Tang Q, Chen C. Vocabulary Retention Under Multimodal Coupling Strength Index (MCSI): Insights from Eye Tracking. Applied Sciences. 2025; 15(14):7645. https://doi.org/10.3390/app15147645

Chicago/Turabian Style

Tang, Qiyue, and Chen Chen. 2025. "Vocabulary Retention Under Multimodal Coupling Strength Index (MCSI): Insights from Eye Tracking" Applied Sciences 15, no. 14: 7645. https://doi.org/10.3390/app15147645

APA Style

Tang, Q., & Chen, C. (2025). Vocabulary Retention Under Multimodal Coupling Strength Index (MCSI): Insights from Eye Tracking. Applied Sciences, 15(14), 7645. https://doi.org/10.3390/app15147645

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vocabulary Retention Under Multimodal Coupling Strength Index (MCSI): Insights from Eye Tracking

Abstract

1. Introduction

2. Literature Review

2.1. Theory

2.1.1. Theoretical Foundations of Multimodal Learning

2.1.2. The Influence of Sound on Memory

2.1.3. The Influence of Pictures on Memory

2.1.4. Multimodal Interaction Effects of Sound and Image

2.1.5. Positioning MCSI Within Multimodal Integration Research

2.2. Objectives

3. Research Methodology

3.1. Preliminary Preparation

3.1.1. Word and App Selection

3.1.2. Participants

3.1.3. Experimental Equipment

3.2. Experimental Design

3.3. Operational Steps

3.3.1. Step 1—Under Conditions of No Sound and Image

3.3.2. Step 2—Under Conditions of Image-Only

3.3.3. Step 3—Under Conditions of Sound-Only

3.3.4. Step 4—Under Conditions of Text, Images, and Sound

3.3.5. Step 5—Conceptual Definition of MCSI

4. Results

4.1. Analysis of the Four Groups of Overall Data

4.2. Effect of Auditory Stimuli on Lexical Memory

4.3. Effect of Visual Stimuli on Lexical Memory

4.4. Effect of Multimodal Integration on Lexical Memory

4.5. Analysis of MCSI Calculation Results

4.5.1. MCSI Specific Calculation Process

4.5.2. Interpretation of MCSI Calculations

5. Discussion

5.1. Synergistic Enhancement and Attentional Optimization

5.2. MCSI: Quantifying Integration Strength and Revealing Crossmodal Dynamics

5.3. Theoretical and Practical Implications

5.3.1. Theoretical Implications: Advancing Multimodal Integration Models

5.3.2. Practical Implications: Translating Findings into Action

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI