2.1.1. Theoretical Foundations of Multimodal Learning
Multimodal learning operates via the brain’s parallel processing architecture, fundamentally anchored in Paivio’s (1971) Dual-Coding Theory (DCT) [
5]. This framework establishes that verbal and nonverbal inputs generate segregated yet interconnected memory representations, with empirical evidence confirming that auditory word forms and visual images create distinct but mutually reinforcing cognitive traces [
6]. These dual pathways enhance recall through two complementary neurocognitive mechanisms: (a) activating parallel neural routes to strengthen memory encoding, and (b) providing enriched retrieval cues—specifically through auditory and visual stimuli serving as cognitive anchors.
Building upon these principles, Mayer and Moreno’s Cognitive Theory of Multimedia Learning (CTML) [
7] provides an evidence-based instructional design framework comprising three core processes: (1) visual and auditory channels independently process pictorial and verbal information; (2) limited working memory capacity constrains information intake; and (3) active cognitive integration synthesizes multisensory inputs. These mechanisms collectively resolve the fundamental tension between multimodal stimulation and cognitive load constraints, thereby enabling efficient knowledge construction.
This integrated theoretical foundation offers robust scaffolding for the empirical investigation of sensory-mediated learning. In the present study, we operationalize auditory and visual stimuli within cognitive retrieval paradigms to systematically examine unimodal (auditory-only or visual-only) and multimodal (audiovisual) processing conditions.
2.1.2. The Influence of Sound on Memory
The auditory modality plays a distinctive role in lexical memory through phonological processing mechanisms. Gkalitsiou & Byrd noted that the phonological loop, a core component of the working memory system, is specialized for the temporary storage and manipulation of phonological information (e.g., verbal repetition, digit span tasks) [
8]. Speech input can effectively reduce visual channel cognitive load by distributing processing across multiple sensory modalities. For instance, John Sweller et al. mentioned that concurrent narration with visual animation capitalizes on working memory’s dual-channel capacity while alleviating single-channel overload. However, phonological loop efficiency is not absolute but rather modulated by multiple acoustic parameters including emotional valence, spatial characteristics, linguistic features, and perceptual complexity [
9].
The emotional qualities of sound significantly shape how we process and remember words. Dolcos et al. mentioned that when sounds carry positive emotions, they boost memory accuracy for word meanings by creating constructive interactions between the brain’s emotion processing centers and working memory systems [
10]. Research by Kensinger confirmed that people recall word meanings more accurately when hearing positive voices compared to neutral tones. Conversely, negative emotional speech—such as angry tones—triggers defensive reactions through amygdala activation [
11]. Mickley Steinmetz et al. and Bigot et al. indicated this redirects mental resources from understanding word meanings towards processing emotional content, resulting in measurable memory impairment [
12,
13]. Eye-tracking studies demonstrate this emotional modulation in real time: positive speech shortens initial focus duration on abstract words, while negative speech causes visible attention disruption through increased pupil dilation. These physiological responses found by Mueller & Kuchinke reveal how sound’s emotional properties regulate memory formation through early-stage attention control mechanisms [
14].
Neurocognitive research demonstrates that spatial attributes of auditory stimuli optimize crossmodal resource allocation through ecological cue simulation [
15,
16]. When lexical items are paired with spatially localized sounds, empirical evidence reveals significantly enhanced free recall accuracy (Δaccuracy = +18.7%,
p < 0.001) [
17], indicating privileged access to memory traces via spatial coding mechanisms. This phenomenon originates from dynamic phase coupling between auditory spatial working memory buffers and hippocampal place cell networks [
18]. Complementary eye-tracking evidence confirms that 3D auditory cues reduce visual channel dependency by 23.7% (d = 1.15), demonstrating efficient crossmodal resource substitution [
19,
20].
Combining spatial attributes with the semantic processing power of linguistic features allows for precise attentional guidance. Rhythmic and rhyming speech patterns can improve the efficiency of speech loops, as recently reported by Saito et al. [
21]. Similarly, semantically consistent prosody (e.g., pairing ‘water’ with the sound of dripping water) can strengthen the semantic network. Empirical data by Gupta & Tisdale suggest an 18% memory advantage for rhyming words compared to a no-rhyme condition, and semantically related onomatopoeia reduces recognition reaction times by 210 ms [
22]. Eye-tracking measures performed by Huettig & McQueen show that the fixation frequency of words decreases in conditions of high semantic congruence, suggesting that memory efficiency increases through optimal integration of linguistic features [
23].
The complexity of auditory stimuli may counteract language enhancement effects through resource competition. Clear monophonic speech optimizes speech cycle efficiency, whereas complex speech increases the cognitive load, as reported by Bidebnan & Krishnan [
24]. Experimental evidence indicated by Fernández-Quezada, D et al. suggests a nonlinear relationship between noise intensity and memory performance. Eye-tracking metrics reveal complexity-induced attentional disturbances [
25], with an increase in the variability in gaze duration for key information in noisy conditions (SD from 120 ms to 280 ms), suggesting attentional instability due to resource competition, as reported by John E Marsh et al. [
26].
Purely auditory input exhibits a fundamental limitation in supporting long-term consolidation of form-meaning associations due to the absence of visual cues, as reported by Amedi et al. [
27]. Eye-tracking studies by Mayberry et al. demonstrate that visual information significantly enhances memory stability through the redundant encoding of form-meaning mappings [
28]. These findings collectively indicate that unimodal auditory input without visual supplementation reduces memory persistence.
In summary, we found that previous scholars have investigated various aspects of sound’s own multidimensional properties such as affective, spatial (left and right channel), linguistic structure, and complexity dimensions. However, it is noteworthy that we found few experiments focusing their goals on the effects of sound on word memory help, which provides new ideas and points of attention for our experiments.
2.1.3. The Influence of Pictures on Memory
In visual modality, pictures enhance memory performance through semantic activation and visual attention mechanisms. Empirical evidence recently reported by Brewin & Langley and Bainbridge et al. demonstrates that items presented as pictures exhibit higher free verbal recall rates compared to those presented as text [
29,
30]. These findings collectively indicate that pictorial presentation confers a distinct advantage in memory facilitation. However, the magnitude of this mnemonic benefit is contingent upon several factors: the dynamic characteristics of the visual stimuli, specific pictorial attributes (including contrast and color properties), and the degree of congruence between pictorial and verbal information.
Empirical studies by Cheng et al., Li et al., and Shang et al. demonstrate that motion-enhanced visual stimuli (e.g., GIF animations, brief video sequences) significantly potentiate initial attentional capture through locomotive salience, while concurrently elevating visual distraction indices [
31,
32,
33]. Comparative analyses by Cilia et al. [
34] reveal substantially abbreviated target acquisition latencies for dynamic versus static stimuli, attributable to motion trajectories automatically triggering visual orienting reflexes. Directional motion indicators (e.g., animated arrows) enhance gaze allocation to critical regions during cued search paradigms. However, complex visual scenes containing multiple concurrent object movements impair automatic symbolic cue processing, inducing saccadic pattern disruption, as documented by Guzzon et al. [
35]. These collective findings support an inverted U-shaped functional relationship between motion intensity and mnemonic facilitation, with moderate dynamism optimizing attentional guidance efficacy.
Theeuwes, Gao, J et al. and Rosenholtz et al. mentioned that in picture displays, high contrast combined with warm colors (e.g., red, yellow) optimizes early attentional allocation by increasing visual salience but may impair deep semantic processing [
36,
37,
38]. Eye-tracking data by Zhang, YF et al. further reveal that high-contrast images elicit significantly denser gaze hotspot clustering compared to low-contrast images, yet they also increase semantic association errors [
39]. Wang, T et al. found that these findings suggest that perceptual properties must be systematically integrated with picture–word congruence: while high contrast facilitates encoding when semantic clarity is high, it amplifies surface–depth processing conflicts under semantic ambiguity [
40].
Mayer, Sadoski & Paivio, and Cao, Jingcun et al. reported that pictures exhibiting high semantic alignment with lexical content enhance memory encoding through “semantic resonance” mechanisms [
41,
42,
43]. However, Zhang, YF et al. indicated that complex pictorial stimuli necessitate additional cognitive resources for visual filtering, which compromises vocabulary semantic integration and long-term retention [
39]. Empirical evidence by Rey further demonstrates that irrelevant pictorial elements reduce learning efficiency by approximately 18%, indicating visual interference significantly impairs semantic integration processes [
44].
Notably, pictorial stimuli exhibit distinct temporal processing characteristics compared to textual stimuli, as reported by Potter et al., Hochstein & Ahissar, and Ma, Alex C. et al. [
45,
46,
47]. Eye-tracking studies demonstrate that image processing involves unique temporal dynamics with potential interference effects. Underwood & Foulsham indicated that visual stimuli elicit a faster attentional capture than text, as indicated by significantly shorter first fixation durations. These findings collectively substantiate the fundamental role of pictorial elements as primary visual access points in information processing systems [
48].
In summary, existing research has systematically investigated multiple aspects of pictorial processing, including dynamic visual features, picture-form memory effects in picture–word congruence, and temporal processing characteristics with potential interference effects through eye-tracking methodologies. However, our review identified a notable research gap: few experimental studies have specifically examined pictorial facilitation effects on lexical memory retention or explored subsequent memory consolidation processes. The exception is auditory integration studies, which introduced a novel experimental dimension. To bridge this gap, we designed a follow-up study integrating these two critical dimensions.
2.1.4. Multimodal Interaction Effects of Sound and Image
The synergistic effect of crossmodal integration arises from the complementary and dynamic interaction of multisensory information channels. The combination of auditory and visual stimuli initiates a “dual encoding-cross-reinforcement” mechanism: while speech input activates phonological word representations, pictorial stimuli engage semantic networks, with both modalities converging to strengthen memory traces through associative encoding in the hippocampus. Empirical evidence by Clark & Paivio supports this mechanism—for instance, simultaneous presentation of the word “bird” with corresponding visual imagery enhances long-term memory retention through the cross-activation of phonological and visual coding systems [
49].
Oculomotor indicators provide microscopic evidence for multimodal integration as well. For example, a series of studies by Heikkilä, H & Räihä, KJ found that the audiovisual congruent group switched gaze 40% more frequently than the unimodal group, suggesting that multimodal information facilitates dynamic integration and that switching paths are more focused on semantically critical areas [
50]. It has also been found that the frequency of gaze switching was significantly higher in the multimodal group than in the unimodal group (25.00% more on average), reflecting the fact raised by Paré, M. et al. that learners dynamically integrate content between audiovisual information more frequently [
51]. This all suggests that multimodal input enhances learning efficiency by optimizing resource allocation.
However, Mayer (2001) suggests that multimodal designs must strictly adhere to the “consistency principle”. Inconsistent audiovisual information by Eitel et al. induces competing representations in working memory, leading to memory fragmentation [
52]. Empirical evidence from eye-tracking studies confirms this effect: Eitel et al. observed a 30% increase in eye movements in response to verbal stimuli in an incongruent multimodal condition, suggesting that learners are repeatedly attempting to reconcile conflicting information. This inefficient integration process depletes cognitive resources and ultimately affects learning efficiency [
52].
In summary, the current study focuses on the learning effects of sound, pictures, and multimodality and analyzes them inductively. However, two key limitations remain: firstly, analyses of multimodal synergies have relied heavily on traditional statistical methods (e.g., ANOVA) and lack dynamic quantitative tools to distinguish between nonlinear interactions between auditory and visual stimuli over time. Secondly, assessments of cognitive resource allocation often rely on subjective reports or isolated behavioral metrics (e.g., reaction time), failing to use micro-behavioral data from eye tracking (e.g., average saccade [ASC], standard deviation of fixation duration [SD-FD]) to quantify the process of attentional reallocation.