4.1. Extraction of Key Visual Elements Based on Gaze Information and Attention Patterns
To enhance color perception accuracy and ensure personalized musical generation, this experiment employed the state-of-the-art Tobii Eye Tracker 5 system to collect high-precision children’s eye movement data. The collected metrics included gaze duration (measured in milliseconds), eye movement trajectory (recorded as x, y coordinates), fixation sequences, and areas of interest (AOI) with associated attention weights. This sophisticated approach enabled dynamic element extraction based on individual users’ unique visual attention patterns, providing accurate reflection of personal focus areas and aesthetic preferences.
For eye movement analysis, we concentrated on children’s gaze information with particular emphasis on temporal dynamics and spatial distribution. Key visual elements were extracted based on empirically determined gaze duration thresholds derived from pilot studies. To minimize measurement error and ensure meaningful engagement, extraction occurred only when gaze duration exceeded 2000 ms within a specific spatial area (defined as a 100 × 100 pixel region). The benchmark for the continuous viewing time of each image is 60 s, typically yielding at least four meaningfully extracted elements per image, though this varied based on individual viewing patterns. The inherently individualized nature of gaze patterns, duration, and sequential viewing order enabled truly personalized element extraction, serving as crucial differentiating factors in subsequent music generation and ensuring each child received a unique auditory experience tailored to their visual engagement.
This study aimed to investigate and quantify eye movement patterns of first-grade children during artwork viewing tasks across three carefully controlled interaction modes: (1) artwork viewing without musical accompaniment (control condition), (2) artwork viewing with AI-generated music based on real-time visual information processing, and (3) artwork viewing with pre-selected custom background music chosen by experts (For details, see
Section 4.2). We hypothesized that children viewing artworks accompanied by AI-composed music would demonstrate significantly more effective eye-movement patterns, including extended gaze duration, reduced saccadic movements, more focused AOI concentration, and enhanced visual exploration, compared to both no-music and generic custom background music conditions.
4.2. Participant Demographics and Selection Criteria
This investigation formed an integral component of a multi-year research program exploring the complex relationship between AI-generated music, aesthetic perception, and artwork appreciation in school-aged children. Given the distinct developmental stages characterizing children’s cognitive, perceptual, and emotional maturation, participant age was carefully defined based on developmental psychology principles. The study recruited 120 first-grade students (age range: 6.0–7.3 years, mean age: 6.1 ± 0.4 years, baseline attention span: ≥15 s, 95% of the children have more than two months of music theory knowledge), along with 16 parents (8 mothers, 8 fathers) and 5 experienced preschool teachers (minimum 5 years teaching experience) from three international primary schools in Shijiazhuang, China. Written informed consent was obtained from all guardians, child assent was secured through age-appropriate procedures, and the study protocol received full approval from the Institutional Review Board of Hainan University
The study employed a randomized three-arm trial design comparing: The 120 children were randomly assigned using computer-generated randomization to three experimental groups (A, B, and C) of 40 participants each, with stratification ensuring balanced gender distribution. Random assignment to groups was employed to evenly distribute potential confounding factors (e.g., musical familiarity, prior art exposure) across conditions. Group A (control) viewed paintings without background music in a quiet environment; Group B (experimental) viewed paintings accompanied by AI-composed background music. These compositions were pre-generated based on the aggregate gaze patterns of a separate cohort of children during pilot testing to semantically and emotionally match each artwork (
Figure 5); Group C (comparison) viewed paintings accompanied by customized background music selected based on a ranking survey of 30 music-related practitioners (5 years of working experience). The primary outcome was average fixation duration, with secondary outcomes including parent/teacher Likert ratings and AOI focus metrics. The researchers initially curated 15 candidate musical pieces from mainstream music platforms. Using a standardized ranking methodology, music-related practitioners evaluated the music-painting congruence, with the highest-ranked selections serving as experimental stimuli. Additionally, the children order of the paintings were viewed was randomized. Teachers and parents participated in a separate evaluation phase, experiencing both AI-generated music conditions and no-music conditions to provide expert assessment of the system’s educational value.
4.3. Artificial Intelligence Composition Process and Musical Generation
The AI-composed music was generated prior to the main study. The generation workflow involved training our model on gaze data and color mappings from pilot studies. The system was designed to create a musical piece that, based on the pilot data, represented a coherent auditory counterpart to the artwork’s color composition and typical viewing patterns. It first processes children’s eye-tracking data, identifying gaze hotspot regions, dwell times, and scan paths during artwork viewing. Based on 10 × 10 spatial grid division, each gazed region generates corresponding notes according to its dominant color, with gaze duration determining note duration values and scan order establishing temporal note relationships. This mapping process produces an initial note sequence serving as conditional input for the LSTM model, guiding subsequent music generation.
The generation phase employs an autoregressive approach to progressively construct complete melodies. At each time step, the model calculates the probability distribution for the next event through forward propagation based on previously generated event sequences and initial conditions. Each generated event undergoes validity checking to ensure pitch remains within acceptable ranges, duration conforms to basic rhythmic patterns, and velocity changes occur smoothly and naturally. The post-processing stage optimizes and adjusts raw generated sequences. First, rhythmic quantization aligns durations to nearest standard note values (e.g., quarter notes, eighth notes), ensuring rhythmic regularity. Second, pitch range checking prevents extreme pitches outside children’s auditory comfort zones. Finally, structural validation ensures generated melodies maintain overall framework consistency with initial color–note mappings without deviating excessively. The entire generation process typically completes within seconds, achieving real-time transformation from visual input to auditory output, providing children with immediate, personalized audiovisual synesthetic experiences. This study successfully generated personalized musical compositions corresponding to three carefully selected paintings (designated as Art Work 1: “Polychrome Rhythm”—a portrait composition; Art Work 2: “Happy Family”—a portrait composition; Art Work 3: “Jungle Adventure”—an animal scene) through our novel AI composition system based on real-time visual information processing (
Figure 6).
4.4. Eye Tracking Analysis Methodology
Eye movements were measured binocularly with a Tobii eye tracker 5 which has a gaze accuracy of 0.5°, drift of <0.3°, and acquisition rate of 133 Hz. The three paintings were played on a 15.6-inch monitor, respectively, with a display resolution of 1920 × 1080. Eye movement data acquisition was achieved using the Processing-3.5.4 software. Children sat in a comfortable chair without their head being fixed so that the screen was positioned in primary gaze at a working distance of 60 cm. A 5-point calibration was completed for each child at the beginning of the data collection session.
Fixations were collected by rewriting the GazeTrack algorithm based on Processing. The algorithm records the eye movement trajectories, fixation count, fixation duration (the fixation duration was recorded once the eye movement trajectories range exceeds 100 × 100 pixels). A recording was unreliable if more than 25% of data was missing or more than 20% of fixations fell outside the Painting. In each analysis, the researcher sat in a corner to observe the children, and switched to the next set of pictures when the children looked away from the screen for at least 5 s. Eye movement heat maps were analyzed with the GazePointHeatMap (Gaze Point Heat Map, pygaze.org).
AOIs were defined a priori by two teachers (both with expertise in child development and art education) to identify semantically meaningful regions in each artwork. For portrait compositions (Art Works 1 and 2), meaningful AOIs included facial features, hair, and shoulders. For the animal scene (Art Work 3), meaningful AOIs included the main animal figures and interactive elements.
Statistical analysis was performed using GraphPad Prism version 8.0 (GraphPad, graphpad.com). Given the exploratory nature of this initial study and its primary focus on demonstrating the effect size and feasibility of the AI-generated music intervention, group differences for each artwork were investigated using independent samples t-tests. A Bonferroni correction was applied to the three primary artwork comparisons, setting the significance threshold at p < 0.0167. Effect sizes are reported as Cohen’s d with 95% confidence intervals (CIs), calculated using the pooled standard deviation. We acknowledge that a more powerful repeated-measures approach (e.g., mixed-effects model) that accounts for the within-subject correlation across the three artworks would be more appropriate for future confirmatory studies with larger sample sizes. A p value < 0.05 was considered statistically significant for all other comparisons.
4.5. Results
Of the initial 120 participants recruited, 24 were excluded from final analysis due to invalid data (excessive missing eye-tracking data n = 15, failure to complete all viewing sessions n = 7, technical difficulties n = 2), resulting in a final sample of 96 children. A comparison showed no significant differences in age or gender between excluded and included participants (both p > 0.05), indicating no systematic selection bias. The analyzed sample had a mean age of 6.1 ± 0.4 years with balanced gender distribution (48 males, 48 females). Group A (no-music control) comprised 31 children (18 males, 13 females), Group B (AI composition experimental) included 33 children (14 males, 19 females), and Group C (custom music comparison) contained 32 children (15 males, 17 females). Groups were well-matched on demographic variables with no significant differences in age (F(2, 93) = 0.84, p = 0.435) or gender distribution (χ2 = 2.31, p = 0.315).
Table 1 presents demographic characteristics and eye movement metrics comparing Groups A and B. While no significant age differences existed between groups (t(62) = 0.93,
p = 0.356), all eye movement characteristics showed statistically significant and practically meaningful differences. Children in the AI music condition (Group B) demonstrated significantly longer average fixation durations across all artworks: Art Work 1 (55.31 ± 8.42 s vs. 41.23 ± 9.78 s, t(62) = 6.23,
p = 0.0011, Cohen’s d = 1.31), Art Work 2 (64.42 ± 10.21 s vs. 45.67 ± 11.34 s, t(62) = 7.89,
p < 0.0001, Cohen’s d = 1.59), and Art Work 3 (56.73 ± 9.65 s vs. 37.98 ± 8.91 s, t(62) = 8.12,
p = 0.0002, Cohen’s d = 1.11). Total average fixation duration across all three paintings was significantly higher in Group B (58.82 ± 7.38 s) compared to Group A (41.29 ± 6.92 s), t(62) = 5.17,
p < 0.001, Cohen’s d = 1.31, representing a substantial increase of 17.53 s (42.5% improvement).
Table 2 displays detailed demographic characteristics and eye movement metrics comparing Groups B and C. Despite no significant differences in age (t(63) = 0.77,
p = 0.444) or gender distribution (χ
2 = 0.94,
p = 0.332), eye movement characteristics differed dramatically between conditions. Children in the AI music condition (Group B) showed significantly longer average fixation durations for all artworks compared to the custom music condition: Art Work 1 (55.31 ± 8.42 s vs. 31.45 ± 7.23 s, t(63) = 12.34,
p < 0.0001, Cohen’s d = 3.04), Art Work 2 (64.42 ± 10.21 s vs. 38.76 ± 9.45 s, t(63) = 10.56,
p < 0.0001, Cohen’s d = 2.61), and Art Work 3 (56.73 ± 9.65 s vs. 30.09 ± 6.78 s, t(63) = 13.01,
p < 0.0001, Cohen’s d = 3.21). Total fixation duration across all paintings was significantly higher in Group B (58.82 ± 7.38 s) compared to Group C (33.43 ± 5.67 s), t(63) = 15.67,
p < 0.0001, Cohen’s d = 3.87, representing a remarkable improvement of 25.39 s (75.9% increase).
Across all three paintings, children in the AI music condition (Group B) demonstrated significantly longer average fixation durations compared to both Groups A and C. Surprisingly, children exposed to generic custom musical backgrounds (Group C) exhibited lower average fixation durations than both other groups, suggesting that non-personalized music may actually interfere with visual attention. Between-group comparisons revealed significant differences in fixation patterns: Art Work 1 showed significant differences between all group pairs (B vs. A: * p = 0.0176; B vs. C: p < 0.0001; A vs. C: p < 0.01). For Art Works 2 and 3, while Groups B and A showed marked differences in fixation duration distribution, suggesting qualitatively different viewing strategies.
In terms of region of interest (AOI) analysis (see
Table 3 and
Table 4 for details), Group B exhibited significantly different attention distribution patterns compared with the control group. For the three artworks with
p < 0.01, the proportion of gaze frequency in meaningful areas (such as the face, hair, and shoulders) in group B was significantly higher. Group B’s gaze frequency was nearly twice as high as that of Group A, indicating that AI-generated music can attract children to gaze at meaningful areas for a longer time. When comparing Group B with Group C, it can also be seen that the music generated by artificial intelligence is slightly superior. Furthermore, AOI clearly shows (
Figure 7) that compared with group A and Group c, the viewing path of group B is more coherent, indicating that the music generated by artificial intelligence may be more guiding and more in line with the viewing path of artworks.
4.6. Discussion
The multifaceted evaluation of our proposed system encompassed three complementary dimensions: objective behavioral assessment of children’s viewing patterns, subjective evaluation from parents regarding educational value and child engagement, and professional assessment from experienced teachers concerning pedagogical effectiveness and practical implementation.
4.6.1. Children’s Behavioral Evaluation: Objective Metrics of Engagement
The child evaluation methodology was grounded in Hyson’s framework for assessing learning qualities in early childhood, which encompasses both affective/motivational and action/behavioral dimensions. The affective–motivational dimension, characterized as enthusiasm, incorporates intrinsic interest, emotional pleasure, and autonomous learning motivation—critical factors in sustained educational engagement. The complementary action/behavioral dimension, manifesting as active participation, encompasses sustained concentration, task persistence, cognitive flexibility, and emergent self-regulation capabilities. This multidimensional approach emphasizes objective assessment through systematically observable and quantifiable behaviors rather than subjective impressions.
Viewing duration emerges as a primary behavioral indicator of children’s sustained concentration and task persistence when engaging with specific artworks—particularly significant given the developmental characteristics of this age group. Children exposed to AI-composed music (Group B) demonstrated statistically significant and educationally meaningful extensions in fixation duration compared to both no-music (Group A) and generic custom music (Group C) conditions. This finding assumes particular importance given the typically limited attention spans characteristic of 6.0–7.3-year-old children, who are in a transitional developmental phase between preoperational and concrete operational thinking. The magnitude of improvement—42.5% compared to no music and 75.9% compared to generic music—suggests that personalized audio-visual cross-modal integration creates a powerful scaffolding effect that supports sustained attention. The significant improvements in sustained attention and AOI focus under the AI-music condition suggest that the personalized auditory stimuli may have reduced the ‘perceived difficulty’ associated with processing complex artworks. This aligns with findings from other learning domains, where addressing factors such as Emotions and Self-consideration—key components of perceived difficulty (
Spagnolo, 2025;
Spagnolo et al., 2024)—can facilitate deeper engagement. Our AI-driven approach, by generating congruent music, likely positively influenced these affective and metacognitive dimensions.
AOI analysis revealed that 71% of Group B not only exhibited more focused fixation points (Extraction occurred only when gaze duration exceeded 2000 ms within a specific spatial area. For details, see
Section 4.1) but also demonstrated extended duration within semantically meaningful regions (such as face, shoulders and train) compared to Groups A and C. This pattern potentially indicates either enhanced concentration on relevant visual information or increased capacity for visual information processing when supported by congruent auditory stimuli. The synchronized audio-visual experience may facilitate deeper encoding and more elaborate processing of visual details, consistent with dual-coding theory and multimedia learning principles.
These empirical findings suggest that artwork viewing experiences elicit dramatically varying levels of concentration, visual exploration, and sustained engagement depending on the accompanying audiovisual environment. The differences may be partially attributable to the synchronization between visual content and musical tempo, rhythm, and emotional valence. While further research is needed to definitively characterize the specific mechanisms underlying these relationships, the current approach offers valuable insights for investigating children’s concentration patterns, individual differences in cognitive styles, and optimal conditions for aesthetic engagement.
Additionally, our experimental evidence indicates specific parameters for optimizing child-oriented audio-visual content. Musical accompaniments benefit from concise presentation, with optimal duration ranging from 30–40 s per visual segment to maintain engagement without cognitive overload. Musical styles should be deliberately active and engaging, featuring rhythmic variability and melodic interest that captures and maintains children’s naturally dynamic attention while avoiding monotony or excessive complexity that might distract from visual content.
4.6.2. Parents’ Subjective Evaluation: Family Perspectives on Educational Value
Parents’ subjective evaluations were systematically collected based on careful observation of their children’s behavioral responses during the experimental sessions and their personal assessment of the system’s educational merit. Given the absence of statistically significant differences in objective metrics between custom music and no-music conditions for gaze duration and AOI patterns, parental and teacher evaluations strategically focused on comparing AI-generated personalized music conditions with traditional no-music artwork viewing. Sixteen parents (8 mothers, 8 fathers) provided subjective evaluations using a validated 5-point Likert scale (1 = strongly disagree, 5 = strongly agree) across the following carefully constructed indicators:
Enhanced Interest and Comprehension: “My child demonstrated increased interest in and understanding of the artwork when accompanied by AI-generated music” (Mean rating: 4.56 ± 0.51).
Age-Appropriate Design: “The system’s objectives and content are clearly aligned with my child’s developmental stage and cognitive abilities” (Mean rating: 4.69 ± 0.48).
Innovation and Engagement: “The design content is lively, interesting, and represents an innovative approach to art education” (Mean rating: 4.81 ± 0.40).
Positive Emotional Response: “My child exhibited positive attitudes, genuine happiness, and strong participation desire during the experience” (Mean rating: 4.75 ± 0.45).
Sustained Interest: “My child expressed willingness and enthusiasm for repeated experiences with the system” (Mean rating: 4.63 ± 0.50).
Results indicated predominantly high parental ratings across all dimensions, with no ratings below 4.0 on any indicator. Qualitative feedback highlighted several key themes: parents appreciated the personalized nature of the musical accompaniment, noting that it seemed to “speak to” their individual child’s interests. Many parents commented on the marked difference in their children’s sustained attention compared to traditional museum or gallery visits. This experimental approach was praised for facilitating self-directed viewing, enhancing attentional skill development while enabling immersive artwork experiences that foster active exploration rather than passive observation.
Parents particularly valued the system’s capacity to support multi-perspective artwork perception without imposing adult interpretations. The design’s emphasis on interactive differences and exploratory engagement while avoiding unidirectional knowledge transfer resonated strongly with contemporary parenting philosophies emphasizing child autonomy and constructivist learning (
Figure 8).
4.6.3. Teachers’ Subjective Evaluation: Professional Educational Assessment
Five experienced preschool educators (mean teaching experience: 8.4 ± 2.3 years) provided professional evaluations based on systematic observation of student behaviors and personal experience with the system, leveraging their extensive pedagogical expertise. Evaluations utilized the same 5-point scale across educationally relevant indicators:
Cognitive Enhancement: “The system demonstrably enhances children’s artwork understanding and aesthetic appreciation” (Mean rating: 4.80 ± 0.45).
Practical Implementation: “The approach offers operational cost-effectiveness and practical feasibility for classroom integration” (Mean rating: 4.40 ± 0.55).
Student Engagement: “Children exhibit high levels of interactive participation and sustained engagement” (Mean rating: 4.80 ± 0.45).
Developmental Appropriateness: “Content aligns with age-appropriate cognitive and emotional development principles” (Mean rating: 5.00 ± 0.00).
Child-Centered Pedagogy: “The system reflects child-centered approaches and promotes subjective initiative in learning” (Mean rating: 4.80 ± 0.45).
Results demonstrated strong professional endorsement of the experimental design model across all pedagogical dimensions. Teachers unanimously agreed (5.0/5.0) that the content was developmentally appropriate, with one educator noting: “This approach brilliantly bridges the gap between children’s natural synesthetic tendencies and formal art appreciation.” The content planning and format were recognized as exemplifying child-centered, interactive, and playful educational approaches that align with contemporary early childhood education best practices.
Teachers particularly emphasized how the system enables children to experience truly personalized musical accompaniment that responds to their individual viewing patterns, creating what one teacher described as “a dialogue between the child and the artwork mediated by music.” The seamless integration of visual interaction with rhythmic musical variations was noted to create genuinely immersive experiences that maintain educational value while maximizing engagement. Several teachers suggested potential applications beyond art appreciation, including literacy development, emotional regulation, and cross-cultural education (
Figure 9).
In summary, this study addresses the critical challenge of enhancing children’s aesthetic development by employing cutting-edge artificial intelligence to integrate personalized visual information with contextually appropriate musical rhythm, thereby significantly improving engagement. Through establishing audiovisual fluency relationships based on scientifically grounded music–color correspondences, our design prioritizes child-centered, developmentally appropriate approaches that respect individual differences. Multi-sensory stimulation successfully attracts and maintains children’s attention to artworks, realizing the pedagogical ideal of learning through play while fostering deep aesthetic appreciation. Both objective behavioral assessments using state-of-the-art eye-tracking technology and subjective evaluations from key stakeholders—teachers and parents—confirm the high applicability and educational value of this experimental design for cultivating children’s aesthetic appreciation and supporting holistic cognitive development.
It is important to contextualize the scope of our current implementation. This study employed pre-composed AI music based on aggregate gaze data from pilot studies. While this approach successfully demonstrated the significant efficacy of AI-generated auditory cues in enhancing sustained attention and AOI-directed attention, it does not represent true real-time personalization. The logical next step and the future direction of this research lie in developing systems capable of generating music in real-time from each child’s own gaze dynamics during the viewing experience. Achieving such a system represents the next frontier in personalized aesthetic education, moving from a pre-defined, stimulus-matched approach to a fully adaptive, child-centered one.