Abstract
The cultivation of aesthetic appreciation through engagement with exemplary artworks constitutes a fundamental pillar in fostering children’s cognitive and emotional development, while simultaneously facilitating multidimensional learning experiences across diverse perceptual domains. However, children in early stages of cognitive development frequently encounter substantial challenges when attempting to comprehend and internalize complex visual narratives and abstract artistic concepts inherent in sophisticated artworks. This study presents an innovative methodological framework designed to enhance children’s artwork comprehension capabilities by systematically leveraging the theoretical foundations of audio-visual cross-modal integration. Through investigation of cross-modal correspondences between visual and auditory perceptual systems, we developed a sophisticated methodology that extracts and interprets musical elements based on gaze behavior patterns derived from prior pilot studies when observing artworks. Utilizing state-of-the-art deep learning techniques, specifically Recurrent Neural Networks (RNNs), these extracted visual–musical correspondences are subsequently transformed into cohesive, aesthetically pleasing musical compositions that maintain semantic and emotional congruence with the observed visual content. The efficacy and practical applicability of our proposed method were validated through empirical evaluation involving 96 children (analyzed through objective behavioral assessments using eye-tracking technology), complemented by qualitative evaluations from 16 parents and 5 experienced preschool educators. Our findings show statistically significant improvements in children’s sustained engagement and attentional focus under AI-generated, artwork-matched audiovisual support, potentially scaffolding deeper processing and informing future developments in aesthetic education. The results demonstrate statistically significant improvements in children’s sustained engagement (fixation duration: 58.82 ± 7.38 s vs. 41.29 ± 6.92 s, p < 0.001, Cohen’s d ≈ 1.29), attentional focus (AOI gaze frequency increased 73%, p < 0.001), and subjective evaluations from parents (mean ratings 4.56–4.81/5) when visual experiences are augmented by AI-generated, personalized audio-visual experiences.