MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts

Zhang, Huaye; Chen, Chenglizhao; Song, Mengke; Chen, Tingting; Jiang, Diqiong; Liu, Lichun; Liu, Xinyu

doi:10.3390/s26041395

Open AccessArticle

MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts

by

Huaye Zhang

,

Chenglizhao Chen

,

Mengke Song

,

Tingting Chen

,

Diqiong Jiang

,

Lichun Liu

and

Xinyu Liu

^*

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(4), 1395; https://doi.org/10.3390/s26041395

Submission received: 3 January 2026 / Revised: 14 February 2026 / Accepted: 20 February 2026 / Published: 23 February 2026

(This article belongs to the Special Issue Music Acquisition and Automatic Processing for Machine Learning-Based Applications)

Download

Browse Figures

Versions Notes

Abstract

Recent technologies such as music retrieval, soundtrack generation, and video understanding have developed rapidly. Consequently, the aesthetic evaluation of video soundtracks has become an important research topic in academia. Soundtracks are key elements in shaping the emotional atmosphere and driving the narrative rhythm. Therefore, they require systematic methods to assess their artistic coordination with visual content. However, existing approaches mostly focus on evaluating the quality of the music itself. They often lack the ability to model the deeper aesthetic synergy between audio and visuals. To address this gap, we propose MEMA, a new soundtrack aesthetic evaluation model. MEMA employs a two-stage training strategy. The first stage builds a crossmodal imagination mechanism using a Conditional Variational Autoencoder. This method achieves bidirectional semantic reconstruction between audio and visuals. The second stage introduces a Guided Cross-Attention Alignment Module. This module enhances the model’s focus on key narrative moments in video. To facilitate this research, we also construct VMAE-Sets. It is the first large-scale dataset dedicated to soundtrack aesthetic evaluation. Finally, MEMA performs scoring and textual evaluation along three core aesthetic dimensions. Experimental results demonstrate that MEMA outperforms existing methods, achieving average improvements of 18.137% in LCC and 17.866% in SRCC compared to the strongest baseline. These findings confirm its superior audio–visual narrative alignment, demonstrating high consistency with human judgments.

Keywords:

multimodal learning; aesthetic quality assessment; audio-visual alignment; cross-attention; video soundtrack evaluation

1. Introduction

The rapid development of multimodal technologies, including video generation [1,2,3,4,5,6,7] and video understanding [8,9,10,11,12,13,14], has resulted in automated audio–visual content creation. Large-scale vision language models such as CLIP [15] have demonstrated the effectiveness of crossmodal alignment through contrastive learning. This progress introduces a critical challenge beyond simple generation: evaluating the artistic coordination between soundtracks and visuals. This task is a form of aesthetic assessment, which judges subjective qualities like emotional resonance and creative originality. Within computer vision and multimedia, aesthetic assessment ensures that machine-generated content aligns with human perceptions of beauty, spanning domains like image and video aesthetic assessment. Consequently, a clear need has emerged for systematic methods and high-quality datasets to specifically evaluate video soundtrack aesthetics. However, aesthetic evaluation of the audio–visual relationship remains a critical challenge.

At the methodological level, current evaluation approaches exhibit significant limitations. Traditional music quality assessment methods rely solely on audio signals. They treat audio signals in isolation and focus solely on physical or perceptual metrics such as the signal-to-noise ratio, loudness, and rhythm stability [16,17,18]. These approaches overlook the functional role of music in cinematic contexts. Music serves as a narrative device that works together with visual elements to construct emotional tension and rhythmic structure. As a result, such methods cannot capture the expressive intent and narrative value of soundtracks in specific scenes. Although recent multimodal evaluation methods attempt to integrate video and audio information, most remain limited to low-level feature fusion. A common approach is to extract features from image frames and audio segments. These features are then concatenated directly or processed by shallow attention mechanisms for joint modeling. While these techniques offer modest improvements in capturing inter-modal correlations, they fall short in modeling the dynamic interplay between musical progression and narrative development. They also fail to capture the aesthetic coupling between timbre, rhythm, and visual atmosphere [19].

On the data side, there is currently no high-quality aesthetic evaluation dataset that is comprehensive, fine-grained, and focused on artistic expression. Compared to general video–music alignment tasks, film soundtracks place greater emphasis on high-level emotional alignment between the music and the visual narrative. This alignment goes beyond crossmodal information matching. It involves the synergy between musical atmosphere and visual storytelling across emotional rhythms, narrative structures, and thematic metaphors. However, existing datasets are insufficient to support in-depth aesthetic evaluations between video and music. For example, SymMV [4] provides annotations for video–music alignment but primarily contains narrow-range samples using a solo piano. This limitation restricts its musical diversity. BGM909 [4] offers high-quality audio–video–text triplets but focuses mainly on semantic-level alignment. It lacks descriptions of whether the soundtrack emotionally resonates with the visuals. Similarly, HarmonySet [20] is designed for aesthetic evaluation tasks but mainly covers short-video scenarios. It fails to capture the more nuanced and complex aesthetic requirements of film scoring. At the core of this problem lies a significant bottleneck in obtaining high-quality aesthetic annotations. Expert annotation requires interdisciplinary knowledge of both music composition and film theory. This makes large-scale dataset construction costly and time-consuming. As a result, existing datasets often only provide coarse-grained labels and fail to distinguish between different aesthetic dimensions.

To address the above challenges, we propose a solution that includes both a new evaluation model and a dataset. For the model, we introduce MEMA (Multimodal Aesthetic Evaluation of Music), which captures the artistic relationships between audio and visuals through crossmodal imagination. The model employs a Conditional Variational Autoencoder (CVAE) [21] for bidirectional audio–visual reconstruction. This mechanism enables the model to learn deep features by imagining one modality from another. MEMA decomposes aesthetic judgments into three dimensions: narrative–emotional congruence, technical integration, and thematic identity and originality. These dimensions are grounded in film music theory and enable fine-grained evaluation. For the dataset, we introduce VMAE-Sets (Video–Music Aesthetic Evaluation Datasets), derived from critically acclaimed films and their original soundtracks. To overcome the annotation bottleneck, we design a weakly supervised pipeline that leverages large language models to extract structured aesthetic evaluations from user reviews in movie databases. This pipeline generates multidimensional annotations for each video–music pair without requiring expert labeling.

The contributions of our work are follows:

We formulate video soundtrack aesthetic evaluation as a new task. This task aims to assess the artistic coordination between music and visual content. It provides a foundation for automatic quality assessment in film scoring and video production.
We propose MEMA, a multimodal model that captures audio–visual artistic relationships through crossmodal imagination. The model decomposes aesthetic judgments into three dimensions and achieves the strongest performance among all evaluated models, with average improvements of 18.137 percent in LCC and 17.866 percent in SRCC compared to the best-performing baseline.
We introduce VMAE-Sets, the first large-scale dataset for soundtrack aesthetic evaluation. A weakly supervised pipeline based on large language models enables scalable annotation without expert labeling.

2. Related Work

2.1. Audio Evaluation

Music quality evaluation has evolved from traditional acoustic assessment to modern data-driven approaches. Early work primarily focused on objective physical metrics such as audio fidelity and the signal-to-noise ratio [22,23]. These methods often relied on engineered features or signal-processing measures. With the advent of deep learning, researchers began to learn complex quality metrics directly from human subjective ratings [24]. Some studies employ the discriminator of a Generative Adversarial Network as a no-reference audio quality assessor [25]. This approach demonstrates strong alignment with human perception in certain settings. Other works have explored modeling higher-level musical attributes such as rhythm, harmony, and timbre [26,27,28]. Music emotion recognition has also been extensively studied [29]. However, most approaches remain single-modal. They analyze intrinsic musical qualities such as rhythmic stability but overlook the functional role of music within a visual narrative. This paradigm fails to capture the dynamic alignment between audio and visual elements like plot development or character emotions. Current metrics rely heavily on technical acoustic features rather than aesthetic dimensions like narrative congruence. Consequently, evaluating audio in isolation is insufficient. There is an urgent need for a crossmodal framework that effectively assesses the deep integration of audio and visual modalities.

2.2. Soundtrack Understanding

Currently, there are no specialized models designed specifically for soundtrack understanding. Researchers typically repurpose general video understanding models such as InternVideo, Video-LLaMA, and their successors [8,9,10,11,12,13,14,30,31]. These frameworks have achieved strong performance on semantic recognition tasks through large-scale multimodal pretraining. They are effective at capturing factual content such as object categories, actions, and events. However, they offer limited insight into aesthetic and emotional dimensions. These general foundation models fall short when applied to soundtrack evaluation. They lack the ability to model nuanced crossmodal relationships such as the narrative–emotional congruence between music and visual storytelling. Their training objectives are typically oriented toward recognition or captioning. This focus encourages semantic alignment but does not optimize for aesthetic qualities like emotional resonance, pacing, or thematic coherence in audio–visual compositions.

2.3. Video–Text and Video–Music Datasets

Alongside progress in video understanding, several video–music datasets have been proposed [32]. However, most were not designed with in-depth aesthetic evaluation in mind. SymMV [4] provides video–symbolic music pairs with detailed annotations of chords, melodies, and accompaniments. However, its scale is limited, and the music is largely restricted to piano renditions. This results in a lack of diversity in musical style and scene representation. The VidMuse V2M dataset [33] contains hundreds of thousands of video–music pairs covering movie trailers, advertisements, and documentaries. It is primarily used for video-to-music generation and matching. Its annotations focus on semantic-level alignment and do not provide explicit labels for aesthetic evaluation. HarmonySet [20] offers tens of thousands of samples annotated along dimensions including rhythm synchronization, emotional alignment, and thematic coherence. While it demonstrates more depth than earlier datasets, it primarily covers short-form videos rather than cinematic-length content. Although it includes labels such as emotional alignment, it does not comprehensively address more complex aesthetic dimensions of film soundtracks such as narrative–emotional congruence, technical integration, and thematic identity and originality.

2.4. Audio Aesthetic Evaluation Datasets

In the audio domain, several datasets have been proposed to quantify subjective evaluations of musical aesthetics. These datasets provide important references for our crossmodal setting. AES-Natural [26] unifies automatic quality assessment for speech, music, and sound. However, its clips are relatively short, and the evaluation dimensions focus on generic quality rather than detailed textual analysis of aesthetic properties. This makes it insufficient for studying complex musical structure in cinematic contexts. MusicEval [27] takes a step further by providing expert evaluations of AI-generated instrumental music. However, its evaluation dimensions remain relatively coarse and lack fine-grained granularity. Moreover, its text alignment metric becomes ineffective for assessing music that is not conditioned on textual prompts. This limits its applicability to video soundtrack evaluation. SongEval [28] introduces a multi-dimensional aesthetic evaluation of full songs with vocals. It proposes a five-dimensional framework annotated by professional musicians, including coherence, memorability, vocal naturalness, structural clarity, and overall musicality. While SongEval is an important milestone for single-modal song aesthetics, it is not designed for crossmodal tasks. All dimensions are evaluated in isolation on the audio itself without considering how the music aligns with visual content or supports narrative coherence. Thus, there remains a fundamental gap between existing audio-only aesthetic datasets and our goal of deeply fused video–soundtrack evaluation.

3. The VMAE-Sets Dataset

VMAE-Sets is a multimodal dataset comprising aligned text–audio–video samples for film soundtrack aesthetic evaluation. It assesses emotional and narrative congruence, technical integration, and thematic originality. Table 1 compares VMAE-Sets with representative datasets. Unlike SongEval [28] and HarmonySet [20], VMAE-Sets uniquely integrates video, OST, and commentary, providing a unified foundation for future research.

3.1. Data Collection

To ensure diversity and quality, we selected clips from critically acclaimed narrative films worldwide, covering a wide range of eras, genres, and production scales. For the audio component, we sourced official OST (original soundtrack) tracks from public distributions. User reviews of each soundtrack were collected from online platforms to serve as the raw textual material for later modeling and scoring. The final dataset contains 1170 soundtrack–video–text triplets, with a total duration of approximately 106.8 h, making it one of the largest datasets of its kind focused specifically on soundtrack aesthetics.

3.2. Video Processing

We adopted the Pretrained Conformers for Audio Fingerprinting method [35] to automatically match each OST to its corresponding appearance(s) in the film. After successful alignment, we extracted the matched video segments to ensure precise temporal synchronization between the music and the visuals. It is important to note that a single soundtrack track may be reused or segmented multiple times within a film, meaning that one OST track may correspond to 1 to n (where

n \geq 2

) video segments.

3.3. Definition of Evaluation Metrics

Inspired by prior work in film music theory and audio–visual studies [36,37,38], we define three evaluation metrics:

Narrative–Emotional Congruence (NEC) [36]: It measures the alignment between the emotional atmosphere created by the soundtrack and the narrative elements in the video, including plot progression, character psychology, and narrative tension. For example, cheerful music in a suspenseful or tragic scene would receive a lower NEC score.

Technical Integration (TI) [37]: It evaluates how well the soundtrack is integrated with other auditory components in the film, such as dialogue and sound effects, including aspects like mixing balance, masking, and clarity.

Thematic Identity and Originality (TIO) [38]: It focuses on the recognizability and creative uniqueness of the soundtrack’s musical theme—whether it possesses a distinct motive or leitmotif that connects with characters or plotlines and whether it avoids overly templated or formulaic orchestration.

3.4. Textual Review Processing and Scoring

To ensure objectivity, we processed user reviews using large language models (LLMs) weighted by “likes”. For each dimension, the LLMs generated (1) a score (1–10) for the given dimension, (2) a confidence score for each judgment, and (3) a corresponding textual explanation.

We employed two LLMs (DeepSeek and Qwen) with three independent samplings per pair to reduce bias. Figure 1 illustrates the pipeline. We computed individual model scores using confidence-weighted averaging:

s_{c} = \frac{\sum_{j} {score}_{j} \times {conf}_{j}}{\sum_{j} {conf}_{j}},

(1)

where

{score}_{j}

and

{conf}_{j}

are the score and confidence of the j-th sample.

The final fused score from both LLMs was aggregated as

s_{final} = \frac{s_{model 1} \times {conf}_{model 1} + s_{model 2} \times {conf}_{model 2}}{{conf}_{model 1} + {conf}_{model 2}},

(2)

and the average confidence is defined as

{avg}_{confidence} = \frac{{conf}_{model 1} + {conf}_{model 2}}{2} .

(3)

The dataset covers 1170 samples across three dimensions. It contains over 6 million words of explanatory text. Sample lengths range from 70 to 1500 words, with an average of 900 words.

4. Model Construction

The MEMA model takes as input three types of data with the following correspondence: (1) The music sequence

A^{O S T} = {a_{1}, \dots, a_{T}}

is derived from the pure OST—the isolated film soundtrack with non-musical elements (dialogue, ambient sounds, and sound effects) removed. (2) The P segments of

{V_{j}}_{j = 1}^{P}

are the visual content of the video, i.e., the image frames or visual features extracted from each video segment. (3) For each segment j,

A_{j}^{c l i p}

denotes the audio track of the j-th video clip and the original mixed audio accompanying that clip, which includes dialogue, ambient sounds, and sound effects. Thus

A^{O S T}

and

{A_{j}^{c l i p}}_{j = 1}^{P}

provide two distinct audio views (isolated music vs. in-context mix), while

{V_{j}}_{j = 1}^{P}

provides the visual context; this design allows the model to assess how the pure soundtrack integrates with the environmental auditory elements in the final film. The model outputs three aesthetic scores (NEC, TI, and TIO) and textual explanations. The architecture consists of five main components: First, feature extraction modules encode audio and video inputs separately and fuse them at the segment level. Second, two encoding towers build global representations for music and scene collections. Third, a Crossmodal Imagination Module (Figure 2) establishes bidirectional semantic mappings between audio and visual modalities. Fourth, a Guided Cross-Attention Alignment Module enables music semantics to actively select relevant visual segments. Finally, prediction heads output aesthetic scores and generate textual explanations through a frozen language model. Training proceeds in two stages: self-supervised pretraining for crossmodal alignment and weakly supervised fine-tuning [39] for aesthetic prediction.

4.1. Feature Extraction and Local Fusion

Audio Encoder: Film music contains both abstract emotional semantics and concrete acoustic textures. Existing audio encoders typically capture only one aspect. We adopt a dual-pathway structure to model both. For semantic features, the OST

A^{O S T}

is sliced using a 10-s window with a 2-s stride to obtain M segments. Each segment is processed by CLAP [40] to extract

X_{c l a p} \in R^{M \times 512}

, then layer-normalized and projected:

H_{c l a p} = W_{c} X_{c l a p} + b_{c}

(4)

where

W_{c} \in R^{512 \times 768}

. For acoustic features, Librosa extracts 37-dimensional descriptors

X_{l i b} \in R^{M \times 37}

including MFCC, Chroma, RMS, zero-crossing rate, and timbral statistics. A three-layer 1D CNN processes these statistics into

H_{m o t i f} = f_{CNN} (X_{l i b}) \in R^{K \times 768} .

(5)

for each clip-level audio

A_{j}^{c l i p}

, features are concatenated and projected:

x_{a u d}^{(j)} = W_{a} [x_{clap}^{(j)}; x_{lib}^{(j)}] + b_{a}

(6)

where

W_{a} \in R^{549 \times 768}

.

Video Encoder: High-level semantic vectors alone make the model insensitive to visual aesthetics. We introduce a multi-layer mechanism to capture both global semantics and low-level properties. For each segment

V_{j}

, eight frames are sampled and input into InternVideo2 to obtain

x_{g l o b a l}^{(j)} \in R^{768}

. The segment is also uniformly divided into five non-overlapping time intervals along the temporal axis to extract 40-dimensional low-level descriptors (eight dimensions per interval) including color, texture, and edge density, yielding

x_{l o w}^{(j)} \in R^{\begin{matrix} 5 \times 40 \end{matrix}}

. The features are concatenated:

x_{v i d}^{(j)} = [x_{g l o b a l}^{(j)}; Flatten (x_{l o w}^{(j)})] \in R^{968} .

(7)

Segment-level Audio–Visual Fusion Module: After obtaining features, segment-level fusion aligns the two modalities locally. Let

v_{i} = x_{v i d}^{(i)}

and

a_{i} = x_{a u d}^{(i)}

. Linear transformations generate query, key, and value matrices, followed by cross-attention and residual fusion:

\begin{matrix} Q & = W_{Q} v_{i}, K = W_{K} a_{i}, V = W_{V} a_{i}, \\ Attention (Q, K, V) & = Softmax (Q K^{⊤} / \sqrt{768}) V, \\ H^{'} & = LayerNorm (Q + Attention (Q, K, V)), \\ H_{f} & = LayerNorm (H^{'} + FFN (H^{'})) . \end{matrix}

(8)

applying this to all P segments yields

H^{f u s i o n} \in R^{P \times 768}

.

4.2. Audio Tower and Video Tower

Audio Tower: Local features must be aggregated into global representations. The Audio Tower integrates semantic and acoustic features to model the temporal progression of musical themes. The sequences

H_{c l a p}

and

H_{m o t i f}

are concatenated and enhanced with the following sinusoidal positional encoding:

H_{m u s i c} = PE (Concat (H_{c l a p}, H_{m o t i f})) \in R^{(M + K) \times 768} .

(9)

a 6-layer Transformer [41] encoder with 12 heads and 3072-dimensional FFN outputs

H_{a} \in R^{(M + K) \times 768}

.

Video Tower: Film editing often does not follow a linear timeline due to flashbacks and intercuts. Strict temporal positional encoding may introduce noise. The Video Tower uses a masking mechanism instead of explicit positional encoding. The fusion features

H^{f u s i o n}

are layer-normalized. During encoding, a mask matrix

M_{s c e n e} \in {0, 1}^{P \times P}

controls attention, ensuring narratively related segments attend to each other:

\begin{matrix} M_{s c e n e} (i, j) = \{\begin{matrix} 0 & if related, \\ - \infty & otherwise, \end{matrix} \\ Attention (Q, K, V) = Softmax ((Q K^{⊤} + M_{s c e n e}) / \sqrt{768}) V . \end{matrix}

(10)

a 6-layer Transformer encoder outputs

H_{v} \in R^{P \times 768}

. This design allows narratively related segments to attend to each other regardless of their temporal positions.

4.3. Crossmodal Imagination Module

Existing multimodal methods align features through contrastive learning or concatenation. These approaches capture surface-level correlations but fail to model deep semantic connections. We propose a Crossmodal Variational Autoencoder (CVAE) that learns to generate one modality from another. This resembles human imagination and enables the model to understand how music evokes visual imagery. The module consists of two symmetric paths for bidirectional generation. The encoder maps input H (either

H_{a}

or

H_{v}

) to latent parameters, and a Transformer decoder reconstructs features conditioned on the other modality:

\begin{matrix} [μ, \log σ^{2}] = f_{enc} (H), z = μ + σ ⊙ ϵ, ϵ \sim N (0, I), \\ {\hat{H}}_{v} = f_{dec} (z ∣ H_{a}) \in R^{P \times 768}, {\hat{H}}_{a} = f_{dec} (z ∣ H_{v}) \in R^{(M + K) \times 768} . \end{matrix}

(11)

the loss function combines reconstruction and KL divergence, with

L_{CVAE} = ∥ H_{v} - {\hat{H}}_{v} ∥_{2}^{2} + {∥ H_{a} - {\hat{H}}_{a} ∥}_{2}^{2} + λ_{K L} D_{K L} (q (z ∣ H) ∥ N (0, I)) .

(12)

Through bidirectional reconstruction, the model learns a unified latent space where semantically related features are mapped to nearby regions. The CVAE uses a symmetric encoder–decoder architecture: the encoder is a 3-layer Transformer (six heads; 2048-d feed-forward) that maps

H_{a}

or

H_{v}

to latent parameters; the decoder mirrors it with a 3-layer Transformer that performs cross-attention between the latent sample z and the conditioning modality. The latent dimension is set to

z \in R^{256}

to capture high-level semantic correspondences while avoiding overfitting to low-level details. The symmetric encoder–decoder structure ensures that both bidirectional pathways have equivalent capacity.

4.4. Guided Cross-Attention Alignment Module

Segment-level fusion captures local correspondence. However, film scores exhibit long-range thematic structures spanning multiple scenes. A musical theme may relate to temporally distant scenes. Local fusion cannot capture this global narrative role. We introduce the Guided Cross-Attention Alignment Module (GCAAM) (Figure 3), where music semantics guide the attention process by serving as queries to selectively attend to relevant visual segments. This “guidance” is realized through two key design choices that distinguish the GCAAM from standard cross-attention: (1) We use only the first M semantic tokens (

H_{a}^{(M)}

) from the Audio Tower. These tokens are derived from CLAP embeddings and were aligned through CVAE pretraining to capture global musical structure and thematic progression. This selective use of semantically rich tokens, rather than all audio features, enables the model to focus on musically meaningful patterns. (2) The audio-to-video attention is asymmetric: music acts as the sole query source, while video serves as the key and value, establishing a music-driven selection mechanism rather than bidirectional mutual attention. We extract the first M semantic tokens from

H_{a}

, denoted

H_{a}^{(M)} \in R^{M \times 768}

, as queries. The scene representation

H_{v}

serves as keys and values, and cross-attention is computed as follows:

\begin{matrix} Q_{m} & = W_{Q} H_{a}^{(M)}, K_{v} = W_{K} H_{v}, V_{v} = W_{V} H_{v}, \\ H^{'} & = Softmax (Q_{m} K_{v}^{⊤} / \sqrt{768}) V_{v} . \end{matrix}

(13)

The output passes through residual connections, layer normalization, and FFN to produce

H_{a l i g n e d} \in R^{M \times 768}

. This mechanism establishes a global mapping between musical themes and scene contexts. The attention weights can be visualized to analyze which scenes each musical segment attends to.

4.5. Prediction Heads

Score Prediction Heads: The model outputs three aesthetic scores from features that best reflect each dimension. NEC measures narrative–emotional congruence and is computed from

u_{N E C} = {Pool}_{t} (H_{a l i g n e d}) \in R^{768} .

(14)

TIO evaluates thematic identity and originality from

u_{T I O} = {Pool}_{t} (H_{a}) \in R^{768} .

(15)

TI assesses technical integration by comparing global and local audio, with

u_{a}^{O S T} = {Pool}_{t} (H_{a}), u_{a}^{c l i p} = {Pool}_{j} ({x_{a u d}^{(j)}}), Δ = u_{a}^{O S T} - u_{a}^{c l i p}, and u_{T I} = [u_{a}^{O S T}; u_{a}^{c l i p}; Δ] \in R^{2304} .

(16)

each metric uses a two-layer MLP (

d_{i n} \to 512

with ReLU and Dropout 0.1, then

512 \to 1

with Sigmoid). Training uses confidence-weighted MSE, where

L_{s c o r e}^{t} = \frac{1}{N} \sum_{i} w_{i}^{t} {(y_{i}^{t} - {\hat{y}}_{i}^{t})}^{2} .

(17)

LLM Output Head: The model also generates textual explanations. Separate MLP projectors map task vectors to the language embedding space as follows:

ϕ_{N E C} : R^{768} \to R^{1536} \to R^{3584}, ϕ_{T I O} : R^{768} \to R^{1536} \to R^{3584}, ϕ_{T I} : R^{2304} \to R^{1536} \to R^{3584} .

(18)

projected embeddings are concatenated with VideoLLaMA2 features and input to a frozen Qwen2-7B [42] dataset with task-specific prompts. Only projectors are trainable. Cross-entropy loss is calculated as

L_{t e x t} = - \sum_{t} y_{t} log p (y_{t} ∣ y_{< t}, H_{t a s k}) .

(19)

4.6. Training Strategy

Self-supervised Pretraining: The first stage performs crossmodal alignment without labels. The loss function combines reconstruction, KL divergence, and alignment regularization to yield

\begin{matrix} L_{p r e} & = λ_{r e c} L_{r e c} + λ_{K L} L_{K L} + λ_{a l i g n} L_{a l i g n}, \\ where L_{r e c} & = {∥ H_{v} - {\hat{H}}_{v} ∥}_{2}^{2} + {∥ H_{a} - {\hat{H}}_{a} ∥}_{2}^{2}, \\ L_{K L} & = D_{K L} (q (z ∣ H) ∥ N (0, I)), \\ and L_{a l i g n} & = MSE (A_{a t t n}, {\bar{A}}_{a t t n}) \end{matrix}

(20)

where

A_{a t t n}

denotes GCAAM attention weights and

{\bar{A}}_{a t t n}

is their segment-wise average. To prevent posterior collapse during CVAE training, we employ a KL annealing strategy for

λ_{K L}

. Specifically,

λ_{K L}

linearly increases from zero to a maximum value

λ_{K L}^{m a x}

over the first

N_{w a r m u p}

epochs:

λ_{K L}^{(t)} = min (t / N_{w a r m u p}, 1) \cdot λ_{K L}^{m a x}

. This allows the model to first learn effective reconstruction before gradually introducing regularization. The final target value

λ_{K L}^{m a x} = 0.01

was selected via grid search over

{0.001, 0.01, 0.05, 0.1, 0.5}

on the validation set, balancing reconstruction quality and latent space smoothness. The settings are as follows:

N_{w a r m u p} = 10

epochs,

λ_{r e c} = 1.0

, and

λ_{a l i g n} = 0.1

.

Weakly Supervised Fine-tuning: The second stage adapts the model to aesthetic prediction. CVAE parameters are frozen. The loss function combines score regression and text generation using

L_{s c o r e} = \sum_{t} ω_{t} \cdot MSE (s_{t}, {\hat{s}}_{t}) and L_{t e x t} = \sum_{t} γ_{t} \cdot CE (y_{t}, {\hat{y}}_{t})

(21)

where

t \in {NEC, TIO, TI}

. Correlation regularization stabilizes training:

L_{f i n e} = α_{s c o r e} L_{s c o r e} + β_{t e x t} L_{t e x t} + λ_{l c c} (1 - ρ)

(22)

where

ρ

represents Pearson’s correlation. Training uses AdamW with a learning rate of

5 \times 10^{- 5}

and mixed precision.

5. Experiments and Analysis

5.1. Experimental Setup

To validate the effectiveness of MEMA, we conducted experiments on the VMAE-Sets dataset. Since no existing models are tailored for soundtrack aesthetic evaluation, we compare MEMA against several categories of baselines: (1) the non-reference audio quality model (MOSNet); (2) two multimodal video quality models (FastVQA and PTM-VQA); (3) the general-purpose multimodal alignment model (ImageBind), which learns a joint embedding space across six modalities including audio and video; and (4) the video–music matching model (VidMuse), a state-of-the-art approach for video-to-music generation and retrieval tasks. MOSNet is only evaluated on TI, as NEC and TIO require video processing. FastVQA and PTM-VQA were originally designed for single-score assessments, while ImageBind and VidMuse were designed for crossmodal alignment rather than aesthetic scoring. We adapt all multimodal models by adding three independent score heads while preserving their original feature extraction structures. The evaluation metrics include the Linear Correlation Coefficient (LCC), Spearman’s Rank Correlation Coefficient (SRCC), Kendall’s Rank Correlation Coefficient (KTAU), the Mean Squared Error (MSE), and the Mean Absolute Error (MAE). These metrics measure the correlation and error between predicted scores and human annotations. The training batch size is 32. Results are shown in Table 2.

5.2. Overall Performance Comparison

The MEMA model significantly outperforms all baseline methods across all three aesthetic dimensions. Compared to VidMuse, the strongest baseline, MEMA achieves substantial improvements: in the NEC dimension, LCC improves by 24.528%, SRCC by 22.760%, and KTAU by 30.824%. In the TI dimension, LCC improves by 17.763%, SRCC by 16.553%, and KTAU by 6.510%. Based on TIO, the improvements are 12.121% for LCC, 14.286% for SRCC, and 9.337% for KTAU. Notably, ImageBind and VidMuse outperform video quality assessment models (FastVQA and PTM-VQA) on correlation metrics, demonstrating that crossmodal alignment capabilities are beneficial for this task. However, these general-purpose multimodal models still fall short of MEMA, as they lack task-specific mechanisms for aesthetic evaluation. VidMuse shows stronger performance than ImageBind, likely due to its specialization in video–music correspondence learning. Nevertheless, MEMA surpasses VidMuse by a significant margin, validating the effectiveness of our CVAE-based crossmodal imagination module and guided cross-attention alignment mechanism. MEMA shows particularly strong performance in the NEC and TI dimensions, highlighting its advantages in crossmodal narrative consistency and audio–visual fusion quality. Performance in TIO is relatively lower across all models. We attribute this to the nature of thematic originality, which depends heavily on internal musical structure and artistic style. These aspects are more subjective and harder to annotate reliably, resulting in higher variance in the dataset. Nevertheless, all three dimensions maintain relatively high correlation metrics. This suggests that MEMA captures the holistic aesthetic features of film music from multiple perspectives.

5.3. Ablation

To evaluate the contribution of each key module to the overall performance, we conduct a systematic ablation study on VMAE-Sets. Using the MEMA model as the baseline, we remove or replace individual core components to examine their effects on the aesthetic evaluation tasks. The ablation settings are as follows:

(1) w/o CVAE: The Crossmodal Imagination Module is removed, and the model performs crossmodal mapping using feature concatenation only. (2) w/o GCAAM: The Guided Cross-Attention Alignment Module is replaced with a standard attention mechanism. (3) w/o Acoustic Branch: The acoustic feature branch is removed, retaining only the semantic audio encoding pathway. (4) w/o AV-Fusion: The segment-level audio–visual fusion layer is removed, and the model performs global-level alignment only. (5) Shared Head: All three dimensions share a single scoring head rather than using independent branches.

All experiments are conducted using the same data split and training procedure. LCC and SRCC are used as the evaluation metrics.

As shown in Table 3, the results show that CVAE has the most significant impact. Removing it leads to SRCC drops of 46.131%, 33.675%, and 43.295% in the NEC, TI, and TIO dimensions respectively, highlighting its central role in establishing deep semantic mappings between music and visual content. Interestingly, the GCAAM primarily affects the NEC dimension (25.109% drop) while having minimal impact on TI (0.146%) and TIO (0.383%), suggesting its specialized role in narrative–emotional alignment. Eliminating the Acoustic Branch results in consistent drops of 21.022%, 26.647%, and 24.138%, reflecting its complementary contribution to audio–visual integration quality. Removing AV-Fusion causes substantial degradation across all dimensions, with SRCC decreasing by 32.409%, 36.310%, and 25.670%, respectively. The Shared Head configuration has the smallest effect, causing only 4.526%, 3.075%, and 2.490% degradation, indicating that the three aesthetic dimensions share certain feature patterns.

5.4. Robustness Analysis

To evaluate the reliability and generalization ability of MEMA in practical scenarios, we assess performance variations under different video and audio quality conditions. All models are trained using the same dataset and settings. During testing, input video or audio signals are subjected to degradation. Video degradation is achieved by downscaling the resolution from the original 480p to 360p (mild) and 240p (severe). Audio degradation is implemented by reducing the bitrate to 128 kbps, 64 kbps, and 32 kbps. These constitute three quality levels: normal, mildly degraded, and severely degraded. As shown in Figure 4, MEMA consistently exhibits the lowest degradation rate under all quality levels. Its performance degradation curve remains the most stable. Compared to baseline models, MEMA demonstrates smaller performance drops in response to reduced input quality. This indicates that MEMA maintains higher prediction stability even under low-quality conditions. This robustness makes MEMA more suitable for real-world applications such as soundtrack evaluation and retrieval, where input quality may fluctuate significantly.

5.5. Visualization Analysis

To explore the internal working mechanism of MEMA, we visualize its latent space to examine how music and video modalities are aligned. Figure 5 compares the distribution of audio and visual samples before and after CVAE pretraining. Blue circles represent music segments. Orange triangles represent video segments, and gray lines connect semantically matched pairs. Before pretraining, audio and video samples form separate clusters with minimal overlap. The long gray lines indicate that semantically related samples are distant in the latent space due to a lack of alignment. After CVAE pretraining, paired audio and video embeddings are located much closer to each other. The gray lines are significantly shortened. This demonstrates that the model has learned to map semantically related music and video into neighboring regions. The result confirms that CVAE pretraining effectively brings different modalities into alignment. This alignment facilitates more efficient attention allocation by the downstream GCAAM.

To validate the effectiveness of the GCAAM, we compare its attention patterns with standard cross-attention. In Figure 6 (left side), traditional attention weights are scattered and easily influenced by low-level features such as motion or color. This makes it difficult to focus on narrative-critical moments in the video. The model fails to establish a strong link between music emotion and narrative progression. In contrast, the GCAAM introduces global semantics from the entire soundtrack as guidance. This results in a distinctly different attention pattern. High-weight regions concentrate around key narrative events: 10.3 s for main character appearance, 26.1 s for parent–child interaction, and 66.7 s for crisis exploration. These regions correspond closely with emotional shifts in the music. They form continuous bands of attention over time. The model consistently enhances attention in segments with recurring narrative motifs. This showcases its understanding of long-term musical structure.

5.6. Pairwise Matching Discrimination

We compute the pairwise matching accuracy based on the three scoring heads to evaluate discriminative capability across aesthetic dimensions. For each video segment

V_{i}

, we construct a ground-truth pair

(V_{i}, M_{i})

and a randomly mismatched pair

(V_{i}, M_{j})

, where

j \neq i

. From the scoring head for dimension

d \in {NEC, TI, TIO}

, we obtain two scores:

s_{i, d}^{match}

and

s_{i, d}^{mismatch}

. If

s_{i, d}^{match} > s_{i, d}^{mismatch}

, we count this as a correct prediction. The pairwise matching accuracy is the ratio of correct predictions over all test samples. Since MOSNet only outputs scores related to music themes, it is included solely in the TIO comparison. NEC and TI are evaluated using FastVQA, PTM-VQA, and MEMA. As shown in Figure 7, MEMA achieves superior pairwise matching accuracy across all three dimensions compared to baseline methods.

5.7. Crossmodal Retrieval on SymMV

To isolate and evaluate the effectiveness of the proposed Crossmodal Variational Autoencoder (CVAE), we perform a video-to-music retrieval task on the SymMV dataset. Since SymMV contains paired video and background music tracks without aesthetic scores, it is highly suitable for assessing crossmodal semantic alignment.

Experimental Setup: During training, the CVAE is optimized using the matched video–audio pairs from the SymMV training set to minimize reconstruction loss and KL divergence. During testing, for each query video, we utilize the CVAE to map its visual representation to the latent space z and decode it to generate an “imagined” audio feature

{\hat{H}}_{a}

. We then compute the cosine similarity between

{\hat{H}}_{a}

and the real audio features

H_{a}

of all music tracks in the test set database. The performance is measured by Recall at K (R@1, R@5, and R@10) and the Median Rank (MR).

Results: We compare our CVAE with a standard dual-encoder model optimized via InfoNCE loss (contrastive learning) and the strong multimodal foundation model, ImageBind. As shown in Table 4, MEMA-CVAE achieves the highest precision in top-ranked results (an R@1 of 19.5% and an R@5 of 41.8%) and the best Median Rank. Interestingly, while ImageBind performs exceptionally well in broad retrieval (an R@10 of 54.3%), due to its massive pretraining corpus, our CVAE yields more accurate exact matches. This demonstrates that generating an “imagined” feature conditioned on the visual input provides a tighter semantic bound than simply learning a joint embedding space.

5.8. Inference Efficiency Analysis

To verify that MEMA maintains high inference efficiency while ensuring performance, we analyze the runtime of key components. All experiments are conducted on an RTX 4060 platform. As shown in Table 5, MEMA achieves an average inference time of approximately 2.49 milliseconds per sample. This corresponds to a throughput of 401 frames per second, which is well above the threshold for real-time processing. The majority of computation time is spent on feature extraction. The Audio Tower and Video Tower account for approximately 1.27 ms and 0.67 ms respectively. The Clip Binder module consumes only 0.24 ms due to its lightweight fusion strategy. The GCAAM requires just 0.16 ms to compute. The final aggregation and scoring prediction modules together take less than 0.16 ms. Both Attention Pooling and scoring heads operate in the sub-millisecond range. This indicates that the decision-making phase introduces negligible additional latency. These results demonstrate that MEMA offers a favorable balance between prediction quality and computational cost. This makes it suitable for time-sensitive applications such as real-time soundtrack evaluation and retrieval systems.

5.9. Human Evaluation vs. Model Prediction

To evaluate the reliability of MEMA in real-world perceptual tasks, we conducted a human evaluation involving 20 volunteers with backgrounds in music and film. The volunteers rated 200 video–soundtrack pairs across three dimensions on a scale of 0 to 10. We recorded MEMA predictions for the same samples and performed consistency analysis. As shown in Figure 8, scatter plots across three dimensions show points clustered near the diagonal line, indicating a clear linear relationship. The Pearson correlation is highest on TI at 0.77, suggesting good alignment with human judgment in perceiving technical integration of dialogue, sound effects, and music. NEC shows a correlation of 0.72, indicating strong performance in capturing the emotional atmosphere and narrative alignment. TIO yields a lower correlation of 0.54 but remains statistically significant. This implies moderate ability to identify musical thematicity and originality. This dimension tends to be more dependent on subjective aesthetics and exhibits greater individual variation. The scatter density maps show that both human scores and model predictions concentrate in the 6 to 9 range, forming a slightly right-skewed distribution. This indicates high overall sample quality and close alignment between model and human judgment on high-scoring samples. MEMA demonstrates human-aligned perceptual results across multiple subjective dimensions with stable scoring behavior.

5.10. LLM Training Effectiveness

We adopt VideoLLaMA2 as the baseline model for large language model evaluation. The original video–audio–text generation flow of VideoLLaMA2 is kept frozen during training. Using VMAE-Sets as the training material, we train only the projection heads from the NEC, TI, and TIO vectors to the Qwen2 model space. The comparison results are shown in Figure 9. In its original form, VideoLLaMA2 struggles to capture artistic relationships between soundtrack and the emotional–narrative structure. This limitation may be due to lack of understanding or representation of multimodal aesthetic semantics. With VMAE-Sets and carefully extracted task-specific features, the model generates more detailed and context-aware evaluations. The fine-tuned output demonstrates improved ability to articulate alignment between soundtrack, narrative, and emotional cues. These enriched assessments provide valuable insights for future soundtrack composition. They potentially support more intelligent and perceptually guided music scoring in multimedia applications.

6. Conclusions

This work addresses the challenge of evaluating the aesthetic quality of film soundtracks by introducing a multidimensional assessment framework that accounts for both narrative and technical coherence between music and video. We propose MEMA, a multimodal model that rethinks soundtrack evaluation as a crossmodal alignment problem, guided by long-range musical semantics and clip-level audio–visual interactions.

Our experimental results demonstrate that MEMA outperforms state-of-the-art video and audio quality assessment models across all three proposed dimensions—narrative–emotional congruence (NEC), technical integration (TI), and thematic identity and originality (TIO). Through ablation studies, we validated the critical roles of the CVAE imagination module and guided attention alignment (GCAAM). The model exhibits strong robustness under varying input quality, closely aligns with human judgment in subjective evaluation, and generates semantically rich assessments via a Qwen2-enhanced LLM head.

These findings suggest that multimodal aesthetic evaluation benefits from modeling not only fused sensory signals but also their structured narrative alignment. MEMA sets a foundation for intelligent soundtrack assessment and recommendation and offers new perspectives for future research in computational aesthetics, crossmodal generation, and music-driven video understanding.

Author Contributions

Conceptualization, C.C. and H.Z.; methodology, H.Z., C.C. and M.S.; software, H.Z. and M.S.; validation, M.S., T.C. and L.L.; formal analysis, H.Z. and D.J.; investigation, T.C. and L.L.; resources, X.L. and C.C.; data curation, H.Z., T.C. and L.L.; writing—original draft preparation, H.Z.; writing—review and editing, D.J. and X.L.; visualization, H.Z. and M.S.; supervision, X.L. and D.J.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Excellent Young Scientists Fund of the Natural Science Foundation of Shandong Province (grant number: ZR2024YQ071).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

We thank all the volunteers who participated in the music aesthetic evaluation experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NEC	Narrative–Emotional Congruence
TI	Technical Integration
TIO	Thematic Identity and Originality
MEMA	Multimodal Aesthetic Evaluation of Music
VMAE-Sets	Video–Music Aesthetic Evaluation Datasets
OST	Original Soundtrack
CVAE	Conditional Variational Autoencoder
GCAAM	Guided Cross-Attention Alignment Module
LLM	Large Language Model
LCC	Linear Correlation Coefficient
SRCC	Spearman’s Rank Correlation Coefficient
KTAU	Kendall Rank Correlation Coefficient
MSE	Mean Squared Error
MAE	Mean Absolute Error
MFCC	Mel-frequency Cepstral Coefficients
RMS	Root Mean Square
CNN	Convolutional Neural Network
MLP	Multi-Layer Perceptron
FFN	Feed-Forward Network
PE	Positional Encoding

References

Ma, Y.; Feng, K.; Hu, Z.; Wang, X.; Wang, Y.; Zheng, M.; He, X.; Zhu, C.; Liu, H.; He, Y.; et al. Controllable video generation: A survey. arXiv 2025, arXiv:2507.16869. [Google Scholar] [CrossRef]
Elmoghany, M.; Rossi, R.; Yoon, S.; Mukherjee, S.; Bakr, E.; Mathur, P.; Wu, G.; Lai, V.D.; Lipka, N.; Zhang, R.; et al. A survey on long-video storytelling generation: Architectures, consistency, and cinematic quality. arXiv 2025, arXiv:2507.07202. [Google Scholar] [CrossRef]
Li, S.; Qin, Y.; Zheng, M.; Jin, X.; Liu, Y. Diff-bgm: A diffusion model for video background music generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27348–27357. [Google Scholar]
Zhuo, L.; Wang, Z.; Wang, B.; Liao, Y.; Bao, C.; Peng, S.; Han, S.; Zhang, A.; Fang, F.; Liu, S. Video background music generation: Dataset, method and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 15637–15647. [Google Scholar]
Lin, Y.-B.; Tian, Y.; Yang, L.; Bertasius, G.; Wang, H. Vmas: Video-to-music generation via semantic alignment in web music videos. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 1155–1165. [Google Scholar]
Li, S.; Yang, B.; Yin, C.; Sun, C.; Zhang, Y.; Dong, W.; Li, C. Vidmusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features. arXiv 2024, arXiv:2412.06296. [Google Scholar]
Zuo, H.; You, W.; Wu, J.; Ren, S.; Chen, P.; Zhou, M.; Lu, Y.; Sun, L. Gvmgen: A general video-to-music generation model with hierarchical attentions. Proc. AAAI Conf. Artif. Intell. 2025, 39, 23099–23107. [Google Scholar] [CrossRef]
Wang, Y.; Li, K.; Li, Y.; He, Y.; Huang, B.; Zhao, Z.; Zhang, H.; Xu, J.; Liu, Y.; Wang, Z.; et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv 2022, arXiv:2212.03191. [Google Scholar] [CrossRef]
Zhang, H.; Li, X.; Bing, L. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv 2023, arXiv:2306.02858. [Google Scholar]
Cheng, Z.; Leng, S.; Zhang, H.; Xin, Y.; Li, X.; Chen, G.; Zhu, Y.; Zhang, W.; Luo, Z.; Zhao, D.; et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv 2024, arXiv:2406.07476. [Google Scholar]
Wang, Y.; Li, K.; Li, X.; Yu, J.; He, Y.; Chen, G.; Pei, B.; Zheng, R.; Wang, Z.; Shi, Y.; et al. Internvideo2: Scaling foundation models for multimodal video understanding. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 396–416. [Google Scholar]
Madan, N.; Møgelmose, A.; Modi, R.; Rawat, Y.S.; Moeslund, T.B. Foundation models for video understanding: A survey. arXiv 2024, arXiv:2405.03770. [Google Scholar] [CrossRef]
He, B.; Li, H.; Jang, Y.K.; Jia, M.; Cao, X.; Shah, A.; Shrivastava, A.; Lim, S.-N. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 13504–13514. [Google Scholar]
Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. Videomamba: State space model for efficient video understanding. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 237–255. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Touros, G.; Giannakopoulos, T. Video soundtrack evaluation with machine learning: Data availability, feature extraction, and classification. In Advances in Speech and Music Technology; Signals and Communication Technology; Springer: Cham, Switzerland, 2022; pp. 137–157. [Google Scholar]
Awan, M.; Nadeem, A.; Mustafa, A. Efficient audio-visual fusion for video classification. arXiv 2024, arXiv:2411.05603. [Google Scholar]
Zellers, R.; Lu, J.; Lu, X.; Yu, Y.; Zhao, Y.; Salehi, M.; Kusupati, A.; Hessel, J.; Farhadi, A.; Choi, Y. MERLOT reserve: Neural script knowledge through vision and language and sound. arXiv 2022, arXiv:2201.02639. [Google Scholar] [CrossRef]
Arandjelovic, R.; Zisserman, A. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 609–617. [Google Scholar]
Zhou, Z.; Mei, K.; Lu, Y.; Wang, T.; Rao, F. Harmonyset: A comprehensive dataset for understanding video-music semantic alignment and temporal synchronization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 1–15 June 2025; pp. 3152–3162. [Google Scholar]
Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. Adv. Neural Inf. Process. Syst. 2015, 28, 3483–3491. [Google Scholar]
Organiściak, K.; Borkowski, J. Single-ended quality measurement of a music content via convolutional recurrent neural networks. Metrol. Meas. Syst. 2020, 27, 721–733. [Google Scholar] [CrossRef]
Wisnu, D.A.M.G.; Rini, S.; Zezario, R.E.; Wang, H.-M.; Tsao, Y. HAAQI-Net: A non-intrusive neural music audio quality assessment model for hearing aids. arXiv 2024, arXiv:2401.01145. [Google Scholar] [CrossRef]
Song, H.; Kim, M.; Park, D.; Shin, Y.; Lee, J.-G. Learning from noisy labels with deep neural networks: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8135–8153. [Google Scholar] [CrossRef] [PubMed]
Hilmkil, A.; Thomé, C.; Arpteg, A. Perceiving music quality with GANs. arXiv 2020, arXiv:2006.06287. [Google Scholar]
Tjandra, A.; Wu, Y.-C.; Guo, B.; Hoffman, J.; Ellis, B.; Vyas, A.; Shi, B.; Chen, S.; Le, M.; Zacharov, N.; et al. Meta Audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv 2025, arXiv:2502.05139. [Google Scholar] [CrossRef]
Liu, C.; Wang, H.; Zhao, J.; Zhao, S.; Bu, H.; Xu, X.; Zhou, J.; Sun, H.; Qin, Y. MusicEval: A generative music dataset with expert ratings for automatic text-to-music evaluation. arXiv 2025, arXiv:2501.10811. [Google Scholar]
Yao, J.; Ma, G.; Xue, H.; Chen, H.; Hao, C.; Jiang, Y.; Liu, H.; Yuan, R.; Xu, J.; Xue, W.; et al. SongEval: A benchmark dataset for song aesthetics evaluation. arXiv 2025, arXiv:2505.10793. [Google Scholar] [CrossRef]
Kim, Y.E.; Schmidt, E.M.; Migneco, R.; Morton, B.G.; Richardson, P.; Scott, J.; Speck, J.A.; Turnbull, D. Music emotion recognition: A state of the art review. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Utrecht, The Netherlands, 9–13 August 2010; pp. 937–952. [Google Scholar]
Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K.V.; Joulin, A.; Misra, I. ImageBind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 15180–15190. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Gan, C.; Huang, D.; Chen, P.; Tenenbaum, J.B.; Torralba, A. Foley music: Learning to generate music from videos. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 758–775. [Google Scholar]
Tian, Z.; Liu, Z.; Yuan, R.; Pan, J.; Huang, X.; Liu, Q.; Tan, X.; Chen, Q.; Xue, W.; Guo, Y. VidMuse: A simple video-to-music generation framework with long-short-term modeling. arXiv 2024, arXiv:2406.04321. [Google Scholar]
Koh, E.Y.; Cheuk, K.W.; Heung, K.Y.; Agres, K.R.; Herremans, D. MERP: A Music Dataset with Emotion Ratings and Raters’ Profile Information. Sensors 2023, 23, 382. [Google Scholar] [CrossRef] [PubMed]
Altwlkany, K.; Selmanovic, E.; Delalic, S. Pretrained conformers for audio fingerprinting and retrieval. arXiv 2025, arXiv:2508.11609. [Google Scholar] [CrossRef]
Gorbman, C. Unheard Melodies: Narrative Film Music; Indiana University Press: Bloomington, IN, USA, 1987. [Google Scholar]
Chion, M. Audio-Vision: Sound on Screen; Columbia University Press: New York, NY, USA, 1994. [Google Scholar]
Brown, R.S. Overtones and Undertones: Reading Film Music; University of California Press: Berkeley, CA, USA, 1994. [Google Scholar]
Zhou, Z.-H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018, 5, 44–53. [Google Scholar] [CrossRef]
Elizalde, B.; Deshmukh, S.; Al Ismail, M.; Wang, H. CLAP: Learning audio concepts from natural language supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]

Figure 1. LLM-based scoring pipeline for VMAE-Sets. User comments and their “likes” are fed into two complementary LLMs (Qwen and DeepSeek), which generate multiple samples of scores and textual analyses for NEC, TI, and TIO. The confidence-weighted fusion produces final labels and rich explanations for each soundtrack.

Figure 2. Crossmodal Imagination Module architecture. The CVAE-based module enables bidirectional reconstruction between audio and visual modalities through symmetric encoder–decoder paths, learning reversible semantic mappings in a unified latent space.

Figure 3. Processed tokens from the Audio and Video Towers are fed into the GCAAM. Combined with VideoLLaMA2 vectors, the model generates the final scores and text.

Figure 4. Use Avg LCC as the evaluation metric.

Figure 5. Distribution of audio and visual samples in the latent space before and after CVAE pretraining.

Figure 6. Attention patterns of the GCAAM and traditional cross-attention.

Figure 7. Pairwise matching accuracy for each dimension.

Figure 8. Scatter plots of “Human Score vs. Model Prediction” across the three dimensions.

Figure 9. Comparison of VideoLLaMA2’s original and fine-tuned performance.

Table 1. Comparison of representative datasets for music evaluation.

Dataset	Total Hours	Video	Pure Audio ¹	Text ²	Evaluation Score	Task Focus
SymMV [4]	76.5	✓	✓	×	×	Video background-music generation
MusicEval [27]	16.67	×	✓	×	✓	Text-to-music evaluation
SongEval [28]	140.32	×	✓	×	✓	Music aesthetic evaluation
MERP [34]	—	×	✓	×	✓	Songs with valence ratings
AES-Natural [26]	∼500	×	✓	×	✓	Audio aesthetic evaluation
HarmonySet [20]	458.8	✓	×	✓	×	Narrative and thematic alignment
VMAE-Sets (ours)	106.8	✓	✓	✓	✓	Soundtrack aesthetic evaluation

Notes: ¹ “Pure Audio” refers to the isolated soundtrack with non-musical elements removed for video-based datasets; for audio-only datasets, it refers to the original audio. ² “Text” refers to aesthetic or quality assessment texts accompanying a music segment.

Table 2. Comparison of Models on Three Aesthetic Dimensions.

Model	Dimension	LCC ↑	SRCC ↑	KTAU ↑	MSE ↓	MAE ↓
MOSNet	TI	0.235	0.356	0.286	0.988	1.132
	NEC	0.318	0.317	0.286	1.263	1.135
FastVQA	TIO	0.324	0.452	0.307	0.683	0.637
	TI	0.437	0.405	0.274	0.578	0.716
	NEC	0.213	0.198	0.217	0.432	0.579
PTM-VQA	TIO	0.426	0.384	0.269	0.675	0.762
	TI	0.332	0.578	0.408	0.473	0.681
	NEC	0.518	0.493	0.382	0.618	0.702
ImageBind	TIO	0.463	0.441	0.308	0.628	0.719
	TI	0.542	0.508	0.355	0.538	0.701
	NEC	0.583	0.558	0.425	0.545	0.632
VidMuse	TIO	0.495	0.483	0.332	0.601	0.638
	TI	0.608	0.586	0.384	0.501	0.695
	NEC	0.726	0.685	0.556	0.458	0.537
MEMA (ours)	TIO	0.555	0.552	0.363	0.568	0.761
	TI	0.716	0.683	0.409	0.472	0.681

↑: higher is better; ↓: lower is better. Bold: best value in that column and dimension (e.g., highest SRCC in NEC).

Table 3. Ablation Study Results.

Model Variant	SRCC			LCC
Model Variant	NEC	TI	TIO	NEC	TI	TIO
MEMA-full	0.685	0.683	0.522	0.726	0.716	0.555
w/o CVAE	0.369	0.453	0.296	0.424	0.472	0.322
w/o GCAAM	0.513	0.682	0.520	0.556	0.710	0.550
w/o Acoustic Branch	0.541	0.501	0.396	0.583	0.512	0.423
w/o AV-Fusion	0.463	0.435	0.388	0.495	0.458	0.416
Shared Head	0.654	0.662	0.509	0.702	0.698	0.537

Table 4. Comparison of Cross-modal Video-Music Retrieval Performance on the SymMV Dataset. Higher Recall (R@K) and lower Median Rank (MR) indicate better performance.

Model	R@1 ↑	R@5 ↑	R@10 ↑	MR ↓
Dual-Encoder (Contrastive)	13.2%	32.5%	48.7%	12.0
ImageBind	17.6%	38.2%	54.3%	9.0
MEMA-CVAE (ours)	19.5%	41.8%	53.6%	7.0

↑: higher is better; ↓: lower is better. Bold: best value in that column.

Table 5. Inference time breakdown of MEMA.

Main Steps	FPS	Milliseconds
Clip Binder		0.23567 ms
Music Tower		1.27006 ms
Scene Tower		0.66514 ms
GCAAM		0.16491 ms
Aggregation & Prediction		0.15790 ms
– Temporal attention pooling		0.07760 ms
– Score prediction head		0.08029 ms
Total Inference Time per sample	401.0	2.49368 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Chen, C.; Song, M.; Chen, T.; Jiang, D.; Liu, L.; Liu, X. MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts. Sensors 2026, 26, 1395. https://doi.org/10.3390/s26041395

AMA Style

Zhang H, Chen C, Song M, Chen T, Jiang D, Liu L, Liu X. MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts. Sensors. 2026; 26(4):1395. https://doi.org/10.3390/s26041395

Chicago/Turabian Style

Zhang, Huaye, Chenglizhao Chen, Mengke Song, Tingting Chen, Diqiong Jiang, Lichun Liu, and Xinyu Liu. 2026. "MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts" Sensors 26, no. 4: 1395. https://doi.org/10.3390/s26041395

APA Style

Zhang, H., Chen, C., Song, M., Chen, T., Jiang, D., Liu, L., & Liu, X. (2026). MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts. Sensors, 26(4), 1395. https://doi.org/10.3390/s26041395

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts

Abstract

1. Introduction

2. Related Work

2.1. Audio Evaluation

2.2. Soundtrack Understanding

2.3. Video–Text and Video–Music Datasets

2.4. Audio Aesthetic Evaluation Datasets

3. The VMAE-Sets Dataset

3.1. Data Collection

3.2. Video Processing

3.3. Definition of Evaluation Metrics

3.4. Textual Review Processing and Scoring

4. Model Construction

4.1. Feature Extraction and Local Fusion

4.2. Audio Tower and Video Tower

4.3. Crossmodal Imagination Module

4.4. Guided Cross-Attention Alignment Module

4.5. Prediction Heads

4.6. Training Strategy

5. Experiments and Analysis

5.1. Experimental Setup

5.2. Overall Performance Comparison

5.3. Ablation

5.4. Robustness Analysis

5.5. Visualization Analysis

5.6. Pairwise Matching Discrimination

5.7. Crossmodal Retrieval on SymMV

5.8. Inference Efficiency Analysis

5.9. Human Evaluation vs. Model Prediction

5.10. LLM Training Effectiveness

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI