MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts
Abstract
1. Introduction
- We formulate video soundtrack aesthetic evaluation as a new task. This task aims to assess the artistic coordination between music and visual content. It provides a foundation for automatic quality assessment in film scoring and video production.
- We propose MEMA, a multimodal model that captures audio–visual artistic relationships through crossmodal imagination. The model decomposes aesthetic judgments into three dimensions and achieves the strongest performance among all evaluated models, with average improvements of 18.137 percent in LCC and 17.866 percent in SRCC compared to the best-performing baseline.
- We introduce VMAE-Sets, the first large-scale dataset for soundtrack aesthetic evaluation. A weakly supervised pipeline based on large language models enables scalable annotation without expert labeling.
2. Related Work
2.1. Audio Evaluation
2.2. Soundtrack Understanding
2.3. Video–Text and Video–Music Datasets
2.4. Audio Aesthetic Evaluation Datasets
3. The VMAE-Sets Dataset
3.1. Data Collection
3.2. Video Processing
3.3. Definition of Evaluation Metrics
3.4. Textual Review Processing and Scoring
4. Model Construction
4.1. Feature Extraction and Local Fusion
4.2. Audio Tower and Video Tower
4.3. Crossmodal Imagination Module
4.4. Guided Cross-Attention Alignment Module
4.5. Prediction Heads
4.6. Training Strategy
5. Experiments and Analysis
5.1. Experimental Setup
5.2. Overall Performance Comparison
5.3. Ablation
5.4. Robustness Analysis
5.5. Visualization Analysis
5.6. Pairwise Matching Discrimination
5.7. Crossmodal Retrieval on SymMV
5.8. Inference Efficiency Analysis
5.9. Human Evaluation vs. Model Prediction
5.10. LLM Training Effectiveness
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| NEC | Narrative–Emotional Congruence |
| TI | Technical Integration |
| TIO | Thematic Identity and Originality |
| MEMA | Multimodal Aesthetic Evaluation of Music |
| VMAE-Sets | Video–Music Aesthetic Evaluation Datasets |
| OST | Original Soundtrack |
| CVAE | Conditional Variational Autoencoder |
| GCAAM | Guided Cross-Attention Alignment Module |
| LLM | Large Language Model |
| LCC | Linear Correlation Coefficient |
| SRCC | Spearman’s Rank Correlation Coefficient |
| KTAU | Kendall Rank Correlation Coefficient |
| MSE | Mean Squared Error |
| MAE | Mean Absolute Error |
| MFCC | Mel-frequency Cepstral Coefficients |
| RMS | Root Mean Square |
| CNN | Convolutional Neural Network |
| MLP | Multi-Layer Perceptron |
| FFN | Feed-Forward Network |
| PE | Positional Encoding |
References
- Ma, Y.; Feng, K.; Hu, Z.; Wang, X.; Wang, Y.; Zheng, M.; He, X.; Zhu, C.; Liu, H.; He, Y.; et al. Controllable video generation: A survey. arXiv 2025, arXiv:2507.16869. [Google Scholar] [CrossRef]
- Elmoghany, M.; Rossi, R.; Yoon, S.; Mukherjee, S.; Bakr, E.; Mathur, P.; Wu, G.; Lai, V.D.; Lipka, N.; Zhang, R.; et al. A survey on long-video storytelling generation: Architectures, consistency, and cinematic quality. arXiv 2025, arXiv:2507.07202. [Google Scholar] [CrossRef]
- Li, S.; Qin, Y.; Zheng, M.; Jin, X.; Liu, Y. Diff-bgm: A diffusion model for video background music generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27348–27357. [Google Scholar]
- Zhuo, L.; Wang, Z.; Wang, B.; Liao, Y.; Bao, C.; Peng, S.; Han, S.; Zhang, A.; Fang, F.; Liu, S. Video background music generation: Dataset, method and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 15637–15647. [Google Scholar]
- Lin, Y.-B.; Tian, Y.; Yang, L.; Bertasius, G.; Wang, H. Vmas: Video-to-music generation via semantic alignment in web music videos. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 1155–1165. [Google Scholar]
- Li, S.; Yang, B.; Yin, C.; Sun, C.; Zhang, Y.; Dong, W.; Li, C. Vidmusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features. arXiv 2024, arXiv:2412.06296. [Google Scholar]
- Zuo, H.; You, W.; Wu, J.; Ren, S.; Chen, P.; Zhou, M.; Lu, Y.; Sun, L. Gvmgen: A general video-to-music generation model with hierarchical attentions. Proc. AAAI Conf. Artif. Intell. 2025, 39, 23099–23107. [Google Scholar] [CrossRef]
- Wang, Y.; Li, K.; Li, Y.; He, Y.; Huang, B.; Zhao, Z.; Zhang, H.; Xu, J.; Liu, Y.; Wang, Z.; et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv 2022, arXiv:2212.03191. [Google Scholar] [CrossRef]
- Zhang, H.; Li, X.; Bing, L. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv 2023, arXiv:2306.02858. [Google Scholar]
- Cheng, Z.; Leng, S.; Zhang, H.; Xin, Y.; Li, X.; Chen, G.; Zhu, Y.; Zhang, W.; Luo, Z.; Zhao, D.; et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv 2024, arXiv:2406.07476. [Google Scholar]
- Wang, Y.; Li, K.; Li, X.; Yu, J.; He, Y.; Chen, G.; Pei, B.; Zheng, R.; Wang, Z.; Shi, Y.; et al. Internvideo2: Scaling foundation models for multimodal video understanding. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 396–416. [Google Scholar]
- Madan, N.; Møgelmose, A.; Modi, R.; Rawat, Y.S.; Moeslund, T.B. Foundation models for video understanding: A survey. arXiv 2024, arXiv:2405.03770. [Google Scholar] [CrossRef]
- He, B.; Li, H.; Jang, Y.K.; Jia, M.; Cao, X.; Shah, A.; Shrivastava, A.; Lim, S.-N. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 13504–13514. [Google Scholar]
- Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. Videomamba: State space model for efficient video understanding. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 237–255. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Touros, G.; Giannakopoulos, T. Video soundtrack evaluation with machine learning: Data availability, feature extraction, and classification. In Advances in Speech and Music Technology; Signals and Communication Technology; Springer: Cham, Switzerland, 2022; pp. 137–157. [Google Scholar]
- Awan, M.; Nadeem, A.; Mustafa, A. Efficient audio-visual fusion for video classification. arXiv 2024, arXiv:2411.05603. [Google Scholar]
- Zellers, R.; Lu, J.; Lu, X.; Yu, Y.; Zhao, Y.; Salehi, M.; Kusupati, A.; Hessel, J.; Farhadi, A.; Choi, Y. MERLOT reserve: Neural script knowledge through vision and language and sound. arXiv 2022, arXiv:2201.02639. [Google Scholar] [CrossRef]
- Arandjelovic, R.; Zisserman, A. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 609–617. [Google Scholar]
- Zhou, Z.; Mei, K.; Lu, Y.; Wang, T.; Rao, F. Harmonyset: A comprehensive dataset for understanding video-music semantic alignment and temporal synchronization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 1–15 June 2025; pp. 3152–3162. [Google Scholar]
- Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. Adv. Neural Inf. Process. Syst. 2015, 28, 3483–3491. [Google Scholar]
- Organiściak, K.; Borkowski, J. Single-ended quality measurement of a music content via convolutional recurrent neural networks. Metrol. Meas. Syst. 2020, 27, 721–733. [Google Scholar] [CrossRef]
- Wisnu, D.A.M.G.; Rini, S.; Zezario, R.E.; Wang, H.-M.; Tsao, Y. HAAQI-Net: A non-intrusive neural music audio quality assessment model for hearing aids. arXiv 2024, arXiv:2401.01145. [Google Scholar] [CrossRef]
- Song, H.; Kim, M.; Park, D.; Shin, Y.; Lee, J.-G. Learning from noisy labels with deep neural networks: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8135–8153. [Google Scholar] [CrossRef] [PubMed]
- Hilmkil, A.; Thomé, C.; Arpteg, A. Perceiving music quality with GANs. arXiv 2020, arXiv:2006.06287. [Google Scholar]
- Tjandra, A.; Wu, Y.-C.; Guo, B.; Hoffman, J.; Ellis, B.; Vyas, A.; Shi, B.; Chen, S.; Le, M.; Zacharov, N.; et al. Meta Audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv 2025, arXiv:2502.05139. [Google Scholar] [CrossRef]
- Liu, C.; Wang, H.; Zhao, J.; Zhao, S.; Bu, H.; Xu, X.; Zhou, J.; Sun, H.; Qin, Y. MusicEval: A generative music dataset with expert ratings for automatic text-to-music evaluation. arXiv 2025, arXiv:2501.10811. [Google Scholar]
- Yao, J.; Ma, G.; Xue, H.; Chen, H.; Hao, C.; Jiang, Y.; Liu, H.; Yuan, R.; Xu, J.; Xue, W.; et al. SongEval: A benchmark dataset for song aesthetics evaluation. arXiv 2025, arXiv:2505.10793. [Google Scholar] [CrossRef]
- Kim, Y.E.; Schmidt, E.M.; Migneco, R.; Morton, B.G.; Richardson, P.; Scott, J.; Speck, J.A.; Turnbull, D. Music emotion recognition: A state of the art review. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Utrecht, The Netherlands, 9–13 August 2010; pp. 937–952. [Google Scholar]
- Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K.V.; Joulin, A.; Misra, I. ImageBind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 15180–15190. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
- Gan, C.; Huang, D.; Chen, P.; Tenenbaum, J.B.; Torralba, A. Foley music: Learning to generate music from videos. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 758–775. [Google Scholar]
- Tian, Z.; Liu, Z.; Yuan, R.; Pan, J.; Huang, X.; Liu, Q.; Tan, X.; Chen, Q.; Xue, W.; Guo, Y. VidMuse: A simple video-to-music generation framework with long-short-term modeling. arXiv 2024, arXiv:2406.04321. [Google Scholar]
- Koh, E.Y.; Cheuk, K.W.; Heung, K.Y.; Agres, K.R.; Herremans, D. MERP: A Music Dataset with Emotion Ratings and Raters’ Profile Information. Sensors 2023, 23, 382. [Google Scholar] [CrossRef] [PubMed]
- Altwlkany, K.; Selmanovic, E.; Delalic, S. Pretrained conformers for audio fingerprinting and retrieval. arXiv 2025, arXiv:2508.11609. [Google Scholar] [CrossRef]
- Gorbman, C. Unheard Melodies: Narrative Film Music; Indiana University Press: Bloomington, IN, USA, 1987. [Google Scholar]
- Chion, M. Audio-Vision: Sound on Screen; Columbia University Press: New York, NY, USA, 1994. [Google Scholar]
- Brown, R.S. Overtones and Undertones: Reading Film Music; University of California Press: Berkeley, CA, USA, 1994. [Google Scholar]
- Zhou, Z.-H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018, 5, 44–53. [Google Scholar] [CrossRef]
- Elizalde, B.; Deshmukh, S.; Al Ismail, M.; Wang, H. CLAP: Learning audio concepts from natural language supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]









| Dataset | Total Hours | Video | Pure Audio 1 | Text 2 | Evaluation Score | Task Focus |
|---|---|---|---|---|---|---|
| SymMV [4] | 76.5 | ✓ | ✓ | × | × | Video background-music generation |
| MusicEval [27] | 16.67 | × | ✓ | × | ✓ | Text-to-music evaluation |
| SongEval [28] | 140.32 | × | ✓ | × | ✓ | Music aesthetic evaluation |
| MERP [34] | — | × | ✓ | × | ✓ | Songs with valence ratings |
| AES-Natural [26] | ∼500 | × | ✓ | × | ✓ | Audio aesthetic evaluation |
| HarmonySet [20] | 458.8 | ✓ | × | ✓ | × | Narrative and thematic alignment |
| VMAE-Sets (ours) | 106.8 | ✓ | ✓ | ✓ | ✓ | Soundtrack aesthetic evaluation |
| Model | Dimension | LCC ↑ | SRCC ↑ | KTAU ↑ | MSE ↓ | MAE ↓ |
|---|---|---|---|---|---|---|
| MOSNet | TI | 0.235 | 0.356 | 0.286 | 0.988 | 1.132 |
| NEC | 0.318 | 0.317 | 0.286 | 1.263 | 1.135 | |
| FastVQA | TIO | 0.324 | 0.452 | 0.307 | 0.683 | 0.637 |
| TI | 0.437 | 0.405 | 0.274 | 0.578 | 0.716 | |
| NEC | 0.213 | 0.198 | 0.217 | 0.432 | 0.579 | |
| PTM-VQA | TIO | 0.426 | 0.384 | 0.269 | 0.675 | 0.762 |
| TI | 0.332 | 0.578 | 0.408 | 0.473 | 0.681 | |
| NEC | 0.518 | 0.493 | 0.382 | 0.618 | 0.702 | |
| ImageBind | TIO | 0.463 | 0.441 | 0.308 | 0.628 | 0.719 |
| TI | 0.542 | 0.508 | 0.355 | 0.538 | 0.701 | |
| NEC | 0.583 | 0.558 | 0.425 | 0.545 | 0.632 | |
| VidMuse | TIO | 0.495 | 0.483 | 0.332 | 0.601 | 0.638 |
| TI | 0.608 | 0.586 | 0.384 | 0.501 | 0.695 | |
| NEC | 0.726 | 0.685 | 0.556 | 0.458 | 0.537 | |
| MEMA (ours) | TIO | 0.555 | 0.552 | 0.363 | 0.568 | 0.761 |
| TI | 0.716 | 0.683 | 0.409 | 0.472 | 0.681 |
| Model Variant | SRCC | LCC | ||||
|---|---|---|---|---|---|---|
| NEC | TI | TIO | NEC | TI | TIO | |
| MEMA-full | 0.685 | 0.683 | 0.522 | 0.726 | 0.716 | 0.555 |
| w/o CVAE | 0.369 | 0.453 | 0.296 | 0.424 | 0.472 | 0.322 |
| w/o GCAAM | 0.513 | 0.682 | 0.520 | 0.556 | 0.710 | 0.550 |
| w/o Acoustic Branch | 0.541 | 0.501 | 0.396 | 0.583 | 0.512 | 0.423 |
| w/o AV-Fusion | 0.463 | 0.435 | 0.388 | 0.495 | 0.458 | 0.416 |
| Shared Head | 0.654 | 0.662 | 0.509 | 0.702 | 0.698 | 0.537 |
| Model | R@1 ↑ | R@5 ↑ | R@10 ↑ | MR ↓ |
|---|---|---|---|---|
| Dual-Encoder (Contrastive) | 13.2% | 32.5% | 48.7% | 12.0 |
| ImageBind | 17.6% | 38.2% | 54.3% | 9.0 |
| MEMA-CVAE (ours) | 19.5% | 41.8% | 53.6% | 7.0 |
| Main Steps | FPS | Milliseconds |
|---|---|---|
| Clip Binder | 0.23567 ms | |
| Music Tower | 1.27006 ms | |
| Scene Tower | 0.66514 ms | |
| GCAAM | 0.16491 ms | |
| Aggregation & Prediction | 0.15790 ms | |
| – Temporal attention pooling | 0.07760 ms | |
| – Score prediction head | 0.08029 ms | |
| Total Inference Time per sample | 401.0 | 2.49368 ms |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, H.; Chen, C.; Song, M.; Chen, T.; Jiang, D.; Liu, L.; Liu, X. MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts. Sensors 2026, 26, 1395. https://doi.org/10.3390/s26041395
Zhang H, Chen C, Song M, Chen T, Jiang D, Liu L, Liu X. MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts. Sensors. 2026; 26(4):1395. https://doi.org/10.3390/s26041395
Chicago/Turabian StyleZhang, Huaye, Chenglizhao Chen, Mengke Song, Tingting Chen, Diqiong Jiang, Lichun Liu, and Xinyu Liu. 2026. "MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts" Sensors 26, no. 4: 1395. https://doi.org/10.3390/s26041395
APA StyleZhang, H., Chen, C., Song, M., Chen, T., Jiang, D., Liu, L., & Liu, X. (2026). MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts. Sensors, 26(4), 1395. https://doi.org/10.3390/s26041395

