Next Article in Journal
Advancing SAR Target Recognition Through Hierarchical Self-Supervised Learning with Multi-Task Pretext Training
Previous Article in Journal
IRIS-QResNet: A Quantum-Inspired Deep Model for Efficient Iris Biometric Identification and Authentication
Previous Article in Special Issue
WiPIHT: A WiFi-Based Position-Independent Passive Indoor Human Tracking System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

AFCLNet: An Attention and Feature-Consistency-Loss-Based Multi-Task Learning Network for Affective Matching Prediction in Music–Video Clips

1
State Key Laboratory of Media Convergence and Communication, Beijing 100024, China
2
Key Laboratory of Acoustic Visual Technology and Intelligent Control System, Ministry of Culture and Tourism, Beijing 100024, China
3
School of Information and Communication Engineering, Communication University of China, Beijing 100024, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(1), 123; https://doi.org/10.3390/s26010123
Submission received: 1 December 2025 / Revised: 22 December 2025 / Accepted: 22 December 2025 / Published: 24 December 2025
(This article belongs to the Special Issue Recent Advances in Smart Mobile Sensing Technology)

Abstract

Emotion matching prediction between music and video segments is essential for intelligent mobile sensing systems, where multimodal affective cues collected from smart devices must be jointly analyzed for context-aware media understanding. However, traditional approaches relying on single-modality feature extraction struggle to capture complex cross-modal dependencies, resulting in a gap between low-level audiovisual signals and high-level affective semantics. To address these challenges, a dual-driven framework that integrates perceptual characteristics with objective feature representations is proposed for audiovisual affective matching prediction. The framework incorporates fine-grained affective states of audiovisual data to better characterize cross-modal correlations from an emotional distribution perspective. Moreover, a decoupled Deep Canonical Correlation Analysis approach is developed, incorporating discriminative sample-pairing criteria (matched/mismatched data discrimination) and separate modality-specific component extractors, which dynamically refine the feature projection space. To further enhance multimodal feature interaction, an Attention and Feature-Consistency-Loss-Based Multi-Task Learning Network is proposed. In addition, a feature-consistency loss function is introduced to impose joint constraints across dual semantic embeddings, ensuring both affective consistency and matching accuracy. Experiments on a self-collected benchmark dataset demonstrate that the proposed method achieves a mean absolute error of 0.109 in music–video matching score prediction, significantly outperforming existing approaches.
Keywords: audiovisual affective matching; DCCA; attention; feature-consistency loss audiovisual affective matching; DCCA; attention; feature-consistency loss

Share and Cite

MDPI and ACS Style

Su, Z.; Liu, J.; Zhang, L.; Feng, Y.; Ren, H. AFCLNet: An Attention and Feature-Consistency-Loss-Based Multi-Task Learning Network for Affective Matching Prediction in Music–Video Clips. Sensors 2026, 26, 123. https://doi.org/10.3390/s26010123

AMA Style

Su Z, Liu J, Zhang L, Feng Y, Ren H. AFCLNet: An Attention and Feature-Consistency-Loss-Based Multi-Task Learning Network for Affective Matching Prediction in Music–Video Clips. Sensors. 2026; 26(1):123. https://doi.org/10.3390/s26010123

Chicago/Turabian Style

Su, Zhibin, Jinyu Liu, Luyue Zhang, Yiming Feng, and Hui Ren. 2026. "AFCLNet: An Attention and Feature-Consistency-Loss-Based Multi-Task Learning Network for Affective Matching Prediction in Music–Video Clips" Sensors 26, no. 1: 123. https://doi.org/10.3390/s26010123

APA Style

Su, Z., Liu, J., Zhang, L., Feng, Y., & Ren, H. (2026). AFCLNet: An Attention and Feature-Consistency-Loss-Based Multi-Task Learning Network for Affective Matching Prediction in Music–Video Clips. Sensors, 26(1), 123. https://doi.org/10.3390/s26010123

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop