Multimodal Learning for Multimedia Content Analysis and Understanding

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Electronic Multimedia".

Deadline for manuscript submissions: 10 May 2026 | Viewed by 1437

Special Issue Editors


E-Mail Website
Guest Editor
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
Interests: multimedia computing

E-Mail Website
Guest Editor
School of Computer and Information Science, Hubei Engineering University, Xiaogan 432000, China
Interests: multimedia computing

Special Issue Information

Dear Colleagues,

With the explosive growth of multimedia data such as images, videos, audio, and text, the demand for effective multimedia content analysis and semantic understanding has become increasingly urgent. Traditional unimodal approaches often struggle to capture the rich and complementary information distributed across modalities. In order to solve this, researchers have increasingly turned to multimodal learning frameworks that facilitate joint representation and interaction among diverse data sources. This shift mainly relies on techniques like multimodal fusion, alignment, and semantic correlation modeling, which help better understand multimodal data and improve performance in tasks such as multimodal fusion, multimodal retrieval, caption generation, and visual–language inference. However, challenges such as modality imbalance, multimodal misalignment, and domain adaptability remain open problems. In light of these challenges, this Special Issue aims to showcase recent progress and emerging trends in multimedia content analysis and understanding, with a particular emphasis on robust, scalable, and generalizable multimodal solutions. We invite contributions that not only propose novel models and algorithms but also address practical deployment issues, offering insights into how multimodal systems can be applied effectively in real-world scenarios.

We look forward to receiving your contributions.

Dr. Donglin Zhang
Dr. Zhen Liu
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multimodal fusion
  • vision–language alignment
  • modality imbalance
  • domain adaptation
  • representation learning
  • multimedia retrieval
  • multimedia intelligence
  • multimedia-related applications

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (3 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

25 pages, 1558 KB  
Article
Towards Scalable Monitoring: An Interpretable Multimodal Framework for Migration Content Detection on TikTok Under Data Scarcity
by Dimitrios Taranis, Gerasimos Razis and Ioannis Anagnostopoulos
Electronics 2026, 15(4), 850; https://doi.org/10.3390/electronics15040850 - 17 Feb 2026
Viewed by 241
Abstract
Short-form video platforms such as TikTok (TikTok Pte. Ltd., Singapore) host large volumes of user-generated, often ephemeral, content related to irregular migration, where relevant cues are distributed across visual scenes, on-screen text, and multilingual captions. Automatically identifying migration-related videos is challenging due to [...] Read more.
Short-form video platforms such as TikTok (TikTok Pte. Ltd., Singapore) host large volumes of user-generated, often ephemeral, content related to irregular migration, where relevant cues are distributed across visual scenes, on-screen text, and multilingual captions. Automatically identifying migration-related videos is challenging due to this multimodal complexity and the scarcity of labeled data in sensitive domains. This paper presents an interpretable multimodal classification framework designed for deployment under data-scarce conditions. We extract features from platform metadata, automated video analysis (Google Cloud Video Intelligence), and Optical Character Recognition (OCR) text, and compare text-only, OCR-only, and vision-only baselines against a multimodal fusion approach using Logistic Regression, Random Forest, and XGBoost. In this pilot study, multimodal fusion consistently improves class separation over single-modality models, achieving an F1-score of 0.92 for the migration-related class under stratified cross-validation. Given the limited sample size, these results are interpreted as evidence of feature separability rather than definitive generalization. Feature importance and SHAP analyses identify OCR-derived keywords, maritime cues, and regional indicators as the most influential predictors. To assess robustness under data scarcity, we apply SMOTE to synthetically expand the training set to 500 samples and evaluate performance on a small held-out set of real videos, observing stable results that further support feature-level robustness. Finally, we demonstrate scalability by constructing a weakly labeled corpus of 600 videos using the identified multimodal cues, highlighting the suitability of the proposed feature set for weakly supervised monitoring at scale. Overall, this work serves as a methodological blueprint for building interpretable multimodal monitoring pipelines in sensitive, low-resource settings. Full article
(This article belongs to the Special Issue Multimodal Learning for Multimedia Content Analysis and Understanding)
Show Figures

Figure 1

26 pages, 619 KB  
Article
Benchmarking LLM-as-a-Judge Models for 5W1H Extraction Evaluation
by José Cassola-Bacallao, José Morales-Donaire, Paula Hernández-Montoya and Brian Keith-Norambuena
Electronics 2026, 15(3), 659; https://doi.org/10.3390/electronics15030659 - 3 Feb 2026
Viewed by 315
Abstract
Evaluating 5W1H (Who, What, When, Where, Why, and How) information extraction systems remains challenging, as traditional information retrieval metrics like ROUGE and BLEU fail to capture semantic accuracy and narrative coherence. The LLM-as-a-Judge paradigm offers a promising alternative, yet systematic comparisons of judge [...] Read more.
Evaluating 5W1H (Who, What, When, Where, Why, and How) information extraction systems remains challenging, as traditional information retrieval metrics like ROUGE and BLEU fail to capture semantic accuracy and narrative coherence. The LLM-as-a-Judge paradigm offers a promising alternative, yet systematic comparisons of judge models for this task are lacking. This study benchmarks multiple large language models, including state-of-the-art models such as GPT, Claude, and Gemini as evaluators of 5W1H extractions from Spanish news articles. We assess judge performance across six quality criteria: Factual Accuracy, Completeness, Relevance and Conciseness, Clarity and Readability, Faithfulness to Source, and Overall Coherence. Our analysis examines inter-judge agreement, score distribution patterns, criterion-level variance, and the relationship between evaluation quality and computational cost. Using two Spanish-language corpora (BASSE and FLARES), we identify which criteria exhibit consistent cross-model agreement and which prove most sensitive to judge selection. The main contribution of this work is providing the first systematic benchmark of LLM-as-a-Judge models for 5W1H extraction evaluation in Spanish, validated against expert journalistic judgment. Results reveal that all evaluated models achieve alignment levels above 90% across all metrics. Specifically, Claude Sonnet 4.5 emerges as the most accurate evaluator with a Global Judgment Acceptance Rate (JAR) of 99.79%. Furthermore, meta-evaluation with human experts demonstrates a substantial inter-annotator agreement of κ=0.6739. Finally, we provide recommendations for judge model selection based on task requirements and resource constraints, contributing practical guidance for researchers implementing LLM-based evaluation pipelines for information extraction tasks. Full article
(This article belongs to the Special Issue Multimodal Learning for Multimedia Content Analysis and Understanding)
Show Figures

Figure 1

31 pages, 12343 KB  
Article
Ensemble Clustering Method via Robust Consensus Learning
by Jia Qu, Qidong Dai, Zekang Bian, Jie Zhou and Zhibin Jiang
Electronics 2025, 14(23), 4764; https://doi.org/10.3390/electronics14234764 - 3 Dec 2025
Viewed by 526
Abstract
Although ensemble clustering methods based on the co-association (CA) matrix have achieved considerable success, they still face the following challenges: (1) in the label space, the noise within the connective matrices and the structural differences between them are often neglected, and (2) the [...] Read more.
Although ensemble clustering methods based on the co-association (CA) matrix have achieved considerable success, they still face the following challenges: (1) in the label space, the noise within the connective matrices and the structural differences between them are often neglected, and (2) the rich structural information inherent in the feature space is overlooked. Specifically, for each connective matrix, a symmetric error matrix is first introduced in the label space to characterize the noise. Then, a set of mapping models is designed, each of which processes a denoised connective matrix to recover a reliable consensus matrix. Moreover, multi-order graph structures are introduced into the feature space to enhance the expressiveness of the consensus matrix further. To preserve a clear cluster structure, a theoretical rank constraint with a block-diagonal enhancement property is imposed on the consensus matrix. Finally, spectral clustering is applied to the refined consensus matrix to obtain the final clustering result. Experimental results demonstrate that ECM-RCL achieves superior clustering performance compared to several state-of-the-art methods. Full article
(This article belongs to the Special Issue Multimodal Learning for Multimedia Content Analysis and Understanding)
Show Figures

Figure 1

Back to TopTop