Advanced Scene Understanding Methods and Applications in Multi-Modal Data

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: 15 August 2026 | Viewed by 675

Special Issue Editors

School of Software, Shandong University, Ji'nan 250100, China
Interests: autonomous driving; computer vision; deep learning
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
School of Mathematics and Computer Science, Quanzhou Normal University, Quanzhou 362000, China
Interests: computer vision; deep learning; remote sensing

E-Mail Website
Guest Editor
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore
Interests: remote sensing; computer vision; fault diagnosis

Special Issue Information

Dear Colleagues,

Deep-learning-based intelligent algorithms have demonstrated remarkable versatility and prowess across various domain-specific tasks such as remote sensing and automatic driving. Despite these significant achievements, existing unimodal models still exhibit limitations in meeting the diverse requirements of daily applications. This has spurred researchers to delve into the field of multimodal data pattern recognition, where models exemplified by Clip have significantly enhanced multimodal scene understanding capabilities. More recently developed Large Multimodal Models (LMMs) such as Gemini (Google) and Sora (OpenAI) further showcase powerful abilities in comprehending or creating realistic and imaginative videos. Although deep-learning-based multimodal algorithms have garnered widespread attention, they face numerous challenges when processing dynamic visual scenes.

These include the following: integrating and aligning multimodal information (e.g., video, audio, 3D data, temporal series data), addressing domain shift issues, handling noisy data and labeling defects, and discovering novel objects or patterns. Furthermore, infusing temporal consistency and coherence properties into these algorithms poses a significant challenge for understanding multimodal scenes. This special session aims to provide a platform for researchers to share the latest advances in multimodal model theories, methodologies, and applications. We also cordially invite submissions exploring the potential of multimodal data in enhancing the diversity and inclusivity of scene-understanding.

  • Visual, LiDAR, and radar perception 2D/3D object detection and 2D/3D object tracking;
  • Remote-sensing-related tasks;
  • Temporal series data prediction/classification;
  • Domain adaption for classification/detection/segmentation;
  • Scene parsing, semantic segmentation, instance segmentation, and panoptic segmentation;
  • Human-centric visual understanding, human–human/object interaction and understanding, human activity understanding, and human intention modeling;
  • Person re-identification, pose estimation, and part-parsing;
  • New benchmark datasets and survey papers related to these topics.

We look forward to receiving your contributions.

Dr. Xiankai Lu
Dr. Yiyou Guo
Dr. Jinsheng Ji
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multimodal data representation and modeling
  • large multimodal models and applications
  • multimodal scene understanding and inference
  • multimodal data alignment and fusion
  • pattern recognition based on temporal series data and text–video/text–image
  • object segmentation, detection, and recognition based on 2D, 3D, and ego-exocentric video data

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (1 paper)

Order results
Result details
Select all
Export citation of selected articles as:

Research

19 pages, 1198 KB  
Article
GSMTNet: Dual-Stream Video Anomaly Detection via Gated Spatio-Temporal Graph and Multi-Scale Temporal Learning
by Di Jiang, Huicheng Lai, Guxue Gao, Dan Ma and Liejun Wang
Electronics 2026, 15(6), 1200; https://doi.org/10.3390/electronics15061200 - 13 Mar 2026
Viewed by 455
Abstract
Video Anomaly Detection aims to identify video segments containing abnormal events. However, detecting anomalies relies more heavily on temporal modeling, particularly when anomalies exhibit only subtle deviations from normal events. However, most existing methods inadequately model the heterogeneity in spatiotemporal relationships, especially the [...] Read more.
Video Anomaly Detection aims to identify video segments containing abnormal events. However, detecting anomalies relies more heavily on temporal modeling, particularly when anomalies exhibit only subtle deviations from normal events. However, most existing methods inadequately model the heterogeneity in spatiotemporal relationships, especially the dynamic interactions between human pose and video appearance. To address this, we propose GSMTNet, a dual-stream heterogeneous unsupervised network integrating gated spatio-temporal graph convolution and multi-scale temporal learning. First, we introduce a dynamic graph structure learning module, which leverages gated spatio-temporal graph convolutions with manifold transformations to model latent spatial relationships via human pose graphs. This is coupled with a normalizing flow-based density estimation module to model the probability distribution of normal samples in a latent space. Second, we design a hybrid dilated temporal module that employs multi-scale temporal feature learning to simultaneously capture long- and short-term dependencies, thereby enhancing the separability between normal patterns and potential deviations. Finally, we propose a dual-stream fusion module to hierarchically integrate features learned from pose graphs and raw video sequences, followed by a prediction head that computes anomaly scores from the fused features. Extensive experiments demonstrate state-of-the-art performance, achieving 86.81% AUC on ShanghaiTech and 70.43% on UBnormal, outperforming existing methods in rare anomaly scenarios. Full article
Show Figures

Figure 1

Back to TopTop