applsci-logo

Journal Browser

Journal Browser

Multimodal Information-Assisted Visual Recognition or Generation

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 31 October 2025 | Viewed by 3691

Special Issue Editors


E-Mail Website
Guest Editor
Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
Interests: computer vision; multimodal information fusion

E-Mail Website
Guest Editor
College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
Interests: artificial intelligence; deep learning, computer vision, natural language processing, artificial intelligence oceanography
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
Interests: big data processing; software engineering; federal learning
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Multimodal information has been widely applied in various real-world computer vision applications, such as human–computer interaction, healthcare (including disease diagnosis), autonomous vehicles, and emotion recognition, and trending techniques have been successfully applied to solve these problems. This Special Issue aims to collect cutting-edge research at the intersection of computer science and multimodal information processing, and explores advancements in visual recognition and generation. From large-model-guided vision interpretation to diffusion-model-based image and video generation, we aim to leverage cutting-edge technologies to address the challenges present in these research fields.

Dr. Wenjuan Gong
Prof. Dr. Guoqiang Zhong
Prof. Dr. Weishan Zhang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • computer vision
  • multimodal information fusion
  • large-model-guided visual interpretation
  • GAN and diffusion-model-based visual generation
  • federated multimodal learning for computer vision
  • quantum multimodal algorithms for computer vision

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 21785 KiB  
Article
Three-Dimensional Geological Modeling Method Based on Potential Vector Fields
by Peigang Liu, Zheng Li, Gang Yu and Zongmin Li
Appl. Sci. 2025, 15(7), 3594; https://doi.org/10.3390/app15073594 - 25 Mar 2025
Viewed by 180
Abstract
With the development of 3D geological modeling, implicit modeling methods have gradually gained popularity. However, existing potential field methods cannot directly represent unconformable geological interfaces. In response, an implicit modeling method based on a potential vector field was proposed, which generates geological surface [...] Read more.
With the development of 3D geological modeling, implicit modeling methods have gradually gained popularity. However, existing potential field methods cannot directly represent unconformable geological interfaces. In response, an implicit modeling method based on a potential vector field was proposed, which generates geological surface models through the potential vector field method and generalized marching cubes algorithm, and visualizes the modeling results. An experiment was conducted on the study area of a certain mineral deposit, and a 3D geological surface model with consistency and no topological errors was established, demonstrating the effectiveness of the method for the surface modeling of unconformity geological interfaces. Full article
(This article belongs to the Special Issue Multimodal Information-Assisted Visual Recognition or Generation)
Show Figures

Figure 1

20 pages, 6721 KiB  
Article
RPS-YOLO: A Recursive Pyramid Structure-Based YOLO Network for Small Object Detection in Unmanned Aerial Vehicle Scenarios
by Penghui Lei, Chenkang Wang and Peigang Liu
Appl. Sci. 2025, 15(4), 2039; https://doi.org/10.3390/app15042039 - 15 Feb 2025
Viewed by 597
Abstract
The fast advancement of unmanned aerial vehicle (UAV) technology has facilitated its use across a wide range of scenarios. Due to the high mobility and flexibility of drones, the images they capture often exhibit significant scale variations and severe object occlusions, leading to [...] Read more.
The fast advancement of unmanned aerial vehicle (UAV) technology has facilitated its use across a wide range of scenarios. Due to the high mobility and flexibility of drones, the images they capture often exhibit significant scale variations and severe object occlusions, leading to a high density of small objects. However, the existing object detection algorithms struggle with detecting small objects effectively in cross-scale detection scenarios. To overcome these difficulties, we introduce a new object detection model, RPS-YOLO, based on the YOLOv8 architecture. Unlike the existing methods that rely on traditional feature pyramids, our approach introduces a recursive feature pyramid (RFP) structure. This structure performs two rounds of feature extraction, and we reduce one downsampling step in the first round to enhance attention to small objects during cross-scale detection. Additionally, we design a novel attention mechanism that improves feature representation and mitigates feature degradation during convolution by capturing spatial- and channel-specific details. Another key innovation is the proposed Localization IOU (LIOU) loss function for bounding box regression, which accelerates the regression process by incorporating angular constraints. Experiments conducted on the VisDrone-DET2021 and UAVDT datasets show that RPS-YOLO surpasses YOLOv8s, with an mAP50 improvement of 8.2% and 3.4%, respectively. Our approach demonstrates that incorporating recursive feature extraction and exploiting detailed information for multi-scale detection significantly improves detection performance, particularly for small objects in UAV images. Full article
(This article belongs to the Special Issue Multimodal Information-Assisted Visual Recognition or Generation)
Show Figures

Figure 1

18 pages, 2980 KiB  
Article
Adaptive Multimodal Fusion with Cross-Attention for Robust Scene Segmentation and Urban Economic Analysis
by Chun Zhong, Shihong Zeng and Hongqiu Zhu
Appl. Sci. 2025, 15(1), 438; https://doi.org/10.3390/app15010438 - 6 Jan 2025
Viewed by 1361
Abstract
With the increasing demand for accurate multimodal data analysis in complex scenarios, existing models often struggle to effectively capture and fuse information across diverse modalities, especially when data include varying scales and levels of detail. To address these challenges, this study presents an [...] Read more.
With the increasing demand for accurate multimodal data analysis in complex scenarios, existing models often struggle to effectively capture and fuse information across diverse modalities, especially when data include varying scales and levels of detail. To address these challenges, this study presents an enhanced Swin Transformer V2-based model designed for robust multimodal data processing. The method analyzes urban economic activities and spatial layout using satellite and street view images, with applications in traffic flow and business activity intensity, highlighting its practical significance. The model incorporates a multi-scale feature extraction module into the window attention mechanism, combining local and global window attention with adaptive pooling to achieve comprehensive multi-scale feature fusion and representation. This approach enables the model to effectively capture information at different scales, enhancing its expressiveness in complex scenes. Additionally, a cross-attention-based multimodal feature fusion mechanism integrates spatial structure information from scene graphs with Swin Transformer’s image classification outputs. By calculating similarities and correlations between scene graph embeddings and image classifications, this mechanism dynamically adjusts each modality’s contribution to the fused representation, leveraging complementary information for a more coherent multimodal understanding. Compared with the baseline method, the proposed bimodal model performs superiorly and the accuracy is improved by 3%, reaching 91.5%, which proves its effectiveness in processing and fusing multimodal information. These results highlight the advantages of combining multi-scale feature extraction and cross-modal alignment to improve performance on complex multimodal tasks. Full article
(This article belongs to the Special Issue Multimodal Information-Assisted Visual Recognition or Generation)
Show Figures

Figure 1

21 pages, 4434 KiB  
Article
Exploring Gait Recognition in Wild Nighttime Scenes
by Haotian Li, Wenjuan Gong, Yutong Li, Yikai Wu, Kechen Li and Jordi Gonzàlez
Appl. Sci. 2025, 15(1), 350; https://doi.org/10.3390/app15010350 - 2 Jan 2025
Viewed by 770
Abstract
Currently, gait recognition research is gradually expanding from ideal indoor environments to real-world outdoor scenarios. However, recognition scenarios in practical applications are often more complex than those considered in existing studies. For instance, real-world scenarios present multiple influencing factors, such as viewpoint variations [...] Read more.
Currently, gait recognition research is gradually expanding from ideal indoor environments to real-world outdoor scenarios. However, recognition scenarios in practical applications are often more complex than those considered in existing studies. For instance, real-world scenarios present multiple influencing factors, such as viewpoint variations and diverse carried items. Notably, many gait recognition tasks occur under low-light conditions at night. At present, research on gait recognition in nocturnal environments is relatively limited, and effective methods for nighttime gait recognition are lacking. To address this gap, this study extends gait recognition research to outdoor nighttime environments and introduces the first wild gait dataset encompassing both daytime and nighttime data, named Gait Recognition of Day and Night (GaitDN). Furthermore, to tackle the challenges posed by low-light conditions and other influencing factors in outdoor nighttime gait recognition, we propose a novel pose-based gait recognition framework called GaitSAT. This framework models the intrinsic correlations of human joints by integrating self-attention and graph convolution modules. We conduct a comprehensive evaluation of the proposed method and existing approaches using both the GaitDN dataset and other available datasets. The proposed GaitSAT achieves state-of-the-art performance on the OUMVLP, GREW, Gait3D, and GaitDN datasets, with Rank-1 accuracies of 60.77%, 57.37%, 22.90%, and 86.24%, respectively. Experimental results demonstrate that GaitSAT achieves higher accuracy and superior generalization capabilities compared to state-of-the-art pose-based methods. Full article
(This article belongs to the Special Issue Multimodal Information-Assisted Visual Recognition or Generation)
Show Figures

Figure 1

Back to TopTop