applsci-logo

Journal Browser

Journal Browser

Applied Multimodal AI: Methods and Applications Across Domains

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 July 2026 | Viewed by 2045

Special Issue Editors

Sea-NExT Joint Lab, National University of Singapore, Singapore 119077, Singapore
Interests: large model safety; image/video generation and editing

E-Mail Website
Guest Editor
Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, USA
Interests: medical image and video analysis
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue highlights practical advances in multimodal AI—systems that jointly reason over text, images, audio, video, time-series, and structured data to solve real-world problems across domains. We welcome application-driven studies that demonstrate measurable impacts in healthcare (medical imaging + EHR text), robotics and embodied AI (vision–language–action), autonomous driving (camera–LiDAR–map fusion), education and accessibility (speech–vision for assistive tools), and creative industries (audio-visual understanding).

Submissions may cover fusion architectures (early/intermediate/late), cross-modal alignment and grounding, retrieval-augmented multimodal reasoning, controllability and tool use, and on-device or streaming inference under resource constraints. We particularly encourage works that address robustness to domain shift, privacy and security, fairness across languages and geographies, and human-in-the-loop evaluation. Contributions can include original research, systems papers, datasets/benchmarks, comprehensive evaluations, and application case studies with open resources. The goal is to provide a clear picture of how multimodal methods translate from algorithms to dependable products, establishing best practices for data curation, fusion design, deployment, and rigorous, task-relevant metrics.

Dr. Fei Shen
Dr. JungHwan Oh
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • large model safety
  • image/video generation and editing
  • medical image and video analysis

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

25 pages, 11626 KB  
Article
Rethinking Visual Attention for Reducing Hallucination in Large Vision–Language Models
by Xuewen Li and Yuan Liu
Appl. Sci. 2026, 16(9), 4143; https://doi.org/10.3390/app16094143 - 23 Apr 2026
Viewed by 215
Abstract
Large Vision–Language Models (LVLMs) have achieved strong performance in multimodal understanding and generation. However, they remain prone to hallucination, where generated content deviates from the visual input, reducing output reliability. We analyze the attention mechanism and identify two key issues in visual information [...] Read more.
Large Vision–Language Models (LVLMs) have achieved strong performance in multimodal understanding and generation. However, they remain prone to hallucination, where generated content deviates from the visual input, reducing output reliability. We analyze the attention mechanism and identify two key issues in visual information use. The model exhibits insufficient overall attention to visual tokens and weak or dispersed attention to semantically relevant regions, limiting effective visual grounding. We propose a tuning-free attention intervention method applied at inference time. In the encoding stage, we apply a structured rescaling to the attention logits associated with visual tokens, introducing a structural bias in the visual subspace. In the decoding stage, we filter attention heads based on their response magnitudes and perform weighted aggregation using their global response intensities. This design reinforces salient visual evidence while suppressing weak or diffuse attention patterns. Experiments on CHAIR and POPE show that our method reduces hallucination without additional training. On the CHAIR benchmark, it reduces the sentence-level metric by 15.5% and the instance-level metric by 5.7% on average, while consistently improving performance across multiple LVLMs and maintaining strong results on general multimodal benchmarks such as MME. Full article
(This article belongs to the Special Issue Applied Multimodal AI: Methods and Applications Across Domains)
Show Figures

Figure 1

19 pages, 2502 KB  
Article
Automatic Sleep Staging with Long-Term Temporal Modeling Using Single-Channel EEG
by Qiyu Yang, Dejun Zhang and Yi Huang
Appl. Sci. 2026, 16(9), 4092; https://doi.org/10.3390/app16094092 - 22 Apr 2026
Viewed by 488
Abstract
With the increasing demand for sleep health monitoring, automatic sleep staging using single-channel electroencephalogram (EEG) signals has become increasingly prominent due to its clinical practicality. Existing methods have achieved notable progress, but they often fail to adequately capture long-term temporal dependencies and struggle [...] Read more.
With the increasing demand for sleep health monitoring, automatic sleep staging using single-channel electroencephalogram (EEG) signals has become increasingly prominent due to its clinical practicality. Existing methods have achieved notable progress, but they often fail to adequately capture long-term temporal dependencies and struggle to characterize transition phases. We propose SleepLT, an automated sleep staging framework that integrates multi-scale wavelet decomposition (MWD) and multi-head latent Fourier attention (MLFA). The MLFA module incorporates Fourier analysis into self-attention mechanisms and employs a partially weight-sharing bottleneck to optimize Key/Value generation, effectively capturing sleep rhythms. Extensive experiments on SleepEDF-78 and SHHS datasets demonstrate strong and consistent performance, with Macro F1 improvements of 2.1–3.2% over the compared baselines. Visualizations confirm that SleepLT enhances inter-class discriminability between sleep stages, robustly detects salient waveforms, and effectively captures transitions through long-sequence modeling. These results indicate that SleepLT is effective for automatic sleep staging from single-channel EEG, particularly in improving the recognition of ambiguous transitional stages such as N1 and REM. Full article
(This article belongs to the Special Issue Applied Multimodal AI: Methods and Applications Across Domains)
Show Figures

Figure 1

13 pages, 2368 KB  
Article
DGE-YOLO: Dual-Branch Gathering and Attention for Efficient Accurate UAV Object Detection
by Kunwei Lv, Zhiren Xiao, Hang Ren, Xiali Li and Ping Lan
Appl. Sci. 2026, 16(8), 4004; https://doi.org/10.3390/app16084004 - 20 Apr 2026
Viewed by 531
Abstract
The rapid proliferation of unmanned aerial vehicles (UAVs) has amplified the need for robust and efficient object detection in diverse aerial environments. However, detecting small objects under complex conditions (e.g., low illumination, cluttered backgrounds, and thermal–visual discrepancies) remains challenging. While many existing detectors [...] Read more.
The rapid proliferation of unmanned aerial vehicles (UAVs) has amplified the need for robust and efficient object detection in diverse aerial environments. However, detecting small objects under complex conditions (e.g., low illumination, cluttered backgrounds, and thermal–visual discrepancies) remains challenging. While many existing detectors emphasize real-time inference, they often rely on weak or late fusion strategies, resulting in suboptimal utilization of complementary multi-modal cues. To address this limitation, we propose DGE-YOLO, an enhanced YOLO-based framework for effective infrared–visible (IR–RGB) multi-modal fusion in UAV object detection. DGE-YOLO adopts a dual-branch architecture for modality-specific feature extraction, preserving modality-aware representations before fusion. To strengthen cross-scale semantics, we introduce an Efficient Multi-scale Attention (EMA) module that improves feature discrimination across spatial resolutions. Furthermore, we replace the conventional neck with a Gather-and-Distribute module to reduce information loss during feature aggregation and improve multi-scale feature propagation. Extensive experiments on the DroneVehicle dataset demonstrate that DGE-YOLO consistently outperforms state-of-the-art baselines, confirming its effectiveness and practicality as an applied multi-modal detection solution for UAV scenarios. Full article
(This article belongs to the Special Issue Applied Multimodal AI: Methods and Applications Across Domains)
Show Figures

Figure 1

19 pages, 7295 KB  
Article
Video Identifying and Eraser: Use Multi-Task Cascaded Convolutional Neural Network to Enhance Safety in a Text-to-Video Diffusion Model
by Shuang Lin, Ranran Zhou and Yong Wang
Appl. Sci. 2026, 16(6), 2995; https://doi.org/10.3390/app16062995 - 20 Mar 2026
Viewed by 380
Abstract
Current security solutions predominantly rely on cloud-based implementations, often neglecting computational resource constraints and operational efficiency. While contemporary methodologies typically require additional training, the few that operate without retraining frequently yield suboptimal performance. To address these limitations, this work leverages a pre-trained MTCNN [...] Read more.
Current security solutions predominantly rely on cloud-based implementations, often neglecting computational resource constraints and operational efficiency. While contemporary methodologies typically require additional training, the few that operate without retraining frequently yield suboptimal performance. To address these limitations, this work leverages a pre-trained MTCNN architecture to detect faces of copyright-protected individuals. We construct a facial landmark database comprising five critical fiducial points, which serves as a supplementary module integrated into the stable diffusion framework, enabling real-time security filtering for synthesized video content. The proposed system utilizes MTCNN models pre-trained in the cloud to build a repository of copyrighted facial signatures, generating a geometric parameter database of facial landmarks. This database, coupled with a parallel verification unit, functions as a plugin within the standard Stable Diffusion pipeline. By leveraging Stable Diffusion’s native decoder, we decode stochastic frames from the U-Net latent representations and perform real-time comparative analysis to identify potential copyright violations in generated video sequences. Upon detecting an infringement, an on-screen display (OSD) alert notifies the user and immediately halts the text-to-video (T2V) generation process. Experimental evaluations demonstrate that our framework effectively mitigates the resource constraints and latency issues inherent in edge deployment scenarios of prior security implementations. Leveraging MTCNN’s proven robustness and extensive edge compatibility for facial recognition, the proposed detection and obfuscation plugin integrates seamlessly with Stable Diffusion while preserving generation quality. Full article
(This article belongs to the Special Issue Applied Multimodal AI: Methods and Applications Across Domains)
Show Figures

Figure 1

Back to TopTop