Submit to Special Issue Submit Abstract to Special Issue Review for Applied Sciences Propose a Special Issue

Journal Menu

Journal Browser

Applied Multimodal AI: Methods and Applications Across Domains

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 July 2026 | Viewed by 2195

Share This Special Issue

Special Issue Editors

Dr. Fei Shen

E-Mail Website
Guest Editor

Sea-NExT Joint Lab, National University of Singapore, Singapore 119077, Singapore
Interests: large model safety; image/video generation and editing

Dr. JungHwan Oh

E-Mail Website
Guest Editor

Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, USA
Interests: medical image and video analysis
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue highlights practical advances in multimodal AI—systems that jointly reason over text, images, audio, video, time-series, and structured data to solve real-world problems across domains. We welcome application-driven studies that demonstrate measurable impacts in healthcare (medical imaging + EHR text), robotics and embodied AI (vision–language–action), autonomous driving (camera–LiDAR–map fusion), education and accessibility (speech–vision for assistive tools), and creative industries (audio-visual understanding).

Submissions may cover fusion architectures (early/intermediate/late), cross-modal alignment and grounding, retrieval-augmented multimodal reasoning, controllability and tool use, and on-device or streaming inference under resource constraints. We particularly encourage works that address robustness to domain shift, privacy and security, fairness across languages and geographies, and human-in-the-loop evaluation. Contributions can include original research, systems papers, datasets/benchmarks, comprehensive evaluations, and application case studies with open resources. The goal is to provide a clear picture of how multimodal methods translate from algorithms to dependable products, establishing best practices for data curation, fusion design, deployment, and rigorous, task-relevant metrics.

Dr. Fei Shen
Dr. JungHwan Oh
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

large model safety
image/video generation and editing
medical image and video analysis

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

25 pages, 11626 KB

Open AccessArticle

Rethinking Visual Attention for Reducing Hallucination in Large Vision–Language Models

by Xuewen Li and Yuan Liu

Appl. Sci. 2026, 16(9), 4143; https://doi.org/10.3390/app16094143 - 23 Apr 2026

Viewed by 230

Abstract

Large Vision–Language Models (LVLMs) have achieved strong performance in multimodal understanding and generation. However, they remain prone to hallucination, where generated content deviates from the visual input, reducing output reliability. We analyze the attention mechanism and identify two key issues in visual information use. The model exhibits insufficient overall attention to visual tokens and weak or dispersed attention to semantically relevant regions, limiting effective visual grounding. We propose a tuning-free attention intervention method applied at inference time. In the encoding stage, we apply a structured rescaling to the attention logits associated with visual tokens, introducing a structural bias in the visual subspace. In the decoding stage, we filter attention heads based on their response magnitudes and perform weighted aggregation using their global response intensities. This design reinforces salient visual evidence while suppressing weak or diffuse attention patterns. Experiments on CHAIR and POPE show that our method reduces hallucination without additional training. On the CHAIR benchmark, it reduces the sentence-level metric by 15.5% and the instance-level metric by 5.7% on average, while consistently improving performance across multiple LVLMs and maintaining strong results on general multimodal benchmarks such as MME. Full article

(This article belongs to the Special Issue Applied Multimodal AI: Methods and Applications Across Domains)

► Show Figures

Figure 1

19 pages, 2502 KB

Open AccessArticle

Automatic Sleep Staging with Long-Term Temporal Modeling Using Single-Channel EEG

by Qiyu Yang, Dejun Zhang and Yi Huang

Appl. Sci. 2026, 16(9), 4092; https://doi.org/10.3390/app16094092 - 22 Apr 2026

Viewed by 576

Abstract

With the increasing demand for sleep health monitoring, automatic sleep staging using single-channel electroencephalogram (EEG) signals has become increasingly prominent due to its clinical practicality. Existing methods have achieved notable progress, but they often fail to adequately capture long-term temporal dependencies and struggle to characterize transition phases. We propose SleepLT, an automated sleep staging framework that integrates multi-scale wavelet decomposition (MWD) and multi-head latent Fourier attention (MLFA). The MLFA module incorporates Fourier analysis into self-attention mechanisms and employs a partially weight-sharing bottleneck to optimize Key/Value generation, effectively capturing sleep rhythms. Extensive experiments on SleepEDF-78 and SHHS datasets demonstrate strong and consistent performance, with Macro F1 improvements of 2.1–3.2% over the compared baselines. Visualizations confirm that SleepLT enhances inter-class discriminability between sleep stages, robustly detects salient waveforms, and effectively captures transitions through long-sequence modeling. These results indicate that SleepLT is effective for automatic sleep staging from single-channel EEG, particularly in improving the recognition of ambiguous transitional stages such as N1 and REM. Full article

(This article belongs to the Special Issue Applied Multimodal AI: Methods and Applications Across Domains)

► Show Figures

Figure 1

13 pages, 2368 KB

Open AccessArticle

DGE-YOLO: Dual-Branch Gathering and Attention for Efficient Accurate UAV Object Detection

by Kunwei Lv, Zhiren Xiao, Hang Ren, Xiali Li and Ping Lan

Appl. Sci. 2026, 16(8), 4004; https://doi.org/10.3390/app16084004 - 20 Apr 2026

Viewed by 562

Abstract

The rapid proliferation of unmanned aerial vehicles (UAVs) has amplified the need for robust and efficient object detection in diverse aerial environments. However, detecting small objects under complex conditions (e.g., low illumination, cluttered backgrounds, and thermal–visual discrepancies) remains challenging. While many existing detectors emphasize real-time inference, they often rely on weak or late fusion strategies, resulting in suboptimal utilization of complementary multi-modal cues. To address this limitation, we propose DGE-YOLO, an enhanced YOLO-based framework for effective infrared–visible (IR–RGB) multi-modal fusion in UAV object detection. DGE-YOLO adopts a dual-branch architecture for modality-specific feature extraction, preserving modality-aware representations before fusion. To strengthen cross-scale semantics, we introduce an Efficient Multi-scale Attention (EMA) module that improves feature discrimination across spatial resolutions. Furthermore, we replace the conventional neck with a Gather-and-Distribute module to reduce information loss during feature aggregation and improve multi-scale feature propagation. Extensive experiments on the DroneVehicle dataset demonstrate that DGE-YOLO consistently outperforms state-of-the-art baselines, confirming its effectiveness and practicality as an applied multi-modal detection solution for UAV scenarios. Full article

(This article belongs to the Special Issue Applied Multimodal AI: Methods and Applications Across Domains)

► Show Figures

Figure 1

19 pages, 7295 KB

Open AccessArticle

Video Identifying and Eraser: Use Multi-Task Cascaded Convolutional Neural Network to Enhance Safety in a Text-to-Video Diffusion Model

by Shuang Lin, Ranran Zhou and Yong Wang

Appl. Sci. 2026, 16(6), 2995; https://doi.org/10.3390/app16062995 - 20 Mar 2026

Viewed by 392

Abstract

Current security solutions predominantly rely on cloud-based implementations, often neglecting computational resource constraints and operational efficiency. While contemporary methodologies typically require additional training, the few that operate without retraining frequently yield suboptimal performance. To address these limitations, this work leverages a pre-trained MTCNN architecture to detect faces of copyright-protected individuals. We construct a facial landmark database comprising five critical fiducial points, which serves as a supplementary module integrated into the stable diffusion framework, enabling real-time security filtering for synthesized video content. The proposed system utilizes MTCNN models pre-trained in the cloud to build a repository of copyrighted facial signatures, generating a geometric parameter database of facial landmarks. This database, coupled with a parallel verification unit, functions as a plugin within the standard Stable Diffusion pipeline. By leveraging Stable Diffusion’s native decoder, we decode stochastic frames from the U-Net latent representations and perform real-time comparative analysis to identify potential copyright violations in generated video sequences. Upon detecting an infringement, an on-screen display (OSD) alert notifies the user and immediately halts the text-to-video (T2V) generation process. Experimental evaluations demonstrate that our framework effectively mitigates the resource constraints and latency issues inherent in edge deployment scenarios of prior security implementations. Leveraging MTCNN’s proven robustness and extensive edge compatibility for facial recognition, the proposed detection and obfuscation plugin integrates seamlessly with Stable Diffusion while preserving generation quality. Full article

(This article belongs to the Special Issue Applied Multimodal AI: Methods and Applications Across Domains)

► Show Figures

Journal Menu

Journal Browser

Applied Multimodal AI: Methods and Applications Across Domains

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (4 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI