Submit to Electronics Review for Electronics Propose a Special Issue

Journal Menu

Journal Browser

Real-Time Audio, Video and Image Processing: Latest Advances and Prospects

Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (15 October 2025) | Viewed by 6103

Share This Special Issue

Special Issue Editors

Dr. Muchao Ye

E-Mail Website
Guest Editor

Department of Computer Science, University of Iowa, Iowa City, IA 52242, USA
Interests: AI safety and multi-modal learning

Dr. Pan He

E-Mail Website
Guest Editor

Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, USA
Interests: computer vision; machine learning; deep learning; smart infrastructure; intelligent transportation

Special Issue Information

Dear Colleagues,

The rapid advancements that have been made in the real-time processing of audio, image, and video data are driving significant progress in a variety of applications across industries, from autonomous vehicles and smart cities to healthcare and entertainment. As these technologies continue to evolve, they are increasingly playing a critical role in the development of next-generation systems that require ultra-low latency, high reliability, and efficient processing capabilities.

This Special Issue aims to explore the latest advances and future prospects in real-time audio, image, and video processing. We invite original research articles, comprehensive reviews, and case studies that address the challenges of, propose solutions for, and demonstrate applications of these technologies. We are particularly interested in contributions that highlight innovative approaches and cutting-edge techniques in this fast-evolving field.

Topics of interest include, but are not limited to:

Real-time audio and speech processing techniques;
Real-time image and video processing algorithms;
Applications of AI and machine learning in real-time processing;
5G/6G-enabled real-time communication systems;
Edge computing for real-time audio, image, and video processing;
IoT and wearable devices for multimedia applications;
Low-latency streaming and broadcasting technologies;
Data compression and transmission;
Security and privacy in multimedia processing;
Applications in autonomous vehicles, smart cities, and healthcare.

Dr. Muchao Ye
Dr. Pan He
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

real-time processing
artificial intelligence
machine learning
computer vision

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

17 pages, 1594 KB

Open AccessArticle

TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition

by Majid Joudaki, Mehdi Imani and Hamid R. Arabnia

Electronics 2025, 14(16), 3326; https://doi.org/10.3390/electronics14163326 - 21 Aug 2025

Viewed by 1136

Abstract

Human Action Recognition has seen significant advances through transformer-based architectures, yet achieving a nuanced understanding often requires fusing multiple data modalities. Standard models relying solely on RGB video can struggle with actions defined by subtle motion cues rather than appearance. This paper introduces TransMODAL, a novel dual-stream transformer that synergistically fuses spatiotemporal appearance features from a pre-trained VideoMAE(Video Masked AutoEncoders) backbone with explicit skeletal kinematics from a state-of-the-art pose estimation pipeline (RT-DETR(Real-Time DEtection Transformer) + ViTPose++). We propose two key architectural innovations to enable effective and efficient fusion: a CoAttentionFusion module that facilitates deep, iterative cross-modal feature exchange between the RGB and pose streams, and an efficient AdaptiveSelector mechanism that dynamically prunes less informative spatiotemporal tokens to reduce computational overhead. Evaluated on three challenging benchmarks, TransMODAL demonstrates robust generalization, achieving accuracies of 98.5% on KTH, 96.9% on UCF101, and 84.2% on HMDB51. These results significantly outperform a strong VideoMAE-only baseline and are competitive with state-of-the-art methods, demonstrating the profound impact of explicit pose guidance. TransMODAL presents a powerful and efficient paradigm for composing pre-trained foundation models to tackle complex video understanding tasks by providing a fully reproducible implementation and strong benchmark results. Full article

(This article belongs to the Special Issue Real-Time Audio, Video and Image Processing: Latest Advances and Prospects)

► Show Figures

Figure 1

17 pages, 7350 KB

Open AccessArticle

Lightweight Network for Spoof Fingerprint Detection by Attention-Aggregated Receptive Field-Wise Feature

by Md Al Amin, Naim Reza and Ho Yub Jung

Electronics 2025, 14(9), 1823; https://doi.org/10.3390/electronics14091823 - 29 Apr 2025

Viewed by 1540

Abstract

The spread of biometric systems utilizing fingerprints has increased the need for advanced spoof detection techniques, but training convolutional neural networks (CNNs) with the limited number of images available in fingerprint datasets poses significant challenges. In this paper, we propose a lightweight network architecture which addresses the challenges inherent in small fingerprint datasets by employing a moderately deep network architecture which is sufficient for extracting essential features from fingerprint images. We apply a hyperbolic tangent activation to the final feature map, which has features from local receptive fields, and average the responses into a single value. Thus, our architecture reduces overfitting by increasing the number of effective labels during training. Additionally, the incorporation of the spatial attention module enhances feature representation, culminating in improved accuracy. The evaluation results show that the proposed model, with only 0.14 million parameters, outperforms existing techniques including lightweight models and transfer-learning-based models, achieving superior average test accuracies of 98.30% and 95.57% on the LivDet-2015 and -2017 datasets, respectively. It also delivers state-of-the-art cross-material performance, with corresponding average classification error values of 0.81% and 1.91%, making it highly effective for on-device fingerprint authentication. Full article

(This article belongs to the Special Issue Real-Time Audio, Video and Image Processing: Latest Advances and Prospects)

► Show Figures

Figure 1

34 pages, 122053 KB

Open AccessArticle

Development of a Virtual Environment for Rapid Generation of Synthetic Training Images for Artificial Intelligence Object Recognition

by Chenyu Wang, Lawrence Tinsley and Barmak Honarvar Shakibaei Asli

Electronics 2024, 13(23), 4740; https://doi.org/10.3390/electronics13234740 - 29 Nov 2024

Cited by 1 | Viewed by 1406

Abstract

In the field of machine learning and computer vision, the lack of annotated datasets is a major challenge for model development and accuracy improvement. Synthetic data generation addresses this issue by providing large, diverse, and accurately annotated datasets, thereby enhancing model training and validation. This study presents a Unity-based virtual environment that utilises the Unity Perception package to generate high-quality datasets. First, high-precision 3D (Three-Dimensional) models are created using a 3D structured light scanner, with textures processed to remove specular reflections. These models are then imported into Unity to generate diverse and accurately annotated synthetic datasets. The experimental results indicate that object recognition models trained with synthetic data achieve a high rate of performance on real images, validating the effectiveness of synthetic data in improving model generalisation and application performance. Monocular distance measurement verification shows that the synthetic data closely matches real-world physical scales, confirming its visual realism and physical accuracy. Full article

(This article belongs to the Special Issue Real-Time Audio, Video and Image Processing: Latest Advances and Prospects)

► Show Figures

Figure 1

17 pages, 4207 KB

Open AccessArticle

by Huihui Zhang, Qibing Qin, Meiling Ge and Jianyong Huang

Electronics 2024, 13(22), 4520; https://doi.org/10.3390/electronics13224520 - 18 Nov 2024

Cited by 2 | Viewed by 1423

Abstract

Remote sensing image retrieval (RSIR) plays a crucial role in remote sensing applications, focusing on retrieving a collection of items that closely match a specified query image. Due to the advantages of low storage cost and fast search speed, deep hashing has been one of the most active research problems in remote sensing image retrieval. However, remote sensing images contain many content-irrelevant backgrounds or noises, and they often lack the ability to capture essential fine-grained features. In addition, existing hash learning often relies on random sampling or semi-hard negative mining strategies to form training batches, which could be overwhelmed by some redundant pairs that slow down the model convergence and compromise the retrieval performance. To solve these problems effectively, a novel Deep Multi-similarity Hashing with Spatial-enhanced Learning, termed DMsH-SL, is proposed to learn compact yet discriminative binary descriptors for remote sensing image retrieval. Specifically, to suppress interfering information and accurately localize the target location, by introducing a spatial enhancement learning mechanism, the spatial group-enhanced hierarchical network is firstly designed to learn the spatial distribution of different semantic sub-features, capturing the noise-robust semantic embedding representation. Furthermore, to fully explore the similarity relationships of data points in the embedding space, the multi-similarity loss is proposed to construct informative and representative training batches, which is based on pairwise mining and weighting to compute the self-similarity and relative similarity of the image pairs, effectively mitigating the effects of redundant and unbalanced pairs. Experimental results on three benchmark datasets validate the superior performance of our approach. Full article

(This article belongs to the Special Issue Real-Time Audio, Video and Image Processing: Latest Advances and Prospects)

► Show Figures

Journal Menu

Journal Browser

Real-Time Audio, Video and Image Processing: Latest Advances and Prospects

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (4 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI