AI-Driven Image and Video Understanding

A special issue of Journal of Imaging (ISSN 2313-433X). This special issue belongs to the section "AI in Imaging".

Deadline for manuscript submissions: 31 May 2026 | Viewed by 3985

Special Issue Editors

School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China
Interests: computer vision; multi-modal data processing; saliency detection; object/semantic segmentation; defect inspection

E-Mail Website1 Website2
Guest Editor
School of Automation, Hangzhou Dianzi University, Hangzhou 310000, China
Interests: computer vision; image and video processing; saliency detection; defect detection
Department of Computer Science and Technology, Tongji University, Shanghai 201804, China
Interests: machine learning; meta-learning; domain adaptation; image classification; gaze estimation; point cloud; saliency detection

Special Issue Information

Dear Colleague,

Artificial Intelligence (AI) has revolutionized image and video understanding, evolving from theoretical exploration to practical cornerstones across critical domains, including medical imaging for disease diagnosis, real-time perception in autonomous vehicles, intelligent content curation in multimedia platforms, and smart surveillance for public safety. This transformation has made AI indispensable, yet it also underscores the need to address emerging challenges in visual data analysis. This Special Issue aims to showcase cutting-edge research at the dynamic intersection of AI and visual content understanding.

Against the backdrop of explosive growth in unstructured visual data (e.g., high-resolution images, long-form videos) and rapid advancements in deep learning (e.g., transformer-based architectures), generative models (e.g., diffusion models), and multimodal fusion techniques, we actively seek submissions that push the boundaries of existing knowledge. Focused topics include novel neural architectures for image and video understanding, multi-model data processing, AI-driven computer vision, and AI-driven vision applications.  

We invite submissions that blend rigorous theoretical insights with robust empirical validation, explicitly addressing key challenges facing visual AI systems. By bringing together researchers from computer vision, machine learning, and application domains, this Special Issue seeks to accelerate cross-field knowledge exchange, drive technical innovation, and highlight how AI can unlock richer, more reliable insights from visual data. Ultimately, it aims to shape the future of AI-driven image and video understanding, fostering solutions that deliver tangible societal benefits.

Dr. Gongyang Li
Dr. Xiaofei Zhou
Dr. Yong Wu
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Journal of Imaging is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • image and video understanding
  • AI-driven computer vision
  • AI-driven vision applications
  • multi-modal data processing
  • image and video quality assessment
  • image and video super-resolution
  • object segmentation and detection
  • saliency detection

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

26 pages, 1385 KB  
Article
Probabilistic Short-Term Sky Image Forecasting Using VQ-VAE and Transformer Models on Sky Camera Data
by Chingiz Seyidbayli, Soheil Nezakat and Andreas Reinhardt
J. Imaging 2026, 12(4), 165; https://doi.org/10.3390/jimaging12040165 - 10 Apr 2026
Viewed by 294
Abstract
Cloud cover significantly reduces the electrical power output of photovoltaic systems, making accurate short-term cloud movement predictions essential for reliable solar energy production planning. This article presents a deep learning framework that directly estimates cloud movement from ground-based all-sky camera images, rather than [...] Read more.
Cloud cover significantly reduces the electrical power output of photovoltaic systems, making accurate short-term cloud movement predictions essential for reliable solar energy production planning. This article presents a deep learning framework that directly estimates cloud movement from ground-based all-sky camera images, rather than predicting future production from past power data. The system is based on a three-step process: First, a lightweight Convolutional Neural Network segments cloud regions and produces probabilistic masks that represent the spatial distribution of clouds in a compact and computationally efficient manner. This allows subsequent models to focus on the geometry of clouds rather than irrelevant visual features such as illumination changes. Second, a Vector Quantized Variational Autoencoder compresses these masks into discrete latent token sequences, reducing dimensionality while preserving fundamental cloud structure patterns. Third, a GPT-style autoregressive transformer learns temporal dependencies in this token space and predicts future sequences based on past observations, enabling iterative multi-step predictions, where each prediction serves as the input for subsequent time steps. Our evaluations show an average intersection-over-union ratio of 0.92 and a pixel accuracy of 0.96 for single-step (5 s ahead) predictions, while performance smoothly decreases to an intersection-over-union ratio of 0.65 and an accuracy of 0.80 in 10 min autoregressive propagation. The framework also provides prediction uncertainty estimates through token-level entropy measurement, which shows positive correlation with prediction error and serves as a confidence indicator for downstream decision-making in solar energy forecasting applications. Full article
(This article belongs to the Special Issue AI-Driven Image and Video Understanding)
Show Figures

Figure 1

29 pages, 4764 KB  
Article
A Two-Level Illumination Correction Network for Digital Meter Reading Recognition in Non-Uniform Low-Light Conditions
by Haoning Fu, Zhiwei Xie, Wenzhu Jiang, Xingjiang Ma and Dongying Yang
J. Imaging 2026, 12(4), 146; https://doi.org/10.3390/jimaging12040146 - 25 Mar 2026
Viewed by 271
Abstract
The automatic reading recognition of digital instruments is crucial for achieving metering automation and intelligent inspection. However, in non-standardized industrial environments, the masking effect caused by the coupling of non-uniform low-light conditions and the reflective surfaces of instrument panels severely degrades the displayed [...] Read more.
The automatic reading recognition of digital instruments is crucial for achieving metering automation and intelligent inspection. However, in non-standardized industrial environments, the masking effect caused by the coupling of non-uniform low-light conditions and the reflective surfaces of instrument panels severely degrades the displayed information, significantly limiting the recognition performance. Conventional image processing methods, while aiming to restore the imaging quality of instrument panels through low-light enhancement, inevitably introduce overexposure and indiscriminately amplify background noise during this process. To address the two key challenges of illumination recovery and noise suppression in the process of restoring panel image quality under non-uniform low-light conditions, this paper proposes a coarse-to-fine cascaded perception framework (CFCP). First, a lightweight YOLOv10 detector is employed to coarsely localize the meter reading region under non-uniform illumination conditions. Second, an Adaptive Illumination Correction Module (AICM) is designed to decouple and correct the illumination component at the pixel level, effectively restoring details in dark areas. Then, an Illumination-invariant Feature Perception Module (IFPM) is embedded at the feature level to dynamically perceive illumination-invariant features and filter out noise interference. Finally, the refined detection results are fed into a lightweight sequence recognition network to obtain the final meter readings. Experiments on a self-built industrial digital instrument dataset show that the proposed method achieves 93.2% recognition accuracy, with 17.1 ms latency and only 7.9 M parameters. Full article
(This article belongs to the Special Issue AI-Driven Image and Video Understanding)
Show Figures

Figure 1

45 pages, 798 KB  
Article
Real-Time Visual Anomaly Detection in High-Speed Motorsport: An Entropy-Driven Hybrid Retrieval- and Cache-Augmented Architecture
by Rubén Juárez Cádiz and Fernando Rodríguez-Sela
J. Imaging 2026, 12(2), 60; https://doi.org/10.3390/jimaging12020060 - 28 Jan 2026
Viewed by 630
Abstract
At 300 km/h, an end-to-end vision delay of 100 ms corresponds to 8.3 m of unobserved travel; therefore, real-time anomaly monitoring must balance sensitivity with strict tail-latency constraints at the edge. We propose a hybrid cache–retrieval inference architecture for visual anomaly detection in [...] Read more.
At 300 km/h, an end-to-end vision delay of 100 ms corresponds to 8.3 m of unobserved travel; therefore, real-time anomaly monitoring must balance sensitivity with strict tail-latency constraints at the edge. We propose a hybrid cache–retrieval inference architecture for visual anomaly detection in high-speed motorsport that exploits lap-to-lap spatiotemporal redundancy while reserving local similarity retrieval for genuinely uncertain events. The system combines a hierarchical visual encoder (a lightweight backbone with selective refinement via a Nested U-Net for texture-level cues) and an uncertainty-driven router that selects between two memory pathways: (i) a static cache of precomputed scene embeddings for track/background context and (ii) local similarity retrieval over historical telemetry–vision patterns to ground ambiguous frames, improve interpretability, and stabilize decisions under high uncertainty. Routing is governed by an entropy signal computed from prediction and embedding uncertainty: low-entropy frames follow a cache-first path, whereas high-entropy frames trigger retrieval and refinement to preserve decision stability without sacrificing latency. On a high-fidelity closed-circuit benchmark with synchronized onboard video and telemetry and controlled anomaly injections (tire degradation, suspension chatter, and illumination shifts), the proposed approach reduces mean end-to-end latency to 21.7 ms versus 48.6 ms for a retrieval-only baseline (55.3% reduction) while achieving Macro-F1 = 0.89 at safety-oriented operating points. The framework is designed for passive monitoring and decision support, producing advisory outputs without actuating ECU control strategies. Full article
(This article belongs to the Special Issue AI-Driven Image and Video Understanding)
Show Figures

Figure 1

26 pages, 9792 KB  
Article
LLM-Based Pose Normalization and Multimodal Fusion for Facial Expression Recognition in Extreme Poses
by Bohan Chen, Bowen Qu, Yu Zhou, Han Huang, Jianing Guo, Yanning Xian, Longxiang Ma, Jinxuan Yu and Jingyu Chen
J. Imaging 2026, 12(1), 24; https://doi.org/10.3390/jimaging12010024 - 4 Jan 2026
Cited by 1 | Viewed by 791
Abstract
Facial expression recognition (FER) technology has progressively matured over time. However, existing FER methods are primarily optimized for frontal face images, and their recognition accuracy significantly degrades when processing profile or large-angle rotated facial images. Consequently, this limitation hinders the practical deployment of [...] Read more.
Facial expression recognition (FER) technology has progressively matured over time. However, existing FER methods are primarily optimized for frontal face images, and their recognition accuracy significantly degrades when processing profile or large-angle rotated facial images. Consequently, this limitation hinders the practical deployment of FER systems. To mitigate the interference caused by large pose variations and improve recognition accuracy, we propose a FER method based on profile-to-frontal transformation and multimodal learning. Specifically, we first leverage the visual understanding and generation capabilities of Qwen-Image-Edit that transform profile images to frontal viewpoints, preserving key expression features while standardizing facial poses. Second, we introduce the CLIP model to enhance the semantic representation capability of expression features through vision–language joint learning. The qualitative and quantitative experiments on the RAF (89.39%), EXPW (67.17%), and AffectNet-7 (62.66%) datasets demonstrate that our method outperforms the existing approaches. Full article
(This article belongs to the Special Issue AI-Driven Image and Video Understanding)
Show Figures

Figure 1

22 pages, 18974 KB  
Article
Lightweight 3D CNN for MRI Analysis in Alzheimer’s Disease: Balancing Accuracy and Efficiency
by Kerang Cao, Zhongqing Lu, Chengkui Zhao, Jiaming Du, Lele Li, Hoekyung Jung and Minghui Geng
J. Imaging 2025, 11(12), 426; https://doi.org/10.3390/jimaging11120426 - 28 Nov 2025
Viewed by 1438
Abstract
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder characterized by subtle structural changes in the brain, which can be observed through MRI scans. Although traditional diagnostic approaches rely on clinical and neuropsychological assessments, deep learning-based methods such as 3D convolutional neural networks (CNNs) [...] Read more.
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder characterized by subtle structural changes in the brain, which can be observed through MRI scans. Although traditional diagnostic approaches rely on clinical and neuropsychological assessments, deep learning-based methods such as 3D convolutional neural networks (CNNs) have recently been introduced to improve diagnostic accuracy. However, their high computational complexity remains a challenge. To address this, we propose a lightweight magnetic resonance imaging (MRI) classification framework that integrates adaptive multi-scale feature extraction with structural pruning and parameter optimization. The pruned model achieving a compact architecture with approximately 490k parameters (0.49 million), 4.39 billion floating-point operations, and a model size of 1.9 MB, while maintaining high classification performance across three binary tasks. The proposed framework was evaluated on the Alzheimer’s Disease Neuroimaging Initiative dataset, a widely used benchmark for AD research. Notably, the model achieves a performance density(PD) of 189.87, where PD is a custom efficiency metric defined as the classification accuracy per million parameters (% pm), which is approximately 70× higher than the basemodel, reflecting its balance between accuracy and computational efficiency. Experimental results demonstrate that the proposed framework significantly reduces resource consumption without compromising diagnostic performance, providing a practical foundation for real-time and resource-constrained clinical applications in Alzheimer’s disease detection. Full article
(This article belongs to the Special Issue AI-Driven Image and Video Understanding)
Show Figures

Figure 1

Back to TopTop