sensors-logo

Journal Browser

Journal Browser

Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensing and Imaging".

Deadline for manuscript submissions: 20 December 2025 | Viewed by 13655

Special Issue Editors


E-Mail Website
Guest Editor
Division of Electronics and Electrical Engineering, Dongguk University, 30, Pildong- ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea
Interests: deep learning; biometrics; image processing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor

Special Issue Information

Dear Colleagues,

Recent developments have led to the vivid application of artificial intelligence (AI) and sensing techniques to image and video processing and recognition. While state-of-the-art technology has matured, its performance is still affected by various environmental conditions and heterogeneous databases. This Special Issue invites high-quality and state-of-the-art academic papers on challenging issues in the field of AI and sensing-based image and video processing and recognition. We solicit original papers of unpublished and completed research that are not currently under review by any other conference, magazine, or journal. Topics of interest include, but are not limited to, the following:

  • AI and sensing-technique-based image processing, understanding, recognition, compression, and reconstruction;
  • AI and sensing-technique-based video processing, understanding, recognition, compression, and reconstruction;
  • Computer vision based on AI and sensing techniques;
  • AI and sensing-technique-based biometrics;
  • AI and sensing-technique-based object detection and tracking;
  • Approaches that combine AI and sensing techniques and conventional methods for image and video processing and recognition;
  • Generative adversarial network (GAN)-based image and video processing and recognition;
  • Approaches that combine AI and blockchain methods for image and video processing and recognition.

Prof. Dr. Kang Ryoung Park
Prof. Dr. Sangyoun Lee
Prof. Dr. Euntai Kim
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • image processing, understanding, recognition, compression, and reconstruction based on sensing techniques and AI
  • video processing, understanding, recognition, compression, and reconstruction based on sensing techniques and AI
  • computer vision based on sensing techniques and AI
  • biometrics based on sensing techniques and AI
  • AI fusion and conventional methods
  • AI fusion and blockchain methods

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

20 pages, 2194 KiB  
Article
An Efficient and Low-Complexity Transformer-Based Deep Learning Framework for High-Dynamic-Range Image Reconstruction
by Josue Lopez-Cabrejos, Thuanne Paixão, Ana Beatriz Alvarez and Diodomiro Baldomero Luque
Sensors 2025, 25(5), 1497; https://doi.org/10.3390/s25051497 - 28 Feb 2025
Viewed by 656
Abstract
High-dynamic-range (HDR) image reconstruction involves creating an HDR image from multiple low-dynamic-range images as input, providing a computational solution to enhance image quality. This task presents several challenges, such as frame misalignment, overexposure, and motion, which are addressed using deep learning algorithms. In [...] Read more.
High-dynamic-range (HDR) image reconstruction involves creating an HDR image from multiple low-dynamic-range images as input, providing a computational solution to enhance image quality. This task presents several challenges, such as frame misalignment, overexposure, and motion, which are addressed using deep learning algorithms. In this context, various architectures with different approaches exist, such as convolutional neural networks, diffusion networks, generative adversarial networks, and Transformer-based architectures, with the latter offering the best quality but at a high computational cost. This paper proposes an HDR reconstruction architecture using a Transformer-based approach to achieve results competitive with the state of the art while reducing computational cost. The number of self-attention blocks was reduced for feature refinement. To prevent quality degradation, a Convolutional Block Attention Module was added, enhancing image features by using the central frame as a reference. The proposed architecture was evaluated on two datasets, achieving the best results on Tel’s dataset in terms of quality metrics. The computational cost indicated that the architecture was significantly more efficient than other Transformer-based approaches for reconstruction. The results of this research suggest that low-complexity Transformer-based architectures have great potential, with applications extending beyond HDR reconstruction to other domains. Full article
Show Figures

Graphical abstract

22 pages, 3085 KiB  
Article
Detection of Human Traffic Controllers Wearing Construction Workwear via Synthetic Data Generation
by Seunghyun Baik and Euntai Kim
Sensors 2025, 25(3), 816; https://doi.org/10.3390/s25030816 - 29 Jan 2025
Viewed by 608
Abstract
Developing Level 3 or higher autonomous vehicles requires the ability to follow human traffic controllers in situations where regular traffic signals are unavailable, such as during construction. However, detecting human traffic controllers at construction sites is challenging due to the lack of dedicated [...] Read more.
Developing Level 3 or higher autonomous vehicles requires the ability to follow human traffic controllers in situations where regular traffic signals are unavailable, such as during construction. However, detecting human traffic controllers at construction sites is challenging due to the lack of dedicated datasets and variations in their appearance. This paper proposes a method for detecting human traffic controllers by generating synthetic images with diffusion models. We introduce a color-boosting technique to enhance image diversity and employ a cut-and-paste mechanism for seamless integration into realistic road scenes. We generate 19,840 synthetic images, combined with 600 real-world images, to train a YOLOv7 model. The trained model achieves an AP50 score of 73.9%, improving by 32.9% over the baseline. The HTC600 dataset used in our experiments is publicly available to support autonomous driving research. Full article
Show Figures

Figure 1

12 pages, 1842 KiB  
Article
Neural Radiance Fields for Fisheye Driving Scenes Using Edge-Aware Integrated Depth Supervision
by Jiho Choi and Sang Jun Lee
Sensors 2024, 24(21), 6790; https://doi.org/10.3390/s24216790 - 22 Oct 2024
Viewed by 1033
Abstract
Neural radiance fields (NeRF) have become an effective method for encoding scenes into neural representations, allowing for the synthesis of photorealistic views of unseen views from given input images. However, the applicability of traditional NeRF is significantly limited by its assumption that images [...] Read more.
Neural radiance fields (NeRF) have become an effective method for encoding scenes into neural representations, allowing for the synthesis of photorealistic views of unseen views from given input images. However, the applicability of traditional NeRF is significantly limited by its assumption that images are captured for object-centric scenes with a pinhole camera. Expanding these boundaries, we focus on driving scenarios using a fisheye camera, which offers the advantage of capturing visual information from a wide field of view. To address the challenges due to the unbounded and distorted characteristics of fisheye images, we propose an edge-aware integration loss function. This approach leverages sparse LiDAR projections and dense depth maps estimated from a learning-based depth model. The proposed algorithm assigns larger weights to neighboring points that have depth values similar to the sensor data. Experiments were conducted on the KITTI-360 and JBNU-Depth360 datasets, which are public and real-world datasets of driving scenarios using fisheye cameras. Experimental results demonstrated that the proposed method is effective in synthesizing novel view images, outperforming existing approaches. Full article
Show Figures

Figure 1

19 pages, 14507 KiB  
Article
High-Precision Multi-Object Tracking in Satellite Videos via Pixel-Wise Adaptive Feature Enhancement
by Gang Wan, Zhijuan Su, Yitian Wu, Ningbo Guo, Dianwei Cong, Zhanji Wei, Wei Liu and Guoping Wang
Sensors 2024, 24(19), 6489; https://doi.org/10.3390/s24196489 - 9 Oct 2024
Cited by 2 | Viewed by 1434
Abstract
In this paper, we focus on the multi-target tracking (MOT) task in satellite videos. To achieve efficient and accurate tracking, we propose a transformer-distillation-based end-to-end joint detection and tracking (JDT) method. Specifically, (1) considering that targets in satellite videos usually have small scales [...] Read more.
In this paper, we focus on the multi-target tracking (MOT) task in satellite videos. To achieve efficient and accurate tracking, we propose a transformer-distillation-based end-to-end joint detection and tracking (JDT) method. Specifically, (1) considering that targets in satellite videos usually have small scales and are shot from a bird’s-eye view, we propose a pixel-wise transformer-based feature distillation module through which useful object representations are learned via pixel-wise distillation using a strong teacher detection network; (2) targets in satellite videos, such as airplanes, ships, and vehicles, usually have similar appearances, so we propose a temperature-controllable key feature learning objective function, and by highlighting the learning of similar features during distilling, the tracking accuracy for such objects can be further improved; (3) we propose a method that is based on an end-to-end network but simultaneously learns from a highly precise teacher network and tracking head during training so that the tracking accuracy of the end-to-end network can be improved via distillation without compromising efficiency. The experimental results on three recently released publicly available datasets demonstrated the superior performance of the proposed method for satellite videos. The proposed method achieved over 90% overall tracking performance on the AIR-MOT dataset. Full article
Show Figures

Figure 1

21 pages, 2501 KiB  
Article
RetinaViT: Efficient Visual Backbone for Online Video Streams
by Tomoyuki Suzuki and Yoshimitsu Aoki
Sensors 2024, 24(17), 5457; https://doi.org/10.3390/s24175457 - 23 Aug 2024
Viewed by 1138
Abstract
In online video understanding, which has a wide range of real-world applications, inference speed is crucial. Many approaches involve frame-level visual feature extraction, which often represents the biggest bottleneck. We propose RetinaViT, an efficient method for extracting frame-level visual features in an online [...] Read more.
In online video understanding, which has a wide range of real-world applications, inference speed is crucial. Many approaches involve frame-level visual feature extraction, which often represents the biggest bottleneck. We propose RetinaViT, an efficient method for extracting frame-level visual features in an online video stream, aiming to fundamentally enhance the efficiency of online video understanding tasks. RetinaViT is composed of efficiently approximated Transformer blocks that only take changed tokens (event tokens) as queries and reuse the already processed tokens from the previous timestep for the others. Furthermore, we restrict keys and values to the spatial neighborhoods of event tokens to further improve efficiency. RetinaViT involves tuning multiple parameters, which we determine through a multi-step process. During model training, we randomly vary these parameters and then perform black-box optimization to maximize accuracy and efficiency on the pre-trained model. We conducted extensive experiments on various online video recognition tasks, including action recognition, pose estimation, and object segmentation, validating the effectiveness of each component in RetinaViT and demonstrating improvements in the speed/accuracy trade-off compared to baselines. In particular, for action recognition, RetinaViT built on ViT-B16 reduces inference time by approximately 61.9% on the CPU and 50.8% on the GPU, while achieving slight accuracy improvements rather than degradation. Full article
Show Figures

Figure 1

26 pages, 11283 KiB  
Article
Infrared Image Super-Resolution Network Utilizing the Enhanced Transformer and U-Net
by Feng Huang, Yunxiang Li, Xiaojing Ye and Jing Wu
Sensors 2024, 24(14), 4686; https://doi.org/10.3390/s24144686 - 19 Jul 2024
Cited by 2 | Viewed by 1438
Abstract
Infrared images hold significant value in applications such as remote sensing and fire safety. However, infrared detectors often face the problem of high hardware costs, which limits their widespread use. Advancements in deep learning have spurred innovative approaches to image super-resolution (SR), but [...] Read more.
Infrared images hold significant value in applications such as remote sensing and fire safety. However, infrared detectors often face the problem of high hardware costs, which limits their widespread use. Advancements in deep learning have spurred innovative approaches to image super-resolution (SR), but comparatively few efforts have been dedicated to the exploration of infrared images. To address this, we design the Residual Swin Transformer and Average Pooling Block (RSTAB) and propose the SwinAIR, which can effectively extract and fuse the diverse frequency features in infrared images and achieve superior SR reconstruction performance. By further integrating SwinAIR with U-Net, we propose the SwinAIR-GAN for real infrared image SR reconstruction. SwinAIR-GAN extends the degradation space to better simulate the degradation process of real infrared images. Additionally, it incorporates spectral normalization, dropout, and artifact discrimination loss to reduce the potential image artifacts. Qualitative and quantitative evaluations on various datasets confirm the effectiveness of our proposed method in reconstructing realistic textures and details of infrared images. Full article
Show Figures

Figure 1

21 pages, 6785 KiB  
Article
Multi-Granularity Aggregation with Spatiotemporal Consistency for Video-Based Person Re-Identification
by Hean Sung Lee, Minjung Kim, Sungjun Jang, Han Byeol Bae and Sangyoun Lee
Sensors 2024, 24(7), 2229; https://doi.org/10.3390/s24072229 - 30 Mar 2024
Cited by 1 | Viewed by 1366
Abstract
Video-based person re-identification (ReID) aims to exploit relevant features from spatial and temporal knowledge. Widely used methods include the part- and attention-based approaches for suppressing irrelevant spatial–temporal features. However, it is still challenging to overcome inconsistencies across video frames due to occlusion and [...] Read more.
Video-based person re-identification (ReID) aims to exploit relevant features from spatial and temporal knowledge. Widely used methods include the part- and attention-based approaches for suppressing irrelevant spatial–temporal features. However, it is still challenging to overcome inconsistencies across video frames due to occlusion and imperfect detection. These mismatches make temporal processing ineffective and create an imbalance of crucial spatial information. To address these problems, we propose the Spatiotemporal Multi-Granularity Aggregation (ST-MGA) method, which is specifically designed to accumulate relevant features with spatiotemporally consistent cues. The proposed framework consists of three main stages: extraction, which extracts spatiotemporally consistent partial information; augmentation, which augments the partial information with different granularity levels; and aggregation, which effectively aggregates the augmented spatiotemporal information. We first introduce the consistent part-attention (CPA) module, which extracts spatiotemporally consistent and well-aligned attentive parts. Sub-parts derived from CPA provide temporally consistent semantic information, solving misalignment problems in videos due to occlusion or inaccurate detection, and maximize the efficiency of aggregation through uniform partial information. To enhance the diversity of spatial and temporal cues, we introduce the Multi-Attention Part Augmentation (MA-PA) block, which incorporates fine parts at various granular levels, and the Long-/Short-term Temporal Augmentation (LS-TA) block, designed to capture both long- and short-term temporal relations. Using densely separated part cues, ST-MGA fully exploits and aggregates the spatiotemporal multi-granular patterns by comparing relations between parts and scales. In the experiments, the proposed ST-MGA renders state-of-the-art performance on several video-based ReID benchmarks (i.e., MARS, DukeMTMC-VideoReID, and LS-VID). Full article
Show Figures

Figure 1

15 pages, 861 KiB  
Article
Conv3D-Based Video Violence Detection Network Using Optical Flow and RGB Data
by Jae-Hyuk Park, Mohamed Mahmoud and Hyun-Soo Kang
Sensors 2024, 24(2), 317; https://doi.org/10.3390/s24020317 - 5 Jan 2024
Cited by 16 | Viewed by 3135
Abstract
Detecting violent behavior in videos to ensure public safety and security poses a significant challenge. Precisely identifying and categorizing instances of violence in real-life closed-circuit television, which vary across specifications and locations, requires comprehensive understanding and processing of the sequential information embedded in [...] Read more.
Detecting violent behavior in videos to ensure public safety and security poses a significant challenge. Precisely identifying and categorizing instances of violence in real-life closed-circuit television, which vary across specifications and locations, requires comprehensive understanding and processing of the sequential information embedded in these videos. This study aims to introduce a model that adeptly grasps the spatiotemporal context of videos within diverse settings and specifications of violent scenarios. We propose a method to accurately capture spatiotemporal features linked to violent behaviors using optical flow and RGB data. The approach leverages a Conv3D-based ResNet-3D model as the foundational network, capable of handling high-dimensional video data. The efficiency and accuracy of violence detection are enhanced by integrating an attention mechanism, which assigns greater weight to the most crucial frames within the RGB and optical-flow sequences during instances of violence. Our model was evaluated on the UBI-Fight, Hockey, Crowd, and Movie-Fights datasets; the proposed method outperformed existing state-of-the-art techniques, achieving area under the curve scores of 95.4, 98.1, 94.5, and 100.0 on the respective datasets. Moreover, this research not only has the potential to be applied in real-time surveillance systems but also promises to contribute to a broader spectrum of research in video analysis and understanding. Full article
Show Figures

Figure 1

Review

Jump to: Research

19 pages, 1482 KiB  
Review
A Comprehensive Evaluation of Iris Segmentation on Benchmarking Datasets
by Mst Rumana Sumi, Priyanka Das, Afzal Hossain, Soumyabrata Dey and Stephanie Schuckers
Sensors 2024, 24(21), 7079; https://doi.org/10.3390/s24217079 - 3 Nov 2024
Cited by 1 | Viewed by 1494
Abstract
Iris is one of the most widely used biometric modalities because of its uniqueness, high matching performance, and inherently secure nature. Iris segmentation is an essential preliminary step for iris-based biometric authentication. The authentication accuracy is directly connected with the iris segmentation accuracy. [...] Read more.
Iris is one of the most widely used biometric modalities because of its uniqueness, high matching performance, and inherently secure nature. Iris segmentation is an essential preliminary step for iris-based biometric authentication. The authentication accuracy is directly connected with the iris segmentation accuracy. In the last few years, deep-learning-based iris segmentation methodologies have increasingly been adopted because of their ability to handle challenging segmentation tasks and their advantages over traditional segmentation techniques. However, the biggest challenge to the biometric community is the scarcity of open-source resources for adoption for application and reproducibility. This review provides a comprehensive examination of available open-source iris segmentation resources, including datasets, algorithms, and tools. In the process, we designed three U-Net and U-Net++ architecture-influenced segmentation algorithms as standard benchmarks, trained them on a large composite dataset (>45K samples), and created 1K manually segmented ground truth masks. Overall, eleven state-of-the-art algorithms were benchmarked against five datasets encompassing multiple sensors, environmental conditions, demography, and illumination. This assessment highlights the strengths, limitations, and practical implications of each method and identifies gaps that future studies should address to improve segmentation accuracy and robustness. To foster future research, all resources developed during this work would be made publicly available. Full article
Show Figures

Figure 1

Back to TopTop