Submit to Special Issue Submit Abstract to Special Issue Review for Sensors Propose a Special Issue

Journal Menu

Journal Browser

► Journal Browser

Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Related Special Issue
Published Papers

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensing and Imaging".

Deadline for manuscript submissions: 20 December 2025 | Viewed by 21030

Share This Special Issue

Special Issue Editors

Prof. Dr. Kang Ryoung Park

E-Mail Website
Guest Editor

Division of Electronics and Electrical Engineering, Dongguk University, 30, Pildong- ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea
Interests: deep learning; biometrics; image processing
Special Issues, Collections and Topics in MDPI journals

Prof. Dr. Sangyoun Lee

E-Mail Website
Guest Editor

Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722 Republic of Korea
Interests: human detection and recognition; gesture recognition; face recognition; HEVC
Special Issues, Collections and Topics in MDPI journals

Prof. Dr. Euntai Kim

E-Mail Website1 Website2
Guest Editor

Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Republic of Korea
Interests: pedestrian and vehicle detection; recognition; vision for advanced driver assistance systems (ADAS); robot vision
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Recent developments have led to the vivid application of artificial intelligence (AI) and sensing techniques to image and video processing and recognition. While state-of-the-art technology has matured, its performance is still affected by various environmental conditions and heterogeneous databases. This Special Issue invites high-quality and state-of-the-art academic papers on challenging issues in the field of AI and sensing-based image and video processing and recognition. We solicit original papers of unpublished and completed research that are not currently under review by any other conference, magazine, or journal. Topics of interest include, but are not limited to, the following:

AI and sensing-technique-based image processing, understanding, recognition, compression, and reconstruction;
AI and sensing-technique-based video processing, understanding, recognition, compression, and reconstruction;
Computer vision based on AI and sensing techniques;
AI and sensing-technique-based biometrics;
AI and sensing-technique-based object detection and tracking;
Approaches that combine AI and sensing techniques and conventional methods for image and video processing and recognition;
Generative adversarial network (GAN)-based image and video processing and recognition;
Approaches that combine AI and blockchain methods for image and video processing and recognition.

Prof. Dr. Kang Ryoung Park
Prof. Dr. Sangyoun Lee
Prof. Dr. Euntai Kim
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

image processing, understanding, recognition, compression, and reconstruction based on sensing techniques and AI
video processing, understanding, recognition, compression, and reconstruction based on sensing techniques and AI
computer vision based on sensing techniques and AI
biometrics based on sensing techniques and AI
AI fusion and conventional methods
AI fusion and blockchain methods

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Image and Video Processing and Recognition Based on Artificial Intelligence-2nd Edition in Sensors (19 articles)

Published Papers (13 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

Jump to: Review

22 pages, 4033 KiB

Open AccessArticle

Masked Feature Residual Coding for Neural Video Compression

by Chajin Shin, Yonghwan Kim, KwangPyo Choi and Sangyoun Lee

Sensors 2025, 25(14), 4460; https://doi.org/10.3390/s25144460 - 17 Jul 2025

Viewed by 427

Abstract

In neural video compression, an approximation of the target frame is predicted, and a mask is subsequently applied to it. Then, the masked predicted frame is subtracted from the target frame and fed into the encoder along with the conditional information. However, this structure has two limitations. First, in the pixel domain, even if the mask is perfectly predicted, the residuals cannot be significantly reduced. Second, reconstructed features with abundant temporal context information cannot be used as references for compressing the next frame. To address these problems, we propose Conditional Masked Feature Residual (CMFR) Coding. We extract features from the target frame and the predicted features using neural networks. Then, we predict the mask and subtract the masked predicted features from the target features. Thereafter, the difference is fed into the encoder with the conditional information. Moreover, to more effectively remove conditional information from the target frame, we introduce a Scaled Feature Fusion (SFF) module. In addition, we introduce a Motion Refiner to enhance the quality of the decoded optical flow. Experimental results show that our model achieves an 11.76% bit saving over the model without the proposed methods, averaged over all HEVC test sequences, demonstrating the effectiveness of the proposed methods. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

21 pages, 4859 KiB

Open AccessArticle

Improvement of SAM2 Algorithm Based on Kalman Filtering for Long-Term Video Object Segmentation

by Jun Yin, Fei Wu, Hao Su, Peng Huang and Yuetong Qixuan

Sensors 2025, 25(13), 4199; https://doi.org/10.3390/s25134199 - 5 Jul 2025

Viewed by 638

Abstract

The Segment Anything Model 2 (SAM2) has achieved state-of-the-art performance in pixel-level object segmentation for both static and dynamic visual content. Its streaming memory architecture maintains spatial context across video sequences, yet struggles with long-term tracking due to its static inference framework. SAM 2’s fixed temporal window approach indiscriminately retains historical frames, failing to account for frame quality or dynamic motion patterns. This leads to error propagation and tracking instability in challenging scenarios involving fast-moving objects, partial occlusions, or crowded environments. To overcome these limitations, this paper proposes SAM2Plus, a zero-shot enhancement framework that integrates Kalman filter prediction, dynamic quality thresholds, and adaptive memory management. The Kalman filter models object motion using physical constraints to predict trajectories and dynamically refine segmentation states, mitigating positional drift during occlusions or velocity changes. Dynamic thresholds, combined with multi-criteria evaluation metrics (e.g., motion coherence, appearance consistency), prioritize high-quality frames while adaptively balancing confidence scores and temporal smoothness. This reduces ambiguities among similar objects in complex scenes. SAM2Plus further employs an optimized memory system that prunes outdated or low-confidence entries and retains temporally coherent context, ensuring constant computational resources even for infinitely long videos. Extensive experiments on two video object segmentation (VOS) benchmarks demonstrate SAM2Plus’s superiority over SAM 2. It achieves an average improvement of 1.0 in J&F metrics across all 24 direct comparisons, with gains exceeding 2.3 points on SA-V and LVOS datasets for long-term tracking. The method delivers real-time performance and strong generalization without fine-tuning or additional parameters, effectively addressing occlusion recovery and viewpoint changes. By unifying motion-aware physics-based prediction with spatial segmentation, SAM2Plus bridges the gap between static and dynamic reasoning, offering a scalable solution for real-world applications such as autonomous driving and surveillance systems. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

26 pages, 11410 KiB

Open AccessArticle

High-Speed Multiple Object Tracking Based on Fusion of Intelligent and Real-Time Image Processing

by Yuki Kawawaki and Yuji Yamakawa

Sensors 2025, 25(11), 3400; https://doi.org/10.3390/s25113400 - 28 May 2025

Viewed by 1181

Abstract

Multiple object tracking (MOT) is a critical and active research topic in computer vision, serving as a fundamental technique across various application domains such as human–robot interaction, autonomous driving, and surveillance. MOT typically consists of two key components: detection, which produces bounding boxes around objects, and association, which links current detections to existing tracks. Two main approaches have been proposed: one-shot and two-shot methods. While previous works have improved MOT systems in terms of both speed and accuracy, most works have focused primarily on enhancing association performance, often overlooking the impact of accelerating detection. Thus, we propose a high-speed MOT system that balances real-time performance, tracking accuracy, and robustness across diverse environments. Our system comprises two main components: (1) a hybrid tracking framework that integrates low-frequency deep learning-based detection with classical high-speed tracking, and (2) a detection label-based tracker management strategy. We evaluated our system in six scenarios using a high-speed camera and compared its performance against seven state-of-the-art (SOTA) two-shot MOT methods. Our system achieved up to 470 fps when tracking two objects, 243 fps with three objects, and 178 fps with four objects. In terms of tracking accuracy, our system achieved the highest MOTA, IDF1, and HOTA scores with high-accuracy detection. Even with low detection accuracy, it demonstrated the potential of long-term association for high-speed tracking, achieving comparable or better IDF1 scores. We hope that our multi-processing architecture contributes to the advancement of MOT research and serves as a practical and efficient baseline for systems involving multiple asynchronous modules. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

23 pages, 11186 KiB

Open AccessArticle

MixRformer: Dual-Branch Network for Underwater Image Enhancement in Wavelet Domain

by Jie Li, Lei Zhao, Heng Li, Xiaojun Xue and Hui Liu

Sensors 2025, 25(11), 3302; https://doi.org/10.3390/s25113302 - 24 May 2025

Viewed by 426

Abstract

This paper proposes an underwater image enhancement model MixRformer that combines the wavelet transform and a hybrid architecture. To address the problems of insufficient global modeling in existing CNN models, weak local feature extraction of Transformer and high computational complexity, multi-resolution feature decomposition is performed through a discrete wavelet transform (IWT/DWT) in which low-frequency components retain structure and texture, and high-frequency components capture detail features. An innovative dual-branch feature capture module (DFCB) is designed as follows: (1) the surface information extraction block combines convolution and position encoding to enhance local modeling; (2) the rectangular window gated Transformer expands the receptive field through the convolution gating mechanism to achieve efficient global relationship modeling. Experiments show that the model outperforms mainstream methods in color restoration and detail enhancement, while optimizing computational efficiency. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

20 pages, 2194 KiB

Open AccessArticle

An Efficient and Low-Complexity Transformer-Based Deep Learning Framework for High-Dynamic-Range Image Reconstruction

by Josue Lopez-Cabrejos, Thuanne Paixão, Ana Beatriz Alvarez and Diodomiro Baldomero Luque

Sensors 2025, 25(5), 1497; https://doi.org/10.3390/s25051497 - 28 Feb 2025

Cited by 2 | Viewed by 1635

Abstract

High-dynamic-range (HDR) image reconstruction involves creating an HDR image from multiple low-dynamic-range images as input, providing a computational solution to enhance image quality. This task presents several challenges, such as frame misalignment, overexposure, and motion, which are addressed using deep learning algorithms. In this context, various architectures with different approaches exist, such as convolutional neural networks, diffusion networks, generative adversarial networks, and Transformer-based architectures, with the latter offering the best quality but at a high computational cost. This paper proposes an HDR reconstruction architecture using a Transformer-based approach to achieve results competitive with the state of the art while reducing computational cost. The number of self-attention blocks was reduced for feature refinement. To prevent quality degradation, a Convolutional Block Attention Module was added, enhancing image features by using the central frame as a reference. The proposed architecture was evaluated on two datasets, achieving the best results on Tel’s dataset in terms of quality metrics. The computational cost indicated that the architecture was significantly more efficient than other Transformer-based approaches for reconstruction. The results of this research suggest that low-complexity Transformer-based architectures have great potential, with applications extending beyond HDR reconstruction to other domains. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Graphical abstract

22 pages, 3085 KiB

Open AccessArticle

Detection of Human Traffic Controllers Wearing Construction Workwear via Synthetic Data Generation

by Seunghyun Baik and Euntai Kim

Sensors 2025, 25(3), 816; https://doi.org/10.3390/s25030816 - 29 Jan 2025

Cited by 2 | Viewed by 878

Abstract

Developing Level 3 or higher autonomous vehicles requires the ability to follow human traffic controllers in situations where regular traffic signals are unavailable, such as during construction. However, detecting human traffic controllers at construction sites is challenging due to the lack of dedicated datasets and variations in their appearance. This paper proposes a method for detecting human traffic controllers by generating synthetic images with diffusion models. We introduce a color-boosting technique to enhance image diversity and employ a cut-and-paste mechanism for seamless integration into realistic road scenes. We generate 19,840 synthetic images, combined with 600 real-world images, to train a YOLOv7 model. The trained model achieves an AP₅₀ score of 73.9%, improving by 32.9% over the baseline. The HTC600 dataset used in our experiments is publicly available to support autonomous driving research. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

12 pages, 1842 KiB

Open AccessArticle

Neural Radiance Fields for Fisheye Driving Scenes Using Edge-Aware Integrated Depth Supervision

by Jiho Choi and Sang Jun Lee

Sensors 2024, 24(21), 6790; https://doi.org/10.3390/s24216790 - 22 Oct 2024

Viewed by 1422

Abstract

Neural radiance fields (NeRF) have become an effective method for encoding scenes into neural representations, allowing for the synthesis of photorealistic views of unseen views from given input images. However, the applicability of traditional NeRF is significantly limited by its assumption that images are captured for object-centric scenes with a pinhole camera. Expanding these boundaries, we focus on driving scenarios using a fisheye camera, which offers the advantage of capturing visual information from a wide field of view. To address the challenges due to the unbounded and distorted characteristics of fisheye images, we propose an edge-aware integration loss function. This approach leverages sparse LiDAR projections and dense depth maps estimated from a learning-based depth model. The proposed algorithm assigns larger weights to neighboring points that have depth values similar to the sensor data. Experiments were conducted on the KITTI-360 and JBNU-Depth360 datasets, which are public and real-world datasets of driving scenarios using fisheye cameras. Experimental results demonstrated that the proposed method is effective in synthesizing novel view images, outperforming existing approaches. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

19 pages, 14507 KiB

Open AccessArticle

High-Precision Multi-Object Tracking in Satellite Videos via Pixel-Wise Adaptive Feature Enhancement

by Gang Wan, Zhijuan Su, Yitian Wu, Ningbo Guo, Dianwei Cong, Zhanji Wei, Wei Liu and Guoping Wang

Sensors 2024, 24(19), 6489; https://doi.org/10.3390/s24196489 - 9 Oct 2024

Cited by 2 | Viewed by 1910

Abstract

In this paper, we focus on the multi-target tracking (MOT) task in satellite videos. To achieve efficient and accurate tracking, we propose a transformer-distillation-based end-to-end joint detection and tracking (JDT) method. Specifically, (1) considering that targets in satellite videos usually have small scales and are shot from a bird’s-eye view, we propose a pixel-wise transformer-based feature distillation module through which useful object representations are learned via pixel-wise distillation using a strong teacher detection network; (2) targets in satellite videos, such as airplanes, ships, and vehicles, usually have similar appearances, so we propose a temperature-controllable key feature learning objective function, and by highlighting the learning of similar features during distilling, the tracking accuracy for such objects can be further improved; (3) we propose a method that is based on an end-to-end network but simultaneously learns from a highly precise teacher network and tracking head during training so that the tracking accuracy of the end-to-end network can be improved via distillation without compromising efficiency. The experimental results on three recently released publicly available datasets demonstrated the superior performance of the proposed method for satellite videos. The proposed method achieved over 90% overall tracking performance on the AIR-MOT dataset. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

21 pages, 2501 KiB

Open AccessArticle

RetinaViT: Efficient Visual Backbone for Online Video Streams

by Tomoyuki Suzuki and Yoshimitsu Aoki

Sensors 2024, 24(17), 5457; https://doi.org/10.3390/s24175457 - 23 Aug 2024

Cited by 1 | Viewed by 1392

Abstract

In online video understanding, which has a wide range of real-world applications, inference speed is crucial. Many approaches involve frame-level visual feature extraction, which often represents the biggest bottleneck. We propose RetinaViT, an efficient method for extracting frame-level visual features in an online video stream, aiming to fundamentally enhance the efficiency of online video understanding tasks. RetinaViT is composed of efficiently approximated Transformer blocks that only take changed tokens (event tokens) as queries and reuse the already processed tokens from the previous timestep for the others. Furthermore, we restrict keys and values to the spatial neighborhoods of event tokens to further improve efficiency. RetinaViT involves tuning multiple parameters, which we determine through a multi-step process. During model training, we randomly vary these parameters and then perform black-box optimization to maximize accuracy and efficiency on the pre-trained model. We conducted extensive experiments on various online video recognition tasks, including action recognition, pose estimation, and object segmentation, validating the effectiveness of each component in RetinaViT and demonstrating improvements in the speed/accuracy trade-off compared to baselines. In particular, for action recognition, RetinaViT built on ViT-B16 reduces inference time by approximately 61.9% on the CPU and 50.8% on the GPU, while achieving slight accuracy improvements rather than degradation. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

26 pages, 11283 KiB

Open AccessArticle

Infrared Image Super-Resolution Network Utilizing the Enhanced Transformer and U-Net

by Feng Huang, Yunxiang Li, Xiaojing Ye and Jing Wu

Sensors 2024, 24(14), 4686; https://doi.org/10.3390/s24144686 - 19 Jul 2024

Cited by 2 | Viewed by 1870

Abstract

Infrared images hold significant value in applications such as remote sensing and fire safety. However, infrared detectors often face the problem of high hardware costs, which limits their widespread use. Advancements in deep learning have spurred innovative approaches to image super-resolution (SR), but comparatively few efforts have been dedicated to the exploration of infrared images. To address this, we design the Residual Swin Transformer and Average Pooling Block (RSTAB) and propose the SwinAIR, which can effectively extract and fuse the diverse frequency features in infrared images and achieve superior SR reconstruction performance. By further integrating SwinAIR with U-Net, we propose the SwinAIR-GAN for real infrared image SR reconstruction. SwinAIR-GAN extends the degradation space to better simulate the degradation process of real infrared images. Additionally, it incorporates spectral normalization, dropout, and artifact discrimination loss to reduce the potential image artifacts. Qualitative and quantitative evaluations on various datasets confirm the effectiveness of our proposed method in reconstructing realistic textures and details of infrared images. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

21 pages, 6785 KiB

Open AccessArticle

Multi-Granularity Aggregation with Spatiotemporal Consistency for Video-Based Person Re-Identification

by Hean Sung Lee, Minjung Kim, Sungjun Jang, Han Byeol Bae and Sangyoun Lee

Sensors 2024, 24(7), 2229; https://doi.org/10.3390/s24072229 - 30 Mar 2024

Cited by 2 | Viewed by 1706

Abstract

Video-based person re-identification (ReID) aims to exploit relevant features from spatial and temporal knowledge. Widely used methods include the part- and attention-based approaches for suppressing irrelevant spatial–temporal features. However, it is still challenging to overcome inconsistencies across video frames due to occlusion and imperfect detection. These mismatches make temporal processing ineffective and create an imbalance of crucial spatial information. To address these problems, we propose the Spatiotemporal Multi-Granularity Aggregation (ST-MGA) method, which is specifically designed to accumulate relevant features with spatiotemporally consistent cues. The proposed framework consists of three main stages: extraction, which extracts spatiotemporally consistent partial information; augmentation, which augments the partial information with different granularity levels; and aggregation, which effectively aggregates the augmented spatiotemporal information. We first introduce the consistent part-attention (CPA) module, which extracts spatiotemporally consistent and well-aligned attentive parts. Sub-parts derived from CPA provide temporally consistent semantic information, solving misalignment problems in videos due to occlusion or inaccurate detection, and maximize the efficiency of aggregation through uniform partial information. To enhance the diversity of spatial and temporal cues, we introduce the Multi-Attention Part Augmentation (MA-PA) block, which incorporates fine parts at various granular levels, and the Long-/Short-term Temporal Augmentation (LS-TA) block, designed to capture both long- and short-term temporal relations. Using densely separated part cues, ST-MGA fully exploits and aggregates the spatiotemporal multi-granular patterns by comparing relations between parts and scales. In the experiments, the proposed ST-MGA renders state-of-the-art performance on several video-based ReID benchmarks (i.e., MARS, DukeMTMC-VideoReID, and LS-VID). Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

15 pages, 861 KiB

Open AccessArticle

Conv3D-Based Video Violence Detection Network Using Optical Flow and RGB Data

by Jae-Hyuk Park, Mohamed Mahmoud and Hyun-Soo Kang

Sensors 2024, 24(2), 317; https://doi.org/10.3390/s24020317 - 5 Jan 2024

Cited by 21 | Viewed by 3866

Abstract

Detecting violent behavior in videos to ensure public safety and security poses a significant challenge. Precisely identifying and categorizing instances of violence in real-life closed-circuit television, which vary across specifications and locations, requires comprehensive understanding and processing of the sequential information embedded in these videos. This study aims to introduce a model that adeptly grasps the spatiotemporal context of videos within diverse settings and specifications of violent scenarios. We propose a method to accurately capture spatiotemporal features linked to violent behaviors using optical flow and RGB data. The approach leverages a Conv3D-based ResNet-3D model as the foundational network, capable of handling high-dimensional video data. The efficiency and accuracy of violence detection are enhanced by integrating an attention mechanism, which assigns greater weight to the most crucial frames within the RGB and optical-flow sequences during instances of violence. Our model was evaluated on the UBI-Fight, Hockey, Crowd, and Movie-Fights datasets; the proposed method outperformed existing state-of-the-art techniques, achieving area under the curve scores of 95.4, 98.1, 94.5, and 100.0 on the respective datasets. Moreover, this research not only has the potential to be applied in real-time surveillance systems but also promises to contribute to a broader spectrum of research in video analysis and understanding. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

Review

Jump to: Research

19 pages, 1482 KiB

Open AccessReview

A Comprehensive Evaluation of Iris Segmentation on Benchmarking Datasets

by Mst Rumana Sumi, Priyanka Das, Afzal Hossain, Soumyabrata Dey and Stephanie Schuckers

Sensors 2024, 24(21), 7079; https://doi.org/10.3390/s24217079 - 3 Nov 2024

Cited by 2 | Viewed by 1904

Abstract

Iris is one of the most widely used biometric modalities because of its uniqueness, high matching performance, and inherently secure nature. Iris segmentation is an essential preliminary step for iris-based biometric authentication. The authentication accuracy is directly connected with the iris segmentation accuracy. In the last few years, deep-learning-based iris segmentation methodologies have increasingly been adopted because of their ability to handle challenging segmentation tasks and their advantages over traditional segmentation techniques. However, the biggest challenge to the biometric community is the scarcity of open-source resources for adoption for application and reproducibility. This review provides a comprehensive examination of available open-source iris segmentation resources, including datasets, algorithms, and tools. In the process, we designed three U-Net and U-Net++ architecture-influenced segmentation algorithms as standard benchmarks, trained them on a large composite dataset (>45K samples), and created 1K manually segmented ground truth masks. Overall, eleven state-of-the-art algorithms were benchmarked against five datasets encompassing multiple sensors, environmental conditions, demography, and illumination. This assessment highlights the strengths, limitations, and practical implications of each method and identifies gaps that future studies should address to improve segmentation accuracy and robustness. To foster future research, all resources developed during this work would be made publicly available. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Journal Menu

Journal Browser

Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issue

Published Papers (13 papers)

Research

Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI