sensors-logo

Journal Browser

Journal Browser

Applications of Video Processing and Computer Vision Sensor II

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: closed (25 September 2023) | Viewed by 19751

Special Issue Editors


E-Mail Website
Guest Editor
School of Computing, Gachon University, Seongnam, Republic of Korea
Interests: image processing; computer vision
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
School of Computing, Gachon University, Seongnam, Republic of Korea
Interests: video streaming; edge intelligence; IoT system
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123 Cagliari, Italy
Interests: pattern recognition; image processing; intelligent video surveillance
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

With the recent proliferation of deep learning technology and AI democratization by edge computing devices, AI-powered computer vision sensors have garnered a great deal interest from academics and industry. Accordingly, we have witnessed the recent dramatic growth of various AI-based vision applications (e.g., video surveillance and self-driving vehicles) in various fields. The goal of this Special Issue is to invite researchers who tackle important and challenging issues in various AI-based vision applications, which may include video processing and computer vision sensors. In particular, this Special Issue aims at providing information regarding the recent progress of AI-enabled computational photography and machine vision, smart camera systems, and applications for intelligent edge. Topics of interest include, but are not limited to:

  1. Deep Learning-Based Computational Photography:
  • Image/video manipulation (camera ISP, inpainting, relighting, super-resolution, deblurring, de-hazing, artifact removal, etc);
  • Image-to-image translation;
  • Video-to-video translation;
  • Image/video restoration and enhancement on mobile devices;
  • Image fusion for single and multi-camera;
  • Hyperspectral imaging;
  • Depth estimation;
  1. AI-Enabled Machine Vision:
  • Object detection and real-time tracking;
  • Anomaly detection;
  • Crowd monitoring and crowd behaviour analysis;
  • Face detection, recognition, and modeling;
  • Human activity recognition; Emotion recognition;
  1. Smart Camera Systems and Applications for Intelligent Edge:
  • Resource management in edge devices for video surveillance;
  • Architecture and efficient operation procedures for edge devices to support video surveillance;
  • Intelligent edge system and protocol design for video surveillance;
  • Deep learning model optimization for intelligent edge;
  • Input filtering for object detection and tracking for Intelligent edge.

Dr. Yong Ju Jung
Dr. Joohyung Lee
Dr. Giorgio Fumera
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • computational photography
  • image/video understanding and recognition
  • deep learning algorithms
  • AI-enabled machine vision
  • smart camera systems
  • intelligent edge
  • edge computing devices

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

19 pages, 6090 KiB  
Article
Video Global Motion Compensation Based on Affine Inverse Transform Model
by Nan Zhang, Weifeng Liu and Xingyu Xia
Sensors 2023, 23(18), 7750; https://doi.org/10.3390/s23187750 - 8 Sep 2023
Viewed by 752
Abstract
Global motion greatly increases the number of false alarms for object detection in video sequences against dynamic backgrounds. Therefore, before detecting the target in the dynamic background, it is necessary to estimate and compensate the global motion to eliminate the influence of the [...] Read more.
Global motion greatly increases the number of false alarms for object detection in video sequences against dynamic backgrounds. Therefore, before detecting the target in the dynamic background, it is necessary to estimate and compensate the global motion to eliminate the influence of the global motion. In this paper, we use the SURF (speeded up robust features) algorithm combined with the MSAC (M-Estimate Sample Consensus) algorithm to process the video. The global motion of a video sequence is estimated according to the feature point matching pairs of adjacent frames of the video sequence and the global motion parameters of the video sequence under the dynamic background. On this basis, we propose an inverse transformation model of affine transformation, which acts on each adjacent frame of the video sequence in turn. The model compensates the global motion, and outputs a video sequence after global motion compensation from a specific view for object detection. Experimental results show that the algorithm proposed in this paper can accurately perform motion compensation on video sequences containing complex global motion, and the compensated video sequences achieve higher peak signal-to-noise ratio and better visual effects. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

19 pages, 3715 KiB  
Article
Aggregating Different Scales of Attention on Feature Variants for Tomato Leaf Disease Diagnosis from Image Data: A Transformer Driven Study
by Shahriar Hossain, Md Tanzim Reza, Amitabha Chakrabarty and Yong Ju Jung
Sensors 2023, 23(7), 3751; https://doi.org/10.3390/s23073751 - 5 Apr 2023
Cited by 3 | Viewed by 1954
Abstract
Tomato leaf diseases can incur significant financial damage by having adverse impacts on crops and, consequently, they are a major concern for tomato growers all over the world. The diseases may come in a variety of forms, caused by environmental stress and various [...] Read more.
Tomato leaf diseases can incur significant financial damage by having adverse impacts on crops and, consequently, they are a major concern for tomato growers all over the world. The diseases may come in a variety of forms, caused by environmental stress and various pathogens. An automated approach to detect leaf disease from images would assist farmers to take effective control measures quickly and affordably. Therefore, the proposed study aims to analyze the effects of transformer-based approaches that aggregate different scales of attention on variants of features for the classification of tomato leaf diseases from image data. Four state-of-the-art transformer-based models, namely, External Attention Transformer (EANet), Multi-Axis Vision Transformer (MaxViT), Compact Convolutional Transformers (CCT), and Pyramid Vision Transformer (PVT), are trained and tested on a multiclass tomato disease dataset. The result analysis showcases that MaxViT comfortably outperforms the other three transformer models with 97% overall accuracy, as opposed to the 89% accuracy achieved by EANet, 91% by CCT, and 93% by PVT. MaxViT also achieves a smoother learning curve compared to the other transformers. Afterwards, we further verified the legitimacy of the results on another relatively smaller dataset. Overall, the exhaustive empirical analysis presented in the paper proves that the MaxViT architecture is the most effective transformer model to classify tomato leaf disease, providing the availability of powerful hardware to incorporate the model. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

19 pages, 13345 KiB  
Article
Multi-Stage Network for Event-Based Video Deblurring with Residual Hint Attention
by Jeongmin Kim and Yong Ju Jung
Sensors 2023, 23(6), 2880; https://doi.org/10.3390/s23062880 - 7 Mar 2023
Cited by 2 | Viewed by 1845
Abstract
Video deblurring aims at removing the motion blur caused by the movement of objects or camera shake. Traditional video deblurring methods have mainly focused on frame-based deblurring, which takes only blurry frames as the input to produce sharp frames. However, frame-based deblurring has [...] Read more.
Video deblurring aims at removing the motion blur caused by the movement of objects or camera shake. Traditional video deblurring methods have mainly focused on frame-based deblurring, which takes only blurry frames as the input to produce sharp frames. However, frame-based deblurring has shown poor picture quality in challenging cases of video restoration where severely blurred frames are provided as the input. To overcome this issue, recent studies have begun to explore the event-based approach, which uses the event sequence captured by an event camera for motion deblurring. Event cameras have several advantages compared to conventional frame cameras. Among these advantages, event cameras have a low latency in imaging data acquisition (0.001 ms for event cameras vs. 10 ms for frame cameras). Hence, event data can be acquired at a high acquisition rate (up to one microsecond). This means that the event sequence contains more accurate motion information than video frames. Additionally, event data can be acquired with less motion blur. Due to these advantages, the use of event data is highly beneficial for achieving improvements in the quality of deblurred frames. Accordingly, the results of event-based video deblurring are superior to those of frame-based deblurring methods, even for severely blurred video frames. However, the direct use of event data can often generate visual artifacts in the final output frame (e.g., image noise and incorrect textures), because event data intrinsically contain insufficient textures and event noise. To tackle this issue in event-based deblurring, we propose a two-stage coarse-refinement network by adding a frame-based refinement stage that utilizes all the available frames with more abundant textures to further improve the picture quality of the first-stage coarse output. Specifically, a coarse intermediate frame is estimated by performing event-based video deblurring in the first-stage network. A residual hint attention (RHA) module is also proposed to extract useful attention information from the coarse output and all the available frames. This module connects the first and second stages and effectively guides the frame-based refinement of the coarse output. The final deblurred frame is then obtained by refining the coarse output using the residual hint attention and all the available frame information in the second-stage network. We validated the deblurring performance of the proposed network on the GoPro synthetic dataset (33 videos and 4702 frames) and the HQF real dataset (11 videos and 2212 frames). Compared to the state-of-the-art method (D2Net), we achieved a performance improvement of 1 dB in PSNR and 0.05 in SSIM on the GoPro dataset, and an improvement of 1.7 dB in PSNR and 0.03 in SSIM on the HQF dataset. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

16 pages, 6480 KiB  
Article
Cognitive Video Surveillance Management in Hierarchical Edge Computing System with Long Short-Term Memory Model
by Dilshod Bazarov Ravshan Ugli, Jingyeom Kim, Alaelddin F. Y. Mohammed and Joohyung Lee
Sensors 2023, 23(5), 2869; https://doi.org/10.3390/s23052869 - 6 Mar 2023
Cited by 4 | Viewed by 1998
Abstract
Nowadays, deep learning (DL)-based video surveillance services are widely used in smart cities because of their ability to accurately identify and track objects, such as vehicles and pedestrians, in real time. This allows a more efficient traffic management and improved public safety. However, [...] Read more.
Nowadays, deep learning (DL)-based video surveillance services are widely used in smart cities because of their ability to accurately identify and track objects, such as vehicles and pedestrians, in real time. This allows a more efficient traffic management and improved public safety. However, DL-based video surveillance services that require object movement and motion tracking (e.g., for detecting abnormal object behaviors) can consume a substantial amount of computing and memory capacity, such as (i) GPU computing resources for model inference and (ii) GPU memory resources for model loading. This paper presents a novel cognitive video surveillance management with long short-term memory (LSTM) model, denoted as the CogVSM framework. We consider DL-based video surveillance services in a hierarchical edge computing system. The proposed CogVSM forecasts object appearance patterns and smooths out the forecast results needed for an adaptive model release. Here, we aim to reduce standby GPU memory by model release while avoiding unnecessary model reloads for a sudden object appearance. CogVSM hinges on an LSTM-based deep learning architecture explicitly designed for future object appearance pattern prediction by training previous time-series patterns to achieve these objectives. By referring to the result of the LSTM-based prediction, the proposed framework controls the threshold time value in a dynamic manner by using an exponential weighted moving average (EWMA) technique. Comparative evaluations on both simulated and real-world measurement data on the commercial edge devices prove that the LSTM-based model in the CogVSM can achieve a high predictive accuracy, i.e., a root-mean-square error metric of 0.795. In addition, the suggested framework utilizes up to 32.1% less GPU memory than the baseline and 8.9% less than previous work. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

13 pages, 1391 KiB  
Article
Brickognize: Applying Photo-Realistic Image Synthesis for Lego Bricks Recognition with Limited Data
by Joel Vidal, Guillem Vallicrosa, Robert Martí and Marc Barnada
Sensors 2023, 23(4), 1898; https://doi.org/10.3390/s23041898 - 8 Feb 2023
Cited by 2 | Viewed by 3604
Abstract
During the last few years, supervised deep convolutional neural networks have become the state-of-the-art for image recognition tasks. Nevertheless, their performance is severely linked to the amount and quality of the training data. Acquiring and labeling data is a major challenge that limits [...] Read more.
During the last few years, supervised deep convolutional neural networks have become the state-of-the-art for image recognition tasks. Nevertheless, their performance is severely linked to the amount and quality of the training data. Acquiring and labeling data is a major challenge that limits their expansion to new applications, especially with limited data. Recognition of Lego bricks is a clear example of a real-world deep learning application that has been limited by the difficulties associated with data gathering and training. In this work, photo-realistic image synthesis and few-shot fine-tuning are proposed to overcome limited data in the context of Lego bricks recognition. Using synthetic images and a limited set of 20 real-world images from a controlled environment, the proposed system is evaluated on controlled and uncontrolled real-world testing datasets. Results show the good performance of the synthetically generated data and how limited data from a controlled domain can be successfully used for the few-shot fine-tuning of the synthetic training without a perceptible narrowing of its domain. Obtained results reach an AP50 value of 91.33% for uncontrolled scenarios and 98.7% for controlled ones. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

19 pages, 8276 KiB  
Article
A Gait-Based Real-Time Gender Classification System Using Whole Body Joints
by Muhammad Azhar, Sehat Ullah, Khalil Ullah, Ikram Syed and Jaehyuk Choi
Sensors 2022, 22(23), 9113; https://doi.org/10.3390/s22239113 - 24 Nov 2022
Cited by 5 | Viewed by 2003
Abstract
Gait-based gender classification is a challenging task since people may walk in different directions with varying speed, gait style, and occluded joints. The majority of research studies in the literature focused on gender-specific joints, while there is less attention on the comparison of [...] Read more.
Gait-based gender classification is a challenging task since people may walk in different directions with varying speed, gait style, and occluded joints. The majority of research studies in the literature focused on gender-specific joints, while there is less attention on the comparison of all of a body’s joints. To consider all of the joints, it is essential to determine a person’s gender based on their gait using a Kinect sensor. This paper proposes a logistic-regression-based machine learning model using whole body joints for gender classification. The proposed method consists of different phases including gait feature extraction based on three dimensional (3D) positions, feature selection, and classification of human gender. The Kinect sensor is used to extract 3D features of different joints. Different statistical tools such as Cronbach’s alpha, correlation, t-test, and ANOVA techniques are exploited to select significant joints. The Coronbach’s alpha technique yields an average result of 99.74%, which indicates the reliability of joints. Similarly, the correlation results indicate that there is significant difference between male and female joints during gait. t-test and ANOVA approaches demonstrate that all twenty joints are statistically significant for gender classification, because the p-value for each joint is zero and less than 1%. Finally, classification is performed based on the selected features using binary logistic regression model. A total of hundred (100) volunteers participated in the experiments in real scenario. The suggested method successfully classifies gender based on 3D features recorded in real-time using machine learning classifier with an accuracy of 98.0% using all body joints. The proposed method outperformed the existing systems which mostly rely on digital images. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

17 pages, 2073 KiB  
Article
A Study on Fast and Low-Complexity Algorithms for Versatile Video Coding
by Kiho Choi
Sensors 2022, 22(22), 8990; https://doi.org/10.3390/s22228990 - 20 Nov 2022
Cited by 4 | Viewed by 1932
Abstract
Versatile Video Coding (VVC)/H.266, completed in 2020, provides half the bitrate of the previous video coding standard (i.e., High-Efficiency Video Coding (HEVC)/H.265) while maintaining the same visual quality. The primary goal of VVC/H.266 is to achieve a compression capability that is noticeably better [...] Read more.
Versatile Video Coding (VVC)/H.266, completed in 2020, provides half the bitrate of the previous video coding standard (i.e., High-Efficiency Video Coding (HEVC)/H.265) while maintaining the same visual quality. The primary goal of VVC/H.266 is to achieve a compression capability that is noticeably better than that of HEVC/H.265, as well as the functionality to support a variety of applications with a single profile. Although VVC/H.266 has improved its coding performance by incorporating new advanced technologies with flexible partitioning, the increased encoding complexity has become a challenging issue in practical market usage. To address the complexity issue of VVC/H.266, significant efforts have been expended to develop practical methods for reducing the encoding and decoding processes of VVC/H.266. In this study, we provide an overview of the VVC/H.266 standard, and compared with previous video coding standards, examine a key challenge to VVC/H.266 coding. Furthermore, we survey and present recent technical advances in fast and low-complexity VVC/H.266, focusing on key technical areas. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

16 pages, 19775 KiB  
Article
Hint-Based Image Colorization Based on Hierarchical Vision Transformer
by Subin Lee and Yong Ju Jung
Sensors 2022, 22(19), 7419; https://doi.org/10.3390/s22197419 - 29 Sep 2022
Cited by 2 | Viewed by 1740
Abstract
Hint-based image colorization is an image-to-image translation task that aims at creating a full-color image from an input luminance image when a small set of color values for some pixels are given as hints. Though traditional deep-learning-based methods have been proposed in the [...] Read more.
Hint-based image colorization is an image-to-image translation task that aims at creating a full-color image from an input luminance image when a small set of color values for some pixels are given as hints. Though traditional deep-learning-based methods have been proposed in the literature, they are based on convolution neural networks (CNNs) that have strong spatial locality due to the convolution operations. This often causes non-trivial visual artifacts in the colorization results, such as false color and color bleeding artifacts. To overcome this limitation, this study proposes a vision transformer-based colorization network. The proposed hint-based colorization network has a hierarchical vision transformer architecture in the form of an encoder-decoder structure based on transformer blocks. As the proposed method uses the transformer blocks that can learn rich long-range dependency, it can achieve visually plausible colorization results, even with a small number of color hints. Through the verification experiments, the results reveal that the proposed transformer model outperforms the conventional CNN-based models. In addition, we qualitatively analyze the effect of the long-range dependency of the transformer model on hint-based image colorization. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

18 pages, 10448 KiB  
Article
Multi-Scale Attention-Guided Non-Local Network for HDR Image Reconstruction
by Howoon Yoon, S. M. Nadim Uddin and Yong Ju Jung
Sensors 2022, 22(18), 7044; https://doi.org/10.3390/s22187044 - 17 Sep 2022
Cited by 3 | Viewed by 2360
Abstract
High-dynamic-range (HDR) image reconstruction methods are designed to fuse multiple Low-dynamic-range (LDR) images captured with different exposure values into a single HDR image. Recent CNN-based methods mostly perform local attention- or alignment-based fusion of multiple LDR images to create HDR contents. Depending on [...] Read more.
High-dynamic-range (HDR) image reconstruction methods are designed to fuse multiple Low-dynamic-range (LDR) images captured with different exposure values into a single HDR image. Recent CNN-based methods mostly perform local attention- or alignment-based fusion of multiple LDR images to create HDR contents. Depending on a single attention mechanism or alignment causes failure in compensating ghosting artifacts, which can arise in the synthesized HDR images due to the motion of objects or camera movement across different LDR image inputs. In this study, we propose a multi-scale attention-guided non-local network called MSANLnet for efficient HDR image reconstruction. To mitigate the ghosting artifacts, the proposed MSANLnet performs implicit alignment of LDR image features with multi-scale spatial attention modules and then reconstructs pixel intensity values using long-range dependencies through non-local means-based fusion. These modules adaptively select useful information that is not damaged by an object’s movement or unfavorable lighting conditions for image pixel fusion. Quantitative evaluations against several current state-of-the-art methods show that the proposed approach achieves higher performance than the existing methods. Moreover, comparative visual results show the effectiveness of the proposed method in restoring saturated information from original input images and mitigating ghosting artifacts caused by large movement of objects. Ablation studies show the effectiveness of the proposed method, architectural choices, and modules for efficient HDR reconstruction. Full article
(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)
Show Figures

Figure 1

Back to TopTop