Deep Learning for Computer Vision Application

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (15 February 2025) | Viewed by 26392

Special Issue Editor


E-Mail Website
Guest Editor
Research Officer (AI/ML Expert), Construction Research Centre, National Research Council Canada, Ottawa, ON K1A 0R6, Canada
Interests: computer vision; image processing; artificial intelligence; deep learning; medical imaging; thermal imaging; spectroscopy; virtual reality; data analytics and risk assessment; electronics/embedded systems
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Artificial intelligence (AI) methods, and more specifically deep neural networks (also called deep learning models), have became the core technique for computer vision tasks across various applications. The advent of these powerful deep learning models allows state-of-the-art automation levels in autonomous pattern recognition from image data. In a general sense, the ultimate manifestation of these techniques can be seen in our daily life, from automatically sorting and retrieving photos in Google photos to autonomous cars. However, these powerful techniques still have not been utilized in all computer vision tasks. Future studies should seek to find more applications of AI in our life, e.g., via data acquisition and cleaning, as well as more model optimization, innovation, and research. In this Special Issue, we are particularly interested in new applications of deep learning in the computer vision field.

Topics of interest include but are not limited to:

  • Image classification using deep learning;
  • Object detection using deep learning;
  • Semantic and instant segmentation using deep learning;
  • Deep learning techniques for generating new images (generative adversarial networks);
  • Employing reinforcement learning for computer vision tasks;
  • Application of deep learning in the Internet of Things (IoT);
  • Application of deep learning in embedded systems, sensor development, and electronics;
  • Computer vision tasks using deep learning (medical image processing, remote sensing, hyperspectral imaging, thermal imaging, space and extraterrestrial observations);
  • Image sequence analysis using deep learning;
  • Deep learning and computer vision for smart and green building, smart industry, and smart devices.

Dr. Hamed Mozaffari
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • convolutional neural network
  • deep learning
  • computer vision
  • artificial intelligence
  • image processing
  • medical image processing
  • internet of things
  • thermal imaging
  • image technologies
  • application of deep learning
  • autonomous vehicles
  • image classification
  • object detection
  • and object segmentation

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review, Other

15 pages, 9988 KiB  
Article
Geometry-Aware 3D Hand–Object Pose Estimation Under Occlusion via Hierarchical Feature Decoupling
by Yuting Cai, Huimin Pan, Jiayi Yang, Yichen Liu, Quanli Gao and Xihan Wang
Electronics 2025, 14(5), 1029; https://doi.org/10.3390/electronics14051029 - 5 Mar 2025
Viewed by 622
Abstract
Hand–object occlusion poses a significant challenge in 3D pose estimation. During hand–object interactions, parts of the hand or object are frequently occluded by the other, making it difficult to extract discriminative features for accurate pose estimation. Traditional methods typically extract features for both [...] Read more.
Hand–object occlusion poses a significant challenge in 3D pose estimation. During hand–object interactions, parts of the hand or object are frequently occluded by the other, making it difficult to extract discriminative features for accurate pose estimation. Traditional methods typically extract features for both the hand and object from a single image using a shared backbone network. However, this approach often results in feature contamination, where hand and object features are mixed, especially in occluded regions. To address these issues, we propose a novel 3D hand–object pose estimation framework that explicitly tackles the problem of occlusion through two key innovations. While existing methods rely on a single backbone for feature extraction, our framework introduces a feature decoupling strategy that shares low-level features (using ResNet-50) to capture interaction contexts, while separating high-level features into two independent branches. This design ensures that hand-specific features and object-specific features are processed separately, reducing feature contamination and improving pose estimation accuracy under occlusion. Recognizing the correlation between the hand’s occluded regions and the object’s geometry, we introduce the Hand–Object Cross-Attention Transformer (HOCAT) module. Unlike traditional attention mechanisms that focus solely on feature correlations, the HOCAT leverages the geometric stability of the object as prior knowledge to guide the reconstruction of occluded hand regions. Specifically, the object features (key/value) provide contextual information to enhance the hand features (query), enabling the model to infer the positions of occluded hand joints based on the object’s known structure. This approach significantly improves the model’s ability to handle complex occlusion scenarios. The experimental results demonstrate that our method achieves significant improvements in hand–object pose estimation tasks on publicly available datasets such as HO3D V2 and Dex-YCB. On the HO3D V2 dataset, the PAMPJPE reaches 9.1 mm, the PAMPVPE is 9.0 mm, and the F-score reaches 95.8%. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

15 pages, 23802 KiB  
Article
Vision-Based Prediction of Flashover Using Transformers and Convolutional Long Short-Term Memory Model
by M. Hamed Mozaffari, Yuchuan Li, Niloofar Hooshyaripour and Yoon Ko
Electronics 2024, 13(23), 4776; https://doi.org/10.3390/electronics13234776 - 3 Dec 2024
Viewed by 844
Abstract
The prediction of fire growth is crucial for effective firefighting and rescue operations. Recent advancements in vision-based techniques using RGB vision and infrared (IR) thermal imaging data, coupled with artificial intelligence and deep learning techniques, have shown promising solutions to be applied in [...] Read more.
The prediction of fire growth is crucial for effective firefighting and rescue operations. Recent advancements in vision-based techniques using RGB vision and infrared (IR) thermal imaging data, coupled with artificial intelligence and deep learning techniques, have shown promising solutions to be applied in the detection of fire and the prediction of its behavior. This study introduces the use of Convolutional Long Short-term Memory (ConvLSTM) network models for predicting room fire growth by analyzing spatiotemporal IR thermal imaging data acquired from full-scale room fire tests. Our findings revealed that SwinLSTM, an enhanced version of ConvLSTM combined with transformers (a deep learning architecture based on a new mechanism called multi-head attention) for computer vision purposes, can be used for the prediction of room fire flashover occurrence. Notably, transformer-based ConvLSTM deep learning models, such as SwinLSTM, demonstrate superior prediction capability, which suggests a new vision-based smart solution for future fire growth prediction tasks. The main focus of this work is to perform a feasibility study on the use of a pure vision-based deep learning model for analysis of future video data to anticipate behavior of fire growth in room fire incidents. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

16 pages, 5772 KiB  
Article
Optimizing Football Formation Analysis via LSTM-Based Event Detection
by Benjamin Orr, Ephraim Pan and Dah-Jye Lee
Electronics 2024, 13(20), 4105; https://doi.org/10.3390/electronics13204105 - 18 Oct 2024
Viewed by 1553
Abstract
The process of manually annotating sports footage is a demanding one. In American football alone, coaches spend thousands of hours reviewing and analyzing videos each season. We aim to automate this process by developing a system that generates comprehensive statistical reports from full-length [...] Read more.
The process of manually annotating sports footage is a demanding one. In American football alone, coaches spend thousands of hours reviewing and analyzing videos each season. We aim to automate this process by developing a system that generates comprehensive statistical reports from full-length football game videos. Having previously demonstrated the proof of concept for our system, here, we present optimizations to our preprocessing techniques along with an inventive method for multi-person event detection in sports videos. Employing a long short-term memory (LSTM)-based architecture to detect the snap in American football, we achieve an outstanding LSI (Levenshtein similarity index) of 0.9445, suggesting a normalized difference of less than 0.06 between predictions and ground truth labels. We also illustrate the utility of snap detection as a means of identifying the offensive players’ assuming of formation. Our results exhibit not only the success of our unique approach and underlying optimizations but also the potential for continued robustness as we pursue the development of our remaining system components. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Graphical abstract

19 pages, 9439 KiB  
Article
MFAD-RTDETR: A Multi-Frequency Aggregate Diffusion Feature Flow Composite Model for Printed Circuit Board Defect Detection
by Zhihua Xie and Xiaowei Zou
Electronics 2024, 13(17), 3557; https://doi.org/10.3390/electronics13173557 - 7 Sep 2024
Cited by 1 | Viewed by 2120
Abstract
To address the challenges of excessive model parameters and low detection accuracy in printed circuit board (PCB) defect detection, this paper proposes a novel PCB defect detection model based on the improved RTDETR (Real-Time Detection, Embedding and Tracking) method, named MFAD-RTDETR. Specifically, the [...] Read more.
To address the challenges of excessive model parameters and low detection accuracy in printed circuit board (PCB) defect detection, this paper proposes a novel PCB defect detection model based on the improved RTDETR (Real-Time Detection, Embedding and Tracking) method, named MFAD-RTDETR. Specifically, the proposed model introduces the designed Detail Feature Retainer (DFR) into the original RTDETR backbone to capture and retain local details. Subsequently, based on the Mamba architecture, the Visual State Space (VSS) module is integrated to enhance global attention while reducing the original quadratic complexity to a linear level. Furthermore, by exploiting the deformable attention mechanism, which dynamically adjusts reference points, the model achieves precise localization of target defects and improves the accuracy of the transformer in complex visual tasks. Meanwhile, a receptive field synthesis mechanism is incorporated to enrich multi-scale semantic information and reduce parameter complexity. In addition, the scheme proposes a novel Multi-frequency Aggregation and Diffusion feature composite paradigm (MFAD-feature composite paradigm), which consists of the Aggregation Diffusion Fusion (ADF) module and the Refiner Feature Composition (RFC) module. It aims to strengthen features with fine-grained awareness while preserving a certain level of global attention. Finally, the Wise IoU (WIoU) dynamic nonmonotonic focusing mechanism is used to reduce competition among high-quality anchor boxes and mitigate the effects of the harmful gradients from low-quality examples, thereby concentrating on anchor boxes of average quality to promote the overall performance of the detector. Extensive experiments are conducted on the PCB defect dataset released by Peking University to validate the effectiveness of the proposed model. The experimental results show that our approach achieves the 97.0% and 51.0% performance in mean Average Precision (mAP)@0.5 and mAP@0.5:0.95, respectively, which significantly outperforms the original RTDETR. Moreover, the model reduces the number of parameters by approximately 18.2% compared to the original RTDETR. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

18 pages, 89225 KiB  
Article
Graph Attention Networks and Track Management for Multiple Object Tracking
by Yajuan Zhang, Yongquan Liang, Ahmed Elazab, Zhihui Wang and Changmiao Wang
Electronics 2023, 12(19), 4079; https://doi.org/10.3390/electronics12194079 - 28 Sep 2023
Cited by 1 | Viewed by 2364
Abstract
Multiple object tracking (MOT) constitutes a critical research area within the field of computer vision. The creation of robust and efficient systems, which can approximate the mechanisms of human vision, is essential to enhance the efficacy of multiple object-tracking techniques. However, obstacles such [...] Read more.
Multiple object tracking (MOT) constitutes a critical research area within the field of computer vision. The creation of robust and efficient systems, which can approximate the mechanisms of human vision, is essential to enhance the efficacy of multiple object-tracking techniques. However, obstacles such as repetitive target appearances and frequent occlusions cause considerable inaccuracies or omissions in detection. Following the updating of these inaccurate observations into the tracklet, the effectiveness of the tracking model, employing appearance features, declines significantly. This paper introduces a novel method of multiple object tracking, employing graph attention networks and track management (GATM). Utilizing a graph attention network, an attention mechanism is employed to capture the relationships of nodes within the graph as well as node-to-node correlations across graphs. This mechanism allows selective focus on the features of advantageous nodes and enhances discriminability between node features, subsequently improving the performance and robustness of multiple object tracking. Simultaneously, we categorize distinct tracklet states and introduce an efficient track management method, which employs varying processing techniques for tracklets in diverse states. This method can manage occluded tracks in crowded scenes and improves tracking accuracy. Experiments conducted on three challenging public datasets (MOT16, MOT17, and MOT20) demonstrate that our method could deliver competitive performance. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

14 pages, 3711 KiB  
Article
VHR-BirdPose: Vision Transformer-Based HRNet for Bird Pose Estimation with Attention Mechanism
by Runang He, Xiaomin Wang, Huazhen Chen and Chang Liu
Electronics 2023, 12(17), 3643; https://doi.org/10.3390/electronics12173643 - 29 Aug 2023
Cited by 7 | Viewed by 2723
Abstract
Pose estimation plays a crucial role in recognizing and analyzing the postures, actions, and movements of humans and animals using computer vision and machine learning techniques. However, bird pose estimation encounters specific challenges, including bird diversity, posture variation, and the fine granularity of [...] Read more.
Pose estimation plays a crucial role in recognizing and analyzing the postures, actions, and movements of humans and animals using computer vision and machine learning techniques. However, bird pose estimation encounters specific challenges, including bird diversity, posture variation, and the fine granularity of posture. To overcome these challenges, we propose VHR-BirdPose, a method that combines Vision Transformer (ViT) and Deep High-Resolution Network (HRNet) with an attention mechanism. VHR-BirdPose effectively extracts features using Vision Transformer’s self-attention mechanism, which captures global dependencies in the images and allows for better capturing of pose details and changes. The attention mechanism is employed to enhance the focus on bird keypoints, improving the accuracy of pose estimation. By combining HRNet with Vision Transformer, our model can extract multi-scale features while maintaining high-resolution details and incorporating richer semantic information through the attention mechanism. This integration of HRNet and Vision Transformer leverages the advantages of both models, resulting in accurate and robust bird pose estimation. We conducted extensive experiments on the Animal Kingdom dataset to evaluate the performance of VHR-BirdPose. The results demonstrate that our proposed method achieves state-of-the-art performance in bird pose estimation. VHR-BirdPose based on bird images is of great significance for the advancement of bird behaviors, ecological understanding, and the protection of bird populations. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

20 pages, 10877 KiB  
Article
SGooTY: A Scheme Combining the GoogLeNet-Tiny and YOLOv5-CBAM Models for Nüshu Recognition
by Yan Zhang and Liumei Zhang
Electronics 2023, 12(13), 2819; https://doi.org/10.3390/electronics12132819 - 26 Jun 2023
Cited by 1 | Viewed by 1885
Abstract
With the development of society, the intangible cultural heritage of Chinese Nüshu is in danger of extinction. To promote the research and popularization of traditional Chinese culture, we use deep learning to automatically detect and recognize handwritten Nüshu characters. To address difficulties such [...] Read more.
With the development of society, the intangible cultural heritage of Chinese Nüshu is in danger of extinction. To promote the research and popularization of traditional Chinese culture, we use deep learning to automatically detect and recognize handwritten Nüshu characters. To address difficulties such as the creation of a Nüshu character dataset, uneven samples, and difficulties in character recognition, we first build a large-scale handwritten Nüshu character dataset, HWNS2023, by using various data augmentation methods. This dataset contains 5500 Nüshu images and 1364 labeled character samples. Second, in this paper, we propose a two-stage scheme model combining GoogLeNet-tiny and YOLOv5-CBAM (SGooTY) for Nüshu recognition. In the first stage, five basic deep learning models including AlexNet, VGGNet16, GoogLeNet, MobileNetV3, and ResNet are trained and tested on the dataset, and the model structure is improved to enhance the accuracy of recognising handwritten Nüshu characters. In the second stage, we combine an object detection model to re-recognize misidentified handwritten Nüshu characters to ensure the accuracy of the overall system. Experimental results show that in the first stage, the improved model achieves the highest accuracy of 99.3% in recognising Nüshu characters, which significantly improves the recognition rate of handwritten Nüshu characters. After integrating the object recognition model, the overall recognition accuracy of the model reached 99.9%. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

20 pages, 9967 KiB  
Article
CNN-Based Fluid Motion Estimation Using Correlation Coefficient and Multiscale Cost Volume
by Jun Chen, Hui Duan, Yuanxin Song, Ming Tang and Zemin Cai
Electronics 2022, 11(24), 4159; https://doi.org/10.3390/electronics11244159 - 13 Dec 2022
Cited by 2 | Viewed by 2702
Abstract
Motion estimation for complex fluid flows via their image sequences is a challenging issue in computer vision. It plays a significant role in scientific research and engineering applications related to meteorology, oceanography, and fluid mechanics. In this paper, we introduce a novel convolutional [...] Read more.
Motion estimation for complex fluid flows via their image sequences is a challenging issue in computer vision. It plays a significant role in scientific research and engineering applications related to meteorology, oceanography, and fluid mechanics. In this paper, we introduce a novel convolutional neural network (CNN)-based motion estimator for complex fluid flows using multiscale cost volume. It uses correlation coefficients as the matching costs, which can improve the accuracy of motion estimation by enhancing the discrimination of the feature matching and overcoming the feature distortions caused by the changes of fluid shapes and illuminations. Specifically, it first generates sparse seeds by a feature extraction network. A correlation pyramid is then constructed for all pairs of sparse seeds, and the predicted matches are iteratively updated through a recurrent neural network, which lookups a multi-scale cost volume from a correlation pyramid via a multi-scale search scheme. Then it uses the searched multi-scale cost volume, the current matches, and the context features as the input features to correlate the predicted matches. Since the multi-scale cost volume contains motion information for both large and small displacements, it can recover small-scale motion structures. However, the predicted matches are sparse, so the final flow field is computed by performing a CNN-based interpolation for these sparse matches. The experimental results show that our method significantly outperforms the current motion estimators in capturing different motion patterns in complex fluid flows, especially in recovering some small-scale vortices. It also achieves state-of-the-art evaluation results on the public fluid datasets and successfully captures the storms in Jupiter’s White Ovals from the remote sensing images. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

Review

Jump to: Research, Other

30 pages, 4599 KiB  
Review
Deep Learning Innovations in Video Classification: A Survey on Techniques and Dataset Evaluations
by Makara Mao, Ahyoung Lee and Min Hong
Electronics 2024, 13(14), 2732; https://doi.org/10.3390/electronics13142732 - 11 Jul 2024
Cited by 3 | Viewed by 7645
Abstract
Video classification has achieved remarkable success in recent years, driven by advanced deep learning models that automatically categorize video content. This paper provides a comprehensive review of video classification techniques and the datasets used in this field. We summarize key findings from recent [...] Read more.
Video classification has achieved remarkable success in recent years, driven by advanced deep learning models that automatically categorize video content. This paper provides a comprehensive review of video classification techniques and the datasets used in this field. We summarize key findings from recent research, focusing on network architectures, model evaluation metrics, and parallel processing methods that enhance training speed. Our review includes an in-depth analysis of state-of-the-art deep learning models and hybrid architectures, comparing models to traditional approaches and highlighting their advantages and limitations. Critical challenges such as handling large-scale datasets, improving model robustness, and addressing computational constraints are explored. By evaluating performance metrics, we identify areas where current models excel and where improvements are needed. Additionally, we discuss data augmentation techniques designed to enhance dataset accuracy and address specific challenges in video classification tasks. This survey also examines the evolution of convolutional neural networks (CNNs) in image processing and their adaptation to video classification tasks. We propose future research directions and provide a detailed comparison of existing approaches using the UCF-101 dataset, highlighting progress and ongoing challenges in achieving robust video classification. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

Other

Jump to: Research, Review

24 pages, 6316 KiB  
Systematic Review
Deep Learning Approaches for Chest Radiograph Interpretation: A Systematic Review
by Hammad Iqbal, Arshad Khan, Narayan Nepal, Faheem Khan and Yeon-Kug Moon
Electronics 2024, 13(23), 4688; https://doi.org/10.3390/electronics13234688 - 27 Nov 2024
Cited by 2 | Viewed by 2043
Abstract
Lung diseases are a major global health concern, with nearly 4 million deaths annually, according to the World Health Organization (WHO). Chest X-rays (CXR) are widely used as a cost-effective and efficient diagnostic tool by radiologists to detect conditions such as pneumonia, tuberculosis, [...] Read more.
Lung diseases are a major global health concern, with nearly 4 million deaths annually, according to the World Health Organization (WHO). Chest X-rays (CXR) are widely used as a cost-effective and efficient diagnostic tool by radiologists to detect conditions such as pneumonia, tuberculosis, COVID-19, and lung cancer. This review paper provides an overview of the current research on diagnosing lung diseases using CXR images and Artificial Intelligence (AI), without focusing on any specific disease. It examines different approaches employed by researchers to leverage CXR, an accessible diagnostic medium, for early lung disease detection. This review shortlisted 11 research papers addressing this problem through AI, exploring the datasets used and their sources. Results varied across studies: for lung cancer, Deep Convolutional Neural Network (DCNN) achieved 97.20% accuracy, while multiclass frameworks like ResNet152V2+Bi-GRU (gated reccurent unit) reached 79.78% and 93.38%, respectively. For COVID-19 detection, accuracy rates of 98% and 99.37% were achieved using EfficientNet and Parallel Convolutional Neural Network-Extreme Learning Machine (CNN-ELM). Additionally, studies on the CXR-14 dataset (14 classes) showed high accuracy, with MobileNet V2 reaching 94%. Other notable results include 73% accuracy with VDSNet, 98.05% with VGG19+CNN for three classes, and high accuracy in detecting pediatric pneumonia, lung opacity, pneumothorax, and tuberculosis. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

Back to TopTop