sensors-logo

Journal Browser

Journal Browser

Deep Learning for Perception and Recognition: Method and Applications

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensing and Imaging".

Deadline for manuscript submissions: 31 October 2025 | Viewed by 22888

Special Issue Editors


E-Mail Website
Guest Editor
State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang, China
Interests: multimodal perception and recognition; light field processing; and computer vision in industrial scenarios

E-Mail Website
Guest Editor
College of Computer Science and Technology, Shanghai Electric Power University, Shanghai, China
Interests: pattern recognition and machine learning; interpretable artificial intelligence; computer vision; smart grid

E-Mail Website
Guest Editor
School of Automation, Central South University, Changsha 410083, China
Interests: infrared thermography; temperature measurement; deep learning; vision-based measurement; object detection; information fusion
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The rapid advancement of deep learning technology has brought about transformative breakthroughs in perception and recognition systems across a wide range of applications. In addition to driving innovation in industrial sectors, it has opened up significant opportunities in fields such as intelligent transportation, smart cities, healthcare, and robotics.

Deep learning significantly enhances the accuracy and robustness of perception and recognition systems through hierarchical feature extraction in multilayer neural networks, achieving remarkable results in areas such as soft sensing, image classification, natural language processing, and object detection. By training on large volumes of labeled data, deep learning algorithms are able to automatically learn complex feature representations and efficiently recognize objects during the perception process.

As application scenarios grow more complex and data become increasingly diverse, deep learning models continue to face significant challenges in solving real-world perception and recognition problems. These challenges include ensuring model generalization when dealing with noisy, imbalanced, or limited data; enhancing performance through self-supervised, few-shot, or transfer learning in cases of insufficient labeled data; integrating information across different scales, dimensions, and modalities; and developing explainable perception and recognition systems for high-risk applications.

This Special Issue seeks to highlight advanced research in deep learning for perception and recognition. Submitted papers should clearly present novel contributions, whether in general methodologies or innovative applications, addressing any of the following topics:

  • Image and video processing;
  • Visual intelligent perception;
  • Multimodal information fusion and learning;
  • Perception and recognition with applications;
  • Pattern recognition and analysis;
  • Knowledge and learning system with applications;
  • Explainable machine learning.

Dr. Gaochang Wu
Dr. Zizhu Fan
Dr. Dong Pan
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • image processing
  • visual perception
  • multimodal learning
  • multimodal fusion
  • pattern analysis
  • knowledge system
  • explainable machine learning

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (17 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

22 pages, 1087 KB  
Article
Modeling the Internal and Contextual Attention for Self-Supervised Skeleton-Based Action Recognition
by Wentian Xin, Yue Teng, Jikang Zhang, Yi Liu, Ruyi Liu, Yuzhi Hu and Qiguang Miao
Sensors 2025, 25(21), 6532; https://doi.org/10.3390/s25216532 - 23 Oct 2025
Abstract
Multimodal contrastive learning has achieved significant performance advantages in self-supervised skeleton-based action recognition. Previous methods are limited by modality imbalance, which reduces alignment accuracy and makes it difficult to combine important spatial–temporal frequency patterns, leading to confusion between modalities and weaker feature representations. [...] Read more.
Multimodal contrastive learning has achieved significant performance advantages in self-supervised skeleton-based action recognition. Previous methods are limited by modality imbalance, which reduces alignment accuracy and makes it difficult to combine important spatial–temporal frequency patterns, leading to confusion between modalities and weaker feature representations. To overcome these problems, we explore intra-modality feature-wise self-similarity and inter-modality instance-wise cross-consistency, and discover two inherent correlations that benefit recognition: (i) Global Perspective expresses how action semantics carry a broad and high-level understanding, which supports the use of globally discriminative feature representations. (ii) Focus Adaptation refers to the role of the frequency spectrum in guiding attention toward key joints by emphasizing compact and salient signal patterns. Building upon these insights, we propose a novel language–skeleton contrastive learning framework comprising two key components: (a) Feature Modulation, which constructs a skeleton–language action conceptual domain to minimize the expected information gain between vision and language modalities. (b) Frequency Feature Learning, which introduces a Frequency-domain Spatial–Temporal block (FreST) that focuses on sparse key human joints in the frequency domain with compact signal energy. Extensive experiments demonstrate the effectiveness of our method achieves remarkable action recognition performance on widely used benchmark datasets, including NTU RGB+D 60 and NTU RGB+D 120. Especially on the challenging PKU-MMD dataset, MICA has achieved at least a 4.6% improvement over classical methods such as CrosSCLR and AimCLR, effectively demonstrating its ability to capture internal and contextual attention information. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

24 pages, 13390 KB  
Article
Performance of Acoustic, Electro-Acoustic and Optical Sensors in Precise Waveform Analysis of a Plucked and Struck Guitar String
by Jan Jasiński, Marek Pluta, Roman Trojanowski, Julia Grygiel and Jerzy Wiciak
Sensors 2025, 25(21), 6514; https://doi.org/10.3390/s25216514 - 22 Oct 2025
Abstract
This study presents a comparative performance analysis of three sensor technologies—microphone, magnetic pickup, and laser Doppler vibrometer—for capturing string vibration under varied excitation conditions: striking, plectrum plucking, and wire plucking. Two different magnetic pickups are included in the comparison. Measurements were taken at [...] Read more.
This study presents a comparative performance analysis of three sensor technologies—microphone, magnetic pickup, and laser Doppler vibrometer—for capturing string vibration under varied excitation conditions: striking, plectrum plucking, and wire plucking. Two different magnetic pickups are included in the comparison. Measurements were taken at multiple excitation levels on a simplified electric guitar mounted on a stable platform with repeatable excitation mechanisms. The analysis focuses on each sensor’s capacity to resolve fine-scale waveform features during the initial attack while also taking into account its capability to measure general changes in instrument dynamics and timbre. We evaluate their ability to distinguish vibro-acoustic phenomena resulting from changes in excitation method and strength as well as measurement location. Our findings highlight the significant influence of sensor choice on observable string vibration. While the microphone captures the overall radiated sound, it lacks the required spatial selectivity and offers poor SNR performance 34 dB lower then other methods. Magnetic pickups enable precise string-specific measurements, offering a compelling balance of accuracy and cost-effectiveness. Results show that their low-pass frequency characteristic limits temporal fidelity and must be accounted for when analysing general sound timbre. Laser Doppler vibrometers provide superior micro-temporal fidelity, which can have critical implications for physical modeling, instrument design, and advanced audio signal processing, but have severe practical limitations. Critically, we demonstrate that the required optical target, even when weighing as little as 0.1% of the string’s mass, alters the string’s vibratory characteristics by influencing RMS energy and spectral content. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

30 pages, 73820 KB  
Article
Progressive Multi-Scale Perception Network for Non-Uniformly Blurred Underwater Image Restoration
by Dechuan Kong, Yandi Zhang, Xiaohu Zhao, Yanyan Wang and Yanqiang Wang
Sensors 2025, 25(17), 5439; https://doi.org/10.3390/s25175439 - 2 Sep 2025
Viewed by 711
Abstract
Underwater imaging is affected by spatially varying blur caused by water flow turbulence, light scattering, and camera motion, resulting in severe visual quality loss and diminished performance in downstream vision tasks. Although numerous underwater image enhancement methods have been proposed, the issue of [...] Read more.
Underwater imaging is affected by spatially varying blur caused by water flow turbulence, light scattering, and camera motion, resulting in severe visual quality loss and diminished performance in downstream vision tasks. Although numerous underwater image enhancement methods have been proposed, the issue of addressing non-uniform blur under realistic underwater conditions remains largely underexplored. To bridge this gap, we propose PMSPNet, a Progressive Multi-Scale Perception Network, designed to handle underwater non-uniform blur. The network integrates a Hybrid Interaction Attention Module to enable precise modeling of feature ambiguity directions and regional disparities. In addition, a Progressive Motion-Aware Perception Branch is employed to capture spatial orientation variations in blurred regions, progressively refining the localization of blur-related features. A Progressive Feature Feedback Block is incorporated to enhance reconstruction quality by leveraging iterative feature feedback across scales. To facilitate robust evaluation, we construct the Non-uniform Underwater Blur Benchmark, which comprises diverse real-world blur patterns. Extensive experiments on multiple real-world underwater datasets demonstrate that PMSPNet consistently surpasses state-of-the-art methods, achieving on average 25.51 dB PSNR and an inference speed of 0.01 s, which provides high-quality visual perception and downstream application input from underwater sensors for underwater robots, marine ecological monitoring, and inspection tasks. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

21 pages, 3937 KB  
Article
Wind Turbine Blade Defect Recognition Method Based on Large-Vision-Model Transfer Learning
by Xin Li, Jinghe Tian, Xinfu Pang, Li Shen, Haibo Li and Zedong Zheng
Sensors 2025, 25(14), 4414; https://doi.org/10.3390/s25144414 - 15 Jul 2025
Viewed by 806
Abstract
Timely and accurate detection of wind turbine blade surface defects is crucial for ensuring operational safety and improving maintenance efficiency with respect to large-scale wind farms. However, existing methods often suffer from poor generalization, background interference, and inadequate real-time performance. To overcome these [...] Read more.
Timely and accurate detection of wind turbine blade surface defects is crucial for ensuring operational safety and improving maintenance efficiency with respect to large-scale wind farms. However, existing methods often suffer from poor generalization, background interference, and inadequate real-time performance. To overcome these limitations, we developed an end-to-end defect recognition framework, structured as a three-stage process: blade localization using YOLOv5, robust feature extraction via the large vision model DINOv2, and defect classification using a Stochastic Configuration Network (SCN). Unlike conventional CNN-based approaches, the use of DINOv2 significantly improves the capability for representation under complex textures. The experimental results reveal that the proposed method achieved a classification accuracy of 97.8% and an average inference time of 19.65 ms per image, satisfying real-time requirements. Compared to traditional methods, this framework provides a more scalable, accurate, and efficient solution for the intelligent inspection and maintenance of wind turbine blades. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

17 pages, 4558 KB  
Article
Automated Anomaly Detection in Blast Furnace Shaft Static Pressure Using Adversarial Autoencoders and Mode Decomposition
by Xiaodong Sun, Jie Zhu, Bing Tang and Zhaohui Jiang
Sensors 2025, 25(11), 3473; https://doi.org/10.3390/s25113473 - 31 May 2025
Viewed by 742
Abstract
Monitoring the blast furnace shaft static pressure is crucial for maintaining a stable ironmaking process. Traditional rule-based methods and manual inspections suffer from high labor costs and inconsistent standards. This article proposes a new unsupervised anomaly detection framework that combines adversarial autoencoder with [...] Read more.
Monitoring the blast furnace shaft static pressure is crucial for maintaining a stable ironmaking process. Traditional rule-based methods and manual inspections suffer from high labor costs and inconsistent standards. This article proposes a new unsupervised anomaly detection framework that combines adversarial autoencoder with variational mode decomposition (VMD). Firstly, using VMD combined with sample entropy calculation and clustering algorithm, the trend, period, and other components of multidimensional signals are extracted, and then these components are integrated into an improved adversarial training autoencoder to detect global and local anomalies. The proposed method has an accuracy of 0.95, a recall rate of 0.91, and an F1 score of 0.93. Which demonstrates the method effectively captures multi-scale anomalies including value bias, morphological changes, and sudden fluctuations, while providing analysts with interpretable anomaly detail diagnosis. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

15 pages, 4638 KB  
Article
Invisible CMOS Camera Dazzling for Conducting Adversarial Attacks on Deep Neural Networks
by Zvi Stein, Adir Hazan and Adrian Stern
Sensors 2025, 25(7), 2301; https://doi.org/10.3390/s25072301 - 4 Apr 2025
Viewed by 1231
Abstract
Despite the outstanding performance of deep neural networks, they remain vulnerable to adversarial attacks. While digital domain adversarial attacks are well-documented, most physical-world attacks are typically visible to the human eye. Here, we present a novel invisible optical-based physical adversarial attack via dazzling [...] Read more.
Despite the outstanding performance of deep neural networks, they remain vulnerable to adversarial attacks. While digital domain adversarial attacks are well-documented, most physical-world attacks are typically visible to the human eye. Here, we present a novel invisible optical-based physical adversarial attack via dazzling a CMOS camera. This attack involves using a designed light pulse sequence spatially transformed within the acquired image due to the camera’s shutter mechanism. We provide a detailed analysis of the photopic conditions required to keep the attacking light source invisible to human observers while effectively disrupting the image, thereby deceiving the DNN. The results indicate that the light source duty cycle controls the tradeoff between the attack’s success rate and the degree of concealment needed. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

21 pages, 2969 KB  
Article
HGF-MiLaG: Hierarchical Graph Fusion for Emotion Recognition in Conversation with Mid-Late Gender-Aware Strategy
by Yihan Wang, Rongrong Hao, Ziheng Li, Xinhe Kuang, Jiacheng Dong, Qi Zhang, Fengkui Qian and Changzeng Fu
Sensors 2025, 25(4), 1182; https://doi.org/10.3390/s25041182 - 14 Feb 2025
Cited by 3 | Viewed by 1346
Abstract
Emotion recognition in conversation (ERC) is an important research direction in the field of human-computer interaction (HCI), which recognizes emotions by analyzing utterance signals to enhance user experience and plays an important role in several domains. However, existing research on ERC mainly focuses [...] Read more.
Emotion recognition in conversation (ERC) is an important research direction in the field of human-computer interaction (HCI), which recognizes emotions by analyzing utterance signals to enhance user experience and plays an important role in several domains. However, existing research on ERC mainly focuses on constructing graph networks by directly modeling interactions on multimodal fused features, which cannot adequately capture the complex dialog dependency based on time, speaker, modalities, etc. In addition, existing multi-task learning frameworks for ERC do not systematically investigate how and where gender information is injected into the model to optimize ERC performance. To address the above problems, this paper proposes a Hierarchical Graph Fusion for ERC with Mid-Late Gender-aware Strategy (HGF-MiLaG). HGF-MiLaG uses hierarchical fusion graph to adequately capture intra-modal and inter-modal speaker dependency and temporal dependency. In addition, HGF-MiLaG explores the effect of the location of gender information injections on ERC performance, and ultimately employs a Mid-Late multilevel gender-aware strategy in order to allow the hierarchical graph network to determine the proportion of emotion and gender information in the classifier. Empirical results on two public multimodal datasets(i.e.,IEMOCAP and MELD), demonstrate that HGF-MiLaG outperforms existing methods. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

14 pages, 12776 KB  
Article
A Projective-Geometry-Aware Network for 3D Vertebra Localization in Calibrated Biplanar X-Ray Images
by Kangqing Ye, Wenyuan Sun, Rong Tao and Guoyan Zheng
Sensors 2025, 25(4), 1123; https://doi.org/10.3390/s25041123 - 13 Feb 2025
Cited by 1 | Viewed by 1277
Abstract
Current Deep Learning (DL)-based methods for vertebra localization in biplanar X-ray images mainly focus on two-dimensional (2D) information and neglect the projective geometry, limiting the accuracy of 3D navigation in X-ray-guided spine surgery. A 3D vertebra localization method from calibrated biplanar X-ray images [...] Read more.
Current Deep Learning (DL)-based methods for vertebra localization in biplanar X-ray images mainly focus on two-dimensional (2D) information and neglect the projective geometry, limiting the accuracy of 3D navigation in X-ray-guided spine surgery. A 3D vertebra localization method from calibrated biplanar X-ray images is highly desired to address the problem. In this study, a projective-geometry-aware network for localizing 3D vertebrae in calibrated biplanar X-ray images, referred to as ProVLNet, is proposed. The network design of ProVLNet features three components: a Siamese 2D feature extractor to extract local appearance features from the biplanar X-ray images, a spatial alignment fusion module to incorporate the projective geometry in fusing the extracted 2D features in 3D space, and a 3D landmark regression module to regress the 3D coordinates of the vertebrae from the 3D fused features. Evaluated on two typical and challenging datasets acquired from the lumbar and the thoracic spine, ProVLNet achieved an identification rate of 99.53% and 98.98% and a point-to-point error of 0.64 mm and 1.38 mm, demonstrating superior performance of our proposed approach over the state-of-the-art (SOTA) methods. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

18 pages, 12390 KB  
Article
DeiT and Image Deep Learning-Driven Correction of Particle Size Effect: A Novel Approach to Improving NIRS-XRF Coal Quality Analysis Accuracy
by Jiaxin Yin, Ruonan Liu, Wangbao Yin, Suotang Jia and Lei Zhang
Sensors 2025, 25(3), 928; https://doi.org/10.3390/s25030928 - 4 Feb 2025
Cited by 1 | Viewed by 1589
Abstract
Coal, as a vital global energy resource, directly impacts the efficiency of power generation and environmental protection. Thus, rapid and accurate coal quality analysis is essential to promote its clean and efficient utilization. However, combined near-infrared spectroscopy and X-ray fluorescence (NIRS-XRF) spectroscopy often [...] Read more.
Coal, as a vital global energy resource, directly impacts the efficiency of power generation and environmental protection. Thus, rapid and accurate coal quality analysis is essential to promote its clean and efficient utilization. However, combined near-infrared spectroscopy and X-ray fluorescence (NIRS-XRF) spectroscopy often suffer from the particle size effect of coal samples, resulting in unstable and inaccurate analytical outcomes. This study introduces a novel correction method combining the Segment Anything Model (SAM) for precise particle segmentation and Data-Efficient Image Transformers (DeiTs) to analyze the relationship between particle size and ash measurement errors. Microscopic images of coal samples are processed with SAM to generate binary mask images reflecting particle size characteristics. These masks are analyzed using the DeiT model with transfer learning, building an effective correction model. Experiments show a 22% reduction in standard deviation (SD) and root mean square error (RMSE), significantly enhancing ash prediction accuracy and consistency. This approach integrates cutting-edge image processing and deep learning, effectively reducing submillimeter particle size effects, improving model adaptability, and enhancing measurement reliability. It also holds potential for broader applications in analyzing complex samples, advancing automation and efficiency in online analytical systems, and driving innovation across industries. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

28 pages, 16917 KB  
Article
A Framework of State Estimation on Laminar Grinding Based on the CT Image–Force Model
by Jihao Liu, Guoyan Zheng and Weixin Yan
Sensors 2025, 25(1), 238; https://doi.org/10.3390/s25010238 - 3 Jan 2025
Viewed by 1091
Abstract
It is a great challenge for a safe surgery to localize the cutting tip during laminar grinding. To address this problem, we develop a framework of state estimation based on the CT image–force model. For the proposed framework, the pre-operative CT image and [...] Read more.
It is a great challenge for a safe surgery to localize the cutting tip during laminar grinding. To address this problem, we develop a framework of state estimation based on the CT image–force model. For the proposed framework, the pre-operative CT image and intra-operative milling force signal work as source inputs. In the framework, a bone milling force prediction model is built, and the surgical planned paths can be transformed into the prediction sequences of milling force. The intra-operative milling force signal is segmented by the tumbling window algorithm. Then, the similarity between the prediction sequences and the segmented milling signal is derived by the dynamic time warping (DTW) algorithm. The derived similarity indicates the position of the cutting tip. Finally, to overcome influences of some factors, we used the random sample consensus (RANSAC). The code of the functional simulations has be opened. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

19 pages, 2113 KB  
Article
3D-BCLAM: A Lightweight Neurodynamic Model for Assessing Student Learning Effectiveness
by Wei Zhuang, Yunhong Zhang, Yuan Wang and Kaiyang He
Sensors 2024, 24(23), 7856; https://doi.org/10.3390/s24237856 - 9 Dec 2024
Viewed by 1321
Abstract
Evaluating students’ learning effectiveness is of great importance for gaining a deeper understanding of the learning process, accurately diagnosing learning barriers, and developing effective teaching strategies. Emotion, as a key factor influencing learning outcomes, provides a novel perspective for identifying cognitive states and [...] Read more.
Evaluating students’ learning effectiveness is of great importance for gaining a deeper understanding of the learning process, accurately diagnosing learning barriers, and developing effective teaching strategies. Emotion, as a key factor influencing learning outcomes, provides a novel perspective for identifying cognitive states and emotional experiences. However, traditional evaluation methods suffer from one sidedness in feature extraction and high complexity in model construction, often making it difficult to fully explore the deep value of emotional data. To address this challenge, we have innovatively proposed a lightweight neurodynamic model: 3D-BCLAM. This model cleverly integrates Bidirectional Convolutional Long Short-Term Memory (BCL) and dynamic attention mechanism, in order to efficiently capture emotional dynamic changes in time series with extremely low computational cost. 3D-BCLAM can achieve a comprehensive evaluation of students’ learning outcomes, covering not only the cognitive level but also delving into the emotional dimension for detailed analysis. Under testing on public datasets, 3D-BCLAM has demonstrated outstanding performance, significantly outperforming traditional machine learning and deep learning models based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). This achievement not only validates the effectiveness of the 3D-BCLAM model, but also provides strong support for promoting the innovation of student learning effectiveness assessment. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

15 pages, 884 KB  
Article
Node Classification Method Based on Hierarchical Hypergraph Neural Network
by Feng Xu, Wanyue Xiong, Zizhu Fan and Licheng Sun
Sensors 2024, 24(23), 7655; https://doi.org/10.3390/s24237655 - 29 Nov 2024
Viewed by 2019
Abstract
Hypergraph neural networks have gained widespread attention due to their effectiveness in handling graph-structured data with complex relationships and multi-dimensional interactions. However, existing hypergraph neural network models mainly rely on planar message-passing mechanisms, which have limitations: (i) low efficiency in encoding long-distance information; [...] Read more.
Hypergraph neural networks have gained widespread attention due to their effectiveness in handling graph-structured data with complex relationships and multi-dimensional interactions. However, existing hypergraph neural network models mainly rely on planar message-passing mechanisms, which have limitations: (i) low efficiency in encoding long-distance information; (ii) underutilization of high-order neighborhood features, aggregating information only on the edges of the original graph. This paper proposes an innovative hierarchical hypergraph neural network (HCHG) to address these issues. The HCHG combines the high-order relationship-capturing capability of hypergraphs, uses the Louvain community detection algorithm to identify community structures within the network, and constructs hypergraphs layer by layer. In the bottom-level hypergraph, the model establishes high-order relationships through direct neighbor nodes, while in the top-level hypergraph, it captures global relationships between aggregated communities. Through three hierarchical message-passing mechanisms, the HCHG effectively integrates local and global information, enhancing the multi-resolution representation ability of node representations and significantly improving performance in node classification tasks. In addition, the model performs excellently in handling 3D multi-view datasets. Such datasets can be created by capturing 3D shapes and geometric features through sensors or by manual modeling, providing extensive application scenarios for analyzing three-dimensional shapes and complex geometric structures. Theoretical analysis and experimental results show that the HCHG outperforms traditional hypergraph neural networks in complex networks. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

15 pages, 458 KB  
Article
Facial Anti-Spoofing Using “Clue Maps”
by Liang Yu Gong, Xue Jun Li and Peter Han Joo Chong
Sensors 2024, 24(23), 7635; https://doi.org/10.3390/s24237635 - 29 Nov 2024
Viewed by 1759
Abstract
Spoofing attacks (or Presentation Attacks) are easily accessible to facial recognition systems, making the online financial system vulnerable. Thus, it is urgent to develop an anti-spoofing solution with superior generalization ability due to the high demand for spoofing attack detection. Although multi-modality methods [...] Read more.
Spoofing attacks (or Presentation Attacks) are easily accessible to facial recognition systems, making the online financial system vulnerable. Thus, it is urgent to develop an anti-spoofing solution with superior generalization ability due to the high demand for spoofing attack detection. Although multi-modality methods such as combining depth images with RGB images and feature fusion methods could currently perform well with certain datasets, the cost of obtaining the depth information and physiological signals, especially that of the biological signal is relatively high. This paper proposes a representation learning method of an Auto-Encoder structure based on Swin Transformer and ResNet, then applies cross-entropy loss, semi-hard triplet loss, and Smooth L1 pixel-wise loss to supervise the model training. The architecture contains three parts, namely an Encoder, a Decoder, and an auxiliary classifier. The Encoder part could effectively extract the features with patches’ correlations and the Decoder aims to generate universal “Clue Maps” for further contrastive learning. Finally, the auxiliary classifier is adopted to assist the model in making the decision, which regards this result as one preliminary result. In addition, extensive experiments evaluated Attack Presentation Classification Error Rate (APCER), Bonafide Presentation Classification Error Rate (BPCER) and Average Classification Error Rate (ACER) performances on the popular spoofing databases (CelebA, OULU, and CASIA-MFSD) to compare with several existing anti-spoofing models, and our approach could outperform existing models which reach 1.2% and 1.6% ACER on intra-dataset experiment. In addition, the inter-dataset on CASIA-MFSD (training set) and Replay-attack (Testing set) reaches a new state-of-the-art performance with 23.8% Half Total Error Rate (HTER). Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

17 pages, 3450 KB  
Article
Coal and Gangue Detection Networks with Compact and High-Performance Design
by Xiangyu Cao, Huajie Liu, Yang Liu, Junheng Li and Ke Xu
Sensors 2024, 24(22), 7318; https://doi.org/10.3390/s24227318 - 16 Nov 2024
Cited by 1 | Viewed by 1156
Abstract
The efficient separation of coal and gangue remains a critical challenge in modern coal mining, directly impacting energy efficiency, environmental protection, and sustainable development. Current machine vision-based sorting methods face significant challenges in dense scenes, where label rewriting problems severely affect model performance, [...] Read more.
The efficient separation of coal and gangue remains a critical challenge in modern coal mining, directly impacting energy efficiency, environmental protection, and sustainable development. Current machine vision-based sorting methods face significant challenges in dense scenes, where label rewriting problems severely affect model performance, particularly when coal and gangue are closely distributed in conveyor belt images. This paper introduces CGDet (Coal and Gangue Detection), a novel compact convolutional neural network that addresses these challenges through two key innovations. First, we proposed an Object Distribution Density Measurement (ODDM) method to quantitatively analyze the distribution density of coal and gangue, enabling optimal selection of input and feature map resolutions to mitigate label rewriting issues. Second, we developed a Relative Resolution Object Scale Measurement (RROSM) method to assess object scales, guiding the design of a streamlined feature fusion structure that eliminates redundant components while maintaining detection accuracy. Experimental results demonstrate the effectiveness of our approach; CGDet achieved superior performance with AP50 and AR50 scores of 96.7% and 99.2% respectively, while reducing model parameters by 46.76%, computational cost by 47.94%, and inference time by 31.50% compared to traditional models. These improvements make CGDet particularly suitable for real-time coal and gangue sorting in underground mining environments, where computational resources are limited but high accuracy is essential. Our work provides a new perspective on designing compact yet high-performance object detection networks for dense scene applications. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

37 pages, 15011 KB  
Article
Steering-Angle Prediction and Controller Design Based on Improved YOLOv5 for Steering-by-Wire System
by Cunliang Ye, Yunlong Wang, Yongfu Wang and Yan Liu
Sensors 2024, 24(21), 7035; https://doi.org/10.3390/s24217035 - 31 Oct 2024
Cited by 1 | Viewed by 2748
Abstract
A crucial role is played by steering-angle prediction in the control of autonomous vehicles (AVs). It mainly includes the prediction and control of the steering angle. However, the prediction accuracy and calculation efficiency of traditional YOLOv5 are limited. For the control of the [...] Read more.
A crucial role is played by steering-angle prediction in the control of autonomous vehicles (AVs). It mainly includes the prediction and control of the steering angle. However, the prediction accuracy and calculation efficiency of traditional YOLOv5 are limited. For the control of the steering angle, angular velocity is difficult to measure, and the angle control effect is affected by external disturbances and unknown friction. This paper proposes a lightweight steering angle prediction network model called YOLOv5Ms, based on YOLOv5, aiming to achieve accurate prediction while enhancing computational efficiency. Additionally, an adaptive output feedback control scheme with output constraints based on neural networks is proposed to regulate the predicted steering angle using the YOLOv5Ms algorithm effectively. Firstly, given that most lane-line data sets consist of simulated images and lack diversity, a novel lane data set derived from real roads is manually created to train the proposed network model. To improve real-time accuracy in steering-angle prediction and enhance effectiveness in steering control, we update the bounding box regression loss function with the generalized intersection over union (GIoU) to Shape-IoU_Loss as a better-converging regression loss function for bounding-box improvement. The YOLOv5Ms model achieves a 30.34% reduction in weight storage space while simultaneously improving accuracy by 7.38% compared to the YOLOv5s model. Furthermore, an adaptive output feedback control scheme with output constraints based on neural networks is introduced to regulate the predicted steering angle via YOLOv5Ms effectively. Moreover, utilizing the backstepping control method and introducing the Lyapunov barrier function enables us to design an adaptive neural network output feedback controller with output constraints. Finally, a strict stability analysis based on Lyapunov stability theory ensures the boundedness of all signals within the closed-loop system. Numerical simulations and experiments have shown that the proposed method provides a 39.16% better root mean squared error (RMSE) score than traditional backstepping control, and it achieves good estimation performance for angles, angular velocity, and unknown disturbances. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

24 pages, 5816 KB  
Article
Adaptive FPGA-Based Accelerators for Human–Robot Interaction in Indoor Environments
by Mangali Sravanthi, Sravan Kumar Gunturi, Mangali Chinna Chinnaiah, Siew-Kei Lam, G. Divya Vani, Mudasar Basha, Narambhatla Janardhan, Dodde Hari Krishna and Sanjay Dubey
Sensors 2024, 24(21), 6986; https://doi.org/10.3390/s24216986 - 30 Oct 2024
Cited by 2 | Viewed by 1933
Abstract
This study addresses the challenges of human–robot interactions in real-time environments with adaptive field-programmable gate array (FPGA)-based accelerators. Predicting human posture in indoor environments in confined areas is a significant challenge for service robots. The proposed approach works on two levels: the estimation [...] Read more.
This study addresses the challenges of human–robot interactions in real-time environments with adaptive field-programmable gate array (FPGA)-based accelerators. Predicting human posture in indoor environments in confined areas is a significant challenge for service robots. The proposed approach works on two levels: the estimation of human location and the robot’s intention to serve based on the human’s location at static and adaptive positions. This paper presents three methodologies to address these challenges: binary classification to analyze static and adaptive postures for human localization in indoor environments using the sensor fusion method, adaptive Simultaneous Localization and Mapping (SLAM) for the robot to deliver the task, and human–robot implicit communication. VLSI hardware schemes are developed for the proposed method. Initially, the control unit processes real-time sensor data through PIR sensors and multiple ultrasonic sensors to analyze the human posture. Subsequently, static and adaptive human posture data are communicated to the robot via Wi-Fi. Finally, the robot performs services for humans using an adaptive SLAM-based triangulation navigation method. The experimental validation was conducted in a hospital environment. The proposed algorithms were coded in Verilog HDL, simulated, and synthesized using VIVADO 2017.3. A Zed-board-based FPGA Xilinx board was used for experimental validation. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

15 pages, 2372 KB  
Article
PDeT: A Progressive Deformable Transformer for Photovoltaic Panel Defect Segmentation
by Peng Zhou, Hong Fang and Gaochang Wu
Sensors 2024, 24(21), 6908; https://doi.org/10.3390/s24216908 - 28 Oct 2024
Cited by 6 | Viewed by 1572
Abstract
Defects in photovoltaic (PV) panels can significantly reduce the power generation efficiency of the system and may cause localized overheating due to uneven current distribution. Therefore, adopting precise pixel-level defect detection, i.e., defect segmentation, technology is essential to ensuring stable operation. However, for [...] Read more.
Defects in photovoltaic (PV) panels can significantly reduce the power generation efficiency of the system and may cause localized overheating due to uneven current distribution. Therefore, adopting precise pixel-level defect detection, i.e., defect segmentation, technology is essential to ensuring stable operation. However, for effective defect segmentation, the feature extractor must adaptively determine the appropriate scale or receptive field for accurate defect localization, while the decoder must seamlessly fuse coarse-level semantics with fine-grained features to enhance high-level representations. In this paper, we propose a Progressive Deformable Transformer (PDeT) for defect segmentation in PV cells. This approach effectively learns spatial sampling offsets and refines features progressively through coarse-level semantic attention. Specifically, the network adaptively captures spatial offset positions and computes self-attention, expanding the model’s receptive field and enabling feature extraction across objects of various shapes. Furthermore, we introduce a semantic aggregation module to refine semantic information, converting the fused feature map into a scale space and balancing contextual information. Extensive experiments demonstrate the effectiveness of our method, achieving an mIoU of 88.41% on our solar cell dataset, outperforming other methods. Additionally, to validate the PDeT’s applicability across different domains, we trained and tested it on the MVTec-AD dataset. The experimental results demonstrate that the PDeT exhibits excellent recognition performance in various other scenarios as well. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

Back to TopTop