sensors-logo

Journal Browser

Journal Browser

AI-Based Computer Vision Sensors & Systems—2nd Edition

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensing and Imaging".

Deadline for manuscript submissions: 30 June 2026 | Viewed by 14597

Special Issue Editors


E-Mail Website
Guest Editor
School of Artificial Intelligence, Xidian University, Xi'an, China
Interests: visual cognitive computing; computer vision; visual big data mining; intelligent algorithms
Special Issues, Collections and Topics in MDPI journals

E-Mail
Guest Editor Assistant
Research Institute of Electrical Communication, Tohoku University, Sendai, Miyagi, Japan
Interests: spatial mechanisms of human visual attention; size tuning; cognitive science; LLM for psychology; explainable human–AI interaction systems

Special Issue Information

Dear Colleagues,

Artificial intelligence (AI) in computer vision sensors and systems is a specialized field that encompasses both current and historical AI advancements, as well as their potential impact and future prospects within sensor technology and its applications. This Special Issue explores the innovative landscape of AI-based computer vision sensors and systems, emphasizing their transformative potential across a variety of applications. These technologies harness advanced imaging techniques to facilitate real-time analysis and intelligent decision-making. We invite researchers to submit original articles investigating the use of RGB cameras, depth cameras (e.g., LiDAR), and thermal cameras in conjunction with image processing units (GPUs, TPUs, FPGAs) and object detection frameworks (e.g., YOLO, SSD, Faster R-CNN) in areas such as environmental monitoring, healthcare imaging, autonomous navigation, and security systems. This Issue aims to highlight innovative methodologies that enhance object detection, gesture recognition, and real-time analytics, ultimately advancing the capabilities of computer vision.

Prof. Dr. Xuefeng Liang
Guest Editor

Dr. Guangyu Chen
Guest Editor Assistant

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • RGB cameras
  • depth cameras (LiDAR)
  • thermal cameras
  • image processing units (GPUs, TPUs, FPGAs)
  • YOLO (You Only Look Once)
  • gesture recognition systems
  • autonomous navigation systems
  • augmented reality (AR)
  • industrial automation
  • smart surveillance systems

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

36 pages, 9783 KB  
Article
Spectral-YOLOv13: A Dual-Domain Vision-Mamba Sensing Framework for Fine-Grained Coral Health Assessment and Continuous Ecological Forecasting
by Litian Yang, Wenkun Chen, Zhuoyue Mo, Xin Gao, Minzhi Mo, Chunlei Xia and Liankuan Zhang
Sensors 2026, 26(10), 3265; https://doi.org/10.3390/s26103265 - 21 May 2026
Abstract
Coral reefs are among the most important and vulnerable marine ecosystems worldwide. AI-powered underwater visual monitoring has become essential for effective reef conservation, yet current methods still face severe limitations: spectral ambiguity caused by underwater turbidity, fine-grained confusion in early coral health assessment, [...] Read more.
Coral reefs are among the most important and vulnerable marine ecosystems worldwide. AI-powered underwater visual monitoring has become essential for effective reef conservation, yet current methods still face severe limitations: spectral ambiguity caused by underwater turbidity, fine-grained confusion in early coral health assessment, and discrete forecasting models that cannot represent continuous ecological degradation dynamics. To address these issues, we propose Spectral-YOLOv13, a dual-domain vision-Mamba sensing framework for high-precision coral health evaluation and continuous ecological forecasting. The framework incorporates three novel components: a Wavelet-Integrated Omni-Neck (WIO-Neck) to perform multi-scale spectral filtering and suppress turbidity-induced noise; a Contrastive Prototype Head (CP-Head) to enhance discriminability between visually similar health states; and a Bio-Mamba Predictor based on state-space models to capture long-term continuous health trajectories. Extensive experiments on the CR-Mix++ dataset demonstrate that Spectral-YOLOv13 achieves 53.8% mAP with strong robustness in turbid underwater environments. It reduces four-week forecasting error by 26.8% and maintains real-time inference speed at 112 FPS. This work provides a reliable and high-performance vision framework for practical underwater coral reef monitoring and proactive conservation management. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

18 pages, 21058 KB  
Article
MSSA-Net: Multi-Modal Structural and Semantic-Adaptive Network for Low-Light Image Enhancement
by Tianxiang Chen, Xiaoyi Wang, Tongshun Zhang and Qiuzhan Zhou
Sensors 2026, 26(7), 2059; https://doi.org/10.3390/s26072059 - 25 Mar 2026
Viewed by 588
Abstract
Low-light image enhancement (LLIE) remains challenging due to severe degradation of high-frequency structures and semantic ambiguity under extreme darkness. Although existing methods achieve satisfactory brightness recovery, they often suffer from structural inconsistency and semantic drift, as diverse scenes are typically processed with uniform [...] Read more.
Low-light image enhancement (LLIE) remains challenging due to severe degradation of high-frequency structures and semantic ambiguity under extreme darkness. Although existing methods achieve satisfactory brightness recovery, they often suffer from structural inconsistency and semantic drift, as diverse scenes are typically processed with uniform enhancement strategies or static text prompts. To address these issues, we propose a Multi-Modal Structural and Semantic-Adaptive Network (MSSA-Net) under a structure-anchored paradigm. First, we design a Multi-Scale Self-Refinement Block (MSRB) to enhance degraded visible representations through multi-scale feature extraction and progressive refinement. Meanwhile, a pseudo-infrared structural prior derived from the input image is introduced to provide noise-insensitive geometric cues. These cues are extracted via a Structure-Guided Cross-Attention (SGCA) module to produce structure-dominant features. The refined visible features and structural features are then adaptively integrated through an adaptive residual fusion (ARF) module to achieve balanced restoration. Furthermore, we develop a Large Multi-modal Model (LMM)-Driven Scene-Adaptive Attention mechanism that generates instance-aware scene tags from a coarse preview and injects semantic embeddings into visual features. Extensive experiments demonstrate that MSSA-Net improves structural fidelity, brightness recovery, and semantic naturalness across multiple benchmarks. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

28 pages, 5658 KB  
Article
A Multimodule Collaborative Framework for Unsupervised Visible–Infrared Person Re-Identification with Channel Enhancement Modality
by Baoshan Sun, Yi Du and Liqing Gao
Sensors 2026, 26(6), 1770; https://doi.org/10.3390/s26061770 - 11 Mar 2026
Viewed by 517
Abstract
Unsupervised visible–infrared person re-identification (USL-VI-ReID) plays a pivotal role in cross-modal computer vision applications for intelligent surveillance and public safety. However, the task remains hampered by large modality gaps and limited granularity in feature representations. In particular, channel augmentation (CA) is typically used [...] Read more.
Unsupervised visible–infrared person re-identification (USL-VI-ReID) plays a pivotal role in cross-modal computer vision applications for intelligent surveillance and public safety. However, the task remains hampered by large modality gaps and limited granularity in feature representations. In particular, channel augmentation (CA) is typically used only for data augmentation, and its potential as an independent input modality remains unexplored. To address these shortcomings, we present a multimodule collaborative USL-VI-ReID framework that explicitly treats CA as a separate input modality. The framework combines four complementary modules. The Person-ReID Adaptive Convolutional Block Attention Module (PA-CBAM) module extracts discriminative features using a two-level attention mechanism that refines salient spatial and channel cues. The Varied Regional Alignment (VRA) module performs cross-modal regional alignment and leverages the Multimodal Assisted Adversarial Learning (MAAL) to reinforce region-level correspondence. The Varied Regional Neighbor Learning (VRNL) implements reliable neighborhood learning via multi-region association to stabilize pseudo-labels and capture local structure. Finally, the Uniform Merging (UM) module merges split clusters through alternating contrastive learning to improve cluster consistency. We evaluate the proposed method on SYSU-MM01 and RegDB. On RegDB’s visible-to-infrared setting, the approach achieves Rank-1 = 93.34%, mean Average Precision (mAP) = 87.55%, and mean Inverse Negative Penalty (mINP) = 76.08%. These results indicate that our method effectively reduces modal discrepancies and increases feature discriminability. It outperforms most existing unsupervised baselines and several supervised approaches, thereby advancing the practical applicability of USL-VI-ReID. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

22 pages, 6170 KB  
Article
A Lightweight Net with Dual-Path Feature Enhancer and Bidirectional Gated Fusion for Cloud Detection
by Yan Mo, Puhui Chen, Shaowei Bai and Erbao Xiao
Sensors 2026, 26(5), 1727; https://doi.org/10.3390/s26051727 - 9 Mar 2026
Viewed by 402
Abstract
Cloud detection serves as a critical preprocessing step in remote sensing image processing and quantitative applications. However, prevailing deep learning-based models often depend on computationally intensive backbone networks to achieve high accuracy, which hinders their deployment in resource-constrained scenarios such as on-board processing [...] Read more.
Cloud detection serves as a critical preprocessing step in remote sensing image processing and quantitative applications. However, prevailing deep learning-based models often depend on computationally intensive backbone networks to achieve high accuracy, which hinders their deployment in resource-constrained scenarios such as on-board processing or edge computing. To bridge the trade-off between accuracy and efficiency, this paper introduces a lightweight network for cloud detection. The core innovations of our network are twofold: (1) a dual-path feature enhancer that operates at the front end to extract and fuse multi-scale features through a parallel architecture, significantly enriching feature diversity and representational capacity, thereby alleviating the need for a complex backbone, and (2) a bidirectional gated fusion module, which adaptively integrates multi-scale features from the dual-path feature enhancer with deep semantic features from the backbone decoder through a gated attention mechanism and dynamic convolution, thereby enhancing feature discriminability. Comprehensive experiments on the public HRC_WHU dataset demonstrate that the proposed model achieves a high overall accuracy of 96.31% and a mean intersection-over-union of 92.82%, with only 12.04 GFLOPs of computational cost, outperforming several state-of-the-art methods. These results validate that our approach effectively balances high detection performance with computational efficiency, offering a practical solution for real-time, lightweight cloud detection in high-resolution remote sensing imagery. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

23 pages, 4564 KB  
Article
Two-Stage Wildlife Event Classification for Edge Deployment
by Aditya S. Viswanathan, Adis Bock, Zoe Bent, Mark A. Peyton, Daniel M. Tartakovsky and Javier E. Santos
Sensors 2026, 26(4), 1366; https://doi.org/10.3390/s26041366 - 21 Feb 2026
Viewed by 857
Abstract
Camera-based wildlife monitoring is often overwhelmed by non-target triggers and slowed by manual review or cloud-dependent inference, which can prevent timely intervention for high stakes human–wildlife conflicts. Our key contribution is a deployable, fully offline edge vision sensor that achieves near-real-time, highly accurate [...] Read more.
Camera-based wildlife monitoring is often overwhelmed by non-target triggers and slowed by manual review or cloud-dependent inference, which can prevent timely intervention for high stakes human–wildlife conflicts. Our key contribution is a deployable, fully offline edge vision sensor that achieves near-real-time, highly accurate wildlife event classification by combining detector-based empty-image suppression with a lightweight classifier trained with a staged transfer-learning curriculum. Specifically, Stage 1 uses a pretrained You Only Look Once (YOLO)-family detector for permissive animal localization and empty-trigger suppression, and Stage 2 uses a lightweight EfficientNet-based binary classifier to confirm puma on detector crops and gate downstream actions. Our design is robust to low-quality nighttime monochrome imagery (motion blur, low contrast, illumination artifacts, and partial-body captures) and operates using commercially available components in connectivity-limited settings. In field deployments running since May 2025, end-to-end latency from camera trigger to action command is approximately 4 s. Ablation studies using a dataset of labeled wildlife images (pumas, not pumas) show that the two-stage approach substantially reduces false alarms in identifying pumas relative to a full-image classifier while maintaining high recall. On the held-out test set (N=1434 events), the proposed two-stage cascade achieves precision 0.983, recall 0.975, F1 0.979, accuracy 0.986, and balanced accuracy 0.983, with only 8 false positives and 12 false negatives. The system can be easily adapted for other species, as demonstrated by rapid retraining of the second stage to classify ringtails. Downstream responses (e.g., notifications and optional audio/light outputs) provide flexible actuation capabilities that can be configured to support intervention. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

18 pages, 2041 KB  
Article
Wavelet-CNet: Wavelet Cross Fusion and Detail Enhancement Network for RGB-Thermal Semantic Segmentation
by Wentao Zhang, Qi Zhang and Yue Yan
Sensors 2026, 26(3), 1067; https://doi.org/10.3390/s26031067 - 6 Feb 2026
Viewed by 518
Abstract
Leveraging thermal infrared imagery to complement RGB spatial information is a key technology in industrial sensing. This technology enables mobile devices to perform scene understanding through RGB-T semantic segmentation. However, existing networks conduct only limited information interaction between modalities and lack specific designs [...] Read more.
Leveraging thermal infrared imagery to complement RGB spatial information is a key technology in industrial sensing. This technology enables mobile devices to perform scene understanding through RGB-T semantic segmentation. However, existing networks conduct only limited information interaction between modalities and lack specific designs to exploit the thermal aggregation entropy of the thermal modality, resulting in inefficient feature complementarity within bilateral structures. To address these challenges, we propose Wavelet-CNet for RGB-T semantic segmentation. Specifically, we design a Wavelet Cross Fusion Module (WCFM) that applies wavelet transforms to separately extract four types of low- and high-frequency information from RGB and thermal features, which are then fed back into attention mechanisms for dual-modal feature reconstruction. Furthermore, a Cross-Scale Detail Enhancement Module (CSDEM) introduces cross-scale contextual information from the TIR branch into each fusion stage, aligning global localization through contour information from thermal features. Wavelet-CNet achieves competitive mIoU scores of 58.3% and 85.77% on MFNet and PST900, respectively, while ablation studies on MFNet further validate the effectiveness of the proposed WCFM and CSDEM modules. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

27 pages, 20812 KB  
Article
A Lightweight Radar–Camera Fusion Deep Learning Model for Human Activity Recognition
by Minkyung Jeon and Sungmin Woo
Sensors 2026, 26(3), 894; https://doi.org/10.3390/s26030894 - 29 Jan 2026
Cited by 2 | Viewed by 898
Abstract
Human activity recognition in privacy-sensitive indoor environments requires sensing modalities that remain robust under illumination variation and background clutter while preserving user anonymity. To this end, this study proposes a lightweight radar–camera fusion deep learning model that integrates motion signatures from FMCW radar [...] Read more.
Human activity recognition in privacy-sensitive indoor environments requires sensing modalities that remain robust under illumination variation and background clutter while preserving user anonymity. To this end, this study proposes a lightweight radar–camera fusion deep learning model that integrates motion signatures from FMCW radar with coarse spatial cues from ultra-low-resolution camera frames. The radar stream is processed as a Range–Doppler–Time cube, where each frame is flattened and sequentially encoded using a Transformer-based temporal model to capture fine-grained micro-Doppler patterns. The visual stream employs a privacy-preserving 4×5-pixel camera input, from which a temporal sequence of difference frames is extracted and modeled with a dedicated camera Transformer encoder. The two modality-specific feature vectors—each representing the temporal dynamics of motion—are concatenated and passed through a lightweight fully connected classifier to predict human activity categories. A multimodal dataset of synchronized radar cubes and ultra-low-resolution camera sequences across 15 activity classes was constructed for evaluation. Experimental results show that the proposed fusion model achieves 98.74% classification accuracy, significantly outperforming single-modality baselines (single-radar and single-camera). Despite its performance, the entire model requires only 11 million floating-point operations (11 MFLOPs), making it highly efficient for deployment on embedded or edge devices. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

28 pages, 3652 KB  
Article
A Ground-Based Visual System for UAV Detection and Altitude Measurement Deployment and Evaluation of Ghost-YOLOv11n on Edge Devices
by Hongyu Wang, Yifeng Qu, Zheng Dang, Duosheng Wu, Mingzhu Cui, Hanqi Shi and Jintao Zhao
Sensors 2026, 26(1), 205; https://doi.org/10.3390/s26010205 - 28 Dec 2025
Viewed by 1108
Abstract
The growing threat of unauthorized drones to ground-based critical infrastructure necessitates efficient ground-to-air surveillance systems. This paper proposes a lightweight framework for UAV detection and altitude measurement from a fixed ground perspective. We introduce Ghost-YOLOv11n, an optimized detector that integrates GhostConv modules into [...] Read more.
The growing threat of unauthorized drones to ground-based critical infrastructure necessitates efficient ground-to-air surveillance systems. This paper proposes a lightweight framework for UAV detection and altitude measurement from a fixed ground perspective. We introduce Ghost-YOLOv11n, an optimized detector that integrates GhostConv modules into YOLOv11n, reducing computational complexity by 12.7% while achieving 98.8% mAP0.5 on a comprehensive dataset of 8795 images. Deployed on a LuBanCat4 edge device with Rockchip RK3588S NPU acceleration, the model achieves 20 FPS. For stable altitude estimation, we employ an Extended Kalman Filter to refine measurements from a monocular ranging method based on similar-triangle geometry. Experimental results under ground monitoring scenarios show height measurement errors remain within 10% up to 30 m. This work provides a cost-effective, edge-deployable solution specifically for ground-based anti-drone applications. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

19 pages, 1656 KB  
Article
YOLOv11-GLIDE: An Improved YOLOv11n Student Behavior Detection Algorithm Based on Scale-Based Dynamic Loss and Channel Prior Convolutional Attention
by Haiyan Wang, Guiyuan Gao, Wei Zhang, Kejing Li, Na Che, Caihua Yan and Liu Wang
Sensors 2025, 25(22), 6972; https://doi.org/10.3390/s25226972 - 14 Nov 2025
Viewed by 1594
Abstract
Student classroom behavior recognition is a core research direction in intelligent education systems. Real-time analysis of students’ learning states and behavioral features through classroom monitoring provides quantitative support for teaching evaluation, classroom management, and personalized instruction, offering significant value for data-driven educational decision-making. [...] Read more.
Student classroom behavior recognition is a core research direction in intelligent education systems. Real-time analysis of students’ learning states and behavioral features through classroom monitoring provides quantitative support for teaching evaluation, classroom management, and personalized instruction, offering significant value for data-driven educational decision-making. To address the issues of low detection accuracy and severe occlusion in classroom behavior detection, this article proposes an improved YOLOv11n-based algorithm named YOLOv11-GLIDE. The model introduces a Channel Prior Convolutional Attention (CPCA) mechanism to integrate global and local feature information, enhancing feature extraction and detection performance. A scale-based dynamic loss (SD Loss) is designed to adaptively adjust the loss weights according to object scale, improving regression stability and detection accuracy. In addition, Sparse Depthwise Convolution (SPD-Conv) replaces traditional down-sampling to reduce fine-grained feature loss and computational cost. Experimental results on the SCB-Dataset3 demonstrate that YOLOv11-GLIDE achieves an excellent balance between accuracy and lightweight design. Compared with the baseline YOLOv11n, mAP@0.5 and mAP@0.5-0.95 increase by 2.5% and 7.6%, while Parameters and GFLOPS are reduced by 9.4% and 11.1%, respectively. The detection speed reaches 127.9 FPS, meeting the practical requirements of embedded classroom monitoring systems for accurate and efficient student behavior recognition. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

27 pages, 5331 KB  
Article
Real-Time Robust 2.5D Stereo Multi-Object Tracking with Lightweight Stereo Matching Algorithm
by Jinhyeong Lee, Junyoung Shin, Eunwoo Park and Daekeun Kim
Sensors 2025, 25(21), 6773; https://doi.org/10.3390/s25216773 - 5 Nov 2025
Cited by 2 | Viewed by 2602
Abstract
Multi-object tracking faces persistent challenges from occlusions and truncations in monocular vision systems. While stereo vision provides depth information, existing approaches require computationally expensive dense matching or 3D reconstruction. This paper presents a real-time 2.5D stereo multi-object tracking framework combining lightweight stereo matching [...] Read more.
Multi-object tracking faces persistent challenges from occlusions and truncations in monocular vision systems. While stereo vision provides depth information, existing approaches require computationally expensive dense matching or 3D reconstruction. This paper presents a real-time 2.5D stereo multi-object tracking framework combining lightweight stereo matching with resilient tracker management. The stereo matching module employs Direct Linear Transform-based triangulation using only bounding box coordinates, eliminating costly feature extraction while maintaining robust correspondence through geometric constraints. A dual-tracker architecture maintains independent trackers in both views, enabling re-identification when objects become occluded in one view but remain visible in the other. Experimental validation on a refrigerator monitoring dataset demonstrates that StereoSORT achieves a multiple object tracking accuracy (MOTA) of 0.932 and an identification F1 score (IDF1) of 0.823, substantially outperforming monocular trackers, including OC-SORT (IDF1: 0.765) and ByteTrack (IDF1: 0.609). The system achieves a 50.1 mm median depth error, comparable to commercial sensors, while maintaining 70 FPS on standard hardware. These results validate that geometric constraints alone enable robust stereo tracking without appearance features, offering a practical solution for resource-constrained environments where computational efficiency and tracking reliability are equally critical. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

18 pages, 5377 KB  
Article
M3ENet: A Multi-Modal Fusion Network for Efficient Micro-Expression Recognition
by Ke Zhao, Xuanyu Liu and Guangqian Yang
Sensors 2025, 25(20), 6276; https://doi.org/10.3390/s25206276 - 10 Oct 2025
Cited by 4 | Viewed by 1584
Abstract
Micro-expression recognition (MER) aims to detect brief and subtle facial movements that reveal suppressed emotions, discerning authentic emotional responses in scenarios such as visitor experience analysis in museum settings. However, it remains a highly challenging task due to the fleeting duration, low intensity, [...] Read more.
Micro-expression recognition (MER) aims to detect brief and subtle facial movements that reveal suppressed emotions, discerning authentic emotional responses in scenarios such as visitor experience analysis in museum settings. However, it remains a highly challenging task due to the fleeting duration, low intensity, and limited availability of annotated data. Most existing approaches rely solely on either appearance or motion cues, thereby restricting their ability to capture expressive information fully. To overcome these limitations, we propose a lightweight multi-modal fusion network, termed M3ENet, which integrates both motion and appearance cues through early-stage feature fusion. Specifically, our model extracts horizontal, vertical, and strain-based optical flow between the onset and apex frames, alongside RGB images from the onset, apex, and offset frames. These inputs are processed by two modality-specific subnetworks, whose features are fused to exploit complementary information for robust classification. To improve generalization in low data regimes, we employ targeted data augmentation and adopt focal loss to mitigate class imbalance. Extensive experiments on five benchmark datasets, including CASME I, CASME II, CAS(ME)2, SAMM, and MMEW, demonstrate that M3ENet achieves state-of-the-art performance with high efficiency. Ablation studies and Grad-CAM visualizations further confirm the effectiveness and interpretability of the proposed architecture. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

26 pages, 5861 KB  
Article
Robust Industrial Surface Defect Detection Using Statistical Feature Extraction and Capsule Network Architectures
by Azeddine Mjahad and Alfredo Rosado-Muñoz
Sensors 2025, 25(19), 6063; https://doi.org/10.3390/s25196063 - 2 Oct 2025
Cited by 3 | Viewed by 1496
Abstract
Automated quality control is critical in modern manufacturing, especially for metallic cast components, where fast and accurate surface defect detection is required. This study evaluates classical Machine Learning (ML) algorithms using extracted statistical parameters and deep learning (DL) architectures including ResNet50, Capsule Networks, [...] Read more.
Automated quality control is critical in modern manufacturing, especially for metallic cast components, where fast and accurate surface defect detection is required. This study evaluates classical Machine Learning (ML) algorithms using extracted statistical parameters and deep learning (DL) architectures including ResNet50, Capsule Networks, and a 3D Convolutional Neural Network (CNN3D) using 3D image inputs. Using the Dataset Original, ML models with the selected parameters achieved high performance: RF reached 99.4 ± 0.2% precision and 99.4 ± 0.2% sensitivity, GB 96.0 ± 0.2% precision and 96.0 ± 0.2% sensitivity. ResNet50 trained with extracted parameters reached 98.0 ± 1.5% accuracy and 98.2 ± 1.7% F1-score. Capsule-based architectures achieved the best results, with ConvCapsuleLayer reaching 98.7 ± 0.2% accuracy and 100.0 ± 0.0% precision for the normal class, and 98.9 ± 0.2% F1-score for the affected class. CNN3D applied on 3D image inputs reached 88.61 ± 1.01% accuracy and 90.14 ± 0.95% F1-score. Using the Dataset Expanded with ML and PCA-selected features, Random Forest achieved 99.4 ± 0.2% precision and 99.4 ± 0.2% sensitivity, K-Nearest Neighbors 99.2 ± 0.0% precision and 99.2 ± 0.0% sensitivity, and SVM 99.2 ± 0.0% precision and 99.2 ± 0.0% sensitivity, demonstrating consistent high performance. All models were evaluated using repeated train-test splits to calculate averages of standard metrics (accuracy, precision, recall, F1-score), and processing times were measured, showing very low per-image execution times (as low as 3.69×104 s/image), supporting potential real-time industrial application. These results indicate that combining statistical descriptors with ML and DL architectures provides a robust and scalable solution for automated, non-destructive surface defect detection, with high accuracy and reliability across both the original and expanded datasets. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

19 pages, 3920 KB  
Article
HCDFI-YOLOv8: A Transmission Line Ice Cover Detection Model Based on Improved YOLOv8 in Complex Environmental Contexts
by Lipeng Kang, Feng Xing, Tao Zhong and Caiyan Qin
Sensors 2025, 25(17), 5421; https://doi.org/10.3390/s25175421 - 2 Sep 2025
Cited by 3 | Viewed by 1319
Abstract
When unmanned aerial vehicles (UAVs) perform transmission line ice cover detection, it is often due to the variable shooting angle and complex background environment, which leads to difficulties such as poor ice-covering recognition accuracy and difficulty in accurately identifying the target. To address [...] Read more.
When unmanned aerial vehicles (UAVs) perform transmission line ice cover detection, it is often due to the variable shooting angle and complex background environment, which leads to difficulties such as poor ice-covering recognition accuracy and difficulty in accurately identifying the target. To address these issues, this study proposes an improved icing detection model based on HCDFI–You Only Look Once version 8 (HCDFI-YOLOv8). First, a cross-dense hybrid (CDH) parallel heterogeneous convolutional module is proposed, which can not only improve the detection accuracy of the model, but also effectively alleviate the problem of the surge in the number of floating-point operations during the improvement of the model. Second, deep and shallow feature weighted fusion using improved CSPDarknet53 to 2-Stage FPN_Dynamic Feature Fusion (C2f_DFF) module is proposed to reduce feature loss in neck networks. Third, optimization of the detection head using the feature adaptive spatial feature fusion (FASFF) detection head module is performed to enhance the model’s ability to extract features at different scales. Finally, a new inner-complete intersection over union (Inner_CIoU) loss function is introduced to solve the contradiction of the CIOU loss function used in the original YOLOv8. Experimental results demonstrate that the proposed HCDFI-YOLOv8 model achieves a 2.7% improvement in mAP@0.5 and a 2.5% improvement in mAP@0.5:0.95 compared to standard YOLOv8. Among twelve models for icing detection, the proposed model delivers the highest overall detection accuracy. The accuracy of the HCDFI-YOLOv8 model in detecting complex transmission line environments is verified and effective technical support is provided for transmission line ice cover detection. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

Back to TopTop