Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (605)

Search Parameters:
Keywords = RGB-D dataset

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 673 KB  
Article
LST-AGCN: A Novel Unified Lightweight Attention Framework for Efficient Skeleton-Based Action Recognition
by Khadija Lasri, Khalid El Fazazy, Adnane Mohamed Mahraz, Hamid Tairi and Jamal Riffi
Big Data Cogn. Comput. 2026, 10(4), 125; https://doi.org/10.3390/bdcc10040125 (registering DOI) - 18 Apr 2026
Abstract
While Graph Convolutional Networks (GCNs) have revolutionized skeleton-based action recognition, existing methods face a critical efficiency–accuracy dilemma: state-of-the-art approaches achieve high performance through computationally expensive multi-stream fusion (joint, bone, joint motion, and bone motion) and deep architectures, limiting real-world deployment on resource-constrained devices. [...] Read more.
While Graph Convolutional Networks (GCNs) have revolutionized skeleton-based action recognition, existing methods face a critical efficiency–accuracy dilemma: state-of-the-art approaches achieve high performance through computationally expensive multi-stream fusion (joint, bone, joint motion, and bone motion) and deep architectures, limiting real-world deployment on resource-constrained devices. We propose LST-AGCN (Lightweight Spatial–Temporal Attention Graph Convolutional Network), introducing three technical contributions that address this challenge: (1) Unified Attention Module (UAM)—a framework that integrates channel, spatial, and temporal attention through a single compact operation, significantly reducing attention parameters compared to separate attention mechanisms; (2) Depthwise Separable Attention Mechanism (DSAM)—a factorization using depthwise separable convolutions that achieves linear complexity reduction from O(C2) to O(C) in attention operations; and (3) Efficient Topology-Aware Fusion (ETAF)—an adaptive Joint-wise Attention strategy that captures fine-grained spatial relationships without quadratic complexity growth. Extensive experiments on NTU RGB+D 60 and NTU RGB+D 120 datasets demonstrate that LST-AGCN achieves strong performance using only joint modality (86.14%/94.0% and 79.5%/82.0% Top-1 accuracy with 99.0% Top-5 on cross-view) while requiring 14.11 M parameters and 19.02 GFLOPs, delivering efficient inference suitable for edge deployment. Full article
36 pages, 23663 KB  
Article
Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems
by Hend Alshaya
Mathematics 2026, 14(8), 1341; https://doi.org/10.3390/math14081341 - 16 Apr 2026
Viewed by 131
Abstract
Video-based action recognition for neural rehabilitation—spanning stroke recovery, Parkinsonian gait assessment, and cerebral palsy monitoring—faces critical challenges, including temporal ambiguity, non-causal motion correlations, and the absence of causally grounded dynamics modeling. While transformer-based architectures achieve strong performance, they often exploit spurious temporal and [...] Read more.
Video-based action recognition for neural rehabilitation—spanning stroke recovery, Parkinsonian gait assessment, and cerebral palsy monitoring—faces critical challenges, including temporal ambiguity, non-causal motion correlations, and the absence of causally grounded dynamics modeling. While transformer-based architectures achieve strong performance, they often exploit spurious temporal and environmental cues, limiting reliability in safety-critical clinical settings. We propose NeuroPrisma, a neuro-prismatic video framework that integrates frequency-domain spectral decomposition with causal intervention under Structural Causal Models (SCMs) via the backdoor criterion. NeuroPrisma introduces (i) a Prismatic Spectral Attention (PSA) module, which applies discrete Fourier transforms to decompose temporal features into multi-scale frequency bands, disentangling slow postural dynamics from rapid corrective movements, and (ii) a Causal Intervention Layer (CIL), which performs do-calculus-based backdoor adjustment to remove confounding influences and produce causally invariant representations. PSA preconditions representations prior to intervention, improving confounder estimation and causal robustness. Extensive evaluation against seven state-of-the-art models (I3D, SlowFast, TimeSformer, ViViT, Video Swin Transformer, UniFormerV2, and VideoMAE) demonstrates that NeuroPrisma achieves 98.7% Top-1 accuracy on UCF101, 82.4% on HMDB51, 71.2% on Something-Something V2, and 91.5%/95.8% on NTU RGB+D (Cross-Subject/Cross-View), consistently outperforming prior methods. It further reduces the Causal Confusion Score (CCS) by 42.3%, indicating substantially lower reliance on spurious correlations, while maintaining real-time performance with 23.4 ms latency per 16-frame clip on an NVIDIA A100 GPU. All improvements are statistically significant (p < 0.001, Cohen’s d = 0.72–1.24). Evaluation was conducted exclusively on benchmark datasets (UCF101, HMDB51, Something-Something V2, and NTU RGB+D) under controlled conditions, without direct clinical validation on neurological patient cohorts. Overfitting was mitigated using three random seeds (42, 123, 456), RandAugment, Mixup (α = 0.8), weight decay (0.05), and early stopping. Cross-dataset generalization from UCF101 to HMDB51 without fine-tuning achieved 76.2% Top-1 accuracy. Future work will focus on prospective clinical validation across stroke, Parkinson’s disease, and cerebral palsy populations, including correlation with standardized clinical assessment scales such as Fugl–Meyer, UPDRS, and GMFCS. These results establish NeuroPrisma as a causally grounded and computationally efficient framework for reliable, real-time movement assessment in clinical rehabilitation systems. Full article
Show Figures

Figure 1

28 pages, 5786 KB  
Article
Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis
by Hongzhi Fan, Chao Zhang, Mingyu Sun, Kexi Xu, Wenyang Zhang and Ximing Zhang
Vibration 2026, 9(2), 28; https://doi.org/10.3390/vibration9020028 - 15 Apr 2026
Viewed by 152
Abstract
Rolling bearing fault diagnosis under complex and noisy operating conditions requires not only high diagnostic accuracy but also interpretability that can be quantitatively verified against physically meaningful excitation structures. However, many existing deep learning approaches rely on a single time–frequency (TF) representation and [...] Read more.
Rolling bearing fault diagnosis under complex and noisy operating conditions requires not only high diagnostic accuracy but also interpretability that can be quantitatively verified against physically meaningful excitation structures. However, many existing deep learning approaches rely on a single time–frequency (TF) representation and provide limited, non-verifiable links between model decisions and the original vibration patterns. To address this issue, we propose MBT-XAI, a multi-wavelet TF fusion network with a Token-to-Spectrum Traceback (TST) mechanism for structure-preserving, physics-consistent interpretability. Three complementary wavelets, namely Morlet, Mexican Hat, and Complex Morlet, are used to construct multi-view TF representations, which are encoded into RGB channels and adaptively fused via cross-channel attention within a Transformer backbone. TST maps patch-token attributions back to the TF domain, enabling quantitative evaluation of physics consistency through overlap-based metrics. Experiments on the public CWRU dataset and an industrial IMUST dataset show that MBT-XAI achieves 98.13 ± 0.24% and 96.23 ± 0.31% accuracy at SNR = 0 dB, outperforming the strongest baseline by 2.83% and 2.43%, respectively. Under AWGN contamination, MBT-XAI maintains 95.44 ± 0.38%/93.45 ± 0.47% accuracy on CWRU and 95.80 ± 0.33%/92.91 ± 0.51% accuracy on IMUST at SNR = −2/−4 dB. Under colored-noise contamination, the proposed method also preserves robust performance under pink and brown noise at the same SNR levels. Quantitative interpretability evaluation further indicates high alignment between salient frequency regions and theoretical fault-characteristic bands, with IoU = 80.21 ± 0.86% and Coverage = 91.70 ± 0.63%. In addition, MBT-XAI requires 10.393 M parameters and 10.678 GFLOPs, with an inference latency of 14.7 ms per sample (batch size = 1) on an NVIDIA GeForce RTX 3060 GPU. These results suggest that multi-wavelet TF modeling with attention-based fusion and TF-level traceback provides an accurate, robust, and physics-consistent framework for intelligent bearing fault diagnosis. Full article
Show Figures

Figure 1

24 pages, 11059 KB  
Article
Large-Scale Modeling of Urban Rooftop Solar Energy Potential Using UAS-Based Digital Photogrammetry and GIS Spatial Analysis: A Case Study of Sofia City, Bulgaria
by Stelian Dimitrov, Martin Iliev, Bilyana Borisova, Stefan Petrov, Ivo Ihtimanski, Leonid Todorov, Ivan Ivanov, Stoyan Valchev and Kristian Georgiev
Urban Sci. 2026, 10(4), 210; https://doi.org/10.3390/urbansci10040210 - 14 Apr 2026
Viewed by 654
Abstract
Urban rooftop photovoltaic systems represent a substantial yet still underutilized renewable energy resource, particularly in high-density residential environments. Accurate large-scale assessment of rooftop solar potential, however, remains challenging due to the complex geometry of urban morphology and the limited availability of high-resolution geospatial [...] Read more.
Urban rooftop photovoltaic systems represent a substantial yet still underutilized renewable energy resource, particularly in high-density residential environments. Accurate large-scale assessment of rooftop solar potential, however, remains challenging due to the complex geometry of urban morphology and the limited availability of high-resolution geospatial data. This study presents a large-scale methodological framework for estimating the theoretical photovoltaic potential of urban rooftop spaces using Unmanned Aerial System (UAS)-based digital photogrammetry and GIS-based spatial analysis. The approach integrates centimeter-resolution Digital Surface Models (DSMs) and orthophotos derived from fixed-wing UAS surveys with detailed rooftop vectorization and solar radiation modeling implemented in a GIS environment. The methodology accounts for rooftop geometry, surface orientation, slope, shading effects, and rooftop-mounted obstacles. The methodology consists of data collection of high-resolution RGB imagery suitable for detailed three-dimensional reconstruction. The images are captured with a UAS equipped with a S.O.D.A. 3D photogrammetric camera, creating a dense, georeferenced three-dimensional point cloud based on UAS imagery. Based on the point cloud, a high-resolution Digital Surface Model (DSM) was produced. Rooftop boundaries and rooftop-mounted structures were digitized on the basis of an orthophoto created from UAS imagery. The analysis workflow consists of solar modeling using ArcGIS Pro, including calculating the solar radiation. The next methodological step is to filter low radiation rooftops, steep slopes, and northern-oriented rooftops. Finally, we calculate the potential electricity production. The framework was applied to high-density residential districts in Sofia, Bulgaria, dominated by prefabricated panel buildings with predominantly flat rooftops. Drone applications in such studies are typically restricted to modeling individual roofs, which severely limits their scalability for district-wide evaluations. To overcome this, the study employs a specialized fixed-wing UAS uniquely certified for legal operations over densely populated urban environments. This platform rapidly maps large territories, ensuring consistent lighting and shading conditions that significantly enhance the accuracy of subsequent rooftop digitization. Furthermore, the resulting centimeter-level precision enables the exact vectorization of micro-rooftop obstacles. Capturing these intricate details is a critical innovation that effectively prevents the overestimation of solar energy potential commonly observed in conventional large-scale models. Solar radiation was modeled at the pixel level for a full annual cycle and filtered using photovoltaic suitability criteria, including minimum annual radiation thresholds, slope, and aspect constraints. Theoretical electricity production was subsequently estimated using zonal statistics and system performance parameters representative of contemporary photovoltaic installations. The results indicate a total theoretical annual electricity potential of approximately 76.7 GWh for the analyzed rooftop spaces, with an average production of about 34 MWh per rooftop and pronounced spatial variability driven by rooftop geometry and exposure conditions. The findings demonstrate the significant renewable energy potential embedded in existing urban rooftop infrastructure and highlight the applicability of UAS-based photogrammetry for high-resolution, large-area solar potential assessments. The proposed framework provides actionable information for urban energy planning, municipal solar cadaster development, and the strategic integration of photovoltaic systems into dense urban environments, particularly in regions lacking open-access high-resolution geospatial datasets. Full article
(This article belongs to the Special Issue Remote Sensing & GIS Applications in Urban Science)
Show Figures

Figure 1

32 pages, 21661 KB  
Article
Robust Human-to-Robot Handover System Under Adverse Lighting
by Yifei Wang, Baoguo Xu, Huijun Li and Aiguo Song
Biomimetics 2026, 11(4), 231; https://doi.org/10.3390/biomimetics11040231 - 1 Apr 2026
Viewed by 455
Abstract
Human-to-robot (H2R) handovers are critical in human–robot interaction but are challenged by complex environments that impact robot perception. Traditional RGB-based perception methods exhibit severe performance degradation under harsh lighting (e.g., glare and darkness). Furthermore, H2R handovers occur in unstructured environments populated with fine-grained [...] Read more.
Human-to-robot (H2R) handovers are critical in human–robot interaction but are challenged by complex environments that impact robot perception. Traditional RGB-based perception methods exhibit severe performance degradation under harsh lighting (e.g., glare and darkness). Furthermore, H2R handovers occur in unstructured environments populated with fine-grained visual details, such as multi-angle hand configurations and novel object geometries, where conventional semantic segmentation and grasp generation approaches struggle to generalize. To overcome lighting disturbances, we present an H2R handover system with a dual-path perception pipeline. The system fuses perception data from a stereo RGB-D camera (eye-in-hand) and a time-of-flight (ToF) camera (fixed scene) under normal lighting, and switches to the ToF camera for reliable perception under glare and darkness. In parallel, to address the complex spatial and geometric features, we augment the Point Transformer v3 (PTv3) architecture by integrating a T-Net module and a self-attention mechanism to fuse the relative positional angle features between human and robot, enabling efficient real-time 3D semantic segmentation of both the object and the human hand. For grasp generation, we extend GraspNet with a grasp selection module optimized for H2R scenarios. We validate our approach through extensive experiments: (1) a semantic segmentation dataset with 7500 annotated point clouds covering 15 objects and 5 relative angles and tested on 750 point clouds from 15 unseen objects, where our method achieves 84.4% mIoU, outperforming Swin3D-L by 3.26 percentage points with 3.2× faster inference; (2) 250 real-world handover trials comparing our method with the baseline across 5 objects, 5 hand postures, and 5 angles, showing an improvement of 18.4 percentage points in success rate; (3) 450 trials under controlled adverse lighting (darkness and glare), where our dual-path perception method achieves 82.7% overall success, surpassing single-camera baselines by up to 39.4 percentage points; and (4) a comparative experiment against a state-of-the-art multimodal H2R handover method under identical adverse lighting, where our system achieves 75.0% success (15/20) versus the baseline’s 15.0% (3/20), further confirming the lighting robustness of our design. These results demonstrate the system’s robustness and generalization in challenging H2R handover scenarios. Full article
(This article belongs to the Special Issue Human-Inspired Grasp Control in Robotics 2025)
Show Figures

Figure 1

34 pages, 24153 KB  
Article
Forest Vegetation 3D Localization Using Deep Learning Object Detectors
by Paulo A. S. Mendes, António P. Coimbra and Aníbal T. de Almeida
Appl. Sci. 2026, 16(7), 3375; https://doi.org/10.3390/app16073375 - 31 Mar 2026
Viewed by 244
Abstract
Forest fires are becoming increasingly prevalent and destructive in many regions of the world, posing significant threats to biodiversity, ecosystems, human settlements, climate, and the economy. The United States of America (USA), Australia, Canada, Greece and Portugal are five regions that have experienced [...] Read more.
Forest fires are becoming increasingly prevalent and destructive in many regions of the world, posing significant threats to biodiversity, ecosystems, human settlements, climate, and the economy. The United States of America (USA), Australia, Canada, Greece and Portugal are five regions that have experienced enormous forest fires. One way to reduce the size and rage of forest fires is by decreasing the amount of flammable material in forests. This can be achieved using autonomous Unmanned Ground Vehicles (UGVs) specialized in vegetation cutting and equipped with Artificial Intelligence (AI) algorithms to identify and differentiate between vegetation that should be preserved and material that should be removed as potential fire fuel. In this paper, an innovative study of forest vegetation detection, classification and 3D localization using ground vehicles’ RGB and depth images is presented to support autonomous forest cleaning operations to prevent fires. The presented work, which is a continuation of a previous research, presents a method for 3D objects localization in the real-world using Deep Learning Object Detection (DLOD) combined with an RGB-D camera. It presents and compares results of eight recent high-performance DLOD architectures, YOLOv5, YOLOv7, YOLOv8, YOLO-NAS, YOLOv9, YOLOv10, YOLO11 and YOLOv12, to detect and classify forest vegetation in five classes: “Grass”, “Live vegetation”, “Cut vegetation”, “Dead vegetation”, and “Tree-trunk”. For the training of the DLOD models, our custom dataset acquired in dense forests in Portugal is used. A methodology that combines the best DLOD trained for vegetation detection and classification and an RGB-D camera for the 3D localization of the classified detected objects in the real-world. The presented methods are employed in an Unmanned Ground Vehicle (UGV) to localize forest vegetation that needs to be thinned for fire prevention purposes. A key challenge for autonomous forest vegetation cleaning is the reliable discrimination of objects that need to be identified to reach the goal of fire prevention using autonomous unmanned ground vehicles in dense forests. With the obtained results, forest vegetation is precisely detected, classified and localized using the DL models and the localization method presented. Also, the fastest DLOD architecture to train is YOLOv5, and the fastest to infer are YOLOv7 and YOLOv12. The innovation presented is the detection, classification, and 3D localization of the vegetation using DLOD architectures, in real-time, with a localization error of the real-world object in width, height and depth under 21.4, 20.7 and 11%, respectively, using only a depth camera and a processing unit. The 3D localized objects are defined as parallelepiped geometrical shapes. The methodology for vegetation detection, classification and localization presented in this paper is highly suitable for future autonomous forest vegetation cleansing, specialized using unmanned ground vehicles. Full article
Show Figures

Figure 1

36 pages, 6199 KB  
Systematic Review
Intelligent and Automated Technologies for Textile Recycling Pre-Processing: A Systematic Literature Review
by Daniel Lopes, Eduardo J. Solteiro Pires, Vítor Filipe, Manuel F. Silva and Luís F. Rocha
Technologies 2026, 14(4), 200; https://doi.org/10.3390/technologies14040200 - 27 Mar 2026
Viewed by 566
Abstract
Textile-to-textile recycling is strongly constrained by upstream pre-processing, where post-consumer clothing must be identified, separated, and prepared under high variability in materials, appearance, and contamination. This paper presents a Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)-guided systematic literature review of intelligent [...] Read more.
Textile-to-textile recycling is strongly constrained by upstream pre-processing, where post-consumer clothing must be identified, separated, and prepared under high variability in materials, appearance, and contamination. This paper presents a Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)-guided systematic literature review of intelligent and automated technologies for textile recycling pre-processing covering the interval between 2015 to 2025. After screening and quality assessment, 21 primary studies published between 2020 and 2025 were included. The literature is synthesized across three task families: (i) identificationof fiber/material, composition, or color; (ii) sorting, considered only when explicit separation strategies are defined to operationalize identification outcomes into routing actions or output streams; and (iii) contaminant detection and/or removal, targeting non-recyclable items. Results show that identification dominates the field (19/21 studies), supported by Red–Green–Blue (RGB) and red–green–blue plus depth (RGB-D) imaging and material-signature sensing, including near-infrared (NIR) spectroscopy, hyperspectral imaging (HSI), and Raman spectroscopy. In contrast, sorting as a defined separation stage is less frequent (4/21), and contaminant-related automation remains sparse (3/21). Most studies are validated in laboratory conditions, with limited semi-industrial evidence, highlighting a persistent perception-to-action gap. Overall, the review indicates that robust separation strategies, representative datasets, and end-to-end system integration remain key bottlenecks for scalable automated textile recycling pre-processing. Full article
Show Figures

Figure 1

15 pages, 2219 KB  
Article
One Patch Is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues
by Sindhuja Penchala, Gavin Money, Gabriel Marques, Samuel Wood, Jessica Kirschman, Travis Atkison, Shahram Rahimi and Noorbakhsh Amiri Golilarz
Sensors 2026, 26(7), 2083; https://doi.org/10.3390/s26072083 - 27 Mar 2026
Viewed by 341
Abstract
Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation and material perception. However, most existing methods rely on dense or full scene observations, limiting their effectiveness in constrained or partial view environments. This gap highlights the need for [...] Read more.
Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation and material perception. However, most existing methods rely on dense or full scene observations, limiting their effectiveness in constrained or partial view environments. This gap highlights the need for models capable of inferring surfaces’ properties from extremely limited visual information. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders, Vision Transformer (ViT), Masked Autoencoder (MAE), Swin Transformer and DETR using the Touch and Go dataset of real-world surface textures. SMARC achieves the highest performance among the evaluated methods with a PSNR of 17.55 dB and a surface classification accuracy of 85.10%. These results validate the effectiveness of SMARC in relation to surface material understanding and highlight its potential for deployment in robotic perception tasks where visual access is inherently limited. Full article
(This article belongs to the Special Issue Advanced Sensors and AI Integration for Human–Robot Teaming)
Show Figures

Figure 1

25 pages, 3612 KB  
Article
Learning Modality Complementarity for RGB-D Salient Object Detection via Dynamic Neural Network
by Yuanhao Li, Jia Song, Chenglizhao Chen and Xinyu Liu
Electronics 2026, 15(7), 1361; https://doi.org/10.3390/electronics15071361 - 25 Mar 2026
Viewed by 302
Abstract
RGB-D salient object detection (RGB-D SOD) aims to accurately localize and segment visually salient objects by jointly leveraging RGB images and depth maps. Some existing methods rely on static fusion strategies with fixed paths and weights, which treat all regions equally and fail [...] Read more.
RGB-D salient object detection (RGB-D SOD) aims to accurately localize and segment visually salient objects by jointly leveraging RGB images and depth maps. Some existing methods rely on static fusion strategies with fixed paths and weights, which treat all regions equally and fail to capture the varying importance of different regions and modalities. Although some attention-based methods alleviate the limitations of static fusion by assigning adaptive weights to different regions and modalities, the quality of RGB and depth data may degrade in real-world scenarios due to sensor noise, illumination changes, or environmental interference. These attention-based methods often overlook inter-modality quality differences and complementarity, making them prone to over-relying on a certain modality, which can lead to noise introduction, feature conflicts, and performance degradation. To address these limitations, this paper proposes a novel dynamic feature routing and fusion framework for RGB-D SOD, which adaptively adjusts the fusion strategy according to the quality of input modalities. To enable modality quality awareness, the proposed method characterizes the modality complementarity between RGB and depth features in a task-driven manner inspired by information-theoretic principles. We introduce a task-relevance scoring function which is integrated with a mutual information estimator to quantify such complementarity, and emphasizes task-relevant features while suppressing redundancy. A dynamic routing module is then designed to perform feature selection guided by the captured complementarity. In addition, we propose a novel cross-modal fusion module to adaptively fuse the features selected by the dynamic routing module, which effectively enhances complementary representations while suppressing redundant features and noise interference. Extensive experiments conducted on seven public RGB-D SOD benchmark datasets demonstrate that the proposed method consistently achieves competitive performance, outperforming existing methods by an average of approximately 1% across multiple evaluation metrics. Notably, in challenging scenarios with severe modality quality degradation, the proposed method outperforms existing best-performing methods by up to 1.8%, demonstrating strong robustness against cluttered backgrounds, complex object structures, and diverse object scales. Overall, the proposed dynamic fusion framework provides a novel solution to modality quality imbalance in RGB-D salient object detection. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

21 pages, 2657 KB  
Article
Research on Forest Fire Detection and Segmentation Based on MST++ Hyperspectral Reconstruction Technology
by Shuai Tang, Jie Xu and Li Zhang
Fire 2026, 9(4), 139; https://doi.org/10.3390/fire9040139 - 25 Mar 2026
Viewed by 670
Abstract
The increasing frequency of global forest fires necessitates rapid and accurate detection methods. This study proposes a forest fire detection and segmentation framework based on the MST++ hyperspectral reconstruction model to improve the accuracy and robustness of wildfire monitoring under complex environmental conditions. [...] Read more.
The increasing frequency of global forest fires necessitates rapid and accurate detection methods. This study proposes a forest fire detection and segmentation framework based on the MST++ hyperspectral reconstruction model to improve the accuracy and robustness of wildfire monitoring under complex environmental conditions. The proposed method first reconstructs hyperspectral images from RGB inputs using an MST++ model trained on the NTIRE 2022 RGB-to-hyperspectral dataset (950 paired samples), followed by fire and smoke segmentation based on spectrally sensitive bands. For segmentation experiments, 118 flame images from the BoWFire dataset and 100 manually annotated smoke images from public datasets (D-Fire and DFS) were used. Quantitative results demonstrate that the proposed MST++-based method significantly outperforms the conventional U-Net baseline. In flame segmentation, MST++ achieved an IoU of 76.90%, an F1 score of 86.81%, and a Kappa coefficient of 0.8603, compared to 44.42%, 58.15%, and 0.5625 for U-Net, respectively. For smoke segmentation, MST++ achieved an IoU of 91.76% and an F1 score of 95.66%, surpassing U-Net by 17.08% and 10.32%, respectively. In fire–smoke overlapping scenarios, MST++ maintained strong robustness, achieving an IoU of 89.64% for smoke detection. These results indicate that hyperspectral reconstruction enhances discrimination capability among flame, smoke, and complex backgrounds, particularly under low-light and overlapping conditions. The proposed framework provides a reliable and efficient solution for early forest fire detection and demonstrates the potential of hyperspectral reconstruction approaches in disaster monitoring applications. Full article
Show Figures

Figure 1

23 pages, 5784 KB  
Article
Learning Italian Hand Gesture Culture Through an Automatic Gesture Recognition Approach
by Chiara Innocente, Giorgio Di Pisa, Irene Lionetti, Andrea Mamoli, Manuela Vitulano, Giorgia Marullo, Simone Maffei, Enrico Vezzetti and Luca Ulrich
Future Internet 2026, 18(4), 177; https://doi.org/10.3390/fi18040177 - 24 Mar 2026
Viewed by 313
Abstract
Italian hand gestures constitute a distinctive and widely recognized form of nonverbal communication, deeply embedded in everyday interaction and cultural identity. Despite their prominence, these gestures are rarely formalized or systematically taught, posing challenges for foreign speakers and visitors seeking to interpret their [...] Read more.
Italian hand gestures constitute a distinctive and widely recognized form of nonverbal communication, deeply embedded in everyday interaction and cultural identity. Despite their prominence, these gestures are rarely formalized or systematically taught, posing challenges for foreign speakers and visitors seeking to interpret their meaning and pragmatic use. Moreover, their ephemeral and embodied nature complicates traditional preservation and transmission approaches, positioning them within the broader domain of intangible cultural heritage. This paper introduces a machine learning–based framework for recognizing iconic Italian hand gestures, designed to support cultural learning and engagement among foreign speakers and visitors. The approach combines RGB–D sensing with depth-enhanced geometric feature extraction, employing interpretable classification models trained on a purpose-built dataset. The recognition system is integrated into a non-immersive virtual reality application simulating an interactive digital totem conceived for public arrival spaces, providing tutorial content, real-time gesture recognition, and immediate feedback within a playful and accessible learning environment. Three supervised machine learning pipelines were evaluated, and Random Forest achieved the best overall performance. Its integration with an Isolation Forest module was further considered for deployment, achieving a macro-averaged accuracy and F1-score of 0.82 under a 5-fold cross-validation protocol. An experimental user study was conducted with 25 subjects to evaluate the proposed interactive system in terms of usability, user engagement, and learning effectiveness, obtaining favorable results and demonstrating its potential as a practical tool for cultural education and intercultural communication. Full article
Show Figures

Figure 1

23 pages, 10822 KB  
Article
Off-Road Autonomous Vehicle Semantic Segmentation and Spatial Overlay Video Assembly
by Itai Dror, Omer Aviv and Ofer Hadar
Sensors 2026, 26(6), 1944; https://doi.org/10.3390/s26061944 - 19 Mar 2026
Viewed by 425
Abstract
Autonomous systems are expanding rapidly, driving a demand for robust perception technologies capable of navigating challenging, unstructured environments. While urban autonomy has made significant progress, off-road environments pose unique challenges, including dynamic terrain and limited communication infrastructure. This research addresses these challenges by [...] Read more.
Autonomous systems are expanding rapidly, driving a demand for robust perception technologies capable of navigating challenging, unstructured environments. While urban autonomy has made significant progress, off-road environments pose unique challenges, including dynamic terrain and limited communication infrastructure. This research addresses these challenges by introducing a novel three-part solution for off-road autonomous vehicles. First, we present a large-scale off-road dataset curated to capture the visual complexity and variability of unstructured environments, providing a realistic training ground that supports improved model generalization. Second, we propose a Confusion-Aware Loss (CAL) that dynamically penalizes systematic misclassifications based on class-level confusion statistics. When combined with cross-entropy, CAL improves segmentation mean Intersection over Union (mIoU) on the off-road test set from 68.66% to 70.06% and achieves cross-domain gains of up to ~0.49% mIoU on the Cityscapes dataset. Third, leveraging semantic segmentation as an intermediate representation, we introduce a spatial overlay video encoding scheme that preserves high-fidelity RGB information in semantically critical regions while compressing non-essential background regions. Experimental results demonstrate Peak Signal-to-Noise Ratio (PSNR) improvements of up to +5 dB and Video Multi-Method Assessment Fusion (VMAF) gains of up to +40 points under lossy compression, enabling efficient and reliable off-road autonomous operation. This integrated approach provides a robust framework for real-time remote operation in bandwidth-constrained environments. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

17 pages, 1167 KB  
Article
HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction
by Jinsong Zhang and Yuqin Lin
Biomimetics 2026, 11(3), 214; https://doi.org/10.3390/biomimetics11030214 - 17 Mar 2026
Viewed by 553
Abstract
Monocular 3D human–object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues [...] Read more.
Monocular 3D human–object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues (e.g., contact locality) with channel-wise semantic cues (e.g., action/affordance), and provides limited control for representing directional and asymmetric physical influence between humans and objects. This paper presents HOIMamba, a state-space sequence modeling framework that reformulates HOI reconstruction as bidirectional, multi-scale interaction state inference. Instead of relying on symmetric correlation aggregation, HOIMamba uses structured state evolution to propagate interaction evidence. We introduce a multi-scale state-space module (MSSM) to capture interaction dependencies spanning local contact details and global body–object coordination. Building on MSSM, we propose a spatial-channel grouped SSM (SCSSM) block that factorizes interaction modeling into a spatial pathway for geometric/contact dependencies and a channel pathway for semantic/functional correlations, followed by gated fusion. HOIMamba further performs explicit bidirectional propagation between human and object states to better reflect asymmetric reciprocity in physical interactions. We evaluate HOIMamba on two public benchmarks, BEHAVE and InterCap, using Chamfer distance for human/object meshes and contact precision/recall induced by reconstructed geometry. HOIMamba achieves consistent improvements over representative prior methods. On the BEHAVE dataset, it reduces human Chamfer distance by 8.6% and improves contact recall by 13.5% compared to the strongest Transformer-based baseline, with similar gains observed on the InterCap dataset. Ablation studies on BEHAVE verify the contributions of state-space modeling, multi-scale inference, spatial-channel factorization, and bidirectional interaction reasoning. Full article
Show Figures

Graphical abstract

15 pages, 3088 KB  
Article
Lightweight Semantic Segmentation Algorithm Based on Gated Visual State Space Models
by Kui Di, Jinming Cheng, Lili Zhang and Yubin Bao
Electronics 2026, 15(6), 1175; https://doi.org/10.3390/electronics15061175 - 12 Mar 2026
Viewed by 398
Abstract
LiDAR serves as the primary sensor for acquiring environmental information in intelligent driving systems. However, under adverse weather conditions, point cloud signals obtained by LiDAR suffer from intensity attenuation and noise interference, leading to a decline in segmentation accuracy. To address these issues, [...] Read more.
LiDAR serves as the primary sensor for acquiring environmental information in intelligent driving systems. However, under adverse weather conditions, point cloud signals obtained by LiDAR suffer from intensity attenuation and noise interference, leading to a decline in segmentation accuracy. To address these issues, this paper designs a lightweight semantic segmentation system based on the Gated Visual State Space Model (VMamba), named RainMamba. Specifically, the system utilizes spherical projection to transform point clouds into 2D sequences and constructs a physical perception feature embedding module guided by the Beer–Lambert law to explicitly model and suppress spatial noise at the source. Subsequently, an uncertainty-weighted cross-modal correction module is employed to incorporate RGB images for dynamically calibrating the degraded point cloud data. Finally, a VMamba backbone is adopted to establish global dependencies with linear complexity. Experimental results on the SemanticKITTI dataset demonstrate that the system achieves an inference speed of 83 FPS, with a relative mIoU improvement of approximately 7.2% compared to the real-time baseline PolarNet. Furthermore, zero-shot evaluations on the real-world SemanticSTF dataset validate the system’s robust Sim-to-Real generalization capability. Notably, RainMamba delivers highly competitive accuracy comparable to the state-of-the-art heavy-weight model PTv3 while requiring a significantly lower parameter footprint, thereby demonstrating its immense potential for practical edge-computing deployment. Full article
Show Figures

Figure 1

21 pages, 1926 KB  
Article
From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics
by Guang Wu, Mingyuan Ge, Yunxiang Wang, Youhao Chen and Li Liu
Appl. Sci. 2026, 16(6), 2678; https://doi.org/10.3390/app16062678 - 11 Mar 2026
Viewed by 590
Abstract
The acquisition of high-quality three-dimensional (3D) models of cultural relics often relies on expensive scanning equipment or multi-view image capture, which limits large-scale deployment in real-world heritage conservation scenarios. Large-scale water impoundment in the Three Gorges region has resulted in the permanent submergence [...] Read more.
The acquisition of high-quality three-dimensional (3D) models of cultural relics often relies on expensive scanning equipment or multi-view image capture, which limits large-scale deployment in real-world heritage conservation scenarios. Large-scale water impoundment in the Three Gorges region has resulted in the permanent submergence of numerous cultural relics and archaeological remains. For many of these artifacts, only a single two-dimensional image remains as the sole visual record, posing significant challenges for reconstructing their original three-dimensional geometry and appearance. This limitation renders traditional multi-view reconstruction and physical scanning methods infeasible. To address this challenge, we propose a generative framework for reconstructing high-fidelity 3D digital models of Chinese Three Gorges cultural relics from a single two-dimensional (2D) image. Building upon recent advances in generative 3D representation learning, the proposed method adopts a transformer-based image-to-triplane architecture to infer an implicit 3D representation directly from a single RGB image. A vision transformer encoder is employed to extract global and local visual features, which are subsequently projected into a compact triplane representation through a cross-attention-based decoder. The reconstructed triplane features are further decoded by a neural radiance field (NeRF) to synthesize dense geometry and appearance, enabling accurate mesh extraction and novel-view rendering. To enhance robustness under in-the-wild conditions, the model implicitly estimates camera parameters during inference without relying on explicit calibration information. The proposed method is evaluated on a dataset of Chinese Three Gorges cultural relics, covering diverse artifact categories and visual styles. Experimental results demonstrate that the proposed framework is capable of producing structurally coherent and visually consistent 3D reconstructions from a single image, effectively preserving key morphological characteristics of cultural relics under limited data conditions. Compared with existing single-image and multi-view reconstruction baselines, the proposed framework exhibits better reconstruction accuracy, visual consistency, and generalization capability. This study provides an efficient and scalable solution for the digital reconstruction of cultural relics and offers a practical pathway for large-scale 3D digitization of heritage artifacts from archival images. This work provides a practical solution for the digital reconstruction of submerged heritage artifacts and contributes to the application of generative 3D modeling techniques in cultural heritage preservation and restoration. Full article
Show Figures

Figure 1

Back to TopTop