Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (108)

Search Parameters:
Keywords = 3D-Swin Transformer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
27 pages, 5763 KB  
Article
Cardio-Dense: Diagnosis of Cardiac Abnormalities Based on Phonocardiogram Using Improved Swin Transformer Through Lightweight Dense Blocks
by Alaa E. S. Ahmed, Mostafa E. A. Ibrahim and Yassine Daadaa
Diagnostics 2026, 16(10), 1421; https://doi.org/10.3390/diagnostics16101421 - 7 May 2026
Viewed by 228
Abstract
Background: Cardiovascular diseases (CVDs) are among the top sources of mortality worldwide. To properly diagnose cardiovascular diseases, a low-cost remedy based on phonocardiography (PCG) signals must be proposed. Several deep learning (DL)-driven CVD systems are now being developed to identify various phases of [...] Read more.
Background: Cardiovascular diseases (CVDs) are among the top sources of mortality worldwide. To properly diagnose cardiovascular diseases, a low-cost remedy based on phonocardiography (PCG) signals must be proposed. Several deep learning (DL)-driven CVD systems are now being developed to identify various phases of the disease. Nevertheless, the approaches’ accuracy falls short of expectations, and they necessitate substantial processing resources and training data. Methods: This paper proposes Cardio-Dense, a hybrid framework for multi-class CVD detection from phonocardiogram signals. The PCG waveform is first denoised in the wavelet domain and then converted into a 2D time–frequency spectrogram using continuous wavelet transform (CWT). We design a joint architecture that combines a Swin transformer for capturing global contextual dependencies with lightweight DenseBlocks for efficient local feature refinement, enabling robust learning from PCG spectrograms across five disease classes. Results: Experiments on PCG datasets achieve up to 0.977 accuracy, 0.975 sensitivity, 0.992 specificity, 0.978 F1-score, 0.978 AUC, and 0.976 precision, while maintaining low computational overhead suitable for real-time inference. Conclusions: The findings indicate that the proposed model provides an economical, non-invasive method for preliminary signal-level identification of multi-class heart valve diseases. It benefits clinicians by decreasing the need for arduous and error-prone manual PCG analysis. Furthermore, it offers quick, near-real-time categorization suitable for clinical and portable applications. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

22 pages, 2008 KB  
Article
A Video Frame Prediction Method Based on Latent-Space Autoregressive Modeling
by Congcong Zhang, Jin Tian, Lihua Gong, Yujin Zhang, Fei Wu and Han Pan
Appl. Sci. 2026, 16(9), 4423; https://doi.org/10.3390/app16094423 - 1 May 2026
Viewed by 253
Abstract
Video prediction is a fundamental task in computer vision with broad applications in intelligent robotics, autonomous driving, and related fields. However, existing methods often struggle to simultaneously model long-term temporal dependencies, preserve local details, and alleviate error accumulation during autoregressive prediction. To address [...] Read more.
Video prediction is a fundamental task in computer vision with broad applications in intelligent robotics, autonomous driving, and related fields. However, existing methods often struggle to simultaneously model long-term temporal dependencies, preserve local details, and alleviate error accumulation during autoregressive prediction. To address these issues, this paper proposes a two-stage video prediction framework composed of a HybridResSwin Autoencoder (HRS-AE) and an Enhanced FAR Transformer (EFAR). In the first stage, HRS-AE learns compact and discriminative latent representations from input video frames while preserving essential spatial structures and fine-grained details. In the second stage, EFAR performs autoregressive temporal prediction in the latent space, and the predicted latent representations are then decoded to reconstruct future video frames. Experiments on the KTH, BAIR, and Moving MNIST datasets show that the proposed method achieves competitive performance under the adopted evaluation protocol. Specifically, the proposed framework achieves a PSNR of 30.27 dB and an LPIPS of 0.0722 on KTH, a PSNR of 20.95 dB on BAIR, and an SSIM of 0.961 with an MSE of 22.9 on Moving MNIST. In addition, ablation studies further indicate that the proposed components contribute to latent representation learning and long-horizon prediction stability. These results suggest that the proposed framework provides a promising approach for video prediction with favorable reconstruction quality, perceptual consistency, and temporal coherence. Full article
(This article belongs to the Topic Intelligent Image Processing Technology)
Show Figures

Figure 1

33 pages, 7941 KB  
Article
PGT-Net: A Physics-Guided Transformer–CNN Hybrid Network for Low-Light Image Enhancement and Object Detection in Traffic Scenes
by Bin Chen, Jian Qiao, Baowei Li, Shipeng Liu and Wei She
J. Imaging 2026, 12(5), 191; https://doi.org/10.3390/jimaging12050191 - 28 Apr 2026
Viewed by 280
Abstract
In autonomous driving and intelligent transportation systems, the degradation of image quality under low-light conditions severely impacts the reliability of subsequent object detection. Existing methods predominantly employ data-driven deep learning models for image enhancement, often lacking physical interpretability and struggling to maintain robustness [...] Read more.
In autonomous driving and intelligent transportation systems, the degradation of image quality under low-light conditions severely impacts the reliability of subsequent object detection. Existing methods predominantly employ data-driven deep learning models for image enhancement, often lacking physical interpretability and struggling to maintain robustness in complex lighting-varying traffic scenarios. To address this, this paper proposes a Physically Guided Transformer–CNN Hybrid Network (Physically Guided Transformer–CNN Hybrid Network, PGT-Net) for end-to-end joint optimization of low-light enhancement and object detection. PGT-Net innovatively integrates the atmospheric scattering physical model with deep learning architecture: first, a learnable physical guidance branch estimates the scene’s atmospheric illumination map and transmittance map, providing explicit physical priors for the network; second, a dual-branch enhancement backbone is designed, where the local CNN branch (based on an improved UNet) restores fine textures, while the Global Transformer Branch (based on Swin Transformer) models long-range dependencies to correct global uneven illumination, with features adaptively combined via a Physical Fusion Module to ensure enhancement results align with physical laws while retaining rich visual features; finally, the enhanced images are directly fed into a lightweight detection head (e.g., YOLOv7) for joint training and optimization. Comprehensive experiments on public datasets (ExDark, BDD100K-night, etc.) demonstrate that PGT-Net significantly outperforms mainstream methods (e.g., RetinexNet, KinD, Zero-DCE) in both low-light image enhancement quality (PSNR/SSIM) and object detection accuracy (mAP), while maintaining high inference efficiency. This research offers an interpretable, high-performance solution for visual perception tasks under adverse lighting conditions, holding strong theoretical significance and practical value. Full article
(This article belongs to the Section AI in Imaging)
Show Figures

Figure 1

18 pages, 4961 KB  
Article
A Generalizable Low-Precision Softmax Approximation for Small-FPGA Deployment of Vision Transformers
by Samuel Aboagye, Lujun Zhai and Suxia Cui
Electronics 2026, 15(9), 1774; https://doi.org/10.3390/electronics15091774 - 22 Apr 2026
Viewed by 387
Abstract
Softmax is a step in transformer computation during which the internal buffer size grows rapidly because of the use of the exponential function. Softmax is a fundamental yet computationally expensive operation in vision transformer attention, posing significant challenges for deployment on resource-constrained FPGAs [...] Read more.
Softmax is a step in transformer computation during which the internal buffer size grows rapidly because of the use of the exponential function. Softmax is a fundamental yet computationally expensive operation in vision transformer attention, posing significant challenges for deployment on resource-constrained FPGAs (Field Programmable Gate Arrays). Computational precision demands grow at the softmax stage in the attention pipeline mainly because of the use of the exponential function in the softmax computation. This paper proposes a low-precision softmax approximation that combines a truncated Maclaurin-series exponential with input-range clamping to enable efficient hardware realization without sacrificing reconstruction quality. By bounding extreme attention scores that contribute negligibly to final outputs, the proposed method mitigates the instability of low-order polynomial approximations while preserving their hardware efficiency. The approach is first validated in software using SwinIR (Image restoration using the SWIN Transformer) super resolution to ensure reconstruction fidelity and is then analyzed for FPGA deployment. SWINIR is a multi-stage version of other transformers like Deit and Vit, making it a preferred option for testing the reconstruction fidelity of the change for transformers. Experimental results demonstrate that the proposed fourth-order clamped approximation achieves near-reference performance, incurring only 0.15 dB PSNR and 0.0059 SSIM degradation on SwinIR-M, while significantly reducing precision and memory requirements. For the large-sized SWINIR model (SWINIR-L), a PSNR increase with a less than 0.01 SSIM loss is observed, further highlighting the insignificance of extreme values as model size gets bigger. A Horner-form reformulation further improves hardware efficiency by limiting intermediate precision growth. Overall, this work presents a reconstruction-aware and hardware-friendly softmax reformulation that enables practical deployment of vision transformers on small FPGA platforms. This work also uses this contribution to improve the performance of the ViTA accelerator design. We also add bias initialization and a PE loop bound runtime variable to the existing ViTA accelerator design. Full article
Show Figures

Figure 1

36 pages, 23663 KB  
Article
Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems
by Hend Alshaya
Mathematics 2026, 14(8), 1341; https://doi.org/10.3390/math14081341 - 16 Apr 2026
Viewed by 410
Abstract
Video-based action recognition for neural rehabilitation—spanning stroke recovery, Parkinsonian gait assessment, and cerebral palsy monitoring—faces critical challenges, including temporal ambiguity, non-causal motion correlations, and the absence of causally grounded dynamics modeling. While transformer-based architectures achieve strong performance, they often exploit spurious temporal and [...] Read more.
Video-based action recognition for neural rehabilitation—spanning stroke recovery, Parkinsonian gait assessment, and cerebral palsy monitoring—faces critical challenges, including temporal ambiguity, non-causal motion correlations, and the absence of causally grounded dynamics modeling. While transformer-based architectures achieve strong performance, they often exploit spurious temporal and environmental cues, limiting reliability in safety-critical clinical settings. We propose NeuroPrisma, a neuro-prismatic video framework that integrates frequency-domain spectral decomposition with causal intervention under Structural Causal Models (SCMs) via the backdoor criterion. NeuroPrisma introduces (i) a Prismatic Spectral Attention (PSA) module, which applies discrete Fourier transforms to decompose temporal features into multi-scale frequency bands, disentangling slow postural dynamics from rapid corrective movements, and (ii) a Causal Intervention Layer (CIL), which performs do-calculus-based backdoor adjustment to remove confounding influences and produce causally invariant representations. PSA preconditions representations prior to intervention, improving confounder estimation and causal robustness. Extensive evaluation against seven state-of-the-art models (I3D, SlowFast, TimeSformer, ViViT, Video Swin Transformer, UniFormerV2, and VideoMAE) demonstrates that NeuroPrisma achieves 98.7% Top-1 accuracy on UCF101, 82.4% on HMDB51, 71.2% on Something-Something V2, and 91.5%/95.8% on NTU RGB+D (Cross-Subject/Cross-View), consistently outperforming prior methods. It further reduces the Causal Confusion Score (CCS) by 42.3%, indicating substantially lower reliance on spurious correlations, while maintaining real-time performance with 23.4 ms latency per 16-frame clip on an NVIDIA A100 GPU. All improvements are statistically significant (p < 0.001, Cohen’s d = 0.72–1.24). Evaluation was conducted exclusively on benchmark datasets (UCF101, HMDB51, Something-Something V2, and NTU RGB+D) under controlled conditions, without direct clinical validation on neurological patient cohorts. Overfitting was mitigated using three random seeds (42, 123, 456), RandAugment, Mixup (α = 0.8), weight decay (0.05), and early stopping. Cross-dataset generalization from UCF101 to HMDB51 without fine-tuning achieved 76.2% Top-1 accuracy. Future work will focus on prospective clinical validation across stroke, Parkinson’s disease, and cerebral palsy populations, including correlation with standardized clinical assessment scales such as Fugl–Meyer, UPDRS, and GMFCS. These results establish NeuroPrisma as a causally grounded and computationally efficient framework for reliable, real-time movement assessment in clinical rehabilitation systems. Full article
Show Figures

Figure 1

32 pages, 21661 KB  
Article
Robust Human-to-Robot Handover System Under Adverse Lighting
by Yifei Wang, Baoguo Xu, Huijun Li and Aiguo Song
Biomimetics 2026, 11(4), 231; https://doi.org/10.3390/biomimetics11040231 - 1 Apr 2026
Viewed by 635
Abstract
Human-to-robot (H2R) handovers are critical in human–robot interaction but are challenged by complex environments that impact robot perception. Traditional RGB-based perception methods exhibit severe performance degradation under harsh lighting (e.g., glare and darkness). Furthermore, H2R handovers occur in unstructured environments populated with fine-grained [...] Read more.
Human-to-robot (H2R) handovers are critical in human–robot interaction but are challenged by complex environments that impact robot perception. Traditional RGB-based perception methods exhibit severe performance degradation under harsh lighting (e.g., glare and darkness). Furthermore, H2R handovers occur in unstructured environments populated with fine-grained visual details, such as multi-angle hand configurations and novel object geometries, where conventional semantic segmentation and grasp generation approaches struggle to generalize. To overcome lighting disturbances, we present an H2R handover system with a dual-path perception pipeline. The system fuses perception data from a stereo RGB-D camera (eye-in-hand) and a time-of-flight (ToF) camera (fixed scene) under normal lighting, and switches to the ToF camera for reliable perception under glare and darkness. In parallel, to address the complex spatial and geometric features, we augment the Point Transformer v3 (PTv3) architecture by integrating a T-Net module and a self-attention mechanism to fuse the relative positional angle features between human and robot, enabling efficient real-time 3D semantic segmentation of both the object and the human hand. For grasp generation, we extend GraspNet with a grasp selection module optimized for H2R scenarios. We validate our approach through extensive experiments: (1) a semantic segmentation dataset with 7500 annotated point clouds covering 15 objects and 5 relative angles and tested on 750 point clouds from 15 unseen objects, where our method achieves 84.4% mIoU, outperforming Swin3D-L by 3.26 percentage points with 3.2× faster inference; (2) 250 real-world handover trials comparing our method with the baseline across 5 objects, 5 hand postures, and 5 angles, showing an improvement of 18.4 percentage points in success rate; (3) 450 trials under controlled adverse lighting (darkness and glare), where our dual-path perception method achieves 82.7% overall success, surpassing single-camera baselines by up to 39.4 percentage points; and (4) a comparative experiment against a state-of-the-art multimodal H2R handover method under identical adverse lighting, where our system achieves 75.0% success (15/20) versus the baseline’s 15.0% (3/20), further confirming the lighting robustness of our design. These results demonstrate the system’s robustness and generalization in challenging H2R handover scenarios. Full article
(This article belongs to the Special Issue Human-Inspired Grasp Control in Robotics 2025)
Show Figures

Figure 1

35 pages, 14172 KB  
Article
A Multimodal Time-Frequency Fusion Architecture for Fault Diagnosis in Rotating Machinery
by Hui Wang, Congming Wu, Yong Jiang, Yanqing Ouyang, Chongguang Ren, Xianqiong Tang and Wei Zhou
Appl. Sci. 2026, 16(7), 3269; https://doi.org/10.3390/app16073269 - 27 Mar 2026
Viewed by 506
Abstract
Accurate fault diagnosis of rotating machinery in complex industrial environments demands an optimal trade-off between feature representation capability and computational efficiency. Existing single-modality models relying solely on 1D time-series signals or heavy 2D time-frequency images often fail to simultaneously capture high-frequency transient impacts [...] Read more.
Accurate fault diagnosis of rotating machinery in complex industrial environments demands an optimal trade-off between feature representation capability and computational efficiency. Existing single-modality models relying solely on 1D time-series signals or heavy 2D time-frequency images often fail to simultaneously capture high-frequency transient impacts and long-range degradation trends. CLiST (Complementary Lightweight Spatiotemporal Network), a novel lightweight multimodal framework driven by time-frequency fusion, was proposed to overcome this limitation. The architecture of CLiST employs a synergistic dual-stream design: a LightTS module efficiently extracts global operational trends from 1D vibration signals with linear complexity, while a structurally pruned LiteSwin integrated with Triplet Attention captures local high-frequency textures from 2D continuous wavelet transform (CWT) images. This mechanism establishes explicit cross-dimensional dependencies, effectively eliminating feature blind spots without excessive computational overhead. The experimental results show that CLiST not only achieves perfect accuracy on the fundamental CWRU benchmark but also exhibits exceptional spatial generalization when independently evaluated on non-dominant sensor axes of the XJTUGearbox dataset. Furthermore, validation on the real-world dataset (Guangzhou port) proves that the framework has excellent robustness to the attenuation of the signal transmission path and reduces the performance fluctuation between remote measurement points. Ultimately, CLiST delivers highly reliable AI-driven image and signal-processing solutions for vibration monitoring in industrial equipment. Full article
Show Figures

Figure 1

15 pages, 2219 KB  
Article
One Patch Is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues
by Sindhuja Penchala, Gavin Money, Gabriel Marques, Samuel Wood, Jessica Kirschman, Travis Atkison, Shahram Rahimi and Noorbakhsh Amiri Golilarz
Sensors 2026, 26(7), 2083; https://doi.org/10.3390/s26072083 - 27 Mar 2026
Viewed by 443
Abstract
Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation and material perception. However, most existing methods rely on dense or full scene observations, limiting their effectiveness in constrained or partial view environments. This gap highlights the need for [...] Read more.
Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation and material perception. However, most existing methods rely on dense or full scene observations, limiting their effectiveness in constrained or partial view environments. This gap highlights the need for models capable of inferring surfaces’ properties from extremely limited visual information. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders, Vision Transformer (ViT), Masked Autoencoder (MAE), Swin Transformer and DETR using the Touch and Go dataset of real-world surface textures. SMARC achieves the highest performance among the evaluated methods with a PSNR of 17.55 dB and a surface classification accuracy of 85.10%. These results validate the effectiveness of SMARC in relation to surface material understanding and highlight its potential for deployment in robotic perception tasks where visual access is inherently limited. Full article
(This article belongs to the Special Issue Advanced Sensors and AI Integration for Human–Robot Teaming)
Show Figures

Figure 1

18 pages, 2199 KB  
Article
Brain-Oct-Pvt: A Physics-Guided Transformer with Radial Prior and Deformable Alignment for Neurovascular Segmentation
by Quan Lan, Jianuo Huang, Chenxi Huang, Songyuan Song, Yuhao Shi, Zijun Zhao, Wenwen Wu, Hongbin Chen and Nan Liu
Bioengineering 2026, 13(3), 332; https://doi.org/10.3390/bioengineering13030332 - 13 Mar 2026
Viewed by 595
Abstract
The primary objective of this study is to develop a specialized deep learning framework specifically adapted for the unique physical characteristics of neurovascular Optical Coherence Tomography (OCT) imaging. Although Polyp-PVT, originally designed for polyp segmentation, shows promise for OCT analysis, it faces limitations [...] Read more.
The primary objective of this study is to develop a specialized deep learning framework specifically adapted for the unique physical characteristics of neurovascular Optical Coherence Tomography (OCT) imaging. Although Polyp-PVT, originally designed for polyp segmentation, shows promise for OCT analysis, it faces limitations in neurovascular applications. The default RGB input wastes resources on duplicated grayscale data, while its fixed-scale fusion struggles with vascular curvature variations. Furthermore, the attention mechanism fails to capture radial vessel patterns, and geometric constraints limit thin boundary detection. To address these challenges, we propose Brain-OCT-PVT with key innovations: a single-channel input stem reducing parameters by two-thirds; a Radial Intensity Module (RIM) using polar transforms and angular convolution to model annular structures; and a Deformable Cross-scale Fusion Module (D-CFM) with learnable offsets. The Boundary-aware Attention Module (BAM) combines Laplace edge detection with Swin-Transformer for sub-pixel consistency. A specialized loss function combines Dice Similarity Coefficient (Dice), BoundaryIoU on 2-pixel dilated edges, and Focal Tversky to handle extreme class imbalance. Evaluation on 13 clinical cases achieves a Dice score of 95.06% and an 95% Hausdorff Distance (HD95) of 0.269 mm, demonstrating superior performance compared to existing approaches. Full article
(This article belongs to the Special Issue AI-Driven Imaging and Analysis for Biomedical Applications)
Show Figures

Graphical abstract

33 pages, 4786 KB  
Article
A Hierarchical Multi-View Deep Learning Framework for Autism Classification Using Structural and Functional MRI
by Nayif Mohammed Hammash and Mohammed Chachan Younis
J. Imaging 2026, 12(3), 109; https://doi.org/10.3390/jimaging12030109 - 4 Mar 2026
Viewed by 730
Abstract
Autism classification is challenging due to the subtle, heterogeneous, and overlapping neural activation profiles that occur in individuals with autism. Novel deep learning approaches, such as Convolutional Neural Networks (CNNs) and their variants, as well as Transformers, have shown moderate performance in discriminating [...] Read more.
Autism classification is challenging due to the subtle, heterogeneous, and overlapping neural activation profiles that occur in individuals with autism. Novel deep learning approaches, such as Convolutional Neural Networks (CNNs) and their variants, as well as Transformers, have shown moderate performance in discriminating between autism and normal cohorts; yet, they often struggle to jointly capture the spatial–structural and temporal–functional variations present in autistic brains. To overcome these shortcomings, we propose a novel hierarchical deep learning framework that extracts the inherent spatial dependencies from the dual-modal MRI scans. For sMRI, we develop a 3D Hierarchical Convolutional Neural Network to capture both fine and coarse anatomical structures via multi-view projections along the axial, sagittal, and coronal planes. For the fMRI case, we introduced a bidirectional LSTM-based temporal encoder to examine regional brain dynamics and functional connectivity. The sequential embeddings and correlations are combined into a unified spatiotemporal representation of functional imaging, which is then classified using a multilayer perceptron to ensure continuity in diagnostic predictions across the examined modalities. Finally, a cross-modality fusion scheme was employed to integrate feature representations of both modalities. Extensive evaluations on the ABIDE I dataset (NYU repository) demonstrate that our proposed framework outperforms existing baselines, including Vision/Swin Transformers and various newly developed CNN variants. For the sMRI branch, we achieved 90.19 ± 0.12% accuracy (precision: 90.85 ± 0.16%, recall: 89.27 ± 0.19%, F1-score: 90.05 ± 0.14%, and focal loss: 0.3982). For the fMRI branch, we achieved an accuracy of 88.93 ± 0.15% (precision: 89.78 ± 0.18%, recall: 88.29 ± 0.20%, F1-score: 89.03 ± 0.17%, and focal loss of 0.4437). These outcomes affirm the superior generalization and robustness of the proposed framework for integrating structural and functional brain representations to achieve accurate autism classification. Full article
(This article belongs to the Section Medical Imaging)
Show Figures

Figure 1

20 pages, 3334 KB  
Article
A Rolling Bearing Fault Diagnosis Method Based on the STRN-CM Model
by Shiyou Xu, Wei Zhang, Shan Pang, Shenglin Wu, Rongzhen Zhao, Yijuan Qin and Pinshuo Guo
Machines 2026, 14(3), 279; https://doi.org/10.3390/machines14030279 - 2 Mar 2026
Viewed by 465
Abstract
The operational safety of rotating machinery heavily relies on the condition of its rolling bearings. However, under strong background noise and variable operating conditions, weak fault-induced impact responses are easily overwhelmed. To address these challenges, this paper proposes a dual-branch cross-modal fault diagnosis [...] Read more.
The operational safety of rotating machinery heavily relies on the condition of its rolling bearings. However, under strong background noise and variable operating conditions, weak fault-induced impact responses are easily overwhelmed. To address these challenges, this paper proposes a dual-branch cross-modal fault diagnosis framework (STRN-CM) that integrates a Swin Transformer with a one-dimensional wide-kernel deep residual network (1D ResNet). The model develops a complementary structure of heterogeneous features. The enhanced 1D ResNet branch responds directly to the passage of volatile impulse features, which can detect early errors through raw vibrations. The Swin Transformer branch captures long-term periodic texture windows by using time–frequency images, which have an important dependence on time. Also, a Cross-Modal Attention Fusion (CMAF) scheme is introduced. Using high signal-to-noise ratio (SNR) temporal impulse features as query probes, the mechanism dynamically calibrates the response weights of time–frequency features, thereby achieving adaptive denoising and enhancement at the feature level. Experimental results demonstrate that STRN-CM achieves a diagnostic accuracy of 93.04% in harsh −6 dB noise conditions on the Case Western Reserve University (CWRU) dataset. Furthermore, it achieves a 97.99% accuracy on the Paderborn University (PU) dataset, showcasing superior generalization in cross-load and real fatigue damage transfer tasks. It also demonstrates significantly better generalization performance than single-modal networks in cross-load and real fatigue damage transfer tasks. Full article
(This article belongs to the Section Machines Testing and Maintenance)
Show Figures

Figure 1

18 pages, 4746 KB  
Article
MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition
by Wen Liu, Lei Ma, Xuanshun Zhuang and Zhongliang Deng
Sensors 2026, 26(5), 1561; https://doi.org/10.3390/s26051561 - 2 Mar 2026
Viewed by 501
Abstract
Place recognition is a fundamental challenge for robotics and autonomous vehicles. While visual place recognition has achieved high precision, cross-modal place recognition—specifically, visual localization within large-scale point cloud maps—remains a formidable problem. Existing methods often struggle with the significant domain gap between modalities [...] Read more.
Place recognition is a fundamental challenge for robotics and autonomous vehicles. While visual place recognition has achieved high precision, cross-modal place recognition—specifically, visual localization within large-scale point cloud maps—remains a formidable problem. Existing methods often struggle with the significant domain gap between modalities and can be computationally prohibitive, especially those processing raw 3D point clouds. Furthermore, they frequently fail to learn features invariant to viewpoint and scale variations, limiting generalization to unseen environments. In this paper, we formulate cross-modal recognition as a problem of learning a scale-invariant, unified embedding space. Our framework employs a hierarchical Swin Transformer to extract multi-scale features from unified 2D representations of both modalities. The central principle of our method is a multi-scale self-distillation paradigm, which recasts feature learning as an intra-modal knowledge transfer task. Specifically, the coarse-scale “teacher” features provide supervision for the fine-scale “student” features. The final inter-modal alignment is then achieved via a global contrastive loss, exclusively leveraging the semantically rich “teacher” embeddings to ensure a reliable and discriminative matching. Extensive experiments on the KITTI and KITTI-360 datasets demonstrate that our method achieves state-of-the-art performance. Notably, using only the KITTI-trained model without fine-tuning, Recall@1 exceeds 60% on all evaluable sequences of KITTI-360 at a 10 m threshold. Code and pre-trained models will be made publicly available upon acceptance. Full article
(This article belongs to the Section Radar Sensors)
Show Figures

Figure 1

25 pages, 6702 KB  
Article
Soft Optical Sensor for Embryo Quality Evaluation Based on Multi-Focal Image Fusion and RAG-Enhanced Vision Transformers
by Domas Jonaitis, Vidas Raudonis, Egle Drejeriene, Agne Kozlovskaja-Gumbriene and Andres Salumets
Sensors 2026, 26(5), 1441; https://doi.org/10.3390/s26051441 - 25 Feb 2026
Viewed by 431
Abstract
Assessing human embryo quality is a critical step in in vitro fertilization (IVF), yet traditional manual grading remains subjective and physically limited by the shallow depth-of-field in conventional microscopy. This study develops a novel “soft optical sensor” architecture that transforms standard optical microscopy [...] Read more.
Assessing human embryo quality is a critical step in in vitro fertilization (IVF), yet traditional manual grading remains subjective and physically limited by the shallow depth-of-field in conventional microscopy. This study develops a novel “soft optical sensor” architecture that transforms standard optical microscopy into an automated, high-precision instrument for embryo quality assessment. The proposed system integrates two key computational innovations: (1) a multi-focal image fusion module that reconstructs lost morphological details from Z-stack focal planes, effectively creating a 3D-aware representation from 2D inputs; and (2) a retrieval-augmented generation (RAG) framework coupled with a Swin Transformer to provide both high-accuracy classification and explainable clinical rationales. Validated on a large-scale clinical dataset of 102,308 images (prior to augmentation), the system achieves a diagnostic accuracy of 94.11%. This performance surpasses standard single-plane analysis methods by 9.43%, demonstrating the critical importance of fusing multi-focal data. Furthermore, the RAG module successfully grounds model predictions in standard ESHRE consensus guidelines, generating natural language explanations. The results demonstrate that this soft sensor approach significantly reduces inter-observer variability and offers a robust tool for standardized morphological assessment, though prospective validation against live birth outcomes remains essential for clinical adoption. Full article
Show Figures

Figure 1

23 pages, 5043 KB  
Article
A Hybrid of ResNext101_32x8d and Swin Transformer Networks with XAI for Alzheimer’s Disease Detection
by Saeed Mohsen, Amr Yousef and M. Abdel-Aziz
Computers 2026, 15(2), 95; https://doi.org/10.3390/computers15020095 - 2 Feb 2026
Viewed by 741
Abstract
Medical images obtained from advanced imaging devices play a crucial role in supporting disease diagnosis and detection. Nevertheless, acquiring such images is often costly and storage-intensive, and it is time-consuming to diagnose individuals. The use of artificial intelligence (AI)-based automated diagnostic systems provides [...] Read more.
Medical images obtained from advanced imaging devices play a crucial role in supporting disease diagnosis and detection. Nevertheless, acquiring such images is often costly and storage-intensive, and it is time-consuming to diagnose individuals. The use of artificial intelligence (AI)-based automated diagnostic systems provides potential solutions to address the limitations of cost and diagnostic time. In particular, deep learning and explainable AI (XAI) techniques provide a reliable and robust approach to classifying medical images. This paper presents a hybrid model comprising two networks, ResNext101_32x8d and Swin Transformer to differentiate four categories of Alzheimer’s disease: no dementia, very mild dementia, mild dementia, and moderate dementia. The combination of the two networks is applied to imbalanced data, trained on 5120 MRI images, validated on 768 images, and tested on 512 other images. Grad-CAM and LIME techniques with a saliency map are employed to interpret the predictions of the model, providing transparent and clinically interpretable decision support. The proposed combination is realized through a TensorFlow framework, incorporating hyperparameter optimization and various data augmentation methods. The performance evaluation of the proposed model is conducted through several metrics, including the error matrix, precision recall (PR), receiver operating characteristic (ROC), accuracy, and loss curves. Experimental results reveal that the hybrid of ResNext101_32x8d and Swin Transformer achieved a testing accuracy of 98.83% with a corresponding loss rate of 0.1019. Furthermore, for the combination “ResNext101_32x8d + Swin Transformer”, the precision, F1-score, and recall were 99.39%, 99.15%, and 98.91%, respectively, while the area under the ROC curve (AUC) was 1.00, “100%”. The combination of proposed networks with XAI techniques establishes a unique contribution to advance medical AI systems and assist radiologists during Alzheimer’s disease screening of patients. Full article
(This article belongs to the Section AI-Driven Innovations)
Show Figures

Figure 1

26 pages, 8779 KB  
Article
TAUT: A Remote Sensing-Based Terrain-Adaptive U-Net Transformer for High-Resolution Spatiotemporal Downscaling of Temperature over Southwest China
by Zezhi Cheng, Jiping Guan, Li Xiang, Jingnan Wang and Jie Xiang
Remote Sens. 2026, 18(3), 416; https://doi.org/10.3390/rs18030416 - 27 Jan 2026
Viewed by 890
Abstract
High-precision temperature prediction is crucial for dealing with extreme weather events under the background of global warming. However, due to the limitations of computing resources, numerical weather prediction models are difficult to directly provide high spatio-temporal resolution data that meets the specific application [...] Read more.
High-precision temperature prediction is crucial for dealing with extreme weather events under the background of global warming. However, due to the limitations of computing resources, numerical weather prediction models are difficult to directly provide high spatio-temporal resolution data that meets the specific application requirements of a certain region. This problem is particularly prominent in areas with complex terrain. The use of remote sensing data, especially high-resolution terrain data, provides key information for understanding and simulating the interaction between land and atmosphere in complex terrain, making the integration of remote sensing and NWP outputs to achieve high-precision meteorological element downscaling a core challenge. Aiming at the challenge of temperature scaling in complex terrain areas of Southwest China, this paper proposes a novel deep learning model—Terrain Adaptive U-Net Transformer (TAUT). This model takes the encoder–decoder structure of U-Net as the skeleton, deeply integrates the global attention mechanism of Swin Transformer and the local spatiotemporal feature extraction ability of three-dimensional convolution, and innovatively introduces the multi-branch terrain adaptive module (MBTA). The adaptive integration of terrain remote sensing data with various meteorological data, such as temperature fields and wind fields, has been achieved. Eventually, in the complex terrain area of Southwest China, a spatio-temporal high-resolution downscaling of 2 m temperature was realized (from 0.1° in space to 0.01°, and from 3 h intervals to 1 h intervals in time). The experimental results show that within the 48 h downscaling window period, the TAUT model outperforms the comparison models such as bilinear interpolation, SRCNN, U-Net, and EDVR in all evaluation metrics (MAE, RMSE, COR, ACC, PSNR, SSIM). The systematic ablation experiment verified the independent contributions and synergistic effects of the Swin Transformer module, the 3D convolution module, and the MBTA module in improving the performance of each model. In addition, the regional terrain verification shows that this model demonstrates good adaptability and stability under different terrain types (mountains, plateaus, basins). Especially in cases of high-temperature extreme weather, it can more precisely restore the temperature distribution details and spatial textures affected by the terrain, verifying the significant impact of terrain remote sensing data on the accuracy of temperature downscaling. The core contribution of this study lies in the successful construction of a hybrid architecture that can jointly leverage the local feature extraction advantages of CNN and the global context modeling capabilities of Transformer, and effectively integrate key terrain remote sensing data through dedicated modules. The TAUT model offers an effective deep learning solution for precise temperature prediction in complex terrain areas and also provides a referential framework for the integration of remote sensing data and numerical model data in deep learning models. Full article
Show Figures

Figure 1

Back to TopTop