Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (253)

Search Parameters:
Keywords = cross-modal transformer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
24 pages, 2308 KB  
Review
Review on Application of Machine Vision-Based Intelligent Algorithms in Gear Defect Detection
by Dehai Zhang, Shengmao Zhou, Yujuan Zheng and Xiaoguang Xu
Processes 2025, 13(10), 3370; https://doi.org/10.3390/pr13103370 - 21 Oct 2025
Viewed by 440
Abstract
Gear defect detection directly affects the operational reliability of critical equipment in fields such as automotive and aerospace. Gear defect detection technology based on machine vision, leveraging the advantages of non-contact measurement, high efficiency, and cost-effectiveness, has become a key support for quality [...] Read more.
Gear defect detection directly affects the operational reliability of critical equipment in fields such as automotive and aerospace. Gear defect detection technology based on machine vision, leveraging the advantages of non-contact measurement, high efficiency, and cost-effectiveness, has become a key support for quality control in intelligent manufacturing. However, it still faces challenges including difficulties in semantic alignment of multimodal data, the imbalance between real-time detection requirements and computational resources, and poor model generalization in few-shot scenarios. This paper takes the paradigm evolution of gear defect detection technology as the main line, systematically reviews its development from traditional image processing to deep learning, and focuses on the innovative application of intelligent algorithms. A research framework of “technical bottleneck-breakthrough path-application verification” is constructed: for the problem of multimodal fusion, the cross-modal feature alignment mechanism based on Transformer network is deeply analyzed, clarifying its technical path of realizing joint embedding of visual and vibration signals by establishing global correlation mapping; for resource constraints, the performance of lightweight models such as MobileNet and ShuffleNet is quantitatively compared, verifying that these models reduce Parameters by 40–60% while maintaining the mean Average Precision essentially unchanged; for small-sample scenarios, few-shot generation models based on contrastive learning are systematically organized, confirming that their accuracy in the 10-shot scenario can reach 90% of that of fully supervised models, thus enhancing generalization ability. Future research can focus on the collaboration between few-shot generation and physical simulation, edge-cloud dynamic scheduling, defect evolution modeling driven by multiphysics fields, and standardization of explainable artificial intelligence. It aims to construct a gear detection system with autonomous perception capabilities, promoting the development of industrial quality inspection toward high-precision, high-robustness, and low-cost intelligence. Full article
Show Figures

Figure 1

24 pages, 4921 KB  
Article
YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments
by Xinbao Chen, Yaohui Zhang, Junqi Lei, Lelin Li, Lifang Liu and Dongshui Zhang
Remote Sens. 2025, 17(20), 3488; https://doi.org/10.3390/rs17203488 - 20 Oct 2025
Viewed by 243
Abstract
Road cracks represent a significant challenge that impacts the long-term performance and safety of transportation infrastructure. Early identification of these cracks is crucial for effective road maintenance management. However, traditional crack recognition methods that rely on visible light images often experience substantial performance [...] Read more.
Road cracks represent a significant challenge that impacts the long-term performance and safety of transportation infrastructure. Early identification of these cracks is crucial for effective road maintenance management. However, traditional crack recognition methods that rely on visible light images often experience substantial performance degradation in weak-light environments, such as at night or within tunnels. This degradation is characterized by blurred or deficient image textures, indistinct target edges, and reduced detection accuracy, which hinders the ability to achieve reliable all-weather target detection. To address these challenges, this study introduces a dual-modal crack detection method named YOLOv11-DCFNet. This method is based on an enhanced YOLOv11 architecture and incorporates a Cross-Modality Fusion Transformer (CFT) module. It establishes a dual-branch feature extraction structure that utilizes both infrared and visible light within the original YOLOv11 framework, effectively leveraging the high contrast capabilities of thermal infrared images to detect cracks under weak- or no-light conditions. The experimental results demonstrate that the proposed YOLOv11-DCFNet method significantly outperforms the single-modal model (YOLOv11-RGB) in both weak-light and no-light scenarios. Under weak-light conditions, the fusion model effectively utilizes the weak texture features of RGB images alongside the thermal radiation information from infrared (IR) images. This leads to an improvement in Precision from 83.8% to 95.3%, Recall from 81.5% to 90.5%, mAP@0.5 from 84.9% to 92.9%, and mAP@0.5:0.95 from 41.7% to 56.3%, thereby enhancing both detection accuracy and quality. In no-light conditions, the RGB single modality performs poorly due to the absence of visible light information, with an mAP@0.5 of only 67.5%. However, by incorporating IR thermal radiation features, the fusion model enhances Precision, Recall, and mAP@0.5 to 95.3%, 90.5%, and 92.9%, respectively, maintaining high detection accuracy and stability even in extreme no-light environments. The results of this study indicate that YOLOv11-DCFNet exhibits strong robustness and generalization ability across various low illumination conditions, providing effective technical support for night-time road maintenance and crack monitoring systems. Full article
Show Figures

Figure 1

23 pages, 5774 KB  
Article
A Multimodal Voice Phishing Detection System Integrating Text and Audio Analysis
by Jiwon Kim, Seuli Gu, Youngbeom Kim, Sukwon Lee and Changgu Kang
Appl. Sci. 2025, 15(20), 11170; https://doi.org/10.3390/app152011170 - 18 Oct 2025
Viewed by 270
Abstract
Voice phishing has emerged as a critical security threat, exploiting both linguistic manipulation and advances in synthetic speech technologies. Traditional keyword-based approaches often fail to capture contextual patterns or detect forged audio, limiting their effectiveness in real-world scenarios. To address this gap, we [...] Read more.
Voice phishing has emerged as a critical security threat, exploiting both linguistic manipulation and advances in synthetic speech technologies. Traditional keyword-based approaches often fail to capture contextual patterns or detect forged audio, limiting their effectiveness in real-world scenarios. To address this gap, we propose a multimodal voice phishing detection system that integrates text and audio analysis. The text module employs a KoBERT-based transformer classifier with self-attention interpretation, while the audio module leverages MFCC features and a CNN–BiLSTM classifier to identify synthetic speech. A fusion mechanism combines the outputs of both modalities, with experiments conducted on real-world call transcripts, phishing datasets, and synthetic voice corpora. The results demonstrate that the proposed system consistently achieves high values regarding the accuracy, precision, recall, and F1-score on validation data while maintaining robust performance in noisy and diverse real-call scenarios. Furthermore, attention-based interpretability enhances trustworthiness by revealing cross-token and discourse-level interaction patterns specific to phishing contexts. These findings highlight the potential of the proposed system as a reliable, explainable, and deployable solution for preventing the financial and social damage caused by voice phishing. Unlike prior studies limited to single-modality or shallow fusion, our work presents a fully integrated text–audio detection pipeline optimized for Korean real-world datasets and robust to noisy, multi-speaker conditions. Full article
Show Figures

Figure 1

20 pages, 6483 KB  
Article
Loop-MapNet: A Multi-Modal HDMap Perception Framework with SDMap Dynamic Evolution and Priors
by Yuxuan Tang, Jie Hu, Daode Zhang, Wencai Xu, Feiyu Zhao and Xinghao Cheng
Appl. Sci. 2025, 15(20), 11160; https://doi.org/10.3390/app152011160 - 17 Oct 2025
Viewed by 294
Abstract
High-definition maps (HDMaps) are critical for safe autonomy on structured roads. Yet traditional production—relying on dedicated mapping fleets and manual quality control—is costly and slow, impeding large-scale, frequent updates. Recently, standard-definition maps (SDMaps) derived from remote sensing have been adopted as priors to [...] Read more.
High-definition maps (HDMaps) are critical for safe autonomy on structured roads. Yet traditional production—relying on dedicated mapping fleets and manual quality control—is costly and slow, impeding large-scale, frequent updates. Recently, standard-definition maps (SDMaps) derived from remote sensing have been adopted as priors to support HDMap perception, lowering cost but struggling with subtle urban changes and localization drift. We propose Loop-MapNet, a self-evolving, multimodal, closed-loop mapping framework. Loop-MapNet effectively leverages surround-view images, LiDAR point clouds, and SDMaps; it fuses multi-scale vision via a weighted BiFPN, and couples PointPillars BEV and SDMap topology encoders for cross-modal sensing. A Transformer-based bidirectional adaptive cross-attention aligns SDMap with online perception, enabling robust fusion under heterogeneity. We further introduce a confidence-guided masked autoencoder (CG-MAE) that leverages confidence and probabilistic distillation to both capture implicit SDMap priors and enhance the detailed representation of low-confidence HDMap regions. With spatiotemporal consistency checks, Loop-MapNet incrementally updates SDMaps to form a perception–mapping–update loop, compensating remote-sensing latency and enabling online map optimization. On nuScenes, within 120 m, Loop-MapNet attains 61.05% mIoU, surpassing the best baseline by 0.77%. Under extreme localization errors, it maintains 60.46% mIoU, improving robustness by 2.77%; CG-MAE pre-training raises accuracy in low-confidence regions by 1.72%. These results demonstrate advantages in fusion and robustness, moving beyond one-way prior injection and enabling HDMap–SDMap co-evolution for closed-loop autonomy and rapid SDMap refresh from remote sensing. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

51 pages, 4751 KB  
Review
Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy
by Vinit Mehta, Charu Sharma and Karthick Thiyagarajan
Sensors 2025, 25(20), 6394; https://doi.org/10.3390/s25206394 - 16 Oct 2025
Viewed by 421
Abstract
With the rapid advancement of artificial intelligence and robotics, the integration of Large Language Models (LLMs) with 3D vision is emerging as a transformative approach to enhancing robotic sensing technologies. This convergence enables machines to perceive, reason, and interact with complex environments through [...] Read more.
With the rapid advancement of artificial intelligence and robotics, the integration of Large Language Models (LLMs) with 3D vision is emerging as a transformative approach to enhancing robotic sensing technologies. This convergence enables machines to perceive, reason, and interact with complex environments through natural language and spatial understanding, bridging the gap between linguistic intelligence and spatial perception. This review provides a comprehensive analysis of state-of-the-art methodologies, applications, and challenges at the intersection of LLMs and 3D vision, with a focus on next-generation robotic sensing technologies. We first introduce the foundational principles of LLMs and 3D data representations, followed by an in-depth examination of 3D sensing technologies critical for robotics. The review then explores key advancements in scene understanding, text-to-3D generation, object grounding, and embodied agents, highlighting cutting-edge techniques such as zero-shot 3D segmentation, dynamic scene synthesis, and language-guided manipulation. Furthermore, we discuss multimodal LLMs that integrate 3D data with touch, auditory, and thermal inputs, enhancing environmental comprehension and robotic decision-making. To support future research, we catalog benchmark datasets and evaluation metrics tailored for 3D-language and vision tasks. Finally, we identify key challenges and future research directions, including adaptive model architectures, enhanced cross-modal alignment, and real-time processing capabilities, which pave the way for more intelligent, context-aware, and autonomous robotic sensing systems. Full article
(This article belongs to the Special Issue Advanced Sensors and AI Integration for Human–Robot Teaming)
Show Figures

Figure 1

18 pages, 4337 KB  
Article
A Transformer-Based Multimodal Fusion Network for Emotion Recognition Using EEG and Facial Expressions in Hearing-Impaired Subjects
by Shuni Feng, Qingzhou Wu, Kailin Zhang and Yu Song
Sensors 2025, 25(20), 6278; https://doi.org/10.3390/s25206278 - 10 Oct 2025
Viewed by 488
Abstract
Hearing-impaired people face challenges in expressing and perceiving emotions, and traditional single-modal emotion recognition methods demonstrate limited effectiveness in complex environments. To enhance recognition performance, this paper proposes a multimodal fusion neural network based on a multimodal multi-head attention fusion neural network (MMHA-FNN). [...] Read more.
Hearing-impaired people face challenges in expressing and perceiving emotions, and traditional single-modal emotion recognition methods demonstrate limited effectiveness in complex environments. To enhance recognition performance, this paper proposes a multimodal fusion neural network based on a multimodal multi-head attention fusion neural network (MMHA-FNN). This method utilizes differential entropy (DE) and bilinear interpolation features as inputs, learning the spatial–temporal characteristics of brain regions through an MBConv-based module. By incorporating the Transformer-based multi-head self-attention mechanism, we dynamically model the dependencies between EEG and facial expression features, enabling adaptive weighting and deep interaction of cross-modal characteristics. The experiment conducted a four-classification task on the MED-HI dataset (15 subjects, 300 trials). The taxonomy included happy, sad, fear, and calmness, where ‘calmness’ corresponds to a low-arousal neutral state as defined in the MED-HI protocol. Results indicate that the proposed method achieved an average accuracy of 81.14%, significantly outperforming feature concatenation (71.02%) and decision layer fusion (69.45%). This study demonstrates the complementary nature of EEG and facial expressions in emotion recognition among hearing-impaired individuals and validates the effectiveness of feature layer interaction fusion based on attention mechanisms in enhancing emotion recognition performance. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

19 pages, 3418 KB  
Article
WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection
by Min Li, Jing Sang, Yuanyao Lu and Lina Du
J. Imaging 2025, 11(10), 354; https://doi.org/10.3390/jimaging11100354 - 10 Oct 2025
Viewed by 658
Abstract
Weakly Supervised Video Anomaly Detection (WSVAD) is a critical task in computer vision. It aims to localize and recognize abnormal behaviors using only video-level labels. Without frame-level annotations, it becomes significantly challenging to model temporal dependencies. Given the diversity of abnormal events, it [...] Read more.
Weakly Supervised Video Anomaly Detection (WSVAD) is a critical task in computer vision. It aims to localize and recognize abnormal behaviors using only video-level labels. Without frame-level annotations, it becomes significantly challenging to model temporal dependencies. Given the diversity of abnormal events, it is also difficult to model semantic representations. Recently, the cross-modal pre-trained model Contrastive Language-Image Pretraining (CLIP) has shown a strong ability to align visual and textual information. This provides new opportunities for video anomaly detection. Inspired by CLIP, WSVAD-CLIP is proposed as a framework that uses its cross-modal knowledge to bridge the semantic gap between text and vision. First, the Axial-Graph (AG) Module is introduced. It combines an Axial Transformer and Lite Graph Attention Networks (LiteGAT) to capture global temporal structures and local abnormal correlations. Second, a Text Prompt mechanism is designed. It fuses a learnable prompt with a knowledge-enhanced prompt to improve the semantic expressiveness of category embeddings. Third, the Abnormal Visual-Guided Text Prompt (AVGTP) mechanism is proposed to aggregate anomalous visual context for adaptively refining textual representations. Extensive experiments on UCF-Crime and XD-Violence datasets show that WSVAD-CLIP notably outperforms existing methods in coarse-grained anomaly detection. It also achieves superior performance in fine-grained anomaly recognition tasks, validating its effectiveness and generalizability. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

20 pages, 1740 KB  
Article
Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping
by Wei Xu, Gu Geng, Xinming Zhang and Di Yuan
AI 2025, 6(10), 263; https://doi.org/10.3390/ai6100263 - 8 Oct 2025
Viewed by 752
Abstract
Single-object vision–language tracking has become an important research topic due to its potential in applications such as intelligent surveillance and autonomous driving. However, existing cross-modal alignment methods typically rely on contrastive learning and struggle to effectively address semantic ambiguity or the presence of [...] Read more.
Single-object vision–language tracking has become an important research topic due to its potential in applications such as intelligent surveillance and autonomous driving. However, existing cross-modal alignment methods typically rely on contrastive learning and struggle to effectively address semantic ambiguity or the presence of multiple similar objects. This study aims to explore how to achieve more robust vision–language alignment under these challenging conditions, thereby achieving accurate object localization. To this end, we propose a text heatmap mapping (THM) module that enhances the spatial guidance of textual cues in tracking. The THM module integrates visual and language features and generates semantically aware heatmaps, enabling the tracker to focus on the most relevant regions while suppressing distractors. This framework, developed based on UVLTrack, combines a visual transformer with a pre-trained language encoder. The proposed method is evaluated on benchmark datasets such as OTB99, LaSOT, and TNL2K. The main contribution of this paper is the introduction of a novel spatial alignment mechanism for multimodal tracking and its effectiveness on various tracking benchmarks. Results demonstrate that the THM-based tracker improves robustness to semantic ambiguity and multi-instance interference, outperforming baseline frameworks. Full article
Show Figures

Figure 1

17 pages, 1318 KB  
Article
Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View
by Ajian Liu, Yandi Zhang, Huichao Shi and Juan Chen
World Electr. Veh. J. 2025, 16(10), 567; https://doi.org/10.3390/wevj16100567 - 2 Oct 2025
Viewed by 360
Abstract
Reliable three-dimensional (3D) object detection is critical for intelligent vehicles to ensure safety in complex traffic environments, and recent progress in multi-modal sensor fusion, particularly between LiDAR and camera, has advanced environment perception in urban driving. However, existing approaches remain vulnerable to occlusions [...] Read more.
Reliable three-dimensional (3D) object detection is critical for intelligent vehicles to ensure safety in complex traffic environments, and recent progress in multi-modal sensor fusion, particularly between LiDAR and camera, has advanced environment perception in urban driving. However, existing approaches remain vulnerable to occlusions and dense traffic, where depth estimation errors, calibration deviations, and cross-modal misalignment are often exacerbated. To overcome these limitations, we propose BEVAlign, a local–global feature alignment framework designed to generate unified BEV representations from heterogeneous sensor modalities. The framework incorporates a Local Alignment (LA) module that enhances camera-to-BEV view transformation through graph-based neighbor modeling and dual-depth encoding, mitigating local misalignment from depth estimation errors. To further address global misalignment in BEV representations, we present the Global Alignment (GA) module comprising a bidirectional deformable cross-attention (BDCA) mechanism and CBR blocks. BDCA employs dual queries from LiDAR and camera to jointly predict spatial sampling offsets and aggregate features, enabling bidirectional alignment within the BEV domain. The stacked CBR blocks then refine and integrate the aligned features into unified BEV representations. Experiment on the nuScenes benchmark highlights the effectiveness of BEVAlign, which achieves 71.7% mAP, outperforming BEVFusion by 1.5%. Notably, it achieves strong performance on small and occluded objects, particularly in dense traffic scenarios. These findings provide a basis for advancing cooperative environment perception in next-generation intelligent vehicle systems. Full article
(This article belongs to the Special Issue Recent Advances in Intelligent Vehicle)
Show Figures

Figure 1

18 pages, 11220 KB  
Article
LM3D: Lightweight Multimodal 3D Object Detection with an Efficient Fusion Module and Encoders
by Yuto Sakai, Tomoyasu Shimada, Xiangbo Kong and Hiroyuki Tomiyama
Appl. Sci. 2025, 15(19), 10676; https://doi.org/10.3390/app151910676 - 2 Oct 2025
Viewed by 531
Abstract
In recent years, the demand for both high accuracy and real-time performance in 3D object detection has increased alongside the advancement of autonomous driving technology. While multimodal methods that integrate LiDAR and camera data have demonstrated high accuracy, these methods often have high [...] Read more.
In recent years, the demand for both high accuracy and real-time performance in 3D object detection has increased alongside the advancement of autonomous driving technology. While multimodal methods that integrate LiDAR and camera data have demonstrated high accuracy, these methods often have high computational costs and latency. To address these issues, we propose an efficient 3D object detection network that integrates three key components: a DepthWise Lightweight Encoder (DWLE) module for efficient feature extraction, an Efficient LiDAR Image Fusion (ELIF) module that combines channel attention with cross-modal feature interaction, and a Mixture of CNN and Point Transformer (MCPT) module for capturing rich spatial contextual information. Experimental results on the KITTI dataset demonstrate that our proposed method outperforms existing approaches by achieving approximately 0.6% higher 3D mAP, 7.6% faster inference speed, and 17.0% fewer parameters. These results highlight the effectiveness of our approach in balancing accuracy, speed, and model size, making it a promising solution for real-time applications in autonomous driving. Full article
Show Figures

Figure 1

15 pages, 2373 KB  
Article
LLM-Empowered Kolmogorov-Arnold Frequency Learning for Time Series Forecasting in Power Systems
by Zheng Yang, Yang Yu, Shanshan Lin and Yue Zhang
Mathematics 2025, 13(19), 3149; https://doi.org/10.3390/math13193149 - 2 Oct 2025
Viewed by 347
Abstract
With the rapid evolution of artificial intelligence technologies in power systems, data-driven time-series forecasting has become instrumental in enhancing the stability and reliability of power systems, allowing operators to anticipate demand fluctuations and optimize energy distribution. Despite the notable progress made by current [...] Read more.
With the rapid evolution of artificial intelligence technologies in power systems, data-driven time-series forecasting has become instrumental in enhancing the stability and reliability of power systems, allowing operators to anticipate demand fluctuations and optimize energy distribution. Despite the notable progress made by current methods, they are still hindered by two major limitations: most existing models are relatively small in architecture, failing to fully leverage the potential of large-scale models, and they are based on fixed nonlinear mapping functions that cannot adequately capture complex patterns, leading to information loss. To this end, an LLM-Empowered Kolmogorov–Arnold frequency learning (LKFL) is proposed for time series forecasting in power systems, which consists of LLM-based prompt representation learning, KAN-based frequency representation learning, and entropy-oriented cross-modal fusion. Specifically, LKFL first transforms multivariable time-series data into text prompts and leverages a pre-trained LLM to extract semantic-rich prompt representations. It then applies Fast Fourier Transform to convert the time-series data into the frequency domain and employs Kolmogorov–Arnold networks (KAN) to capture multi-scale periodic structures and complex frequency characteristics. Finally, LKFL integrates the prompt and frequency representations through an entropy-oriented cross-modal fusion strategy, which minimizes the semantic gap between different modalities and ensures full integration of complementary information. This comprehensive approach enables LKFL to achieve superior forecasting performance in power systems. Extensive evaluations on five benchmarks verify that LKFL sets a new standard for time-series forecasting in power systems compared with baseline methods. Full article
(This article belongs to the Special Issue Artificial Intelligence and Data Science, 2nd Edition)
Show Figures

Figure 1

23 pages, 1548 KB  
Article
Customizable Length Constrained Image-Text Summarization via Knapsack Optimization
by Xuan Liu, Xiangyu Qu, Yu Weng, Yutong Gao, Zheng Liu and Xianggan Liu
Symmetry 2025, 17(10), 1629; https://doi.org/10.3390/sym17101629 - 2 Oct 2025
Viewed by 310
Abstract
With the proliferation of multimedia data, controllable summarization generation has become a key focus in Artificial Intelligence Content Generation. However, many traditional methods lack precise control over output length, often resulting in summaries that are either too verbose or too brief, thus failing [...] Read more.
With the proliferation of multimedia data, controllable summarization generation has become a key focus in Artificial Intelligence Content Generation. However, many traditional methods lack precise control over output length, often resulting in summaries that are either too verbose or too brief, thus failing to meet diverse user needs. In this paper, we propose a length-customizable approach for multimodal image-text summarization. Our method integrates combinatorial optimization with deep learning to address the length-control challenge. Specifically, we formulate the summarization task as a knapsack optimization problem, enhanced by a greedy algorithm to strictly adhere to user-defined length constraints. Additionally, we introduce a multimodal attention mechanism to ensure balanced and coherent integration of textual and visual information. To further enhance semantic alignment, we employ a cross-modal matching strategy for image selection based on pre-trained vision-language models. Experimental evaluations on the MSMO dataset and validate against baselines like LEAD-3, Seq2Seq, Attention, and Transformer that our method achieves a ROUGE-1 score of 40.52, ROUGE-2 of 16.07, and ROUGE-L of 35.15, outperforming existing length-controllable baselines. Moreover, our approach attains the lowest length variance, confirming its precise adherence to target summary lengths. These results validate the effectiveness of our method in generating high-quality, length-constrained multimodal summaries. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

19 pages, 7222 KB  
Article
Multi-Channel Spectro-Temporal Representations for Speech-Based Parkinson’s Disease Detection
by Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee and Myunggi Yi
J. Imaging 2025, 11(10), 341; https://doi.org/10.3390/jimaging11100341 - 1 Oct 2025
Viewed by 315
Abstract
Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary [...] Read more.
Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary time–frequency representations—mel spectrogram, constant-Q transform (CQT), and gammatone spectrogram—into a three-channel input analogous to an RGB image. This fused representation is evaluated across CNNs (ResNet, DenseNet, and EfficientNet) and Vision Transformer using the PC-GITA dataset, under 10-fold subject-independent cross-validation for robust assessment. Results showed that fusion consistently improves performance over single representations across architectures. EfficientNet-B2 achieves the highest accuracy (84.39% ± 5.19%) and F1-score (84.35% ± 5.52%), outperforming recent methods using handcrafted features or pretrained models (e.g., Wav2Vec2.0, HuBERT) on the same task and dataset. Performance varies with sentence type, with emotionally salient and prosodically emphasized utterances yielding higher AUC, suggesting that richer prosody enhances discriminability. Our findings indicate that multi-channel fusion enhances sensitivity to subtle speech impairments in PD by integrating complementary spectral information. Our approach implies that multi-channel fusion could enhance the detection of discriminative acoustic biomarkers, potentially offering a more robust and effective framework for speech-based PD screening, though further validation is needed before clinical application. Full article
(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)
Show Figures

Figure 1

12 pages, 4847 KB  
Article
Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features
by Manish Kansana, Elias Hossain, Shahram Rahimi and Noorbakhsh Amiri Golilarz
Information 2025, 16(10), 839; https://doi.org/10.3390/info16100839 - 27 Sep 2025
Viewed by 389
Abstract
Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and Principal Component [...] Read more.
Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and Principal Component Analysis (PCA)-reduced visual embeddings extracted via ResNet 50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features. This model not only achieves the highest accuracy, but also demonstrated significantly faster inference time compared to other evaluated models, highlighting its potential for real-time applications. To extend this investigation, we introduced a multimodal fusion setup by combining vision and tactile inputs. We trained both Surformer v1 (using structured features) and a Multimodal CNN (using raw images) to examine the impact of feature-based versus image-based multimodal learning on classification accuracy and computational efficiency. The results showed that Surformer v1 achieved 99.4% accuracy with an inference time of 0.7271 ms, while the Multimodal CNN achieved slightly higher accuracy but required significantly more inference time. These findings suggest that Surformer v1 offers a compelling balance between accuracy, efficiency, and computational cost for surface material recognition. The results also underscore the effectiveness of integrating feature learning, cross-modal attention and transformer-based fusion in capturing the complementary strengths of tactile and visual modalities. Full article
(This article belongs to the Special Issue AI-Based Image Processing and Computer Vision)
Show Figures

Figure 1

20 pages, 1488 KB  
Article
Attention-Fusion-Based Two-Stream Vision Transformer for Heart Sound Classification
by Kalpeshkumar Ranipa, Wei-Ping Zhu and M. N. S. Swamy
Bioengineering 2025, 12(10), 1033; https://doi.org/10.3390/bioengineering12101033 - 26 Sep 2025
Viewed by 467
Abstract
Vision Transformers (ViTs), inspired by their success in natural language processing, have recently gained attention for heart sound classification (HSC). However, most of the existing studies on HSC rely on single-stream architectures, overlooking the advantages of multi-resolution features. While multi-stream architectures employing early [...] Read more.
Vision Transformers (ViTs), inspired by their success in natural language processing, have recently gained attention for heart sound classification (HSC). However, most of the existing studies on HSC rely on single-stream architectures, overlooking the advantages of multi-resolution features. While multi-stream architectures employing early or late fusion strategies have been proposed, they often fall short of effectively capturing cross-modal feature interactions. Additionally, conventional fusion methods, such as concatenation, averaging, or max pooling, frequently result in information loss. To address these limitations, this paper presents a novel attention fusion-based two-stream Vision Transformer (AFTViT) architecture for HSC that leverages two-dimensional mel-cepstral domain features. The proposed method employs a ViT-based encoder to capture long-range dependencies and diverse contextual information at multiple scales. A novel attention block is then used to integrate cross-context features at the feature level, enhancing the overall feature representation. Experiments conducted on the PhysioNet2016 and PhysioNet2022 datasets demonstrate that the AFTViT outperforms state-of-the-art CNN-based methods in terms of accuracy. These results highlight the potential of the AFTViT framework for early diagnosis of cardiovascular diseases, offering a valuable tool for cardiologists and researchers in developing advanced HSC techniques. Full article
(This article belongs to the Section Biosignal Processing)
Show Figures

Figure 1

Back to TopTop