Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (131)

Search Parameters:
Keywords = cross-modal alignment

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
35 pages, 1458 KiB  
Article
User Comment-Guided Cross-Modal Attention for Interpretable Multimodal Fake News Detection
by Zepu Yi, Chenxu Tang and Songfeng Lu
Appl. Sci. 2025, 15(14), 7904; https://doi.org/10.3390/app15147904 - 15 Jul 2025
Viewed by 218
Abstract
In order to address the pressing challenge posed by the proliferation of fake news in the digital age, we emphasize its profound and harmful impact on societal structures, including the misguidance of public opinion, the erosion of social trust, and the exacerbation of [...] Read more.
In order to address the pressing challenge posed by the proliferation of fake news in the digital age, we emphasize its profound and harmful impact on societal structures, including the misguidance of public opinion, the erosion of social trust, and the exacerbation of social polarization. Current fake news detection methods are largely limited to superficial text analysis or basic text–image integration, which face significant limitations in accurately identifying deceptive information. To bridge this gap, we propose the UC-CMAF framework, which comprehensively integrates news text, images, and user comments through an adaptive co-attention fusion mechanism. The UC-CMAF workflow consists of four key subprocesses: multimodal feature extraction, cross-modal adaptive collaborative attention fusion of news text and images, cross-modal attention fusion of user comments with news text and images, and finally, input of fusion features into a fake news detector. Specifically, we introduce multi-head cross-modal attention heatmaps and comment importance visualizations to provide interpretability support for the model’s predictions, revealing key semantic areas and user perspectives that influence judgments. Through the cross-modal adaptive collaborative attention mechanism, UC-CMAF achieves deep semantic alignment between news text and images and uses social signals from user comments to build an enhanced credibility evaluation path, offering a new paradigm for interpretable fake information detection. Experimental results demonstrate that UC-CMAF consistently outperforms 15 baseline models across two benchmark datasets, achieving F1 Scores of 0.894 and 0.909. These results validate the effectiveness of its adaptive cross-modal attention mechanism and the incorporation of user comments in enhancing both detection accuracy and interpretability. Full article
(This article belongs to the Special Issue Explainable Artificial Intelligence Technology and Its Applications)
Show Figures

Figure 1

21 pages, 1118 KiB  
Review
Integrating Large Language Models into Robotic Autonomy: A Review of Motion, Voice, and Training Pipelines
by Yutong Liu, Qingquan Sun and Dhruvi Rajeshkumar Kapadia
AI 2025, 6(7), 158; https://doi.org/10.3390/ai6070158 - 15 Jul 2025
Viewed by 780
Abstract
This survey provides a comprehensive review of the integration of large language models (LLMs) into autonomous robotic systems, organized around four key pillars: locomotion, navigation, manipulation, and voice-based interaction. We examine how LLMs enhance robotic autonomy by translating high-level natural language commands into [...] Read more.
This survey provides a comprehensive review of the integration of large language models (LLMs) into autonomous robotic systems, organized around four key pillars: locomotion, navigation, manipulation, and voice-based interaction. We examine how LLMs enhance robotic autonomy by translating high-level natural language commands into low-level control signals, supporting semantic planning and enabling adaptive execution. Systems like SayTap improve gait stability through LLM-generated contact patterns, while TrustNavGPT achieves a 5.7% word error rate (WER) under noisy voice-guided conditions by modeling user uncertainty. Frameworks such as MapGPT, LLM-Planner, and 3D-LOTUS++ integrate multi-modal data—including vision, speech, and proprioception—for robust planning and real-time recovery. We also highlight the use of physics-informed neural networks (PINNs) to model object deformation and support precision in contact-rich manipulation tasks. To bridge the gap between simulation and real-world deployment, we synthesize best practices from benchmark datasets (e.g., RH20T, Open X-Embodiment) and training pipelines designed for one-shot imitation learning and cross-embodiment generalization. Additionally, we analyze deployment trade-offs across cloud, edge, and hybrid architectures, emphasizing latency, scalability, and privacy. The survey concludes with a multi-dimensional taxonomy and cross-domain synthesis, offering design insights and future directions for building intelligent, human-aligned robotic systems powered by LLMs. Full article
Show Figures

Figure 1

21 pages, 12122 KiB  
Article
RA3T: An Innovative Region-Aligned 3D Transformer for Self-Supervised Sim-to-Real Adaptation in Low-Altitude UAV Vision
by Xingrao Ma, Jie Xie, Di Shao, Aiting Yao and Chengzu Dong
Electronics 2025, 14(14), 2797; https://doi.org/10.3390/electronics14142797 - 11 Jul 2025
Viewed by 209
Abstract
Low-altitude unmanned aerial vehicle (UAV) vision is critically hindered by the Sim-to-Real Gap, where models trained exclusively on simulation data degrade under real-world variations in lighting, texture, and weather. To address this problem, we propose RA3T (Region-Aligned 3D Transformer), a novel self-supervised framework [...] Read more.
Low-altitude unmanned aerial vehicle (UAV) vision is critically hindered by the Sim-to-Real Gap, where models trained exclusively on simulation data degrade under real-world variations in lighting, texture, and weather. To address this problem, we propose RA3T (Region-Aligned 3D Transformer), a novel self-supervised framework that enables robust Sim-to-Real adaptation. Specifically, we first develop a dual-branch strategy for self-supervised feature learning, integrating Masked Autoencoders and contrastive learning. This approach extracts domain-invariant representations from unlabeled simulated imagery to enhance robustness against occlusion while reducing annotation dependency. Leveraging these learned features, we then introduce a 3D Transformer fusion module that unifies multi-view RGB and LiDAR point clouds through cross-modal attention. By explicitly modeling spatial layouts and height differentials, this component significantly improves recognition of small and occluded targets in complex low-altitude environments. To address persistent fine-grained domain shifts, we finally design region-level adversarial calibration that deploys local discriminators on partitioned feature maps. This mechanism directly aligns texture, shadow, and illumination discrepancies which challenge conventional global alignment methods. Extensive experiments on UAV benchmarks VisDrone and DOTA demonstrate the effectiveness of RA3T. The framework achieves +5.1% mAP on VisDrone and +7.4% mAP on DOTA over the 2D adversarial baseline, particularly on small objects and sparse occlusions, while maintaining real-time performance of 17 FPS at 1024 × 1024 resolution on an RTX 4080 GPU. Visual analysis confirms that the synergistic integration of 3D geometric encoding and local adversarial alignment effectively mitigates domain gaps caused by uneven illumination and perspective variations, establishing an efficient pathway for simulation-to-reality UAV perception. Full article
(This article belongs to the Special Issue Innovative Technologies and Services for Unmanned Aerial Vehicles)
Show Figures

Figure 1

26 pages, 2178 KiB  
Article
Cross-Modal Fake News Detection Method Based on Multi-Level Fusion Without Evidence
by Ping He, Hanxue Zhang, Shufu Cao and Yali Wu
Algorithms 2025, 18(7), 426; https://doi.org/10.3390/a18070426 - 10 Jul 2025
Viewed by 284
Abstract
Although multimodal feature fusion technology in fake news detection can integrate complementary information from different modal data, the semantic inconsistency of multimodal features will lead to feature fusion difficulties. And there is the problem of information loss during one fusion process. In addition, [...] Read more.
Although multimodal feature fusion technology in fake news detection can integrate complementary information from different modal data, the semantic inconsistency of multimodal features will lead to feature fusion difficulties. And there is the problem of information loss during one fusion process. In addition, although it is possible to improve the detection effect by increasing the support of external evidence in fake news detection, there is a lag in obtaining external evidence and the reliability and completeness of the evidence source is difficult to guarantee. Additional noise may be introduced to interfere with the model judgment. Therefore, a cross-modal fake news detection method (CM-MLF) based on evidence-free multilevel fusion is proposed. The method solves the semantic inconsistency problem by utilizing cross-modal alignment processing. And it utilizes the attention mechanism to perform multilevel fusion of text and image features without the assistance of other evidential features to further enhance the expressive power of the features. Experiments show that the method achieves better detection results on multiple benchmark datasets, effectively improving the accuracy and robustness of cross-modal fake news detection. Full article
(This article belongs to the Special Issue Algorithms for Feature Selection (3rd Edition))
Show Figures

Graphical abstract

26 pages, 92114 KiB  
Article
Multi-Modal Remote Sensing Image Registration Method Combining Scale-Invariant Feature Transform with Co-Occurrence Filter and Histogram of Oriented Gradients Features
by Yi Yang, Shuo Liu, Haitao Zhang, Dacheng Li and Ling Ma
Remote Sens. 2025, 17(13), 2246; https://doi.org/10.3390/rs17132246 - 30 Jun 2025
Viewed by 330
Abstract
Multi-modal remote sensing images often exhibit complex and nonlinear radiation differences which significantly hinder the performance of traditional feature-based image registration methods such as Scale-Invariant Feature Transform (SIFT). In contrast, structural features—such as edges and contours—remain relatively consistent across modalities. To address this [...] Read more.
Multi-modal remote sensing images often exhibit complex and nonlinear radiation differences which significantly hinder the performance of traditional feature-based image registration methods such as Scale-Invariant Feature Transform (SIFT). In contrast, structural features—such as edges and contours—remain relatively consistent across modalities. To address this challenge, we propose a novel multi-modal image registration method, Cof-SIFT, which integrates a co-occurrence filter with SIFT. By replacing the traditional Gaussian filter with a co-occurrence filter, Cof-SIFT effectively suppresses texture variations while preserving structural information, thereby enhancing robustness to cross-modal differences. To further improve image registration accuracy, we introduce an extended approach, Cof-SIFT_HOG, which extracts Histogram of Oriented Gradients (HOG) features from the image gradient magnitude map of corresponding points and refines their positions based on HOG similarity. This refinement yields more precise alignment between the reference and image to be registered. We evaluated Cof-SIFT and Cof-SIFT_HOG on a diverse set of multi-modal remote sensing image pairs. The experimental results demonstrate that both methods outperform existing approaches, including SIFT, COFSM, SAR-SIFT, PSO-SIFT, and OS-SIFT, in terms of robustness and registration accuracy. Notably, Cof-SIFT_HOG achieves the highest overall performance, confirming the effectiveness of the proposed structural-preserving and corresponding point location refinement strategies in cross-modal registration tasks. Full article
Show Figures

Figure 1

14 pages, 921 KiB  
Article
Physiotherapy Management of Plantar Fasciitis: A National Cross-Sectional Survey in Saudi Arabia
by Abdulmajeed Muhaysin Alnefaie, Hosam Alzahrani and Mansour Abdullah Alshehri
J. Clin. Med. 2025, 14(13), 4584; https://doi.org/10.3390/jcm14134584 - 27 Jun 2025
Viewed by 431
Abstract
Background/Objectives: Plantar fasciitis is the most common cause of heel pain, affecting 4–7% of the general population. Physiotherapy is a key component of conservative management. However, there is limited evidence on how physiotherapists in Saudi Arabia manage this condition. This study aimed [...] Read more.
Background/Objectives: Plantar fasciitis is the most common cause of heel pain, affecting 4–7% of the general population. Physiotherapy is a key component of conservative management. However, there is limited evidence on how physiotherapists in Saudi Arabia manage this condition. This study aimed to investigate current physiotherapy practices for plantar fasciitis in Saudi Arabia and assess their alignment with international clinical guidelines. Methods: A cross-sectional survey was conducted among licensed physiotherapists practicing in Saudi Arabia who had treated patients with plantar fasciitis. An online questionnaire, adapted from a validated UK-based survey, gathered data on participant demographics, service characteristics, diagnostic criteria, treatment goals, outcome measures, and intervention strategies. Descriptive statistics were used for analysis. Results: A total of 399 physiotherapists participated. Diagnosis was mainly based on clinical signs such as pain during plantar fascia stretch (72.9%), early morning pain (70.4%), and medial heel tenderness (69.4%). Common goals of intervention included pain reduction (93.4%), functional improvement (69.9%), and patient education (57.3%). Pain scales (74.9%) and functional tests (49.1%) were the most frequently used outcome measures. Interventions such as exercise therapy (92.0%), stretching (89.4%), and strengthening (84.7%) were widely used. More advanced modalities like shockwave therapy and dry needling were less frequently reported. Conclusions: Physiotherapy practices largely align with international guidelines. However, variation in outcome assessments and underuse of advanced modalities indicate the need for national clinical guidelines and targeted training programs. These steps may promote more consistent, evidence-based care and improve patient outcomes in Saudi Arabia. Full article
Show Figures

Figure 1

17 pages, 7434 KiB  
Article
Cell-Type Annotation for scATAC-Seq Data by Integrating Chromatin Accessibility and Genome Sequence
by Guo Wei, Long Wang, Yan Liu and Xiaohui Zhang
Biomolecules 2025, 15(7), 938; https://doi.org/10.3390/biom15070938 - 27 Jun 2025
Viewed by 411
Abstract
Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) technology enables single-cell resolution analysis of chromatin accessibility, offering critical insights into gene regulation, epigenetic heterogeneity, and cellular differentiation across various biological contexts. However, existing cell annotation methods face notable limitations. Cross-omics approaches, which rely [...] Read more.
Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) technology enables single-cell resolution analysis of chromatin accessibility, offering critical insights into gene regulation, epigenetic heterogeneity, and cellular differentiation across various biological contexts. However, existing cell annotation methods face notable limitations. Cross-omics approaches, which rely on single-cell RNA sequencing (scRNA-seq) as a reference, often struggle with data alignment due to fundamental differences between transcriptional and chromatin accessibility modalities. Meanwhile, intra-omics methods, which rely solely on scATAC-seq data, are frequently affected by batch effects and fail to fully utilize genomic sequence information for accurate annotation. To address these challenges, we propose scAttG, a novel deep learning framework that integrates graph attention networks (GATs) and convolutional neural networks (CNNs) to capture both chromatin accessibility signals and genomic sequence features. By utilizing the nucleotide sequences corresponding to scATAC-seq peaks, scAttG enhances both the robustness and accuracy of cell-type annotation. Experimental results across multiple scATAC-seq datasets suggest that scAttG generally performs favorably compared to existing methods, showing competitive performance in single-cell chromatin accessibility-based cell-type annotation. Full article
(This article belongs to the Section Molecular Biology)
Show Figures

Figure 1

28 pages, 11793 KiB  
Article
Unsupervised Multimodal UAV Image Registration via Style Transfer and Cascade Network
by Xiaoye Bi, Rongkai Qie, Chengyang Tao, Zhaoxiang Zhang and Yuelei Xu
Remote Sens. 2025, 17(13), 2160; https://doi.org/10.3390/rs17132160 - 24 Jun 2025
Cited by 1 | Viewed by 346
Abstract
Cross-modal image registration for unmanned aerial vehicle (UAV) platforms presents significant challenges due to large-scale deformations, distinct imaging mechanisms, and pronounced modality discrepancies. This paper proposes a novel multi-scale cascaded registration network based on style transfer that achieves superior performance: up to 67% [...] Read more.
Cross-modal image registration for unmanned aerial vehicle (UAV) platforms presents significant challenges due to large-scale deformations, distinct imaging mechanisms, and pronounced modality discrepancies. This paper proposes a novel multi-scale cascaded registration network based on style transfer that achieves superior performance: up to 67% reduction in mean squared error (from 0.0106 to 0.0068), 9.27% enhancement in normalized cross-correlation, 26% improvement in local normalized cross-correlation, and 8% increase in mutual information compared to state-of-the-art methods. The architecture integrates a cross-modal style transfer network (CSTNet) that transforms visible images into pseudo-infrared representations to unify modality characteristics, and a multi-scale cascaded registration network (MCRNet) that performs progressive spatial alignment across multiple resolution scales using diffeomorphic deformation modeling to ensure smooth and invertible transformations. A self-supervised learning paradigm based on image reconstruction eliminates reliance on manually annotated data while maintaining registration accuracy through synthetic deformation generation. Extensive experiments on the LLVIP dataset demonstrate the method’s robustness under challenging conditions involving large-scale transformations, with ablation studies confirming that style transfer contributes 28% MSE improvement and diffeomorphic registration prevents 10.6% performance degradation. The proposed approach provides a robust solution for cross-modal image registration in dynamic UAV environments, offering significant implications for downstream applications such as target detection, tracking, and surveillance. Full article
(This article belongs to the Special Issue Advances in Deep Learning Approaches: UAV Data Analysis)
Show Figures

Graphical abstract

27 pages, 4737 KiB  
Article
Context-Aware Multimodal Fusion with Sensor-Augmented Cross-Modal Learning: The BLAF Architecture for Robust Chinese Homophone Disambiguation in Dynamic Environments
by Yu Sun, Yihang Qin, Wenhao Chen, Xuan Li and Chunlian Li
Appl. Sci. 2025, 15(13), 7068; https://doi.org/10.3390/app15137068 - 23 Jun 2025
Viewed by 513
Abstract
Chinese, a tonal language with inherent homophonic ambiguity, poses significant challenges for semantic disambiguation in natural language processing (NLP), hindering applications like speech recognition, dialog systems, and assistive technologies. Traditional static disambiguation methods suffer from poor adaptability in dynamic environments and low-frequency scenarios, [...] Read more.
Chinese, a tonal language with inherent homophonic ambiguity, poses significant challenges for semantic disambiguation in natural language processing (NLP), hindering applications like speech recognition, dialog systems, and assistive technologies. Traditional static disambiguation methods suffer from poor adaptability in dynamic environments and low-frequency scenarios, limiting their real-world utility. To address these limitations, we propose BLAF—a novel MacBERT-BiLSTM Hybrid Architecture—that synergizes global semantic understanding with local sequential dependencies through dynamic multimodal feature fusion. This framework incorporates innovative mechanisms for the principled weighting of heterogeneous features, effective alignment of representations, and sensor-augmented cross-modal learning to enhance robustness, particularly in noisy environments. Employing a staged optimization strategy, BLAF achieves state-of-the-art performance on the SIGHAN 2015 (data fine-tuning and supplementation): 93.37% accuracy and 93.25% F1 score, surpassing pure BERT by 15.74% in accuracy. Ablation studies confirm the critical contributions of the integrated components. Furthermore, the sensor-augmented module significantly improves robustness under noise (speech SNR to 18.6 dB at 75 dB noise, 12.7% reduction in word error rates). By bridging gaps among tonal phonetics, contextual semantics, and computational efficiency, BLAF establishes a scalable paradigm for robust Chinese homophone disambiguation in industrial NLP applications. This work advances cognitive intelligence in Chinese NLP and provides a blueprint for adaptive disambiguation in resource-constrained and dynamic scenarios. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)
Show Figures

Figure 1

22 pages, 2610 KiB  
Article
Multi-Modal Entity Alignment Based on Enhanced Relationship Learning and Multi-Layer Feature Fusion
by Huayu Li, Yujie Hou, Jing Liu, Peiying Zhang, Cuicui Wang and Kai Liu
Symmetry 2025, 17(7), 990; https://doi.org/10.3390/sym17070990 - 23 Jun 2025
Viewed by 340
Abstract
Entity alignment is a critical technique for integrating diverse knowledge graphs. Although existing methods have achieved impressive success in traditional entity alignment, they may struggle to handle the complexities arising from interactions and dependencies in multi-modal knowledge. In this paper, a novel multi-modal [...] Read more.
Entity alignment is a critical technique for integrating diverse knowledge graphs. Although existing methods have achieved impressive success in traditional entity alignment, they may struggle to handle the complexities arising from interactions and dependencies in multi-modal knowledge. In this paper, a novel multi-modal entity alignment model called ERMF is proposed, which leverages distinct modal characteristics of entities to identify equivalent entities across different multi-modal knowledge graphs. The symmetry in cross-modal interactions and hierarchical feature fusion is a core design principle of our approach. Specifically, we first utilize different feature encoders to independently extract features from different modalities. Concurrently, visual features and nearest neighbor negative sampling methods are incorporated to design a vision-guided negative sample generation strategy based on contrastive learning, ensuring a symmetric balance between positive and negative samples and guiding the model to learn effective relationship embeddings. Subsequently, in the feature fusion stage, we propose a multi-layer feature fusion approach that incorporates cross-attention and cross-modal attention mechanisms with symmetric processing of intra- and inter-modal correlations, thereby obtaining multi-granularity features. Extensive experiments were conducted on two public datasets, namely FB15K-DB15K and FB15K-YAGO15K. With 20% aligned seeds, ERMF improves Hits@1 by 8.4% and 26%, and MRR by 6% and 19.2% compared to the best baseline. The symmetric architecture of our model ensures the robust and balanced utilization of multi-modal information, aligning with the principles of structural and functional symmetry in knowledge integration. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

19 pages, 5602 KiB  
Article
PnPDA+: A Meta Feature-Guided Domain Adapter for Collaborative Perception
by Liang Xin, Guangtao Zhou, Zhaoyang Yu, Danni Wang, Tianyou Luo, Xiaoyuan Fu and Jinglin Li
World Electr. Veh. J. 2025, 16(7), 343; https://doi.org/10.3390/wevj16070343 - 21 Jun 2025
Viewed by 276
Abstract
Although cooperative perception enhances situational awareness by enabling vehicles to share intermediate features, real-world deployment faces challenges due to heterogeneity in sensor modalities, architectures, and encoder parameters across agents. These domain gaps often result in semantic inconsistencies among the shared features, thereby degrading [...] Read more.
Although cooperative perception enhances situational awareness by enabling vehicles to share intermediate features, real-world deployment faces challenges due to heterogeneity in sensor modalities, architectures, and encoder parameters across agents. These domain gaps often result in semantic inconsistencies among the shared features, thereby degrading the quality of feature fusion. Existing approaches either necessitate the retraining of private models or fail to adapt to newly introduced agents. To address these limitations, we propose PnPDA+, a unified and modular domain adaptation framework designed for heterogeneous multi-vehicle cooperative perception. PnPDA+ consists of two key components: a Meta Feature Extraction Network (MFEN) and a Plug-and-Play Domain Adapter (PnPDA). MFEN extracts domain-aware and frame-aware meta features from received heterogeneous features, encoding domain-specific knowledge and spatial-temporal cues to serve as high-level semantic priors. Guided by these meta features, the PnPDA module performs adaptive semantic conversion to enhance cross-agent feature alignment without modifying existing perception models. This design ensures the scalable integration of emerging vehicles with minimal fine-tuning, significantly improving both semantic consistency and generalization. Experiments on OPV2V show that PnPDA+ outperforms state-of-the-art methods by 4.08% in perception accuracy while preserving model integrity and scalability. Full article
Show Figures

Figure 1

12 pages, 3214 KiB  
Article
Singular Value Decomposition (SVD) Method for LiDAR and Camera Sensor Fusion and Pattern Matching Algorithm
by Kaiqiao Tian, Meiqi Song, Ka C. Cheok, Micho Radovnikovich, Kazuyuki Kobayashi and Changqing Cai
Sensors 2025, 25(13), 3876; https://doi.org/10.3390/s25133876 - 21 Jun 2025
Viewed by 680
Abstract
LiDAR and camera sensors are widely utilized in autonomous vehicles (AVs) and robotics due to their complementary sensing capabilities—LiDAR provides precise depth information, while cameras capture rich visual context. However, effective multi-sensor fusion remains challenging due to discrepancies in resolution, data format, and [...] Read more.
LiDAR and camera sensors are widely utilized in autonomous vehicles (AVs) and robotics due to their complementary sensing capabilities—LiDAR provides precise depth information, while cameras capture rich visual context. However, effective multi-sensor fusion remains challenging due to discrepancies in resolution, data format, and viewpoint. In this paper, we propose a robust pattern matching algorithm that leverages singular value decomposition (SVD) and gradient descent (GD) to align geometric features—such as object contours and convex hulls—across LiDAR and camera modalities. Unlike traditional calibration methods that require manual targets, our approach is targetless, extracting matched patterns from projected LiDAR point clouds and 2D image segments. The algorithm computes the optimal transformation matrix between sensors, correcting misalignments in rotation, translation, and scale. Experimental results on a vehicle-mounted sensing platform demonstrate an alignment accuracy improvement of up to 85%, with the final projection error reduced to less than 1 pixel. This pattern-based SVD-GD framework offers a practical solution for maintaining reliable cross-sensor alignment under calibration drift, enabling real-time perception systems to operate robustly without recalibration. This method provides a practical solution for maintaining reliable sensor fusion in autonomous driving applications subject to long-term calibration drift. Full article
(This article belongs to the Special Issue Recent Advances in LiDAR Sensor)
Show Figures

Figure 1

22 pages, 44010 KiB  
Article
SMM-POD: Panoramic 3D Object Detection via Spherical Multi-Stage Multi-Modal Fusion
by Jinghan Zhang, Yusheng Yang, Zhiyuan Gao, Hang Shi and Yangmin Xie
Remote Sens. 2025, 17(12), 2089; https://doi.org/10.3390/rs17122089 - 18 Jun 2025
Viewed by 490
Abstract
Panoramic 3D object detection is a challenging task due to image distortion, sensor heterogeneity, and the difficulty of combining information from multiple modalities over a wide field-of-view (FoV). To address these issues, we propose SMM-POD, a novel framework that introduces a spherical multi-stage [...] Read more.
Panoramic 3D object detection is a challenging task due to image distortion, sensor heterogeneity, and the difficulty of combining information from multiple modalities over a wide field-of-view (FoV). To address these issues, we propose SMM-POD, a novel framework that introduces a spherical multi-stage fusion strategy for panoramic 3D detection. Our approach creates a five-channel spherical image aligned with LiDAR data and uses a quasi-uniform Voronoi sphere (UVS) model to reduce projection distortion. A cross-attention-based feature extraction module and a transformer encoder–decoder with spherical positional encoding enable the accurate and efficient fusion of image and point cloud features. For precise 3D localization, we adopt a Frustum PointNet module. Experiments on the DAIR-V2X-I benchmark and our self-collected SHU-3DPOD dataset show that SMM-POD achieves a state-of-the-art performance across all object categories. It significantly improves the detection of small objects like cyclists and pedestrians and maintains stable results under various environmental conditions. These results demonstrate the effectiveness of SMM-POD in panoramic multi-modal 3D perception and establish it as a strong baseline for wide FoV object detection. Full article
(This article belongs to the Section Urban Remote Sensing)
Show Figures

Figure 1

28 pages, 4916 KiB  
Article
Research on Bearing Fault Diagnosis Method for Varying Operating Conditions Based on Spatiotemporal Feature Fusion
by Jin Wang, Yan Wang, Junhui Yu, Qingping Li, Hailin Wang and Xinzhi Zhou
Sensors 2025, 25(12), 3789; https://doi.org/10.3390/s25123789 - 17 Jun 2025
Viewed by 385
Abstract
In real-world scenarios, the rotational speed of bearings is variable. Due to changes in operating conditions, the feature distribution of bearing vibration data becomes inconsistent, which leads to the inability to directly apply the training model built under one operating condition (source domain) [...] Read more.
In real-world scenarios, the rotational speed of bearings is variable. Due to changes in operating conditions, the feature distribution of bearing vibration data becomes inconsistent, which leads to the inability to directly apply the training model built under one operating condition (source domain) to another condition (target domain). Furthermore, the lack of sufficient labeled data in the target domain further complicates fault diagnosis under varying operating conditions. To address this issue, this paper proposes a spatiotemporal feature fusion domain-adaptive network (STFDAN) framework for bearing fault diagnosis under varying operating conditions. The framework constructs a feature extraction and domain adaptation network based on a parallel architecture, designed to capture the complex dynamic characteristics of vibration signals. First, the Fast Fourier Transform (FFT) and Variational Mode Decomposition (VMD) are used to extract the spectral and modal features of the signals, generating a joint representation with multi-level information. Then, a parallel processing mechanism of the Convolutional Neural Network (SECNN) based on the Squeeze-and-Excitation module and the Bidirectional Long Short-Term Memory network (BiLSTM) is employed to dynamically adjust weights, capturing high-dimensional spatiotemporal features. The cross-attention mechanism enables the interaction and fusion of spatial and temporal features, significantly enhancing the complementarity and coupling of the feature representations. Finally, a Multi-Kernel Maximum Mean Discrepancy (MKMMD) is introduced to align the feature distributions between the source and target domains, enabling efficient fault diagnosis under varying bearing conditions. The proposed STFDAN framework is evaluated using bearing datasets from Case Western Reserve University (CWRU), Jiangnan University (JNU), and Southeast University (SEU). Experimental results demonstrate that STFDAN achieves high diagnostic accuracy across different load conditions and effectively solves the bearing fault diagnosis problem under varying operating conditions. Full article
(This article belongs to the Section Fault Diagnosis & Sensors)
Show Figures

Figure 1

20 pages, 2511 KiB  
Article
MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection
by Hantao Ding, Shengfeng Lou, Hairong Ye and Yanbing Chen
Appl. Sci. 2025, 15(12), 6773; https://doi.org/10.3390/app15126773 - 16 Jun 2025
Viewed by 711
Abstract
Video anomaly detection (VAD) faces significant challenges in multimodal semantic alignment and long-term temporal modeling within open surveillance scenarios. Existing methods are often plagued by modality discrepancies and fragmented temporal reasoning. To address these issues, we introduce MT-CMVAD, a hierarchically structured Transformer architecture [...] Read more.
Video anomaly detection (VAD) faces significant challenges in multimodal semantic alignment and long-term temporal modeling within open surveillance scenarios. Existing methods are often plagued by modality discrepancies and fragmented temporal reasoning. To address these issues, we introduce MT-CMVAD, a hierarchically structured Transformer architecture that makes two key technical contributions: (1) A Context-Aware Dynamic Fusion Module that leverages cross-modal attention with learnable gating coefficients to effectively bridge the gap between RGB and optical flow modalities through adaptive feature recalibration, significantly enhancing fusion performance; (2) A Multi-Scale Spatiotemporal Transformer that establishes global-temporal dependencies via dilated attention mechanisms while preserving local spatial semantics through pyramidal feature aggregation. To address the sparse anomaly supervision dilemma, we propose a hybrid learning objective that integrates dual-stream reconstruction loss with prototype-based contrastive discrimination, enabling the joint optimization of pattern restoration and discriminative representation learning. Our extensive experiments on the UCF-Crime, UBI-Fights, and UBnormal datasets demonstrate state-of-the-art performance, achieving AUC scores of 98.9%, 94.7%, and 82.9%, respectively. The explicit spatiotemporal encoding scheme further improves temporal alignment accuracy by 2.4%, contributing to enhanced anomaly localization and overall detection accuracy. Additionally, the proposed framework achieves a 14.3% reduction in FLOPs and demonstrates 18.7% faster convergence during training, highlighting its practical value for real-world deployment. Our optimized window-shift attention mechanism also reduces computational complexity, making MT-CMVAD a robust and efficient solution for safety-critical video understanding tasks. Full article
Show Figures

Figure 1

Back to TopTop