Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,053)

Search Parameters:
Keywords = multi-modal feature fusion

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 3445 KB  
Article
Underwater Objective Detection Algorithm Based on YOLOv8-Improved Multimodality Image Fusion Technology
by Yage Qie, Chao Fang, Jinghua Huang, Donghao Wu and Jian Jiang
Machines 2025, 13(11), 982; https://doi.org/10.3390/machines13110982 (registering DOI) - 24 Oct 2025
Abstract
The field of underwater robotics is experiencing rapid growth, wherein accurate object detection constitutes a fundamental component. Given the prevalence of false alarms and omission errors caused by intricate subaquatic conditions and substantial image noise, this study introduces an enhanced detection framework that [...] Read more.
The field of underwater robotics is experiencing rapid growth, wherein accurate object detection constitutes a fundamental component. Given the prevalence of false alarms and omission errors caused by intricate subaquatic conditions and substantial image noise, this study introduces an enhanced detection framework that combines the YOLOv8 architecture with multimodal visual fusion methodology. To solve the problem of degraded detection performance of the model in complex environments like those with low illumination, features from Visible Light Image are fused with the Thermal Distribution Features exhibited by Infrared Image, thereby yielding more comprehensive image information. Furthermore, to precisely focus on crucial target regions and information, a Multi-Scale Cross-Axis Attention Mechanism (MSCA) is introduced, which significantly enhances Detection Accuracy. Finally, to meet the lightweight requirement of the model, an Efficient Shared Convolution Head (ESC_Head) is designed. The experimental findings reveal that the YOLOv8-FUSED framework attains a mean average precision (mAP) of 82.1%, marking an 8.7% enhancement compared to the baseline YOLOv8 architecture. The proposed approach also exhibits superior detection capabilities relative to existing techniques while simultaneously satisfying the critical requirement for real-time underwater object detection. Moreover, the proposed system successfully meets the essential criteria for real-time detection of underwater objects. Full article
(This article belongs to the Section Robotics, Mechatronics and Intelligent Machines)
Show Figures

Figure 1

24 pages, 5556 KB  
Article
Efficient Wearable Sensor-Based Activity Recognition for Human–Robot Collaboration in Agricultural Environments
by Sakorn Mekruksavanich and Anuchit Jitpattanakul
Informatics 2025, 12(4), 115; https://doi.org/10.3390/informatics12040115 - 23 Oct 2025
Abstract
This study focuses on human awareness, a critical component in human–robot interaction, particularly within agricultural environments where interactions are enriched by complex contextual information. The main objective is identifying human activities occurring during collaborative harvesting tasks involving humans and robots. To achieve this, [...] Read more.
This study focuses on human awareness, a critical component in human–robot interaction, particularly within agricultural environments where interactions are enriched by complex contextual information. The main objective is identifying human activities occurring during collaborative harvesting tasks involving humans and robots. To achieve this, we propose a novel and lightweight deep learning model, named 1D-ResNeXt, designed explicitly for recognizing activities in agriculture-related human–robot collaboration. The model is built as an end-to-end architecture incorporating feature fusion and a multi-kernel convolutional block strategy. It utilizes residual connections and a split–transform–merge mechanism to mitigate performance degradation and reduce model complexity by limiting the number of trainable parameters. Sensor data were collected from twenty individuals with five wearable devices placed on different body parts. Each sensor was embedded with tri-axial accelerometers, gyroscopes, and magnetometers. Under real field conditions, the participants performed several sub-tasks commonly associated with agricultural labor, such as lifting and carrying loads. Before classification, the raw sensor signals were pre-processed to eliminate noise. The cleaned time-series data were then input into the proposed deep learning network for sequential pattern recognition. Experimental results showed that the chest-mounted sensor achieved the highest F1-score of 99.86%, outperforming other sensor placements and combinations. An analysis of temporal window sizes (0.5, 1.0, 1.5, and 2.0 s) demonstrated that the 0.5 s window provided the best recognition performance, indicating that key activity features in agriculture can be captured over short intervals. Moreover, a comprehensive evaluation of sensor modalities revealed that multimodal fusion of accelerometer, gyroscope, and magnetometer data yielded the best accuracy at 99.92%. The combination of accelerometer and gyroscope data offered an optimal compromise, achieving 99.49% accuracy while maintaining lower system complexity. These findings highlight the importance of strategic sensor placement and data fusion in enhancing activity recognition performance while reducing the need for extensive data and computational resources. This work contributes to developing intelligent, efficient, and adaptive collaborative systems, offering promising applications in agriculture and beyond, with improved safety, cost-efficiency, and real-time operational capability. Full article
Show Figures

Figure 1

23 pages, 6498 KB  
Article
A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification
by Xiaohuan Wu, Houji Qi, Keli Wang, Yikun Liu and Yang Wang
ISPRS Int. J. Geo-Inf. 2025, 14(11), 411; https://doi.org/10.3390/ijgi14110411 - 23 Oct 2025
Abstract
Land use classification based on multi-modal data fusion has gained significant attention due to its potential to capture the complex characteristics of urban environments. However, effectively extracting and integrating discriminative features derived from heterogeneous geospatial data remain challenging. This study proposes an ensemble [...] Read more.
Land use classification based on multi-modal data fusion has gained significant attention due to its potential to capture the complex characteristics of urban environments. However, effectively extracting and integrating discriminative features derived from heterogeneous geospatial data remain challenging. This study proposes an ensemble learning framework for land use classification by fusing cross-modal deep features from both physical and socioeconomic perspectives. Specifically, the framework utilizes the Masked Autoencoder (MAE) to extract global spatial dependencies from remote sensing imagery and applies long short-term memory (LSTM) networks to model spatial distribution patterns of points of interest (POIs) based on type co-occurrence. Furthermore, we employ inter-modal contrastive learning to enhance the representation of physical and socioeconomic features. To verify the superiority of the ensemble learning framework, we apply it to map the land use distribution of Bejing. By coupling various physical and socioeconomic features, the framework achieves an average accuracy of 84.33 %, surpassing several comparative baseline methods. Furthermore, the framework demonstrates comparable performance when applied to a Shenzhen dataset, confirming its robustness and generalizability. The findings highlight the importance of fully extracting and effectively integrating multi-source deep features in land use classification, providing a robust solution for urban planning and sustainable development. Full article
Show Figures

Figure 1

25 pages, 1741 KB  
Article
Event-Aware Multimodal Time-Series Forecasting via Symmetry-Preserving Graph-Based Cross-Regional Transfer Learning
by Shu Cao and Can Zhou
Symmetry 2025, 17(11), 1788; https://doi.org/10.3390/sym17111788 - 22 Oct 2025
Abstract
Forecasting real-world time series in domains with strong event sensitivity and regional variability poses unique challenges, as predictive models must account for sudden disruptions, heterogeneous contextual factors, and structural differences across locations. In tackling these challenges, we draw on the concept of symmetry [...] Read more.
Forecasting real-world time series in domains with strong event sensitivity and regional variability poses unique challenges, as predictive models must account for sudden disruptions, heterogeneous contextual factors, and structural differences across locations. In tackling these challenges, we draw on the concept of symmetry that refers to the balance and invariance patterns across temporal, multimodal, and structural dimensions, which help reveal consistent relationships and recurring patterns within complex systems. This study is based on two multimodal datasets covering 12 tourist regions and more than 3 years of records, ensuring robustness and practical relevance of the results. In many applications, such as monitoring economic indicators, assessing operational performance, or predicting demand patterns, short-term fluctuations are often triggered by discrete events, policy changes, or external incidents, which conventional statistical and deep learning approaches struggle to model effectively. To address these limitations, we propose an event-aware multimodal time-series forecasting framework with graph-based regional transfer built upon an enhanced PatchTST backbone. The framework unifies multimodal feature extraction, event-sensitive temporal reasoning, and graph-based structural adaptation. Unlike Informer, Autoformer, FEDformer, or PatchTST, our model explicitly addresses naive multimodal fusion, event-agnostic modeling, and weak cross-regional transfer by introducing an event-aware Multimodal Encoder, a Temporal Event Reasoner, and a Multiscale Graph Module. Experiments on diverse multi-region multimodal datasets demonstrate that our method achieves substantial improvements over eight state-of-the-art baselines in forecasting accuracy, event response modeling, and transfer efficiency. Specifically, our model achieves a 15.06% improvement in the event recovery index, a 15.1% reduction in MAE, and a 19.7% decrease in event response error compared to PatchTST, highlighting its empirical impact on tourism event economics forecasting. Full article
Show Figures

Figure 1

24 pages, 2308 KB  
Review
Review on Application of Machine Vision-Based Intelligent Algorithms in Gear Defect Detection
by Dehai Zhang, Shengmao Zhou, Yujuan Zheng and Xiaoguang Xu
Processes 2025, 13(10), 3370; https://doi.org/10.3390/pr13103370 - 21 Oct 2025
Viewed by 285
Abstract
Gear defect detection directly affects the operational reliability of critical equipment in fields such as automotive and aerospace. Gear defect detection technology based on machine vision, leveraging the advantages of non-contact measurement, high efficiency, and cost-effectiveness, has become a key support for quality [...] Read more.
Gear defect detection directly affects the operational reliability of critical equipment in fields such as automotive and aerospace. Gear defect detection technology based on machine vision, leveraging the advantages of non-contact measurement, high efficiency, and cost-effectiveness, has become a key support for quality control in intelligent manufacturing. However, it still faces challenges including difficulties in semantic alignment of multimodal data, the imbalance between real-time detection requirements and computational resources, and poor model generalization in few-shot scenarios. This paper takes the paradigm evolution of gear defect detection technology as the main line, systematically reviews its development from traditional image processing to deep learning, and focuses on the innovative application of intelligent algorithms. A research framework of “technical bottleneck-breakthrough path-application verification” is constructed: for the problem of multimodal fusion, the cross-modal feature alignment mechanism based on Transformer network is deeply analyzed, clarifying its technical path of realizing joint embedding of visual and vibration signals by establishing global correlation mapping; for resource constraints, the performance of lightweight models such as MobileNet and ShuffleNet is quantitatively compared, verifying that these models reduce Parameters by 40–60% while maintaining the mean Average Precision essentially unchanged; for small-sample scenarios, few-shot generation models based on contrastive learning are systematically organized, confirming that their accuracy in the 10-shot scenario can reach 90% of that of fully supervised models, thus enhancing generalization ability. Future research can focus on the collaboration between few-shot generation and physical simulation, edge-cloud dynamic scheduling, defect evolution modeling driven by multiphysics fields, and standardization of explainable artificial intelligence. It aims to construct a gear detection system with autonomous perception capabilities, promoting the development of industrial quality inspection toward high-precision, high-robustness, and low-cost intelligence. Full article
Show Figures

Figure 1

28 pages, 5272 KB  
Article
A Cross-Attention Gating Mechanism-Based Multimodal Feature Fusion Method for Software Defect Prediction
by Renliang Wang and Feng Liu
Appl. Sci. 2025, 15(20), 11259; https://doi.org/10.3390/app152011259 - 21 Oct 2025
Viewed by 96
Abstract
Early software defect prediction research primarily relied on software metric features. However, such features struggle to fully capture semantic information in source code. Consequently, methods based on semantic features have gradually become mainstream. Nevertheless, the empirical value of traditional metric features remains significant. [...] Read more.
Early software defect prediction research primarily relied on software metric features. However, such features struggle to fully capture semantic information in source code. Consequently, methods based on semantic features have gradually become mainstream. Nevertheless, the empirical value of traditional metric features remains significant. To address this, this paper proposes a multi-modal feature fusion method based on a cross-attention gating mechanism for software defect prediction (GMCA-SDP). This method aims to effectively fuse multiple modal features of source code to improve defect prediction performance. Unlike previous approaches, GMCA-SDP can simultaneously integrate traditional metric features and various code semantic features. During the feature fusion stage, it considers both the contribution differences among different types of features and the information interaction between modalities. This paper selects three types of modal features as input for GMCA-SDP: traditional metric features, semantic features extracted from the abstract syntax tree, and structural features extracted from the control flow graph. Our experiments on nine open-source projects demonstrate that the GMCA-SDP method outperforms six mainstream defect prediction models, with average improvements of 18.7% in F1, 10.9% in AUC, and 14.1% in G-mean. Full article
Show Figures

Figure 1

29 pages, 48102 KB  
Article
Infrared Temporal Differential Perception for Space-Based Aerial Targets
by Lan Guo, Xin Chen, Cong Gao, Zhiqi Zhao and Peng Rao
Remote Sens. 2025, 17(20), 3487; https://doi.org/10.3390/rs17203487 - 20 Oct 2025
Viewed by 214
Abstract
Space-based infrared (IR) detection, with wide coverage, all-time operation, and stealth, is crucial for aerial target surveillance. Under low signal-to-noise ratio (SNR) conditions, however, its small target size, limited features, and strong clutters often lead to missed detections and false alarms, reducing stability [...] Read more.
Space-based infrared (IR) detection, with wide coverage, all-time operation, and stealth, is crucial for aerial target surveillance. Under low signal-to-noise ratio (SNR) conditions, however, its small target size, limited features, and strong clutters often lead to missed detections and false alarms, reducing stability and real-time performance. To overcome these issues of energy-integration imaging in perceiving dim targets, this paper proposes a biomimetic vision-inspired Infrared Temporal Differential Detection (ITDD) method. The ITDD method generates sparse event streams by triggering pixel-level radiation variations and establishes an irradiance-based sensitivity model with optimized threshold voltage, spectral bands, and optical aperture parameters. IR sequences are converted into differential event streams with inherent noise, upon which a lightweight multi-modal fusion detection network is developed. Simulation experiments demonstrate that ITDD reduces data volume by three orders of magnitude and improves the SNR by 4.21 times. On the SITP-QLEF dataset, the network achieves a detection rate of 99.31%, and a false alarm rate of 1.97×105, confirming its effectiveness and application potential under complex backgrounds. As the current findings are based on simulated data, future work will focus on building an ITDD demonstration system to validate the approach with real-world IR measurements. Full article
(This article belongs to the Special Issue Deep Learning-Based Small-Target Detection in Remote Sensing)
Show Figures

Figure 1

22 pages, 4780 KB  
Article
A Fusion Estimation Method for Tire-Road Friction Coefficient Based on Weather and Road Images
by Jiye Huang, Xinshi Chen, Qingsong Jin and Ping Li
Lubricants 2025, 13(10), 459; https://doi.org/10.3390/lubricants13100459 - 20 Oct 2025
Viewed by 169
Abstract
The tire-road friction coefficient (TRFC) is a critical parameter that significantly influences vehicle safety, handling stability, and driving comfort. Existing estimation methods based on vehicle dynamics suffer from a substantial decline in accuracy under conditions with insufficient excitation, while vision-based approaches are often [...] Read more.
The tire-road friction coefficient (TRFC) is a critical parameter that significantly influences vehicle safety, handling stability, and driving comfort. Existing estimation methods based on vehicle dynamics suffer from a substantial decline in accuracy under conditions with insufficient excitation, while vision-based approaches are often limited by the generalization ability of their datasets, making them less effective in complex and variable real-driving environments. To address these challenges, this paper proposes a novel, low-cost fusion method for TRFC estimation that integrates weather conditions and road image data. The proposed approach begins by employing semantic segmentation to partition the input images into distinct regions—sky and road. The segmented images will be fed into the road recognition network and the weather recognition network for road type and weather classification. Furthermore, a fusion decision tree incorporating an uncertainty modeling mechanism is introduced to dynamically integrate these multi-source features, thereby enhancing the robustness of the estimation. Experimental results demonstrate that the proposed method maintains stable and reliable estimation performance even on unseen road surfaces, outperforming single-modality methods significantly. This indicates its high practical value and promising potential for broad application. Full article
Show Figures

Figure 1

23 pages, 1986 KB  
Article
GMHCA-MCBILSTM: A Gated Multi-Head Cross-Modal Attention-Based Network for Emotion Recognition Using Multi-Physiological Signals
by Xueping Li, Yanbo Li, Yuhang Li and Yuan Yang
Algorithms 2025, 18(10), 664; https://doi.org/10.3390/a18100664 - 20 Oct 2025
Viewed by 281
Abstract
To address the limitations of the single-modal electroencephalogram (EEG), such as its single physiological dimension, weak anti-interference ability, and inability to fully reflect emotional states, this paper proposes a gated multi-head cross-attention module (GMHCA) for multimodal fusion of EEG, electrooculography (EOG),and electrodermal activity [...] Read more.
To address the limitations of the single-modal electroencephalogram (EEG), such as its single physiological dimension, weak anti-interference ability, and inability to fully reflect emotional states, this paper proposes a gated multi-head cross-attention module (GMHCA) for multimodal fusion of EEG, electrooculography (EOG),and electrodermal activity (EDA). This attention module employs three independent and parallel attention computation units to assign independent attention weights to different feature subsets across modalities. Combined with a modality complementarity metric, the gating mechanism suppresses redundant heads and enhances the information transmission of key heads. Through multi-head concatenation, cross-modal interaction results from different perspectives are fused. For the backbone network, a multi-scale convolution and bidirectional long short-term memory network (MC-BiLSTM) is designed for feature extraction, tailored to the characteristics of each modality. Experiments show that this method, which primarily fuses eight-channel EEG with peripheral physiological signals, achieves an emotion recognition accuracy of 89.45%, a 7.68% improvement over single-modal EEG. In addition, in cross-subject experiments conducted on the SEED-IV dataset, the EEG+EOG modality achieved a classification accuracy of 92.73%. All were significantly better than the baseline method. This fully demonstrates the effectiveness of the innovative GMHCA module architecture and MC-BiLSTM feature extraction network proposed in this paper for multimodal fusion methods. Through the novel attention gating mechanism, higher recognition accuracy is achieved while significantly reducing the number of EEG channels, providing new ideas and approaches based on attention mechanisms and gated fusion for multimodal emotion recognition in resource-constrained environments. Full article
(This article belongs to the Special Issue Machine Learning in Medical Signal and Image Processing (4th Edition))
Show Figures

Graphical abstract

18 pages, 1010 KB  
Article
ViT-BiLSTM Multimodal Learning for Paediatric ADHD Recognition: Integrating Wearable Sensor Data with Clinical Profiles
by Lin Wang and Guang Yang
Sensors 2025, 25(20), 6459; https://doi.org/10.3390/s25206459 - 18 Oct 2025
Viewed by 302
Abstract
ADHD classification has traditionally relied on accelerometer-derived tabular features, which summarise static activity but fail to capture spatial-temporal patterns, potentially limiting model performance. We developed a multimodal deep learning framework that transforms raw accelerometer signals into images and integrates them with clinical tabular [...] Read more.
ADHD classification has traditionally relied on accelerometer-derived tabular features, which summarise static activity but fail to capture spatial-temporal patterns, potentially limiting model performance. We developed a multimodal deep learning framework that transforms raw accelerometer signals into images and integrates them with clinical tabular data, enabling the joint exploration of dynamic activity patterns and static clinical characteristics. Data were collected from children aged 7–13 years, including accelerometer recordings from Apple Watches and clinical measures from standardised questionnaires. Deep learning models for image feature extraction and multiple fusion strategies were evaluated to identify the most effective representation and integration method. Our analyses indicated that combining activity images with clinical variables facilitated the classification of ADHD, with the ViT-BiLSTM model using cross-attention fusion achieving the highest performance. These findings suggest that multimodal learning can become a robust approach to ADHD classification by leveraging complementary information from activity dynamics and clinical data. Our framework and code will be made publicly available to support reproducibility and future research. Full article
(This article belongs to the Section Wearables)
Show Figures

Figure 1

23 pages, 5774 KB  
Article
A Multimodal Voice Phishing Detection System Integrating Text and Audio Analysis
by Jiwon Kim, Seuli Gu, Youngbeom Kim, Sukwon Lee and Changgu Kang
Appl. Sci. 2025, 15(20), 11170; https://doi.org/10.3390/app152011170 - 18 Oct 2025
Viewed by 197
Abstract
Voice phishing has emerged as a critical security threat, exploiting both linguistic manipulation and advances in synthetic speech technologies. Traditional keyword-based approaches often fail to capture contextual patterns or detect forged audio, limiting their effectiveness in real-world scenarios. To address this gap, we [...] Read more.
Voice phishing has emerged as a critical security threat, exploiting both linguistic manipulation and advances in synthetic speech technologies. Traditional keyword-based approaches often fail to capture contextual patterns or detect forged audio, limiting their effectiveness in real-world scenarios. To address this gap, we propose a multimodal voice phishing detection system that integrates text and audio analysis. The text module employs a KoBERT-based transformer classifier with self-attention interpretation, while the audio module leverages MFCC features and a CNN–BiLSTM classifier to identify synthetic speech. A fusion mechanism combines the outputs of both modalities, with experiments conducted on real-world call transcripts, phishing datasets, and synthetic voice corpora. The results demonstrate that the proposed system consistently achieves high values regarding the accuracy, precision, recall, and F1-score on validation data while maintaining robust performance in noisy and diverse real-call scenarios. Furthermore, attention-based interpretability enhances trustworthiness by revealing cross-token and discourse-level interaction patterns specific to phishing contexts. These findings highlight the potential of the proposed system as a reliable, explainable, and deployable solution for preventing the financial and social damage caused by voice phishing. Unlike prior studies limited to single-modality or shallow fusion, our work presents a fully integrated text–audio detection pipeline optimized for Korean real-world datasets and robust to noisy, multi-speaker conditions. Full article
Show Figures

Figure 1

22 pages, 1678 KB  
Article
Image Completion Network Considering Global and Local Information
by Yubo Liu, Ke Chen and Alan Penn
Buildings 2025, 15(20), 3746; https://doi.org/10.3390/buildings15203746 - 17 Oct 2025
Viewed by 201
Abstract
Accurate depth image inpainting in complex urban environments remains a critical challenge due to occlusions, reflections, and sensor limitations, which often result in significant data loss. We propose a hybrid deep learning framework that explicitly combines local and global modelling through Convolutional Neural [...] Read more.
Accurate depth image inpainting in complex urban environments remains a critical challenge due to occlusions, reflections, and sensor limitations, which often result in significant data loss. We propose a hybrid deep learning framework that explicitly combines local and global modelling through Convolutional Neural Networks (CNNs) and Transformer modules. The model employs a multi-branch parallel architecture, where the CNN branch captures fine-grained local textures and edges, while the Transformer branch models global semantic structures and long-range dependencies. We introduce an optimized attention mechanism, Agent Attention, which differs from existing efficient/linear attention methods by using learnable proxy tokens tailored for urban scene categories (e.g., façades, sky, ground). A content-guided dynamic fusion module adaptively combines multi-scale features to enhance structural alignment and texture recovery. The frame-work is trained with a composite loss function incorporating pixel accuracy, perceptual similarity, adversarial realism, and structural consistency. Extensive experiments on the Paris StreetView dataset demonstrate that the proposed method achieves state-of-the-art performance, outperforming existing approaches in PSNR, SSIM, and LPIPS metrics. The study highlights the potential of multi-scale modeling for urban depth inpainting and discusses challenges in real-world deployment, ethical considerations, and future directions for multimodal integration. Full article
Show Figures

Figure 1

20 pages, 2565 KB  
Article
GBV-Net: Hierarchical Fusion of Facial Expressions and Physiological Signals for Multimodal Emotion Recognition
by Jiling Yu, Yandong Ru, Bangjun Lei and Hongming Chen
Sensors 2025, 25(20), 6397; https://doi.org/10.3390/s25206397 - 16 Oct 2025
Viewed by 449
Abstract
A core challenge in multimodal emotion recognition lies in the precise capture of the inherent multimodal interactive nature of human emotions. Addressing the limitation of existing methods, which often process visual signals (facial expressions) and physiological signals (EEG, ECG, EOG, and GSR) in [...] Read more.
A core challenge in multimodal emotion recognition lies in the precise capture of the inherent multimodal interactive nature of human emotions. Addressing the limitation of existing methods, which often process visual signals (facial expressions) and physiological signals (EEG, ECG, EOG, and GSR) in isolation and thus fail to exploit their complementary strengths effectively, this paper presents a new multimodal emotion recognition framework called the Gated Biological Visual Network (GBV-Net). This framework enhances emotion recognition accuracy through deep synergistic fusion of facial expressions and physiological signals. GBV-Net integrates three core modules: (1) a facial feature extractor based on a modified ConvNeXt V2 architecture incorporating lightweight Transformers, specifically designed to capture subtle spatio-temporal dynamics in facial expressions; (2) a hybrid physiological feature extractor combining 1D convolutions, Temporal Convolutional Networks (TCNs), and convolutional self-attention mechanisms, adept at modeling local patterns and long-range temporal dependencies in physiological signals; and (3) an enhanced gated attention fusion module capable of adaptively learning inter-modal weights to achieve dynamic, synergistic integration at the feature level. A thorough investigation of the publicly accessible DEAP and MAHNOB-HCI datasets reveals that GBV-Net surpasses contemporary methods. Specifically, on the DEAP dataset, the model attained classification accuracies of 95.10% for Valence and 95.65% for Arousal, with F1-scores of 95.52% and 96.35%, respectively. On MAHNOB-HCI, the accuracies achieved were 97.28% for Valence and 97.73% for Arousal, with F1-scores of 97.50% and 97.74%, respectively. These experimental findings substantiate that GBV-Net effectively captures deep-level interactive information between multimodal signals, thereby improving emotion recognition accuracy. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

14 pages, 1149 KB  
Article
Modality Information Aggregation Graph Attention Network with Adversarial Training for Multi-Modal Knowledge Graph Completion
by Hankiz Yilahun, Elyar Aili, Seyyare Imam and Askar Hamdulla
Information 2025, 16(10), 907; https://doi.org/10.3390/info16100907 - 16 Oct 2025
Viewed by 170
Abstract
Multi-modal knowledge graph completion (MMKGC) aims to complete knowledge graphs by integrating structural information with multi-modal (e.g., visual, textual, and numerical) features and leveraging cross-modal reasoning within a unified semantic space to infer and supplement missing factual knowledge. Current MMKGC methods have advanced [...] Read more.
Multi-modal knowledge graph completion (MMKGC) aims to complete knowledge graphs by integrating structural information with multi-modal (e.g., visual, textual, and numerical) features and leveraging cross-modal reasoning within a unified semantic space to infer and supplement missing factual knowledge. Current MMKGC methods have advanced in terms of integrating multi-modal information but have overlooked the imbalance in modality importance for target entities. Treating all modalities equally dilutes critical semantics and amplifies irrelevant information, which in turn limits the semantic understanding and predictive performance of the model. To address these limitations, we proposed a modality information aggregation graph attention network with adversarial training for multi-modal knowledge graph completion (MIAGAT-AT). MIAGAT-AT focuses on hierarchically modeling complex cross-modal interactions. By combining the multi-head attention mechanism with modality-specific projection methods, it precisely captures global semantic dependencies and dynamically adjusts the weight of modality embeddings according to the importance of each modality, thereby optimizing cross-modal information fusion capabilities. Moreover, through the use of random noise and multi-layer residual blocks, the adversarial training generates high-quality multi-modal feature representations, thereby effectively enhancing information from imbalanced modalities. Experimental results demonstrate that our approach significantly outperforms 18 existing baselines and establishes a strong performance baseline across three distinct datasets. Full article
Show Figures

Figure 1

20 pages, 4914 KB  
Article
Dual-Channel Parallel Multimodal Feature Fusion for Bearing Fault Diagnosis
by Wanrong Li, Haichao Cai, Xiaokang Yang, Yujun Xue, Jun Ye and Xiangyi Hu
Machines 2025, 13(10), 950; https://doi.org/10.3390/machines13100950 - 15 Oct 2025
Viewed by 336
Abstract
In recent years, the powerful feature extraction capabilities of deep learning have attracted widespread attention in the field of bearing fault diagnosis. To address the limitations of single-modal and single-channel feature extraction methods, which often result in incomplete information representation and difficulty in [...] Read more.
In recent years, the powerful feature extraction capabilities of deep learning have attracted widespread attention in the field of bearing fault diagnosis. To address the limitations of single-modal and single-channel feature extraction methods, which often result in incomplete information representation and difficulty in obtaining high-quality fault features, this paper proposes a dual-channel parallel multimodal feature fusion model for bearing fault diagnosis. In this method, the one-dimensional vibration signals are first transformed into two-dimensional time-frequency representations using continuous wavelet transform (CWT). Subsequently, both the one-dimensional vibration signals and the two-dimensional time-frequency representations are fed simultaneously into the dual-branch parallel model. Within this architecture, the first branch employs a combination of a one-dimensional convolutional neural network (1DCNN) and a bidirectional gated recurrent unit (BiGRU) to extract temporal features from the one-dimensional vibration signals. The second branch utilizes a dilated convolutional to capture spatial time–frequency information from the CWT-derived two-dimensional time–frequency representations. The features extracted by both branches were are input into the feature fusion layer. Furthermore, to leverage fault features more comprehensively, a channel attention mechanism is embedded after the feature fusion layer. This enables the network to focus more effectively on salient features across channels while suppressing interference from redundant features, thereby enhancing the performance and accuracy of the dual-branch network. Finally, the fused fault features are passed to a softmax classifier for fault classification. Experimental results demonstrate that the proposed method achieved an average accuracy of 99.50% on the Case Western Reserve University (CWRU) bearing dataset and 97.33% on the Southeast University (SEU) bearing dataset. These results confirm that the suggested model effectively improves fault diagnosis accuracy and exhibits strong generalization capability. Full article
(This article belongs to the Section Machines Testing and Maintenance)
Show Figures

Figure 1

Back to TopTop