Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (2,410)

Search Parameters:
Keywords = multimodal datasets

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 20117 KB  
Article
Intelligent Corrosion Diagnosis of High-Strength Bolts Based on Multi-Modal Feature Fusion and APO-XGBoost
by Hanyue Zhang, Yin Wu, Bo Sun, Yanyi Liu and Wenbo Liu
Sensors 2026, 26(8), 2520; https://doi.org/10.3390/s26082520 (registering DOI) - 19 Apr 2026
Abstract
High-strength bolts are critical structural components that are highly susceptible to corrosion in complex environments, posing significant threats to structural safety and reliability. Although acoustic emission (AE) technology has been widely applied in structural health monitoring, existing studies mainly focus on damage mode [...] Read more.
High-strength bolts are critical structural components that are highly susceptible to corrosion in complex environments, posing significant threats to structural safety and reliability. Although acoustic emission (AE) technology has been widely applied in structural health monitoring, existing studies mainly focus on damage mode identification or source localization, while the identification of corrosion evolution stages based on AE signals remains insufficient. This study develops an intelligent corrosion diagnosis framework for high-strength bolts by integrating multimodal feature fusion and optimized machine learning. AE signals are first collected from the near-end and far-end of bolts using a wireless sensor network and then transformed into time–frequency representations via continuous wavelet transform (CWT). The resulting time–frequency images are fed into a modified ResNet-18 network to extract deep features, while statistical features are simultaneously extracted from the raw signals to preserve global information. These heterogeneous features are subsequently fused to form a comprehensive representation of corrosion characteristics. Furthermore, an artificial protozoa optimizer (APO) is introduced to adaptively optimize the hyperparameters of the XGBoost model. The results demonstrate that AE signals generated by hammering bolts with different corrosion levels can be successfully distinguished. The proposed method achieves high accuracy in corrosion stage classification and outperforms conventional approaches. Even when evaluated on an additional M30 bolt dataset, the proposed method maintains robust performance, demonstrating excellent generalization capability across different bolt sizes. These results demonstrate the practical potential of the proposed method for intelligent bolt corrosion diagnosis. Full article
(This article belongs to the Section Fault Diagnosis & Sensors)
Show Figures

Figure 1

21 pages, 1699 KB  
Article
Three-Way Multimodal Learning with Severely Missing Modalities
by Hanrui Wang, Yu Fang, Xin Wang and Fan Min
Information 2026, 17(4), 384; https://doi.org/10.3390/info17040384 (registering DOI) - 19 Apr 2026
Abstract
Missing modalities remain a major obstacle to the real-world deployment of multimodal learning systems, as incomplete inputs can substantially degrade model performance. Existing methods often suffer from biased imputation under high missing rates and lack uncertainty-aware, differentiated processing. Inspired by three-way decision, a [...] Read more.
Missing modalities remain a major obstacle to the real-world deployment of multimodal learning systems, as incomplete inputs can substantially degrade model performance. Existing methods often suffer from biased imputation under high missing rates and lack uncertainty-aware, differentiated processing. Inspired by three-way decision, a framework for handling uncertainty by adding a deferment option to acceptance and rejection, we propose three-way multimodal learning with severely missing modalities (3WML-SMMs), a novel framework that introduces a three-way decision mechanism into both missing-modality imputation and feature regularization for the first time. Specifically, 3WML-SMM treats variance not merely as a descriptive measure of uncertainty, but as a decision signal for adaptive processing. Based on this idea, the framework incorporates (1) a variance-guided three-way imputation strategy with accept–delay–reject decisions to reduce unreliable reconstruction when only a limited number of complete samples are available and (2) a dimension-wise adaptive feature enhancement module that performs fine-grained regularization according to perturbation uncertainty. Experiments on the CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI) and Multimodal Internet Movie Database (MM-IMDb) datasets show that 3WML-SMM consistently outperforms representative baselines, including reconstruction-based methods, complete-input multimodal methods, and missing-modality-specific methods under severe missing-modality settings, with statistically significant improvements over the multimodal learning with severely missing modality (SMIL) baseline (p<0.05). These results demonstrate the effectiveness of the proposed framework, even in extreme settings where only 10% of the text modality is available. Full article
(This article belongs to the Section Artificial Intelligence)
32 pages, 8881 KB  
Article
WS-R-IR Adapter: A Multimodal RGB–Infrared Remote Sensing Framework for Water Surface Object Detection
by Bin Xue, Qiang Yu, Kun Ding, Mengxin Jiang, Ying Wang, Shiming Xiang and Chunhong Pan
Remote Sens. 2026, 18(8), 1220; https://doi.org/10.3390/rs18081220 - 17 Apr 2026
Abstract
Water surface object detection in shipborne remote sensing is challenged by unstable wave-induced backgrounds, illumination variations, extreme scale changes with tiny objects, and limited annotations. Multimodal RGB–infrared (RGB–IR) sensing leverages complementary visible and infrared cues to enhance robustness. However, most existing RGB–IR methods [...] Read more.
Water surface object detection in shipborne remote sensing is challenged by unstable wave-induced backgrounds, illumination variations, extreme scale changes with tiny objects, and limited annotations. Multimodal RGB–infrared (RGB–IR) sensing leverages complementary visible and infrared cues to enhance robustness. However, most existing RGB–IR methods rely on backbones pretrained on limited-scale data, which constrain their performance for complex water surface scenes. In this work, we propose the WS-R-IR Adapter, a parameter-efficient vision foundation model (VFM)-based framework for shipborne RGB–IR object detection. Instead of full fine-tuning, it adapts frozen VFM representations via lightweight task-specific designs. the WS-R-IR Adapter includes (1) a water scene domain-aware modal adapter that progressively guides frozen backbone features with evolving semantic cues, (2) a parallel multi-scale structural perception module for fine-grained, scale-sensitive modeling, (3) an adaptive RGB–IR feature modulation fusion strategy, and (4) a resolution-aligned context semantic and structural detail fusion module. Moreover, we introduce an object-guided global-to-local registration framework to address dynamic cross-modal misalignment, and construct modality-aligned PoLaRIS-DET and ASV-RI-DET datasets that cover diverse water surface scenes. On the two datasets, the proposed method achieves mAP@0.5:0.95 scores of 74.2% and 50.2%, respectively, significantly outperforming existing methods with only 11.9M additional parameters. These results demonstrate the effectiveness of parameter-efficient VFM adaptation for multimodal water surface remote sensing. Full article
(This article belongs to the Section Remote Sensing Image Processing)
21 pages, 1194 KB  
Article
Environment-Aware Proactive Beam Prediction in mmWave V2I via Multi-Modal Prior Mask Map
by Changpeng Zhou and Youyun Xu
Sensors 2026, 26(8), 2488; https://doi.org/10.3390/s26082488 - 17 Apr 2026
Abstract
In millimeter wave V2I communication systems, accurate beam prediction is crucial for optimizing network performance and improving signal transmission efficiency. Traditional beam prediction methods mainly rely on single-modal data, which often fails to capture the comprehensive environmental information required for high accuracy prediction. [...] Read more.
In millimeter wave V2I communication systems, accurate beam prediction is crucial for optimizing network performance and improving signal transmission efficiency. Traditional beam prediction methods mainly rely on single-modal data, which often fails to capture the comprehensive environmental information required for high accuracy prediction. In contrast, multi-modal approaches leverage complementary information from different data sources and offer a more promising solution. However, many existing fusion methods primarily depend on real-time sensory inputs and do not fully exploit stable environmental features in V2I scenarios, limiting the effective use of each modality. To address these limitations, this paper proposes a environment-aware proactive beam prediction method based on a multi-modal prior mask map (MMPMM), which integrates offline mapping with an online beam prediction network. Specifically, the method fuses information from images, point clouds, positions, and the MMPMM to predict the optimal beam index. The MMPMM provides channel-related prior information by extracting static V2I scene features offline without incurring any additional online measurement overhead. Experimental results on real-world datasets demonstrate that the proposed method achieves a Top-3 beam prediction accuracy of up to 71.23% while maintaining stable performance under the evaluated dynamic and degraded conditions, demonstrating its effectiveness in the considered scenarios. Full article
(This article belongs to the Special Issue 6G Communication and Edge Intelligence in Wireless Sensor Networks)
Show Figures

Figure 1

29 pages, 3416 KB  
Article
Enhancing Collaborative AI Learning: A Blockchain-Secured, Edge-Enabled Platform for Multimodal Education in IIoT Environments
by Ahsan Rafiq, Eduard Melnik, Alexey Samoylov, Alexander Kozlovskiy and Irina Safronenkova
Big Data Cogn. Comput. 2026, 10(4), 123; https://doi.org/10.3390/bdcc10040123 - 17 Apr 2026
Abstract
As industries deploy more connected devices in factories, warehouses, and smart facilities, the need for artificial intelligence (AI) systems that can operate securely in distributed, data-intensive environments is growing. Traditional centralized learning and online education platforms struggle when students and systems have to [...] Read more.
As industries deploy more connected devices in factories, warehouses, and smart facilities, the need for artificial intelligence (AI) systems that can operate securely in distributed, data-intensive environments is growing. Traditional centralized learning and online education platforms struggle when students and systems have to process real-time streams (sensors, video, text) with strict latency and privacy requirements. To address this challenge, a blockchain-secured, edge-enabled multimodal federated learning framework tailored for Industrial IoT (IIoT) environments is proposed. The model integrates four key layers: (i) a blockchain layer that provides credentialing, transparency, and token-based incentives; (ii) a multimodal community layer that supports group formation, peer consensus, and cross-modal learning across text, images, audio, and sensor data; (iii) an edge computing layer that enables low-latency task offloading and secure training within Intel SGX enclaves; and (iv) a data layer that applies pre-processing, differential privacy, and synthetic augmentation to safeguard sensitive information. Experiments on industrial multimodal datasets demonstrate 42% faster model aggregation, 78.9% multimodal accuracy, and 1.9% accuracy loss under ε = 1.0 differential privacy. This shows a scalable and practical path for decentralized AI training in next-generation IIoT systems, confirming the possibility of technical support for educational processes. However, the conducted research requires a validation of pedagogical effectiveness. Full article
Show Figures

Figure 1

25 pages, 5736 KB  
Article
Photogrammetry–Polarimetry Fusion for 3D Structural Edge Extraction and Physics-Guided Classification
by Mohammad Saadatseresht, Hossein Arefi and Fatemeh Torkamandi
J. Sens. Actuator Netw. 2026, 15(2), 33; https://doi.org/10.3390/jsan15020033 - 16 Apr 2026
Viewed by 134
Abstract
The accurate interpretation of structural edges requires distinguishing geometry-driven discontinuities from reflectance- and illumination-induced variations. Conventional photogrammetric pipelines rely primarily on radiometric and geometric cues, which often lack physical interpretability under complex material and lighting conditions. This study proposes a photogrammetry–polarimetry fusion framework [...] Read more.
The accurate interpretation of structural edges requires distinguishing geometry-driven discontinuities from reflectance- and illumination-induced variations. Conventional photogrammetric pipelines rely primarily on radiometric and geometric cues, which often lack physical interpretability under complex material and lighting conditions. This study proposes a photogrammetry–polarimetry fusion framework for physics-guided semantic classification of 3D structural edges. Radiometric, geometric, and polarimetric features are integrated within a noise-normalized representation to enable modality-independent interpretation. A rule-based classification scheme is introduced to assign edges to physically meaningful categories, including geometric, material, specular, illumination, and polarization-driven phenomena. The method is evaluated on a calibrated geometric object and a cultural heritage statue. Results show that polarization provides complementary information that reduces ambiguity between geometry-driven and reflectance-driven edge responses while preserving the underlying reconstructed geometry. On the calibrated dataset, edge detection achieves 88.4% precision, 95.5% recall, and an F1-score of approximately 0.92. Multi-view integration further improves the completeness of geometry-dominant 3D edges. The proposed framework introduces a physics-guided semantic sensing layer for multi-modal 3D perception, enabling more robust and interpretable structural analysis in photogrammetric workflows. Full article
Show Figures

Graphical abstract

55 pages, 1671 KB  
Article
Multimodal Large Language Model-Based Explainable Boosting Machine Analysis for Interpretation of State-of-Health Prediction of Lithium-Ion Batteries
by Jaehyeok Lee, Jaeseung Lee and Jehyeok Rew
Electronics 2026, 15(8), 1675; https://doi.org/10.3390/electronics15081675 - 16 Apr 2026
Viewed by 110
Abstract
Accurate prediction of the state of health (SOH) of lithium-ion batteries is essential for ensuring the safety and reliability of electric vehicles and energy storage systems. While machine learning (ML)-based models have demonstrated strong predictive performance, their limited interpretability remains a major challenge [...] Read more.
Accurate prediction of the state of health (SOH) of lithium-ion batteries is essential for ensuring the safety and reliability of electric vehicles and energy storage systems. While machine learning (ML)-based models have demonstrated strong predictive performance, their limited interpretability remains a major challenge for deployment in safety-critical applications. Although explainable boosting machines (EBMs) provide an interpretable alternative through their additive structure, existing studies still rely on manual analysis of model outputs, which restricts scalability and reproducibility. To address this limitation, this study proposes a structured interpretation framework that integrates EBMs with multimodal large language models (MLLMs). The proposed framework employs EBMs to generate SOH predictions along with global feature importance and variable-level score-density visualizations. These outputs are subsequently processed by an MLLM to perform automated interpretation at both global and variable levels, followed by aggregation, cross-validation, and generation of a unified interpretation report. Experiments were conducted on a lithium-ion battery degradation dataset and the EBM achieved competitive predictive performance compared to baseline ML models. In addition, the quality of the generated interpretations was evaluated using both an MLLM-as-a-Judge and a user study. The evaluation results show that the generated interpretations consistently achieved high scores, with average ratings exceeding 4.5 out of 5 across key criteria such as interpretation accuracy and faithfulness, as assessed by both independent MLLMs and domain experts. These results demonstrate that the proposed framework enables reliable and scalable interpretation of battery SOH prediction models, providing a practical solution for explainable artificial intelligence in battery health management. Full article
30 pages, 2314 KB  
Article
Confidence-Aware Gated Multimodal Fusion for Robust Temporal Action Localization in Occluded Environments
by Masato Takami and Tomohiro Fukuda
Sensors 2026, 26(8), 2454; https://doi.org/10.3390/s26082454 - 16 Apr 2026
Viewed by 154
Abstract
In industrial environments, robust Temporal Action Localization (TAL) is essential; however, frequent occlusions often compromise the reliability of skeletal data, leading to negative transfer in multimodal fusion. To address this challenge, we propose a Gated Skeleton Refinement Module (Gated SRM), a universal front-end [...] Read more.
In industrial environments, robust Temporal Action Localization (TAL) is essential; however, frequent occlusions often compromise the reliability of skeletal data, leading to negative transfer in multimodal fusion. To address this challenge, we propose a Gated Skeleton Refinement Module (Gated SRM), a universal front-end preprocessing module that explicitly incorporates OpenPose confidence scores into the network architecture. By applying these scores as a logarithmic bias within a self-attention mechanism, our method achieves soft suppression—dynamically attenuating the attention weights assigned to unreliable joints—before adaptively fusing the refined skeletal features with RGB representations through a learnable gating network. Extensive experiments on the heavily occluded IKEA ASM dataset demonstrate that our approach effectively prevents the catastrophic accuracy degradation typical of naive and established multimodal fusion strategies, improving the mean Average Precision (mAP) to 21.77%, maintaining parity with the RGB-only baseline while demonstrating superior robustness. Furthermore, the system maintains a practical end-to-end inference speed of approximately 9.2 frames per second (FPS), which is sufficient for monitoring macro-level industrial workflows. By prioritizing confidence-based data selection over data restoration, this sensor-metadata-driven architecture offers a robust and principled approach acting as a critical fail-safe and safety-net for real-world action recognition under occlusion. Full article
17 pages, 2884 KB  
Article
From Real-World Practice to an Ideal Rehabilitation Pathway in Osteoarthritis: A Delphi Consensus on Patient Itineraries
by Helena Bascuñana-Ambrós, Alex Trejo-Omeñaca, Carlos Cordero-García, Sergio Fuertes-González, Juan Ignacio Castillo-Martín, Michelle Catta-Preta, Jan Ferrer-Picó, Josep Maria Monguet-Fierro and Jacobo Formigo-Couceiro
J. Clin. Med. 2026, 15(8), 3047; https://doi.org/10.3390/jcm15083047 - 16 Apr 2026
Viewed by 149
Abstract
Background: Care for knee osteoarthritis (KOA) is frequently fragmented, and pathway-level decisions within Physical Medicine and Rehabilitation (PM&R) are influenced by local organizations. The objective of this study was to identify areas of agreement and disagreement among PM&R experts and to translate [...] Read more.
Background: Care for knee osteoarthritis (KOA) is frequently fragmented, and pathway-level decisions within Physical Medicine and Rehabilitation (PM&R) are influenced by local organizations. The objective of this study was to identify areas of agreement and disagreement among PM&R experts and to translate these into a clinically interpretable, function-oriented care pathway for knee osteoarthritis (KOA) within rehabilitation services. Methods: A two-round Real-Time Delphi study was conducted using the SmartDelphi web platform. A steering committee of five PM&R physicians developed a 37-item questionnaire covering referral/access, functional and outcome assessment, conservative management, escalation/referral thresholds, and follow-up/discharge. Round 1 was online (SERMEF osteoarthritis working group; 46 invited, 40 completed; 87.0%) with responses collected until 30 April 2025. Round 2 was an in-person, facilitated validation round on 30 May 2025 at the SERMEF Congress (A Coruña; 85 invited, 70 completed; 82.4%). Items were rated on a 6-point Likert scale; consensus strength was defined by interquartile range (IQR): strong (0–1) vs. weak (≥2). No patient-level data were collected; participant characteristics were comparable across rounds, suggesting consensus refinement reflected deliberation rather than panel shifts over time. Results: Consensus supported a longitudinal, function-first pathway that was structured into five phases: entry/referral to PM&R; comprehensive functional assessment using a minimum outcomes dataset (pain VAS/NRS, WOMAC function, quality-of-life scale); multimodal conservative rehabilitation combining exercise/physiotherapy, education/self-management support, and indicated oral/topical therapies; reassessment-guided escalation in non-responders, reserving interventional PM&R techniques, multidisciplinary musculoskeletal pain-unit management, or orthopedic evaluation for persistent pain and/or functional limitation; and longitudinal monitoring with defined discharge criteria. Conclusions: SERMEF PM&R experts converged on an implementation-oriented, outcomes-driven KOA itinerary centred on functioning, conservative multimodal care, structured reassessment, and explicit discharge planning. Full article
(This article belongs to the Section Clinical Rehabilitation)
Show Figures

Figure 1

40 pages, 3667 KB  
Review
Deep Learning Methods for SAR and Optical Image Fusion: A Review
by Chengyan Guo, Zhiyuan Zhang, Kexin Huang, Lan Luo, Ziqing Yang, Shuyun Shi and Junpeng Shi
Remote Sens. 2026, 18(8), 1196; https://doi.org/10.3390/rs18081196 - 16 Apr 2026
Viewed by 262
Abstract
Synthetic Aperture Radar (SAR) and optical image fusion technology plays a crucial role in remote sensing applications. It effectively combines the high spatial resolution and rich spectral information of optical images with the all-weather and penetrating observation advantages of SAR images, thereby significantly [...] Read more.
Synthetic Aperture Radar (SAR) and optical image fusion technology plays a crucial role in remote sensing applications. It effectively combines the high spatial resolution and rich spectral information of optical images with the all-weather and penetrating observation advantages of SAR images, thereby significantly enhancing image interpretation accuracy and task execution capabilities. This paper systematically reviews deep learning-based fusion methods for SAR and optical images, with a particular focus on recent advances in deep learning models. Furthermore, it summarizes commonly used evaluation metrics for assessing fusion image quality, providing a basis for comparing and analyzing the performance of different methods. In addition, commonly used SAR-optical fusion datasets are briefly reviewed to highlight their roles in algorithm development and performance evaluation. Unlike conventional review articles, this paper further analyzes the guidance and supporting role of fusion algorithms from the perspective of typical and specific applications. Finally, it identifies key challenges and issues faced by current fusion methods, including data registration, model lightweight design, and multimodal feature alignment, and offers perspectives on future research directions. This review aims to provide routes and references for the development of SAR and optical image fusion technology. Full article
Show Figures

Figure 1

16 pages, 904 KB  
Article
AI-Based Quantification of Botulinum Neurotoxin-Induced Facial Changes: Wrinkle Reduction, Region-Specific Effects, and Functional Correlates of Facial Muscle Activity
by Ibrahim Güler, Armin Kraus, Gerrit Grieb and Henrik Stelling
Toxins 2026, 18(4), 188; https://doi.org/10.3390/toxins18040188 - 15 Apr 2026
Viewed by 208
Abstract
Botulinum neurotoxin (BoNT) treatment outcomes are commonly assessed through visual evaluation of facial wrinkle patterns, a process that remains inherently subjective despite structured grading systems. This study evaluated whether contemporary multimodal artificial intelligence (AI) systems can identify facial changes associated with BoNT treatment, [...] Read more.
Botulinum neurotoxin (BoNT) treatment outcomes are commonly assessed through visual evaluation of facial wrinkle patterns, a process that remains inherently subjective despite structured grading systems. This study evaluated whether contemporary multimodal artificial intelligence (AI) systems can identify facial changes associated with BoNT treatment, using region-specific wrinkle patterns as surrogate markers of underlying muscle activity. A dataset of 46 facial images (23 pre-treatment, 23 post-treatment) was analyzed using four multimodal models, each assessed across five independent runs. Models were tasked with classifying treatment state from single images, detecting wrinkle presence in the forehead, glabella, and periorbital regions, and generating exploratory severity scores and age estimates. Two models achieved 100% accuracy in distinguishing pre- from post-treatment images in this dataset, while region-specific wrinkle detection was variable and frequently did not exceed majority-class baselines. Inter-run reliability varied substantially across models. Exploratory wrinkle severity scores showed directional differences between treatment states, whereas apparent age estimates demonstrated minimal systematic variation. These findings suggest that global facial changes associated with BoNT treatment appear to be detectable in model outputs, but region-specific assessment remains limited, underscoring the need for cautious interpretation and further validation. Full article
(This article belongs to the Special Issue Study on Botulinum Toxin in Facial Diseases and Aesthetics)
Show Figures

Graphical abstract

32 pages, 1120 KB  
Article
Ontology-Guided Multimodal Framework for Explainable Music Similarity and Recommendation
by Mikhail Rumiantcev
Big Data Cogn. Comput. 2026, 10(4), 122; https://doi.org/10.3390/bdcc10040122 - 15 Apr 2026
Viewed by 113
Abstract
Analyzing music similarity in large catalogs is challenging because people perceive music differently and important details are found in audio, text, and metadata. This article introduces a multimodal framework that uses an ontology to make music similarity and recommendation more explainable. The framework [...] Read more.
Analyzing music similarity in large catalogs is challenging because people perceive music differently and important details are found in audio, text, and metadata. This article introduces a multimodal framework that uses an ontology to make music similarity and recommendation more explainable. The framework brings together learned features from audio, lyrics, and other text with structured metadata in a shared similarity space, and then improves ranking with a music ontology that captures relationships between songs, artists, genres, and moods. The design works with any encoder that creates fixed-size features. This study uses strong neural audio and text encoders, mainly based on transformers. This approach allows the system to handle different input types while staying reliable across datasets. This study tests the framework on several open music and audio datasets using content-based retrieval tasks and standard ranking measures. In addition to Configurations C1–C4, this study includes an external content-based reference baseline based on conventional MIR audio descriptors. This baseline represents a signal-level retrieval approach that models complementary aspects of the audio signal, such as timbre, harmony, and spectral characteristics, and is evaluated under the same retrieval protocol as the main framework. It is included to provide an external comparison point outside the proposed C1–C4 design. Compared to audio-only and non-ontological variants within the same framework, the proposed multimodal and ontology-guided configurations achieve better precision, recall, and mean average precision, and also cover more rare content. Visualizations and case studies show that combining different data types and using ontology-based reranking can improve performance and make results easier to interpret. This work lays the groundwork for explainable, cognitively informed music recommendation systems and points to future work in modeling user behavior over time and adapting to different cultures. Full article
(This article belongs to the Section Cognitive System)
30 pages, 711 KB  
Article
Artificial Intelligence-Driven Multimodal Sensor Fusion for Complex Market Systems via Federated Transformer-Based Learning
by Lei Shi, Mingran Tian, Yinfei Yi, Xinyi Hu, Xiaoya Wang, Yating Yang and Manzhou Li
Sensors 2026, 26(8), 2418; https://doi.org/10.3390/s26082418 - 15 Apr 2026
Viewed by 136
Abstract
In highly digitalized and networked modern trading systems, large volumes of heterogeneous data are continuously generated from multiple sources during market operations. However, due to the complexity of data structures, significant differences in temporal scales, and constraints imposed by data privacy protection, traditional [...] Read more.
In highly digitalized and networked modern trading systems, large volumes of heterogeneous data are continuously generated from multiple sources during market operations. However, due to the complexity of data structures, significant differences in temporal scales, and constraints imposed by data privacy protection, traditional single-source modeling approaches are unable to fully exploit multisource information. To address this issue, a federated multimodal prediction framework for complex market systems, termed Federated Market-Sensor Transformer (FMST), is proposed. In this framework, data originating from different information sources are uniformly modeled as multimodal time series. A multimodal market-sensor representation module is constructed to perform unified feature encoding, and a cross-modal Transformer fusion architecture is employed to characterize dynamic interaction relationships among different information sources. Meanwhile, a federated collaborative learning mechanism is introduced during the training phase, enabling multiple data nodes to perform collaborative model optimization without sharing raw data. In this manner, data privacy can be preserved while improving the cross-region generalization capability of the model. Systematic experimental evaluation is conducted on the constructed multimodal market-sensor dataset. The experimental results demonstrate that the proposed method consistently outperforms traditional statistical models and deep learning approaches across multiple evaluation metrics. In the main prediction experiment, FMST achieves a root mean square error (RMSE) of 0.1136, a mean absolute error (MAE) of 0.0832, and a coefficient of determination R2 of 0.8517, while the direction prediction accuracy reaches 74.56%, clearly outperforming baseline models including ARIMA, LSTM, Temporal CNN, Transformer, and FedAvg-LSTM. In the cross-region generalization experiment, FMST maintains strong performance, achieving an RMSE of 0.1242, an MAE of 0.0908, an R2 value of 0.8261, and a direction prediction accuracy of 72.48%. The ablation study further indicates that the three core components—multimodal market-sensor representation, cross-modal Transformer fusion, and federated collaborative learning—each make important contributions to the overall model performance. These experimental findings demonstrate that the proposed method can effectively integrate multisource market information and significantly enhance the prediction capability for complex market dynamics, providing a new technical pathway for the application of artificial intelligence-driven multimodal sensing systems in economic data analysis. Full article
(This article belongs to the Special Issue Artificial Intelligence-Driven Sensing)
23 pages, 9927 KB  
Article
A Relative Orbital Motion-Guided Framework for Generating Multimodal Visual Data of Spacecraft
by Wanyun Li, Yurong Huo, Qinyu Zhu, Yao Lu, Yuqiang Fang and Yasheng Zhang
Remote Sens. 2026, 18(8), 1177; https://doi.org/10.3390/rs18081177 - 15 Apr 2026
Viewed by 194
Abstract
The advancement of on-orbit servicing and space debris removal missions has established high-precision visual perception for non-cooperative spacecraft as a key research focus. However, the availability of high-quality, diverse spacecraft image datasets is severely limited due to extreme on-orbit imaging conditions, data confidentiality, [...] Read more.
The advancement of on-orbit servicing and space debris removal missions has established high-precision visual perception for non-cooperative spacecraft as a key research focus. However, the availability of high-quality, diverse spacecraft image datasets is severely limited due to extreme on-orbit imaging conditions, data confidentiality, and morphological diversity of targets, significantly constraining the advancement of data-driven algorithms in this domain. To address this challenge, we propose a relative orbital motion-guided framework for generating multimodal visual data of spacecraft. The proposed method integrates an orbital dynamics model into the synthetic data generation pipeline to simulate typical relative motion patterns between the camera and the target in a realistic orbital environment, thereby generating image sequences characterized by continuous spatiotemporal evolution. Targeting four representative spacecraft—Tiangong, Spacedragon, ICESat, and Cassini—this work simultaneously generates a dataset comprising 8000 samples, each containing four strictly aligned modalities: RGB images, instance segmentation masks, depth maps, and surface normal maps, along with precise 6-degree-of-freedom (6-DoF) pose ground truth. Furthermore, an end-to-end physical image degradation model is developed to accurately simulate the complete imaging chain—from optical diffraction and aberrations to sensor sampling and noise—thereby effectively narrowing the domain gap between synthetic and real data. By addressing three key aspects—physical motion modeling, synchronous multimodal ground truth, and imaging degradation simulation—this work provides a crucial data foundation for training, testing, and validating data-driven on-orbit perception algorithms. Full article
Show Figures

Figure 1

38 pages, 588 KB  
Review
A Unified Information Bottleneck Framework for Multimodal Biomedical Machine Learning
by Liang Dong
Entropy 2026, 28(4), 445; https://doi.org/10.3390/e28040445 - 14 Apr 2026
Viewed by 164
Abstract
Multimodal biomedical machine learning increasingly integrates heterogeneous data sources (including medical imaging, multi-omics profiles, electronic health records, and wearable sensor signals) to support clinical diagnosis, prognosis, and treatment response prediction. Despite strong empirical performance, most existing multimodal systems lack a principled theoretical foundation [...] Read more.
Multimodal biomedical machine learning increasingly integrates heterogeneous data sources (including medical imaging, multi-omics profiles, electronic health records, and wearable sensor signals) to support clinical diagnosis, prognosis, and treatment response prediction. Despite strong empirical performance, most existing multimodal systems lack a principled theoretical foundation for understanding why fusion improves prediction, how information is distributed across modalities, and when models can be trusted under incomplete or shifting data. This paper develops a unified information-theoretic framework that formalizes multimodal biomedical learning as an information optimization problem. We formulate multimodal representation learning through the information bottleneck principle, deriving a variational objective that balances predictive sufficiency against informational compression in an architecture-agnostic manner. Building on this foundation, we introduce information-theoretic tools for decomposing modality contributions via conditional mutual information, quantifying redundancy and synergy, and diagnosing fusion collapse. We further show that robustness to missing modalities can be cast as an information consistency problem and extend the framework to longitudinal disease modeling through transfer entropy and sequential information bottleneck objectives. Applications to multimodal foundation models, uncertainty quantification, calibration, and out-of-distribution detection are developed. Empirical case studies across three biomedical datasets (TCGA breast cancer multi-omics, TCGA glioma clinical-plus-molecular data, and OASIS-2 longitudinal Alzheimer’s data) show that the framework’s key quantities are computable and interpretable on real data: MI decomposition identifies modality dominance and redundancy; the VMIB traces a compression–prediction tradeoff in the information plane; entropy-based selective prediction raises accuracy from 0.787 to 0.939 at 50% coverage; transfer entropy reveals stage-dependent modality influence in disease progression; and pretraining/adaptation diagnostics distinguish efficient from wasteful fine-tuning strategies. Together, these results develop entropy and mutual information as organizing principles for the design, analysis, and evaluation of multimodal biomedical AI systems. Full article
Back to TopTop