Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (35)

Search Parameters:
Keywords = low-rank multi-modal fusion

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
32 pages, 22420 KB  
Article
FuDensityNet: Occlusion-Aware Multimodal Activation for Robust Object Detection
by Zainab Ouardirhi, Mostapha Zbakh, Mohammed Benjelloun and Sidi Ahmed Mahmoudi
Electronics 2026, 15(13), 2783; https://doi.org/10.3390/electronics15132783 (registering DOI) - 24 Jun 2026
Abstract
Accurate object detection remains a major challenge in autonomous systems and surveillance, particularly when objects are partially or fully obscured by occlusions. To address this issue, we revisit FuDensityNet as a multimodal detection framework that jointly leverages 2D RGB images and 3D LiDAR [...] Read more.
Accurate object detection remains a major challenge in autonomous systems and surveillance, particularly when objects are partially or fully obscured by occlusions. To address this issue, we revisit FuDensityNet as a multimodal detection framework that jointly leverages 2D RGB images and 3D LiDAR point clouds for robust feature representation. The model integrates spatial and depth cues through low-rank tensor fusion (LRTF) and incorporates an Occlusion Rate (OR) assessment module that estimates the degree of occlusion and dynamically selects the most suitable detection pathway to preserve performance. Experiments on the KITTI and NuScenes datasets indicate that this adaptive strategy improves robustness under high occlusion while maintaining competitive accuracy in less challenging conditions. In particular, FuDensityNet attains 76.6% AP for car detection under “Hard” conditions on KITTI and outperforms several RGB-only and RGB–LiDAR baselines. Owing to its adaptive and modular design, FuDensityNet remains compatible with both 2D and 3D detection pipelines, making it a practical option for real-world environments where visual obstructions are frequent. Full article
(This article belongs to the Special Issue Computer Vision and Machine Learning: Real-World Applications)
Show Figures

Figure 1

30 pages, 719 KB  
Article
A Multimodal Sensor-Based Self-Supervised Learning Framework for Low-Noise System State Prediction and Anomaly Detection
by Kexin Guo, Jingwen Wang, Jiayu Lin, Ningjing Chen, Hengyuan Chen, Zilang Zhou and Manzhou Li
Sensors 2026, 26(12), 3851; https://doi.org/10.3390/s26123851 - 17 Jun 2026
Viewed by 201
Abstract
To address the challenges of strong signal noise, pronounced cross-modal asynchrony, high subjectivity in manually defined state labels, and insufficient model stability under extreme abnormal conditions in multi-source sensor systems, a low-noise system state prediction and anomaly detection method based on multimodal sensor [...] Read more.
To address the challenges of strong signal noise, pronounced cross-modal asynchrony, high subjectivity in manually defined state labels, and insufficient model stability under extreme abnormal conditions in multi-source sensor systems, a low-noise system state prediction and anomaly detection method based on multimodal sensor signals and self-supervised representation learning is proposed. Environmental sensing data, device status data, network transmission data, operational behavior data, and event log data are uniformly modeled as system state perception signals. A temporal masking-based state structure modeling method, a state-oriented contrastive learning representation constraint mechanism, and a state representation and downstream prediction task alignment strategy are designed to learn stable, transferable, and interpretable system state features. Experimental results demonstrate that the proposed method achieves the best performance in multimodal sensor state prediction and anomaly detection tasks, with mean squared error (MSE), mean absolute error (MAE), and root mean square error (RMSE) values of 0.0167, 0.0856, and 0.1291, respectively, outperforming baseline models such as GARCH, MLP, LSTM, TCN, and Transformer. Meanwhile, IC, RankIC, and AUC reach 0.494, 0.460, and 0.815, respectively, indicating stronger state-ranking capability and improved discrimination between high-abnormality and low-abnormality states. At the classification recognition level, superior accuracy, precision, recall, and F1-score are also achieved by the proposed method, suggesting that potential abnormal states can be identified more accurately. Ablation experiments verify the effectiveness of multimodal fusion, temporal masking modeling, self-supervised contrastive constraints, and task alignment strategies. Robustness experiments further show that lower prediction errors and higher AUC can still be maintained under high-fluctuation and extreme-shock states, demonstrating strong noise resistance, stability, and practical application potential in complex sensor system scenarios. Full article
Show Figures

Figure 1

42 pages, 2864 KB  
Article
A Hybrid Nonlinear Greater Cane Rat Algorithm with Teaching–Learning-Based Optimization for Global Optimization and Constrained Engineering Applications
by Jinzhong Zhang, Hongkai Li, Tan Zhang and Zhen He
Biomimetics 2026, 11(6), 397; https://doi.org/10.3390/biomimetics11060397 - 4 Jun 2026
Viewed by 227
Abstract
The greater cane rat algorithm (GCRA) represents an emerging swarm intelligence paradigm derived from the instinctual survival patterns exhibited by greater cane rats (GCRs), which simulates the typical male-dominated survival patterns of the GCR species, including rainy-season mating and reproduction behaviors, dry-season behavioral [...] Read more.
The greater cane rat algorithm (GCRA) represents an emerging swarm intelligence paradigm derived from the instinctual survival patterns exhibited by greater cane rats (GCRs), which simulates the typical male-dominated survival patterns of the GCR species, including rainy-season mating and reproduction behaviors, dry-season behavioral differentiation of solitary males and clustered females, and their nonlinear adaptive foraging characteristics. Nevertheless, the original GCRA suffers from inherent defects in complex and high-dimensional optimization scenarios, encompassing premature convergence phenomena, inadequate local exploitation proficiency, constrained convergence precision, and a proneness to stagnation at local optima, which severely restrict its practical engineering application. To address the aforementioned limitations, this work introduces an enhanced hybrid variant of the greater cane rat algorithm, amalgamated with Teaching-and-Learning-Based Optimization (TLBO) and designated as the TLGCRA, incorporating three pivotal targeted innovations. Specifically, the TLGCRA innovatively introduces the two-stage teacher–student interactive learning mechanism of TLBO on the basis of retaining the core evolutionary and behavioral characteristics of the original GCRA, which effectively compensates for the insufficient local disturbance capability of the original algorithm and enriches population diversity to avoid local optimum stagnation. Furthermore, an adaptive parameter tuning strategy is innovatively designed and embedded in the iterative optimization process, which dynamically balances the global exploration and local exploitation capabilities of the algorithm, fundamentally improving the low learning efficiency and weak mining performance of the GCRA. A suite of computational simulations is conducted across 23 canonical benchmark functions and six representative constrained engineering design optimization scenarios. The introduced TLGCRA is benchmarked against the canonical GCRA, LPSO, and ten cutting-edge metaheuristic approaches. Empirical outcomes substantiate that the TLGCRA attains marked performance advantages in terms of convergence velocity, solution precision, and algorithmic resilience. In particular, the optimized design effectively improves the optimal solution precision of the algorithm in complex multimodal function optimization, and the standard deviation of multiple independent runs in six engineering application cases is close to zero, verifying its excellent stability. Statistical verification employing the Friedman test and Wilcoxon signed-rank test additionally corroborates that the TLGCRA exhibits statistically robust and dependable optimization efficacy. In summary, the proposed innovative fusion strategies endow the TLGCRA with stronger environmental adaptability and comprehensive optimization performance, enabling it to realize faster convergence speed and higher computational accuracy, as well as outstanding stability and robustness, thus furnishing a viable resolution framework for intricate constrained engineering optimization challenges. Full article
(This article belongs to the Section Biological Optimisation and Management)
Show Figures

Figure 1

22 pages, 3888 KB  
Article
Multimodal Learning for Integrity Classification of Building Foundation Piles Using Low-Strain Reflection Testing
by Qi-Ling Luo, Cang Chen, Ming-Chao Li, Gan-Lin Feng and Gao-Xiang Tang
Buildings 2026, 16(11), 2126; https://doi.org/10.3390/buildings16112126 - 26 May 2026
Viewed by 174
Abstract
Low-strain reflection testing is widely used for the rapid screening of pile integrity, but its interpretation still relies heavily on manual judgment. This study proposes a dual representation learning framework for classifying the integrity of building foundation piles from low-strain testing records. A [...] Read more.
Low-strain reflection testing is widely used for the rapid screening of pile integrity, but its interpretation still relies heavily on manual judgment. This study proposes a dual representation learning framework for classifying the integrity of building foundation piles from low-strain testing records. A dataset containing 1139 piles from engineering projects was established and divided into four integrity classes. Each record was represented in two complementary forms: structured features extracted from engineering parameters and waveform characteristics, and a redrawn waveform image generated from coordinate point data. Support vector machine (SVM), random forest (RF), and convolutional neural network (CNN) models were used as single modality baselines, and their performance was compared with that of a multimodal neural network (MNN) trained on paired structured and image inputs. The multimodal model achieved the highest overall accuracy on the main evaluation subset, reaching 84.65%, whereas the random forest achieved the best Macro-Recall and Macro-F1. This result suggests that multimodal fusion mainly improved overall robustness rather than consistently enhancing performance across all classes. Clearly intact piles and severely defective piles were easier to identify, whereas Class II remained the most difficult category because of its borderline signal characteristics. In the supplementary external validation set, the same ranking of model performance was observed, and the multimodal model achieved an accuracy of 85%. These results indicate that the proposed framework has strong potential for computer-assisted screening of building foundation piles. Full article
(This article belongs to the Section Construction Management, and Computers & Digitization)
Show Figures

Figure 1

20 pages, 37476 KB  
Article
In-Orbit MapAnything: An Enhanced Feed-Forward Metric Framework for 3D Reconstruction of Non-Cooperative Space Targets Under Complex Lighting
by Yinxi Lu, Hongyuan Wang, Qianhao Ning, Ziyang Liu, Yunzhao Zang, Zhen Liao and Zhiqiang Yan
Sensors 2026, 26(7), 2026; https://doi.org/10.3390/s26072026 - 24 Mar 2026
Cited by 1 | Viewed by 727
Abstract
Precise 3D reconstruction of non-cooperative space targets is a prerequisite for active debris removal and on-orbit servicing. However, this task is impeded by severe environmental challenges. Specifically, the limited dynamic range of visible light cameras leads to frequent overexposure or underexposure under extreme [...] Read more.
Precise 3D reconstruction of non-cooperative space targets is a prerequisite for active debris removal and on-orbit servicing. However, this task is impeded by severe environmental challenges. Specifically, the limited dynamic range of visible light cameras leads to frequent overexposure or underexposure under extreme space lighting. Compounded by sparse textures and strong specular reflections, these factors significantly constrain reconstruction accuracy. While existing general-purpose feed-forward models such as MapAnything offer efficient inference, their geometric recovery capabilities degrade sharply when facing significant domain shifts. To address these issues, this paper proposes an enhanced 3D reconstruction framework tailored for the space environment named In-Orbit MapAnything. First, to mitigate data scarcity, we construct a high-quality space target dataset incorporating extreme illumination characteristics, which provides comprehensive auxiliary modalities including accurate camera poses and dense point clouds. Second, we propose the SatMap-Adapter module to mitigate feature degradation caused by severe specular reflections. This architecture employs a hierarchical cascade sampling strategy to align multi-level backbone features and utilizes a lightweight adaptive fusion module to dynamically integrate shallow photometric cues, intermediate structural information, and deep semantic features. Finally, we employ a weight-decomposed low-rank adaptation strategy to achieve parameter-efficient fine-tuning while strictly freezing the pre-trained backbone. Experimental results demonstrate that the proposed method decreases the absolute relative error and Chamfer distance by 15.23% and 20.02% respectively compared to the baseline MapAnything model, while maintaining a rapid inference speed. The proposed approach effectively suppresses reconstruction noise on metallic surfaces and recovers fine geometric structures, validating the effectiveness of our feature-enhanced framework in extreme space environments. Full article
Show Figures

Figure 1

26 pages, 977 KB  
Article
KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection
by Jiaying Chen, Jingyi Liu, Yiwen Liang and Mengjie Zhou
Appl. Sci. 2026, 16(6), 2909; https://doi.org/10.3390/app16062909 - 18 Mar 2026
Cited by 1 | Viewed by 754
Abstract
The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they [...] Read more.
The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they often lack transparency and fail to leverage the rich contextual knowledge embedded in large-scale datasets. In this paper, we propose KE-MLLM (Knowledge-Enhanced Multimodal Large Language Model), a unified framework that integrates knowledge-enhanced prompting with parameter-efficient fine-tuning for explainable fake review detection. Our approach employs LoRA (Low-Rank Adaptation) to fine-tune lightweight large language models (LLaMA-3-8B) on review text, while incorporating multimodal behavioral sensor signals including temporal patterns, user metadata, and social network characteristics for comprehensive anomaly sensing. To address the critical need for interpretability in fraud detection systems, we implement a Chain-of-Thought (CoT) reasoning module that generates human-understandable explanations for classification decisions, highlighting linguistic anomalies, sentiment inconsistencies, and behavioral red flags. We enhance the model’s discriminative capability through a knowledge distillation strategy that transfers domain-specific expertise from larger teacher models while maintaining computational efficiency suitable for edge sensing devices. Extensive experiments on two benchmark datasets—YelpChi and Amazon Reviews from the DGL Fraud Dataset—show that KE-MLLM achieves strong performance, reaching an F1-score of 94.3% and an AUC-ROC of 96.7% on YelpChi and outperforming the strongest baseline in our comparison by 5.8 and 4.2 percentage points, respectively. Furthermore, human evaluation indicates that the generated explanations achieve 89.5% consistency with expert annotations, suggesting that the framework can improve the interpretability and practical usefulness of automated fraud detection systems. The proposed framework provides a useful step toward more accurate and interpretable fake review detection and offers a practical reference for building more transparent and accountable AI systems in high-stakes applications. Full article
Show Figures

Figure 1

33 pages, 4366 KB  
Article
Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference
by Wenli Qu and Mu-Jiang-Shan Wang
Symmetry 2026, 18(3), 488; https://doi.org/10.3390/sym18030488 - 12 Mar 2026
Cited by 1 | Viewed by 635
Abstract
Emotions and affective responses are core intervention targets in music therapy. Through acoustic elements, music can evoke emotional responses at physiological and neurological levels, influencing cognition and behavior while providing an important dimension for evaluating therapeutic efficacy. However, emotions are inherently abstract and [...] Read more.
Emotions and affective responses are core intervention targets in music therapy. Through acoustic elements, music can evoke emotional responses at physiological and neurological levels, influencing cognition and behavior while providing an important dimension for evaluating therapeutic efficacy. However, emotions are inherently abstract and difficult to represent directly. Artificial intelligence models therefore provide a promising tool for modeling and quantifying such abstract affective states from physiological signals. In this paper, we propose a structured and explicitly factorized multi-modal representation learning framework for joint affective state and preference inference. Instead of entangling heterogeneous dynamics within monolithic encoders, the framework decomposes representation learning into cross-channel interaction modeling and intra-channel temporal–spectral organization modeling. The framework integrates electroencephalography (EEG), peripheral physiological signals (GSR, BVP, EMG, respiration, and temperature), and eye-movement data (EOG) within a unified temporal modeling paradigm. At its core, a Dynamic Token Feature Extractor (DTFE) transforms raw time series into compact token representations and explicitly factorizes representation learning into (i) explicit channel-wise cross-series interaction modeling and (ii) temporal–spectral refinement via learnable frequency-domain gating. These complementary structural modules are implemented through Cross-Series Intersection (CSI) and Intra-Series Intersection (ISI), which perform low-rank channel dependency learning and adaptive spectral modulation, respectively. A hierarchical cross-modal fusion strategy integrates modality-level tokens in a representation-consistent and interaction-aware manner, enabling coordinated modeling of neural, autonomic, and attentional responses. The entire framework is optimized under a unified multi-task objective for valence, arousal, and liking prediction. Experiments on the DEAP dataset demonstrate consistent improvements over state-of-the-art methods. The model achieves 98.32% and 98.45% accuracy for valence and arousal prediction, 97.96% for quadrant classification in single-task evaluation, and 92.8%, 91.8%, and 93.6% accuracy for valence, arousal, and liking in joint multi-task settings. Overall, this work establishes a structure-aware and factorized multi-modal representation learning framework for robust affective decoding and intelligent music therapy systems. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

29 pages, 14346 KB  
Article
LRCFuse: Infrared and Visible Image Fusion Based on Low-Rank Representation and Convolutional Sparse Learning
by Jingjing Liu, Yujie Zhu, Yuhao Zhang, Aiying Guo, Mengjiao Li and Jianhua Zhang
Sensors 2026, 26(6), 1771; https://doi.org/10.3390/s26061771 - 11 Mar 2026
Cited by 1 | Viewed by 564
Abstract
With the development of cross-modal image fusion in multi-sensor systems, current fusion technologies have made significant progress in feature extraction, facilitating more effective image analysis. However, insufficient fusion information may degrade the correlation between the source and fused images, often resulting in the [...] Read more.
With the development of cross-modal image fusion in multi-sensor systems, current fusion technologies have made significant progress in feature extraction, facilitating more effective image analysis. However, insufficient fusion information may degrade the correlation between the source and fused images, often resulting in the omission of critical features from the original modalities. Therefore, in order to preserve as much information as possible, especially for the complete extraction of effective feature information in source images, this paper proposes a new cross-modal image fusion method based on low-rank representation and convolutional sparse learning named LRCFuse. Firstly, the learned low-rank representation (LLRR) blocks are employed to perform dimensionality reduction on the source images while simultaneously extracting their low-rank and sparse feature components. Nevertheless, considering that the low-rank representation has insufficient modeling ability for different modal images, we introduce common feature preservation module (CFPM) blocks based on convolutional sparse coding. By leveraging the CFPM module, LRCFuse recovers common features from both source images to mitigate the loss caused by the imperfect assumptions of low-rank representation. Based on this, a multi-level optimization strategy incorporating pixel loss, shallow-level loss, mid-level loss, deep-level loss, and sobel loss is proposed to hierarchically learn and refine diverse image features. Quantitative and qualitative evaluations are conducted across various datasets, revealing that LRCFuse can effectively detect targets infrared salient targets, preserve additional details in visible images, and achieve better fusion results for subsequent downstream tasks. Full article
(This article belongs to the Special Issue Machine Learning in Image/Video Processing and Sensing)
Show Figures

Figure 1

20 pages, 2326 KB  
Article
A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts
by Yongyang Yin, Hengyu Cao, Chunsheng Zhang, Faxun Jin, Xin Liu and Jun Lin
Appl. Sci. 2026, 16(4), 1857; https://doi.org/10.3390/app16041857 - 12 Feb 2026
Viewed by 638
Abstract
To address the trade-off between parameter scale and generation quality in Vision-Language Models (VLMs), this study proposes a Multi-Feature Dynamic Instruction Tuning (MFDIT) image captioning model based on LLaMA. By integrating CLIP-based global features with SAM-derived local features, the model constructs a multi-level [...] Read more.
To address the trade-off between parameter scale and generation quality in Vision-Language Models (VLMs), this study proposes a Multi-Feature Dynamic Instruction Tuning (MFDIT) image captioning model based on LLaMA. By integrating CLIP-based global features with SAM-derived local features, the model constructs a multi-level visual representation. Additionally, a Dynamic Prompt Adapter is designed to enable cross-modal semantic alignment with adaptive flexibility. Combined with a Low-Rank Adaptation (LoRA) fine-tuning strategy, the proposed method enhances the model’s capability in describing diverse images while training only 20 million parameters, accounting for merely 0.05% of the total parameter volume. Experimental results demonstrate that the model achieves a CIDEr score of 126.7 on the MSCOCO dataset, surpassing traditional adapter-based approaches by 3.0 points. Moreover, in the MME Benchmark evaluation, the proposed model outperforms the mainstream LLaMA-Adapter V2 by 7.3% and 3.8% in OCR and object counting tasks, respectively. Ablation studies further validate the synergistic effects of multi-feature fusion and dynamic instruction optimization. This research provides an efficient solution for parameter-efficient multimodal model training and potential deployment in resource-constrained environments. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

32 pages, 3217 KB  
Review
Architecting the Orthopedical Clinical AI Pipeline: A Review of Integrating Foundation Models and FHIR for Agentic Clinical Assistants and Digital Twins
by Assiya Boltaboyeva, Zhanel Baigarayeva, Baglan Imanbek, Bibars Amangeldy, Nurdaulet Tasmurzayev, Kassymbek Ozhikenov, Zhadyra Alimbayeva, Chingiz Alimbayev and Nurgul Karymsakova
Algorithms 2026, 19(2), 99; https://doi.org/10.3390/a19020099 - 27 Jan 2026
Cited by 1 | Viewed by 1270
Abstract
The exponential growth of multimodal orthopedic data, ranging from longitudinal Electronic Health Records to high-resolution musculoskeletal imaging, has rendered manual analysis insufficient. This has established Large Language Models (LLMs) as algorithmically necessary for managing healthcare complexity. However, their deployment in high-stakes surgical environments [...] Read more.
The exponential growth of multimodal orthopedic data, ranging from longitudinal Electronic Health Records to high-resolution musculoskeletal imaging, has rendered manual analysis insufficient. This has established Large Language Models (LLMs) as algorithmically necessary for managing healthcare complexity. However, their deployment in high-stakes surgical environments presents a fundamental algorithmic paradox: while generic foundation models possess vast reasoning capabilities, they often lack the precise, protocol-driven domain knowledge required for safe orthopedic decision support. This review provides a structured synthesis of the emerging algorithmic frameworks required to build modern clinical AI assistants. We deconstruct current methodologies into their core components: large-language-model adaptation, multimodal data fusion, and standardized data interoperability pipelines. Rather than proposing a single proprietary architecture, we analyze how recent literature connects specific algorithmic choices such as the trade-offs between full fine-tuning and Low-Rank Adaptation to their computational costs and factual reliability. Furthermore, we examine the theoretical architectures required for ‘agentic’ capabilities, where AI systems integrate outputs from deep convolutional neural networks and biosensors. The review concludes by outlining the unresolved challenges in algorithmic bias, security, and interoperability that must be addressed to transition these technologies from research prototypes to scalable clinical solutions. Full article
(This article belongs to the Special Issue Artificial Intelligence Algorithms for Healthcare: 2nd Edition)
Show Figures

Figure 1

31 pages, 7858 KB  
Article
Domain-Adapted MLLMs for Interpretable Road Traffic Accident Analysis Using Remote Sensing Imagery
by Bing He, Wei He, Qing Chang, Wen Luo and Lingli Xiao
ISPRS Int. J. Geo-Inf. 2026, 15(1), 8; https://doi.org/10.3390/ijgi15010008 - 21 Dec 2025
Cited by 2 | Viewed by 1084
Abstract
Traditional road traffic accident analysis has long relied on structured data, making it difficult to integrate high-dimensional heterogeneous information such as remote sensing imagery and leading to an incomplete understanding of accident scene environments. This study proposes a road traffic accident analysis framework [...] Read more.
Traditional road traffic accident analysis has long relied on structured data, making it difficult to integrate high-dimensional heterogeneous information such as remote sensing imagery and leading to an incomplete understanding of accident scene environments. This study proposes a road traffic accident analysis framework based on Multimodal Large Language Models. The approach integrates high-resolution remote sensing imagery with structured accident data through a three-stage progressive training pipeline. Specifically, we fine-tune three open-source vision–language models using Low-Rank Adaptation (LoRA) to sequentially optimize the model’s capabilities in visual environmental description, multi-task accident classification, and Chain-of-Thought (CoT) driven causal reasoning. A multimodal dataset was constructed containing remote sensing image descriptions, accident classification labels, and interpretable reasoning chains. Experimental results show that the fine-tuned model achieved a maximum improvement in the CIDEr score for image description tasks. In the joint classification task of accident severity and duration, the model achieved an accuracy of 71.61% and an F1-score of 0.8473. In the CoT reasoning task, both METEOR and CIDEr scores improved significantly. These results validate the effectiveness of structured reasoning mechanisms in multimodal fusion for transportation applications, providing a feasible path toward interpretable and intelligent analysis for real-world traffic management. Full article
(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)
Show Figures

Figure 1

19 pages, 5931 KB  
Article
Vascular-Aware Multimodal MR–PET Reconstruction for Early Stroke Detection: A Physics-Informed, Topology-Preserving, Adversarial Super-Resolution Framework
by Krzysztof Malczewski
Appl. Sci. 2025, 15(22), 12186; https://doi.org/10.3390/app152212186 - 17 Nov 2025
Cited by 1 | Viewed by 963
Abstract
Rapid and reliable identification of large vessel occlusions and critical stenoses is essential for guiding treatment in acute ischemic stroke. Conventional MR angiography (MRA) and PET protocols are constrained by trade-offs among acquisition time, spatial resolution, and motion tolerance. A multimodal MR–PET angiography [...] Read more.
Rapid and reliable identification of large vessel occlusions and critical stenoses is essential for guiding treatment in acute ischemic stroke. Conventional MR angiography (MRA) and PET protocols are constrained by trade-offs among acquisition time, spatial resolution, and motion tolerance. A multimodal MR–PET angiography reconstruction framework is introduced that integrates joint Hankel-structured sparsity with topology-preserving multitask learning to overcome these limitations. High-resolution time-of-flight MRA and perfusion-sensitive PET volumes are reconstructed from undersampled data using a cross-modal low-rank Hankel prior coupled to a super-resolution generator optimized with adversarial, perceptual, and pixel-wise losses. Vesselness filtering and centerline continuity terms enforce preservation of fine arterial topology, while learned k-space and sinogram sampling concentrate measurements within vascular territories. Motion correction, blind deblurring, and modality-specific denoising are embedded to improve robustness under clinical conditions. A multitask output head estimates occlusion probability, stenosis localization, and collateral flow, with hypoperfusion mapping generated for dynamic PET. Evaluation on clinical and synthetically undersampled MR–PET studies demonstrated consistent improvements over MR-only, PET-only, and conventional fusion methods. The framework achieved higher image quality (MRA PSNR gains up to 3.7 dB and SSIM improvements of 0.042), reduced vascular topology breaks by over 20%, and improved large vessel occlusion detection by nearly 10% in AUROC, while maintaining at least a 40% reduction in sampling. These findings demonstrate that embedding vascular-aware priors within a joint Hankel–sparse MR–PET framework enables accelerated acquisition with clinically relevant benefits for early stroke assessment. Full article
Show Figures

Figure 1

24 pages, 5484 KB  
Article
TFI-Fusion: Hierarchical Triple-Stream Feature Interaction Network for Infrared and Visible Image Fusion
by Mingyang Zhao, Shaochen Su and Hao Li
Information 2025, 16(10), 844; https://doi.org/10.3390/info16100844 - 30 Sep 2025
Viewed by 863
Abstract
As a key technology in multimodal information processing, infrared and visible image fusion holds significant application value in fields such as military reconnaissance, intelligent security, and autonomous driving. To address the limitations of existing methods, this paper proposes the Hierarchical Triple-Feature Interaction Fusion [...] Read more.
As a key technology in multimodal information processing, infrared and visible image fusion holds significant application value in fields such as military reconnaissance, intelligent security, and autonomous driving. To address the limitations of existing methods, this paper proposes the Hierarchical Triple-Feature Interaction Fusion Network (TFI-Fusion). Based on a hierarchical triple-stream feature interaction mechanism, the network achieves high-quality fusion through a two-stage, separate-model processing approach: In the first stage, a single model extracts low-rank components (representing global structural features) and sparse components (representing local detail features) from source images via the Low-Rank Sparse Decomposition (LSRSD) module, while capturing cross-modal shared features using the Shared Feature Extractor (SFE). In the second stage, another model performs fusion and reconstruction: it first enhances the complementarity between low-rank and sparse features through the innovatively introduced Bi-Feature Interaction (BFI) module, realizes multi-level feature fusion via the Triple-Feature Interaction (TFI) module, and finally generates fused images with rich scene representation through feature reconstruction. This separate-model design reduces memory usage and improves operational speed. Additionally, a multi-objective optimization function is designed based on the network’s characteristics. Experiments demonstrate that TFI-Fusion exhibits excellent fusion performance, effectively preserving image details and enhancing feature complementarity, thus providing reliable visual data support for downstream tasks. Full article
Show Figures

Figure 1

40 pages, 2568 KB  
Review
Intelligent Edge Computing and Machine Learning: A Survey of Optimization and Applications
by Sebastián A. Cajas Ordóñez, Jaydeep Samanta, Andrés L. Suárez-Cetrulo and Ricardo Simón Carbajo
Future Internet 2025, 17(9), 417; https://doi.org/10.3390/fi17090417 - 11 Sep 2025
Cited by 16 | Viewed by 11537
Abstract
Intelligent edge machine learning has emerged as a paradigm for deploying smart applications across resource-constrained devices in next-generation network infrastructures. This survey addresses the critical challenges of implementing machine learning models on edge devices within distributed network environments, including computational limitations, memory constraints, [...] Read more.
Intelligent edge machine learning has emerged as a paradigm for deploying smart applications across resource-constrained devices in next-generation network infrastructures. This survey addresses the critical challenges of implementing machine learning models on edge devices within distributed network environments, including computational limitations, memory constraints, and energy-efficiency requirements for real-time intelligent inference. We provide comprehensive analysis of soft computing optimization strategies essential for intelligent edge deployment, systematically examining model compression techniques including pruning, quantization methods, knowledge distillation, and low-rank decomposition approaches. The survey explores intelligent MLOps frameworks tailored for network edge environments, addressing continuous model adaptation, monitoring under data drift, and federated learning for distributed intelligence while preserving privacy in next-generation networks. Our work covers practical applications across intelligent smart agriculture, energy management, healthcare, and industrial monitoring within network infrastructures, highlighting domain-specific challenges and emerging solutions. We analyze specialized hardware architectures, cloud offloading strategies, and distributed learning approaches that enable intelligent edge computing in heterogeneous network environments. The survey identifies critical research gaps in multimodal model deployment, streaming learning under concept drift, and integration of soft computing techniques with intelligent edge orchestration frameworks for network applications. These gaps directly manifest as open challenges in balancing computational efficiency with model robustness due to limited multimodal optimization techniques, developing sustainable intelligent edge AI systems arising from inadequate streaming learning adaptation, and creating adaptive network applications for dynamic environments resulting from insufficient soft computing integration. This comprehensive roadmap synthesizes current intelligent edge machine learning solutions with emerging soft computing approaches, providing researchers and practitioners with insights for developing next-generation intelligent edge computing systems that leverage machine learning capabilities in distributed network infrastructures. Full article
Show Figures

Graphical abstract

26 pages, 13544 KB  
Article
GeoJapan Fusion Framework: A Large Multimodal Model for Regional Remote Sensing Recognition
by Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa and Miki Haseyama
Remote Sens. 2025, 17(17), 3044; https://doi.org/10.3390/rs17173044 - 1 Sep 2025
Cited by 1 | Viewed by 2188
Abstract
Recent advances in large multimodal models (LMMs) have opened new opportunities for multitask recognition from remote sensing images. However, existing approaches still face challenges in effectively recognizing the complex geospatial characteristics of regions such as Japan, where its location along the seismic belt [...] Read more.
Recent advances in large multimodal models (LMMs) have opened new opportunities for multitask recognition from remote sensing images. However, existing approaches still face challenges in effectively recognizing the complex geospatial characteristics of regions such as Japan, where its location along the seismic belt leads to highly diverse urban environments and cityscapes that differ from those in other regions. To overcome these challenges, we propose the GeoJapan Fusion Framework (GFF), a multimodal architecture that integrates a large language model (LLM) and a vision–language model (VLM) and strengthens multimodal alignment ability through an in-context learning mechanism to support multitask recognition for Japanese remote sensing images. The GFF also incorporates a cross-modal feature fusion mechanism with low-rank adaptation (LoRA) to enhance representation alignment and enable efficient model adaptation. To facilitate the construction of the GFF, we construct the GeoJapan dataset, which comprises a substantial collection of high-quality Japanese remote sensing images, designed to facilitate multitask recognition using LMMs. We conducted extensive experiments and compared our method with state-of-the-art LMMs. The experimental results demonstrate that GFF outperforms previous approaches across multiple tasks, demonstrating its promising ability for multimodal multitask remote sensing recognition. Full article
(This article belongs to the Special Issue Remote Sensing Image Classification: Theory and Application)
Show Figures

Figure 1

Back to TopTop