Next Article in Journal
From Compliance to Value: Data Governance Implementation Challenges to Turn Data into a Strategic Asset—A Systematic Literature Review
Previous Article in Journal
A Hybrid Mechanistic–Empirical and Neural Network Model Framework for Forecasting Fatigue Crack Deterioration in Ethiopian Flexible Pavements
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Review

A Review of Multimodal Image Feature Fusion Technology and Application

1
School of Computer and Information Engineering, Fuyang Normal University, Fuyang 236037, China
2
Anhui Engineering Research Center for Intelligent Computing and Information Innovation, Fuyang Normal University, Fuyang 236037, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(11), 5290; https://doi.org/10.3390/app16115290
Submission received: 15 April 2026 / Revised: 15 May 2026 / Accepted: 18 May 2026 / Published: 25 May 2026

Abstract

Multimodal image fusion has emerged as a core technology for complex perception systems—such as autonomous driving, remote sensing monitoring, and medical diagnosis—by integrating complementary information from heterogeneous sensors. Given the rapid technological evolution within this field, particularly driven by the emergence of Mamba architectures, Generative Diffusion Models, and Vision Foundation Models (VFMs), traditional classification methods no longer fully encompass the ongoing paradigm shifts. Following the PRISMA guidelines to ensure the objectivity and reproducibility of the findings, this paper provides a systematic literature review and data extraction for multimodal image feature fusion. Under this standardized framework, a five-dimensional decoupling classification architecture is proposed to deconstruct models across fusion hierarchy, backbone architecture, fusion operator, supervision paradigm, and deployment constraints. Specifically, the analysis highlights the linear computational efficiency of Mamba in long-sequence modeling, the high-fidelity reconstruction capabilities of diffusion models via generative priors, and the universal semantic alignment achieved by VFMs . Furthermore, this study summarizes qualitative and quantitative evaluation metrics alongside cross-domain public datasets for performance benchmarking while discussing critical future directions, including cross-modal alignment in complex environments, parameter-efficient fine-tuning of large models, and real-time inference at the edge.
Keywords: multimodal image fusion; feature fusion; PRISMA guidelines; Mamba architecture; vision foundation models (VFMs) multimodal image fusion; feature fusion; PRISMA guidelines; Mamba architecture; vision foundation models (VFMs)

Share and Cite

MDPI and ACS Style

Cao, P.; Zhao, Y.; Duan, T.; Li, L.; Xian, C.; Li, S. A Review of Multimodal Image Feature Fusion Technology and Application. Appl. Sci. 2026, 16, 5290. https://doi.org/10.3390/app16115290

AMA Style

Cao P, Zhao Y, Duan T, Li L, Xian C, Li S. A Review of Multimodal Image Feature Fusion Technology and Application. Applied Sciences. 2026; 16(11):5290. https://doi.org/10.3390/app16115290

Chicago/Turabian Style

Cao, Pingping, Yuting Zhao, Tao Duan, Linguo Li, Chaole Xian, and Shujing Li. 2026. "A Review of Multimodal Image Feature Fusion Technology and Application" Applied Sciences 16, no. 11: 5290. https://doi.org/10.3390/app16115290

APA Style

Cao, P., Zhao, Y., Duan, T., Li, L., Xian, C., & Li, S. (2026). A Review of Multimodal Image Feature Fusion Technology and Application. Applied Sciences, 16(11), 5290. https://doi.org/10.3390/app16115290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop