A Review of Multimodal Image Feature Fusion Technology and Application

Cao, Pingping; Zhao, Yuting; Duan, Tao; Li, Linguo; Xian, Chaole; Li, Shujing

doi:10.3390/app16115290

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessReview

A Review of Multimodal Image Feature Fusion Technology and Application

by

Pingping Cao

^1,2

,

Yuting Zhao

¹

,

Tao Duan

¹

,

Linguo Li

^1,2,

Chaole Xian

¹ and

Shujing Li

^1,*

¹

School of Computer and Information Engineering, Fuyang Normal University, Fuyang 236037, China

²

Anhui Engineering Research Center for Intelligent Computing and Information Innovation, Fuyang Normal University, Fuyang 236037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5290; https://doi.org/10.3390/app16115290

Submission received: 15 April 2026 / Revised: 15 May 2026 / Accepted: 18 May 2026 / Published: 25 May 2026

Download Versions Notes

Abstract

Multimodal image fusion has emerged as a core technology for complex perception systems—such as autonomous driving, remote sensing monitoring, and medical diagnosis—by integrating complementary information from heterogeneous sensors. Given the rapid technological evolution within this field, particularly driven by the emergence of Mamba architectures, Generative Diffusion Models, and Vision Foundation Models (VFMs), traditional classification methods no longer fully encompass the ongoing paradigm shifts. Following the PRISMA guidelines to ensure the objectivity and reproducibility of the findings, this paper provides a systematic literature review and data extraction for multimodal image feature fusion. Under this standardized framework, a five-dimensional decoupling classification architecture is proposed to deconstruct models across fusion hierarchy, backbone architecture, fusion operator, supervision paradigm, and deployment constraints. Specifically, the analysis highlights the linear computational efficiency of Mamba in long-sequence modeling, the high-fidelity reconstruction capabilities of diffusion models via generative priors, and the universal semantic alignment achieved by VFMs . Furthermore, this study summarizes qualitative and quantitative evaluation metrics alongside cross-domain public datasets for performance benchmarking while discussing critical future directions, including cross-modal alignment in complex environments, parameter-efficient fine-tuning of large models, and real-time inference at the edge.

Keywords: multimodal image fusion; feature fusion; PRISMA guidelines; Mamba architecture; vision foundation models (VFMs)

Share and Cite

MDPI and ACS Style

Cao, P.; Zhao, Y.; Duan, T.; Li, L.; Xian, C.; Li, S. A Review of Multimodal Image Feature Fusion Technology and Application. Appl. Sci. 2026, 16, 5290. https://doi.org/10.3390/app16115290

AMA Style

Cao P, Zhao Y, Duan T, Li L, Xian C, Li S. A Review of Multimodal Image Feature Fusion Technology and Application. Applied Sciences. 2026; 16(11):5290. https://doi.org/10.3390/app16115290

Chicago/Turabian Style

Cao, Pingping, Yuting Zhao, Tao Duan, Linguo Li, Chaole Xian, and Shujing Li. 2026. "A Review of Multimodal Image Feature Fusion Technology and Application" Applied Sciences 16, no. 11: 5290. https://doi.org/10.3390/app16115290

APA Style

Cao, P., Zhao, Y., Duan, T., Li, L., Xian, C., & Li, S. (2026). A Review of Multimodal Image Feature Fusion Technology and Application. Applied Sciences, 16(11), 5290. https://doi.org/10.3390/app16115290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Multimodal Image Feature Fusion Technology and Application

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI