Hyperspectral–Polarization–LiDAR Multimodal Image Fusion Method for Few-Shot Scenarios

Yin, Yunlong; Li, Guanlin; Sun, Hongyu; Wang, Jiayu; Zhang, Jian; Liu, Jianan; Wang, Qi; Li, Yingchao; Shi, Haodong; Chen, Mingce

doi:10.3390/photonics13060540

Open AccessArticle

Hyperspectral–Polarization–LiDAR Multimodal Image Fusion Method for Few-Shot Scenarios

by

Yunlong Yin

^1,2,

Guanlin Li

¹

,

Hongyu Sun

^1,2,

Jiayu Wang

^1,2

,

Jian Zhang

²,

Jianan Liu

¹,

Qi Wang

^1,2,

Yingchao Li

^1,2,

Haodong Shi

^1,* and

Mingce Chen

³

¹

Key Laboratory of Space Optoelectronic Technology, Changchun University of Science and Technology, Changchun 130022, China

²

School of Optoelectronic Engineering, Changchun University of Science and Technology, Changchun 130022, China

³

Beijing Institute of Control Engineering, Beijing 100029, China

^*

Author to whom correspondence should be addressed.

Photonics 2026, 13(6), 540; https://doi.org/10.3390/photonics13060540

Submission received: 20 April 2026 / Revised: 27 May 2026 / Accepted: 28 May 2026 / Published: 31 May 2026

(This article belongs to the Special Issue Laser as a Detection: From Spectral Imaging to LiDAR for Remote Sensing Applications (2nd Edition))

Download

Browse Figures

Review Reports Versions Notes

Abstract

To meet the demand for high-precision target classification in complex scenes, a hyperspectral–polarimetric–LiDAR multimodal image fusion method tailored for few-shot scenarios is proposed. Feature-mapping functions for polarimetric and LiDAR images are constructed, and a multi-scale hierarchical optimization strategy is employed to jointly enhance low- and high-frequency components across modalities. This approach effectively addresses key challenges under limited training data, such as substantial cross-modal dimensional disparities and the difficulty of robust feature extraction and fusion. The proposed algorithm conducts bimodal image fusion on the NWPUSP spectral-polarization dataset and KAIST spectral-depth dataset. Compared with other fusion methods, it achieves average increases of 7.3% and 4.87% in information entropy, 53.18% and 30.35% in standard deviation, 48% and 108.28% in average gradient, as well as 96.25% and 101.13% in spatial frequency, respectively. Moreover, relying on the self-developed integrated hyperspectral-polarization imaging system and commercial LiDAR, we synchronously and efficiently acquire multimodal images including hyperspectral, polarization and LiDAR images of complex ground object scenes. Comparative experiments are implemented against six other mainstream fusion algorithms. The objective evaluation results show that the average improvements reach 7.19% in information entropy, 46.85% in standard deviation, 76.62% in average gradient and 79.74% in spatial frequency, which notably enhances the feature retention capability of fused images. Under few-shot conditions, the target recognition classification accuracy and Kappa coefficient of the fused image are improved by 9.8% and 11.05%, respectively, compared with those of the unimodal hyperspectral image. This effectively highlights targets under shadow occlusion and compensates for LiDAR’s response deficiencies to surface textures, achieving complementary advantages of multimodal images for ground object targets in complex scenes. This research provides a new solution for future optical multimodal remote sensing and image fusion.

Keywords:

multimodal image fusion; few-shot scenario; hyperspectral remote sensing; target recognition

1. Introduction

Since the 1970s, remote sensing imagery has evolved from “single-modality” to “multi-modality” acquisition. Traditional optical remote sensing has struggled to meet the demands of high-precision urban object recognition and classification in complex scenarios. Among these modalities, hyperspectral imagery contains both spatial and rich spectral information, making it highly valuable for fine-grained object recognition and classification tasks [1]; polarization imagery enhances the contrast between objects and background in shadowed or otherwise challenging environments [2], while LiDAR provides accurate three-dimensional depth information [3]. However, these three types of imagery are rarely acquired simultaneously from the same scene and exhibit significant modality differences, which complicate unified representation. Thus, effectively integrating the rich spectral features of hyperspectral imagery, the unique physical characteristics of polarization imagery, and the precise depth information provided by LiDAR remains a fundamental challenge and central difficulty in current multi-modal remote sensing image fusion.

In multi-modal image fusion tasks, the primary challenge lies in addressing feature-space mismatch and representational differences caused by modality heterogeneity. YIN Jianling et al. [4] innovatively combined low-light polarization imaging with LiDAR point clouds. They constructed a mapping matrix with depth offsets and applied a pseudo-color mapping technique. This method effectively addressed the problem of detecting targets behind transparent media, providing new technical approach for military reconnaissance and security applications. Athanasia Chroni et al. [5] fused multispectral and LiDAR data using a convolutional neural network (CNN)-based semantic segmentation framework, which significantly improved land-cover classification accuracy in semi-arid regions and offering powerful tools for ecological monitoring. GUO Fengqi et al. [6] proposed a spectral–polarization feature fusion algorithm based on principal component analysis (PCA) and energy weighting. This method not only compresses multidimensional features efficiently into a single image but also incorporates a pseudo-color mapping scheme consistent with human visual characteristics. It lays a foundation for real-time airborne remote sensing processing. Meiyong et al. [7] introduced a visual Transformer-based architecture to construct a dual-fusion multi-modal network. Through self-attention mechanisms, the model effectively captures deep correlations between hyperspectral and LiDAR data. This approach achieves state-of-the-art classification performance on several benchmark datasets. Nevertheless, three-modality datasets acquired from the same scene are scarce. Existing methods are also predominantly deep learning–based, which leads to strong dependence on annotated data and limited adaptability in few-shot scenarios. Moreover, most fusion frameworks are restricted to bimodal combinations. This makes it difficult to fully exploit complementary information across multiple modalities. As a result, the richness of fused information and the adaptability to complex scenes remain limited.

Therefore, this paper proposes a fusion method for hyperspectral, polarization and LiDAR depth images tailored to few-shot scenarios. Feature mapping functions are designed for polarization images and LiDAR depth images, respectively. A multi-scale hierarchical optimization strategy is adopted to reconcile low-frequency structural information with high-frequency details. Finally, the multi-scale fusion results and high-frequency details from the residual layers are combined through weighted averaging to produce the final fused image. To more clearly illustrate the fundamental differences between the proposed method and the aforementioned recent representative works, Table 1 systematically compares the proposed method with a variety of recent multimodal image fusion methods across six dimensions: modalities, applications, fusion methods, fusion levels, and performance. The proposed method is applied to fuse and recognize spectral bands at different wavelengths, polarization images, and a LiDAR image acquired by a hyperspectral–polarization imager and LiDAR system. The resulting multimodal fused images achieve significant improvements in both classification accuracy and Kappa coefficient. This study provides an effective solution for improving the overall visual quality of single hyperspectral images under complex environmental influences and for increasing the accuracy of remote sensing object recognition.

Contributions of this paper:

In this paper, we focus on the degree of polarization (DOP) in polarimetric images and the gradient sparsity of LiDAR depth images. We design polarization and depth mapping functions with explicit physical interpretations to directly incorporate physical priors into the fusion process. By utilizing the ground-truth DOP values and depth gradient magnitudes of each pixel, the contrast between highly and weakly polarized targets, as well as between targets at different depths, is adaptively enhanced. Consequently, the discriminability of different targets is improved, leading to better fusion quality and classification performance.
The proposed multi-scale hybrid-norm convex optimization objective function, which operates at the pixel level without relying on data-driven training, exhibits a natural compatibility with the sparse nature of polarimetric and depth features due to its sparse constraint. This compatibility enables it to perform effectively in hyperspectral-polarimetric-LiDAR image fusion tasks.
Different from the aforementioned dual-modal combination methods, this paper, for the first time, jointly models three physical imaging modalities within a unified optimization framework. This enables a trinity of complementary representations for ground objects in terms of “material–texture–geometry.”

2. Multimodal Image Fusion Method

The workflow of the proposed hyperspectral–polarization–LiDAR image fusion method for few-shot scenarios is shown in Figure 1. The process mainly comprises four steps: image registration, polarization computation, design of feature-mapping functions, and multi-scale hierarchical optimization. The input images are the acquired hyperspectral polarization images (HPSI) and LiDAR depth images. To enable cooperative multi-modal information fusion, images from different modalities are first precisely aligned in space via image registration. Next, the Stokes formalism is used to compute the S₀, DOP, and AOP images to obtain spectral–polarization images. Then, to resolve semantic heterogeneity across modalities, physical feature-mapping functions are designed specifically for the polarization and depth images. A multi-scale hierarchical optimization is applied to optimize across all scale layers, formulating the fusion as the minimization of an objective function; finally, the optimization results are combined with residual-layer high-frequency details by weighted fusion to produce the final image containing information from all three modalities.

2.1. Design of Physical Feature Mapping Functions

Different imaging modalities reflect different physical properties of a scene. In order to convert images from the original pixel space to a feature space with more physical significance, enabling more effective multimodal information fusion, we need to account for the distinct physical properties captured by different imaging modalities. Spectral images reflect reflection intensity, polarization images reveal surface material characteristics, and depth images provide geometric structure information. If fusion is performed directly in the pixel space, the differences between these heterogeneous physical quantities can lead to information confusion and feature conflicts. Therefore, this paper designs a feature mapping function to transform each modality image into a unified feature representation space, where different physical attributes can be harmoniously integrated into the fusion process. This mapping not only preserves the unique advantages of each modality, but also ensures comprehensive information retention from low-frequency details to high-frequency details through multi-scale hierarchical processing. The resulting fused image complies with both physical laws and visual perception requirements.

Given a set of registered single-channel spectral image g, corresponding polarization image p, and LiDAR depth image d, the final output is the fused image x, which retains the texture details of the spectral image, the polarization characteristics of the polarization image, and the geometric structure of the depth image, where the fidelity term

E_{1} (x)

of the spectral image is given by Equation (1).

E_{1} (x) = ‖ x - g ‖_{a}^{a}

(1)

Due to the complexity of the multi-detector system in HSI acquisition, a complex mixed noise pattern is generated. Thermal agitation and the resulting dark current cause hyperspectral data to be affected by Gaussian noise. The low radiant energy caused by narrow-band splicing further contributes to Gaussian noise [8]. The L₂ norm achieves optimal estimation when the residual x − g follows a Gaussian distribution. Therefore, the spectral fidelity term in our objective function adopts the L₂ norm. This design not only preserves the global luminance contrast of the original hyperspectral image but is also theoretically optimal for Gaussian-distributed residuals.

Where x denotes the pre-fusion image; g denotes the spectral image; a = 2, i.e., the L₂ norm is adopted. Equation (1) is used to preserve the luminance contrast between the target and the background in the spectral image, without requiring the pre-fusion image to have exactly the same pixel intensities as the spectral image.

On the one hand, polarization images usually contain two important physical quantities: the Degree of Polarization DOP and the Angle of Polarization AOP. These two quantities provide information about the surface material properties and the geometric shape of an object, respectively. Therefore, in order to extract features related to polarization characteristics, this paper designs a polarization feature mapping function

P

to associate the polarization image with DOP and AOP, expressed as:

P (I) = W_{P} \cdot I

. Where w_p is the polarization degree matrix, i.e., the DOP value matrix corresponding to each pixel, and I is the input image. By utilizing the ground truth DOP value of each pixel of the target, the contrast between highly polarized targets and weakly polarized targets is adaptively enhanced, thereby highlighting the discriminability of different targets and improving the fusion and classification performance. Subsequently, the extracted features and the fused image are passed through the same feature mapping function, respectively, to obtain their representations in the polarization feature space. The difference between them is then computed, and the polarization fidelity term

E_{2} (x)

is expressed as

E_{2} (x) = ‖ P (x) - P (p) ‖_{b}^{b}

(2)

Here,

P (p)

represents the features extracted from the polarization image, and

P (x)

represents the corresponding features extracted from the fused image. The surface material and micro-geometric information carried by the degree of polarization and the angle of polarization are inherently local, high-frequency, and discontinuous. Their significant responses are confined to small regions and vanish in uniform areas, exhibiting strong sparsity [9]. The L₁ norm is particularly well-suited to preserve such sparse features during fusion, thus improving material discriminability. Consequently, the L₁ norm is employed, i.e., b = 1.

On the other hand, the depth image reflects depth information. By incorporating the gradient of the depth image into the fusion process, the structural constraints provided by the depth image are preserved. The corresponding feature mapping function is expressed as

D (I) = W_{\nabla d} \cdot I

, where

W_{\nabla d}

is the depth gradient matrix, i.e., the depth gradient magnitude matrix corresponding to each pixel. And the resulting depth-information fidelity term

E_{3} (x)

is given by

E_{3} (x) = ‖ D (x) - D (d) ‖_{c}^{c}

(3)

Here,

D (d)

denotes the features extracted from the depth image, and geometric edges are the most typical sparse features [10]. The L₁ norm can strictly constrain the edge locations in the fused image to be aligned with the depth map, thereby ensuring the accuracy of the 3D geometric structure and edge sharpness and preventing structural blurring. Therefore, the L₁ norm is selected, i.e., (c = 1).

By combining Equations (1)–(3), the fusion problem is transformed into the minimization objective function in Equation (4), which lays the foundation for the subsequent solution of the overall objective function

E (x)

for multi-scale hierarchical optimization.

E (x) = E_{1} (x) + E_{2} (x) + E_{3} (x) = ‖ x - g ‖_{2}^{2} + ‖ P (x) - P (p) ‖_{1}^{1} + ‖ D (x) - D (d) ‖_{1}^{1}

(4)

To verify the rationality of using the L₂ norm for the spectral image and the L₁ norm for the polarization and depth features, we designed an ablation study. All other parameters were fixed, and only the norm combinations in the objective function were varied. Three schemes were compared: pure L₁ norm combination, pure L₂ norm combination, and the proposed mixed L₁ + L₂ norm combination. Objective evaluation metrics were computed on the same test images, and the results are shown in Table 2. As can be seen from Table 2, the pure L₁ norm combination achieves the highest SSIM. However, this may be because the L₁ sparsity constraint excessively modifies the image structure, making it statistically more similar to the source images. Moreover, its AG and SF values are relatively low, indicating blurred images and severe loss of detail. The pure L₂ norm combination yields slightly lower MI than the pure L₁ norm combination, but its overall performance is inferior to that of the proposed mixed norm combination. The proposed mixed norm combination significantly outperforms both the pure L₂ and pure L₁ combinations in terms of AG, SF, and SD—metrics that reflect image clarity and contrast—demonstrating that its fused images have clearer textures and richer information. In summary, the proposed mixed L₂ + L₁ norm achieves high clarity and high contrast without sacrificing too much structural similarity, making it the optimal choice.

2.2. Multi-Scale Hierarchical Optimization

This paper adopts multi-scale hierarchical optimization, primarily addressing the issue that a single scale cannot balance low-frequency information and high-frequency details. Fine-scale refinement of high-frequency details effectively avoids interference between different physical attributes during the fusion process, while significantly improving the optimization convergence speed and the visual coherence of the fusion result.

First, this paper performs multi-scale decomposition on the single-channel spectral image g, the corresponding polarization image p, and the LiDAR image d. The overall multi-scale objective function can be expressed as

\min_{{x_{k}}_{k = 1}^{K}, x_{r e s}} \sum_{k = 1}^{K} E_{k} (x_{k}) + E_{r e s} (x_{r e s})

(5)

Here, k represents the number of multi-scale layers, x is the fused image, and res is the residual layer.

Next, the low-frequency information of the spectrum approximately follows a Gaussian distribution at the coarse scale. In this paper, the L₂ norm is chosen to constrain it in order to achieve optimal fidelity for the overall contrast. On the other hand, all high-frequency detail information representing material properties, geometric edges, and spectral textures naturally exhibits sparsity. The L₁ norm is uniformly applied to constrain these details at the corresponding scales, effectively promoting the sparse preservation and enhancement of these key features. Therefore, the multi-scale objective function with physical feature norm-based constraints can be expressed as

E_{k} (x_{k}) = \{\begin{array}{l} ‖ x_{1} - g_{1} ‖_{2}^{2} + ‖ P^{1} (x_{1}) - P^{1} (p_{1}) ‖_{1}^{1} + ‖ D^{1} (x_{1}) - D^{1} (d_{1}) ‖_{1}^{1}, k = 1 \\ ‖ x_{k} - g_{k} ‖_{1}^{1} + ‖ P^{k} (x_{k}) - P^{k} (p_{k}) ‖_{1}^{1} + ‖ D^{k} (x_{k}) - D^{k} (d_{k}) ‖_{1}^{1}, k = 2 . . K \end{array}

(6)

E_{r e s} (x_{r e s}) = δ {‖x_{r e s} - \frac{g_{r e s} + p_{r e s}}{2}‖}_{1}

(7)

Here, g_k, p_k, d_k represent the image representations at each scale k, and

δ

is the weight of the residual layer fusion term.

For the strictly convex coarse-scale L₂ norm objective function, this paper uses first-order methods, such as gradient descent, to stably converge to a unique solution. For the convex but non-strictly convex problem introduced by the L₁ norm at finer scales, algorithms capable of handling nonsmooth terms, such as the proximal gradient method, are used to ensure reliable fusion results while maintaining convergence efficiency.

Coarse scale : x_{1}^{(t + 1)} = {prox}_{η g} (x_{1}^{(t)} - η \nabla f (x_{1}^{(t)}))

(8)

Fine scale : x_{k}^{(t + 1)} = {prox}_{η E_{k}} (x_{k}^{(t)})

(9)

Here,

x_{1}^{(t)}

represents the fused image at the coarse scale after the t-th iteration,

η

is the step size,

\nabla f (x_{1}^{(t)})

is the gradient of the objective function at the point

x_{1}^{(t)}

,

x_{k}^{(t)}

is the fused image at the fine scale after the t-th iteration,

p r o x_{η g} (\cdot)

and

p r o x_{η E_{k}} (\cdot)

are the proximal operator.

In the objective function E_k(x_k) of Equation (6),

‖ x_{1} - g_{1} ‖_{2}^{2}

is the smooth convex term, and the remaining terms are non-smooth convex terms. For such problems, the most classical solution framework is the Forward-Backward Splitting (FBS) algorithm. Specifically, gradient descent is performed on the smooth convex term as shown in Equation (8), and proximal gradient is applied to the non-smooth convex terms as shown in Equation (9).

The convergence of the FBS algorithm is supported by well-established theory. Assume the gradient of the optimization objective is Lipschitz continuous with constant L_f = 2α. When the step size η at each iteration satisfies 0 < η < 2/L_f, the FBS algorithm described by Equations (8) and (9) is guaranteed to converge to the global optimal solution.

\frac{{‖x^{(k + 1)} - x^{(k)}‖}_{2}}{{‖x^{(k)}‖}_{2}} < ε

(10)

Here,

ε

denotes a very small positive number. Through the above procedure, the iterative process gradually stabilizes and converges, ultimately yielding the global optimal solution.

Finally, after completing the hierarchical optimization across all scales, the multi-scale fusion results

x_{K}^{*}

are linearly combined with the high-frequency details

x_{r e s}^{*}

from the residual layer to obtain the fused image

x = x_{K}^{*} + x_{r e s}^{*}

.

Here,

x_{K}^{*}

contains the optimization results from all scales, from coarse to fine, which not only preserves the geometric structural framework provided by the depth data but also integrates the material details from the polarization data and the intensity foundation from the spectral image. The residual term

x_{r e s}^{*}

is then weighted and fused to supplement the finest texture components that may have been lost during the earlier optimization process.

This paper, by establishing a multi-scale physical feature space and hierarchical optimization framework, addresses the semantic unification of different modality information in geometric structure, material properties, and intensity distribution. It not only achieves information complementarity at the pixel level but also realizes cross-modal semantic alignment at the physical feature level. This enables the fused image to simultaneously possess the structural accuracy of depth data, the material discriminability of polarization data, and the brightness texture of spectral images, thereby providing a new solution for multimodal image fusion. We summarize the proposed method for fusion in Algorithm 1.

Algorithm 1: Proposed Hyperspectral–Polarization–LiDAR Multimodal Fusion

Input: Hyperspectral image g, polarization image p, LiDAR depth image d, parameters η = 0.2, ε = 10⁻⁴, T_max = 200, ω_res = 0.5
Output: Fused image x
1. Compute DOP and AOP from p; compute depth gradient ∇d.
2. Try each multi-scale level k = 1, 2, 3, 4, 5 and each regularization ratio δ = 10, 20, 30, 40, 50, 60. For each (k, δ) perform steps 3–6, compute the mutual information, structural similarity, and edge preservation of the fused result, and record the overall quality score.
3. Downsample g, DOP, ∇d to k levels to obtain a multi-scale pyramid.
4. Initialize the coarsest-scale fused image as the coarsest spectral image.
5. for scale from coarsest to finest do
6. repeat
7. At the coarse scale, use proximal gradient descent.
8. At the fine scale, use ISTA.
9. until the relative change between two consecutive results is less than ε or the maximum iterations T_max is reached
10. Upsample the current scale result as the initial value for the next finer scale.
11. Select the (k, δ) that gave the highest overall quality score as the optimal parameters.
12. Perform the final fusion using the optimal parameters (repeat steps 3–10 once more).
13. Extract residual high-frequency details from the original spectral image and DOP.
14. Combine the finest-scale fusion result with the residual details using weight ω_res to obtain the final fused image x
15. return x

3. Experimental Results and Analysis

3.1. Data Collection

To validate the effectiveness of the proposed image fusion method, data acquisition was performed using a laboratory-developed hyperspectral polarimetric integrated imaging system and an Aolan Avia commercial LiDAR. Fusion experiments were conducted on real measured datasets. The experimental setup is shown in Figure 2. The polarimetric hyperspectral imager consists of a lens assembly, a prism-grating (PG) dispersive element, and a polarimetric imaging detector. The PG dispersive element combines the high dispersion capability of a grating with the order-sorting function of a prism, enabling non-overlapping, high-resolution spectral dispersion of 2 nm across a broad wavelength range from 400 nm to 900 nm. The imager is mounted on a rotating stage, and a whiskbroom scanning mode is adopted to acquire images of the target scene. The signals are received by a division-of-focal-plane (DoFP) polarimetric detector, which acquires slit image lines at different spatial positions of the target. Each group of four pixels in the polarimetric detector forms one super-pixel, containing polarimetric intensities at four different directions. Subsequently, the slit images from different spatial positions are sequentially arranged to reconstruct different targets, each containing information from different polarization directions. This forms a three-dimensional data cube that integrates spatial, spectral, and polarimetric information. The LiDAR used in this experiment is a pulsed time-of-flight ranging device with an operating wavelength of 905 nm. Its scanning mechanism employs a built-in dual-prism fast scanning system, which generates a non-repetitive scanning pattern through the compound motion of two rotating prisms. The field-of-view coverage increases with integration time, enabling the capture of more detailed spatial information within the scene. The emitted laser pulses are directed by the dual-prism scanning system to illuminate the target, and the echo signals are received by an avalanche photodiode (APD) detector. The distance is calculated by measuring the time difference between emission and reception, ultimately generating a three-dimensional point cloud. The point cloud density is approximately 720,000 points per second, with a horizontal field of view of 70.4° and a vertical field of view of 77.2° [11].

Through spectro-polarimetric calibration, the core performance parameters of the polarimetric hyperspectral imaging system are characterized. These parameters mainly include the spectral response function, the operating spectral range, the central wavelength and spectral resolution of each channel, the mapping relationship between detector pixel positions and wavelengths, and the system’s response to different polarization states. A single-channel spectral image is reconstructed based on the number of pixels assigned to each channel. At this stage, each 2 × 2 pixel block in the spectral image contains polarization information from four polarization angles. The DOP and angle of polarization (AOP) images for each spectral band are then calculated using the Stokes parameters [12]. The raw LiDAR data consist of a three-dimensional point set, including the spatial coordinates and reflection intensity of each point. To register it with the hyperspectral-polarimetric images, the point cloud is first projected onto the image plane of the hyperspectral polarimetric camera using the Direct Linear Transformation (DLT) algorithm. This projection produces a sparse depth map, where the grayscale value of each pixel represents the distance to the corresponding laser point. A point spread-based nearest neighbor filling method is then applied for depth completion, resulting in a dense depth grayscale map with the same spatial resolution as the hyperspectral image. The experimental setup and procedure are shown in Figure 2, and the system specifications and parameters are summarized in Table 3.

The advantage of the focal-plane polarization detector lies in its ability to simultaneously acquire spectral, polarization, and spatial information of the target, thereby fundamentally avoiding the complex image registration problem. In contrast, traditional time-division polarization imaging techniques require continuous switching of polarization states for multiple exposures. This approach is not only inefficiency but also susceptible to misalignment among images with different polarization states due to target or platform motion. As a result, the subsequent registration process becomes more complex and error-prone. Although amplitude-division polarization imaging can achieve simultaneous acquisition, its optical path is complex and the system is bulky. In addition, multiple optical paths can introduce inherent parallax, resulting in strict registration requirements and potential errors. In contrast, focal-plane technology [13], on the other hand, integrates a micro-nano polarization filter array directly into the sensor pixels. In this configuration, each 2 × 2 pixels form a super-pixel that captures four polarization states at 0°, 45°, 90°, and 135°. This enables the super-pixel to simultaneously and independently measure light intensity in different polarization directions, i.e.,

I_{0}

,

I_{45}

,

I_{90}

,

I_{135}

, thereby allowing the computation of polarization parameters such as Stokes components, DOP, and AOP, as shown in Equation (11). This approach enables the acquisition of multi-dimensional information in a single exposure, ensuring strict pixel-wise alignment between spectral and polarization data. Consequently, the data processing pipeline is significantly simplified, while spatial–temporal consistency is effectively maintained. This provides convenience for the spectral-polarization-LiDAR multimodal fusion task in this paper, as it allows the fusion process to focus solely on cross-modal registration between hyperspectral images and LiDAR point clouds, without requiring additional registration of polarization images.

\begin{matrix} S_{0} = I_{0} + I_{90} \\ S_{1} = I_{0} - I_{90} \\ S_{2} = I_{45} - I_{135} \\ D O P = \frac{\sqrt{S_{1}^{2} + S_{2}^{2}}}{S_{0}} \\ A O P = \frac{1}{2} \arctan (\frac{S_{2}}{S_{1}}) \end{matrix}

(11)

Since the DLT algorithm enables image registration without relying on camera intrinsic parameters, this paper adopts the DLT-based registration method [14]. Eight pairs of feature points are selected from the point cloud and image. To evaluate the registration accuracy of the DLT algorithm, the average reprojection error and root mean square reprojection error are used for evaluation. The average reprojection error and root mean square error (RMSE) are used as metrics. The average reprojection error (RME) and RMSE of the registered image are 0.39 pixels and 0.43 pixels, respectively. The actual average distance of the eight feature points measured by the LiDAR is 40 m. The pixel size of the hyperspectral polarization camera is 3.45 μm, and the system focal length is 48.75 mm. According to Equation (12), the spatial distances corresponding to 0.39 pixels and 0.43 pixels are 0.0011 m and 0.0012 m, respectively.

L = \frac{s \cdot N \cdot Z}{f}

(12)

Here, L represents the spatial width, s is the pixel size, N is the number of pixels occupied, Z is the target detection distance, and f is the system focal length.

In practical visual non-measurement scenarios, subpixel-level registration accuracy is sufficient to ensure reliable feature alignment. Pursuing excessively higher precision would lead to unnecessary computational overhead [15]. Moreover, the primary purpose of the subsequent polarization, spectral, and LiDAR data in this study is to extract the macroscopic material properties, geometric contours, and classification features of the target. This mainly relies on the low-frequency structural consistency among the data. As mentioned earlier, the impact of registration errors mainly manifests as pixel level misalignment in high-frequency details. However, the fusion algorithm proposed in this paper incorporates multi-scale decomposition of the input data with different levels of granularity. This design provides strong spatial tolerance, effectively mitigating the effects of pixel-level registration errors. Therefore, the current registration accuracy is sufficient to ensure precise alignment of different modalities on the main structure of the target, without significantly affecting the macroscopic features or the classification results after fusion. This conclusion is further supported by the subsequent comparative experimental results. In this study, the complex three-modal registration problem is simplified into a dual-modal registration task. This not only significantly improves computational efficiency and reduces algorithm complexity, but also fundamentally avoids cumulative errors. Overall, this demonstrates the advantages of the proposed method for complex multimodal perception tasks.

The hyperspectral polarization imaging data were collected between 14:00 and 15:00 on 25 September 2025, under clear and windless weather conditions, with a detection distance of approximately 40 m. Prior to image acquisition, a collimated standard radiation source was generated using a parallel light tube and directed toward both the system under test and an irradiance meter. This setup was used to measure the illuminance of the incident light and its corresponding grayscale response. A fitting algorithm was then applied to establish the linear relationship between the illuminance values and grayscale values at different wavelengths. Through radiometric calibration, meaningless digital number (DN) values are converted into radiometric brightness values with physical significance. Standard whiteboard measurements are then used for reflectance calculations, and, finally, the reflectance spectral curve of the acquired ground object targets is obtained. For the ground object recognition task, spectral curves of differences targets in the scene are analyzed to identify significant spectral differences. The ground objects targets included slate roads, tanks, gravel piles, vehicles, trees, grasslands, and asphalt roads. The experimental scene and corresponding spectral curves are shown in Figure 3.

From the ground object reflectance spectra, it can be seen that grass, trees, and other vegetation have a weak reflection peak around 560 nm. Near 700 nm, due to the red-edge effect, the reflectance increases sharply and then gradually stabilizes. In contrast, gravel piles and tanks gradually show a gradual increase in reflectance with wavelength, stabilizing around 700 nm. The reflectance of vehicles, slate roads, and asphalt roads is relatively insensitive to wavelength variations in the visible range and generally remains stable. Notably, the reflectance curves of vehicles and asphalt roads nearly overlap, making them difficult to distinguish. To effectively differentiate such visually similar ground objects and achieve accurate classification, this paper applies the Savitzky–Golay filtering algorithm to smooth and denoise the reflectance data. By calculating the first and second derivatives, the variations in the spectral curves are analyzed, and true feature peaks are selected, effectively reducing misjudgments caused by noise interference. Additionally, the system spectral resolution of 2 nm is used as the wavelength difference threshold for peak overlap determination to identify unique spectral features of ground objects. If the wavelength difference between a feature peak of one ground object and the peaks of all other objects exceeds this threshold, it is identified as a unique peak for that object. Through processing and analysis, 14 major feature peaks were detected. Among them, tanks, gravel piles, trees, and grass each exhibit one unique peak at 518 nm, 590 nm, 758 nm, and 796 nm, respectively. These distinctive spectral features can be used for the precise identification of these ground object categories.

For multi-class target recognition and classification, hyperspectral information alone makes it difficult to achieve precise identification of complex ground objects. However, the hyperspectral polarization camera used in this study enables the simultaneous acquisition of hyperspectral and polarization information. The DOP and AOP spectral curves of different ground objects at various wavelengths are shown in Figure 4a,b. From these curves, it can be observed that vehicles exhibit relatively high polarization characteristics in the visible wavelength range. This is attributed to the smooth painted and metallic surface, which produce reflected light with stronger polarization. In contrast, natural backgrounds typically generate weak or more irregular polarization responses. In the AOP curve, targets such as asphalt roads and slate roads exhibit distinct polarization angle distributions due to differences in surface dielectric properties and roughness. Figure 4 shows that vehicles, asphalt roads, and gravel piles share similar spectral feature peak positions, making it difficult to distinguish them using spectral information alone. This indicates that when spectral features overlap, additional polarization information is required for reliable discrimination. Although different ground objects exhibit varying polarization responses under different illumination and imaging conditions, polarization characteristics fundamentally reflect intrinsic surface properties such as dielectric constant, micro-geometry, surface roughness, and dominant texture orientation after light–matter interaction. These features are largely invariant to variations in illumination intensity and color, providing a complementary dimension for material identification and classification that is not captured by traditional hyperspectral imaging. Therefore, polarization feature analysis enables effective discrimination of targets with similar spectral signatures but different material properties, offering a valuable approach for optical remote sensing-based recognition tasks.

To validate the effectiveness of polarization information for target recognition and classification, grass and camouflage netting were selected as experimental subjects. Data were collected at 14:30 on 24 July 2025, under a solar zenith angle of 40° and an observation zenith angle of 50°. With these conditions fixed, the relative azimuth angle was set to 180°, at which the polarization characteristics of ground objects are more pronounced. Therefore, the relative azimuth angle was maintained at 180° during the data collection process. Prior to the experiment, spectral-polarization calibration was performed to mitigate aliasing errors caused by inherent misalignment in the focal-plane camera and to ensure measurement accuracy. In addition, Newton interpolation was applied to reduce pixel-to-pixel response variations. These procedures effectively controlled the influence on the main image region and ensured reliable analysis of overall polarization characteristics. Figure 5 shows the experimental scene and spectral curves of grass and camouflage netting at selected bands. As illustrated in Figure 5, the spectral curves of grass and camouflage netting are highly similar, making it difficult to distinguish between them using only hyperspectral information alone.

Figure 6 and Figure 7 present the spectral curves of grass and camouflage netting at four polarization angles, respectively. As shown in Figure 6, the spectral curves of grass at 45° and 135° are very close. In contrast, Figure 7 shows that the spectral curves of camouflage netting at 45° and 135° exhibit more noticeable differences. Further calculations based on Equation (11) indicate that, under this observation geometry, the contrast in the S2 images of grass and camouflage netting is relatively significant. This is because grass, as natural vegetation with irregularly arranged leaves, has microscopically rough surfaces and internal organic components. These structures cause incident light to undergo multiple diffuse reflections, scattering, and internal transmission. As a result, the polarization state is significantly weakened, leading to a low polarization response in S2 images and reduced contrast. In contrast, camouflage netting is an artificial material composed of synthetic fibers. Although it mimics the macroscopic color and texture of grass, its microstructure is more regular, with relatively uniform material properties. This allows the polarization state of the incident light to be better preserved, resulting in more pronounced polarization reflection characteristics. Since grass and camouflage netting exhibit distinct polarization behaviors, polarization imaging can effectively enhance the contrast between them. This provides richer strong support for subsequent target recognition and classification tasks, significantly improving the accuracy and robustness of target recognition in complex scenes.

3.2. Comparison and Analysis of the Performance of Different Fusion Algorithms

At present, publicly available remote sensing datasets lack data that simultaneously includes hyperspectral, polarization, and LiDAR modalities for the same scene. To fully validate the effectiveness of the proposed algorithm, this paper independently conducts tests on the existing NWPU-SP [16] spectral-polarimetric dataset and the KAIST [17] spectral-depth dataset, demonstrating the superiority of the proposed fusion algorithm in dual-modal combinations. Furthermore, a laboratory-developed hyperspectral polarimetric imager and a commercial Avia LiDAR are employed for simultaneous three-modal acquisition to obtain authentic three-modal data of the same scene. This further validates the performance of the algorithm in three-modal fusion and comprehensively demonstrates the effectiveness of the proposed method in multimodal image fusion tasks.

3.2.1. Algorithm Metric Analysis

This experiment was conducted on a Windows operating system using MATLAB 2024, with the following running environment: NVIDIA GeForce RTX 4050, Intel(R) Core(TM) i7-14650HX @ 2.20 GHz, 16 GB of RAM, and an image resolution of 1200 × 512. The parameter ε is set to 10⁻⁴. The iteration terminates when the relative solution change falls below 10⁻⁴ or after 200 iterations. The step size η is set to 0.2. The fusion method in this paper has two free parameters: the number of multi-scale layers k and the regularization parameter δ. To determine the impact of these two free parameter values on fusion performance, the mutual information MI, structural similarity index SSIM, and edge preservation quality Q_abf were used to objectively evaluate the fusion results of test images under different parameter combinations. The control variable method was applied, and the average value of each evaluation index was taken as the evaluation result. The optimal result’s parameter values were then used as the parameters for this study. Figure 8 shows the objective evaluation values of fusion performance under different parameters.

As shown in Figure 8, when k < 3, the values of all three objective evaluation indicators increase as k increases. However, when k > 3, the values of each evaluation indicator gradually decrease. This result is caused by the fact that, when the decomposition scale is small, significant features cannot be well separated from the source images, and when the scale is too large, there are only a few detailed areas in the low-frequency image, which causes the weight values to not achieve the best effect when fusing detail coefficients. After comprehensive consideration, the parameter value k is set to 3. As δ increases, MI first decreases and then increases, reaching its minimum at δ = 40; SSIM decreases monotonically, indicating that a larger δ alters more pixel values due to the enhanced L₁ sparsity constraint, leading to a continuous decline in structural similarity; Q_abf increases monotonically, suggesting that a larger δ is beneficial for preserving gradient edge information. Considering the three metrics comprehensively, δ = 30 is selected as the optimal parameter, achieving a favorable balance among information correlation, edge preservation, and structural fidelity. Therefore, in this paper, the number of multi-scale decomposition levels is set to k = 3 and the regularization parameter δ = 30 as the final experimental parameters, and all subsequent fusion experiments adopt this parameter setting.

3.2.2. Spectral-Polarimetric Image Fusion

To verify the superiority and effectiveness of the proposed multimodal fusion algorithm, as well as to improve its generalization ability and robustness, this paper adopts the NWPUSP public dataset and a self-collected dataset containing measured images of diverse ground objects (e.g., vehicles, buildings, and tanks) for spectral-polarization fusion validation and analysis. The proposed algorithm is compared with six other methods: complex wavelet transform (DTCWT) [18], guided filtering fusion (GFF) [19], NSCT-based algorithm [20], Bayesian-based algorithm (Bayes) [21], latent low-rank representation-based algorithm (LatLRR) [22], and weighted least squares with visual saliency map (WLS-VSM) [23]. Three real-world scenes containing multiple typical targets such as vehicles, trees, and tanks are presented in Figure 9, Figure 10 and Figure 11. These scenes provide rich complementary spectral and polarimetric information for multimodal image fusion, facilitating the evaluation of the fusion algorithm’s ability to preserve and enhance detail textures, object contours, and multimodal features.

Figure 9, Figure 10 and Figure 11 present three sets of comparative fusion experiments. In each set, images (a) through (i) represent the 550 nm spectral image, the polarization image, and the fusion results obtained by DTCWT, GFF, NSCT, Bayes, LatLRR, WLS-VSM, and the proposed method, respectively. The first two sets are fusion results on the NWPUSP public dataset, while the third set is on the self-collected dataset. To facilitate a visual comparison of the algorithms, distinctive regions of interest are marked in each image.

In the first fusion experiment, the objective is to preserve the global structure, overall brightness, and texture details of the spectral image while simultaneously integrating the high-contrast edges, surface roughness, and material-sensitive features of the polarization image, thereby achieving complementary benefits of the two modalities. Among the seven fusion methods, only the proposed method and LatLRR successfully integrate these advantages. Compared with the proposed method, the LatLRR result exhibits fewer perceptible edge details. The Bayes and WLS-VSM results are generally darker, with significant loss of background details. The polarization information of the car windows inside the red box is not clearly evident. GFF and Bayes preserve polarization information relatively well, but they lack distinct spectral features. DTCWT and NSCT exhibit good performance in maintaining contours and hierarchical structures, but they are less effective at highlighting target saliency. In contrast, the proposed method fully preserves the vehicle contours, edge details, and luminance variation in the red-box region, while also maintaining the overall natural brightness of the spectral image and the pronounced contrast caused by material differences between the car windows and the car body in the polarization image. This demonstrates that the proposed method achieves a superior trade-off among detail enhancement, structural preservation, and visual balance.

In the second set of experiments, the fusion images generated by GFF, Bayes, and WLS-VSM are generally blurry and exhibit relatively low luminance. Although DTCWT effectively fuses the spectral and polarimetric features, it introduces severe artifacts that degrade image quality. NSCT achieves relatively high overall contrast but suffers from insufficient preservation of polarimetric features, as evidenced by the indistinct enhancement of contrast between window mullions and window panes in its images. Compared with the above methods, LatLRR and the proposed algorithm yield fusion images with higher luminance. However, within the red-boxed details, the windows and mullions in the LatLRR fusion image appear more blurred than those obtained by the proposed method, making them less distinguishable. This indicates that the fusion images generated by the proposed algorithm possess finer gradient textures. Furthermore, owing to the distinct polarimetric characteristics of high-rise building windows and their mullions, the proposed method, after incorporating polarimetric information, significantly enhances the contrast between mullions and windows, leading to a more pronounced visual effect. Compared with the spectral images, the visibility is substantially improved.

From Figure 11, it can be observed that the fused images obtained using the GFF and Bayes methods are generally darker. The Complex Wavelet Transform (DTCWT) method exhibits noticeable artifacts in the edge regions, which disrupt the continuity of details and visual fidelity, especially in areas like the vehicle contours and tree edges. These artifacts reduce the fineness of the contours and the consistency of textures, affecting the overall quality of the fused image. Although GFF has some effect on edge preservation, it lacks depth in the multi-scale fusion of texture details, leading to insufficient fusion of vehicle and ground texture information. As a result, the image presents a tendency to smooth out, with the integrity of detail layers compromised. The NSCT method performs well in maintaining multi-scale contour continuity, but the edge transitions are not natural enough, causing slight blocky distortions that interfere with the visual consistency of the fused image. The Bayes method shows certain advantages in information fidelity but falls short in contrast enhancement and edge sharpness, resulting in slightly dull details. The LatLRR method retains the overall structure of the scene well, but serious artifacts appear, such as the reflection of streetlights, uneven vehicle background grayscale, and non-uniform road surface brightness. The WLS-VSM method effectively highlights details in significant areas, but the polarization information in non-significant areas is not clearly represented. In contrast, the proposed algorithm achieves a good balance in retaining polarization information, texture clarity, and contrast naturalness. It preserves details such as the tank camouflage texture, vehicle polarization characteristics, and lane edge textures, ensuring visual consistency in brightness and contrast. This significantly enhances the overall quality and visual experience of the fused image.

To better evaluate the quality of the fused images, six objective evaluation metrics are employed to assess the three sets of fusion experiments, and the results are presented in Table 4, Table 5 and Table 6. The six metrics are information entropy (EN), standard deviation (SD), structural similarity (SSIM), average gradient (AG), mutual information (MI), and spatial frequency (SF). All metrics are positive indicators, meaning that larger values correspond to richer information in the fused image and better imaging performance.

As shown in Table 4, Table 5 and Table 6, the proposed method achieves favorable performance across all objective evaluation metrics. The fused images preserve the grayscale information of the spectral images, the edge contour information of the degree-of-polarization images, and the spatial perception information of the depth images. On the NWPUSP public dataset, compared with other fusion methods, the average information entropy is improved by 7.3%, the average standard deviation by 53.18%, the average gradient by 48%, and the spatial frequency by 96.25%. The structural similarity (SSIM) is slightly lower than those of DTCWT, GFF, LatLRR, NSCT, and the Bayesian-based method, and the mutual information (MI) is also lower than that of GFF. This is because GFF adopts guided filtering and tends to preserve all information from the source images, including redundant and common features, rather than highlighting modality-specific sparse features, which results in a high MI value; however, the imaging quality and contrast of its fused images are inferior to those of other methods in terms of visual perception. The SSIM value of the proposed algorithm is slightly lower than those of DTCWT, GFF, LatLRR, NSCT, and the Bayesian method. This is mainly because the optimization objective of the proposed method is designed to preserve physically meaningful sparse features and geometric contour information, rather than pursuing pixel-level structural similarity. The advantage of the proposed method lies in its ability to enhance target discriminability in subsequent few-shot classification tasks. Consequently, the fused images inevitably sacrifice some degree of local structural fidelity.

In summary, the objective evaluation results of the proposed method are generally consistent with the subjective visual quality. Although some evaluation metrics in the fusion experiments are inferior to those of several compared algorithms, the fused images exhibit higher brightness, natural grayscale transitions, and a high degree of polarimetric feature retention, resulting in excellent visual quality. Therefore, the proposed algorithm appears to be more targeted and demonstrates slightly better fusion performance.

3.2.3. Polarimetric-Depth Image Fusion

To further validate the effectiveness of the proposed multimodal fusion algorithm, this paper employs the KAIST spectral-depth dataset. Figure 12 and Figure 13 present two sets of comparative fusion experiments. In each set, images (a) through (i) represent the 550 nm spectral image, the depth image, and the fusion results obtained by DTCWT, GFF, NSCT, Bayes, LatLRR, WLS-VSM, and the proposed method, respectively. To effectively combine the textural details of the spectral images with the spatial awareness of the depth images, we examined the results of seven different fusion algorithms in the first fusion experiment. Only the proposed method and WLS-VSM successfully combine the advantages of the two modalities; the other fusion methods fail to achieve a satisfactory integration. Compared with the proposed method, the NSCT and LatLRR results exhibit severe artifacts at edges, which degrade image quality. The Bayes result shows indistinct textural details overall. GFF preserves the textural details of the spectral image well but fails to adequately represent the spatial awareness of the depth image. DTCWT lacks sufficient target detail. In contrast, the proposed method achieves a superior trade-off between detail enhancement and spatial awareness, leading to a better fusion effect.

In the second experiment, the fused image obtained by WLS-VSM is generally darker. LatLRR and NSCT produce some artifacts along the pillow edges. DTCWT and the proposed method achieve relatively high overall contrast, but DTCWT exhibits slightly insufficient texture fineness. Meanwhile, Bayes well preserves the depth features of the palette compared with the other algorithms, yet its preservation of textural features from the spectral image is somewhat inadequate. In contrast to the above methods, the fused image of the proposed method exhibits relatively higher brightness and achieves complementary advantages between the depth features and spectral features of various targets, further demonstrating that the proposed algorithm effectively extracts spectral-depth features. Table 7 and Table 8 present the objective evaluation results of the fused images for the two experiments.

As shown in Table 7 and Table 8, the proposed method achieves favorable performance across all objective evaluation metrics. The fused images preserve the grayscale information of the spectral images and the spatial perception information of the depth images. On the KAIST public dataset, compared with other fusion methods, the average information entropy is improved by 7.3%, the average standard deviation by 53.18%, the average gradient by 48%, and the spatial frequency by 96.25%. The proposed method successfully integrates the texture details of the spectral images with the spatial awareness of the depth images, thereby enhancing the contrast of various targets in the fused images. These results are consistent with the subjective analysis presented above.

3.2.4. Spectral-Polarimetric-LiDAR Image Fusion

To verify the effectiveness of the proposed algorithm in the spectral-polarimetric-LiDAR multimodal image fusion task, a set of well-registered self-collected three-modal data (Scene 1) is used for validation. Figure 14 presents the fusion results obtained by the proposed algorithm and the six comparative fusion methods described in Section 3.2.2. In each set of images, (a) through (j) represent the 550 nm spectral image, the polarization image, the depth image, and the fusion results of DTCWT, GFF, NSCT, Bayes, LatLRR, WLS-VSM, and the proposed method, respectively. The fusion images of DTCWT, GFF, and Bayes are generally darker. NSCT and LatLRR produce acceptable fusion of spectral and polarimetric information but inadequately preserve the spatial awareness of the LiDAR data. Only the proposed method and WLS-VSM successfully combine the textural features of the spectral image, the ability of polarization to highlight hidden targets, and the spatial awareness of the LiDAR data. However, compared with the proposed method, the brightness of the WLS-VSM result is slightly insufficient.

As shown in Table 9 (Scene 1), the proposed method achieves favorable performance across all objective evaluation metrics. Compared with other fusion methods, the average information entropy is improved by 7.19%, the average standard deviation by 46.85%, the average gradient by 76.62%, and the spatial frequency by 79.74%. The fused image effectively integrates the textural details of the spectral image, the ability of the polarization image to recognize hidden targets, and the spatial perception information of the depth image, thereby enhancing the contrast of various targets in the fused image. These results are consistent with the subjective analysis presented above.

To validate the superiority of multimodal images in actual ground object recognition, the 560 nm spectral image and its corresponding polarization degree (DOP) image, polarization angle (AOP) image, and fused image are respectively fused with LiDAR images. The depth is mapped to a pseudo-color representation to present a more accurate and realistic visual effect, as shown in Figure 15. Using the proposed method, the DOP, AOP, and S0 images are fused together. This fused result is then combined with the LiDAR depth pseudo-color image, followed by a comparison with the single-modal images. However, the experiment revealed that spectral images typically have a higher signal-to-noise ratio and can present richer surface texture details, such as the camouflage pattern of tanks and the markings on sidewalks. The DOP image can improve the contrast between objects and the background in complex environments, highlighting the surface and contours of objects, thus increasing the probability of detecting and identifying objects. For example, in the image, a vehicle in shadow is almost invisible in the spectral image, whereas its contrast is greatly enhanced in the DOP image. The AOP information is more sensitive to the three-dimensional geometric structure of the object’s surface, especially to targets with surface undulations. This is because the value of AOP is directly determined by the projection direction of the local surface normal in the observation plane. Consequently, any change in surface curvature will lead to a change in the normal direction. This is clearly shown in the AOP image, where the hood drainage line of the last vehicle in the first row is distinctly visible. Although LiDAR images can capture the distance to a target, their accuracy is limited, making it difficult to distinguish subtle texture variations on the target’s surface. Polarization information can compensate for the limitations of LiDAR’s distance resolution. By applying the proposed fusion method, the 560 nm spectral image is fused with its DOP and AOP images, effectively combining the advantages of all three modalities: the marking lines on the sidewalk and the spectral texture of the tank, the DOP’s ability to recognize vehicles in shadow, and the AOP’s sensitivity to the hood drainage lines of vehicles. The depth information from the LiDAR image is used to add color information to this fused image, resulting in a pseudo-color image containing depth information. Compared with the spectral-polarization fusion image, the fused image with depth information is more visually informative. It not only makes it easier to recognize objects in shadow but also shows the distance information of the ground objects.

To further evaluate the generalization capability of the proposed fusion method across different scenes, a set of three-modal data was acquired from a scene different from the main experiment, as shown in Figure 16 (Scene 2). Using the same fusion parameters and evaluation metrics as in Section 3.2, an objective quality assessment of the fused images was conducted, and the results are presented in Table 10 (Scene 2).

As shown in Table 10, the proposed method achieves favorable performance across all objective evaluation metrics. Compared with other fusion methods, the average information entropy is improved by 7.03%, the average standard deviation by 43.68%, the average gradient by 70.24%, and the spatial frequency by 80.79%. Compared with the results of the main experiment, the proposed method still achieves excellent fusion performance on the cross-scene data, with key metrics such as information entropy (EN), standard deviation (SD), average gradient (AG), and spatial frequency (SF) showing only minor fluctuations due to differences in scene complexity. This fully demonstrates that the proposed fusion method possesses good generalization capability and robustness across different scenes and land-cover combinations, and does not overfit to a specific scene.

3.3. Image Classification

To validate the effectiveness of the proposed multimodal image fusion method for target recognition in few-shot scenarios, this paper conducts a comparative target recognition experiment using hyperspectral images, corresponding polarimetric DOP images, hyperspectral polarimetric images, and hyperspectral polarimetric LiDAR fused images. The detection band images and fused images are shown in Figure 17. The color coding used in these images is explained in Figure 18. The recognition targets include seven different land cover types: trees, vehicles, tanks, asphalt roads, gravel piles, stone-paved roads, and grassland. A Support Vector Machine (SVM) is employed for image classification. The “few-shot scenario” referred to in this paper aims to achieve accurate recognition of new-class targets using only a small number of annotated samples, which is applicable to practical scenarios characterized by data scarcity in remote sensing images [24]. The hyperspectral polarimetric and LiDAR images acquired in this experiment have a resolution of 4800 × 2048. To improve the computational efficiency of the algorithm, these images are downsampled to 1200 × 512. Due to the mechanism of the division-of-focal-plane polarimetric detector, the polarimetric calculation must be performed on the original images before downsampling to obtain the experimental images. In the few-shot setting, the features used for classification are the fused original images, and the number of annotated samples provided per new class in the training set is referred to as the number of shots. For example, 1-shot indicates that only one annotated sample is provided for each new class, and K-shot is a general term for scenarios where K samples are provided per class, typically with K ∈ {1, 3, 5, 10, 20, 25}. From all labeled pixels of each class, N pixels are randomly selected as the training set, and the remaining pixels are used as the test set. To avoid bias caused by a single random sampling, each experiment is repeated five times, and the mean and standard deviation of the accuracy and Kappa coefficient are calculated across the five runs. The total number of labeled pixels per class varies depending on the target size, but after sample extraction, the test set for each class contains no fewer than 200 pixels to ensure statistical stability of the evaluation results. For all compared image types, identical training and test set splits are used, and all classification experiments are conducted under the same hardware and software environments. The classification results are shown in Figure 19.

Based on the classification accuracy of the targets and the Kappa coefficient of the classification results, an objective analysis of the classification results of several types of images is conducted. Five repeated experiments are performed for each sample, and the mean and standard deviation of accuracy and Kappa are calculated, as shown in Figure 20. From the classification result maps and the objective evaluation data, it can be observed that hyperspectral images achieve good recognition performance for trees, grassland, gravel piles, and tanks. However, the recognition of vehicles under shadows is poor, and misclassification occurs for stone-paved roads due to the influence of tank shadows, resulting in an accuracy of 59.11% under the 25-shot setting. Polarimetric images, on the other hand, achieve good recognition of vehicles under shadows, but their classification maps exhibit obvious vertical stripe artifacts, which affect the classification accuracy. The accuracy of hyperspectral polarimetric fused images and polarimetric images is slightly lower than that of the unimodal hyperspectral images, which is attributed to the overall low contrast of polarimetric images. Nevertheless, the classification maps show a reduction in pixels where shadowed areas are misclassified as other targets, indicating that the fused image enhances the difference between the background and the targets. Although the hyperspectral polarimetric fused image successfully recognizes vehicles under shadows, a large number of pixels are still misclassified as other target categories, leading to no significant improvement in accuracy even after incorporating polarimetric information. The accuracy of the polarimetric LiDAR fused image is improved compared with that of the hyperspectral polarimetric fused image, but vehicles under shadows are misclassified as asphalt roads.

Using the polarimetric hyperspectral LiDAR fused image for land-cover target recognition, the accuracy under the 25-shot setting increases substantially to 71.94%, with a significant reduction in background misclassification and a Kappa coefficient of 0.6568. Not only are vehicles under shadows correctly identified, but part of the asphalt road under shadows is also recognized. This indicates that incorporating polarimetric information enhances the ability of spectral images to recognize targets under shadows, while depth information improves the contrast between near and far targets, thereby improving classification accuracy. It further validates that multimodal fusion detection combining hyperspectral, polarimetric, and LiDAR data can effectively improve the accuracy of target recognition and classification for land-cover targets compared with unimodal images.

From the comparison results in the objective evaluation plots, it is evident that the hyperspectral polarimetric LiDAR fused image generated by the proposed algorithm significantly outperforms the unimodal hyperspectral image overall under different shot settings. Particularly under medium-to-high shot conditions, it exhibits stronger performance growth potential. Under settings of 5 shots or more, the proposed fused image achieves markedly superior classification accuracy and visual quality compared with the unimodal hyperspectral image, with accuracies reaching 61.5%, 67.27%, 71.47%, and 71.94%, respectively. Compared with the unimodal hyperspectral image, the hyperspectral polarimetric LiDAR fused image obtained by the proposed algorithm improves the average accuracy and Kappa value by 9.8% and 11.05%, respectively. Paired t-tests performed on the accuracy data of the two types of images under different shot conditions yield p-values of 0.407, 0.035, 0.003, 0.0003, 0.000021, and 0.000020, indicating that under settings of 3 shots or more, the differences between the two types of images are statistically significant. Although the accuracy of the polarimetric LiDAR fused image is higher than that of the hyperspectral polarimetric LiDAR fused image under 1-shot, 3-shot, and 5-shot settings, reaching 51.48%, 60.72%, and 64.46%, respectively, its standard deviations are 10.97%, 8.89%, and 5.48%, which are much higher than those of the hyperspectral polarimetric LiDAR fused image. This means that its classification performance is highly sensitive to the random selection of training samples, which is often a manifestation of model overfitting to the training data, i.e., the model memorizes the characteristics of specific samples rather than learning the intrinsic patterns of the categories. Moreover, polarimetric images and LiDAR images exhibit obvious sparse features, resulting in high similarity among samples from different spectral channels, thereby leading to overfitting. The hyperspectral polarimetric LiDAR fused image, which integrates hyperspectral information, can effectively mitigate this overfitting phenomenon while leveraging the advantages of both polarimetric and LiDAR images to improve the classification accuracy of complex land-cover targets.

With the development of remote sensing detection technology, the information provided by unimodal images can no longer meet the demands of target recognition. The proposed hyperspectral-polarimetric-LiDAR multimodal fusion method for few-shot scenarios, which integrates polarization, hyperspectral, and LiDAR detection data sources, provides an effective means to enhance the recognition rate of hyperspectral images.

4. Discussion

The fusion quality of hyperspectral polarimetric LiDAR images is highly dependent on the precise registration between the hyperspectral polarimetric images and the LiDAR point cloud. Under dynamic observation conditions such as UAV or satellite platforms, platform vibration or motion can lead to larger registration errors, thereby affecting the fusion of hyperspectral polarimetric and depth features. Moreover, the academic community currently lacks a publicly available, standardized three-modal dataset originating from the same scene. This is primarily due to the need for expensive and complex synchronized hardware platforms for data acquisition, as well as the significant technical challenges in achieving strict synchronization and calibration, which limits the development of multimodal image fusion technology. All data in this paper were collected from the same geographical location, within the same time period, and for specific land cover types. The effectiveness of the proposed method in more complex environments has not yet been validated.

In this paper, the runtime of each algorithm is simulated, and the time efficiency of the algorithms is compared through the simulation results. Experiments show that the GFF algorithm achieves the best time efficiency. The NSCT and LatLRR algorithms have higher time costs, while the runtimes of the remaining algorithms are comparable, with the proposed algorithm ranking in the middle. Since multi-scale optimization involves simultaneous scale decomposition transformation, and its internal parameter values are defined using image feature values, the overall runtime inevitably increases. If the time efficiency of the fusion algorithm is a priority, the internal parameters can be set to fixed empirical values while reducing the number of decomposition scales. However, this will inevitably sacrifice the superior visual performance of the fusion algorithm. Therefore, the runtime of the proposed algorithm is still acceptable. Table 11 presents the mean and standard deviation of the running times for different fusion algorithms.

The proposed method is not designed for specific sensor parameters and can theoretically be transferred to other hyperspectral polarimetric-LiDAR systems. In this paper, the spatial resolution of the hyperspectral polarimetric images is 0.21 mrad, and the LiDAR point cloud density is approximately 720,000 points per second. If the resolution of the input images varies significantly, the number of multi-scale decomposition levels, k, needs to be readjusted. Furthermore, the proposed method exhibits significant advantages in distinguishing highly polarized man-made targets from low-polarization natural backgrounds. If the polarization characteristics of the target and the background are similar, the fusion performance may be reduced. In such cases, additional polarization features, such as the Angle of Polarization (AOP), can be introduced to enhance discriminability. In the future, a multimodal remote sensing dataset needs to be established, and the generalization ability of the proposed method requires further validation. In particular, the robustness of the proposed method still needs to be systematically evaluated for scenes under different illumination conditions, different seasons, and different geographical regions.

5. Conclusions

This paper addresses the issue of insufficient recognition ability in complex environments, such as high-precision urban ground object recognition, using single-modal remote sensing. A hyperspectral-polarization-LiDAR multimodal image fusion method for few-shot scenarios is proposed. By designing corresponding feature mapping functions for polarization and LiDAR images and employing multi-scale hierarchical optimization to balance the low-frequency and high-frequency information of multimodal images, effective fusion of multimodal image information is achieved.

Experimental results demonstrate that the proposed algorithm performs bimodal image fusion on the NWPUSP spectral-polarization dataset and KAIST spectral-depth dataset. Compared with other fusion methods, its information entropy is averagely increased by 7.3% and 4.87%, the standard deviation is averagely improved by 53.18% and 30.35%, the average gradient is averagely elevated by 48% and 108.28%, and the spatial frequency is averagely enhanced by 96.25% and 101.13%, respectively. In addition, three-modal images including hyperspectral, polarization and LiDAR images of complex ground object scenes are collected. Comparative experiments are carried out against six other fusion algorithms. Objective evaluation results indicate that the average growth rates of information entropy, standard deviation, average gradient and spatial frequency reach 7.19%, 46.85%, 76.62% and 79.74% respectively, which remarkably improves the feature retention ability of fused images. Compared with single-modal hyperspectral images, the target recognition and classification accuracy and Kappa coefficient of fused images are increased by 15.7% and 15.15% correspondingly. This method can effectively suppress the shadow effect of hyperspectral images and boost the ground object recognition capability in complex scenarios. The proposed fusion method provides a novel idea for remote sensing detection in complex environments such as ground object target recognition and classification under few-shot scenarios, and exhibits broad application prospects.

Author Contributions

Methodology, Y.Y.; writing—original draft, Y.Y.; writing—review and editing, Y.Y.; project administration, G.L., H.S. (Hongyu Sun), J.W., J.Z., J.L., Q.W., and M.C.; supervision, Y.L.; resources, Y.L.; funding acquisition, H.S. (Haodong Shi). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Jilin Provincial Department of Science and Technology Free Exploration Project, grant number YDZJ202501ZYTS605; and by Optoelectronic Measurement and Intelligent Perception Zhongguancun Open Lab, grant number LabSOMP-2024-17. The APC was funded by the authors.

Data Availability Statement

The data presented in this study are not publicly available because they are part of an ongoing research project. The full dataset will be released upon project completion. The raw data can be requested from the corresponding author at a reasonable time.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, W.; Jiang, J.; Li, K.; Wang, X.; Zhang, F.; Zhang, R.; Jiang, D.; Yue, G. A baseline slope index to detect natural gas microleakage-stressed vegetation considering shadow removal in hyperspectral imagery. Energy 2025, 331, 137037. [Google Scholar] [CrossRef]
Shi, H.; Gong, C.; Wang, Q.; Liu, J.; Wang, J.; Li, Y.; Sun, H.; Wang, C.; Ma, Y.; Kang, X.; et al. Airborne push-broom hyperspectral polarization imaging system design and image fusion method. Opt. Laser Technol. 2025, 191, 113409. [Google Scholar] [CrossRef]
Li, X.; Gan, V.J.; Li, K.; Li, M. High-precision 3D BIM reconstruction for mechanical, electrical and plumbing components using terrestrial laser scanning and LiDAR point clouds. J. Build. Eng. 2025, 112, 113661. [Google Scholar] [CrossRef]
Yin, J.; Li, G.; Zhou, B.; Cheng, L. Laser Radar and Micro-Light Polarization Image Matching and Fusion Research. Electronics 2025, 14, 3136. [Google Scholar] [CrossRef]
Chroni, A.; Vasilakos, C.; Christaki, M.; Soulakellis, N. Fusing Multispectral and LiDAR Data for CNN-Based Semantic Segmentation in Semi-Arid Mediterranean Environments: Land Cover Classification and Analysis. Remote Sens. 2024, 16, 2729. [Google Scholar] [CrossRef]
Guo, F.; Zhu, J.; Huang, L.; Li, F.; Zhang, N.; Deng, J.; Li, H.; Zhang, X.; Zhao, Y.; Jiang, H.; et al. Multi-Dimensional Fusion of Spectral and Polarimetric Images Followed by Pseudo-Color Algorithm Integration and Mapping in HSI Space. Remote Sens. 2024, 16, 1119. [Google Scholar] [CrossRef]
Mei, Y.; Fan, J.; Fan, X.; Li, Q. CSTC: Visual Transformer Network with Multimodal Dual Fusion for Hyperspectral and LiDAR Image Classification. Remote Sens. 2025, 17, 3158. [Google Scholar] [CrossRef]
Aetesam, H.; Prasath, V.B. Variational Weighted ℓp−ℓq Regularization for Hyperspectral Image Restoration under Mixed Noise. IET Image Process 2025, 19, e70073. [Google Scholar] [CrossRef]
Akiyama, K.; Ikeda, S.; Pleau, M.; Fish, V.L.; Tazaki, F.; Kuramochi, K.; Broderick, A.E.; Dexter, J.; Mościbrodzka, M.; Gowanlock, M.; et al. Superresolution Full-polarimetric Imaging for Radio Interferometry with Sparse Modeling. Astron. J. 2017, 153, 159. [Google Scholar] [CrossRef]
Reichardt, L.; Mangat, P.; Wasenmuller, O. DVMN: Dense Validity Mask Network for Depth Completion. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2653–2659. [Google Scholar]
Shenzhen Dajiang Innovation Technology Co., Ltd. Optical Component for Scanning LiDAR System. Chinese Patent No. 202080107216.1, 27 May 2025. [Google Scholar]
Liu, J.; Xia, R.; Jin, W.; Wang, X.; Du, L. Review of imaging polarimetry based on Stokes Vector. Opt. Tech. 2013, 39, 56. [Google Scholar]
Li, Z.; Zhai, A.; Ji, Y.; Li, G.; Wang, D.; Wang, W.; Shi, L.; Ji, T.; Liu, F.; Cui, Y. Research, application, and progress of optical polarization imaging technology. Infrared Laser Eng. 2023, 52, 298–313. [Google Scholar] [CrossRef]
Yi, A.-A. Direct Linear Transformation from Comparator Coordinates into Object Space Coordinates. In Proceedings of the ASP/UI Symposium on Close-Range Photogrammetry, Urbana, IL, USA, 26–29 January 1971; pp. 1–18. [Google Scholar]
Shen, Y.; Lin, Y.; Chen, H.; Wu, J.; Huang, F. Algorithm for removing mismatched feature points in heterogeneous images under spatial constraints. Acta Opt. Sin. 2024, 44, 216–227. [Google Scholar] [CrossRef]
Tong, G.; Yao, X.; Li, B.; Fu, J.; Wang, Y.; Hao, J.; Karim, S.; Yu, Y. MSPFusion: A feature transformer for multidimensional spectral-polarization image fusion. Expert Syst. Appl. 2025, 275, 127079. [Google Scholar] [CrossRef]
Baek, S.-H.; Ikoma, H.; Jeon, D.S.; Li, Y.; Heidrich, W.; Wetzstein, G.; Kim, M.H. Single-shot Hyperspectral-Depth Imaging with Learned Diffractive Optics. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2631–2640. [Google Scholar]
Kingsbury, N. Complex Wavelets for Shift Invariant Analysis and Filtering of Signals. Appl. Comput. Harmon. Anal. 2001, 10, 234–253. [Google Scholar] [CrossRef]
Yang, Y.; Li, Y.; Ding, J.; Wang, Y. Infrared and visible image fusion based on fast alternating guided filtering and CNN. Opt. Precis. Eng. 2023, 31, 1548–1562. [Google Scholar] [CrossRef]
Da Cunha, A.; Zhou, J.; Do, M. The Nonsubsampled Contourlet Transform: Theory, Design, and Applications. IEEE Trans. Image Process. 2006, 15, 3089–3101. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Zhang, J. Bayesian fusion for infrared and visible images. Signal Process. 2020, 177, 107734. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. Infrared and visible image fusion using latent low-rank representation. arXiv 2018, arXiv:1804.08992. [Google Scholar]
Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
Gao, G.; Zhang, Z.; Zhang, W.; Shang, Y.; Dong, Y.; Xi, J. DAFSDet: Dual-Attention Guided Few-Shot Object Detection in Remote Sensing Images. Remote Sens. 2026, 18, 1345. [Google Scholar] [CrossRef]

Figure 1. Multimodal image fusion method framework.

Figure 2. Experimental setup and flowchart.

Figure 3. Experimental scene and spectral curves of ground object targets.

Figure 4. Spectral curves of terrestrial targets: (a) DOP; (b) AOP.

Figure 5. Experimental scene and spectral curves of grass and camouflage nets in selected bands.

Figure 6. Spectral curves of grassland at four polarization angles.

Figure 7. Spectral curves of camouflage nets at four polarization angles.

Figure 8. Objective evaluation values for fusion performance with different parameters.

Figure 9. Hyperspectral polarimetric image fusion experiment 1. (a) 550 nm spectral image, (b) polarization image, (c) DTCWT, (d) GFF, (e) NSCT, (f) Bayes, (g) LatLRR, (h) WLS-VSM, (i) proposed.

Figure 10. Hyperspectral polarimetric image fusion experiment 2. (a) 550 nm spectral image, (b) polarization image, (c) DTCWT, (d) GFF, (e) NSCT, (f) Bayes, (g) LatLRR, (h) WLS-VSM, (i) proposed.

Figure 11. Hyperspectral polarimetric image fusion experiment 3. (a) 550 nm spectral image, (b) polarization image, (c) DTCWT, (d) GFF, (e) NSCT, (f) Bayes, (g) LatLRR, (h) WLS-VSM, (i) proposed.

Figure 12. Polarimetric depth image fusion experiment 1. (a) 550 nm spectral image, (b) depth image, (c) DTCWT, (d) GFF, (e) NSCT, (f) Bayes, (g) LatLRR, (h) WLS-VSM, (i) proposed.

Figure 13. Polarimetric depth image fusion experiment 2. (a) 550 nm spectral image, (b) depth image, (c) DTCWT, (d) GFF, (e) NSCT, (f) Bayes, (g) LatLRR, (h) WLS-VSM, (i) proposed.

Figure 14. Spectral-Polarimetric-LiDAR Image Fusion experiment(Scene 1). (a) 550 nm spectral image, (b) polarization image, (c) depth image, (d) DTCWT, (e) GFF, (f) NSCT, (g) Bayes, (h) LatLRR, (i) WLS-VSM, (j) proposed.

Figure 15. Spectropolarimetric fusion images and corresponding depth map fusion images.

Figure 16. Spectral-Polarimetric-LiDAR Image Fusion experiment (Scene 2). (a) 550 nm spectral image, (b) polarization image, (c) depth image, (d) DTCWT, (e) GFF, (f) NSCT, (g) Bayes, (h) LatLRR, (i) WLS-VSM, (j) proposed.

Figure 17. Images and fused images at the detection band.

Figure 18. Ground truth label map.

Figure 19. Classification results.

Figure 20. Objective evaluation metrics for classification.

Table 1. Comparison of the proposed method with recent multimodal image fusion methods.

Author	Applications	Input Modalities	Fusion Methods	Fusion Levels	Performance
Author	Applications	Input Modalities	Fusion Methods	Fusion Levels	Q	E	R	T
[4]	Covert target reconnaissance	PI+LiDAR	Depth offset mapping matrix + pseudo-color	Feature-level fusion	H	N	N	P
[5]	Land cover classification	MSI+LiDAR	CNN semantic segmentation	Feature-level fusion	Y	N	N	P
[6]	Target enhancement and recognition	HSI+PI	PCA + energy weighting	Pixel-level fusion	H	N	N	P
[7]	Multimodal image classification	HSI+LiDAR	Visual Transformer dual fusion	Feature-level fusion	Y	N	N	P
Proposed.	Target enhancement and recognition	HSI+PI+LiDAR	Multi-scale optimization + L₂ + L₁ convex optimization	Pixel-level fusion	Y	H	Y	P

Table 2. Objective evaluation metrics for different norm combinations.

Method	EN	SD	SSIM	AG	MI	SF
Pure L₁	7.127	38.079	0.372	53.607	0.744	16.403
Pure L₂	7.124	35.691	0.397	50.819	0.753	14.273
proposed	7.135	38.457	0.368	54.112	0.742	16.525