# MF-DCMANet: A Multi-Feature Dual-Stage Cross Manifold Attention Network for PolSAR Target Recognition

## Abstract

## 1. Introduction

- A multi-feature extraction method specifically for PolSAR images has been proposed. The multi-feature extracted by this method can describe the target stably and robustly and is not affected by the target pose, geometry, and radar parameters as much as possible;
- A dual-stage feature cross-fusion representation framework is proposed, respectively named Cross-Feature Network (CFN) and Cross-Manifold Attention (CMA);
- In MF-DCMANet, handcrafted monogenic features and polarization features are combined with deep features to improve target recognition accuracy;
- By leveraging fusion techniques, the proposed MF-DCMANet enhances recognition performance and achieves the highest accuracy on the fully polarimetric GOTCHA dataset;
- It is often challenging to obtain sufficient and comprehensive samples in practical PolSAR target recognition applications. Despite this limitation, the proposed method still achieves satisfactory performance in few-shot and open-set recognition scenarios.

## 2. Related Works

#### 2.1. CNN-Based Multi-Feature Target Recognition

#### 2.2. Transformer in Target Recognition

## 3. Methods

#### 3.1. Problem Formulation

#### 3.2. Multi-Feature Extraction

#### 3.2.1. Monogenic Feature Extraction

#### 3.2.2. Polarization Feature Extraction

#### 3.3. Cross-Feature-Network (CFN)

#### 3.4. Cross-Manifold-Attention (CMA)

#### 3.4.1. Feature Representation on Grassmann Manifold

#### 3.4.2. Distance Metrics on Grassmann Manifold

#### 3.5. Predictor

## 4. Experiments

#### 4.1. Data Description

#### 4.2. Implementation Details

#### 4.3. Evaluation Metrics

#### 4.3.1. Overall Accuracy (OA)

#### 4.3.2. Receiver Operation Characteristics (ROC)

#### 4.4. Quantitative Analysis

#### 4.4.1. Classification Results and Analysis

#### 4.4.2. Classification Accuracy Evaluation under Few-Shot Recognition

#### 4.4.3. Classification Accuracy Evaluation under Open Set Recognition

#### 4.4.4. Ablation Study

**Concat**: Directly concatenate the mid-level monogenic features ${M}_{d}$ and the mid-level polarization features ${P}_{d}$ along the channel dimension.

**Parallel**: Directly add the mid-level monogenic features ${M}_{d}$ and the mid-level polarization features ${P}_{d}$ along the channel dimension.

**En-De**: Perform encoder-decoder fusion on ${M}_{d}$ and ${P}_{d}$.

#### 4.5. Qualitative Analysis

#### 4.5.1. CFN Module Analysis

#### 4.5.2. CMA Module Analysis

## 5. Conclusions

**Figure 1.**The overview of the entire framework. First, we extract low-level features from the intensity image and multi-polarization channels of POLSAR data, which are monogenic features $\U0001d4c2$ and polarization features $\U0001d4c5$, respectively, and then use two fully convolutional neural networks to mine the mid-level semantic features contained in the polarization features and monogenic features. The extracted mid-level semantic features are fed to the first-stage cross-feature network (CFN) to obtain the fused features, followed by feeding the fused features ($cros{s}_{MP}$) and mid-level semantic features (${M}_{d},{P}_{d}$) into the second-stage cross-manifold-attention (CMA) transformer. In the CMA module, these features are first encoded as tokens and then represented on the Grassmann manifold to mine the nonlinear correlation between features, which are mutually supplemented through multiple attention fusions.

**Figure 5.**The CMA framework. (

**a**) Transformer encoder with local window. (

**b**) Cross-manifold-attention mechanism. (

**c**) The calculation process for attention weight, where ${Q}_{1}{K}_{{\rho}_{1}\left(1\right)}$ denotes the projection distance between ${Q}_{1}$ and ${K}_{{\rho}_{1}\left(1\right)}$ calculated according to Equation (17), ${\rho}_{\tau}\left(i\right)$ denotes ${i}^{\prime}$s $\tau $-th neighbor.

**Figure 9.**Confusion matrix of the MF-DCMANet on the GOTCHA dataset. (

**a**) Full training datasets. (

**b**) One-third of the training set. (

**c**) One-seventh of the training set. (

**d**) One-tenth of the training set.

**Figure 11.**ROC curves and AUC values of different methods in the 1/10 few-shot recognition experiment.

**Figure 12.**The overall accuracy of different methods at different thresholds. (

**a**) Softmax (

**b**) KL divergence.

**Figure 14.**Confusion matrix of the proposed method at different feature fusion stages. (

**a**) The first stage: CFN module; (

**b**) the second stage: CMA module.

**Figure 15.**Visualization features of GOTCHA data by using t-SNE. (

**a**) the original monogenic features. (

**b**) the original polarization features. (

**c**) the fused features obtained by the CFN module.

**Figure 17.**Comparison of Euclidean spaces and Grassmann manifolds. On the Grassmann manifold, the dimension of $Q$ and $K$ changes from ${\mathbb{R}}^{L\times D}$ to ${\mathbb{R}}^{L\times n\times k}$, where $L$=9. Thus, the horizontal axis of Figure 17 represents the 81 measures between ${Q}_{i}\left(i=1:9\right)$ and ${K}_{j}\left(j=1:9\right)$, while the upper and lower parts of the vertical axis represent the cosine angles in Euclidean space and principal angles in the Grassmann manifold space, respectively. We have plotted the 81 cosine angle values in Figure 16 as the top red polyline in Figure 17. According to the definition of principal angle in Equation (16), the number of principal angles between ${Q}_{i}$ and ${K}_{j}$ equals $k$ (where $k$= 8). Hence, the eight differently colored points within each gray stripe represent the eight principal angles of patch pairs.

Input | Ex-Net | Fu-Net | |||
---|---|---|---|---|---|

Block1 | Block2 | Block3 | Block4 | Block5 | |

5 × 5 Conv | 5 × 5 Conv | 5 × 5 Conv | 5 × 5 Conv | 1 × 1 Conv | |

BN | BN | BN | BN | BN | |

20 × 20 × d | Relu | Relu | Relu | Relu | Relu |

2 × 2 MP | |||||

20 × 20 × 16 | 10 × 10 × 32 | 10 × 10 × 64 | 10 × 10 × 128 | 10 × 10 × 128 |

Dataset | Category | Pass | Number |
---|---|---|---|

Training set | 1~9 | 1, 3, 5, 7 | 360 × 9 |

Test set | 1~9 | 2, 4, 6, 8 | 360 × 9 |

Batch size | 64 |

Optimizer | Adam |

Initialized learning rate | 0.01 |

Learning Rate Decay | Exponential-decay |

Momentum | 0.9 |

Weight decay | 0.0001 |

Epochs | 100 |

Input | Method | Classifier | OA (%) | FPS | |
---|---|---|---|---|---|

Handcrafted features | Mono-based | Mono | SRC | 97.72 | 83.08 |

Mono-HOG | SVM | 98.15 | 58.65 | ||

Mono-BoVW | SVM | 98.02 | 19.11 | ||

Mono-Grass | SRC | 98.61 | 24.23 | ||

Pol-based | Polarimetric decomposition | SVM | 98.30 | 32.65 | |

Polarimetric scattering coding | SVM | 97.65 | 39.94 | ||

others | Steerable Wavelet | SVM | 98.89 | 38.54 | |

ASC | SVM | 98.46 | 15.49 | ||

Deep features | CNN-based | A-ConvNet | Softmax | 97.99 | 442 |

CV-CNN | Softmax | 98.46 | 403.91 | ||

CV-FCNN | Softmax | 98.98 | 341.89 | ||

CVNLNet | Softmax | 99.44 | 320.46 | ||

RVNLNet | Softmax | 98.52 | 431.71 | ||

Transformer-based | ViT | Softmax | 98.77 | 389.2 | |

SpectralFormer | Softmax | 98.12 | 376.27 | ||

CrossViT | Softmax | 99.17 | 363.07 | ||

others | SymNet | KNN | 97.28 | 263.28 | |

Monogenic ConvNet layer | Softmax | 98.73 | 308.70 | ||

Multi-features | FEC | Softmax | 99.10 | 195.06 | |

Mono-CVNLNet | Softmax | 99.54 | 227.96 | ||

Proposed | Softmax | 99.75 | 322.93 |

Method | Known Target Accuracy (%) | Unknown Target Accuracy (%) | Overall Target Accuracy (%) |
---|---|---|---|

Mono-Grass | 89.96 | 67.64 | 85.00 |

Mono-ConvNet | 88.89 | 72.78 | 85.31 |

Wavelet | 92.30 | 70.00 | 87.35 |

CrossViT | 92.78 | 71.11 | 87.96 |

CVNLNet | 93.97 | 69.58 | 88.55 |

FEC | 93.29 | 73.19 | 88.83 |

Mono-CVNLNet | 93.81 | 75.14 | 89.66 |

Proposed | 95.20 | 77.36 | 91.23 |

Method | M-DSMANet | P-DSMANet | MF-DCMANet |

OA (%) | 98.58 | 97.84 | 99.75 |

Method | Concat | Parallel | En-De | CFN |

OA (%) | 98.55 | 98.73 | 95.68 | 99.75 |

Method | Euclidean | Grassmann |

OA (%) | 98.64 | 99.75 |

