In this section, we first provide the definition of class-incremental target classification and then introduce the detailed network structure and learning procedures of MMFAN.
  2.1. Problem Definition
Without loss of generality, the data sequence of class-incremental learning includes 
T groups of classification tasks 
, where 
 is the 
t-th incremental task, 
 is the number of training samples in task 
t, 
 represents the ISAR image sample, and 
 denote the width and height of the image, respectively. The class label of the sample is shown as 
, where 
 denotes the label space corresponding to task 
t. The label spaces do not overlap between tasks, i.e., 
 for 
. During the training task 
t, the model can only access data 
, and the goal of class-incremental learning is to establish a classifier 
 for all the classes by continuously learning the new knowledge from 
T tasks. After the training of task 
t, MMFAN maintains a prototype library 
 of known classes, which is then validated on all test sets from task 1 to 
t, where 
 is a prototypical vector of dimension 
. The ideal class-incremental learning model 
 not only performs well on the newly learned classes for task 
t, but also needs to retain memory for historical classes, i.e.,
		
		where 
H denotes the parameters of model 
 and 
  is an indicator function, i.e., it outputs 1 if the condition is satisfied, otherwise 0. 
 represents the test data for task 
t and is subject to deformation distortion. The training samples do not intersect with the test samples, i.e., 
. As a non-exemplar class-incremental learning framework, MMFAN can only access 
 and the prototypes 
 during the training of task 
t and also retains deformation robustness to test samples.
  2.2. Overall Structure
The overall structure of MMFAN is shown in 
Figure 1, which can be divided into three parts: the Mix-Mamba backbone, FAN, and the loss bar. The Mix-Mamba backbone extracts the embedding features from the input samples and memorizes historical information by the prototypical vectors. FAN transfers the embedding features and prototypes between different tasks. The loss bar fuses five different loss functions to form a prototype-guided and non-exemplar incremental training procedure that guides the parameter updating of Mix-Mamba and FAN. In 
Figure 1, the dotted and solid trapezoids represent the parameters of the Mix-Mamba backbone in task 
t − 1 (old) and task 
t (new), respectively. In the embedded feature zone, solid squares represent embeddings extracted by the original input images, slash squares represent embeddings extracted by the augmented input images, stars represent prototypes generated from embedded features, and pentagrams and squares with the same color correspond to the same target category. The blue arrows denote the generation and update flow of the prototypes, while the black arrows denote the data flow for network training.
Take task t as an example for generally describing the training and test of MMFAN. First, the parameters of Mix-Mamba in task t − 1 are fixed and copied as the old backbone  (which does not involve gradient calculation) to serve as a reference for the historical memory. The new backbone  participates in all gradient backpropagation and parameter updating and obtains classification results. Then, the input image  is fed into both the old and new backbones to obtain the embedding features  and , respectively. In order to constrain the new model from overfitting to the distribution of new classes, the distillation loss  is calculated to minimize the difference between the features extracted by the old and new models, both from a new sample. For the new model, the augmented input samples  (by scaling and rotating on ) are fed into the new backbone to obtain augmented features . Then  and the historical prototypes  serve as a contrastive learning template for supervised contrastive learning with , obtaining a contrastive loss .
FAN plays the role of transferring the feature distribution from the old backbone to the new backbone, i.e., . To achieve a better proximity between the adjusted features and the new backbone features, the feature adjustment loss  is designed to minimize the difference between  and . Another role of FAN is to adjust prototypes  corresponding to all learned classes, thus mapping the old-class prototypes to the new feature space, i.e., . Then, the adjusted prototypes are input into the classifier to obtain the predicted prototype label, and the prototypical loss  is computed by comparing the prototype label with the true class label. The features extracted by the new backbone are also fed into the classifier, and the classification loss  is obtained by comparing them with the true class labels. Finally, the mean of  corresponding to each class is computed to obtain the new class prototype  for task t, which is then combined with  to update the prototype library as .
  2.3. Mix-Mamba
The network structure of Mix-Mamba is shown in 
Figure 2, which is composed of three CST convolution stages and two Mamba stages. Compared with the original Mamba, Mix-Mamba mainly increases the CST convolution as a deformation-robust structure; in addition, motivated by Vision Transformer, Mix-Mamba uses the Mamba vision block to extract global features as 2D feature maps. As a backbone, Mix-Mamba takes image samples as input and obtains embedding vectors with the dimension of 
. Due to the spatial inductive bias and strong local feature-extraction capabilities of convolutional networks, the CST convolution stages with residuals are designed to extract high-resolution features while adjusting deformations such as scaling, rotation, and perspective distortion in the feature maps. Mamba stages divide the feature maps into patches and employ selective scanning structured state-space models (S6M) [
59] for global context modeling, thus overcoming the limitations of capturing global spatial relationships for convolutional networks.
CST convolution stage: the CST operation illustrates three improvements over the original STN to figure out feature mismatch and shape mismatch: (1) introducing a more flexible homography transformation instead of the affine transformation [
60], (2) applying spatial transformations independently to multi-layer and each channel features, and (3) generating transformation parameters through a Transformer encoder rather than a simple linear function. By utilizing the cross-attention mechanism, the homography parameters are extracted from the image features through the learnable query vector. For the input feature map 
, where 
H and 
W represent the height and width of the feature map and 
C is the number of channels, the Transformer encoder calculates the attention scores by scaled dot-product attention. Let the attention value 
V and key 
K be obtained through linear mappings, while letting the query 
Q be a learnable parameter for querying a preset length of output from the feature. We denote 
 as the hidden dimension of the Transformer encoder, 
 as a two-layer fully-connected network followed by the Gaussian error linear unit (GELU) non-linear activation function, 
 as the linear mapping from the feature dimension 
 to 
, and 
 as the layer normalization, then the homography parameters 
 are obtained by
        where the homography transformation parameters are extracted from the image features through the learnable query vector by using the Transformer cross-attention mechanism.
After obtaining 
, it can be rearranged as 
C groups of individual 3 × 3 transformation matrices according to the homogeneity, and then each channel of the feature maps can be resampled. The homography transformation can map the projections of 3D scatter points between different IPPs, thereby enabling the adjustment of the unknown deformation of ISAR images. Specifically, it calculates the pixel at position 
 in the output feature map 
 corresponding to the position 
 in the input feature map 
, where 
. Then, the output pixel value of the channel 
c is obtained by bilinear interpolation as follows:
In each CST convolution stage, the adjusted feature map 
 is then fed into a two-layer 3 × 3 convolutional network (
) for local feature extraction, where each convolutional layer is followed by batch normalization (BN) and a GELU activation. Finally, the output of the CST convolution stage 
 is realized by element-wise adding the adjusted feature map for residual connecting, which is down-sampled by a 2 × 2 max-pooling (
) with doubled channel size, i.e.,
Mamba stage: The global feature extraction is realized by patch partition/reverse and two Mamba vision blocks [
61]. The feature map 
 is firstly partitioned (denoted by 
) into 
NC non-overlapping patches of size 
, where 
, and these patches are then flattened into vectors of length 
, which is then linearly mapped and added with a learnable positional encoding 
 to obtain the patch embedding sequence 
 as
        where 
d denotes the feature dimension of this layer, and 
 denotes the process of patch partition.
In the Mamba vision block, the patch embedding sequence 
 is firstly linearly expanded along feature dimension from 
d to 2
d, then it is split into two streams (denoted as 
) with the same size: the Mamba flow 
 and the residual flow 
, i.e.,
The Mamba flow consists of a 1D convolution with a kernel size of 3 (denoted as 
), a sigmoid linear unit (
) activation, and an S6M operation (denoted as 
). Conversely, the residual flow maintains the same structure without the S6M operation to preserve the local spatial relationships. Thereafter, the outputs of both streams are concatenated along the feature dimension (denoted as 
) and linearly mapped back to the original dimension 
d as
In order to recover the feature maps, the output of Mamba vision block 
 is obtained by linear mapping and patch reverse (denoted as 
):
Finally, the feature downsampling and channel doubling are realized in the same way as (4). After the last Mamba stage, a global average pooling layer is designed to obtain the embedding feature vector corresponding to the input ISAR image.
In the Mamba vision block, S6M is the core for global feature extraction and contextual representation, which is improved from the SSM. The SSM handles long-term dependencies through sequence-to-sequence mappings and maintains a set of hidden state spaces to predict the output. For a 1D sequence 
, where 
 and 
 is the sequence length, the continuous SSM defines the linear mapping from the input to the hidden state 
 and output 
 as
        where 
 denotes a state transition matrix that governs the retention of the 
, 
 denotes an input matrix that governs the update of 
, and 
 denotes the output matrix that governs the contribution of the 
 to the output. 
M is the dimension of the state space. Under the framework of deep learning, the continuous model is computationally inefficient and hard to train, so the SSM is discretized by converting the derivative into differences and aligning with the data sampling rate. The zero-order hold technique [
62] preserves the discrete data for a certain period and generates continuous output during that period. The preservation period is referred to as the sampling timescale 
, and the discrete SSM takes the form of
        where
To embed the discrete SSM into a deep network, the state updating can be expanded along the time dimension and implemented by a 1D convolution, i.e.,
        where 
 is the convolution kernel. Therefore, the discrete SSM can be realized in parallel during training via convolution and al retain memory capability like a recurrent neural network (RNN). However, SSM is linear time-invariant, i.e., 
 are static parameters, and their values do not directly depend on the input sequence. It limits the capability of long-term dependencies and global representations. Hence, S6M introduces selective scanning to grant the hidden states the ability to select content based on the input data. In the selective scanning, the step size 
, the matrix 
, and 
 are derived from the parallelized input sequence 
 as follows:
        where 
 ensures that the timescale is positive, and 
 is the dimension of the selective scanning.
  2.4. FAN
As a bridge between the distributions of old and new class features, FAN transfers the old-class prototypes to the feature space of the new tasks, thereby relieving catastrophic forgetting caused by prototype mismatch. Let the backbone network of task 
t − 1 and task 
t generate embedding features 
 and 
, respectively, from the same sample of task 
t, where 
 and 
 are the corresponding feature spaces formed by all possible features. To transfer the distribution of 
 to 
 with minimal cost, the optimal transportation model is established due to the discrete form of the Monge formula:
        where 
 is the optimal transport function that minimizes the cost, 
 is the measure-preserving mapping of transporting from the feature space 
 to 
, and 
 denotes the cost function transferring 
 to 
; here, it can be represented as the Wasserstein distance between the features transferred by the old and new backbones, i.e.,
In order to realize end-to-end model training, the proposed FAN is also implemented by a Transformer encoder, as shown in (2), and the self-attention mechanism is applied here to obtain the internal relationship between old and new embedded features, thereby realizing the transfer of feature distribution. The parameter of FAN is optimized by the feature adjustment loss and prototype loss together.
  2.5. Network Training
In the context of non-exemplar incremental learning, to alleviate the conflict between catastrophic forgetting and new-task learning, MMFAN combines feature replay and knowledge distillation techniques [
63]. The loss bar is proposed for prototype-guided and non-exemplar network training, where the supervised classification loss 
, contrastive loss 
, prototype loss 
, unsupervised distillation loss 
, and feature adjustment loss 
 are weighted and summed as follows:
        where 
 are the weights for each loss.
Classification loss: It measures the accuracy of the class label predictions by the Mix-Mamba backbone and the linear classifier 
, fulfilled by the following cross-entropy loss function:
        where 
K is the total number of classes, and 
 denotes the true class label.
Contrastive loss: It helps the backbone to compress the feature space and separate the different classes in incremental tasks. Since the historical data cannot be accessed, in the new feature space, the prototypes corresponding to old classes may be overlapped with the features extracted by the new backbone, increasing the error of decision boundaries. For a batch of input samples, the corresponding embedding features are noted as a set 
, while the features of augmented samples are noted as a set 
. The supervised contrastive loss takes 
 as a reference, with the features in 
 that share the same class as the reference are positive samples. In contrast, the prototypes in 
 and features in 
 longing to different classes of 
 are considered negative samples. Consequently, the inner product of features is employed to quantify the similarity between positive and negative samples thus:
        where 
 and 
 are the labels corresponding to 
 and 
, 
 denotes the number of the samples in 
 with the same class label as 
, and 
 denote the temperature coefficient. In (18), the numerator encourages the features with the same class to come closer (reducing intra-class distance), while the denominator encourages pushing away the features with different classes (increasing inter-class distance). Therefore, the supervised contrastive loss prevents the dispersed feature distribution of new samples, thus mitigating the failure of historical decision boundaries and knowledge forgetting.
Prototype loss: 
 measures the precision of the classification result from the prototype 
 adjusted by FAN, which is realized by a cross-entropy function as
        where 
 is the ground true of the prototype. Prototype loss ensures that the classifier enables discriminability for the adjusted old-class prototypes.
Distillation loss: 
 directly computes the L
2 distance between the embedding feature extracted by the old backbone 
 and the new backbone 
, i.e.,
The distillation loss enables the new backbone to recover the historical knowledge.
Feature adjustment loss: As an unsupervised loss, 
 is designed to optimize FAN such that the prototypes can be mapped to an appropriate place in the new feature space, thus reducing the mismatch between the prototype and feature during cross-task incremental learning. Specifically, 
 calculates the L
2 distance between the adjusted feature 
 of the old backbone and the feature 
 of the new backbone, i.e.,