Next Article in Journal
Power System Fault Detection and Localization Using a Dual-Path Spatio-Temporal Multi-Task Graph Convolutional Network
Previous Article in Journal
LSTM-Enhanced Model Predictive Virtual Inertia Control for Frequency Stability in Low-Inertia Islanded Microgrids
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ParalIMR: Bypassing Shortcut Learning in Incremental Modulation Recognition via Parallel Reconstruction and Feature Decoupling

1
National Key Laboratory of Science and Technology on Communications, University of Electronic Science and Technology of China, Chengdu 610000, China
2
National Key Laboratory of Complex Aviation System Simulation, Chengdu 610036, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(13), 2766; https://doi.org/10.3390/electronics15132766 (registering DOI)
Submission received: 6 May 2026 / Revised: 7 June 2026 / Accepted: 8 June 2026 / Published: 23 June 2026

Abstract

Incremental automatic modulation recognition is essential for the awareness of complex electromagnetic environments but is prone to catastrophic forgetting. This is fundamentally precipitated by shortcut learning, a phenomenon where deep models prioritize stable but non-essential channel artifacts (e.g., noise, fading) over intrinsic modulation characteristics. Consequently, models rely on spurious correlations that collapse during incremental task updates or environmental shifts, leading to representation drift. To bridge this gap, we propose the ParalIMR framework, which integrates a parallel reconstruction architecture with the segment substitution (SS) strategy to decouple modulation signatures from environmental fingerprints. Specifically, the parallel branch utilizes a Denoising AutoEncoder (DAE) as a task-agnostic structural anchor, purifying feature representations and maintaining geometric consistency across varying signal-to-noise ratios without propagating noise-overfitting to the classifier. In the meantime, the SS strategy actively disrupts the temporal coupling between class labels and hardware fingerprints through random reorganization, forcing the model to extract modulation-invariant structural cues. Experimental results on the RML2016a datasets demonstrate that in a three-stage incremental setup, our method achieves an overall accuracy of 84.32% at 0 dB SNR, representing a 2.69% improvement over the iCaRL baseline. Notably, this advantage expanded to 4.46% on RML2018, demonstrating that ParalIMR effectively arrests catastrophic forgetting. Ultimately, this research provides a robust learning paradigm tailored for cognitive radio and electronic warfare in dynamic electromagnetic landscapes.

1. Introduction

Automatic Modulation Recognition (AMR) is essential for civilian and military applications, including signal detection, cognitive radio, and electronic countermeasures [1]. A primary objective within this field is to achieve robust AMR in complex and dynamic electromagnetic environments. Although Deep Learning (DL) like Convolutional Neural Networks (CNNs) has significantly advanced recognition accuracy [2], its prohibitive computational demands render full retraining cycles unsustainable. This issue is exacerbated by the continuous emergence of novel modulation techniques, driven by needs for enhanced security and spectral efficiency.
Consequently, modern AMR systems require Incremental Learning (IL) capabilities, i.e., the ability to assimilate new knowledge while preserving prior competencies without recourse to the entire historical dataset [3]. While IL paradigms have demonstrated remarkable success in domains like computer vision, their deployment within AMR encounters unique challenges.
In practice, received signals are degraded by noise (e.g., Additive White Gaussian Noise (AWGN)) and channel impairments (e.g., multipath fading, Doppler spread) [4]. As a result, features extracted by neural networks invariably conflate intrinsic modulation structure with extrinsic environmental artifacts. This feature entanglement critically erodes the stability of the feature extractor over time and emerges as a principal catalyst for Catastrophic Forgetting (CF) during the incremental learning stages [5]. This detrimental process often manifests through two interconnected pathways.
The first pathway is denoising overfitting. A common strategy to handle noise involves the use of Denoising AutoEncoders (DAEs) [6,7]. However, DAEs’ efficacy is highly contingent upon the Signal-to-Noise Ratio (SNR) and the order of the modulation. This stems from parameter overspecialization: weights optimized for low-SNR signals induce overfitting when applied to high-SNR inputs. Thus, the problem hinders the model’s ability to learn structural features of the signal. Instead, it yields a distorted feature representation, which accelerates CF when learning subsequent tasks.
The second, and often more impactful, is shortcut learning [8]. The impairments introduced by the propagation channel possess statistical properties that are more pronounced and easily learnable than the subtler features defining the modulation itself (e.g., constellation geometry). This makes models likely to exploit shortcuts, i.e., spurious correlations between channel effects and class labels, rather than learning an invariant representation of the modulation. A crucial flaw in shortcuts is the inherent instability of these shortcut features: variations in transceivers, radio propagation environments, or the interference profile can induce shifts in the distribution of these artifacts. It has been shown that conventional IL trained on static or past data may fail to adapt to these deeply embedded issues, leading to significant performance degradation when deployed in real scenarios [9].
In this paper, we propose ParalIMR, a novel IL approach for AMR, aiming to deal with the catastrophic forgetting caused by overfitting and shortcuts. Unlike prevailing IL models integrating DAE serially, ParalIMR employs a parallel architecture that decouples the noise suppression branch from the classification pathway. This design enhances the model’s capacity to mitigate overfitting to noise-specific patterns. To further bolster feature stability, we incorporate a Segment Substitution (SS) module [10], in which segments of the samples are randomly replaced with other samples drawn from a noise-and-channel-augmented pool. This operation disrupts the spurious correlations between channel/noise artifacts and class labels, compelling the model to focus on intrinsic modulation structures rather than relying on unstable shortcuts. In summary, the main contributions are as follows.
  • We design an auxiliary reconstruction module using a parallel architecture that decouples denoising from the primary classification task. It enables effective denoising across diverse SNR conditions and modulation orders without propagating overfitting errors to the classifier. The performance on the RadioML 2016 and RadioM 2018 datasets [11,12] demonstrates that by leveraging the decoupled parallel structure, our proposed algorithm significantly mitigates catastrophic forgetting, with particularly marked performance gains observed for high-SNR signals.
  • We adapt the segment replacement strategy for incremental learning to alleviate shortcut learning. The rationale of SS is to break spurious correlations between background artifacts and class labels, compelling the model to prioritize modulation-invariant structural cues over environment-dependent shortcuts.
  • We comprehensively evaluate ParalIMR on the RadioML 2016.10a dataset. Results confirm its robust and consistent recognition capability. Comparative experiments against benchmarks reveal its superior performance, while ablation studies validate the positive roles of both the decoupled DAE branch and the adaptive SS module in mitigating catastrophic forgetting and shortcut learning.
The rest of the paper is organized as follows: In Section 2, we briefly introduce the related works. In Section 3, we discuss the problems considered in the paper. In Section 4, we elaborate on the details of the proposed ParalIMR method. In Section 5, we report and analyze the simulation results. Finally, in Section 6, we conclude this work.

2. Related Work

In 2016, O’Shea et al. [11] introduced CNNs to AMR and conducted comparisons with traditional methods such as support vector machines and random forest algorithms. Their results demonstrate that CNNs substantially improve both average accuracy and inference speed. Further experiments showed that increasing the training data tenfold and keeping the model unchanged boosts average accuracy by over 10%.
Since then, the deep learning paradigm has attracted growing attention in the AMR community. Since received wireless signals are typically represented as in-phase (I) and quadrature (Q) components, Krzyston et al. [13] extended the convolution operation to the complex domain and devised a Complex-Valued Convolutional (CVC) neural network. Simulations revealed that the CVC network achieves higher accuracy than real-valued CNNs.
Subsequent work has followed two directions: multi-domain feature fusion and long-range temporal modeling. To overcome single-domain limitations, Lin et al. fed both time- and frequency-domain features into a CNN and proved that this fusion improves accuracy [14]. Tunze et al. incorporated multiple skip connections into a CNN, enhancing structural stability and training efficiency [15]. Emam et al. exploited the temporal correlation of wireless signals by integrating a CNN with Long Short-Term Memory (LSTM) layers, with simulation results confirming the feasibility of this hybrid approach [16].
With the rise of Transformers, attention mechanisms have been increasingly adopted in AMR. Chang et al. [17] proposed a fast multi-loss learning network combining group convolution with the multi-head attention mechanism in Transformer encoders. This enables parallel temporal feature extraction while maintaining a lightweight model. Jang et al. [18] introduced the Meta-Transformer framework, which employs a master-slave Transformer encoder structure and a meta-learning mechanism. It allows adaptation to new modulation categories with only a few samples. Lei et al. showed that complex-valued transformer architectures achieve superior performance in modulation recognition tasks by effectively capturing the inherent phase and amplitude information of complex-valued signals, which is critical for distinguishing different modulation types accurately [19].
Driven by deployment needs on edge devices (e.g., drones, tactical radios), lightweight design has emerged as another major research trend. Tunze et al. showed that sparsely connected CNNs can significantly reduce computational complexity and memory footprint while maintaining competitive modulation recognition accuracy [20]. Ma et al. designed a lightweight ConvTransformer network (SMCTNet) [21], which considerably reduces the parameter count while effectively combining local convolutional features with global attention information. When coupled with semi-supervised metric learning, SMCTNet provided high recognition accuracy even with a limited number of labeled samples per class, thus mitigating overfitting on resource-constrained devices. Moreover, Lyu et al. investigated quantization-aware collaborative inference (co-inference) for resource-limited embodied agents running large AI models [22]. These works provide practical insights for model compression and computation resource management in embodied AI systems.
However, prevailing AMR methods almost universally rely on a strict closed-set assumption, wherein the modulation categories in training and test sets are identical. Confronted with unseen modulation types, existing closed-set models often face a dilemma: full retraining entails substantial computational and storage costs, whereas merely fine-tuning on new data leads to the erosion of previously acquired knowledge [23]. Consequently, equipping deep models with the ability to incrementally learn new modulation categories without requiring access to the entire historical dataset has become a critical bottleneck in current AMR research.
To address the closed-set limitations, researchers have adopted incremental learning frameworks in AMR [24]. Montes et al. conducted a systematic evaluation of fine-tuning and cross-entropy distillation under varying memory budgets, establishing empirical performance bounds for these strategies [25]. Xu et al. [26] integrated a bias compensation layer into the ResNet classifier to alleviate catastrophic forgetting induced by bias drift in the fully connected layer toward newly introduced classes. Fan et al. [27,28] introduced a complex-valued IL adapter that enforces a physical decoupling between baseband feature extraction and classification decision-making. Coupled with a unified loss function, this design enables robust retention of previously learned representations while minimizing replay memory. Deng et al. exploited the intrinsic sparsity of wireless communication signals to dynamically instantiate task-specific sub-models, wherein each sub-model is sparsely activated and dedicated to a distinct modulation class [29]. This architecture efficiently repurposes dormant neurons for new tasks, enhancing parameter efficiency and scalability.
Moreover, Wang et al. proposed PID, a Domain-Incremental Learning (DIL)-driven AMR framework that incorporates a scenario-aware domain controller (SDC) to dynamically project each input sample onto an optimal, domain-specific parameter subspace [30]. This mechanism facilitates rapid adaptation to emerging channel or propagation domains while maintaining high classification accuracy across previously encountered domains, all with negligible parameter expansion. Tan et al. [31] developed PASS-Net, a lightweight architecture explicitly designed to strengthen few-shot class-incremental generalization. Qi et al. [32] further extended IL to distributed settings by integrating federated learning with a class-incremental recognition pipeline, enabling collaborative model evolution under realistic non-independent and identically distributed (non-IID) wireless channel conditions.
Despite these advances, existing IL approaches in AMR remain susceptible to shortcut learning: time-varying channel impairments and various hardware are frequently misidentified as class-discriminative features. This confounding effect undermines semantic stability, resulting in poor long-term retention of modulation-specific knowledge and degraded backward compatibility.

3. Problem Statement

3.1. Signal Model

In Class-Incremental Learning (CIL), a model is trained sequentially on T ( > 1) tasks, with each task introducing a set of new classes. Let CIL process include T sequential arriving tasks D 1 , D 2 , , D T . At the t-th ( 1 t T ) task, the CIL model receives dataset D t = { x 1 , t , y 1 , t , x 2 , t , y 2 , t , , x N t , t , y N t , t } , where signal x i , t R C , 1 i N t , is a training signal from category y i Y t , Y t denotes the category set for task t, C represents the dimension of the training signal and N t is the number of samples. In CIL, the category sets are disjoint, i.e., Y t 1 Y t 2 = , t 1 t 2 . For any sample x i , t , y i , t , it can be modeled as
x i , t = h t s y i , t , t + n t ,
where x i , t is the i-th signal sample at task t, s y i , t , t is the signal modulated with category y i , t , h t is the channel impulse response, n t is additive white Gaussian noise, and ∗ is the convolution operation.
Following replay strategy, the training set, D ^ t = D t P t , is constructed at the beginning of task t, where exemplar set P t k = 1 t 1 D k is used to store representative samples from previous task. Then, the model takes D ^ t as input to update its parameters. In the paradigm, D t enables the model to learn knowledge of the categories in Y t , while P t encourages the model to retain the discriminative information that has been learned, ensuring consistent performance on old categories. Finally, after updating the parameters, the exemplar set is reconstructed to generate P y , t + 1 for subsequent tasks.
In accordance with Equation (1), received signal x ( t ) results from the convolution between the modulated signal s y i , t and the propagation environment. Thus, the model must be resilient to channel-induced signal distortions. Furthermore, heterogeneous propagation conditions across tasks impose two additional requirements for a CIL model. First, the CIL model should disambiguate semantically overlapping classes, i.e., distinguish newly introduced categories from previously learned ones. Second, it should be able to extract modulation features that generalize well under varying physical environments. These requirements are equally critical to mitigating the risk of catastrophic forgetting.

3.2. Denoising Problem

To overcome signal distortion, noise-aware preprocessing is essential. Denoising autoencoders (DAEs) are widely adopted for this purpose, leveraging their intrinsic capacity to separate signal structure from additive noise. Conventionally, DAEs are deployed in a tandem architecture: an MSE-based reconstruction module precedes a downstream classifier, where the classifier operates exclusively on the denoised representations.
However, this structure exhibits SNR- and modulation-order-dependent performance degradation. The underlying reason is that under low-SNR conditions, the model tends to activate more parameters to fit random noise, which reduces training loss rapidly but compromises structural fidelity. Nevertheless, when the same parameters are applied to high-SNR signals, overfitting occurs. This phenomenon not only impedes the model’s ability to learn essential signal structures, but also causes the characteristics of high-order signals to be obscured by such fitting.
Furthermore, when learning a new task under altered conditions (e.g., changes in transceiver equipment), the noise characteristics (such as variance and mean) may correspondingly change, resulting in changes in channel statistical characteristics. In such scenarios, employing the tandem structure may lead the training process to prioritize learning new noise characteristics, thereby accelerating the forgetting of previously learned features.
Therefore, effectively suppressing noise while preventing overfitting becomes the paramount challenge in developing robust classification systems.

3.3. Shortcut Learning Problem

Shortcut learning is a key factor driving catastrophic forgetting when the sampling environment changes. Specifically in deep learning, shortcut learning refers to a phenomenon in which models exploit spurious, task-correlated features, rather than the underlying, task-relevant structural attribute. Modulation recognition is inherently susceptible to this issue. Recall Equation (1), a received signal is a mixture of the modulated signal and the channel impulse response that reflects the imperfections of the physical environment and hardware devices. Therefore, channels have always introduced a set of task-specific but non-generalizable artifacts that exhibit strong local statistical significance within a single task but inevitably drift across incremental sessions. This leads to recognition models being more likely to learn the channel-specific characteristics, i.e., shortcuts, rather than the structural features of the modulated signal.
Although shortcut-driven decisions yield high accuracy and rapid convergence, they severely compromise model generalization. Unfortunately, this limitation is exacerbated in the CIL progress. In specific, when a CIL model learns to associate former tasks with idiosyncratic features of transceivers or propagation environments, those associations become embedded in its parameters and are progressively reinforced during subsequent training. However, as new tasks introduce previously unseen devices or channels, the statistical distribution of the shortcut features inevitably shifts. Neither parameter solidification, exemplar replay, nor knowledge distillation can fully preserve or reconstruct the full spectrum of shortcut patterns acquired from prior tasks. As a result, the model loses its capacity to reliably recognize earlier classes—a key contributor to catastrophic forgetting.
Therefore, it is desirable to mitigate shortcut learning by deliberately weakening the spurious correlation between channel features and modulation labels to enhance the robustness of the recognition model. To this end, data augmentation is proposed to decouple shortcut features from the signal.
Unlike images in which the background can be pixel-wise separated, communication signals and channel environments are coupled through convolution. Accordingly, factors such as channel fading, multipath dispersion and hardware fingerprints do not merely overlay the signal but induce waveform distortion and nonlinear reconstruction. This inherent nonlinearity renders modulation features highly susceptible to masking or corruption by the shortcut-inducing artifacts. In the paper, one contribution of this work is to design physically informed, effective data augmentation methods.

4. The Proposed Method

The overall architecture of our incremental learning framework is illustrated in Figure 1. An input sample first undergoes a Channel-Signal Decoupling Augmenter (CSDA) for weakening the shortcuts. The augmented signal is then processed by a shared feature extractor. Subsequently, the extracted features are routed to a Denoise-Robust Classifier (DRC) consisting of two branches, where the denoising branch is built upon a Denoising AutoEncoder (DAE) and the classification branch is responsible for obtaining the classification loss. Additionally, a compact exemplar set of representative samples from previous tasks is maintained and replayed during the training of new tasks. Note that these exemplars are also augmented using the CSDA, ensuring consistent representation learning across incremental tasks and mitigating distributional shift in earlier-class features.
This section is structured as follows: Section 4.1 presents the data augmentation technique used in the CSDA module. Then, Section 4.2 details the principles and implementation of the dual-branch structure. Finally, Section 4.3 describes the unified total loss function integrating reconstruction, classification, and knowledge distillation objectives.

4.1. Segment Substitution-Based Feature Decoupling

To bypass the shortcuts, the CSDA module employs the Segment Substitution (SS) strategy [10] in CIL as a key data augmentation mechanism. The SS operates by randomly segmenting, shuffling, and replacing time-domain signal segments, which is able to decouple the temporal correlations between modulation structure and channel-specific artifacts such as hardware fingerprints or channel distortions. Training on such substituted samples reduces the model’s reliance on non-generalizable cues tied to particular acquisition conditions. Consequently, the following feature extractor is encouraged to learn modulation-invariant representations, which enhances generalization across tasks and alleviates catastrophic forgetting during incremental updates.
In this paper, the implementation of segment substitution is as follows. Define the original input signal as x R N × 2 , where N is the signal length. The enhanced signal x ˜ can be expressed as a combination of an original signal x o r g and a substitution signal x s s . There are two different replacement methods.
  • Discrete Segment Substitution: Let us randomly select discontinuous index n i [ 1 , C ] , 1 i L , and construct a set I o r g = { n 1 , , n L } . The discrete-replaced signal x ˜ d s c n is then given by
    x ˜ d s c n = x s s n i f n I o r g x o r g n o t h e r w i s e .
  • Continuous Segment Substitution: A segment of length L is randomly selected in signal x o r g and replaced with a continuous segment from x s s . Let u and v, 0 u , v N L , be the random starting indices in x o r g and x s s respectively. The continuous substitution can be expressed as follows.
    x ˜ c t s n = x s s v + n u i f n [ u , u + L ) x o r g n o t h e r w i s e .
For the sake of effectiveness and robustness, a substitute signal x s s should meet the following two conditions. Let set D t , s s consist of all samples satisfying the conditions for task t.
  • The signals x o r g and x s s should belong to the same modulation category y. The application of substitution thus weakens the channel characteristics without compromising the structural features of the signal.
  • The SNR of x s s should be less than or equal to that of x o r g . Incorporating sample points with lower SNR can force the model to extract features under more severe noise conditions, thereby enhancing its noise resistance.
It has been demonstrated that both substitution strategies can enhance the model’s representation of the signal constellation diagram. In Section 5.5, it will be further shown that using segment substitution can enhance the robustness of AMR in CIL tasks.

4.2. Denoise-Robust Classifying

Although segment replacement effectively mitigates feature interference induced by the channel response function, it offers no suppression capability against additive noise. Consequently, explicit denoising of the input signal remains indispensable. To guarantee the consistency of the denoising performance across diverse SNRs and modulation orders, we propose a parallel architecture with a shared feature encoder. As illustrated in Figure 1, DRC comprises three tightly coupled components:
  • Shared Encoder: It maps the raw I/Q time-series signal into a high-dimensional latent feature space Z 3 , wherein semantic abstractions are progressively distilled through hierarchical representation learning;
  • Modulation Classifier: It uses Z 3 to perform fine-grained discrimination among modulation categories;
  • Reconstruction Decoder: It implements an invertible mapping from Z 3 back to the original signal, which provides waveform-level signal recovery.
In this parallel design, denoising is not treated as a sequential preprocessing step. Instead, it acts as an explicit regularization within the training objective. This creates a dual-constraint mechanism on the output of the shared encoder, where learned features must simultaneously satisfy two criteria: high discriminability, enforced by the classification loss, and high fidelity, enforced by the reconstruction loss. Through this joint optimization, the encoder is forced to disentangle modulation-invariant physical structure such as constellation geometry and transient envelope from noise-corrupted artifacts. Thus, this approach promotes robust and physically grounded feature representations.
Furthermore, in CIL, classification boundaries shift continually as new modulation classes arrive. This parallel design ensures that any overfitting in the denoising process does not propagate to the classifier, since the two branches have separate feature extraction paths with no additional task-specific weight sharing beyond the shared encoder in their core modules. In other words, the reconstruction’s goal remains unchanged regardless of class identity or label semantics. As a result, the reconstruction branch serves as a stable structural anchor, in which the geometric consistency and semantic coherence of the latent feature space are maintained at the architectural level.
The shared encoder constitutes the core feature extraction engine. Unlike ResNet architectures designed for two-dimensional (2D) image data in computer vision, radio modulation signals are one-dimensional (1D) dual-channel time series (I and Q components). Accordingly, we adapt ResNet to the signal domain with the following three modifications:
  • 1D Convolutional: All 2D convolution kernels are replaced with 1D kernels. This change eliminates spatial redundancy and allows precise modeling of temporal dynamics, such as instantaneous phase shifts, frequency sweeps, and amplitude modulations, via sliding-window operations aligned with the signal’s intrinsic time axis.
  • Residual Networks: To mitigate gradient degradation in deep stacks, we integrate residual connections. For an input x , the layer learns a residual mapping F ( x ) , yielding output y = F ( x ) + x . This identity mapping ensures unimpeded gradient propagation to early layers, sustaining stable and efficient parameter updates throughout prolonged incremental training.
  • Hierarchical multi-scale encoding: As detailed in Figure 2, the encoder comprises three stages. The number of channels doubles per stage ( b 2 b 4 b ) to enrich semantic expressivity, while temporal resolution is halved via stride-2 convolutions ( k 2 k 4 k ), progressively compressing redundant temporal sampling. The resulting latent tensor Z R b a t c h × 4 k × 4 b thus balances expressive power and compactness—retaining discriminative high-level semantics while discarding sample-level noise and aliasing.
The shared encoder maps raw I/Q signals into a high-dimensional latent feature space Z . To invert this mapping and reconstruct the original signal, we design a lightweight generative decoder. The decoder uses cascaded 1D transposed convolution layers for feature upsampling and reconstruction. Specifically, there are two upsampling blocks in the decoder. Each block uses a transposed convolution to double the temporal length of feature maps. As shown in Figure 3, the first block restores the feature size from k to 2 k , and the second stage doubles it again to 4 k . Meanwhile, 1 × 1 convolutions progressively reduce the channel dimension to match the original dual-channel (I/Q) structure. Since preprocessed I/Q signals are normalized to [ 1 , 1 ] , a Tanh activation is placed at the decoder’s output. Compared with ReLU or Sigmoid, the output ranges of Tanh align better with the signal’s physical amplitude limits, which helps avoid bias in reconstructed waveforms and speeds up convergence during training.
To mitigate the loss of high-frequency information from repeated downsampling, we introduce skip connections between the encoder and the decoder. Unlike the channel-wise concatenation of U-Net, which increases memory overhead, we propose a lightweight weighted residual fusion, where features from Stage-2 of the encoder, denoted as F enc ( 2 ) , are combined with the corresponding decoder feature F dec ( 2 ) by an element-wise addition, i.e.,
F fused = F dec ( 2 ) + λ F enc ( 2 ) ,
where λ is a learnable balancing coefficient.
This design benefits from the following two facts. First, the fine-grained temporal structure from shallow layers is directly injected into the decoding path, resulting in a more precise reconstruction of the signals. Second, gradients from the reconstruction loss can flow through the skip connection to the intermediate encoder layer during backpropagation, imposing stronger regularization on the feature extractor.

4.3. Training for Incremental Tasks

To mitigate catastrophic forgetting, our algorithm incorporates the Knowledge Distillation (KD) mechanism. When the model enters the t-th ( t > 0 ) task, an augmented training set is constructed for training, which consists of the new category dataset for the current task D t and a minimal set of preserved old task examples D ^ t = D t P t . Then, during the training of task t ( > 0 ), we freeze the previous-stage model θ t 1 as a teacher model, leveraging its soft-label outputs generated from D ^ t to impose feature-level constraints on the current model.
However, the probability distribution output by Softmax functions tends to be overly extreme (with target category probabilities approaching 1 and others approaching 0), which obscures subtle similarity information between different categories. To extract and convey these softened structural relationships, this paper introduces a temperature coefficient τ to smooth the original logical scores (Logits) generated by the network output.
Let z k , o l d denote the original logical score output by the teacher model for the k-th old category, and z k , p r e d represent the original logical score generated by the current model for the same category. The softened probability distributions after scaling by the temperature coefficient τ are defined as:
p k , o l d = exp ( z k , o l d / τ ) j = 1 | Y t 1 | exp ( z j , o l d / τ )
p k , p r e d = exp ( z k , p r e d / τ ) j = 1 | Y t 1 | exp ( z j , p r e d / τ )
Based on these distributions, the knowledge distillation loss L K D is constructed by calculating the cross-entropy between the two softened probability distributions.
L t K D = D i s t i l l a t i o n p p r e d , p o l d , τ = k = 1 | Y t 1 | p k , o l d log ( p k , p r e d )
In the training process, the current batch of training samples D ^ t are simultaneously fed into both the current model θ t and the teacher model θ t 1 . Consequently, the total loss function comprises three components: classification loss L t cls , reconstruction loss L t rec , and distillation loss L t KD .
The discriminative loss of the main task is generated by the classification main task branch, with the model designed to maximize the feature discrimination between modulation categories. This study employs a standard cross-entropy loss function.
L t cls = C r o s s E n t r o p y ( D ˜ t , D t ) = 1 N i = 1 N c = 1 | Y t | h i , c log P ( h i , c x i ) ,
where D ˜ t is the output of the classifier, N denotes the batch size, h i , c { 0 , 1 } is the one-hot encoded ground-truth label for signal x i and P ( h i , c x i ) is the predicted probability of the classifier for class y c given input x i . This term explicitly guides the encoder toward semantic-discriminative feature representations.
For the reconstruction branch, we formulate feature decoupling and signal purification as a denoising structure-learning objective. Input signals x aug are generated by applying the segment replacement strategy, introducing controlled structural perturbations such as random segment shuffling or masking to original signals. Crucially, the reconstruction target is consistently set to the corresponding reference signals x ref from a predefined set X ref , regardless of the category of the input. The reconstruction loss is computed using Mean Squared Error (MSE):
L t rec = M S E ( X t rec , X t ref ) = 1 N C i = 1 N m = 1 C x i , m rec x i , m ref 2 2 ,
where C is the signal length, and x i , m rec and x i , m ref denote the reconstructed and the reference signals for the i-th sample, respectively. This asymmetric supervision—noisy or perturbed input paired with clean ground truth—induces the model to disentangle modulation-invariant signal structure from task-irrelevant artifacts (e.g., hardware distortions or channel effects), thereby promoting robust and generalizable representations.
The reconstruction loss L t rec acts as a geometric consistency constraint, anchoring structural features to the physical waveform manifold without injecting noise-overfitting bias into the categorical discrimination space. As a result, the proposed parallel architecture can decouple the gradient flow between the denoising and classification paths. Unlike tandem structures where denoising bias propagates to the classifier, ParalIMR ensures that the classifier optimizes its decision boundaries strictly on the latent features.
Note that the predefined set X ref consists of samples marked by high-SNR. This is because AMR mainly relies on signals’ structure with respect to the modulation type. We note that even if the symbol sequences between the references and the inputs differ, the structural characteristics (e.g., constellation shape, frequency variation) that define the modulation type are preserved. By using these samples as reconstruction targets, the model is guided to focus on learning the core modulation characteristics rather than spurious noise patterns prevalent in low-SNR scenarios. The MSE loss on the reconstructed signal thus enforces the model to capture these invariant structural features, rather than relying on symbol-specific details. Consequently, the high-fidelity reconstruction by high-SNR signals ensures that the learned features are robust to noise variations, fostering modulation invariance across different SNR levels. This connection between reconstruction quality and invariance is critical. As the model becomes proficient at reconstructing clean, ideal signals, it develops a generalized representation that is less sensitive to environmental shifts.
Finally, the overall optimization objective is defined as a weighted sum,
L t total = L t cls + α ( e ) L t rec + λ L t KD ,
where α ( e ) is a hyperparameter that balances representational stability and task-specific plasticity, with its value determined by the Epoch count e, λ K D represents the knowledge distillation weight.
Minimizing L t total jointly enforces two desiderata: (i) discriminative capability for new modulation classes, and (ii) fidelity to intrinsic signal structure across varying acquisition conditions. As a result, the encoder learns features that are simultaneously category-separable and physically consistent—effectively mitigating catastrophic forgetting in class-incremental modulation recognition. The complete incremental learning algorithm integrating the SS strategy and parallel optimization is summarized in Algorithm 1. Note that, D ^ t = D t and θ t 1 = Φ if t = 1 .
Algorithm 1: The proposed Incremental Learning for AMR
Electronics 15 02766 i001

5. Experiment and Discussion

5.1. Experimental Setup

Dateset: The experiments are conducted on the widely adopted open-source datasets, RadioML2016a and RadioML2018.
1.
The RML2016a dataset includes 11 modulation types, i.e., eight digital (BPSK, QPSK, 8PSK, 16QAM, 64QAM, PAM4, CPFSK, GFSK) and three analog (AM-DSB, AM-SSB, WBFM), where the SNR ranges from −20 dB to +18 dB in 2-dB steps. Each complex-valued I/Q sample is represented as a 2 × 128 real-valued matrix, where the first row encodes the in-phase (I) component and the second row the quadrature (Q) component. For every SNR level, there are exactly 1000 samples per modulation type. In total, the dataset contains 220,000 labeled samples. (11 classes × 11 SNR levels × 1000 samples).
The initial stage contains 5 modulation classes, followed by two incremental tasks with 3 new classes each. Specifically, the modulation types are categorized as follows: Stage 0 (Classes 0–4): CPFSK, WBFM, AM-DSB, GFSK, 8PSK; Stage 1 (Classes 5–7): PAM4, BPSK, AM-SSB; Stage 2 (Classes 8–10): QAM16, QPSK, QAM64.
2.
The RML2018 dataset includes 24 types of modulation patterns as follows: OOK, 4ASK, 8ASK, BPSK, QPSK, 8PSK, 16PSK, 32PSK, 16APSK, 32APSK, 64APSK, 128APSK, 16QAM, 32QAM, 16QAM, 32QAM, 64QAM, 128QAM, 256QAM, AMSSB-SC, AM-DSB-WC, AM-DSB-SC, FM, GMSK, and OQPSK, where the SNR ranges from −30 dB to +20 dB in 2-dB steps. Each complex-valued I/Q sample is represented as a 2 × 1024 real-valued matrix, where the first row encodes the in-phase (I) component and the second row the quadrature (Q) component. For every SNR level, there are exactly 4096 samples per modulation type. In total, the dataset contains 2,555,904 labeled samples. (24 classes × 26 SNR levels × 4096 samples).
The initial stage contains 12 modulation classes, followed by two incremental tasks with 6 new classes each. Specifically, the modulation types are categorized as follows: Stage 0 (Classes 0–11): 4ASK, QPSK, 64QAM, GMSK, 8ASK, 32QAM, 128APSK, BPSK, 16QAM, AM-SSB-WC, AM-DSB-SC and 16PSK; Stage 1 (Classes 12–17): AM-DSB-WC, 32PSK, 128QAM, 256QAM, FM and OQPSK; Stage 2 (Classes 18–24): 32APSK, AM-SSB-SC, 8PSK, 16APSK, 64APSK and OOK.
To evaluate the incremental learning performance, the number of exemplars for each old class is 480. For instance, the size of exemplar set is | P 2 | = 480 6 = 2880 for RML2016a at Task 2. The selection method of exemplars is the same as that in iCaRL.
Training: All experiments were conducted on a PyTorch-based deep learning platform with the following hardware specifications: an NVIDIA GeForce RTX 4070 Laptop GPU (8 GB) running on the Windows 11 operating system. This computational setup ensures consistent training and inference performance for validating the proposed incremental learning approach.
The layer-wise parameter configuration of the parallel reconstruction network is detailed in Table 1 and Table 2. All nonlinear activation functions within the network are ReLU. A Dropout layer with a dropout rate of 0.2 is inserted before the fully connected layer of the classifier, while Batch Normalization (BN) is applied subsequent to all convolutional layers to stabilize training dynamics.
For model optimization, the Adam/SGD optimizer is utilized for network parameter updates. Each incremental stage is trained for 80 epochs with a batch size of 128. The initial learning rate is set to 0.1, which is decayed to 0.01 after 50 epochs. The replacement ratio is set to 1/64, the balance coefficient λ is configured as 0.5, and the knowledge distillation temperature coefficient τ is set to 2.0.
Evaluation: Two evaluation metrics are employed to quantify model performance: class-specific accuracy across incremental tasks and overall accuracy. The class-specific accuracy for stage t , 0 t 3 at SNR ρ is defined as the ratio of correctly classified samples N ρ , i c o r r e c t to the total number N ρ , t of samples belonging to newly introduced classes in stage,
A c c ( ρ , t ) = N ρ , t c o r r e c t N ρ , t × 100
The overall accuracy for all classes at a specific SNR can be described as follows. After the model completes training in the t-th stage, the total number of samples of all classes with SNR ρ in the test set is T ρ , and the number of samples for which the model correctly predicts at this SNR is T ρ c o r r e c t . Then, the recognition accuracy for all classes that have appeared up to the t-th stage at this signal-to-noise ratio is defined as
A c c ( ρ ) = T ρ c o r r e c t T ρ × 100
Note that the primary goal of this paper is to validate the effectiveness of ParalIMR, and thus implementation issues have not been considered. However, we recognize that the deployment of Automatic Modulation Recognition (AMR) in wireless communication systems and edge devices is of considerable importance. In future research, we will investigate the feasibility of ParalIMR with respect to the deployment-related metrics, e.g., complexity, inference latency, and memory overhead.

5.2. Parameter Analysis

We analyzed the parameters of the proposed ParalIMR method in this subsection. We changed the parameters of the replacement ratio p { 1 / 4 , 1 / 8 , 1 / 16 , 1 / 32 , 1 / 64 } and temperature τ { 2 , 4 , 6 , 10 } , and penalty factor λ [ 0.001 , 1 ] as shown in Algorithm 1 and evaluated the performance of ParalIMR. The detailed results are depicted as follows.
First, we changed the replacement ratio p while fixing the other parameters. The results are shown in Figure 4. Both the optimal p of Stage 2 and Stage 3 are 1 / 32 . Specifically, when one-quarter of the signal is replaced, recognition performance declines to its lowest level. This occurs because the replaced samples exhibit different symbol sequences from the original ones. Such replacement introduces distinct symbol distribution characteristics of the substituted information, thereby diluting or disrupting the structural features of the original data. On the other hand, as shown in the figure, performance under replacement ratios of 1/32 and 1/16 is comparable; however, under low signal-to-noise ratio conditions, the 1/16 ratio demonstrates significantly better performance than 1/32. This indicates that when noise features overwhelm signal features, fragment replacement effectively mitigates their impact on the model, thereby alleviating the problem of shortcut learning.
Next, we focus on analyzing the impact of distillation temperature on the performance of ParalIMR, with detailed results presented in Figure 5. The figure visualizes the model’s performance across a range of distillation temperatures τ from 2 to 10. As observed, when the temperature rises, a gradual decrease in performance is noted, suggesting that high temperatures may dilute the knowledge transferred from the teacher model to the student model.
Figure 4 and Figure 5 confirm that selecting optimal parameters is critical for maximizing the efficiency and generalization ability of ParalIMR, laying a foundation for fine-tuning the model in subsequent phases.

5.3. Comparative Performance

This section conducts a series of comparative experiments against four representative incremental learning (IL) paradigms. Finetune [23] is a naive baseline that updates the model on new tasks without any forgetting mitigation, serving as a reference to demonstrate the severity of catastrophic forgetting. LwF [33] is a classic knowledge distillation-based method that does not rely on replay, allowing us to contrast the effectiveness of replay-based strategies (like ParalIMR) against non-replay approaches. iCaRL [23] is a well-established replay-based method, making it a direct competitor to assess the improvements of our framework in terms of forgetting mitigation and classification accuracy. Upper [10], as the upper bound (joint training of all classes simultaneously), provides a benchmark to measure how close ParalIMR’s performance is to the optimal scenario without incremental constraints. Moreover, only results for SNRs ≥−4 dB are reported here. This is because at SNRs below −4 dB, the input signals are severely corrupted by noise, making the modulation characteristics almost indistinguishable even for well-trained models. In such extremely low-SNR scenarios, the performance of all compared methods (including baselines and our ParalIMR) drops to near-random levels, which fails to provide meaningful insights into the relative effectiveness of different frameworks.
As illustrated in Figure 6a, the overall accuracy trajectories across varying SNRs in Task 2 substantiate the superiority of ParalIMR. While the Finetune model suffers a total collapse of prior knowledge mappings (0% accuracy for legacy classes, as evidenced in Figures 8 and 9), LwF demonstrates a partial mitigation of forgetting but remains significantly inferior to replay-based approaches.
Notably, ParalIMR establishes a substantial performance leap under low-SNR conditions. For instance, ParalIMR achieves an accuracy of 86.09% at −4 dB SNR, outperforming iCaRL’s 78.59%. By forcing the model to learn modulation-invariant structural cues rather than environment-dependent artifacts, ParalIMR effectively preserves discriminative signatures even when they are partially masked by Gaussian noise or fading channel interference.
In Task 3, the performance of all incremental algorithms inevitably degrades. However, ParalIMR exhibits the smallest performance drop among all baselines. As shown in Figure 6b, ParalIMR consistently outperforms iCaRL across most SNR scenarios, validating that feature decoupling through parallel reconstruction regularization mitigates catastrophic forgetting more effectively than data replay alone. Furthermore, the proposed method approaches the upper performance bound in both tasks, confirming the rationality of its dual-branch architecture for incremental modulation recognition in dynamic electromagnetic environments.
Additionally, we performed further evaluations of the iCaRL and paralIMR algorithms utilizing the RML2018 dataset. As illustrated in Figure 7a, paralIMR demonstrates enhanced performance relative to iCaRL at lower SNRs, which can be attributed to its advanced noise reduction capabilities for the PR module. Conversely, at higher SNRs, both algorithms exhibit comparable performance. Figure 7a explicitly indicates that paralIMR substantially outperforms iCaRL. The integrated mechanisms of paralIMR, including dual-branch optimization and segment substitution augmentation, prove more efficacious in addressing practical channel impairments and hardware non-idealities, which are particularly pronounced in the RML2018 dataset. This result substantiates that our framework possesses superior adaptability to real-world communication environments in comparison with iCaRL.
Figure 8 and Figure 9 quantify the accuracy of legacy categories across SNRs. In Task 2 (Figure 8), iCaRL achieves slightly higher accuracy than ParalIMR under some SNR conditions. However, as the task progresses to Task 3 (Figure 9), the cross-stage performance trend becomes distinct. With increased exposure to new task data, iCaRL’s decision boundaries—initially shaped by noise and environmental cues—undergo collapse in Task 3, triggering severe catastrophic forgetting. In contrast, the proposed algorithm preserves stable core modulation features, which is resilient to environmental fluctuations. This intrinsic robustness reduces noise reliance during classification, enabling consistent outperformance over iCaRL in Task 3 and validating the efficacy of ParalIMR in mitigating shortcuts.

5.4. Confusion Matrix

To quantify class-specific discrimination, confusion matrices are presented for all stages. In these matrices, rows represent true modulation categories while columns denote model-predicted categories. These confusion matrices offer a detailed visualization of the model’s discriminative performance and the evolving dynamics of knowledge retention across three incremental stages. Analyzing these results reveals critical insights into physical feature overlap, constellation-based similarities, and the mechanics of catastrophic forgetting.
First, the confusion matrices in Figure 10 and Figure 11 quantify the performance divergence between ParalIMR and iCaRL across modulation categories. Specifically, ParalIMR exhibits misclassification patterns where AM-DSB signals are primarily confused with AM-SSB, whereas iCaRL frequently misidentifies AM-DSB as PAM. A direct comparison of Figure 10c and Figure 11c further delineates performance disparities for both novel and existing categories. Collectively, these results validate that ParalIMR maintains superior retention of prior category knowledge compared with iCaRL, demonstrating robust resistance to catastrophic forgetting in incremental learning scenarios.
Figure 12 and Figure 13 illustrate the detrimental impact of catastrophic forgetting on incremental learning frameworks, revealing a pronounced tendency for misclassification of legacy classes as novel categories. Moreover, cross-referencing with Figure 10 and Figure 11 reveals that this degradation effect escalates with increasing task complexity, as measured by the number of categories per incremental step. Consistent with observations on the RML2016 dataset, both ParalIMR and iCaRL demonstrate comparable performance on ambiguous modulation classes, albeit with distinct misclassification profiles. Critically, ParalIMR preserves significantly more discriminative knowledge of legacy classes, thereby mitigating catastrophic forgetting more effectively than state-of-the-art alternatives.
To further elucidate ParalIMR’s recognition performance across various categories, this paper presents findings based on experimental results from the RML2016 dataset. Experimental results in Figure 11b reveal sustained diagonal concentration in confusion matrices compared with Figure 11a, confirming the proposed method’s ability to preserve discriminative features during incremental learning. When advancing to Task 2, which incorporates three new modulation formats (PAM4, BPSK, AM-SSB), classification accuracy remains above 93% for most categories. Notably, AM-DSB signals exhibit cross-class misclassification with the newly introduced AM-SSB, attributed to their shared analog amplitude modulation lineage and highly similar temporal envelope characteristics.
In Figure 11c, a notable feature is the misrecognition of QAM16 signals as QAM64, highlighting the model’s limitations in distinguishing complex signals. Detailed analysis of the constellation diagram reveals that QAM64 exhibits a denser point distribution compared with QAM16, attributable to its higher modulation order. Consequently, the feature space constructed by the model for QAM64 is broader and more detailed, capable of capturing subtle signal variations. On the other hand, QAM16 constellation points undergo oscillation and dispersion under AWGN or fading channel conditions. From the model’s perspective, this blurred QAM16 signal closely resembles a high-order QAM64 signal, as noise-induced dispersion mimics the inherent point density characteristic of higher-order modulation. This leads to loss of critical discriminative features and results in overfitting, where the model becomes overly dependent on superficial characteristics rather than fundamental features.
Furthermore, Figure 11c demonstrates a reduction in confusion between AM-DSB and AM-SSB, attributed to the relay strategy. In Task 2, AM-SSB was introduced as a novel modulation category with ample training samples, while AM-DSB belonged to an existing category derived from exemplar set P 2 . At this stage, the model underwent end-to-end deep learning training for AM-SSB, thereby comprehensively mastering its distinctive features—including spectral asymmetry caused by upper-sideband suppression. In contrast, for Task 3, both AM-DSB and AM-SSB are old categories with equivalent sample sizes, requiring no significant parameter adjustments to accommodate new samples. Additionally, the parallel reconstruction branch served as a task-agnostic structural anchor that consistently preserves geometric consistency of underlying physical waveforms. All of this ensures stable understanding of homologous analog signals’ fundamental structures and minimizes feature overlap between AM-SSB and AM-DSB in the latent space.

5.5. Ablation Study

To investigate the contributions of the parallel reconstruction strategy and segment substitution strategy, ablation experiments are designed with four model configurations: Baseline (iCaRL), Baseline + SS (SS), Baseline + PR (PR) and Baseline + SS + PR (ParalIMR). Table 3 presents the trend of the accuracy varying with (SNR) across three incremental tasks, while the comparison of task-specific accuracy of each task category in Task 3 is detailed in subsequent Figure 14.
Table 3 illustrates a cross-stage performance reversal phenomenon within the incremental learning process. In Task 2, the overall accuracy of the SS strategy (e.g., 87.17% at 0 dB) was marginally lower than the 87.93% achieved by the baseline model. However, as the task progressed to Task 3, the baseline model’s accuracy plummeted to 81.63%, whereas SS maintained superior performance at 85.19%. This reversal confirms that the baseline’s initial performance was artificially inflated by shortcuts. In contrast, the proposed framework utilizes feature decoupling and regularization to ensure the model focuses on intrinsic modulation signatures. Furthermore, the fact that these performance gains are more pronounced at low-SNRs validates the model’s anti-interference capability. In these high-noise environments, the proposed method successfully prevents the model from defaulting to noise-sensitive shortcuts, thereby ensuring robust recognition across incremental tasks.
Based on the results in Figure 14, the specific advantages of the SS and PR strategies are validated through their impact on task-specific accuracy. For intermediate tasks (classes 5–7), the SS strategy achieved a performance of 83.36%, while iCaRL only achieved 69.12%. This demonstrates its efficacy in blocking environmental shortcuts. On the other hand, the PR strategy exhibits its strength in feature anchoring. At 2 dB to 4 dB, PR achieved 86.26% and 92.27%, significantly exceeding the baseline’s 84.90% and 89.33%. Notably, the performance crossover for the oldest tasks (classes 0–4) confirms that reconstruction loss successfully anchors the underlying physical waveform topology. This suggests a dual-path logic: the PR module suppresses over-adaptation to new features through structural regularization, while the SS module preserves intermediate task representations through data-level feature decoupling.

5.6. Shortcut Learning

To systematically evaluate the robustness of the proposed ParalIMR model against shortcut learning, we devise an experimental protocol wherein the model is trained and tested on distinct datasets with varying noise and channel conditions. First, we construct the following four datasets in MATLAB 2023b.
  • AWGN dataset: This dataset incorporates only Gaussian white noise to simulate ideal fading-free channel conditions, serving as a baseline environment for evaluating the impact of background noise.
  • Rayleigh + AWGN dataset: Signals undergo Rayleigh channel under simulated multipath propagation conditions, followed by Gaussian white noise addition to characterize its behavioral characteristics in a purely scattering environment.
  • Rician + AWGN dataset: Signals propagate through Rice channel under a mixed propagation environment containing both direct and multipath components, followed by the addition of Gaussian white noise to analyze signal performance in scenarios with partial direct signals.
  • Rician + Rayleigh + AWGN dataset: Signals first pass through Rayleigh channel followed by Rice channel, simulating more complex fading scenarios, and are finally supplemented with Gaussian white noise to evaluate signal performance in highly dynamic channels.
All datasets were generated using MATLAB’s built-in functions and default parameters to ensure experimental reproducibility and consistency. Within each dataset, the signal size and quantities for various modulation types across different SNRs align with those of RML2016a, enabling direct application of the model parameters listed in Table 1 for fair comparison with existing studies.
Using the aforementioned dataset, we design three experimental configurations to evaluate whether the model can operate independently of specific channel effects.
  • Configuration 1 employs the Rician + Rayleigh + AWGN dataset as the training set and the AWGN dataset as the test set, aiming to verify whether the model maintains robust recognition performance under ideal AWGN conditions after training in complex fading channel environments.
  • Configuration 2 uses the AWGN dataset as the training set and the Rayleigh + AWGN dataset as the test set, verifying whether the model can maintain stable recognition accuracy in complex channel scenarios with superimposed Rayleigh fading after being trained solely on an AWGN environment.
  • Configuration 3 employs the Rayleigh + AWGN dataset as the training set, while the Rician + AWGN dataset acts as the test set, verifying whether the model trained in a channel environment with Rayleigh fading can maintain reliable recognition performance even under a Rician fading channel scenario.
Figure 15, Figure 16 and Figure 17 present the results for the three experimental combinations outlined previously. ParalIMR consistently outperforms the traditional incremental learning method iCaRL across all combinations. For instance, in Combination 1 at −4 dB, ParalIMR achieves improvements of 2% and 8% in recognition accuracy over iCaRL for Task 2 and Task 3, respectively, indicating that ParalIMR does not excessively rely on specific fading effects within the training environment. In Combination 2, the baseline method, iCaRL, consistently exhibits lower recognition accuracy than ParalIMR. Particularly in Task 3, iCaRL’s performance remains inferior. Results from Combination 3 further confirm that ParalIMR delivers superior recognition performance during scene transitions involving diverse fading types. These findings conclusively demonstrate that ParalIMR effectively mitigates spurious correlations between environmental conditions and category labels through its parallel reconstruction architecture and segment substitution strategy, reducing dependence on specific channel effects in training environments and thereby validating its robustness against shortcut learning phenomena. Furthermore, the results across these experiments reveal that ParalIMR’s performance advantages are more pronounced under low-SNR conditions compared with high-SNR scenarios, highlighting the effective noise reduction capabilities of the parallel reconstruction module.
It should be noted that we did not use RML2016a for training and instead employed the aforementioned four datasets as the test set. The rationale for this decision is that we lacked knowledge of specific sample-related parameters in RML2016a, such as bandwidth and sampling rate. This uncertainty introduced an uncontrollable bias in the distribution of signal characteristics between the training and test sets, making it impossible to ensure that experimental results were solely attributable to differences in channel and noise conditions. Consequently, this could introduce additional confounding variables that would compromise the accurate assessment of the model’s robustness. Therefore, we opted for datasets generated using unified parameters to strictly control variables and ensure the reliability and persuasiveness of our conclusions.
Finally, in future research, we will further collect real-world communication signal data from diverse environments (including urban, suburban, and indoor settings) to comprehensively evaluate the ParalIMR model’s generalization performance and practical applicability in real-world scenarios, thereby providing a solid foundation for optimizing its deployment. In summary, this study makes significant contributions to the field of incremental learning for modulation classification.

6. Conclusions

This paper presents an incremental learning algorithm for automatic modulation recognition. To suppress overfitting to random noise, a parallel reconstruction architecture with deeply integrated denoising mechanisms is proposed. To mitigate environment-induced shortcut learning, we develop a segment substitution strategy to break the strong correlation between radio environment features and signal labels. This dual innovation eliminates the model’s reliance on environment-specific shortcuts, thereby alleviating catastrophic forgetting caused by incomplete coverage of legacy shortcut features during incremental tasks. Experimental results on RML2016a and RML 2018 show that our method outperforms the iCaRL baseline by 2.69% and 4.46%, respectively, demonstrating that the proposed algorithm achieves an optimal balance between stability and plasticity in long-term incremental learning.

Author Contributions

Conceptualization, Z.W. and Z.Z.; methodology, Z.W. and Z.Z.; software, Z.W.; validation, Z.W., Z.Z. and Y.W.; formal analysis, Z.Z.; investigation, Z.W. and Z.Z.; resources, Z.W.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.Z.; visualization, Z.W. and Z.Z.; supervision, Y.W.; project administration, Z.Z.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. The code is publicly available at https://github.com/zhzhou/ParalIMR accessed on 7 June 2026. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhu, Z.; Nandi, A.K. Automatic Modulation Classification: Principles, Algorithms and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
  2. Zhang, C.; Patras, P.; Haddadi, H. Deep learning in mobile and wireless networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 2224–2287. [Google Scholar] [CrossRef]
  3. Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]
  4. Goldsmith, A. Wireless Communications; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
  5. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
  6. Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.-A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
  7. Miuccio, L.; Panno, D.; Riolo, S. A Wasserstein GAN Autoencoder for SCMA Networks. IEEE Wirel. Commun. Lett. 2022, 11, 1298–1302. [Google Scholar] [CrossRef]
  8. Geirhos, R.; Jacobsen, J.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
  9. Liu, Z.; Li, Y.; Wu, Y.; Gong, Y. Learning to Optimize Resource Allocation in Dynamic Wireless Environments: Embracing the New While Engaging the Old. IEEE Trans. Wirel. Commun. 2025, 24, 7346–7359. [Google Scholar] [CrossRef]
  10. Qu, Y.; Lu, Z.; Zeng, R.; Wang, J.; Wang, J. Enhancing automatic modulation recognition through robust global feature extraction. IEEE Trans. Veh. Technol. 2024, 74, 4192–4207. [Google Scholar] [CrossRef]
  11. O’Shea, T.J.; West, N. Radio machine learning dataset generation with gnu radio. In Proceedings of the GNU Radio Conference, Boulder, CO, USA, 12–16 September 2016. [Google Scholar]
  12. O’Shea, T.J.; Roy, T.; Clancy, T.C. Convolutional radio modulation recognition networks. In Proceedings of the 17th International Conference on Engineering Applications of Neural Networks, Aberdeen, UK, 2–5 September 2016; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
  13. Krzyston, J.; Bhattacharjea, R.; Stark, A. Complex-Valued Convolutions for Modulation Recognition using Deep Learning. In 2020 IEEE International Conference on Communications Workshops (ICC Workshops), Dublin, Ireland, 7–11 June 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
  14. Lin, S.; Zeng, Y.; Gong, Y. Learning of time-frequency attention mechanism for automatic modulation recognition. IEEE Wirel. Commun. Lett. 2022, 11, 707–711. [Google Scholar] [CrossRef]
  15. Tunze, G.B.; Huynh-The, T.; Lee, J.-M.; Kim, D.-S. Multi-shuffled convolutional blocks for low-complex modulation recognition. In Proceedings of the 2020 International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Republic of Korea, 21–23 October 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
  16. Emam, A.; Shalaby, M.; Aboelazm, M.A.; Bakr, H.E.A.; Mansour, H.A.A. A comparative study between CNN, LSTM, and CLDNN models in the context of radio modulation classification. In Proceedings of the 12th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, 7–9 July 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
  17. Chang, S.; Yang, Z.; He, J.; Li, R.; Huang, S.; Feng, Z. A fast multi-loss learning deep neural network for automatic modulation classification. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 1503–1518. [Google Scholar] [CrossRef]
  18. Jang, J.; Pyo, J.; Yoon, Y.; Choi, J. Meta-transformer: A meta-learning framework for scalable automatic modulation classification. IEEE Access 2024, 12, 9267–9276. [Google Scholar]
  19. Lei, J.; Li, Y.; Lo, Y.; Leng, Y.; Lin, Q.; Wu, Y. Understanding Complex-Valued Transformer for Modulation Recognition. IEEE Wirel. Commun. Lett. 2024, 13, 3523–3527. [Google Scholar] [CrossRef]
  20. Tunze, G.B.; Huynh-The, T.; Lee, J.-M.; Kim, D.-S. Sparsely Connected CNN for Efficient Automatic Modulation Recognition. IEEE Trans. Veh. Technol. 2020, 69, 15557–15568. [Google Scholar] [CrossRef]
  21. Ma, J.; Hu, M.; Chen, X.; Wan, L.; Wang, J. Few-shot automatic modulation classification via semi-supervised metric learning and lightweight conv-transformer model. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 1012–1024. [Google Scholar]
  22. Lyu, Z.; Xiao, M.; Skoglund, M.; Debbah, M.; Poor, H.V. Quantization-Aware Collaborative Inference for Large Embodied AI Models. arXiv 2026, arXiv:2602.13052. [Google Scholar] [CrossRef]
  23. Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017. [Google Scholar]
  24. Zhang, X.; Li, T.; Gong, P.; Liu, R.; Zha, X.; Tang, W. Open set recognition of communication signal modulation based on deep learning. IEEE Commun. Lett. 2022, 26, 1588–1592. [Google Scholar] [CrossRef]
  25. Montes, C.; Morehouse, T.; Zhou, R. Class-incremental learning for baseband modulation classification: A comparison. In Proceedings of the 2024 International Wireless Communications and Mobile Computing (IWCMC), Ayia Napa, Cyprus, 27–31 May 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
  26. Xu, B.; Wang, H.; Wu, B.; Cui, Z.; Cao, Z. Signal Modulation Recognition via Bias Adjustment-Based Class Incremental Learning. IEEE Sens. J. 2024, 24, 41437–41450. [Google Scholar] [CrossRef]
  27. Fan, Z.; Tu, Y.; Lin, Y.; Shi, Q. C-SRCIL: Complex-valued Class-Incremental Learning for Signal Recognition. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
  28. Fan, Z.; Tu, Y.; Lin, Y.; Shi, Q. Class-Incremental Learning for Recognition of Complex-Valued Signals. IEEE Trans. Cogn. Commun. Netw. 2024, 10, 417–428. [Google Scholar]
  29. Deng, Z.; Luo, C.; Tang, Z.; Luo, Y. MSNCIL: A Domain-Agnostic Class-Incremental Learning Method Tailored for Automatic Modulation Recognition. IEEE Commun. Lett. 2025, 29, 1456–1460. [Google Scholar]
  30. Wang, G.; Liu, Z.; Zhang, X.; Chen, Y.; Zhang, Y.; Zhu, J. PID: A Parameter-Efficient Isolation Domain-Incremental Learning Framework for Signal Modulation Classification. IEEE Trans. Neural Netw. Learn. Syst. 2026, 37, 1449–1462. [Google Scholar] [PubMed]
  31. Tan, H.; Zhang, Z.; Li, Y.; Shi, X.; Wang, L.; Yang, X. PASS-Net: A Pseudo Classes and Stochastic Classifiers-Based Network for Few-Shot Class-Incremental Automatic Modulation Classification. IEEE Trans. Wirel. Commun. 2024, 23, 17987–18003. [Google Scholar]
  32. Qi, P.; Zhou, X.; Ding, Y.; Zheng, S.; Jiang, T.; Li, Z. Collaborative and incremental learning for modulation classification with heterogeneous local dataset in cognitive IoT. IEEE Trans. Green Commun. Netw. 2022, 7, 881–893. [Google Scholar] [CrossRef]
  33. Li, Z.; Hoiem, D. Learning without Forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The diagram of our proposed methodology.
Figure 1. The diagram of our proposed methodology.
Electronics 15 02766 g001
Figure 2. The architecture of the shared encoder.
Figure 2. The architecture of the shared encoder.
Electronics 15 02766 g002
Figure 3. The architecture of the decoder.
Figure 3. The architecture of the decoder.
Electronics 15 02766 g003
Figure 4. Performance of ParalIMR with changing replacement ratio p on RML2016a.
Figure 4. Performance of ParalIMR with changing replacement ratio p on RML2016a.
Electronics 15 02766 g004
Figure 5. Performance of ParalIMR with changing distillation temperature τ on RML2016a.
Figure 5. Performance of ParalIMR with changing distillation temperature τ on RML2016a.
Electronics 15 02766 g005
Figure 6. Overall accuracy for the second and third task on RML2016a.
Figure 6. Overall accuracy for the second and third task on RML2016a.
Electronics 15 02766 g006
Figure 7. Overall accuracy for the second and third task on RML2018.
Figure 7. Overall accuracy for the second and third task on RML2018.
Electronics 15 02766 g007
Figure 8. Task-specific accuracy for Task 2 on RML2016a.
Figure 8. Task-specific accuracy for Task 2 on RML2016a.
Electronics 15 02766 g008
Figure 9. Task-specific accuracy for Task 3 on RML2016a.
Figure 9. Task-specific accuracy for Task 3 on RML2016a.
Electronics 15 02766 g009
Figure 10. Confusion matrices for iCaRL on RML 2016a.
Figure 10. Confusion matrices for iCaRL on RML 2016a.
Electronics 15 02766 g010
Figure 11. Confusion matrices for ParalIMR on RML 2016a.
Figure 11. Confusion matrices for ParalIMR on RML 2016a.
Electronics 15 02766 g011
Figure 12. Confusion matrices for iCaRL on RML 2018.
Figure 12. Confusion matrices for iCaRL on RML 2018.
Electronics 15 02766 g012
Figure 13. Confusion matrices for ParalIMR on RML 2018.
Figure 13. Confusion matrices for ParalIMR on RML 2018.
Electronics 15 02766 g013
Figure 14. Task-specific accuracy for ablation study at the last task.
Figure 14. Task-specific accuracy for ablation study at the last task.
Electronics 15 02766 g014
Figure 15. Overall accuracy for Configuration 1.
Figure 15. Overall accuracy for Configuration 1.
Electronics 15 02766 g015
Figure 16. Overall accuracy for Configuration 2.
Figure 16. Overall accuracy for Configuration 2.
Electronics 15 02766 g016
Figure 17. Overall accuracy for Configuration 3.
Figure 17. Overall accuracy for Configuration 3.
Electronics 15 02766 g017
Table 1. The Structure of the Denoise-Robust Classifier for RML 2016a.
Table 1. The Structure of the Denoise-Robust Classifier for RML 2016a.
NameSettingsInput DimensionOutput Dimension
shared encoderinput ConvConv1d (kernel size = 3, stride = 1, padding = 0) 2 × 128 16 × 128
stage 1ResNet (stride = 1) 16 × 128 16 × 128
stage 2ResNet (stride = 2) 16 × 128 32 × 64
stage 3ResNet (stride = 2) 32 × 64 64 × 32
decoderupsample 1kernel size = 4, stride = 2 64 × 32 32 × 64
skip connection 32 × 64 32 × 64
upsample 2kernel size = 4, stride = 2 32 × 64 16 × 128
output Convkernel size = 4, stride = 1 16 × 128 2 × 128
Table 2. The Structure of the Denoise-Robust Classifier for RML 2018.
Table 2. The Structure of the Denoise-Robust Classifier for RML 2018.
NameSettingsInput DimensionOutput Dimension
shared encoderinput ConvConv1d (kernel size = 3, stride = 1, padding = 0) 2 × 1024 16 × 1024
stage 1ResNet (stride = 1) 16 × 1024 16 × 1024
stage 2ResNet (stride = 2) 16 × 1024 32 × 512
stage 3ResNet (stride = 2) 32 × 512 64 × 256
stage 4ResNet (stride = 2) 64 × 256 128 × 128
decoderupsample 1kernel size = 4, stride = 2 128 × 128 64 × 256
skip connection 1 64 × 256 64 × 256
upsample 2kernel size = 4, stride = 2 64 × 256 32 × 512
skip connection 2 32 × 512 32 × 512
upsample 3kernel size = 4, stride = 2 32 × 512 16 × 1024
output Conv1kernel size = 3, stride = 1 16 × 1024 8 × 1024
output Conv2kernel size = 3, stride = 1 8 × 1024 2 × 1024
Table 3. Ablation Study on RML2016a. All results are the average and standard deviation of 5 runs.
Table 3. Ablation Study on RML2016a. All results are the average and standard deviation of 5 runs.
SNR
(dB)
Stage 1Stage 2Stage 3
iCaRL SS PR PIMR iCaRL SS PR PIMR iCaRL SS PR PIMR
−486.4887.3486.9888.8770.7973.5970.0475.646066.5360.4367.1
± 9.24 ± 6 ± 7.15 ± 7.91 ± 7.43 ± 6.42 ± 7.26 ± 7.07 ± 3.7 ± 5.62 ± 2.93 ± 4.18
−293.2894.1293.5994.2982.2980.9482.1183.5173.4477.6373.6678.01
± 6.27 ± 5.87 ± 4.62 ± 5.48 ± 4.96 ± 3.25 ± 3.78 ± 4.51 ± 3.18 ± 3.28 ± 1.77 ± 2.67
097.8697.8898.298.5487.9387.1788.188.3281.6385.1982.9684.32
± 1.61 ± 1.15 ± 0.81 ± 0.75 ± 2.4 ± 2.18 ± 2.67 ± 2.96 ± 3.19 ± 2.55 ± 2.71 ± 2.1
299.1198.8598.9399.3289.6388.9989.6889.7884.4186.5685.3686.26
± 0.31 ± 0.19 ± 0.19 ± 0.12 ± 2.33 ± 2.25 ± 2.67 ± 3.13 ± 2.63 ± 2.84 ± 1.96 ± 2.57
499.2598.9199.0699.4390.1789.1889.7290.3185.4487.3986.4987.16
± 0.13 ± 0.12 ± 0.14 ± 0.15 ± 1.32 ± 1.98 ± 1.84 ± 2.95 ± 3 ± 2.92 ± 4.08 ± 2.66
699.2999.2299.2299.4990.6790.7490.7990.8587.3788.6188.7988.79
± 0.15 ± 0.15 ± 0.16 ± 0.19 ± 1.76 ± 2.25 ± 2.03 ± 2.48 ± 3.07 ± 3.24 ± 1.72 ± 1.96
899.1799.1699.1699.390.1489.3289.6190.386.3788.1587.9787.94
± 0.15 ± 0.12 ± 0.12 ± 0.1 ± 1.98 ± 2.37 ± 2.99 ± 2.69 ± 3.18 ± 2.28 ± 3.36 ± 2.34
1099.4599.0999.2599.5591.1590.290.9291.0887.5988.3889.1288.69
± 0.14 ± 0.13 ± 0.12 ± 0.15 ± 1.54 ± 1.65 ± 2.55 ± 2.42 ± 3.2 ± 3.24 ± 2.7 ± 2.11
1299.4699.7299.5699.6290.8391.0790.6890.8287.4688.9687.888.81
± 0.27 ± 0.22 ± 0.2 ± 0.19 ± 1.53 ± 1.87 ± 1.58 ± 2.53 ± 3.29 ± 3.85 ± 2.12 ± 2.24
1499.399.3499.3499.3790.889.6989.4990.5787.3188.1287.6288.17
± 0.12 ± 0.12 ± 0.15 ± 0.16 ± 1.45 ± 1.87 ± 2.01 ± 2.39 ± 3.03 ± 2.86 ± 2.71 ± 2.51
1699.2798.4498.3699.3490.4990.3290.1290.5487.7188.3887.6788.96
± 0.41 ± 0.34 ± 0.41 ± 0.33 ± 1.85 ± 1.97 ± 3.04 ± 2.68 ± 3.14 ± 2.01 ± 2.44 ± 1.98
1899.2499.3399.4199.4390.5590.4190.1190.2187.1187.3187.8587.95
± 0.21 ± 0.04 ± 0.12 ± 0.08 ± 1.46 ± 2.08 ± 2.11 ± 2.92 ± 3.59 ± 2.7 ± 3.08 ± 2.59
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Zhou, Z.; Wu, Y. ParalIMR: Bypassing Shortcut Learning in Incremental Modulation Recognition via Parallel Reconstruction and Feature Decoupling. Electronics 2026, 15, 2766. https://doi.org/10.3390/electronics15132766

AMA Style

Wang Z, Zhou Z, Wu Y. ParalIMR: Bypassing Shortcut Learning in Incremental Modulation Recognition via Parallel Reconstruction and Feature Decoupling. Electronics. 2026; 15(13):2766. https://doi.org/10.3390/electronics15132766

Chicago/Turabian Style

Wang, Zhilong, Zhiheng Zhou, and Yuansheng Wu. 2026. "ParalIMR: Bypassing Shortcut Learning in Incremental Modulation Recognition via Parallel Reconstruction and Feature Decoupling" Electronics 15, no. 13: 2766. https://doi.org/10.3390/electronics15132766

APA Style

Wang, Z., Zhou, Z., & Wu, Y. (2026). ParalIMR: Bypassing Shortcut Learning in Incremental Modulation Recognition via Parallel Reconstruction and Feature Decoupling. Electronics, 15(13), 2766. https://doi.org/10.3390/electronics15132766

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop