Abstract
Automatic modulation recognition (AMR) is widely employed in communication systems. However, under conditions of low signal-to-noise ratio (SNR), recent studies reveal limitations in achieving high AMR accuracy. In this work, we introduce a novel network architecture that leverages a transformer-inspired approach tailored for AMR, called Feature-Enhanced Transformer with skip-attention (FE-SKViT). This innovative design adeptly harnesses the advantages of translation variant convolution and the Transformer framework, handling intra-signal variance and small cross-signal variance to achieve enhanced recognition accuracy. Experimental results on RadioML2016.10a, RadioML2016.10b, and RML22 datasets demonstrate that the Feature-Enhanced Transformer with skip-attention (FE-SKViT) excels over other methods, particularly under low SNR conditions ranging from −4 to 6 dB.
1. Introduction
The rapid evolution of deep learning (DL) [1,2] techniques has significantly impacted various fields, including image processing, natural language processing, and speech recognition, acting as an intermediary between signal detection and demodulation. AMR [3,4] significantly facilitates the efficient classification of the modulation types [5]. AMR methods can be categorized into likelihood-based (LB) [6], feature-based (FB) [7], and deep learning (DL) [8,9]. DL-based methods in AMR leverage advanced learning paradigms to enhance model performance and adaptability.
DL-based AMR methods typically involves two consecutive stages. In the initial stage, preprocessing is applied to the received signal to ensure its appropriate representation for subsequent utilization. In the following stage, deep neural networks (DNNs) [10] are employed to process the signal representations and determine the modulation scheme.
In the preprocessing stage, the signal is mainly transformed into four representations, including image representation [11], feature representation [12], sequence representation [13], and combined representation [14]. For each representation, the received signal has been represented in various formats.
The idea of image representation is to transform the received signal into images, such as Constellation Diagram [15], Eye Diagram [16], Bispectrum Gragh [17], and Spectral Correlation Function Image [18], with which the task of modulation classification is performed via image recognition. However, this approach does come with a few limitations: (1) transforming signals into images increases the algorithm’s complexity and (2) to ensure high recognition rates, there are strict requirements on the size of the transformed images, which can impact the model’s real-time inference performance.
The purpose of feature representation is to extract a set of features that accurately characterize the received signal. Importantly, the number of extracted features is typically less than the length of the received signal, facilitating simpler DNNs with reduced complexity in neurons and layers. However, it is worth noting that calculating signal features also adds to the computational complexity.
Currently, research mainly focuses on image and feature representations, as they effectively use established deep learning techniques. Researchers are exploring the optimization of these representations for various modulation schemes.
In the second stage, we categorize these DL-based AMR methods into CNN-based [19], RNN-based [20], GRU-based [21], LSTM-based [22], and transformer-based [23] methods. Transformers utilize multi-head self-attention mechanisms to model contextual information in signals, which is particularly suitable for the sequential nature of signal data. Consequently, recent papers have endeavored to make improvements based on ViT, designing preprocessing methods tailored for IQ signals before feeding them into ViT for recognition, achieving satisfactory outcomes.
In the domain of AMR, an innovative approach was presented by the authors in [24], integrating a frame-wise embedding module (FEM) into a transformer-based architecture to capture global signal characteristics. Unlike conventional Vision Transformer (ViT) [25] models, which rely on fixed-sized patches, the FEM aggregates multiple signal samples into single tokens in the embedding stage, facilitating a more efficient token sequence. This process enables the transformer to extract global features across the entire signal while mitigating locality bias. By removing reliance on convolutional or recurrent layers, the FEM fully activates the parallelism advantages of the transformer, resulting in improved classification accuracy and reduced computation time. This novel structure optimizes the token sequence generation process, making the model more suitable for AMR tasks.
Another significant contribution is noted in [26], where the authors proposed a vision-inspired transformer model to effectively tackle the challenge of feature extraction from signals. One kernel block possess a fundamental trait known as translation equivariance [27], which involves the spatial sharing of filters in a sliding window manner. While this approach reduces the number of parameters, it limits its ability to adapt to various positions within a signal. Consequently, their model incorporates a multi-kernel block, enabling it to perceive signals with varying kernel sizes, thus facilitating both fine-grained and broad-scoped feature capture.
Furthermore, in [28], the authors introduced a novel relative position embedding technique within the Complex-Valued Transformer (CV-TRN) framework for AMR. This method emphasizes relationships between symbols based on their relative positioning, which is crucial for enhancing automatic modulation recognition (AMR) performance. Through meticulous experimentation on frame length L and step size R during the frame-wise embedding process, they investigated how these hyper-parameters specifically influence improvements in AMR system capabilities.
It can be concluded that the crucial aspect of DL-based AMR lies in effectively enhancing feature expression capabilities in low SNR conditions, constraining the improvement of related model performance. To improve this issue, our contributions are summarized as follows:
- A comprehensive analysis of feature enhancement (FE) improves the effectiveness of feature extraction from input signals;
- A high-accuracy network based on ViT with FE and skip attention (SKAT), named (FE-SKViT), is proposed for automatic modulation classification;
- Compared to mainstream DL methods, the proposed FE-SKViT can achieve effective and robust modulation classification performance.
2. An Overview of Signal Model
In digital communication systems, the received signal is typically modeled to account for various distortions and noise introduced during transmission. The received signal can be expressed as:
where represents the transmitted RF signal, ∗ denotes the convolution operation, denotes the channel impulse response, and represents additive white Gaussian noise (AWGN).
For M-ary Continuous Phase Frequency Shift Keying (CPFSK) signals, as illustrated in Figure 1, the transmitted signal can be defined as:
where is the carrier frequency, h is the modulation index, is typically a sequence of impulses modulated by the data symbols , and is the initial phase.
Figure 1.
Waveform of the CPFSK signal, which is normalized.
For CPFSK, the pulse shaping function is usually a rectangular pulse given by:
where is the symbol period and is the pulse shaping function.
Then, the received signal is quadrature sampled at the receiver to obtain the in-phase (I) and quadrature (Q) components, which can be represented as:
where N denotes the sampling length.
3. Proposed Method
3.1. Overall Architecture
The detailed architecture of FE-SKViT is illustrated in Figure 2, conforming to the procedure as follows.
Figure 2.
The architecture of FE-SKViT.
In our research involving software-defined radio (SDR) platforms, we encountered the inevitable presence of noise in the recorded IQ data, which poses a significant challenge to deep learning classification models prone to overfitting. To address this, we harnessed the concept of TVConv [29], leveraging its ability to capture layout-specific features, to effectively attenuate noise in the frequency domain. By these findings, we introduced a novel model termed FE-SKViT, which incorporates a feature enhancement block to suppress noise in the frequency domain, coupled with ViT’s multi-head self-attention mechanism and the SKAT [30] framework, as illustrated in Figure 3, for efficient and in-depth feature extraction.
Figure 3.
The architecture of SKAT.
3.2. Investigation on the AMC with Distribution of Cross-Signal and Intra-Signal Variance
For AMR tasks, as in Figure 4, where the statistic is calculated from the CNN feature maps by feeding the RML2016.10a, RML2016.10b [31], and RML22 datasets [32]. The inputs demonstrate a regional statistic with large intra-signal variance and small cross-signal variance.The intra-signal variance is approximately 100 times that of the cross-signal variance. This suggests that for IQ signals, significant variations can occur at different time spans. How to enhance the adaptive capacity of the modulation recognition model at different moments of IQ signals will be the key to improving the recognition accuracy of the network.
Figure 4.
Intra-signal and Cross-signal variance on RML datasets.
3.3. Feature Enhancement Block (FE)
ViT demonstrates scalable architectures and excels in capturing global features. However, conventional vision transformer methods often use 3 × 3 convolutional projection during the feature embedding phase for signal classification. Adapting this approach to signal classification may overlook key signal characteristics, as detailed in VT-MCNet. To improve the feature extraction process in transformer networks, especially for tasks requiring the ability to adapt to IQ signals that varies continuously with time characteristics, we intend to utilize the TVConv (Translation Variant Convolution) for the feature extraction component. This helps the model better adapt to the characteristic of IQ signal changing over time. The TVConv methodology introduces two primary units: the Affinity Map Unit (AMU) and the Weight-Generating Unit (WGU), which collectively enhance the network’s ability to adapt to spatial variances within the signal.
3.3.1. Affinity Map Unit (AMU)
The Affinity Map Unit (AMU) is a crucial component of TVConv, designed to capture the correlation between IQ signals at adjacent time instants. This unit generates learnable affinity maps, which help in understanding the contextual importance of different segments of the IQ signal. The process involves several key steps:
- Learnable Affinity Maps: These maps are trained to highlight the connections between various samples in the IQ signal. Similar to attention mechanisms used in transformers, these affinity maps reduce the computational overhead while capturing essential spatial relationships.
- Spatial Awareness: By focusing on relationships between pairs of time steps within the IQ sequence, the AMU can distinguish between different characteristics of the IQ signal, such as phase shifts, frequency changes, and amplitude variations, ensuring a nuanced understanding of the signal’s structure.
- Implementation Details: The AMU employs a learnable function f to compute the affinity between each pair of time steps within the IQ signal. This function takes into account the temporal position and signal characteristics, such as amplitude and phase, producing a matrix of affinities. A.
Mathematically, the affinity maps A are generated as follows:
where and are the feature vectors of the IQ sequence at time steps i and j, respectively, and represents the parameters of the learnable function f. The function f can be modeled using a neural network that takes the concatenated feature vectors of these time-step pairs and outputs a scalar affinity score.
The affinity map A then undergoes a normalization process, often using a softmax function to ensure that the affinities for each time step sum to one, providing a probabilistic interpretation.
3.3.2. Weight-Generating Unit (WGU)
The Weight-Generating Unit (WGU) leverages the affinity maps produced by the AMU to generate sample-specific convolutional weights. This unit ensures that the convolution operation adapts dynamically to different segments of the IQ signal, enhancing feature extraction and improving overall performance. The WGU consists of several steps:
- Weight Generation: The WGU produces convolutional weights tailored to each spatial location, using the affinity maps to ensure these weights are contextually relevant. The process involves a learnable function g that maps the affinities to convolutional weights.
- Efficiency: To maintain computational efficiency, the weights can be precomputed and stored, allowing for fast retrieval during the inference phase.
- Implementation Details: The function g used in the WGU is designed to take the affinity maps and generate the appropriate weights for each convolutional filter. This function can be implemented using a neural network that outputs a set of weights for each time step based on the affinity values.
The weights W for the convolution operation are computed as:
where represents the parameters of the learnable function g. The function g transforms the normalized affinities into convolutional weights, ensuring that each pixel’s convolutional operation is influenced by its spatial context.
3.4. Transformer Encoder with SKAT Framework
3.4.1. Multi-Head Self-Attention
Multi-Head Self-Attention (MHSA) is a crucial component of the transformer architecture, representing each token as a weighted sum of all other tokens to capture intricate dependencies and relationships within the input sequence. For each element in the sequence:
The attention scores are computed as:
To capture different aspects of the sequence:
3.4.2. SKAT Framework
The Skip-Attention enhances computational efficiency by reducing redundant computations in the self-attention mechanism. It leverages the high correlation in self-attention maps across different layers. Instead of computing self-attention in every layer, SKAT reuses representations from previous layers through a parametric function.
The SKAT parametric function approximates the output of skipped self-attention layers:
where and are fully connected layers, DwC [33] represents depth-wise convolution, and ECA [34] denotes efficient channel attention.
3.4.3. Transformer Encoder with SKAT
The SKAT-modified transformer layer updates the representation as follows:
where MLP denotes the multi-layer perceptron applied to the output.
The Transformer Encoder, enhanced by the SKAT framework, provides a robust and efficient method for processing sequential data. By leveraging the high correlation in self-attention maps, SKAT reduces redundant computations, thereby improving throughput and reducing FLOPs without compromising performance.
4. Experimental Results and Discussion
4.1. Datasets
To evaluate the performance of the proposed method, we conduct experiments on the RadioML2016.10a [31], RadioML2016.10b, and RML22 [32] datasets. The datasets comprise eight digital modulated signals widely used in wireless communications, including 8PSK, BPSK, CPFSK, GFSK, PAM4, 16QAM, 64QAM, and QPSK. Each sample includes in-phase and quadrature (IQ) channels. The SNR ranges from dB to 18 dB with an interval of 2 dB. Signals are modulated at a rate of eight samples per symbol. In addition, random walk drifting of the carrier frequency oscillator, additive white Gaussian noise (AWGN), and Rician fading of the channel impulse response are taken into account in the process of generating signals. Furthermore, translation, dilation, and unknown scale are introduced when the signal is transmitted through harsh channels. In conclusion, all 11 modulation types are employed to form both the training and testing datasets, thereby ensuring a thorough evaluation of the models across the entire range of modulation types. The details of the experimental dataset are shown in Table 1.
Table 1.
The details of the experimental dataset.
All experiments are conducted on the Nvidia GeForce RTX 2080Ti 11G GPU. In the training process, the deep learning framework is Pytorch. The training parameters are shown in Table 2. In the testing stage, to avoid the contingency caused by a single test, we use the average accuracy of 100 test experiments as the final evaluation indicator. The calculation formula of a single test is defined as:
where is the number of samples correctly classified and is the number of all samples.
Table 2.
The training parameters.
4.2. Analysis of Feature Enhance Module
4.2.1. Performance Comparisons with Different Parameters in the Weight-Generating Block
To evaluate the classification performance of the FEVCT under different parameters, we examine the role of different components, including the depth (layers), width (channels) of the weight-generating block B, and the number of channels in affinity maps A.
Ablation under the varied number of channels and inter layers for weight generating block B, as well as the varied number of channels for affinity maps A. We use hyper-parameters highlighted in bold as our default setting elsewhere in the paper.
According to Table 3 and Table 4, we examine the role of different components, including the width (channels) of the weight-generating block B and the number of channels in affinity maps A. With deeper, wider layers in B or more channels in A, the model generally performs better. However, the performance slightly drops from the peak when the overparameterization goes too far. This is because a larger model might require more iterations for training and can suffer from overfitting.
Table 3.
Ablation under a varied number of channels for the weight-generating block.
Table 4.
Ablation under a varied number of channels for the weight-generating block.
From Table 5, the classification accuracy of FE-SKViT when the number of layers in the weight-generating block, B, is 5(). significantly outperforms that at and . When , the FE-SKViT stands out with the highest overall accuracy of 63.326% and the lowest minimum verification loss of 1.03231. This configuration achieves optimal performance, balancing model complexity and efficiency. However, increasing the layer count to 6 or decreasing it to 4, 3, 2, or 1 results in the rate of improvement in classification accuracy slowing down, indicating that 5 layers provide the best trade-off between accuracy and computational efficiency. In the following part, we set to construct the FE-SKViT.
Table 5.
Ablation under a varied number of layers for the weight-generating block.
4.2.2. Recognition Performance by SNR and Modulation Type
We fix the hyper-parameters BLayers, Bchans and AChans of FE-SKViT as 5, 64, and 4 and evaluate its performance by SNR and modulation type.
Based on the comparison of the confusion matrices of the two models, as illustrated in Figure 5 and Figure 6, it was found that incorporating the translation variant block improved the model’s recognition rates for modulation schemes including QAM16, QAM64, WBFM, BPSK, and AM-SSB. But the recognition rate of the AM-DSB modulation mode has declined severely. Table 6 provides a detailed comparison of the experiments conducted at −2 dB.
Figure 5.
Overall normalized confusion matrix of ViT on RML2016.10a.
Figure 6.
Overall normalized confusion matrix of FE-ViT on RML2016.10a.
Table 6.
Recognition accuracy at −2 dB SNR of each algorithm for five modulations.
The data presented in Table 6 indicate that the FE-ViT model consistently outperforms the ViT model across most scenarios, achieving an overall recognition accuracy of 83.98%, compared to 80.56% for the ViT model. Notably, the FE-ViT model demonstrates significantly enhanced performance on QAM16 and QAM64 modulation schemes, with accuracies of 84.50% and 90.19%, respectively, in contrast to the ViT model’s performances of 74.00% and 80.88%. Furthermore, a substantial improvement is observed in WBFM for the FE-ViT model, which attains an accuracy of 54.50%, as opposed to just 31.00% for its counterpart. Both models exhibit equivalent performance on BPSK, each achieving an accuracy of 94.94%. However, it is important to note that the ViT model slightly surpasses the FE-ViT model in AM-DSB modulation with an accuracy of 88.01%, compared to the latter’s score of 64.89%. Due to the error occurring in the generation of the analog information source in [32] within RML2016.10a, the low recognition rate of AM-DSB holds scant reference value, as described.
In summary, while both models have their merits, the FE-ViT exhibits superior capabilities particularly concerning higher-order modulation schemes and frequency modulation.
Figure 7 illustrates the effects of a feature-enhanced module for these five modulation schemes.
Figure 7.
First channel of QAM16, QAM64, WBFM, BPSK, and AM-SSB signal samples (snr = −2 dB) after feature aggregation.
4.2.3. Feature Visualization
As shown in Figure 7, the initial IQ diagram of QAM16 displays significant noise and irregularity. Following enhancement, it reveals a smoother and more distinguishable signal pattern. The power spectrum also exhibits improved frequency resolution and clearer peaks. Quantitatively, the noise amplitude of the original IQ signal ranges from −0.02 to 0.02, whereas the enhanced signal is notably reduced to between −0.3 and 0.6, thereby improving signal clarity substantially. The original power spectrum presents broad and noisy peaks; however, after enhancement, sharper peaks at critical frequencies are attained, with background noise diminished by approximately 60%.
In the original IQ diagram of QAM64, substantial noise and fluctuations hinder effective signal discrimination. Post-enhancement, both the noise level decreases and clarity improves; nevertheless, the inherent complexity of QAM64 persists as a challenge. The peak in the power spectrum is better defined with reduced noise levels observed overall. Numerically speaking, the fluctuations in the IQ diagram decrease from an initial range of −0.02 to +0.02 to a post-enhancement range of −0.3 to +0.6; concurrently, mid-frequency band signals exhibit more pronounced peaks while background noise diminishes by about 55%.
The initial IQ plot for WBFM contains considerable noise and irregularities; however, in its enhanced version, there is significant suppression of this noise, leading to clearer signal patterns emerging thereafter. The initially broad and noisy power spectrum becomes increasingly concentrated following enhancement efforts undertaken herewith: quantitatively assessed fluctuations within the original IQ signal span from −0.02 to +0.02 but are suppressed downwards into a range between −0.3 and +0.6 upon enhancement application—resulting in nearly a 70% reduction in background noise that now concentrates around central frequencies.
The initial BPSK IQ map demonstrates elevated levels of both noise as well as irregular fluctuations; yet, subsequent enhancements yield decreased levels thereof alongside increased stability within resultant signals produced hereinafter—a transformation wherein previously broad spectra characterized by multiple noisy peaks evolve into distinctly clear singular peak states instead. We specifically note that prior ranges for such noises transition from an interval spanning −0.02 through +2.00 towards one confined strictly between values ranging thusly −0.3 through +0.6, accompanied by notable improvements across respective spectral domains, reflecting reductions approximating 65%.
Lastly, addressing AM-DSB’s original I/Q plots which reveal extensive amounts pertaining toward both noisiness along with erratic patterns present therein, enhanced versions demonstrate marked interference reductions, yielding coherent characteristics conducive towards identification purposes henceforth achieved via concentration focused sharply upon single prominent peak formations rather than diffuse distributions seen earlier on display whereupon noted fluctuation intervals originally ranged broadly −0.02 up until +0.02. These are subsequently compressed downwards, ultimately residing firmly nestled amid tighter confines, extending only throughout limits established above, i.e., −0.3 through +0.6, finally culminating in overall reductions nearing 75% with respect to residual ambient disturbances.
In a word, the feature enhancement module significantly enhances signal quality across various modulation schemes. Enhanced IQ plots consistently show reduced noise and clearer signal patterns. Enhanced power spectrum exhibit more defined peaks and reduced noise, indicating improved frequency resolution. This demonstrates the feature enhancement module’s effectiveness in enhancing signal recognition by improving the clarity and stability of IQ diagrams and power spectra across all tested modulation schemes.
4.3. Ablation Study
As shown in Table 7, FE-SKViT stands out with the highest performance among the models. FE-SKViT’s average accuracies are 63.31 on RML16a, 65.05 on RML16b, and 68.3 on RML22, making it the most accurate, especially on the RML22 dataset. Its superior accuracy highlight its advantage over other models. Subsequently, we standardized the model parameter configurations to enable a consistent comparison of model complexities across different architectures, as shown in Table 8 and Table 9.
Table 7.
The performances of ablation experiments on the three datasets.
Table 8.
The model parameters set for comparing the complexity of different models.
Table 9.
Complexity comparison of different models.
The integration of the SKIP module in FE-SKViT results in a notable increase in memory consumption compared to FE-ViT, escalating from 11.76 MB to 35.44 MB. This increase can be attributed to the additional intermediate computations and storage demands imposed by the SKIPAT module. Despite this increase in memory usage, there is a slight enhancement in inference speed, improving from 0.0114 s to 0.0130 s. This improvement stems from SKIPAT’s ability to bypass redundant computations, particularly by reusing attention maps and reducing the number of times the expensive self-attention mechanism needs to be recalculated. As a result, the model achieves a faster forward pass, demonstrating that the trade-off between increased memory consumption and reduced computation can lead to overall performance gains in terms of inference time.
4.4. Performance Comparisons to the State-of-the-Art
To evaluate the performance of the FE-SKViT presented in this paper, we compared it against four different methods: the adaptive wavelet network (AWN) [35], the deep residual network (ResNet18) [36], the complex-valued depth-wise separable convolutional neural network (CDSCNN) [37], and the Feature-Enhanced Transformer with skip attention (FE-SKViT). The experimental results under different feature extraction modules are illustrated in Figure 8, Figure 9 and Figure 10. Table 10 summarizes the various methods proposed in this paper.
Figure 8.
Recognition accuracy comparison over different SNR levels on RML2016.10a.
Figure 9.
Recognition accuracy comparison over different SNR levels on RML2016.10b.
Figure 10.
Recognition accuracy comparison over different SNR levels on RML22.
Table 10.
Several different feature extraction module structures.
As shown in Table 10, The ResNet18 model includes multiple ResBlocks, each containing a 2D convolution (Conv2d) with a kernel size of 3 × 3, BatchNorm (BN) for normalization to speed up convergence, ReLU as the activation function, and a MaxPooling layer (MaxPool2D) with a kernel size of 2 × 2 to reduce data dimensions. The ResNet18 architecture is composed of these ResBlocks followed by a Flatten layer for transforming the final feature maps into a one-dimensional array for classification.The CDSCNN architecture employs depth-wise separable convolutions to efficiently extract features. Each convolutional layer includes a depth-wise Conv2d followed by a point-wise Conv2d, both using a kernel size of 3 × 3. BatchNorm (BN) is applied after each convolution for normalization, and ReLU serves as the activation function. MaxPooling layers (MaxPool2D) with a kernel size of 2 × 2 are used for dimension reduction. The network ends with a Flatten layer to convert the data into a suitable format for the fully connected layers.The AWN integrates adaptive wavelet decomposition with channel attention mechanisms. Initially, the network comprises Conv2d layers with a kernel size of 3 × 3, followed by BatchNorm (BN) for normalization and ReLU activation functions. MaxPooling layers (MaxPool2D) with a kernel size of 2 × 2 are used to downsample the data. The adaptive wavelet decomposition further refines feature extraction across multiple frequency bands. An attention mechanism is applied to emphasize significant features, and a Flatten layer is used to prepare the data for the final classification stages. The FE-SKViT is illustrated in Figure 1.
We fix the hyper-parameters BLayers, Bchans, and AChans of FE-SKViT as 5, 64, and 4, for a good balance of accuracy and Complexity according to Table 7, and evaluate its performance by SNR and modulation type.The recognition accuracy comparison of FE-SKViT and baseline methods is shown in Table 11.
Table 11.
Recognition accuracy.
According to Figure 8, it can be seen that on RML2016.10a, when the SNR is less than −8 dB, FE-SKViT performs slightly weaker than some baseline methods, while when the SNR is greater than −8 dB, FE-SKViT performs better, especially at the −6 to 6 dB SNR range. However, for SNR values above 6 dB, FE-SKViT’s performance is comparable to that of the baseline methods. On the whole, FE-SKViT outperforms the baseline methods by 2.92%, 2.49%, and 1.05% with respect to average accuracy.
According to Figure 9, it can be seen that on RML2016.10b, the FE-SKViT obtains better recognition accuracy than the baseline methods over almost the entire −16 to 6 dB SNR range, and that advantage is more evident at −4 to 6 dB SNRs. On the whole, FE-SKViT outperforms the baseline methods by 0.86%, 1.53%, and 0.87% on the average accuracy.
According to Figure 10, it can be seen that on RML22, when the SNR is less than −10 dB, FE-SKViT performs slightly weaker than some baseline methods, while when the SNR is greater than −10 dB, FE-SKViT performs always better, especially for the 2 to 20 dB SNR range. On the whole, FE-SKViT outperforms the baseline methods by 0.86%, 1.53%, and 0.87% on the average accuracy.
In a word, the FE-SKViT achieves the SOTA recognition performance on RML2016.10a, RML2018.01a, and RML22.
5. Conclusions
In this paper, to solve the problem of automatic modulation recognition of communication signals under conditions of low SNR, a novel model FE-SKViT that can extract local features of signals is proposed. The FE component is utilized to handle large intra-signal variance and small cross-signal variance by employing learnable affinity maps and a weight-generating block, while the SKAT component effectively improving accuracy by skipping redundant self-attention computations across layers. We validated the performance of FE-SKViT on three benchmark datasets, i.e., RML2016.10a, RML2016.10b, and RML22, with recognition accuracy higher than some current mainstream AMR models and robustness to noise. Furthermore, we designed a series of ablation studies and a visualization analysis to demonstrate the effectiveness of the FE-SKViT model. Moreover, the proposed model integrates the SKAT component, resulting in higher recognition accuracy with higher computational and storage cost. The model’s recognition accuracy does not exhibit significant improvement under high SNR conditions. Future works will focus on feature extraction, such as Multi-Frequency Octave [38], Masked Signal Feature Extractor [39], and other content [40,41], to improve performance in actual scenarios. In addition, reducing the complexity of the proposed model is also a direction worth exploring.
Author Contributions
Conceptualization, G.Z. and B.Z.; methodology, G.Z., B.Z. and P.Y.; software, G.Z. and W.Z.; validation, G.Z., B.Z., W.Z. and B.L.; formal analysis, W.Z. and B.L.; investigation, G.Z., B.Z. and P.Y.; resources, B.Z., W.Z. and B.L.; data curation, B.L.; writing—original draft preparation, G.Z. and B.Z.; writing—review and editing, G.Z. and B.Z.; supervision, B.Z., P.Y. and W.Z.; project administration, B.Z. and P.Y.; funding acquisition, B.Z. and P.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
In this paper, the RadioML2016.10a, RadioML2016.10b, and RadioML22 datasets are employed for experimental verification. The RadioML2016.10a dataset is a representative dataset for the testing and evaluation of current AMR methods. Readers can obtain the dataset from the author by email (zgy@stu.xidian.edu.cn).
Acknowledgments
I would like to acknowledge our colleagues for their wonderful collaboration and patient support. I also thank all the reviewers and editors for their great help and useful suggestions.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Wang, J.; Liu, X.; Zhang, Y.; Chen, H. Deep Learning Based Automatic Modulation Recognition: Models, Datasets, and Challenges. arXiv 2022, arXiv:2207.09647. [Google Scholar]
- Xia, H. Cellular signal identification using convolutional neural networks: AWGN and Rayleigh fading channels. In Proceedings of the 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Newark, NJ, USA, 11–14 November 2019. [Google Scholar]
- Peng, S.; Sun, S.; Yao, Y.D. A survey of modulation classification using deep learning: Signal representation and data preprocessing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 7020–7038. [Google Scholar] [CrossRef] [PubMed]
- Jiang, X.R.; Chen, H.; Zhao, Y.D.; Wang, W.Q. Automatic modulation recognition based on mixed-type features. Int. J. Electron. 2021, 108, 105–114. [Google Scholar] [CrossRef]
- Zhang, D.; Lu, Y.; Li, Y.; Ding, W.; Zhang, B.; Xiao, J. Frequency Learning Attention Networks based on Deep Learning for Automatic Modulation Classification in Wireless Communication. Pattern Recognit. 2023, 137, 109345. [Google Scholar] [CrossRef]
- Xu, J.L.; Su, W.; Zhou, M. Likelihood-ratio approaches to automatic modulation classification. IEEE Trans. Syst., Man, Cybern. C Appl. Rev. 2011, 41, 455–469. [Google Scholar] [CrossRef]
- Al-Sa’d, M.; Boashash, B.; Gabbouj, M. Design of an optimal piece-wise spline wigner-ville distribution for TFD performance 365 evaluation and comparison. IEEE Trans. Signal Process 2021, 69, 3963–3976. [Google Scholar] [CrossRef]
- Abd-Elaziz, O.F.; Abdalla, M.; Elsayed, R.A. Deep learning-based automatic modulation classification using robust CNN architecture for cognitive radio networks. Sensors 2023, 23, 9467. [Google Scholar] [CrossRef]
- Wang, X.; Li, Y.; Zhang, P.; Wang, Z. RNN-based melody generation using sequence-to-sequence learning. IEEE Access 2019, 7, 165346–165356. [Google Scholar]
- Sainath, T.N.; Vinyals, O.; Senior, A.; Sak, H. Convolutional, long short-term memory, fully connected deep neural networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 4580–4584. [Google Scholar]
- Tu, Y.; Lin, Y. Deep neural network compression technique towards efficient digital signal modulation recognition in edge device. IEEE Access 2019, 7, 58113–58119. [Google Scholar] [CrossRef]
- Lee, S.H.; Kim, K.-Y.; Kim, J.H.; Shin, Y. Effective feature-based automatic modulation classification method using DNN algorithm. In Proceedings of the 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Okinawa, Japan, 11–13 February 2019; pp. 557–559. [Google Scholar]
- Shi, J.; Hong, S.; Cai, C.; Wang, Y.; Huang, H.; Gui, G. Deep learning based automatic modulation recognition method in the presence of phase offset. IEEE Access 2020, 8, 42841–42847. [Google Scholar] [CrossRef]
- Hiremath, S.M.; Behura, S.; Kedia, S.; Deshmukh, S.; Patra, S.K. Deep learning-based modulation classification using time and stockwell domain channeling. In Proceedings of the 2019 National Conference on Communications (NCC), Bangalore, India, 20–23 February 2019; pp. 1–6. [Google Scholar]
- Peng, S.; Jiang, H.; Wang, H.; Alwageed, H.; Zhou, Y.; Sebdani, M.M.; Yao, Y.D. Modulation classification based on signal constellation diagrams and deep learning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 718–727. [Google Scholar] [CrossRef] [PubMed]
- Wang, D.; Zhang, M.; Li, Z.; Li, J.; Fu, M.; Cui, Y.; Chen, X. Modulation format recognition and OSNR estimation using CNN-based deep learning. IEEE Photon. Technol. Lett. 2017, 29, 1667–1670. [Google Scholar] [CrossRef]
- Li, Y.; Shao, G.; Wang, B. Automatic modulation classification based on bispectrum and CNN. In Proceedings of the 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 24–26 May 2019; pp. 311–316. [Google Scholar]
- Mendis, G.J.; Wei, J.; Madanayake, A. Deep learning-based automated modulation classification for cognitive radio. In Proceedings of the 2016 IEEE International Conference on Communication Systems (ICCS), Shenzhen, China, 14–16 December 2016; pp. 1–6. [Google Scholar]
- Huang, S.; Jiang, Y.; Gao, Y.; Feng, Z.; Zhang, P. Automatic modulation classification using contrastive fully convolutional network. IEEE Wireless Commun. Lett. 2019, 8, 1044–1047. [Google Scholar] [CrossRef]
- Hu, S.; Pei, Y.; Liang, P.P.; Liang, Y.-C. Deep neural network for robust modulation classification under uncertain noise conditions. IEEE Trans. Veh. Technol. 2020, 69, 564–577. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, Y.; Wang, Z. A novel deep learning automatic modulation classifier with fusion of multichannel information using GRU. EURASIP J. Wirel. Commun. Netw. 2023, 2023, 66. [Google Scholar]
- Graves, A.; Fernández, S.; Schmidhuber, J. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the 15th International Conference on Artificial Neural Networks (ICANN), Warsaw, Poland, 11–15 September 2005; pp. 799–804. [Google Scholar]
- Chen, Y.; Li, Q.; Zhao, L.; Wang, M. Enhancing Automatic Modulation Recognition for IoT Applications Using Transformers. arXiv 2023, arXiv:2403.15417. [Google Scholar]
- Chen, Y.; Dong, B.; Liu, C.; Xiong, W.; Li, S. Abandon locality: Framewise embedding aided transformer for automatic modulation recognition. IEEE Commun. Lett. 2023, 27, 327–331. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Dao, T.-T.; Noh, D.-I.; Pham, Q.-V.; Hasegawa, M.; Sekiya, H.; Hwang, W.-J. VT-MCNet: High-Accuracy Automatic Modulation Classification Model Based on Vision Transformer. IEEE Commun. Lett. 2024, 28, 98–102. [Google Scholar] [CrossRef]
- Worrall, D.E.; Garbin, S.J.; Turmukhambetov, D.; Brostow, G.J. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5028–5037. [Google Scholar]
- Li, W.; Deng, W.; Wang, K.; You, L.; Huang, Z. A Complex-Valued Transformer for Automatic Modulation Recognition. IEEE Internet Things J. 2024, 11, 22197–22207. [Google Scholar] [CrossRef]
- Chen, J.; He, T.; Zhuo, W.; Ma, L.; Ha, S.; Chan, S.H.G. Tvconv: Efficient translation variant convolution for layout-aware visual processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12548–12558. [Google Scholar]
- Venkataramanan, S.; Ghodrati, A.; Asano, Y.M.; Porikli, F.; Habibian, A. Skip-attention: Improving vision transformers by paying less attention. arXiv 2023, arXiv:2301.02240. [Google Scholar]
- O’shea, T.J.; West, N. Radio machine learning dataset generation with GNU radio. In Proceedings of the 6th GNU Radio Conference, Boulder, CO, USA, 12–16 September 2016; pp. 1–6. [Google Scholar]
- Sathyanarayanan, V.; Gerstoft, P.; Gamal, A.E. RML22: Realistic Dataset Generation for Wireless Modulation Classification. IEEE Trans. Wirel. Commun. 2023, 22, 7663–7675. [Google Scholar] [CrossRef]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; Volume 22, pp. 11534–11542. [Google Scholar]
- Zhang, J.; Wang, T.; Feng, Z.; Yang, S. Toward the Automatic Modulation Classification With Adaptive Wavelet Network. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 549–563. [Google Scholar] [CrossRef]
- Liang, Z.; Tao, M.; Wang, L.; Su, J.; Yang, X. Automatic modulation recognition based on adaptive attention mechanism and ResNeXt WSL model. IEEE Commun. Lett. 2021, 25, 2953–2957. [Google Scholar] [CrossRef]
- Xiao, C.; Yang, S.; Feng, Z. Complex-valued Depth-wise Separable Convolutional Neural Network for Automatic Modulation Classification. IEEE Trans. Instrum. Meas. 2023, 72, 1–10. [Google Scholar]
- Hao, X.; Feng, Z.; Yang, S.; Wang, M.; Jiao, L. Automatic Modulation Classification via Meta-Learning. IEEE Internet Things J. 2023, 10, 12276–12292. [Google Scholar] [CrossRef]
- Wang, S.; Xing, H.; Wang, C.; Zhou, H.; Hou, B.; Jiao, L. SigDA: A Superimposed Domain Adaptation Framework for Automatic Modulation Classification. IEEE Trans. Wirel. Commun. 2024, 23, 13159–13172. [Google Scholar] [CrossRef]
- Chen, Y.; Shao, W.; Liu, J.; Yu, L.; Qian, Z. Automatic modulation classification scheme based on LSTM With random erasing and attention mechanism. IEEE Access 2020, 8, 154290–154300. [Google Scholar] [CrossRef]
- Huynh-The, T.; Pham, Q.-V.; Nguyen, T.-V.; Nguyen, T.T.; Costa, D.B.D.; Kim, D.-S. RanNet: Learning residual-attention structure in CNNsfor automatic modulation classification. IEEE Wirel. Commun. Lett. 2022, 11, 1243–1247. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).