Hybrid Data Augmentation and Dual-Stream Spatiotemporal Fusion Neural Network for Automatic Modulation Classiﬁcation in Drone Communications

: Automatic modulation classiﬁcation (AMC) is one of the most important technologies in various communication systems, including drone communications. It can be applied to conﬁrm the legitimacy of access devices, help drone systems better identify and track signals from other communication devices, and prevent drone interference to ensure the safety and reliability of communication. However, the classiﬁcation performance of previously proposed AMC approaches still needs to be improved. In this study, a dual-stream spatiotemporal fusion neural network (DSSFNN)-based AMC approach is proposed to enhance the classiﬁcation accuracy for the purpose of aiding drone communication because SDDFNN can effectively mine spatiotemporal features from modulation signals through residual modules, long-short term memory (LSTM) modules, and attention mechanisms. In addition, a novel hybrid data augmentation method based on phase shift and self-perturbation is introduced to further improve performance and avoid overﬁtting. The experimental results demonstrate that the proposed AMC approach can achieve an average classiﬁcation accuracy of 63.44%, and the maximum accuracy can reach 95.01% at SNR = 10 dB, which outperforms the previously proposed methods.


Introduction
Automatic modulation classification (AMC) is the process of identifying the modulation types of communication signals, which has been widely applied in various communication systems for enhancing communication efficiency, ensuring security, and enabling safe and reliable drone operations [1][2][3][4].In drone communications, it can also play an important role for distinguishing signals between drones, detecting unauthorized devices or signals, enabling automated control for optimal communication performance, and so on [5,6].
Traditional AMC methods can be classified into two categories: likelihood-based (LB) methods and feature-based (FB) methods.LB methods [7,8] typically involve significant computational complexity or require prior knowledge about the channel or noise [9].FB methods [10,11] are based on expert features derived from time-frequency analysis and statistical theory.However, it is difficult to process massive signal samples in parallel, and the classification accuracy does not meet expectations.Moreover, with the rapid development of communication technologies, the electromagnetic environment has become

Related Works DNN Strength Weakness
Shea et al. [28] CNN Innovative use of deep learning for modulation recognition achieved significant accuracy improvement compared to traditional methods.
Only explored the application of CNN for modulation recognition.
Ramjee et al. [29] CLDNN Both time and spatial features were extracted from IQ signals, resulting in a more diverse feature set.
Using only IQ data for feature extraction results in insufficiently diverse feature sets.

ResNet
Further exploration was conducted on top of CNN.
The time feature of IQ data was neglected.
LSTM RNNs were utilized to extract time information from IQ signals for modulation recognition.
The absence of convolutional neural networks (CNNs) for spatial feature extraction is a limitation.
Zhang et al. [9] CNN-LSTM Features were extracted separately from IQ and AP data for modulation recognition.
Extracting temporal features from the spatial features extracted by CNN may have an impact on the accuracy of modulation recognition.

Liao et al. [30] SCRNN
The accuracy of the model is ensured while reducing the required training time.
Extracting features solely from IQ data limits the diversity of the features.
Chang et al. [31] MLDNN Explored the interaction features and temporal information of both IQ and AP data, which resulted in a significantly improved accuracy rate.
The WBFM modulation scheme is highly susceptible to misclassification as AM-DSB.

CGDNN2
The parameter estimator and the parameter transformer were introduced, resulting in a significant reduction in the model's parameter count.
The design of the model architecture lacks significant innovation, impeding the extraction of improved features.
In this study, we propose an AMC method using hybrid data augmentation and a dualstream spatiotemporal fusion neural network (DSSFNN), where the former is to expand training samples to prevent model overfitting, while the latter is a parallel architecture to extract the spatiotemporal features for high classification performance.In detail, the spatial feature extraction branch is responsible for IQ data, while the temporal feature extraction branch is designed for AP data.The features extracted from these branches are fused for modulation classification.The contributions of the paper are listed as follows:

•
We propose a hybrid data augmentation method based on phase shift and selfperturbation, which can effectively expand training samples without introducing additional information.

•
We propose a DSSFNN structure for AMC, which can extract features from both the spatial and temporal dimensions of data.Compared to the single-dimensional feature extraction method, the features extracted by DSSFNN are more diverse and effective, which improve the accuracy of AMC.
The remaining parts of this paper are as follows: Section 2 elaborates on the problem formulation.In Section 3, a detailed description of the proposed AMC method is provided, including the data augmentation technique and the dual-stream spatiotemporal fusion neural network (DSSFNN) architecture.Section 4 presents the simulation results and analysis.Finally, in Section 5, the conclusions drawn from the study are presented, highlighting the contributions of the proposed method.

Signal Model
The complex baseband signal model [32] can be represented equivalently without losing generality as follows: where the received signal is represented by x(t), s(t) represents the modulated signal, the channel is represented using h(t), and n(t) represents zero-mean complex AWGN with a bilateral power spectral density of N 0 /2 [32].

Dual-Stream Data
In this paper, an IQ signal along with the AP data were used as the training data for model training.In general, signal reception equipment can be used to receive signals in a communication channel and store them in the format of IQ data.The AP data can then be obtained from the IQ data using mathematical formulas.This approach is commonly used in the field of digital signal processing for wireless communication systems.The model for storing signal data in IQ format is shown as follows: where, x I [i] and x Q [i] represent the real and imaginary parts of the IQ signal of the i-th signal, respectively.By decomposing IQ data into in-phase and quadrature components and calculating their amplitudes and phases, one can obtain corresponding AP data.This process can be represented as: where I i and Q i represent the real and imaginary parts of the IQ signal, A i represents the amplitude of the ith data in AP data, and ϕ i represents the phase of the ith data in AP data.

Problem Description
Modulation classification is the process of determining the modulation scheme used by a received signal based on the sampled signal sequence of N modulation schemes.The deep-learning-based modulation classification scheme can be represented as [32]: where M represents the predicted value of the modulation classification type, M i represents the true value of the modulation classification type, M represents the set of modulation schemes [32], f (W) represents the deep learning model that maps the signal sample x to the modulation classification type M, and W represents the parameter weights of the model.The deep-learning-based modulation classification scheme can be simplified as the task of obtaining a high-precision deep learning model f (W).

The Framework of the Proposed Method
Our proposed robust AMC method based on hybrid data augmentation and dualstream spatiotemporal fusion neural network is illustrated in Figure 1.The proposed scheme consists of three key parts: hybrid data augmentation, AP information extraction, and dual-stream spatiotemporal fusion neural network.The hybrid data augmentation includes phase transformation data augmentation and self-perturbation data augmentation.The IQ data is fed into the model as the raw input, which first goes through the hybrid data augmentation part.This increases the amount of data and makes the data features more rich and diverse.After that, the augmented data are used to extract amplitude and phase information.The IQ data and AP data after data augmentation will be fed into the dual-stream spatiotemporal fusion neural network for modulation classification.The IQ data are fed into the spatial feature extraction branch for extracting spatial features.The AP data are fed into the temporal feature extraction branch to extract temporal features.The spatial features and temporal features will be fused at the end and fed into a fully connected layer for classification.

Hybrid Data Augmentation
In real-world scenarios, due to the complex electromagnetic environment, the received signals by the receiver are often not as satisfactory as expected.At the same time, deep learning models often fail to extract good features and are prone to overfitting due to insufficient training samples.To enhance the robustness and generalization ability of the trained deep learning model, a data augmentation algorithm is proposed in this paper.The proposed algorithm performs phase transformations on the original data to generate data samples at different phases, thereby increasing the quantity of the training data and effectively preventing the occurrence of model overfitting.Next, the augmented data will be subjected to self-perturbation data augmentation, a method that enhances data diversity and helps the model learn different features.

Phase-Shift Data Augmentation
Phase transformation is a simple and effective data augmentation method in the field of modulation classification.By varying the phase angle, data can be obtained at different phase angles, thereby achieving the purpose of data augmentation.The phase-shift data augmentation process can be represented as [33]: where x denotes the original data, x denotes the augmented data, R() and L() represent the operations performed on the real and imaginary parts, and θ takes the values 0, π 2 , π, and 3π 2 .

Self Perturbation Data Augmentation
Data augmentation through self-perturbation refers to the process of randomly cropping a portion of the data and then splicing it back into the remaining data.Assuming the data to be augmented with self-perturbation are denoted as D, the remaining data after cropping are denoted as D cut , the length of the remaining data after cropping is denoted as |D cut |, and a random segment taken from the original data is denoted as s.The process of self-perturbation data augmentation can be represented as follows: where D OutPut represents the output of the self-perturbation data augmentation algorithm, and p represents a random position within D cut .The self-perturbation algorithm involves cropping some parts of a sequence and adding them to random positions, which expands the data while enriching its features in automatic modulation classification.This approach has the potential to improve the model's generalization performance-its ability to perform well on data outside the training set.Some advantages of this algorithm include:

•
Enhancing data diversity: The self-perturbation algorithm enhances the robustness and generalization ability of a model by adding new variations to the dataset.This augmentation of data diversity can enable the model to better capture the features of the dataset and improve its accuracy.

•
Reducing overfitting: Overfitting is a common problem in machine learning, and the self-perturbation algorithm can reduce the risk of overfitting by increasing the size of the dataset.This is because training the model on more data can help to better learn the true distribution of the dataset.• Simplicity and ease of implementation: The self-perturbation algorithm is relatively simple to implement, requiring only a small amount of manipulation on the original data.Compared to other complex data augmentation techniques, self-perturbation algorithm has lower implementation costs and higher practicality.

•
No introduction of additional information: The self-perturbation algorithm achieves data augmentation by cropping parts of the original data and then splicing them together.This approach ensures that no additional information is introduced into the data.In contrast, adding noise as a form of data augmentation introduces additional information that may sometimes affect classification performance.

Spatial Feature Extraction Module for IQ Data
The proposed AMC method includes a spatial feature extraction module for IQ data, which is based on ResNeXt [34] and a self-attention mechanism [35].ResNeXt is used to compute the real and imaginary parts of the IQ signal, which are then used to extract features from the spatial dimension of the IQ data [34].The self-attention mechanism is employed to weight the feature maps and enhance the discriminability of the features [35].This spatial feature extraction module plays a crucial role in the overall AMC method, as it enables the extraction of informative features from the IQ data, which are then used for modulation classification.ResNeXt can be represented as: where x denotes the input data, H(x) represents the mapping function, and F(x) refers to the residual block.The residual block can be expressed as: In the residual block represented above, F(x, {W i }) denotes the mapping function, and W i represents the weight parameters.To improve the performance and efficiency of the network, ResNeXt introduces grouped convolutions into the residual block F(x, {W i }).The input data x are divided into several groups with the same number of channels; then, a convolution operation is performed on each group of data.Finally, the convolution results of each group are merged.Grouped convolutions can be expressed as: where the notation X represents the input data, Y represents the output data, K denotes the convolution kernel, Group denotes the partitioning of the input data into multiple groups, i denotes the i-th group, and j denotes the channel within each group [34].
At the final stage of the spatial feature extraction module, a self-attention mechanism was employed [35].The purpose is to feed the extracted features into a self-attention mechanism, with the aim of computing weights to enhance the more salient features.The calculation process of the self-attention mechanism can be divided into three parts: computing the attention scores, computing the weighted sum, and computing the output.For any element x i in the input data, the formula for calculating its attention scores a i with respect to other elements can be expressed as: where q i , K ∈ R d k represents the elements in the input sequence, which represent queries and keys, and d k represents the dimensionality.The attention score a i represents the relevance between the ith element in the input sequence and other elements.The function so f t max is used to normalize the attention scores.Using the attention scores a i , each element in the input data can be weighted and summed to obtain a weighted sum z, which can be represented as: where v i ∈ R d v represents the value of the i-th element in the input sequence, and d v is the dimension of the value representation.The output sequence y is obtained by applying a linear transformation and a non-linear activation to the weighted sum z: where  [34].Specifically, we designed four branches for each base block, with each branch composed of three convolutional layers.After passing through the three convolutional layers in each branch, the data from the four branches are combined.In addition, we also added a shortcut connection between the input and output of the base block, which can help reduce the occurrence of gradient explosion and vanishing problems while deepening the network, as well as accelerate the convergence speed of the model.The detailed structures of the ResNet block and base block are illustrated in Figure 2.  We incorporated a Transformer Encoder with a self-attention mechanism as the core into the end of the spatial feature extraction branch.By utilizing the characteristics of the self-attention mechanism, the feature weights of the extracted features are calculated, which enhances the core features and accelerates convergence while improving the accuracy of modulation classification.The excellent performance of the self-attention mechanism can mainly be attributed to the following aspects:

•
Global information: the self-attention mechanism [35] can consider the entire sequence of information while processing the information at each position.• Interpretability: the self-attention mechanism [35] can increase the interpretability of the model by assigning different weights to information from different positions.

•
Addressing long-range dependencies: the self-attention mechanism can solve the problem of long-range dependencies, where the model is capable of correctly processing distantly related contextual information.

•
Powerful feature representation capability: the self-attention mechanism can fuse information from different positions to obtain powerful feature representation capability.

Time Feature Extraction Module for AP Data
The time feature extraction module for AP data is constructed based on the LSTM model [36].The LSTM model is capable of extracting temporal features from both phase and amplitude information, while addressing the issues of gradient vanishing and exploding in traditional RNN models.The LSTM architecture comprises a memory cell and three gating components, namely, the input gate, output gate, and forget gate.
Due to the limited feature information contained in low-dimensional word vectors, in order to enhance the LSTM's ability to extract temporal information, the input data need to be first expanded with word vector extensions in the time feature extraction module.Suppose the length of the input AP data is T; then, the shape of the input data X is T × 2. The word vector expansion can be represented as follows: where W represents a linear layer weight matrix of size 2 × E, b represents a bias vector, E represents the dimensionality of the extended word vectors, and Y denotes the extended word vectors.With word vector expansion, the input data are expanded from T × 2 to T × E, as shown in Figure 3.The time feature extraction module consists of two layers of LSTM.After passing through the first layer of LSTM, the model will extract the data outputted by the output gate of the last LSTM unit.The extracted data are then subjected to another word vector expansion operation and fed into the next layer of LSTM for further temporal feature extraction.Finally, the outputted information is sent to the feature fusion module for feature fusion.

Spatiotemporal Feature Fusion Mechanism
In the feature fusion module, we concatenated the temporal and spatial features and then fed the concatenated feature vector as input to a linear layer for further processing.Specifically, assuming the dimensionality of the temporal features is d t and the dimensionality of the spatial features is d s , we concatenate them along the feature dimension to obtain a new feature vector of dimensionality d t+s = d t + d s .The resulting new feature vector obtained by concatenating the temporal features and spatial features is fed into a linear layer, where it undergoes a linear transformation to obtain a new feature vector that can be used for further task processing.This process can be represented as follows: where the notation [x t , x s ] denotes the concatenation of temporal and spatial features, while Linear(•) represents the linear transformation applied to this concatenated feature vector.

Loss Functions and Optimization Algorithms
In this paper, the cross-entropy loss function is adopted as the objective function to solve the multi-class classification problem.As for the optimizer, the AdamW optimizer is used with a training cycle of 128 and a learning rate of 0.001.

Experimental Results
The experimental results of the proposed automatic modulation classification method (AMC) are presented in this section of the paper, which includes an evaluation of the proposed hybrid data augmentation algorithm for the dual-stream spatiotemporal fusion neural network (DSSFNN) model.The classification accuracy is evaluated with and without the hybrid data augmentation algorithm.This study also investigates the optimal architecture of the DSSFNN model, evaluating the necessity of the number of LSTM layers, ResNet structure, and self-attention mechanism.The performance of the proposed approach is compared with other state-of-the-art models in terms of classification accuracy, and the findings indicate that the proposed method surpasses the performance of the existing methods.Additionally, the classification performance of the DSSFNN model on different modulation types under various SNR conditions is analyzed, showing that the proposed method is effective in real-world scenarios.

Simulation Environment, Parameters, and Performance Metrics
To demonstrate the performance of the proposed AMC scheme, the dataset used in this paper to evaluate the proposed scheme is the RML2016.10aopen radio machine learning dataset.This dataset contains 220,000 samples comprising 11 modulation types, each with 20 SNR levels ranging from −20 dB to 18 dB [28].Each sample includes two signal components, I and Q, each with 128 samples per component.During the experiments, 70% of the data sets were randomly assigned to the training set, while 15% were assigned to the validation set and 15% were assigned to the test set.In our experiments, the training environment employed a Windows 11 operating system, with an NVIDIA RTX 3060 GPU utilized for training the models.Python was used as the programming language, and the deep learning models were constructed using the PyTorch framework.

Ablation Experiment of Hybrid Data Augmentation Algorithm
Figure 4 is referenced in this paper to present comparative results between the DSSFNN model trained by the hybrid data augmentation method and other methods.The outcomes exhibit a notable improvement in the classification accuracy of the DSSFNN model trained by the hybrid data augmentation approach as compared to the model trained without it.This finding highlights the effectiveness of the hybrid data augmentation method in improving the classification accuracy of AMC.
When −20 dB SNR 18 dB, the mean classification accuracy of the DSSFNN model trained without hybrid data augmentation is 60.90%.The DSSFNN model trained solely with the self-perturbation data augmentation scheme attained a mean accuracy rate of 62.52%.The mean classification accuracy attained by the DSSFNN model trained exclusively by the phase-shift data augmentation scheme is 62.60%.The DSSFNN model trained with the data augmentation approach proposed in this paper achieved a mean classification accuracy of 63.44%.The proposed hybrid data augmentation approach improved the mean classification accuracy of DSSFNN by 2.54%.The experimental outcomes validate the efficacy of the suggested approach in improving the classification accuracy of the DSSFNN model.

Ablation Experiment of DSSFNN
Figure 5 illustrates the changes in modulation classification accuracy of the DSSFNN model under different structures of the base block.It can be observed from the graph that the highest classification accuracy of 93.75% is achieved when the number of grouped convolutions is two.When the number of grouped convolutions is set to four, the highest classification accuracy achieved by the model is 95.01%.On the other hand, when the number of grouped convolutions is set to six, the model's highest classification accuracy is 93.81%.When the number of grouped convolutions exceeded four, the accuracy of the model decreased slightly.The experimental results suggest that increasing the number of branches in the DSSFNN model leads to a slight decrease in the classification accuracy after a certain threshold.Therefore, in this study, the number of grouped convolutions in the base block of the DSSFNN model was set to four.
In addition, we also conducted experiments on DSSFNN without using group convolutions.The results show that the DSSFNN model with group convolutions achieved significantly higher accuracy compared to the one without group convolutions.
Figure 6 illustrates the variation in the DSSFNN modulation classification accuracy for different numbers of LSTM layers and after deleting the Transformer Encoder.When the Transformer Encoder was removed from the DSSFNN model, a significant decrease in accuracy was observed.At a high signal-to-noise ratio (SNR), the average accuracy of the model was only 90.89%.When a single layer of LSTM was used, the DSSFNN model achieved an average accuracy of 92.00%.Compared with the models without attention block and with a single LSTM layer, the proposed DSSFNN model in this paper improved the average accuracy by 1.91% and 0.8%, respectively, under high signal-to-noise ratio conditions.The experimental results demonstrate that using a double-layer LSTM in the DSSFNN model can achieve the best classification accuracy.Moreover, the presence or absence of a self-attention mechanism has a significant impact on the accuracy of the DSSFNN model.Table 2 shows the results of the ablative experiments of DSSFNN under a high SNR case.7 illustrates a comparison between the proposed approach and existing modulation classification methods, including ResNet [29], CLDNN [29], CNN-LSTM [9], SCRNN [30], CNN4 [31], MLDNN [31], CGDNN [31], and the proposed DSSFNN model.Among the models compared, MCLDNN achieved the lowest classification accuracy.Specifically, when 0 dB SNR 18 dB, the mean accuracy achieved by MCLDNN was 81.06%, with a maximum classification accuracy of 81.74%.ResNet attained a mean accuracy of 82.82% at an SNR of [0, 18] dB.The average classification accuracy of CNN4 was 83.36%, with a maximum accuracy of 84.8%.The average accuracy of the LSTM-CNN dual-stream model was only 60.97%, which partially demonstrates the effectiveness of the proposed model in this paper.Comparatively, under high SNR circumstances, SCRNN, CLDNN, and CGDNN demonstrated exceptional performance.The average classification accuracy of SCRNN was 89.92%.The average classification accuracy of CLDNN was 90.61%.The average classification accuracy of CGDNN2 was 91.65%.The modulation classification scheme proposed in this paper achieves the utmost classification accuracy.Specifically, at high SNR, the proposed modulation classification scheme achieved an accuracy rate of 93.14%, with the highest accuracy rate being 95.01%.The experimental results indicate that the modulation classification scheme proposed in this paper can achieve a relatively advanced level and outperforms other schemes.Figure 8 shows a comparison of classification results obtained by the proposed modulation classification scheme and SCRNN at SNR = 10 dB.Based on the figure, it can be observed that at aaa, the proposed modulation classification scheme in this paper shows significant improvement compared to SCRNN in the classification of QAM64 and WBFM modulation schemes, with an increase of 17% and 16%, respectively.However, the modulation classification accuracy of the proposed scheme for WBFM modulation style is still not high enough.Therefore, how to improve the classification accuracy of deep learning models for WBFM modulation style is a research problem worthy of further investigation.

Conclusions
This paper proposed a robust AMC method based on data augmentation and deep learning models, which achieves high-precision classification of signal modulation methods.Firstly, a hybrid data augmentation method was selected to augment the original data.By using data augmentation, the trained model can have higher robustness and generalization ability and can effectively suppress overfitting during the training process.Additionally, it should be noted that AMC plays a crucial role in drone communication due to the requirement of reliable data transmission between drones and ground stations.Next, this paper proposed a novel AMC method, DSSFNN, which adopts a parallel design to extract both temporal and spatial features separately and fuses them for high-precision classification of most modulation schemes.A method for spatial and temporal feature extraction, based on two independent branches and dual-stream components, was developed.This approach ensures diversity between spatial and temporal features while preventing the extraction of features from previously extracted ones, effectively eliminating potential factors that may degrade model performance.By conducting experiments on the publicly available dataset RML2016.10a and comparing with existing models, the proposed modulation classification scheme in this paper achieved the highest classification accuracy, reaching an advanced level in terms of accuracy.
Funding: This research received no external funding.

Figure 1 .
Figure 1.The overall architecture of automatic modulation classification method based on hybrid data augmentation and dual-stream spatiotemporal fusion neural network.
×d v and b o ∈ R d h are weight matrices and bias vectors used for linear transformation.d h denotes the dimensionality of the output representation, and ReLU is a non-linear activation function.In the spatial feature extraction branch of DSSFNN, three ResNet blocks are stacked, each consists of a one-dimensional 2 × 32 convolutional layer, two base blocks, and a max pooling layer.After each base block in the spatial feature extraction branch of DSSFNN, a ReLU activation layer and a one-dimensional batch normalization layer are added to prevent overfitting.We made appropriate adjustments to the grouping convolutions proposed by Xie et al. in 2017 (a) ResNet block (b) Base block

Figure 2 .
Figure 2. ResNet block and base block architecture.

Figure 4 .
Figure 4. Performance comparison of DSSFNN modulation classification under various data augmentation methods.

Figure 7 .
Figure 7.Comparison of modulation classification results between DSSFNN and other models.

Figure 8 .
Figure 8. Modulation classification comparison of SC-RNN and proposed model under different SNR scenarios.

Table 2 .
DSSFNN experimental results of ablation study under different SNR scenarios.