A Novel Automatic Modulation Classiﬁcation Method Using Attention Mechanism and Hybrid Parallel Neural Network

: Automatic Modulation Classiﬁcation (AMC) is of paramount importance in wireless communication systems. Existing methods usually adopt a single category of neural network or stack different categories of networks in series, and rarely extract different types of features simultaneously in a proper way. When it comes to the output layer, softmax function is applied for classiﬁcation to expand the inter-class distance. In this paper, we propose a hybrid parallel network for the AMC problem. Our proposed method designs a hybrid parallel structure which utilizes Convolution Neural Network (CNN) and Gate Rate Unit (GRU) to extract spatial features and temporal features respectively. Instead of superposing these two categories of features directly, three different attention mechanisms are applied to assign weights for different types of features. Finally, a cosine similarity metric named Additive Margin softmax function, which can expand the inter-class distance and compress the intra-class distance simultaneously, is adopted for output. Simulation results demonstrate that the proposed method can achieve remarkable performance on an open access dataset.


Introduction
Automatic Modulation Classification (AMC) is an intermediate step between signal detection and demodulation. The aim of AMC is to identify the modulation type of received signal correctly and automatically, which reveals a broad application foreground both in civil and military technologies [1]. In civilian aspects, AMC can be applied to ensure the normal communication of legitimate users by monitoring legitimate spectrum and identifying illegitimate interference. AMC can also be used to reinforce and foster situational awareness in soft-defined radio system for better spectrum utilization [2]. In a non-orthogonal multiple access system, AMC can identify the modulation type of nonorthogonal multiple access system and decide whether successive interference cancellation is required or not [3]. In military aspects, AMC is the core technology of electronic countermeasures, communication surveillance and jamming. AMC can help intercept receivers to get the correct modulation type, which provides a reference basis for demodulation algorithm selection of demodulator and is helpful to the selection of optimal jamming pattern or jamming cancellation algorithm for electronic warfare, so as to ensure friendly communication and suppress and destroy enemy communication, so as to achieve the purpose of electronic warfare communication countermeasure [4]. Consequently, it is absolutely imperative to develop AMC technology.
Extant literature can be divided into two categories: maximum likelihood-based and feature-based approaches. Maximum likelihood-based approaches can be divided into 3 categories: average likelihood ratio test (ALRT) [5], generalised likelihood ratio test (GLRT) [6] and hybrid likelihood ratio test (HLRT) [7]. For maximum likelihood-based approaches, AMC problems can be reviewed as a hypothesis testing problem. Specifically, according to O'shea [28] first provided the standard dataset and the baseline network of AMC. Deep Belief Network [29] has been applied in AMC early but shown lower accuracy than conventional discriminant model. Rajendran [30] projected the amplitude and phase of each samples and put the transformed signal into a 2-layer LSTM network for classification and exceeded more than 80% accuracy for an SNR of 10 dB under O'shea's dataset. Ref. [31] applied GRU for AMC with resource-constrained end-devices. Utrilla [32] designed a LSTMbased denoising autoencoder classifier and exceeded 90% for an SNR of 4 dB. West [33] utilized Inception structure [34] and Residual module to extract features and can reach 80% accuracy for an SNR of 0 dB. Liu [35] proposed a Convolutional Long Short-term Deep Neural Network (CLDNN) which cascaded CNN and LSTM. Hermawan [36] proposed an CNN-based method adding Gaussian noise layer and performed better than baseline. Dropout-replaced Convolutional Neural Network (DrCNN) [37] was proposed which replaced max pooling layer with dropout layer [38] and obtain a competitive result. Huang [39] combined ResNet and GRU and reached 95% for an SNR of 5 dB. Tao [40] proposed sequential convolutional recurrent neural network which connected CNN and bi-LSTM in series and got a competitive result. However, serial structure of CNN and RNN aliases different types of features, which will result in the loss of information inevitably. Thus, parallel structure with a proper feature combination method can reach a better performance.
Attention mechanism was proposed in 2017 [41], which has brought widespread attention. Attention mechanism reweights the input features in order to highlight important parts and suppress unimportant parts, which has widely used in natural language processing [42], computer vision [43], and also aroused AMC's interest, Yang [44] combined attention mechanism and one-dimension convolution module to propose an One-Dimensional Deep Attention Convolution Network (OADCN) network for different modulation types with different channel coding mode. Nevertheless, attention mechansim is only applied to assign weights for spatial features from CNN model, it still has broad application prospects.
Softmax function [45] is the most popular classification function in AMC. Softmax function can amplify the difference between outputs. However, there are also many output functions for neural network. Cosine-similarity-based function has become popular recently [46]. Existing literature rarely considers cosine-similarity distance for AMC problem. As a consequence, this paper proposed the idea of applying cosine similarity softmax called Additive margin softmax (AM-softmax) [47] which is widely utilized in metric learning to AMC problem.
In this paper, we propose a hybrid parallel module which concatenates spatial features and temporal features simultaneously, utilizing attention mechanism to reweight all features. At first, a hybrid parallel feature extraction module is designed for extracting features in parallel. CNN and GRU are applied to extract spatial features and temporal features, respectively. Then the two types of features are concatenated in the channel dimension. Different attention mechanisms are applied to assign weights for features. Squeeze-and-Excitation (SE) block and multi-head attention mechanism are put into use for assigning weights for features in channel dimension and feature dimension respectively. The iterative Attentional Feature Fusion (iAFF) module is utilized for weights assignment in the residual shortcut structure. We use SE block and multi-head attention mechanism to construct hybrid parallel feature extraction module and build a residual structure with hybrid parallel feature extraction module and iAFF module. Finally, we flatten the output features, apply multi-head attention mechanism to each point in features, and send the features to AM-softmax to calculate the cosine-similarity distance. To our knowledge, this is the first paper utilizing cosine similarity softmax in the AMC domain.
The remainder of the paper is organized as follows: In Section 2, we will show the details of our proposed method. In Section 3, we evaluate the performance of the proposed model with simulation results. Finally, we summarize the content of this paper in Section 4.

System Model
The system model of our method is shown as Figure 1, the first module is preprocessing module. x IQ is the input data of our model. The input data is IQ representation. The real part and the imaginary part of the input signal consist of 2 channels. We first calculate the amplitude and phase features of the IQ signal. We next concatenated the raw complex data with these two features and got dataset with 4 channels. The purpose of this concatenation is to achieve data augmentation. The detail of this operation will be shown in Section 3.2. After preprocessing module, the concatenated data is sent to the classification module, which is achieved by our proposed hybrid parallel network. The output of our network is the classification result.

Hybrid Parallel Network
The hybrid parallel network is stacked by the hybrid parallel module we proposed. The hybrid parallel module combines Inception, Gate Recurrent Unit (GRU), Squeeze-and-Excitation (SE) block, iterative Attentional Feature Fusion (iAFF) module and a multi-head attention mechanism. The output function is an Addictive Margin (AM)-Softmax function. We will first show the whole structure and then introduce each module in sequence. Figure 2 is the structure of the proposed hybrid parallel network. The hybrid parallel network is composed of several hybrid parallel modules. The core structure of the hybrid parallel module is the hybrid parallel feature extraction module. Figure 2a illustrates the hybrid parallel feature extraction module. The parallel structure is constructed by Inception and GRU. The inception module and GRU module aim to extract spatial features and temporal features, respectively. The SE block adjusts the weight of each channel. Before the Inception module we normalize the features by Batch Normalization (BN ). The Maxpool layer is applied when the stride of Inception is greater than 1, the length of Maxpool is equal to the stride of Inception. After obtaining spatial features and GRU features, we normalize and reweight the splicing features by multi-head attention mechanism. Our proposed hybrid parallel module combines the parallel feature extraction module and iAFF module. The short skip to the iAFF module is a convolution module with the 1 × 1 kernel in order to adjust the channel size of the initial input. The hybrid parallel network is presented on Figure 2b. We stack several hybrid parallel modules by residual structure to extract temporal and spatial features. After the last hybrid parallel module, we flatten all the features and readjust the weight of each sample point by multi-head attention mechanism. Finally, we output the result through AM-softmax function. The details of all the constructed modules are shown in the following sections.  Convolution neural network was first proposed in [17] for imageNet competition and reached the first immediately. CNN has more powerful data processing abilities than traditional full-connected neural network. CNN can preserve the neighborhood relations and spatial locality of input data at feature representation [48]. As long as the input data is enough, the CNN model can train itself automatically. The CNN structure is entirely derived from the training data, thus the network is fully adapted to the data and can obtain more representative features. The core idea of CNN is the convolution step, which can be viewed as a correlated process. Assuming that ω 1 , ω 2 , . . . , ω m represent the weight of 1-D convolution kernel of length m, the computational process of 1-D convolution can be represented as: where x t is the input sample at time t, y t is the superposition of the information generated at the current moment t and the information delayed at the previous moment, f (·) is the activation function. The activation function we choose in this paper is Rectified Linear Unit (ReLU) [49], which can be defined as: In this paper, we choose ReLU as our activation function after 1-D convolution processing. The typical way to increase capacity for CNN is to go deeper and wider on structure. Going deeper means stacking more layers, going wider needs more channels, both methods will increase the computational overhead. Inception [34] was proposed to solve this problem. The structure of Inception module applied in this paper is shown as Figure 3: The module 1 × 1 Kernel means that the kernel size of 1-D convolution process is 1 × 1. Similarly, 1 × 3, 1 × 5, 1 × 7 represent the convolution kernel size as 1 × 3, 1 × 5 and 1 × 7 respectively. After performing convolution operations, Inception module concatenates all output at channel dimension. Different scale of kernel size can help Inception to extract different features in different spatial scale. The next step of inception is Batch Normalization (BN ) [50]. The BN layer accelerates training by normalizing features at one batch: where N batch denotes batch size, x n denotes the input data, µ and σ 2 denote the mean and variance of the batch,x denotes the normalized data and ε is a constant to prevent zero gradient, γ and β are learnable parameter vectors for fixing data, y n is the fixed output feature.

Gate Recurrent Unit (GRU)
Gate Recurrent Unit (GRU), a type of RNN, is proposed to solve the vanishing gradient problem in back propagation of RNN. Compared to LSTM, GRU can reach the same accuracy with cheaper computational cost. The forward propagation of GRU can be denoted as: where x t is the input data at the time t, h t−1 is the hidden state at the last time, r t and z t denote the reset gate state and the update gate state respectively, σ(·) denotes the sigmoid function which is defined as tanh(·) function is defined as L(·) denotes a full-connect layer which can be defined as: W i and b i are denoted as weight and bias respectively.

Attention Mechanism
Attention mechanism has a huge improvement effect on deep learning tasks. When human beings are observing something seriously, they will definitely focus on what needs to be observed and ignore the surrounding environment. This can be interpreted as human beings assigning more weight to immediate things and less weight to the surrounding environment. Attention mechanism is based on this principle by training itself to learn the reweighting mechanism. Attention mechanism is able to pay more attention to the part of the input which better express the characteristics of the signal. We choose three types of attention mechanism called Squeeze-and-Excitation (SE) block [51], iterative Attention Feature Fusion (iAFF) module [52] and Multi-head attention [41] for our model. SE block focuses on channel-dimension weight distribution and is applied together with Inception for spatial feature extraction. The iAFF module is applied for feature fusion from the same-layer scenario to cross-layer scenarios which is called residual structure. iAFF module is placed between two hybrid parallel modules to distribute features consist of inception module, GRU module and the initial input. The multi-head attention module reweights all flattened features before classification. The following sections will introduce these three attention modules.

Squeeze-and-Excitation (SE) Block
The core idea of SE block is to assign weights to each channel. SE block has two steps: squeeze step and excitation step. Squeeze step splits features in channel dimension. SE block applies global average pooling in channel dimension and gets the channel-wise statistics, the i-th element of the output z can be shown as: where x i is the i-th element of the input and z is a C-dimension vector. Then the output vector is sent into a gating mechanism: δ is defined as ReLU function, W 1 ∈ R C× C r and W 2 ∈ R C r ×C , they are both trainable parameters and r is a hyper parameter called reduction ratio, controlling the bottleneck of two full-connect layers with a non-linearity function ReLU. Finally, the weight of each channel is rescaled byx wherex c is the c-th element of the output feature with dimension C × H, C is the number of channel, H is the dimension of input feature. s c is the weight of c-th channel, x c refers to the c-th input feature and F scale means channel-wise multiplication.

Iterative Attentional Feature Fusion (iAFF) module
Existing attention-based methods only focus on the features in the same layer, iAFF module [52] can integrate cross layer features, making the fusion scheme more heuristic and fuse the receive features in a contextual scale-aware way. The structure of iAFF is shown as Figure 4:  is designed as L(X) = BN (PWConv 2 (δ(BN (PWConv 1 (X))))) (11) where PWConv denotes point-wise convolution, BN denotes batch normalization. δ denotes ReLU function. We define g(X) as global average pooling(Global Avg Pooling) that can computed as PWConv 1 has C r channels and PWConv 2 has C channels, PWConv 1 and PWConv 2 construct a convolution bottleneck. The left part of MS-CAM can be seen as global channel feature extraction and the right part is local channel feature extraction. The output feature X is defined as where ⊗ is the element-wise multiplication and ⊕ means broadcasting addition. M(X) is the sum of global channel extracted features and local channel extracted features. iAFF module in Figure 4b can be expressed as Z denotes the output feature, Y can be output features of the interlayer. In this paper, Y is the input of residual input. The dotted arrow from MS-CAM means 1 − M(·). The iAFF module can be separated into two parts. The first part is X Y, the second part takes X Y as the input of the next MS-CAM, and assigns its output as weight to X and Y.

Multi-Head Attention Mechanism
Attention mechanism divides features into multiple heads to form multiple subspaces, allowing the model to pay attention to different aspects of feature information. We call this practice the multi-head attention mechanism. The core idea of attention mechanism is the scale dot-product attention mechanism. Q, K, V represent query, key and value, respectively. In multi-head attention mechanism, they are all established by sending the input to full-connect layer. Q and K have the same dimension d k ; thus, the output of scaled dot-product Attention can be written as In this equation, K T means the transpose of K, the output vector is the reweighted value V, the weight assigned to each value is calculated by K and Q. the dimension of Q, K and V can be uniformly written as C × E, where C is the channel dimension and E denotes embedding dimension. Q and K must have the same embedding dimension while the embedding dimension of V can be different. The multi-head mechanism divides Q, K and V into h parts, undertakes scaled dot-product attention h times and finally concatenates all h outputs, which can be denoted by where W means the parameter matric.

Addictive Margin (AM)-Softmax Function
Softmax function is the most popular output function in classification task. Softmax function is suitable in optimizing inter-class difference. However, If we can reduce intraclass distance, we will also expand inter-class distance, and this will help to increase classification accuracy. Cosine-similarity-based softmax aims to compute the angle between the input vector and the center vector of modulation category. The process of narrowing the angles within a class is also the process of widening the distance between classes. Therefore, we choose cosine-similarity-based function for our modulation classification. The softmax function can be written as where f i is the i-th element of the output feature, W j is the j-th column of the output. The W T y i f i denotes the target score of the i-th sample. As is the equation shown, the softmax function can be written as the product of the magnitudes of two vectors and the cosine of their angular distance. As a result, the cos-similarity softmax can be implemented by normalizing the weight vectors and the features. In this way, W j and f i are both normalized to 1, and the data at the exponential position only leave the cosine of the angular distance. Addictive Margin Softmax (AM-Softmax) is proposed to increase a cosine margin in cos-similarity softmax, which is denoted as s and m are both preset hyper parameters. m denotes the designed cos similarity margin between classes. The difference between conventional softmax decision boundary and AM-softmax decision boundary is shown in Figure 5. Where the red arrows W 1 and W 2 denote the center vectors of two modulation categories respectively. The decision boundary of conventional softmax function is the green line P 0 , where cos(θ W 1 ,P 0 ) = W T 1 P 0 = W T 2 P 0 = cos(θ W 2 ,P 0 ).
We must declare that the conventional softmax function in Figure 5 is not similar to the softmax function applied in the existing literature. This is because W 1 , W 2 and P 0 are all normalized to 1 while the softmax function applied in existing literature do not normalize parameters and features. Thus the conventional softmax function in Figure 5 is a type of cosine similarity metrics. The decision boundary of softmax function bisects the angular of two center vectors. For AM-softmax function, the decision boundary of the category W 1 is at P 1 . Similarly, the decision boundary of the category W 2 is at P 2 , the margin m can be written as: The greater the value of m is, the larger the cosine similarity distance is. If m is set to 0, that means the AM-softmax function degrades into a cosine-similarity metrics. s is the scale parameter to accelerate convergence. If s is set too small, the convergence will be too slow. However, if s is set too large, the convergence will be too fast to find better local optimal value. Therefore, no matter the setting of m or the choice of s, the trainer must be careful. In the training step, we widen the gap between the target logit and other logits by at least m cosine margin. If the network is trained well, the output logit of the true modulation category will be larger than others by a cosine value m in the test step. After passing through the AM-Softmax function, the output will also be sent to a loss function like cross entropy to optimize the performance of the network.

Dataset and Parameters
RML2016.10a [28] is the most popular dataset applied in AMC. RML2016.10a consists of 11 modulation categories: BPSK, QPSK, 8PSK, 16QAM, 64QAM, BFSK, CPFSK, PAM4, WB-FM, AM-SSB, and AM-DSB. Signal dimension is 2 × 128, the length of per sample is 128, each sample has real and imaginary parts, so the dataset has 2 channels. The duration per sample is 128 µs. The sampling frequency is 1 MHz, the sampless per symbol is 8. The number of samples under per SNR of each category is 1100, we choose 80% as our training set, 10% as our validation set, 10% as our test set. The SNR range is from −20 dB to 18 dB, but in practical applications, the communication conditions under −5 dB is useless for communication, therefore we choose −4 dB to 18 dB for our experiments with interval of 2 dB. Therefore, the training samples in out experiments are 105,600, the validation samples and test samples are both 13,200. The shape of our training set is 105, 600 × 2 × 128, the shapes of validation set and the test set are both 13, 200 × 2 × 128.
The input of the whole network is the raw complex data concatenated with the corresponding amplitude and phase. The channel of the complex dataset is 2 including the real data and its corresponding imaginary part. If we directly consider the two parts as two channels, we will lose the relation between the real part and the imaginary part. Therefore, we project the signal to polar coordinates and choose the amplitude and phase as two channels to concatenate with the initial complex data, which can be written as where real i denotes the real part of the sample point and imag i denotes the imaginary of the sample point a i and p i respectively denote the amplitude and phase the sample point. Finally, we concatenate the amplitude channel and the phase channel with the complex data, and the input data has 4 channels. The output of hybrid parallel network also needs to be optimized by loss function. The loss function we choose is cross entropy, which is defined as where l is the output vector of AM-softmax, t denotes the label vector with one-hot formulation, t k denotes the label of k-th element. Cross entropy loss can learn the difference between two distribution and coverage fast; this is the reason why we choose cross entropy loss as our loss function. The optimizer we choose is Adam optimizer [53], the initial learning rate is set to 0.001 decayed by 10 every 16 epochs. The parameters of AM-Softmax we choose are 0.1 for m and 10 for s. The batch size is set to 512 and we run 40 epochs for training. Our proposed model cascades 4 hybrid parallel modules with different structure parameters which is illustrated by the following table: Table 1 illustrates the channels of the Inception modules applied in our structure. The hidden size of corresponding GRU modules is all set to half of the channels of the second Inception channels because of the bidirectional property. The size of the corresponding channels of convolution modules and iAFF module is 2 times of the output channels of Inception due to the concatenation of Inception and GRU. The number of multi-head attention mechanisms in hybrid parallel modules is 4. The stride of the convolution module in residual skip is 7 and the kernel size is 1 × 1. Following the last hybrid parallel module we flat all the features and send it to a multi-head attention module with 10 heads. Before the last AM-softmax function, we reduce the dimension of features by a linear layer with the output size 128. The total parameters of our model are 27,415,424. After training, the average inference time is 3 s.  64  40  1  40  80  2  40  40  2  20 20 1

Experiments and Discussion
We first explore the impact of the number of hybrid parallel modules. Figure 6 illustrates the performance of different hybrid parallel modules. We choose m = 0.1 and s = 10. The average accuracy from one hybrid parallel module to 6 hybrid parallel modules is 71.77%, 87.76%, 88.84%, 90.51%, 87.36% and 86.69%, respectively. We can find that only one hybrid parallel module is not enough to learn the distribution of dataset. However, when the number of modules is larger than 4, the performance of network is also not ideal. This can be explained by too few modules not being enough to learn the distribution of dataset. Nevertheless, too many modules will make back propogation difficult, which will result in the problem of network degradation. As a result, our proposed method needs a proper number of modules. In Figure 6, we find that the best number is 4. The recognition rate of each modulation category for different SNRs is shown on Figure 7. The network we applied stacks with 4 hybrid parallel module and sets m = 0.1 and s = 10, the average accuracy is 90.42%. We find that AM-DSB has always been misclassifed a lot with WBFM. 8PSK is misclassified at first but the accuracy add up fast, while at 0 dB, the accuracy surpasses 90%. Although the performance of some modulation categories is not good at −4 dB, with the increase of SNR, the performance of all modulation categories can exceed 95% except AM-DSB, which means that our proposed method has a great performance on the digital-modulated signal and can also recognize some types of analog modulation categories. Next we analyze different confusion matrixs for an SNR of −4 dB, 0 dB and 6 dB. The value −4 dB in Figure 8 is the lowest SNR so we can observe the confusion at low SNR. When it comes to 0 dB, the SNR has increased and we can see the change in the confusion matrix. The performance of our method becomes stable for an SNR of 6 dB. AM-DSB has always been misclassified with WBFM, but WBFM can be successfully recognized. 16QAM and 64QAM are misclassified for each other at −4 dB and the accuracy improves gradually with the increase of SNR. This is owing to the fact that 16QAM and 64QAM both belong to QAM modulation category. The same thing happens between 8PSK and QPSK. they are misclassified with each other at −4 dB, and the accuracy reaches 99% at 6 dB. In summary, the performance of digital-modulated signal is much better than analog-modulated signal. We hold the opinion that this is because of two reasons: at the first, the number of analogmodulated signal included in this dataset is less than digital-modulated signal. Secondly, the distance between analog-modulated signal is shorter than the digital-modulated signal.  We utilized t-SNE [54] to reduce dimension. We show in Figure 9 that, for an SNR of −4 dB, the cosine distance between 16QAM and 64QAM is very close, 8PSK and QPSK severely overlap. GFSK and QPSK are also very close. These results are identical to the facts we observed on the confusion matrix. From −4 dB to 6 dB, the distance between 16QAM and 64QAM has expanded and the accuracy has increased, AM-DSB and WBFM have also partly separated from each other. We can also find that the separated features of different modulation categories are far away from each other even for an SNR of −4 dB.
Then we analyze the influence of the hyper parameters of AM-softmax. AM-softmax has two parameters: m means margin and s means scale. m controls the cosine margin which can be written as the difference between the cosine values of both sides of the dividing interval. Table 2 is the performance of hybrid parallel network with different margin m. We stacked 4 hybrid parallel modules with s = 10, and other parameters were all fixed.  We can find that the difference between the maximum accuracy and the minimum of the accuracy is less than 2%, which means the parameter m has little influence on our proposed method. This can be explained according to the results above. In Figure 7, the accuracy of AM-DSB is always non-ideal while the others have already got high values. In Figure 8 we can also find that, except AM-DSB, almost all the modulation categories are well classified at 6 dB; even at −4 dB, the classification accuracy of most modulation types has been over 80%. In Figure 9, almost all features at hidden layers are discriminate between each other since 0 dB and the distance between each category of feature is very large except confused types. These results all point to the facts that most modulation types have been classified very well and the cosine angular distance within each category of features is very small except BWFM and AM-DSB. As a result, no matter whether the cosine angular margin m is small or large, it has limited influence on accuracy.
The performance of AM-softmax with different scales s is illustrated from Table 3. We fix the structure of network with 4 hybrid parallel modules and m is set to 0.1. With the increase of the scale s, the performance of the hybrid parallel network also gets better until s = 10. The parameter s controls the rate of convergence. If s is too small, the gradient will grow slow and is easy to fall into local optima. Therefore, s cannot be too small. As we can see in Table 3, when s is larger than 10, the accuracy result will become stable. As a consequence, we can conclude that as long as s reaches a threshold, the growth of s will have little effect on the outcome. From Table 3, we know the threshold of s in this study is 10. In Figure 10, we compare the softmax function with AM-softmax function. We cascade 3 hybrid parallel modules with different output function. We choose m = 0.1 and s = 10. We applied AM-softmax and softmax respectively for the output of the same model, and each ran 100 times and calculated the average for each SNR. The average accuracy of AM-softmax and softmax are respectively 87.33% and 86.36%. The accuracy of AM-softmax is 1% greater than Softmax. However, the improvement of 1% is within the margin of error; thus, we believe that AM-softmax can achieve similar performance to softmax. This result provides a new idea for the research of AMC that other types of distance measurement functions can also achieve good results in addition to softmax. Finally, we compare our proposed model with some existing models, namely VTCNN [28], LSTM model which takes amplitude and phase information as input [30], ResNet [35], Dropout-replaced Convolutional Neural Network (DrCNN) [37], CLDNN [35], Improved convolutional neural network based automatic modulation classification (IC-AMCNet) [36], and ODAC Network [44]. The result is shown in Figure 11. According to our experimental results, our proposed hybrid parallel network is significantly superior to others. The highest recognition accuracy exceeds 93% at 8 dB, and the highest accuracy of other methods cannot exceed 85%. The lowest accuracy of our method exceeds 75% at −4 dB; however, VTCNN and LSTM go a little bit beyond 75% at 18 dB. The average accuracy of our model is at least 5% superior to others. The best performance of other methods is CLDNN which combines CNN with LSTM; thus, our proposed method and CLDNN show that combining CNN with RNN can improve the classification performance. ResNet also achieves good results among these other methods, which indicates that residual structure can also improve classification performance. Our proposed method combines the advantages of these methods and achieves the best among all methods.

Conclusions
In this paper we propose a hybrid parallel network for AMC problem. Our hybrid parallel network extracts features through a parallel structure. This parallel structure extracts spatial features and temporal features simultaneously, and concatenates them with attention mechanism. Three attention mechanisms are applied in the proposed method for reweighting the output to highlight the more discriminate parts in the features. After the last layer, instead of the conventional softmax function, we apply a cos-similarity metric method named AM-softmax to help us compress with-in class cosine angular distance. We first explore the influence factors of our proposed method including the number of hybrid parallel modules and the hyper parameters of AM-softmax function. Then we analyze the accuracy of each modulation category in our experiments and find that most of modulation categories can be discriminated well for an SNR of 0 dB. We also compare the result of AM-softmax function with softmax function while our model is fixed. The average accuracy of AM-softmax and softmax are 87.33% and 86.36%, respectively, which confirms that AM-softmax can achieve similar performance to softmax within the margin of error. Finally, we compare our proposed method with other existing methods. The experiments prove the effectiveness of our proposed method. The worst accuracy is 75% at −4 dB while the best accuracy of the baseline model just reaches 74%. The average accuracy of our method is at least 5%, which is better than the comparison methods. As a result, our proposed method achieved competitive performance. Although our hybrid parallel structure can achieve a competitive performance, the computational complexity is not a concern. However, computational complexity plays an important role in the practicability of AMC methods. Our future work is to maintain the classification accuracy while reducing the computational complexity at the same time. Although AM-softmax has achieved good results on the AMC problem, it has not shown its superiority to softmax. We will continue to work on cosine-similarity distance to explore more possibilities. Another future work is to apply our hybrid parallel network to a more complicated and larger dataset. If our work can handle more complicated environment, this will prove the superiority of our method.

Data Availability Statement:
The data presented in this study are openly available in O'shea's website at [28], https://www.deepsig.io/datasets.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: