Modulation Recognition of Communication Signals Based on Multimodal Feature Fusion

Modulation recognition is the indispensable part of signal interception analysis, which has always been the research hotspot in the field of radio communication. With the increasing complexity of the electromagnetic spectrum environment, interference in signal propagation becomes more and more serious. This paper proposes a modulation recognition scheme based on multimodal feature fusion, which attempts to improve the performance of modulation recognition under different channels. Firstly, different time- and frequency-domain features are extracted as the network input in the signal preprocessing stage. The residual shrinkage building unit with channel-wise thresholds (RSBU-CW) was used to construct deep convolutional neural networks to extract spatial features, which interact with time features extracted by LSTM in pairs to increase the diversity of the features. Finally, the PNN model was adapted to make the features extracted from the network cross-fused to enhance the complementarity between features. The simulation results indicated that the proposed scheme has better recognition performance than the existing feature fusion schemes, and it can also achieve good recognition performance in multipath fading channels. The test results of the public dataset, RadioML2018.01A, showed that recognition accuracy exceeds 95% when the signal-to-noise ratio (SNR) reaches 8dB.


Introduction
Modulation recognition mainly refers to analyzing the noncooperative received signals through a series of processes to acquire their modulation types. How to automatically recognize the modulation types of signals quickly and accurately plays a key role in the subsequent demodulation and analysis [1].
Since the publication of the first article on modulation recognition in 1969 [2], the research on modulation recognition has been rather mature, which is mainly divided into recognition schemes based on maximum likelihood theory, recognition schemes based on feature extraction, and recognition schemes based on deep learning [3]. The recognition schemes based on the maximum likelihood theory was developed earlier, and its basic idea is that, according to the statistical characteristics of signals and to minimize the loss function as the goal, the log-likelihood function of signals is obtained through theoretical derivation and calculation, and then the appropriate threshold is selected to compare the original signal with its log-likelihood function to obtain the predicted classification results [4]. In the noncooperative communication condition, the received signals contain many unknown parameters. Recognition schemes based on maximum likelihood theory can get the theoretical optimal solution, but they need a lot of prior knowledge, have high computational complexity, and poor generalization ability. Recognition schemes based on feature extraction transform the received signals into other domain features to better characterize the modulation types. The common features include instantaneous amplitude, phase-and frequency-features, high-order cumulant, high-order cumulant spectrum, and

Signal Model
The baseband received signal can be expressed as: where x(t) represents the baseband transmission signal, n(t) represents the Gaussian white noise, h(t) represents the channel impulse response. If received signals are only interfered by Gaussian white noise, h(t) = 1; If there exist multiple propagation paths, such as direct beam, reflection, and refraction, the channel model can be expressed as: where L represents the number of discrete multipath channels, α i (t) represents the attenuation factor of the received signals on the ith propagation path, τ i (t) represents the propagation delay of the received signals on the ith propagation path. Substituting Equation (1) into Equation (2), we can get: where θ i (t) = 2π f c τ i (t).
According to Euler's formula, the instantaneous envelope a(t) and phase θ(t) of the received signal can be expressed as follows: The received signal is further simplified as: = a(t)e −jθ(t) + n(t) Therefore, the received signal propagated over the multipath fading channel can be regarded as numerous time-varying vectors of amplitude and phase. If the channel is a Rayleigh fading channel, the envelope of the channel response at any time follows a Rayleigh distribution, and the phase in the range (0, 2π) follows a uniform distribution [17]. The corresponding probability density functions are: f (θ) = 1 2π , θ ∈ (0, 2π) 0, otherwise where σ 2 represents the average power of the signal. If the channel is a Rician fading channel, it can be viewed as the sum of direct signal and multipath signal components following a Rayleigh distribution [18]. The probability density function of the signal response can be expressed as: where A represents the amplitude of the direct signal, I 0 represents the modified order 0 of the first kind of the Bessel function. Next, the influence of channel parameters on the received signal is analyzed [19]. The coherent bandwidth of the channel can be expressed as: where T d represents multipath delay. If the signal bandwidth is much larger than the coherence bandwidth, the amplitude of some frequency components of the received signal will be enhanced, and the amplitude of some frequency components will be decreased, and frequency selective fading will occur. If the signal bandwidth is much smaller than the coherence bandwidth, all frequency components of the received signal are subject to the same fading and the signal only experiences flat fading.
The coherence time of the signal can be expressed as: where f doppler represents the doppler frequency shift. If the signal symbol period is much smaller than the channel coherence time, channel changes are slower than signal changes, and interference caused by frequency shift is not obvious and slow fading occurs. If the signal symbol period is much larger than the channel coherence time, channel changes are faster than signal changes, and adjacent frequency components interfere with each other and fast fading will occur.

The Proposed Scheme
Multimodal technology has been widely used in modulation recognition. However, at present, it either relies on the network to extract multi-scale feature maps [14], or the simple concatenation of transformation domain features [15], or the interaction of spatialtemporal features [16]. The work of feature fusion deserves further exploration, so this paper proposes a modulation recognition scheme based on multimodal feature-fusion to improve the performance of modulation recognition under different channel interference, whose framework is shown in Figure 1. CW12 to extract the temporal and spatial features of the signals, and then the outer product operation is performed to increase the diversity of features. Since the outer product can bring feature-dimension expansion, a fully connected layer is used to reduce the feature dimension. The modulus, phase, and spectrum features are fed into the RSBU-CW12 to extract their respective features. For the three groups of features extracted from the I/Q waveform, modulus and phase, as well as the welch spectrum, square spectrum, and fourth power spectrum, the method of direct concatenation to the full connection layer cannot achieve the full fusion of features. We adopted the PNN model to carry out feature cross-fusion for the features extracted from the network, so that the model can capture more key information.

Network Model Structure
To better extract signal features, a deep residual shrinkage network, RSBU-CW12, is designed, as shown in Figure 2. The RSBU-CW Block is introduced into the convolutional layer of the network, and its processing flow is mainly as follows: the initial feature input  Firstly, the multiple transformation domain can provide multimodal information, so the I/Q waveform, modulus and phase of the received signals, as well as welch spectrum, square spectrum, and fourth power spectrum are extracted from the perspective of the time-frequency domain as network input [15]. Then, we consider the way the networks learn and incorporate multimodal features. RSBU-CW12 is designed to extract spatial features of signals. Inspired by [16], the raw I/Q signals are fed into the LSTM and the RSBU-CW12 to extract the temporal and spatial features of the signals, and then the outer product operation is performed to increase the diversity of features. Since the outer product can bring feature-dimension expansion, a fully connected layer is used to reduce the feature dimension. The modulus, phase, and spectrum features are fed into the RSBU-CW12 to extract their respective features. For the three groups of features extracted from the I/Q waveform, modulus and phase, as well as the welch spectrum, square spectrum, and fourth power spectrum, the method of direct concatenation to the full connection layer cannot achieve the full fusion of features. We adopted the PNN model to carry out feature cross-fusion for the features extracted from the network, so that the model can capture more key information.

Network Model Structure
To better extract signal features, a deep residual shrinkage network, RSBU-CW12, is designed, as shown in Figure 2. The RSBU-CW Block is introduced into the convolutional layer of the network, and its processing flow is mainly as follows: the initial feature input F 0 is convolved twice to get the feature vector F 1 , and then F 1 is fed into the sub neural network with soft thresholding. First, F 1 takes the absolute value, and adaptive pooling and flattening are carried out to obtain one-dimensional features, F 2 ; F 2 passes through two fully connected layers and performs a sigmoid operation to get F 3 ; F 4 can be obtained by multiplying F 2 and F 3 ; redundant features are eliminated to obtain feature F 5 by soft thresholding results obtained by F 4 and F 1 ; the initial input F 0 and soft thresholding result F 5 are added to obtain the final output result through identity mapping, as shown in Figure 2a. Finally, features extracted from the residual shrinkage module are reduced through the fully connected layer to obtain a feature vector of size 1 × 50. Traditional image network models generally employ a convolution kernel of 3 × 3, but since the network input of RSBU-CW12 is 2 × 1000 signal waveform, a convolution kernel of 1 × 3 and 2 × 3 are adopted. To make the network fully learn how to hop information between symbol sequences, the pooling layer is canceled after the convolution operation. As the number of network layers increases, a zero-padding operation is carried out before each convolution, and the step of the convolution is set to one to ensure that deep network input has enough feature information [20]. The batch normalization (BN) layer and dropout layer are also utilized to suppress overfitting. Soft thresholding is the nonlinear transformation, and it sets features whose absolute value is less than the threshold directly to zero, and "shrinks" features whose absolute value is greater than the threshold by subtracting the threshold from them. Setting the thresholds is automatically adjusted through network training. The formula of soft thresholding and its derivative can be defined as: Soft thresholding is the nonlinear transformation, and it sets features whose absolute value is less than the threshold directly to zero, and "shrinks" features whose absolute value is greater than the threshold by subtracting the threshold from them. Setting the thresholds is automatically adjusted through network training. The formula of soft thresholding and its derivative can be defined as: where x represents the feature input, y represents the feature output, and τ represents the threshold.

Multimodal Feature Fusion
The proposed scheme performs feature-fusion from the following three aspects.

Multimodal Feature Input in the Time-Frequency Domain
In the signal preprocessing stage, different domain transformation features of the received signals are extracted from the perspective of the time-frequency domain. The I/Q waveform, modulus and phase, welch spectrum, square spectrum, and fourth power spectrum are taken as network inputs. Figure 3 shows the time-frequency domain feature inputs of 12 kinds of modulation types when SNR = 18 dB. Soft thresholding is the nonlinear transformation, and it sets features whose absolute value is less than the threshold directly to zero, and "shrinks" features whose absolute value is greater than the threshold by subtracting the threshold from them. Setting the thresholds is automatically adjusted through network training. The formula of soft thresholding and its derivative can be defined as: where x represents the feature input, y represents the feature output, and  represents the threshold.

Multimodal Feature Fusion
The proposed scheme performs feature-fusion from the following three aspects.

Multimodal Feature Input in the Time-Frequency Domain
In the signal preprocessing stage, different domain transformation features of the received signals are extracted from the perspective of the time-frequency domain. The I/Q waveform, modulus and phase, welch spectrum, square spectrum, and fourth power spectrum are taken as network inputs. Figure 3 shows the time-frequency domain feature inputs of 12 kinds of modulation types when SNR = 18 dB.

Temporal and Spatial Feature-Fusion
For the I/Q waveform of the received signals, RSBU-CW12 is used to extract highdimensional spatial features and get feature vectors f a of size 1 × 50; Meanwhile, LSTM is used to extract temporal features and get feature vectors f b of size 1 × 50. To fully integrate temporal and spatial features, the outer product operation is utilized to conduct pairwise interaction between the extracted two groups of features and obtain feature vectors f c of size 50 × 50. Finally, feature vector f c is reshaped to 1 × 2500, and then its dimension is reduced to 1 × 50 with fully connected layers.

PNN Feature Cross Fusion
The proposed scheme extracts three groups of 1 × 50 feature vectors from the I/Q waveform, modulus and phase, as well as from the welch spectrum, square spectrum, and fourth power spectrum. They are stacked to obtain 3 × 50 feature vectors. The PNN model is employed to replace the fully connected layer for recognition. The PNN model mainly adds the vector product layer between feature inputs and fully a connected layer to improve the ability of learning cross-features, as shown in Figure 4.

Temporal and Spatial Feature-Fusion
For the I/Q waveform of the received signals, RSBU-CW12 is used to extract highdimensional spatial features and get feature vectors a f of size 1 50 LSTM is used to extract temporal features and get feature vectors b f of size 1 50  . To fully integrate temporal and spatial features, the outer product operation is utilized to conduct pairwise interaction between the extracted two groups of features and obtain feature vectors c f of size 50 50  . Finally, feature vector c f is reshaped to 1 2500  , and then its dimension is reduced to 1 50  with fully connected layers.

PNN Feature Cross Fusion
The proposed scheme extracts three groups of 1 50  feature vectors from the I/Q waveform, modulus and phase, as well as from the welch spectrum, square spectrum, and fourth power spectrum. They are stacked to obtain 3 50  feature vectors. The PNN model is employed to replace the fully connected layer for recognition. The PNN model mainly adds the vector product layer between feature inputs and fully a connected layer to improve the ability of learning cross-features, as shown in Figure 4. The structure of the PNN model is mainly divided into the following parts: (1) Features Input The constant number "1" represents the bias, and feature input is the feature vector of size 3 50  extracted with a neural network, which can be defined as: The structure of the PNN model is mainly divided into the following parts: (1) Features Input The constant number "1" represents the bias, and feature input is the feature vector of size 3 × 50 extracted with a neural network, which can be defined as: where M = 50, f 1 , f 2 , and f 3 represent the feature vectors of the I/Q waveform, modulus and phase, as well as the welch spectrum, square spectrum, and fourth power spectrum extracted with a neural network, respectively.
(2) Product Layer f input is fed into the product layer to get the linear eigenvector f z , and the nonlinear eigenvector f p . f z can be defined as: where N = 3 and W n z represents the weight of the linear part. Feature interaction adopts the inner product operation, f p can be defined as: where δ n i = θ n i f i , W n p represents the weight of the nonlinear part, i = 1, 2, . . . , N; j = 1, 2, . . . , N.
(4) L2 Hidden Layer where W 2 represents the weight coefficient and b 2 represents bias.

Experimental Results Analysis
In this section, the performance of the multimodal feature-fusion scheme is evaluated. The generated simulation datasets of three different channels, namely Gaussian white noise, Rayleigh fading, and Rician fading, are used to verify the effectiveness of the scheme together with the public dataset RadioML2018.01A.

Simulation Results Analysis
Signal parameters of the simulation dataset: Modulation types are 16QAM, 32QAM, 64QAM, BPSK, QPSK, 8PSK, 16APSK, 32APSK, 64APSK, AM-DSB, AM-SSB, and FM; the symbol rate is 100kBaud; the carrier frequency is 350 kHz; the factor of oversampling is 10; the roll-off coefficient of the shape filter is 0.35, and time delay is 3; the SNR is -10:2:18 dB; a total of 2200 signal samples are generated for each modulation type under each SNR, and the data format of each I/Q signal sample is 2 × 1000; the ratio of the number of training set and test set is 10:1.
At present, deep learning is widely used in modulation recognition. In terms of feature input, it can be generally divided into two types: One is to directly input signal data into neural networks for recognition; the other is to transform I/Q waveforms into other domain features by means of domain transformation, and finally, input them into the network for training in the form of images. According to our survey, there are relatively few studies on the systematic comparison of these two types of inputs. Hence, four common feature inputs, such as the I/Q waveform, vector diagram, eye diagram, and time-frequency diagram, are selected for preliminary research on the impact of recognition performance, as shown in Figure 5.
into neural networks for recognition; the other is to transform I/Q waveforms into other domain features by means of domain transformation, and finally, input them into the network for training in the form of images. According to our survey, there are relatively few studies on the systematic comparison of these two types of inputs. Hence, four common feature inputs, such as the I/Q waveform, vector diagram, eye diagram, and time-frequency diagram, are selected for preliminary research on the impact of recognition performance, as shown in Figure 5. To better compare the influence of different feature inputs, residual building units (RBUs) were adopted as the basic module to design three residual network models based on the ResNets structure, as shown in Figure 6. Since the data format of the I/Q waveform is 2 1000  , the network convolution kernel is 13  and 23  ; the image format of the vector diagram, eye diagram, and time-frequency diagram is 224 224  , so the network convolution kernel is 33  .  Table 1 shows the overall recognition accuracy of the different feature inputs, and Table 2 shows the complexity comparison of the different network models. According to the experiment results in Table 1, it can be concluded that the I/Q waveform as a network input has the highest recognition accuracy, so the I/Q waveform serves as input to the neural network in subsequent experiments. According to our preliminary analysis, when the I/Q waveform is taken as input, the network can directly extract features from raw signal data. However, when the received signals are converted into other domain features To better compare the influence of different feature inputs, residual building units (RBUs) were adopted as the basic module to design three residual network models based on the ResNets structure, as shown in Figure 6. Since the data format of the I/Q waveform is 2 × 1000, the network convolution kernel is 1 × 3 and 2 × 3; the image format of the vector diagram, eye diagram, and time-frequency diagram is 224 × 224, so the network convolution kernel is 3 × 3. To better compare the influence of different feature inputs, residual building units (RBUs) were adopted as the basic module to design three residual network models based on the ResNets structure, as shown in Figure 6. Since the data format of the I/Q waveform is 2 1000  , the network convolution kernel is 13  and 23  ; the image format of the vector diagram, eye diagram, and time-frequency diagram is 224 224  , so the network convolution kernel is 33  .  Table 1 shows the overall recognition accuracy of the different feature inputs, and Table 2 shows the complexity comparison of the different network models. According to the experiment results in Table 1, it can be concluded that the I/Q waveform as a network input has the highest recognition accuracy, so the I/Q waveform serves as input to the  Table 1 shows the overall recognition accuracy of the different feature inputs, and Table 2 shows the complexity comparison of the different network models. According to the experiment results in Table 1, it can be concluded that the I/Q waveform as a network input has the highest recognition accuracy, so the I/Q waveform serves as input to the neural network in subsequent experiments. According to our preliminary analysis, when the I/Q waveform is taken as input, the network can directly extract features from raw signal data. However, when the received signals are converted into other domain features and input in the form of images, network captures features from data distribution of images, which inevitably leads to the loss of information. Comparing three residual network models combined with Tables 1 and 2, the recognition effect of RBU1 is not ideal. RBU24 has the highest recognition accuracy, but has numerous parameters and floating-point operations per second (FLOPs). The recognition accuracy of RBU12 is very close to that of RBU24, and the number of parameters and FLOPs are relatively smaller. Compared with Figures 2b and 6c, RSBU-CW12 is RBU12 an added sub-neural network with soft thresholding. Simultaneously, edge filling is carried out before each convolution to keep boundary information. I/Q waveform serves as network input to compare the performance of RSBU-CW12 with several other common modulation recognition network models, as shown in Table 3. As can be seen from Table 3, recognition performance of RSBU-CW12 is better than that of other network models. Compared with CLDNN(Bi-LSTM), which ranks second in overall recognition accuracy, recognition performance of RSBU-CW12 is increased by 3.62%.  Figure 7 shows the recognition accuracy curve of different network models with the change of SNR. It can be seen that the recognition accuracy of RSBU-CW12 is higher than that of other network models when SNR is from -10 dB to 18 dB. When SNR is 2 dB, the recognition accuracy of RSBU-CW12 is more than 85%, while that of other network models is less than 80%. When SNR exceeds 8 dB, recognition the accuracy is approximately 100%. Advantages of RSBU-CW12 network model in modulation recognition are further illustrated by analysis, which can be used as the basic feature extraction network in subsequent research. To further enhance modulation recognition performance, multimodal feature fusion methods in Section 3.2 are adopted and compared with existing feature fusion schemes [14][15][16], as shown in Table 4. From Figure 8, after adding feature-fusion methods, compared with RSBU-CW12, recognition accuracy of low SNR is improved to some extent. When SNR is 0dB, the recognition accuracy is more than 80%. When SNR is 2 dB, recognition the accuracy reaches approximately 88%. When SNR is over 6dB, the recognition accuracy is approximately 100%. Meanwhile, recognition performance of feature-fusion scheme proposed in this paper is better than other feature fusion schemes. The recognition accuracy of the RSBU-CW12 is higher than those of the existing feature-fusion schemes, which indicates that RSBU-CW12 can extract more critical features. A PNN model can better integrate multimodal features to enhance recognition performance.  To further enhance modulation recognition performance, multimodal feature fusion methods in Section 3.2 are adopted and compared with existing feature fusion schemes [14][15][16], as shown in Table 4. From Figure 8, after adding feature-fusion methods, compared with RSBU-CW12, recognition accuracy of low SNR is improved to some extent. When SNR is 0 dB, the recognition accuracy is more than 80%. When SNR is 2 dB, recognition the accuracy reaches approximately 88%. When SNR is over 6 dB, the recognition accuracy is approximately 100%. Meanwhile, recognition performance of feature-fusion scheme proposed in this paper is better than other feature fusion schemes. The recognition accuracy of the RSBU-CW12 is higher than those of the existing feature-fusion schemes, which indicates that RSBU-CW12 can extract more critical features. A PNN model can better integrate multimodal features to enhance recognition performance.   Figure 9 gives the recognition performance of the proposed scheme. Figure 9a shows the recognition accuracy curve of each modulation type. High-order modulation signals, such as 32QAM, 64QAM, 16APSK, 32APSK, and 64APSK, are very difficult to be recognized in the case of low SNR. When SNR is 6dB, recognition accuracy of all modulation types is more than 90%. Figure 9b shows the overall confusion matrix. The overall recognition accuracy of low-order modulation signals BPSK, QPSK, and analog modulation signals AM-DSB, AM-SSB, and FM is over 90% and close to 100%. The modulation order of QAM and APSK signals is higher than 16, and recognition accuracy is relatively low. Recognition accuracies of QAM, PSK, and APSK signals decrease with the increase of modulation order. In the actual signal propagation process, signals are not only be affected by Gaussian white noise, but also face the interference of multipath fading. Therefore, Rayleigh fading and Rician fading are, respectively, added to the simulation dataset, and specific simulation channel parameters are listed in Table 5.   Figure 9 gives the recognition performance of the proposed scheme. Figure 9a shows the recognition accuracy curve of each modulation type. High-order modulation signals, such as 32QAM, 64QAM, 16APSK, 32APSK, and 64APSK, are very difficult to be recognized in the case of low SNR. When SNR is 6 dB, recognition accuracy of all modulation types is more than 90%. Figure 9b shows the overall confusion matrix. The overall recognition accuracy of low-order modulation signals BPSK, QPSK, and analog modulation signals AM-DSB, AM-SSB, and FM is over 90% and close to 100%. The modulation order of QAM and APSK signals is higher than 16, and recognition accuracy is relatively low. Recognition accuracies of QAM, PSK, and APSK signals decrease with the increase of modulation order.  Figure 9 gives the recognition performance of the proposed scheme. Figure 9a shows the recognition accuracy curve of each modulation type. High-order modulation signals, such as 32QAM, 64QAM, 16APSK, 32APSK, and 64APSK, are very difficult to be recognized in the case of low SNR. When SNR is 6dB, recognition accuracy of all modulation types is more than 90%. Figure 9b shows the overall confusion matrix. The overall recognition accuracy of low-order modulation signals BPSK, QPSK, and analog modulation signals AM-DSB, AM-SSB, and FM is over 90% and close to 100%. The modulation order of QAM and APSK signals is higher than 16, and recognition accuracy is relatively low. Recognition accuracies of QAM, PSK, and APSK signals decrease with the increase of modulation order. In the actual signal propagation process, signals are not only be affected by Gaussian white noise, but also face the interference of multipath fading. Therefore, Rayleigh fading and Rician fading are, respectively, added to the simulation dataset, and specific simulation channel parameters are listed in Table 5.  In the actual signal propagation process, signals are not only be affected by Gaussian white noise, but also face the interference of multipath fading. Therefore, Rayleigh fading and Rician fading are, respectively, added to the simulation dataset, and specific simulation channel parameters are listed in Table 5.  Figure 10 shows the time-domain waveform (left) and the time-frequency spectrum (right) of QPSK. Through multipath fading channels, time-domain waveform becomes no longer flat. Coherent bandwidth of Rayleigh fading channel is approximately equal to 5 × 10 4 Hz, far less than signal bandwidth 100 kHz, so frequency selective fading occurs. Coherent bandwidth of the Rician fading channel is approximately equal to 2 × 10 6 Hz, which is greater than the signal bandwidth and belongs to flat fading; doppler frequency shift of two channels is much less than symbol rate, so both channels belong to slow fading.

10 
Hz, far less than signal bandwidth 100 kHz, so frequency selective fading occurs.
Coherent bandwidth of the Rician fading channel is approximately equal to 6 2 10  Hz, which is greater than the signal bandwidth and belongs to flat fading; doppler frequency shift of two channels is much less than symbol rate, so both channels belong to slow fading.  Figure 11 shows the comparison of recognition performance of different channels, from which it can be seen that the Rayleigh fading and Rician fading cause different degrees of performance degradation, especially in the case of low SNR. When SNR is 0dB, recognition accuracies of the Rayleigh fading and Rician fading decrease to less than 70%. However, when SNR is greater than 8dB, the recognition accuracy can still reach more than 90%.  Figure 11 shows the comparison of recognition performance of different channels, from which it can be seen that the Rayleigh fading and Rician fading cause different degrees of performance degradation, especially in the case of low SNR. When SNR is 0 dB, recognition accuracies of the Rayleigh fading and Rician fading decrease to less than 70%. However, when SNR is greater than 8 dB, the recognition accuracy can still reach more than 90%.

Public Dataset Validation
To further verify the performance of the proposed scheme, public dataset, Radi-oML2018.01A [10] is used for testing, whose parameters are shown in Table 6.

Public Dataset Validation
To further verify the performance of the proposed scheme, public dataset, RadioML2018.01A [10] is used for testing, whose parameters are shown in Table 6.  Figure 12 shows a recognition performance curve of the proposed scheme in the public dataset, RadioML2018.01A. To facilitate observation, the recognition results of all the modulation types are divided into ASK+QAM, PSK+APSK, and low order+analog in Figure 12a-c. Similar to the analysis results in Figure 9, compared with low-order modulation signals, such as OOK and BPSK, recognition of the high-order modulation signals, such as 128APSK and 256QAM is more difficult and its accuracy is relatively lower. When SNR is 4 dB, except for 16PSK (75%), recognition accuracies of other digital modulation signals of order 16 or less are over 90%, and recognition accuracies of OOK, BPSK, QPSK, 8PSK, and 16APSK are close to 100%. When SNR is 10 dB, except for AM-DSB-SC (81.84%) and AM-SSB-SC (84.77%), recognition accuracies of other signals are more than 90%, and recognition accuracies of most signals can reach 100%; the recognition accuracy of 128APSK is 98.63%; the recognition accuracy of 128QAM is 95.70%; the recognition accuracy of 256QAM is 90.04%. For analog modulation signals, the overall recognition accuracies of FM, AM-DSB-WC, and AM-SSB-WC are high, while the highest recognition accuracies of AM-DSB-SC and AM-SSB-SC are only 87.50% and 89.84%, respectively. Figure 12d shows the overall recognition accuracy, from which it can be seen that the proposed scheme can achieve better performance than MSN, WSMF, and CNN-LSTM in the public dataset, RadioML2018.01A. With continuous improvements in SNR, the recognition accuracy also increases. When SNR is 4 dB, the overall recognition accuracy is 80.22%. When SNR reaches 8 dB, the overall recognition accuracy exceeds 95%, which further demonstrates the superiority of the proposed scheme model.

Conclusions
This paper proposes a modulation recognition scheme based on multimodal feature fusion to improve performance on modulation recognition under different channel interferences. Firstly, the recognition performance of the waveform data as network input is higher than that of other domain transformation features through experiment comparison, so the I/Q waveform is adopted as network input. To make more use of the useful

Conclusions
This paper proposes a modulation recognition scheme based on multimodal feature fusion to improve performance on modulation recognition under different channel interferences. Firstly, the recognition performance of the waveform data as network input is higher than that of other domain transformation features through experiment comparison, so the I/Q waveform is adopted as network input. To make more use of the useful information in the received signals, two groups of the time-frequency domain features, such as modulus and phase, welch spectrum, square spectrum, and fourth power spectrum, are extracted and fed into the network together with the I/Q waveform. The designed network RSBU-CW12 is used for spatial feature extraction, and the LSTM network is used for temporal feature extraction. The temporal and spatial features were paired with each other to increase feature diversity. The features extracted from the different inputs are further cross-fused with a PNN model, so as to enhance recognition performance.
Compared with the existing modulation recognition feature fusion schemes, the proposed scheme in this paper can effectively improve the performance of the modulation recognition. Under the condition of a multipath fading channel, performance is degraded, but the recognition effect is still good. In addition, experiment results in the public dataset, RadioML2018.01A, show that when SNR is 4 dB, the overall recognition accuracy is 80.22%; when SNR reaches 8 dB, the recognition accuracy can exceed 95%, which further illustrates superiority of the proposed scheme.