Modulation Classification Using Compressed Sensing and Decision Tree–Support Vector Machine in Cognitive Radio System

In this paper, a blind modulation classification method based on compressed sensing using a high-order cumulant and cyclic spectrum combined with the decision tree–support vector machine classifier is proposed to solve the problem of low identification accuracy under single-feature parameters and reduce the performance requirements of the sampling system. Through calculating the fourth-order, eighth-order cumulant and cyclic spectrum feature parameters by breaking through the traditional Nyquist sampling law in the compressed sensing framework, six different cognitive radio signals are effectively classified. Moreover, the influences of symbol length and compression ratio on the classification accuracy are simulated and the classification performance is improved, which achieves the purpose of identifying more signals when fewer feature parameters are used. The results indicate that accurate and effective modulation classification can be achieved, which provides the theoretical basis and technical accumulation for the field of optical-fiber signal detection.


Introduction
As one of the booming communication technologies in the information era, modulation classification (MC) technology [1] has a very important application value in the field of wireless communication. For example, it can play an important role in communication investigation, electronic countermeasures, signal authentication, interference identification, spectrum management, etc. At present, the wireless communication network has maintained a steady and rapid development trend, the network construction is increasingly integrated, and network applications are everywhere. At the same time, the inherent contradiction between the centralized static network and the dynamic change of the environment also causes serious problems like the low utilization of spectrum resources in the wireless communication network. Therefore, cognitive radio (CR) technology [2][3][4][5] is proposed and considered as a promising technology to solve these problems. MC plays an important role in CR based on spectrum sensing and feature analysis.
Cognitive radio has been widely accepted as a new technology in the field of wireless communication in the new era. In the cognitive radio network (CRN), in order to avoid interference with the transmission of the primary users, it is essential to accurately sense the presence for any contemporaneous transmission of the primary users in the observed spectrum [6]. The primary user signal error detection will cause the secondary user to waste the spectrum opportunity. Noise, shadow and multipath fading lead to a serious degradation of signal characteristics in conventional wireless Sensors 2020, 20, 1438 3 of 16

Feature Extraction Based on Higher-Order Cumulant (HOC)
For wireless channel model, we studied the property of HOC and the insensitivity of its second-order terms to Gaussian noise, the kth-order cumulant C k,n (m 1 , m 2 , · · ·, m k−1 ) of a complex-valued stationary random process x(t), can be defined as: C k,n (m 1 , m 2 , · · ·, m k−1 ) = cum(x(t), x(t + m 1 ), · · ·, x(t + m k−1 )) (1) where x(t + m k ) denotes a function of different time delays and regardless of t, cum(•) means taking the cumulant. Therefore, its fourth-order cumulant is: Based on the above theory, the fourth-order, sixth-order and eight-order cumulants of the zero-mean x(t), are shown as: where M pq = E[x(t) p−q x * (t) q ] denotes the pth-order mixing moment [19].
In the practical application of MC, we need to estimate the HOC value of the signal from the received symbol sequence in the shortest possible time. Sample estimations of the correlations are given by: Substituting the estimated values into Equation (4), we can obtain all of the features for the considered six wireless signal types. Table 1 shows some of these features for a number of these signals. These values are computed under the constraint of unit variance in noise free conditions. It can be seen that by computing of these values, we can classify the wireless signal types.  Table 1 shows that OOK (on-off keying), DPSK (differential phase shift keying), QPSK (quadrature phase shift keying), OQPSK (offset quadrature phase shift keying) have the same theoretical values Sensors 2020, 20, 1438 4 of 16 of HOC. In addition, 16QAM (16 quadrature amplitude modulation) and 64QAM (64 quadrature amplitude modulation) have similar HOC values. Therefore, we can define a feature parameter T1 = C 8,0 / C 4,0 that is calculated in Table 2 and divides signals into three categories including (OOK,  DPSK), (QPSK, OQPSK) and (16QAM, 64QAM). It is worth noting that the absolute value and ratio form are used to eliminate the effect of phase jitter and amplitude [20]. Owing to the difference between the phase jump rules of QPSK and OQPSK, the sampling sequence of both can be performed with a differential operation, i.e., where x(t) denotes the signals of QPSK and OQPSK, a k is the transmitted symbol sequences, f c denotes the carrier frequency and θ c denotes the phase jitter. For the sake of discussion, we assume that f c and θ c have been completed timing synchronization. The values of HOC under difference operation are calculated in Table 3. Then we define another feature parameter T2 = C d8,0 / C d4,0 2 is calculated in Table 4 to classify QPSK and OQPSK, where C d8,0 and C d4,0 represent the cumulants after differential operation.

Feature Extraction Based on Cyclic Spectrum
Since the T1 of (OOK, DPSK) and (16QAM, 64QAM) are the same or similar, a cyclic spectral density function for noise suppression is proposed for identification. Assuming x(t) is the cyclostationary signal, and its mean value and autocorrelation function are periodic with T 0 shown as: where τ is the delay variable. Because the autocorrelation function has periodicity, its Fourier series can be written as: where α stands for the frequency corresponding to the instantaneous autocorrelation and is often called the cyclic frequency. In addition, R xα is the coefficient of the Fourier series which is given by: Sensors 2020, 20, 1438

of 16
The Fourier transform of the cyclic autocorrelation function can be written as: where S xα ( f ) is called power spectral density and f is the spectral frequency. The R xα (τ) can be seen as the cross-correlation of two complex frequency shift components u(t) and v(t) of x(t), i.e., where From Equation (10) we can obtain S xα ( f ) = S uvα ( f ). Through the cross-spectrum analysis, we can obtain: where Equation (12) is used to estimate the cyclic spectral density, Equation (13) is the cyclic periodic diagram, Equation (14) is the short-time Fourier transform (STFT) formula, ∆t is the length of received data, T 0 is the window length for the STFT, and (•) * is the complex conjugate.
According to the insensitivity of the cyclic spectrum to noise and the above theory, the characteristic parameter T3 = max(S xα ) is defined to distinguish the signal set of (OOK, DPSK) and (16QAM, 64QAM).

Compressed Value of HOC
CS techniques perform successfully whenever applied to so-called compressible and/or K-sparse signals, i.e., signals that can be represented by K N significant coefficients over an N-dimensional basis [21,22]. The K-sparse signal s(t) ∈ R N of dimension N is accomplished by computing a measurement vector y(t) ∈ R M that consists of M N linear projections of the vector s(t). The compression sampling rate f cs = (M/N) f s , where f s is traditional sampling rate and δ = M/N ∈ (0, 1) is called the compression ratio. The linear compression sampling process can be described as: where Φ represents a M × N matrix, usually over the field of real numbers. It is noted that the measurement matrix Φ is a random matrix satisfying the restricted isometry property (RIP) [23], and its form is various, such as Gaussian matrix [24] and local Hadamard matrix [25].
According to the theory of feature extraction and compression sensing, the linear square compression sampling process can be defined as: where [[•]] represents the product operation of the corresponding elements between the vectors. From Equation (15), we can through CS to simplify the reconstruction.
where vec(AXB) = (B T ⊗ A)vec(X) (⊗ denotes Kronecker product) and P [ [26]. According to Equation (4), the compressed value of fourth-order cumulant can be defined as: is the discrete Fourier transform (DFT) matrix. Therefore, the linear representation process of the vector-form C (4,0)δ is shown as: where [•] † stands for the pseudo-inverse operation. Finally, by deriving the linear compressed sampling process of the fourth-order cumulant, we can obtain the compressed value of the eighth-order as follows: Therefore, the first and second characteristic parameters after CS is T1 = C (8,0)δ / C (4,0)δ and

Compressed Value of Cyclic Spectrum
It can be seen from Equations (10) and (16) that there is no direct linear relationship between the compressed sampled value x in the time domain and the cyclic spectrum S xα , so the existing reconstruction algorithm cannot be used to implement the cyclic spectrum estimation. It is necessary to use some explicit linear relations between the second-order statistic to derive its transformation, and indirectly establish the linear relationship between them, so as to use the existing reconstruction algorithm to complete the estimation of the cyclic spectrum [27].
In order to obtain the compressed values of the cyclic spectrum, we first need to obtain the relationship between the cyclic spectrum matrix S sδ and the cyclic autocorrelation matrix R sδ (u, v). Let R sδ (u, v) denote the form of the time-varying covariance matrix R. When x is a real value, R is a symmetric semi-positive definite matrix. For the convenience of calculation, we convert it into an auxiliary covariance matrix: Sensors 2020, 20, 1438 where N is the sampling point.
The matrix R contains all the elements in the R sδ vector except for the zero elements. The relationship between R sδ and R can be expressed as: where H N ∈ {0, 1} N 2 ×(N(N+1)/2) , vec{•} means the vectorization operation, and the R sδ vector is mapped to vec{R}. So the relationship between the cyclic spectrum matrix and the cyclic autocorrelation matrix is shown as: The time-varying covariance matrix R x = E x m x T m of the compressed value x is also a symmetric semi-definite matrix, which can be rearranged into a vector R xδ of length M(M + 1)/2 to represent as: Through the linear formula conversion, we can define two projection matrices P m ∈ {0, 1} N 2 ×(N(N+1)/2) and Q m ∈ {0, 1/2, 1} (M(M+1)/2)×M 2 map the entries of x, s to those in vec(R x ) and vec(R s ), it can be shown that: where P m and Q m are special mapping matrices.
Because of x(t) = Φs(t), we can obtain: . Following the equation (24), we can obtain vec(R sδ ) is: where Ω = . Through the Equations (24), (27) and (28), we can derive the measurement vector x as a linear function of the vector-form cyclic spectrum S sδ as: where Ξ = (F −T ⊗ I N ) −1 , I N is the N dimension unit matrix. Therefore, the third characteristic parameter after CS is T3 = max(S sδ ).

The Principle of Support Vector Machine
Support vector machine (SVM) is based on the principle of structural risk minimization [28][29][30]. Its final solution can be transformed into a quadratic convex programming problem with linear constraints. There is no local minimum problem. By introducing the kernel function, the linear SVM can be simply extended to the non-linear SVM, and there is almost no additional computation for high-dimensional samples.
The main idea can be seen from Figure 1 that for a linear separable case, the idea of maximizing the classification boundary is used to seek the optimal hyperplane H, while H1 and H2 are hyperplanes passing through the closest sample to the H and parallel to h, respectively, and the distance between them is called the classification interval. In the case of linear indivisibility, the linear indivisible samples in the low-dimensional input space are transformed into the high-dimensional feature space by the non-linear mapping algorithm, so that they can be linearly separable, in order that the high-dimensional feature space can be solved by the linear analysis method.

The Principle of Support Vector Machine
Support vector machine (SVM) is based on the principle of structural risk minimization [28][29][30]. Its final solution can be transformed into a quadratic convex programming problem with linear constraints. There is no local minimum problem. By introducing the kernel function, the linear SVM can be simply extended to the non-linear SVM, and there is almost no additional computation for high-dimensional samples.
The main idea can be seen from Figure 1 that for a linear separable case, the idea of maximizing the classification boundary is used to seek the optimal hyperplane H, while H1 and H2 are hyperplanes passing through the closest sample to the H and parallel to h, respectively, and the distance between them is called the classification interval. In the case of linear indivisibility, the linear indivisible samples in the low-dimensional input space are transformed into the high-dimensional feature space by the non-linear mapping algorithm, so that they can be linearly separable, in order that the high-dimensional feature space can be solved by the linear analysis method.
where 0 ≥ C is the penalty parameter. A larger C indicates a larger penalty for misclassification and is the only parameter that can be adjusted in the algorithm. By choosing the proper kernel function Suppose that the training set is (x i , x i ), i = 1, 2, . . . , L and the expected output is y i ∈ {+1, −1}, where +1 and −1 represent two kinds of class representation respectively. If x i ∈ R n belongs to the first category, the corresponding output is y i = +1; if it belongs to the second category, the corresponding output is y i = −1. The linear separability of the problem shows that there is a hyper-plane (w * x) + b = 0, which makes the positive and negative inputs of the training points located on both sides of the hyper-plane, respectively. When the training sets are not completely linearly separable, we can introduce the relaxation variable ξ i ≥ 0 i = 1, 2, . . . , L, then the objective function is transformed into: where C ≥ 0 is the penalty parameter. A larger C indicates a larger penalty for misclassification and is the only parameter that can be adjusted in the algorithm. By choosing the proper kernel function K(x, x T ) and using the Lagrange multiplier method to solve Equation (30), the corresponding dual problem can be obtained as follows: Sensors 2020, 20, 1438 Equation (31) obtains the optimal solution α * = (α * 1 , α * 2 , . . . , α * L ) and selects a positive component 0 ≤ α * j ≤ C of α * , and calculates b * = y j − L i=1 y i α * i K(x i , x j ) accordingly. Finally, the policy function

The Structure of Decision Tree-Support Vector Machine Classifier
Through the above calculation and analysis of the compressed values of the feature parameters based on CS, the feature parameters are input into the decision tree-support vector machine (DT-SVM) structure with efficient calculation to realize signal classification. Adding support vector machine (SVM) to every node of the decision tree can comprehensively utilize the efficient computing power of the decision tree structure and the high classification performance of the SVM to achieve the high-precision classification of MC. In particular, for the K-class classification problem, the method of DT-SVM only needs to construct K-1 SVM sub classifier. The classification process of decision tree is shown in Figure 2.

The Structure of Decision Tree-Support Vector Machine Classifier
Through the above calculation and analysis of the compressed values of the feature parameters based on CS, the feature parameters are input into the decision tree-support vector machine (DT-SVM) structure with efficient calculation to realize signal classification. Adding support vector machine (SVM) to every node of the decision tree can comprehensively utilize the efficient computing power of the decision tree structure and the high classification performance of the SVM to achieve the high-precision classification of MC. In particular, for the K-class classification problem, the method of DT-SVM only needs to construct K-1 SVM sub classifier. The classification process of decision tree is shown in Figure 2.  Using the three features in Figure 1 to complete the MC, the specific steps of the process are as follows: (1) Three feature vectors are obtained from six kinds of wireless modulation signal data through feature extraction module; (5) By SVM-3 and T1, the residual signals can be divided into two categories: (QPSK, OQPSK) and (16QAM, 64QAM); (6) The T2 after differential operation and SVM-4 are used to classify QPSK and OQPSK; (7) Finally, the classification of 16QAM and 64QAM signals is realized by the T3 and SVM-5.
Sensors 2020, 20, x FOR PEER REVIEW 10 of 16 (5) By SVM-3 and T1, the residual signals can be divided into two categories: (QPSK, OQPSK) and (16QAM, 64QAM); (6) The T2 after differential operation and SVM-4 are used to classify QPSK and OQPSK; (7) Finally, the classification of 16QAM and 64QAM signals is realized by the T3 and SVM-5.  It can be seen from Figure 3a that the compressed value of the feature parameter T1 tends to be stable with the increase of SNR and conforms to the theoretical value. In addition, it is obvious from the figure that T1 can classify the six signals into three categories includes (OOK, DPSK), (QPSK, OQPSK) and (16QAM, 64QAM). Similarly, it can be seen from Figure 3b that the feature parameter T2 after difference can distinguish QPSK from OQPSK. The circular spectrum and the cross-sectional diagrams of cycle frequency of OOK, DPSK, 16QAM and 64QAM are shown in Figure 4. It can be seen from Figure 4 that the maximum value of the cyclic spectrum of different signals is different, so the remaining signals can be distinguished by T3. It can be seen from Figure 3a that the compressed value of the feature parameter T1 tends to be stable with the increase of SNR and conforms to the theoretical value. In addition, it is obvious from the figure that T1 can classify the six signals into three categories includes (OOK, DPSK), (QPSK, OQPSK) and (16QAM, 64QAM). Similarly, it can be seen from Figure 3b that the feature parameter T2 after difference can distinguish QPSK from OQPSK. The circular spectrum and the cross-sectional diagrams of cycle frequency of OOK, DPSK, 16QAM and 64QAM are shown in Figure 4. It can be seen from Figure 4 that the maximum value of the cyclic spectrum of different signals is different, so the remaining signals can be distinguished by T3.

Simulation Results and Discussion
For signal modulation classification task, the modulation set is {OOK, DPSK, QPSK, OQPSK, 16QAM, 64QAM}. In this simulation process, all modulation signals adopt the same modulation parameters, that is, the carrier frequency is 100 kHz, the symbol rate is 40 kbps, and the sampling frequency is 800kHz. For each kind of modulation signal, the simulation generates 500 characteristic samples under different SNR (SNR from −5 dB to 15 dB, interval 5 dB). The K-fold cross validation (K-CV) method is used to evaluate the generalization ability of the model, which can not only improve the data utilization, but also solve the over fitting problem to a certain extent, so as to select the model. In this paper, K is chosen as 10. The basic principle of K-CV is to divide the training data set into K equal subsets. Each time, K-1 data are used as training data, and other data are used as test data. In this way, we repeat K times, estimate the expected generalization error according to the mean square error (MSE) average value after K times iteration, and finally select a group of optimal parameters.
To evaluate the influence of different symbol lengths on the classification performance, we select the symbol length N = 512, 1024, 2048, 4096 for the OOK, QPSK and 16QAM signals (as examples), respectively, to analyze the classification accuracy in Figure 5. As can be seen from Figure 5, with the increase of symbol length, the trend of classification accuracy is gradually increasing and finally tends to 100%. However, the increase of symbol length affects the classification accuracy mainly in the case of low SNR. Considering the influence of the computation cost in the classification process, 2048 is chosen as the symbol length in this paper.

Simulation Results and Discussion
For signal modulation classification task, the modulation set is {OOK, DPSK, QPSK, OQPSK, 16QAM, 64QAM}. In this simulation process, all modulation signals adopt the same modulation parameters, that is, the carrier frequency is 100 kHz, the symbol rate is 40 kbps, and the sampling frequency is 800kHz. For each kind of modulation signal, the simulation generates 500 characteristic samples under different SNR (SNR from −5 dB to 15 dB, interval 5 dB). The K-fold cross validation (K-CV) method is used to evaluate the generalization ability of the model, which can not only improve the data utilization, but also solve the over fitting problem to a certain extent, so as to select the model. In this paper, K is chosen as 10. The basic principle of K-CV is to divide the training data set into K equal subsets. Each time, K-1 data are used as training data, and other data are used as test data. In this way, we repeat K times, estimate the expected generalization error according to the mean square error (MSE) average value after K times iteration, and finally select a group of optimal parameters.
To evaluate the influence of different symbol lengths on the classification performance, we select the symbol length N = 512, 1024, 2048, 4096 for the OOK, QPSK and 16QAM signals (as examples), respectively, to analyze the classification accuracy in Figure 5. As can be seen from Figure  5, with the increase of symbol length, the trend of classification accuracy is gradually increasing and finally tends to 100%. However, the increase of symbol length affects the classification accuracy mainly in the case of low SNR. Considering the influence of the computation cost in the classification process, 2048 is chosen as the symbol length in this paper.  In order to evaluate the impact of different compression ratios on the classification performance, we analyzed the classification accuracy in Figure 6 with the compression ratios of 25%, 37.5%, 50% and 75% for OOK, QPSK and 16QAM signals (as examples) when the symbol length is 2048. As can be seen from Figure 6, with the increase of compression ratio, the classification accuracy increases slightly and finally tends to 100%. The increase of symbol rate has little effect on recognition rate. Therefore, in order to reduce the sampling rate and system complexity as much as possible, 25% is selected as the compression rate value.
Sensors 2020, 20, x FOR PEER REVIEW 13 of 16 In order to evaluate the impact of different compression ratios on the classification performance, we analyzed the classification accuracy in Figure 6 with the compression ratios of 25%, 37.5%, 50% and 75% for OOK, QPSK and 16QAM signals (as examples) when the symbol length is 2048. As can be seen from Figure 6, with the increase of compression ratio, the classification accuracy increases slightly and finally tends to 100%. The increase of symbol rate has little effect on recognition rate. Therefore, in order to reduce the sampling rate and system complexity as much as possible, 25% is selected as the compression rate value. In order to optimize the parameters of kernel function and improve the classification accuracy, the grid search method is used. Table 5 shows the optimization results of penalty parameter and kernel parameter of each node of DT-SVM and the classification accuracy of sub-SVM under the RBF kernel function. It can be seen from the table that under different SNR conditions, with the improvement of SNR, the classification accuracy of sub nodes has improved. When the SNR is 0dB or above, the average classification accuracy has reached 100%.  In order to optimize the parameters of kernel function and improve the classification accuracy, the grid search method is used. Table 5 shows the optimization results of penalty parameter and kernel parameter of each node of DT-SVM and the classification accuracy of sub-SVM under the RBF kernel function. It can be seen from the table that under different SNR conditions, with the improvement of SNR, the classification accuracy of sub nodes has improved. When the SNR is 0 dB or above, the average classification accuracy has reached 100%.  In order to prove the superiority of this method in recognition accuracy, Table 6 shows the classification accuracy of six different cognitive radio signals using multidimensional HOC, cyclic spectrum and DT-SVM classifier when the kernel function is RBF. In addition, the sizes of training and testing subsets are selected as 80% and 20% of the whole set of eigenvectors. The results show that with the increase of SNR, the classification accuracy of six kinds of signal is improved. When the SNR is 0 dB, the classification accuracy is 100%. It is proved that this method still has high recognition accuracy under low SNR. In addition, a new modulation signal can be introduced to expand the flexibility of the method and shows better compatibility, which will certainly increase the complexity of the algorithm and the classification time of the whole classification system.

Conclusions
We have proposed a method through simulation to identify the cognitive radio signals based on compressed sensing combined with HOC and cyclic spectrum, which has been proved to perform well in noisy situations. It successfully achieves reconstructing the feature parameters of HOC and cyclic spectrum directly from the sub-Nyquist rate rather than reconstructing the original signal. The simulation results indicate that this method can effectively achieve modulation classification for six kinds of cognitive radio signals with three feature parameters. In this paper, the proposed method is relatively simple and the feature parameters used are few, and they are also less affected by noise. By analyzing the effect of symbol length and compression rate on the classification rate, the classification performance is improved and the classification accuracy can reach 100% when the SNR is 0 dB. This technique utilizes a sampling rate of CS much lower than the Nyquist sampling rate and noise-insensitive feature extraction algorithm, realizes blind classification without any prior information from the transmitter, and has low computational complexity.