Automatic Modulation Classiﬁcation with Neural Networks via Knowledge Distillation

: Deep learning is used for automatic modulation recognition in neural networks, and because of the need for high classiﬁcation accuracy, deeper and deeper networks are used. However, these are computationally very expensive for neural network training and inference, so its utility in the case of a mobile with memory limitations or weak computational power is questionable. As a result, a trade-off between network depth and network classiﬁcation accuracy must be considered. To address this issue, we used a knowledge distillation method in this study to improve the classiﬁcation accuracy of a small network model. First, we trained Inception–Resnet as a teacher network, which has a size of 311.77 MB and a ﬁnal peak classiﬁcation accuracy of 93.09%. We used the method to train convolutional neural network 3 (CNN3) and increase its peak classiﬁcation accuracy from 79.81 to 89.36%, with a network size of 0.37 MB. It was also used similarly to train mini Inception–Resnet and increase its peak accuracy from 84.18 to 93.59%, with a network size of 39.69 MB. When we compared all classiﬁcation accuracy peaks, we discover that knowledge distillation improved small networks and that the student network had the potential to outperform the teacher network. Using knowledge distillation, a small network model can achieve the classiﬁcation accuracy of a large network model. In practice, choosing the appropriate student network based on the constraints of the usage conditions while using knowledge distillation (KD) would be a way to meet practical needs.


Introduction
A wealth of specialized expertise has become available in the field of communications due to the rapid development of communication technology, which ensures the accurate transfer of data. It is a complex and mature engineering field with many distinct areas of investigation that have all seen diminishing returns in improved performance, particualrly on the physical layer [1]. Modulation types of communication signals diversify and become more complex along with the wireless communication environment, which puts increased demand on the modulation identification of signals. Deep learning is brought into signal modulation identification by employing convolutional neural network (CNN) approach to identify the modulation types of signals in order to further explore and solve the problem of modulation recognition.
New classification problems have arisen as a result of emerging wireless technology, which means that automatic modulation classification (AMC) in real-world environments continues to be a dynamic research field [2]. A DL-based (or processing block) that does not require a mathematically tractable model and channel might be able to optimize the function of a communications system better [1], so deep learning is useful for solving the problem of new modulation recognition. Initially, neural networks in combination with complex-valued temporal data, was investigated, and it was demonstrated that network depth does not limit wireless modulation recognition. Therefore, and it was suggested that future research focus on synchronization and equalization. Refs. [3][4][5] Deep learning in The classification accuracy of NNs with three convolutional layers improved with high-precision networks through knowledge distillation. Work related to neural networks and knowledge distillation is introduced in Section 2. Experiments are set up in Section 3 to train and test the proposed scheme, and the results were compared to select the best result. In Section 4, the results of modulation recognition classification accuracy are compared when KD was used and when it was not.

Basic Principle of Signal Modulation
Modulation identification means that for a given receive signal r (t) , 0 <= t <= T, from the set consisting of C possible modulation types {ω 1 , ω 2 , · · · , ω c }, the modulation type of r (t) is selected and identified. AMC combines signal processing and pattern recognition, but because communication signals and channel noise are typically modeled as stochastic processes, they are coupled with unknown signal fading, multipath propagation, and interference effects. As a result, modulation mode identification is essentially another multiple unknown parameter with AMC, and it is very important in both collaborative and non-collaborative domains.
If the channel is assumed to be ideal for a general signal model and carrier and timing synchronization are not taken into account, the modulated signal can be uniformly modeled as where i is the modulation type denote; r (i) (t) is the received complex signal; t denotes the simulation time; n(t) denotes the noise during signal transmission; I (i) (t), Q (i) (t) denotes the in-phase and quadrature components of the low-pass equivalent signal, respectively; f c is the carrier frequency; θ is the initial phase of the carrier; and T is the observed signal duration.

SNR and accuracy
In digital signal processing (DSP) we deal with extremely big numbers and extremely small numbers together (e.g., the strength of a signal compared to the strength of the noise). The logarithmic scale of a dB lets us have more dynamic range when we express numbers or plot them.
For a given value x, we can represent x in dB using the following formula: The signal-to-noise ratio (SNR) is what we use measure the differences in strength between the signal and noise, and in practice it is almost always in dB, in practice.
If someone says "SNR = 0 dB" it means the signal and noise power are the same. A positive SNR means our signal is stronger than the noise, while a negative SNR means that the noise is stronger. Detecting signals at a negative SNR is usually pretty tough.
To calculate accuracy, one needs to find the true positive (TP), true negative (TN), false positive (FP) and false negative (FN) values. True positive indicates that the output of the real class is yes, and the output of the predicted class is also yes, whereas true negative indicates that the value of the real class and the value of the anticipated class are no. False positive indicates that the real class is no while the predicted class is yes, whereas a false negative indicates that the real class is yes but the expected class is no.
Because all samples were retrieved in this paper, only TP and FP were required: TP indicated that the modulation predicted to be this modulation was this modulation. FP indicated that what was predicted to be a modulation was actually another. The accuracy metrics is the ratio of correct predictions over the total number of predictions evaluated Accuracy = TP/TP + FP (5)

CNN
A CNN is focused on handling data that has a grid-like structure. It shows that at least one layer of the network uses special linear operations called convolutions rather than the more common matrix multiplications [27].
In CNN terminology, the first parameter x of the convolution is usually called the input and the second parameter w is called the kernel function. The structure of the output is called the feature mapping. In general, when a computer processes data, time is discretized so that the moment t can only take integer values, and assuming that both x and w are defined at the integer moment t [27]. The convolution in discrete form is defined as Equation (6): In machine learning,the kernel is usually a multidimensional array of parameters that have been optimized by a learning algorithm, and input data are mostly arrays. Generally, these multidimensional arrays are referred to as tensors. Convolutional operations are generally carried out in multiple dimensions in practical uses. For example, a two-dimensional image I is taken as input and convolution is performed using a two-dimensional kernel K as in Equation (7).
The convolution is exchangeable, and we can equivalently write Convolutional operations can be used interchangeably because the relative inputs of the kernels can be flipped. The libraries of neural networks, however, typically use intercorrelation functions, which are nearly identical to convolutional operations but don't flip the kernels as (9), in the application of neural networks: Figure 1 depicts a two-dimensional convolution. The convolution kernel is a 2 × 2 matrix, and the input data is a 3 × 4 matrix. A matrix of the same size as the convolution kernel moves over the input data, which is multiplied by the data at the corresponding position of the convolution kernel, and the product is added to produce a result. The convolution result of the input data and the convolution kernel is represented by the result matrix. Convolution refers to an operation that consists of multiple parallel kernals, because a convolution with a single kernel can only extract one type of feature, despite the fact that the kernel acts on multiple spatial locations, and we usually want each layer of the network to extract multiple types of features at multiple locations [27]. A CNN, which is a deep-learning algorithm, can automatically classify and identify features without any human intervention [28]. A visible trend in NNs for classification is building deeper networks to learn more complex functions and hierarchical feature relationships. Deep networks enable more complex functions to be learned more readily from raw data [29]. Deep neural networks are typically used in three steps to solve modulation classification problems. The first step is to design the network architecture. The next is to train the network to select weights that minimize loss. The third is to validate and test the network to solve the problem [30].

Residual Network
When deeper networks are able to start converging, a degradation problem is exposed. When the network depth increases, accuracy becomes saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training errors. The Refs. [31] addressed the degradation problem by introducing a deep residual learning framework. Instead of hoping that each few stacked layers directly fit a desired underlying mapping, they explicitly let these layers fit a residual mapping.
A deep residual learning framework is depicted in Figure 2. Following the activation function, the input data matrix was processed in two parallel ways: one, to leave the data unchanged and the other, to feed the data through two convolutional layers, and finally the two parallel ways are summed and fed into the activation function. The original function becomes Equation (10): where x and y are the input and output vectors of the layers. The function F (x, {W i }) represents the residual mapping to be learned. The dimensions of x and F must be equal. If this is not the case, we can perform a linear projection Ws by shortcut connections to match the dimensions as Equation (11): Ws is only used when matching dimensions [31]. Residual connections are inherently necessary for training very deep convolutional models becasue they seem to improve the training speed greatly, which is a great argument for their use [32].

Inception
The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components [33].
The Inception module with dimension reductions is depicted in Figure 3. It receives data from the previous layer and processes it in four parallel ways into different convolution channels for calculation; if the convolution channel is smaller than the input channel, the data is downscaled. Finally, all output data channels are stitched together to produce the final result. 1 × 1 convolutions are used to compute reductions before the expensive 3 × 3 and 5 × 5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes 1 × 1 convolutions dual purpose: most critically, they were used mainly as dimension reduction modules to remove computational bottlenecks that would otherwise have limited the size of our networks. This allowed not just an increase in depth, but also in the width of our networks without a significant performance penalty [33].
The generous use of dimensional reduction and parallel structures of the Inception modules mitigated the effect of structural changes on nearby components [34].

Knowledge Distillation
The main idea behind knowledge distillation is that to achieve superior performance, the student model must imitate the teacher model. Knowledge types, distillation techniques and teacher-student learning architecture all play critical roles in student learning. [19]. A general teacher-student framework for knowledge distillation is shown in Figure 4. The teacher-student framework is the most fundamental framework for knowledge distillation. The teacher network in this paper is the pre-trained Inception-ResNet, and the student networks are a CNN with three convolutional layers and a Mini-Inception-ResNet. The output of the teacher network is distilled to form a soft and hard target to calculate the student network's loss function.  Neural networks typically produce class probabilities by using a "softmax" output layer that converts the logit z i , computed for each class into a probability q i , by comparing z i with the other logits [18].
where T is a temperature that is normally set to 1. Using a higher value for T produces a softer probability distribution over the classes. To understand the knowledge distillation, a benchmark model, which is the distillation combined with student losses, is given in Figure 5. The basic process of knowledge extraction is as follows: The same input enters the teacher and student networks; the temperature of the softmax layer of the teacher network is set to t; the output produced is soft targets; the temperature of the softmax layer of the student network is set to t; and the soft targets produced together with those produced by the teacher network produce a distillation loss; the temperature of the student network's softmax layer is set to 1; and the resulting soft targets and true labels of the data generate the both the student and distillation losses as well as a combined loss, which comprises the overall loss.  When calculating losses in KD, the weighted average of two different objective functions is used. The first is cross entropy with the soft targets, which is computed using the same high temperature in the softmax of the distilled model that was used to generate the soft targets from the cumbersome model. The second is cross entropy with the correct labels. Since the magnitudes of the gradients produced by the soft target scale as 1/T 2 , it is important to multiply them by T 2 when using both hard and soft targets [19].
The process of knowledge distillation is shown in Figure 5, and its loss function [35] is where T and α are hyperparameters, T refers to the temperature of distillation, and α refers to the proportion of soft loss in the total loss.

Structure of the Teacher and Student Network
The structure of the Teacher network shown in Figure 6 is based on Inception-Resnet, which adapted to the size of the dataset used in this paper by varying the size of the convolutional kernels over the number of sizes. Inception-ResnetA was repeated 10 times; Inception-ResnetB, 20 times; and Inception-ResnetC 10 times.
There are two student networks: mini-Inception-Resnet and CNN3. The Figure 7 depicts the network structure of the mini-Inception-Resnet. The input to it was computed in the following order: stem module, Inception-ResnetA, reduction-A, Inception-ResnetB, reduction-B, and Inception-ResnetC, followed by pooling and softmax. The CNN3 used three convolutional networks for classification as shown in Figure 8, and is the simple equivalent of an entry-level network with fast inference, small computation, small number of parameters, and small space occupied by the model.
The input to Inception-Resnet was computed in the following order: stem module, 10 tandem Inception-ResnetA, reduction-A, 10 tandem Inception-ResnetB, reduction-B, and 10 tandem Inception-ResnetC, followed by pooling and softmax. The structures of Inception − −ResnetA, Inception − −ResnetB, and Inception − −ResnetC used in the paper are shown in the Figure 9, respectively. Inception-Resnet contains the Stem network module showed in Figure 10, the three Inception − −Resnet modules, and two Reduction modules.
All three Inception-Resnet modules have a similar structure. After the previous level's input was passed through the ReLU activation function, it was divided into two parallel paths: one was completely unchanged; the other passed through several parallel convolutional layers, then through a final 1 × 1 convolution into the same number of channels as the initial input data. Finally the outputs of the two paths were summed and passed through the relu activation function into the next module. Using the Inception − −Resnet module has the advantage of being a fast converging network, and from the results the Inception-Resnet network had a good classification effect [32].      The article pre-trained the teacher network and the undistilled student network. It used the Adam optimizer in training and the NatchNorm and ReLU activation functions after all convolutional layers. The loss function is the categorical cross-entropy function, and it is seen from the Figure 11 that the classification accuracy of the teacher network is higher than 90% when the signal-to-noise ratio is higher than 0 dB, while it can be seen that the accuracy of the undistilled student network had about 80% classification accuracy at a high SNR, which was much less than the teacher network becasue its structure is much smaller than that of the teacher network.

Loss Function
The loss function Equation (13) designed for knowledge distillation was used in the experiment. The teacher network was based on the Inception-Resnet network, and the student network was a three-layer convolutional network. The soft target was the output of Inception-Resnet, which corresponded to Q T , and the output of the three-layer convolutional network corresponded to Q s .

Dataset and Training
This paper used the RadioML 2016.10b dataset [2] as the basis for evaluating the modulation recognition task. The dataset consisted of 11 modulations: 8 digital and 3 analog, all of which are widely used in wireless communications systems all around us. These consist of BPSK, QPSK, 8PSK, 16QAM, 64QAM, BFSK, CPFSK, and PAM4 for digital; and WB-FM, AM-SSB, and AM-DSB for analog modulations. Details about the generation of this dataset can be found in [36] . Data was modulated at a rate of roughly 8 samples per symbol with a normalized average transmit power of 0 dB [3]. The dataset was split into two sections for the experiment, with 20% serving as a validation set and the remaining 80% serving as training data. Adam was the optimizer used in this paper. Its batch size is 512, and learning rate is 0.001.
When training in the experiments, an early stop mechanism was employed to halt training when performance on the validation dataset began to deteriorate. The ability of the deep neural network to generalize was improved by stopping the training before the neural network overfitted the training dataset. The network was initially configured with a minimum loss of 100. When the network's loss on the validation set was less than the minimum loss, the network structure was saved at this point, and the loss at this time was noted as the minimum loss. Training was stopped when the network's loss on the validation set exceeded the minimum loss for 10 consecutive iterations.

Experimental Procedure
The general framework of the experimental process in this paper is shown in Figure 12, where the teacher network is the high precision network trained in advance and the student network is the network to be trained. The experimental procedure is the same as that of knowledge distillation: first, the soft loss is calculated using the knowledge distillation process, then the hard loss is calculated, and the two are added to yield the total loss.  Typically, large amounts of data are essential for training the deep learning model to avoid overfitting. There are many parameters in deep neural networks, so if there is not enough data to training them, they tend to remember the entire training set, which will result in good training, but bad performance on testing set [37].
The weighted average of the soft and hard losses is the knowledge distillation loss. Soft loss was calculated first by inputting the dataset to the teacher and student networks and dividing the result by the temperature parameter T, followed by a softmax calculation to obtain the probability distribution of softening. The output of the teacher and student networks was then used to calculate KL divergence to obtain a soft loss. The hard target was the true marker of the sample, which can be represented by a one-hot vector inputted into the student network to get an output. Then softmax was calculated, and the cross-entropy loss was calculated using the softmax result and the true marker of the sample.
The experiment raised the temperature from 1 to 9, with a temperature difference of 2. The proportion α of the total loss changed from 0.1 to 0.9 in each temperature, and the difference between α was 0.1, so there were nine α in each temperature. There were five temperatures in total, each temperature and α formed a small experiment, and each small experiment got a distillation student network for a total of 45. Figures 13 and 14 display the KD effects that can be produced by "α" at various temperatures. Assuming that a model's quality is determined by its highest classification accuracy over all SNRs, it is clear that, after knowledge distillation, the DSCNN3 (CNN3 after knowledge distillation) performed best at T = 1, α = 0.4, T = 3, α = 0.7, T = 5, α = 0.1,T = 7, α = 0.6,and T = 9, α = 0.1, the DSminiIRNET(mini-Inception-Resnet after knowledge distillation) performed best at T = 1, α = 0.4, T = 3, α = 0.7, T = 5, α = 0.8, T = 7, α = 0.6, and T = 9, α = 0.5.

Evaluation of Classification
It can be seen that most networks after KD achieved higher classification accuracy, and the classification accuracy of networks after knowledge distillation were very high at the right temperature and α. Figures 13f and 14f represent the accuracy of the best model at each temperature. The best DSCNN3 appears at T=1 and α = 0.4, which corresponded to the network model. The best DSminiIRNET appeared at T=7 and α = 0.6, which also corresponds to the network model.   Figure 15 depicts the optimal DSCNN3 classification accuracy. Once the SNR exceeded 0 dB, the network's classification accuracy became flat, and the classification accuracy of CNN3 was stable at around 78% and that of DSCNN3 around 89%. According to Table 1 Test accu acy SNR vs Accuracy Inception-Resnet DSCNN3 at T=1 α=0.4 CNN3 Figure 15. Classification accuracy of teacher network, student network and the best network after knowledge distillation. (DSCNN3 represents CNN3 after knowledge distillation). Figure 16 depicts the optimal DSCNN3 classification accuracy. Once the SNR exceeded 0 dB, the network's classification accuracy becames flat and the classification accuracy of the mini-Inception-ResNet was stable around 83% and that of DSminiIRNET around 93%. When the SNR was greater than -6 dB, the classification accuracy curves of the DSminiIRNET and the Inception-Resnet overlapped, and the curves of the DSminiIRNET were slightly higher than those of Inception-Resnet. As documented by the Table 1, mini-Inception-ResNet had a peak classification accuracy of 0.8418 and for DSminiIRNET it was 0.9359, which was greater than the classification accuracy peak of the teacher network. Without changing the size of the network or the amount of computation, KD improved the mini-Incepion-ResNet classification accuracy peak by 9.4%.   Knowledge distillation can transfer the relationships between different classifications as information to the student network for learning, allowing it to have a higher classification accuracy and outperform the teacher network's smaller models.

Computation Complexity.
The complexity of an algorithm can be divided into time and space complexity. Time complexity, which is defined as the time of calculation on the algorithm, can be quantitatively analyzed with floating-point operations (FLOPs). Space complexity describes the memory occupation when the algorithm runs [17].
The total model size is the memory occupied by the model. Total parameters indicates the number of them in the different models. FLOPs also indicate the complexity of the deep neural network. Figure 15 shows that after knowledge distillation, the model's accuracy can be improved without changing its complexity. Table 2 compares the computational complexity of the teacher network to that of the best student network.

Conclusions
In this paper,we proposed a scheme to improve the classification accuracy of small network models for AMC. A highly accurate teacher network induced student network model training to improve accuracy via knowledge distillation.
We conducted experiments to compare the accuracy of student network models obtained by using different hyperparameters T,α in knowledge distillation, from which the best ones were chosen. The peak classification accuracy of the teacher model was 93.09%, and the model size was 311.77 MB; the peak classification accuracy of two student networks after knowledge distillation was 93.59% and 89.36%, and the model sizes were 39.69 and 0.37 MB. Knowledge distillation was successful in reducing model size and improving classification accuracy. When we needed to reduce computational complexity and model size in memory-limited devices or when real-time performance was required, we found that KD was a useful approach for solving this problem.
The use of KD on AMC improved the model's classification accuracy without changing the model size. Comparing model size, parameter size, FLOPS of the student network to those of the teacher network demonstrated the efficacy of knowledge distillation in improving the accuracy of the small network and providing a useful idea for applying AMC in practice. Knowledge distillation can reduce model complexity while improving network classification accuracy. It is useful whether the goal is to pursue high accuracy or reduce complexity. We aim to explore pruning and quantization methods for more details about AMC to further reduce model complexity and training time.

Conflicts of Interest:
The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations
The following abbreviations are used in this manuscript:

AMC
Automatic modulation classification GPUs Graphic processing units KD Knowledge Distillation DSCNN3 CNN3 after knowledge distillation DSminiIRNET mini Inception-Resnet after knowledge distillation