Automatic Modulation Classification with Deep Neural Networks

Automatic modulation classification is a desired feature in many modern software-defined radios. In recent years, a number of convolutional deep learning architectures have been proposed for automatically classifying the modulation used on observed signal bursts. However, a comprehensive analysis of these differing architectures and importance of each design element has not been carried out. Thus it is unclear what tradeoffs the differing designs of these convolutional neural networks might have. In this research, we investigate numerous architectures for automatic modulation classification and perform a comprehensive ablation study to investigate the impacts of varying hyperparameters and design elements on automatic modulation classification performance. We show that a new state of the art in performance can be achieved using a subset of the studied design elements. In particular, we show that a combination of dilated convolutions, statistics pooling, and squeeze-and-excitation units results in the strongest performing classifier. We further investigate this best performer according to various other criteria, including short signal bursts, common misclassifications, and performance across differing modulation categories and modes.


I. INTRODUCTION
A UTOMATIC modulation classification (AMC) is of par- ticular interest for radio frequency (RF) analysis and in modern software-defined radios to perform numerous tasks including "spectrum interference monitoring, radio fault detection, dynamic spectrum access, opportunistic mesh networking, and numerous regulatory and defense applications" [1].Upon detection of an RF signal with unknown characteristics, AMC is a crucial initial procedure in order to demodulate the signal.Efficient AMC allows for maximal usage of transmission mediums and can provide resilience in modern cognitive radios.Systems capable of adaptive modulation schemes can monitor current channel conditions with AMC and adjust exercised modulation schemes to maximize usage across the transmission medium.
Moreover, for receivers that have a versatile demodulation capability, AMC is a requisite task.The correct demodulation scheme must be applied to recover the modulated message within a detected signal.In systems where the modulation scheme is not known a priori, AMC allows for efficient prediction of the employed modulation scheme.Higher performing AMC can increase the throughput and accuracy of these systems; therefore, AMC is currently an important research topic in the fields of machine learning and communication systems, specifically for software-defined radios.
Typical benchmarks are constructed on the premise that the AMC model must classify not only the mode of modulation (e.g., QAM), but the exact variant of that mode of modulation (e.g., 32QAM).While many architectures have proven to be effective at high signal to noise ratios (SNRs), performance degrades significantly at lower SNRs that often occur in realworld applications.Other works have investigated increasing classification performance at lower SNR levels through the use of SNR-specific modulation classifiers [2] and clustering based on SNR ranges [3].To perform classification, a variety of signal features have been investigated.Historically, AMC has relied upon statistical moments and higher order cumulants [4]- [6] derived from the received signal.Recent approaches [1], [7]- [9] use raw time-domain in-phase (I) and quadrature (Q) components as features to predict the modulation variant of a signal.Further works have investigated additional features including I/Q constellation plots [10]- [12].
After selecting the signal input features, machine learning models are used to determine statistical patterns in the data for the classification task.Support vector machines, decision trees, and neural networks are commonly used classifiers for this application [1], [3], [7]- [10], [13], [14].Residual neural networks (ResNets), along with convolutional neural networks (CNNs), have been shown to achieve high classification performance for AMC [1], [3], [7]- [10].Thus, deep learning based methods in AMC have become more prevalent due to their promising performance and their ability to generalize to large, complex datasets.
While other works have contributed to increased AMC performance, the importance of many design elements for AMC remains unclear and a number of architectural elements have yet to be investigated.Therefore, in this work, we aim to formalize the impact of a variety of architectural changes and model design decisions on AMC performance.Numerous modifications to architectures from previous works, including our own [7], and novel combinations of elements applied to AMC are considered.After an initial investigation, we provide a comprehensive ablation study in this work to investigate the performance impact of various architectural modifications.Additionally, we achieve new state-of-the-art classification performance on the RadioML 2018.01Adataset [15].Using the best performing model, we provide additional analyses that characterize its performance across modulation modes and Fig. 1.ResNet architecture used in [1].Each block represents a unit in the network, which may be comprised of several layers and connections as shown on the right of the figure.Dimensions of the tensors on the output of each block are also shown where appropriate.signal burst duration.

II. RELATED WORK
The area of AMC has been investigated by several research groups.We provide a summary of results in AMC to provide context and motivation for our contributions to AMC and the corresponding ablation study described in this paper.
Corgan et al. [8] illustrate that deep convolutional neural networks are able to achieve high classification performance particularly at low SNRs on a dataset comprising 11 different types of modulation.It was found that CNNs exceeded performance over expertly crafted features.Comparing results with architectures in [8] and [1], [16] improved AMC performance utilizing self-supervised contrastive learning.First, an encoder is pre-trained in a self-supervised manner through creating contrastive pairs with data augmentation.By creating different views of the input data through augmentation, contrastive loss is used to maximize the cosine similarity between positive pairs (augmented views of the same input).Once converged, the encoder is frozen (i.e., the weights are set to fixed values) and two fully-connected layers are added following the encoder to form the classifier.The classifier is trained using supervised learning to predict the 11 different modulation schemes.Chen et al. applied a novel architecture to the same dataset where the input signal is sliced and transformed into a square matrix and apply a residual network to predict the modulation schemes [17].Other work has investigated empirical and variational mode decomposition to improve fewshot learning for AMC [18].In our work, we utilize a larger, more complex dataset consisting of 24 modulation schemes, as well as modeling improvements.
Spectrograms and I/Q constellation plots in [19] were found to be effective input features to a traditional CNN achieving nearly equivalent performance as the baseline CNN network in [1] which used raw I/Q signals.
Further, [10]- [12] also used I/Q constellations as an input feature in their machine learning models on a smaller scale of four or eight modulation types.Other features have been used in AMC- [20], [21] utilized statistical features and support vector machines while [22], [23] used fusion methods in CNN classifiers.Mao et al. utilized various constellation diagrams at varying symbol timings alleviating symbol timing synchronization concerns [24].A squeeze-and-excitation [25] inspired architecture was used as an attention mechanism to focus on the most important diagrams.
Although spectrograms and constellation plots have shown promise, they require additional processing overhead and have had comparable performance to raw I/Q signals.In addition, models that use raw I/Q signals could be more adept at handling varying-length signals than constellation plots because they are not limited by periodicity constraints for short duration signals (i.e., burst transmissions).Consequently, we utilize raw I/Q signals in our work.
Tridgell, in his dissertation [26], builds upon these works by investigating these architectures when deployed on resourcelimited Field Programmable Gate Arrays (FGPAs).His work stresses the importance of reducing the number of parameters for modulation classifiers because they are typically deployed in resource-constrained embedded systems.Fig. 2. X-Vector architecture overview.The convolutional activations immediately before pooling are shown.These activations are fed into two statistical pooling layers that collapse the activations over time, creating a fixed-length tensor that can be further processed by fully connected dense layers.Fig. 3. Proposed CNN Architecture in [7].This is the first work to employ an X-Vector inspired architecture for AMC showing strong performance.This architecture is used as a baseline for the modifications investigated in this paper.The f and k variables shown designate the number of kernels and size of each kernel, respectively, in each layer.These parameters are investigated for optimal sizing in our initial investigation.
In [1], Oshea et al. created a dataset with 24 different types of modulation, known as RadioML 2018.01A, and achieved high classification performance using convolutional neural networks-specifically using residual connections (see Figure 1) within the network (ResNet).A total of 6 residual stacks were used in the architecture.A residual stack is defined as a series of a convolutional layers, residual units, and a max pooling operation as shown in Figure 1.The ResNet employed by [1] attained approximately 95% classification accuracy at high SNR values.
Harper et al. proposed the use of X-Vectors [27] to increase classification performance using CNNs [7].X-Vectors are traditionally used in speaker recognition and verification systems making use of aggregate statistics.X-Vectors employ statistical moments, specifically mean and variance, across convolutional filter outputs.It can be theorized that taking the mean and variance of the embedding layer helps to eliminate signalspecific information, leaving global, modulation-specific characteristics.Figure 2 illustrates the X-Vector architecture where statistics are computed over the activations from a convolutional layer producing a fixed-length vector.
Additionally, this architecture maintains a fullyconvolutional structure enabling variable size inputs into the network.Using statistical aggregations allows for this property to be exploited.When using statistical aggregations, the input to the first dense layer is dependent upon the number of filters in the final convolutional layer.The number of filters is a hyperparameter, independent of the length in time of the input signal into the neural network.
Without the statistical aggregations, the input signals into a traditional CNN or ResNet would need to be resampled, cropped or padded to a fixed-length in time such that there is not a size mismatch with the final convolutional output and the first dense layer.While the dataset used in this work has uniformly sized signals in terms of duration, (1024 × 2), this is an architectural advantage in our deployment as received signals may vary in duration.Instead of modifying the inputs to the network via sampling, cropping, padding, etc., the X-Vector architecture can directly operate with variable-length inputs without modifications to the network or input signal.
Figure 3 outlines the employed X-Vector architecture in [7] where Mean and variance pooling are performed on the final convolutional outputs, concatenated, and fed through a series of dense layers creating the fixed-length X-Vector.A maximum of 98% accuracy was achieved at high SNR levels.[1] and the X-Vector inspired model from [7] over varying SNRs.This accuracy comparison shows the superior performance of the X-Vector architecture, especially at higher SNRs, and supports using this architecture as a baseline for the improvements investigated in this paper.
The work of [7] replicated the ResNet architecture from [1] and compared the results with the X-Vector architectures as seen in Figure 4. Harper et al. [7] were able to reproduce this architecture achieving a maximum of 93.7% accuracy.The authors attribute the difference in performance to differences in the train and test set separation they used since these parameters were unavailable.As expected, the classifiers perform with a higher accuracy as the SNR value increases.In signals with a low SNR value, noise becomes more dominant and the signal is harder to distinguish.In modern software-defined radio applications, a high SNR value is not always a given.However, there is still significant improvement compared to random chance, even at low SNR values.Moreover, in systems where the modulation type must be classified quickly, this could become crucially important as fewer demodulation schemes would need to be applied in a trial and error manner to discover the correct scheme.
One challenge of AMC is that performance is desired to work well across a large range of SNRs.For instance, Figure 4 illustrates modulation classification performance plateaued in peak performance beyond +8dB SNR and approached chance classification performance below −8dB SNR on the RadioML 2018.01Adataset.This range is denoted by the shaded region.
Six MCs were created by discretizing the SNR range to ameliorate performance between −8dB to +8dB SNR (see Figure 5).These groupings were chosen in order to provide sufficient training data to avoid overfitting the MCs and provide enough resolution so that combining MCs provided more value than a single classifier.
By first predicting the SNR of the received signal with a regression model, an SNR-specific MC that was trained on signals with the predicted SNR is applied to make the final prediction.Although the SNR values in the dataset are discrete, SNR is measured on a continuous scale in a deployment scenario and can vary over time.As a result, regression is used over classification to model SNR.Using this approach, different classifiers can tune their feature processing for differing SNR ranges.Each MC in this approach uses the same architecture as that proposed in [7]; however, each MC is trained with signals within each MC's SNR training range (see Table I).
Highlighting improvements across varying SNR values, Figure 6 shows the overall performance improvement (in percentage accuracy) using the SNR-assisted architecture compared to the baseline classification architecture described in [7].While a slight decrease in performance was observed for −8dB and a larger decrease for −2dB, improvement is shown under most SNR conditions-particularly in the target range of −8dB to +8dB.A possible explanation for the decrease in performance at particular SNRs is that the optimization for a particular MC helped overall performance for a grouping at the expense of a single value in the group.That is, the MC for [−4, 0) Fig. 5.The architecture using SNR regression and SNR-specific classifiers from [2].Each MC block shown employs the same architecture as the baseline from [7], but specifically trained to perform AMC within a more narrow range of SNRs (denoted as dB ranges in each block).
boosted the overall performance by performing well at −4 and 0dB at the expense of −2dB.Due to the large size of the testing set, these small percentage gains are impactful because thousands more classifications are correct.All results are statistically significant based on a McNemar's test [28], therefore achieving new state-of-the-art performance at the time.
Soltani et al. [3] found SNR regions of [−10, −2]dB, [0, 8]dB, and [10,30]dB having similar classification patterns.Instead of predicting exact modulation variants, the authors group commonly confused variants into a more generic, coarse-grained label.This grouping increases performance of AMC by combining modulation variants that are commonly confused.However, it also decreases the sensitivity of the model to the numerous possible variants.
Cai et al. utilized a transformer based architecture to aid performance at low SNR levels with relatively few training parameters (approximately 265,0000 parameters) [29].A multiscale network along with center loss [30] was used in [31].It was found that larger kernel sizes improved AMC performance.We further explore kernel size performance impacts in this work.Zhang et al. proposed a high-order attention mechanism using the covariance matrix achieving a maximum accuracy of 95.49% [32].
Although many discussed works use the same RadioML 2018.01Adataset, there is a lack of a uniform dataset split to establish a benchmark for papers to report performance.In an effort to make AMC work more reproducible and comparable across publications, we have made our dataset split and accompanying code available on GitHub. 1hile numerous works have investigated architectural improvements, we aim to improve upon these works by introducing additional modifications as well as a comprehensive ablation study that illustrates the improvement of each modification.With the new modifications, we achieve new stateof-the-art AMC performance.

III. DATASET
To evaluate different machine learning architectures, we use the RadioML 2018.01Adataset that is comprised of 24 Fig. 6.Summary of residual improvement in accuracy over [7] that was first published in [2].This work showed how the baseline architecture could be tuned to specific SNR ranges.Positive improvement is observed for most SNR ranges.
different modulation types [1], [15].Due to the complexity and variety of modulation schemes in the dataset, it is fairly representative of typically encountered modulation schemes.Moreover, this variety increases the likelihood that AMC models will generalize to more exotic or non-existing modulation schemes in the training data that are derived from these traditional variants.
There are a total of 2.56 million labeled signals, S(T ), each consisting of 1024 time domain digitized intermediate frequency (IF) samples of in-phase (I) and quadrature (Q) signal components where S(T ) = I(T ) + jQ(T ).The data was collected at a 900MHz IF with an assumed sampling rate of 1MS/sec such that each 1024 time domain digitized I/Q sample is 1.024 ms [33].The 24 modulation types and the representative groups that we chose for each are listed as follows: • Amplitude: OOK, 4ASK, 8ASK, AM-SSB-SC, AM-SSB-WC, AM-DSB-WC, and AM-DSB-SC • Phase: BPSK, QPSK, 8PSK, 16PSK, 32PSK, and OQPSK • Amplitude and Phase: 16APSK, 32APSK, 64APSK, 128APSK, 16QAM, 32QAM, 64QAM, 128QAM, and 256QAM • Frequency: FM and GMSK Each modulation type includes a total of 106, 496 observations ranging from −20dB to +30dB SNR in 2dB steps for a total of 26 different SNR values.SNR is assumed to be consistent over the same window length as the I/Q sample window.For evaluation, we divided the dataset into 1 million different training observations and 1.5 million testing observations under a random shuffle split, stratified across modulation type and SNR.Because of this balance, the expected performance for a random chance classifier is 1/24 or 4.2%.With varying SNR levels across the dataset, it is expected that the classifier would perform with a higher degree of accuracy as the SNR value is increased.For consistency, each model investigated in this work was trained and evaluated on the same train and test set splits.

IV. INITIAL INVESTIGATION
In this work, we use the architecture described in [7] as the baseline architecture.We note that [2] improved upon the baseline; however, each individual MC used the baseline architecture except trained on specific SNR ranges.Therefore, the base architectural elements were similar to [7], but separated for different SNRs.In this work, our focus is to improve upon the employed CNN architecture for an individual MC rather than the use of several MCs.Therefore, we use the architecture from [7] as our baseline.
Before exploring an ablation study, we make a few notable changes from the baseline architecture in an effort to increase AMC performance.This initial exploration is for clarity as it reserves the ablation study that follows from requiring an inordinate number of models.It also introduces the general training procedures that assist and orient the reader in following the ablation study-the ablation study mirrors these procedures.We first provide an initial investigation exploring these notable changes.
We train each model using the Adam optimizer [34] with an initial learning rate lr = 0.0001, a decay factor of 0.1 if the validation loss does not decrease for 12 epochs, and a minimum learning rate of 1e-7.If the validation loss does not decrease after 20 epochs, training is terminated and the models are deemed converged.For all experiments, mini-batches of size 32 are used.As has been established in most programming packages for neural networks, we refer to fully connected neural network layers as dense layers, which are typically followed by an activation function.

A. Architectural Changes
A common property of neural networks is using fewer but larger kernels in the early layers of the network, and an increase of smaller kernels are used in the later layers than the baseline architecture.This is commonly referred to as the information distillation pipeline [35].By utilizing a smaller number of large kernels in early layers, we are able to increase the temporal context of the convolutional features without dramatically increasing the number of trainable parameters.Numerous, but smaller kernels are used in later convolutional layers to create more abstract features.Configuring the network in this manner is especially popular in image classification where later layers represent more abstract, classspecific features.
We investigate this modification in three stages, using the baseline architecture described in Figure 3 [7].We denote number of filters in the network and the filter sizes as

B. Initial Investigation Results
As shown in Table II, increasing the size of the filters in earlier layers increases both average and maximum test accuracy over [7]; but, at the cost of additional parameters.A possible explanation for the increase in performance is the increase in temporal context due to the larger kernel sizes.Increasing the number of filters without increasing temporal context decreases performance.This is possibly because it increases the complexity of the model without adding additional signal context.Although increasing the number of filters decreases performance alone, combining the approach with larger kernel sizes yields the best performance in our initial investigation.Increasing the temporal context may have allowed additional filters to better characterize the input signal.
Because increased temporal context improves AMC performance, we are inspired to investigate additional methods such as squeeze-and-excitation blocks and dilated convolutions that can increase global and local context [25], [36].
V. ABLATION STUDY ARCHITECTURE BACKGROUND Building upon our findings from our initial investigation, we make additional modifications to the baseline architecture.
For the MCs, we introduce dilated convolutions, squeezeand-excitation blocks, self-attention, and other architectural changes.We also investigate various kernel sizes and the quantity of kernels employed from the initial investigation.Our goal is to improve upon existing architectures while investigating the impact of each modification on classification accuracy through an ablation study.In this section, we describe each modification performed.

A. Squeeze-and-Excitation Networks
Fig. 8. Squeeze-and-Excitation block proposed in [25].One SE block is shown applied to a single layer convolutional output activation.Two paths are shown, a scaling path and an identity path.The scaling vector is applied across channels to the identity path of the activations.
Squeeze-and-Excitation (SE) blocks introduce a channelwise attention mechanism first proposed in [25].Due to the limited receptive field of each convolutional filter, SE blocks propose a recalibration step based on global statistics across channels (average pooling) to provide global context.Although initially utilized for image classification tasks [25], [37], [38], we argue the use of SE blocks can provide meaningful global context to the convolutional network used for AMC over the time domain.
Figure 8 depicts an SE block.The squeeze operation is defined as temporal global average pooling across convolutional filters.For an individual channel, c, the squeeze operation is defined as: where T is the number of samples in time, and C is the total number of channels.To model nonlinear interactions between channel-wise statistics, Z is fed into a series of dense layers followed by nonlinear activation functions: where δ is the rectified linear (ReLU) activation function, , r is a dimensionality reduction ratio, and σ is the sigmoid activation function.The sigmoid function is chosen as opposed to the softmax function so that multiple channels can be accentuated and are not mutuallyexclusive.That is, the normalization term in the softmax can cause dependencies among channels, so the sigmoid activation is preferred.W 1 imposes a bottleneck to improve generalization performance and reduce parameter counts while W 2 increases the dimensionality back to the original number of channels for the recalibration operation.In our work, we use r = 2 for all SE blocks to ensure a reasonable number of trainable parameters without over-squashing the embedding size.
The final operation in the SE block, scaling or recalibration, is obtained by scaling the the input X by s: where Proposed in [36], Figure 10 depicts dilated convolutions where the convolutional kernels are denoted by the colored components.In a traditional convolution, the dilation rate is equal to 1. Dilated convolutions build temporal context by increasing the receptive field of the convolutional kernels without increasing parameter counts as the number of entries in the kernel remains the same.Dilated convolutions also do not downsample the signals like strided convolutions.Instead, the output of a dilated convolution can be the exact size of the input after properly handling edge effects at the beginning and end of the signal.

C. Final Convolutional Activation
We also investigate the impact of using an activation function (ReLU) after the last convolutional layer, just before statistics pooling.Because ReLU transforms the input sequence to be non-negative, the distribution characterized by the pooling statistics may become skewed.In [7] and [2], no activation was applied after the final convolutional layer as shown in Figure 3.We investigate if this transformation impacts classification performance.

D. Self-Attention
Self-attention allows the convolutional outputs to interact with one another enabling the network to learn to focus on important outputs.Self-attention before statistics pooling essentially creates a weighted summation over the convolutional outputs weighting their importance similarly to [39]- [41].
We use the attention mechanism described by Vaswani et al. in [42] where each output element is a weighted sum of the linearly transformed input where the dimensionality of K is d k as seen in Equation (4).
In the case of self-attention, Q, K, and V are equal.A scaling factor of 1 is applied to counteract vanishing gradients in the softmax output when d k is large.

VI. ABLATION STUDY ARCHITECTURE
Applying the specified modifications to the architecture in [7], Figure 9 illustrates the proposed architecture with every modification included in the graphic.Each colored block represents an optional change to the architecture that will be investigated in the ablation study.That is, each combination of network modifications are analyzed to aid understanding of each modification's impact on the network.
Each convolutional layer has the following parameters: number of filters, kernel size, and dilation rate.The asterisk next to each dilation rate represents the changing of dilation rates in the ablation study.If dilated convolutions are used,

VII. EVALUATION METRICS
We present several evaluation metrics to compare the different architectures considered in the ablation study.In this section, we will discuss each evaluation technique used in the results section.
Due to the varying levels of SNRs in the employed dataset, we plot classification accuracy over each true SNR value.This allows for a visualization of the tradeoff in performance as noise becomes more or less dominant in the received signals.Additionally, we report average accuracy and maximum accuracy across the entire test set for each model.While we note that average accuracy is not indicative of the model's performance, as accuracy is highly correlated to the SNR of the input signal, we share this result to give other researchers the ability to reproduce and compare works.
As discussed in [26], AMC is often implemented on resource-constrained devices.In these systems, using larger models in terms of parameter counts may not be feasible.We report the number of parameters for each model in the ablation study to examine the tradeoff in AMC performance and model size.
Additional analyses are also carried out.However, due to the large number of models investigated in this study, we will select the best performing model from the ablation study for brevity and analyze the performance of this model in greater detail.For example, confusion matrices for the best performing model from the ablation study are provided to show common misclassifications for each modulation type.Additionally, there exist several use-cases where relatively short signal bursts are received.For example, a wide-band scanning receiver may only detect a short signal burst.Therefore, signal duration in the time domain versus AMC performance is investigated to determine the robustness of the best performing model when short signal bursts are received.

A. Overall Performance
Table III lists the maximum and average accuracy performance for each model in the ablation study.A binary naming convention is used to indicate the various methods used for each architecture.Similarly to the result found in Section IV, increasing the temporal context typically results in increased performance.Models that incorporate dilated convolutions tended to have higher average accuracies than models without dilated convolutions.
The best performing model, in terms of average accuracy across all SNR conditions included SE blocks, dilated convolutions, and a ReLU activation prior to statistics pooling (model 1110) with an average accuracy of approximately 63.7%.This model also achieved the highest maximum accuracy of about 98.9% at a 22dB level.SE blocks did not increase performance compared to model 0000 with the exception of models 1110 and 1111.However, SE blocks were incorporated in the best performing model, 1110.Self-attention was not found to aid classification performance in general with the proposed architecture.Self-attention introduces a large number of trainable parameters possibly forming a complex loss space.
Table IV lists the performances of single modification (from baseline) architectures.Each component of the ablation study, with the exception of dilated convolutions, decreased performance when applied individually.When combined, however, the best performing model was found.Therefore, we conclude that each component could possibly aid the optimization of   each other-and, in general, dilated convolutions tend to have the most dramatic performance increases.

B. Accuracy Over Varying SNR
Figure 11 summarizes the ablation study in terms of classification accuracy over varying SNR levels.We add this figure for completeness and reproducibility for other researchers.The accuracy within each SNR band is shown along with the modifications used, similar to Table III.The coloring in the figure denotes the accuracy in each SNR band.Performance follows a trend similar to that of a sigmoid function, where the rate at which peak classification accuracy is achieved is the most distinguishing feature between the different models.With the improved architectures, a maximum of 99% accuracy is achieved at high SNR levels (starting around 12dB SNR).
While the proposed changes to the architectures generally improve performance at higher SNR levels, the largest improvements occur between −12dB and 12dB compared to the baseline model in [7].For example, at 4dB, the performance increases from 75% up to 82%.Incorporating these modifications to the network may prove to be critical in real-world situations where noisy signals are likely to be obtained.Improving AMC performance at lower SNR ranges (< −12dB) is still an open research topic, with accuracies near chance level.
One observation is the best performing model can vary with SNR.In systems that have available memory and processing power, an approach similar to [2] may be used to utilize several models and intelligently chose predictions based on estimated SNR conditions.That is, if the SNR of the signal of interest is known, a model can be tuned to increase performance slightly, as shown in [2].Using the results presented here, researchers could also choose the architecture differences that perform best for a given SNR range (although performance differences are subtle).III.

C. Parameter Count Tradeoff
An overview of each model's complexity and overall performance across the entire testing set is shown in Table III.This information is also shown graphically in Figure 12 for the maximum accuracy over SNR and the average accuracy across all SNRs.Whether looking at the maximum or the average measures of performance, the conclusions are similar.The previously described binary model name also appears in the figure.We found a slight correlation between the number of model parameters and overall model performance; however, with the architectures explored, there was a general parameter count where performance peaked.Models with parameter counts between approximately 170k to 205k generally performed better than smaller and larger models.We note that the models with more than 205k parameters included self-attention which was found to decrease model performance with the proposed architectures.This implies that one possible reason self-attention did not perform as well as other modifications is because of the increase in parameters, resulting in a more difficult loss space from which to optimize.

IX. BEST PERFORMING MODEL INVESTIGATION
Due to the large volume of models, we focus upon the best performing model, (model 1110), for the remainder of this work.As previously mentioned, this model employs all modifications except self-attention.

A. Top-K Accuracy
As discussed, in systems where the modulation schemes must be classified quickly, it is advantageous to apply fewer demodulation schemes in a trial and error fashion.This is particularly significant at lower SNR values where accuracy is mediocre.Top-k accuracy allows an in-depth view on the expected number of trials before finding the correct modulation scheme.Although traditional accuracy (top-1 accuracy) characterizes the performance of the model in terms of classifying the exact variant, top-k accuracy characterizes the percentage of the classifier predicting the correct variant among the topk predictions (sorted by descending class probabilities).We plot the top-1, top-2, and top-5 classification accuracy over varying SNR conditions for each modulation grouping defined in Section III in Figure 13.
Although performance decays to approximately random chance for the overall (all modulation schemes) performance curves for each top-k accuracy, it is notable that some modulation group performances drop below random chance.The models are trained to maximize the overall model performance.This could explain why certain modulation groups dip below random chance but the overall performance and other modulation groups remain at or above random chance.
Using the proposed method greatly reduces the correct modulation scheme search space.While high performance in top-1 accuracy is increasingly difficult to achieve with low SNR signals, top-2 and top-5 accuracy converge to higher values at a much faster rate.This indicates our proposed method greatly reduces the search space from 24 modulation candidates to fewer candidate types when employing trial and error methods to determine the correct modulation scheme.Further, if the group of modulation is known (e.g., FM), one can view a more specific tradeoff curve in terms of SNR and top-k accuracy given in Figure 13.

B. Short Duration Signal Bursts
Due to the rapid scanning characteristic of some modern software-defined radios, we investigate the performance tradeoff of varying signal duration and AMC performance.This analysis is meant to emulate the situation wherein a receiver only detects a short RF signal burst.We investigate signal burst durations of 1.024 ms (full length signal from original dataset), 512 µs, 256 µs, 128 µs, 64 µs, 32 µs, and 16 µs.We assume the same 1MS/sec sampling rate as in the previous analyses such that 16 µs burst is captured in 16 I/Q samples.In this section, we use the same test set as our other investigations; however, a uniformly random starting point is determined for each signal such that a contiguous sample of the desired duration, starting at the random point, is chosen.Thus, the chosen segment from a test set sample is randomly assigned.
We also note that, although the sample length for the evaluation is changed, the best performing model is the same architecture with the exact same trained weights because this model uses statistics pooling from the X-Vector inspired modification.A significant benefit to the X-Vector inspired architecture is its ability to handle variable-length inputs without the need of padding, retraining, or other network modifications.This is achieved by taking global statistics across convolutional channels producing a fixed-length vector, regardless of signal duration.Due to this flexibility, the same model (model 1110) weights are used for each duration experiment.This fact also emphasizes the desirability of using X-vector inspired AMC architectures for receivers that are deployed in an environment where short-burst and variable duration signals are anticipated to be present.
For each signal duration in the time domain, we plot the overall classification accuracy over varying SNR conditions as well as the accuracy for each modulation grouping defined in Section III. Figure 14 demonstrates the tradeoff for various signal durations where n is the number of samples from the time domain I/Q signal.The first observation is, as we would expect, that classification performance degrades with decreased signal duration.For example, the maximum accuracy begins to degrade at 256 µs and is more noticeable at 128 µs.This is likely a result of using sample statistics that result in unstable or biased estimates for short signal lengths since the number of received signal data points are insufficient to characterize the sample statistics used during training.Random classification accuracy is approximately 4% and is shown in the black dotted line in Figure 14.Although classification performance decreases with decreased duration, we are still able to achieve significantly higher classification accuracy than random chance down to 16 µs of signal capture.
FM (frequency modulation) signals were typically more resilient to noise interference than AM (amplitude modulation) and AM-PM (amplitude and phase modulation) signals in our AMC.This was observed across all signal burst durations and our top-k accuracy analysis.This behavior indicates that the performance of our AMC for short bursts, in the presence of increasing amounts of noise, is more robust for signals modulated by changes in the carrier frequency and is more sensitive to signals modulated by varying the carrier amplitude.We attribute this behavior to our AMC architecture, the architecture of the receiver, or a combination of both of the AMC and receiver.

C. Confusion Matrices
While classification accuracy provides a holistic view of model performance, it lacks the granularity to investigate where misclassifications are occurring.Confusion matrices are used to analyze the distribution of classifications for each given class.For each true label, the proportion of correctly classified samples is calculated along with the proportion of incorrect predictions for each opposing class.In this way, we can see which classes the model is struggling to distinguish from one another.A perfect classifier would be the identity matrix where the diagonal values represent the true class matches the predicted class.Each matrix value represents the percentage of classifications for the true label and each row sums to 1 (100%).
Figure 15 illustrates the class confusion matrices for SNR levels greater than or equal to 0dB for models 1110, the reproduced ResNet architecture from [1], and the baseline X-Vector architecture from [7] respectively.Shown in [7], the X-Vector architecture was able to distinguish PSK and AM-SSB variants to a higher degree and performed better overall than [1].Both architectures struggled to differentiate QAM variants.
Model 1110 improved upon these prior results for QAM signals and in general has higher diagonal components than the other architectures.This again supports a conclusion that model 1110 achieves a new state-of-the-art in AMC performance.

X. CONCLUSION
A comprehensive ablation study was carried out with regard to AMC architectural features using the extensive RadioML 2018.01Adataset.This ablation study built upon a strong performance of a new baseline model that was also introduced in the initial investigation of this study.This initial investigation informed the design of a number of AMC architecture modifications-specifically, the use of X-Vectors, dilated convolutions, and SE blocks.With the combined modifications, we achieved a new state-of-the-art in AMC performance.Among these modifications, dilated convolutions were found to be the most critical architectural feature for model performance.Self-attention was also investigated but was not found to increase performance-although increased temporal context improved upon prior works.

Fig. 4 .
Fig. 4. Accuracy comparison of the reproduced ResNet in[1] and the X-Vector inspired model from[7] over varying SNRs.This accuracy comparison shows the superior performance of the X-Vector architecture, especially at higher SNRs, and supports using this architecture as a baseline for the improvements investigated in this paper.

Figure 3 .
The baseline architecture used f = 64 (for all layers) and k = 3 (consistent kernel size for all layers).Our first modification to the baseline architecture is F =[32, 48, 64, 72, 84, 96, 108], but keeping k = 3 for all layers.Second, we use the baseline architecture, but change the size of filters in the network where f = 64 (same as baseline) and K =[7,5,7,5,3,3,3].Third, we make both modifications and compare the result to the baseline model where F = [32, 48, 64, 72, 84, 96, 108] and K = [7, 5, 7, 5, 3, 3, 3].These modifications are not exhaustive searches; rather, these modifications are meant to guide future changes to the network by understanding the influence of filter quantity and filter size in a limited context.

Fig. 7 .
Fig. 7. SNR vs. accuracy comparison of the initial investigation using the baseline architecture.Noticeable improvements can be observed across all SNRs.

Figure 7
Figure 7 illustrates the change in accuracy with varying SNR.The combined model, utilizing various kernel sizes and numbers of filters, consistently outperforms the architectures across changing SNR conditions.Although increasing the number of filters decreases performance alone, combining the approach with larger kernel sizes yields the best performance in our initial investigation.Increasing the temporal context may have allowed additional filters to better characterize the input signal.Because increased temporal context improves AMC performance, we are inspired to investigate additional methods such as squeeze-and-excitation blocks and dilated convolutions that can increase global and local context[25],[36].

Fig. 9 .
Fig. 9. Proposed architecture with modifications including SENets, dilated convolutions, optional ReLU activation before statistics pooling, and self-attention.The output tensor sizes are also shown for each unit in the diagram.An * denotes where the sizes differ from the baseline architecture.

Fig. 10 .
Fig. 10.Dilated convolutions diagram.The top shows a traditional kernel applied to sequential time series points.The middle and bottom diagram illustrate dilation rates of two and three, respectively.These dilations serve to increase the receptive field of the filter without increasing the number of trainable variables in the kernel.

Fig. 11 .
Fig. 11.Ablation study results in terms of classification accuracy across SNR ranges.The best performing model is in the second to last row and displays strong performance across SNR values.

Fig. 12 .
Fig. 12. Ablation study parameter count tradeoff.The x-axis shows the number of trainable variables in each model and the y-axis shows max or average accuracy.The callout for each point denotes the model name as shown in TableIII.

Fig. 14 .
Fig. 14.Tradeoff in accuracy for various signal lengths across SNR, grouped by modulation category for the best performing model 1110.The top plot shows the baseline performance using the full sequence.Subsequent plots show the same information using increasingly smaller signal lengths for classification.
Harper et al. investigated methods to improve classification performance in this range by employing an SNR regression model to aid separate modulation classifiers (MCs).While other works have trained models to be as resilient as possible under varying SNR conditions, Harper et al. employed SNRspecific MCs [2].

TABLE I SNR
GROUPINGS FOR TRAINING SNR-SPECIFIC CLASSIFIERS AND DEMULTIPLEXED CLASSIFICATION RANGES FOR EACH PREDICTED SNR.
Training Range (dB) Demultiplexed Classification Range

TABLE II INITIAL
INVESTIGATION PERFORMANCE OVERVIEW.ALL ARCHITECTURES EMPLOY THE BASELINE WITH VARYING NUMBERS OF KERNELS AND KERNEL SIZES.

TABLE IV INDIVIDUAL
NETWORK MODIFICATION PERFORMANCE OVERVIEW.ENTRIES ARE REPEATED FROM TABLE III FOR CLARITY.