1. Introduction
Recent advances in Software-Defined Radios (SDR) and cognitive networking technologies, as well as increasing the accessible low-cost hardware, have led to most applications becoming dependent on the wireless networks [
1]. It provides adversaries with an opportunity to deploy the jamming attacks (also known as the intentional Radio Frequency Interference—RFI) and harm systems that rely on wireless networks [
1]. Jamming attacks cause Denial-of-Service (DoS) problems such as slowing browsing websites and downloading files, intensively limiting the number of active voice users, and as a result, network latency [
2]. Although the jammers can be launched using simple and cheap technologies, they are hard to defeat due to the large variety of available jammers completely [
3].
To guarantee the Quality of Service (QoS) and security of the wireless communication system, a robust RFI detection strategy is required to produce an effective mitigation process [
3]. Besides, it is essential to precisely determine the modulation type of SoI combined by any type of RFI. Since Automatic Modulation Classification (AMC) is a significant procedure in communication networks to facilitate the demodulation process at the receiver side [
4].
To address this concern, Machine Learning (ML) based techniques have shown promising results in the area of multiclass RFI recognition [
5,
6] and Automatic modulation classification (AMC) [
6]. However, the complex nature of preprocessing tasks such as feature extraction and feature selection in classical ML techniques highly degrades the classification performance regarding efficiency [
7]. To tackle this issue, deep learning (DL)-based approaches, as a subfield of ML, have presented outstanding RFI detection results. DL-based techniques include numerous information processing layers in a hierarchical design for either pattern classification or feature extraction [
8]. One of the most successful types of DL is Convolutional Neural Networks (CNN), which has been typically used for object detection in computer vision fields, without any prior knowledge regarding the object’s location [
9].
The main challenge of DL in supervised learning applications could be the lack of enough data to train the model from scratch. To address this issue, the image-based transfer learning method has gained attraction in the case of an insufficient dataset to create models [
10,
11]. Transfer learning refers to reuse the pretrained CNN architectures on a prebuild large dataset, such as the ImageNet project [
10]. Hence, transfer learning leads to minimizing the training time by considering the pretrained layers of a model [
10].
The main contribution of this work is not only specifying the type of received signals but also determining different schemes of digital modulation in the presence of jamming signals in a real-time digital video broadcasting based on DVB-S2 standard using transfer learning. To this end, we propose a hierarchical classification design for RFI classification and AMC by leveraging the benefits of transfer learning technology using pretrained CNNs such as AlexNet, VGG16, GoogleNet, and ResNet18 for feature learning, followed by a fully-connected classifier. This study provides a comparative analysis of these pretrained CNNs with respect to accuracy in the context of transfer learning and consumed training time. We have generated visual representations of the received signals in the time-frequency domain as the input data, which is the magnitude squared of the wavelet transform known as scalogram [
12].
In this work, SoI is a video stream transmitted in a digital video broadcasting scenario based on DVB-S2 standard in a real-time Satellite communication (Satcom). We have assumed that SoI is combined with three well-known types of jammers, namely, continuous-wave interference (CWI), multi-CWI (MCWI), and chirp interference (CI), to increase the scenarios complexity and to simulate the realistic situations [
5]. As a result, the proposed methodology can precisely determine the type of the received signal, either SoI or a combination of SoI with any other jammers, and also the modulation type of SoI. We have investigated four different types of modulation due to their more applicable in DVB-S2 standard, namely, quadrature phase-shift keying (QPSK), 8-array asymmetric phase-shift keying (8-APSK), 16-array APSK (16-APSK), and 32-array APSK (32-APSK).
The rest of this paper presents the related works in
Section 2, the proposed methodology in
Section 3, and the simulation results are provided in
Section 4. Finally, the paper is concluded in
Section 5.
2. Related Works
With the rapid advances of AI technology, DL is also increasingly being applied to the field of RFI and modulation classification. To name a few, in [
13], a robust Dl-based technique is proposed known as faster region-based convolutional neural networks (Faster R-CNN) for interference and clutter detection in a high-frequency surface wave radar (HFSWR). To this end, the Range-Doppler (RD) spectrum image is used as the input of the designed network. As a result, the proposed method has a high classification accuracy and a decent detection performance [
13].
Z. Yang and et al. have proposed a CNN-based strategy named RFI-Net to detect interference in a five-hundred-meter Aperture Spherical radio Telescope (FAST) [
14], that can outperform other techniques such as the U-Net model based on a CNN architecture, k-nearest neighbors (KNN), as well as Sum-Threshold. In [
15], two DL-based strategies are used for jamming attack detection, namely deep convolutional neural networks (DCNN) and deep recurrent neural networks (DRNN). In this research, two different jamming attacks, namely, classical wide-band barrage jamming and reference signal jamming, have been analyzed [
15]. The results show that the classification accuracy reaches up to 86.1% under a realistic test environment [
15].
In [
16], three methods, including a Convolutional Long Short-term Deep Neural Network (CLDNN), a Long Short-Term Memory neural network (LSTM), and a deep Residual Network (ResNet) have been proposed to recognize ten different modulation types. The results indicate that the classification accuracy is increased by up to 90% at high SNRs.
Further Principal Component Analysis (PCA) has been deployed to optimize the classification process by minimizing the size of the training dataset [
16]. A combination of the transfer learning and a pretrained Inception-ResNetV2 has been presented in [
17] to recognize three modulation types, namely Binary Phase Shift Keying (BPSK), QPSK, and 8PSK at SNR equal to 4 dB. As the results illustrate, the classification accuracies to recognize BPSK, QPSK, and 8PSK are 100%, 99.66%, and 96.33% respectively [
17].
In [
18], a robust hierarchical DNN architecture is presented that performs a hierarchical classification to estimate data type (analog or digital modulation), modulation class, and modulation order. To this purpose, spectrogram snapshots computed from baseband signal in-phase and Quadratic (I/Q) components of the signal are used as the input of the CNN and reach out the performance of 90% at high SNR for most modulation schemes [
18]. Yang et al. present an efficient methodology using CNN and Recurrent Neural Networks (RNN) to classify six modulation types under two-channel distortions such as Additive White Gaussian noise (AWGN) and Rayleigh fading [
19]. According to the experimental results, the classification precision of the CNN is always close to 100% in AWGN channel [
19].
Even in the Rayleigh channel, the minimum classification accuracy still approaches 84%, whereas the maximum value is near 96%. Ref. [
20] proposes a robust CNN-based approach that can precisely classify four types of modulation, including BPSK, QPSK, 8PSK, and 16QAM in an orthogonal frequency division multiplexing (OFDM) system under the presence of Phase offset (PO). In [
21], CNN and LSTM have been used to solve the AMC problem.
Furthermore, the proposed classifiers that are based on the fusion model in serial and parallel modes are highly beneficial to improving classification accuracy when the SNR is ranging from 0 dB to 20 dB [
21]. As is shown, the serial fusion mode has the best performance compared with other modes.
3. Proposed Methodology
This study proposes a DL-based approach for RFI recognition and AMC by benefiting from the transfer learning strategy. The general framework is based on the hierarchical classification in which the first and second levels determine the type of the received signal that is either SoI or a combination of SoI with any of the jamming signals and the modulation type of SoI, respectively [
6]. To this end, in the first classification level, a classifier is trained to determine the type of received signals. Further, a classifier is trained per each type of received signal to recognize the modulation type of the combined SoI.
Figure 1 demonstrates the proposed methodology, which follows four steps: (1) data acquisition, (2) scalogram computation, (3) Feature extraction using pretrained CNN, and (4) classification. Each step will be fully elaborated in the rest of this section.
3.1. Data Acquisition Set-Up
As fully explained in [
5,
6], the desired signal is a video stream, which is modulated and processed by GNU radio and transmitted using a Universal Software Radio Peripheral (USRP-N210) [
22]. For modeling a real-time Satcom, the channel simulator (RTLogic T400) [
23] is used. Further, the generated jamming signals are combined to SoI by a combiner. Finally, the combined signal is received by a MegaBee modem [
5]. Notably, AWGN power can be manually adjusted in the range of −168 to −125 dBm, which is approximately equal to SNR 5 to 12 dB.
Figure 2 shows the Real-time RFI data acquisition set-up.
Table 1 presents a summary of the dataset specification generated in [
5].
This study analyzes the efficiency of the proposed classification technique in the presence of three jamming signals, such as continuous-wave interference (CWI), multi-CWI (MCWI), and chirp interference (CI) [
5].
- (1)
Continuous Wave Interference (CWI):
where
and
t represent the center frequency and the duration of interference respectively.
- (2)
Multi Continuous Wave Interference (MCWI): In this study, we have considered two-tone CW, which is defined as:
where
and
are the center frequencies of each wave.
- (3)
Chirp Interference (CI): The CI has been generated according to [
24] as follows:
where
so that the signal sweeps from
to
and
T is the sweeping duration.
Note: the center frequencies have been considered to be changed randomly.
Dataset Generation
This study has considered a visual representation of the received signals in the time-frequency domain using a scalogram as the input data. The scalogram is the squared magnitude of Continuous Wavelet Transform (CWT) and mathematically is defined as [
25]:
where
z and
denote scalogram and the complex conjugate of the mother wavelet function,
, and
are the oscillatory frequency and shifting position of the wavelet, respectively [
25]. CWT is widely applied for nonstationary and transient signal analysis, mainly through its scalogram [
26]. The main difference between wavelet transform and short-time Fourier transform (STFT) is that STFT has a fixed signal analysis window whereas the wavelet transform utilizes short windows at high frequencies and long windows at low frequencies [
12].
Therefore, the wavelet transform provides superior time and frequency resolution at high and low frequencies [
12]. Hence, the wavelet-based analysis is considered an appropriate choice when the signal at hand has high-frequency components for a short duration and low-frequency components for a longer period, as is considered in this study [
12]. As shown in
Figure 3, the scalogram of SoI and its combination with CWI, CI, and MCWI jammers is computed using the Morse wavelet [
27] to calculate the wavelet transform as well as the coherence analysis of the time series. For further processing, the computed scalogram is converted to an RGB image.
3.2. Transfer Learning Process
One of the main applications of transfer learning is feature extraction [
28]. In the feature extraction approach, the output from one or more than one layer in the pretrained CNN is used as the input feature vector for the classification phase [
29]. Since the deeper layers extract the higher-level features, the layer right before the classification phase can be an appropriate choice for feature extraction [
30].
A typical CNN structure consists of two parts; (1) convolutional layers, composed by a stack of the convolutional and the pooling layers to extract the features from the image-based input. (2) a classification part including a set of fully-connected (FC) layers followed by an activation function, like Soft-Max, to classify the images using the extracted features [
11]. In the transfer learning process, the classification part can be replaced by a new classifier that fits the application in hand and the model can be tuned using one of the following strategies [
11]:
Training the entire dataset: The pretrained CNN can be trained from scratch using a new dataset. Therefore, a large dataset and lots of computational power are required.
Training some layers and leaving the others frozen: As the lower layers extract the general features while higher layers represent the most specific features, it can be decided how many layers need to be retrained depending on the application. For a small dataset with a large number of parameters, it is efficient to leave more layers frozen. Because the frozen layers are kept unchanged during the training process to avoid overfitting. On the other hand, for a large dataset with a small number of parameters, training more layers would be reasonable to the new task, since overfitting is not an issue.
Freezing the convolutional part: In this scenario, the convolutional part can be kept unchanged, and its output can be fed to a new classifier. In other words, the pretrained model is considered as a fixed feature extraction basis, which is beneficial in case of having a small dataset and suffering from a lack of computational power. Notably, in this study, we have applied this strategy.
Notably, the first two strategies highly depend on the learning rate, which defines how much the weights of a network can be adjusted. A small value learning rate can be chosen over a high-value to reduce the risk of losing previous knowledge [
11].
3.2.1. Pretrained CNNs
As presented in the previous section, transfer learning refers to the reuse of pretrained CNN architectures on a large dataset. In this study, we have analyzed the efficiency of four well-known CNN architectures, namely, AlexNet [
9], GoogleNet [
31], ResNet18 [
32], and VGG16 [
33] regarding classification precision and training time, as you can see below:
AlexNet: In 2012, AlexNet could outperform other prior architectures in ImageNet LSVRC-2012 competition, designed by the SuperVision group [
9]. AlexNet includes five convolutional layers and three FC layers in which Relu is applied after every convolutional and FC layer. In addition, the dropout technique is applied before the first and the second FC layer [
9].
GoogleNet: GoogleNet won ILSVRC 2014 competition with a high precision close to human perception. Its architecture has taken benefits of several small convolutions to reduce the number of parameters drastically. It consists of a 22-layer deep CNN, but it decreased the number of parameters from 60 million (in AlexNet) to 4 million [
31].
VGG: Visual Geometry Group (VGG) is a CNN proposed by the University of Oxford [
33] to improve AlexNet by replacing large kernel-sized filters with multiple 3 by 3 kernel-sized filters one after another. VGG16 was trained for weeks and was using NVIDIA Titan Black GPU [
33].
ResNet: Residual Neural Network (ResNet) presented an outstanding performance in ILSRVC 2015 [
32]. The Residual network directly copies the input matrix to the second transformation output and sums up the output in final ReLU function [
32].
It should be taken into account that the output of the last layer of the convolutional structure before the classification layer has been used as the feature set for the designed classification; “fc8” for AlexNet and VGG16, “loss3-classifier,” and “fc1000” for GoogleNet and ResNet18, respectively. Notably, the input image size for AlexNet is 227 by 227 and 224 by 224 for the three other CNNs.
3.2.2. Fully Connected (FC) Layer
In CNN, the convolutional and pooling layers can be followed by a set of FC layers that perform like any ANN, such as MLP. The purpose of the FC layers is to combine all the features (local information) learned by the previous layers to recognize the larger patterns. For classification problems, the number of neurons at the last FC layer is equal to the number of classes [
34]. In image classification problems, the standard method is to use a stack of FC layers, followed by a Soft-Max activation function [
11]. The output of Soft-Max is a set of probability distributions of different classes, where the neuron with the maximum probability is considered as the classification result [
35]:
where
presents the predicted label, the former layer output refers to the last fully connected layer, and
k represents the number of fully connected layers. The fundamental of the training phase is like MLP—that after defining the CNN layers, the training phase is started by determining the optimization technique first. There are two well-known optimizers to minimize the loss function Equation (
6), such as adaptive moment estimation and Stochastic Gradient Descent (SGD) [
36]. In this research, the loss function is the cross-entropy, which is mathematically defined as:
where,
N and
M refer to number of samples and classes respectively.
is an indicator that
ith sample belongs to
jth class [
36].
4. Results And Discussion
This section illustrates the simulation results of the proposed methodology for both RFI recognition and AMC, using MATLAB. For this purpose, we evaluate the performance of the four pretrained CNNs (AlexNet, GoogleNet, VGG 16, and ResNet18) in the classification phase using the deep learning toolbox. The results show a comparative analysis of these pretrained CNNs with respect to the accuracy in transfer learning and the consumed training time. The architecture of the FC for each classifier includes a layer with four neurons, followed by a Soft-Max classifier. The highest classification results are achieved in the experiments using SGD with momentum (SGDM) and Adam optimizer for the RFI classification and AMC phases, respectively.
4.1. Simulation Results for RFI Classification
Figure 4 presents the confusion matrix of RFI classification using four different pretrained CNNs. As is vivid, the classification accuracy is above 90% for all the techniques, but the ResNet18 has a more accurate result with a precision of 98.3% comparatively.
Figure 5 illustrates a comparative result of the elapsed running time using each pretrained CNN architecture. The consumed time has been computed using the “tic-toc” function of MATLAB. It is clear that AlexNet is comparatively less time-consuming and more efficient.
4.2. Simulation Results of AMC
For the AMC phase, we have trained another classifier per each received type of signal to specify the modulation type of the combined SoI. As was already mentioned, the SoI is transmitted using four modulation types: QPSK, 8APSK, 16APSK, and 32APSK. The following figures illustrate the AMC results for each type of received signal. As can be seen, the presence of jammers highly degrades the classification accuracy. As
Figure 6 indicates, AMC is more efficient using AlexNet in the absence of jamming signals, with a comparative classification precision of 95.00%.
Figure 7 shows the AMC results in the presence of CWI, in which the highest accuracy was achieved using ResNet18 with a precision of 92.2%.
As the AMC results in the presence of MCWI show in
Figure 8, the highest accuracy is obtained using VGG16, with a precision of 71.90%.
Figure 9 demonstrates the AMC results in the presence of CI. As is clear, the highest precision is 81.90%, using ResNet18.
According to the AMC results, ResNet18 is more efficient because it shows a higher average accuracy comparatively.
4.3. Prediction Phase
The performance of the trained classifiers is assessed on new unseen datasets generated at different AWGN powers ranging from −140 to −125 dBm, which is approximately equal to an SNR range from 5 to 9 dB.
Table 2 shows the robustness of the trained CNNs in predicting new unseen data at different SNRs for RFI classification in the first classification level.
According to the results, VGG16 shows a more precise performance in detecting the type of unseen RFI at different noise levels.
Table 3,
Table 4,
Table 5 and
Table 6 illustrate the prediction results for each AMC (SoI, SoI+CWI, SoI+MCWI, and SoI+CI ) using the trained classifiers for RFI recognition and AMC.
As it was shown, in the absence of jamming signals, AlexNet performs more efficiently to recognize the modulation types in different noise powers.
As
Table 4 shows, ResNet18 performs more accurately compared to the other classifiers for AMC in the presence of CWI.
In the presence of MCWI, VGG16 is more robust in recognizing four different modulation types.
As
Table 6 indicates, ResNet18-based classification slightly outperforms three other techniques. In addition, it presents that the effect of each pretrained CNN on the prediction performance varies depending on the type of data. To sum up, ResNet-18 shows more promising results; however, the presented techniques are highly sensitive to AWGN power. As is shown, the classifiers are less reliable by increasing AWGN power.
5. Conclusions
In this work, we presented a transfer learning-based approach for RFI recognition and modulation classification. In this approach, the pretrained CNN analyzes the scalogram of the received signal to extract more informative features, which will be further used in the classification phase using a fully-connected layer. This work presented a comparative analysis of using four well-known pretrained CNNs such as AlexNet, GoogleNet, VGG16, and ResNet18. As the results show, the classification accuracy highly depends on the type of input data and the feature extraction technique. More importantly, the dataset used as the input in this study includes the scalogram of the signals transmitted in a satellite-to-ground video broadcasting scenario based on DVB-S2 standards. Further, the robustness of each trained classifier in predicting unseen data was thoroughly evaluated. To sum up, in terms of classification, all the pretrained architectures perform relatively similarly, although AlexNet and VGG16 lead to the least and the most elapsed training times, respectively.