RF-Enabled Deep-Learning-Assisted Drone Detection and Identification: An End-to-End Approach

The security and privacy risks posed by unmanned aerial vehicles (UAVs) have become a significant cause of concern in today’s society. Due to technological advancement, these devices are becoming progressively inexpensive, which makes them convenient for many different applications. The massive number of UAVs is making it difficult to manage and monitor them in restricted areas. In addition, other signals using the same frequency range make it more challenging to identify UAV signals. In these circumstances, an intelligent system to detect and identify UAVs is a necessity. Most of the previous studies on UAV identification relied on various feature-extraction techniques, which are computationally expensive. Therefore, this article proposes an end-to-end deep-learning-based model to detect and identify UAVs based on their radio frequency (RF) signature. Unlike existing studies, multiscale feature-extraction techniques without manual intervention are utilized to extract enriched features that assist the model in achieving good generalization capability of the signal and making decisions with lower computational time. Additionally, residual blocks are utilized to learn complex representations, as well as to overcome vanishing gradient problems during training. The detection and identification tasks are performed in the presence of Bluetooth and WIFI signals, which are two signals from the same frequency band. For the identification task, the model is evaluated for specific devices, as well as for the signature of the particular manufacturers. The performance of the model is evaluated across various different signal-to-noise ratios (SNR). Furthermore, the findings are compared to the results of previous work. The proposed model yields an overall accuracy, precision, sensitivity, and F1-score of 97.53%, 98.06%, 98.00%, and 98.00%, respectively, for RF signal detection from 0 dB to 30 dB SNR in the CardRF dataset. The proposed model demonstrates an inference time of 0.37 ms (milliseconds) for RF signal detection, which is a substantial improvement over existing work. Therefore, the proposed end-to-end deep-learning-based method outperforms the existing work in terms of performance and time complexity. Based on the outcomes illustrated in the paper, the proposed model can be used in surveillance systems for real-time UAV detection and identification.


Introduction
In recent times, unmanned aerial vehicles (UAVs), widely recognized as drones, have become an area of substantial interest. Without a pilot on board, UAVs can be operated from miles away with the help of a remote controller. Initially, their applications were limited to military sectors [1]. Military UAVs are used in warfare, surveillance, air strikes, investigations, etc. [2]. However, drones are now being utilized for a diverse range of applications that extend beyond the military, making them a valuable tool in many different industries. For example, governments use UAVs for forestry surveillance [2], learning for classifying multiple drones was presented in [16]. The authors proposed a supervised deep learning algorithm to perform the detection and classification tasks. They have used short-term Fourier transform (STFT) for preprocessing RF signals. STFT was first used in this work to perform preprocessing of the data, which was fundamental to the increased performance of their algorithm. In [10], the authors presented RF-UAVNet, a convolutional network for the drone surveillance system, to identify and classify drones based on RF signals. The proposed architecture consists of grouped convolutional layers reducing network size and computational cost. DroneRF [17], a publicly available dataset for RF-based drone detection systems, was used in this work. The DroneRF dataset was also used in [18], where authors introduced compressed sensing technology, replacing the traditional sampling theorem, and a multi-channel random demodulator to sample the signal. To detect the UAV, multistep deep learning was used. The DNN was used to detect the UAV and a CNN was used to further identify the UAV. However, while using the DroneRF dataset, considering other signals present at the 2.4 GHz band was not possible [19]. So, Bluetooth and WIFI signals were not considered in [10,16,18]. In [6], the authors performed an analysis of RF-based UAV detection and identification, considering the intrusion of other wireless signals such as Bluetooth and WIFI. They performed continuous wavelet transform (CWT) and wavelet scattering transform (WST) for extracting features. They considered transient and steady states while classifying and identifying the signal. Furthermore, they performed multiple image-based feature extraction techniques to compare the performance with coefficient-based techniques (CWT, WST). They performed several ML models such as support vector machine (SVM), k-nearest neighbors (KNN), and ensemble in combination with principal component analysis (PCA) for classification and identification tasks across various noise levels. They performed transfer learning using SqueezeNet [20], which is a publicly available pretrained model for the classification and identification of UAVs. In this work, the authors only considered drone control signals for detection. However, focusing solely on control signals has a notable limitation when it comes to detecting drones, as these UAVs can be operated from a remote location, potentially rendering them undetectable. Therefore, to get a more reliable outcome, signals transmitted from drones must be considered [19]. Moreover, the authors observed severe performance degradation with lower signal-to-noise ratios (SNR). In [19], the authors proposed a framework for classifying and identifying and for activity recognition. The authors considered commonplace 2.4 GHz signals such as WIFI and Bluetooth, UAV controller signals, and UAV signals. A stacked noise denoising autoencoder (SDAE) was used for denoising to reduce noise and channel effects. After identifying the unmanned aerial system (UAS), UAV controller signal, or UAV, the classification was further performed to know the exact model of the device after extracting the unique features using wavelet packet transform (WPT) and Hilbert-Huang transform (HHT). Only the steady-state signals were considered as the transient signal can be easily affected by channel effects [6]. In [6,19], the Cardinal RF (CardRF) dataset was also used for UAV detection tasks. However, most of the aforementioned literature [6,18,19] heavily relied on separate feature extraction methods and noise reduction methods, which significantly increase the workload and complexity [21].
To mitigate the aforementioned challenges, we propose an end-to-end deep CNNbased model to detect and identify UAS signals in the presence of WIFI and Bluetooth signals with various SNRs. We aim to exploit multiscale convolutional architecture to classify and detect UAV or UAV controller signals. We have used the CardRF [22] dataset for training, as well as for evaluating the predictive performances of the proposed model, as other datasets available for UAV surveillance have some shortcomings, as described in [19]. The stacked convolutional layers in the network-extract-enriched information from the noisy data. Therefore, the proposed model does not require any further denoising or feature-extraction steps. Moreover, the feature-extraction capability of the network is enhanced by the introduction of the multiscale architecture. Features of different scales are obtained by paralleling different convolutional kernels. Residual connections are also inserted in the proposed model to avoid gradient explosion, which results in superior training outcomes. Furthermore, the residual structures and maxpooling improve the performance of the model in backpropagation [23].
In summary, the main contributions of this work are presented as follows: • An end-to-end DL-based system has been proposed to detect and identify UAS, Bluetooth, and WIFI signals across various different noise levels.

•
The model does not require any manual feature extraction steps, which reduces the computational overhead. The model exploits the RF signature of different devices for the detection and identification tasks. • Stacked convolutional layers along with multiscale architecture have been utilized in the model, which assists in the extraction of crucial features from the noisy data without any assistance from the feature-extraction techniques.

•
The performance of the model has been evaluated using different performance matrices (e.g., accuracy, precision, sensitivity, and F1-score) on the CardRF dataset.

•
After conducting comparative experiments, we have established that our proposed network outperforms the existing works in terms of performance and time complexity.
The rest of this paper is structured as follows: Section 2 describes the methodology of UAV detection and identification; Section 3 is based on the experimental results, as well as implementational details; and the conclusion was finally drawn in Section 4.

Methodology
This section describes the identification and detection of UAS signals along with Bluetooth and WIFI signals utilizing the proposed architecture using the CardRF dataset. Figure 1 depicts the complete architecture of the proposed system for the UAS signal. The samples sourced from the RF database are preprocessed, and additive white Gaussian noise (AWGN) is incorporated into the samples to generate noisy samples of different SNRs. Each requisite step of UAS signal detection and identification is illustrated in a detailed manner in the following sections.
feature-extraction steps. Moreover, the feature-extraction capability of the network is enhanced by the introduction of the multiscale architecture. Features of different scales are obtained by paralleling different convolutional kernels. Residual connections are also inserted in the proposed model to avoid gradient explosion, which results in superior training outcomes. Furthermore, the residual structures and maxpooling improve the performance of the model in backpropagation [23].
In summary, the main contributions of this work are presented as follows: • An end-to-end DL-based system has been proposed to detect and identify UAS, Bluetooth, and WIFI signals across various different noise levels.

•
The model does not require any manual feature extraction steps, which reduces the computational overhead. The model exploits the RF signature of different devices for the detection and identification tasks.

•
Stacked convolutional layers along with multiscale architecture have been utilized in the model, which assists in the extraction of crucial features from the noisy data without any assistance from the feature-extraction techniques.

•
The performance of the model has been evaluated using different performance matrices (e.g., accuracy, precision, sensitivity, and F1-score) on the CardRF dataset.

•
After conducting comparative experiments, we have established that our proposed network outperforms the existing works in terms of performance and time complexity.
The rest of this paper is structured as follows: Section 2 describes the methodology of UAV detection and identification; Section 3 is based on the experimental results, as well as implementational details; and the conclusion was finally drawn in Section 4.

Methodology
This section describes the identification and detection of UAS signals along with Bluetooth and WIFI signals utilizing the proposed architecture using the CardRF dataset. Figure 1 depicts the complete architecture of the proposed system for the UAS signal. The samples sourced from the RF database are preprocessed, and additive white Gaussian noise (AWGN) is incorporated into the samples to generate noisy samples of different SNRs. Each requisite step of UAS signal detection and identification is illustrated in a detailed manner in the following sections.

RF Dataset Description
For the mentioned system, CardRF, a large-scale dataset, is utilized for different RFbased signals (e.g., UAS, WIFI, and Bluetooth) detection and device identification. The dataset contains signals from five UAVs (one Beebeerun (Bbrun), four DJI), five UAV flight controllers (one 3DR and four DJI), five Bluetooth devices (iPad, iPhone, and smartwatch), two WIFI routers (one Cisco and one TP-link). The captured signals were passed through a 2.4 GHz bandpass filter to ensure that they have the same frequency band [19]. Each

RF Dataset Description
For the mentioned system, CardRF, a large-scale dataset, is utilized for different RFbased signals (e.g., UAS, WIFI, and Bluetooth) detection and device identification. The dataset contains signals from five UAVs (one Beebeerun (Bbrun), four DJI), five UAV flight controllers (one 3DR and four DJI), five Bluetooth devices (iPad, iPhone, and smartwatch), two WIFI routers (one Cisco and one TP-link). The captured signals were passed through a 2.4 GHz bandpass filter to ensure that they have the same frequency band [19]. Each signal contains five million sampling points at 30 dB SNR. The details of signal acquisition experiments of the signals are given in [19]. In this article, the steady state of the signals with 1024 sampling points per slice is considered. The dataset used in this literature is shown in Table 1 in a detailed manner.

RF Signal Preprocessing
The RF signal pre-processing mentioned in Figure 1 is described here in detail. In the CardRF dataset, each signal contains five million sampling points, which comprise of noise transient state and steady state. In this article, we have considered 10 segments from each signal, where each signal contains 1024 sampling points for the classification tasks, as the minimal length of the signal will introduce enhanced time complexity in the detection and identification system [19]. As some of the classes do contain the transient state, which can be shown in Figure 2, only the steady-state signals were considered. Moreover, the transient state sometimes does not contain reliable features. For this reason, each segment is taken from the steady state and normalized by scaling values in the range of (0, 1) as follows: where x i denotes the amplitude of the segmented signal, and x min , x max , and x normalized denote the minimum, maximum, and normalized amplitude of the signal, respectively.

Noise Incorporation
To investigate the model performance across various noise levels, we have incorporated AWGN to signals to produce noisy signals of 0 dB, 5 dB, 10 dB, 15 dB, 20 dB, and 25 dB SNR. To generate noisy signals of desired SNR, SNR Target , desired noise power, and P Noise can be calculated using signal power P Signal and desired SNR, SNR Target as follows: where m denotes signal length, and P Signal dB is the average signal power in the dB unit in Equation (3). P Noise dB and SNR Target dB are noise power and desired SNR in dB, respectively. The noise power can be calculated as follows: where P Noise is the noise power in watts. To produce the noise signal, zero is chosen as the mean noise, P Noise as standard deviation, and the noisy signal is generated using the following equation: where X i Noisy is the generated noisy signal. η represents the noise signal. µ Noise and ρ Noise are noise mean and standard deviation, respectively.

Noise Incorporation
To investigate the model performance across various noise levels, we have inco rated AWGN to signals to produce noisy signals of 0 dB, 5 dB, 10 dB, 15 dB, 20 dB, an dB SNR. To generate noisy signals of desired SNR, , desired noise power, can be calculated using signal power and desired SNR, as lows: where m denotes signal length, and is the average signal power in the dB in Equation (3).
and SNR Target dB are noise power and desired SNR in dB, res tively. The noise power can be calculated as follows:  Figure 3 shows the signal at different noise levels. Figure 3a-c show the signal at 30 dB, 25 dB, and 20 dB, respectively. The difference in RF signal is minimal in these SNRs. However, the quality of the signal degrades with the decrease in SNR, which can be seen in Figure 3e,f. Figure 4a describes the complete architecture of the model. The whole model can be divided into three major sections. The first stage is called the initial feature extraction block. At the very top, after the input layer, the one-dimensional data was reshaped to feed into the convolutional layer and followed by a rectified linear unit (ReLU) activation function, which is linear for all positive values and zero for all negative values. ReLU is computationally inexpensive, which results in less training and inference time. Moreover, it converges faster than other activation functions, such as Tanh. The ReLU function can be written as follows: However, the quality of the signal degrades with the decrease in SNR, which can be seen in Figure 3e,f.

Model Description
Figure 4a describes the complete architecture of the model. The whole model can be divided into three major sections. The first stage is called the initial feature extraction block. At the very top, after the input layer, the one-dimensional data was reshaped to feed into the convolutional layer and followed by a rectified linear unit (ReLU) activation function, which is linear for all positive values and zero for all negative values. ReLU is computationally inexpensive, which results in less training and inference time. Moreover, it converges faster than other activation functions, such as Tanh. The ReLU function can be written as follows: Next, the maxpooling layer is used to extract the most prominent features and to reduce the feature map before incorporating multiscale architecture. Next, the maxpooling layer is used to extract the most prominent features and to reduce the feature map before incorporating multiscale architecture.
The second section is the multiscale feature extraction block. This section consists of both sequential and parallel layers to extract features of the different spatial domains. In our network, we have exploited an architecture with two branches for feature extraction. The architecture of these two branches is identical except for the size of their kernels. Different kernel sizes have been used for experimental purposes. Each branch contains four convolutional blocks (conv block) with different convolutional filters. The first two parallel blocks consist of one convolutional layer followed by a ReLU function and another convolutional layer that is described as conv block 1 in Figure 4b. The layers consist of 64 convolutional filters.
where x i is the output of the maxpooling layer and f (x i ) is the output of the conv block 1.
The output of the conv block and maxpooling layers is added, as shown in Equation (8), and passed through the ReLU layer, which is the input of the second conv block with 128 filters, which is an instance of conv block 2. The second conv block has the architecture shown in Figure 4c. The difference between this block from the previous one is the output of the second conv layer is passed through a dense layer with ReLU activation of 64 units to keep the number of outputs similar to the previous one. The residual block next can be expressed as follows:  The second section is the multiscale feature extraction block. This section consists of both sequential and parallel layers to extract features of the different spatial domains. In our network, we have exploited an architecture with two branches for feature extraction. The architecture of these two branches is identical except for the size of their kernels. Different kernel sizes have been used for experimental purposes. Each branch contains four convolutional blocks (conv block) with different convolutional filters. The first two parallel blocks consist of one convolutional layer followed by a ReLU function and another convolutional layer that is described as conv block 1 in Figure 4b. The layers consist of 64 convolutional filters. The third and final section of the model, which is called the terminal block, contains flatten and softmax layers. The outputs of both branches are concatenated and flattened. For the detection task, three classes are used, and for the identification task, ten classes are utilized for specific device identification task and eight for the device manufacturer identification task. However, similar architecture is used for identification and detection tasks except for the softmax layer. Softmax maps the outputs between zero and one, as well as provides a probabilistic distribution of the likelihood of all the classes. The softmax function can be defined as follows: where z i is the flattened outputs of the previous stage and k is the number of classes. The selection of the number of neurons and layers utilized in this article was based on extensive

Experimental Results
In this section, implementation details, performance metrics, and model performances are described. Finally, the performance of the proposed model is evaluated with existing work to analyze the effectiveness of the proposed system and unveil its superiority over other existing works.

Implementation Details and Performance Metrics
From the normalized RF signals, 85% of each category is selected for training, and the remaining 15% of the signals are kept for testing purposes for both detection and identification tasks. The total training data number 51,765, and the testing data number 9135 for the detection task, including all three categories. The classifier models are trained using the training data and optimized using an optimizer. Finally, the performance has been evaluated on the testing data (see Figure 1). For the identification task, three classes (iPhone 7, iPad 3, and E5 Cruise) are excluded to compare our work with [6]. The total training data for specific device identification tasks number 43,732, and the testing data number 7718. The training and testing procedures were conducted within an Anaconda Python 3.7 environment on a system featuring a 12th generation Intel Core i7 CPU with a base clock speed of 2.10 GHz, 16 GB of RAM, and a single Nvidia GeForce RTX 3050 GPU with 8 GB of dedicated GPU memory. All the hyperparameters utilized for training the proposed model are shown in Table 3. By varying the noise level, the performance of the proposed model is evaluated, keeping the number of hyperparameters identical. For the cost function, categorical cross-entropy is used for both the detection and identification tasks, which is require multi-class classification. To minimize the loss function, an adaptive moment estimation (Adam) optimizer is used. The benefit of using Adam is that it perceives the learning rate individually for all the parameters. Both the detection and identification models were trained for 120 epochs. To evaluate the performance of our model, we have computed the accuracy (ACC), precision (PR), sensitivity (SE), and F 1 -score (F 1 ), which are also known as evaluation metrics. PR is the ability of the classifier to avoid incorrectly labeling instances as positive if they are truly negative. On the other hand, SE is defined as the ability of the classifier to identify the positive instances. F 1 is the weighted harmonic mean of both PR and SE. These are defined as follows: where TP i , TN i , FP i , and FN i are true-positive, true-negative, false-positive, and falsenegative, respectively, of the ith class. True-positive and true-negative stand for the number of the ith class predicted correctly and the number of other classes that are not predicted as the of ith class, respectively. Whereas false-positive and false-negative are the outcomes that refer to the number of other classes, which are predicted as the ith class and the number of ith classes classified as the other classes, respectively.

Performance Analysis
In this section, the performance of the proposed model is analyzed across different noise levels for both detection and identification tasks. Figure 5a,b depict the training and testing accuracy curves over 120 epochs for detection and specific device identification tasks, respectively. From the figures, it can be seen that the models do not have overfitting issues. It can also be seen that both models converge rapidly. The training process of the models has been stopped, even though the training accuracy was still improving because of no noticeable improvement in the testing data. The overall training and testing accuracy of the proposed model are 98.7% and 97.53%, respectively, for the detection task, and for the identification task, the model has an accuracy of 76.42%. For the detection task, the accuracy of RF signal detection has higher accuracy as opposed to the specific device identification task, as the model has a higher rate of misclassifying the UAS signals from the devices that are manufactured by the same maker. and the number of th classes classified as the other classes, respectively.

Performance Analysis
In this section, the performance of the proposed model is analyzed ac noise levels for both detection and identification tasks. Figure 5a,b depict the testing accuracy curves over 120 epochs for detection and specific device tasks, respectively. From the figures, it can be seen that the models do not ha issues. It can also be seen that both models converge rapidly. The training models has been stopped, even though the training accuracy was still impro of no noticeable improvement in the testing data. The overall training and tes of the proposed model are 98.7% and 97.53%, respectively, for the detection the identification task, the model has an accuracy of 76.42%. For the detec accuracy of RF signal detection has higher accuracy as opposed to the specifi tification task, as the model has a higher rate of misclassifying the UAS sig devices that are manufactured by the same maker. We have varied the kernel sizes for the convolutional layers of our models to observe the performance of the model to find the most optimal hyperparameters. Table 4 demonstrates the performance comparison of the proposed model for different kernel sizes. From the table, it can be seen that for the higher SNR values, the accuracy of the model slightly differs, but with the increase in noise, the differences in the performance of the model are more visible. For the detection task, the model shows an accuracy of 98.63% when kernels of size 3 and 7 have been used, which is only 0.01% less than kernel sizes of 5 and 7. However, for 0 dB SNR, the model demonstrates an accuracy of 93.81% with kernel sizes of 5 and 7, which is 0.93% and 0.95% higher than the accuracy of the model with kernel sizes of 3, 7, and 3, 5. The overall accuracy of the model is also higher with 5 and 7 kernel sizes. The same scenario can be seen for the detection task as well. For 0 dB SNR, the accuracy of the model is 88% and 1.81% higher with kernel sizes of 5 and 7 than the models with kernel sizes of 3, 7, and 3, 5. The model yields better results with larger kernel sizes because they reduce false positives and improve accuracy [24]. Moreover, larger kernels also capture more spatial information and extract more relevant features from the noisy signals. Table 5 shows the overall performance of the model for the detection task in terms of four evaluation metrics numerically using TP i , TN i , FP i , and FN i . From the SE metrics, it can be seen that the model can identify 97.53% of UAS signals correctly. The model demonstrates a PR of 98.06%. This high precision rate means the model has a very high rate of TP i in terms of UAS signals. The model also shows a higher SE for UAS signals. These high PR and SE leads to a high F 1 score as well. The PR, SE, and F 1 are similar for the UAS and Bluetooth classes. That describes the model can almost accurately classify these two classes. The lower value PR, SE, and F 1 for WIFI can be explained by the fewer training samples of the class.  Figures 6 and 7 show the confusion matrix of the proposed model for the detection task and specific identification task, respectively. Test accuracy for identification tasks is 98.64%, 98.63%, 98.62%, 98.45%, 97.59%, 95.96%, and 93.81% for 30 dB, 25 dB, 20 dB, 15 dB, 10 dB, 5 dB, and 0 dB, respectively. The model maintains an accuracy of more than 80% for SNR of 20 dB and above, but the accuracy drops with the increase in the noise level because of the presence of more noise. At 10 dB SNR, the accuracy of the model is 76.16%. The performance of the models was evaluated with a set of unseen data from different unknown noise levels. For the detection task, the accuracy was 95.89%. The confusion matrix of the unseen noise for the detection and identification tasks are shown in Figures 6h and 7h. RF signal detection has a higher accuracy as opposed to the specific device identification task, as the model has a higher rate of misclassifying the UAS signals from the devices that are manufactured by the same maker. That can be confirmed from the confusion matrices, as all DJI UAS signals are clustered in an area.
The comparison of the model performance in terms of accuracies with [6] for both tasks is shown in Figure 8. For the detection task, the performance of the proposed model is close to the SqueezeNet architecture exploited in [6] for 30 dB to 10 dB SNR, but with the increase in the noise level, the performance of the SqueezeNet model decreases rapidly, which can be seen in Figure 8a.
After 10 dB SNR, the accuracy of the SqueezeNet model is lower than 90%. However, the proposed model maintains an accuracy of over 93% for all the noise levels discussed in this work. The superior performance of the proposed model can be described because of multiscale architecture. The model extracts features of multiple scales, which assist the proposed model in identifying more prominent features from the noisy data. This shows that the proposed model is more reliable than the SqueezeNet architecture. Figure 8b shows the comparison of the models for the identification task. It can be clearly seen that the proposed model not only outperforms the SqueezeNet but also has a more stable and reliable performance than the methods proposed in ref. [6] for all the noise levels from 0 dB to 30 dB. Table 6 demonstrates the comparison of average PR, SE, and F 1 of the proposed model with existing work for RF signals of 30 dB SNR. From the table, it can be said that the proposed model not only outperforms the existing work in terms of accuracy but also in other metrics. For the detection task, the proposed model exhibits a 0.4% and 0.6% higher SE compared to the SqueezeNet with WST and CWT, respectively, which means the proposed model is able to find and correctly classify more of the instances with fewer FN i . As F 1 depends on PR and SE, the model demonstrates a higher F 1 . In the identification task, the model exhibits a 7.55% and 6.25% improvement in precision and a 6.97% and 7.67% enhancement in sensitivity when compared to SqueezeNet with WST and CWT, respectively.
After 10 dB SNR, the accuracy of the SqueezeNet model is lower than 9 the proposed model maintains an accuracy of over 93% for all the noise le in this work. The superior performance of the proposed model can be desc of multiscale architecture. The model extracts features of multiple scales, w proposed model in identifying more prominent features from the noisy da that the proposed model is more reliable than the SqueezeNet architectu shows the comparison of the models for the identification task. It can be cle the proposed model not only outperforms the SqueezeNet but also has a m reliable performance than the methods proposed in ref. [6] for all the noise dB to 30 dB.     The comparison of accuracies in Figure 8 and other performance metrics in Table 6 demonstrates the superiority of the proposed model in terms of performance.
To address the issue of the higher misclassification among the devices from the same manufacturer observed in Figure 9 the identification model is further modified to classify the devices based on the manufacturers. The four DJI drones and Bluetooth devices from Apple are kept in the same cluster. The performance of the model greatly improves while identifying the signature of the device makers. The overall training and testing of device manufacturer identification are 90.52% and 84.43%, respectively. For the signals from unseen SNR, the accuracy of the model is 84.1%, and for 30 dB to 15 dB, the accuracy of the model is above 85%, and for 0 dB, the accuracy is 71%. The confusion matrix in Figure 9 shows the performance of the model for each class, which shows the model's ability to classify devices from different manufacturers across various different noise levels. Figure 9h shows that the proposed model can identify most of the devices from the unseen noise levels accurately.  Table 6 demonstrates the comparison of average , , and of the proposed model with existing work for RF signals of 30 dB SNR. From the table, it can be said that the proposed model not only outperforms the existing work in terms of accuracy but also in other metrics. For the detection task, the proposed model exhibits a 0.4% and 0.6% higher compared to the SqueezeNet with WST and CWT, respectively, which means the proposed model is able to find and correctly classify more of the instances with fewer . As depends on and , the model demonstrates a higher . In the identification task, the model exhibits a 7.55% and 6.25% improvement in precision and a 6.97% and 7.67% enhancement in sensitivity when compared to SqueezeNet with WST and CWT, respectively. The comparison of accuracies in Figure 8 and other performance metrics in Table 6 demonstrates the superiority of the proposed model in terms of performance.
To address the issue of the higher misclassification among the devices from the same manufacturer observed in Figure 9 the identification model is further modified to classify the devices based on the manufacturers. The four DJI drones and Bluetooth devices from Apple are kept in the same cluster. The performance of the model greatly improves while identifying the signature of the device makers. The overall training and testing of device manufacturer identification are 90.52% and 84.43%, respectively. For the signals from unseen SNR, the accuracy of the model is 84.1%, and for 30 dB to 15 dB, the accuracy of the model is above 85%, and for 0 dB, the accuracy is 71%. The confusion matrix in Figure 9 shows the performance of the model for each class, which shows the model's ability to classify devices from different manufacturers across various different noise levels. Figure 9h shows that the proposed model can identify most of the devices from the unseen noise levels accurately.  Table 7 shows the inference time and the number of parameters of the proposed system compared with the previous work. SqueezeNet requires 180 milliseconds (ms) with CWT and 151 ms with WST. The higher inference time is due to the utilization of manual feature extraction techniques, which are computationally expensive, but our proposed DL-based method, despite having more parameters, demonstrates an inference time of 0.379 ms for the detection task. For specific device identification task, the inference time of the proposed model is 0.343 ms, which is also significantly lower than [6]. The significant improvement in inference time is because the proposed model does not require any manual feature-extraction technique. The multiscale feature-extraction method utilized in this article is sufficient to extract features from the noisy RF signal. From the table it is evident that the proposed model offers a reduction in inference time by eliminating the need for feature extraction, which is advantageous for real-time applications.  Table 7 shows the inference time and the number of parameters of the proposed system compared with the previous work. SqueezeNet requires 180 milliseconds (ms) with CWT and 151 ms with WST. The higher inference time is due to the utilization of manual feature extraction techniques, which are computationally expensive, but our proposed DL-based method, despite having more parameters, demonstrates an inference time of 0.379 ms for the detection task. For specific device identification task, the inference time of the proposed model is 0.343 ms, which is also significantly lower than [6]. The significant improvement in inference time is because the proposed model does not require any manual feature-extraction technique. The multiscale feature-extraction method utilized in this article is sufficient to extract features from the noisy RF signal.

Conclusions
In this article, we have utilized an end-to-end deep learning architecture for detecting and identifying UAV signals based on their RF signature. We have considered both UAV and UAV controller signals for our classifier. The communications of the UAV and the flight controller are established at the 2.4 GHz frequency band. Other devices, such as Bluetooth and WIFI signals, also operate in the same range, so we have considered both of these signals as well. Our proposed model is trained on signals from different noise levels, and it can classify signals from unknown SNRs as well, which makes our proposed model more effective. Our proposed model does not require any feature-extraction techniques, which makes it computationally efficient. The raw RF signals, after being normalized, are fed into the network model for training. The model is trained with the data from 0 dB to 30 dB SNR. The average accuracy of the model is 97.53%. Furthermore, the network is evaluated on the data from unseen noise levels to evaluate the performance of the classifier. The overall accuracy for the detection task on unseen data is above 94%. We have obtained an overall accuracy above 76% for specific device identification tasks because of the higher misclassification rate from the same makers. The classification accuracy greatly improves when devices from the same manufacturers are clustered together. The model yields an accuracy of 84% on average when classifying the RF signature of the manufacturers. Finally, we have compared our work with the existing framework and found that the performance of our model, despite having no feature-extraction steps, is more stable across different SNRs.
Our proposed model holds the potential to benefit surveillance systems by effectively detecting and identifying UAS signals in real-time scenarios. The model eliminates the need for manual feature extraction, thus enabling deployment in edge devices. Moreover, its scope of application extends beyond surveillance systems, as it can also be used for image segmentation, feature extraction [25], and video analysis [26] for industries such as health care and others that require similar functionalities. Going forward, we are committed to implementing our model in a diverse range of applications to highlight its versatility and the significant impact it can have across various industries.