Distinguishing Malicious Drones Using Vision Transformer

: Drones are commonly used in numerous applications, such as surveillance, navigation, spraying pesticides in autonomous agricultural systems, various military services, etc., due to their variable sizes and workloads. However, malicious drones that carry harmful objects are often adversely used to intrude restricted areas and attack critical public places. Thus, the timely detection of malicious drones can prevent potential harm. This article proposes a vision transformer (ViT) based framework to distinguish between drones and malicious drones. In the proposed ViT based model, drone images are split into ﬁxed-size patches; then, linearly embeddings and position embeddings are applied, and the resulting sequence of vectors is ﬁnally fed to a standard ViT encoder. During classiﬁcation, an additional learnable classiﬁcation token associated to the sequence is used. The proposed framework is compared with several handcrafted and deep convolutional neural networks (D-CNN), which reveal that the proposed model has achieved an accuracy of 98.3%, outperforming various handcrafted and D-CNNs models. Additionally, the superiority of the proposed model is illustrated by comparing it with the existing state-of-the-art drone-detection methods.


Introduction
With recent development in remote sensing technology, drones play an essential role in developing smart cities and innovative industries due to their numerous applications, including automated irrigation, spraying pesticides and fertilizers in agriculture [1], water management [2], food services [3], UAS-based image velocimetry [4], flying base stations [5], etc. Drones of variable sizes and different shapes have been deployed in the military for navigation and surveillance purposes [6].
In spite of numerous useful applications, drones are often used for spying and carrying dangerous loads. Such drones are termed malicious drones, which enter in the restricted non-fly zones avoiding radar detection due to their low-altitude flight path. The schematic in Figure 1a,b shows the normal use cases of drones, and Figure 1c depicts the intrusion of malicious drones in the restricted zones.
Hence, it is critical to develop an autonomous system that can efficiently detect the intrusion of malicious drones to avoid any potential damage. In that regard, machine learning (ML) and computer vision (CV) can allow us to develop automated systems that can detect malicious drones. The existing techniques in the literature usually rely on audios, images, videos, and radio frequency signals to detect drones. In [7], the authors proposed a DL-based hybrid audio and integrated visual framework for detecting malicious drones, which achieved an accuracy of 98.5% for the combined audio and visual dataset. However, the main drawback of the model was that it was limited to drone detection, and the model was unable to differentiate between drones with loads and without loads. Along similar lines, in [8], the authors proposed a mel frequency cepstral coefficient (MFCC) AI 2022, 3 261 with a SVM-based model for detecting malicious drones; however, the performance model deteriorates when detecting amateur drones in adverse weather conditions and noisy environments. Moreover, in [9], the authors proposed a handcrafted feature extractionbased technique to detect drones using audios and images. The method achieved 81% accuracy, but deteriorates when detecting drones in adverse weather conditions. In [10], Dumitrescu et al. designed a DL-based system for drone detection by employing acoustic signals. However, the authors did not consider malicious drones as a separate class and the article only addressed drone detection. In [11], Digulescu et al. investigated a radio frequency signal-based advanced signal processing model to detect the movement of drones. The model performed relatively well in the controlled environment. In [12], Singha et al. proposed a YOLOv4-based model for detecting drones. The model achieved mean average precision (mAP) of 74.36%. Furthermore, in [13], the authors proposed a DL-based detection and identification of the drones using audio signals. The technique achieved the highest accuracy of 85.26%; however, the model showed limited performance in adverse weather conditions. In [14], the authors distinguished drones from birds using the laser. The framework detected drones with less than five kilograms of mass. However, the technique was not used to detect drones with loads. In a subsequent study [15], the authors proposed a YOLOv3 based model for detecting drones and birds. The model performance varies with the variation in the shape of drones and the visibility of the drones. In [16], Swinney et al. analyzed the impact of real-world interference in the classification of drones, using CNNs and radio frequency signals. Hence, it is critical to develop an autonomous system that can efficien intrusion of malicious drones to avoid any potential damage. In that reg learning (ML) and computer vision (CV) can allow us to develop automated From the aforementioned discussion, it is clear that although several existing DL models can classify and detect drones based on acoustic, radio frequency, and visual signals, they may not be useful in challenging scenarios of distinguishing between several subject classes, such as drones, malicious drones, birds, airplanes, helicopters, etc. Furthermore, none of the existing ML and deep learning (DL) models address the issue of drone detection with loads.

Proposed Methodology
Drones have different visual characteristics, such as color, shape, load, and size. Thus, the images are useful for distinguishing malicious drones from the other classes, such as drones without load, aeroplanes, helicopters, and birds. The images are fed into the handcrafted descriptors, D-CNNs, and ViT-classifier. Handcrafted descriptors and D-CNNs are used to extract features that are used to train the classifier. The schematic in Figure 2 shows the flow diagram of the framework. of the drones. In [16], Swinney et al. analyzed the impact of real-world interference in the classification of drones, using CNNs and radio frequency signals.
From the aforementioned discussion, it is clear that although several existing DL models can classify and detect drones based on acoustic, radio frequency, and visual signals, they may not be useful in challenging scenarios of distinguishing between severa subject classes, such as drones, malicious drones, birds, airplanes, helicopters, etc. Furthermore, none of the existing ML and deep learning (DL) models address the issue of drone detection with loads.

Proposed Methodology
Drones have different visual characteristics, such as color, shape, load, and size Thus, the images are useful for distinguishing malicious drones from the other classes such as drones without load, aeroplanes, helicopters, and birds. The images are fed into the handcrafted descriptors, D-CNNs, and ViT-classifier. Handcrafted descriptors and D CNNs are used to extract features that are used to train the classifier. The schematic in Figure 2 shows the flow diagram of the framework.

Handcrafted Descriptors
The images are resized to 224 × 224 and after that, features are extracted with the help of HOG, LETRIST, LBP, GLCM, NRLBP, CJLBP, and LTrP. The features are stored in the feature vectors, which are used to train ML classifiers.

D-CNN Models
The images are resized to the input size of each D-CNNs and after that, features are extracted with the help of AlexNet, ShuffleNet, ResNet-50, SqueezeNet, MobileNet-v2

Handcrafted Descriptors
The images are resized to 224 × 224 and after that, features are extracted with the help of HOG, LETRIST, LBP, GLCM, NRLBP, CJLBP, and LTrP. The features are stored in the feature vectors, which are used to train ML classifiers.

D-CNN Models
The images are resized to the input size of each D-CNNs and after that, features are extracted with the help of AlexNet, ShuffleNet, ResNet-50, SqueezeNet, MobileNet-v2, Inceptionv3, GoogleNet, EfficientNetb0, Inception-ResNet-v2, DarkNet-53, and Xception. The features are saved in the feature vectors, which are used to train ML classifiers.

ViT-Based Classification
Initially, the images are resized to 224 × 224 and then fed into ViT. ViT splits images into 14 × 14 vectors with patches of 16 × 16. These patch embedding vectors are followed by adding learnable position embedding vectors. These embedded vectors are further fed into the transformer encoder (TE), which is proposed in [40]. In TE, the embedded vectors are divided into a query (a), key (b), and value (c) after being expanded by a fully connected (fc) layer. Then, a, b, and c are further divided and fed to the parallel attention heads (AH). Outputs from AHs are concatenated to form the vectors whose shape is the same as the encoder input. The vectors go through an fc, a layer normalization, and a multi-layer perceptron MLP block with two fc layers. TE encodes the embedding vector and outputs a vector of the same size. The output vector of the TE is fed into the MLP head to make the final classification. The complete schematic diagram of the ViT is shown in Figure 3. Inceptionv3, GoogleNet, EfficientNetb0, Inception-ResNet-v2, DarkNet-53, and Xception. The features are saved in the feature vectors, which are used to train ML classifiers.

ViT-Based Classification
Initially, the images are resized to 224 × 224 and then fed into ViT. ViT splits images into 14 × 14 vectors with patches of 16 × 16. These patch embedding vectors are followed by adding learnable position embedding vectors. These embedded vectors are further fed into the transformer encoder (TE), which is proposed in [40]. In TE, the embedded vectors are divided into a query (a), key (b), and value (c) after being expanded by a fully connected (fc) layer. Then, a, b, and c are further divided and fed to the parallel attention heads (AH). Outputs from AHs are concatenated to form the vectors whose shape is the same as the encoder input. The vectors go through an fc, a layer normalization, and a multi-layer perceptron MLP block with two fc layers. TE encodes the embedding vector and outputs a vector of the same size. The output vector of the TE is fed into the MLP head to make the final classification. The complete schematic diagram of the ViT is shown in Figure 3.

Dataset
In the present study, a customized dataset consisting of five different classes (i.e., aero-planes, birds, drones, helicopters, and malicious drones) is utilized. The dataset is challenging due to the presence of occluded images, night images, low visibility of object images, and adverse weather condition images. The dataset has a total of 776 images. The aeroplane and bird classes have 105 images in each class. Similarly, the drone, helicopter, and malicious drone classes have 200, 167, and 199 images, respectively. All the images are resized to 224 × 224. The dataset is publicly available on Kaggle, and the link can be found in the data availability section. The dataset is divided into a train set with 70% images and a test set with 30% images. Some of the typical images from the dataset are shown in Figure 4.

Dataset
In the present study, a customized dataset consisting of five different classes (i.e., aeroplanes, birds, drones, helicopters, and malicious drones) is utilized. The dataset is challenging due to the presence of occluded images, night images, low visibility of object images, and adverse weather condition images. The dataset has a total of 776 images. The aeroplane and bird classes have 105 images in each class. Similarly, the drone, helicopter, and malicious drone classes have 200, 167, and 199 images, respectively. All the images are resized to 224 × 224. The dataset is publicly available on Kaggle, and the link can be found in the data availability section. The dataset is divided into a train set with 70% images and a test set with 30% images. Some of the typical images from the dataset are shown in Figure 4.

Results
In order to evaluate the performance of the proposed classifier, various performance metrics, including accuracy, specificity, sensitivity, and − are considered. The accuracy of the classifier can be obtained as follows: where, in Equation (1), and denote true negative and true positive, respectively, while and represent false negative and false positive, respectively. The accuracy of the classifier indicates the ability to distinguish malicious drone classes correctly. Sensitivity ( ) is the proportion of actual positives that are correctly predicted as positives and is determined as Precision or specificity ( ) is the proportion of actual positives that are correctly predicted as negatives and is calculated as follows:

Results
In order to evaluate the performance of the proposed classifier, various performance metrics, including accuracy, specificity, sensitivity, and F 1 − score are considered. The accuracy of the classifier can be obtained as follows: where, in Equation (1), t n and t p denote true negative and true positive, respectively, while f n and f p represent false negative and false positive, respectively. The accuracy of the classifier indicates the ability to distinguish malicious drone classes correctly. Sensitivity (s e ) is the proportion of actual positives that are correctly predicted as positives and is determined as Precision or specificity s p is the proportion of actual positives that are correctly predicted as negatives and is calculated as follows: From the definition of s e and s p in Equations (2) and (3), the F 1 − score can be obtained as Additionally, Cohen's kappa (κ) is considered to further evaluate the performance of the proposed model, which can be calculated as The experiments are conducted on the local system with 12 GB RAM and Tesla T4 GPU. The model complexity and hyperparameters of the model are shown in Table 1. From the classification result, it is found that the proposed ViT classifier has achieved 98.28% overall accuracy. The accuracy values for aeroplanes, birds, and helicopters are 100%, 100%, and 100%, respectively, indicating excellent robustness of the model for these classes. However, the accuracy values for the drone and malicious drone classes slightly drop to 96.8% and 96.8%, respectively. The confusion matrix of the ViT classifier is shown in Figure 5.
The experiments are conducted on the local system with 12 GB RAM and Tesla T GPU. The model complexity and hyperparameters of the model are shown in Table 1. From the classification result, it is found that the proposed ViT classifier has achieve 98.28% overall accuracy. The accuracy values for aeroplanes, birds, and helicopters ar 100%, 100%, and 100%, respectively, indicating excellent robustness of the model for thes classes. However, the accuracy values for the drone and malicious drone classes slightl drop to 96.8% and 96.8%, respectively. The confusion matrix of the ViT classifier is show in Figure 5.  The ViT classifier achieves the overall s e , s p , F 1 − score, and κ values of 99.00%, 99.00%, 99.00%, and 99.00%, respectively. The s e , s p and F 1 − score of aeroplane, bird, and helicopter classes are 100%, 100%, and 100% respectively. The s e , s p , and F 1 − score for drone and malicious drone classes are 97.0%, 97.0%, and 97.0%, respectively. Figure 6 shows the comparison bar chart of various classification metrics obtained from the ViT classifier for different classes.   Table 2.  2 39.70% DT 3 36.60% NB 4 43.50%  Table 2.
Analyzing Table 2, it is evident that the performance of handcrafted descriptors is quite low compared to ViT classifier, as the highest accuracy is 78.90% using HOG and ensemble classifier. The accuracy of HOG with the SVM classifier is 76.70% whereas, with kNN, NB and DT, it is 37.90%, 57.30%, and 58.60%, respectively. Similarly, Table 3 shows the test accuracy of the AlexNet, ShuffleNet, ResNet-50, SqueezeNet, MobileNet-v2, Inceptionv3, GoogleNet, EfficientNetb0, Inception-ResNet-v2, DarkNet-53, and Xception models with different classifiers. All the D-CNNs are trained with 1000 epochs and the best number of the epochs are achieved by monitoring the validation accuracy of the models and adding early stopping.  We also visualized the hot maps of the Grad CAM to visualize the portion of the image, which helps in classification of the images with 10% to 90% background. The visual results are shown in Figure 7.
SVM and ensemble. SqueezeNet achieves the highest accuracy of the 82.80% with the ensemble. MobileNet-v2, Inceptionv3, GoogleNet, EfficientNetb0 and Inception-ResNet-v2 has 91.80%, 90.90%, 87.90%, 92.20% and 91.80% accuracy with the SVM classifier, respectively. However, DarkNet-53 achieves the highest accuracy of 91.40% with the ensemble. The proposed framework with ViT classifier achieves an accuracy of 98.28%, which is a 4.78% increase in accuracy compared to Xception with SVM. The comparisons demonstrate that the proposed model can significantly outperform existing D-CNN models by achieving the highest classification accuracy.
We also visualized the hot maps of the Grad CAM to visualize the portion of the image, which helps in classification of the images with 10% to 90% background. The visual results are shown in Figure 7. From Figure 7, it can be observed that when the load is near the drone, even in the 90% background images, it contributes to the classification. However, when the load is tied with string or relatively far away from the drone body, then only the drone contributes to the classification. From the performance comparison, it is evident that the proposed framework can be employed as a robust and efficient classification model for malicious drone detection. The current framework can be extended for the image compression [41], classification, and other computer vision tasks, such as object detection [42][43][44], and motor imagery classification in the brain-computer interface (BCI) [45][46][47]. The work can further be extended to classify malicious drones using selected features with nature and bio inspired algorithms [48][49][50], such as particle swarm optimization (PSO), genetic algorithm (GA), artificial bee colony (ABC), etc.

Conclusions
Drones are widely used due to their numerous applications. However, malicious drones which carry harmful material can cause destruction and bomb blasts. Thus, it is critical to distinguish between malicious drones and other flying objects. In this article, several ML and DL techniques are analyzed, which reveal that the performance of the handcrafted descriptors with ML classifiers is relatively low. Furthermore, the performance of various D-CNN ML classifiers is also evaluated. Our study indicates that the highest accuracy achieved by D-CNN models is 93.50%. However, the overall classification accuracy of the ViT classifier is 98.3%, which is the highest among all models. The ViT classifier achieves the overall recall, precision, and − of 99.0%, 99.0%, 99.0%, and 99.0%, respectively. The precision, recall, − , and Cohen's kappa for mali- Figure 7. Hot map visualization of malicious drone images with 10% to 90% background using Grad CAM.
From Figure 7, it can be observed that when the load is near the drone, even in the 90% background images, it contributes to the classification. However, when the load is tied with string or relatively far away from the drone body, then only the drone contributes to the classification. From the performance comparison, it is evident that the proposed framework can be employed as a robust and efficient classification model for malicious drone detection. The current framework can be extended for the image compression [41], classification, and other computer vision tasks, such as object detection [42][43][44], and motor imagery classification in the brain-computer interface (BCI) [45][46][47]. The work can further be extended to classify malicious drones using selected features with nature and bio inspired algorithms [48][49][50], such as particle swarm optimization (PSO), genetic algorithm (GA), artificial bee colony (ABC), etc.

Conclusions
Drones are widely used due to their numerous applications. However, malicious drones which carry harmful material can cause destruction and bomb blasts. Thus, it is critical to distinguish between malicious drones and other flying objects. In this article, several ML and DL techniques are analyzed, which reveal that the performance of the handcrafted descriptors with ML classifiers is relatively low. Furthermore, the performance of various D-CNN ML classifiers is also evaluated. Our study indicates that the highest accuracy achieved by D-CNN models is 93.50%. However, the overall classification accuracy of the ViT classifier is 98.3%, which is the highest among all models. The ViT classifier achieves the overall recall, precision, and F 1 − score of 99.0%, 99.0%, 99.0%, and 99.0%, respectively. The precision, recall, F 1 − score, and Cohen's kappa for malicious drone class are 97.0%, 97.0%, 97.0%, and 97.0%, respectively. The current study illustrates that the proposed ViT-based approach can help to classify malicious drones more efficiently than state-of-the-art D-CNN models. Training with a large dataset can further enhance the performance of the ViT-based framework. Nevertheless, the current framework can also be extended to various classification and computer vision tasks, such as object detection, motor imagery classification in the brain-computer interface, etc.