Audio-Based Aircraft Detection System for Safe RPAS BVLOS Operations

: For the Remotely Piloted Aircraft Systems (RPAS) market to continue its current growth rate, cost-effective ‘Detect and Avoid’ systems that enable safe beyond visual line of sight (BVLOS) operations are critical. We propose an audio-based ‘Detect and Avoid’ system, composed of microphones and an embedded computer, which performs real-time inferences using a sound event detection (SED) deep learning model. Two state-of-the-art SED models, YAMNet and VGGish, are ﬁne-tuned using our dataset of aircraft sounds and their performances are compared for a wide range of conﬁgurations. YAMNet, whose MobileNet architecture is designed for embedded applications, outperformed VGGish both in terms of aircraft detection and computational performance. YAMNet’s optimal conﬁguration, with >70% true positive rate and precision, results from combining data augmentation and undersampling with the highest available inference frequency (i.e., 10 Hz). While our proposed ‘Detect and Avoid’ system already allows the detection of small aircraft from sound in real time, additional testing using multiple aircraft types is required. Finally, a larger training dataset, sensor fusion, or remote computations on cloud-based services could further improve system performance.


Motivation
Over the last years, the market for Remotely Piloted Aircraft Systems (RPAS) has grown exponentially. Given the increasing number of operators, the market for professional RPAS is estimated to exceed 10 billion in Europe by 2035 and generate more than 100,000 jobs, according to the SESAR's European Drones Outlook Study [1].
One of the key points to reach this potential market is the transition from operations within visual line of sight to beyond visual line of sight (BVLOS), which offer a much higher added value. In fact, SESAR estimates that approximately 50% of the professional market will focus on BVLOS operations in rural environments for applications such as linear infrastructure inspection and monitoring, precision agriculture, and surveillance. In these kinds of applications, aerial platforms fly at a Very Low Level (VLL), a term which implies flight below 150 m.

Related Work
This 'Detect and Avoid' concept has been the subject of research in recent years [3], but a standardized system for light RPAS has not yet been achieved. Current approaches involve large aerial vehicles which carry complex sensors and heavy computer systems, and which operate under Visual Flight Rules or Instrumental Flight Rules [4]. However, the requirements for such approaches differ considerably from those of RPAS in VLL operations. Other solutions that help increase the pilot's ability to detect other aircraft are based on existing surveillance technologies such as Automatic Dependent Surveillance-Broadcast (ADS-B). ADS-B uses transponders to broadcast information to other airspace users such as their identifier, route, position, speed, etc. The main disadvantages are the security vulnerabilities and the need to mount an ADS-B transponder on the aircraft-which is not mandatory for general aviation or sport aircraft-making it not 100% reliable as a 'Detect and Avoid' system in VLL flights [5].
Although sensors for 'Detect and Avoid' systems are commonly placed on board the aerial platform (e.g., radar, visual cameras, ultrasonic, or laser), there are also systems which rely on ground-based infrastructure. These systems limit the operational area of the RPAS, but they have the advantage of reducing the aerial platform payload and the computational requirements on board and are especially suitable for VLL operations. Regarding ground-based technology, one development uses a Passive Secondary Surveillance Radar for aircraft surveillance [6], but their high cost limits their adoption in most use cases. Another ground-based approach proposes combining Kalman filtering and sensor fusion applied to video and acoustic vector sensor data for propeller aircraft detection and tracking [7]. Although this system achieves state-of-the-art detection accuracy (Q1 = 66%, Q2 = 77%, Q3 = 92%), it does so at distances < 300 m and the author states an upper-detection limit of 1000 m. More recently, a drone detection system using LIDAR technology has been proposed, although its range is limited to a few hundred meters [8].
The current work is part of a larger project whose aim is to overcome the aforementioned limitations through the development of a ground-based 'Detect and Avoid' system. Its main goal is the early detection of manned aircraft within a two-kilometer radius, by combining images and audio acquired using low-cost sensors (cameras and microphones, respectively), advanced processing capabilities, and deep learning methods to preserve safety in BVLOS operations of RPAS.
Regarding the detection from images, the main challenges are the angle of view and the image resolution provided by low-cost cameras [9,10], since aircraft appear within just a few pixels for distances greater than two kilometers, as shown in Figure 1. Regarding the detection from audio, the two-kilometer range is also an important aspect. However, the main challenges are the real-time detection of aircraft sounds and the avoidance of false positives from surrounding and closer sounds, such as those produced by ground vehicles.
The task of determining the source of a sound is known as Sound Event Detection (SED). Although conventional machine learning algorithms have been used for SED in the past, current state-of-the-art approaches are based on deep learning models [11,12]. Over the last years, deep learning algorithms have consistently outperformed conventional machine learning ones in the annual Detection and Classification of Acoustic Scenes and Events (DCASE) challenge [11]. In fact, most DCASE 2020 competitors used deep learning models based on convolutional neural networks (CNN), recurrent neural networks (RNN), or a combination of both, with the top performer achieving an F-score of 50% for SED in domestic environments [13].
Hershey et al. [14] compared the performance of several CNN architectures using the AudioSet dataset [15], which contains 5.24 million hours of labeled sounds (extracted from YouTube videos) corresponding to 632 different sound classes. The compared models were AlexNet [16], VGG (configuration E) [17], Inceptionv3 [18], and ResNet-50 [19]. Although they were originally designed for image classification, their SED performance was similarly outstanding, with ResNet-50 achieving an Area Under the Curve (AUC) value of 0.926. Alternative approaches have achieved state-of-the-art performance using either RNNs [20], a combination of CNN and RNN architectures [21], or Transformers [22]. Consequently, SED using deep learning has a number of current and potential applications, including road surveillance [23], human activity monitoring [24], music genre recognition [25], smart wearables and hearables, health care, and autonomous navigation [12].
Both approaches within this project-detection via image and audio-use supervised deep learning techniques to ensure the robustness of the aircraft detection system. Subsequently, sensor fusion techniques combining audio-and image-based predictions are applied with the aim of further increasing the robustness of the system to meet future requirements of 'Detect and Avoid' systems demanded by the authorities.
The aim of this paper is the development of an audio-based 'Detect and Avoid' system for small aircraft using CNN models trained on our dataset of aircraft sounds. Our contributions to the field of SED are the design and implementation of an embedded real-time aircraft detection system, and a comparison between state-of-the-art deep learning models for SED.
The remainder of the paper is organized as follows. In Section 2, we describe the proposed detection system, the dataset of aircraft sounds, and the SED models. In Section 3, we compare the performance and computational requirements of each model and discuss the implications of such results. Finally, in Section 4, we present the main conclusions of the study and make some recommendations for future research.

Aircraft Detection System
The aircraft detection system, whose task is to detect small aircraft within a two-kilometer radius, must run on a single self-contained, portable, embedded computer with a GPU that is powerful enough to run real-time inferences using deep learning models.
The system is composed of microphones, cameras, and an embedded computer, as shown in Figure 2a. The main sensors are two 20 megapixel RGB cameras (Basler acA5472-17uc with 25 mm Fujinon CF25ZA-1S lens) for image capture, and two directional microphones (Audio-Technica AT875R) connected to a two-channel audio card (Focusrite Scarlett 2i2) for audio capture. The addition of a servo motor allows a 360-degree rotation around the horizontal plane, which enables the cameras and microphones to cover the entire area around the 'Detect and Avoid' system. The embedded computer where both image and audio data are acquired and analyzed is an Nvidia Jetson TX2, and the real-time inferences are sent to a ground control station via a 4G connection provided by a USB network card. The current study focuses on the audio-based detection aspects of the system, which includes recording audio through the microphones and performing real-time inferences within the embedded computer to determine the presence or absence of nearby aircraft. A detailed description of the Nvidia Jetson TX2 configuration and model implementations is provided in Appendix A. Figure 2 shows a summary of the audio-based 'Detect and Avoid' processing pipeline. First, audio signals are captured in real-time through the microphones (a). Then, the audio waveform is transformed into log-scaled Mel spectrograms (b), which are in turn fed into the pre-trained CNN model for feature extraction (c). Finally, a fully connected classifier determines if the sound belongs to an aircraft or not (d).

Dataset: Small Aircraft Sounds
The large amounts of data required to train deep learning models have motivated the creation of a number of datasets containing sounds from urban [27][28][29], domestic [30], industrial [31,32], and generic [15,[33][34][35] environments. Nevertheless, sounds for our application are scarce and difficult to produce, given the particular focus of our 'Detect and Avoid' system: small aircraft flying at VLL and within two kilometers of the microphones.
In order to fine-tune existing SED models, we have curated a dataset of small aircraft sounds ('Small Aircraft' dataset) representing the 'aircraft' class. These audios were initially collected from free online audio databases, directly accessible through their websites [36][37][38]. However, since there were not enough available external data to successfully train our models, a data acquisition campaign was performed at ATLAS (Air Traffic Laboratory for Advanced Unmanned Systems) [39]. ATLAS is a test flight center located in Villacarrillo (Jaén, Spain), which is ideally suited for the development of experimental flights with unmanned aerial vehicles. The data acquisition campaign, consisting of a flight plan simulating a realistic scenario according to the objectives of the 'Detect and Avoid' system, took place in September 2019. A SOCATA TB9 aircraft executed a circular trajectory around the airfield runway, flying at an altitude between 90 m and 230 m, and at a distance between 1000 m and 2250 m from the aircraft detection system. With excellent weather conditions (i.e., light wind, few clouds, and high illumination), the aircraft was able to complete 11 laps.
Given that the human ear can detect air variations between 20 Hz and 20,000 Hz [40] that most aircraft sounds are located in the range from 10 Hz to 250 Hz [41], and that the selected neural network is limited to frequencies lower than 7.5 kHz, labeling was performed by a human operator. The exclusion criteria were the lack of clear aircraft sounds, the presence of excessive noise, the presence of multiple sounds together with aircraft sounds, and the type of aircraft sound (i.e., jet or large aircraft sounds were excluded). After manual labeling, the combined duration of aircraft sounds from free online sources (used for training) and from ATLAS (used for testing) was 0.6 hours.
Additionally, the 'UrbanSound8K' dataset [27], which contains more than 8000 urban sounds (<4 s) originally divided into 10 sound classes, has been used to represent the 'not aircraft' class, with a total duration of 8.75 h. We choose this dataset since it contains a number of sounds (i.e., 'drilling', 'engine idling', 'jackhammer', and 'air conditioner') which are likely to produce false aircraft detections in rural environments.
However, due to the large difference in total duration between both datasets, we face the problem of class imbalance, where the minority class (i.e., 'aircraft') contains a much lower number of samples (i.e., feature vectors) than the majority class (i.e., 'not aircraft'), as seen in Table 1. To overcome this problem, we follow four different strategies for the training data. Firstly, we undersample the 'not aircraft' class by randomly eliminating samples so that the size of both classes matches. We refer to the resulting dataset as the 'Undersampling' dataset. Secondly, we apply augmentation to the 'aircraft' class to match the size of the 'not aircraft' class. Our data augmentation strategies consist of randomly applying the following modifications to the audio waveform: time stretching or compressing; resampling; volume change; and addition of random noise with a uniform distribution [42]. The value ranges for each augmentation are presented in Table 2. This is called the 'Data augmentation' dataset. Thirdly, we apply data augmentation to both classes, approximately doubling the number of samples of the 'Data augmentation' dataset, and refer to this as the 'Data augmentation*2' dataset. Fourthly, we undersample the 'not aircraft' class by 50% and augment the 'aircraft' class so that the size of both classes matches and refer to this as the 'Hybrid' dataset. There are a number of deep learning-based SED models in the literature, but their large amount of parameters makes training them from scratch unfeasible in most cases. Additionally, these models already perform the task of extracting the distinct audio features which allow us to differentiate between sounds. For these reasons, we decided to fine-tune state-of-the-art SED models using the 'Small Aircraft' and 'UrbanSound8K' datasets.
The process of fine-tuning, which includes transforming the audio signal into images (Section 2.3.1), extracting audio features using pre-trained CNN models (Section 2.3.2), and classifying such features into 'aircraft' or 'not aircraft' (Section 2.3.3), is described next.
2.3.1. Audio Post-Processing: From Sound to Images Figure 3 shows a detailed description of the feature extraction process using pre-trained CNN models. Firstly, the original audio waveform (a) is converted to a magnitude spectrogram with 257 frequency bins (b) using a short-time Fourier transform (STFT) with a 25 ms window size, a 10 ms window hop, and a periodic Hann window. These values, used in the official implementations of YAMNet [43] and VGGish [44], were chosen based on the work on large-scale audio classification by Hershey et al. [14]. The magnitude spectrogram is then converted to a Mel spectrogram with 64 Mel bins. The log-scaled Mel spectrogram (c) is calculated as the natural logarithm of the offset Mel spectrogram. This offset is added to avoid the calculation of the logarithm of 0. The log-scaled Mel spectrogram is framed into 90% overlapping image frames of 0.96 s with a 0.096 s frame hop (d).
These frames are then passed as individual inputs to the pre-trained CNN models (e). Finally, the CNN models produce a feature vector (f) for each corresponding image frame.

Feature Extraction Using Pre-Trained CNN Models
State-of-the-art CNN models have been identified in the SED literature and selected according to their reported performance in the main SED conferences and challenges [14,[16][17][18][19][20][21][22]. However, since the 'Detect and Avoid' system must run on an Nvidia Jetson TX2, the limited computational resources allocated to the SED task conditions the choice of deep learning models. We have chosen the official implementations of the VGGish [44] and YAMNet [43] architectures as they provide a good trade-off between SED performance and computational cost.
VGGish-based on the VGG architecture [17] (configuration A)-has an input size of 96 × 64, drops the last block of convolutional and maxpool layers, and uses two 4096-wide fully connected layers followed by a 128-wide fully connected layer. Thus, VGGish outputs a feature vector of size 128. YAMNet uses the MobileNet_v1 architecture [45], composed of depthwise separable convolutions which drastically reduce computational cost and model size. YAMNet's implementation consists of one convolutional layer and 13 depthwise-pointwise layer pairs with batch normalization and ReLU, followed by average pooling and a 1000-wide fully connected classifier. For the purpose of fine-tuning, the classifier is removed, making YAMNet's new output a feature vector of size 1024.

Aircraft Sound Classification
The classifier, which is trained using the feature vectors extracted from the 'Small Aircraft' and 'UrbanSound8K' datasets, consists of an input layer whose size matches that of the feature vectors (128 for VGGish and 1024 for YAMNet, respectively), a fully connected layer of size 1024, an output layer with a softmax activation function and two outputs which correspond to the 'aircraft' and 'not aircraft' classes, respectively.
During training, the classifier optimizes the validation cross-entropy loss via mini-batch stochastic gradient descent with Nesterov momentum, a constant learning rate of 0.001, a decay of 10 −6 , and a momentum of 0.9. The classifier trains for 1000 epochs with a variable train/test split which depends on the choice of dataset. Each mini-batch consists of 256 randomly selected unique feature vectors which are seen exactly once for every epoch. The classifier, together with the pre-trained CNN models, is implemented in Python using TensorFlow 1.15.0 and Keras 2.2.4.

Results and Discussion
We test the performance of VGGish and YAMNet for each dataset described in Section 2.2 and for four different inference frequencies: 1 Hz, 2 Hz, 4 Hz and 10 Hz. The inference frequency determines how often a new inference is computed. Since the computational cost of a single inference is constant, the higher the inference frequency, the higher the overall computational cost. Furthermore, for high inference frequencies, inferences may overlap and require to be computed in parallel.

Performance Metrics
The performance metrics chosen to compare SED models and datasets are described next. The true positive rate (TPR), also called sensitivity, recall, or hit rate is calculated as TPR = TP TP+FN , where TP and FN represent the number of true positives and false negatives, respectively. The false positive rate (FPR), also called fall-out, is calculated as FPR = FP FP+TN , where FP and TN represent the number of false positives and true negatives, respectively. On the one hand, the TPR indicates what proportion of aircraft is correctly detected by the proposed 'Detect and Avoid' system. On the other hand, the FPR describes the likelihood of sending false detection alarms.
If these metrics are collected while varying the confidence threshold, a so-called Precision-Recall (P-R) curve can be obtained where the x-axis represents the recall (TPR) and the y-axis the precision (PRE), calculated as PRE = TP TP+FP [46]. P-R curves are a useful technique for visualizing the performance of binary classifiers for imbalanced datasets. The points laying on the horizontal line y = TP+FN TP+FN+FP+TN corresponds to a random classifier. The ideal performance is located at (1, 1) where PRE and TPR are both 100%.

Dataset and Inference Frequency Comparison
Each model is tested for each possible combination of datasets and inference frequencies, as shown in Table 3. For each inference frequency, the best TPR and FPR values are highlighted in bold. Furthermore, for each model, the best overall TPR and FPR are highlighted in red. For both models, the 'Data augmentation*2' and 'Hybrid' datasets show the best performance in terms of FPR and TPR, respectively, across all inference frequencies. The best overall performance for YAMNet and VGGish corresponds to an inference frequency of 10 Hz and 1 Hz, respectively. However, while VGGish's best FPR (5.03%) is 1% lower than YAMNet's (6.11%), YAMNet's best TPR (74.90%) is 11% higher than VGGish's (63.91%). Furthermore, the performances of each model's best configuration are compared using Precision-Recall curves in Figure 4. Here, YAMNet also performs better than VGGish, but this time in terms of precision.
Given YAMNet's superior performance across all datasets and inference frequencies, a more detailed analysis is performed next. Figure 5 shows the P-R curves corresponding to each inference frequency for all datasets. The largest differences in performance are observed across different datasets. For all inference frequencies, the 'Undersampling' dataset shows the worst performance since it suffers from the removal of 83% of 'not aircraft' samples without any data augmentation benefits. The 'Data augmentation' and 'Data augmentation*2' datasets benefit from the data augmentation strategies. Their performance is very similar-as evidenced by the PR curves-which indicates that, although it is beneficial to apply data augmentation to the aircraft class, further data augmentation applied to both classes has negligible effects. The 'Hybrid' dataset benefited both from augmenting the minority 'aircraft' class, and from undersampling the majority 'not aircraft' class to ensure class balance. Regardless of the choice of inference frequency, and excluding the 'Undersampling' dataset, the results for the remaining datasets are very similar for Recall > 0.7. This suggests that, in the absence of enough data representative of the minority class, a data augmentation strategy can provide a boost in prediction performance.    Figure 6 shows the P-R curves corresponding to each dataset for all inference frequencies. As the inference frequency is increased from 1 Hz to 10 Hz, performance improves for all datasets. Although it has a lower impact on performance than the choice of the dataset, there is also a clear benefit to using higher values of inference frequency.

Computational Performance Assessment
To assess the computational performance of both models, Table 4 shows inference times-defined as the time that the computer takes to read an audio waveform, extract a feature vector, and make a prediction-for four inference frequencies. These tests are performed on an Nvidia Jetson TX2 for the task of audio-based detection alone. YAMNet's latency is one order of magnitude smaller than that of VGGish, with an average inference time of 0.154 s for an inference frequency of 10 Hz. These differences in inference times may be explained by model parameter size (4 M for YAMNet [45] vs 133 M for VGGish [17]) and model complexity thanks to the use of depthwise-separable convolutions (which reduce the number of multiply-accumulate operations). Furthermore, RAM consumption was one order of magnitude smaller for YAMNet (35 MB) than for VGGish (360 MB) for all inference frequencies. Finally, CPU and GPU usage for both models is similar, regardless of the choice of inference frequency.
For both models, depending on the value of the inference frequency, a new inference may start before the previous one has finished, meaning that two or more inferences will be computed in parallel. In the case of VGGish, high inference frequencies result in the parallel computation of dozens of inferences.

Conclusions
We have developed a real-time, audio-based aircraft detection system using deep learning models fine-tuned with our 'Small Aircraft' dataset. We have identified YAMNet-fine-tuned using our 'Hybrid' dataset-with an inference frequency of 10 Hz as the optimal configuration in terms of TPR and PRE. Despite the project constraints on the available computational resources, this model is able to provide accurate detection results in real-time. This study has also highlighted how data augmentation strategies can significantly improve model performance for imbalanced audio datasets. Although the aircraft type used for testing (SOCATA TB9) was different from those used for training (Edge 540, Cessna 172E, Cessna 152, and a number of historical aircraft and aircraft engines), further testing using different aircraft types will be performed in the future to strengthen our current conclusions.

Future Work
On the one hand, given the limited sample size of the 'aircraft' class, one immediate direction for future work is increasing the size of the 'Small Aircraft' dataset during additional flight campaigns using different aircraft types. On the other hand, optimizing YAMNet's implementation to reduce feature extraction and inference computational cost would allow an even higher inference frequency and consequently a higher detection accuracy. Investigating the potential superiority of the wavelet transform over the STFT for audio post-processing [47] could also result in detection accuracy gains. Additionally, combining multiple aircraft detection systems located within a certain distance of each other and taking advantage of the microphones' high directionality would allow us to localize the aircraft, thus increasing the utility of the current design. After some adaptation, the detection system could be integrated within a fixed-wing UAV to reduce the effect of ground sounds, while the microphones' high directionality could also be used to minimize the effect of rotor noise. Finally, an alternative to performing inferences on the embedded computer would be to stream the audio feed via a remote connection to a cloud server. With the advent of 5G networks, this application is likely to exploit the lower latency, higher capacity, and increased bandwidth that they provide. This remote server, not being as limited by computational resources as the embedded computer onsite, would be able to use more complex SED models in real time with even higher accuracy than YAMNet. Funding: This work has been partially funded by the VIGIA (ITC-20181032) and iMOV3D (CER-20191007) Spanish R&D projects, funded by the CDTI.

Acknowledgments:
The authors would like to thank Carlos Albo, José Ramón López, Daniel Lozano, and Mouhsine Kassimi from CATEC for their support in gathering the dataset and integrating the system; and Justin Salamon from NYU and Adobe Research for his advice on state-of-the-art SED models.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: