Practical Study of Recurrent Neural Networks for Efficient Real-Time Drone Sound Detection: A Review

Utebayeva, Dana; Ilipbayeva, Lyazzat; Matson, Eric T.

doi:10.3390/drones7010026

Open AccessArticle

Practical Study of Recurrent Neural Networks for Efficient Real-Time Drone Sound Detection: A Review

by

Dana Utebayeva

^1,*

,

Lyazzat Ilipbayeva

² and

Eric T. Matson

³

¹

Department of ET and ST, Satbayev University, Almaty 050013, Kazakhstan

²

Department of RET, International IT University, Almaty 050040, Kazakhstan

³

Department of CIT, Purdue University, West Lafayette, IN 47907-2021, USA

^*

Author to whom correspondence should be addressed.

Drones 2023, 7(1), 26; https://doi.org/10.3390/drones7010026

Submission received: 9 November 2022 / Revised: 21 December 2022 / Accepted: 22 December 2022 / Published: 30 December 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The detection and classification of engine-based moving objects in restricted scenes from acoustic signals allow better Unmanned Aerial System (UAS)-specific intelligent systems and audio-based surveillance systems. Recurrent Neural Networks (RNNs) provide wide coverage in the field of acoustic analysis due to their effectiveness in widespread practical applications. In this work, we propose to study SimpleRNN, LSTM, BiLSTM, and GRU recurrent network models for real-time UAV sound recognition systems based on Mel-spectrogram using Kapre layers. The main goal of the work is to study the types of RNN networks in a practical sense for a reliable drone sound recognition system. According to the results of an experimental study, the GRU (Gated Recurrent Units) network model demonstrated a higher prediction ability than other RNN architectures for detecting differences and the state of objects from acoustic signals. That is, RNNs gave higher recognition than CNNs for loaded and unloaded audio states of various UAV models, while the GRU model showed about 98% accuracy for determining the UAV load states and 99% accuracy for background noise, which consisted of more other data.

Keywords:

RNN; simpleRNN; LSTM; BiLSTM; bidirectional LSTM; gated recurrent networks; GRU; Kapre method; Mel-spectrogram; deep learning; UAV sound; sound detection; loaded UAV; unloaded UAV; drone detection; real-time UAV detection

1. Introduction

Very recently, the use of small unmanned aerial vehicles (UAVs) has significantly increased. These tiny vehicles, also known as drones, are being used more and more frequently, particularly for a variety of tasks, including the production of toys for kids, entertainment, photography, video monitoring in remote locations, agronomy, military intelligence, delivery, and transportation. The purpose of their widespread use in the postal service and agriculture is the automation of processes in order to replace the workforce. Due to the many recreational uses of these UAVs, illegal activities with them have also increased or at least become more visible than before. These vehicles are said to have been used in smuggling drugs across international borders, violating airport security and local laws. The main reason for their frequent use among people is the expansion of their technical capabilities, i.e., an increase in flight duration, the possibility of flexible video filming from different heights and angles, and the possibility of unhindered entry into different areas. Furthermore, over the past decade, the mass production of unmanned aerial vehicles (UAVs) at affordable prices has led to the problem of the constant use of these devices for various dubious and recreational purposes. The careless or harmful use of these vehicles can endanger people, their lives and protected institutions, and the border areas of countries. These reasons take into account the fact that UAVs are becoming more and more dangerous. Recently, there have been several incidents with drones [1,2,3,4,5,6,7,8,9]. In particular, in 2015, an intoxicated government official in front of the White House lawn in the United States provoked a drone crash, in 2017 a plane landing in Canada had a minor accident with a drone, and a suspicious drone flew over Gatwick Airport in London in 2018, causing thousands of passengers to cancel flights [3]. There are many cases of damage caused, especially by improper management. Likewise, an Unmanned Aerial Vehicle (UAV) crash occurred in the city of Asir on the border with Saudi Arabia [10]. Another drone crash happened in China: a dozen of the 200 drones took to the skies during a light show in Zhengzhou. In this case of UAV crashes, just 2.5 min after their glorious ascent, the drones began to fall back to the ground, crashing into trees, cars, and everything else as they fell. Some participants of this show ran to hide from unmanned drones [11]. In general, the reasons for the impossibility of the sustainable use of these vehicles are discussed in more detail in [3,5]. Thus, the high frequency of unauthorized drone flights requires the development of reliable real-time drone detection systems for protected areas. Therefore, the detection systems for the unauthorized use of UAVs are becoming more and more significant, especially in the buildings of institutions such as schools, kindergartens, hospitals, universities, administrations, ministries, border areas of the country, protected areas where military bases are located, reservoirs that protect large cities from snowmelt, and agricultural locations. For these reasons, there has been a growing demand for research into security measures based on drone detection systems to prevent the spread of unauthorized drones in restricted and protected strategic areas.

The potential risks with UAVs are more related to factors such as poor control or human error, as well as malicious use, especially when they have an additional load while flying in crowded areas. In these regards, the problem of automatic audio-based UAV classification and detection systems has attracted particular interest from the research community due to their ability to detect loaded and unloaded UAV states in certain areas of interest [1,3]. The recognition of the UAV’s load by acoustic sensors is more effective than other methods such as radar, camera, and radio frequency methods [5,6,7]; thus, this work aims to investigate the acoustic sensor method for the UAV detection problem.

In general, taking into account the mentioned UAV incidents, UAV threats can be divided into two main categories. Unmanned aerial vehicles used for amateur purposes, in most cases, will not carry an extra load during their flight, so they fall into the category of less threatening. At the same time, UAVs used in terrorist and malicious attacks often carry an additional load. Therefore, UAVs of a similar category or poorly controlled loaded UAVs belong to the group of the most threatening objects. In the case where the UAV poses a real public threat, resources can be directed to the real threat and not to less or more threatening phenomena. In order to ensure simultaneous detection of both threatening UAVs, this research work proposes a new approach with the task of classifying loaded and unloaded UAVs based on acoustic signals in the presence of different UAV models and different payload masses. The purpose behind identifying UAVs by acoustic signatures lies in the fact that if the UAV has an additional load during the flight, the supplemented mass makes changes in the recognition system database, and likewise, the acoustic signature will change. The signature of the UAV without load differs from the signature of the UAV with a one-kilogram Cemtex stick or other explosive material attached to the bottom of the vessel [1]. In order to recognize the differences between these UAV acoustic signatures, our research paper proposes a new recognition system structure based on the frequency features and Recurrent neural network cells. The motivation for this study was the lack of a drone recognition system that could take into account the reliable recognition of different UAV models with different payload masses in real-time. In addition, by analyzing related works in sound detection systems [12], RNN has chosen robust computational neurons for recognizing different audio signal content, including UAV audio signals too. In parallel, the CNN algorithm, which in the previous statements is claimed to be a successful and reliable system, was experimented with a similar density as RNN. The new approach of our methodology is based on the study of a few models of Recurrent Neural Networks such as Simple RNN, LSTM, Bidirectional LSTM (BiLSTM), and Gated Recurrent Units (GRU) on the basis of Mel-spectrograms using the Kapre Method with fixed hyperparameters as in Table 3 and Figure 5. In addition, studies of drone-load identification tasks in [1,3] are based solely on one type of drone, in particular, the DJI Phantom 2. Consequently, our approach performs an additional study of the sounds of several drone models, as well as their real-time detection during every second of observation time for protected areas.

Generally, Simple RNN, Long Shot-Term Memory Network (LSTM), Bidirectional Long Short-Term Memory Network (BiLSTM), and Gated Recurrent Network (GRU) Neural Cells are Recurrent Neural Network (RNN) architectures designed for sequencing problems. Another advantage of these networks is that they are also able to learn the time dependence of given data. That is, the network shows one observation in a sequence and can figure out which of the observations it has seen before are relevant and how important they are for predictions [12]. Therefore, our research work was aimed at experimentally studying these SimpleRNN, LSTM, BiLSTM, and GRU for UAV sound signals. To determine the cell robustness of RNN neurons, their results were compared with CNN models to see how they solve UAV sound classification problems. Furthermore, RNN models were actively studied in this work for the first time in the problem of classifying UAV acoustic signatures by assessing their payload. In accordance with the purpose of the work, research and experimental stages were considered based on the following tasks: (1) creation of a database of UAV sounds of different masses with different UAV models; (2) conducting extensive experimental studies for RNN cell types and developing structures for these neurons with additional hyperparameters; (3) the study of its recognition based on a system of previous studies in this direction with a small number of CNN layers and comparing the result with the results of RNN; (4) identifying and evaluating an effective system based on the results obtained.

This research work has five key sections that are all focused on research. The introductory section discusses the drone recognition system in general and the relevance of drone sound recognition in it. The second section of similar works examines the successes and drawbacks of prior research in this area. A theoretical foundation for the suggested RNN systems was attempted in the third section. The process, including data collection and frequency feature extraction, is covered in the fourth part. The results were examined, and an attempt was made to assess an efficient system in the fifth section.

2. Related Studies

Recognition of the UAVs according to their acoustic signals aroused extraordinary interest among researchers since they are able to solve scientific problems, such as UAV detection in conditions of poor visibility or identification of UAVs when it carries an additional burden. General studies of UAV detection systems have been discussed in detail in [3,5]. This section briefly discusses the survey of sound-based UAV detection and localization tasks. Successful detection or localization of these UAVs has been achieved primarily through machine learning, including deep learning techniques. Accordingly, this section discussed the classification of research works based on machine learning, deep learning, and other methods.

Machine learning methods in UAV sound recognition. Nijim et al. [13] studied a drone detection system from its acoustic signatures using a Hidden Markov Model on DJI Phantom 3 and the FPV 250 drones dataset. The authors of [14] processed the data of audio files into time and frequency parameters using the traditional ML method, SVM. They performed the task of binary classification, i.e., the presence or absence of a drone, as well as UAV sound spectrum images trained with a k-nearest neighbor (KNN) classifier (61%) and correlation (83%) methods to detect DJI Phantom 1 and 2 in [15]. The problem of classification of drones into many types in the context of clustering of environmental noise and other types of noise (car, bird, rain) was carried out in the [16] using the HMM method on the basis of 24 and 36 using MFCC features. The authors of [17] studied the classification of flight sounds of DJI Spark, Quadcopter AR Drone 2.0, and Parrot Mambo drones using the random forest method. The SVM method classified the features using MFCC and LPCC coefficients in [18]. The real-time detection of drone sounds was performed by two methods: PIL (Plotted Image Machine Learning) and KNN (K Nearest Neighbors) classification in [15]. Reference [19] integrated sensors such as a camera and a microphone; video frames were classified by HOG parameters, and audio frames were classified by SVM through MFCC parameters.

Machine learning and deep learning methods in UAV sound recognition. Jeon et al. [20] investigated the detection of the drone existence in the range of 150 m with the Gaussian Mixture Model (68%), CNN (80%), and RNN (58%). Yang B. et al. proposed classifying the MFCCs and the STFT features with SVM and CNN models in [21].

Deep learning methods in UAV sound recognition. In the works [4,22], CNN network models performed various classification tasks, such as the absence of drones in the region, the presence of drones, and the presence of two drones, and references [1,2] studied simple CNN models for the loaded and unloaded UAVs classification task with the DJI Phantom 2 model. The background noise of the test area was examined as false negatives, given as feature vectors for neural networks to distinguish the main class from noise.

UAV localization and bi-modal methods. References [23,24,25,26,27,28,29,30] suggested searching for several sound sources through a microphone (acoustic sensor), i.e., they studied the localization of drone sounds to achieve reliable control of the drone. The authors of [31,32] studied the correlation method of drone detection. The authors of [27] conducted a study with a combination of radar and acoustic sensors. The authors of [33] worked on the problem of detecting a drone with a combined system of camera and acoustic sensors to construct a bi-modal method. The advantage of the bi-modal method is the creation of a reliable system by combining the functions of two different systems because each drone detection system has its own advantages and disadvantages. For example, a computer vision system loses its reliability when the line of sight is not clear. The drone may also falsely detect the load, i.e., it may recognize the loaded empty box or cargo as a genuinely loaded drone. The acoustic method has the ability to reliably detect the loaded drone based on sound characteristics. However, acoustic systems can be unreliable over long distances. These shortcomings can be rectified by combining two systems, that is, by integrating their individual functions and capabilities into one system. More complex tasks include the creation of a multi-modal system, integrating all the functions of all detection systems.

By analyzing research studies based on acoustic sensors, the acoustic method of a UAV detection system is divided into research areas such as detection and localization. In all the studies of deep learning methods discussed above, CNNs and RNNs were not considered with hyperparameters of the same thickness. Therefore, the results of the CNN network for audio signals do not allow us to conclude that they are completely reliable. Based on this motivation, the proposed work aims to investigate all common types of RNNs, including the CNN network. Our project has considered a real-time recognition system of the UAV sound detection task. The next section will discuss the theoretical foundations of the RNN architectures that are used in this objective.

3. Methods

Based on the related works discussed in the second section, the UAV sound recognition task has shown good results using deep learning methods. Convolutional Neural Networks (CNNs), Deep Feedforward Neural Networks, and Recurrent Neural Networks (RNNs) dominate traditional machine learning methods in complex prediction problems. Due to the wide use of image data, CNNs are known in computer vision problems. A deep feedforward network is a feedforward network that has gained prominence among image recognition, computer vision, sound recognition, and other prediction problems because they use more than one hidden layer, which provides more learning on given data. The main problem with using only one hidden NN layer is overfitting, so by adding more hidden layers, overfitting can be reduced, and generalization can be improved. However, deep FF neural networks have a limitation in that overlaying more layers leads to an exponential increase in training time, which makes DFF rather impractical. Recurrent Neural Networks (RNNs) are essentially a subset of feedforward networks (FFs). They are capable of learning features and long-term dependencies from sequential and time-series data. Each neuron in the hidden layers of an RNN receives input with a certain time delay. Recurrent Neural Networks are used when current iterations require access to earlier data. For example, to predict the next word in a sentence, one needs to know the words that came before it. Any lengths and weights can be shared by the RNN over time as the input is processed. The calculations in this model take into account previous data, and the size of the model does not increase with the size of the input data. However, the problem lies in the low computation speed of this neural network [34]. Audio signals are constantly changed over time. The consistent and time-varying nature of sounds makes the RNN networks an ideal model for studying the features [35,36]. Currently, four different computational cells of RNN networks such as, simple RNN, LSTM, BiLSTM and GRU, are popular for predicting audio signals.

In general, neural network-based architectures are built with three main types of layers: input layers, intermediate hidden computational neurons or cells, and output layers. Additionally, the hidden nodes and output nodes of a neural network architecture have activation functions. As the number of layers in the network increases, the value of the derivative product decreases until, at some point, the partial derivative of the loss function approaches a value close to zero and disappears, which is known as vanishing problems and gradient descent. In essence, the activation function transforms the input value that enters the node and then sends the modified value to the subsequent group of neurons. The sigmoid and the hyperbolic tangent (tanh) were typically the two most used nonlinear activation functions. A vanishing gradient issue, however, was present when employing these activation functions with Deep Neural Networks (Deep NNs). In order to update the weights, the error is backpropagated across the network during the training. Lately, the gradient descent problems have been using the “ReLU” activation function. The expanded principle of the activation function is provided in reference [37].

RNNs have extended their applicability based on traditional Simple Recurrent Neural Networks (SimpleRNN). Simple RNN has three layers that include input, hidden, and output layers, as shown in Figure 1. According to the basic operating principle of Simple RNN, it connects nodes to understand current information by feeding the output of the neural network layer at time t to an input of the same network level at time

t + 1

, as shown in Figure 1. The input data are a sequence of vectors over time t, such as

. . ., x_{t - 1}, x_{t}, x_{t + 1}, . . .

in Input Layer. Input blocks in a fully connected Simple RNN connect to hidden blocks in a hidden layer. The hidden layer has the following hidden units

h_{t} = h_{t - 1}, h_{t}, h_{t + 1}, . . .

, which are connected to each other over time using periodic connections. Initializing hidden modules using small non-zero elements can improve overall network performance and stability. The concept of an unrolled structure of RNN networks has been shown as an “Unfolded” form in Figure 1 in the case where there are multiple input time steps (X(t), X(t+1), …), multiple internal state time steps (h(t), h(t +1), …) and several output time steps (y(t), y(t+1), …).

However, these networks have their drawbacks. The first shortcoming of Simple RNN is the gradient vanishing and explosion problems. This complicates RNN learning in two ways: it cannot handle very long sequences if tanh is used as the activation function; it is very unsteady if the function ReLU is used as the activation function [36].

Long Short-Term Memory (LSTM) is a specific architecture of the Recurrent Neural Network (RNN) that has been developed to solve the vanishing and exploding gradient problems of traditional RNNs [12,38]. They have shown successful results in predicting sequence problems, such as handwriting recognition, language modeling, image captioning, and acoustic signal classifications. Model training using LSTM networks is more accurate, and the training time is longer than other algorithms. In order to reduce the training time while maintaining a high level of training accuracy, Gated Recurrent Unit (GRU) networks have been developed. Furthermore, GRU networks are widely used in classification tasks [35]. In this paper, we propose to study SimpleRNN, LSTM-based RNN units, and Gated Recurrent Unit (GRU) models in the single and stacked architectures for the UAV acoustic representations classification task since these models have been used more efficiently in the learning of sound-based recognition systems.

3.1. LSTM

LSTM is a recurrent neural network architecture that replaces the standard layers of the neural system with long-term memory cell blocks to avoid the problem of long-term dependency; see Figure 1 and Figure 2. The cell blocks of common LSTMs are composed of four interacting layers: a cell state, an input gate, an output gate and a forget gate. The sequence data

x_{t}

of feature extraction is combined with the output data of the previous cell

h_{t - 1}

. This combination of input data goes through the forget gate (1),

f_{t}

, and the input gate (2),

i_{t}

. Both gates have sigmoid activation functions to output values between 0 and 1.

f_{t} = σ (ω_{f} [h_{t - 1}, x_{t}] + b_{f})

(1)

i_{t} = σ (ω_{i} [h_{t - 1}, x_{t}] + b_{i})

(2)

{\hat{C}}_{t} = tanh (ω_{c} [h_{t - 1}, x_{t}] + b_{c})

(3)

C_{t} = f_{t} \times C_{t - 1} + i_{t} \times {\hat{C}}_{t}

(4)

O_{t} = σ (ω_{O} [h_{t - 1}, x_{t}] + b_{O})

(5)

h_{t} = O_{t} \times tanh (C_{t})

(6)

Therefore, the forget gate (1) decides what information to throw out of the cell, and the input gate (2) decides which values from the input to update. Furthermore, that combination is squeezed by the tanh layer,

{\hat{C}}_{t}

. Here,

ω_{f}

,

ω_{i}

,

ω_{c}

are weights for the respective gate neurons; and

b_{f}

,

b_{i}

,

b_{c}

are biases for the respective gates. LSTM cells have an internal loop (cell state) consisting of a variable

C_{t}

(4) called the constant error carrousel (CEC). The old cell state

C_{t - 1}

is connected to establish an effective recurrence loop with the input data. The compressed combination

{\hat{C}}_{t}

is multiplied × with the input gate data

i_{t}

; see Figure 2b.

However, this recurrence loop is controlled by a forget gate, which determines what data should be stored or forgotten on the network. The addition method ⊕ reduces the risk of gradient vanishing instead of multiplication. Then, the system positions the cell state via the tanh function to push the values between −1 and 1 and multiply it by the output of the sigmoid gate (5). Thus, this gate (6) determines what values should be output from the

h_{t}

cell as the output value. On the whole, the forget gate and the input gate are used for updating the internal state (4). The main disadvantages of LSTM networks are the higher memory requirements and computational complexity than a simple RNN due to multiple memory slots. It differs from conventional RNNs in its long-term dependencies and robust advantages over gradient vanishing.

In summary, if we summarize LSTM benefits, it overcomes vanishing and exploding gradients and allows memory to overcome long-term time dependency problems with input sequences [12,39].

Bidirectional LSTM

Bidirectional LSTMs are a development of typical LSTMs that can increase model performance in sequence classification tasks. With all the time steps of the input sequence, bidirectional LSTMs train two LSTMs instead of a single LSTM in the input. Bidirectional LSTMs solve the problem by outputting data from the input sequence in the forward and reverse directions over time steps.

In practice, this architecture duplicates the first recurrence level in the network by having two layers side by side, then providing the input sequence as it is at the input of the first level and providing a reverse copy of the input sequence to the second layer input. Thereby, this additional context makes the results faster [12,40]. Therefore, the BiLSTM network processes

x_{t}

sequence data in the forward and reverse directions using two separate hidden layers, and their hidden layers are connected with the same output layer; see Figure 3. Like the LSTM level, the final output of the Bidirectional LSTM level represents the vector

y_{t} = [y_{t - 1}, . . ., y_{t + 1}]

, in which the last element,

y_{t + 1}

, is the predicted sequence for the next time steps. BiLSTM shows its disadvantage by increasing computational complexity over LSTM due to forward and backward learning. Their main advantage is that they better reflect both future and past contexts of the input sequence than LSTM networks.

3.2. Gated Recurrent Unit (GRU) Networks

While the LSTM network has proven to be a viable option for preventing gradients from disappearing or exploding, they have higher memory requirements given the multiple memory locations in their architecture [36]. As a solution to this problem, the GRU network was developed by the authors [41] with less learning time due to fewer parameters than the LSTM structure while maintaining high accuracy. Unlike LSTM networks, GRU networks do not have an output gate [35]. Figure 4 demonstrates the network structure of GRU. In the structure of GRU networks, at each moment of time, there are two input functions, which include the previous output vector

h_{t - 1}

and the input vector

x_{t}

. Each gate output can be taken through a logical operation and nonlinear transformation of the input. The ratio between output and input can be characterized as follows:

r_{t} = σ_{g} (ω_{r} x_{t} + U_{r} h_{t - 1} + b_{r})

(7)

z_{t} = σ_{g} (ω_{z} x_{t} + U_{z} h_{t - 1} + b_{z})

(8)

h_{t} = (1 - z_{t}) h_{t - 1} + z_{t} \hat{h_{t}}

(9)

\hat{h_{t}} = σ_{h} (ω_{h} x_{t} + U_{h} (r_{t} h_{t - 1}) + b_{h})

(10)

where

z_{t}

is the update gate vector,

r_{t}

is the reset gate vector,

ω_{r}, ω_{z}, ω_{h}

and

U_{r}, U_{z}, U_{h}

are weight matrices for the respective gate neurons.

σ_{g}

is a sigmoid function and

σ_{h}

is a hyperbolic tangent [2,34].

In this structure in Figure 4, the hidden state ignores the previous hidden state when the reset gate is close to 0, and it only resets with the current input. This allows the hidden mode to remove any information that is not important in the future, thus providing a more compact view. In addition, the update gate controls how much information from the previous hidden state will be transferred over to the current hidden state. This works in the same way as LSTM memory cells and allows the RNN to store long-term information. Since each hidden module has a separate reset and update gate, each hidden module will learn to catch dependencies at different time scales. Modules that learn to capture short-term dependencies will have a reset gate that is often active, but modules that collect long-term dependencies will have an updated gate that is most active [41].

Higher computational complexity and memory requirements than Simple RNN due to the many hidden state vectors show the disadvantage of GRU. The advantages of GRU networks, such as the ability to model long-term dependent sequences, resistance to gradient reduction, and lower memory requirements, have expanded their field of application in practice. Taking into account all the features of computing modules of RNN networks discussed above, this paper intends to carry out practical research for the UAV sound recognition task.

3.3. Single-Layer (1L) RNNs

Single-layer (1L) or Vanilla RNNs are simple configurations that consist of an input layer, a fully connected SimpleRNN/LSTM/BiLSTM or GRU hidden layer, and a fully connected output layer [12].

Single-layer (1L) RNNs have the following properties: a sequence classification due to multiple distributed input time steps; the memory of accurate input observations for thousands of time steps; sequence prediction as a function of previous time steps; resistance to inserting random time steps into the input sequence; resistance to placing signal data on the input sequence.

3.4. Stacked RNNs

The Stacked RNN or Deep RNN network is a model that consists of multiple hidden SimpleRNN/LSTM/BiLSTM or GRU layers, where each layer contains multiple memory cells. Stacking hidden layers makes the model deeper and, more precisely, earns a description as a deep learning method. The depth of neural networks is usually explained by the success of the approach in a wide range of complex prediction problems. The depth of the network is more important than the number of memory cells in a given layer to model skill. A stacked RNN is currently a suitable method for solving sequence prediction problems [42].

Concluding this section, our approach involves a practical consideration of the cognitive skills of RNN structures. In the course of empirical experimental work, single-layer or vanilla and stacked layers were studied to improve the accuracy of the problem. As a result, it was sufficient to use single RNN layers due to the proposed model structure. All details of the proposed model will be discussed in the next section.

4. Proposed Real-Time System

The creation of intelligent recognition systems is the main goal of the computational analysis of audio signals, which includes the study of the classification of audio events based on deep learning methods. The focus of this research work is the development of such a recognition system, namely the classification of UAV sounds for the simultaneous detection of several UAV models in the same class as unloaded UAVs when they are not loaded with payloads or their load state in the area of interest. Thus, this section aims to propose an acoustic sensor basis to create a recognition system that can detect the presence of UAVs in certain areas of interest in real-time by their sounds and recognize whether they are flying with an additional load or not. The stages of creating this system consist of the following two main parts:

Data preparation,
Real-time recognition framework.

The first data preparation part of the proposed work dealt with the gathering of acoustic data of unmanned aerial vehicles when they fly close and are at longer distances from microphones, as well as with and without a payload. The majority of audio data of the UAV was recorded using a professional ZaxSound SF666PRO condenser cardiomicrophone, as well as using built-in microphones from Apple products such as iPhone 11, 13, and Ipad Air (2020 release). The collected UAV sounds are grouped into three different classes. The sounds collected by a UAV flying with a special payload are called the Loaded UAV class. All the sounds of drones launched from different models were collected under the name “Unloaded UAV”, and the background noise and the sound of other objects that could be confusing were collected in the “background noise” class. The data preparation part also gathered UAV sounds from open sources. This stage was laborious due to the fact that there were not sufficient UAV sounds in known databases. This work differs from previous studies on the detection of UAV sounds in that the signal pre-processing stage, which was performed by traditional methods, is partly included in the data preparation part and inside the Recognition Framework section itself. Our approach uses the Kapre method [8], which processes audio data on a GPU or CPU with the quick implementation of deep learning models.

The second part focused on the creation of a system itself capable of recognizing UAVs in real-time and interpreting their state during their flight. To this end, this section attempted to build a recognition system architecture using RNN neural networks, such as SimpleRNN, LSTM, BiLSTM, and GRU, by performing the task of multiple classifications of UAV acoustic data based on the concept of computer vision. Based on the ability of RNNs to successfully identify audio signals, this work was primarily intended for the practical investigation of Recurrent Neural Networks for UAV sound detection. According to previous studies on UAV sound classification and estimation in [1,2,4,42], most Drone sound recognition systems prioritize CNN networks. As can be seen from the Related Works section, reviews and full studies of all types of Recurrent Neural Networks were not investigated for UAV sounds and were not compared with CNN networks. Therefore, the main reason for the comprehensive practical study of Recurrent Neural Networks is the significant success of these networks in the study of sound (audio) signals and speech. The object of study is the processing of UAV sound data and not their images, and thus, this work considers the study of types of recurrent neural systems from a practical point of view.

Generally, previous studies of UAV sound classification [1,2,4] were processed and implemented with a signal duration of different seconds on their input. This, in turn, can lead to many dangerous events over a long not-adapted duration. There was a need to develop a system capable of recognizing loaded and unloaded drones that were trained to interpret input audio signals of equal or shorter duration to represent a real-time system. Taking into account these factors, experimental work was carried out on the problem of recognizing three classes of UAV sounds with four types of computational neural cells of Recurrent Neural Networks, such as SimpleRNN, LSTM, Bidirectional LSTM, and GRU. As a result, a real-time system was developed that recognizes the sound of UAV classes trained from sound files with a duration of one second.

In the proposed framework, drone sounds are mainly divided into three classes and given to the input layers, as mentioned above, such as “Unloaded”, “Loaded” and “Background”. This is because, in most cases, identifying a specific type of drone does not create a meaningful need. Even though the possibility of this problem was studied in previous works [6,7], the problem of determining the load of the drone is a very necessary and vital system. In particular, it can be used as a solution or preventive measure for life-threatening problems, such as illegal and unauthorized transport of goods, the possibility of a load falling on people even if the load is harmless, and the transport of life-threatening goods for military purposes. Moreover, the recognition of UAV sounds can solve the problem of detecting a suspicious UAV infiltrating a strategically protected area. Determining the load of the UAV is a difficult task because it is similar to the sound of the UAV itself. The sound of flight with an additional load may differ in its matrices in the time and frequency domains and set the stage for recognition. After determining the object of study and methods, the stages of the experimental study were carried out. Accordingly, a framework for the experiment implementation was compiled, as shown in Figure 5. Features of UAV sounds of different models and states were taken in the form of Mel-spectrograms in our work. Basically, the feature extraction step is performed by processing the signals from the time domain to the frequency domain, and these frequency data are extracted as a spectrogram, Mel-spectrogram, and MFCC [43] coefficients. Pre-processing of raw audio data for real-time systems can be obtained using these basic time-frequency representations. In fact, most audio signals are non-periodic signals. It is necessary to take their frequency representations, i.e., the spectrum representations of these signals from the frequency domain, as they change over time. Consequently, multiple spectra are computed by performing an FFT on multiple segments of the signal window. This is called the short-time Fourier transform as in (11) [44].

X_{i}^{-} (K) = \sum_{n = 1}^{N} X_{i} (n) g (n) e^{- j 2 k n / N}, k = 1, . . ., K

(11)

Here, K is the number of discrete Fourier transform coefficients, and g(n) is N sample long analysis window. The FFT is calculated for the overlapping signal window segments and can be obtained from the spectrogram. Typically, the signal frame duration is 20–40 ms frames. In our case, this is 25 ms. This means that the frame length for our downsampled 16 kHz signal is 0.025 ∗ 16,000 = 400 samples. The frame step or hop length is 10 ms or 160 samples, which allows for some frame overlap. The first frame of 400 samples starts at sample 0, and the next frame of 400 samples starts at sample 160 until the end of the audio signal file is reached. Periodograms are then obtained after using (12) to extract short frames from the input signals in (12):

P_{i} (k) = 1 / N {| X_{i}^{-} (K) |}^{2}

(12)

The main reason for obtaining periodograms is to evaluate the power spectral estimate of the signal. We took the absolute value of the complex Fourier transform and performed a 512-point FFT. A unit of pitch called the Mel scale was proposed by Stevens, Volkmann, and Neumann to make equal pitch distances sound the same. In other words, the Mel scale is a pitch perception scale that is determined to be equally far from one another (after a hearing). By designating a perceptual step, the reference point between this scale and the typical frequency measurement is established. The frequencies are mathematically transformed to the Mel scale using (13).

m = 1125 l n (1 + f / 700)

(13)

A spectrogram in which frequencies are scaled to the Mel scale is known as a Mel-spectrogram. In this work, the Mel-spectrogram was chosen as a feature extraction. It has been demonstrated through our experimental part of research that higher frequencies are shown for our UAV sounds. Accordingly, we expanded the number of filters on the Mel scale to 128. The Mel-spectrograms we chose acted as the first input layer, i.e., UAV sounds are processed as an input layer before the neural network layers, thanks to the Kapre method [45]. Our Mel-spectrogram processed the matrices with a dimension of 128 by 100 in training time. The data of this created layer are given to the neural network, which is built further, or in each neural model, it works as an input layer that processes the input data in the network. On the other hand, it is convenient even at the training stage with the structure of RNN models for drone sounds, and also time-efficient for recognizing unseen UAV sound files for prediction. This structure can be proposed as a real-time system since the sound data can be processed and recognized using trained models from one-second duration sound data files. The proposed framework works according to the algorithm of two blocks: data preparation and training models based on the types of recurrent network cells such as single-layer SimpleRNN, LSTM, BiLSTM, and GRU, Figure 5.

The Data Preparation section processed and split the UAV sound data for model training and predicting the models to be built. This was discussed in more detail in the Data Preparation section. Drone sounds for model training are fed as one-second audio data to Recurrent Neural Network models or CNN. Only 70% of the total volume of sound files received here is given to the training models, while the remaining 30% is stored without being shown to the models. The reason is to check the reliability of the created models. In addition to these 30 percent test files, 100 files of one-second duration for each class were saved separately, resulting in 300 files in total to test whether the models can recognize each class efficiently and reliably. As a result, the reliability of the created models for real-time systems was tested in several directions, that is, 30% of the data for the training models were comprehensively tested using model accuracy plots, and the separately stored 300 files were predicted using the confusion matrix, F1, recall, and accuracy for effective recognition of each class. The models were created on the basis of SimpleRNN, LSTM, BiLSTM, and GRU computational cells. In addition to the four models of Recurrent Neural Networks, a CNN model was created in accordance with their structure. Therefore, a total of five models were created. As mentioned above, the input layer of each model was the Mel-spectrogram layer, which performs real-time calculations. The full structural explanation and hyperparameters of the generated models are discussed in the model preparation subsection. The problem of determining the effectiveness of one of the models and the reliability of the constructed models specifically for real-time systems is discussed in the Experimental Results and Discussion section.

4.1. Data Preparation

Data preparation is an integral part of any deep-learning-based system. The Data Preparation section first started with collecting drone sounds from different scenarios, as shown in Figure 6. The main content of the database was created by testing the flights of DJI Phantom 1 and DJI Phantom 2 without and with a load at a distance of 2–100 m from the microphone. The microphone is placed on a table height of 75 cm. During the launch of the UAVs, motorcycles, gator utility vehicle riding, and a freight train were seen moving with overlapping sounds in the area. In addition, wind, canopy whispers, and other ambient sounds were heard during the test time, and their data were also collected to distinguish UAV sounds from false negatives. The loaded UAV was flown carrying a 0.5 kg payload of modeling clay at this location using the DJI Phantom 2. At the next location, the UAV Syma X20, which is widely used for recreational purposes, was flown with 0.425 kg of an iron power bank for its loaded case and also in the unloaded case. The task of load assessment of these recreational or toy drones was considered as the probability of danger with an error while controlling the drone. Other affordable UAVs, such as Syma x5c in 2–90 m and Tarantula x6 in 2–50 m distance, were tested for unloaded UAV cases. The sounds of drones DJI Phantom 4, DJI Phantom 4 Pro, Mavic Pro, and Qazdrone were also recorded. Background sounds from their launch were also collected. Not all drone models could fly with additional payload. The data of the drone that was able to fly with a load are shown in Table 1.

All these UAVs were recorded with a 16-bit resolution of microphone depth with a frequency of 44,100 Hz, back and forth, up and down, and at different speeds by default from their technical characteristics, starting at close proximity to the microphone and the parking lot next to it. The rest of the data were obtained from open sources. Compiling audio files from open sources took much more time and additional work. This is due to the fact that the sounds of the loaded UAV were found only in amateur videos and processed using a special converter at a frequency of 44,100 Hz and a depth of 16 bits because our prediction system was based on an acoustic sensor capable of listening with a frequency of 44,100 Hz and a depth of 16 bit.

Our model was constructed to receive audio data through the wav extension and the remaining data from the open sources, which in other formats have been converted to a 44,000 Hz sampling rate with 16-bit depth resolution and “mono” mode of a microphone with the extension “.wav”. The previous studies [1,2] were limited by considering only a single model of UAV, which is the DJI Phantom 2. This work aimed to study the influence of various UAV models’ acoustic data on the problem of UAV load estimation. Thus, all data were collected using several types of UAV models for each of these three main classes, i.e., “Unloaded”, “Loaded” and “Background noise”. The duration of the recorded drone sounds varied from a few seconds to over 5 min. A general view of the duration of the sounds collected for each class is given in Table 2 in seconds.

Some UAV sounds from open sources were in “stereo” mode. Some parts of the sounds of the Qazdrone, DJI Phantom 2, DJI Phantom 4, and DJI Phantom 4 Pro drones were recorded using the Apple iPad and Apple iPhone 11 and 13 built-in microphones during the experiment. All these data files were converted to 44,100 Hz and “mono” mode with a specially designed filter, run through signal envelope methods, and converted to 1-second files, which are eventually saved in special folders. As mentioned above, this work differs from previous studies on the detection of UAV sounds in that the signal pre-processing stage, which was performed by traditional methods, is partly included in the data preparation part and inside the Recognition Framework section itself. Our approach uses the Kapre method [45], which processes audio data on a GPU or CPU with the quick implementation of deep learning models. Traditional methods of sound recognition systems required preliminary processing of input signals before applying them to neural networks using additional functions to obtain their time and frequency characteristics, which took some time and effort. All collected audio data were analyzed first in the time domain and then in the frequency domain. This is due to the fact that it was necessary to consider in advance the extent of all audio files, such as the sounds of UAVs and other motorized objects, which can be confused with UAVs or other background noises. So, to distinguish UAV sounds from the background noise of motorized objects such as trains, cars, and motorcycles, sound plots were plotted in the time domain (Figure 7) and frequency domain (Figure 8) with extended background noise classes for the investigation of information domains. The reason was the need to see and analyze the intervals of their information parts before the Framework architecture.

As can be seen in the frequency domain, the information part of the UAV sounds and the background noise we need is reflected only up to 16,000 Hz. A specially created filter program was prepared to cut off up to 16,000 Hz and, furthermore training files were planned to prepare for the model with three classes for a real-time system. The sound of some types of train signals, which were background noise, had high frequencies, but our target UAV sounds did not exceed 16,000 Hz, so only this range was captured. Objects such as motorcycles, trains, and motorized cars can potentially be confusing with the sound of UAVs. They were only analyzed separately as an additional extended background sound in the time and frequency domain during the analysis step temporarily as mentioned before. In the case of the task of recognizing train sounds, it is possible to increase the frequency range. As a guideline when examining other motorized objects, we have tried to show the extended classes of background noise objects. However, according to our task, it was adapted to the sounds of the UAV.

After selecting the required frequency range at the analysis step, the sound base was again combined into three classes using a special filter that cuts off all signals below 16,000 Hz. And the features of the Mel scale were obtained in the proposed structure itself using a range of time and frequency hyperparameters from the analysis part. Temporal and frequency characteristics based on traditional methods conducted in Python are mainly implemented using libraries Librosa and Essentia [45] in many studies. Our work involves the use of the Kapre method, which is developed as Keras layers. The research work was carried out using the python program using the layers and libraries of Keras, including KAPRE. The main advantage of the Kapre method is the optimization of audio signal processing parameters. In addition, the time and frequency representations, their normalization, and data augmentation are performed in real-time on GPU or CPU. However, it is necessary to analyze the sounds of various objects in the frequency domain in order to pre-determine the parameters of the filters up to the Kapre layers due to the characteristics of their natural appearance. In order to perform this preliminary analysis task, we obtained spectra performed individually for each class in the frequency range using Fast Fourier Transform (FFT) for our time domain signals above. Most of the changes in the audio signals are observed at low frequencies, and the downsampling step is performed at a frequency of 16,000 Hz, preserving some informative areas of signals for objects of all classes. Thus, in the next step, we divided all signals into one-second chunks from the entire duration of all audio recordings, as we are trying to build a real-time system. Therefore, our audio files need to be trained using 1-second audio files. In addition, a signal envelope method was performed to eliminate dead zones in sections of one-second duration, i.e. lower values with zero thresholds were cleaned up by creating our special filter in Figure 9.

This work is approached based on the image classification method, while its neural networks have the ability to differentiate objects based on feature vectors of all classes, especially by allowing background noise to be recognized as false negatives. The spectra of the signals that have passed the above-mentioned cleaning function are shown in Figure 9. Through empirical research, the Mel-spectrogram was chosen as an effective frequency representation for a real-time recognition system. The Mel-spectrogram parameters are shown in Table 3 in the Model Preparation section.

In conclusion, this subsection collected UAV audio and background noise data and only analyzes the signals in the time and frequency domains to find the interval of the information domains. As a result, a filter was created that collects 1-second audio signals below 16000 Hz into three classes. And the frequency characteristics of the Mel scale used for the system have not been obtained in this subsection. This is because it will be taken immediately as the first layer of the proposed structure in the next subsection.

4.2. Model Preparation

This subsection attempts to build recognition models using RNN neural networks, such as SimpleRNN, LSTM, BiLSTM, and GRU. The task is performed on the basis of multiple classifications of UAV acoustic data based on the concept of computer vision. This is because the UAV audio data are entered as one-second audio data in the input and form the Mel-spectrogram layer using the Kapre method [45]. The output launches a stream of images created from frequency representations, called a Mel-spectrogram, to the next level, as shown in Table 3.

The Mel-spectrogram is a layer of Kapre that is expanded on a spectrogram by multiplying the conversion matrix of the Mel scale from linear frequencies [45], Figure 10. Our approach has studied a large range for the Mel-spectrogram layer, as shown in Table 3. As a result of the experimental attempts, 100-time vectors and 128-frequency features were obtained. These selected layers in the proposed structure were chosen for their following features, and the hyperparameters were chosen through a series of experiments: the normalization layer is implemented in these Mel-spectrograms as the second layer of the model as it normalizes two-dimensional input data per frequency, time, batch, and channel. When the input vectors have already been selected, the TimeDistributed(Reshape) layer is used to ensure that the size of the matrix provides data to the layers of the RNN. A neural dense layer with the tanh function was added to approximate the size of the matrix to the size of the RNN cells. As the fifth layer of the model structure, computational units such as SimpleRNN, LSTM (Long-Short Term Memory), BiLSTM (Bidirectional LSTM), and GRU (Gated Recurrent Networks) networks were connected with 32 cells. A concatenation layer was added after the RNN cells, and then the dense layers were connected based on the hyperparameters in Table 3. The MaxPooling1D layer was placed after the RNN layer to make the architecture easier to learn. Then, a 32-cell dense layer with a ReLU function was created. In addition, a Flatten layer was used to make the multidimensional output linear and transfer it to a dense layer. The output of the Flatten layer is carried over to the next layers for the classification task. A Dropout layer is added to the next row so that the model does not overfit and showed a good fit when evaluating the model with a coefficient of 0.2 for a 32-cell RNN. A dense layer of 32 cells with an activity regularizer and ReLU function was added before the final dense layer. It should be noted that the activity regularizer function had a great impact on the representativeness of the accuracy graphs in model training for the given UAV sounds. Therefore, this function was considered within 0.01–0.00001. The number of dropout layers was changed according to the number of RNN cells. In the case of GRU, with an increase in the number of cells to 64, a change in the dropout coefficient from 0.25 to 0.3 was considered since the model could be retrained. In the realization of the models, the “categorical cross-entropy” loss function is optimized for the multiple classification task.

The “Adam” implementation of gradient descent is used to optimize the weights, and the classification “accuracy” is calculated during the model training, validation, and testing to evaluate learning-skill and generalizing-skill, of models in comparison to all architectures. The results of these investigations are given in the Training and Evaluation Metrics subsections.

To summarize this section, the design of the proposed structure has been adapted for training with RNN training cells and additional layers by obtaining spectrograms based on the Mel scale and other hyperparameters using Table 3.

5. Experiment Results and Discussion

In this section, RNN models such as SimpleRNN, LSTM, BiLSTM, and GRU were trained through a series of experiments empirically based on the proposed structure, as in Table 3 above. After that, the CNN model structure as in Table 5 of the work [1] was trained using our data set. All details of the conducted training are discussed in the “Training” subsection.

5.1. Training

In our series of experiments, four RNN models and one CNN model were trained during training. The experiments were performed with an Intel(R) Core(TM) i5-8265U processor at 1.60 GHz using Python (Spyder). The Kapre libraries and other required layer libraries are pre-installed for processing Mel-spectrograms to be added as input layers, just like Keras layers. Initially, training was carried out with 30 epochs. The 30-epoch training case made it possible to find an area of “good fit” that was estimated at 25 epochs from his recognition accuracy plot. All models were retrained with 25 epochs afresh and saved as Pickle files with the “.h5” extension. First, a SimpleRNN model with 32 cells was trained. Subsequently, the LSTM, BiLSTM, and GRU models were trained with 32 cells and stored. The structure of the remaining layers was left unchanged when training other RNN models. Moreover, 30% of the data was given for testing and 70% for training when training the model. Graphs from the recognition accuracy data of all considered models with 32 cells were compared with each other in Table 4 and Figure 11. The structures of all layers remained unchanged during the training of all RNN models to test the ability of RNN cell type recognition skills in practice. These trained RNN models were tested in two rounds. In the first stage, the accuracy graphs of the model were studied. This is because it is necessary to determine if the trained models are robust on other audio datasets that they have not seen. In the second step, one-second 300 audio files, which were previously stored separately, were predicted in detail. These 300 files have been called the Prediction dataset. The prediction files were first tested with the confusion matrix. The focus of the confusion matrix was first the loaded class and then the unloaded UAV class. The cognitive abilities of recognition of these models were differentiated from the class of loaded UAVs. This is due to the fact that the recognition accuracy decreased in the unloaded class when testing for several types of RNNs. The results of these predictive data are discussed in the subsection Evaluation Metrics.

The data in Table 4 show the mean and weighted model accuracy of the first learning outcomes of four RNN models and one CNN model. Since these results cannot give more detailed information, the graphs of the model accuracy formed by their values from 25 epochs were compared.

The results were analyzed after the first training steps were received. Although the average recognition accuracy scores of the models in Table 4 were similar, their recognition plots were “unrepresentative” in the plot of the SimpleRNN model, as shown in Figure 11. These metrics do not mean that the results obtained will be reliable for other databases where new data are added or for previously untrained databases. This is because the recognition accuracy of the model only shows the highest recognition rate achieved when training the same database. To show the robustness of the model for other cases, it is necessary to test the recognition abilities of each class given by the confusion matrix. In addition, the CNN network showed a lower percentage of recognition than other models. From this, it can be concluded that the CNN model has a lower recognition ability than the RNN model for sounds such as drones and background noise such as trains, motorcycles, and cars. It is possible that if the CNN layer is added as multiple layers, it may have a high recognition capability. However, this study mainly looked at the experimental study of the abilities of neural layers through a single layer. Even our CNN structure was based on work [1], so two CNN layers were added. However, it had a slightly lower recognition performance than single-layer RNN models. In addition, all experiments were tested several times with at least two to three attempts. This is because it was assumed that a model trained only once could be a mere coincidence.

The accuracy of the LSTM and BilSTM models was slightly less than that of GRU and SimpleRNN. Since the recognition history plot of the SimpleRNN network was not representative, it was not considered for further study. However, further empirical research with the GRU model was continued. At the next stage, the number of GRU cells was increased to 64, and training was carried out while maintaining 25 epochs. Due to the increase in the number of cells, the value of the “activity regularization” function in the penultimate layer was searched from a different interval. The values of the “activity regularization” function were taken as L2 = 0.00001, and this value gave a “good fit” in the recognition accuracy plot in the GRU model with 64 cells. The overall accuracy of this model is shown in Table 4 as well, and the accuracy plot based on recognition history data is shown in Figure 12 along with the CNN plot.

Therefore, as can be seen from Figure 12 above, the proposed GRU model with 64 cells provides a fairly high recognition rate. The model accuracy plot also shows a “good fit”. The CNN model built as in [1] shows a gap between the accuracy of training and testing, and the overall recognition ability also shows a decrease.

In general, the ability of a recognition system based on such a multi-class classification cannot be shown with complete accuracy. Therefore, we determined the predictive ability of the models by presenting the results in an extended form with recognition accuracy parameters for each class. These additional extended predictions for each class are discussed in the next subsection.

5.2. Evaluation Metrics

Some machine learning statistics such as precision and recall, F-scores, and false positives per image [3] or per one-second audio file can be used to evaluate any approach to detect and reliably classify objects. The confusion matrix combines these indicators into one table or a confusion matrix visualization. Our proposed work is carried out to determine the practical significance of RNN networks based on multiclassification using time and frequency representations of UAV sound data. In the case of multiple classification problems, the performance of a classifier is usually determined according to the confusion matrix associated with the classifier. Based on the elements of the matrix, sensitivity (recall), specificity, and accuracy were calculated, as shown in Table 5. Our approach was evaluated using various performance metrics [3,46], including precision, recall, and F1 score.

Precision values were found by scoring False Positive (FP) objects for True Positive (TP) objects by Equation (11):

P r e c i s i o n = \frac{T_{p}}{T_{p} + F_{p}}

(14)

The recall was evaluated by the True Positive (TP) predictions in relation to (FN) False Negative predictions that are not at all classified using Equation (12):

R e c a l l = \frac{T_{p}}{T_{p} + F_{n}}

(15)

Since precision or recall do not accurately determine the predictability of a system, the F1-score, which represents the average of the values, was determined using Equation (10) in the 13th equation

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

These evaluation methods were performed on a prediction dataset (300 files), and each class was adjusted to 100 files in order to be visually analyzed in Table 5.

Thus, almost all types of RNN models show a very high recognition ability when estimating the background noise class. While the CNN model showed high accuracy in the background noise class, it performed slightly lower than the RNN models. It can be seen from this that although CNN models are capable of general recognition problems, they have the property of lower recognition skills than RNN networks in the case of the same object sounds but when they have specific states. Almost all RNN cells have a high ability to recognize engine-based objects and show high recognition ability from a single layer. In addition, besides the capabilities of the RNN network, the structure of the model plays a unique role here. As can be seen from Table 3, the dense layer before the RNN model was set by the “tanh” activation function. Even after the RNN layer, dense layers were provided with a “ReLU” function. A dropout layer has also been added to prevent overtraining. As a result, optimal recognition results are achieved. Simple RNN, LSTM, and BiLSTM were not able to show a consistently high sound recognition rate in loaded and unloaded UAV cases, as shown by the analysis of RNN models in Table 5. As we can see above in the recall, which shows True Positive, the recognition rate fell by 4–5% for loaded and unloaded UAVs. The 32-cell GRU showed the highest performance for the class of loaded UAVs, which is the main focus. By increasing its number of cells to 64, better results were obtained for all classes. The conclusion to be drawn from this is that the GRU cells are very capable of detecting different states of the sounds of the same object. CNN models have shown that they handle binary classification well with more layers. However, all types of RNNs performed binary and multiple sound classification tasks with a higher stable recognition ability than CNNs. In this paper, it is concluded that the GRU model is considered an effective method in the case of recognizing the sounds of objects when they have different states. The confusion matrix obtained from the results of 64 cells of the GRU model is shown in Figure 13.

The confusion matrix was created to clearly see correctly and falsely predicted audio files. In the classes “loaded UAV” and “unloaded UAV”, which were the main focus, there was a fairly high level of recognition of the proposed model, as shown in Figure 13.

Based on the results of the experimental section, a data set with three classes was fed to the input of the classifier. To determine the predictability of the selected classifier by ensuring that each individual class is analyzed correctly, their actual predicted audio files and false negatives were presented with a confusion matrix using balanced 100 audio files per class. The experimental results were with the GRU model with 64 cells, which in turn showed the stability of the recognition skill based on the proposed database.

5.3. Practical Applications

This research work was intended to show the skills of the RNN network by substantiating the theoretical foundations with their experimental studies. The neuronal cell types of Recurrent Neural Networks have not previously been fully analyzed in a single study. Our study is proposed as a basis for the recognition and processing of sounds in the case of the detection of other types of objects. The proposed structure itself can serve for other research tasks from this direction. In addition, the developed system is recommended for use in the state security system of countries, in the protection of strategically important buildings, in ensuring security in densely populated areas, and for reliable systems based on a bi- or multi-modal method.

6. Conclusions

In this study, we investigated architectures such as SimpleRNN, LSTM, bidirectional LSTM, and GRU based on a practical application for a real-time UAV sound recognition system. We paid special attention to determine whether the UAVs were loaded or unloaded. In general, UAV sounds were collected into three main classes: loaded UAVs unloaded UAVs, and background noises, including the sound of other engine-based objects from the scene. In the next stage, we tried to assess the level of accuracy of the UAV recognition system using all the metrics of multiple class classification tasks. As a result, the GRU architecture (64) was found to be an efficient model with a high degree of predictability. According to our goal, this model can recognize loaded UAVs and unloaded UAVs with 98% accuracy and background noise with 99% accuracy. This evaluation ensures the ability of the UAV sound recognition system at a reliable level, suggesting the creation of an array of acoustic sensors using the proposed GRU model (64). Other RNN network architectures are robust for binary classification problems. Thus, based on the predictions, SimpleRNN, LSTM, BiLSTM, and GRU architectures can be used in the UAV load detection task due to their better content-based recognition ability than CNN models. The CNN model could only detect cases of binary object classification and was slightly smaller than RNN networks. Within this study, our work contributed to the following solutions:

(1) The hyperparameter size of the “good fit” Mel-spectrogram for UAV sounds was developed during an empirical experimental study.

(2) The structure of RNN networks based on the “best fit” hyperparameters and fewer layer-based structures has been developed for real-time systems.

(3) A real-time recognition system was adapted for UAV states when they have different masses of weights.

(4) For the first time, all RNN cell types were examined in one study to provide guidance on the choice of these cell types for other types of sounds.

This work was limited by the low amount of acoustic data of the loaded UAVs. However, this work has shown that UAV loads can be determined and estimated in real-time. The future work of our study could focus on two areas: first, increasing the loaded UAV’s audio dataset and exploring in detail the distance-based detection problem.

Author Contributions

Conceptualization, D.U., L.I. and E.T.M.; methodology, D.U.; software, D.U.; validation, D.U., L.I. and E.T.M.; formal analysis, D.U.; investigation, D.U.; resources, D.U. and E.T.M.; data curation, D.U. and E.T.M.; writing—original draft preparation, D.U.; writing—review and editing, D.U.; visualization, D.U.; supervision, E.T.M. and L.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was prepared within the confines of the Zhas Galym Project of the Republic of Kazakhstan, AP14971907. The conclusions drawn here are the sole responsibility of the authors.

Data Availability Statement

Not Applicable.

Acknowledgments

The authors would like to thank J.C Gallagher for kindly providing the first initial set of UAV audio data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, S.; Kim, H.; Lee, S.D.; Gallagher, J.C.; Kim, D.; Park, S.; Matson, E.T. Convolutional Neural Networks for Analyzing Unmanned Aerial Vehicles Sound. In Proceedings of the 18th International Conference on Control, Automation, and Systems (ICCAS), PyeongChang, Republic of Korea, 17–20 October 2018; pp. 862–866. [Google Scholar]
Lim, D.; Kim, H.; Hong, S.; Lee, S.; Kim, G.; Snail, A.; Gotwals, L.; Gallagher, J.C. Practically Classifying Unmanned Aerial Vehicles Sound Using Convolutional Neural Networks. In Proceedings of the 2018 Second IEEE International Conference on Robotic Computing (IRC), Laguna Hills, CA, USA, 31 January–2 February 2018; pp. 242–245. [Google Scholar]
Seidaliyeva, U.; Akhmetov, D.; Ilipbayeva, L.; Eric, M.T. Real-Time and Accurate Detection in a Video with a Static Background. Sensors 2020, 20, 3856. [Google Scholar] [CrossRef] [PubMed]
Vemula, H.C. Multiple Drone Detection and Acoustic Scene Classification with Deep Learning; Wright State University: Dayton, OH, USA, 2018. [Google Scholar]
Taha, B.; Shoufan, A. Machine Learning-Based Drone Detection and Classification: State-of-the-Art in Research. IEEE Access 2019, 7, 138669–138682. [Google Scholar] [CrossRef]
Utebayeva, D.; Almagambetov, A.; Alduraibi, M.; Temirgaliyev, Y.; Ilipbayeva, L.; Marxuly, S. Multi-label UAV sound classification using Stacked Bidirectional LSTM. In Proceedings of the 2020 Fourth IEEE International Conference on Robotic Computing (IRC), Taichung, Taiwan, 9–11 November 2020; pp. 453–458. [Google Scholar] [CrossRef]
Utebayeva, D.; Alduraibi, M.; Ilipbayeva, L.; Temirgaliyev, Y. Stacked BiLSTM-CNN for Multiple label UAV sound classification. In Proceedings of the 2020 Fourth IEEE International Conference on Robotic Computing (IRC), Taichung, Taiwan, 9–11 November 2020; pp. 470–474. [Google Scholar] [CrossRef]
Dumitrescu, C.; Minea, M.; Costea, I.M.; Cosmin Chiva, I.; Semenescu, A. Development of an Acoustic System for UAV Detection. Sensors 2020, 20, 4870. [Google Scholar] [CrossRef] [PubMed]
Al-Emadi, S.; Al-Ali, A.; Mohammad, A.; Al-Ali, A. Audio Based Drone Detection and Identification using Deep Learning. In Proceedings of the 2019 15th International Wireless Communications and Mobile Computing Conference (IWCMC), Tangier, Morocco, 24–28 June 2019; pp. 459–464. [Google Scholar] [CrossRef]
Houthi Drone Crashes into Saudi School in Asir Province, 14 June 2021. Available online: https://thearabweekly.com/houthi-drone-crashes-saudi-school-asir-province (accessed on 20 December 2022).
Mircea, C. Light Show in China May Have Been Sabotaged, Dozens of Drones Fell from the Sky. 5 October 2021. Available online: https://www.autoevolution.com/news/light-show-in-china-may-have-been-sabotaged-dozens-of-drones-fell-from-the-sky-170962.html (accessed on 20 December 2022).
Brownlee, J. What are LSTMs. In Long Short-Term Memory Networks with Python; Edition v1.6.; 2019; pp. 10–18. Available online: https://machinelearningmastery.com/lstms-with-python/ (accessed on 20 December 2022).
Nijim, M.; Mantrawadi, N. Drone classification and identification system by phenome analysis using data mining techniques. In Proceedings of the EEE Symposium on Technologies for Homeland Security (HST), Waltham, MA, USA, 10–11 May 2016; pp. 1–5. [Google Scholar]
Bernardini, A.; Mangiatordi, F.; Pallotti, E.; Capodiferro, L. Drone detection by acoustic signature identification. Electron. Imaging 2017, 29, 60–64. [Google Scholar] [CrossRef]
Kim, J.; Park, C.; Ahn, J.; Ko, Y.; Park, J.; Gallagher, J.C. Real-time UAV sound detection and analysis system. In Proceedings of the IEEE Sensors Applications Symposium (SAS), Glassboro, NJ, USA, 13–15 March 2017; pp. 1–5. [Google Scholar]
Shi, L.; Ahmad, I.; He, Y.; Chang, K. Hidden Markov model based drone sound recognition using MFCC technique in practical noisy environments. J. Commun. Netw. 2018, 20, 509–518. [Google Scholar] [CrossRef]
Siriphun, N.; Kashihara, S.; Fall, D.; Khurat, A. Distinguishing Drone Types Based on Acoustic Wave by IoT Device. In Proceedings of the 22nd International Computer Science and Engineering Conference (ICSEC), Chiang Mai, Thailand, 21–24 November 2018; pp. 1–4. [Google Scholar]
Anwar, M.Z.; Kaleem, Z.; Jamalipour, A. Machine Learning Inspired Sound-Based Amateur Drone Detection for Public Safety Applications. IEEE Trans. Veh. Technol. 2019, 68, 2526–2534. [Google Scholar] [CrossRef]
Liu, H.; Wei, Z.; Chen, Y.; Pan, J.; Lin, L.; Ren, Y. Drone Detection Based on an Audio-Assisted Camera Array. In Proceedings of the IEEE Third International Conference on Multimedia Big Data (BigMM), Laguna Hills, CA, USA, 19–21 April 2017; pp. 402–406. [Google Scholar]
Jeon, S.; Shin, J.; Lee, Y.; Kim, W.; Kwon, Y.; Yang, H. Empirical study of drone sound detection in real-life environment with deep neural networks. In Proceedings of the 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 1858–1862. [Google Scholar]
Yang, B.; Matson, E.T.; Smith, A.H.; Dietz, J.E.; Gallagher, J.C. UAV Detection System with Multiple Acoustic Nodes Using Machine Learning Models. In Proceedings of the Third IEEE International Conference on Robotic Computing (IRC), Naples, Italy, 25–27 February 2019; pp. 493–498. [Google Scholar]
Seo, Y.; Jang, B.; Im, S. Drone Detection Using Convolutional Neural Networks with Acoustic STFT Features. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018. [Google Scholar]
Yamada, K.; Kumon, M.; Furukawa, T. Belief-Driven Control Policy of a Drone with Microphones for Multiple Sound Source Search. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 5326–5332. [Google Scholar]
Basiri, M.; Schill, F.; Lima, P.; Floreano, D. On-Board Relative Bearing Estimation for Teams of Drones Using Sound. IEEE Robot. Autom. Lett. 2016, 1, 820–827. [Google Scholar] [CrossRef]
Fernandes, R.P.; Apolinario, J.A.; Ramos, A.L.L. Bearings-only aerial shooter localization using a microphone array mounted on a drone. In Proceedings of the IEEE 8th Latin American Symposium on Circuits and Systems (LASCAS), Bariloche, Argentina, 20–23 February 2017; pp. 1–4. [Google Scholar]
Wang, L.; Cavallaro, A. Acoustic Sensing From a Multi-Rotor Drone. IEEE Sens. J. 2018, 18, 4570–4582. [Google Scholar] [CrossRef]
Yue, X.; Liu, Y.; Wang, J.; Song, H.; Cao, H. Software defined radio and wireless acoustic networking for amateur drone surveillance. IEEE Commun. Mag. 2018, 56, 90–97. [Google Scholar] [CrossRef]
Misra, P.; Kumar, A.A.; Mohapatra, P.; Balamuralidhar, P. Aerial Drones with Location-Sensitive Ears. IEEE Commun. Mag. 2018, 56, 154–160. [Google Scholar] [CrossRef]
Ruiz-Espitia, O.; Martinez-Carranza, J.; Rascon, C. AIRA-UAS: An Evaluation Corpus for Audio Processing in Unmanned Aerial System. In Proceedings of the International Conference on Unmanned Aircraft Systems (ICUAS), Dallas, TX, USA, 12–15 June 2018. [Google Scholar]
Misra, P.; Kumar, A.A.; Mohapatra, P.; Balamuralidhar, P. DroneEARS: Robust Acoustic Source Localization with Aerial Drones. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018. [Google Scholar]
Mezei, J.; Fiaska, V.; Molnar, A. Drone Sound Detection. In Proceedings of the 16th IEEE International Symposium on Computational Intelligence and Informatics (CINTI), Budapest, Hungary, 19–21 November 2015; pp. 333–338. [Google Scholar]
Mezei, J.; Molnar, A. Drone Sound Detection by Correlation. In Proceedings of the 2016 IEEE 11th International Symposium on Applied Computational Intelligence and Informatics (SACI), Timisoara, Romania, 12–14 May 2016; pp. 509–518. [Google Scholar]
Chang, S.J.; Li, K.W. Visual and Hearing Detection Capabilities to Discriminate whether a UAV Invade a Campus Airspace. In Proceedings of the 5th International Conference on Industrial Engineering and Applications, Singapore, 26–28 April 2018. [Google Scholar]
Main Types of Neural Networks and Its Applications—Tutorial. 13 July 2020. Available online: https://towardsai.net/p/machine-learning/main-types-of-neural-networks%-and-its-applications-tutorial-734480d7ec8e (accessed on 20 December 2022).
Wang, Y.; Liao, W.; Chang, Y. Gated Recurrent Unit Network-Based Short-Term Photovoltaic Forecasting. Energies 2018, 11, 2163. [Google Scholar] [CrossRef] [Green Version]
Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent Advances in Recurrent Neural Networks, arXiv:1801.01078v3 [cs.CV]. Available online: https://arxiv.org/pdf/1801.01078.pdf (accessed on 22 February 2018).
Deep Feed Forward Neural Networks and the Advantage of ReLU Activation Function. Available online: https://towardsdatascience.com/deep-feed-forward-neural-networks-and-the-advantage-of-relu-activation-function-ff881e58a635 (accessed on 20 December 2022).
Sak, H.; Senior, A.W.; Senior, A.W. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. In Proceedings of the Conference of the International Speech Communication Association, Singapore, 1 January 2014. [Google Scholar]
Thakur, D. LSTM and Its Equations, Google Summer of Code. 2019. Available online: https://medium.com/@divyanshu132/lstm-and-its-equations-5ee9246d04af (accessed on 20 December 2022).
Available online: https://www.i2tutorials.com/technology/deep-dive-into-bidirectional-lstm/ (accessed on 20 December 2022).
Cho, K.; Van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Sak, H.; Senior, A.; Beaufays, F. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. Int. J. Speech Technol. 2019, 22, 21–30. [Google Scholar]
Huzaifah, M. Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks. arXiv 2017, arXiv:1706.07156. [Google Scholar]
Robert, G.; Ruqiang, Y. Non-stationary signal processing for bearing health monitoring. Int. J. Manuf. Res. 2006, 1, 18–40. [Google Scholar] [CrossRef]
Choi, K.; Joo, D.; Kim, J. Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Iwendi, C.; Khan, S.; Anajemba, J.H.; Mittal, M.; Alenezi, M.; Alazab, M. The use of Ensemble Models for Multiple Class and Binary Class Classification for Improving Intrusion Detection Systems. Sensors 2020, 20, 2559. [Google Scholar] [CrossRef]

Figure 1. Structure of the SimpleRNN network.

Figure 2. LSTM architecture (a) General architecture of LSTM structure (b) The computational cell of LSTM network.

Figure 3. Single layer BiLSTM architecture.

Figure 4. GRU network cell architecture.

Figure 5. Proposed real-time UAV sound recognition framework.

Figure 6. UAV sound data preparation: (a) Microphone placement and recording time; (b) Simulation of drone loading; (c) Loaded drone flight.

Figure 7. Time series for the UAV dataset in extended classes.

Figure 8. Extended UAV sound dataset in the frequency domain.

Figure 9. Result of one-second data preparation function with the signal envelope method.

Figure 10. Time–frequency representations for UAV sound dataset.

Figure 11. Model accuracy history plots for RNN models: solid line—training history, and the dotted line—test history.

Figure 12. Model accuracy history plot comparison of the proposed model GRU and CNN: solid line—training history, and the dotted line—test history.

Figure 13. Prediction with confusion matrix using the proposed GRU (64) framework.

Table 1. Technical parameters of UAV models and UAV loading simulations.

Model	Load Parameter (in kg)	Flying Distance (in meters)
DJI Phantom I	-	2–100
DJI Phantom II	0.5	2–100
DJI Phantom IV, IV Pro	0.5	2–100
Syma x20	0.425	100
Tarantula x6		2–50
Syma x5c	-	2–90
DJI Phantom III, IV, DJI Phantom IV Pro,	-	from Open Datasets
Air Hogs, quadcopter, Syma x5, and others
DJI Phantom III, IV Pro	0.454–1.36	1–40
Mavic Pro	0.156–0.256	2–8
QazDrone		2–50

Table 2. Composition of UAV sound dataset.

Name of Classes	Total Duration (s)	Duration for Train Set (s)	Duration for Prediction Set (s)
	7612	7312	300
Loaded UAV	1513	1413	100
Unloaded UAV	3334	3234	100
Ambient Noise	2765	2665	100

Table 3. Hyperparameter optimization of proposed models.

Layers	Parameter	Best Fit	Range
Mel-spectrogram	Sampling rate	16,000 Hz	6000–44,100 Hz
	Window length	512	512, 1024
	Hop length	160	160, 256
	Number of Mels	128	40–128
	(Frequency, Time)	128 ∗ 100
LayerNormalization	Batch Normalization
Reshape	TimeDistributed (Reshape)
Dense	TimeDistributed (Dense), tanh	64	32–128
SimpleRNN, LSTM,	RNN cells (Fmaps)	GRU (64)	32–64
BiLSTM, GRU
concatenate
Dense	Dense, ReLU	(64)	(32–128)
MaxPooling	MaxPooling1D
Dense	Dense, ReLU	(32)	(32–128)
Flatten
Dropout	Dropout	0.25	0.2–0.3
Dense	Dense, ReLU	32	8–32
	activity regularizer	0.01	0.01–0.00001
	activity regularizer	0.00001 for GRU (64)	0.01–0.00001
Dense	Dense	(# classes) 3	(3, 4)
	Activation in classification	softmax
	Optimization solver	adam	sgdm, adam
	# epochs	25	25–150

Table 4. Comparison of model accuracy of the SimpleRNN, LSTM, BiLSTM, GRU, and CNN models on 128-100-dimensional Mel-spectrograms.

Model	Accuracy
SimpleRNN	98
LSTM	97
Bidirectional LSTM (BiLSTM)	97
GRU with 32	98
CNN structure as in [8]	94
GRU with 64 cells	98

Table 5. The performance of the models and their prediction metrics.

Model	Classes	Precision, %	Recall, %	F1-Score, %
SimpleRNN
	Background noise	99	100	100
	Loaded UAV	100	96	98
	Unloaded UAV	96	99	98
LSTM
	Background noise	96	100	98
	Loaded UAV	98	97	97
	Unloaded UAV	98	95	96
BiLSTM
	Background noise	97	100	99
	Loaded UAV	99	95	97
	Unloaded UAV	96	97	97
GRU with 32 cell
	Background noise	99	99	99
	Loaded UAV	97	98	98
	Unloaded UAV	97	96	96
CNN as in [8]
	Background noise	99	95	97
	Loaded UAV	95	89	92
	Unloaded UAV	90	99	94
GRU with 64 cell
	Background noise	99	99	99
	Loaded UAV	99	98	98
	Unloaded UAV	97	98	98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Utebayeva, D.; Ilipbayeva, L.; Matson, E.T. Practical Study of Recurrent Neural Networks for Efficient Real-Time Drone Sound Detection: A Review. Drones 2023, 7, 26. https://doi.org/10.3390/drones7010026

AMA Style

Utebayeva D, Ilipbayeva L, Matson ET. Practical Study of Recurrent Neural Networks for Efficient Real-Time Drone Sound Detection: A Review. Drones. 2023; 7(1):26. https://doi.org/10.3390/drones7010026

Chicago/Turabian Style

Utebayeva, Dana, Lyazzat Ilipbayeva, and Eric T. Matson. 2023. "Practical Study of Recurrent Neural Networks for Efficient Real-Time Drone Sound Detection: A Review" Drones 7, no. 1: 26. https://doi.org/10.3390/drones7010026

APA Style

Utebayeva, D., Ilipbayeva, L., & Matson, E. T. (2023). Practical Study of Recurrent Neural Networks for Efficient Real-Time Drone Sound Detection: A Review. Drones, 7(1), 26. https://doi.org/10.3390/drones7010026

Article Menu

Practical Study of Recurrent Neural Networks for Efficient Real-Time Drone Sound Detection: A Review

Abstract

1. Introduction

2. Related Studies

3. Methods

3.1. LSTM

Bidirectional LSTM

3.2. Gated Recurrent Unit (GRU) Networks

3.3. Single-Layer (1L) RNNs

3.4. Stacked RNNs

4. Proposed Real-Time System

4.1. Data Preparation

4.2. Model Preparation

5. Experiment Results and Discussion

5.1. Training

5.2. Evaluation Metrics

5.3. Practical Applications

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI