Gunshot Airborne Surveillance with Rotary Wing UAV-Embedded Microphone Array

Unmanned aerial vehicles (UAV) are growing in popularity, and recent technological advances are fostering the development of new applications for these devices. This paper discusses the use of aerial drones as a platform for deploying a gunshot surveillance system based on an array of microphones. Notwithstanding the difficulties associated with the inherent additive noise from the rotating propellers, this application brings an important advantage: the possibility of estimating the shooter position solely based on the muzzle blast sound, with the support of a digital map of the terrain. This work focuses on direction-of-arrival (DoA) estimation methods applied to audio signals obtained from a microphone array aboard a flying drone. We investigate preprocessing and different DoA estimation techniques in order to obtain the setup that performs better for the application at hand. We use a combination of simulated and actual gunshot signals recorded using a microphone array mounted on a UAV. One of the key insights resulting from the field recordings is the importance of drone positioning, whereby all gunshots recorded in a region outside a cone open from the gun muzzle presented a hit rate close to 96%. Based on experimental results, we claim that reliable bearing estimates can be achieved using a microphone array mounted on a drone.


Introduction
The interest in automatic sniper localization systems traces back to the early 1990s, pioneered by countries such as the United States of America, Russia, Canada, France, and more recently, Israel, among others. Such surveillance systems for shooter detection and localization can be useful to the police and military forces [1,2]. The shooter detection and localization problem can be approached in different ways, depending on the kind of signatures from a gunshot event, acoustic or electromagnetic, that one decides to process [3]. For instance, cameras can be used to detect the muzzle flash [4], whereas microphone arrays can be used to detect the muzzle blast and the shockwave acoustic signatures. If these two acoustic signatures are detected in the same gunshot event, one can estimate the location of the shooter using a two-step procedure [3].
The successful use of microphone arrays to tackle the direction-of-arrival (DoA) estimation problem even with low signal-to-noise ratio (SNR) can be seen in Reference [5]. In this paper, median filtering is used to enhance the collected acoustic gunshot signals. In Reference [6], an algorithm that optimizes DoA estimation using exhausive search through consistent fundamental loops is introduced. This method, in an attempt to have the best approach for the level of noise of audio signals, is a combination of standard DoA estimation, Exhaustive Search (ES) [7], and consistent fundamental loop [6].
Microphone arrays can be deployed in different platforms, e.g., stand-alone systems mounted on vehicles [8], on light posts in urban areas or on trees in a forest [9]. All these systems are currently subjects of great interest in academia and are recently associated with the internet of things (IoT) industry [10] as well. However, one such system based on a microphone array mounted on an aerial drone brings additional advantages owing to its flexibility to cover wider areas relatively quicker and at a lower cost. It also opens the opportunity for new important applications, such as search-and-rescue missions [11,12] and environmental monitoring [9]. In Reference [11], a microphone array mounted on a drone is used to detect a narrowband signal generated by a whistle, which can be very effective in search-and-rescue missions in areas of difficult access.
An application example for environmental monitoring is presented in Reference [9], suggesting the use of an open hardware to be deployed in the forest to record audio signals and using a Secure Digital (SD) card to store the data. These signals vary from bat ultrasounds to gunshot signals. For instance, in the case of detecting gunshot events in protected areas, people responsible for monitoring those areas could be triggered to carry out necessary actions against poaching. Hoshiba et al. presented detailed design and implementation of a quadcopter-embedded microphone array system for outdoor environments [12].
In order to enable new drone applications, the scientific community has developed an interest for new techniques capable of tackling the strong ego-noise presented in audio recordings from unmanned aerial vehicle (UAV)-embedded microphone arrays, especially when the target sound is a whistle or human speech. Methods based on Multiple Signal Classification (MUSIC), known to be very robust against noise, are presented in Reference [13]. The Generalized Eigenvalue Decomposition-MUSIC (GEVD-MUSIC) [14,15] is reported to have high performance even for low-SNR signals. The incremental Generalized Eigenvalue Decomposition-MUSIC (iGEVD-MUSIC) introduced in Reference [16] estimates the noise correlation matrix incrementally to cope with the high non-stationarity of the ego-noise. A supervised approach that uses UAV sensors data and motor rotation speeds to estimate the noise correlation matrix was proposed in Reference [17]. Aiming at reducing computational complexity and errors associated with inaccuracies in noise correlation matrix estimation, Reference [18] proposes the Multiple Signal Classification based on incremental Generalized Singular Value Decomposition-MUSIC (iGSVD-MUSIC) with Correlation Matrix Scaling (CMS).
A novel algorithm to sound source location with UAV-embedded microphone arrays based on time-frequency bins was proposed in Reference [19]. This method takes advantage of the fact that ego-noise and target sound (e.g., speech or emergency whistle) mainly consist of harmonic components that usually occupy different time-frequency bins. In Reference [20], the time-frequency technique is associated with a time-frequency spatial filter to enhance the signal of interest. Other interesting researches related to sound processing with drones include the following: a study about the ego-noise of multirotors micro aerial vehicles [21], that also proposes the use of Blind Source Separation (BSS) algorithm to suppress it; the use of ego-noise to measure the relative directions between multirotors in a swarm [22]; and the ability to track moving sound sources [23].
Focusing on gunshot airborne surveillance, the deployment of acoustic sensors in elevated platforms could enable advantages for shooter-position estimation, according to Reference [24]. The use of an aerial drone as a mobile elevated platform was investigated in Reference [25], using only simulations. Different noise levels were synthesized using drone noise recordings in a silent room and real gunshot signals recorded with a high-quality microphone mounted on a tripod in an open field. Also in this work, signal enhancement techniques were employed along with DoA estimation algorithms and target motion analysis was used to estimate the shooter's position. In Reference [26], the geometry deployment of the microphone array is discussed, taking into account the wind produced by the propellers, the electromagnetic interference, and the scarce space available on the drone.
In this paper, we focus on the details of DoA estimation of gunshot signals obtained from a microphone array aboard a flying drone. We used simulations to investigate the performance of preprocessing and DoA estimation techniques, also tuning theirs parameters. The most appropriated methods were evaluated with actual field recordings, given the position and attitude of the drone obtained from its GPS and inertial unit.
The rest of this paper is organized as follows. Section 2 starts with a brief overview on gunshot acoustics, followed by a discussion on the techniques used in a UAV-based gunshot surveillance system, namely signal preprocessing, gunshot detection, DoA estimation, and shooter localization. Section 3 describes the hardware used and the shooting site as well as how telemetry data is recovered and used, while Section 4 discusses experimental results from simulations and from actual gunshot signals collected using an array of sensors mounted on a flying drone. The discussion and conclusion are addressed in Section 5.

DoA Estimation and Shooter Localization
The acoustic signatures generated by a gunshot event can be divided into three parts, namely the muzzle blast, the shockwave, and sounds related to mechanical actions, which include the trigger and hammer mechanisms, ejection of spent cartridges, and the loading system. The mechanical action-related sounds can be useful in forensics analysis [27,28]. However, they are of no interest in the design of sniper localization systems, since they can only be recorded using sensors placed close to the gun.
The muzzle blast is generated by the expansion of gases in the gun barrel and is louder in the direction the barrel is pointing toward [29,30]. It propagates at the speed of sound and lasts typically from 3 to 5 ms [31]. The energy of the muzzle blast depends on the firearm used, and it is almost always audible in a given range, provided that silencers or suppressors are not used [31,32].
A shockwave will be generated for as long as a projectile is travelling faster than the speed of sound and propagates outwards from the bullet trajectory at an angle known as the Mach angle [33]. The shockwave generated by a typical supersonic bullet lasts for approximately 200 µs, and its frequency spectrum has a wider frequency band than that of the muzzle blast, as exemplified in Figure 1. Since the shockwave propagates in a cone shape following the bullet trajectory, it cannot be detected if the bullet is moving away from the position where the sensors are located [34]. This constitutes a problem for shooter localization systems that rely on the detection of both shockwave and muzzle blast signals.
The shooter's localization problem may be divided in four steps, namely preprocessing, gunshot detection and muzzle blast identification, DoA estimation, and shooter-position estimation.

Preprocessing
Localization of a shooter based on the audio acquired from a drone is especially challenging due to the presence of strong ego-noise, mainly generated by the propellers [35]. This becomes an even greater challenge under long-range shootings, whereby the detection and the direction-of-arrival estimation of the muzzle blast signal is compromised.
When signals of interest are approximately stationary, such as tiny voice snippets, whistles, or white noise, methods based on noise correlation matrix estimation, such as Wiener Filtering [36], are used. For impulsive signals, as in the case of gunshot signals, median filtering is an alternative option [5]. In this work, we evaluate the performance of these methods in the task of DoA estimation.
During steady parts of a flight, where stationarity can be assumed [36], we may use Wiener filtering to attenuate the influence of the ego-noise. We used in this work the implementation developed by Liu Ming and Pascal Scalart [37,38]. This Wiener filter, referred to as Two-Step Noise Reduction (TSNR), uses the decision-directed approach [39] to track a priori SNR and refines the SNR estimation to avoid reverberation effects.
Median filtering was employed in Reference [40] as a technique to separate percussive and harmonic components of a signal. The proposed scheme uses the concept that percussive components can be seen as outliers in the time domain while harmonic sounds can be seen as outliers in the frequency domain. Median filtering, described next, is capable of removing these outliers and of separating these different acoustic signatures to some extent. Given the input x(k), the output y(k) is the median value of the window with length ∆ centered in x(k). The parameter should be chosen accordingly, in such a way that the expected duration of the artifacts is removed but without significant impact on the signal of interest. The median filtering can be expressed according to the following: if ∆ is even, the median is defined as the mean of the two central median values. The use of median filtering to estimate background noise embedded in gunshot signals was introduced in Reference [5], which computes the enhanced signal as s(k) = x(k) − y(k). To preserve the muzzle blast's shape, ∆ should not represent less than half of its duration, which is approximately 3 ms [31].

Gunshot Detection
As previously noted, a gunshot surveillance system must be able to detect an impulsive signal and to identify if it is a muzzle blast component, a shockwave component, or none of them. There is a vast literature available about this matter [41][42][43][44][45][46].
The method in Reference [41] uses a transient detection, introduced in Reference [42], that looks for significant changes in the signal energy. For muzzle blast and shockwave classification, Reference [41] uses two tests: the first one is based on the spectral information of the signal, and the second one uses time difference of arrival between neighboring peaks.
A detection scheme based on correlation against a template is proposed in Reference [43], where the authors claim that the method could be implemented by a low power consuming hardware. Correlation against a template is also addressed in Reference [44], where it is compared against classical algorithms usually used in speech processing; their results conclude that correlation matches the performance of those algorithms, especially in noisy environments. In Reference [45], linear predictive coding (LPC) coefficients are combined with template matching to increase the performance of gunshot detection systems, especially regarding false-positive errors.
A wavelet-based approach [46] can be used to distinguish three acoustic events: muzzle blast, shockwave, and reflections. Furthermore, according to the authors, this method can classify the caliber based on the muzzle blast or on the shockwave signals.
The strong ego-noise of a drone is not white and is highly nonstationary [36]. Furthermore, it is strongly dependent on the drone used and on the positioning of the sensors. These are additional challenges to detection and muzzle blast-shockwave classification. These tasks were carried out manually in this work.

DoA Estimation Methods
In this section, we first define DoA angles and then present two DoA estimation methods: a data selection least squares method [7] and an angular spectrum-based method named the Multi-channel Blind Source Separation (MBSS) Locate [47]. Figure 2 shows the angles (azimuth φ and zenith θ) that define the DoA. It is noteworthy that the azimuth herein is taken counterclockwise, as in Reference [48]. Thus, the unit vector in the direction of sound wave propagation is given as follows: (2) Figure 2. Azimuth (φ) and zenith (θ) relative to the center of the array: The x axis is oriented to the front of the drone, and the z axis points upwards.

The Data Selection Least Squares DoA Estimation Algorithm
The first step of the least squares (LS) method is the time-delay estimation (TDE) between the sensor pairs in the array. Next, we use an LS cost function associated with a data-selective algorithm. The TDEs are obtained from the peak of the cross-correlation r x i x j (τ) defined as follows [49]: where E denotes the expectation operator and τ is the lag between two given sensors, x i and x j . In practice, we do not have statistical knowledge of the signals, and Equation (3) is usually approximated by its time average given by the following: where * is the convolution operator.
Taking the discrete Fourier transform ofr x i x j (τ) and assuming real-valued signals, we can write the cross power spectrum density between x i (k) and x j (k) as follows: The cross-correlation can then be computed using the following: Adding a frequency weighting function in Equation (6), we have the generalized cross-correlation (GCC) as follows:r where classical cross-correlation corresponds to ψ(w) = 1, ∀ w. A popular weighting scheme employed by the GCC is the phase transform (PHAT) [7,36,[49][50][51][52][53], known to have good performance in reverberating scenarios [49]. PHAT also tends to have a sharper peak than classical GCC, increasing the performance of the TDE [50]. The PHAT weighting function is given by the following [53]: Finally, the TDE is obtained as follows: where τ max is the maximum delay possible (in number of samples) between microphone i and j, which occurs when the DoA has the same direction of the vector that connects sensors i and j: where p i and p j are the position vectors of sensors i and j, v s is the speed of sound, and f s is the sampling frequency. The TDEs using inverse Fourier transform (iFFT) provide delays as integer multiples of the sampling period; this leads to errors that are particularly relevant in small arrays (small time delays between sensors) and with low sampling frequency. To mitigate this source of errors, we can interpolate the GCC, allowing more accurate estimations of the time difference of arrival (TDoA). In this work, we used cubic interpolation [54], calculated at every point in between −τ max and τ max , ensuring that all possible values of delay are covered. Figure 3 shows the effect of cubic interpolation over GCC-PHAT in a small array for a signal with f s = 8 kHz. Note thatr x i x j (τ) is the GCC-PHAT without interpolation. Figure 4 illustrates in 2-D the delay between microphones i and j. In a 3-D scenario, we write d ij = ∆p T i,j a DoA such that the TDE (in samples) is given by the following: where Based on the estimated delay, as given in Equation (9), and the delay based on the unknown vector a DoA , Equation (11), we define the least squares cost function: for all possible pairs, N = M(M − 1)/2 for the case of M microphones.
Minimizing the cost function with respect to a DoA , we find the following: where d = ∆p T τ, τ = [τ 1,2 τ 1,3 ... τ 1,N τ 2,3 ... τ M−1,M ] T and R = ∆p T ∆p, ∆p are assembled as follows: The solution provided by Equation (13) may not have unit norm, which must be ensured through normalization. Only then could azimuth and zenith be calculated using trigonometric operations, according to Equation (2).
Equation (13) provides all three coordinates only when using a spatial array. If a planar array is used, ambiguity occurs and matrix R is singular. When all sensors are in a plane (xy-plane for instance), we must adapt the sensor positions (p i ) to suppress the coordinate associated with the perpendicular axis, in our case z. This way R is non-singular and Equation (13) providesâ incomplete DoA = [a x a y ] T . As the a DoA must be unitary and assuming that the source is located above or below the array, it is possible to estimate the DoA.
The strong ego-noise could compromise the TDEs, generating outliers that would adversely affect the DoA estimation. As for the cost function defined in Equation (12), the solution can be obtained without using all available pairs of microphones; it is possible to use a data-selective algorithm to remove outliers. Using the Exhaustive Search algorithm ES(n) [7], we choose the n combination from the set of N pairs of microphones that minimizes the cost function ξ in Equation (12). We need to be cautious when choosing the number of pairs of microphones to be used, parameter "n" in ES(n), once it can generate ill-conditioned matrices [55]. The appropriate choice for "n" can be obtained according to a decision tree as done in Reference [6].

The MBSS Locate
The Multi-channel Blind Source Separation (MBSS) Locate [56] is available as a MATLAB ® toolbox. It estimates the direction of arrival of multiple sources from audio signals collected by an acoustic sensor array. This software implements multichannel versions of four state-of-the-art and three proposed SNR-based local angular spectra methods for audio signals [47].
The state-of-the-art local angular spectra methods are GCC-PHAT [49] and its version with a nonlinear function GCC-NONLIN [57], Multiple Signal Classification (MUSIC) [13], and Cumulative State Coherence Transform (cSCT) [58]. These techniques, except the cSCT method, rely on the assumption that one source is predominant in each time-frequency bin. The cSCT method assumes that there are at most two predominant sources.
The SNR-based local angular spectra tackles the multisource TDoA estimation problem. The main idea is to use the SNR as an unbounded measure to estimate if the information of a time-frequency bin results from a single source. Blandin et al. [47] proposed three methods to estimate the SNR using two microphones and the following techniques: Minimum Variance Distortionless Response (MVDR) [59], Diffuse Noise Model (DNM) [60], and Minimum Variance Distortionless Response Weighted (MVDRW).
The MBSS full version enables the user to simulate the recording scenario, e.g., room dimensions, walls absorption coefficient, and number of microphones [56]. Nevertheless, we summarize in the following only the core of the angular-spectra based method. For detailed information about its functionalities and implementation, one should refer to the user guide provided with the software.
We describe the use of the MBSS algorithm in three main steps. The first step is to define the possible angles of azimuth and zenith and to assemble the search grid. The program uses elevation instead of zenith, but it can be easily converted: zenith = π 2 − elevation. Based on the grid, the set of possible delays for each pair of microphones is computed, and then, it is resampled to limit the quantity to points in which angular spectrum is calculated. The software offers some options to compute angular spectra. For this work, we used the GCC-PHAT local angular spectra defined as follows [47]: where R is the real operator, l is the index of the time frame, f is the center frequency of the FFT bin, and t is the delay in seconds.
In the second step, the contents of all selected frequency bins are summed up. A linear interpolation is used in γ PHAT i,j (t, l) to the compute angular spectrum approximation in all possible t in each pair of microphones. This value is used to calculate the angular spectrum directly, depending on the direction of arrival, γ i,j (l, φ, θ). Then, angular spectrum of all pairs are summed, generating γ(l, φ, θ). For multiple time frames, there are two strategies: sum all time frames or use the maximum overall time frames. The last one is recommended when the signal of interest is only active in a few frames [47]. As the gunshot signals are impulsive, the maximum approach was used and the angular spectrum is then given by the following: The last step of the MBSS algorithm is a grid search to find the global maximum in the case of only one source or the local maxima when there are multiple sources. If there is only one single source, the DoA angles are obtained from the following:

Position Estimation
There are a number of ways to estimate the shooter localization. A simple approach uses DoA estimations of muzzle blast from different arrays according, for instance, to the total least squares (TLS) [61] algorithm. Since this method does not use the shockwave component, it can estimate position even of small caliber weapons of which the projectiles do not reach supersonic speed. Another advantage of this method is that, with a sufficient number of arrays, it could be combined with a data-selective algorithm, such as Exhaustive Search, seen in Section 2.3, to remove outliers expected to happen when some arrays do not have a clean sight to the firearm or are heavily affected by multipath [31,32]. On the other hand, in order to use the TLS approach, the system would be more complex and expensive to be deployed, since it requires more than one drone and they need to communicate with the node responsible for the calculation of the shooter's position with the information of all platforms.
A second approach is to combine shockwave and muzzle blast DoA estimations to compute the probable shooter location [41,62]. As this method uses shockwave components, it is only applicable in the case of supersonic projectiles and when the array is inside the shockwave field of view. Moreover, this method assumes (at least its simplest version) that the projectile has a constant speed, which tends to generate results that overestimate the distance when the shooter is more than 100 m from the array; adaptations are then required to overcome this limitation as stated in Reference [63] .
A third approach presented in Reference [24] combines muzzle blast DoA estimation from an elevated array with a digital map containing topographic information to estimate the shooter position. The main concern of this method is to obtain the appropriate digital model of the terrain. As in the TLS approach, this method can estimate the position of subsonic firearms. This approach would be appropriated for our scenario, but focusing on the DoA estimation, we have not carried out a position estimation evaluation in this work.

System Setup and Signal Acquisition
In this section, we describe the hardware, the drone, and the microphone array used in field recordings. We also provide some information about the shooting site and the environmental conditions the experiment was performed under. We also provide details regarding the data acquisition process, including audio recording and drone flight log data.

UAV and Avionics
We used a DJI Phantom 4. It weights 1.38 kg (battery and propellers included but without the extra hardware used for recording the audio signals) and has a 35 cm diagonal, also featuring a 4K camera, and support for two satellite positioning systems (GPS nad GLONASS). According to the manufacturer [64], the UAV, without any external hardware, is able to resist to wind gusts up to 36 km/h.
The microphone array was mounted in 41 cm metal rods, aligned with the propeller's arms. The size of the rods was engineered to keep the microphones away from the propellers to reduce the influence of noise caused by air displacement generated by the rotating blades. The four microphones were placed in the same height in a planar structure to avoid interference with the drone's maneuverability, especially during take off and landing. The planar coordinates for the microphones are given in Table 1, assuming the origin of the coordinate system in the center of the UAV.  The gimbal and the camera were removed, allowing the recorder to be placed under the drone (see Figure 5), aligning it with the center of mass of the multirotor, and minimizing the impact on the flight capabilities of the UAV. Care was taken in order not to cover the ultrasonic sensors, located on the underside of the drone's hull; this would severely affect flight safety and its landing ability.

Environmental Conditions and Shooting Site
The gunshot signals were collected in a shooting site located at the Brazilian Army Evaluation Center (CAEx) on a cloudy day with no strong wind and with a temperature of 24 • C. Figure 6 shows a satellite image of the shooting site. The drone's flight zone was restricted to the blue rectangle of area 30 × 120 square meters to prevent it from flying over sensitive regions and to ensure a clear line of sight to the shooter.

Data Acquisition: Audio and Drone Position and Attitude
The four microphones were connected to a four-channel recorder, TASCAM DR-40 [66], which is convenient given its relatively reduced dimensions and light weight of 0.213 kg without batteries. The TASCAM DR-40 recorder comes with two connectors for external microphones and two built-in microphones, which were rearranged to a single set with four external channels to accommodate four small electret microphones.
The recordings were carried out using a sampling frequency of 44.1 kHz and encoded using 24 bits per sample. The drone flight log data was recorded in a file and recovered using AirData.com [67]. The log data provides the following information: time (in ms), GPS coordinates (latitude and longitude), altitude, and attitude data (angles yaw, roll, and pitch), as illustrated in Figure 7. The digital audio recorder and the drone were initialized manually and simultaneously for each flight to synchronize the data about the position and the attitude of the drone with the recorded gunshot signals. As the drone was hovering when the shots were fired, the mismatch due to the manual process is negligible. Furthermore, there was no considerable drift caused by two different clocks, since the battery capacity limits the duration of each flight to a maximum of 18 minutes.

Axis Rotation
The DoA is calculated with respect to the drone's coordinates of which the axes are not necessarily aligned with the geographic axes. Therefore, after calculating the DoA with respect to the drone's coordinates, we must rotate the DoA vector in order to match the orientation of the geographic axes. The rotation can be applied by a series of matrix multiplications [69], using the attitude data and the magnetic declination of the location. Considering the axes system shown in Figure 2, the matrix that computes a rotation over axis z (yaw-α) is given by the following: The rotation matrix over axis y (pitch-β) is given by the following: Also, the matrix that performs the rotation over axis x (roll-ψ) is given by the following: Therefore, the rotated DoA vector in the geographic coordinate system is expressed as follows: Please note that matrix multiplication is not commutative, and therefore, the sequence roll, pitch, and yaw must be respected. Furthermore, the coordinates systems in Figures 2 and 7 are not the same: axes y and z point in opposite directions; it is necessary to reverse the rotation directions of pitch and yaw angles given by DJI Phantom 4. We must also take into account magnetic declination when rotating over axis z Equation (18), or the DoA vector will be aligned with magnetic north instead of geographic north.

Simulated Signals
In this work, we used simulated muzzle blast gunshot signals with different noise levels to tune parameters and to evaluate the performance of DoA estimation methods. In order to evaluate the quality of a DoA estimation, we used the angle between the estimated and the actual DoA, herein named angular error and defined as follows: where a DoA is the correct DoA vector andâ DoA is the estimated one. Angular error can vary from 0 • , when there is no error in DoA estimation, up to 180 • , when DoA estimation points towards the opposite direction of actual DoA. This metric allows us to compare objectively two different estimations while avoiding distortions in azimuth error when zenith is close to 0 • or 180 • . We used three performance measures based on angular error to evaluate the DoA estimation methods: mean, standard deviation, and percentage of estimations with angular error less than 3 • . An error of 3 • is expected to cause an error of approximately 6.28 m at the 120 m range. The simulation of the muzzle blast signal used 7 real gunshot recordings from a rifle Fz 7.62 M964 (FAL) manufactured by Indústria de Material Bélico do Brasil (IMBEL) [70]. Signals were collected with a high-quality microphone in an open and quiet environment, avoiding distortions such as additive noise and multipath propagation effect. These clean gunshot signals were clipped to be 10 ms in length. The selected muzzle blast was considered as the signal of a virtual microphone in the center of the array. Then, we inserted fractional delays to generate each one of the microphone's target signal, simulating the spatial position of the sound source with respect to the array. Noise was simulated based on eighteen recordings made during flights of the drone with the setup described in Section 3.1. During these recordings, the drone was hovering at different altitude levels, ranging from 10 m to 50 m. At each iteration of the simulation, a random muzzle blast signal and a random noise file were selected. Next, the noise file was clipped at a random point with the size of the desired window.
As the noise may have different magnitudes for each microphone, we define SNR mean as the mean SNR across all the sensors: SNR mean = 10 log 10 1 M where M is number of sensors in the array, σ 2 ni is the variance of the noise in the ith sensor, and σ 2 si is the variance of the muzzle blast component in the ith sensor defined from a 10-ms window.
We divided the results of simulations in two groups: LS method and MBSS, each one having its own parameters to be optimized. In both cases, we studied the effectiveness of preprocessing techniques. In this experiment, we ran 3000 iterations for each SNR value. In each iteration, the DoA was drawn according to a uniform distribution over a semisphere (as already mentioned, we consider that the shooter is in a lower position when compared to the drone).
For the LS method, simulations aimed at the best values of window size (from 20 ms up to 50 ms) and n in ES(n). As the array is composed of 4 sensors, N = M(M − 1)/2 = 6 pairs of microphones are available, so we tested from n = 6 to n = 3. Analyzing Table 2, we note that the best estimation was usually obtained using the smallest window. This was expected, since the muzzle blast signal lasts 10 ms and a smaller window would contain less noise without losing information about the muzzle blast signal. As stated in Reference [6], an optimal n depends on the SNR: when there is less noise, we should consider more pairs; conversely, when the SNR value gets lower, more pairs have their TDEs compromised and should be discarded. As for the preprocessing techniques, we note that the median filter improves the quality of DoA estimation. However, the Wiener filter implementation used herein did not fit well to the application at hand when combined with the GCC-PHAT. An in-depth analysis using the complete results of the LS simulation in Table A1, would indicate that the median filter has better performance among all estimates with angular errors less than 3 • .
In our simulations, we defined two basic MBSS parameters: grid resolution, which was set to 1 • , and alpha resolution, which was set to 0.5 • . The first one is the minimum increment considered in DoA angles, while the second one is related to the resample of possible delays for each pair of microphones, as mentioned in Section 2.3.2. These parameters do not have a considerable influence on performance with low SNR. Assuming that a muzzle blast would come from below the drone, the search boundaries for azimuth and zenith were set to 0 • to 359 • and 90 • to 180 • , respectively. We explored the most suitable values for window and frame sizes; the former varied from 25 ms up to 50 ms, and the latter varied from 10 ms up to 20 ms.
A summary of the MBSS simulation results containing the best parameters per SNR in relation to the rate of estimations with angular error less than 3 • are shown in Table 3. We noted that frame-based processing, together with the overall maximum strategy, led to the best performance with a 50-ms window size and a frame size of 12 ms or greater. We also notice that the MBSS method does not work well with the preprocessing techniques previously mentioned. Nevertheless, MBSS proved to be more robust to ego-noise, achieving high hit rates even for SNRs as low as −5 dB. The complete results of MBSS simulation can be seen in Table A2.
Based on the simulation results, we chose 2 schemes to process the real gunshots: MBSS with window size of 50 ms and frame size of 15 ms and LS-method using n = 4 and frame size of 20 ms preprocessed with median filtering.

Field Recordings
The recordings were carried out in 5 sets. In the first one, 3 shoots were recorded only to make sure that the system was fully operational and were not used to evaluate its performance. The next four sets contain, respectively, 50, 50, 60, and 87 gunshot recordings. Summing up, we have a total of 250 gunshots, all from a Carbine IMBEL 7.62 IA2 [70]. In each series, the drone's flight height varied from 8.8 m up to 60.5 m. The upper limit of flight height was set to ensure safety since the additional payload in the drone compromises its ability to withstand wind gusts. In some recordings, both muzzle blast and shockwave components are present, while in cases where the drone was not positioned in the propagation path of the shockwave, only the muzzle blast is present. In this work, we address DoA estimation of the muzzle blast only.
In order to avoid issues related to automatic detection, the system recorded continuously for the duration of the flight, and signals were clipped, manually preserving the muzzle blast only. These two acoustic signatures overlapped in a few recordings. When analysing the results, we found out that azimuth estimations were biased. The bias was similar in the third, fourth, and fifth sets but clearly different for the second run. This, combined with the fact that the UAV required a calibration of its magnetic sensor between the second and the third runs, indicates that this bias can be credited to electromagnetic interference in electronic compassed caused by other circuits aboard. As this bias was spotted only when processing the signals in the laboratory, the value of the compensation had to be estimated directly from the gunshots. To estimate the bias, we computed the mean azimuth error, but in order to mitigate the deleterious effects of possible outliers, we used only DoA estimations with zenith errors less than 3 • . Finally, we obtained bias correction values of −8.6743 • for the second set and −16.7746 • for the third, fourth, and fifth sets.
The experiments were designed to evaluate the performed of the algorithms under different controlled values of SNR for the gunshot signals. However, other important measurements from GPS and attitude sensors are assumed to be inherently noisy. Table 4 presents the results obtained for the 247 muzzle blast gunshot signals under test. Although for the simulated signals we used as the hit rate the percentage of shots with angular errors lower than 3 • , for the real gunshot signals, we increased this threshold to 10 • . The results in Table 4 represent an average of the different recording conditions, depending on the position of the drone as shall be seen in the following.  Figure 8 illustrates the relationship between the position of the drone and the DoA estimation error. Notice that, as the distances between drone and shooter are not substantially large, 120 m at most, the error observed in this experiment is not strongly related to the distance but is rather correlated to relative position: when the drone is within a cone in front of the weapon, the results are poorer. We analyzed the recordings from positions within this cone and observed distorted signals in most of them. These positions are in the field of view of the shockwave but also in the direction of and in a small distance from the gun barrel. This suggests that the causes of the distortions are twofold: overlap of shockwave and muzzle blast components and a great signal intensity saturating the sensor. In an attempt to measure the system performance in better positioning, we took all gunshots recorded in a region outside a 35 • cone open from the weapon muzzle and the error dropped considerably. The hit rate increased to 92.86% for the MBSS technique and to 95.54% for the LS + MF method instead of the former 72.87% and 70.45%, respectively.

Discussion and Conclusions
In this work, we analyze the problem of determining the position of a shooter based on gunshot signals acquired using a microphone array mounted on a multirotor UAV. We have conducted a comprehensive literature review on essential topics characterizing the state-of-the-art for this kind of application. We narrow down the focus on the main task, which is to determine the direction of arrival for the muzzle blast and to evaluate the performance of two well-established DoA estimation techniques as well as two important preprocessing methods.
We carry out extensive simulations to evaluate the performance of DoA algorithms and to tune their parameters before finally testing the methods with actual gunshot dates recorded in a firing range. Based on our experimental results, we claim that an aerial microphone array mounted on a drone can be used to obtain good estimates of gunshot direction of arrival using different techniques. The experiments also highlight the fact that the accuracy of the estimates are sensitive to the drone position relative to the shooter and emphasize that better results can be achieved with a system that can fly at higher altitudes, in which case it would be possible to estimate the position of the shooter as well.
Nevertheless, issues like detection, classification, and noise cancellation algorithms require further investigation, testing, and validation to achieve a fully functional, reliable, and autonomous system.

Acknowledgments:
The authors would like to thank Jorge P. do Bomfim for his support in all audio recordings, drone setup, and operation. We would also like to thank the Brazilian Army Evaluation Center (CAEx) for the support in the recordings of real gunshot signals. CAEx provided highly skilled staff, location, gun, and ammunition.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.