Prototype Development of Cross-Shaped Microphone Array System for Drone Localization Based on Delay-and-Sum Beamforming in GNSS-Denied Areas

: Drones equipped with a global navigation satellite system (GNSS) receiver for absolute localization provide high-precision autonomous ﬂight and hovering. However, the GNSS signal reception sensitivity is considerably lower in areas such as those between high-rise buildings, under bridges, and in tunnels. This paper presents a drone localization method based on acoustic information using a microphone array in GNSS-denied areas. Our originally developed microphone array system comprised 32 microphones installed in a cross-shaped conﬁguration. Using drones of two different sizes and weights, we obtained an original acoustic outdoor benchmark dataset at 24 points. The experimentally obtained results revealed that the localization error values were lower for 0 ◦ and ± 45 ◦ than for ± 90 ◦ . Moreover, we demonstrated the relative accuracy for acceptable ranges of tolerance for the obtained localization error values. and weights, we obtained an original acoustic outdoor benchmark dataset at 24 points. The experimentally obtained results revealed that the localization error values were lower at 0 ◦ and ± 45 ◦ than at ± 90 ◦ . Moreover, we evaluated the imaginary localization accuracy for 28 results at 22 points for changing tolerance ranges from 3.0 m to 0.5 m step by 0.5 m. The localization accuracy of our proposed method is 71.4% with 2.0 m tolerance. Future work includes improving the angular resolution of elevation angles, compara-tive evaluation of sound source localization with other methods, further investigation of the magnitude and effects of noise, simultaneous localization of multiple drones, real-time drone tracking, and expansion of benchmark datasets. Moreover, we must evaluate altitude measurement errors using a total station theodolite, which is an optical instrument device used in surveying and building construction.


Introduction
Drones have virtually unlimited potential [1], not only for hobby use but also as innovative sensing platforms for numerous industrial applications. The global navigation satellite system (GNSS), which comprises the global positioning system (GPS), global navigation satellite system (GLONASS), BeiDou navigation satellite system (BDS), and Galileo, provides drones with autonomous flight and hovering with high accuracy based on absolute location information. Using high-precision positioning combined with realtime kinematic (RTK) technology [2], drone applications have been expanding rapidly in terms of precision agriculture [3] and smart farming [4], infrastructure inspection [5], traffic monitoring [6], object recognition [7], underground mining [8], construction management [9], parcel and passenger transportation [10], public safety and security [11], and rescue operations at disaster sites [12]. Particularly, improved efficiency, cost reduction, and labor savings have been achieved through drone applications in numerous operations and missions of various types. Nevertheless, the GNSS signal reception sensitivity is significantly lower in areas between high-rise buildings, under bridges, and in tunnels [13]. As localization techniques, location estimation and identification for drones in such GNSSdenied areas are challenging research tasks [14]. Moreover, for security and safety, various studies have been conducted to detect unknown drones [15][16][17].
In indoor environments that are frequently GNSS unavailable, simultaneous localization and mapping (SLAM) [18] are used widely for localization and tracking of drones [19,20]. In recent years, environmental mapping and location identification have been actualized from visual SLAM [21][22][23][24] using depth cameras and light detection and ranging (LiDAR) sensors [25]. Numerous visual SLAM-based methods have been proposed [26], not only for research purposes but also for practical applications such as floor-cleaning robots [27]. Another strategy that can be adopted without creating or using a map is an approach using motion capture sensors [28]. The use of motion capture is mainly confined to indoor areas because of their limited sensing range. The sensing range in outdoor environments is markedly greater than that in indoor environments. Moreover, in outdoor environments, sensors are affected by sunlight, wind, rain, snow, and fog. Therefore, localization and identification in GNSS-denied environments are tremendously challenging tasks. Taha et al. [15] categorized drone detection technologies into four categories of radar, acoustic, visual, and radio-frequency (RF)-based technologies. They presented both benefits and shortcomings of the respective technologies based on various evaluation criteria. In addition, they inferred the radar-based approach as the most realistic in terms of accuracy and detection range. Moreover, they surveyed that vision-based methods using static or dynamic images are consistent with deep learning [29], which has undergone rapid evolution and development in recent years. However, vision-based methods merely detected relative positions in image coordinates. Absolute positions as world coordinates remained unconsidered for detection results.
This study specifically examines acoustic-based techniques to acquire absolute positions as world coordinates. Drones with multiple rotors that revolve at high speed constantly make noise. This feature differs from fixed-wing gliders in the universal category of unmanned aerial vehicles (UAVs). Originally, the drone designation signified continuous low-frequency humming sounds, similar to that of honeybees. A salient benefit of using acoustic information is that it is available for a wider sensing range than when using vision information. Moreover, acoustic information performs robustly in environments with limited visibility and occlusion between objects.
The objective of this study is to develop a prototype microphone array system that identifies the drone position in world coordinates and to evaluate its quantitative accuracy. Our microphone array system comprises 32 omnidirectional microphones installed on an original cross-shaped mount. We conducted a flight evaluation experiment in an outdoor environment using two drones of different body sizes and payloads. Experimentally obtained results achieved using originally collected benchmark datasets revealed that our proposed system identified drone locations in the world coordinates within the accuracy range of GPS. Moreover, we demonstrated properties and profiles of position identification error values according to distances and angles. This paper is structured as follows. Section 2 briefly reviews related studies of state-ofthe-art acoustic and multimodal methods of drone detection and localization. Subsequently, Section 3 presents our proposed localization method using an originally developed microphone system based on delay-and-sum (DAS) beamforming. Experiment results obtained using our original acoustic benchmark datasets obtained using two drones are presented in Section 4. Finally, Section 5 concludes this explanation and highlights future work.

Related Work
As comprehensive research, Taha et al. [15] surveyed state-of-the-art acoustic methods for drone detection [30][31][32][33][34][35][36] . They asserted that far fewer research efforts have been based on acoustic methods compared with other modalities. Moreover, they indicated a shortage of acoustic benchmark datasets because of difficulties related to annotation compared with other modalities, especially in visual benchmark datasets. In our brief literature survey here, four additional references below are reviewed as reports of related work.
Sedunov et al. [37] proposed an acoustic system for detecting, tracking, and classifying UAVs according to their propeller noise patterns. Their system included three sensor array nodes: each node comprised 15 microphones with 100 ± 20 m intervals. To detect sound source directions, they used the steered-response phase transform (SRP-PHAT) [38] based on the computation and addition of generalized cross-correlation phase transform (GCC-PHAT) [39] The flight evaluation experiment results conducted outdoors using five drones of different sizes and payloads revealed that their proposed method obtained notably good performance compared to the existing commercial acoustic-based systems. Specifically, their system detected a drone at up to 350 m with average precision of 4 • . Moreover, they demonstrated real-time drone tracking within a predicted coverage distance of approximately 250 m.
Chang et al. [40] proposed a systematic method using two acoustic arrays for drone localization and tracking. Both arrays were composed of four microphone sensors separated by 14 m gap distance. They developed a time difference of arrival (TDOA) estimation algorithm using the Gaussian prior probability density function to overcome multipath effects and a low signal-to-noise ratio (SNR). The analysis of data obtained during field experiments leads to satisfactory performance for real-time drone localization and tracking. However, they merely demonstrated a single experimental result using a drone in the limited range of 50 × 50 m.
Dumitrescu et al. [41] proposed an acoustic system for drone detection using 30 spiralshaped digital microphones in a micro-electro-mechanical system (MEMS) fabrication. For drone identification and classification, they proposed a concurrent neural network (CoNN) that included supervised and unsupervised learning paradigms simultaneously. They achieved simultaneous detection using their originally developed benchmark datasets using six drones: three commercial drones and three originally developed drones.
Blanchard et al. [42] proposed an arrayed microphone system for 3D localization and tracking of a drone. Their system comprises ten microphones arranged in three orthogonal axes. They employed time-frequency DAS beamforming and a Kalman filter for 3D localization and tracking. They conducted an outdoor experiment to evaluate localization and tracking errors. However, the flight range was limited to a relatively small area of approximately 10×10 m horizontally and 5 m vertically.
As a multimodal device with new possibilities, audio images obtained using an acoustic camera have come to be used for drone detection. Zunino et al. [43] developed an optical video camera prototype with 128 digital MEMS microphones. Using a turntable that mechanically enhances the range of camera view, the maximum field of view is 90 • in elevation and 360 • in azimuth were achieved. Experimental results obtained using their original datasets demonstrated real-time detection and tracking of drones and people indoors, and motorbikes outdoors. The relation between the sound source positions and the rotary table angles strongly affected the detection accuracy.
Liu et al. [44] proposed a modular camera array system combined with multiple microphones. They specifically examined a camera array system that was able to use largescale airspace observation outdoors. Their proposed system integrated vision and audio information from multiple sensors in surroundings of various directions. They identified the characteristics of various drones to achieve higher monitoring efficiency. However, their proposed system did not support localization in world coordinates.
Svanström et al. [45] developed an automatic multisensor drone detection system for achieving multiple modality sensing. Their proposed system comprised a normal video camera, a fisheye lens camera, a thermal infrared camera, and a microphone. They obtained original datasets containing 650 annotated infrared and visible video files of drones, birds, airplanes, and helicopters combined with audio files of drones, helicopters, and background noise. Their comparison results obtained using a you only look once (YOLO) v2 detector [46] revealed improved accuracy with multiple modalities over a single modality.
Izquierdo et al. [47] proposed a combined method for detecting drones and persons simultaneously. Their acoustic camera system comprised a monocular camera with 64 MEMS microphones. They used a 2D beamforming algorithm for target detection. The experimental results revealed that their proposed system identified rotor noise not only by the arrival angles of direct sound signals but also by the first echo reflected by the reflective surface. However, mere evaluation experiments were conducted under limited conditions in a small room as a feasibility study. The novelty and contributions of this study are the following.

•
Our proposed method can detect the horizontal position of a drone using three parameters obtained from the drone and our cross-shaped microphone array system. The challenging detection range is greater than the previous approach [42]. • Compared with previous studies [44][45][46][47] based on multimodal devices combining a camera and arrayed microphones, our method can detect a drone using only acoustic information similar to previous studies [37,[40][41][42]]. • Compared with previous studies [37,[40][41][42] based on a single modality of arrayed microphones, we provide a detailed evaluation that comprises 24 positions and two different drones in an actual outdoor environment. • To the best of our knowledge, this is the first study to demonstrate and evaluate drone localization based on DAS beamforming used in GNSS-denied areas.
By contrast, as a limitation of our method, the target drone must accurately transmit its flight altitude information to our proposed system. Another limitation is that we do not currently consider applications in environments where sound reflection occurs, such as in tunnels or under bridges.

Acoustic Localization Method
(1) Using h d and θ e , length r from O to (x, y) is obtained as Using r and θ h , one can calculate length x as Subsequently, using x and θ h , the length y is calculated as For this study, we set h m to 900 mm when developing our original sensor mount. Parameter h is obtained from the flight controller via a drone controller. Although GNSS provides high vertical ranging resolutions, it drops significantly in horizontal ranging resolutions. Therefore, drones use not only IMU values but also barometric pressure and temperature values to calculate the altitude. Parameters θ h and θ e are calculated using the beamforming method.

Delay-and-Sum Beamforming
As depicted in Figure 2, beamforming is a versatile technology used for directional signal enhancement with sensor arrays [48]. Letting y(t) be a beamformer output signal at time t and letting M be the number of microphones, then for z m (t) and w m (t), which respectively denote a measured signal and a filter of the m-th sensor, y(t) is calculated as For this study, we regard DAS beamforming as a temporal domain. Assuming that single plane waves exist and letting s m (t) be a set of sound signals, then delay τ m as expressed for the formula below occurs for incident waves observed for the m-th sensor as where M represents the total number of microphones. By using the filter, the delayed −τ m of incident waves and the advancing +τ m of incident waves were offset. Signals from the direction of θ are enhanced because of gathered phases of signal s(t) in all channels. For this study, the temporal compensated filter w m (t) is defined as where δ is Dirac's delta function.
Letting θ be an angle as a variable parameter, then for the comparison of sound directions, the relative mean power level G(θ) of y(t) is defined as where T represents the length of the interval time.
We changed θ from −90 • to 90 • with 1 • intervals. We assumed P h (θ) and P e (θ) as respectively denoting G(θ) obtained from the horizontal and vertical microphone arrays. Using P h (θ) and P e (θ), one can obtain θ h and θ e as 3.3. Devices for Experimentation Figure 3 depicts the design output of a cross-shaped sensor mount for installing microphones as an array system. The mount dimensions are 1200 mm length, width, and height. The horizontal arm is located 900 mm above the ground. The cross-attached mount legs were designed to be wide to prevent falling. Two aluminum frames were placed in parallel to fasten microphones at their front and back. We installed 16 microphones at constant intervals at a 600 mm width. Similarly, for the vertical side, we mounted 16 microphones in a height range of the positions from 600 mm to 1200 mm. In all, we used 32 microphones as our first prototype. For assembly, aluminum straight pipes with a square-cross sectional of 20 × 20 mm were used for the mount framework structure. To maintain greater strength, L-shaped brackets were used for all joints. After assembling the mount, the 32 microphones were installed using plastic for tightening up. Figure 4 depicts photographs of the completed microphone array system from the front, right, and rear side views. Figure 5 depicts an 8-channel audio interface unit (Behringer ADA8200; Music Tribe; Makati, Metro Manila, Philippines) that includes an amplifier and an analog-todigital (A/D) converter. According to the specifications of ADA8200, the sampling rate is 44.1/48 kHz with synchronization across the channels. Our system comprises four interface units because it uses 32 microphones. The mean power consumption of each interface unit is 15 Wh. Digital signals from them are integrated with a hub module. We used a laptop computer to capture and store digital signals. The mean power consumption of the laptop computer is 9 Wh. Therefore, the total system power consumption is approximately 69 Wh. We used a portable battery (PowerHouse 200; Anker Innovations Technology Limited, Shenzhen, China) with 213 Wh battery capacity. Using this battery, the operating time of our proposed system is approximately 3 h.

Benchmark Datasets
We conducted acoustic data collection experiments at a track of the Honjo campus (39 • 39'35"N, 140 • 7'33"E), Akita Prefectural University, Yurihonjo city, Japan. The left panel of Figure 8 exhibits results of some aerial photography conducted at the experimental site. This campus is surrounded by paddy fields. An expressway and an ordinary road are located, respectively, east and south of the campus. We conducted this experiment during the daytime, as depicted on the right panel of Figure 8. Although the traffic was scarce, noise from passing cars was included in the obtained datasets. There was no noise output from the faculty building located to the north.  Table 2 presents details of ground truth (GT) positions and angles at 24 points of drone flights for obtaining sound datasets. Here, x g and y g correspond to GT coordinates in the evaluation process of estimation results. We used a tape measure (USR-100; TJM Design Crop.; Tokyo, Japan) to obtain GT positions. The minimum range of the tape measure is 2 mm. Additionally, θ gh and θ gv respectively signify the azimuthal angle and elevation angles calculated from x g , y g , and h using our proposed method. The respective positions were assigned independent labels as P1-P24. Here, the following three steps were used to determine the drone's flight position.

1.
A 2D position on the ground was measured using a tape measure and marked.

2.
After placing a drone at the mark, it was flown to an arbitrary height in the vertical direction.

3.
The 3D flight position was confirmed by visual observation from the ground and field-of-view (FOV) images transmitted from an onboard camera of the drone.
We used the Matrice 200 at P1-P20 and the Matrice 600 Pro at P21-P24. Datasets at P1-P20, P21-P22, P23-P24 were obtained on different days. Table 3 denotes the meteorological conditions on the respective experimental days. On all three days, the meteorological conditions were found to be suitable for drone flights.
The coordinates of flight positions P1-P20 are depicted in Figure 9. The microphone array was placed at the coordinate (0, 0). For the x-axis direction, the flight positions were assigned from ±20 m to ±50 m at ±10 m intervals. For the y-axis direction, the flight positions were assigned from 20 m to 50 m at 10 m intervals. Based on the x axis positive direction, angles were assigned from 0 • to 180 • at 45 • intervals. The flight altitudes were assigned from 20 m to 70 m at 10 m intervals.  Similarly to P11, the P21 flight coordinate was set to 20 m on the x and y axes. Moreover, similarly to P20, the P22 flight coordinate was set to 50 m on the x and y axes. For both positions, the flight altitudes were set to four steps: 5 m, 50 m, 100 m, and 150 m. The Japanese Civil Aeronautics Law regulates the maximum flight altitude from the ground up to 150 m.
For P23 and P24, the microphone array system locations were changed, but the flight position was fixed. Figure 10 depicts the positional relations between the drone flight point and measurement position at P23 and P24, which correspond to the installation positions of the microphone array system. The distances from P23 and P24 to the drone in flight were, respectively, 70 m and 355 m on the x and y axes. The flight altitudes at P23 and P24 were, respectively, 100 m and 150 m. At all locations except for P23 and P24, we were able to hear the drone propeller rotation sound.
For the respective positions, we recorded 10 s sound data while the drone was hovering. We extracted 1 s sound data randomly at time t for input to DAS beamforming as s(t).

Experiment Results
The experimentally obtained results are presented in groups according to the distances between the microphone array system and flight positions. Herein, Group 1 includes P4, P5, P9, P10, and P11, which are located ±20 m in the x axis and 20 m in the y axis from the microphone array system. Groups 2-4 comprise the positions that are equally distant at 10 m intervals.   Angle estimation results for Groups 2, 3, and 4 are presented respectively in Figures 12-14. The estimated angles for θ hm are located at approximately 45 • intervals. The estimated angles for θ vm are located at approximately 45 • in all cases. The experimentally obtained angle estimation results revealed that each output curve exhibited a unimodal distribution with a distinct peak. Figures 15 and 16 respectively depict angle estimation results obtained at P21 and P22. Both results include four flight altitudes of 5 m, 50 m, 100 m, and 150 m. Regarding horizontal results, θ hm was extracted at around 45 • in all altitude patterns. As a comprehensive trend, the flight altitudes and the mean power level exhibit an inversely proportional trend except at 5 m altitude. We consider that the mean power level at the 5 m altitude was low because of the effects sound waves reflected from the ground. For θ hm , the flight altitudes and estimated angles were found to have positive correlation. The respective curves exhibit a unimodal distribution with a distinct peak.     Vertically, θ vm was obtained at 45 • for P23 and at 17 • for P24. Although sound volumes that were heard subjectively were low, the respective curves were presented as a unimodal distribution with a distinct peak.  All of the results described above are presented in Table 4. Here, positional error values x e , y e , and E were calculated as presented below.
x e y e = x m − x g y m − y g .  Figure 19 presents localization results obtained at P1-P20. Filled circles and rings respectively correspond to GT positions and estimation positions. As an overall trend, no estimation position differs to a considerable degree from the GT positions. With respect to the angular characteristics, the error values at ±90 • tend to be larger than those at 0 • or ±45 • . The cumulative error increased because of the higher delay ratio, defined theoretically as τ m . Figure 20 portrays scatter plots of correlation between GT and estimation positions at P1-P20 for each coordinate. These results indicate that the variance on the y axis of the right panel is greater than that at the x axis of the left panel. Figure 21 depicts localization results obtained at P21 and P22. In comparison to the GT position, the P21 estimation results are located farther away from the microphone array system. Similarly, in comparison to the GT position, the P22 estimation results are located closer to and farther away from the microphone array system. Large error values were found at altitudes of 100 m for P21 and 5 m for P22. The localization results at altitudes of 50 m and 150 m were stable.    Figure 22 depicts distributions of estimated locations at P23 and P24. The localization error of P23 was 0.8, which was smaller than those of P1-P22. Although the localization error of P24 was 13.0, which was higher than those of P1-P23, it corresponds to 3.4%. We regard these results as reference values because they were conducted only in a single trial.  Table 5 presents E of each group with respect to the distance between the microphone array system and the drone. The total E represents the accumulated values of five positions in each group. The mean E shows a slight trend for the error to increase slightly according to the growing distance. Generally, GPS localization error values are approximated at 2 m, which is dependent on receivers, antennas, and obstacles in surrounding environments. Modsching et al. [49] reported that their experimentally obtained GPS error was 2.52 m in a medium-size city with few obstructions. We evaluated the simulated localization accuracy for 28 results at P1-P22 for changing tolerance ranges from 3.0 m to 0.5 m step by 0.5 m, as presented in Table 6. The localization accuracy of our proposed method is 71.4% with 2.0 m tolerance. The localization accuracy dropped to 67.9% with 1.5 m tolerance. However, it dropped to 39.3% with 1.0 m tolerance.

Conclusions
This paper presented a drone localization method based on acoustic information using a microphone array in GNSS-denied areas. Our originally developed microphone array system comprised 32 microphones installed in a cross-shaped configuration. Using two drones of different sizes and weights, we obtained an original acoustic outdoor benchmark dataset at 24 points. The experimentally obtained results revealed that the localization error values were lower at 0 • and ±45 • than at ±90 • . Moreover, we evaluated the imaginary localization accuracy for 28 results at 22 points for changing tolerance ranges from 3.0 m to 0.5 m step by 0.5 m. The localization accuracy of our proposed method is 71.4% with 2.0 m tolerance.
Future work includes improving the angular resolution of elevation angles, comparative evaluation of sound source localization with other methods, further investigation of the magnitude and effects of noise, simultaneous localization of multiple drones, real-time drone tracking, and expansion of benchmark datasets. Moreover, we must evaluate altitude measurement errors using a total station theodolite, which is an optical instrument device used in surveying and building construction.