Application of UAV in Search and Rescue Actions in Underground Mine—A Speciﬁc Sound Detection in Noisy Acoustic Signal

: The possibility of the application of an unmanned aerial vehicle (UAV) in search and rescue activities in a deep underground mine has been investigated. In the presented case study, a UAV is searching for a lost or injured human who is able to call for help but is not able to move or use any communication device. A UAV capturing acoustic data while ﬂying through underground corridors is used. The acoustic signal is very noisy since during the ﬂight the UAV contributes high-energetic emission. The main goal of the paper is to present an automatic signal processing procedure for detection of a speciﬁc sound (supposed to contain voice activity) in presence of heavy, time-varying noise from UAV. The proposed acoustic signal processing technique is based on time-frequency representation and Euclidean distance measurement between reference spectrum (UAV noise only) and captured data. As both the UAV and “injured” person were equipped with synchronized microphones during the experiment, validation has been performed. Two experiments carried out in lab conditions, as well as one in an underground mine, provided very satisfactory results.


Introduction
The biggest underground mines look like underground cities. They consist of many "streets" and "street crossings". Unfortunately, each mining void looks very similar to the other. Due to the complexity of the underground corridors network, localization of an injured person or any other incident, if not reported on an emergency line, is complicated and time-consuming. It imposes a serious danger in the cases, in which rapid contact with works supervisors or rescue units is crucial. The complex structure of an underground mine makes getting lost in it much more complicated to solve and serious than in the case of an accident on the ground level. Another important issue is related to extremely harsh environmental conditions. An example of an emergency situation caused by them can be given, like one of the co-workers in a pair or small group fainting. In such a case, the presence of a drone sent to the area in which worsened environmental parameters have been reported can significantly shorten the time needed to reach the victim and others at risk.
There are many natural hazards or unfavorable environmental factors, that can impose a significant risk to any humans performing work underground, such as spontaneous combustion of coal, rock bursts, methane ejections, water hazard, high dustiness [1][2][3][4]. According to the State Mining Authority, only in 2019 did Polish mine rescue units take part in 32 rescue operations in mining plants [5]. Most of them were related to roof/wall collapse or fire occurrence. In case of such an extreme situation in the conditions of an underground mine, it is not difficult to imagine a situation in which it is necessary to search for workers. As a result of the risks present in the mining environment, such as lack of oxygen, presence of fire fumes or other harmful gases, leading to loss of consciousness, as well as other injuries, the miner may not be able to evacuate from the hazardous area. Furthermore, a person may get trapped due to a rock collapse or a fire occurring in the evacuation route. The scale of the problem can be seen in the example of the activity of Polish mining rescue units. Considering that mining rescue activities are not limited to Polish underground mines, it can be stated that on a global scale the issue of human location in emergency situations is a considerable challenge. As the area to search is large, consists of many corridors and time is a critical issue, there is a need for a quick method of finding victims of accidents.
In the paper, it is proposed to use an unmanned air vehicle (UAV) for audio signals acquisition for human voice detection. In the article, we propose a drone equipped with a microphone, capable of registering human voice and distinguishing it from the propellers' noise, to be used for detecting emergency situations in underground mines. A method using time-frequency decomposition of the signal is presented, together with the full procedure of separating the human voice and background noise. It is assumed that the person is not able to move, but can call for help by shouting. Practically, the mission may be performed using several drones searching corridor by corridor and using acoustic signals to find the location of the human.
From a signal processing perspective, the problem is related to voice detection in presence of time-varying noise with complex structure. Due to the spinning of the propellers, some deterministic and random components are present in the acoustic signal. Unfortunately, the energy of the voice signal captured by a microphone mounted below the drone is rather small and the high energy of the acoustic signal emitted by the drone makes the voice detection problem complicated. In the paper, a procedure for voice detection in presence of UAV noise is proposed. Our informative signal is in fact a human voice, but limited to specific words only like "help". So, we called it specific acoustic signal. We limit our research for "help" only, however we believe also other words might be detectable-but for sure not all. "Help" is the most intuitive in crisis situation. The experiments have been carried out in the lab as well as in the real mining corridor to test and validate the proposed procedure.

State of the Art
Voice activity detection (VAD), being a natural context for our sound detection method, is an inherent stage of any real-world application of speech processing. There have been numerous publications concerning various methods on this subject; however, formerly, emphasis was placed primarily on detecting voice messages between long periods of silence or low magnitude noise. If the background noise has lower energy than any material speech occurrence, then a simple thresholding criterion enables a recording or communication system to save energy, computational memory, and network bandwidth. Traditional VAD and noise estimation methods are used in VOIP (Voice over Internet Protocol), as well as various speech-based applications. Along with de-noising and feature extraction, it would be a stage of any speech recognition or voice control system [6].
More sophisticated methods were developed to deal with lower SNR (signal-to-noise ratio). Among more recent work on VAD methods, some were based on combining spectral analysis and statistical methods [7,8]. Closely related to them are the noise reduction techniques, which allow subtracting the noise to increase the intelligibility of the speech signal [9]. Extracting preliminary information from speech signals may include voiced/unvoiced speech classification [10]. Speech detection and de-noising were also considered in the context of the so-called cocktail party problem where audio sources (i.e., speech and noise) are separated through a binaural system consisting of two microphones [11].
The background noise has considerably high energy in the case of recording nearby UAV equipped with a multi-rotor propulsion system. It contains the sound of its engine and mechanical parts, such as the propellers, which cannot be damped in its natural operating mode. The methods of dealing with high-level background noise conditions are widely discussed in literature concerning acoustic measurement on machines with rotating elements such as turbochargers [12][13][14] among many other examples. The ego-noise of a small-size multi-rotor drone itself has been investigated by many researchers in recent years. It was thoroughly examined at static thrust [15]. Some recent research considers tracking change in noise characteristics under realistic flight conditions based on position and noise measurement of a moving drone using both camera and a multiple microphone system (which is called microphone array) [16].
There are some publications about de-noising methods concerning mainly a hovering drone. The azimuthal direction of sound arrival was analyzed (using both supervised and unsupervised methods) in [17,18] based on a microphone array mounted on a hovering drone. Though at static thrust drone was enabled to track moving sound sources. Analyzing the sound emitted by a UAV has recently acquired interest also in the detection of illegal drone activity [19], where the dynamics of a quadrotor drone was taken into account.
In our experiment, as the drone is moving from one point to another, the background noise is both high-energy and non-stationary. It originates from the variable mode of operating of a UAV and air response but also from air turbulences. We only use one microphone at a sound source, thus physical direction or distance cannot be tracked, but the distance is changing (in an approximately monotonous way) during our records.
When analyzing a non-stationary process, it is essential to perform some sort of time segmentation of the data [20]. Therefore, transformations preserving (to a certain extent) original time structure are very useful. One of the primary methods is representing a signal in the time-frequency domain via Short-Time Fourier Transform (STFT) [21].
To characterize the time intervals containing voice we adopt a method based on STFT which was discussed before in a context of cyclic impulse anomaly detection (local damage in driving units of a belt conveyor) based on calculating distance (metric) between spectral densities [22]. It was stated that informative frequency band selection may strongly affect the analysis based on the time-frequency map. To deal with it we perform both band-pass filtering and spectrogram normalization.
The application of robotics in mining industry and especially in missions related to rescue and search of the victims of an accident victims is well established and constantly developed since the beginning of the third-millennium [23][24][25][26][27][28][29][30][31]. The use of UAVs as a support for the search and rescue (SAR) operations has been presented among others in [32], where some significant considerations are included, such as energy limitations, the quality of sensors, and the exchange of information between UAVs and rescue teams. An example of a fixed-wing UAV-system capable of detecting accident victims employing real-time imaging, on-board processing, and assigning coordinates of the detected target can be found in [33]. Both-UAV's hardware and identification algorithm with a selfadjusting threshold were exhaustively described providing a valuable overview of the current technical tools valuable for SAR purposes in superficial conditions. A remotecontrolled drone, presented by [34], is an example of how the technology of this type can not only prevent mining production obstructions but also increase safety on-site. The UAV used by the authors carried an analog thermal infrared video camera, digital video recorder, an RGB spectrum digital video camera, and a small battery power source. This solution based on commercially available devices allowed evaluation of the subsurface temperatures in a flexible and low-cost way, nevertheless, the system was able to perform only short missions-lasting up to fifteen minutes due to the limitation of batteries weight. The character of sub-surface mining workplaces and accidents that take place there created a big demand for the introduction of urban SAR robotics in underground applications and their adaptation to solving similar problems in new-more demanding conditions. Together with maintenance-purpose robots, this technology will allow the limitation of human involvement in the most hazardous areas, contributing to the transformation of the mining sector [27,35]. The vast majority of underground mining-related applications of aerial vehicles' proposed by researchers nowadays are aimed at mapping, localization of discontinuities, and general reconnaissance of subsurface areas [36][37][38][39][40]. The aforementioned are undoubtedly valuable examples of how robotics can facilitate arduous and time-consuming processes, providing information about some available tools. However, recent development and increased accessibility of robotics in the raw materials sector is also aimed at increasing the safety of underground workers and providing quick and effective help in case of an accident. The detection of a human in an underground environment can be based on a video recording transmitted in real-time to an operator [41], automatic detection of humans with the use of image recognition algorithms [28], or, what we propose in this article-human voice detection. Three recent reviews of the usage of unmanned aerial vehicles [42][43][44] give an exhaustive insight into the spectrum of possible applications of them in both underground and open-cast mines, including rescue operations. Nonetheless, in most of the cases in which a UAV is considered to be applied in this area, the sensors in which it is equipped are the image-capturing ones. To the best authors' knowledge, the problem of voice detection and its extraction from the drone background noise have not been approached in the context described in this work (underground mining rescue action).

Theoretical Background of the Novel Procedure
We propose a procedure involving time-frequency decomposition of a signal. A timefrequency map (spectrogram) of a signal is generated using Short-Time Fourier Transform (STFT). The STFT of a vector of observations x 0 , ..., x N−1 is a function of discrete valued time and frequency. It is a DFT of a signal multiplied by a shifted window selecting the observations close to a given time point (equivalently in terms of filtering it is a discrete convolution of a frequency shifted signal with the window function) [45] which is given by the Equation Both for further analysis and for plotting a spectrogram we will consider squared absolute value of the STFT and refer to it as a power spectral densities (PSD) matrix of size T × F containing values Its columns (time slices) we name power spectral densities and its rows (slices for given frequency) we name subsignals. There arises a problem of informative frequency band selection (see also [46][47][48]).
Hidden determinism associated with the motion of the drone propellers is present in the background noise. It was stated that such a character of data appears with high-energy low-frequency bands. To avoid the necessity of further investigating (see advanced filter design discussed among others by Wodecki [49]) and selecting the band of interest, we decided to use a data-driven method which proved very efficient in the context of cyclic anomaly detection. Reducing the impact of the hidden determinism involves sub-signals normalization. As a norm, we use the sum of the values of a sub-signal: The main part of the presented method is calculating the measure of dissimilarity for every spectral density (time slice of the matrix) with a referential spectrum. Euclidean metric is used defined for power spectral densities p( f ) and q( f ) as follows: (4) The sound emitted by the drone is assumed to contain a notable noisy component. To increase SNR we do band-pass filtering in the frequency domain. According to the fact that the informative part of the human voice does not exceed 4 kHz, this value is used as a cutoff frequency. We also add a lower boundary at 0.2 kHz. Therefore, we modify the Equation (4) to obtain (5) where F lower = 0.2 kHz, F upper = 4 kHz. For the sake of the visualization of results, we will also filter signals in the time domain.
As the referential, the mean power spectrum calculated from a reference interval (usually a few starting seconds depending on a particular sound signal) is assigned. It is assumed that in this time interval only the background noise containing the sound emitted by the drone is present. Thus for every time slice t on a spectrogram we obtain where for a given reference interval I we have To discriminate between the human voice and the drone noise we set a threshold on Euclidean distance. Beforehand, we apply moving mean smoothing (with parameter M) on the calculated metric: As the threshold, the value of mode + standard error was chosen where the mode is read from the kernel smoothed empirical PDF (probability density function) as the argument of the maximum value. Values greater than the threshold are associated with the human voice therefore corresponding time interval is assigned to the voice-present cluster and marked with 1. Otherwise, a time interval is assigned to a voice-absent cluster and marked with 0. Final processing of the received output includes filling gaps (changing too short series of 0 to 1) and subsequent deleting immaterial voice (changing too short series of 1 to 0). It is accordingly necessary to beforehand define tolerance values Tol 0 and Tol 1 for both steps of this operation. Voice intervals merged by that means are marked with red color on the plots. At once, the detected intervals are exported to the output audio file. They are delimited with 1-second pause to distinguish between one another, that is, data that would be then sent by the drone to report emergency. The outline of our algorithm is as follows: 1.
Calculating STFT of a signal and then its squared absolute value to obtain PSD(t,f) matrix; 2.
Normalization of the subsignals (rows of the PSD matrix) in time domain and restriction of the frequency domain; 3.
Defining of the referential spectrum; 4.
Calculating Euclidean metric at every time interval of PSD matrix and using moving mean smoothing filter; 5.
Setting a threshold on Euclidean metric;
Changing clusters in too short time intervals; 8.
Visualisation of the voice detection results on a band-pass filtered signal.
As a standard-setting, we use STFT with Kaiser window of order 5 and length 1024 which is also the number of FFT (Fast Fourier Transform) algorithm points. Other parameters which are needed to be set are the number of overlapping points in STFT, the start and end of the reference interval, the number of points in the moving mean filter, and tolerances (in seconds) for binary merging in final processing.

The Experiment
In the experiment, a manually controlled drone, DJI Mavic Mini was used. It is a small, commercially available quadrotor, weighing 249 g, using 4 pairs of light, 119 mm diameter propellers. This model was chosen for our case study as it represents well the wider class of micro aerial vehicles (MAV), which are, due to their small size and fairly good battery life, the most suitable type of unmanned vehicle to be used in an emergency situation in an underground mine. During a normal performance, its brushless motors are operating in the frequency range of 9000-15,000 RPM (150-250 Hz). Since the experiment was realized in indoor conditions, the propeller guard (provided by the UAV manufacturer) was mounted on the drone to avoid damage in case of the drone colliding with the walls.
To evaluate the possibility of the application of a microphone-equipped unmanned aerial vehicle in the detection of humans, two experiments have been performed. The first was a preliminary test performed in the corridors of a building (Figure 1), roughly approximating realistic conditions. Another took place in a slant of an inactive underground mine ( Figure 2). In both experiments, a UAV with a microphone located underneath has been approaching a person laying in the side pit of the main corridor and calling for help. In the preliminary experiment, an additional microphone has been used, which was placed near the help-caller. The person was located behind the corner, loudly repeated the words 'HELP, HELP'. In the experiment performed in realistic conditions, a deep niche in the corridor has been selected as the 'victim's' location. The volume of help-calling, as well as the speed of the drone, were kept as constant as possible throughout the whole experiment given the technical and organizational capabilities. Some minor variations in both speed and volume of acoustic emission could be caused by a human factor, as well as variations in the speed of airflow in underground excavations.

Overground Sounds
Data acquired during the first (overground) experiment consist of sounds recorded at the same time with two microphones located at different places: first, located nearby the UAV ('next-to-drone' record); second, nearby the person who was making voice calling for help ('next-to-victim' record). The length of the signals differ (they are 62 s and 66 s long respectively) and their real-time starting moments are shifted relative to each other. Both were recorded with a sampling frequency equal to 48 kHz. The procedure presented above was applied separately to each of the records. Time waveforms of both raw sounds are presented in Figure 3. As we can see, the first of them seems a noise-like signal while in the case of the second signal the time intervals containing human voice are clearly visible.
STFT was calculated with 80% overlapping. The resulting time-bin length is 4.3 milliseconds. Spectrograms of the raw signals which are presented in Figure 4 are records from both microphones. They illustrate the time-frequency decomposition of the sounds in terms of the log-scale PSD.
As it was stated the analysis of the data is based on the normalized spectrogram. This type of spectrogram is showed in Figure 5. It should be noted that frequencies in this case are restricted to the band 200-4000 Hz because, according to band-pass filtering of the data, we cut PSD to 0 outside of this band. In the case of the 'next-to-drone' record, we can distinguish particular moments in the evolution of the power spectrum only after normalization, the changes are visible mostly in a frequency bin located around 550-800 Hz. In the spectrograms based on the 'next-to-victim' recordings, on the other hand, the changes are by far more evident in numerous bands and valuable as the noise arises only in the ending part.
We visualize band-pass filtering in the time domain in Figure 6. In the case of the 'nextto-drone' record, this processing step makes the shape of the waveform more informative compared with the waveform of the raw signal presented before. The referential power spectral density was defined as a mean from 5 first seconds. Euclidean metric as a function of time is showed in Figure 7. The Euclidean metric smoothed with 20 samples moving mean is highlighted in green. The red line stands for the threshold which we obtain as mode + standard error value from the empirical distribution of our metric on the whole time domain. The empirical density is presented in Figure 8.
In Figure 9 the detected intervals of the found voice are highlighted as a part of the plot of the band-pass filtered signal. Tolerances Tol 0 and Tol 1 in final processing were both set to 0.25 s. The output comprises six voice-present regions selected from the 'next-to-drone' record which correspond to the analogous segments present in the 'next-to-victim' record. The last four of them can be heard in the original raw 'next-to-drone' record.

Underground Sounds
Data acquired during the second (underground) experiment include only sounds recorded nearby the drone with the sampling frequency equal to 25.6 kHz. From several sounds recorded, two were chosen for the analysis.
First, we analyze a short signal with a frequent and clearly audible human voice. The signal is 34 s long and we present it in Figure 10. Its spectrogram is presented in Figure 11. Spectral structure is dominated by the noise from the drone; showever, some voice related signature is detectable so it could be a basis for automatic signal processing.  Since the voice regions are frequent in this realization of the experiment, this time we used lower overlapping equal to 50% so as not to blur the results. The time moments at which the voice was heard can be seen at the spectrogram, even more clearly after filtering and normalization ( Figure 12). A particularly nonstationary behavior of the drone alone in the ending time interval (which is also audible from the record) was successfully removed due to band-pass filtering as it can be noted at the plot visualizing band-pass filtering of the raw sound ( Figure 13).
To determine the time intervals which correspond to the rescued person calling we calculate Euclidean metric as before. This length of the time bin is 20 ms resulting from the parameters of the STFT computation. The referential spectrum is calculated as a mean PSD from only first 2 seconds (so as to catch an interval without voice). Calculated metric and the determined threshold is presented in the time domain and on a kernel-smoothed density (Figures 14 and 15). Moving mean smoothing parameter was again set to 20. In both figures the chosen threshold equal to mode + standard error is shown. The detected voice time intervals (Figure 16) are fully consistent with all seven calls for help, which are audible from the raw sound (using Tol 0 and Tol 1 equal to 0.5 s).    The second sound signal is 90 s long and has a low signal-to-noise ratio ( Figure 17). The way of approaching the rescued person was similar to that which was performed in the overground experiment (the background noise though differ due to the underground environment and other microphone used). STFT is computed with 80% overlapping. Its time-frequency characteristic is very volatile though harmonics, i.e., horizontal lines can be distinguished (Figures 18 and 19).  The signal even after filtering has still a very noise-like shape ( Figure 20). Only some slow changes in mean magnitude (corresponding also to the power) are visible.
Time bins for calculation of spectral metric are 8 ms long. Smoothed (with 20 samples moving mean as before) Euclidean metric referring to a mean power spectrum from first 5 s is particularly high in the ending seconds ( Figure 21). The threshold was in this case set using mean value instead of mode value. Kernel-smoothed PDF is presented in Figure 22.
In the set of the marked intervals no clear pattern can be seen ( Figure 23). Nevertheless, audition of the detection output and comparing it with the input raw sound allows us to state that all audible voice is detected (included in 4 last intervals). Parameters of Tol 0 and Tol 1 were again fixed to 0.5 s.

Validation Experiments
To evaluate the efficiency of the presented method, several lab condition experiments have been performed. We investigated the impact of the drone's flight conditions on the performance of our algorithm. During the validation experiments, we have additionally tested two other drones of different sizes (see Figure 24) to evaluate how the noise generated by the rotors of different power they are equipped with, will mask informative components related to a human voice. The second drone used for validation purposes is, similarly to the one used in the original experiment, a commercial quadcopter-DJI Phantom 4 Advanced, dimensions of which are 350 × 350 mm (the dimensions of Mavick Mini are 160 × 202 mm). The third one was a custom UAV, designed according to predefined specifications. The rotors' spacing is 900 mm. All sounds presented in this section were recorded using the same microphone with sampling frequency equal to 25.6 kHz.
First, almost stable flight conditions have been kept (see spectrogram in Figure 25). It means that the drone's ego-noise spectrum is more or less constant.
We intentionally introduced maneuvers during the flight to assess the variability of acoustic signal parameters. As we predicted, change of flight parameters will modify the spectrum significantly, including AM/FM modulations in the signal (see spectrogram in Figure 26).
The introduced changes in flight parameters cause that human voice activity detection is way more difficult to be achieved. When the drone's flight parameters are approximately constant, the effectiveness of the algorithm used is very high (see Figure 25). As can be seen in Figure 26 the proposed method faces segmentation problems while the drone is performing the additional maneuvers. First of all, not every voice activity has been detected. Secondly, the noise generated by the drone is incorrectly detected as a human voice. To simplify the problem, for two new drones (medium size and large size) we decided to analyze segments with stable flight conditions ignoring those with rapid maneuvers, i.e., strongly time-varying spectrum. The justification of this approach is the fact that in underground mines' conditions the dimensions of underground corridors are limited, hence the possibility of performing complex drone maneuvers during a flight is significantly limited. To reduce the risk of a situation in which the drone could be damaged as a result of a collision with excavation walls or other elements in the excavation area, the drone's flight speed must not be too high or variable, and the maneuvers should not be sudden. This translates into the type of recorded noise, the energy of which will not be high and the spectrum will not be highly variable in time. The above factors, although they have an impact on the effectiveness of the proposed method, will not occur often in real conditions in which the method is to be applied.
Results of additional experiments (focused on validation of the concept) are presented below in Figures 27 and 28. In Figure 27a, raw signal is presented. Only part of the signal plotted in black is analyzed using the proposed algorithm. Figure 27b presents the spectrogram of the analyzed signal, where a significant change in spectral content is clearly visible at ca. T = 5, 10, 15, 22, and 27 s. Figure 27c presents the Euclidean distance estimated for each spectrum. The threshold resulting from the PDF of such a measure is also plotted as a red line. Based on the proposed decision-making scheme, the final picture with detection results is presented in Figure 27d. All voice activities have been successfully detected. By analogy, the same procedure has been applied in the third case ( Figure 28). In Figure 28a a raw signal within the selected segment is presented. One may notice that for signals above T = 40 s its amplitudes are much higher; we did not select this part for further analysis. The max power of the drone and noise level were significantly higher resulting in lower SNR. We ignored this segment since in an underground mine, a drone ought to be steered very carefully and one should not risk flying with such parameters. The spectrogram (Figure 28b) for the 3rd case (selected part) is clear enough and allows detection of voice activities (Figure 28c). Final detection result visualization is presented in Figure 28d.
The results achieved in this case can raise concerns regarding the utility of the presented method when using bigger UAVs, nonetheless, taking into consideration the spatial restraints in underground mines, such drones are not recommended in this specific environment (bigger UAVs are much more cumbersome to be steered and can not reach as many locations as smaller ones).
One may conclude that based on the obtained results, the signal having its origin in a human voice is detectable in each case. Three different noise characteristics have been considered. During maneuvers (turning left/right, acceleration, etc.) the structure of drone ego-noise is time-varying and indeed it makes voice detection much more complicated. Given the aforementioned, we decided to skip the highly non-stationary segments during the detection procedure. (c) (d) Figure 28. Results obtained with the large size drone: (a) raw sound recorded in the presence of a medium-size drone, (b) normalized and band-pass filtered spectrogram selected for analysis, (c) estimated distance from reference spectrum with estimated threshold, (d) detection results.

Discussion and Conclusions
A conception of drone usage for rescue operations in underground mines has been presented with the assumption of a situation in which an accident victim (being unable to rescue oneself from the accident's location) is capable of calling for help. Since the main problem to be solved in order to ensure human detection with the use of a microphoneequipped UAV was the distinction of a characteristic sound in the presence of heavy, time-varying noise, obtaining the automatic signal processing procedure is a valuable achievement, not only from the point of view of a rescue operation.
Our method used for voice detection tends to capture long phrases wholly, while a simple word /help/ is perceived mostly as /he/ or /e/. Hence high-energy voiced speech is better detected than low-energy noise-like unvoiced speech which is covered by the background noise. An improvement of the presented method can be sought in detecting voiced speech components having a harmonic structure. It should be noted that any source of the acoustic signal that is characteristic for a given situation and easily separable from drone-induced noise can be used to identify the event of interest. Taking this into consideration, future work will be conducted with the focus put on identifying other event-informative and distinguishable-from-noise sound sources, together with generalizing the signal processing technique presented. Such signals can be emitted by, for instance, electronic devices carried by underground workers. Characteristic spectral content obtained this way would be easier to detect than the human voice. The detection of human voice would then state a backup scenario in case of a lost or faulty sound emitting device.
The core of this research work-the procedure proposed in the article-is relatively simple, as it hinges on the usage of typical time-frequency representation of the registered signal and comparison of its instantaneous spectrum with a reference spectrum that is known to not contain the signal sought, but only drone's sound. It alludes directly to the novelty-detection approach, namely, if a spectrum emerges in the analyzed signal of an unknown or not expected structure, there is a high probability that it is related to an event of interest. Taking into account that during a rescue action, the presence of various noise sources is limited (due to the stopped machinery activity for instance), the abovementioned assumption can be found as legitimate. It should be underlined that what is of the highest importance is not a detected and distinguished specific human voice or a phrase, but rather, a quick and reliable segmentation of useful data (containing more than just a drone's noise) from a big dataset created with the use of a team of microphone-equipped UAVs.
Several laboratory experiments and the one conducted in realistic underground mine conditions proved that the developed procedure gives satisfactory results, and is a promising basis for the development of a useful aid tool for rescue operations. Nonetheless, its usage can be also extended to other cases in which drone-aided inspection would be safer or more efficient, which is a common case in underground mining.
The presented study is a preliminary test of a concept to be further developed, and subjected to evaluation in various use-case scenarios in different conditions. The method was tested with the use of various drones having different engine powers emitting noise of various levels. One of the key issues is the identification of drone noise. In general, it is expected that noise from the flying drone will be a non-stationary signal with a time-varying structure, especially when flight parameters are changing. At this stage, to simplify the signal processing procedure, the flight condition was as stable as possible. No maneuvers, except take-off and landing, have been considered. During the flight, the flight speed was relatively slow (noise is smaller too) and stable as much as possible. Its ultimate autonomous flight could be stabilized by the algorithm so the variability of speed will be neglectable. It simplifies the structure of the instantaneous spectrum and makes detection easier. The speed has not been measured directly, but from the spectrogram, it can be seen that it is nearly stable (frequencies related to rotor speed are not changing too much).
Even though the method can be described as 'self-calibrating' (due to the automated calculation of the referential spectrum), repeating the experiment numerous times could be beneficial since it would allow examining the method's response and accuracy giving very significant information regarding the so-called 'false alarm ratio'. In this research, we were not able to estimate the relation between signal to noise ratio and probability of detection; hence, this is assumed to be the basis of future work. Implementation of the proposed method in a real-case scenario of rescue action conducted in an underground mine must be aware of possible detection of some signal something that is not expected voice-related source. According to the specific procedures concerning rescue actions, during such a situation, all activities in the mine are supposed to be stopped until the end of rescue action. There is nearly no chance to record other noise than from the drone's operation and potential human calling for help. In addition, any false detections in such a situation are of poor probability, since the proposed method is intended to support rescue teams, so every potential detection will be verified.
Another more complicated issue is the lack of detection due to poor SNR of the acquired signal. This could be minimized by decreasing the distance between the sound source and the drone. It has been shown that if the drone is closer to the source, the signal is more visible in the noise. The task here is not to detect all signatures, so classical measures in detection efficiency analysis are not relevant. The main target is to provide data by the drone before the rescue team will be able to realize the rescue action.
Employing an autonomous UAV should be considered to obtain the benefit of reduction of the human presence in the most dangerous places as much as it is possible. It should be also stated that detection ability is related to several factors: drone noise level, human voice source level, the distance between the voice source and the drone with a microphone, etc. One could expect precise information on how close a drone should be to detect a human voice based on acoustic signals. It can be determined experimentally on the basis of many experiments in the mine. It is important to say that voice propagation in the tunnel may be slightly different due to tunnel geometry and surface properties.