3.2. Drone Class Classification without Augmentation
To establish a baseline for drone sound classification, four distinct classifiers were trained without data augmentation techniques under seemingly identical conditions. Despite using the same script for each run, significant outcome variations were observed across classifiers. Confusion matrices in
Figure 2 illustrate these differences, with the accuracy for correctly classifying C0 drones fluctuating between 82.8% and 87.2% and C1 drones between 87.4% and 93.7%. Variations were also notable in the more nuanced classifications of C2 and C3 categories.
The observed inconsistencies in classifier outcomes are attributed to the stochastic nature of ML model training processes, including random weight initialization and inherent probabilistic elements in learning algorithms. These variabilities significantly impact ML models’ performance and generalization capabilities. The baseline experiment without data augmentation highlights the importance of considering these stochastic processes during model training, which can cause notable performance variability, even with identical setups. This underscores the necessity for meticulous experimental design, such as using fixed seeds for random number generators, to ensure reproducibility and reliability in ML research.
Table 3 summarizes the performance of sixteen classifiers, divided into four groups with four classifiers each, trained under identical conditions without data augmentation. These groups are differentiated by the random seed initialization used: no seed, seed initialized to 1, seed initialized to 2, and seed initialized to 3. The classifiers within each seed group were trained to assess the impact of controlled initial conditions on the consistency of performance metrics.
The group without seed initialization showed significant variability in performance, with the standard deviation of accuracy across classes (C0 to C3), averaging 1.2% for prediction and 1.0% for recall. This reflects the stochastic influence on classifiers when random processes are not controlled. In contrast, the seeded groups exhibited more consistency, with standard deviations in accuracy reduced to around 1.0% maximum and 0.5% minimum.
Although only four classifiers were trained per seed, which may be considered a limited sample for statistical robustness, the results demonstrate a clear trend. Classifiers with controlled seed initialization yielded more consistent accuracies, suggesting that non-random weight initialization can lead to more reliable classification results. The observed trend, despite the small sample size, underscores the potential influence of controlled initial conditions on model performance. Further research with a larger number of classifiers per seed could provide additional insights into the effects of weight initialization on classifier performance.
The analysis of these outcomes highlights the necessity for careful consideration of initialization processes in ML classifiers, acknowledging the balance between random variability and the quest for replicable results. It is generally important to use randomness in ML to provide the best stability and robustness for the most of neural networks. For the purpose of investigating the influence of several adjustments like augmentations on general performance, however, fixed seed augmentation could reduce noise and increase the comparability.
The performance of the first drone classifier, with seed 1, as detailed in
Table 3, was evaluated by classifying a C3 drone (HP-X4 2020) from an outdoor experiment. The outcomes of this real-world application are depicted in
Figure 3.
The spectrogram in the upper section of
Figure 3 clearly illustrates a typical acoustic footprint of the drone’s activity. The typical acoustic spectrum of a drone is characterized by a distinctive pattern of harmonics across mid to high frequencies, often with peaks in lower frequencies generated by the rotors and motors. The drone initiated movement at around 7.5 s, with a stationary phase until approximately 25.5 s, and subsequently moved away from the microphone. Its farthest distance from the microphone, where the acoustic signature is weakest, was reached at around 52 s before it began its return journey. The drone passed directly overhead at 66.5 s and finally landed at 109 s.
The classifier’s temporal predictions, depicted in the bottom panel, segmented the audio signal into one-second intervals for classification. Although the classifier consistently identified drone presence, it erroneously classified them as C0 drones in 72.1% of detections, accurately recognizing them as C3 drones in only 9.9% of instances. Detection did not occur before the drone’s takeoff when distant, or after landing, highlighting significant misclassification likely influenced by environmental factors. The classifier, trained with noise from an anechoic chamber, struggled against real-world environmental variations like reflections and ambient sounds, underscoring the necessity for appropriate augmentation for generalization.
It is noteworthy that the ‘Drone’ vs. ‘No Drone’ classifier’s performance was not particularly impressive, for instance in the time frame of 74 to 81 s where the drone’s presence can unmistakably seen in the spectrogram. However, given the study’s focus on augmentation techniques, this initial classifier was continuously used to ensure comparability across different augmentation methods.
3.3. Augmentations
3.3.1. Harmonic Distortions
The study investigated the impact of harmonic distortions on drone classification accuracy by varying distortion levels from 0% (no augmentation) to 63%, with 7% increments, based on preliminary findings that showed a decline in performance beyond 50% distortion. This approach allowed for a detailed exploration within a manageable framework, training approximately 10 classifiers for each augmentation level. The findings, summarized in
Table 4, indicated that slight to moderate distortions, particularly between 7% and 14%, could enhance accuracy. Such levels of distortion may mimic the variety of sound qualities UAVs produce under different operational conditions, thereby potentially increasing the model’s adaptability to real-world situations.
The study discovered that increasing harmonic distortion levels initially boosted drone classification accuracy, peaking between 7% and 14%. This suggests that moderate levels of distortion more closely mimic the real-world acoustic conditions drones encounter, thus improving the model’s generalization from the augmented training data. Importantly, augmentation was applied solely to the training data, not the validation set, to ensure the model was evaluated against unaltered, real-world data for an accurate capability assessment. Beyond the optimal distortion range, accuracy decreased, indicating that excessive distortion introduces noise, hindering correct classification. This finding emphasizes the necessity for a balanced harmonic distortion application to preserve classification integrity.
However, the investigation into the optimal distortion range’s impact on outdoor experiments did not yield a significant performance improvement. This outcome, while not presented due to the lack of substantial enhancement, underscores the challenge of applying controlled environment improvements to outdoor scenarios. It emphasizes the complexities involved in acoustic drone classification under real-world conditions and underscores the critical importance of comprehensive model validation strategies.
3.3.2. Environmental Noise
A rigorous investigation of the effects of environmental noise augmentation on training data was conducted, with the aim of determining the impact of different noise intensities on the classifier’s accuracy. The levels of noise introduced varied from 0% to 72%. This specific range and increment step were informed by preliminary studies, which demonstrated a clear degradation in classifier performance with noise augmentations exceeding 50%. To cover the critical range effectively, we employed incremental steps of 8%, allowing us to train 10 classifiers for each augmentation technique, and explore the impact of different noise intensities on classification accuracy comprehensively. The results summarized in
Table 5 indicate that the incorporation of noise generally results in a decrease in classification accuracy. This finding is consistent with the discussion in the methods section, which focused on the selective application of augmentation to training data. It is based on the understanding that augmented data may not always accurately replicate real-world conditions [
24].
The classifier performance remained stable up to 32% noise, suggesting that controlled noise might enhance real-world robustness. Beyond this, accuracy dropped significantly, especially above 40%.
Figure 4 shows that applying 24% noise (blue circles) improved the C3 drone classification of the above-mentioned example to 29.7%, a significant increase over the non-augmented scenario (black dots) in
Section 3.2. With 32% noise (red crosses), correct classifications still occurred in 25.1% of instances.
The increase in classification accuracy observed at higher noise levels is intuitive, reflecting the outdoor measurement conditions layered with background noises. This contrasts with the anechoic chamber’s measurements, which lack such ambient sounds and served as the basis for training data. The incorporation of environmental noise through augmentation closely mirrors actual conditions, emphasizing the relevance of the validation methodology described in the methods section. Augmentation was applied exclusively to the training data to keep the validation set realistic and free from potential bias-inducing artifacts [
24]. This cautious approach ensured the evaluation of classifier performance using unaltered, real-world data, leading to a more accurate determination of their effectiveness.
3.3.3. Pitch Shifting
The assessment of how pitch augmentation affects classifier performance was done by altering the maxPitch parameter, which defines the pitch change limits for each audio segment from −maxPitch to +maxPitch semitones, with 0 indicating no change. This parameter ranged from 0 to 2.5 semitones. The results, detailed in
Table 6, revealed varied impacts on performance. An analysis was conducted on how pitch affects accuracy for different drone categories.
The investigation into pitch augmentation’s effect on classifier accuracy revealed complex outcomes. The analysis, as shown in
Table 6, indicates that minor pitch adjustments, up to +/−0.4 semitones, have minimal impact on precision. In contrast, larger alterations lead to reduced accuracies, while the choice of augmentation level, specifically the augmentation by +/−1.4 semitones for the outdoor drone model ‘HP-X4 2020’, might appear contradictory given its performance in
Table 6; this decision was grounded in a comprehensive examination of the augmentation effects across all classifiers on real-world examples. It was observed that, despite the seemingly counterintuitive selection based on
Table 6’s data, augmenting the pitch by +/−1.4 semitones significantly improved classification accuracy by up to 40%, as depicted in
Figure 5. This substantial improvement, illustrated with blue circles, contrasts sharply with the 9.9% accuracy (represented by black dots) observed without augmentation, as previously noted in
Section 3.2. Augmentations beyond +/−1.4 semitones further demonstrated substantial accuracy enhancements compared to scenarios without augmentation. This strategic selection, thus, was based on detailed assessments of augmentations’ impacts, identifying +/−1.4 semitones as the most effective for enhancing classifier performance in real-world settings (for this particular example), underscoring the broader applicability and importance of pitch augmentation.
Acknowledging the reliance on accuracy metrics from training, where augmented values were compared with non-augmented ones from identical measurements, reveals a methodological limitation in evaluating the augmentation technique’s efficacy. This shortfall highlights the need for broader assessments of augmentation strategies, particularly pitch adjustments, to accurately gauge their benefits and constraints in enhancing drone sound classification under diverse real-world scenarios.
3.3.4. Delays
The study’s investigation into the effects of introducing audio delays of 15 ms to 27 ms, with amplitudes varying from 30% to 90%, aimed to mimic real-world acoustic phenomena like echoes and ground reflections. However, no specific trend was observed in classification accuracy across different delay levels, suggesting minimal impact, possibly due to noise rather than a systematic influence on performance.
The analysis of delay augmentation is uniquely dependent on the specific measurement context, including the microphone–drone relative positions and surface reflectivity. Real-world conditions can produce time differences between the direct signal and its reflection of up to 30 ms, with amplitude variations based on the reflection coefficient of surfaces. This highlights the importance of incorporating a broad spectrum of delay variability to accurately reflect real-world scenarios.
Figure 6 presents the classification performance over time for the ’HP-X4 2020’ drone, comparing non-augmented (black dots) and random delay-augmented (blue dots) scenarios. This augmentation significantly improved accuracy from 9.9% without augmentation to 27.3% with random delay augmentation. Such an enhancement underscores the critical need to simulate a wide array of delay variations, closely resembling the acoustic reflection conditions found in real-world environments. This finding advocates for random delay augmentation as an effective strategy to increase the robustness of classification systems in settings with prevalent echoes and reflections.