4.5. Analysis of Classification Results
In biological research and ecological monitoring, animal sound data are a key source of information for understanding their behavior and environmental conditions. Especially for some endangered species, such as the western black-crested gibbon, sound data can help researchers monitor their population dynamics, reproductive behavior, and habitat changes. However, for many rare or inaccessible species, collecting large numbers of sound samples is often challenging, which limits the effectiveness of using automated analysis methods such as machine learning, which typically require large amounts of data to train reliable models.
To address the problem of a small dataset of western black-crested gibbon calls, this study proposed a data augmentation method that increases data diversity by adding different levels of noise to the original recordings. This method is based on the concept of signal-to-noise ratio (SNR), which is the ratio of signal strength to noise strength, and is used to simulate different listening conditions in the natural environment. The SNR can be adjusted for positive (1, 0.5, 0.2, 0.1, 0.3) and negative (−0.1, −0.2, −0.5, −1) values, covering a wide range of environments from mild background noise to extreme noise interference.
Through our analysis and experimental validation, we found that applying data augmentation to the western black-crested gibbon call dataset significantly improved the classification accuracy of six different deep learning networks. These networks included VGG16, VGG19, AlexNet, MobileNetV3, EfficientNet, and VBSNet. This method not only increased the volume and diversity of the dataset, but also provided an effective means of modeling complex sound environments in the real world. Specific results are shown in
Table 6, which demonstrates significant improvements in classification accuracy for all six models after data augmentation.
By comparing the classification results before and after data augmentation, it is evident that the performance of all six network models is significantly improved. This demonstrates the importance of data augmentation in improving the accuracy of models in classifying rare organism sound data. This not only increases the model’s ability to handle data in complex, changing contexts, but also enhances the model’s ability to generalize, a valuable strategy for bioacoustic research and in the development of automated monitoring systems. These findings emphasize the potential of employing advanced machine learning techniques in bioacoustic research, opening up new directions for future research and applications. It is important to note that the background noise was introduced artificially for the purpose of data augmentation. This may not perfectly replicate natural conditions where overlapping noise is present during the initial recording. The behavior of animals and the acoustic properties of the environment could influence the actual recordings. Despite these limitations, this approach provides a valuable means to enhance dataset diversity and improve model robustness.
By adding different levels of noise to the original gibbon call recordings, we were able to generate a series of sound samples under varying noise conditions, thereby modeling the various auditory environments that gibbons may encounter in their natural habitats. This increases the size of the dataset and improves the model’s robustness to noise, making it more adaptable to real-world complexities. For example, in denser woodlands or harsh climatic conditions, background noise may significantly affect the transmission and reception of sound. A model that can still accurately recognize and classify gibbon calls under these conditions will greatly enhance its utility and reliability.
In addition, using different signal-to-noise levels increases the dataset’s diversity and provides an experimental basis for studying the behavior of sound data in different noise environments, which is important for acoustic ecology research. By analyzing the response of calls under different noise levels, researchers can gain a deeper understanding of how sound travels through natural environments and how animals acoustically adapt to changes in their living environments.
In conclusion, the data augmentation method based on signal-to-noise ratio adjustment provides an effective solution for analyzing acoustic data of rare species such as the western black-crowned gibbon. This helps to improve the quality and efficiency of bioacoustic monitoring, and provides important technical support for the conservation and study of these precious species.
After a systematic performance evaluation of the VBSNet model, this paper presents the learning curves and evaluation metrics of the model after 100 training iterations.
Figure 7 shows the accuracy and loss curves of VBSNet on the training and test sets. In the early stage of training, the model has high loss values and low accuracy, which is in line with the general pattern of deep learning model training. As the number of iterations increases, the model begins to learn the patterns in the data, the loss value continues to decrease, and the accuracy rate increases steadily. The model begins to converge after 25 iterations, and as the number of iterations increases, the two curves are gradually fitted, indicating that the model has achieved significant performance improvement by learning from the data. Subsequently, the model performance is increasingly stabilized, and both loss and accuracy on the validation set level off, which reveals the learning and convergence of the model.
By incorporating MFCC spectrograms as input features, the VBSNet model demonstrated excellent accuracy and robustness on the sound event recognition task. The model exhibits the following specific performance metrics for different categories of sound events, as shown in the
Table 7 below.
The VBSNet model demonstrated a very high performance in recognizing the calls of the endangered species, the western black-crested gibbon, maintaining 98.49% precision and 98.00% recall, as well as an F1-score of 98.25%, as shown in
Table 7. This further highlights the significant value of the model in ecological monitoring and biodiversity species identification.
In this paper, we provided the confusion matrix generated by the VBSNet model on this dataset, where 0 represents the brown-necked thrush, 1 represents the hooked thrush, 2 represents the pachyderm warbler, 3 represents the ruddy-tailed thrush, 4 represents the golden-winged thrush, 5 represents the great mockingbird, 6 represents the sound of cicadas, 7 represents the sound of rain, 8 represents the sound of the western black-crested gibbon, and 9 represents the sound of wind, and the confusion matrix is shown in
Figure 8 above. In summary, this confusion matrix demonstrates that the model has a very high classification accuracy, and the vast majority of samples are correctly classified into their respective true categories. The few misclassifications that exist may be due to the similarity of sound features between categories. Future work could further tune the model or investigate the characteristics of the misclassified samples to improve the model’s ability to distinguish between these sound events.