A Randomized Bag-of-Birds Approach to Study Robustness of Automated Audio Based Bird Species Classiﬁcation

: The automatic classiﬁcation of bird sounds is an ongoing research topic, and several results have been reported for the classiﬁcation of selected bird species. In this contribution, we use an artiﬁcial neural network fed with pre-computed sound features to study the robustness of bird sound classiﬁcation. We investigate, in detail, if and how the classiﬁcation results are dependent on the number of species and the selection of species in the subsets presented to the classiﬁer. In more detail, a bag-of-birds approach is employed to randomly create balanced subsets of sounds from different species for repeated classiﬁcation runs. The number of species present in each subset is varied between 10 and 300 by randomly drawing sounds of species from a dataset of 659 bird species taken from the Xeno-Canto database. We observed that the shallow artiﬁcial neural network trained on pre-computed sound features was able to classify the bird sounds. The quality of classiﬁcations were at least comparable to some previously reported results when the number of species allowed for a direct comparison. The classiﬁcation performance is evaluated using several common measures, such as the precision, recall, accuracy, mean average precision, and area under the receiver operator characteristics curve. All of these measures indicate a decrease in classiﬁcation success as the number of species present in the subsets is increased. We analyze this dependence in detail and compare the computed results to an analytic explanation assuming dependencies for an idealized perfect classiﬁer. Moreover, we observe that the classiﬁcation performance depended on the individual composition of the subset and varied across 20 randomly drawn subsets.


Introduction
The audio-based automatic recognition of bird species has become an increasingly common and effective method in the context of bird species monitoring, studying the behavior of birds, and understanding their communication patterns [1,2]. Notwithstanding the advantages of using bird vocalizations to infer ecologically relevant information, there are certain challenges associated with processing field recordings to produce robust results. Unattended field recordings can be quite noisy. Depending on the distance from the recording device, sound clips can be faded or distorted, and recordings can include overlapping sounds from the same or different bird species. Several authors [3,4] have addressed the influences of noise by adding artificial noise to recordings.
On surveying the literature available on automatic audio-based recognition of bird species [5][6][7][8][9], it has been found that most analysis has been performed using less noisy recordings and relatively small datasets, as has already also been indicated in the review paper [2]. The measures for classification success that different authors use to report their results vary, which makes it difficult to compare the performance of different classifiers [8,[10][11][12][13][14]. 20 times to have a reliable estimate of classification performance given a certain number of species. Figure 1 illustrates the idea of this numerical experiment. A balanced dataset was curated for the classification task in which 200 sound samples were randomly drawn for each bird species. Finally, the dataset was divided into training and testing sets such that the training sets contain 75% and test sets contain 25% of data. Given that, for each bird species, we are using 200 sound samples, the test set for each analyzed species will consequently contain m = 50 sound samples. This is described in detail in Section 2.4.

Feature Extraction
The next step entails extracting audio features from time series of audio signals. Extracting features allows us to obtain lower-dimensional compact statistical representations while preserving the distinguishing characteristics of the signal in a non-redundant manner. In addition to reducing the computational costs, feature extraction can maximize the classification performance of the system [20]. Studies have shown that aggregated features allow us to achieve a better classification performance compared to a single feature [21]. In this contribution, we employ the spectral centroid, the spectral rolloff, the zero-crossing-rate, Spectral Bandwidth, the root-mean-square energy (RMSE), and the Mel-frequency-cepstral-coefficients (MFCCs) as features.
Since signal statistics change rapidly, bird sounds, like audio signals in general, are non-stationary signals. For this reason, feature extraction is performed in a short-term processing manner where the signal is chunked into short analysis frames. The analysis frames are assumed to be in a quasi-stationary state [22]. In order to preserve as much information and arrive at a sufficient trade-off between the frequency and temporal resolution, we selected an analysis frame size of 512 samples (23 ms) while allowing an overlap of 25%.
Therefore, the spectral features described below are computed frame wise. In the case an additional splitting into even shorter windows within each analysis frame is required (e.g., for the MFCC), the temporal average of the feature is computed to generate a single value that is associated with the respective analysis frame. For each MFCC, the variance of the coefficients within the analysis frame is also computed and used as an additional feature.
In more detail, all features are computed using the Librosa 0.6 audio processing package [17] and can be described as follows [22]: • Spectral Centroid: The spectral centroid measures the frequency where the energy of a spectrum is centered. In other words, it localizes the center of mass of the spectrum and is calculated as a weighted mean of the frequencies that the signal is composed of: where S(k) is the spectral magnitude at frequency bin k, and f (k) represents the center frequency of the bin [23]. • Spectral Rolloff: The spectral rolloff gives the frequency f (k), below which, a predefined percentage (usually set to 85%) of the total spectral energy is concentrated [24]. • Zero-Crossing Rate: The zero-crossing rate r measures the smoothness of a signal. It is the rate at which a signal changes its sign from negative to positive or vice versa [25]. • The root-mean-square energy (RMSE): The root-mean-square energy of a signal gives the signal's total energy and is defined as: where S(k) is the spectral magnitude at frequency bin k [17].

•
Spectral Bandwith: measures if the power spectrum is concentrated around the spectral centroid or spread across the spectrum. It is computed as: where s c is the spectral centroid, S(k) is the spectral magnitude at frequency bin k [26]. • Mel-Frequency Cepstral Coefficients (MFCCs): Mel-Frequency Cepstral Coefficients are inspired by human auditory perception. After computing the Fourier transform of a signal, the magnitude spectrum is projected to the Mel scale, which emphasizes relevant frequencies in a non-linear way-a small bandwidth at low frequencies and large bandwidth at high frequencies. The Mel scale approximates human auditory response better than linearly spaced frequency bands. The output is log transformed, and MFCCs are obtained by taking a discrete cosine transform of the logarithmic outputs [27]. In this contribution, the analysis frames are split into windows of lengths 512 and compute the first 20 MFCC values as features in our system.
In total, the dimensions of the feature space are 45, containing first 20 time averaged MFCCs, variances of first 20 MFCCs, and the time-averaged zero-crossing-rate, the spectral rolloff, the spectral centroid, the root-mean-square energy, and the spectral bandwidth.

Classification Model
The features are then fed into a feed forward neural network, which is constructed using the sequential model within the Tensor Flow framework [28]. Feed forward neural networks are archetypal models for machine learning [29,30]. In contrast to the deep learning approaches used to classify bird sounds, the network consists of only four layers as illustrated in Figure 2. These constitute a sequence of layers where each layer is an affine transformation followed by a non-linear transfer function σ: constitutes the parameter space where w i are weights, b i are biases, and σ i are transfer functions for different layers i. The goal of the neural network is to learn the value of parameter θ that generates the best function approximation [22,31]. The input to the neural network is the feature vector x ∈ R 45 , which maps through three intermediate layers with d 1 = 256, d 2 = 128, and d 3 = 64 hidden units, respectively, and is amplified using rectified linear units (ReLU) [31]. Finally, the output layer maps to n independent classes with a softmax transfer function [31,32] where n is the number of bird species.

Input 45
ReLU 256 ReLU 128 ReLU 64 Softmax n During training, the model optimizes the cross-entropy loss using the Adam stochastic optimization algorithm [33]. We used a constant learning rate of 0.001. To identify the parameter setting that increases the likelihood of predictions, the model was trained for 100 epochs. We observed, from our experiments, that the loss converged to a minimum toward the end of 100 epochs.

Measuring Classification Success
This section introduces the indices used to measure the classification performance of bird vocalizations [34,35]. In each numerical experiment, we considered n species and a dataset of a total of 4m sound samples. In more detail, 3m samples were used for training, whereas the remaining m samples were employed to evaluate the quality of classifications. Within the dataset for n randomly selected species, for each species j = 1, 2, ..., n, a dataset of 4m samples was processed. Here, 3m samples were used for training, whereas the remaining m samples were employed to evaluate the quality of the classifications. These evaluations were then done by computing a confusion matrix for each species, containing: c tp (j) the number of classifications that are true positives for species j, c tn (j) the number of classifications that are true negatives for species j, c f p (j) the number of classifications that are false positives for species j, and c f n (j) the number of classifications that are false negatives for species j.
The resulting elements of the confusion matrix enter into computation of more advanced metrics for measuring classification success, such as the precision, recall, accuracy, mean-average-precision, and receiver-operator characteristics. To understand how the entries of each species' confusion matrix influence the outcomes of these summarizing measures, more detailed descriptions of their computations are introduced in the following: • Precision: This metric gives the measure of reliability of our predictions. The formula to compute the precision for a bird species j is Therefore, the precision for a species j indicates how many true positives the model predicted out of all positives. Therefore, the higher the precision, the more confident a model is about its predictions. In order to compute the precision for the entire test dataset, we average over all species • Recall: This metric gives the measure of predictive power of a model. The formula to compute recall for each bird species j is Thus, to recall, for a class species j, would indicate, of all actual positives in the test dataset, how many did the model predict as positive. Therefore, the higher the recall, the more positive samples the model correctly classified as positive. In order to compute the recall for the entire test dataset, we averaged over all species • Accuracy: While precision and recall are computed for each class separately in a multi-class classification problem, the accuracy A is computed for the entire test dataset using and thus, out of all test samples, how many were correctly classified. • Area Under ROC Curve (AUC): An ROC curve shows the performance of a classification model at different classification thresholds. The curve is computed by plotting the true positive rate (r tp ) against the false positive rate (r f p ) at these thresholds. The true positive rate for a bird species j is defined as: and the false positive rate for a bird species j is defined as with ρ denoting a probability threshold that is varied from 0 to 1 in order to obtain the ROC curve. The area under the ROC curve (AUC) gives an aggregate measure of the classification performance. The ROC was originally developed for a binary classifier and was later generalized for a multi-class classification system [36]. The test set labels are binarized by employing either the one-vs-one or the one-vs-rest configuration. We employed the one-vs-one configuration for our task. In more detail, different sound samples are ranked by their probabilities, and then false positive and true positive rates are computed by choosing different probability cut-offs ρ to generate the ROC curve. The AUC is computed as the area under the ROC curve. In the end, an average across species is computed to get one AUC value for the entire data set, i.e., • Mean Average Precision (mAP): The evaluation metric gives us a way of characterizing the performance of a classifier by monitoring how precision changes with varying the classification probability threshold that the model uses to make a decision if a bird sound sample belongs to a class j. A good classifier will maintain a high precision as recall increases, while a poor classifier will take a hit on precision as recall increases with changes in threshold. In more detail, to compute Average Precision for a species j, a list of probabilities is generated in which the discrimination probabilities our model has assigned to all test samples for class j are stored. The list is then sorted by decreasing probabilities, and each element is assigned a rank k. By varying the rank k (by gradually lowering the probability threshold), a list of true positives and false positives is generated. Note that, as the classification threshold is lowered, the model labels increasingly more samples as positive. This will lead to an increase in false positives. The list is consequently employed to produce a list of precision values at different ranks p(k).
Considering all the K cases in the list where the sound sample belongs to class j, the average precision is computed as: where 1(k) is an indicator function that equals unity if the sample at threshold k is a true positive. The mean average precision P mA is then computed by averaging over all classes (species) [15].

Results and Discussion
We evaluated the performance of our classification algorithm by testing it for 20 trials on n randomly selected species, (out of 659 species in the selected dataset) with n varying between n = 10 and n = 300. The results for the precision, AUC, mAP, recall, and accuracy are summarized in Figure 3.
Box plots were estimated from 20 different randomized data sets for each n. Box plots, also known as box-and-whisker plots, provide robust statistical summaries for the data, if the sample size is relatively small, i.e., here 20. The box plot divides the data into quartiles or fourths-two box panels and two whiskers. The middle 50% of data is spanned by the box with the 25th percentile or 25% of the data falling below the lower edge of the box (first quartile) and 75% of data falling below the upper edge of box (third quartile). The edges of the box are often referred to as hinges, and the length of the box is called the interquartile range (IQR). The median is indicated by the middle line of box. The whiskers mark the extremes for the remaining 50% of the data [37]. Surprisingly, we found no increase in the size of the interquartile range for the performance measures with increasing n.
There are several aspects of these results that deserve to be addressed in more detail.

Variations due to Randomized Sub-Sets
As observed in Figure 3, we can see that, for each choice of a subset of n species, we obtain a range of performance values, depending on the particular random selection of species. Our results show that the interquartile ranges vary as much as 12% in some cases. For instance, in case of n = 30 in Figure 3c, we see that the mean average precision (mAP) varies between 0.72 for one subset of 30 species to 0.84 for another subset of randomly drawn 30 species. Similarly for n = 70, the mAP varies between 0.6 and 0.71. We can see a similar trend in the figures for other metrics considered in this work.
As mentioned earlier, the experiment was repeated 20 times for different randomized selections of n species. The variation in results within different n species' trials showed that the classification results can vary significantly depending on the choice of species chosen for the analysis. Consequently, it can be inferred that generic claims about the performance of a certain algorithm for a certain number of non-randomly selected species must be interpreted with caution. The results might not generalize for another set of n species, even when the species are drawn from the same dataset.
One possible reason, among others, that could explain the variability in performance between different subsets or ensembles of randomly drawn sound samples from n species (bag-of-birds) is the possible degrees of similarity of sounds, or the lack of it, between species of different subsets. Ambiguity can be a consequence of the similarity of the sounds of species within an ensemble. Therefore, one possible explanation for these results is that sounds of species in the bags leading to lower performance measures have a higher degree of similarity compared to bags that generate higher classification performance.

The Dependence on the Number of Species
All performance measures decrease with an increased number of species as is visible in Figures 3 and 4. An intuitive explanation for this could be that species are more difficult to distinguish when more species are added to the classification task. However, looking at the definitions of the performance measures (Equations (4)-(13)) we investigated whether it is possible to understand the numerical results by analytic reasoning. Consider, e.g., the precision P(n), which is defined in Equation (5). If each precision per species p(j) contributing to the average is constant and not depending on n (i.e., p(j) ∼ c), one should expect P(n) ∼ c. This is clearly not what is observed in Figure 4. Therefore, one must assume that p(j) is dependent on n, although this is not explicitly visible in Equation (4). To investigate this implicit dependence on n, we visualized the average numbers of true positives for each n and, in a similar way, the averaged numbers of false positives, true negatives, and false negatives in Figure 5a-c. These elements of a confusion matrix enter (in an non-averaged form) into the computed performance measures, and thus their dependence on n influences the performance measures. The averaging in Figure 5 was done, since the amount of sound samples in each trial clearly depends linearly on n. Therefore, this trivial dependence was removed, and we can monitor a non-trivial implicit dependence on n. As one can see, the dependence of the averaged numbers of true positive, false positives, and false negatives can be described relatively well by a quadratic function, whereas the averaged number of true negatives increases linearly with increasing n.
The fact that true negatives, as can be seen in Figure 5d, behave differently than the other elements of the confusion matrix can be understood by considering the way true negatives are computed in a multi-class classification problem, using a one-vs-all configuration. Each time a sound sample was correctly not classified as the particular species j under consideration, the count of true negatives was increased by one. Therefore, e.g., in a subset of m · n = 500 sound samples recorded from n = 10 different species and each species being represented by m = 50 sound samples, a perfect algorithm would classify 50 samples correctly as belonging to species j. Consequently, the count of true positives would be c tp (j) = 50, and the count of true negatives c tn (j) = 450 for a perfect classifier. In other words, we can expect c tn (j) = nm − m with m being the sample size, as specified before, in case of a perfect classifier The results of the prediction experiments in this contribution with an obviously not perfect classifier, reveal that a tn can be fitted by a linear function a tn (n) = b 1 n + b 0 , where b 1 = (49.88 ± 0.82) × 10 −2 , and b 0 = −(59.27 ± 0.61). Note that the two coefficients are relatively close to the true sample size m = 50.
The dependence of the other elements of the confusion matrix on n are more subtle with respect to the range in which these numbers vary, and the dependence can be described by quadratic functions with d 2 = (8.02 ± 1.00) × 10 −4 , d 1 = (22.87 ± 1.34) × 10 −2 , d 0 = 6.84 ± 0.37 and c 0 = 43.16 ± 0.38. Note that the first two coefficients d 1 and d 2 of a tp , a f p and a f n have either the same values (up to the first eight digits, which are not shown here), or just differ in sign but not in value. These coefficients are shown in detail here since we will, in the following, demonstrate a connection between Equations (16)- (18) and the functions describing the dependence of the overall performance measures. Being able to describe a tp (n), a f p (n) and a f n (n), one can now attempt to understand the dependencies of the performance measures. Assuming that each species is classified equally well by a perfect classifier, one would expect a tp (n) = c tp (j, n) for all j and similar for a f p = c f p (j, n) and a f n = c f n . Inserting Equations (16) and (17) in Equation (4) holds since the non-constant terms in the denominator cancel each other. Inserting this in the equation for the overall precision (Equation (5)) holds since all terms p(j) are identical for the perfect classifier. Consequently, one should be able to predict the scaling of P(n), knowing the coefficients d 2 , d 1 , d 0 , and c 0 . Fitting the coefficients for the quadratic function describing P(n) as in Figure 4, one obtains P(n) ≈ g 2 n 2 + g 1 n + g 0 , with g 2 = (0.16 ± 0.02) × 10 −4 , g 1 = −(0.44 ± 0.03) × 10 −2 , g 0 = 0.86 ± 0.76 × 10 −2 .
Note that these coefficients are very close to the coefficients of a tp multiplied with a factor 1 d 0 +c 0 = 1 50 as indicated by Equation (20). Hence, we can confirm numerically that the dependence of the precision on the number of classes follows the dependence of a tp up to a scaling factor of 1 d 0 +c 0 = 1 50 .
Following the same assumptions and reasoning, one obtains for the recall. Here, the relation between the fitting coefficients of a tp and R is confirmed by the quadratic function fitted to R in Figure 4. Note that, for the prediction experiments in this study, the same quadratic function is able to describe the n-dependence of precision and recall. Extending the above reasoning (i.e., c tp (j, n) ≈ atp(n)) to explain the n-dependence of the accuracy as given by Equation (8) yields This relation was numerically confirmed by comparing the coefficients for the polynomials describing A and a tp . The values of coefficients d 2 , d 1 , and c 0 are given after Equation (18).
Discussing the n-dependence of the multi-class AUC and the mAP analytically is not as straightforward as the previous considerations; therefore, only numerical results are presented in this contribution. As one can see in Figure 4, the n-dependence of AUC and mAP can be also described by quadratic functions. Additionally, we observe that the coefficients for the linear and the quadratic term of the function describing the mAP resemble the coefficients describing P(n) in value (The values describing coefficients of P(n) are given after Equation (21)). Consequently, one can argue that the above discussion for P(n) could possibly also explain the n-dependence of mAP. Nevertheless, the constant term (c 0 = 43.16 ± 0.38) added to the function describing P mA (n) is higher than the constant offset of the precision.
Summarizing, we can relate the n-dependence of several measures for the classification success to the n-dependences of the confusion matrix, assuming the behaviour of a perfect classifier, and we fit functions describing these dependencies. Note that this does not imply that we claim our classifier to be a perfect classifier, neither do we claim that scaling with n, which we obtain here, is universal in the sense that it will be observed for any other classifier. The latter aspect is a question that needs to be tested in future contributions, but it is out of the scope of this work.

Metric of Confidence
The decisions made by the classifier are based on probabilities that are estimated (through the ANN) for each species. The predicted label is then assigned to the species with the highest probability. Here, we analyze the effect of introducing a confidence threshold requiring the assigned probability to be above the threshold in order to accept the classification. In Figure 6, one can see that the precision (cTPS/cPs), i.e., the ratio of true positives to all classified positives changes as the confidence threshold is varied between 0.5 and 1.0.
We see that the precision increases as the confidence threshold is increased. For instance, for n = 30 species, the precision for a confidence threshold in range of 0.9-1.0 is more than 0.8. Similar results can be seen for other n. This basically shows that when the model is assigning high confidence to its predictions, the predictions are mostly correct, which should be expected from a good classifier.

Comparing Different Measures for Classification Success
In this contribution, we used several common measures for evaluating the classification performance and comparing their results. The primary reason for this is that different indices encapsulate different aspects of the classification performance. Secondly, as mentioned earlier, there seems to be no consensus in the literature available on the choice of evaluation metric for the audio-based bird species classification task. This compelled us to study a set of indices and not rely on a specific metric.
As one can see in Figure 3, the precision and recall for n < 100 do not show much disparity and instead look quite similar.
Although, by definition, these two indices encapsulate different aspects of model performance. This can be clearly seen in Figure 7. Here, we see that for different species in one classification run of n = 10, the precision and recall values differ. There are species where the precision is higher than the recall (e.g., species 10) while others where the recall is higher than the precision (e.g., species 3). It seems that, for n < 100, the precision and recall values more or less equalize when an average is taken over a species.
From Figures 3 and 4, one can additionally see that the accuracy is exactly the same as the recall, since the equations of recall and accuracy become the same when averages are taken over all classes.
Additionally, the mean average precision (mAP) was used to evaluate classification success. Increasingly a number of works in recent years have used this metric to state the classification performance of their models. Note that average precision is one way of measuring the area under the precision-recall curve. Compared to precision and recall that are computed for one probability threshold, the average precision is computed cumulatively by varying the threshold. We see in Figure 3 that, although it follows a similar downward quadratic trend to the recall and precision, the mAP values are slightly higher than the precision and recall values for different n. For instance, the range for n = 10 species for the 20 runs is between 0.86 and 0.94, whereas the precision and recall ranges are between 0.79 and 0.87. This observation also reflected in the offset of the functions describing the n-dependence as mentioned above. Another commonly used metric for classification success is the area under the Receiver Operating Characteristics curve (AUC). Our model achieved a high score on the AUC metric as can be seen in Figures 3 and 4. Although the AUC score decreased with the increase in number of species n, the score was nevertheless unexpectedly high. For instance, the AUC score for n = 300 for one run was 0.94, which is unexpected for such a large number of species. (Note that, as per the definition of the AUC, a random classifier making randomized decisions should give a score of 0.5).
In our understanding, the multi-class nature of our problem explains this result. As mentioned earlier, the AUC metric was essentially designed for a binary classifier and was later generalized for multi-class classification problems [36].
Therefore, in the case of multi-class problems one needs to binarize the class labels to compute the AUC score, such that the problem is transformed into a binary classification problem with n(n−1) 2 binary classifiers (where n is number of classes). Using a one-vsone configuration [36,38], as recommended by tutorials of many software packages, an AUC score is then computed for each of these binary classifiers, and finally an average is computed to obtain a final AUC score for the entire set of n classes.
For an actual binary classifier that classifies poorly, the miss-classifications will reflect in significant enough values of false positives and false negatives to give us a low true positive rate and high false positive rate as per Equations (9) and (10). This will result in a low AUC score. In the multi-class scenario with one-vs-one configuration, we observe that a classifier distributing miss-classifications sparsely across several classes leads to a small number of false positives and small number of false negatives for these artificially assumed binary classifiers. One should note that this will happen even if the classifier fairs poorly i.e., miss-classifies with a high rate.
An example for this can be seen in Figure 8, which shows a confusion matrix for a classification run with 20 species. It can be seen that the miss-classifications are spread throughout the rows and columns of the confusion matrix. Consequently, less numbers of false positives and false negatives will amount to a high true positive rate and low false positive rate for individual binary comparisons and, therefore, a high AUC score (refer to Equations (9) and (10)). This is exactly what is reflected on averaging the individual AUC scores to compute the total AUC score for n classes. The classifier is distributing the false predictions sparsely across several classes and the one-vs-one generalization is unable to capture the actual performance of the model. This leads us to the conclusion that ROC is not a suitable performance measure for multiclass classification tasks. Especially in cases where the miss-classifications are distributed rather evenly among several classes, it is very likely to obtain overestimated AUC scores.

Conclusions
The novelty of this work lies in studying the dependence of classification success on the number of species for bird sound classification. Furthermore, the idea is to illustrate how these classification results are heavily contingent on the composition of bird species subsets. Therefore, we employed balanced subsets of bird sounds for n species, drawing the species randomly from a larger dataset containing 659 species, where n was varied between 10 and 300. For each n, we repeated the whole procedure (composition of the subset, training of the classifier, and testing) 20 times to produce a reliable estimate of the performance given a certain number of bird species.
The classification was performed using a shallow feed forward neural network trained on 45 pre-computed sound features. We used a shallow neural network to conduct our analysis primarily due to its model simplicity, lower computational costs, and relatively low amount of data that is required to train such networks vis a vis deep neural networks. We wanted to benchmark the classification performance and perform our analysis using a simple model that can be trained using hand crafted sound features.
We evaluated the classification performance using several common measures for classification success and also analyzed their dependence on n in detail. We observed that the classification performance was relatively high, even when many different species were present in the datasets under study and using relatively less data. This is an interesting result, since many recent approaches are based on deep neural networks trained on much larger datasets of images of spectrograms without any feature selection. This suggests that shallow neural networks trained on pre-computed sound features can also provide a robust approach to bird classification, which, at the same time, is inexpensive in terms of the computational costs and the amount of data used.
Concerning the robustness of the approach, we found that all measures of classification success showed a decline in value if the number of species present in the subset was increased. For some of these measures, this decline can be explained analytically knowing the n-dependence of the confusion matrix and assuming the behavior of an idealized perfect classifier.
Additionally, we observe that the classification success depends on the individual composition of the bird subsets and classification results can vary significantly depending on the choice of species chosen for the analysis. For this reason, it seems the generic claims about the performance of a certain algorithm for, say n species of non-randomly drawn species, must not be interpreted as a generalized measure of performance for any n species. The classification results might not generalize for another set of n species, even when the species are drawn from the same dataset. Data Availability Statement: A publicly available dataset was analyzed in this study. The dataset has been drawn from the Xeno-Canto repository for bird sounds and was composed by the organizers of BirdClef2019 [15,16].