Animal Sound Classiﬁcation Using Dissimilarity Spaces

: The classiﬁer system proposed in this work combines the dissimilarity spaces produced by a set of Siamese neural networks (SNNs) designed using four di ﬀ erent backbones with di ﬀ erent clustering techniques for training SVMs for automated animal audio classiﬁcation. The system is evaluated on two animal audio datasets: one for cat and another for bird vocalizations. The proposed approach uses clustering methods to determine a set of centroids (in both a supervised and unsupervised fashion) from the spectrograms in the dataset. Such centroids are exploited to generate the dissimilarity space through the Siamese networks. In addition to feeding the SNNs with spectrograms, experiments process the spectrograms using the heterogeneous auto-similarities of characteristics. Once the similarity spaces are computed, each pattern is “projected” into the space to obtain a vector space representation; this descriptor is then coupled to a support vector machine (SVM) to classify a spectrogram by its dissimilarity vector. Results demonstrate that the proposed approach performs competitively (without ad-hoc optimization of the clustering methods) on both animal vocalization datasets. To further demonstrate the power of the proposed system, the best standalone approach is also evaluated on the challenging Dataset for Environmental Sound Classiﬁcation (ESC50) dataset.


Introduction
Over the past decade, research in sound classification and recognition has gained in popularity and rapidly broaden in its application from the more traditional focus on speech recognition [1] and music genre classification [2] to biometric identification [3], computer-aided heart sound detection [4], environmental audio scene and sound recognition [5,6], biodiversity assessment [7], human voice classification and emotion recognition [8], English accent classification, and gender identification [9], to list a few of a wide range of application areas. As with research in pattern recognition generally, the features fed into classifiers were initially engineered, which, in the case of sound applications, meant extracting from raw audio traces such descriptors as the Statistical Spectrum Descriptor and Rhythm Histogram [10].
In the past decade, researchers began exploring the possibility of visually representing audio signals to apply more powerful image classification descriptors and techniques. Initially, visual representations of audio traces centered around the display overtime of the frequency spectrum: examples of these representations include the spectrogram [11] and Mel-frequency Cepstral Coefficients spectrogram [12].
representation for each pattern. This vector was then used to train an SVM. Results showed that this approach worked better than the standalone CNNs.
The system proposed in this work is similar to [43] in that it generates a dissimilarity space from the training set using an SNN to define a distance function from the input spectrograms. The objective at this point in the process is to maximize the distance separating the patterns of the different classes. Unlike [43], however, four different CNN architectures are selected for the twin classifiers, and both the original input spectrograms and the spectrograms processed by Heterogeneous Auto-Similarities of Characteristics [48] make up the inputs to the SNNs. In the testing phase, the unknown pattern is compared to the centroids of the dissimilarity space using the different SNNs for comparing two samples and measure their dissimilarity. Although it is the case that the entire training set can function as the centroids of the dissimilarity space, it is desirable to reduce dimensionality by selecting a smaller number of prototypes. In this work, both a supervised and unsupervised clustering algorithm are used to reduce dimensionality. The dissimilarity space represents each input (both the original and the processed spectrograms) by its distance from each of the centroids, or prototypes, a distance that is learned by the SNNs. In other words, the SNNs compare a given spectrogram to each of the centroids to obtain a dissimilarity feature vector. A support vector machine (SVM) is then trained on these features.
The approach taken in this paper is evaluated on the same animal vocalization datasets as in [43], i.e., on cats [37] and birds [7]. In addition, the system is tested on the challenging benchmark dataset for Environmental Sound Classification (ESC-50) [49]. As in [43], the most effective system is the result of a fusion of classifiers, i.e., the sum rule of SVMs trained on different dissimilarity spaces generated by changing the value of k in the clustering approaches and network topologies. Performance is compared with both the state-of-the-art as well as with fusions with the state-of-the-art. Results demonstrate the power of using dissimilarity spaces based on an ensemble of SNNs, particularly when coupled with a standard CNN approach. The MATLAB code of the proposed method is freely available at https://github.com/LorisNanni. The remainder of this paper is broken down into the following sections. In Section 2, a detailed overview of the proposed approach is provided; in Section 3, SNN is discussed along with the different CNN architectures that make up the subnetworks. In Section 4, the supervised and unsupervised clustering methods are outlined. In Section 5, experimental results are presented along with comparisons with the state-of-the-art. The paper concludes in Section 6 with an overview and some suggestions for future research.

Proposed System
The system presented here for spectrogram classification extends that proposed in [43] and is schematized in Figure 1, which illustrates the approach using one SNN. The pseudocode in Algorithms 1 and 2 that correspond to Figure 1 are detailed in the remainder of this section.
The process begins by generating a similarity space during the training phase via a learning distance measure d(x, y) from a set of prototypes P = p 1 , . . . p k . The distance measure is learned by four SNNs trained (1) to maximize the similarity between pairs of spectrograms belonging to the same class and (2) to minimalize the similarity for pairs of spectrograms belonging to different classes. The set of prototypes generated in this phase are the k centroids of the clusters produced by a clustering approach. What results is a feature vector f ∈ R k that represents training sample x in the dissimilarity space, where a given f i is the distance between x and the prototype p i : f i = d(x, p i ). Once these features have been calculated, they are used to train an SVM classifier.  In the testing phase, each unknown pattern is represented by its projection in a dissimilarity space: the descriptor is obtained by calculating the pattern distance to the set of prototypes P, and the resulting feature vectors of the input images are then classified by SVM. In our experiments, we test both spectrograms and HASC [48], a 2D descriptor extracted from the spectrogram, as the input of the classification process. The method for extracting HASC is outlined in Section 2.5.

Siamese Neural Network Training
To define a similarity measure inside the Dissimilarity Space, a Siamese Network is trained to compare a couple of spectrograms and return a similarity value that is the greater value if they belong to the same class and the lower value if they belong to different classes. The pseudocode for training an SNN, which is called in step 1 of Algorithm 1, is reported in Algorithm 3. The SNN architecture is defined in steps 2 and 3 (more details are given in Section 3). The training process consists in repeating Steps 5-8: random extraction of batchSize spectrogram pairs from the training set (via the function GETBATCH), computation of loss and gradients for each pair passed to the SNN, and upgrade of the SNN accordingly (i.e., the subnet and the fully connected (FN) layer of the Siamese network). for iteration from 1 to numberO f Iterations do 5: X1, X2, pairLabels← getBatch (trainImgs, trainLabels, batchSize) 6: gradients, loss← Evaluate(subnet, X1, X2, pairLabels) 7: Update(subnet, gradients) 8: Update(f cWeights, gradients) 9: end for 10: return tSNN←subnet, f cWeights 11: end function Note: if SNN fails to converge on the training set, the training phase is repeated.

Prototype Selection
Prototype selection involves extracting a total of k prototypes from the training set in order to reduce the dimensionality of the dissimilarity space, which would be too high to maintain each training sample as a prototype. Dimensionality can be reduced by employing clustering techniques to calculate k centroids. In this work, we perform both prototype selection separately for each class (kc prototypes per class) and via global (i.e., unsupervised) clustering (k prototypes): in both cases, results are compared considering the same final dimension. Algorithm 4 presents the pseudocode for prototype selection. As can be observed, a clustering technique is selected from a set of four possible clustering methods, each of which is employed separately to cluster the training samples. For the sake of space, only the pseudocode for the supervised clustering methods is included, though it should be noted that this work used both supervised and unsupervised clustering approaches.

Projection in the Dissimilarity Space
Classically, classifiers are trained to predict patterns within a feature space. It is also possible, as demonstrated here, for patterns to be projected in a dissimilarity space such that each pattern x is characterized by its dissimilarity to a set of prototypes P = p 1 , . . . p k and by a dissimilarity vector defined as where the similarity of pattern d(x, y) is obtained using a trained SNN. for i from 1 to numClasses do 5: images←images of the class i from imgsTrain 6: switch type do 7: case "k-means" P i ← KMeans(imgs,kc) 8: case "k-medoids" P i ← KMedoids (imgs,kc) 9: case "hierarchical" P i ← Hierarchical (imgs,kc) 10: case "spectral" P i ← Spectral (imgs,kc) 11: P←P ∪P i 12: end for 13: return P 14: end function The projection of a set of images (from the training or the testing set) into a dissimilarity space R k is described in Algorithm 5. Each image of the input set (stored in the variable X, see step 3) is compared with the k prototypes (stored in P) according to the dissimilarity measure learned by the trained SNN (PREDICTSIAMESE function, see step 4). The number of centroids is a parameter, and different values of k were tested (depending on the number of classes c): k = kc * c, kc = {15, 30 45, 60}. The output is the feature space F that includes the projections of all the input images imgs. for j from 1 to SIZE(imgs) do 3: X←imgs[j] 4: F[j]← predictSiamese (tSNN, X, P) 5: end for 6: return F 7: end function

Classification by SVM
SVM [50] is a well-known binary learner that represents training samples as points in space. The goal of SVM training (function TRAINSVM) is to find at least one hyperplane such that it separates the data that belongs to each of the two classes. Prediction (function PREDICTSVM) is accomplished by mapping an unseen pattern to the side of the hyperplane representing the class for a given data point.
SVM, as defined above, does a poor job discriminating input that is not linearly separable in its original space. This difficulty can be overcome by selecting kernel functions that map the data into a higher-dimensional space where separation is possible. In this work, we chose a radial basis function kernel to map a single vector to a vector of higher dimensionality.
Although SVM is binary, it can be applied to nonbinary or multilabel problems by training an ensemble of SVMs and combining their decisions. In the experiments reported here, the One-Against-All method is used, where an SVM is trained systematically to distinguish each class against all the others combined. The pattern is then predicted to belong to that class that produces the highest confidence score.

Heterogeneous Auto-Similarities of Characteristics (HASC)
HASC [48] is applied to heterogeneous dense feature maps. It encodes linear relations by covariances (COV) and nonlinear associations with entropy combined with mutual information (EMI). Three reasons for considering covariance matrices as descriptors are as follows: (1) they are low in dimensionality, (2) robust to noise (but with the exception that outlier pixels can render them more sensitive to noise), and (3) the covariance among two features is optimally able to encapsulate the features of the joint PDF (but with the caveat that they be linked by a linear relation). HASC obviates these limitations by combining COV with EMI.
The entropy (E) of a random variable measures the uncertainty of its value, and the mutual information (MI) of two random variables captures their generic linear and nonlinear dependencies. The way HASC utilizes these advantages is by dividing an image into patches from which it generates an EMI matrix (d × d), such that the main diagonal entries encapsulate the amount of unpredictability of the d features. The off-diagonal entries (element i, j) capture the mutual dependency between two features, that is, the i-th and j-th feature. HASC is the concatenation of vectorized EMI and COV. Specifically, the MI of a pair of random variables A, B is where p(a), p(b), and p(a, b) are the PDF of A, the PDF of B, and their joint PDF, respectively. If A = B, then MI is the entropy of A: If there exists a finite set K of realization pairs, MI can be estimated as a sample mean inside the logarithm, thus: A fast and efficient method for calculating from the K realizations the probabilities inside the logarithm is to estimate them by building a joint 2D normalized histogram of values A and B, such that each p(a k , b k ) is estimated taking the value of the 2D histogram bin containing the pair a k , b k . In this way, p(a k ) and p(b k ) can be estimated by summing all the bins corresponding to a k and b k , respectively. Thus, the i, j-th component of the EMI matrix related to the patch P can be defined as where p( . . . ) and p(.) are the probabilities estimated with the histogram and z ki is the i-th feature at pixel K.
In this study, HASC is extracted from the whole spectrogram. Given the function HASC, the output FEAT is a three-dimensional matrix (w × h × d) containing the features extracted from the image (d is the number of low-level features). The number of bins used to evaluate the histograms in the EMI computation is 28, and the number of low-level features is 6 (default parameters).

Siamese Neural Network (SNN)
SNN, initially proposed in [45], is a class of neural networks that contains two or more twin networks that share the same weights and parameters. As illustrated in Figure 2, an SNN has two Appl. Sci. 2020, 10, 8578 8 of 18 inputs that compare two patterns and one output that corresponds to the similarity between the two inputs. In other words, an SNN identifies correlations between two different input patterns (see [51] for a fairly comprehensive overview of SNN). As shown in Figure 2, the SNNs used in this work are composed of five components, as described below.
image ( is the number of low-level features). The number of bins used to evaluate the histograms in the EMI computation is 28, and the number of low-level features is 6 (default parameters). FEAT (:,:,6)], and is resized to the right dimension for the input into a CNN.

Siamese Neural Network (SNN)
SNN, initially proposed in [45], is a class of neural networks that contains two or more twin networks that share the same weights and parameters. As illustrated in Figure 2, an SNN has two inputs that compare two patterns and one output that corresponds to the similarity between the two inputs. In other words, an SNN identifies correlations between two different input patterns (see [51] for a fairly comprehensive overview of SNN). As shown in Figure 2, the SNNs used in this work are composed of five components, as described below.

The Two Identical Twin Subnetworks
In this study, four twin CNN subnetworks are utilized. CNNs are constructed by assembling specialized layers composed of neurons. Some of the more common layers found in CNN architectures include convolutional, activation (ReLU), pooling, and fully connected (FC) layers. The specific CNN architecture for the four twin subnetworks are outlined in Table 1.
Two activation functions are explored here: the ReLU activation function [52] and the Leaky ReLU [53]. The well-known ReLU activation function for the points ( , ) is defined as: The Leaky ReLU variation of ReLU is defined as

The Two Identical Twin Subnetworks
In this study, four twin CNN subnetworks are utilized. CNNs are constructed by assembling specialized layers composed of neurons. Some of the more common layers found in CNN architectures include convolutional, activation (ReLU), pooling, and fully connected (FC) layers. The specific CNN architecture for the four twin subnetworks are outlined in Table 1. Two activation functions are explored here: the ReLU activation function [52] and the Leaky ReLU [53]. The well-known ReLU activation function for the points (x i , y i ) is defined as: The Leaky ReLU variation of ReLU is defined as The subnetworks of the Siamese CNNs 1-4 each learn the features that best represent the information in the spectrograms that are the input patterns to the two input nodes (X1 and X2) and return either a 2048 or a 4096-dimensional feature vector (F1 and F2). The subnetworks share the same parameters and training weights.

Subtract Block, FC Layer, and Sigmoid Function
The Subtract Block calculates the feature vector Y as the absolute value of the difference between the resulting vectors of the two subnetworks: Following the method outlined in [44], the FC layer learns the distance model for calculating the dissimilarity. Y, the result of the Subtract Block, is processed by the FC block to produce the measure of dissimilarity between two spectrogram patterns. Because of this block, the resulting dissimilarity measure is not a metric since it does not have the following properties: identity and triangle inequality. It does, however, have the properties of symmetry and compactness as defined in [22], so that a sufficiently small perturbation of an object leads to an arbitrary small dissimilarity value between the disturbed object and its original.
The sigmoid function is then applied to the dissimilarity value to transform it to a probability value in the range [0,1]:

Clustering
Clustering is a procedure that divides unlabeled patterns into groups that maximize the commonality between members within a group and their differences with members belonging to other groups. Clustering techniques often calculate the mean vector, or centroid, of all the patterns within a cluster when forming the clusters. Centroids encapsulate salient characteristics of patterns belonging to a cluster; for this reason, they can be used to reduce the dissimilarity space size without losing too much important information. Even more information can be retained if the number of centroids representing each class is increased.
Both supervised and unsupervised clustering approaches are considered here. If kc clusters are extracted with a supervised approach, then, in each of NK classes, kc × NK clusters are extracted using unsupervised clustering on the training set. A description of the four clustering methods used in this study follows.

K-Means
K-means is one of the most popular clustering algorithms. It partitions patterns into k clusters by placing each observation into a cluster based on the nearest centroid (according to the Euclidean Distance measure).
The standard k-means algorithm involves four steps: 1.
Randomly select a set of centroids from among the data points.

2.
For each data point x remaining in the training set, compute the distance d(x) between it and the nearest centroid.

3.
Recalculate new centroids via a weighted probability distribution.

K-Medoids
K-medoids is a clustering technique that follows the same general logic behind k-means but differs in the specific way it partitions data points into clusters: K-medoids minimizes the sum of distances between a given pattern and the center of that pattern's cluster. In short, the center of a cluster in K-Means ends up being the centroid of the cluster, but the center in K-Medoids is a member (medoid) of the cluster. In other words, a medoid is that member in a cluster whose sum of distances from all other members is minimal.
The standard K-medoids algorithm involves three steps:

1.
Step one is a build-step where each k cluster is associated with a potential medoid. There are many ways to select the first medoid; the standard MATLAB's implementation does this employing the k-means++ heuristic.

2.
Step two is a swap-step where each point in a cluster is tested as a potential medoid by checking whether the sum of the within-cluster distances is smaller when using that point as the medoid. Every point is then assigned to the cluster with the closest medoid.

3.
The last step repeats previous steps until convergence.

Spectral
Spectral clustering partitions data into clusters starting from the adjacency matrix representing the undirected similarity graph of the patterns. Each pattern is a node of the graph, and two nodes are connected only if their similarity is larger than a threshold (typically set to 0). This clustering algorithm involves the following matrices: • The similarity matrix M, whose cell m ij is the similarity value of two patterns (i.e., two spectrograms s i , s j ); • The degree matrix D, which is a diagonal matrix that is obtained by summing the rows of M: • The Laplacian matrix L, which is defined as The algorithm for spectral clustering is a five-step process: 1.
Define a local neighborhood for each data point in the dataset (there are many ways to define a neighborhood; the nearest-neighbor method is the default setting in the MATLAB implementation of spectral clustering). Then compute the local similarity matrix of each pattern in the neighborhood.

2.
Calculate the Laplacian matrix L.

3.
Create a matrix V containing columns v 1 , . . . , v k , where the columns are the k eigenvectors, i.e., the spectrums (hence the name), corresponding to the k smallest eigenvalues of L.

4.
Perform k-means or k-medoids clustering by treating each row of V as a datapoint.

5.
Cluster the original pattern according to the assignments of their corresponding rows.

Hierarchical Clustering
Hierarchical clustering partitions data by creating a tree of clusters divided into n levels selected for the specific classification task. In general, hierarchical clustering is divided into two types:

1.
Agglomerative, where each pattern corresponds to a cluster. A strategy to merge couples of clusters is defined as moving up the hierarchy: each cluster in the next level is the fusion of two clusters from the previous level.

2.
Divisive, where a single cluster contains all patterns in the first level, then a splitting strategy is defined to halve clusters by moving down the hierarchy.
In this work, the agglomerative hierarchical approach is employed as this is the default MATLAB implementation. The MATLAB algorithm involves three steps:

1.
Using a distance metric, find the similarity or dissimilarity between every pair of data points in the dataset; 2.
Aggregate data points into a binary hierarchical cluster tree by fusing pairs of clusters according to their distance; 3.
Establish the level of the tree where it is cut into k clusters.
The resulting prototypes are selected as the mean vectors of each cluster.

Experimental Results
The proposed approach is tested and compared with the canonical approaches with each using the testing protocol proposed by the authors of the datasets. The performance indicator is the classification accuracy and methods were tested on the following animal vocalization datasets: • BIRDz, which functioned as a control and a real-world audio dataset in [46], a ten-run testing protocol is used; we have used the same split used by the authors of the dataset. The real-world tracks were collected from the Xeno-canto Archive (http://www.xeno-canto.org/). BIRDz includes a total of 2762 bird acoustic samples from 11 North American bird species plus 339 "unknown" samples that include noise and unknown species' vocalizations. The observations are composed of five different spectrograms: 1) constant frequency, 2) frequency modulated whistles, 3) broadband pulses, (4) broadband with varying frequency components, and 5) strong harmonics. The dataset is balanced: the size of all the "bird" classes varies between 246 and 259; only the class "other" is a little larger. • CAT, [37,47] is a dataset that contains ten balanced classes of approximately 300 samples per class for a total of 2962 samples. The testing protocol is a 10-fold cross-validation. The ten classes represent the following cat vocalizations: (1) Resting, (2) Warning, (3) Angry, (4) Defense, In this section, we report experiments aimed at evaluating the proposed system by varying several components: i.e., the input images (spectrograms or HASC images), the topology of the Siamese Network (NN1, NN2, NN3, NN4), the clustering algorithm (K-Means, K-Medoids, Hierarchical, Spectral), the clustering modality (unsupervised or supervised, i.e., clustering on the whole training set or on each class), and the number of prototypes (kc = 15, 30, 45, 60). For the training of the Siamese networks in our experiments, we have selected the training options suggested by the MATLAB framework for Siamese networks. This choice ensures that such values have not been overfitted on the selected dataset. The binary cross-entropy gives the loss function between the predicted score and the true label value. The parameters for ADAM optimization are the following: learning rate 0.0001, gradient decay factor 0.9, squared gradient decay factor 0.99. The number of iterations has been set at 3000 (without any early stop criterion).
In the first experiment, reported in Table 2, only K-means clustering is explored. The results are obtained from the fusion by sum rule of the four SVMs trained using the dissimilarity spaces built with different values for kc (15,30,45,60). Performance is reported for only NN1 and NN2 (the first two network topologies) to reduce the computation time.
The ensembles in Table 2 are obtained by varying the input data (Sp = spectrograms, HASC = HASC images), the type of clustering (unsupervised or supervised), and the network topology. The clustering method is fixed to K-means for all the methods, and the number of prototypes belongs to the following set (15,30,45,60). The column labeled #classifiers recaps the number of classifiers in the ensemble, and the first column ("Name") assigns a name to the ensemble.
The best average performance is obtained by the ensemble FA1_2 in the last row (which is the sum rule of the methods in the first eight rows). On the BIRD dataset, there is a boost in performance with NN1 and NN2 using the HASC images instead of the spectrograms, while on the CAT dataset, HASC images boost the performance of NN2 but not NN1.
The goal of the second experiment is to compare the clustering methods. To achieve this aim, in Table 3, we report the performance of different clustering approaches using the HSup-2 approach (i.e., HASC images as input, NN2 as network topology, and unsupervised version of the clustering). The last row reports the ensemble F_Clu obtained as the sum rule among the above four approaches. F_Clu produces the highest performance, though this gain is only slightly higher than that of K-means. All the clustering algorithms are quite similar in performance. Since their fusion does not achieve a clear advantage against a single approach, in the next experiments, we use only the K-means strategy for clustering by varying the number of prototypes kc.  In Table 4, the four network topologies are coupled with K-means clustering and HSup (i.e., HASC images as input and the unsupervised version of clustering). The last row, which reports the ensemble F_NN is obtained as the sum rule among the above four approaches, and it produces the average best performance on this test. Even better results are obtained by combining by sum rule all the approaches reported in the previous tables: the combined performance on CAT is 85.76 and on BIRD 95.08. It is clear that the ensemble strongly outperforms a simple Sup-1. The superiority of one method over another is validated according to the Wilcoxon signed-rank test [54]; F_NN outperforms each of the other methods with a p-value of 0.05.
The following experiment is aimed at evaluating the possibility of creating ensembles by varying some parameters of the method. For a fair comparison among ensembles in Table 5, the performance of an ensemble obtained by retraining Siamese HSup-1 (i.e., by using the same approach as that reported in Table 2) is compared with ensembles obtained by varying the network topology. The results of Table 5 indicate that changing the network topology introduces diversity in the ensemble: the comparison among ensembles of 4, 8, and 16 networks (see HSup-1x), obtained from the topology NN1, and the ensembles named F_NN, obtained by varying the topology of the Siamese Network, show an evident performance gain in favor of the latter (with the same number of classifiers). It is also interesting to observe the similar results in rows 2 and 3: both ensembles have four networks, but the first is made simply by retraining the same model, while the second has different numbers of prototypes. Thus, varying the values of kc is not very important when building an ensemble. For obtaining superior performance, it is necessary to vary the network topologies. Finally, reported in Table 6 is a comparison between the Siamese networks and standard CNNs trained with spectrograms. The method labeled eCNN is the fusion among different CNNs (GoogleNet, VGG16, VGG19, and GoogleNetP365). From the results reported in the previous tables, the following conclusions can be drawn: • The best way for building an ensemble of Siamese networks is to combine different network topologies; • The proposed F_NN ensemble improves previous methods based on Siamese networks (cf. OLD in Table 6); • F_NN obtains a performance that is similar to eCNN on BIRD but lower than eCNN on CAT; • The best performance in both datasets is gained by sum rule between eCNN and F_NN (i.e., the fusion among CNNs and the Siamese networks).
As far as the computation requirements are concerned, using a Xeon E5-1603 v4-2.8 GHz-64 GB Ram, the proposed approach requires a classification time of less than 0.028 s per a batch of 1251 images in the testing phase, the creation of the test dissimilarity space requires 44.30 s per a batch of 1251 images, and Hasc needs 0.006 s to be performed in a given spectrogram. The training phase for a training dataset of 1511 images requires 4415 s for a single NN1 training (using a single GPU Titan Xp), 88.60 s for the creation of a single training dissimilarity space, and 0.07 s for the training of a single SVM. Thus, the time required for testing is near real-time using a powerful server/GPU.
To further validate our approach, we tested it on the ESC-50 benchmark audio classification dataset. To reduce the computation time, only Sup-1 was tested. It obtained 52% accuracy but needed a high number of training iterations for network convergence. For the ESC-50 dataset, Sup-1 was trained for 25,000 epochs rather than for 3000 epochs, as was done for CAT and BIRD. For comparison, a simple CNN with only 3-layers [55] obtained an accuracy of 54%.
In Table 7, some other state-of-the-art results in the literature are reported on the CAT and BIRD datasets (using the same testing protocol). As can be observed, the performance of the ensembles described in this paper approaches those reported in the literature. Note that in Table 7 the results from [46] are based on a feature selection approach where the number of selected features is the hyperparameters selected on that dataset. Finally, it is worth noting that for a more fair comparison among acoustic animal classification works, a deeper experimental evaluation across multiple datasets is needed. The experiments presented in this paper speak to the robustness of the proposed approach: competitive classification accuracy, compared to the state-of-the-art in the literature, has been obtained on two datasets including very different data without any ad-hoc parameter tuning. These results were produced by following a clear and unambiguous testing protocol. The value of reporting methods across datasets means that the results reported here can reasonably serve as a baseline for later research comparisons in this area.

Conclusions
This work presents a method for classifying animal vocalizations using four Siamese networks and dissimilarity spaces. Different clustering techniques taking both a supervised and unsupervised approach were used for dissimilarity space generation. A compact descriptor is obtained by projecting each pattern into the dissimilarity spaces generated by the clustering methods using different numbers of centroids and the outputs of the four Siamese networks. The classification step is performed using a set of SVMs trained from such descriptors and combined by the sum rule to obtain a highly competitive ensemble as tested on two datasets of animal vocalizations. In addition, experimental results demonstrated the diversity between the proposed approach and other state-of-the-art approaches can be exploited in an ensemble to further improve the classification performance. The fusions improved the performance on both audio classification problems, outperforming the standalone approaches.
Future work in this area will focus on experimentally deriving ensembles using the same approach. The goal will be to assess the approach proposed here for generalizability across many sound classification problems, such as those cited in [36,44]. This will involve testing the proposed method by adding more supervised and unsupervised clustering techniques and additional Siamese network architectures.