All articles published by MDPI are made immediately available worldwide under an open access license. No special
permission is required to reuse all or part of the article published by MDPI, including figures and tables. For
articles published under an open access Creative Common CC BY license, any part of the article may be reused without
permission provided that the original article is clearly cited. For more information, please refer to
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature
Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for
future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive
positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world.
Editors select a small number of articles recently published in the journal that they believe will be particularly
interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the
most exciting work published in the various research areas of the journal.
The classifier system proposed in this work combines the dissimilarity spaces produced by a set of Siamese neural networks (SNNs) designed using four different backbones with different clustering techniques for training SVMs for automated animal audio classification. The system is evaluated on two animal audio datasets: one for cat and another for bird vocalizations. The proposed approach uses clustering methods to determine a set of centroids (in both a supervised and unsupervised fashion) from the spectrograms in the dataset. Such centroids are exploited to generate the dissimilarity space through the Siamese networks. In addition to feeding the SNNs with spectrograms, experiments process the spectrograms using the heterogeneous auto-similarities of characteristics. Once the similarity spaces are computed, each pattern is “projected” into the space to obtain a vector space representation; this descriptor is then coupled to a support vector machine (SVM) to classify a spectrogram by its dissimilarity vector. Results demonstrate that the proposed approach performs competitively (without ad-hoc optimization of the clustering methods) on both animal vocalization datasets. To further demonstrate the power of the proposed system, the best standalone approach is also evaluated on the challenging Dataset for Environmental Sound Classification (ESC50) dataset.
Over the past decade, research in sound classification and recognition has gained in popularity and rapidly broaden in its application from the more traditional focus on speech recognition  and music genre classification  to biometric identification , computer-aided heart sound detection , environmental audio scene and sound recognition [5,6], biodiversity assessment , human voice classification and emotion recognition , English accent classification, and gender identification , to list a few of a wide range of application areas. As with research in pattern recognition generally, the features fed into classifiers were initially engineered, which, in the case of sound applications, meant extracting from raw audio traces such descriptors as the Statistical Spectrum Descriptor and Rhythm Histogram .
In the past decade, researchers began exploring the possibility of visually representing audio signals to apply more powerful image classification descriptors and techniques. Initially, visual representations of audio traces centered around the display overtime of the frequency spectrum: examples of these representations include the spectrogram  and Mel-frequency Cepstral Coefficients spectrogram . Although a spectrogram is typically a graph with two dimensions (time and frequency), additional dimensions can be included, such as pixel intensity , which includes at each time step the representation of an audio signal’s amplitude in a specific frequency. Initially, popular texture descriptors, such as Haralick’s Grey Level Co-occurrence Matrices (GLCMs) , Gabor filters , and Local Binary Patterns (LBP)  and its variants  were extracted from these spectrograms for the task of music genre classification [18,19,20]. Fusions of a large set of state-of-the-art texture descriptors were then experimentally applied to spectrograms, and certain combinations were shown to enhance further the accuracy of music genre classification .
The rise in popularity of deep learning due to affordable graphic processing units (GPUs) changed the trajectory in machine learning research . Deep learners, such as the convolutional neural network (CNN), produced far superior results to most other classifiers when it came to image classification . Consequently, more attention was placed on representing acoustic traces through visual representations so that deep learning approaches could be applied. Engineered features diminished in importance since deep classifiers learn which patterns perform best for a specific problem during the training process. However, engineered features have been shown to augment deep learning approaches when fused. Early work with CNNs applied to visual representations obtained state-of-the-art in chord detection and recognition [23,24] and in music genre classification . In , for example, spectrograms were converted into GCLM maps and trained on CNNs. In , canonical approaches, such as LBP-trained SVMs were fused with CNNs and shown to outperform previous systems.
Another development in sound classification involves the design of deep learners and feature sets specific to audio classification. For example, in , the authors explored variations in CNN architecture and parameters; in  a novel sparse coding CNN was developed and shown to perform well if not better than the state-of-the-art for sound event recognition and retrieval. Also of note is the hybrid multimodal deep learning approach proposed in  for multilabel music genre classification that combined album cover images, reviews, and audio tracks; this system was shown to outperform the single-modality approaches. In  the authors present a gated CNN and a temporary attention-based classification system for audio classification from weakly labeled data.
When it comes to animal sound classification, the focus of this study, fusions of CNNs with other methods to classify animals have been evaluated in  and  for fish identification using the Fish and MBARI benthic animal dataset and in  for bioacoustic bird species classification using a dataset of audio samples from 43 species. In , the authors combined engineered features with CNNs, and in , deep learning was combined with shallow learning. Both demonstrated that the fusions performed better than the standalone approaches.
Animal sound classification is a vibrant area of research. Many benchmark datasets are available for a variety of animals, such as bats , birds [34,35], whales , cats , and frogs . In the literature, animal audio classification is broadly divided into two categories: the CNN approach discussed above and fingerprinting , which involves a compact audio representation for comparing audio segments in terms of similarity and dissimilarity . Both approaches, however, suffer from limitations: fingerprinting can only find exact matches, and CNNs require large datasets for accurate training; many animal datasets are small because of the difficulty of collecting samples.
The superiority of combining the deep learning approach with fingerprinting is demonstrated in , where a Siamese Neural Network (SNN) produced semantic descriptions of audio signals. SNNs have been applied to sound classification in [40,41,42] and have the advantage over the canonical CNN in their ability to generalize. In , a system was developed based on dissimilarity spaces, such as that proposed in  for brain image classification, where a distance model was learned by training a SNN  on dissimilarity values. This system combined several clustering approaches to obtain a dissimilarity space; each pattern was represented in such a space, and the resulting descriptors were used to train a general purpose classifier (SVM). The clustering methods transformed the spectrograms in a bird  and cat [37,47] dataset to a set of centroids that were used to generate a vector space representation for each pattern. This vector was then used to train an SVM. Results showed that this approach worked better than the standalone CNNs.
The system proposed in this work is similar to  in that it generates a dissimilarity space from the training set using an SNN to define a distance function from the input spectrograms. The objective at this point in the process is to maximize the distance separating the patterns of the different classes. Unlike , however, four different CNN architectures are selected for the twin classifiers, and both the original input spectrograms and the spectrograms processed by Heterogeneous Auto-Similarities of Characteristics  make up the inputs to the SNNs. In the testing phase, the unknown pattern is compared to the centroids of the dissimilarity space using the different SNNs for comparing two samples and measure their dissimilarity. Although it is the case that the entire training set can function as the centroids of the dissimilarity space, it is desirable to reduce dimensionality by selecting a smaller number of prototypes. In this work, both a supervised and unsupervised clustering algorithm are used to reduce dimensionality. The dissimilarity space represents each input (both the original and the processed spectrograms) by its distance from each of the centroids, or prototypes, a distance that is learned by the SNNs. In other words, the SNNs compare a given spectrogram to each of the centroids to obtain a dissimilarity feature vector. A support vector machine (SVM) is then trained on these features.
The approach taken in this paper is evaluated on the same animal vocalization datasets as in , i.e., on cats  and birds . In addition, the system is tested on the challenging benchmark dataset for Environmental Sound Classification (ESC-50) . As in , the most effective system is the result of a fusion of classifiers, i.e., the sum rule of SVMs trained on different dissimilarity spaces generated by changing the value of k in the clustering approaches and network topologies. Performance is compared with both the state-of-the-art as well as with fusions with the state-of-the-art. Results demonstrate the power of using dissimilarity spaces based on an ensemble of SNNs, particularly when coupled with a standard CNN approach. The MATLAB code of the proposed method is freely available at https://github.com/LorisNanni.
The remainder of this paper is broken down into the following sections. In Section 2, a detailed overview of the proposed approach is provided; in Section 3, SNN is discussed along with the different CNN architectures that make up the subnetworks. In Section 4, the supervised and unsupervised clustering methods are outlined. In Section 5, experimental results are presented along with comparisons with the state-of-the-art. The paper concludes in Section 6 with an overview and some suggestions for future research.
2. Proposed System
The system presented here for spectrogram classification extends that proposed in  and is schematized in Figure 1, which illustrates the approach using one SNN. The pseudocode in Algorithms 1 and 2 that correspond to Figure 1 are detailed in the remainder of this section.
The process begins by generating a similarity space during the training phase via a learning distance measure from a set of prototypes . The distance measure is learned by four SNNs trained (1) to maximize the similarity between pairs of spectrograms belonging to the same class and (2) to minimalize the similarity for pairs of spectrograms belonging to different classes. The set of prototypes generated in this phase are the centroids of the clusters produced by a clustering approach. What results is a feature vector that represents training sample in the dissimilarity space, where a given is the distance between and the prototype . Once these features have been calculated, they are used to train an SVM classifier.
In the testing phase, each unknown pattern is represented by its projection in a dissimilarity space: the descriptor is obtained by calculating the pattern distance to the set of prototypes and the resulting feature vectors of the input images are then classified by SVM. In our experiments, we test both spectrograms and HASC , a 2D descriptor extracted from the spectrogram, as the input of the classification process. The method for extracting HASC is outlined in Section 2.5.
Algorithm 1. Training phase
Input: Training images (imgsTrain), training labels (labelTrain), number of training iterations (trainIterations), batch size (trainBatchSize), number of centroids (k), and clustering technique (type).
Output: Trained SNN (tSNN), set of centroids (C), and trained SVM (svm).
Input: Test images (imgsTest), trained SNN (tSNN), Set of centroids (C), Trained SVM (tSVM).
Output: Actual test labels (labelTest).
1: F ←getDissSpaceProjection (imgsTest, P, tSNN)
2: labelTest ←predictSvm (F, tSVM)
2.1. Siamese Neural Network Training
To define a similarity measure inside the Dissimilarity Space, a Siamese Network is trained to compare a couple of spectrograms and return a similarity value that is the greater value if they belong to the same class and the lower value if they belong to different classes. The pseudocode for training an SNN, which is called in step 1 of Algorithm 1, is reported in Algorithm 3. The SNN architecture is defined in steps 2 and 3 (more details are given in Section 3). The training process consists in repeating Steps 5–8: random extraction of batchSize spectrogram pairs from the training set (via the function GETBATCH), computation of loss and gradients for each pair passed to the SNN, and upgrade of the SNN accordingly (i.e., the subnet and the fully connected (FN) layer of the Siamese network).
Algorithm 3. Siamese training pseudocode
Input: Training image (trainImgs), training labels (trainLabels), batch size (batchSize), and iterations (numberO f Iterations).
Note: if SNN fails to converge on the training set, the training phase is repeated.
2.2. Prototype Selection
Prototype selection involves extracting a total of k prototypes from the training set in order to reduce the dimensionality of the dissimilarity space, which would be too high to maintain each training sample as a prototype. Dimensionality can be reduced by employing clustering techniques to calculate k centroids. In this work, we perform both prototype selection separately for each class (kc prototypes per class) and via global (i.e., unsupervised) clustering (k prototypes): in both cases, results are compared considering the same final dimension. Algorithm 4 presents the pseudocode for prototype selection. As can be observed, a clustering technique is selected from a set of four possible clustering methods, each of which is employed separately to cluster the training samples. For the sake of space, only the pseudocode for the supervised clustering methods is included, though it should be noted that this work used both supervised and unsupervised clustering approaches.
Algorithm 4. Clustering pseudocode
Input: Training images (imgsTrain), training labels (labelTrain), number of clusters (k), and clustering technique (type).
Output: Centroids P.
2: numClasses←number of classes from labelTrain
4: fori from 1 to numClassesdo
5: images←images of the class i from imgsTrain
6: switchtype do
7: case “k-means” Pi ←KMeans(imgs,kc)
8: case “k-medoids” Pi ←KMedoids (imgs,kc)
9: case “hierarchical” Pi ←Hierarchical (imgs,kc)
10: case “spectral” Pi ←Spectral (imgs,kc)
11: P←P ∪Pi
12: end for
14: end function
2.3. Projection in the Dissimilarity Space
Classically, classifiers are trained to predict patterns within a feature space. It is also possible, as demonstrated here, for patterns to be projected in a dissimilarity space such that each pattern is characterized by its dissimilarity to a set of prototypes and by a dissimilarity vector defined as
where the similarity of pattern is obtained using a trained SNN.
The projection of a set of images (from the training or the testing set) into a dissimilarity space is described in Algorithm 5. Each image of the input set (stored in the variable X, see step 3) is compared with the k prototypes (stored in P) according to the dissimilarity measure learned by the trained SNN (PREDICTSIAMESE function, see step 4). The number of centroids is a parameter, and different values of were tested (depending on the number of classes ): . The output is the feature space F that includes the projections of all the input images imgs.
Algorithm 5. Projection in the Dissimilarity space pseudocode
Input: Images (imgs), Centroids (P), number of centroids (k), and trained SNN (tSNN).
Output: Feature vectors (F).
2: forj from 1 to SIZE(imgs) do
4: F[j]← predictSiamese (tSNN, X, P)
5: end for
7: end function
2.4. Classification by SVM
SVM  is a well-known binary learner that represents training samples as points in space. The goal of SVM training (function TRAINSVM) is to find at least one hyperplane such that it separates the data that belongs to each of the two classes. Prediction (function PREDICTSVM) is accomplished by mapping an unseen pattern to the side of the hyperplane representing the class for a given data point.
SVM, as defined above, does a poor job discriminating input that is not linearly separable in its original space. This difficulty can be overcome by selecting kernel functions that map the data into a higher-dimensional space where separation is possible. In this work, we chose a radial basis function kernel to map a single vector to a vector of higher dimensionality.
Although SVM is binary, it can be applied to nonbinary or multilabel problems by training an ensemble of SVMs and combining their decisions. In the experiments reported here, the One-Against-All method is used, where an SVM is trained systematically to distinguish each class against all the others combined. The pattern is then predicted to belong to that class that produces the highest confidence score.
2.5. Heterogeneous Auto-Similarities of Characteristics (HASC)
HASC  is applied to heterogeneous dense feature maps. It encodes linear relations by covariances (COV) and nonlinear associations with entropy combined with mutual information (EMI). Three reasons for considering covariance matrices as descriptors are as follows: (1) they are low in dimensionality, (2) robust to noise (but with the exception that outlier pixels can render them more sensitive to noise), and (3) the covariance among two features is optimally able to encapsulate the features of the joint PDF (but with the caveat that they be linked by a linear relation). HASC obviates these limitations by combining COV with EMI.
The entropy (E) of a random variable measures the uncertainty of its value, and the mutual information (MI) of two random variables captures their generic linear and nonlinear dependencies. The way HASC utilizes these advantages is by dividing an image into patches from which it generates an EMI matrix (, such that the main diagonal entries encapsulate the amount of unpredictability of the features. The off-diagonal entries (element capture the mutual dependency between two features, that is, the -th and -th feature. HASC is the concatenation of vectorized EMI and COV. Specifically, the MI of a pair of random variables is
where , and are the PDF of , the PDF of , and their joint PDF, respectively. If , then MI is the entropy of :
If there exists a finite set of realization pairs, MI can be estimated as a sample mean inside the logarithm, thus:
A fast and efficient method for calculating from the realizations the probabilities inside the logarithm is to estimate them by building a joint 2D normalized histogram of values and , such that each is estimated taking the value of the 2D histogram bin containing the pair . In this way, and can be estimated by summing all the bins corresponding to and , respectively. Thus, the -th component of the EMI matrix related to the patch can be defined as
where (…) and (.) are the probabilities estimated with the histogram and is the -th feature at pixel .
In this study, HASC is extracted from the whole spectrogram. Given the function HASC, the output FEAT is a three-dimensional matrix () containing the features extracted from the image ( is the number of low-level features). The number of bins used to evaluate the histograms in the EMI computation is 28, and the number of low-level features is 6 (default parameters). FEAT is reshaped to build the vector [FEAT (:,:,1) FEAT (:,:,2); FEAT (:,:,3) FEAT (:,:,4); FEAT (:,:,5) FEAT (:,:,6)], and is resized to the right dimension for the input into a CNN.
3. Siamese Neural Network (SNN)
SNN, initially proposed in , is a class of neural networks that contains two or more twin networks that share the same weights and parameters. As illustrated in Figure 2, an SNN has two inputs that compare two patterns and one output that corresponds to the similarity between the two inputs. In other words, an SNN identifies correlations between two different input patterns (see  for a fairly comprehensive overview of SNN). As shown in Figure 2, the SNNs used in this work are composed of five components, as described below.
3.1. The Two Identical Twin Subnetworks
In this study, four twin CNN subnetworks are utilized. CNNs are constructed by assembling specialized layers composed of neurons. Some of the more common layers found in CNN architectures include convolutional, activation (ReLU), pooling, and fully connected (FC) layers. The specific CNN architecture for the four twin subnetworks are outlined in Table 1.
Two activation functions are explored here: the ReLU activation function  and the Leaky ReLU . The well-known ReLU activation function for the points is defined as:
The Leaky ReLU variation of ReLU is defined as
The subnetworks of the Siamese CNNs 1–4 each learn the features that best represent the information in the spectrograms that are the input patterns to the two input nodes (X1 and X2) and return either a 2048 or a 4096-dimensional feature vector (F1 and F2). The subnetworks share the same parameters and training weights.
3.2. Subtract Block, FC Layer, and Sigmoid Function
The Subtract Block calculates the feature vector Y as the absolute value of the difference between the resulting vectors of the two subnetworks:
Following the method outlined in , the FC layer learns the distance model for calculating the dissimilarity. Y, the result of the Subtract Block, is processed by the FC block to produce the measure of dissimilarity between two spectrogram patterns. Because of this block, the resulting dissimilarity measure is not a metric since it does not have the following properties: identity and triangle inequality. It does, however, have the properties of symmetry and compactness as defined in , so that a sufficiently small perturbation of an object leads to an arbitrary small dissimilarity value between the disturbed object and its original.
The sigmoid function is then applied to the dissimilarity value to transform it to a probability value in the range [0,1]:
Clustering is a procedure that divides unlabeled patterns into groups that maximize the commonality between members within a group and their differences with members belonging to other groups. Clustering techniques often calculate the mean vector, or centroid, of all the patterns within a cluster when forming the clusters. Centroids encapsulate salient characteristics of patterns belonging to a cluster; for this reason, they can be used to reduce the dissimilarity space size without losing too much important information. Even more information can be retained if the number of centroids representing each class is increased.
Both supervised and unsupervised clustering approaches are considered here. If clusters are extracted with a supervised approach, then, in each of classes, clusters are extracted using unsupervised clustering on the training set. A description of the four clustering methods used in this study follows.
K-means is one of the most popular clustering algorithms. It partitions patterns into k clusters by placing each observation into a cluster based on the nearest centroid (according to the Euclidean Distance measure).
The standard k-means algorithm involves four steps:
Randomly select a set of centroids from among the data points.
For each data point x remaining in the training set, compute the distance d(x) between it and the nearest centroid.
Recalculate new centroids via a weighted probability distribution.
Repeat Steps 2 and 3 until convergence.
K-medoids is a clustering technique that follows the same general logic behind k-means but differs in the specific way it partitions data points into clusters: K-medoids minimizes the sum of distances between a given pattern and the center of that pattern’s cluster. In short, the center of a cluster in K-Means ends up being the centroid of the cluster, but the center in K-Medoids is a member (medoid) of the cluster. In other words, a medoid is that member in a cluster whose sum of distances from all other members is minimal.
The standard K-medoids algorithm involves three steps:
Step one is a build-step where each k cluster is associated with a potential medoid. There are many ways to select the first medoid; the standard MATLAB’s implementation does this employing the k-means++ heuristic.
Step two is a swap-step where each point in a cluster is tested as a potential medoid by checking whether the sum of the within-cluster distances is smaller when using that point as the medoid. Every point is then assigned to the cluster with the closest medoid.
The last step repeats previous steps until convergence.
Spectral clustering partitions data into clusters starting from the adjacency matrix representing the undirected similarity graph of the patterns. Each pattern is a node of the graph, and two nodes are connected only if their similarity is larger than a threshold (typically set to 0).
This clustering algorithm involves the following matrices:
The similarity matrix M, whose cell is the similarity value of two patterns (i.e., two spectrograms , );
The degree matrix D, which is a diagonal matrix that is obtained by summing the rows of M:
The Laplacian matrix L, which is defined as
The algorithm for spectral clustering is a five-step process:
Define a local neighborhood for each data point in the dataset (there are many ways to define a neighborhood; the nearest-neighbor method is the default setting in the MATLAB implementation of spectral clustering). Then compute the local similarity matrix of each pattern in the neighborhood.
Calculate the Laplacian matrix .
Create a matrix containing columns , …, , where the columns are the eigenvectors, i.e., the spectrums (hence the name), corresponding to the smallest eigenvalues of L.
Perform k-means or k-medoids clustering by treating each row of V as a datapoint.
Cluster the original pattern according to the assignments of their corresponding rows.
4.4. Hierarchical Clustering
Hierarchical clustering partitions data by creating a tree of clusters divided into n levels selected for the specific classification task. In general, hierarchical clustering is divided into two types:
Agglomerative, where each pattern corresponds to a cluster. A strategy to merge couples of clusters is defined as moving up the hierarchy: each cluster in the next level is the fusion of two clusters from the previous level.
Divisive, where a single cluster contains all patterns in the first level, then a splitting strategy is defined to halve clusters by moving down the hierarchy.
In this work, the agglomerative hierarchical approach is employed as this is the default MATLAB implementation. The MATLAB algorithm involves three steps:
Using a distance metric, find the similarity or dissimilarity between every pair of data points in the dataset;
Aggregate data points into a binary hierarchical cluster tree by fusing pairs of clusters according to their distance;
Establish the level of the tree where it is cut into k clusters.
The resulting prototypes are selected as the mean vectors of each cluster.
5. Experimental Results
The proposed approach is tested and compared with the canonical approaches with each using the testing protocol proposed by the authors of the datasets. The performance indicator is the classification accuracy and methods were tested on the following animal vocalization datasets:
BIRDz, which functioned as a control and a real-world audio dataset in , a ten-run testing protocol is used; we have used the same split used by the authors of the dataset. The real-world tracks were collected from the Xeno-canto Archive (http://www.xeno-canto.org/). BIRDz includes a total of 2762 bird acoustic samples from 11 North American bird species plus 339 “unknown” samples that include noise and unknown species’ vocalizations. The observations are composed of five different spectrograms: 1) constant frequency, 2) frequency modulated whistles, 3) broadband pulses, (4) broadband with varying frequency components, and 5) strong harmonics. The dataset is balanced: the size of all the “bird” classes varies between 246 and 259; only the class “other” is a little larger.
CAT, [37,47] is a dataset that contains ten balanced classes of approximately 300 samples per class for a total of 2962 samples. The testing protocol is a 10-fold cross-validation. The ten classes represent the following cat vocalizations: (1) Resting, (2) Warning, (3) Angry, (4) Defense, (5) Fighting, (6)·Happy, (7) Hunting mind, (8) Mating, (9) Mother call, and (10) Paining. The average duration of each sample is approximately 4 s. Samples were garnered from such online resources as Kaggle, Youtube, and Flickr.
In this section, we report experiments aimed at evaluating the proposed system by varying several components: i.e., the input images (spectrograms or HASC images), the topology of the Siamese Network (NN1, NN2, NN3, NN4), the clustering algorithm (K-Means, K-Medoids, Hierarchical, Spectral), the clustering modality (unsupervised or supervised, i.e., clustering on the whole training set or on each class), and the number of prototypes (kc = 15, 30, 45, 60). For the training of the Siamese networks in our experiments, we have selected the training options suggested by the MATLAB framework for Siamese networks. This choice ensures that such values have not been overfitted on the selected dataset. The binary cross-entropy gives the loss function between the predicted score and the true label value. The parameters for ADAM optimization are the following: learning rate 0.0001, gradient decay factor 0.9, squared gradient decay factor 0.99. The number of iterations has been set at 3000 (without any early stop criterion).
In the first experiment, reported in Table 2, only K-means clustering is explored. The results are obtained from the fusion by sum rule of the four SVMs trained using the dissimilarity spaces built with different values for kc (15, 30, 45, 60). Performance is reported for only NN1 and NN2 (the first two network topologies) to reduce the computation time.
The ensembles in Table 2 are obtained by varying the input data (Sp = spectrograms, HASC = HASC images), the type of clustering (unsupervised or supervised), and the network topology. The clustering method is fixed to K-means for all the methods, and the number of prototypes belongs to the following set (15,30,45,60). The column labeled #classifiers recaps the number of classifiers in the ensemble, and the first column (“Name”) assigns a name to the ensemble.
The best average performance is obtained by the ensemble FA1_2 in the last row (which is the sum rule of the methods in the first eight rows). On the BIRD dataset, there is a boost in performance with NN1 and NN2 using the HASC images instead of the spectrograms, while on the CAT dataset, HASC images boost the performance of NN2 but not NN1.
The goal of the second experiment is to compare the clustering methods. To achieve this aim, in Table 3, we report the performance of different clustering approaches using the HSup-2 approach (i.e., HASC images as input, NN2 as network topology, and unsupervised version of the clustering). The last row reports the ensemble F_Clu obtained as the sum rule among the above four approaches. F_Clu produces the highest performance, though this gain is only slightly higher than that of K-means. All the clustering algorithms are quite similar in performance. Since their fusion does not achieve a clear advantage against a single approach, in the next experiments, we use only the K-means strategy for clustering by varying the number of prototypes kc.
In Table 4, the four network topologies are coupled with K-means clustering and HSup (i.e., HASC images as input and the unsupervised version of clustering). The last row, which reports the ensemble F_NN is obtained as the sum rule among the above four approaches, and it produces the average best performance on this test.
Even better results are obtained by combining by sum rule all the approaches reported in the previous tables: the combined performance on CAT is 85.76 and on BIRD 95.08. It is clear that the ensemble strongly outperforms a simple Sup-1. The superiority of one method over another is validated according to the Wilcoxon signed-rank test ; F_NN outperforms each of the other methods with a p-value of 0.05.
The following experiment is aimed at evaluating the possibility of creating ensembles by varying some parameters of the method. For a fair comparison among ensembles in Table 5, the performance of an ensemble obtained by retraining Siamese HSup-1 (i.e., by using the same approach as that reported in Table 2) is compared with ensembles obtained by varying the network topology. The results of Table 5 indicate that changing the network topology introduces diversity in the ensemble: the comparison among ensembles of 4, 8, and 16 networks (see HSup-1x), obtained from the topology NN1, and the ensembles named F_NN, obtained by varying the topology of the Siamese Network, show an evident performance gain in favor of the latter (with the same number of classifiers). It is also interesting to observe the similar results in rows 2 and 3: both ensembles have four networks, but the first is made simply by retraining the same model, while the second has different numbers of prototypes. Thus, varying the values of kc is not very important when building an ensemble. For obtaining superior performance, it is necessary to vary the network topologies.
Finally, reported in Table 6 is a comparison between the Siamese networks and standard CNNs trained with spectrograms. The method labeled eCNN is the fusion among different CNNs (GoogleNet, VGG16, VGG19, and GoogleNetP365).
From the results reported in the previous tables, the following conclusions can be drawn:
The best way for building an ensemble of Siamese networks is to combine different network topologies;
The proposed F_NN ensemble improves previous methods based on Siamese networks (cf. OLD in Table 6);
F_NN obtains a performance that is similar to eCNN on BIRD but lower than eCNN on CAT;
The best performance in both datasets is gained by sum rule between eCNN and F_NN (i.e., the fusion among CNNs and the Siamese networks).
As far as the computation requirements are concerned, using a Xeon E5-1603 v4-2.8 GHz-64 GB Ram, the proposed approach requires a classification time of less than 0.028 s per a batch of 1251 images in the testing phase, the creation of the test dissimilarity space requires 44.30 s per a batch of 1251 images, and Hasc needs 0.006 s to be performed in a given spectrogram. The training phase for a training dataset of 1511 images requires 4415 s for a single NN1 training (using a single GPU Titan Xp), 88.60 s for the creation of a single training dissimilarity space, and 0.07 s for the training of a single SVM. Thus, the time required for testing is near real-time using a powerful server/GPU.
To further validate our approach, we tested it on the ESC-50 benchmark audio classification dataset. To reduce the computation time, only Sup-1 was tested. It obtained 52% accuracy but needed a high number of training iterations for network convergence. For the ESC-50 dataset, Sup-1 was trained for 25,000 epochs rather than for 3000 epochs, as was done for CAT and BIRD. For comparison, a simple CNN with only 3-layers  obtained an accuracy of 54%.
In Table 7, some other state-of-the-art results in the literature are reported on the CAT and BIRD datasets (using the same testing protocol). As can be observed, the performance of the ensembles described in this paper approaches those reported in the literature. Note that in Table 7 the results from  are based on a feature selection approach where the number of selected features is the hyperparameters selected on that dataset.
Finally, it is worth noting that for a more fair comparison among acoustic animal classification works, a deeper experimental evaluation across multiple datasets is needed. The experiments presented in this paper speak to the robustness of the proposed approach: competitive classification accuracy, compared to the state-of-the-art in the literature, has been obtained on two datasets including very different data without any ad-hoc parameter tuning. These results were produced by following a clear and unambiguous testing protocol. The value of reporting methods across datasets means that the results reported here can reasonably serve as a baseline for later research comparisons in this area.
This work presents a method for classifying animal vocalizations using four Siamese networks and dissimilarity spaces. Different clustering techniques taking both a supervised and unsupervised approach were used for dissimilarity space generation. A compact descriptor is obtained by projecting each pattern into the dissimilarity spaces generated by the clustering methods using different numbers of centroids and the outputs of the four Siamese networks. The classification step is performed using a set of SVMs trained from such descriptors and combined by the sum rule to obtain a highly competitive ensemble as tested on two datasets of animal vocalizations. In addition, experimental results demonstrated the diversity between the proposed approach and other state-of-the-art approaches can be exploited in an ensemble to further improve the classification performance. The fusions improved the performance on both audio classification problems, outperforming the standalone approaches.
Future work in this area will focus on experimentally deriving ensembles using the same approach. The goal will be to assess the approach proposed here for generalizability across many sound classification problems, such as those cited in [36,44]. This will involve testing the proposed method by adding more supervised and unsupervised clustering techniques and additional Siamese network architectures.
L.N. conceived of the presented idea, G.M., L.N., A.L. performed the experiments. S.B., A.L. wrote the manuscript. S.B. provided some resources. All authors have read and agree to the published version of the manuscript.
This research received no external funding.
The authors are grateful to NVIDIA Corporation for supporting this research with the donation of a Titan Xp GPU.
Conflicts of Interest
The authors declare no conflict of interest.
Padmanabhan, J.; Premkumar, M.J.J. Machine learning in automatic speech recognition: A survey. IETE Tech. Rev.2015, 32, 240–251. [Google Scholar] [CrossRef]
Nanni, L.; Costa, Y.M.; Lucio, D.R.; Silla, C.N., Jr.; Brahnam, S. Combining visual and acoustic features for audio classification tasks. Pattern Recognit. Lett.2017, 88, 49–56. [Google Scholar] [CrossRef]
Sahoo, S.; Choubisa, T.; Prasanna, S.M. Multimodal Biometric Person Authentication: A Review. IETE Tech. Rev.2012, 29, 54–75. [Google Scholar] [CrossRef]
Li, S.; Li, F.; Tang, S.; Xiong, W. A Review of Computer-Aided Heart Sound Detection Techniques. BioMed Res. Int.2020, 2020, 5846191. [Google Scholar] [CrossRef]
Chandrakala, S.; Jayalakshmi, S.L. Generative Model Driven Representation Learning in a Hybrid Framework for Environmental Audio Scene and Sound Event Recognition. IEEE Trans. Multimed.2019, 22, 3–14. [Google Scholar] [CrossRef]
Chachada, S.; Kuo, C.-C.J. Environmental sound recognition: A survey. In Proceedings of the 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Kaohsiung, Taiwan, 29 October–1 November 2013; pp. 1–9. [Google Scholar] [CrossRef]
Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech emotion recognition from spectrograms with deep convolutional neural network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Korea, 13–15 February 2017; pp. 1–5. [Google Scholar]
Lidy, T.; Rauber, A. Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In Proceedings of the 6th International Conference on Music Information Retrieval, London, UK, 11–15 September 2005; pp. 34–41. [Google Scholar]
Wyse, L. Audio spectrogram representations for processing with convolutional neural networks. arXiv2017, arXiv:1706.09559. [Google Scholar]
Rubin, J.; Abreu, R.; Ganguli, A.; Nelaturi, S.; Matei, I.; Sricharan, K. Classifying heart sound recordings using deep convolutional neural networks and mel-frequency cepstral coefficient. In Proceedings of the Computing in Cardiology (CinC), Vancouver, BC, Canada, 11–14 September 2016. [Google Scholar]
Nanni, L.; Costa, Y.M.G.; Brahnam, S. Set of texture descriptors for music genre classification. In Proceedings of the 22nd WSCG International Conference on Computer Graphics, Visualization and Computer Vision, Plzen, Czech Republic, 2–5 June 2014. [Google Scholar]
Haralick, R.M. Statistical and structural approaches to texture. Proc. IEEE1979, 67, 786–804. [Google Scholar] [CrossRef]
Ojansivu, V.; Heikkila, J. Blur insensitive texture classification using local phase quantization. In Proceedings of the ICISP, Cherbourg-Octeville, France, 1–3 July 2008. [Google Scholar]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell.2002, 24, 971–987. [Google Scholar] [CrossRef]
Brahnam, S.; Jain, L.C.; Lumini, A.; Nanni, L. (Eds.) Local Binary Patterns: New Variants and Applications; Springer: Berlin, Germany, 2014. [Google Scholar]
Costa, Y.M.G.; Oliveira, L.S.; Koerich, A.L.; Gouyon, F.; Martins, J.G. Music genre classification using LBP textural features. Signal Process.2012, 92, 2723–2737. [Google Scholar] [CrossRef][Green Version]
Costa, Y.M.G.; Oliveira, L.S.; Koerich, A.L.; Gouyon, F. Music genre recognition using spectrograms. In Proceedings of the 18th International Conference on Systems, Signals and Image Processing, Sarajevo, Bosnia and Herzegovina, 16–18 June 2011. [Google Scholar]
Costa, Y.M.G.; Oliveira, L.S.; Koerich, A.L.; Gouyon, F. Music genre recognition using gabor filters and LPQ texture descriptors. In Proceedings of the 18th Iberoamerican Congress on Pattern Recognition, Havana, Cuba, 20–23 November 2013. [Google Scholar]
Ren, Y.; Cheng, X. Review of convolutional neural network optimization and training in image processing. In Proceedings of the 10th International Symposium on Precision Engineering Measurements and Instrumentation (ISPEMI 2018), Kunming, China, 8–10 August 2018; SPIE: Bellingham, WA, USA, 2019. [Google Scholar]
Humphrey, E.; Bello, J.P. Rethinking automatic chord recognition with convolutional neural networks. In Proceedings of the International Conference on Machine Learning and Applications, Boca Raton, FL, USA, 12–15 December 2012. [Google Scholar]
Humphrey, E.; Bello, J.P.; LeCun, Y. Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proceedings of the International Conference on Music Information Retrieval, Porto, Portugal, 8–12 October 2012; pp. 403–408. [Google Scholar]
Nakashika, T.; Garcia, C.; Takiguchi, T. Local-feature-map integration using convolutional neural networks for music genre classification. In Proceedings of the Interspeech 2012 13th Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012; pp. 1752–1755. [Google Scholar]
Costa, Y.M.; Oliveira, L.S.; Silla, C.N., Jr. An evaluation of Convolutional Neural Networks for music classification using spectrograms. Appl. Soft Comput.2017, 52, 28–38. [Google Scholar] [CrossRef]
Sigtia, S.; Dixon, S. Improved music feature learning with deep neural networks. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, Florence, Italy, 4–9 May 2014. [Google Scholar]
Wang, C.Y.; Santoso, A.; Mathulaprangsan, S.; Chiang, C.C.; Wu, C.H.; Wang, J.C. Recognition and retrieval of sound events using sparse coding convolutional neural network. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017. [Google Scholar]
Oramas, S.; Nieto, O.; Barbieri, F.; Serra, X. Multilabel music genre classification from audio, text and images using deep features. In Proceedings of the International Society for Music Information Retrieval (ISMR) Conference, Suzhou, China, 23–27 October 2017. [Google Scholar]
Kong, Q.; Xu, Y.; Sobieraj, I.; Wang, W.; Plumbley, M.D. Sound Event Detection and Tim Frequency Segmentation from Weakly Labelled Data. IEEE ACM Trans. Audio Speech Lang. Process.2019, 27, 777–787. [Google Scholar] [CrossRef]
Nanni, L.; Brahnam, S.; Lumini, A.; Barrier, T. Ensemble of local phase quantization variants with ternary encoding. In Local Binary Patterns: New Variants and Applications; Brahnam, S., Jain, L.C., Lumini, A., Nanni, L., Eds.; Springer: Berlin, Germany, 2014; pp. 177–188. [Google Scholar]
Cao, Z.; Principe, J.C.; Ouyang, B.; Dalgleish, F.; Vuorenkoski, A. Marine animal classification using combined CNN and hand-designed image features. In Proceedings of the MTS/IEEE Oceans, Washington, DC, USA, 19–22 October 2015. [Google Scholar]
Salamon, J.; Bello, J.P.; Farnsworth, A.; Kelling, S. Fusing sallow and deep learning for bioacoustic bird species. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Acevedo, M.A.; Corrada-Bravo, C.J.; Corrada-Bravo, H.; Villanueva-Rivera, L.J.; Aide, T.M. Automated classification of bird and amphibian calls using machine learning: A comparison of methods. Ecol. Inform.2009, 4, 206–214. [Google Scholar] [CrossRef]
Fristrup, K.M.; Watkins, W.A. Marine Animal Sound Classification; WHOI Technical Reports; Woods Hole Oceanographic Institution: Woods Hole, MA, USA, 1993; Available online: https://hdl.handle.net/1912/546 (accessed on 30 October 2020).
Pandeya, Y.R.; Kim, D.; Lee, J. Domestic cat sound classification using learned features from deep neural nets. Appl. Sci.2018, 8, 1949. [Google Scholar] [CrossRef][Green Version]
Wang, A. An industrial strength audio search algorithm. In Proceedings of the ISMIR Proceedings, Baltimore, MD, USA, 26–30 October 2003. [Google Scholar]
Haitsma, J.; Kalker, T. A Highly Robust Audio Fingerprinting System. In Proceedings of the ISMIR, Paris, France, 13–17 October 2002. [Google Scholar]
Manocha, P.; Badlani, R.; Kumar, A.; Shah, A.; Elizalde, B.; Raj, B. Content-based representations of audio using siamese neural networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal. Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 3136–3140. [Google Scholar]
Droghini, D.; Vesperini, F.; Principi, E.; Squartini, S.; Piazza, F. Few-shot siamese neural networks employing audio features for human-fall detection. In Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence, Union, NJ, USA, 15–17 August 2018. [Google Scholar] [CrossRef]
Zhang, Y.; Pardo, B.; Duan, Z. Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation. IEEE/ACM Trans. Audio, Speech, Lang. Process.2018, 27, 429–441. [Google Scholar] [CrossRef]
Nannia, L.; Rigo, A.; Lumini, A.; Brahnam, S. Spectrogram Classification Using Dissimilarity Space. Appl. Sci.2020, 10, 4176. [Google Scholar] [CrossRef]
Agrawal, A. Dissimilarity learning via Siamese network predicts brain imaging data. arXiv2019, arXiv:1907.02591. [Google Scholar]
Bromley, J.; Bentz, J.W.; Bottou, L.; Guyon, I.; LeCun, Y.; Moore, C.; Säckinger, E.; Shah, R. Signature verification using a Siamese time delay neural network. Int. J. Pattern Recognit. Artif. Intell.1993, 7, 669–688. [Google Scholar] [CrossRef][Green Version]
Zhang, S.H.; Zhao, Z.; Xu, Z.Y.; Bellisario, K.; Pijanowski, B.C. Automatic bird vocalization identification based on fusion of spectral pattern and texture features. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal. Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 271–275. [Google Scholar]
Pandeya, Y.R.; Lee, J. Domestic Cat Sound Classification Using Transfer Learning. Int. J. Fuzzy Log. Intell. Syst.2018, 18, 154–160. [Google Scholar] [CrossRef][Green Version]
Biagio, M.S.; Crocco, M.; Cristani, M.; Martelli, S.; Murino, V. Heterogeneous auto-similarities of characteristics (hasc): Exploiting relational information for classification. In Proceedings of the IEEE Computer Vision (ICCV13), Sydney, Australia, 3–6 December 2013. [Google Scholar]
Piczak, K.J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, 26–30 October 2015. [Google Scholar] [CrossRef]
Vapnik, V. The support vector method. In Proceedings of the Artificial Neural Networks ICANN’97, Lausanne, Switzerland, 8–10 October 1997. [Google Scholar]
Chicco, D. Siamese neural networks: An overview. In Artificial Neural Networks. Methods in Molecular Biology; Cartwright, H., Ed.; Springer Protocols; Humana: New York, NY, USA, 2020; Volume 2190, pp. 73–94. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely
those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or
the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas,
methods, instructions or products referred to in the content.