Spectrogram Classiﬁcation Using Dissimilarity Space

: In this work, we combine a Siamese neural network and different clustering techniques to generate a dissimilarity space that is then used to train an SVM for automated animal audio classiﬁcation. The animal audio datasets used are (i) birds and (ii) cat sounds, which are freely available. We exploit different clustering methods to reduce the spectrograms in the dataset to a number of centroids that are used to generate the dissimilarity space through the Siamese network. Once computed, we use the dissimilarity space to generate a vector space representation of each pattern, which is then fed into an support vector machine (SVM) to classify a spectrogram by its dissimilarity vector. Our study shows that the proposed approach based on dissimilarity space performs well on both classiﬁcation problems without ad-hoc optimization of the clustering methods. Moreover, results show that the fusion of CNN-based approaches applied to the animal audio classiﬁcation problem works better than the stand-alone CNNs.


Introduction
Sound classification and recognition have been applied in different domains, e.g., speech recognition [1], music classification [2], environmental sound recognition, and biometric identification [3]. Traditionally, in pattern recognition problems, features have been extracted from the actual audio traces (e.g., Statistical Spectrum Descriptor and Rhythm Histogram [4]). However, by replacing audio traces by their visual representation, image classification techniques can be used to extract features on sound classification problems. The most commonly used visual representation of audio traces involves the display of their frequency spectrum as they vary in time, as in spectrograms [5] and Mel-frequency Cepstral Coefficients spectrograms [6]. A spectrogram can be described as a graph with two dimensions (time and frequency) plus a third dimension in terms of pixel intensity [7] that represents the signal amplitude in a specific frequency at a particular time step. Costa et al. [8,9] applied several classification and texture analysis techniques to music genre classification using such a method. In [9], the authors extracted grey level co-occurrence matrices (GLCMs) [10] from spectrograms, while in [8] they used the local binary pattern (LBP) [11], which is a popular texture descriptor. In [12], two other feature descriptors were extracted from audio images: local phase quantization (LPQ) and Gabor filters [13]. In 2017, Nanni et al. [2] demonstrated on multiple audio datasets how the fusion of acoustic features extracted from audio traces using state-of-the-art texture descriptors greatly improves the accuracy of acoustic and visual feature-based systems.
When deep learning became popular and Graphic Processing Units (GPUs) became more powerful at accessible costs, traditional pattern recognition changed, and attention focused even audio datasets) and cannot generalize to new classes without retraining the network. The objective of this work is to solve these issues by proposing an approach based on Dissimilarity Spaces. Recently, Agrawal [37] proposed an approach that learns a distance model by training a Siamese neural network directly on dissimilarity values for brain image classification, and in [38] an approach is proposed for online signature verification using a Siamese neural network and a contrastive loss function. In the latter work, the authors claim that the main advantage a Siamese network offers over a canonical CNN is the ability to generalize: the Siamese network approach they developed was shown to verify the authenticity of the signature of a new user without being trained on any examples from this user.
In this work, the dissimilarity space is created using a Siamese Neural Network (SNN) trained on the entire training set to define a distance function among the samples. The training phase for SNN is aimed at maximizing the distance between patterns of different classes; the testing phase of the SNN is used to compare two spectrograms to obtain a measure of their dissimilarity. In theory, all the training samples can be selected as centroids of the dissimilarity space. Dimensionality reduction is obtained by selecting a smaller number (k) of prototypes via a clustering approach. The dissimilarity space is the space where each spectogram is represented by a its distance to each centroid/prototype: in this space, the SNN is used to compare the spectrogram to every centroid, obtaining the spectrogram's dissimilarity vector, which is the final descriptor. The classification task is performed by a support vector machine (SVM) trained using the dissimilarity descriptors generated from the training samples. The proposed system is evaluated on two different datasets for animal audio classification: domestic cat sounds [27] and bird sounds [23]. Results for the different clustering methods and different values of the hyperparameter (k) are reported.
In addition, an ensemble of SVMs trained on different dissimilarity spaces (by changing the value of k) are combined by sum rule, and its performance is compared with (i) some canonical CNN approaches and (ii) the fusion of the SVMs and the CNNs. Experiments demonstrate for the first time that the use of dissimilarity spaces based on SNN is a feasible representation for image data and can, when combined with a general purpose classifier, achieve high classification performance. Because the descriptors obtained in the dissimilarity space show high diversity with respect to the representations based on CNNs, their fusion can be exploited in an ensemble, as proven by the high classification accuracy obtained by the fusion of CNNs with our approach. The MATLAB code used in this study is freely available at https://github.com/LorisNanni.

Proposed Approach
The proposed method for spectrogram classification using dissimilarity space is based on several steps which are schematized in Figure 1. This figure is followed by the pseudo-code for each step (Algorithms 1 and 2). In order to define a similarity space, it is necessary to select a distance measure and a set of prototypes in the training phase. The distance measure d(x, y) is learned by means of a SNN trained to maximize the similarity between couples of spectrograms in the same class, while minimizing the similarity for couples in different classes. The set of prototypes P = p 1 , ...p k are obtained as the k centroids of the clusters generated by a supervised clustering procedure. The final step represents each training sample x in the dissimilarity space by a feature vector f ∈ k , where each component f i is the distance between x and the prototype p i : f i = d(x, p i ). These feature vectors are used to train a SVM for the final classification task. In the testing phase, each unlabeled spectrogram is first represented in the dissimilarity space by calculating its distance to all the prototypes, then the resulting feature vector is classified by SVM.  Each of the main functions used in the pseudo-code are described below.

Siamese Neural Network Training
The SNN, described in more detail in Section 3, is trained to compare a pair of spectrograms by returning a measure of their similarity. Algorithm 3 presents the pseudocode for this phase and corresponds with step 1 of Algorithm 1. The SNN architecture is defined in steps 2 and 3 of algorithm Algorithm 3. Steps 5-8 are repeated for each training iteration.
Step 5 extracts randomly batchSize spectrograms pairs from the training set using the function GETSIAMESEBATCH.
Step 6 feeds the pairs to the network and computes loss and gradients for gradient descent. Steps 7 and 8 use the gradients and loss to update the weights of the fully connected layer and the twin subnetworks.

Prototype Selection
In this phase, k prototypes are extracted from the training set. In theory, every spectrogram in the training set could be selected as a prototype, but this would be too resource expensive and the dimensionality of the generated dissimilarity vectors would be too high. A better alternative is to employ clustering techniques to compute k centroids for each class. Clustering would significantly reduce the dimension of the resulting dissimilarity space and thus make the process more viable. Algorithm 4 presents the pseudo code for prototype selection, which provides a selection from among four clustering procedures, which are used separately to cluster the training samples belonging to each class.

Projection in the Dissimilarity Space
Existent classification methods learn to classify patterns using their feature space. In this work, patterns are represented in a dissimilarity space in which every pattern x is represented by its similarity to a selected set of prototypes P = p 1 , ...p k by a dissimilarity vector: where the similarity among pattern d(x, y) is obtained using a trained SNN. In order to project each image in the Dissimilarity space k , Algorithm 5 compares each input image (stored in X in step 3) with the k centroids (stored in P) using the trained SNN tSNN with the PREDICTSIAMESE function (step 4). The resulting feature space F includes the projected features of all the input images.

Support Vector Machine Training and Prediction
A Support Vector Machine (SVM) is a supervised learning model witch can be used to perform classification or regression. An SVM model represents each training example as a data point in space and is trained to construct one or more hyperplanes that divide the space in two, separating data points belonging to different classes (function TRAINSVM). The model will predict (function PREDICTSVM) the class of a new pattern mapped in the space according to the side of the hyperplane the data point falls into. The hyperplane found by an SVM is defined as follows: where D(x) is the hyperplane, x is the data point vector, w is the hyperplane's normal vector, and the b ||w|| ratio is the hyperplane's distance from the origin. The optimal hyperplane is the one that maximizes the distance to the nearest data point of any class, defined as 2 ||w|| , which is also called the margin. The i-th point x i will be assigned to the first class when D(x i ) ≥ +1 and to the second class when D(x i ) ≤ −1. The points that lie on the margin line, defined by the equation D(x i ) = ±1, completely describe the solution to the problem and are called support vectors. An example of an optimal hyperplane with highlighted support vectors is shown in Figure 2. Because SVMs use hyperplanes to discriminate data, they do not work well with data that is not linearly separable in its original space. This problem can be solved using kernel functions, which map data into a much higher dimensional space, presumably to make the separation easier in that space. To keep the computational complexity to an acceptable level, the kernel function of choice has to be computationally efficient.
Being binary classifiers, SVMs can only determine the separation surface between two classes of data; however, it is possible to apply SVMs to multi-label problems by training an ensemble of SVMs and combining them. In this work, the One-Against-All approach is used, where for each class an SVM is trained to discriminate between a given class and all the other classes put together. The pattern is then assigned to the class that gives the higher confidence score.

Siamese Neural Network
The Siamese Neural Network (SNN) is a class of neural network architectures that contains two or more twins, i.e., sub-networks with the same parameters and weights. SNNs are used in tasks involving similarity or in identifying correlations between different entities. SNN was first proposed by Bromley et al. [39] for performing signature verification. SNNs have since been used successfully in other application domains, such as face verification [40], image recognition [41], human fall detection [42], content-based audio representation [34], and sound search by vocal imitation [43]. The SNN architecture used in this work is similar to the one used in [43] and is represented in (Figure 3). As shown in Figure 3, the SNN used in this work is composed of five blocks: • Two identical twin subnetworks The twin subnetworks in our SNN are two Convolutional Neural Networks composed of 13 layers, as listed in Table 1. These subnetworks learn the features best representing the spectrograms in the input (X1 and X2), returning a 4096-dimensional feature vector for each (F1 and F2). The subnetworks share parameters and weights which are mirrored during the training. •

Subtract block
The output vectors of the subnetworks are subtracted, resulting in a feature vector Y representing the features in which the images differ: •

Fully Connected Layer
As in [37], the Fully Connected Layer (FCL) learns the distance model to calculate the dissimilarity. The output vector of the subtract block is fed to the FCL which returns a dissimilarity value for the pair of spectrograms in the input.

• Sigmoid
The sigmoid function is a class of mathematical real functions having a characteristic S-shaped curve. We apply the sigmoid to the dissimilarity value returned by the FCL to convert it to a probability value in the range [0, 1], using the standard logistic function: •

Binary Cross Entropy
The Binary Cross Entropy (BCE) is a popular loss function, which, given the prediction of the model and the correct observation label (in our case, 1 if the two spectrograms belong to the same class, 0 otherwise) returns a measure of the performance of the model. Loss functions are used by learning algorithms to train the network by adjusting the weights. BCE is applied to the probability obtained from the sigmoid and computes the gradients of the loss function with respect to the weights of the network in order to adjust them. In a two-class problem, BCE can be calculated as: where y is the binary value that indicates whether the class label c is correct for the observation o, p is the predicted probability that observation o is of class c, and log is the natural logarithm.

Clustering
Clustering is the task of organizing data in groups ( Figure 4) so that patterns in the same cluster are more similar to each other than they are to patterns belonging to other clusters. Clustering is often used to find natural clusters in unlabeled data. Some clustering techniques calculate centroids during the process. A centroid is the mean vector of all the patterns in a cluster. Because it is a mean vector, it contains the most characterizing features of a cluster's patterns. Centroids are computed to reduce the dissimilarity space size without losing too much information. The greater the number of centroids used for each class, the more information that is retained. In this work, samples are divided into classes before clustering, and the clustering procedure is applied to each class separately. The remainder of this section describes the four clustering techniques used in this study.

K-Means
K-means is a popular clustering algorithm that partitions a set of patterns into k clusters by assigning each observation to the cluster with the nearest centroid, or mean vector. There are several versions of this algorithm. In this study, the default implementation (with the Euclidean distance metric) in the MATLAB Statistics and Machine Learning Toolbox was applied. The standard k-means algorithm cycles through the following steps: 1. Choose k initial cluster centers (centroids) according to the k-means++ variation detailed below. 2. Compute point-to-cluster-centroid distances of all observations to each centroid. 3. Assign each observation to the cluster with the closest centroid. 4. Compute the average of the observations in each cluster to obtain k new centroids. 5. Repeat steps 2 through 4 until cluster assignments no longer change (i.e., until the algorithm converges) or until the maximum number of iterations is reached.
The k-means++ variation [44] employs a heuristic to find the initial centroids: 1. Choose one center uniformly at random from among the data points. 2. For each data point x, compute d(x), the distance between x and the nearest center that has already been chosen. 3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to d(x) 2 . 4. Repeat Steps 2 and 3 until k centers have been chosen.

K-Medoids
K-medoids is a clustering technique very similar to k-means. It partitions a set of observations into k clusters by minimizing the sum of distances between a pattern and the center of that pattern's cluster. The main difference between k-means and k-medoids is that, in the first case, the center of a cluster is its centroid, or mean, whereas, in the latter case, the center is a member, or medoid, of the cluster. A medoid is an observation in a cluster whose sum of distances from the other observations within the cluster is minimal. The basic algorithm for K-medoids loops through the following three steps: 1. Build-step: each k cluster is associated with a potential medoid. The first assignment can be performed in various ways; the standard MATLAB's implementations uses the k-means++ heuristic. 2. Swap-step: within each cluster, each point is tested as a potential medoid by checking whether the sum of the within-cluster distances gets smaller using that particular point as the medoid. If so, the point is defined as a new medoid. Every point is then assigned to the cluster with the closest medoid. 3. Repeat steps 1-4 until medoids no longer swap (i.e., until the algorithm converges) or until the maximum number of iterations is reached.

Hierarchical
Hierarchical clustering is a clustering technique that groups data by building a hierarchy of clusters. The hierarchy tree that is obtained is divided into n levels chosen for the application at hand. There are two main categories of hierarchical clustering: • Agglomerative: each pattern starts in its own cluster; then, moving up the hierarchy, each cluster in one level is obtained by merging two clusters in the previous level. • Divisive: all patterns start in one cluster; then, by moving down the hierarchy, each pair of clusters is obtained by splitting a single cluster in the previous level.
In this work, the default MATLAB implementation of hierarchical clustering is used, which is the agglomerative type. The MATLAB algorithm loops through the following three steps: 1. Find the similarity or dissimilarity between every pair of objects in the dataset using a distance metric. 2. Group the objects into a binary hierarchical cluster tree by linking objects in pairs based on their distance. As objects are paired into binary clusters, the newly formed clusters are grouped into larger clusters until a hierarchical tree is formed. 3. Determine where to cut the hierarchical tree into clusters. Here, MATLAB's cluster function is used to prune branches off the bottom of the hierarchical tree and to assign all the objects below each cut to a single cluster. In this way, k clusters are obtained.
After applying this algorithm, centroids, as the mean vectors of each cluster, are computed.

Spectral
The spectral clustering technique splits data into groups using the data's undirected similarity graph represented by a similarity matrix (also called an adjacency matrix). In the similarity graph, every·node is an observation, and two nodes are connected by an edge if their similarity is larger then a certain threshold, which is often 0. The algorithm uses four mathematical expressions: • Similarity Matrix: a square symmetrical matrix that represents the similarity graph. Letting M be the similarity matrix, each cell value m ij is the similarity value of two connected nodes in the graph, which, in turn, represent the spectrogram pairs (s i , s j ). • Degree Matrix: a diagonal matrix obtained by summing the similarity matrix rows. The degree matrix is defined by the equation where D g is the degree matrix, and m ij is a value of the similarity matrix. • Laplacian Matrix: another way of representing the similarity graph that is defined as Here are the steps required by the spectral algorithm: • For each spectrogram in the dataset, define a local neighborhood. There are different ways such a neighborhood can be defined. The MATLAB implementation defaults to the nearest-neighbor method. Once the neighborhood is defined, compute the pairwise similarities of each spectrogram in the neighborhood using some distance metric. • Calculate the Laplacian matrix L.
• Create a matrix V containing columns v 1 , ..., v k , where the columns are the k eigenvectors that correspond to the k smallest eigenvalues of the Laplacian matrix. The eigenvalues of the matrix are also called spectrum, hence the algorithm's name. • Treating each row of V as a pattern, perform k-means clustering or k-medoids clustering.
• Assign the original spectrograms in the dataset to the same clusters as their corresponding rows in V.

Experimental Results
The approach proposed in this paper is tested, along with some comparison canonical approaches, using a stratified ten-fold cross validation protocol and the classification accuracy as the performance indicator. Tests were performed on two datasets: • BIRDZ, which was also used as a control and as a real-world audio dataset in [23]. The real-world tracks were obtained from the Xeno-canto Archive (http://www.xeno-canto.org/) and cover 11 widespread North American bird species. Thus, the dataset contains 11 classes: (1) Blue Jay, BIRDZ is composed of five types of spectrograms: constant frequency, frequency modulated whistles, broadband pulses, broadband with varying frequency components, and strong harmonics, for a total of 2762 bird acoustic events with 339 detected "unknown" events corresponding to noise and other unknown species vocalizations. Including the "unknown class", BIRDZ has 3101 samples for 12 classes. • CAT, which was first presented in [27,45]. This dataset is composed of 10  In Tables 2 and 3, the performance of the four tested clustering algorithms is reported using different values of kc (i.e., the number of clusters per class). As a baseline for comparison, the classification accuracy is also reported for the following well-known CNN models, each fine-tuned on the problem (for 30 epochs, using a batch size of 30, and a learning rate of 0.0001, no freezing): • Googlenet [46], VGG16 and VGG19 [47], all pretrained on ImageNet [48]; • GoogleNetP365, a GoogleNet model pretrained on Places365 [49]. Tables 2 and 3, the accuracy obtained by the following fusion approaches are reported:  From the results reported in Tables 2 and 3, the following conclusions can be drawn:

Moreover, in
1. KAll outperforms each stand alone method based on a single value of kc; 2. ALL outperforms each KAll in both datasets; 3. Performance of ALL is similar to that obtained by GoogleNet; 4. The ensemble ALL based on our dissimilarity space is a feasible representation for spectograms and achieves a performance that is comparable to the CNNs. 5. In both datasets, the best performance is obtained by ALL+eCNN, (even though the improvement with respect to eCNN is negligible). 6. ALL+GoogleNet strongly outperforms ALL and Googlenet; this light ensemble, which uses only one CNN, is our recommended method.
The proposed approach based on the representation of animal sound in a dissimilarity space has two main advantages: (1) it produces a compact representation on the signal (ranging from 15 to 60, depending on the number of clusters for the single space, to 150 for the KAll ensemble); (2) it generates a high diversity of classification results with respect to the baseline CNNs, which can be exploited to improve the performance in an ensemble method (i.e., ALL+GoogleNet).
In Table 4, the ensembles proposed in this work are shown to achieve a performance on the two datasets that is similar to some of the state-of-the-art approaches reported in the literature. Two results are taken from [27], and are labeled [27] and [27]-CNN.
Unfortunately, most published papers in the field of acoustic animal classification focus only on a single dataset. The authors of this paper are aware that evaluating the proposed approach on two different datasets instead of focusing on just one limits the strength of the conclusions drawn. Be that as it may, the experiments reported here prove the robustness of the proposed approach, which obtains good classification accuracy on two different problems without any ad-hoc parameter optimization and according to a clear and unambiguous testing protocol. As a result, the performances reported in this paper can be used for baseline comparisons with other audio classification methods developed in the future.  [51] are based on a feature selection approach where the number of selected features is the hyperparameters selected on that dataset; the approach presented here has no hyperparameters selected on a given dataset.

Conclusions
In this work, a method using dissimilarity space is presented that achieves competitive results in automated audio classification of animal sounds (bird and cat sounds). Different types of clustering techniques to obtain centroids for dissimilarity space generation were tested and compared. A set of SVMs was trained on the dissimilarity spaces generated using four clustering techniques and different numbers of centroids. These SVMs were then combined by sum rule to obtain a high performing ensemble.
Moreover, it is shown that the method presented here can be fused with other state-of-the-art approaches to improve classification accuracy. The proposed ensemble of SVMs was fused with other state-of-the-art approaches. The fusions improved performance on the two audio classification problems and were shown to outperform the standalone approaches.
In the future, this study will be further developed by including other sound classification problems, e.g., those cited in [26,37], in order to obtain a more comprehensive validation of the proposed approach. The plan is also to test the proposed method on some image classification problems using additional supervised and unsupervised clustering techniques.