# Animal Sound Classification Using Dissimilarity Spaces

^{1}

^{2}

^{3}

^{*}

*Applied Sciences*: Invited Papers in Computing and Artificial Intelligence Section)

## Abstract

**:**

## 1. Introduction

## 2. Proposed System

Algorithm 1. Training phase |

Input: Training images (imgsTrain), training labels (labelTrain), number of training iterations(trainIterations), batch size (trainBatchSize), number of centroids (k), and clustering technique (type). |

Output: Trained SNN (tSNN), set of centroids (C), and trained SVM (svm). |

1: tSNN ← trainSiamese(imgsTrain, labelTrain, trainIterations, trainBatchSize) |

2: p ← Clustering(imgsTrain, labelTrain, k, type) |

3: F ← getDissSpaceProjection(imgsTrain, P, tSNN) |

4: tSVM ← trainSvm(labelTrain, F) |

Algorithm 2. Testing phase |

Input: Test images (imgsTest), trained SNN (tSNN), Set of centroids (C), Trained SVM (tSVM). |

Output: Actual test labels (labelTest). |

1: F ← getDissSpaceProjection (imgsTest, P, tSNN) |

2: labelTest ← predictSvm (F, tSVM) |

#### 2.1. Siamese Neural Network Training

Algorithm 3. Siamese training pseudocode |

Input: Training image (trainImgs), training labels (trainLabels), batch size (batchSize), and iterations (numberO f Iterations). |

Output: Trained SNN (tSNN). |

1: function TRAINSVM |

2: subnet←NETWORK([inputLayer,..., FullyConnectedLayer]) |

3: f cWeights←randomWeights |

4: for iteration from 1 to numberO f Iterations do |

5: X1, X2, pairLabels← getBatch (trainImgs, trainLabels, batchSize) |

6: gradients, loss← Evaluate(subnet, X1, X2, pairLabels) |

7: Update(subnet, gradients) |

8: Update(f cWeights, gradients) |

9: end for |

10: return tSNN←subnet, f cWeights |

11: end function |

Note: if SNN fails to converge on the training set, the training phase is repeated. |

#### 2.2. Prototype Selection

Algorithm 4. Clustering pseudocode |

Input: Training images (imgsTrain), training labels (labelTrain), number of clusters (k), and clustering technique (type). |

Output: Centroids P. |

1: function Clustering |

2: numClasses←number of classes from labelTrain |

3: kc←k/numClasses |

4: for i from 1 to numClasses do |

5: images←images of the class i from imgsTrain |

6: switch type do |

7: case “k-means” P_{i} ← KMeans(imgs,kc) |

8: case “k-medoids” P_{i} ← KMedoids (imgs,kc) |

9: case “hierarchical” P_{i} ← Hierarchical (imgs,kc) |

10: case “spectral” P_{i} ← Spectral (imgs,kc) |

11: P←P ∪P_{i} |

12: end for |

13: return P |

14: end function |

#### 2.3. Projection in the Dissimilarity Space

Algorithm 5. Projection in the Dissimilarity space pseudocode |

Input: Images (imgs), Centroids (P), number of centroids (k), and trained SNN (tSNN). |

Output: Feature vectors (F). |

1: function getDissSpaceProjection |

2: for j from 1 to SIZE(imgs) do |

3: X←imgs[j] |

4: F[j]← predictSiamese (tSNN, X, P) |

5: end for |

6: return F |

7: end function |

#### 2.4. Classification by SVM

#### 2.5. Heterogeneous Auto-Similarities of Characteristics (HASC)

## 3. Siamese Neural Network (SNN)

#### 3.1. The Two Identical Twin Subnetworks

#### 3.2. Subtract Block, FC Layer, and Sigmoid Function

## 4. Clustering

#### 4.1. K-Means

- Randomly select a set of centroids from among the data points.
- For each data point x remaining in the training set, compute the distance d(x) between it and the nearest centroid.
- Recalculate new centroids via a weighted probability distribution.
- Repeat Steps 2 and 3 until convergence.

#### 4.2. K-Medoids

- Step one is a build-step where each k cluster is associated with a potential medoid. There are many ways to select the first medoid; the standard MATLAB’s implementation does this employing the k-means++ heuristic.
- Step two is a swap-step where each point in a cluster is tested as a potential medoid by checking whether the sum of the within-cluster distances is smaller when using that point as the medoid. Every point is then assigned to the cluster with the closest medoid.
- The last step repeats previous steps until convergence.

#### 4.3. Spectral

- The similarity matrix M, whose cell ${m}_{ij}$ is the similarity value of two patterns (i.e., two spectrograms ${s}_{i}$, ${s}_{j}$);
- The degree matrix D, which is a diagonal matrix that is obtained by summing the rows of M:$${D}_{g}\left(i,i\right)={\sum}_{j}{m}_{i,j};$$
- The Laplacian matrix L, which is defined as$$L={D}_{g}-M.$$

- Define a local neighborhood for each data point in the dataset (there are many ways to define a neighborhood; the nearest-neighbor method is the default setting in the MATLAB implementation of spectral clustering). Then compute the local similarity matrix of each pattern in the neighborhood.
- Calculate the Laplacian matrix $L$.
- Create a matrix $V$ containing columns ${v}_{1}$, …, ${v}_{k}$, where the columns are the $k$ eigenvectors, i.e., the spectrums (hence the name), corresponding to the $k$ smallest eigenvalues of L.
- Perform k-means or k-medoids clustering by treating each row of V as a datapoint.
- Cluster the original pattern according to the assignments of their corresponding rows.

#### 4.4. Hierarchical Clustering

- Agglomerative, where each pattern corresponds to a cluster. A strategy to merge couples of clusters is defined as moving up the hierarchy: each cluster in the next level is the fusion of two clusters from the previous level.
- Divisive, where a single cluster contains all patterns in the first level, then a splitting strategy is defined to halve clusters by moving down the hierarchy.

- Using a distance metric, find the similarity or dissimilarity between every pair of data points in the dataset;
- Aggregate data points into a binary hierarchical cluster tree by fusing pairs of clusters according to their distance;
- Establish the level of the tree where it is cut into k clusters.

## 5. Experimental Results

- BIRDz, which functioned as a control and a real-world audio dataset in [46], a ten-run testing protocol is used; we have used the same split used by the authors of the dataset. The real-world tracks were collected from the Xeno-canto Archive (http://www.xeno-canto.org/). BIRDz includes a total of 2762 bird acoustic samples from 11 North American bird species plus 339 “unknown” samples that include noise and unknown species’ vocalizations. The observations are composed of five different spectrograms: 1) constant frequency, 2) frequency modulated whistles, 3) broadband pulses, (4) broadband with varying frequency components, and 5) strong harmonics. The dataset is balanced: the size of all the “bird” classes varies between 246 and 259; only the class “other” is a little larger.
- CAT, [37,47] is a dataset that contains ten balanced classes of approximately 300 samples per class for a total of 2962 samples. The testing protocol is a 10-fold cross-validation. The ten classes represent the following cat vocalizations: (1) Resting, (2) Warning, (3) Angry, (4) Defense, (5) Fighting, (6)·Happy, (7) Hunting mind, (8) Mating, (9) Mother call, and (10) Paining. The average duration of each sample is approximately 4 s. Samples were garnered from such online resources as Kaggle, Youtube, and Flickr.

- The best way for building an ensemble of Siamese networks is to combine different network topologies;
- The proposed F_NN ensemble improves previous methods based on Siamese networks (cf. OLD in Table 6);
- F_NN obtains a performance that is similar to eCNN on BIRD but lower than eCNN on CAT;
- The best performance in both datasets is gained by sum rule between eCNN and F_NN (i.e., the fusion among CNNs and the Siamese networks).

## 6. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Padmanabhan, J.; Premkumar, M.J.J. Machine learning in automatic speech recognition: A survey. IETE Tech. Rev.
**2015**, 32, 240–251. [Google Scholar] [CrossRef] - Nanni, L.; Costa, Y.M.; Lucio, D.R.; Silla, C.N., Jr.; Brahnam, S. Combining visual and acoustic features for audio classification tasks. Pattern Recognit. Lett.
**2017**, 88, 49–56. [Google Scholar] [CrossRef] - Sahoo, S.; Choubisa, T.; Prasanna, S.M. Multimodal Biometric Person Authentication: A Review. IETE Tech. Rev.
**2012**, 29, 54–75. [Google Scholar] [CrossRef] - Li, S.; Li, F.; Tang, S.; Xiong, W. A Review of Computer-Aided Heart Sound Detection Techniques. BioMed Res. Int.
**2020**, 2020, 5846191. [Google Scholar] [CrossRef] - Chandrakala, S.; Jayalakshmi, S.L. Generative Model Driven Representation Learning in a Hybrid Framework for Environmental Audio Scene and Sound Event Recognition. IEEE Trans. Multimed.
**2019**, 22, 3–14. [Google Scholar] [CrossRef] - Chachada, S.; Kuo, C.-C.J. Environmental sound recognition: A survey. In Proceedings of the 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Kaohsiung, Taiwan, 29 October–1 November 2013; pp. 1–9. [Google Scholar] [CrossRef]
- Zhao, Z.; Zhang, S.H.; Xu, Z.Y.; Bellisario, K.; Dai, N.H.; Omrani, H.; Pijanowski, B.C. Automated bird acoustic event detection and robust species classification. Ecol. Inform.
**2017**, 39, 99–108. [Google Scholar] [CrossRef] - Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech emotion recognition from spectrograms with deep convolutional neural network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Korea, 13–15 February 2017; pp. 1–5. [Google Scholar]
- Zeng, Y.; Mao, H.; Peng, D.; Yi, Z. Spectrogram based multi-task audio classification. Multimed. Tools Appl.
**2019**, 78, 3705–3722. [Google Scholar] [CrossRef] - Lidy, T.; Rauber, A. Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In Proceedings of the 6th International Conference on Music Information Retrieval, London, UK, 11–15 September 2005; pp. 34–41. [Google Scholar]
- Wyse, L. Audio spectrogram representations for processing with convolutional neural networks. arXiv
**2017**, arXiv:1706.09559. [Google Scholar] - Rubin, J.; Abreu, R.; Ganguli, A.; Nelaturi, S.; Matei, I.; Sricharan, K. Classifying heart sound recordings using deep convolutional neural networks and mel-frequency cepstral coefficient. In Proceedings of the Computing in Cardiology (CinC), Vancouver, BC, Canada, 11–14 September 2016. [Google Scholar]
- Nanni, L.; Costa, Y.M.G.; Brahnam, S. Set of texture descriptors for music genre classification. In Proceedings of the 22nd WSCG International Conference on Computer Graphics, Visualization and Computer Vision, Plzen, Czech Republic, 2–5 June 2014. [Google Scholar]
- Haralick, R.M. Statistical and structural approaches to texture. Proc. IEEE
**1979**, 67, 786–804. [Google Scholar] [CrossRef] - Ojansivu, V.; Heikkila, J. Blur insensitive texture classification using local phase quantization. In Proceedings of the ICISP, Cherbourg-Octeville, France, 1–3 July 2008. [Google Scholar]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 971–987. [Google Scholar] [CrossRef] - Brahnam, S.; Jain, L.C.; Lumini, A.; Nanni, L. (Eds.) Local Binary Patterns: New Variants and Applications; Springer: Berlin, Germany, 2014. [Google Scholar]
- Costa, Y.M.G.; Oliveira, L.S.; Koerich, A.L.; Gouyon, F.; Martins, J.G. Music genre classification using LBP textural features. Signal Process.
**2012**, 92, 2723–2737. [Google Scholar] [CrossRef][Green Version] - Costa, Y.M.G.; Oliveira, L.S.; Koerich, A.L.; Gouyon, F. Music genre recognition using spectrograms. In Proceedings of the 18th International Conference on Systems, Signals and Image Processing, Sarajevo, Bosnia and Herzegovina, 16–18 June 2011. [Google Scholar]
- Costa, Y.M.G.; Oliveira, L.S.; Koerich, A.L.; Gouyon, F. Music genre recognition using gabor filters and LPQ texture descriptors. In Proceedings of the 18th Iberoamerican Congress on Pattern Recognition, Havana, Cuba, 20–23 November 2013. [Google Scholar]
- Ren, Y.; Cheng, X. Review of convolutional neural network optimization and training in image processing. In Proceedings of the 10th International Symposium on Precision Engineering Measurements and Instrumentation (ISPEMI 2018), Kunming, China, 8–10 August 2018; SPIE: Bellingham, WA, USA, 2019. [Google Scholar]
- Wang, X.; Zhao, Y.; Pourpanah, F. Recent advances in deep learning. Int. J. Mach. Learn. Cybern.
**2020**, 11, 747–750. [Google Scholar] [CrossRef][Green Version] - Humphrey, E.; Bello, J.P. Rethinking automatic chord recognition with convolutional neural networks. In Proceedings of the International Conference on Machine Learning and Applications, Boca Raton, FL, USA, 12–15 December 2012. [Google Scholar]
- Humphrey, E.; Bello, J.P.; LeCun, Y. Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proceedings of the International Conference on Music Information Retrieval, Porto, Portugal, 8–12 October 2012; pp. 403–408. [Google Scholar]
- Nakashika, T.; Garcia, C.; Takiguchi, T. Local-feature-map integration using convolutional neural networks for music genre classification. In Proceedings of the Interspeech 2012 13th Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012; pp. 1752–1755. [Google Scholar]
- Costa, Y.M.; Oliveira, L.S.; Silla, C.N., Jr. An evaluation of Convolutional Neural Networks for music classification using spectrograms. Appl. Soft Comput.
**2017**, 52, 28–38. [Google Scholar] [CrossRef] - Sigtia, S.; Dixon, S. Improved music feature learning with deep neural networks. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, Florence, Italy, 4–9 May 2014. [Google Scholar]
- Wang, C.Y.; Santoso, A.; Mathulaprangsan, S.; Chiang, C.C.; Wu, C.H.; Wang, J.C. Recognition and retrieval of sound events using sparse coding convolutional neural network. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017. [Google Scholar]
- Oramas, S.; Nieto, O.; Barbieri, F.; Serra, X. Multilabel music genre classification from audio, text and images using deep features. In Proceedings of the International Society for Music Information Retrieval (ISMR) Conference, Suzhou, China, 23–27 October 2017. [Google Scholar]
- Kong, Q.; Xu, Y.; Sobieraj, I.; Wang, W.; Plumbley, M.D. Sound Event Detection and Tim Frequency Segmentation from Weakly Labelled Data. IEEE ACM Trans. Audio Speech Lang. Process.
**2019**, 27, 777–787. [Google Scholar] [CrossRef] - Nanni, L.; Brahnam, S.; Lumini, A.; Barrier, T. Ensemble of local phase quantization variants with ternary encoding. In Local Binary Patterns: New Variants and Applications; Brahnam, S., Jain, L.C., Lumini, A., Nanni, L., Eds.; Springer: Berlin, Germany, 2014; pp. 177–188. [Google Scholar]
- Cao, Z.; Principe, J.C.; Ouyang, B.; Dalgleish, F.; Vuorenkoski, A. Marine animal classification using combined CNN and hand-designed image features. In Proceedings of the MTS/IEEE Oceans, Washington, DC, USA, 19–22 October 2015. [Google Scholar]
- Salamon, J.; Bello, J.P.; Farnsworth, A.; Kelling, S. Fusing sallow and deep learning for bioacoustic bird species. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
- Cullinan, V.I.; Matzner, S.; Duberstein, C.A. Classification of birds and bats using flight tracks. Ecol. Inform.
**2015**, 27, 55–63. [Google Scholar] [CrossRef][Green Version] - Acevedo, M.A.; Corrada-Bravo, C.J.; Corrada-Bravo, H.; Villanueva-Rivera, L.J.; Aide, T.M. Automated classification of bird and amphibian calls using machine learning: A comparison of methods. Ecol. Inform.
**2009**, 4, 206–214. [Google Scholar] [CrossRef] - Fristrup, K.M.; Watkins, W.A. Marine Animal Sound Classification; WHOI Technical Reports; Woods Hole Oceanographic Institution: Woods Hole, MA, USA, 1993; Available online: https://hdl.handle.net/1912/546 (accessed on 30 October 2020).
- Pandeya, Y.R.; Kim, D.; Lee, J. Domestic cat sound classification using learned features from deep neural nets. Appl. Sci.
**2018**, 8, 1949. [Google Scholar] [CrossRef][Green Version] - Wang, A. An industrial strength audio search algorithm. In Proceedings of the ISMIR Proceedings, Baltimore, MD, USA, 26–30 October 2003. [Google Scholar]
- Haitsma, J.; Kalker, T. A Highly Robust Audio Fingerprinting System. In Proceedings of the ISMIR, Paris, France, 13–17 October 2002. [Google Scholar]
- Manocha, P.; Badlani, R.; Kumar, A.; Shah, A.; Elizalde, B.; Raj, B. Content-based representations of audio using siamese neural networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal. Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 3136–3140. [Google Scholar]
- Droghini, D.; Vesperini, F.; Principi, E.; Squartini, S.; Piazza, F. Few-shot siamese neural networks employing audio features for human-fall detection. In Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence, Union, NJ, USA, 15–17 August 2018. [Google Scholar] [CrossRef]
- Zhang, Y.; Pardo, B.; Duan, Z. Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation. IEEE/ACM Trans. Audio, Speech, Lang. Process.
**2018**, 27, 429–441. [Google Scholar] [CrossRef] - Nannia, L.; Rigo, A.; Lumini, A.; Brahnam, S. Spectrogram Classification Using Dissimilarity Space. Appl. Sci.
**2020**, 10, 4176. [Google Scholar] [CrossRef] - Agrawal, A. Dissimilarity learning via Siamese network predicts brain imaging data. arXiv
**2019**, arXiv:1907.02591. [Google Scholar] - Bromley, J.; Bentz, J.W.; Bottou, L.; Guyon, I.; LeCun, Y.; Moore, C.; Säckinger, E.; Shah, R. Signature verification using a Siamese time delay neural network. Int. J. Pattern Recognit. Artif. Intell.
**1993**, 7, 669–688. [Google Scholar] [CrossRef][Green Version] - Zhang, S.H.; Zhao, Z.; Xu, Z.Y.; Bellisario, K.; Pijanowski, B.C. Automatic bird vocalization identification based on fusion of spectral pattern and texture features. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal. Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 271–275. [Google Scholar]
- Pandeya, Y.R.; Lee, J. Domestic Cat Sound Classification Using Transfer Learning. Int. J. Fuzzy Log. Intell. Syst.
**2018**, 18, 154–160. [Google Scholar] [CrossRef][Green Version] - Biagio, M.S.; Crocco, M.; Cristani, M.; Martelli, S.; Murino, V. Heterogeneous auto-similarities of characteristics (hasc): Exploiting relational information for classification. In Proceedings of the IEEE Computer Vision (ICCV13), Sydney, Australia, 3–6 December 2013. [Google Scholar]
- Piczak, K.J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, 26–30 October 2015. [Google Scholar] [CrossRef]
- Vapnik, V. The support vector method. In Proceedings of the Artificial Neural Networks ICANN’97, Lausanne, Switzerland, 8–10 October 1997. [Google Scholar]
- Chicco, D. Siamese neural networks: An overview. In Artificial Neural Networks. Methods in Molecular Biology; Cartwright, H., Ed.; Springer Protocols; Humana: New York, NY, USA, 2020; Volume 2190, pp. 73–94. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the AISTATS, Ft. Lauderdale, FL, USA, 11–13 April 2011; Available online: https://pdfs.semanticscholar.org/6710/7f78a84bdb2411053cb54e94fa226eea6d8e.pdf?_ga=2.211730323.729472771.1575613836-1202913834.1575613836 (accessed on 30 October 2020).
- Maas, A.L. Rectifier Nonlinearities Improve Neural Network Acoustic Models. 2013. Available online: https://pdfs.semanticscholar.org/367f/2c63a6f6a10b3b64b8729d601e69337ee3cc.pdf?_ga=2.208124820.729472771.1575613836-1202913834.1575613836 (accessed on 30 October 2020).
- Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.
**2006**, 7, 1–30. [Google Scholar] - Huzaifah, M. Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks. arXiv
**2017**, arXiv:1706.07156. [Google Scholar] - Nanni, L.; Costa, Y.; Lumini, A.; Kim, M.Y.; Baek, S.R. Combining visual and acoustic features for music genre classification. Expert Syst. Appl.
**2016**, 45, 108–117. [Google Scholar] [CrossRef]

Siamese Network 1 | ||||

Layers | Activations | Learnable | Filter Size | Num. of Filters |

Input Layer | 224 × 224 | |||

2D Convolution | 215 × 215 × 64 | 6464 | 10 × 10 | 64 |

ReLU | 215 × 215 × 64 | 0 | ||

Max Pooling | 107 × 107 × 64 | 0 | 2 × 2 | |

2D Convolution | 101 × 101 × 128 | 401,536 | 7 × 7 | 128 |

ReLU | 101 × 101 × 128 | 0 | ||

Max Pooling | 50 × 50 × 128 | 0 | 2 × 2 | |

2D Convolution | 47 × 47 × 128 | 262,272 | 4 × 4 | 128 |

ReLU | 47 × 47 × 128 | 0 | ||

Max Pooling | 23 × 23 × 128 | 0 | 2 × 2 | |

2D Convolution | 19 × 19 × 64 | 204,864 | 5 × 5 | 64 |

ReLU | 19 × 19 × 64 | 0 | ||

Fully Connected | 4096 | 94,638,080 | ||

Siamese Network 2 | ||||

Layers | Activations | Learnable | Filter Size | Num. of Filters |

Input Layer | 224 × 224 | 0 | ||

2D Convolution | 220 × 220 × 64 | 1664 | 5 × 5 | 64 |

LeakyReLU | 220 × 220 × 64 | 0 | ||

2D Convolution | 216 × 216 × 64 | 102,464 | 5 × 5 | 64 |

LeakyReLU | 216 × 216 × 64 | 0 | ||

Max Pooling | 108 × 108 × 64 | 0 | 2 × 2 | |

2D Convolution | 106 × 106 × 128 | 73,856 | 3 × 3 | 128 |

LeakyReLU | 106 × 106 × 128 | 0 | ||

2D Convolution | 104 × 104 × 128 | 147,584 | 3 × 3 | 128 |

LeakyReLU | 104 × 104 × 128 | 0 | ||

Max Pooling | 52 × 52 × 128 | 0 | 2 × 2 | |

2D Convolution | 49 × 49 × 128 | 262,272 | 4 × 4 | 128 |

LeakyReLU | 49 × 49 × 128 | 0 | ||

Max Pooling | 24 × 24 × 128 | 0 | 2 × 2 | |

2D Convolution | 20 × 20 × 64 | 204,864 | 5 × 5 | 64 |

LeakyReLU | 20 × 20 × 64 | 0 | 5 × 5 | |

Fully Connected | 2048 | 52,430,848 | ||

Siamese Network 3 | ||||

Layers | Activations | Learnable | Filter Size | Num. Filters |

Input Layer | 224 × 224 | |||

2D Convolution | 55 × 55 × 128 | 6400 | 7 × 7 | 128 |

Max Pooling | 27 × 27 × 128 | 0 | 2 × 2 | |

2D Convolution | 23 × 23 × 256 | 819,456 | 5 × 5 | 256 |

ReLU | 23 × 23 × 256 | 0 | ||

2D Convolution | 19 × 19 × 128 | 819,328 | 5 × 5 | 128 |

Max Pooling | 9 × 9 × 128 | 0 | 2 × 2 | |

2D Convolution | 7 × 7 × 64 | 73,792 | 3 × 3 | 64 |

ReLU | 7 × 7 × 64 | 0 | ||

Max Pooling | 3 × 3 × 64 | 0 | 2 × 2 | |

Fully Connected | 4096 | 2,363,392 | ||

Siamese Network 4 | ||||

Layers | Activations | Learnable | Filter Size | Num. of Filters |

Input Layer | 224×224 | |||

2D Convolution | 218 × 218 × 128 | 6400 | 7 × 7 | 128 |

Max Pooling | 54 × 54 × 128 | 0 | 4 × 4 | |

ReLU | 54 × 54 × 128 | 0 | ||

2D Convolution | 50 × 50 × 256 | 819,456 | 5 × 5 | 256 |

ReLU | 50 × 50 × 256 | 0 | ||

2D Convolution | 48 × 48 × 64 | 147,520 | 3 × 3 | 64 |

Max Pooling | 24 × 24 × 64 | 0 | 2 × 2 | |

2D Convolution | 22 × 22 × 128 | 73,856 | 3 × 3 | 128 |

ReLU | 22 × 22 × 128 | 0 | ||

2D Convolution | 18 × 18 × 64 | 204,864 | 5 × 5 | 64 |

Fully Connected | 4096 | 84,938,752 |

Name | Input Image | Network Topology | Clustering Method | Clustering Type | #Prototypes | #Classifiers | CAT | BIRD |
---|---|---|---|---|---|---|---|---|

Sup-1 | Sp | NN1 | K-means | S | 15, 30, 45, 60 | 4 | 78.64 ± 1.2 | 92.46 ± 0.71 |

Sup-2 | Sp | NN2 | K-means | S | 15, 30, 45, 60 | 4 | 76.95 ± 1.3 | 92.74 ± 0.82 |

UnS-1 | Sp | NN1 | K-means | U | 15, 30, 45, 60 | 4 | 81.69 ± 1.0 | 92.73 ± 0.95 |

UnS-2 | Sp | NN2 | K-means | U | 15, 30, 45, 60 | 4 | 75.25 ± 1.4 | 92.80 ± 0.78 |

HSup-1 | HASC | NN1 | K-means | S | 15, 30, 45, 60 | 4 | 78.64 ± 1.2 | 94.52 ± 0.65 |

HSup-2 | HASC | NN2 | K-means | S | 15, 30, 45, 60 | 4 | 81.69 ± 0.9 | 93.22 ± 0.82 |

HUnS-1 | HASC | NN1 | K-means | U | 15, 30, 45, 60 | 4 | 79.32 ± 1.1 | 94.53 ± 0.68 |

HUnS-2 | HASC | NN2 | K-means | U | 15, 30, 45, 60 | 4 | 81.36 ± 1.3 | 92.97 ± 0.72 |

FSp-1 | Sp | NN1 | K-means | S,U | 15, 30, 45, 60 | 8 | 81.02 ± 1.0 | 92.79 ± 0.85 |

FSp-2 | Sp | NN2 | K-means | S,U | 15, 30, 45, 60 | 8 | 76.95 ± 1.2 | 92.77 ± 0.76 |

FA-1 | Sp,HASC | NN1 | K-means | S,U | 15, 30, 45, 60 | 16 | 82.37 ± 0.9 | 94.50 ± 0.65 |

FA2 | Sp,HASC | NN2 | K-means | S,U | 15, 30, 45, 60 | 16 | 83.73 ± 0.9 | 94.11 ± 0.70 |

FA1_2 | Sp,HASC | NN1 + NN2 | K-means | S,U | 15, 30, 45, 60 | 32 | 84.41 ± 0.9 | 94.37 ± 0.62 |

**Table 3.**Performance obtained considering different clustering algorithms: accuracy ± standard deviation.

Name | Input Image | Network Topology | Clustering Method | Clustering Type | #Prototypes | #Classifiers | CAT | BIRD |
---|---|---|---|---|---|---|---|---|

HASC | NN2 | K-means | S | 15, 30, 45, 60 | 4 | 81.69 ± 0.9 | 93.22 ± 0.82 | |

HASC | NN2 | K-Med | S | 15, 30, 45, 60 | 4 | 81.02 ± 1.0 | 92.85 ± 0.85 | |

HASC | NN2 | Hier | S | 15, 30, 45, 60 | 4 | 81.69 ± 0.9 | 93.01 ± 0.87 | |

HASC | NN2 | Spect | S | 15, 30, 45, 60 | 4 | 80.00 ± 1.1 | 93.13 ± 0.79 | |

F_Clu | HASC | NN2 | All | S | 15, 30, 45, 60 | 16 | 82.03 ± 0.9 | 93.37 ± 0.75 |

**Table 4.**Performance obtained considering different network topologies: accuracy ± standard deviation.

Name | Input Image | Network Topology | Clustering Method | Clustering Type | #Prototypes | #Classifiers | CAT | BIRD |
---|---|---|---|---|---|---|---|---|

HASC | NN1 | K-means | S | 15, 30, 45, 60 | 4 | 78.64 ± 1.2 | 94.52 ± 0.65 | |

HASC | NN2 | K-means | S | 15, 30, 45, 60 | 4 | 81.69 ± 1.1 | 93.22 ± 0.72 | |

HASC | NN3 | K-means | S | 15, 30, 45, 60 | 4 | 78.64 ± 1.2 | 94.91 ± 0.64 | |

HASC | NN4 | K-means | S | 15, 30, 45, 60 | 4 | 82.37 ± 1.1 | 93.33 ± 0.68 | |

F_NN | HASC | All | K-means | S | 15, 30, 45, 60 | 16 | 84.07 ± 1.0 | 94.99 ± 0.64 |

**Table 5.**Comparison between ensembles of reiterated Siamese Networks with NN1 and ensembles obtained considering different network topologies: accuracy ± standard deviation.

Name | Input Image | Network Topology | Clustering Method | Clustering Type | #Prototypes | #Classifiers | CAT | BIRD |
---|---|---|---|---|---|---|---|---|

HSup-1(1) | HASC | NN1 | K-means | S | 15 | 1 | 75.93 ± 1.5 | 93.92 ± 0.85 |

HSup-1(4) | HASC | NN1 | K-means | S | 15 | 1×4 | 81.69 ± 1.3 | 94.50 ± 0.78 |

HSup-1 | HASC | NN1 | K-means | S | 15, 30, 45, 60 | 4×1 | 78.64 ± 1.2 | 94.52 ± 0.65 |

HSup-1(8) | HASC | NN1 | K-means | S | 15, 30, 45, 60 | 4×2 | 80.68 ± 1.1 | 94.56 ± 0.75 |

HSup-1(16) | HASC | NN1 | K-means | S | 15, 30, 45, 60 | 4×4 | 81.02 ± 1.0 | 94.63 ± 0.77 |

F_NN(4) | HASC | All | K-means | S | 15 | 4 | 83.39 ± 0.9 | 94.73 ± 0.62 |

F_NN(8) | HASC | All | K-means | S | 15, 30 | 8 | 84.07 ± 0.8 | 94.90 ± 0.60 |

F_NN | HASC | All | K-means | S | 15, 30, 45, 60 | 16 | 84.07 ± 0.8 | 94.99 ± 0.58 |

Method | CAT | BIRD |
---|---|---|

OLD [43] | 82.41 | 92.97 |

F_NN | 84.07 | 94.99 |

GoogleNet | 82.98 | 92.41 |

VGG16 | 84.07 | 95.30 |

VGG19 | 83.05 | 95.19 |

GoogleNetP365 | 85.15 | 92.94 |

eCNN | 87.36 | 95.81 |

OLD + eCNN | 87.76 | 95.95 |

F_NN + eCNN | 88.47 | 96.03 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Nanni, L.; Brahnam, S.; Lumini, A.; Maguolo, G. Animal Sound Classification Using Dissimilarity Spaces. *Appl. Sci.* **2020**, *10*, 8578.
https://doi.org/10.3390/app10238578

**AMA Style**

Nanni L, Brahnam S, Lumini A, Maguolo G. Animal Sound Classification Using Dissimilarity Spaces. *Applied Sciences*. 2020; 10(23):8578.
https://doi.org/10.3390/app10238578

**Chicago/Turabian Style**

Nanni, Loris, Sheryl Brahnam, Alessandra Lumini, and Gianluca Maguolo. 2020. "Animal Sound Classification Using Dissimilarity Spaces" *Applied Sciences* 10, no. 23: 8578.
https://doi.org/10.3390/app10238578