Finding Meanings in Low Dimensional Structures: Stochastic Neighbor Embedding Applied to the Analysis of Indri indri Vocal Repertoire

Simple Summary The description of the vocal repertoire represents a critical step before deepening other aspects of animal behaviour. Repertoires may contain both discrete vocalizations—acoustically distinct and distinguishable from each other—or graded ones, with a less rigid acoustic structure. The gradation level is one of the causes that make repertoires challenging to be objectively quantified. Indeed, the higher the level of gradation in a system, the higher the complexity in grouping its components. A large sample of Indri indri calls was divided into ten putative categories from the acoustic similarity among them. We extracted frequency and duration parameters and then performed two different analyses that were able to group the calls accordingly to the a priori categories, indicating the presence of ten robust vocal classes. The analyses also showed a neat grouping of discrete vocalizations and a weaker classification of graded ones. Abstract Although there is a growing number of researches focusing on acoustic communication, the lack of shared analytic approaches leads to inconsistency among studies. Here, we introduced a computational method used to examine 3360 calls recorded from wild indris (Indri indri) from 2005–2018. We split each sound into ten portions of equal length and, from each portion we extracted spectral coefficients, considering frequency values up to 15,000 Hz. We submitted the set of acoustic features first to a t-distributed stochastic neighbor embedding algorithm, then to a hard-clustering procedure using a k-means algorithm. The t-distributed stochastic neighbor embedding (t-SNE) mapping indicated the presence of eight different groups, consistent with the acoustic structure of the a priori identification of calls, while the cluster analysis revealed that an overlay between distinct call types might exist. Our results indicated that the t-distributed stochastic neighbor embedding (t-SNE), successfully been employed in several studies, showed a good performance also in the analysis of indris’ repertoire and may open new perspectives towards the achievement of shared methodical techniques for the comparison of animal vocal repertoires.

as either continuous or discontinuous, may constitute an oversimplification [27,29], as repertoires may show both graded and discrete features (e.g., Papio ursinus [29]; Cercopithecus neglectus, Cercopithecus campbelli, Cercocebus torquatus, [30]), and the differentiation within vocal types may occur to varying degrees [31,32]. Traditionally, a large number of studies relied on the comparison of sounds similarity using clustering methods [33] based on acoustic features extracted from spectrograms. Still, although these algorithms showed good results in the classification of sounds, they could fail to describe the graded transition of call types that may occur in vocal repertoires [29]. Moreover, the gradation level is precisely one of the main reasons for the lack of consistency in vocal repertoire sizes assessments. Indeed, the higher the level of gradation, the higher the potential for information diffusion but also the higher the complexity in grouping the components of a system [28]. We expected to find a repertoire containing both graded and conspicuous signals [29,30] and, according to the call social function hypothesis, an acoustic variation of calls associated with their function [27,28,30,34]. Calls related to social contexts show the highest variation level when associated with affiliative value, while the highest level of stereotypy is associated with agonistic contexts (Cercopithecus campbelli [35]); alarm calls show an intermediate gradation level. Hence, we expected to find great flexibility in those calls having an affiliative social function, a rigid structure of signals associated with negative contexts, and an intermediate variation in the alarm calls. Accordingly, in agreement with Peckre and colleagues [28], we expected to find a clearer clusterization of discrete calls and a weaker grouping accuracy of graded ones. Finally, in agreement with the "social complexity-vocal complexity hypothesis" [30] and the social complexity hypothesis for communicative complexity [28], we expected indris to possess a small repertoire size if compared to that of other lemurs [21] or other primates [36] living in larger social groups.

Data Collection
We recorded spontaneous vocalizations of 18 groups of indris at four different forest sites: Six groups (1R, 2R, 3R, 5R, 6R, and XR) were recorded in Analamazaotra Special Reserve ( , with a 16-bit amplitude resolution. Vocalizations were recorded at a distance from 2-10 m since all the study groups were habituated, and all efforts were made to ensure that the microphone was oriented toward the vocalizing animal. Focal animal sampling [37] and the presence of individual-specific natural marks, allowed the attribution of each vocalization to a signaler. Only spontaneous utterances were recorded, avoiding the use of playback stimuli.

Acoustical Analysis
We visually inspected all recordings using spectrograms (Praat 6.0.28) (Phonetic Sciences, University of Amsterdam, Amsterdam, The Netherlands) [38] and then cut high-quality vocal emissions, normalized, saved into single files (n = 3360), and assigned to nine putative categories on the basis of their acoustic and spectrographic evaluation, according to the vocal types identified in a previous study [39]: Clacsons (n = 622), grunts (n = 1145), hums (n = 418), kisses (n = 296), Animals 2019, 9, 243 4 of 13 long tonal calls (n = 31), roars (n = 62), short tonal calls (n = 44), wheezes (n = 150), and wheezing grunts (n = 297). Moreover, all indris within a familiar group participate in a chorusing song, mainly consisting of harmonic frequency modulated notes [40]. We also isolated units from the songs and grouped them in a tenth category (songbits, n = 295). Eight vocal types and 1275 vocalizations out of 3360 were included in a previous analysis [39]; wheezing grunts were previously identified [41] but not detected by Maretti and colleagues [39], and song units were not considered in that former repertoire description. For each call, we extracted spectral coefficients using a custom-made script in Praat [38]. The script first calculated the overall duration of a sound and then split it into ten portions of equal length. For each portion, the frequency range between 50 Hz and 15,000 Hz was divided into sets of frequencies called bins or bands (e.g., 50-500 Hz, 501-1000 Hz, 1001-1500 Hz, and 2001-2500 Hz). For each bin, we extracted the energy value using the function 'Get band energy' in Praat. The resulting dataset contained 3360 samples with 151 attributes for each; one hundred and fifty parameters were frequency parameters, the last was the duration of sounds.

Acoustic Embedding and Classification Procedure
We embedded the spectral features vectors into a bi-dimensional space using t-distributed stochastic neighbor embedding [10] with a Barnes-Hut implementation, using the Rtsne package [42] in R (R Core Team 2018; version 3.5.1, R Foundation for Statistical Computing, Vienna, Austria) [43]. We then used the t-SNE model (perplexity = 40, theta = 0.5, dims = 2) to group the cases, using k-means clustering [44]. t-SNE was also used for data visualization. We then used the WEKA 3.8 (Waikato Environment for Knowledge Analysis) [45] machine learning tool for the implementation of two classification algorithms. We applied multi-layer perceptron (MLP) [46,47], for the quantitative categorization of both the cluster assignment and the vocal type prediction, using the 67% of the dataset to train the neural network. We then computed two mean confusion matrices, one from the vocal types assigned a priori and the classes predicted by the MLP, the other one from the cluster assigned with the t-SNE procedure and the classes predicted by the network. Finally, to compare the results of the t-SNE cluster assignment to that of a k-means clustering (with k = 7, calculated through an average silhouette width) performed on a dataset reduced with a principal components analysis (and indicating six principal components), we applied a third network for the quantitative categorization of the cluster assignment.

t-SNE Mapping
The t-SNE algorithm identified eight clouds (Figure 1a), we, therefore, performed a k-means clustering with k = 8. As highlighted in Figure 1a,b, the analysis recognized eight different clusters; all groups but three were consistent with the acoustic structure of the a priori identification. Cluster one, two, and three exclusively contain a vocal type each: Wheezing grunts ( (Table 1). Kisses and wheezes (Figure 2d,e, Figure S1c,d) were grouped in cluster five (66.37% and 33.63%, respectively), while grunts and hums (Figure 2b,c, Figure S1a,b) were both included in clusters four, seven, and eight. Specifically, cluster four contained mainly grunts (85.04%) and a small percentage of hums (14.96%); cluster seven, just as cluster four, comprised mostly grunts (99.00%). Conversely, cluster eight included a great portion of hums (82.06%) and a smaller part of grunts (17.94%). Short tonal (Figure 2g, Figure S1f Figure S2a), although emerging as single clouds in the map, were grouped together in cluster six (respectively, 22.63%, 45.36%, and 32.12%, Table 1).

Call Recognition
For the quantitative categorization of both the cluster assignment and the vocal type prediction, the network we selected, trained for 500 iterations yielded the best performance by using a learning rate = 0.2 and momentum = 0.2. The correct attribution for the vocal type prediction achieved the 85.57% (n = 949, kappa statistic: 0.820; mean absolute error: 0.034; root mean squared error: 0.157; Table 2). The network recognized all vocal categories with percentages of correct classification ranging from 58.76% for the wheezing grunts to 100.00% for the long tonal calls and roars. Clacsons and songbits were almost totally correctly classified (99.03% and 98%, respectively). The classification of grunts achieved lower performances (84.25%), hums (84.56%), kisses (77.89%), short tonal calls (75.00%), and wheezes (78.57%, Table 3).

Call Recognition
For the quantitative categorization of both the cluster assignment and the vocal type prediction, the network we selected, trained for 500 iterations yielded the best performance by using a learning rate = 0.2 and momentum = 0.2. The correct attribution for the vocal type prediction achieved the 85.57% (n = 949, kappa statistic: 0.820; mean absolute error: 0.034; root mean squared error: 0.157; Table 2). The network recognized all vocal categories with percentages of correct classification ranging from 58.76% for the wheezing grunts to 100.00% for the long tonal calls and roars. Clacsons and songbits were almost totally correctly classified (99.03% and 98%, respectively). The classification Animals 2019, 9, 243 7 of 13 of grunts achieved lower performances (84.25%), hums (84.56%), kisses (77.89%), short tonal calls (75.00%), and wheezes (78.57%, Table 3).  The model built for the cluster assignment showed better results. A total of 1109 instances were correctly classified in 1059 cases (95.49%, kappa statistic: 0.947; mean absolute error: 0.016; root mean squared error: 0.088; Table 4). The network recognized all clusters with high percentages of correct classification (Table 5). Five groups (clusters 1, 3, 5, and 6) were entirely correctly classified, with a rate of correct assignment of 100%. The last three groups' classification showed almost as good results. The lowest performance was achieved by cluster 4 that was correctly classified in 85.35% of cases. Cluster 7 and cluster 8 showed the highest results: The first was correctly classified in 96.92%, while the second reached 95% of correct assignation. These groups, containing almost the totality of cases misclassified with respect to the clustering assignment, corresponded to the clusters showing a less homogeneous composition (Table 1): Cluster 4 and 7, contained mainly grunts (85.04% and 99.00%, respectively) and smaller percentages of hums (14.96% and 1%, respectively). On the other side, cluster 8 included a great portion of hums (82.06%) and a smaller part of grunts (17.94%). The third model, built using the PCA-based clustering as class, showed slightly weaker results when compared to the t-SNE model (93.05% vs. 95.49%; kappa statistic: 0.897; mean absolute error: 0.02; root mean squared error: 0.13).

Discussion
We described the use of a computationally simple but powerful method applied in the automatic recognition of acoustic signals. The t-SNE embedding and the use of MLP allowed an efficient analytical performance: Our results indicate that it was possible to automatically identify vocal types by using a dataset consisting of high-dimensional vector representations of objects, assigning similarities between those objects as conditional probabilities [10]. Still, although both t-SNE [15][16][17][18][19] and neural networks [50,51] are widely used to analyze acoustic characteristics in a wide range of research fields, ours represents the first attempt to combine these kinds of computational tools and apply them to the identification of vocal repertoire in nonhuman primates. Our findings support what was found in a previous analysis on indris' vocal repertoire [39]. Indeed, our analysis confirmed the presence of the eight call types emerged in the study, but we also identified two further categories: The songbits, consisting of all units given by an indri during the choral song of the group, were not considered to the purposes of the qualitative assessment of Indri indri vocal repertoire; and the wheezing grunts [41], particular vocalizations given after agonistic physical interactions (pers. obs.), were not detected by Maretti and colleagues [39]. Albeit our analysis allowed us to easily distinguish the different vocal types, the algorithm's map contained some points clustered within the wrong class. Most of these points correspond to sounds belonging to vocal classes showing a certain degree of gradation one another and therefore may be difficult to be identified [29]. In particular, we found an overlay between hums and grunts and kisses and wheezes. Hums (also known as weak grunts) [52] and grunts are both low-frequency and low-intensity calls; hums show a more defined harmonic structure when compared to grunts that, in contrast, show a clearer and low-pitched pulsed structure [39].
Furthermore, hums serve as group-cohesion calls [39] and their gradation level is following what was found in Campbell's monkeys (Cercopithecus campbelli), where calls associated with high affiliative social values show an elevated gradation level [35]. The great gradation in these calls may allow for flexible usage and the encoding of multiple elements of information, in agreement with the findings of Keenan and colleagues on Cercopithecus campbelli [27]. Overall, our results are in line with findings on red-capped mangabeys (Cercocebus torquatus), whose contact calls show more acoustic dissimilarity than long-distance and alarm signals [53], in contrast with findings on chacma (Papio ursinus), olive (P. anubis), and Guinea (P. papio) baboons, whose loud calls are more differentiated than grunts [54]. Kisses and wheezes, on the other hand, are both brief medium-intensity vocalizations, often uttered together (85% of cases) [39]. They are stress-related vocalizations that can be emitted as contact-rejection call, before a song, or in response to anxiety-causing stimuli [39,41,55]. In our analysis, the categories identification relied on a human visual assessment, and the vocal classes grouping, although supported by our findings, may imply dissimilarities perceived by humans but not necessarily by the species [56,57]. Moreover, in agreement with what was hypothesized, our results indicated the presence of signals showing features of both conspicuousness and gradedness, as found in other primate species [27,29,30] and the analysis showed a stronger accuracy in the classification of discrete calls, than that of graded ones [28]. We expected the variation of calls to be associated with their social function [35], with calls having affiliative value showing the highest variation level, calls associated with agonistic contexts showing the highest stereotypy, and alarm calls showing an intermediate gradedness. This prediction was not entirely supported by our results, as we found the two alarm calls (roars and clacsons), well separated from one another. The result seems instead to be in line with studies on calls referentiality [58][59][60]. Additionally, the roars were grouped together with long tonal and short tonal calls; these three vocal types are the only with a chaotic component [39] and the result may depend by their spectral features, known to affect the vocalization recognition [21,61].
Finally, in agreement with the social complexity-vocal complexity hypothesis [30] and the social complexity hypothesis for communicative complexity [28], the vocal repertoire size is directly proportional to the group size. We expected indris to possess a small repertoire size compared to that of other lemurs [21] and other primates [36] living in larger social groups. A ten-categories vocal repertoire and an average group size of four to six individuals, seemed not to be in line with this theory, in accordance with findings on Eulemur rubriventer, owning a vocal repertoire of 14 vocal types and a group size of about three individuals [21]. Notably, both species also show a stable social monogamous organization [62,63], in agreement with the hypothesis stating that the diversity in communication signals may be favored by an egalitarian social structure or a stable social group [64]. These findings are also in agreement with the studies on Asian colobines Pygathrix nemaeus [65] and Nasalis larvatus [66,67], showing a repertoire size smaller or similar to that of indris, compared to an average group size sometimes even significantly higher.

Conclusions
As earlier hypothesized, the vocal repertoire structure may be determined by both the species' environment and social structure [68]. This could also be for the indris' case, where the presence of loud and discrete calls, like alarm calls [27,68] and even the song, may have evolved to cope with a noisy environment and poor visual ranges, like that of dense rainforests, to reduce the misinterpretation of signals in the long-distance and even in inter-group communication. On the other side, contact calls and in general vocalizations that may serve the intra-group and short-range communication, do not have to face such kinds of obstacles and may show a more graded structure.

Conflicts of Interest:
The authors declare no conflict of interest.