Fusion of Linear and Mel Frequency Cepstral Coefﬁcients for Automatic Classiﬁcation of Reptiles

: Bioacoustic research of reptile calls and vocalizations has been limited due to the general consideration that they are voiceless. However, several species of geckos, turtles, and crocodiles are able to produce simple and even complex vocalizations which are species-speciﬁc. This work presents a novel approach for the automatic taxonomic identiﬁcation of reptiles through their bioacoustics by applying pattern recognition techniques. The sound signals are automatically segmented, extracting each call from the background noise. Then, their calls are parametrized using Linear and Mel Frequency Cepstral Coefﬁcients (LFCC and MFCC) to serve as features in the classiﬁcation stage. In this study, 27 reptile species have been successfully identiﬁed using two machine learning algorithms: K-Nearest Neighbors (kNN) and Support Vector Machine (SVM). Experimental results show an average classiﬁcation accuracy of 97.78% and 98.51%, respectively


Introduction
The taxonomic class Reptilia is formed by turtles, crocodiles, snakes, lizards, and tuataras, of which some are able of vocalize [1], but they do not do it often.As a result, there are few studies in literature about the reptile acoustic communications, and have been considered unimportant until recently.However, some geckos, crocodiles, and turtles are very active in producing vocalizations [2,3], but their social roles are still not fully understood.Crocodiles are probably the most vocal reptiles, with a rich variety of hissing, distress, and threatening calls due to their close relation with birds, they are even capable of vocalizing in the egg before hatching [4].Moreover, some species such as turtles, crocodiles, and alligators can emit sound both in air and underwater [5].
Reptiles emit vocalizations in a broad range of frequencies-they produce sounds mainly between 0.1-4 kHz-but some turtles, crocodiles, and also lizards are able to generate calls above 20 kHz [5,6].In addition, as a consequence of their behavior and small size, most reptiles can be very difficult to detect in the field using visual surveys [7], which can lead to an underestimation of species richness.
Bioacoustic technologies are an efficient way to sample populations in extended areas where visibility is limited [8], so they may be able to provide additional data for reptile estimation.Traditional bioacoustic monitoring methods rely on human observers who categorize acoustic patterns according to sound similarities.However, this procedure is slow, and it depends on the observer's ability to identify species, which leads to bias [9].Hence, machine learning techniques are being applied in many research areas to design automatic classification intelligent systems, such as mosquito identification based on morphological features [10], carbon fiber fabrics classification to minimize risks in engineering processes [11], automatic recognition of arrhythmias for the diagnosis of heart diseases [12], or this work where bioacustic signals of reptile species are used for taxonomic classification.
In recent decades, several techniques have been proposed to automate the acoustic classification of species through intelligent systems.For instance, Acevedo et al. [13] successfully classified three bird and nine frog species by characterizing their calls with 11 variables: minimum and maximum frequency, call length, maximum power, and frequency of eight highest energy points in the call.Then, the results of three classification algorithms-Decision Tree (DT), Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM)-were compared.In their work, SVM achieved an identification rate of 94.95%, outperforming DT and LDA, but the sample calls were selected manually.Brandes [14] used a Hidden Markov Model (HMM) to recognize the vocalizations of nine bird, ten frog, and eight cricket species.This approach employed the peak frequencies and bandwidth from the spectrogram to characterize the sound samples, getting high classification rates for each animal group individually, though it had difficulties coping with complex broad band calls.Another interesting approach can be found in Le-Qing's research [15], where 50 different insect sounds were classified with an accuracy of 96.17%.In that work, Mel Frequency Cepstral Coefficients (MFCCs) were employed as features, and a Probabilistic Neural Network (PNN) was applied for classification.However, the sounds were taken from noise-free sections of the recorded files.More recently, Henríquez et al. [16] recognized seven different species of bats by Gaussian Mixture Models (GMM), achieving a low average error of 1.8%, using a combination of linear and non-linear parameters.There is no doubt of the progress made in the field of bioacoustic identification to enable an efficient classification of species.However, a robust machine learning technique to recognize reptile calls has still not been found.Previous studies have been focused on the analysis of spectro-temporal characteristics of reptile calls, but to the best of our knowledge, there has not been any research that has used their acoustic signals for automatic inter-species classification.For this reason, this paper proposes a novel approach of taxonomic identification of reptiles through their acoustic features.To achieve this goal, Linear Frequency Cepstral Coefficients (LFCCs) and MFCCs [17] have been used to parametrize the reptile acoustic signals and they have been fused to obtain a robust characterization of the signal in the frequency domain.In addition, two widely used machine learning algorithms-K-Nearest Neighbors (kNN) [18] and Support Vector Machine (SVM) [19] have been utilized to verify the robustness of the proposed parameters.The approach has been validated in three public collections of reptile sounds selected by experts, which contain 27 different species with several types of calls.Therefore, despite the small corpus, this study may serve as a first reference in the field of automatic acoustic recognition of reptile specimens for researchers.
The remainder of this paper is organized as follows.Section 2 presents the proposed technique where the audio signal segmentation and the feature extraction procedure are described.Two classification systems based on kNN and SVM algorithms are described in Section 3, particularized for acoustic recognition.The experimental methodology, the sound dataset, and the results obtained are shown in Section 4, where a comparison of features and classification algorithms is done.Finally, the conclusions and future work are shown in Section 5.

Proposed Method
The proposed method is based on the following phases.First, reptile acoustic signals are automatically segmented in syllables and labeled by species.Secondly, for each syllable, the cepstral feature parameters are computed and fused into a unique vector of characteristics per call.Afterwards, these vectors are employed in the classification stage to train and test the two pattern recognition algorithms utilized in this work.Figure 1 illustrates the proposed system technique.

Segmentation
The segmentation stage splits the file recordings into as many syllables as possible to yield useful information for the taxonomy identification.The procedure has been developed based on the Härmä segmentation algorithm [20].It uses short-time analysis to divide the acoustic signal into a set of frequency and amplitude modulated pulses, where each pulse corresponds to one detected call.For this purpose, the acoustic signal spectrogram was calculated utilizing Short Time Fourier Transform (STFT) with a Hamming window size of 5.8 milliseconds and a 33% overlap, with which they have been heuristically computed.The spectrogram is represented by a matrix S( f , t), where f is the frequency and t denotes the time.Then, the algorithm explores the matrix using the following strategy: 1. Find t n and f n such that |S( f n , t n )| ≥ |S( f , t)|∀( f , t), placing the nth syllable in t n .The amplitude of this point is calculated as Equation (1): 2. If Y n (0) < Y 0 (0)−βdB, the segmentation process is stopped, as the signal amplitude is inferior to the stopping criteria β.For reptile sounds, β has been set to 25 dB.3. From t n , seek the highest peak of |S( f , t)| for t > t n and t < t n , until Y n (t) < Y n (0)−βdB for both sides.Thus, the starting and ending times of the nth syllable are denoted as , t n − t s and t n + t e .4. Save the amplitude trajectories as the nth syllable.5. Delete the nth syllable from the matrix S( f , t n − t s , • • • , t n + t e ) = 0 and set n = n + 1. 6. Repeat from Step 1 until the end of the spectrogram.
Figure 2 shows four spectrograms belonging to several kinds of reptiles: Crotalus atrox (Diamondback rattlesnake), Gecko, Alligator mississippiensis (American alligator), and Chelonoides nigra (Galapagos giant tortoise).The alligator spectrogram presents a complex and most rich vocalization similar to birds, while the other species present a simpler sound production of hissing and groan calls.

Feature Extraction
For each syllable, two cepstral features were extracted to characterize the acoustic signal.MFCCs and LFCCs have been broadly used in speech recognition with success [21], and they have also been applied in animal bioacoustic classification [22][23][24][25], due to their easy implementation and high performance.Reptiles mainly produce sounds in low frequencies within the human auditory range.For this reason, MFCCs have been adopted to get high resolution in the low frequency region.However, reptiles are capable of producing sound above 20 kHz, so LFCCs have also been computed to obtain the information in high frequency ranges.
They are calculated using STFT (a 25 milliseconds Hamming window with an overlap of 50%) and applying Discrete Fourier Transform (DFT) over each frame of the signal.The resultant magnitude spectrum is wrapped by a bank of 40 triangular band pass filters.For MFCCs, the filters are non-uniformly sparse to perform the mel scale transformation.Finally, the coefficients are retrieved taking the lowest Discrete Cosine Transform (DCT) values from the log-magnitude filter outputs, log |Y i |.They are computed following Equation (2): where j denotes the index of the cepstral coefficient, B is the number of triangular filters, and N denotes the number of cepstral coefficients to compute.LFCCs are calculated similarly, but using a linear sparse triangular filter bank instead of the mel-scale filters.Furthermore, the number of coefficients, N, has been established by experimentation in order to reach the highest classification rate in the last stage.As a result, 18 coefficients have been taken for both features.
In this work, the feature vectors have been fused, appending the characteristics horizontally as in Label (3), where each row represents a syllable from the segmentation stage.These features have been combined to hold information of higher as well as lower frequency regions, obtaining a broad spectral representation of the reptile calls.Therefore, these rows contain 36 coefficients, which are used as inputs to the classification stage.

Classification System
For the classification stage, the performance of two machine learning algorithms have been compared (kNN and SVM), which have been parametrized to resolve the acoustic signal classification.

K-Nearest Neighbor
This algorithm determines the classification of new observations based on the closest training samples in the feature space.It matches the class measuring the distances of the k nearest data points to the test data.Then, simple majority of neighbors is used to determine the class prediction.In this study, the number of neighbors has been established to k = √ N where N is the length of the feature coefficients for a syllable.

Support Vector Machine
SVM has been used to some extent in bioacoustic species recognition with success [13,26].It discriminates the data by seeking the optimal hyperplane that separates the training data into two classes.However, the reptile call features are not lineally separable, so a non-linear kernel function has been used to divide the features in a higher dimensional space.For the experiments, a Gaussian or Radial Basis Function (RBF) kernel K(x, x ) = exp(−γ x − x 2 ) has been selected, where the parameter γ was optimized using a grid approach.For multiclass classification, the strategy "one-versus-one" [27] has been implemented, which trains a binary SVM classifier for each pair of classes.Therefore, for N different classes, N(N − 1)/2 binary classifiers are required to distinguish the samples, where N represents the number of reptile species.The SVM decision function is defined as in (4), where b is a numeric offset threshold and α i are Lagrange multipliers.The magnitude of α is determined by the C parameter, which imposes a penalty on misclassified samples (0 < α C).

Experimental Procedure
This section describes the datasets and the experimental methodology used in the experiments to evaluate the effectiveness of the proposed method.
To ensure independence between the training and testing sets in each experiment (at least 100 simulations by experiment), the syllables obtained automatically from the segmentation of each sound have been randomly shuffled and split 50/50 into two datasets-one for training and another for testing (k-fold cross-validation with k = 2)-to achieve significant results.
Furthermore, accuracy has been calculated following Equation (5) for each class and averaging the results.F-Measure value [28] has also been calculated as 2 * ((P * R)/(P + R)), where P (precision) is the number of correct positive results divided by the number of all positive results, and R (recall) is the number of correct positive results divided by the number of positive results: The acoustic classification system was implemented in Matlab, where the SVM implementation was based on the libsvm library [29], applying a C-Support Vector Classification (C-SVC) [30].
In addition, a non-dedicated standard laptop with an Intel Core i7-4510 2.0 GHz CPU and 16 GB RAM under Windows 8.1 operating system was used to carry out the experiments.
As indicated in Section 2.2, the number of cepstral features was obtained by experimentation.They were selected applying a wrapper method by varying the number of coefficients, N, from 6 to 25 for each feature until there was no improvement in prediction.
As for the SVM parameters, to find the optimum values of the penalty parameter of the error term (C) and the kernel gamma (γ) parameter, a grid-search was utilized with exponentially growing sequences of the parameters (C = 2 −2 , 2 −1 , ..., 2 10 ; γ = 2 −12 , 2 −11 , ..., 2 2 ) and employing cross-validation.Finally, a finer grid search was conducted, establishing a gamma value of 0.45 and a penalty term of 30 for all experiments.
Initially, the features have been analyzed individually to determine their effectiveness.However, different species of reptiles can produce sounds in frequencies anywhere in the spectrum.Therefore, to obtain the correct useful information, the two types of cepstral coefficients have been fused to better represent the acoustic information of the sounds.

Sound Dataset
The number of reptile sound repositories is quite limited.As a consequence, the dataset has been constructed using three internet sound collections.The main source of audio recordings has come from the Natural Museum of Berlin [31], which contains 120.000 audio recordings of diverse species.A third of them were recorded in controlled conditions, employing animals in captivity, and the rest in natural habitats with background noise.In addition, two on-line collections of reptiles from California have also been used: California Herps [32] and the California Tortoise Club [33].Hence, the dataset is finally composed of 1,895 samples, which correspond to 27 different reptile species.Table 1 shows the list of species employed in this work, indicating the number of segmented syllables extracted from each species and their family group.All files have been sampled to 44.1 kHz.

Results and Discussion
In order to validate the proposed data fusion, the features have been analyzed individually to compare their performance.At the same time, the acoustic features have been combined with the classification algorithms to seek the best model.The algorithms have been run 100 times to obtain significant results.Additionally, the dataset has been randomly scrambled in each repetition, dividing the data 50/50 for training and testing purposes.
In Table 2, the experimental results show that MFCCs are more suitable for the identification of reptile calls than LFCCs.Mel features present more resolution at the lowest frequencies, emphasizing these spectrum regions where most reptile acoustic energy occurs.In fact, most of the sounds produced by reptiles are in the range 0.1 to 4 kHz.However, some reptiles-mainly lizards-can generate harmonic components at high frequency even into the ultrasound range (>20 kHz).At these frequencies, MFCCs hold insufficient information because the area under the triangular filters used in the mel-filterbank analysis increases at higher frequencies.Therefore, LFCCs are more suitable to model these reptile calls, since no frequency warping is applied.Thus, LFCC surpasses MFCC in some experiments, for instance in class 19 (Kinixys belliana or Hinge-back tortoise), where the mating calls contain high levels of low frequency noise.It can also be observed that Alligator sinensis (class 15) was poorly classified in all experiments, as this class was created by appending several audio recordings because each of them only contains one or two sample calls.Therefore, it presents diverse types of calls, which hinders the classification process.The distress calls emitted by the alligator present a complex harmonic pattern in a wide bandwidth (see Figure 2), occasionally extending over 15 kHz.Hence, the linear cepstral coefficients show a superior performance in both classifiers on class 15.
On the other hand, Crotalus durissus (South American rattlesnake) also achieved low recognition rates, caused by its distinctive rattle noise that overlaps with the hissing call in several syllables.On the other hand, nine species reached a classification accuracy of 100%, regardless of the features used, due to the spectrum distribution of those reptile species being clearly different from others.
Regarding the classifiers, it is observed that SVM performs slightly better than kNN.This result is because the SVM approach is able to separate the classes more efficiently using the Gaussian kernel.However, as a consequence of the small corpus, the difference is not significant.In fact, for MFCC features, kNN outperforms SVM by 0.12%.However, it is expected that the difference will increase when new species are added to the corpus.
The feature fusion technique (MFCC/LFCC) exhibits high classification results in both algorithms, outperforming each of the individual features.In most cases, the resultant accuracy per class is equal to or greater than those achieved by single features.This confirms that this method provides a further characterization of the reptile calls by appending information of high and low frequency regions, which leads to a higher accuracy in classification.Furthermore, 13 classes were identified with a success rate of 100% when this approach was applied.100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 99.33% ± 0.08 100.00% ± 0.00 9 100.00%± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 10 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 11 100.00%± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 12 100.00%± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 100.00% ± 0.00 In the second experiment, the training samples were reduced from 50% to 5% to validate the robustness of the method using the best model found in the first experiment (MFCC/LFCC fusion + SVM).Table 3 shows that the fusion approach is able to deal with the low number of training samples, keeping accuracy and F-Measure values above 90% in almost all cases.This proves that the fusion of both cepstral coefficients is effective for modeling the discriminant information in the reptile calls.Moreover, the reduction in training set size offers savings in time needed to calculate the support vectors.However, when 5% of training is reached, the system clearly declines in effectiveness because most of the reptile species are only characterized by one syllable; thus, SVM has serious difficulties finding discriminant information to identify the classes.Nevertheless, this approach is able to maintain the classification results above 85%, confirming the robustness of the feature fusion method.

Conclusions and Future Work
Automated methods to detect and identify species are particularly useful for biodiversity studies and conservation purposes.In this paper, a novel automatic method for the bioacoustic recognition of reptile species by a fusion of frequency cepstral features has been presented.Reptile acoustic characteristics have been analyzed to seek the more discriminant features to parametrize their acoustic signals.It has been concluded that MFCCs are able to represent the reptile call efficiently because their acoustic signals are emitted predominantly in low frequencies.However, some species can also produce sounds in high frequencies; hence, LFCCs have also been utilized to hold information regarding that part of the spectra.The experimental results have demonstrated that the fusion of both features allows a broad characterization of the signal, increasing the classification rate.It has been validated in over 27 different reptile species, achieving an average accuracy of 98.52% ± 3.26.In addition, the proposed solution has been tested under low training sample conditions, proving the strength of the technique.
Traditional reptile surveys rely on visual searching, which is costly and time-consuming.Therefore, this approach can lead to the development of new remote monitoring systems for reptile research.In addition, the authors are not aware of other studies that have assessed the use of reptile acoustic signals for their inter-species recognition.However, despite the promising results of this first research, it is necessary increase the corpus and extend the solution to the entire animal group.Furthermore, it would be useful to enhance the approach by recognizing individuals within the same species.Finally, this approach could be applied to classify animals with similar sound production mechanisms (such as frogs or birds) by adjusting the system parameters.

Figure 2 .
Figure 2. Example of four reptile call spectrograms.

Table 3 .
Classifier performance by training set size.