Large-Scale Whale-Call Classiﬁcation by Transfer Learning on Multi-Scale Waveforms and Time-Frequency Features

: Whale vocal calls contain valuable information and abundant characteristics that are important for classiﬁcation of whale sub-populations and related biological research. In this study, an effective data-driven approach based on pre-trained Convolutional Neural Networks (CNN) using multi-scale waveforms and time-frequency feature representations is developed in order to perform the classiﬁcation of whale calls from a large open-source dataset recorded by sensors carried by whales. Speciﬁcally, the classiﬁcation is carried out through a transfer learning approach by using pre-trained state-of-the-art CNN models in the ﬁeld of computer vision. 1D raw waveforms and 2D log-mel features of the whale-call data are respectively used as the input of CNN models. For raw waveform input, windows are applied to capture multiple sketches of a whale-call clip at different time scales and stack the features from different sketches for classiﬁcation. When using the log-mel features, the delta and delta-delta features are also calculated to produce a 3-channel feature representation for analysis. In the training, a 4-fold cross-validation technique is employed to reduce the overﬁtting effect, while the Mix-up technique is also applied to implement data augmentation in order to further improve the system performance. The results show that the proposed method can improve the accuracies by more than 20% in percentage for the classiﬁcation into 16 whale pods compared with the baseline method using groups of 2D shape descriptors of spectrograms and the Fisher discriminant scores on the same dataset. Moreover, it is shown that classiﬁcations based on log-mel features have higher accuracies than those based directly on raw waveforms. The phylogeny graph is also produced to signiﬁcantly illustrate the relationships among the whale sub-populations.


Introduction
Acoustic methods are an established technique to monitor marine mammal populations and their behaviors. The automated detection and classification of marine mammal vocalizations is a central aim of these methods. Whales produce a series of whistles and other complex sounds to survey their surroundings, hunt for food, and communicate with each other. The different types of whales are gregarious living within socially stable family units known as 'pods', such as killer whales (Orcinus orca) [1,2] and pilot whales (Globicephala spp.) [3]. Within a pod, whales share a unique repertoire (also known as dialect) of stereotyped calls, which are comprised of a complex pattern of pulsed and tonal elements [4]. Classification of killer whale and pilot whale calls is of great importance not only for a better understanding of whale societies, but also for promoting investigative research on large underwater acoustic datasets.
With the continuous development of devices such as hydrophones deployed from ships, or digital acoustic recording tags (DTAGs) placed on marine mammals, large datasets of whale sound samples are increasingly being acquired. Often, an expert must analyze these large, volumetric datasets. Such manual analysis is a very labor-intensive task that would greatly benefit from automation. Different approaches for analyzing large audio datasets are being developed. Most methods are feature-based classifiers, which generally first extract or search for deterministic features of audio data in the time or frequency domain and then apply classification algorithms. Characterization methods based on short-time overlapping segments with the short-time Fourier and the wavelet packet transforms have been proposed to classify blue whale calls [5]. A large amount of the acoustic dataset titled Directional Autonomous Seafloor Acoustic Recorders (DASARs) is used to test the performance of contour tracing methods and image segmentation techniques in detecting and classifying bowhead whale calls [6]. Recently in the classification of humpback whale social calls, some researchers apply PCA-based and connected-component-based methods to derive features from relative power in the frequency bins of spectrograms and a supervised Hidden Markov Model (HMM) algorithm is then used as a classifier to investigate the classification feasibility [7]. A generalized automated detection and classification system (DCS) was developed to efficiently and accurately identify low-frequency baleen whale calls in order to tackle the large volume of acoustic data and reduce the laborious task [8].
In 2014, Lior Shamir etc. [9] proposed an automatic method for analyzing large acoustic datasets from the Whale FM project [10] and studied the differences between sounds of different sub-populations of whales. In their study, groups of 2D image-like features of whale calls were extracted by using the Wndchrm toolbox [11] for biological image analysis and the Fisher discriminant scores algorithm, which include FFT features, wavelet features, edge features, and so on. The significant features were then used to classify or evaluate the similarity between the different populations of samples without expert-based auditing. Although this work has already made progress in the unsupervised classification and similarity analysis of large acoustic datasets of whale calls, it still highly relies on the effectiveness of different polynomial decomposition techniques and the Fisher scores algorithm.
Nowadays, as a class of highly non-linear machine learning models, Convolutional Neural Networks (CNNs) are becoming very popular, having achieved state-of-the-art results in image recognition [12] and other fields. In particular, the CNN network can combine hierarchical feature extraction and classification together, which plays a role as an automated feature extractor and classifier at the same time. In the field of acoustic signal processing, CNN-based models are also adopted in speech recognition systems [13] for large-scale speech datasets, which shows a great improvement in performance [14] as opposed to more traditional classifiers. More recently, some research [15,16] has demonstrated that a basic CNN could generally outperform existing methods for environmental sound classification, provided sufficient data. Moreover, the applicability of basic CNN models is also being explored for the bio-acoustic task of whale call detection, such as with respect to North Atlantic right whale calls [17] and Humpback whale calls [18]. The recent successful applications of CNN-based models to time series classification have motivated studies aiming for better input representations of audio signals in order to train the CNN networks more efficiently. Various time-frequency representations, such as spectrograms, can typically offer a rich representation of the temporal and spectral structure of the original audios. A comparative study between different forms of commonly used time-frequency representations is conducted to evaluate their impact on the CNN classification performance of audio data [19]. For superior feature representations for the purposes of time series classification, a Multi-scale time-frequency analysis framework is also proposed to automatically extract features at different scales and frequencies [20]. To thoroughly explore the potential of CNNs on classification of optical remote sensing imagery data, a detailed investigation of the performance of seven well-known CNN architectures is presented in [21,22], where both two types of training, i.e., fine-tuning of a pre-trained CNN and training of a CNN from scratch, were respectively applied to obtain results.
This study aims to apply CNN to efficiently extract the informative features from large datasets of whale calls for classification and similarity analysis. Three main contributions of our work can be summarized as: (1) To overcome the difficulty of lack of sufficient training data and make use of the excellent CNN architectures well trained on a large set of labeled data in computer vision [23,24], a means of transfer learning was implemented by fine-tuning the trained CNN models, i.e., ResNext101 and Xception [25], on whale-call data to perform the classification task. (2) Both the raw waveforms from the multi-scale sketches of the whale call audios and the log-mel time-frequency representations were explored in order to achieve an ideal feature representation as the input of CNN network. (3) To ensure the convergence of deep neural networks, a Mix-up-based data augmentation technique [26] was also employed, which is advantageous in significantly increasing the number of training data for fully simulating the CNN networks. (4) In addition, the similarity analysis was carried out based on the 'likelihood' output of the Softmax layer, while the whale phylogeny was drawn to further illustrate the detailed relationship among different whale sub-populations. The code can be accessed in the Github (https://github.com/Blank-Wang/whale-call-classification).

Methodology
As illustrated in Figure 1, the proposed method consists of two main branches, respectively using two different types of inputs for CNN models to perform the classification, i.e., 1D raw waveforms and 2D time-frequency features. The CNN models are pre-trained by a large dataset of labeled image samples titled 'ImageNet' [12] and can be fine-tuned by using a whale-call training dataset through a 4-fold cross-validation. The performance of the system is finally evaluated on a whale-call validation dataset. This study aims to apply CNN to efficiently extract the informative features from large datasets of whale calls for classification and similarity analysis. Three main contributions of our work can be summarized as: (1) To overcome the difficulty of lack of sufficient training data and make use of the excellent CNN architectures well trained on a large set of labeled data in computer vision [23,24], a means of transfer learning was implemented by fine-tuning the trained CNN models, i.e., ResNext101 and Xception [25], on whale-call data to perform the classification task. (2) Both the raw waveforms from the multi-scale sketches of the whale call audios and the log-mel time-frequency representations were explored in order to achieve an ideal feature representation as the input of CNN network. (3) To ensure the convergence of deep neural networks, a Mix-up-based data augmentation technique [26] was also employed, which is advantageous in significantly increasing the number of training data for fully simulating the CNN networks. (4) In addition, the similarity analysis was carried out based on the 'likelihood' output of the Softmax layer, while the whale phylogeny was drawn to further illustrate the detailed relationship among different whale sub-populations. The code can be accessed in the Github (https://github.com/Blank-Wang/whale-call-classification).

Methodology
As illustrated in Figure 1, the proposed method consists of two main branches, respectively using two different types of inputs for CNN models to perform the classification, i.e., 1D raw waveforms and 2D time-frequency features. The CNN models are pre-trained by a large dataset of labeled image samples titled 'ImageNet' [12] and can be fine-tuned by using a whale-call training dataset through a 4-fold cross-validation. The performance of the system is finally evaluated on a whale-call validation dataset.

Classification on Raw Waveforms
Extracting features directly from raw whale-call time series by using neural networks is an endto-end way to do the classification, which avoids having to pre-process the raw audio signals into other forms of inputs such as spectrograms for classifiers. It is actually a natural way to apply the deep learning techniques to do the classification directly on raw whale-call waveforms. Due to the fact that different kinds of whale sounds may require feature representations at different time scales, we use windows to randomly capture the sketches of raw audio time series at different time scales. As shown in Figure 1, the proposed architecture consists of a set of parallel 1D convolutional layers with different filter sizes and strides to learn feature representations with multi-temporal resolutions. In this setup, high-frequency features can be learned by convolutional filters with a short size and a small stride, while low-frequency features can be learned by convolutional filters with a long size and a large stride. In the experiment, three branches of 1D convolutional layers are employed to obtain feature maps with different temporal resolutions, where all the branches have the same

Classification on Raw Waveforms
Extracting features directly from raw whale-call time series by using neural networks is an end-to-end way to do the classification, which avoids having to pre-process the raw audio signals into other forms of inputs such as spectrograms for classifiers. It is actually a natural way to apply the deep learning techniques to do the classification directly on raw whale-call waveforms. Due to the fact that different kinds of whale sounds may require feature representations at different time scales, we use windows to randomly capture the sketches of raw audio time series at different time scales. As shown in Figure 1, the proposed architecture consists of a set of parallel 1D convolutional layers with different filter sizes and strides to learn feature representations with multi-temporal resolutions. In this setup, high-frequency features can be learned by convolutional filters with a short size and a small stride, while low-frequency features can be learned by convolutional filters with a long size and a large stride.
In the experiment, three branches of 1D convolutional layers are employed to obtain feature maps with different temporal resolutions, where all the branches have the same number of filters (220) but different filter parameters (Branch I with filter size 11 and stride 1, Branch II with filter size 51 and stride 5, Branch III with filter size 101 and stride 10). Following the branch convolutional layer, another 1D time-domain convolutional layer is employed to create invariance to phase shifts with filter size 3 and stride 1. In this case, the feature maps of the three branches have the same dimensional size in the filter axis but different sizes in the time axis. Since the pre-trained CNN classifiers are transferred from the field of image classification, the inputs are image-type data with 3 channels corresponding to the red, green and blue color maps. To reduce modifications to the well-trained CNN models, we apply a max-pooling layer following each convolutional branch to pool the feature maps into the same dimensional size in the time axis and then stack these feature maps into a 3-channel image-like input for classifiers.

Classification on Log-Mel Features
Apart from extracting feature maps from the raw waveforms using 1D convolutional layers, there are several more popular transforms to convert the raw audio signals into feature representations for classifiers. The most common choice in audio signal processing is the 2D time-frequency representation, for example, based on short-time Fourier transform, usually scaled on the basis of a log-like frequency, i.e., mel frequency. In this study, log-mel filter bank features (log-mel) of the whale-call audios are employed as feature representations to be fed into the CNN classifiers. Based on the results of comparative studies [19,27], log-mel features are considered to be the best time-frequency feature representations for audio signals to be used for deep-learning methods. As demonstrated in Figure 1, we compute the log-mel features (by 128 filters with 2800-point sliding window and a 220-point hop size) and the corresponding delta and delta-delta features from the raw audio signals at the same time. To produce a 3-channel image-type input for CNN classifiers, these feature maps are stacked to a 3-channel form. Due to the fact that the whale-call clips normally have different time lengths, we apply a random selection strategy to stochastically capture a fixed-length segment in the time axis (150) from the 3-channel feature maps to generate the inputs for classifiers. It should be noted that the selected segments inherit the clip labels for analysis and the classification result for each audio clip is achieved through the majority voting on all the selected segments.

Pre-Trained CNN Models
In this study, the classification task is finally solved by using the state-of-the-art CNN models, which have excellent performance on the identification of large-scale image data in the field of computer vision [12]. Due to the fact that training large CNN models from scratch entails high computational cost and a huge number of samples, one can normally reuse the pre-trained layers and weights from a solved problem to solve a target problem if these layers are fine-tuned. In the training of deep neural networks, the low-level semantic features are actually learned in the front layers, which involves local texture information, color information and so on. These low-level semantic features are constant for various classification tasks like computer vision and audio tagging. The main difference is only located in the high-level semantic features, which are generated in the top layers of neural networks. Thus, it is reasonable to employ the deep CNN models pre-trained on images to do the audio classification by modifying the top layers. This approach is also called transfer learning, and consists of the use of knowledge acquired solving a source problem to facilitate the resolution of a target problem. It generally allows for faster training and smaller classification errors [28]. The possibility of doing transfer learning on deep neural networks was investigated in [28,29].
In our experiments, we respectively apply the well-trained ResNext101 [30] and Xception [25] models on 'ImageNet' [12] as the classifiers, since these two models have better performance than others in our previous study [31]. The ResNext architecture is an extension of the deep residual network which replaces the standard residual block with one that leverages a 'split-transform-merge' strategy used in the Google Inception models [30]. Xception is constructed based on a linear stack of depth-wise separable convolution layers with linear residual connections, also similar to Inception [21]. The top-layers in the original model architectures need to be modified to adapt to the whale-call classification problem. As shown in Figure 1, following the CNN model are a modified fully connected layer (FC) and a Softmax layer. Based on the 'likelihood' output of the Softmax layer to each class, the classification result is finally achieved.

Mix-Up Data Augmentation
To expand the training data size to fully fine-tune the pre-trained CNN models, a Mix-up technique is applied in the experiment. It has been reported in many tasks [32] that the Mix-up method can improve the performance of CNN models and help to reduce the overfitting effect on the generalization gap between the training and testing data. Specifically, the Mix-up technique produces synthetic samples including the new data and labels in an interpolation manner by means of the weighted sum of original data and labels: where (x s , y s ) is the new sample with data x s and label y s , random mixing weight λ ∼ Beta(α, α) for α ∈ (0, ∞), λ ∈ [0, 1], (x s , y s ) and (x s , y s ) are pairs of original samples selected from the training inputs for classifiers. Mix-up is believed to be a way to generate new data points among the original training data points scattering in a high-dimensional space, which reduces the relative distance between these scattering points. In our experiment, we apply the Mix-up with an alpha value of 0.2 on the input samples.

Similarity Analysis and the Phylogeny
The whale sub-populations to be classified have strong relationships and potential correlations with each other in biology. Thus, it is meaningful to evaluate and quantify these relationships in terms of similarity analysis and phylogeny construction. The likelihood output of the Softmax layer actually indicates the similarities among different whale pods, which can be used as a measure to quantify the relationships between classes. After the fine-tuning of the pre-trained CNN models by the whale-call training dataset, the validation dataset is used to evaluate the performance of the proposed approach. Given 16 whale pods to be classified, a one-dimensional likelihood vector with a size of 1 × 16 would be the output for each validation sample from the Softmax layer. All the likelihood vectors of one pod are averaged to achieve a general likelihood vector, which is employed to indicate the similarities between one pod and all pods. In this way, a 16 × 16 likelihood (similarity) matrix working as a kind of 'correlation matrix' can be obtained by combining all the 16 general likelihood vectors of pods, which provides a good indication for the relationship among whale-call pods. In the meantime, the phylogeny can also be constructed based on the general likelihood vectors to further investigate the biological relationship of whale sub-populations. With the aid of the Phylip software [33], the phylogenic tree can be easily generated using the criterion of square Euclidean distance.

Data Preparation
The experimental dataset comes from the Whale FM website [10], which is a citizen science project from Zooniverse and Scientific American. All the data are collected by recording DTAGs [34], which are normally attached to individual whales to record the whale calls, as well as calls from other animals nearby. There are also motion sensors equipped that make it possible to follow the underwater movement of whales. The dataset consists of about 10,000 audio files in WAV format ranging between 1 s and 8 s in length. There are 16 separate recording events based on sensors carried by 7 pilot whales and 9 killer whales at 4 locations close to the coasts of Norway, Iceland, and the Bahamas, respectively. A detailed description of the dataset is shown in Table 1. Since the sampling rates of the audio samples are different, all the audio samples are first resampled to 22,050 Hz. For the classification of raw waveforms, three different time-length windows (1.5 s, 0.3 s and 0.15 s) are used to randomly capture the multiple sketches from a whale-call clip for analysis. Through the convolutional branches, 3-channel feature maps with a size of (220, 220, 3) are produced as the inputs for the CNN models. For the classification of log-mel features, the log-mel, delta and delta-delta maps are calculated by Librosa toolbox in Python. After the segment selection, 3-channel feature maps with size of (128, 150, 3) are generated to be input into the CNN models. All the input data for CNN models are produced and stored in form of pickle files in advance in order to accelerate the training process.

Model Training
To evaluate the performance of the proposed approach, the dataset is divided into a development dataset and a validation dataset. Since there are different numbers of audio files in the pods, we first divide all the audio in each pod equally into 5 groups and select 1 group from each pod to sum them up to make the validation dataset (1862 files in total). All the remaining audio in each pod is gathered into the development dataset (7457 files in total). A 4-fold cross-validation strategy [35] is performed on the development dataset to fine-tune the CNN models. This means that each time, 5960 samples are selected from the development dataset (the same proportion from all pods) to train the CNN models, while 1497 samples are used for testing.
In the network training, the momentum stochastic gradient descent algorithm is used as the optimizer to solve the cross-entropy loss objective function. The learning rate is set as multi-steps, i.e., 0.01 for the first 35 epochs and multiplied by 0.1 for every next 20 epochs. The model is trained for 60 epochs in total before stopping, which ensures convergence. A batch size of 32 is applied, while rectified linear units (ReLUs) are used as nonlinear activation functions. The models are implemented by PyTorch using GPU acceleration on a hardware resource consisting of Xeon E5 2683V3 CPU and 2 GTX 1080Ti GPU cards driven by CUDA with cuDNN. The classification accuracy of the validation audio clips, instead of CNN input segments, is employed to evaluate the performance of the proposed method. The clip-level accuracy is obtained from the segment-level accuracy through the majority-voting strategy. As shown in Figure 2, the evolution of accuracies and losses in the one-fold training based on the pre-trained ResNext101 model with both log-mel and wave inputs is illustrated. It can be observed that the model with log-mel input generally has lower losses and higher accuracies than the model with wave input. In addition, the training accuracies are shown to be significantly low due to the fact that the Mix-up technique has changed the labels of the training data.

Classification of Whale-Call Data
The classification on the large-scale whale-call dataset consists of 3 sub-tasks, which aims to classify the data respectively into 2 species (pilot whales and killer whales), 4 groups (Norwegian killer whales, Iceland killer whales, Bahamas short-finned whales and Norwegian long-finned whales) and 16 pods, as illustrated in Table 1. The proposed approach is actually implemented on the classification into 16 pods, while the classifications into 2 species and 4 groups are then obtained on the basis of the results of the 16-pod classification. As shown in Table 2, the classification accuracies are achieved based on the validation dataset and compared with the corresponding results of other researchers [9]. It is shown that the classifications using log-mel features (ResN-logm and Xcep-logm) have higher accuracies than those using raw waveforms (ResN-wav and Xcep-wav), which probably indicates that the time-frequency transforms are more powerful for extracting discriminative features than the 1-D convolutional layers for classification. The time-frequency features also have a much clearer physical meaning. Moreover, the ResNext101 model is demonstrated to have better performance than Xception in classifying whale-call data. A significant improvement is observed in the comparison between the Wndchrm method and the proposed approaches. The results of models trained from scratch are also presented in Table 2 as a reference, where the accuracies decrease by about 2~4% compared with the results of the corresponding models pre-trained on ImageNet. In addition, as shown in Table 3, the classifiers have different performances for these 16 pods, where accuracies are obtained ranging from 100% to 63.2%. It should be noted that a class imbalance problem exists among the pods, whereby the largest pod has 1526 audio files and the smallest one only has 116, which might have an effect on the class-wise classification results. This problem will be considered in our further study.

Classification of Whale-Call Data
The classification on the large-scale whale-call dataset consists of 3 sub-tasks, which aims to classify the data respectively into 2 species (pilot whales and killer whales), 4 groups (Norwegian killer whales, Iceland killer whales, Bahamas short-finned whales and Norwegian long-finned whales) and 16 pods, as illustrated in Table 1. The proposed approach is actually implemented on the classification into 16 pods, while the classifications into 2 species and 4 groups are then obtained on the basis of the results of the 16-pod classification. As shown in Table 2, the classification accuracies are achieved based on the validation dataset and compared with the corresponding results of other researchers [9]. It is shown that the classifications using log-mel features (ResN-logm and Xcep-logm) have higher accuracies than those using raw waveforms (ResN-wav and Xcep-wav), which probably indicates that the time-frequency transforms are more powerful for extracting discriminative features than the 1-D convolutional layers for classification. The time-frequency features also have a much clearer physical meaning. Moreover, the ResNext101 model is demonstrated to have better performance than Xception in classifying whale-call data. A significant improvement is observed in the comparison between the Wndchrm method and the proposed approaches. The results of models trained from scratch are also presented in Table 2 as a reference, where the accuracies decrease by about 2~4% compared with the results of the corresponding models pre-trained on ImageNet. In addition, as shown in Table 3, the classifiers have different performances for these 16 pods, where accuracies are obtained ranging from 100% to 63.2%. It should be noted that a class imbalance problem exists among the pods, whereby the largest pod has 1526 audio files and the smallest one only has 116, which might have an effect on the class-wise classification results. This problem will be considered in our further study.

Similarity and Phylogenic Analysis
The biological relationship of different whale sub-populations is an important target to be studied in processing these whale-call data. In this work, the similarity and phylogenic relationship of whale sub-populations are demonstrated by using the output quantities from the Softmax layer in the proposed architecture. As discussed in Section 2.5, the likelihood (similarity) matrix that consists of Softmax outputs actually indicates the relationship among whale-call pods. Figure 3 illustrates the correlation matrix normalized by column into the range [0, 1] based on the ResN-wav method on the validation dataset. Since the diagonal values represent the pod self-correlations that are too distinct to be properly plotted together with other correlation values, these diagonal values are neglected from the normalization and uniformly labeled as '1.00 for convenience. As shown in Figure 3, the 16 pods can be clustered into 4 groups, as highlighted by the red rectangles, which is in consistent with the 4 groups (Norwegian killer whales, Iceland killer whales, Bahamas short-finned whales and Norwegian long-finned whales) shown in Table 1. The similarity analysis of the whale-call pods can also be implemented by using a series of classical methods based on likelihood vectors, for example, hierarchical cluster analysis method in SPSS software [36], Pearson correlation coefficient, Euclidean distances, and so on. In addition, the phylogeny is constructed by using Phylip [33] in our experiment on the basis of the square Euclidean distance of the likelihood vectors. As shown in Figure 4, the phylogeny graph can be clearly separated into pilot whales and killer whales along the middle dashed line. Also, the influence of the geographic locations on the whale sub-populations is completely distinguished by our method as there are four distinct branches displayed. Specifically, there are two obvious points which clearly indicate the bifurcation of different sub-populations. Compared with the phylogeny shown in [9], our result is more distinct and informative.

Conclusions
Classification of whale sounds is a long-standing problem that has been studied for decades. Despite great advancements in feature engineering and machine learning techniques, there still remain some challenging problems to be solved. In this paper, we investigate the possibility of using the transfer learning approach to do the classification task and study the similarity and phylogenic relationship of different whale sub-populations. Both raw waveforms and log-mel features are applied as the classifier inputs for a comparison. Instead of training a deep CNN model from the very beginning, we respectively applied well-trained ResNext101 and Xception models to do the classification, with the goal of both improving the efficiency and accuracy. The results show that the proposed approach is able to accurately classify these datasets into different categories, where the accuracies have been significantly increased by more than 20% compared with the traditional

Conclusions
Classification of whale sounds is a long-standing problem that has been studied for decades. Despite great advancements in feature engineering and machine learning techniques, there still remain some challenging problems to be solved. In this paper, we investigate the possibility of using the transfer learning approach to do the classification task and study the similarity and phylogenic relationship of different whale sub-populations. Both raw waveforms and log-mel features are applied as the classifier inputs for a comparison. Instead of training a deep CNN model from the very beginning, we respectively applied well-trained ResNext101 and Xception models to do the classification, with the goal of both improving the efficiency and accuracy. The results show that the proposed approach is able to accurately classify these datasets into different categories, where the accuracies have been significantly increased by more than 20% compared with the traditional

Conclusions
Classification of whale sounds is a long-standing problem that has been studied for decades. Despite great advancements in feature engineering and machine learning techniques, there still remain some challenging problems to be solved. In this paper, we investigate the possibility of using the transfer learning approach to do the classification task and study the similarity and phylogenic relationship of different whale sub-populations. Both raw waveforms and log-mel features are applied as the classifier inputs for a comparison. Instead of training a deep CNN model from the very beginning, we respectively applied well-trained ResNext101 and Xception models to do the classification, with the goal of both improving the efficiency and accuracy. The results show that the proposed approach is able to accurately classify these datasets into different categories, where the accuracies have been significantly increased by more than 20% compared with the traditional method. The similarity analysis for whale sub-populations was carried out using the informative likelihood output of the Softmax layer in the proposed architecture, and 4 groups of whale-call pods can be clearly observed in the plot of similarity matrix. Finally, the phylogeny graph was also produced based on the Softmax outputs in order to achieve a better understanding of the relations among whale sub-populations, which distinctly demonstrates the phylogenic relationships. A further study will be focused on analysis of the abstract features learned by the CNN models in order to obtain an appropriate physical interpretation.