Classiﬁcation of Hydroacoustic Signals Based on Harmonic Wavelets and a Deep Learning Artiﬁcial Intelligence System

: This paper considers two approaches to hydroacoustic signal classiﬁcation, taking the sounds made by whales as an example: a method based on harmonic wavelets and a technique involving deep learning neural networks. The study deals with the classiﬁcation of hydroacoustic signals using coe ﬃ cients of the harmonic wavelet transform (fast computation), short-time Fourier transform (spectrogram) and Fourier transform using a kNN-algorithm. Classiﬁcation quality metrics (precision, recall and accuracy) are given for di ﬀ erent signal-to-noise ratios. ROC curves were also obtained. The use of the deep neural network for classiﬁcation of whales’ sounds is considered. The e ﬀ ectiveness of using harmonic wavelets for the classiﬁcation of complex non-stationary signals is proved. A technique to reduce the feature space dimension using a ‘modulo N reduction’ method is proposed. A classiﬁcation of 26 individual whales from the Whale FM Project dataset is presented. It is shown that the deep-learning-based approach provides the best result for the Whale FM Project dataset both for whale types and individuals.


Introduction
The whale was one of the main commercial animals in the past. Whalers were attracted by the huge carcass of this animal-from one whale they could get much more fat and meat than from any other marine animal. Today, many of its species have almost been driven to extinction. For this reason, they are listed in the IUCN Red List of Threatened Species [1]. Currently, the main threat to whales is an anthropogenic factor, expressed in violation of their usual way of life and pollution of the seas. To ensure the safety of rare animals, the number of individuals must be monitored. Within the framework of environmental monitoring programs approved by governments and public organizations of different countries, cetacean monitoring activities are carried out year-round using all of the modern achievements in data processing [2]. Monitoring includes work at sea and post-processing of the collected data: determining the coordinates of whale encounters, establishing the algorithms allowing the extraction of useful signals from certain directions. Preliminary processing includes de-noising, estimation of the degree of randomness, extraction of short-term local features, pre-filtering, etc. Preprocessing affects the process of further analysis within a hydroacoustic monitoring system [11][12][13]. Even though the preprocessing of hydroacoustic signals has been studied for a long time, there are several unresolved problems, namely: working in conditions of a priori uncertainty of signal parameters; processing complex non-stationary hydroacoustic signals with multiple local features; and analysis of multicomponent signals. Another set of problems is represented by effective preliminary visual processing of hydroacoustic signals and the need for a mathematical apparatus for signal preprocessing tasks.
Current advances in applied mathematics and digital signal processing along with the development of high-performance hardware allow the effective application of numerous mathematical techniques, including continuous and discrete wavelet transforms. Wavelets are an effective tool for signal preprocessing, due to their adaptability, the availability of fast computational algorithms and the diversity of wavelet bases.
Detection of foreign objects in marine and river areas, including icebergs and other ice formations, the size estimation of these objects, hazard assessment based on analyzing local signal features; 2.
Detection and classification of marine targets based on the analysis of local signal features; 3.
Detection of hydroacoustic signals in the presence of background noise; 4.
Efficient visualization and processing of hydroacoustic signals based on multiscale wavelet spectrograms.
Classification is an important task of modern signal processing. The quality of the classification depends on the noise level, training size and testing datasets, and the algorithm. It is also important to choose classification features and determine the size of the feature space. The classification feature is the feature or characteristic of the object used for classification. If we classify real non-stationary signals, it is important to have informative classification features. Among such features are wavelet coefficients.

Harmonic Wavelets
Wavelet transform uses wavelets as the basis functions. An arbitrary function can be obtained from one function ("mother" wavelet) by using translations and dilations in the time domain. The wavelet transform is commonly used for analyzing non-stationary (seismic, biological, hydroacoustic etc.) signals, usually together with various spectral analysis algorithms [16,17].
Consider the basis of harmonic wavelets whose spectra are rectangular in the given frequency band [15,16]. Harmonic wavelets are usually represented in the frequency domain. Wavelet-function (mother wavelet) can be written as: There are some techniques that allow us to decompose input signals using different basic functions: wavelets, sine waves, damped sine waves, polynomials, etc. These functions form the atom dictionary (basis functions) and each function is localized in the time and frequency domains. Often the dictionary of atoms is full (all types of functions are used) and redundant (the functions are not mutually independent). One of the main problems in these techniques is the selection of basic functions and dictionary optimization to acheive optimal decomposition levels [17]. Decomposition levels for wavelets can be defined as: where j is decomposition level and k is dilation.
Very often, wavelets are basis functions because of their useful properties [14] and the potential to process signals in the time-frequency domain. The Fourier transform of a scaling function can be written as: We can formulate the following properties of harmonic wavelets, which relate them with other classes of wavelets:

•
Harmonic wavelets have compact support in the frequency domain, which can be used for localizing signal features.

•
There are fast algorithms based on the fast Fourier transform (FFT) for computing wavelet coefficients and reconstructing signals in the time domain.
The drawback of harmonic wavelets is their weak localization properties in the time domain in comparison with other types of wavelets. The spectrum in the form of a rectangular wave leads to decay in the time domain as 1/x, which is not sufficient for extracting short-term singularities in a signal in the time domain.

Wavelet Transform in the Basis of Harmonic Wavelets
Detailed coefficients a jk , a jk and approximation coefficients a φk , a φk : where j is the decomposition level; k is the dilation.
If f(x) is a real-valued function, then: a jk = a jk , a φk = a φk . Wavelet decomposition [14]: Wavelet decomposition using harmonic wavelets [18]: Calculations with the last two formulae are inefficient.

of 14
Fast decomposition can be implemented in the following way: The substitution is of the following form: We can show that: Thus, the algorithm for computing wavelet coefficients of the octave harmonic wavelet transform [19] of a continuous-time function f(x) can be written in the following way: 1. The We calculate the discrete Fourier transform using the fast Fourier transform to obtain a set of complex numbers f(n), n = 0 . . . N−1-Fourier coefficients (DFT coefficients).

3.
Octave blocks F are processed using the discrete Fourier transform (DFT) to obtain coefficients: a jk = a 2 j +k . The calculation results for the coefficients are given in Table 1. Table 1. Distribution of wavelet coefficients among decomposition levels.

Number of Decomposition Level j Wavelet Coefficients Number of Wavelet Coefficients
Further, consider two approaches to classifying bio-acoustic signals. We have used real hydroacoustic signals of whales from the database [20].

Classification Using the kNN-Algorithm
The classification was based on 14,822 records of whales of two types: 'killer' (4673 records) and 'pilot' (10,149 records). Data for processing was taken from [20]. Research has been conducted for the following signal-to-noise ratios (SNR): 100, 3, 0 and −3 dB. Training of the classifier was based on 85% of records of each class, and testing was based on 15% of records of each class. The following attributes have been used for comparison: the harmonic wavelet transform (HWT) coefficients, the short-time Fourier transform (STFT) coefficients and the discrete Fourier transform (DFT) coefficients.
All records had different numbers of samples (8064-900,771) and different sampling rates. To perform classification, we had to change the lengths of the records so that they equaled 2. To reduce the feature space dimension, we employed the approach based on modulo N reduction [21]. Such an approach allows us to reduce the data dimension when calculating N-point DFT if N < L (L is signal length). The final signal matrix size (N = 4096) was 14,822 × 4096.
To reduce the feature space dimension, we also used coefficients of symmetry for the harmonic wavelet transform and the DFT: we used 50% coefficients (matrix: 14,822 × 2048). In the case of using a short-time Fourier transform (Hamming window of the size 256, overlap 50%), the final signal matrix size was 14,822 × 3999.
Below we can see the classification results (Tables 2-13, Figure 1) using the kNN-algorithm [22] for different features and different SNR values.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 15 All records had different numbers of samples (8064-900,771) and different sampling rates. To perform classification, we had to change the lengths of the records so that they equaled 2. To reduce the feature space dimension, we employed the approach based on modulo N reduction [21]. Such an approach allows us to reduce the data dimension when calculating N-point DFT if N L < (L is signal length). The final signal matrix size (N = 4096) was 14,822 × 4096.
To reduce the feature space dimension, we also used coefficients of symmetry for the harmonic wavelet transform and the DFT: we used 50% coefficients (matrix: 14,822 × 2048). In the case of using a short-time Fourier transform (Hamming window of the size 256, overlap 50%), the final signal matrix size was 14,822 × 3999.
Below we can see the classification results (Tables 2-13, Figure 1) using the kNN-algorithm [22] for different features and different SNR values. The classification problem is to attribute vectors to different classes. We have two classes: positive and negative. In this case, we can have four different situations at the output of a classifier: • If the classification result is positive, and the true value is positive as well, we have a truepositive value-TP.

•
If the classification result is positive, but the true value is negative, we have false-positive value -FP.

•
If the classification result is negative, and the true value is negative as well, we have a truenegative value-TN.

•
If the classification result is negative, but the true value is positive, we have a false-negative value-FN.
We have calculated the following classification quality metrics: precision, recall and accuracy. The classification problem is to attribute vectors to different classes. We have two classes: positive and negative. In this case, we can have four different situations at the output of a classifier:

•
If the classification result is positive, and the true value is positive as well, we have a true-positive value-TP.

•
If the classification result is positive, but the true value is negative, we have false-positive value-FP.

•
If the classification result is negative, and the true value is negative as well, we have a true-negative value-TN.

•
If the classification result is negative, but the true value is positive, we have a false-negative value-FN.
We have calculated the following classification quality metrics: precision, recall and accuracy.
Tables 14-16 contain precision, recall and accuracy for different classification features and different signal-to-noise ratios. Additionally, we can find the average final efficiency score characterizing the use of different classification features.  Final score I * score of a particular metric for each SNR. The "averaged score for three metrics" means that we estimated the average score for three metrics with the same SNR. Then, the final score for each feature (HWT, STFT, DFT) with different SNRs was chosen. We can see that using HWT as features gives the best result.

Classification Using a Deep Neural Network
The classification was based on 14,822 records of whales of two types: 'killer' (4673 records) and 'pilot' (10,149 records). Data for processing were taken from [20], containing sound recordings of 26 whales of two types: killer whale (15 individuals) and pilot whale (11 individuals).
In [23], for this dataset, two classifiers were constructed based on the kNN-algorithm. In the first case, the sounds were classified into a grind or killer whale sounds. For training, 800 whale sounds of each class were used; for testing, 400 of each were used. A classification accuracy of 92% was obtained. In the second experiment, 18 whales were separated from each other. For training, they took 80 records; for testing, they took 20. The classification accuracy was 51%.
In this work, records less than 960 ms long were removed from the dataset. After that, 14,810 records with an average duration of 4 s remained: 10,149 records of the grind and 4661 records of killer whales.
The classifier for both tasks was based on the VGGish model [24], which is a modified deep neural network VGG [25] pre-trained on the YouTube-8M dataset [26]. Cross entropy was used as a loss function. The audio files have been pre-processed in accordance with the procedure presented in [24]. Each record is divided into non-overlapping 960 ms frames, and each frame inherits the label of its parent video. Then log-mel spectrogram patches of 96 × 64 bins are then calculated for each frame. These form the set of inputs to the classifier. The output for the entire audio recording was carried out according to the maximum likelihood for all classes for each segment. As features, the output of the penultimate layer of dimension 128 was taken. More details can be found in the paper [23].

Experiment 1-Classification by Type
For the first task, we divided the dataset into training and test data in the proportion 85:15; in the training and the test sample there are no sounds from the same whales. The killer whale was designated 0, and the pilot whale was designated 1. Statistics on the training set: 8486-1, 3995-0. Statistics on the test set: 1663-1, 666-0.
The following results were obtained. On the training set, the confusion matrix was:  Figure 2 shows the ROC curve for the test set.  Figure 2 shows the ROC curve for the test set.

Experiment 2-Classification by Individual
The data was divided into training and test sets in the ratio of 85:15, maintaining the proportions of the classes. As Figure 3 shows, the classes are very unbalanced. Thus, in the training test, for classes 26 and 5, 15 and 12 files are available. For class 20, 3684 files are available (see Figure 3).

Experiment 2-Classification by Individual
The data was divided into training and test sets in the ratio of 85:15, maintaining the proportions of the classes. As Figure 3 shows, the classes are very unbalanced. Thus, in the training test, for classes 26 and 5, 15 and 12 files are available. For class 20, 3684 files are available (see Figure 3).  Figure 2 shows the ROC curve for the test set.

Experiment 2-Classification by Individual
The data was divided into training and test sets in the ratio of 85:15, maintaining the proportions of the classes. As Figure 3 shows, the classes are very unbalanced. Thus, in the training test, for classes 26 and 5, 15 and 12 files are available. For class 20, 3684 files are available (see Figure 3).  The confusion matrix for the training set is given in Figure 4. The confusion matrix for the test set is presented in Figure 5. The confusion matrix for the test set is presented in Figure 5.  The confusion matrix for the training set is given in Figure 4. The confusion matrix for the test set is presented in Figure 5.  The accuracy of the classification of individuals in percent on a test sample is presented in Figure  6. Blue lines indicate the true-positive value, orange lines indicate false-positive value. As can be seen, the 25th (whale ID 26) class never predicts. Only for the 9th (whale ID 10), 14th (whale ID 15), 24th (whale ID 25) classes was the classification accuracy below 60%; for all the others it was higher. For some classes, classification accuracy is higher than 95%.

Discussion
Classification of whale sounds is a challenging problem that has been studied for a long time. Despite great achievements in feature engineering, signal processing and machine learning techniques, there still remain some major problems to be solved. In this paper, we used harmonic wavelets and deep neural networks. The results of the classification of whale types and individuals by means of deep neural networks are better than in previous works [23] with this dataset, but accuracy in the classification of types using harmonic wavelets as features and in the classification of individuals using deep neural networks should be increased. In further studies, we will use a Hilbert-Huang transform [27] and adaptive signal processing algorithms [28] to generate features.
For improvement of individual classification, two approaches can be suggested. The first combines data augmentation with other architectures of the neural network, but this will lead to large computational costs. The second approach is to use technology for simple and non-iterative improvements of multilayer and deep learning neural networks and artificial intelligence systems, which was proposed some years ago [29,30]. Our further research in the classification of hydroacoustic signals will be related to these two approaches. We also intend to test these approaches by adding noises at different SNRs, as we have done for harmonic wavelets. As can be seen, the 25th (whale ID 26) class never predicts. Only for the 9th (whale ID 10), 14th (whale ID 15), 24th (whale ID 25) classes was the classification accuracy below 60%; for all the others it was higher. For some classes, classification accuracy is higher than 95%.

Discussion
Classification of whale sounds is a challenging problem that has been studied for a long time. Despite great achievements in feature engineering, signal processing and machine learning techniques, there still remain some major problems to be solved. In this paper, we used harmonic wavelets and deep neural networks. The results of the classification of whale types and individuals by means of deep neural networks are better than in previous works [23] with this dataset, but accuracy in the classification of types using harmonic wavelets as features and in the classification of individuals using deep neural networks should be increased. In further studies, we will use a Hilbert-Huang transform [27] and adaptive signal processing algorithms [28] to generate features.
For improvement of individual classification, two approaches can be suggested. The first combines data augmentation with other architectures of the neural network, but this will lead to large computational costs. The second approach is to use technology for simple and non-iterative improvements of multilayer and deep learning neural networks and artificial intelligence systems, which was proposed some years ago [29,30]. Our further research in the classification of hydroacoustic signals will be related to these two approaches. We also intend to test these approaches by adding noises at different SNRs, as we have done for harmonic wavelets.

Conclusions
In our paper, we considered the harmonic wavelet transform and its application to classifying hydroacoustic signals from whales of two types. We have provided a detailed representation of the mathematical tools, including fast computation of the harmonic wavelet transform coefficients. Classification results analysis allows us to draw conclusions about the reasonability of using harmonic wavelets when analyzing complex data. We have established that the smallest classification error is provided by the k-NN algorithm based on the harmonic wavelet transform coefficients.
The analysis (Table 17 Figures 5 and 6) illustrates the superiority of using a neural network for the Whale FM Project dataset in comparison with known work [23] and a kNN-classifier for the classification problem [31]. However, it is worth noting that the implementation of a neural network of such a complicated structure requires significant computational resources. Classification of 26 individual whales from the Whale FM Project dataset was proposed, and better results in comparison with previous works were achieved [23].
The proposed approach can be used in the study of the fauna of the oceans by research institutes, environmental organizations, and enterprises producing equipment for sonar monitoring. In addition, the study showed that the same methods can be used for speech processing and classification of underwater bioacoustic signals, which will subsequently allow the creation of effective medical devices based on these methods.