1. Introduction
The underwater environment is characterized by a multitude of sound sources, including natural phenomena such as marine mammal vocalizations, waves, and icebergs, as well as human activities such as shipping, seismic exploration, and naval sonar operations [
1]. Among these, marine mammal vocalizations provide important information for species identification and distribution tracking, which are essential for understanding species behavior, communication, and ecology. Monitoring marine mammal vocalizations has therefore become an important approach for studying marine species. Classification systems can support this process by effectively distinguishing vocalizations from different species, enabling efficient analysis of large acoustic datasets.
Recent research has been dedicated to developing marine mammal sound classification methods using artificial intelligence (AI) algorithms, including machine learning and deep learning, as traditional approaches, such as manual interpretation of acoustic data, are labor-intensive and time-consuming [
2]. These AI-based techniques are becoming increasingly popular due to their ability to process and analyze large amounts of data effectively, allowing researchers to identify vocalization patterns with greater speed and efficiency [
3]. Specifically, studies such as [
4,
5] have investigated classification methods using decision trees, support vector machines, and sparse representation-based classifiers. However, their performance is heavily dependent on manually extracted features used for training, a process that is not only time-consuming but also inherently subjective, as it relies on the researcher’s expertise and selection of features, restricting its ability to generalize across diverse scenarios [
6].
To address these limitations, automated feature extraction techniques, such as convolutional neural networks (CNNs), have been widely adopted, excelling in performance by extracting high-level representations directly from the data and enhancing both classification accuracy and generalizability [
7,
8]. However, to facilitate this process, preprocessing methods such as the Short-Time Fourier Transform (STFT) are commonly used to convert raw audio signals into spectrograms in the frequency domain. The frequency domain offers advantages such as dimensionality reduction by visualizing key information from raw acoustic data, enhancing analysis and classification [
9].
Furthermore, the frequency domain offers compatibility with advanced image-based deep learning architectures, such as AlexNet and EfficientNet, which can help enhance classification performance in marine bioacoustics. Owing to advances in these image recognition deep learning technologies, these architectures offer robust feature extraction capabilities and high classification accuracy. Through frequency-domain representations and transfer learning, researchers can easily adapt these models to acoustic data, reducing both computational and time costs. Thus, many studies have employed frequency-domain representations when developing automated classification methods, including [
10,
11].
Despite these advancements, frequency-domain representations have inherent limitations. For example, STFT-based spectrogram generation requires several user-defined parameters, such as FFT length, window size, and hop length, which strongly influence the resulting time–frequency resolution. Selecting appropriate values for these parameters can be challenging, as it involves balancing the trade-off between time and frequency resolution, which may ultimately affect model performance. In addition, spectrogram-based classification methods typically require input signals to have uniform lengths. As a result, additional preprocessing steps such as padding or cropping are often applied, which may introduce distortions or remove important information from the original signal [
12,
13].
As a result, alternative approaches have explored applying deep learning models directly to raw audio signals [
14]. Previous studies have developed end-to-end models that process raw audio without requiring frequency-domain transformations, including transformer-based architectures [
15,
16]. While transformer architectures can capture long-range dependencies in audio data, they typically require large-scale datasets and higher computational resources for effective training. In contrast, CNN-based models are computationally efficient and perform well on smaller datasets, making them a practical choice for many bioacoustic classification tasks involving relatively short and isolated vocalizations.
However, CNNs generally require a fixed input size because the dimensions of convolutional and fully connected layers are predefined, and assume consistent feature map sizes during training [
17]. If input sizes vary, each feature map would also vary in size, which prevents training through layers that expect fixed dimensions. Common approaches to address this issue include cropping a single important portion of the signal or padding the signals to standardize their lengths [
18,
19]. However, these methods can distort the original signal, potentially impacting the quality of the extracted features and the overall model performance.
In this study, we present a methodological framework for classifying marine mammal vocalizations directly from raw audio signals using a one-dimensional (1D) CNN approach. To address the issue of varying signal durations, the approach incorporates a random cropping strategy in which multiple fixed-length portions are randomly extracted from each signal during training. Each segment is processed independently by the 1D CNN, and the final classification is determined by aggregating predictions across portions through majority voting.
Aggregation strategies have been explored in classification tasks to improve robustness by combining information from multiple signal representations. For example, Raza et al. [
20] proposed a CNN framework incorporating multi-scale deep feature aggregation (MSDFA) for automated classification of humpback whale calls. While their approach aggregates feature representations extracted from spectrogram inputs, our study aggregates predictions from multiple randomly cropped portions of the same recording through majority voting, providing a simple and computationally efficient alternative that does not require padding or additional preprocessing.
The framework is evaluated using recordings from the Watkins Marine Mammal Sound Database, an open-source dataset containing vocalizations from multiple marine mammal species. Recordings from five marine species, Bowhead whales, Humpback whales, Bearded seals, Harp seals, and Walruses, are used to evaluate the method under three different experimental scenarios involving more cropped portions and data augmentation. Although the current study focuses on a limited set of species, the results demonstrate the feasibility of the approach and highlight its potential as a deep learning-based framework for marine mammal vocalization classification.
3. Methodology
3.1. Proposed 1D CNN Architecture
1D CNNs have been widely used for analyzing raw audio signals because they can directly learn patterns from waveform data without requiring frequency-domain transformations. This allows the model to preserve the original structure of acoustic signals while reducing preprocessing requirements. For example, Abdoli et al. [
31] demonstrated that 1D CNN architectures can effectively extract representations directly from raw audio waveforms for classification tasks.
Motivated by these advantages, this study adopts a 1D CNN architecture for classifying marine mammal vocalizations from raw audio signals.
Figure 3 illustrates the proposed 1D CNN architecture for classifying the five marine mammal species in our dataset. The network consists of two main components: a feature extraction module and a classification module.
The feature extraction module employs a sequence of convolutional layers with a kernel size of 3, a stride of 1, and padding of 1. Each convolutional layer is followed by instance normalization, a SiLU activation function, and a max pooling layer. The convolutional layers have increasing channel dimensions of 16, 32, and 64, allowing the network to progressively extract higher-level temporal features from the input audio signals. The classification module consists of fully connected layers, where the first layer maps the extracted features into a smaller dimensional space with 64 units, followed by a ReLU activation function. The second fully connected layer outputs the final predictions corresponding to the five classes considered in this study.
3.2. Variable Duration Signal Handling Framework
Although 1D CNN architectures are capable of learning features directly from raw waveform data, CNN models typically require inputs with fixed dimensions.
Figure 4 illustrates the distribution of total durations for all audio signals in the dataset after segmentation. Real-world biological signals usually vary significantly in duration, which makes them difficult to input directly into CNN layers without additional preprocessing. Conventional methods typically address this issue by either cropping a single important portion from the original signal or applying padding to standardize signal lengths, both of which can distort the signal. Thus, studies have tried to address this limitation imposed by the CNN input layer by dividing the audio signal into multiple fixed-length frames using a sliding window of appropriate size, as demonstrated in [
31].
However, a wide range of signal durations makes the sliding window approach computationally demanding, particularly for longer recordings. To reduce this computational burden, a longer sliding window can be employed to minimize the number of segments generated for very long signals. While this may be effective, it requires padding shorter signals to match the sliding window length, potentially introducing artificial data that does not accurately represent the original signal.
To address the computational cost while still utilizing as much information from the audio signals as possible, our framework employs a random cropping method. This involves extracting up to
N random portions of fixed length
L from each signal. Each cropped portion is then processed by the CNN model to generate a class prediction. The final classification is determined by aggregating these predictions through majority voting. This strategy, similar to ensemble learning, helps improve classification robustness by incorporating information from different regions of the signal [
32].
Figure 5 illustrates the overall framework, showing how random cropping is applied and processed through the model to generate class predictions, enabling adaptive processing of signals with varying lengths.
The fixed length L is determined by the shortest signal in the dataset. By doing so, no padding is required, preserving the integrity of the original data. In practice, shorter signals yield fewer than N portions, ensuring that the method adapts naturally to the signal’s duration while maintaining computational efficiency. In addition, different portions are randomly cropped during each training iteration, allowing the model to be exposed to a wide range of regions within each recording and learn representative features from different parts of the signal as training progresses.
3.3. Data Augmentation
Another key challenge in this study is class imbalance, which is an inevitable characteristic as the dataset originates from real-world acoustic recordings, where certain species are naturally underrepresented. Such an imbalance can lead to biased model predictions, where the model favors majority classes, ultimately affecting both training and evaluation performance. To address this issue, we employed data augmentation, a widely used technique that modifies the original data to generate additional synthetic samples, enhancing dataset diversity [
33].
Specifically, we utilized two augmentation methods: time shifting and amplitude scaling. Time shifting repositions the waveform in time by randomly translating the signal to the left or right within a specified range. Because the recordings were segmented around vocalization events, the starting positions of the vocalizations within the input segments may become relatively consistent during training. However, in real acoustic recordings, vocalizations can occur at any position within the signal. Time shifting helps mitigate this positional bias by allowing the CNN to observe the same vocalization at different temporal positions, enabling the model to learn the acoustic pattern itself rather than relying on its fixed location within the segment. In our implementation, each signal is randomly shifted by up to 10% of its total length.
Amplitude scaling modifies the amplitude of the waveform by multiplying it by a randomly chosen scaling factor. Although RMS normalization was applied to ensure that amplitudes fall within a comparable range, variations may still remain due to inherent differences. Amplitude scaling helps simulate these variations in signal strength, allowing the model to become less sensitive to absolute amplitude differences and instead focus on the underlying acoustic patterns of the vocalizations. For this study, the scaling factor is randomly selected between 0.5 and 1.5.
Figure 6 shows an example of an audio signal before and after time shifting and amplitude scaling. Augmentation is implemented on the fly, meaning that augmented samples are not pre-generated but created dynamically as part of the training process. This approach helps conserve memory resources and eliminates the need to store large numbers of pre-generated augmented samples.
A probabilistic scheme is also applied in which underrepresented classes have a higher probability of augmentation, while major classes have a lower probability, ensuring a more balanced representation across all classes in the dataset. Specifically, probabilities are derived based on the class distribution in the training data. For each class, an initial weight is computed as the inverse of the number of training samples belonging to that class, such that underrepresented classes receive higher weights. These weights are then linearly normalized to a bounded probability range, with a minimum probability of 0.5 and a maximum probability of 0.8. As a result, the majority class is assigned a probability of 0.5, the most underrepresented class is assigned a probability of 0.8, and all remaining classes receive probabilities between these bounds based on their relative weights.
4. Application
To evaluate the performance of our proposed method, it is applied to our dataset under three different cases:
Case 1: Random cropping is applied to extract a single portion of length L from each audio sample without any data augmentation. This serves as the base model for comparison with the other cases.
Case 2: Random cropping is extended to extract at most N portions of length L from each audio sample, still with no data augmentation.
Case 3: Random cropping of at most N portions of length L for each audio sample is combined with data augmentation techniques.
For all three cases, the cropped portion length L is fixed according to the shortest signal in the dataset, as previously noted, to avoid unnecessary padding.
During training, model performance is assessed using
k-fold cross-validation, a commonly used technique that provides a more reliable estimate of model performance [
31]. In our study, we employed 5-fold cross-validation, where the training set is divided into five subsets (folds). For each of the five runs, four folds are used for training and the remaining fold for validation. This process yields five trained models, and the performance metrics from the validation folds are averaged to produce a more robust estimate of the model’s generalization performance, reducing potential bias and variability introduced by relying on a single train–validation split. The overall process is illustrated in
Figure 7.
Model performance is continuously monitored by tracking key metrics, such as training and validation losses as well as accuracy, to track the model’s progress and performance. For each training run, the model with the highest validation accuracy is saved, resulting in a total of five trained models, which are then used to produce the final predictions on the test set. Additionally, we implemented an early stopping strategy, in which training is stopped if there is no improvement in validation accuracy after a specified number of training steps [
34]. This helps prevent overfitting and reduces training time by avoiding unnecessary additional training once the model has converged.
The five trained models with optimal parameters, determined by the highest validation accuracy during each training run, are evaluated on the test dataset to assess generalization capability. These models are aggregated using majority voting across their predictions to produce the final predictions on the test set. As the test set is initially separated from the original dataset, it is not included in the training process, ensuring an unbiased evaluation. However, to enable an accurate comparison across the three cases, the test set uses fixed cropped portions, in which the same portions are consistently extracted from each audio sample, processed through the trained models, and aggregated to make the final class prediction. These portions are evenly distributed across each signal, preventing bias toward any specific region of the data. In other words, unlike the training set, random cropping is not applied to the test set, maintaining consistency and ensuring an unbiased performance evaluation.
Figure 8 summarizes the overall process of our method, from data preparation to evaluation.
Performance evaluation of the trained model over the test set is conducted using a confusion matrix, which provides a detailed breakdown of the model’s classification performance across all classes [
35]. Each row of the confusion matrix represents the true class labels, while each column corresponds to the predicted class labels. The diagonal elements represent correctly classified samples, while the off-diagonal elements correspond to misclassifications between classes. This representation allows direct analysis of class-specific performance and shows which classes are commonly confused by the model.
Furthermore, performance metrics including precision, recall, and F-score were computed to provide a quantitative evaluation of the model’s classification performance. Precision measures the proportion of correctly predicted samples among all samples predicted for a given class, while recall represents the proportion of correctly identified samples among all true instances of that class. The F-score is defined as the harmonic mean of precision and recall. These metrics were calculated for each class to provide a more detailed assessment of the model’s performance across different marine mammal species.
5. Results & Discussion
5.1. Determination of the Optimal Number of Cropped Portions N
Determining the optimum number of cropped portions N remains challenging, as it is a user-defined hyperparameter that must be specified prior to training. A small value of N may limit the amount of information utilized, especially for longer recordings, whereas a very large value may increase computational cost without providing significant improvements in performance. Therefore, selecting an appropriate value of N is necessary to balance classification performance and computational efficiency.
In this study, the optimal value of N was determined by evaluating several candidate values using the training dataset. For each value, the model was trained and evaluated using a 5-fold cross-validation procedure. In this process, the training data were divided into five folds, where four folds were used for training and the remaining fold was used for validation. The validation accuracies obtained from the five folds were then averaged to provide a robust estimate of model performance for each value of N.
Figure 9 shows the validation accuracy obtained for different values of
N, where the mean accuracy across the five cross-validation folds is presented with the corresponding standard deviation. Based on the results from this dataset, the optimal value of
N was determined to be six. Therefore, this value was adopted for the Case 2 and Case 3 scenarios. It should be noted, however, that this value is dataset-dependent. For different datasets, the optimal value of
N may vary and should therefore be re-evaluated accordingly.
5.2. Performance Evaluation Across Experimental Cases
Table 3,
Table 4, and
Table 5 summarize the performance metrics for Case 1, Case 2, and Case 3 on the test set, respectively, while
Figure 10 presents the corresponding confusion matrices for each of the three cases. Across all cases, the highest classification accuracy is observed for walruses, which are almost perfectly classified. This is likely due to their distinctive vocalizations, which differ substantially from those of the other species.
In Case 1, the overall accuracy is approximately 0.853, indicating a strong baseline performance.
Figure 11 illustrates sample spectrograms showing the frequency characteristics of the raw waveform signals. Although the proposed model operates directly on raw waveforms, the convolutional filters implicitly capture frequency-related information, enabling the model to distinguish between different vocalizations to a certain extent. However, a noticeable number of misclassifications are observed across several classes. As shown in the class-level performance metrics and confusion matrix, Bowhead whale exhibits high precision but relatively low recall, indicating that while predictions labeled as Bowhead whale are usually correct, many true Bowhead whale vocalizations are misclassified as Humpback whale vocalizations. A similar pattern is observed for the seal species, where Bearded seal and Harp seal show lower precision and recall values, in which these two species are frequently misclassified as each other. The use of a single cropped portion limits the amount of acoustic information captured from each signal, particularly for longer recordings, which reduces the model’s ability to capture species-specific patterns and increases the likelihood of confusion between species. In addition, class imbalance may further contribute to this issue, as reflected in the higher misclassification rates observed for minority classes compared with more prevalent classes.
In Case 2, the overall accuracy improved to 0.956, indicating a substantial improvement in classification performance when multiple randomly cropped segments are used. As shown in the class-level performance metrics, the precision, recall, and F-score values increased for all five species, suggesting that the model is able to detect and classify vocalization patterns more reliably. The use of multiple randomly cropped portions allows the model to capture information from different temporal regions of the signal, enabling a better representation of the acoustic characteristics of each species. However, misclassifications among the seal species still persist, as shown in the confusion matrix, suggesting that class imbalance may remain a contributing factor.
In Case 3, the overall accuracy further improved to 0.970 after applying data augmentation techniques. However, the improvement from Case 2 to Case 3 is relatively smaller compared with the improvement observed from Case 1 to Case 2. This indicates that the primary performance gain is achieved through the proposed multi-cropping and aggregation framework, while data augmentation mainly provides an additional benefit by partially mitigating the effects of class imbalance in the dataset.
Figure 12 illustrates the distribution of the best validation accuracy obtained from each fold for the three experimental cases. As the cases progress, the distribution shifts toward higher accuracy values and becomes more consistent across folds, indicating improved classification performance when multiple cropped portions and data augmentation are applied.
5.3. Limitations and Future Works
Overall, the results of this study demonstrate the potential of a classification framework that operates directly on raw waveforms without requiring preprocessing steps such as frequency-domain transformations or signal padding. However, there are several limitations of the current study that must be addressed. First, the dataset used in this study is relatively small and includes a limited number of marine mammal species. As such, the present work primarily serves as a baseline investigation for evaluating the feasibility of a raw-waveform-based 1D CNN approach. Future studies should extend the framework to larger and more diverse datasets encompassing a broader range of species.
Second, this study focuses exclusively on marine mammal vocalizations. In real ocean environments, acoustic recordings also contain various sources of background noise, including abiotic noise (e.g., waves and ice movement) and anthropogenic noise, such as ships, waves, and ice movement. Incorporating these additional sound sources into the classification framework will be important for improving the robustness and practical real-world applicability.
Finally, marine mammal species often produce multiple vocalization types and call patterns within the same species, each serving different behavioral functions. The present study focuses on species-level classification and does not distinguish between different call types produced by the same species. Future work should therefore consider call-type-level classification rather than only species-level identification. Such an approach would enable more detailed ecological analysis of marine mammal communication and behavior.
6. Conclusions
This study presents a methodological analysis of a 1D CNN framework for classifying marine mammal vocalizations directly from raw audio waveforms that accounts for varying signal lengths. In contrast to conventional preprocessing strategies such as padding, single cropping, or sliding window segmentation used to satisfy the fixed input size requirements of CNNs, this study employs random cropping to extract different segments of the audio signal during training. This approach aims to balance computational efficiency with the preservation of the original signal characteristics.
The framework was evaluated under three experimental scenarios to examine how different signal sampling strategies and data augmentation influence classification performance. The results indicate that using multiple random crops allows the model to utilize a broader range of acoustic information within each signal, while the inclusion of data augmentation improves the classification of underrepresented species by mitigating class imbalance.
Although the experiments were conducted on a relatively small dataset with a limited number of species, the findings provide insights into how raw waveforms can be applied to bioacoustic classification tasks. Future work should extend this analysis to larger datasets that incorporates additional ocean acoustic sources such as environmental and anthropogenic noise, and investigate classification of various call patterns within species. Such extensions will help improve the robustness and practical applicability of automated classification systems.