A Methodological Study of 1D CNN Classification of Marine Mammal Vocalizations with Variable Signal Durations

Kim, Won-Ki; Lee, Dawoon; Bae, Ho Seuk

doi:10.3390/jmse14070639

Open AccessArticle

A Methodological Study of 1D CNN Classification of Marine Mammal Vocalizations with Variable Signal Durations

by

Won-Ki Kim

¹

,

Dawoon Lee

²

and

Ho Seuk Bae

^2,*

¹

Department of Earth and Environmental Science, Chungbuk National University, Cheongju 28644, Republic of Korea

²

Agency for Defense Development, Changwon 51678, Republic of Korea

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(7), 639; https://doi.org/10.3390/jmse14070639

Submission received: 19 February 2026 / Revised: 23 March 2026 / Accepted: 29 March 2026 / Published: 30 March 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Marine mammal sound classification plays an important role in understanding species behavior, communication, and ecology. Automated classification methods have received increasing attention due to their ability to efficiently process and analyze large volumes of acoustic data. Traditional classification approaches often rely on frequency-domain representations, such as spectrograms, and image-based classifiers, which can be highly influenced by user-defined parameters. In this study, we investigate a classification method for marine mammal vocalizations using a one-dimensional convolutional neural network (1D CNN) that directly processes raw audio signals. The approach can handle signals of varying durations through a random cropping technique, minimizing signal distortion that is commonly introduced by conventional methods. The model was evaluated using marine mammal vocalization recordings obtained from the Watkins Marine Mammal Sound Database under three experimental scenarios. The results demonstrate the feasibility of using raw audio inputs with a 1D CNN for classifying marine mammal vocalizations with variable signal durations.

Keywords:

convolutional neural network; marine species classification; varying signal lengths; data augmentation

1. Introduction

The underwater environment is characterized by a multitude of sound sources, including natural phenomena such as marine mammal vocalizations, waves, and icebergs, as well as human activities such as shipping, seismic exploration, and naval sonar operations [1]. Among these, marine mammal vocalizations provide important information for species identification and distribution tracking, which are essential for understanding species behavior, communication, and ecology. Monitoring marine mammal vocalizations has therefore become an important approach for studying marine species. Classification systems can support this process by effectively distinguishing vocalizations from different species, enabling efficient analysis of large acoustic datasets.

Recent research has been dedicated to developing marine mammal sound classification methods using artificial intelligence (AI) algorithms, including machine learning and deep learning, as traditional approaches, such as manual interpretation of acoustic data, are labor-intensive and time-consuming [2]. These AI-based techniques are becoming increasingly popular due to their ability to process and analyze large amounts of data effectively, allowing researchers to identify vocalization patterns with greater speed and efficiency [3]. Specifically, studies such as [4,5] have investigated classification methods using decision trees, support vector machines, and sparse representation-based classifiers. However, their performance is heavily dependent on manually extracted features used for training, a process that is not only time-consuming but also inherently subjective, as it relies on the researcher’s expertise and selection of features, restricting its ability to generalize across diverse scenarios [6].

To address these limitations, automated feature extraction techniques, such as convolutional neural networks (CNNs), have been widely adopted, excelling in performance by extracting high-level representations directly from the data and enhancing both classification accuracy and generalizability [7,8]. However, to facilitate this process, preprocessing methods such as the Short-Time Fourier Transform (STFT) are commonly used to convert raw audio signals into spectrograms in the frequency domain. The frequency domain offers advantages such as dimensionality reduction by visualizing key information from raw acoustic data, enhancing analysis and classification [9].

Furthermore, the frequency domain offers compatibility with advanced image-based deep learning architectures, such as AlexNet and EfficientNet, which can help enhance classification performance in marine bioacoustics. Owing to advances in these image recognition deep learning technologies, these architectures offer robust feature extraction capabilities and high classification accuracy. Through frequency-domain representations and transfer learning, researchers can easily adapt these models to acoustic data, reducing both computational and time costs. Thus, many studies have employed frequency-domain representations when developing automated classification methods, including [10,11].

Despite these advancements, frequency-domain representations have inherent limitations. For example, STFT-based spectrogram generation requires several user-defined parameters, such as FFT length, window size, and hop length, which strongly influence the resulting time–frequency resolution. Selecting appropriate values for these parameters can be challenging, as it involves balancing the trade-off between time and frequency resolution, which may ultimately affect model performance. In addition, spectrogram-based classification methods typically require input signals to have uniform lengths. As a result, additional preprocessing steps such as padding or cropping are often applied, which may introduce distortions or remove important information from the original signal [12,13].

As a result, alternative approaches have explored applying deep learning models directly to raw audio signals [14]. Previous studies have developed end-to-end models that process raw audio without requiring frequency-domain transformations, including transformer-based architectures [15,16]. While transformer architectures can capture long-range dependencies in audio data, they typically require large-scale datasets and higher computational resources for effective training. In contrast, CNN-based models are computationally efficient and perform well on smaller datasets, making them a practical choice for many bioacoustic classification tasks involving relatively short and isolated vocalizations.

However, CNNs generally require a fixed input size because the dimensions of convolutional and fully connected layers are predefined, and assume consistent feature map sizes during training [17]. If input sizes vary, each feature map would also vary in size, which prevents training through layers that expect fixed dimensions. Common approaches to address this issue include cropping a single important portion of the signal or padding the signals to standardize their lengths [18,19]. However, these methods can distort the original signal, potentially impacting the quality of the extracted features and the overall model performance.

In this study, we present a methodological framework for classifying marine mammal vocalizations directly from raw audio signals using a one-dimensional (1D) CNN approach. To address the issue of varying signal durations, the approach incorporates a random cropping strategy in which multiple fixed-length portions are randomly extracted from each signal during training. Each segment is processed independently by the 1D CNN, and the final classification is determined by aggregating predictions across portions through majority voting.

Aggregation strategies have been explored in classification tasks to improve robustness by combining information from multiple signal representations. For example, Raza et al. [20] proposed a CNN framework incorporating multi-scale deep feature aggregation (MSDFA) for automated classification of humpback whale calls. While their approach aggregates feature representations extracted from spectrogram inputs, our study aggregates predictions from multiple randomly cropped portions of the same recording through majority voting, providing a simple and computationally efficient alternative that does not require padding or additional preprocessing.

The framework is evaluated using recordings from the Watkins Marine Mammal Sound Database, an open-source dataset containing vocalizations from multiple marine mammal species. Recordings from five marine species, Bowhead whales, Humpback whales, Bearded seals, Harp seals, and Walruses, are used to evaluate the method under three different experimental scenarios involving more cropped portions and data augmentation. Although the current study focuses on a limited set of species, the results demonstrate the feasibility of the approach and highlight its potential as a deep learning-based framework for marine mammal vocalization classification.

2. Materials

2.1. Data Acquisition & Preparation

The acoustic recordings used in this study were initially obtained from the Watkins Marine Mammal Sound Database, developed by the Woods Hole Oceanographic Institution [21]. This repository contains one of the largest publicly available collections of annotated marine species vocalizations, compiled over several decades of fieldwork. Each entry is accompanied by detailed metadata, including species name, recording date, geographic location, and a description of the vocalization, enabling comparisons across species and populations and serving as a valuable reference for vocal characterization. Owing to its extensive coverage and long-term accumulation, the Watkins Marine Mammal Sound Database is among the most widely used open-source resources in marine bioacoustics research.

For this study, only a subset of species was selected from the database. While the database contains recordings from a wide range of marine species, this study aims to investigate a deep learning-based classification framework, and the focus was therefore placed on defining a reliable baseline rather than addressing unnecessary data complexity at this stage. To ensure diversity while maintaining a manageable dataset size, five representative species were selected: Bowhead whales, Humpback whales, Bearded seals, Harp seals, and Walruses. Recordings of each species were selected from the “best-of cuts” collection in the Watkins Marine Mammal Sound Database, a curated subset of recordings selected for their high quality.

The selected five species exhibit diverse vocalization characteristics, all of which are well documented in the database. Bowhead whale vocalizations primarily consist of low-frequency moans and pulsive calls, with patterns that vary depending on the season [22]. Humpback whales produce a variety of prolonged sound patterns, commonly referred to as songs, which can last from several minutes to hours [23]. Bearded seals are known for long, frequency-modulated trills and sweeps, which also can vary in duration and patterns [24]. Harp seals produce multiple call types that vary in structure and frequency characteristics across locations [25]. Walrus vocalizations mainly consist of pulsed and knocking sounds that vary in both temporal and frequency characteristics [26].

Each recording corresponds to a single species and typically contains a representative vocalization, although variations in call types or acoustic patterns may be present. In this study, the classification task focuses on species-level identification rather than distinguishing between individual call categories. Separating recordings by specific call types would substantially reduce the available data for each category, which is not suitable for training the framework. Therefore, all recordings from the same species were treated as a single class in the current classification task. Future studies with larger datasets may explore classification at the call-type level.

The recordings from the Watkins Marine Mammal Sound Database are provided in standard WAV format. The audio files were imported into the Python environment (Version 3.10.14) using the Librosa library, a widely adopted framework for audio signal processing [27]. Librosa enables efficient reading and manipulation of waveform data, including tasks such as resampling and amplitude normalization, making it a popular choice for bioacoustic signal analysis. The raw waveforms can be easily visualized using Librosa, as shown in Figure 1, which presents sample recordings from the five species selected for this research and illustrates the type of data available in the repository.

2.2. Data Preprocessing

Table 1 summarizes the dataset, including the number of raw recordings, total duration, and original sampling frequencies. A key challenge in this study is the limited availability of raw acoustic recordings, as a small training dataset may affect classification performance [28]. To address this issue, the raw recordings were manually segmented to increase the number of training samples.

This process involved a detailed examination of each recording to identify segments containing clear acoustic signals. Segmentation was performed through manual auditory inspection, during which the recordings were reviewed and segments were extracted whenever a clearly distinguishable vocalization event was detected. Regions consisting solely of background noise or acoustically ambiguous signals were excluded. This procedure was applied consistently across all recordings to ensure that only meaningful audio data was used for training. Figure 2 illustrates the class distribution, providing an overview of the dataset’s composition across different classes after segmentation.

Additionally, the original recordings were sampled at varying frequencies, which could lead to inconsistencies in temporal and spectral resolution across the dataset [29]. To address this, all recordings are usually resampled to a uniform frequency. Determining the optimal frequency is challenging, as a low sampling frequency may fail to capture all important information, whereas using a high sampling frequency inevitably results in significantly higher computational costs. In this study, we selected a standardized sampling frequency of 40 kHz based on empirical evaluation. Although the maximum sampling frequency in our dataset was 44 kHz, we confirmed that resampling to 40 kHz preserved the relevant frequency information across all samples while staying within our available computational resources.

Following resampling, as our dataset is composed of signals from various locations, a data normalization process is applied to reduce amplitude variability caused by differences in acquisition process and recording equipment. Specifically, we used the root mean square (RMS) normalization to rescale each waveform so that its amplitude fell within a common amplitude range. This ensures that all input signals have comparable amplitude levels, allowing the model to focus more effectively on relevant acoustic patterns during training rather than on absolute amplitude values that are unrelated to the underlying biological signal.

2.3. Dataset Split for Train and Testing

When developing machine learning or deep learning models, it is important not only to achieve high predictive accuracy but also to ensure that the models generalize effectively to new, unseen data. During model development, this is typically accomplished by dividing the dataset into distinct training and test subsets, where the model is trained and optimized on the training set, and its performance is evaluated on the separate, unseen test set. In this study, the dataset is divided into training and test sets using a stratified split, ensuring that both sets contain all classes in proportions representative of the original data [30].

However, because the dataset consists of signals segmented from raw, long recordings, segments from the same recording may include vocalizations produced by the same individual. If segments from the same individual are split between the training and test sets, this can lead to data leakage, causing the model to overestimate performance due to similarities across sets. To minimize this risk, the stratified split was performed so that all segments derived from the same raw recording are exclusively grouped in either the training or the test set. The resulting distribution of signals is shown in Table 2.

3. Methodology

3.1. Proposed 1D CNN Architecture

1D CNNs have been widely used for analyzing raw audio signals because they can directly learn patterns from waveform data without requiring frequency-domain transformations. This allows the model to preserve the original structure of acoustic signals while reducing preprocessing requirements. For example, Abdoli et al. [31] demonstrated that 1D CNN architectures can effectively extract representations directly from raw audio waveforms for classification tasks.

Motivated by these advantages, this study adopts a 1D CNN architecture for classifying marine mammal vocalizations from raw audio signals. Figure 3 illustrates the proposed 1D CNN architecture for classifying the five marine mammal species in our dataset. The network consists of two main components: a feature extraction module and a classification module.

The feature extraction module employs a sequence of convolutional layers with a kernel size of 3, a stride of 1, and padding of 1. Each convolutional layer is followed by instance normalization, a SiLU activation function, and a max pooling layer. The convolutional layers have increasing channel dimensions of 16, 32, and 64, allowing the network to progressively extract higher-level temporal features from the input audio signals. The classification module consists of fully connected layers, where the first layer maps the extracted features into a smaller dimensional space with 64 units, followed by a ReLU activation function. The second fully connected layer outputs the final predictions corresponding to the five classes considered in this study.

3.2. Variable Duration Signal Handling Framework

Although 1D CNN architectures are capable of learning features directly from raw waveform data, CNN models typically require inputs with fixed dimensions. Figure 4 illustrates the distribution of total durations for all audio signals in the dataset after segmentation. Real-world biological signals usually vary significantly in duration, which makes them difficult to input directly into CNN layers without additional preprocessing. Conventional methods typically address this issue by either cropping a single important portion from the original signal or applying padding to standardize signal lengths, both of which can distort the signal. Thus, studies have tried to address this limitation imposed by the CNN input layer by dividing the audio signal into multiple fixed-length frames using a sliding window of appropriate size, as demonstrated in [31].

However, a wide range of signal durations makes the sliding window approach computationally demanding, particularly for longer recordings. To reduce this computational burden, a longer sliding window can be employed to minimize the number of segments generated for very long signals. While this may be effective, it requires padding shorter signals to match the sliding window length, potentially introducing artificial data that does not accurately represent the original signal.

To address the computational cost while still utilizing as much information from the audio signals as possible, our framework employs a random cropping method. This involves extracting up to N random portions of fixed length L from each signal. Each cropped portion is then processed by the CNN model to generate a class prediction. The final classification is determined by aggregating these predictions through majority voting. This strategy, similar to ensemble learning, helps improve classification robustness by incorporating information from different regions of the signal [32]. Figure 5 illustrates the overall framework, showing how random cropping is applied and processed through the model to generate class predictions, enabling adaptive processing of signals with varying lengths.

The fixed length L is determined by the shortest signal in the dataset. By doing so, no padding is required, preserving the integrity of the original data. In practice, shorter signals yield fewer than N portions, ensuring that the method adapts naturally to the signal’s duration while maintaining computational efficiency. In addition, different portions are randomly cropped during each training iteration, allowing the model to be exposed to a wide range of regions within each recording and learn representative features from different parts of the signal as training progresses.

3.3. Data Augmentation

Another key challenge in this study is class imbalance, which is an inevitable characteristic as the dataset originates from real-world acoustic recordings, where certain species are naturally underrepresented. Such an imbalance can lead to biased model predictions, where the model favors majority classes, ultimately affecting both training and evaluation performance. To address this issue, we employed data augmentation, a widely used technique that modifies the original data to generate additional synthetic samples, enhancing dataset diversity [33].

Specifically, we utilized two augmentation methods: time shifting and amplitude scaling. Time shifting repositions the waveform in time by randomly translating the signal to the left or right within a specified range. Because the recordings were segmented around vocalization events, the starting positions of the vocalizations within the input segments may become relatively consistent during training. However, in real acoustic recordings, vocalizations can occur at any position within the signal. Time shifting helps mitigate this positional bias by allowing the CNN to observe the same vocalization at different temporal positions, enabling the model to learn the acoustic pattern itself rather than relying on its fixed location within the segment. In our implementation, each signal is randomly shifted by up to 10% of its total length.

Amplitude scaling modifies the amplitude of the waveform by multiplying it by a randomly chosen scaling factor. Although RMS normalization was applied to ensure that amplitudes fall within a comparable range, variations may still remain due to inherent differences. Amplitude scaling helps simulate these variations in signal strength, allowing the model to become less sensitive to absolute amplitude differences and instead focus on the underlying acoustic patterns of the vocalizations. For this study, the scaling factor is randomly selected between 0.5 and 1.5.

Figure 6 shows an example of an audio signal before and after time shifting and amplitude scaling. Augmentation is implemented on the fly, meaning that augmented samples are not pre-generated but created dynamically as part of the training process. This approach helps conserve memory resources and eliminates the need to store large numbers of pre-generated augmented samples.

A probabilistic scheme is also applied in which underrepresented classes have a higher probability of augmentation, while major classes have a lower probability, ensuring a more balanced representation across all classes in the dataset. Specifically, probabilities are derived based on the class distribution in the training data. For each class, an initial weight is computed as the inverse of the number of training samples belonging to that class, such that underrepresented classes receive higher weights. These weights are then linearly normalized to a bounded probability range, with a minimum probability of 0.5 and a maximum probability of 0.8. As a result, the majority class is assigned a probability of 0.5, the most underrepresented class is assigned a probability of 0.8, and all remaining classes receive probabilities between these bounds based on their relative weights.

4. Application

To evaluate the performance of our proposed method, it is applied to our dataset under three different cases:

Case 1: Random cropping is applied to extract a single portion of length L from each audio sample without any data augmentation. This serves as the base model for comparison with the other cases.
Case 2: Random cropping is extended to extract at most N portions of length L from each audio sample, still with no data augmentation.
Case 3: Random cropping of at most N portions of length L for each audio sample is combined with data augmentation techniques.

For all three cases, the cropped portion length L is fixed according to the shortest signal in the dataset, as previously noted, to avoid unnecessary padding.

During training, model performance is assessed using k-fold cross-validation, a commonly used technique that provides a more reliable estimate of model performance [31]. In our study, we employed 5-fold cross-validation, where the training set is divided into five subsets (folds). For each of the five runs, four folds are used for training and the remaining fold for validation. This process yields five trained models, and the performance metrics from the validation folds are averaged to produce a more robust estimate of the model’s generalization performance, reducing potential bias and variability introduced by relying on a single train–validation split. The overall process is illustrated in Figure 7.

Model performance is continuously monitored by tracking key metrics, such as training and validation losses as well as accuracy, to track the model’s progress and performance. For each training run, the model with the highest validation accuracy is saved, resulting in a total of five trained models, which are then used to produce the final predictions on the test set. Additionally, we implemented an early stopping strategy, in which training is stopped if there is no improvement in validation accuracy after a specified number of training steps [34]. This helps prevent overfitting and reduces training time by avoiding unnecessary additional training once the model has converged.

The five trained models with optimal parameters, determined by the highest validation accuracy during each training run, are evaluated on the test dataset to assess generalization capability. These models are aggregated using majority voting across their predictions to produce the final predictions on the test set. As the test set is initially separated from the original dataset, it is not included in the training process, ensuring an unbiased evaluation. However, to enable an accurate comparison across the three cases, the test set uses fixed cropped portions, in which the same portions are consistently extracted from each audio sample, processed through the trained models, and aggregated to make the final class prediction. These portions are evenly distributed across each signal, preventing bias toward any specific region of the data. In other words, unlike the training set, random cropping is not applied to the test set, maintaining consistency and ensuring an unbiased performance evaluation. Figure 8 summarizes the overall process of our method, from data preparation to evaluation.

Performance evaluation of the trained model over the test set is conducted using a confusion matrix, which provides a detailed breakdown of the model’s classification performance across all classes [35]. Each row of the confusion matrix represents the true class labels, while each column corresponds to the predicted class labels. The diagonal elements represent correctly classified samples, while the off-diagonal elements correspond to misclassifications between classes. This representation allows direct analysis of class-specific performance and shows which classes are commonly confused by the model.

Furthermore, performance metrics including precision, recall, and F-score were computed to provide a quantitative evaluation of the model’s classification performance. Precision measures the proportion of correctly predicted samples among all samples predicted for a given class, while recall represents the proportion of correctly identified samples among all true instances of that class. The F-score is defined as the harmonic mean of precision and recall. These metrics were calculated for each class to provide a more detailed assessment of the model’s performance across different marine mammal species.

5. Results & Discussion

5.1. Determination of the Optimal Number of Cropped Portions N

Determining the optimum number of cropped portions N remains challenging, as it is a user-defined hyperparameter that must be specified prior to training. A small value of N may limit the amount of information utilized, especially for longer recordings, whereas a very large value may increase computational cost without providing significant improvements in performance. Therefore, selecting an appropriate value of N is necessary to balance classification performance and computational efficiency.

In this study, the optimal value of N was determined by evaluating several candidate values using the training dataset. For each value, the model was trained and evaluated using a 5-fold cross-validation procedure. In this process, the training data were divided into five folds, where four folds were used for training and the remaining fold was used for validation. The validation accuracies obtained from the five folds were then averaged to provide a robust estimate of model performance for each value of N.

Figure 9 shows the validation accuracy obtained for different values of N, where the mean accuracy across the five cross-validation folds is presented with the corresponding standard deviation. Based on the results from this dataset, the optimal value of N was determined to be six. Therefore, this value was adopted for the Case 2 and Case 3 scenarios. It should be noted, however, that this value is dataset-dependent. For different datasets, the optimal value of N may vary and should therefore be re-evaluated accordingly.

5.2. Performance Evaluation Across Experimental Cases

Table 3, Table 4, and Table 5 summarize the performance metrics for Case 1, Case 2, and Case 3 on the test set, respectively, while Figure 10 presents the corresponding confusion matrices for each of the three cases. Across all cases, the highest classification accuracy is observed for walruses, which are almost perfectly classified. This is likely due to their distinctive vocalizations, which differ substantially from those of the other species.

In Case 1, the overall accuracy is approximately 0.853, indicating a strong baseline performance. Figure 11 illustrates sample spectrograms showing the frequency characteristics of the raw waveform signals. Although the proposed model operates directly on raw waveforms, the convolutional filters implicitly capture frequency-related information, enabling the model to distinguish between different vocalizations to a certain extent. However, a noticeable number of misclassifications are observed across several classes. As shown in the class-level performance metrics and confusion matrix, Bowhead whale exhibits high precision but relatively low recall, indicating that while predictions labeled as Bowhead whale are usually correct, many true Bowhead whale vocalizations are misclassified as Humpback whale vocalizations. A similar pattern is observed for the seal species, where Bearded seal and Harp seal show lower precision and recall values, in which these two species are frequently misclassified as each other. The use of a single cropped portion limits the amount of acoustic information captured from each signal, particularly for longer recordings, which reduces the model’s ability to capture species-specific patterns and increases the likelihood of confusion between species. In addition, class imbalance may further contribute to this issue, as reflected in the higher misclassification rates observed for minority classes compared with more prevalent classes.

In Case 2, the overall accuracy improved to 0.956, indicating a substantial improvement in classification performance when multiple randomly cropped segments are used. As shown in the class-level performance metrics, the precision, recall, and F-score values increased for all five species, suggesting that the model is able to detect and classify vocalization patterns more reliably. The use of multiple randomly cropped portions allows the model to capture information from different temporal regions of the signal, enabling a better representation of the acoustic characteristics of each species. However, misclassifications among the seal species still persist, as shown in the confusion matrix, suggesting that class imbalance may remain a contributing factor.

In Case 3, the overall accuracy further improved to 0.970 after applying data augmentation techniques. However, the improvement from Case 2 to Case 3 is relatively smaller compared with the improvement observed from Case 1 to Case 2. This indicates that the primary performance gain is achieved through the proposed multi-cropping and aggregation framework, while data augmentation mainly provides an additional benefit by partially mitigating the effects of class imbalance in the dataset. Figure 12 illustrates the distribution of the best validation accuracy obtained from each fold for the three experimental cases. As the cases progress, the distribution shifts toward higher accuracy values and becomes more consistent across folds, indicating improved classification performance when multiple cropped portions and data augmentation are applied.

5.3. Limitations and Future Works

Overall, the results of this study demonstrate the potential of a classification framework that operates directly on raw waveforms without requiring preprocessing steps such as frequency-domain transformations or signal padding. However, there are several limitations of the current study that must be addressed. First, the dataset used in this study is relatively small and includes a limited number of marine mammal species. As such, the present work primarily serves as a baseline investigation for evaluating the feasibility of a raw-waveform-based 1D CNN approach. Future studies should extend the framework to larger and more diverse datasets encompassing a broader range of species.

Second, this study focuses exclusively on marine mammal vocalizations. In real ocean environments, acoustic recordings also contain various sources of background noise, including abiotic noise (e.g., waves and ice movement) and anthropogenic noise, such as ships, waves, and ice movement. Incorporating these additional sound sources into the classification framework will be important for improving the robustness and practical real-world applicability.

Finally, marine mammal species often produce multiple vocalization types and call patterns within the same species, each serving different behavioral functions. The present study focuses on species-level classification and does not distinguish between different call types produced by the same species. Future work should therefore consider call-type-level classification rather than only species-level identification. Such an approach would enable more detailed ecological analysis of marine mammal communication and behavior.

6. Conclusions

This study presents a methodological analysis of a 1D CNN framework for classifying marine mammal vocalizations directly from raw audio waveforms that accounts for varying signal lengths. In contrast to conventional preprocessing strategies such as padding, single cropping, or sliding window segmentation used to satisfy the fixed input size requirements of CNNs, this study employs random cropping to extract different segments of the audio signal during training. This approach aims to balance computational efficiency with the preservation of the original signal characteristics.

The framework was evaluated under three experimental scenarios to examine how different signal sampling strategies and data augmentation influence classification performance. The results indicate that using multiple random crops allows the model to utilize a broader range of acoustic information within each signal, while the inclusion of data augmentation improves the classification of underrepresented species by mitigating class imbalance.

Although the experiments were conducted on a relatively small dataset with a limited number of species, the findings provide insights into how raw waveforms can be applied to bioacoustic classification tasks. Future work should extend this analysis to larger datasets that incorporates additional ocean acoustic sources such as environmental and anthropogenic noise, and investigate classification of various call patterns within species. Such extensions will help improve the robustness and practical applicability of automated classification systems.

Author Contributions

W.-K.K.: Conceptualization, Methodology, Software, Visualization, Writing—Original Draft. D.L.: Resources, Supervision, Writing—Review & Editing, Data QC. H.S.B.: Resources, Project administration, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Agency For Defense Development by the Korean Government (UE240509DD).

Data Availability Statement

The raw acoustic recordings used in this study are publicly available from the Watkins Marine Mammal Sound Database (WHOI), and can be accessed through its official website (https://www.whoi.edu/; accessed on 2 April 2025). The code developed for this study has been made publicly available at: https://github.com/wkk-0505/1DCNN_MarineMammalClassification.git; accessed on 19 February 2026.

Acknowledgments

The authors would like to acknowledge the Watkins Marine Mammal Sound Database for sharing their data publicly. ChatGPT (based on GPT-5.3, OpenAI) was used for English language polishing to improve grammar and readability.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Peng, C.; Zhao, X.; Liu, G. Noise in the Sea and Its Impacts on Marine Organisms. Int. J. Environ. Res. Public Health 2015, 12, 12304–12323. [Google Scholar] [CrossRef]
Usman, A.M.; Ogundile, O.O.; Versfeld, D.J.J. Review of Automatic Detection and Classification Techniques for Cetacean Vocalization. IEEE Access 2020, 8, 105181–105206. [Google Scholar] [CrossRef]
Bansal, A.; Garg, N.K. Environmental Sound Classification: A descriptive review of the literature. Intell. Syst. Appl. 2022, 16, 200115. [Google Scholar] [CrossRef]
Esfahanian, M.; Zhuang, H.; Erdol, N. A new approach for classification of dolphin whistles. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 6038–6042. [Google Scholar] [CrossRef]
Caruso, F.; Dong, L.; Lin, M.; Liu, M.; Gong, Z.; Xu, W.; Alonge, G.; Li, S. Monitoring of a Nearshore Small Dolphin Species Using Passive Acoustic Platforms and Supervised Machine Learning Techniques. Front. Mar. Sci. 2020, 7, 267. [Google Scholar] [CrossRef]
Ruano-Ordás, D. Machine Learning-Based Feature Extraction and Selection. Appl. Sci. 2024, 14, 6567. [Google Scholar] [CrossRef]
Shaheen, F.; Verma, B.; Asafuddoula, M. Impact of Automatic Feature Extraction in Deep Learning Architecture. In Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia, 30 November–2 December 2016; pp. 1–8. [Google Scholar] [CrossRef]
Indolia, S.; Goswami, A.K.; Mishra, S.; Asopa, P. Conceptual Understanding of Convolutional Neural Network—A Deep Learning Approach. Procedia Comput. Sci. 2018, 132, 679–688. [Google Scholar] [CrossRef]
Mishachandar, B.; Vairamuthu, S. Diverse ocean noise classification using deep learning. Appl. Acoust. 2021, 181, 108141. [Google Scholar] [CrossRef]
Lu, T.; Han, B.; Yu, F. Detection and classification of marine mammal sounds using AlexNet with transfer learning. Ecol. Inform. 2021, 62, 101277. [Google Scholar] [CrossRef]
Olcay, A.; White, P.R.; Bull, J.M.; Risch, D.; Dell, B.; White, E.L. Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance. J. Acoust. Soc. Am. 2025, 157, 3017–3032. [Google Scholar] [CrossRef]
Mateo, C.; Talavera, J.A. Short-time Fourier transform with the window size fixed in the frequency domain. Digit. Signal Process. 2018, 77, 13–21. [Google Scholar] [CrossRef]
Guo, T.; Zhang, T.; Lim, E.; López-Benítez, M.; Ma, F.; Yu, L. A Review of Wavelet Analysis and Its Applications: Challenges and Opportunities. IEEE Access 2022, 10, 58869–58903. [Google Scholar] [CrossRef]
Khan, M.A.; Liu, S.; Bilal, M.; Hassan, A. Convolutional autoencoders for low probability of detection constrained underwater acoustic communications. Ocean. Eng. 2026, 344, 123720. [Google Scholar] [CrossRef]
Schäfer-Zimmermann, J.C.; Demartsev, V.; Averly, B.; Dhanjal-Adams, K.; Duteil, M.; Gall, G.; Faiß, M.; Johnson-Ulrich, L.; Stowell, D.; Manser, M.B.; et al. animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics. arXiv 2024, arXiv:2210.14493. [Google Scholar] [CrossRef]
Hagiwara, M. AVES: Animal Vocalization Encoder based on Self-Supervision. arXiv 2022, arXiv:2210.14493. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Al-Saggaf, U.M.; Botalb, A.; Moinuddin, M.; Alfakeh, S.A.; Ali, S.S.A.; Boon, T.T. Either crop or pad the input volume: What is beneficial for Convolutional Neural Network? In Proceedings of the 2020 8th International Conference on Intelligent and Advanced Systems (ICIAS), Kuching, Malaysia, 13–15 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Licciardi, A.; Carbone, D. WhaleNet: A Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database. IEEE Access 2024, 12, 154182–154194. [Google Scholar] [CrossRef]
Raza, A.; Zongxin, S.; Qiao, G.; Javed, M.; Bilal, M.; Zuberi, H.H.; Mohsin, M. Automated classification of humpback whale calls in four regions using convolutional neural networks and multi scale deep feature aggregation (MSDFA). Measurement 2025, 255, 118038. [Google Scholar] [CrossRef]
Sayigh, L.; Daher, M.A.; Allen, J.; Gordon, H.; Joyce, K.; Stuhlmann, C.; Tyack, P. The Watkins Marine Mammal Sound Database: An online, freely accessible resource. Proc. Meet. Acoust. 2017, 27, 040013. [Google Scholar] [CrossRef]
Delarue, J.; Laurinolli, M.; Martin, B. Bowhead whale (Balaena mysticetus) songs in the Chukchi Sea between October 2007 and May 2008. J. Acoust. Soc. Am. 2009, 126, 3319–3328. [Google Scholar] [CrossRef]
Payne, R.S.; McVay, S. Songs of Humpback Whales. Science 1971, 173, 585–597. [Google Scholar] [CrossRef]
Risch, D.; Clark, C.W.; Corkeron, P.J.; Elepfandt, A.; Kovacs, K.M.; Lydersen, C.; Stirling, I.; Van Parijs, S.M. Vocalizations of male bearded seals, Erignathus barbatus: Classification and geographical variation. Anim. Behav. 2007, 73, 747–762. [Google Scholar] [CrossRef]
Perry, E.A.; Terhune, J.M. Variation of harp seal (Pagophilus groenlandicus) underwater vocalizations among three breeding locations. J. Zool. 1999, 249, 181–186. [Google Scholar] [CrossRef]
Miller, E.H.; Kochnev, A.A. Ethology and Behavioral Ecology of the Walrus (Odobenus rosmarus), with Emphasis on Communication and Social Behavior. In Ethology and Behavioral Ecology of Otariids and the Odobenid; Campagna, C., Harcourt, R., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 437–488. [Google Scholar] [CrossRef]
Mcfee, B.; McVicar, M.; Raffel, C.; Liang, D.; Nieto, O.; Moore, J.; Ellis, D.; Repetto, D.; Viktorin, P.; Santos, J.F.; et al. Librosa, V0.4.0; Zenodo: Geneva, Switzerland, 2015. [Google Scholar] [CrossRef]
Safonova, A.; Ghazaryan, G.; Stiller, S.; Main-Knorn, M.; Nendel, C.; Ryo, M. Ten deep learning techniques to address small data problems with remote sensing. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103569. [Google Scholar] [CrossRef]
Mellinger, D.K. A comparison of methods for detecting right whale calls. Can. Acoust. 2004, 32, 55–65. [Google Scholar]
Sadaiyandi, J.; Arumugam, P.; Sangaiah, A.K.; Zhang, C. Stratified Sampling-Based Deep Learning Approach to Increase Prediction Accuracy of Unbalanced Dataset. Electronics 2023, 12, 4423. [Google Scholar] [CrossRef]
Abdoli, S.; Cardinal, P.; Lameiras Koerich, A. End-to-end environmental sound classification using a 1D convolutional neural network. Expert Syst. Appl. 2019, 136, 252–263. [Google Scholar] [CrossRef]
Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Yao, Y.; Rosasco, L.; Caponnetto, A. On Early Stopping in Gradient Descent Learning. Constr. Approx. 2007, 26, 289–315. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]

Figure 1. Sample raw waveforms from the five target species (Bearded Seal, Bowhead Whale, Harp Seal, Humpback Whale, and Walrus), obtained from the Watkins Marine Mammal Sound Database.

Figure 2. Class distribution of the number of audio signal segments across the five classes in the dataset after segmentation.

Figure 3. Proposed one-dimensional CNN (1D CNN) architecture.

Figure 4. Distribution of total durations for all audio signals in our dataset. The y-axis is shown on a logarithmic scale to account for the large variation in the number of signals across duration ranges.

Figure 5. Illustration of the proposed random cropping method for processing signals of varying lengths. Up to N portions of length L are randomly cropped during each training process, with each portion processed through the proposed CNN model, and the results are aggregated to make the final class prediction.

Figure 6. Comparison of original audio data with augmentation techniques. The left figure shows the original data (black) and after time shifting is applied (blue). The right figure shows the original data (black) and after amplitude scaling is applied (red).

Figure 7. Overview of the 5-fold cross-validation process. The dataset is first split into independent training and test sets. The training set is further divided into five folds. In each run, four folds are used for training (blue) and one for validation (yellow). This process is repeated five times so that each fold serves once as the validation set. The test set (green) remains separate and is used only for final evaluation.

Figure 8. Flowchart of the data preparation and evaluation process for our proposed method.

Figure 9. Validation accuracy for different values of the number of cropped portions N used in the random cropping framework. Mean accuracy across the five cross-validation folds is shown, with error bars indicating the standard deviation.

Figure 10. Normalized confusion matrix for each case: (a) Case 1, (b) Case 2, (c) Case 3. The diagonal elements indicate correct classifications, while off-diagonal elements represent misclassifications.

Figure 11. Spectrograms of representative acoustic signals from the five marine mammal classes shown in Figure 1, resampled to a standardized sampling frequency of 40 kHz. Time is shown on the horizontal axis, frequency is shown on the vertical axis, and color indicates signal amplitude in decibels.

Figure 12. Distribution of the best validation accuracy from each fold during 5-fold cross-validation for Case 1, Case 2, and Case 3. Circular markers indicate that the minimum value equals the first quartile.

Table 1. Overview of the dataset used in this study, including the number of raw recordings, total recording duration, and original sampling frequencies for the target species.

Class	Number of Raw Data Files	Total Duration (s)	Original Sampling Frequency (kHz)
Bearded Seal	37	348.80	Varies by sample, ranging from approximately 5–44 kHz
Bowhead Whale	125	3815.56
Harp Seal	47	390.43
Humpback Whale	12	5821.74
Walrus	38	134.24

Table 2. Number of signal segments per class in the training and test datasets.

Class	Training Dataset	Test Dataset
Bearded Seal	59	26
Bowhead Whale	329	311
Harp Seal	223	68
Humpback Whale	875	750
Walrus	320	223

Table 3. Performance metrics over the test dataset for Case 1.

Class	Precision	Recall	F-Score	Accuracy
Bowhead Whale	0.934	0.583	0.718	0.853
Humpback Whale	0.807	0.977	0.884
Bearded Seal	0.538	0.412	0.467
Harp Seal	0.774	0.854	0.812
Walrus	1.000	0.986	0.979

Table 4. Performance metrics over the test dataset for Case 2.

Class	Precision	Recall	F-Score	Accuracy
Bowhead Whale	0.990	0.924	0.956	0.956
Humpback Whale	0.957	0.998	0.977
Bearded Seal	0.580	0.853	0.693
Harp Seal	0.960	0.750	0.842
Walrus	0.988	0.988	0.988

Table 5. Performance metrics over the test dataset for Case 3.

Class	Precision	Recall	F-Score	Accuracy
Bowhead Whale	0.990	0.975	0.982	0.970
Humpback Whale	0.980	0.997	0.988
Bearded Seal	0.638	0.882	0.741
Harp Seal	0.986	0.760	0.859
Walrus	0.980	0.996	0.988

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, W.-K.; Lee, D.; Bae, H.S. A Methodological Study of 1D CNN Classification of Marine Mammal Vocalizations with Variable Signal Durations. J. Mar. Sci. Eng. 2026, 14, 639. https://doi.org/10.3390/jmse14070639

AMA Style

Kim W-K, Lee D, Bae HS. A Methodological Study of 1D CNN Classification of Marine Mammal Vocalizations with Variable Signal Durations. Journal of Marine Science and Engineering. 2026; 14(7):639. https://doi.org/10.3390/jmse14070639

Chicago/Turabian Style

Kim, Won-Ki, Dawoon Lee, and Ho Seuk Bae. 2026. "A Methodological Study of 1D CNN Classification of Marine Mammal Vocalizations with Variable Signal Durations" Journal of Marine Science and Engineering 14, no. 7: 639. https://doi.org/10.3390/jmse14070639

APA Style

Kim, W.-K., Lee, D., & Bae, H. S. (2026). A Methodological Study of 1D CNN Classification of Marine Mammal Vocalizations with Variable Signal Durations. Journal of Marine Science and Engineering, 14(7), 639. https://doi.org/10.3390/jmse14070639

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Methodological Study of 1D CNN Classification of Marine Mammal Vocalizations with Variable Signal Durations

Abstract

1. Introduction

2. Materials

2.1. Data Acquisition & Preparation

2.2. Data Preprocessing

2.3. Dataset Split for Train and Testing

3. Methodology

3.1. Proposed 1D CNN Architecture

3.2. Variable Duration Signal Handling Framework

3.3. Data Augmentation

4. Application

5. Results & Discussion

5.1. Determination of the Optimal Number of Cropped Portions N

5.2. Performance Evaluation Across Experimental Cases

5.3. Limitations and Future Works

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI