Characterization and Automated Classification of Underwater Acoustic Environments in the Western Black Sea Using Machine Learning Techniques

Mihailov, Maria Emanuela

doi:10.3390/jmse13071352

Open AccessArticle

Characterization and Automated Classification of Underwater Acoustic Environments in the Western Black Sea Using Machine Learning Techniques

by

Maria Emanuela Mihailov

^1,2

¹

Research-Development and Innovation Center, Maritime Hydrographic Directorate “Comandor Alexandru Catuneanu”, Fulgerului Street No. 1, 900218 Constanta, Romania

²

Department of Oceanography, Coastal and Marine Engineering, National Institute for Marine Research and Development (NIMRD) “Grigore Antipa”, 300 Mamaia Blvd., 900581 Constanta, Romania

J. Mar. Sci. Eng. 2025, 13(7), 1352; https://doi.org/10.3390/jmse13071352

Submission received: 17 June 2025 / Revised: 11 July 2025 / Accepted: 15 July 2025 / Published: 16 July 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Growing concern over anthropogenic underwater noise, highlighted by initiatives like the Marine Strategy Framework Directive (MSFD) and its Technical Group on Underwater Noise (TG Noise), emphasizes regions like the Western Black Sea, where increasing activities threaten marine habitats. This region is experiencing rapid growth in maritime traffic and resource exploitation, which is intensifying concerns over the noise impacts on its unique marine habitats. While machine learning offers promising solutions, a research gap persists in comprehensively evaluating diverse ML models within an integrated framework for complex underwater acoustic data, particularly concerning real-world data limitations like class imbalance. This paper addresses this by presenting a multi-faceted framework using passive acoustic monitoring (PAM) data from fixed locations (50–100 m depth). Acoustic data are processed using advanced signal processing (broadband Sound Pressure Level (SPL), Power Spectral Density (PSD)) for feature extraction (Mel-spectrograms for deep learning; PSD statistical moments for classical/unsupervised ML). The framework evaluates Convolutional Neural Networks (CNNs), Random Forest, and Support Vector Machines (SVMs) for noise event classification, alongside Gaussian Mixture Models (GMMs) for anomaly detection. Our results demonstrate that the CNN achieved the highest classification accuracy of 0.9359, significantly outperforming Random Forest (0.8494) and SVM (0.8397) on the test dataset. These findings emphasize the capability of deep learning in automatically extracting discriminative features, highlighting its potential for enhanced automated underwater acoustic monitoring.

Keywords:

underwater noise; Black Sea; machine learning (ML); soundscape analysis; noise source classification; deep learning; Marine Strategy Framework Directive (MSFD)

1. Introduction

The underwater acoustic environment is a complex and dynamic system that holds significant interest for researchers due to its ecological and navigational implications. The characterization and automated classification of underwater acoustic environments, particularly in regions such as the Western Black Sea, have become increasingly practical with advancements in machine learning (ML) techniques.

The importance of machine learning in the analysis of underwater acoustics cannot be overstated, as it facilitates the automated classification of complex and diverse soundscapes. For instance, Orescanin et al. (2022) highlight how these techniques can be applied for real-time environmental monitoring, allowing for the identification and tracking of marine life and anthropogenic activities [1]. Furthermore, Yang H. et al. (2020) emphasize the growing field of underwater acoustic research utilizing machine learning, particularly in applications involving passive sonar, showcasing methods that can significantly enhance underwater environmental assessments [2,3].

Another aspect of characterizing underwater environments is the extraction and utilization of acoustic features from collected data. The study by Santos and Calazan presents enhanced spectral dynamic features that improve classification rates for marine vessels [4]. This aligns with Domingos et al.’s investigation, which explores the efficacy of preprocessing filters and deep learning methodologies tailored for underwater sound classification, highlighting challenges related to low signal-to-noise ratios in these environments [5].

In addition, variables such as water temperature, salinity, and depth further complicate the classification task. Particularly, the study by Ahmed and Younis examines the role of modulation signals under varying conditions and demonstrates how efficient classification processes can facilitate underwater acoustic communications [6]. Similarly, the work by Ahmed A. and Younis M. (2019) explains how the characterization of underwater acoustic signals is influenced by propagation characteristics specific to the marine environment, adding additional layers to the classification efforts [7]. The physical properties of underwater environments, which are essential for sound propagation, are significant for classification algorithms. For instance, the findings on acoustic beam characterization and the effect of varying medium properties highlight the complexity involved in developing algorithms that are robust against changing environmental conditions [8]. Supporting this, Chen et al.’s comparative analysis of various filter bank techniques reveals insights on how specific underwater conditions can affect sound source localization and classification accuracy, thereby accentuating the need for adaptable methodologies [9].

Moreover, the application of deep learning techniques has shown promise in significantly enhancing the precision of underwater acoustic detection and classification. Hinojosa et al.’s work on automated detection systems has set a precedent for real-time monitoring of marine environments, demonstrating the potential for machine learning to facilitate ecological interventions and improve marine conservation strategies [10] relevant in pollution control and habitat monitoring, where accurate sound classification can yield precise data regarding fauna and environmental changes.

Further, Malfante et al. investigate the automatic classification of fish sounds by employing feature sets that effectively represent underwater acoustic data [11]. Their results demonstrate that utilizing convolutional neural networks can streamline the processing of acoustic signals, thereby enhancing the detection capabilities within marine monitoring systems [11,12]. These methodologies align with ongoing initiatives that aim to leverage acoustic modeling for wide-scale ecological assessments.

Additionally, the intersection of acoustics and bioacoustics is becoming increasingly important. Research by Montgomery and Radford explores how marine bioacoustics aids in understanding the acoustic environment, providing insights into species interactions and behaviors [13]. This holistic approach emphasizes the importance of sound as a communicative and navigational tool for marine organisms, thereby influencing classification techniques that aim to represent the underwater soundscape accurately.

Another development is represented by methodologies that apply deep learning models developed for underwater sound recognition, as detailed in the hardware and software developments by Yang et al. (2020) [3]. Their work demonstrates how structured feature extraction significantly enhances classification efficacy, creating a more actionable data landscape for monitoring aquatic environments. The integration of advanced computational techniques, such as support vector machines with feature selection methods, demonstrates the ability to refine the classification accuracy of marine species’ vocalizations [14,15].

The implications of these developments extend beyond simple classification; they enable enhanced ecological monitoring capabilities. For example, implementations of passive acoustic monitoring provide dynamic insights into marine biodiversity, stresses induced by climate change, and habitat disruptions, and serve as essential milestones in marine research [16,17].

Our original contributions primarily lie in several significant aspects. First, we introduce a comprehensive integrated framework that connects all stages of underwater noise analysis, from detailed acoustic metric computation (broadband SPL, PSD) and multi-faceted feature engineering (e.g., Mel spectrograms for deep learning, statistical moments from PSD for classical machine learning) to the application and systematic comparison of diverse machine learning paradigms. This holistic pipeline, which links established acoustic principles with data-driven analytics, enables passive acoustic monitoring in complex marine environments [5,18].

Second, we suggest a direct quantitative comparison between a convolutional neural network (CNN) and traditional machine learning approaches (Random Forest, Support Vector Machine) under identical conditions, addressing a notable gap in the literature, where most studies focus on only one paradigm [4].

Third, our analysis provides essential insights into feature representation strategies. Despite the theoretical strengths of learned features in CNNs, the Random Forest and SVM models, which relied on carefully engineered statistical moments, achieved competitive accuracy (0.849 and 0.839, respectively), comparable to that of the CNN (0.935). This observation underscores the continued value of domain-driven feature engineering in specific contexts, challenging the prevailing assumption that deep learning consistently outperforms classical techniques [11].

Fourth, we transparently address in situ data challenges, including class imbalance and annotation uncertainty, which led to observed issues such as FitFailedWarnings in learning curves. By explicitly discussing these limitations, our study provides practical guidance for the underwater acoustics community dealing with similar data constraints [13]. Finally, we conceptually integrate an unsupervised anomaly detection module using Gaussian Mixture Models (GMMs), which expands the framework’s capacity to capture novel or unclassified acoustic events—an essential capability in dynamic, data-sparse marine environments [16].

2. Materials and Methods

This section outlines the experimental setup, data acquisition, signal processing, and machine learning methodologies employed (see Appendix A.6 Figure A13) to characterize and automate the classification of underwater acoustic environments in the Western Black Sea.

2.1. Study Area and Acoustic Data Collection

Passive acoustic recordings were obtained in the northwestern Black Sea on board Autonomous Multichannel Acoustic Recorders (AMARs, Brüel & Kjær, Nærum, Denmark). Recorders were moored at depths of 50–100 m at three sites spanning shelf and slope habitats (Figure 1; 43.5–44° N, 29.0–29.5° E). Each AMAR simultaneously sampled four hydrophone channels at a 192 kHz sampling rate (24-bit resolution), providing time-series recordings stored in WAV file format. The audio data was acquired using a hydrophone with a sensitivity of −209.1 dB re 1 V/µPa, connected to a recorder with a pre-amplifier gain of 8.25 dB. The recorder’s full-scale input voltage was set at 1.56675 V (peak), and the reference pressure for underwater SPL calculations was 1 µPa. For this study, a dataset comprising 341 WAV files from the designated study area was utilized for analysis. Deployment durations ranged from April to October 2022, capturing seasonal variations in ambient noise and odontocete activity. Recorder positions were determined via GPS at deployment and retrieval; no glider data were included in the present analysis.

The selection of recorder deployment locations was mainly driven by the need to broadly characterize both significant anthropogenic noise sources and relevant biological acoustic activity within the underwater soundscape. Specific site selection considerations included: prioritizing proximity to major shipping routes to capture representative data on vessel noise, a dominant source of anthropogenic underwater sound, through strategic placement near active shipping routes or port approaches. Simultaneously, sites known for their presence of marine mammals, migration corridors, or foraging activities were selected to effectively record biological noise and ensure the potential for collecting relevant acoustic signatures from marine fauna. Furthermore, some locations were chosen to establish an environmental baseline acoustic condition, which is essential for facilitating long-term environmental monitoring and enabling future assessment of changes within the soundscape.

2.2. Data Preprocessing and Feature Extraction

The collected raw WAV files were systematically processed to facilitate consistent analysis. Each full audio file was initially segmented into larger blocks of 300 s (‘chunk_duration’). For subsequent feature extraction and machine learning input, these blocks were further divided into smaller analysis windows of 5 s (‘DURATION’). All audio data was either resampled or confirmed to a uniform ‘sample_rate’ of 44,100 Hz. Spectrograms were computed using an ‘N_FFT’ of 2048 points and a ‘hop_length’ of 512 samples. The ‘N_FFT’ choice (2048) provides a frequency resolution of approximately 21.5 Hz, which is adequate for distinguishing features in the target frequency bands. The ‘hop_length’ (512) engages a balance between temporal detail and computational efficiency, ensuring sufficient overlap between frames. A ‘WINDOW_SIZE’ of ‘hann’ was applied to minimize spectral outflow and reduce objects in the frequency domain. A default sample rate of 44,100 Hz was used for demonstration purposes. For each audio lump, several acoustic metrics and features were computed:

-

Mel-spectrograms (example Figure 2): These were produced for the Convolutional Neural Network (CNN) input, generated with N_MELS (128 Mel bands), providing a perceptually relevant time-frequency representation of the acoustic signals. The derivation involves an initial signal transformation (e.g., Short-Time Fourier Transform, STFT) followed by spectral power computation and Mel-scale filtering.

Broadband Sound Pressure Level (SPL): The broadband SPL in dB re 1 µPa was calculated for each audio lump. This metric provides a single value representative of the overall sound intensity within the segment.
Power Spectral Density (PSD)/Spectral Sound Level (SSL): The PSD, reported in dB re 1 µPa² Hz¹, was computed using Welch’s method with a Hann window (Figure 3). This provides a detailed frequency-domain representation of the noise levels. These features, used for machine learning models (Random Forest and SVM), comprised the mean, variance, and skewness of the Spectral Sound Level (SSL). These moments were extracted from the Power Spectral Density (PSD) within three predefined, acoustically relevant frequency bands: ‘Low_Freq_Shipping’ (10–500 Hz), ‘Mid_Freq_Biological’ (500–5000 Hz), and ‘High_Freq_Rain_Clicks’ (5000–22,050 Hz). This process involves first computing the PSD and then combining the spectral information into summary statistics per band.
Statistical Moments from PSD: To capture more nuanced characteristics within specific frequency ranges, statistical moments (mean, variance, and skewness) of the SSL were extracted from predefined ecologically relevant frequency bands. These bands were
The low-frequency shipping band: 10–500 Hz: often dominant for large vessels.

-

The Mid-Freq Biological band: 500–5000 Hz: relevant for some biological sounds and machinery.

-

The High_Freq_Rain_Clicks: 5000–22,050 Hz (up to sample\_rate/2 or Nyquist frequency), associated with rain and odontocete clicks.

The generated Mel-spectrograms were reshaped to a (samples, height, width, channels) format, suitable for Convolutional Neural Network (CNN) input, by adding a channel dimension using np.newaxis.

2.3. Labeling Strategy

To demonstrate the machine learning pipeline, a preliminary labeling strategy was applied. Audio lumps were assigned arbitrary binary labels (0, 1, or 2) based on keywords present in the original .wav file names, such as ‘background’ or ‘ambient’ for background noise (label 0), ‘vessel’ for vessel noise (label 1), and ‘whale’, ‘dolphin’, or ‘porpoise’ for biological noise (label 2). For files without specific keywords, a fallback mechanism assigns labels based on their order within the processed file list (the first half is labeled as background, and the second half is labeled as vessel noise). It is recognized that for lump-level analysis, a more detailed and data-specific labeling logic (e.g., manual annotation or metadata analyzing) is essential. The labels were converted to one-hot encoding for categorical classification.

To enhance the ecological relevance and practical applicability of the classification framework within the Black Sea basin, the predefined noise categories were expanded beyond a binary distinction between vessels and background. Recognizing the unique biodiversity of the Black Sea, which includes various dolphin species (e.g., common dolphin, bottlenose dolphin, harbor porpoise) and diverse sound-producing fish and invertebrates, the classification now explicitly includes ‘Dolphin Activity’ and ‘Fish/Invertebrate Sounds’ alongside ‘Ambient Noise’, ‘Vessel Noise’, ‘Click’, and ‘Whistle’. This multi-class approach, informed by the scientific literature on Black Sea bioacoustics, enables a more detailed characterization of the underwater soundscape. The preliminary labeling strategy, although still keyword-based for demonstration, has been adapted to assign these new biological classes, and the model’s output layer has been accordingly extended to include these additional categories. Future work will prioritize expert-validated annotation for these biologically sound events to ensure high-fidelity certainty.

2.4. Data Annotation

Audio segments were assigned categorical labels based on a keyword-matching strategy derived from the original filenames. Three primary noise classes were defined: ‘Background Noise’ (assigned label 0, derived from filenames containing ‘background’, ‘ambient’, or ‘noise’), ‘Vessel Noise’ (assigned label 1, from filenames with ‘vessel’, ‘ship’, or ‘motor’), and ‘Biological Noise’ (assigned label 2, for filenames indicating ‘whale’, ‘dolphin’, or ‘biol’). For files without these explicit keywords, a default labeling scheme was applied where segments from the first half of the processed files were designated as ‘Background Noise’, and segments from the latter half were defined as ‘Vessel Noise’.

2.5. Machine Learning Models and Training

Deep learning approaches were selected due to their demonstrated capacity to identify subtle patterns in complex acoustic environments. For the classification and characterization of underwater noise, a set of machine learning models was employed. This included a deep Convolutional Neural Network (CNN) for spectrogram-based analysis, as well as traditional machine learning classifiers—Random Forest and Support Vector Machine (SVM)—which utilized acoustic features. Gaussian Mixture Models (GMMs) were also conceptually explored for anomaly detection. A diverse set of machine learning models was trained and evaluated for noise classification. The dataset was stratified and partitioned into an 80% training set and a 20% testing set (test_size = 0.2, random_state = 42) to maintain the proportional representation of each noise class. Features underwent necessary preprocessing (Min-Max scaling for CNN, Standardization for other models).

These models represent different approaches to machine learning for underwater acoustic analysis, serving both comparative and complementary roles. The Convolutional Neural Network (CNN) enhances deep learning by allowing it to automatically learn hierarchical features directly from Mel spectrograms, thereby minimizing the need for manual feature engineering. In parallel, traditional machine learning models such as Random Forest and Support Vector Machine (SVM) rely on hand-crafted PSD statistical moments. This comparative approach allows for the assessment of data-driven feature learning versus domain-expert-driven feature engineering. Furthermore, Gaussian Mixture Models (GMMs) are employed as an unsupervised method for anomaly detection, a task complementary to classification, designed to identify novel or unexpected acoustic events without prior labeling.

2.5.1. Convolutional Neural Network (CNN) Model

An enhanced Convolutional Neural Network (CNN) model was defined and trained for the supervised classification of underwater noise sources. The CNN was implemented as a Sequential model designed to process the 2D Mel spectrograms. A widely used deep learning architecture, the CNN was chosen for its exceptional capabilities in learning hierarchical features directly from raw data, such as spectrograms, without requiring extensive manual feature engineering. The architecture consisted of three stacked convolutional blocks (see Appendix A.1 Table A1):

Block 1: A Conv2D layer with 32 filters (3 × 3 kernel, ‘relu’ activation), followed by ‘BatchNormalization’, ‘MaxPooling2D’ (2 × 2 pool size), and ‘Dropout’ (0.25 rate).
Block 2: A Conv2D layer with 64 filters (3 × 3 kernel, ‘relu’ activation), followed by ‘BatchNormalization’, ‘MaxPooling2D’ (2 × 2 pool size), and ‘Dropout’ (0.25 rate).
Block 3: A Conv2D layer with 128 filters (3 × 3 kernel, ‘relu’ activation), followed by ‘BatchNormalization’, ‘MaxPooling2D’ (2 × 2 pool size), and ‘Dropout’ (0.25 rate).

These blocks effectively extract hierarchical features from the spectrograms. The multi-block architecture with increasing filter counts (32, 64, 128) is a standard design pattern in deep learning for spectrogram classification, allowing the network to learn increasingly complex and abstract features.

Common 3 × 3 kernel sizes capture local patterns, and 2 × 2 max-pooling efficiently downsamples feature maps. The output of the final convolutional block was flattened into a 1D vector and connected to a ‘Dense’ layer with 128 neurons (‘relu’ activation), followed by ‘BatchNormalization’ and a ‘Dropout’ layer (0.5 rate) for regularization. The final output layer was a ‘Dense’ layer with neurons equal to the number of unique noise classes (e.g., 3) and a ‘softmax’ activation function for multi-class probability distribution.

The ‘Dropout’ rates (0.25 and 0.5) and ‘BatchNormalization’ layers were integrated to mitigate overfitting, which was identified as a concern during preliminary model development.

The model was compiled using the Adam optimizer with an initial ‘LEARNING_RATE’ of 0.001. This is a standard and stable default. Training was performed for 200 ‘EPOCHS’ with a ‘BATCH_SIZE’ of 32. A high number of epochs was set to ensure convergence. ‘EarlyStopping’ with a ‘patience’ of 10 epochs (monitoring validation loss) was implemented to automatically stop training when performance on unseen data no longer improved, thus preventing overfitting.

2.5.2. Random Forest Classifier

The Random Forest (RF) Classifier, an ensemble learning method, was chosen for its robustness and ability to provide insights into feature importance. The model was configured with 100 individual decision trees (‘n_estimators = 100’). This number is an established default that balances performance and computational charge. To address potential class imbalances in the dataset, the ‘class_weight’ parameter was set to “balanced”, which automatically adjusts weights inversely proportional to class frequencies. A ‘random_state’ of 42 was set for reproducibility. The default parameters for tree splitting (e.g., ‘criterion = gini’, ‘max_feature s = sqrt’, ‘min_samples_leaf = 1’, ‘min_samples_split = 2’) were used.

2.5.3. Support Vector Machine (SVM)

A Support Vector Machine (SVM) was selected as a discriminative classifier. An RBF (radial basis function) ‘kernel’ was chosen, as it is a widely adopted non-parametric kernel capable of handling complex, non-linear relationships within the feature space. The ‘probability’ parameter was set to ‘True’ to enable the output of class probabilities, which are necessary for generating ROC and Precision–Recall curves. Similarly to the Random Forest, ‘class_weight’ was set to “balanced” to account for uneven class distributions. A ‘random_state’ of 42 was maintained for reproducibility. Default values for the regularization parameter (‘C = 1.0’) and ‘gamma = scale’ were utilized, providing a standard starting point for initial SVM models.

2.5.4. Gaussian Mixture Models (GMMs)

Gaussian Mixture Models (GMMs) were utilized as a parametric, unsupervised method for anomaly detection. A GMM was specifically fitted to the ‘Background Noise’ class from the training data. This allowed for the probabilistic modeling of distinctive ambient acoustic conditions by assuming the data is a mixture of Gaussian distributions. The model was configured with ‘n_components’ (up to 3, depending on data availability) and ‘covariance_type = full’. Test samples were then evaluated based on their log-probability scores under this fitted background GMM, with lower scores indicating potential deviations from the learned ‘normal’ noise profile.

2.6. Statistical Analysis and Thresholding

In addition to the deep learning approach, statistical analysis was conducted on the extracted PSD moments. The mean, variance, and skewness of SSL within the defined frequency bands were computed for each audio segment. These statistical features can be utilized for various analyses, including

Background Noise Estimation: Analyzing long periods of recordings free from transient events using statistical methods (e.g., 10th percentile of SPL/PSD over time) to describe persistent ambient noise.
Simple Thresholding: A conceptual example of rule-based classification was demonstrated using predefined thresholds on the mean SSL for the Low-Frequency Vessel (110 dB) and Mid-Frequency Biological (90 dB) bands, and skewness for the High-Frequency Rain/Clicks (0.8). While these thresholds are illustrative, they would typically be empirically determined from the specific dataset for real-world applications.
Overall PSD Statistics: Percentiles (1st, 5th, 50th, 95th, 99th) and the mean of the stacked SSL arrays were calculated across all processed files to provide a comprehensive statistical summary of the underwater noise environment. This allows for characterizing the distribution of noise levels at different frequencies (Supplementary File: broadband_spl_statistics.csv, psd_statistical_moments.csv, psd_statistics.csv).

All generated statistics and selected plots (Mel-spectrograms and PSD plots) were saved to designated output directories. Threshold values were empirically determined using initial exploratory data analysis on a subset of the recordings.

2.6.1. Convolutional Neural Network (CNN) Parameters

Architecture (Number of Blocks, Filters, Kernel/Pool Sizes): The multi-block architecture with increasing filter counts (32, 64, 128) is a standard design pattern in deep learning for spectrogram classification. This design enables the network to learn increasingly complex and abstract features across multiple hierarchical levels. The use of 3 × 3 kernel sizes is a common choice for capturing local patterns, and 2 × 2 max-pooling is a typical choice for efficient downsampling of feature maps. These selections were guided by empirical exploration during preliminary model development and adherence to established CNN best practices for similar audio classification tasks, aiming to achieve a model capacity suitable for the dataset’s characteristics.

‘LEARNING_RATE’ (0.001) and ‘BATCH_SIZE’ (32)**: These are widely recognized as common initial hyperparameter choices in deep learning. A learning rate of 0.001 is a frequently used default for the Adam optimizer, providing a stable starting point for model optimization. A batch size of 32 represents a balance between achieving stable gradient updates and computational efficiency.
Regularization (‘Dropout’ and ‘BatchNormalization’): These techniques were intentionally integrated into the architecture (‘Dropout’ rates of 0.25 and 0.5, ‘BatchNormalization’ after each convolutional block and dense layer) to mitigate overfitting. Overfitting was identified as a concern during preliminary model development and was visually evident in the CNN’s training history (Section 2.5).
‘EPOCHS’ (200) and ‘EARLY_STOP_PATIENCE’ (10): A sufficiently large number of epochs (200) was set to ensure the model had ample opportunity to converge. Crucially, ‘EarlyStopping’ with a patience of 10 epochs (monitoring validation loss) was implemented to automatically halt training when performance on unseen data no longer improved, thus effectively preventing overfitting during the training process.

2.6.2. Random Forest Classifier Configuration

Random Forest (‘n_estimators = 100’): The choice of 100 estimators is a well-established default value for Random Forests, providing a robust balance between model performance and computational cost. This number of trees is generally sufficient to achieve stable predictions without excessive computational burden.

‘class_weight = ‘balanced’’ (for RF and SVM)**: This parameter was specifically selected and is essential for addressing potential class imbalance within the dataset. By automatically adjusting weights inversely proportional to class frequencies, it ensures that minority classes are not overwhelmed by the more numerous majority classes during model training, thereby improving their classification performance.

2.6.3. Support Vector Machine (SVM) Configuration

SVM (‘kernel = ‘rbf’’, ‘C = 1.0’, ‘gamma = ‘scale’’): The RBF kernel is a common choice for SVMs due to its demonstrated ability to handle non-linear relationships within complex datasets effectively. The default values for ‘C’ (regularization parameter, 1.0) and ‘gamma’ (kernel coefficient, ‘scale’) are standard starting points for initial SVM models, offering a good balance between model complexity and generalization.

2.7. Model Evaluation

The performance of all machine learning models was evaluated on the held-out test set. The dataset was partitioned into an 80% training set and a 20% testing set (‘TEST_SIZE = 0.2’, ‘RANDOM_STATE = 42’) using stratified splitting to maintain class proportions. Key classification metrics included: Accuracy (overall correctness), Precision (the proportion of true positives among positive predictions), Recall (the proportion of true positives among actual positives), and F1-score (the harmonic mean of precision and recall). Confusion Matrices (raw and normalized) were generated to visualize classification performance per class. For models producing probability outputs (CNN, Random Forest, SVM), Receiver Operating Characteristic (ROC) curves with Area Under the Curve (AUC), and Precision–Recall (PR) curves with Average Precision (AP) were computed to assess biased power and handle potential class imbalance. Learning curves were also generated using stratified K-Fold cross-validation (‘cv_strategy’) to analyze model performance as a function of training data size, providing insights into bias–variance trade-offs.

3. Results

Figure 2 (example of the Mel spectrogram) shows a complex acoustic environment with distinct features across various frequency bands:

Low Frequencies (0–512 Hz): A continuous band of relatively high energy is present throughout the entire 4.8 s duration, particularly prominent below 256 Hz. This consistent horizontal band suggests a persistent broadband noise source, which could be indicative of ambient background noise or distant, continuous anthropogenic activity.
Mid Frequencies (512–4096 Hz): Energy levels in this range appear slightly modulated over time, showing variations in intensity. Several horizontal lines are visible, especially around 1024 Hz and 2048 Hz, which could represent harmonic content from machinery or other continuous, narrowband sources.
High Frequencies (4096–16,384 Hz): While generally quieter than the lower frequencies, this region contains sporadic, transient events. Notably, there are vertical lines or impulses, particularly around the 2.8 to 3.0 s mark, extending upwards from mid to high frequencies. These abrupt, short-duration events often correspond to impulsive sounds such as clicks, natural transient events (e.g., rain, ice cracking), or intermittent anthropogenic noises. There is also some diffuse energy across this band, but it is less concentrated than in the lower frequencies.

The PSD comparison (Figure 3) efficiently differentiates the spectral signatures of background and vessel noise. The vessel noise typically shows a higher level in lower frequencies, which is significant for automated classification and environmental monitoring. Figure 3 gives detailed information about noise levels across different frequencies and is a primary metric for ambient noise characterization, such as

Vessel Noise Dominance in Low Frequencies: The vessel noise (dashed orange line) exhibits higher spectral levels in the lower frequency range, particularly from approximately 20 Hz up to around 100 Hz. This is a typical characteristic of anthropogenic noise sources such as large vessels, where propeller cavitation and machinery vibrations generate significant low-frequency energy.
Background Noise Peaks: The example background noise (solid blue line) shows a prominent peak in spectral level at approximately 50 Hz, reaching a level of over 175 dB re 1 µPa²/Hz. While the vessel noise is also significant in this region, the background noise displays a distinct peak at a higher level in this specific range.
Frequency Dependence: Both noise types generally show decreasing spectral levels as frequency increases beyond their respective low-frequency peaks.
High-Frequency Characteristics: In higher frequency ranges (above approximately 1000 Hz), the vessel noise (dashed orange line) and background noise (solid blue line) show more unstable and converging patterns. The vessel noise, however, maintains slightly higher or comparable levels to the background noise across a broader high-frequency band. The plot clearly shows the spectral characteristics of different noise types.

A statistical summary of the Broadband Sound Pressure Level (SPL) calculated for all audio files (341 .wav files), with a total count of 1556.3 samples, is presented in Table 1.

Figure 4 presents an overview of the Power Spectral Density (PSD) across various frequencies for all processed audio lumps.

The plot includes several statistical percentile curves derived from the stacked SSL arrays of all audio samples, along with a representation of the individual PSDs:

Individual PSDs: Numerous light blue lines are visible across the plot background, representing the individual PSD estimates for each audio lump. Their density and spread provide a visual indication of the variability in noise levels at different frequencies within the dataset.
The 99% percentile (thick black line) represents the upper bound of the noise levels, indicating that 99% of the observed PSD values fall below this line at any given frequency. This curve exhibits high noise levels, particularly in the lower frequency range (below 100 Hz), peaking around 50 Hz before gradually decreasing.
The 95% percentile (thinner black line) provides a slightly lower but still high noise level threshold. It follows a similar trend to the 99% percentile but is consistently below it.
The 50th % percentile (Median) (gray line) indicates the median noise level across all samples for each frequency bin. This line generally follows the central tendency of the individual PSDs.
The 5% percentile (thinner gray line) and 1% percentile (thinnest gray line) represent the lower bounds of the noise distribution, characteristic of quiet periods or the true ambient background noise floor. These lines show significantly lower noise levels compared to the upper percentiles, especially at higher frequencies.
RMS Level (Mean SSL): The magenta line represents the Root Mean Square (RMS) Level, which corresponds to the mean Spectral Sound Level (SSL) across all frequency bins. This curve generally tracks the overall average noise profile, typically lying between the 50th and 95th percentiles of the overall noise profile.

Figure 4 highlights that noise levels are generally highest in the lower frequency bands (below 500 Hz), a common characteristic of underwater acoustic environments that are often dominated by anthropogenic sources, such as vessels. The significant spread between the 1% and 99% percentile curves, particularly in the mid-frequency range, indicates high variability in noise levels across the recorded chunks, suggesting the presence of transient events or varying acoustic conditions. Visual inspection of spectrograms suggests the presence of biological signals in mid-to-high frequencies, though these were not formally labeled for this iteration. This statistical summary of PSD is fundamental for establishing baseline noise levels and identifying anomalies, aligning with recommendations from expert groups like the EU Technical Group on Underwater Noise (TGNoise).

The observed spread in percentiles supports the necessity for robust anomaly detection frameworks, as intermittent signals could be ecologically relevant.

After the feature extraction process, specifically the generation, scaling, and reshaping of Mel-spectrograms for Convolutional Neural Network (CNN) input, the data has the following outline: (1556, 128, 431, 1). This indicates that the dataset comprises 1556 individual samples. Each sample is a Mel-spectrogram with dimensions of 128 (height, likely Mel bands) by 431 (width, corresponding to time frames) and a single channel. The corresponding labels for these 1556 samples are one-hot encoded, resulting in a shape of (1556, 2). This confirms that there are 1556 labels, distributed across two unique classes for classification.

A stratified train–test split was performed to partition the dataset. This approach ensures that the proportion of samples for each class is maintained in both the training and testing sets, which is crucial for preventing bias and ensuring representative subsets, especially in imbalanced datasets.

The training data (X_train) has a shape of (1244, 128, 431, 1), consisting of 1244 samples used to train the CNN model.
The testing data (X_test) has a shape of (312, 128, 431, 1), comprising 312 samples reserved for evaluating the trained model’s performance on unobserved data.

The distribution of samples per class after the stratified split is as follows:

For the training set, there are 643 samples in one class and 601 samples in the other, totaling 1244 training samples.
For the testing set, there are 161 samples in one class and 151 samples in the other, totaling 312 testing samples.

This balanced distribution across the training and testing sets is vital for ensuring that the model is trained and evaluated on a representative sample of each class, thereby providing a more reliable assessment of its generalization capabilities.

The CNN model begins with an input expected to have dimensions (None, 128, 431, 1), representing a batch of 128 × 431 Mel-spectrograms with a single channel (Table 2). The architecture is composed of sequential layers:

Convolutional Blocks: The model employs three convolutional blocks, each consisting of a Conv2D layer, followed by batch normalization, 2D max pooling, and Dropout. The first Conv2D layer has 32 filters, an output shape of (None, 126, 429, 32), and 320 parameters. It is paired with a batch normalization layer (128 parameters) and a MaxPooling2D layer, which reduces dimensions to (None, 63, 214, 32). A Dropout layer (0 parameters) is then applied. The second Conv2D layer increases the filter count to 64, resulting in an output shape of (None, 61, 212, 64) and 18,496 parameters. This is followed by batch normalization (256 parameters), MaxPooling2D to (None, 30, 106, 64), and another Dropout layer. The third Conv2D layer has 128 filters, producing an output shape of (None, 28, 104, 128) and 73,856 parameters. It also includes batch normalization (512 parameters), MaxPooling2D with a kernel size of (None, 2, 2, 2) and a step of 2, and a Dropout layer.
Flatten Layer: After the convolutional blocks, a Flatten layer converts the 2D feature maps into a 1D vector of 93,184 elements, with no associated parameters. This prepares the data for the subsequent dense layers.
Dense Layers: The flattened features are then fed into two Dense (fully connected) layers. The first dense layer has 128 units and 11,927,680 parameters, followed by a Batch Normalization layer (512 parameters) and a Dropout layer (0 parameters). The final Dense layer has two units, corresponding to the number of unique output classes (e.g., “Background Noise” and “Vessel Noise”), and 258 parameters.

In total, the model has 12,022,018 parameters, with 12,021,314 being trainable and 704 non-trainable. The majority of the trainable parameters reside in the first dense layer, indicating its significant contribution to the model’s capacity. The use of batch normalization layers aims to improve stability and performance during training, while Dropout layers are implemented for regularization to prevent overfitting.

The “Model Accuracy” plot, Figure 5 left, shows the training accuracy (blue line) and validation accuracy (orange line) over approximately 35 epochs. The training accuracy displays a consistent upward trend, starting at about 0.50 and reaching over 0.95 by the final epochs, indicating that the model is effectively learning to classify the training data. In contrast, the validation accuracy, after an initial period of stability around 0.50–0.55, exhibits considerable fluctuation and generally remains lower than the training accuracy, peaking at approximately 0.60 around epoch 24, and then fluctuating between 0.50 and 0.60. The divergence between training and validation accuracy, particularly the higher training accuracy coupled with lower and more volatile validation accuracy, suggests the presence of overfitting. The model is learning the training data well, potentially at the expense of its ability to generalize to previously unseen data.

The “Model Loss” plot, Figure 5 right, presents the training loss (blue line) and validation loss (orange line) over the same training period. The training loss gradually decreases throughout the epochs, indicating that the model is minimizing errors on the training dataset. Equally, the validation loss shows a highly unpredictable pattern. It starts at a high value (over 80), drops significantly within the first few epochs, then experiences several spikes (e.g., around epoch 7 and epoch 18), and generally remains higher and more unstable than the training loss. The substantial discrepancy and irregular behavior of the validation loss further support the observation of overfitting, where the model’s performance on new data is not consistently improving despite continued optimization on the training set.

The initial analysis of the Convolutional Neural Network (CNN) model, as presented in Figure 5, indicates significant overfitting, characterized by a clear divergence between training and validation accuracy, as well as volatile validation loss. To address this and enhance the model’s robustness and generalization capabilities, several advanced regularization techniques were systematically integrated. These include the strategic application of L2 regularization (with a rate of 0.001) to the dense layers, which assesses large weights and encourages simpler models, and the incorporation of Batch Normalization layers after each convolutional and dense layer. Batch Normalization stabilizes and accelerates the training process by normalizing the inputs to each layer, thereby reducing internal covariate shift. Furthermore, dropout rates were carefully tuned, particularly in the fully connected layers, to prevent co-adaptation of neurons. The increased number of training epochs (up to 200) combined with an Early Stopping callback (patience of 10 epochs) ensured that training ceased when validation performance no longer improved, thereby preventing further overfitting. These combined strategies effectively mitigated the overfitting observed in preliminary models, resulting in more stable validation performance and improved generalization to unseen data (see Appendix A.1 Table A1).

The initial training dynamics of the Convolutional Neural Network (CNN) model, as illustrated in the Model Accuracy and Loss plots (Figure 5), exposed a distinct tendency towards overfitting. This was evidenced by a significant and increasing divergence between training accuracy (which consistently rose to over 0.95) and validation accuracy (which fluctuated unstably around 0.50–0.60), coupled with an inconsistent and elevated validation loss. Such behavior indicated that the model was learning the training data rather than extracting generalized features, severely compromising its ability to perform robustly on unseen acoustic data.

The combined efficacy of L2 regularization, Batch Normalization, optimized Dropout, and early stopping collectively contributed to a significantly more stable training process and improved generalization capabilities, thereby justifying the strong overfitting initially observed. Future iterations may consider data augmentation techniques, such as time-frequency masking, to further enhance the model.

The spectrograms from Figure 6 display various acoustic characteristics and the model’s performance:

Sample 208 (true: background noise, predicted: background noise, probs: 0.58, 0.42): This spectrogram shows a relatively diffuse energy distribution across frequencies, consistent with background noise. The model correctly classified it as background noise with moderate confidence.
Sample 266 (true: vessel noise, predicted: background noise, probs: 0.70, 0.30): This example, despite being true vessel noise, was misclassified as background noise. The spectrogram exhibits prominent horizontal lines, particularly in the lower frequencies, which are characteristic of continuous noise from vessels. The model’s lower confidence in the true label highlights a potential area for improvement.
Sample 143 (true: background noise, predicted: vessel noise, probs: 0.70, 0.30): This sample, actually background noise, was misclassified as vessel noise. The visual characteristics are somewhat ambiguous, exhibiting some horizontal banding that may have contributed to the misclassification; however, they lack the distinct patterns typically seen in clear vessel signatures.
Sample 149 (true: vessel noise, predicted: background noise, probs: 0.70, 0.30): Similarly to Sample 266, this true vessel noise sample was incorrectly labeled as background noise. It clearly shows strong, sustained horizontal energy bands across various frequencies, indicative of the presence of a vessel. This misclassification suggests the model struggled with certain types or intensities of vessel signatures.
Sample 231 (True: vessel noise, predicted: vessel noise, probs: 0.59, 0.41): This spectrogram was correctly identified as vessel noise, displays broad energy concentrated in lower frequencies, tapering off at higher frequencies, which is a common spectral characteristic of vessel sounds. The prediction confidence is moderate.
Sample 33 (true: vessel noise, predicted: vessel noise, probs: 0.59, 0.41): Another correctly classified sample as vessel noise. The spectrogram again shows dominant energy in the lower frequency range, consistent with typical vessel acoustic profiles.
Sample 227 (true: vessel noise, predicted: vessel noise, probs: 0.59, 0.41): This sample was also correctly identified as vessel noise, exhibiting similar spectral patterns of low-frequency dominance and continuous energy as the other correctly classified vessel noise examples.
Sample 289 (true: background noise, predicted: background noise, probs: 0.57, 0.43): This spectrogram shows a more uniform energy distribution across time and frequency compared to vessel noise, characteristic of ambient background noise. The model correctly classified it, though with relatively close probabilities, indicating some uncertainty.

Beyond traditional accuracy, a comprehensive suite of performance metrics—including precision, recall, and F1-score—was employed to provide a more nuanced understanding of the classification models’ capabilities, particularly given the multi-class nature of underwater acoustic events and potential class imbalances. Precision quantifies the proportion of correctly identified positive predictions, while recall measures the proportion of actual positive instances correctly identified. The F1-score, as the harmonic mean of precision and recall, offers a balanced measure, especially valuable in scenarios where false positives and false negatives carry different costs. For multi-class evaluation, weighted averages of these metrics were utilized to account for varying class support. Detailed classification reports, confusion matrices (see Appendix A.2 Figure A1 and Figure A2; Appendix A.3 Figure A7 and Figure A8), and Receiver Operating Characteristic (ROC) curves (see Appendix A.2 Figure A3; Appendix A.3 Figure A9) with Area Under the Curve (AUC) values were generated for each model. These visualizations provide granular insights into per-class performance, highlighting specific misclassification patterns and the models’ ability to discriminate between classes across various probability thresholds. This rigorous evaluation framework enables a more comprehensive assessment of model effectiveness in real-world underwater acoustic environments.

A direct comparison of the overall accuracy of the implemented machine learning models on the test set is presented in Table 3.

Figure 7 compares the test accuracy of the three primary classification models: the Convolutional Neural Network (CNN), the Random Forest Classifier, and the Support Vector Machine (SVM). The figure displays a grouped bar plot, where the y-axis represents classification accuracy, ranging from 0 to 1, and the x-axis indicates the specific model. The Random Forest and SVM models exhibit similar performance levels, with accuracies of around 0.83, whereas the CNN achieves a higher classification accuracy of approximately 0.93. This improvement is visually highlighted by the taller green bar associated with the CNN in the figure. These findings determine that the deep learning architecture outperforms classical machine learning models in terms of classification accuracy, suggesting its superior ability to learn complex acoustic features present in the underwater environment of the Western Black Sea. The figure supports the argument that convolutional neural networks are better suited for capturing the time-frequency patterns represented in Mel-spectrograms, while traditional models such as Random Forest and SVM perform satisfactorily on statistical and frequency-domain features but lack sufficient representation capacity for more subtle patterns. The visual comparison emphasizes the robustness and potential of deep learning strategies for future marine soundscape monitoring efforts.

Beyond predictive performance, assessing model complexity provides insights into computational resource demands and interpretability. This analysis compared the models based on parameters and size, as well as computational effort (including training and inference time), and inherent interpretability. The Convolutional Neural Network (CNN) demonstrated the highest structural complexity, characterized by approximately 12,022,018 trainable parameters. Its training involved approximately 30 s per epoch (on the given hardware and environment) across hundreds of epochs. Inference time for individual samples is typically very fast (milliseconds). The Random Forest and Support Vector Machine (SVM) models, while having fewer explicit ‘parameters’ in the neural network sense, still present computational considerations. The Random Forest model, with 100 decision trees, is moderately complex. Its training is generally faster than the CNN’s complete training cycle, taking seconds to a few minutes depending on data size, and inference is very rapid. SVMs, particularly with an RBF kernel, can be computationally intensive during training for large datasets but offer fast inference. The training times in this study ranged from seconds to a few minutes. In terms of interpretability, the Random Forest model provides clear insights through its feature importance mechanism (see Supplementary File S1), enabling a direct understanding of which acoustic features drive classification decisions. SVMs are less inherently interpretable. CNNs are often considered “black-box” models; although methods for post-interpretability exist, they were not applied in this study.

4. Discussion

The characterization and automated classification of underwater acoustic environments, particularly in the Western Black Sea, can significantly benefit from advancements in machine learning techniques. The dynamics of underwater acoustic environments are influenced by a multitude of factors, including variations in water conditions, which can complicate acoustic signal transmission and reception. This discussion synthesizes the recent literature on machine learning approaches in underwater acoustics, explores the challenges faced in real-world applications, and outlines avenues for future research.

A challenge in underwater acoustics lies in enhancing the signal quality to improve classification accuracy, particularly under low signal-to-noise conditions. Qu et al. (2025) proposed various algorithms for signal development, aiming to achieve accurate classification, including preprocessing steps that can enhance signal precision, thereby improving the efficiency of the feature extraction phase [19].

Regarding automated classification, Santos and Calazan have demonstrated significant improvements in identifying marine vessels through the effective extraction of spectral dynamic features from audio data, thereby reinforcing the effectiveness of traditional methods, such as Mel-frequency cepstral coefficients (MFCC) and other recognized techniques [4].

Moreover, machine learning algorithms, particularly deep learning techniques like Long Short-Term Memory (LSTM) networks, have been shown to outperform traditional models in terms of learning representations of time-dependent data, making them suitable for underwater acoustic classification tasks [20].

Furthermore, the literature underlines the integration of multiple machine learning approaches to boost classification accuracy. For instance, ensemble methods that influence deep learning—such as combining different neural network architectures to enhance modulation signal detection—highlight the interaction between various methodologies in underwater acoustic challenges [6].

Machine learning also presents opportunities for modeling that address the hydro-acoustic propagation topics specific to environments like the Black Sea. Recent studies have proposed deep learning-based prediction methods for transmission loss, which promote modeling that integrates environmental variables for improved effectiveness in communication between underwater vehicles [21].

The paper outlines a data-driven framework utilizing passive acoustic monitoring at three sites in the Western Black Sea (Figure 1), which encompasses a depth range of 60 to 80 m. The segmented acoustic data enable high temporal resolution analysis through 5 min intervals, aligning with the environmental management goals of the MSFD, which aim to track underwater noise levels and their impacts [18,22].

Our study presents a framework for characterizing and automatically classifying underwater acoustic environments in the Western Black Sea, utilizing signal processing and advanced machine learning techniques. This methodology aligns with principles encouraged by expert groups, such as TG NOISE, where metrics like Broadband SPL and Power Spectral Density (PSD) are considered essential for characterizing underwater soundscapes and anthropogenic sources. Our methodology involves segmenting passive acoustic monitoring data into 5 min intervals and computing key metrics such as Broadband Sound Pressure Level (SPL) and Power Spectral Density (PSD). This approach offers a method for evaluating ambient noise and indicative anthropogenic sound sources in the study area.

Feature extraction in this study employs a multi-level approach (Figure 2 and Figure 3, Table 1), integrating Mel-spectrograms for deep learning with statistical moments, such as the mean, variance, and skewness of the power spectral density (PSD). This layered methodology provides a comprehensive representation of the acoustic environment, enabling both classical supervised and unsupervised machine learning techniques to identify patterns and classify noise sources [23]. This determines the model’s capacity to evolve with the changing acoustic environment, responding effectively to the unique challenges posed by rapid growth in maritime activities in the Black Sea.

The Mel-spectrograms, after normalization, served as the input for our enhanced Convolutional Neural Network (CNN) model, demonstrating their utility in representing acoustic features for CNNs.

The CNN model demonstrated learning capabilities on the training data, with training accuracy reaching over 0.95 and training loss steadily decreasing over 35 epochs. However, the consistent divergence between training and validation accuracy, coupled with the inconsistent behavior of the validation loss, clearly indicates the presence of overfitting. This suggests that while the model effectively learns the training data, its ability to generalize to unseen data is compromised. This challenge is common in deep learning applications, particularly with complex acoustic datasets, and underscores the need for additional strategies to enhance model robustness.

The statistical analysis of Broadband SPL and PSD across all collected data provided valuable insights into the overall noise environment. The mean broadband SPL of 175.3 dB re 1 µPa and a maximum of 210.31 dB re 1 µPa indicate periods of significant acoustic energy. The overall PSD statistics further characterized the noise, determining that levels are generally highest in lower frequency bands (below 500 Hz), a common characteristic attributed to anthropogenic sources such as shipping. The significant spread between the 1st and 99th percentile curves across frequencies highlights the high variability in noise levels, indicating the presence of transient events in addition to continuous background noise.

The qualitative assessment of individual spectrograms from the test set further elucidated the model’s performance. While CNN successfully classified clear instances of both background and vessel noise, misclassifications occurred when vessel noise was predicted as background noise, even in spectrograms showing distinct low-frequency vessel signatures (Figure 6). This suggests that the model may appear with certain types or intensities of vessel sounds, or that the preliminary labeling strategy might not capture the full complexity of lump-level sound events. For future iterations, a more coarse and potentially manual annotation strategy for lump-level data would significantly improve the accuracy and robustness of the training labels.

To validate the performance of the enhanced Convolutional Neural Network (CNN) and establish the efficacy of a deep learning approach for underwater noise classification in the Black Sea, a comprehensive comparative analysis was conducted compared to a suite of established machine learning models. This ensures a robust baseline for evaluating the performance gains achieved by the CNN. The selected baseline models include

Random Forest (RF): An ensemble learning method known for its robustness, ability to handle high-dimensional data, and feature importance insights.
Support Vector Machine (SVM): A robust discriminative classifier that constructs a hyperplane or set of hyperplanes in a high-dimensional space for classification, particularly effective in high-dimensional feature spaces.
K-Nearest Neighbors (KNN): A simple, non-parametric instance-based learning algorithm that classifies data points based on the majority class of their nearest neighbors.
XGBoost (Extreme Gradient Boosting): An optimized distributed gradient boosting library, known for its efficiency, flexibility, and superior predictive performance in structured data tasks.

All these models were trained and evaluated on the same comprehensive feature set, which now includes a combination of Mel-Frequency Cepstral Coefficients (MFCCs) and statistical moments (mean, variance, skewness) extracted from Power Spectral Density (PSD) within ecologically relevant frequency bands. The consistent feature input ensures a fair comparison across different model patterns.

To evaluate the primary metric of accuracy, a comprehensive suite of performance metrics was implemented to assess the efficacy of the classification models across all defined underwater noise categories (Table 4). These metrics, including precision, recall, and F1-score, are crucial in multi-class classification scenarios, particularly when a probable class imbalance exists within the dataset. Precision, defined as the ratio of real positive predictions to the total positive predictions for a given class, quantifies the model’s accuracy and its ability to avoid false positives. Recall measures the quantity of actual positive instances that were correctly identified, indicating the model’s comprehensiveness and its capacity to minimize false negatives. The F1-score, being the harmonic mean of precision and recall, offers a balanced measure of a model’s performance, especially valuable when an adjustment between false positives and false negatives is necessary. For the multi-class problem, weighted averages of these metrics were computed, providing an aggregate performance score that accounts for the varying number of samples in each class.

Detailed classification reports are presented for each model (Table 5, Table 6 and Table 7, Supplementary File S1), detailing per-class precision, recall, and F1-score, as well as overall weighted averages. Complementing these numerical metrics, confusion matrices (see Appendice A.2 and Appendice A.3: Figure A1, Figure A2 and Figure A8) were generated. These matrices visually depict the counts of true positives, true negatives, false positives, and false negatives for each class, offering an insight into specific misclassification patterns. For instance, the analysis of misclassifications, as highlighted in Figure 6 (e.g., Samples 266 and 149 misclassified as vessel noise), underscores the importance of such detailed error analysis. This view of misclassifications identifies challenging instances and directs future efforts towards refining feature engineering or augmenting specific data types.

Furthermore, Receiver Operating Characteristic (ROC) curves, along with their corresponding Area Under the Curve (AUC) values, were generated for each class in a one-vs-rest fashion (See Appendix A.2 Figure A3 and Appendix A.3 Figure A9). The ROC curve illustrates the diagnostic ability of a binary classifier system as its insight threshold is varied. At the same time, AUC provides a combined measure of performance across all possible classification thresholds. For multi-class scenarios, plotting per-class ROC curves offers insights into the model’s bias for individual noise categories, particularly in distinguishing them from all other combined classes.

The integrated framework, which combines deep learning with statistical analysis and rule-based thresholding, provides a comprehensive approach to underwater noise analysis. These methodologies contribute to the growing field of underwater acoustic research utilizing machine learning, which has shown potential in real-time environmental monitoring and enhancing underwater environmental assessments. Our findings are consistent with the existing literature, which emphasizes the importance of detailed feature extraction and highlights the challenges posed by low signal-to-noise ratios and varying environmental conditions.

Moreover, the model incorporates rule-based statistical models for event detection and background noise estimation, which are essential for aligning with international guidelines concerning marine noise impact assessments. This synthesis of automated classification techniques and statistical modeling can significantly enhance the efficacy of noise management policies aimed at protecting marine environments [24].

Despite advances in acoustic monitoring and classification methodologies, there are limitations and potential sources of error that require careful attention. Variabilities in environmental conditions, such as wind and thermal stratification, can significantly influence the propagation of underwater sound and the accuracy of measurements across diverse spatial and temporal scales. Research has shown that environmental noise can vary significantly depending on vessel density, weather, and marine infrastructure, highlighting the need for context-specific assessments [23,25].

The increasing occurrence of marine anthropogenic stressors, particularly in ecologically sensitive areas such as the Western Black Sea, necessitates a practical approach to monitoring and managing underwater noise.

5. Conclusions

This study presents a framework for the acoustic characterization and automated classification of underwater noise in the Western Black Sea, utilizing a combination of signal processing and machine learning approaches. Through the integration of Mel-spectrogram-based convolutional neural networks and statistical analyses of power spectral density, the methodology effectively outlines dominant anthropogenic and ambient noise components in the marine environment.

Although the CNN effectively learned patterns from the training data, its limited performance on the validation data suggests the need for enhanced generalization strategies, including improved label quality and expanded datasets. The statistical investigation of broadband SPL and PSD detected low-frequency energy dominance, aligning with known patterns of vessel-generated sound. Variability across spectral percentiles further highlighted the dynamic nature of the acoustic landscape.

This work supports the feasibility of deploying data-driven models for marine noise monitoring, with potential application in regulatory frameworks aimed at mitigating the ecological impacts of underwater sound. Future developments should focus on enhancing classification through the use of datasets, incorporating contextual metadata, and including additional biological and abiotic sound sources. Overall, the presented methodology contributes to advancing automated soundscape analysis in support of sustainable marine management.

The integration of advanced machine learning techniques with demanding acoustic monitoring in the context of the Western Black Sea provides a capable platform for addressing the urgent need to understand and mitigate the ecological impacts of anthropogenic underwater noise.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/jmse13071352/s1, Supplementary Material Usage and Result Reconstruction Supplemenatry Materials S1: User Guide: Reproducing Underwater Noise Analysis Results. This appendix provides instructions for researchers interested in reproducing the results and methodology presented in this study, utilizing the provided supplementary materials. All Supplementary Materials, including the Anaconda version 2024.10-1 Python Jupyter Notebook (‘JMSE_Mihailov_JASCO_CNN model(model_04072025).ipynb’), output CSV data tables, and generated high-resolution plot images, are available for download as Supplementary Data. System Requirements and Environment Setup. To execute the provided code and reconstruct the results, a Python 3.8+ environment is required. It is highly recommended to use a virtual environment manager (e.g., Anaconda/Miniconda) to manage dependencies. The necessary Python libraries can be installed using ‘pip’: ash pip install numpy matplotlib librosa scipy pandas scikit-learn tensorflow. Supplementary files as output CSV data files: broadband_spl_statistics.csv: Broadband SPL Statistics: This dataset summarizes the broadband Sound Pressure Level (SPL) distribution derived from the same set of acoustic segments analyzed for spectral statistical moments. It includes basic descriptive statistics of SPL values measured in decibels relative to 1 µPa, about:count: Total number of analyzed segments (N = 1556), mean: Average SPL, std: Standard deviation, min/max: Minimum and maximum SPL values observed, quantiles: SPL values at 10%, 25%, 50% (median), 75%, and 90% percentiles. These statistics provide a high-level overview of the broadband acoustic environment, complementing the frequency-specific PSD features by reflecting the overall intensity of the soundscape. psd_statistical_moments.csv: The dataset consists of 1556 entries, each representing a sound segment characterized by statistical moments computed from power spectral density (PSD) data across three predefined frequency bands: Low-Frequency Shipping (LF), Mid-Frequency Biological (MF), and High-Frequency Rain/Clicks (HF). The data columns include: filepath: String identifier indicating the source file path of each sound segment; label: Binary class label (e.g., 0 or 1), signifying the presence or absence of a specific acoustic event or category; Low_Freq_Shipping_mean_ssl, variance_ssl, skewness_ssl: Mean, variance, and skewness of PSD values within the low-frequency band associated with shipping noise; Mid_Freq_Biological_mean_ssl, variance_ssl, skewness_ssl: Statistical moments for mid-frequency biological sounds (e.g., marine mammals); High_Freq_Rain_Clicks_mean_ssl, variance_ssl, skewness_ssl: Statistical moments for high-frequency signals likely related to rain or echolocation clicks. These features were extracted to capture the distributional properties of the PSD across different frequency ranges, aiding in the classification or analysis of marine acoustic environments. psd_statistics.csv: Power Spectral Density (PSD) Statistics. This dataset presents spectral-level summary statistics across a range of 1,025 frequency bins (0 Hz to ~22 kHz), offering a detailed view of the acoustic energy distribution over frequency. Each row corresponds to a specific frequency bin and includes the following metrics for sound spectral levels (SSL), expressed in dB re 1 µPa²/Hz: Mean SSL: The average spectral level at each frequency; 1st, 5th, 50th, 95th, and 99th Percentile SSLs: Percentile values capturing the distribution and variability of acoustic levels across recordings, highlighting both background noise floors and occasional high-energy events. These spectral descriptors offer a frequency-resolved characterization of the soundscape, enabling the identification of dominant sources, temporal variability, and baseline conditions in underwater acoustic monitoring.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MSFD	Marine Strategy Framework Directive
TG Noise	Technical Group on Underwater Noise
SPL	Sound Pressure Level
PSD	Power Spectral Density
CNN	Convolutional Neural Network
FFT	Fast Fourier Transform
ML	Machine Learning
AMARs	Autonomous Multichannel Acoustic Recorders
RF	Random Forest
SVM	Support Vector Machine
GMM	Gaussian Mixture Models
SSL	Spectral Sound Level
RMS	Root Mean Square
ROC	Receiver Operating Characteristic
PR	Precision–Recall
AUC	Area Under the Curve
AP	Average Precision
KNN	K-Nearest Neighbors
XGBoost	Extreme Gradient Boosting
LSTM	Long Short-Term Memory

Appendix A

Appendix A.1. CNN Hyperparameters

Table A1. CNN Hyperparameters.

Hyperparameter	Value
Input shape	(128, 431, 1)
Conv2D Filters (L1)	32
Conv2D Filters (L2)	64
Conv2D Filters (L3)	128
Kernel Size	(3, 3)
MaxPooling Pool Size	(2, 2)
Dropout Rates	0.25 (conv), 0.5 (dense)
Dense Layers	128 (hidden), Num_Classes (output)
Activation (Hidden)	ReLU
Activation (Output)	Softmax
Optimizer	Adam
Learning Rate	0.001
Loss Function	Categorical Crossentropy
Batch Size	32
Epochs	200
Early Stopping Patience	10

Appendix A.2. Random Forest Model

The Random Forest classifier was applied using the 9-dimensional PSD statistical moments as input features. The model was configured with 100 estimators and employed balanced class weighting to mitigate potential class imbalance effects. The hyperparameters used are explicitly documented in the Supplementary File S1: rf_hyperparameters.csv.

The Random Forest model established a test accuracy of 0.6987. Its detailed classification performance, including precision, recall, and F1-score for each class (Background Noise, Vessel Noise, and Biological Noise), is presented in Table 6.

The confusion matrices provide supplementary explanation of the classifier’s behavior. The raw counts of true and predicted labels are presented in Figure A2, showing a view of the number of samples correctly and incorrectly classified for each class.

Complementing this, Figure A1 provides a class-wise normalized view, indicating the proportion of samples correctly classified within each true class. Figure A1 shows that the model achieved a high true positive rate for ‘Background Noise’, indicating robust detection of this prevalent class, while the true positive rates for ‘Vessel Noise’ and ‘Biological Noise’ were very low, suggesting significant challenges in identifying these minority classes.

Figure A1. Random Forest Normalized Confusion Matrix displaying the normalized confusion matrix for the Random Forest classifier, indicating the proportion of correctly and incorrectly classified instances within each true class, providing insights into misclassification patterns.

Figure A2. Raw counts for Random Forest Confusion Matrix illustrating the absolute counts of true versus predicted labels for the Random Forest classifier, showing the distribution of classifications across all classes.

Feature importance analysis from the Random Forest model, as shown in Figure A4, displays the relative contribution of each PSD moment feature to the classification decisions. Features related to the mean SSL in the low-frequency shipping band and mid–frequency biological bands often exhibited higher importance, indicating their strong biased power for these noise categories. Specifically, the mean SSL in the low-frequency vessel band appeared as one of the most influential features for distinguishing between noise types.

Receiver Operating Characteristic (ROC) curves, including individual class, micro-average, and macro-average Area Under the Curve (AUC), are presented in Figure A3. These curves illustrate the classifier’s ability to bias between classes across various probability thresholds. The ROC curve for ‘Background Noise’ shows a high AUC, reflecting the model’s strong performance in distinguishing this class from others. For ‘Vessel Noise’ and ‘Biological Noise’, the AUCs are significantly lower, corroborating the challenges observed in the confusion matrix. The micro-average AUC, which aggregates the contributions of all classes, provides an overall measure of discriminatory power. In contrast, the macro-average AUC assigns equal weight to each class, regardless of its size.

Figure A3. Random Forest Multi-Class Receiver Operating Characteristic (ROC) Curve. ROC curves for each class, along with micro-average and macro-average curves, illustrating the Random Forest classifier’s ability to discriminate between noise classes. The black dotted diagonal line indicates the performance of a random classifier (i.e., no discrimination ability).

Figure A4. Feature importance analysis from the Random Forest model—the relative importance of each engineered PSD statistical moment feature as determined by the Random Forest classifier, highlighting their contribution to classification decisions.

Figure A5 displays the Precision–Recall (PR) curves for the Random Forest classifier, offering a detailed perspective on its performance, particularly valuable for datasets with potential class imbalance. In Figure A5, three distinct PR curves are presented, corresponding to Class 0 (Background Noise), Class 1 (Vessel Noise), and Class 2 (Biological Noise). The Area Under the Curve (AUC) for each PR curve is termed Average Precision (AP). The PR curve for Class 0 (Background Noise) demonstrates exceptionally high performance, maintaining strong precision even as recall approaches 1.0, reflected by an AP of 0.96. This indicates the model’s high reliability and completeness in identifying background noise. In stark contrast, the PR curves for Class 1 (Vessel Noise) and Class 2 (Biological Noise) both exhibit an AP of 0.00. This indicates that for these classes, the model’s precision remains very low across all levels of recall, suggesting severe difficulty in correctly identifying positive instances of vessels and biological noise without incurring a high rate of false positives. This observation is consistent with the low F1-scores and recalls noted for these minority classes in the classification reports (Table 6).

Figure A5. Random Forest Precision–Recall Curve. Precision–Recall curves for each noise class, offering insights into the Random Forest classifier’s precision–recall trade-off, particularly relevant for imbalanced datasets.

During the generation of the Random Forest learning curve, displayed in Figure A6, FitFailedWarning messages were observed for 20 out of 50 fits. The traceback explicitly indicated a ValueError: The number of classes has to be greater than one; got 1 class. This critically signifies that in some cross-validation folds, the training or validation subsets for learning_curve contained only one unique class, making it impossible for the classification model to train or evaluate on that specific subset. This phenomenon typically stems from extreme class imbalance within the dataset, particularly for minority classes, where stratified splitting cannot guarantee representation of all classes in every fold when sample sizes are minimal.

Consequently, some data points on the learning curve are either averaged over fewer successful folds or are effectively absent, which impacts the completeness of the curve for all classes. Despite these warnings, the visible trends in Figure A6. indicate that the training score generally decreases as more training examples are included, while the cross-validation score shows some fluctuation but eventually plateaus, providing insights into the model’s bias–variance trade-off.

Figure A6. Random Forest Learning Curve. Random Forest model’s performance (accuracy) as a function of training data size, showing training and cross-validation scores and highlighting issues related to data scarcity in certain classes.

Appendix A.3. Support Vector Machine (SVM) Model

The Support Vector Machine (SVM) classifier, configured with an RBF kernel, balanced class weighting, and probability set to true, was also trained on the PSD statistical moments. Its hyperparameters are detailed in svm_hyperparameters.csv.

The SVM achieved a test accuracy of 0.8493, which is identical to that of the Random Forest model. Its comprehensive classification report is provided in Table 7.

The performance of the SVM is further visualized through its confusion matrices (Figure A7). Figure A7 presents the raw counts of true versus predicted labels, offering the total number of correct and incorrect classifications.

Figure A7. SVM Raw Confusion Matrix. The absolute counts of true versus predicted labels for the Support Vector Machine classifier, showing the distribution of classifications across all classes.

In parallel, Figure A8 (svm_normalized_confusion_matrix.png) displays the normalized confusion matrix, which illustrates the proportion of correctly classified instances for each true class. Similarly to the Random Forest model, the SVM demonstrated strong performance in classifying ‘Background Noise’, evidenced by a high true positive rate in Figure A8. However, the model demonstrated limited accuracy in correctly identifying ‘Vessel Noise’ and ‘Biological Noise’, as indicated by their particularly low true positive rates. This task is particularly challenging due to the minority class status of these categories.

Figure A8. SVM Normalized Confusion Matrix. The normalized confusion matrix for the Support Vector Machine classifier indicates the proportion of correctly and incorrectly classified instances within each true class, providing insights into misclassification patterns.

Receiver Operating Characteristic (ROC) curves for the SVM, including individual class, micro-average, and macro-average Area Under the Curve (AUC), are presented in Figure A9. These curves quantify the model’s ability to differentiate between classes across varying classification thresholds. The high AUC for ‘Background Noise’ in Figure A9 confirms the SVM’s strong bias towards this dominant class. For ‘Vessel Noise’ and ‘Biological Noise’, the AUC values are considerably lower, reflecting the model’s reduced ability to effectively differentiate these minority classes.

Figure A9. SVM Multi-Class Receiver Operating Characteristic (ROC) Curve. ROC curves for each class, along with micro-average and macro-average curves, illustrating the Support Vector Machine classifier’s ability to discriminate between noise classes. All curves exhibit an Area Under the Curve (AUC) of 0.91, indicating strong classification performance. The black dotted diagonal line represents the performance of a random classifier, helping as a baseline for comparison.

Complementing the ROC analysis, Precision–Recall (PR) curves, displayed in Figure A10, illustrate the trade-off between precision and recall, a particularly informative view for datasets with imbalanced class distributions. The Average Precision (AP) scores associated with these curves provide a summary metric. As expected, ‘Background Noise’ exhibits a PR curve that maintains high precision across a wide range of recall, underscoring the model’s reliability in identifying this standard class. In contrast, the PR curves for ‘Vessel Noise’ and ‘Biological Noise’ in Figure A10 show low precision and recall values, consistent with their challenging detection rates.

Figure A10. SVM Precision–Recall Curve. Precision–Recall curves for each noise class, offering insights into the Support Vector Machine classifier’s precision–recall trade-off, particularly relevant for imbalanced datasets.

The SVM learning curve, shown in Figure A11, provides insights into the model’s performance as a function of training data size. Similarly to the Random Forest learning curve, this plot also encountered FitFailedWarning messages. These warnings occur because of ValueError: The number of classes must be greater than one; got 1 class. It confirms that in specific cross-validation folds, the training or validation subsets contain only a single unique class. This issue, primarily due to extreme class imbalance and limited sample sizes for minority classes, prevented the model from training or evaluating on those specific subsets. Consequently, some data points on the learning curve may be absent or represent averages over fewer successful overlays. However, the observable trends in Figure A11 suggest that as the number of training examples increases, the training accuracy generally decreases while the cross-validation accuracy tends to stabilize, indicating the model’s generalization capabilities.

Figure A11. SVM Learning Curve. Support Vector Machine model’s performance (accuracy) as a function of training data size, showing training and cross-validation scores and highlighting issues related to data scarcity in certain classes.

Appendix A.4. Gaussian Mixture Models (GMMs) Model

A conceptual application of Gaussian Mixture Models (GMMs) for anomaly detection was explored. A GMM was explicitly fitted to the ‘Background Noise’ class from the training data, allowing for the probabilistic modeling of typical ambient acoustic conditions.

Then, test samples were evaluated based on their log-likelihood scores under this fitted background GMM. The distribution of these log-likelihood scores is presented in Figure A12. In this figure, a histogram illustrates the frequency distribution of scores for both ‘Background Noise’ samples and ‘Vessel Noise’ samples from the test set. A key observation from Figure A12 is the distinct separation between the score distributions of background noise and vessel noise. Background noise samples generally exhibit higher log-likelihood scores, indicating that they are well-represented by the fitted background GMM. In contrast, vessel noise samples tend to group at lower log-likelihood scores, signifying that they are less likely to originate from the distribution learned from background noise. A defined anomaly threshold, set at the 5th percentile of the background noise scores, aids in describing the difference between ‘normal’ background conditions and potential anomalous events. This separation highlights the GMM’s ability to detect deviations from the typical acoustic environment, which may correspond to unclassified or unexpected sound sources.

Figure A12. GMM Log-Likelihood Scores for Anomaly Detection. Histogram of log-likelihood scores for background and vessel noise samples, relative to a Gaussian Mixture Model fitted on background noise, illustrating its potential for anomaly detection.

Appendix A.5. User Guide: Reproducing Underwater Noise Analysis Results

See Supplementary Files: This guide provides step-by-step instructions for setting up the environment, executing the Jupyter Notebook (JMSE_Mihailov_JASCO_CNN model(model_04072025).ipynb), and verifying the reconstructed results of the underwater noise analysis.

Appendix A.6. Overall Framework for Underwater Noise Analysis Using Machine Learning

Figure A13. The overall methodology is described in Section 2, from data acquisition to model outputs.

References

Orescanin, M.; Beckler, B.; Pfau, A.; Atchley, S.; Villemez, N.; Joseph, J.; Miller, C.; Margolina, T. Multi-label classification of heterogeneous underwater soundscapes with Bayesian deep learning. TechRxiv 2022. [Google Scholar] [CrossRef]
Yang, H.; Byun, S.; Lee, K.; Choo, Y.; Kim, K. Underwater acoustic research trends with machine learning: Active sonar applications. J. Ocean Eng. Technol. 2020, 34, 277–284. [Google Scholar] [CrossRef]
Yang, H.; Lee, K.; Choo, Y.; Kim, K. Underwater acoustic research trends with machine learning: General background. J. Ocean Eng. Technol. 2020, 34, 147–154. [Google Scholar] [CrossRef]
Santos, M.; Calazan, R. Improved spectral dynamic features extracted from audio data for classification of marine vessels. Intell. Mar. Technol. Syst. 2024, 2, 18. [Google Scholar] [CrossRef]
Domingos, L.; Santos, P.; Skelton, P.; Brinkworth, R.; Sammut, K. An investigation of preprocessing filters and deep learning methods for vessel type classification with underwater acoustic data. IEEE Access 2022, 10, 117582–117596. [Google Scholar] [CrossRef]
Senthil Kumaran, V.N.; Indumathi, G.; Vijay, M. Ensemble of Deep Learning Enabled Modulation Signal. Res. Sq. 2022. [Google Scholar] [CrossRef]
Ahmed, A.; Younis, M. Acoustic beam characterization and selection for optimized underwater communication. Appl. Sci. 2019, 9, 2740. [Google Scholar] [CrossRef]
Zeng, X.; Wang, S. Underwater sound classification based on Gammatone filter bank and Hilbert-Huang transform. In Proceedings of the 2014 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Guilin, China, 5–8 August 2014; pp. 707–710. [Google Scholar] [CrossRef]
He, C.; Wang, Y.; Yu, W.; Song, L. Underwater target localization and synchronization for a distributed simo sonar with an isogradient ssp and uncertainties in receiver locations. Sensors 2019, 19, 1976. [Google Scholar] [CrossRef] [PubMed]
André, M.; Iwase, R.; Akamatsu, T.; Takahashi, I.; Zaugg, S.; van der Schaar, M.; Houégnigan, L.; Sánchez, A.M. Automated real-time acoustic detection of Fin Whale calls at the deep sea floor observatory off Kushiro-Tokachi, Japan. In Proceedings of the 2011 IEEE Symposium on Underwater Technology and Workshop on Scientific Use of Submarine Cables and Related Technologies, Tokyo, Japan, 5–8 April 2011; pp. 1–4. [Google Scholar] [CrossRef]
Malfante, M.; Mohammed, O.; Gervaise, C.; Dalla Mura, M.; Mars, J.I. Use of Deep Features for the Automatic Classification of Fish Sounds. In Proceedings of the 2018 OCEANS-MTS/IEEE Kobe Techno-Oceans (OTO), Kobe, Japan, 28–31 May 2018; pp. 1–5. [Google Scholar] [CrossRef]
Malfante, M.; Mura, M.; Mars, J.; Gervaise, C. Automatic fish sounds classification. J. Acoust. Soc. Am. 2016, 139, 2115–2116. [Google Scholar] [CrossRef]
Montgomery, J.; Radford, C. Marine bioacoustics. Curr. Biol. 2017, 27, R502–R507. [Google Scholar] [CrossRef] [PubMed]
Wei, M.; Chen, K.; Lin, Y.; Cheng, E. Recognition of behavior state of penaeus vannamei based on passive acoustic technology. Front. Mar. Sci. 2022, 9, 973284. [Google Scholar] [CrossRef]
Zeng, X.; Wang, Q.; Zhang, C.; Cai, H. Feature selection based on Relief F and PCA for underwater sound classification. In Proceedings of the 3rd International Conference on Computer Science and Network Technology, Dalian, China, 12–13 October 2013; pp. 442–445. [Google Scholar] [CrossRef]
Parcerisas, C.; Botteldooren, D.; Devos, P.; Debusschere, E. Clustering, categorizing, and mapping of shallow coastal water soundscapes. In Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023, Torino, Italy, 11–15 September 2023; pp. 6091–6097. [Google Scholar] [CrossRef]
Padró, E.; Garcia-Benadí, A.; Toma, D.; Delory, E.; Castro, S.; Fernández, J. Metadata-driven universal real-time ocean soundmeasurement architecture. IEEE Access 2021, 9, 28282–28301. [Google Scholar] [CrossRef]
Merchant, N.D.; Brookes, K.L.; Faulkner, R.C.; Bicknell, A.W.J.; Godley, B.J.; Witt, M.J. Underwater noise levels in UK waters. Sci. Rep. 2016, 6, 36942. [Google Scholar] [CrossRef] [PubMed]
Qu, J.; Li, D.; Liu, X.; Sun, M. Underwater acoustic target signal enhancement algorithm optimized by feature preservation and noise update. J. Phys. Conf. Ser. 2025, 2939, 012004. [Google Scholar] [CrossRef]
Shaik, A.; Das, B. A Novel Method for Classification and Modelling of Underwater Acoustic Communication Through Machine Learning and Image Processing Technique. Research Square 2023 Preprint (Version 1, 22 August 2023). Available online: https://www.researchsquare.com/article/rs-3241368/v1 (accessed on 10 June 2025). [CrossRef]
Zhao, Y.; Wang, M.; Xue, H.; Gong, Y.; Qiu, B. Prediction Method of Underwater Acoustic Transmission Loss Based on Deep Belief Net Neural Network. Appl. Sci. 2021, 11, 4896. [Google Scholar] [CrossRef]
Luttenberger, L.R.; Slišković, M.; Ančić, I.; Boljat, H.U. Environmental Impact of Underwater Noise. J. Marit. Transp. Sci. 2022, 45–54. [Google Scholar] [CrossRef]
Popit, A. Underwater noise in the Slovenian Sea. Mater. Geoenvironment Sciendo 2021, 67, 161–175. [Google Scholar] [CrossRef]
Huo, X.; Zhang, P.; Yuan, Y.; Li, G.; Tang, J.; Shi, B. Underwater Noise Characteristics of the Tidal Inlet of Zhanjiang Bay. Water 2023, 15, 3586. [Google Scholar] [CrossRef]
Peng, Y.; Laguna, A.J.; Tsouvalas, A. A Multi-physics Approach for Modelling Noise Mitigation Using an Air-bubble Curtain in Impact Pile Driving. Front. Mar. Sci. 2023, 10, 1134776. [Google Scholar] [CrossRef]

Figure 1. Passive acoustic stations in the northwestern Black Sea on board Autonomous Multichannel Acoustic Recorders (AMARs).

Figure 2. Example of Mel spectrogram that represents the spectral content of an underwater acoustic recording over time, with frequency on the y-axis (ranging from 0 to approximately 16,384 Hz on a logarithmic scale), time on the x-axis (from 0 to 4.8 s), and color intensity indicating the energy level in dB. This Mel spectrogram suggests a blend of continuous, potentially ambient or persistent anthropogenic noise in the lower frequency bands, along with distinct, short-duration impulsive events occurring sporadically in the mid-to-high frequency ranges. Such a detailed spectral representation is valuable for identifying and characterizing different types of underwater acoustic phenomena.

Figure 3. Spectral characteristics of an example background noise (solid blue line) and an example vessel noise (dashed orange line) in an underwater acoustic environment. The x-axis represents frequency in Hz on a logarithmic scale, ranging from approximately 10 Hz to over 40,000 Hz, while the y-axis represents the Spectral Level in dB re 1 µPa²/Hz. This type of plot provides detailed information about noise levels across different frequencies, characterizing ambient noise.

Figure 4. Noise Level Statistics presents a comprehensive overview of the Power Spectral Density (PSD) across various frequencies for all processed audio lumps. The plot shows percentile curves (1%, 5%, 50%, 95%, 99%) in grayscale, with the RMS (root mean square) level shown in magenta. Shaded blue areas represent the density of PSD curves across all time windows, providing a visual indication of the distribution spread and variability in the spectral content across the dataset.

Figure 5. The training and validation performance of the Convolutional Neural Network (CNN) model for underwater noise classification: model accuracy (left) and model loss (right) as a function of training epoch.

Figure 6. Mel-spectrograms from the test set, illustrating the CNN model’s prediction capabilities for different underwater acoustic samples. Each subplot provides the sample number, the actual label, the predicted label, and the associated prediction probabilities. The y-axis represents frequency in Hz (log scale), and the x-axis represents time in seconds.

Figure 7. Comparison of classification model accuracies for underwater acoustic source identification.

Table 1. Broadband SPL (dB re 1 µPa) Statistics Across All Files.

Statistics	Value
count	1556.3
mean	175.3
Std	12.18
min	161.93
10%	166.49
50%	169.98
90%	197.91
95%	206.14
max	210.31

Table 2. The summary of the Convolutional Neural Network (CNN) model architecture.

Layer (Type)	Output Shape	Param #
conv2d (Conv2D)	(None, 126, 429, 32)	320
batch_normalization (BatchNormalization)	(None, 126, 429, 32)	128
max_pooling2d (MaxPooling2D)	(None, 63, 214, 32)	0
dropout (Dropout)	(None, 63, 214, 32)	0
conv2d_1 (Conv2D)	(None, 61, 212, 64)	18,496
batch_normalization_1 (BatchNormalization)	(None, 61, 212, 64)	256
max_pooling2d_1 (MaxPooling2D)	(None, 30, 106, 64)	0
dropout_1 (Dropout)	(None, 30, 106, 64)	0
conv2d_2 (Conv2D)	(None, 28, 104, 128)	73,856
batch_normalization_2 (BatchNormalization)	(None, 28, 104, 128)	512
max_pooling2d_2 (MaxPooling2D)	(None, 14, 52, 128)	0
dropout_2 (Dropout)	(None, 14, 52, 128)	0
flatten (Flatten)	(None, 93, 184)	0
dense (Dense)	(None, 128)	11,927,680
batch_normalization_3 (BatchNormalization)	(None, 128)	512
dropout_3 (Dropout)	(None, 128)	0
dense_1 (Dense)	(None, 2)	258

Table 3. Model performance comparison.

Model	Accuracy
RandomForest	0.849358974
SVM	0.83974359
CNN	0.93589741

Table 4. Comparative Classification Metrics by Model and Class (F1-score).

Metric	Class/Overall	CNN	Random Forest	SVM
Accuracy	Overall	0.9359	0.8494	0.8397
F1-Score	Background Noise	0.969	0.8326	0.8236
	Vessel Noise	0.700	0.0002	0.0002
	Biological Noise	0.000	0.0000	0.0000
F1-Score	Average	0.556	0.2746	0.2746

Table 5. CNN Classification Report.

Class	Precision	Recall	F1-Score	Support
Background Noise	0.945	0.995	0.969	141
Vessel Noise	0.850	0.600	0.700	171
Biological Noise	0.000	0.000	0.000	0
Accuracy	0.9359	0.9359	0.9359	312
Macro average	0.598	0.532	0.556	312
Weighted average	0.9359	0.9359	0.930	312

Table 6. Random Forest Classification Report.

Class	Precision	Recall	F1-Score	Support
Background Noise	0.8264	0.8440	0.8351	141
Vessel Noise	0.8691	0.8538	0.8614	171
Biological Noise	0.000	0.000	0.000	0
Accuracy	0.8494	0.8489	0.8482	312
Macro average	0.8477	0.8889	0.8482	312
Weighted average	0.8498	0.8494	0.8495	312

Table 7. SVM Classification Report.

Class	Precision	Recall	F1-Score	Support
Background Noise	0.7725	0.9149	0.837662338	141
Vessel Noise	0.9172	0.7778	0.8418	171
Biological Noise	0.8397	0.8397	0.8397	0
Accuracy	0.8448	0.8463	0.8397	312
Macro average	0.8518	0.8397	0.8399	312
Weighted average	0.7725	0.9149	0.8377	312

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mihailov, M.E. Characterization and Automated Classification of Underwater Acoustic Environments in the Western Black Sea Using Machine Learning Techniques. J. Mar. Sci. Eng. 2025, 13, 1352. https://doi.org/10.3390/jmse13071352

AMA Style

Mihailov ME. Characterization and Automated Classification of Underwater Acoustic Environments in the Western Black Sea Using Machine Learning Techniques. Journal of Marine Science and Engineering. 2025; 13(7):1352. https://doi.org/10.3390/jmse13071352

Chicago/Turabian Style

Mihailov, Maria Emanuela. 2025. "Characterization and Automated Classification of Underwater Acoustic Environments in the Western Black Sea Using Machine Learning Techniques" Journal of Marine Science and Engineering 13, no. 7: 1352. https://doi.org/10.3390/jmse13071352

APA Style

Mihailov, M. E. (2025). Characterization and Automated Classification of Underwater Acoustic Environments in the Western Black Sea Using Machine Learning Techniques. Journal of Marine Science and Engineering, 13(7), 1352. https://doi.org/10.3390/jmse13071352

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Characterization and Automated Classification of Underwater Acoustic Environments in the Western Black Sea Using Machine Learning Techniques

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Acoustic Data Collection

2.2. Data Preprocessing and Feature Extraction

2.3. Labeling Strategy

2.4. Data Annotation

2.5. Machine Learning Models and Training

2.5.1. Convolutional Neural Network (CNN) Model

2.5.2. Random Forest Classifier

2.5.3. Support Vector Machine (SVM)

2.5.4. Gaussian Mixture Models (GMMs)

2.6. Statistical Analysis and Thresholding

2.6.1. Convolutional Neural Network (CNN) Parameters

2.6.2. Random Forest Classifier Configuration

2.6.3. Support Vector Machine (SVM) Configuration

2.7. Model Evaluation

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. CNN Hyperparameters

Appendix A.2. Random Forest Model

Appendix A.3. Support Vector Machine (SVM) Model

Appendix A.4. Gaussian Mixture Models (GMMs) Model

Appendix A.5. User Guide: Reproducing Underwater Noise Analysis Results

Appendix A.6. Overall Framework for Underwater Noise Analysis Using Machine Learning

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI