1. Introduction
The underwater acoustic environment is a complex and dynamic system that holds significant interest for researchers due to its ecological and navigational implications. The characterization and automated classification of underwater acoustic environments, particularly in regions such as the Western Black Sea, have become increasingly practical with advancements in machine learning (ML) techniques.
The importance of machine learning in the analysis of underwater acoustics cannot be overstated, as it facilitates the automated classification of complex and diverse soundscapes. For instance, Orescanin et al. (2022) highlight how these techniques can be applied for real-time environmental monitoring, allowing for the identification and tracking of marine life and anthropogenic activities [
1]. Furthermore, Yang H. et al. (2020) emphasize the growing field of underwater acoustic research utilizing machine learning, particularly in applications involving passive sonar, showcasing methods that can significantly enhance underwater environmental assessments [
2,
3].
Another aspect of characterizing underwater environments is the extraction and utilization of acoustic features from collected data. The study by Santos and Calazan presents enhanced spectral dynamic features that improve classification rates for marine vessels [
4]. This aligns with Domingos et al.’s investigation, which explores the efficacy of preprocessing filters and deep learning methodologies tailored for underwater sound classification, highlighting challenges related to low signal-to-noise ratios in these environments [
5].
In addition, variables such as water temperature, salinity, and depth further complicate the classification task. Particularly, the study by Ahmed and Younis examines the role of modulation signals under varying conditions and demonstrates how efficient classification processes can facilitate underwater acoustic communications [
6]. Similarly, the work by Ahmed A. and Younis M. (2019) explains how the characterization of underwater acoustic signals is influenced by propagation characteristics specific to the marine environment, adding additional layers to the classification efforts [
7]. The physical properties of underwater environments, which are essential for sound propagation, are significant for classification algorithms. For instance, the findings on acoustic beam characterization and the effect of varying medium properties highlight the complexity involved in developing algorithms that are robust against changing environmental conditions [
8]. Supporting this, Chen et al.’s comparative analysis of various filter bank techniques reveals insights on how specific underwater conditions can affect sound source localization and classification accuracy, thereby accentuating the need for adaptable methodologies [
9].
Moreover, the application of deep learning techniques has shown promise in significantly enhancing the precision of underwater acoustic detection and classification. Hinojosa et al.’s work on automated detection systems has set a precedent for real-time monitoring of marine environments, demonstrating the potential for machine learning to facilitate ecological interventions and improve marine conservation strategies [
10] relevant in pollution control and habitat monitoring, where accurate sound classification can yield precise data regarding fauna and environmental changes.
Further, Malfante et al. investigate the automatic classification of fish sounds by employing feature sets that effectively represent underwater acoustic data [
11]. Their results demonstrate that utilizing convolutional neural networks can streamline the processing of acoustic signals, thereby enhancing the detection capabilities within marine monitoring systems [
11,
12]. These methodologies align with ongoing initiatives that aim to leverage acoustic modeling for wide-scale ecological assessments.
Additionally, the intersection of acoustics and bioacoustics is becoming increasingly important. Research by Montgomery and Radford explores how marine bioacoustics aids in understanding the acoustic environment, providing insights into species interactions and behaviors [
13]. This holistic approach emphasizes the importance of sound as a communicative and navigational tool for marine organisms, thereby influencing classification techniques that aim to represent the underwater soundscape accurately.
Another development is represented by methodologies that apply deep learning models developed for underwater sound recognition, as detailed in the hardware and software developments by Yang et al. (2020) [
3]. Their work demonstrates how structured feature extraction significantly enhances classification efficacy, creating a more actionable data landscape for monitoring aquatic environments. The integration of advanced computational techniques, such as support vector machines with feature selection methods, demonstrates the ability to refine the classification accuracy of marine species’ vocalizations [
14,
15].
The implications of these developments extend beyond simple classification; they enable enhanced ecological monitoring capabilities. For example, implementations of passive acoustic monitoring provide dynamic insights into marine biodiversity, stresses induced by climate change, and habitat disruptions, and serve as essential milestones in marine research [
16,
17].
Our original contributions primarily lie in several significant aspects. First, we introduce a comprehensive integrated framework that connects all stages of underwater noise analysis, from detailed acoustic metric computation (broadband SPL, PSD) and multi-faceted feature engineering (e.g., Mel spectrograms for deep learning, statistical moments from PSD for classical machine learning) to the application and systematic comparison of diverse machine learning paradigms. This holistic pipeline, which links established acoustic principles with data-driven analytics, enables passive acoustic monitoring in complex marine environments [
5,
18].
Second, we suggest a direct quantitative comparison between a convolutional neural network (CNN) and traditional machine learning approaches (Random Forest, Support Vector Machine) under identical conditions, addressing a notable gap in the literature, where most studies focus on only one paradigm [
4].
Third, our analysis provides essential insights into feature representation strategies. Despite the theoretical strengths of learned features in CNNs, the Random Forest and SVM models, which relied on carefully engineered statistical moments, achieved competitive accuracy (0.849 and 0.839, respectively), comparable to that of the CNN (0.935). This observation underscores the continued value of domain-driven feature engineering in specific contexts, challenging the prevailing assumption that deep learning consistently outperforms classical techniques [
11].
Fourth, we transparently address in situ data challenges, including class imbalance and annotation uncertainty, which led to observed issues such as FitFailedWarnings in learning curves. By explicitly discussing these limitations, our study provides practical guidance for the underwater acoustics community dealing with similar data constraints [
13]. Finally, we conceptually integrate an unsupervised anomaly detection module using Gaussian Mixture Models (GMMs), which expands the framework’s capacity to capture novel or unclassified acoustic events—an essential capability in dynamic, data-sparse marine environments [
16].
2. Materials and Methods
This section outlines the experimental setup, data acquisition, signal processing, and machine learning methodologies employed (see
Appendix A.6 Figure A13) to characterize and automate the classification of underwater acoustic environments in the Western Black Sea.
2.1. Study Area and Acoustic Data Collection
Passive acoustic recordings were obtained in the northwestern Black Sea on board Autonomous Multichannel Acoustic Recorders (AMARs, Brüel & Kjær, Nærum, Denmark). Recorders were moored at depths of 50–100 m at three sites spanning shelf and slope habitats (
Figure 1; 43.5–44° N, 29.0–29.5° E). Each AMAR simultaneously sampled four hydrophone channels at a 192 kHz sampling rate (24-bit resolution), providing time-series recordings stored in WAV file format. The audio data was acquired using a hydrophone with a sensitivity of −209.1 dB re 1 V/µPa, connected to a recorder with a pre-amplifier gain of 8.25 dB. The recorder’s full-scale input voltage was set at 1.56675 V (peak), and the reference pressure for underwater SPL calculations was 1 µPa. For this study, a dataset comprising 341 WAV files from the designated study area was utilized for analysis. Deployment durations ranged from April to October 2022, capturing seasonal variations in ambient noise and odontocete activity. Recorder positions were determined via GPS at deployment and retrieval; no glider data were included in the present analysis.
The selection of recorder deployment locations was mainly driven by the need to broadly characterize both significant anthropogenic noise sources and relevant biological acoustic activity within the underwater soundscape. Specific site selection considerations included: prioritizing proximity to major shipping routes to capture representative data on vessel noise, a dominant source of anthropogenic underwater sound, through strategic placement near active shipping routes or port approaches. Simultaneously, sites known for their presence of marine mammals, migration corridors, or foraging activities were selected to effectively record biological noise and ensure the potential for collecting relevant acoustic signatures from marine fauna. Furthermore, some locations were chosen to establish an environmental baseline acoustic condition, which is essential for facilitating long-term environmental monitoring and enabling future assessment of changes within the soundscape.
2.2. Data Preprocessing and Feature Extraction
The collected raw WAV files were systematically processed to facilitate consistent analysis. Each full audio file was initially segmented into larger blocks of 300 s (‘chunk_duration’). For subsequent feature extraction and machine learning input, these blocks were further divided into smaller analysis windows of 5 s (‘DURATION’). All audio data was either resampled or confirmed to a uniform ‘sample_rate’ of 44,100 Hz. Spectrograms were computed using an ‘N_FFT’ of 2048 points and a ‘hop_length’ of 512 samples. The ‘N_FFT’ choice (2048) provides a frequency resolution of approximately 21.5 Hz, which is adequate for distinguishing features in the target frequency bands. The ‘hop_length’ (512) engages a balance between temporal detail and computational efficiency, ensuring sufficient overlap between frames. A ‘WINDOW_SIZE’ of ‘hann’ was applied to minimize spectral outflow and reduce objects in the frequency domain. A default sample rate of 44,100 Hz was used for demonstration purposes. For each audio lump, several acoustic metrics and features were computed:
- -
Mel-spectrograms (example
Figure 2): These were produced for the Convolutional Neural Network (CNN) input, generated with N_MELS (128 Mel bands), providing a perceptually relevant time-frequency representation of the acoustic signals. The derivation involves an initial signal transformation (e.g., Short-Time Fourier Transform, STFT) followed by spectral power computation and Mel-scale filtering.
Broadband Sound Pressure Level (SPL): The broadband SPL in dB re 1 µPa was calculated for each audio lump. This metric provides a single value representative of the overall sound intensity within the segment.
Power Spectral Density (PSD)/Spectral Sound Level (SSL): The PSD, reported in dB re 1 µPa
2 Hz
1, was computed using Welch’s method with a Hann window (
Figure 3). This provides a detailed frequency-domain representation of the noise levels. These features, used for machine learning models (Random Forest and SVM), comprised the mean, variance, and skewness of the Spectral Sound Level (SSL). These moments were extracted from the Power Spectral Density (PSD) within three predefined, acoustically relevant frequency bands: ‘Low_Freq_Shipping’ (10–500 Hz), ‘Mid_Freq_Biological’ (500–5000 Hz), and ‘High_Freq_Rain_Clicks’ (5000–22,050 Hz). This process involves first computing the PSD and then combining the spectral information into summary statistics per band.
Statistical Moments from PSD: To capture more nuanced characteristics within specific frequency ranges, statistical moments (mean, variance, and skewness) of the SSL were extracted from predefined ecologically relevant frequency bands. These bands were
The low-frequency shipping band: 10–500 Hz: often dominant for large vessels.
- -
The Mid-Freq Biological band: 500–5000 Hz: relevant for some biological sounds and machinery.
- -
The High_Freq_Rain_Clicks: 5000–22,050 Hz (up to sample\_rate/2 or Nyquist frequency), associated with rain and odontocete clicks.
The generated Mel-spectrograms were reshaped to a (samples, height, width, channels) format, suitable for Convolutional Neural Network (CNN) input, by adding a channel dimension using np.newaxis.
2.3. Labeling Strategy
To demonstrate the machine learning pipeline, a preliminary labeling strategy was applied. Audio lumps were assigned arbitrary binary labels (0, 1, or 2) based on keywords present in the original .wav file names, such as ‘background’ or ‘ambient’ for background noise (label 0), ‘vessel’ for vessel noise (label 1), and ‘whale’, ‘dolphin’, or ‘porpoise’ for biological noise (label 2). For files without specific keywords, a fallback mechanism assigns labels based on their order within the processed file list (the first half is labeled as background, and the second half is labeled as vessel noise). It is recognized that for lump-level analysis, a more detailed and data-specific labeling logic (e.g., manual annotation or metadata analyzing) is essential. The labels were converted to one-hot encoding for categorical classification.
To enhance the ecological relevance and practical applicability of the classification framework within the Black Sea basin, the predefined noise categories were expanded beyond a binary distinction between vessels and background. Recognizing the unique biodiversity of the Black Sea, which includes various dolphin species (e.g., common dolphin, bottlenose dolphin, harbor porpoise) and diverse sound-producing fish and invertebrates, the classification now explicitly includes ‘Dolphin Activity’ and ‘Fish/Invertebrate Sounds’ alongside ‘Ambient Noise’, ‘Vessel Noise’, ‘Click’, and ‘Whistle’. This multi-class approach, informed by the scientific literature on Black Sea bioacoustics, enables a more detailed characterization of the underwater soundscape. The preliminary labeling strategy, although still keyword-based for demonstration, has been adapted to assign these new biological classes, and the model’s output layer has been accordingly extended to include these additional categories. Future work will prioritize expert-validated annotation for these biologically sound events to ensure high-fidelity certainty.
2.4. Data Annotation
Audio segments were assigned categorical labels based on a keyword-matching strategy derived from the original filenames. Three primary noise classes were defined: ‘Background Noise’ (assigned label 0, derived from filenames containing ‘background’, ‘ambient’, or ‘noise’), ‘Vessel Noise’ (assigned label 1, from filenames with ‘vessel’, ‘ship’, or ‘motor’), and ‘Biological Noise’ (assigned label 2, for filenames indicating ‘whale’, ‘dolphin’, or ‘biol’). For files without these explicit keywords, a default labeling scheme was applied where segments from the first half of the processed files were designated as ‘Background Noise’, and segments from the latter half were defined as ‘Vessel Noise’.
2.5. Machine Learning Models and Training
Deep learning approaches were selected due to their demonstrated capacity to identify subtle patterns in complex acoustic environments. For the classification and characterization of underwater noise, a set of machine learning models was employed. This included a deep Convolutional Neural Network (CNN) for spectrogram-based analysis, as well as traditional machine learning classifiers—Random Forest and Support Vector Machine (SVM)—which utilized acoustic features. Gaussian Mixture Models (GMMs) were also conceptually explored for anomaly detection. A diverse set of machine learning models was trained and evaluated for noise classification. The dataset was stratified and partitioned into an 80% training set and a 20% testing set (test_size = 0.2, random_state = 42) to maintain the proportional representation of each noise class. Features underwent necessary preprocessing (Min-Max scaling for CNN, Standardization for other models).
These models represent different approaches to machine learning for underwater acoustic analysis, serving both comparative and complementary roles. The Convolutional Neural Network (CNN) enhances deep learning by allowing it to automatically learn hierarchical features directly from Mel spectrograms, thereby minimizing the need for manual feature engineering. In parallel, traditional machine learning models such as Random Forest and Support Vector Machine (SVM) rely on hand-crafted PSD statistical moments. This comparative approach allows for the assessment of data-driven feature learning versus domain-expert-driven feature engineering. Furthermore, Gaussian Mixture Models (GMMs) are employed as an unsupervised method for anomaly detection, a task complementary to classification, designed to identify novel or unexpected acoustic events without prior labeling.
2.5.1. Convolutional Neural Network (CNN) Model
An enhanced Convolutional Neural Network (CNN) model was defined and trained for the supervised classification of underwater noise sources. The CNN was implemented as a Sequential model designed to process the 2D Mel spectrograms. A widely used deep learning architecture, the CNN was chosen for its exceptional capabilities in learning hierarchical features directly from raw data, such as spectrograms, without requiring extensive manual feature engineering. The architecture consisted of three stacked convolutional blocks (see
Appendix A.1 Table A1):
Block 1: A Conv2D layer with 32 filters (3 × 3 kernel, ‘relu’ activation), followed by ‘BatchNormalization’, ‘MaxPooling2D’ (2 × 2 pool size), and ‘Dropout’ (0.25 rate).
Block 2: A Conv2D layer with 64 filters (3 × 3 kernel, ‘relu’ activation), followed by ‘BatchNormalization’, ‘MaxPooling2D’ (2 × 2 pool size), and ‘Dropout’ (0.25 rate).
Block 3: A Conv2D layer with 128 filters (3 × 3 kernel, ‘relu’ activation), followed by ‘BatchNormalization’, ‘MaxPooling2D’ (2 × 2 pool size), and ‘Dropout’ (0.25 rate).
These blocks effectively extract hierarchical features from the spectrograms. The multi-block architecture with increasing filter counts (32, 64, 128) is a standard design pattern in deep learning for spectrogram classification, allowing the network to learn increasingly complex and abstract features.
Common 3 × 3 kernel sizes capture local patterns, and 2 × 2 max-pooling efficiently downsamples feature maps. The output of the final convolutional block was flattened into a 1D vector and connected to a ‘Dense’ layer with 128 neurons (‘relu’ activation), followed by ‘BatchNormalization’ and a ‘Dropout’ layer (0.5 rate) for regularization. The final output layer was a ‘Dense’ layer with neurons equal to the number of unique noise classes (e.g., 3) and a ‘softmax’ activation function for multi-class probability distribution.
The ‘Dropout’ rates (0.25 and 0.5) and ‘BatchNormalization’ layers were integrated to mitigate overfitting, which was identified as a concern during preliminary model development.
The model was compiled using the Adam optimizer with an initial ‘LEARNING_RATE’ of 0.001. This is a standard and stable default. Training was performed for 200 ‘EPOCHS’ with a ‘BATCH_SIZE’ of 32. A high number of epochs was set to ensure convergence. ‘EarlyStopping’ with a ‘patience’ of 10 epochs (monitoring validation loss) was implemented to automatically stop training when performance on unseen data no longer improved, thus preventing overfitting.
2.5.2. Random Forest Classifier
The Random Forest (RF) Classifier, an ensemble learning method, was chosen for its robustness and ability to provide insights into feature importance. The model was configured with 100 individual decision trees (‘n_estimators = 100’). This number is an established default that balances performance and computational charge. To address potential class imbalances in the dataset, the ‘class_weight’ parameter was set to “balanced”, which automatically adjusts weights inversely proportional to class frequencies. A ‘random_state’ of 42 was set for reproducibility. The default parameters for tree splitting (e.g., ‘criterion = gini’, ‘max_feature s = sqrt’, ‘min_samples_leaf = 1’, ‘min_samples_split = 2’) were used.
2.5.3. Support Vector Machine (SVM)
A Support Vector Machine (SVM) was selected as a discriminative classifier. An RBF (radial basis function) ‘kernel’ was chosen, as it is a widely adopted non-parametric kernel capable of handling complex, non-linear relationships within the feature space. The ‘probability’ parameter was set to ‘True’ to enable the output of class probabilities, which are necessary for generating ROC and Precision–Recall curves. Similarly to the Random Forest, ‘class_weight’ was set to “balanced” to account for uneven class distributions. A ‘random_state’ of 42 was maintained for reproducibility. Default values for the regularization parameter (‘C = 1.0’) and ‘gamma = scale’ were utilized, providing a standard starting point for initial SVM models.
2.5.4. Gaussian Mixture Models (GMMs)
Gaussian Mixture Models (GMMs) were utilized as a parametric, unsupervised method for anomaly detection. A GMM was specifically fitted to the ‘Background Noise’ class from the training data. This allowed for the probabilistic modeling of distinctive ambient acoustic conditions by assuming the data is a mixture of Gaussian distributions. The model was configured with ‘n_components’ (up to 3, depending on data availability) and ‘covariance_type = full’. Test samples were then evaluated based on their log-probability scores under this fitted background GMM, with lower scores indicating potential deviations from the learned ‘normal’ noise profile.
2.6. Statistical Analysis and Thresholding
In addition to the deep learning approach, statistical analysis was conducted on the extracted PSD moments. The mean, variance, and skewness of SSL within the defined frequency bands were computed for each audio segment. These statistical features can be utilized for various analyses, including
Background Noise Estimation: Analyzing long periods of recordings free from transient events using statistical methods (e.g., 10th percentile of SPL/PSD over time) to describe persistent ambient noise.
Simple Thresholding: A conceptual example of rule-based classification was demonstrated using predefined thresholds on the mean SSL for the Low-Frequency Vessel (110 dB) and Mid-Frequency Biological (90 dB) bands, and skewness for the High-Frequency Rain/Clicks (0.8). While these thresholds are illustrative, they would typically be empirically determined from the specific dataset for real-world applications.
All generated statistics and selected plots (Mel-spectrograms and PSD plots) were saved to designated output directories. Threshold values were empirically determined using initial exploratory data analysis on a subset of the recordings.
2.6.1. Convolutional Neural Network (CNN) Parameters
Architecture (Number of Blocks, Filters, Kernel/Pool Sizes): The multi-block architecture with increasing filter counts (32, 64, 128) is a standard design pattern in deep learning for spectrogram classification. This design enables the network to learn increasingly complex and abstract features across multiple hierarchical levels. The use of 3 × 3 kernel sizes is a common choice for capturing local patterns, and 2 × 2 max-pooling is a typical choice for efficient downsampling of feature maps. These selections were guided by empirical exploration during preliminary model development and adherence to established CNN best practices for similar audio classification tasks, aiming to achieve a model capacity suitable for the dataset’s characteristics.
‘LEARNING_RATE’ (0.001) and ‘BATCH_SIZE’ (32)**: These are widely recognized as common initial hyperparameter choices in deep learning. A learning rate of 0.001 is a frequently used default for the Adam optimizer, providing a stable starting point for model optimization. A batch size of 32 represents a balance between achieving stable gradient updates and computational efficiency.
Regularization (‘Dropout’ and ‘BatchNormalization’): These techniques were intentionally integrated into the architecture (‘Dropout’ rates of 0.25 and 0.5, ‘BatchNormalization’ after each convolutional block and dense layer) to mitigate overfitting. Overfitting was identified as a concern during preliminary model development and was visually evident in the CNN’s training history (
Section 2.5).
‘EPOCHS’ (200) and ‘EARLY_STOP_PATIENCE’ (10): A sufficiently large number of epochs (200) was set to ensure the model had ample opportunity to converge. Crucially, ‘EarlyStopping’ with a patience of 10 epochs (monitoring validation loss) was implemented to automatically halt training when performance on unseen data no longer improved, thus effectively preventing overfitting during the training process.
2.6.2. Random Forest Classifier Configuration
Random Forest (‘n_estimators = 100’): The choice of 100 estimators is a well-established default value for Random Forests, providing a robust balance between model performance and computational cost. This number of trees is generally sufficient to achieve stable predictions without excessive computational burden.
‘class_weight = ‘balanced’’ (for RF and SVM)**: This parameter was specifically selected and is essential for addressing potential class imbalance within the dataset. By automatically adjusting weights inversely proportional to class frequencies, it ensures that minority classes are not overwhelmed by the more numerous majority classes during model training, thereby improving their classification performance.
2.6.3. Support Vector Machine (SVM) Configuration
SVM (‘kernel = ‘rbf’’, ‘C = 1.0’, ‘gamma = ‘scale’’): The RBF kernel is a common choice for SVMs due to its demonstrated ability to handle non-linear relationships within complex datasets effectively. The default values for ‘C’ (regularization parameter, 1.0) and ‘gamma’ (kernel coefficient, ‘scale’) are standard starting points for initial SVM models, offering a good balance between model complexity and generalization.
2.7. Model Evaluation
The performance of all machine learning models was evaluated on the held-out test set. The dataset was partitioned into an 80% training set and a 20% testing set (‘TEST_SIZE = 0.2’, ‘RANDOM_STATE = 42’) using stratified splitting to maintain class proportions. Key classification metrics included: Accuracy (overall correctness), Precision (the proportion of true positives among positive predictions), Recall (the proportion of true positives among actual positives), and F1-score (the harmonic mean of precision and recall). Confusion Matrices (raw and normalized) were generated to visualize classification performance per class. For models producing probability outputs (CNN, Random Forest, SVM), Receiver Operating Characteristic (ROC) curves with Area Under the Curve (AUC), and Precision–Recall (PR) curves with Average Precision (AP) were computed to assess biased power and handle potential class imbalance. Learning curves were also generated using stratified K-Fold cross-validation (‘cv_strategy’) to analyze model performance as a function of training data size, providing insights into bias–variance trade-offs.
3. Results
Figure 2 (example of the Mel spectrogram) shows a complex acoustic environment with distinct features across various frequency bands:
Low Frequencies (0–512 Hz): A continuous band of relatively high energy is present throughout the entire 4.8 s duration, particularly prominent below 256 Hz. This consistent horizontal band suggests a persistent broadband noise source, which could be indicative of ambient background noise or distant, continuous anthropogenic activity.
Mid Frequencies (512–4096 Hz): Energy levels in this range appear slightly modulated over time, showing variations in intensity. Several horizontal lines are visible, especially around 1024 Hz and 2048 Hz, which could represent harmonic content from machinery or other continuous, narrowband sources.
High Frequencies (4096–16,384 Hz): While generally quieter than the lower frequencies, this region contains sporadic, transient events. Notably, there are vertical lines or impulses, particularly around the 2.8 to 3.0 s mark, extending upwards from mid to high frequencies. These abrupt, short-duration events often correspond to impulsive sounds such as clicks, natural transient events (e.g., rain, ice cracking), or intermittent anthropogenic noises. There is also some diffuse energy across this band, but it is less concentrated than in the lower frequencies.
The PSD comparison (
Figure 3) efficiently differentiates the spectral signatures of background and vessel noise. The vessel noise typically shows a higher level in lower frequencies, which is significant for automated classification and environmental monitoring.
Figure 3 gives detailed information about noise levels across different frequencies and is a primary metric for ambient noise characterization, such as
Vessel Noise Dominance in Low Frequencies: The vessel noise (dashed orange line) exhibits higher spectral levels in the lower frequency range, particularly from approximately 20 Hz up to around 100 Hz. This is a typical characteristic of anthropogenic noise sources such as large vessels, where propeller cavitation and machinery vibrations generate significant low-frequency energy.
Background Noise Peaks: The example background noise (solid blue line) shows a prominent peak in spectral level at approximately 50 Hz, reaching a level of over 175 dB re 1 µPa2/Hz. While the vessel noise is also significant in this region, the background noise displays a distinct peak at a higher level in this specific range.
Frequency Dependence: Both noise types generally show decreasing spectral levels as frequency increases beyond their respective low-frequency peaks.
High-Frequency Characteristics: In higher frequency ranges (above approximately 1000 Hz), the vessel noise (dashed orange line) and background noise (solid blue line) show more unstable and converging patterns. The vessel noise, however, maintains slightly higher or comparable levels to the background noise across a broader high-frequency band. The plot clearly shows the spectral characteristics of different noise types.
A statistical summary of the Broadband Sound Pressure Level (SPL) calculated for all audio files (341 .wav files), with a total count of 1556.3 samples, is presented in
Table 1.
Figure 4 presents an overview of the Power Spectral Density (PSD) across various frequencies for all processed audio lumps.
The plot includes several statistical percentile curves derived from the stacked SSL arrays of all audio samples, along with a representation of the individual PSDs:
Individual PSDs: Numerous light blue lines are visible across the plot background, representing the individual PSD estimates for each audio lump. Their density and spread provide a visual indication of the variability in noise levels at different frequencies within the dataset.
The 99% percentile (thick black line) represents the upper bound of the noise levels, indicating that 99% of the observed PSD values fall below this line at any given frequency. This curve exhibits high noise levels, particularly in the lower frequency range (below 100 Hz), peaking around 50 Hz before gradually decreasing.
The 95% percentile (thinner black line) provides a slightly lower but still high noise level threshold. It follows a similar trend to the 99% percentile but is consistently below it.
The 50th % percentile (Median) (gray line) indicates the median noise level across all samples for each frequency bin. This line generally follows the central tendency of the individual PSDs.
The 5% percentile (thinner gray line) and 1% percentile (thinnest gray line) represent the lower bounds of the noise distribution, characteristic of quiet periods or the true ambient background noise floor. These lines show significantly lower noise levels compared to the upper percentiles, especially at higher frequencies.
RMS Level (Mean SSL): The magenta line represents the Root Mean Square (RMS) Level, which corresponds to the mean Spectral Sound Level (SSL) across all frequency bins. This curve generally tracks the overall average noise profile, typically lying between the 50th and 95th percentiles of the overall noise profile.
Figure 4 highlights that noise levels are generally highest in the lower frequency bands (below 500 Hz), a common characteristic of underwater acoustic environments that are often dominated by anthropogenic sources, such as vessels. The significant spread between the 1% and 99% percentile curves, particularly in the mid-frequency range, indicates high variability in noise levels across the recorded chunks, suggesting the presence of transient events or varying acoustic conditions. Visual inspection of spectrograms suggests the presence of biological signals in mid-to-high frequencies, though these were not formally labeled for this iteration. This statistical summary of PSD is fundamental for establishing baseline noise levels and identifying anomalies, aligning with recommendations from expert groups like the EU Technical Group on Underwater Noise (TGNoise).
The observed spread in percentiles supports the necessity for robust anomaly detection frameworks, as intermittent signals could be ecologically relevant.
After the feature extraction process, specifically the generation, scaling, and reshaping of Mel-spectrograms for Convolutional Neural Network (CNN) input, the data has the following outline: (1556, 128, 431, 1). This indicates that the dataset comprises 1556 individual samples. Each sample is a Mel-spectrogram with dimensions of 128 (height, likely Mel bands) by 431 (width, corresponding to time frames) and a single channel. The corresponding labels for these 1556 samples are one-hot encoded, resulting in a shape of (1556, 2). This confirms that there are 1556 labels, distributed across two unique classes for classification.
A stratified train–test split was performed to partition the dataset. This approach ensures that the proportion of samples for each class is maintained in both the training and testing sets, which is crucial for preventing bias and ensuring representative subsets, especially in imbalanced datasets.
The training data (Xtrain) has a shape of (1244, 128, 431, 1), consisting of 1244 samples used to train the CNN model.
The testing data (Xtest) has a shape of (312, 128, 431, 1), comprising 312 samples reserved for evaluating the trained model’s performance on unobserved data.
The distribution of samples per class after the stratified split is as follows:
For the training set, there are 643 samples in one class and 601 samples in the other, totaling 1244 training samples.
For the testing set, there are 161 samples in one class and 151 samples in the other, totaling 312 testing samples.
This balanced distribution across the training and testing sets is vital for ensuring that the model is trained and evaluated on a representative sample of each class, thereby providing a more reliable assessment of its generalization capabilities.
The CNN model begins with an input expected to have dimensions (None, 128, 431, 1), representing a batch of 128 × 431 Mel-spectrograms with a single channel (
Table 2). The architecture is composed of sequential layers:
Convolutional Blocks: The model employs three convolutional blocks, each consisting of a Conv2D layer, followed by batch normalization, 2D max pooling, and Dropout. The first Conv2D layer has 32 filters, an output shape of (None, 126, 429, 32), and 320 parameters. It is paired with a batch normalization layer (128 parameters) and a MaxPooling2D layer, which reduces dimensions to (None, 63, 214, 32). A Dropout layer (0 parameters) is then applied. The second Conv2D layer increases the filter count to 64, resulting in an output shape of (None, 61, 212, 64) and 18,496 parameters. This is followed by batch normalization (256 parameters), MaxPooling2D to (None, 30, 106, 64), and another Dropout layer. The third Conv2D layer has 128 filters, producing an output shape of (None, 28, 104, 128) and 73,856 parameters. It also includes batch normalization (512 parameters), MaxPooling2D with a kernel size of (None, 2, 2, 2) and a step of 2, and a Dropout layer.
Flatten Layer: After the convolutional blocks, a Flatten layer converts the 2D feature maps into a 1D vector of 93,184 elements, with no associated parameters. This prepares the data for the subsequent dense layers.
Dense Layers: The flattened features are then fed into two Dense (fully connected) layers. The first dense layer has 128 units and 11,927,680 parameters, followed by a Batch Normalization layer (512 parameters) and a Dropout layer (0 parameters). The final Dense layer has two units, corresponding to the number of unique output classes (e.g., “Background Noise” and “Vessel Noise”), and 258 parameters.
In total, the model has 12,022,018 parameters, with 12,021,314 being trainable and 704 non-trainable. The majority of the trainable parameters reside in the first dense layer, indicating its significant contribution to the model’s capacity. The use of batch normalization layers aims to improve stability and performance during training, while Dropout layers are implemented for regularization to prevent overfitting.
The “Model Accuracy” plot,
Figure 5 left, shows the training accuracy (blue line) and validation accuracy (orange line) over approximately 35 epochs. The training accuracy displays a consistent upward trend, starting at about 0.50 and reaching over 0.95 by the final epochs, indicating that the model is effectively learning to classify the training data. In contrast, the validation accuracy, after an initial period of stability around 0.50–0.55, exhibits considerable fluctuation and generally remains lower than the training accuracy, peaking at approximately 0.60 around epoch 24, and then fluctuating between 0.50 and 0.60. The divergence between training and validation accuracy, particularly the higher training accuracy coupled with lower and more volatile validation accuracy, suggests the presence of overfitting. The model is learning the training data well, potentially at the expense of its ability to generalize to previously unseen data.
The “Model Loss” plot,
Figure 5 right, presents the training loss (blue line) and validation loss (orange line) over the same training period. The training loss gradually decreases throughout the epochs, indicating that the model is minimizing errors on the training dataset. Equally, the validation loss shows a highly unpredictable pattern. It starts at a high value (over 80), drops significantly within the first few epochs, then experiences several spikes (e.g., around epoch 7 and epoch 18), and generally remains higher and more unstable than the training loss. The substantial discrepancy and irregular behavior of the validation loss further support the observation of overfitting, where the model’s performance on new data is not consistently improving despite continued optimization on the training set.
The initial analysis of the Convolutional Neural Network (CNN) model, as presented in
Figure 5, indicates significant overfitting, characterized by a clear divergence between training and validation accuracy, as well as volatile validation loss. To address this and enhance the model’s robustness and generalization capabilities, several advanced regularization techniques were systematically integrated. These include the strategic application of L2 regularization (with a rate of 0.001) to the dense layers, which assesses large weights and encourages simpler models, and the incorporation of Batch Normalization layers after each convolutional and dense layer. Batch Normalization stabilizes and accelerates the training process by normalizing the inputs to each layer, thereby reducing internal covariate shift. Furthermore, dropout rates were carefully tuned, particularly in the fully connected layers, to prevent co-adaptation of neurons. The increased number of training epochs (up to 200) combined with an Early Stopping callback (patience of 10 epochs) ensured that training ceased when validation performance no longer improved, thereby preventing further overfitting. These combined strategies effectively mitigated the overfitting observed in preliminary models, resulting in more stable validation performance and improved generalization to unseen data (see
Appendix A.1 Table A1).
The initial training dynamics of the Convolutional Neural Network (CNN) model, as illustrated in the Model Accuracy and Loss plots (
Figure 5), exposed a distinct tendency towards overfitting. This was evidenced by a significant and increasing divergence between training accuracy (which consistently rose to over 0.95) and validation accuracy (which fluctuated unstably around 0.50–0.60), coupled with an inconsistent and elevated validation loss. Such behavior indicated that the model was learning the training data rather than extracting generalized features, severely compromising its ability to perform robustly on unseen acoustic data.
The combined efficacy of L2 regularization, Batch Normalization, optimized Dropout, and early stopping collectively contributed to a significantly more stable training process and improved generalization capabilities, thereby justifying the strong overfitting initially observed. Future iterations may consider data augmentation techniques, such as time-frequency masking, to further enhance the model.
The spectrograms from
Figure 6 display various acoustic characteristics and the model’s performance:
Sample 208 (true: background noise, predicted: background noise, probs: 0.58, 0.42): This spectrogram shows a relatively diffuse energy distribution across frequencies, consistent with background noise. The model correctly classified it as background noise with moderate confidence.
Sample 266 (true: vessel noise, predicted: background noise, probs: 0.70, 0.30): This example, despite being true vessel noise, was misclassified as background noise. The spectrogram exhibits prominent horizontal lines, particularly in the lower frequencies, which are characteristic of continuous noise from vessels. The model’s lower confidence in the true label highlights a potential area for improvement.
Sample 143 (true: background noise, predicted: vessel noise, probs: 0.70, 0.30): This sample, actually background noise, was misclassified as vessel noise. The visual characteristics are somewhat ambiguous, exhibiting some horizontal banding that may have contributed to the misclassification; however, they lack the distinct patterns typically seen in clear vessel signatures.
Sample 149 (true: vessel noise, predicted: background noise, probs: 0.70, 0.30): Similarly to Sample 266, this true vessel noise sample was incorrectly labeled as background noise. It clearly shows strong, sustained horizontal energy bands across various frequencies, indicative of the presence of a vessel. This misclassification suggests the model struggled with certain types or intensities of vessel signatures.
Sample 231 (True: vessel noise, predicted: vessel noise, probs: 0.59, 0.41): This spectrogram was correctly identified as vessel noise, displays broad energy concentrated in lower frequencies, tapering off at higher frequencies, which is a common spectral characteristic of vessel sounds. The prediction confidence is moderate.
Sample 33 (true: vessel noise, predicted: vessel noise, probs: 0.59, 0.41): Another correctly classified sample as vessel noise. The spectrogram again shows dominant energy in the lower frequency range, consistent with typical vessel acoustic profiles.
Sample 227 (true: vessel noise, predicted: vessel noise, probs: 0.59, 0.41): This sample was also correctly identified as vessel noise, exhibiting similar spectral patterns of low-frequency dominance and continuous energy as the other correctly classified vessel noise examples.
Sample 289 (true: background noise, predicted: background noise, probs: 0.57, 0.43): This spectrogram shows a more uniform energy distribution across time and frequency compared to vessel noise, characteristic of ambient background noise. The model correctly classified it, though with relatively close probabilities, indicating some uncertainty.
Beyond traditional accuracy, a comprehensive suite of performance metrics—including precision, recall, and F1-score—was employed to provide a more nuanced understanding of the classification models’ capabilities, particularly given the multi-class nature of underwater acoustic events and potential class imbalances. Precision quantifies the proportion of correctly identified positive predictions, while recall measures the proportion of actual positive instances correctly identified. The F1-score, as the harmonic mean of precision and recall, offers a balanced measure, especially valuable in scenarios where false positives and false negatives carry different costs. For multi-class evaluation, weighted averages of these metrics were utilized to account for varying class support. Detailed classification reports, confusion matrices (see
Appendix A.2 Figure A1 and
Figure A2;
Appendix A.3 Figure A7 and
Figure A8), and Receiver Operating Characteristic (ROC) curves (see
Appendix A.2 Figure A3;
Appendix A.3 Figure A9) with Area Under the Curve (AUC) values were generated for each model. These visualizations provide granular insights into per-class performance, highlighting specific misclassification patterns and the models’ ability to discriminate between classes across various probability thresholds. This rigorous evaluation framework enables a more comprehensive assessment of model effectiveness in real-world underwater acoustic environments.
A direct comparison of the overall accuracy of the implemented machine learning models on the test set is presented in
Table 3.
Figure 7 compares the test accuracy of the three primary classification models: the Convolutional Neural Network (CNN), the Random Forest Classifier, and the Support Vector Machine (SVM). The figure displays a grouped bar plot, where the
y-axis represents classification accuracy, ranging from 0 to 1, and the
x-axis indicates the specific model. The Random Forest and SVM models exhibit similar performance levels, with accuracies of around 0.83, whereas the CNN achieves a higher classification accuracy of approximately 0.93. This improvement is visually highlighted by the taller green bar associated with the CNN in the figure. These findings determine that the deep learning architecture outperforms classical machine learning models in terms of classification accuracy, suggesting its superior ability to learn complex acoustic features present in the underwater environment of the Western Black Sea. The figure supports the argument that convolutional neural networks are better suited for capturing the time-frequency patterns represented in Mel-spectrograms, while traditional models such as Random Forest and SVM perform satisfactorily on statistical and frequency-domain features but lack sufficient representation capacity for more subtle patterns. The visual comparison emphasizes the robustness and potential of deep learning strategies for future marine soundscape monitoring efforts.
Beyond predictive performance, assessing model complexity provides insights into computational resource demands and interpretability. This analysis compared the models based on parameters and size, as well as computational effort (including training and inference time), and inherent interpretability. The Convolutional Neural Network (CNN) demonstrated the highest structural complexity, characterized by approximately 12,022,018 trainable parameters. Its training involved approximately 30 s per epoch (on the given hardware and environment) across hundreds of epochs. Inference time for individual samples is typically very fast (milliseconds). The Random Forest and Support Vector Machine (SVM) models, while having fewer explicit ‘parameters’ in the neural network sense, still present computational considerations. The Random Forest model, with 100 decision trees, is moderately complex. Its training is generally faster than the CNN’s complete training cycle, taking seconds to a few minutes depending on data size, and inference is very rapid. SVMs, particularly with an RBF kernel, can be computationally intensive during training for large datasets but offer fast inference. The training times in this study ranged from seconds to a few minutes. In terms of interpretability, the Random Forest model provides clear insights through its feature importance mechanism (see
Supplementary File S1), enabling a direct understanding of which acoustic features drive classification decisions. SVMs are less inherently interpretable. CNNs are often considered “black-box” models; although methods for post-interpretability exist, they were not applied in this study.
4. Discussion
The characterization and automated classification of underwater acoustic environments, particularly in the Western Black Sea, can significantly benefit from advancements in machine learning techniques. The dynamics of underwater acoustic environments are influenced by a multitude of factors, including variations in water conditions, which can complicate acoustic signal transmission and reception. This discussion synthesizes the recent literature on machine learning approaches in underwater acoustics, explores the challenges faced in real-world applications, and outlines avenues for future research.
A challenge in underwater acoustics lies in enhancing the signal quality to improve classification accuracy, particularly under low signal-to-noise conditions. Qu et al. (2025) proposed various algorithms for signal development, aiming to achieve accurate classification, including preprocessing steps that can enhance signal precision, thereby improving the efficiency of the feature extraction phase [
19].
Regarding automated classification, Santos and Calazan have demonstrated significant improvements in identifying marine vessels through the effective extraction of spectral dynamic features from audio data, thereby reinforcing the effectiveness of traditional methods, such as Mel-frequency cepstral coefficients (MFCC) and other recognized techniques [
4].
Moreover, machine learning algorithms, particularly deep learning techniques like Long Short-Term Memory (LSTM) networks, have been shown to outperform traditional models in terms of learning representations of time-dependent data, making them suitable for underwater acoustic classification tasks [
20].
Furthermore, the literature underlines the integration of multiple machine learning approaches to boost classification accuracy. For instance, ensemble methods that influence deep learning—such as combining different neural network architectures to enhance modulation signal detection—highlight the interaction between various methodologies in underwater acoustic challenges [
6].
Machine learning also presents opportunities for modeling that address the hydro-acoustic propagation topics specific to environments like the Black Sea. Recent studies have proposed deep learning-based prediction methods for transmission loss, which promote modeling that integrates environmental variables for improved effectiveness in communication between underwater vehicles [
21].
The paper outlines a data-driven framework utilizing passive acoustic monitoring at three sites in the Western Black Sea (
Figure 1), which encompasses a depth range of 60 to 80 m. The segmented acoustic data enable high temporal resolution analysis through 5 min intervals, aligning with the environmental management goals of the MSFD, which aim to track underwater noise levels and their impacts [
18,
22].
Our study presents a framework for characterizing and automatically classifying underwater acoustic environments in the Western Black Sea, utilizing signal processing and advanced machine learning techniques. This methodology aligns with principles encouraged by expert groups, such as TG NOISE, where metrics like Broadband SPL and Power Spectral Density (PSD) are considered essential for characterizing underwater soundscapes and anthropogenic sources. Our methodology involves segmenting passive acoustic monitoring data into 5 min intervals and computing key metrics such as Broadband Sound Pressure Level (SPL) and Power Spectral Density (PSD). This approach offers a method for evaluating ambient noise and indicative anthropogenic sound sources in the study area.
Feature extraction in this study employs a multi-level approach (
Figure 2 and
Figure 3,
Table 1), integrating Mel-spectrograms for deep learning with statistical moments, such as the mean, variance, and skewness of the power spectral density (PSD). This layered methodology provides a comprehensive representation of the acoustic environment, enabling both classical supervised and unsupervised machine learning techniques to identify patterns and classify noise sources [
23]. This determines the model’s capacity to evolve with the changing acoustic environment, responding effectively to the unique challenges posed by rapid growth in maritime activities in the Black Sea.
The Mel-spectrograms, after normalization, served as the input for our enhanced Convolutional Neural Network (CNN) model, demonstrating their utility in representing acoustic features for CNNs.
The CNN model demonstrated learning capabilities on the training data, with training accuracy reaching over 0.95 and training loss steadily decreasing over 35 epochs. However, the consistent divergence between training and validation accuracy, coupled with the inconsistent behavior of the validation loss, clearly indicates the presence of overfitting. This suggests that while the model effectively learns the training data, its ability to generalize to unseen data is compromised. This challenge is common in deep learning applications, particularly with complex acoustic datasets, and underscores the need for additional strategies to enhance model robustness.
The statistical analysis of Broadband SPL and PSD across all collected data provided valuable insights into the overall noise environment. The mean broadband SPL of 175.3 dB re 1 µPa and a maximum of 210.31 dB re 1 µPa indicate periods of significant acoustic energy. The overall PSD statistics further characterized the noise, determining that levels are generally highest in lower frequency bands (below 500 Hz), a common characteristic attributed to anthropogenic sources such as shipping. The significant spread between the 1st and 99th percentile curves across frequencies highlights the high variability in noise levels, indicating the presence of transient events in addition to continuous background noise.
The qualitative assessment of individual spectrograms from the test set further elucidated the model’s performance. While CNN successfully classified clear instances of both background and vessel noise, misclassifications occurred when vessel noise was predicted as background noise, even in spectrograms showing distinct low-frequency vessel signatures (
Figure 6). This suggests that the model may appear with certain types or intensities of vessel sounds, or that the preliminary labeling strategy might not capture the full complexity of lump-level sound events. For future iterations, a more coarse and potentially manual annotation strategy for lump-level data would significantly improve the accuracy and robustness of the training labels.
To validate the performance of the enhanced Convolutional Neural Network (CNN) and establish the efficacy of a deep learning approach for underwater noise classification in the Black Sea, a comprehensive comparative analysis was conducted compared to a suite of established machine learning models. This ensures a robust baseline for evaluating the performance gains achieved by the CNN. The selected baseline models include
Random Forest (RF): An ensemble learning method known for its robustness, ability to handle high-dimensional data, and feature importance insights.
Support Vector Machine (SVM): A robust discriminative classifier that constructs a hyperplane or set of hyperplanes in a high-dimensional space for classification, particularly effective in high-dimensional feature spaces.
K-Nearest Neighbors (KNN): A simple, non-parametric instance-based learning algorithm that classifies data points based on the majority class of their nearest neighbors.
XGBoost (Extreme Gradient Boosting): An optimized distributed gradient boosting library, known for its efficiency, flexibility, and superior predictive performance in structured data tasks.
All these models were trained and evaluated on the same comprehensive feature set, which now includes a combination of Mel-Frequency Cepstral Coefficients (MFCCs) and statistical moments (mean, variance, skewness) extracted from Power Spectral Density (PSD) within ecologically relevant frequency bands. The consistent feature input ensures a fair comparison across different model patterns.
To evaluate the primary metric of accuracy, a comprehensive suite of performance metrics was implemented to assess the efficacy of the classification models across all defined underwater noise categories (
Table 4). These metrics, including precision, recall, and F1-score, are crucial in multi-class classification scenarios, particularly when a probable class imbalance exists within the dataset. Precision, defined as the ratio of real positive predictions to the total positive predictions for a given class, quantifies the model’s accuracy and its ability to avoid false positives. Recall measures the quantity of actual positive instances that were correctly identified, indicating the model’s comprehensiveness and its capacity to minimize false negatives. The F1-score, being the harmonic mean of precision and recall, offers a balanced measure of a model’s performance, especially valuable when an adjustment between false positives and false negatives is necessary. For the multi-class problem, weighted averages of these metrics were computed, providing an aggregate performance score that accounts for the varying number of samples in each class.
Detailed classification reports are presented for each model (
Table 5,
Table 6 and
Table 7,
Supplementary File S1), detailing per-class precision, recall, and F1-score, as well as overall weighted averages. Complementing these numerical metrics, confusion matrices (see
Appendice A.2 and
Appendice A.3:
Figure A1,
Figure A2 and
Figure A8) were generated. These matrices visually depict the counts of true positives, true negatives, false positives, and false negatives for each class, offering an insight into specific misclassification patterns. For instance, the analysis of misclassifications, as highlighted in
Figure 6 (e.g., Samples 266 and 149 misclassified as vessel noise), underscores the importance of such detailed error analysis. This view of misclassifications identifies challenging instances and directs future efforts towards refining feature engineering or augmenting specific data types.
Furthermore, Receiver Operating Characteristic (ROC) curves, along with their corresponding Area Under the Curve (AUC) values, were generated for each class in a one-vs-rest fashion (See
Appendix A.2 Figure A3 and
Appendix A.3 Figure A9). The ROC curve illustrates the diagnostic ability of a binary classifier system as its insight threshold is varied. At the same time, AUC provides a combined measure of performance across all possible classification thresholds. For multi-class scenarios, plotting per-class ROC curves offers insights into the model’s bias for individual noise categories, particularly in distinguishing them from all other combined classes.
The integrated framework, which combines deep learning with statistical analysis and rule-based thresholding, provides a comprehensive approach to underwater noise analysis. These methodologies contribute to the growing field of underwater acoustic research utilizing machine learning, which has shown potential in real-time environmental monitoring and enhancing underwater environmental assessments. Our findings are consistent with the existing literature, which emphasizes the importance of detailed feature extraction and highlights the challenges posed by low signal-to-noise ratios and varying environmental conditions.
Moreover, the model incorporates rule-based statistical models for event detection and background noise estimation, which are essential for aligning with international guidelines concerning marine noise impact assessments. This synthesis of automated classification techniques and statistical modeling can significantly enhance the efficacy of noise management policies aimed at protecting marine environments [
24].
Despite advances in acoustic monitoring and classification methodologies, there are limitations and potential sources of error that require careful attention. Variabilities in environmental conditions, such as wind and thermal stratification, can significantly influence the propagation of underwater sound and the accuracy of measurements across diverse spatial and temporal scales. Research has shown that environmental noise can vary significantly depending on vessel density, weather, and marine infrastructure, highlighting the need for context-specific assessments [
23,
25].
The increasing occurrence of marine anthropogenic stressors, particularly in ecologically sensitive areas such as the Western Black Sea, necessitates a practical approach to monitoring and managing underwater noise.