Sentinel-2 Satellite-Derived Bathymetry with Data-Efficient Domain Adaptation

Anagnostopoulos, Christos G. E.; Papaioannou, Vassilios; Vlachos, Konstantinos; Moumtzidou, Anastasia; Gialampoukidis, Ilias; Vrochidis, Stefanos; Kompatsiaris, Ioannis

doi:10.3390/jmse13071374

Open AccessArticle

Sentinel-2 Satellite-Derived Bathymetry with Data-Efficient Domain Adaptation

by

Christos G. E. Anagnostopoulos

¹

,

Vassilios Papaioannou

^1,*

,

Konstantinos Vlachos

²

,

Anastasia Moumtzidou

¹

,

Ilias Gialampoukidis

¹

,

Stefanos Vrochidis

¹

and

Ioannis Kompatsiaris

¹

Information Technologies Institute, Centre for Research and Technology Hellas, 6th km Charilaou-Thermi, 57001 Thessaloniki, Greece

²

CDXi Solutions P.C., Filikis Etaireias 12, 54621 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(7), 1374; https://doi.org/10.3390/jmse13071374

Submission received: 13 June 2025 / Revised: 15 July 2025 / Accepted: 17 July 2025 / Published: 18 July 2025

(This article belongs to the Section Physical Oceanography)

Download

Browse Figures

Versions Notes

Abstract

Satellite-derived bathymetry (SDB) enables the efficient mapping of shallow waters such as coastal zones but typically requires extensive local ground truth data to achieve high accuracy. This study evaluates the effectiveness of transfer learning in reducing this requirement while keeping estimation accuracy at acceptable levels by adapting a deep learning model pretrained on data from Puck Lagoon (Poland) to a new coastal site in Agia Napa (Cyprus). Leveraging the open MagicBathyNet benchmark dataset and a lightweight U-Net architecture, three scenarios were studied and compared: direct inference to Cyprus, site-specific training in Cyprus, and fine-tuning from Poland to Cyprus with incrementally larger subsets of training data. Results demonstrate that fine-tuning with 15 samples reduces RMSE by over 50% relative to the direct inference baseline. In addition, the domain adaptation approach using 15 samples shows comparable performance to the site-specific model trained on all available data in Cyprus. Depth-stratified error analysis and paired statistical tests confirm that around 15 samples represent a practical lower bound for stable SDB, according to the MagicBathyNet benchmark. The findings of this work provide quantitative evidence on the effectiveness of deploying data-efficient SDB pipelines in settings of limited in situ surveys, as well as a practical lower bound for clear and shallow coastal waters.

Keywords:

satellite-derived bathymetry; transfer learning; remote sensing; MagicBathyNet; U-Net; Sentinel-2

1. Introduction

Remote sensing has provided a non-invasive way to map shallow-water depths for almost 50 years. The earliest method by Lyzenga [1] used logarithms of radiance from two visible bands to estimate depth from satellite images. Although simple and effective in very clear water, it assumed uniform water properties and could only see down to about the Secchi depth. In the 1990s, more complex models combined light absorption and scattering to estimate both depth and water quality at a higher accuracy [2,3], but they needed expensive in situ data and heavy computations, limiting their use.

A breakthrough came with simpler algorithms using ratios of blue and green bands that worked over different seabed types [4]. High-resolution satellites like IKONOS helped make these methods useful for coastal management. Then, machine learning methods like support vector machines and random forests appeared, which could learn non-linear relationships between satellite data and depth without fixed formulas, improving performance and handling moderate turbidity [5,6,7]. However, they still needed significant amounts of ground truth depth data, which can be hard to obtain.

Another approach uses visible surface waves in satellite images to estimate depth through physical wave models [8]. This works for deeper water but has lower detail. Cloud computing and large satellite archives made it possible to automate global bathymetry mapping and combining space-borne LiDAR with passive images to further improve accuracy without ship surveys [9,10]. Today, methods combine color, wave, and LiDAR data with machine learning and physics to create detailed shallow water maps on large scales.

Deep learning, especially convolutional neural networks (CNNs), now helps extract spatial and spectral features from images, reducing noise and improving accuracy in clear water [11]. However, CNNs need large labeled datasets and can struggle in turbid waters or with sensor changes. New models like the transformer-based TransBathy [12] use self-attention to learn long-range dependencies and generalize well across different regions. Transfer learning helps adapt these models to new sites with limited data by reusing learned features, reducing the need for costly local measurements [13,14,15].

Beyond bathymetry, remote sensing is advancing the monitoring of aquatic hazards like harmful algal blooms by integrating satellite, aerial, and ground data to improve early warnings and our understanding of bloom dynamics [16]. It also supports the large-scale mapping of biogeochemical variables such as dissolved CO₂ in lakes, improving our knowledge of carbon cycling and emissions [17].

A critical step for many applications is optically distinguishing shallow waters from deep waters, as reflectance-based methods rely on bottom visibility. Richardson et al. developed a deep learning model trained on hundreds of satellite scenes that accurately automates this classification, improving the robustness of aquatic remote sensing [18].

Recent models like BathyFormer, based on vision transformers, show strong potential for accurate, scalable bathymetry mapping in nearshore environments [19]. Combining active sensors like ICESat-2 LiDAR with multispectral imagery and machine learning removes the need for traditional field surveys and achieves high accuracy [20]. Large-area mapping efforts using random forests calibrated by airborne LiDAR demonstrate how these techniques can be adapted to real-world coastal management challenges [21].

Although satellite-derived bathymetry (SDB) has improved a lot, there is still a significant gap, which is the scarcity of public benchmark datasets for training and testing deep learning models. One of these benchmarks is the MagicBathyNet dataset [13]. It includes well-distributed image patches and bathymetry data from two coastal areas—the Mediterranean near Agia Napa (Cyprus) and the Baltic Sea’s Puck Lagoon (Poland). MagicBathyNet provides co-registered Sentinel-2, SPOT-6, and high-resolution aerial images, along with digital surface models and seabed habitat labels. With over 3300 multispectral image patches and nearly 500 seabed classification patches, this dataset is a good candidate for improving deep learning methods in shallow coastal waters.

In more detail, we focus on domain adaptation—how well a pretrained model from Puck Lagoon (source domain) can adapt to Agia Napa (target domain) with only a small amount of local ground truth data. Since the model and training process closely follow those described in MagicBathyNet [13], we refer readers to that work for full details. The present work addresses the following question: How many target-domain training samples are required before transfer-learning accuracy plateaus? To isolate this factor, the model architecture and the optimization protocol are kept fixed; no comparison is attempted with traditional or alternative machine learning algorithms. Three learning strategies are compared:

Direct inference of a model trained on the Puck Lagoon to Agia Napa;
A site-specific model training and evaluation from scratch on Agia Napa;
A transfer learning approach where a model is pretrained on the Puck Lagoon and is retrained and fine-tuned on a variable number of Agia Napa training samples.

The study evaluates pixel-wise and sample-wise performance, quantifies statistical confidence through permutation tests, and assesses spatial consistency via residual structure metrics. The contribution of this work can be summarized twofold: (i) to assess the performance that transfer learning affords to multispectral-based SDB and (ii) to determine the minimum amount of local data required to achieve results comparable to a site-specific model, based on the MagicBathyNet benchmark. In doing so, this work provides evidence of the effectiveness of lightweight deep learning-based SDB in data-limited coastal and inland water settings. It is noted that any quantitative threshold reported should be interpreted as site-specific to optically clear, shallow-water sites rather than a universal rule.

2. Materials and Methods

This section outlines the data sources, model design, experimental protocols, and evaluation strategies employed in this study to investigate the effectiveness of machine learning-based SDB. Leveraging the recently published MagicBathyNet benchmark dataset, we focus on Sentinel-2 satellite imagery paired with high-resolution LiDAR-derived bathymetries acting as ground truth across two distinct coastal regions. A lightweight U-Net model architecture was utilized to estimate water depth from optical inputs, and multiple training strategies, including domain adaptation and transfer learning, were explored. Experimental configurations were carefully designed to assess model generalization across spatial domains, while a suite of regression-based and statistical metrics was applied to rigorously evaluate performance. The following subsections detail each component of the methodology.

2.1. Dataset

This study utilizes the MagicBathyNet benchmark dataset [13], a comprehensive multimodal dataset intended to assess and enhance SDB through artificial intelligence methodologies. The dataset comprises two primary regions: Agia Napa in Cyprus and Puck Lagoon in Poland, providing coregistered image–depth pairs from different sensing modalities (Figure 1). This encompasses Sentinel-2 multispectral data, SPOT-6 high-resolution optical imagery, and aerial orthophotos. The corresponding ground truth bathymetry is obtained from high-accuracy LiDAR surveys.

Specifically, only the Sentinel-2 modality was employed in this study, since they are provided by the Copernicus program systematically and for free, and thus can lead to efficient operationalization, contrary to the other modalities. In the context of this work, we use the terms “sample”, “patch”, and “tile” interchangeably. Sentinel-2 images were already atmospherically corrected, normalized, and split into 18 × 18 samples, each representing an area of 180 × 180 m (0.0324 sq. km) on the ground. This tiling approach facilitates localized learning of depth features while preserving spatial coherence along bathymetric gradients. Only the red, green, and blue (RGB) corresponding to B4, B3, and B2 Sentinel-2 bands were utilized. A binary mask was applied to exclude no-data pixels, i.e., without ground truth depth values or without valid spectral information across all input bands. The Agia Napa subset consists of 28 training samples and 7 testing samples, with a maximum depth of 30.29 m. The Puck Lagoon subset provides a significantly larger sample size, consisting of 2256 training samples and 566 testing samples, with a maximum depth of 10.57 m. The training and testing splits are predefined in the MagicBathyNet dataset and ensure consistent comparison and reproducibility across studies. Additionally, the samples are non-overlapping and spatially distinct, facilitating realistic evaluation of model generalization across geographically separate portions of the target site.

2.2. Model Architecture

A lightweight U-Net architecture identical to that presented in the original MagicBathyNet study was employed. The model adheres to a standard encoder–decoder structure, originally proposed by [22], designed for dense prediction applications. The model accepts three-band Sentinel-2 imagery (RGB) and produces a single-band continuous-valued depth prediction map.

The encoder comprises four consecutive blocks. Each block includes two convolutional layers with ReLU activations, followed by a downsampling operation by max pooling. The number of features escalates progressively (32, 64, 128, and 256) to encapsulate abstract spatial characteristics. The decoder replicates the encoder by employing transposed convolutions for upsampling and utilizing skip connections to merge corresponding encoder activations. Each decoding block implements a double convolution prior to forwarding features. The output layer consists of a 1 × 1 convolution that generates the predicted depth values.

This architecture was chosen to maintain continuity with prior work and to isolate the effects of sample count on transfer learning performance. Although more recent models (e.g., TransBathy [12]) have demonstrated enhanced performance, benchmarking these methods is beyond the scope of the present study. The primary objective is to evaluate the data-efficiency of a widely utilized network under domain shift conditions.

2.3. Experimental Setup

Three experimental scenarios were evaluated to assess the effectiveness of domain adaptation and transfer learning in SDB (Figure 2):

Direct inference: The model trained exclusively on the Puck Lagoon dataset was directly evaluated on the Agia Napa test set without any additional adaptation. This scenario provided a baseline assessment of model generalization across geographically and bathymetrically distinct regions, highlighting the inherent challenges associated with domain gaps.
Site-specific training: A model was trained and evaluated exclusively on the Agia Napa dataset. This scenario assessed the upper bound of model performance achievable when both training and testing occur within the same geographic area, serving as a performance benchmark.
Transfer learning: A two-stage fine-tuning approach was adopted, beginning with a model pre-trained on the Puck Lagoon (source domain) dataset and subsequently fine-tuned on the Agia Napa training set (target domain). This approach examined the capability of transfer learning to leverage both domain-specific information from a related geographic area. Furthermore, an additional evaluation was conducted to determine the minimal number of training samples from Agia Napa required to surpass performance baselines, systematically varying the training sample size from 5 up to the full set of 28 samples.

All models utilized the same architecture. The input multispectral imagery was normalized to a [0, 1] range based on site-specific parameters derived from each respective training set using band-wise minimum and maximum values. The ground truth depth maps were similarly scaled.

For model training, the original 18 × 18 samples were resized to 256 × 256 to match the input dimensions of the model architecture. Data augmentation included random vertical and horizontal flips to enhance generalization. The training was performed on a GPU-accelerated system (NVIDIA Geforce RTX 4060 Ti, 16 GB VRAM). The Adam optimizer was employed with an initial learning rate of 1 × 10⁻⁴, reproducing the original MagicBathyNet paper implementation. All stochastic operations were initialized with a fixed seed equal to (1). A batch size of 1 was retained, following the configuration of MagicBathyNet [13], to accommodate GPU memory constraints and the limited dataset size. Despite the fact that a small batch size can potentially increase training/validation variance, this effect was mitigated through extensive data augmentation.

In scenarios (1) and (2), the learning rate remained constant throughout the training period of 10 epochs, adhering closely to the experimental setup described in the original MagicBathyNet study. For scenario (3), a two-phase learning rate scheduling strategy was implemented over a shorter span of five epochs. Initially, the encoder weights pre-trained on Puck Lagoon were frozen, facilitating the rapid adaptation of the decoder to the Agia Napa dataset. After three epochs, the encoder weights were unfrozen, and the learning rate was reduced to 1 × 10⁻⁵. This staged fine-tuning approach aimed to preserve previously learned representations, minimize forgetting, and effectively adapt the model to local conditions specific to the Agia Napa area.

The selection of a training schedule of 10 epochs is also in accordance with the original MagicBathyNet training approach and allows for effective learning without overfitting. In initial tests, after 6 epochs, the performance of the model stabilized, showing minimal improvement between epochs 8 and 10 (Figure 3). Extending the training beyond 10 epochs led to diminishing returns in terms of accuracy, and the model started to overfit.

All models were trained with a masked RMSE loss. For every pixel, the squared error between the model’s output and the ground truth depth is computed, where annotations are available. A binary validity mask selects those pixels, and the result is averaged exclusively over this subset. A small constant (ε = 1 × 10⁻⁸) is added to the denominator to keep the computation numerically stable. Formally, with prediction d, ground truth

\hat{d}

, and mask M, the loss is given in Equation (1):

L o s s_{R M S E} = \sqrt{\frac{\sum {(d - \hat{d})}^{2} Μ}{\sum M + ε}}

(1)

All experiments and model implementations were conducted using PyTorch (version 2.5.1+cu124) in a Python environment (version 3.10), ensuring reproducibility and consistent computational performance across scenarios.

2.4. Evaluation Metrics

Models’ performance was statistically evaluated using standard regression-based metrics calculated at the pixel-level on the Agia Napa test set. Only valid pixels were included in the assessment. The following metrics were computed and presented in meters: RMSE, mean absolute error (MAE), and standard deviation (SD).

All metrics were calculated globally by merging the predictions and ground truth values from all test samples into singular 1D arrays. This method guarantees that performance accurately represents cumulative per-pixel behavior throughout the full testing area.

To enhance the analysis of prediction fidelity, least-squares linear regressions were utilized to the predicted and true depth values for each model scenario. The derived slope, intercept, and 95% confidence intervals were employed to assess the degree of systematic underestimation or overestimation across the depth range. Scatter plots were generated to visualize the correlation between predictions and reference depths, with the identity line serving as a visual benchmark.

Furthermore, depth-stratified RMSE was computed utilizing non-overlapping 1 m bins of ground truth depth values. This essentially means categorizing the pixels based on ground truth depth ranges (bins) and calculating the respective RMSEs per depth range, as well as the standard errors (SE) of the absolute errors/residuals. The SE were calculated as seen below in Equations (2) and (3):

E_{a b s} = |d - \hat{d}|

(2)

S E = \frac{σ_{E_{a b s}}}{\sqrt{n}}

(3)

where d = ground truth depth (m),

\hat{d}

= predicted depth (m),

σ_{E_{a b s}}

= standard deviation of E_abs, and

n

= number of pixels per bin

A 1 m bin resolution was selected to optimize statistical resilience and bathymetric resolution. It is sufficiently fine-grained to capture performance variation across depth gradients while ensuring that each bin contains a significant number of valid pixels. This stratification allows for a detailed evaluation of model performance relative to true water depth, highlighting possible systematic inaccuracies in shallow or deep areas.

Finally, the statistical significance between models was evaluated by paired hypothesis testing at the sample-level. Three specific types of tests were performed: the paired t-test, the Wilcoxon signed-rank test, and a permutation test employing 10,000 random sign flips. This statistical methodology facilitated a dependable assessment of whether performance improvements between models were meaningful and not attributable to random variation. These evaluation parameters correspond with established approaches in recent SDB research studies [23,24]. All metrics were calculated using NumPy (version 2.2.1) and SciPy (version 1.14.1).

3. Results

This section presents the experimental findings from applying SDB models across various learning scenarios, with a focus on assessing accuracy, calibration, and data efficiency. Through a series of comparative analyses, we explore three primary configurations, direct inference, site-specific training, and transfer learning, using the Agia Napa dataset as the evaluation benchmark. Performance is evaluated using pixel-wise metrics, regression diagnostics, depth-stratified error analyses, and statistical significance testing to identify the effectiveness and limitations of each approach. The results not only quantify the benefits of transfer learning in adapting to new geographic regions but also provide some evidence on the minimum training data size requirements for stable bathymetric predictions without compromising the accuracy.

3.1. Comparison of Main Experimental Scenarios

Table 1 presents the global pixel-wise evaluation metrics for the three primary scenarios: direct inference, site-specific training, and full transfer learning using all 28 annotated train samples of the Agia Napa dataset. The direct inference model, trained solely on Puck Lagoon site and directly applied to Agia Napa, exhibited the weakest performance with an RMSE of 4.111 m and an MAE of 3.235 m. Its high error standard deviation (2.537 m) reflects the substantial domain shift between regions and the challenges of out-of-distribution generalization.

In contrast, the site-specific model, trained and tested exclusively on Agia Napa, achieved a considerably lower RMSE of 1.068 m and a MAE of 0.694 m, indicating a strong fit to local bathymetric characteristics. The best performance was obtained via transfer learning, where a model pre-trained on Puck Lagoon was fine-tuned using all 28 Agia Napa training samples. This approach yielded an RMSE of 0.810 m and the lowest MAE (0.488 m) and standard deviation (0.646 m).

3.2. Performance with Varying Training Set Samples

To assess how much in-domain data is required to achieve strong enough transfer learning performance, additional models were trained using subsets of the 28 available Agia Napa training samples (i.e., 5, 10, 15, 20, and 25 samples). Results are summarized in Table 2. Performance improved consistently as more local data were introduced. With only 5 samples, the RMSE was 1.974 m, but it dropped to 1.361 m at 10 samples and reached 0.984 m at 15 samples. Beyond 20 samples, improvements were more gradual and leveled off, suggesting diminishing returns. Importantly, models trained on 15 or more samples achieved performance approaching that of the full 28-sample model. RMSE dropped from 1.97 m for 5 samples to 0.98 m for 15 samples (~50% reduction) and reached 0.81 m for the full 28-sample set (Table 2). The marginal gain shrank from 0.61 m (from 5 to 10 samples) to 0.04 m (from 20 to 25 samples), indicating diminishing returns once roughly 70% of the local data were exploited.

3.2.1. Linear Regression Consistency

Least-squares linear regressions were employed to analyze the consistency and potential biases in depth prediction by comparing predicted depth values with ground truth values for each transfer learning model. The slope, intercept, and 95% confidence intervals of each regression line elucidate the direction and magnitude of systematic overestimation and underestimation throughout the depth range. Pearson’s correlation coefficient (r) serves as an indicator of linear agreement.

Figure 4 presents scatter plots that demonstrate the consistency of predictions across various transfer learning configurations. The identity line (1:1 dashed black line) indicates perfect agreement (slope = 1, intercept = 0), whereas the fitted regression line (solid red) and its corresponding statistics highlight deviations from the ideal behavior.

Transfer learning (5 samples) (Figure 4a) displayed poor calibration, exhibiting a severely flat slope of 0.054 and a significant positive offset of +1.9 m, revealing the severe compression of dynamic range and a large positive bias. The Pearson correlation coefficient was low (r = 0.169), suggesting a weak correlation, which is not linear, as can be seen in the figure.
Transfer learning (10 samples) (Figure 4b) yielded a marked improvement in fit, with a slope of 0.789 and an offset of +0.289 m. Correlation increased to r = 0.913, and the regression line approached the 1:1 line; however, depths of >12 m remain underestimated.
Transfer learning (15 samples) (Figure 4c) achieved near-ideal alignment, indicated by a slope of 1.025 and an intercept of −0.479 m. Both parameters differ significantly from their ideal targets, but overall scale and bias errors fall below 0.5 m and 3%, respectively (r = 0.92).
Transfer learning (20 samples) (Figure 4d) slightly overshot the identity line (slope = 1.172). Intercept confidence interval (CI) spans zero [−0.059 m, 0.151 m], indicating negligible global bias. Correlation peaks at 0.97.
Transfer learning (25 samples) (Figure 4e) yielded a slope of 1.251 and an intercept of −0.559 m. While the fit slightly diverged from the identity line, it systematically overestimates at the deepest 10% of pixels (>12.5 m).
Transfer learning (28 samples) (Figure 4f) demonstrated the most balanced and consistent predictions, with slope = 1.143 and intercept = −0.271 m, and a correlation coefficient of r = 0.963. Although the slope remains significant at >1, the residual depth-dependent error is ≤0.5 m at up to 18 m in depth.

These results indicate that regression alignment consistently improves with the number of fine-tuning samples. From 15 samples onward, the fitted lines are nearly parallel to the 1:1 line, and the correlation coefficients exceed 0.92. This indicates that transfer learning can yield well-calibrated and depth-consistent predictions even with a modest number of annotated samples.

Overall, agreement between predicted and ground truth depths increases monotonically with additional in-domain samples. From 15 samples upwards, slopes deviate from 1 by ≤18%, intercept magnitudes drop below 0.6m, and r ≥ 0.92. Models trained on 15–20 samples provide depth-consistent predictions suitable for operational SDB in data-sparse regions.

3.2.2. Depth-Stratified Error Analysis

To evaluate model performance across varying water depths, RMSE was computed within non-overlapping 1 m ground truth depth bins. Each marker shows the bin-wise RMSE and whiskers denote ±1 standard error of the absolute depth errors (

E_{a b s}

) distribution, thereby conveying both accuracy and precision. This analysis allowed for a depth-aware assessment of predictions accuracy and their variance, and revealed distinct trends related to both training set sample size and depth range.

All configurations demonstrate very high precision in the shallowest ranges (<6 m) expressed with sub-meter SEs, but their trajectories diverge sharply in depths greater than 6 m. When only five Agia Napa training samples are used the profile rises almost linearly with depth, culminating in an RMSE of 15.9 m at the 17–18 m bin (Figure 5a), confirming that few training samples cannot resolve the strong spectral attenuation that characterizes deeper pixels.

By expanding the fine-tune setup to 10 samples (Figure 5b) the profile flattens. RMSE stays below 3.7 m in the 0–10 m depth range. Beyond this range the RMSE ascends more gradually, peaking at 4.8 m in the 14–15 m bin, before declining again in the deepest interval. For depths greater than 12 m, the standard error of the absolute error distribution decreases from ~0.45 m to <0.25 m, confirming a statistically tighter fit. These observations indicate that 10 training samples are sufficient to correct mid-depth bias.

An inflection occurs once 15 samples are included. RMSE overall stays below 5 m in each bin and falls above 3 m for depths between 6 m and 14 m. In the shallowest and deepest bins (0–6 m, 14–18 m) the RMSE shows values below 3 m. The shallowest and deepest bins contract to 1 m RMSE, which is 1/16 of the error observed for the five-sample model. SE never exceeds 0.38 m across the profile. Thus, about half of the available training set is sufficient to stabilize both the bias and dispersion of the retrieval.

Adding more training samples yields diminishing benefits. The 20-sample model holds shallow-water RMSE between 0.5 and 1 m but displays a modest rebound in the 12–14 m depth region (RMSE = 4.1 m). The 25-sample configuration behaves similarly and degrades slightly at several depths. In contrast, the full 28-sample model demonstrates the most balanced performance: RMSE stays below 1 m down to 5 m depths, below 3.1 m between 6 and 12 m depths, and slightly exceeds 3.5 m overall.

Overall, the depth-stratified analysis corroborates the global metrics: the steepest accuracy improvement occurs between 5 and 15 training samples, after which the curves flatten and the residual error is potentially driven primarily by water column physics [25], rather than data scarcity.

3.2.3. Statistical Significance of Differences

Robust inference requires that the sampling units entering a hypothesis test be independent. Due to neighboring pixels in a Sentinel-2 scene, high spatial autocorrelation is observed. To address this issue, the error of every model was aggregated to a single RMSE per test sample, whereas the seven non-overlapping samples were treated as independent observations (n = 7, degrees of freedom = 6). For each pair of models, the following quantities were calculated:

Δμ, the mean paired difference in sample RMSE:

Δ μ = \frac{1}{7} \sum_{k = 1}^{7} (R M S E_{1, k} - R M S E_{2, k})

(4)

where a positive value indicates that the first-named model is less accurate.

p, p-t, the statistic and two-tailed probability from a paired t-test, which gauges whether Δμ differs from zero under the assumption of normal differences.
W, p-wil, the signed-rank statistic and probability from the Wilcoxon test, which makes no distributional assumption.
p-perm, a permutation probability obtained by randomly flipping the sign of the seven paired differences 10,000 times and recording the proportion of permutations whose absolute mean equals or exceeds |Δμ|.
d, Cohen’s effect size for paired samples, d = Δμ/σ_δ, where σ_δ is the standard deviation of the seven differences.

With only seven samples statistical power is not fully trustworthy but can act as an indication. Paired t-tests therefore hover just above the 0.05 threshold for many contrasts. Nevertheless, the non-parametric Wilcoxon test and the permutation test, both robust to non-normality, converge to the same conclusion: error falls sharply from 5 to 20 samples (p ≤ 0.032, |d| ≥ 0.87) and then stabilizes. Large effect sizes (>0.8) paired with small permutation probabilities (<0.02) indicate that the observed plateau is not an artifact of limited sample size but indicates the saturation of transferable information.

The sample-level analysis confirms that the largest accuracy gains occur when the fine-tune set expands from 5 to ~20 Agia Napa samples. In the contrast “5 vs. 20”, the mean RMSE gap is 0.74 m, the Wilcoxon probability is 0.032, the permutation probability is 0.016, and the effect size is 0.87, all of which indicate a genuine improvement. Adding a further five samples yield a similar but slightly weaker signal (Δμ = 0.76, Wilcoxon p = 0.047). Once 15 samples are available, however, successive increments produce only modest and statistically uncertain changes. In more detail, the difference between the 15- and 20-sample models amounts to 0.04 m. Both parametric and non-parametric tests return probabilities > 0.15 and the effect size is small (d = 0.54). The 28-sample model remains the most accurate (RMSE = 0.83 m), but the advantage over the 20- and 25-sample configurations is not statistically reliable (p ≥ 0.07, |Δμ| ≤ 0.09 m, |d| ≤ 0.60).

The complementary pixel-level analysis of the absolute errors in Appendix A (Table A1) mirrors this order (5 < 10 < 15 < 20 ≤ 25 < 28) and likewise displays the largest error reductions for cases that add the first 10 to 15 samples. The p-values are vanishingly small because of the vast amount of spatially autocorrelated pixels that inflate statistical power. However, the absolute differences between models that differ between ≤5 samples are ≤0.07 m, confirming that the practical benefits plateau beyond 20 samples.

Conclusively, Table 3 results demonstrate that transfer learning error declines steeply as the first 15–20 samples are incorporated and then asymptotically plateau. Beyond this threshold the residual error is dominated by factors other than training sample size, such as optical attenuation, sensor noise, or model capacity, and additional annotation yields diminishing improvement.

3.2.4. Visualization Results

Figure 6 depicts the mosaiced samples of the depth predictions produced by each transfer learning experiment. The qualitative progression observed in the six panels aligns with the quantitative patterns detailed in prior sections.

With only five samples (Figure 6a) the bathymetric surface is strongly compressed. Extensive areas are assigned uniform depths. Scattered negative values arise where the network predicts depth for land that was masked during training. They have no geophysical meaning but indicate poor generalization when very little in-domain data are provided. Adding 10 samples (Figure 6b) recovers a wider depth range (~22 m) and begins to delineate channel incisions, yet vertical striping persists along detector rows, revealing remaining calibration bias.

With 15 samples (Figure 6c), the map achieves spatial coherence: reef heads, sand tongues, and scour pits are continuous, and the depth histogram aligns with the 0–25 m ground truth reference frame. Negative artifacts vanish, validating that the network has learned a solid land–water boundary. The 20-sample model (Figure 6d) sharpens slope breaks and small shoals, while offshore noise is visibly reduced.

The 25-sample output (Figure 6e) differs mainly by a slight smoothing of speckle in the deeper fringe, in line with the marginal Δμ improvement detected in Section 3.2.3. Finally, the 28-sample model (Figure 6f) provides the cleanest representation. Detailed variations on the outer shelf and scour pits adjacent to reef promontories are identified, with a depth range of around 27 m aligning with the ground truth benchmark. Minor negative depths are generated, highlighting the model’s calibrated confidence throughout the site.

4. Discussion

Optimizing transfer learning pipelines in SDB, particularly under limited data conditions, requires identifying a potential minimum number or a range of target-domain training samples needed for stable and precise bathymetry estimation. This section integrates multiple quantitative analyses (global metrics, regression fits, depth-stratified errors, and statistical tests) to evaluate how model performance scales with increasing Agia Napa training samples. The benchmark test subset spans depths of up to 18 m.

Across the transfer learning configurations, models demonstrated consistent improvements in RMSE, MAE, and SD as training set size increased (Table 2), which is intuitively expected. However, these improvements were non-linear. For instance, between 5 and 15 samples, RMSE decreased from 1.974 m to 0.984 m, a 50.1% relative reduction, while beyond 15 samples, improvements decelerated. In particular, from 15 to 28 samples, RMSE decreased only marginally (0.984 m to 0.810 m, 17.7%). MAE and SD followed similar saturation patterns, indicating that the performance improvements plateau beyond 15 samples, suggesting a potential lower bound for capturing dominant spatial and spectral variability within the Agia Napa site.

Regression analysis reinforces this inflection point. At 15 samples, the fitted slope (1.025), intercept (−0.479 m), and correlation (r = 0.924) closely approximate the fully adapted 28-sample model (slope = 1.143, intercept = −0.271 m, r = 0.963), indicating near-parity in gradient sensitivity and bias correction. In contrast, the five-sample model exhibited a flat slope (0.054) and large positive offset (+1.9 m), failing to capture bathymetric gradients and underestimating depths systematically. This progression underscores how depth-gradient fidelity is recovered incrementally through fine-tuning. However, based on the qualitative interpretation of the relationships, clustering is observed, which is especially apparent in the 5, 10, and 15 samples and, to a lesser extent, in the 25-sample experiment, in Figure 4a–c and Figure 4e, respectively. In fact, some of the clusters may show a variable type of correlation. For instance, the qualitative assessment of the 15-sample experiment (Figure 4c) indicates a potentially overall non-linear relationship when all depths are considered. Although, when isolating the clusters between ground truth depths of 0 and 6 m, one can observe a linear relationship with lower slope, while isolating the cluster greater than 6m ground truth depth the slope becomes greater. Similarly, in the 25-sample experiment (Figure 4e), three clusters could be observed, namely, 0–6 m, 6–11 m, and 15 m and above, based on the predicted depth, which show relatively different types of relationships. On the contrary, the 20-sample experiment shows a more balanced and consistent globally linear relationship, comparable to the 28-sample experiment.

Prediction accuracy in SDB is expected to deteriorate with depth due to down-welling light attenuation. This trend is supported by the depth-binned results (Figure 4), although the size of the fine-tuning set has a significant impact on the rate that error decreases. With five training samples the model performed credibly in the uppermost depths (RMSE ≤ 1 m down to ~5 m depths), but error escalated rapidly thereafter, reaching 16 m in the 17–18 m depth bin. Adding 10 training samples flattened the central portion of the curve, keeping RMSE below 6 m across the depth bins. A step-change is observed at 15 samples. Across the entire 0–18 m depth range RMSE did not exceed 4 m, and in the critical 12–18 m interval the mean error dropped below 3.5 m. Expanding the fine-tuning set from 15 to 20 and 25 samples reduced bin-wise RMSE by only a few decimetres, and adding the final 3 samples (28 samples in total) yielded improvements that are within measurement noise range (<0.1 m in every depth class). Hence, the learning curve is saturated once approximately 15 to 20 well-distributed samples are provided.

The physical explanation of this plateau is attributable to the optical characteristics of the water column in Agia Napa. In situ observations from the COASTLOOC dataset report median diffuse–attenuation coefficients of K_d = 0.46 m⁻¹ (490 nm), 0.41 m⁻¹ (510 nm), and 0.32 m⁻¹ (555 nm), corresponding to an e-folding depth of only 2–3 m [26]. At a 12 m depth the downwelling irradiance is suppressed by more than two orders of magnitude, effectively reaching the sensor’s noise floor. Additional annotations cannot recover information absent from the radiance record. Consequently, the residual RMSE of 3–4 m observed in 12–18 m bins is an attenuation-limited, rather than data-limited, artifact.

All transfer learning models followed a two-stage fine-tuning strategy (Section 2.3). This staged approach minimizes forgetting of the source–domain representation and allows the rapid adaptation of spatial and spectral filters to the target site. The results showed that even minimal adaptation (e.g., 5–10 samples) can suppress domain shift, but gradient consistency and bias correction only stabilized once ≥15 samples were provided.

Sample-wise hypothesis tests demonstrated that the accuracy benefit of transfer learning rose steeply until approximately 15 to 20 Agia Napa samples were included and then plateaued. The transition from 10 to 15 samples brought RMSE within 0.25 m of the saturation level, and no pairwise contrast beyond 20 samples survived a 5% significance threshold. Consequently, 15–20 spatially diverse samples constitute a lower bound for achieving site-specific performance with Sentinel-2 RGB imagery in clear, shallow water settings.

It is important to emphasize that this threshold reflects optical and geomorphological conditions of the Agia Napa region. In regions with stronger attenuation, higher turbidity, or complex bathymetry, larger in situ datasets may be required. Thus, the 15- to 20-sample size recommendation represents a lower bound for transfer learning in optically clear and shallow coastal settings.

While the 15-sample threshold offers a potential practical lower bound for high-fidelity prediction, this recommendation must be contextualized. The 28 Agia Napa training samples were non-overlapping, spatially diverse, and predefined by the benchmark to ensure comprehensive bathymetric coverage. Reducing to 15 samples implicitly narrows coverage, potentially omitting rare substrates, complex shoreline geometries, or spectrally anomalous features. Thus, optimal sample selection is crucial. Stratified sampling or spatial uniformity is advised to avoid local overfitting.

Additionally, although transfer learning makes up for limited data, it does not remove the intrinsic limitations of optical bathymetry, such as bottom signal loss or reflectance saturation in turbid regions. Performance in deep, turbid waters is still limited by sensor physics, even with efficient fine-tuning, and cannot be entirely overcome by model architecture or learning strategy.

Collectively, the evidence supports the conclusion that 15 to 20 Agia Napa samples (about 53.57% of the available training set) constitute a minimum viable fine-tuning subset which is equivalent to 0.486 to 0.648 square km. At this point, the model reproduces both global and local bathymetric structure with high fidelity, minimizes stratified uncertainty, and passes all statistical criteria for stable domain adaptation. These findings provide a quantifiable lower bound for data requirements in transfer learning-based SDB and highlight the steep performance ramp available through even moderate amounts of in-domain calibration.

Limitations

Several factors constrain the generality of the present findings. First, results are derived from a single source-target pairing and tested on seven non-overlapping Sentinel-2 samples. Although, these cover most of Agia Napa’s depth range and benthic variability, they do not encompass the optical or geomorphological diversity typical of turbid, high-chlorophyll, or structurally complex coastal systems. The reported 15-sample threshold should therefore be viewed as site-specific and not generalized to other environments without further validation.

Second, only Sentinel-2 RGB bands were utilized. These bands are limited in their ability to capture dissolved organic matter, algae, or suspended sediments. In optically complex settings, incorporating red-edge or shortwave infrared bands may be essential for achieving comparable performance with similar sample counts.

Third, the model architecture was fixed to a lightweight U-Net to isolate the effect of target-domain sample count. While one variable is isolated, it does not assess whether more advanced architecture (e.g., transformer-based, physics-informed networks) could reduce data requirements or adapt more efficiently to domain shifts.

Finally, the training–testing split derived from the original MagicBathyNet study, although geographically stratified, is not exhaustive. Reducing training samples from 28 to about 15–20 inevitably leaves parts of the coastline unsampled. If these areas contain rare substrates or atypical optical properties, the reported lower bound may underestimate data requirements and bathymetric estimations. Overall, these limitations highlight the importance of multi-region experiments to confirm whether similar plateaus emerge across diverse coastal environments and conditions.

5. Conclusions

Fine-tuning a U-Net model pretrained on the Puck Lagoon region with 15 Agia Napa samples reduced the global RMSE from 4.11 m (direct inference) to 0.98 m, effectively matching the performance of the fully adapted 28-sample model within the clear, shallow coastal setting under study. Additional training samples yielded marginal improvements (RMSE ≤ 0.05 m per 5 samples). These findings suggest that, in regions similar to Agia Napa with limited data availability, transfer learning can offer non-trivial performance gains with 15 to 20 samples. However, it is important to note that the comparative advantage of transfer learning over site-specific training is context-dependent. In some regions with complex bathymetric features or environments where the site-specific training set is large and representative, site-specific models may provide superior accuracy. Additionally, the results indicate that once the model has analyzed approximately 15 representative samples from the target domain, residual error is predominantly governed by physical constraints (e.g., optical attenuation) rather than by data scarcity.

This performance was achieved by employing a two-stage fine-tuning strategy: the decoder was initially trained independently to adjust for site-specific characteristics, followed by full-model adaptation at a reduced learning rate. The approach ensured rapid convergence while preserving source-domain features. In the context of Sentinel-2 RGB data, these results suggest that, in clear and shallow—Mediterranean in this case—waters similar to Agia Napa, 15–20 spatially diverse samples can serve as a practical lower bound for reliable transfer learning in SDB; yet this threshold may not be universally applicable and may vary in more complex environments. Future work should explore generalization to broader depth ranges and alternative model classes, including physics-informed architectures, to determine adaptive thresholds across diverse coastal and inland water environments. Also, the exploration of region-specific factors (e.g., optical properties, environmental heterogeneity) and multi-region experiments is recommended.

Author Contributions

C.G.E.A. and V.P. conceived the study and wrote the abstract; C.G.E.A. and V.P. wrote introduction, methodology, and results; C.G.E.A., A.M. and K.V. reviewed the first draft; A.M., K.V. and I.G. supervised the methodology and conclusions sections; S.V. and I.K. funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Union’s Horizon Europe Research and Innovation Program Waterverse, under grant agreement no 101070262, and Cyclops, under grant agreement no 101135513.

Data Availability Statement

The data presented in this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.10470959, reference number 10470959.

Acknowledgments

This research has been carried out within the Horizon Europe Research and Innovation Programmes WATERVERSE and CyclOps.

Conflicts of Interest

Author Konstantinos Vlachos was employed by the company CDXi Solution P.C. (Private Company) which is a spin-off of the Centre for Research and Technology Hellas (CERTH)/Information Technologies Institute (ITI). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SDB	Satellite-Derived Bathymetry
IOP	Inherent Optical Properties
RMSE	Root Mean Squared Error
MAE	Mean Absolute Error
SD	Standard Deviation
DEM	Digital Elevation Model
CNN	Convolutional Neural Network

Appendix A

Table A1. Summary of pairwise statistical comparisons among transfer learning models (absolute per-pixel errors).

Transfer Learning Models Comparison	Δμ (m)	t-stat	p-t	W (×10¹⁰)	p-wil	p-perm
(5) vs. (10)	0.389	235.028	<5 × 10⁻³²⁴	3.21	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(5) vs. (15)	0.606	328.651	<5 × 10⁻³²⁴	2.63	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(5) vs. (20)	0.620	342.628	<5 × 10⁻³²⁴	2.20	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(5) vs. (25)	0.631	336.516	<5 × 10⁻³²⁴	2.49	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(10) vs. (15)	0.217	176.304	<5 × 10⁻³²⁴	4.05	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(10) vs. (20)	0.231	207.430	<5 × 10⁻³²⁴	3.48	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(10) vs. (25)	0.242	213.622	<5 × 10⁻³²⁴	3.66	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(15) vs. (20)	0.014	17.198	2.90 × 10⁻⁶⁶	4.89	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(15) vs. (25)	0.025	31.323	3.8 × 10⁻²¹⁵	5.22	3.70 × 10⁻⁷	<1 × 10⁻⁴
(20) vs. (25)	0.011	18.767	1.5 × 10⁻⁷⁸	5.26	5.00 × 10⁻¹	<1 × 10⁻⁴
(28) vs. (5)	−0.697	−361.995	<5 × 10⁻³²⁴	2.23	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(28) vs. (10)	−0.307	−255.215	<5 × 10⁻³²⁴	3.25	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(28) vs. (15)	−0.091	−131.364	<5 × 10⁻³²⁴	4.01	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(28) vs. (20)	−0.076	−117.218	<5 × 10⁻³²⁴	4.37	<5 × 10⁻³²⁴	<1 × 10⁻⁴
(28) vs. (25)	−0.065	−129.600	<5 × 10⁻³²⁴	4.02	<5 × 10⁻³²⁴	<1 × 10⁻⁴

References

Lyzenga, D.R. Passive remote sensing techniques for mapping water depth and bottom features. Appl. Opt. 1978, 17, 379–383. [Google Scholar] [CrossRef] [PubMed]
Maritorena, S.; Morel, A.; Gentili, B. Diffuse reflectance of oceanic shallow waters: Influence of water depth and bottom albedo. Limnol. Oceanogr. 1994, 39, 1689–1703. [Google Scholar] [CrossRef]
Lee, Z.; Carder, K.L.; Mobley, C.D.; Steward, R.G.; Patch, J.S. Hyperspectral remote sensing for shallow waters: 2. Deriving bottom depths and water properties by optimization. Appl. Opt. 1999, 38, 3831–3843. [Google Scholar] [CrossRef] [PubMed]
Stumpf, R.P.; Holderied, K.; Sinclair, M. Determination of water depth with high-resolution satellite imagery over variable bottom types. Limnol. Oceanogr. 2003, 48, 547–556. [Google Scholar] [CrossRef]
Pan, Z.; Glennie, C.; Legleiter, C.; Overstreet, B. Estimation of water depths and turbidity from hyperspectral imagery using support vector regression. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2165–2169. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Danilo, C.; Melgani, F. Wave period and coastal bathymetry using wave propagation on optical images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6307–6319. [Google Scholar] [CrossRef]
Kerr, J.M.; Purkis, S.J. Automated tuning of empirical models for satellite-derived bathymetry in variable water types. Remote Sens. Environ. 2018, 210, 402–413. [Google Scholar]
Thomas, N.; Pertiwi, A.P.; Traganos, D.; Lagomasino, D.; Poursanidis, D.; Moreno, S.; Fatoyinbo, L. Space-borne cloud-native Satellite-Derived Bathymetry (SDB) models using ICESat-2 and Sentinel-2. Geophys. Res. Lett. 2021, 48, e2020GL092170. [Google Scholar] [CrossRef]
Wu, Z.; Zhao, Y.; Wu, S.; Chen, H.; Song, C.; Mao, Z.; Shen, W. Satellite-derived bathymetry using a fast feature cascade learning model in turbid coastal waters. J. Remote Sens. 2024, 4, 0272. [Google Scholar] [CrossRef]
Zhang, X.; Al Shehhi, M.R. Bathymetry estimation for coastal regions using self-attention. Sci. Rep. 2025, 15, 970. [Google Scholar] [CrossRef] [PubMed]
Agrafiotis, P.; Janowski, Ł.; Skarlatos, D.; Demir, B. MAGICBATHYNET: A Multimodal Remote Sensing Dataset for Bathymetry Prediction and Pixel-Based Classification in Shallow Waters. arXiv 2024, arXiv:2405.15477. Available online: http://www.magicbathy.eu/magicbathynet.html (accessed on 8 February 2025).
Neumann, M.; Pinto, A.S.; Zhai, X.; Houlsby, N. In-domain representation learning for remote sensing. arXiv 2019, arXiv:1911.06721. [Google Scholar] [CrossRef]
Ma, Y.; Chen, S.; Ermon, S.; Lobell, D.B. Transfer learning in environmental remote sensing. Remote Sens. Environ. 2023, 288, 113924. [Google Scholar] [CrossRef]
Qiu, Y.; Huang, J.; Luo, J.; Xiao, Q.; Shen, M.; Xiao, P.; Peng, Z.; Jiao, Y.; Duan, H. Monitoring, simulation and early warning of cyanobacterial harmful algal blooms: An upgraded framework for eutrophic lakes. Environ. Res. 2025, 264, 120296. [Google Scholar] [CrossRef] [PubMed]
Qi, T.; Shen, M.; Kutser, T.; Xiao, Q.; Cao, Z.; Ma, J.; Luo, J.; Liu, D.; Duan, H. Remote sensing of dissolved CO₂ concentrations in meso-eutrophic lakes using Sentinel-3 imagery. Remote Sens. Environ. 2023, 278, 113431. [Google Scholar] [CrossRef]
Richardson, G.; Foreman, N.; Knudby, A.; Wu, Y.; Lin, Y. Global deep learning model for delineation of optically shallow and optically deep water in Sentinel-2 imagery. Remote Sens. Environ. 2024, 311, 114302. [Google Scholar] [CrossRef]
Lv, Z.; Herman, J.; Brewer, E.; Nunez, K.; Runfola, D. BathyFormer: A transformer-based deep learning method to map nearshore bathymetry with high-resolution multispectral satellite imagery. Remote Sens. 2025, 17, 1195. [Google Scholar] [CrossRef]
Xie, C.; Chen, P.; Zhang, Z.; Pan, D. Satellite-derived bathymetry combined with Sentinel-2 and ICESat-2 datasets using machine learning. Front. Earth Sci. 2023, 11, 1111817. [Google Scholar] [CrossRef]
Mudiyanselage, S.S.J.D.; Abd-Elrahman, A.; Wilkinson, B.; Lecours, V. Satellite-derived bathymetry using machine learning and optimal Sentinel-2 imagery in South-West Florida coastal waters. GISci. Remote Sens. 2022, 59, 1143–1158. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Volume 9351, pp. 234–241. [Google Scholar]
Traganos, D.; Poursanidis, D.; Aggarwal, B.; Chrysoulakis, N.; Reinartz, P. Estimating satellite-derived bathymetry (SDB) with the Google Earth Engine and Sentinel-2. Remote Sens. 2018, 10, 859. [Google Scholar] [CrossRef]
Al Najar, M.; Thoumyre, G.; Bergsma, E.W.J.; Almar, R.; Benshila, R.; Wilson, D.G. Satellite derived bathymetry using deep learning. Mach. Learn. 2023, 112, 1107–1130. [Google Scholar] [CrossRef]
Caballero, I.; Stumpf, R.P. Confronting turbidity, the major challenge for satellite-derived coastal bathymetry. Sci. Total Environ. 2023, 861, 161898. [Google Scholar] [CrossRef] [PubMed]
Jamet, C.; Loisel, H.; Dessailly, D. Retrieval of the spectral diffuse attenuation coefficient K_d(λ) in open and coastal ocean waters using a neural network inversion. J. Geophys. Res. Ocean. 2012, 117, C10023. [Google Scholar] [CrossRef]

Figure 1. Map of the study areas: (a) Puck Lagoon, Poland; (b) Agia Napa, Cyprus. CRS: WGS84 (EPSG: 4326) (basemap source: ESRI World Imagery).

Figure 2. Algorithm flowchart. The two components are as follows: (a) comparison between the three scenarios—direct inference, site-specific training, and transfer learning; (b) transfer learning training and testing based on the varying number of training samples (5 to 28 samples).

Figure 3. Indicative training learning curve of the transfer learning experiment for a 28-sample experiment after 10 epochs.

Figure 4. Predicted vs. ground truth depth scatter plots with ordinary least-square fits, 95% CI annotations and the identity line. Panels correspond to models fine-tuned with (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 28 samples, respectively.

Figure 5. Depth-stratified RMSE plots per number of samples used for transfer learning, over the Agia Napa test set with error bars depicting the bin-wise SE for (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, (f) 28 samples.

Figure 6. Coast-wide bathymetric predictions for Agia Napa produced by transfer learning models fine-tuned with (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 28 samples. Base map from ESRI World Imagery. Depth units in meters (m).

Table 1. Pixel-wise global metrics across the three main scenarios.

Scenario	RMSE (m)	MAE (m)	SD (m)
Direct inference	4.111	3.235	2.537
Site-specific	1.068	0.694	0.940
Transfer learning (28)	0.810	0.488	0.646

Table 2. Pixel-wise global metrics across the varying transfer learning approaches.

Model	RMSE (m)	MAE (m)	SD (m)
Transfer learning (5)	1.974	1.185	1.579
Transfer learning (10)	1.361	0.795	1.104
Transfer learning (15)	0.984	0.579	0.796
Transfer learning (20)	0.948	0.564	0.761
Transfer learning (25)	0.905	0.554	0.716
Transfer learning (28)	0.810	0.488	0.646

Table 3. Pair-wise comparison of sample-wise RMSE (meters).

Transfer Learning Models Comparison	Δμ_RMSE (m)	t-stat	p-t	W	p-wil	d	p-perm
(5) vs. (10)	0.445	2.08	0.0830	5	0.1560	0.79	0.1144
(5) vs. (15)	0.700	2.12	0.0782	3	0.0781	0.80	0.0588
(5) vs. (20)	0.740	2.31	0.0599	1	0.0312	0.87	0.0160
(5) vs. (25)	0.756	2.19	0.0712	2	0.0469	0.83	0.0281
(10) vs. (15)	0.255	1.99	0.0933	4	0.1090	0.75	0.0797
(10) vs. (20)	0.295	2.39	0.0544	0	0.0156	0.90	0.0151
(10) vs. (25)	0.310	2.07	0.0835	3	0.0781	0.78	0.0794
(15) vs. (20)	0.040	1.44	0.2000	6	0.2190	0.54	0.2148
(15) vs. (25)	0.055	1.74	0.1330	6	0.2190	0.66	0.1074
(20) vs. (25)	0.015	0.45	0.6660	11	0.6880	0.17	0.6634
(28) vs. (5)	−0.830	−2.23	0.0675	0	0.0156	−0.84	<1 × 10⁻⁴
(28) vs. (10)	−0.385	−2.18	0.0719	1	0.0312	−0.82	0.0164
(28) vs. (15)	−0.130	−2.55	0.0432	0	0.0156	−0.97	<1 × 10⁻⁴
(28) vs. (20)	−0.090	−1.59	0.1630	6	0.2190	−0.60	0.1539
(28) vs. (25)	−0.075	−2.16	0.0744	3	0.0781	−0.82	0.063

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Anagnostopoulos, C.G.E.; Papaioannou, V.; Vlachos, K.; Moumtzidou, A.; Gialampoukidis, I.; Vrochidis, S.; Kompatsiaris, I. Sentinel-2 Satellite-Derived Bathymetry with Data-Efficient Domain Adaptation. J. Mar. Sci. Eng. 2025, 13, 1374. https://doi.org/10.3390/jmse13071374

AMA Style

Anagnostopoulos CGE, Papaioannou V, Vlachos K, Moumtzidou A, Gialampoukidis I, Vrochidis S, Kompatsiaris I. Sentinel-2 Satellite-Derived Bathymetry with Data-Efficient Domain Adaptation. Journal of Marine Science and Engineering. 2025; 13(7):1374. https://doi.org/10.3390/jmse13071374

Chicago/Turabian Style

Anagnostopoulos, Christos G. E., Vassilios Papaioannou, Konstantinos Vlachos, Anastasia Moumtzidou, Ilias Gialampoukidis, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2025. "Sentinel-2 Satellite-Derived Bathymetry with Data-Efficient Domain Adaptation" Journal of Marine Science and Engineering 13, no. 7: 1374. https://doi.org/10.3390/jmse13071374

APA Style

Anagnostopoulos, C. G. E., Papaioannou, V., Vlachos, K., Moumtzidou, A., Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2025). Sentinel-2 Satellite-Derived Bathymetry with Data-Efficient Domain Adaptation. Journal of Marine Science and Engineering, 13(7), 1374. https://doi.org/10.3390/jmse13071374

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sentinel-2 Satellite-Derived Bathymetry with Data-Efficient Domain Adaptation

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Model Architecture

2.3. Experimental Setup

2.4. Evaluation Metrics

3. Results

3.1. Comparison of Main Experimental Scenarios

3.2. Performance with Varying Training Set Samples

3.2.1. Linear Regression Consistency

3.2.2. Depth-Stratified Error Analysis

3.2.3. Statistical Significance of Differences

3.2.4. Visualization Results

4. Discussion

Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI