Integration of Photogrammetric and Spectral Techniques for Advanced Drone-Based Bathymetry Retrieval Using a Deep Learning Approach

: Shallow bathymetry mapping using proximal sensing techniques is an active ﬁeld of research that offers a new perspective in studying the seaﬂoor. Drone-based imagery with centimeter resolution allows for bathymetry retrieval in unprecedented detail in areas with adequate water transparency. The majority of studies apply either spectral or photogrammetric techniques for deriving bathymetry from remotely sensed imagery. However, spectral methods require a certain amount of ground-truth depth data for model calibration, while photogrammetric methods cannot perform on texture-less seaﬂoor types. The presented approach takes advantage of the interrelation of the two methods, in order to predict bathymetry in a more efﬁcient way. Thus, we combine structure-from-motion (SfM) outputs along with band-ratios of radiometrically corrected drone images within a specially designed deep convolutional neural network (CNN) that outputs a reliable and robust bathymetry estimation. To achieve effective training of our deep learning system, we utilize interpolated uncrewed surface vehicle (USV) sonar measurements. We perform several predictions at three locations in the southern Mediterranean Sea, with varying seaﬂoor types. Our results show low root-mean-square errors over all study areas (average RMSE ∼ = 0.3 m), when the method was trained and tested on the same area each time. In addition, we obtain promising cross-validation performance across different study areas (average RMSE ∼ = 0.9 m), which demonstrates the potential of our proposed approach in terms of generalization capabilities on unseen data. Furthermore, areas with mixed seaﬂoor types are suitable for building a model that can be applied in similar locations where only drone data is available.


Optical Remote Sensing in Seafloor Mapping
Shallow seafloor bathymetry is an essential component in several coastal studies providing end users with valuable information about the underwater topography. Detailed shallow bathymetry data are required for a wide variety of applications, such as monitoring coastal erosion [1][2][3][4], and ecological mapping of benthic habitats [5][6][7]. However, the coastal seafloor has long been considered as a "white ribbon" [8], since traditional techniques such as sonar surveying are unable to provide full coverage at high spatial resolution (<1 m) in a time-and cost-effective way. Limiting factors such as the safe operational depth of the vessel and restricted sonar coverage due to survey geometry [9] hinder the quality and feasibility of high-resolution mapping in shallow areas. Other traditional topo-bathymetric

Satellite-Derived Bathymetry
Most of the empirical SDB techniques are based on the logarithmic band ratios approach introduced by [18] (see methods section), and lately there has been a tendency to combine machine learning techniques and empirical SDB algorithms. These novel approaches take advantage of the multi-dimensional nature of input datasets and have shown promising results. Ref. [21] applied an artificial neural network (ANN) approach on Landsat imagery showing promising results even for predicting depths greater than 20 m. Furthermore, Ref. [22] tested two ANN algorithms on IKONOS and Landsat imagery, which outperformed the optical modeling, and the regression trees techniques. Ref. [23] developed a new support vector machine (SVM) approach for deriving bathymetry using IKONOS-2 multispectral imagery. They performed training on a neighborhood scale and by using the full size of training dataset they obtained bathymetry with low (<1 m) RMSE values even for deeper waters (>16 m). Recently, Ref. [24] applied a convolutional neural network (CNN) technique on Sentinel-2 imagery, trained with sonar and LIDAR bathymetry for calculating SDB up to 15 m of water depth with 1-m error approximately.
A usual constraint in most empirical SDB studies is the requirement for comprehensive ground-truth depth measurements for tuning a regression model for bathymetry prediction. An additional constraint in empirical SDB studies, seafloor cover heterogeneity, which may induce depth inaccuracies in cases where the spectral difference due to seafloor cover is greater than the spectral difference due to depth [18]. This occurs particularly in shallow and relatively flat seafloor areas where mixed seafloor types (e.g., sand, reefs, algae, seagrasses) alternate spatially.

Structure from Motion
A technique that does not require ground-truth data input for deriving bathymetry from passive optical sensors is the application of photogrammetry along with multi-view satellite or aerial imagery [25,26]. Essentially, this approach takes advantage of image geometry and overlap along with image texture for producing a 3D surface from corre-sponding points between successive images. However, an important downturn of the method is that it requires seafloor surface with significant texture in order to provide useful outputs. In several cases, the seafloor occurs naturally featureless (e.g., flat with sediment cover), preventing the application of photogrammetric techniques. Additional requirements for successful photogrammetric results include accurate camera pose initialization (e.g., RTK accuracy), clear water, low wave height, minimal breaking waves, and minimal sun glint [27,28]. Light refraction due to the air-water interface is another potential issue that may be encountered during photogrammetric reconstruction of shallow seafloor. The reconstruction error caused by refraction directly depends on the water depth and on the incidence angles of the rays. High altitude flights allow reconstructions using only small incidence ray angles and therefore smaller refraction related errors. However, if rays with large incident angles are taken into account (e.g., to cover a larger area), the refraction error can remain significant even for high altitude flights which is also suggested by [29,30].
The latest advancements in Uncrewed Aerial Vehicle (UAV or drone) technology along with the development of structure from motion (SfM) techniques have paved the road for a new era in the field of geospatial disciplines [30]. Structure-from-motion (SfM) is one of the most applied photogrammetry techniques in studies using drone imagery. This technique has revolutionized the traditional photogrammetry method due to its efficacy to reconstruct 3D scenes without a priori knowledge of camera position [30]. Recent studies from [27,31] utilized drone-based imagery and SfM for deriving shallow water bathymetry in Mediterranean coastal areas under ideal sea-surface conditions. In order to correct for refraction, they trained a machine learning algorithm with LIDAR bathymetry using dense point clouds resulted from standard SfM, obtaining optimal results. In contrast, a similar study from [32] applied the refraction correction from [33] for reconstructing very shallow areas (<2 m) without significantly improving the final bathymetry. Another study that applies SfM for reconstructing shallow bathymetry is presented in [34]. In their study, the camera intrinsic and extrinsic calibration parameters were computed using frames from the onshore part of the dataset while refraction was corrected according to the method proposed in [35]. Similarly, Ref. [36] apply a multimedia bundle adjustment to account for refraction. They process initially images above land areas for deriving camera intrinsic and extrinsic parameters, and then they use these parameters as fixed during processing images from further offshore. Additionally, there have been a few recent studies applying empirical SDB (i.e., extensions of the logarithmic band-ratio technique) or analytical algorithms on drone-based multispectral imagery [28,[37][38][39][40], showing good results with less than half a meter vertical errors. Particularly, Ref. [27] combined SfM and RGB color information in the same processing chain for bathymetry calculation. Initially, they performed 3D reconstruction in areas with rich seafloor texture, and then they applied the SfM outputs as inputs to machine learning models for optimizing bathymetry retrieval. Accordingly, Ref. [41] applied a deep learning methodology for extracting bathymetry by incorporating spectral and photogrammetric features of airborne imagery.

Aim of the Study
Considering the above-mentioned limitations of optical methods in deriving bathymetry, we propose a novel deep learning methodology tested on three study areas with contrasting seafloor types. This practice assists in evaluating better the proposed method under different scenarios of seafloor texture. Capitalizing on recent advancements in the field of machine learning, the presented study takes advantage of successful approaches in neural networks. The recent success of deep neural networks has led to the development of new tools, with improved performance compared to their predecessors. Specifically for image analysis, convolutional neural networks (CNNs) are the type of algorithms that are commonly used in literature [42][43][44][45] and have achieved important advances in diverse areas of image processing [45][46][47]. Therefore, we consider the application of CNN in this study in order to integrate the geometric and spectral approaches for optical bathymetry retrieval. This is the main novelty of the study, offering a new perspective in tackling the disadvantages of both methods when applied separately. In this way, we optimize the training process for bathymetry retrieval by minimizing the need for extensive ground-truth data input.
Initially, we develop an approximate 3D surface based on SfM over areas with textured seafloor and then combine it with spectral and spatial information for producing a final bathymetric output covering the entire area. This approach was tested over three study areas with diverse seafloor types for evaluating better its performance. In addition, we examine the influence of the amount of training data on the final bathymetry predictions and we assess the performance of each individually trained model (at each study area) by applying it to the remaining areas. In this way, we identify which areas are optimal for training a model that can be applied at a regional scale. The application of deep learning assists further in extracting maximum information from a diverse set of image datasets and thus achieves detailed bathymetry over any seafloor type. Ground-truth data collected with an Uncrewed Surface Vehicle (USV) are utilized for guiding the training process and validating the bathymetry predictions. This study exploits the versatility of uncrewed platforms (UAV and USV) along with state-of-the-art remote sensing techniques for accurate and detailed reconstruction of shallow bathymetry in a computationally efficient way and at low overall cost compared to other methods applied so far.

Study Areas
The method was applied on the following three study areas, which have been selected according to the variability of seafloor cover and structure they present. All of the study areas comprise of waters with similar optical properties and have a Secchi depth greater than 10 m. These are characterized as optically transparent waters due to low concentrations of chlorophyll and suspended matter as a result of the oligotrophic character of eastern Mediterranean Sea [48] and the absence of significant input from adjacent drainage systems. The first study area is a small bay (Stavros) located at the north of Chania city (Crete, Greece, Figure 1). The seafloor captured by drone data is shallow in general reaching a maximum depth of 4 m and comprising of very smooth slope. The study area is generally homogeneous, covered partly with fine sand and partly with beach rock. The smooth bathymetry results in seafloor albedo, which changes gradually according to water depth. This provides an ideal case for studying the performance of both techniques in complementing each other. The second area (Kalamaki) is located a few kilometers west of Chania city ( Figure 1). It comprises mainly of rough seafloor surface, made by rocky reefs, which are covered with various types of algae while there are also some shallow areas covered with fine sand. The maximum depth measured by the USV is 9 m and falls within the area covered by the drone images. The third area is located in the south-western coast of Crete (Elafonisi beach, Figure 1) and it also comprises of smooth seafloor, covered mainly with fine, foraminiferal sand and at places with submerged beach-rock and rocky reefs. The maximum depth recorded within the drone coverage area is −4.5 m. This site provides an additional setting for testing the effect of mixed seafloor types and repeatability of results. It has to be noted that, in the boundaries of Elafonisi and Stavros orthomosaics, the actual maximum depth might be 1-2 m deeper than the maximum recorded by the USV data. Thus, we have excluded these image parts from the prediction process in order to restrict the range of optimal bathymetry predictions within the range of the actual depth measurements. in order to restrict the range of optimal bathymetry predictions within the range of the actual depth measurements.

Onshore Survey and Drone Platform Configuration
Prior to the drone surveys, a set of ground control points (GCPs) were measured along the coastline of each study area. The GCPs were measured with a real-time kinematics (RTK) GPS for achieving high accuracy (± 2 cm). This level of accuracy is crucial in drone surveys that produce imagery with centimeter-scale spatial resolution while the onboard GPS sensor has a horizontal accuracy of approximately two meters. Thus, the GCPs are used for accurate orthorectification of the point-clouds and 2D reflectance mosaics. The drone platform comprises of a commercial DJI Phantom 4 Pro drone equipped with a 20 Mpixel RGB camera. The aerial survey data are presented in Table 1. By flying at an altitude of 120 m, it assists in minimizing the effect of: (a) air/water refraction and (b) image noise due to sun glint effects both on the sea-surface and on the seafloor (due to wave focusing). Drone imagery was processed with SfM techniques for producing an initial bathymetric surface (see following section). The raw imagery values were converted to reflectance by using a reference reflectance panel and Pix4D software (v4.5, Lausanne, Switzerland). The blue (B) and green (G) bands correspond to shorter wavelengths (460 ± 40 nm and 525 ± 50 nm respectively) and thus show deeper penetration through the water column [49]. The red (R) band corresponds to 590 ± 25 nm, which is highly absorbed in the first 1-2 m of water depth, but assists in emphasizing extremely shallow areas. Table 1. Details of drone survey at each study area.

Onshore Survey and Drone Platform Configuration
Prior to the drone surveys, a set of ground control points (GCPs) were measured along the coastline of each study area. The GCPs were measured with a real-time kinematics (RTK) GPS for achieving high accuracy (±2 cm). This level of accuracy is crucial in drone surveys that produce imagery with centimeter-scale spatial resolution while the onboard GPS sensor has a horizontal accuracy of approximately two meters. Thus, the GCPs are used for accurate orthorectification of the point-clouds and 2D reflectance mosaics. The drone platform comprises of a commercial DJI Phantom 4 Pro drone equipped with a 20 Mpixel RGB camera. The aerial survey data are presented in Table 1. By flying at an altitude of 120 m, it assists in minimizing the effect of: (a) air/water refraction and (b) image noise due to sun glint effects both on the sea-surface and on the seafloor (due to wave focusing). Drone imagery was processed with SfM techniques for producing an initial bathymetric surface (see following section). The raw imagery values were converted to reflectance by using a reference reflectance panel and Pix4D software (v4.5, Lausanne, Switzerland). The blue (B) and green (G) bands correspond to shorter wavelengths (460 ± 40 nm and 525 ± 50 nm respectively) and thus show deeper penetration through the water column [49]. The red (R) band corresponds to 590 ± 25 nm, which is highly absorbed in the first 1-2 m of water depth, but assists in emphasizing extremely shallow areas.

USV Surveys
The diagram in Figure 2 shows the USV and sensors configuration. The USV used is a remote-controlled platform mounted with an Ohmex BTX single-beam sonar with an operating frequency of 235 kHz. The sonar is integrated with an RTK-GPS sensor and it collects attitude-corrected bathymetry points at 2 Hz sampling rate. The USV survey ground-truth bathymetry points are shown in Figure 2A. All USV data were collected on the same date with drone imagery in order to avoid temporal changes of bathymetry. The RTK-GPS measurements provide high spatial accuracy, which is essential in processing drone-based imagery with a pixel resolution of a few centimeters. At the Chania area, a total of 800 depth measurements were acquired, while at the Elafonisi area the USV survey yielded more than 3000 data points. Considering that the tidal range on the island of Crete is maximum ±0.2 m, and drone data acquisition has a one-hour difference with USV data, we infer that the tidal effect is completely negligible on the USV and drone data.

USV Surveys
The diagram in Figure 2 shows the USV and sensors configuration. The USV used is a remote-controlled platform mounted with an Ohmex BTX single-beam sonar with an operating frequency of 235 kHz. The sonar is integrated with an RTK-GPS sensor and it collects attitude-corrected bathymetry points at 2 Hz sampling rate. The USV survey ground-truth bathymetry points are shown in Figure 2A. All USV data were collected on the same date with drone imagery in order to avoid temporal changes of bathymetry. The RTK-GPS measurements provide high spatial accuracy, which is essential in processing drone-based imagery with a pixel resolution of a few centimeters. At the Chania area, a total of 800 depth measurements were acquired, while at the Elafonisi area the USV survey yielded more than 3000 data points. Considering that the tidal range on the island of Crete is maximum ± 0.2 m, and drone data acquisition has a one-hour difference with USV data, we infer that the tidal effect is completely negligible on the USV and drone data.

Structure from Motion
The OpenSfM open-source software was utilized for photogrammetric processing of the drone images [50]. The drone's GPS and IMU data were used to initialize the camera extrinsic parameters. The intrinsic parameters were estimated during SfM using self-calibration. Given the very low ratio between the average water depth and the flight altitude, and the nadiral view of imagery, we assume that refraction effects are minimal and thus did not account for. The resulting 3D surface was not very accurate over smooth (feature-less) seafloor areas, and thus it was interpolated and used as an explanatory variable. This way, it provided a useful approximation of seafloor relief for guiding the predictive model.

Data Pre-Processing
The photogrammetrically derived bathymetry is expected to have captured in detail only areas with rich seafloor texture, leaving areas with homogeneous seafloor unreconstructed. Thus, an integrated approach with full-scene spectral data is required for compensating for this issue. This approach relies on the empirical SDB method of Ref. [18], which employs the logarithmic band ratios from multispectral data (Equation (1)). This method is characterized by straightforwardness and versatility regarding its various implementations, and thus, it has been applied successfully on several recent studies [23,[51][52][53][54]. The main concept of this approach is that light attenuation in the water column is exponential for wavelengths in the visible spectrum.
The original formula relies on the logarithmic ratio of two spectral bands (wavelengths) and two empirically tuned factors (Equation (1)): where R w (λ i,j ) is the water column reflectance of optically shallow water recorded at wavelength λ i nanometres (with i < j), m 1 is a tuneable constant to scale the ratio to depth, n is a fixed constant for all areas, and m 0 is the offset for a depth of 0 m (e.g., tidal offset). The fixed value of n is chosen arbitrarily in order to assure both that the logarithm will be positive under any condition and that the ratio will produce a linear response with depth. The coefficients m 1 and m 0 are determined by a set of ground-truth depth measurements for calibrating a linear regression equation, which then can be applied for calculating bathymetry. In addition, it is expected that Equation (1) is better tuned when spectral bands with good correlation with water-depth are available [55]. In order to prepare the imagery for use with the CNN model, we applied radiometric corrections using proprietary software Pix4D©. These corrections are necessary for converting raw image data to meaningful reflectance values, which are useful in quantitative image analysis such as spectral mapping [49]. Particularly, Ref. [56] showed that radiometrically corrected drone RGB imagery shows improved correlation with water depth. Initially, the pixel values are compensated for sensor bias such as sensor black-level, sensitivity, gain and exposure settings, and lens vignette effects and then they are converted to radiance values (i.e., in units W/m 2 /sr/nm, meaning watts per square meter per steradian per nanometer). Following, the radiance values are converted to spectral reflectance for each band, by incorporating the information from the calibrated reflectance panel (CRP). Apart from radiometric corrections, Pix4D provided geometric calibration for radial lens distortion using the specific camera model provided by the manufacturer.

Convolutional Network Architecture and Training Set
The CNN model used in this study follows the stacked-hourglass architecture suggested by [57]. This type of model was especially designed to find dominant features in multichannel inputs [57]. The specific architecture that we adopt here, has been successful in estimating depth values from multichannel images of faces and hands [58,59], which is a problem that shares some similarities with the problem of predicting bathymetry. In contrast to other networks for depth estimation [60,61], this architecture is more lightweight since it deals with image patches instead of the entire large image as input. In the work of [59], it was determined that a number between four and six stacked hourglass modules provided optimal depth reconstructions. Based on this and further experimentations (Section 3.1), we decided to use six stacked-hourglass modules in this study.
The training procedure of the CNN model is presented in Figure 3. First, an input traverses the network giving the network's output. Next, the validity of the output is quantified with the loss function between the computed output and the defined groundtruth depth. The multichannel input of the CNN model consists of several image patches of size 128 × 128 pixels that includes five input rasters: three rasters for the logarithmic band-ratios (Blue/Green, Blue/Red and Green/Red), one for the approximate SfM surface, and one with the distance from coast information. In order to enhance the available geoinformation of the training set, we decided to include the distance from coast for each pixel as an additional explanatory variable. Thus, we extracted the coastline by visual assessment of the RGB orthomosaics. The output of the training set (interpolated USV depth) consists of a 128 × 128 single channel image patch that depicts the depth values of the respective input, originating from the measurements of the USV. In the case of SfM and USV data, a thin-plate spline interpolation is applied within the region of each patch since the original data is composed of sparse measurements. Apart from the CNN model and for comparison purposes we also applied Random Forests (RFs) [62] and Support Vector Machines (SVM) [63] algorithms using a common training dataset from the Kalamaki area. RFs operate by aggregating the votes of several decision trees that each has been trained on predicting a particular parameter. Aggregation helps avoiding noise and extending the generalization capability of the resulting predictions. SVMs are based on the insight that, for a binary classifier, there is a decision boundary around which the classification prediction switches between the two classes. This boundary is essentially defined by the samples that are closest to each other and labeled of opposite class. Both approaches have been originally proposed as classifiers, and later extended for regression tasks.

Convolutional Network Architecture and Training Set
The CNN model used in this study follows the stacked-hourglass architecture suggested by [57]. This type of model was especially designed to find dominant features in multichannel inputs [57]. The specific architecture that we adopt here, has been successful in estimating depth values from multichannel images of faces and hands [58,59], which is a problem that shares some similarities with the problem of predicting bathymetry. In contrast to other networks for depth estimation [60,61], this architecture is more lightweight since it deals with image patches instead of the entire large image as input. In the work of [59], it was determined that a number between four and six stacked hourglass modules provided optimal depth reconstructions. Based on this and further experimentations (Section 3.1), we decided to use six stacked-hourglass modules in this study.
The training procedure of the CNN model is presented in Figure 3. First, an input traverses the network giving the network's output. Next, the validity of the output is quantified with the loss function between the computed output and the defined groundtruth depth. The multichannel input of the CNN model consists of several image patches of size 128 × 128 pixels that includes five input rasters: three rasters for the logarithmic band-ratios (Blue/Green, Blue/Red and Green/Red), one for the approximate SfM surface, and one with the distance from coast information. In order to enhance the available geoinformation of the training set, we decided to include the distance from coast for each pixel as an additional explanatory variable. Thus, we extracted the coastline by visual assessment of the RGB orthomosaics. The output of the training set (interpolated USV depth) consists of a 128 × 128 single channel image patch that depicts the depth values of the respective input, originating from the measurements of the USV. In the case of SfM and USV data, a thin-plate spline interpolation is applied within the region of each patch since the original data is composed of sparse measurements. Apart from the CNN model and for comparison purposes we also applied Random Forests (RFs) [62] and Support Vector Machines (SVM) [63] algorithms using a common training dataset from the Kalamaki area. RFs operate by aggregating the votes of several decision trees that each has been trained on predicting a particular parameter. Aggregation helps avoiding noise and extending the generalization capability of the resulting predictions. SVMs are based on the insight that, for a binary classifier, there is a decision boundary around which the classification prediction switches between the two classes. This boundary is essentially defined by the samples that are closest to each other and labeled of opposite class. Both approaches have been originally proposed as classifiers, and later extended for regression tasks. Overall workflow of the proposed approach for drone-based bathymetry estimation. The images acquired by the drone are processed to generate an orthomosaic of the study area, as well as to perform Structure from Motion. Then, logarithmic RGB band ratios are calculated, as well as a map with the distance from coast and a rough depth estimate derived from interpolating the SfM Figure 3. Overall workflow of the proposed approach for drone-based bathymetry estimation. The images acquired by the drone are processed to generate an orthomosaic of the study area, as well as to perform Structure from Motion. Then, logarithmic RGB band ratios are calculated, as well as a map with the distance from coast and a rough depth estimate derived from interpolating the SfM result. These are then fed to the adopted CNN model, which consists of convolutional layers (light blue), deconvolutional layers (light orange), and a bottle-neck convolution (light red). The layers inside a stage fluctuate the inputs in dimensions N (spatial and feature). The outputs of all stages contribute to minimizing the loss function between the estimations and the interpolated USV depth.

Results
In this section, we present the results of our proposed method in the three study areas. We also include an ablation study to highlight the contribution of each component of our method. Furthermore, we compare it with previous deep learning approaches as well as with conventional machine learning methods (RFs, SVMs). Finally, we show the generalization capabilities of our method with a cross-validation experiment between all study areas.

Bathymetry Results on the Study Areas
To evaluate the performance of our method, we conducted three experiments on each study area, respectively. The experiments consist of a training and testing procedure on the respective train and test sets of each study area.
From all the patches that constitute a study area, we used a random subset of 60% of them for training, and the remaining 40% for testing. The training-test split of the data was based on the checkerboard block approach suggested by [64,65] as the most effective way to eliminate spatial autocorrelation effects during data validation. Furthermore, it is worth mentioning that, while the train/test patches are in several cases adjacent, we do not use the entire test patches in order to calculate the reported results. In fact, we use part of the USV points that are within the test patches: The USV points that are close (<3 m) to any train patch are discarded, leaving a smaller number of USV points, near the center of the test patch, to be used as ground-truth test depth values.
In Figure 4, we present a visualization example of the patches selected for training (in blue) and testing (in red) along the USV measurements (white dots) using the 60/40 training/testing ratio. The 60/40 data split was chosen as a more balanced scenario where the training set does not overwhelm the test set and thus highlights the CNN performance better. Particularly, regarding the Stavros bay, 94 patches were selected for training and 52 for testing; for the Kalamaki beach, 155 patches were selected for training and 107 for testing; and for the Elafonisi beach, 63 patches were selected for training and 45 for testing.
The test results of the three experiments are reported in Figure 5. The predicted bathymetry at the Stavros area ( Figure 5A) shows maximum correlation with USV depth measurements despite a few artifacts in the middle of the scene. These artifacts are probably due to tiling effects on the reflectance mosaic. In addition, the RMSE value for the Stavros results is less than 10 cm. The predicted bathymetry at the Kalamaki area ( Figure 5C) shows very good correlation with USV depth measurements and has an RMSE value of 35 cm. At the Elafonisi area, the predicted bathymetry ( Figure 5E) shows very good correlation with USV depth measurements and has an RMSE value of 33 cm. The error distribution of the overall CNN predictions ( Figure 6) suggests that each (local) model using 60% of the USV data for training, generalizes well the bathymetry beyond the training patches and over different seafloor types.   measurements despite a few artifacts in the middle of the scene. These artifacts are probably due to tiling effects on the reflectance mosaic. In addition, the RMSE value for the Stavros results is less than 10 cm. The predicted bathymetry at the Kalamaki area ( Figure  5C) shows very good correlation with USV depth measurements and has an RMSE value of 35 cm. At the Elafonisi area, the predicted bathymetry ( Figure 5E) shows very good correlation with USV depth measurements and has an RMSE value of 33 cm. The error distribution of the overall CNN predictions ( Figure 6) suggests that each (local) model using 60% of the USV data for training, generalizes well the bathymetry beyond the training patches and over different seafloor types.

Ablation Study
In order to investigate the appropriate architecture choices and the optimal use of input data, we conducted a series of ablative experimentations to clarify these matters. The experiments' aim is to show the advantage of using multiple stacks of hourglasses (as suggested by the literature), instead of using a single hourglass module. Moreover, the experiments justify the usage of the appropriate rasters (RGB band ratios, SfM, distance from coast) and their benefits towards estimation accuracy.
We trained variations of our model with the same training subset of Kalamaki beach and validated the choices on the respective test subset. The results reported in Table 2 Figure 6. Frequency distribution of residuals (USV depth minus CNN-predicted depth using 60% for training) at each study area and corresponding statistics.

Ablation Study
In order to investigate the appropriate architecture choices and the optimal use of input data, we conducted a series of ablative experimentations to clarify these matters. The experiments' aim is to show the advantage of using multiple stacks of hourglasses (as suggested by the literature), instead of using a single hourglass module. Moreover, the experiments justify the usage of the appropriate rasters (RGB band ratios, SfM, distance from coast) and their benefits towards estimation accuracy.
We trained variations of our model with the same training subset of Kalamaki beach and validated the choices on the respective test subset. The results reported in Table 2 indicate the beneficial use of multiple stacks compared to fewer stacks (single and triple hourglass). Furthermore, the results in Table 2 demonstrate the benefit of using multivariable input, as well as the ability of the network to handle it effectively, since only when we use all rasters jointly do we obtain the best results for bathymetry predictions.

Sensitivity Analysis of the Train-Test Split
Since the amount of our data is lower compared to other similar studies applying deep learning approaches, we analyze the sensitivity of our model to the amount of training/testing data, by applying different split ratios. Specifically, we used the 70/30, 60/40, 50/50, 40/60, and 30/70 training/testing ratios for each study area. Table 3 shows how RMSE and R 2 scores are influenced by the different ratio selection. These results suggest that the lower the training ratio, the higher the error of estimation is. This signifies that, despite the low total amount of data we have, the train/test sets are capable of describing the true behavior of our model.

Comparison with Artificial Neural Networks and Conventional Machine Learning Methods
Comparison of our CNN model with other ANN approaches cannot be achieved directly. The reason for that is mainly the lack of publicly available implementations that target specific problems of bathymetry estimation. Therefore, we try to approximate this comparison by creating the most well-known and commonly used ANN architecture. The case of a single stack hourglass is similar to the U-net type networks applied in previous bathymetry retrieval studies [41] using RGB or/and SfM as input. The output results shown in Table 2 demonstrate how this variant of networks is compared to our full stack model. Even though deep learning approaches have massive breakthroughs in many domains, conventional machine learning methods, such as random forests (RF) and support vector machines (SVM), manage to provide satisfactory results in shallow bathymetry mapping [66,67]. For that reason, we compared our CNN model with RF and SVM implementations using the same training and test set (60/40 split) of the Kalamaki area to train and test them. Specifically, an RF regression was used with 100 trees and a maximum tree depth = 8 and a Support Vector Regression with linear kernel, with C = 2.0 and parameter epsilon = 0.4 for the loss function.
The results in Table 4 show that the CNN model outperformed the commonly used machine learning algorithms and thus strengthened our motivation to apply it in this study.

Cross-Validation Study
In order to assess further the estimation capabilities of our approach, we conducted a cross-validation experiment between the three case studies. Specifically, we trained our CNN model on all patches (100% for training) of each study area and then we applied the model on the remaining two areas again for all their image patches. We repeated this procedure for all three cases, resulting in a 3 × 3 matrix as seen in Table 5. Table 5 reports the different RMSE values, each column corresponds to the case on which the model was trained, and each row corresponds to the case it was tested on. As expected, the diagonal of the matrix holds the lowest error values since the entire ground-truth set was used for training and testing thus providing artificially very low RMSE values. It is observed that data from Kalamaki beach provide the best training case when compared to the other areas, with an average RMSE of 0.8 m. This can be justified as this study area holds a greater variety of common features than the other two areas (rocky/featureless seafloors, and large coverage). The results of the ablation study and the comparison with the RF and SVM methods (Tables 2 and 4) assisted in framing better the performance of our CNN model among other machine learning approaches applied in earlier studies [41,66,67]. The CNN model yielded more accurate bathymetry predictions than the RF and SVM for the presented study areas and for the given amount of training/testing data. In addition, the ablation study using the full stack architecture provided a more successful model performance compared to the single and triple stack architecture that resembles the deep neural network applied by [41]. The combination of spectral (band-ratios), SfM and distance-to-coast rasters appear to explain bathymetry variability better than when a subset of those is used ( Table 2). This evidence suggests that the proposed CNN model is a promising tool for extracting shallow bathymetry from drone imagery.
The CNN model applied in this study appeared to generalize well the training depth over a wider portion of each study area. Consequently, individual models predicted bathymetry over unknown seafloor with significant accuracy when they were trained with 60% of the ground-truth data. This led to at least 85% of bathymetry outputs having a low error value (<0.5 m, Figure 6). This suggests that sparse sonar measurements acquired with a USV provide sufficient training data for producing accurate bathymetry (RMSE < 0.4 m) over wider shallow areas such as bays and extended coastlines. Even when a smaller amount of training data is used, the output bathymetry accuracy does not decrease considerably, suggesting that the learning procedure is optimal and that each training set comprises of data representative of all seafloor types and depths. Only at the Elafonisi area, the R 2 decreases enough when 40% of training data is used.
Technically, it required approximately two hours to train the model, while predicting took about 30 min on an Nvidia GeForce© GTX 1080 Ti GPUprocessor. The cross-validation results suggest that the CNN model, using the entire Kalamaki area data, provided the best overall accuracy (RMSE~0.8 m) when used for predicting the bathymetry in the other two study areas. This suggests that the CNN model of the Kalamaki area (100% of data for training) is more "regional" and thus relatively applicable in similar locations with small differences in seafloor types and water properties where additional ground-truth data are not available. This is probably explained by the fact that at the Kalamaki area more than 3500 interpolated depth measurements scattered over various seafloor types were used for training. It is suggested that a greater number of training data would further decrease the prediction error at the other two areas, as shown also by the experiments in Table 3.
To assist better the evaluation of potential sources of error, Figure 7 shows the spatial distribution of residuals overlaid on RGB orthomosaics. Test points with large absolute value of residuals appear in a few instances over deep, rocky places of the Kalamaki and Elafonisi areas. A possible explanation could be that in rocky areas there are often dark or shaded parts of the seafloor, which may have led to erroneous bathymetric predictions. This issue was also encountered in the study of [37] where, similarly, large errors occurred over shaded patches in rocky areas. At the Stavros area, the few outliers that appear on the map are attributed to random noisy measurements, as the max absolute error values are generally low (0.4 m) compared to the other two areas.  The presented approach produces bathymetry predictions with low overall error for the given amount of training bathymetry data. Compared to other similar studies using drone-based imagery for empirical bathymetry retrieval [28,39], our approach achieves comparable errors; however, the main difference is that our training set is significantly more restricted than the one from Ref. [28], which is based on tens of thousands of points. Earlier SDB studies relied on abundant (several thousands) bathymetry data for training [22,23,67] using large training/test ratios. We consider that the efficacy of any empirical method in predicting bathymetry, should also consider the amount of training data required in order to achieve a particular accuracy level. In our case, the application of the USV platform assisted in obtaining targeted ground-truth depth measurements in a time-and cost-effective way. Thus, we utilized on average 1-2 thousands of depth measurements (60% scenario) in order to construct each of the training sets.

Future Considerations
In this study, we considered an alternative use of the SfM output surface and did not apply it as training output as in [28]. Instead, we produced small training patches by interpolating along the track of the USV point measurements. Interpolation of ground-truth data assists in data augmentation that is an essential procedure in deep model training. In contrast to more simple machine learning algorithms, which receive 1D point-based information, a CNN model requires 2D/3D data as training outputs. Therefore, data augmentation techniques are necessary for closing gaps in actual ground-truth data or even the explanatory variables [68]. In other words, a LIDAR or multi-beam sonar bathymetry dataset would be ideal for extracting training patches for the CNN. However, such data require significant cost and effort to be collected at shallow areas and thus fall beyond the concept of this study. The presented approach could be further benefited by specific improvements, which should focus on enhancing the quality of training datasets. Initially, it is recommended that for a successful SfM reconstruction, several environmental criteria should be met, such as textured seafloor, minimal sun glint, and nearly flat sea-surface. These conditions although challenging to be met simultaneously, they should greatly improve the SfM output and then it should be used for extracting suitable training patches for the CNN model. Furthermore, the USV application holds promising potential for producing 3D training patches from underwater imagery. The USV offers the possibility for collecting underwater images with considerable detail and texture up to a maximum depth depending on water clarity (for most Mediterranean coastal areas this depth should be between 5 to 10 m). This application would greatly enhance the capabilities of singlebeam sonar by incorporating the RTK depth measurements into a SfM procedure similar to the one applied on drone images. An additional advantage of this application is that underwater images would be free from refraction artifacts. In this way, suitable bathymetric surfaces could be extracted and provide an improved source for training and testing the CNN model.

Conclusions
Hydrospatial datasets acquired with novel uncrewed platforms (drone, USV) show great potential in mapping shallow bathymetry in high resolution. The suggested deep learning approach combines the strengths of SfM and spectral methods simultaneously, resulting in a CNN model that predicts bathymetry with low error even when 60% of ground-truth data is used for training. This suggests that deep learning approaches result in an efficient use of input data, thus minimizing the cost and effort for data acquisition. The cross-validation tests showed that areas with mixed seafloor types are optimal for training a more "regional" CNN model that can be applied in unknown areas with similar water/seafloor types using only drone images. However, a greater amount of ground-truth data is required for achieving acceptable errors (<0.5 m) with the "regional" CNN model. Future work will focus on developing and testing a "regional" CNN model with better performance, and applying SfM on underwater imagery for increasing the capacity of training data in the deep learning procedure.