Deep Learning for Deep Waters: An Expert-in-the-Loop Machine Learning Framework for Marine Sciences

: Driven by the unprecedented availability of data, machine learning has become a pervasive and transformative technology across industry and science. Its importance to marine science has been codiﬁed as one goal of the UN Ocean Decade. While increasing amounts of, for example, acoustic marine data are collected for research and monitoring purposes, and machine learning methods can achieve automatic processing and analysis of acoustic data, they require large training datasets annotated or labelled by experts. Consequently, addressing the relative scarcity of labelled data is, besides increasing data analysis and processing capacities, one of the main thrust areas. One approach to address label scarcity is the expert-in-the-loop approach which allows analysis of limited and unbalanced data efﬁciently. Its advantages are demonstrated with our novel deep learning-based expert-in-the-loop framework for automatic detection of turbulent wake signatures in echo sounder data. Using machine learning algorithms, such as the one presented in this study, greatly increases the capacity to analyse large amounts of acoustic data. It would be a ﬁrst step in realising the full potential of the increasing amount of acoustic data in marine sciences.


Introduction
The United Nations (UN) has declared the next ten years, 2021-2030, as the UN Decade of Ocean Science for Sustainable Development. The primary reason is that healthy and productive oceans, which relies on and requires sustainable management, are essential for humanity. One of the core themes in the UN Ocean Decade is data: in particular the transformative possibilities resulting from the generation of big data within marine science and monitoring [1][2][3][4]. This is also reflected in the coordinated work within the European Union (EU) Copernicus and EMODnet platforms [5,6].
This growing volume of data, often referred to as "big data", invokes new issues and calls for new approaches to manage and analyse data [2]. Machine learning methods are often used to analyse "big data" and to detect statistically significant patterns in them [1,2]. However, state-of-the-art machine learning methods, such as deep learning, typically require a large number of labelled examples to train on. Therefore, one of the principal bottlenecks in using machine learning to facilitate analysis of marine data, is the current lack of annotated data [1,2,7].
One example of "big data" in marine science, is acoustic data obtained from different types of sonars/echo sounders: i.e., Multi-Beam Echo Sounders, Single Beam Echo Sounders, Side Scan Sonars, and Acoustic Doppler Current Profilers (ADCP) [8][9][10][11]. Acoustic data has eclectic applications in marine science [10], and the Marine Strategy Framework Directive (MSFD) stresses the importance of using acoustic methods to improve environmental monitoring in the future [12]. For example, acoustic methods can be used: (a) to map geological properties of the sea bed and benthic habitats (e.g., [11,13,14], and refs therein), (b) to monitor fish schools and individual fishes [8,9,15], zooplankton [16,17], and marine mammals [18,19], and (c) to study physical properties of the water column [20][21][22]. Here, we present a novel deep learning-based framework for automatic detection of wake signatures in echo sounder data (echograms). The advantages of our deep learning framework are demonstrated by a case study using acoustic data of turbulent ship wakes and their introduction of energy by vertical mixing, as outlined in the following sections.
Our goal was to automatically detect turbulent ship wakes in echograms, with the aim of assessing the spatiotemporal extent of the wakes, as well as the energy input from ships through vertical mixing. Of the eleven MSFD descriptors used to assess the environmental status of the marine environment (MSFD, 2008/56/EC) [23], the eleventh descriptor concerns the introduction of energy, including underwater noise, into the marine environment. Environmental impact from shipping is often mentioned in relation to descriptor 11, as ships cause both noise [12] and erosion [24,25]. However, there is another energy input from shipping which is rarely mentioned, namely the energy input through ship-induced vertical mixing.
Behind all ships, a turbulent wake is induced by the ship propeller and hull [26][27][28][29]. The wake is characterised by an increase in turbulence and a dense bubble cloud. The bubbles in the wake region can be detected using an echo sounder [30,31]. In regions with intense ship traffic the turbulence and bubbles from repeated ship passages have the potential to affect air-sea gas-exchange [30][31][32], sea floor integrity and turbidity [33], and the dispersion and distribution of pollutants and contaminants from ship discharges [34,35]. Hypothetically, water column stratification and nutrient supply could also be affected, if the ship-induced vertical mixing is deep and frequent enough. There is currently a lack of knowledge regarding the parameters determining the temporal and spatial development of the turbulent wake, which is needed to estimate the environmental impact from shipinduced vertical mixing. Hence, there is a need for suitable and efficient methods to measure the temporal and spatial scales of the turbulent wake, as well as the intensity of the vertical mixing. Echo sounder data from bottom mounted ADCPs or multibeams, provides an excellent data source for ship wake characterisation, as the bubbles in the turbulent wake create an elevated echo amplitude in the turbulent wake region in the echogram [26,[29][30][31]36,37]. However, manually identifying and annotating ship wakes in echograms requires expert knowledge and considerable time. The efficiency of the analytical process can be greatly improved by using machine learning to automatically identify and quantify ship wakes. This provides the core motivation for this study.
Our contributions: We propose adopting the state-of-the-art deep neural networks, which are widely used for image classification also for applications in marine science [38], to detect wakes using the echo sounder data from the bottom mounted ADCP. The principal bottleneck is that these deep neural networks, namely Convolutional Neural Networks (CNN), require significant amount of labelled data to train and thus, to achieve the desired detection accuracy. To the best of the authors knowledge, there currently exists only one publicly available annotated dataset of ship wakes in acoustic data, namely the one included in this study. Hence, in order to enable training of a deep neural network in absence of a sufficiently large labelled dataset, data augmentation is necessary.
Data augmentation was performed by generating synthetic wake samples based on the existing ones. A deep learning model was then trained on the populated dataset. Several important steps, including the results of synthetic data generation, were supervised by a marine science expert, using the expert-in-the-loop approach [39]. In particular, for uses of machine learning in health-care, the efficacy of expert-in-the-loop or human-in-the-loop has been demonstrated for a range of tasks, such as interpreting and assessing the uncertainty of machine learning predictions, as an oracle in active learning settings, or, similar to here, to assess goodness of fit of auxiliary models and confirm labelling suggestions from the machine learning method [40,41].
We experimentally show that the proposed approach of using deep neural networks combined with the expert-in-the-loop framework achieves significant accuracy for detecting ship wakes from the echos sounder data with limited affirmative instances.
Though we use the proposed framework for a specific case study and data type, it should be noted that this framework provides a blueprint to address related problems, i.e., as long as the collected data is similar in nature, e.g., acoustic data, we expect that both the proposed deep learning model for identifying different signals in the data and the expert-in-the-loop framework to address label scarcity, will prove beneficial.

Background: Bridging Machine Learning and Marine Science
Its successes over the past decade in tasks such as object and face detection, natural language processing, or text mining, have established deep learning as a suitable paradigm of machine learning for signal, image, and video analysis [42]. Therefore, we chose to use deep learning to analyse acoustic data for the automatic detection of ship wakes in echo sounder data. This setting provided a suitable scenario to evaluate the potential benefits of using deep learning to assist the analysis of marine acoustic data. The following section provides a short background on the use of machine learning techniques for acoustic signal processing, with a particular focus on the applications in marine science.

Deep Learning for Signal and Image Analysis
Machine learning techniques are efficient in processing image and signal data and detecting patterns in them [43]. These techniques adapt numerical weights and other parameters, in general according to features of a set of known training data. Such methods can often be smoothly adapted to a domain-specific task, provided there is a sufficient amount of data. When training data is scarce, the machine learning models tend to lack generalisation and instead overfit to the known examples used for training. As scarce data can impede machine learning techniques, several approaches, including data generation and augmentation, are proposed to improve the performance. For example, data generation has previously been used in a study by Allken et al. [7], where synthetic samples were leveraged to expand the training dataset for a deep learning algorithm identifying fish species from video data.

Machine Learning for Acoustic Detection in Marine Science
In recent decades, machine learning has successfully been applied to analyse acoustic data in order to automatically classify the geological properties of the seafloor [44][45][46], constituting an exemplary application of machine learning in marine sciences. In contrast, machine learning methodologies are still at the developing stage for identifying biota and physical properties of the water column in acoustic data. However, the potential benefit of, and necessity for incorporating machine learning in the analysis of the increasing amount of acoustic data, have been pointed out [2,14,47]. Currently, there exists methodologies for automatic target tracking and identification of e.g., fish schools, marine mammals, and birds, in multibeam and echo sounder data [8,[48][49][50][51]. Still, the analysis of echo sounder data often relies on manual scrutiny by experts at some step of the process [52], which is time consuming and introduces subjectivity to the analysis [49,51].
Twenty years ago, during the previous boom of neural network popularity, a few studies used Artificial Neural Networks (ANN) to automatically identify fish schools in echo sounder data [53][54][55][56]. Other machine learning methods, such as random forests have also been applied [52]. However, all of these models were small and applied to the pre-defined features extracted from the already detected fish schools, hence do not strictly classify as "deep learning". Nevertheless, there is one later example of the application of "deep learning" in acoustic classification. Brautaset et al. [57] recently applied deep learning techniques to a problem comparable to the identification of ship wakes-acoustic classification of fish schools. Similarly, they used echo sounder data, although collected from surface vessels and at greater depths, compared to the ship wake dataset. The authors applied a CNN model to their problem and achieve good results for isolated regions in the echogram.

Materials and Methods
Here the proposed framework is presented, which is essentially expert-in-the-loop with a deep neural network for classification, and the methodology to deploy this framework for wake detection from acoustic data is specified.
The development of the proposed deep learning model for wake detection can be structured as follows: 1.
In situ collection of acoustic data (Section 3.4); 2.
Initial data labelling by the domain expert (Section 3.5); 3.
Data augmentation and its approval by the expert (Section 3.6); 4.
Implementing algorithms to make the model robust against data imbalance and noise with an additional input from the expert (Sections 3.7 and 3.8); 7.
Evaluating the final model (Section 4).
A schematic view of the proposed pipeline is given in Figure 1. Following subsections explain the expert-in-the-loop framework, which is used in model development, and also elaborate all the steps.

Expert-in-the-Loop
Expert-in-the-loop is a framework for Artificial Intelligence (AI) training [39,41], where an expert oversees the training and gives additional feedback based on the results after a training iteration. It can be beneficial in terms of both achieving better results and speeding up the training. However, in the traditional machine learning approach, the participation of a human expert is usually limited to preparing the training task (e.g., labelling). This is because large datasets are typically very expensive to create and require the work of many annotators.
In the case of this paper, the dataset was relatively small (1 month of observation and 165 identified wakes), and the data was very domain-specific. The expert who performed the initial labelling was also involved in the development, so it was possible to introduce intermediate expert evaluation in some form. The stages with the expert involvement are highlighted in Figure 1.

Machine Learning: Classification
Classification is a machine learning problem where the goal is to assign a new observation to one of two, or more, predefined categories. The prediction is based on the training dataset of observations for which the categories are known. Classification problems can be solved by a large variety of algorithms ranging from decision trees to neural networks, deep or shallow.
In the case of wake detection, the proposed model must solve a binary classification task, where the two classes will be labelled as 'wake' and 'background'. More formally, if X = {x 1 , x 2 , . . . , x n } is a set of objects and Y = {y 1 , y 2 , . . . , y n } is a set of classes, and f is a classifier, which maps X into Y : f (x i ) = y j . For wake recognition, X is the set of frames and Y = {'wake', 'background'}.
The formulation of the principal task in terms of classification allows the use of a range of machine learning classification algorithms. The nature of the available data, however, suggests using models based on Convolutional Neural Networks (CNNs) [58]. CNNs are known to be effective in image recognition and finding patterns. By treating the acoustic data in the form of time frames, the task can be mapped directly to pattern recognition, where the target pattern is the trace of bubbles in the turbulent wake behind the ship.

Convolutional Neural Network
Neural networks are a class of machine learning models that consist of multiple interconnected nodes (or neurons). Nodes perform simple processing of incoming information and pass it further down the network. Nodes in neural networks are commonly organised in layers, which can be connected in various ways. Neural Networks are trained using backpropagation, which calculates the gradient of the loss function with respect to the weights of the model for each input-output and updates the weights accordingly.
The most common type of neural network used in image analysis is a CNN. The main feature of a CNN is convolutional layers. Each neuron in such a layer processes only the information from the receptive field corresponding to this neuron. It can be perceived as a sliding window moving over a 2-dimensional image and checking if certain patterns of the size of the window are present. Typical CNN architecture involves stacking several convolutional layers with other types of layers, such as pooling (aggregating information from neighbouring neurons) and fully connected layers. One example of a well-known and successful deep neural network for image recognition is VGG (VGGNet by Visual Geometry Group) [59]. CNNs were also successfully used for analysis of photographic images in marine science, e.g., monitoring waves [38].

Residual Neural Network
A more recent CNN architecture, which is often used as a benchmark, is the Residual Neural Network (ResNet) [60]. ResNet solves the vanishing gradient problem, which affects the training of very large neural networks with gradient-based methods. In large networks, the gradient can become vanishingly small, and the weights can stop changing at all during training. To address this, ResNet introduces skip-connections. In simpler architectures, layers are always connected sequentially, while in ResNet some layers can be skipped during the first stages of training. For instance, if there are three layers in the network: A, B, C, and they are connected in order A to B to C, then the direct connection from A allows skipping B during initial training ( Figure 2). This architecture not only addresses the vanishing gradient problem, but also greatly improves training time. As variations of ResNet obtain close to state-of-the-art results on several image analysis task, we choose it for a different type of data, superficially resembling photographic image data, namely the data we collected from an ADCP.

A B C
Skip-connection Figure 2. An example of a skipped layer in a Residual Neural Network. There is an additional connection between layers A and C that allows skipping layer B when propagating.

Approaches for Scarce Data
One of the main challenges encountered in this research was the lack of labelled data and, more specifically, the lack of labelled positive wake examples. Performing supervised machine learning algorithms typically require a considerable number of labelled samples. This is especially important for a neural network model, as it includes many parameters to adjust. There are several possible approaches to address the lack of data.

Data Augmentation
Data augmentation is a common strategy to use when the training dataset is too small and more samples are needed [61]. To augment a dataset, original samples are changed in a minor way, while keeping the designated output value. Even simple augmentation is known to improve the performance of machine learning models. For example, image datasets can be augmented with rotated, cropped, or mirrored samples (Note, we did not utillize these standard augmentation steps; see below and Section 3.6). Data augmentation is also a step that can be overseen by a domain expert. If generated or modified samples have to be approved as plausible by an expert, it reduces the chance of introducing errors during the augmentation.
However, this type of augmentation might not be applicable to stationary profiler data, primarily because wakes have a fixed position relative at the top of the frame. There are, nevertheless, more complex approaches, which can also serve as a means of regularisation and increasing robustness.

Probabilistic Models
New samples can also be generated using statistical methods. They preserve the patterns in the original samples while adding diversity to the dataset.
A Gaussian Mixture Model (GMM) [62] is a simple probabilistic model that fits the data to a convex combination of several uni-or multi-variate Gaussian distributions. It can be an efficient sample generating or clustering tool when the data structure is not very complex. The optimal number of components can be determined using relative model quality estimators, such as Akaike Information Criterion (AIC) [63]. AIC penalises a large number of parameters to fit and rewards the goodness of fit using the likelihood function, thus, finding a balanced number of distributions.
GMMs are not very efficient when applied to high-dimensional data such as images, so in order to apply them to echo sounder data, some dimensionality reduction is needed. Compression should also be reversible, so the generated samples can be restored.
Principal Component Analysis (PCA) [64] is a method that transforms the data into a new coordinate system. The greatest sample variance in the data lies on the first component, the second-largest lies on the second component, and so on. The ratio of explained variance can regulate the number of dimensions after compression.
While PCA is efficient, it does not specifically try to preserve the existing patterns in the data. This can be addressed by applying a more meaningful transformation first. CNNs perform well in preserving image patterns, and they can also be used for efficient compression. An autoencoder is a special type of unsupervised neural network that consists of two components: an encoder, which compresses the input and decoder that reverts the process (Figure 3). The training process of an autoencoder consists of attempting to reconstruct a set of samples. A trained CNN-based autoencoder can create a compressed representation of image-like data while preserving patterns such as the shape of the bubble trace of a ship's wake.

In-Situ Data Collection
The data collection was conducted in the large ship lane outside Gothenburg harbour, which is the largest harbour in Scandinavia [65]. The 4 week measurement period included 165 clearly visible ship wakes and varying weather conditions.

ADCP Measurements
The ship wake dataset was collected using a bottom-mounted Nortek Signature 500 kHz broadband Acoustic Doppler Current Profiler (ADCP). The instrument was deployed under the ship lane during 4 weeks (28 August to 25 September 2018), at approximately 30 m depth (57.61178 N, 11.66102 E). The ADCP had four slanted beams (25 • angle) and one vertical beam, all with a measured cell size of 1 m and ping frequency of 1 Hz. The measured echo amplitude was used to identify the ship wake region, as the bubble cloud in the ship wake reflects sound more efficiently than water, and is clearly visible as an elevation in the signal strength [26,[29][30][31]36,37].

AIS Data
A dataset of the ships passing the study area during the measurement period, was used to identify periods without ship passages, to use as negative controls when training the algorithm. The dataset was purchased form he Swedish Maritime Administration, and originates from the Baltic Marine Environment Protection Commission (HELCOM) Automatic Information System (AIS) database. The data was processed according to the procedure described in the annex of the HELCOM Assessment on maritime activities in the Baltic Sea 2018 [66]. Additional files from the same HELCOM database was provided by the Swedish Institute for the Marine Environment (SIME).

Data Labelling and Preparation
The raw dataset from each ADCP beam was used in the analysis, resulting in 5 time series, with observations for 79,023 timestamps in total. For each timestamp, 28 data points were given, corresponding to depth levels from 3.5 m to 30.5 m with 1-meter interval. Due to side lobe interference, the 2.5 m closest to the surface did not have reliable data, hence the measurements start at 3.5 m depth. One of the slanted beams (beam 2) was malfunctioning, resulting in corrupted data, and was thus excluded from the analysis. The signal data was normalised, so that all the values fell between 0 and 1. The nighttime data contained very high levels of noise, and several nighttime wakes were marked as ambiguous by the expert. Thus, for training and testing purposes, only daytime negative samples were used.

Data Labelling
High resolution figures of the echo amplitude of the vertical beam were used to manually identify the ship wakes in the ADCP dataset. Wake signatures in the ADCP dataset were then compared with the AIS data of passing ships, to confirm that the wake signatures could be connected to vessel passages. Next, the echo amplitude in each confirmed wake region was compared to the daily/nightly mean, and all measurements in the wake frame ∼15% higher than the mean was annotated as part of the wake.

Data Representation and Visualisation
Acoustic data can be visualised in the form of an echogram, as shown in Figure 4. The echogram displays the intensity of the reflected signal; wakes and other objects are visually recognisable in this representation. A total of 165 wakes were identified by an expert and marked on echograms. AIS ship tracking data was checked during initial labelling, to confirm that all the detected wakes could be related to a passing ship. This allowed the assumption that all data outside the marked time frames could be used as negative samples.  To perform binary classification, the echogram was split into fixed-size frames. The length of the time frame was 60 data points or 30 min, and the size of one such sample was 4 × 28 × 60 data points. The first dimension corresponds to the number of functional beams. One example of such a frame is displayed in Figure 4. The wakes in the dataset ranged from large to barely noticeable, as can be seen in Figure 5.
All of the examples in this paper use the data from the central vertical beam (beam 5) because it consistently showed stronger signals. The difference was confirmed by comparing Wasserstein distances between the data from different beams and an "empty" frame with a flat distribution of signals. Wasserstein distance computes the cost of transforming one distribution into another, where the cost is the size of the part that has to be changed. The Wasserstein distance from the flat distribution was consistently larger for the central beam than for the other three.

Set-Aside Dataset
Before proceeding with any computational experiments, a subset covering four days (22 September to 25 September 2018) was set aside for separate evaluation. The period included 23 wakes of varying sizes and notably worse weather conditions, according to the expert. This subset was selected as a more difficult and completely independent additional evaluation task for the model. The remaining dataset containing 142 labelled wakes was used in the development of the predictive model.

Data Augmentation
To successfully perform deep learning training, a sufficient number of positively and negatively labelled data samples is needed. The ship wake dataset was heavily unbalanced and noisy, so data augmentation was necessary to generate more wake samples and allow the use of neural networks.
Data augmentation could only be based on the labelled 142 wake frames. This number is clearly not large enough to apply deep learning models, such as generative adversarial networks (GANs) [67], so a more robust approach of fitting a GMM was chosen. Note, we decided against the often used simple geometric transformations such as stretching, flipping, rotating, cutting for augmentation, as the positioning of wakes is fixed and can be meaningful.

Data Compression
Since one frame containing a wake has relatively high dimensionality, a reduction had to be applied before fitting a GMM. To make the dimensionality reduction more meaningful and preserve the patterns of the wakes, a CNN-based approach was taken.
The first step of data compression was, therefore, training an autoencoder on all 142 positively labelled frames with wakes. A simple autoencoder model with three convolutional layers in both encoder and decoder was trained on these examples for 70 epochs. The autoencoder then was able to compress and restore frames while keeping the major patterns and cancelling most of the noise. In Figure 6 the frame from Figure 4 is shown after passing the autoencoder. Most of the noise was removed, but the wake is preserved and clearly visible. After the training, the encoder was used to compress the same 142 samples. Depth (meters) Figure 6. The frame displayed in Figure 4 after being passed through the trained autoencoder.
After the frames were passed through the network, the dimensionality remained relatively high, and as the second compression step, PCA was applied. The model was set to keep 99.9% of the variance, and it reduced the sizes to only 48 dimensions each. Some of the generated wakes are shown in Figure 7. Synthetic data has noticeably less noise and is much smoother than the original samples ( Figure 5), but all the defining features have been successfully preserved. Following the pipeline (Figure 1), the generated set was approved by the expert as containing plausible shapes that could only be identified as wakes in the original dataset. The expert agreed that while being smoother than the wakes in the original observations, most of the generated data qualifies as a set of positive examples. As an additional step, to confirm the quality of augmentation, pairwise Wasserstein distances were computed between the original wakes and a set of generated wakes. The average value of 0.0526 was smaller than the average pairwise distance between the original wake frames and a random subset of frames (0.0544).

Deep Learning Models
Two deep learning models were implemented and tested. The base model used ResNet architecture. Since the structure of data was not very complex, and the resolution was small, a relatively small ResNet18 model was implemented. ResNet18 consists of 18 layers in total: 1 convolutional layer followed by 8 residual blocks with 2 convolutional layers each, and a fully connected output layer with sigmoid activation. The model treats the data from the four beams as four channels of the same image. A single input, therefore, has the shape of 4 × 28 × 60. The output consists of two values, which can be interpreted as probabilities of the input belonging to wake and background classes. For the second model, exactly the same ResNet18 architecture was re-implemented using the example reweighting technique. It is a strategy for addressing unbalanced data by identifying more or less valuable samples within the training dataset and assigning corresponding weights to them. Typically these weights are initialised offline, once during the training process, but there is an alternative approach proposed by Ren et al. [68]. It involves using a small subset of perfectly labelled clean data as a hyper-validation set. At each step of training, the gradient of the hyper-validation loss with respect to the weights of samples in the current mini-batch is computed and the weights are updated accordingly. The reweighting model can be implemented for most deep learning architectures and showed good performance both for unbalanced datasets and datasets with noisy labels. Perhaps the most important feature that makes this method applicable to the wake detection problem is that the size of the hyper-validation set can be as small as 10 samples while keeping a high level of performance. It also aligns well with the expert-in-the-loop framework (Figure 1). Since only a small number of clean examples are needed, they can be hand-picked or approved by the expert.
For the experiment, the hyper-validation dataset was set to the size of 10:5 positive and 5 negative examples. The negative samples were hand-picked from the dataset by the expert as clean with confidence. The purpose of this model was to test its performance compared to the baseline when the training data is unbalanced.

Evaluation Metrics
The evaluation of the models was based on the standard metrics for binary classification. Accuracy (the ratio of correct guesses) indicates the performance on balanced datasets, and the False Negative Rate (FNR) the ratio of wakes missed by the algorithm. Since the purpose of the model was to assist in the detection of wakes in the datasets, the FNR was the most relevant metric. The Area Under Receiver Operating Characteristic Curve (AUC ROC) is a metric that measures how well the model distinguishes between classes, as the threshold to make a decision varies.
To achieve stability in results, each experiment was performed 10 times. For each metric, the average and the standard deviation were calculated. For all experiments, the test dataset included 150 samples, 29 of which were randomly selected original wake frames, 46 were randomly selected generated wake frames, and 75 were randomly sub-sampled frames without wakes. Test-time augmentation [69] has recently received attention for its capability to obtain more reliable or informative evaluation metrics [70], and even better results when used to obtain classification as an ensemble average over augmented data [71]. We chose to include augmented data to check for generalization, a real concern given the very low number of positive samples.
An additional experiment was performed using the imbalanced set-aside dataset covering four days with 23 known wakes. It was passed to the model sequentially in the form of overlapping frames. The purpose of this test was to show the performance with a completely independent set of more noisy observations.

Results
The performance of the two main models was evaluated: the baseline ResNet18 model and the example reweighting model. In all experiments, the main metrics were the accuracy score and FNR, since the test dataset was always balanced. Additionally, for the baseline experiment, the AUC ROC score was computed. Each experiment was performed 10 times to achieve stability in the results, and average values were used. As an additional evaluation, the performance of the main model is shown for the set-aside subset of four days.

Baseline ResNet Model
To perform the baseline experiment, the ResNet18 model was trained on a balanced dataset which included: 500 negative samples, 113 positive samples from the original data, and 387 generated positive samples. The test dataset was also balanced and included 75 negative samples, 29 positive samples from the original data, and 46 generated positive samples. All samples were selected randomly from their respective sets. The experiment showed the mean average accuracy of 93.40 ± 1.80%, AUC ROC value of 0.97 ± 0.01, and false negative rate of 9.87 ± 3.28% over 10 repetitions. The results can be interpreted in the following way: 93.4% of predictions were correct on average, but most of the wrong predictions came from false negatives. The model had a high AUC ROC score, meaning it performed well in ranking samples by the likelihood of containing wakes.

Example Reweighting Model
The second evaluated model was the example reweighting model. Its main purpose was to allow training on unbalanced data while keeping test performance from quickly degrading. The reweighting model had exactly the same architecture as the baseline model but took an additional step in training to adjust the example weights during training.
The experiment included training the models on five training datasets with different class balance: 50%, 20%, 10%, 5%, and 2.5% of positive samples. The total size of the training dataset was the same as in the previous experiment, namely 1000 samples. The test dataset was kept balanced with 75 positive and 75 negative samples. The results are shown in Figure 8. With the decrease of positive samples in the training data, the quality of predictions of both models predictably dropped. The example reweighting models showed noticeably better robustness to class imbalance.

Set-Aside Dataset Experiment
As a secondary evaluation of the baseline model, an experiment was performed on the set-aside dataset. The whole dataset was passed through the ResNet18 model in the form of overlapping time frames of the 4 × 28 × 60 size starting at each timestamp. Unlike the previous experiments, the whole set-aside dataset was used, including extremely noisy nighttime data and data from bad weather conditions during the measurement period. The 23rd of September was specifically marked by the expert as the day with the worst weather over the month.  There were 23 known wakes in this subset. For each wake, only the starting timestamp was known. It was assumed that the wake was visible in a 30-min frame if the frame starts at most 25 min before the wake start or at most 5 min later. The total number of frames in this experiment was 9780, out of which only 1338 were expected to be classified as wakes due to overlaps.
For this setup, the mean accuracy score over 10 experiments was as low as 60.6%, while the mean false negative rate was at 38.06%, which is comparable to test performance with a significant level of noise. Overall, the results show a significant drop in performance compared to experiments on cleaner and smaller test sets.

Discussion
The looming ocean crisis, for example the threat of near extinction of commercially and fundamentally ecosystem relevant species, has been in part caused by unsustainable use enabled by regulatory omissions. One contributing factor is the wide gap between the resolution and extent of information provided by current approaches to monitoring the ocean, including the effects of commercial activities and what would be needed to derive actionable, policy-supporting insights. In response, data has been designated a core theme of the UN Ocean Decade and a priority for other marine science platforms, such as EU Copernicus and EMODnet, acknowledging the opportunity for transformation made possible by machine learning. This growing demand for data-driven technologies to enhance marine sciences, require development of machine learning techniques that can simultaneously leverage the available data with its limitations, and the experts' knowledge from the scientific community. For acoustic data, the analysis and processing has traditionally been performed manually [2]. As a consequence, there is a huge potential for increased analytical capacity within the marine science field, if machine learning can be applied to acoustic data. Acoustic data is currently used in a wide array of applications within the field, benefiting from the increasing use of high-quality echo sounders [2]. The need to further increase the use of acoustic data within marine monitoring was pointed out in the recent European Marine Board (EMB) Future Science Brief №6: Big Data in Marine Science [1]. However, the increasing volumes of acoustic data have not been mirrored by a similar increase in analytical capacity, making data analysis a bottleneck [1,2]. Furthermore, there is still a scarcity of annotated data to train machine learning algorithms.
Here, we demonstrated the advantage of machine learning in marine science by detecting ship wakes from acoustic marine data in the form of echograms, using an expert-in-the-loop framework. Since the annotated data was limited in size and quality, we adopted machine learning algorithms to augment the data, by generating new data samples similar to the data available. The expert was then included to validate the quality of the generated data and send it forward in the framework to build a deep learning model for ship wake detection. Hence, the proposed collaboration of machine learning algorithms and a domain expert can circumvent the limited quality and size of acoustic marine data. This framework also allows development of accurate and scalable deep learning models that can be used for detecting patterns of interest in marine sciences. When applied to the ship wake detection problem, it showed an average accuracy of ∼93.40% given a very limited and unbalanced dataset. The framework also serves as a demonstration of general applicability of the deep learning approach in marine sciences.
The experimental results demonstrate the potential of machine learning expert-inthe-loop approaches in analysing acoustic marine data, but also show that there is room for improvement. Since this is the first study of its kind, the performance cannot be compared to the result of previous studies. Increasing the algorithm's robustness to noisy and unbalanced data would be a direct path to improvement without requiring significantly more data. This could be achieved by modifying both the predictive algorithm itself or the data augmentation process. Furthermore, the proposed machine learning framework can be adopted to identify further objects of interest, which are potential sources of miss-classification in the current data set: schools of fish, vegetation, marine mammals, water mixing zones, currents, or stratification. With suitable data available, knowledge transfer [72] provides a machine learning approach to proceed.
The method presented here can be considered a stepping-stone solution for emerging fields plagued by the lack of labelled training data, which can be used and developed as the field is established. This will contribute to the effort of further integrate machine learning and marine science, as urged by Guidi et al. [1] and Malde et al. [2].
This demand for more, and more insightful data, can-similar to other disciplines where monitoring biodiversity is paramount [73]-be most effectively satisfied by the use of artificial intelligence or machine learning methods. Approaches differ depending on the exact phenomenon which needs to be recognised or quantified. To put things into perspective, it is important to recall that commercial machine learning methods e.g., for recognising objects in images, are often trained with billions of labelled examples. However, as this study demonstrates, the initial, labelled data a human expert provides, together with expert supervision of the often incremental model-improvement process, can rapidly lead to useful tools. This demonstration establishes that this particular type of application can be addressed with machine learning, which should be further explored and improved in future studies.

Conclusions
This study presents a novel deep learning-based expert-in-the-loop framework for automatic detection of turbulent ship wake signatures in echo sounder data. The proposed framework enables collaboration of experts and data-driven algorithms to create novel machine learning methods, which can be adapted to work on multiple types of signal data, may it be acoustic or visual. The suggested machine learning algorithms can fill the gap between the data available today and its use to rapidly address marine science problems, such as the effect commercial shipping has on marine ecosystem. In addition, it is a step towards further integrating machine learning and marine science, to propel a sustainable management of our oceans.

Data Availability Statement:
The datasets generated and/or analysed during the current study are available from the corresponding author on reasonable request. Acknowledgments: I.-M.H. and A.S. acknowledge a seed grant for "Deep Learning for Deep Waters" from the Transport Area of Advance at Chalmers University of Technology. Acknowledgment to the Swedish Institute for the Marine Environment (SIME), for supplying the AIS dataset. D.B. acknowledges the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation for funding a part of his tenure during this work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: