Next Article in Journal
GCN-Embedding Swin–Unet for Forest Remote Sensing Image Semantic Segmentation
Previous Article in Journal
A Filter Method for Vehicle-Based Moving LiDAR Point Cloud Data for Removing IRI-Insensitive Components of Longitudinal Profile
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparison of Self-Supervised and Supervised Deep Learning Approaches in Floating Marine Litter and Other Types of Sea-Surface Anomalies Detection

by
Olga Bilousova
1,2,*,
Mikhail Krinitskiy
1,2,
Maria Pogojeva
3,2,4,
Viktoriia Spirina
4 and
Polina Krivoshlyk
2,5
1
Moscow Center for Earth Sciences, Moscow 115184, Russia
2
Shirshov Institute of Oceanology, Russian Academy of Sciences, Moscow 117997, Russia
3
Faculty of Geography, Lomonosov Moscow State University, Moscow 119991, Russia
4
N. N. Zubov’s State Oceanographic Institute, Roshydromet, Moscow 119034, Russia
5
Immanuel Kant Baltic Federal University, Kaliningrad 236041, Russia
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(2), 241; https://doi.org/10.3390/rs18020241
Submission received: 7 October 2025 / Revised: 5 November 2025 / Accepted: 19 November 2025 / Published: 12 January 2026
(This article belongs to the Section Ocean Remote Sensing)

Highlights

What are the main findings?
  • A large Arctic sea-surface imagery dataset was created and annotated for floating litter and other marine anomalies.
  • Self-supervised contrastive learning based on MoCo + ResNet50 achieved accuracy comparable to that of supervised detection based on YOLO11 while requiring far fewer labels.
What are the implications of the main findings?
  • Data balancing markedly improved recognition of rare objects and overall model stability.
  • The presented framework opens the avenue for scalable, low-cost monitoring of floating marine litter using shipborne and autonomous camera systems.

Abstract

Monitoring marine litter in the Arctic is crucial for environmental assessment, yet automated methods are needed to process large volumes of visual data. This study develops and compares two distinct machine learning approaches to automatically detect floating marine litter, birds, and other anomalies from ship-based optical imagery captured in the Barents and Kara seas. We evaluated a supervised Visual Object Detection (VOD) model (YOLOv11) against a self-supervised classification approach that combines a Momentum Contrast (MoCo) framework with a ResNet50 backbone and a CatBoost classifier. Both methods were trained and tested on a dataset of approximately 10,000 manually annotated sea surface images. Our findings reveal a significant performance trade-off between the two techniques. The YOLOv11 model excelled in detecting clearly visible objects like birds with an F1-score of 73%, compared to 67% for the classification method. However, for the primary and more challenging task of identifying marine litter, which demonstrates less clear visual representation in optical imagery, the self-supervised approach was substantially more effective, achieving a 40% F1-score, versus the 10% obtained for the VOD model. This study demonstrates that, while standard object detectors are effective for distinct objects, self-supervised learning strategies can offer a more robust solution for detecting less-defined targets like marine litter in complex sea-surface imagery.

1. Introduction

Marine litter, or marine debris, is recognized as one of the most acute environmental challenges of the twenty-first century. Plastic debris, glass, metal, and other persistent materials accumulate in coastal zones, drift through marine ecosystems, and form large garbage patches in the open ocean [1,2,3]. Expeditions and monitoring programs in different regions consistently confirm the ubiquity of marine litter [4,5]. Floating litter poses multiple threats: it entangles and is ingested by organisms, alters habitats, and serves as a vector for invasive species [6,7,8,9]. Plastics are of particular concern, comprising up to 96% of observed floating marine litter [10], and their persistence amplifies ecological damage.
Reliable monitoring methods are essential for assessing and mitigating this problem. Traditional approaches, including shoreline surveys, trawling, and visual shipboard observations, are limited in scope and demand substantial labor [11,12]. Remote sensing technologies—satellite imagery, UAV surveys, shipborne cameras, and autonomous platforms—have opened new possibilities [13,14,15], but transforming these heterogeneous observations into consistent and scalable monitoring products remains difficult. The sea surface is a complex environment: objects are small relative to the image frame, backgrounds vary with light and weather, and many non-target features (foam, reflections, seaweed) resemble litter. These challenges underscore the need for robust, data-driven methods.
In recent years, machine learning (ML) and deep learning (DL) have been increasingly applied to the problem of anomaly detection on the sea surface. Convolutional neural networks (CNNs) have been trained to classify or detect marine litter in aerial and shipborne imagery, often outperforming manual observation [4]. Object detection frameworks, particularly the YOLO family, are now common tools for identifying floating marine litter in coastal and offshore settings [5]. For example, compact YOLOv5 models deployed on ship-mounted cameras have achieved detection accuracies of above 95% [6]. Similar approaches applied to drone imagery have been successful in identifying litter along beaches and nearshore waters [7]. Beyond plastics, DL models have been adapted to other anomalies: segmentation networks such as DeepLabv3+ have been applied to oil spill detection from SAR imagery [8], while Random Forest classifiers and CNNs have been used with multispectral data to identify harmful algal blooms [9,10].
The state of the art continues to evolve toward higher accuracy and operational efficiency. One-stage detectors like YOLOv3, YOLOv5, and their successors remain the backbone of many applications [5]. To further improve robustness, researchers have integrated attention mechanisms and Transformer modules. For instance, combining YOLOv7 with a Convolutional Block Attention Module increased marine litter detection F1-scores by more than 10% compared to the baseline [11]. Transformer-based frameworks have also been tested on drone imagery, where their ability to capture global context enhances the detection of marine litter under variable illumination and sea states [12]. Equally important is the drive to reduce model size for deployment in real-time systems. Pruned YOLOv4 models retain much of their accuracy with only a fraction of the parameters [13], and compact architectures such as UTNet have surpassed several YOLO versions in mean average precision while operating efficiently on continuous video streams [14]. These developments make it feasible to implement detection systems on drones, buoys, or research vessels with limited computational capacity.
Despite this progress, supervised approaches are constrained by the scarcity of labeled data. Floating litter objects are rare compared to empty ocean frames, creating severe class imbalance [15]. Annotating large datasets is costly and time-consuming [16], and many existing resources cover only specific regions or conditions. This has motivated the exploration of unsupervised and self-supervised learning. Reconstruction-based methods, including autoencoders, learn to represent normal ocean conditions and identify anomalies through elevated reconstruction errors. One study applied a Vector-Quantized Variational Autoencoder to aerial imagery, successfully flagging marine animals as anomalies against empty water backgrounds [17]. Self-supervised representation learning, including contrastive frameworks such as MoCo, SimCLR, and BYOL, offers another avenue: by training on vast unlabeled datasets, models acquire general features that can later be fine-tuned for anomaly detection [18]. Sparse autoencoders trained on magnetic background data have been used to identify anomalies without any labeled targets [19]. These approaches reduce reliance on manual annotation and can reveal unexpected anomaly types.
In our study, we address these challenges by presenting and analyzing a large dataset of shipborne video recordings collected during an Arctic expedition in 2023. The dataset contains over half a million frames, of which approximately 10,000 were systematically annotated for anomalies including marine litter, birds, glare, and lens droplets. Building on this resource, two complementary machine learning strategies are compared. The first relies on self-supervised contrastive learning (MoCo with a ResNet50 backbone), followed by classification with Random Forest, Balanced Random Forest, and CatBoost. This design leverages abundant unlabeled data while limiting annotation requirements. The second applies a supervised object detection model, YOLOv11, directly to the annotated imagery, reflecting current state-of-the-art practice. By systematically evaluating these approaches under different data balance scenarios, the study highlights how strategy, data configuration, and labeling effort influence detection performance. Particular attention is given to the rare but ecologically significant category of floating marine litter.

2. Materials and Methods

2.1. Collecting Data Methodology

Video recordings captured during a scientific expedition in the Arctic Ocean conducted in autumn 2023 were used as the data source. The vessel “Dalnie Zelentsy” departed from the port of Murmansk and proceeded through the Barents and Kara Seas toward the shores of the Novaya Zemlya archipelago. Figure 1 shows the expedition route.
Table 1 contains information about weather conditions and sea states for each expedition day when the sea surface video recording was conducted. It can be observed that the sea state was rough on most days. It should also be noted that ice floes were rarely encountered.
Video recording of the sea surface was conducted during vessel movement in daylight hours throughout the expedition. The camera was mounted on the port side at a height of approximately 5.5 m. The camera’s field of view covered approximately 6 to 15 m in width and about 25 m in length from the vessel. The recording setup is shown in Figure 2.
The camera mounting location on the vessel’s side is shown in Figure 3.
Video recording was conducted in several sessions per day with breaks for battery replacement. As a result, the research team aboard the vessel collected video material with a total duration of 136 h, divided into 69 separate video files, with between 1 and 7 files recorded on each of the 17 days of the expedition. Recording was performed using an AKASO V50X camera with video resolution of 3840 by 2160 pixels.
Table 2 below presents: the start time of each video recording, vessel coordinates at the start of each recording, and its duration. On some days, less video material was recorded than on others; this was primarily due to poor weather conditions.

2.2. Original Data and Exploratory Data Analysis

2.2.1. Video Frame Extraction

Since data in video format (data collection details were described in Section 2.1) is more difficult to use for analysis, all videos were split into individual frames; the ffmpeg command-line utility was used for this task. The resulting video is a sequence of photographs packed into a video file; the photography frequency is 1 frame per second. Video splitting should be performed at a rate of no more than 30 fps—at higher values, frames would be duplicated. Accordingly, the maximum possible data volume can be obtained at fps = 30.
After processing 69 video recordings captured over 17 expedition days (the camera recorded different time intervals on different days) and dividing these videos into individual per second frames, a total of more than 500,000 photographs of the sea surface were obtained.

2.2.2. Dark (Nighttime) Image Removal Using MoCo and DBSCAN

For the first time, a ResNet50 model trained using the MoCo method (described in the following Section 2.5 on classification tasks) was applied at the primary processing stage. The objective of this training was to verify the model’s capability to find anomalies in data and learn. These objectives can be verified by manual inspection of images from the dataset that the model marks as “anomalous” and by monitoring the evolution of the loss function.
Following dimensionality reduction with UMAP, we applied the DBSCAN algorithm to the resulting two-dimensional feature vectors. This step was used to automatically identify and group dense clusters of image fragments that shared similar learned representations, providing an unsupervised method for discovering inherent classes in the data before applying the final classifier [20].
The program used adjacent images as positive pairs for the neural network. Negative pairs, i.e., images that would be pushed apart during training, were frames separated by 10 or more seconds. When defining positive and negative pair criteria, we were guided by the logic that adjacent frames differ from each other insignificantly and are therefore semantically similar. Conversely, frames from different days or different datasets (in terms of files in the directory—10 or more frames apart) differ substantially.
After transformations, all images were resized to 384 × 216 pixels (10% of the original photo size, 3840 × 2160).
Images are then fed to the ResNet50 neural network input, and the MoCo training process begins for 100 epochs, after which the trained model is applied to the same dataset. The learning rate was constant throughout training at 0.001 (1 × 10−3). The batch size ranged from 4 to 16 images across the different runs.
The optimal settings for the DBSCAN method were recognized as the following parameter values: eps = 0.04, min_samples = 15. Over 90 training epochs, there is a stable trend toward contrastive loss reduction. Figure 4a,b show the obtained hidden representation vectors with reduced dimensionality using the UMAP method after clustering with the DBSCAN method.
In the class displayed in yellow in Figure 4a,b, dark images were identified after manual inspection. Examples of dark images detected by the model are shown in Figure 5a,b below.

2.3. Data Filtration

Let us summarize the preliminary data processing stages. Initially, there were more than 500 thousand photographs of “raw” data; however, first, irrelevant dark images taken at night were identified among them. Then, to optimize calculations and better utilize computational resources, every 50th frame was taken sequentially from the resulting dataset. Accordingly, the volume of the actual dataset with which most of the manipulations described below were performed amounted to approximately 10,000 images.

2.4. Data Labeling

Annotation of part of the dataset for the presence of anomalies was conducted. When identifying anomalies in images, attention was paid not only to the object’s appearance but also to its presence or absence in adjacent images and its location. White objects of unclear origin are quite common in frames—it is impossible to definitively determine whether these were glare on the surface, foam from waves, etc., or hydrosphere pollution objects. Sometimes, water droplets hit the camera lens, causing the image to become blurred.
The color characteristics of the video frames strongly depend on the time of day. Frames from the same video are most similar to each other, but even in this case, the color of the water surface can vary greatly from blue-green to bright blue.
For annotation, we used every 50th image from the dataset. Annotating all 511,262 images proved too labor-intensive for the study; meanwhile, it was necessary to use representative data from the entire expedition duration, not just the first days.
As a result, approximately 10 thousand photographs were annotated using the Label Studio application. After that, the annotation data were analyzed using pandas 2.2.2. (a Python programming library for data processing and analysis).
It was decided to search for and annotate the following types of anomalies:
  • Birds (Bird) (Example in Figure 6a);
  • Marine litter (Marine_litter) (Example in Figure 6b);
  • Colored glare (Glare) (Example in Figure 6c);
  • Droplets on camera (Droplets) (Example in Figure 6d).
These types of anomalies were chosen for several reasons. Searching for litter is the main objective of the study; birds are similar to each other and occur regularly, making it interesting to analyze whether the model can consider them similar and classify them as a separate class. Colored glare is interesting due to its pronounced difference from the standard appearance of sky and water surface. Droplets on the camera were decided to be annotated for subsequent analysis of the problem of corrupted frames due to blurring of entire images or their parts.
A total of 10,064 photos from the complete dataset were annotated. The following were found:
  • A total of 2716 bird objects in 559 photographs.
  • A total of 56 litter objects in 54 photographs.
  • A total of 18,400 droplet objects in 3709 photographs.
  • A total of 1737 glare instances in 969 photographs.
It can be observed that a sufficiently large number of birds were found, often occurring in groups. Litter expectedly appears rarely in images and is typically single objects. Droplets occur very frequently, usually in large quantities on a single image. They are typically found in groups of consecutive images. Droplets and birds occur simultaneously in 167 images, while droplets and litter occur together in 18. It is worth considering mounting the camera higher or using a camera with a windshield wiper. Glare also occurs regularly, most often as a result of bright reflections from the sunset or sunrise.
We then analyzed the annotation results. First and foremost, we wanted to determine the optimal fragment size for contrastive learning. For this purpose, we calculated the diagonal lengths of rectangles containing anomalies found during annotation in Label Studio. Histograms of diagonal distribution by anomaly classes, as well as for all fragments across all classes, are shown in Figure 7.
For each distribution, statistics were calculated for numerical understanding: mean, standard deviation, median, and percentiles 25, 75, and 95. The results are presented in Table 3.
The average fragment size for contrastive learning was chosen based on the 95th percentile for the litter class. Since the table indicates the transverse size of the litter object, in a square annotation box, it corresponds to the diagonal. For this reason, we divided the 95th percentile value (p95) by the square root of 2 and then rounded to the nearest multiple of 10 for computational convenience. Ultimately, the linear size of the base fragment side was 240 pixels.
Thus, we now have a prepared dataset of 10,000 images—every 50th frame was taken from the original videos at equal intervals. From this dataset, irrelevant nighttime images were filtered out because nothing can be distinguished against a black background and they would only harm the model training process on the dataset. Manual data annotation was then conducted, from which it is clear that the dataset contains several thousand objects of interest, including marine litter and marine fauna (birds). We are now ready to proceed with describing the approaches and models used.

2.5. Classification and Contrast Learning—Description

In the first approach, we solve the floating marine litter detection problem as a classification task on substantially imbalanced data using the annotation we obtained. We solve this task with respect to image fragments of 240 × 240 pixels. Details on how exactly the fragmentation was performed and the difference between two different subtasks within the classification approach are shown in Section 2.5.3 on fragmentation.
In this method, we use a ResNet50 convolutional neural network trained using the MoCo method. We apply the neural network to hidden representation vectors describing individual fragments and computed by an artificial neural network that we train using the MoCo method.

2.5.1. MoCo Contrastive Learning Method

We employed a self-supervised learning strategy using Momentum Contrast (MoCo) [21] with a ResNet50 encoder to learn feature representations of the sea surface directly from unlabeled image fragments. In this framework, different augmented views of the same fragment serve as positive pairs, while views from different fragments serve as negative pairs. The model is trained to minimize a contrastive loss function (InfoNCE or BCE), which pushes representations of positive pairs closer together and negative pairs further apart in the feature space. This process allows the network to learn robust, low-dimensional vector representations of textural and structural patterns on the sea surface without manual labels, which are then used as input for the downstream classification task.
While other prominent self-supervised methods such as SimCLR [22] and BYOL [23] offer similar capabilities, the MoCo (Momentum Contrast) architecture was selected for this study. MoCo is a well-established approach for learning powerful feature representations from unlabeled data, making it highly effective in contexts with limited annotated examples. The advantage of MoCo is that in MoCo the batch size does not depend on the number of negative pairs, and also that the model does not require a large batch size to have a sufficient number of negative samples. As a consequence, MoCo does not lose much performance when the batch size is reduced.
From a software implementation perspective, the model works as follows: the same data portions are fed to both network branches, which are themselves identical and built on the ResNet50 neural network architecture. Then, the main model encoder is trained using stochastic gradient descent. Positive pairs are considered as different augmentations of the same image, which should have similar representations in the feature space, while negative pairs are augmentations of different images, whose representations should differ maximally from each other. During training, the algorithm minimizes the contrastive loss function, which brings positive pairs closer together in the feature space and pushes negative pairs apart, thereby forcing the model to learn transformation-invariant object representations. This is schematically shown in Figure 8.
The key feature of MoCo is the use of a queue to store a large number of negative examples and exponential moving average for updating auxiliary encoder parameters, which ensures training stability and supports efficiency when working with large batches of negative examples.

2.5.2. Dimensionality Reduction Method

The output hidden representation vectors obtained as a result of neural network training and application have high dimensionality and are poorly suited for direct analysis. It was decided to reduce the dimensionality to 2 to obtain a clear visualization of the hidden representation vectors in two-dimensional space, and the UMAP algorithm was used for this purpose.
UMAP (Uniform Manifold Approximation and Projection) is a nonlinear dimensionality reduction algorithm that constructs a low-dimensional data representation by approximating the topological structure of the original manifold. The algorithm is characterized by high speed, configuration flexibility, and ability to preserve both local and global data structure [24].
Main parameters: n_neighbors (structure balance), min_dist (packing density), n_components (dimensionality), metric (distance metric), and n_epochs (optimization epochs). Works in two stages: construction of a weighted nearest neighbor graph through fuzzy sets with exponentially decreasing weights, then projection of the graph into low-dimensional space through stochastic gradient optimization minimizing cross-entropy between distance distributions.

2.5.3. Fragmentation Method, the Difference Between BCE and InfoNCE Approaches

Previously, we applied the ResNet50 neural network, trained using the MoCo method (hereinafter, for brevity, this construction will be designated by the term ResNet50 + MoCo) to identify anomalies among entire images; these turned out to be dark frames taken at night. Such an approach is suitable for identifying large features distinguishing photos from each other, but is completely unsuitable for detecting small details in images.
For this reason, at the current stage of ResNet50 + MoCo network training, so-called fragmentation is applied to images-i.e., data sampling from available data by extracting small fragments from images.
Additionally, two so-called subtasks were also used within the same ResNet50 + MoCo approach. The first is conditionally called the BCE-based subtask because the BCE loss function was used for calculation, and the second, the InfoNCE-based subtask, corresponds to the use of the BCE (Binary Cross-Entropy) loss function. Formulas for calculating both loss functions are provided in Section 2.2.2 on contrastive learning and the MoCo method. Furthermore, both subtasks differ in the logic for determining negative pairs, which will be discussed below.
Fragmentation for training was performed as follows. Before starting fragment cropping, a constant was set: the cropping square size. Fragments were selected in square form of randomly chosen size ranging from 160 × 160 to 360 × 360 pixels. Then, a pseudorandom number generator first set two coordinates: the upper left corner of the cropping rectangle, and then the linear size (length) of the fragment. After that, all fragments were brought to a uniform size by scaling to 240 × 240 size.
Only two fragments from the same photograph were accepted as positive pairs, with the distance between them being no more than one reference fragment length but no less than 0.2 in length. The latter condition is introduced to exclude excessive overlap and coincidence between fragments. The definition of negative pairs also underwent changes. For subtask #1, two fragments from images at a distance of at least 10 frames from each other (corresponding to real time of 10 s) were considered in this capacity.
For subtask #2 with the InfoNCE loss function, the program code specifies only the definition of positive pairs, and any non-positive pair is considered negative.
Images were then fed to the ResNet50 + MoCo neural network input, and the learning process began for 100 epochs, after which the trained model is applied to the same dataset. The fragmentation algorithm was identical for both subtasks.
The hidden representation vectors, obtained as a result of training, underwent validation on the same dataset, which was now divided into fragments in a different way—into equal-sized squares of 240 × 240 pixels. As mentioned earlier, the square size was chosen based on the 95th percentile for the Litter class.
Then, the entire digital image canvas was divided by a grid with equal steps; thus, each image was divided into fragments. The step value was chosen equal to 120, i.e., exactly half the fragment length. Thereby, different fragments overlap each other; most of the image (all pixels not at the image edges) fell into 4 different fragments. This ensures artificial data augmentation. Additionally, in case any anomaly fell on the grid boundary and was divided, it is highly likely that this object already fell completely into the neighboring fragment. Both these circumstances (artificial data augmentation and partial solution to the problem of object division between grid squares) should theoretically contribute to higher convolutional neural network model performance at the validation stage.
Considering that the source image size is always 3840 × 2160 pixels, when fragmented in this way, the photograph is divided into 527 square fragments; the total number of data objects in such a dataset amounted to more than 5.7 million.

2.5.4. Classification Methods and Optimal Parameters Selection

This section describes the characteristics of three classifier models used in this work: CatBoost, Random Forest, and Balanced Random Forest. The main reason for selecting these algorithms lies in their ability to efficiently work with tabular and feature-heterogeneous datasets, resistance to overfitting, and high interpretability of results. Another reason is that these classifiers have previously been used in similar works on floating marine litter identification or sea surface anomaly detection and have shown acceptable results with different types of convolutional neural networks.
CatBoost is a modern gradient boosting on decision trees that performs well on tasks with categorical features and with limited data volume [25]. Random Forest is intuitively understandable, stable, and simple to configure, allowing accurate models to be built on data with different target class distributions. The RF variant Balanced Random Forest is also a suitable choice due to its work with imbalanced data characteristic of realistic scenarios where the proportion of images with litter is significantly lower compared to “empty” images.
Below is a brief description of the essence and operating principles of each of the three methods.

2.5.5. Experimental Configurations and Hyperparameter Optimization

To thoroughly evaluate the classification approach, we systematically tested several experimental configurations. These configurations varied on three adjustments:
  • Contrastive Learning Strategy: We compared two subtasks for generating training pairs for the MoCo model: one using the InfoNCE loss function where any non-positive pair is considered negative, and another using Binary Cross-Entropy (BCE) with more explicit rules for negative sampling (fragments from frames at least 10 s apart).
  • Data Balancing: We trained the classifiers on five datasets derived from the training set in order to address the severe class imbalance. These datasets featured progressively smaller, random sub-samples of the majority (“empty”) class, creating different ratios of anomalous to non-anomalous fragments.
  • Classifier Optimization: We used RandomizedSearchCV to optimize the main hyperparameters for the Random Forest and Balanced Random Forest models. For all three classifiers (including CatBoost), we employed the Optuna framework to perform a targeted search for the optimal class weights to maximize the F1-score [26,27].
The final hyperparameters used for each classifier after optimization are presented in Table 4. For the VOD approach, we tested two data scenarios: training on the full dataset and training on a reduced dataset containing only images with at least one annotated object.
More details on the selection of parameters for each classifier are described in Section 3.1.1, Section 3.1.2 and Section 3.1.3, as well as in the Appendix A.

2.5.6. Performance Evaluation and Optimization

Evaluation Metric
The primary metric for evaluating both approaches was the F1-score, which is the harmonic mean of precision and recall. This metric was chosen for all optimization procedures as it is particularly well-suited for handling the significant class imbalance present in the dataset.
Hyperparameter Optimization
Two distinct tools were used for hyperparameter optimization:
  • RandomizedSearchCV: This Scikit-learn method was used to find optimal parameters for the Random Forest models. It efficiently explores a large parameter space by testing a fixed number of random combinations with cross-validation.
  • Optuna: This modern optimization library was used to select the best class weights for the classification task. It employs advanced algorithms like Bayesian optimization to find optimal parameters more efficiently than traditional random or grid searches.
Dataset Balancing
To address the class imbalance, different dataset configurations were tested:
For the classification approach, five datasets were created. These included the original dataset and four derivatives where the number of non-anomalous fragments (the majority class) was progressively reduced to create more balanced training sets.
For the VOD approach, which processes entire images, two configurations were used: the full, unaltered dataset (approx. 10,000 images) and a smaller dataset consisting only of images that contained at least one annotation (approx. 4000 images).

2.6. Visual Object Detection—Description of the Approach

Now, we proceed to describe the second approach we used—Visual Object Detection based on the YOLO neural network. As already indicated above, based on the review of relevant scientific works, one can conclude that YOLO is one of the most successful models for this type of task.
Compared to the previous classification approach, this time we do not need to develop our own complex fragmentation, dimensionality reduction and classification procedures, but can use ready-made available Python libraries with minor modifications and parameter settings.
This section will consist of two subsections—in the first, we will give a brief description of the YOLO method, and in the second, we will describe how model performance was evaluated in this approach.

2.6.1. YOLO Method

YOLO (You Only Look Once) is a family of real-time object detection algorithms. The main principle of YOLO is that the entire image is processed simultaneously, not element by element. This ensures high image processing speed, making YOLO an ideal choice for applied tasks where responsiveness is important, such as video surveillance.
This architecture was selected for the VOD task due to its exceptional processing speed, which is essential for real-time or near-real-time detection applications.
The stages of the YOLO algorithm can be described as follows:
  • Bounding box position prediction: Each grid cell predicts a fixed number of bounding boxes, each having a confidence score indicating the probability of object presence and box accuracy.
  • Class label prediction: Class probabilities conditioned on the fact of object presence in the box are predicted for each bounding box.
  • Non-Maximum Suppression (NMS): To eliminate multiple detections of the same object, NMS is applied, which improves final approximations [28].

2.6.2. Quality Assessment

Although mAP is more commonly used as the main metric for YOLO networks, this work applies the F1-score metric. It represents the harmonic mean between precision and recall, making it useful for obtaining a single detection quality indicator.
In the YOLO context, F1-score is calculated for each object class separately and can then be averaged across all classes. This metric is particularly useful when working with imbalanced datasets where the number of objects of different classes differs significantly. Additionally, the main quality indicator of the other model (based on classification) is also F1-score, meaning comparison of the two models based on this indicator will be most representative.

3. Results and Discussion

In this section, we will compare model performance indicators in both approaches. The order of consideration is analogous to the Section 2—first, we will examine the classification approach task divided into three classifiers, two object sampling methodologies (InfoNCE subtask and BCE subtask), and five different data subsets with varying degrees of balance, with or without algorithmic search for optimal model weights using Optuna.
To ensure a fair and direct comparison between the classification and Visual Object Detection (VOD) approaches, a consistent data division strategy was established.
For the VOD approach, the entire dataset of 10,017 annotated images was first partitioned into a training set (70% of images) and a validation set (30% of images). All models were trained only on data from the training set and evaluated on the unseen validation set.
For the classification approach, the process operated on image fragments. First, the self-supervised ResNet50 + MoCo model learned representations using fragments generated from all the images. Then, all the fragments were matched to the corresponding annotation by the central point of a fragment (whether or not the central point of a given fragment falls into any of annotated boxes) and after that, this new large dataset of image fragments (about 5.9 million items) was divided into training and validating subsets by ratio 70:30. The subsequent classifiers (e.g., CatBoost) were trained on the feature vectors derived from these training set fragments and evaluated on feature vectors derived from the fragments of the images in the validation partition.

3.1. Classification Results

First, let us summarize the results of the model in the MoCo and classification approach across all configurations.
The section is structured as follows. In the next three parts, we will examine model performance on each of the classifiers—Random Forest, Balanced Random Forest, CatBoost. In each of these subsections, graphs are provided showing the dependence of the quality metric (F1-score) on “balance”—that is, the ratio of objects labeled as “anomaly” of one of three types to non-anomalous objects in the dataset. For each classifier, three graphs are provided for the “Marine litter”, “Birds”, and “Glare” classes, each of which displays the indicated dependence in three configurations: without optimal parameter selection (without Optuna), with selection for all classes, and with selection only for the “Marine litter” class. Finally, at the end of each subsection, a comparative table with quality metrics for each configuration is provided.

3.1.1. Random Forest

The graphs presented below (on Figure 9a–c and Figure 10a–c) demonstrate the dependence of average F1-score values for each class within the corresponding subtasks. Firstly, (on images Figure 9a–c) the InfoNCE-based subtask is presented. The visualization includes three configurations: the model without class weights (dashed line), the model with weights optimized for maximizing averaged F1-score across anomaly classes (dotted line), and the model with weights optimized for maximizing the “Marine Litter” class F1-score (solid line).
Below, on similar images on Figure 10a–c, there are graphs of F1-score for BCE-based subtasks.
As expected, setting weights to maximize the F1-score for the Litter class improves the metric value for practically all datasets in both subtasks. In the InfoNCE subtask for the most balanced dataset, the F1-score metric for the Litter class reaches a value of 0.32. In the same subtask for the most balanced dataset, the F1-score metric for the Birds class reaches a value of 0.71.
It can also be observed that the InfoNCE subtask generally performs better with Random Forest classification.
Further information regarding the choice of hyperparameters can be seen in the corresponding section in Appendix A (Figure A1).

3.1.2. Balanced Random Forest

Below, in Figure 11a–c and Figure 12a–c, there are the graphs of dependence of average F1-score values for each class within the corresponding subtasks, when the Balanced Random Forest classifier was used. The first triplet shows metric evolution for InfoNCE-subtask, while the next three plots show the same for BCE-based subtask.
There is considerable qualitative similarity with the Random Forest classifier results. However, quantitatively, the results turned out noticeably worse.
Setting class weights for the Litter class with F1-score maximization improves the F1-score metric value for any ratio of anomalies to non-anomalies and in both subtasks. In the InfoNCE subtask for the most balanced dataset, the F1-score metric for the Litter class reaches a value of 0.15-a result 2 times worse than the RF classifier. In the same subtask for the most balanced dataset, the F1-score metric for the Birds class reaches a value of 0.64, but with weight optimization for all classes rather than just Litter, which is expected.
It can also be observed that the InfoNCE subtask generally performs better with classification, just as with the Random Forest method.
Further information regarding the choice of hyperparameters for the Balanced Random Forest method is presented in the respective section in Appendix A (Figure A2).

3.1.3. Catboost

After that, the classification task was performed using CatBoost algorithms (see Section 2.5.4 of this paper for a description) on latent representation vectors. These in turn were obtained by training the ResNet50 + MoCo network with two different sets of parameters: one from subtask #1 with the Binary Cross-Entropy loss function and with both types of pairs specified (Figure 13), and the other from subtask #2 with the Cross Entropy loss function (Figure 14). For more details on the differences between the loss functions, see Section 2.5.3.
Qualitatively, the CatBoost model results do not differ much from the previous two classifiers. The best result can be achieved when using the highest ratio of anomalous objects to non-anomalies in the dataset in the InfoNCE subtask. Regarding class weight determination using Optuna, the best result for average F1-score among all classes was achieved when setting coefficients for all classes rather than exclusively for marine litter.
However, it is worth mentioning that the highest achieved F1-score metric value specifically for the “Marine litter” class was 0.43 using the BCE subtask, although the value in the InfoNCE subtask does not differ significantly and amounted to 0.4. When identifying “Birds” class objects, the maximum F1-score metric value was 0.68, and here too the difference between the two subtasks is small.
The additional information regarding the choice of hyperparameters for CatBoost classificator can be seen in the corresponding section “CatBoost” in the Appendix A (Figure A3).

3.2. Visual Object Detection Results

Let us summarize the results of the model in the VOD approach for both datasets—on the full one and the reduced one.

3.2.1. Detection on a Full Dataset

Two scenarios were implemented for the work: in the first, the neural network was supplied with data “as is”—here, anomalies occur quite rarely, but this dataset is most similar to the real situation at sea in life—and in the second, where the neural network was trained on pre-selected images where the presence of any of the objects of interest (bird, marine litter, camera glare) was guaranteed. Out of 10,017 photographs annotated in Label Studio, 4218 had some label from these classes.
The comparison was conducted using three quality metrics—Precision, Recall, and F1-score. Similar quality metrics were also used in the classification approach using ResNet + MoCo and CatBoost for convenience of comparing results with each other.
The dataset is divided according to a 70:30 proportion, i.e., 70% of the data was used for model training, and the remaining 30% was used for model validation. The YOLOv11 model was used with classification into three classes. The “imgsz” parameter, which sets the search area for YOLO, was set to 3840 pixels, i.e., equal to the length of photographs in the working dataset. Thus, no photo compression was performed.
The graphs in Figure 15a–c show the evolution of the loss function during YOLO model training on each of 3 examined object classes.
Table 5 shows the quality metrics values after validating the YOLO model on the validation subset.
While the final F1-score was the primary metric for comparison, the training convergence of the VOD model was also monitored. For a detailed view of the model’s learning progress, the Mean Average Precision (mAP@0.5) training curves for the models trained on both the full and the anomalies-only datasets are provided in Appendix A (Figure A3a).

3.2.2. Detection with a Dataset Consisting Only of Anomalies

In the second scenario, only photographs with objects in them were selected. Similar to the previous run, the dataset was divided between training and validation samples in a 70:30 ratio.
The evolution graphs of quality metrics are shown in Figure 16a–c. The results for quality metrics after validation of the trained model on new (validation) data are shown in Table 6. The quality metrics and the proportion between training and validation datasets are the same as in the first scenario.
Mean Average Precision (mAP@0.5) training curve for the models trained on the anomalies-only dataset is provided in Appendix A (Figure A3b).

3.3. Discussion and Result Interpretation

Let us summarize the conclusions we have obtained for both approaches to solving the problem.

3.3.1. Classification Approach

Let us start with the results of various classifier models. Among the classifiers used in the ResNet50 + MoCo task (Random Forest, Balanced Random Forest, and CatBoost discussed above), the best results were expectedly achieved at the highest ratio of anomalous objects in the dataset to non-anomalous ones, i.e., with artificial increase in the proportion of objects with labels from Label Studio relative to other unlabeled objects. Regarding subtask selection, the methodology using the InfoNCE loss function more often gives better results than BCE, especially for RF and BRF classifiers.
Summarizing the classifier results above, the best result for recognizing “Birds” class objects was achieved by the Random Forest model in the InfoNCE subtask using weights for all classes. The F1-score metric value was 0.71. However, this indicator for the “Litter” class equals only 0.32.
The best model for identifying litter objects specifically turned out to be CatBoost, also when setting all class weights, but in the BCE subtask-the quality metric result was 0.43. Considering that litter identification is a higher priority task for us and that maximizing F1-score for litter is the most important goal, preference should be given to using the CatBoost classifier.

3.3.2. Visual Object Detection Approach

Based on the results of testing the YOLO neural network in the Visual Object Detection approach, the following conclusion can be made: training the model only on images containing anomalies (objects of “Marine litter”, “Birds”, “Glare” classes) does not provide substantial growth in quality metrics.
The best result was achieved by the YOLO model in recognizing “Birds” class objects, with F1-score reaching 0.73 when using the dataset with anomalies only compared to 0.62 on the full dataset. However, the quality measure when recognizing marine litter objects turned out to be quite low—less than 0.1—and it did not change at all when transitioning from the full dataset to the anomaly-only dataset.

3.3.3. Comparison of Best Results in Each Method

Let us now compare the best values obtained in the two approaches—a neural network trained with MoCo contrastive learning supplemented with a classifier, versus the YOLO detection network.
For the method with contrastive learning of ResNet neural network and classifier, this will be a run on a dataset with maximum ratio of meaningful fragments to zero fragments (about 3 to 2), InfoNCE loss function, with class weight tuning and CatBoost classifier (Table 7).
For the VOD approach, this would be run using a dataset from which all photos that do not contain anomalies (not containing the labels “Bird”, “Marine Litter”, or “Glare”) have been excluded (Table 8).

4. Discussion

This study compared a self-supervised classification approach against a supervised Visual Object Detection (VOD) model for identifying marine litter and birds, revealing a distinct performance trade-off. While the findings are promising, the study has several limitations that provide clear directions for future research.

4.1. Methodological and Dataset Limitations

The primary limitations of the findings of this study are rooted in the dataset and the methodologies employed. The training data was geographically and temporally constrained to the Barents and Kara Seas, which may not represent conditions found elsewhere to the full extent. Furthermore, the preprocessing step to balance the dataset by removing a significant amount of ‘empty’ sea images may have inadvertently discarded valuable contextual information.
Methodologically, the classification approach proved computationally intensive and lost spatial context by analyzing image fragments, while the VOD approach was restricted to a single model architecture. More critically, the severe dataset imbalance, with only approximately 56 annotated instances of “Marine Litter,” rendered the F1-score statistically fragile. With such a small validation sample, minor prediction changes—correctly or incorrectly classifying just one or two instances—caused substantial fluctuations in the metric. Consequently, while the classification model’s 40% F1-score represents a notable improvement, its statistical robustness is limited and requires validation with larger datasets.

4.2. Generalizability and Model Transferability

Developing these models solely with Arctic data brings up some serious doubts about how well they apply elsewhere. Transferring them right over to different ocean settings, like tropical or temperate coastal areas, would probably cause a degradation in their performance to some extent. That happens due to changes in light conditions, the optical characteristics of sea water, and the ways biomaterial builds up, all of which alter the look of litter and the surroundings. On top of that, the models would face unfamiliar animals and plants missing from the original training set, which could easily lead to wrong identifications. In the end, building a strong model that works reliably across the globe means pulling together a training dataset full of all kinds of environmental differences and object types.

4.3. Future Research Directions

Future research should prioritize collecting more diverse data from different geographic locations. From a modeling perspective, exploring end-to-end anomaly detection frameworks has the potential to combine the strengths of both tested approaches.
Moreover, since the data is derived from video containers holding individual sequential images, the use of temporal information offers significant opportunities. Implementing object tracking or ensuring temporal consistency between frames can help distinguish static debris from dynamic objects such as birds and reduce the number of short-term false positives. From the point of view of self-supervised learning, it is possible to use not only the spatial autocorrelation of optical imaging data, but also the temporal one, which can provide an additional source of data for formulating the loss function of contrast learning.
Finally, it is worth exploring which models perform better in handling specific types of sea surface anomalies and for what reasons. In particular, MoCo is known for its performance under limited labeled training data, and YOLO for its high real-time object detection speed, but more detailed analysis is needed in future work.

5. Conclusions

Thus, it can be seen that, in terms of absolute maximum value among all classes, the advantage belongs to the VOD approach—in the variant using a dataset containing only anomalies, birds are recognized with 73% accuracy by F1-score compared to 67% in the classification approach.
This study successfully demonstrated that the optimal machine learning strategy for monitoring the marine surface is highly dependent on the target’s characteristics. While standard object detection models like YOLO are highly effective for identifying well-defined objects such as birds, a self-supervised, fragment-based classification approach is substantially more proficient at detecting varied targets like marine litter.
Therefore, we conclude that for the primary goal of marine litter identification, the problem is more effectively framed as an anomaly classification task rather than a standard object detection task.

Author Contributions

Main research, data analysis and model training: O.B.; Supervisor and ML expert: M.K.; Data collector and marine ecology expert: M.P.; Data collecting and management: V.S., P.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was carried out under governmental assignment by Ministry of Science and Higher Education of the Russian Federation No. 075-03-2025-662 dated 17 January 2025. Models’ development was supported by Presidential Environmental Projects Foundation, grant No. ЭKO-25-2-003542. Data acquisition and management was supported under the governmental assignment by Ministry of Science and Higher Education of the Russian Federation of Lomonosov Moscow State University No. 121051100167-1.

Data Availability Statement

The dataset consisting of videos and imagery collected from “Dalnie Zelentsy” vessel was published by our research team and is available online [29].

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Appendix A.1. Random Forest

For the RF method, all parameters except class weights were selected using the RandomizedSearchCV optimizer—F1-score, averaged across all anomaly classes, was maximized. Parameters were selected for the most balanced dataset. Figure 10 shows F1-score value dependencies for various parameter values for the BCE subtask.
Figure A1. Visualization of the optimal parameter selection process using RandomizedSearchCV for the Random Forest method.
Figure A1. Visualization of the optimal parameter selection process using RandomizedSearchCV for the Random Forest method.
Remotesensing 18 00241 g0a1
Parameter values selected using the RandomizedSearchCV optimizer are provided in Table A1. They were identical for BCE and InfoNCE subtasks.
Table A1. Parameter values for the best F1-score for the Random Forest method.
Table A1. Parameter values for the best F1-score for the Random Forest method.
ParameterValue
n_estimators200
min_samples_split2
min_samples_leaf2
max_featuressqrt
max_depth14
bootstrapFalse
The Bootstrap parameter is set to False. This configuration is due to the imbalance of the original dataset, since disabling bootstrap sampling ensures inclusion of all rare class samples in the training of each ensemble tree.
Class weight optimization was performed using the Optuna library according to two criteria:
  • Maximization of averaged F1-score across all anomaly classes.
  • Maximization of F1-score for the “Marine Litter” class.
For comparison, a baseline model was trained without assigning class weights.

Appendix A.2. Balanced Random Forest

All parameters except class weights for the Balanced Random Forest method were also selected using the RandomizedSearchCV optimizer—F1-score, averaged across all anomaly classes, was maximized. Parameters were selected for the most balanced dataset. Figure 13 shows F1-score value dependencies for various parameter values for the BCE subtask.
Figure A2. Visualization of the process of selecting optimal parameters using RandomizedSearchCV for the Balanced Random Forest method.
Figure A2. Visualization of the process of selecting optimal parameters using RandomizedSearchCV for the Balanced Random Forest method.
Remotesensing 18 00241 g0a2
The parameter values selected using the RandomizedSearchCV optimizer are given in Table A2. They were the same for both the BCE-based and InfoNCE-based subtasks.
Table A2. Parameter values for the best F1-score for the Balanced Random Forest method.
Table A2. Parameter values for the best F1-score for the Balanced Random Forest method.
ParameterValue
sampling_strategyall
replacementTrue
n_estimators600
min_samples_split7
min_samples_leaf2
max_featureslog2
max_depth7
bootstrapTrue
We decided to select class weights using the Optuna optimizer. We selected two sets of weights:
  • Maximizing the F1-score averaged over all anomaly classes.
  • Maximizing the F1-score for the Marine Litter class.
We also trained the classification model without specifying weights.

Appendix A.3. CatBoost

The parameter values presented in Table A3 were used.
Table A3. Parameter values for the best F1-score for the Catboost method.
Table A3. Parameter values for the best F1-score for the Catboost method.
ParameterValue
iterations1000
depth12
leaf_estimation_iterations10
learning_rate0.0001
loss_functionMultiClass
eval_metricTotalF1
class_weightsThe best weights were set using the Optuna optimizer

Appendix A.4. Mean Average Precision (mAP@0.5) Diagrams for YOLO Models

Figure A3. Mean Average Precision (mAP@0.5) training curves for the YOLO object detection model under two different dataset configurations. (a) Training performance on the dataset consisting only of images containing at least one annotation. (b) Training performance on the full, original dataset, including images with no annotated objects.
Figure A3. Mean Average Precision (mAP@0.5) training curves for the YOLO object detection model under two different dataset configurations. (a) Training performance on the dataset consisting only of images containing at least one annotation. (b) Training performance on the full, original dataset, including images with no annotated objects.
Remotesensing 18 00241 g0a3

References

  1. Barboza, L.G.A.; Gimenez, B.C.G. Microplastics in the marine environment: Current trends and research priorities. Mar. Pollut. Bull. 2018, 133, 336–345. [Google Scholar] [CrossRef] [PubMed]
  2. Gregory, M.R. Environmental implications of plastic debris in marine settings. Philos. Trans. R. Soc. B Biol. Sci. 2009, 364, 2013–2025. [Google Scholar] [CrossRef] [PubMed]
  3. Valdenegro-Toro, M. Deep learning for detection of marine debris. In Proceedings of the OCEANS Conference, Seattle, WA, USA, 27–31 October 2019; IEEE: New York, NY, USA, 2019. [Google Scholar]
  4. Acuña-Ruz, T.; Uribe, D.; Taylor, R.; Amézquita, L.; Guzmán, M.C.; Merrill, J.; Martínez, P.; Voisin, L.; Mattar B., C. Anthropogenic marine debris over beaches: Spectral characterization for remote sensing applications. Remote Sens. Environ. 2018, 217, 309–322. [Google Scholar] [CrossRef]
  5. Fulton, M.; Hong, J.; Islam, M.J.; Sattar, J. Robotic Detection of Marine Litter Using Deep Visual Detection Models. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 5752–5758. [Google Scholar] [CrossRef]
  6. Topouzelis, K.; Papageorgiou, D.; Suaria, G.; Aliani, S. Floating marine litter detection algorithms and techniques using optical remote sensing data: A review. Mar. Pollut. Bull. 2021, 170, 112675. [Google Scholar] [CrossRef] [PubMed]
  7. Kylili, K.; Kyriakides, I.; Artusi, A.; Hadjistassou, C. Identifying floating plastic marine debris using a deep learning approach. Environ. Sci. Pollut. Res. 2019, 26, 17091–17099. [Google Scholar] [CrossRef] [PubMed]
  8. Fingas, M.; Brown, C. Review of oil spill remote sensing. Mar. Pollut. Bull. 2014, 83, 9–23. [Google Scholar] [CrossRef] [PubMed]
  9. Sannigrahi, S.; Basu, B.; Sarkar Basu, A.; Pilla, F. Development of Automated Marine Floating Plastic Detection System Using Sentinel-2 Imagery and Machine Learning Models. Mar. Pollut. Bull. 2022, 178, 113527. [Google Scholar] [CrossRef] [PubMed]
  10. Baek, S.-S.; Pyo, J.; Kwon, Y.S.; Chun, S.-J.; Baek, S.H.; Ahn, C.-Y.; Oh, H.-M.; Kim, Y.O.; Cho, K.H. Deep Learning for Simulating Harmful Algal Blooms Using Ocean Numerical Model. Front. Mar. Sci. 2021, 8, 729954. [Google Scholar] [CrossRef]
  11. Shen, A.; Zhu, Y.; Angelov, P.; Jiang, R. Marine Debris Detection in Satellite Surveillance Using Attention Mechanisms. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4320–4330. [Google Scholar] [CrossRef]
  12. Romano, D.; Mennella, C.; Lapegna, M. A deep learning-based method for efficient floating garbage debris recognition on high-performance edge computing platform. Future Gener. Comput. Syst. 2026, 155, 108000. [Google Scholar] [CrossRef]
  13. Berg, P.; Maia, D.S.; Pham, M.-T.; Lefèvre, S. Weakly Supervised Detection of Marine Animals in High Resolution Aerial Images. Remote Sens. 2022, 14, 339. [Google Scholar] [CrossRef]
  14. Cui, J.; Zhou, S.; Xu, G.; Liu, X.; Gao, X. Marine Debris Detection in Real Time: A Lightweight UTNet Model. J. Mar. Sci. Eng. 2025, 13, 1560. [Google Scholar] [CrossRef]
  15. NOAA NCCOS. Marine Debris Monitoring and Assessment Protocols; NOAA Technical Memorandum: Springfield, MO, USA, 2021.
  16. Gomes-Pereira, J.N.; Auger, V.; Beisiegel, K.; Benjamin, R.; Bergmann, M.; Bowden, D.; Buhl-Mortensen, P.; De Leo, F.C.; Dionísio, G.; Durden, J.M.; et al. Current and future trends in marine image annotation software. Prog. Oceanogr. 2016, 149, 106–120. [Google Scholar] [CrossRef]
  17. Pham, M.-T.; Gangloff, H.; Lefèvre, S. Weakly supervised marine animal detection from remote sensing images using vector-quantized variational autoencoder. In Proceedings of the IGARSS 2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: New York, NY, USA, 2023; pp. 5559–5562. [Google Scholar] [CrossRef]
  18. Pang, G.; Shen, C.; Cao, L.; van den Hengel, A. Deep learning for anomaly detection: A review. ACM Comput. Surv. 2021, 54, 38. [Google Scholar] [CrossRef]
  19. Wang, S.; Zhang, X.; Zhao, Y.; Yu, H.; Li, B. Self-Supervised Marine Noise Learning with Sparse Autoencoder Network for Generative Target Magnetic Anomaly Detection. Remote Sens. 2024, 16, 3263. [Google Scholar] [CrossRef]
  20. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
  21. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
  22. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
  23. Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Gheshlaghi Azar, M.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. arXiv 2006, arXiv:2006.07733. [Google Scholar] [CrossRef]
  24. McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
  25. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html (accessed on 18 November 2025).
  26. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  27. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
  28. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  29. Bilousova, O.; Krivoshlyk, P.; Spirina, V.; Krinitskiy, M.; Pogojeva, V. Marine Monitoring, Autumn 2023, Dalnie Zelentsy (Data Set); Kaggle: San Francisco, CA, USA, 2023. [Google Scholar] [CrossRef]
Figure 1. Route of the expedition vessel “Dalnie Zelentsy” from which the obtained video material was captured.
Figure 1. Route of the expedition vessel “Dalnie Zelentsy” from which the obtained video material was captured.
Remotesensing 18 00241 g001
Figure 2. Camera installation scheme on the vessel. Red dots symbolize floating marine litter objects.
Figure 2. Camera installation scheme on the vessel. Red dots symbolize floating marine litter objects.
Remotesensing 18 00241 g002
Figure 3. Photo of the scientific vessel and the board where a camera was attached.
Figure 3. Photo of the scientific vessel and the board where a camera was attached.
Remotesensing 18 00241 g003
Figure 4. Clustering using two different methods method on the full data set in embedding vectors space: (a) DBSCAN; (b) kdeplot.
Figure 4. Clustering using two different methods method on the full data set in embedding vectors space: (a) DBSCAN; (b) kdeplot.
Remotesensing 18 00241 g004
Figure 5. (a,b) Two examples of frames taken at night and recognized as “anomaly” objects by us, as they belong to the minor class separate from the main one.
Figure 5. (a,b) Two examples of frames taken at night and recognized as “anomaly” objects by us, as they belong to the minor class separate from the main one.
Remotesensing 18 00241 g005
Figure 6. Examples of images containing various types of anomalies in red rectangles on photos from the dataset. (a) birds (b) glares (c) marine litter (d) droplets on camera.
Figure 6. Examples of images containing various types of anomalies in red rectangles on photos from the dataset. (a) birds (b) glares (c) marine litter (d) droplets on camera.
Remotesensing 18 00241 g006
Figure 7. Histograms of size distribution for objects of four different classes—birds (blue), litter (cyan), glare (green), droplets (yellow)—as well as all the classes combined (red).
Figure 7. Histograms of size distribution for objects of four different classes—birds (blue), litter (cyan), glare (green), droplets (yellow)—as well as all the classes combined (red).
Remotesensing 18 00241 g007
Figure 8. The principle of processing positive and negative pairs in contrast learning.
Figure 8. The principle of processing positive and negative pairs in contrast learning.
Remotesensing 18 00241 g008
Figure 9. (ac) Graph of average F1-score dependence in the InfoNCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Random Forest classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Figure 9. (ac) Graph of average F1-score dependence in the InfoNCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Random Forest classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Remotesensing 18 00241 g009
Figure 10. (ac) Graph of average F1-score dependence in the BCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Random Forest classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Figure 10. (ac) Graph of average F1-score dependence in the BCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Random Forest classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Remotesensing 18 00241 g010
Figure 11. (ac) Graph of average F1-score dependence in the InfoNCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Balanced Random Forest classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Figure 11. (ac) Graph of average F1-score dependence in the InfoNCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Balanced Random Forest classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Remotesensing 18 00241 g011
Figure 12. (ac) Graph of average F1-score dependence in the BCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Balanced Random Forest classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Figure 12. (ac) Graph of average F1-score dependence in the BCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Balanced Random Forest classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Remotesensing 18 00241 g012
Figure 13. (ac) Graph of average F1-score dependence in the InfoNCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Catboost classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Figure 13. (ac) Graph of average F1-score dependence in the InfoNCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Catboost classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Remotesensing 18 00241 g013
Figure 14. (ac) Graph of average F1-score dependence in the BCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Catboost classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Figure 14. (ac) Graph of average F1-score dependence in the BCE-based subtask, with division into calculations without class weights (dashed), with weights (dotted), and with weights only for the “marine litter” class (solid line). Catboost classifier. (a) “Marine Litter” class; (b) “Birds” class; (c) “Glare” class.
Remotesensing 18 00241 g014
Figure 15. (ac) Evolution of the F1-score quality metric for each of the three anomaly classes ((a) Marine Litter, (b) Birds, (c) Glare) during training of the YOLO model using the full dataset.
Figure 15. (ac) Evolution of the F1-score quality metric for each of the three anomaly classes ((a) Marine Litter, (b) Birds, (c) Glare) during training of the YOLO model using the full dataset.
Remotesensing 18 00241 g015aRemotesensing 18 00241 g015b
Figure 16. (ac) Evolution of the F1-score quality metric for each of the three anomaly classes during training of the YOLO model using a dataset consisting only of anomalies.
Figure 16. (ac) Evolution of the F1-score quality metric for each of the three anomaly classes during training of the YOLO model using a dataset consisting only of anomalies.
Remotesensing 18 00241 g016aRemotesensing 18 00241 g016b
Table 1. Weather observation log from the vessel during the expedition. Dash line denotes absence of weather or sea state information on this particular day.
Table 1. Weather observation log from the vessel during the expedition. Dash line denotes absence of weather or sea state information on this particular day.
DateWeatherSea State
12 September 2023Cloudy, wind direction 145At the beginning of the day, it was calm, the waves were small; by the evening, the height of the waves increased a little
13 September 2023Morning: strong winds,
temperature lower than yesterday
14 September 2023Partly cloudy, light rain, good visibility, temperature higher than yesterday
15 September 2023Cold; partly cloudy; snowing, after the turn in the afternoon the wind died downHigh waves, about 5–6 points of storm
16 September 2023Cloudy, light wind, coldSmall waves, calm, glaciers on the horizon
17 September 2023Clear, calm, snowingSmall waves, ice floes on the horizon
18 September 2023Bad weather conditions, cold, ice crust on the cover
19 September 2023Cloudy, light windSmall waves
20 September 2023CloudyStrong waves
21 September 2023Storm
22 September 2023Strong wind, snow in the morning, drizzle and rain later; coldStrong waves with temporary calm
23 September 2023Cloudy, cold, thin crust of ice
24 September 2023Cloudy, strong wind, cold, ice crust on the deckStrong wind, storm warning
25 September 2023Warm, no windPitching
26 September 2023Cloudy, snowy, changeable weather, heavy snowfall at times
27 September 2023StormStrong pitching
28 September 2023
Table 2. Information about video recordings for each day of the expedition—time and GPS coordinates of the start and end of filming and the duration of filming in minutes.
Table 2. Information about video recordings for each day of the expedition—time and GPS coordinates of the start and end of filming and the duration of filming in minutes.
DateVideo Start TimeVideo Start Coordinate (Latitude)Video Start Coordinate (Longitude)Duration in Minutes
12 September 202307:0770 16.08036 52.5125
09:2470 32.8237 41.63141
11:5070 50.31038 39.420130
14:1071 05.70539 29.150135
16:3071 22.66740 18.36570
13 September 202306:0272 56.81745 17.595131
08:2073 11.47546 09.753150
09:5273 21.31546 44.810135
12:2073 37.61847 42.211145
14 September 202306:0075 37.1555 17.32110
12:0076 06.9957 44.39135
14:1876 18.1458 48.45102
15 September 202306:4077 37.35359 46.3285
08:1077 50.01359 43.612135
10:3078 00.321359 28.5298118
13:1778 06.285759 38.4241118
16:2078 17.385859 36.7700100
16 September 202305:5579 10.744659 40.7852175
09:1579 36.458660 03.5846155
11:5779 42.56961 56.462143
14:5079 48.360763 00.9514150
17:5079 42.23864 31.098142
17 September 202306:3079 41.629967 34.9835130
08:4579 41.274668 18.651480
10:1079 41.00469 06.568123
12:1979 40.49069 07.51049
14:2579 40.485570 06.146590
15:5879 40.361570 41.7312116
18 September 202306:4879 26.245572 03.796782
09:5079 21.836473 07.0075140
16:1779 14.258076 53.242089
17:5079 15.439578 09.401110
19 September 202307:3079 02.880474 04.3837145
09:0078 59.103372 59.7819185
12:0978 55.382072 54.0020159
14:5378 34.005971 55.275962
20 September 202305:2577 27.103068 50.2099115
21 September 202315:0077 02.131067 38.4386120
22 September 202304:4077 02.131067 38.4386160
07:25 80
08:5077 34.39968 75.139175
11:5077 23.862069 23.0397100
13:4077 35.55570 07.878120
15:4577 48.531070 55.8168105
23 September 202305:2578 54.438169 56.740125
07:4079 03.51869 06.128115
09:4079 05.505268 50.5778132
11:5579 13.32667 59.09095
13:3579 20.60267 10.2569,5150
16:1079 28.36365 00.677102
24 September 202304:2678 25.180966 38.790844
07:1578 11.19067 09.491186
25 September 202312:2076 39.85073 00.48995
14:0576 24.03472 57.5612140
26 September 202305:1075 41.776870 00.1976160
10:1575 47.618370 01.571990
14:4576 09.315170 01.076098
27 September 202311:3577 02.050563 32.7308205
14:4076 49.81961 47.338140
28 September 202307:3574 54.231253 11.6773132
09:5274 37.029852 06.5156158
12:3574 16.63050 53.678152
15:1273 57.983649 46.4342142
Table 3. Values of mean, standard deviation, 25th percentile, median, 75th percentile, 95th percentile for each class and the total amount of the annotated boxes’ sizes.
Table 3. Values of mean, standard deviation, 25th percentile, median, 75th percentile, 95th percentile for each class and the total amount of the annotated boxes’ sizes.
MeanSTDp25Median (p50)p75p95
Bird45.8947.4126.9737.2452.1595.62
Litter116.89117.4346.5278.66134.01326.29
Glare800.31597.10410.11611.07934.212212.74
Droplets516.10278.75350.29444.06597.15989.66
Total480.89348.16312.09422.73585.51032.03
Table 4. Final hyperparameter values for classifier models. “Not Available” (N/A) symbolizes that this parameter is not applicable to this particular classifier.
Table 4. Final hyperparameter values for classifier models. “Not Available” (N/A) symbolizes that this parameter is not applicable to this particular classifier.
ParameterRandom ForestBalanced Random ForestCatBoost
n_estimators/iterations2006001000
max_depth/depth14712
min_samples_split27N/A
min_samples_leaf22N/A
max_featuressqrtlog2N/A
bootstrapFalseTrueN/A
learning_rateN/AN/A0.0001
loss_functionN/AN/AMultiClass
Table 5. Quality metrics values after training the YOLO model on the full dataset.
Table 5. Quality metrics values after training the YOLO model on the full dataset.
Object ClassPrecisionRecallF1-Score
Marine Litter0.194180.06250.094564
Bird0.724910.545910.6228
Glare0.113730.172020.13693
Table 6. Quality metrics values after training the YOLO model on a dataset consisting only of anomalies.
Table 6. Quality metrics values after training the YOLO model on a dataset consisting only of anomalies.
Object ClassPrecisionRecallF1-Score
Marine Litter0.233530.06250.09861
Bird0.798070.677570.73290
Glare0.228710.323250.26788
Table 7. Best metric values in the classification approach.
Table 7. Best metric values in the classification approach.
Object ClassPrecisionRecallF1-Score
Marine Litter0.360.450.40
Bird0.540.900.67
Glare0.460.470.46
Table 8. Best metric values in the VOD approach.
Table 8. Best metric values in the VOD approach.
Object ClassPrecisionRecallF1-Score
Marine Litter0.233530.06250.09861
Bird0.798070.677570.73290
Glare0.228710.323250.26788
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bilousova, O.; Krinitskiy, M.; Pogojeva, M.; Spirina, V.; Krivoshlyk, P. A Comparison of Self-Supervised and Supervised Deep Learning Approaches in Floating Marine Litter and Other Types of Sea-Surface Anomalies Detection. Remote Sens. 2026, 18, 241. https://doi.org/10.3390/rs18020241

AMA Style

Bilousova O, Krinitskiy M, Pogojeva M, Spirina V, Krivoshlyk P. A Comparison of Self-Supervised and Supervised Deep Learning Approaches in Floating Marine Litter and Other Types of Sea-Surface Anomalies Detection. Remote Sensing. 2026; 18(2):241. https://doi.org/10.3390/rs18020241

Chicago/Turabian Style

Bilousova, Olga, Mikhail Krinitskiy, Maria Pogojeva, Viktoriia Spirina, and Polina Krivoshlyk. 2026. "A Comparison of Self-Supervised and Supervised Deep Learning Approaches in Floating Marine Litter and Other Types of Sea-Surface Anomalies Detection" Remote Sensing 18, no. 2: 241. https://doi.org/10.3390/rs18020241

APA Style

Bilousova, O., Krinitskiy, M., Pogojeva, M., Spirina, V., & Krivoshlyk, P. (2026). A Comparison of Self-Supervised and Supervised Deep Learning Approaches in Floating Marine Litter and Other Types of Sea-Surface Anomalies Detection. Remote Sensing, 18(2), 241. https://doi.org/10.3390/rs18020241

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop