Deep-Feature-Based Approach to Marine Debris Classification

The global community has recognized an increasing amount of pollutants entering oceans and other water bodies as a severe environmental, economic, and social issue. In addition to prevention, one of the key measures in addressing marine pollution is the cleanup of debris already present in marine environments. Deployment of machine learning (ML) and deep learning (DL) techniques can automate marine waste removal, making the cleanup process more efficient. This study examines the performance of six well-known deep convolutional neural networks (CNNs), namely VGG19, InceptionV3, ResNet50, Inception-ResNetV2, DenseNet121, and MobileNetV2, utilized as feature extractors according to three different extraction schemes for the identification and classification of underwater marine debris. We compare the performance of a neural network (NN) classifier trained on top of deep CNN feature extractors when the feature extractor is (1) fixed; (2) fine-tuned on the given task; (3) fixed during the first phase of training and fine-tuned afterward. In general, fine-tuning resulted in better-performing models but is much more computationally expensive. The overall best NN performance showed the fine-tuned Inception-ResNetV2 feature extractor with an accuracy of 91.40% and F1-score 92.08%, followed by fine-tuned InceptionV3 extractor. Furthermore, we analyze conventional ML classifiers’ performance when trained on features extracted with deep CNNs. Finally, we show that replacing NN with a conventional ML classifier, such as support vector machine (SVM) or logistic regression (LR), can further enhance the classification performance on new data.


Introduction
The rising level of marine pollution is a growing environmental problem in today's society. Our oceans, seas, and other water bodies are polluted with various waste items that threaten coastal wildlife, habitat, human safety, and coastal communities' economic health [1]. Discarded fishing gear continues to trap and kill marine life. Animals such as seabirds or turtles often mistake plastic debris with their food due to its similar appearance and odor, which leads to their malnutrition and starvation [2,3]. In addition, seafood contaminated with microplastics, i.e., small plastic particles less than 5 mm in size, is potentially toxic for humans that consume it [4][5][6]. Furthermore, marine debris induces high economic costs in industries such as tourism, aquaculture, and fisheries [7]. The key measures for solving marine pollution include detailed knowledge of magnitudes, sources, and impacts of marine debris, general behavioral change, enhancement of circular economy, prevention of waste items entering the marine environment, reducing waste generation, and removal of debris already present in the marine environment [8]. Identification and the cleanup of marine debris, especially one deep under the water surface, is challenging and expensive. Therefore, to make the cleanup process more efficient, automatized detection and removal of marine debris is desired. The latter can be realized with the aid of autonomous underwater vehicles (AUVs) and deep-learning-based visual identification of underwater waste. feasible. A common and effective strategy to overcome a deficit of labeled training data is transfer learning, i.e., knowledge transfer from source to the desired target domain [33]. The idea is to reuse features learned by the network pre-trained on a very large dataset, such as the ImageNet [34] dataset, which contains 1.2 million training images from 1000 classes, on a new task of interest.
There are two common approaches to transfer learning: (1) using a pre-trained network as a fixed feature extractor [35,36]; (2) fine-tuning a pre-trained network's weights on a target dataset [37,38]. In the first approach, the last, usually fully connected, part of the network is removed, while the remaining part (convolutional base) stays fixed during the training. On top of the fixed base used for feature extraction, a new classifier, either a new neural network or a conventional machine learning classifier such as a support vector machine or a logistic regression classifier, is added to learn underlying patterns from extracted features and discriminate data according to them. The fine-tuning approach uses the pre-trained network weights as a weight initialization scheme. During the training, these weights are updated by backpropagating errors from the target task into the base network to improve the final generalization performance on a new task of interest [37].
The successful employment of transfer learning has been reported on computer vision tasks from various domains, such as medical image analysis [39][40][41][42], agriculture [43,44], remote sensing [45,46], and the textile industry [47]. Transfer learning has been utilized in several works considering marine debris identification and classification. The VGG16 network pre-trained on the ImageNet dataset is used as a fixed feature extractor for plastic marine waste classification in [25,26]. In [24], object-detection-based transfer learning is applied on the proposed RetinaNet model with ResNet50 backbone to detect waste items in aquatic surroundings. Successful object detection networks, YOLOv2, Tiny-YOLO, Faster RCNN, and Single Shot MultiBox Detector (SSD), are fine-tuned in [27] for underwater detection of three classes of objects, including plastic marine debris. As a backbone for their Mask R-CNN architecture for seafloor litter detection, Politikos et al. [28] use MobileNetV1 architecture pre-trained on the COCO detection dataset.
The main goals of this study were to: (1) develop the model for autonomus marine debris identification and classification of different types of marine debris; (2) compare the performance of prominent deep convolutional architectures, including VGG19, Incep-tionV3, ResNet50, Inception-ResNetV2, DenseNet121, and MobileNetV2, on the task of marine debris classification; (3) investigate different schemes to utilize transfer learning for marine debris classification: fixed extraction of features, fine-tuning, and combination of both; (4) compare the performance of conventional machine learning classifiers trained on feature vectors extracted by deep convolutional architectures.
The rest of the paper is structured as follows. Section 2 describes materials and methods used in this study: dataset (Section 2.1), implemented deep convolutional architectures (Section 2.2), and machine learning algorithms (Section 2.3). Section 3 gives implementation details and experimental settings. Experimental results are presented in Section 4 and discussed in Section 5. In Section 6, concluding remarks and directions for future work are given.

Dataset
A large annotated dataset of underwater trash is needed to utilize a deep-learning approach for marine debris detection and classification. The Japan Agency for Marine-Earth Science and Technology (JAMSTEC) has made their Deep-sea Debris Database, which contains numerous marine debris videos and photos, available online to the public [48]. In our work, we used data from the Deep-sea Debris Database complemented with Google Images. Images were manually labeled and validated by one of the researchers. Each image was visually inspected prior to being added to the dataset. The final dataset contains 2395 images from six different classes: glass, metal, plastic, rubber, other trash, and no trash. Figure 1 provides a sample of images from the dataset. To ensure that training and test sets have the same distribution of considered classes, available data are divided into a training set and a test set as follows: 20% of the images from each class were set aside for the final model evaluation, while the remaining 80% were left for training. Table 1 shows the class-wise distribution of data in the original dataset as well as in the training and test subsets.

Deep Convolutional Architectures
Compared to traditional computer vision techniques, deep convolutional networks provide better accuracy in image classification tasks. Furthermore, they also provide superior flexibility since they can be retrained using custom datasets and also require less human expert analysis and fine-tuning [49]. This section describes six deep CNN architectures utilized in this work for marine debris classification and extraction of dense image representations.
The main components repeated through the following architectures are convolutional and spatial pooling layers: max pooling, average pooling, and global average pooling. Convolutional layers convolve the input with the kernel shared across all input's spatial locations to obtain feature maps. The feature map value at position (i, j) obtained with k th kernel (filter) is calculated as z i,j,k = w T k x i,j + b k , where w k and b k denote k th kernel's weight vector and bias term, while x i,j denotes the input patch centered at (i, j). The nonlinear activation function g : R → R is applied element-wise on obtained feature maps to obtain activations a i,j,k = g(z i,j,k ) [20]. The pooling layers aggregate information within feature map a k by replacing local pooling regions R i,j in a k with the maximum element from R i,j in the case of max pooling, and with arithmetic mean of elements in R i,j in the case of average pooling. On the other hand, global average pooling averages all values in a k .

VGG19
Simonyan et al. [22] proposed several deep configurations of CNN architecture with a different number of weight layers. These architectures, known as VGGNets, stack convolutional layers with small 3 × 3 receptive fields in blocks followed by a max-pooling layer. In one network configuration, they utilize 1 × 1 convolutions to increase the nonlinearity of the decision function without changing the receptive fields of convolutional layers. In this work, we employ VGG19 architecture that has 19 weight layers. The VGG19 architecture with ≈ 143.47M trainable parameters, as defined in [22], is illustrated in Figure 2.

InceptionV3
Although compellingly architecturally simple and with good generalization performance, VGGNets come with high computational costs. To obtain an efficient deep model with reduced computational costs and high generalization ability, Szegedy et al. stack Inception modules in a 22 weight layer deep GoogLeNet architecture [50]. In contrast to conventional architectures where either convolution (with a single filter size) or pooling operation is performed, Inception modules perform convolution with different filter sizes (1 × 1, 3 × 3 and 5 × 5) in parallel along with the max-pooling operation and pass the concatenated results forward throughout the network. Varying filter sizes enable the model to capture spatial information at different scales at the same level in the network. The computational efficiency is preserved by adding an extra 1 × 1 convolution before expensive 3 × 3 and 5 × 5 convolutions and after the pooling layer, as illustrated in Figure 3. Two auxiliary classifiers are connected to intermediate Inception modules to propagate gradients through all network layers effectively.  A later variant of the GoogLeNet architecture employed in this study, InceptionV3 architecture [51], modifies the original one in the following ways: (1) larger convolutions in the Inception modules are factorized into smaller ones, i.e., 5 × 5 convolution is replaced by two 3 × 3 convolutions in the Inception module A; (2) Inception module B factorizes symmetric 7 × 7 convolutions into asymetric 1 × 7 and 7 × 1 convolutions; (3) the Inception module C, which is introduced for promotion of high dimensional representations, replaces 3 × 3 convolution with parallel asymmetric 1 × 3 and 3 × 1 convolutions; (4) one auxiliary classifier with employed batch normalization is used as a regularizer together with the label smoothing technique; (5) efficient size reduction with parallel convolutional and pooling blocks with a stride two is employed in the reduction modules. Figures 4 and 5 show modules and schematic representation of the InceptionV3 architecture. All convolutional layers employ batch normalization and use the ReLU activation function.

ResNet50
In [52], He et al. address the training accuracy degradation problem in deep network architectures by introducing the residual learning framework. Let f (x) be the underlying mapping to be learned with several network layers, where x denotes the input to the first of these layers. In the residual learning framework, stacked network layers learn the residual mapping f R (x) := f (x) − x, which is easier to optimize than underlying f (x). Original mapping f (x) now corresponds to f R (x) + x, which is realized with shortcut connections and element-wise additions in a feedforward convolutional network.
Residual Networks (ReNets) employ shortcut connections together with the batch normalization technique (after each convolution and before activation) to ease the training of deep network architectures and enjoy accuracy gains from an increase in network depth. Figure 6 shows two main residual modules, Convolution and Identity blocks, which comprise the 50/101/152 ResNet framework.  Figure 6. Two types of residual modules. Modules calculate f R (x) + g(x) (element-wise) where x denotes the module's input, f R (x) the output of three stacked convolutional layers, and g the projection function used to match x and f R (x) dimensions in the convolution block, and the identity function g(x) = x in the identity block.
Identity shortcuts in identity blocks can only be used when input and output have the same dimension. Otherwise, a projection shortcut with 1 × 1 convolutions, i.e., convolution block from Figure 6, is used to match the input and output dimensions. The 50-layers deep ResNet50 architecture employed in this study is illustrated in Figure 7.

Inception-ResNetV2
The Inception-ResNet [53] architecture combines good performing Inception architecture [51] with residual learning framework [52] by replacing the filter concatenation stage in Inception modules with residual connections. In residual versions of Inception modules (Inception-ResNet A, B and C modules in Figure 8), 1 × 1 convolution without activation, i.e., with linear activation g(x) = x, is added before the summation to scale up the dimensionality of a given volume to match the input depth. The batch normalization is utilized only on the top of traditional layers, not on summations. Introduction of residual connections into Inception architecture significantly accelerates the training of the Inception networks with an increased depth. However, the residual version of the Inception network is prone to instabilities during the training when a number of filters exceed 1000. To stabilize the training procedure, Inception-ResNets scale down the residuals before addition. Schematic representation of the Inception-ResNetV2 network architecture is shown in Figure 9.

DenseNet121
Huang et al. in [54] address the vanishing gradient problem of deep convolutional networks by introducing new, dense connectivity pattern in Densely Connected Convolutional Network (DenseNet) architecture. The main idea is to connect all layers directly with each other in a feed-forward manner to improve the information and gradient flow between layers as illustrated in Figure 10. The m th layer L m receives the original input x 0 and outputs x 1 , . . . , x m−1 of all preceding layers L 1 , . . . , L m as input and outputs . . , x m−1 ] denotes the depth-wise concatenation of volumes x 0 , x 1 , . . . , x m−1 (it is assumed that all volumes have the same width and height). If each layer outputs g feature maps and if d denotes the depth (number of channels) of input, then layer L m receives the input volume [x 0 , x 1 , . . . , x m−1 ] of depth d + (m − 1)g and forwards a total of d + mg feature maps to the next layer. The architecture of densely connected network with 121 weight layers (excluding batch normalization layers), DenseNet121, is shown in Figure 11. Four dense blocks are comprised of 6, 12, 24, and 16 smaller building blocks each comprised of 1 × 1 convolution, which reduces the number of the input features (added for computational efficiency), and 3 × 3 convolution, which produces g = 32 feature maps and concatenates it to the original input volume as illustrated in Figure 12. Transition layers following dense blocks aim to reduce the depth and spatial size of the input volume. Reuse of features learned in earlier layers encourages classifier to use features of all complexity levels, removing the need to learn the redundant features and results with narrower architecture requiring fewer parameters.

MobileNetV2
Lightweight MobileNet architectures are intended for mobile and embedded vision applications. The primary building block of MobileNetV1 [55] architecture is depth-wise separable convolution, which, unlike standard convolution, separates filtering and combining of input features into two distinct stages: (1) depth-wise convolution, which filters input features by applying a single convolution kernel per input channel; (2) point-wise 1 × 1 convolution used to linearly combine depth-wise convolution output channels into new features. Implemented 3 × 3 depth-wise separable convolutions require 8 to 9 times less computation compared to standard convolutions at the cost of a small reduction in accuracy [55].
MobileNetV2 [56] architecture upgrades ideas from its predecessor MobileNetV1. It retains depth-wise separable convolution as an efficient building block and introduces linear bottlenecks and shortcut connections into architecture. The MobileNet architecture utilizes the ReLU6 activation function, ReLU6(x) = min{max{x, 0}, 6} on all layers except on the linear 1 × 1 point-wise convolution layers colored in yellow in Figures 13 and 14, and on the final softmax layer. Inverted residual block characteristic for MobileNetV2 architecture, illustrated in Figure 13, adds narrow layers instead of traditional residuals that use expanded data representations.  Figure 13. MobileNetV2 building blocks. Idea: (1) uncompress recived data (2) filter data using the lightweighted depth-wise convolution (3) compress data to low-dimensional representation (4) combine input data with new compressed data representation. Point-wise convolutional layer in MobileNetV2 architecture is also known as the projection layer since it projects data with large number of channels into smaller-depth output (ed >> f ).

Machine Learning Classifiers
This section describes conventional machine learning classifiers, which were used to classify marine debris images based on extracted feature vectors.

Random Forests
The random forest (RF) classifier [57] comprises numerous decision trees that operate as an ensemble. The bagging technique combined with randomized feature selection is used to build large collection of de-correlated decision trees. Let n t be the number of data instances in the training set D t , and m the number of features given for every data instance. Each decision tree in ensemble is trained on new dataset built by randomly sampling n t instances (with replacement) from D t . During the tree growing, at each node subset of m m features is chosen at random, and the best feature (splitter) among them is used to split the node. At inference, each decision tree in ensemble votes for the preferred class, and RF outputs the class with majority votes as the final prediction.

k-Nearest Neighbors
The k-Nearest Neighbors (kNN) is a simple representative of lazy learning algorithms. It stores all the training data and delays the processing until a new data instance needs to be classified. The algorithm assigns a new data point to the majority class of k training data instances that are the closest in distance to the received data point. The most common choice for distance measure is Euclidean distance. However, any other metric can be used for distance calculation.

Support Vector Machines
For linearly separable training data D = {(x 1 , y 1 ), . . . , (x n , y n )} ⊆ R m × {±1}, the support vector machine (SVM) [58] classifier aims to find optimal hyperplane w, , an affine subspace of dimension m-1, which maximizes the margin of separation between two classes. The corresponding decision function is given by f (x) = sign( w, x + b). Since the weight vector w of the optimal hyperplane is a linear combination of training examples x i [58], for f (x) calculation, it is only necessary to compute inner products x, x i . When dealing with nonlinearly separable data, SVM first maps original data into high-dimensional (inner-product) feature space F via a nonlinear map φ : R m → F and applies the linear method to the obtained data in F . Inner products φ(x), φ(x i ) can be effectively calculated using kernel function k : R m × R m → R; the function that satisfies k(x, y) = φ(x), φ(y) , ∀x, y ∈ R m . In this paper, we use radial basis function (RBF) as the kernel function and extend the binary SVM classifier to the multiclass problem according to the one-versus-one approach, i.e., training K(K−1) 2 binary classifiers, one for each pair of K > 2 possible classes, and using majority voting to obtain the final prediction.

Naive Bayes
The naive Bayes (NB) classifier relies on Bayes's theorem and (naive) assumption of conditional independence between features. The probability of a class y given input , where p(y) denotes class prior probability, p(x|y) class likelihood, and p(y|x) posterior probability. The Bayesian classifier assigns the most probable classŷ to an instance x wherê Since p(x) is constant (and independent of y), (1) reduces toŷ = argmax p(x i |y).
NB estimates p(x i |y) and p(y) based on the frequencies in the training data. In this paper, we employ Gaussian NB, which assumes Gaussian likelihood of the features p(x i |y) = , where parameters σ y and µ y are estimated using maximum likelihood.

Logistic Regression
Multinomial Logistic Regression (LR) extends binary LR classifier to the multiclass classification problem. Let x = (x 1 , . . . , x m ) be the input instance, y label for a given instance, and K the number of possible classes, which are indexed with numbers 1 to K. Let w i ∈ R m and b i ∈ R denote the weight vector and bias corresponding to class i ∈ {1, . . . , K}, which are learned from the training data by optimizing the loss function. Given input x, first, vector z = (z 1 , . . . , z K ) is calculated, where z i = w i |x + b i . To obtain per-class probability distribution, the softmax function g : is applied on vector z. We interpret value g(z) i as the probability p(y = i|x) of y being class i. For input x, the LR classifier predicts class i for which p(y = i|x) is maximal.

Experimental Setup
For the implementation of deep models, we used Python 3.7.6 programming language together with TensorFlow 2.1.0 [59] machine learning framework and Tensorflow.Keras API [60]. Other, classical, machine learning classifiers are implemented using the Scikitlearn [61] machine learning library for the Python programming language.

Extraction of Features
In conventional CNNs, the feature maps from the last convolutional layer are flattened into a feature vector and forwarded to fully connected layers and final softmax classification layer [21,22]. Lin et al. [62] offer an alternative with global average pooling layers, which spatially compresses the information contained in feature maps into one vector. Dense feature vectors for marine debris images are extracted using the state-of-the-art convolutional model architectures described in Section 2.2, as illustrated in Figure 15. For each model, the last layer with three-dimensional output is selected as the feature extraction layer. Global average pooling is applied to the output of that layer to obtain a feature vector from a three-dimensional output volume. Let w × h × n be the shape of the feature extraction layer's output and let f 1 , f 2 , . . . , f n denote n feature maps in the output volume. By applying a global average pooling on the given volume, we obtain a vector ( f 1 , f 2 , . . . , f n ) where f i , represents the mean value of w · h values in the feature map f i , i = 1, . . . , n. Thus, the size of the feature vector matches the depth of the output volume of the feature extraction layer. All deep models used for feature extraction are loaded with weights pre-trained on the ImageNet [34] dataset. In each deep CNN architecture, all layers that follow the feature extraction layer are dropped and replaced by a global average pooling for feature extraction. Table 2 lists all deep architectures employed in this work for feature extraction together with the information about the total number of their parameters (after dropping the layers following the feature extraction layer) and extracted vectors' size. A new neural network (NN) classifier, which has two fully connected (FC) layers with 256 and 128 neurons followed by the Softmax layer with six neurons, is then added on top of the pooling layer, as illustrated in Figure 16. The batch normalization [63] technique is applied on newly added FC layers.  Figure 16. Classification of marine debris images using a neural network classifier, which receives extracted feature vectors as inputs and outputs class-wise probability distribution.
In this paper, we use deep CNN feature extractor in three different ways: (1) we freeze all its layers and train only the newly added NN classifier; (2) we fine-tune all weights of the deep CNN and train it together with the NN classifier, i.e., we use loaded Image-Net weights as the starting point for the training of the model; (3) first, we freeze deep CNN and train the NN classifier on top, and afterward unfreeze the weights of deep CNN to fine-tune them. In following sections, latter cases are denoted as (1) fixed feature extractor (FFE); (2) fine-tuning (FT); (3) FFE+FT. For the training of all models, we use Adam [64] optimizer with learning rates as in Table 3, β 1 = 0.9, β 2 = 0.999 and = 10 −7 . The learning rate of each model is chosen from the set of predefined values on a logarithmic scale using the 5-fold cross-validation. Table 3. Learning rates.

Model
Fixed Feature Extractor Fine-Tuning All models were trained for 100 epochs. In the FFE+FT case, for the first 25 epochs, we train only the NN classifier on top and keep the layers of the deep CNN frozen. In the remaining 75 epochs, we fine-tune the weights of the deep CNN together with the NN classifier. For all models, we used small mini-batches of size 16, which require a smaller memory footprint than larger ones. Moreover, small batch sizes provide better generalization performance and optimization convergence [65,66]. Since we have a limited amount of data for the training at hand, we use data augmentation to expand the training set artificially. During the training, we augment images using random rotations, width and height shifts, shearing, and horizontal flipping.
For each model architecture, images are processed in an adequate format by utilizing the corresponding preprocess_input function from the tf.keras.applications (https: //www.tensorflow.org/api_docs/python/tf/keras/applications, accessed on 10 June 2021) module. More precisely, for VGG19 and ResNet50 models, the images are converted from RGB to BGR image format, and each color channel is zero-centered with respect to the unscaled ImageNet data. Pixel values of input images are scaled between −1 and 1 for the InceptionV3, Inception-ResNetV2, and MobileNetV2 architectures. Finally, for the DenseNet121 architecture, pixel values are scaled between 0 and 1, and then each color channel is normalized with respect to the ImageNet data.

Simple Neural Network Architecture
To compare the performance of common pre-trained deep model architectures with the performance of smaller neural networks, we constructed a simple neural network with architecture as described in Table 4. All convolutional layers are followed by batch normalization [63] and use ReLU nonlinearity. This neural network was trained from scratch on marine debris data for 150 epochs with mini-batches of size 16. We used the Adam optimization algorithm with the learning rate 10 −4 , β 1 = 0.9, β 2 = 0.999, and = 10 −7 . Image pixel values are scaled between 0 and 1. During the training, the same augmentation techniques were used as with the pre-trained model to enlarge the training dataset artificially.

Evaluation Metrics
To evaluate the performance of marine debris classifiers, we use four quantitative metrics commonly used for multiclass classification problems: accuracy, precision, recall, and F1-score [67].
Suppose a given data set D contains K > 2 different classes encoded with numbers 1, 2, . . . , K. Let C i,j denote the number samples classified as class j, which actually belong to class i. The K × K matrix C = [C i,j ] is known as the confusion matrix. The overall accuracy of a model gives the proportion of correctly classified data points and it is calculated as where n denotes the cardinality of D. Since accuracy weighs more highly populated classes and thus strong errors on classes with just a few examples are hard to identify, we complement accuracy scores with precision, recall, and F1 metrics scores. We introduce the following notation: • Precision corresponding to the class i is calculated as while corresponding recall is given by Precision measures the ability of a model to return only relevant instances, while recall expresses the model's ability to find all relevant instances in a dataset. F1-Score of the i-th class combines corresponding precision and recall scores by calculating their harmonic mean resulting in Obtained per class metrics are aggregated in overall macro scores computed as simple arithmetic means in the following way where Score is either Precision, Recall, or F1-score. Sometimes F1-score (macro) is calculated as F1-score (macro2) = 2Precision (macro) Recall (macro) Precision (macro) +Recall (macro) [67,68]. However, we use F1-score (macro) rather than F1-score (macro2) , since it is more robust toward the error type distribution [68] and it is also implemented in Python's sklearn library [61]. Macro scores do not take the class imbalance into account. Additionally, weighted scores are computed to address the uneven data distribution. Weighted scores find the average of class-wise scores weighted by the number of class instances as follows: where n i denotes the support for class i. In addition to previously mentioned performance measures, we use Cohen's Kappa coefficient [69], which expresses the level of agreement between classifier predictions and actual class labels. It is defined as where p o denotes the observational probability of agreement and p c probability of agreement by chance. κ < 0 indicates poor agreement, κ ∈ 0, 0.2] slight, κ ∈ 0.2, 0.4] fair, κ ∈ 0.4, 0.6] moderate, κ ∈ 0.6, 0.8] substantial, and κ ∈ 0.8, 1] almost perfect agreement [70]. The Kappa coefficient shows how much better the given classifier performs than the random classifier that predicts class labels based on the class frequencies.

Deep CNN Architectures
This section compares the final performance of marine debris classifiers when different deep CNN architectures are utilized to extract image features. For each architecture, three feature extraction-tuning schemes are employed and afterward compared. Table 5 shows the values of quantitative metrics described in Section 3.4 computed on the test data that were not used for the network's training. The overall best performance was the fine-tuned Inception-ResNetV2 model architecture, which obtained the best results of all considered metrics, including 91.40% accuracy, 92.08% (macro) F1-score, and Kappa coefficient 0.89, while fine-tuned InceptionV3 network achieves the second-best result with 90.57% accuracy, 91.07% F1-score, and Kappa coefficient of 0.88. All deep feature extraction CNNs show better performance when their weights are in some way fine-tuned, either during the whole training process or only during the second phase of the training. The overall worst performance was the VGG19 network used as a fixed feature extractor with 77.15% accuracy, 78.15% F1-score, and 0.71 Kappa coefficient, while the lightweight MobileNetV2 architecture displayed the worst performance among the fine-tuned models. Compared to the simple neural network trained from scratch (with architecture as in Table 4), which achieves the accuracy of 46.96% and (macro) average precision 45.74%, recall 43.07%, and F1-score 43.68%, models that apply transfer learning show significantly better classification performance on test images of marine debris. A neural network trained from scratch suffers from severe overfitting; it shows good performance on the training data but has low generalization capability.
Statistician Francis Anscombe demonstrated the importance of complementing numerical calculations with data visualizations for better perception of hidden data properties not captured by statistical analyses [71]. To illustrate high-dimensional features extracted with deep CNN architectures, we use the t-SNE [72] algorithm to project feature vectors into a visualizable two-dimensional space. Figure 17 shows two-dimensional t-SNE training feature projections for best-performing Inception-ResNetV2 and lightweight MobileNetV2 architecture when three different schemes of transfer learning are used. There is a notable difference in the class-wise separation of two-dimensional features when the model finetunes the pre-trained weights on the task of interest and when a deep feature extractor stays fixed. During the fine-tuning, a deep CNN extractor adjusts its weights so that new features incorporate details and peculiarities of marine images important for the characterization of each of the six considered classes. Thus, the fine-tuned features better discriminate between these classes. Greater separation between semantic clusters of features can be seen in the better-performing FT Inception model ( Figure 17). In general, models obtain higher precision than recall, i.e., fewer false positives than false negatives per class. Figure 18 shows class-wise F1-scores for each deep CNN architecture. The weakest performance is observed on the Other garbage class with high in-class variations and the best in the Rubber class due to the unique round shape of class objects and lower range of in-class variations. Aside from the low performance in the Other Trash class, models that used the ResNet50 and VGG19 extractors did not show the best performance in the Glass class either.

Deep Feature Classification with Conventional ML Classifiers
The neural network classifier on top of the deep feature extractor can be replaced with any conventional ML classifier. Table 6 provides the classification results of RF, SVM, NB, LR, and KNN classifier trained on deep features extracted with the best performing feature extraction scheme (see Table 5) for a given CNN architecture. As we can see from Table 6, by replacing the NN classifier by SVM or LR classifier, the performance on new data in many cases improves, e.g., SVM classifier with Inception-ResNetV2 obtains 91.61% accuracy, 92.27% macro F1-score, and 0.90 Kappa coefficient, as opposed to 91.40% accuracy, 92.08% F1-score, and 0.89 Kappa of the NN classifier. The most significant improvement was observed with the MobileNetV2 feature extractor, which obtains the best classification result with an SVM of accuracy 85.32% compared to 82.60% accuracy of the NN classifier.  Figure 20 shows the confusion matrices for the best-performing Inception architectures (InceptionV3 and Inception-ResNetV2) and a lightweight MobileNetV2 architecture designed to meet the resource constraints of mobile and embedded devices. Although replacing the NN classifier by an ML classifier did result in a slight boost of model's overall performance on new data, in several cases the performance on some classes dropped, e.g., Other trash for InceptionV3 and No trash for the Inception-ResNetV2. The confusion across all models (in NN and ML classifiers) is the most pronounced at the Other trash class with corresponding images often assigned to classes Metal and Plastic, and vice versa. Furthermore, misconceptions of plastic as metal and metal as plastic often occur. The confusion of No trash images with different categories of marine litter can be attributed to the various non-waste objects found in such images, such as other marine life objects, seagrass, and rocks.

Discussion
This study analyzes the performance of well-known deep CNN architectures when utilized as feature extractors to classify underwater images of marine debris with neural networks and other conventional ML classifiers. Fine-tuned deep CNN feature extractors that stack Inception modules, Inception-ResNetV2 and InceptionV3, show the overall best performance in our experimental setup (see Tables 5 and 6). With increasing depth, these architectures simultaneously increase the width of the network by introducing parallel convolutions. Although MobileNetV2 architecture's performance is ranked as the worst, it should not be written off easily since this architecture comes with several advantages desirable for implementation in embedded devices: faster performance, reduced network size, and low latency. With the choice of MobileNetV2 architecture for feature extraction, one sacrifices overall accuracy for computational efficiency.
Several works have employed deep-learning-based approaches to marine debris detection and classification in underwater imagery in recent years. Table 7 gives an overview of results presented in the literature using different network architectures. While [27][28][29] use realistic underwater images, [23] uses forward-looking sonar (FLS) imagery with constructed 96 × 96 image crops of debris objects utilized for classifier's training and testing. In terms of accuracy, classifcation model from [23] gives better results than similar models trained and validated on underwater RGB images that do not contain only one centered debris object, as expected. In contrast to studies presented in Table 7, our work compares the performance of six well-known CNN architectures (VGG19, InceptionV3, ResNet50, Inception-ResNetV2, DenseNetV2, and MobileNetV2) combined with NN, RF, SVM, NB, LR, and KNN classifiers. The best reported result with an accuracy of 91.61% is obtained with fine-tuned Inception-ResNetV2 feature extractor and SVM classifier. Despite the intrinsic challenges of the used dataset, with the choice of appropriate feature-extractor network architectures, the scheme for its deployment, and the appropriate classifier, satisfactory results on new images are obtained. This paper focuses exclusively on image classification based on deep learning feature extraction. Alternatively, traditional computer vision techniques can be used for feature extraction from raw pixel data. The problem with traditional approach is that it requires careful fine-tuning and expert analysis. This is especially evident in multiclass classification problems. Each class requires manual feature engineering in order to describe best its typical object patterns, which becomes a real burden with many parameters to tweak [49]. On the other hand, DL methods mitigate the need for manual extraction of features and provide the end-to-end learning process, which extracts relevant image features automatically and often outperforms conventional feature extraction techniques [73,74].
Traditional "crisp" algorithms for image classification do not fully consider inherent uncertainties and peculiarities present in debris images from the realistic underwater environment. Image entities belonging to the same debris class often significantly vary in shape and entity position. Moreover, images are often affected by uncertainties such as current recording angle, water turbidity, and illumination, implying the inherent fuzzy nature of the given dataset. In order to address this problem, specific classifiers based on fuzzy techniques can be applied. The success of the fuzzy approach to classification and clustering has been reported on datasets with inherent uncertainties from various domains [75][76][77]. Based on these findings, it would be interesting to investigate further whether the fuzzy-based techniques can enhance the classification performance on the use-case of marine debris identification and classification. However, this falls out of the scope of this paper and is planned for future research.

Conclusions
Marine debris poses a major threat to the marine ecosystem and negatively affects today's society in an environmental, social, and economical way. Motivated by the need for automatic and cost-effective approaches for marine debris monitoring and removal, we employ machine learning techniques together with deep learning-based feature extraction to identify and classify marine debris in a realistic underwater environment. This paper provides a comparative analysis of common deep convolutional architectures used as feature extractors for underwater image classification. Furthermore, it explores the best ways to use deep feature extractors by analyzing three different modes for utilizing pretrained deep feature extractors and examining the performance of different ML-based classifiers trained on top of extracted features.
The fine-tuning of the pre-trained feature extractor network's weights with appropriate learning rates during the whole training procedure showed the most prominent results in our experimental setup. The best performance is shown by Inception-based FT feature extractors, namely Inception-ResNetV2 and InceptionV3, achieving an overall accuracy of more than 90%, more precisely 91.40%, and 90.57%, respectively, when trained with the NN classifier on top. Traditional SVM and LR classifiers exhibited as credible alternatives to the NN classifier, which often outperform the NN classifier. The SVM trained on Inception-ResNetV2 features achieves 91.61% accuracy, while the LR classifier trained on the InceptionV3 features obtains accuracy of 90.78%. Considering the inherent challenges that come with automatic marine debris classification in underwater imagery, the obtained results demonstrate the potential for further exploitation of deep-learning-based models for real-time marine debris identification and classification in natural aquatic environments.
In the future, we hope to assemble our dataset containing images of marine debris in Croatian Adriatic Sea underwaters to utilize a deep-learning approach for automatic marine debris identification in the local marine environment. The main focus of this paper is the problem of marine debris classification. Future research should focus on expanding the conducted analysis to detect waste objects under the sea level: comparing different object detection architectures with different backbone convolutional network architectures. Furthermore, this work discusses only three approaches to transfer learning: (1) keeping the pre-trained feature extractor network frozen; (2) fine-tuning its weights during the whole training procedure; (3) freezing the feature extractor during the first phase of training and afterward unfreezing it to fine-tune its weights during the second training phase. In future research, it would be interesting to extend the analysis to the case where the first layers of the pre-trained feature extractor network remain fixed while the rest of the network corresponding to more domain-specific features become fine-tuned. Moreover, the optimal way to split the feature extractor into a frozen and fine-tuned part can be further analyzed for each network architecture.  Data Availability Statement: The image and numerical data used to support the findings of this study are available from the corresponding author upon request as the data also form part of an ongoing study.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: