1. Introduction
The amount of anthropogenic debris in marine and coastal environments is increasing dramatically and constitutes a global issue. Monitoring the abundance and characterization of anthropogenic marine debris (or marine litter) becomes essential to identify the main sources [
1,
2] and to design effective mitigation measures [
3,
4]. In particular, as marine litter is present in large quantities on coastlines [
5,
6,
7], action plans have been implemented to map the load and type of marine litter on beaches worldwide [
8,
9,
10,
11,
12,
13,
14,
15,
16,
17].
The most common technique for marine litter monitoring relies on an in-situ visual census approach. This technique, performed generally four times per year, consists of counting, classifying, and recollecting the marine litter items, within the same chosen area [
18,
19,
20]. Although these surveys can be achieved at low cost, with minimal equipment by inexperienced surveyors under instruction [
19], the in-situ visual census method requires intensive human effort [
7,
21].
To overcome the limitations of visual census, recent works have explored the viability of an unmanned aerial system (UAS)-based approach for the detection, identification, and classification of marine litter in coastal areas [
22,
23,
24,
25,
26,
27,
28,
29]. In these works, the drone flew at a variable height between 6 and 40 m, with the camera gimbal at −90°, and collected high resolution images of the beach surface. In general, the obtained final resolution, expressed usually in terms of ground sampling distance (GSD), allowed one to properly identify the marine macro litter (hereinafter MML), which was defined as any persistent anthropogenic solid material disposed or abandoned in marine and coastal environments, with its lower limit of 2.5 cm in the longest dimension [
30].
MML items are usually detected on UAS-derived orthomosaics. The detection can be performed manually following an image screening procedure [
22,
23,
25,
28,
29,
31] or by applying automated detection techniques [
22,
23,
24,
27,
32]. To date, three main automated pixel-based MML detection techniques have been proposed, namely image processing threshold method [
24], random forest (RF) [
23,
27], and convolutional neural network (CNN) [
22]. The image processing threshold proposed by Bao et al. [
24] was applied successfully on images where the beach surface was smooth and characterized by a regular colored background. However, this approach was inadequate for a universal application since sandy beaches often present footprints and ripples on their surface. RF and CNN were tested on more complex environments. Martin et al. [
23] used RF and histogram of oriented gradients (HOG) as a feature descriptor (F-score 44%), while Gonçalves et al. [
27] had better results (F-score 75%) adopting a color feature descriptors. Fallati et al. [
22] implemented a deep-learning CNN obtaining contrasting results at two study sites (F-score 49% and 78%). Gonçalves et al. [
26] compared the manual and pixel-based automatic detection performances obtained by RF and CNN algorithms. Random forest classifier returned the best-automated detection rate (F-score 70%), whereas CNN performed slightly worse (F-score 60%) due to a higher number of false positive detections. A comparison highlighted that the automated techniques could provide a reliable density map of MML load, with faster surveys, and therefore increased frequency of observations.
The previous experiences underlined that the automated detection of MML on UAS-based imagery is a challenging task. The MML bulk is often composed of items with different materials, chromas, and geometries, along with items partially buried and not fully visible. In addition, the wide variety of marine environment characteristics that constitute image background (e.g., sandy beaches, vegetated dune, and rocky shores) further complicates the task of finding a general solution. Finally, environmental conditions (e.g., sun brightness and shadows) and different GSDs can affect image quality, and thus the automated detection accuracy. Therefore, it is of interest to further the search for an optimal solution.
In general, automated identification of MML items on UAS images has been performed using pixel-based image analysis (PBIA), in which every MML pixel is evaluated and grouped together on the image level by means of statistical clustering of pixel values. This approach is appropriate in cases where the objects are similar or smaller than the pixel size. On the contrary, when the pixel size is much smaller than the objects (high spatial resolution images), it is preferable to detect the objects by grouping near-homogeneous pixels [
33]. In addition, due to the ultrahigh spatial resolution of a UAS-based orthomosaic (5.5 mm in this study), the classification of MML is faced with issues of having very high intraclass spectral variability and very low multiclass statistical separability. Therefore, an object-based image analysis (OBIA) classification approach may be a better solution, since the image segmentation technique is directed by the relative object heterogeneity and internal homogeneity criteria, weighted by its spectral and shape characteristics [
33].
The main objective of this work was to propose and evaluate a simple and cost-effective UAS-based approach for automatically generating MML abundance maps of sandy beaches. In this context, we evaluated the performance of three commonly used object-oriented machine learning classifiers (OOML), namely support vector machine (SVM), k-nearest neighbor (KNN), and random forest (RF), to automatically detect MML items on an orthomosaic derived from UAS flight. In addition, this work contributes to advances in remote sensing MML surveys by optimizing automated MML detection on UAS-derived orthomosaics. The MML abundance maps produced by the UAS surveys can assist environmental pollution monitoring programs and contribute to the search and evaluation of mitigation measures. Furthermore, these MML maps can also improve the clean-up activities on coastal environments carried out by governmental authorities in close partnerships with all stakeholders, including non-governmental organizations, municipalities, local communities, and the private sector.
2. Materials and Methods
A simple, cost-effective, and UAS-based framework was used for generating MML abundance maps of sandy beaches, in compliance with European Directives [
34]. This framework, described in
Figure 1, was composed of four operational steps. First, a very low altitude UAS flight was planned and the corresponding ultrahigh resolution images were acquired over the targeted area. Then, the image block was processed using a Structure from Motion and Multi Video Stereo (SfM-MVS) processing workflow to generate the digital surface model (DSM) and the orthomosaic. In the third step, the MML items were detected in the orthomosaic by a supervised and OOML classifier, which used a minimal training effort. In the last step, abundance maps were created by using the centroids of the macro litter objects classified in the previous step.
2.1. Study Area
Cabedelo Beach (40°08′12.8″N and 8°51′47.5″W,
Figure 2) is a sandy beach located on the western Portuguese coast facing the North Atlantic Ocean (OSPAR area 5, Iberian coast, [
34]), southward Mondego River estuary (Figueira da Foz). The beach is backed by a stabilized dune, with a crest height that varies between 5 and 10 m moving southwards (see
Figure 2).
2.2. Field Data Acquisition and Unmanned Aerial Syetem (UAS) Survey
The acquisition of aerial images was performed with the quadcopter Phantom 4 Pro (
Figure 3a) on 15 February 2019 at 12:30 a.m., a sunny day with clear sky and light wind. The choice of this aerial platform was driven by the need for the aircraft to be deployed in very small places and flown at a very low cruise speed. This rotary wing, which was significantly more affordable than most rotary wings, was equipped with a one-inch 20 megapixels CMOS (complementary metal oxide semicondutor) sensor (camera model FC6310, 24 mm full-frame equivalent) with a mechanical shutter. The camera was also combined with a three-axis brushless gimbal, that smooths the angular movements of the camera, dampens vibrations, and maintains the camera in a predefined position [
35]. This component was essential to ensure good stabilization of the image acquisition process and to avoid blurring in the very low altitude images.
Concerning the image acquisition strategy and taking into account the current practices in UAS-based environmental monitoring [
36], the following three main issues were considered: mission planning, UAS georeferencing accuracy, and camera settings. The mission planning must include all the parameters that allow the UAS to perform the flight autonomously. For nadiral image acquisition, the most important parameters are as follows: (i) nominal flight height, (ii) image overlap, (iii) geometry of surveyed area, and (iv) camera settings. On the basis of these parameters, the flight mission software computed, for the given camera model, the expected ground sampling distance (GSD) and the flight path (waypoints) to follow. In this work, mission planning was carried out using the freeware mobile application DroneDeploy (
Figure 3b). The drone was set to fly at an altitude of 20 m, with the camera gimbal set to −90° for capturing photos perpendicular to the direction of the flight (
Figure 3b). The images with a resolution of 4864 × 3648 pixels (aspect ratio 4:3) were overlapped with 80% front and 70% side rates. The final image nominal spatial resolution (GSD), was 5.5 mm.
In general, the positioning and image georeferencing accuracies of a UAS are driven by internal quality of the on-board Global Navigation Satellite System (GNSS) sensors. Using the waypoints computed by the mission planning software, the UAS performs an autonomous flight and records digital images with the specific camera settings at the indicated geographic positions. During the flight, the camera position and attitude are also recorded by the internal UAS GNSS sensors. However, Phantom 4 Pro navigation sensors are not accurate enough to perform a correct georeferencing of the derived geospatial products. Therefore, ground control points (GCPs) are needed for georeferencing digital surface model (DSM) and an orthomosaic in a specific cartographic coordinate system to eventually refine the auto-calibrated camera model. Along with GCPs, it is recommended to acquire additional points that can be used as independent check points (CHP) for assessing geometric accuracy of derived geospatial products. In order to maintain a low cost and simple approach, we acquired only five GCPs for georeferencing purposes and two CHPs for assessing the horizontal and vertical accuracy of the generated orthomosaic and DSM, respectively (
Figure 3c).
Regarding camera settings, the overall exposure of each image has a significant impact on the geometric and radiometric quality of the final UAS-based geospatial products [
37]. The ISO, aperture, and shutter speed are the three fundamental camera settings that determines the image exposure. In this work, ISO, shutter speed, and aperture were set to 100, 1/1250 s, and f/3.2, respectively, in order to accommodate the scene to daytime illumination conditions and to obtain sharp and well-exposed image data.
2.3. Structure from Motion and Multi Video Stereo (SfM-MVS) Processing
Generating a DSM (and the subsequent orthomosaic) from a block of overlapping images and processed with a SfM-MVS photogrammetric workflow requires that every part of the surface is imaged from two or more different positions [
38,
39]. The first step of this process consists of detecting features (keypoints) in each image and assigning a unique identifier to them, regardless of the image perspective and scale. The external orientation of the images (i.e., camera position and attitude) and the coordinates of the tiepoints (i.e., scene geometry) are then reconstructed simultaneously through the automatic identification of matching keypoints (tiepoints) in multiple images. These features, which are tracked from each image pair of the whole image block, allow one to estimate the initial camera positions and the object coordinates of tiepoints. Then, these initial values are simultaneously optimized in a bundle block adjustment (BBA), which minimizes the overall residual error and produces a self-consistent three-dimensional (3D) model with the associated camera parameters.
Agisoft Metashape (v. 1.5.3, [
40]) was adopted as a Structure from Motion and Multi Video Stereo (SfM-MVS) processing software package to produce the digital surface model (DSM) and the related RGB orthomosaic. The processing strategy was divided into the following steps:
Photo alignment Using the keypoints detected on each image, the process computes the internal camera parameters (e.g., lens distortion), the external orientation parameters for each image, and generates a sparse 3D point cloud.
Georeferencing The geospatial 3D point cloud is assigned to a specific cartographic (or geographic) coordinate system.
Camera optimization Camera calibration and the estimation of its interior orientation parameters are refined by an optimization procedure, which minimizes the sum of re-projection errors and reference coordinate misalignments. For this step, the sparse point cloud is statistically analyzed to delete misallocated points and to find the optimal re-projection solution.
Dense matching The MVS dense matching technique generates a 3D dense point cloud from multiple images with optimized internal and external orientation parameters.
DSM and orthomosaic generation The DSM is interpolated from the 3D dense point cloud, and consequently, the orthomosaic is generated from this DSM. It is worthwhile to note that we imaged a scene with low variation in height relative to the flying height. Therefore, the extra time-consuming steps of mesh generation and 3D texture mapping were not necessary for the generation of the orthomosaic.
2.4. Classification Preprocessing, Nomenclature, and Training Areas
Before classification, the UAS-based orthomosaic was cropped to a manually digitized outline of the beach area that was monitored and where the MML was present. The aim of this preprocessing step was to either simplifying the beach cover nomenclature or minimizing the negative influences of non-beach areas (dune, rocs, and walkways) on the classification procedure.
Considering that we were interested in mapping MML abundance on a sandy beach, a nomenclature (classification scheme) was carefully selected and defined (
Table 1 and
Figure 4) taking into account that the corresponding classes were as follows: (i) mutually exclusive; (ii) exhaustive, and if necessary (iii) hierarchical [
41].
The previously mentioned literature supports that, at least, image segmentation, training sample, feature space, and tuning parameters can have a significant impact on classification accuracy and efficiency [
42,
43]. Collecting adequate training data is a time-consuming and expertise demanding task. However, as we wanted to propose a simple, easy, and accessible OBIA classification approach, we decided to use a rectangular training area, outlined manually over the orthomosaic, where the variability of each beach cover class was well represented. After carefully inspecting the orthomosaic, this training area was located at the south part of the study area and represented only one-third of the total surface area to be classified. Within this training area, several polygons representing each class where manually digitized in a GIS environment (
Figure 4).
2.5. Feature Space and Data Normalization
Implementing a successfully OBIA classification requires careful selection of suitable discriminating features (or variables) such as spectral signatures, vegetation indices, transformed images, as well as textural and contextual information [
44]. In our case, the spectral dimensionality was restricted to the RGB wavelengths of the low-cost onboard UAS camera which is sensitive to illumination intensity. The bands of the RGB wavelengths are highly correlated, mixing the color and intensity information, and in general this color space is not perceptually uniform [
27]. To overcome these limitations, and considering that MML is generally characterized by its strong manufactured color, we used transformed image features described by the following three additional color spaces (see
Figure 5): hue-based (HSV), perceptually uniform (CIE-Lab), and luminance-based YCbCr [
45]. For each color space, the color is described differently from the RGB additive color model [
46]. In HSV (hue, saturation, and value), the color information is only contained in the hue channel. In CIE-Lab, the color information is contained in two chromaticity layers, i.e., the red-green axis (a) and the blue-yellow axis (b). In YCbCr, the intensity or luminance (Y) is easily discriminated from the two chrominance components: the blue (Cb) and the red (Cr).
Considering that the color space transformations generated a mixture of spectral bands (RGB) with synthetic bands, data normalization was important for some classifiers to treat each band equally. For the SVM and KNN classifiers, bands were normalized by using linear scaling to produce a range from zero to one.
2.6. Image Segmentation
Segmentation is the process of dividing the image into non-overlapping image objects that are spatially and spectrally homogeneous. As a first and most critical step of OBIA classification [
47], the quality of the image segmentation has a significant impact on the classification accuracy. Over-segmented objects which contain only a part of the target object class, and under-segmented objects which contain more than one target object class, both cause negative effects on the predicted class signatures [
48].
In this study, the segmentation of the synthetic remote sensing image was realized with the multi-resolution image segmentation algorithm (MRIS) available in Trimble eCognition Developer
® (usually known as eCognition) [
49]. The MRIS is a bottom-up region growing technique driven by the following three main parameters: scale, shape and compactness. The most important is the scale parameter that controls the average size in pixels of the resulting image objects (a higher value results in larger objects). Shape and compactness define the object homogeneity and are weighted from zero to one. Shape controls how much the segmentation is influenced by the spectral (color) information versus the object shape information (a higher value means lower influence of color). Compactness also controls the object shape (a higher value means more compact objects but less spectrally homogeneous) [
47]. The values of these three parameters were selected using an iterative trial and error process, combined with a visual analysis performed by an experienced operator. In order to find a single segmentation scale that would best separate the four cover classes and based on similar research on OBIA analysis of ultrahigh sub-decimeter UAS imagery [
50], we started by fixing the values of the two parameters shape and compactness to 0.1 and 0.5, respectively. Then, the training area was segmented at seven segmentation scales, starting at 10 and ending at 80 by using scale increments of 10 (see
Figure 6). The scale 30 was the best because it retained the individual marine litter items (
Figure 6c); at a coarse scale (
Figure 6b), these items were very often merged into broader image objects such as vegetation debris.
2.7. Classifiers and User-Defined Parameters
In the context of detecting MML items from an orthomosaic with ultrahigh resolution (sub-centimeter level), the following three supervised, non-parametric and object-oriented machine learning classifiers were evaluated: (1) RF, a decision-tree-based ensemble algorithm; (2) SVM, a statistical learning algorithm; and (3) KNN, an instance-based learning algorithm.
2.7.1. Random Forest
RF is an ensemble classifier that uses a large number of decision tree classifiers to assign a final class of the unknown object by majority voting of all decisions taken at each tree [
51]. Each tree is constructed and trained automatically using a random set (in general, two thirds) of the training data (referred to as in-bag samples) and a random set of the variables [
43]. The remaining training data (in general, one third) that is not used at each tree, is known as out-of-bag samples and is used in an internal cross-validation technique to provide an independent estimate of the overall accuracy of the RF classification [
52]. In order to generate a prediction model, two important and user-defined parameters need to be set, i.e., the number of decisions trees to be generated (ntree) and the number of variables used in each node to make the tree grow (Mvar). The published literature has highlighted that the RF classifier is more sensitive to the Mvar parameter than to the ntree parameter [
53]. Since the computational efficiency and the non-overfit properties of the RF classifier allows the error to stabilize before 500 trees are achieved, this number of trees is commonly assigned to the ntree parameter [
43,
52]. Regarding the Mvar parameter, the square root of the total number of variables is the value commonly used in classification problems [
43]. However, in some software implementations (e.g., eCognition), the RF algorithm can be subject to the same parameters as decision trees (DT). These parameters include the following: (i) depth (Dep) to regularize each tree (i.e., to limit the way it grows) preventing overfitting; (ii) minimum number of samples (Ns) that a node must contain to consider splitting; (iii) maximum categories to cluster possible values of a categorical variable; and (iv) the use (or non-use) of surrogates to work with missing data [
49]. The following additional eCognition parameters are: (i) active variables (Mvar); (ii) forest accuracy, for the desired level of accuracy, and (iii) termination criteria, which can be set to the maximum number of trees, forest accuracy, or both.
2.7.2. Support Vector Machine
According to the principle of statistical learning theory, the SVM constructs an optimal hyperplane (i.e., a decision surface) that separates the dataset into a discrete predefined number of classes in a way consistent with the training examples [
54]. The amount of training data that can be misclassified (e.g., on the wrong side of the hyperplane) is controlled by a positive user-defined parameter C (the cost parameter). A large C value decreases the number of misclassified objects, but can create an overfitted model that may not be adequate to classify new data [
55]. When it is not possible to separate the classes linearly, kernel functions are used to project the input data into a high-dimensional feature space that increases the separability of these classes in this feature space [
56]. The most commonly used kernel functions in remote sensing are linear, polynomial, and radial basis function (RBF) which is controlled by the gamma (Υ) parameter [
52]. Adjusting the value of Υ changes the shape of the decision boundary; smaller values mean a smoother boundary, whereas higher values mean a more complex boundary. In eCognition, the SVM classifier was implemented with the following configurable parameters: (i) C, (ii) kernel function (linear or radial basis function), and (iii) gamma (for RBF only). The optimal values of C and Υ are often determined by using the grid search method (also known as exhaustive search), which uses a large range (search interval) of different pairs of parameters and the one having the highest classification accuracy rate in this interval is selected [
57].
2.7.3. K-Nearest Neighbor
KNN is a relatively simple instance-based learning approach. An object is classified based on the weighted average value of the class attributes of its k spectrally nearest neighbor (e.g., k = 5) in the training set [
58]. The performance of this classifier is mainly influenced by the key parameter k [
55]. In eCognition. the KNN was implemented with only one configurable parameter, k.
2.8. Tuning the Primary Classifier Parameters
The strategy used for tuning the classifier was to modify one by one each of the primary parameters, while maintaining the others fixed. For RF, we started with the default values and we modified successively the ntree, depth, and Ns, one parameter at a time. For SVM, we also started with the default values and we modified successively the Υ and C parameters, one at a time. For KNN, only the number of neighbors (K) was tuned, since it was the only implemented parameter.
2.9. Performance Assessment
In order to have a valuable reference for evaluating the performance detection of the classifiers, the RGB orthomosaic map was visually screened and manually processed by an operator in the GIS environment. For each object recognized as marine litter item by the operator, the approximated center of marine litter item shapes was marked. For further details about the manual procedure and the type of MML encountered at Cabedelo beach, please refer to Gonçalves et al. [
26,
27].
The automated detection performances were evaluated with the F-score statistical analysis. The centroid of all the objects labeled as MML by the algorithms were compared to the centroids of MML objects delineated manually in the testing areas. When the distance between the centroids was smaller than 20 cm (setup threshold), the detection was marked as true positive (TP), otherwise as false positive (FP).
Finally, all the marine litter items not detected by the automated algorithm were counted as false negatives (FN). In detail, the precision (P) is a measure of the method to not generate false positives and is defined as:
To measure the sensitivity of each method to not generate false negatives we use the recall (R), which is given by:
The F-score (F) is a measure of the overall quality of the method and combines the previous P and R metrics as:
It also varies between 0% and 100%, where 0% means no correlation between the predicted and observed MML items and 100% means a perfect classification (i.e., a perfect match).
2.10. Quantifying Macro Litter Abundance
Quantifying and mapping the abundance of MML on coastal areas is an important issue to understand the dynamics of their deposition, to compute accumulation rates, and to identify spatial distribution patterns over time for improving the planning of clean-up operations [
28,
29]. In this study, kernel density estimators (KDE) were used for quantifying the MML abundance. First, the polygonal macro litter items detected by a particular OOML method were converted to point features using the centroid of these polygonal features. Then, using a KDE function, these point events (i.e., the centroids of the macro litter items) were transformed into a continuous surface that represented the point density (i.e., the number of MML items per square meter) in a two-dimensional (2D) space [
59]. The two key parameters of a planar KDE function are the kernel function and the search bandwidth. However, there is consensus that the choice of the bandwidth that determines the smoothness of the density surface is more important than the choice of the kernel function [
60]. In this work, the quartic function was used for estimating the MML density at each cell of the orthomosaic image. In addition, to generate a smooth MML abundance map, a sufficiently large bandwidth of 10 m was chosen.
5. Conclusions
This study showed that a consumer-grade UAS combined with SfM-MVS methods can be used effectively for generating an ultrahigh resolution orthomosaic (sub-centimeter level) and for monitoring sandy beaches polluted by MML. The low spectral resolution of the orthomosaic was overcome by combining four color spaces (RGB, CIE-Lab, HSV, and YCbCr) with an OBIA approach which proved to be highly suitable for extracting marine MML objects from ultrahigh resolution imagery.
After being optimally tuned, the three compared object-oriented machine learning (OOML) classifiers, namely random forest (RF), support vector machine (SVM), and k-nearest neighbor (KNN), were shown to have quite similar performances (F-score) for detecting colored MML objects. Although the RF had more parameters to be tuned, and therefore appeared to be more complex to optimize, the number of trees (ntree) was the most influencing parameter. On the contrary, for the KNN, which had one parameter to be tuned, the F-score was slightly worse than the other two machine learning classifiers. Nevertheless, the MML abundance map generated from KNN was well correlated with the abundance map produced manually. This suggests that this OOML classifier can be used effectively by nonexpert remote sensing analysts in a simple MML abundance map framework.
The synergistic use of small UAS with OOML classifiers is a major step towards cost-effective and efficient operational programs for monitoring MML abundance and detecting hotspots on sandy beaches, as they can be easily implemented by local, municipal, and national environmental agencies.
Future research should focus on the use of one-class classifiers with minimal labeling effort to generate abundance maps from an ultrahigh resolution orthomosaic obtained by a consumer-grade UAS incorporating RTK-GNSS sensors.