Instance Segmentation for Governmental Inspection of Small Touristic Infrastructure in Beach Zones Using Multispectral High-Resolution WorldView-3 Imagery

: Misappropriation of public lands is an ongoing government concern. In Brazil, the beach zone is public property, but many private establishments use it for economic purposes, requiring constant inspection. Among the undue targets, the individual mapping of straw beach umbrellas (SBUs) attached to the sand is a great challenge due to their small size, high presence, and agglutinated appearance. This study aims to automatically detect and count SBUs on public beaches using high-resolution images and instance segmentation, obtaining pixel-wise semantic information and individual object detection. This study is the ﬁrst instance segmentation application on coastal areas and the ﬁrst using WorldView-3 (WV-3) images. We used the Mask-RCNN with some modiﬁcations: (eight channels), (b) improved the sliding window algorithm for large image classiﬁcation, and (c) comparison of different image resizing ratios to improve small object detection since the SBUs are small objects (<32 2 pixels) even using high-resolution images (31 cm). The accuracy analysis used standard COCO metrics considering the original image and three scale ratios (2 × , 4 × , and 8 × resolution increase). The average precision (AP) results increased proportionally to the image resolution: 30.49% (original image), 48.24% (2 × ), 53.45% (4 × ), and 58.11% (8 × ). The 8 × model presented 94% AP50, classifying nearly all SBUs correctly. Moreover, the improved sliding window approach enables the classiﬁcation of large areas providing automatic counting and estimating the size of the objects, proving to be effective for inspecting large coastal areas and providing insightful information for public managers. This remote sensing application impacts the inspection cost, tribute, and environmental conditions. es; Coordination for the Improvement of Higher Education Personnel (CAPES) for postgraduate assistance; Union Heritage Secretariat of the Ministry of Economy for ﬁnancial support; and the European Space Agency (ESA) for image supply within the project “Surveillance of union properties areas using deep learning technique in satellite images”. Special thanks are given to the research group of the Laboratory of Spatial Information System of the University of Brasilia for technical support. Conﬂicts of Interest: The authors declare no conﬂict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.


Introduction
Public land management is essential for the effective use of natural resources with implications for economic, social, and environmental issues [1]. Government policies establish public areas in ecological, social, or safety-relevant regions (i.e., natural fields and historic spaces), offering services ranging from natural protection to recreation [1,2]. However, managing public interests to promote social welfare over private goals is a significant challenge. Especially in developing countries, recurrent misuse of public land [3], and illegal invasions (i.e., the use of public lands for private interests) [4] are among the most common problems.
Coastal zone areas concentrate a large part of the world population, despite being environmentally sensitive with intense natural processes (erosion, accretion, and natural disasters) [5] and constant anthropic threats (marine litter, pollution, and inappropriate use) [6,7]. The coastal zone is a priority for developing programs for continuous monitoring and misuse detection. In Brazil, coastal areas belong to the Federal Government, considering the distance of 33 meters from the high-medium water line in 1831 (known as "navy land"). Beaches and water bodies have guaranteed public access according to the Brazilian Forest Code. Therefore, Brazilian legislation establishes measures for public use, economic exploitation, environmental preservation, and recovery considering coastal areas' socio-environmental function. The inspection of beach areas in Brazil is a challenge, as the Union's Heritage Secretariat does not have complete and accurate information about this illegal occupation throughout the country. The undue economic exploitation of the urban beach strip leads to an increase in the number of illegal constructions, a reduction in government revenue due to non-registration, environmental problems, visual pollution, beach litter, among others. Many illegal installations in urban beaches are masonry constructions for private or commercial use. In addition, tourist infrastructure for food and leisure extends several straw beach umbrellas (SBUs) (fixed in the sand by local traders) to the sand strip without permission. Given the potential impact on the environment and the local economy, the monitoring and enforcement to curb private business development in public spaces must be constant and efficient [5], mainly to avoid uncontrolled tourism development [8,9]. The inspection must ensure the legal requirements, avoid frequent changes that lead to lawful gaps, and minimize differences arising from conflicts of interest.
Conventionally, the inspection process imposes a heavy burden on state and federal agencies, containing few inspectors with low frequency on site. In this regard, geospatial technologies and remote sensing techniques are valuable for public managers since they enable monitoring changes in the landscapes and understanding different patterns and behaviors. Thus, an excellent potential for remote sensing application by government control agencies is detecting unauthorized constructions in urban areas [10,11]. Several review articles address the use of remote sensing and geospatial technology in coastal studies [12][13][14][15][16]. Currently, geospatial technology is a key factor for the development and implementation of an integrated coastal management, allowing a spatial analysis for studies of environmental vulnerability, landform change (erosion and accretion), disaster management, protected areas, ecosystem, economic, and risk assessment [17][18][19][20].
However, few remote sensing studies focus on the detection of tourist infrastructure objects on the beach for inspection. Beach inspection requires high-resolution images and digital image processing algorithms that identify, count, and segment small objects of interest, such as the SBUs. Among the remote sensing data, high-resolution orbital images have the advantage of periodic availability and coverage of large areas at a moderate cost, unlike aerial photographs and unmanned aircraft systems (UASs) of limited accessibility. Typically, high-resolution satellite images acquire a panchromatic band (from 1 meter to sub-metric resolutions) and multispectral bands (spectral bands of blue, green, red, and near-infrared with spatial resolutions ranging from 1 to 4 m), such as IKONOS (Panchromatic: 1 m; Multispectral: 4 m), OrbView-3 (Panchromatic: 1 m; Multispectral: 4 m), QuickBird (Panchromatic: 0.6 m; Multispectral: 2.4 m), GeoEye-1 (Panchromatic: 0.41 m; Multispectral: 1.65 m), and Pleiades (Panchromatic: 0.5 m; Multispectral: 2 m). Unlike the satellites mentioned above, the WorldView-2 (WV2) and WorldView-3 (WV3) images present a differential for combining the panchromatic band (0.3 m resolution) with eight multispectral bands (Resolution 1, 24 m): coastal (400-450 nm), blue (450-510 nm), green (510-580 nm), yellow (585-625 nm), red (655-690 nm), red edge (705-745 nm), near-infrared 1 (NIR1) (770-895 nm), and near-infrared 2 (NIR2) (860-1040 nm). Therefore, WorldView-2 and WorldView-3 have additional spectral bands compared to other sensors (coastal, yellow, red edge, and NIR2), valuable for urban mapping [21]. Therefore, the conjunction of the spectral and spatial properties of the WorldView-2 and WorldView-3 images is an advantage in the detailed classification process in complex urban environments. Few studies assess infrastructure detection on the beach. Llausàs et al. [22] conducted a study on private swimming pools on the Catalan coast to estimate water use from WorldView-2 images and Geographic Object-Based Image Analysis (GEOBIA). Papakonstantinou et al. [23] used UAS images and GEOBIA to detect tourist structures in the coastal region of the Santorini and Lesvos islands. Despite the wide use of the GEOBIA, deep learning (DL) segmentation techniques demonstrate greater efficiency than GEOBIA in the following factors: (a) greater precision and efficiency; (b) high ability to transfer knowledge to other environments and different attributes of objects (light, color, size, shape, and background); (c) requires less human supervision; and (d) less noise interference [24][25][26][27].
Nonetheless, instance segmentation and object detection networks enable a distinct identification for elements belonging to the same class, suitable for multi-object identification and counting. A drawback when comparing instance segmentation and object detection networks is real-time processing, in which instance segmentation usually presents an inference speed lower than object detection. Nevertheless, instance segmentation models bring more pixel-wise information, crucial to determining the exact object dimensions.
However, instance segmentation brings difficulties in its implementation. The first is the annotation format, where most instance segmentation models use a specific annotation format that is not straightforward from traditional annotations. The second is that most algorithm uses conventional red, green, and blue (RGB) images, whereas remote sensing images often present more spectral channels and varied dimensions. The third problem is adjusting the training images to a specific size to train the models. To classify a large area requires post-processing procedures. Object detection algorithms require only the bounding box coordinates, which are much more straightforward than instance segmentation that requires each object's bounding boxes and polygons.
Another recurrent problem is the poor performance of DL algorithms on small objects since they present low resolutions and a noisy representation [65]. Common objects in context (COCO) [66] characterizes objects sizes within three categories: (a) small objects (area < 32 2 pixels); (b) medium objects (32 2 < area < 96 2 ); and (c) large objects (area > 96 2 pixels). The average precision (AP) score (main metric) has nearly half of the performance on small objects within the COCO challenge than on medium and large objects. According to a review article by Tong et al. [67], few studies focus on small object detection, and despite the subject's relevance, the current state is far from acceptable in most scenarios and still underrepresented in the remote sensing field. In this regard, an effective method is to increase the image dimensions. In this way, the small objects will have more pixels, differentiating them from noise.
The present research aims to effectively identify, count, and estimate SBU areas using multispectral WordView-3 (WV-3) imagery and instance segmentation to inspect and control tourist infrastructure properly. Very few works use instance segmentation on the remote sensing field, and none of those use WV-3 images or in beach areas. Thus, our contributions are threefold: (1) a novel application of instance segmentation using multispectral WV-3 images on beach areas, (2) leverage the existing method for classifying large areas using instance segmentation, and (3) analyze and compare the effect of the DL image tiles and their metrics.

Materials and Methods
The methodology is subdivided into the following steps: (A) dataset; (B) instance segmentation approach; (C) image mosaicking using sliding window; and (D) performance metrics ( Figure 1).  The study area was Praia do Futuro in Fortaleza, Ceará, Brazil, with intense tourist activities ( Figure 2). The research used WorldView-3 images from 17 September 2017 and 18 September 2018, provided by the European Space Agency (ESA) with a total length of 400 km 2 . The WorldView-3 images combine the acquisition of panchromatic (with 0.31 m resolution) and multispectral (with 1.2 m resolution and eight spectral bands) bands. Thus, we use the Gram-Schmidt pan-sharpening method [68] with nearest neighbor resampling to maximize image resolution and preserve spectral values [69]. The pan-sharpening technique aims to combine the multispectral images (with low spatial resolution and narrow spectral band) with the panchromatic image (with high spatial resolution and wide spectral band), extracting the best characteristics of both data and merging in a product that favors the data interpretation [70]. The Gram-Schmidt technique presents high fidelity in rendering spatial features, being a fast and straightforward method.

Annotations
Annotations assign specific labels to the objects of interest, consisting of the ground truth in model training. Instance segmentation programs use the COCO annotation format, such as Detectron2 software [71] with the Mask-RCNN model [72]. Consequently, several annotation tools have been proposed for traditional photographic images considering the COCO format, such as LabelMe [73,74], Computer Vision Annotation Tool (CVAT) [75], RectLabel (https://rectlabel.com, accessed on 5 October 2021), Labelbox (https:///labelbox. com, accessed on 5 October 2021), and Visual Object Tagging Tool (VoTT) (https://github. com/microsoft/VoTT, accessed on 5 October 2021). In remote sensing studies, an extensive collection of annotation tools is present in Geographic Information Systems (GIS) with several procedures to capture, store, edit and display georeferenced data. Therefore, an alternative to taking advantage of all the technology developed for spatial data is to convert the output data from the GIS program to COCO annotation format. In the present research, we converted GIS data to the COCO annotation format [66] from the program developed in the C++ language proposed by Carvalho et al. [76]. Thus, the SBUs' ground truth digitization used ArcGIS software. Since instance segmentation requires a unique identifier (ID) for each object, each SBU had a different value (from 1 to N, with N being the total number of SBUs).

Clipping Tiles and Scaling
Our research targets are very small (<16 2 pixels) and very crowded in most cases. A powerful yet straightforward operation to improve small objects' detection is to scale the input image [67]. We evaluated the ratios of 2×, 4×, and 8× the original image. The cropped tiles considered 64 × 64 pixels in the original image, which increased proportionally with the different scaling ratios (128 × 128, 256 × 256, and 512 × 512, respectively).

Data Split
For supervised DL tasks, the usage of three sets is beneficial to evaluate the proposed model. The training set usually presents most of the samples, which is where the algorithm will understand the patterns. However, the training set alone is insufficient since the final model may be overfitting or underfitting. In this regard, the validation set plays a crucial role in keeping track of the model progress. A common approach is to save the model with the best performance on the validation set. Nevertheless, this procedure also brings a bias. With that said, the model is often preferable to be done using an independent test set. Thus, we distributed the cropped tiles into training, validation, and test sets as listed in Table 1. The number of instances shows a high object concentration, with an average of nearly ten objects per 64 × 64 pixel image.  [77], Fast-RCNN [78], and Faster-RCNN [79]. The Mask-RCNN uses the Faster-RCNN as a basis with the addition of a segmentation branch that performs a binary segmentation on each detected bounding box using a fully convolutional network (FCN) [80] (Figure 3). input image [67]. We evaluated the ratios of 2×, 4×, and 8× the original image. The cropped tiles considered 64 × 64 pixels in the original image, which increased proportionally with the different scaling ratios (128 × 128, 256 × 256, and 512 × 512, respectively).

Data Split
For supervised DL tasks, the usage of three sets is beneficial to evaluate the proposed model. The training set usually presents most of the samples, which is where the algorithm will understand the patterns. However, the training set alone is insufficient since the final model may be overfitting or underfitting. In this regard, the validation set plays a crucial role in keeping track of the model progress. A common approach is to save the model with the best performance on the validation set. Nevertheless, this procedure also brings a bias. With that said, the model is often preferable to be done using an independent test set. Thus, we distributed the cropped tiles into training, validation, and test sets as listed in Table 1. The number of instances shows a high object concentration, with an average of nearly ten objects per 64 × 64 pixel image.  [77], Fast-RCNN [78], and Faster-RCNN [79]. The Mask-RCNN uses the Faster-RCNN as a basis with the addition of a segmentation branch that performs a binary segmentation on each detected bounding box using a fully convolutional network (FCN) [80] (Figure 3). The region-based algorithms present a backbone structure (e.g., ResNets [81], Res-NeXts [82], or other CNNs) followed by a region proposal network (RPN). However, the The region-based algorithms present a backbone structure (e.g., ResNets [81], ResNeXts [82], or other CNNs) followed by a region proposal network (RPN). However, the Mask-RCNN has a region of interest (RoI) align mechanism, in contrast to the RoIPool. The benefit of this method is a better alignment of each RoI with the inputs that removes any quantization problems on the RoI's boundaries. Succinctly, the model aims to identify the bounding boxes, classify the bounding box classes, and apply a pixel-wise mask on the bounding box objects. The loss function considers the three elements, being the sum of the bounding box loss (Loss bbox ), mask loss (Loss mask ), and classification loss (Loss class ), in which Loss mask and Loss class are log loss functions, and Loss bbox is the L1 loss.
Moreover, we use the Detectron2 [71] software, which uses the Pytorch framework. Since this architecture is usually applied to traditional images (3 channels), it requires some adjustments to be compatible with the WV-3 imagery (TIFF format and have more than three channels) [76].

Model Configurations
To train the Mask-RCNN model, we made the necessary source code changes for compatibility and applied z-score normalization based on the training set images. We only used the ResNeXt-101-FPN (X-101-FPN) backbone since the objective is to analyze scaling.

Image Mosaicking Using Sliding Windows
In remote sensing, the images often present interest areas much larger than the images used in training, validation, and testing. This problem requires some post-processing procedures. This process is not straightforward since the edges of the frames usually present errors. In this context, the sliding window technique has been used to various semantic segmentation problems [83][84][85][86], in which the authors establish a step value (usually less than the frame size) and take the average from the overlapping pixels to attenuate the border errors. The problem persists in object detection and instance segmentation since predictions from adjacent frames would output distinct partial predictions for the same object. Recently, de Carvalho et al. [76] proposed a mosaicking strategy for object detection using a base classifier ( Figure 4B), vertical edge classifier ( Figure 4C), and horizontal edge classifier ( Figure 4E). Our research adapted the method by adding a double edge classifier since some errors may persist (https://github.com/osmarluiz/Straw-Beach-Umbrella-Detection, accessed on 5 October 2021). Mask-RCNN has a region of interest (RoI) align mechanism, in contrast to the RoIPool. The benefit of this method is a better alignment of each RoI with the inputs that removes any quantization problems on the RoI's boundaries. Succinctly, the model aims to identify the bounding boxes, classify the bounding box classes, and apply a pixel-wise mask on the bounding box objects. The loss function considers the three elements, being the sum of the bounding box loss (Lossbbox), mask loss (Lossmask), and classification loss (Lossclass), in which Lossmask and Lossclass are log loss functions, and Lossbbox is the L1 loss. Moreover, we use the Detectron2 [71] software, which uses the Pytorch framework. Since this architecture is usually applied to traditional images (3 channels), it requires some adjustments to be compatible with the WV-3 imagery (TIFF format and have more than three channels) [76].

Model Configurations
To train the Mask-RCNN model, we made the necessary source code changes for compatibility and applied z-score normalization based on the training set images. We only used the ResNeXt-101-FPN (X-101-FPN) backbone since the objective is to analyze scaling.

Image Mosaicking Using Sliding Windows
In remote sensing, the images often present interest areas much larger than the images used in training, validation, and testing. This problem requires some post-processing procedures. This process is not straightforward since the edges of the frames usually present errors. In this context, the sliding window technique has been used to various semantic segmentation problems [83][84][85][86], in which the authors establish a step value (usually less than the frame size) and take the average from the overlapping pixels to attenuate the border errors. The problem persists in object detection and instance segmentation since predictions from adjacent frames would output distinct partial predictions for the same object. Recently, de Carvalho et al. [76] proposed a mosaicking strategy for object detection using a base classifier ( Figure 4B), vertical edge classifier ( Figure 4C), and horizontal edge classifier ( Figure 4E). Our research adapted the method by adding a double edge classifier since some errors may persist (https://github.com/osmarluiz/Straw-Beach-Umbrella-Detection, accessed on 5 October 2021).

Base Classification
The first step is to apply a base classifier (BC) (considering all elements) using a sliding window starting at x = 0 and y = 0, and stride values of 512 ( Figure 5B). This procedure produces partial classification on the frame's edges between consecutive frames, resulting in more than one imperfect classification for the same object, which is a misleading result.

Base Classification
The first step is to apply a base classifier (BC) (considering all elements) using a sliding window starting at x=0 and y=0, and stride values of 512 ( Figure 5B). This procedure produces partial classification on the frame's edges between consecutive frames, resulting in more than one imperfect classification for the same object, which is a misleading result.

Single Edge Classification
The second step is to classify objects located in the borders (partially classified objects by the BC). We applied the vertical edge classifier (VEC) to classify elements in consecutive frames vertical-wise, composed of a sliding window that starts at x=256 and y=0 (Figure 5C). Similarly, to horizontal-wise consecutive frames, we applied the horizontal edge classifier (HEC), with a sliding window that starts at x=0 and y=256 ( Figure 5D). Both strategies use 512-pixel strides. In addition, to avoid the high computational cost, the VEC and HED only classify objects that start before the center of the image (x<256 for the VEC and y<256 for the HEC) and end after the image's center (x>256 for the VEC and y>256 for the HEC).

Double Edge Classifier
An additional problem for crowded object areas such as SBUs are objects located at the BC borders horizontal-wise and vertical-wise, presenting a double edge error (DEC). Thus, we enhanced the mosaicking by applying a new sliding window, starting at x=256 and y=256 with 512-pixel strides ( Figure 5E).

Non-maximum suppression sorted by area
Furthermore, each object located at the images' borders may present more than one classification for the same object, partial classifications from consecutive BC frames (incorrect classifications), and a unique, complete classification (correct classification) from the HEC, VEC, or DEC ( Figure 5). The elimination of excessive boxes used the non-maximum suppression ordered by area, guaranteeing only the classification of the most significant element (complete object). Figure 5 shows an example of an element located at double edges, where the DEC classification is the largest and the correct one.

Performance Metrics
The model evaluation considered the following COCO metrics [66]: average precision (AP), AP50, and AP75. The AP is a ranking metric that calculates the area under the precision-recall curve. However, in object detection, it is crucial to determine a minimum overlap between the predicted bounding box and the ground truth bounding box to evaluate a correct classification. Thus, another element is the intersection over union (IoU) ( Figure 6).

Single Edge Classification
The second step is to classify objects located in the borders (partially classified objects by the BC). We applied the vertical edge classifier (VEC) to classify elements in consecutive frames vertical-wise, composed of a sliding window that starts at x = 256 and y = 0 ( Figure 5C). Similarly, to horizontal-wise consecutive frames, we applied the horizontal edge classifier (HEC), with a sliding window that starts at x = 0 and y = 256 ( Figure 5D). Both strategies use 512-pixel strides. In addition, to avoid the high computational cost, the VEC and HED only classify objects that start before the center of the image (x < 256 for the VEC and y < 256 for the HEC) and end after the image's center (x > 256 for the VEC and y > 256 for the HEC).

Double Edge Classifier
An additional problem for crowded object areas such as SBUs are objects located at the BC borders horizontal-wise and vertical-wise, presenting a double edge error (DEC). Thus, we enhanced the mosaicking by applying a new sliding window, starting at x = 256 and y = 256 with 512-pixel strides ( Figure 5E).

Non-Maximum Suppression Sorted by Area
Furthermore, each object located at the images' borders may present more than one classification for the same object, partial classifications from consecutive BC frames (incorrect classifications), and a unique, complete classification (correct classification) from the HEC, VEC, or DEC ( Figure 5). The elimination of excessive boxes used the nonmaximum suppression ordered by area, guaranteeing only the classification of the most significant element (complete object). Figure 5 shows an example of an element located at double edges, where the DEC classification is the largest and the correct one.

Performance Metrics
The model evaluation considered the following COCO metrics [66]: average precision (AP), AP50, and AP75. The AP is a ranking metric that calculates the area under the precision-recall curve. However, in object detection, it is crucial to determine a minimum overlap between the predicted bounding box and the ground truth bounding box to evaluate a correct classification. Thus, another element is the intersection over union (IoU) (Figure 6). In this regard, the COCO AP considers the average among ten intersection over union (IoU) thresholds (from 0.5 to 0.95 with 0.05 steps), while AP50 and AP75 scores consider a fixed threshold of 0.5 and 0.75. Table 2 lists the detection (Box) and segmentation (Mask) results with different image scaling ratios and the X-101-FPN backbone. Results on the original image presented similar results compared to the COCO dataset scores. Moreover, scaling presented significant improvement, in which 2× scaling increased nearly 20% in the AP score, and 8× scaling increased nearly 30% AP.

Performance Metrics
Small objects negatively affect the strictest metrics (highest IoU, e.g., AP75). Slight errors in the bounding box position on small objects (with fewer pixels) significantly reduce the IoU (implying low AP scores). In turn, the mistakes are much less impactful when increasing the image dimensions. However, a limitation to the indefinite increase in the image dimensions is the high computational cost.  In this regard, the COCO AP considers the average among ten intersection over union (IoU) thresholds (from 0.5 to 0.95 with 0.05 steps), while AP50 and AP75 scores consider a fixed threshold of 0.5 and 0.75. Table 2 lists the detection (Box) and segmentation (Mask) results with different image scaling ratios and the X-101-FPN backbone. Results on the original image presented similar results compared to the COCO dataset scores. Moreover, scaling presented significant improvement, in which 2× scaling increased nearly 20% in the AP score, and 8× scaling increased nearly 30% AP. Small objects negatively affect the strictest metrics (highest IoU, e.g., AP75). Slight errors in the bounding box position on small objects (with fewer pixels) significantly reduce the IoU (implying low AP scores). In turn, the mistakes are much less impactful when increasing the image dimensions. However, a limitation to the indefinite increase in the image dimensions is the high computational cost.

Scene Classification
We used the X-101-FPN model with the best scaling ratio (8×) scores, applying it in a 3072 × 2048 pixel image (also using 8× scaling) to validate the mosaicking technique. Figure 7A demonstrates a satisfactory classification even in crowded areas. This process excluded 66 partial classifications in total ( Figure 7B), and the trained model has proven to distinguish SBUs from other elements such as tourist beach umbrellas. Figure 8 shows three zoomed areas (1, 2, and 3) where the top images present the complete (correct) classification results, whereas the bottom images show the partial (incorrect) classifications deleted by the non-max suppression sorted by area algorithm. Figure 8A-C shows the DEC, VEC, and HEC, respectively. Another interesting point is that example 3.2 shows that one of the partial predictions has greater confidence than the correct prediction (97% against 96%), demonstrating that the non-maximum suppression ordered by area brings improved results.   Table 3 lists quantitative values that may be very helpful in decision making. This methodology enables automatic counting and detection within large areas using Mask-RCNN. The sizes of the SBUs are very similar, with the average and median sizes very close and a standard deviation of 0.2 m 2 . In addition, the algorithm was able to differentiate very close objects, showing a good usage of instance segmentation models for crowded regions. Table 3. Analysis of the detected objects regarding their counting, average size, median size, minimum size, maximum size, and standard deviation, considering the 8× scaled image.

Discussion
Instance segmentation is a state-of-the-art computer vision segmentation method that enables many practical approaches for identifying objects at the pixel level. Most instance segmentation studies use large datasets (e.g., COCO [66], Cityscapes [87], or Mapillary Vistas [88]) in a ready-to-use format. Developing datasets for instance segmentation is highly complex and labor intensive, requiring annotation experts and a suitable storage format for DL models. Difficulties worsen for orbital remote sensing images by the need to choose the places of each image tile and the existence of very little annotation software available that considers geospatial data's particularities. With that said, in in a Web of Science search up to November 11, considering the keywords "instance segmentation", "remote sensing", and "deep learning", we found only 22 peer-reviewed journal articles. Despite the gains in efficiency and quality of results, the limited number of papers using instance segmentation demonstrates the difficulties reported. The present research demonstrates that instance segmentation allows a significant gain in inspection efficiency in coastal areas that have not yet been explored. Within these 22 articles, Soloy et al. [89] also explored the beach areas, but with a different approach, as the authors used photos taken by the iPhone to quantify grain size on pebble beaches.

Multichannel Instance Segmentation Studies
Few studies addressed instance segmentation using multi-channel imagery. Most studies use RGB images [90][91][92] or even three-channel images from the combination of digital orthophoto map and near-infrared band from the Landsat-8 [93]. The usage of multi-channels in remote sensing is widespread, allowing for more efficient detection than traditional RGB images (e.g., camera photos). Basically, there are four scenarios in remote sensing for using multi-channel inputs: (1) sensors with many spectral bands, (2) time series, (3) change detection, and (4) a combination between these characteristics (e.g., a time series of multispectral images). Using multispectral imagery, de Carvalho et al. [76] made a study on center pivot irrigation systems using Landsat-8 images. The authors compared the usage of seven channels with the traditional RGB, showing a difference of 3% in the AP metric when using more channels. Hao et al. [94] used a multiband input for the Mask-RCNN for identifying tree-crowns and estimating their height. Concerning time series applications, de Albuquerque et al. [95] used Sentinel-1 time series (up to eleven channels) for mapping center pivots. The authors reported an increased performance when including more time frames. In a different approach, de Albuquerque et al. [27] used Sentinel-2 time series (up to 24 channels), considering four spectral bands per temporal frame for effectively mapping regions with a cloud presence.

Methods for Large Area Classification
A significant problem is that a DL adaptation for remote sensing applications uses large-size images. In this regard, the present research used mosaicking with sliding windows for object detection/instance segmentation. This procedure is more common in semantic segmentation approaches using overlapping pixels [84][85][86]96]. The method uses a sliding window with a step size smaller than the frame dimensions, causing overlapping. Averaging the overlapping areas mitigates errors, providing better accuracy metrics and visual results. However, for instance segmentation models, the procedure must consider the bounding boxes. In this sense, we modified the method proposed by de Carvalho et al. [76], introducing the double edge classifier (DEC) that is more efficient in extremely crowded areas, such as the SBUs. The methodology effectively eliminates frame discontinuity problems by considering the prediction under the best circumstance, providing a viable solution for mapping large areas.
The capability to apply an instance segmentation algorithm over a large area enables a thorough scene understanding, which is vital for public inspection. For example, our study allows automatic counting of all SBUs and a series of other statistics, such as average size, median size, and standard deviation of the sizes, among others. These quantitative results increase the amount of information for public managers to act. In addition, it is possible to extract the exact location of each element just by getting the coordinates of each bounding box.

Small Object Problem
Small objects often underperform in many datasets. For example, in the COCO dataset, the AP small metric is much lower than the AP medium and AP large metrics. This effect is related to increasing noise with decreasing object size. In the review of Tong et al. [67], image scaling is a straightforward approach to improve small object detection. Nevertheless, no study compares the effect of different scaling and improved object detection. In this regard, this research compares three scaling ratios for mapping SBUs, which are very small objects. This comparison can guide other studies further studies of small object detection in other scenarios. Our results show that image scaling (even as an image augmentation built-in method) may be a plausible and effective solution. The AP metrics increased more than 20%, considering eight times the original size. Even so, doubling the dimensions already provided a significant increase. This analysis is relevant since increasing the image dimensions might present computational problems (e.g., memory, and processing time). Some other alternatives have been studied for detecting small objects. Zhang et al. [97] proposed a scale adaptive proposal network by modifying the Faster-RCNN architecture. This innovative approach has broad applications where there are datasets of many different sizes. Nonetheless, considering different scales might not be enough for very small objects, especially for AP scores, where few mistakes in the bounding box drastically reduce this accuracy metric. Generative adversarial networks (GAN) algorithms also present advances in studies with small objects [65]. In remote sensing, Ren et al. [98] proposed an advanced end-to-end GAN to increase image resolution and apply the Faster-RCNN network in object detection. Therefore, a viable alternative for future studies would be the development of algorithms using GAN for surveillance in coastal areas. In the traditional RGB images from the COCO data set, Kisantal et al. [99] made an augmentation system based on copying and pasting small objects into different images to increase the representativeness of a small object in a larger number of images. This augmentation is a promising strategy for datasets with different scaled images. However, it can be computationally expensive in multichannel imaging and in detecting many small objects.

Accuracy Metric Analysis for Small Objects
Even though there is broad applicability of the COCO metrics for instance segmentation datasets (including the COCO dataset), the AP50 is the most appropriate metric for analyzing small objects (especially in datasets in which all objects are small) since very few mistakes drop the performance metrics significantly. Figure 9 shows two theoretical examples A and B, in which the prediction and the ground truth bounding boxes have the exact spatial dimensions. When considering small objects, a slight mistake of one pixel horizontally and vertically has an IoU of 69.25%, impacting the AP and AP75 metric. A one-pixel error in a 100 × 100-pixel bounding box generates a 96.10% IoU, showing the attenuation of slight errors in larger objects. This research shows that the simple increase in object dimensions allows the algorithm to have a better accuracy score. Therefore, generating ground truth data, especially for small objects, must be done rigorously to avoid misleading metrics.

Policy Implications
The Brazilian Government is responsible for the administration and inspection of federal properties. According to Normative Instruction No. 23, of 18 March 2020, the inspection action may have a preventive or coercive nature, requiring a field inspector to investigate possible irregularities committed against federal properties. The inspection action is predominantly coercive through denunciation, when the improper action is consolidated, leaving only the repair of the damage. The lack of preventive action causes an increase in unlawful acts and the filing of numerous lawsuits, with deprivation of use of areas and legal uncertainty.
In Brazil, beach areas are public properties protected by environmental legislation (CONAMA resolution No. 303 of 20 March 2002) as permanent preservation areas and consist of Navy land, where private occupation (private, commercial, or industrial) requires payment of a fee for the use of the public area. Beach areas are constant targets of economic exploitation and improper tourism and need constant surveillance. In this context, the development of remote and semi-automated methods of surveillance of property misuse becomes fundamental.
Therefore, the instance segmentation of multispectral remote sensing images demonstrates a high potential to establish an effective action with a solid preventive impact due to the rapid infraction detection. However, the procedure should be improved, including other activities without prior authorization in coastal areas such as landfills, deforestation, construction, fences, or other improvements, which could be developed in future lines of research.

Conclusions
The automatic remote sensing detection of tourist infrastructure in beach areas is essential for government surveillance, requiring quick and periodic information for decision making. The coastal regions of Brazil are government property, being areas with specific taxation for use and environmental protection. This study proposed a methodology based on instance segmentation to identify straw beach umbrellas (SBUs), the most common tourist structure on Brazilian beaches. The developed method integrates different solutions for the use of instance segmentation in remote sensing data: (1) multi-channel models, (2) small object detection, and (3) classification of large areas. Therefore, we modified Detectron2's Mask-RCNN model to account for multi-channel image inputs in TIFF format, compared different scaling ratios on the original image, and improved the existing method for classifying large images using the sliding window technique. Our results show that increasing image dimensions significantly improve the AP metric from 30% to 58%. In addition, the less strict metric (AP50) showed results from 74% to 94%. Image scaling is a computationally expensive solution, so we initially considered the original image dimensions of 64 × 64 pixels. In addition, even though we evaluated up to eight times the original dimensions (resulting in a 512 × 512 image), a two-times resizing already provides a significant increase. Thus, the research needs to define a trade-off in computational cost and in the quality of predictions.
Another problem is the accumulation of errors on the frame edges, which intensify with overcrowded objects. Our innovative proposal to use double edge classification (DEG) solved the problem simply and efficiently. The architecture of all exposed methods is a suitable solution for accurately detecting small objects in large areas using multispectral data, providing insightful information for public managers. For example, statistical analysis of the SBUs on a 3072 × 2048 test image identified 148 objects with an average size of 5.8 m 2 . The bounding box centroid established the exact geographic location. Future studies in this area will consider more beach elements, exploring objects and background elements, and other segmentation tasks such as panoptic segmentation.
Carvalho Júnior, Roberto Arnaldo Trancoso Gomes, and Renato Fontes Guimarães; Coordination for the Improvement of Higher Education Personnel (CAPES) for postgraduate assistance; Union Heritage Secretariat of the Ministry of Economy for financial support; and the European Space Agency (ESA) for image supply within the project "Surveillance of union properties areas using deep learning technique in satellite images". Special thanks are given to the research group of the Laboratory of Spatial Information System of the University of Brasilia for technical support.