Instance Segmentation for Large, Multi-Channel Remote Sensing Imagery Using Mask-RCNN and a Mosaicking Approach

.

The DL techniques regarding segmentation have two subdivisions: (i) semantic segmentation (labels are class-aware); and (ii) instance segmentation (labels are instanceaware). Semantic segmentation brings pixel-wise classification to the entire scene, with pieces of information about the category, localization, and shape [50]. In addition, semantic segmentation differs from image classification since it enables all object parts to interact, by identifying and grouping pixels that are semantically together [51]. The deep semantic understanding allows us to aggregate the different parts in the formation of a whole, considering variations of colors, textures, and patterns. Several reviews on semantic segmentation published recently, highlights the algorithms' innovations, applications, and taxonomy [51][52][53][54][55].
However, the semantic segmentation results do not distinguish different instances within the same category, resulting in limitations in individually separating objects. Therefore, this new problem is not only to determine the pixels of a specific class (semantic segmentation) but also includes the discernment of different objects in the same category by obtaining the exact number of a given object in the image (instance segmentation). Therefore, instance segmentation consists of a new paradigm and evolution of semantic segmentation by allowing a unique understanding of each object, counting the number of objects, and analyzing objects in occlusion and contact conditions. Instance segmentation algorithms have two main approaches [56]: (a) segmentationfirst strategy, where segmentations occur before classification, and (b) instance-first strategy, parallel process of both segmentation and classification. In turn, the segmentation-first strategy also has two approaches: (a) segment-based, first establishes segment candidates and then performs their classification [57][58][59]; and (b) based on semantic segmentation masks, trying to separate the pixels of the same classes in different instances [60][61][62][63].
The instance-first strategy methods have advantages for being more straightforward and more flexible, allowing the algorithm to obtain the bounding boxes and the segmentation masks simultaneously. The main models proposed were Fully Convolutional Instance-Aware Semantic Segmentation (FCIS) [64], Mask-Region-based Convolutional Neural Network (Mask R-CNN) [56], Cascade Mask R-CNN [65,66], Mask Scoring R-CNN [67], and High-Quality Instance Segmentation Network (HQ-ISNet) (based on Cascade Mask R-CNN) [68].
However, surpassing some challenges is necessary for a broad application of instance segmentation in remote sensing (and multi-channel medical imagery). The instance segmentation frameworks (e.g., Detectron2) use configurations and libraries with restricted compatibility with Red, Green, and Blue (RGB) images, traditionally applied by the computer vision community in tasks, such as fruit detection [82] and animal recognition [83], among others. This is a data limitation for optical Earth observation sensors that are generally multispectral, where the available channels provide complementary information that maximizes accuracy. In semantic segmentation, approaches to aggregate more information considered: (a) the use of image fusion techniques, where the three bands used are data integration products [84]; (b) input layer adequation to support a larger amount of channels, e.g., 14 channels [15] 12 channels [14], 7 channels [85], and 4 channels [86].
A necessary fit for satellite images comes from its large size, in contrast to traditional CNN methods that receive fixed-size inputs and produce a unique classification for the entire image. Therefore, a strategy widely used in the semantic segmentation of remote sensing images is to subdivide it into patches with the same size as the training samples from a sliding window with a step that allows establishing an overlap interval [87]. Image mosaicking considers mathematical operations (usually averaging) in the overlapping areas to avoid frame junction errors [88]. Albuquerque et al. (2020) evaluate the segmentation's accuracy, considering sliding windows with different overlapping strides. Research has also been carried out to evaluate different sliding window sizes [14,18,89]. The frames' fixed size must consider a dimension that allows the general context to perform the classification without a significant increase in computational complexity and CNN parameters. Thus, balancing these two factors is crucial to ensure object detection and computational efficiency. Instance segmentation, where each object in a category has also a unique identification, requires different adjustments in the patch mosaic compared to the semantic segmentation methods, since it is not possible to perform the simple use of an average between the overlapping areas.
For instance segmentation, image labeling requires polygons that delimit each object individually with its bounding box (coordinates) and pixel-wise segmentation mask. This annotation format is more complex, laborious, and requires highly qualified specialists to label more complex information correctly. Thus, a limitation for detecting remote sensing targets is the lack of publicly available data sets suitable for instance segmentation. Many publicly labeled data sets exist for photographic landscape images, such as LabelMe [90], ImageNet [91], PASCAL [92], Cityscapes [93], Open Images [94], and Creating Common Object in Context (COCO) [95]. In this context, the two most popular procedures for annotating objects for computer vision data are COCO and Pascal Visual Object Classes (VOC). Although we do not yet have a large-scale remote sensing image dataset with the appropriate instance segmentation annotations, several databases with raster and vector information can be adapted for this purpose. Therefore, a challenge is to develop a method for converting vector data to the COCO annotation format (data format widely used by instance segmentation and object detection community).
This research aims to perform instance segmentation on multi-channel remote sensing imagery for Center Pivot Irrigation System (CPIS) detection. In this context, the research has three secondary objectives that improve the use of instance segmentation in remote sensing. The first is to develop a method for converting the remote sensing data with its respective vector and raster data to the COCO data format containing the corresponding JavaScript Object Notation (JSON) annotation file. The second is to adapt Detectron2 instance segmentation source code [96] to allow the multispectral data set (the seven surface reflectance bands of Landsat 8 image). Finally, the third is to develop a novel mosaicking method using the sliding window technique and a modified non-max suppression sorted by area to classify large images.

Related Works on Center Pivot Detection
The mapping of CPIS from remote sensing imagery had little changes over time, using predominantly a visual interpretation of circular features since the 70-80 s [97,98] until recently [99][100][101][102], with a significant consumption of labor work and time. The different colors, textures, and spectral information inside and between the center pivots make it challenging to obtain accurate classifications by traditional machine learning methods based on pixel or vegetation indices. Consistent automatic detection of center pivots emerges with methods based on deep learning [85,103,104]. Zhang et al. [103] were the precursors in using CNNs for automatic identification of CPIS. The research used an RGB image and did not perform segmentation, and it only identified the central point of each CPIS and established an engagement quadrant with a predetermined size that did not necessarily coincide with the circumference of the central pivot. Subsequently, two articles report the use of semantic segmentation for the detection of CPIS. Saraiva et al. [104] perform the segmentation of the U-Net architecture of the images of the PlanetScope constellation containing four channels (blue, green, red, and near-infrared). De Albuquerque et al. [85] compare three CNN architectures (U-net, Deep ResUnet, and SharpMask) and use Landsat-8 surface reflectance images composed of 7 bands in the rainy and dry period. In this context, instance segmentation is still an unexplored method for this target, which is a differential for the management of irrigated areas, as it establishes the quantity and size of the central pivots, which are fundamental factors for forecasting the harvest and water consumption.

Materials and Methods
The present research had the following methodological steps: (a) image data acquisition from three different areas in the rainy and dry period; (b) clipping frames with 512 × 512 pixel dimensions (for the original image and ground truth) with their corresponding annotations in COCO format; (d) data partition into training, development, and test sets (train/dev/test split); (d) training Detectron2 with different backbones; (e) COCO metrics evaluation; and (f) large image mosaicking ( Figure 1).

Dataset and Study Areas
Despite the interest in satellite imagery, few open datasets use multichannel imagery for instance segmentation tasks. The existing datasets are either RGB or for different tasks, such as semantic segmentation or object detection [105][106][107]. Some open challenges, such as SpaceNet [108] (which provides polygons), could use the same methodology used in this paper to experiment instance segmentation algorithms. Nevertheless, we used the CPIS database developed by Albuquerque et al. [85] based on the survey of center pivots in Brazilian territory by the National Water Agency (ANA) in 2014 [100], since it presents high relevance to agricultural studies. The elaboration of the ANA dataset used visual interpretation on a computer screen. Albuquerque et al. (2020) corrected the data considering the periods of drought and rain for 2015 and 2016 in three regions of Central Brazil. We used surface reflectance images from the Landsat-8 Operational Land Imager (OLI) sensor, containing seven bands and 30-m resolution, for the three regions of Central Brazil. The images correspond to the period of drought and rain.
The study areas locate in the Cerrado biome, presenting a high expansion of centerpivot irrigation due to flat land favorable to mechanization and the dry season between May and September [109]. The three study sites consist of areas around the Federal District, Mato Grosso, and Western Bahia regions, totaling 3731 (more than six thousand considering both seasons) center pivots ( Figure 2). The region surrounding the Federal District has the largest number of center pivots in Brazil, not only driven by the proximity of the country's capital but also conflicts over water use [110,111]. In the last decades, the Western Bahia region has presented an advanced agribusiness growth with the expansion of irrigated areas and water conflicts [112][113][114]. Finally, the state of Mato Grosso has favorable environmental factors for agriculture presenting a 175% growth in CPIS in the period 2010-2017 [99].

COCO Annotation Format
Semantic segmentation algorithms need only a ground truth mask where each element has a class, e.g., pivots (1) and non-pivots (0). Meanwhile, instance segmentation has additional complications in the labeling and annotation format, requiring that each element in a sample image in the training process needs a unique value. For example, an instance segmentation mask with ten center pivots needs different values for each pivot, contrasting with semantic segmentation masks, where all pivots have the same value.
Most of the instance segmentation algorithms follow the COCO annotation format. Thus, we developed a methodology to generate and convert training samples (composed of Landsat images and polygon labels) in the COCO annotation format. This procedure does not aim to replace labeling software, i.e., LabelMe, but to give an alternative for cases in which there is polygon data from the targets, which is common in the remote sensing community. The conversion procedure uses two programs: (a) program developed in this research to extract the samples in frames with a predetermined size and compatible input data with the next program, and (b) Cocosynth repository (https://github.com/akTwelve/ cocosynth) [115] with some adaptations that convert the data to the COCO annotation format ( Figure 3).
The first program developed in the C++ language considers the following input data: remote sensing image with the respective number of bands, labeled image, vector point of the frame's centroid, and the parameters height and width of each frame. The labeled image is elaborated by converting vector polygons to raster, where each center pivot acquires a distinct integer value from 1 to N, where N is the number of center pivots in the entire scene. The program modifies the labeled image to be compatible with the Cocosynth program that uses different colors for each instance. Thus, the program modifies the polygon identifiers to RGB system values, using an algorithm, like the numerical base conversion (decimal to base-256). The RGB numerical system has 16,777,216 (256 × 256 × 256) color possibilities. The algorithm consists in performing two consecutive divisions by 256. First, the integer number is divided by 256, and the Red color value (R) is the remainder. Consequently, the integer part of the division result is divided by 256 again, where the Green color value (G) is the remainder, and the Blue color value (B) is the integer part of the second division (Equations (1)-(3)). The polygon values start at one instead of zero since the (0,0,0) is the background color. The first integer with value 1 representation is (1,0,0), while the integer 16,777,216 representation is (255,255,255). The color conversion within the image is from left to right and top to bottom direction. Figure 4 shows the processing steps from the polygons to the RGB image. Nevertheless, the program changes the labeled image type (".tiff" file with integer numbers ranging from 1 to the number of instances) to a more straightforward data conversion (".PNG" file with the RGB channels). The proposed C++ program creates a JSON file with each frame information (original image and label data), such as the color, category, and super category of each object.
The next step to create the COCO annotation file was to adapt the Cocosynth code (coco_json_utils.py) [115] to allow the management of multi-channel remote sensing images in ".tiff" or ".tif" format. This code uses the JSON-file created by our C++ program with color, category, and super-category and creates a new JSON file in the COCO annotation format, which is ready to train.

Data Split
In the scientific literature, there is not a predetermined optimal train/validation/test split. We used 228 images for training and 50 images for test and validation (approximately 70/15/15). The Landsat-8 training images had a 512 × 512-pixel dimension, resulting in 512 × 512 × 7 input shape. The choice of window size considered a larger image size to minimize the edge effects and computational capacity. Table 1 lists the number of instances used in each process. Despite the number of images is not extensive, there is a high concentration of instances, which is the most important number to train the algorithm. In addition, we applied data augmentation considering brightness, contrast and resizing in the training data. This kind of procedure avoids overfitting, and enhances the model ability to learn new features.

Mask R-CNN
One of the most powerful instance segmentation frameworks is the Mask R-CNN [116], introduced by the Facebook Artificial Intelligence Research (FAIR), which combines object detection and semantic segmentation, an evolution of the RCNN [117], Fast RCNN [118], and Faster RCNN [119] methods. This framework operates in two stages: (a) generation of region proposals; and (b) classification of each generated proposals.
We used the Detectron2 [96], a software powered by the Pytorch framework, containing many backbone structures and a faster training process ( Figure 5). The original code (https://github.com/facebookresearch/detectron2) uses libraries restricted to RGB in more traditional formats, such as PNG and JPEG formats, whereas satellite images present more channels in the TIFF format. Thus, we implemented changes to read and train multi-channel images in the TIFF format.

Backbone Structure
The input image passes through a convolutional network, also called the backbone structure ( Figure 6). The backbone may vary according to the desired tradeoff between performance, training speed, and limitations due to computational power.
The Mask-RCNN architecture consists of a bottom-up and top-down pathway. The bottom-up section is responsible for the convolutions and generation of the feature maps, and the most used structure is the ResNets [120] or ResNeXts [121] with five convolutional modules (C1, C2, C3, C4, and C5). The strides between each module doubles, this means the image dimensions halves. Each convolutional module composition includes many layers that may vary depending on the configurations chosen on the depth of the ResNet. The more layers, the longer it takes to train, but the accuracy tends to be higher, especially in complex object detection. In the present research, we used ResNet50, ResNet101, and ResNeXt101. ResNeXts often present better results when compared to the ResNet since it uses multiple parallel convolutions. Figure 7 shows a simplified structure, where the number of those convolutional blocks in the ResNeXt is the cardinality. Xie et al. [121] tested different cardinality values (1, 2, 4, 8, and 32), showing the best results using 32 (the one used in this research). The input and output dimensions (256d) from the ResNet and ResNeXt are the same, demonstrating similar levels of complexity, varying on the convolutional structures.   The bottom-up and top-bottom pathways link through lateral connections, ensuring spatial cohesion from a module to another. In addition, each module in the feature extractor gives a prediction (P5, P4, P3, P2), that will be used in the Region Proposal Networks. The greater the number of convolutional layers, the more complex information the algorithm tends to learn, but also rises the risk of overfitting, and applying dilation on the convolutional modules may increase performance on different sized objects. Thus, testing different structures is essential to obtain optimal results. We compared seven different backbone structures (ResNet50-FPN, ResNet50-DC5, Resnet50-C4, ResNet101-FPN, ResNet101-DC5, ResNet101-C4, and ResNeXt101-FPN).

Region Proposal Network and Region of Interest (ROI) Align
The backbone output (P2, P3, P4, P5) are feature maps used in the Region Proposal Network (RPN) to generate anchor boxes. Each region with high probability generates 9 anchor boxes with different ratios (1:1, 2:1, 1:2) and scales (0.5, 1, 2). The Region of Interest (ROI) pass through ROI align (Figure 8), a bilinear interpolation quantization-free o preserves spatial information (He et al., 2016). These fixed dimension ROIs enter three parallel processes: (a) class of the object and its respective probability; (b) bounding box; and (c) segmentation mask.

Loss Functions
The total loss of the training process is the addition of mask loss, class loss, and box regression (Equation (4)). The segmentation mask is a binary classification that involves a single classifier per class (one versus all strategy). Therefore, each ROI will only consider one object at a time. Thus, the loss function is a simple log loss [118], in which the result is the average from all results (Equation (5)). The classification loss is also the same formula.
There are two ways to obtain the bounding box, considering the four coordinate values: (a) using "x" the centroid in the x-axis; "y" the centroid in the y-axis; (h) the height of the box; and "w" the width of the box boxes [123]; and (b) using: "x1" the minimum x value; "x2" the maximum x value; "y1" the minimum y value; and "y2" the maximum y value. The Detectron2 algorithm uses the second method, and its loss regression function uses L1 loss (Equation (6)). Figure 9 shows the process after a loss reduction from the first to the second iteration. The computed loss is lower in the second iteration because the differences are smaller between ground truth (black dotted line) and the prediction (red line).
L mask and L cls =

Hyperparameter Configuration
Another critical step in training a neural network is the hyperparameter configuration. Thus, we trained from scratch (unfreezing all layers) seven models using all seven channels and the best model using only the RGB channels (Landsat-8 bands 2, 3, and 4). We used: (a) Adam optimizer with a learning rate of 0.001 divided by ten after 1000 iterations and momentum of 0.9; (b) 256 ROIs per image; (c) 30,000 iterations, keeping track of the validation loss to an optimal converging point and avoid overfitting; (d) five anchor boxes sizes of 16, 32, 64, 128, 256; (e) 1000 warm-up iterations (where learning rate slowly increases to avoid errors) with a 0.001 factor; and (f) 1 image per batch. In addition, we used Nvidia GeForce RTX 2080 TI GPU with 11 GB memory. Data normalization (z-score method) was necessary since each channel has different ranges of values and can bring bad results during the training process, such as disappear-ance gradients [124] (Equation (7)). Furthermore, normalization allows us to accelerate the training process.

Accuracy Analysis
Accuracy analysis is crucial in Deep Learning tasks to evaluate how well the trained model behaves in new data, which is a powerful insight to understand applicability in the real world. The confusion matrix shows each class's frequencies, being extremely useful to evaluate the supervised models of Machine Learning/Deep Learning. Figure 10 shows the confusion matrix, where True Positives (TP) and True Negatives (TN) represent elements correctly identified in their corresponding classes. In contrast, False Positives (FP) and False Negatives (FN) represent misclassified elements. The two-primary metrics for evaluating instance segmentation models are precision (Equation (8)) and recall (Equation (9)). Precision is the number of correctly identified positive instances (TP) divided by the total number of predictions (TP + FP), and recall is the number of correctly identified positive instances divided by the total number of positive instances (TP + FN).
Precision and recall bring rich insights to data, but, when dealing with deep learning algorithms, the results are often probabilities, and another crucial information is the threshold cutoff point. The threshold considers the Intersection over Union (IoU) of the bounding boxes ( Figure 11). A low IoU will be more permissive when considering possible targets, and a large IoU will be more restricted. The optimal point may vary depending on each problem.  [95]. These are the most commonly used metrics in instance segmentation tasks, proving to be satisfactory to evaluate and compare different models in object detection and segmentation (mask quality) performance, including the original Mask RCNN research [56] You Only Look At Coefficients (YOLACT) [125], YOLACT++ [126], mask scoring RCNN [67], and cascade RCNN [66], among other works using applications of these methods.
The average precision (AP) uses the mean value from 10 IoU thresholds, starting at 0.5 up to 0.95 with 0.05 steps (0.50: 0.05: 0.95). The closer the AP is to 1, the better the model. AP 50 represents the calculation under the IoU threshold of 0.50, whereas AP 75 is a stricter metric and represents the calculation under the IoU threshold of 0.75. In addition, the metrics consider the average precision in different target sizes, having three categories (a) small (area < 32 2 pixels), (b) medium (32 2 pixels < area < 96 2 pixels); and (c) large (area > 96 2 pixels). The present research does not have objects larger than 96 2 pixels; thus, we will only consider AP small and AP medium . Another important metric is the Average Recall (AR), where the averaged IoU thresholds are the same from the AP (0.50: 0.05: 0.95). Furthermore, the AR considers the maximum number of detections (Max Dets). Since the maximum number of detections in a single 512 × 512-pixel frame in our dataset is 96, we will only consider the AR with a maximum detection of 100 objects (AR 100 ). Other options analyzed in the COCO dataset is considering 1 and 10 detections, which would not bring much value to the observations.

Scene Mosaicking
Remote sensing images are larger than the image size used for training and validation due to computational limitations. For example, the center pivot survey covers a wide area, not restricted to just a single frame. Therefore, the classification of a complete scene requires a mosaic reconstruction of sub-images with training image size. For this reason, we used the sliding window technique that runs through the image with a specific dimension (height × width) and a stride value in the horizontal and vertical directions. When the stride is smaller than the window size, it creates an overlap between consecutive frames. Semantic and instance segmentation errors occur predominantly at the frame edges, corrected, or minimized with overlapping images [85,87].
The sliding window with a stride dimension corresponding to half-frame length shows three patterns ( Figure 12): (a) base arrangement (initial position at x = 0 and y = 0) ( Figure 12A); (b) horizontal displacement arrangement (initial position at x = half-frame length and y = 0) ( Figure 12B); and (c) vertical displacement arrangement (initial position at x = 0 and y = half-frame length) ( Figure 12C).
In this configuration, window overlays guarantee three classifications for the same object (disregarding an edge with half-window length). Incomplete classifications at the window edges (red and orange boxes) should be eliminated ( Figure 13A), remaining in these places only the boxes (marked in green) from the two other arrangements (horizontal or vertical) ( Figure 13B). We restricted the valid boxes to the central zone of the vertical and horizontal displacement arrangements where edge errors concentrate on the base arrangements, optimizing and eliminating information redundancy. Figure 13 shows the green boxes as the appropriate result of the conjunction of the base ( Figure 13A) and horizontal and vertical configurations ( Figure 13B).
The bounding box position of a given sliding window is repositioned to a coordinate system that considers the entire image. Consequently, data processing is windowed, but storage considers the size of the original image. Besides, each object's description uses a binary mask with the total dimension of the image (filled with zeros). Therefore, each new element store uses a new dimension of the array with shape (Number of instances, width, height). We store four types of information in a NumPy array: (1) bounding box coordinates (N, x1, x2, x3, x4); (2) class labels for each bounding box (N, classification); (3) prediction for each bounding box (N, predictions); and (4) prediction masks for each frame (N, image height, image width). To exclude excessive bounding boxes, we apply a modified no-maximum suppression algorithm that uses the box size and the overlapping area index. The method calculates the bounding box area by its coordinate pairs in the upper left corner (x1, y1) and lower right corner (x2, y2) (Equation (10)), sort by size, and select the largest. The elimination of the boxes is from the smallest to the most extensive areas to avoid possible errors. To ensure that we are eliminating overlapping boxes, we use a ratio that is the total overlap area divided by the box area (Equation (11)), considering the Overlap Box Width (OBW) and Overlap Box Height (OBH) (Equations (12) and (13)) ( Figure 14). The coordinate values increase from top to bottom and from left to right. We consider an overlap of 0.3 to exclude excessive boxes (keeping the box with the largest area).
OBH = max(B1(y1); B2(y1)) − min(B1(y2); B2(y2)). Figure 14. Demonstration of the bounding box coordinates. Figure 15 shows three boxes for the same object. The red and orange boxes are at the edges of two consecutive frames, classifying only parts of the object, while the green box classifies the entire object. The ordering by area ( Figure 15A) guarantees the elimination of smaller frames (partial target). In the present case, the procedure becomes more appropriate than the ranking by score ( Figure 15B), which selects the highest confidence score, since it is not always the box that maps the entire object.  Figure 16 shows an example of a 512 × 512-pixel frame before and after running the program that converts the polygon identifiers to the RGB system and creates the JSON format file with the annotation information. This procedure allows an easy transformation to the COCO annotation system used in the training phase.  Tables 2 and 3 list the COCO metrics, for instance segmentation. ResneXt101 presented the best results, followed by Resnet101-FPN. The backbone structures from 50 to 101 depth in the Resnet architecture show significant differences in almost 10% average precision. The ResneXt101 has similar results to ResNet101 when analyzing the Average Precision (AP) with the IoU threshold at 0.5. However, the difference is significant at IoU 0.75, with nearly 2% improvement compared to the second-best model (Resnet101-FPN). Medium-sized CPIS detection is also greater than smaller ones.

Evaluation of COCO Metrics
Another crucial analysis is the performance comparison using multi-channel imagery considering seven channels with the traditional RGB images (Landsat-8 bands 2, 3, 4). Thus, we applied the best model (ResNext101-FPN) using the same train/dev/test images but considering only the RGB channels. Results show a strong tendency of accuracy advantages using more channel information, demonstrating that the usage of multi-channel imagery, especially to remote sensing data, where the tradeoff between accuracy and processing speed in most cases tilts toward accuracy.

Image Mosaicking
The process of creating the bounding boxes and segmentation co-occur. Nevertheless, to give a better visual understanding, Figure 17a shows the results from the base classifier (stating at x = 0 and y = 0 with 512-pixel step), which outputs a classification to all objects. Figure 17b shows the classification of the horizonal classifier (starting at x = 256 and y = 0 with 512-pixel step), considering only center pivots that start before the center of the image (x < 256) and ends after the center of the image (x > 256). Figure 17c shows the deleted boxes in the non-max suppression sorted by area algorithm, evidencing the correct elimination from the partial classification in Figure 17a. Finally, Figure 17d shows the final classification from this small example, where only the correct boxes remain, and there is only one classification per object, demonstrating the effectiveness of the algorithm. The same procedure applied to an entire applying the non-max suppression sorted by area result in the classified image ( Figure 18). Other information we can extract immediately is the number of objects and the average size of a center pivot in the referred region. This kind of information is vital to public managers and farmers to understand its plantation and surroundings.

Discussion
This research presents the results of state-of-the-art instance segmentation (Mask-RCNN) in satellite images with an innovative approach that uses large and multi-channel images. Instance segmentation brings more information than semantic segmentation, enabling a greater understanding of the scenes. The box boundaries and mask predictions better visualize different instances and enable useful insights, such as object coordinates, number of instances, average object size, total area occupied, and powerful to remote sensing tasks.
There are currently no works using Mask-RCNN algorithms in multichannel imagery. Previous works on object detection using multi-channel imagery use segmentation-first strategy (object-based Convolutional Neural Networks) [86,127,128]. The limitation of the segmentation-first strategy is that objects receive the same semantic information for all instances. In contrast, the Mask-RCNN makes a clear distinction between objects and gives per-object information, showing promising results even when objects overlap [72]. Therefore, the instance segmentation in the remote sensing data predominantly uses the Mask-RCNN/Faster-RCNN architecture [129][130][131][132]. However, the instance segmentation in remote sensing has been limited to RGB channels or even one channel of the Sentinel-1 image. In this way, for the best of our knowledge, the present research was the first to use Mask-RCNN with remote sensing multi-channel, demonstrating that this information increases performance and target detection.
Considering the instance segmentation in RGB images, researches with Mask-RCNN obtained relevant accuracy. Su et al. [130] 75 detection results), demonstrating that instance segmentation models in remote sensing imagery targets present high accuracy. Zhao et al. [133] applied a boundary regularization for building extraction using the Mask-RCNN algorithm and ResNet101-FPN as the backbone structure. The authors used the COCO annotations format and compared the proposed method with the traditional Mask-RCNN models using the F1 score metrics. The Mask-RCNN outperformed their algorithm. Yekeen et al. [77] applied Mask-RCNN in oil spill detection using Keras and Tensorflow and ResNet101-FPN backbone in Synthetic-Aperture Radar (SAR) imagery. The authors analyzed precision, recall, specificity, f1, IoU, and overall accuracy, showing promising results. Despite the good results, the usage of the Detectron2 algorithm (which contains more backbone structures) would increase performance using the ResNeXt101 architecture.
In this research, the instance segmentation of large images used a mosaic of overlapping frames from sliding windows with non-maximum suppression by area index. The current approach is essential for remote sensing images that predominantly have more significant dimensions. The large image reconstruction from the sliding window mosaic is widely used in the literature for semantic segmentation [85] propose a sliding window technique for semantic segmentation to minimize border effects. To show these metrics, they monitored the Area Under the Receiver Operating Characteristic (ROC) curve to measure the increasing performance, demonstrating a powerful tool for semantic segmentation scene mosaicking. Similarly, Yi et al. [87] applied scene reconstruction in a semantic segmentation algorithm for building extraction training with 256 × 256 pixel patches and mosaicking with a sliding window with a 64-pixel stride to minimize errors. Nevertheless, these solutions are not applicable to object detection, where each instance has a bounding box and different values. Martins et al. [86] applied a segmentation-first strategy algorithm in multi-channel National Agriculture Imagery Progam (NAIP) imagery (four channels) to classify large scenes. They used different patch sizes in the convolutional neural networks training process to better predict different sized data. In our work, the Mask-RCNN algorithm uses different anchor boxes that do this job very efficiently, especially when using deep backbone structures, such as ResNeXt101-FPN. In addition, the instance-first strategy, where each object has a unique mask segmentation, gives better results when there are overlapping objects, which is very common in object detection.

Conclusions
Instance level recognition, which requires individual objects' limits, allows a more thorough understanding of the image content with high potential for remote sensing. Instance segmentation is exceptionally suitable for different applications essential to counting different objects and estimating its areas individually. This research used the Detectron2 algorithm, the current state-of-the-art in instance segmentation, and still with little exploration in satellite images. The present research innovates in the following aspects: (a) development of a method to convert vector polygons from the interpretation of remote sensing images to the COCO format with its JSON file; (b) adaptation of the Detectron2 algorithm for multi-channel processing, and (c) proposition of a method for processing large images considering sliding windows and mosaic reconstruction by non-maximum suppression. The novel approach, in instance segmentation using the sliding window technique, gives a more substantial analysis since it is possible to gather information in large images.
This study applied the developed methodology for CPIS detection, which is a vital aspect of the support system of agricultural management and water resources. The detection of CPIS is a challenging task due to the different and complex crop patterns. Previous surveys have applied manual methods, and, only recently, semantic segmentation methods have been used for automatic detection. However, the semantic reserve has limitations for the individual detection of areas, especially as areas are in contact or overlap. We compared seven backbone structures in the Mask-RCNN model (Resnet50-FPN, Resnet50-DC5, Resnet50-C4, Resnet101-FPN, Renset101-DC5, Resnet101-C4, ResneXt101-FPN). In the ResNet50 and ResNet101, the FPN feature extractor outperformed C4 and Dilated C5. In addition, the detection of medium objects is significantly better, with an APmedium nearly 20% higher than the APsmall. The ResNeXt101-FPN is considerably better than the other models with an AP 3% higher than Resnet101-FPN (the second-best model).
Furthermore, a critical conclusion is also the difference between training with RGB and multi-channel imagery. Thus, we compared the best model (ResNeXt101-FPN) training with the same samples but considering only the RGB bands (2, 3, and 4). Results show that using multi-channel imagery improves the accuracy metrics for nearly 3%, evidencing an excellent tendency to other researchers to use multi-channel imagery to improve accuracy.
The proposed methodology improves remote sensing images and applies to studies previously carried out with semantic segmentation. Future work may include creating new backbone structures and small arrangements to allow the instance segmentation for multiclass problems. Besides, the present method applies in other science fields, which use larger images or a more significant number of channels, such as biomedical images. In addition, an extensive database of CPIS data can be developed for model training to provide better results in transfer learning.