LabelRS: An Automated Toolbox to Make Deep Learning Samples from Remote Sensing Images

: Deep learning technology has achieved great success in the ﬁeld of remote sensing processing. However, the lack of tools for making deep learning samples with remote sensing images is a problem, so researchers have to rely on a small amount of existing public data sets that may inﬂuence the learning effect. Therefore, we developed an add-in (LabelRS) based on ArcGIS to help researchers make their own deep learning samples in a simple way. In this work, we proposed a feature merging strategy that enables LabelRS to automatically adapt to both sparsely distributed and densely distributed scenarios. LabelRS solves the problem of size diversity of the targets in remote sensing images through sliding windows. We have designed and built in multiple band stretching, image resampling, and gray level transformation algorithms for LabelRS to deal with the high spectral remote sensing images. In addition, the attached geographic information helps to achieve seamless conversion between natural samples, and geographic samples. To evaluate the reliability of LabelRS, we used its three sub-tools to make semantic segmentation, object detection and image classiﬁcation samples, respectively. The experimental results show that LabelRS can produce deep learning samples with remote sensing images automatically and efﬁciently.


Introduction
With the development of artificial intelligence, deep learning has achieved great success in image classification [1,2], object detection [3,4], and semantic segmentation [5,6] tasks in the field of computer vision. At the same time, more and more researchers use technologies similar to convolutional neural networks (CNNs) to process and analyze remote sensing images. Compared with traditional image processing methods, deep learning has achieved state-of-the-art success. For example, Rezaee et al. [7] used AlexNet [8] for complex wetland classification, and the results show that the CNN is better than the random forest. Chen et al. [9] built an end-to-end aircraft detection framework using VGG16 and transfer learning. Wei et al. [10] regarded road extraction from remote sensing images as a semantic segmentation task and used boosting segmentation based on D-LinkNet [11] to enhance the robustness of the model. In addition, deep learning is also used for remote sensing image fusion [12,13] and image registration [14,15].
Samples are the foundation of deep learning. The quality and quantity of samples directly affect the accuracy and generalization ability of the model. Due to the dependence of deep learning technology on massive samples, making samples is always an important task that consumes a lot of manpower and time and relies on expert knowledge. At present, more and more researchers and institutions begin to pay attention to how to design and implement high-efficiency annotation methods and tools for images [16,17], video [18,19], text [20,21], and speech [22,23]. In the field of computer vision, representative tools and platforms include Labelme [24], LabelImg [25], Computer Vision Annotation Tool (CVAT) [26], RectLabel [27] and Labelbox [28]. A brief comparison between these annotation tools is shown in Table 1. These tools are fully functional and support the use of boxes, lines, dots, polygons, and bitmap brushes to label images and videos. Advanced commercial annotation tools also integrate project management, task collaboration, and deep learning functions. However, none of them support labeling multispectral remote sensing images. The main reason is that the processing of remote sensing images is very complicated. As the data volume of a remote sensing image is huge generally compared to ordinary natural images, ordinary annotation tools cannot complete basic image loading and display functions, not to mention complex tasks such as pyramid building, spatial reference conversion, and vector annotation.  [29] Without an effective and universal annotation tool for remote sensing images, researchers can only rely on existing public data sets, such as the UC Merced Land Use Dataset [30], WHU-RS19 [31], RSSCN7 [32], AID [33], Vaihingen dataset [34] and the Deep-Globe Land Cover Classification Challenge dataset [35]. But these data sets have limited categories. For example, WHU-RS19 and RSSCN7 contain 19 categories and 7 categories, respectively. In addition, they have specific image sources and spatial-temporal resolutions, and the quality of annotations is also uneven. These reasons show it difficult for the existing remote sensing data sets to meet the actual needs in complex scenarios. Therefore, it is definitely necessary to develop a universal remote sensing image annotation tool.
ArcGIS is one of the most representative software in geography, earth sciences, environment, and other related disciplines. It has diverse functions like huge image display, data processing, spatial analysis, and thematic mapping. Although the high version (Ar-cGIS 10.6) has added the function of making deep learning samples, there are still obvious limitations, such as (1) the tool cannot be used in lower versions of ArcGIS. We used "ArcGIS + version number" and "ArcMap + version number" as keywords and retrieved a total of 765 related papers in the past three years from the Web of Science (WoS). We counted the ArcGIS versions used in these papers, as shown in Figure 1. More than 90% of ArcGIS currently used does not have the function of making deep learning samples, and ArcGIS 10.2 and 10.3 are still the mainstream versions. (2) Output format restriction. The tool does not consider the high color depth and multiple bands of remote sensing images, which results in the format of the output sample must be consistent with the input. (3) The target size and distribution patterns are ignored. (4) Poor flexibility. The sample creation tool in ArcGIS requires that the input vector layers must follow a training sample format as generated by the ArcGIS image classification tool.
According to the above analysis and considering both the development cost and the use cost, we developed the annotation tool LabelRS for remote sensing images based on ArcGIS 10.2. LabelRS enables researchers to easily and efficiently make remote sensing samples for computer vision tasks such as semantic segmentation, object detection, and image classification. Specifically, our tool supports inputting multispectral images with arbitrary bands and adapts to both sparse and densely distributed scenarios through a feature merging strategy. LabelRS solves the scaling problem of objects of different sizes through a sliding window. And a variety of band stretching, image resampling, and gray level transformation algorithms are used to enable the output samples to meet the actual needs of users and reduce the workload of postprocessing. In addition, we designed XML files to store metadata information to ensure the traceability of the samples. Each sample contains a spatial coordinate file to seamlessly realize the conversion between ordinary images and geographic images. More importantly, the sample production process we designed can also be potentially used to unravel many other problems of multispectral image classification, such as mineral recognition from X-ray maps [36,37] and breast cancer diagnosis from medical images [38]. All of them have a high spectral and spatial resolution, changing the reading and writing way of images according to the type of images can support the migration and reuse of LabelRS. The main contributions of our work are summarized as follows. 1.
An efficient framework is proposed to make deep learning samples using multispectral remote sensing images, which contains sub-tools as semantic segmentation, object detection, and image classification. To our knowledge, it is the first complete framework for image annotation with remote sensing images.

3.
Three cases are implemented to evaluate the reliability of LabelRS, and the experimental results show that LabelRS can automatically and efficiently produce deep learning samples for remote sensing images.
The remainder of this paper is structured as follows. Section 2 explains the design principle and implementation process of LabelRS. Section 3 introduces three cases and the corresponding experimental results. Finally, in Section 4, conclusions are drawn, and recommendations for use are given.

Functionality and Implementation
The LabelRS toolbox we designed contains three sub-tools, namely Semantic Segmentation Tool, Object Detection Tool, and Image Classification Tool. Section 2.1 firstly introduces the design principles and functions of these three aspects, and then Section 2.2 introduces the specific implementation, including the interface design and input parameters of these tools.

Semantic Segmentation
The left part of Figure 2 shows the processing flow of the semantic segmentation module. In addition to remote sensing images, the input of the tool also requires a vector file of the region of interests, and a field indicates the categories of different features. Such vector files can be drawn by users themselves in ArcGIS or other GIS tools, can be derived from NDVI [39] or NDWI [40], or can be derived from land survey data or other opensource geographic data, such as OpenStreetMap [41]. (1) Feature Merging Strategy We divide the distribution patterns of objects into sparse distribution and dense distribution, as shown in Figure 3. Figure 3a shows dense buildings, and Figure 3b shows sparse ponds. For densely distributed objects, if each object is processed separately, a lot of computing resources will be consumed, and the generated samples have much redundancy and overlap. To solve this problem, we proposed a strategy of merging features, as shown in Figure 4. First, create a buffer on the outside of each object. The buffer of the building in Figure 3a is shown in Figure 4a. The radius of the buffer depends on the input sample size and the spatial resolution of the image. The calculation formula of the radius d is defined as: where r is the spatial resolution of the image in the vertical or horizontal direction. It represents the spatial range represented by each pixel and can be obtained by reading the metadata of the image. t represents the sample size input by the user, and the unit of t is pixel. Then, the overlapping buffers are merged, and the obtained range is shown in Figure 4b. Finally, the densely distributed targets become independent objects.  (2) Split Large Targets The independent objects are divided into three parts according to their size, as shown in Figure 5. We use the sliding window algorithm to split the target. First, determine the relationship between the height h, width w of the target envelop, and the sample size t. If h < 2t and w < 2t, take the center of the envelop as the sample center and r × t as the side length to construct a square as the range of the output sample. If h < 2t and w ≥ 2t, take the center of the envelop as the sample center O, and then slide left and right respectively. Set the center point after sliding as O', then O' can be calculated as follows: where x and y represent the longitude and latitude of the sample center, respectively. m is the overlap size defined by the user. If h ≥ 2t and w < 2t, then slide up and down respectively, the principle is similar to the above. In the above two cases, we choose to start from the center of the envelop instead of starting from the upper left corner. This is really necessary because we found that when sliding from the upper left corner, the original complete object would be very fragmented. Starting from the center can guarantee its original distribution patterns to the greatest extent. Finally, if h > 2t and w > 2t, start from the upper left corner of the envelop and slide to the right and down, respectively, to generate the sample area. The detailed sliding window algorithm for semantic segmentation is also given in Algorithm 1. In addition, a potential problem with the above process is that when the boundaries of vector features are very complicated, creating buffers and fusing buffers will be very time-consuming. Therefore, another innovation of this paper is to use the Douglas-Peucker Algorithm [42] to simplify the polygons in the early stage. Our experiments show that for irregular objects with complex boundaries, the sample generation efficiency can be increased by more than 100 times after adding a simplified surface step. (c) Objects whose height or width is larger than the sample size. (d) Objects whose height and width are both larger than the sample size. (

3) Band Combination and Stretching
The above steps create regular grids for making samples (we call them sample grids), and then these sample grids are used to crop the images and labels. Multispectral remote sensing images generally have more than three bands. Meanwhile, the pixel depth of remote sensing images is different from ordinary natural images and can usually reach 16BIT or 32BIT. But in the field of computer vision, natural images similar to JPEG and PNG are more common. Therefore, we allow users to choose the three bands mapped to RGB and set the image stretching method for multispectral images when users need to generate samples in JPEG or PNG formats. The image stretching algorithms we defined involves (1) Percentage Truncation Stretching (PTS). The algorithm needs to set the minimum percentage threshold minP and the maximum percentage threshold maxP. The two ends of the gray histogram of remote sensing images are usually noise. Assuming that the corresponding values of minP and maxP in the histogram are c and d, respectively. Then the stretched pixel x can be calculated as follows: where x is the pixel value before stretching. (2) Standard Deviation Stretching (SDS). It is similar to PTS, the difference is the calculation of c and d. The formula is as follows: where m and s represent the mean and standard deviation of the band, respectively. k is a user-defined parameter, and the default value is 2.5. It is suitable for non-numeric attribute values. For example, the input shapefile may use the string 'water' or the character 'w' to represent water bodies. In this case, the first two methods are invalid, and different types of features can be assigned from 1 in the order of positive integers. (4) Custom. Users customize the gray value of different types of features in the label. The above four different gray-level transformation methods will be recorded and saved in the output XML file.

(5) Quality Control
The last step is quality control. Since we have created a buffer during the sliding and cropping process, part of the generated samples may not include any target or only a few pixels of the target, which will cause class imbalance and affect the training of the deep learning model. Therefore, we set a filter parameter f. If the ratio of foreground pixels to background pixels in the label is less than f, then the sample will be considered unqualified and discarded. Another problem is that the 'no data' area of the remote sensing images may be cropped, which will also be automatically identified and then eliminated.
In addition, different from the semantic segmentation samples in computer vision, spatial reference information is also an important component of remote sensing image samples. Therefore, we create jgw and pgw files for JPEG and PNG images, respectively, which are used to store the geographic coordinates of the upper left corner of the sample and the spatial resolution in the east-west and north-south directions. Finally, we use an XML file to record the metadata information of the sample to facilitate traceability and inspection. center.x = center.x + (t-m)*r 24 center.y = center.y − (t-m)*r 25 center.x = ex.Xmin + t*r/2 26 return SG

Object Detection
We first explain the related definitions. The entire image is regarded as a sample, and the coordinate range of the image is the sample range, as shown in the yellow box in Figure 6. The objects marked by the user in the sample are called targets or objects, as shown in the red box in Figure 6. The object detection tool is to separately record the sample range and target range of each labeled object. The processing flow of the object detection tool is shown in the middle of Figure 2. For the object detection samples, we are more concerned about the relationship between the sample size and the target size, and the position of the target in the sample. truncated is an important attribute of the sample in the object detection task. It represents the completeness of the target in the sample. If the target is completely within the image range, truncated = 0, indicating that no truncation has occurred. Suppose we want to generate samples with a size of 512 × 512, but the length and width of the target are greater than 512, then the target needs to be truncated, and each sample contains only a part of the target. Therefore, we first need to use the sliding window algorithm to segment large targets. Different from semantic segmentation, no buffer is created in the target detection tool, so the length and width of the object are compared with the size of the sample, rather than twice the size of the sample. Assuming that the red rectangle in Figure 7a is the target O marked by the user, and the grid obtained after segmentation using the sliding window is shown as the yellow rectangle in the figure, marked as G, then the calculation formula for truncated is as follow: where T i represents the truncation of the ith grid, S() is the function of calculating the area, and G i is the ith grid. The actual labels of the targets after splitting are shown as the green rectangles in Figure 7b.  Different from the pixel-level labeling of semantic segmentation, the object detection task needs to minimize the truncation of the target and ensure that the target is located in the center of the sample as much as possible. Therefore, a sample is made with each object as the center after segmenting the large target, and the sample may also contain other surrounding targets. In order to improve retrieval efficiency, we use t × r/2 as the radius for retrieval and only consider other targets within this radius. The calculated sample range is then used to crop the input image while recording the pixel position of the target in the sample. We have set up three currently popular metadata formats of object detection for users to choose from, namely PASCAL VOC [43], YOLO [4], and KITTI [44]. PASCAL VOC uses the xml file to store xmin, xmax, ymin, ymax, the category, and truncation value of each object. YOLO uses the txt file to store the category of each target, the normalized coordinates of the target center, and the normalized width and height of the target. KITTI is mainly used for autonomous driving and uses the txt file to store the category, truncation, and bounding box of each target. In addition, because the annotations are recorded in text, users cannot directly judge the quality of the annotations of the samples. We have designed a visualization function to superimpose the bounding box of the target onto the sample while creating samples and annotations so users can visually browse the annotation results.

Image Classification
In the image classification task, the entire image has only one value as a label, and there is no specific object position information. Therefore, the biggest difference between the above two tools is that other pixel information cannot be introduced during the splitting process. The processing flow of the image classification tool is shown in the right of Figure 2. First, segment the large target, as shown in Figure 8. Swipe to the right and down from the upper left corner of the target, then fill in from the right to left or bottom to top when the distance to the right or the bottom is less than the size of a sample. This can guarantee the integrity of the target without introducing new image information. Objects smaller than the sample will be resampled to the sample size. We set up three interpolation methods, Nearest, Bilinear, and Cubic, among which Nearest is simple and fast, but the result is rougher. Cubic generates smoother images, but the computational cost is high. Finally, the samples support arbitrary band combinations and stretching methods, and different types of samples are stored in different folders.

Implementation
Based on ArcGIS 10.2, we developed an upward compatible annotation tool LabelRS, for remote sensing images using Python 2.7. The python libraries imported by LabelRS mainly include ArcPy, OpenCV, and Pillow. LabelRS has two versions. One is the ArcGIS toolbox. Its advantage is that it has a visual graphical interface, which is convenient for parameter input and can be quickly integrated into ArcGIS. The other is Python scripts, which facilitate code debugging and integration. This version has higher flexibility and is more suitable for batch data processing environments. The following describes the implementation of the three sub-modules of LabelRS.

Semantic Segmentation Tool
Dialog box of semantic segmentation tool is shown in Figure 9a. There are 14 input parameters in total, four of which are required. The meaning of each parameter is shown in Table 2.

Object Detection Tool
Dialog box of object detection tool is shown in Figure 9b. There are 11 input parameters in total, four of which are required. The meaning of each parameter is shown in Table 3.

Image Classification Tool
Dialog box of image classification tool is shown in Figure 9c. There are 11 input parameters in total, 4 of which are required. The meaning of each parameter is shown in Table 4.

Making Water Samples
Water is the most common and precious natural object on the surface [45]. Many researchers try to use deep learning methods to extract water bodies [46][47][48]. The boundary of the water body is very complicated, and manual labeling is time-consuming and labor-intensive. Therefore, we combined NDWI and LabelRS to propose an automated production process for water body samples. First, use the red and near-infrared band to calculate the NDWI, and then use the OTSU algorithm [49] to determine the segmentation threshold of water and non-water bodies. Filter out the interference of non-water objects such as farmland and mountain shadows through the area threshold. The finally obtained vector of water bodies can be input into the semantic segmentation tool to make water body samples.
We used the GaoFen-2 (GF-2) satellite images, which have a spatial resolution of 4 m and contain four bands as red, green, blue, and near-infrared. The Beijing-Tianjin-Hebei region and Zhenjiang, Jiangsu Province were selected as the experimental areas. The climate types and land covers of these two regions are completely different. Due to the unique sensitivity of water to the near-infrared band, we chose the near-infrared, red, and green bands of GF-2 as the output wavebands of the sample. The sample size was set to 256. The gray level transformation method was Maximum Contrast, and the band stretching method was PTS. The experiment was carried out on a desktop computer, and the CPU was Intel Core i7-6700 3.40 GHz with 32 GB RAM. Finally, some samples made using semantic segmentation tools are shown in Figure 10. It can be seen that LabelRS combined with NDWI can segment the water body area well. The water body boundaries in the generated label are very detailed and smooth. Table 5 shows the processing time for a different task, and we can find that LabelRS is very efficient. The average time to produce a single multispectral remote sensing sample is 1-2 s.

Making Dam Samples
Dams are important water conservancy infrastructures with functions such as flood control, water supply, irrigation, hydroelectric power generation, and tourism. We chose the same experimental areas and data source as in the previous section. Due to the similarities between dams and bridges in geometric, spectral, and texture features, we treat bridges as negative samples. First, we manually marked the locations of dams and bridges in ArcGIS and saved them in a vector file. Then the object detection tool was used to make samples. The sample size was set to 512, and the metadata format was PASCAL VOC. In order to perform data augmentation, both true-color and false-color composite samples are generated, as shown in Figure 11. To visualize the labeling effect, the bounding boxes of dams and bridges have been drawn in different colors in the figure. Figure 12 is an example of PASCAL VOC annotation.

Making Land Cover Classification Samples
The classification of images at the image level instead of the pixel level means that we do not need to know the details of the distribution of the objects in the image. It is widely used in land use classification. We used Huairou District and Changping District in Beijing as experimental areas and selected GF-2 and GF-6 images as the main data sources. Figure 13 shows the main land use types in some parts of the experimental area. It can be seen that the main land cover classes include forest, water, buildings, and farmland. We first manually drew different features in ArcGIS software and then used the image classification tool of LabelRS to produce classification samples. The sample size was set to 128. The samples obtained are shown in Figure 14.

Conclusions
In response to the current lack of labeling tools for remote sensing images, we developed LabelRS based on ArcGIS and Python. Unlike normal images, remote sensing images have more bands, higher pixel depth, and larger breadth. The targets on remote sensing images also have different sizes and diverse distribution patterns. LabelRS has overcome these difficulties to a certain extent. It solves the problem of densely distributed targets by merging elements and divides large targets through sliding windows. A variety of band stretching, resampling, and gray level transformation algorithms are set to solve the problem of image spectral combination and pixel depth conversion. In addition, on the basis of conventional samples, unique spatial information is added to realize seamless conversion between natural samples and geographic samples. Our tool can assist researchers in making their own deep learning samples, which can reduce the burden of data preprocessing and the reliance on existing public samples, and ultimately helping researchers use deep learning techniques to solve specific target detection and recognition tasks. LabelRS also have certain limitations. (1) The object detection sub-tool does not support a rotating bounding box for the time being. (2) LabelRS currently rely on ArcPy scripts. Later we will use the GDAL library to achieve full open source. LabelRS's current limitations are identified and will be the base for future developments.
Finally, we propose some suggestions on parameter settings. The first is how to choose adaptive tools. Large-size and irregular objects are not suitable for object detection because the rectangular bounding box may not effectively cover the target and will introduce a lot of interference. In this case, the semantic segmentation tool is more appropriate. The road map for selecting the tools is shown in Figure 15. The second is the sample size. The sample size for object detection can be set larger to avoid the target from being truncated. The sample size of image classification should be set as small as possible, because the larger the rectangle, the more non-label types of pixels may be introduced. The last is the output format. At present, most users are more accustomed to processing ordinary JPEG or PNG images. Because of this, LabelRS provides several useful data stretching functions to meet this demand. But in some cases, we prefer users to choose TIFF as the output format to avoid stretching the remote sensing images. For example, for a water body with uniform color, the image after stretching may look strange, which is caused by incorrect histogram truncation. In future research, we will continue to improve our code to make LabelRS easier to understand, easier to use, and easier to integrate.