A Novel Framework Based on Mask R-CNN and Histogram Thresholding for Scalable Segmentation of New and Old Rural Buildings

: Mapping new and old buildings are of great signiﬁcance for understanding socio-economic development in rural areas. In recent years, deep neural networks have achieved remarkable building segmentation results in high-resolution remote sensing images. However, the scarce training data and the varying geographical environments have posed challenges for scalable building segmentation. This study proposes a novel framework based on Mask R-CNN, named Histogram Thresholding Mask Region-Based Convolutional Neural Network (HTMask R-CNN), to extract new and old rural buildings even when the label is scarce. The framework adopts the result of single-object instance segmentation from the orthodox Mask R-CNN. Further, it classiﬁes the rural buildings into new and old ones based on a dynamic grayscale threshold inferred from the result of a two-object instance segmentation task where training data is scarce. We found that the framework can extract more buildings and achieve a much higher mean Average Precision (mAP) than the orthodox Mask R-CNN model. We tested the novel framework’s performance with increasing training data and found that it converged even when the training samples were limited. This framework’s main contribution is to allow scalable segmentation by using signiﬁcantly fewer training samples than traditional machine learning practices. That makes mapping China’s new and old rural buildings viable.


Introduction
Monitoring the composition of new and old buildings in rural area is of great significance to rural development [1]. In particular, China's recent rapid urbanization has tremendously transformed its rural settlements over the last decades [2]. However, unplanned and poorly-documented dwellings have posed significant challenges for understanding rural settlements [3,4]. Traditionally, field surveys had been the major solutions, but they require intensive labour inputs and could be time-consuming, especially in remote areas. The recent breakthroughs of remote sensing technologies provide the growing availability of high-resolution remote sensing images such as low-altitude aerial photos and Unmanned Aerial Vehicle (UAV) images. That allows manual mappings of the rural settlements at a lower cost and with broader coverage, but they are still time-consuming. Therefore, to map the settlements for nearly 564 million rural population in China [5], a scalable, intelligent and image-based solution is urgently needed.
Remote sensing-based mapping of buildings has been a popular research topic for decades [6][7][8][9][10]. Since the launch of IKONO, QuickBird, WorldView and most recently framework's contribution to the building extraction research area is to achieve a promising classification capability while annotation efforts can be significantly reduced. This study uses rural areas in Xinxing County, Guangdong Province, as the case study to test the proposed framework's performance.

Study Area
To test the proposed framework's performance, we collected data samples from highresolution satellite images covering rural Xinxing County, Guangdong Province, China (see Figure 1). Xinxing is a traditional mountainous agricultural county, with a large agricultural population and a relatively complete landscape, forest, and land city. Moreover, Xinxing is a rural revitalization pilot area and has made much rural development and governance achievements [37]. The extraction of new and old buildings is of great significance for understanding rural development in Xinxing.
fiers are more accurate than multi-class classifiers in general. For example, classifying dogs is easier than classifying dog breeds. Therefore, the proposed framework uses histogram thresholding as an add-on to the state-of-the-art deep learning algorithm to achieve impressive segmentation results. In the methods section, we will address the proposed framework in detail. The proposed framework's contribution to the building extraction research area is to achieve a promising classification capability while annotation efforts can be significantly reduced. This study uses rural areas in Xinxing County, Guangdong Province, as the case study to test the proposed framework's performance.

Study Area
To test the proposed framework's performance, we collected data samples from highresolution satellite images covering rural Xinxing County, Guangdong Province, China (see Figure 1). Xinxing is a traditional mountainous agricultural county, with a large agricultural population and a relatively complete landscape, forest, and land city. Moreover, Xinxing is a rural revitalization pilot area and has made much rural development and governance achievements [37]. The extraction of new and old buildings is of great significance for understanding rural development in Xinxing.  Table 1 shows the new and old buildings in high-resolution satellite images. Most of the new buildings are brick-concrete structures, with roofs made of cement or colored tiles. Old houses are mainly Cantonese-style courtyards in Xinxing, where the roof materials are dark tiles. Moreover, the outline of their footprints is less clear than new houses. New buildings are mostly distributed along the streets, while the old buildings still retain a compact comb pattern.  Table 1 shows the new and old buildings in high-resolution satellite images. Most of the new buildings are brick-concrete structures, with roofs made of cement or colored tiles. Old houses are mainly Cantonese-style courtyards in Xinxing, where the roof materials are dark tiles. Moreover, the outline of their footprints is less clear than new houses. New buildings are mostly distributed along the streets, while the old buildings still retain a compact comb pattern.  [38] to delineate the building foo (see Figure 2). We edited and checked all the building samples of the original vect using the ArcGIS TM software to produce a high-quality dataset. All building sample   [38] to delineate the building footp (see Figure 2). We edited and checked all the building samples of the original vecto using the ArcGIS TM software to produce a high-quality dataset. All building samples   [38] to delineate the building footpr (see Figure 2). We edited and checked all the building samples of the original vector using the ArcGIS TM software to produce a high-quality dataset. All building samples f   [38] to delineate the building footprin (see Figure 2). We edited and checked all the building samples of the original vector f using the ArcGIS TM software to produce a high-quality dataset. All building samples fro   [38] to delineate the building footprin (see Figure 2). We edited and checked all the building samples of the original vector fi using the ArcGIS TM software to produce a high-quality dataset. All building samples from   [38] to delineate the building foot (see Figure 2). We edited and checked all the building samples of the original vect using the ArcGIS TM software to produce a high-quality dataset. All building sample   [38] to delineate the building footp (see Figure 2). We edited and checked all the building samples of the original vecto using the ArcGIS TM software to produce a high-quality dataset. All building samples   [38] to delineate the building footpr (see Figure 2). We edited and checked all the building samples of the original vector using the ArcGIS TM software to produce a high-quality dataset. All building samples f   [38] to delineate the building footpri (see Figure 2). We edited and checked all the building samples of the original vector using the ArcGIS TM software to produce a high-quality dataset. All building samples fr   [38] to delineate the building footprin (see Figure 2). We edited and checked all the building samples of the original vector fi using the ArcGIS TM software to produce a high-quality dataset. All building samples fro For model training purposes, we collected 68 images with a resolution of 0.26 m. Each image has a size ranging from 900 × 900 to 1024 × 1024 pixels in the RGB color space. We use the open-source image annotation tool VIA [38] to delineate the building footprints (see Figure 2). We edited and checked all the building samples of the original vector file using the ArcGIS TM software to produce a high-quality dataset. All building samples from those 68 images were compiled as dataset called one-class samples. We annotated only 34 out of 60 images with new and old labels (called two-class samples hereafter). Finally, the annotated imageries were randomly divided into a training, validation and test set (see Table 2). new buildings For model training purposes, we collected 68 images with a resolution of 0.26 m. Each image has a size ranging from 900 × 900 to 1024 × 1024 pixels in the RGB color space. We use the open-source image annotation tool VIA [38] to delineate the building footprints (see Figure 2). We edited and checked all the building samples of the original vector file using the ArcGIS TM software to produce a high-quality dataset. All building samples from those 68 images were compiled as dataset called one-class samples. We annotated only 34 out of 60 images with new and old labels (called two-class samples hereafter). Finally, the annotated imageries were randomly divided into a training, validation and test set (see Table 2).

HTMask R-CNN
Mask R-CNN [21] has been proven to be a powerful and adaptable model in many different domains [23,39]. It operates in two phases, generation of region proposals and classification of each generated proposal. In this study, we use Mask R-CNN as our baseline model for benchmarking. As discussed before, we propose a novel segmentation framework that can utilize the histogram thresholding and deep learning's image segmentation capability to extract the new and old rural buildings. We call the proposed

HTMask R-CNN
Mask R-CNN [21] has been proven to be a powerful and adaptable model in many different domains [23,39]. It operates in two phases, generation of region proposals and classification of each generated proposal. In this study, we use Mask R-CNN as our baseline model for benchmarking. As discussed before, we propose a novel segmentation framework that can utilize the histogram thresholding and deep learning's image segmentation capability to extract the new and old rural buildings. We call the proposed framework HTMask R-CNN, abbreviating Histogram Thresholding Mask R-CNN. The workflow of the framework is addressed as follows (see Figure 3 for illustration): a.
We built two segmentation models (one-class model and two-class model) based on the one-class and two-class samples' training sets (Figure 3a). The one-class model can extract rural buildings, while the two-class model can classify new and old rural buildings. All models used the pre-trained weights trained on the COCO dataset [40] as the base weights. b.
An satellite image (Figure 3b) is classified by the one-class and two-class model separately, leading to a map of building footprints (R1 in Figure 3c), and a map of new and old buildings (R2 in Figure 3d). c.
Grayscale histograms were built using the pixels from the new and old building footprints (R2). The average grayscale levels for new and old buildings were computed as N and O, respectively (Figure 3e). A valley point is determined by θ = (N + O)/2. d.
The valley point θ is used as the threshold to determine the type of building in R1. Finally, we get a map of the old and new buildings in R3 (Figure 3f).
ings in R2. The two-class model's performance depends on the numbers of training samples. Assumably, the more the training data are added, the more robust the network training, the better the segmentation results. This study also tests how the numbers of training samples could affect R2 and R3's performance to evaluate how HTMask R-CNN can save the annotation efforts while retaining the segmentation capability.  The hypothesis is that R3 performs better than R2. Specifically, R3 can take advantage of the capability of R1, while utilizing the grayscale difference of the new and old buildings in R2. The two-class model's performance depends on the numbers of training samples. Assumably, the more the training data are added, the more robust the network training, the better the segmentation results. This study also tests how the numbers of training samples could affect R2 and R3's performance to evaluate how HTMask R-CNN can save the annotation efforts while retaining the segmentation capability.

Experiment
We used R2, the prediction results of the two-class model, as the benchmarking. R3 is the result of the proposed framework. We compared R2 and R3 to test how much accuracy improvements in the proposed framework.
We randomly selected 50% of the images from the one-class and two-class training set for data augmentations, resulting in 1.5 times of the original training size. The augmentations included rotating, mirroring, brightness enhancement, and adding noise points to the images. In the training stage for the one-class and two-class models, 50 epochs with two batches per epoch were applied, and the learning rate was set at 0.0001. The Stochastic Gradient Descent (SGD) optimization algorithm was adopted as the optimizer [41]. We set the weight decay to 0.000. The loss function is shown in Equation (S1). The learning momentum was set at 0.9, which was used to control to what extent the model remains in the original updating direction. We used cross-entropy as a loss function to evaluate the train-Remote Sens. 2021, 13, 1070 6 of 11 ing performance. We performed hyperparameters tuning and the settings addressed above achieved the best performance (refer to Table S2 for details of hyperparameters tuning).
To test how HTMask R-CNN can achieve a converged performance with a limited amount of training data, the training process has involved an incremental number of samples (from 5 to 20 satellite images). Afterward, we compared the baseline Mask R-CNN and the HTMask R-CNN by comparing R2 and R3.

Accuracy Assessment
We use the average precision (AP) to quantitatively evaluate our framework on the validation dataset. The AP is equal to taking the area under the precision-recall (PR) curve, Equation (1) IoU means the ratio of intersection and union of the prediction and the reference. When a segmentation image is obtained, the value of IoU is calculated according to Equation (2). Figure 4 shows the result of an example image Site1 ( Figures S1 and S2 presents additional examples for other sites with gradually increasing building density). In terms of the building footprints mapping, the one-class model has identified most of the buildings. More importantly, it can accurately outline individual buildings and the boundaries of adjacent buildings being correctly separated, which allows the texture of the building to be captured.    The baseline model (two-class model) performed better and better with the growing numbers of training samples in building extraction (Figure 4b). However, the numbers of buildings in R2 are still significantly fewer than R1, especially if the buildings are very dense (see Site3), which aligns with our assumption. In R3, the proposed framework uses R1 as the base map, so the numbers of buildings are equal between R1 and R3. That means the proposed framework outperforms the baseline model. In terms of the new and old building segmentation, R3 is significantly better than R2 at all levels of training samples. When the number of training samples is very limited, e.g., five, the baseline model nearly misidentified most of the new and old buildings, while the proposed framework still produced a reasonable result. Figure 5 shows the performance of the one-class model. We noticed that the performance of the one-class model converges at the 25th epoch, where it identified most of the buildings. The mAP 50 could reach 0.70.   The baseline two-class model and the HTMask R-CNN also converge at the 25th epoch (Table 3; Figure 6). When the training size is small (image_num = 5), the mAP50 of the baseline two-class model is very low (0.24), while the HTMask R-CNN can significantly improve the recognition (0.46). With the increasing training size, the baseline twoclass model and HTMask R-CNN's performance gap become narrower (see Figure 6d). Finally, the mAP50 of the baseline two-class reached 0.51 when the training size is 20. More The baseline two-class model and the HTMask R-CNN also converge at the 25th epoch (Table 3; Figure 6). When the training size is small (image_num = 5), the mAP 50 of the baseline two-class model is very low (0.24), while the HTMask R-CNN can significantly improve the recognition (0.46). With the increasing training size, the baseline two-class model and HTMask R-CNN's performance gap become narrower (see Figure 6d). Finally, the mAP 50 of the baseline two-class reached 0.51 when the training size is 20. More importantly, HTMask R-CNN performs consistently (mAP 50 ≈ 0.48), no matter which levels of training size.

Discussions
With the advance of deep learning, the extraction of building footprint from satellite imagery has made notable progress, contributing significantly to settlements' digital records. However, the scarcity of training data has always been the main challenge for scaling building segmentation. Therefore, this study proposes a novel framework based on the Mask R-CNN model and histogram thresholding to extract old and new rural buildings even when the label is scarce. We tested the framework in Xinxing County, Guangdong Province, and achieved promising results. This framework provides a viable solution for mapping China's rural buildings at a significantly reduced cost. It confirms our assumption that HTMask R-CNN can perform well in a small number of samples, which means that it can significantly reduce annotation efforts while retaining the segmentation capability. In contrast, the baseline two-class model performed poorly in extracting old-new two-category buildings.

Discussions
With the advance of deep learning, the extraction of building footprint from satellite imagery has made notable progress, contributing significantly to settlements' digital records. However, the scarcity of training data has always been the main challenge for scaling building segmentation. Therefore, this study proposes a novel framework based on the Mask R-CNN model and histogram thresholding to extract old and new rural buildings even when the label is scarce. We tested the framework in Xinxing County, Guangdong Province, and achieved promising results. This framework provides a viable solution for mapping China's rural buildings at a significantly reduced cost.
Mask R-CNN models have been proven useful in many applications. However, this study found that the orthodox Mask R-CNN model performed poorly in extracting oldnew two-category buildings. When the training samples are limited, the mAP 50 is only 0.24, respectively. We believe the varying geographical environments lead to the poor generalization of the segmentation model when the training samples cannot cover the most distinctive spatial and spectrum features. For instance, the model might not classify a building with an open patio as either a new or old building if none of the training samples contains this unique shape. Meanwhile, the single-category classification task using Mask R-CNN could reach mAP 50 at 0.70, respectively. That means utilizing one-class model's capability in mapping building footprints could improve the recall rate for the old-new two-category classification task, especially in high-density areas. Hence, we propose such a novel framework.
While tested the framework with increasing training samples, we found that it converges at a very early stage when the numbers of training images are only five. That means the framework can be applied on a large scale, to map all rural buildings in China.
Before then, more careful studies should be undertaken to understand the limitations of the framework. We have applied the framework in Fuliang County, Jiangxi Province, China, and found that the performance of the framework is worse than the benchmarking two-class model R2 (see Figure S4). When the histogram of the R2 result does not exhibit a clear valley, the pixel grayscale of the new and old buildings is similar. In this regard, the thresholding method lose its advantage against the orthodox model. In this case, the prediction should come from the output of R2.
Moreover, polygons produced from the proposed framework have irregular shapes, slightly different from the building footprint boundaries. Therefore, downstream regularization is needed in future studies. The recent advances of multi-angle imaging technologies and vision technologies integrated with deep learning that emerged in civil engineering provide new opportunities in the 3D reconstruction of rural building models [42,43].

Conclusions
Nearly half of the Chinese population live in the rural areas of China. The lack of a digital record of the new and old buildings has posed challenges for the governments to realize the socio-economic state. Under China's central government's current rural revitalization policy, many migrant workers will return to the villages. Therefore, a scalable, intelligent, and accurate building mapping solution is urgently needed. The proposed framework in this study achieved a promising result even when the training samples are scarce. As a result, we can scale the mapping process at a significantly reduced cost. Therefore, we believe this framework could map every settlement in the rural areas, help policymakers establish a longitudinal digital building record, and monitor socio-economics across all rural regions.
Supplementary Materials: The following are available online at https://www.mdpi.com/2072-4 292/13/6/1070/s1, Figure S1: The result comparison of Site 2 between the baseline Mask R-CNN model and the HTMask R-CNN framework, Figure S2:The result comparison of Site 3 between the baseline Mask R-CNN model and the HTMask R-CNN framework, Figure S3: The feature maps of the three sites in the two-class model at the 50th epoch, Figure S4: The baseline Mask R-CNN model and the HTMask R-CNN framework were tested in Fuliang County, Jiangxi Province, China, Figure  S5: The loss function per training iteration, Table S1: The mAP 75 between R2 and R3 at all levels of training, Table S2: Tuning hyperparameters, Equation S1: Loss function of the two-class model R2.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patient(s) to publish this paper.

Data Availability Statement:
The data presented in this study are available at https://github.com/ liying268-sysu/HTM-R-CNN, accessed on 8 February 2021.