GeoBoost: An Incremental Deep Learning Approach toward Global Mapping of Buildings from VHR Remote Sensing Images

: Modern convolutional neural networks (CNNs) are often trained on pre-set data sets with a ﬁxed size. As for the large-scale applications of satellite images, for example, global or regional mappings, these images are collected incrementally by multiple stages in general. In other words, the sizes of training datasets might be increased for the tasks of mapping rather than be ﬁxed beforehand. In this paper, we present a novel algorithm, called GeoBoost, for the incremental-learning tasks of semantic segmentation via convolutional neural networks. Speciﬁcally, the GeoBoost algorithm is trained in an end-to-end manner on the newly available data, and it does not decrease the performance of previously trained models. The effectiveness of the GeoBoost algorithm is veriﬁed on the large-scale data set of DREAM-B. This method avoids the need for training on the enlarged data set from scratch and would become more effective along with more available data.


Introduction
In recent years, the size of satellite image data sets has been considerably enlarged compared with a decade ago. In real-world applications, such as global or regional mappings, large-scale data sets are built at multiple stages. For instance, the widely used data set, WHU-RS [1], is built through three versions. This data set was expanded from 12 classes of aerial scenes [1] to 19 [2] and 20 [3] classes, and the number of samples in each class was also increased. Actually, it is a tough task in which we decide whether a certain size of the data set is big enough, or whether the configuration of the semantic classes is reasonable before diving into the solid verification of the data set. Inevitably, large-scale data sets are built in a multi-stage manner. Correspondingly, most current models, which are adopted for the increasing data sets, have the weak continual-learning ability over time. This arouses the need for the models which are capable of continually fitting the growing data sets.
The incremental learning approaches focus on tackling the issue of studying sequentially acquired data without forgetting. This type of method is also referred to as continual learning [4][5][6][7] and lifelong learning [8][9][10]. The incremental learning approaches resorting to neural networks can be divided into the following categories [6]: regularization approaches, dynamic architectures, and memory replay. Regularization approaches add certain constraints to the update of weights to alleviate forgetting [11][12][13]. Dynamic architectures expand the networks dynamically by creating new weights or layers [14][15][16][17]. Memory replay stores a subset of samples or long-term information from the previous stage [5,18,19]. Without constraining the capacity of the model, the method of dynamic architectures is a more appropriate choice to achieve better performance [7]. Therefore, we focus on this type of method for the incremental learning of satellite images in this paper.
The task of incremental learning in this paper refers to the data-incremental learning which keeps the same classes for the data at different stages. Progressive neural networks [16] reuse each part of intermediate layers in previous models. It strengthens the representations of the final model but discards the supervision information of output layers. Network architecture evolving [17] employs neural architecture search (NAS) to optimize the entire models based on the new coming data by choosing from three options: creating new neurons, reusing the existing parameters, and tuning the existing parameters. It keeps the model size reasonable, whereas the procedure of NAS is time-consuming. Deep adaptation networks [20] attach new convolutional filters to the existing base network. These new convolutional filters are linear combinations of the trained convolutional filters of the base network. This allows the augmented network to adapt representations from previous tasks. The weights of linear combinations are determined by the controller modules. Expert gate [21] is a network of experts. It employs an auto-encoder gate to decide which expert to use at test time.
The relatedness between the test sample and the relevant experts are measured by the reconstruction error of the auto-encoder gate. Error-driven incremental learning [22] grows the deep convolutional networks in a hierarchical manner. The increased capacities of networks are only beneficial for the top layers, and the feature extractor part is not expanded. The self-organizing incremental neural network [23] combines a variant of the Growing When Required (GWR) network [24] with pre-trained convolutional neural networks (CNNs) to hierarchically learn human actions. The pre-trained CNNs limit the representational power of the model. Incremental feature learning [25] adds the neurons of hidden layers of a denoising autoencoder to enhance the capacity of the network. These newly added features are trained on the collected hard examples. Then, it merges similar features to reduce the redundancy.
Incremental boosting [14] trains a base learner for each sequence of new coming data to construct an ensemble model. Essentially, it is an additive model, so the performance of trained base learners are not affected by the subsequent models. Thus, we develop the new algorithms based on the incremental boosting method in this paper. The previous studies of utilizing boosting for incremental learning [14,15,26] employ the AdaBoost algorithm [27] to adjust the weights of samples. In recent years, the gradient boosting algorithm produces state-of-the-art results on many application benchmarks [28,29]. It is quite flexible to adopt diverse loss functions. Therefore, we employ gradient boosting [30] for incremental learning in this paper.
For the large-scale applications of satellite images, geospatial distribution is one of the most distinguishing characteristics. As shown in Figure 1, satellite images from different regions have their own regularities and are diverse in terms of color, texture, and morphological structure. In addition, for the large-scale satellite images, the available training data are not evenly distributed. Figure 2 illustrates the geospatial distribution of the training labels used in this paper, which are derived from Open Street Map (OSM) [31]. It can be seen clearly that the labels of Europe, North America, and East Asia are more available than that of other areas. The geospatial information of satellite images can be utilized to guide the continual-learning process of models for improving prediction results for semantic segmentation. Additionally, for the consideration of data privacy, the researchers of satellite images can not always obtain training data from previous research, whereas the trained models of them are publicly available in general. If we have a trained global model without access to the original data set, training a new model from scratch is not the best choice when the new regional data are collected. Fine-tuning the existing model is a common approach to use the information from the trained model, but it will decrease the performance of the model on the original data set. Therefore, we propose a novel algorithm for geographically incremental learning according to the corresponding geospatial distribution for improving the prediction results, which is called GeoBoost in this paper. The GeoBoost method simplifies the optimization of gradient boosting in the task of semantic segmentation via convolutional neural networks (CNN) and attaches the geospatial distribution information of the large-scale satellite images to the base learner of gradient boosting. The remainder of this paper is organized as follows. Section 2 describes the essential idea of GeoBoost. In Section 3, the experiment design, and its results are presented in detail. We discuss the results for the experiments in Section 4. Finally, we draw some conclusions in Section 5.

Training set
Test set

Methods
Boosting of neural networks has been adopted for incremental learning [14,15,26], where each base learner is trained in a manner that is similar to AdaBoost [27]. Different from previous studies, the GeoBoost algorithm employs gradient boosting [30] for incremental learning. In this section, we illustrate how gradient boosting, composed of neural networks, can be trained in an end-to-end way for geographically incremental learning. Furthermore, when gradient boosting is applied to large-scale satellite images, we show how the proposed algorithm, GeoBoost, assembles base learners of the ensemble model according to the corresponding geospatial distribution of data to improve the prediction results.

Gradient Boosting
The gradient boosting method [30] is a special type of ensemble learning approaches, which constructs additive models in a stage-wise strategy. More formally, with the training data x = {x 1 , . . . , x N } and the corresponding labels y = {y 1 , . . . , y N }, gradient boosting optimizes the ensemble model to minimize the loss function L(y, F(x)), where f m (x) is a single base learner. The whole optimization procedure is divided into multiple stages. In the m-th stage, based on the ensemble model from the previous stage the base learner f m (x) of the current stage can be derived from F m−1 (x) and the loss function L(y, F m−1 (x)) as f m (x) = −ρ m g m (x).
Here, g m (x) is the gradient direction and ρ m is the step of the linear search along the negative gradient direction In summary, the essential idea of gradient boosting is to optimize the base learner f m (x) along the negative gradient direction of the loss function. Since g m (x) is the gradient direction with respect to F m−1 (x), parameters θ m of the base learner f m (x; θ m ) can not directly be trained by Equation (3). In practice, θ m is obtained in an equivalent form and where h m (x; θ m ) is a parameterized function. Taking into account all the mentioned factors, the generic procedure of the gradient boosting method is shown in Algorithm 1.

Input:
The training data, x, and the corresponding labels, y; the parameterized function, h(x, θ).

End-to-End Gradient Boosting
While neural networks are adopted as base learners of the gradient boosting algorithm, the existing studies [32,33] just replace g m (x) and ρ m with the corresponding versions of neural networks. Actually, the optimization of gradient boosting and the training of a single neural network can be combined together to further simplify the training process.
In practice, the previous studies treat the score of the softmax function as the output of neural networks for classification. Unlike the aforementioned mode, we can assume that the activation values z of the layer before the softmax function σ is the output of neural networks. Certainly, z is a K-dimensional vector of real values with respect to the K classes. Then, z is normalized by the softmax function σ to obtain the final normalized score: With the above definition of model, the base learner of gradient boosting can be expressed as f m (x) = z m , and the ensemble model as where θ m are the parameters of the neural network f m . In this form, the output of F m (x) is normalized scores. Since Algorithm 1 is proposed for the generic purpose, any corresponding optimization method of h m (x; θ m ) can be utilized to find the solution of Equation (6). For instance, if h m (x; θ m ) is a decision tree, it can be optimized by the CART algorithm [30,34]. Thus, Equations (6) and (4) are two separated steps. Generally, neural networks are trained by the gradient descent method. Likewise, gradient boosting is optimized along the negative gradient direction of the loss function. Instead of calculating g m (x) directly, we can incorporate this step into the process of gradient descent. In other words, the loss function can be optimized directly during the back propagation of neural networks without the extra step of obtaining g m (x) explicitly. Therefore, this type of gradient boosting can be called end-to-end gradient boosting.
Supposing that we have the regression task with the squared loss L(y, F m (x)) = (y − F m (x)) 2 , the base model f 0 and f 1 have been learned, so we have the trained model If f 2 is a decision tree, the gradient direction is The decision tree f 2 has not been constructed yet, thus we can only get the gradients with respect to 2 using the CART algorithm [34]. In short, the gradient boosting method utilizes g 2 (x) = ∂L/∂F 1 to guide the learning procedure of the parameter θ 2 .
If f 2 is a simple neural network f 2 (x) = θ 2 x + b, the gradient direction is Different from decision trees, neural networks can be randomly initialized before the training procedure, so the gradient with respect to F 2 (x) can be obtained. The backpropagation algorithm computes the gradients of parameters θ 2 as to guide the learning procedure of the parameters where η is the learning rate. Using decision trees as base learners leads to two separate steps: calculate ∂L/∂F 1 , then use it to guide the learning procedure of the parameters. The advantage of gradient boosting is that the type of the base learner is not limited. Using neural networks as base learners does not need to calculate ∂L/∂F 2 manually. It can be calculated by the backpropagation algorithm. Therefore, the end-to-end gradient boosting algorithm can only employ neural networks as base learners. Different from line 3 in Algorithm 1 that g m (x) is the gradient direction with respect to F m−1 (x), the base learner f m (x; θ m ) can be optimized with respect to F m (x) (Equation (10)) by gradient descent: During the training of f m (x), the base learners from previous stages are frozen and non-trainable, and θ m is the solely trainable parameters. Additionally, gradient descent of neural networks already searches the optimal point on the loss surface during training. Therefore, the linear search ρ m is redundant and can be omitted. Finally, we obtain a concise process of training gradient boosting in an end-to-end manner shown in Algorithm 2.

Algorithm 2
The algorithm of end-to-end gradient boosting.

Input:
The training data, x, and the corresponding labels, y; the neural network, f (x; θ); and the softmax function, σ.
For the purpose of incremental learning, each base learner of boosting is trained on a set of new coming data rather than on the same one. For instance, the training collection (X, Y) is composed by a group of image sets X = {x 0 , . . . , x m , . . . , x M } and the corresponding group of label sets Y = {y 0 , . . . , y m , . . . , y M }, where each image set x m = {x 1 , . . . , x N } is a bunch of images and the corresponding label set is y m = {y 1 , . . . , y N }, respectively. If x 1 is an image with a size of 512 × 512, then y 1 is the label matrix of the semantic segmentation task with a size of 512 × 512. Thus, the image set x 1 = {x 1 , x 2 , x 3 } contains three images, and the corresponding label set y 1 = {y 1 , y 2 , y 3 } contains three label matrices. With two image sets x 5 }, their label sets are y 1 = {y 1 , y 2 , y 3 } and y 2 = {y 4 , y 5 }. Hence, the group of image sets X = {x 1 , x 2 } contains five images in total, and the group of label sets Y = {y 1 , y 2 } contains five label matrices in total.
As each base learner f m (x m ; θ m ) trained on the data set (x m , y m ), then Equation (10) is transformed into By utilizing gradient boosting for incremental learning [14,15,26], the performance of existing base learners is not affected by the newly added component, and the base learner in the current stage is trained without touching the data involved in previous stages. Additionally, the ensemble model can also reuse the classification capability of existing base learners.
For the large-scale satellite images, geospatial distribution is their most notable characteristic. Generally, the collected satellite images are clustered in certain areas (as shown in Figure 3), and the objects in these images from different areas are diverse in terms of color, size, and density. When gradient boosting is applied to the large-scale satellite images, the geospatial distribution information of data can be considered to improve the prediction results.
For instance, suppose that we separately collect satellite images from Europe and America, then we train a model F a with the data from Europe and train a model F b with the data from America. The model F a will perform well on the data from Europe, but deteriorate on the data from America. Images from these two areas are much diverse and F a has never touched the data from America, so it fails to predict other areas. One solution is to make the model dedicate to a certain area. In this manner, F a will be trained on the data from Europe and predict results for the data from Europe, and F b will be trained on the data from America and predict results for the data from America.  Specifically, in the m-th stage of gradient boosting, the coverage area of the training data set x m can be expressed as geographic coordinates in the form of bounding box B m = (x min , y min , x max , y max ).
Since the base learner f m is trained on x m , we can set B m as the coverage area of the base learner f m (x m ; θ m , B m ). For a certain image x j , it is solely classified by the base learners that cover the geographic location p j of the image. Consequently, the ensemble model becomes where r(p j , B i ) is an indicator function Due to the property that base learners adhere to certain areas, we name this algorithm GeoBoost. In the rest of this paper, the indicator function r(p j , B i ) is simply denoted as r i , and f i (x j , p j ; B i ) as f i (x j ). The entire process of GeoBoost is shown in Algorithm 3. The base learners of the original gradient boosting method do not take any geospatial information into consideration, which means that they can be applied to any area. In other words, they can be treated as a special case of GeoBoost that covers the entire range of the geographic coordinates system.
Boosting methods assemble a bunch of base learners to form the final ensemble model. Due to plenty of parameters, the ensemble model can easily cause the over-fitting problem. The gradient boosting method exerts a learning rate to each base learner as regularization for preventing over-fitting [30]. The learning rate ν and the number of base learners M jointly determine the performance of the boosting model. Small learning rates can reduce the dominance of a single base learner. With the presence of the learning rate ν, we can reach the final algorithm as shown in Algorithm 4.

Algorithm 3
The algorithm of GeoBoost.

Data Set
For the semantic segmentation tasks of satellite images, the commonly used data sets, such as the INRIA building data set [35] and ISPRS 2D Semantic Labeling Benchmark [36], are not distributed widely enough, which are built from just a few cities. Therefore, we create a new worldwide building data set to simulate the real application from 100 different cities as shown in Figure 2. This new data set is named as Building data set for Disaster Reduction and Emergency Management (DREAM-B). The training images are collected from Google Earth Engine [37], and the corresponding labels are obtained from the open-source map, Open Street Map (OSM) [31]. DREAM-B contains 626 image tiles with a size of 4096 × 4096. We split out 250 tiles for training, 63 for validation, and 313 for testing. This data set solely contains two classes: the buildings and the non-building class. Each image tile of the DREAM-B data set is composed of red (R), green (G), and blue (B) bands, and its spatial resolution is 30 cm.
For the task of incremental learning, the 250 training image tiles are divided into four groups based on their geospatial location rather than sampled uniformly. As shown in Figure 3, we build a tiny global data set for the first group and add the local data sets for the other groups. The amount of image tiles of four groups is roughly equal. This manner of division is to simulate the real situation that a global dataset might be built in an incremental manner, e.g., from coarse to fine and from one region to another one.

Implementation Details
We combine the U-Net model [38] with the NASNet-Mobile model [39] for the architecture of the base learners of the GeoBoost algorithm. Specifically, the convolutional modules in U-Net are replaced by the neural cell obtained via neural architecture searching [39], which is more efficient for the consideration of computation efficiency. This model is called U-NASNetMobile in this paper. The input size of U-NASNetMobile is 512 × 512, and the original image tiles of the DREAM-B data set is split into small patches with a size of 512 × 512 to match the model. In the recent studies [40,41], researchers found that a larger training sample size produces better performance for neural networks. However, if the GPU memory is fixed, a larger training sample size will lead to the smaller batch size of gradient descending. According to the rule of linear scaling learning rate [40], a small batch size increases the noise in the gradient, so the learning rate of gradient descending may be decreased. For the consideration of the balance between the batch size and the training sample size, the input size of U-NASNetMobile is 512 × 512. The architecture of U-NASNetMobile is shown in Figure 4. Data augmentations are employed to avoid over-fitting, including random flipping horizontally and vertically, random rotation, and random brightness jittering.
The Adam optimizer [42] is used for optimization. We train models with the cosine learning rate annealing schedule [43] with a maximum learning rate of 3 × 10 −4 and a minimum learning rate of 1 × 10 −6 . Unless otherwise specified, all the experiments are trained for 200 epochs with a mini-batch size of 16. In addition, the Intersection over Union (IoU) is employed as the evaluation accuracy [44]. More specifically, the IoU accuracy is defined as The metrics of Precision, Recall, Overall Accuracy, and Kappa Coefficient are also provided for reference.  Figure 4. The architecture of U-NASNetMobile. Normal Cells and Reduction Cells are the structures obtained via neural architecture searching [39]. The yellow circles are concatenation layers.

Results and Discussion
In this section, we first analyze the impacts of technical factors in Section 4.1. Then, the quantitative and qualitative evaluation of different models are presented in the Sections 4.2 and 4.3. Finally, we dive into the regional impacts of prediction in Section 4.4.

Pre-Training
Base learners of the boosting algorithm are trained sequentially. After the base model f 0 is trained, the weights of other base learners can be initialized by the trained weights of f 0 instead of random initialization. This is an application of the pre-training approach [45]. As illustrated in Figure 5, the base model f 1 with pre-training on f 0 converges faster than the model trained from scratch and also leads to better performance. Therefore, we adopt the strategy of pre-training in the rest of the experiments.

Learning Rates
For the reason that training ensemble models are time-consuming, all the base learners in Table 1 are trained only 100 epochs to search for the optimum learning rate. From Table 1, it can be clearly seen that learning rates of base learners significantly affect the performance of the GeoBoost algorithm. With four base learners, the learning rate ν achieves the best result around 0.1. This result is consistent with that in the gradient boosting algorithm [30] with much more base learners. Thus, the learning rate of 0.1 is a stable value in practice and is the default configuration in the rest of the boosting experiments.

Complexity Analysis
The GeoBoost algorithm trains M base learners to form an ensemble model, thus its model complexity is O(M). Therefore, it contains more parameters comparing with a single model. However, when more data are available, GeoBoost can be adopted without tuning hyper-parameters of the base models, and the architectures of these base models can also be different. Since each base learner is only assigned a portion of data, the training time of GeoBoost is not increased and keeps up with that of a single model. When testing, the ensemble model slows down the inference speed with time complexity of O(M). Nevertheless, all the base learners are independent from each other, so the model is highly parallelizable to reduce the inference time.

Quantitative Evaluation
As shown in Table 2, the result of a single net is obtained by training a U-NASNetMobile model with all the available training data x 0 , x 1 , x 2 , and x 3 from four areas. It is the ideal case to train a model when all of the datasets are available. It might not be a scenario for the practical application of global mapping using VHR satellite images. The single net model has the highest accuracy of 0.6458. This result is the baseline result for comparison. Table 2. Results for the comparison of different methods. The result of a single model is obtained by training a net with all the available training data x 0 , x 1 , x 2 , and x 3 from four areas. EGB is short for the end-to-end gradient boosting method. The end-to-end gradient boosting algorithm for incremental learning obtains the worst performance of 0.5465 (IoU) for U-net and 0.5887 (IoU) for U-NASNetMobile in all the experiments. It can be seen from Figure 6a that the first base learner of gradient boosting f 0 already has an IoU accuracy of 0.5998, and the successive results are only 0.5564, 0.5747, and 0.5887. It means that the performance of the model is decreased by gradient boosting without considering the geospatial distribution of data. Compared with a single net, GeoBoost obtains a comparable accuracy of 0.6372, and it also outperforms the end-to-end gradient boosting algorithm by a large margin. Randomly sampling of training data for base learners can improve the performance of gradient boosting [46], whereas collecting satellite images from certain areas is actually equivalent to biased subsampling without repetition rather than uniformly sampling. That may be the reason that the end-to-end gradient boosting algorithm is failed on the DREAM-B data sets and GeoBoost achieves a satisfactory performance.

IoU Precision Recall Overall Accuracy Kappa
The results of experiments demonstrate the effectiveness of the GeoBoost algorithm by progressively utilizing the geospatial information of satellite images. Additionally, the result of a single net shows the importance of the large scale of data once more. Figure 7 shows some semantic segmentation results of GeoBoost in different training stages from the cities of Chicago, Vienna, and Shanghai. Compared with the model F 0 (x) = σ r 0 f 0 (x) , the model F 3 (x) = σ r 0 f 0 (x) + r 1 f 1 (x) + r 2 f 2 (x) + r 3 f 3 (x) produces better prediction results in terms of both accuracy and visual sense. According to the location of the image from Chicago, r 2 and r 3 is 0, and r 0 and r 1 is 1 for this image based on Equations (19) and (18). Therefore, the prediction of the image from Chicago can be simplified as F 3 (x) = σ f 0 (x) + f 1 (x) . Similarly, predictions of images from Vienna and Shanghai can also be simplified in this way. Obviously, as pointed out by the yellow circles in Figure 7, some misclassified pixels of F 0 (x) are corrected by the model F 3 (x), and the others are not affected. Taken together, Figures 6 and 7 suggest that the GeoBoost method is capable of continually learning, both quantitively and qualitatively.  Figure 8 presents some semantic segmentation results of different models from the cities of Chicago, Berlin, and Shanghai. Yellow circles indicate some notable discrepancies among different prediction results. Three models produce quite similar results. The prediction of GeoBoost is more in accordance with that of the single model. The end-to-end gradient boosting model misclassifies some tiny buildings. It should be noticed that these visualization results are merely for reference. The tiny differences among these models can also be triggered by the randomness of the training process and weights initialization. Quantitative evaluations in Section 4.2 are more reliable.

Discussion on the Regional Impacts
Some typical samples from the areas of B 1 , B 2 , and B 3 in Figure 3b are shown in Figures 9-11, respectively. As shown in Figure 9, buildings in the area B 1 have unified distributions. The size of most buildings in this area is quite small, and the orientation of them is consistent within each image tile. Gaps among the individual buildings are obvious. Thus, images in this area are easy to be predicted correctly. It can be seen from Figure 10 that buildings in the area B 2 are much different. Shapes of these buildings are not squares, and adjacent ones are connected to each other to form the big chunks.  Figure 11 shows that buildings in the area B 3 are more cluttered, and this area contains more high-rise buildings. Some pixels on the left-bottom corner of Figure 11i are mislabeled. This suggests that there may be some label noise in this area.
The accuracy of GeoBoost in Table 2 is measured by the overall performance of the model. For comparison, we can dive into more details of the trained model in different areas around the world. As shown in Table 3, the performance of GeoBoost fluctuates from one area to another. The model achieves the worst performance of 0.4758 in East Asia, whereas the results of America and Europe are much better.   Figure 3. The divergence of the regional performance may be caused by two reasons. First, this may be caused by the accuracy measurement of IoU. It prefers big objects to the small ones. For simplicity, we assume that the buildings are squared with a side length of s pixels. Thus, the total number of pixels of a building is s × s = s 2 . Most misclassifications of the model occur near the border of buildings. Therefore, the total number of misclassified pixels is roughly 4 × s. The proportion of the misclassified pixels can be inferred as 4s s 2 = 4 s . As the building size becomes bigger, the proportion of the misclassified pixels becomes much smaller, and the IoU becomes higher. The above inference is a rough estimation rather than a strict analysis. With all the available labels of the training set, we can take a glance at the distribution of building size in the DREAM-B data set. From Figure 12, it can be clearly seen that the building size of Europe is quite bigger than that of the other areas. Figure 10 presents the visualization of samples from three areas. Buildings in the city of Vienna are closely connected to each other and form the big building groups. In addition, big buildings have more context to infer the predicting results. This is why the prediction of Europe achieves the best result. Area of buildings (km 2 ) 0<area<9 9<area<36 36<area<81 81<area<144 144<area<225 225<area<324 324<area<441 441<area<576 576<area<729 729<area<900 900<area<2025 2025<area<3600 3600<area<8100 8100<area<14400 14400<area<22500 22500<area<32400 area>32400

Area Testing IoU
Size of an individual building (m 2 ) Region America Europe Asia Second, the data of East Asia contain more high-rise buildings. The image in Figure 11f is a sample from the city of Shanghai. Obviously, the upper left corner of this sample has a considerable amount of high-rise buildings. This is very common in East Asia and does not appear very often in the other areas. Figure 13 better illustrates the affection of high-rise buildings. The labels of high-rise buildings are located on their roots. Though the model finds the positions of these high-rise buildings, there is no visual clue, such as edges and contrast, for precise segmentation. This is a tough task for CNNs to recognize 3D structures from a single image, whereas the segmentation of lower buildings is much easier and produces sharp edges of buildings as shown in Figure 7. For the reason that the image is not acquired with orthographic projection, the low accuracy of high-rise buildings is inevitable. The above-mentioned two reasons may explain the divergence of the regional performance. Surprisingly, CNNs can capture 3D spatial information to some extent. These results qualitatively verify the necessity of utilizing the geospatial information of satellite images for semantic segmentation.

Conclusions
In this paper, we propose a novel approach, GeoBoost, for geographically incremental learning, which is trained in an end-to-end way. It enables the models of satellite images to learn continually based on geographical information of data without forgetting the previous knowledge. The effectiveness of the GeoBoost algorithm is verified on the large-scale data set of DREAM-B. The proposed method utilizing U-NASNetMobile as the base learner outperforms end-to-end gradient boosting by a large margin of 4.85% (IoU). Experiments with different base learners confirm that GeoBoost surpasses end-to-end gradient boosting consistently. At present, this algorithm is validated on the semantic segmentation task with the high-resolution satellite images. This method is quite flexible to be adapted to the lower resolution images. By adopting the appropriate base learner of GeoBoost, our proposed method may also be beneficial for the lower-resolution satellite images. The current GeoBoost algorithm focuses on the task of data-incremental learning. Based on its flexible framework, we will adapt it to the task of class-incremental learning in the future.