Three-Stage MPViT-DeepLab Transfer Learning for Community-Scale Green Infrastructure Extraction

: The extraction of community-scale green infrastructure (CSGI) poses challenges due to limited training data and the diverse scales of the targets. In this paper, we reannotate a training dataset of CSGI and propose a three-stage transfer learning method employing a novel hybrid architecture, MPViT-DeepLab, to help us focus on CSGI extraction and improve its accuracy. In MPViT-DeepLab, a Multi-path Vision Transformer (MPViT) serves as the feature extractor, feeding both coarse and fine features into the decoder and encoder of DeepLabv3+, respectively, which enables pixel-level segmentation of CSGI in remote sensing images. Our method achieves state-of-the-art results on the reannotated dataset.


Introduction
Urban green space (UGS) refers to natural environment areas within urban areas, which typically comprise natural elements, such as trees, grass, gardens, parks, forests, lakes, rivers, lawns, and more.These green spaces can be located in various parts of the city, including downtown areas, communities, residential neighborhoods, and industrial zones.The extraction of urban green spaces can be achieved through traditional methods [1,2] as well as deep learning-based approaches [3,4].In [5], the concept of community-scale green infrastructure (CSGI) is introduced and pertains to public facilities built to promote community sustainability.Compared to urban green spaces, these infrastructures are smaller in scale.CSGI extraction is of significant value for community planning, landscaping, and sustainable development.In this paper, we categorize CSGI into five classes based on their specific functions within the community: Grass, Tree, Bush area, Lake, and Terrace greenery, which collectively cover over 90 percent of the green infrastructure within the community.
CSGI extraction is a remote sensing semantic segmentation task.Semantic segmentation in remote sensing imagery has long been one of the challenging problems in the field of computer vision.This difficulty arises from the significant variations in the appearance of segmented objects and the irregular spatial distribution.Convolutional neural networks (CNNs), as excellent feature extractors, have been widely applied to remote sensing image segmentation tasks [6,7].However, conventional CNNs for classification often incorporate consecutive pooling and downsampling operations, leading to a decrease in spatial resolution and a limitation in modeling long-range dependencies.To address this issue, researchers have attempted to improve the network structure of CNNs.For instance, U-Net [8], with its encoder-decoder structure and skip connections, performs well in medical image segmentation [9] and remote sensing image segmentation tasks and achieves promising results in building extraction [10,11].To give another example, the DeepLab series of networks [12][13][14][15] increases kernel sizes, employs atrous convolutions, and establishes spatial pyramids to expand the receptive field for pixel-level image segmentation.These optimizations enhance the long-range dependency of feature maps.
The Transformer architecture originally gained tremendous success in the field of natural language processing by stacking multi-head attention modules to model long-range dependencies, effectively compensating for the limitations of CNN.As a result, the Vision Transformer (ViT) [16] processes images in the form of token sequences, similar to dealing with natural language.However, it has a significant drawback in that it requires large amounts of data and long training times to match the performance of CNNs.The Swin Transformer [17] introduces the concept of a mobile window to make ViT more efficient.The Multi-path Vision Transformer (MPViT) [18] takes a different approach by constructing parallel multi-path structures, simultaneously utilizing the non-local connectivity of convolution and the end-to-end dependency of Transformer, which allows it to represent fine and coarse features for dense prediction tasks effectively.
CSGI extraction requires attention to both global and local features.Additionally, the dense and overlapping distribution of segmentation objects in CSGI, coupled with the difficulty of manual labeling and the scarcity of training data, poses significant challenges for extraction.The lack of training data and the introduction of new extraction targets led us to consider transfer learning.Early on, transfer learning found extensive applications in multi-task learning [19] and domain adaptation [20].With the rise of deep learning in the 21st century, researchers have begun to use transfer learning to tackle more complex tasks and data.Jason Yosinski and his colleagues [21], through experimentation and trials, provide substantial support and affirmation for the transferability of deep networks.When transfer learning is appropriately employed, the accuracy of deep learning tasks can be significantly improved while reducing training time.Transfer learning has found numerous applications in fields such as medicine [22,23], industry [24], finance [25,26], biology [27], music [28], environment [29], and computer vision [30,31].
In this paper, we use DeepLabv3+ as the backbone network to compare the segmentation performance of Mobilenet [32], Resnet101 [33], Xception [34], and MPViT as a feature extraction network on CSGI.Additionally, we investigate which batch normalization (BN) layers of MPViT-DeepLab should be frozen during the three-stage transfer learning process.In summary, the contributions of this paper are as follows: 1.
We reannotate a dataset suitable for training in the task of CSGI extraction.

2.
We feed the coarse and fine features extracted by MPViT into DeepLabv3+ for pixellevel segmentation of CSGI in the three-stage transfer learning process.

3.
We confirm that three-stage MPViT-DeepLab transfer learning, along with freezing all BN layers during the second transfer learning, achieves state-of-the-art performance in the CSGI extraction task on the reannotate dataset.

Related Work
The hybrid architecture we employ, MPViT-DeepLab, utilizes DeepLabv3+ as the backbone with MPViT serving as the feature extractor.In this section, we will provide a detailed explanation of the specific configurations of the MPViT and DeepLabv3+ that we use.

Multi-Path Vision Transformer
In [18], the Multi-path Vision Transformer (MPViT) is primarily constructed using a Multi-Scale PatchEmbed (MS-PatchEmbed) block and a Multi-Path Transformer (MP-Transformer) block to build a parallel multi-path structure, capturing both fine and coarse features for dense prediction tasks.Figure 1 provides a specific illustration of the MPViT network architecture.We input an image of dimensions H × W × 3. Initially, the image undergoes two consecutive convolutional batch normalization (Conv-BN) layers with a 3 × 3 convolution kernel and a stride of 2, which serve to reduce dimensionality and model parameters.After this stage, the feature size becomes H 4 × W 4 × C , where C' represents the channel size for the next stage.The Conv-BN layer consists of a convolutional layer, a batch normalization layer, and a Handswish activation function.Following this, the process involves four iterations, each consisting of an MS-PatchEmbed block and an MP-Transformer block.The MS-PatchEmbed block can extract image sequences of different sizes using depth-wise convolution, while the MP-Transformer block processes the image sequences passed by the MS-PatchEmbed block through a Transformer Encoder, resulting in the aggregation of local and global features.In the following sections, we will provide a detailed description of each component.
terms of height H i and width W i is as follows: , we can use batches of different sizes to generate features of the same size.The H × W × C image features, regardless of whether they pass through 3 × 3, 5 × 5, or 7 × 7 convolutions, will not change the H and W sizes.This allows visual tokens of the same sequence length to be passed into the Transformer Encoder.When we need to reduce spatial resolution, we increase the stride, such as when s = 2, which reduces the spatial resolution to half of its original value.Within each Multi-Scale PatchEmbed block, we perform two convolutions with s = 2.For instance, given the feature size Stacking convolution blocks allows us to achieve the same receptive field as a larger kernel with fewer parameters.As a result, our 5 × 5 convolution is composed of two 3 × 3 convolutions (2 × 3 2 < 5 2 ), and similarly, our 7 × 7 convolution is formed by stacking three 3 × 3 convolutions (3 × 3 2 < 7 2 ).The depthwise separable convolutions [34] (depicted as DWConv-BN in Figure 3) consist of a 3 × 3 depthwise convolution and a 1 × 1 pointwise convolution, followed by a batch normalization layer [35] and a Handswish activation function [36].

Multi-Path Transformer Block
The structure of the multi-path Transformer block is depicted in Figure 3.The purpose of the multi-path Transformer block is to capture long-range dependencies while paying attention to local feature relationships.It mainly comprises two types of modules: the Depthwise Residual Bottleneck block (DW Residual Bottleneck) and the Transformer Encoder block.The DW Residual Bottleneck block consists of a 1 × 1 convolution, a 3 × 3 depthwise convolution, and another 1 × 1 convolution [33].Its primary role is to extract local features with a relatively lower parameter count and computational cost in both spatial and channel dimensions.Within the Transformer Encoder block, efficient factorized self-attention [37] is employed (depicted as the F-MHSA block in Figure 4) to alleviate the computational burden: where Q, K, V ∈ R N×C represent linearly projected queries, keys, and values, and N, C denote tokens and channels.Due to the self-attention mechanism of the Transformer and its ability to disregard position and sequence size, it exhibits great strength in handling global features.The Transformer Encoder block is capable of extracting global features in both spatial and channel dimensions.We denote the extracted local features as L i ∈ R H i ×W i ×C i and the global features as G i,j ∈ R H i ×W i ×C i .These features are then concatenated together: The path index j corresponds to the position on the path, and the aggregated feature A i ∈ R H i ×W i ×(j+1)C i interacts with the feature interaction function I(•) to generate the final feature X i+1 ∈ R H i ×W i ×C i+1 with channel dimension C i+1 of next stage.In I(•), a 1 × 1 convolution with a channel number of C i+1 is used.The final feature X i+1 serves as the input for the next stage's Multi-Scale Patch Embedding layer.

DeepLabv3+
Deeplabv3+ [15] is a semantic segmentation model that excels in capturing fine details and context in images.Notable for its dilated convolutions and atrous spatial pyramid pooling, it effectively addresses challenges in various fields, providing precise object delineation and high-resolution predictions.
As shown in Figure 4, in DeepLabv3+,the feature extractor feeds the low-level feature and output feature into the encoder and decoder, respectively.The encoder utilizes atrous convolutions to compute output features.The features processed by the encoder are combined with low-level feature passed to the decoder, and after operations such as upsampling, the model outputs the segmented image result.

Dataset
We select the ILSVRC2012 dataset [38] for pretraining, which spans 1000 object classes and contains 1.2 million training images, 50,000 validation images, and 100,000 test images.This dataset provides sufficient data for training ViT-based networks.Additionally, we utilize the DroneDeploy Segmentation Dataset [39], which comprises several aerial scenes captured with drones.The dataset includes six categories: BUILDING, CLUTTER, VEG-ETATION, WATER, GROUND, and CAR.Each scene has a ground resolution of 10 cm per pixel.We chip these aerial images into 300 × 300 sizes, resulting in a training set of over 11,000+ slices and a validation set of over 2000+.Subsequently, cropping is performed according to the requirements of different network inputs.
CSGI is classified into five categories based on different functions: Grass, Tree, Bush area, Lake, and Terrace greenery.However, the DroneDeploy Segmentation Dataset is not fully suitable for CSGI extraction tasks.To adapt to the CSGI extraction task, we use Labelme [40] to reannotate three iconic remote sensing images from the DroneDeploy Segmentation Dataset.The samples in the dataset are shown in Figure 5, where Bush area includes low bushes, flower beds, etc., and Terrace greenery refers to the greenery on the roofs of community buildings.During the annotation process, we try to ensure that the contours of each CSGI are clear.For connected trees, etc., we mark them together.We then divide the reannotated remote sensing images into 300 × 300 sizes, removing slices with a high proportion of background.The remaining slices are split into a training set and a validation set.Removing slices with a high proportion of background can improve training accuracy and reduce the impact of the background on calculation.In the end, we obtain the training set consisting of 300 images and the validation set consisting of 110 images, with a training set to validation set image quantity ratio of approximately 3:1.

Three-Stage Transfer Learning
MPViT is a multi-path variant of the ViT model, sharing similar characteristics.One challenge that cannot be avoided is the need for a substantial amount of data and extended training times to achieve the desired segmentation results.Additionally, CSGI extraction datasets pose difficulties in annotation, consume a considerable amount of time, and make it challenging to obtain a large number of labeled samples.In [41], to address the issue of small-sample data, a multi-stage transfer learning approach is employed.Building upon this, we have designed a three-stage transfer learning scheme to tackle training difficulties and data scarcity.
As shown in Figure 6, the three-stage transfer learning method includes three stages: pretraining, the first transfer learning, and the second transfer learning.Firstly, the pretraining stage is essential.Due to the specificity of MPViT, it needs extensive training with a sufficient amount of data and time to achieve satisfactory results.We conduct prolonged training on the ILSVR2012 dataset and use the obtained weights as initial weights for subsequent transfer learning processes.
Next is the first transfer learning stage.Because of the transition from a classification task to a segmentation task, MPViT is replaced by MPViT-DeepLab in the following two transfer learning processes.In MPViT-DeepLab, the fine and coarse features extracted by MPViT are separately input into the encoder and decoder of DeepLabv3+.In this stage, we input images from the DroneDeploy Segmentation Dataset into MPViT-DeepLab to obtain segmentation results for BUILDING, CLUTTER, VEGETATION, WATER, GROUND, and CAR.This differs significantly from the pretraining classification task.Therefore, in the first transfer learning, we do not freeze any parameters.
Finally, in the second transfer learning stage, the network architecture is similar to the first transfer learning, but we choose to freeze all batch normalization (BN) layers in MPViT-DeepLab.We input the reannotated DroneDeploy Dataset and obtain segmentation results for Grass, Tree, Bush area, Lake, and Terrace greenery.In this task, the segmentation targets partially overlap with the first transfer learning.To fully leverage previous model information, improve segmentation accuracy, reduce the number of parameters and computational burden, and prevent gradient disappearance or explosion, we freeze all batch normalization layers in the MPViT-DeepLab network during the second transfer learning.

MPViT-DeepLab
CSGI exhibits dense distribution and varying scales, making it challenging to achieve high precision using conventional image segmentation methods.While CNN structures are well-organized and do not rely on extensive data or long training times, they are unable to capture long-range dependencies.On the other hand, ViT exhibits end-to-end dependency but requires a significant amount of data and extensive training time to reach the target accuracy.In light of this, researchers have attempted to combine CNN and Transformer architectures to leverage the strengths of both.Zhang C et al. [42] propose a hybrid network architecture combining CNN and Transformer, using the Swin Transformer as the feature extractor and a U-shaped architecture for the encoder and decoder, for semantic segmentation of ultra-high-resolution remote sensing images.In [43], an MPViT-Unet architecture is used for medical image segmentation.Similarly, in [44], Wang W et al. employ Enhancing Multi-scale Representations With Transformer for the segmentation of remote sensing images, achieving promising results.
DeepLabv3+ has become the preferred segmentation model in many research and application domains due to its outstanding performance and robustness.Azad R et al. introduce TransDeepLab in [45], utilizing the Swin Transformer to extend DeepLabv3 and model the Atrous Spatial Pyramid Pooling (ASPP) module, marking the first use of a purely Transformer-based model to enhance the groundbreaking DeepLab model.Inspired by these works, we utilize MPViT as the backbone feature extraction network, feeding coarse features into the decoder of DeepLabv3+ and fine features into its encoder.This leads to the construction of a hybrid network architecture combining MPViT and DeepLabv3+, named MPViT-DeepLab.
Figure 7 provides an overview of the proposed MPViT-DeepLab network, where DeepLabv3+ serves as the backbone and MPViT functions as the feature extractor.Images first enter the MPViT network.In MPViT, the first Multi-Path Transformer block aggregates features using patches of size H 4 × W 4 , and the last aggregation uses patches of size H 32 × W 32 .Therefore, we consider taking the aggregated coarse features after the first aggregation in MPViT as low-level features, which are then fed into the decoder.It first goes through a 1 × 1 convolution to reduce the channel count, then aggregates with the upsampled result from the encoder, refines the feature with a 3 × 3 convolution, and finally performs bilinear upsampling to output the segmentation result.Simultaneously, the aggregated fine features are passed into the encoder for atrous convolution.
The atrous convolutions' function is to enable us to control the resolution of the convolution features and capture multi-scale information.For the input feature X i ∈ R H i ×W i ×C i , the output feature Y i ∈ R H ×W ×C after the atrous convolution has dimensions H × W × C , where the calculations for H and W are as follows: The calculation formula for the feature Y i after atrous convolution is as follows: In Formula (6), k represents the convolution kernel size, r is the atrous rate, and W k represents the corresponding weights.For the MPViT network with an output stride of 16, we set the atrous rates for the three intermediate atrous convolutions to [6,12,18] to expand the receptive field as much as possible without losing information.We aggregate the features after atrous convolutions, reduce the channel count through a 1 × 1 convolution, and then upsample the features.We denote the output of the atrous convolutions in the encoder with atrous rates 6, 12, and 18 as Y i (i = 1, 2, 3), the result of feature pooling as P, the low-level feature input to the decoder as X low , and the feature input to the encoder as X.Combining the F(•) functions, Equation ( 3) and ( 6) described in the previous sections, the entire process can be represented as In Equations ( 7)-( 9), B(•) represents a bilinear upsampled function by 4, R decoder represents the features in the decoder after a 1 × 1 convolution, R encoder represents the features output by the encoder after upsampling, and R ∈ R H×W×C is the final pixel-wise segmentation result, where C represents the number of classes for segmentation.

Results and Discussion
We implement our method based on Pycharm [46] with Python 3.7.0,and all models are trained on a single NVIDIA Quadro RTX 5000 GPU.
We use DeepLabv3+ as the backbone network and MobileNet, ResNet101, Xception, and MPViT as the feature extraction networks (named Mob-D, Res-D, Xce-D, and MPViT-D).In [18], MPViT is categorized into Tiny, Xsmall, Small, and Base based on parameter sizes.In our experiment, we chose the Base version as the feature extractor, with layers set to [1,3,8,3], channels set to [128,224,368,480], and the path of MS-PatchEmbed set to [2,3,3,3].We set the total number of iterations to 30,000, a learning rate of 0.01, a batch size of 4, and an output stride of 16.For MobileNet, ResNet101, and Xception, the crop size was set to 299, while for MPViT, it was set to 224.
To compare the segmentation performance of several networks, we use the mean intersection over union (MIOU) as the evaluation metric.The specific calculation formula is as follows: From Table 1, we can see that the training time for each network varies due to differences in network complexity and the number of parameters.Due to its parallel network structure and a large number of parameters, MPViT-DeepLab obtained the longest training time.In general, pretraining is effective in reducing training time and improving training accuracy.After pretraining, MPViT-DeepLab achieved an MIOU value of 54.7%, which is only 2.2% lower than the highest MIOU achieved by the Resnet101 network and outperforms the other two networks.This indicates that MPViT-DeepLab is suitable for remote sensing image segmentation tasks.
Similarly, using DeepLabv3+ as the backbone network, we then train them on the reannotated DroneDeploy Dataset using their respective pretrained models on ImageNet [38].We record the intersection over union (IOU) for each combination for the classes Bush area, Grass, Lake, Terrace greenery, and Tree as well as the overall MIOU for CSGI.The specific values are shown in Table 2.
Analyzing Table 2, MPViT-DeepLab achieves an MIOU of 85.4%, which is 0.4% higher than the Xception and DeepLabV3+ combination.Compared to other combinations, MPViT-DeepLab achieves state-of-the-art results in CSGI extraction.We also observe that MPViT-DeepLab performs well in the segmentation of Green infrastructure categories, such as Grass (+0.6%),Terrace greenery (+1.7%), and Tree (+0.1%).The Terrace greenery segmentation task requires the network to consider the relationship between buildings, greenery, and the environment, and MPViT-DeepLab effectively balances global and local features, resulting in excellent segmentation performance.
To validate the effectiveness of the three-stage transfer learning method and determine which parameters to freeze during the second transfer learning process, we utilize the MPViT-DeepLab parameters trained on the DroneDeploy Segmentation Dataset as the initial weights for training.We conduct comparative experiments under different scenarios, including no freezing (named MPViT-D-T), freezing only the BN layers in MPViT (named MPViT-D-FM), freezing only the BN layers in the DeepLabv3+ (named MPViT-D-FD), and freezing all BN layers in the MPViT-DeepLab (named MPViT-D-FA).The experimental results are presented in Table 3.The first row of Table 3 corresponds to the last row of Table 2 for reference and comparison.Among all the freezing strategies during the second transfer learning process, the MIOU achieved by freezing all BN layers in MPViT-DeepLab is 85.9%, which is 0.5% higher than the MIOU obtained with only the first transfer learning (MPViT-D) and 1.3% higher than the MIOU without freezing any BN layers.Additionally, the strategy of freezing all BN layers in MPViT-DeepLab significantly outperforms the other three freezing methods in the segmentation of Terrace greenery (+2.6%) and Tree (+0.7%), two classes of Green infrastructure.The scenario without freezing any BN layers performs poorly in the segmentation of Terrace greenery, suggesting that, while this approach may have an advantage in simple scenarios due to retaining the original model information, it is not proficient in complex segmentation tasks.
Additionally, we analyze the loss reduction in each scenario in Table 3, as illustrated in Figure 8.According to Figure 8, compared to the approach without the second transfer learning, the gradient of the loss reduction is larger, and the convergence requires fewer epochs, regardless of which layers are frozen.It is evident that adopting the second transfer learning not only improves the accuracy of CSGI extraction but also accelerates the network's convergence speed.   2 and 3.

Conclusions
In this article, we propose a three-stage transfer learning method to address the challenges of training MPViT and other ViT series networks, which typically require extensive data and time.Additionally, we introduce a hybrid network architecture, MPViT-DeepLab, which exhibits superior integration capabilities for both global and local features compared to traditional neural networks.MPViT-DeepLab achieves state-of-the-art results in the extraction of CSGI.
While MPViT-DeepLab outperforms traditional networks in accuracy, its parallel training structure implies increased computational overhead, and the complexity of the network leads to a higher number of parameters.CSGI extraction holds significant importance in urban planning, and despite the enhanced accuracy of MPViT-DeepLab, future research will focus on simplifying the network to achieve better segmentation results with a more lightweight structure.

Figure 1 .
Figure 1.An overview of MPViT network structure for image classification.

2. 1 . 1 .
Multi-Scale PatchEmbed BlockThe Multi-Scale PatchEmbed block's specific content is illustrated in Figure2.It designs a function F k×k (•), which employs convolution with overlapping patches.F k×k (•) represents a 2D convolution operation with a kernel size of k × k, stride s, and padding p.

Figure 2 .
Figure 2. Specific architecture of the Multi-Scale PatchEmbed block in MPViT.By adjusting the sizes of k, s, and p, we can determine whether to generate features of the same size or reduce spatial resolution.When s = 1 and p = k−12

Figure 3 .
Figure 3. Specific architecture of the multi-path Transformer block in MPViT.

Figure 4 .
Figure 4. Specific network architecture of encoder and decoder of DeepLabv3+.

Figure 5 .
Figure 5.A few samples of reannotated DroneDeploy Dataset, where class labels corresponding to colors are shown on the right.

Figure 6 .
Figure 6.An overview of the three-stage MPViT-DeepLab transfer learning, which includes three stages: pretraining, the first transfer learning, and the second transfer learning.

Figure 8 .
Figure 8. Loss reduction of MPViT-DeepLab networks with and without transfer learning.

Figure 9 .
Figure 9.Some CSGI extraction results of networks in Tables2 and 3.

Table 1 :Table 1 .
FN , where TP means true positive, FP means false positive, and FN means false negative.Firstly, we choose DeepLabv3+ as the backbone network and trained Mobilenet, Resnet101, Xception, and MPViT on the DroneDeploy Segmentation Dataset with and without pretraining.We use image chips of sizes 512 × 512, 300 × 300, and 224 × 224 from the DroneDeploy Segmentation Dataset, and find that the 300 × 300 chip size resulted in the highest training accuracy.Subsequently, for all following experiments, we select remote sensing images with a crop size of 300 × 300 for training.The training times and MIOU values are shown in Training time (hours) and MIOU (%) of MobileNet, ResNet101, and Xception with DeepLabv3+ and MPViT-DeepLab with and without pretraining on DroneDeploy Segmentation Dataset.

Table 3 .
Class segmentation IOU (%) and image segmentation MIOU (%) of no freezing, freezing only the BN layers in MPViT, freezing only the BN layers in the DeepLabv3+, and freezing all BN layers in the MPViT-DeepLab on reannotated DroneDeploy Dataset.