Urban Green Plastic Cover Mapping Based on VHR Remote Sensing Images and a Deep Semi-Supervised Learning Framework

With the rapid process of both urban sprawl and urban renewal, large numbers of old buildings have been demolished in China, leading to wide spread construction sites, which could cause severe dust contamination. To alleviate the accompanied dust pollution, green plastic mulch has been widely used by local governments of China. Therefore, timely and accurate mapping of urban green plastic covered regions is of great significance to both urban environmental management and the understanding of urban growth status. However, the complex spatial patterns of the urban landscape make it challenging to accurately identify these areas of green plastic cover. To tackle this issue, we propose a deep semi-supervised learning framework for green plastic cover mapping using very high resolution (VHR) remote sensing imagery. Specifically, a multi-scale deformable convolution neural network (CNN) was exploited to learn representative and discriminative features under complex urban landscapes. Afterwards, a semi-supervised learning strategy was proposed to integrate the limited labeled data and massive unlabeled data for model co-training. Experimental results indicate that the proposed method could accurately identify green plastic-covered regions in Jinan with an overall accuracy (OA) of 91.63%. An ablation study indicated that, compared with supervised learning, the semi-supervised learning strategy in this study could increase the OA by 6.38%. Moreover, the multi-scale deformable CNN outperforms several classic CNN models in the computer vision field. The proposed method is the first attempt to map urban green plastic-covered regions based on deep learning, which could serve as a baseline and useful reference for future research.


Introduction
Nowadays, urban renewal has been widely performed around the globe, which could effectively relieve the shortage of urban land resources and improve urban land use efficiency [1][2][3]. For instance,

Introduction
Nowadays, urban renewal has been widely performed around the globe, which could effectively relieve the shortage of urban land resources and improve urban land use efficiency [1][2][3]. For instance, urban renewal in China has led to a large-scale demolition of old, low-density urban areas and urban villages over the past few decades [2]. During the renewal process, construction sites can be a source of huge amounts of dust, which could easily be transferred to the air and water nearby, leading to severe environmental pollution.
To alleviate the accompanied dust contamination, plastic mulch has been widely utilized by local governments in China ( Figure 1). Moreover, the plastic mulch is always green, making it appear environmentally friendly. Actually, green plastic mulch is commonly made from polyethylene. Most urban renewal projects in China use the same green plastic mulch to alleviate dust contamination. After the construction process, the plastic mulch can be recycled at relevant chemical plants. Due to the stringent environmental protection regulations in China, green plastic mulch has been a must in urban renewal projects, offering an opportunity to accurately identify construction sites during urban sprawl and renewal. Therefore, it is of great significance to monitor and detect these green plastic covers (GPC), which could provide the spatial distribution of construction sites. Moreover, the detection of GPC could also help the environmental protection department with the precise control of construction dusts. However, as far as we know, there is still no report on GPC detection in the remote sensing field; therefore, we are highly motivated to propose an accurate classification method for GPC based on deep learning (DL) from VHR remotely sensed imagery. The accurate classification of GPC is challenging for the following reasons. Firstly, the complex urban landscapes lead to a high variability of the spatial patterns of GPC. Secondly, the limited labeled data of GPC could lead to overfitting of the deep learning-based classification model. To tackle these issues, we first exploited a multi-scale deformable CNN to account for the scale and shape variability of GPC. Afterwards, we integrated unlabeled GPC samples with labeled data into a semisupervised learning framework to increase the model's generalization capability.
Actually, urban green plastic cover could be viewed as a specific urban land cover category. Due to its synoptic view and cost-effectiveness, remote sensing has been widely utilized for urban land use and land cover (LULC) mapping [4][5][6]. Traditional methods mainly focused on the visual inspection and vectorization from VHR remotely sensed imagery. However, this is both time and labor-intensive. Therefore, how to develop an automatic urban LULC classification method has become a hot research topic [7][8][9]. Early studies [10][11][12][13][14][15] mainly combined hand-crafted features (i.e., spectral indices, texture features) with machine learning classifiers to automatically extract a specific urban LULC type. For example, Shao et al. [10] performed the extraction of urban impervious surface The accurate classification of GPC is challenging for the following reasons. Firstly, the complex urban landscapes lead to a high variability of the spatial patterns of GPC. Secondly, the limited labeled data of GPC could lead to overfitting of the deep learning-based classification model. To tackle these issues, we first exploited a multi-scale deformable CNN to account for the scale and shape variability of GPC. Afterwards, we integrated unlabeled GPC samples with labeled data into a semi-supervised learning framework to increase the model's generalization capability.
Actually, urban green plastic cover could be viewed as a specific urban land cover category. Due to its synoptic view and cost-effectiveness, remote sensing has been widely utilized for urban land use and land cover (LULC) mapping [4][5][6]. Traditional methods mainly focused on the visual inspection and vectorization from VHR remotely sensed imagery. However, this is both time and labor-intensive. Therefore, how to develop an automatic urban LULC classification method has become a hot research topic [7][8][9]. Early studies [10][11][12][13][14][15] mainly combined hand-crafted features (i.e., spectral indices, texture features) with machine learning classifiers to automatically extract a specific urban LULC type. For example, Shao et al. [10] performed the extraction of urban impervious surface based on random forest (RF) from GaoFen-1 and Sentinel-1A imagery. Yin et al. [11] applied both sub-pixel and super-pixel based methods for characterizing urban green space in Haidian District, Beijing. In our previous studies, we also adopted random forest and texture analysis for urban vegetation mapping [12] and urban inundated regions extraction [13] from unmanned aerial vehicle (UAV) remote sensing data.
Meanwhile, there are still no relevant studies on urban green plastic cover mapping from remotely sensed data. Similar research mainly consists of the detection of construction sites and urban landfill. Yu et al. [16] proposed an unsupervised learning method for the classification of buildings under construction from multi-temporal UAV data. Silvestri et al. [17] utilized maximum likelihood classifier (MLC) and IKONOS images to recognize the uncontrolled urban landfills. Considering that no published studies focus on green plastic cover classification, this paper could be the first attempt to solve this important and challenging issue.
It should be noted that the aforementioned studies mainly rely on hand-crafted features and machine learning approaches for urban LULC classification. However, the design of hand-crafted features relies heavily on domain expertise, which might lead to inability to discover high-level and discriminative features from remote sensing images. On the other hand, deep learning has a strong ability to extract representative multi-level features from original data instead of empirical feature design and can work in an end-to-end manner, which has led to impressive performance in the computer vison field [18][19][20][21][22], such as in image classification [18], object detection [19], and semantic segmentation [22]. More recently, deep learning, especially deep CNN, has also been successfully applied in numerous remote sensing applications [23][24][25][26][27][28][29]. For instance, Huang et al. [23] proposed a semi-transfer deep CNN for urban land use mapping, based on VHR WorldView-2 imagery, and achieved an accuracy of 91.25%. Zhang et al. [24] proposed an object-based CNN for urban land use classification and achieved excellent classification accuracy and computational efficiency. Dong et al. [25] exploited a hybrid approach of random forest and CNN for subtropical forest mapping, and their results indicated that the developed model could lead to an improvement in information extraction. In our previous studies [30], we modified a two-branch CNN for urban land use mapping and found that the proposed CNN model outperforms traditional machine learning algorithms such as MLC, RF, and support vector machine (SVM). Moreover, we extended the above model to a multi-branch version for the fusion of multi-senor and multi-temporal Sentinel-1/2 imagery [31]. All of the above studies demonstrated that CNN could provide an effective tool for remote sensing image classification. Therefore, in this study, we exploited a novel multi-scale deformable CNN to learn high-level and representative features for green plastic cover classification.
There is no denying that great improvements have been made in urban LULC mapping from remote sensing images through deep learning. However, deep learning works in an exhaustive data-driven manner, and a large number of labeled samples need to be fed into a DL model to avoid overfitting. Meanwhile, it should be noted that labeling enormous training samples is both labor-extensive and time-consuming, especially in the remote sensing and geoscience fields. Therefore, how to integrate the limited labeled samples with massive unlabeled data to improve the model's generalization capability is a key question. Semi-supervised learning precisely provides an effective tool to tackle this issue. He et al. [32] proposed generative adversarial network (GAN)-based, semi-supervised learning to classify hyperspectral images (HSI), while the unlabeled samples were from the GAN's generator. Fang et al. [33] also utilized a semi-supervised learning strategy based on several sample selection methods for HSI classification. Inspired by these studies, we also introduced a semi-supervised learning framework for the classification of urban green plastic covers based on limited well-annotated samples.
To sum up, the contributions of this study are as follows: (1) For the first time, we developed a deep learning method for urban green plastic cover mapping from VHR remote sensing data, which could provide an effective tool for construction site monitoring and environmental protection. (2) We exploited a multi-scale deformable CNN to tackle the variability of land object's scales and shapes under complex urban landscapes.
(3) We integrated the limited labeled samples with massive unlabeled data into a semi-supervised learning framework to increase the generalization capability of the classification model for green plastic covers.

Study Area
The study area ( Figure 2) is the urban built-up regions of Jinan City, which is the provincial capital of Shandong Province, China. It includes parts of Licheng District, Lixia District, Tianqiao District, Huaiyin District, Shizhong District, and Changqing District, with an approximate area of 1015 km 2 .
ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 4 of 18 (3) We integrated the limited labeled samples with massive unlabeled data into a semi-supervised learning framework to increase the generalization capability of the classification model for green plastic covers.

Study Area
The study area ( Figure 2) is the urban built-up regions of Jinan City, which is the provincial capital of Shandong Province, China. It includes parts of Licheng District, Lixia District, Tianqiao District, Huaiyin District, Shizhong District, and Changqing District, with an approximate area of 1015 km 2 . Jinan City lies in the midwest of Shandong Province, on the eastern edge of the North China Plain. It is characterized by a temperate, semi-humid continental monsoon with an annual average temperature of 13.8 °C, an average frost-free period of 178 days, and an annual average rainfall of approximately 685 mm. Recently, Jinan has witnessed rapid urban sprawl and renewal. Numerous villages on the fringe of urban areas have been demolished, and some old buildings in the urban areas have been reconstructed. Most of these renewal regions are covered by green plastic mulch.

Dataset
Considering the widespread usage and data availability, the remotely sensed data from the Google Earth (GE) platform [34] were adopted. Specifically, the image was from the GE history database (obtained in 2019) and had a spatial resolution of about 1.19 m/pixel. Actually, the corresponding remote sensing imagery was mainly provided by Maxar (namely, DigitalGlobe company, Westminster, CO, USA). The optical sensors included WorldView-2, WorldView-3, and WorldView-4. Although the WorldView series could provide multi-spectral observations, the data Jinan City lies in the midwest of Shandong Province, on the eastern edge of the North China Plain. It is characterized by a temperate, semi-humid continental monsoon with an annual average temperature of 13.8 • C, an average frost-free period of 178 days, and an annual average rainfall of approximately 685 mm. Recently, Jinan has witnessed rapid urban sprawl and renewal. Numerous villages on the fringe of urban areas have been demolished, and some old buildings in the urban areas have been reconstructed. Most of these renewal regions are covered by green plastic mulch.

Dataset
Considering the widespread usage and data availability, the remotely sensed data from the Google Earth (GE) platform [34] were adopted. Specifically, the image was from the GE history database (obtained in 2019) and had a spatial resolution of about 1.19 m/pixel. Actually, the corresponding remote sensing imagery was mainly provided by Maxar (namely, DigitalGlobe company, Westminster, CO, USA). The optical sensors included WorldView-2, WorldView-3, and WorldView-4. Although the WorldView series could provide multi-spectral observations, the data provided by the Google Earth platform have only three bands (namely, red, green, and blue, RGB). Moreover, the Google Earth platform only provides data at an 8-bit radiometric resolution.
The size of the image was 35,976 × 63,055 pixels, corresponding to about 43 × 75 km 2 ( Figure 2). The classification scheme in this study included two types: Green plastic cover (GPC) and non-GPC. Both the training and testing samples belong to image patches with a size of 224 pixel × 224 pixel. Actually, the size of 224 pixel × 224 pixel has been a standard image patch size in the computer vison (CV) field, where the popular convolutional neural networks (e.g., ResNet, DenseNet) take a 224 × 224 image patch and output a predicted label. Therefore, to be comparable with these CV models, we also used this setting in this study. Furthermore, as the spatial resolution is about 1.2 m/pixel, the 224 × 224 image patch corresponds to 268 × 268 m 2 . Under this context, the image patch could cover a scene that is not too big or too small for the task of plastic covered region detection. Figure 3 illustrates several samples of each land cover type. provided by the Google Earth platform have only three bands (namely, red, green, and blue, RGB). Moreover, the Google Earth platform only provides data at an 8-bit radiometric resolution. The size of the image was 35,976 × 63,055 pixels, corresponding to about 43 × 75 km 2 ( Figure 2). The classification scheme in this study included two types: Green plastic cover (GPC) and non-GPC. Both the training and testing samples belong to image patches with a size of 224 pixel × 224 pixel. Actually, the size of 224 pixel × 224 pixel has been a standard image patch size in the computer vison (CV) field, where the popular convolutional neural networks (e.g., ResNet, DenseNet) take a 224 × 224 image patch and output a predicted label. Therefore, to be comparable with these CV models, we also used this setting in this study. Furthermore, as the spatial resolution is about 1.2 m/pixel, the 224 × 224 image patch corresponds to 268 × 268 m 2 . Under this context, the image patch could cover a scene that is not too big or too small for the task of plastic covered region detection. Figure 3 illustrates several samples of each land cover type.  In order to describe the material composition of GPC in detail, we downloaded Sentinel-2 L2A data acquired on 28 August 2019 from the European Space Agency (ESA) and delineated the spectral reflectance signature of GPC ( Figure 4) using bands 2-8 (Visible/Near Infrared), band 8a (Near Infrared), and band 11-12 (Shortwave Infrared). They indicated that the spectral reflection signature of green plastic cover is similar to that of built-up or bare land, which leads to spectral confusion in image classification (Figure 4), especially for RGB images with only three bands, as in our experiment. In order to describe the material composition of GPC in detail, we downloaded Sentinel-2 L2A data acquired on 28 August 2019 from the European Space Agency (ESA) and delineated the spectral reflectance signature of GPC ( Figure 4) using bands 2-8 (Visible/Near Infrared), band 8a (Near Infrared), and band 11-12 (Shortwave Infrared). They indicated that the spectral reflection signature of green plastic cover is similar to that of built-up or bare land, which leads to spectral confusion in image classification (Figure 4), especially for RGB images with only three bands, as in our experiment. ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 6 of 18  Figure 5 illustrates the overview of the proposed method for green plastic cover mapping. The input is an image patch with 224 rows and 224 columns, and the final result is a predicted land cover class. More specifically, the proposed method consists of two components: (1) Feature extraction based on a deep CNN; and (2) semi-supervised learning that integrates both labeled and unlabeled data. As for the former, we exploited a multi-scale deformable CNN to learn representative spatial features under complex urban landscapes. For the latter, the trained CNN was first utilized to endow the unlabeled data with a pseudo label. Afterwards, the most confident data were selected through top-k ranking and added to the training set to retrain the CNN model.   Figure 5 illustrates the overview of the proposed method for green plastic cover mapping. The input is an image patch with 224 rows and 224 columns, and the final result is a predicted land cover class. More specifically, the proposed method consists of two components: (1) Feature extraction based on a deep CNN; and (2) semi-supervised learning that integrates both labeled and unlabeled data. As for the former, we exploited a multi-scale deformable CNN to learn representative spatial features under complex urban landscapes. For the latter, the trained CNN was first utilized to endow the unlabeled data with a pseudo label. Afterwards, the most confident data were selected through top-k ranking and added to the training set to retrain the CNN model.   Figure 5 illustrates the overview of the proposed method for green plastic cover mapping. The input is an image patch with 224 rows and 224 columns, and the final result is a predicted land cover class. More specifically, the proposed method consists of two components: (1) Feature extraction based on a deep CNN; and (2) semi-supervised learning that integrates both labeled and unlabeled data. As for the former, we exploited a multi-scale deformable CNN to learn representative spatial features under complex urban landscapes. For the latter, the trained CNN was first utilized to endow the unlabeled data with a pseudo label. Afterwards, the most confident data were selected through top-k ranking and added to the training set to retrain the CNN model.   Figure 6 and Table 1 shows the detailed structure of the multi-scale deformable CNN for deep feature representation. Specifically, it includes several convolutional layers, max pooling layers, and deformable multi-scale residual blocks. Meanwhile, to obtain the final classification result, a global average pooling (GAP), a fully connected (FC) layer, and a Softmax layer were cascaded. ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 7 of 18 Figure 6 and Table 1 shows the detailed structure of the multi-scale deformable CNN for deep feature representation. Specifically, it includes several convolutional layers, max pooling layers, and deformable multi-scale residual blocks. Meanwhile, to obtain the final classification result, a global average pooling (GAP), a fully connected (FC) layer, and a Softmax layer were cascaded.

Multi-Scale Deformable CNN for Feature Representation
In this study, both deformable convolutions and multi-scale residual blocks were introduced into the deep CNN model for better feature representation. Through deformable convolution, the receptive field and sampling locations were trained to be adaptive to the shapes and scales of land objects, which was beneficial for extracting highly discriminative features. Meanwhile, a multi-scale residual block could extract hierarchical, multi-scale features and improve gradient flow at the same time. In addition, the integration of deformable convolutions into the multi-scale residual block could combine the merits of both modules, increasing the feature adaptability to the complex spatial patterns of urban landscapes. Figure 7 illustrates the detailed parameters of deformable multi-scale residual blocks.
In this study, both deformable convolutions and multi-scale residual blocks were introduced into the deep CNN model for better feature representation. Through deformable convolution, the receptive field and sampling locations were trained to be adaptive to the shapes and scales of land objects, which was beneficial for extracting highly discriminative features. Meanwhile, a multi-scale residual block could extract hierarchical, multi-scale features and improve gradient flow at the same time. In addition, the integration of deformable convolutions into the multi-scale residual block could combine the merits of both modules, increasing the feature adaptability to the complex spatial patterns of urban landscapes. Figure 7 illustrates the detailed parameters of deformable multi-scale residual blocks. Moreover, in our previous study [31], the multi-scale deformable CNN was proposed for spatial feature learning in a coastal wetland landscape, and showed good performance. Therefore, we also introduced it in this study when considering the spatial heterogeneity of complex urban scenarios. More details of the above model can be found in [31].

Samples Selection for Semi-Supervised Learning
The data-driven nature of deep learning calls for a massive number of high-quality labeled samples to maintain the model's generalization capability. However, in the field of remote sensing and geoscience, manually labeling sufficient samples is infeasible due to both the high labor intensity and the low efficiency. Semi-supervised learning, on the other hand, aims to learn from both labeled and unlabeled data, providing a favorable strategy to address the insufficient training data issue, and can achieve satisfactory accuracy with the mining of a massive number of unlabeled samples. Therefore, we resorted to deep semi-supervised learning and proposed a two-step strategy to select the most confident unlabeled samples for model retraining.
Before the description of the two-step strategy for unlabeled samples selection, we first introduce the details of the labeled data. To begin with, we annotated 700 samples for each category, including both GPC and non-GPC, to construct the initial labeled pool. The labeled samples were randomly divided into two parts: 300 for the training set and 400 for the testing set. Meanwhile, 90% of the training set was employed to train the CNN, while the remaining 10% were used as a validation set to evaluate the performance during training.
The proposed two-step strategy for semi-supervised learning was as follows. In the first step, the trained CNN was used to predict samples from the unlabeled pool to derive the posterior probability. Only the unlabeled samples with a probability exceeding 0.5 would be selected and assigned with a predicted category (namely, pseudo-labeled samples). However, these pseudolabeled samples may be unreliable. If we directly added all these samples into the labeled pool to retrain the CNN model, the performance would not always increase due to additional noise.
To ensure the reliability of the pseudo-labeled samples, we introduced a second step for unlabeled data selection. We calculated the similarities between each pseudo-labeled sample and all labeled samples, which are measured by the Euclidean distance:  Moreover, in our previous study [31], the multi-scale deformable CNN was proposed for spatial feature learning in a coastal wetland landscape, and showed good performance. Therefore, we also introduced it in this study when considering the spatial heterogeneity of complex urban scenarios. More details of the above model can be found in [31].

Samples Selection for Semi-Supervised Learning
The data-driven nature of deep learning calls for a massive number of high-quality labeled samples to maintain the model's generalization capability. However, in the field of remote sensing and geoscience, manually labeling sufficient samples is infeasible due to both the high labor intensity and the low efficiency. Semi-supervised learning, on the other hand, aims to learn from both labeled and unlabeled data, providing a favorable strategy to address the insufficient training data issue, and can achieve satisfactory accuracy with the mining of a massive number of unlabeled samples. Therefore, we resorted to deep semi-supervised learning and proposed a two-step strategy to select the most confident unlabeled samples for model retraining.
Before the description of the two-step strategy for unlabeled samples selection, we first introduce the details of the labeled data. To begin with, we annotated 700 samples for each category, including both GPC and non-GPC, to construct the initial labeled pool. The labeled samples were randomly divided into two parts: 300 for the training set and 400 for the testing set. Meanwhile, 90% of the training set was employed to train the CNN, while the remaining 10% were used as a validation set to evaluate the performance during training.
The proposed two-step strategy for semi-supervised learning was as follows. In the first step, the trained CNN was used to predict samples from the unlabeled pool to derive the posterior probability. Only the unlabeled samples with a probability exceeding 0.5 would be selected and assigned with a predicted category (namely, pseudo-labeled samples). However, these pseudo-labeled samples may be unreliable. If we directly added all these samples into the labeled pool to retrain the CNN model, the performance would not always increase due to additional noise.
To ensure the reliability of the pseudo-labeled samples, we introduced a second step for unlabeled data selection. We calculated the similarities between each pseudo-labeled sample and all labeled samples, which are measured by the Euclidean distance: where u i and l j denote the i-th unlabeled and j-th labeled sample, respectively; s(·) represents the similarity metric; and f (·) stands for the deep feature expression. Afterwards, we sorted the labeled ISPRS Int. J. Geo-Inf. 2020, 9, 527 9 of 18 pool by descending order of the above similarities. If the top-k training samples have the same category as the pseudo-labeled sample, then this pseudo-labeled sample was regarded as reliable and could be added to the labeled pool for CNN retraining [29]. In addition, we analyzed the impact of value k in top-k on GPC classification; the results are shown in Section 4.4.

Details of Network Training
Although the number of training samples could be increased by means of semi-supervised learning, we still adopted the data augmentation technique to further boost the generalization capability and decrease the risk of overfitting. Specifically, all the initial labeled samples were rotated 90, 180, or 270 • and flipped up and down.
All the weights of the proposed CNN model were initialized with He normalization [35], and all biases were initially set to 0. For optimizing weights and biases to improve classification performance, an Adam optimizer [36] was used with an initial learning rate of 10 -4 . An early-stopping technique was adopted to select the best model. Cross-entropy loss [37] was adopted, whose expression is as follows: where L denotes cross-entropy loss;ŷ i stands for the probability predicted by the model; y i denotes the ground truth; and N refers to the number of classes.
The training procedure included the following steps: (1) Firstly, the backbone, i.e., the multi-scale deformable CNN was trained using only the initial labeled data. (2) Next, the backbone was utilized to predict the unlabeled datasets, and only the samples that passed the two-step selection process would be added to the labeled pool with pseudo labels. (3) The backbone was retrained with samples from the new labeled pool.
In addition, the deep learning library used was TensorFlow [38]. The entire semi-supervised learning framework was conducted on the Ubuntu 18.04 operating system with Intel Xeon(R) Gold 5118 CPU and NVIDIA TITAN V with 12 GB memory.

Accuracy Assessments
After the classification model was trained, a total of 400 testing samples were utilized to calculate the overall accuracy and confusion matrix. The following metrics were also calculated: Producer accuracy (PA), user accuracy (UA), and Kappa coefficient. Meanwhile, visual evaluation was also involved to check for obvious classification errors. In general, visual inspection is a subjective evaluation method that determines whether the classification result is good or not through comparing the green plastic cover mapping results with high-resolution images from Google Earth. Since the green plastic mulch could be identified by eye on Google Earth, we used the visual interpreted images as the "gold standard." Moreover, we conducted field surveys in several places in Jinan to make sure that the interpreted images were correct.
We also conducted an ablation study to justify the performance of the semi-supervised learning strategy. Furthermore, a comparison with several commonly used CNN models in the computer vision field was performed to evaluate the effectiveness of the multi-scale deformable CNN in this paper.

Classification Results of GPC
After the semi-supervised learning procedure, the trained best model was utilized to classify the entire VHR remote sensing imagery. A sliding window of 224 × 224 was adopted for green plastic cover prediction. Figure 8 displays the spatial distribution of the GPC prediction results. It could be observed that the green plastic covered regions were mainly located in the eastern part of Jinan, indicating that Jinan has been experiencing urban renewal towards the east. The above remote sensing monitoring results are in accordance with Jinan's urban planning, which verifies the effectiveness of the proposed method in discovering key information on urban renewal. ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 10 of 18 observed that the green plastic covered regions were mainly located in the eastern part of Jinan, indicating that Jinan has been experiencing urban renewal towards the east. The above remote sensing monitoring results are in accordance with Jinan's urban planning, which verifies the effectiveness of the proposed method in discovering key information on urban renewal.  Figure 8 also illustrates several parts of the original remote sensing imagery and the GPC prediction results. From the sub-regions, it could be seen that the urban landscape is rather complex, with a high spatial heterogeneity. Therefore, the accurate detection of GPC is a challenging task. However, careful visual inspection indicates that the extraction results of green plastic-covered areas have good consistency with the ground truth, justifying the robustness of our proposed method.

Accuracy Assessment Results
Section 4.1 mainly evaluates the classification results qualitatively from a visual inspection. To further justify the performance, this section adopts a confusion matrix calculated from the testing set to quantitatively evaluate the accuracy of urban green plastic cover mapping. The number of testing samples is 400 for each class. Table 2 lists the accuracy assessment results.  Table 2 indicates that the overall accuracy reached 91.63% and the Kappa index reached 0.8325, indicating that the proposed method achieved an excellent performance in urban green plastic cover mapping from VHR remote sensing data. Meanwhile, since we viewed the GPC identification as a remote sensing scene classification task, the patch-based classification and sliding window strategy would result in a serrated boundary, which would lead to extra errors when calculating the total areas of GPC. To tackle this issue, we would like to exploit semantic segmentation methods such as UNet [39] and DeepLab series [40] in future studies to retrieve the exact boundaries of GPC. However, it should be noted that semantic segmentation methods need to vectorize the GPC for training data preparation, which calls for more labor than our proposed method. In this situation, the proposed method could be viewed as a fast, cost-effective yet still reliable way to detect GPC, especially when considering the compromise between workload and accuracy.

Impact of Semi-Supervised Learning on GPC Classification
To justify the contribution of semi-supervised learning on GPC classification, we conducted an ablation study. Specifically, only the initial 270 labeled samples for each class (GPC and non-GPC)  From the sub-regions, it could be seen that the urban landscape is rather complex, with a high spatial heterogeneity. Therefore, the accurate detection of GPC is a challenging task. However, careful visual inspection indicates that the extraction results of green plastic-covered areas have good consistency with the ground truth, justifying the robustness of our proposed method.

Accuracy Assessment Results
Section 4.1 mainly evaluates the classification results qualitatively from a visual inspection. To further justify the performance, this section adopts a confusion matrix calculated from the testing set to quantitatively evaluate the accuracy of urban green plastic cover mapping. The number of testing samples is 400 for each class. Table 2 lists the accuracy assessment results.  Table 2 indicates that the overall accuracy reached 91.63% and the Kappa index reached 0.8325, indicating that the proposed method achieved an excellent performance in urban green plastic cover mapping from VHR remote sensing data. Meanwhile, since we viewed the GPC identification as a remote sensing scene classification task, the patch-based classification and sliding window strategy would result in a serrated boundary, which would lead to extra errors when calculating the total areas of GPC. To tackle this issue, we would like to exploit semantic segmentation methods such as UNet [39] and DeepLab series [40] in future studies to retrieve the exact boundaries of GPC. However, it should be noted that semantic segmentation methods need to vectorize the GPC for training data preparation, which calls for more labor than our proposed method. In this situation, the proposed method could be viewed as a fast, cost-effective yet still reliable way to detect GPC, especially when considering the compromise between workload and accuracy.

Impact of Semi-Supervised Learning on GPC Classification
To justify the contribution of semi-supervised learning on GPC classification, we conducted an ablation study. Specifically, only the initial 270 labeled samples for each class (GPC and non-GPC) were utilized to train the classification model. The accuracy was evaluated using the same testing set as that of Section 4.2. The new confusion matrix is as follows. Table 3 indicates that when using only limited labeled data, the classification performance is inferior to that of semi-supervised learning. The OA only reaches 85.25%, a decrease of 6.38%, while the Kappa index dropped from 0.8325 to 0.7050, a decrease of 0.1275. Therefore, the introduction of semi-supervised learning could improve the classification performance. This is mainly due to the capability of semi-supervised learning to effectively mine the massive unlabeled data. The two-step selection strategy of pseudo labeled data in this study could ensure that the most confidential ones are added to the labeled pool.

Impact of k in Top-k on GPC Classification
In this section, we analyze the impact of k in top-k on GPC classification. A series of k from 45 to 270 with a step of 45 were considered. Due to the fact that the number of GPC samples is much less than that of non-GPC in the unlabeled pool, we first selected a number of M GPC samples; afterwards, the same number of M non-GPC samples was also selected. The accuracy assessment results are shown in Table 4.  Table 4 indicates that the number of candidate pseudo labeled samples progressively decreased with the increase of k in top-k. This is understandable since the higher the value of k, the higher the confidence threshold of these pseudo-labeled samples. When the value of k is too high, there would be no pseudo labeled samples that would satisfy the selection strategy. Table 4 also indicates that the GPC classification accuracy is the highest when k equals 90. This might be due to a compromise between the additional information gain and the introduced noise. When k is less than the optimal value (90 in this study), there would be more pseudo-labeled samples added into the labeled pool. However, more noise would also be introduced. Meanwhile, when k increases beyond the optimal value, both the number of pseudo-labeled samples and the accompanying information gain would decrease, leading to a reduction in the classification performance.

Comparison with Classic CNN Models
To further justify the effectiveness of the proposed model, several classic CNN models in the computer vision field were adopted for comparison, such as VGG [41], ResNet [42], and DenseNet [43].
It should be noted that all the above models were trained using the same semi-supervised learning strategy and evaluated on the same testing set. The comparison results are listed in Table 5.  Table 5 indicates that the proposed CNN model (multi-scale deformable CNN) achieved the highest accuracy among the four deep learning models. More specifically, VGG had a relatively lower OA (85.87%) in comparison with ResNet (86.88%) and DenseNet (89.62%). This is mainly because VGG utilized a simple cascade of convolutional layers in building its network architecture [41], and has difficulty in extracting highly representative features. Meanwhile, ResNet avoided the gradient vanishing issue in the process of error back-propagation due to the introduction of residual learning and skip connection, which led to higher accuracy. As for DenseNet, its network architecture contained more skip connections for aggregating features and had the best performance. However, in this paper the multi-scale deformable CNN outperformed all the classical CNN models. This could be because the proposed CNN has better adaptability when considering the shape and scale variations of complex urban landscapes.
Furthermore, we compared the above CNN models without the semi-supervised learning strategy, i.e., only the initial limited labeled samples were utilized. The comparison results are in Table 6. Similar to Table 5, Table 6 indicates that the proposed CNN model outperformed other backbone networks with an OA of 85.25% and a Kappa index of 0.7050. Therefore, the effectiveness of the proposed CNN in GPC classification was further verified under the condition of limited labeled samples.

Comparison with Sentinel-2 Data
Since the successful implementation of the European Copernicus program initiated by the European Space Agency (ESA), Sentinel-2 multi-spectral data are now open-access and free to the public, providing new insights for remote sensing applications, such as coastal land cover classification [31], crop mapping [44], and urban areas monitoring [45]. To further justify the performance of the proposed method, we utilized the proposed CNN to detect GPC from Sentinel-2 data. Specifically, the Sentinel-2 L2A data were acquired on 28 August 2019. A total of 10 bands were used in the experiment, including bands 2-4 (10 m), bands 5-7 (20 m), band 8 (10 m), band8a (20 m), and bands 11-12 (20 m). Meanwhile, bands with a 20 m spatial resolution were resampled to 10 m using the SNAP software developed by ESA. Since the image patch of GE data used is 224 × 224 with a spatial resolution of 1.19 m/pixel, to maintain comparability, the image patch of Sentinel-2 data was set to 27 × 27. Moreover, the same training and testing dataset were used to train and evaluate the model. The accuracy comparison results are listed in Table 7.  Table 7 indicates that the proposed CNN could yield high performance for both Sentinel-2 and Google Earth data, with an OA of 91.63% and 90.87%, respectively. This further demonstrated that our proposed CNN model has a strong GPC identification ability for either Google Earth data or Sentinel-2 multi-spectral data as the network input. Figure 9 displays the spatial distribution of the GPC prediction results using Sentinel-2 data. Through a comparison with the GPC prediction results using Google Earth data, it could be observed that the GPC prediction results using these two different images have similar spatial patterns.   Table 7 indicates that the proposed CNN could yield high performance for both Sentinel-2 and Google Earth data, with an OA of 91.63% and 90.87%, respectively. This further demonstrated that our proposed CNN model has a strong GPC identification ability for either Google Earth data or Sentinel-2 multi-spectral data as the network input. Figure 9 displays the spatial distribution of the GPC prediction results using Sentinel-2 data. Through a comparison with the GPC prediction results using Google Earth data, it could be observed that the GPC prediction results using these two different images have similar spatial patterns. Now that the GPC classification result from Sentinel-2 is available, it could be used to refine the result from GE data. It should be noted that the entire study region covers a large area of approximate 1015 km 2 , making it difficult to cover the whole region with single-date VHR imagery. Actually, the study area is covered by multi-date VHR datasets from GE. Meanwhile, most GPCs would be replaced by new buildings within a short period, therefore, if we refine the entire classification results from multi-date GE by single-date Sentinel-2, there would be errors from the mismatch of observation dates. In this section, we selected a subset of GE data, whose observation date (23 August 2019) is close to that of Sentinel-2 data (28 August 2019). We then applied a decision level fusion to merge the classification results of GE and Sentinel-2. Only the intersection of GE and Sentinel-2 classification results were maintained to increase the reliability of GPC recognition, which is shown in Figure 10. Now that the GPC classification result from Sentinel-2 is available, it could be used to refine the result from GE data. It should be noted that the entire study region covers a large area of approximate 1015 km 2 , making it difficult to cover the whole region with single-date VHR imagery. Actually, the study area is covered by multi-date VHR datasets from GE. Meanwhile, most GPCs would be replaced by new buildings within a short period, therefore, if we refine the entire classification results from multi-date GE by single-date Sentinel-2, there would be errors from the mismatch of observation dates. In this section, we selected a subset of GE data, whose observation date (23 August 2019) is close to that of Sentinel-2 data (28 August 2019). We then applied a decision level fusion to merge the classification results of GE and Sentinel-2. Only the intersection of GE and Sentinel-2 classification results were maintained to increase the reliability of GPC recognition, which is shown in Figure 10.

Comparison with Random Forest Classification
Random forest (RF), proposed by Breiman [46], has been widely utilized for land use/land cover mapping in the remote sensing field with improved classification accuracy [12,13,47,48]. To further justify the performance of the proposed CNN, it should be compared with RF. Therefore, RF was trained and tested with the same training and testing samples as the proposed method to maintain fairness. The accuracy comparison results are listed in Table 8.  Table 8 indicates that, compared with RF classification, the proposed CNN could increase the OA by 7.76% and 5.88% for GE and S2 data, respectively. This is mainly because the CNN could extract high-level discriminative features compared with RF, which was beneficial for the improvement of classification accuracy.

Conclusions
This study proposed a deep semi-supervised learning framework for urban green plastic cover mapping from VHR remote sensing imagery. A multi-scale deformable CNN was exploited for

Comparison with Random Forest Classification
Random forest (RF), proposed by Breiman [46], has been widely utilized for land use/land cover mapping in the remote sensing field with improved classification accuracy [12,13,47,48]. To further justify the performance of the proposed CNN, it should be compared with RF. Therefore, RF was trained and tested with the same training and testing samples as the proposed method to maintain fairness. The accuracy comparison results are listed in Table 8.  Table 8 indicates that, compared with RF classification, the proposed CNN could increase the OA by 7.76% and 5.88% for GE and S2 data, respectively. This is mainly because the CNN could extract high-level discriminative features compared with RF, which was beneficial for the improvement of classification accuracy.

Conclusions
This study proposed a deep semi-supervised learning framework for urban green plastic cover mapping from VHR remote sensing imagery. A multi-scale deformable CNN was exploited for discriminative feature learning in the complex urban landscapes. A two-step sample selection strategy was proposed for semi-supervised learning to identify the most reliable sample from the unlabeled pool. Experiments and an ablation study were conducted to confirm the good performance of the proposed method.
The experimental results indicate that the proposed method could classify green plastic covered regions in Jinan with a high performance. An accuracy assessment showed that the overall accuracy (OA) was 91.63% and the Kappa index was 0.8325. Moreover, a careful visual inspection showed that most of the green plastic-covered areas could be correctly identified. An ablation study showed that the semi-supervised learning strategy could increase the OA by 6.38% compared with supervised learning, indicating that the mining of the most confidential unlabeled data could effectively improve the classification accuracy. Meanwhile, the comparison with several classic CNN models in the computer vision field showed that the multi-scale deformable CNN in this study yielded the highest accuracy, justifying its effectiveness for spatial feature learning in complex urban landscapes.
Moreover, this study is the first attempt to identify green plastic cover from VHR remote sensing data based on deep learning methods, which could provide a baseline for relevant studies. Although the proposed CNN is now utilized for urban plastic-covered region recognition, it could also be applied to other applications, such as remote sensing scene understanding. In future work, we will further justify the model's effectiveness and use semantic segmentation models to derive the exact boundaries of the green plastic covered regions.