Machine Learning Comparison and Parameter Setting Methods for the Detection of Dump Sites for Construction and Demolition Waste Using the Google Earth Engine

: Machine learning has been successfully used for object recognition within images. Due to the complexity of the spectrum and texture of construction and demolition waste (C&DW), it is difﬁcult to construct an automatic identiﬁcation method for C&DW based on machine learning and remote sensing data sources. Machine learning includes many types of algorithms; however, different algorithms and parameters have different identiﬁcation effects on C&DW. Exploring the optimal method for automatic remote sensing identiﬁcation of C&DW is an important approach for the intelligent supervision of C&DW. This study investigates the megacity of Beijing, which is facing high risk of C&DW pollution. To improve the classiﬁcation accuracy of C&DW, buildings, vegetation, water, and crops were selected as comparative training samples based on the Google Earth Engine (GEE), and Sentinel-2 was used as the data source. Three classiﬁcation methods of typical machine learning algorithms (classiﬁcation and regression trees (CART), random forest (RF), and support vector machine (SVM)) were selected to classify the C&DW from remote sensing images. Using empirical methods, the experimental trial method, and the grid search method, the optimal parameterization scheme of the three classiﬁcation methods was studied to determine the optimal method of remote sensing identiﬁcation of C&DW based on machine learning. Through accuracy evaluation and ground veriﬁcation, the overall recognition accuracies of CART, RF, and SVM for C&DW were 73.12%, 98.05%, and 85.62%, respectively, under the optimal parameterization scheme determined in this study. Among these algorithms, RF was a better C&DW identiﬁcation method than were CART and SVM when the number of decision trees was 50. This study explores the robust machine learning method for automatic remote sensing identiﬁcation of C&DW and provides a scientiﬁc basis for intelligent supervision and resource utilization of C&DW.


Introduction
Construction and demolition waste (C&DW) refers to all types of solid waste generated during construction, transformation, decoration, demolition, and laying of various buildings and structures and their auxiliary facilities, primarily including residue soil, waste concrete, broken bricks and tiles, waste asphalt, waste pipe materials, and waste wood [1]. China will inevitably produce more C&DW in the future due to its rapid economic development. Based on the available statistics, the output of solid domestic waste has reached 7 billion tons, C&DW contributes 30-40% of the total urban waste [2], and the newly produced C&DW will reach 300 million tons per year [3]. If C&DW is not treated and used appropriately, it will cause serious effects on society, the environment, Therefore, considering the megacity of Beijing, this study explores the feasibility of identifying C&DW with complex features based on the GEE platform and machine learning algorithm. The specific problems to be solved in this study include: (1) determining the optimal parameter scheme of each algorithm by optimizing the core parameters (e.g., minimum leaf population, number of trees, kernel type, gamma, cost for CART, RF, and SVM); (2) clarifying the intelligent recognition ability of each algorithm after determining the optimal parameter scheme of each algorithm; and (3) selecting the most robust remote sensing identification method of C&DW by field survey and accuracy evaluation. This study can reduce the monitoring and control costs of C&DW and provide the basis for research on the spatial and temporal distribution of C&DW, resource utilization, and environmental pollution risk reduction in megacities.

Study Area
Beijing is located in the northern part of China with a central location at 116 • 20 E and 39 • 56 N. The city has 16 districts with a total area of 16,410.54 km 2 ( Figure 1). Based on the available statistics, the annual output of C&DW in Beijing is approximately 92,852,400 tons, while the annual transportation volume is more than 45 million tons [34]. C&DW occupies land resources, and causes air, soil, and water pollution without timely management. Most C&DW in urban areas is piled up in waste consumption farms that have been transformed from historical pit sites and kiln sites located outside the Fifth Ring Road in Beijing [35].
Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 18 optimization for specific algorithms to determine the optimal machine learning method and parameter scheme for objects to be identified. Therefore, considering the megacity of Beijing, this study explores the feasibility of identifying C&DW with complex features based on the GEE platform and machine learning algorithm. The specific problems to be solved in this study include: 1) determining the optimal parameter scheme of each algorithm by optimizing the core parameters (e.g., minimum leaf population, number of trees, kernel type, gamma, cost for CART, RF, and SVM); 2) clarifying the intelligent recognition ability of each algorithm after determining the optimal parameter scheme of each algorithm; and 3) selecting the most robust remote sensing identification method of C&DW by field survey and accuracy evaluation. This study can reduce the monitoring and control costs of C&DW and provide the basis for research on the spatial and temporal distribution of C&DW, resource utilization, and environmental pollution risk reduction in megacities.

Study Area
Beijing is located in the northern part of China with a central location at 116°20′ E and 39°56′ N. The city has 16 districts with a total area of 16,410.54 km 2 ( Figure 1). Based on the available statistics, the annual output of C&DW in Beijing is approximately 92,852,400 tons, while the annual transportation volume is more than 45 million tons [34]. C&DW occupies land resources, and causes air, soil, and water pollution without timely management. Most C&DW in urban areas is piled up in waste consumption farms that have been transformed from historical pit sites and kiln sites located outside the Fifth Ring Road in Beijing [35].

Data Source
A large amount of geospatial remote sensing data is derived from GEE. The dataset consists of remote sensing images of Earth observations, including those from Landsat, Remote Sens. 2021, 13, 787 4 of 17 MODIS, Sentinel-1, and Sentinel-2. The Sentinel-2 mission collects high-resolution multispectral imagery, which is useful for a broad range of applications, including monitoring vegetation, soil and water cover, land cover change, and humanitarian and disaster risks. According to the Earth Engine Data Catalog in GEE, this product is a Level-2A product, and the surface reflection product has been preprocessed, namely for radiation, geometry, and atmospheric correction (https://developers.google.com/earth-engine/datasets/catalog/ COPERNICUS_S2_SR, accessed on 21 January 2021). The product details are shown in Table 1. In this study, the sentinel image data of the whole year were selected for research, and Quality Assessment 60 (QA60) wave code was used as a pre-filter to reduce turbid particles and eliminate the influence of clouds.

Methods
Remote sensing data covering Beijing are quickly acquired and processed by GEE. The C&DW sample dataset was constructed based on the spectral, textural, and topographic features of construction waste, and the CART, RF, and SVM algorithms were optimized by adjusting the parameters. The optimal identification method of construction waste was finally determined by evaluating the classification accuracy and analyzing ground verification results. A technical flow chart of the GEE process is shown in Figure 2.

Sampling
This study used Sentinel-2 images of Beijing from GEE, and all data sets were preprocessed (e.g., cloud removal). Based on the spectral and textural characteristics of the satellite data, this study focused on five categories, buildings, vegetation, water bodies, crops, and C&DW, to perform target recognition. Five types of sample points were selected in the study area based on the principle of uniform selection and prior knowledge. Visual interpretation was used to build a training dataset, which ensured the accuracy of the feature subset. In addition, different regions and shapes of C&DW samples were selected to avoid redundant and highly correlated features. The sample size of each feature type is shown in Table 2. All sample points had to be merged in the same layer to form a sample set, and the selected samples were divided into two parts (training and testing), where 70% of the samples were randomly selected for training and the other 30% of the samples were used to verify and evaluate the accuracy of the algorithm. The random column function of GEE was used to assign a random number (with the random values ranging from 0 to 1) for all sample points. Therefore, all the sample data had an extra random value. The samples with a random number ≥0.7 were used for validation, while the samples with a random number <0.7 were taken as training samples.

. CART and Parametric Optimization Scheme
The CART algorithm is a supervised machine learning algorithm [36] that uses training samples to identify and construct trees to solve the problem of "same object with different spectrum, foreign object with same spectrum" in remote sensing recognition and classification [37]. Therefore, CART is widely used for remote sensing classification.
The CART algorithm divides n-dimensional space into non-overlapping rectangles by recursion. Firstly, an independent variable x i is selected, and then a value u i of x i is selected. The n-dimensional space is divided into two parts. Certain points satisfy x i ≤ u i , and the others satisfy x i > u i . For a discontinuous variable, there are only two values for the attribute value: equal or not equal. During recursive processing, these two parts based on the first step to reselect an attribute continue to partition until the entire n-dimensional space is divided. Attributes with minimum GI N I coefficient values are used as partition indexes. For a dataset D, the GI N I coefficient is defined as follows: where k is the number of categories of samples and p(i) represents the probability that a sample is classified into category i. The smaller the GI N I value, the higher is the "purity" of the sample, and the better is the division effect [38].
The decision tree is composed of multilevel and multi-leaf nodes. Therefore, the decision tree can be pruned by controlling the parameters or thresholds of the new branch ( Figure 3). Max nodes refers to the maximum number of leaves per tree, and min leaf population is the minimum number of nodes that are created only for the training set. To construct a suitable tree, sufficient nodes and branches must be created. The max node value is unlimited if it is not specified in GEE. The empirical method is used to select the optimal Min leaf population.

RF and Parametric Optimization Scheme
RF is an integrated learning algorithm that can integrate many decision trees and then form a forest ( Figure 4). The algorithm combines random features or a combination of random features to generate a tree. The bagging method is used to generate training samples, and each selected feature is randomly drawn by replacing N (the size of the original training set) samples. Then, the prediction of multiple decision trees is combined, and the final prediction result is obtained by voting [39]. The final classification decision is as follows: The decision tree is composed of multilevel and multi-leaf nodes. Therefore, the decision tree can be pruned by controlling the parameters or thresholds of the new branch ( Figure 3). Max nodes refers to the maximum number of leaves per tree, and min leaf population is the minimum number of nodes that are created only for the training set. To construct a suitable tree, sufficient nodes and branches must be created. The max node value is unlimited if it is not specified in GEE. The empirical method is used to select the optimal Min leaf population.

RF and Parametric Optimization Scheme
RF is an integrated learning algorithm that can integrate many decision trees and then form a forest ( Figure 4). The algorithm combines random features or a combination of random features to generate a tree. The bagging method is used to generate training samples, and each selected feature is randomly drawn by replacing N (the size of the original training set) samples. Then, the prediction of multiple decision trees is combined, and the final prediction result is obtained by voting [39]. The final classification decision is as follows: h is a single decision tree's classification model, Y is the output variable (or target variable), and () I ⋅ is the indicator function.
The formula shows that the RF uses the majority of voting decisions to determine the final classification. RF has a good tolerance for outliers and noise and is not easy to over fit (i.e., is stable) [40]. The adjustable parameter of the RF algorithm is the number of trees, and the number of trees is selected empirically.

SVM and Parametric Optimization Scheme
SVM is a supervised machine learning algorithm that can manage sample scarcity, is robust, and typically yields good results during classification and regression. SVM uses hyperplanes to divide support vectors to classify log data points clearly [41] with the goal of finding the two types of independent support vectors with the largest margin (maximum distance) ( Figure 5). RF has a good tolerance for outliers and noise and is not easy to over fit (i.e., is stable) [40]. The adjustable parameter of the RF algorithm is the number of trees, and the number of trees is selected empirically.

SVM and Parametric Optimization Scheme
SVM is a supervised machine learning algorithm that can manage sample scarcity, is robust, and typically yields good results during classification and regression. SVM uses hyperplanes to divide support vectors to classify log data points clearly [41] with the goal of finding the two types of independent support vectors with the largest margin (maximum distance) ( Figure 5).

SVM and Parametric Optimization Scheme
SVM is a supervised machine learning algorithm that can manage sample scarcity, is robust, and typically yields good results during classification and regression. SVM uses hyperplanes to divide support vectors to classify log data points clearly [41] with the goal of finding the two types of independent support vectors with the largest margin (maximum distance) ( Figure 5). The SVM algorithm is based on the kernel method, and the selection of the kernel type has a strong effect on the classification results [42]. Currently, there are three types of kernels commonly used:

•
Polynomial kernel: where qis the polynomial order and c and q are artificially defined parameters.
• Radial basis function (RBF) kernel: The SVM algorithm is based on the kernel method, and the selection of the kernel type has a strong effect on the classification results [42]. Currently, there are three types of kernels commonly used: where q is the polynomial order and c and q are artificially defined parameters. • Radial basis function (RBF) kernel: where γ is greater than 0 and defined manually.
where c is the upsilon, a manually defined parameter.
In GEE, the SVM algorithm is described by the kernel type, gamma, and cost parameters [24]. The kernel type can be polynomial, SIGmoID, or RBF. Gamma represents a parameter of the function after selecting the RBF function and implicitly determines the distribution of the data mapped to the new eigenspace. The larger the gamma is, the smaller is the support vector, and vice versa. The number of support vectors affects the speed of training and prediction. Cost represents the regularization parameter C of the error term.
Currently, SVM parameter selection methods primarily include empirical method, experimental trial method, gradient descent method, cross validation method, Bayesian method, etc. [43,44]. Among the three kernel functions, the RBF kernel function is relatively stable, while the polynomial kernel function and sigmoid kernel function have relatively low stability [43]. Therefore, the RBF kernel type was used in this study. The values of gamma and cost were selected by a combination of the experimental trial method and the grid search method.

Verification Methods
The confusion matrix is the core method of accuracy evaluation, which can describe the classification accuracy and show the confusion between categories. The basic statistics for the confusion matrix include the overall accuracy (OA), consumer accuracy (CA), producer accuracy (PA), and Kappa coefficient. In this study, 156 sample points were taken to make the confusion matrix.
To verify the classification results, the ground verification method was used to visually interpret the optimized classification results of each algorithm. Different classification algorithms for C&DW based on satellite images were compared and analyzed. Based on accuracy assessment and ground verification, the optimal scheme of CART, RF, and SVM for C&DW identification was determined.

Accuracy Assessment
The detail codes are accessible from Code links (see Supplementary Materials). From the algorithm optimization study, the statistics of each algorithm's precision index are as follows: For optimizing the CART algorithm, the min leaf population was between 0 and 10. As shown in Figure 6a, the OA, CA, PA and kappa coefficient remained unchanged as the number of nodes increased, with values of 73.13%, 65.22%, 50%, and 65.22%, respectively. The CART algorithm yielded relatively low classification accuracy for C&DW, and the change in parameters had little influence on the classification. When optimizing the RF algorithm, as shown in Figure 6b, the OA, CA, PA and kappa coefficient all increased rapidly as the number of trees increased from 1 to 20. The OA and PA tended to decrease and then increase slowly when the number of trees was approximately 40. Classification precisions reachede their highest values and remadin stable when the number of trees was 50. The highest values of OA, CA, PA, and the kappa coefficients for RF were 98.05%, 100%, 96.67%, and 98.38%, respectively.
For SVM optimization, the optimal gamma parameter was determined first, followed by the optimal cost parameter. The empirical method was used to set the cost value to 10. Then, the experimental trial method and grid search method were used to obtain and compare the statistical values of each precision index in the cases of the gamma and cost parameters. As shown in Figure 6c, the OA, CA, PA and kappa coefficient rose markedly and then increased slowly when gamma changed between 0.2 and 0.6. The OA, CA, PA, e and kappa coefficient reached a small wave peak when gamma was 5. Additionally, the classification accuracy was good. As shown in Figure 6d, the classification precision tended to be stable when gamma ≥ 10. The highest levels of the OA, CA, PA, and the kappa coefficients were 84.68%, 81.82%, 60%, and 75.6%, respectively, when gamma was 16. The cost parameter was studied and compared when gamma was 5 and 16 for experimental rigor. As shown in Figure 6e, OA and CA increased with increasing cost when gamma was 5, while PA and the kappa coefficient fluctuated marginally from 1 to 20. The classification precision reached its highest value and remained stable when the cost was 40. The highest values of OA, CA, PA, and the kappa coefficients were 84.38%, 85.71%, 60%, and 76.37%, respectively. As shown in Figure 6f, the OA, CA, and kappa coefficient increased with increasing cost when gamma was 5, except for cost near 14. The PA increased first and then decreased, and finally tended to be stable with increasing cost. The highest values of OA, CA, PA, and the kappa coefficient were 85.62%, 85.71%, 60%, and 78.78%, respectively, when cost was 34. For the SVM algorithm, the combination of gamma = 16 and cost = 34 yielded better accuracy than when gamma = 5 and cost = 40.
This study selected certain experimental results of C&DW identification, and drew a distribution comparison map of C&DW in Beijing, as shown in Figure 7. The detected C&DW distribution results were less affected by CART parameters, followed by RF parameters, and then the SVM parameters. Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 18 For optimizing the CART algorithm, the min leaf population was between 0 and 10. As shown in Figure 6a, the OA, CA, PA and kappa coefficient remained unchanged as the number of nodes increased, with values of 73.13%, 65.22%, 50%, and 65.22%, respectively. The CART algorithm yielded relatively low classification accuracy for C&DW, and the change in parameters had little influence on the classification. When optimizing the RF algorithm, as shown in Figure 6b, the OA, CA, PA and kappa coefficient all increased rapidly as the number of trees increased from 1 to 20. The OA and PA tended to decrease and then increase slowly when the number of trees was approximately 40. Classification precisions reachede their highest values and remadin stable when the number of trees was 50. The highest values of OA, CA, PA, and the kappa coefficients for RF were 98.05%, 100%, 96.67%, and 98.38%, respectively.
For SVM optimization, the optimal gamma parameter was determined first, followed by the optimal cost parameter. The empirical method was used to set the cost value to 10. Then, the experimental trial method and grid search method were used to obtain and compare the statistical values of each precision index in the cases of the gamma and cost parameters. As shown in Figure 6c, the OA, CA, PA and kappa coefficient rose markedly and then increased slowly when gamma changed between 0.2 and 0.6. The OA, CA, PA,e and kappa coefficient reached a small wave peak when gamma was 5. Additionally, the classification accuracy was good. As shown in Figure 6d, the classification precision tended to be stable when gamma ≥ 10. The highest levels of the OA, CA, PA, and the kappa coefficients were 84.68%, 81.82%, 60%, and 75.6%, respectively, when gamma was Statistical analyses of each accuracy index described an optimal parameterization scheme of CART, RF, SVM and the corresponding precision. The results of this process are shown in Table 3.

Ground Verification
This study verified the classification results using ground truth data to compare the classification effect and the reliability of various algorithms for C&DW detection. In certain areas of Beijing, the real state and primary characteristics in C&DW were examined using remote sensing images at construction waste dumps on 15 December 2020, including formal consumption sites and informal construction waste dump areas. As shown in Figure 8, C&DW was classified and absorbed into a regular consumption area. The C&DW was piled in an orderly manner based on brick size and covered with green grids to prevent air pollution. However, certain informal C&DW dumps have existed for a long time. After many years of weathering, weeds grow on the surface of the C&DW accumulation, making it difficult to identify that area as a C&DW dump. Remote Sens. 2021, 13, x FOR PEER REVIEW 12 of 18 Statistical analyses of each accuracy index described an optimal parameterization scheme of CART, RF, SVM and the corresponding precision. The results of this process are shown in Table 3.

Ground Verification
This study verified the classification results using ground truth data to compare the classification effect and the reliability of various algorithms for C&DW detection. In certain areas of Beijing, the real state and primary characteristics in C&DW were examined using remote sensing images at construction waste dumps on December 15, 2020, including formal consumption sites and informal construction waste dump areas. As shown in Figure 8, C&DW was classified and absorbed into a regular consumption area. The C&DW was piled in an orderly manner based on brick size and covered with green grids to prevent air pollution. However, certain informal C&DW dumps have existed for a long time. After many years of weathering, weeds grow on the surface of the C&DW accumulation, making it difficult to identify that area as a C&DW dump. C&DW is scattered and irregular in shape. To analyze the prediction results of each algorithm in more detail, certain C&DW areas with concentrated accumulation and relatively regular shapes were selected for visual interpretation and comparison. Based on the distribution map of C&DW under the optimal parameterization scheme, the details of C&DW identification in different areas were analyzed, as shown from Figure 9. Typically, each algorithm can identify C&DW; however, the recognition results were marginally different. In regions A and D, compared to SVM and RF, the classification ability of CART was inferior. In regions B and C, CART could not completely identify C&DW, and SVM incorrectly classified ground objects that were not C&DW as C&DW. Additionally, RF was shown to be more accurate in classifying C&DW. The analysis of four different regions shows that RF yielded the best recognition results compared to that of CART and SVM. However, SVM had good recognition results along edges of C&DW, such as in regions B and C. C&DW is scattered and irregular in shape. To analyze the prediction results of each algorithm in more detail, certain C&DW areas with concentrated accumulation and relatively regular shapes were selected for visual interpretation and comparison. Based on the distribution map of C&DW under the optimal parameterization scheme, the details of C&DW identification in different areas were analyzed, as shown from Figure 9. Typically, each algorithm can identify C&DW; however, the recognition results were marginally different. In regions A and D, compared to SVM and RF, the classification ability of CART was inferior. In regions B and C, CART could not completely identify C&DW, and SVM incorrectly classified ground objects that were not C&DW as C&DW. Additionally, RF was shown to be more accurate in classifying C&DW. The analysis of four different regions shows that RF yielded the best recognition results compared to that of CART and SVM. However, SVM had good recognition results along edges of C&DW, such as in regions B and C.
The numbers of pixels identified as C&DW in the four areas are shown in Table 4. The number of C&DW areas identified by SVM was typically higher than those identified by CART and RF. Combined with the visual results shown in Figure 9, the SVM algorithm may have experienced overfitting.
Based on the accuracy assessment indices and ground verification of C&DW, the RF algorithm was shown to yield better recognition results for C&DW detection among the tested algorithms. Identification was best when RF had 50 decision trees.

Discussion and Conclusions
In recent years, due to the serious threat of construction and demolition waste (C&DW) to society, the economy, and the environment, C&DW management has received increasing attention. Intelligent identification of C&DW is an important method of waste supervision and resource utilization. This study attempted to determine a method suitable for C&DW identification and classification by optimizing parameters based on Google Earth Engine (GEE) and machine learning algorithms. The results of this study provide a method for C&DW remote sensing identification. The types of ground objects were divided into buildings, vegetation, water, C&DW, and crops. The parameter optimization studies of machine learning found that the overall classification accuracy of each algorithm was optimal when 6 nodes were used with the classification and regression trees (CART) algorithm, 50 trees are used with the random forest (RF) algorithm, and gamma = 16 and cost = 34 with the support vector machine (SVM) algorithm. Ground verification results of C&DW distribution points in Beijing show that the results of this study are reliable. The results showed that CART, RF, and SVM have different recognition abilities for C&DW in Beijing. Compared to CART and SVM, RF performed better in terms of the overall accuracy (OA) and identification ability of C&DW. In some other studies, four methods (characteristic reflectivity and extreme learning machine; first-order derivative of characteristic reflectivity and extreme learning machine; grey level co-occurrence matrix and extreme learning machine; and convolutional neural network) were proposed for the automatic identification of C&DW. The given correct rate was around 80% in these studies [20]. Our results were consistent with previous studies. They further prove the effectiveness of the machine learning method in the intelligent identification of C&DW based on remote sensing images.
This study confirmed the feasibility of intelligent identification of C&DW based on GEE and machine learning algorithms. A parameter optimization scheme of the machine learning algorithm was proposed to improve the remote sensing recognition ability for C&DW. Based on a field-based accuracy assessment, the best machine learning method and its parameterization scheme for remote sensing identification of C&DW were determined. This study could provide the basis for research on the spatial and temporal distribution of C&DW, resource utilization, and environmental pollution risk reduction in megacities. The results of this study are helpful for the treatment and management of C&DW and associated cost reductions, which are beneficial for saving land resources and promoting energy conservation and emission reduction [45]. Additionally, the distribution map of C&DW in Beijing provides a scientific basis for intelligent supervision and resource utilization of C&DW.
However, the proposed method still has certain limitations, which should be further explored in future research. For example, during identification and classification, failure to consider the elevation of C&DW may lead to the classification of construction waste as buildings, and C&DW covered with green grids may be misclassified as vegetation or crops. In addition to the influence of algorithm parameters, the source of the sample data, the number of samples, and the partition ratio of the training/validation set will affect the final classification performance [46]. What is more, deep learning methods also have great potential in the intelligent identification of C&DW, and we will undertake more studies on deep learning identification methods of C&DW identification based on TensorFlow. In addition, we will make further in-depth comparisons with other methods to find the optimal method of intelligent identification of C&DW. The accuracy validation of C&DW remote sensing identification results is also an important aspect of our future research. Based on the existing data, this study discussed the influence of algorithm parameter combinations on the performance of C&DW identification and classification, and preliminarily determined an optimal scheme. This paper focused on the new method of intelligent identification of C&DW based on machine learning and remote sensing data. The uncertain factors should be investigated in future research.