Mapping Urban Extent at Large Spatial Scales Using Machine Learning Methods with VIIRS Nighttime Light and MODIS Daytime NDVI Data

: Urbanization poses signiﬁcant challenges on sustainable development, disaster resilience, climate change mitigation, and environmental and resource management. Accurate urban extent datasets at large spatial scales are essential for researchers and policymakers to better understand urbanization dynamics and its socioeconomic drivers and impacts. While high-resolution urban extent data products - including the Global Human Settlements Layer (GHSL), the Global Man-Made Impervious Surface (GMIS), the Global Human Built-Up and Settlement Extent (HBASE), and the Global Urban Footprint (GUF) - have recently become available, intermediate-resolution urban extent data products including the 1 km SEDAC’s Global Rural-Urban Mapping Project (GRUMP), MODIS 1km, and MODIS 500 m still have many users and have been demonstrated in a recent study to be more appropriate in urbanization process analysis (around 500 m resolution) than those at higher resolutions (30 m). The objective of this study is to improve large-scale urban extent mapping at an intermediate resolution (500 m) using machine learning methods through combining the complementary nighttime Visible Infrared Imaging Radiometer Suite (VIIRS) and daytime Moderate Resolution Imaging Spectroradiometer (MODIS) data, taking the conterminous United States (CONUS) as the study area. The e ﬀ ectiveness of commonly-used machine learning methods, including random forest (RF), gradient boosting machine (GBM), neural network (NN), and their ensemble (ESB), has been explored. Our results show that these machine learning methods can achieve similar high accuracies across all accuracy metrics ( > 95% overall accuracy, > 98% producer’s accuracy, and > 92% user’s accuracy) with Kappa coe ﬃ cients greater than 0.90, which have not been achieved in the existing data products or by previous studies; the ESB is not able to produce signiﬁcantly better accuracies than individual machine learning methods; the total misclassiﬁcations generated by GBM are more than those generated by RF, NN, and ESB by 14%, 16%, and 11%, respectively, with NN having the least total misclassiﬁcations. This indicates that using these machine learning methods, especially NN and RF, with the combination of VIIRS nighttime light and MODIS daytime normalized di ﬀ erence vegetation index (NDVI) data, high accuracy intermediate-resolution urban extent data products at large spatial scales can be achieved. The methodology has the potential to be applied to annual continental-to-global scale urban extent mapping at intermediate resolutions. study area consisting of the 48 conterminous states in the US, with urban extent data layer from the Global Rural–Urban Mapping Project (GRUMP) overlaid with state boundaries.

Machine learning methods have been demonstrated to perform well in land cover mapping [49][50][51][52], and have been effective in urban area mapping in recent years [11,13,15,16,44,45,48,53]. Schneider et al. employed a supervised decision tree algorithm (C4.5) with a one-year time series of MODIS 8-day composites of the seven land bands and the enhanced vegetation index (EVI), and produced the MODIS 500 m global urban extent data product [10]. Hu and Weng estimated impervious surfaces from medium spatial resolution imagery using multi-layer perceptron neural networks [13]. Zhou et al. developed a cluster-based method to map urban areas from the Defense Meteorological Program Operational Line-Scan System (DMSP/OLS) nightlight data [43]. Wan et al. relied on the Terra MODIS surface reflectance datasets and a positive and unlabeled learning (PUL) method for mapping US urban extent [45]. Zhang et al. applied the one-class support vector machine (OCSVM) to classify different combinations of the DMSP/OLS stable nighttime light (NTL) data, MODIS normalized difference vegetation index (NDVI) data, and land surface temperature (LST) data for regional urban extent mapping [48]. Wang et al. used a back propagation neural network to identify urban areas in China with VIIRS nighttime light and MODIS NDVI data as inputs [16]. Li et al. experimented with support vector machine (SVM) methods to extract urban extent from LJ1-01 and VIIRS nighttime light data [20].
Random forest (RF), gradient boosting machine (GBM), neural network (NN), and their ensemble (ESB) are commonly used machine learning methods in land cover mapping but have not been fully assessed in urban area mapping, especially at intermediate resolutions. The objective of this study is to explore the effectiveness of these machine learning methods for improving the accuracies of large-scale urban extent mapping at intermediate resolutions (500 m) based on the combination of the complementary VIIRS nighttime light and MODIS daytime NDVI data.

Study Area
This study takes the conterminous United States (CONUS) as the study area ( Figure 1). The reasons for choosing this study area include: First, the United States is one of the highly urbanized countries with intensive urbanization in recent decades. Based on the statistics from the US Census, four out of five Americans lived in urban areas in the 2000s and the urbanization of the United States is not uniform across its vast landscape with the fastest urbanization occurring in the northeastern region [54]. Lopez's study in 2014 [55] demonstrated that for 2010, there were 136 US metropolitan areas with a sprawl index ranging from 50 to 70, and 176 US metropolitan areas with a sprawl index greater than 75. The sprawl index values were calculated based on the formula: where: SIi = sprawl index for metropolitan area i S%i = percentage of total population in low-density census tracts in metropolitan area i D%i = percentage of total population in high-density census tracts in metropolitan area i Sprawl index values range between 0 and 100, with 100 representing the highest level of sprawl and 0 representing the lowest level of sprawl. In addition, the Joint Research Centre (JRC)'s degree of urbanization calculation indicates that from 1975 to 2015, the United States' total built-up area increased from 80,417 km 2 to 161,379 km 2 . Secondly, the application of VIIRS nighttime light data in large-scale urban extent mapping is not fully studied for the CONUS region, especially regarding the use of machine learning methods [7,32,37,39,56,57]. Thirdly, the CONUS covers an area of about 7.6 million km 2 and contains various land cover types such as urban built-up areas, water, forests, grasslands, bare lands, croplands, wetlands, shrubs, and other land cover types. Fourthly, the urban extent data products for this region have abundant regional users from both the scientific research community and government agencies [24].
where: SIi = sprawl index for metropolitan area i S%i = percentage of total population in low-density census tracts in metropolitan area i D%i = percentage of total population in high-density census tracts in metropolitan area i Sprawl index values range between 0 and 100, with 100 representing the highest level of sprawl and 0 representing the lowest level of sprawl. In addition, the Joint Research Centre (JRC)'s degree of urbanization calculation indicates that from 1975 to 2015, the United States' total built-up area increased from 80,417 km 2 to 161,379 km 2 . Secondly, the application of VIIRS nighttime light data in large-scale urban extent mapping is not fully studied for the CONUS region, especially regarding the use of machine learning methods [7,32,37,39,56,57]. Thirdly, the CONUS covers an area of about 7.6 million km 2 and contains various land cover types such as urban built-up areas, water, forests, grasslands, bare lands, croplands, wetlands, shrubs, and other land cover types. Fourthly, the urban extent data products for this region have abundant regional users from both the scientific research community and government agencies [24].

Definition of Urban Area
The definition of an urban area varies from different research perspectives [33,58,59]. For example, census-related urban studies refer mainly to population distributions while those using nighttime lights or multi-spectral data may be related to economic conditions or "built-up areas" (physical attributes of land surface) [9,10,23,31,57,60,61]. As the characteristics of all these definitions are correlated, most of the urban areas identified by relevant methods are consistent. However, significant inconsistencies remain among the urban extent data products, especially for the large differences in total urban areas that are partially caused by the different definitions of an urban area. Physical attribute-based urban extent data products have broader application potential including

Definition of Urban Area
The definition of an urban area varies from different research perspectives [33,58,59]. For example, census-related urban studies refer mainly to population distributions while those using nighttime lights or multi-spectral data may be related to economic conditions or "built-up areas" (physical attributes of land surface) [9,10,23,31,57,60,61]. As the characteristics of all these definitions are correlated, most of the urban areas identified by relevant methods are consistent. However, significant inconsistencies remain among the urban extent data products, especially for the large differences in total urban areas that are partially caused by the different definitions of an urban area. Physical attribute-based urban extent data products have broader application potential including population analysis, economic research, disaster modeling, and environmental impact assessment. Therefore, this study employed the definition proposed by Schneider et al. [11]. That is, urban areas are defined by the physical attributes and land cover composition of the land surface: urban areas are places dominated by the built environment with a minimum mapping unit of 1 km by 1 km, which includes all non-vegetated, human-constructed elements such as buildings, roads, runways, etc. Here, 'dominated' implies the coverage of human-constructed elements is greater than 50% in a 1 km by 1 km area. Based on this definition, when vegetation, water, and other non-human-constructed elements cover most of a 1 km by 1 km area, that area will not be considered as an urban area, while any 1 km by 1 km area with over 50% built-up area, whether they are continuous or not, will be considered as an urban area in practice ( Figure 2). Another reason to adopt this definition of an urban area in this study is that many recent research activities on urban extent mapping with satellite data used this physical attribute-based definition [10,11,16,32].
Remote Sens. 2019, 11, x FOR PEER REVIEW 5 of 18 population analysis, economic research, disaster modeling, and environmental impact assessment. Therefore, this study employed the definition proposed by Schneider et al. [11]. That is, urban areas are defined by the physical attributes and land cover composition of the land surface: urban areas are places dominated by the built environment with a minimum mapping unit of 1 km by 1 km, which includes all non-vegetated, human-constructed elements such as buildings, roads, runways, etc.
Here, 'dominated' implies the coverage of human-constructed elements is greater than 50% in a 1 km by 1 km area. Based on this definition, when vegetation, water, and other non-human-constructed elements cover most of a 1 km by 1 km area, that area will not be considered as an urban area, while any 1 km by 1 km area with over 50% built-up area, whether they are continuous or not, will be considered as an urban area in practice ( Figure 2). Another reason to adopt this definition of an urban area in this study is that many recent research activities on urban extent mapping with satellite data used this physical attribute-based definition [10,11,16,32]. The size of the square is 1 km by 1 km: urban area contains more than 50% of built-up areas (left) while a non-urban area contains less than 50% built-up area (right).  (Figure 3). The capability of MODIS daytime spectral reflectance and NDVI for urban extent mapping has been extensively demonstrated by many researchers [10,11,14,35,44,45,62]. However, as urban areas are spectrally similar to none or low vegetated non-urban areas, such as uncropped soils or bare lands [6,59], depending totally on MODIS daytime spectral data for urban extent mapping often leads to classification errors [57].

Data and Preprocessing
Nighttime lights are straight forward for applications in urban extent mapping as artificial lights in urban areas are easier to separate from the darker non-urban areas at night [63]. The most widelyused nighttime light data are the stable light data products from the Defense Meteorological Satellite Program/Operational Linescan System (DMSP/OLS) [34,42,43,64]. While DMSP nighttime light datasets provide a longer time series of nighttime light observations (1992-2013) than VIIRS and a method exists to deblur for the blooming [65], their applications are still affected by the coarser 1 km spatial resolution, light intensity saturation in urban areas, and intra-sensor calibration problems Figure 2. Illustration of urban and non-urban areas based on the definition of urban areas employed in this study. The size of the square is 1 km by 1 km: urban area contains more than 50% of built-up areas (left) while a non-urban area contains less than 50% built-up area (right).  (Figure 3). The capability of MODIS daytime spectral reflectance and NDVI for urban extent mapping has been extensively demonstrated by many researchers [10,11,14,35,44,45,62]. However, as urban areas are spectrally similar to none or low vegetated non-urban areas, such as uncropped soils or bare lands [6,59], depending totally on MODIS daytime spectral data for urban extent mapping often leads to classification errors [57].

Data and Preprocessing
Nighttime lights are straight forward for applications in urban extent mapping as artificial lights in urban areas are easier to separate from the darker non-urban areas at night [63]. The most widely-used nighttime light data are the stable light data products from the Defense Meteorological Satellite Program/Operational Linescan System (DMSP/OLS) [34,42,43,64]. While DMSP nighttime light datasets provide a longer time series of nighttime light observations (1992-2013) than VIIRS and a method exists to deblur for the blooming [65], their applications are still affected by the coarser 1 km spatial resolution, light intensity saturation in urban areas, and intra-sensor calibration problems [66]. The nighttime light data provided by the VIIRS DNB band are superior to DMSP [67], with significant improvements including increased spatial resolution (15 arc-second, approximately 500 m), lower light imaging detection limits (~2 × 10 −11 W·cm −2 ·sr −1 ), and higher radiometric quantization (14 bit), thus providing the potential to better delineate urban areas and enhancing the capability to detect urban areas effectively [1, 37,67]. Because nighttime light emissions are related to a number of factors including energy policies at country level, levels of access to electricity, and measurable levels of luminosity affected by varied economic conditions, solely depending on nighttime light data may not be able to produce accurate urban extent maps, especially at a continental-to-global scale [16,24,57].
Remote Sens. 2019, 11, x FOR PEER REVIEW 6 of 18 [66]. The nighttime light data provided by the VIIRS DNB band are superior to DMSP [67], with significant improvements including increased spatial resolution (15 arc-second, approximately 500 m), lower light imaging detection limits (~2 × 10 −11 W·cm −2 ·sr −1 ), and higher radiometric quantization (14 bit), thus providing the potential to better delineate urban areas and enhancing the capability to detect urban areas effectively [1, 37,67]. Because nighttime light emissions are related to a number of factors including energy policies at country level, levels of access to electricity, and measurable levels of luminosity affected by varied economic conditions, solely depending on nighttime light data may not be able to produce accurate urban extent maps, especially at a continental-to-global scale [16,24,57]. Combination of VIIRS nighttime light data and MODIS daytime spectral data might overcome their individual limitations, thus achieving better performance in mapping urban extent than using MODIS or VIIRS data alone [16,32,46,68].
The Earth Observations Group (EOG) at NOAA/National Centers for Environmental Information (NCEI) is producing a version 1 suite of annual average radiance composite images using nighttime light data from VIIRS DNB, which was available only for 2015 at the time of this study ( Figure 3 (Left)). Impacts from stray lights, lightning, lunar illumination, and cloud-cover have been filtered in this dataset, and gas flares have also been removed by referencing the gas flares' locations accompanied with the nighttime light data provided by NCEI [69]. Removing cloud effects from MODIS images is critical when using the data for land cover mapping. The greenest pixel compositing method uses the maximum normalized difference vegetation index (NDVI) of a MODIS time series to composite. MODIS NDVI annual composite at 500 m for 2015 was created using the greenest pixel compositing method through GEE API and downloaded to our local server (Figure 3 (Right)), and co-registered with the VIIRS nighttime light data.

Reference Sample Data
For training and validating machine learning models, "ground truth" reference samples were collected for both urban and non-urban land cover types (e.g., forest, cropland, wetland, water, grassland, and bare land) based on high-resolution images (e.g., ESRI World Imagery) and visual Combination of VIIRS nighttime light data and MODIS daytime spectral data might overcome their individual limitations, thus achieving better performance in mapping urban extent than using MODIS or VIIRS data alone [16,32,46,68].
The Earth Observations Group (EOG) at NOAA/National Centers for Environmental Information (NCEI) is producing a version 1 suite of annual average radiance composite images using nighttime light data from VIIRS DNB, which was available only for 2015 at the time of this study (Figure 3 (Left)). Impacts from stray lights, lightning, lunar illumination, and cloud-cover have been filtered in this dataset, and gas flares have also been removed by referencing the gas flares' locations accompanied with the nighttime light data provided by NCEI [69]. Removing cloud effects from MODIS images is critical when using the data for land cover mapping. The greenest pixel compositing method uses the maximum normalized difference vegetation index (NDVI) of a MODIS time series to composite. MODIS NDVI annual composite at 500 m for 2015 was created using the greenest pixel compositing method through GEE API and downloaded to our local server (Figure 3 (Right)), and co-registered with the VIIRS nighttime light data.

Reference Sample Data
For training and validating machine learning models, "ground truth" reference samples were collected for both urban and non-urban land cover types (e.g., forest, cropland, wetland, water, grassland, and bare land) based on high-resolution images (e.g., ESRI World Imagery) and visual interpretation for a given 1 km by 1 km area. Data recorded for each reference sample include the location of the sample site and its attribute (urban or non-urban).
Urban areas, unlike other land cover types, have special spatial characteristics. For example, in most urban areas, the core or central portions are covered by more human-constructed features (e.g., buildings and roads) and less natural features (e.g., vegetation) than the periphery portions; relatively smaller urban areas tend to have all human-constructed features distributed uniformly within their boundaries, while the density of human-constructed features in bigger urban areas tend to decrease from core or central portions to the periphery portions; urban areas located in the northeastern US contain more vegetation than those in the southwestern US. Therefore, during the process of reference sample data collection, we applied the stratified random sampling method to reduce sampling bias. That is, in addition to considering randomness when picking a reference sampling site, we also considered (1) the core-periphery spatial structure of urban areas to balance the number of urban area samples for the core urban areas and peripheral urban areas, (2) balancing the number of urban area samples for bigger urban areas and smaller urban areas, and (3) balancing the number of urban area samples for urban areas located in different geographic regions. Ignoring such characteristics of urban areas when collecting reference samples may make the accuracy assessment falsely high and, thus, untrustworthy (e.g., using urban area samples collected only from the core urban areas achieved a user's accuracy of 99.77%, a producer's accuracy of 98.86%, and an overall accuracy of 99.15% with a Kappa coefficient of 0.9820 for the machine learning methods, as core urban areas are easy to identify and almost all the inconsistencies among the available urban extent data products occur in the periphery portions of urban areas). For each reference sample site, as long as the surrounding 1 km by 1 km area contained more than 50% human-constructed features, it was recorded as an urban sample, and vice visa. As a result, small patches of urban forest or parks that account for less than 50% in a 1 km by 1 km area were also considered as urban areas.
A total of 2772 reference samples (1455 urban samples and 1317 non-urban samples) were collected for this study in two steps: (1) reference samples were collected solely for training the machine learning models -in total, 295 training samples (198 urban samples and 97 non-urban samples) were collected in this step; (2) after the urban extent data products were generated by each of the machine learning models, a new set of reference samples was collected solely for independent and rigorous accuracy assessment -in total, 2477 samples were collected in this step, including 1257 urban samples and 1220 non-urban samples (Figure 4). interpretation for a given 1 km by 1 km area. Data recorded for each reference sample include the location of the sample site and its attribute (urban or non-urban). Urban areas, unlike other land cover types, have special spatial characteristics. For example, in most urban areas, the core or central portions are covered by more human-constructed features (e.g., buildings and roads) and less natural features (e.g., vegetation) than the periphery portions; relatively smaller urban areas tend to have all human-constructed features distributed uniformly within their boundaries, while the density of human-constructed features in bigger urban areas tend to decrease from core or central portions to the periphery portions; urban areas located in the northeastern US contain more vegetation than those in the southwestern US. Therefore, during the process of reference sample data collection, we applied the stratified random sampling method to reduce sampling bias. That is, in addition to considering randomness when picking a reference sampling site, we also considered (1) the core-periphery spatial structure of urban areas to balance the number of urban area samples for the core urban areas and peripheral urban areas, (2) balancing the number of urban area samples for bigger urban areas and smaller urban areas, and (3) balancing the number of urban area samples for urban areas located in different geographic regions. Ignoring such characteristics of urban areas when collecting reference samples may make the accuracy assessment falsely high and, thus, untrustworthy (e.g., using urban area samples collected only from the core urban areas achieved a user's accuracy of 99.77%, a producer's accuracy of 98.86%, and an overall accuracy of 99.15% with a Kappa coefficient of 0.9820 for the machine learning methods, as core urban areas are easy to identify and almost all the inconsistencies among the available urban extent data products occur in the periphery portions of urban areas). For each reference sample site, as long as the surrounding 1 km by 1 km area contained more than 50% human-constructed features, it was recorded as an urban sample, and vice visa. As a result, small patches of urban forest or parks that account for less than 50% in a 1 km by 1 km area were also considered as urban areas.
A total of 2,772 reference samples (1,455 urban samples and 1,317 non-urban samples) were collected for this study in two steps: (1) reference samples were collected solely for training the machine learning models -in total, 295 training samples (198 urban samples and 97 non-urban samples) were collected in this step; (2) after the urban extent data products were generated by each of the machine learning models, a new set of reference samples was collected solely for independent and rigorous accuracy assessment -in total, 2,477 samples were collected in this step, including 1,257 urban samples and 1,220 non-urban samples (Figure 4).
.  (1) the cross "+" symbols indicate reference sample sites solely for training purposes, (2) the triangle " " symbols indicate reference sample sites solely for accuracy assessment. Sample sites of (1) and (2) were collected in two separate steps and are, therefore, totally independent.

Methods
Machine learning methods build mathematical models from training sample data to make predictions or decisions automatically and have been applied in intermediate-resolution remote sensing of urban extent. RF, GBM, and NN are three relatively mature and commonly-used machine learning methods in data analytics and have been increasingly applied to satellite image classifications for land cover mapping at different spatial scales and resolutions in recent years [11,16,49,51,71,72]. Machine learning ensembles are learning algorithms that construct a group of different classifiers and then classify the data by taking a weighted vote of the individual classifier predictions [73]. Such ensembles or model combinations are usually more accurate than a single classifier and were introduced to land cover mapping by Walsh in 2015 [51]. There have been no studies reported yet in the literature regarding the performance of RF, GBM, NN, and ESB in intermediate-resolution urban extent mapping with nighttime light and daytime satellite data as inputs.
To explore and compare the effectiveness of RF, GBM, NN, and ESB in mapping urban extent, exactly the same datasets were used as inputs, which include VIIRS nighttime light luminosity annual composite, MODIS NDVI annual composite, and the training reference samples. VIIRS nighttime light and MODIS NDVI were stacked together; therefore, at each 500 m pixel location, there is a 2-dimension vector: where: The construction of the ESB is based on the outputs from the individual machine learning models [51,73]. As each of the three individual machine learning models uses the same reference samples and satellite data inputs, their predictions for urban and non-urban land cover types at unknown pixel locations are correlated inherently, which must be considered during the ensemble step to achieve better results. Linear stacking with the elastic net was used to address this issue through both ridge and lasso penalizations [74]. Figure 5 shows the entire workflow of this study. First, for each of the reference sample sites, the VIIRS luminosity value and MODIS NDVI value were extracted. Secondly, two-thirds of the 295 training reference samples were randomly selected to train the three individual machine learning models while the remaining one-third reference samples were used for regularizing the ESB weights and finding an optimal set of model weights that would not diminish the predictive performance of the ensemble [51]. Specifically, RF was set up using training parameters including out-of-bag (OOB) error, GBM was set up using training parameters including 10-fold repeated cross-validation and 5 repeats, NN was set up with one hidden layer and training parameters including 10-fold cross-validation, and the ESB was set up with a 10-fold regularized ensemble weighting. All these machine learning algorithms were implemented in R, which can be run on Linux, Windows, and Mac. Thirdly, the urban probability grids outputted by RF, GBM, NN, and ESB were classified into urban and non-urban using 0.95 probability as the threshold. At the last step, all the validation reference samples were overlaid with the four urban extent maps corresponding to RF, GBM, NN, and ESB, and their accuracies for correctly identifying urban areas were assessed and compared.

New Urban Extent Maps for CONUS
After around 3 h of program execution, four urban extent maps at 500 m resolution were produced for CONUS 2015 (Figure 6a) corresponding to the four machine learning methods (RF, GBM, NN, and ESB). By visually analyzing these maps, it is observed that all the maps have correctly revealed the spatial patterns of urban area distributions in CONUS. Zoom-in detailed comparisons demonstrate that their differences occur mainly in the areas where urban and non-urban features mix and are located either in the peripheral areas or inner urban areas bordering with vegetation (e.g., big parks inside a city) (Figure 6b). (a)

New Urban Extent Maps for CONUS
After around 3 h of program execution, four urban extent maps at 500 m resolution were produced for CONUS 2015 (Figure 6a) corresponding to the four machine learning methods (RF, GBM, NN, and ESB). By visually analyzing these maps, it is observed that all the maps have correctly revealed the spatial patterns of urban area distributions in CONUS. Zoom-in detailed comparisons demonstrate that their differences occur mainly in the areas where urban and non-urban features mix and are located either in the peripheral areas or inner urban areas bordering with vegetation (e.g., big parks inside a city) (Figure 6b).

New Urban Extent Maps for CONUS
After around 3 h of program execution, four urban extent maps at 500 m resolution were produced for CONUS 2015 (Figure 6a) corresponding to the four machine learning methods (RF, GBM, NN, and ESB). By visually analyzing these maps, it is observed that all the maps have correctly revealed the spatial patterns of urban area distributions in CONUS. Zoom-in detailed comparisons demonstrate that their differences occur mainly in the areas where urban and non-urban features mix and are located either in the peripheral areas or inner urban areas bordering with vegetation (e.g., big parks inside a city) (Figure 6b). (a)

Comparing the Four Urban Extent Maps through Quantitative Accuracy Assessment
To rigorously compare and evaluate which urban extent maps generated by the four machine learning algorithms are more accurate, a quantitative accuracy assessment was conducted against the reference samples specifically collected for validation and accuracy assessment. While the overall accuracy (OA) is the commonly-used index for accuracy assessment, because of class sample imbalance and different performance of classification methods for different land cover types, it can be biased towards the majority classes, ignoring the minority classes [16]. Therefore, we generated the confusion matrices and calculated the producer's accuracy, the user's accuracy, and the Kappa coefficients for each of the four urban extent maps to fully assess the accuracies (Table 2). Table 2. Confusion matrices and accuracy assessment using "ground truth" reference samples: (a) random forest (RF)-based urban extent accuracy assessment, (b) gradient boosting machine (GBM)based urban extent accuracy assessment, (c) neural network (NN)-based urban extent accuracy assessment, and (d) ESB-based urban extent accuracy assessment.

Comparing the Four Urban Extent Maps through Quantitative Accuracy Assessment
To rigorously compare and evaluate which urban extent maps generated by the four machine learning algorithms are more accurate, a quantitative accuracy assessment was conducted against the reference samples specifically collected for validation and accuracy assessment. While the overall accuracy (OA) is the commonly-used index for accuracy assessment, because of class sample imbalance and different performance of classification methods for different land cover types, it can be biased towards the majority classes, ignoring the minority classes [16]. Therefore, we generated the confusion matrices and calculated the producer's accuracy, the user's accuracy, and the Kappa coefficients for each of the four urban extent maps to fully assess the accuracies ( Table 2).
The confusion matrices show that these four machine learning methods can achieve similar high accuracies across all accuracy metrics (>95% overall accuracy, >98% producer's accuracy, and >92% user's accuracy, Kappa coefficients > 0.90), which have not been achieved by existing data products, previous studies, and associated methods; the ESB of RF, GBM, and NN is not able to produce significantly better accuracies than the three individual machine learning methods; the total misclassified validation samples generated by GBM (121) are more than those generated by RF (107), NN (104), and the ESB (109) by 14%, 16%, and 11%, respectively, with NN having the least total misclassified validation samples. If we must pick the best and worst among these four machine learning methods, NN performs the best while GBM performs the worst based on their relatively low and high numbers of misclassifications and similar level of accuracies across all accuracy metrics.
The reasons for the ESB of RF, GBM, and NN not being able to outperform the individual machine learning methods could be that RF-, GBM-, and NN-based urban extent data products are mostly consistent; thus, none is adding significant new information to the others, or each of them has already achieved quite high accuracy (not many errors), or their outputs are correlated because they use the same data inputs, especially the satellite data. Considering the more computing resources needed for running the ESB and its inability to improve significantly on the accuracies relative to the individual machine learning methods, constructing an ensemble from RF, GBM, and NN appears unnecessary for such urban extent mapping applications.
The differences in accuracies for NN and RF are negligible while GBM apparently performs a little worse than both (with user's accuracy of 1% lower than RF and NN, and misclassified validation samples of about 15% lower than RF and NN). Therefore, NN and RF should be better choices for intermediate-resolution urban extent mapping. Table 2. Confusion matrices and accuracy assessment using "ground truth" reference samples: (a) random forest (RF)-based urban extent accuracy assessment, (b) gradient boosting machine (GBM)-based urban extent accuracy assessment, (c) neural network (NN)-based urban extent accuracy assessment, and (d) ESB-based urban extent accuracy assessment.
(a) Additionally, the reason for NN and RF outperforming GBM could be that RF can break the correlation between individual base learner predictions, thus hopefully reducing the variance of final predictions, and NN is flexible and capable of effectively representing both structured and non-structured data (pixel values).

Comparing the New Urban Extent with Existing Data Products
The comparison between our newly created urban extent (year 2015) with other existing urban extent data products is to characterize their differences and confirm whether the new urban extent data perform well in delineating urban areas. Not all the datasets listed in Table 1 were available to us at the time of this study. We chose the available GRUMP urban extent (1 km for the year 1995) and GlobCover artificial surfaces (309 m for the year 2009) datasets and selected US cities for the comparison. As there were no urban extent data products available for the same year (2015), urban extent data for different nominal years were selected. Figure 7 shows the comparison between our new urban extent and GRUMP urban extent. While these datasets are 20 years apart and comparing cannot discriminate between actual urban changes and improvements of the new methodology, considering urban extents for all cities in 1995 must be smaller than or the same as those in 2015 (actually most of US cities are sprawling based on the sprawl analysis [55]), it clearly shows that our new urban extent data product delineates more realistically the urban areas (boundaries between urban areas and non-urban areas) as the base map is dated between 2016 and 2017.

Conclusions
Urban land changes are related to sustainable development, environmental quality, public health, natural hazards, poverty, climate change adaptation, and other environmental and socioeconomic issues [75][76][77][78][79][80]. Satellite urban extent mapping provides a fundamental dataset for analyzing urban land changes and the relevant environmental or socioeconomic drivers and impacts [81,82].
Our results and analyses have demonstrated that both NN and RF can be used to create high-accuracy intermediate-resolution urban extent data products at large spatial scales when the complementary VIIRS nighttime light and MODIS daytime NDVI data are combined. Such data products can be used to update the existing intermediate-resolution urban extent data products with better accuracy, spatial resolution, and consistency (e.g., our NN-based 500 m urban extent data product for CONUS is being used to update the 23-years-old 1995 GRUMP urban extent at 1 km for CONUS region at SEDAC). In addition, these machine learning methods can be used to create annual urban extent data products based on the availability of VIIRS nighttime light and MODIS daytime spectral data or data from other similar satellites so that urban land change dynamics can be studied.

Discussion
These machine learning methods, especially NN and RF, have the potential to be applied to continental-to-global scale urban extent mapping at intermediate resolutions. However, as the characteristics of urban areas in other parts of the world might be different from the CONUS, further study is needed before these methods can be applied to other continents or globally. For example, there are recognized issues in using nighttime light satellite data to capture urban areas located in poorly lit countries like North Korea or African countries or because of the deliberate decision of some countries in Europe to rarely light urban areas [67] (it is known that when comparing the USA to European cities, American cities have five times more lights per capita than European cities). While incorporating daytime MODIS spectral data with VIIRS nighttime light can help to mitigate these issues, they may still have an impact on the machine learnings' calibration and limit their performance when applying them directly to other continents or at a global scale. One possible solution to these issues is to train and calibrate the machine learning methods by continent; however, the effectiveness and accuracies that can be achieved can only be found out through further testing of other world regions. At the very least, these methods can be used to effectively map urban areas in the US and other developed or well-lit countries of the world.
Although several recent studies have explored the most suitable sensors and methods for mapping urban extent, and data products derived from high-resolution sensors (e.g., Landsat, Sentinel-1) and very high-resolution sensors (e.g., TerraSAR-X) are currently available, producing these kinds of data products is time consuming, especially when repeated mapping at large spatial scales is needed. Further, when applying such high-resolution urban extent data products to continental-to-global scale models and applications, the data usually needs to be resampled to lower resolutions because of the huge data volume involved and the computing resources needed. A recent study [30] based on 30 m Landsat-derived urban extent has demonstrated that 480 m resolution performs the best for urbanization process analysis. Therefore, improving the accuracy of intermediate-resolution urban extent mapping at 500 m with VIIRS nighttime light, MODIS daytime spectral data or other similar satellite data, and machine learning methods is still needed for data users even when there are open global urban extent products at relatively high resolutions.
As shown in the results and analyses, the major differences or inconsistencies between the urban extent data products generated by different machine learning methods or other studies are located mainly in the peripheral portions of urban areas or along the borders of built-up areas and vegetated urban areas (e.g., parks) where mixing pixels are mainly located. Therefore, introducing unmixing image analysis in the future may further help increase the urban extent mapping accuracies or provide another interesting data product to represent urbanization degree. For example, through unmixing, different percentages of built-up area coverage in a pixel can be extracted and, thus, a non-binary continuous urban extent data layer can be generated. This might be helpful for mapping low-density sprawl settlements and for some urban models or applications.
Urban land changes occurring on small scales may not be detectable at the relatively coarse resolution of VIIRS and MODIS. When such urban changes are important in relevant applications (e.g., local applications), higher resolution satellite data such as those from Landsat and Sentinel 2 need to be introduced and the effectiveness of machine learning methods in improving urban extent mapping at such scales can be explored.
Author Contributions: X.L. and A.d.S. conceived and designed the experiments; X.L. and Y.Z. performed the experiments; X.L. and Y.Z. analyzed the data; X.L. wrote the original draft of the paper, which was revised by X.L. and A.d.S.

Funding:
The authors would like to acknowledge support under the National Aeronautical and Space Administration (NASA) contract NNG13HQ04C for the continued operation of the Socioeconomic Data and Applications Center (SEDAC).