An Unsupervised Urban Extent Extraction Method from NPP-VIIRS Nighttime Light Data

: An accelerating trend of global urbanization accompanying various environmental and urban issues makes frequently urban mapping. Nighttime light data (NTL) has shown great advantages in urban mapping at regional and global scales over long time series because of its appropriate spatial and temporal resolution, free access, and global coverage. However, the existing urban extent extraction methods based on nighttime light data rely on auxiliary data and training samples, which require labor and time for data preparation, leading to the di ﬃ culty to extract urban extent at a large scale. This study seeks to develop an unsupervised method to extract urban extent from nighttime light data rapidly and accurately without ancillary data. The clustering algorithm is applied to segment urban areas from the background and multi-scale spatial context constraints are utilized to reduce errors arising from the low brightness areas and increase detail information in urban edge district. Firstly, the urban edge district is detected using spatial context constrained clustering, and the NTL image is divided into urban interior district, urban edge district and non-urban interior district. Secondly, the urban edge pixels are classiﬁed by an adaptive direction ﬁltering clustering. Finally, the full urban extent is obtained by merging the urban inner pixels and the urban pixels in urban edge district. The proposed method was validated using the urban extents of 25 Chinese cities, obtained by Landsat8 images and compared with two common methods, the local-optimized threshold method (LOT) and the integrated night light, normalized vegetation index, and surface temperature support vector machine classiﬁcation method (INNL-SVM). The Kappa coe ﬃ cient ranged from 0.687 to 0.829 with an average of 0.7686 (1.80% higher than LOT and 4.88% higher than INNL-SVM). The results in this study show that the proposed method is a reliable and e ﬃ cient method for extracting urban extent with high accuracy and simple operation. These imply the signiﬁcant potential for urban mapping and urban expansion research at regional and global scales automatically and accurately. extracted by our method had high spatial continuity and higher consistency with the reference data at urban edge district. Overall, these results indicate that our method performs well in extracting urban extent from NTL data, enhances the internal spatial continuity of the extracted results, and improves the underestimation of the edge pixels.


Introduction
Urbanization is speeding up the consumption of natural resources, energy, and altering land use and cover, which is closely related to almost all aspects of global change [1,2]. The rapid expansion of urban areas has brought about various environmental and urban issues, such as deforestation, farmland decrease, ecological damage, air pollution, water shortage, climate change, traffic congestion, massive and chaotic urban settlements and mismatched infrastructure in recent years [3,4]. Accurate and consistent information on the distribution and extent of urban areas is fundamental to understanding urbanization dynamics, evaluating the impacts of urbanization on environments, and addressing comparable results to the LOT method. Xu et al. [28] demonstrated that ANN could provide an effective and accurate alternative in extracting urban built-up areas from NTL. Liu et al. [29] explored the effectiveness of commonly used machine learning methods for urban extent extraction from NTL data, including random forest (RF), gradient boosting machine (GBM), neural network (NN), and their ensemble (ESB), and the results showed that these machine learning methods can achieve similar high accuracies. He et al. [30] proposed a fully convolutional network (FCN) to extract urban land using NTL data and employed it to detect global urban expansion from 1992-2016. The supervised classification method avoids the cumbersome process of threshold selection and raises the automaticity to extract urban extent from NTL data [26]. However, it is difficult to label enough urban pixels. The extracted results using the supervised classification are sensitive to the quality of samples [31].
The previous methods rely highly on the auxiliary data or urban training samples, which largely increases the uncertainty of the results and limits its application in practice, especially for NPP-VIIRS NTL data with higher spatial resolution and radiometric resolution. We attempted to develop an automatic method based on unsupervised methods to improve the efficiency of NTL data in urban mapping. The new method, designed for NPP-VIIRS NTL data, assigned a label to a pixel-based on the light intensity of the pixel and the pixels that are spatially close to it. Clustering is an unsupervised learning method widely used for classification and segmentation of remote sensing images, which was used in this study [32]. Twenty-five cities with different levels of the natural environment and economic development in China were selected as evaluation areas. The urban extent derived from the Landsat8 OLI image was used as the reference for accuracy assessment, and the most commonly used methods the local-optimized threshold method (LOT) and the integrated night light, normalized vegetation index, and surface temperature support vector machine classification method (INNL-SVM) [33] were used to compare, the validity, reliability, and robustness of our method are verified.

Study Area
This study took China as the experimental area because China has experienced rapid urbanization in the last four decades and it has a vast territory with diverse natural conditions and an uneven geographical pattern of socioeconomic development ( Figure 1). To evaluate the effectiveness and robustness of the proposed method for urban extent extraction using NPP-VIIRS NTL data, 25 representative cities, located from east to west and from north to south across China, were selected as evaluation areas. The populations of these selected cities varied from 220 thousand to over 21 million and the gross domestic product (GDP) ranged from approximately 20 billion RMB to over 2400 billion RMB (Table 1). These cities, with various urban sizes, diverse physical environments, and different economic and social development levels, can assess the applicability of the proposed urban extent extraction method for different urbanization levels well.

Data
(1) Remote Sensing Data The Earth Observations Group (EOG) is producing a version 1 suite of average radiance composite images using nighttime data from the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB). The version 1 series of monthly composites and annual composites is now publicly available through the Payne Institute for Public Policy at the Colorado School of Mines (https://payneinstitute.mines.edu/eog/). The products are 15 arc-second geographic grids, spanning

Data
(1) Remote Sensing Data The Earth Observations Group (EOG) is producing a version 1 suite of average radiance composite images using nighttime data from the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB). The version 1 series of monthly composites and annual composites is now publicly available through the Payne Institute for Public Policy at the Colorado School of Mines (https://payneinstitute.mines.edu/eog/). The products are 15 arc-second geographic grids, spanning the globe from 75N to 65S in latitude with six tiles. The DN unit is nanoWatts·cm −2 ·sr −1 . The geographic coordinate system is WGS-1984. Prior to averaging, the NTL data is filtered to exclude data impacted by stray light, lightning, lunar illumination, and cloud-cover. The monthly composites have not been filtered to screen out lights from aurora, fires, boats, and other temporal lights, while the annual composites have layers with additional separation, removing temporal lights and background (non-light) values. We downloaded the 2015 annual NPP-VIIRS NTL composite data and made an outlier removal process to filter out moonlight reflected from water by setting water pixels value to zero.
The Geospatial Data Cloud website (http://www.gscloud.cn/) provides Moderate-resolution Imaging Spectroradiometer (MODIS) daily, 5-days, 10-days, and monthly composite images of normalized difference vegetation index (NDVI) and land surface temperature (LST) covering the whole of China. The monthly composite images of NDVI (MODND1M) of China in 2015 were obtained and processed by maximum value composition (MVC) to create a yearly maximal NDVI image [33]. We also applied the MVC method to the monthly nighttime LST (MODLT1M.LTN) product of China from January to December 2015 to create a yearly maximal LST image.
The 2015 Landsat8 OLI images covering 25 cities with less than 1% cloud cover were obtained from the Geospatial Data Cloud website. Radiation calibration, atmospheric correction, mosaic, clipping, and other pre-processing operations were performed on Landsat8 OLI images of each city with ENVI. The FLAASH (Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes) tool was applied to atmospheric correction. The impervious surface was initially classified using maximum likelihood method. Then the actual urban extent was acquired by visually interpreted with the help of 2015 google history images at the fifteenth layer. The spatial resolution of google history images in the 15th layer is 3.66 m, which we downloaded by 91 Weitu (http://www.91weitu.com).
All images were resampled to a spatial resolution of 500m on the Albers Conical Equal Area projection.
(2) Auxiliary Data The administrative boundaries of Chinese cities were applied to partition the research areas into cities. The images were cutting out by the extent of each city's administrative boundary. The water mask images excluded the water pixels from the nighttime light images to remove the effect of water on night light. The administrative boundaries and water mask were obtained from the Resource and Environment Data Cloud Platform website (http://resdc.cn/). The 1 km grid population dataset [34] and 1 km grid GDP dataset [35] of China in 2015 were from the Resource and Environment Data Cloud Platform. The statistical data of Table 1 was from the China City Statistical Yearbook 2016. The China City Statistical Yearbook 2016 covered the main socio-economic statistical data of cities at all levels for 2015.

(3) Other Global Urban Data Products
Recently, there are many global or national urban/impervious surface products released. We chose four existing products for comparison purpose, including the annual maps of global artificial impervious area (GAIA) [36], the high-resolution multi-temporal mapping of global urban land based on a normalized urban areas composite index (NUACI-based GUL) [37], the global urban expansion product based on a fully convolutional network (FCN) [30] and the MCD12Q1.006 MODIS Land Cover Type Yearly Global 500 m product (MODIS500). The MODIS500 were obtained from Google Earth Engine. We extracted the "urban and built-up lands" type for comparison. Because the 2015 urban mapping data is not available from the FCN, we replaced with 2016 FCN data. The spatial resolution of the GAIA and the NUACI-based GUL was 30m. The original spatial resolution was maintained in urban spatial comparison and area calculation. All urban products were resampled to a spatial resolution of 500 m on the Albers Conical Equal Area projection for accuracy assessment.

Method
The NTL data records the light intensity emitted by objects on the ground in the night, helping to distinguish bright urban areas from dim background [2]. The key to extracting urban extent based on NTL data is to identify urban areas according to the spatial distribution pattern of nighttime light intensity. The clustering method, widely used for classification and segmentation of remote sensing images as an unsupervised approach, makes inferences from input data without referring to known, or labeled, outcomes [38]. This study applied the clustering method based on pixels to extract urban extent from NTL data. The clustering method regarded input pixels as an out-of-order dataset and ignored the spatial context information of pixels. This resulted in the clustering results that were sensitive to noise and outliers. In addition, adjacent pixels in an NTL image were interdependent, and a large part of NTL radiation of a pixel came from the surroundings. Therefore, the spatial context information of each pixel was considered to improve the classification accuracy of urban pixels. There was a roughly negative correlation between nighttime light intensity and distance from the urban center. It was easy to distinguish between urban inner pixels and non-urban inner pixels because of their distinct brightness differences. However, it was difficult to classify urban pixels in urban edge district because of the complex spatial distribution of light intensity and the large light intensity range in urban edge district. Therefore, extracting urban areas separately by urban inner pixels and urban edge pixels was more efficient and suitable. Therefore, we proposed an automatic method for extracting urban extent from NTL data based on spatial context constraints and clustering algorithm. We first segmented the NTL image into urban interior district, urban edge district and non-urban interior district by a spatial context constrained clustering algorithm. We then classified urban pixels in urban edge district by developing a direction-based spatial context constrained clustering algorithm. Finally, the full urban extent was obtained by merging the urban inner pixels and the urban pixels in urban edge district. In addition, we assessed the performance of our method by evaluating the accuracy of the extracted results and comparing it with two commonly used urban extent extraction methods based on NTL data. The workflow of the proposed method is shown in Figure 2.
information of each pixel was considered to improve the classification accuracy of urban pixels. There was a roughly negative correlation between nighttime light intensity and distance from the urban center. It was easy to distinguish between urban inner pixels and non-urban inner pixels because of their distinct brightness differences. However, it was difficult to classify urban pixels in urban edge district because of the complex spatial distribution of light intensity and the large light intensity range in urban edge district. Therefore, extracting urban areas separately by urban inner pixels and urban edge pixels was more efficient and suitable. Therefore, we proposed an automatic method for extracting urban extent from NTL data based on spatial context constraints and clustering algorithm. We first segmented the NTL image into urban interior district, urban edge district and non-urban interior district by a spatial context constrained clustering algorithm. We then classified urban pixels in urban edge district by developing a directionbased spatial context constrained clustering algorithm. Finally, the full urban extent was obtained by merging the urban inner pixels and the urban pixels in urban edge district. In addition, we assessed the performance of our method by evaluating the accuracy of the extracted results and comparing it with two commonly used urban extent extraction methods based on NTL data. The workflow of the proposed method is shown in Figure 2.

The Spatial Context Constrained Clustering Algorithm
K-means clustering is a simple and popular unsupervised machine learning algorithm that partitions data points into k clusters by allocating each point to the cluster with the nearest cluster centroid. The silhouette method plays a significant role in k-means clustering evaluation and help to find the optimum number of clusters [39]. The Silhouette Coefficient is calculated using the mean

The Spatial Context Constrained Clustering Algorithm
K-means clustering is a simple and popular unsupervised machine learning algorithm that partitions data points into k clusters by allocating each point to the cluster with the nearest cluster centroid. The silhouette method plays a significant role in k-means clustering evaluation and help to find the optimum number of clusters [39]. The Silhouette Coefficient is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each point data in the silhouette method. A higher average Silhouette Coefficient indicates a better clustering effect. Interestingly, when performing K-Means clustering on NTL pixels for different cities, the average Silhouette coefficient always has the highest value when the number of clusters is 2 ( Figure 3). Therefore, in this study, we used the K-Means clustering algorithm to classify the NTL pixels into two groups: a low-NTL light intensity group and a high-NTL light intensity group. We assigned pixels in the high-NTL light intensity group are urban pixels and pixels in the low-NTL light intensity group are non-urban pixels.
A sliding window was used to define the surrounding neighborhood. Take the pixels in the window of (2 × R + 1) × (2 × R + 1) with radius R, centered on the pixel (x, y) as the spatial context of the pixel (x, y). We applied the filtering algorithms to integrate the spatial context information to correct, enrich and constrains the current pixel. The adopted filtering algorithms could be mean filtering, median filtering, and other filtering methods, which were determined according to the spatial distribution of light intensity. Then input the filtered pixels into the K-means algorithm for clustering to get the urban pixels. The loss function of the spatial context constrained clustering algorithm was defined as follows.
In Equations (1)-(3), E is the loss function; K is the total number of clusters, this study specified K as 2; C j is the collection of jth cluster; d j and µ j is the number of pixels and the center of the jth cluster, respectively; f is the adopted filtering algorithm and F(x, y) is the filtered result of pixel in (x, y); N (x,y) represents the set of all pixels in the sliding window centered on the pixel (x, y).
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 20 intra-cluster distance and the mean nearest-cluster distance for each point data in the silhouette method. A higher average Silhouette Coefficient indicates a better clustering effect. Interestingly, when performing K-Means clustering on NTL pixels for different cities, the average Silhouette coefficient always has the highest value when the number of clusters is 2 ( Figure 3). Therefore, in this study, we used the K-Means clustering algorithm to classify the NTL pixels into two groups: a low-NTL light intensity group and a high-NTL light intensity group. We assigned pixels in the high-NTL light intensity group are urban pixels and pixels in the low-NTL light intensity group are non-urban pixels. A sliding window was used to define the surrounding neighborhood. Take the pixels in the window of (2 × R + 1) × (2 × R + 1) with radius R, centered on the pixel ( , ) as the spatial context of the pixel ( , ) . We applied the filtering algorithms to integrate the spatial context information to correct, enrich and constrains the current pixel. The adopted filtering algorithms could be mean filtering, median filtering, and other filtering methods, which were determined according to the spatial distribution of light intensity. Then input the filtered pixels into the K-means algorithm for clustering to get the urban pixels. The loss function of the spatial context constrained clustering algorithm was defined as follows.
In Equations (1)-(3), E is the loss function; K is the total number of clusters, this study specified K as 2; is the collection of jth cluster; and is the number of pixels and the center of the jth cluster, respectively; is the adopted filtering algorithm and F(x, y) is the filtered result of pixel in ( , ); ( , ) represents the set of all pixels in the sliding window centered on the pixel ( , ).

Urban Edge District Detection
We first grouped the NTL pixels into two clusters and regarded the cluster with high light intensity as potential urban ( ) areas and the other as potential non-urban ( ) areas. Due to the low light intensity of water, green spaces, and inactive areas at night within urban areas, it was easy to be mistaken as non-urban pixels. We applied the median filter to smooth areas in the urban inner. The median filter is a well-known nonlinear filter with good performance for removing noise,

Urban Edge District Detection
We first grouped the NTL pixels into two clusters and regarded the cluster with high light intensity as potential urban (P urb ) areas and the other as potential non-urban (P non ) areas. Due to the low light intensity of water, green spaces, and inactive areas at night within urban areas, it was easy to be mistaken as non-urban pixels. We applied the median filter to smooth areas in the urban inner. The median filter is a well-known nonlinear filter with good performance for removing noise, especially for some specific noise types such as "Gaussian", "random", and "salt and pepper" noises. It sorts all the pixel values from the surrounding neighborhood and replaces the current pixel value with the middle value. Compared to the mean filter, the median filter is less affected by the background and better preserves edges while removing noise ( Figure 4). The filtered pixels were grouped into P urb areas and P non areas with k-means.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 20 especially for some specific noise types such as "Gaussian", "random", and "salt and pepper" noises. It sorts all the pixel values from the surrounding neighborhood and replaces the current pixel value with the middle value. Compared to the mean filter, the median filter is less affected by the background and better preserves edges while removing noise ( Figure 4). The filtered pixels were grouped into areas and areas with k-means. The spatial distribution pattern of nighttime light intensity corresponds to the urban spatial pattern. In NTL images, the urban core areas are brighter than the surrounding areas, while the countryside and background are dim. Thus, pixels with relatively low light intensity in areas were more likely to be in urban edge district; similarly, pixels with relatively high light intensity in areas were more likely to be in urban edge district. The mean value and standard deviation of The spatial distribution pattern of nighttime light intensity corresponds to the urban spatial pattern. In NTL images, the urban core areas are brighter than the surrounding areas, while the countryside and background are dim. Thus, pixels with relatively low light intensity in P urb areas were more likely to be in urban edge district; similarly, pixels with relatively high light intensity in P non areas were more likely to be in urban edge district. The mean value and standard deviation of P urb areas were calculated to define the threshold t2. P urb areas were divided into urban inner areas and edge areas (Equations (4) and (6)). Similarly, P non areas were divided into non-urban inner areas and edge areas by the threshold t1 defined by its mean and standard deviation (Equations (5) and (6)). The urban edge district was obtained by merging the edge areas of P urb areas and P non areas (Equation (6)). P urban inner = pixels in P urb , pixel value > t2 (4) P non−urban inner = pixels in P non , pixel value < t1 (5) In Equations (4)- (8), P urb _mean and P urb _std is the mean value and the standard deviation of P urb areas, respectively; P non−urb _mean and P non−urb _std is the mean value and the standard deviation of P non areas, respectively; P urban inner and P non−urban inner is the pixels within the urban inner areas and the non-urban inner areas, respectively; P edge areas is the pixels of the urban edge district; t1 and t2 are the segment thresholds for P non areas and P urb areas, respectively, which are determined by the mean value, the standard deviation of the clusters. N is a coefficient, usually 1.

Urban Pixels Recognition in the Urban Edge District
The urban and non-urban pixels at the edge district have similar and confusing nighttime light intensity, and the urban edge pixels contain more complex meanings and neighborhood spatial heterogeneity than the non-edge pixels. It cannot simply treat the urban edge pixels as noise or one class and requires mining deep spatial information in a larger neighborhood space. Besides, the spatial distribution of light intensity in the edge district presents a certain directionality. We first used an adaptive directional filter algorithm to integrate a larger spatial neighborhood information and filter interference information at the same time, which could enhance the discrimination of urban and non-urban pixels. We then clustered the filtered result into two groups with K-Means. Finally, pixels in a high-intensity group were classified as urban pixels.
Direction filter replaced the center pixel value with the average value of the pixels in the considered direction within the window. We considered eight directions including east, south, west, north, southeast, northeast, southwest, and northwest. The eight direction templates were showed as Figure 5. The filtering direction discrimination function was as Equations (9) and (10). S i was the standard deviation of all available pixel values in the i-th direction template centered on the current pixel. The filtering direction was satisfied Equations (9), which was selected the direction template with the minimum standard deviation.
Remote Sens. 2020, 12, 3810 9 of 19 considered direction within the window. We considered eight directions including east, south, west, north, southeast, northeast, southwest, and northwest. The eight direction templates were showed as Figure 5. The filtering direction discrimination function was as Equations (9) and (10). was the standard deviation of all available pixel values in the i-th direction template centered on the current pixel. The filtering direction was satisfied Equations (9), which was selected the direction template with the minimum standard deviation.

Accuracy Evaluation
According to the previous studies, the most popular way of conducting the accuracy evaluation is to compare the extracted urban extent with those derived from Landsat 8 OLI images [19,26,33]. Therefore, we used urban extent obtained from Landsat8 OLI images as the reference for accuracy evaluation. We defined urban areas as human settlements and functionally-related regions that have high density of roads, buildings, shops, and restaurants that attract people, including suburban regions closely related to urban cores. Firstly, we classified the impervious surface from Landsat8 OLI images and produced discrete and bounded patches having a high percentage of impervious surface. Then the coarse urban extent was derived through recognizing and adjusting those patches under visually interpreting Landsat8 OLI images. For those areas that were not easy to recognize, we interpreted with the help of google high-resolution images and POIs to interpret, and got the finer urban extent that was regarded as the actual urban extent.
The confusion matrix was calculated by comparing the urban extent extracted from the NTL data with the reference data pixel by pixel. Quantified the performance of the urban extraction methods with four accuracy indexes, the overall accuracy (OA), the kappa coefficient, the omission error (OE) and the commission error (CE). The specific calculation formula of evaluation index is as follows.
OA is the proportion of correctly classified pixels to the total number of pixels, including correctly classified urban pixels and correctly classified non-urban pixels.
Kappa coefficient indicates the consistency between the extracted results and the reference data, and statistically reflects the extent to which the classification results are better than the random classification results. Pe is the proportion of misinterpretations caused by accidental factors.

Accuracy Evaluation
According to the previous studies, the most popular way of conducting the accuracy evaluation is to compare the extracted urban extent with those derived from Landsat 8 OLI images [19,26,33]. Therefore, we used urban extent obtained from Landsat8 OLI images as the reference for accuracy evaluation. We defined urban areas as human settlements and functionally-related regions that have high density of roads, buildings, shops, and restaurants that attract people, including suburban regions closely related to urban cores. Firstly, we classified the impervious surface from Landsat8 OLI images and produced discrete and bounded patches having a high percentage of impervious surface. Then the coarse urban extent was derived through recognizing and adjusting those patches under visually interpreting Landsat8 OLI images. For those areas that were not easy to recognize, we interpreted with the help of google high-resolution images and POIs to interpret, and got the finer urban extent that was regarded as the actual urban extent.
The confusion matrix was calculated by comparing the urban extent extracted from the NTL data with the reference data pixel by pixel. Quantified the performance of the urban extraction methods with four accuracy indexes, the overall accuracy (OA), the kappa coefficient, the omission error (OE) and the commission error (CE). The specific calculation formula of evaluation index is as follows.
OA is the proportion of correctly classified pixels to the total number of pixels, including correctly classified urban pixels and correctly classified non-urban pixels.
Kappa coefficient indicates the consistency between the extracted results and the reference data, and statistically reflects the extent to which the classification results are better than the random classification results. Pe is the proportion of misinterpretations caused by accidental factors.
OE is the proportion of prediction errors in city pixels, reflecting the underestimation of extracted results. The producer's accuracy (PA) is the recognition rate of the actual urban pixels. The sum of OE and PA is 1.
CE is the proportion of prediction errors in the pixels predicted as cities, reflecting the overestimation of the extracted results. The user's accuracy (UA) is the correct proportion of the predicted urban pixels. The sum of CE and UA is 1.
N is the total number of the input pixels. Urban real and Non_Urban real are the number of urban pixels and the number of non-urban pixels in reference data, respectively. Urban pre is the number of urban pixels in extracted result, and Non_Urban pre is the number of non-urban pixels in extracted result. Urban true is the number of pixels that are the urban pixels in the extracted result and are also the urban pixels in the reference data. Urban f alse is the number of urban pixels in the extracted result which are non-urban pixels in the reference data. Non_Urban f alse is the number of non-urban pixels in extracted result which are urban pixels in the reference data.

Selection of Neighborhood Size
We tested various neighborhood size for urban edge district detection and urban edge pixels classification ( Figure 6). We found that the larger the neighborhood size for urban edge district detection, the lower PA and the higher UA of the extracted results were, and the Kappa coefficient reached the highest value in a certain range. It was similar with the neighborhood size for urban edge pixels classification. By comparing the filtering and clustering effects with different neighborhood sizes, we used 5 × 5 window size to detect urban edge district and 9 × 9 window size to classify urban edge pixels.
CE is the proportion of prediction errors in the pixels predicted as cities, reflecting the overestimation of the extracted results. The user's accuracy (UA) is the correct proportion of the predicted urban pixels. The sum of CE and UA is 1.
N is the total number of the input pixels. and _ are the number of urban pixels and the number of non-urban pixels in reference data, respectively. is the number of urban pixels in extracted result, and _ is the number of non-urban pixels in extracted result.
is the number of pixels that are the urban pixels in the extracted result and are also the urban pixels in the reference data.
is the number of urban pixels in the extracted result which are non-urban pixels in the reference data.
_ is the number of non-urban pixels in extracted result which are urban pixels in the reference data.

Selection of Neighborhood Size
We tested various neighborhood size for urban edge district detection and urban edge pixels classification ( Figure 6). We found that the larger the neighborhood size for urban edge district detection, the lower PA and the higher UA of the extracted results were, and the Kappa coefficient reached the highest value in a certain range. It was similar with the neighborhood size for urban edge pixels classification. By comparing the filtering and clustering effects with different neighborhood sizes, we used 5 × 5 window size to detect urban edge district and 9 × 9 window size to classify urban edge pixels.

Accuracy and Comparison
The urban extent for 25 cities were extracted and compared using the proposed method, the LOT method, and the INNL-SVM method. The urban extent extraction process of the LOT method and the INNL-SVM method were referred to Dou et al. [33]. Then we calculated the confusion matrix of these three methods between the extracted results and the actual urban extent extracted from Landsat8 OLI, respectively. OA, kappa coefficient, OE, and CE of three methods were as shown in Table 2. The average OA of our method, LOT and INNL-SVM, were 0.9625, 0.9622, and 0.9571, and the average kappa coefficients were 0.7679, 0.7544, and 0.7304, respectively. The method had the highest average kappa coefficient and the second highest average of OA very close to LOT. As a whole, the three methods had high classification accuracy. Since there was a large number of non-urban background pixels in the input data, which can greatly improve the value of OA, we mainly compared the kappa coefficients of the three methods. The lowest kappa coefficients of our method, LOT and INNL-SVM were 0.6871, 0.6645, and 0.6083, respectively. Kappa coefficient over 0.61 means that the results is highly consistent with the reference data. Therefore, the extracted results of our method had high reliability. The kappa coefficients of the extracted results by our method were mostly higher than LOT and INNL-SVM in 25 cities, which indicated that our method had strong robustness and better performance. We used OE and CE to evaluate the underestimation and overestimation of the three methods, respectively. The higher value of OE, the more serious of underestimation, and the higher value of CE, the more serious of overestimation. For most cities, the OE and CE of the three methods were between 0.1 and 0.35. The results of LOT had similar OE and CE in most cities. The OE and CE of INNL-SVM varied greatly in different places. It meant that the performance of INNL-SVM was greatly affected by different regionally environment. The OE of our method was less than CE in most cities, except in Hohhot, Lanzhou, Lhasa, and Shenyang. Our method had OE greater than CE in these four cities. This was because human activities in these four cities were mainly concentrated urban interior areas, while urban fringe areas were more prone to underestimation. In most cities, the OE of our method was smaller than that of LOT and INNL-SVM, while CE was generally between LOT and INNL-SVM.  The urban extent extraction results of seven representative cities using three methods were illustrated in Figure 7. These cities range from mega cities (e.g., Beijing, Shanghai, and Guangzhou), middle-size cities (e.g., Wuhan) to small cities (e.g., Lanzhou, Urumqi, and Kunming). In order to clearly present the difference between the extracted results and the reference data, the common parts, overestimated parts and underestimation parts were showed in red, green, and blue, respectively. The complete urban extent result was composed of the common parts (red pixels) and the overestimated parts (green pixels). The visual comparison revealed that the urban extent extracted from NTL data by these three methods match the reference data well. Nevertheless, all the urban extent extracted by three methods existed some overestimation. This may partly be explained by the inconsistency between NTL data and Landsat 8 images in expressing the urban and the low resolution of NTL data. In addition, there were some problems with the extracted results of LOT and INNL-SVM. The results of LOT and INNL-SVM had many holes in urban inner areas and many pixels at urban edge district were mistaken as non-urban pixels. These problems were improved in our method. The urban extent extracted by our method had high spatial continuity and higher consistency with the reference data at urban edge district. Overall, these results indicate that our method performs well in extracting urban extent from NTL data, enhances the internal spatial continuity of the extracted results, and improves the underestimation of the edge pixels. resolution of NTL data. In addition, there were some problems with the extracted results of LOT and INNL-SVM. The results of LOT and INNL-SVM had many holes in urban inner areas and many pixels at urban edge district were mistaken as non-urban pixels. These problems were improved in our method. The urban extent extracted by our method had high spatial continuity and higher consistency with the reference data at urban edge district. Overall, these results indicate that our method performs well in extracting urban extent from NTL data, enhances the internal spatial continuity of the extracted results, and improves the underestimation of the edge pixels.

Comparison with Other Products
We also compared our results in the year 2015 with the existing global urban datasets, the GAIA (2015), the MODIS500 (2015), the NUACI-based GUL (2015), the FCN (2016). Figure 8 shows the comparisons in seven representative cities, in which GAIA and NUAIC-based GUL had 30 m spatial resolution and our result, MODIS500 and FCN had 500 m spatial resolution. When visually compared with the referenced urban extent and Landsat8 OLI images, our results and FNC provided a good delineation of urban extent, while GAIA and NUACI-based GUL demonstrate the finest spatial details of urban land. The urban spatial pattern of these five dataset MODIS500 had relatively strong similarity. MODIS500 overestimated urban land for cities such as Beijing, while had more omission errors for the cities of Urumqi and Kunming. We resampled these urban datasets for the seven representative cities to a spatial resolution of 500 m on the Albers Conical Equal Area projection and calculated urban pixels and the kappa coefficient ( Table 3). The Kappa value of our method and FCN ranged from 0.7 to 0.8, while that of GAIA, NUACI-based GUL, and MODIS500 spanned 0.5 to 0.7. GAIA had the highest urban area for all cities because it is an artificial impervious surface product, not only covering urban areas but also non-urban areas.
delineation of urban extent, while GAIA and NUACI-based GUL demonstrate the finest spatial details of urban land. The urban spatial pattern of these five dataset MODIS500 had relatively strong similarity. MODIS500 overestimated urban land for cities such as Beijing, while had more omission errors for the cities of Urumqi and Kunming. We resampled these urban datasets for the seven representative cities to a spatial resolution of 500 m on the Albers Conical Equal Area projection and calculated urban pixels and the kappa coefficient ( Table 3). The Kappa value of our method and FCN ranged from 0.7 to 0.8, while that of GAIA, NUACI-based GUL, and MODIS500 spanned 0.5 to 0.7. GAIA had the highest urban area for all cities because it is an artificial impervious surface product, not only covering urban areas but also non-urban areas.   Note: "Area" means the number of urban pixels in spatial resolution 500 m, and "Landsat" means the urban extent obtained from Landsat.

The Proposed Method Can Effectively Extract Urban Extent from NPP-VIIRS NTL
This study proposed an automatic and reliable approach to extract urban extent from NTL data quickly and accurately. Different neighborhood sizes of spatial context constraints are utilized to characterize the edge and interior of the urban areas and reduce the effects of noise and abnormal light intensity. The spatial context constrains can enhance the signals of urban areas and weaken the influence of non-urban areas light. The major difference from previous studies is that we attempt to develop a simple method without the help of ancillary data and transform urban extent extraction as an urban and non-urban pixels-based clustering problem, which could avoid the tedious procedure of threshold or sample selection. The experimental results prove that the proposed method is reliable and robust, which can extract the urban extent of different areas with different regional environment and social and economic development level accurately. The kappa ranged from 0.687 to 0.829 with an average of 0.768 (1.80% higher than LOT and 4.88% higher than INNL-SVM). The extraction results of the proposed method demonstrate great integrity and connectivity, which are consistent with the reference data. Comparison with LOT and INNL-SVM showed that the proposed method could guarantee high extraction accuracy, while increasing detail information of areas in edge district ( Figure 9a) and reducing errors affected by low brightness of areas in urban inner (Figure 9b). Among three popular methods, LOT requires ancillary data to determine optimal thresholds and the extraction results of INNL-SVM are highly rely on sample data. However, for large-scale studies, considerable time and labor are needed to obtain high-accuracy ancillary data and sample data, which limits the practicability of LOT and INNL-SVM. The proposed method automatically recognizes urban extent depending on NTL data itself, without referring to other data, increasing its efficiency for extracting urban extent at large scales or long-term temporal scales. These prove that the proposed method has extensive applicability in the field of urban extent extraction. Note: "Area" means the number of urban pixels in spatial resolution 500 m, and "Landsat" means the urban extent obtained from Landsat.

The Proposed Method Can Effectively Extract Urban Extent from NPP-VIIRS NTL
This study proposed an automatic and reliable approach to extract urban extent from NTL data quickly and accurately. Different neighborhood sizes of spatial context constraints are utilized to characterize the edge and interior of the urban areas and reduce the effects of noise and abnormal light intensity. The spatial context constrains can enhance the signals of urban areas and weaken the influence of non-urban areas light. The major difference from previous studies is that we attempt to develop a simple method without the help of ancillary data and transform urban extent extraction as an urban and non-urban pixels-based clustering problem, which could avoid the tedious procedure of threshold or sample selection. The experimental results prove that the proposed method is reliable and robust, which can extract the urban extent of different areas with different regional environment and social and economic development level accurately. The kappa ranged from 0.687 to 0.829 with an average of 0.768 (1.80% higher than LOT and 4.88% higher than INNL-SVM). The extraction results of the proposed method demonstrate great integrity and connectivity, which are consistent with the reference data. Comparison with LOT and INNL-SVM showed that the proposed method could guarantee high extraction accuracy, while increasing detail information of areas in edge district ( Figure 9a) and reducing errors affected by low brightness of areas in urban inner (Figure 9b). Among three popular methods, LOT requires ancillary data to determine optimal thresholds and the extraction results of INNL-SVM are highly rely on sample data. However, for large-scale studies, considerable time and labor are needed to obtain high-accuracy ancillary data and sample data, which limits the practicability of LOT and INNL-SVM. The proposed method automatically recognizes urban extent depending on NTL data itself, without referring to other data, increasing its efficiency for extracting urban extent at large scales or long-term temporal scales. These prove that the proposed method has extensive applicability in the field of urban extent extraction.  In addition, we applied our method to extracting urban extent in the years 2013 and 2018 for the seven representative cities to demonstrate the potential of our method in urban dynamic analysis. We selected the GAIA to comparison because of its appropriate time coverage and relatively high resolution. Figures 10 and 11 showed urban expansion of Shanghai and Lanzhou from 2013 to 2018. The urban extent extracted using our method in the years 2013, 2015, and 2018 revealed the urban growth trace and the growth regions of our method matched to those of GAIA to certain degrees considering the difference between imperviousness-and NTL-indicated urbanization.
seven representative cities to demonstrate the potential of our method in urban dynamic analysis. We selected the GAIA to comparison because of its appropriate time coverage and relatively high resolution. Figures 10 and 11 showed urban expansion of Shanghai and Lanzhou from 2013 to 2018. The urban extent extracted using our method in the years 2013, 2015, and 2018 revealed the urban growth trace and the growth regions of our method matched to those of GAIA to certain degrees considering the difference between imperviousness-and NTL-indicated urbanization.

The Disadvantages of the Proposed Method
There were some overestimation and underestimation of our extraction results. For example, Lhasa and Hohhot had relatively high OE and Ningbo and Guangzhou had relatively high CE. We compared Lhasa and Urumqi because they both were in west China, while Urumqi had much lower OE and CE. The brightness distribution in Urumqi was relatively uniform and urban areas were noticeably brighter than non-urban areas in urban edge district, while the brightness distribution in Lhasa showed obviously spatial aggregation ( Figure 12). The great brightness difference between urban center and the surrounding areas resulted in mistaking many urban areas for non-urban areas. Observation of incorrect urban areas illustrated that the inconsistent description of urban areas between NTL data and Landsat images was one of the main causes of overestimating the urban areas. In Landsat, area A in Figure 13 was regarded as a non-urban area because of its low percentages of impervious surfaces, while NTL identified it as an urban area because of its relatively high brightness. From the google image, there are several roads through area A and some high-density building areas are embedded in contiguous farmland ( Figure 13). It seems that there is a close relationship between area A and the main urban areas. seven representative cities to demonstrate the potential of our method in urban dynamic analysis. We selected the GAIA to comparison because of its appropriate time coverage and relatively high resolution. Figures 10 and 11 showed urban expansion of Shanghai and Lanzhou from 2013 to 2018. The urban extent extracted using our method in the years 2013, 2015, and 2018 revealed the urban growth trace and the growth regions of our method matched to those of GAIA to certain degrees considering the difference between imperviousness-and NTL-indicated urbanization.

The Disadvantages of the Proposed Method
There were some overestimation and underestimation of our extraction results. For example, Lhasa and Hohhot had relatively high OE and Ningbo and Guangzhou had relatively high CE. We compared Lhasa and Urumqi because they both were in west China, while Urumqi had much lower OE and CE. The brightness distribution in Urumqi was relatively uniform and urban areas were noticeably brighter than non-urban areas in urban edge district, while the brightness distribution in Lhasa showed obviously spatial aggregation ( Figure 12). The great brightness difference between urban center and the surrounding areas resulted in mistaking many urban areas for non-urban areas. Observation of incorrect urban areas illustrated that the inconsistent description of urban areas between NTL data and Landsat images was one of the main causes of overestimating the urban areas. In Landsat, area A in Figure 13 was regarded as a non-urban area because of its low percentages of impervious surfaces, while NTL identified it as an urban area because of its relatively high brightness. From the google image, there are several roads through area A and some high-density building areas are embedded in contiguous farmland ( Figure 13). It seems that there is a close relationship between area A and the main urban areas.

The Disadvantages of the Proposed Method
There were some overestimation and underestimation of our extraction results. For example, Lhasa and Hohhot had relatively high OE and Ningbo and Guangzhou had relatively high CE. We compared Lhasa and Urumqi because they both were in west China, while Urumqi had much lower OE and CE. The brightness distribution in Urumqi was relatively uniform and urban areas were noticeably brighter than non-urban areas in urban edge district, while the brightness distribution in Lhasa showed obviously spatial aggregation ( Figure 12). The great brightness difference between urban center and the surrounding areas resulted in mistaking many urban areas for non-urban areas. Observation of incorrect urban areas illustrated that the inconsistent description of urban areas between NTL data and Landsat images was one of the main causes of overestimating the urban areas. In Landsat, area A in Figure 13 was regarded as a non-urban area because of its low percentages of impervious surfaces, while NTL identified it as an urban area because of its relatively high brightness. From the google image, there are several roads through area A and some high-density building areas are embedded in contiguous farmland ( Figure 13). It seems that there is a close relationship between area A and the main urban areas.
In general, the proposed method has the following shortcomings. It is based solely on NTL data result in some misclassification. The overflow effect of NTL data and the strong light from non-human active area facilities (such as city periphery, airport, oil field) may cause the overestimation of urban areas. The overpass time of VIIRS is near 1:30 a.m., which leads to underestimation in places with declining brightness in midnight. In addition, the proposed method uses a simple way to represent the spatial context information and classify urban pixels, which is not sufficient for complex areas.
In the area of rapid urban development and continuous emergence of small towns, it is more likely to exist overestimation between cities and towns. In the area where human activities are mainly concentrated in the center part, it is easy to be underestimated in the edge district. Besides, this study remains narrow in focus dealing only with the NPP-VIIRS data. The developed method is not suitable for other NTL products due to the different of spatial resolution, radiation resolution and process procedures. To extract urban extent from other NTL products such as DMSP-OLS and Luojia-01, the appropriate clustering method based on spatial context constraints should be designed according to the characteristics of the NTL product. Future research might focus on developing the unsupervised urban extent extraction model with better spatial features representation capabilities to adapt to different NTL products and improve the accuracy of complex environmental areas. Consider using convolution neural network methods to enhance the ability of spatial information expression and integrating other data sources (such as the human activity data) to descript urban areas more comprehensively.  In general, the proposed method has the following shortcomings. It is based solely on NTL data result in some misclassification. The overflow effect of NTL data and the strong light from nonhuman active area facilities (such as city periphery, airport, oil field) may cause the overestimation of urban areas. The overpass time of VIIRS is near 1:30 a.m., which leads to underestimation in places with declining brightness in midnight. In addition, the proposed method uses a simple way to represent the spatial context information and classify urban pixels, which is not sufficient for complex areas. In the area of rapid urban development and continuous emergence of small towns, it is more likely to exist overestimation between cities and towns. In the area where human activities are mainly concentrated in the center part, it is easy to be underestimated in the edge district. Besides, this study remains narrow in focus dealing only with the NPP-VIIRS data. The developed method is not suitable for other NTL products due to the different of spatial resolution, radiation resolution and process procedures. To extract urban extent from other NTL products such as DMSP-OLS and Luojia-01, the appropriate clustering method based on spatial context constraints should be designed according to the characteristics of the NTL product. Future research might focus on developing the unsupervised urban extent extraction model with better spatial features representation capabilities to adapt to  In general, the proposed method has the following shortcomings. It is based solely on NTL data result in some misclassification. The overflow effect of NTL data and the strong light from nonhuman active area facilities (such as city periphery, airport, oil field) may cause the overestimation of urban areas. The overpass time of VIIRS is near 1:30 a.m., which leads to underestimation in places with declining brightness in midnight. In addition, the proposed method uses a simple way to represent the spatial context information and classify urban pixels, which is not sufficient for complex areas. In the area of rapid urban development and continuous emergence of small towns, it is more likely to exist overestimation between cities and towns. In the area where human activities are mainly concentrated in the center part, it is easy to be underestimated in the edge district. Besides, this study remains narrow in focus dealing only with the NPP-VIIRS data. The developed method is not suitable for other NTL products due to the different of spatial resolution, radiation resolution and process procedures. To extract urban extent from other NTL products such as DMSP-OLS and Luojia-01, the appropriate clustering method based on spatial context constraints should be designed according to the characteristics of the NTL product. Future research might focus on developing the unsupervised urban extent extraction model with better spatial features representation capabilities to adapt to different NTL products and improve the accuracy of complex environmental areas. Consider using A Figure 13. Analysis of overestimation urban area for Ningbo: (black box is the overestimation area zoomed in column 1, which was indicated by "A" in text).

The Sensitivity Analysis of Parameters
The algorithm of the proposed method was presented in Supplementary Table S1, where there are three free parameters: R1, R2, and N. The parameters R1 and R2 are used to determined the filtering window radius for step 1 and step 2 (Figure 2), respectively. The window size of step 1 is (2 × R1 + 1) × (2 × R1 + 1), and the window size of step 2 is (2 × R2 + 1) × (2 × R2 + 1). The parameter N is used to specify the threshold value for urban edge district detection.
A sensitivity analysis was performed for R1, R2, and N to examine how changes in these parameters impact urban extent extraction accuracy. We use the average kappa coefficient of 25 experiment cities to represent the extraction accuracy. Figure 14 showed that the accuracy kept in the range of 0.68 to 0.8 with fewer outliers, which indicated that no matter how the parameters change, our results keep a relative high accuracy. However, parameters also can have important impacts on the accuracy.
The accuracy increased with the increase of N, and it achieved the highest value when N was 1. For R2, the accuracy had the highest value at 4 or 5. For R1, the results had the highest value at 2. Besides, we found that these parameters have little influence on each other.
A sensitivity analysis was performed for R1, R2, and N to examine how changes in these parameters impact urban extent extraction accuracy. We use the average kappa coefficient of 25 experiment cities to represent the extraction accuracy. Figure 14 showed that the accuracy kept in the range of 0.68 to 0.8 with fewer outliers, which indicated that no matter how the parameters change, our results keep a relative high accuracy. However, parameters also can have important impacts on the accuracy. The accuracy increased with the increase of N, and it achieved the highest value when N was 1. For R2, the accuracy had the highest value at 4 or 5. For R1, the results had the highest value at 2. Besides, we found that these parameters have little influence on each other.

Conclusions
This study set out to develop an unsupervised method for urban extent extraction from NTL data, avoiding the time-consume preparation of ancillary or labelled data. The proposed method applies the pixel-based clustering algorithm to distinguish urban areas from background. Multi-scale spatial context constraints are utilized to integrate surrounding information of pixels, to increase the distinction between urban pixels and nonurban pixels and correct those areas with abnormal light intensity. The experiments confirmed that it has high accuracy and can extract urban extent with different regional environments and socioeconomic development levels (the kappa of the proposed method ranged from 0.687 to 0.829, while the kappa range of LOT and INNL-SVM was 0.664 to 0.811 Figure 14. The Sensitivity analysis of parameters: (a-c) were the results of all parameter combinations; (d) was the results withR2 = 4; (e) was the results with R1 = 2; (f) was the results with N = 1.

Conclusions
This study set out to develop an unsupervised method for urban extent extraction from NTL data, avoiding the time-consume preparation of ancillary or labelled data. The proposed method applies the pixel-based clustering algorithm to distinguish urban areas from background. Multi-scale spatial context constraints are utilized to integrate surrounding information of pixels, to increase the distinction between urban pixels and nonurban pixels and correct those areas with abnormal light intensity. The experiments confirmed that it has high accuracy and can extract urban extent with different regional environments and socioeconomic development levels (the kappa of the proposed method ranged from 0.687 to 0.829, while the kappa range of LOT and INNL-SVM was 0.664 to 0.811 and 0.608 to 0.821, respectively). The overlay between extraction results and reference data demonstrates that the urban extent extracted by the proposed method had greater spatial integrity and connectivity in urban inner and lower omission errors in urban edge district. The method can delineate urban extent accurately and automatically, without the time-consuming and laborious process of threshold selection and training dataset construction, which saves a large amount of time and labor. These indicate the broad application prospects of the proposed method in large-scale urban research. In addition, the results in this study also proved the feasibility of applying unsupervised methods for urban extent extraction, which provides new inspiration for urban extent extraction based on NTL data and has a number of important implications for future practice.