Global Mapping of GDP at 1 km2 Using VIIRS Nighttime Satellite Imagery

Frequent and rapid spatially explicit assessment of socioeconomic development is critical for achieving the Sustainable Development Goals (SDGs) at both national and global levels. Over the past decades, scientists have proposed many methods for estimating human activity on the Earth’s surface at various spatiotemporal scales using Defense Meteorological Satellite Program Operational Line System (DMSP-OLS) nighttime light (NTL) data. However, the DMSP-OLS NTL data and the associated processing methods have limited their reliability and applicability for systematic measuring and mapping of socioeconomic development. This study utilized Visible Infrared Imaging Radiometer Suite (VIIRS) NTL and the Isolation Forest machine learning algorithm for more intelligent data processing to capture human activities. We used machine learning and NTL data to map gross domestic product (GDP) at 1 km2. We then used these data products to derive inequality indexes (e.g., Gini coefficients) at nationally aggregate levels. This flexible approach processes the data in an unsupervised manner at various spatial scales. Our assessments show that this method produces accurate subnational GDP data products for mapping and monitoring human development uniformly across the globe.


Introduction
The United Nations has established a set of sustainable development goals to achieve a better future for people and the planet. Building on the success of the Millennium Development Goals (MDGs), the 2030 Agenda for Sustainable Development aims to promote and stimulate a series of actions to transform our world. The 17 Sustainable Development Goals (SDGs) with 169 associated targets will unite and mobilize efforts from countries across the world to tackle and address urgent development issues like poverty, inequality, and climate change [1][2][3][4]. Although significant progress has been made towards the achievement of these goals, some of the actions and policies have not been implemented effectively because of the complexity of the Earth system and humanenvironment interactions. In other words, global climate change is progressing at a quick pace and many people are still living in poverty. Therefore, it is important to understand the global distribution of wealth, characterize socioeconomic well-being, and predict environmental change at appropriate spatiotemporal resolutions to facilitate the implementation of policies and the achievement of SDGs [5].
Measuring socioeconomic data in a timely and accurate manner is important for evaluating current socioeconomic status and assessing policy effectiveness. Doing this well helps countries achieve many of the SDGs including sustainable development, eradication of poverty, and reduction of inequality and exclusion. It also helps practitioners, scientists, and policymakers compare levels of development across the globe to inform efforts toward achieving the SDGs. However, collecting these data can be costly and challenging for many less-developed countries. In recent years, the availability of remotely sensed images has greatly helped scientists monitor human activity on the Earth [6]. For instance, nighttime light (NTL) data are widely used for estimating and evaluating socioeconomic activities since they can capture the artificial light on the Earth's surface [7][8][9]. Remote sensing technology and satellite imagery have provided us with global and regional economic data to understand and evaluate the relationship between human development and nature [10]. There are many difficulties associated with collecting traditional census data for measuring human well-being. For example, accurate information about the size and distribution of the human population is not available for many regions of the world and sometimes these data are of poor quality [11]. Hence, remote sensing data can be an alternative way for scientists to study and monitor human activities in a timely, consistent, and affordable way. NTL data are different from other remote sensing data as they capture the artificial light on the Earth's surface and offer a unique view of human activity [9,[12][13][14][15][16]. For example, NTL imagery has been used to generate and demonstrate the quantitative relationships between the NTL and population and energy consumption in the USA [17,18].
The Visible Infrared Imaging Radiometer Suite (VIIRS) platform is a new and improved vehicle for developing global NTL data products. Prior to VIIRS, the Defense Meteorological Satellite Program Operational Line System (DMSP-OLS) was primarily designed and developed for cloud cover image detection. Researchers discovered that DMSP-OLS nighttime images of the visible and near-infrared (VNIR) band could help scientists observe and detect the VNIR emission sources (e.g., city lights, auroras, gas flares, and fires). Thus, the DMSP-OLS NTL data have been used in many fields including (1) the measuring of human settlements, (2) urban population and socioeconomic activity, (3) energy and electricity consumption, (4) the monitoring of gas flaring, (5) forest fires, and (6) the impacts of military actions and natural disasters [19]. In recent years, the Visible Infrared Imaging Radiometer Suite (VIIRS) sensor, which is equipped with the Day/Night Band (DNB), has outperformed its predecessor DMSP-OLS in many ways. In general, VIIRS exceeds DMSP-OLS including greater dynamic range, finer spatial resolution, and lower detection limits [20,21].
Over the past decades, scientists have proposed many methods to estimate GDP at national and subnational levels using NTL [19,[22][23][24]. For example, Sutton et al. [25] estimated the marketed and non-marketed economic value on a global scale and discovered that the GDP was correlated with the amount of light energy emitted by that country based on DMSP-OLS NTL. Shi et al. [26] used VIIRS NTL to estimate GDP and electricity power consumption and concluded that it can be a strong tool to evaluate socioeconomic indicators. Nevertheless, some researchers found that using NTL alone is insufficient to capture the spatial heterogeneity of GDP at subnational levels. First of all, NTL is not a direct measurement of socioeconomic activities. In addition to that, in many developing regions like Sub-Saharan Africa, a large portion of the population are engaged in agricultural activities. Thus, using NTL alone cannot estimate GDP accurately in these regions [27]. Therefore, some researchers have started to estimate various socioeconomic indicators using NTL based on urban and rural regions separately in order to capture the different levels of productivity [28].
This paper presents an approach that utilizes the VIIRS NTL data for estimating the socioeconomic development metrics for the world. We mainly studied socioeconomic development with a focus on the measurement of inequality as these indicators can reflect the socioeconomic status and the current level of development. Rising levels of economic inequality can lead to a series of consequences including lower rates of economic growth, happiness, and higher rates of crime, health, and poverty problems [29]. Moreover, to improve the current data processing method to achieve better results, we separated GDP estimation based on rural and urban regions using land cover classification data. In addition to that, we also used a machine learning-based data processing method for filtering the NTL outliers that were not related to socioeconomic activities to capture spatial heterogeneity of GDP distribution at the pixel level. We adopted the unsupervised Isolation Forest (iForest) machine learning model to help us automatically detect and remove irrelevant NTL data so as to improve model accuracy [30]. We applied this method to develop two NTL-based indexes including (1) NTL-Gini, which is adapted from the Gini coefficient and developed based on the cumulative share of population and gross domestic product (GDP) and (2) NTL-2020, which is adapted from the 20:20 ratio and used to show how much richer the top 20% of populations are to the bottom 20% based on the cumulative distribution of GDP and population. We produced the two NTL-based indexes to investigate and estimate the current development progress for countries around the world [31].
This paper is organized as follows. Section 2 describes the data and methods we used for developing NTL-based indexes and the model that was developed for evaluating social and economic status. In Sections 3 and 4, we present and evaluate the results and compare them with the actual data. Finally, we summarize the results and draw conclusions in Section 5.

Data Collections
The datasets used in this study are described in Table 1. This study used multisource geospatial data in tandem with aggregate national socioeconomic data to develop NTL-based inequality indexes. We used the stable, cloud-free VIIRS NTL (vcm-orm-ntl) product which is produced by the National Oceanic Atmospheric Administration (NOAA) and the National Aeronautics and Space Administration (NASA). We selected the ʺvcm-orm-ntlʺ data to calculate the sum of total NTL intensity in the urban region for each of the administrative units to estimate the development in the urban regions. The NTL can be used to estimate economic activities contributed by commercial and industrial activities. This product improves our estimates of socioeconomic data, since it contains cloud-free average radiance values and the outliers caused by fires and other ephemeral light have been removed. The administrative boundary file was collected from the Database of Global Administrative Areas (GADM) [32]. We mainly used the country (level 0) and subdivision level (levels 1 and 2) data for national and subnational data processing. Additionally, population distribution and settlement data at the 1 km 2 level were obtained from the Global Human Settlement (GHS) datasets, which contain location, population, and urban extent information for the human presence on the planet from 1975 to 2015. Human settlement information was obtained from the GHS Settlement Model grid (SMOD), which contains urban center (densely populated areas), urban clusters (towns and suburbs), and rural grid cells (rural areas) that can help us separate urban regions from rural regions. The population data of GHS contain the distribution and density of population and they are expressed as the number of people in each cell. For example, Figure 1 shows the NTL, SMOD, and population data distribution in mainland China and Afghanistan. Socioeconomic statistics were obtained from the World Bank and United Nations Development Programme (UNDP) databases. We used income classification data to group countries and agriculture, forestry, and fishing with value added (% of GDP) for calculating the proportion of GDP in rural regions to estimate the agriculture activity.

Dataset Description Sources
Population Global spatial information for the human presence on the planet in 2015 with 1 km 2 spatial resolution. GHS [33] Human Settlement Global spatial information for the human settlement (urban and rural) in 2015 with 1 km 2 spatial resolution.

Data Pre-Processing
Results from previous studies show that using NTL alone is insufficient to accurately measure the GDP at subnational levels [27]. In order to assess the different levels of socioeconomic development in the country, we separated the data into urban and rural data to capture the different levels of industrial, commercial, and agricultural activities on a subnational scale using level 0, 1, and 2 administrative districts from GADM. The data preparation consisted of three steps ( Figure 2). In step 1, we converted GHS population raster data to points in ArcGIS Pro 2.4.1 and used the converted population data points to sample and extract the values of human settlement information and NTL intensity values based on SMOD data and VIIRS NTL. Therefore, all data points contained attribute information of population, nighttime light value, SMOD, administrative districts information at different levels, income classification (based on country), and unique point identifier. In step 2, we separated population data into urban and rural data based on the SMOD value. In step 3, we applied the iForest method to identify NTL outliers in urban regions that were not related to socioeconomic activities and reclassified them as 0. Although we selected the ʺvcm-orm-ntlʺ, irrelevant light sources that are not filtered by the product processing algorithm still remain due to its sensitivity [28]. Moreover, NTL is not directly measuring socioeconomic activities and contains irrelevant data that can greatly affect the GDP estimates especially at the 1 km 2 level [38,39]. Therefore, we adopted a series of measures to remove these irrelevant data in step 3. Since we were only using NTL to measure GDP in the urban regions, we first filtered the data by selecting the data points in the urban regions based on their attributes (SMOD value >20). Then, we used the unsupervised machine learning model of iForest to detect the anomalies for the urban pixels based on the population and NTL attributes of urban data points. Over the past decades, many anomaly detection models have been developed based on classification, clustering, and statistical methods. Some researchers have used the DMSP data as a mask to extract the VIIRS data or build a normal profile for the NTL data in order to remove outliers based on the range of NTL intensity value distribution. However, it is very difficult to apply this method globally due to the variations among countries. Furthermore, some of the methods can only be applied to data of low dimensionality and smaller size. By contrast, the iForest machine learning model can automatically detect irrelevant NTL outliers [30] because it is (1) not reliant on the distribution profile of NTL data, (2) more suitable for processing large datasets, and (3) specifically designed to detect anomalies. Studies have demonstrated that the iForest method can outperform many other existing model-based, distance-based, and density-based methods [30]. It is also more suitable for processing large datasets compared to traditional methods like the density-based spatial clustering of applications with noise (DBSCAN) [40]. We used data points for all countries as input so that iForest could detect outliers based on the population and NTL values' patterns. For instance, points with extremely high NTL value and low population were identified as anomalies (Figure 3). The identified NTL outlier values were changed to 0. Since we had already used a filtered VIIRS NTL product, only a small proportion (about 0.1%) of NTL outliers were detected.

GDP and Inequality
The Gini coefficient is a statistical measure of economic inequality based on income distribution [41,42]. A higher Gini coefficient value indicates a higher degree of income inequality, whereas a lower value indicates a lower degree of income inequality. Elvidge et al. [12] developed the night light development index (NLDI) based on the Lorenz curve analysis to analyze the co-distribution of NTL and population by sorting the NTL values in an ascending order. Nevertheless, this index may not be sufficient to capture the distribution of economic activity as it cannot accurately represent the spatiotemporal variation of income distribution since NTL is not a direct measure of income or wealth. Therefore, we combined the NTL values (nanoWatts/cm2/sr) with the actual agricultural production ratios and population distribution in order to improve this characterization of economic activity. We used (1) the urban data points (SMOD value >20) with NTL values to measure the distribution of economic activity in the urban regions and (2) the rural data points (SMOD value <20) with population density to measure the distribution of economic activities in the rural regions. We obtained the aggregate GDP in each district based on the sum of rural and urban GDP (in constant 2011 U.S. dollars) for that district. We defined the urban ( UV ) and rural pixel value ( RV ) of GDP as follows: where i is the unique identification of population pixel (derived from the GHS population layer), SnNTL is the total NTL in the district, TotNTL is the total NTL for the country, PopVi is the population count of the corresponding pixel, TotRuPop and TotUrPop are the total rural and urban population for the country, AgRatio is the proportion of agriculture production of the total GDP, and GDP is the national GDP at purchasing power parity (in 2011 constant U.S. dollars) data obtained from the World Bank database. Based on the procedures described above, we produced a gridded GDP product at the 1 km 2 level for countries around the world (Figure 4). The subnational GDP calculation was based on the aggregate NTL and population using level 2 GADM districts. In addition, for countries without AgRatio data, we used the total rural population divided by the total national population to calculate the estimated AgRatio. NTL-Gini and NTL-2020 ratios were calculated based on the accumulative distribution of aggregate GDP at level 1 and 2 districts for countries around the world. We sorted the aggregate GDP data at district level in an ascending order to construct a GDP distribution profile for each country based on the fraction of population and the cumulative share of GDP at the subnational levels and to plot the Lorenz curve ( Figure 5). The NTL-Gini coefficient is equal to the area marked A divided by the sum of the areas marked A and B in Figure 5a (Gini index = Area A/(Area A + Area B)). Moreover, we also calculated the 20:20 ratio based on the distribution of NTL GDP at district level to measure inequality (Appendix A). Higher 20:20 ratios indicated higher income inequality [43][44][45]. The 20:20 ratio can be more revealing than the Gini coefficient since it compares how much wealthier the top 20% of the population is to the bottom 20% of the population. Many studies have shown that this can be a more useful measure to evaluate other development issues like health and social problems. In order to calculate the 20:20 ratio, we used the same distribution profile for each country and calculated the ratio between the total GDP for the top 20% of the population and total GDP for the bottom 20% of the population.

Subnational GDP Validation
The performance of NTL-based development indexes was evaluated using the actual GDP data from the Organisation for Economic Co-operation and Development (OECD) Regional Statistics and Indicators, the Gini coefficient data from the World Bank databank, and the 20:20 Ratios from UNDP. We first compared and validated the subnational GDP products by using the 249 regional GDP (Large regions TL2) data from OECD administrative units that matched the level 1 districts from GADM. Regional total GDP results of the NTL-based GDP were aggregated using the zonal statistics tool in ArcGIS Pro based on the 1 km 2 gridded NTL GDP product. We produced the cross-sectional fit comparing the NTL-based GDP against the actual GDP from OECD regions. In Figure 6, the overall result shows that NTL-based subnational GDP has a high coefficient of determination (R 2 = 0.761). In addition, since many researchers have studied the relationship between total NTL values within regions with GDP [26,46] based on simple linear regression, we also produced the cross-sectional fit comparing the sum of NTL within districts against the actual GDP from OECD regions (R 2 = 0.684). Results in Figure 6 show that NTL GDP can better reflect the actual GDP values. Nevertheless, due to the small size of validation data (n = 246), it is difficult for us to evaluate the results' accuracy globally at various spatial scales. We also compared our data based on the electricity accessibility [47] and the Gridded GDP datasets [48]. Figure 7 shows that because the Gridded GDP datasets only contain GDP per capita information for Uganda at the national level, the GDP data ( Figure 7c) is mainly dependent on population density (Figure 7a). Therefore, it fails to show the subnational variation of economic activities in urban and rural regions. Figure 7b shows the electricity access estimation (distribution of people without access to electricity) near Kampala, Uganda [47]. Both the NTL GDP data and the electricity access rate show that the Gridded GDP dataset [48] overestimates GDP in many regions outside Kampala despite the fact that these regions have a low electricity access rate and are less developed. This is possibly because the Gridded GDP product is incapable of differentiating levels of productivity within districts and fails to capture the spatial heterogeneity of GDP at various spatial scales. Figure 7. Comparisons of population distribution, electricity accessibility [47], Gridded GDP at the 1 km 2 level [48], and NTL GDP data around Kampala, Uganda. (a) Population data, (b) electricity access data, (c) GDP data from Gridded global datasets, (d) NTL-based GDP.

Inequality Validation
To evaluate whether NTL-based GDP distribution can predict inequality accurately, we compared the NTL-based inequality indexes against the actual Gini index and 20:20 ratios data using root mean square error (RMSE) and mean absolute error (MAE) (Figure 8). We normalized all inequality data into the range of 0 to 1. We also compared the data by categorizing the countries based on the income level classification. The overall RMSE and MAE for all countries without using income classification were also compared. The inequality validation results show that there is an overall smaller deviation between the NTL-2020 ratios and the actual data from UNDP, indicating that using NTL and population data can better capture the differences of wealth distribution for the top and bottom 20% of the population in urban and rural regions. Both of the NTL-based Gini coefficient and NTL-2020 have similar RMSE and MAE for high-income and low-income countries, whereas the NTL-2020 ratios have smaller RMSE and MAE for upper-middle and lower-middle countries. This shows that the overall GDP distribution profile may be more accurate for developed and lessdeveloped countries as they tend to rely more on tertiary industry (that can be captured by NTL) and primary industry (captured by population density). For many developing countries (with uppermiddle and lower middle incomes), where there tends to be greater socioeconomic inequality, the NTL-2020 ratios can better capture this unequal distribution of income and opportunity. For instance, studies [27] have shown that the correlation between light intensity values and economic activity is much weaker for countries that are dependent on agriculture. This is probable since most of these countries are developed countries or industrialized countries that have advanced their technology infrastructures and developed their economies. Therefore, it is harder to measure the socioeconomic development in these countries.

Discussion
The NTL global GDP data at 1 km 2 can be aggregated into different subnational levels to support analysis at multiple spatial scales. In general, the NTL data collected by the VIIRS can help us not only monitor the light sources but also study various human activities. By using the multisource data, we developed an NTL-based index to estimate different levels of socioeconomic development in urban and rural regions in an efficient and accurate manner. The VIIRS data were capable of capturing commercial and industrial activities more accurately than DMSP-OLS. Although DMSP-OLS NTL data have been widely used due to their detection of anthropogenic lighting sources to study human activities, they still present many significant problems. For instance, the data have deficiencies such as coarse spatial resolution, saturation, lack of in-flight calibration, and lack of lowlight imaging spectral bands suitable for discriminating lighting types [8]. Elvidge et al. [20] compared capabilities of DMSP-OLS and VIIRS and concluded that VIIRS is superior to DMSP-OLS in many ways. Therefore, as more VIIRS products are released (monthly and annual), there is a great potential to capture human development at various spatiotemporal scales. Furthermore, the NTL imagery can potentially become an alternative method for scientists to measure and assess socioeconomic development to achieve SDGs. First, there are many limitations for collecting and calculating traditional inequality data. NTL can help us generate reliable estimates to evaluate if cities have achieved the sustainable goals on a global scale. Second, NTL estimates can be combined with multisource data to help people understand the current water, energy, and food security nexus to evaluate and manage the capacity of our growth. Third, it is important to develop different measures based on various sources of data and evaluate the nation's sustainable development based on multiple indexes. The current method is also limited by the availability of accurate data for model optimization and validation. For example, in our model, we assumed that the rural region was mainly dependent on agricultural activities. However, there are also other labor-intensive activities that can contribute to the rural GDP like mining, oil extraction, and refinery. Therefore, it is also important to collect more accurate socioeconomic data on various spatial scales to improve the accuracy of GDP estimates.

Conclusions
Our approach is suitable for measuring the distribution of GDP on both subnational and national levels. The NTL-based GDP estimation using urban and rural separation helped us capture the spatial heterogeneity of GDP distribution compared to the simple linear regression method based on NTL values only. By utilizing the iForest machine learning solution, it was easier for us to detect outliers from the urban NTL data to better estimate GDP distribution at the pixel level (1 km 2 ). Furthermore, the NTL-based indexes were useful for estimating a variety of inequality indicators. Nevertheless, due to the different levels of development, the performance of NTL-based indexes was also affected. In the future, several options can take this research to another level: (a) incorporating an advanced machine learning model or hybrid model to improve the model performance; (b) collecting more historical socioeconomic data to analyze development changes based on the trend and make forecasts for the future; (c) estimating the inequality at subnational levels and validating the results using ground-truth data; and (d) incorporating more variables to train the model so that it can be customized and adjusted for different inequality evaluation purposes. Moreover, monthly VIIRS NTL products are now also available. There is a great potential for us to understand the dynamics of human population changes within cities, assess our ecological footprints, estimate the demand of resources, and evaluate the limit of our growth [49][50][51].