Evaluating the Ability to Use Contextual Features Derived from Multi-Scale Satellite Imagery to Map Spatial Patterns of Urban Attributes and Population Distributions

: With an increasing global population, accurate and timely population counts are essential for urban planning and disaster management. Previous research using contextual features, using mainly very-high-spatial-resolution imagery (<2 m spatial resolution) at subnational to city scales, has found strong correlations with population and poverty. Contextual features can be deﬁned as the statistical quantiﬁcation of edge patterns, pixel groups, gaps, textures, and the raw spectral signatures calculated over groups of pixels or neighborhoods. While they correlated with population and poverty, which components of the human-modiﬁed landscape were captured by the contextual features have not been investigated. Additionally, previous research has focused on more costly, less frequently acquired very-high-spatial-resolution imagery. Therefore, contextual features from both very-high-spatial-resolution imagery and lower-spatial-resolution Sentinel-2 (10 m pixels) imagery in Sri Lanka, Belize, and Accra, Ghana were calculated, and those outputs were correlated with OpenStreetMap building and road metrics. These relationships were compared to determine what components of the human-modiﬁed landscape the features capture, and how spatial resolution and location impact the predictive power of these relationships. The results suggest that contextual features can map urban attributes well, with out-of-sample R 2 values up to 93%. Moreover, the degra-dation of spatial resolution did not signiﬁcantly reduce the results, and for some urban attributes, the results actually improved. Based on these results, the ability of the lower resolution Sentinel-2 data to predict the population density of the smallest census units available was then assessed. The ﬁndings indicate that Sentinel-2 contextual features explained up to 84% of the out-of-sample variation for population density.


Introduction
The world population is projected to reach 9.8 billion by 2050, with most of the growth in developing countries and with 68% predicted to live in urban areas [1]. Current and accurate population data are important for managing time-sensitive issues such as vulnerable populations identification, disease impact, natural disaster or emergency response, management, and evacuations [2][3][4][5][6][7][8]; administrative and legislative issues such as resource and service allocation, policymaking, planning, and boundary delineation [9][10][11][12][13]; private and social research such as selecting sites for businesses or assessing health care accessibility [13][14][15][16]; and assessing environmental impacts [17,18].
Census data, while useful, has a number of limitations: (1) countries usually conduct censuses at most once every 10 years, as recommended by the United Nations [13], which affects their relevance, as high migration and urban growth rates can make existing data quickly outdated [19]; (2) due to privacy reasons, census data are usually aggregated, generalized, and not available at the local scale [20][21][22]; (3) census units do not necessarily align with human settlement boundaries [20][21][22][23]; (4) censuses are resource intensive, which

Spatial Features and Remote Sensing
Texture analysis and spatial feature extraction can identify patterns and homogeneity in spatial configurations that go beyond spectral patterns or color intensities. Approaches include Gabor, gray-level co-occurrence matrix (GLCM) method, histogram of oriented gradients (HOG), local binary pattern (LBPM), and local edge pattern (LEP) [51,52].
In remote sensing, spatial feature extraction has primarily been used for land use and land cover (LULC) classification [51]. It claims to improve classification accuracy when compared to per-pixel classification, especially for very-high-spatial-resolution (VHSR; less than 2 m spatial resolution) imagery, due to greater pixel variance [53].
The traditional, most popular approach for LULC classification with spatial features is GLCM [51,54]. Bayram et al., [51], however, found that GLCM is not particularly accurate, along with LEP and edge-oriented features. Abeigne Ella et al., [54] tried to identify the texture feature extraction methods that produce the best results for urban settlement classification, specifically focusing on informal settlements. The authors found that LBPM was better than GLCM as it was more consistently accurate and not significantly affected by the number of sampling points.
Zhang et al., [55] found that multiple texture features were necessary to accurately classify LULC. They recommended using at least three or four, and beyond that threshold, the classification accuracy improvement became negligible. For homogeneous areas, the number of texture features needed was less than for spatially varying areas. They found that using the mean derived from 10 m SPOT (Satellite Pour l'Observation de la Terre) panchromatic imagery of Beijing produced the best results when only one texture feature is used to capture urban spatial patterns, while the combination of mean and GLCM produced the best results when two texture features are used.
Graesser et al., [56] introduced the technique of using various pixel windows over which to calculate spatial features and then reporting the statistical values (e.g., mean, minimum, maximum) back to neighborhoods of pixels instead of looking at individual pixels. As the scales of human settlements vary spatially, this technique captures a variety of patterns for an area while maintaining a pixel's neighborhood context, and hence can be considered a contextual features approach. For instance, Engstrom et al., [57] and Graesser et al., [56] used contextual features to map slums or deprived areas in multiple cities. This tends to work well with VHSR imagery in urban areas because the objects of interest are generally made up of multiple pixels. Moreover, this approach reduces the amount of class outliers and computational processing needed.
While contextual features have been used in classification, other research has examined if the features themselves are directly correlated to measures of building density, road density, building count, population, and poverty. Engstrom et al., [58] calculated contextual features from VHSR imagery and correlated them with urban attributes (building area, building count, building density, built-up area, built-up percent, road area, road length, and road density) from OpenStreetMap (OSM). Their model explained 70% to 92% of the variance in urban attributes. Engstrom, Hersh, and Newhouse [59] and Engstrom, Newhouse, et al., [60] found that contextual features calculated from VHSR imagery were also highly correlated to poverty measures. Engstrom, Newhouse, et al., [60] found that with an ordinary least squares (OLS) linear regression model, spatial features explained up to 54% of the variance in poverty within Sri Lanka. This underscores the ability of spatial features to detect human modifications to the landscape.
Many researchers have emphasized the importance of accurate and timely population counts for various purposes such as disaster management [26,28,34,35,38,[40][41][42]47,52,61]. To improve population models, these researchers use either VHSR satellite imagery [29,32,35,41,47,52,58] or medium-spatial-resolution imagery such as from Landsat [23,26,28,34,35,37,38,40,42,47,50,62]. While VHSR imagery may be more accurate than lower-spatial-resolution imagery [62], it is not always accessible as it is not free. Similar to the challenges of conducting a census [24], this can introduce cost barriers for resourcelimited countries. Given that Landsat has been effective at modeling demographic variables, the freely available Sentinel-2 imagery with its 10 m spatial resolution can also be beneficial and possibly more accurate. While coarser than VHSR imagery, the every-five-day global coverage of Sentinel-2 and its availability in Google Earth Engine for global processing makes this a powerful resource for working over large areas [63].
The bottom-up methodology of using contextual features in remote sensing fits into the trend of using imagery-derived textures to model population [32]. The methodologies discussed in Engstrom, Newhouse, et al., [60] and Engstrom et al., [57] are context-dependent, where the correlated variables in one area of a country were not necessarily the same in another. To date, there has not been any work identifying how correlations compare across countries and sub-regions or if there is a global indicator for estimating these attributes. This paper examines how contextual features are related to population density and human modification to the landscape (hereinafter synonymous with urban attributes). We define contextual features as the statistical quantification of edge patterns, pixel groups, gaps, textures, and the raw spectral signatures calculated over groups of pixels or neighborhoods. Given the ability of texture analysis to identify human presence on satellite imagery and the demonstrated relationship between remotely sensed variables and population density, it was hypothesized that contextual features will be strongly correlated with human settlements and population density. We specifically address the following questions:

1.
What do contextual features derived from VHSR imagery represent in the humanmodified landscape? 2.
How do these representations of the landscape change as the spatial resolution of the satellite imagery changes (from VHSR imagery to Sentinel-2 imagery)?

3.
How do contextual features derived from Sentinel-2 relate to population density based on census data?

4.
To what extent can a population density model be built based on contextual features to allow for the dasymetric mapping of population density in multiple countries?

Materials and Methods
To answer the research objectives, contextual features were analyzed in relation to urban attributes and population density ( Table 1). The methodology used largely followed that of Engstrom et al., [58] (Figure 1).

Study Areas
The study areas for this research include portions of Belize, Sri Lanka, and the city of Accra, Ghana. These three locations comprise a range of cities, urban and rural populations, and land cover characteristics in three different regions-Latin America and the Caribbean, South Asia, and Sub-Saharan Africa-spanning three different continents ( Figure 2).
For each area, we have access to fine spatial resolution census data and urban attributes from OSM. Polygon shapefiles delineating the Enumeration Areas (EA) in Ghana were provided by Ghana Statistical Service (GSS) [64], Enumeration Districts (ED) in Belize were provided by the Statistical Institute of Belize (SIB) [65], and Gram Niladhari Divisions (GN) in Sri Lanka were provided by the Department of Census and Statistics (DCS) [66]; these enumeration units are the census units used in our analysis. There are 2403 EAs in Accra, 723 EDs in Belize, and 14,021 GNs in Sri Lanka ( Table 2). Shapefiles of national-level administrative boundaries (level 0) were also obtained [67,68].    [70]; La Dade-Kotopon, 183,528 [69]; and LEKMA, 227,932 [71]. In the AMA, the 2010 census counted 149,689 houses, with an average household size of 3.7 people [70]; in La Dade-Kotopon, 19,174 houses, 3.6 people [69]; and in LEKMA, 21,366 houses, 3.6 people [71]. Most lived in compound houses [69][70][71]. As of 2010, Ghana is projected to experience a two-fold increase in population by 2038 [72]. Accra has been experiencing rapid urbanization [71,73] and population growth due to natural increase and rural-urban migration [71,[74][75][76], creating socio-economic, health, environmental, and institutional challenges [69][70][71]73 [80] found that most households resided in undivided private houses. Living conditions in Belize are unique in that cities and towns have relatively low densities of residential development due to the availability of land, small city sizes, and ownership of large properties [81,82]. Belize is thus still a rural country, with 54% of its citizens living in rural areas [83]. The country, however, has seen its population double since 1980, mostly via immigration [84] and a relatively new trend of lifestyle migration [85]. . The proportion of Sri Lanka's population living in urban areas has remained close to 18.5% due to an emphasis on rural development programs [83,88]. These official statistics may not reflect that some communities have urban characteristics but are officially classified as rural [88,89]. Sri Lanka faces numerous urban land management challenges including increasing population density, urban sprawl, rapid urban expansion, and pressure on the country's road infrastructure [88,89].
Sentinel-2 image mosaics were created in Google Earth Engine. A cloud-free image for each country or city was extracted using the median pixel in the four 10 m bands: blue, green, red, and NIR (near-infrared). For Belize and Sri Lanka, imagery from Sentinel-2 A and B satellites between 1 January 2017, and 31 March 2018, was obtained to create the single image composites [93]. For Accra, imagery covering an area of approximately 1250 km 2 from 1 January 2019, to 1 January 2020, was used [94].

Urban Attributes
To capture the human-modified landscape, building footprint polygons and road polylines were downloaded from OSM [95,96] via GeoFabrik (https://www.geofabrik.de accessed on 30 July 2019). OSM is an open-source database of physical and man-made features digitized on a base map and continually updated [97]. OSM data for Sri Lanka were downloaded on May 21, 2019 [95]; for Belize, 20 August 2019 [95]; and for Accra, 25 June 2020 [96].

Population
Census data were provided in table format by GSS [64] for Accra in 2010, SIB [65] for Belize in 2010, and DCS [66] for Sri Lanka in 2012. These data were joined to the shapefiles for Accra at the EA level with 2318 records; Belize at the ED level, 775 records; and Sri Lanka at the GN level, 14,001 records.

Contextual Features
The contextual features are calculated by comparing each pixel or group of pixels (block) with its surrounding pixels (scale; Figure 3). The block size is also the pixel size to which the contextual feature statistics are reported [58]. Multiple scale sizes were chosen because the extent and variability of neighborhoods vary [56]. For VHSR imagery, all contextual features were created at a block size of 8 pixels and scales of 8 m, 16 m, 32 m, and 64 m, as in Engstrom et al., [58]. For Sentinel-2 imagery, most were created using a block size of 1 pixel (10 m  The 11 contextual features calculated are a combination of features that capture edge patterns, pixel groups, gaps, textures, and the raw spectral signatures. These features are Fourier, Gabor, HOG, lacunarity, LSR, LBPM, mean, normalized difference vegetation index (NDVI), ORB, PanTex, and SFS. SpFeas (https://github.com/jgrss/spfeas accessed on 30 July 2019), an open-source Python library developed by Graesser [98], was used to process the imagery (in the blue, green, red, and NIR bands).
Fourier Transform. Fourier transform captures the frequency of patterns across an image. Any signal can be represented as a series of sinusoidal signals [99,100]; thus, an image can be decomposed into sine and cosine waves with various amplitudes and frequencies [101]. The Fourier transform consists of magnitude and phase parts, with the former usually displayed as the output image (power spectrum). In these magnitude outputs, low-frequency features, such as water, are located closer towards the origin (center), with increasing frequency farther from the origin [99]. A radial profile can be derived from a power spectrum, within which pixel frequencies can be summarized. Fourier produces two outputs: mean and variance.
Gabor. Gabor is a linear filter used for edge detection [51]. Multiple filters consisting of strips are created by a sinusoidally modulated Gaussian function [102][103][104], forming the filter bank [105,106]. The size, shape, and orientation of the filters can be set, and the various orientations enable extraction of features with those associated orientations [102,104]. A Gabor wavelet transformation is outputted [107]. There are 16 Gabor outputs: mean, variance, and 14 individual filters that examine different angles.
Histogram of Oriented Gradients. HOG identifies the orientation and magnitude of shades [108], distinguishing settlement and non-settlement classes [56]. Gradient magnitudes in both the x and y directions are calculated for each pixel and combined to obtain the magnitude and direction of the gradient [109]. The image is divided into subregions (cells), and within each, the gradient direction bins the pixels by angles (1 • -180 • ) [108][109][110]. The magnitude of each pixel is distributed to its associated bin, with the magnitude value split among two bins if the gradient direction falls between two. The aggregated magnitudes in each bin form a histogram (vector) for the cell [108,111].
Next, four cells (and their four histograms) are concatenated into a block and normalized [109,110]. All block vectors are combined to form the final HOG vector [108], and statistics can be extracted. The five statistical outputs are the maximum, mean, variance, skew, and kurtosis.
Lacunarity. Lacunarity measures the homogeneity of the landscape via the spatial distribution of gap sizes. For heterogeneous images, all gap sizes are not the same; thus, the image is not translationally invariant, and lacunarity is high [112,113]. For instance, in urban areas, there are gaps between buildings; in high density areas, there tend to be less gaps [56]. Variation in gap sizes is scale dependent [112][113][114].
One way to calculate lacunarity involves a moving window in which the number of holes is calculated [113,115]. First, an intensity surface, where the plane is the image and the z-axis (height) is the intensity (value) of the pixels, is created. A moving window of a set size is centered over one pixel, with a smaller gliding box placed in the upper left corner. If necessary, multiple boxes are stacked so all the pixel intensities fall within. The relative height is calculated using the minimum and maximum pixel values (or the boxes in which they fall) within the column. As the gliding box moves across the image window, all the relative heights are summed, and a formula is used to calculate lacunarity for that center pixel. The window repeats the process across the image [116]. Only one lacunarity value is calculated.
Line Support Regions. LSR extracts straight lines from imagery, which can determine the area and spatial configuration of settled areas [56,117,118]. Gradient orientations on an image are first calculated and used to group pixels into LSRs with similar gradient orientations. The groups that do not have enough support (pixels appropriated to a region, as described in Burns et al., [117]) are removed. A plane fit to the pixel intensities in each line support region using a least squares fit and a horizontal plane of average pixel intensities, both weighted by local gradient magnitude, are created. A line is extracted where the two planes intersect. The line's length, width, contrast (intensity change over the line), steepness (slope of intensity change), and straightness can subsequently be obtained [117]. LSR produces three outputs: line length, line mean, and line contrast.
Local Binary Pattern. LBPM assesses the homogeneity of an image, detecting bright and dark spots, flat areas, and edges [119]. After the radius and number of neighbors are specified, the value of a center pixel is compared with those of its surrounding neighbors. If the center pixel value is smaller or equal, the neighbor is given a value of 1; otherwise, the value is 0 [54,[119][120][121][122]. The values around the center pixel are taken sequentially (forming a binary string) and inputted into an equation to obtain the LBPM code for the center pixel [119,121,123]. Patterns with more than two 0-1 or 1-0 switches are not uniform, with two or less considered uniform [119,120,122]. A histogram is built with separate bins for each uniform pattern and one bin for all non-uniform patterns [119,122]; this is based on Ojala et al.'s [119] observation that certain uniform patterns appear more frequently in textures. Five statistical outputs of LBPM are produced: maximum, mean, variance, skew, and kurtosis.
Mean. The mean of the image is calculated using inverse distance weighting (IDW). IDW is an interpolation method where the influence of a point on an unknown point is inversely related with distance and dependent on the specified power setting, which controls the rate at which the influence of points decreases with increasing distance [124]. For SpFeas, pixels near the center of a frame are given higher weights [98]. In addition to mean, the variance of the pixels within the scale used is also calculated.
Normalized Difference Vegetation Index. NDVI assesses vegetation by incorporating a pixel's value in the NIR and red regions. High values (towards 1) reflect a higher density of green vegetation, and low values (towards -1) reflect a lower density [99]. NDVI values are generally lower in and negatively correlated with built-up areas due to sparser vegetation [125]. Both the mean and variance of NDVI are calculated for each scale.
ORB. A feature-based matching method introduced by Rublee et al., [126], ORB combines the Features from Accelerated Segment Test (FAST)-a feature detector-and Binary Robust Independent Elementary Features (BRIEF)-a feature descriptor-approaches.
The FAST algorithm is used to identify keypoints at each level in a scale pyramid of the image, and the Harris corner measure orders the keypoints and rejects edges picked up by FAST [126][127][128]. Intensity centroid is used to assign an orientation to the corner [126,129,130]. BRIEF selects a random pair of pixels around a keypoint, compares their intensity values, and assigns them binary values [126,131]. The orientation from the intensity centroid is used to steer BRIEF towards this orientation, as BRIEF is not invariant to rotation. A greedy algorithm takes all the pairs and creates a subset (usually 256) of uncorrelated pairs, forming a 256-bit feature descriptor output (rotated BRIEF or rBRIEF) [126,129,132]. Five statistical outputs from ORB are produced: maximum, mean, variance, skew, and kurtosis.
PanTex. PanTex extracts built-up areas from panchromatic imagery using the GLCM approach [133,134]. The textural contrast is calculated in all directions within a window around a pixel. The minimum value is taken, and the output with all the minimum values is the PanTex index. For urban areas, this minimum value would be consistently high. Pesaresi et al., [134] used minimum values over average values, reasoning that averages produce an edge effect that could overestimate built-up areas. PanTex produces one output, which is the minimum contrast.
Structural Feature Sets. SFS extracts information on direction-lines [135]. Lines from the center pixel are created in all directions. For a direction-line, a pixel is compared with the center pixel to determine whether it is considered homogenous. If it is, it is added to the direction line; the line keeps extending until a pixel is not considered homogenous based on set threshold levels or until the line reaches a set maximum length [135,136]. This is repeated for all line directions. A histogram is built from the lines, and statistics can be extracted [135]. SFS produces six outputs: maximum line length, minimum line length, mean, w-mean (weighted mean), standard deviation, and maximum ratio of orthogonal angles.
Finally, zonal statistics-mean, sum, and standard deviation-were calculated on each contextual feature output. For VHSR imagery of Sri Lanka, there were 576 contextual feature outputs in total; for Sentinel-2 imagery, there were 429 contextual feature outputs total for each census unit in all study areas. Of the 723 EDs and 14,021 GNs in Belize and Sri Lanka, respectively, 687 EDs and 13,402 GNs were completely covered with imagery and were used in the analysis.

Urban Attributes
All census units with complete and accurate road and building OSM data were identified by overlaying OSM data on top of satellite imagery [137] (Figure 4). For Accra, 314 EAs had complete OSM data; for Belize, 80 EDs; and for Sri Lanka, 333 GNs. Of the 333 GNs, 192 had VHSR imagery coverage (Colombo, Kurunegala, Batticaloa, and Negombo). In total, there were 727 census units used in this analysis. OpenStreetMap data were identified. Sources: [67,95,96,137]. Basemaps [137] reprinted in accordance with Terms of Use from Esri (2021). Copyright 2021 Esri.
The road and building shapefiles were clipped to each census unit. Within each unit, building area, building count, building density, built-up area, built-up percent, road area, road density, and road length were calculated in a fashion similar in Engstrom et al., [58]. Building area is the total area of building footprints in square meters. Building count is the number of buildings. Building density is the building count divided by the census unit area in square kilometers. Built-up area is the sum of road area and building area in square meters. Built-up percent is built-up area divided by census unit area. Road length is the aggregated length of all road segments in meters. Road density is road length in meters divided by census unit area in square kilometers.
Road area is the total area of all road segments in square meters. OSM road polylines were multiplied by estimated road widths based on their classifications and the traffic direction. Widths were determined using GIS based on satellite imagery and the OSM metadata guidance provided by Ramm [138] (Table 3).

Population Density
The census datasets were joined to their respective shapefiles. The area was calculated for each census unit, and the population was divided by the area to obtain the population density (people per km 2 ) for each unit.

Data Preparation
The datasets were combined in accordance with the four main analyses (Table 1). When the population density and contextual feature datasets were combined, 2216 EAs, 687 EDs, and 13,402 GNs remained. To reduce the large number of independent variables, bivariate correlations were conducted between each independent variable and the dependent variable, which was similarly performed in Engstrom et al., [58] and Joseph et al., [42]. Pearson's correlations, which characterize the strength and direction of a relationship, were calculated. The associated p-value for each correlation was obtained, and the 200 independent variables with the strongest correlation coefficients and p-values less than a significance level of 0.05 were kept. Finally, all variables were scaled and normalized.

Model Building
The processed data were split for each analysis: 80% for training and 20% for outof-sample testing. For an individual study area (Accra, Belize, Sri Lanka), the 80%/20% split consisted of that individual study area's dataset only; for the combined study area (Accra-Belize-Sri Lanka combined), the 80%/20% split was performed after combining all the individual study areas' datasets. The 80% subsets were used to create elastic net regularization (ENR) and random forest models to predict urban attributes and population density. A model's predictive power was assessed using the out-of-sample R-squared statistic (R 2 ). This statistic was calculated-within each area for the individual study areas and across all areas for the combined study area-using the 20% of the data set aside for testing. Given the small sample sizes for some portions of the study, which can make models unstable, each analysis went through 100 trials, with random seed values set from 1 to 100. The output statistics from the 100 trials were averaged.

Elastic Net Regularized Regression
Developed by Zou and Hastie [139], ENR is a variable selection method that combines the least absolute shrinkage and selection operator (LASSO) and ridge regressions with ordinary least squares (OLS). Both have similarities with OLS [140]. The ridge regression applies a regularization term equal to the sum of squared coefficients-the l 2 norm-which can shrink coefficients close to 0. The LASSO regression performs variable selection by applying a regularization term equal to the sum of absolute values of the coefficients-the l 1 norm-which can remove independent variables by forcing their coefficients to 0 [140,141]. Each regularization term is multiplied by a tuning parameter λ, which together forms the shrinkage penalty (l 1 penalty and l 2 penalty). The tuning parameter λ controls the weight or extent of the penalties. When λ is large, the coefficients approach 0 in ridge regression and approach or equal 0 in LASSO regression, and ENR becomes a null model. When λ is small or equal to 0, the penalties are voided, and ENR becomes equal to OLS [140]. The mixing parameter α is set to control the ratio between ridge (α = 0) and LASSO (α = 1) [142].  ENR combines the advantages of LASSO and ridge regressions and is more accurate than solely using LASSO [139,141]. The ENR equation is written as [139,143]: where: In (1),β is the elastic net estimator, y is the dependent variable, X is an array of independent variables, β is a vector of estimated coefficients, λ is the tuning parameter, β 1 is the l 1 norm, and β 2 is the l 2 norm. As Equation (1) shows, ENR minimizes the residual sum of squares (RSS, which is used for OLS; y − Xβ 2 ) with the constraint of the added regularization terms ( β 2 and β 1 ). ENR gives a model of best fit by using the least number of independent variables to explain the dependent variables, improving on OLS and reducing multicollinearity (when independent variables are correlated with each other) [144,145].
Using ElasticNetCV from the scikit-learn library [142,146], select parameters were tuned via five-fold cross-validation to produce the best model (Table 4). Output variables from each trial were a list of features (independent variables) and their coefficients, out-ofsample R 2 , out-of-sample mean square error (MSE), the l1 ratio, and alpha (α).

Random Forest Regression
Introduced by Breiman [147], a random forest is an ensemble modeling approach that consists of many decision trees. Graphically, decision trees are tree-like diagrams with numerous splits used to predict an output value given an input value [140].
To build a decision tree, the data are split into J leaf nodes or distinct regions-R 1 , R 2 , . . . , R j -where each observation, with its known response value y i , within a region is given the same predictionŷ R j , which is the mean of the region's training observation response values [140]. To split the regions R into J regions, each split is determined by minimizing the overall RSS between the separated groups, which is [140]: (2)  [142,146].
For random forests, each tree is chosen from a different sample, and each split at the node of a tree is determined by a random subset of independent variables [50,140,147]. Using a random subset prevents one strong predictor from overpowering other variables and creating similar decision trees, ensuring that predictions from the trees are not strongly correlated and the average of all trees is more reliable [140]. Random forests work well even when there are many predictors, including when some are co-related. Random forests are also nonparametric [50]. For regressions, the mean prediction of all the trees is outputted.
When building random forest models using RandomForestRegressor and GridSearchCV from the scikit-learn library [146,148,149], select parameters were tuned via five-fold crossvalidation to produce the best model (Tables 5 and 6). Output variables from each trial were a list of features and their importance values, out-of-sample R 2 , and out-of-sample MSE. Table 5. User-defined parameters for random forest using RandomForestRegressor from the scikitlearn library. The remaining parameters were default; some parameters were not used until Grid-SearchCV (Table 6).

Parameter 1 Description of Purpose 1 Value(s)
n_estimators number of trees in forest [see Table 6] min_samples_leaf minimum number of samples at leaf node [see Table 6] max_features maximum number of features to be considered during split [see Table 6] 1 Parameters and descriptions from Pedregosa et al., [146,148].

Human-Modified Landscape and Very-High-Spatial-Resolution Imagery Contextual Features
First, contextual features derived from VHSR imagery of Sri Lanka (sample size of 192 GNs) were used to model urban attributes to investigate what contextual features derived from VHSR imagery represent in the human-modified landscape. Across all models, ENR results indicated that VHSR contextual features explained 43% to 85% of the out-of-sample variation in urban attributes (Table 7). Random forest results indicated that VHSR contextual features explained 51% to 83% (Table 8). Table 7. R 2 and mean square error values for urban attributes using elastic net regularization models at different spatial resolutions for Sri Lanka (192 Grama Niladhari Divisions).

Urban Attribute
Very-High-Spatial-Resolution Imagery Sentinel-2 Imagery  Table 8. R 2 and mean square error values for urban attributes using random forest models at different spatial resolutions for Sri Lanka (192 Grama Niladhari Divisions).

Urban Attribute Very-High-Spatial-Resolution Imagery Sentinel-2 Imagery
In-Sample R 2 Out-of-Sample R 2

Mean Square Error
In-Sample R 2

Human-Modified Landscape and Imagery Spatial Resolution
Contextual features derived from Sentinel-2 imagery of Sri Lanka were used to model urban attributes and compared to the VHSR-derived contextual features to examine how these representations of the landscape change as the spatial resolution of the satellite imagery changes (for the same 192 GNs as in the VHSR imagery analysis). Across all models, ENR and random forest results indicated that Sentinel-2 contextual features explained 46% to 80% and 47% to 84% of the out-of-sample variance in urban attributes, respectively (Tables 7 and 8).

Human-Modified Landscape and Sentinel-2 Imagery Contextual Features
To further investigate the ability of contextual features derived from Sentinel-2 imagery to map urban attributes, an analysis was run with data from all three study areas: Accra (314 EAs), Belize (80 EDs), and Sri Lanka (333 GNs). ENR and random forest models were built for each area individually (Tables 9 and 10) and then on all areas (Table 11). ENR results indicated that Sentinel-2 contextual features explained up to 78% of the out-ofsample variance in urban attributes in Accra, 42% to 81% in Sri Lanka, and 34% to 90% in Accra-Belize-Sri Lanka. Random forest results indicated that contextual features explained 12% to 80% in Accra, 44% to 86% in Sri Lanka, and 45% to 93% in Accra-Belize-Sri Lanka. Table 9. R 2 and mean square error values for urban attributes using elastic net regularization and random forest models with Sentinel-2 imagery for Accra (314 Enumeration Areas).

Urban Attribute Elastic Net Regularization Random Forest
In-Sample R 2 Out-of-Sample R 2

Mean Square Error
In-Sample R 2  Table 10. R 2 and mean square error values for urban attributes using elastic net regularization and random forest models with Sentinel-2 imagery for Sri Lanka (333 Grama Niladhari Divisions).

Urban Attribute Elastic Net Regularization Random Forest
In-Sample R 2 Out-of-Sample R 2

Mean Square Error
In-Sample R 2

Population Density and Sentinel-2 Contextual Features
Finally, contextual features derived from Sentinel-2 imagery of Accra (2216 EAs), Belize (687 EDs), and Sri Lanka (13,402 GNs) were used to model population density to explore how contextual features derived from Sentinel-2 relate to population density based on census data (Table 12). ENR results indicated that Sentinel-2 contextual features explained 57% of the out-of-sample variance in population density in Accra, 73% in Belize, 65% in Sri Lanka, and 67% in Accra-Belize-Sri Lanka. Random forest results indicated that contextual features explained 74% in Accra, 78% in Belize, 77% in Sri Lanka, and 84% in Accra-Belize-Sri Lanka.

Human-Modified Landscape and Very-High-Spatial-Resolution Imagery Contextual Features
Random forest and ENR approaches had similar levels of performance, with four urban attributes having higher out-of-sample R 2 values with ENR and the other four having higher or equal values with random forest (Tables 7 and 8). The lower out-ofsample R 2 values for two of the three building variables (building area and building count) suggest that contextual features are only able to modestly capture the building attributes. The relatively high out-of-sample R 2 values for the road attributes indicate that contextual features represent roads well. Since the built-up area attribute consisted of the building area and road area attributes, the lower out-of-sample R 2 value for the built-up area attribute was likely pulled down by the low out-of-sample R 2 value for building area. The outof-sample R 2 value for the built-up percent attribute was strong likely due to the strong out-of-sample R 2 values for building density and road density.

Human-Modified Landscape and Imagery Spatial Resolution
Sentinel-2 is generally less powerful-yet still effective-at predicting urban attributes when compared to VHSR imagery. This is reflected in the out-of-sample R 2 decreasing when comparing the values from VHSR to those from Sentinel-2 (especially for building area, building density, and road density), although there were some increases (such as road length and road area; Table 13). Table 13. Impacts of degrading spatial resolution from very-high-spatial-resolution (VHSR) to Sentinel-2 data on urban attribute model performance. The differences in out-of-sample R 2 between the VHSR and Sentinel-2 models were calculated. A negative value indicates the out-of-sample R 2 decreased (a decrease in predictive power) from the VHSR model to Sentinel-2 model; a positive value indicates the out-of-sample R 2 increased (an increase in predictive power) from VHSR to Sentinel-2. Differences may not correspond to actual R 2 values due to rounding. Abbreviation: ENR, elastic net regularization. The results expand on the claim by Henebry and Kux [114] that lacunarity is scale dependent by reinforcing scale as an important component for all contextual features. When conducting classification or object identification, the homogeneity and the variance of pixel values change at differing spatial resolutions because different phenomena occur on different scales [150]. In addition to neighborhoods being on various scales [56], urban features within neighborhoods are also on various scales. This can explain why most outof-sample R 2 values decreased while some did not (Table 13). The out-of-sample R 2 values may have increased for some road attributes because the sizes of the roads were generally smaller and closer to the spatial resolutions of Sentinel-2 and VHSR imagery, whereas the sizes of the buildings were much larger. The different sizes and compositions of buildings and roads could have then influenced their pixel variance at each spatial resolution.

Human-Modified Landscape and Sentinel-2 Imagery Contextual Features
Analysis suggests contextual features derived from Sentinel-2 imagery can effectively capture urban attributes, except for road density. Overall, random forest models were slightly more effective than ENR, indicating that the relationships may be non-linear (Table 9, Table 10, and Table 11).
The lower out-of-sample R 2 values for the building attributes (building area and building count) for individual study areas (especially Accra) suggest that Sentinel-2 contextual features can only somewhat capture building attributes in specific areas (Tables 9 and 10). Although building area and building count generally had the weakest models with VHSR contextual features (Tables 7 and 8), those attributes also mostly experienced larger drops in out-of-sample R 2 values compared to other attributes when degrading spatial resolution in Sri Lanka (Table 13); the lower spatial resolution of Sentinel-2 may have contributed to the lower predictive power.
Likewise, road density also had a larger drop in its out-of-sample R 2 value with moderate-resolution Sentinel-2 imagery (Table 13), which could explain why road density had lower out-of-sample R 2 values with Sentinel-2 imagery (Table 9, Table 10, and Table 11). The extremely low road density out-of-sample R 2 value for Accra (indicating that a null model was a better fit) was particularly surprising given the higher values for the other road attributes, which make up road density (Table 9). This may suggest a simple explanation that Accra's road network might be more nuanced than was captured by OSM. The remaining out-of-sample R 2 values for the road attributes in Accra and Sri Lanka (Tables 9 and 10), which were generally higher, suggest that contextual features can capture roads better than buildings in both areas even with moderate-resolution imagery.
Contextual features captured built-up variables the best. The out-of-sample R 2 values for built-up area and built-up density were strong; built-up percent had the highest out-of-sample R 2 values of all the urban attributes (Table 9, Table 10, and Table 11). One unexpected observation was that aggregated data (built-up percent and the Accra-Belize-Sri Lanka study area) generally had stronger out-of-sample R 2 values than their individual constituent datasets (building and road area and the individual study areas, respectively; Table 9, Table 10, and Table 11). For both aggregations, with more data, outliers within constituent areas may be less influential. In the combined study area specifically, contextual features appear capable of capturing the landscape in multiple areas better than in individual areas, possibly highlighting global landscape trends, given the larger area covered.

Population Density and Sentinel-2 Contextual Features
Sentinel-2 contextual features can generally predict population density well for individual countries and when all countries are combined, highlighting that a population density model with countries from various regions might be feasible (Table 12). Random forest models appeared to be more effective. Similar to the urban attribute results, one unexpected result was that the random forest out-of-sample R 2 value for the combined study area was higher than those of the individual areas. Outliers in individual areas may be less influential, possibly highlighting global population density trends.
The strong performances modeling urban attributes and population density are likely related. In previous work, researchers modeled population and population density using imagery-derived characteristics such as urban areas, land use, dwelling units, and raw spectral values (Lo, as cited in G. Li and Weng [34]). With contextual features capturing the landscape well, they are likely effective proxies for many of the variables that Lo (as cited in G. Li and Weng [34]) claim are important for modeling population. Within the context of this research, built-up attributes (which include building and road attributes) captured by contextual features could be a proxy for land use and urban areas, as spatial variations in building and road data can be representative of specific human landscapes; building attributes captured by contextual features could be a proxy for dwelling units by utilizing counts and average sizes of these buildings. Likewise, NDVI and mean both utilize raw spectral values and could be capturing open spaces or other indicators of population. Contextual features might partially explain population density by picking up various urban attributes that have been shown to model population well. There may be other factors unrelated to urban attributes that can model population; building counts, for instance, have been previously used to estimate population [29], yet contextual features did not capture building counts well. Overall, this method reflects a promising way of using open-source, freely available, remotely sensed data to model population, which can be especially helpful for government officials and researchers when costs are a concern.

Limitations and Future Work
One limitation is that the image collection dates were not the same as when the census and OSM variables were collected, as our study used and was limited to the best data that were available to us. While a limitation, the fact that the relationships were still strong indicates that the timing of data collection may not have that large of an influence on these relationships. Second, there is a large number of independent variables used within this study, and bivariate correlations were used to reduce the number of variables prior to their incorporation into the ENR and random forest models. While ENR and random forest models are designed to reduce issues resulting from multicollinearity, correlations among independent variables may have influenced the results presented in this study. A third limitation is the small sample size of some of the analyses relative to the number of predictors, which potentially resulted in low degrees of freedom and increased the risk of overfitting. With multicollinearity, this could cause the models to be unstable [140]. To mitigate this issue, we ran 100 trials and calculated out-of-sample statistics for each analysis [140]. While overfitting may still be the case for some of the analyses performed, the large sample size and strong out-of-sample results when using all of the datasets indicate that our results are more robust. Fourth, OSM data may have errors due to the nature in how and when the data are collected. While there were likely errors in the data, they were probably minor, as the data were visually verified with satellite imagery prior to analysis.
Future research should investigate whether multicollinearity tests or dimension reduction techniques such as principal component analysis should be performed to reduce any possible impacts of multicollinearity. Future work could also expand the models to include other countries, evaluate individual contextual feature performance, conduct time-series analysis with features once Sentinel-2 has acquired enough historical data, and test if different block and scale sizes for each of the features would work better, as these are scale dependent. Finally, while outside the scope of the current study, this analysis could theoretically be done at global scales and used to predict populations in areas where there is limited to no census data.

Conclusions
This study analyzed the ability of contextual features to model attributes of the humanmodified landscape and population density. The results suggest that contextual features can model urban attributes well at very high spatial resolutions (<2 m), with out-of-sample R 2 values up to 85%, and less so-yet still effectively-at lower spatial resolutions (10 m), with out-of-sample R 2 values up to 93%. Contextual features can model population density well in individual and multiple countries, with out-of-sample R 2 values up to 84%, and the results here are very encouraging, as the data used in the study are freely available and global in coverage.
This research fits into the broader work of using contextual features to model socioeconomic variables and using remote sensing to predict population. The strong results with freely available Sentinel-2 imagery have important implications for researchers and government officials, as those with limited resources can use contextual features to model population density at a specified time and place, allowing for accurate and timely population counts utilizing both top-down and bottom-up approaches when census data are outdated or unavailable.