1. Introduction
Remote sensing has witnessed growing interest within the realm of urban planning and spatial analysis, due to its commendable cost-effectiveness and technological reliability [
1,
2,
3]. Presently, we possess multiple instances of global urban products from the Global Human Settlement Layer (GHSL) [
4] and Global Urban Footprint (GUF), which can provide significant insights into understanding urban land change at regional and global scales. However, these global products, while useful for broader level analysis, fall short in detailing land cover change at the local level [
5]. Due to the fact that these global frameworks aim to capture a variety of urban land change contexts, they often miss out on the drivers required for local-level accuracy. Especially when extracting land cover information through object-based image analysis (OBIA), the effectiveness and accuracy largely depend on the variability and uncertainty in the OBIA process.
In recent years, a notable trend has emerged in the field of land use development. It involves the creation of comprehensive regional databases through the skillful fusion of local data by planning analysts who possess an in-depth understanding of the local conditions. These regional databases are of substantial significance, as they provide valuable insights and data for local development planning [
6,
7,
8]. However, the transformation of such extensive data on high-fidelity urban land use predictions into actionable policy guidance remains an ongoing challenge [
9]. Multiple phases of the policy cycle are rarely tackled with remote sensing data, as many studies remain at the conceptual level, rarely providing recommendations for actions [
10,
11]. Especially at the local site level, the drivers for implementing real change in the environment lie in the incorporation of conceptual and technological advances in policy, strategic thinking, and design [
12]. For example, when planning the design for best practice management, it is essential to have precise information on permeable and impermeable surfaces down to the scale of roads to guide pond design. Multifaceted patterns of urban land use present a formidable obstacle, impeding policymakers from deriving direct and effective insights from land use alterations. In contrast, the reclassification of land use types followed by comparative analyses appears to offer a more tenable avenue for discerning the nature of urban land use/cover changes [
13].
Existing studies and examples of reclassifying land use types and conducting comparative studies to understand land cover changes have played a crucial role in advancing our knowledge of environmental transformations. Before leveraging the remote sensing dataset, research efforts have focused on reclassifying surveyed land use data and using the context-based model such as the SLEUTH model to understand the transition from agricultural to urban land use in rapidly growing cities [
14]. With the increasing availability of aerial images and Lidar data, there is more research using the integration of data from multiple sensors results in an enhanced land cover product and then classifying the product for specific use [
15,
16,
17]. These applications aim to understand the correlation between neighborhood characteristics and land cover, such as adult mosquito abundance data to inform critical public health concerns [
18], urban heat [
19,
20], and tree cover and social equity [
21]. These understandings have notable repercussions on public health, water resources, spanning community and site-level planning, and land use regulation [
22].
The Chesapeake Bay watershed plays a crucial role as both an ecological and economic asset, supporting a range of ecosystems and sustaining vital industries like agriculture, tourism, and fisheries. Nonetheless, the US Environmental Protection Agency has listed the Chesapeake Bay as impaired due to its non-point source loads of nutrients and sediment (Chesapeake Bay Program, 2000). Knowing the probability of land conversion from agriculture, wetland, or forest (resource lands) to residential, commercial, or industrial use (built) will guide the development of practical alternatives and contingency plans related to Bay trends and indicators [
23]. However, the region faces escalating environmental challenges due to the combined effects of sea-level rise and land subsidence. It ranks as the nation’s second-most vulnerable area to flooding and storm surge, second only to New Orleans [
24]. Central to confronting these issues and driving forward climate adaptation and mitigation planning is the accurate prediction of land cover changes, particularly the transition from pervious to impervious surfaces.
Existing studies have delved into modeling pervious surface changes to facilitate climate adaptation in urban planning [
25]. These investigations recognize the pivotal role of land cover alterations, particularly the expansion of impervious surfaces, in exacerbating urban heat island effects and other climate-related challenges [
26]. Researchers have employed various approaches, such as remote sensing and geographic information systems, to monitor and predict changes in land cover, emphasizing the need for increased green spaces, green infrastructure, and urban forestry to improve the urban ecology [
27], avoid extreme heat exposure [
28], and flooding [
29]. Moreover, these studies often stress the importance of precise land cover data for accurate climate modeling and adaptation strategies, underlining the potential of innovative data fusion techniques to enhance the accuracy of land cover maps for urban areas [
30,
31].
There is a significant gap in the current models regarding the study of changes between impervious and pervious surfaces, as they predominantly focus on detection rather than prediction [
32]. Models that utilize land cover data for prediction often center on urban growth, employing conventional methods like the SLEUTH models. However, these models lack the necessary precision in identifying changes between pervious and impervious surfaces, resulting in suboptimal accuracy [
33,
34]. This shortcoming is largely due to the under-utilization of reclassified land cover databases as reliable predictive tools, particularly for previous land cover changes on a cellular level.
To overcome these issues, our model leverages high-resolution land cover prediction data, significantly enhancing the accuracy and reliability of our predictions. We adopt the random forest algorithm, enriched with carefully selected socioeconomic features (
Table 1) and environmental factors inspired by the SLEUTH models. This advanced approach positions our model as a notably accurate and effective tool in comparison to existing research in this field.
Historically, health warnings regarding Perfluorobutane sulfonic acid and GenX chemicals have been closely linked with the region’s agricultural practices, posing threats to the well-being and lives of residents [
35]. The significant role of spatial data becomes evident in highlighting the cooling effect of green buffers on runoff, even as impervious surface proportions increase due to urban development [
36]. With strategic green buffer distribution, the impact of impervious surface expansion on urban growth can still be mitigated. Established urban greenways can reduce stormwater runoff and combat the urban heat island effect caused by impervious surfaces [
37,
38]. Parks, wetlands, and rooftop gardens can reduce the adverse impacts of impervious surface expansion on the city’s microclimate, biodiversity, and flood resilience [
39]. Additionally, research studies have shown that urban green buffers along urban streams can significantly reduce the input of pollutants into aquatic ecosystems [
40].
2. Materials and Methods
2.1. Study Area
This study focuses on three distinct counties within the Chesapeake Bay watershed, each representing unique developmental contexts (
Figure 1). Portsmouth serves as an urban archetype, characterized by its dense residential, commercial, and industrial zones. James City County exemplifies the suburban context, encompassing a mix of rural, suburban, and urban developments, featuring diverse landscapes including forests, wetlands, and historic sites. Lastly, Isle of Wight County embodies the rural aspect, predominantly marked by agriculture, forestry, and extensive natural habitats. By encompassing these diverse counties, our objective is to create a comprehensive and universally applicable model for predicting land cover changes across a variety of regional development scenarios.
In partnership with CIC, we procured longitudinal and high-resolution land use and land cover data aimed at driving “Precision Conservation” strategies in the realms of climate adaptation and mitigation planning. This initiative holds significant potential, particularly through analyses that offer applicability across partner counties and municipalities spanning the Planning District and the wider Chesapeake Basin. Our analysis focuses on three emblematic counties of the Hampton Roads Planning District Commission (HRPDC) Zone [
41]—Isle of Wight, James City, and Portsmouth—each representing rural, suburban, and urban landscapes.
The choice of our study area originates from a specific necessity. In the Hampton Roads planning district, discussions of prospective developments raise the possibility of spatial overlap with the Chesapeake Bay Preservation area, which is subject to preservation acts and conservation regulations. Notably, construction projects may necessitate special permissions to safeguard existing green infrastructure, while the conservation of local wetlands and habitats remains a paramount concern. To proactively address this, early identification of potential developments is vital. Leveraging a machine learning model to predict the probability of land cover transitions to imperviousness at a fine resolution, we can accurately anticipate future development and allocate resources accordingly [
42]. The outcomes of our predictive analysis in this chosen study area will offer invaluable insights that can guide targeted regulatory actions and funding allocation for green infrastructure.
2.2. Data Source
We obtained high-resolution land cover data for 2013/14 and 2017/18 from the Chesapeake Conservancy [
43]. This comprehensive raster dataset boasts a remarkable 1 m accuracy, offering 900 times more detail than the commonly used 30-meter resolution in the National Land Cover Dataset. Such a granularity is crucial to capture subtle changes in land cover. Chesapeake Conservancy, U.S. Geological Survey (USGS), and University of Vermont Spatial Analysis Lab (UVM SAL), with funding from the Chesapeake Bay Program (CBP), produce such 1 m resolution land cover and land use datasets spanning the Chesapeake Bay watershed regional area (encompassing 206 counties and an area over 250,000 km
2). These data offer a foundational, authoritative, and transformative perspective on the region’s landscape and its holistic management.
The production of the CBP 1 m “land cover” data involves the delineation and classification of image objects derived from aerial imagery (primarily the National Agriculture Imagery Program, NAIP), topographical details derived from LiDAR, and other ancillary data. Land cover represents the surface characteristics of the land with classes such as impervious cover, tree canopy, herbaceous, and barren. In contrast, “land use” represents how humans use and manage the land with classes such as turf grass, cropland, and timber harvest. Producing land use from land cover data requires a variety of ancillary datasets combined with spatial rules that leverage the contextual information inherent in the land cover data.
The CBP land use/land cover (LULC) data are distinctly named to highlight their amalgamation of cover and use classes, such as extractive barren and solar–herbaceous. These classes are critical to understanding the impact of human activities on the Chesapeake Bay. For example, one singular land cover class, like herbaceous vegetation, can encapsulate both the highest polluting land use (e.g., corn production) and the least impactful ones (e.g., natural succession). LULC data contextualize land cover classes for decision-making, such as informing outcomes in the Chesapeake Bay Watershed Agreement and serving as the basis for developing the next generation of watershed and land change models.
Additionally, the CBP 1 m LULC data have over 50 unique classes, providing more categorical context than the 13-class CBP land cover data or the 17-class NLCD data. This detailed classification scheme is necessary to ensure that these data are widely applicable for supporting data-driven decision-making by the Chesapeake Bay Program and other regional stakeholders.
Land cover classification includes pervious surfaces, such as tree canopies and shrubs, which allow water to infiltrate the ground. In contrast, impervious surfaces encompass categories like roads and structures that prevent water infiltration, leading to increased runoff and potential flooding issues. Although water and wetlands are often considered impervious surfaces, in this study, we classify them as pervious surfaces due to their dynamic nature, interaction with groundwater, floodplain connectivity, and the critical functions of wetlands in water storage and infiltration.
The temporal resolution of the data is set at a four-year interval. This decision stems from the variability in regions covered by NAIP aerial imagery. To maintain uniformity in temporal statistics and error analysis across the entire watershed area, this level of temporal granularity represents the best achievable consistency. Nevertheless, for purposes such as detailed policy implementation, robust management control, and effective pollution monitoring, this level of detail typically proves adequate. This project is one of the efforts trying to assist the new water quality parameters (WQPs) which have traditionally been analyzed and monitored through sampling and laboratory testing, and are expensive, labor-intensive, time-consuming, and not suitable for large-scale analysis. In our study, we try to utilize the data to assist other possible use cases.
To comprehensively understand how land cover change is influenced by various environmental, social, and economic factors, we acquired additional data from the following sources: we procured a the digital elevation model (DEM) with a 1 arc-second resolution from USGS, collected soil data from the Web Soil Survey, and extracted Census tract-level data (population, white population, household unit, median household income) from the American Community Survey (ACS) for the years 2014, 2018, and 2021.
2.3. Data Processing
Under considerations of computational efficiency and scalability, we resampled the original dataset to 500,000 data points for model building and then tested the fitted model on the whole dataset with 850,000 data points for subsequent predictions. We employed geo cross-validation at the block group level to ensure the robustness of our model and avoid overfitting.
Performance evaluation and model selection were conducted using the confusion matrix for binary threshold setting. We calculate the accuracy, sensitivity, and specificity of the models and compare their performance based on the confusion matrix.
Figure 2 shows the whole workflow of our modeling process. We ran the land cover prediction model based on the Equation (1) composed of social economic factors, environmental factors, and original land cover types that affect the land cover change. All the factors are calculated based on the 10 m × 10 m raster cell and detailed in
Table 1. All the variables are calculated for the overtime changes for every four years.
2.3.1. Unit of Analysis
To maintain consistency in our analysis and facilitate future investigations, we are-interpolated all remote sensing data to a 10 × 10 m resolution, which serves as our primary analytical unit. This resolution strikes a balance, providing sufficient detail for planning purposes (for example, a 10 m resolution will not affect the resolution of minor roads in the image, while significantly minimizing noise and reducing the dataset size, thus enabling a faster calculation process).
We have opted to build our model at the county level, a decision driven by two key advantages. Firstly, this approach effectively translates the broad-scale management of watershed ecological health into practical policy implementation scales. Secondly, it enhances the accuracy and efficacy of predictive outcomes [
44]. In the realm of land use and conservation, a common concern revolves around the potential impact of land preservation on economic development. Conducting cell-level (spatial dimension: 10 m × 10 m) planning at the county level aids in coordinating different administrative units under the guidance of overarching conservation policies. For example, in the context of our project, the Chesapeake watershed has established policies for land use within its designated protection zones. However, the execution of these policies is the responsibility of individual administrative regions. With the assistance of our models, each administrative unit can make adaptive adjustments based on their unique land use conditions. Through our analysis, it becomes apparent that socioeconomic indicators wield significant influence over the overall predictions. This highlights that, while collaborative and indicator-based control at the watershed scale is feasible and logical for comprehensive environmental health planning and management, personalized planning, modeling, and control for each specific region prove more effective and meaningful for precise planning, detection, and management.
2.3.2. Equation for Feature Engineering
Equations (2)–(5) show the calculations of the social economic factors being used in this study. Equations (6)–(9) calculate the values of natural splines at specific locations for creating smooth surfaces of the input variables.
Spatial lag factors:
is cell coordination in the raster dataset,
is the value of the natural spline at
X,
represents the intercept or constant term;
,
are coefficients associated with the polynomial terms,
represents the deviation of the predictor from the left endpoint
a of the segment.
2.3.3. Spatial Pattern Analysis
According to most urban growth models [
45], there are change drivers and constraints in the land use development and urban growth. Specifically for the impervious surface, the impervious cover change correlated with specific land use like the road, canopy, and water. The original land use data can be treated as cells and evaluated by the distance to the existing land cover type. As our data products are built upon existing land use cover data, and given that the land use types affecting the transition from permeable to impermeable surfaces are both dispersed and represent a relatively small proportion of the dataset, we have adopted a strategy that relies on focal calculation (as detailed in
Table A1) based on individual features. This approach effectively facilitates spatial interpolation, making the most of the spatial efficacy of data points that constitute a minority. It also serves the purpose of preventing data leakage.
During our focal calculation, we achieved a controlled distance decay of the spatial impact of various land elements through multiple iterations of focal mean calculations. By incorporating a range of morphological parameters, we successfully identified subtle spatial variations within our data.
Figure 3 illustrates the outcomes of spatial feature calculations within 1000 m × 1000 m land units. It demonstrates the varying spatial influence of different features on ’y’. This scale of 1000 m × 1000 m is commonly employed in urban planning, allowing us to encompass diverse land use categories and observe fine-grained changes in land use types, such as roads and water bodies, within urban areas [
46].
2.4. Modeling
We modeled the probability of land cover change from pervious to impervious as a function of socioeconomic factors, environmental factors, spatial lag factors, and proximity to existing land cover types. We refined our final model through a series of steps, including experimenting with different feature combinations, rigorous model testing, and careful parameter tuning to optimize results.
To ensure computational efficiency and scalability, we down-sampled the original dataset to 500,000 data points for model building and then fitted the selected model back to the whole dataset for future predictions. We divided the dataset into two subsets: for the training set and for the validation set. We employed geo cross-validation at the block group level to ensure the robustness of our model and avoid overfitting.
2.4.1. Model Selection
Recognizing that different types of models perform well on different datasets, we experimented with three model types to predict land cover change. Random forest, XGBoost, and binomial generalized linear model (GLM). These models were chosen due to their ability to handle complex interactions and nonlinear relationships within the data.
After evaluating the balanced accuracy of each model (
Table A2), we selected random forest as the most suitable model for our analysis, as it demonstrated the best accuracy. Random forest is an ensemble learning method that builds multiple decision trees and combines their results to improve overall accuracy and stability. This model is particularly well-suited for handling high-dimensional and noisy data, making it an ideal choice for predicting land cover changes in our study area.
2.4.2. Model Re-Sampling
In model selection, we acknowledged the challenge of overfitting due to an extremely unbalanced dataset and feature leakage, impacting our model’s change detection accuracy despite the high overall accuracy. We refined our approach by filtering for original land cover data and achieving balance at a 1:10 (y = 1: y = 0) ratio, with multiple training rounds for stability. (We compared our results after sampling for a balanced dataset, and we tested the ratio of y = 1/y = 0 = 1:3 or 1:10, by comparing the results between two smaller counties. We used 1:10 for our models). To minimize the impact of random results generated during the down-sampling process, we implemented a more robust approach by resampling and training the model 2–3 times for each county. We then selected the best-performing model based on the Kappa statistic and the p-value for the [Acc > NIR] comparison (
Table A3). This method ensures a more reliable model selection that better represents the underlying patterns and relationships between land cover changes and various factors across the different counties.
2.4.3. Geo Cross-Validation
Instead of randomly splitting the data into folds as in traditional cross-validation, geo cross-validation considers the spatial distribution of the data. Block groups are a suitable choice as validation units since they represent spatially contiguous areas. We assessed the model’s generalizability by calculating the mean absolute percentage error (MAPE) for each census block group. The results
Figure 4 show that, in Portsmouth, our model has a low MAPE for the majority of neighborhoods, with slightly higher MAPE values observed in some block groups located in the southern region. In James City, the MAPE is relatively consistent across the block groups, with only one exhibiting a higher MAPE that could be considered an outlier. When we referred to
Figure A5, we could notice that the main source of errors may come from the household units change in the area, which may be the result of new construction. These findings suggest that our models are indeed generalizable and can effectively capture the patterns of land cover changes across different block groups within the counties.
We further enhanced the model performance by optimizing the threshold to achieve the ideal balance between sensitivity and specificity. Given the inherent exiguity in land cover that may change (1), we not only offer a recommended binary prediction but also provide a continuously categorized result based on the probability distribution. This approach offers planners more meaningful insights for decision-making.