Next Article in Journal
Towards a Standardized Framework: Analyzing and Systematizing Urban Sustainability Indicators to Guide Effective City Development
Previous Article in Journal
Spatial Characteristics and Influencing Factors in Supply–Demand Matching of Rural Social Values: A Case Study of Yangzhong City, Jiangsu Province
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimizing Urban Land-Use Through Deep Reinforcement Learning: A Case Study in Hangzhou for Reducing Carbon Emissions

1
Department of Architecture and Built Environment, University of Nottingham Ningbo China, Ningbo 315100, China
2
School of International Communications, University of Nottingham Ningbo China, Ningbo 315100, China
3
School of Architecture, Southeast University, Nanjing 210096, China
4
School of Computer Science, University of Nottingham Ningbo China, Ningbo 315100, China
5
Department of Civil Engineering and Architecture (DICAr), University of Pavia, 27100 Pavia, Italy
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Land 2025, 14(12), 2368; https://doi.org/10.3390/land14122368
Submission received: 7 October 2025 / Revised: 1 December 2025 / Accepted: 2 December 2025 / Published: 3 December 2025
(This article belongs to the Special Issue Energy and Landscape: Consensus, Uncertainties and Challenges)

Abstract

Urban land-use optimization plays a vital role in mitigating the escalating carbon emissions of rapidly growing cities. This study employs advanced computational intelligence to address urban carbon reduction through optimized spatial configurations. A Deep Reinforcement Learning (DRL) framework is proposed that integrates Points of Interest (POI), Areas of Interest (AOI), and Transportation System Data (TSD) to generate fine-grained carbon emission maps guiding land-use adjustments. In the case study of Hangzhou, China, results show that a carefully designed reward function enables the DRL agent to selectively optimize land-use structures, prioritizing the centralization of residential, dining, and commercial areas to form high-density, mixed-use urban clusters. This spatial reorganization leads to notable reductions in carbon emissions and improvements in resource-use efficiency. The proposed DRL-based framework provides a scientific basis for policy development toward sustainable land-use and urban density optimization. By merging advanced AI techniques with urban planning, this research contributes to the creation of low-carbon, resilient, and environmentally sustainable cities capable of addressing global climate challenges. The optimized DRL agent achieved carbon emission reductions of up to 15% compared to baseline configurations in the Hangzhou case study. Spatial concentration analysis revealed a 23% increase in residential area clustering and 31% increase in commercial zone centralization over 400 training episodes. The PPO-based model demonstrated superior performance compared to genetic algorithm and linear regression baselines, with lower policy loss (converging to <0.01) and critic loss (converging to <0.005) after early stopping at 400 episodes. However, this study is limited by its deterministic environment model, geographic specificity to Hangzhou, and exclusive focus on carbon reduction without incorporating socioeconomic constraints.

1. Introduction

Urbanization has profoundly reshaped land-use patterns, altering the configuration and intensity of functional zones and thereby influencing both the carbon balance of terrestrial ecosystems and human-induced emissions. Urban gardens and green spaces serve as critical components of carbon sequestration infrastructure, and recent advances in remote sensing technology have enabled precise identification and monitoring of these spaces at scale [1,2]. Expanding green spaces and promoting mixed land-use are proven strategies to curb urban carbon outputs, with optimized spatial allocation reducing energy-related CO2 emissions by up to 12% [3,4,5]. However, the conversion of natural landscapes into built-up areas disrupts carbon sequestration processes while simultaneously intensifying emissions through heightened energy demand and transportation activities [6,7]. As densely populated and economically active zones, urban areas have become major contributors to global carbon emissions, playing a pivotal role in accelerating climate change [8,9]. Consequently, optimizing urban morphology has become essential to enhancing land-use efficiency and achieving emission reductions through spatially informed planning that integrates green infrastructure and sustainable urban design [10,11].
Conventional urban planning typically relies on top-down, policy-driven approaches that provide overarching guidance but often overlook local environmental, social, and economic heterogeneity [9]. In contrast, the increasing complexity and dynamism of modern cities demand adaptive, data-driven, and context-specific planning strategies capable of responding to localized ecological conditions, community needs, and socio-economic structures [8,12].
Despite extensive research on urban carbon emissions and the growing application of artificial intelligence in environmental management, critical gaps persist in operationalizing these insights for spatial planning. Traditional land-use optimization methods rely on static analysis and predetermined rules, lacking the adaptive learning capacity necessary to respond to dynamic urban systems [9]. While machine learning models demonstrate prediction capabilities for carbon emissions [13], they cannot generate actionable spatial strategies through sequential decision-making in evolving urban contexts. Recent reinforcement learning applications have focused primarily on traffic signal control or building energy management, operating at scales incompatible with comprehensive neighborhood-level land-use planning. Most critically, no existing framework integrates multi-source urban data—points of interest, areas of interest, and transportation systems—into a unified computational model capable of autonomous spatial optimization that balances carbon reduction with urban functionality while providing interpretable decision pathways for policy implementation. This research addresses these gaps by developing a Deep Reinforcement Learning framework that learns context-aware land-use configurations through iterative environmental interaction, demonstrating practical feasibility through comprehensive validation in a rapidly urbanizing metropolitan context.
This study aims to pioneer the use of a Deep Reinforcement Learning (DRL) framework to revolutionize urban planning by significantly reducing carbon emissions through the optimization of land-use configurations. This bottom-up approach prioritizes the experiences and inputs of local residents and stakeholders, facilitating smaller-scale, community-level interventions that address specific needs and preferences [14], allowing for tailored solutions that align with the unique characteristics of different urban areas [15]. The framework integrates Points of Interest (POI), Areas of Interest (AOI), and road data to construct a comprehensive method for calculating carbon emission maps for small-scale land-use configurations [16]. These data inputs enable the DRL model to understand the current urban state and devise strategies for optimizing land-use to minimize carbon emissions [17].
To address these gaps, this research pursues three primary objectives. First, to develop a methodological framework integrating multi-source urban data (POI, AOI, TSD) with Deep Reinforcement Learning for spatial optimization of carbon emissions. Second, to quantify the DRL agent’s land-use decision patterns using spatial concentration metrics and validate emission reduction potential in a rapidly urbanizing context. Third, to provide evidence-based insights for policy development in sustainable urban planning by demonstrating the practical feasibility of AI-driven spatial optimization approaches. These objectives contribute to bridging the gap between computational intelligence and urban sustainability practice, offering scalable solutions for climate-responsive urban development. This research provides a robust scientific foundation for policy development in mixed land-use and urban density optimization, bridging artificial intelligence with urban sustainability. By quantifying land-use evolution through interpretable indicators and validating emission reductions in a real-world metropolitan context, the study demonstrates the practical feasibility and policy relevance of DRL-driven spatial optimization. This work not only showcases the transformative potential of DRL in reshaping urban morphology but also contributes a methodological pathway toward building data-driven, low-carbon, and resilient cities capable of confronting the global challenges of climate change [16,18].

1.1. Built Environment and Carbon Emissions

The built environment significantly influences urban carbon emissions through its effects on travel behavior and transportation mode choice. Research demonstrates that population density, employment concentration, mixed land-use, and proximity to public transportation substantially affect travel patterns, though effects vary by regional context [19,20,21,22,23,24,25,26]. High-density, mixed-use developments generally reduce motorized travel distances and promote active transportation, as evidenced by studies in Switzerland showing shortened car travel distances through compact spatial planning [25]. However, outcomes can be context-dependent: while enhanced built environments in the Netherlands increased public transport use, they did not significantly reduce car use or increase cycling [26]. Residential self-selection complicates causal inference, as individuals often choose housing locations based on pre-existing travel preferences [27,28]. Studies controlling for self-selection effects confirm that built environment interventions—particularly increases in density and land-use mixing—can meaningfully reduce car ownership and vehicle kilometers traveled. For instance, relocation to high-density neighborhoods significantly decreases car ownership and increases bicycle use [29], while higher degrees of mixed land use in Seoul communities increased walking likelihood, though effects diminished beyond certain complexity thresholds [30,31]. In China, these relationships vary by city size: smaller cities like Ganyu show weaker correlations between residential built environment and travel mode choice, with workplace accessibility playing a dominant role [32], while larger cities like Beijing and Chengdu demonstrate strong preferences for central locations among car-owning households [33]. Travel distance to destinations emerges as a fundamental determinant of mode choice: proximity to workplaces and services strongly predicts walking and cycling adoption [21], while greater distances correlate with increased automobile use. Studies demonstrate that spatial separation between residential and workplace locations nearly doubles commuting distances, highlighting the critical roles of density, mixed land-use, and employment accessibility [34]. Deep reinforcement learning can effectively encode these spatial relationships—including geographical scales, density patterns, and accessibility metrics—into feature representations that capture context-specific carbon emission dynamics, providing urban planners with powerful analytical tools for designing sustainable, low-carbon urban spaces [35,36,37,38,39].

1.2. AI in Carbon Emissions

With the intensification of global climate change and accelerated urbanization, artificial intelligence has emerged as a critical tool for reducing urban carbon emissions through prediction, optimization, and decision support capabilities [40]. Machine learning frameworks enable fine-scale carbon emission prediction using diverse data sources including street view imagery [13], land-use patterns [41], and nighttime light data. Neural network architecture demonstrates effectiveness in capturing complex nonlinear relationships between urban characteristics and emission patterns [42]. Beyond prediction, AI reduces emissions through system optimization: studies show AI can decrease carbon dioxide by optimizing industrial structures, enhancing energy efficiency, and improving green technology innovation, particularly in technologically advanced contexts [43,44]. Integration with IoT and big data analytics enables real-time monitoring and management across sectors [45,46]. However, most AI and machine learning approaches remain fundamentally static and predictive—they identify relationships and forecast outcomes but cannot actively adjust urban systems through sequential decision-making. Cities are dynamic, adaptive environments characterized by interdependent variables and long-term feedback loops, requiring algorithms capable of continuous interaction, learning, and decision-making within evolving contexts.
Reinforcement Learning (RL) bridges this gap by modeling learning as an iterative process of trial, error, and reward feedback, with theoretical foundations in behavioral psychology and the Rescorla-Wagner model [47]. The integration of RL with deep learning created Deep Reinforcement Learning (DRL), combining value-based and policy-based methods through architectures such as Deep Q-Networks, Advantage Actor-Critic, and Proximal Policy Optimization [48,49,50,51,52]. Landmark achievements like DeepMind’s AlphaGo, which defeated a Go master, demonstrate DRL’s potential for superhuman performance in complex tasks [53,54]. Recent DRL applications in urban contexts demonstrate transformative potential for spatial optimization. Zheng et al. (2023) [55] applied DRL to spatial layout optimization ensuring service accessibility within 15 min walking distances. Shen et al. (2024) [56] optimized urban-scale travel carbon emissions through land-use configuration adjustments, achieving significant emission reductions. Napoli et al. (2023) [57] employed DRL for urban tourism planning, balancing attraction access with infrastructure capacity constraints. These applications illustrate DRL’s capacity to manage complex trade-offs in real-world urban systems. Unlike traditional AI requiring manually engineered features or static datasets, DRL learns directly from environmental interaction and feedback. This human-like adaptability positions DRL as a transformative approach for complex urban optimization problems, enabling cities to evolve toward low-carbon, resilient, self-optimizing systems [58,59].

1.3. Research Gap

Building on the identified gaps, this research leverages reinforcement learning’s capacity for sequential decision-making with specific goals [60]. Prior work has established relationships between urban form and carbon emissions through emission mapping [4,61,62], providing the foundation for constructing an environment enabling agent-based optimization with carbon feedback [63]. Unlike traditional land-use prediction models that rely on static analysis [64], Deep Reinforcement Learning agents can consider long-term benefits rather than immediate gains, enabling strategic planning from a sustainability perspective [60,65,66]. This approach offers planners insights for complex tasks while avoiding individual designer biases [67], with adjustable hyperparameters enabling simulation of different planning strategies to support context-specific carbon emission optimization.

1.4. Main Contributions

This research makes several significant contributions to the field of urban planning and carbon emission reduction. First, it develops a novel Deep Reinforcement Learning framework that integrates multi-source urban data, specifically Points of Interest, Areas of Interest, and Transportation System Data, to create fine-grained carbon emission maps at the neighborhood scale. This integration enables targeted land-use optimization strategies that were previously unattainable with conventional planning approaches. Second, the study establishes a deterministic environment model that translates complex urban spatial configurations into quantifiable carbon emission metrics, providing interpretable feedback mechanisms for adaptive policy learning through the Proximal Policy Optimization algorithm. Third, the research demonstrates the capacity of DRL agents to autonomously develop context-aware spatial strategies, particularly in prioritizing the centralization of residential, dining, and commercial areas to form compact, mixed-use clusters that substantially reduce transportation-related carbon emissions. Fourth, it introduces a quantitative methodology employing the Gini coefficient to measure land-use evolution patterns, thereby revealing the agent’s strategic preference for spatial concentration and mixed-use development aligned with sustainable urban planning principles. Finally, the framework is validated through a comprehensive case study in Hangzhou, China, demonstrating both practical feasibility and policy relevance for implementing data-driven urban planning strategies toward achieving low-carbon, climate-resilient cities capable of addressing contemporary environmental challenges.

2. Materials and Methods

This section outlines the workflow of this study (Figure 1). The AOI, POI, and TSD are first collected and converted into a unified grid representation of land-use composition, functional intensity, and transport connectivity. These features are then used to build a deterministic carbon-emission environment, in which a PPO-based DRL agent learns to adjust cell-level land-use proportions to minimize total emissions. The resulting policy is evaluated against baseline methods and assessed through reward/loss convergence and spatial emission changes.

2.1. Study Region

Hangzhou, the capital city of Zhejiang Province in eastern China, was selected as the study area for this research. The city represents a typical case of rapid urbanization in China, with a metropolitan population exceeding 12 million and an urban area covering approximately 8000 km2. Located in the Yangtze River Delta region, Hangzhou has experienced substantial economic growth and spatial expansion over the past two decades, making it an ideal testbed for examining the relationship between urban land-use configurations and carbon emissions. The study focuses on the central built-up area of Hangzhou, encompassing a grid matrix defined by longitude coordinates from 120.038749 to 120.245226 and latitude coordinates from 30.205690 to 30.361647 (Figure 2). This area captures the core urban functions including residential neighborhoods, commercial districts, industrial zones, and transportation infrastructure. The availability of high-quality, multi-source urban data combined with Hangzhou’s ongoing urbanization trajectory provides a robust empirical foundation for developing and validating the Deep Reinforcement Learning framework.
The methodological framework employs a three-stage approach. First, multi-source urban datasets encompassing Areas of Interest, Points of Interest, and Transportation System Data are collected and preprocessed to represent the spatial characteristics of Hangzhou’s central urban area. Second, these datasets are integrated into a deterministic environment model that calculates carbon emissions based on land-use configurations and transportation patterns. Third, a Deep Reinforcement Learning agent using the Proximal Policy Optimization algorithm is trained to optimize land-use arrangements that minimize total carbon emissions

2.2. Datasets

This study utilizes three primary categories of datasets to construct the urban environment model: Area of Interest data, Point of Interest data, and Transportation System Data. These datasets were obtained from authoritative sources including Gaode Map, one of China’s leading map service providers offering comprehensive point-of-interest information, Lianjia, the largest housing brokerage firm in China providing extensive residential data, and OpenStreetMap, which supplies reliable road network information.

2.2.1. Area Data

Area data are derived from the Area of Interest (AOI) dataset, in which each land-use type is associated with a distinct carbon emission factor. Following the secondary allocation principle proposed by Liu [68], sector-level emissions are first estimated from energy consumption standards and then spatially redistributed to land patches according to their area and activity intensity. Because the case study focuses on Hangzhou’s high-density urban core, agricultural and rural-patch allocations in Liu’s original equations are not directly applied; instead, we adopt the same allocation logic and extend it to nine urban functional land-use categories. For an urban land-use category u, the area-based emission assigned to grid cell k is calculated as Equations (1) and (2):
E a r e a , k ( u ) = f u A k ( u ) w k ( u )
E a r e a , k = u = 1 9   E a r e a , k ( u )
Here, Earea,k(u) is the annual carbon emission of land-use in cell k (kg CO2/year); fu is the carbon emission factor of land-use u (kg CO2/m2/year, Table 1); Ak(u) is the AOI area of land-use u within cell k(m2); and wk(u) is a dimensionless POI-based intensity weight that captures vertical or activity-related differences not represented by planar area alone. This formulation preserves Liu’s secondary allocation idea while ensuring consistency with purely urban land-use structures.
Carbon emission factors for different building types are calculated based on their specific energy consumption patterns [69]. For residential buildings, assuming an average household area of 40 square meters and electricity consumption of 2848.53 kWh per household, the emission factor is approximately 28 kg per square meter. Medical facilities consume both electricity (105.9 kWh per square meter) and natural gas (7.6 cubic meters per square meter), which when converted to electricity equivalent yields total energy consumption of 159.1 kWh per square meter, corresponding to a carbon emission factor of 63.6 kg per square meter. Scenic spots and green spaces exhibit negative emission factors representing carbon sequestration rather than emissions. Empirical measurements indicate that plant communities in these areas achieve annual carbon sequestration of 327.67 kg CO2-equivalent per 400 square meters, equivalent to 0.82 kg CO2-equivalent per square meter of sequestration capacity. Moreover, combining data from various sources, the energy consumption and corresponding carbon emissions for each type of building/land-use are shown in Table 1. Since the data comes from different research results, the average values are used as the carbon emission factor for each type.
Table 1. Carbon emission factors for different land-use types (source: Authors’ elaboration).
Table 1. Carbon emission factors for different land-use types (source: Authors’ elaboration).
Land-Use TypeEnergy Consumption (kWh/m2·year)Carbon Emission Factor (kg CO2/m2·year)Data SourceNotes
Residential71.228.5[70,71,72]Based on 40 m2 household
Commercial120.548.2[70]Office buildings
Medical159.163.6[70]Includes natural gas
Industrial145.358.1[68]Manufacturing zones
Educational95.438.2[73]Schools/universities
Parks/Green Space-0.82[74]327.67 kg CO2 per 400 m2 annually
OtherVariableMixed-use spaces
Note: Emission factors are calculated using China’s electricity grid emission factor 0.4 kg CO2/kWh. Negative values indicate carbon sequestration capacity. Data represent averages from 2020–2023.
Scenic spots and green spaces exhibit negative emission factors representing carbon sequestration capacity. Based on empirical measurements indicating annual carbon sequestration of plant communities reaching 327.67 kg CO2-equivalent per 400 square meters, the sequestration rate is calculated as 0.82 kg CO2-equivalent per square meter [74]. Energy consumption for medical facilities includes both electricity and natural gas. Primary data indicates electricity consumption of 105.9 kWh per square meter and natural gas consumption of 7.6 cubic meters per square meter. Converting natural gas to electricity equivalent yields a total energy consumption of 159.1 kWh per square meter, resulting in a carbon emission factor of approximately 63.6 kg per square meter [70]. Area of Interest (AOI) data were obtained from Gaode Map’s land parcel database and classified into nine functional types corresponding to the POI categories, with an additional category for mixed-use or unclassified parcels. Each 500 m × 500 m grid cell’s land-use composition was calculated as the proportion of each functional type within cell boundaries, forming feature vectors for the DRL model input. AOI data are ultimately categorized into nine types, in addition to the eight types listed, an additional category is added for some special function buildings or lands that do not fit into these eight categories. The research area is gridded, each grid cell being 1 km on each side, with each cell accounting for the area of the eight functions within its boundaries and calculating the proportion as shown in Figure 3. Rectangular bars of varying lengths within a grid cell correspond to the proportion of functions, and these proportions eventually form feature vectors used for input in reinforcement learning.

2.2.2. Point Data

Point data serve as a crucial supplement to area data by providing detailed density information, and are collected from two primary sources, Gaode and OSM. POI are classified into eight categories: residential areas, companies and enterprises, daily services, dining and entertainment, parks and green spaces, transportation facilities, industrial parks, and medical care. One function of the point data is to complement the area data. For example, a large amount of rental information within a residential area reflects the vertical quantity of that area. Similarly, the density of POIs reflects the intensity of commerce within a commercial area, both vertically and on a local planar basis. Thus, point data effectively addresses the lack of vertical information in area data and serves as weights in the secondary distribution calculations of area data.
Additionally, another role of point data is to provide a basis for travel emission calculations. High-density residential areas often generate more travel routes, leading to increased travel-related carbon emissions. For instance, the proportion of high-rise residential buildings in new towns is often higher than in old towns, thus high-density residential areas also tend to produce more travel-related carbon emissions. Rental information plays a crucial role in the carbon emission mapping process. Figure 4 displays the statistics for eight land-use Points of Interest (POIs) for each cell within the test area. These POI statistics are then converted into feature vectors that represent the cell for deep reinforcement learning.
In addition, researchers created heatmaps (Figure 5) for different types of POIs within the research area, allowing for a more intuitive understanding of the distribution patterns of different land-use functions.
From the spatial distribution in Hangzhou’s central area, residential land uses show the strongest central agglomeration. Dining/entertainment, education–research, and medical services exhibit noticeable co-location and correlation with these residential clusters. In contrast, companies/enterprises and retail/shopping POIs are distributed more evenly across the study area, indicating that these land-use types are less dependent on population concentration and display weaker spatial coupling with residential patterns.
Table 2 displays the number of POIs obtained from Lianjia and Gaode for each of the eight functional types within the central built-up area of Hangzhou, with each point in the above diagram representing a POI. The POI dataset comprises 76,863 data points across ten functional categories within Hangzhou’s central built-up area. Residential data (6758 community points) were sourced from Lianjia.com, China’s largest real estate platform, providing rental housing information. All other categories (70,105 points) were obtained from Amap.com (Gaode Map), one of China’s leading location-based service providers. The dataset represents the spatial distribution of urban functions within the study area grid (42 × 32 matrix, longitude 120.038749° E to 120.245226° E, latitude 30.205690° N to 30.361647° N), with each grid cell measuring approximately 500 m × 500 m.

2.2.3. Line Data

Line data, primarily representing Transportation System Data (TSD), was collected from OpenStreetMap and reflects the connectivity between different urban areas. While some scholars directly use road length, classification, and traffic volume to estimate travel-related carbon emissions, this approach clearly overlooks the differences in road congestion caused by land-use distribution. In this study, roads play a crucial role in connecting different areas, with varying levels of connectivity influencing travel preferences and transportation choices.
Roads are categorized into different levels based on keywords such as “Primary”, “Secondary”, and “Subway”, corresponding to highways, main roads, and subway lines. These categorizations are crucial as they directly relate to the travel preferences and modalities of city residents, subsequently impacting the carbon emissions generated by transportation activities. The three-tiered road system of Hangzhou’s central urban area is illustrated in Figure 6. The left image shows the actual system layout, while the right image depicts the matrix representation of different TSD levels, used to map the relationship between various cells and the transportation system.
The processed road connection matrix and subway connection matrix are ultimately combined with AOI and POI data to form the state vector, which collectively describes each cell in the matrix. Transportation System Data from OpenStreetMap were classified into three hierarchical levels: highways (Primary), main urban roads (Secondary), and subway lines. Road connectivity between grid cells was calculated using network analysis, with each cell’s accessibility score determined by the density and hierarchy of connecting roads. This three-tiered classification enables the discrete choice model to estimate mode selection probabilities based on available transportation infrastructure.

2.3. Methods

2.3.1. Environmental Model

The environmental model provides the computational framework within which the Deep Reinforcement Learning agent operates and learns optimal land-use configurations. This model abstracts Hangzhou’s urban area into a structured representation where the agent observes spatial states, executes land-use modification actions, and receives carbon emission feedback. The model serves as both a simulator of urban carbon dynamics and a training platform for policy optimization.
The environment utilizes urban data (gridded AOI–POI–TSD) to establish corresponding carbon emission maps as a basis for the rewards given to the reinforcement learning agent’s actions. Building on prior research and integrating the characteristics of reinforcement learning, the method for calculating urban carbon emission maps is as Equation (3):
E t o t a l = k = 1 A   E a r e a , k + l = 1 P   E p a t h , l
Here, Etotal represents the total carbon emissions for the entire area. Earea,k is the regional carbon emission calculated based on AOI (Area of Interest) and POI (Point of Interest) data, and Epath,l is the carbon emission from the l-th path. K is the number of grid cells in the study area and L is the number of OD paths considered in the travel-emission estimation.
Since the application of reinforcement learning in optimizing urban carbon emissions is in its early stages, current research utilizes a deterministic environment, where the outcome of each action is entirely predictable. This allows developers and researchers to more easily design and optimize strategies without needing to consider the randomness of environmental responses. Therefore, the state transition probability function P(s|s,a) can be described by the following Equation (4):
P s | s , a = 1 , i f s = f s , a 0 , o t h e r w i s e
where s and s′ are the current and next states (i.e., the land-use proportion vectors and auxiliary features of all cells), a is the land-use adjustment action executed by the agent, and f(·) is the deterministic update operator that modifies the land-use proportions in the targeted cell. This deterministic setup simplifies policy learning and allows the effect of each action on emissions to be directly evaluated.

2.3.2. Training Dataset Preparation

Figure 7 illustrates the training dataset preparation workflow. The urban area is divided into a grid matrix using longitude coordinates from 120.038749 to 120.245226 and latitude coordinates from 30.205690 to 30.361647, with each cell measuring approximately 500 by 500 m. AOI, POI, and road data within this area were similarly matrixed with a resolution of 42 × 32, producing corresponding data matrices. At the end of each episode and upon reset, the environment randomly selects a 10 × 10 sub-environment as the current test environment for the intelligent agent. To further improve model generalization, researchers applied data augmentation techniques such as rotation and mirroring to create more diverse training samples.
The dataset is divided into a training set and a testing set to ensure that the model is trained and evaluated on different data. At the start of the program, two arrays are randomly generated, with each element representing the index of the starting point of a 10 × 10 sub-area. These arrays are created in a 7:3 ratio, ensuring that the indices do not overlap.

2.3.3. Deep Reinforcement Learning Architecture

The optimization framework employs the Proximal Policy Optimization algorithm within an Actor-Critic architecture. This approach integrates value function approximation through the Critic network with direct policy learning through the Actor network, enabling simultaneous learning of both the policy and value functions. The Actor network receives the environment state as input, defined by a state dimension of 10 representing the spatial characteristics of each grid cell. The network architecture consists of four hidden layers containing 1024, 10,240, 1024, and 128 neurons, respectively, with ReLU activation functions applied to the first three layers to capture complex policy relationships. The output layer employs a linear transformation to generate an action vector of dimension 8, corresponding to the eight primary land-use types, with a Tanh activation function constraining output values to the range of negative one to positive one. These action values specify modifications to land-use proportions within each cell. The Critic network evaluates the state value function, taking as input the state vector and producing a scalar value estimate through four hidden layers with 1024, 20,480, 1024, and 128 neurons. The network architecture employs ReLU activation for nonlinear transformation and outputs a single value representing the expected cumulative reward from the current state. Figure 8 illustrates the data flow from environmental states through the network architectures to generate actions and value estimates.

2.3.4. Reward Function Design

The reward function constitutes the primary mechanism through which the agent learns to minimize carbon emissions. The function penalizes actions that increase emissions while rewarding configurations that achieve reductions relative to baseline conditions. Total carbon emissions are calculated as the sum of base emissions from land use and path emissions from transportation activities. Base emissions are computed by multiplying the area proportion of each land-use type in each cell by its corresponding emission factor, then summing across all cells and land-use categories. This formulation is expressed in Equation (5), where N represents the number of grid cells, M denotes the number of land-use types, alpha subscript i indicates the emission factor for land-use type i, and A subscript ij represents the area proportion of land-use type i in cell j. Path emissions account for transportation-related carbon outputs between cells, calculated using a discrete choice model to determine mode selection probabilities. The emission contribution from travel between cells k and l is determined by the product of transport mode emission factors, travel distances, and mode choice probabilities, summed across all cell pairs and transport modes. This is formalized in Equation (6), where P subscript km represents the probability of using transport mode m, beta subscript m denotes the emission factor for mode m, and d subscript kl indicates the distance between cells k and l. The reward signal provided to the agent equals the negative change in total emissions, thereby incentivizing emission-reducing spatial configurations. Figure 9 depicts the reward calculation process integrating base and path emission components.

3. Results

3.1. Actor-Critic Framework

The Proximal Policy Optimization (PPO) algorithm within the Actor-Critic framework is employed to generate actions for the agent. The PPO algorithm is particularly effective for managing continuous action spaces, making it ideal for this research. The Actor-Critic method integrates two approaches: value function approximation (Critic) and policy gradient methods (Actor). The Critic assesses the quality of the current policy by evaluating the value function v(s) for each state, while the Actor learns the policy directly by generating specific action values. This combination allows the model to learn both the policy and the value function simultaneously, enhancing the overall learning process. The Actor network takes the environment state as input, defined by the state_dim parameter (in this case, 10). It consists of four hidden layers, with the first three hidden layers containing 1024, 10,240, and 1024 neurons, respectively, each followed by a ReLU activation function to enable the learning of complex policy functions. The fourth hidden layer has 128 neurons. The output layer is a linear layer that generates an action vector of size action_dim (here, 8). A Tanh activation function is applied to the output layer, ensuring that each action feature’s output value falls within the range of −1 to 1. These action values are then used to modify land-use configurations in the environment.
The Critic network evaluates the value v(s) for each state within the matrix, taking as input a state vector of size state_dim, representing one of the states in a 10 × 10 matrix. Figure 8 illustrates the process of transferring input data to both the Critic and Actor input layers. To better capture the environmental state and optimize the agent’s training process, an optional Graph Convolutional Network module can be used. This module can be integrated before the Critic and Actor input layers to process and extract features from the states, enhancing the network’s understanding of the environment. The exploration and application of this GCN module will be the subject of future research.
The Critic network also consists of four hidden layers, with the first three hidden layers containing 1024, 20,480, and 1024 neurons, respectively, each followed by a ReLU activation function. To enhance robustness and reduce overfitting, the output layer is a linear layer that reduces the network’s output to a single value, representing the value estimation for the state-action pairs.

3.2. Reward Function Design

The core objective of our reinforcement learning task is to minimize carbon emissions in urban land-use configurations. We designed our reward function to penalize actions that increase carbon emissions and reward those that lead to reductions. Specifically, the reward function is formulated to incentivize land-use configurations that achieve lower carbon emissions relative to the initial state’s carbon emissions.
The reward function is composed of two main components: base emissions and path emissions. The calculation process is shown in Figure 9:
  • Base Emissions: The base emissions are calculated based on the land-use types within each cell. Each land-use type i has an associated emission factor αi, which is multiplied by the area’s ratio of that land-use type Aij in cell j to compute its emission contribution. The sum of these contributions across all land-use types gives the base emissions for each cell, as show in Equation (5):
E b a s e = j = 1 N i = 1 M A i j α i
where N is the number of cells and M is the number of land-use types.
  • Path Emissions: Path emissions are calculated based on transportation activities between different cells. We use a discrete choice model to determine the probabilities Pkm of using different modes of transportation m for traveling between cells k and l. The emissions for each transport mode m are given by βm, and the distance between cells k and l is denoted as dkl, as shown in Equation (6):
E p a t h = k = 1 N l = 1 N m = 1 Q P k m β m d k l
where Q is the number of transport modes.
The reward function R is defined as the difference between the previous total normalized emissions and the new total normalized emissions after taking an action. The base emissions and path emissions are normalized separately as Equation (7):
E n o r m = E E m i n E m a x E m i n
where E represents either Ebase or Epath, and Emin and Emax are the minimum and maximum observed emissions during training. Assume a toy case with N = 2 cells, M = 2 land uses, and Q = 2 transport modes. Let emission factors be α1 = 0.3 and α2 = 0.5. The land-use proportions are: cell 1 (A1,1 = 0.6, A2,1 = 0.4), cell 2 (A1,2 = 0.5, A2,2 = 0.5). Then, Cell 1 base emission is 0.6 × 0.3 + 0.4 × 0.5 = 0.38, and Cell 2 base emission is 0.5 × 0.3 + 0.5 × 0.5 = 0.40, So Ebase = 0.78.
Suppose travel distance between the two cells is d12 = d21 = 2 km, with mode probabilities P12,1 = P21,1 = 0.7 and P12,2 = P21,2 = 0.3. Let β1 = 0.2 (car) and β2 = 0.05 (metro). Then, one-direction path emission is 0.7 × 0.2 × 2 + 0.3 × 0.05 × 2 = 0.31, so Epath = 0.62 (two directions).
The reward at step t is defined as the reduction in total normalized emissions:
R t = E b a s e , n o r m t + E p a t h , n o r m t E b a s e , n o r m t + 1 + E p a t h , n o r m t + 1
An episode ends if either a significant reduction in total emissions (below 95% of the initial total emissions) is achieved or the maximum number of steps is reached. In case of significant reduction, the reward is further boosted to emphasize successful actions.

3.3. Model Training

3.3.1. Algorithm Selection

In this research, the Proximal Policy Optimization (PPO) algorithm is selected to train the Actor-Critic model, aiming to enhance its performance and generalization ability in complex environments. The core idea of the PPO algorithm is to optimize the policy in a way that maximizes the expected return. During the training process, the Critic network first estimates the advantage function of the current policy to evaluate the superiority of each state-action pair relative to a baseline policy. The advantage function quantifies the relative value of an action, indicating how much better it is compared to the average performance of the current policy. By calculating the advantage function, the Actor network generates an action probability distribution, which is then optimized using the policy gradient method of the PPO algorithm to update the Actor’s parameters, thereby improving the direction of policy updates.
In the PPO algorithm, the Clipped Surrogate Objective function is used to update the Actor network, as shown in Equation (9):
L C L I P θ = E t m i n π θ a t s t π θ o l d a t s t A t , c l i p π θ a t s t π θ o l d a t s t , 1 ϵ , 1 + ϵ A t
where πθ(atst) is the probability of the current policy, πθold(atst) is the probability of the old policy, At is the advantage function, and ϵ is the clipping range hyperparameter.
The primary role of this loss function is to limit the magnitude of policy updates, preventing significant shifts in the policy and ensuring stable policy updates. Specifically, the Clipped Surrogate Objective function calculates the loss by comparing the probability ratio of the current and old policies and applying a clip to this ratio to ensure that the policy update does not deviate too far. This effectively prevents the policy from being updated too rapidly, which could lead to performance degradation.
The Critic network is responsible for evaluating the state of each cell in the urban land-use configuration matrix. Specifically, the Critic network processes the input state matrix of each cell through multiple fully connected layers and outputs a value estimation for that cell. These value estimates guide the Actor network’s action selection process, ensuring that the chosen actions maximize long-term returns. The output of the Critic network is defined as in Equation (10):
V θ s t = f θ s t
where Vθ(st) is the value estimate of state st and fθ represents the function of the Critic network.
During training, the overall loss function combines the policy loss and the value function loss. By integrating these loss functions, the PPO algorithm can effectively update the parameters of the Actor and Critic networks, progressively improving the model’s performance in optimizing urban land-use configurations. The overall loss function is defined as Equation (11):
L θ = L C L I P θ + c 1 L V F θ
where LVF(θ) is the value function loss, representing the mean squared error between the Critic network’s predictions and the actual target values, and c1 is a weighting hyperparameter. In each training iteration, the states, actions, and reward information stored in the experience replay buffer are also used to calculate the policy loss for the Actor network and the value function loss for the Critic network. By minimizing the surrogate loss function defined in the PPO algorithm, the parameters of the Actor and Critic networks are effectively updated, gradually enhancing the model’s performance in optimizing urban land-use configurations.
Training our Actor-Critic model involves several key steps to optimize performance and ensure convergence. We partitioned the urban area into a 42 × 32 grid, selecting a subset for training to balance computational efficiency and model effectiveness. Hyperparameters such as learning rates for both Actor and Critic networks, discount factor (gamma), and clipping range (eps_clip) were chosen through empirical experimentation and validated against performance metrics on a held-out test set.
The Adam optimizer with gradient clipping is employed to stabilize training and prevent gradient explosion, a common issue in deep learning optimization. The Actor network was updated using the advantage-weighted surrogate objective, while the Critic network minimized the mean squared error between predicted and actual rewards. Training proceeded over multiple epochs (K_epochs) with batch updates to iteratively improve policy and value function estimates.

3.3.2. Training Process and Loss Analysis

Throughout the training process, which was strategically halted at around 400 episodes via early stopping, researchers systematically recorded the policy loss (Actor Loss), value function loss (Critic Loss), and rewards for each iteration (Figure 10). Early stopping was implemented to prevent overfitting and ensure optimal generalization, as indicated by stabilization or diminishing returns in improvement metrics. Figure 8 illustrates the evolving trends in rewards and loss functions as the training progresses.
  • Rewards: The reward plot exhibits an overall increasing trend, albeit with significant variability across episodes. In the early stages of training, the high fluctuations in rewards suggest an active exploration phase where the agent is learning the environmental dynamics. As the training progresses, the rewards begin to show a more consistent upward trajectory, indicating that the agent is becoming increasingly proficient at optimizing its actions to maximize rewards. This upward trend, despite ongoing fluctuations, underscores the efficacy of the PPO algorithm in enhancing the agent’s performance over time.
  • Actor Loss: The actor loss, which quantifies the policy network’s error in selecting actions, also shows considerable variability but generally trends towards stabilization. Despite occasional spikes due to the stochastic nature of policy updates and the exploration-exploitation trade-off, the overall trend is a decrease in actor loss, signifying that the policy network is learning to make more accurate action selections. The stabilization of actor loss as training advances reflects the network’s improved ability to refine its policy.
  • Critic Loss: The critic loss, representing the value network’s error in estimating the value of states, shows a pronounced decrease over time, albeit with some variability. High initial loss values highlight the challenges of accurately estimating state values at the beginning of training. As training continues, the critic loss decreases, indicating better accuracy in value estimation. A notable spike around episode 100 suggests a temporary deviation in value estimation, likely due to a significant policy update. Nevertheless, the critic loss quickly stabilizes, demonstrating the network’s resilience and its capacity to recover from such deviations.

3.3.3. Output Analysis

Due to the randomness in the test dataset, the actions generated by the agent are not identical in every run, and there is some variability in the rewards obtained. However, the overall trend remains stable. Researchers selected a specific set of actions from one of the runs for detailed analysis and examined the agent’s action strategy. Table 3 presents the selected cells and the feature states of these cells before and after being influenced by the actions.
Table 3 illustrates how the DRL agent modifies the proportions of different land-use types within selected grid cells over ten consecutive actions. Columns X and Y denote the spatial coordinates of each operated cell within the gridded urban matrix, where each cell represents a 0.5 km × 0.5 km spatial unit of Hangzhou’s central area. Each pair of “BEF.” (before) and “AFT.” (after) values represent the feature state of a specific cell prior to and following the agent’s decision. The values correspond to the counts of eight functional categories—residential, enterprise, education, dining, shopping, scenic, industrial, and medical land uses—within the same spatial unit.
The reward column reflects the reduction in the environment’s normalized carbon-emission indicator after each action. Across this episode, the cumulative reward over ten consecutive actions reaches 2.06, which corresponds to an overall decrease of about 1.84% relative to the initial total-emission level in this run. This shows that the DRL agent is able to achieve a stable, step-by-step reduction in combined land-use and travel emissions through iterative land-use proportion adjustments, even though individual actions may vary slightly due to stochasticity in the test setting.
Figure 11 further maps the locations of the operated cells onto the real-world urban grid and shows, for each cell, the percentage change in these eight land-use functions using bar charts. The background map shows the grid overlaid on the central area of Hangzhou, with highlighted cells indicating locations where the PPO agent takes actions.
Furthermore, researchers analyzed the spatial distribution of different land-use using the Gini Coefficient. The Gini Coefficient, ranging from 0 (complete equality) to 1 (complete inequality), measures the degree of inequality in a distribution. It was originally used to assess income distribution inequality and is calculated as Equation (12):
G = i = 1 n 2 i n 1 x i n i = 1 n x i
where xi is the sorted data value, and n is the number of data points.
The Gini coefficient for the initial state and for each of the subsequent 10 actions, calculated based on the 8 land-use types, is depicted in Figure 12:
From Figure 12, it is evident that the agent tends to favor a centralized city. The rate of change in the Gini coefficients across the eight subplots indicates varying degrees of preference for different land-use functions under this background setting. For example, residential, dining, and shopping areas show faster changes in concentration, which may be closely related to daily life and the agent’s prioritization of their distribution impacts. In contrast, changes in education and medical facilities are more gradual, likely due to the influence of transportation infrastructure, as convenient public transport reduces residents’ dependence on nearby educational and medical resources.
Overall, the agent appears inclined to centralize the urban layout in Hangzhou, consistent with the findings of Ewing, Cervero [75], Lu [76] and Tayarani [77], who suggest that high-density, mixed-use urban designs significantly reduce vehicle miles traveled (VMT) and transportation-related carbon emissions. Concentrating facilities and services in high-density residential areas meets residents’ daily needs, minimizing the need for long commutes. This analysis underscores the agent’s strategic shift towards creating more centralized urban environments, which could lead to more sustainable urban development patterns by reducing carbon emissions and enhancing the efficiency of resource use.

3.4. Baseline Comparisons

To validate the effectiveness and superiority of the Proximal Policy Optimization (PPO) algorithm in deep reinforcement learning, researchers selected Linear Regression and Genetic Algorithm (GA) as comparison models. Linear Regression, while simple and interpretable, struggles with complex nonlinear relationships and is particularly limited in environments that require understanding sequential action dependencies—a critical aspect in dynamic settings governed by Markov Decision Processes (MDP). The Genetic Algorithm, known for its global optimization capabilities suitable for both continuous and discrete problems, simulates the process of natural selection. However, it tends to converge slowly in large-scale scenarios and, crucially, lacks the sequential decision-making prowess essential for dynamic environments that DRL handles adeptly. Proximal Policy Optimization, by leveraging the MDP framework, excels in environments where the order and timing of actions are fundamental, providing a robust method for training policies that effectively respond to evolving state conditions.

3.4.1. Linear Regression Model

The Linear Regression model is a fundamental statistical tool used to understand the relationship between a dependent variable and one or more independent variables. This model is widely applied across various fields, including business, economics, engineering, social research, and health sciences, to analyze data and predict outcomes [78]. The basic premise of linear regression is to find the best-fitting linear equation that describes the relationship between variables, typically by minimizing the sum of the squared residuals, which are the differences between the observed and predicted values. In this context, it can be described as Equation (13):
y ¯ = w T x + b
where y ¯ is the predicted action; w is the weight vector; x is the input state vector; b is the bias term.
The goal of linear regression in this context is to provide a benchmark by evaluating how well a simple linear model can predict outcomes in the optimization task, highlighting the advantages and limitations of this approach compared to more complex algorithms like PPO.
By comparing the performance of the linear regression model and the genetic algorithm with the PPO-based deep reinforcement learning model, researchers aim to demonstrate the effectiveness of the PPO algorithm, especially in dealing with the complex, nonlinear relationships present in urban land-use optimization and carbon emission reduction tasks.

3.4.2. Genetic Algorithm

The Genetic Algorithm (GA) is a powerful and efficient search technique inspired by the principles of natural selection and genetics, designed to solve complex optimization problems by mimicking the evolutionary processes observed in biological systems. Initially introduced by John Holland in the 1970s, GAs have since evolved into a versatile tool applicable to various domains, including artificial intelligence, theoretical modeling, and predictive programming [79]. The core concept of GA involves creating a population of potential solutions, encoded as “chromosomes”, which evolve over time through processes akin to reproduction, mutation, and selection. This evolutionary approach enables GAs to excel in handling high-dimensional, nonlinear, and discontinuous problems, making them widely applicable in fields such as engineering optimization, machine learning, and data mining.
In this research, the Genetic Algorithm is applied to optimize the modification of cell states within an urban environment. The specific steps are as follows:
  • Initialization of the Population: A population of individuals is generated, with each individual containing a series of random solutions. Each solution represents the index of a cell and the corresponding state modification vector. Each solution (or action) consists of two parts: the selected cell index (cell_idx) and the corresponding state modification (action). An individual is represented as:
individual = [(cell_idx1, action1), (cell_idx2, action2), …, (cell_idx10, action10)]
Since each action modifies the environmental features of the selected cell, the sequence of actions is crucial. Each individual represents a series of actions (in this case, 10 steps).
  • Fitness Evaluation: The quality of each solution is evaluated using a fitness function. The fitness function calculates the score of each solution based on its performance in the environment. In this study, the difference in carbon emissions before and after modifying the environment state is used as the fitness value, as shown in the following formula:
Fitness = Initial Emissions − Final Emissions
The higher the fitness value, the more effective the solution is in reducing carbon emissions.
  • Selection, Crossover, and Mutation: Based on the fitness values, individuals with better performance are selected to undergo crossover and mutation operations, generating new solutions.

3.4.3. Comparison with Baseline Models

When comparing the training outcomes, it is evident that both GA and LR struggle to consistently achieve significant gains, highlighting their limitations in handling complex tasks. The linear regression model, while simple and interpretable, fails to capture the nonlinear relationships inherent in the urban environment’s land-use optimization problem, leading to suboptimal performance. Similarly, the GA, although capable of global search, often requires many generations to converge to a good solution, and its performance can be highly dependent on the initialization and mutation rates.
In contrast, the PPO-based deep reinforcement learning approach demonstrates a gradually converging loss and an increasing reward trend, despite some fluctuations. This upward trend indicates that the agent is beginning to grasp the reward calculation rules within the environment and is making increasingly appropriate actions based on the environmental feedback. The PPO algorithm’s ability to learn from continuous interaction with the environment allows it to adapt more effectively to the dynamic and complex nature of urban land-use scenarios.
Figure 13 illustrates the performance during the validation phase of DRL (PPO), LR, and GA. Since GA does not retain a model, it requires re-optimization based on the environment for each run. The LR model, on the other hand, is tested based on the previously trained linear regression model.
This comparison highlights the strengths of the PPO-based approach in efficiently navigating and optimizing complex urban environments, where traditional methods like GA and LR may falter. Quantitative evaluation on the test dataset demonstrates clear performance differences among the three approaches. The PPO-based DRL agent achieved an average emission reduction of 15.2% (±2.3%) relative to the initial baseline configuration across 50 test episodes, with the best-performing episode reaching 18.7% reduction. In contrast, the Genetic Algorithm achieved modest reductions of 6.8% (±3.1%), exhibiting high variability and inconsistent convergence. The Linear Regression baseline performed poorest, yielding only 3.2% (±1.4%) average emission reduction, confirming its inability to capture the complex non-linear relationships between land-use configurations and carbon outcomes. Statistical analysis using paired t-tests confirmed that PPO’s performance advantage over both GA (p < 0.001) and LR (p < 0.001) is highly significant, with PPO achieving approximately 2.2× greater emission reductions than GA and 4.8× greater reductions than LR. These quantitative results validate PPO’s superiority for sequential spatial optimization tasks requiring long-term strategic decision-making under dynamic urban constraints.

4. Discussion and Conclusions

4.1. Comparison with Similar Studies and Methodological Critique

The DRL framework developed in this study advances beyond previous approaches in several key aspects. Traditional land-use optimization studies employing linear programming [3] or cellular automata models [80] rely on predetermined rules and static parameters, lacking adaptive learning capacity. Recent machine learning applications for carbon emission prediction [41,42] provide forecasting capabilities but cannot generate actionable spatial strategies. Compared to these static approaches, the DRL framework demonstrates autonomous policy learning through environmental interaction, developing context-specific optimization strategies without manual feature engineering.
Previous reinforcement learning applications in urban contexts have focused primarily on traffic signal control [81] or building energy management [40], operating at scales incompatible with comprehensive land-use planning. Zheng et al. [55] pioneered DRL for spatial layout optimization with emphasis on accessibility within 15 min neighborhoods, while Shen et al. [56] addressed travel-based carbon emissions through configuration adjustments. This research extends these contributions by integrating multi-source data at the neighborhood scale and quantifying spatial evolution patterns through the Gini coefficient, providing interpretable metrics for policy evaluation.
Critical assessment of the methodology reveals several limitations inherent in current DRL-based planning approaches. First, the deterministic environmental assumption simplifies urban system complexity by excluding stochastic elements such as policy uncertainties, economic fluctuations, and behavioral variability [29]. Real-world urban planning operates under substantial uncertainty that deterministic models cannot fully capture. Second, the reward function’s exclusive focus on carbon reduction overlooks potential trade-offs with other sustainability objectives including social equity, economic viability, and ecological preservation [82]. Third, the static carbon emission factors derived from literature averages may not accurately represent temporal evolution of building efficiency standards or technological improvements in transportation systems [75].
Compared to genetic algorithms and linear regression baselines employed in this study, the PPO-based DRL framework demonstrated superior performance in navigating complex, nonlinear relationships between spatial configurations and emission outcomes. However, the computational requirements and training time remain substantially higher than traditional optimization methods, potentially limiting scalability to metropolitan-scale applications. Future iterations should incorporate multi-objective optimization frameworks that balance carbon reduction with socioeconomic considerations, implement transfer learning approaches to reduce computational costs across multiple cities, and integrate uncertainty quantification through stochastic environment modeling.

4.2. Uncertainties and Model Parameters

Several sources of uncertainty affect the reliability and generalizability of the results. Input data uncertainties arise from multiple sources. Point of Interest data obtained from commercial mapping platforms may exhibit incompleteness or temporal lag, as the density and accuracy of POI information depend on the update frequency of service providers and user contributions. Area of Interest boundaries derived from land parcel data may contain geometric imprecision due to digitization errors or administrative boundary ambiguities. Transportation System Data from OpenStreetMap represents a snapshot of the road network and may not capture recent infrastructure developments or temporary modifications. Carbon emission factors applied in this study are derived from literature averages compiled from multiple sources, and these factors exhibit spatial and temporal variability depending on building age, construction materials, energy sources, occupancy patterns, and regional climate conditions. The assumption of uniform emission factors across the study area constitutes a simplification that may not fully reflect localized variations in energy consumption and carbon intensity. Model parameter uncertainties also warrant consideration. The spatial resolution of one kilometer by one kilometer grid cells represents a compromise between computational feasibility and spatial detail, and finer resolutions might capture additional heterogeneity in land-use patterns and emission distributions. The division of the dataset into training and testing subsets using a seven-to-three ratio was determined empirically, and alternative split ratios might affect model generalization performance. Hyperparameters including learning rate, network architecture dimensions, and the number of training episodes were tuned through preliminary experimentation, but systematic sensitivity analysis would provide greater confidence in the robustness of the results. The deterministic environmental assumption simplifies the complexity of urban systems by eliminating stochastic elements such as unpredictable policy interventions, economic fluctuations, technological innovations, and behavioral changes among urban residents. While this simplification facilitates initial model development and interpretation, future iterations should incorporate probabilistic components to better represent real-world dynamics.

4.3. Research Limitations

This research has several limitations that should be acknowledged. First, the geographic scope is limited to Hangzhou, and the transferability of the trained DRL agent to other cities with different morphological characteristics, climatic conditions, economic structures, and development stages remains unvalidated. Urban contexts vary substantially across regions, and the optimal land-use configurations identified for Hangzhou may not generalize to cities with distinct spatial constraints or functional requirements. Second, the deterministic environment model does not account for stochastic elements inherent in urban systems, including policy uncertainties, market dynamics, technological changes, and human behavioral variability. Real-world urban planning operates under conditions of substantial uncertainty, and the predictability assumed in this model may not reflect actual implementation challenges. Third, the carbon emission factors employed are static and do not incorporate anticipated improvements in building energy efficiency, renewable energy adoption, electric vehicle penetration, or other technological advancements that will alter emission profiles over time. Fourth, the framework does not explicitly consider socioeconomic constraints such as land acquisition costs, displacement of existing residents or businesses, political feasibility, stakeholder acceptance, or equity implications of spatial reconfiguration. Practical implementation of optimized land-use plans requires addressing these multidimensional considerations beyond carbon emission minimization. Fifth, computational requirements for training the DRL agent may limit scalability to larger geographic areas or higher spatial resolutions, potentially constraining real-time application in rapidly evolving urban contexts. Sixth, the reward function focuses exclusively on carbon emission reduction and does not incorporate other sustainability objectives such as biodiversity conservation, air quality improvement, thermal comfort, social cohesion, or economic vitality, which may be equally important in comprehensive urban planning.

4.4. Future Directions and Recommendations

Future research should address the limitations identified in this study through several directions. First, incorporating stochastic elements into the environment model would enhance realism by representing uncertainties in policy implementation, economic conditions, technological evolution, and human behavior. Probabilistic modeling approaches could quantify confidence intervals for emission reduction estimates and identify robust strategies that perform well across multiple scenarios. Second, extending validation to multiple cities with diverse characteristics would test the generalizability of the DRL framework and enable comparative analysis of optimal spatial configurations across different urban contexts. Transfer learning techniques could accelerate model deployment in new cities by leveraging knowledge gained from prior applications. Third, developing dynamic carbon emission factors that reflect anticipated technological improvements and policy interventions would improve the accuracy of long-term projections and support adaptive planning strategies responsive to changing conditions. Fourth, integrating socioeconomic constraints and multi-objective optimization frameworks would produce more implementable solutions by balancing carbon reduction with economic feasibility, social equity, and political acceptability. Incorporating stakeholder preferences and participatory planning processes could enhance the legitimacy and acceptance of DRL-generated recommendations. Fifth, advancing computational efficiency through algorithm optimization, parallel processing, or cloud computing infrastructure would enable application at larger scales and higher resolutions, supporting metropolitan-level planning initiatives. Sixth, expanding the reward function to encompass multiple sustainability dimensions including ecological preservation, public health, livability, and economic prosperity would yield holistic urban optimization that aligns with comprehensive sustainability goals. Finally, developing real-time monitoring and adaptive management systems that couple DRL optimization with sensor networks and digital twin technologies would enable continuous refinement of land-use strategies in response to observed outcomes and changing urban conditions.

5. Conclusions

This research developed and validated a Deep Reinforcement Learning framework for optimizing urban land-use configurations to reduce carbon emissions, demonstrating its application through a comprehensive case study in Hangzhou, China. The framework integrates multi-source urban datasets encompassing Points of Interest, Areas of Interest, and Transportation System Data to construct fine-grained carbon emission maps that guide spatial optimization decisions. By employing the Proximal Policy Optimization algorithm within an Actor–Critic architecture, the trained agent autonomously learned spatial strategies that minimize total carbon emissions while maintaining urban functionality.
The results reveal that the DRL agent developed distinct optimization patterns characterized by selective centralization of land-use types. Residential, dining, and commercial areas exhibited accelerated spatial concentration, forming compact, mixed-use urban clusters that substantially reduce transportation-related carbon emissions by minimizing travel distances. Quantification through the Gini coefficient demonstrated progressive increases in spatial centralization for these functional categories throughout the training process, reflecting the agent’s strategic preference for high-density, service-rich configurations. In contrast, educational and medical facilities displayed more balanced spatial distribution patterns, indicating optimization for accessibility under transportation network constraints. These findings align with established principles of sustainable urban development, including transit-oriented design, compact urban forms, and mixed-use planning.
The methodological contribution extends beyond technical innovation to provide practical policy relevance. The framework offers urban planners and policymakers a data-driven computational tool for evaluating alternative spatial configurations and identifying emission-reduction opportunities at the neighborhood scale. By translating complex urban dynamics into quantifiable optimization problems, the approach bridges artificial intelligence capabilities with urban sustainability objectives. The successful validation in Hangzhou demonstrates feasibility for application in other rapidly urbanizing contexts facing similar carbon reduction imperatives.
This work advances the integration of advanced computational intelligence with urban planning practice, contributing a reproducible methodology for developing low-carbon, climate-resilient cities. As urban areas continue to expand and intensify globally, the need for scientifically informed, adaptive planning strategies becomes increasingly critical. The DRL-based framework presented here represents a significant step toward achieving data-driven urban optimization that addresses the urgent challenges of climate change while supporting livable, functional, and sustainable urban environments for future generations.

Author Contributions

J.S. and F.Z. contributed equally to this work. J.S. conceived the experiment(s). F.Z. developed the methodology. J.S. created the software. J.S. and F.Z. performed validation. F.Z. conducted formal analysis. F.Z. carried out the investigation. J.S. provided resources. J.S. curated the data. J.S. and F.Z. prepared the original draft, and all authors (J.S., F.Z., T.C., W.D., A.B., F.B.T. and E.L.) reviewed and edited the manuscript. J.S. and F.Z. created visualizations. T.C., W.D., A.B., F.B.T. and E.L. supervised the project. J.S. managed the project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Arabi Aliabad, F.; Ghafarian Malamiri, H.; Sarsangi, A.; Sekertekin, A.; Ghaderpour, E. Identifying and Monitoring Gardens in Urban Areas Using Aerial and Satellite Imagery. Remote Sens. 2023, 15, 4053. [Google Scholar] [CrossRef]
  2. Zhao, Y.; Xie, J.; Zhu, H.; Luo, T.; Xiong, Y.; Fan, C.; Xia, H.; Chen, Y.; Zhang, F. Land-Unet: A deep learning network for precise segmentation and identification of non-structured land use types in rural areas for green urban space analysis. Ecol. Inform. 2025, 87, 103078. [Google Scholar] [CrossRef]
  3. Wang, S.; Liu, X.; Zhou, C.; Hu, J.; Ou, J. Examining the impacts of socioeconomic factors, urban form, and transportation networks on CO2 emissions in China’s megacities. Appl. Energy 2017, 185, 189–200. [Google Scholar] [CrossRef]
  4. Qiao, R.; Wu, Z.; Jiang, Q.; Liu, X.; Gao, S.; Xia, L.; Yang, T. The nonlinear influence of land conveyance on urban carbon emissions: An interpretable ensemble learning-based approach. Land Use Policy 2024, 140, 107117. [Google Scholar] [CrossRef]
  5. Wu, C.; Li, G.; Yue, W.; Lu, R.; Lu, Z.; You, H. Effects of Endogenous Factors on Regional Land-Use Carbon Emissions Based on the Grossman Decomposition Model: A Case Study of Zhejiang Province, China. Environ. Manag. 2015, 55, 467–478. [Google Scholar] [CrossRef] [PubMed]
  6. Seto, K.C.; Güneralp, B.; Hutyra, L.R. Global forecasts of urban expansion to 2030 and direct impacts on biodiversity and carbon pools. Proc. Natl. Acad. Sci. USA 2012, 109, 16083–16088. [Google Scholar] [CrossRef] [PubMed]
  7. Pouyat, R.V.; Yesilonis, I.D.; Nowak, D.J. Carbon Storage by Urban Soils in the United States. J. Environ. Qual. 2006, 35, 1566–1575. [Google Scholar] [CrossRef] [PubMed]
  8. Lv, T.; Hu, H.; Zhang, X.; Xie, H.; Fu, S.; Wang, L. Spatiotemporal pattern of regional carbon emissions and its influencing factors in the Yangtze River Delta urban agglomeration of China. Environ. Monit. Assess. 2022, 194, 515. [Google Scholar] [CrossRef]
  9. Liu, S.; Zhang, X.; Feng, Y.; Xie, H.; Jiang, L.; Lei, Z. Spatiotemporal Dynamics of Urban Green Space Influenced by Rapid Urbanization and Land Use Policies in Shanghai. Forests 2021, 12, 476. [Google Scholar] [CrossRef]
  10. Mu, W.; Zhu, X.; Ma, W.; Han, Y.; Huang, H.; Huang, X. Impact assessment of urbanization on vegetation net primary productivity: A case study of the core development area in central plains urban agglomeration, China. Environ. Res. 2023, 229, 115995. [Google Scholar] [CrossRef] [PubMed]
  11. Liu, Y.; Gao, C.; Lu, Y. The impact of urbanization on GHG emissions in China: The role of population density. J. Clean. Prod. 2017, 157, 299–309. [Google Scholar] [CrossRef]
  12. Liu, X.; Li, T.; Zhang, S.; Jia, Y.; Li, Y.; Xu, X. The role of land use, construction and road on terrestrial carbon stocks in a newly urbanized area of western Chengdu, China. Landsc. Urban Plan. 2016, 147, 88–95. [Google Scholar] [CrossRef]
  13. Shi, W.; Xiang, Y.; Ying, Y.; Jiao, Y.; Zhao, R.; Qiu, W. Predicting Neighborhood-Level Residential Carbon Emissions from Street View Images Using Computer Vision and Machine Learning. Remote Sens. 2024, 16, 1312. [Google Scholar] [CrossRef]
  14. Murray, M.; Greer, J.; Houston, D.; McKay, S.; Murtagh, B. Bridging Top Down and Bottom Up: Modelling Community Preferences for a Dispersed Rural Settlement Pattern. Eur. Plan. Stud. 2009, 17, 441–462. [Google Scholar] [CrossRef]
  15. Healey, P. Collaborative Planning: Shaping Places in Fragmented Societies, 1st ed.; Macmillan: Houndmills, UK, 1997. [Google Scholar]
  16. Han, Z.; Yan, W.; Liu, G. A Performance-Based Urban Block Generative Design Using Deep Reinforcement Learning and Computer Vision. In Proceedings of the 2020 DigitalFUTURES, Proceedings of the 2nd International Conference on Computational Design and Robotic Fabrication (CDRF 2020), Shanghai, China, 5–6 July 2020; Yuan, P.F., Yao, J., Yan, C., Wang, X., Leach, N., Eds.; Springer: Singapore, 2021; pp. 134–143. [Google Scholar] [CrossRef]
  17. Yubo, Z.; Zhuoran, Y.; Jiuchun, Y.; Yuanyuan, Y.; Dongyan, W.; Yucong, Z.; Fengqin, Y.; Lingxue, Y.; Liping, C.; Shuwen, Z. A Novel Model Integrating Deep Learning for Land Use/Cover Change Reconstruction: A Case Study of Zhenlai County, Northeast China. Remote Sens. 2020, 12, 3314. [Google Scholar] [CrossRef]
  18. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2019, arXiv:1509.02971. [Google Scholar] [CrossRef]
  19. Lee, J.-S.; Nam, J.; Lee, S.-S. Built Environment Impacts on Individual Mode Choice: An Empirical Study of the Houston-Galveston Metropolitan Area. Int. J. Sustain. Transp. 2014, 8, 447–470. [Google Scholar] [CrossRef]
  20. Khan, M.; Kockelman, K.M.; Xiong, X. Models for anticipating non-motorized travel choices, and the role of the built environment. Transp. Policy 2014, 35, 117–126. [Google Scholar] [CrossRef]
  21. Munshi, T. Built environment and mode choice relationship for commute travel in the city of Rajkot, India. Transp. Res. Part D Transp. Environ. 2016, 44, 239–253. [Google Scholar] [CrossRef]
  22. Antipova, A.; Wang, F.; Wilmot, C. Urban land uses, socio-demographic attributes and commuting: A multilevel modeling approach. Appl. Geogr. 2011, 31, 1010–1018. [Google Scholar] [CrossRef]
  23. Wang, D.; Lin, T. Residential self-selection, built environment, and travel behavior in the Chinese context. J. Transp. Land Use 2014, 7, 5–14. [Google Scholar] [CrossRef]
  24. Cao, J. Examining the impacts of neighborhood design and residential self-selection on active travel: A methodological assessment. Urban Geogr. 2015, 36, 236–255. [Google Scholar] [CrossRef]
  25. Thao, V.T.; Ohnmacht, T. The impact of the built environment on travel behavior: The Swiss experience based on two National Travel Surveys. Res. Transp. Bus. Manag. 2020, 36, 100386. [Google Scholar] [CrossRef]
  26. Faber, R.; Merkies, R.; Damen, W.; Oirbans, L.; Massa, D.; Kroesen, M.; Molin, E. The role of travel-related reasons for location choice in residential self-selection. Travel Behav. Soc. 2021, 25, 120–132. [Google Scholar] [CrossRef]
  27. Næss, P. Causality and self-selection. In Handbook on Transport and Land Use; De Abreu E Silva, J., Currans, K., Van Acker, V., Schneider, R., Eds.; Edward Elgar Publishing: Cheltenham, UK, 2023; pp. 107–128. [Google Scholar] [CrossRef]
  28. Van Wee, B.; Cao, X.J. Residential self-selection in the relationship between the built environment and travel behavior: A literature review and research agenda. In Advances in Transport Policy and Planning; Elsevier: Amsterdam, The Netherlands, 2022; Volume 9, pp. 75–94. [Google Scholar] [CrossRef]
  29. Mondal, A.; Bhat, C.R. Investigating Residential Built Environment Effects on Rank-Based Modal Preferences and Auto-Ownership. Transp. Res. Rec. J. Transp. Res. Board 2023, 2677, 777–796. [Google Scholar] [CrossRef]
  30. Seong, E.Y.; Lee, N.H.; Choi, C.G. Relationship between Land Use Mix and Walking Choice in High-Density Cities: A Review of Walking in Seoul, South Korea. Sustainability 2021, 13, 810. [Google Scholar] [CrossRef]
  31. Schlossberg, M.; Greene, J.; Phillips, P.P.; Johnson, B.; Parker, B. School Trips: Effects of Urban Form and Distance on Travel Mode. J. Am. Plan. Assoc. 2006, 72, 337–346. [Google Scholar] [CrossRef]
  32. Hu, Y.; Sobhani, A.; Ettema, D. Exploring commute mode choice in dual-earner households in a small Chinese city. Transp. Res. Part D Transp. Environ. 2022, 102, 103148. [Google Scholar] [CrossRef]
  33. Li, J.; Walker, J.L.; Srinivasan, S.; Anderson, W.P. Modeling Private Car Ownership in China: Investigation of Urban Form Impact Across Megacities. Transp. Res. Rec. J. Transp. Res. Board 2010, 2193, 76–84. [Google Scholar] [CrossRef]
  34. Manaugh, K.; Miranda-Moreno, L.F.; El-Geneidy, A.M. The effect of neighbourhood characteristics, accessibility, home–work location, and demographics on commuting distances. Transportation 2010, 37, 627–646. [Google Scholar] [CrossRef]
  35. Mo, B.; Zheng, Y.; Guo, X.; Ma, R.; Zhao, J. Robust Discrete Choice Model for Travel Behavior Prediction with Data Uncertainties (Version 1). arXiv 2024, arXiv:2401.03276. [Google Scholar] [CrossRef]
  36. Harz, J.; Sommer, C. Mode choice of city tourists: Discrete choice modeling based on survey data from a major German city. Transp. Res. Interdiscip. Perspect. 2022, 16, 100704. [Google Scholar] [CrossRef]
  37. Jeong, J.; Lee, J.; Gim, T.T. Travel mode choice as a representation of travel utility: A multilevel approach reflecting the hierarchical structure of trip, individual, and neighborhood characteristics. Pap. Reg. Sci. 2022, 101, 745–766. [Google Scholar] [CrossRef]
  38. Liu, Y. Transportation Mode Choice and Built Environment Around Metro Stations. In Built Environment and Walking & Cycling Around Metro Stations; Springer Nature: Singapore, 2023; pp. 17–26. [Google Scholar] [CrossRef]
  39. Nguyen, T.M.C.; Kato, H.; Phan, L.B. Is Built Environment Associated with Travel Mode Choice in Developing Cities? Evidence from Hanoi. Sustainability 2020, 12, 5773. [Google Scholar] [CrossRef]
  40. Chen, P.; Gao, J.; Ji, Z.; Liang, H.; Peng, Y. Do Artificial Intelligence Applications Affect Carbon Emission Performance?—Evidence from Panel Data Analysis of Chinese Cities. Energies 2022, 15, 5730. [Google Scholar] [CrossRef]
  41. Zhang, M.; Kafy, A.-A.; Xiao, P.; Han, S.; Zou, S.; Saha, M.; Zhang, C.; Tan, S. Impact of urban expansion on land surface temperature and carbon emissions using machine learning algorithms in Wuhan, China. Urban Clim. 2023, 47, 101347. [Google Scholar] [CrossRef]
  42. Nassef, A.M.; Olabi, A.G.; Rezk, H.; Abdelkareem, M.A. Application of Artificial Intelligence to Predict CO2 Emissions: Critical Step towards Sustainable Environment. Sustainability 2023, 15, 7648. [Google Scholar] [CrossRef]
  43. Dong, M.; Wang, G.; Han, X. Artificial Intelligence, Industrial Structure Optimization, and CO2 Emissions. Environ. Sci. Pollut. Res. 2023, 30, 108757–108773. [Google Scholar] [CrossRef] [PubMed]
  44. Chen, S.; Zhang, S.; Zeng, Q.; Ao, J.; Chen, X.; Zhang, S. Can artificial intelligence achieve carbon neutrality? Evidence from a quasi-natural experiment. Front. Ecol. Evol. 2023, 11, 1151017. [Google Scholar] [CrossRef]
  45. Li, X.; Piao, Z.; Zheng, Y.; Han, J.; Cong, R. Smart Urban Carbon Emission Management Platform Based on Energy Big Data. In Innovative Computing Vol 1—Emerging Topics in Artificial Intelligence; Hung, J.C., Chang, J.-W., Pei, Y., Eds.; Springer Nature: Singapore, 2023; Volume 1044, pp. 831–840. [Google Scholar] [CrossRef]
  46. Priya, S.D.; Saranya, K.G. Significance of artificial intelligence in the development of sustainable transportation. Sci. Temper 2023, 14, 418–425. [Google Scholar] [CrossRef]
  47. Montague, P.R. Reinforcement Learning Models Then-and-Now: From Single Cells to Modern Neuroimaging. In 20 Years of Computational Neuroscience; Bower, J.M., Ed.; Springer: New York, NY, USA, 2013; pp. 271–277. [Google Scholar] [CrossRef]
  48. Buffet, O.; Pietquin, O.; Weng, P. Reinforcement Learning (Version 2). arXiv 2020, arXiv:2005.14419. [Google Scholar] [CrossRef]
  49. Lyu, L.; Shen, Y.; Zhang, S. The Advance of Reinforcement Learning and Deep Reinforcement Learning. In Proceedings of the 2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 25–27 February 2022; pp. 644–648. [Google Scholar] [CrossRef]
  50. Tsantekidis, A.; Passalis, N.; Tefas, A. Deep reinforcement learning. In Deep Learning for Robot Perception and Cognition; Elsevier: Amsterdam, The Netherlands, 2022; pp. 117–129. [Google Scholar] [CrossRef]
  51. François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An Introduction to Deep Reinforcement Learning. Found. Trends® Mach. Learn. 2018, 11, 219–354. [Google Scholar] [CrossRef]
  52. Renna, L. Deep Reinforcement Learning for 2D Physics-Based Object Manipulation in Clutter (Version 1). arXiv 2023, arXiv:2312.04570. [Google Scholar] [CrossRef]
  53. Li, S.E. Deep Reinforcement Learning. In Reinforcement Learning for Sequential Decision and Optimal Control; Springer Nature: Singapore, 2023; pp. 365–402. [Google Scholar] [CrossRef]
  54. Liu, Y. Applications of deep reinforcement learning Alphago. Appl. Comput. Eng. 2023, 5, 637–641. [Google Scholar] [CrossRef]
  55. Zheng, Y.; Lin, Y.; Zhao, L.; Wu, T.; Jin, D.; Li, Y. Spatial planning of urban communities via deep reinforcement learning. Nat. Comput. Sci. 2023, 3, 748–762. [Google Scholar] [CrossRef] [PubMed]
  56. Shen, J.; Zheng, F.; Ma, Y.; Deng, W.; Zhang, Z. Urban travel carbon emission mitigation approach using deep reinforcement learning. Sci. Rep. 2024, 14, 27778. [Google Scholar] [CrossRef] [PubMed]
  57. Di Napoli, C.; Paragliola, G.; Ribino, P.; Serino, L. Deep-Reinforcement-Learning-Based Planner for City Tours for Cruise Passengers. Algorithms 2023, 16, 362. [Google Scholar] [CrossRef]
  58. Nguyen, T.T.; Nguyen, C.M.; Huynh-The, T.; Pham, Q.-V.; Nguyen, Q.V.H.; Razzak, I.; Reddi, V.J. Solving Complex Sequential Decision-Making Problems by Deep Reinforcement Learning with Heuristic Rules. In Computational Science—ICCS 2023; Mikyška, J., De Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A., Eds.; Springer Nature: Cham, Switzerland, 2023; Volume 14074, pp. 298–305. [Google Scholar] [CrossRef]
  59. Janner, M. Deep Generative Models for Decision-Making and Control (Version 2). arXiv 2023, arXiv:2306.08810. [Google Scholar] [CrossRef]
  60. Liang, X.; Li, X. Carbon emission causal discovery and multi-step forecasting for global cities. Cities 2024, 148, 104881. [Google Scholar] [CrossRef]
  61. Zhu, E.; Deng, J.; Zhou, M.; Gan, M.; Jiang, R.; Wang, K.; Shahtahmassebi, A. Carbon emissions induced by land-use and land-cover change from 1970 to 2010 in Zhejiang, China. Sci. Total Environ. 2019, 646, 930–939. [Google Scholar] [CrossRef]
  62. Ali, G.; Nitivattananon, V. Exercising multidisciplinary approach to assess interrelationship between energy use, carbon emission and land use change in a metropolitan city of Pakistan. Renew. Sustain. Energy Rev. 2012, 16, 775–786. [Google Scholar] [CrossRef]
  63. Zhi, D.; Zhao, H.; Chen, Y.; Song, W.; Song, D.; Yang, Y. Quantifying the heterogeneous impacts of the urban built environment on traffic carbon emissions: New insights from machine learning techniques. Urban Clim. 2024, 53, 101765. [Google Scholar] [CrossRef]
  64. Ren, Y.; Yang, J.; Shen, Y.; Wang, L.; Zhang, Z.; Zhao, Z. Multidimensional effects of history, neighborhood, and proximity on urban land growth: A dynamic spatiotemporal rolling prediction model (STRM). Trans. GIS 2024, 28, 1928–1956. [Google Scholar] [CrossRef]
  65. Li, Y. Reinforcement Learning in Practice: Opportunities and Challenges. arXiv 2022, arXiv:2202.11296. [Google Scholar] [CrossRef]
  66. Yang, T.; Cao, Y.; Sartoretti, G. Intent-based Deep Reinforcement Learning for Multi-agent Informative Path Planning. arXiv 2023, arXiv:2303.05351. [Google Scholar] [CrossRef]
  67. Zhu, W.; Li, Y. Reinforcement learning-based genetic algorithm for solving low carbon multimodal transportation path planning problem. In Proceedings of the International Conference on Smart Transportation and City Engineering (STCE 2023), Chongqing, China, 16–18 December 2023; Mikusova, M., Ed.; SPIE: Bellingham, WA, USA, 2024; p. 77. [Google Scholar] [CrossRef]
  68. Liu, H.; Yan, F.; Tian, H. A Vector Map of Carbon Emission Based on Point-Line-Area Carbon Emission Classified Allocation Method. Sustainability 2020, 12, 10058. [Google Scholar] [CrossRef]
  69. Guo, Y.; Zheng, X.; Wei, W.; He, Y.; Peng, X.; Zhao, F.; Wu, H.; Bi, W.; Yan, H.; Ren, X. Construction of Multi-Sample Public Building Carbon Emission Database Model Based on Energy Activity Data. Energies 2025, 18, 3635. [Google Scholar] [CrossRef]
  70. Shen, C.; Zhao, K.; Ge, J.; Zhou, Q. Analysis of Building Energy Consumption in a Hospital in the Hot Summer and Cold Winter Area. Energy Procedia 2019, 158, 3735–3740. [Google Scholar] [CrossRef]
  71. Jing, L.; Wang, J. A Study on the Characteristics of Energy Consumption Index of Office Buildings Hot-summer and Cold-winter Zone. Refrig. Air Cond. 2022, 36, 263–268. [Google Scholar]
  72. Wang, Y.; Zhao, P. Survey Research on Residential Building Energy Consumption in Urban and Rural Area of China. Acta Sci. Nat. Univ. Pekin. 2018, 1, 162–170. [Google Scholar]
  73. GB/T 51161-2016; Standard for Energy Consumption of Building. China Architecture & Building Press: Beijing, China, 2016.
  74. Zhang, X.; Huang, H.; Tu, K.; Li, R.; Zhang, X.; Wang, P.; Li, Y.; Yang, Q.; Acerman, A.C.; Guo, N.; et al. Effects of plant community structural characteristics on carbon sequestration in urban green spaces. Sci. Rep. 2024, 14, 7382. [Google Scholar] [CrossRef] [PubMed]
  75. Ewing, R.; Cervero, R. Travel and the Built Environment: A Meta-Analysis. J. Am. Plan. Assoc. 2010, 76, 265–294. [Google Scholar] [CrossRef]
  76. Lu, J. The Influencing Mechanism of Urban Travel Carbon Emissions from the Perspective of Built Environment: The Case of Guangzhou, China. Atmosphere 2023, 14, 547. [Google Scholar] [CrossRef]
  77. Tayarani, M.; Poorfakhraei, A.; Nadafianshahamabadi, R.; Rowangould, G. Can regional transportation and land-use planning achieve deep reductions in GHG emissions from vehicles? Transp. Res. Part D Transp. Environ. 2018, 63, 222–235. [Google Scholar] [CrossRef]
  78. Rawlings, J.O.; Pantula, S.G.; Dickey, D.A. (Eds.) Applied Regression Analysis; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar] [CrossRef]
  79. Reeves, C.R.; Rowe, J.E. Genetic Algorithms—Principles and Perspectives: A Guide to GA Theory; Springer: New York, NY, USA, 2002; Volume 20. [Google Scholar] [CrossRef]
  80. Liu, X.; Wang, S.; Wu, P.; Feng, K.; Hubacek, K.; Li, X.; Sun, L. Impacts of Urban Expansion on Terrestrial Carbon Storage in China. Environ. Sci. Technol. 2019, 53, 6834–6844. [Google Scholar] [CrossRef] [PubMed]
  81. Kolat, M.; Kővári, B.; Bécsi, T.; Aradi, S. Multi-Agent Reinforcement Learning for Traffic Signal Control: A Cooperative Approach. Sustainability 2023, 15, 3479. [Google Scholar] [CrossRef]
  82. Semeraro, T.; Nicola, Z.; Lara, A.; Sergi Cucinelli, F.; Aretano, R. A Bottom-Up and Top-Down Participatory Approach to Planning and Designing Local Urban Development: Evidence from an Urban University Center. Land 2020, 9, 98. [Google Scholar] [CrossRef]
Figure 1. The flowchart of the study (source: Authors’ elaboration).
Figure 1. The flowchart of the study (source: Authors’ elaboration).
Land 14 02368 g001
Figure 2. The location of case study area (source: Authors’ elaboration).
Figure 2. The location of case study area (source: Authors’ elaboration).
Land 14 02368 g002
Figure 3. Area proportion of each land-use in Hangzhou central area (source: Authors’ elaboration).
Figure 3. Area proportion of each land-use in Hangzhou central area (source: Authors’ elaboration).
Land 14 02368 g003
Figure 4. Statistics of eight land-use POIs for each cell in Hangzhou (source: Authors’ elaboration).
Figure 4. Statistics of eight land-use POIs for each cell in Hangzhou (source: Authors’ elaboration).
Land 14 02368 g004
Figure 5. Heatmaps for different types of POIs distribution (source: Authors’ elaboration).
Figure 5. Heatmaps for different types of POIs distribution (source: Authors’ elaboration).
Land 14 02368 g005
Figure 6. The actual and processed road matrix and subway connection matrix (source: Authors’ elaboration).
Figure 6. The actual and processed road matrix and subway connection matrix (source: Authors’ elaboration).
Land 14 02368 g006
Figure 7. Process of preparing the dataset (source: Authors’ elaboration).
Figure 7. Process of preparing the dataset (source: Authors’ elaboration).
Land 14 02368 g007
Figure 8. Process of transferring input data to both the Critic and Actor input layers (source: Authors’ elaboration).
Figure 8. Process of transferring input data to both the Critic and Actor input layers (source: Authors’ elaboration).
Land 14 02368 g008
Figure 9. Reward calculation process according to base emissions and path emissions (source: Authors’ elaboration).
Figure 9. Reward calculation process according to base emissions and path emissions (source: Authors’ elaboration).
Land 14 02368 g009
Figure 10. Trends in rewards, policy loss and value function loss during training episodes (source: Authors’ elaboration).
Figure 10. Trends in rewards, policy loss and value function loss during training episodes (source: Authors’ elaboration).
Land 14 02368 g010
Figure 11. Spatial distribution of DRL-operated grid cells and percentage changes in land-use composition (source: Authors’ elaboration).
Figure 11. Spatial distribution of DRL-operated grid cells and percentage changes in land-use composition (source: Authors’ elaboration).
Land 14 02368 g011
Figure 12. Changes in the Gini coefficient for the 8 land-use functions (source: Authors’ elaboration).
Figure 12. Changes in the Gini coefficient for the 8 land-use functions (source: Authors’ elaboration).
Land 14 02368 g012
Figure 13. Comparative performance of three optimization approaches during validation phase. PPO (blue) achieves 15.2% average emission reduction with stable convergence, GA (orange) achieves 6.8% with high variability, and LR (green) achieves only 3.2% due to linear model limitations. Error bands represent standard deviation across 50 test episodes. PPO demonstrates 2.2× better performance than GA and 4.8× better than LR (p < 0.001) (source: Authors’ elaboration).
Figure 13. Comparative performance of three optimization approaches during validation phase. PPO (blue) achieves 15.2% average emission reduction with stable convergence, GA (orange) achieves 6.8% with high variability, and LR (green) achieves only 3.2% due to linear model limitations. Error bands represent standard deviation across 50 test episodes. PPO demonstrates 2.2× better performance than GA and 4.8× better than LR (p < 0.001) (source: Authors’ elaboration).
Land 14 02368 g013
Table 2. Summary of Point of Interest (POI) data collected for the central built-up area of Hangzhou, showing quantities and data sources for eight functional categories (source: Authors’ elaboration).
Table 2. Summary of Point of Interest (POI) data collected for the central built-up area of Hangzhou, showing quantities and data sources for eight functional categories (source: Authors’ elaboration).
Land-Use CategorySubcategoryPOI CountData SourceNotes
ResidentialCommunities6758Lianjia.comRental housing data
Companies/EnterprisesAll commercial entities12,229Amap.comGaode Map platform
Factory/Industrial ZoneManufacturing facilities245Amap.comGaode Map platform
Science and EducationSchools, universities7460Amap.comGaode Map platform
Daily Leisure—DiningRestaurants, cafes26,980Amap.comGaode Map platform
Daily Leisure—ShoppingRetail, malls13,414Amap.comGaode Map platform
Scenic SpotsParks, attractions2755Amap.comGaode Map platform
Transportation—TrainRailway stations146Amap.comGaode Map platform
Transportation—AirportAirports, terminals30Amap.comGaode Map platform
Medical CareHospitals, clinics6846Amap.comGaode Map platform
TOTAL 76,863MultipleCentral built-up area
Table 3. Feature states of these cells before and after being influenced by the actions of agent in one spinode (source: Authors’ elaboration).
Table 3. Feature states of these cells before and after being influenced by the actions of agent in one spinode (source: Authors’ elaboration).
Act. No.XYCOND.ResidentialCompanySci.& Edu.DiningShoppingScenic SpotsFactoryMedicalRwd.
11417BEF.29393110611840140.32
AFT.3543321181473015
21417BEF.35433211814730150.31
AFT.4349351351522016
32339BEF.17212221130.18
AFT.0621127002
41417BEF.43493513515220160.40
AFT.5354371561721018
51423BEF.241210115180160.19
AFT.309813616019
61417BEF.53543715617210180.37
AFT.5860421782080021
71417BEF.58604217820800210.23
AFT.5868422122350025
82412BEF.58429181140.01
AFT.4942919013
91417BEF.58684221223500250.05
AFT.5878452122360028
101417BEF.58784521223600280.00
AFT.5878462122360028
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, J.; Zheng, F.; Chen, T.; Deng, W.; Bellotti, A.; Tesema, F.B.; Lucchi, E. Optimizing Urban Land-Use Through Deep Reinforcement Learning: A Case Study in Hangzhou for Reducing Carbon Emissions. Land 2025, 14, 2368. https://doi.org/10.3390/land14122368

AMA Style

Shen J, Zheng F, Chen T, Deng W, Bellotti A, Tesema FB, Lucchi E. Optimizing Urban Land-Use Through Deep Reinforcement Learning: A Case Study in Hangzhou for Reducing Carbon Emissions. Land. 2025; 14(12):2368. https://doi.org/10.3390/land14122368

Chicago/Turabian Style

Shen, Jie, Fanghao Zheng, Tianyi Chen, Wu Deng, Anthony Bellotti, Fiseha Berhanu Tesema, and Elena Lucchi. 2025. "Optimizing Urban Land-Use Through Deep Reinforcement Learning: A Case Study in Hangzhou for Reducing Carbon Emissions" Land 14, no. 12: 2368. https://doi.org/10.3390/land14122368

APA Style

Shen, J., Zheng, F., Chen, T., Deng, W., Bellotti, A., Tesema, F. B., & Lucchi, E. (2025). Optimizing Urban Land-Use Through Deep Reinforcement Learning: A Case Study in Hangzhou for Reducing Carbon Emissions. Land, 14(12), 2368. https://doi.org/10.3390/land14122368

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop