Analysis of Urban Congestion Traceability: The Role of the Built Environment

: Analyzing the factors influencing traffic congestion is essential for urban planning and coordinated development. Previous research frequently focuses on the internal aspects of traffic systems, often overlooking the impact of external factors on congestion sources. Therefore, this study utilizes a geospatial dataset and mobile signaling data, firstly applying the Fuzzy C-Means (FCM) algorithm to identify congested roads of different levels and trace the localization of travelers’ origins on regional congested roads. Furthermore, it employs the LightGBM method to study the influence of the built environment of various congestion sources on network-level congestion. The findings are as follows: (1) There is a positive correlation between traffic congestion and geographical location, with congestion predominantly caused by a few specific plots and demonstrating a concentrated trend in city centers. (2) Residential population density is the most critical factor, accounting for over 12% of the congestion contribution, followed by road density and working population density. (3) Both residential and working population densities show a non-linear positive correlation with congestion contribution, while the mixture of land use displays a non-linear V-shaped influence. Additionally, when residential population density is between 8000 and 11,000, it notably exacerbates congestion contribution. Significantly, by emphasizing land use considerations in traffic system analysis, these findings illuminate the intricate linkages between urban planning and traffic congestion, advocating for a more comprehensive approach to urban development strategies.


Introduction
As urban populations and vehicle ownership increase, traffic congestion emerges as a global challenge.European Union statistics indicate that congestion results in economic losses of up to nearly 100 billion euros annually.In China, data from the Ministry of Transport suggest congestion directly causes economic losses amounting to 5-8% of GDP, up to 250 billion yuan [1].Additionally, congestion contributes to environmental pollution and energy waste [2].A study from the 1990s in London revealed that 74% of atmospheric nitrogen oxides originated from vehicle exhaust, with low-speed driving and frequent stops and starts exacerbating energy consumption, emissions, and noise pollution [3].
Given the pivotal role of efficient transportation systems in fostering sustainable urban development [4,5], various solutions have been implemented to alleviate congestion, such as expanding subway lines and boosting public transport capacity.However, these solutions often face challenges like long implementation periods, high construction costs, and lack of flexibility, making them unsuitable for the rapidly evolving urban traffic landscape.Other measures, like congestion pricing and traffic restrictions, have also been adopted to ease congestion [6].Regardless of the approach, understanding the mechanisms and origins of traffic congestion is essential.
Land 2024, 13, 255 2 of 15 On the supply side, road infrastructure and intersection signal strategies are analyzed to identify causes of network congestion, with congestion propagation analysis revealing the sources or critical nodes [15,16].Research related to internal congestion detection is closely related to the analysis of congestion propagation and evolution within traffic networks.
However, this focus on internal factors such as traffic flow and road infrastructure may not fully explain the increasing trend of urban traffic congestion.To gain a more comprehensive understanding, it's necessary to consider external factors of the transportation system, like population density, socio-economic factors, and land use, which may influence urban residents' travel demand and, consequently, the overall traffic situation.These factors may affect urban residents' travel demand, thereby impacting the overall traffic situation.
This paper approaches the root causes of congestion from the external perspective of the traffic system, adopting a comprehensive view.Firstly, it identifies traffic congestion based on the Fuzzy C-means algorithm and identifies congested roads for different road levels, tracing their geographical origins.Considering that each road congestion is a coupled state of traffic formed by travel trajectories, the number of travelers for each congestion source, i.e., the congestion contribution, is obtained.Finally, the impact of external factors of the traffic system on network-level congestion is explored using the LightGBM model.This research provides decision-making support to alleviate the phenomenon of traffic congestion.

Literature Review 2.1. Congestion Recognition
Traffic congestion is a phenomenon that arises when the road network cannot accommodate the current volume of traffic.Scholars have extensively studied indicators for assessing the degree of road congestion, such as vehicle travel speed, travel delay, and journey time [17][18][19][20].Moreover, research has also utilized other improved indicators for determining congestion.Zhu Xinglin et al. [21] introduced fuzzy theory into the division of traffic operation conditions based on speed thresholds.Yuan [22] proposed a Space-Time Congestion Index (SI) on the basis of traditional evaluation indicators.Zhang [23] developed a congestion probability discrimination indicator that can be applied to probabilistic forecasting results.
In the research on methods for identifying traffic congestion, most scholars analyze from a micro perspective.Zhao et al. [24] devised a model based on an enhanced clustering algorithm to predict lane congestion.Yang [25] created an internal grid congestion model, and Kong et al. [26] developed a model based on floating car data to identify congested roads.Liu et al. [27] proposed a real-time congestion detection algorithm for urban intersections.However, macro-level road network congestion studies are less common.Zhang's study of regional correlation and congestion area identification methods [23], and Zeng's analysis of urban traffic flow state evolution [28], offer important macro perspectives.
Unlike previous micro-focused research, this study takes a macro view, employing the Fuzzy C-Means (FCM) model to discern network-level congestion.It also considers road hierarchy in congestion categorization for more precise congestion analysis.

Analysis of Factors Influencing Traffic Congestion
To deepen the understanding of the built environment's impact on congestion, many scholars have employed models based on linear assumptions to study their relationship.Wang et al. [29] used a multivariate linear regression model to analyze the impact of land use layout on traffic congestion in Zhengzhou, highlighting significant effects from residential, office, and commercial land densities, road length, and network density on congestion.Zhang et al. [30] established a multivariate linear regression model between congestion duration and different land uses, concluding significant correlations between land use proportions and congestion times.Yang and Debbage [31] conducted a quantitative analysis on the relationship between urban development patterns and traffic congestion in 2011 Land 2024, 13, 255 3 of 15 US city regions from 1998 to 2001, finding a close link between increased urban land use intensity and worsening congestion.Sun and Lu [32] assessed the contributions of socioeconomic factors and governance policies to congestion changes, finding socio-economic factors contributed 25% to congestion increases and 66% to decreases.Additionally, Bao et al. [33] studied the temporal heterogeneity of land use factors in urban traffic congestion, discovering negative impacts of dining land use during peak times.In Toronto, Rothman et al. [34] found significant correlations between school vicinity congestion and doubleparking, reversing, and vehicle parking.Moreover, studies in smaller Chinese cities have identified a significant positive correlation between educational land use and increased traffic pressure [15,35].
However, linear models have limitations in addressing the complexity of the real world, especially their neglect of non-linearities.Thus, some researchers have turned to non-linear methods to analyze the relationship between congestion behaviors and external factor changes.Wang et al. [36] explored the correlation between built environment factors and urban congestion patterns using the XGBoost algorithm, identifying residential land use and population density as having the most significant impact on urban congestion.Moreover, Liu and Xiao [37] used the random forest method to study the impact of built environment characteristics on commuting time, and Li et al. [38] investigated the non-linear effects of subway commuting and non-commuting flows using Gradient Boosting Regression Trees (GBRT).These studies indirectly indicate the potential impact of environmental characteristics on traffic congestion.
Although previous research has extensively explored the relationship between the built environment and traffic congestion, focusing primarily on the direct impacts of the built environment on surrounding traffic congestion, it falls short in tracing and analyzing the geographic sources of congestion.In contrast, this study delves into a more macrolevel exploration, thoroughly analyzing the characteristics of congestion sources and their traceable relationships with network traffic congestion.

Study Area
Xi'an, with a total area of 10,108 square kilometers and a permanent population of 12.9959 million, is a central city in the northwest region of China.The number of vehicles in Xi'an increased from 180,000 in 1997 to 4.02 million by 2021 due to the rapid expansion of urban roads and a substantial influx of population.This study focuses on the area within Xi'an's 3rd Ring Road, which is a prominent urban center characterized by dense population and commercial activities, as well as facing serious congestion issues.To thoroughly analyze the congestion in this area, we divided the study area into 680 separate plots based on the main road network and land properties.After a meticulous selection process to exclude smaller units, 520 effective plots were identified as the subjects of this study, as illustrated in Figure 1.

Data Introduction
In this study, land use data and mobile signaling data were used from the perspectives of traffic planning and congestion mitigation, offering new insights into individual travel trajectories and land distribution under different activities.The integration and analysis of these datasets primarily aimed to investigate the root causes of congestion from external factors of the transportation system.Specifically, mobile signaling data can identify travel trajectories, congested roads, travel OD (Origin-Destination) volumes, and residential population numbers, while POI (Point of Interest) can pinpoint the spatial distribution of different land uses.By analyzing these multi-source datasets, this research reveals potential causes of congestion, identifies the main external factors contributing to traffic congestion, and proposes targeted strategies to alleviate congestion in urban areas.

Data Introduction
In this study, land use data and mobile signaling data were used from the perspectives of traffic planning and congestion mitigation, offering new insights into individual travel trajectories and land distribution under different activities.The integration and analysis of these datasets primarily aimed to investigate the root causes of congestion from external factors of the transportation system.Specifically, mobile signaling data can identify travel trajectories, congested roads, travel OD (Origin-Destination) volumes, and residential population numbers, while POI (Point of Interest) can pinpoint the spatial distribution of different land uses.By analyzing these multi-source datasets, this research reveals potential causes of congestion, identifies the main external factors contributing to traffic congestion, and proposes targeted strategies to alleviate congestion in urban areas.
(1) The land use data primarily reflects the land development status within the study area, including POI and road network data.POI data, such as businesses, restaurants, financial institutions, etc., are sourced from the Gaode development platform.Additionally, road network data are obtained from OpenStreetMap.
(2) Mobile signaling travel data are provided by the "Smart Footprint" company, with the original data coming from China Unicom.To ensure data security, the platform sets the user ID as a prohibited field (not accessible by users), allowing only aggregated data to be exported for users.The data description is illustrated in Table 1, where the time extracts the time of entry into the road section, the route_id identifies the road section, the rn_seq recognizes the trajectory sequence, the is_start indicates whether it is the start of the trip, the is_end indicates whether it is the end of the trip, and the trip_id indicates the sequence number of the trip within a day for a user.(1) The land use data primarily reflects the land development status within the study area, including POI and road network data.POI data, such as businesses, restaurants, financial institutions, etc., are sourced from the Gaode development platform.Additionally, road network data are obtained from OpenStreetMap.
(2) Mobile signaling travel data are provided by the "Smart Footprint" company, with the original data coming from China Unicom.To ensure data security, the platform sets the user ID as a prohibited field (not accessible by users), allowing only aggregated data to be exported for users.The data description is illustrated in Table 1, where the time extracts the time of entry into the road section, the route_id identifies the road section, the rn_seq recognizes the trajectory sequence, the is_start indicates whether it is the start of the trip, the is_end indicates whether it is the end of the trip, and the trip_id indicates the sequence number of the trip within a day for a user.The unique identifier for a user's trip

Built Environment Variables
In this paper, 12 variables were selected from land use, transportation-related, and socio-economic aspects.The land use variables included shopping center density, community services density, recreational density, catering density, financial institutions density, company enterprise density, educational services density, and land-use mix.The transportation-related variables included road density and transit station density, and the socio-economic variables included residential population density and working population density.To analyze these variables, we calculated the Mean and Standard Deviation (Std) for each, and further computed the Variance Inflation Factor (VIF).The results of Land 2024, 13, 255 5 of 15 VIF indicated that all variables had VIF values less than 5, suggesting the absence of multicollinearity issues.The descriptive statistics of the variables are shown in Table 2.

Research Methodology
Figure 2 presents an analytical framework for investigating the distribution characteristics and influencing factors of congestion sources.Initially, mobile phone signaling data was mapped onto the road network to construct travel trajectory matrices.Subsequently, the data underwent time-slice processing to calculate the undirected weight values of roads in each time slice.The Fuzzy C-Means (FCM) clustering method was then applied to identify congested sections on expressways, main roads, and secondary roads.Following this, the origins of congested sections were traced to analyze the distribution characteristics of different source plots.Lastly, the LightGBM method was employed to study the impact of various factors on congestion contribution, revealing the non-linear relationships between these factors and congestion contribution.
applied to identify congested sections on expressways, main roads, and secondary roads.Following this, the origins of congested sections were traced to analyze the distribution characteristics of different source plots.Lastly, the LightGBM method was employed to study the impact of various factors on congestion contribution, revealing the non-linear relationships between these factors and congestion contribution.

Construction of Network Weights
In this study, we used the undirected weight value  to simplify the road network structure.For bidirectional traffic segments: if  , it is assumed that the traffic distribution is balanced in both directions, and thus the segment's undirected weight value is the average of the bidirectional weights.Conversely, if  , it is assumed that the traffic distribution is significantly imbalanced, and in this case, the undirected weight value of the segment is taken as the weight value of the direction with heavier traffic load.For unidirectional traffic segments, the undirected weight value is the sole directional weight value.This is expressed in Equation (1).

Construction of Network Weights
In this study, we used the undirected weight value µ ij to simplify the road network structure.For bidirectional traffic segments: if K ij < 2 3 , it is assumed that the traffic distribution is balanced in both directions, and thus the segment's undirected weight value is the average of the bidirectional weights.Conversely, if K ij ≥ 2 3 , it is assumed that the traffic distribution is significantly imbalanced, and in this case, the undirected weight value of the segment is taken as the weight value of the direction with heavier traffic load.For unidirectional traffic segments, the undirected weight value is the sole directional weight value.This is expressed in Equation (1).
In the formula, w ij (w ji ) represents the weight value of the edge from node i to node j (or from node j to node i). a ij indicates whether the network is connected from node i to node j.K ij denotes the directional distribution coefficient.

Fuzzy C-Means
The Fuzzy C-Means (FCM) algorithm is a clustering method for soft clustering, which establishes the affiliation of each sample data to all cluster centers by optimizing the objective, and classifies the sample data based on the size of the affiliation.Given the dataset X = {x 1 , x 2 , • • • , x n }, it is assumed that the number of clusters is m, with T denoting the cluster center.A = [u ij ] n×m represents the fuzzy classification matrix, where u ij denotes the sample data x i to the clustering center k j .The essence of the FCM algorithm is an iterative process that converges the objective function by continuously updating the cluster centers K and the membership degree matrix A. The objective function is shown in Equation (2).
In this equation, t represents the weighted exponent, which is commonly set to t = 2.The algorithm steps are as follows: (1) Determine the number of cluster centers m, set the iteration count l = 0, and initialize the classification matrix A = [u ij ] n×m .
Choose an appropriate norm ε > 0. If the condition A l+1 − A l < ε is satisfied, terminate the operation; otherwise, let l = l + 1 and repeat steps (3) and ( 4) until the condition is satisfied.
The classification coefficient is commonly used to evaluate the effectiveness of clustering algorithms.For a given number of clustering centers m and a classification matrix A, the classification coefficient is defined by Equation (5).
The classification coefficient is a standard indicating the fuzziness of clustering results; the closer F is to 1, the better the clustering effect.

Traceability of Congested Segments
In this study, we define P as the set of all plots within the research area.For any two plots a and b, we are concerned with the set of travel ODs (Origin-Destination) from plot a to plot b, denoted as (a, b).Additionally, crowd represents the set of congested roads in the area.Based on this, q ab is defined as the number of travelers from plot a to plot b, and r ab represents the set of paths taken by these travelers.Specifically, r l ab (r l ab ∈ r ab , l ∈ crowd) denotes the set of travel trajectories passing through a specific congested segment l.The number of travelers passing through congested segment l from plot a to plot b is Q abl .The contribution of plot a to congested segment l during the selected time period is expressed in Equation (6).

LightGBM Model
LightGBM is an advanced iterative decision tree algorithm, exhibiting significant advantages in efficiency and space usage compared to traditional models like GBDT.This is primarily attributed to its integration of two innovative technologies: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).GOSS significantly reduces computational load by retaining samples with larger gradient values while randomly sampling those with smaller gradients, thus enhancing the model's efficiency.On the other hand, EFB, leveraging the sparsity of high-dimensional data, combines exclusive features, ensuring both the integrity of information and a reduction in feature dimensions.Additionally, LightGBM employs a histogram-based strategy for node splitting in decision trees, effectively identifying and splitting features that offer the maximum information gain.It also utilizes a leaf-wise growth strategy with depth limitations, which not only ensures efficiency but also effectively prevents overfitting by choosing the leaf with the maximum splitting gain for splitting.The structure and main experimental procedure of the LightGBM model are elaborately illustrated in Figure 2.This algorithm excels in handling large-scale datasets, particularly suitable for machine learning tasks that demand high efficiency and accuracy.The structural details and primary experimental flow of the LightGBM model are depicted in Figure 3.

Congestion Identification Results
The time of 20 July 2021 was chosen for analysis due to the significant traffic congestion caused by heavy rainfall on that day.The data of the three types of roads were clustered in Python using the FCM algorithm and the results are shown in Table 3.The congestion levels were classified into four categories: smooth traffic, mild congestion, moderate congestion, and severe congestion.The table reveals that the clustering centers for the three road types are not significantly different, but the threshold decreases as the road grade lowers.The value of 1 is considered of a free-flow state, with smaller values indicating increasing congestion.
The spatial distribution and frequency characteristics of congestion status at different times were further analyzed, as shown in Figure 4. Figure 4a reveals the spatial autocorrelation numbers for each time node, with a time granularity of 20 min.The study utilized the Fuzzy C-Means (FCM) method to ascertain that the traffic congestion status within Xi'an's third ring road is fundamentally correlated with space and exhibits a positive cor- In this study, the dataset was randomly divided into a training dataset and a validation dataset at a ratio of 7:3, with the training dataset being utilized for model fitting.Subsequently, the Grid Search algorithm was employed to adjust several hyperparameters, including the lambda_l1, lambda_l2, min_data_in_leaf, num_leaves, and feature_fraction, to identify the optimal parameter combination.Thereafter, the predictive capability of the LightGBM model was assessed using the validation dataset.The model's performance was evaluated through statistical metrics such as the coefficient of determination (R 2 ), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE).Finally, the model was interpreted through feature importance and partial dependence plots.

Congestion Identification Results
The time of 20 July 2021 was chosen for analysis due to the significant traffic congestion caused by heavy rainfall on that day.The data of the three types of roads were clustered in Python using the FCM algorithm and the results are shown in Table 3.The congestion levels were classified into four categories: smooth traffic, mild congestion, moderate congestion, and severe congestion.The table reveals that the clustering centers for the three road types are not significantly different, but the threshold decreases as the road grade lowers.The value of 1 is considered of a free-flow state, with smaller values indicating increasing congestion.
The spatial distribution and frequency characteristics of congestion status at different times were further analyzed, as shown in Figure 4. Figure 4a reveals the spatial autocorrelation numbers for each time node, with a time granularity of 20 min.The study utilized the Fuzzy C-Means (FCM) method to ascertain that the traffic congestion status within Xi'an's third ring road is fundamentally correlated with space and exhibits a positive correlation throughout the day.Specifically, during the morning rush hour, the spatial distribution exhibits the highest degree of clustering and the highest level of congestion, with the peak Moran's index observed at 10:00 a.m.This phenomenon suggests that targeted congestion management strategies during these peak hours could be highly effective, especially in clustered areas where congestion is most pronounced.As time progresses, the correlation between congestion intensity and spatial distribution remains stable, with certain clustering characteristics in the spatial distribution.Figure 4b illustrates the frequency of occurrence for each congestion level, using a 24-h daily granularity.From 0:00 to 6:00 a.m., the traffic network is at a low due to the rest period, with fewer residents traveling and the congestion level generally remaining unimpeded.Starting from 7:00 a.m., with the onset of the morning rush hour, the frequency of unimpeded road sections gradually decreases, and the level of congestion progressively increases, peaking around 10:00 a.m.Post noon, the congestion status undergoes slow changes until 18:00, when the congestion trend starts to intensify significantly, reaching its peak around 21:00.The sharp increase in congestion in the evening highlights the necessity for efficient public transit systems and real-time traffic management.The observed pattern of congestion intensification and subsequent easing further underscores the need for dynamic congestion management systems that can adapt to changing traffic conditions throughout the day.

Distribution of Congestion Sources
In response to the pronounced traffic congestion during the evening rush hour, this study selected this time period for a source analysis.After identifying the congested roads during the evening peak hours, the study further analyzed the origin distribution of travelers.As illustrated in Figure 5 (based on 20-min average travel flows), although the origins of travelers on congested roads are widespread, most areas contribute only limited travel flow.In contrast, there are fewer sources that significantly influence the main traffic

Distribution of Congestion Sources
In response to the pronounced traffic congestion during the evening rush hour, this study selected this time period for a source analysis.After identifying the congested roads during the evening peak hours, the study further analyzed the origin distribution of travelers.As illustrated in Figure 5 (based on 20-min average travel flows), although the origins of travelers on congested roads are widespread, most areas contribute only limited travel flow.In contrast, there are fewer sources that significantly influence the main traffic flow, primarily concentrated within the city's first ring and the residential areas on the right side between the first and second rings.The practical implications of these findings are profound, offering a nuanced understanding of congestion contributions.By identifying the primary sources of congestion, policymakers can tailor their approaches to address the specific needs and characteristics of the most impactful areas, thereby improving the efficiency of the transportation network.For example, promoting public transportation options or encouraging alternative modes of travel in these areas can help alleviate congestion throughout the entire region.

Parameter Experiments
In this study, the suitable parameter combination for the LightGBM model is identified using the Grid Search algorithm (Table 4).This combination effectively prevents model overfitting and significantly enhances the model's predictive accuracy by controlling parameters such as the lambda_l1, lambda_l2, min_data_in_leaf, num_leaves, and feature_fraction.Additionally, the robustness of the LightGBM model optimized through Grid Search is further evaluated using Five-fold Cross-validation.This involves calculating R 2 , RMSE, and MAE for each test set, with results shown in Table 5.Among the five subsets, the R 2 values ranged from 0.55 to 0.71, RMSE values are between 124.69 and 162.99, and MAE values vary from 91.57 to 121.15.These results demonstrate that the LightGBM model exhibits good robustness.

Parameter Experiments
In this study, the suitable parameter combination for the LightGBM model is identified using the Grid algorithm (Table 4).This combination effectively prevents model overfitting and significantly enhances the model's predictive accuracy by controlling parameters such as the lambda_l1, lambda_l2, min_data_in_leaf, num_leaves, and feature_fraction.Additionally, the robustness of the LightGBM model optimized through Grid Search is further evaluated using Five-fold Cross-validation.This involves calculating R 2 , RMSE, and MAE for each test set, with results shown in Table 5.Among the five subsets, the R 2 values ranged from 0.55 to 0.71, RMSE values are between 124.69 and 162.99, and MAE values vary from 91.57 to 121.15.These results demonstrate that the LightGBM model exhibits good robustness.6 presents the mean relative importance (MRI) of different independent variables on the contribution to network-level congestion.The results indicate that socio-economic variables have the highest average importance, accounting for 11.69%, signifying their most significant impact on congestion contribution.This is primarily because areas with more developed economies typically have higher travel demands, leading to networklevel congestion.In comparison, the importance of traffic-related characteristics is slightly lower, at 9.16%.Furthermore, the average importance of land-use characteristics is 7.29%, suggesting that land features also influence network-level congestion to a certain extent.On the level of individual variables, residential population density is of the highest importance, with a significance of 12.93%, followed by road density at 12.27%.Working population density ranks third among all built-environment variables, with an importance of 10.45%.Additionally, land use mix, company enterprise density, catering density, and shopping center density also demonstrate significant predictive capabilities.

Non-Linear Relationships between Variables and Congestion Contributions
The four most important factors affecting the amount of congestion contribution were extracted and analyzed in a partial dependency diagram (PDP, Figure 6).These factors are residential population density, road network density, workforce population density, and land-use mix.The plots reveal a clear threshold effect of these four built-environment variables on congestion contribution.
tribution, overall displaying a V-shaped pattern.Specifically, the congestion contribution sharply declines in a near-linear fashion within the land-use mix range of about 1.6-1.7.Beyond 1.73, the congestion contribution significantly increases, peaking around 1.9.However, the congestion contribution rises again beyond a land-use mix of two.Given the continuous increase thereafter, caution should be exercised with land-use mixes above 1.8.  Figure 6a illustrates the N-shaped correlation between residential population density and congestion contribution.The analysis reveals a significant increase in the impact of residential population density on congestion contribution as it escalates from 4000, peaking at 11,000.Beyond this peak, the trend inversely declines until reaching 20,000, indicating a complex interplay where additional population density does not linearly translate to increased congestion.It suggests that certain thresholds of population density may activate more efficient use of available transportation infrastructure, or a saturation point where the incremental impact on congestion diminishes.Subsequently, the trend witnesses a pronounced increase once more as the density escalates to 26,000.This resurgence underscores the overwhelming effect of very high residential population densities.Even with increased public transportation usage, the significant growth in travel demand leads to an increase in congestion contribution.
Figure 6b reveals the relationship between road density and congestion contribution, displaying a sharp downward trend.This decline peaks at a road density of 50, then stabilizes, indicating that higher road densities provide more route options, thus reducing congestion.This inverse correlation suggests that strategic increases in road infrastructure in densely trafficked areas can effectively alleviate congestion.However, the stabilization of this trend beyond a certain density highlights the diminishing returns of simply adding more roads, pointing towards the necessity for smarter traffic management and infrastructure development strategies that go beyond road expansion.
Figure 6c shows the relationship between working population density and congestion contribution, exhibiting a positive correlation with a distinct threshold effect.The conges-tion contribution decreases around a working population density of 8000 but significantly increases beyond 12,000, peaking at approximately 16,300.This pattern reflects the critical role of the working population's movement patterns, especially during rush hours, in exacerbating congestion.It underscores the potential benefits of policies aimed at dispersing work start times or promoting remote work arrangements to ease peak congestion pressures.Moreover, this insight into the congestion contribution of the working population can guide targeted interventions in urban transport planning, such as the enhancement of public transit services or the development of new mobility solutions tailored to the needs of working individuals.
Figure 6d demonstrates the relationship between land use mix and congestion contribution, overall displaying a V-shaped pattern.Specifically, the congestion contribution sharply declines in a near-linear fashion within the land-use mix range of about 1.6-1.7.Beyond 1.73, the congestion contribution significantly increases, peaking around 1.9.However, the congestion contribution rises again beyond a land-use mix of two.Given the continuous increase thereafter, caution should be exercised with land-use mixes above 1.8.

Discussion and Conclusions
In summary, this study identified the congestion and traced its origins, exploring the explicability of different factors in network-level congestion contributions.Compared to previous research [25][26][27], our study discriminated congestion from a macro perspective and considered network structure, offering new insights into the dynamics of traffic congestion.
Our work shows that congestion clustering centers, which represent the degree of traffic congestion, are influenced by road levels, with congestion thresholds decreasing as road levels lower.Despite congestion being a ubiquitous issue across all road types, its severity becomes more pronounced on lower-level roads.Moreover, a stable positive correlation between congestion status and geographic location was observed, indicating significant spatial clustering.This suggests that congestion is not random but closely linked to specific geographic areas, emphasizing the importance of location-specific strategies in congestion mitigation for urban planners and traffic managers.
In addition, we traced the geographic origins of congestion, identifying the location where the travelers who participate in the network congestion primarily originate from.Interestingly, we found that a few spots are responsible for the majority of network congestion, primarily located in city centers and their surrounding areas.Therefore, optimizing traffic flow within these areas is essential for improving overall traffic efficiency.For example, promoting public transportation options in these areas or alternative routes to divert traffic from these congested hotspots could help alleviate congestion in the entire region [10,11].
Furthermore, we delved into the explicability of different factors contributing to network-level congestion.Unlike previous studies that focused on internal factors like traffic facilities and signal control [15,16], our study reveals the impact of external factors on congestion.The results indicate that residential population density is the most significant factor causing congestion.Additionally, road density, working population density, and land use mix also have considerable impacts on congestion.Notably, our study uncovered significant non-linear relationships between some built environment variables and congestion factors, notably the thresholds for residential population density and land use mix.This highlights the complex interplay between urban planning decisions and congestion outcomes, suggesting a more comprehensive approach to urban development strategies that consider how population dynamics, infrastructure capacity, and land use patterns collectively impact congestion.We recommend that efforts to mitigate congestion should not only aim to expand road capacity but also manage demand through housing planning, zoning regulations, and the promotion of mixed-use developments to reduce the necessity for long commutes.
In the end, this study has some limitations.First, congestion is considered not only in terms of the impact of the built environment variables but also requires further examination of specific aspects or influencing factors of congestion to explore the causes of congestion from a more comprehensive perspective.Second, this study highlights the significant impact of residential population density and working population density on congestion.Future research could delve deeper into how the travel patterns of different population groups (such as commuters, students, etc.) affect urban congestion.Finally, this study observes cyclical variations in congestion levels at different times of the day.Future research could further explore the spatiotemporal characteristics of congestion over longer periods or during special events (such as heavy rain or large-scale events) to provide a more detailed understanding of congestion dynamics.

Figure 4 .
Figure 4. Congestion distribution characteristics.(a) Variation of the spatial autocorrelation coefficient.(b) Frequency distribution of congestion.

Figure 4 .
Figure 4. Congestion distribution characteristics.(a) Variation of the spatial autocorrelation coefficient.(b) Frequency distribution of congestion.

Land 2024 , 16 Figure 5 .
Figure 5. Distribution of congestion sources in the evening peak.

Figure 5 .
Figure 5. Distribution of congestion sources in the evening peak.

Figure 6 .
Figure 6.Non-linear effect of variables on congestion contributions: (a) residential population density; (b) road density; (c) working population density; (d) land use mix.Figure 6. Non-linear effect of variables on congestion contributions: (a) residential population density; (b) road density; (c) working population density; (d) land use mix.

Figure 6 .
Figure 6.Non-linear effect of variables on congestion contributions: (a) residential population density; (b) road density; (c) working population density; (d) land use mix.Figure 6. Non-linear effect of variables on congestion contributions: (a) residential population density; (b) road density; (c) working population density; (d) land use mix.

Table 2 .
Descriptive statistics for variables.

Table 3 .
Range of congestion thresholds.

Table 3 .
Range of congestion thresholds.

Table 4 .
The tuning parameters of the LightGBM model.

Table 5 .
Results of five-fold cross-validation.

Table 4 .
The tuning parameters of the LightGBM model.

Table 5 .
Results of five-fold cross-validation.

Table 6 .
Order of importance of independent variables.