Investigating the Nonlinear Effect of Built Environment Factors on Metro Station-Level Ridership under Optimal Pedestrian Catchment Areas via the Machine Learning Method

: Exploring the built environment factor’s impact on metro ridership can help develop metro station area planning strategies. This is in order to compensate for the shortcomings of previous studies, which mostly used all uniform pedestrian catchment areas (PCA) around metro stations. Beijing was divided into two zones and 12 built environment explanatory variables were selected as independent variables based on the “7D” dimension of the built environment. The boarding ridership during the morning peak hours was used as the dependent variable. Nineteen PCA radii from 200 to 2000 m were assumed. The optimal PCA of metro stations for each zone was determined by using the eXtreme Gradient Boosting (XGBoost) model with the objective of minimizing the Mean Absolute Percentage Error (MAPE). The nonlinear impact of the built environment factor of each zone on metro ridership is analyzed under the optimal PCA of metro stations. The study results show that (1) the optimal PCAs of metro stations inside the 4th Ring Road and outside the 4th Ring Road are the circular buffer zones with a radius of 800 m and 1300 m, respectively. (2) There is a nonlinear inﬂuence of the built environment factor on metro ridership, with strong threshold effects and spatial heterogeneity. The PCA results can be used for the built environment’s zoning of metro stations. The XGBoost model and the nonlinear impact results provide signiﬁcant implications for the practice of station-level ridership forecasting and integrating TOD development and built environment renewal.


Introduction
The experience of some developed Western countries with high reliance on the car shows that car-based transport causes traffic congestion [1,2] and a range of environmental problems [3].Metro transport is considered to be a better way of addressing the issues caused by high levels of car dependency [4][5][6] because it helps to reduce car dependency and congestion [7][8][9], improve road safety [10], and reduce social exclusion [11].Therefore, urban policymakers highly value the construction and operation of metro transport [6,12], especially in developing countries [13].The planning and construction of metro transport in China are rapidly growing.The 14th Five-Year Plan for the Development of a Modern Comprehensive Transport System issued by the State Council of the People's Republic of China states that the total operational mileage of metro transport in the country will reach 10,000 km by 2025 [14].However, mega-cities like Beijing have already experienced excessive ridership at some metro stations during peak morning hours.The urban metro operator in Beijing has had to limit the ridership to ensure the safety and comfort of metro operations [15].Still, the measures to restrict the ridership have greatly affected the efficiency of residents' travel.How to ensure that metro transport can meet the commuting needs of urban residents has become a widespread concern [16].So, determining the factors influencing metro ridership is very important for the planning and operation of metro transport [2,4,17].
Linear models are widely used by scholars.However, linear models usually assume a linear function to fit the data.In real life, the influence of built environment factors may have a threshold effect.Therefore, using a linear model can cause errors in the results [18][19][20].To date, Gradient Boosting Regression Tree (GBRT) models have been used to investigate the nonlinear influence of the built environment factor on metro ridership [3,21].But, GBRT models suffer from over-fitting [19].Daily and weekly ridership are widely focused and fewer have looked separately at peak-hour inbound or outbound ridership during the day.In addition, most existing studies have used experience, the TOD theory [22], or borrowed from others [23,24] to identify the pedestrian catchment areas (PCA) at metro stations, where most use a uniform size PCA of metro stations to calculate built environment factors.
Therefore, our study has three major purposes: (1) the optimal metro station PCA for the different zoning of Beijing's metro stations was determined; (2) to investigate the nonlinear influence of the built environment of two zones on the ridership of the metro stations; (3) and to investigate the spatial heterogeneity of the nonlinear influence.

Literature Review
As an important public transport component, metro transport has received a great deal of scholarly attention in recent years [4,17,18,25].Existing studies have found that the population density [26][27][28], density [17,29], accessibility [29,30], and land use mix [30] have an impact on subway ridership.This literature review focuses on four aspects of the research methodology, the determination of the PCA, the determination of dependent variables, and the determination of independent variables.The relevant transportation literature is summarized in Table 1.
In terms of research methods, the Ordinary Least Squares (OLS) model [4,16,[31][32][33][34][35], Geographically Weighted Regression (GWR) model [23,[36][37][38][39][40], structural equation models [31], and Multi-Scale Geographically Weighted Regression (MGWR) [17,22] model were used by a large number of scholars.These models can only examine the linear effect.However, some scholars have found that the effect of the built environment on metro passenger flow is nonlinear.[3,25].With the boom in machine learning methods, random forests [41] and deep learning [42,43] have been applied to transportation.And, some scholars have used Gradient Boosting Regression Trees (GBRT) to analyze the nonlinear influence of the built environment on metro ridership [3,18,21,44].However, the GBRT model has an overfitting problem [19].The XGBoost model can accurately determine the nonlinear influence of independent variables on the dependent variable and it is also an excellent solution to the over-fitting problem [45].In addition, XGBoost has the advantage of being extremely accurate and better at handling missing values and outliers [46,47].XGBoost is currently being applied to predictive modeling [48][49][50][51], analysis of the residents' travel behavior [19,52,53], factors triggering traffic accidents [54], and the impact of building configuration [55] on urban stormwater management.There are few studies using XGB to analyze the influence of the built environment on metro passenger flow.Determining the PCA for a metro station is considered very important before conducting research [17].Currently, most studies use circular buffers [1,4,16,22,23,31,34,36,52,56] centered on the metro station as the PCA.However, considering that the station study areas overlap in areas with a dense distribution of metro stations, Tyson polygons [57] or Tyson polygons superimposed with circular buffers to take the intersection [17,23,24] have been used to determine the PCA.The radii of circular buffers chosen by different scholars varied widely, with more scholars choosing a circular buffer radii of 400 m [21], 500 m [31,34], 600 m [22,39,52], 800 m [3,4,16,23,36], and 1000 m [1,17].In addition, the choice of buffer radius mostly relies on pedestrian accessibility [22,32,36], experience [4], and drawing on the research of others [23].Existing research has shown that the PCA for metro stations varies from city to city (57) and that one cannot borrow the PCA for metro stations in other cities.Thus, the PCA of metro stations has been determined using the goodness of fit of regression models [17,39].Although some scholars have found the PCA via regression fit superiority methods, they tend to use a single PCA of metro stations.And, with the rapid development of cities, mega-cities like Beijing are establishing new districts on the outskirts of the city, which tend to be larger in scale.Therefore, using the uniform PCA of metro stations would make the error of the model larger.Although, scholars have already divided the city into three zones [39].But, this study still uses a uniform PCA of metro stations across the three zones.To our knowledge, no scholars have identified separate metro station's PCA according to different zones.
In terms of the selection of dependent variables, daily ridership was the most popular among scholars [1,4,21,31,[33][34][35][36][37]56], while some other scholars chose monthly ridership [33,36] and seasonal ridership [39] as dependent variables.In addition, some scholars will choose several dependent variables in one article [18,23,34].Fewer scholars consider boarding or alighting ridership during the morning peak hours on the working day alone.But, the fact is that for mega-cities like Beijing, the morning rush hour is the time of most significant conflict, and there are already morning rush hour entry restrictions at metro stations.We think that a separate analysis of boarding ridership during the morning peak hours is important to improve metro operations' efficiency and adjust metro station traffic later.Fewer academics have analyzed the boarding ridership during the morning peak hours separately.
This study focuses on the impact of the "7D" built environment on the boarding ridership during the morning peak hours.The XGBoostwas was used to determine the optimal PCA for different zones.And, the nonlinear influence of the built environment on subway passenger flow and its spatial heterogeneity are studied under an optimal subway PCA.

Study Scope and Data Sources
The study was carried out on a total of 292 metro stations that are already in service on 19 lines in Beijing in 2020.We find that the distribution of metro stations inside the 4th Ring Road is more concentrated, while the distribution of metro stations outside the 4th Ring Road is more dispersed.So, all metro stations in Beijing are divided into two zones: metro stations inside the Fourth Ring Road (white-filled areas) and metro stations outside the Fourth Ring Road (yellow-filled areas) (Figure 1).The data source of the dependent variable is the Beijing public transport IC card data.We obtained the average hourly inbound passenger flow of Beijing's metro stations during the five working days of the week from 12 October 2020 to 16 October 2020.Based on the trend of boarding ridership (Figure 2), the morning peak of Beijing's metro transit is 7:00-9:00.Considering that the contradiction is more prominent in the morning peak hour and the space is limited, this paper only analyzes the boarding ridership during the morning peak hours (hereafter referred to as metro ridership).Figure 1  the Fourth Ring Road (yellow-filled areas) (Figure 1).The data source of the de variable is the Beijing public transport IC card data.We obtained the average ho bound passenger flow of Beijing's metro stations during the five working days of t from 12 October 2020 to 16 October 2020.Based on the trend of boarding ridership 2), the morning peak of Beijing's metro transit is 7:00-9:00.Considering that the diction is more prominent in the morning peak hour and the space is limited, th only analyzes the boarding ridership during the morning peak hours (hereafter to as metro ridership).Figure 1

Explanatory Variables of the Built Environment
The "7D" dimensions were constructed by adding demand management and demographic factors to the built environment's "5D" dimensions [59].It consists of seven sections: density, diversity, design, destination accessibility, distance, demand management, and demographics.[60].It has been proved that the number of POI has an impact on metro ridership [61,62].Therefore, the density of POIs is changed to the number of POIs in the study.The built environment dataset was constructed on this basis and included 12 built environment explanatory variables(Table 2).For the data sources and the calculation methods of the explanatory variables, see other research results of our research group [17].

Pedestrian Catchment Areas (PCA) Delineation for Metro Stations
A key task before analyzing the nonlinear influence of the built environment factor on ridership at metro stations is how to define the scope of the built environment analysis for metro stations [30].The extent of the built environment analysis for metro stations is determined using the "maximum" walking distance or the area within walking distance of most users [4,63].For this reason, a metro station's built environment analysis area is often referred to as pedestrian catchment areas (PCA).In existing studies, the PCA of metro stations varies widely, from a minimum of 250 m [1,56] to a maximum of 1500 m.In order to more accurately determine the PCA of the metro stations in the two zones of Beijing, the circular buffer zones with a radius of 200-2000 m (interval 100 m) is selected as the PCA of metro stations in the two zones, respectively.The minimum of Mean Absolute Percentage Errors (MAPE) of the XGBoost models under multiple PCAs were used to determine the optimal PCA for the two zones of metro stations.

eXtreme Gradient Boosting (XGBoost)
XGBoost is an improved algorithm based on gradient-augmented decision trees, proposed by Chen et al. in 2016 [46].The XGBoost model is not only an excellent solution to the overfitting problem.It also has the advantages of high accuracy and better handling of missing values and outliers [46,47].The regression function of XGBoost usually consists of two parts: training loss and regularization.Its objective function expression is: where L is the training loss function, and Ω is the regularization term.The training loss is used to measure the performance of the model on the training data.The purpose of the regularization term is to control the complexity of the model, and the over-fitting of the model can be controlled by the regularization term [64].In this study, the training set is 70% of the total data, and the test set is 30% of the total data.The parameter configuration of the XGBoost model we selected for this study is shown in Table 3. SHAP (Shapley Additive exPlanations) is used to explain the machine learning models and was proposed by Lundberg and Lee in 2021 [65].The formula for calculating the SHAP value is expressed as: where i denotes a feature.F is the set of features containing all features.S is the set of all features without feature i. |S|! is the factorial of the number of features contained in S. X s is the input feature values in S. f S∪{i} is a model trained with feature i. f S is another model trained without feature i. f S∪{i} X S∪{i} − f S (X s ) is the difference between the outputs of the two models.

Mean Absolute Percentage Error
The Mean Absolute Percentage Error (MAPE) is a measure of a relative error that uses absolute values to avoid positive and negative errors canceling each other out.The MAPE has been found to be a more accurate determination of the model's accuracy [66], with smaller MAPE values proving that the model is more accurate.The formula for calculating the MAPE is expressed as: where n is the total number of metro stations.ŷi is the predicted value of the explanatory variable for the ith orbital site.y i is the actual value of the explanatory variable for the ith orbital site.

Optimal Metro Stations PCA for Different Zones
In order to determine the rationality of the XGBoost model in this analysis, we compared the accuracy of the XGBoost model with other machine learning models and the comparison results of AdjR 2 are shown in Figure 3.As can be seen from Figure 3, the accuracy of the XGBoost model is better than others, and the AdjR 2 of the testing set inside the 4th Ring Road is 0.74 and the AdjR 2 of the testing set outside the 4th Ring Road is 0.72.So, XGBoost can be used for this analysis.Calculate the MAPE of PCAs for the inside and outside 4th Ring Road metro stations based on the predicted values in the XGBoost model in the testing set and the true values in the testing set, respectively, and plot the line graph of MAPE at different PCAs.To our knowledge, the accuracy of nonlinear models has not been considered in previous studies.In addition, most scholars currently studying the nonlinear influence of the built environment factor on metro ridership have used the goodness of fit for linear models [17,39] and experience [3,18,21] to determine the optimal PCA of metro stations.No one has used the accuracy of nonlinear models to determine the optimal PCA of metro stations.Figure 4 shows the MAPE folds at different PCAs of metro stations inside and outside the 4th Ring Road.The graph shows that when the buffer zone radius is 800 m, the lowest MAPE value is reached at 9.64% for metro stations inside the 4th Ring Road.Therefore, the optimal PCA of metro stations inside the 4th Ring Road is the circular buffer zone of an 800 m radius.For the outside 4th Ring Road metro stations, MAPE reaches a minimum value of 16.60% when the buffer radius is 1300 m.So, the optimal PCA of the outside 4th Ring Road metro stations is a circular buffer of 1300 m.
Looking at the MAPEs of the metro stations inside and outside the 4th Ring Road, the MAPE of the metro stations outside the 4th Ring Road is larger than those of the metro stations inside the 4th Ring Road.It proves that the model accuracy is higher inside the 4th Ring Road.That is consistent with existing research [39].This is due to the fact that outside the 4th Ring Road is a new urban area with a larger urban scale.Some passengers do not start their journey in the PCA but still choose to come to this metro station.
the MAPE of the metro stations outside the 4th Ring Road is larger than those of the metro stations inside the 4th Ring Road.It proves that the model accuracy is higher inside the 4th Ring Road.That is consistent with existing research [39].This is due to the fact that outside the 4th Ring Road is a new urban area with a larger urban scale.Some passengers do not start their journey in the PCA but still choose to come to this metro station.

Global Impact on Metro Ridership
The average value of the absolute value of each explanatory variable SHAP is calculated and the influence degree of the explanatory variable on metro ridership is expressed.The greater the mean value of SHAP, the greater the influence of the explanatory variables on metro ridership and vice versa.The results of the average SHAP values of the explanatory variables for ridership at metro stations in different zones are shown in Figures 5

Global Impact on Metro Ridership
The average value of the absolute value of each explanatory variable SHAP is calculated and the influence degree of the explanatory variable on metro ridership is expressed.The greater the mean value of SHAP, the greater the influence of the explanatory variables on metro ridership and vice versa.The results of the average SHAP values of the explanatory variables for ridership at metro stations in different zones are shown in Figures 5 and 6, with positive correlations in red and negative correlations in blue.
the 4th Ring Road, the number of public service facilities is the explanatory variable with the greatest degree of influence.That said, for metro stations outside the 4th Ring Road, it may be more effective to adjust the ridership of metro stations by adjusting the number of public service facilities.
Figures 5 and 6 show that there is a significant difference in the ranking of the effects of the explanatory variables on metro ridership inside and outside the 4th Ring Road.This demonstrates the need for this study partition to examine the built environment's impact on metro ridership.Understanding the global impact of built environment explanatory variables on metro ridership in both zones can help planning decision makers and operations and design departments to adjust metro ridership from a zone-wide perspective.

Nonlinear Effects on Metro Ridership
We select the top three explanatory variables for nonlinear analysis according to the influence degree of explanatory variables in the two zones.Figure 7 shows the nonlinear results for the explanatory variables for metro stations inside and outside the 4th Ring Road.For metro stations inside the 4th Ring Road, the relationship between the number of entrances and exits and metro ridership is overall positively correlated.When the number of entrances and exits is between five and seven, the effect of the number of entrances and exits on metro ridership is stable.This means that if we want to adjust the ridership the greatest degree of influence.That said, for metro stations outside the 4th Ring Road, it may be more effective to adjust the ridership of metro stations by adjusting the number of public service facilities.
Figures 5 and 6 show that there is a significant difference in the ranking of the effects of the explanatory variables on metro ridership inside and outside the 4th Ring Road.This demonstrates the need for this study partition to examine the built environment's impact on metro ridership.Understanding the global impact of built environment explanatory variables on metro ridership in both zones can help planning decision makers and operations and design departments to adjust metro ridership from a zone-wide perspective.

Nonlinear Effects on Metro Ridership
We select the top three explanatory variables for nonlinear analysis according to the influence degree of explanatory variables in the two zones.Figure 7 shows the nonlinear results for the explanatory variables for metro stations inside and outside the 4th Ring Road.For metro stations inside the 4th Ring Road, the relationship between the number of entrances and exits and metro ridership is overall positively correlated.When the number of entrances and exits is between five and seven, the effect of the number of entrances and exits on metro ridership is stable.This means that if we want to adjust the ridership at a metro station, adjusting the number of entrances in the range of 5-7 may not change the ridership at the metro station.However, when the number of entrances and exits is For metro stations inside the 4th Ring Road, the top three explanatory variables in the order of the influence degree are the number of entrances and exits > mixed utilization of land > the density of bus lines.There is a positive relationship between all three explanatory variables and SHAP values (Figure 5), i.e., the larger the eigenvalues of these three explanatory variables, the larger the SHAP values.This means that the larger these three explanatory variables are, the greater the impact on metro ridership.The mixed utilization of land has a large impact on metro ridership.That proves that the mixed utilization of land development has a strong promoting effect on metro ridership.That is consistent with the existing research [21].However, as a very important index of land development, the floor area ratio is negatively correlated with metro ridership.It is proved that for the metro station in the 4th Ring Road, the ridership of metro stations with a higher floor area ratio is not necessarily higher.The likely reason is that the higher floor area ratios are generally concentrated in the core commercial office areas, where the morning peak is dominated by alighting ridership and does not generate much boarding ridership.Conversely, residential cores can generate high boarding ridership, but have a relatively low floor area ratio due to design constraints.The effect of population on ridership at metro stations inside the 4th Ring Road is positive, which is consistent with existing studies [4,36].
For metro stations outside the 4th Ring Road, the top three explanatory variables in the order of the influence degree are the number of public services facilities > building density > road density (Figure 6).Building density is negatively correlated with metro ridership and road density is negatively correlated with metro ridership.And, the average SHAP value for the number of office facilities is much greater than the average SHAP values for building density and road density.This proves that for metro stations outside the 4th Ring Road, the number of public service facilities is the explanatory variable with the greatest degree of influence.That said, for metro stations outside the 4th Ring Road, it may be more effective to adjust the ridership of metro stations by adjusting the number of public service facilities.
Figures 5 and 6 show that there is a significant difference in the ranking of the effects of the explanatory variables on metro ridership inside and outside the 4th Ring Road.This demonstrates the need for this study partition to examine the built environment's impact on metro ridership.Understanding the global impact of built environment explanatory variables on metro ridership in both zones can help planning decision makers and operations and design departments to adjust metro ridership from a zone-wide perspective.

Nonlinear Effects on Metro Ridership
We select the top three explanatory variables for nonlinear analysis according to the influence degree of explanatory variables in the two zones.Figure 7 shows the nonlinear results for the explanatory variables for metro stations inside and outside the 4th Ring Road.For metro stations inside the 4th Ring Road, the relationship between the number of entrances and exits and metro ridership is overall positively correlated.When the number of entrances and exits is between five and seven, the effect of the number of entrances and exits on metro ridership is stable.This means that if we want to adjust the ridership at a metro station, adjusting the number of entrances in the range of 5-7 may not change the ridership at the metro station.However, when the number of entrances and exits is greater than seven, the impact of the number of entrances and exits on metro ridership tends to increase (Figure 7a).When the mixed utilization of land is less than 0.84, the impact of the mixed utilization of land on metro ridership is minimal (Figure 7b) and the overall impact of the mixed utilization of land on metro ridership is positive.If we want to improve the ridership of a metro station inside the 4th Ring Road, it may be more effective to increase the land use mix degree beyond 0.84.The nonlinear effect of the density of bus lines on metro ridership is more complex.When the density of the bus line is less than 35, the effect of the bus line density on metro ridership is less.At the same time, when the density of the bus line is in the range of 35-39, the effect of the bus line density on ridership is negative.In addition, when the density of bus lines is greater than 54, the effect of the bus line density on ridership is also negative (Figure 7c).
derstanding the nonlinear effects of explanatory variables on metro ridership can help us to adjust metro ridership from an urban renewal perspective.In particular, we find that some of the explanatory variables do not have a greater impact on metro ridership at higher eigenvalues.Therefore, while understanding the global impact of the explanatory variables on metro ridership, the nonlinear impact of the explanatory variables on metro ridership needs to be considered simultaneously.

Spatial Heterogeneity Effecton Metro Ridership
Previous studies have mostly used Partial Dependence Plot (PDP) dependency maps to study the impact of the built environment explanatory variables on metro ridership from a global perspective [3,21].However, existing research has demonstrated spatial heterogeneity in the influence of the built environment explanatory variables on metro ridership [17,18] and that the influence of built environment explanatory variables on metro ridership varies depending on the station's location.Therefore, this study links SHAP values to metro stations and visualizes them.In this section, the top three global influences of the explanatory variables are still selected for spatial heterogeneity analysis.The results For metro stations outside the 4th Ring Road, when the number of public service facilities is less than 65, the impact of the number of public service facilities on the ridership at metro stations is small.However, when the number of public service facilities is in the range of 65-80, the impact of the number of public service facilities on the ridership at metro stations increases sharply.At the same time, when the number of public service facilities is greater than 80, the influence of the number of public service facilities on the ridership of the metro station tends to level off and there is even a negative correlation (Figure 7d).The overall impact of building density on metro ridership is negative.The effect of building density on metro ridership decreases sharply when building density is in the range of 0.07-0.10.The effect of building density on metro ridership is relatively flat when the building density is greater than 0.10 (Figure 7e).The effect of road density on metro ridership decreases sharply when the road density is between 1.9 and 4.5, but levels off when the road density is greater than 4.5 (Figure 7f).This demonstrates that the 1.9-4.5 range is the most effective if road density is to be used to change metro ridership.
We find the selected explanatory variables have a strong threshold effect on the ridership of the metro station.That is consistent with existing research findings [3,21].Understanding the nonlinear effects of explanatory variables on metro ridership can help us to adjust metro ridership from an urban renewal perspective.In particular, we find that some of the explanatory variables do not have a greater impact on metro ridership at higher eigenvalues.Therefore, while understanding the global impact of the explanatory variables on metro ridership, the nonlinear impact of the explanatory variables on metro ridership needs to be considered simultaneously.

Spatial Heterogeneity Effecton Metro Ridership
Previous studies have mostly used Partial Dependence Plot (PDP) dependency maps to study the impact of the built environment explanatory variables on metro ridership from a global perspective [3,21].However, existing research has demonstrated spatial heterogeneity in the influence of the built environment explanatory variables on metro ridership [17,18] and that the influence of built environment explanatory variables on metro ridership varies depending on the station's location.Therefore, this study links SHAP values to metro stations and visualizes them.In this section, the top three global influences of the explanatory variables are still selected for spatial heterogeneity analysis.The results of the visualization of metro station SHAP values are shown in Figure 8.
For metro stations inside the 4th Ring Road, the number of metro stations with positive and negative SHAP values for the number of entrances and exits are roughly evenly divided.The metro stations with high negative SHAP values are mainly located in the northern part of the 4th Ring Road.The likely reason for this is that these metro stations are saturated with passengers, and further upgrading the number of entrances and exits to the metro stations will not enhance metro ridership.In addition, the metro stations with positive high SHAP values are mainly located in the southeastern part of the 4th Ring Road (Figure 8a).For the mixed utilization of land, there are more negative SHAP metro stations than positive SHAP metro stations (Figure 8b).This demonstrates that for most metro stations inside the 4th Ring Road, the mixed utilization of land dampens metro ridership.The reason for this may be that these neighborhoods are functionally mixed.However, it is a fact that where there is a high density of residential areas, there is also a high level of ridership at metro stations, and an excessive mix of land use reduces the number of residential areas.In addition, metro stations with negative SHAP values are clustered inside the 4th Ring Road.Therefore, when urban renewal is carried out later, the agglomeration area of metro stations with negative SHAP values can be considered uniformly.For the density of bus lines, the number of metro stations with negative SHAP values is much greater than the number of positive SHAP value metro stations (Figure 8c).This demonstrates that the density of bus lines has a dampening effect on metro ridership for most metro stations inside the 4th Ring Road.There is also a strong concentration of positive high SHAP metro stations, with positive high SHAP metro stations concentrated in the southeast and southwest inside the 4th Ring Road.

Conclusions
This research provides empirical evidence for the delineation of the PCA of metro stations in analyzing the nonlinear impacts of the built environment on the ridership of metro stations by using the XGBoost model.The optimal PCAs of metro stations inside the 4th Ring Road and outside the 4th Ring Road are the circular buffer zones with a radius of 800 m and 1300 m, respectively.Additionally, for the key explanatory variables (top three in overall impact) in the two zones we selected, there is a nonlinear relationship and a strong threshold effect on metro ridership.We also found spatial heterogeneity in the effects of the explanatory variables on ridership at metro stations.It indicates that we can develop site-specific renewal strategies around metro stations, considering the nonlinear effects of explanatory variables on metro ridership.
Based on the results of this study, we make the following recommendations: (1) we recommend that when considering the TOD range of Beijing, 800 m is recommended inside the 4th Ring Road and 1300 m is recommended outside the 4th Ring Road.(2) For the metro stations inside the 4th Ring Road, we can improve the vitality of the surrounding area by changing the land use mix degree and bus line density around the subway stations.For the metro station outside the 4th Ring Road, we can improve the vitality of the railway station by changing the number of public service facilities, building density, and road density.
There are some limitations in this study.First, assuming that the PCA is a circular buffer that cannot represent the actual range of passenger OD flow, this study did not use other means to judge the actual travel distribution of metro passenger flow, which may be important to improve the model accuracy.Second, OSM data was used in our study For the metro stations outside the 4th Ring Road, the number of public service facilities with positive high SHAP values metro stations are clustered in the north and east of outside the 4th Ring Road (Figure 8d).Combined with Figure 1, the ridership at the metro stations with positive high SHAP values is high, and we can reduce the ridership at the metro stations by reducing the number of public service facilities.Also, most of the stations with high negative SHAP are concentrated at the end of the metro line (Figure 7d).The possible reason is that these metro stations are located in suburban areas, where there may be some large infrastructure and public service facilities, and these facilities lead to fewer residential neighborhoods.The metro stations with positive building density SHAP are mainly concentrated at the end of the metro line (Figure 8e).And, the ridership of these metro stations are low, so we can enhance the ridership of these metro stations by increasing the building density.In addition, the stations with high negative SHAP have a strong agglomeration effect, especially in the north of the 4th to 5th Ring Road and the southeastern outside the 4th Ring Road (Figure 8e).And, these metro stations can be considered unified as a solution to regional problems.For road density, the stations with high negative SHAP are mostly concentrated at the end of Line 4 and in the northeast of the 4th to 5th Ring Road (Figure 8f).The probable reason is that these neighborhoods are in dense residential areas and are very densely populated.Increasing the road density will reduce the area of land to be used, which will result in smaller residential areas.And, the positive high SHAP stations are concentrated in the north outside the 4th Ring Road (Figure 7f).
By visualizing the values of the explanatory variables SHAP, we find the effect of the explanatory variables on the ridership of different metro stations.This has substantial practical implications for adjusting the ridership of individual metro stations [17,18].When adjusting for individual metro ridership, we need to consider the SHAP values of the metro stations and the nonlinear effects of the explanatory variables on the ridership.

Conclusions
This research provides empirical evidence for the delineation of the PCA of metro stations in analyzing the nonlinear impacts of the built environment on the ridership of metro stations by using the XGBoost model.The optimal PCAs of metro stations inside the 4th Ring Road and outside the 4th Ring Road are the circular buffer zones with a radius of 800 m and 1300 m, respectively.Additionally, for the key explanatory variables (top three in overall impact) in the two zones we selected, there is a nonlinear relationship and a strong threshold effect on metro ridership.We also found spatial heterogeneity in the effects of the explanatory variables on ridership at metro stations.It indicates that we can develop site-specific renewal strategies around metro stations, considering the nonlinear effects of explanatory variables on metro ridership.
Based on the results of this study, we make the following recommendations: (1) we recommend that when considering the TOD range of Beijing, 800 m is recommended inside the 4th Ring Road and 1300 m is recommended outside the 4th Ring Road.(2) For the metro stations inside the 4th Ring Road, we can improve the vitality of the surrounding area by changing the land use mix degree and bus line density around the subway stations.For the metro station outside the 4th Ring Road, we can improve the vitality of the railway station by changing the number of public service facilities, building density, and road density.
There are some limitations in this study.First, assuming that the PCA is a circular buffer that cannot represent the actual range of passenger OD flow, this study did not use other means to judge the actual travel distribution of metro passenger flow, which may be important to improve the model accuracy.Second, OSM data was used in our study and this non-specialized map data can bias the results.In addition, socioeconomic variables were not included in our study, which can be included in future studies to make the model results more accurate.
also shows the spatial distribution of passenger flow at the station level in Beijing.
also shows the spatial distribution of passenger flo station level in Beijing.

Figure 1 .
Figure 1.Spatial distribution of metro ridership during the morning peak hours.

Figure 1 .
Figure 1.Spatial distribution of metro ridership during the morning peak hours.

Figure 1 .
Figure 1.Spatial distribution of metro ridership during the morning peak hours.

Figure 2 .Figure 2 .
Figure 2. Changes in metro ridership by the time of day in Beijing.

Figure 3 .
Figure 3. Accuracy of different machine learning models: (a) Inside the 4th Ring Road; (b) Outside the 4th Ring Road.

Figure 3 .
Figure 3. Accuracy of different machine learning models: (a) Inside the 4th Ring Road; (b) Outside the 4th Ring Road.

Figure 4 .
Figure 4. MAPE diagram for different zones of metro station PCAs.

Figure 4 .
Figure 4. MAPE diagram for different zones of metro station PCAs.

Figure 5 .
Figure 5. Global impact of explanatory variables on metro ridership inside the 4th Ring Road.

Figure 6 .
Figure 6.Global impact of explanatory variables on metro ridership outside the 4th Ring Road.

Figure 5 .
Figure 5. Global impact of explanatory variables on metro ridership inside the 4th Ring Road.

Figure 5 .
Figure 5. Global impact of explanatory variables on metro ridership inside the 4th Ring Road.

Figure 6 .
Figure 6.Global impact of explanatory variables on metro ridership outside the 4th Ring Road.

Figure 6 .
Figure 6.Global impact of explanatory variables on metro ridership outside the 4th Ring Road.

Figure 7 .
Figure 7. Nonlinear results for explanatory variables of metro stations inside and outside the 4th Ring Road: (a) Number of entrances and exits for metro stations inside the 4th Ring Road.(b) Mixed utilization of land for metro stations inside the 4th Ring Road.(c) Density of bus lines for metro stations inside the 4th Ring Road.(d) Number of public service facilities for metro stations outside the 4th Ring Road.(e) Building density for metro stations outside the 4th Ring Road.(f) Road density for metro stations outside the 4th Ring Road.

Figure 7 .
Figure 7. Nonlinear results for explanatory variables of metro stations inside and outside the 4th Ring Road: (a) Number of entrances and exits for metro stations inside the 4th Ring Road.(b) Mixed utilization of land for metro stations inside the 4th Ring Road.(c) Density of bus lines for metro stations inside the 4th Ring Road.(d) Number of public service facilities for metro stations outside the 4th Ring Road.(e) Building density for metro stations outside the 4th Ring Road.(f) Road density for metro stations outside the 4th Ring Road.

Figure 8 .
Figure 8. Local SHAP values for metro stations inside and outside the 4th Ring Road: (a) Number of entrances and exits for metro stations inside the 4th Ring Road.(b) Mixed utilization of land for metro stations inside the 4th Ring Road.(c) Density of bus lines for metro stations inside the 4th Ring Road.(d) Number of public service facilities for metro stations outside the 4th Ring Road.(e) Building density for metro stations outside the 4th Ring Road.(f) Road density for metro stations outside the 4th Ring Road.

Figure 8 .
Figure 8. Local SHAP values for metro stations inside and outside the 4th Ring Road: (a) Number of entrances and exits for metro stations inside the 4th Ring Road.(b) Mixed utilization of land for metro stations inside the 4th Ring Road.(c) Density of bus lines for metro stations inside the 4th Ring Road.(d) Number of public service facilities for metro stations outside the 4th Ring Road.(e) Building density for metro stations outside the 4th Ring Road.(f) Road density for metro stations outside the 4th Ring Road.

Table 1 .
Summary of reference literature on transportation.

Table 3 .
The parameter configuration of the XGBoost model.