1. Introduction
In recent decades, the rapid expansion of high-speed rail (HSR) networks has become a transformative force reshaping the spatial organization of cities and regions [
1,
2]. As an efficient, high-capacity mode of intercity transport, HSR enhances regional connectivity, compresses travel time, and drives urban integration and spatial restructuring across scales [
3,
4,
5]. In China, the development of the world’s largest HSR system has not only fulfilled key transportation demands but has also profoundly impacted urban form, land use patterns, and population distribution [
6,
7,
8].
HSR stations, serving as both transport nodes and urban centers, play a pivotal role in promoting economic vitality and guiding development direction [
9,
10,
11]. However, the success of these stations extends beyond their transport functionality—it is increasingly reflected in the intensity, diversity, and persistence of human activities they attract. This has given rise to the concept of spatial vitality, which refers to the dynamic state of human presence and interaction within urban space over time. Spatial vitality is a multidimensional construct encompassing economic vitality (e.g., business activity, consumption), social vibrancy (e.g., pedestrian flows, diversity of social interactions), and environmental engagement (e.g., the use of public and green spaces). In the context of HSR station areas, spatial vitality serves as a key indicator of how effectively transportation infrastructure integrates with urban life [
12,
13,
14,
15,
16].
To explore what drives spatial vitality in HSR station areas, this study focuses on three categories of influencing factors—urban support capacity, station attributes, and built environment characteristics—based on a city–node–place analytical framework. These dimensions are selected because they collectively reflect the macro-level development background, the meso-level transportation functions of the station, and the micro-level spatial qualities experienced by users. Specifically, urban support indicators such as population size and urbanization level represent the demand base and economic potential of the host city. Station-level variables, including service frequency, location accessibility, and connectivity, determine how well the station functions as a mobility hub. Built environment indicators—such as commercial land use ratio, transportation facilities, and functional diversity—describe the spatial configuration and service provision in the immediate station area. By jointly analyzing these three levels, the study aims to reveal not only which factors significantly affect spatial vitality, but also why and how they do so under different conditions and in combination.
Understanding what drives this vitality is essential for enhancing land use efficiency, optimizing functional configurations, and promoting transit-oriented, sustainable development. More importantly, spatial vitality plays a fundamental role in improving the broader quality of urban life. Economically, vibrant station areas can stimulate commercial development and attract investment. Socially, they facilitate inclusiveness, accessibility, and human interaction. Environmentally, vital station areas can promote compact development, reduce vehicle dependence, and encourage the use of public and green spaces. Therefore, enhancing spatial vitality around HSR stations is not only a matter of functional efficiency, but also a strategic pathway toward inclusive, resilient, and environmentally sustainable cities.
Despite growing academic attention, current research on HSR station area vitality faces several notable limitations [
17,
18,
19]. First, many studies adopt a static perspective, relying on instantaneous indicators such as heatmaps (e.g., Baidu thermal maps), which capture short-term activity snapshots but fail to reflect the cumulative vitality that develops over time—a key aspect for assessing the long-term spatial dynamics of station areas. Second, traditional statistical models typically assume linear relationships and struggle to capture complex nonlinear interactions or threshold effects among influencing variables, limiting their explanatory power in multifactorial urban contexts. Third, conventional data sources often involve survey-based or aggregated statistics, which lack spatial precision and temporal continuity. Such data mask local variations and individual behaviors within station areas. In contrast, emerging location-based big data, like social media check-ins, provide fine-grained, real-time insights, enabling more accurate mapping of human activity patterns and overcoming the coarse granularity and static nature of prior data.
With the emergence of geospatial big data and social media-based human mobility records, there is a new opportunity to conduct high-resolution analyses of spatial vitality [
20,
21,
22]. In particular, the use of location-based check-in data provides valuable insights into real-time activity patterns, enabling more dynamic assessments of how people interact with station environments. Moreover, combining classical regression models with machine learning algorithms offers a robust framework to capture both the interpretable and nonlinear aspects of population behavior [
23,
24,
25,
26,
27].
Recent academic efforts have increasingly leveraged multi-source data and advanced analytical methods to study the dynamics of HSR station areas. For instance, Wang et al. focused on the Yangtze River Delta, developing an integrated framework to assess urban vitality through indicators such as concentration, accessibility, livability, and functional diversity [
6]. Yue et al. analyzed mobile phone signaling data in Jiangsu Province, clustering 71 HSR stations by passenger flow time series and employing geographically weighted multinomial logit models to examine the roles of entertainment POI density, population, GDP, and built area in shaping station classifications [
9]. Additionally, recent research by Doan et al. applied machine learning models combined with SHAP (Shapley Additive Explanations) to explore the nonlinear and threshold effects of built environment, traffic, and air quality factors on urban vitality in Manhattan, using street-view pedestrian presence data [
28]. Building on our earlier exploratory work, which employed panel data and a difference-in-differences (DID) approach to examine the temporal effects of HSR on commercial agglomeration, this study makes a significant conceptual and methodological advancement. We shift from a temporal to a spatial comparative perspective, analyzing the vitality mechanisms of 66 HSR station areas across 35 cities. Through interpretable machine learning and a city–node–place framework, we aim to uncover the multiscale drivers of station area vitality and provide actionable planning insights [
29]. Together, these studies highlight the growing potential of integrating diverse datasets and interpretable computational models to uncover complex spatial patterns.
Therefore, this study aims to comprehensively examine the spatial vitality of HSR station areas across China by integrating multi-source data and applying both multiple linear regression (MLR) and Light Gradient Boosting Machine (LightGBM) models. Drawing on check-in data from Sina Weibo, urban statistical indicators, and detailed built environment metrics, we analyze 66 representative HSR stations in 35 cities to identify key drivers of population clustering. We further explore the threshold and interaction effects of influencing factors using SHAP (Shapley Additive Explanations) analysis [
30,
31].
To comprehensively understand the factors shaping spatial vitality in HSR station areas, this study adopts a city–node–place analytical framework and combines both traditional statistical models and advanced machine learning techniques. By integrating multi-source data including social media check-ins, statistical indicators, and built environment features, we aim to explore the nonlinear, threshold, and interaction effects underlying population clustering. The following sections detail the study area, data sources, variable design, modeling methods, and empirical results, culminating in theoretical discussions and policy implications.
2. Study Area and Data
2.1. Study Area
As of now, HSR services have been introduced in 269 cities across China, covering approximately 72.3% of all cities nationwide. To ensure the representativeness of our research, we selected 35 cities through a stratified sampling framework. The selection considered five key criteria: (1) regional diversity (eastern, central, western, and ethnic minority regions); (2) administrative level (including municipalities, provincial capitals, and sub-provincial cities); (3) urban scale and population density; (4) level of economic development and functional importance; and (5) the maturity and intensity of HSR operations. This approach ensures that the sample reflects China’s diverse urban contexts and developmental stages.
From these cities, 66 representative HSR stations were selected based on comprehensive criteria, including station classification, operational scale, geographic location, service capacity, and data availability [
11,
29,
32,
33]. In order to avoid bias and maintain consistency, stations co-located with airports or lacking adequate data—such as Chengdu Xipu Station and Guangzhou Baiyun Airport North Station—were excluded. We also avoided stations with weak service capabilities or low integration with the surrounding urban fabric. This filtering process ensures the analytical validity and robustness of our sample, making it suitable for comparative analysis.
The final sample includes a wide range of station types (e.g., central city hubs, regional gateways, peripheral nodes), covering diverse geographic and functional contexts. These stations span metropolitan cores such as Beijing and Shanghai, as well as key inland centers like Xi’an and Chengdu. This distribution enables a comprehensive understanding of how HSR integrates with varying urban structures [
34,
35]. Overall, the selected 66 stations in 35 cities provide a robust empirical foundation for analyzing spatial vitality in different socioeconomic and planning environments across China (
Figure 1).
2.2. HSR Station Areas
In the existing literature, scholars have adopted various approaches to delineate the spatial extent of areas influenced by HSR stations, depending on their specific research objectives [
36,
37,
38,
39,
40,
41,
42]. Among these, distance-based and time-based criteria are widely applied. For example, Schütz (1998) classified the surrounding areas into three development zones based on station accessibility: a 5–10 min access zone, a zone accessible within 15 min, and a zone beyond 15 min [
43]. Building on this, Zheng et al. (2024) investigated development patterns and multi-level spatial interactions within a 1500 m radius around HSR stations, shedding light on the mechanisms of station area development and spatial structure [
1]. Similarly, Wang Lan et al. (2014) proposed a concentric zoning approach based on spatial distance, dividing the station’s influence area into a core zone (2000 m), an impact zone (4000 m), and a peripheral zone (8000 m) [
44].
Findings from these studies indicate that the functional intensity of facilities around HSR stations generally decreases with increasing distance from the station, forming a layered, centripetal spatial structure [
45,
46,
47]. Based on this insight, the present study adopts a 1500 m fixed-radius buffer centered on each HSR station as the primary area of investigation. This decision is supported by both existing literature and the practical focus of this study. On one hand, the 1–2 km range is widely used in the existing literature as a typical zone of transit influence, ensuring methodological consistency. On the other hand, since this research focuses on spatial vitality, which is typically concentrated within walkable distances from the station—especially during the early and middle stages of development—a 1500 m range effectively captures the core activity zone.
Additionally, using a uniform fixed-radius buffer allows for standardized comparison across 66 HSR stations in 35 cities, enhancing the robustness of cross-sectional analysis. While network-based travel distances may offer a more realistic representation of accessibility, such data are often unavailable or inconsistent across large samples due to differences in local infrastructure quality and data availability.
2.3. Data Sources and Processing
Population agglomeration is a critical indicator for evaluating the vibrancy and economic consumption potential of HSR station areas [
48,
49]. Meanwhile, users’ check-in behaviors at various spatial locations reflect their preferences for different types of activity spaces. Therefore, to gain deeper insights into the spatial behavior patterns of individuals within HSR station areas, this study employed a web crawler to collect check-in data from Sina Weibo within defined station boundaries.
The acquired check-in data were first matched with corresponding spatial coordinates. Subsequently, the spatial locations were categorized based on a point-of-interest (POI) classification system. This processing enabled the identification of the spatial distribution patterns of active individuals and the characterization of activity hotspots within the station areas.
Notably, Sina Weibo initially provided an open API for check-in data access in 2012 [
50,
51,
52]. However, the original interface was later discontinued due to overwhelming access traffic. In response to this limitation, this study improved upon previous methods by developing a targeted data acquisition strategy. Specifically, we conducted a detailed search of the Sina Weibo source pages, applied precise filtering techniques, and utilized both the location service API and user service API in an integrated manner. This optimized workflow significantly enhanced both the accuracy and efficiency of data collection, thereby laying a solid foundation for subsequent analysis.
To ensure the integrity of the dataset, we implemented a multi-stage preprocessing workflow. First, we removed duplicate entries based on user ID, timestamp, and spatial coordinates. Second, we filtered out invalid check-ins lacking essential attributes such as geolocation or content. Third, we eliminated advertising-related records using keyword-based content filtering to exclude promotional or irrelevant posts. These steps collectively ensured that only high-quality, user-generated activity records were retained for analysis.
The data acquisition process involved several key steps (
Figure 2): identifying available interface paths, constructing query addresses for HSR station areas, sending requests to retrieve POI collections, iteratively accessing URLs and parsing the returned JSON files, and finally, performing data validation and storage. The collected check-in data covered the period up to 31 December 2024, and each entry was precisely matched with the latitude and longitude of its corresponding HSR station area (
Table 1).
2.4. Variables
The influencing factors of population agglomeration represent a multidimensional and complex research domain, involving the interaction of various spatial and functional components. At the urban level, the capacity of a city to support population concentration serves as a fundamental condition. This capacity encompasses urban scale, economic development, transportation network structure, and the orientation of planning policies (
Table 2). Together, these factors determine the city’s attractiveness and its ability to accommodate large populations [
53,
54,
55].
In selecting urban support variables, we prioritized those with solid theoretical foundations and empirical support in the field of urban vitality research. Specifically, we identified “Population Scale,” “Economic Development,” and “Transportation Network” as core dimensions, as they comprehensively reflect a city’s development stage and population attraction capacity. These macro-level factors play a fundamental role in determining the intensity of population activity and the spatial vitality of HSR station areas.
First, the scale and level of economic development define the availability of employment, education, and living resources, which directly enhance the city’s capacity to attract people. Second, the layout and efficiency of the transportation network facilitate the flow of people and goods, making the city a key node in regional mobility. Additionally, urban planning strategies and policy support optimize resource allocation, thus further reinforcing population concentration. The data for these indicators are primarily derived from municipal statistical yearbooks, government bulletins, and open-access databases.
At the node level, the characteristics of HSR stations also exert a significant influence on the degree of population aggregation. Factors such as the rationality of station location, the suitability of its design and scale, and the completeness of supporting facilities jointly determine the attractiveness of the station as a transportation hub [
56,
57]. Stations located in central urban areas or major traffic corridors are more likely to attract large flows of people. A well-organized spatial layout and functional zoning can enhance user experience, while high-quality supporting facilities—including waiting areas, commercial amenities, and accessibility features—encourage frequent use and promote further agglomeration. Data on HSR station characteristics are sourced from official railway reports, open databases, and transportation planning documents.
Table 3 presents the variables used to assess the HSR station conditions.
At the place level, the built environment of the station area serves as the immediate spatial carrier for population activity and thus plays a key role in influencing agglomeration [
16,
58,
59,
60,
61]. Core dimensions of the built environment include spatial density, functional diversity, design quality, transportation accessibility, and facility availability. Appropriate population density ensures urban vitality; functional diversity refers to the mix of land uses, commercial types, and social activities; and high-quality design enhances visual appeal and usability. Furthermore, the accessibility of the station area to the broader urban context—through both physical connections and service coverage—affects whether people choose it as a destination for travel or daily activity. Relevant data are collected from urban planning documents, remote sensing imagery, and POI data from Gaode Maps.
Table 4 presents the built environment variables of station areas.
3. Methods
3.1. Multiple Linear Regression Model
The Multiple Linear Regression (MLR) model is a widely used statistical method that explores the linear relationship between a dependent variable and multiple independent variables. In the context of high-speed rail (HSR) station area studies, this model is particularly valuable as it enables the quantification of how various factors—such as urban scale, transportation networks, and station-area facilities—affect spatial vitality.
Specifically, the MLR model helps identify and evaluate the extent to which different urban and infrastructural variables contribute to population agglomeration and activity intensity around HSR stations. By establishing a regression equation, it becomes possible to measure the individual contribution of each explanatory variable to the dependent variable and to reveal the underlying mechanisms that drive spatial dynamics.
In this study, we adopt the MLR approach by using spatial vitality in HSR station areas as the dependent variable and a set of multidimensional influencing factors as independent variables. The basic mathematical form of the model is expressed as follows:
In this equation, Vitality (check-ins/km2) denotes the dependent variable, representing the intensity of human activity per unit area within the 1500 m buffer zone of each HSR station during its operational period. The variables X1, X2, …, Xn are the independent variables, corresponding to influencing factors such as transportation accessibility, functional diversity, and station-area infrastructure. β0 is the intercept term, and β1, β2, …, βn are the regression coefficients, each representing the marginal effect of a corresponding independent variable on the dependent variable. The term captures the random error, accounting for unexplained variability in the model.
Each regression coefficient has a clear interpretive meaning. For instance, the coefficient β1 indicates the magnitude and direction of change in spatial vitality Y when the corresponding factor X1 increases by one unit, holding all other variables constant. This allows for a detailed understanding of the relative importance of different influencing factors.
3.2. LightGBM Model
Unlike traditional multiple linear regression models, LightGBM (Light Gradient Boosting Machine) is an ensemble learning method based on the gradient boosting framework. It builds multiple decision trees to model data and iteratively optimizes the objective function to enhance prediction accuracy. As a result, LightGBM demonstrates significant advantages in handling complex nonlinear relationships and large-scale datasets, making it an effective tool for uncovering hidden patterns and influential features in data.
The core algorithm underlying LightGBM is the Gradient Boosting Decision Tree (GBDT). The main idea is to construct a strong predictive model by sequentially training a series of weak learners, typically shallow decision trees, and minimizing the loss function in each iteration.
The general form of the loss function is defined as
In this expression, L represents the total loss, yi is the actual value of the i-th sample, ŷi is the corresponding predicted value, l(·) denotes the loss function for an individual sample (e.g., squared error or log loss), and Ω(f) is a regularization term that controls the complexity of the model, such as tree depth or the number of leaf nodes.
During each iteration, LightGBM leverages both the first-order gradient (i.e., the derivative of the loss function with respect to the prediction) and the second-order gradient (i.e., the second derivative) to update the model. This use of gradient information enables LightGBM to improve computational efficiency and convergence speed while maintaining high prediction accuracy.
To further improve training speed and reduce memory consumption, LightGBM introduces a histogram-based algorithm to construct decision trees (
Figure 3). Instead of evaluating every possible split point, LightGBM first buckets continuous feature values into discrete bins (i.e., histograms). Then it finds the optimal split based on these bins, significantly reducing the number of comparisons needed during training.
3.3. Shapley Additive Explanation (SHAP)
Although LightGBM and other machine learning methods can produce highly accurate predictions, their “black-box” nature often makes it difficult to understand how the models arrive at specific decisions. To address this issue, SHAP (Shapley Additive Explanations) was developed as a model-agnostic interpretability framework that offers transparent explanations for complex machine learning models.
The SHAP model is theoretically grounded in the Shapley value from cooperative game theory. It quantifies the contribution of each feature to the prediction by computing its marginal effect across all possible feature combinations. Therefore, SHAP not only maintains the predictive power of advanced models but also reveals the underlying mechanisms driving their outputs.
In this equation, denotes the Shapley value of feature i, representing its contribution to the model’s prediction. N is the set of all features, and is any subset that does not include feature i. The function () refers to the model output using only the features in subset , while () refers to the output when feature i is added to the subset.
By averaging the marginal contributions over all possible subsets, the Shapley value provides a fair and comprehensive measure of feature importance. As a result, SHAP serves as a powerful tool for enhancing the interpretability of machine learning models without sacrificing predictive accuracy.
3.4. Model Selection and Comparative Justification
Given the heterogeneous and high-dimensional nature of the data used in this study—including structured socioeconomic indicators, spatial POI distributions, and user-generated check-in behaviors—we adopt a dual-modeling strategy using both Multiple Linear Regression (MLR) and Light Gradient Boosting Machine (LightGBM).
The MLR model provides robust interpretability and is well suited for structured variables with assumed linear effects. It supports inference of marginal impacts and is ideal for testing policy-relevant hypotheses. However, it falls short in capturing complex nonlinearity and variable interactions.
LightGBM, on the other hand, excels in handling nonlinearities, feature interactions, and sparse behavioral data—common in social sensing applications. It is especially valuable for uncovering threshold effects and high-order dependencies, albeit at the cost of interpretability. This is addressed by incorporating SHAP.
Therefore, the combined use of MLR and LightGBM allows us to bridge explanatory clarity and analytical depth, leading to a more comprehensive understanding of what drives spatial vitality in HSR station areas.
4. Results
4.1. Spatiotemporal Variations in the Vitality of HSR Station Areas
From the cumulative check-in bar chart, it is evident that human activity in HSR station areas exhibits a significant spatial clustering pattern, particularly concentrated in economically developed metropolitan regions (
Figure 4). For instance, Shanghai Hongqiao Station, Futian Station, Beijing Station, and Shenzhen Station rank among the top nationwide in terms of cumulative check-ins, reaching 327,000, 224,000, 183,000, and 151,000 check-ins, respectively. Notably, Shanghai Hongqiao Station recorded the highest number of check-ins, highlighting its strong attractiveness as a core transportation hub.
In addition, stations such as Jinan Station (284,000), Guangzhou East Station (134,000), Zhengzhou Station (123,000), and Shanghai Station (113,000) also recorded high levels of activity. These stations are typically located near city centers and benefit from convenient transportation links, which effectively attract large crowds. It is also worth noting that popular tourist destinations like Xiamen Station have drawn substantial numbers of active users, indicating that HSR station areas also possess strong tourism appeal.
Figure 4 illustrates the distribution of cumulative check-ins across different HSR station areas. Based on functional attributes, the active spaces surrounding the stations can be categorized into commercial service space, business space, public space, leisure space, transportation space, residential space, green space, industrial space, and administrative space. The activity levels in each space type were analyzed (
Figure 5). Among them, commercial service spaces exhibited the highest level of activity, with approximately 1.37 million check-ins, reflecting their strong commercial attractiveness. Transportation and residential spaces followed, with 610,000 and 600,000 check-ins, respectively. This pattern not only underscores the critical role of HSR stations as transportation hubs but also demonstrates the effective development of residential functions in adjacent areas.
Furthermore, public and green spaces recorded 420,000 and 160,000 check-ins, respectively, showing a certain degree of vitality. This suggests that in addition to transportation and commercial services, HSR station areas also provide public amenities and green leisure environments that meet the diverse needs of residents and travelers.
According to the spatial distribution of check-in data, the activity patterns of populations in HSR station areas can be generally categorized into three types: balanced distribution, fan-shaped clustering, and star-shaped dispersion (
Figure 6). First, stations with a balanced distribution are typically located near city centers. These areas tend to develop evenly, with commercial and business functions highly concentrated, which generates strong attractiveness and leads to relatively uniform population aggregation. Typical examples include Beijing Station, Tianjin Station, Shanghai Station, Futian Station, Guangzhou East Station, Chengdu South Station, and Zhengzhou Station. These stations represent the typical development pattern of central urban HSR stations.
Second, fan-shaped clustering occurs in HSR stations that expand along existing urban development axes. Influenced by the built environment, these station areas extend in a directional pattern that mirrors urban spatial growth. Stations such as Chongqing North, Hangzhou South, Harbin West, and Zhengzhou East fall into this category, and their development aligns closely with the city’s expansion trajectory.
Lastly, star-shaped dispersion is observed in HSR stations located farther from urban centers. These stations often feature undeveloped land or lack adequate supporting facilities, resulting in weaker integration with the city and a more scattered pattern of population activity. Despite their current lower levels of aggregation, these stations hold the potential to become key nodes in future urban development. Representative examples include Guangzhou South, Xi’an North, Chongqing West, Guiyang North, Kunming South, and Changchun West stations.
4.2. Results of Linear Regression
In this study, all explanatory variables were subjected to a multicollinearity test. The results showed that the Variance Inflation Factors (VIFs) for all variables were below 5, indicating that multicollinearity among the variables was not significant and the model estimates are statistically reliable. The MLR model achieved an R-squared value of 0.3975, a Root Mean Squared Error (RMSE) of 5.014, and a Mean Absolute Error (MAE) of 3.712.
Table 5 presents the results of the MLR model.
Several factors related to urban support capacity exhibited significant effects on the spatial clustering of active populations. Specifically, both the population size of the city (C1) and the level of urbanization (C7) were strongly correlated with the degree of population aggregation, showing statistically significant positive relationships. The regression coefficient for city population size (C1) was 7.0798 with a p-value of 0.022, suggesting that larger urban populations tend to attract more active individuals to gather around HSR station areas. Similarly, the urbanization level (C7) had a coefficient of 8.0828 and a p-value of 0.03, indicating that more urbanized environments promote higher levels of population clustering.
Regarding station-level characteristics, the station location index (S11) had a significant negative effect on population aggregation, with a coefficient of −9.5416 and a p-value of 0.001. This result implies that stations with lower location index values—i.e., more favorable or central locations—tend to attract higher levels of population activity. In contrast, the number of originating and terminating train services (S15) positively influenced population clustering, with a coefficient of 6.5358 and a p-value of 0.033, demonstrating that increased service frequency enhances the station’s ability to attract people. Additionally, the number of bus stops near the station (S24) also had a statistically significant positive effect, with a coefficient of 12.4662 and a p-value of 0.001, underscoring the importance of public transportation connectivity in reinforcing population gathering.
Furthermore, several elements of the built environment around the station area were found to significantly influence the level of population activity. The area designated for commercial service facilities (E6) had a positive and significant impact, with a coefficient of 9.9389 and a p-value of 0.018. This suggests that expanding commercial land use around stations can substantially boost their attractiveness and crowd-gathering capacity. Similarly, the number of parking facilities (E22) had a positive and significant effect, with a coefficient of 8.7181 and a p-value of 0.009, indicating that adequate parking infrastructure supports both accessibility and functional completeness, thereby contributing to greater population clustering.
These findings align with urban economic theory, which suggests that larger urban populations create stronger consumption bases and attract higher levels of pedestrian activity. The positive effect of commercial land use reflects the principle of agglomeration economies, where mixed-use developments increase land efficiency and social interaction.
In contrast, the negative effect of the station location index suggests that stations located in peripheral or poorly integrated zones suffer from low spatial synergy, as predicted by urban spatial mismatch theory. The accessibility factor, such as the number of bus stops, supports the notion that multi-modal connections enhance last-mile connectivity, which is a cornerstone of transit-oriented development (TOD).
4.3. Results of the LightGBM Regression Model and SHAP Analysis
According to the convergence evaluation plot of the active population prediction model, the model’s loss value showed a significant decrease during the initial phase of training (
Figure 7). In particular, within the first 50 iterations, the loss dropped sharply from a high value near 7.0 to approximately 5.75. This indicates that the optimization algorithm quickly adjusted the hyperparameters in the early training stage and substantially reduced the prediction error. As the number of iterations increased, the rate of loss reduction gradually slowed, and the curve began to flatten. After around 250 iterations, the loss value stabilized at approximately 5.2, suggesting that further training did not lead to noticeable improvement, and the model reached convergence with stable performance.
Based on the hyperparameter tuning results, a scatterplot matrix was used to visualize the interactions between parameters and their impact on model performance (
Figure 8). The selected optimal parameter configuration achieved a well-balanced trade-off between model fitting and generalization. Specifically, the learning rate was set to 0.01 to ensure stable optimization and reduce the risk of overfitting. The number of trees was set to 1041 to improve accuracy without introducing excessive computational burden. The maximum tree depth was limited to 10, effectively controlling model complexity and enhancing generalization capability. Each tree was allowed a maximum of 15 leaf nodes, increasing fitting capacity while mitigating overfitting. The subsample ratio was set at 0.648, helping introduce randomness and reduce overfitting. Additionally, the feature subset ratio was set to 0.624, which further enhanced model diversity. Lastly, the minimum number of samples per leaf was set to 10, ensuring node stability and robust decision splits. Collectively, these hyperparameter settings enabled the model to achieve strong and stable performance in predicting active population clustering. The LightGBM model achieved an R-squared value of 0.7347, a Root Mean Squared Error (RMSE) of 3.3273, and a Mean Absolute Error (MAE) of 1.9621.
Figure 9. Learning curve of the LightGBM model showing the loss trajectories for both the training and validation sets. As the number of iterations increases, both losses decrease and gradually stabilize, with no upward trend in validation loss—indicating good model convergence and no significant overfitting.
After training the model, SHAP analysis was conducted to interpret the contribution of each input variable to the prediction outcomes. The resulting variable importance ranking (
Figure 10) and SHAP value distribution plots (
Figure 11) revealed that the factors influencing active population clustering differed from those affecting passenger flow or dwellers. In particular, active population distribution more strongly reflected perceptions of spatial design, economic vitality, and local consumption potential in HSR station areas.
Among all features, the proportion of land used for commercial service facilities within the station area (E6) exhibited the highest SHAP value of +1.71, indicating it had the most substantial positive impact on active population clustering. In addition, the population size (C1) and urbanization level (C7) of the HSR station’s host city had SHAP values of +1.24 and +0.92, respectively, suggesting that both city scale and development level significantly promote human activity near the stations. The station location index (S11) also showed a notable SHAP value of +0.92, confirming that favorable geographic positioning and transit hub advantages play a key role in attracting people. Moreover, the number of bus stops near the station (S24) and the number of parking facilities in the station area (E22) had SHAP values of +0.83 and +0.82, respectively, underscoring the importance of well-developed transit infrastructure in enhancing population aggregation.
4.4. Threshold Effects
The primary purpose of the univariate driving model is to examine the independent effect of a single variable on population clustering, thereby identifying and understanding the specific degree and pattern of influence each factor has on high-speed rail (HSR) station areas. Through the quantitative assessment of each variable, the model reveals how these factors operate under different conditions and how their effects may change across value ranges.
The results indicate a strong relationship between commercial facilities and population clustering in HSR station areas. Specifically, the partial dependence plot for the proportion of commercial service land use in the station area (E6) shows that when the share increases from a low baseline to around 12%, the predicted value of active population rises sharply. However, once the proportion exceeds 15%, the predicted value tends to stabilize (
Figure 12). This pattern reflects the classical economic law of diminishing marginal utility and suggests that while moderate commercial land allocation enhances attractiveness and pedestrian density, excessive concentration may cause functional redundancy, spatial congestion, or underutilization—consistent with the “resource saturation” theory in urban land use planning. This finding suggests that a moderate increase in commercial land use can significantly enhance the vibrancy of the station area by attracting more consumers and stayers. Therefore, in planning areas surrounding HSR stations, commercial, cultural, and office functions should be strategically allocated to match actual demand, thus strengthening the overall attractiveness and functional diversity of the district.
In addition, the station location index (S11) exhibits a significant negative correlation with the predicted value of active population clustering (
Figure 13). The partial dependence plot shows that when S11 is low—indicating a superior spatial location within the city—the predicted population level is high. As S11 increases, the predicted value drops rapidly and levels off, particularly after the value exceeds 0.3. This result highlights the critical role of spatial positioning in attracting human activity. And then levels off, reinforcing the spatial mismatch theory which states that urban facilities located far from core areas often suffer from reduced accessibility and weaker socio-economic integration. Consequently, the planning and development of HSR station areas should prioritize their integration within the broader urban spatial framework. Strengthening connectivity with central urban zones and major functional areas, improving transport accessibility, and ensuring the availability of high-quality public services can collectively enhance the locational advantages of HSR stations, thereby promoting greater population clustering and urban vitality.
Furthermore, the number of bus stops (S24) near the station has a clear and positive effect on population activity (
Figure 14). According to the partial dependence plot, as the number of bus stops increases, the predicted value of active population rises notably—especially after the number surpasses 10, at which point the prediction exhibits a step-like increase and gradually stabilizes. This pattern confirms that a well-developed bus transfer system is a key factor in enhancing the vitality of HSR station areas. Increasing the number of bus stops not only improves transportation accessibility but also expands the service coverage of the station area, thus attracting more pedestrian flow and encouraging longer stays. Accordingly, future planning of station-area transit infrastructure should prioritize the density and coverage of the bus network, support multimodal integration, and promote the construction of efficient and convenient public transportation systems to better support human activity and sustainable station-area development.
These patterns reflect the classical economic concept of diminishing marginal returns, which states that beyond a certain threshold, the additional benefit of increasing an input (e.g., commercial land ratio or bus stop density) decreases.
From a theoretical standpoint, the observed turning points align with threshold theory, which emphasizes the presence of critical values beyond which system responses change nonlinearly. Additionally, the phenomenon is related to resource saturation theory, where excessive infrastructure or functional allocation may exceed local demand or spatial capacity, leading to diminishing efficiency or even congestion effects.
These findings suggest that urban planning around HSR stations must consider not only increasing supply but also optimizing it to avoid over-concentration and spatial redundancy.
4.5. Interaction Effects
The main objective of the multi-factor interaction model is to assess the combined influence of multiple variables on population clustering. This model emphasizes the interaction and synergy among different variables, revealing how their joint effects can significantly enhance clustering outcomes. By analyzing the composite influence of variable combinations, the model helps identify which interactions are most impactful in shaping human activity around HSR station areas.
Figure 15 shows the interaction effects of influencing factors on spatial vitality in HSR station areas.
To quantify interaction effects, we used SHAP’s built-in method, which separates each prediction into the effect of individual features and the added effect from feature pairs. For any two features, SHAP calculates how much their combined influence differs from the total of their separate effects. If the combined effect is stronger, this indicates a synergy—what we call a “synergy bonus.” These observed synergies are entirely data-driven and reflect the nonlinear relationships captured by the LightGBM model. We did not assign any fixed values or assumptions; instead, the SHAP interaction values reflect the complex relationships automatically learned by the LightGBM model.
The interaction plot reveals that the urbanization level of the HSR station’s host city (C7) and the proportion of commercial service land use in the station area (E6) jointly affect the predicted intensity of population activity (
Figure 16). The results show that when the urbanization level is high (above approximately 85%) and the proportion of commercial service land use exceeds 15%, the predicted value of population activity rises significantly, peaking at around 8.9. This indicates a strong clustering effect. Conversely, when either the urbanization level is low or the proportion of commercial land is insufficient, the predicted values remain at a relatively low level. This supports the “urban function–infrastructure synergy” theory, where a mature urban context provides sufficient population density and mobility demand, while commercial land facilitates consumption and social interaction.
This suggests a clear synergistic relationship between a city’s development level and the functional capacity of the station area. A high level of urbanization typically brings higher population density and better infrastructure, while commercial services offer spaces for consumption and social engagement. The combination of these two factors maximizes the vitality of HSR station areas. Therefore, improving station-area vibrancy depends not only on local planning and design but also on aligning with the broader developmental context of the city. In cities with high urbanization levels, special attention should be paid to enhancing functional diversity within the station area—particularly by increasing the supply and quality of commercial service facilities—to ensure efficient land use and foster a positive cycle of population clustering.
Another interaction plot shows the joint effect of the number of bus stops (S24) and the proportion of commercial service land use (E6) on population activity (
Figure 17). The plot reveals that when the number of bus stops is relatively high and the commercial land use ratio is moderate to high (between approximately 10% and 20%), the predicted value of population activity increases significantly, peaking at around 10.3. This demonstrates a strong positive synergy. In contrast, if the number of bus stops is low or the commercial land share is insufficient, the predicted activity level drops considerably. This aligns with transit-oriented development (TOD) principles: multimodal connectivity enhances last-mile accessibility, while commercial functions support destination attractiveness.
Additionally, the interaction between the station location index (S11) and the number of parking facilities (E22) also influences both population clustering and mobility within the station area. As shown in
Figure 18, when the station location index is low (approximately 0.1) and the number of parking facilities increases substantially (e.g., up to 400 spaces), the predicted value of population activity rises to 7.87, reflecting a pronounced clustering effect. A higher location index typically means the station is farther from the city center, leading to weaker transport connections. In such cases, sufficient parking becomes critical—especially for car users—by lowering access costs and improving station reachability. This interaction illustrates a compensatory mechanism: parking facilities can mitigate the disadvantage of poor location by reducing access costs for car users, thus broadening the station’s functional catchment. This finding resonates with access–cost theory and supports the importance of integrated planning in peripheral station areas.
As the number of parking facilities increases, the area’s transportation support capacity improves, which in turn promotes population aggregation. Particularly when the station’s spatial location is less advantageous, optimizing parking infrastructure can help offset locational disadvantages and attract more users. The interaction between station location and parking capacity highlights the importance of coordinated planning. Enhancing transportation linkages and parking management can significantly improve accessibility and appeal, contributing to sustainable human activity and functional development in HSR station areas.
The observed synergies between urbanization and commercial land use suggest that spatial vitality is a product of both macro-level urban development and micro-level spatial design. This supports theories from urban systems and complexity science, which emphasize the interdependence between urban form and function. The enhanced activity observed at well-connected, well-equipped station areas validates the role of integrated planning in maximizing land value and urban accessibility.
4.6. Individual-Level Influences
Due to differences in land use types, development intensity, and spatial configuration patterns, high-speed rail (HSR) station areas exhibit diverse developmental characteristics within the urban spatial structure. Different types of station areas show distinct patterns in how they attract populations, which factors most influence clustering, and what development strategies are most effective. Therefore, analyzing individual station areas helps identify key influencing factors and enables the formulation of tailored planning approaches.
For HSR station areas dominated by residential functions, population clustering is primarily driven by the availability of lifestyle amenities, commercial services, and transportation accessibility (
Figure 19). For example, at Beijing South Station, several variables were found to positively influence population aggregation, including the number of commercial and leisure facilities within the station area (S27), general public budget revenue of the host city (C10), centrality within the urban network (C16), and the number of directly connected stations via one-transfer routes (S4). These factors collectively indicate that the station attracts people by offering robust commercial amenities and strong connectivity. However, certain constraints were also identified, such as limited sky openness (E16), high road density (C12), and a lower number of metro lines (S23), which may hinder further population clustering. Therefore, the development strategy for such residential-oriented station areas should focus on enhancing everyday service facilities, strengthening metro and bus integration, and improving land use efficiency to increase the convenience and livability of the area, thereby enhancing its overall attractiveness.
In commercially oriented station areas, such as Zhengzhou Station, population clustering is mainly influenced by the quality of commercial facilities, the radiation effect of the city center, and transportation connectivity (
Figure 17). Key promoting factors include urban centrality (C16), the number of station-based commercial and leisure facilities (S27), and the number of directly connected stations (S4). These variables demonstrate that Zhengzhou Station effectively attracts passengers and dwellers through its commercial vitality and strong transport hub functions, enhancing its clustering effect (
Figure 20). However, several limiting factors were also identified, including general public budget revenue (C10), sky openness (E16), road network density (C12), number of road intersections (E13), and the number of metro lines (S23). These constraints indicate that although Zhengzhou Station performs well in commercial and transportation dimensions, it still has room for improvement in environmental comfort, traffic network optimization, and vertical spatial design. To further promote clustering effects in this station area, it is recommended to integrate surrounding resources through comprehensive and future-oriented planning and renewal efforts.
Overall, these station-level differences reaffirm the necessity of differentiated planning strategies. Different types of HSR station areas exhibit their own unique spatial characteristics and development potential. Residential-, transport-, commercial-, and public service-oriented station areas each have their own advantages and challenges. Therefore, development strategies should be tailored according to the station’s functional orientation, geographic context, infrastructure quality, and socioeconomic background. For each station type, enhancing infrastructure, optimizing spatial layout, and reinforcing inter-functional connectivity are key strategies for fostering population aggregation and achieving sustainable development of HSR station areas.
5. Discussion
5.1. Rethinking Station Area Vitality Through Nonlinear Urban Dynamics
This study’s findings challenge the conventional linear assumptions in urban spatial analysis and contribute to a more nuanced understanding of how high-speed rail (HSR) station areas evolve. The revealed nonlinear relationships—particularly the saturation effects of commercial land and bus stop density—indicate that vitality does not increase indefinitely with infrastructural inputs. This echoes recent shifts in urban theory emphasizing complex systems thinking, where urban space responds to incremental interventions in nonlinear, often threshold-dependent ways.
In comparison with existing literature, which primarily views commercial land and transit accessibility as linear enablers of urban vitality, our results add empirical weight to the idea that “more” is not always “better.” The discovery of diminishing marginal returns in facility input highlights the importance of optimal functional balance in station-area development. From a theoretical standpoint, this finding extends the urban spatial equilibrium framework by introducing nonlinear feedback effects triggered by over-saturation or spatial crowding.
5.2. Spatial Thresholds and the Redefinition of Planning Efficiency
The identification of clear spatial thresholds—such as the optimal commercial land ratio (10–15%) and ideal bus stop density—offers a critical redefinition of what constitutes planning efficiency in station area contexts. Rather than maximizing infrastructure, planners should aim for functionally calibrated interventions based on local capacity and contextual needs. This perspective moves beyond traditional engineering-driven logic and embraces performance-based spatial planning.
Moreover, these thresholds provide actionable tools for resource prioritization in cities facing financial or spatial constraints. For instance, peripheral stations may not benefit from excessive commercial zoning or transit lines, whereas moderate-scale, context-sensitive development could yield more sustainable outcomes. This insight has practical value for transit-oriented development (TOD), supporting phased, evidence-based investment models that align with evolving population flows and spatial usage.
5.3. Theoretical and Policy Implications for Node–Place Synergy
The findings reinforce and refine the “node–place” framework by revealing that spatial vitality emerges not from the node (HSR infrastructure) or the place (urban environment) alone, but from their synergy, particularly when modulated by city-level conditions (e.g., urbanization rate, centrality). The interaction effects between macro-structural variables (e.g., C7, urbanization) and micro-spatial attributes (e.g., E6, commercial land) suggest that infrastructure performance is contingent upon urban context compatibility.
This insight has clear policy relevance. For national and provincial planning bodies, it calls for differentiated station development models based on regional typologies. For instance, stations in highly urbanized zones should prioritize function mixing and pedestrian-centric design, while those in transition zones may benefit from mobility-enhancing infrastructure like park-and-ride systems. Furthermore, integrating social sensing data (e.g., Weibo check-ins) into urban monitoring can facilitate adaptive governance and promote real-time evaluation frameworks for major transport infrastructure.
To further contextualize the findings, we compare two representative cases: Chengdu and Beijing. Chengdu, as a rapidly expanding city, features HSR stations like Chengdu East and Chengdu South that are still in the process of urban integration. Here, planning should emphasize balanced commercial land development, robust multimodal access, and staged infrastructure investment to prevent premature saturation. In contrast, mature hubs such as Beijing South or Beijing West require strategies that alleviate congestion, optimize transfer systems, and elevate service quality. These differentiated approaches underscore that the node–place synergy must be adapted to each city’s growth stage and spatial characteristics, reaffirming the importance of localized, context-aware planning in HSR station area development.
6. Conclusions
This study systematically examined the spatial vitality of high-speed rail (HSR) station areas in China by integrating multi-source data and applying both statistical and machine learning methods. Based on a city–node–place framework, we evaluated 66 HSR station areas across 35 cities using Sina Weibo check-in data, urban support capacity, station attributes, and built environment indicators.
Importantly, this study considered both the broader urban context in which each station is embedded and the detailed environmental features within the surrounding area. City-level factors such as population size, urbanization level, and transit network configuration form the underlying support conditions, while local-scale features—including commercial land use ratio, bus stop density, and parking facilities—shape the functional performance of each station area. Our use of a 1500 m buffer allowed for consistent comparison across cities, capturing meaningful differences in how both urban structure and immediate surroundings influence vitality.
The results show that city population size, urbanization level, commercial land use ratio, transit accessibility, and parking facilities are key positive contributors to station area vitality. SHAP analysis revealed nonlinear threshold effects and interaction relationships—such as diminishing returns in commercial land use and bus stop density, and strong synergies between urban development and functional configuration.
These findings have several policy implications. First, land use and transit planning can be integrated to enhance functional synergy around HSR stations. Second, moderate commercial land allocation (around 10–15%) is most effective in attracting population activity. Third, improving multimodal transport infrastructure—including bus stops and parking—can strengthen last-mile connectivity. Fourth, tailored development strategies are needed for different station types (e.g., residential vs. commercial). Lastly, data-driven evaluation tools like social media analytics and interpretable machine learning can support real-time monitoring and adaptive policy formulation.
Nonetheless, this study has some limitations. The use of Sina Weibo check-in data, while providing high spatial-temporal granularity, may introduce demographic and geographic biases. Younger users and residents of more developed cities are overrepresented, while the behavior of older adults and residents in less developed or peripheral areas may be underrepresented. These biases could influence the comprehensiveness of vitality assessment. Future research should further integrate diverse datasets—such as mobile phone signaling, smart card records, and public Wi-Fi logs—to improve representativeness. Additionally, the present study focuses on a single temporal snapshot. Future studies could explore how spatial vitality evolves over time using longitudinal data sources. Such efforts may reveal seasonal rhythms, lifecycle stages, or long-term shifts in station activity patterns, thereby enriching the understanding of station area development dynamics. Overall, this study offers empirical insights and methodological innovations to guide the sustainable, accessible, and vibrant development of HSR station areas in China and beyond.