This study integrates spatial feature extraction, cluster analysis, machine learning modeling, and interpretive analysis to develop a mathematical modeling framework for predicting subway station locations. First, 28 spatial variables—such as nighttime light intensity, population density, and road network structure—are extracted to construct a gridded feature matrix. Cluster analysis is then conducted using passenger flow data from existing stations to extract optimal positive samples. Second, the optimal positive samples are combined with a random forest model to identify key site selection drivers. Variables with a feature contribution below 1% are excluded, resulting in a refined set of 20 features. These variables are then input into a LightGBM model for regression analysis, and SHAP values are used to interpret feature contributions, revealing how key factors influence site selection suitability. Finally, spatial predictions and visualizations of site suitability are generated at the city scale, providing scientific support and a quantitative decision-making basis for subway station planning. The technical workflow of this study is illustrated in
Figure 2.
3.1. Selection of Influencing Factors for Rail Transit Station Location
This study primarily selects key factors influencing rail transit station site selection across the following dimensions: socio-economic conditions, urban vitality, land use, transportation infrastructure density, regional road network efficiency, and natural environment. The selection of these factors is informed by a synthesis of existing indicator systems in relevant literature, which serves as the foundation for constructing the feature matrix. This approach allows for the flexible construction of indicators based on specific research needs, which need not replicate original indicators as long as they comply with relevant standards [
27]. Therefore, this study uses the existing indicator systems as a reference framework for constructing the feature system. Furthermore, this study does not pursue an extensive number of indicators but instead emphasizes that each essential factor influencing site selection is represented by at least one measurable indicator. The specific factors corresponding to each dimension are presented in
Table 2, along with the respective references for each.
- (1)
Socioeconomic
Data for the socioeconomic dimension were obtained by extracting POI information via the Amap API. Based on Amap’s classification standards [
37], the POI data were categorized into 12 types, including the density of catering services, daily life services, and scenic spot services, resulting in 12 indicators representing the socioeconomic dimension (
Figure 3).
- (2)
Urban vitality
Urban vitality reflects the intensity of human activity and social interaction. In previous studies, nighttime light data and population density have been widely adopted as proxies for measuring urban vitality. Accordingly, this study adopts these two indicators as feature variables representing urban vitality (
Figure 4).
- (3)
Land use
The land use dimension primarily includes two indicators: land use mix and building density (
Figure 5). Specifically, land use mix is calculated based on the distribution of POI types, with its computation defined in Formula (1). Building density is defined as the ratio of the total floor area of all buildings within a given grid cell to the total area of that cell, as shown in Formula (2).
where
denotes the land use mix within a specific grid cell,
represents the total number of POI types in grid
i, and
denotes the proportion of POI type
j relative to all POI types in the grid.
where
represents the area of the
i-th building (in square meters, m
2),
n denotes the number of buildings within a specific grid cell, and
refers to the area of a single grid cell (in square meters, m
2).
- (4)
Transportation facility supply density
The transportation infrastructure supply density dimension characterizes the quantity, spatial distribution, and network coverage of transportation facilities surrounding each candidate site. It includes five indicators: intersection density, distance to the city center, bus stop density, road network density, and transportation service facility density (
Figure 6). Among them, intersection density, distance to the city center, bus stop density, and road network density were calculated using spatial data provided by the Kunming City Surveying and Mapping Institute and processed in ArcGIS 10.8.1. In contrast, transportation service facility density was derived from Amap’s classification system and computed using kernel density estimation of POI data.
- (5)
Efficiency of regional road network structure
The spatial efficiency of the regional road network was evaluated based on the topological efficiency of the road network, focusing on the spatial position of each candidate site within the urban road system and its influence on patterns and intensities of pedestrian flow, vehicle flow, visibility, and urban activities. The rail transit network was conceptualized as a graph consisting of stations (nodes) and connecting lines (edges), where each station corresponds to a node and the links between stations represent edges. Based on GIS data and spatial network theory, adjacency relationships between nodes are established. Indicators in this dimension are computed using Depthmap, based on space syntax theory and road network geometric data (line segments and nodes), and include three metrics: selectivity, connectivity, and integration (
Figure 7). The corresponding calculation formulas are presented in Equations (3)–(5):
where
is the number of shortest paths from node
j to node
k that pass through node
i, and
denotes the total number of shortest paths between node
j and node
k.
where
represents the connectivity of node
i, and
denotes the set of nodes adjacent to node
i.
where
represents the integration value of node
i,
N denotes the total number of nodes, and
is the distance between nodes
i and
j.
- (6)
Natural environment
The selection of urban rail transit station locations in mountainous cities is characterized by distinct topographical adaptability. Kunming City, as a representative mountainous city, features highly undulating terrain and complex geological conditions, where natural environmental factors serve as fundamental constraints in station site selection. Accordingly, the natural environment dimension in this study primarily considers four key factors(
Figure 8): Digital Elevation Model (DEM), slope gradient, slope aspect, and LST. LST was derived via the Google Earth Engine (GEE) platform using Kunming’s administrative boundary vector data and thermal infrared imagery (Band 10) from Landsat 9 through remote sensing inversion. The LST processing workflow involves cloud masking, emissivity correction, brightness temperature conversion, median synthesis of annual summer images (2022–2024), gap filling, and spatial clipping. This process ultimately produces a stable and reliable surface temperature distribution map, which is used to evaluate the influence of high-temperature zones on site comfort and the potential for pedestrian traffic aggregation.
In integrating multi-source spatial data, this study first standardized the coordinate systems of all vector and raster datasets to CGCS2000. Vector data (e.g., road networks, subway stations, and subway lines) underwent spatial topology checks and corrections based on three rules: no overlapping points, no dangling lines, and no gaps in polygon features. Topological errors were batch-identified using Check Geometry (ArcGIS) and shapely.validation.explain_validity (Python 3.7.1), followed by manual inspection. Raster datasets—including POI kernel density, nighttime light intensity, and population density—were resampled to a 100 m resolution.
As the datasets originated from different sources, their original spatial extents varied. To ensure consistency in model inputs, the smallest common bounding box across all layers was used as a reference. All datasets were uniformly cropped using NumPy 2.1.3slicing, with dimensions aligned by row and column count. Areas without data were uniformly assigned a value of 0 to eliminate the impact of NoData values on model training. Additionally, to ensure comparability across variables with differing units, all features were normalized using min–max scaling to a standardized range of [0, 1]. The final 28 input variables were stacked into a three-dimensional array of shape (434, 358, 28) and fed into the subsequent site selection prediction model.
3.3. Feature Importance Calculation Based on Random Forest
The random forest model is an ensemble learning algorithm consisting of multiple decision trees. Due to its high spatial fitting accuracy, it has been widely applied to address multidimensional nonlinear problems [
42]. In this study, the random forest model is primarily employed to compute feature importance—after normalization—using the Mean Decrease in Gini index (MDG), in order to identify the most influential variables affecting rail transit station site selection. The MDG value for feature importance is calculated as follows:
where
represents the importance of the
r-th feature among all features;
n is the number of decision trees;
t is the number of nodes in a single tree;
m represents the total number of features; and
refers to the Gini impurity reduction contributed by the
r-th feature at the
j-th node of the
i-th tree.
Based on their MDG values, all features are ranked in order of importance. A higher MDG value indicates a greater contribution to model performance. Generally, features with higher MDG values are considered the most influential factors in rail transit station site selection. Depending on the research objectives, the top k features with the highest MDG values are selected for further analysis.
3.4. LightGBM-Based Site Selection Prediction Model
To evaluate the performance advantages of LightGBM in site suitability regression, this study introduces four benchmark regression models—linear regression, ridge regression, random forest regression, and support vector regression—for comparative analysis. Model performance is assessed using root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). All evaluation metrics were computed using spatial 5-fold cross-validation and are reported as mean ± standard deviation. The results indicate that LightGBM outperforms the other models across R
2, RMSE, and additional metrics (
Table 3), demonstrating that its nonlinear modeling and feature interaction capabilities provide substantial advantages for addressing complex geospatial prediction tasks.
Therefore, this study adopts LightGBM regression as the primary modeling approach for predicting subway station site suitability. The model takes multi-source spatial features of grid cells as input and outputs a continuous suitability score for potential station locations. The objective is to learn the relationship between spatial features and the likelihood of a grid cell being suitable for a subway station. LightGBM is a highly efficient implementation of the gradient boosting decision tree framework. It uses a leaf-wise (leaf-first) growth strategy and an efficient histogram-based feature binning mechanism, which significantly improve large-scale data processing capacity and training speed while maintaining high accuracy [
43,
44]. The model is trained iteratively by minimizing the squared error loss function, gradually constructing and stacking multiple regression trees to approximate the true target values. The objective function is presented in Equation (7). In each iteration, the loss function is approximated and optimized using a second-order Taylor expansion (Equation (9)) to improve training efficiency and model stability. To mitigate potential spatial autocorrelation and information leakage in high-resolution grid data, this study applies spatial K-fold cross-validation to evaluate the generalization capability of the LightGBM-based site selection model, using GroupKFold (n = 5) and setting grid_size = 1000 m.
- (1)
Overall objective function
where
denotes the total loss of the
t-th tree;
is the true label of the
i-th sample;
is the predicted value of the
i-th sample after the (
t − 1)-th iteration;
represents the prediction increment for the
i-th sample from the
t-th tree;
l(⋅)is the loss function, such as the squared error
l(
y,
) =
; and
is the regularization term used to control model complexity, typically defined in Equation (8).
- (2)
Regularization term
where
denotes the number of leaf nodes in the tree;
is the output weight of the
j-th leaf;
are regularization parameters.
- (3)
Approximate optimization: second-order Taylor expansion
where
denotes the first-order derivative (gradient), and
=
denotes the second-order derivative (Hessian).
- (4)
Prediction output formula
After
M iterations of training, the model’s prediction is given as follows:
where
is the predicted value of the
i-th sample, and
is the output of the
t-th regression tree for input
.