1. Introduction
Cities are important spatial carriers of human civilization. Their built environment not only supports the production and living activities of residents, but also profoundly influences people’s psychological experiences and behavioral patterns [
1]. In this complex process of human-land interaction, urban visual landscapes, as the direct presentation of environmental information, play a crucial role in shaping and developing human perceptual responses. Urban perception, which refers to the subjective cognition and comprehensive evaluation of the urban spatial environment by residents based on visual cues [
2], has become an important field of interdisciplinary research involving architecture, urban and rural planning, and sociology.
The existing research methods still face significant challenges. Traditional research methods such as questionnaires and in-depth interviews [
3,
4], although they can obtain relatively in-depth perceptual data, generally have inherent limitations such as limited sample size, high survey costs, and difficulty in achieving continuous spatial coverage [
3,
4]. In recent years, breakthroughs in deep learning technology have provided a new technical path for large-scale urban perception research [
5,
6]. The method based on street view images and deep neural networks can achieve efficient and standardized perception prediction, effectively overcoming the shortcomings of traditional methods. However, advanced predictive models often exhibit a “black-box” nature, which means that while they achieve high prediction accuracy, their internal decision-making logic remains opaque [
7,
8]. This lack of transparency hinders the clear explanation of how predictions are generated and fails to effectively reveal the specific environmental factors that shape urban perception. Consequently, the practical application of such models in urban planning is limited, as planners require not only predictive outcomes but also interpretable insights to inform design interventions. Although traditional statistical methods have good interpretability, they often have deficiencies when dealing with the complex nonlinear correlations between the environment and perception [
9,
10].
Against this backdrop, the emergence of explainable machine learning methods offers new ideas for addressing the aforementioned challenges [
11]. Techniques such as SHAP (Shapley Additive Explanations) have been applied to uncover the nonlinear effects of visual elements on safety perception [
12] or to identify key streetscape features influencing the sense of security [
13]. Despite these advances, several critical research gaps persist, limiting both theoretical understanding and practical application.
Most studies focus on identifying key visual features that influence perception, but have not systematically analyzed the interactions between different visual elements and their synergistic or antagonistic effects on multidimensional perception. Urban perception is shaped by a combination of environmental features, yet the compound effects and potential trade-offs among elements remain underexplored. Existing research predominantly examines coastal metropolises or cities at specific developmental stages, paying insufficient attention to inland megacities undergoing rapid urbanization and spatial restructuring. The unique spatial patterns, development trajectories, and perceptual dynamics of such cities—characterized by distinct urban forms, historical layers, and development pressures—are not adequately represented in the current literature, limiting the generalizability of findings. While correlations between visual elements and perceptual scores have been established, the causal mechanisms and directional effects linking specific environmental features to particular perceptual dimensions require further validation through more rigorous modeling and interpretability frameworks. The “why” behind the predictions—the explicit pathways through which environmental attributes translate into subjective experience—remains inadequately addressed.
To address these gaps, this study constructs a comprehensive analytical framework integrating street view images, deep learning, and explainable machine learning. We select Zhengzhou City, a major inland transportation hub and a rapidly developing megacity in central China, as an empirical case. Zhengzhou’s urban spatial structure shows distinct east–west differences with Zhongzhou Avenue as the boundary: the eastern part is a newly built urban area with larger street scales and higher greening levels, while the western part is the old urban area with high density and complex street interfaces. This spatial duality provides an ideal setting to explore perceptual variations and their environmental drivers.
This study focuses on the following three research questions: (1) How to measure the multi-dimensional urban perception spatial pattern of Zhengzhou City based on machine learning methods? (2) What are the key environmental elements that influence the formation of urban perception in Zhengzhou City? (3) How do these environmental elements act on different dimensions of urban perception?
By answering these questions, this study aims to advance the methodological application of explainable machine learning in urban perception research and provide a scientific, interpretable evidence base for spatial quality enhancement and human settlement optimization in Zhengzhou and similar inland megacities.
4. Research Methods
This study adopts a comprehensive analysis framework integrating street view images, deep learning and interpretable machine learning. The framework is designed to address the three core research questions outlined in the Introduction: (1) measuring the spatial pattern of multi-dimensional urban perception, (2) identifying key visual elements influencing perception, and (3) elucidating the mechanisms through which these elements affect different perceptual dimensions. The technical route covers three core stages that are successively connected (
Figure 2). The first stage is visual element analysis: The SegFormer-BB5 model pre-trained on the Cityscape dataset is used to perform fine semantic segmentation on street view images, and the system extracts and quantifies the composition ratios of 19 key visual elements. The second stage is multi-dimensional perception prediction: Based on the ResNet-50 deep residual network trained on the large-scale crowdsourced dataset MIT Place Pulse 2.0, the perception evaluation is constructed as a binary classification task, and the perception scores of street view images in the six dimensions of beauty, safety, vitality, wealth, depression, and boredom are obtained through transfer learning. And by means of visualization methods, it presents its spatial differentiation pattern. The third stage is the analysis of perceptual elements: Taking 19 types of visual elements as features, the six-dimensional perception score and the comprehensive perception score as target variables, the LightGBM model was constructed, and combined with the SHAP (Shapley Additive Explanations) interpretation framework based on game theory, Quantify the contribution degree and action direction of each visual element in different perception dimensions, thereby identifying the key visual features that affect urban perception.
4.1. Street View Image Semantic Segmentation
This study implemented the Pyramid Scene Parsing Network (PSPNet) architecture (
Figure 3), trained on the MIT CSAIL-developed ADE20K dataset, to systematically quantify urban visual components. The ADE20K repository provides semantically rich annotations for 150 object categories spanning built environment features (e.g., buildings, infrastructure) and natural elements (e.g., sky, vegetation) [
35]. PSPNet demonstrates particular efficacy in complex urban contexts, achieving 79.73% mean intersection-over-union (mIoU) segmentation accuracy, with its robustness for streetscape analysis empirically validated across multiple urban computing studies [
36,
37].
Among them, is a multi-layer perceptron with 19 classes, and each ln is a pixel-by-pixel label mapping of in. CNN stands for Convolutional Neural Network. Each label map ln contains the pixel ratio of each object category in the street view image.
4.2. Perceptual Score Based on ResNet-50
This study adopts the ResNet-50 architecture, which is a 50-layer deep residual network proposed by He et al. [
38] and uses a bottleneck design. Use its pre-trained form to take advantage of its proven effectiveness in capturing the hierarchical features of urban elements such as building facades and vegetation. To address the inherent uncertainties in perceptual evaluation, the established method is adjusted by reconstructing the task into a binary classification problem [
39]. The implementation process consists of four consecutive components: data standardization, transfer learning, spatial quantification, and compound scoring. Our data standardization process involves strictly screening street view images with ≥5 annotations, using Q-score classification (mean ± standard deviation) to eliminate median values and enhance model robustness. Subsequent transfer learning is carried out through a 5:1 training-validation segmentation configuration, and the model performance is improved by iteratively optimizing the network parameters. In the spatial quantization stage, the classification probability is transformed into perception intensity indicators to achieve spatial mapping of subjective perception. The final output generates perception scores in six dimensions: beauty, boredom, depression, vitality, safety, and affluence. These scores are then aggregated into a composite score through weighted averaging. It is worth noting that score inversion is applied to the negative perception dimension to ensure that higher values always correspond to positive perception attributes and maintain the consistency of interpretation for all evaluation indicators.
4.3. Build an Environmental Perception Model Based on LightGBM
This study employs the LightGBM algorithm to model and analyze environmental perception and visual elements. LightGBM (Light Gradient Boosting Machine) is a Boosting method based on Gradient Boosting Decision Tree (GBDT) [
40]. This algorithm effectively excludes most of the small Gradient samples by introducing gradient-based One-side Sampling (GOSS), and only retains the large gradient samples for information gain calculation. Thus, while ensuring the model accuracy, it reduces the amount of training data. In addition, LightGBM also adopts a leaf-wise strategy with depth constraints, that is, each time, the Leaf with the maximum splitting gain is found among all current leaf nodes for splitting, and this cycle continues. Under the condition of the same number of splits, the Leaf-wise strategy can reduce errors and improve accuracy. The restriction on the maximum tree depth further avoids the overfitting phenomenon.
To comprehensively evaluate the model’s performance, three quantitative indicators were implemented: Mean Absolute Error (MAE) quantifies the average prediction deviation, Root mean square error (RMSE) emphasizes larger prediction errors, and Coefficient of Determination (R
2) assesses the explanatory power. The combination of these indicators provides a multi-dimensional assessment of the model’s accuracy in simulating human perceptual responses to urban visual stimuli. The calculation formula is as follows:
Among them, and respectively represent the predicted value and actual value of the RF model for the sample. represents the mean of the actual values of all samples. represents the total number of samples.
4.4. An Explanation Based on SHAP
SHAP (Shapley Additive Explanations) is an explainable machine learning framework based on the concept of Shapley values in cooperative game theory [
41], providing model-independent explanations for feature contributions in predictive modeling. This game theory approach can operate without understanding the internal model architecture. Instead, it quantifies the positive and negative directional impacts of each variable on individual predictions by analyzing feature interactions and distribution patterns and through conditional expectation calculations. The obtained SHAP values not only reveal the intrinsic data distribution characteristics but also explain the model’s decision-making pattern.
In implementation, this interpretive framework assesses the differentiated impact of urban visual elements on different perception dimensions through the Shapley value decomposition system, and the calculation process is formalized in Formula (7) [
42]. The magnitude and polarity of these values establish a quantified ranking of feature importance, thereby enabling systematic identification of the key visual determinants that shape human perception and response to the urban environment.
Here,
represents the perceived predicted value of the visual element.
represents the average value of environmental perception.
represents the number of visual elements in the perception model.
represents the SHAP value of the i-th visual element.
i∈{0,1} indicates whether the i-th visual element participates in the model prediction. The calculation formula for the SHAP value [
43] is as follows:
Here, represents the SHAP value of the -th visual element. represents the set of visual elements involved in the prediction. represents the collection of all visual elements.
In which the perceptual impact of urban visual elements is quantified as a weighted aggregation of score differences across different spatial configurations. This process is mechanically equivalent to calculating the weighted sum of the changes in perception scores under different combinations of visual element proportions, and ensures mathematical rigor through double validation: Formula (8) enforces algorithmic fairness through the principle of symmetrical distribution [
44,
45]. By processing the visual element dataset with the corresponding perception metrics through this SHAP framework, we have established a direct quantitative representation of the perception impact of environmental features. The obtained values exhibit magnitude dependent significance (absolute values indicate the intensity of influence) and pole-specific interpretation (positive/negative signs indicate directional effects), jointly forming an interpretable mapping between the visual characteristics of a city and its psychological impact. This method, through comparative analysis of their Shapley-derived contribution metrics, can systematically identify key design elements.
5. Results
5.1. Spatial Distribution of Different Perception Indicators
Based on the ArcGIS Pro 3.1.5 platform, the coordinate data of street view image collection points are spatially correlated with their perception scores. The natural breakpoint method is used to divide the scores of six types of perception indicators into seven grades, generating a spatial distribution map (
Figure 3). The analysis results show that the four positive perception indicators of safety, prosperity, vitality and beauty present a spatial pattern of “Zhongzhou Avenue as the boundary” in the main urban area of Zhengzhou City. The area east of Zhongzhou Avenue generally shows a relatively high positive perception score, while the old urban area west of Zhongzhou Avenue shows the characteristics of obvious perception differentiation and concentration of local high values. The high-value zones for safety perception (high-value zone > 7.874736), prosperity perception (high-value zone > 8.467721), and vitality perception (high-value zone > 7.923512) have extremely high thresholds and are highly overlapping in space, mainly concentrated in areas such as the CBD of Zhengdong New District and the core business districts of Jinshui District. These places, with their well-developed infrastructure, high-quality building facades and vibrant business atmosphere, have jointly shaped a highly positive urban impression. In contrast, the high-value area of beauty perception (>7.626042) shows a different distribution pattern. Its distribution is not limited to the commercial core, but is more widely distributed along the waterfront green belts such as Jinshui River and Xiong ‘er River (>5.358731), as well as newly built landscape areas such as Beilong Lake and Longzi Lake, indicating that natural and artificial landscapes are the main driving forces of beauty perception.
In contrast to positive perception, the two negative perception indicators of suppression and boredom present a spatial pattern of “high in the periphery and specific areas”. The high depression perception areas (>7.529553) are mainly distributed in the traditional industrial areas in the west, around railway stations and construction sites on the outskirts of cities. These areas generally have the characteristics of dense buildings and chaotic environment. The high boredom perception area (>7.806227) is concentrated in large residential areas with a single architectural style and along the lackluster branch roads. The huge range of its values from the lowest 0.567777 to the highest 9.447562 reveals the extreme unevenness of the visual richness of the city’s streets. It is worth noting that the data analysis has revealed a special phenomenon: in the CBD of Zhengdong New District and some waterfront areas, there are regions with a high perception of beauty but also a relatively high perception of boredom. This indicates that although high-quality greening and modern architecture support a high beauty score, the single function of the street interface and the lack of diverse commercial and public activity Spaces have led to insufficient visual richness, thereby triggering a high sense of boredom. This proves that a single esthetic appeal is not enough to eliminate the negative perception brought about by the monotony of functions.
5.2. Spatial Distribution of Different Environmental Indicators
Based on the ArcGIS platform, the coordinate data of street view image collection points are spatially associated with nine types of environmental indicators. The natural breakpoint method is used to divide each indicator into seven levels, generating a spatial distribution map of environmental elements (
Figure 4). The analysis results show that different types of built environmental elements present significant spatial differentiation characteristics in Zhengzhou City, and these environmental features provide a physical basis for understanding the spatial differences in urban perception.
In the main urban area of Zhengzhou City, the positive elements characterizing environmental quality show spatial differentiation with Zhongzhou Avenue as the boundary. The high-value area (>0.329868) of the Green View Index is mainly distributed along the waterfront green belts such as Jinshui River and Xiong ‘er River, as well as the newly built landscape areas such as Beilong Lake and Longzi Lake in Zhengdong New District, which is highly consistent with the high-value area of beauty perception. The high-value areas of Pedestrian Space Ratio (>0.245400) and Ground Net-clearance index (>0.940466) are mainly concentrated in modern areas such as the CBD of Zhengdong New District and the core business district of Jinshui District. These areas provide material support for the perception of safety and affluence through spacious pedestrian Spaces and clean street interfaces. The high-value areas of the Sky View Factor (>0.579677) are also concentrated in the new urban areas with open architectural layouts, reflecting the sense of spatial openness in these areas. The old urban area west of Zhongzhou Avenue exhibits typical features such as high building density, rich but visually chaotic street interfaces, small block scales, and strong road connectivity, which form a sharp contrast with the newly built urban area.
In contrast, the indicators characterizing environmental pressure present different spatial patterns. The high-value areas of the Car–Nonmotor Conflict Index (>0.755909) are mainly concentrated around railway stations, traditional business districts and intersections of major roads. The traffic flow in these areas is complex and there are significant safety hazards. The high-value areas of Clutter Index (>0.035517) are mainly distributed in the traditional industrial areas in the west, the old urban areas and some urban fringe areas. These areas generally have problems such as disorderly street facilities and messy billboards, corresponding to the perception of poor environmental tidiness. The high-value areas of Hard Surface Intensity (>0.604325) are widely distributed along the core business districts and major traffic arteries of the city, reflecting the insufficiency of ecological regulation capacity in these areas. It is worth noting that, in some areas such as the CBD of Zhengdong New District, a special phenomenon emerged where the Green View Index was relatively high (0.252913–0.329867), but the Pedestrian Space Ratio was relatively low (0.150282–0.245399). Street view image analysis shows that although these areas have high-quality central greenery and modern buildings, the scale of the streets is large, and the convenience and humanization of walking are insufficient, resulting in a mismatch between visual esthetics and walking experience. This might be the environmental cause of the phenomenon where both the perception of beauty and boredom in this area are relatively high simultaneously.
5.3. Correlation Analysis of Different Perception Indicators
To explore the intrinsic connection between urban perception and built environmental elements, this study extracted nine environmental indicators based on street view images and conducted Pearson correlation analysis with six types of perception indicators (
Figure 5). The results show that there is a significant correlation between different perception dimensions and specific environmental elements.
The environmental drivers of negative perception are relatively concentrated. Boredom perception was significantly positively correlated with Hard Surface Intensity (r = 0.48) and Clutter Index (r = 0.40), and strongly negatively correlated with Green View Index (r = −0.69). Repressive perception is also highly negatively correlated with the Green View Index (r = −0.66), indicating that green scarcity and high-density built environments are prone to induce negative emotions.
The environmental basis for positive perception is more diverse. The Green View Index has the strongest positive correlation with beauty perception (r = 0.72), highlighting the shaping effect of green view landscapes on the esthetic appeal of the city. Affluent perception is highly correlated with the Pedestrian Space Ratio (r = 0.84) and the Ground Net-clearance Index (r = 0.83), indicating the key role of wide and clean streets in affluent perception. Vitality perception also depends on the Pedestrian Space Ratio (r = 0.83), while security perception is positively correlated with the Green View Index (r = 0.45) and the Sky View Factor (r = 0.51), reflecting the importance of an open and transparent visual environment for a sense of security.
Furthermore, CAR-nonmotor Conflict Inde is negatively correlated with safety perception (r = −0.26), while Clutter Index systematically weakens positive perception and enhances negative perception, becoming a key entry point for optimizing urban perception. Relevant analysis provides a clear path for precisely intervening in perceptual experience through environmental elements.
5.4. Regression Analysis of Zhengzhou City
This study conducted a regression analysis of street view images based on the LightGBM algorithm to evaluate the urban perception characteristics of Zhengzhou City in six perception dimensions: beauty, boredom, depression, vitality, safety and prosperity. Taking the perception score predicted by the ResNet-50 model as the dependent variable and nine street view visual elements as features, a regression model of six perception dimensions was constructed. During the training process, a 5-fold cross-validation method was adopted and repeated twice to optimize the model’s hyperparameters. The fitting degree and generalization ability of the model were evaluated through the mean square error (MSE) and the coefficient of determination (R2).
The model evaluation results show (
Table 1) that there are differences in the prediction performance of each perception dimension. The esthetic perception model has the strongest explanatory power. The R
2 of the test set reaches 0.6190, indicating that the model can capture the visual features related to esthetics in the image quite well. The model performances of boredom perception and vitality perception are also relatively stable, with R
2 of the test sets being 0.6427 and 0.6434, respectively. In contrast, the models of repressive perception (test set R
2 = 0.5699) and affluent perception (test set R
2 = 0.4703) performed relatively weakly, possibly due to the higher subjectivity of these perception dimensions and the complexity of visual cues. The predictive performance of the safety perception model lies somewhere in between, with the test set R
2 being 0.5001.
Overall, this series of models has successfully revealed the connection between the street scene visual environment and multi-dimensional perception in Zhengzhou City. The perceptual features of beauty, boredom and vitality dimensions can be effectively captured by the model, while the modeling of suppression and abundance perception suggests that future research needs to incorporate more non-visual features such as socio-economic or functional mix degree to enhance the explanatory power.
5.5. LightGBM Regression Results
Based on the results of the feature importance analysis (
Figure 6), the research found that the Green Visibility Index (GVI) is a core element in shaping the positive perception of a city. Its importance in the perception of beauty is as high as 70.53%, and it contributes 22.25% and 14.67%, respectively, in the perception of safety and prosperity, highlighting the fundamental role of greening in enhancing the multi-dimensional perception quality of a city. The Sky visibility factor (SVF) exhibits significant dual characteristics in different perception dimensions: on the one hand, it serves as a key element in the formation of a sense of security (39.79%), while on the other hand, it may exacerbate boredom by causing the absence of street interfaces (21.75%), revealing the complexity of the influence mechanism of built environmental elements. In addition, the proportion of pedestrian space (PSR) dominates in the perception of affability (31.15%), and the ground clearance index (GNI) contributes the most to the perception of vitality (34.67%), all pointing to the importance of humanized design and spatial tidiness for high-quality urban experiences. The research also identified building density (SBI) and traffic conflict (CCI) as the main sources of negative perception, having a significant impact on both boredom and depression perception. These findings indicate that the formation of urban perception depends on the synergistic effect of multiple elements. A single environmental element may play a completely different role in different perception dimensions, which provides a scientific basis for regulating urban perception through precise environmental design and suggests that future urban renewal should focus on the coordinated advancement of the construction of a green system, the control of spatial scale, and the optimization of functional layout.
Multidimensional analysis based on SHAP values reveals the complex influence mechanisms and directions of different visual elements on urban perception (
Figure 7). Research has found that the Green Visibility Index (GVI) has a significant positive driving effect on beauty perception. Its high value significantly enhances the evaluation of beauty, reflecting the core position of greening in shaping the esthetic appeal of a city. The Sky visibility factor (SVF) shows a clear positive impact on security perception, indicating that an open visual space helps enhance a sense of security. However, in the perception of boredom, a higher SVF value is associated with negative evaluations, revealing that the same element may play opposite roles in different perception dimensions. Further analysis indicates that the ground clearance Index (GNI), as a key positive element of vitality perception, significantly promotes the feeling of street vitality when its improvement is enhanced. However, building density (SBI) and vehicle conflict (CCI) exhibit a relatively strong negative impact in the perception of oppression. It is particularly worth noting that the contribution of the pedestrian space ratio (PSR) to the perception of affluence shows nonlinear characteristics, only generating positive effects within the appropriate range, and the promoting effect weakens after exceeding the threshold. The hard interface ratio (HIS) also shows a positive correlation in the perception of affordance, but a negative correlation with the perception of vitality, suggesting its dual role in different contexts.
It is found that from the two dimensions of the direction and intensity of the elements’ effects, the understanding of the complex relationship between the urban visual environment and subjective perception has been deepened, providing a quantitative basis for precise urban design. It is suggested that in planning practice, attention should be paid to the coordinated allocation among elements, and differentiated strategies should be adopted for specific perception goals, so as to achieve a systematic improvement in the quality of urban space.
6. Discussion and Conclusions
6.1. Conclusions
This paper constructs a comprehensive analysis framework integrating street view images, deep learning and interpretable machine learning, and quantitatively measures and explains the mechanism of urban perception in Zhengzhou City. Unlike traditional research, this framework analyzes the contribution and direction of action of visual elements to subjective perception from within the model through the SHAP method, thereby transcending the description of phenomena and precisely revealing the spatial pattern of perception results and the deep environmental driving forces behind their formation.
(1) The urban perception of Zhengzhou City presents a spatial differentiation feature with Zhongzhou Avenue as the boundary. Positive perception (safety, prosperity, vitality, and beauty) has demonstrated a “high center” agglomeration pattern in modern areas such as the CBD of Zhengdong New District and the core business district of Jinshui District. Its high value is closely related to the high-quality environmental elements such as the open Sky View Factor, sufficient Green View Index and spacious pedestrian space in the newly built urban area. The old urban area west of Zhongzhou Avenue, on the other hand, is characterized by a rich variety of historical buildings, a dense flow of people, diverse but visually chaotic street interfaces, small block scales, and high building density. In some areas, due to the chaotic environment and single function, negative perception has accumulated.
(2) The influence of each visual element was quantified through the random forest model and SHAP analysis. The Green View Index was identified as a core element for shaping positive perception (with the importance of beauty perception features reaching 70.53%) and suppressing negative perception. The Sky View Factor and the Pedestrian Space Ratio together constitute the basis of the perception of safety, affability and vitality. The research also reveals the complexity of the role of environmental factors. In the newly built urban area east of Zhongzhou Avenue, due to the large street scale and single building interface, the SVF value is relatively high. Although it enhances the sense of openness, it intensifies the sense of boredom in specific situations, reflecting the nonlinear characteristics of the perceptual influence mechanism.
(3) It is suggested that, in the planning, the characteristics of the built environment on both the east and west sides of Zhongzhou Avenue should be combined to implement differentiated renewal strategies. The visual order and vitality of the old urban area on the west side should be enhanced through interface improvement, functional integration and micro-updates. The newly built urban area on the east side should enhance the details of the building interface, control the scale of the streets and increase the diversity of functions to alleviate the perceptual contradiction of “beautiful but boring”. The findings provide a scientific basis and precise entry points for urban design and renewal practices. It is suggested that in the planning, the characteristics of the built environment on both the east and west sides of Zhongzhou Avenue should be combined to implement differentiated renewal strategies.
6.2. Discussion
The perceptual spatial pattern of Zhengzhou City revealed in this study not only confirms the universal law of the “center-edge” gradient of urban perception, but also demonstrates the uniqueness of it as a regional central city. Compared with first-tier cities like Beijing and Shanghai, the high-value areas of beauty perception in Zhengzhou are more prominently distributed along the riverside green belts such as Jinshui River and Xiong ‘er River, which is closely related to its construction concept of “Beautiful City by Water”. This finding contrasts sharply with Zhang et al. [
14]’s research on Beijing, where the perception of beauty is more associated with historical and cultural heritage areas and modern commercial centers. In contrast, Zhengzhou demonstrates the emphasis placed by an emerging inland central city on ecological landscape construction. Meanwhile, as a newly built urban area, Zhengdong New District, with its wide street scale, complete infrastructure and reasonable functional layout, has jointly contributed to the continuous high-value distribution of vitality perception. This feature is relatively rare in the research on Chinese cities by Wei et al. [
18], reflecting the practical achievements of urban planning concepts in the new era.
The causes of negative perception are particularly worthy of in-depth exploration. The research finds that the perception of repression in Zhengzhou City presents a “dual-core” distribution feature, which not only occurs in the old urban area with dense buildings, but also exists in some newly built areas. The SHAP analysis results reveal the environmental causes of this phenomenon from the mechanism level: On the one hand, some new districts overly pursue visual effects, leading to an imbalance in spatial scale. Although an excessively high SVF value (>0.65) enhances the sense of openness, it also intensifies the sense of spatial coldness due to the lack of building enclosure. On the other hand, the lack of vertical greening and details on the building facade has prevented the positive effects of the GVI from being fully exerted, intensifying the sense of oppression in the space. The special phenomenon of “beautiful but boring” that has emerged in areas such as the CBD of Zhengdong New District reveals the common problem of “emphasizing landscape over function” in current urban construction. Simple visual beautification is not enough to create a vibrant urban space; it must be combined with diverse commercial facilities and public activity Spaces. This finding corroborates the research conclusion of Liu et al. [
25] on the impact of functional mixing degree on vitality perception, further strengthening the key role of functional diversity in the formation of urban perception.
Based on the SHAP analysis results, this study proposes the following targeted optimization suggestions:
(1) In terms of vegetation configuration, it is recommended to adopt a composite greening system of trees, shrubs and ground covers to maintain the road green coverage rate within the optimized range of 25% to 35%. This range has been proven in the SHAP analysis to maximize the positive effects of GVI. At the same time, control the density of trees to maintain visual corridors, ensuring that the SVF remains within the ideal range of 0.4 to 0.6, taking into account both openness and spatial enclosure.
(2) In terms of architectural interface treatment, the visual richness should be enhanced by increasing the advance and retreat relationship, using harmonious material colors, and controlling the length of continuous interfaces (it is recommended not to exceed 50 m). SHAP analysis shows that appropriate changes in the architectural interface can reduce the negative impact of SBI on boredom perception by approximately 23%.
(3) In terms of the organization of street Spaces, emphasis should be placed on functional integration. Especially in waterfront areas where beauty perception is high but boredom is obvious, small-scale commercial functions such as coffee shops, bookstores, and convenience stores, as well as cultural and leisure functions such as exhibitions and performances, should be appropriately added. Research shows that when the functional blending degree is raised above 1.8, the perception of vitality can increase by approximately 35%, effectively alleviating the perceptual contradiction of “beautiful but boring”.
These design suggestions based on quantitative analysis provide specific and feasible technical paths for improving the spatial quality of Zhengzhou and other similar cities, and also offer practical examples for data-driven urban design methods.
6.3. Limitations and Future Research
Despite its contributions, this study has several limitations. Firstly, the reliance on Baidu Street View imagery inherently limited data coverage, excluding some back streets and renewal areas, which may affect the generalizability of the perceptual map. Secondly, while seasonal effects were controlled, the analysis captured a static visual snapshot and did not account for dynamic factors like weather, time of day, or pedestrian flow, which are integral to real-time urban experience. Thirdly, the perception model, though robust, is based on a globally trained dataset; its cultural specificity and adaptability to the inland Chinese context could be further enhanced with a localized training corpus. Additionally, the study primarily assessed the individual impact of visual elements, leaving the interaction effects between different environmental features (e.g., greenery and building interfaces) for future exploration. Finally, urban perception is a multi-sensory process, yet this research focused solely on visual cues, omitting the influences of sound, smell, and tactile experiences. Addressing these limitations in future work, through multi-source data integration, dynamic modeling, and cross-sensory investigation, will lead to a more holistic understanding of human perception in urban environments.