Exploring Urban Spatial Quality Through Street View Imagery and Human Perception Analysis

Li, Yonghao; Lu, Jialin; Meng, Yuan; Luo, Yiwen; Ren, Juan

doi:10.3390/buildings15173116

Open AccessArticle

Exploring Urban Spatial Quality Through Street View Imagery and Human Perception Analysis

by

Yonghao Li

^1,2,†,

Jialin Lu

^1,2,†

,

Yuan Meng

^3,†

,

Yiwen Luo

¹ and

Juan Ren

^1,4,*

¹

School of Architecture, Chang’an University, Xi’an 710061, China

²

School of Architecture, South China University of Technology, Guangzhou 510641, China

³

School of Architecture, Southeast University, Nanjing 210096, China

⁴

School of Architecture, Xi’an University of Architecture and Technology, Xi’an 710055, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Buildings 2025, 15(17), 3116; https://doi.org/10.3390/buildings15173116

Submission received: 2 August 2025 / Revised: 24 August 2025 / Accepted: 27 August 2025 / Published: 31 August 2025

(This article belongs to the Special Issue Computational Design for Low-Carbon and Climate-Responsive Architecture and Urban Environments)

Download

Browse Figures

Versions Notes

Abstract

Amid the global challenges of rapid urbanization, understanding how micro-scale spatial features shape human perception is critical for advancing livable cities. This study pro-poses a data-driven framework that integrates street view imagery, deep learning-based semantic segmentation, and machine learning interpretation models including SHAP analysis to explore the relationship between urban spatial characteristics and subjective perceptions. A total of 12,604 street-level images from Xi’an, China, were analyzed to ex-tract seven spatial indicators. These indicators were then linked with perceptual data across six emotional dimensions derived from the Place Pulse 2.0 dataset. The analysis revealed that natural elements significantly enhance perceived comfort and aesthetics, while high-density built environments can suppress perceived safety and liveliness. Spatial clustering further identified three urban typologies—traditional, transitional, and modern—with distinct perceptual signatures. These findings offer scalable and transferable insights for perception-informed urban design and renewal, particularly in dense urban settings worldwide.

Keywords:

urban spatial quality; subjective perception; street view imagery; clustering analysis; spatial perception

1. Introduction

The global acceleration of urbanization presents a dual challenge: accommodating growing populations while ensuring that cities remain livable, equitable, and resilient [1,2]. Within this context, the quality of urban space—how it is perceived and experienced by residents—has emerged as a critical determinant of well-being, social cohesion, and sustainable development [3,4]. However, rapid, often unplanned, urban expansion has frequently prioritized economic efficiency and infrastructural density over human-scale design, leading to widespread issues of visual overload, environmental degradation, and socio-spatial disparities [5,6]. These problems manifest as perceptible declines in safety, comfort, and aesthetic pleasure, ultimately undermining urban livability and contravening the principles of Sustainable Development Goals (SDGs), particularly SDG 11 (Sustainable Cities and Communities) [7].

Traditional methods for assessing urban spatial quality, relying on remote sensing and census data, offer valuable macro-scale insights but fail to capture the human-scale, perceptual nuances that define daily lived experience [8,9]. This gap impedes our ability to design cities that are not only functionally efficient, but also psychologically supportive and emotionally engaging. Consequently, there is a pressing need for frameworks that can quantitatively link micro-scale spatial features with subjective human perception to inform more human-centric urban design and policy.

Recent advances in geospatial technology and artificial intelligence provide an unprecedented opportunity to bridge this gap. Street View Imagery (SVI), with its ground-level perspective, offers a scalable data source that closely mirrors human visual experience [10]. Coupled with deep learning for semantic segmentation and machine learning (ML) for modeling complex relationships, this approach enables the large-scale, high-resolution quantification of the built environment and its perceptual impacts [11,12]. However, current studies often focus on isolated features (e.g., greenness) or single perceptual dimensions, neglecting the synergistic effects of multidimensional urban form. Moreover, the “black-box” nature of powerful ML models limits their interpretability and practical adoption in urban planning [13].

To address these challenges, this study develops a novel “spatial-feature–perception–response” framework that integrates SVI analysis, ML interpretation techniques, and spatial clustering. We focus on Xi’an, China—a city characterized by a stark spatial dichotomy between a historical core and modern peripheries—as an ideal living laboratory to explore how contrasting urban forms influence human perception [14]. This research is designed to solve practical urban problems by providing actionable insights for mitigating the negative perceptual impacts of high-density development and guiding the creation of restorative, people-oriented spaces. Guided by this framework, we seek to answer the following research questions (RQs).

RQ1: What is the quantitative contribution and non-linear influence of key urban spatial elements on the six dimensions of human perception? RQ2: How do interaction effects and thresholds between these spatial features shape perceptual outcomes, and what are the optimal parameter ranges? RQ3: How is the relationship between spatial configuration and perception spatially heterogeneous, and what distinct urban spatial typologies can be derived from this heterogeneity to inform targeted renewal strategies?

Moreover, the novelty of this research is articulated through the following key contributions:

(1): Theoretical: Proposing and validating a non-linear explanatory framework for urban spatial perception that moves beyond simple correlations to reveal the synergistic and threshold effects between multidimensional features.
(2): Methodological: Developing an integrated analytical pipeline that innovatively couples deep learning-based semantic segmentation with Explainable AI techniques (SHAP) and spatial clustering to enhance the interpretability and spatial explicitness of perception modeling.
(3): Data: Demonstrating the value of multi-source heterogeneous data fusion and quantitatively linking street-view-derived spatial indicators with large-scale human perception scores for a comprehensive human-scale quality assessment.
(4): Policy Relevance: Generating actionable, spatially explicit intelligence by identifying distinct urban spatial typologies with specific perceptual signatures, thereby providing an evidence-based foundation for targeted urban renewal and design strategies, particularly in dense urban settings.

2. Literature Review

The assessment of urban spatial quality has progressively evolved from a focus on macro-scale, functional metrics toward a deeper understanding of human-scale perceptual experience, acknowledging that the success of an urban space is ultimately defined by its impact on inhabitant well-being [1,2]. This paradigm shift is grounded in longstanding theoretical foundations. Environment–Behavior Studies (EBSs) provide a crucial lens, establishing that spatial attributes directly influence human cognition, emotion, and behavior [4]. This is complemented by the theory of affordances, which posits that environments offer specific action possibilities, shifting analysis from objective metrics to relational, user-centered interpretations [15]. Furthermore, research on Restorative Environments, particularly Kaplan’s Attention Restoration Theory (ART), demonstrates that exposure to natural elements can mitigate mental fatigue and enhance psychological well-being, providing a robust theoretical basis for the positive role of greenery [16,17]. The concept of Sense of Place further enriches this view, emphasizing that perceptions are filtered through cultural and historical contexts, necessitating localized inquiry [18].

Concurrent with these theoretical advances, methodologies for measuring the urban environment have transformed to capture experiential quality. The field has moved beyond traditional planning metrics to adopt human-centric indicators derived from street-level data [5]. Key among these are the GVI for quantifying visible greenery [11], SOR for measuring overhead visibility [12], and IED for assessing the continuity of street walls [13]. International studies have leveraged these metrics to reveal cross-cultural patterns. For instance, research in Japanese cities found that objective destination diversity had the strongest positive association with overall neighborhood perception, with effects varying significantly by age and gender [19]. Similarly, the seminal Place Pulse project crowdsourced global perceptions from 56 cities, linking them to SVI-derived features [20], while other work has quantified the inequitable distribution of street greenery in cities like Hartford, USA [21].

This methodological revolution is propelled by breakthroughs in data acquisition and analytics. The proliferation of Street View Imagery (SVI) platforms provides a scalable, ground-truth data source that mirrors the human visual experience [5,10]. Coupled with advances in deep learning—specifically semantic segmentation models like DeepLabv3+(Google Research, Mountain View, CA, USA) and PSPNet—researchers can now automatically and accurately parse millions of images to extract urban features at an unprecedented scale [11,12]. To measure perception, crowdsourced platforms have largely supplanted limited surveys, enabling the collection of large-scale human judgment datasets like Place Pulse 2.0 [20,22]. To decode the complex, non-linear relationships between multidimensional urban form and perception, machine learning models such as Random Forests have become the standard tool due to their superior predictive power [23,24]. However, their inherent “black-box” nature has historically limited their practical application in planning [9,25]. The recent adoption of Explainable AI techniques, particularly SHAP, has begun to address this by quantifying the contribution and interaction of individual features, making model outputs transparent [26,27]. Recent advancements, such as the Concept-Driven Framework with Visual Foundation Models, further enhance interpretability by linking visual features to human perceptual judgments through a transparent reasoning process [28].

As shown in Table 1. Despite these advancements, a synthesis of the literature reveals persistent gaps. Firstly, many studies focus on isolated features or single perceptual dimensions, neglecting the synergistic and threshold effects of multidimensional urban form [7,8,29]. Secondly, a significant interpretability gap remains; the application of SHAP in urban perception studies is still nascent, particularly for exploring complex feature interactions across heterogeneous spaces [30,31]. Finally, there is a policy translation disconnect; advanced models often fail to distill findings into actionable, spatially explicit urban typologies for targeted intervention [32,33]. As shown in a study of Montréal’s streets, perceptions of inclusivity and practicality can vary dramatically across demographic groups, highlighting the need for planning tools that can capture and address this heterogeneity [34].

3. Materials and Methods

Utilizing a structured framework of “data-driven extraction—mechanism interpretation—spatial response”, this study integrates street-level image analysis with machine learning-based perception modeling to investigate how spatial features shape perceptual outcomes across Xi’an’s inner urban core.

3.1. Study Area and Data Sources

This study’s data covers street view images, architectural spatial data, and remote sensing data, all of which are targeted at the area within the second ring road of Xi’an to ensure that they can comprehensively reflect the characteristics of the urban environment in this area. Figure 1 reveals the study area’s scope and the study’s technical route.

Xi’an is a major city in northwestern China and a historically significant cultural hub, characterized by a spatial structure that blends historical continuity with modern urban expansion [35]. Recent studies have demonstrated that Xi’an’s evolving urban structure significantly impacts environmental conditions and urban sustainability [25]. The traditional urban core, enclosed by the Ming Dynasty city wall, retains a low-density street network and rich historical textures, offering high spatial legibility and cultural identity. In contrast, newly developed districts, driven by rapid urbanization, feature high-rise, high-density development patterns. This juxtaposition of historic and modern spatial configurations creates a strong spatial tension, manifested in shifts from narrow traditional alleyways to wide urban blocks, changes in interface continuity, and the emergence of differentiated symbolic nodes, which significantly influence how residents perceive different urban environments.

This study focuses on Xi’an’s core urban area within the second ring road to analyze the interaction between spatial features and human perception. This area represents a key zone in the city’s urbanization process, with a population density of approximately 15,000 per square kilometers. It encompasses commercial centers, historical districts, and residential zones. Its diverse spatial forms and social functions make it an ideal setting for investigating human-scale spatial quality and perceptual variation.

Data collection followed a systematic spatial sampling approach. As shown in Figure 2, based on vector road network data for 4778 road segments within the second ring road of Xi’an—sourced from OpenStreetMap (OSM)—primary and secondary roads were selected as the sampling baseline to ensure comprehensive coverage of street hierarchy and functional types. Sampling points were placed every 25 m along these selected roads using a horizontal street-level perspective, resulting in 12,604 valid images via the Baidu Maps API (Baidu, Inc., Beijing, China) [6]. The 25 m interval was shown to be effective in capturing visible greenery in [36] and optimal for green view detection compared to [37], supporting the representativeness of our sampling design.

3.2. Model Training Process

This study employed the DeepLabv3+ model, pretrained on the Cityscapes dataset, to segment street-view images [38]. Prior work has shown reliable applications in Chinese contexts [39,40], supporting the robustness of this approach in Xi’an.

The Cityscapes dataset includes 19 semantic labels, such as buildings, roads, and vegetation, collected from 50 cities, making it highly suitable for urban scene parsing.

During model training, a cross-entropy loss function was used to address the class imbalance issue, and optimization was carried out using Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01 and a momentum of 0.9. To enhance generalization across complex urban scenes, data augmentation techniques such as random flipping and color jittering were applied [11].

Based on the segmentation results, seven spatial quality indicators were extracted, including the following:

Building Coverage Ratio (BCR): proportion of image pixels occupied by buildings,
Sky Openness Ratio (SOR): proportion of sky pixels,
Interface Enclosure Degree (IED): proportion of continuous building facades,
Green View Index (GVI): proportion of vegetation pixels,
Sky Openness Ratio (SOR): proportion of sky pixels,
Pedestrian Space Ratio (PSR): proportion of pedestrian pixels,
Vehicle Space Ratio (VSR): proportion of vehicle pixels.

These indicators comprehensively quantify urban space’s morphological, environmental, and functional dimensions, laying the foundation for integrated spatial–perceptual coupling analysis.

3.3. Construction of Urban Spatial Feature System and Multi-Source Data Integration

This study establishes a multidimensional indicator system that integrates both objective spatial features and subjective perceptions. The objective indicators are categorized into the following three dimensions: morphological, environmental, and functional, as shown in Table 2, which outlines the indicators and their associations with different dimensions of human perception [6].

The subjective perception indicator system is constructed from a human-centered perspective and is categorized into the following three dimensions: basic perception, functional perception, and emotional perception. Basic perception includes aesthetic quality, safety, and comfort; functional perception relates to convenience and accessibility; and emotional perception focuses on place identity and cultural characteristics [29]. The perception data are derived from the Place Pulse 2.0 dataset, which contains crowdsourced perceptual evaluations (e.g., safety, beauty, and comfort) across 56 cities worldwide, including several in Asia such as Tokyo, Singapore, Taipei, and Hong Kong. This broad coverage reduces the risk of a purely Western bias. Moreover, its applicability in Asian contexts has been demonstrated in recent studies: for example, [41] applied Place Pulse and deep learning methods to assess street quality in Beijing and Shanghai, linking greenery and other visual features with public perception scores, while [42] combined street-view imagery, deep learning, and space syntax to evaluate the perceived street quality in Shanghai and Chengdu. These works support the representativeness of Place Pulse data in Chinese cities, which justifies its integration with objective street-view indicators in Xi’an.

The coupling of objective and subjective indicators is implemented through a spatial grid framework, integrating semantic segmentation outputs, POI density, and perceptual scores to build a multidimensional feature matrix. SHAP values are used to interpret non-linear interaction mechanisms, such as the threshold effect of the Green View Index on perceived safety. Finally, K-means clustering is applied to classify the study area into traditional, transitional, and modern spatial typologies, providing a foundation for differentiated spatial optimization strategies [30].

3.4. Data Analysis Methods

This study develops a multi-scale integrated spatial feature analysis framework to reveal the complex mechanisms underlying spatial heterogeneity and perceptual responses in urban environments through a stepwise analytical process. As illustrated in Figure 3, the analytical workflow consists of the following three primary stages: spatial feature extraction, perception modeling, and typological classification.

The multidimensional spatial indicators are then aggregated into a unified grid-based spatial unit system. Using an improved K-means++ clustering algorithm [22], the study area is categorized into the following three functional spatial typologies: traditional low-quality zones, transitional mixed-use zones, and modern high-quality integrated zones. The optimal number of clusters is determined using the elbow method to ensure robustness in pattern recognition. This step enables the identification of spatial feature distributions and functional zone characteristics. Subsequently, a Random Forest regression model is applied to predict subjective perception scores based on the extracted spatial features [24]. To improve interpretability, SHAP values are used to quantify each feature’s global and local contributions to the perceptual outcomes, consistent with recent perception studies using association-based frameworks [39]. Furthermore, SHAP interaction analysis is conducted to explore the non-linear and synergistic relationships between key spatial features and perceptual dimensions, enabling the identification of perceptual response differences across heterogeneous urban spaces [31,33]. It should be noted that SHAP provides model-based interpretability rather than causal inference, and the findings are, therefore, interpreted as robust associative patterns.

4. Results

This chapter reveals the relationships between different spatial features and residents’ subjective perceptions in the core urban area of Xi’an and quantitatively assesses their spatial heterogeneity through semantic segmentation modeling, subjective perception prediction, and spatial clustering analysis.

4.1. Semantic Segmentation Results and Spatial Distribution of Quality Indicators

In this study, 12,604 street-view images from the core urban area of Xi’an were semantically segmented using the DeepLabv3+ model [43]. Seven spatial quality indicators, including BCR, GVI, SOR, IED, SVI, PSR, and VSR, were extracted.

Figure 4 illustrates the spatial distribution of these indicators, highlighting the significant heterogeneity in urban form and environmental characteristics across the study area. In the high-density commercial zone centered around the Bell Tower, BCR reaches as high as 0.9. At the same time, GVI drops below 0.2 and SOR falls under 0.4, indicating a “high-density, low-green, low-openness” spatial configuration that tends to induce visual pressure. In contrast, areas like Qujiang New District, which emphasize ecological development, show a much higher GVI and SOR above 0.6, reflecting a more open and comfortable street environment.

Figure 5 further presents how different spatial environments affect the following six dimensions of subjective perception: aesthetics, safety, comfort, boredom, depression, and liveliness. The results indicate strong positive correlations between GVI/SOR and perceived beauty and comfort, whereas a higher BCR is commonly associated with increased feelings of oppression and lower perceived safety. This suggests that natural and open spatial elements play a critical role in enhancing residents’ experiential quality in urban streetscapes.

4.2. Subjective Perception Modeling and Feature Contribution Analysis

In this study, a perception prediction model based on Random Forest regression was constructed, and SHAP values were applied to enhance model interpretability [44]. This approach enabled an in-depth investigation into the relationship between spatial features and residents’ subjective perceptions. The analysis revealed that spatial elements such as buildings, sidewalks, vehicles, and greenery play significant roles in shaping how residents perceive their urban environment.

According to Table 3, building-related features exhibit the highest SHAP values for perceived safety, strongly influencing residents’ sense of security. Specifically, BCR shows a notable negative impact on safety perception in high-density environments, where visual crowding and spatial enclosure may lead to heightened feelings of pressure and discomfort.

In contrast, sidewalks and vegetation demonstrate a substantial positive contribution to perceptions of beauty and liveliness. Areas with a higher GVI are consistently associated with enhanced environmental aesthetics and greater comfort. Vegetation not only improves the visual quality of urban space, but also supports emotional well-being. Likewise, well-designed pedestrian infrastructure significantly affects walkability and encourages more frequent walking activities, thereby enhancing residents’ perception of vibrancy within the urban environment.

SHAP is a model interpretation method rooted in the Shapley value concept from game theory. It is mainly used to explain the prediction results of machine learning models by quantifying the contribution of each feature to the model’s prediction. The core output of SHAP is the SHAP value.

ϕ_{i} (v a l) = \sum_{S \subseteq A ∖ {i}} \frac{|S|! (|M| - |S| - 1)!}{n!} (v a l (S \cup {x_{i}}) - v a l (S))

where ϕ_i(val) is the SHAP value of the feature xi. S is the union of all possible features, except the feature x_i. ∣S∣ is the number of features in alliance S. n is the total number of features. val(S_a∪x_i) is the model prediction value after adding features x_i on the basis of alliance S. val(S_a) is the model prediction of alliance S. Figure 6 presents the SHAP value analysis of primary spatial features across the following six subjective perception dimensions: beautiful, wealthier, livelier, safer, depressing, and boring. The results reveal that buildings are the most influential feature overall, showing strong contributions to both liveliness and safety, but also increasing depressing perceptions when density becomes excessive. Vegetation plays a key positive role in enhancing aesthetic, comfort, and wealth-related perceptions while mitigating negative feelings such as depression and boredom.

Sidewalks and fences emerge as essential contributors to vibrancy, safety, and walkability, suggesting that pedestrian infrastructure significantly shapes perceived environmental quality. Meanwhile, vehicles and cars enhance perceptions of activity and mobility, but can negatively affect safety and comfort if overly dominant. Flats and monotonous built forms are strongly associated with negative perceptions such as depression and boredom, indicating that a lack of spatial variety weakens emotional experiences.

Overall, the SHAP analysis highlights the non-linear and multidimensional effects of spatial elements on human perception. It confirms that not only the presence, but also the intensity, composition, and balance of features such as greenery, density, and circulation networks determine how urban spaces are perceived. To enhance interpretability, SHAP values are employed to describe the contribution of each feature to the model outputs and to reveal potential non-linear and interaction effects. It should be emphasized that SHAP does not provide causal inference; in this study, it is used only to identify robust associative patterns within the predictive framework. To strengthen the robustness of the findings, the SHAP results are compared with MGWR regression outcomes and show consistent trends, and they are further supported by recent studies applying SHAP in spatial and urban contexts [40,45].

Further analysis of Table 4 reveals that BCR, GVI, and IED also exhibit non-linear effects on subjective perception. Building density shows a clear negative impact on the perception of safety, while higher Green View Index values are positively associated with enhanced perceptions of beauty and comfort. Moreover, interface enclosure plays a critical role in improving safety perception within historical districts, indicating that ordered spatial organization and continuous street frontages are essential for fostering a greater sense of security among residents.

Figure 7 illustrates the mean absolute SHAP values of the seven secondary-level spatial indicators and their correlations with the six subjective perception dimensions. The results demonstrate that BCR exerts the strongest overall influence, particularly on perceptions of safety and wealth, where high-density environments tend to reduce perceived safety but convey a sense of affluence. IED also substantially contributes to safety and liveliness, highlighting the role of continuous street edges and spatial enclosure in enhancing security and vibrancy. Meanwhile, SFR and PSR are particularly impactful for liveliness and wealthier perceptions, underscoring the importance of service infrastructure in shaping positive urban experiences.

On the environmental side, GVI and SOR are more strongly linked to aesthetic and comfort-related perceptions such as beauty and depression. However, their overall influence is lower than that of morphological and functional indicators. VSR displays a limited impact across all perception dimensions [46], indicating that pedestrian network continuity alone is insufficient to substantially shift residents’ evaluations without complementary spatial features.

The model’s validity is further confirmed by its predictive performance. As shown in Table 5, the MAE and MSE values demonstrate acceptable error levels across perception dimensions, while the model’s performance on specific dimensions—such as beauty, depression, and safety—is notably stronger. This indicates the model’s practical effectiveness in capturing subjective perceptions and provides confidence in the reliability of SHAP-derived associations. Taken together, these results suggest that morphological compactness, street interface continuity, and accessibility to public services are the most influential factors in shaping residents’ perceptual evaluations of urban space, while environmental openness provides additional emotional and aesthetic value.

4.3. Analysis of Spatial Hotspot Distribution Patterns

Using the Gi* spatial clustering analysis method, this study identified the hotspot and coldspot distribution patterns of both objective spatial quality indicators and subjective perceptions within the study area [47]. Figure 8 and Figure 9 present the hotspot distribution patterns of objective spatial quality indicators and subjective spatial perception evaluations within Xi’an’s second ring road, revealing their spatial coupling characteristics.

The Getis–Ord(G_i*) is used to identify local spatial hotspots and coldspots. The formula is as follows:

G_{i}^{*} = \frac{\sum_{j = 1}^{n} w_{i j} x_{j} - \bar{x} \sum_{j = 1}^{n} w_{i j}}{S \sqrt{\frac{n \sum_{j = 1}^{n} w_{i j}^{2} - {(\sum_{j = 1}^{n} w_{i j})}^{2}}{n - 1}}}

x_j is the attribute value of spatial unit j. w_ij is the spatial weight, which is used to measure the spatial relationship between spatial unit i and j.

\bar{x} = \frac{1}{n} \sum_{j = 1}^{n} x_{j}

, which is the average value of the attribute values of all spatial units.

S = \sqrt{\frac{1}{n} \sum_{j = 1}^{n} {(x_{j} - \bar{x})}^{2}}

, which is the standard deviation of the attribute values. n is the total number of spatial units.

In Figure 8, hotspots of BCR and IED are mainly concentrated around the city wall and in the high-tech zone, indicating a high building density and enclosed street environments that contribute to a sense of spatial compression. In contrast, hotspots of GVI are clustered in the Qujiang New District and the southeastern section of the study area, reflecting abundant greenery and open space, designating these areas as ecological advantage zones. PSR and SOR show hotspot patterns around the urban periphery, particularly in university campuses and newly developed neighborhoods, suggesting a favorable openness and pedestrian accessibility. Moreover, SFR displays a multi-centered hotspot distribution, indicating localized strengths in traffic safety and infrastructure.

Figure 9 illustrates the spatial responses of subjective perceptions. Hotspots for perceptions of beauty and liveliness closely match the GVI and PSR hotspots in Figure 8, highlighting the positive impact of greenery and open spaces on aesthetic and vitality-related perceptions. In contrast, feelings of boredom and depression are concentrated in areas with high BCR and IED values, suggesting that dense, enclosed urban forms reinforce negative emotional responses. Perceptions of safety are concentrated in spatially open and well-structured areas, while feelings of wealth tend to cluster around commercial hubs and high-end residential zones. Overall, both objective and subjective indicators exhibit strong spatial consistency and regional coupling, indicating a close relationship between urban form and resident perception. Optimizing urban structure and enhancing spatial environmental quality can significantly improve residents’ psychological comfort and satisfaction with a city [6,48].

The elbow method is a common technique for determining the optimal number of clusters in clustering algorithms. It evaluates the clustering quality by calculating the Sum of Squared Errors (SSE) under different values of k, and selects the optimal k based on the change trend of SSE. The formula is as follows:

S S E = \sum_{j = 1}^{k} \sum_{x \in C_{j}} d {(x, c_{j})}^{2}

k is the number of clusters set in the clustering analysis when testing different cluster numbers. C_j represents the j-th cluster formed after clustering, which is a subset of the original dataset containing all data points assigned to this cluster. x denotes an individual sample that belongs to cluster C_j.

The silhouette coefficient method is used to calculate the average distance a(i) from sample i to other samples in the same cluster. a(i) quantifies the average dissimilarity between sample i and other samples in the same cluster, and it reflects the tightness of the cluster to which sample i belongs. A smaller a(i) indicates that sample i is more closely clustered with other samples in the same cluster, meaning that the cluster is more compact. The formula is as follows:

a (i) = \frac{1}{|C_{i}| - 1} \sum_{j \in C_{i}, j \neq i} d (i, j)

C_i is the cluster that sample i belongs to; all samples in C_i are assigned to the same cluster as i in the clustering result. j represents all other samples in cluster C_i except sample i.

Based on these combined evaluation methods, three clusters were identified as the optimal grouping solution. This choice balances statistical performance with interpretability in Figure 10: although k = 2 yielded the highest silhouette score, it oversimplified the heterogeneity of urban spaces. By contrast, k = 3 retained a relatively high silhouette value, showed a clear elbow inflection, and produced spatial partitions that align well with the observed functional differences across the study area. Further increases in k led to declining silhouette values and higher Davies–Bouldin indices, indicating over-segmentation and a reduced clustering quality. Therefore, the three-cluster solution not only demonstrated statistical robustness, but also provided a meaningful differentiation of urban spatial types, which will serve as a basis for subsequent analyses.

To further ensure the robustness of the Gi hotspot analysis, we conducted a supplementary Global Moran’s I test for both spatial and perceptual indicators. The results consistently showed significant positive spatial autocorrelation, with Moran’s I values ranging from 0.016 to 0.406 and all z-scores and p-values reaching statistical significance. These findings confirm that the clustered patterns identified in the Gi analysis reflect intrinsic spatial structures of the data rather than random proximity effects.

Moran’s I is a key indicator for measuring the spatial autocorrelation of a variable, which reflects the correlation of the variable’s attribute values across different spatial locations. The formula is as follows:

I = \frac{n \sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i j} (x_{i} - \bar{x}) (x_{j} - \bar{x})}{(\sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i j}) \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}

n denotes the total number of spatial units involved in the analysis. x_i and x_j represent the attribute values of the variable at spatial units i and j, respectively.

\bar{x}

stands for the average value of the variable’s attribute values across all spatial units. w_ij is an element of the spatial weight matrix between spatial units i and j, which quantifies the spatial relationship between the two units.

4.4. K-Means Clustering for Type Identification and Spatial Distribution

To further refine the structural characteristics of urban space, the K-means clustering method was applied to the spatial features extracted from street-view images [49,50]. The optimal number of clusters was determined using the elbow method, and the study area was divided into several spatially similar clustering units, as shown in Figure 11. Each unit represents areas with similar spatial quality characteristics, facilitating the identification of spatial functional types and informing directions for future spatial optimization.

This formula calculates the Euclidean distance, which is the most widely used distance metric in the K-means clustering algorithm. The K-means algorithm relies on this distance to assign data points to the nearest cluster centers and iteratively update cluster centers. The formula is as follows:

d (x, y) = \sqrt{\sum_{t = 1}^{m} {(x_{t} - y_{t})}^{2}}

x = (x₁, x₂, ⋯ x_m) and y = (y₁, y₂, ⋯ y_m) are two m-dimensional vectors, each representing a data point in the dataset. d(x,y) is the Euclidean distance between the two vectors x and y, which measures the straight-line distance between the two data points in the m-dimensional feature space. m refers to the number of dimensions of the data points; each dimension corresponds to a specific attribute of the data. x_t and y_t are the values of the t-th dimension of vectors x and y, respectively.

Figure 11 illustrates the spatial classification results of Xi’an’s central urban area based on K-means clustering, which was conducted on street-view image features to identify and refine the urban spatial structure types within the second ring road. The clustering results are categorized into the following three types: traditional low-quality function-oriented space, transitional mixed-use space, and modern high-quality integrated space.

The first type—traditional low-quality function-oriented space—is primarily distributed in the old city, urban fringes, and traffic/industrial-dominated zones. It is characterized by high building coverage and strong interface enclosure, but with a significantly low Green View Index. These areas typically lack public facilities, have poor walkability, and receive low subjective evaluations such as “depressing”, “boring”, and “unsafe”, reflecting issues of outdated environments and single-use functions.

The second type—transitional mixed-use space—represents an intermediate stage of urban development and transformation. Its spatial indicators fall between the other two types, while subjective perceptions show higher fluctuation and uncertainty. These areas are often located at the boundaries of urban cores and are closely related to zones of potential redevelopment and policy-driven growth, marking the city’s transition toward renewal.

The third type—modern high-quality comprehensive space—is concentrated in areas such as the Qujiang cultural zone, new CBD clusters, and key urban cultural corridors. These spaces feature high greenery, openness, and pedestrian accessibility, as well as complex and diverse building forms. They also exhibit significantly higher ratings in aesthetics, safety, and resident satisfaction, representing the leading direction of integrated, high-quality urban development.

This classification is further supported by the spatial distribution of key landmarks, such as the Bell Tower, Yongning Gate, Daming Palace, and Xingqing Palace, which match well with the identified clusters and reinforce the accuracy and representativeness of the results.

Overall, this clustering framework reveals the spatial heterogeneity of the urban environment and provides a data-driven basis for formulating differentiated optimization strategies across different city zones.

Figure 12 provides a standardized boxplot analysis of both the objective and subjective indicators across the three spatial cluster types, offering deeper insight into the structural transformation of Xi’an’s central urban area. The results clearly delineate a spatial evolution trajectory from traditional low-quality zones to modern high-quality integrated spaces. Cluster 1 is characterized by high values in morphological indicators such as BCR and IED, coupled with low scores in environmental perception indicators, including GVI, aesthetic appeal, and perceived safety. This pattern reflects typical features of traditional, function-oriented spaces with a low environmental quality and limited user satisfaction.

Cluster 2 represents a transitional typology, exhibiting intermediate values and greater variability across most dimensions. The observed fluctuations in perception-related indicators suggest an unstable and evolving spatial identity, often associated with zones undergoing regeneration or peripheral redevelopment. In contrast, Cluster 3 consistently outperforms the other types in key indicators such as GVI, PSR, and subjective evaluations of beauty and safety. Its high scores in pedestrian friendliness and resident satisfaction denote a modern, high-quality urban environment with comprehensive spatial functions and livability.

The overall clustering pattern—from Cluster 1 to Cluster 3—illustrates a gradual spatial progression radiating outward from the historical city wall, encapsulating the logic of urban evolution from “heritage preservation” through “incremental renewal” to “emergent expansion”. One-way ANOVA results confirm the statistical robustness of the clustering. All variables show significant between-group differences, with F-values exceeding the critical threshold and p-values consistently below 0.05. This evidences the reliability and discriminative validity of the K-means clustering for differentiating spatial typologies based on integrated urban form and perceptual quality metrics.

5. Discussion

5.1. Spatial Drivers of Perception

Our results confirm that morphological compactness (BCR and IED) is the strongest driver of perceptions of safety, liveliness, and wealth, while greenery (GVI) and openness (SOR) are more closely associated with aesthetic and emotional responses. These findings directly demonstrate how different spatial dimensions influence specific perceptual outcomes through our SHAP importance analysis (Figure 13). The global SHAP interaction analysis reveals the complex interplay between spatial features, with Figure 14 further illustrating the key interaction effects under different subjective perception categories.

These findings demonstrate that urban form, functional infrastructure, and environmental elements jointly shape perceptual outcomes. The non-linear thresholds revealed by the SHAP analysis—particularly the interaction effects between pedestrian accessibility and environmental features—highlight that perceptual effects often depend on the combination of multiple features rather than the influence of single indicators. For instance, when investigating the impact of spatial features on aesthetic perception, we found that in low-density areas (BCR < 0.3), an increase in GVI significantly enhances aesthetic appeal. However, when BCR exceeds 0.5, the marginal benefit of GVI gradually diminishes, demonstrating the critical importance of understanding these interaction effects for urban design interventions.

5.2. Robustness, Typologies, and Policy Scenarios

The observed patterns are consistent with previous studies that validated the use of global perception data in local urban contexts [39,40] and with recent applications of explainable machine learning, where SHAP identified thresholds and interaction effects comparable to those revealed by spatial regression models [45,51]. Such convergence suggests that the associations identified in this study are stable and meaningful.

Cluster analysis further revealed three distinct spatial types with quantified characteristics. Figure 15 illustrates the spatial distribution of these three cluster types within the central urban area of Xi’an, clearly presenting the geographic differentiation and structural patterns of urban spaces.

The traditional low-quality, function-oriented spaces (Cluster 1) are mainly concentrated around the ancient city wall, within urban villages, and in infrastructure-deficient areas, characterized by compact urban forms and single-function structures. The transitional mixed-use spaces (Cluster 2) are distributed along the urban periphery and in redevelopment zones, exhibiting a degree of spatial openness and functional flexibility. In contrast, the modern high-quality integrated spaces (Cluster 3) are concentrated in areas such as the Qujiang New District, the new CBD, and along cultural corridors, displaying multifunctionality, high levels of greenery, and excellent residential environment quality.

Figure 16 quantifies the performance differences across the objective built environment indicators among the following three cluster types: Cluster 1 (traditional low-quality) exhibits high building coverage and low greenery, Cluster 2 (transitional mixed-use) shows intermediate values with high variability, and Cluster 3 (modern high-quality) demonstrates superior environmental indicators. Cluster 1 shows a significantly higher BCR and IED compared to the others but scores much lower on GVI, SOR, and PSR, indicating dense, enclosed, and greenery-deficient traditional urban forms. Cluster 3, by contrast, performs strongly across GVI, SOR, and PSR, reflecting a high spatial openness and ecological quality, typical of modern multifunctional urban areas.

Evidence-based scenarios derived from these typologies suggest practical intervention pathways: in Cluster 1 areas, reducing building density while strategically introducing green infrastructure could address the empirically identified deficits in beauty and safety perception shown in Figure 17; in Cluster 2 areas, standardizing spatial quality through balanced interventions could stabilize the observed variability in perceptual outcomes; and in Cluster 3 areas, maintaining the existing high environmental quality while addressing any remaining gaps could optimize urban livability. These examples show how, together, statistical robustness and spatial typologies provide actionable insights for planning and design.

5.3. Design and Equity Implications

From a design perspective, our empirical findings translate into specific recommendations: interface continuity improvements are particularly beneficial in traditional areas where our analysis demonstrates strong associations with safety perception, while strategic green infrastructure placement can address the characteristically low GVI values found in Cluster 1 areas, as evidenced in Figure 16. The relationship between physical spatial features and subjective perception is not linear, as demonstrated across our cluster analysis. The combination and relative weighting of spatial indicators across different types generate complex interaction effects on residents’ experiential outcomes.

Enhancing frontage continuity, introducing micro-shade, and articulating facades can mitigate negative perceptions in dense areas, while benches and accessible services are particularly important in transitional spaces. For instance, the high GVI and PSR in Cluster 3 significantly enhance perceptions of beauty and vitality, while building density shows opposite effects on perceived safety across different clusters, as clearly illustrated in Figure 17.

Equity considerations emerge directly from our clustering results: Cluster 1 areas, characterized by the lowest perceptual quality scores across multiple dimensions, are predominantly located in traditional urban districts. The quantified performance gaps between clusters provide empirical evidence for prioritizing resource allocation to these areas, benefiting residents who currently experience a disproportionately poor environmental quality. Figure 17 shows that Cluster 1 scores lower in dimensions such as aesthetic appeal, liveliness, and perceived safety, with residents more likely to experience feelings of depression and boredom. Improved accessibility interventions in mixed-use areas support diverse populations, while ensuring connectivity improvements in integrated areas can distribute perceptual quality benefits beyond affluent residents to broader urban populations.

Therefore, spatial classification and optimization strategies must not only address physical parameters, but also emphasize their adaptability and elastic responsiveness to residents’ perceptions, aiming to achieve a more human-centered and context-sensitive approach to urban spatial governance.

6. Conclusions

This study conducted semantic segmentation on street-view imagery of the central urban area of Xi’an, extracting 19 categories of spatial features and calculating seven key spatial quality indicators, including BCR, SOR, and GVI. The results revealed significant spatial heterogeneity within the urban core. Commercially dense areas, such as those surrounding the Bell Tower, exhibited a high building density and low green visibility, leading to a strong sense of spatial oppression. In contrast, areas like Qujiang New District, rich in green resources, received higher resident evaluations, reflecting a superior aesthetic quality and comfort.

By integrating the PP2.0 dataset and Artificial Neural Networks for perception score analysis, the study found that GVI and SOR significantly enhance residents’ comfort and aesthetic perceptions, particularly in areas with abundant greenery. Conversely, building density negatively impacts perceptions of safety and comfort, especially in high-density and traffic-congested zones. IED was found to play a positive role in improving regional safety and livability.

Hotspot analysis based on Getis–Ord Gi* statistics further revealed that GVI hotspots were concentrated in the Qujiang New District, while BCR hotspots were primarily located in the central commercial area. IED hotspots were mainly observed in historic neighborhoods, highlighting the contribution of traditional urban street networks to enhancing residents’ sense of safety and belonging.

Subsequent K-means clustering analysis divided the urban core into the following three spatial types: traditional low-vitality spaces, transitional multifunctional spaces, and modern high-quality spaces. This classification provides a clear framework for proposing targeted optimization strategies. The findings advocate for a shift from purely physical–spatial interventions toward integrated strategies that couple environmental morphology with socio-functional enhancements. For instance, in traditional zones, increasing green visibility must be planned alongside improvements to pedestrian infrastructure and public service accessibility to comprehensively address perceptions of safety and liveliness. In modern high-density areas, maintaining spatial openness should be balanced with policies that ensure equitable access to amenities and efficient transportation networks to mitigate potential congestion and stress. Future urban renewal should, therefore, utilize these insights to design interventions that are not only spatially sound, but also socially informed and functionally robust, ultimately creating more holistic and human-centered urban environments.

Despite its contributions, this study has certain limitations. First, the analysis primarily relied on street-view imagery and subjective perception data, which may have been affected by temporal, locational, and environmental factors during data collection, potentially leading to bias or incompleteness. Second, although Random Forest regression models and SHAP analysis were employed to reveal the influence of spatial features on perceptions, the inherent “black-box” nature of machine learning models limits the interpretability of complex multi-variable interactions.

Moreover, the study focused primarily on physical spatial characteristics such as building density and green visibility, without systematically incorporating social functional factors like transportation networks or public service facilities. Future research could expand the range of spatial features, integrating data related to social infrastructure and service accessibility to develop a more comprehensive spatial quality evaluation system. Future work could extend the analysis to cities of different scales and developmental stages to test the robustness and applicability of the results, particularly considering long-term urban structural changes and their environmental impacts [25].

Additionally, since the study area was limited to the central district of Xi’an, the generalizability of the findings to other cities or regions remains to be verified. Future work could extend the analysis to cities of different scales and developmental stages to test the robustness and applicability of the results. Further studies could also incorporate multi-source heterogeneous data and adopt more interpretable models, thereby improving model precision, transparency, and operational applicability and providing stronger theoretical and practical support for perception-oriented urban spatial optimization.

Author Contributions

Conceptualization, Y.M. and Y.L. (Yonghao Li); methodology, Y.L. (Yonghao Li), Y.M. and J.R.; software, Y.L. (Yonghao Li); validation, J.L., Y.M. and Y.L. (Yonghao Li); formal analysis, J.L. and Y.M.; investigation, Y.M. and Y.L. (Yiwen Luo); resources, Y.L. (Yonghao Li) and J.R.; data curation, Y.L. (Yonghao Li), J.L., Y.L. (Yiwen Luo) and Y.M.; writing—original draft preparation, Y.L. (Yonghao Li), J.L., Y.M. and J.R.; writing—review and editing, J.L. and J.R.; visualization, Y.L. (Yonghao Li), J.L. and Y.M.; supervision, J.R. and Y.M.; project administration, J.R.; funding acquisition, J.R. All authors have read and agreed to the published version of the manuscript.

Funding

Science and Technology Department of Shaanxi Province, grant number 2025JC-YBMS-372; Fundamental Research Funds for the Central Universities, CHD, grant number 211941220002.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, G.; Zhou, Z.; Jiao, L.; Zhao, R. Compact Urban Form and Expansion Pattern Slow Down the Decline in Urban Densities: A Global Perspective. Land Use Policy 2020, 94, 104563. [Google Scholar] [CrossRef]
Dong, H.; Fujita, T.; Geng, Y.; Dong, L.; Ohnishi, S.; Sun, L.; Dou, Y.; Fujii, M. A Review on Eco-City Evaluation Methods and Highlights for Integration. Ecol. Indic. 2016, 60, 1184–1191. [Google Scholar] [CrossRef]
Jian, I.Y.; Luo, J.; Chan, E.H.W. Spatial Justice in Public Open Space Planning: Accessibility and Inclusivity. Habitat Int. 2020, 97, 102122. [Google Scholar] [CrossRef]
Gifford, R. Environmental Psychology Matters. Annu. Rev. Psychol. 2014, 65, 541–579. [Google Scholar] [CrossRef] [PubMed]
Biljecki, F.; Ito, K. Street View Imagery in Urban Analytics and GIS: A Review. Landsc. Urban Plan. 2021, 215, 104217. [Google Scholar] [CrossRef]
Zhang, J.; Yu, Z.; Li, Y.; Wang, X. Uncovering Bias in Objective Mapping and Subjective Perception of Urban Building Functionality: A Machine Learning Approach to Urban Spatial Perception. Land 2023, 12, 1322. [Google Scholar] [CrossRef]
Li, X.; Lv, Z.; Zheng, Z.; Zhong, C.; Hijazi, I.H.; Cheng, S. Assessment of Lively Street Network Based on Geographic Information System and Space Syntax. Multimed. Tools Appl. 2017, 76, 17801–17819. [Google Scholar] [CrossRef]
Kabisch, N.; Haase, D. Green Spaces of European Cities Revisited for 1990–2006. Landsc. Urban Plan. 2013, 110, 113–122. [Google Scholar] [CrossRef]
Ren, J.; Wang, Y.; Liu, Q.; Liu, Y. Numerical Study of Three Ventilation Strategies in a prefabricated COVID-19 inpatient ward. Build. Environ. 2021, 188, 107467. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Sun, H.; Xu, H.; He, H.; Wei, Q.; Yan, Y.; Chen, Z.; Li, X.; Zheng, J.; Li, T. A Spatial Analysis of Urban Streets under Deep Learning Based on Street View Imagery: Quantifying Perceptual and Elemental Perceptual Relationships. Sustainability 2023, 15, 14798. [Google Scholar] [CrossRef]
Wang, F.; Gu, N. Exploring the Spatio-Temporal Characteristics and Driving Factors of Urban Expansion in Xi’an during 1930–2014. Int. J. Urban Sci. 2023, 27, 39–64. [Google Scholar] [CrossRef]
Gibson, J.J. The Ecological Approach to Visual Perception; Houghton Mifflin: Boston, MA, USA, 1979. [Google Scholar]
Kaplan, S. The Restorative Benefits of Nature: Toward an Integrative Framework. J. Environ. Psychol. 1995, 15, 169–182. [Google Scholar] [CrossRef]
Hartig, T.; Mang, M.; Evans, G.W. Restorative Effects of Natural Environment Experiences. Environ. Behav. 1991, 23, 3–26. [Google Scholar] [CrossRef]
Tuan, Y.-F. Space and Place: The Perspective of Experience; University of Minnesota Press: Minneapolis, MI, USA, 1977. [Google Scholar]
Li, J.; Koohsari, M.J.; Zhao, J.; Kaczynski, A.T.; McCormack, G.R.; Oka, K.; Nakaya, T.; Tanimoto, R.; Watanabe, R.; Hanibuchi, T. The Impact of Objective Urban Features on Perception of Neighbourhood Environments. Sci. Rep. 2025, 15, 30322. [Google Scholar] [CrossRef] [PubMed]
Naik, N.; Kominers, S.D.; Raskar, R.; Glaeser, E.L.; Hidalgo, C.A. Computer Vision Uncovers Predictors of Physical Urban Change. Proc. Natl. Acad. Sci. USA 2017, 114, 7571–7576. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Zhang, C.; Li, W.; Kuzovkina, Y.A.; Weiner, D. Who Lives in Greener Neighborhoods? The Distribution of Street Greenery and Its Association with Residents’ Socioeconomic Conditions in Hartford, Connecticut, USA. Urban For. Urban Green. 2015, 14, 751–759. [Google Scholar] [CrossRef]
Jain, A.K. Data Clustering: 50 Years beyond K-Means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
Seiferling, I.; Naik, N.; Ratti, C.; Proulx, R. Green Streets—Quantifying and Mapping Urban Trees with Street-Level Imagery and Computer Vision. Landsc. Urban Plan. 2017, 165, 93–101. [Google Scholar] [CrossRef]
Ryo, M.; Rillig, M.C. Statistically Reinforced Machine Learning for Nonlinear Patterns and Variable Interactions. Ecosphere 2017, 8, e01976. [Google Scholar] [CrossRef]
Huo, K.; Qin, R.; Zhao, J.; Ma, X. Long-term Tracking of Urban Structure and Analysis of Its Impact on Urban Heat Stress: A Case Study of Xi’an, China. Ecol. Indic. 2025, 174, 113418. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Muller, E.; Gemmell, E.; Choudhury, I.; Nathvani, R.; Metzler, A.B.; Bennett, J.; Denton, E.; Flaxman, S.; Ezzati, M. City-Wide Perceptions of Neighbourhood Quality Using Street View Images. arXiv 2022, arXiv:2211.12139. [Google Scholar] [CrossRef]
Yu, Y.; Yu, Z.; Shi, X.; Wan, R.; Wang, B.; Zhang, J. Towards Transparent Urban Perception: A Concept-Driven Framework with Visual Foundation Models. ISPRS Int. J. Geo-Inf. 2025, 14, 315. [Google Scholar] [CrossRef]
Wang, L.; Han, X.; He, J.; Jung, T. Measuring Residents’ Perceptions of City Streets to Inform Better Street Planning through Deep Learning and Space Syntax. ISPRS J. Photogramm. Remote Sens. 2022, 190, 215–230. [Google Scholar] [CrossRef]
Zeng, Q.; Gong, Z.; Wu, S.; Zhuang, C.; Li, S. Measuring Cyclists’ Subjective Perceptions of the Street Riding Environment Using K-Means SMOTE-RF Model and Street View Imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103739. [Google Scholar] [CrossRef]
Yin, H.; Xiao, R.; Fei, X.; Zhang, Z.; Gao, Z.; Wan, Y.; Tan, W.; Jiang, X.; Cao, W.; Guo, Y. Analyzing “Economy-Society-Environment” Sustainability from the Perspective of Urban Spatial Structure: A Case Study of the Yangtze River Delta Urban Agglomeration. Sustain. Cities Soc. 2023, 96, 104691. [Google Scholar] [CrossRef]
Angelo, H. Added Value? Denaturalizing the “Good” of Urban Greening. Geogr. Compass 2019, 13, e12459. [Google Scholar] [CrossRef]
Ma, Z. Deep Exploration of Street View Features for Identifying Urban Vitality: A Case Study of Qingdao City. Int. J. Appl. Earth Obs. Geoinf. 2023, 123, 103476. [Google Scholar] [CrossRef]
Mushkani, R.; Berard, H.; Ammar, T.; Koseki, S. Public Perceptions of Montréal’s Streets: Implications for Inclusive Public Space Making and Management. J. Urban Manag. 2025; in press. [Google Scholar] [CrossRef]
Schneider, A.; Chang, C.; Paulsen, K. The Changing Spatial Form of Cities in Western China. Landsc. Urban Plan. 2015, 135, 40–61. [Google Scholar] [CrossRef]
Sánchez, I.A.V.; Labib, S.M. Accessing Eye-Level Greenness Visibility from Open-Source Street View Images: A Methodological Development and Implementation in Multi-City and Multi-Country Contexts. Sustain. Cities Soc. 2024, 103, 105262. [Google Scholar] [CrossRef]
Teeuwen, R.; Milias, V.; Bozzon, A.; Psyllidis, A. How Well Do NDVI and OpenStreetMap Data Capture People’s Visual Perceptions of Urban Greenspace? Landsc. Urban Plan. 2024, 245, 105009. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
Zhang, F.; Zhou, B.; Liu, L.; Liu, Y.; Fung, H.H.; Lin, H.; Ratti, C. Measuring human perceptions of a large-scale urban region using machine learning. Landsc. Urban Plan. 2018, 180, 148–160. [Google Scholar] [CrossRef]
Yao, Y.; Liang, Z.; Yuan, Z.; Liu, P.; Bie, Y.; Zhang, J.; Wang, R.; Wang, J.; Guan, Q. A Human-Machine Adversarial Scoring Framework for Urban Perception Assessment Using Street-View Images. Int. J. Geogr. Inf. Sci. 2019, 33, 2363–2384. [Google Scholar] [CrossRef]
Ma, X.; Ma, C.; Wu, C.; Xi, Y.; Yang, R.; Peng, N.; Zhang, C.; Ren, F. Measuring Human Perceptions of Streetscapes to Better Inform Urban Renewal: A Perspective of Scene Semantic Parsing. Cities 2021, 110, 103086. [Google Scholar] [CrossRef]
Lei, Y.; Xu, Y.; Liu, X.; Long, Y. Measuring Street Quality from a Human Perspective Using Street View Images, Deep Learning and Space Syntax: Evidence from Shanghai and Chengdu, China. Buildings 2024, 14, 1847. [Google Scholar] [CrossRef]
Long, Y.; Jiao, S.; Yu, Y.; Xiao, K. An Analysis of Spatial Vitality Distribution and Formation Mechanisms in Historical Urban Areas Based on Multi-Source Big Data: A Case Study of Changsha. Front. Archit. Res. 2025; in press. [Google Scholar] [CrossRef]
Xi, Y.; Hou, Q.; Duan, Y.; Lei, K.; Wu, Y.; Cheng, Q. Exploring the Spatiotemporal Effects of the Built Environment on the Nonlinear Impacts of Metro Ridership: Evidence from Xi’an, China. ISPRS Int. J. Geo-Inf. 2024, 13, 105. [Google Scholar] [CrossRef]
Li, Z. Extracting Spatial Effects from Machine Learning Model Using Local Interpretation Method: An Example of SHAP and XGBoost. Comput. Environ. Urban Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]
Liu, K.; Zhou, D.; Qi, Y.; Zhang, M.; Ren, Y.; Wei, Y.; Wang, J. Exploring the Complex Effects and Their Spatial Associations of the Built Environment on the Vitality of Community Life Circles Using an eXtreme Gradient Boosting–SHapley Additive exPlanations Approach: A Case Study of Xi’an. Buildings 2025, 15, 1372. [Google Scholar] [CrossRef]
Getis–Ord Statistics, Wikipedia. 2024. Available online: https://en.wikipedia.org/w/index.php?title=Getis%E2%80%93Ord_statistics&oldid=1222989590 (accessed on 27 April 2025).
Liu, L.; Tu, Y.; Sun, M.; Lyu, H.; Wang, P.; He, J. Spatial Quality Measurement and Characterization of Daily High-Frequency Pedestrian Streets in Xi’an City. Land 2024, 13, 885. [Google Scholar] [CrossRef]
Zhong, T.; Ye, C.; Wang, Z.; Tang, G.; Zhang, W.; Ye, Y. City-Scale Mapping of Urban Façade Color Using Street-View Imagery. Remote Sens. 2021, 13, 1591. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, M.; Song, X.; Li, Z. Image-Based Machine Learning and Cluster Analysis for Urban Road Network: Employing Orange for Codeless Visual Programming. Geo-Spat. Inf. Sci. 2025, 28, 1298–1315. [Google Scholar] [CrossRef]
Ruan, Y.; Zhang, X.; Wang, J.; Liu, N. Understanding the Role of Urban Block Morphology in Innovation Vitality through Explainable Machine Learning. Sci. Rep. 2025, 15, 21337. [Google Scholar] [CrossRef]

Figure 1. Conceptual framework diagram.

Figure 2. Street view image sampling points in the core area of Xi’an.

Figure 3. Workflow of the spatial feature analysis and perception modeling framework.

Figure 4. The impact of the street view environment on the seven evaluation indicators.

Figure 5. The impact of the street view environment on six emotional state indicators.

Figure 6. SHAP-based impact of primary spatial features on subjective perception.

Figure 7. SHAP-based importance of secondary spatial features on subjective perceptions.

Figure 8. Hotspot distribution of objective spatial quality indicators.

Figure 9. Hotspot distribution of subjective spatial quality evaluations.

Figure 10. Determination of the optimal number of clusters using the elbow method and silhouette coefficient.

Figure 11. Spatial typology classification of Xi’an based on K-means clustering.

Figure 12. Boxplot of multidimensional cluster types.

Figure 13. SHAP interaction values of spatial features under three subjective perception categories.

Figure 14. Key SHAP interaction values of spatial features under three subjective perception categories.

Figure 15. Spatial distribution of the three clustered urban space types.

Figure 16. Feature indices of the three urban space clusters.

Figure 17. Subjective perception indices of the three urban space clusters.

Table 1. Summary of regional and thematic references in literature review.

Region/Type	Example References	Key Themes	Relevance to This Study	Morphological/Cultural Differences Discussion
Asia/China (Focal Region)	[14] Wang & Gu (2023); [25] Huo et al. (2025);	Spatiotemporal expansion of Xi’an; Thermal environment;	Core case study and ontological research. Provides local empirical evidence, methodological reference, and a direct baseline for comparison.	High-density historical core vs. modern ecological new districts. Discusses the unique challenge of balancing heritage preservation, high-density development, and ecological needs within limited space, reflecting typical contradictions in Asian high-density cities.
Europe (Medium Coverage)	[5] Biljecki & Ito (2021); [8] Kabisch & Haase (2013); [23] Seiferling et al. (2017)	Street view imagery in GIS review; 3D city models; Green space evolution; Street greenery quantification	Provides theoretical frameworks, indicator origins (GVI), and methodological reflection (e.g., model Level of Detail).	Macro green space planning tradition vs. micro streetscape perception. European research often stems from a long Green City tradition; metrics like SVF and GVI provide standard references, but their urban forms are generally lower-density, highlighting the particularity of Xi’an’s high-density context.
North America/Global (High Coverage)	[20] Naik et al. (2017)—Place Pulse; [26] Lundberg & Lee (2017)—SHAP; [4] Gifford (2014)—EBS; [16] Kaplan (1995)—ART; [21] Li et al. (2015)—Hartford; [34] Mushkani et al. (2025)—Montréal	Global crowdsourced perception datasets; Explainable AI frameworks; Environmental psychology and restorative environment theory; Environmental justice; Socio-demographic heterogeneity in perception	Provides foundational theories (EBS, ART), core data sources (Place Pulse), key tools (SHAP), and international case comparisons (social equity, group differences).	Perceptual universals and variations across cultural diversity. North American research reveals universal patterns and profound socioeconomic disparities. Highlights that findings require bias correction for cross-cultural application, not direct transfer.
Classic Theory and Others (Foundational Support)	[15] Gibson (1979)—Affordance; [18] Tuan (1977)—Sense of Place; [19] Li et al. (2025)—Japan Study; [28] Yu et al. (2025)—Transparent VFM	Affordance Theory; Sense of Place; Destination diversity in Japanese cities; Transparent urban perception frameworks	Provides meta-theoretical support, neighboring Asian case studies, and methodological frontier comparison.	Theoretical universality and cultural specificity. Classic theories provide globally applicable analytical lenses. Japanese research, as a cultural neighbor, may show different key factors compared to the West, providing a bridge for understanding cultural specificity. Latest research pushes the field towards interpretability.

Table 2. Multidimensional urban spatial quality indicators and their perceptual associations.

Objective Index	Objective Indicator	Associated Perception	Mathematical Formula	Interpretation
Morphological parameter	Building Coverage Ratio (BCR)	Beautiful, Comfortable	$B C R = \frac{\sum A_{b u i l d i n g}}{\sum A_{p l o t}}$	High density may feel oppressive; moderate improves convenience
	Sky Openness Ratio (SOR)	Beautiful, Comfortable	$S O R = \frac{\sum A_{s k y}}{\sum A_{t o t a l}}$	Enhances visibility, openness, and spatial comfort
	Interface Enclosure Degree (IED)	Safe, Convenient	$I E D = \frac{\sum L_{e n c l o s e d}}{\sum L_{i n t e r f a c e}}$	Strengthens safety and spatial continuity
Environmental parameter	Green View Index (GVI)	Beautiful, Comfortable, affective	$G V I = \frac{\sum A_{g r e e n}}{\sum A_{t o t a l}}$	Enhances aesthetics and emotional identity
Environmental parameter	Safety Facility Ratio (SFR)	Safe, Convenient	$S F R = \frac{\sum N_{s a f e t y - f a c i l i t y}}{\sum A_{s p a c e}}$	Improves security and spatial comfort
Functional parameter	Pedestrian Space Ratio (PSR)	Convenient, Accessible	$P S R = \frac{\sum A_{p e d e s t r i a n - s p a c e}}{\sum A_{u r b a n - s p a c e}}$	Supports walkability and pedestrian comfort
Functional parameter	Vehicle Space Ratio (VSR)	Convenient, Accessible, Place identity	$V S R = \frac{\sum A_{v e h i c l e - s p a c e}}{\sum A_{u r b a n - s p a c e}}$	Enhances connectivity and urban spatial legibility

Table 3. SHAP values of primary-level spatial features in relation to subjective perceptions.

Feature	Beautiful	Boring	Depressing	Livelier	Safer	Wealthier	Average
building	0.0955	0.0723	0.0680	0.0715	0.1120	0.0733	0.082
sidewalk	0.0440	0.0523	0.0549	0.0568	0.0635	0.0631	0.056
vehicle	0.0437	0.0452	0.0459	0.0472	0.0589	0.0550	0.049
car	0.0427	0.0407	0.0389	0.0427	0.0407	0.0520	0.043
pole	0.0387	0.0389	0.0388	0.0405	0.0333	0.0411	0.039
person	0.0374	0.0374	0.0386	0.0393	0.0320	0.0386	0.037
wall	0.0370	0.0370	0.0385	0.0368	0.0311	0.0360	0.036
flat	0.0362	0.0368	0.0367	0.0338	0.0310	0.0351	0.035
traffic light	0.0360	0.0367	0.0362	0.0325	0.0310	0.0303	0.034
sky	0.0355	0.0366	0.0341	0.0324	0.0308	0.0299	0.033
fence	0.0335	0.0346	0.0321	0.0314	0.0280	0.0289	0.031
traffic sign	0.0313	0.0335	0.0318	0.0310	0.0274	0.0285	0.031
terrain	0.0297	0.0304	0.0309	0.0290	0.0257	0.0270	0.028
bicycle	0.0283	0.0281	0.0288	0.0284	0.0256	0.0266	0.028
train	0.0280	0.0273	0.0288	0.0278	0.0251	0.0258	0.027
human	0.0259	0.0240	0.0274	0.0271	0.0245	0.0255	0.026
rider	0.0227	0.0238	0.0270	0.0261	0.0240	0.0247	0.025
vegetation	0.0222	0.0235	0.0245	0.0260	0.0220	0.0211	0.023
road	0.0185	0.0227	0.0225	0.0235	0.0207	0.0196	0.021
bus	0.0177	0.0220	0.0223	0.0193	0.0197	0.0180	0.020
motorcycle	0.0171	0.0205	0.0215	0.0164	0.0152	0.0155	0.018
nature	0.0168	0.0155	0.0139	0.0149	0.0132	0.0111	0.014
truck	0.0118	0.0116	0.0129	0.0109	0.0089	0.0110	0.011

Table 4. SHAP-based correlation between secondary spatial features and subjective perceptions.

Target	BCR	IED	SFR	PSR	GVI	SOR	VSR
beautiful	−0.0009	−0.0193	−0.0038	0.0026	−0.0055	−0.0152	−0.0007
boring	0.0142	0.0147	−0.0001	−0.0140	−0.0004	−0.0076	−0.0085
depressing	−0.0121	0.0162	−0.0026	0.0066	−0.0233	0.0037	0.0083
livelier	0.0163	0.0125	0.0149	0.0047	0.0042	0.0101	0.0084
safer	−0.0182	0.0012	0.0044	0.0112	0.0235	0.0367	−0.0118
wealthier	0.0227	−0.0228	−0.0049	0.0102	−0.0085	−0.0015	−0.0260

Table 5. Prediction errors of subjective perception dimensions based on Random Forest models.

Target	MAE	MSE
beautiful	1.577171985581062	4.608631015381684
boring	1.7231578518686848	5.6779611167127255
depressing	1.9649958693466718	6.511114209166368
livelier	1.8208642307392076	7.483464521643784
safer	2.0946684365119177	8.637644445137152
wealthier	2.0070083518613786	7.343163736267139

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Lu, J.; Meng, Y.; Luo, Y.; Ren, J. Exploring Urban Spatial Quality Through Street View Imagery and Human Perception Analysis. Buildings 2025, 15, 3116. https://doi.org/10.3390/buildings15173116

AMA Style

Li Y, Lu J, Meng Y, Luo Y, Ren J. Exploring Urban Spatial Quality Through Street View Imagery and Human Perception Analysis. Buildings. 2025; 15(17):3116. https://doi.org/10.3390/buildings15173116

Chicago/Turabian Style

Li, Yonghao, Jialin Lu, Yuan Meng, Yiwen Luo, and Juan Ren. 2025. "Exploring Urban Spatial Quality Through Street View Imagery and Human Perception Analysis" Buildings 15, no. 17: 3116. https://doi.org/10.3390/buildings15173116

APA Style

Li, Y., Lu, J., Meng, Y., Luo, Y., & Ren, J. (2025). Exploring Urban Spatial Quality Through Street View Imagery and Human Perception Analysis. Buildings, 15(17), 3116. https://doi.org/10.3390/buildings15173116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Urban Spatial Quality Through Street View Imagery and Human Perception Analysis

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Study Area and Data Sources

3.2. Model Training Process

3.3. Construction of Urban Spatial Feature System and Multi-Source Data Integration

3.4. Data Analysis Methods

4. Results

4.1. Semantic Segmentation Results and Spatial Distribution of Quality Indicators

4.2. Subjective Perception Modeling and Feature Contribution Analysis

4.3. Analysis of Spatial Hotspot Distribution Patterns

4.4. K-Means Clustering for Type Identification and Spatial Distribution

5. Discussion

5.1. Spatial Drivers of Perception

5.2. Robustness, Typologies, and Policy Scenarios

5.3. Design and Equity Implications

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI