Street View Image-Based Emotional Perception Modeling of Old Residential Communities: An Explainable Framework Integrating Random Forest and SHAP

Xu, Yanqing; Fan, Xiaoxuan

doi:10.3390/ijgi14120471

Open AccessArticle

Street View Image-Based Emotional Perception Modeling of Old Residential Communities: An Explainable Framework Integrating Random Forest and SHAP

by

Yanqing Xu

^1,2,*

and

Xiaoxuan Fan

¹

Department of Architecture, Yangzhou University, Yangzhou 225127, China

²

Department of Architecture, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(12), 471; https://doi.org/10.3390/ijgi14120471

Submission received: 9 September 2025 / Revised: 18 November 2025 / Accepted: 27 November 2025 / Published: 29 November 2025

(This article belongs to the Special Issue Spatial Information for Improved Living Spaces)

Download

Browse Figures

Versions Notes

Abstract

Understanding how the built environment shapes residents’ emotional perceptions in old residential communities (ORCs) is essential for enhancing livability and supporting people-oriented urban regeneration. This study proposes an explainable analytical framework that integrates community attributes, streetscape indicators, and subjective evaluations. Using random forest (RF) regression combined with Shapley Additive Explanations (SHAP), we conducted an empirical study on ten ORCs in Yangzhou, China. A total of 1240 street view images (SVIs) were processed to extract social attributes, including building age, building scale, and point-of-interest (POI) diversity, as well as visual indicators such as walkability, green view index (GVI), and colorfulness. Six emotional perception scores were obtained from the MIT Place Pulse 2.0 model and further calibrated through questionnaires. The results show that the proposed framework effectively captures the spatial determinants of residents’ perceptions, with the model predictions being highly consistent with survey evaluations. Specifically, GVI and street enclosure are positively associated with perceptions of beauty, safety, and vitality, while building aging and functional monotony intensify negative feelings such as oppression and boredom. Visual diversity (VD) enhances aesthetic and vitality perceptions, whereas facility visual entropy demonstrates a dual role—reinforcing safety but potentially inducing oppressive feelings. By integrating interpretable machine learning with geospatial analysis, this study provides both theoretical and practical insights for micro-scale community renewal, and the framework can be extended to multimodal analyses including soundscapes and behavioral pathways.

Keywords:

street view images (SVIs); old residential communities (ORCs); Emotional Correlation; random forest (RF) models; Shapley additive explanations (SHAP)

1. Introduction

Urban renewal has become a vital strategy for addressing spatial decline, inefficient resource utilization, and the deterioration of living environments worldwide [1]. As an essential component of the urban stock, the renewal of old residential communities (ORCs) has shifted from repair-oriented interventions toward integrated governance, emphasizing environmental quality and residents’ well-being. Within the people-oriented development paradigm, growing attention is paid to residents’ subjective experiences and the creation of emotionally resonant living environments, which has emerged as an important theme in urban regeneration research [2,3,4]. In China, large-scale residential communities developed during the rapid urbanization of the mid-to-late 20th century are now facing severe challenges, including building deterioration, lack of public spaces, and declining environmental quality. Since 2017, the government has advanced pilot programs for old community renovation, and in 2020, the State Council issued the Guiding Opinions on Comprehensively Promoting the Renovation of OCRs [5], marking a transition from physical repairs to multidimensional renewal integrating environmental enhancement, community governance, and people-oriented development [6]. Such transformation requires multidisciplinary knowledge and systematic evaluation tools. Within this context, understanding how the built environment affects residents’ emotional perceptions is crucial for identifying problematic spaces and formulating targeted renewal strategies [7].

Streetscapes, often perceived as “outdoor rooms” when individuals walk through streets, turn corners, or leave their homes [8], represent the most frequently experienced public spaces for residents. They are increasingly regarded as critical entry points for assessing environmental quality and renewal potential in residential communities. As Wang et al. [9] emphasized, Street view images (SVIs) provide a low-cost and scalable approach to capturing information on the urban built environment, enabling the identification of spatial quality issues and supporting strategy formulation for renewal. Recent advances in image analysis, particularly the application of deep learning–based semantic segmentation, allow researchers to extract spatial visual features such as green view index (GVI) and Sky View Factor (SVF), thus quantifying the physical attributes of the built environment. For example, Ma et al. [10] employed the SegNet architecture to derive indicators such as openness and greenness, confirming the role of visual perception features in evaluating spatial livability. These methodological breakthroughs have advanced streetscape studies from subjective, manual evaluations toward the high-precision, scalable modeling of visual perception.

Nevertheless, systematic research linking streetscape environments with residents’ emotional perceptions in ORCs remains limited. Existing studies often emphasize physical features of SVIs, while neglecting essential contextual attributes such as building age, point-of-interest (POI) diversity, and residential scale. This omission restricts the comprehensive assessment of how visual and background factors jointly shape emotional experiences. Psyllidis et al. [11], for instance, demonstrated the importance of POI coverage, quality, and classification in built environment analysis, but their work concentrated primarily on geographic representation and application frameworks, without integrating fundamental community attributes. As a result, the generalizability of such models across diverse community types remains limited. At the same time, many studies remain confined to static classifications or global correlations, lacking systematic interpretation of the mechanisms through which environmental variables interact to shape emotions. For example, Luo et al. [12] combined deep learning–predicted emotional labels with building characteristics to examine housing prices, but they relied mainly on global correlations and linear regression, overlooking explanatory pathways and interaction effects. Similarly, Lindal et al. [13] analyzed how building height and morphology influence perceived restorativeness, yet they employed mean comparisons and conventional regression models, limiting interpretability. Although some studies attempt to integrate surveys and online data for analyzing residents’ perceptions, systematic modeling of emotional scores derived from SVIs remains scarce in the context of ORCs.

These limitations underscore the urgent need for a multidimensional analytical framework that integrates SVIs, community attributes, and emotional evaluations. By leveraging interpretable machine learning methods, such a framework could systematically identify key spatial and visual drivers of residents’ emotions, offering both theoretical insight and practical guidance for people-oriented and fine-grained community renewal.

This study focuses on several typical ORCs in Yangzhou, China, and proposes an integrated framework of “community attributes–streetscape features–emotional perception” to reveal the pathways and interpretable mechanisms linking the built environment and residents’ emotions. By incorporating architectural and social attributes, semantic features of SVIs, and subjective perception scores into a unified modeling system, the study employs random forest (RF) regression to capture nonlinear relationships and applies SHAP to quantify the contributions and interactions of key variables.

The main contributions are threefold:

Data: SVI is introduced as a primary data source, enabling fine-grained quantification of old residential environments beyond traditional surveys or statistics.
Method: The RF–SHAP approach captures nonlinearities and interaction effects, overcoming the limits of linear models.
Mechanism: Interpretable analysis identifies differentiated roles of variables such as GVI, street enclosure, and visual diversity (VD) across emotional dimensions, while revealing the complexity of safety perception and POI diversity.

Overall, the proposed framework offers a scalable and interpretable tool for people-oriented community renewal, with potential to expand toward multimodal and cross-regional applications. More importantly, this study advances perception-based urban analysis by introducing a localized, interpretable modeling framework that bridges community-scale spatial attributes and emotional responses, providing methodological innovation beyond existing large-scale perception studies.

2. Research Cases and Data Sources

2.1. Research Cases

This study focuses on Yangzhou, a historic city in Jiangsu Province, China, located along the lower Yangtze River. Due to its early urbanization, the central districts of Yangzhou retain many residential communities built between the 1960s and 1990s, which are characterized by aging facilities, complex spatial structures, and pressing renewal demands.

From 50 such neighborhoods in the central city, 10 representative cases were selected based on four criteria: (1) constructed before 2000, showing typical features of ORCs; (2) located within central districts, covering diverse urban forms; (3) representative in terms of facility deterioration, density, and environmental quality; and (4) typologically diverse, including state-owned enterprise housing, collective housing, and early commercial housing. Figure 1 illustrates the spatial distribution of the ten selected ORCs within Yangzhou, while Table 1 summarizes their basic characteristics.

These cases encompass historic cores, traditional residential zones, and peripheral expansion areas. Their structural types include brick-concrete, low-rise, and early slab-type apartments, while functionally they cover multiple housing forms. Most are already included in ongoing renovation plans, providing real renewal pressure and perceptual improvement potential. Together, the cases offer a solid basis for analyzing both commonalities and variations in visual perception and spatial interventions in ORCs.

2.2. Data Source

2.2.1. SVIs

During the era of big data, Baidu Street View (BSV) imagery has become one of the most important data sources for studying urban streets in China [14]. Baidu Maps, operated by Baidu Inc., is a major commercial online mapping platform that provides nationwide high-resolution street view coverage comparable to Google Street View. Although Google Maps is not accessible in mainland China, Baidu Maps offers a practical alternative with reliable image quality and wide spatial coverage. It should be noted that Baidu Maps is not an open-source platform; however, it provides an open-access Application Programming Interface (API) that allows registered users to programmatically obtain SVIs for research and non-commercial purposes. In this study, BSV imagery was acquired via the official Baidu Maps API (Baidu, n.d., https://lbsyun.baidu.com/index.php?title=webapi/guide/webservice-streetview, accessed on 9 April 2025.).

SVIs were collected using the external boundaries of each residential community as the sampling range, with sampling points established every 20 m along boundary roads. Two complementary methods were employed. (1) Automated collection: Road network data from OpenStreetMap (OSM) were imported into ArcGIS 10.8 to determine 310 sampling points, which were then matched with the Baidu Maps API to automatically retrieve SVIs at four horizontal angles (0°, 90°, 180°, and 270°). (2) On-site collection: Areas not covered by the online API were supplemented through field surveys using a street view camera. To ensure consistency, all images were captured in Yangzhou during April–May under clear weather conditions, between 10:00 a.m. and 2:00 p.m. In total, 1240 images were obtained and subsequently processed by cropping, illumination normalization, and perspective correction to ensure uniform size and quality (Figure 2).

2.2.2. Attribute Indicators of Residential Communities

To comprehensively characterize the spatial utilization patterns and functional foundations of ORCs, this study establishes an attribute indicator system encompassing population vitality, POI diversity, building age, and community scale, serving as the basis for emotional perception modeling and interpretability analysis.

Population vitality was used to represent the level of human activity and social interaction within each community. To enhance the objectivity of measurement and reduce potential subjectivity in expert scoring, this indicator was constructed from two complementary components:

(1) Resident density, derived from the latest community statistical yearbook and census data, calculated as the number of permanent residents per unit area (persons per hectare); and (2) Expert evaluation, jointly conducted by three professionals (community administrators, property managers, and resident representatives) with at least five years of management or residence experience. They assessed the intensity of pedestrian flows and the frequency of social and commercial activities based on field observations along the main access roads and internal public spaces of each community, using a five-point grading system (1 = very low activity, 5 = very high activity). The two components were normalized to a 0–1 scale and integrated through equal-weight averaging (0.5 each) to ensure a balanced representation of both quantitative demographic features and experiential social dynamics.

This hybrid approach allows the indicator to capture both the objective concentration of residents and the subjective perception of activity intensity, providing a more comprehensive and transferable measure of population vitality across communities.

To quantify the functional diversity of facilities surrounding each community, we developed a Python-based script utilizing the Gaode Map API (Gaode, n.d., https://lbs.amap.com/,accessed on 20 April 2025.) to collect POI data within each community’s actual service boundary. The boundary was delineated manually according to natural barriers (e.g., rivers, viaducts) and accessibility constraints. POIs were categorized into major urban functions such as commerce, education, healthcare, transportation, and leisure. The Shannon–Weaver diversity index was then applied to quantify the functional diversity, reflecting the degree of mixed-use and service accessibility around each community.

The diversity level was measured using the Shannon Diversity Index [15,16,17], defined as:

H = - \sum_{i = 1}^{S} P_{i} \ln (P_{i})

(1)

where

S

is the total number of POI categories in the area,

P_{i}

is the proportion of POIs of type

i

. This index captures the balance and heterogeneity of functional facilities.

The indicators of building age and population size were employed to characterize the degree of physical aging and population density, respectively. Building age was categorized into six groups based on the completion time of the main residential buildings and assigned scores ranging from 0 to 5: communities built in 2000 or later were scored as 5; 1995–1999 as 4; 1990–1994 as 3; 1985–1989 as 2; 1980–1984 as 1; and those constructed before 1980 as 0. Population size was quantified by the total number of households and used as a control variable to reflect the community’s scale and service pressure within the urban system.

2.2.3. Emotion Perception Scoring and Calibration Methods

This study adopted the ensemble framework proposed by Zhao et al. [18] (SegFormer-B5 + ConvNeXt-B + RF), which integrates objective and subjective visual features to enhance the prediction of emotional perceptions. The model achieved an accuracy of approximately 78.5% on the Place Pulse 2.0 dataset, providing a robust baseline for subsequent localized calibration and regression analyses. In this study, the framework was applied as a pre-trained feature extractor via transfer learning to generate perceptual scores, instead of performing full retraining.

The Place Pulse 2.0 dataset, developed by the Massachusetts Institute of Technology (MIT) Media Lab and formally introduced by Dubey et al. [19], extends the three perceptual dimensions from Place Pulse 1.0 (safe, lively, beautiful) to six (safe, lively, beautiful, wealthy, depressing, and boring). It includes 110,988 Google SVIs from 56 cities across 28 countries, annotated through 1,169,078 pairwise perceptual comparisons contributed by 81,630 participants worldwide. Similar validation practices have also been reported in recent Chinese studies using the Place Pulse 2.0 dataset [20,21,22], further supporting its applicability in localized contexts.

Within the ensemble, SegFormer-B5 extracts objective environmental features—the proportional composition of roads, buildings, vegetation, and other spatial elements—while ConvNeXt-B captures subjective perceptual representations across the six emotional dimensions. The semantic segmentation process enables the model to quantify the spatial composition of key streetscape elements, forming the basis for visual indicators such as the GVI, walkability, and color composition used in subsequent regression analyses. The fused features are then fed into an RF model, which produces six perceptual scores, subsequently discretized into labels ranging from 0 to 9. The RF + SHAP framework was further utilized, not as a replacement for the ensemble, but to perform interpretable regression analysis on the calibrated perceptual scores, thereby elucidating how spatial and visual indicators influence residents’ emotional perceptions.

To ensure the reliability and local validity of both the perception data and the semantic segmentation results, a statistically representative subset of 300 SVIs (about 25% of the total 1240 SVIs) was selected using the finite population sampling formula under a 95% confidence level and a 5–6% margin of error. These images were proportionally sampled from ten ORCs (GR, DH, NM, ZZ, AD, HT, ML, NY, XY, and YY), with roughly 30 images per community to ensure balanced spatial coverage and represent morphological and social diversity. The segmentation accuracy of SegFormer-B5 was further verified through manual inspection of the same 300-image subset, achieving mean Intersection-over-Union (IoU) values of 0.86, 0.79, and 0.83 for buildings, vegetation, and roads, respectively, and an overall mean IoU of 0.83, indicating high agreement between automated segmentation and human annotation.

For cross-validation of the perception results, 126 residents participated in a questionnaire survey, each evaluating 8–10 images, yielding approximately 1350 valid ratings. Respondents were stratified by gender (48% male, 52% female), age (18–35: 30%; 36–59: 46%; 60+: 24%), residence duration (<5 years: 25%; 5–10 years: 35%; >10 years: 40%), and education level (36% high school or below, 50% college/bachelor, 14% master or above). Each image was rated by 4–5 participants to ensure inter-rater reliability. The correlation analysis between model-predicted and questionnaire-based perception scores showed strong consistency (Spearman’s ρ = 0.63–0.77, p < 0.001), with RMSE values ranging from 0.45 to 0.62 across the six emotional dimensions. These results demonstrate that the perception data used for regression modeling are statistically robust and consistent with residents’ subjective evaluations, ensuring the credibility of the subsequent analyses.

3. Methodology

3.1. Research Framework

To systematically uncover the intrinsic relationship between the spatial environment of ORCs and residents’ subjective perceptions, this study proposes a three-dimensional analytical framework of “Community Attributes–Streetscape Indicators–Emotional Scores” (Figure 3). The framework consists of three core components:

Data acquisition and feature construction: At the community level, four fundamental attributes are extracted, including building age, population size, POI diversity, and demographic vitality. From streetscape images, ten objective indicators such as GVI, walkability, and VD are derived. In addition, six categories of emotional perception scores are obtained through deep learning models and calibrated with questionnaire surveys.
Regression modeling and predictive analysis: The above variables are used as independent variables, with six types of emotional scores (safety, liveliness, beauty, depression, boredom, and richness) as dependent variables, to construct multiple RF regression models for exploring nonlinear predictive relationships.
Interpretability analysis and result deconstruction: The Shapley Additive Explanations (SHAP) method is introduced to interpret model outcomes, identify key influencing factors, analyze marginal effects and interaction mechanisms, and ultimately propose optimization strategies for spatial environments.

This framework integrates complementary modeling techniques to ensure both predictive accuracy and interpretability. SegFormer-B5 and ConvNeXt-B serve as feature extractors, capturing objective semantic elements (e.g., roads, buildings, vegetation) and subjective perceptual features based on the six emotional dimensions from the MIT Place Pulse 2.0 dataset. The extracted features are then regressed using an RF model, which performs robustly with limited data and captures non-linear relationships. SHAP is further applied to interpret the model outputs and quantify the contribution of each indicator. This design balances technical complexity and explanatory clarity, ensuring the framework remains interpretable and adaptable for urban perception analysis.

It should be noted that in this framework, community attributes and streetscape indicators serve as independent explanatory variables, while emotional perception scores are modeled as dependent variables representing human responses. This unidirectional design prevents circular reasoning and ensures that the framework captures causal relationships from environmental features to emotional outcomes.

3.2. Semantic Segmentation

To achieve accurate extraction and quantitative analysis of spatial elements in streetscape images of ORCs, this study employs the state-of-the-art semantic segmentation model Mask2Former (Masked-attention Mask Transformer) [23,24]. Built upon the Transformer architecture, the model formulates semantic segmentation as a set prediction task, where a mask classification mechanism simultaneously outputs category labels and corresponding mask regions, thereby enabling the unified representation of structural and semantic information.

Mask2Former consists of an encoder, a pixel decoder, and a Transformer decoder. First, high-dimensional semantic features are extracted from the input image through the backbone network to generate a two-dimensional pixel embedding matrix

P

; Subsequently, the Transformer decoder leverages a set of query vectors

Q_{q}

to learn potential semantic objects, while the mask head computes the spatial mask of the

q

-th object via dot-product operations, which can be formally expressed as:

M_{q} = σ (Q_{q} \cdot P^{T})

(2)

where

σ

denotes the sigmoid function, and

M_{q}

represents the spatial mask corresponding to the

q

-th object.

Compared with conventional CNN-based semantic segmentation models, Mask2Former offers three major advantages [21]: (1) its attention mechanism provides strong global modeling capability, enabling the capture of complex spatial structures and long-range dependencies within images; (2) it demonstrates excellent task generalization by supporting unified modeling for semantic, instance, and panoptic segmentation; and (3) it exhibits high adaptability and robustness in complex urban streetscapes, achieving accurate recognition of diverse objects such as buildings, roads, vegetation, and pedestrians, making it particularly suitable for image analysis in ORCs.

In this study, image inference was conducted using Mask2Former weights pre-trained on the Mapillary Vistas dataset [25]. This dataset encompasses diverse urban streetscapes with fine-grained category annotations, closely matching the semantic structure of the sample images in this study, thereby ensuring strong transferability of the model. The model outputs 66 semantic categories, and the pixel proportion of each category is calculated as a semantic structural feature, defined as follows:

ϕ_{area, k} = \frac{{pixel}_{k}}{{pixel}_{I}}

(3)

where

{pixel}_{k}

denotes the number of pixels belonging to category

k

, and

{pixel}_{I}

represents the total number of pixels in the image. These semantic features constitute the streetscape vector, which is subsequently used for modeling and analysis.

Figure 4 illustrates the semantic feature extraction workflow based on Mask2Former, which generates pixel-level segmentation maps and area proportions for all 66 semantic categories. In the illustrated example, 24 categories are detected, while undetected categories are assigned an area proportion of zero.

To ensure the semantic outputs are sufficiently reliable for indicator construction, we conducted an expanded local validation using 105 manually annotated images. The validation set covers a broad spectrum of street environments—including residential streets, commercial frontages, urban arterial roads, and high-traffic intersections—and incorporates stratified sampling to ensure representation from all ten study communities. Ground-truth segmentation masks were produced using the LabelMe tool [26].

Model performance was evaluated using Accuracy, IoU, Precision, Recall, and F1-score for each semantic class. The results indicate that Mask2Former maintains strong generalization in the local context: large and visually stable categories (e.g., sky, vegetation, buildings) achieve consistently high IoU (0.80–0.84) and F1-score (0.84–0.89), reflecting reliable segmentation of dominant urban structures. Facility-related classes show comparatively lower performance due to their small pixel footprint and morphological variability, which is typical in urban street-scene segmentation.

ORCs’ streets exhibit greater complexity and more diverse, life-oriented semantic classes than urban arterials, which can occasionally lead to segmentation errors such as rare-object misclassification, vehicle occlusion, or large-structure confusion. These errors are infrequent, and because the indicators are derived from pixel proportions or total counts for larger features, their impact on spatial metrics is minimal. Although Mask2Former was not locally fine-tuned, its strong transferability from Mapillary Vistas pretraining ensures robust segmentation of dominant categories. Empirical validation confirms that the resulting pixel-level outputs are sufficiently accurate to support subsequent spatial indicator construction and analyses, despite the limited availability of representative training data for such environments.

Table 2 summarizes the metrics for the seven categories central to this study. In accordance with our indicator definitions, the pedestrian-area category includes sidewalks and crosswalks, while “other facilities” encompasses streetlights, trash bins and related elements. Overall, the expanded validation confirms that Mask2Former provides sufficiently accurate pixel-level predictions across diverse street settings, supporting the robustness of the derived spatial indicators and subsequent perceptual analyses.

Moreover, Mask2Former and its variants have been widely applied in urban environment analysis and visual computing tasks [27]. Liu et al. demonstrated the feasibility of quantifying urban perception from visual features such as semantic segmentation [28], while Ogawa et al. showed that semantic segmentation and deep features can reveal the linkage between environmental elements and subjective perceptions [29]. These studies provide a solid foundation for applying Mask2Former in this research to extract spatial elements and construct quantitative indicators for ORCs.

3.3. The Construction of Comprehensive Street View Indicators

This study develops a set of 10 streetscape indicators to quantify spatial structure, environmental features, and perceptual quality in images, as shown in Table 3. Indicators derived from semantic segmentation and color analysis are integrated into a structured system combining spatial-functional semantics and visual-perceptual information, providing a data foundation for modeling and interpretable analysis.

To verify that the color-related indicators capture distinct perceptual dimensions, we performed an exploratory factor analysis on Colorfulness, Color Richness, and Color Harmony (CH). The results identified two clear factors: Factor 1 (Color Abundance): C (loading = 0.684) and CR (loading = 0.726); Factor 2 (Color Coordination): CH (loading = −0.348).

Collinearity checks confirmed indicator independence: all pairwise correlations were below 0.7 (range: −0.092 to 0.253) and all VIF values were below 1.35. These results demonstrate that the three indicators capture complementary aspects of color perception, with each providing unique information to the model.

3.4. Random Forest

Building upon the multi-dimensional analytical framework, this study employs a RF regression model to quantify the relationship between the built environment and residents’ emotional perceptions. RF is well-suited for handling high-dimensional features, capturing nonlinear relationships, and accounting for variable interactions, without requiring the strict assumptions of normality and independence inherent in linear models. Compared with a single decision tree, the ensemble averaging of multiple trees in RF substantially reduces overfitting, thereby enhancing the model’s generalization capacity and predictive robustness.

RF proposed by Breiman (2001) [41] is based on the principle of generating multiple subsets from the training dataset through bootstrap sampling, with each subset used to train an individual regression tree. The final prediction is obtained by averaging the outputs of all regression trees, as expressed by the following equation:

\hat{y} = \frac{1}{T} \sum_{t = 1}^{T} h_{t} (x)

(4)

where

h_{t} (x)

denotes the prediction of the

t

tree for input sample

x

, and

T

represents the total number of trees. The model in this study incorporates 14 feature variables, including four community attributes (building age, community size, POI diversity, and demographic vitality) and ten streetscape image indicators. Separate models were trained and evaluated for six categories of emotional perception scores (e.g., sense of safety, sense of richness), expressed as:

{\hat{y}}_{Beautifuly} = f_{RF} (X_{S V I}, X_{R})

(5)

where

X_{S V I}

refers to the streetscape image features and

X_{attr}

denotes the community attribute features. During model training, data were split into training and testing sets at an 8:2 ratio. Hyperparameters were optimized using a combination of Random Search and five-fold cross-validation, with negative MAE as the optimization criterion. Model performance was comprehensively evaluated using MAE, MSE, root mean square error (RMSE), and coefficient of determination (R²).

RF and its integration with interpretability methods have been widely applied in urban perception and planning research. For instance, Deb and Smith [42] combined RF with the SHAP Tree Explainer to analyze economic mobility and spatial inequality, effectively revealing variable pathways and providing a novel quantitative tool for spatial equity studies. Similarly, Wu et al. [43] introduced POI data into an RF framework, demonstrating its significant advantages in handling complex nonlinear relationships and capturing land-use dynamics. Therefore, RF not only ensures robustness and generalization in high-dimensional contexts but also enhances transparency when combined with interpretability methods, offering a reliable theoretical and methodological foundation for urban renewal and environmental perception studies.

3.5. SHAP

To enhance the interpretability of the RF model, this study introduces the SHAP method to provide both global and local explanations for emotional perception predictions. SHAP is grounded in the Shapley value principle from cooperative game theory, which quantifies the fair contribution of each feature by computing its marginal effect across different subsets. Its core formulation is:

Φ_{i} = \sum_{S \subseteq F ∖ [1]} \frac{|S|! (|F| - |S| - 1)!}{|F|!} [f_{S \cup [1]} (x_{S \cup [1]}) - f_{S} (x_{S})]

(6)

where

Φ_{i}

denotes the SHAP value of feature

i

,

F

represents the full feature set,

S

is a subset of features, and

f

is the prediction function.

In this study, SHAP values are applied to interpret the contribution and direction of 14 input features (streetscape indicators and community attributes) to six categories of emotional perception scores (e.g., beauty, safety, depression). The absolute value of the SHAP score reflects the strength of influence: a positive value indicates a facilitating effect, while a negative value suggests a suppressing effect. For example, in the “liveliness” model, VD has a large positive SHAP value, indicating its strong contribution to enhancing liveliness perception, whereas street enclosure shows a negative SHAP value, suggesting a suppressive effect. Moreover, SHAP distributions reveal nonlinear feature–emotion relationships, such as the stronger alleviating effect of moderate POI diversity on “depression.” This approach helps identify core variables influencing emotional perceptions and provides a quantitative basis for designing perception-oriented interventions in streetscapes and communities.

In recent years, the integration of RF with SHAP has been increasingly applied to urban studies. The SHAP framework, proposed by Lundberg and Lee [44], provides a unified approach to interpreting model predictions based on game theory. Building upon this interpretable foundation, Deb and Smith [42] applied RF–SHAP analysis to explore the effects of socioeconomic variables on spatial inequality, while Li and Managi [45] examined how community attachment and livability influence environmental behaviors. Xu et al. [46] further demonstrated its effectiveness in analyzing nonlinear relationships between the built environment and carbon emissions. These studies confirm that combining machine learning with SHAP can effectively uncover the complex mechanisms linking spatial environments and human behavior. Building on this tradition, this study introduces SHAP to emotional perception modeling in ORCs, providing a solid theoretical and empirical foundation for exploring the interactions between the built environment and perception.

4. Results

4.1. Analysis of Emotion Perception Modeling Results Based on RF

Before modeling emotional perception, this study constructed a residential area attribute index system that covers construction age, POI diversity, residential area population size, and population vitality. The aim was to quantify the spatial and social characteristics of ORCs. These variables constitute the basic input for the emotional regression model and are important in revealing the sources of emotional differences. Table 4 shows the specific values of ten typical ORCs in each index and reveals obvious spatial heterogeneity.

For example, the GR community records relatively low values across all indicators: a construction age score of 0, a POI diversity index of 1.320, and a population vitality index of 1.846. These results reflect the area’s functional obsolescence, monotonous streetscape, and underutilized public spaces. In contrast, the DH community performs significantly better on all measures, with a construction age score of 1, a POI diversity index of 1.492, and a population vitality index of 3.375. These higher scores suggest a stronger foundation of service facilities and more vibrant population activity, which are likely to foster more positive emotional experiences. Such differences provide an important basis for the model’s explanatory power.

Subsequently, the study incorporates residential area attributes and SVI features into an RF regression model to estimate the scores of six emotional perception dimensions. To assess model performance, six regression evaluation metrics were calculated, including MAE, mean squared error (MSE), and R². Table 5 reports the performance of each emotional dimension model under the main evaluation criteria.

Overall, the RF model exhibits a moderate but stable predictive performance across most emotional dimensions. On average, the model achieves an R² of 0.616 and an MAE of 0.589. These values indicate that the model can reasonably capture the impacts of streetscapes and residential areas on emotional perception. The model performs particularly well in the “beauty” and “oppression” dimensions, with R² values of 0.705 and 0.666, respectively, and MAE values ranging from 0.492 to 0.634. These results suggest that perceptions of beauty and oppression are highly dependent on the quality of the visual landscape and the atmosphere of the streets.

In contrast, the model performed relatively poorly in the “safety” dimension, with an R² of only 0.512. This indicates that perceptions of safety are more strongly influenced by non-visual factors such as public security, neighborhood relations, and historical events, while streetscape and physical spatial variables provide limited explanatory power.

Regarding model stability, the average MAE of a single tree is 0.768, whereas the overall MAE decreases to 0.568 after ensemble aggregation. This confirms that the multi-tree ensemble mechanism of RF improves predictive stability and reduces overfitting risk. Further evaluation shows a margin strength of –0.351, an inter-tree correlation of 0.474, and an estimated upper bound of generalization error of 3.372. To further validate the statistical significance and robustness of the model, a permutation test was performed with 1000 iterations by randomly shuffling the training labels. The original model achieved an MAE of 0.5331 on the independent test set, whereas the mean MAE of the permuted models was 0.8370 ± 0.0225 (p < 0.001). These results demonstrate that the model predictions are significantly better than random chance and confirm stable and reliable performance. Taken together, these results indicate that although the predictive capacity of individual trees is modest, the integration of moderately correlated learners yields a model with strong generalization ability and stable overall performance.

To further evaluate the contribution of community attributes, an additional ablation experiment was conducted comparing two models:

Using visual indicators only;
Using both visual indicators and community attributes.

The results demonstrated that the inclusion of community attributes improved the mean R² from 0.549 to 0.616 and reduced the mean MAE from 0.636 to 0.589, indicating a clear enhancement in predictive performance. The RMSE also decreased from 0.817 to 0.755, confirming that community-level contextual variables contribute meaningfully to model accuracy. This finding highlights that community-level contextual variables—particularly population vitality—enhance the interpretive and predictive capacity of the model by providing social–spatial context beyond the visual environment. The results of this comparison are presented in Table 6.

4.2. Interpretability Analysis of Emotional Perception in Streetscapes

4.2.1. Impact of Streetscape Features on Emotional Perception

To further examine how spatial features extracted from SVIs shape residents’ emotional perceptions, this study applies the SHAP method to interpret the regression results across six emotional dimensions—beautiful, safe, lively, wealthy, depressing, and boring—using a dataset of 1240 SVIs. Figure 5 displays a heatmap of the mean absolute SHAP values for 14 predictors across the six dimensions. This visualization quantifies the marginal contribution of each variable to the model output and illustrates both the direction and the magnitude of their influence on emotion prediction at different numerical levels. In the heatmap, red indicates stronger contributions, while blue denotes weaker ones.

The overall results reveal notable variations in the contribution of key variables across different emotional dimensions. Nevertheless, enclosure, VD, and facility visual entropy consistently exhibit strong explanatory power, emerging as core determinants of emotional perception. Among these, enclosure shows the highest importance value (0.283) in the depressing dimension, suggesting that enclosed street scenes are more likely to evoke negative emotions. It also exerts significant influence on perceptions of beautiful, lively, and boring, underscoring its multi-dimensional impact. VD achieves an average SHAP value of 0.274 in the lively dimension, highlighting the role of element wealth in enhancing residents’ perception of liveliness. Meanwhile, the facility visual entropy demonstrates strong explanatory capacity in both the safe and depressing dimensions, suggesting that the complexity of street facilities is critical in regulating residents’ emotional responses.

As for secondary variables, the SVF exerts a relatively strong and consistent influence across all six emotional dimensions, demonstrating both stability and universality. This finding implies that its mechanism of action is systemic and multi-layered. In addition, colorfulness contributes notably to the lively dimension, reinforcing the importance of visual vibrancy in stimulating positive affect. Taken together, Figure 5 reveals a multi-level quantitative association between the street environments of ORCs and residents’ emotional perceptions, offering empirical evidence to guide urban renewal strategies and micro-scale street interventions.

Figure 6 further presents SHAP summary plots for the six emotion models, illustrating both the positive and negative effects of each variable on model predictions and their distribution across different feature values. The results show that perceptions of beauty and safety are primarily enhanced by higher levels of enclosure and VD, whereas perceptions of a space as depressing or boring are reinforced by high enclosure and low greenery coverage. These patterns suggest that enclosed and monotonous streetscapes are more likely to provoke negative emotional experiences.

Further rankings of variable importance are illustrated in Figure 7, highlighting distinct dominant factors across different emotional dimensions. For instance, facility visual entropy contributes substantially to both the depressing and safe models, whereas population vitality ranks among the top predictors in the wealthy model. This pattern reflects the interplay between the social attributes and visual characteristics of residential areas. These findings indicate that each emotional dimension is shaped by a unique combination of streetscape elements and neighborhood features. Consequently, they provide a scientific foundation for targeted interventions aimed at enhancing residents’ emotional perceptions—for example, by adjusting street enclosure levels, improving visual continuity, or achieving a more balanced distribution of facilities.

To examine the nonlinear and threshold effects of key street scene features on emotional perception, SHAP dependence plots were generated for the three most predictive mood dimensions—beautiful, boring, and depression—based on their top four contributing features (Figure 8).

For the “beautiful” dimension, enclosure exhibited a clear decreasing trend, contributing positively below a threshold of 0.31 and negatively beyond it, suggesting excessive enclosure may reduce visual pleasure. Similar threshold effects were observed for VD (17.31) and SVF (0.20), where their contributions reversed after exceeding specific cutoffs. Population vitality showed a tri-phasic pattern, shifting from negative to positive and back to negative, with inflection points at 3.16 and 4.90.

In the “boring” dimension, VD below 17.05 was positively associated with boredom, but reversed beyond this value, implying that overly homogeneous scenes induce monotony. Enclosure (0.31), facility visual Entropy (2.11), and population vitality (2.68) showed similar turning points, indicating that moderate levels of spatial enclosure, visual complexity, and crowd activity alleviate boredom, but excessive levels may have adverse effects.

In the “depression” dimension, enclosure became a positive contributor above 0.29, intensifying depressive perception. VD (17.46) and SVF (0.19) shifted from positive to negative effects after their respective thresholds, while facility entropy followed a three-phase pattern (2.04 and 3.01), reflecting a dual role of visual complexity.

Overall, SHAP dependence plots reveal significant nonlinearities and threshold dependencies. Enclosure and SVF exhibited monotonic responses within defined intervals, while features like population vitality and visual entropy followed an “optimal-middle” pattern. Precise control over these thresholds is critical for designing emotionally responsive urban street environments.

4.2.2. Single-Sample Local Interpretation

To reveal the localized mechanisms by which street scene features influence mood perception, this study conducted SHAP waterfall analyses on three representative samples (Figure 9), each corresponding to one of three emotional scene types: positively dominated (a), negatively dominated (b), and mixed (c). Each sample was analyzed based on the ranking of absolute SHAP values to identify the primary contributing factors and their directional effects. For the first two samples, the dimensions of “beauty” and “depression” were selected to represent typical positive and negative emotional responses, respectively, while the third sample, characterized by a more balanced emotional distribution, was examined through the “safety” and “vitality” dimensions that exhibited both positive and negative contributions.

Sample 1 (NY11–90°) was obtained from a relatively new residential district located in the urban core. The scene was captured from a main road with high pedestrian flow, abundant greenery, and vibrant commercial activity, showing strong positive mood perception. In the “beautiful” dimension, enclosure degree (0.098, SHAP = +0.35) and population vitality (4, SHAP = +0.21) emerged as key positive drivers. These features also contributed negatively to the “boredom” dimension, indicating that greater spatial enclosure and active street life can alleviate monotony. Moreover, the facility visual Entropy (2.303, SHAP = −0.15) played a negative role, suggesting that moderate visual complexity enhances aesthetic appeal.

Sample 2 (GR5–270°) was derived from an aging residential community with minimal greenery and poor infrastructure. The image was taken from a narrow dead-end street with sparse pedestrian activity, exhibiting typical negative emotional features. SHAP results showed that population vitality (1, SHAP= − 0.30) and building age (1, SHAP = −0.24) were major negative drivers for the “wealth” dimension. Meanwhile, high enclosure (0.709, SHAP = +0.31), low facility visual Entropy (1.306, SHAP = +0.24), and limited VD (11, SHAP = +0.2) contributed positively to “depression”. The image reflected a sense of spatial oppression due to narrow streets and highly enclosed building façades, where monotonous and low-complexity environments further intensified the depressive perception.

Sample 3 (DH25–270°), located in a large and aged neighborhood in the southern part of the city center, exhibited mixed emotional responses. The image was captured on a riverside branch road with poor environmental maintenance and low foot traffic. In the “safety” dimension, population vitality (2, SHAP = +0.13) was the most influential positive factor, while VD (19, SHAP = −0.11) and facility visual Entropy (2.532, SHAP = −0.09) had negative impacts, implying that excessive visual complexity may undermine perceived safety. In the “vitality” dimension, both color richness (62.289, SHAP = −0.16) and population vitality (2, SHAP = −0.14) exerted negative effects, whereas facility visual Entropy provided a slight positive contribution, suggesting that localized visual disturbance may compensate for otherwise dull environments.

Overall, SHAP waterfall results at the sample level indicate that enclosure degree and population vitality are consistent positive drivers of emotional perception. Meanwhile, facility visual Entropy demonstrates a context-dependent dual role: although low entropy tends to diminish positive perceptions, moderate increases in visual complexity can help stimulate positive emotional responses in highly enclosed and low-activity settings.

4.2.3. Mechanism of the Interaction Between Residential Community Attributes and Emotional Perception

Compared with the direct spatial cues embedded in SVI, residential attributes exert a more indirect and comprehensive influence on emotional perception. Their mechanisms often arise from the interplay of multiple factors, including functional layout, demographic composition, and renewal level. Although their explanatory power is weaker than that of image-based features, SHAP analysis shows that certain residential attributes still play significant roles across multiple emotional dimensions.

Among these, population vitality stands out in positive emotions such as beautiful, wealthy, and lively, with SHAP values of 0.168, 0.252, and 0.194, respectively, all showing positive effects. Active foot traffic and frequent social interactions help cultivate an attractive community atmosphere, thereby enhancing aesthetic and vitality perceptions. However, in the safe dimension, its contribution decreases to 0.087 and even becomes negative at high values, suggesting that excessive population concentration may lead to a weakened sense of order and increased uncertainty.

The effect of construction age varies across emotional dimensions. Newly built residential areas perform better in the category of “beautiful”, reflecting the positive impact of modern planning and environmental quality, while renovated housing stock helps mitigate spatial monotony in the boring dimension. Residential scale contributes moderately to lively (SHAP = 0.069), yet insufficient planning in large-scale developments may intensify perceptions of boring and depressing. In contrast, POI diversity has relatively limited explanatory power, showing only partial nonlinear influence on wealthy and depressing.

These findings are further validated by comparing residential sentiment scores (Table 7) with activity levels. For example, the GR residential area, built before 1980, recorded a population vitality score of only 1.846 and a perceived beautiful score of 2.962. By contrast, the NY community, developed after 2000, achieved a vitality score of 3.194 and a corresponding beautiful score as high as 4.660, underscoring the advantages of newly constructed residential areas in planning and landscape quality. Similarly, a larger population size can enhance vitality and alleviate boredom to some extent, but it may also increase perceptions of an area being depressing, reflecting its complex dual effects.

Overall, while residential attributes are not directly visible in street-level image, their influence on emotional perception should not be overlooked. Their pathways of action are structurally complex and multidimensional, involving interactions among spatial form, usage density, and resident experience. Future studies should integrate these aspects for more systematic interpretation and validation.

In summary, while residential area attributes are not directly perceptible in SVI, they still play a significant structural role in emotional perception. Their influence is complex and involves multidimensional interactions, requiring a comprehensive assessment that integrates spatial form, usage density, and resident experience.

4.2.4. Emotional Effects of Key Street View Indicators

To further elucidate the specific roles of different street scene indicators in residents’ emotional perceptions, this study provides a comprehensive interpretation by integrating SHAP analysis results (Figure 5 and Figure 6) with intergroup comparison experiments (Figure 10). Figure 10 presents the results of comparing average emotional scores between samples ranked in the top 30% and bottom 30% for each indicator value in street scene images. This approach not only visually demonstrates how high- and low-level street scene characteristics differ across emotional dimensions but also statistically validates the reliability of these differences. Overall findings reveal that spatial structure, visual complexity, and natural elements exert distinct influences on both positive and negative emotions, with most dimensional differences reaching statistical significance. This highlights the dual moderating effect of street scene characteristics on residents’ psychological experiences.

First, enclosure plays the most prominent role in “safety.” The safety score for highly enclosed streetscapes was 7.212, significantly higher than the 6.126 recorded for low-enclosure areas. However, it also scored higher in “depressing” and “boring” dimensions: the depressing scores were 6.304 and 4.255, and the “boring” scores were 6.352 and 4.626, respectively. This indicates that high enclosure enhances the sense of safety but may also trigger negative emotions. This finding aligns with the SHAP heatmap, which shows that enclosure has the highest weight in the “depressing” dimension.

Second, VD demonstrated significant advantages in perceptions of area as “beautiful” and “lively.” High-diversity environments scored 5.442 for “lively,” markedly higher than the 3.835 for low-diversity settings; their respective scores for “beautiful” were 4.567 and 3.281. Combined with SHAP analysis results, VD ranks prominently in positive emotion models, validating the role of diverse street scene elements in enhancing positive perceptions.

Furthermore, the facility visual entropy exhibits a complex dual effect. The “safe” score in high-entropy environments was 7.304, higher than the 6.046 in low-entropy environments. However, it also showed higher levels of “depressing” and “boring”: “depressing” was had a value of 6.118 versus 4.871 in low-entropy environments, and “boring” was had a value of 6.395 versus 4.876. This indicates that, while facility layout complexity enhances perceptions of areas as “safe” perceptions to some extent, disorder and clutter may impose psychological burdens.

Regarding natural elements, both SVI and GVI exert positive effects on perceptions of “beautiful” and “safe.” The “beautiful” score of environments with high greenery coverage reached 4.476, significantly higher than the value of 3.323 recorded in areas with low greenery coverage. Similarly, the “safe” score of streetscapes with high SVF was 7.164, surpassing the value of 6.083 observed in low-visibility settings. Concurrently, both indicators yielded relatively low scores for “depressing” and “boring,” indicating that openness and natural elements help alleviate residents’ negative emotions.

Therefore, the mechanism of street view indicators exhibits multidimensional variability: positive emotions are primarily driven by VD, green coverage, and moderate enclosure, while negative emotions are more readily triggered in overly enclosed, monotonous, or complexly furnished environments. This finding is corroborated by SHAP analysis, which not only quantifies the relationship between the street environment and residents’ emotional perception but also provides actionable guidance for future neighborhood renewal and microenvironmental interventions.

4.3. Emotional Perception Characteristics of ORCs

Based on semantic segmentation results from 1240 SVIs of ORCs, this study statistically analyzed the overall visual element composition across 10 neighborhoods, with findings presented in Figure 11. Overall, architectural structures and perimeter walls, along with natural environmental elements, dominate the street scenes. This reflects how ORCs retain substantial traditional building facades within the urban fabric while also preserving green vegetation and natural landscapes. This distribution pattern reveals a dual spatial characteristic of ORCs: the continuity of existing buildings coexists with the integration of natural elements.

Visual elements vary significantly across different residential areas. For instance, areas like NY, ZZ, and HT feature a higher proportion of natural elements, presenting street scenes characterized by open spaces and abundant greenery. This aligns closely with their high emotional scores in the “beautiful” dimension, indicating that natural elements play a crucial role in positive perceptions. In contrast, residential areas like such as GR, YY, and DH feature higher proportions of buildings and fences, creating a stronger sense of spatial enclosure. While this environmental characteristic enhances residents’ perceptions of an area as “safe” perceptions—correlating with higher “safe” scores—it simultaneously restricts accessibility and interactivity in public spaces. Consequently, scores for on the “lively” dimension are generally lower, reflecting the tension between environmental enclosure and social vitality.

Figure 12 displays the box plot distributions across six emotional dimensions for 10 ORCs. Overall, the results show that emotional evaluations not only differ in terms of the median values across communities but also exhibit distinct characteristics in terms of dispersion, as measured by the interquartile range (IQR) and extreme value distributions. Three common patterns are observed: For the “safe” dimension, the median values of most communities cluster at relatively high levels, with narrower IQRs—indicating that residents’ perceptions of safety are consistent. Both the median values and IQRs of the “beautiful” and “lively” dimensions vary more significantly across communities, which reflects residents’ higher sensitivity to spatial and landscape qualities. Some communities have higher median values or wider IQRs for the “boring” and “depressing” dimensions, suggesting that negative emotions are more strongly influenced by differences in contextual conditions and functional attributes.

Additionally, residential areas exhibit significant differences across the six emotional dimensions. Areas such as NY, ZZ, and HT, which are characterized by a relatively higher proportion of natural environments, consistently achieve higher median scores and lower dispersion in the “beautiful” dimension. This indicates that greenery and open spaces significantly enhance residents’ aesthetic experiences and psychological comfort. Conversely, residential areas such as YY, GR, and DH feature prominent building and wall coverage in their streetscapes, creating a stronger sense of spatial enclosure. These areas scored higher in the “safe” dimension, which is consistent with the positive effect of spatial enclosure on perceived safety. However, this environmental characteristic also inhibits resident interaction and the vitality of public spaces, resulting in generally lower scores in the “lively” dimension. This reveals a tension between safety and liveliness.

In the AD and ML residential areas—where transportation facilities and motor vehicle-related elements account for a higher proportion—residents’ emotional expressions are more complex. On one hand, the transportation infrastructure in these areas improves accessibility; on the other hand, it also creates a sense of oppression and generates environmental noise, leading to relatively higher scores for residents in the “depressing” and “boring” dimensions. This finding suggests that an excessive concentration of transportation functions may impair neighborhood livability and trigger more negative emotions. In contrast, residential areas such as NM and XY exhibit greater visual balance, with minimal variations across emotional dimensions. This indicates that environments featuring a balanced distribution of spatial elements contribute to greater stability in residents’ emotional perception.

It can be seen from the synthesis of Figure 9 and Figure 10 shows that differences in streetscape elements directly affect the distribution of emotions. A high proportion of natural elements typically fosters positive emotions such as “beautiful” and “lively,” whereas excessive building enclosure may enhance the “safe” perception while also intensifying feelings of “boring” and “depressing.” Meanwhile, concentrations of transportation facilities and motor vehicles tend to evoke negative emotions. The observed variations across different residential areas reflect the multidimensional challenges encountered in the renewal of ORCs: how to boost vitality while ensuring safety, how to increase the presence of natural elements while pre-serving existing buildings, and how to maintain traffic accessibility without imposing environmental burdens. For residential areas like including NY, ZZ, and HT, further optimizing green space design can enhance aesthetic appeal and vitality. For YY, GR, and DH, improving residents’ social interaction requires refining the design of open spaces and street layouts. As for AD and ML, priority should be given to alleviating the adverse effects of excessive concentration of transportation facilities and motor vehicles, so as to balance accessibility with living comfort.

To validate the effectiveness of the SVIs-based urban sentiment perception modeling method proposed in this study, this paper compared the consistency be-tween the average sentiment scores generated by the model and the results of manual subjective evaluations. The subjective evaluation results stem from field research and questionnaire surveys conducted in the 10 ORCs. Respondents included residents, neighborhood committee administrators, grassroots government officials, and urban re-search experts. A total of 75 questionnaires were distributed, with 75 valid responses collected. Among these, 45 respondents (60%) were community residents, 20 (27%) were community and property management staff, and 10 (13%) were grassroots government officials and urban research experts. Respondents’ age distribution was concentrated in the 31–50 age group (46.7%), followed by 20–30 (30.7%) and 51+ (22.6%). Respondents generally possessed higher educational attainment, with over 70% holding bachelor’s degrees or higher. Regarding duration of community residence or management, 80% had resided or managed for over five years, ensuring familiarity with neighborhood characteristics and response reliability. Consistency testing was conducted using Pearson correlation coefficients, with a significance criterion of Pearson r > 0.6 and p value < 0.05 (95% CI for r does not contain 0) defined as True (significant consistency).

Table 8 presents the validation results, including 95% confidence intervals (CIs) for Pearson r, MAE, and MASE. Among the six emotion dimensions, four—Wealthy, Beautiful, Lively, and Safe—show statistically significant and practically meaningful agreement between model predictions and subjective evaluations. Their Pearson correlation coefficients (r) have 95% CIs that do not include zero (e.g., Wealthy: r = 0.843, 95% CI [0.542, 0.954], p = 0.002), and the comparatively narrow 95% CIs of their MAE values further indicate stable numerical prediction accuracy. Moreover, all MASE values are below 1, with tight CIs, confirming that the model consistently outperforms a naive mean-based benchmark for these perceptual categories. In contrast, Boring (r = 0.611, 95% CI [−0.033, 0.895], p = 0.061) approaches but does not reach statistical significance, as its CI includes zero—a result coherent with its marginal p-value. Interestingly, Boring also exhibits one of the narrowest MAE CIs, suggesting stable absolute prediction errors despite uncertainty in directional agreement. Depressing (r = 0.419, 95% CI [−0.200, 0.793], p = 0.229) presents a much wider r-CI and lacks a consistent directional trend, reflecting higher uncertainty in this negative-affect dimension.

The relatively wide r-CIs for Boring—and especially for Depressing—primarily stem from the limited number of analysis units (n = 10 communities) and the greater variability typically observed in negative-affect judgments. Jin et al. [47] have indicated that Place Pulse 2.0 tends to produce relatively higher predictions for negative perceptions such as “boring” and “depressing”, which may be attributed to cultural differences in visual cognition and the regional representativeness of its data samples. This indicates potential contextual limitations of the model when applied in cross-cultural settings.

To summarize, the proposed method performs well in aligning subjective and objective assessments across multidimensional emotional perception. It offers notable advantages in capturing key streetscape components—including the visual dimension and positive affect dimensions (“Beautiful,” “Wealthy”)—thus validating the feasibility and scientific rigor of semantic segmentation and image metric modeling for urban perception research.

5. Discussions

This study develops a streetscape image–based modeling framework for emotional perception in ORCs and applies SHAP analysis to interpret the role of spatial indicators. The results show that street enclosure, VD, GVI, and SVF significantly shape multiple emotional dimensions. The observed consistency between model predictions and survey evaluations demonstrates the methodological reliability and interpretive validity of this approach.

Compared with existing studies, the findings of this research demonstrate strong consistency and explanatory power. Color saturation shows a stable positive effect on “beauty,” “liveliness,” and “richness,” while also alleviating “depression” and “bore-dom.” This is consistent with the study conducted by Blijlevens et al. [48], who found that higher color saturation can modulate arousal levels to stimulate positive perceptions and reduce negative experiences. Similarly, the positive effects of facility-related visual entropy and convenience on “richness” and “liveliness” confirm the finding of Cilliers et al. [49], who emphasized facilities as core elements of vibrant public spaces, enhancing comfort, social interaction, and spatial identity.

In terms of spatial structure, this study confirms that high enclosure is significantly associated with “oppressiveness” and “depression,” consistent with Zarghami et al. [50], who showed that the visual weight of tall buildings increases feelings of oppression. Regarding natural elements, both GVI and SVF enhance “beauty,” “liveliness,” and “richness,” while mitigating “boredom” and “oppressiveness.” These findings echo the restorative theory of Ulrich et al.’s [51] restorative theory and align with Dai et al. [52], who demonstrated that high GVI and SVF improve psychological comfort and aesthetic appreciation while reducing negative emotions.

This study also reveals variable complexity and limitations. The predictive performance for “safety” was weaker, likely because safety depends more on non-visual factors such as security conditions, social trust, and population structure, leading to concentrated evaluations. The negative association between SVF and “safety” accords with Yin et al. [53], who showed that visual openness weakens boundary perception. Finally, POI diversity did not exhibit a stable positive effect, showing nonlinear trends for “liveliness” and “richness.” This suggests that functional quantity and density alone do not guarantee positive emotions; rather, spatial organization, interface quality, and recognizability are more decisive.

These findings have several implications for the renewal of ORCs. Spatial enclosure should be optimized to avoid excessive closure and create streetscapes with appropriate scale and permeability. Façade details and cultural elements can enhance VD, strengthen streetscape identity, and foster residents’ sense of belonging. Natural elements, such as GVI and SVF, should be increased to support psychologically restorative environments. Improving the orderliness, visibility, and diversity of facilities can stimulate vitality, while perceptions of safety require a balance between openness and reinforced boundary design to enhance residents’ perceived controllability.

Nevertheless, some limitations should be acknowledged. The sample was limited to ten communities in Yangzhou, and the overall dataset size remained relatively small, which may restrict the generalizability of the findings; broader comparative studies with larger samples are needed. In addition, some of the emotion perception data used in this study were generated through neural-network-based models. Although these pre-trained models provide reliable large-scale perception data, they may introduce potential uncertainty and bias into the subsequent regression analysis. Moreover, the Place Pulse 2.0 dataset employed for emotion perception training was primarily developed based on Western urban scenes, which may not fully capture the cultural and perceptual characteristics of Chinese urban environments [47]. Therefore, localized datasets and multi-cultural model retraining are needed to enhance cross-cultural applicability.

While the proposed model achieved a moderate but stable predictive performance (average R² ≈ 0.6), this level of accuracy is reasonable given the inherent complexity and subjectivity of emotional perception data. Rather than serving solely as a predictive tool, the framework demonstrates strong methodological potential for integrating spatial, visual, and perceptual dimensions in urban environmental research. It provides a reproducible and interpretable approach to quantify how built environment attributes and visual features collectively shape residents’ emotional experiences, thereby offering valuable guidance for perception-oriented urban renewal and design practices.

Future research could further advance this work in several directions. Expanding the dataset to include a larger number of SVIs from diverse urban areas would help strengthen the statistical robustness of the model and enhance the reliability of perception analysis. Moreover, incorporating street-segment-level semantic mapping and local surveys would provide finer validation and reveal spatial heterogeneity within communities, extending the framework to more granular spatial scales. Incorporating uncertainty quantification and sensitivity analyses, such as Monte Carlo simulations or ensemble variance evaluations, could further assess and improve model stability. Finally, integrating multimodal perceptual elements such as sound, smell, and temporal experience through immersive or VR-based experiments would deepen the understanding of sensory and emotional responses. Collectively, these efforts would help build a more comprehensive, perception-driven renewal framework with real-time feedback, offering stronger support for people-oriented urban regeneration.

6. Conclusions

This study focuses on ORCs and proposes an emotional perception modeling framework that integrates streetscape image features with machine learning. Using SHAP-based interpretability analysis, it reveals how key indicators such as street enclosure, VD, GVI, and SVF influence residents’ emotional responses. The framework effectively captures both positive emotions (e.g., beauty, liveliness, richness) and negative emotions (e.g., oppressiveness, boredom), with results moderately consistent with survey data, confirming its methodological validity and potential as a data-driven approach to perception analysis.

Methodologically, the study contributes a localized and interpretable framework that integrates deep-learning–based emotion inference with explainable regression analysis. This combination bridges quantitative visual representation and subjective emotional perception, offering a transparent and transferable method for understanding how the built environment shapes residents’ experiences.

Theoretically, the findings highlight the critical role of visual openness and natural elements in fostering positive perceptions, while showing how excessive enclosure and functional monotony can reinforce negative emotions. These insights enrich the evidence base of the “space–perception” relationship in environmental psychology. Practically, the framework provides a perception-oriented tool to guide community renewal strategies—such as optimizing enclosure, enhancing VD integrating natural elements, and improving public facilities—to promote more livable and emotionally responsive environments.

Given the limited sample size of this study, future research could expand the dataset to include a broader range of urban contexts and integrate multimodal perceptual factors such as sound, smell, and temporal experience through immersive or VR-based experiments. Such efforts would further strengthen the robustness and generalizability of perception-driven renewal frameworks and enhance their applicability to diverse urban settings.

Author Contributions

Yanqing Xu: Writing—review & editing, Validation, Supervision, Resources, Project administration, Investigation, Funding acquisition, Data curation, Conceptualization. Xiaoxuan Fan: Writing—review & editing, Writing—original draft, Visualization, Software, Methodology, Formal analysis, Data curation, Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (52408042); Natural Science Foundation of Jiangsu Province (BK20240931); China Postdoctoral Science Funding (2024M750430); Jiangsu Provincial Department of Education Fund of Philosophy and Social Science (2023SJYB2054).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ORCs	old residential communities
RF	random forest
SVIs	street view images
SHAP	Shapley additive explanations
POI	point-of-interest
GVI	green view index
OSM	OpenStreetMap
SVF	sky view factor
IoU	Intersection-over-Union
Mask2Former	masked-attention Mask Transformer
MAE	mean absolute error
MSE	mean squared error
RMSE	root mean square error
CIs	confidence intervals
R²	coefficient of determination
VD	visual diversity

References

Nachmany, H.; Hananel, R. The Urban Renewal Matrix. Land Use Policy 2023, 131, 106744. [Google Scholar] [CrossRef]
Mouratidis, K. Built Environment and Social Well-Being: How Does Urban Form Affect Social Life and Personal Relationships. Cities 2018, 74, 7–20. [Google Scholar] [CrossRef]
Xu, Y.; Zhou, Y. Decision-Making Method for Optimal Renewal Timing of Old Residential Communities Based on the Improved ARP Model. J. Build. Eng. 2025, 111, 113036. [Google Scholar] [CrossRef]
Xu, Y.; Juan, Y.-K. Optimal Decision-Making Model for Outdoor Environment Renovation of Old Residential Communities Based on WELL Community Standards in China. Archit. Eng. Des. Manag. 2022, 18, 571–592. [Google Scholar] [CrossRef]
Ministry of Housing and Urban-Rural Development and Other Departments Issued the “Notice on Solidly Promoting the Renovation of Old Urban Residential Areas in 2023”, Department News, Chinese Government Website. Available online: https://www.gov.cn/zhengce/content/2020-07/20/content_5528320.htm (accessed on 25 August 2025).
Asad Poor, J.; Goh, Y.W.; Thorpe, D. A Human-Centric Participatory Approach to Energy-Efficient Housing Based on Occupants’ Collaborative Image. Open House Int. 2021, 46, 615–635. [Google Scholar] [CrossRef]
Kent, J.L.; Ma, L.; Mulley, C. The Objective and Perceived Built Environment: What Matters for Happiness? Cities Health 2017, 1, 59–71. [Google Scholar] [CrossRef]
Mahmoud, A.M. The Impact of the Built Environment on Human Behaviors. Int. J. Environ. Sci. Sustain. Dev. 2018, 2, 29–41. [Google Scholar] [CrossRef]
Li, Y.; Peng, L.; Wu, C.; Zhang, J. Street View Imagery (SVI) in the Built Environment: A Theoretical and Systematic Review. Buildings 2022, 12, 1167. [Google Scholar] [CrossRef]
Ma, X.; Ma, C.; Wu, C.; Xi, Y.; Yang, R.; Peng, N.; Zhang, C.; Ren, F. Measuring human perceptions of streetscapes to better inform urban renewal: A perspective of scene semantic parsing. Cities 2021, 110, 103086. [Google Scholar] [CrossRef]
Psyllidis, A.; Gao, S.; Hu, Y.; Kim, E.-K.; McKenzie, G.; Purves, R.; Yuan, M.; Andris, C. Points of Interest (POI): A Commentary on the State of the Art, Challenges, and Prospects for the Future. Comput. Urban Sci. 2022, 2, 20. [Google Scholar] [CrossRef] [PubMed]
Luo, L.; Yang, X.; Li, J.; Song, Y.; Zhao, Z. Deciphering House Prices by Integrating Street Perceptions with a Machine-Learning Algorithm: A Case Study of Xi’an, China. Cities 2025, 156, 105542. [Google Scholar] [CrossRef]
Lindal, P.J.; Hartig, T. Architectural Variation, Building Height, and the Restorative Quality of Urban Residential Streetscapes. J. Environ. Psychol. 2013, 33, 26–36. [Google Scholar] [CrossRef]
Tao, Y.; Wang, Y.; Wang, X.; Tian, G.; Zhang, S. Measuring the correlation between human activity density and streetscape perceptions: An analysis based on baidu street view images in Zhengzhou, China. Land 2022, 11, 400. [Google Scholar] [CrossRef]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 623–656. [Google Scholar] [CrossRef]
Shannon, C.E.; Weaver, W.W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, USA, 1963; 117. [Google Scholar] [CrossRef]
Chong, S.K.; Bahrami, M.; Chen, H.; Balcisoy, S.; Bozkaya, B.; Pentland, A.S. Economic Outcomes Predicted by Diversity in Cities. EPJ Data Sci. 2020, 9, 17. [Google Scholar] [CrossRef]
Zhao, X.; Lu, Y.; Lin, G. An Integrated Deep Learning Approach for Assessing the Visual Qualities of Built Environments Utilizing Street View Images. Eng. Appl. Artif. Intell. 2024, 130, 107805. [Google Scholar] [CrossRef]
Dubey, A.; Naik, N.; Parikh, D.; Raskar, R.; Hidalgo, C.A. Deep Learning the City: Quantifying Urban Perception at a Global Scale. In Proceedings of the European Conference on Computer Vision (ECCV) 2016, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar] [CrossRef]
Yao, Y.; Wang, J.; Hong, Y.; Qian, C.; Guan, Q.; Liang, X.; Dai, L.; Zhang, J. Discovering the Homogeneous Geographic Domain of Human Perceptions from Street View Images. Landsc. Urban Plan. 2021, 212, 104125. [Google Scholar] [CrossRef]
Chen, C.; Li, H.; Luo, W.; Xie, J.; Yao, J.; Wu, L.; Xia, Y. Predicting the effect of street environment on residents’ mood states in large urban areas using machine learning and street view images. Sci. Total Environ. 2022, 816, 151605. [Google Scholar] [CrossRef]
Zhong, W.; Wang, L.; Han, X.; Gao, Z. Spatiotemporal Analysis of Urban Perception Using Multi-Year Street View Images and Deep Learning. ISPRS Int. J. Geo-Inf. 2025, 14, 390. [Google Scholar] [CrossRef]
Cheng, B.; Schwing, A.G.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. (NeurIPS) 2021, 34, 17864–17875. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Neuhold, G.; Ollmann, T.; Bulo, S.R.; Kontschieder, P. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Zhang, S.; Lu, J.; Guo, R.; Yang, Y. Exploring the Relationship Between Visual Perception of the Urban Riverfront Core Landscape Area and the Vitality of Riverfront Road: A Case Study of Guangzhou. Land 2024, 13, 2142. [Google Scholar] [CrossRef]
Ankareddy, R.; Delhibabu, R. Dense Segmentation Techniques Using Deep Learning for Urban Scene Parsing: A Review. IEEE Access 2025, 13, 34496–34517. [Google Scholar] [CrossRef]
Liu, Y.; Chen, M.; Wang, M.; Huang, J.; Thomas, F.; Rahimi, K.; Mamouei, M. An Interpretable Machine Learning Framework for Measuring Urban Perceptions from Panoramic Street View Images. IScience 2023, 26, 106132. [Google Scholar] [CrossRef]
Ogawa, Y.; Oki, T.; Zhao, C.; Sekimoto, Y.; Shimizu, C. Evaluating the Subjective Perceptions of Streetscapes Using Street view Images. Landsc. Urban Plann 2024, 247, 105073. [Google Scholar] [CrossRef]
Zhou, H.; He, S.; Cai, Y.; Wang, M.; Su, S. Social Inequalities in Neighborhood Visual Walkability: Using Street View Imagery and Deep Learning Technologies to Facilitate Healthy City Planning. Sustain. Cities Soc. 2019, 50, 101605. [Google Scholar] [CrossRef]
Ki, D.; Lee, S. Analyzing the Effects of Green View Index of Neighborhood Streets on Walking Time Using Google Street View and Deep Learning. Landsc. Urban Plan. 2021, 205, 103920. [Google Scholar] [CrossRef]
Middel, A.; Lukasczyk, J.; Maciejewski, R.; Demuzere, M.; Roth, M. Sky View Factor Footprints for Urban Climate Modeling. Urban Clim. 2018, 25, 120–134. [Google Scholar] [CrossRef]
Stamps, A.E. Advances in Visual Diversity and Entropy. Environ. Plan. B Plan. Des. 2003, 30, 449–463. [Google Scholar] [CrossRef]
Ye, X.; Tan, H.; Zhang, Y.; Zhang, L.; Zhang, Z. Research on Convenience Index of Urban Life Based on POI Data. J. Phys. Conf. Ser. 2020, 1646, 012073. [Google Scholar] [CrossRef]
Stamps, A.E.; Smith, S. Environmental Enclosure in Urban Settings. Environ. Behav. 2002, 34, 781–794. [Google Scholar] [CrossRef]
Zhang, P.; Wang, Y. Research on Public Street Facilities Design in Perspective of City Image. In Proceedings of the 2016 International Conference on Arts, Design and Contemporary Education, Moscow, Russia, 23–25 May 2016; Atlantis Press: Dordrecht, The Netherlands, 2016. [Google Scholar] [CrossRef]
Ahn, S.; Lee, K.; Lee, S. Visual Entropy: A New Framework for Quantifying Visual Information Based on Human Perception. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar] [CrossRef]
Hasler, D.; Suesstrunk, S.E. Measuring Colorfulness in Natural Images. In Proceedings of the Human Vision and Electronic Imaging VIII, Santa Clara, CA, USA, 20–24 January 2003. [Google Scholar] [CrossRef]
Sakamoto, H.; Shirakawa, Y.; Katoh, T.; Inoue, K.; Ryu, S.; Tomotari, M. Evaluating Color Diversity by Color Entropy in an Equi-Color-Difference Space for Paintings Illuminated by Highly Color Rendering LEDs. ITE Trans. Electron. Inf. Syst. 2019, 73, 799–806. [Google Scholar] [CrossRef]
Sharma, S.; Tandukar, J.; Bista, R. Generating Harmonious Colors through the Combination of N-Grams and K-Means. J. Comput. Theor. Appl. 2023, 1, 140–150. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [PubMed]
Deb, D.; Smith, R.M. Application of Random Forest and SHAP Tree Explainer in Exploring Spatial (In)Justice to Aid Urban Planning. ISPRS Int. J. Geo-Inf. 2021, 10, 629. [Google Scholar] [CrossRef]
Wu, R.; Wang, J.; Zhang, D.; Wang, S. Identifying Different Types of Urban Land Use Dynamics Using Point-of-Interest (POI) and Random Forest Algorithm: The Case of Huizhou, China. Cities 2021, 114, 103202. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar] [CrossRef]
Li, C.; Managi, S. Impacts of Community Attachment and Community Livability on Environmental Activity According to XGBoost and SHAP. Cities 2025, 156, 105559. [Google Scholar] [CrossRef]
Xu, C.; Xiong, W.; Zhang, S.; Shi, H.; Wu, S.; Bao, S.; Xiao, T. Research on the Nonlinear Relationship Between Carbon Emissions from Residential Land and the Built Environment: A Case Study of Susong County, Anhui Province Using the XGBoost-SHAP Model. Land 2025, 14, 440. [Google Scholar] [CrossRef]
Rui, J.; Cai, C. Plausible or misleading? Evaluating the adaption of the place pulse 2.0 dataset for predicting subjective perception in Chinese urban landscapes. Habitat Int. 2025, 157, 103333. [Google Scholar] [CrossRef]
Blijlevens, J.; Carbon, C.; Mugge, R.; Schoormans, J.P.L. Aesthetic Appraisal of Product Designs: Independent Effects of Typicality and Arousal. Br. J. Psychol. 2012, 103, 44–57. [Google Scholar] [CrossRef]
Cilliers, E.J.; Timmermans, W.; Van Den Goorbergh, F.; Slijkhuis, J.S.A. Designing Public Spaces through the Lively Planning Integrative Perspective. Environ. Dev. Sustain. 2015, 17, 1367–1380. [Google Scholar] [CrossRef]
Zarghami, E.; Karimimoshaver, M.; Ghanbaran, A.; SaadatiVaghar, P. Assessing the Oppressive Impact of the Form of Tall Buildings on Citizens: Height, Width, and Height-to-Width Ratio. Environ. Impact Assess. Rev. 2019, 79, 106287. [Google Scholar] [CrossRef]
Ulrich, R.S.; Simons, R.F.; Losito, B.D.; Fiorito, E.; Miles, M.A.; Zelson, M. Stress Recovery during Exposure to Natural and Urban Environments. J. Environ. Psychol. 1991, 11, 201–230. [Google Scholar] [CrossRef]
Dai, L.; Zheng, C.; Dong, Z.; Yao, Y.; Wang, R.; Zhang, X.; Ren, S.; Zhang, J.; Song, X.; Guan, Q. Analyzing the Correlation between Visual Space and Residents’ Psychology in Wuhan, China Using Street-View Images and Deep-Learning Technique. City Environ. Interact. 2021, 11, 100069. [Google Scholar] [CrossRef]
Yin, L.; Wang, Z. Measuring Visual Enclosure for Street Walkability: Using Machine Learning Algorithms and Google Street View Imagery. Appl. Geogr. 2016, 76, 147–153. [Google Scholar] [CrossRef]

Figure 1. The location of ORCs.

Figure 2. SVIs of the four external boundaries of ORCs from 0° to 270° in four directions.

Figure 3. Research flowchart.

Figure 4. Schematic diagram of semantic segmentation and manual annotation comparison.

Figure 5. SHAP Feature Importance Heatmap.

Figure 6. Importance of factors that influence emotional perception.

Figure 7. Feature importance ranking.

Figure 8. SHAP dependency plots.

Figure 9. SHAP Waterfall Plots.

Figure 10. Emotion perception differences across top and bottom 30% groups of key streetscape features.

Figure 11. The distribution of visual elements based on semantic segmentation of SVIs in different ORCs.

Figure 12. Matrix of box plots for emotional scores of ORCs.

Table 1. The basic situation of the residential communities.

ORCs	Construction Year	Floor Area Ratio	Residential Population Size (Households)	The Basic Situation of the Community
NM	1988	2.09	1162	The site covers an area of 34,783 m², located in the historic core, bounded by Wenhe South Road on the west and Xiao Qinhuai River on the east, adjacent to cultural streets and the Grand Canal. It is dominated by aging six-story residences and renovated historical houses.
ML	1990	1.49	988	The site covers an area of 33,552 m², situated in the northern old town between Shouxihu Road on the west and Shikefa Road on the east, close to Meihualing and the city moat. The housing stock mainly consists of 3–4 story residential blocks and late-20th-century housing reform units.
GR	1958	2.23	961	The site covers an area of 23,277 m², positioned in the eastern sector of the city, bounded by Jiangdu Road on the west and the Grand Canal waterfront on the east, adjoining traditional neighborhoods. It mainly contains six-story residential buildings rebuilt or modified in the late 20th century.
AD	1998	2.15	317	The site covers an area of 16,323 m², located in the northeastern city, between Yunhe North Road on the west and Xiao Yunhe River on the east, with Zhuxi Road to the south and adjacent communities to the north. It is composed predominantly of long-established six-story apartments.
DH	1983	1.17	660	The site covers an area of 61,196 m², found in the southern residential zone between Lianyi Road on the west and Tangwang Road on the east, adjacent to Kaifa East Road and Donghuayuan Street. It comprises 3–5 story housing mixed with commercial facilities.
ZZ	1994	2.11	612	The site covers an area of 13,020 m², located in the northern part of the city, bounded by Shouxihu Road on the west and Jiangdu North Road on the east, near ecological parks and canal landscapes. It is mainly composed of long-standing six-story apartment blocks.
XY	2000	1.72	228	The site covers an area of 11,132 m², positioned in the northwestern sector, between Guangzhou Road on the west and Pingshan South Road on the east, connected to established neighborhoods and secondary roads. The housing stock is dominated by aging six-story estates.
YY	2000	1.97	349	The site covers an area of 20,819 m², situated in a central residential area, bounded by Hanjiang Middle Road on the west and Baixiang Road on the east, flanked by commercial amenities and arterial roads. It mainly consists of six-story residential blocks.
NY	2000	1.57	380	The site covers an area of 22,102 m², located in the western sector, between Xindu South Road on the west and Longcheng Road on the east, adjacent to Nanyuan Road and Changjiang East Road. It is characterized by deteriorating 5–6 story apartments.
HT	1997	1.44	190	The site covers an area of 13,760 m², found in the central city between Yangzijiang North Road on the west and Weiyang Road on the east, close to Shouxihu Scenic Area and canal branches. It mainly comprises 5–6 story residential buildings from the late 20th century.

Table 2. Statistical Data on the Accuracy of 7 Main Semantic Categories Segmented by the Mask2Former Model.

NO.	Category	Accuracy	IoU	Recall	Precision	F1-Score
1	Sky	0.92	0.84	0.88	0.90	0.89
2	Vegetation	0.84	0.81	0.83	0.85	0.84
3	Road	0.77	0.71	0.76	0.78	0.77
4	Building	0.89	0.80	0.87	0.89	0.88
5	Wall	0.85	0.79	0.84	0.86	0.85
6	Pedestrian Area	0.83	0.72	0.81	0.85	0.83
7	Other Facilities	0.69	0.48	0.65	0.72	0.68

Overall accuracy (OA) = 0.83; mean intersection over union (MloU) = 0.73.

Table 3. Comprehensive street view indicators.

Indicators	Calculation Formula	Description of Indicator Definitions
Walkability (W)	$W_{i} = p_{s i d e w a l k} + p_{c r o s s w a l k}$	Refers to the extent to which the environment supports pedestrian movement [30]. It is calculated as the proportion of pedestrian areas, including sidewalks and crosswalks.
Green View Index (GVI)	$G V I_{i} = P_{V e g e t a t i o n}$	Refers to the level of greenery exposure actually perceived by pedestrians [31]. It is calculated as the proportion of natural vegetation area within the image.
Sky View Factor (SVF)	$S V F_{i} = P_{S k y}$	An indicator used to evaluate the openness of streets [32]. It is defined as the ratio of visible sky area to the total sky hemisphere at a given spatial point.
Visual Diversity (VD)	$V D = \frac{N_{u n i q u e}}{N_{m a x}}$	This indicator is used to measure visual complexity [33]. $N_{u n i q u e}$ represents the actual number of non-repetitive object categories that appear in the semantic segmentation results (excluding the background), and $N_{m a x}$ represents the preset maximum possible number of object categories in the corresponding scene.
Facility Convenience (FC)	${F C}_{i} = P_{facility}$	Reflects the convenience level of urban life [34]. It is measured by counting the pixels of all street facilities (e.g., benches, streetlights, bus stops) and calculating their proportion in the streetscape.
Enclosure (E)	$E_{i} = P_{B u i l d i n g} + P_{W a l l}$	Reflects the subjective sense of being enclosed caused by physical constraints in urban spaces [35]. It is defined as the proportion of enclosed elements (e.g., buildings, walls) in the streetscape.
Facility Visual Entropy (FE)	$F E_{i} = - \sum_{i = 1}^{n} p_{i} \times \log_{2} (p_{i}), p_{i} = \frac{P_{i}}{\sum_{j = 1}^{n} P_{j}}$	Based on Shannon’s information entropy theory, this indicator evaluates the visual complexity of facilities in the streetscape [36,37]. Here, it represents the visual diversity of facility elements in the image, where n is the number of facility categories included in the statistics, and $P_{i}$ is the pixel proportion of the i-th facility category.
Colorfulness (C)	$C = \sqrt{σ_{a}^{2} + σ_{b}^{2}} + 0.3 \times \sqrt{μ_{a}^{2} + μ_{b}^{2}}$	Its meaning is to quantify the overall color richness of a natural image. Here, $σ$ reflects the dispersion of color distribution, while $μ$ demonstrates the degree of color deviation from the neutral axis [38].
Color Richness (CR)	$C R = - \sum_{i = 1}^{n} p_{i} {l o g}_{2} p_{i}$	Information entropy is used to evaluate uncertainty in data, reflecting the dispersion and uniformity of color distribution [39]. It is calculated based on the entropy of the image’s color distribution, where higher values indicate greater color diversity.
Color Harmony (CH)	$CH = \frac{2}{k (k - 1)} \sum_{i = 1}^{k - 1} \sum_{j = i + 1}^{k} ‖C_{i} - C_{j}‖$	Simulates human preference for harmonious colors through “statistical associations of color centers” [40]. Here, $k$ denotes the number of dominant colors extracted from the image (predefined according to color complexity), while ( $C_{i}$ , $C_{j}$ ) represent the coordinates of the i-th and j-th color cluster centers.

Table 4. The residential attribute score of ORCs (part).

ORCs	Construction Age	POI Diversity	Population Vitality
GR	0	1.320	1.846
DH	1	1.492	3.375
NM	2	1.502	2.641
ZZ	3	1.599	3.143
AD	4	1.656	3.000
HT	4	1.416	2.353
ML	4	1.705	2.912
NY	5	1.388	3.194
XY	5	1.481	2.679
YY	5	1.433	3.030

Table 5. Regression Result Indicators of Six Types of Emotional Perception Based on the RF Model.

Emotional Categories	Result Indicators of Regression Based on the RF Model
Emotional Categories	MAE	MSE	RMSE	Median Absolute Error	Explained Variance Score	R²
Beautiful	0.492	0.388	0.623	0.420	0.706	0.705
Depressing	0.634	0.611	0.782	0.575	0.668	0.666
Boring	0.632	0.673	0.820	0.520	0.625	0.624
Lively	0.675	0.738	0.859	0.548	0.603	0.601
Wealthy	0.568	0.518	0.720	0.465	0.594	0.590
Safe	0.535	0.524	0.724	0.385	0.518	0.512
Mean	0.589	0.575	0.755	0.485	0.619	0.616

Table 6. Ablation Experiment Results Comparing Visual Indicators and Combined Attributes.

Model Configuration	Mean R²	Mean MAE	Mean RMSE	Improvement in R²	Reduction in MAE
Visual indicators only	0.549	0.636	0.817	0.067	0.062
Visual indicators + community attributes	0.616	0.589	0.755	0.067	0.062

Note: The ablation experiment quantifies the added value of community attributes (construction age, POI diversity, and population vitality) in improving the model’s explanatory power and predictive accuracy.

Table 7. Mood perception scores of ORCs.

ORCs	Mood Perception Score
ORCs	Beautiful	Boring	Safe	Depressing	Wealthy	Lively
GR	2.962	6.135	7.038	5.538	4.481	4.000
DH	3.445	5.414	6.781	5.000	4.352	4.688
NM	4.109	5.506	6.750	5.686	5.083	4.667
ZZ	4.277	5.527	6.491	5.902	4.991	4.330
AD	4.661	5.607	6.500	4.955	5.098	4.536
HT	3.705	5.718	6.436	4.635	5.622	4.609
ML	3.706	5.066	6.279	5.368	5.235	5.125
NY	4.660	4.806	6.465	4.729	5.861	5.563
XY	3.875	5.089	6.527	4.991	5.295	4.982
YY	4.242	5.280	6.348	5.553	5.023	4.462

Table 8. Consistency between Model Predictions and Survey Evaluations.

Emotion	Pearson r	95% CI(r)	p Value	Consistency or Inconsistency	MAE	95% CI(MAE)	MASE	95% CI(MASE)
Boring	0.611	[−0.033, 0.895]	0.061	False	0.291	[0.234, 0.348]	0.810	[0.665, 0.956]
Wealthy	0.843	[0.542, 0.954]	0.002	True	0.519	[0.436, 0.603]	0.550	[0.458, 0.643]
Depressing	0.419	[−0.200, 0.793]	0.229	False	0.568	[0.482, 0.655]	0.649	[0.541, 0.758]
Lively	0.744	[0.285, 0.923]	0.014	True	0.527	[0.443, 0.612]	0.763	[0.630, 0.897]
Beautiful	0.772	[0.336, 0.934]	0.009	True	0.428	[0.352, 0.505]	0.825	[0.698, 0.953]
Safe	0.723	[0.248, 0.917]	0.018	True	0.376	[0.308, 0.445]	0.778	[0.642, 0.915]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Y.; Fan, X. Street View Image-Based Emotional Perception Modeling of Old Residential Communities: An Explainable Framework Integrating Random Forest and SHAP. ISPRS Int. J. Geo-Inf. 2025, 14, 471. https://doi.org/10.3390/ijgi14120471

AMA Style

Xu Y, Fan X. Street View Image-Based Emotional Perception Modeling of Old Residential Communities: An Explainable Framework Integrating Random Forest and SHAP. ISPRS International Journal of Geo-Information. 2025; 14(12):471. https://doi.org/10.3390/ijgi14120471

Chicago/Turabian Style

Xu, Yanqing, and Xiaoxuan Fan. 2025. "Street View Image-Based Emotional Perception Modeling of Old Residential Communities: An Explainable Framework Integrating Random Forest and SHAP" ISPRS International Journal of Geo-Information 14, no. 12: 471. https://doi.org/10.3390/ijgi14120471

APA Style

Xu, Y., & Fan, X. (2025). Street View Image-Based Emotional Perception Modeling of Old Residential Communities: An Explainable Framework Integrating Random Forest and SHAP. ISPRS International Journal of Geo-Information, 14(12), 471. https://doi.org/10.3390/ijgi14120471

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Street View Image-Based Emotional Perception Modeling of Old Residential Communities: An Explainable Framework Integrating Random Forest and SHAP

Abstract

1. Introduction

2. Research Cases and Data Sources

2.1. Research Cases

2.2. Data Source

2.2.1. SVIs

2.2.2. Attribute Indicators of Residential Communities

2.2.3. Emotion Perception Scoring and Calibration Methods

3. Methodology

3.1. Research Framework

3.2. Semantic Segmentation

3.3. The Construction of Comprehensive Street View Indicators

3.4. Random Forest

3.5. SHAP

4. Results

4.1. Analysis of Emotion Perception Modeling Results Based on RF

4.2. Interpretability Analysis of Emotional Perception in Streetscapes

4.2.1. Impact of Streetscape Features on Emotional Perception

4.2.2. Single-Sample Local Interpretation

4.2.3. Mechanism of the Interaction Between Residential Community Attributes and Emotional Perception

4.2.4. Emotional Effects of Key Street View Indicators

4.3. Emotional Perception Characteristics of ORCs

5. Discussions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI