4.1. Analysis of Emotion Perception Modeling Results Based on RF
Before modeling emotional perception, this study constructed a residential area attribute index system that covers construction age, POI diversity, residential area population size, and population vitality. The aim was to quantify the spatial and social characteristics of ORCs. These variables constitute the basic input for the emotional regression model and are important in revealing the sources of emotional differences.
Table 4 shows the specific values of ten typical ORCs in each index and reveals obvious spatial heterogeneity.
For example, the GR community records relatively low values across all indicators: a construction age score of 0, a POI diversity index of 1.320, and a population vitality index of 1.846. These results reflect the area’s functional obsolescence, monotonous streetscape, and underutilized public spaces. In contrast, the DH community performs significantly better on all measures, with a construction age score of 1, a POI diversity index of 1.492, and a population vitality index of 3.375. These higher scores suggest a stronger foundation of service facilities and more vibrant population activity, which are likely to foster more positive emotional experiences. Such differences provide an important basis for the model’s explanatory power.
Subsequently, the study incorporates residential area attributes and SVI features into an RF regression model to estimate the scores of six emotional perception dimensions. To assess model performance, six regression evaluation metrics were calculated, including MAE, mean squared error (MSE), and R
2.
Table 5 reports the performance of each emotional dimension model under the main evaluation criteria.
Overall, the RF model exhibits a moderate but stable predictive performance across most emotional dimensions. On average, the model achieves an R2 of 0.616 and an MAE of 0.589. These values indicate that the model can reasonably capture the impacts of streetscapes and residential areas on emotional perception. The model performs particularly well in the “beauty” and “oppression” dimensions, with R2 values of 0.705 and 0.666, respectively, and MAE values ranging from 0.492 to 0.634. These results suggest that perceptions of beauty and oppression are highly dependent on the quality of the visual landscape and the atmosphere of the streets.
In contrast, the model performed relatively poorly in the “safety” dimension, with an R2 of only 0.512. This indicates that perceptions of safety are more strongly influenced by non-visual factors such as public security, neighborhood relations, and historical events, while streetscape and physical spatial variables provide limited explanatory power.
Regarding model stability, the average MAE of a single tree is 0.768, whereas the overall MAE decreases to 0.568 after ensemble aggregation. This confirms that the multi-tree ensemble mechanism of RF improves predictive stability and reduces overfitting risk. Further evaluation shows a margin strength of –0.351, an inter-tree correlation of 0.474, and an estimated upper bound of generalization error of 3.372. To further validate the statistical significance and robustness of the model, a permutation test was performed with 1000 iterations by randomly shuffling the training labels. The original model achieved an MAE of 0.5331 on the independent test set, whereas the mean MAE of the permuted models was 0.8370 ± 0.0225 (p < 0.001). These results demonstrate that the model predictions are significantly better than random chance and confirm stable and reliable performance. Taken together, these results indicate that although the predictive capacity of individual trees is modest, the integration of moderately correlated learners yields a model with strong generalization ability and stable overall performance.
To further evaluate the contribution of community attributes, an additional ablation experiment was conducted comparing two models:
The results demonstrated that the inclusion of community attributes improved the mean R
2 from 0.549 to 0.616 and reduced the mean MAE from 0.636 to 0.589, indicating a clear enhancement in predictive performance. The RMSE also decreased from 0.817 to 0.755, confirming that community-level contextual variables contribute meaningfully to model accuracy. This finding highlights that community-level contextual variables—particularly population vitality—enhance the interpretive and predictive capacity of the model by providing social–spatial context beyond the visual environment. The results of this comparison are presented in
Table 6.
4.2. Interpretability Analysis of Emotional Perception in Streetscapes
4.2.1. Impact of Streetscape Features on Emotional Perception
To further examine how spatial features extracted from SVIs shape residents’ emotional perceptions, this study applies the SHAP method to interpret the regression results across six emotional dimensions—beautiful, safe, lively, wealthy, depressing, and boring—using a dataset of 1240 SVIs.
Figure 5 displays a heatmap of the mean absolute SHAP values for 14 predictors across the six dimensions. This visualization quantifies the marginal contribution of each variable to the model output and illustrates both the direction and the magnitude of their influence on emotion prediction at different numerical levels. In the heatmap, red indicates stronger contributions, while blue denotes weaker ones.
The overall results reveal notable variations in the contribution of key variables across different emotional dimensions. Nevertheless, enclosure, VD, and facility visual entropy consistently exhibit strong explanatory power, emerging as core determinants of emotional perception. Among these, enclosure shows the highest importance value (0.283) in the depressing dimension, suggesting that enclosed street scenes are more likely to evoke negative emotions. It also exerts significant influence on perceptions of beautiful, lively, and boring, underscoring its multi-dimensional impact. VD achieves an average SHAP value of 0.274 in the lively dimension, highlighting the role of element wealth in enhancing residents’ perception of liveliness. Meanwhile, the facility visual entropy demonstrates strong explanatory capacity in both the safe and depressing dimensions, suggesting that the complexity of street facilities is critical in regulating residents’ emotional responses.
As for secondary variables, the SVF exerts a relatively strong and consistent influence across all six emotional dimensions, demonstrating both stability and universality. This finding implies that its mechanism of action is systemic and multi-layered. In addition, colorfulness contributes notably to the lively dimension, reinforcing the importance of visual vibrancy in stimulating positive affect. Taken together,
Figure 5 reveals a multi-level quantitative association between the street environments of ORCs and residents’ emotional perceptions, offering empirical evidence to guide urban renewal strategies and micro-scale street interventions.
Figure 6 further presents SHAP summary plots for the six emotion models, illustrating both the positive and negative effects of each variable on model predictions and their distribution across different feature values. The results show that perceptions of beauty and safety are primarily enhanced by higher levels of enclosure and VD, whereas perceptions of a space as depressing or boring are reinforced by high enclosure and low greenery coverage. These patterns suggest that enclosed and monotonous streetscapes are more likely to provoke negative emotional experiences.
Further rankings of variable importance are illustrated in
Figure 7, highlighting distinct dominant factors across different emotional dimensions. For instance, facility visual entropy contributes substantially to both the depressing and safe models, whereas population vitality ranks among the top predictors in the wealthy model. This pattern reflects the interplay between the social attributes and visual characteristics of residential areas. These findings indicate that each emotional dimension is shaped by a unique combination of streetscape elements and neighborhood features. Consequently, they provide a scientific foundation for targeted interventions aimed at enhancing residents’ emotional perceptions—for example, by adjusting street enclosure levels, improving visual continuity, or achieving a more balanced distribution of facilities.
To examine the nonlinear and threshold effects of key street scene features on emotional perception, SHAP dependence plots were generated for the three most predictive mood dimensions—beautiful, boring, and depression—based on their top four contributing features (
Figure 8).
For the “beautiful” dimension, enclosure exhibited a clear decreasing trend, contributing positively below a threshold of 0.31 and negatively beyond it, suggesting excessive enclosure may reduce visual pleasure. Similar threshold effects were observed for VD (17.31) and SVF (0.20), where their contributions reversed after exceeding specific cutoffs. Population vitality showed a tri-phasic pattern, shifting from negative to positive and back to negative, with inflection points at 3.16 and 4.90.
In the “boring” dimension, VD below 17.05 was positively associated with boredom, but reversed beyond this value, implying that overly homogeneous scenes induce monotony. Enclosure (0.31), facility visual Entropy (2.11), and population vitality (2.68) showed similar turning points, indicating that moderate levels of spatial enclosure, visual complexity, and crowd activity alleviate boredom, but excessive levels may have adverse effects.
In the “depression” dimension, enclosure became a positive contributor above 0.29, intensifying depressive perception. VD (17.46) and SVF (0.19) shifted from positive to negative effects after their respective thresholds, while facility entropy followed a three-phase pattern (2.04 and 3.01), reflecting a dual role of visual complexity.
Overall, SHAP dependence plots reveal significant nonlinearities and threshold dependencies. Enclosure and SVF exhibited monotonic responses within defined intervals, while features like population vitality and visual entropy followed an “optimal-middle” pattern. Precise control over these thresholds is critical for designing emotionally responsive urban street environments.
4.2.2. Single-Sample Local Interpretation
To reveal the localized mechanisms by which street scene features influence mood perception, this study conducted SHAP waterfall analyses on three representative samples (
Figure 9), each corresponding to one of three emotional scene types: positively dominated (a), negatively dominated (b), and mixed (c). Each sample was analyzed based on the ranking of absolute SHAP values to identify the primary contributing factors and their directional effects. For the first two samples, the dimensions of “beauty” and “depression” were selected to represent typical positive and negative emotional responses, respectively, while the third sample, characterized by a more balanced emotional distribution, was examined through the “safety” and “vitality” dimensions that exhibited both positive and negative contributions.
Sample 1 (NY11–90°) was obtained from a relatively new residential district located in the urban core. The scene was captured from a main road with high pedestrian flow, abundant greenery, and vibrant commercial activity, showing strong positive mood perception. In the “beautiful” dimension, enclosure degree (0.098, SHAP = +0.35) and population vitality (4, SHAP = +0.21) emerged as key positive drivers. These features also contributed negatively to the “boredom” dimension, indicating that greater spatial enclosure and active street life can alleviate monotony. Moreover, the facility visual Entropy (2.303, SHAP = −0.15) played a negative role, suggesting that moderate visual complexity enhances aesthetic appeal.
Sample 2 (GR5–270°) was derived from an aging residential community with minimal greenery and poor infrastructure. The image was taken from a narrow dead-end street with sparse pedestrian activity, exhibiting typical negative emotional features. SHAP results showed that population vitality (1, SHAP= − 0.30) and building age (1, SHAP = −0.24) were major negative drivers for the “wealth” dimension. Meanwhile, high enclosure (0.709, SHAP = +0.31), low facility visual Entropy (1.306, SHAP = +0.24), and limited VD (11, SHAP = +0.2) contributed positively to “depression”. The image reflected a sense of spatial oppression due to narrow streets and highly enclosed building façades, where monotonous and low-complexity environments further intensified the depressive perception.
Sample 3 (DH25–270°), located in a large and aged neighborhood in the southern part of the city center, exhibited mixed emotional responses. The image was captured on a riverside branch road with poor environmental maintenance and low foot traffic. In the “safety” dimension, population vitality (2, SHAP = +0.13) was the most influential positive factor, while VD (19, SHAP = −0.11) and facility visual Entropy (2.532, SHAP = −0.09) had negative impacts, implying that excessive visual complexity may undermine perceived safety. In the “vitality” dimension, both color richness (62.289, SHAP = −0.16) and population vitality (2, SHAP = −0.14) exerted negative effects, whereas facility visual Entropy provided a slight positive contribution, suggesting that localized visual disturbance may compensate for otherwise dull environments.
Overall, SHAP waterfall results at the sample level indicate that enclosure degree and population vitality are consistent positive drivers of emotional perception. Meanwhile, facility visual Entropy demonstrates a context-dependent dual role: although low entropy tends to diminish positive perceptions, moderate increases in visual complexity can help stimulate positive emotional responses in highly enclosed and low-activity settings.
4.2.3. Mechanism of the Interaction Between Residential Community Attributes and Emotional Perception
Compared with the direct spatial cues embedded in SVI, residential attributes exert a more indirect and comprehensive influence on emotional perception. Their mechanisms often arise from the interplay of multiple factors, including functional layout, demographic composition, and renewal level. Although their explanatory power is weaker than that of image-based features, SHAP analysis shows that certain residential attributes still play significant roles across multiple emotional dimensions.
Among these, population vitality stands out in positive emotions such as beautiful, wealthy, and lively, with SHAP values of 0.168, 0.252, and 0.194, respectively, all showing positive effects. Active foot traffic and frequent social interactions help cultivate an attractive community atmosphere, thereby enhancing aesthetic and vitality perceptions. However, in the safe dimension, its contribution decreases to 0.087 and even becomes negative at high values, suggesting that excessive population concentration may lead to a weakened sense of order and increased uncertainty.
The effect of construction age varies across emotional dimensions. Newly built residential areas perform better in the category of “beautiful”, reflecting the positive impact of modern planning and environmental quality, while renovated housing stock helps mitigate spatial monotony in the boring dimension. Residential scale contributes moderately to lively (SHAP = 0.069), yet insufficient planning in large-scale developments may intensify perceptions of boring and depressing. In contrast, POI diversity has relatively limited explanatory power, showing only partial nonlinear influence on wealthy and depressing.
These findings are further validated by comparing residential sentiment scores (
Table 7) with activity levels. For example, the GR residential area, built before 1980, recorded a population vitality score of only 1.846 and a perceived beautiful score of 2.962. By contrast, the NY community, developed after 2000, achieved a vitality score of 3.194 and a corresponding beautiful score as high as 4.660, underscoring the advantages of newly constructed residential areas in planning and landscape quality. Similarly, a larger population size can enhance vitality and alleviate boredom to some extent, but it may also increase perceptions of an area being depressing, reflecting its complex dual effects.
Overall, while residential attributes are not directly visible in street-level image, their influence on emotional perception should not be overlooked. Their pathways of action are structurally complex and multidimensional, involving interactions among spatial form, usage density, and resident experience. Future studies should integrate these aspects for more systematic interpretation and validation.
In summary, while residential area attributes are not directly perceptible in SVI, they still play a significant structural role in emotional perception. Their influence is complex and involves multidimensional interactions, requiring a comprehensive assessment that integrates spatial form, usage density, and resident experience.
4.2.4. Emotional Effects of Key Street View Indicators
To further elucidate the specific roles of different street scene indicators in residents’ emotional perceptions, this study provides a comprehensive interpretation by integrating SHAP analysis results (
Figure 5 and
Figure 6) with intergroup comparison experiments (
Figure 10).
Figure 10 presents the results of comparing average emotional scores between samples ranked in the top 30% and bottom 30% for each indicator value in street scene images. This approach not only visually demonstrates how high- and low-level street scene characteristics differ across emotional dimensions but also statistically validates the reliability of these differences. Overall findings reveal that spatial structure, visual complexity, and natural elements exert distinct influences on both positive and negative emotions, with most dimensional differences reaching statistical significance. This highlights the dual moderating effect of street scene characteristics on residents’ psychological experiences.
First, enclosure plays the most prominent role in “safety.” The safety score for highly enclosed streetscapes was 7.212, significantly higher than the 6.126 recorded for low-enclosure areas. However, it also scored higher in “depressing” and “boring” dimensions: the depressing scores were 6.304 and 4.255, and the “boring” scores were 6.352 and 4.626, respectively. This indicates that high enclosure enhances the sense of safety but may also trigger negative emotions. This finding aligns with the SHAP heatmap, which shows that enclosure has the highest weight in the “depressing” dimension.
Second, VD demonstrated significant advantages in perceptions of area as “beautiful” and “lively.” High-diversity environments scored 5.442 for “lively,” markedly higher than the 3.835 for low-diversity settings; their respective scores for “beautiful” were 4.567 and 3.281. Combined with SHAP analysis results, VD ranks prominently in positive emotion models, validating the role of diverse street scene elements in enhancing positive perceptions.
Furthermore, the facility visual entropy exhibits a complex dual effect. The “safe” score in high-entropy environments was 7.304, higher than the 6.046 in low-entropy environments. However, it also showed higher levels of “depressing” and “boring”: “depressing” was had a value of 6.118 versus 4.871 in low-entropy environments, and “boring” was had a value of 6.395 versus 4.876. This indicates that, while facility layout complexity enhances perceptions of areas as “safe” perceptions to some extent, disorder and clutter may impose psychological burdens.
Regarding natural elements, both SVI and GVI exert positive effects on perceptions of “beautiful” and “safe.” The “beautiful” score of environments with high greenery coverage reached 4.476, significantly higher than the value of 3.323 recorded in areas with low greenery coverage. Similarly, the “safe” score of streetscapes with high SVF was 7.164, surpassing the value of 6.083 observed in low-visibility settings. Concurrently, both indicators yielded relatively low scores for “depressing” and “boring,” indicating that openness and natural elements help alleviate residents’ negative emotions.
Therefore, the mechanism of street view indicators exhibits multidimensional variability: positive emotions are primarily driven by VD, green coverage, and moderate enclosure, while negative emotions are more readily triggered in overly enclosed, monotonous, or complexly furnished environments. This finding is corroborated by SHAP analysis, which not only quantifies the relationship between the street environment and residents’ emotional perception but also provides actionable guidance for future neighborhood renewal and microenvironmental interventions.
4.3. Emotional Perception Characteristics of ORCs
Based on semantic segmentation results from 1240 SVIs of ORCs, this study statistically analyzed the overall visual element composition across 10 neighborhoods, with findings presented in
Figure 11. Overall, architectural structures and perimeter walls, along with natural environmental elements, dominate the street scenes. This reflects how ORCs retain substantial traditional building facades within the urban fabric while also preserving green vegetation and natural landscapes. This distribution pattern reveals a dual spatial characteristic of ORCs: the continuity of existing buildings coexists with the integration of natural elements.
Visual elements vary significantly across different residential areas. For instance, areas like NY, ZZ, and HT feature a higher proportion of natural elements, presenting street scenes characterized by open spaces and abundant greenery. This aligns closely with their high emotional scores in the “beautiful” dimension, indicating that natural elements play a crucial role in positive perceptions. In contrast, residential areas like such as GR, YY, and DH feature higher proportions of buildings and fences, creating a stronger sense of spatial enclosure. While this environmental characteristic enhances residents’ perceptions of an area as “safe” perceptions—correlating with higher “safe” scores—it simultaneously restricts accessibility and interactivity in public spaces. Consequently, scores for on the “lively” dimension are generally lower, reflecting the tension between environmental enclosure and social vitality.
Figure 12 displays the box plot distributions across six emotional dimensions for 10 ORCs. Overall, the results show that emotional evaluations not only differ in terms of the median values across communities but also exhibit distinct characteristics in terms of dispersion, as measured by the interquartile range (IQR) and extreme value distributions. Three common patterns are observed: For the “safe” dimension, the median values of most communities cluster at relatively high levels, with narrower IQRs—indicating that residents’ perceptions of safety are consistent. Both the median values and IQRs of the “beautiful” and “lively” dimensions vary more significantly across communities, which reflects residents’ higher sensitivity to spatial and landscape qualities. Some communities have higher median values or wider IQRs for the “boring” and “depressing” dimensions, suggesting that negative emotions are more strongly influenced by differences in contextual conditions and functional attributes.
Additionally, residential areas exhibit significant differences across the six emotional dimensions. Areas such as NY, ZZ, and HT, which are characterized by a relatively higher proportion of natural environments, consistently achieve higher median scores and lower dispersion in the “beautiful” dimension. This indicates that greenery and open spaces significantly enhance residents’ aesthetic experiences and psychological comfort. Conversely, residential areas such as YY, GR, and DH feature prominent building and wall coverage in their streetscapes, creating a stronger sense of spatial enclosure. These areas scored higher in the “safe” dimension, which is consistent with the positive effect of spatial enclosure on perceived safety. However, this environmental characteristic also inhibits resident interaction and the vitality of public spaces, resulting in generally lower scores in the “lively” dimension. This reveals a tension between safety and liveliness.
In the AD and ML residential areas—where transportation facilities and motor vehicle-related elements account for a higher proportion—residents’ emotional expressions are more complex. On one hand, the transportation infrastructure in these areas improves accessibility; on the other hand, it also creates a sense of oppression and generates environmental noise, leading to relatively higher scores for residents in the “depressing” and “boring” dimensions. This finding suggests that an excessive concentration of transportation functions may impair neighborhood livability and trigger more negative emotions. In contrast, residential areas such as NM and XY exhibit greater visual balance, with minimal variations across emotional dimensions. This indicates that environments featuring a balanced distribution of spatial elements contribute to greater stability in residents’ emotional perception.
It can be seen from the synthesis of
Figure 9 and
Figure 10 shows that differences in streetscape elements directly affect the distribution of emotions. A high proportion of natural elements typically fosters positive emotions such as “beautiful” and “lively,” whereas excessive building enclosure may enhance the “safe” perception while also intensifying feelings of “boring” and “depressing.” Meanwhile, concentrations of transportation facilities and motor vehicles tend to evoke negative emotions. The observed variations across different residential areas reflect the multidimensional challenges encountered in the renewal of ORCs: how to boost vitality while ensuring safety, how to increase the presence of natural elements while pre-serving existing buildings, and how to maintain traffic accessibility without imposing environmental burdens. For residential areas like including NY, ZZ, and HT, further optimizing green space design can enhance aesthetic appeal and vitality. For YY, GR, and DH, improving residents’ social interaction requires refining the design of open spaces and street layouts. As for AD and ML, priority should be given to alleviating the adverse effects of excessive concentration of transportation facilities and motor vehicles, so as to balance accessibility with living comfort.
To validate the effectiveness of the SVIs-based urban sentiment perception modeling method proposed in this study, this paper compared the consistency be-tween the average sentiment scores generated by the model and the results of manual subjective evaluations. The subjective evaluation results stem from field research and questionnaire surveys conducted in the 10 ORCs. Respondents included residents, neighborhood committee administrators, grassroots government officials, and urban re-search experts. A total of 75 questionnaires were distributed, with 75 valid responses collected. Among these, 45 respondents (60%) were community residents, 20 (27%) were community and property management staff, and 10 (13%) were grassroots government officials and urban research experts. Respondents’ age distribution was concentrated in the 31–50 age group (46.7%), followed by 20–30 (30.7%) and 51+ (22.6%). Respondents generally possessed higher educational attainment, with over 70% holding bachelor’s degrees or higher. Regarding duration of community residence or management, 80% had resided or managed for over five years, ensuring familiarity with neighborhood characteristics and response reliability. Consistency testing was conducted using Pearson correlation coefficients, with a significance criterion of Pearson r > 0.6 and p value < 0.05 (95% CI for r does not contain 0) defined as True (significant consistency).
Table 8 presents the validation results, including 95% confidence intervals (CIs) for Pearson r, MAE, and MASE. Among the six emotion dimensions, four—Wealthy, Beautiful, Lively, and Safe—show statistically significant and practically meaningful agreement between model predictions and subjective evaluations. Their Pearson correlation coefficients (r) have 95% CIs that do not include zero (e.g., Wealthy: r = 0.843, 95% CI [0.542, 0.954],
p = 0.002), and the comparatively narrow 95% CIs of their MAE values further indicate stable numerical prediction accuracy. Moreover, all MASE values are below 1, with tight CIs, confirming that the model consistently outperforms a naive mean-based benchmark for these perceptual categories. In contrast, Boring (r = 0.611, 95% CI [−0.033, 0.895],
p = 0.061) approaches but does not reach statistical significance, as its CI includes zero—a result coherent with its marginal
p-value. Interestingly, Boring also exhibits one of the narrowest MAE CIs, suggesting stable absolute prediction errors despite uncertainty in directional agreement. Depressing (r = 0.419, 95% CI [−0.200, 0.793],
p = 0.229) presents a much wider r-CI and lacks a consistent directional trend, reflecting higher uncertainty in this negative-affect dimension.
The relatively wide r-CIs for Boring—and especially for Depressing—primarily stem from the limited number of analysis units (n = 10 communities) and the greater variability typically observed in negative-affect judgments. Jin et al. [
47] have indicated that Place Pulse 2.0 tends to produce relatively higher predictions for negative perceptions such as “boring” and “depressing”, which may be attributed to cultural differences in visual cognition and the regional representativeness of its data samples. This indicates potential contextual limitations of the model when applied in cross-cultural settings.
To summarize, the proposed method performs well in aligning subjective and objective assessments across multidimensional emotional perception. It offers notable advantages in capturing key streetscape components—including the visual dimension and positive affect dimensions (“Beautiful,” “Wealthy”)—thus validating the feasibility and scientific rigor of semantic segmentation and image metric modeling for urban perception research.