Next Article in Journal
Climate Warming-Induced Hydrological Regime Shifts in Cold Northeast Asia: Insights from the Heilongjiang-Amur River Basin
Previous Article in Journal
Agricultural Land Markets: A Systematic Literature Review on the Factors Affecting Land Prices
 
 
Article
Peer-Review Record

Decoupling Urban Street Attractiveness: An Ensemble Learning Analysis of Color and Visual Element Contributions

by Tao Wu 1,†, Zeyin Chen 1,†, Siying Li 2, Peixue Xing 3, Ruhang Wei 1, Xi Meng 4, Jingkai Zhao 5,*, Zhiqiang Wu 1,6,7,* and Renlu Qiao 5
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Submission received: 18 March 2025 / Revised: 22 April 2025 / Accepted: 28 April 2025 / Published: 1 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

I agree with many of the discussion points: the study offer a quite quantitative approach, that means some restrictions to a cultural, artistic or design thinking point of view of the urban morphology. However, the results are significant and very well presented.

Author Response

Comments 1: I agree with many of the discussion points: the study offer a quite quantitative approach, that means some restrictions to a cultural, artistic or design thinking point of view of the urban morphology. However, the results are significant and very well presented.

 

Response 1: We are deeply grateful for your affirmation of our work, which has greatly strengthened our confidence. We also appreciate your insightful observation regarding the limitations of our predominantly quantitative approach in capturing the cultural, artistic, and design‑thinking dimensions of urban form.

As you rightly point out, focusing solely on quantitative methods does constrain our exploration of these qualitative factors. To address this, we have expanded Section 4.4 Limitations and Future Prospects in the revised manuscript with the following clarification: the current study does not incorporate qualitative elements such as urban cultural context or artistic design intent; nevertheless, these dimensions play a critical role in shaping the visual aesthetic perception of public spaces.

In future work we plan to combine our quantitative model with a cultural and artistic perspective by conducting detailed interviews, gathering expert feedback and analysing case studies of design projects. This will allow us to reveal more fully how colour and visual elements jointly shape street view aesthetics.

Thank you once again for your valuable feedback and constructive suggestions. All corresponding revisions are highlighted in red for your ease of review.

 

4.4. Limitations and Future Prospects

Despite systematically examining the factors influencing street-view aesthetic perception from the two primary dimensions of urban color and visual elements, this study still has several limitations. First, due to data and research scope constraints, the study focuses only on representative indicators such as dominant color, color composition, and VgR. This may overlook other dimensions within urban space—such as sociocultural factors, historical district characteristics, microclimatic conditions, or qualitative aspects like urban cultural background and artistic design intent—that could potentially affect aesthetic perception. Second, although the data scale and sources used in this study are relatively extensive, they largely consist of static street-view information, making it difficult to capture the dynamic changes of urban spaces across different seasons, time periods, or activity contexts. Third, the global scope of 56 cities makes it challenging to fully control for local contextual variables and to delve into region‑specific dynamics. To overcome this, future research will include subgroup analyses by clustering cities according to geographic region, economic development level, or cultural heritage, and will also undertake focused case studies on selected city groups to yield deeper, more precise insights.

In future research, we will further expand the model’s features, incorporating multiple dimensions such as urban cultural context, functional zoning of neighborhoods, and behavioral patterns, to more comprehensively reveal the complex mechanisms underlying urban visual aesthetic perception. In particular, we plan to integrate qualitative methods—such as in‑depth interviews, expert reviews, and artistic design case studies—with our quantitative framework to capture the influence of cultural and design intentions on public‑space aesthetics. In addition, by leveraging real-time or periodically updated street-view data and employing neural networks and large-scale image recognition technologies, we aim to achieve automated bulk collection and feature extraction of urban street-view samples and to provide dynamic monitoring and prediction of urban aesthetic perception.”

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

 

The article presents an interesting and innovative topic, exploring the relationship between color, public space, and the use of machine learning. The core idea and topic selection are among the strengths of the paper. The subject is novel and interdisciplinary, connecting urban aesthetics with data analysis and machine learning. The use of modern technologies to analyze urban environments is both creative and practical. The methodology for data collection and analysis is not clearly explained. The focus is more on presenting mathematical models, which could be summarized more concisely. The study covers a large scope (56 cities), which seems excessive. Managing and controlling variables on such a scale appears challenging. A more limited and focused sample could have provided deeper and more precise insights. The article’s structure and framework need revision, particularly in the analysis and conclusion sections, which lack coherence and strong summarization. The conclusion is weak and fails to effectively summarize the findings or clarify their practical implications. Despite the novel and interesting idea, the article requires substantial revision in terms of scientific structure and methodology to be considered a strong academic contribution.

Author Response

Comments 1: The methodology for data collection and analysis is not clearly explained. The focus is more on presenting mathematical models, which could be summarized more concisely.

 

Response 1: We would like to express our sincere gratitude for your valuable comments on the Methods section. You correctly noted that our original manuscript lacked clear articulation of the data collection and analysis workflows, and that the extensive presentation of mathematical models risked distracting readers. To address these concerns, we have made the following detailed revisions in the updated draft:

In Section 2.2 Description of Data Sources, we have added specifics on street‑view image acquisition intervals, viewpoint selection, and the platform scoring procedure to clarify our data sources and enhance transparency.

In Section 2.3 Research Methods Process, we have streamlined the presentation of mathematical formulas by providing concise annotations and summarizing complex derivations and notation. This ensures that the methodological logic is clear and that readers can grasp the key steps more quickly.

Across all subsections, we have reinforced the explanation of why and how each chosen model—SegNet, K‑means, TrueSkill, and LightGBM—and each key method—semantic segmentation, clustering, ranking algorithms, regression modeling, and SHAP‑based decoupling analysis—contributes to capturing street‑view aesthetic perception.

All related changes are highlighted in red in the main text for your convenience. We trust that these improvements will significantly enhance the coherence and readability of the Methods section. Thank you again for your meticulous guidance.

 

2.2. Description of data sources

This study utilizes data from the urban street-view perception dataset Place Pulse 2.0 [58]. The dataset contains a total of 110,988 Google SVI captured between 2007 and 2012 in 56 cities on six continents , thus covering a wide range of geographic settings. Crucially, Place Pulse is built on a global, crowdsourced rating platform rather than locally confined surveys: participants from many countries were recruited through organic media outreach and targeted Facebook advertisements, and all images were evaluated on the same web interface (centerforcollectivelearning.org/urbanperception). Because raters are not limited to the residents of the depicted cities, the resulting scores reflect a more universal aesthetic judgement and minimize city‑specific cultural bias, even though the dataset itself does not include explicit sociocultural variables for each city.

Street‑view frames were generated following the standard Place Pulse protocol. First, Google Street View panorama IDs were uniformly sampled along the OpenStreetMap road network at an adaptive spacing of roughly 50–100 m, ensuring coverage of both primary and secondary streets. For each panorama, two horizontal images (640×480 px, FOV ≈ 60°) were extracted with headings separated by 90° or 180°, while keeping the pitch at 0° (eye‑level ≈ 1.6 m). In intersections or irregular street segments, up to four directions were captured to reflect the complete surrounding context [59]. This fixed sampling interval and limited set of viewing directions provide consistent spatial density and comparable visual perspectives across all 56 cities.

Place Pulse uses pairwise comparison—a method long employed to assess subjective attributes such as style or visual appeal in clothing [60], urban façades [61,62], animated GIFs [63], and artworks [64]. Pairwise ranking is widely regarded as more reliable and efficient than direct numerical scoring [65,66].

Participants were recruited via organic media sources and targeted Facebook advertisements, and were asked to answer subjective questions across six dimensions—for instance, “Which place looks safer?” or “Which place looks more beautiful?”—by selecting one of two images. This data collection process ran from May 2013 to February 2016. In this study, we mainly focus on responses to “Which place looks more beautiful?”, for which 166,823 pairwise comparison responses were collected. Every individual image underwent an average of approximately 3.46 pairwise comparisons. Place Pulse 1.0 shows that ratings are largely independent of respondents’ age, gender, or geographic location [67]; hence the dataset offers a culturally diverse yet methodologically uniform benchmark.

Because evaluations are made on static images rather than on‑site visits, extraneous local variables (e.g., transient noise, weather, or social activity) exert relatively little influence on the scores. Consequently, the derived VAPS represent a consistent, image‑based measure of perceived beauty that can be compared across all 56 cities without the confounding effects inherent in localized, in‑person surveys.

……

2.3. Research methods process

……

2.3.1. Urban color features

Figure 3 summarises the end‑to‑end workflow that converts each SVI into quan-titative colour descriptors and feeds them into the subsequent modelling pipeline. First, every image is transformed to HSV and RGB colour spaces. A three‑cluster K‑means algorithm (k = 3, chosen as the minimum that captures foreground–middle‑background variation while keeping computation light) groups pixels in HSV space; the cluster with the largest pixel share is defined as the dominant colour [68,69]. Its relative size is recorded as the Dominant‑Colour Ratio (DCR), while the cluster centroid provides the basic hue (H), saturation (S), value (V) and red–green–blue (R, G, B) channel values.

……

2.3.2. Urban visual elements features

Google Street‑View images from Place Pulse 2.0 were semantically segmented to quantify the physical components of each streetscape. We adopted SegNet, an encoder–decoder convolutional network whose class‑balanced training on the Cityscapes and CamVid benchmarks has proved both accurate and computationally efficient [73,74]. The pretrained weights were fine‑tuned on a 6,000‑image subset of Place Pulse to accommodate colour and perspective differences, using an 80∶20 train‑validation split and early stopping on mean Intersection‑over‑Union.

The final model predicts 14 pixel classes (road, building, vegetation, sky, traffic sign, etc.; see Fig. 4). Because transient objects such as cars, buses and trains add noise yet contribute little to long‑term visual aesthetics, their masks were discarded [75]. For each retained class k we computed the ratio, yielding four continuous indicators that literature recognises as pivotal to perceived environmental quality: Vegetation Ratio (VgR), Sky‑Visibility Ratio (SkVR), Building Ratio (BR) and Road Ratio (RR) [76–78,75]. Table 2 summarises the descriptive statistics of these visual‑element variables.

This pixel‑based workflow ensures that every SVI is translated into a consistent, image‑derived vector of visual cues that can be directly linked—via the modelling steps detailed in Section 2.3.4—to human aesthetic judgements.

……

2.3.3. Visual aesthetic perception score

To translate human pairwise judgments into a continuous aesthetic metric, we adopt the TrueSkill algorithm [79], a Bayesian ranking method originally developed for online gaming. This approach provides a robust and continually refined measure of aesthetic quality [61,58]. Specifically, the trueskill of each image is modeled as a  random variable and updates after each comparison. When a user selects image  over image  in a pairwise comparison, the update equations are as follows:

……

This dynamic updating yields a robust, confidence‐weighted VAPS for each image. Unlike simple win–loss tallies, TrueSkill accounts for the reliability of each comparison and handles ties explicitly, producing a stable ranking even when images receive different numbers of votes [58]. Figure 5 displays the final VAPS distribution across all SVIs. These scores serve as our dependent variable in the regression models of Section 2.3.4, providing a human‑grounded benchmark that is independent of any image‑derived features.

……

2.3.4. VAPS decoupling empirical model

To identify the most suitable machine learning model for predicting impact of urban color features and visual element features on VAPS, we evaluate the performance of eight widely used models across different algorithmic families. Specifically, we consider tree-based models, including Random Forest (RF) [43], XGBoost [44], CatBoost [45], and LightGBM [42]; distance-based and kernel methods, including k-Nearest Neighbors (KNN) [47] and Support Vector Machine (SVM) [46]; a neural network-based model, Multi-Layer Perceptron (MLP) [48]; and a traditional decision tree model, Decision Tree (DT) [49]. These models are chosen for their effectiveness in handling structured data and their widespread use in predictive modeling tasks.

All models were trained on the same feature matrix X (urban colour and visual‑element variables) with VAPS as the response ?. We employed five‑fold cross‑validation to obtain robust estimates of out‑of‑sample performance and guard against overfitting. To fine‑tune their performance, we used Optuna’s Bayesian optimization framework to select hyperparameters—such as learning rates, tree depths, and the number of estimators—by minimizing the validation loss in each fold[80,81].

For model comparison, we evaluated each algorithm’s prediction accuracy—using Mean Absolute Error (MAE) to gauge average error magnitude, Mean Squared Error (MSE) to place extra weight on large deviations, and R² to capture the proportion of VAPS variance explained—together with its training time as a measure of computational efficiency. By selecting the model that combined the lowest MAE and MSE, the highest R², and the shortest training time, we identified the optimal learner, which was then adopted for our final decoupling analysis and SHAP interpretation in Section 2.3.5.

……

2.3.5. Interpretation on driving factor of VAPS

To uncover how each input feature contributes to the machine‑learning model’s predictions of VAPS, we apply the SHAP framework [82]. Originating from cooperative game theory……

……

By integrating SHAP into our workflow, we obtain both global importance rankings (average  across all images) and local explanations (the direction and magnitude of each feature’s effect on a single prediction), thereby rendering the machine‑learning model’s internal logic transparent and directly interpretable.”

 

Comments 2: The study covers a large scope (56 cities), which seems excessive. Managing and controlling variables on such a scale appears challenging. A more limited and focused sample could have provided deeper and more precise insights.

 

Response 2: Thank you very much for highlighting the concern that a large‐scale sample may complicate variable control. We fully agree and have added a discussion of this issue in Section 4.4 Limitations and Future Prospects, where we also propose conducting in‐depth case studies on specific city clusters to gain more refined intervention insights.

It is important to note that, although large‐scale studies do present challenges in variable control, the Place Pulse 2.0 data used here are not based on localized ratings by residents but on crowdsourced pairwise comparisons by global users on a unified online platform. This rating process is largely decoupled from the cultural, climatic, and immediate environmental conditions of the sampled cities, thereby substantially reducing local biases and external interference. Moreover, because evaluations are performed on static street‐view images rather than in‑situ experiences, the impact of transient local socioeconomic, cultural, and environmental variables is further minimized. In response to your suggestion, we have clarified these data characteristics in Section 2.2 Description of Data Sources. All additions and modifications are highlighted in red for your convenience. We hope these revisions address your concerns. Thank you once again for your valuable feedback and careful review!

 

2.2. Description of data sources

This study utilizes data from the urban street-view perception dataset Place Pulse 2.0 [58]. The dataset contains a total of 110,988 Google SVI captured between 2007 and 2012 in 56 cities on six continents , thus covering a wide range of geographic settings. Crucially, Place Pulse is built on a global, crowdsourced rating platform rather than locally confined surveys: participants from many countries were recruited through organic media outreach and targeted Facebook advertisements, and all images were evaluated on the same web interface (centerforcollectivelearning.org/urbanperception). Because raters are not limited to the residents of the depicted cities, the resulting scores reflect a more universal aesthetic judgement and minimize city‑specific cultural bias, even though the dataset itself does not include explicit sociocultural variables for each city.

Street‑view frames were generated following the standard Place Pulse protocol. First, Google Street View panorama IDs were uniformly sampled along the Open-StreetMap road network at an adaptive spacing of roughly 50–100 m, ensuring cover-age of both primary and secondary streets. For each panorama, two horizontal images (640×480 px, FOV ≈ 60°) were extracted with headings separated by 90° or 180°, while keeping the pitch at 0° (eye‑level ≈ 1.6 m). In intersections or irregular street segments, up to four directions were captured to reflect the complete surrounding context[59]. This fixed sampling interval and limited set of viewing directions provide consistent spatial density and comparable visual perspectives across all 56 cities.

Place Pulse uses pairwise comparison—a method long employed to assess subjec-tive attributes such as style or visual appeal in clothing [60], urban façades [61,62], animated GIFs [63], and artworks [64]. Pairwise ranking is widely regarded as more reliable and efficient than direct numerical scoring [65,66].

Participants were recruited via organic media sources and targeted Facebook ad-vertisements, and were asked to answer subjective questions across six dimen-sions—for instance, “Which place looks safer?” or “Which place looks more beauti-ful?”—by selecting one of two images. This data collection process ran from May 2013 to February 2016. In this study, we mainly focus on responses to “Which place looks more beautiful?”, for which 166,823 pairwise comparison responses were collected. Every individual image underwent an average of approximately 3.46 pairwise com-parisons. Place Pulse 1.0 shows that ratings are largely independent of respondents’ age, gender, or geographic location [67]; hence the dataset offers a culturally diverse yet methodologically uniform benchmark.

Because evaluations are made on static images rather than on‑site visits, extrane-ous local variables (e.g., transient noise, weather, or social activity) exert relatively lit-tle influence on the scores. Consequently, the derived VAPS represent a consistent, image‑based measure of perceived beauty that can be compared across all 56 cities without the confounding effects inherent in localized, in‑person surveys.

……

4.4. Limitations and Future Prospects

Despite systematically examining the factors influencing street-view aesthetic perception from the two primary dimensions of urban color and visual elements, this study still has several limitations. First, due to data and research scope constraints, the study focuses only on representative indicators such as dominant color, color composi-tion, and VgR. This may overlook other dimensions within urban space—such as soci-ocultural factors, historical district characteristics, microclimatic conditions, or quali-tative aspects like urban cultural background and artistic design intent—that could potentially affect aesthetic perception. Second, although the data scale and sources used in this study are relatively extensive, they largely consist of static street-view in-formation, making it difficult to capture the dynamic changes of urban spaces across different seasons, time periods, or activity contexts. Third, the global scope of 56 cities makes it challenging to fully control for local contextual variables and to delve into re-gion‑specific dynamics. To overcome this, future research will include subgroup anal-yses by clustering cities according to geographic region, economic development level, or cultural heritage, and will also undertake focused case studies on selected city groups to yield deeper, more precise insights.

In future research, we will further expand the model’s features, incorporating multiple dimensions such as urban cultural context, functional zoning of neighbor-hoods, and behavioral patterns, to more comprehensively reveal the complex mecha-nisms underlying urban visual aesthetic perception. In particular, we plan to integrate qualitative methods—such as in depth interviews, expert reviews, and artistic design case studies—with our quantitative framework to capture the influence of cultural and design intentions on public space aesthetics. In addition, by leveraging real-time or periodically updated street-view data and employing neural networks and large-scale image recognition technologies, we aim to achieve automated bulk collec-tion and feature extraction of urban street-view samples and to provide dynamic monitoring and prediction of urban aesthetic perception.”

 

 

Comments 3: The article’s structure and framework need revision, particularly in the analysis and conclusion sections, which lack coherence and strong summarization.

 

Response 3: We sincerely appreciate your identification of these manuscript shortcomings and have given your invaluable suggestions thorough consideration. You correctly observed that the continuity between the analysis and conclusion sections was suboptimal and that the summaries lacked sufficient impact. In response, we have undertaken extensive revisions and refinements to enhance the manuscript’s coherence and persuasive power.

In Section 3 Results, we added linking sentences at the start and end of each subsection and inserted two to three concise summary sentences at the close of each to distill the core findings and reinforce logical flow. In Section 5 Conclusions, we restructured the narrative around a four‑step framework—background, results, implications, and recommendations—first revisiting our objectives and methods while underscoring academic and practical contributions, then summarizing key findings in clear, numbered items, and finally proposing actionable policy and design recommendations.

All changes are highlighted in red in the revised manuscript for your convenience. We believe these adjustments substantially improve the manuscript’s continuity and strength of its conclusions, and we hope they meet your expectations and earn your approval.

 

3. Results

3.1. Distribution of VAPS

The study first analyzed the distribution of the VAPS across 56 major cities worldwide. Figure 6 displays the distribution of VAPS across 56 major cities worldwide. A comparative analysis with the theoretical normal distribution indicates that the overall VAPS is approximately normally distributed, with a mean of 25.01 and a standard deviation of 5.60…… (First paragraph)

……

Overall, VAPS exhibits an approximately normal distribution worldwide, with stable data quality and pronounced variability. The spatial distribution patterns of high‑ and low‑scoring cities also provide a regional backdrop for subsequent investigations into the effects of color and visual elements. (Last paragraph)

3.2. Distribution of Street View Color and Visual Element Features

This section, in conjunction with representative street‑view examples, visually illustrates the central tendency and skewness of each color and visual‑element feature metric in the street‑view images. Figure 8 presents the kernel density distributions for different color and visual element feature indices, along with representative SVI corresponding to the respective mean standards. From the perspective of visual elements, buildings occupy a relatively large proportion in global urban street views…… (First paragraph)

……

Global street views are characterized by high building dominance, limited natural elements, and warm‑leaning but moderately bright colors. The combination of substantial color complexity and low harmony suggests visually rich yet unevenly coordinated urban palettes. (Last paragraph)

3.3. Performance Comparison of Different Machine Learning Decoupling Models

This section evaluates eight representative regression models in the street‑view aesthetic perception regression task, aiming to identify the algorithmic framework best suited to disentangle the contributions of color and visual elements. Figure 9 presents the performance metrics of eight machine learning models in the urban color perception regression task…… (First paragraph)

……

Based on the foregoing analyses and across all evaluated metrics, LightGBM not only surpasses traditional models and other ensemble algorithms in predictive accuracy but also exhibits clear advantages in training efficiency, and is thus chosen as the decoupling model for this study. Furthermore, the performance comparison explicitly shows that a single‐dimensional feature set cannot comprehensively explain street‐view aesthetics, whereas the integrated model combining color and visual‐element features markedly improves prediction accuracy, thereby further validating the methodological soundness of our multidimensional feature decoupling approach. (Last paragraph)

3.4. Analysis of Overall Feature Contribution

Building on our selection of LightGBM as the optimal decoupling model, we next examined how each feature drives the prediction of street view aesthetic perception. Using the SHAP framework, we quantified each variable’s relative contribution to the model output. SHAP values not only rank features by importance but also indicate the direction and magnitude of their effects (Figure 10).

Figure 10 shows that the VgR contributes most substantial-ly—36.28%—underscoring the dominant role of greenery in shaping perceived urban beauty. BR and H follow, with contributions of 14.38% and 10.45%, respectively. A higher BR tends to convey urban density and structural intensity, which can suppress visual appeal, while H captures the influence of color tone on aesthetic perception.

Across all features, natural elements (VgR) emerge as the primary driver of VAPS, reinforcing the importance of green spaces in urban design. In contrast, structural el-ements (BR) and key color attributes (H) also exert strong but secondary influences, demonstrating that both form and color jointly shape street view aesthetics.

3.5. Non-linear Effects

Based on the feature‐contribution analysis above, this section further elucidates the complex nonlinear relationships between street‐view visual elements and color features and the VAPS, in order to explore the heterogeneous effects of each indicator across distinct threshold intervals.

3.5.1. Non-linear Effects of Street View Visual Element Features on VAPS

……

In summary, the nonlinear analysis of visual elements indicates that VgR and SkVR each have critical thresholds at 0.4 and 0.15, respectively, and that BR and RR exhibit optimal value ranges, with excessively high or low levels impairing aesthetic perception. This finding underscores the importance of maintaining all elements within appropriate bounds. (Last paragraph)

3.5.2. Non-linear Effects of Street View Color Features on VAPS

Figure 12 further reveals the nonlinear influence patterns between color features (H, S, V, CC, CH, DCR, and RGB) and VAPS. As H increases, VAPS rises, suggesting that public spaces with higher H values—particularly those in the green-blue range (approximately 70–90)—are more visually appealing to observers, likely due to their close association with natural settings…… (First paragraph)

……

In conclusion, the nonlinear effects of color features indicate that moderate levels of hue, saturation, brightness, and complexity are most conducive to aesthetic perception, whereas excessive uniformity or extreme values are detrimental. Street‑view color schemes should strike a balance between “richness” and “harmony”: moderately increasing saturation and complexity, maintaining the dominant hue proportion at 40–50%, and prioritizing blue‑green tones can significantly enhance VAPS; by contrast, overly high brightness or excessive coordination may weaken visual impact. (Last paragraph)

……

5. Conclusions

This study leveraged over 100,000 street‑view images from 56 major cities world-wide to develop a Visual Aesthetic Perception Score (VAPS) framework based on the TrueSkill algorithm. We systematically compared eight leading machine‑learning re-gression models—Decision Tree, KNN, SVM, MLP, RF, XGBoost, CatBoost, LightGBM—and identified LightGBM as the optimal decoupling tool due to its superi-or accuracy and computational efficiency. By integrating Shapley Additive Explana-tions (SHAP), we further quantified each color and visual‑element feature’s contribu-tion to VAPS and uncovered their nonlinear threshold effects. Methodologically, our work fills a critical gap in large‑scale, joint quantitative analysis of urban color and visual elements, establishing a reproducible, scalable pipeline for urban aesthetic evaluation. Theoretically, it deepens understanding of the mechanisms driving urban street‑view beauty; practically, it offers actionable guidance for urban renewal and public‑space planning.

Our key findings are as follows. (1) VgR is the single most influential predictor, accounting for 36.28% of the model’s explanatory power, yet exhibits diminishing marginal returns beyond a 40% coverage threshold. (2) SkVR follows an inverted‑U pattern, enhancing VAPS up to 15% but reducing perceived appeal when exceeding this level. (3) BR and RR each display an “overuse penalty”: excessive BR (≥60%) sat-urates compressive effects, while RR peaks at around 20% before further increases di-minish vibrancy. (4) In the color domain, optimal hue values lie in the 70–90 range (bluish‑green), with saturation above 55 and brightness between 100 and 130 signifi-cantly boosting aesthetic perception. (5) Moderate CC enriches visual impact, whereas overly high CH undermines layering. (6) DCR of 40–50% best balances visual focus without inducing monotony.

Based on these insights, we recommend the following design and policy measures. Urban green coverage should be maintained at 20–40%, and sky visibility at no more than 15%, to balance biophilic comfort with visual legibility, multifunctionality, and safety. Building height zoning and street‑section optimization should ensure BR ≤60% and RR ≤30%, preserving spatial hierarchy while avoiding oppressive or fragmented streetscapes. In color planning, a bluish‑green dominant palette at 40–50% DCR—complemented by accent colors—should be adopted, with saturation controlled above 55, brightness between 100 and 130, and color complexity elevated moderately without over‑harmonization, thereby achieving both visual impact and overall coher-ence. Implementing these guidelines can effectively enhance the aesthetic quality of urban street views and inform evidence‑based public‑space interventions.”

 

Comments 4: The conclusion is weak and fails to effectively summarize the findings or clarify their practical implications.

 

Response 4: Thank you sincerely for your valuable feedback on the Conclusions section. You rightly observed that the original draft did not sufficiently highlight the significance and outcomes of our study. In response, we have substantially reinforced the Conclusions in the revised manuscript:

At the end of the first paragraph, we added a succinct overview of our methodological innovations, theoretical contributions and practical relevance.

In the second paragraph, we distilled six core empirical findings into a numbered list. These findings detail the specific impacts of green‑view ratio, sky‑view ratio, building‑to‑road proportion, color threshold values and complexity on VAPS, and identify their optimal intervals.

In the third paragraph, we offered concrete recommendations for urban planning and color management grounded in these results.

All changes are highlighted in red. We believe these enhancements significantly strengthen the impact and applicability of our Conclusions. Thank you once again for your thoughtful guidance.

 

5. Conclusions

This study leveraged over 100,000 street‑view images from 56 major cities world-wide to develop a VAPS framework based on the TrueSkill algorithm. We systematically compared eight leading machine‑learning re-gression models—Decision Tree, KNN, SVM, MLP, RF, XGBoost, CatBoost, LightGBM—and identified LightGBM as the optimal decoupling tool due to its superi-or accuracy and computational efficiency. By integrating SHAP, we further quantified each color and visual‑element feature’s contribu-tion to VAPS and uncovered their nonlinear threshold effects. Methodologically, our work fills a critical gap in large‑scale, joint quantitative analysis of urban color and visual elements, establishing a reproducible, scalable pipeline for urban aesthetic evaluation. Theoretically, it deepens understanding of the mechanisms driving urban street‑view beauty; practically, it offers actionable guidance for urban renewal and public‑space planning. (Paragraph 1)

Our key findings are as follows. (1) VgR is the single most influential predictor, accounting for 36.28% of the model’s explanatory power, yet exhibits diminishing marginal returns beyond a 40% coverage threshold. (2) SkVR follows an inverted‑U pattern, enhancing VAPS up to 15% but reducing perceived appeal when exceeding this level. (3) BR and RR each display an “overuse penalty”: excessive BR (≥60%) sat-urates compressive effects, while RR peaks at around 20% before further increases di-minish vibrancy. (4) In the color domain, optimal hue values lie in the 70–90 range (bluish‑green), with saturation above 55 and brightness between 100 and 130 signifi-cantly boosting aesthetic perception. (5) Moderate CC enriches visual impact, whereas overly high CH undermines layering. (6) DCR of 40–50% best balances visual focus without inducing monotony. (Paragraph 2)

Based on these insights, we recommend the following design and policy measures. Urban green coverage should be maintained at 20–40%, and sky visibility at no more than 15%, to balance biophilic comfort with visual legibility, multifunctionality, and safety. Building height zoning and street‑section optimization should ensure BR ≤60% and RR ≤30%, preserving spatial hierarchy while avoiding oppressive or fragmented streetscapes. In color planning, a bluish‑green dominant palette at 40–50% DCR—complemented by accent colors—should be adopted, with saturation controlled above 55, brightness between 100 and 130, and color complexity elevated moderately without over‑harmonization, thereby achieving both visual impact and overall coher-ence. Implementing these guidelines can effectively enhance the aesthetic quality of urban street views and inform evidence‑based public‑space interventions. (Paragraph 3)

 

Comments 5: Despite the novel and interesting idea, the article requires substantial revision in terms of scientific structure and methodology to be considered a strong academic contribution.

 

Response 5: We are truly grateful for your invaluable feedback and have made every effort to improve and refine the manuscript. We sincerely thank the reviewers for their diligent work and hope that this optimized revision meets with your approval. Please accept our apologies for any shortcomings in the original draft; we have undertaken a comprehensive set of enhancements.

In the Methodology section, we have fully reorganized the presentation of data sources and processing workflows. We now provide detailed descriptions of the Place Pulse 2.0 street‑view image acquisition procedure, the principles and implementation of the TrueSkill rating method, and the key steps in semantic segmentation and color extraction. Mathematical models have been summarized to their core ideas to avoid unnecessary complexity and improve readability.

To address concerns about coherence in analysis and conclusions, we added transition sentences at the beginning and end of each subsection in the Results section and included two to three concise summaries at each subsection’s conclusion. In the Conclusions section, we adopted a clear logical structure: first revisiting our research objectives, methods, and contributions; then presenting six key empirical findings in numbered form; and finally offering concrete planning and color‑management recommendations.

We believe these systematic revisions significantly strengthen the manuscript’s clarity, structure, and practical relevance. Thank you again for your thoughtful guidance.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript provides a valuable contribution by emphasizing the influence of urban color and composition on aesthetic perception. Overall, the paper is well-written, with a clear structure and smooth flow. I have only a few minor suggestions that I hope can help the author further improve the quality of the manuscript:

1) In the abstract, I suggest the author clarify what is meant by “interpretable models.” The current phrasing may be a bit vague for readers unfamiliar with the concept.

2) In the introduction, it would be helpful to briefly outline the main types of machine learning models currently in use, and how they can be integrated with interpretable approaches. This would nicely echo the methodology and discussion sections. The following references may be useful:

Gao, M., & Fang, C. (2025). Deciphering urban cycling: Analyzing the nonlinear impact of street environments on cycling volume using crowdsourced tracker data and machine learning. Journal of Transport Geography, 124, 104179.

Zhang, S., Liu, N., Ma, B., & Yan, S. (2024). The effects of street environment features on road running: An analysis using crowdsourced fitness tracker data and machine learning. Environment and Planning B: Urban Analytics and City Science, 51(2), 529–545.

3) The literature review could be strengthened by offering a more critical perspective—for instance, questioning whether the role of urban color has been underestimated in previous research—and by more clearly identifying the research gap this study addresses.

4) In Figure 3, the section titled “evaluation formulate” is difficult to read.

5) In the discussion section, the author suggests that existing studies may overlook sociocultural factors in urban spaces. However, in Section 2.2, the manuscript notes that the dataset used provides broad spatial and cultural representativeness. This may come across as contradictory—clarification on this point would be helpful.

Author Response

Comments 1: In the abstract, I suggest the author clarify what is meant by “interpretable models.” The current phrasing may be a bit vague for readers unfamiliar with the concept.

 

Response 1: We greatly appreciate your insightful suggestion regarding the abstract. Indeed, as you noted, the original phrase “interpretable model” may not clearly convey its precise meaning to all readers. Accordingly, we have revised the abstract to specify “an interpretable ensemble learning approach that integrates LightGBM with Shapley Additive Explanations (SHAP),” and we have briefly explained that SHAP quantifies the contributions of color metrics and visual elements to VAPS by assigning each feature a Shapley value, thus converting black‑box predictions into transparent explanations. We trust that even readers unfamiliar with SHAP will now immediately understand how this method elucidates the role of each feature in street‑view aesthetic perception. Thank you again for helping us make the abstract clearer and more accessible.

.

 

Abstract: Constructing visually appealing public spaces has become an important issue in contemporary urban renewal and design. Existing studies mostly focus on single dimensions (e.g., vegetation ratio), lacking a large-scale integrated analysis of urban color and visual elements. To address this gap, this study employs semantic segmentation and color computation on a massive street-view image dataset encompassing 56 cities worldwide, comparing eight machine learning models in predicting Visual Aesthetic Perception Scores (VAPS). The results indicate that LightGBM achieves the best overall performance. To unpack this “black‑box” prediction, we adopt an interpretable ensemble approach by combining LightGBM with Shapley Additive Explanations (SHAP). SHAP assigns each feature a quantitative contribution to the model’s output, enabling transparent, post hoc explanations of how individual color metrics and visual elements drive VAPS. Our findings suggest that the vegetation ratio contributes the most to VAPS, but once greening surpasses a certain threshold, a “saturation effect” emerges and can no longer continuously enhance visual appeal. Excessive SkVR can reduce VAPS. Moderate road visibility may increase spatial layering and vibrancy, whereas overly dense building significantly degrades overall aesthetic quality. While keeping the dominant color focused, moderate color saturation and complexity can increase the attractiveness of street views more effectively than overly uniform color schemes. Our research not only offers a comprehensive quantitative basis for urban visual aesthetics, but also underscores the importance of balancing color composition and visual elements, offering practical recommendations for public space planning, design, and color configuration.”

 

Comments 2: In the introduction, it would be helpful to briefly outline the main types of machine learning models currently in use, and how they can be integrated with interpretable approaches. This would nicely echo the methodology and discussion sections. The following references may be useful:

Gao, M., & Fang, C. (2025). Deciphering urban cycling: Analyzing the nonlinear impact of street environments on cycling volume using crowdsourced tracker data and machine learning. Journal of Transport Geography, 124, 104179.

Zhang, S., Liu, N., Ma, B., & Yan, S. (2024). The effects of street environment features on road running: An analysis using crowdsourced fitness tracker data and machine learning. Environment and Planning B: Urban Analytics and City Science, 51(2), 529–545.

 

Response 2: We are sincerely grateful for your insightful comments and for the suggested references. Your feedback has greatly improved our manuscript.

According to your recommendation, we have added, in the Introduction, a concise overview of the main machine learning models in urban analytics and described how they can be combined with interpretable techniques (e.g., SHAP values). We have also incorporated the following citations:

Gao, M., & Fang, C. (2025). Deciphering urban cycling: Analyzing the nonlinear impact of street environments on cycling volume using crowdsourced tracker data and machine learning. Journal of Transport Geography, 124, 104179.

Zhang, S., Liu, N., Ma, B., & Yan, S. (2024). The effects of street environment features on road running: An analysis using crowdsourced fitness tracker data and machine learning. Environment and Planning B: Urban Analytics and City Science, 51(2), 529–545.

Changes have been highlighted in red in the text. We hope these revisions satisfactorily address your suggestion and further strengthen the coherence between our Introduction, Methodology, and Discussion sections. Thank you again for your valuable guidance.

 

1. Introduction

……

The algorithms currently used in research exhibit significant differences in their ability to model nonlinear relationships, resist noise, and scale to large datasets, yet are rarely compared systematically, which hinders cross study synthesis. In regression tasks, machine learning methods commonly fall into four categories [39]. Tree based models (e.g., Decision Tree, Random Forest, XGBoost, LightGBM, CatBoost) capture feature interactions via hierarchical splits [40–43]; ensemble variants such as RF, XGBoost, LightGBM, and CatBoost improve fit and generalization through bagging or boosting, with XGBoost and LightGBM exploiting efficient gradient boosting frame-works for rapid training and high accuracy on large data—LightGBM being especially adept at sparse feature handling—and CatBoost using ordered boosting to reduce cat-egorical bias and overfitting. Distance and kernel based methods (KNN, SVM) ad-dress high dimensional nonlinearity via nearest neighbor assumptions and kernel transformations [44,45], respectively: KNN requires no parametric assumptions but is sensitive to noise, while SVM offers stable performance in high dimensions at the cost of greater computational expense. Neural networks (Multi Layer Perceptron) are powerful for modeling complex nonlinear patterns but demand extensive hyperpa-rameter tuning and longer training times [46]. Finally, the Decision Tree provides a simple, interpretable baseline [47]. All of these models have demonstrated excellent performance in regression applications such as urban environmental quality assess-ment and thus represent mainstream choices for predicting street view aesthetic per-ception. A comprehensive comparison is therefore essential to identify the optimal al-gorithm for decoupling the contributions of color and visual elements. Additionally, these models can be used in conjunction with interpretable techniques such as Shapley Additive Explanations (SHAP) to reveal the contribution of individual visual features to model predictions, improving the transparency and applicability of the analysis [48,49]. (Paragraph 3)

 

 

Comments 3: The literature review could be strengthened by offering a more critical perspective—for instance, questioning whether the role of urban color has been underestimated in previous research—and by more clearly identifying the research gap this study addresses.

 

Response 3: We are grateful for your thoughtful suggestion. In response, we have injected a more critical discussion of prior work—highlighting how the role of urban color has often been underplayed—and have sharpened our articulation of the gap that this study fills. The revisions have been highlighted in red in the text.

.

 

1. Introduction

Over the past years, the demand for urban spaces has transitioned from merely pursuing quantitative expansion to prioritizing high standards and superior quality, thereby greatly elevating the importance of urban spatial environmental quality within sustainable urban development strategies [1,2]. Urban color and visual elements together constitute a crucial representation of the urban spatial environment. Among these, urban color plays a crucial role in shaping the overall quality of urban spaces [3], making it imperative to accurately assess the quality of the urban color environment [4]. However, despite this acknowledged importance, many studies have treated color chiefly as an adjunct to other visual features, thereby underestimating its standalone influence on residents’ aesthetic and behavioral responses [5,6]. Such assessments not only provide guidance for urban color planning, shaping city identity and culture [7,8], but also help determine whether the quality of urban color satisfies the residents' psychological needs [4,9]. Moreover, the design and characteristics of visual elements in cities can stimulate positive emotions and create favorable sensory experiences [10], as well as directly influence how residents interact and behave in urban settings [11]. Consequently, these are the focus of current study. For instance, natural elements such as sky visibility ratio and vegetation ratio play a vital role in shaping urban spatial quality [12]. Studies grounded in environmental psychology have demonstrated their impact on human perception and physical well-being [13]. Similarly, research has shown that elements of the natural environment can elicit human aesthetic and emotional responses [14]. Positive perceptual feedback from natural surroundings can enhance individuals’ inner well-being, which may have restorative effects for patients [15]. Furthermore, electroencephalogram (EEG) experiments have offered additional evidence of how visual elements influence internal human responses [16]. Therefore, comprehensively evaluating urban color and visual elements be-comes especially important, and introducing a quantitative model and framework to assess how people perceive different colors and elements in cities and their communi-ties is of considerable significance [17]. (Paragraph 1)

……

However, the current literature exhibits two main shortcomings. First, many studies have treated urban color merely as an adjunct to other visual features, thereby underestimating its standalone influence, and have seldom examined how color and visual elements interact synergistically [6,52]. Second, empirical efforts to quantify urban color perception have largely relied on small‑scale, traditional survey methods, resulting in limited generalizability and challenges in standardization. In terms of ur-ban color, current survey approaches are primarily traditional field investigations [53]. These methods commonly involve manual photography and computation [4], color card comparisons [54], and instrument-based color measurements (Dai Jianjun et al.). Although these approaches offer high accuracy, they come with significant costs, re-quire substantial time, and are constrained to relatively small urban regions, posing challenges for large-scale implementation [28]. Moreover, manual photography is af-fected by equipment parameters and weather conditions, making it difficult to stand-ardize color capture. With respect to urban color perception, most studies use only simple questionnaire data and lack systematic, standardized research on urban color quality [4]. Although some studies have begun to use large-scale street-view data for evaluation and computation [23,55], further improvements in data volume and Com-prehensiveness of analytical mechanisms are needed. Regarding visual elements, most research relies on surveys, in-person interviews, and field observations to examine how individuals perceive and interact with the built environment on both visual and sensory levels [56]. However, the recorded outcomes tend to be abstract rather than quantitative [17], introducing the possibility of substantial individual variability that can affect accuracy [17,57]. (Paragraph 4)

……”

 

Comments 4: In Figure 3, the section titled “evaluation formulate” is difficult to read.

 

Response 4: We apologize for any inconvenience this may have caused readers. We have reformatted Figure 3 by moving the evaluation formula into a more prominent position and increasing its font size and weight to improve clarity. The updated figure has been replaced in the revised manuscript for your review.

 

Figure 3. The Process applied to extract the color features”

 

Comments 5: In the discussion section, the author suggests that existing studies may overlook sociocultural factors in urban spaces. However, in Section 2.2, the manuscript notes that the dataset used provides broad spatial and cultural representativeness. This may come across as contradictory—clarification on this point would be helpful.

 

Response 5: Thank you very much for drawing our attention to this potential inconsistency. We recognize that the original phrasing in Section 2.2 (“The dataset contains a total of 110,988 Google SVI captured between 2007 and 2012 in 56 cities on six continents, offering broad spatial and cultural representation.”) may have been misleading. To clarify, “cultural representation” here simply refers to the fact that our images come from cities with diverse cultural backgrounds and that ratings were crowdsourced from a globally mixed pool of participants via the same online platform. It does not imply that the dataset—or our analysis—incorporates or models explicit city‑level sociocultural variables (such as local history, traditions, or design norms). Our goal was to abstract away from specific local contexts so as to isolate the purely visual attributes of the street‑view images.

Accordingly, we have revised Section 2.2 to state clearly that, although the dataset spans a variety of cultural environments and uses global raters, it does not include explicit sociocultural descriptors for each city. We have also added a note in “Limitations and Future Prospects” to indicate that future work could integrate different social and cultural contexts—through methods such as in‑depth interviews or localized case studies—to more fully explore how resident backgrounds shape aesthetic perception, since street‑view visuals alone do not account for all influences.

We apologize for any confusion caused by our earlier wording and trust that the clarifications and highlighted revisions in red now fully address your concern.

 

2.2. Description of data sources

This study utilizes data from the urban street-view perception dataset Place Pulse 2.0 [53]. The dataset contains a total of 110,988 Google SVI captured between 2007 and 2012 in 56 cities on six continents , thus covering a wide range of geographic settings. Crucially, Place Pulse is built on a global, crowdsourced rating platform rather than locally confined surveys: participants from many countries were recruited through organic media outreach and targeted Facebook advertisements, and all images were evaluated on the same web interface (centerforcollectivelearning.org/urbanperception). Because raters are not limited to the residents of the depicted cities, the resulting scores reflect a more universal aesthetic judgement and minimize city‑specific cultural bias, even though the dataset itself does not include explicit sociocultural variables for each city.

Place Pulse uses pairwise comparison—a method long employed to assess subjective attributes such as style or visual appeal in clothing [54], urban façades [55,56], animated GIFs [57], and artworks [58]. Pairwise ranking is widely regarded as more reliable and efficient than direct numerical scoring [59,60].

Participants were recruited via organic media sources and targeted Facebook advertisements, and were asked to answer subjective questions across six dimensions—for instance, “Which place looks safer?” or “Which place looks more beautiful?”—by selecting one of two images. This data collection process ran from May 2013 to February 2016. In this study, we mainly focus on responses to “Which place looks more beautiful?”, for which 166,823 pairwise comparison responses were collected. Every individual image underwent an average of approximately 3.46 pairwise comparisons. Place Pulse 1.0 shows that ratings are largely independent of respondents’ age, gender, or geographic location [61]; hence the dataset offers a culturally diverse yet methodologically uniform benchmark.

Because evaluations are made on static images rather than on‑site visits, extraneous local variables (e.g., transient noise, weather, or social activity) exert relatively lit-tle influence on the scores. Consequently, the derived VAPS represent a consistent, image‑based measure of perceived beauty that can be compared across all 56 cities without the confounding effects inherent in localized, in‑person surveys.

……

4. Discussion

……

4.4. Limitations and Future Prospects

Despite systematically examining the factors influencing street-view aesthetic perception from the two primary dimensions of urban color and visual elements, this study still has several limitations. First, due to data and research scope constraints, the study focuses only on representative indicators such as dominant color, color composition, and VgR. This may overlook other dimensions within urban space—such as sociocultural factors, historical district characteristics, microclimatic conditions, or qualitative aspects like urban cultural background and artistic design intent—that could potentially affect aesthetic perception. Second, although the data scale and sources used in this study are relatively extensive, they largely consist of static street-view information, making it difficult to capture the dynamic changes of urban spaces across different seasons, time periods, or activity contexts. Third, the global scope of 56 cities makes it challenging to fully control for local contextual variables and to delve into region‑specific dynamics. To overcome this, future research will include subgroup analyses by clustering cities according to geographic region, economic development level, or cultural heritage, and will also undertake focused case studies on selected city groups to yield deeper, more precise insights.

In future research, we will further expand the model’s features, incorporating multiple dimensions such as urban cultural context, functional zoning of neighborhoods, and behavioral patterns, to more comprehensively reveal the complex mechanisms underlying urban visual aesthetic perception. In particular, we plan to integrate qualitative methods—such as in‑depth interviews, expert reviews, and artistic design case studies—with our quantitative framework to capture the influence of cultural and design intentions on public‑space aesthetics. In addition, by leveraging real-time or periodically updated street-view data and employing neural networks and large-scale image recognition technologies, we aim to achieve automated bulk collection and feature extraction of urban street-view samples and to provide dynamic monitoring and prediction of urban aesthetic perception.”

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The data measured by the authors in this article are interesting and valuable, but the reviewer thought that there are some problems in the paper. The specific contents are as follows.

  1. p.1 line 30, “Excessive SVR can reduce VAPS.”, the full abstract does not indicate what “SVR” is, the author should indicate that “SVR” is "Sky Visibility Ratio".
  2.  p.3 line 121-122, “Subsequently, we compare regression models using eight mainstream machine learning approaches and select the best-performing one. “ The author selected eight mainstream machine learning methods, but did not explain why these eight are mainstream.
    The authors should rearrange the word order and elaborate on the advantages and disadvantages of each measurement method.
  3. Fig. 3, the font of the calculation formula part in the figure is not clear.
  4. Fig. 12, the font of the calculation formula part in the figure is not clear.
  5. p.4 line 143-164, “This study utilizes data from the urban street-view perception dataset Place Pulse. The dataset contains a total of 110,988 Google SVI captured between...”  The part of “Description of data sources” is not clear.
    The specific acquisition method of street view image data is not introduced. The reviewers think that the author needs to clarify the interval distance and viewing direction of the street view image data.

Author Response

Comments 1: p.1 line 30, “Excessive SVR can reduce VAPS.”, the full abstract does not indicate what “SVR” is, the author should indicate that “SVR” is "Sky Visibility Ratio".

 

Response 1: We sincerely thank you for your careful review and insightful correction. We apologize for the oversight in failing to define “SVR” upon its first mention (page 1, line30 of the Abstract), which may have confused readers. In the revised manuscript, we have updated the sentence to read “Excessive Sky Visibility Ratio can reduce VAPS,” and we now provide the full term followed by the abbreviation at its first occurrence in the main text to ensure clarity. Thank you again for your valuable suggestion!

 

.

 

Abstract: ……Our findings suggest that the vegetation ratio contributes the most to VAPS, but once greening surpasses a certain threshold, a “saturation effect” emerges and can no longer continuously enhance visual appeal. Excessive Sky Visibility Ratio can reduce VAPS.”

 

Comments 2: p.3 line 121-122, “Subsequently, we compare regression models using eight mainstream machine learning approaches and select the best-performing one. “ The author selected eight mainstream machine learning methods, but did not explain why these eight are mainstream. The authors should rearrange the word order and elaborate on the advantages and disadvantages of each measurement method.

 

Response 2: We greatly appreciate your constructive feedback. We fully agree that the introduction of the eight machine‑learning models in the original draft was abrupt and lacked the necessary background and rationale, which may have confused readers. To remedy this, we have inserted a new paragraph into the third paragraph of the Introduction that systematically reviews the four principal categories of regression methods: tree‑based models (Decision Tree, Random Forest, XGBoost, LightGBM, CatBoost), distance‑ and kernel‑based methods (KNN, SVM), neural networks (MLP), and the decision‑tree baseline model (DT). In this addition, we discuss each category’s strengths and limitations in handling nonlinear relationships, robustness to noise, training efficiency, and interpretability. This context makes clear why we selected these eight algorithms and provides a solid theoretical foundation for our subsequent experimental comparisons. All changes have been highlighted in red for your convenience. Thank you again for your invaluable suggestion!

 

1. Introduction

……Against this background, many researchers have proposed concepts and indicator frameworks related to street-view perception, reorganizing and classifying key visual elements through semantic segmentation and information extraction. Current primary measures include openness [15,35], greenness [36], enclosure [37], and walkability [38]. The indicators above play a positive role in deepening understanding of urban scenes and providing practical guidance.

The algorithms currently used in research exhibit significant differences in their ability to model nonlinear relationships, resist noise, and scale to large datasets, yet are rarely compared systematically, which hinders cross study synthesis. In regression tasks, machine learning methods commonly fall into four categories [39]. Tree based models (e.g., Decision Tree, Random Forest, XGBoost, LightGBM, CatBoost) capture feature interactions via hierarchical splits [40–43]; ensemble variants such as RF, XGBoost, LightGBM, and CatBoost improve fit and generalization through bagging or boosting, with XGBoost and LightGBM exploiting efficient gradient boosting frame-works for rapid training and high accuracy on large data—LightGBM being especially adept at sparse feature handling—and CatBoost using ordered boosting to reduce cat-egorical bias and overfitting. Distance  and kernel based methods (KNN, SVM) ad-dress high dimensional nonlinearity via nearest neighbor assumptions and kernel transformations [44,45], respectively: KNN requires no parametric assumptions but is sensitive to noise, while SVM offers stable performance in high dimensions at the cost of greater computational expense. Neural networks (Multi Layer Perceptron) are powerful for modeling complex nonlinear patterns but demand extensive hyperpa-rameter tuning and longer training times [46]. Finally, the Decision Tree provides a simple, interpretable baseline [47]. All of these models have demonstrated excellent performance in regression applications such as urban environmental quality assess-ment and thus represent mainstream choices for predicting street view aesthetic per-ception. A comprehensive comparison is therefore essential to identify the optimal al-gorithm for decoupling the contributions of color and visual elements. (Added Paragraph 3)

However, the current literature exhibits two main shortcomings. First, most stud-ies focus solely on either color or visual elements, with limited attention paid to the combined effects of these two dimensions……”

 

 

Comments 3: Fig. 3, the font of the calculation formula part in the figure is not clear.

 

Response 3: We apologize for the low clarity of the original figure, which made the formulas difficult to read. In the revised manuscript, we have replaced the image with a higher‑resolution version and enlarged and bolded the formula font to ensure clear readability.

 

Figure 3. The Process applied to extract the color features”

 

Comments 4: Fig. 12, the font of the calculation formula part in the figure is not clear.

 

Response 4: Thank you for pointing this out. We apologize for the low clarity of the calculation formulas in Figure 12. In the revised manuscript, we have updated Figure 12 by enlarging and emboldening the font used for the formulas within the legend to ensure they are easily readable.

 

Fig 12. Nonlinear relationship between color features and VAPS

Noted: The equation shown in the legend represents the fitted regression line for the scatter‑plot data.”

 

Comments 5: p.4 line 143-164, “This study utilizes data from the urban street-view perception dataset Place Pulse. The dataset contains a total of 110,988 Google SVI captured between...”  The part of “Description of data sources” is not clear. The specific acquisition method of street view image data is not introduced. The reviewers think that the author needs to clarify the interval distance and viewing direction of the street view image data.

 

Response 5: Thank you very much for drawing our attention to this potential inconsistency. We recognize that the original phrasing in Section 2.2 (“The dataset contains a total of 110,988 Google SVI captured between 2007 and 2012 in 56 cities on six continents, offering broad spatial and cultural representation.”) may have been misleading. To clarify, “cultural representation” here simply refers to the fact that our images come from cities with diverse cultural backgrounds and that ratings were crowdsourced from a globally mixed pool of participants via the same online platform. It does not imply that the dataset—or our analysis—incorporates or models explicit city‑level sociocultural variables (such as local history, traditions, or design norms). Our goal was to abstract away from specific local contexts so as to isolate the purely visual attributes of the street‑view images.

Accordingly, we have revised Section 2.2 to state clearly that, although the dataset spans a variety of cultural environments and uses global raters, it does not include explicit sociocultural descriptors for each city. We have also added a note in “Limitations and Future Prospects” to indicate that future work could integrate different social and cultural contexts—through methods such as in‑depth interviews or localized case studies—to more fully explore how resident backgrounds shape aesthetic perception, since street‑view visuals alone do not account for all influences.

We apologize for any confusion caused by our earlier wording and trust that the clarifications and highlighted revisions in red now fully address your concern.

 

2.2. Description of data sources

This study utilizes data from the urban street-view perception dataset Place Pulse 2.0 [58]. The dataset contains a total of 110,988 Google SVI captured between 2007 and 2012 in 56 cities on six continents , thus covering a wide range of geographic settings. Crucially, Place Pulse is built on a global, crowdsourced rating platform rather than locally confined surveys: participants from many countries were recruited through organic media outreach and targeted Facebook advertisements, and all images were evaluated on the same web interface (centerforcollectivelearning.org/urbanperception). Because raters are not limited to the residents of the depicted cities, the resulting scores reflect a more universal aesthetic judgement and minimize city‑specific cultural bias, even though the dataset itself does not include explicit sociocultural variables for each city. (Paragraph 1)

Street‑view frames were generated following the standard Place Pulse protocol. First, Google Street View panorama IDs were uniformly sampled along the OpenStreetMap road network at an adaptive spacing of roughly 50–100 m, ensuring cover-age of both primary and secondary streets. For each panorama, two horizontal images (640×480 px, FOV ≈ 60°) were extracted with headings separated by 90° or 180°, while keeping the pitch at 0° (eye‑level ≈ 1.6 m). In intersections or irregular street segments, up to four directions were captured to reflect the complete surrounding context[59]. This fixed sampling interval and limited set of viewing directions provide consistent spatial density and comparable visual perspectives across all 56 cities. (Added Paragraph 2)

Place Pulse uses pairwise comparison—a method long employed to assess subjec-tive attributes such as style or visual appeal in clothing [60], urban façades [61,62], animated GIFs [63], and artworks [64]. Pairwise ranking is widely regarded as more reliable and efficient than direct numerical scoring [65,66].

…….”

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you to the authors for their efforts in revising the manuscript. I have no further suggestions and recommend the manuscript for publication.

Reviewer 4 Report

Comments and Suggestions for Authors

It can be acceptted at current form.

Back to TopTop