Integrating Objective Segmentation and Subjective Perception to Predict Urban Landscape Preference: An XAI-Driven Approach
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript addresses a topic that fits well within the scope of Land: it combines urban green space landscape assessment with perception analysis using SegFormer, CLIP, predictive models, and SHAP. This is a strong and timely idea, and the research objective is clearly stated: to integrate objective physical features with subjective cognitive evaluations in order to predict landscape preference. At the same time, in its current form the manuscript requires major revision, because the main weaknesses concern methodological consistency, validation and interpretive caution.
In Section 2.2, the authors state that they used a dataset of 159 images compiled from 16 peer-reviewed articles, but immediately afterward they write that data from 16 papers were used for training, while data from the remaining three papers were reserved for validation. This implies a total of 19 papers rather than 16. In addition, the modeling section introduces yet another validation scheme: a random 80/20 split and five-fold cross-validation for hyperparameter tuning. It is therefore unclear whether the authors used between-study validation, a random holdout approach, or both simultaneously. This issue must be clarified unambiguously, because the current version does not allow full reproducibility.
The authors build their model using 159 images and 23 predictors (14 objective and 9 subjective), then compare several machine-learning models and interpret the results using SHAP. Although the authors themselves acknowledge that the sample is limited, the scale of the conclusions and the level of detail in the proposed design implications appear overly ambitious for such a dataset. The limitations related to small sample size, possible overfitting, and low estimation stability in small datasets should be emphasized earlier and more strongly.
In the abstract, the authors refer to “robust predictive performance,” whereas the best model achieves R² = 0.442. This is a promising result, but it is better described as moderate rather than robust predictive performance. This is especially true given that the models based only on physical features achieve R² = 0.291, and the models based only on psychological indicators achieve R² = 0.174. These results justify the conclusion that the hybrid approach improves the explanation of preference variance, but they do not yet support very strong normative claims or broad design recommendations.
The bibliography is extensive and, to a large extent, relevant to the topic, but its editorial quality needs improvement. There is at least one reference with problematic chronological consistency—for example, the citation of Wang et al. (2026) in a manuscript formatted as Land 2024—which requires verification. The reference list also contains at least one clearly distorted or incomplete item (“Environment, S. 19, 1; Periodicals Archive Online Pg; 1987”), and one classic paper by Lothian from 1999 appears twice as [9] and [21].
Author Response
Thank you for your valuable and constructive comments. Your feedback has greatly helped us improve the quality of our manuscript. We have prepared a point-by-point response to each of your comments, which is provided in the attached document.
We appreciate your time and effort in reviewing our work.
Author Response File:
Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript addresses a timely and relevant topic by integrating semantic segmentation, a vision-language model, and explainable machine learning to predict urban landscape preference. The overall direction is interesting and potentially valuable for landscape planning and human-centered urban green space design. However, in its current form, the manuscript still has several substantial issues related to the positioning of novelty, dataset comparability and robustness, construct validity of the psychological variables, interpretation of model results, and the practical specificity of the planning implications.
First, the manuscript would benefit from a clearer and more precise positioning of its novelty. In Lines 72–101, the authors argue that existing studies have not sufficiently bridged the gap between objective spatial characteristics and subjective human perception, and in Lines 105–136 they present their framework as an integrated XAI approach applicable to real UGS planning. While this framing is promising, the manuscript does not yet make it sufficiently clear whether its main contribution lies in the combination of SegFormer and CLIP, in the use of CLIP-derived psychological indicators, in the SHAP-based interpretability strategy, or in the specific application to urban green space preference. The contribution would be stronger if the authors more explicitly distinguished methodological novelty, variable-level novelty, and application-oriented novelty, and compared their study more directly with recent related work in AI-based landscape perception and visual preference research.
Second, the dataset size and source heterogeneity require much stronger justification. According to Lines 142–156, the study uses 159 images compiled from 16 previous studies conducted between 2000 and 2025, and the preference scores were normalized using min–max scaling. At the same time, Lines 288–329 indicate that these 159 samples were modeled using 23 variables and nonlinear algorithms such as Random Forest and XGBoost, primarily relying on an 80/20 train-test split and five-fold cross-validation. This raises two related concerns. One is that the pooled images likely differ substantially in image quality, season, weather, cultural context, participant composition, and rating protocol, so score normalization alone may not ensure full comparability. The other is that the sample size remains relatively limited in relation to the number of predictors and the complexity of the models. The manuscript would be strengthened by additional robustness checks, such as repeated cross-validation, bootstrap-based performance intervals, or sensitivity analyses across source studies. At minimum, the limitations of this data structure for generalizability should be discussed more explicitly.
Third, the validity of the CLIP-based psychological variables remains one of the central methodological concerns of the paper. In Lines 222–280, the manuscript operationalizes complex constructs such as Safety, Tranquility, Fascination, Mystery, and Legibility through zero-shot CLIP and a prompt ensemble strategy. This is an innovative idea, but these constructs are theoretically rich and highly context dependent, and the current manuscript does not yet provide enough evidence that image-text similarity can serve as a valid proxy for them. The authors should at least provide the full prompts used, explain the prompt construction procedure in greater detail, discuss the consistency or stability across prompts, and acknowledge more explicitly that these variables are indirect proxy measures rather than direct observations of human psychological states. If any external validation against human ratings is available, including even a limited validation exercise would substantially improve the credibility of this part of the study.
Fourth, although the manuscript repeatedly emphasizes planning and design implications, the current practical recommendations remain somewhat broad. The study suggests that the proposed framework can support evidence-based design guidelines for human-centered urban green space management, and it identifies factors such as water bodies, low vegetation, and sky openness as important drivers of psychological perception. However, most of these implications remain at the level of variable ranking rather than actionable design guidance. The manuscript would become much more useful for planning practice if the authors could translate the findings into more concrete ranges, thresholds, or trade-offs, such as approximate levels of openness, visible water proportion, or building presence associated with more favorable outcomes. In particular, the finding that buildings are negatively associated with overall preference while positively associated with perceived safety is both theoretically and practically important, and it deserves deeper treatment, possibly through nonlinear dependence plots or a clearer discussion of balance and threshold effects.
Overall, the manuscript has clear potential and explores a promising interdisciplinary direction. Nevertheless, substantial revision is still needed before it can be considered for publication.
Author Response
Thank you for your valuable and constructive comments. Your feedback has greatly helped us improve the quality of our manuscript. We have prepared a point-by-point response to each of your comments, which is provided in the attached document.
We appreciate your time and effort in reviewing our work.
Author Response File:
Author Response.docx
Round 2
Reviewer 1 Report
Comments and Suggestions for Authors-
Reviewer 2 Report
Comments and Suggestions for AuthorsThe research team has revised and provided detailed explanations in response to the issues I raised. I therefore recommend accepting the manuscript in its current form.
