Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Towards Transparent Urban Perception: A Concept-Driven Framework with Visual Foundation Models

ISPRS Int. J. Geo-Inf. 2025, 14(8), 315; https://doi.org/10.3390/ijgi14080315

by Yixin Yu¹, Zepeng Yu², Xuhua Shi¹, Ran Wan², Bowen Wang³ and Jiaxin Zhang^2,4,*

Reviewer 1:

Luis Copano Ortiz

Reviewer 2: Anonymous

Reviewer 3: Anonymous

ISPRS Int. J. Geo-Inf. 2025, 14(8), 315; https://doi.org/10.3390/ijgi14080315

Submission received: 12 June 2025 / Revised: 11 August 2025 / Accepted: 14 August 2025 / Published: 18 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This is a magnificent article that introduces a new methodology for urban visual perception. It achieves better results than currently used methods, although the authors share the limitations that all methods typically have. These limitations are well explained in the text, as is the explanation of the method itself and the results obtained, including very meaningful and easy-to-interpret tables and graphs.

The only thing I would include is an assessment of the time cost of the current method compared to the compared methods, especially with a view to evaluating its application to a large number of frames, as well as the processing costs in terms of the manageability of large volumes of data. Always taking into account any significant differences.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This study addresses important issues of transparency and explainability in urban visual perception and proposes a concept-driven framework, UP-CBM, based on visual foundation models (VFMs). The framework demonstrates promising performance and interpretability in urban perception tasks and holds significant academic value and practical application potential. However, several shortcomings remain in the current manuscript:

The introduction provides an insufficient review of existing research on visual foundation models (VFMs). The current section (lines 40–52) only briefly mentions the capabilities of CLIP and DINOv2, without delving into the known limitations of VFMs in terms of explainability. Furthermore, it does not clearly articulate how the proposed framework advances explainability compared to other VFM-based methods.

The concept generation process in the Materials and Methods section lacks transparency. While the paper states that an initial concept pool is generated using GPT-4o and refined via expert filtering (lines 207–237), it does not clarify the criteria used in expert screening. For example, how terms with “semantic redundancy” or “weak visual grounding” are defined. Moreover, no comparison is provided between the pre-and post-filtered concept lists, nor is there any method mentioned for validating inter-expert agreement.

The Concept Bottleneck Layer (CBL) design in the Materials and Methods section is overly simplified. As described in lines 262–269, the CBL relies solely on a single 1×1 convolutional layer for linear projection. This design does not incorporate multi-scale feature integration, thereby limiting the model’s ability to interpret complex concepts in urban perception tasks effectively.

Although the experimental section (lines 353–363) claims that the UP-CBM model outperforms several common baseline models, Tables 3 and 4 only report mean results without standard deviations across five randomized trials. This lack of statistical significance testing weakens the reliability of the comparative findings.

In Section 4.4, the concept analysis involving user study is limited by a small sample size (n=20), and the participants’ backgrounds are not disclosed (lines 463–469). As a result, the generalizability and reliability of the findings regarding concept semantic consistency are restricted.

In the discussion section (lines 517–529), the authors mention some methodological limitations, but the analysis is overly general. It lacks discussion of practical technical bottlenecks that may arise during real-world deployment of the proposed system.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors made a good work about urban visual analytics that they proposed a transparent framework (UP-CBM) based on Visual Foundation Models (VFMs) and oncept-based reasoning. This model is proved efficient with comparison results. This issue is an significant subject in GIS. I think this paper is interesting and can provide important references for understanding urban visual perception. So I believe this paper is worthy of publication.
Minor comments:
1. Section 1"introduction" and Section 2"related work" are somewhat disorganized that both sections include literature reviews. I think section 1 can be simplified and more literature reviews should be presented in section 2.
2. Some parameters of the equations are not detailed, like equations (9), etc. The authors should explain the meaning of parameters.
3. The innovations of this paer should be stated in Conclusion.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

General Assessment：

I appreciate the authors' thorough and detailed responses to my previous comments. The revised manuscript demonstrates a genuine effort to address the raised concerns, and the overall quality of the work has been significantly improved. The authors have provided substantial additional content, quantitative evidence, and honest acknowledgments of limitations. Overall, these revisions have notably enhanced the manuscript's clarity, transparency, and rigor.

Specific Comments：

Comments 1 & 2: The authors have made reasonable modifications and appropriately addressed these concerns.

Comment 3: The authors have provided justification for adopting the simple 1×1 CBL design. However, as shown in Table 6, the two-layer CBL shows slight performance improvement on the VRVWPR dataset (0.8835 vs 0.8919). Although the numerical difference is small, given that these two experiments correspond to regression and classification tasks respectively, this performance inconsistency may warrant analytical consideration. I suggest the authors briefly discuss the potential reasons underlying this difference.

Comment 4 : The authors' addition of Figure 9 significantly strengthens the reliability arguments. Suggestion: Consider adding confidence intervals alongside the current error bars, which would enhance statistical rigor and better quantify the uncertainty in performance estimates.

Comment 5: The authors honestly acknowledge the limitation of the small sample size (n=20) and provide participant demographic information. While this limitation cannot be immediately addressed, such acknowledgment and commitment to future work are appropriate.

Comment 6: The authors have comprehensively addressed this concern by expanding the discussion of practical deployment challenges.

Technical Issues: Numerical formatting inconsistency: Table 5 shows inconsistent decimal formatting. Please standardize whether leading zeros are required (e.g., 0.4405 vs .8054).

Comments on the Quality of English Language

Language concerns: The manuscript contains several grammatical errors that should be addressed: Subject-verb disagreement (line 26): “Recent achievements of computer vision has opened” should be “have opened”.

Number consistency (line 237): “for two dataset” should be “datasets”

Hyphenation inconsistency:“street view images” vs“street-view images” should be standardized.

Content duplication (lines 519-528): The passage beginning with “Nevertheless, we acknowledge that incorporating more sophisticated multi-scale feature fusion...” appears to be duplicated and should be removed. Therefore, I recommend the authors conduct rigorous language proofreading before publication.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Towards Transparent Urban Perception: A Concept-Driven Framework with Visual Foundation Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI