Next Article in Journal
Biochar Enhances Vineyard Resilience: Soil Improvement and Physiological Benefits for Sangiovese Vineyards in Varied Soils of the Chianti Classico (Tuscany, Central Italy)
Previous Article in Journal
Can a Rural Collective Property Rights System Reform Narrow Income Gaps? An Effect Evaluation and Mechanism Identification Based on Multi-Period DID
Previous Article in Special Issue
Connectivity-Oriented Ecological Security Pattern Construction Through Multi-Scenario Simulation Approach: A Case Study of Hefei City, China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Urban Street-Scene Perception and Renewal Strategies Powered by Vision–Language Models

1
Department of Architecture, Built Environment and Construction Engineering (AUIC), Politecnico di Milano, 20133 Milan, Italy
2
College of Architecture and Urban Planning, Tongji University, Shanghai 200092, China
3
Tongji Architectural Design (Group) Co., Ltd. (TJAD), Shanghai 200092, China
*
Authors to whom correspondence should be addressed.
Land 2026, 15(2), 244; https://doi.org/10.3390/land15020244 (registering DOI)
Submission received: 21 November 2025 / Revised: 4 January 2026 / Accepted: 29 January 2026 / Published: 31 January 2026
(This article belongs to the Special Issue Big Data-Driven Urban Spatial Perception)

Abstract

With rapid urbanization, urban renewal has become increasingly important. Traditional research has relied on expert assessments and objective indicators, lacking scalable frameworks that effectively translate street-level conditions into actionable renewal strategies. This study proposes a Vision–Language Model (VLM)-based framework to address these gaps, using the Hongshan Central District of Urumqi, China, as a case study. Specifically, we collected 4215 street-view images (SVIs) and employed VLMs to assess six perceptual dimensions (i.e., safety, liveliness, beauty, wealthiness, depressiveness, and boringness), together with textual descriptions. The best-performing model, selected by a 500-respondent perception survey validation, was used to conduct spatial pattern and text mining analyses to inform targeted urban renewal strategies. Results show that (1) VLMs have a high consistency with humans in evaluating the spatial perception of six dimensions; (2) spatial clustering analysis successfully delineated four distinct renewal priority tiers, confirming the method’s capability in translating perceptual data into actionable spatial strategies; and (3) textual mining of the VLM’s rationales revealed that areas with lower perceptual scores are predominantly characterized by deficiencies in foundational infrastructure and street-level order, thereby providing explanatory evidence directly linked to the generated renewal priorities. This study provides a generative artificial intelligence (GAI)-driven and interpretable evaluation framework for urban renewal decision-making, facilitating precision-oriented and intelligent urban regeneration.

1. Introduction

Over the past decade, urban planning in China has shifted from high-speed expansion to the stock-optimization phase, marking a strategic shift from scale growth and factor inputs to renewal practices centered on quality enhancement, structural repair, and fine-grained governance [1,2,3]. A fundamental question in this transition is how to measure “urban spatial quality” in a scientific, comparable, and operational manner [4]. Such measurement considers not only objective physical indicators but also people’s subjective perceptions and experiences. For instance, as early as the 1960s, Kevin Lynch [5] emphasized in The Image of the City that the material environment shapes residents’ mental images, with paths, edges, districts, nodes, and landmarks jointly forming a legible urban image. Jane Jacobs [6] subsequently proposed the “eyes on the street,” arguing that streets animated by pedestrian flow and natural surveillance can substantially improve the sense of safety. Conversely, neighborhood dilapidation undermines safety and belonging, as classically articulated by Wilson and Kelling’s [7] “broken windows” hypothesis. Recent studies have shown that the aesthetic quality of streetscapes not only enhances urban attractiveness but is also positively associated with mental and physical health [8,9,10]. Together, these studies underscore that the visual quality of the built environment is tightly linked to subjective feelings; dimensions such as safety, vitality, and beauty have become key determinants of urban spatial success [11,12]. These converging insights demonstrate that visual perception is not merely an aesthetic consideration but a fundamental determinant of urban livability.
Traditionally, urban perception has been assessed through questionnaires and field audits. However, these methods are costly and time-consuming, making it difficult to describe perceptual differences street by street at the city scale [13,14]. Over the past decade, with the rise of crowdsourcing and computer vision, researchers began to use street-view images (SVIs) and machine learning to quantify perception [15]. For example, Salesses et al. [16] collected public judgments via online pairwise image comparisons to build large-scale datasets of subjective evaluations (e.g., safety, prosperity), revealing spatial inequalities in urban perception. The Place Pulse 2.0 dataset developed at MIT Media Lab, as described by Dubey et al. [17], further distilled urban perception into six dimensions—Safety, Liveliness, Beauty, and Wealthiness, and the opposing attributes Depressiveness and Boringness. Building on this, previous approaches usually relied on deep learning or machine learning. Specifically, Zhang et al. [18] achieved high accuracy in inferring perceptual attributes such as safety and beauty by employing Deep Spatial Attention Neural Zero-Shot (DSANZS) models with supervised and weakly supervised learning on the Place Pulse 2.0 dataset. Additionally, Contrastive Language–Image Pre-Training (CLIP)-based zero-shot methods have demonstrated high accuracy in urban perception analysis. For example, CLIP was successfully employed by Liu et al. [19] to build reliable perceived walkability indicators and by Zhao et al. [20] to accurately quantify the visual quality of built environments for aesthetic and landscape evaluation. However, although these approaches substantially reduce labeling costs and provide an end-to-end solution, they still require adaptation to the local context and offer limited interpretability.
On the other hand, with the rise of computer vision technology, many models have been developed to segment street-view elements and correlate them with perception using machine learning approaches. Semantic segmentation models such as DeepLab [21] and PSPNet [22] can identify and quantify various urban elements such as buildings, vegetation, vehicles, and pedestrians in street imagery, providing objective measures of streetscape composition. These segmented elements are then linked to perceptual outcomes through statistical modeling or machine learning algorithms. For instance, Kang et al. [23] used semantic segmentation to extract 150 visual elements from street-view images and employed random forest regression to predict safety perception scores, finding that the proportion of sky, trees, and crosswalks positively influenced perceived safety, while cars and construction elements had negative effects. Furthermore, explainable artificial intelligence techniques, particularly SHapley Additive exPlanations (SHAP) [24], offer opportunities to open the “black box” of machine learning models by revealing which visual elements contribute most to specific perceptual judgments. For example, researchers have used SHAP values to demonstrate that the presence of greenery and well-maintained facades positively correlates with safety perception. However, these element-based approaches often oversimplify the complex interactions between urban features and may miss subtle contextual factors that influence human perception, thus limiting their ability to capture the holistic nature of urban perception.
In recent years, vision–language models (VLMs) have provided new tools for urban perception research. Large multimodal models such as LLaVA, BLIP-2, and CLIP integrate image understanding with natural-language reasoning and can analyze images in a human-like, instruction-following manner without additional local training, as demonstrated by Moreno-Vera and Poco [25]. In the urban domain, scholars have begun to explore VLMs for street-scene parsing and evaluation. For example, Yu et al. [26] extracted street-scene elements with deep learning and, supported by adversarial modeling, quantified the six perception dimensions, combining the results with space-syntax analysis to reveal links between streetscape quality and element configuration. Moreover, VLMs can simultaneously process visual information and generate actionable recommendations for urban improvement, offering a more comprehensive approach than traditional perception analysis methods. This suggests that, without relying on locally labeled data, pretrained VLMs can be used to assess streetscape perception in any city, thereby lowering data costs and improving generalization.
Against this backdrop, this study seeks to address three key research questions that guide our investigation: (1) How accurately can VLMs predict human perceptual judgments across the six urban perception dimensions compared to traditional survey methods? (2) What are the spatial patterns and salient characteristics that VLMs identify in different urban areas, and how do these align with human observations? (3) To what extent can VLMs generate actionable and contextually appropriate renewal suggestions for urban planning practice? To answer these questions, we integrate multimodal VLMs with SVIs using the Hongshan Central District of Urumqi, China, as our study area. We collected 4215 street-view panoramas and conducted a perception survey with 500 respondents across the six dimensions to establish a human-rated baseline. We then applied multiple VLMs to measure Safety, Liveliness, Beauty, Wealthiness, Depressiveness, and Boringness, accompanied by textual explanations and renewal suggestions. By systematically comparing model outputs with survey results and mining keywords from model descriptions, we evaluate VLM performance boundaries and explore their potential as decision support tools for intelligent urban planning.

2. Materials and Methods

2.1. Study Area

The Hongshan Central District is in the urban core of Urumqi, China, including Tianshan, Saybagh, and Shuimogou districts (Figure 1). It is a mixed area of commerce, culture, and housing. Within an area of 20 km2, we obtained 4376 street-view panoramas from the Baidu Map API between April and August 2024, covering primary and secondary streets, alleys, and key nodes (e.g., Hongshan Park and Renmin Park). The sampling points were generated at 50 m intervals along the road network to ensure comprehensive spatial coverage while maintaining computational feasibility. These images were taken in May 2021. To ensure data quality, panoramas with poor visibility (fog, heavy shadows, or construction obstructions) were manually filtered by five student volunteers, and duplicate or near-duplicate images within 10 m of each other were removed to avoid spatial redundancy. A total of 4215 high-resolution street-view images were ultimately collected, forming a georeferenced dataset that underpins all subsequent model evaluation and spatial analyses.

2.2. VLM Selection and Prompt Design

Our methodological workflow, shown in Figure 2, comprises four steps. In Step 1, we create a VLM-simulated respondent persona whose demographic characteristics are based on real-world survey data (see Section 2.3). This persona is then coupled with task prompts, enabling the model to reason from the perspective of ordinary residents rather than an expert. In Step 2, we instruct the VLMs to use the SVIs to judge based on visible cues and subjective spatial perception. They are asked to assign a score from 1 to 10 for each perceptual dimension. Each score is accompanied by a brief description that reflects the decision-making process of VLMs.
In Step 3, an expert-planning agent receives the panoramic image, the six-dimensional scores and rationales. The agent is instructed to output renewal recommendations following a schema that includes: a diagnostic summary, suggestions for immediate and short-term actions, and proposals for long-term solutions to structural challenges. We then compile a geo-referenced database (Step 4), with one integrated record per panorama. Fields include identity document, location, persona, perception scores (average), rationales, and expert suggestions. This result supports subsequent spatial and textual analyses.
We employed nine VLMs for comparative evaluation. The selected models represent a balanced mix of open-source and commercial VLMs across different parameter scales, providing a representative comparison of current multimodal architectures. The open-source models include Qwen2.5-VL-3B, 7B, and 32B, ChatGLM-9B-Base, GLM-9B-Thinking, and LLaVA-1.6-7B. Among these models, we selected models from the same family (i.e., Qwen2.5-VL) to facilitate observation of scoring performance across different parameter scales. Additionally, the commercial models comprise ChatGPT-5, Gemini-2.5, and Claude-3.5, which provide stable APIs [27,28,29,30]. All models were configured with consistent hyperparameters.

2.3. Human Labeling and Result Validation

To validate the reliability of the VLMs-generated scores, we randomly sampled 18% of the total images (n = 800) for manual scoring. To ensure comprehensive and reasonable spatial coverage, we performed sampling using a 100 m grid. We designed a six-dimensional perception survey and recruited 500 respondents, covering different demographic groups (Table 1). Each respondent was randomly assigned a set of 40 SVIs to minimize potential bias from image sequence effects. Participants were asked to provide ratings on a 10-point Likert scale (1 = very negative, 10 = very positive) across six dimensions. The four dimensions (i.e., safety, liveliness, beauty, and wealthiness) are positive dimensions (higher is better); the two dimensions (i.e., depressiveness and boringness) are negative dimensions (higher is worse). To ensure comparability, the scores for depressiveness and boringness were inverted when calculating the average perception. We also collected comments from 156 participants (32.2% of the sample) in order to identify the most important factors mentioned by respondents. Due to the limited and unsystematic nature of the coverage, however, these comments could not be used for quantitative analysis.
To quantify the accuracy between model measurement and human-rated baseline assessments, we employed a comprehensive set of statistical metrics. The evaluation framework incorporated four complementary measures to capture different aspects of model performance across multiple dimensions. Specifically, for each perceptual dimension, we computed the Pearson correlation coefficient between the model assessment and human ratings:
r = i = 1 n ( x i x ¯ ) ( y i y ¯ ) i = 1 n ( x i x ¯ ) 2 i = 1 n ( y i y ¯ ) 2
where x i represents model scores, y i represents human baseline scores, and x ¯ and y ¯ denote their respective means. Values near 1 indicate strong positive relationships, while values close to 0 indicate no relationship. To assess absolute deviation from baseline ratings, we computed the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE):
M S E = 1 n i = 1 n ( y i y ^ i ) 2
R M S = 1 n i = 1 n ( y i y ^ i ) 2
Lower MSE and RMSE values indicate enhanced predictive accuracy and reduced systematic bias. All metrics were computed per perception dimension and for the six-dimensional mean.

2.4. Analysis of Model-Generated Explanations

Furthermore, to explore the basis of VLMs’ scoring, we analyzed the reasons for the model’s decisions. Specifically, to extract latent thematic structures from these explanations, we employed BERTopic, a neural topic modeling framework that combines transformer-based embeddings with dimensionality reduction and clustering algorithms. The pipeline proceeded in four stages: first, all explanations were concatenated and preprocessed through standard natural language processing operations including tokenization, lemmatization, lower-casing, and stop-word removal, retaining content words with clear urban and perceptual relevance. Second, each cleaned document was encoded into a high-dimensional semantic vector using a pre-trained sentence transformer (specifically, the all-MiniLM-L6-v2 model), yielding an embedding matrix E     R n × d , where n denotes the number of documents and d the embedding dimension. Third, we applied Uniform Manifold Approximation and Projection (UMAP) to reduce E to a lower-dimensional manifold Z     R n × k k   <   d , preserving local neighborhood structure while facilitating subsequent clustering. Fourth, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) partitioned Z into coherent topic clusters, automatically determining the number of topics and identifying outlier documents. Formally, the topic assignment for document i is given by
t i = a r g m a x j c T F I D F w i ,   T j
where c T F I D F denotes class-based term frequency–inverse document frequency weighting, w i represents the token set of documents i , and T j is the j-th topic cluster. Each discovered topic is characterized by its top-ranked terms under the c T F I D F metric, which we interpret through a dual-layer framework: a physical-environment layer encompassing built form, public realm, greenery, and circulation infrastructure, and a perception layer capturing orderliness, activity, and legibility. This data-driven approach obviates the need for predefined taxonomies, allowing thematic categories to emerge organically from the corpus while maintaining interpretability for urban planning practitioners.
For comparative analysis and visualization, we first constructed a two-dimensional semantic embedding space by further reducing the UMAP coordinates Z via t-distributed Stochastic Neighbor Embedding (t-SNE) or by retaining the first two UMAP dimensions, producing a global semantic scatter that positions representative topic centroids and high-frequency keywords according to their distributional affinity and thematic coherence. We then aggregated topic prevalence—defined as the proportion of documents assigned to each topic—across renewal categories derived from quantile-based stratification of overall perception scores, generating grouped bar charts that reveal how explanatory themes are distributed differentially among low-, medium-, and high-quality urban environments. Finally, we extracted the top 20 keywords by c T F I D F weight across all topics and projected them onto the semantic scatter, color-coded by renewal category, to highlight which perceptual cues are most strongly concentrated within each stratum. Taken together, this BERTopic-driven pipeline translates unstructured textual rationales into planner-facing evidence, aligning latent semantic structures with spatial perception patterns and furnishing actionable insights for subsequent diagnostic and design interventions.

2.5. Spatial Analysis

We systematically mapped six urban perception dimensions across the study area to examine their spatial distribution patterns. To quantify the overall spatial clustering tendency of each perception dimension, we employed Moran’s I statistic using a Queen-contiguity spatial weights matrix, where spatial units are considered neighbors when sharing either an edge or a corner. The global Moran’s I is calculated as:
I = n i = 1 n j = 1 n w i j i = 1 n j = 1 n w i j x i x ̄ x j x ̄ i = 1 n x i x ̄ 2
where n represents the number of spatial units, x i is the perception value at location i , x ̄ is the mean perception value across all locations, and w i j is the spatial weight between locations i and j . The statistical significance of spatial clustering was assessed using the standardized z-score:
Z = I E I V a r I
where E I = 1 / n 1 represents the expected value under the null hypothesis of no spatial autocorrelation.
Building on the global analysis, we conducted Local Indicators of Spatial Association (LISA) analysis to identify localized clustering patterns and spatial outliers. The local Moran’s I statistic for each location i is defined as:
I i = x i x ̄ S 2 j = 1 n w i j x j x ̄
where S 2 = j = 1 n x j x ̄ 2 n 1 is the sample variance. LISA cluster maps were generated for each perception dimension, categorizing spatial units into four distinct types: High-High (HH) clusters representing locations with high perception values surrounded by high-value neighbors, Low-Low (LL) clusters indicating locations with low perception values surrounded by low-value neighbors, High-Low (HL) outliers denoting locations with high perception values surrounded by low-value neighbors, and Low-High (LH) outliers representing locations with low perception values surrounded by high-value neighbors. Statistical significance of local clustering was determined using conditional permutation tests with 999 Monte Carlo simulations at p < 0.05 .

3. Results

3.1. Model Selection and Accuracy Validation

We first compared multiple VLMs in terms of their ability to assess human perceptions. Figure 3 shows the Pearson correlation heatmap across the six dimensions. Overall, ChatGPT-5 achieved the highest correlations for most dimensions, particularly liveliness (0.72 ***) and beauty (0.61 ***). Although Claude-3.5 performed well in wealthiness (0.53 ***) and depressiveness (0.53 ***), its overall performance remained somewhat lower compared to ChatGPT-5. We also observed that Qwen2.5-VL-7B outperformed other open-source models, a result that appears not strongly related to parameter scale, as evidenced by the lack of significant correlation with real data even for models with 30B parameters. Therefore, ChatGPT-5 was selected as the core model for subsequent spatial analysis and text mining.
To further evaluate GPT-5’s numeric accuracy, we computed the R2, RMSE, and MSE for each of the six dimensions and average score (Table 2). Agreement is strongest on Liveliness (R2 = 0.517); Safety and Beauty fall in the 0.32–0.37 range; Wealthiness is lower (R2 = 0.221). Across dimensions, RMSE lies between 0.50 and 0.70, i.e., the average deviation is <1 point on a 1–10 scale (e.g., Safety RMSE = 0.544), while MSE is mostly 0.25–0.50. An average deviation below one point on a 1–10 scale is sufficient for preliminary screening and large-scale diagnostics, which represent the intended application of this method. The six-dimensional mean yields R2 ≈ 0.332. Taken together, GPT-5-Mini shows good consistency with the human benchmark and low errors, sufficient for the subsequent mapping, LISA, and text-based analyses.

3.2. Spatial Distribution of Perception Scores

Using ChatGPT-5, we generated perception predictions for the remaining 3415 street-view panoramas. The mean scores were Safety (mean 4.97), Liveliness (mean 4.91), Beauty (mean 4.93), Wealthiness (mean 4.98), Depressiveness (mean 4.89), and Boringness (mean 4.92). Scores across all dimensions ranged from 3.52 to 6.03, spanning a complete gradient from poorer to better environments. The corresponding spatial distribution maps are shown in Figure 4a–f. Specifically, high values on the positive dimensions (i.e., safety, liveliness, beauty, wealthiness) tend to concentrate along urban arterials, newly developed high-quality precincts, parks, and major commercial streets. Conversely, high scores on the negative dimensions (i.e., boringness and depressiveness) are more prevalent in narrow streets and older, lower-quality neighborhoods. Quantitatively, global Moran’s I is positive and statistically significant for all six dimensions (Table 3), indicating spatial clustering of similar values rather than random dispersion.
We further analyzed the spatial clustering patterns of the six perceptual dimensions using LISA, as shown in Figure 5. The HH clusters are concentrated in newly developed areas with comprehensive functionality, while LL clusters are more prevalent in older or peripheral areas. For instance, regarding safety perception, HH clusters form around Red Mountain Park and its adjacent commercial corridor, as well as the residential areas in the south, while urban villages exhibit LL clusters in the north. We observed fewer HL and LH regions; however, these areas warrant particular attention, as they represent transitions between renewed and non-renewed zones. In terms of vitality, HH clusters appear near southern residential areas, schools, and markets, while enclosed institutional complexes display continuous LL clusters. Regarding boringness levels, HH clusters predominantly occur along the outer ring elevated highway, while LL clusters are mainly found around Red Mountain Park and Nanhu Square. Overall, positive perceptions are primarily concentrated in the urban center, while negative perceptions are mainly distributed in the urban periphery. Moreover, transition areas such as HL and LH zones are primarily located in urban areas undergoing renovation, requiring focused attention.

3.3. Identification of Renewal Priority Tiers

Building on the six-dimensional maps, we computed an overall perception score per street segment as the mean of the six dimensions (with Depressiveness and Boringness reversed), where weights were derived from each dimension’s R2 value to account for their varying predictive power. We then employed K-means clustering to partition the study area into four renewal tiers based on the overall perception scores. The K-means algorithm iteratively assigned street segments to clusters by minimizing within-cluster variance, resulting in four naturally occurring groups (cluster centroids: 4.2, 4.7, 5.1, 5.5). The optimal number of clusters was determined using the elbow method and silhouette analysis. This data-driven approach avoided arbitrary threshold selection while maintaining interpretable distinctions between tiers—I. Comprehensive Upgrading (centroid = 4.2), II. Targeted Intervention (centroid = 4.7), III. Incremental Renewal (centroid = 5.1), and IV. Routine Maintenance (centroid = 5.5). The clustering yielded high between-cluster separation (silhouette score = 0.71) while preserving meaningful within-cluster coherence. The validity of this classification was further supported by two analyses: (1) a bootstrapped stability assessment showing consistent cluster assignments across 1000 resamples for 89% of segments, and (2) discriminant analysis confirming distinct environmental characteristics across tiers (Wilks’ λ = 0.21, p < 0.001) [31].
Figure 6 shows that areas with distinctly low overall perception and multiple underperforming dimensions are classified as Tier I. These areas are typically marked by physical decay, disorder and a lack of vitality, and include Hongshan Road, Binhe Middle Road, Wuxing South Road and Huhuo Road. Tier II covers moderate conditions with localized deficiencies amenable to targeted solutions (Hetan Expressway, Xihong East Road and Qiantangjiang Road). Tier III covers generally good environments where quality enhancement and vitality infusion are appropriate (Xinyi Road, Youhao South Road, the Xinjiang Uygur Autonomous Region Department of Water Resources, etc.). Tier IV represents high-quality settings where routine maintenance is sufficient (Tianshan District Government Residential Quarter, Mashi Community, and Heping South Road).

3.4. Semantic Analysis of Street-View Descriptions

Using descriptions and update suggestions generated by ChatGPT-5, we conducted text mining to extract the semantic structure of model reasoning. Using BERTopic, we classified the text into 8 categories (Figure 7), each topic showing stable keyword clusters. Infrastructure: streets, street facilities, signs, pavement, concrete barriers, traffic signals reflecting road conditions, sign clarity and barrier mitigation; Buildings: building density, facades, building diversity, residential blocks—representing building types and facade activities; Transportation: vehicles, vehicular traffic and luxury vehicles—depicting traffic flow intensity and characteristics; Public Space and Environment: public realm, outdoor seating, trees and canopy—corresponding to reparability and greenery; Cleanliness, Vitality and Safety: maintenance, cleanliness, activity, street lighting and perceived safety with order maintenance, activity levels and perceived security.
Combining these semantic themes with the four update tiers, we observed that infrastructure and building themes were dominant (Figure 8). Specifically, in comprehensive upgrades (Tier I), infrastructure and building themes accounted for the largest proportion, over 50% of the total, indicating that these projects emphasized overall hardware renovation. Targeted interventions (Tier II) similarly focused on infrastructure and buildings but incorporated more environmental and public space considerations, demonstrating more targeted improvement goals. Progressive updates (Tier III) showed a more balanced distribution, with relatively even proportions across themes, reflecting small-scale, gradual update characteristics and diverse project objectives. In routine maintenance (Tier IV), although infrastructure still occupied a large proportion, the overall scale was smallest, focusing mainly on basic functional maintenance, with smaller proportions of other themes like safety and livability. Overall, comprehensive upgrades and targeted interventions significantly outnumbered the other two categories, reflecting the importance of large-scale updates in this study [32].
Next, we extracted the 20 most common keywords across six perceptual dimensions and plotted their frequencies by update category (Figure 9). The data show “street scale,” “vehicle,” and “traffic volume” as the most frequent keywords, with “street scale” appearing nearly 5000 times in targeted interventions. Most keywords appeared across all four update types but with notable frequency variations. Generally, targeted interventions and comprehensive updates showed higher frequencies across most keywords compared to progressive updates and routine maintenance. Notably, terms directly related to street space, such as “sidewalk,” “pedestrian,” and “public space,” maintained high frequencies across different update types, reflecting consistent attention to walkability and public space quality in urban renewal. While aesthetics-related terms like “streetscape” and “public art” showed relatively lower frequencies, their stable presence across all update types indicates consideration for environmental aesthetics and cultural value in urban renewal processes.

3.5. Comparative Analysis of Representative Sites

To demonstrate how the model’s evaluations and renewal suggestions apply to concrete urban scenes, six representative sites (P1–P6) were selected from the study area for in-depth analysis (Figure 10). The boundaries of these sites were defined by land-use categories. The cases covered different functional types and perceptual score ranges.
Figure 11 contrasts six representative areas (P1–P6), arranged in low/high pairs across school, commercial, and residential areas. All areas emphasize infrastructure, with architecture-themed terms also prominent (especially in P4 and P5), while mobility ranks third in each zone. Key descriptive terms highlight each area’s character: P1 shows potholes, poor streetlight poles and sparse retail, reflecting serious deficits; P2 has active storefronts, shady canopy trees, visible social life and well-marked crossings, indicating high liveliness and order; P3 includes temporary enclosure barriers, weathered facades, scarce signage and little ground-floor activation, implying low commercial vitality; P4 exhibits refreshed facades, display windows, continuous signage and diverse tenants, denoting strong commercial frontage; P5 is defined by broken fences, insufficient streetlight poles, and bare soil, pointing to safety and aesthetic issues; and P6 features visible security gates, aligned street trees and active storefronts, indicating a well-maintained public realm. Accordingly, the three low-score areas call for different upgrade strategies: P1 needs Comprehensive Upgrading (systematic repairs to paving, lighting, parking and active street life); P3 suits Incremental Renewal (removal of fences, addition of commercial shops, façade beautification, and update of signage and wayfinding system); and P5 requires Targeted Intervention (enhanced fencing, lighting and greening of vacant lots). By contrast, P2, P4 and P6 are Routine Maintenance cases with strong performance (well-kept storefronts, landscaping and amenities), so they merit only low-intensity upkeep.
These six case comparisons clearly demonstrate the effectiveness of the VLM-based evaluation. The model can distinguish between subtle visual cues in street-view imagery and produce scores and suggestions consistent with on-the-ground perceptions. Such micro-scale examples provide planners with tangible and interpretable references. Through these paired case studies, we confirm that the problems identified by the model align closely with real urban conditions, thereby informing appropriate renewal strategies at both diagnostic and design levels.

4. Discussion

4.1. Consistency Between VLMs and Human Perception

This study demonstrates that VLMs exhibit moderate to high consistency with human raters in evaluating urban street-scene perceptions, providing an empirical foundation for employing AI in large-scale urban environmental audits. Specifically, ChatGPT-5 achieved the highest correlations with human benchmarks on the liveliness and beauty dimensions (R2 of 0.517 and 0.374, respectively), indicating its capacity to capture complex perceptual attributes related to social interaction and visual appeal. This alignment extends beyond statistical correlation to the underlying reasoning: the model frequently cited the same environmental cues as human raters for its judgments. For instance, where human evaluators gave high liveliness scores due to “continuous shopfronts” and “dense tree canopies,” the model’s textual explanations also emphasized keywords such as “active storefronts” and “shady canopy trees.” From a cognitive science perspective, this consistency likely stems from VLMs’ pre-training on vast datasets of human-annotated image-text pairs, aligning their internal representations with human cognitive schemata [33].
Notably, the model’s performance varied significantly across perceptual dimensions. The relatively lower explained variance for Wealthiness (R2 = 0.221) highlights a current technological limitation. This discrepancy may reflect the inherent subjectivity gradient among dimensions [34]. For instance, Safety and Liveliness are more directly linked to observable physical elements (e.g., lighting, pedestrians), whereas Wealthiness judgments involve more complex socio-cultural cues and aesthetic preferences that the model may not yet fully internalize [35]. Nevertheless, the model demonstrated statistically significant predictive power across all six dimensions, with a mean error of less than 1 point on a 10-point scale, confirming its feasibility as a screening tool for urban perception, particularly in the preliminary planning stages requiring rapid problem identification.

4.2. The Value of VLM Interpretability for Urban Renewal

The textual rationales generated by VLMs transform numerical scores into actionable planning language, providing crucial traceability that is often absent in traditional “black-box” computer vision approaches. While our analysis confirms that VLM outputs effectively identify micro-scale maintenance issues (e.g., “potholes,” “insufficient lighting”), as indicated by the dominance of infrastructure-related terms in low-scoring areas, we acknowledge that these initial suggestions tend to be granular and operationally focused.
However, the true value of VLM interpretability lies not merely in cataloging defects, but in leveraging its semantic outputs as a diagnostic springboard for strategic, multi-scale interventions. The distinct semantic stratification we observed—from foundational infrastructure concerns in low-tier areas to maintenance and refinement themes in high-tier areas—provides a logical framework for prioritizing interventions. This stratification aligns with urban renewal theory, which posits that addressing structural deficiencies must precede functional enhancement and aesthetic refinement [36,37]. For instance, a VLM’s identification of widespread “broken fences” and “discontinuous sidewalks” in a residential district (P5) should not be interpreted solely as a call for piecemeal repairs. Instead, it serves as diagnostic evidence pointing toward deeper, systemic issues—such as inadequate public space definition, poor connectivity, or a lack of territorial definition—that require structural solutions like block reconfiguration, greenway integration, or the creation of secured, semi-public courtyards.
Furthermore, VLM outputs can catalyze thinking beyond street-level fixes by revealing patterns that necessitate functional reconstruction. The consistent linkage between “inactive storefronts,” “monotonous facades,” and low scores for Liveliness and Wealthiness in commercial areas (P3) underscores more than a need for facade beautification. It signals potential failures in land-use mix, economic vitality, or urban design at the block or district scale. This should prompt planners to consider strategic interventions such as revising ground-floor use regulations, introducing incentives for small businesses, or redesigning public spaces to stimulate street-level activity—moves that reconstitute the area’s fundamental economic and social function.
In this light, the VLM’s “element-perception-strategy” reasoning chain should be viewed as the first, not the final, step in the planning process. Its strength is in efficiently diagnosing symptomatic urban ailments across vast areas. The planner’s role is to critically interpret these diagnostics, translating localized cues (e.g., “lack of pedestrians”) into an understanding of broader urban dynamics (e.g., land-use segregation, inadequate public transport access) and formulating corresponding structural strategies (e.g., transit-oriented development, functional re-zoning). Thus, when combined with planners’ expertise and supplementary data (e.g., land use maps, demographic trends), VLM-generated text evolves from providing “rough suggestions” to forming a robust, evidence-based foundation for strategic urban regeneration that operates at both the micro (street) and macro (district) scales [38].

4.3. Theoretical Implications and Practical Applications

Theoretically, this study introduces VLMs into the urban perception research paradigm, promoting a methodological shift from “element-driven” to “semantics-driven” analysis. Traditional research often follows an “element extraction-statistical correlation” pathway, which, while capable of identifying significant physical predictors, struggles to explain how these features interact complexly to form an overall environmental impression. By leveraging their powerful multimodal understanding, VLMs achieve a holistic interpretation of streetscapes that aligns more closely with the cognitive integrity emphasized in Kevin Lynch’s Image of the City theory [5]. Simultaneously, the language explanations generated by the models provide a novel data source for validating environmental psychology theories. For example, the strong association between “well-maintained” and safety in our findings supports the “Broken Windows Theory,” while the tight link between “active storefronts” and liveliness echoes Jane Jacobs’ “eyes on the street” concept [36,39,40].
On a practical level, the four-tier renewal zoning framework developed in this study translates abstract perceptual data into concrete spatial guidance, enabling a seamless transition from diagnosis to intervention. The Comprehensive Upgrading tier (I) focuses on foundational infrastructure upgrades and order restoration; the Targeted Intervention tier (II) emphasizes functional activation and interface optimization; the Incremental Renewal tier (III) targets quality enhancement; and the Routine Maintenance tier (IV) ensures the preservation of existing quality. This tiered strategy adheres to the logic of resource-constrained renewal, avoiding the resource misallocation common in one-size-fits-all approaches. More importantly, the methodology demonstrates significant scalability and adaptability, as any city with street-view imagery can rapidly deploy this framework, and its local responsiveness can be further enhanced through fine-tuning. For Chinese cities currently in the stock optimization phase of development, this data-driven, perception-oriented planning method offers a technical pathway towards “precision renewal” and “intelligent governance”.

4.4. Limitations

This study has some limitations. First, although VLMs demonstrate high consistency with human assessments, their judgments may still be influenced by biases inherent in their pre-training data, particularly in capturing nuanced perceptual differences within specific cultural contexts. Second, the online perceptual survey is inevitably subject to sampling bias. For instance, the overrepresentation of highly educated participants in our study may amplify biases in the results. Third, the current methodology primarily relies on static street-view imagery and does not incorporate temporal dimensions (e.g., diurnal or seasonal variations) or socio-economic contextual data, thereby limiting the comprehensiveness of its representation of dynamic urban phenomena [41]. Furthermore, while the renewal suggestions generated by the model are logically coherent, they have not yet been validated for local feasibility. Additionally, this renewal approach may overlook unique urban characteristics, such as climate and cultural context. Lastly, we acknowledge that due to linguistic differences, VLMs trained on English-language datasets may struggle to accurately interpret urban renewal priorities within diverse cultural contexts. Therefore, future research should focus on improving the generalizability of the proposed framework.

5. Conclusions

This study integrates VLMs with large-scale street-view data and develops an end-to-end diagnostic framework that links six-dimensional perception scoring, spatial statistics, a four-tier renewal zoning, text-based semantic mining, and area evidence comparisons. The framework was validated on 4215 street-view samples from Urumqi’s Hongshan Central District. Our findings reveal that: (1) compared to human benchmarks, ChatGPT-5 achieves relatively high correlations across multiple dimensions with controllable overall error; (2) low-scoring areas exhibit aging and disordered infrastructure, while high-scoring areas align with major corridors and improved road segments; and (3) VLM-generated reasoning text enhances result interpretability by highlighting renewal priorities for different spatial typologies. This study provides a large-scale, interpretable VLM-based approach for urban renewal. Future work will incorporate multi-source urban data to support lifecycle-oriented renewal planning.

Author Contributions

Conceptualization, Y.Y.; methodology, Y.Y.; software, Y.Y.; validation, Y.Y.; formal analysis, Y.Y.; investigation, Y.Y.; resources, F.L.; data curation, Y.Y.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Y., F.L. and G.D.; visualization, Y.Y.; supervision, F.L. and G.D.; project administration, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Acknowledgments

The author would like to thank Feidong Lu for his guidance on the overall research design and critical feedback on earlier drafts, and Giuliano Dall’Ò for insightful discussions on urban environmental quality. The author also gratefully acknowledges Haoran Ma and Yuankai Wang for their suggestions on spatial statistics and visualization. During the preparation of this manuscript, the author used a generative AI-based assistant for language polishing and consistency checking. After using this tool, the author reviewed and edited the content and takes full responsibility for the final version of the manuscript.

Conflicts of Interest

Author Feidong Lu was employed by the company Tongji Architectural Design (Group) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
VLM(s)Vision–Language Model(s)
LLM(s)Large Language Model(s)
SVI(s)Street-View Image(s)
SHAPSHapley Additive exPlanations
R2Coefficient of Determination
RMSERoot Mean Squared Error
MSEMean Squared Error
AOI(s)Area(s) of Interest

References

  1. Deng, Y.; Tang, Z.; Liu, B.; Shi, Y.; Deng, M.; Liu, E. Renovation and Reconstruction of Urban Land Use by a Cost-Heuristic Genetic Algorithm: A Case in Shenzhen. ISPRS Int. J. Geo-Inf. 2024, 13, 250. [Google Scholar] [CrossRef]
  2. Sun, Y.; Li, X. Coupling Coordination Analysis and Factors of “Urban Renewal-Ecological Resilience-Water Resilience” in Arid Zones. Front. Environ. Sci. 2025, 13, 1615419. [Google Scholar] [CrossRef]
  3. Song, R.; Hu, Y.; Li, M. Chinese Pattern of Urban Development Quality Assessment: A Perspective Based on National Territory Spatial Planning Initiatives. Land 2021, 10, 773. [Google Scholar] [CrossRef]
  4. Jin, R.; Huang, C.; Wang, P.; Ma, J.; Wan, Y. Identification of Inefficient Urban Land for Urban Regeneration Considering Land Use Differentiation. Land 2023, 12, 1957. [Google Scholar] [CrossRef]
  5. Lynch, K. The Image of the City; Publication of the Joint Center for Urban studies, 33. print.; M.I.T. Press: Cambridge, MA, USA, 2008; ISBN 978-0-262-62001-7. [Google Scholar]
  6. Jacobs, J. The Death and Life of Great American Cities; Vintage books ed.; Vintage Books: New York, NY, USA, 1992. [Google Scholar]
  7. Kelling, G.L.; Wilson, J.Q. Broken Windows: The police and neighborhood safety. Atl. Mon. 1982, 249, 29–38. [Google Scholar]
  8. Xu, Z.; Marini, S.; Mauro, M.; Maietta Latessa, P.; Grigoletto, A.; Toselli, S. Associations Between Urban Green Space Quality and Mental Wellbeing: Systematic Review. Land 2025, 14, 381. [Google Scholar] [CrossRef]
  9. Lu, X.; Li, Q.; Ji, X.; Sun, D.; Meng, Y.; Yu, Y.; Lyu, M. Impact of Streetscape Built Environment Characteristics on Human Perceptions Using Street View Imagery and Deep Learning: A Case Study of Changbai Island, Shenyang. Buildings 2025, 15, 1524. [Google Scholar] [CrossRef]
  10. Tang, F.; Zeng, P.; Wang, L.; Zhang, L.; Xu, W. Urban Perception Evaluation and Street Refinement Governance Supported by Street View Visual Elements Analysis. Remote Sens. 2024, 16, 3661. [Google Scholar] [CrossRef]
  11. Ewing, R.; Handy, S. Measuring the Unmeasurable: Urban Design Qualities Related to Walkability. J. Urban Des. 2009, 14, 65–84. [Google Scholar] [CrossRef]
  12. Mehta, V. Lively Streets: Determining Environmental Characteristics to Support Social Behavior. J. Plan. Educ. Res. 2007, 27, 165–187. [Google Scholar] [CrossRef]
  13. Ogawa, Y.; Oki, T.; Zhao, C.; Sekimoto, Y.; Shimizu, C. Evaluating the Subjective Perceptions of Streetscapes Using Street-View Images. Landsc. Urban Plan. 2024, 247, 105073. [Google Scholar] [CrossRef]
  14. Yao, Y.; Liang, Z.; Yuan, Z.; Liu, P.; Bie, Y.; Zhang, J.; Wang, R.; Wang, J.; Guan, Q. A Human-Machine Adversarial Scoring Framework for Urban Perception Assessment Using Street-View Images. Int. J. Geogr. Inf. Sci. 2019, 33, 2363–2384. [Google Scholar] [CrossRef]
  15. Biljecki, F.; Ito, K. Street View Imagery in Urban Analytics and GIS: A Review. Landsc. Urban Plan. 2021, 215, 104217. [Google Scholar] [CrossRef]
  16. Salesses, P.; Schechtner, K.; Hidalgo, C.A. The Collaborative Image of The City: Mapping the Inequality of Urban Perception. PLoS ONE 2013, 8, e68400. [Google Scholar] [CrossRef]
  17. Dubey, A.; Naik, N.; Parikh, D.; Raskar, R.; Hidalgo, C.A. Deep Learning the City: Quantifying Urban Perception at a Global Scale. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
  18. Zhang, C.; Wu, T.; Zhang, Y.; Zhao, B.; Wang, T.; Cui, C.; Yin, Y. Deep Semantic-Aware Network for Zero-Shot Visual Urban Perception. Int. J. Mach. Learn. Cyber. 2022, 13, 1197–1211. [Google Scholar] [CrossRef]
  19. Liu, X.; Haworth, J.; Wang, M. A New Approach to Assessing Perceived Walkability: Combining Street View Imagery with Multimodal Contrastive Learning Model. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Spatial Big Data and AI for Industrial Applications; ACM: New York, NY, USA, 2023; pp. 16–21. [Google Scholar]
  20. Zhao, X.; Lu, Y.; Lin, G. An Integrated Deep Learning Approach for Assessing the Visual Qualities of Built Environments Utilizing Street View Images. Eng. Appl. Artif. Intell. 2024, 130, 107805. [Google Scholar] [CrossRef]
  21. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  22. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  23. Kang, Y.; Zhang, F.; Gao, S.; Lin, H.; Liu, Y. A Review of Urban Physical Environment Sensing Using Street View Imagery in Public Health Studies. Ann. GIS 2020, 26, 261–275. [Google Scholar] [CrossRef]
  24. Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  25. Moreno-Vera, F.; Poco, J. Assessing Urban Environments with Vision-Language Models: A Comparative Analysis of AI-Generated Ratings and Human Volunteer Evaluations. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025. [Google Scholar]
  26. Yu, M.; Chen, X.; Zheng, X.; Cui, W.; Ji, Q.; Xing, H. Evaluation of Spatial Visual Perception of Streets Based on Deep Learning and Spatial Syntax. Sci. Rep. 2025, 15, 18439. [Google Scholar] [CrossRef]
  27. Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
  28. Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024. [Google Scholar] [CrossRef]
  29. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  30. GLM-V Team; Hong, W.; Yu, W.; Gu, X.; Wang, G.; Gan, G.; Tang, H.; Cheng, J.; Qi, J.; Ji, J.; et al. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv 2025. [Google Scholar] [CrossRef]
  31. Brewer, C.A.; Pickle, L. Evaluation of Methods for Classifying Epidemiological Data on Choropleth Maps in Series. Ann. Assoc. Am. Geogr. 2002, 92, 662–681. [Google Scholar] [CrossRef]
  32. UN-Habitat (Ed.) The Value of Sustainable Urbanization; World cities report; UN-Habitat: Nairobi, Kenya, 2020. [Google Scholar]
  33. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
  34. Wichmann, F.A.; Geirhos, R. Are Deep Neural Networks Adequate Behavioral Models of Human Visual Perception? Annu. Rev. Vis. Sci. 2023, 9, 501–524. [Google Scholar] [CrossRef]
  35. Mushkani, R. Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark. arXiv 2025. [Google Scholar] [CrossRef]
  36. Chen, H.; Ge, J.; He, W. Quantifying Urban Vitality in Guangzhou Through Multi-Source Data: A Comprehensive Analysis of Land Use Change, Streetscape Elements, POI Distribution, and Smartphone-GPS Data. Land 2025, 14, 1309. [Google Scholar] [CrossRef]
  37. Yu, X.; Ma, J.; Tang, Y.; Yang, T.; Jiang, F. Can We Trust Our Eyes? Interpreting the Misperception of Road Safety from Street View Images and Deep Learning. Accid. Anal. Prev. 2024, 197, 107455. [Google Scholar] [CrossRef]
  38. Gebru, T.; Krause, J.; Wang, Y.; Chen, D.; Deng, J.; Aiden, E.L.; Fei-Fei, L. Using Deep Learning and Google Street View to Estimate the Demographic Makeup of the US. Proc. Natl. Acad. Sci. USA 2017, 114, 13108–13113. [Google Scholar] [CrossRef]
  39. Torneiro, A.; Monteiro, D.; Novais, P.; Henriques, P.R.; Rodrigues, N.F. Towards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda. arXiv 2025. [Google Scholar] [CrossRef]
  40. Yin, J.; Chen, R.; Zhang, R.; Li, X.; Fang, Y. The Scale Effect of Street View Images and Urban Vitality Is Consistent with a Gaussian Function Distribution. Land 2025, 14, 415. [Google Scholar] [CrossRef]
  41. Liang, X.; Zhao, T.; Biljecki, F. Revealing Spatio-Temporal Evolution of Urban Visual Environments with Street View Imagery. Landsc. Urban Plan. 2023, 237, 104802. [Google Scholar] [CrossRef]
Figure 1. Study areas. Basemap from OpenStreetMap (https://www.openstreetmap.org).
Figure 1. Study areas. Basemap from OpenStreetMap (https://www.openstreetmap.org).
Land 15 00244 g001
Figure 2. Pipeline of VLMs-based urban renewal framework.
Figure 2. Pipeline of VLMs-based urban renewal framework.
Land 15 00244 g002
Figure 3. Pearson correlation heatmap between model predictions and human ratings across the six perception dimensions (n = 800).
Figure 3. Pearson correlation heatmap between model predictions and human ratings across the six perception dimensions (n = 800).
Land 15 00244 g003
Figure 4. Spatial mapping of six perception dimensions. Basemap from OpenStreetMap (https://www.openstreetmap.org).
Figure 4. Spatial mapping of six perception dimensions. Basemap from OpenStreetMap (https://www.openstreetmap.org).
Land 15 00244 g004
Figure 5. Local spatial autocorrelation cluster maps for six perceptual dimensions. Basemap from OpenStreetMap (https://www.openstreetmap.org).
Figure 5. Local spatial autocorrelation cluster maps for six perceptual dimensions. Basemap from OpenStreetMap (https://www.openstreetmap.org).
Land 15 00244 g005
Figure 6. Spatial distribution of the four renewal tiers in the Hongshan Central District. Basemap from OpenStreetMap (https://www.openstreetmap.org).
Figure 6. Spatial distribution of the four renewal tiers in the Hongshan Central District. Basemap from OpenStreetMap (https://www.openstreetmap.org).
Land 15 00244 g006
Figure 7. Overall topical frequency distribution of model-generated descriptions.
Figure 7. Overall topical frequency distribution of model-generated descriptions.
Land 15 00244 g007
Figure 8. Distribution of eight semantic themes across four renewal categories.
Figure 8. Distribution of eight semantic themes across four renewal categories.
Land 15 00244 g008
Figure 9. Global scatter of the top 20 keywords across four renewal categories.
Figure 9. Global scatter of the top 20 keywords across four renewal categories.
Land 15 00244 g009
Figure 10. Locations of six representative areas. (1) Low-scoring school area (P1), (2) high-scoring school area (P2), (3) low-scoring commercial street (P3), (4) high-scoring commercial street (P4), (5) low-scoring residential block (P5), (6) high-scoring residential block (P6). Basemap from OpenStreetMap (https://www.openstreetmap.org).
Figure 10. Locations of six representative areas. (1) Low-scoring school area (P1), (2) high-scoring school area (P2), (3) low-scoring commercial street (P3), (4) high-scoring commercial street (P4), (5) low-scoring residential block (P5), (6) high-scoring residential block (P6). Basemap from OpenStreetMap (https://www.openstreetmap.org).
Land 15 00244 g010
Figure 11. Representative areas (P1–P6): panoramas, proportion of eight semantic themes, six-dimensional scores, and top terms for each dimension.
Figure 11. Representative areas (P1–P6): panoramas, proportion of eight semantic themes, six-dimensional scores, and top terms for each dimension.
Land 15 00244 g011
Table 1. Demographic Characteristics of Survey Participants (n = 500).
Table 1. Demographic Characteristics of Survey Participants (n = 500).
CharacteristicCategorynPercentage (%)
Age Group18–30 years19839.6
31–45 years18637.2
46–60 years11623.2
GenderMale24749.4
Female25350.6
Residential StatusLocals34268.4
Non-locals15831.6
Education LevelHigh school8917.8
Undergraduate27655.2
Graduate13527.0
OccupationStudents15631.2
Office workers18937.8
Service industry7815.6
Others7715.4
Table 2. Accuracy of ChatGPT-5 relative to human ratings across the six perception dimensions.
Table 2. Accuracy of ChatGPT-5 relative to human ratings across the six perception dimensions.
Perception DimensionR2RMSEMSE
Safety0.3190.5440.295
Liveliness0.5170.6540.427
Beauty0.3740.5150.266
Wealthiness0.2210.6120.375
Depressiveness0.2730.7050.497
Boringness0.2910.5830.340
Overall Average0.3320.6020.367
Table 3. Global Moran’s I Results for Six Perceptual Dimensions.
Table 3. Global Moran’s I Results for Six Perceptual Dimensions.
Perception DimensionMoran’s IZ-Scorep-Value
Safety0.44125.216<0.001
Liveliness0.49027.942<0.001
Beauty0.45325.868<0.001
Wealthiness0.46326.420<0.001
Depressiveness0.56332.115<0.001
Boringness0.47827.269<0.001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, Y.; Dall’Ò, G.; Lu, F. Urban Street-Scene Perception and Renewal Strategies Powered by Vision–Language Models. Land 2026, 15, 244. https://doi.org/10.3390/land15020244

AMA Style

Yao Y, Dall’Ò G, Lu F. Urban Street-Scene Perception and Renewal Strategies Powered by Vision–Language Models. Land. 2026; 15(2):244. https://doi.org/10.3390/land15020244

Chicago/Turabian Style

Yao, Yuhan, Giuliano Dall’Ò, and Feidong Lu. 2026. "Urban Street-Scene Perception and Renewal Strategies Powered by Vision–Language Models" Land 15, no. 2: 244. https://doi.org/10.3390/land15020244

APA Style

Yao, Y., Dall’Ò, G., & Lu, F. (2026). Urban Street-Scene Perception and Renewal Strategies Powered by Vision–Language Models. Land, 15(2), 244. https://doi.org/10.3390/land15020244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop