A Scalable Framework for Street Interface Morphology Assessment via Automated Multimodal Large Language Model Agents
Abstract
1. Introduction
1.1. Street Interface Morphological Evaluation
1.2. Street-View Imagery and Computer Vision Approaches
1.3. Multimodal Large Language Models and New Possibilities
1.4. Research Scope and Questions
2. Materials and Methods
2.1. Analytical Framework
- 1.
- Loop A (Objective metrics): This loop iteratively refines the agent’s geometric reasoning capability, for example by correcting H/D ratio estimation through comparison with field-measured spatial data.
- 2.
- Loop B (Subjective metrics): This loop aligns the agent’s semantic scoring logic with the evaluations provided by a panel of expert urban designers.
- 3.
- Validation and Scalable Extension: After verifying the stability and professional reliability of the workflow, the optimized analytical core was extended into a comprehensive street-analysis agent equipped with spatial visualization capabilities.
2.2. Study Area and Data Source
- Historical Preservation Zone (Hengfu Historic Area). This zone includes Yueyang Road, Wukang Road, and Wuxing Road. These streets are characterized by narrow widths, strong spatial enclosure, and complex interactions between historic buildings and dense vegetation. Such characteristics make the identification of street interfaces difficult when relying solely on satellite imagery, highlighting the importance of street-level observations.
- Modern Commercial Zone (Middle Huaihai Road Commercial District). This zone includes Middle Huaihai Road, Madang Road, and South Huangpi Road. Streets in this district are characterized by wide avenues, high-rise commercial towers with significant building setbacks, and extensive glass curtain-wall façades, creating a spatial scale markedly different from that of the historical area.
- General Residential Zone (Yangpu Workers’ Village Area). This zone includes Tieling Road, Jinxi Road, and Xuchang Road. The street interfaces here exhibit relatively repetitive morphological patterns, with diverse boundary conditions ranging from gated residential walls to active ground-floor commercial frontage.
2.3. Multi-Dimensional Morphological Evaluation Model
2.3.1. The Generic Street Interface Model
- 1.
- Effective calculation segment: To balance physical environment and human perception belonging to individual sampling points, calculations are not based on the entire street length but are restricted to the visual interface segment. This rule also helps reconcile different measurement logics. Traditional D/H is based on a cross-sectional view of the street, whereas human perception is shaped by a continuous street façade. This segment covers the interface extending approximately 25 m to the front and back of the sampling point. Only interfaces within this visually effective range are included in the calculation; distant or visually obscured elements are excluded. is defined as the total length of this sampling segment along the street, normally at 50 m (double 25 m).
- 2.
- Definition of valid interface: A vertical element is counted as a valid interface if it provides a clear and tangible visual boundary that can be seen from street view. In addition, this validity is constrained by spatial proximity. Elements positioned beyond the effective visual threshold, where their capacity to enclose the street diminishes due to excessive setback from the edge of land (over 30 m), are excluded from the calculation. Considering the Chinese urban context, this includes not only building facades but also boundary structures (including both solid masonry and permeable walls) and Structural Vegetation (specifically dense, continuous hedges that act as visual screens) exceeding 1.5 m in height. Individual street trees and sparse landscaping are excluded. The effective wall length () is defined as the sum of the lengths of these valid continuous elements within the segment.
- 3.
- Handling spatial recesses: We distinguish between valid interfaces and spatial gaps based on scale. If a recess has a depth proportionate to the street scale and maintains a sense of enclosure, it is treated as a valid interface. Conversely, if the recess is too deep relative to the street scale, creating a perceived void, it is classified as a gap (discontinuity).
- 4.
- Multi-layered interfaces: This identifies up to three potential depth layers vertically per side. Unlike previous studies that measure setbacks from the street centerline, this study defines Street Interface Depth (SID) as the perpendicular distance from the curb line (the edge of the vehicle lane) to the dominant vertical face. This definition is easier for the agent to apply and better reflects the pedestrian perspective. This ensures the metric reflects the pedestrian’s experience on the sidewalk. If no valid interface is detected on a specific side, the corresponding SID value is calculated as the sidewalk width. For the Maximum Interface Height () on a single side used in enclosure calculations, the maximum height among all identified layers (, and ) is adopted. Notably, effective street width (D) used for enclosure calculation is defined as the sum of the road width (W) and the setback distances (SID) of the highest interface layers detected on both sides:
- 5.
- Layer merging: To address minor architectural articulations, adjacent layers are merged when their depth difference is less than 2.0 m (for example, between the ground floor and the second floor). In such cases, the layers are treated as a single unified interface layer for morphological analysis.
2.3.2. Evaluation Metrics and Calculation Methods
- 1.
- Enclosure (Overall, or double sides):
- 2.
- Continuity (Single-sided):
- 3.
- Transparency (Single-sided):
- 4.
- Roughness (Single-sided):
2.4. Agent Construction and Implementation Mechanism
2.4.1. Implementation Framework
- Interactive Visualization ModuleServing as the primary user interface for researchers (Figure 4), this module integrates two functional panels that support interactive analysis and validation:Analysis Center (left panel): This panel facilities task configuration, such as setting sampling interval (e.g., 50 m), setting start point to end point of sampling path. Centrally, the system integrates a dynamic GIS mapping engine powered by the Baidu Map API. This module precisely aligns sampling coordinates with the urban road network and visualizes the morphological analysis results as vector graphics on the map. Crucially, the lower section of this panel integrates a visualization parameter control module. This allows users to dynamically toggle between different display modes of the sampling path and the result graphics, ensuring that the GIS mapping output aligns with specific analytical needs. Additionally, a Heads-Up Display (HUD) overlay provides real-time monitoring, showing processes including sampling planning, image capturing and image analysis.Result Archive (right panel): Dedicated to quality control, this panel displays aggregated analyzed statistical summaries, including all the result values of objective and subjective metrics of measurement dimensions. Crucially, the intermediate outputs, ranging from the captured images to the extracted fundamental parameters and natural-language descriptions, can be explicitly rendered and logged, making the entire reasoning chain fully traceable. Each captured image can be downloaded. Also, this provides direct access from result of each point to the source sampling points on the map on the left panel, enabling researchers to cross-reference AI-analyzed results with actual street views, ensuring the reliability of the automated assessment.
- Automated Execution EngineThe system’s core logic is governed by an automated engine that manages the ReAct (Reasoning + Acting) process, enabling complex logical reasoning and controllable execution. To maintain stability during street analyzing on a larger scale (e.g., processing hundreds of points), the engine implements a rigorous Finite State Machine (FSM). It enforces a sequential lifecycle for each sampling point—Initializing, Capturing, Analyzing, Cooling—while managing memory allocation and API rate limits. Simultaneously, the engine serves as the interface between the system and the Gemini model, packaging visual data into structured prompts and parsing the output into standardized JSON formats. After optimization, the processing time for each sampling point is significantly reduced, enabling efficient analysis of large-scale street networks.
2.4.2. Operational Workflow
- Step 1 (sampling planning): The agent first generates the shortest walking path based on the user-defined configurations, such as start and end points, and places sampling points at fixed intervals (e.g., 50 m). For each point, it then matches the nearest valid street-view scene available on Baidu Map within a predefined search radius, ensuring that only locations with available street-view imagery are processed.
- Step 2 (image preparation): For each valid sampling point, the corresponding street-view scene from Baidu Map was loaded, and a standardized image set was generated from the scene for downstream analysis. These images are then aggregated into a unified visual matrix to serve as the input for the subsequent reasoning phase.
- Step 3 (visual analyzing): The visual image matrix is packaged and transmitted to the MLLM (Gemini). Guided by the metric definitions, evaluation rules, formulas, and output requirements embedded in the prompt, the agent jointly analyzes and reasons the images and directly generates structured outputs, including objective indicator estimates, subjective perceptual scores, and supporting natural-language descriptions. For objective metrics, the MLLM estimates the quantitative indicators under formula-guided geometric reasoning. For subjective metrics, the MLLM uses prompt-defined visual cues to infer the corresponding subjective scores.
- Step 4 (result parsing and structuring): The execution engine parses the model responses and converts them into a standardized JSON structure. The resulting records are then organized according to the four morphological dimensions, and prepared for downstream storage, and export.
- Step 5 (spatial visualization): Finally, the structured results are transmitted to the GIS engine for real-time spatial visualization. The outputs are simultaneously logged in the results panel and exported as CSV files. Before processing the next sampling node, the system performs a brief memory reset and resource cleanup to maintain stable execution during large-scale analysis.

2.4.3. Visual Sampling Strategy and Joint Reasoning Mechanism
- Tier 1 (0° Elevation): 4 images (Front, Back, Left, Right) to capture street-level to mid-level spatial information, including pedestrian-scale interfaces, pavement details, lower and mid-level façades.
- Tier 2 (45° Elevation): 4 images (Front, Back, Left, Right) to capture mid-level to upper-level spatial information, including upper façades, upper setbacks, and the building skyline.

2.4.4. Agent Validation Experiment Design
- 1.
- Validation dataset and experimental split
- 2.
- Internal alignment and stability test
- (1)
- Spatial consistency of image sampling, by checking whether the sampled image sequences followed the intended street route and viewing directions.
- (2)
- Side-specific directional consistency, by checking whether features identified on each street side were mapped to the correct single-sided score outputs.
- (3)
- Rule compliance of the objective metric stream, by examining whether the model-generated objective estimates were consistent with the geometric definitions, spatial scope, and formula-guided reasoning rules specified for each metric.
- (4)
- Semantic compliance of the subjective metric stream, by examining whether prompt-defined evaluation indicators were reflected in the corresponding subjective scores.
- (5)
- Internal correlation between the agent-generated subjective scores and the corresponding objective metrics, assessed using Spearman’s rank correlation coefficient on the full sampled dataset (N = 36), as shown below:where and represent the ranks of the agent-generated subjective score and the corresponding objective metric for the i-th sample, respectively, and and denote their mean ranks.
- (6)
- Repeated-run stability, assessed through five repeated executions on the pilot subset.
- 3.
- Construction of the objective validation benchmark
- 4.
- Construction of the subjective validation benchmark and agreement screening
- 5.
- Pilot calibration and prompt refinement
- 6.
- Independent hold-out validation with baseline and ablation-style comparisons
- 7.
- Temporal robustness test
2.4.5. Mixed Geometry-Perception Evaluation Experiment Design
- 1.
- Vectorized mapping and linear segment generation
- 2.
- Weighted composite evaluation
3. Results
3.1. Reliability Verification of Experimental Framework
3.1.1. Verification of Internal Workflow Consistency and Stability
- 1.
- Spatial consistency of image sampling
- 2.
- Side-specific directional consistency of interpretation and scoring
- 3.
- Rule compliance of the objective metric stream
- 4.
- Semantic compliance of the subjective metric stream
- 5.
- Internal correlation logic
- 6.
- Stability checking
3.1.2. Verification of Subjective Ground Truth Reliability
3.2. Pilot Study and Prompt Calibration
3.2.1. Calibration of Objective Measurement Mechanism
3.2.2. Calibration of Subjective Measurement Mechanism
3.3. Final Validation of Optimized Agent Against Ground Truth
3.3.1. Validation of Objective and Subjective Measurements
3.3.2. Validation of Temporal Robustness
3.4. Scalable Deployment: Continuous Street Interface Morphological Analysis
3.4.1. Intra-Case Spatial Research
3.4.2. Cross-Case Spatial Research
4. Discussion and Conclusions
4.1. Principal Findings and Contribution
- Complementary assessment of street interface morphology: Traditionally, urban design research has often been divided between rigid geometric measurement and nuanced perceptual assessment. Addressing the first research objective, this study shows that the proposed MLLM-agent workflow can effectively conduct joint assessments of both objective geometric and subjective perceptual indicators directly from street-view imagery. By leveraging the joint reasoning capabilities of a natively multimodal model within an agent-based workflow, the system uses structured prompts to estimate objective geometric indicators and infer subjective morphological scores in a unified analytical process, thereby establishing an automated framework for street interface morphology assessment.
- Codifying expert logic into automated workflows: A significant barrier in large-scale urban analysis is the difficulty of replicating the interpretive and rule-based reasoning used by professional planners. For our second research objective, we demonstrated that through structured dual-benchmark calibration and prompt engineering, the agent’s reasoning can be aligned with ground-truth geometric measurement and expert consensus. This contributes a viable pathway for translating refined expert evaluation into scalable digital workflows. By mitigating the randomness of AI outputs, the framework helps ensure that automated assessments are not merely data-driven, but are anchored in established urban design principles and expert evaluative logic. Notably, this framework enables a rapid optimization process that is independent of large-scale manual annotation, thereby ensuring robust scenario generalizability across diverse and complex urban environments.
- Capturing the continuous rhythm of streetscapes: Urban design practice emphasizes the street as a continuous experience rather than a collection of isolated points. Responding to the third research objective, our workflow successfully transitioned from static point-based sampling to continuous vectorized mapping. This advancement addresses the gap in capturing the spatial rhythm of streets, allowing for a more accurate reflection of how street interface qualities fluctuate along a journey.
4.2. Advantages and Inherent Limitations
- Beyond pixel-level segmentation: Traditional computer vision approaches, primarily represented by Deep Convolutional Neural Networks (DCNNs), focus on the statistical aggregation of visual elements via pixel-level semantic segmentation. While effective at quantifying the physical presence of elements, these models often encounter a “semantic gap” when attempting to synthesize fragmented visual extractions into abstract spatial relationships or complex morphological configurations. The proposed MLLM-agent overcomes this by executing contextual semantic reasoning, mimicking a planner’s cognitive ability for morphological synthesis.
- Interpretable “White-Box” reasoning: Advanced streetscape studies often employ secondary neural networks trained on segmented data to predict perceptual scores. However, these methodologies essentially remain “black-box” systems that rely on statistical correlations, lacking the logical “why” behind the results. In contrast, the agent in this study employs evidence-based inductive reasoning through a “reasoning and acting” (ReAct) process. It does not merely output a score but provides explicit natural-language justifications in outputs aligned with established evaluation rules in prompts. This transparency transforms the evaluation from an opaque prediction into a traceable process, which is far more actionable for planners than pure statistical fitting.
- Zero-shot adaptability versus fragmented workflows: Existing large-scale SVI analytical workflows are often fragmented, requiring complex coordination between different platforms. The MLLM-agent operates as an integrated, automated workflow that leverages pre-trained large language models. Through prompt engineering, it can adapt flexibly to diverse urban contexts without task-specific retraining or large-scale manual annotation, while also reducing the need for complex cross-platform data conversion and transfer.
- Lowering technical barriers for agent-based analysis: Beyond analytical performance, the proposed workflow lowers the technical threshold for developing and refining AI-based urban analysis agents. By relying on an AI platform to coordinate multimodal reasoning, structured prompting, and spatial visualization, it enables researchers to construct, test, and optimize agent-based workflows more rapidly and with less programming overhead.
- Inherent instability of generative AI: MLLMs possess an inherent randomness, and the stability of the image analysis can fluctuate, especially when evaluating physical indicators, such as accurately estimating a large spatial depth, while the corresponding key spatial references are ambiguous or missing in the SVI, or when the evaluation rules are formulated without sufficient detail.
- Challenges in quantifying complex interfaces: Current metric models and sampling methods still face challenges in quantifying complex geometries such as arcades or overhanging structures, which necessitates the clear formulation of rigorous rules for semantic interpretation. Furthermore, boundary standards must be tailored for specific urban fabrics; for instance, the wall-defined boundaries of high-density cities like Shanghai require localized extraction rules that differ from those used for continuous building façades in European cities.
- Prompt interference: We observed that complex reasoning rules for objective and subjective metrics can occasionally interfere with each other within the same task, subtly affecting output precision, which highlights a key area for future research to explore optimization strategies.
- Balancing precision and adaptability in prompt design: While the MLLM agent strictly follows the logic provided in the prompts, crafting these instructions involves a delicate trade-off. Overly rigid rules may ensure consistent results for specific street types but often fail to adapt to the vast diversity of global urban contexts. Conversely, vague or broad instructions can lead to inconsistent scoring. Finding the optimal balance between rule-based precision and morphological adaptability remains a critical area for future optimization.
4.3. Future Research Directions
- Multi-source data integration: To overcome the drawbacks of relying solely on street-level images, future research should combine the AI’s visual analysis with satellite data and GIS mapping. This integration would provide a more complete spatial context and ensure more reliable results for complex geometric measurements.
- Context-aware boundary customization: Urban forms differ significantly across cultural and historical settings. Future research should refine how the “street boundary” is calculated to fit specific local conditions. Adjusting these computational rules to reflect regional spatial characteristics will improve the accuracy of context-sensitive assessments.
- From morphological description to psychological experience: While this study introduces subjective evaluation, the current “subjective values” primarily remain at the level of describing objective physical quantities and their direct spatial effects. Future research should aim to bridge these morphological metrics with high-level human psychological experiences, such as a sense of safety, intimacy, or urban vibrancy.
- Regional-wide morphological mapping: Beyond individual street analysis, this framework can be scaled up to map the morphological features of street interfaces across an entire region. By identifying the spatial distribution of these characteristics, future research could explore how these patterns influence pedestrian movement and urban vitality. These regional-scale “morphological maps” would provide a clear visual guide for planners, helping them make more informed decisions for urban renewal.
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| LLMs | Large Language Models |
| MLLMs | Multimodal Large Language Models |
| AL | Artificial Intelligence |
| SVI | Street View Imagery |
| CV | Computer Vision |
| VLM | Vision-Language Model |
| RMSE | Root Mean Square Error |
| D/H | Width-to-Height Ratio |
| H/D | Height-to-Width Ratio |
| Acc | Qualified Rate |
| SID | Street Interface Depth |
Appendix A
| Zone | Street Segment | Longitude | Latitude |
|---|---|---|---|
| Hengfu Historic Area | Yueyang Road (West Jianguo Rd–Yongjia Rd) | 121.459126 | 31.210209 |
| 121.459052 | 31.21065 | ||
| 121.458957 | 31.211091 | ||
| 121.458874 | 31.211533 | ||
| Wukang Road (Wukang Mansion–Hunan Rd) | 121.444564 | 31.210795 | |
| 121.445226 | 31.211977 | ||
| 121.446096 | 31.21314 | ||
| 121.446708 | 31.214336 | ||
| Wuxing Road (Hengshan Rd–Middle Huaihai Rd) | 121.450044 | 31.20979 | |
| 121.449401 | 31.210628 | ||
| 121.448792 | 31.211305 | ||
| 121.448389 | 31.212131 | ||
| Middle Huaihai Road Commercial District | Middle Huaihai Road (Madang Rd–South Xizang Rd) | 121.483715 | 31.230185 |
| 121.484671 | 31.23057 | ||
| 121.485618 | 31.230919 | ||
| 121.486529 | 31.231278 | ||
| Madang Road (Zizhong Rd–West Jinling Rd) | 121.481032 | 31.224963 | |
| 121.480867 | 31.225837 | ||
| 121.480395 | 31.227572 | ||
| 121.480001 | 31.228423 | ||
| South Huangpi Road (Middle Jinling Rd–Zizhong Rd) | 121.480907 | 31.229332 | |
| 121.481333 | 31.228475 | ||
| 121.481713 | 31.227704 | ||
| 121.481969 | 31.226884 | ||
| Yangpu Workers’ Village Area | Tieling Road (Zhangwu Rd–Benxi Rd) | 121.518843 | 31.286197 |
| 121.519376 | 31.285487 | ||
| 121.519981 | 31.284712 | ||
| 121.520595 | 31.283925 | ||
| Jinxi Road (Tieling Rd–Dahushan Rd) | 121.520107 | 31.283179 | |
| 121.519248 | 31.282591 | ||
| 121.51765 | 31.281449 | ||
| 121.515233 | 31.279782 | ||
| Xuchang Road (Kongjiang Rd–Shuangliao Branch Rd) | 121.518924 | 31.278554 | |
| 121.519169 | 31.277703 | ||
| 121.519659 | 31.276893 | ||
| 121.520506 | 31.276356 |
References
- Madanipour, A. Public and Private Spaces of the City; Routledge: London, UK, 2003. [Google Scholar]
- Bain, L.; Gray, B.; Rodgers, D. Living Streets: Strategies for Crafting Public Space; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Lynch, K. The Image of the City; MIT Press: Cambridge, MA, USA, 1960. [Google Scholar]
- Sitte, C. City Planning According to Artistic Principles (1889). In Historic Cities: Issues in Urban Conservation; Rodríguez-Lores, J., Ed.; Routledge: London, UK, 2019; Volume 8, p. 307. [Google Scholar]
- Jacobs, A.B. Great Streets; MIT Press: Cambridge, MA, USA, 1993. [Google Scholar]
- Lo, R.H. Walkability: What is it? J. Urban. 2009, 2, 145–166. [Google Scholar] [CrossRef]
- Gehl, J. Life Between Buildings: Using Public Space; Island Press: Washington, DC, USA, 2011. [Google Scholar]
- Norberg-Schulz, C. Genius Loci: Towards a Phenomenology of Architecture; Rizzoli: New York, NY, USA, 1980. [Google Scholar]
- Ashihara, Y. The Aesthetic Townscape; MIT Press: Cambridge, MA, USA, 1983. [Google Scholar]
- Ye, Y.; Zeng, W.; Shen, Q.; Zhang, X.; Lu, Y. The visual quality of streets: A human-centred continuous measurement based on machine learning algorithms and street view images. Environ. Plan. B Urban Anal. City Sci. 2019, 46, 1439–1457. [Google Scholar] [CrossRef]
- Yin, L. Street level urban design qualities for walkability: Combining 2D and 3D GIS measures. Comput. Environ. Urban Syst. 2017, 64, 288–296. [Google Scholar] [CrossRef]
- Middel, A.; Lukasczyk, J.; Zakrzewski, S.; Arnold, M.; Maciejewski, R. Urban form and composition of street canyons: A human-centric big data and deep learning approach. Landsc. Urban Plan. 2019, 183, 122–132. [Google Scholar] [CrossRef]
- Biljecki, F.; Ito, K. Street view imagery in urban analytics and GIS: A review. Landsc. Urban Plan. 2021, 215, 104217. [Google Scholar] [CrossRef]
- Kang, S.; Chen, X.; Huang, C.; Li, J.; Yan, J.; Ye, Y. The exploration of street micro-renewal technology from the perspective of new urban science: A case study of Guzong Road in Nanhui New City. New Archit. 2023, 6, 52–57. [Google Scholar]
- Ewing, R.; Handy, S. Measuring the unmeasurable: Urban design qualities related to walkability. J. Urban Des. 2009, 14, 65–84. [Google Scholar] [CrossRef]
- Cullen, G. The Concise Townscape; Routledge: London, UK, 2012. [Google Scholar]
- Ye, Y.; Richards, D.; Lu, Y.; Song, X.; Zhuang, Y.; Zeng, W.; Zhong, T. Measuring daily accessed street greenery: A human-scale approach for informing better urban planning practices. Landsc. Urban Plan. 2019, 191, 103434. [Google Scholar] [CrossRef]
- Gaw, L.Y.; Chen, S.; Chow, Y.S.; Lee, K.; Biljecki, F. Comparing street view imagery and aerial perspectives in the built environment. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 10, 49–56. [Google Scholar] [CrossRef]
- Koo, B.W.; Hwang, U.; Guhathakurta, S. Streetscapes as part of servicescapes: Can walkable streetscapes make local businesses more attractive? Comput. Environ. Urban Syst. 2023, 106, 102030. [Google Scholar] [CrossRef]
- Nagata, S.; Nakaya, T.; Hanibuchi, T.; Amagasa, S.; Kikuchi, H.; Inoue, S. Objective scoring of streetscape walkability related to leisure walking: Statistical modeling approach with semantic segmentation of Google Street View images. Health Place 2020, 66, 102428. [Google Scholar] [CrossRef] [PubMed]
- Zhou, H.; He, S.; Cai, Y.; Wang, M.; Su, S. Social inequalities in neighborhood visual walkability: Using street view imagery and deep learning technologies to facilitate healthy city planning. Sustain. Cities Soc. 2019, 50, 101605. [Google Scholar] [CrossRef]
- Rui, Q.; Cheng, H. Quantifying the spatial quality of urban streets with open street view images: A case study of the main urban area of Fuzhou. Ecol. Indic. 2023, 156, 111204. [Google Scholar] [CrossRef]
- Kelly, C.M.; Wilson, J.S.; Baker, E.A.; Miller, D.K.; Schootman, M. Using Google Street View to audit the built environment: Inter-rater reliability results. Ann. Behav. Med. 2013, 45, S108–S112. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 3431–3440. [Google Scholar]
- Li, X.; Zhang, C.; Li, W.; Ricard, R.; Meng, Q.; Zhang, W. Assessing street-level urban greenery using Google Street View and a modified green view index. Urban For. Urban Green. 2015, 14, 675–685. [Google Scholar] [CrossRef]
- Seiferling, I.; Naik, N.; Ratti, C.; Proulx, R. Green streets—Quantifying and mapping urban trees with street-level imagery and computer vision. Landsc. Urban Plan. 2017, 165, 93–101. [Google Scholar] [CrossRef]
- Gong, F.Y.; Zeng, Z.C.; Zhang, F.; Li, X.; Ng, E.; Norford, L.K. Mapping sky, tree, and building view factors of street canyons in a high-density urban environment. Build. Environ. 2018, 134, 155–167. [Google Scholar] [CrossRef]
- Kang, J.; Körner, M.; Wang, Y.; Taubenböck, H.; Zhu, X.X. Building instance classification using street view images. ISPRS J. Photogramm. Remote Sens. 2018, 145, 44–59. [Google Scholar] [CrossRef]
- Gong, Z.; Ma, Q.; Kan, C.; Qi, Q. Classifying street spaces with street view images for a spatial indicator of urban functions. Sustainability 2019, 11, 6424. [Google Scholar] [CrossRef]
- Gebru, T.; Krause, J.; Wang, Y.; Chen, D.; Deng, J.; Aiden, E.L.; Fei-Fei, L. Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States. Proc. Natl. Acad. Sci. USA 2017, 114, 13108–13113. [Google Scholar] [CrossRef]
- Ibrahim, M.R.; Haworth, J.; Cheng, T. Understanding cities with machine eyes: A review of deep computer vision in urban analytics. Cities 2020, 96, 102481. [Google Scholar] [CrossRef]
- Zhou, Y. Quantitative Research on Street Interface Morphology: Comparison Between Chinese and Western Cities; Springer: Cham, Switzerland, 2022. [Google Scholar]
- Zheng, M.; Tang, W.; Ogundiran, A.; Yang, J. Spatial simulation modeling of settlement distribution driven by random forest: Consideration of landscape visibility. Sustainability 2020, 12, 4748. [Google Scholar] [CrossRef]
- Haddawy, P.; Wettayakorn, P.; Nonthaleerak, B.; Yin, M.S.; Wiratsudakul, A.; Schöning, J.; Laosiritaworn, Y.; Balla, K.; Euaungkanakul, S.; Quengdaeng, P.; et al. Large scale detailed mapping of dengue vector breeding sites using street view images. PLoS Negl. Trop. Dis. 2019, 13, e0007555. [Google Scholar] [CrossRef] [PubMed]
- Naik, N.; Philipoom, J.; Raskar, R.; Hidalgo, C. Streetscore: Predicting the perceived safety of one million streetscapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; IEEE: New York, NY, USA, 2015; pp. 779–785. [Google Scholar]
- Dubey, A.; Naik, N.; Parikh, D.; Raskar, R.; Hidalgo, C.A. Deep learning the city: Quantifying urban perception at a global scale. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 196–212. [Google Scholar]
- Pereira, M.F.; Almendra, R.; Vale, D.S.; Santana, P. The relationship between built environment and health in the Lisbon metropolitan area—Can walkability explain diabetes’ hospital admissions? J. Transp. Health 2020, 18, 100893. [Google Scholar] [CrossRef]
- Reisi, M.; Nadoushan, M.A.; Aye, L. Local walkability index: Assessing built environment influence on walking. Bull. Geogr. Socio-Econ. Ser. 2019, 46, 7–21. [Google Scholar] [CrossRef]
- Wedyan, M.; Saeidi-Rizi, F. Assessing the impact of walkability indicators on health outcomes using machine learning algorithms: A case study of Michigan. Travel Behav. Soc. 2025, 39, 100983. [Google Scholar] [CrossRef]
- Zeng, Q.; Gong, Z.; Wu, S.; Zhuang, C.; Li, S. Measuring cyclists’ subjective perceptions of the street riding environment using K-Means SMOTE-RF Model and street view imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103739. [Google Scholar] [CrossRef]
- Chen, N.; Wang, L.; Xu, T.; Wang, M. Perception of urban street visual color environment based on the CEP-KASS framework. Landsc. Urban Plan. 2025, 259, 105359. [Google Scholar] [CrossRef]
- Ye, C.; Zhang, F.; Mu, L.; Gao, Y.; Liu, Y. Urban function recognition by integrating social media and street-level imagery. Environ. Plan. B Urban Anal. City Sci. 2021, 48, 1430–1444. [Google Scholar] [CrossRef]
- Zhang, F.; Zu, J.; Hu, M.; Zhu, D.; Kang, Y.; Gao, S.; Zhang, Y.; Huang, Z. Uncovering inconspicuous places using social media check-ins and street view images. Comput. Environ. Urban Syst. 2020, 81, 101478. [Google Scholar] [CrossRef]
- Helbich, M.; Yao, Y.; Liu, Y.; Zhang, J.; Liu, P.; Wang, R. Using deep learning to examine street view green and blue spaces and their associations with geriatric depression in Beijing, China. Environ. Int. 2019, 126, 107–117. [Google Scholar] [CrossRef] [PubMed]
- Johansson, T.; Mangold, M.; Dabrock, K.; Donarelli, A.; Campo-Ruiz, I. Using street view images and visual LLMs to predict heritage values for governance support: Risks, ethics, and policy implications. arXiv 2025, arXiv:2601.06056. [Google Scholar]
- Wu, C.; Liang, Y.; Zhao, M.; Teng, M.; Yue, H.; Ye, Y. Perceiving the fine-scale urban poverty using street view images through a vision-language model. Sustain. Cities Soc. 2025, 123, 106267. [Google Scholar] [CrossRef]
- Huang, W.; Wang, J.; Cong, G. Zero-shot urban function inference with street view images through prompting a pretrained vision-language model. Int. J. Geogr. Inf. Sci. 2024, 38, 1414–1442. [Google Scholar] [CrossRef]
- Cai, C.; Kuriyama, K.; Gu, Y.; Biljecki, F.; Herthogs, P. Can a large language model assess urban design quality? Evaluating walkability metrics across expertise levels. arXiv 2025, arXiv:2504.21040. [Google Scholar] [CrossRef]
- Verma, D.; Mumm, O.; Carlow, V.M. Generative agents in the streets: Exploring the use of Large Language Models (LLMs) in collecting urban perceptions. arXiv 2023, arXiv:2312.13126. [Google Scholar] [CrossRef]
- Feng, J.; Liu, T.; Du, Y.; Guo, S.; Lin, Y.; Li, Y. CityGPT: Empowering urban spatial cognition of large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Toronto, ON, Canada, 3 August 2025; ACM: New York, NY, USA, 2025; Volume 2, pp. 591–602. [Google Scholar]
- Li, J.; Ma, M.; Lai, Y. Identifying street multi-activity potential (SMAP) and local networks with MLLMs and multi-view graph clustering. Comput. Environ. Urban Syst. 2025, 122, 102350. [Google Scholar] [CrossRef]
- Tang, Y.; Qu, A.; Yu, X.; Deng, W.; Ma, J.; Zhao, J.; Sun, L. From street views to urban science: Discovering road safety factors with multimodal large language models. arXiv 2025, arXiv:2506.02242. [Google Scholar] [CrossRef]
- Al Mushayt, N.S.; Dal Cin, F.; Barreiros Proença, S. New lens to reveal the street interface: A morphological-visual perception methodological contribution for decoding the public/private edge of arterial streets. Sustainability 2021, 13, 11442. [Google Scholar] [CrossRef]
- James, L.R.; Demaree, R.G.; Wolf, G. rwg: An assessment of within-group interrater agreement. J. Appl. Psychol. 1993, 78, 306. [Google Scholar] [CrossRef]








| Core Dimension | Interface Spatial Scope | Objective Metric | Calculation Formula | Subjective Metric | Scoring Standard (Likert 1–5) |
|---|---|---|---|---|---|
| Enclosure (Overall) | Maximum interface height | Integrated H/D Ratio | Sense of spatial envelope | 1 (Open)– 5 (Enclosed) | |
| Continuity (Single-sided) | All interface above 1.5 m | Street Wall Ratio (%) | Sense of continuity of street interface | 1 (Fragmented)– 5 (Continuous) | |
| Transparency (Single-sided) | 0–5 m in first interface layer | Opening Ratio (%) | Sense of permeability of street interface | 1 (Opaque)– 5 (Transparent) | |
| Roughness (Single-sided) | First interface layer | SID Standard Deviation | Sense roughness of street interface | 1 (Aligned)– 5 (Staggered) |
| Measurement Dimension | Ground Truth | Agent-Based Estimation |
|---|---|---|
| Objective metrics | Satellite measurement and manual field survey | Direct estimation guided by geometric definitions and formulas |
| Subjective metrics | Expert scoring panel | Evaluation based on indicator prompts |
| Dimension | Scope | Retained Samples (n) | Excluded Samples (n) | Agreement Rate ( > 0.7) (%) |
|---|---|---|---|---|
| Enclosure | Overall | 29 (9) | 7 (3) | 80.6 |
| Continuity | Side A | 28 (10) | 8 (2) | 77.8 |
| Side B | 26 (9) | 10 (3) | 72.2 | |
| Transparency | Side A | 26 (8) | 10 (4) | 72.2 |
| Side B | 26 (5) | 10 (7) | 72.2 | |
| Roughness | Side A | 27 (8) | 9 (4) | 75.0 |
| Side B | 27 (6) | 9 (6) | 75.0 |
| Dimension | Prompt (Before) | Prompt (After) |
|---|---|---|
| Enclosure | Low: Vast sky view, low buildings, wide setbacks, empty lots High: Narrow sky strip, tall street walls, canyon effect | Low Score Indicators:
|
| Continuity | Low: Large gaps, vacant lots, inconsistent fencing High: Continuous wall, hedge (>1.5 m), retail frontage | Low Score Indicators:
|
| Transparency | Low: Solid masonry, shuttered doors, dense privacy hedges High: Glass curtain walls, open metal fencing, large shop windows | Low Score Indicators:
|
| Roughness | Low: Straight wall alignment High: Complex setbacks, jagged building line | Low Score Indicators:
|
| Dimension | Scope | RMSE | Acc (%) |
|---|---|---|---|
| Enclosure | Overall | 0.71 | 45.8 |
| Continuity | Side A | 10.24% | 75.0 |
| Side B | 22.39% | 66.7 | |
| Transparency | Side A | 36.23% | 20.8 |
| Side B | 23.79% | 33.3 | |
| Roughness | Side A | 2.00 m | 70.8 |
| Side B | 1.71 m | 70.8 |
| Dimension | Scope | RMSE | Acc (%) |
|---|---|---|---|
| Enclosure | Overall | 1.38 | 33.3 |
| Continuity | Side A | 15.01% | 62.5 |
| Side B | 28.16% | 47.8 | |
| Transparency | Side A | 32.03% | 29.2 |
| Side B | 23.05% | 50.0 | |
| Roughness | Side A | 1.91 m | 66.7 |
| Side B | 1.83 m | 70.8 |
| Dimension | Scope | RMSE | Acc (%) |
|---|---|---|---|
| Enclosure | Overall | 0.35 | 79.2 |
| Continuity | Side A | 12.70% | 75.0 |
| Side B | 16.35% | 83.3 | |
| Transparency | Side A | 13.88% | 66.7 |
| Side B | 13.65% | 70.8 | |
| Roughness | Side A | 1.24 m | 79.2 |
| Side B | 1.41 m | 75.0 |
| Dimension | Scope | n | Spearman’s ρ | p-Value |
|---|---|---|---|---|
| Enclosure | Overall | 20 | 0.37 | 0.108 |
| Continuity | Side A | 18 | 0.54 | 0.021 |
| Side B | 17 | 0.48 | 0.051 | |
| Transparency | Side A | 18 | 0.21 | 0.403 |
| Side B | 21 | 0.48 | 0.028 | |
| Roughness | Side A | 19 | 0.34 | 0.154 |
| Side B | 21 | 0.07 | 0.763 |
| Dimension | Scope | n | Spearman’s ρ | p-Value |
|---|---|---|---|---|
| Enclosure | Overall | 20 | 0.44 | 0.052 |
| Continuity | Side A | 18 | 0.68 | 0.002 |
| Side B | 17 | −0.13 | 0.619 | |
| Transparency | Side A | 18 | 0.16 | 0.526 |
| Side B | 21 | 0.29 | 0.202 | |
| Roughness | Side A | 19 | 0.34 | 0.154 |
| Side B | 21 | 0.21 | 0.361 |
| Dimension | Scope | n | Spearman’s ρ | p-Value |
|---|---|---|---|---|
| Enclosure | Overall | 20 | 0.65 | 0.002 |
| Continuity | Side A | 18 | 0.71 | 0.001 |
| Side B | 17 | 0.79 | <0.001 | |
| Transparency | Side A | 18 | 0.65 | 0.004 |
| Side B | 21 | 0.78 | <0.001 | |
| Roughness | Side A | 19 | 0.72 | <0.001 |
| Side B | 21 | 0.71 | <0.001 |
| Dimension | Average Difference (Objective) | Average Difference (Subjective) |
|---|---|---|
| Enclosure | 0.09 | 0.23 |
| Continuity | 2.93% | 0.20 |
| Transparency | 8.50% | 0.57 |
| Roughness | 0.15 m | 0.14 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, Y.; Ye, Y.; Weng, C. A Scalable Framework for Street Interface Morphology Assessment via Automated Multimodal Large Language Model Agents. Land 2026, 15, 610. https://doi.org/10.3390/land15040610
Wang Y, Ye Y, Weng C. A Scalable Framework for Street Interface Morphology Assessment via Automated Multimodal Large Language Model Agents. Land. 2026; 15(4):610. https://doi.org/10.3390/land15040610
Chicago/Turabian StyleWang, Yuchen, Yu Ye, and Chao Weng. 2026. "A Scalable Framework for Street Interface Morphology Assessment via Automated Multimodal Large Language Model Agents" Land 15, no. 4: 610. https://doi.org/10.3390/land15040610
APA StyleWang, Y., Ye, Y., & Weng, C. (2026). A Scalable Framework for Street Interface Morphology Assessment via Automated Multimodal Large Language Model Agents. Land, 15(4), 610. https://doi.org/10.3390/land15040610

