Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

Yuan, Kaiqing; Lan, Haotian; Gao, Yao; Wang, Kun

doi:10.3390/land15030449

Open AccessArticle

Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

College of Horticulture and Landscape Architecture, Northeast Agricultural University, Harbin 150030, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Land 2026, 15(3), 449; https://doi.org/10.3390/land15030449 (registering DOI)

Submission received: 22 December 2025 / Revised: 3 March 2026 / Accepted: 8 March 2026 / Published: 12 March 2026

(This article belongs to the Special Issue Big Data-Driven Urban Spatial Perception)

Download

Browse Figures

Versions Notes

Abstract

While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using Low-Rank Adaptation(LoRA) and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.863 on objective features and 89.3% agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns—such as the divergent perceptual effects of architectural transparency across residential and commercial zones—revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with Sustainable Development Goal 11(SDG 11). This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.

Keywords:

multimodal urban analytics; human perception; vision-language alignment; interpretable AI; urban streetscape evaluation; VisualGLM-6B; GPT-4; SDG 11

1. Introduction

Streets are the most ubiquitous and frequently used form of public space in cities, shaping everyday mobility, social interaction, and access to local services [1,2,3,4,5]. In the Sustainable Development Goals framework, SDG 11.7 calls for universal access to safe, inclusive, and accessible green and public spaces, and the official metadata for indicator 11.7.1 defines public space as all places of public use accessible to all, comprising both open public space and streets [6,7,8]. This definition positions street environments as a core object of SDG-11 implementation and monitoring [9]. Yet mainstream monitoring still tends to operationalize public space through supply and coverage measures or easily observed physical attributes at scale [7]. Indicator 11.7.1, for example, quantifies the share of built-up areas devoted to open public space and streets, offering an essential but mainly quantitative view of provision [6]. Such area-based accounting cannot fully capture the SDG-11 aspiration of safe, inclusive, and accessible public space, because these qualities are ultimately expressed through everyday experience, including whether a street segment feels safe, comfortable, and usable for diverse users.

A large body of urban design, public space, and environmental psychology research suggests that street experience is not irreducible noise, but a structured outcome that links the built environment to psychological responses and everyday wellbeing [1,3]. Public space quality frameworks commonly organize experience around protection, comfort, and enjoyment, connecting concrete environmental conditions such as traffic conflict, opportunities for walking and staying, and sensory comfort including noise to perceived safety, pleasantness, and use [10]. In parallel, urban design research has shown that many qualities often labelled as subjective, including imageability, enclosure, human scale, transparency, and complexity, can be operationalized and measured systematically [3,11]. Together, these traditions provide a theoretical basis for treating overall street satisfaction as an integrative judgment that can be decomposed into interpretable perceptual dimensions such as perceived safety, cleanliness, accessibility, visual quality, commercial convenience, and related components that are directly relevant to the SDG-11 aspiration of safe, inclusive, and accessible public space [4,12].

This theoretical framing also clarifies why commercial activity is not a peripheral factor but a central mechanism in street experience. Commercially intense streets can increase vitality and attract people, which classical urban theory associates with informal surveillance and potentially higher perceived safety [1,13]. At the same time, environmental psychology distinguishes physical density from perceived crowding and emphasizes that crowding is a subjective state shaped by spatial, social, and personal factors that is often associated with stress and reduced satisfaction [10]. Urban overload theory further implies that high stimulus environments can trigger coping responses that undermine comfort and social engagement [12]. Public health research similarly identifies annoyance as a key pathway through which environmental noise affects wellbeing, making sensory burdens a plausible mechanism by which commercial intensity can lower satisfaction. These strands point to competing, theory grounded hypotheses, in which commercial intensity may raise perceived vibrancy and sometimes safety while also increasing crowding, annoyance, and discomfort, producing tradeoffs or nonlinear relationships rather than uniformly positive effects [10,14].

Surveys and expert audits can measure these perceptual dimensions with high validity, but they are costly to apply comprehensively across entire street networks and difficult to update frequently [15]. As a result, many monitoring practices have relied on objective indicators derived from imagery or GIS as scalable substitutes [14,16]. However, objective proxies do not always align with what residents actually perceive. Empirical work has reported weak agreement between objective and perceived neighborhood features such as density, land use mix, and connectivity, indicating that objective measures alone cannot fully represent perceived street quality [17,18]. This mismatch creates evaluation fuzziness, meaning a practical gap between what is easy to measure at scale and what people actually feel and respond to in daily life [17]. It also directly limits SDG-11 implementation, because the policy language of safety, inclusiveness, and accessibility ultimately concerns experience and use rather than area shares alone [19].

Recent perception datasets derived from street view imagery and crowdsourced judgments have begun to reduce this fuzziness by demonstrating that aggregated perceptions can be collected and modeled at much larger scales than previously feasible [20,21,22]. StreetScore and Place Pulse 2.0, for instance, showed that large numbers of human image comparisons can be used to map perceived attributes such as safety and liveliness across cities [23,24]. These advances motivate the use of scalable perception estimates as a complement to SDG-11 public space monitoring, but they also highlight remaining gaps that are directly relevant to planning [12,25]. Perception constructs must be theoretically grounded, locally validated, and communicated in ways that support actionable diagnosis, especially when the same built cues can generate contradictory perceptual effects across contexts [21,26]. This motivates a framework that can jointly capture objective street conditions and subjective perceptions, while providing interpretable evidence for why a street is perceived in a certain way [15,25,26].

Against this background, this study aims to operationalize SDG-11 relevant street experience quality through a theory informed, interpretable, and scalable framework, with a specific focus on the practical tension between commercial vitality and pedestrian comfort that SDG 11.7 implementation repeatedly encounters at street level. We propose the Multimodal Street Evaluation Framework, MSEF, which integrates street view imagery understanding with language based socio affective inference to generate dual outputs, namely objective streetscape attributes and multiple perceptual dimensions, together with concise natural language rationales that make the assessment traceable for planning use. Using Harbin, China as a real world testbed with diverse street types and commercially intense corridors, we assess street accessibility, cleanliness, perceived safety, visual richness, commercial convenience, and overall satisfaction alongside objective street features, and examine how commercial intensity relates to both vibrancy related cues and comfort related perceptions across heterogeneous socioeconomic contexts. By validating model based perception estimates against aggregated resident perceptions, the study reduces evaluation fuzziness and provides an operational link from SDG-11 public space ambitions to measurable and explainable street diagnostics. Empirically, the framework reveals context dependent contradictions, including cases where informal commerce increases perceived vibrancy while reducing pedestrian comfort, and it highlights semantically contingent patterns that challenge universal spatial heuristics, thereby supporting more realistic, policy facing interpretations of street quality.

2. Materials and Methods

2.1. Overall Study Workflow

This study develops and validates a Multimodal Street Evaluation Framework (MSEF) for scalable streetscape perception assessment through an integrated pipeline that links data collection, supervision construction, model adaptation, auditing, and downstream deployment. We first constructed a perception grounded dataset in Harbin, China, using stratified community sampling to capture socio spatial heterogeneity. All street view imagery was collected in summer to maintain consistent visual conditions. Human perception signals were gathered through both online and offline channels, including resident ratings and semi structured interviews that provide qualitative rationales, while expert reviews were collected as an additional perspective on the same types of street scenes. In parallel, objective streetscape attributes were coded from the same images. These objective attributes serve two roles: they provide an alignment scaffold that helps relate subjective perceptions to interpretable physical cues, and they offer a structured reference for expert auditing of label reliability.

All human sourced materials, including survey ratings, interview statements, and expert assessments, were then structurally standardized into a unified supervision format. Specifically, GPT-4 was used to convert heterogeneous annotations into consistent image question answer pairs suitable for instruction style multimodal learning. This step does not introduce new content or modify the original meaning but reorganizes the same information into a stable training structure. Based on the curated instruction dataset, we adapted a pretrained multimodal backbone, VisualGLM 6B, using parameter efficient instruction tuning so that the model outputs both perception scores and aligned textual rationales. Model reliability was examined through a two layer auditing strategy that leverages the coded objective attributes as a reference for expert checking, together with agreement analysis against human perception signals. Finally, we deployed the trained model to newly collected street scenes to test out of sample generalization and to produce segment level diagnostic outputs that can support subsequent urban analysis and planning oriented interpretation (Figure 1).

2.2. Study Area: Harbin, China

Harbin is a major metropolitan city in Northeast China and serves as the empirical testbed for the proposed framework because it provides a within city setting where diverse built environment typologies and socio spatial conditions coexist under a relatively consistent institutional and cultural context. This feature is important for evaluating whether a multimodal perception model can remain robust when applied to streets and communities that differ substantially in physical form and daily use patterns, while avoiding cross city confounding factors such as policy regimes, language use, and data availability [27]. From a data perspective, Harbin also offers stable coverage of street view imagery and dense urban neighborhoods, enabling systematic collection of comparable visual inputs at scale [28].

More importantly, the city exhibits concrete forms of morphological and community level heterogeneity that are directly relevant to our research design. Within Harbin, neighborhood forms range from older inner city districts with fine grained blocks and mixed commercial frontage, to mid period residential areas shaped by work unit and apartment compound development patterns, and to post 2000 expansion zones dominated by large scale gated residential estates and wide arterial corridors. These typologies differ in street network structure, building height and enclosure, sidewalk continuity, greening provision, traffic intensity, and the configuration of ground level commercial activities, all of which are known to influence how residents perceive safety, comfort, cleanliness, accessibility, and overall street satisfaction [29]. In addition, Harbin presents a pronounced intra urban socioeconomic gradient that is reflected in housing prices and infrastructure conditions, allowing us to select communities that represent a spectrum of income levels and public service provision while remaining within a single metropolitan context.

Guided by this rationale, our subsequent community selection and sampling strategy was designed to deliberately cover representative combinations of spatial form, socioeconomic status, and infrastructure quality in Harbin, so that the proposed framework can be assessed across heterogeneous but comparable urban settings within the same city.

2.3. Research Data and Preprocessing

2.3.1. Street View Imagery Acquisition

Street view imagery was collected in Harbin using a community anchored sampling strategy to capture residential street environments at scale while maintaining a direct linkage between images and their corresponding communities. We compiled an inventory of 3905 residential communities with names and geographic coordinates from major real estate information platforms (Lianjia and Anjuke). Community coordinates were used as anchors, with size adaptive buffers defined for image querying (typically 200 m, 300 m, or 400 m based on spatial extent), and a small number of irregular or atypical communities were manually adjusted to improve spatial representativeness. Street view points were generated along accessible public roads within each buffer at approximately 50 m spacing, and images were retrieved via publicly accessible APIs from Baidu Street View and Tencent Street View. As street view coverage rarely enters internal residential spaces, the resulting imagery primarily reflects everyday streets around community perimeters rather than inside gated compounds. To avoid coverage inconsistency and seasonal appearance shifts, we restricted acquisition to summer records. This procedure yielded 24,143 images across the main urban area, and after quality control and redundancy filtering, 13,560 images were retained for analysis.

2.3.2. Conceptualization and Derivation of Subjective and Objective Perception Factors

We operationalize streetscape quality using a paired construct system that combines subjective perceptions with objectively coded physical attributes. Rather than treating physical attributes as direct causes of perceptions, this pairing is used as a measurement design that supports interpretable linkage between what is visually observable in street view imagery and how streets are commonly experienced and evaluated in planning practice.

SDG 11 provides a normative and policy relevant framing for identifying street scale qualities that are routinely discussed in relation to inclusive mobility and public space, such as pedestrian space provision, traffic pressure, perceived safety, and the visibility of greenery and amenities. Building on this framing, the factor set was informed by two established research streams and adapted to the constraints of image grounded multimodal learning. First, microscale audit instruments in walkability and active living research define street components that are observable and can be coded with reasonable reliability. Second, computational urban perception studies have shown that several subjective impressions, particularly perceived safety, can be collected at scale and are statistically associated with street view visual cues in predictive settings.

Starting from a broader candidate pool aligned with common streetscape constructs, we finalized the dimensions using four criteria: relevance to street scale interpretations of SDG 11.2 and 11.7, visual inferability from street view imagery, interpretability for planning diagnosis, and parsimony to reduce label ambiguity and improve rating stability in the initial citywide validation. Following this process, we defined eight subjective perception dimensions rated by residents and experts: pedestrian width, greenery, public amenities, visual richness, perceived safety, motorization, vehicle lane width, and commercial intensity. In parallel, we coded seven objective streetscape attributes from the same images: sidewalk width, vehicle lane width, greenery level, motorization degree, commercial activity density, sky openness, and public amenities count. These objective attributes function as a structured reference for interpreting model outputs and conducting expert audits, particularly for diagnosing cases where perceived impressions and observable conditions do not align, and they were therefore coded by an expert team using explicit rules applied to street view images rather than collected via resident questionnaires.

2.3.3. Stratified Community Sampling and Perceptual Q&A Instrument

To ensure that the supervised dataset covered heterogeneous neighborhood conditions, we first stratified Harbin’s residential communities into five housing-price tiers using equal-frequency quintiles (Figure 2). The tiers ranged from under ¥5000/m² to above ¥10,000/m², capturing a spectrum from remote or underdeveloped areas to newly built commercial residential zones. From these five tiers, we selected 30 representative communities based on geographic spread and accessibility to support model development, while the remaining neighborhoods were retained for broader evaluation. Observation points were placed along major roads or intersections adjacent to community perimeters to provide consistent, policy-relevant streetscape views for subsequent labeling.

The perception questionnaire was designed in a question–answer (Q&A) format to elicit consistent judgments from street-view imagery and generate labels that can be directly reformatted for instruction-style multimodal supervision. We used a seven-point Likert scale (1 to 7), which offers sufficient granularity while remaining practical for rapid image-based assessment [30,31]. Each item was phrased as a short, attribute-specific prompt so that raters focused on one perceptual aspect at a time, reducing construct confusion during fast visual judgment.

2.3.4. Survey Distribution and Quality Control for Training Corpus

Perceptual supervision was collected via offline and online surveys using a seven-point Likert scale (1 to 7). Offline, 320 questionnaires were administered (Figure 3). Online, the same instrument was used to label a 600-image pool for model development; participants rated at least five images, with skipping permitted. We issued 1500 online assignments and applied response-level screening (completion time, repetitive patterns, and missingness). After cleaning, 1150 valid questionnaires were retained, and image-level aggregation required at least five valid ratings per image.

For external validation, we distributed 3000 assignments, each containing 10 randomly sampled images from the citywide archive; respondents rated at least six images on the same 1 to 7 scale, and only numeric ratings were collected. Quality control included a minimum viewing time of 20 s per rated image, pattern-based filtering, and an internal consistency check using a repeated image item; submissions were discarded if repeated-item ratings differed by more than 2 points. After filtering, 2530 valid questionnaires were retained, with an average of approximately 7.0 rated images per questionnaire. We retained images with at least five valid ratings, resulting in 3500 images with crowdsourced consensus scores used as the external validation set.

2.4. Multimodal Model Architecture, Fine-Tuning, and Validation

Street-scene perception research has historically been dominated by CNN-style pipelines that regress or rank perceptual scores from street-view imagery, often evaluated on large-scale pairwise-judgment benchmarks developed for urban perception assessment [32]. However, recent baseline comparisons suggest that this CNN-centered paradigm may not always remain the strongest once visual foundation representations and multimodal reasoning capabilities are introduced into the pipeline [33]. In particular, evaluations on established urban-perception benchmarks report that foundation-model-based representations can match or, in some cases, exceed multiple CNN-type baselines across perceptual dimensions, which is consistent with improved robustness when street scenes are heterogeneous in morphology and context [34]. Related evidence from multimodal large language model frameworks points in a similar direction: model-based pairwise street-view judgments calibrated against crowd-sourced perception benchmarks often show closer alignment with human preference signals than traditional Siamese CNN baselines, implying potential advantages for cognitively demanding, subjective evaluation tasks that benefit from higher-level visual semantics and reasoning [23]. Beyond perception scoring, vision-language pretraining has also been reported to outperform conventional computer-vision baselines in transferable street-scene sensing [23,35], supporting the broader feasibility of a “pretrain then adapt” paradigm for urban visual analytics under cross-city migration.

This shift is closely associated with the rise of vision-language models (VLMs), which align images with natural language and thereby provide semantically rich representations of streetscapes. Such alignment can support inference not only about visible physical elements but also about perception-relevant meanings, enabling more flexible querying of urban scenes than task-specific regressors. Contrastive Language–Image Pre-training (CLIP) based frameworks, for example, report systematic associations between cues such as colonnaded facades, neon signage, and surface textures, and crowd-sourced descriptors including historic, chaotic, and vibrant [33]. Prompt-driven systems such as SAGAI further operationalize this capability for planning-oriented audits by producing geocoded query outputs, while UrbanCLIP suggests that image-text co-embeddings may improve socioeconomic inference relative to image-only baselines, particularly in morphologically ambiguous environments [36,37].

At the same time, several constraints can limit reliable deployment across contexts and motivate more cautious model design. Cultural sensitivity is a recurring concern, because models pretrained on dominant datasets may not fully capture local meanings and may display uneven performance in migrant or informal districts, which can translate into cross-context generalization gaps [36]. Temporal myopia is another challenge, since many street-view pipelines treat scenes as static even though multiple data sources indicate that perceived vibrancy and perceived threat can vary between daytime and nighttime [37,38]. Opacity may also complicate adoption in policy settings where legitimacy depends on traceable logic rather than only predictive outputs [39]. These considerations have encouraged approaches that decompose holistic impressions into interpretable dimensions such as cleanliness, greenness, and enclosure [40], and that use multi-label or attention-oriented formulations to better represent co-occurrence and interaction among perceptual attributes, including settings where perceived safety conditions commercial vibrancy [39].

Multimodal large language models (MLLMs), including GPT-4V and CogVLM, extend the VLM paradigm by integrating visual understanding with language reasoning in a single instruction-following interface. Their practical value lies in supporting prompt-based inference across heterogeneous questions without requiring task-specific heads, which can be useful when researchers need to probe diverse perceptual constructs under varying urban contexts. Importantly, MLLMs can generate concise rationales grounded in visible evidence, for example, linking lower pedestrian comfort to narrow sidewalks, obstructed sightlines, or limited shade, which may improve auditability and planning interpretation [41,42]. In addition, role-conditioned prompts allow perspective-aware evaluation, such as considering elderly pedestrians or nighttime users, while interpretability tools such as attention maps and token-region alignment provide complementary mechanisms to inspect decision logic and reduce black-box concerns in applied settings [43,44,45,46].

Figure 4 summarizes the architecture and training workflow of the Multimodal Street Evaluation Framework (MSEF). The framework integrates street-view panoramas with textual annotations derived primarily from resident surveys, interview transcripts, and expert notes. These multimodal inputs are encoded into unified latent representations, aligned through cross-modal attention, and optimized to produce dual outputs, including scalar ratings for multiple street-related dimensions and natural-language rationales that support planning interpretation.

——The framework consists of (i) feature extraction and encoding for street-view imagery and textual annotations, (ii) cross-modal alignment via attention, (iii) a task layer for rating prediction and rationale generation, and (iv) optimization and validation including task losses, AdamW with cosine annealing, and robustness and agreement checks.

In the feature extraction and encoding stage, each street-view panorama is divided into image patches and encoded by a vision transformer backbone into a visual embedding sequence. We denote the resulting visual feature matrix as V_vis∈R^n×d, where nnn is the number of visual tokens and ddd is the hidden dimension. In parallel, textual inputs are tokenized and encoded into a text feature matrix T∈R^m×d, where mmm is the number of text tokens. These two modalities provide complementary signals, with imagery capturing observable physical cues and text carrying human-centric evaluation language and context.

The multimodal alignment module performs cross-modal fusion using an attention mechanism that conditions visual understanding on the semantics of the textual channel. Following the notation in Figure 4, we apply linear projections to construct query, key, and value matrices:

Q = V_{v i s} W_{Q}, K = T W_{K}, V_{v a l} = T W_{V},

(1)

W_{Q} \in R^{d_{Q} \times d}, W_{K} \in R^{d_{K} \times d}

and

W_{V} \in R^{d_{V} \times d}

are trainable projection matrices,

d_{k}

is the key/query dimension, and dvd_vdv is the value dimension. Accordingly,

Q \in R^{d_{k} \times n}

represents queries derived from visual tokens, while

K \in R^{d_{k} \times m}

and

V_{v a l} \in R^{d_{v} \times m}

represent keys and values derived from text tokens.

Cross-modal attention is then computed as:

A t t e n t i o n (Q, K, V^{(v a l)}) = s o f t m a x (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V^{(v a l)}

(2)

where

d_{k}

is the key dimension. The fused multimodal representation

M_{i} \in R^{d_{m}}

serves as the aligned feature vector for downstream tasks.

The task layer is built on a pretrained multimodal backbone and is formulated to produce dual outputs. First, the model predicts scalar ratings for multiple dimensions, including environmental qualities and commerce-related evaluations, consistent with the assessment protocol used in this study. Second, it generates concise natural-language rationales that justify the predicted ratings in planning-relevant terms. Training samples are organized in an instruction-style format, where each instance consists of an image, a structured question, and a target answer. This design encourages the model to learn not only numeric appraisal but also explanation generation grounded in visible cues.

To adapt the pretrained backbone to streetscape analytics while preserving general knowledge, we employ parameter-efficient fine-tuning. We use Low-Rank Adaptation (LoRA) to inject trainable low-rank updates into selected transformer layers while keeping the majority of pretrained weights frozen:

W = W₀ + ∆W, ∆W = BA

(3)

where W₀ is the frozen weight matrix and

A \in R^{r \times d}, B \in R^{d \times r}

are trainable low-rank factors with rank r. In addition, we apply P-Tuning v2 by learning a sequence of continuous prompt embeddings {p₁,…,p_m} that are concatenated with the image-conditioned tokens before decoding:

H = [p_{1 : m_{p}}; X_{i m g}] .

(4)

so that task preferences and evaluation heuristics are encoded in the prompt space with minimal parameter updates.

Optimization follows the loss design shown in Figure 4. For rating prediction tasks with continuous targets, we use a mean squared error objective:

L_{m s e} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{d_{y}} {({\hat{y}}_{i j} - y_{i j})}^{2} .

(5)

where N is the number of training instances and

d_{y}

is the number of rating dimensions. For rating tasks treated as binary or multi-label outcomes, we use a binary cross-entropy objective:

L_{b c e} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{d_{y}} (y_{i j} \log ({\hat{y}}_{i j}) + (1 - y_{i j}) \log (1 - {\hat{y}}_{i j})) .

(6)

The overall training objective combines task losses and the language modeling loss for rationale generation, with weights selected to balance scalar accuracy and explanation quality. We optimize parameters using AdamW and adopt a cosine annealing learning rate schedule consistent with Figure 4:

η_{t} = η_{m i n} + \frac{η_{m a x} - η_{m i n}}{2} (1 + \cos (\frac{t}{T_{m a x}} π)) .

(7)

Model validation includes both predictive performance and robustness checks. For tasks evaluated as classification or discretized levels, we report the F1 score:

F_{1} = \frac{2 P R}{P + R} .

(8)

where P and R denote precision and recall. In this study, discretized levels refer to the three interval classes (1–3, 3–5, 5–7) used in the interval-based agreement check. We construct a 3 × 3 confusion matrix between predicted interval labels and expert-audit (ground-truth) interval labels. Precision and Recall are computed from TP, FP, and FN for each class, and macro-F1 is reported as the unweighted mean of class-wise F1 scores. Precision and Recall are derived from the confusion matrix by counting true positives, false positives, and false negatives for each class. To assess stability, we conduct perturbation validation by measuring the maximum deviation between predictions under controlled prompt or sampling perturbations:

δ y = \max |{\hat{y}}^{p e r t u r b e d} - {\hat{y}}^{o r i g i n a l}| .

(9)

For subjective perception ratings, agreement with human aggregated ratings is examined using Bland–Altman analysis, where the paired difference is defined as

d = \hat{y} - y

and the limits of agreement are computed as

m e a n (d) \pm 1.96 \cdot S D (d)

. This combination of performance metrics, perturbation-based robustness testing, and agreement analysis ensures that the framework’s outputs are both quantitatively reliable and suitable for planning-oriented interpretation.

2.5. Standardizing Questionnaire Items into a VLM-Consumable Q&A Schema

Vision language model training is sensitive to the syntactic structure of instructions, and prompt driven urban analytics systems typically rely on stable query templates to ensure comparable outputs across images and sites. Consistent with this line of work in prompt based VLM and MLLM based urban assessment, we converted the original survey instrument into a fixed question answer schema before fine tuning, so that each training instance followed the same parsing pattern while preserving the meaning of the original questionnaire content. In practice, GPT-4 was used only as a formatting normalizer to rewrite each questionnaire item into a structured Q&A pair, without altering the semantic content of the item or residents’ expressed judgments.

Specifically, each street view image was associated with multiple attributes from the survey. For each attribute, we formed a question string that mirrors the original item and explicitly retains its response scale (for example, “How safe is this street? 1 = not safe, 7 = very safe”). We then formed an answer string that includes the aggregated numeric rating together with a short rationale derived from the original responses. Importantly, the GPT-4 standardization step was constrained to preserve content: it does not add new claims, re interpret the intent of the item, or modify what residents expressed. Its sole function is to enforce a uniform input layout (Image, Question, Answer) so that the downstream VLM does not recompile the question wording in unstable ways across samples. To make this step fully transparent, we provide an explicit before/after example in Figure 5, showing one original questionnaire item and its standardized Q&A version. The example demonstrates that the transformation is structural rather than semantic.

All standardized triplets were stored in JSON format with unique image identifiers, the fixed question template, and the consolidated answer. Concretely, each supervision instance is stored as a schema-constrained JSON record (JSONL, one record per line) consisting of (i) an image identifier, (ii) a fixed question template for the target attribute, and (iii) a consolidated answer that aggregates the human supervision signal. The question template is held constant within each attribute during training and evaluation to avoid instruction drift; across records, we vary only the image_id and the aggregated answer content (score and rationale), which is derived from the corresponding respondents’ responses for that image. A representative JSONL example is shown below.

{

“image_id”: “IMG_000123”,

“respondent_id”: “R_0127”,

“module”: “objective”,

“attribute”: “sidewalk_width”,

“question”: “Please rate the sidewalk width on a 1–7 scale (1 = almost none/very narrow, 7 = very wide and dominant). Provide visible evidence.”,

“answer”: {“score”: 5, “evidence”: “A continuous sidewalk is present, but it does not dominate the frame.”},

“score”: 5,

“evidence”: “A continuous sidewalk is present, but it does not dominate the frame.”,

“scale”: {“min”: 1, “max”: 7, “anchor_1”: “almost none/very narrow”, “anchor_7”: “very wide/dominant”},

“source”: “expert_audit”,

“split”: “train”,

“schema_version”: “v1”,

“language”: “en”

}

{

“image_id”: “IMG_000123”,

“respondent_id”: “R_4581”,

“module”: “subjective”,

“attribute”: “perceived_safety”,

“question”: “Please rate perceived safety on a 1–7 scale (1 = very unsafe, 7 = very safe). Provide visible evidence only.”,

“answer”: {“score”: 4, “evidence”: “Moderate traffic and limited pedestrian separation suggest mixed perceived safety.”},

“score”: 4,

“evidence”: “Moderate traffic and limited pedestrian separation suggest mixed perceived safety.”,

“scale”: {“min”: 1, “max”: 7, “anchor_1”: “very unsafe”, “anchor_7”: “very safe”},

“source”: “crowd_consensus”,

“split”: “external_val”,

“schema_version”: “v1”,

“language”: “en”

}

We applied automated quality control checks to ensure structural validity and content integrity, including verifying that required numeric fields were present, screening and removing personally identifiable information, and flagging outputs that were empty, off topic, or internally inconsistent with the provided score. The resulting dataset was ingested by VisualGLM-6B directly for supervised instruction tuning using the fixed Q&A schema described above.

2.6. Parameter-Efficient Fine-Tuning and Training Workflow

2.6.1. Parameter-Efficient Fine-Tuning Strategy

Recent streetscape perception studies have typically relied on convolutional neural networks that map images to scalar scores or pairwise rankings, which is effective for single-task prediction but limited in cross-attribute generalization and in producing interpretable rationales [24,30]. In contrast, modern vision-language models offer a unified instruction interface, allowing the same backbone to answer heterogeneous urban questions under a consistent supervision schema and to produce language explanations that are easier to audit. Motivated by this paradigm shift, we fine-tuned VisualGLM-6B using parameter-efficient methods rather than full fine-tuning, so that adaptation remains data-efficient and iteration-friendly.

We adopted a staged adaptation strategy in which objective physical attributes were optimized with LoRA, while subjective perceptual judgments were adapted with P-Tuning v2. This division is intentional. Objective attributes such as sidewalk width, roadway width, greening level, and sky openness are visually salient and comparatively stable targets. For these tasks, LoRA provides a direct way to adapt internal projections with a small number of trainable parameters, improving visual-text alignment for measurable cues. In contrast, subjective attributes such as perceived safety, cleanliness, and satisfaction depend on nuanced contextual interpretation and are more sensitive to instruction phrasing and distribution shift. For these tasks, P-Tuning v2 enables rapid and low-risk behavioral adjustment by learning deep continuous prompts while keeping the backbone weights frozen, which supports frequent iterative refinement without destabilizing the model.

Importantly, objective prediction is not treated as an end in itself. The objective module serves as an anchoring layer for the subjective module, supporting both alignment and verification. During subjective fine-tuning and deployment, objective outputs are used as auxiliary constraints and sanity checks. When subjective rationales contradict objective cues that are strongly visible in the image, the instance is flagged for manual review and potential correction in the supervision corpus.

2.6.2. Training Split, Protocol, and Evaluation

Fine-tuning followed the image–question–answer triplet format described above. Model development was conducted on a 600-image streetscape pool with crowdsourced consensus labels, which was randomly split into training, validation, and test sets (480/60/60). In addition to this internal split, we used an independent external validation set of 3500 images with crowdsourced consensus perception scores to evaluate out-of-sample generalization. This setup constitutes a small-sample learning regime because high-quality perception supervision is costly to elicit and aggregate, and the number of distinct labeled images for fine-tuning remains limited relative to model capacity; therefore, we adopt parameter-efficient adaptation to improve data efficiency and mitigate overfitting while enabling rapid iteration. For objective tasks, the model was trained to output a bounded score on a fixed scale with a short evidence statement grounded in visible cues. For subjective tasks, the model was trained to output the aggregated resident perception score, accompanied by a concise rationale that reflects recurring themes from interviews and expert notes while remaining faithful to the original meaning. We report objective performance using classification or ordinal metrics consistent with the coded labels, and subjective performance using agreement with aggregated resident perceptions. Hyperparameters and adapter configurations are reported to support reproducibility, including LoRA rank and target modules, as well as the number and depth of prompt tokens in P-Tuning v2.

2.6.3. Hallucination Control and Human-in-the-Loop Revision Loop

At the same time, large vision-language models are known to suffer from hallucination, generating fluent but visually unsupported content, and this failure mode has been systematically documented across representative LVLM families [47,48,49]. Importantly, hallucination risk is not only model-dependent but also prompt-dependent, with instruction wording and visual-instruction priors affecting what is spuriously generated [50]. Syntheses of the LVLM hallucination literature further emphasize that practical mitigation is typically achieved by combining structured prompting or constrained output formats with data-centric quality control, including auditing, filtering, and iterative refinement of supervision corpora [49]. This evidence motivates our design choices in two ways. First, questionnaire standardization into a fixed Q&A schema is used to reduce prompt-structure variance and stabilize model parsing, which helps prevent uncontrolled reinterpretation of question intent. Second, a human-in-the-loop revision loop is adopted to identify visually unsupported or inconsistent supervision instances, revise the corpus at the data layer, and re-train on corrected samples, aligning with mitigation strategies advocated in the LVLM hallucination literature [50].

To reduce this risk, we enforced a structured output format and adopted a data-centric correction loop aligned with the workflow in Figure 6. First, we sampled a small subset of supervision pairs and conducted a pilot fine-tuning run. Second, the model generated image-grounded answers for the same subset under the fixed templates. Third, human reviewers audited outputs for visual grounding and semantic faithfulness. When deviations were identified, experts performed a supervision-target correction at the data layer by editing both the numeric score and the accompanying rationale to match visible evidence and the intended aggregated human judgments captured in the survey. The corrected score–rationale targets replaced the original supervision pairs and were reinserted into the training corpus for the next fine-tuning iteration. Low-quality cases were traced back to their supervision pairs, which were then revised at the data layer, including tightening question wording, removing ambiguous rationales, and correcting inconsistent score-text pairs. The updated corpus was reintroduced for the next fine-tuning iteration. Importantly, these edits were applied only to the supervision pool used for fine-tuning; held-out evaluation sets and the independent external verification set remained fixed and were never used to rewrite training targets.

This iterative loop was repeated until the error patterns stabilized, after which the refined supervision set was used for full-scale fine-tuning and deployment. Throughout training, we also maintained a reserve buffer of alternative question–answer formulations for each image to expose the model to controlled linguistic variation without changing the underlying target signal. By combining structured prompting, community-level holdout evaluation, and repeated human-in-the-loop correction at the corpus level, we aimed to improve both robustness and interpretability while limiting hallucinated details.

Operationally, we adopted a gated, data-centric correction workflow aligned with Figure 6. We first ran pilot fine-tuning on a training-only subset under fixed Q&A templates and audited the model’s generated answers on the same subset. Human reviewers flagged three classes of failures: (i) visually ungrounded statements (hallucinated objects, facilities, or events not supported by the image), (ii) attribute confusion (answering a correlated but different dimension under similar prompts), and (iii) score–text inconsistency (numeric scores conflicting with the accompanying rationale). Flagged cases were traced back to their supervision pairs and corrected through corpus-level revision, including tightening question wording, removing ambiguous rationales, and fixing inconsistent score–text pairs. The corrected samples were then reintroduced into the training corpus for the next iteration.

This loop continued until the held-out evaluation met an explicit acceptance gate. In Phase 2, we conducted full-scale fine-tuning and assessed performance on validation and test sets under the community-level holdout design. We proceeded to Phase 3 only when the model satisfied the predefined criteria on the held-out test set, namely Accuracy@1 (within-one-point on the 7-point Likert scale) ≥ 0.85, together with a low rate of auditor-flagged ungrounded outputs (≤5%). If the gate was not satisfied, we returned to corpus revision and repeated the fine-tuning cycle. After the model met the acceptance gate on the held-out test set, we conducted Phase 3 independent external human verification using a separate post hoc dataset, which is described in the next section.

3. Results

3.1. Validation Through Controlled Audits Against Human Judgments

We first evaluated the validity of the objective street-attribute scores using an expert audit and a complementary interval-based agreement check. For each street-view panorama, the model was prompted to generate the objective scores ten times, and the final score for each indicator was computed as the mean across the ten runs to reduce stochastic variability. All ten runs used the identical standardized instruction template defined by the fixed Q&A schema (same question wording and constrained output format) under fixed inference settings; repeated outputs therefore reflect stochastic decoding variability rather than prompt reformulation.

Aggregating outputs at the community level across 300 communities and seven objective indicators yielded 2100 objective scores as shown in Figure 7. The resulting objective scores mainly ranged between 3.0 and 5.0 and deviated from normality (Shapiro–Wilk test, W = 0.9852, p < 1 × 10⁻¹³).

We then conducted an expert audit using 90 randomly sampled panoramas. Experts scored the same seven objective indicators, resulting in 630 paired comparisons between model outputs and expert ratings. Under a strict score-level criterion, 84% of the comparisons had an absolute error smaller than 0.5 points between the model mean score and the expert score.

To assess whether the model can reliably place cases into coarse condition levels, we further applied an interval-based agreement check by mapping scores into three predefined ranges, 1 to 3, 3 to 5, and 5 to 7. Under this criterion, 92% of the model outputs fell within the same interval as the expert judgment. Based on the same three-interval mapping (1–3, 3–5, 5–7), we further evaluate this interval-level agreement as a 3-class classification task. The resulting expert-audit ground-truth distribution was 70 (1–3), 500 (3–5), and 60 (5–7), and the corresponding confusion matrix is

C = [\begin{matrix} 55 & 15 & 0 \\ 20 & 470 & 10 \\ 0 & 5 & 55 \end{matrix}]

(10)

Under this discretized evaluation, the interval-level accuracy is 0.921 and the macro-F1 score is 0.863.

We validated subjective perception scores by comparing model outputs with ratings from a re-sampled crowdsourced questionnaire conducted on a validation image set reserved specifically for this step. For each image, the model generated scores ten times and we used the mean as the final prediction, while crowdsourced responses were aggregated to a consensus score (mean rating) for the same perceptual dimension.

To make the agreement assessment interpretable, we used Bland–Altman analysis. In our context, this analysis evaluates agreement in two aspects: whether there is a systematic offset between the model and the crowdsourced consensus (bias, reflected by the mean of paired differences), and whether the remaining discrepancies are generally bounded within an acceptable range for use (reflected by the 95% limits of agreement). As shown in Figure 8, most paired differences fall within the 95% limits, indicating that discrepancies between model predictions and the crowdsourced consensus are typically bounded rather than pervasive across the score range.

In addition to Bland–Altman analysis, we also reported a direct consistency rate derived from the same paired comparisons between model predictions and crowdsourced validation scores. Specifically, we compared the model outputs against crowdsourced consensus ratings on a subset of 3500 images. Consistency was defined as the proportion of instances in which the absolute deviation between the model mean score and the crowdsourced consensus was smaller than 0.5 points on the 1–7 scale, yielding an overall consistency of 83.9%.

Overall, the controlled audits provide empirical support that the framework’s outputs are broadly consistent with expert judgments for objective attributes and with crowdsourced consensus for subjective perceptions under the predefined validation criteria. These results indicate that the framework can be used to generate stable, interpretable scores for subsequent analyses in this study.

3.2. Field Deployment Validation at Heilongjiang University of Science and Technology

Following the controlled audits, we evaluated the framework in an unseen real-world setting at Heilongjiang University of Science and Technology and its surrounding mixed-use streets. In late June and early July, we collected a new set of 360-degree panoramas across 107 bidirectional observation points, yielding 736 street-view images that were not included in the training corpus. To ensure spatial consistency, duplicate views from the same location were averaged, resulting in final scores for 103 unique street segments (Figure 9).

To provide a pragmatic plausibility check rather than a formal comparative evaluation, 20 planning professionals conducted a spot-check on a sampled subset of the field panoramas, judging whether the predicted scores were broadly in line with their intuitive assessment of the on-site street conditions. Their judgments were generally consistent with the model outputs, suggesting that the results are reasonable for this deployment setting.

Overall, this field deployment provides additional evidence that the framework can produce interpretable and broadly consistent assessments on newly collected imagery outside the original training environment, supporting its use for the subsequent analyses in this study.

3.3. Model Behavior on the Field Evaluation Subset

We evaluated the Multimodal Street Evaluation Framework (MSEF) on 143 newly collected panoramas from eight street segments surrounding Heilongjiang University of Science and Technology. This field site covers heterogeneous micro-urban settings, including major vehicular corridors, deteriorated alleys, student housing clusters, and an informal vending zone, providing a complex mix of spatial cues. For each image, MSEF produced a set of objective street-attribute scores (scaled 1 to 7) together with a predicted resident satisfaction score (also 1 to 7). For each image, we generated the full set of outputs ten times and used the mean score for each indicator and for resident satisfaction as the final estimate. The following results report distributional patterns and the statistical associations between predicted satisfaction and the objective indicators.

As shown in Figure 10, the pooled objective indicator scores exhibit substantially higher dispersion than the predicted satisfaction scores. While the objective indicators retain considerable variability across images (median interquartile range approximately 1.8), predicted satisfaction concentrates within a narrower mid-scale band (median = 4.1, IQR = 0.9), with comparatively few extreme predictions. This variance attenuation from multi-attribute physical cues to a single evaluative outcome likely reflects two concurrent factors.

First, it is consistent with the empirical profile of the field site itself (Figure 10). Across the eight street segments around Heilongjiang University of Science and Technology, most locations represent ordinary, functional street environments in which highly extreme conditions are relatively rare. Major corridors often provide basic lighting, visibility, and activity, while deteriorated alleys and backstreets typically present localized deficiencies rather than uniformly severe conditions. After aggregation across viewpoints and indicators, these mixed but generally moderate conditions yield a distribution dominated by mid-range scores.

Second, the compression is also consistent with how an integrated satisfaction score is produced from multiple cues. Satisfaction is derived as a synthesis over heterogeneous signals, and in images that contain both favorable and unfavorable elements, the resulting prediction tends to remain near the center of the scale rather than track extremes implied by any single attribute. In addition, the model output was averaged over repeated runs per image, which further stabilizes predictions and can reduce the occurrence of extreme values relative to the raw dispersion of individual indicators.

Multivariate OLS regression results (Table 1) indicate that predicted resident satisfaction increased with pedestrian width (β = 0.419, p < 0.001), greenery (β = 0.444, p < 0.001), public amenities (β = 0.346, p < 0.001), visual richness (β = 0.483, p < 0.001), and perceived safety (β = 0.486, p < 0.001). In contrast, predicted satisfaction decreased with motorization (β = −0.437, p < 0.001), vehicle lane width (β = −0.506, p < 0.001), and commercial intensity (β = −0.392, p < 0.001). Collectively, these coefficients suggest a coherent association profile in which comfort- and walkability-related cues are positively associated with predicted satisfaction, whereas traffic- and roadway-dominant conditions are negatively associated. The effect directions are internally consistent across variables in this field subset, indicating that the predicted satisfaction score responds systematically to multiple attributes rather than being driven by a single indicator.

As visualized in Figure 11, the correlation heatmap reinforces the regression findings by revealing a clear association structure. Traffic-related proxies (motorization and vehicle lane width) show consistently negative correlations with predicted satisfaction (approximately −0.60 to −0.70), whereas greenery and perceived safety are positively correlated with satisfaction (around +0.5). Beyond the satisfaction column, the matrix also exhibits a coherent block pattern: comfort-oriented attributes tend to co-vary positively with one another, while traffic-related attributes co-vary within their own cluster and are generally inversely related to comfort-oriented cues. Overall, this structured correlation profile indicates that predicted satisfaction is not responding idiosyncratically to a single indicator, but aligns systematically with broader attribute groupings observed in the field subset.

3.4. MSEF Capturing Nonlinear and Trade-Off Patterns

Although many associations between street attributes and MSEF-predicted satisfaction can be approximated by linear trends, the field evaluation subset reveals several systematic departures from monotonicity and structured trade-offs. These patterns indicate that MSEF does not map every “more is better” cue into uniformly higher satisfaction but produces context-dependent response shapes under specific local configurations.

A prominent nonlinearity appears for street permeability (Figure 12). MSEF-predicted satisfaction rises sharply from low to moderate permeability (approximately scores 2 to 5) but then shows diminishing returns and a mild decline at higher permeability levels. A polynomial specification fits this relationship better than a linear one (R² = 0.49 versus 0.30), indicating a substantial gain in explanatory power (ΔR² = 0.19). Importantly, the turning point is located around the mid-to-high permeability range, suggesting that in this university-adjacent setting, increased connectivity beyond a moderate level is not consistently associated with higher satisfaction. This pattern is consistent with field characteristics where highly connected alley networks can coincide with intrusive motorbike activity and elevated perceived safety risks, which may offset accessibility benefits.

A second pattern concerns commercial intensity and the divergence between vitality-related cues and satisfaction. In the MSEF outputs, commercial intensity is positively associated with visual richness yet negatively associated with predicted satisfaction. This divergence becomes more pronounced at higher commercial intensity levels, where visual richness continues to increase while satisfaction remains flat or decreases. Segment-level annotations further show that these high-intensity observations are spatially concentrated, with Segment 5 contributing a dense cluster of high commercial intensity cases accompanied by comparatively lower satisfaction. Overall, Figure 13 suggests that MSEF differentiates between “activity visibility” and “comfort evaluation” in this field subset, producing a net satisfaction response that reflects a trade-off under dense informal vending conditions.

Taken together, these results show that MSEF-predicted satisfaction exhibits threshold-type responses (permeability) and structured trade-offs (commercial intensity) that would be obscured under purely linear summaries, providing a basis for the subsequent segment-level analysis of local anomalies and heterogeneous perception patterns.

A third pattern concerns architectural openness. As shown in Figure 14, MSEF-predicted satisfaction generally increases with higher openness in both commercially oriented and residential scenes. However, the strength of this association differs by context. In commercial scenes, median satisfaction shows a clearer upward shift across openness bins, with more distinct separation between bins. In residential scenes, the increase is more gradual, and the distributions exhibit greater overlap across bins. This indicates that openness acts as a positive cue in MSEF outputs, with context-dependent intensity across land-use settings.

Taken together with the permeability nonlinearity and the commercial-intensity trade-off reported above, Figure 14 further suggests that MSEF-predicted satisfaction is shaped by combinations of physical cues and scene context, supporting the presence of heterogeneous response patterns within the field evaluation subset.

3.5. Local Anomaly Sensitivity and Field Cross-Check

To examine micro-scale variability, we first applied MSEF to all sampling points across the seven labeled street segments and summarized segment-level distributions of predicted satisfaction (Figure 15). The segment boxplots show clear between-segment differentiation and, importantly, within-segment dispersion. For instance, Segment 7 exhibits consistently higher predicted satisfaction with a relatively tight spread across sampled locations, whereas Segment 5 shows pronounced intra-segment variability. Within Segment 5, predicted satisfaction drops below 2.5 at locations where curb space is obstructed and pedestrian movement is visibly constrained, but increases to above 5.0 in cleaner and shaded zones where pedestrian presence is more apparent. These patterns indicate that MSEF does not merely assign a single uniform value to an entire street segment but remains responsive to localized environmental signals within the same segment.

To further assess whether these variations reflect meaningful spatial heterogeneity rather than general instability of the model outputs, we conducted an additional field cross-check using a subset of 143 images whose predicted satisfaction scores were not concentrated near the mid-range. This subset was selected as a stress-test for cases with clearer positive or negative signals and was reviewed on site by planning professionals (n = 20) through a pragmatic plausibility check, judging whether the predicted scores were broadly in line with observed street conditions and annotated cues. Under this field check, 117 predictions (82%) fell within the expected satisfaction range. Misalignments were concentrated in false-positive highs, typically involving newly paved but sparsely used roads captured under favorable lighting. This error concentration suggests a specific boundary condition of static imagery, where visually positive surface cues can dominate in the absence of time-sensitive or occupancy-related information, rather than indicating that MSEF outputs are broadly unstable. Overall, the segment-level dispersion patterns together with the targeted field cross-check provide supporting evidence that the framework can capture localized variation in this field site while maintaining interpretable and generally consistent outputs for subsequent analyses.

Overall, the field results suggest that MSEF can differentiate relative street conditions across segments and captures context-dependent response patterns for several key cues. The field cross-check further indicates that misalignments are concentrated in a small subset of cases with a consistent error mode rather than reflecting broad output instability. These findings support the use of the predicted scores for the subsequent comparative analyses in this study.

4. Discussion

4.1. Framework Contributions and Key Empirical Findings

This study proposed and validated the Multimodal Street Evaluation Framework (MSEF) as a scalable approach for human-centered street assessment that links observable street attributes to predicted resident satisfaction. Across controlled audits and a real-world field deployment, MSEF outputs were broadly consistent with expert judgments for objective attributes and with crowdsourced consensus for subjective perceptions under the predefined checking criteria. In the field evaluation subset surrounding Heilongjiang University of Science and Technology, three findings are particularly salient for planning interpretation: predicted satisfaction shows a moderate central tendency while still differentiating street segments and micro-locations, the association structure between attributes and satisfaction is coherent, and several relationships depart from simple monotonic assumptions, revealing threshold-type responses and systematic trade-offs.

A central interpretation is that MSEF outputs are most informative as a comparative signal rather than a literal measurement of absolute sentiment. The mid-range concentration becomes most apparent when images contain mixed cues, which can stabilize predictions in noisy scenes but may attenuate extremes. For planning practice, this implies that the framework is best used for rank-ordering streets, diagnosing relative deficits, and identifying localized deviations within the same segment. This reading is supported by the segment-level variability patterns, where stable high-satisfaction segments can be distinguished from segments with strong within-segment dispersion, indicating sensitivity to micro-scale environmental changes rather than uniform scoring.

The regression and correlation evidence also provides a practical design-oriented interpretation. Pedestrian width, greenery, public amenities, visual richness, and perceived safety show systematic positive associations with predicted satisfaction, whereas motorization intensity and vehicle lane width show systematic negative associations. In planning terms, these results align with a pedestrian comfort and safety bundle in which incremental improvements to walking capacity, shading and greenery, amenity provision, lighting, and traffic dominance management tend to coincide with higher predicted satisfaction. These associations should be interpreted as planning-relevant correlates in this context rather than universal causal effects, but they nonetheless support evidence-informed prioritization of street design interventions.

Beyond average directions, the results sharpen a trade-off that is often under-specified in street evaluation: retail vitality versus pedestrian comfort. Under dense informal vending conditions, commercial intensity is linked to increased visual richness while predicted satisfaction does not rise accordingly and can decline, suggesting that visible activity and experiential comfort do not always move together. A related nuance concerns nonlinearity in permeability, where increased connectivity appears beneficial primarily from low to moderate levels, followed by diminishing returns and a mild decline at higher levels in this university-adjacent setting. For practice, both patterns point to governance strategies that pair vitality and connectivity with pedestrian capacity protection and safety-oriented management, including dedicated vending space, curb and two-wheeler encroachment control, sanitation, and access regulation for intrusive motorized movement, consistent with SDG 11’s emphasis on inclusive and safe public spaces.

The openness results further indicate that physical cues can operate with different gradients across land-use contexts. Architectural openness shows a generally positive association with predicted satisfaction in both commercial and residential scenes, but the upward shift across openness bins is more pronounced in commercial settings, while residential settings show a weaker gradient with greater distributional overlap. This suggests that the same physical intervention may yield different experiential payoffs depending on scene context and land use, reinforcing the need for context-sensitive interpretation of design indicators rather than one-size-fits-all scoring rules. Future work can extend the framework in three directions. Methodologically, incorporating temporal and behavioral signals would directly address the dominant false-positive mode, for example, by integrating day-night imagery, pedestrian counts, mobility traces, or short video clips to capture occupancy and rhythm. Analytically, expanding the evaluation across multiple cities and contrasting land-use regimes would enable more precise characterization of context-dependent gradients and would test the stability of the observed trade-offs. Practically, embedding MSEF into a decision-support workflow would allow planners to use it for benchmarking, rapid diagnostics, and intervention targeting, while maintaining human-in-the-loop review for edge cases and abstract indicators. Together, these directions would strengthen both the empirical basis and the operational value of AI-assisted human-centered street assessment.

4.2. Policy and Planning Implications for SDG 11

The empirical patterns observed in this study can be interpreted in relation to SDG 11’s emphasis on inclusive, safe, and accessible urban environments, particularly the commitment to universal access to safe and inclusive public spaces (Target 11.7) and to safe, accessible, and sustainable transport systems (Target 11.2) [51,52]. In this context, the framework’s outputs should be understood as decision-support signals that help identify where street environments are likely to fall short of human-centered public space objectives, rather than as definitive measurements of lived experience [53]. The consistent association structure linking pedestrian-supportive attributes, greenery, amenities, and perceived safety with higher predicted satisfaction provides an evidence-informed rationale for prioritizing pedestrian realm upgrades in corridor-level street improvement programs [54,55]. These upgrades include reallocating cross-section space toward pedestrian movement, strengthening shading and greenery, improving lighting and amenity provision, and reducing traffic dominance through traffic calming and curbside management, all of which are directly relevant to safer and more accessible everyday mobility [26,55,56].

A second policy-relevant implication concerns governance under trade-offs, especially the tension between retail vitality and pedestrian comfort identified in vending-dense segments. The results suggest that high activity visibility can co-exist with reduced comfort when commercial intensity produces constrained walkways, clutter, and disorder. For planners, this does not imply suppressing informal retail but motivates policy designs that make vitality compatible with pedestrian comfort. Practical instruments include dedicated vending space design, time-based or zoned management, and enforcement against encroachment that blocks pedestrian capacity, complemented by sanitation and waste handling measures. Such interventions align with SDG 11’s public space objectives by improving accessibility and perceived safety while retaining economic opportunity and street-level vibrancy.

The observed nonlinearity for permeability further supports policy approaches that treat connectivity as a conditional benefit rather than a universally positive lever. In university-adjacent settings where high permeability may enable intrusive motorbike movement, connectivity improvements should be paired with measures that regulate through-motorized access, clarify separation of movement modes, and strengthen lighting and surveillance. This interpretation connects the model’s threshold-type response to actionable design and governance choices that can improve safety and comfort, which are central to human-centered public space delivery.

Finally, the framework can support SDG-oriented monitoring and prioritization at scale by enabling rapid, spatially explicit screening of street segments. Used alongside conventional audits and community feedback, the outputs can help municipalities map relative deficits, identify micro-scale hotspots within segments, and evaluate whether interventions shift patterns in the intended direction. Given the field cross-check results showing that misalignments are concentrated in specific boundary conditions, especially visually orderly but socially inactive scenes, an appropriate policy workflow is human-in-the-loop deployment: use the model for broad screening and benchmarking, then apply targeted on-site verification for high-stakes decisions and for environments where occupancy and temporal rhythms are decisive.

5. Conclusions

This study developed and validated the Multimodal Street Evaluation Framework (MSEF) as a scalable approach for human-centered street assessment using street-view imagery. The validation evidence supports that MSEF produces broadly consistent objective attribute outputs and practically interpretable subjective satisfaction estimates under the predefined checking criteria. Field results further indicate that the predicted satisfaction score functions as a comparative signal that differentiates street segments and captures localized variation, while exhibiting context-dependent response patterns, including threshold-type behavior for permeability and trade-offs between retail vitality cues and pedestrian comfort. These findings suggest that MSEF can support SDG 11–aligned planning tasks such as screening, prioritization, and identifying micro-scale problem locations, when used with context-sensitive interpretation and targeted human verification. Limitations remain for time-sensitive and occupancy-dependent perceptions and for abstract constructs that are weakly observable from static imagery. Future work will extend the framework by incorporating temporal and behavioral modalities, testing transferability across multiple cities and land-use regimes and integrating the model into human-in-the-loop decision-support workflows.

Author Contributions

Conceptualization, K.Y., H.L. and K.W.; methodology, K.Y. and H.L.; validation, K.Y., H.L. and Y.G.; formal analysis, K.Y.; investigation, H.L.; resources, K.W.; data curation, Y.G.; writing—original draft preparation, K.Y. and H.L.; writing—review and editing, K.Y. and Y.G.; visualization, H.L.; supervision, K.W.; project administration, K.W.; funding acquisition, K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Program of China, grant number 2022YFD1600500.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ewing, R.; Handy, S. Measuring the Unmeasurable: Urban Design Qualities Related to Walkability. J. Urban Des. 2009, 14, 65–84. [Google Scholar] [CrossRef]
Mehta, V. Evaluating Public Space. J. Urban Des. 2014, 19, 53–88. [Google Scholar] [CrossRef]
Carmona, M. Re-theorising contemporary public space: A new narrative and a new normative. J. Urban. Int. Res. Placemaking Urban Sustain. 2015, 8, 373–405. [Google Scholar] [CrossRef]
Mouratidis, K. Urban planning and quality of life: A review of pathways linking the built environment to subjective well-being. Cities 2021, 115, 103229. [Google Scholar] [CrossRef]
Giles-Corti, B.; Vernez-Moudon, A.; Reis, R.; Turrell, G.; Dannenberg, A.L.; Foster, S.; Lowe, M.; Sallis, J.F.; Stevenson, M.; Owen, N. City planning and population health: A global challenge. Lancet 2016, 388, 2912–2924. [Google Scholar] [CrossRef] [PubMed]
United Nations. SDG Indicator 11.7.1 Metadata; UN Statistics Division: New York, NY, USA, 2020. [Google Scholar]
Venerandi, A.; Mellen, H.; Romice, O.; Porta, S. Walkability Indices—The State of the Art and Future Directions: A Systematic Review. Sustainability 2024, 16, 6730. [Google Scholar] [CrossRef]
Wolch, J.R.; Byrne, J.A.; Newell, J.P. Urban green space, public health, and environmental justice: The challenge of making cities ‘just green enough’. Landsc. Urban Plan. 2014, 125, 234–244. [Google Scholar] [CrossRef]
Kabisch, N.; Qureshi, S.; Haase, D. Human–environment interactions in urban green spaces—A systematic review of contemporary issues and prospects for future research. Environ. Impact Assess. Rev. 2015, 50, 25–34. [Google Scholar] [CrossRef]
Mouratidis, K.; Poortinga, W. Built environment, urban vitality and social cohesion: Do vibrant neighborhoods foster strong communities? Landsc. Urban Plan. 2020, 204, 103951. [Google Scholar] [CrossRef]
Li, X.; Zhang, C.; Li, W.; Ricard, R.; Meng, Q.; Zhang, W. Assessing street-level urban greenery using Google Street View and a modified green view index. Urban For. Urban Green. 2015, 14, 675–685. [Google Scholar] [CrossRef]
Zhang, F.; Zhou, B.; Liu, L.; Liu, Y.; Fung, H.H.; Lin, H.; Ratti, C. Measuring human perceptions of a large-scale urban region using machine learning. Landsc. Urban Plan. 2018, 180, 148–160. [Google Scholar] [CrossRef]
Taylor, K. The Street: A Quintessential Social Public Space. Landsc. Res. 2014, 39, 599–601. [Google Scholar] [CrossRef]
Cheliotis, K. An agent-based model of public space use. Comput. Environ. Urban Syst. 2020, 81, 101476. [Google Scholar] [CrossRef]
Ma, L.; Dill, J. Associations between the objective and perceived built environment and bicycling for transportation. J. Transp. Health 2015, 2, 248–255. [Google Scholar] [CrossRef]
Ye, Y.; Zeng, W.; Shen, Q.; Zhang, X.; Lu, Y. The visual quality of streets: A human-centred continuous measurement based on machine learning algorithms and street view images. Environ. Plan. B 2019, 46, 1439–1457. [Google Scholar]
Klapka, P.; Kraft, S.; Halás, M. Network based definition of functional regions: A graph theory approach for spatial distribution of traffic flows. J. Transp. Geogr. 2026, 88, 102855. [Google Scholar] [CrossRef]
Nieuwenhuijsen, M.J. Urban and transport planning pathways to carbon neutral, liveable and healthy cities; A review of the current evidence. Environ. Int. 2020, 140, 105661. [Google Scholar] [CrossRef] [PubMed]
Kemper, H. Health benefits of green spaces in living environment: A systematic review of epidemiological studies. Urban For. Urban Green. 2015, 14, 806–816. [Google Scholar] [CrossRef]
Biljecki, F.; Ito, K. Street view imagery in urban analytics and GIS: A review. Landsc. Urban Plan. 2021, 215, 104217. [Google Scholar] [CrossRef]
Wang, H.; Che, X.; Yang, X. Investigating Green View Perception in Non-Street Areas by Combining Baidu Street View and Sentinel-2 Images. Sustainability 2025, 17, 7485. [Google Scholar] [CrossRef]
Xia, Y.; Yabuki, N.; Fukuda, T. Development of a system for assessing the quality of urban street-level greenery using street view images and deep learning. Urban For. Urban Green. 2021, 59, 126995. [Google Scholar] [CrossRef]
Dubey, A.; Naik, N.; Parikh, D.; Raskar, R.; Hidalgo, C.A. Deep Learning the City: Quantifying Urban Perception at a Global Scale; Springer: Cham, Switzerland, 2016. [Google Scholar]
Naik, N.; Kominers, S.D.; Raskar, R.; Glaeser, E.L.; Hidalgo, C.A. Computer vision uncovers predictors of physical urban change. Proc. Natl. Acad. Sci. USA 2017, 114, 7571–7576. [Google Scholar] [CrossRef]
Zhu, Z. Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images. Buildings 2025, 15, 2970. [Google Scholar] [CrossRef]
Zhou, X. Urban Public Space Safety Perception and the Influence of the Built Environment from a Female Perspective: Combining Street View Data and Deep Learning. Land 2024, 13, 2108. [Google Scholar] [CrossRef]
Ewing, R.; Cervero, R. Travel and the Built Environment. J. Am. Plan. Assoc. 2010, 76, 265–294. [Google Scholar] [CrossRef]
Liu, Y.; Wang, R.; Xiao, Y.; Huang, B.; Chen, H.; Li, Z. Exploring the linkage between greenness exposure and depression among Chinese people: Mediating roles of physical activity, stress and social cohesion and moderating role of urbanicity. Health Place 2019, 58, 102168. [Google Scholar] [CrossRef]
Zhang, F.; Zu, J.; Hu, M.; Zhu, D.; Kang, Y.; Gao, S.; Zhang, Y.; Huang, Z. Uncovering inconspicuous places using social media check-ins and street view images. Comput. Environ. Urban Syst. 2020, 81, 101478. [Google Scholar] [CrossRef]
Dawes, J. Do data characteristics change according to the number of scale points used? An experiment using 5-point, 7-point and 10-point scales. Int. J. Mark. Res. 2012, 50, 61–77. [Google Scholar] [CrossRef]
Lozano, L.M.; García-Cueto, E.; Muñiz, J. Effect of the Number of Response Categories on the Reliability and Validity of Rating Scales. Methodology 2008, 4, 73–79. [Google Scholar] [CrossRef]
Bi, W.; Wang, L.; Kwok, J.T.; Tu, Z. Learning to predict from crowd sourced data. In Proceedings of the Uncertainty in Artificial Intelligence, Barcelona, Spain, 15–19 July 2024; AUAI Press: Arlington, VA, USA, 2014. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
Fan, Z.; Zhang, F.; Loo, B.P.; Ratti, C. Urban visual intelligence: Uncovering hidden city profiles with street view images. Proc. Natl. Acad. Sci. USA 2023, 120, e2220417120. [Google Scholar] [CrossRef]
Wang, S.; Duan, Y.; Ding, H.; Tan, Y.-P.; Yap, K.-H.; Yuan, J. Learning Transferable Human-Object Interaction Detector with Natural Language Supervision. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 929–938. [Google Scholar] [CrossRef]
Yan, Y.; Wen, H.; Zhong, S.; Chen, W.; Chen, H.; Wen, Q.; Zimmermann, R.; Liang, Y. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM Web Conference, Online, 13–17 May 2024; pp. 4006–4017. [Google Scholar]
Kraff, N.J.; Wurm, M.; Taubenbck, H. Uncertainties of Human Perception in Visual Image Interpretation in Complex Urban Environments. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4229–4241. [Google Scholar] [CrossRef]
Ma, Y.; Yang, Y.; Jiao, H. Exploring the Impact of Urban Built Environment on Public Emotions Based on Social Media Data: A Case Study of Wuhan. Land 2021, 10, 986. [Google Scholar] [CrossRef]
Ogawa, Y.; Oki, T.; Zhao, C.; Sekimoto, Y.; Shimizu, C. Evaluating the subjective perceptions of streetscapes using street-view images. Landsc. Urban Plan. 2024, 247, 105073. [Google Scholar] [CrossRef]
Zhao, C.; Ogawa, Y.; Chen, S.; Oki, T.; Sekimoto, Y. Quantitative land price analysis via computer vision from street view images. Eng. Appl. Artif. Intell. 2023, 123, 106294. [Google Scholar] [CrossRef]
Li, Z.; Xu, J.; Wang, S.; Wu, Y.; Li, H. StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model. arXiv 2024, arXiv:2411.14476. [Google Scholar]
Yuan, L.; Mo, F.; Huang, K.; Wang, W.; Zhai, W.; Zhu, X.; Li, Y.; Xu, J.; Nie, J.Y. Omnigeo: Towards a multimodal large language models for geospatial artificial intelligence. arXiv 2025, arXiv:2503.16326. [Google Scholar] [CrossRef]
Li, M.; Zhao, H.; Guo, X. LIME-Eval: Rethinking Low-light Image Enhancement Evaluation via Object Detection. arXiv 2024, arXiv:2410.08810. [Google Scholar]
Li, Z.; Xia, L.; Tang, J.; Xu, Y.; Shi, L.; Xia, L.; Yin, D.; Huang, C. Urbangpt: Spatio-temporal large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 5351–5362. [Google Scholar]
Zhang, J.; Li, Y.; Fukuda, T.; Wang, B. Revolutionizing urban safety perception assessments: Integrating multimodal large language models with street view images. arXiv 2024, arXiv:2407.19719. [Google Scholar] [CrossRef]
Chen, J.; Seng, K.P.; Smith, J.; Ang, L.M. Situation Awareness in AI-Based Technologies and Multimodal Systems: Architectures, Challenges and Applications. IEEE Access 2024, 12, 40. [Google Scholar] [CrossRef]
Rohrbach, A.; Hendricks, L.A.; Burns, K.; Darrell, T.; Saenko, K. Object Hallucination in Image Captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 30 November 2018. [Google Scholar]
Ding, M.; Ma, Y.; Qin, P.; Wu, J.; Li, Y.; Nie, L. RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-Training. IEEE Trans. Multimed. 2025, 27, 7522–7532. [Google Scholar] [CrossRef]
Northcutt, C.G.; Jiang, L.; Chuang, I.L. Confident Learning: Estimating Uncertainty in Dataset Labels. J. Artif. Intell. Res. 2019, 70, 1373–1411. [Google Scholar] [CrossRef]
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Earle, L. Urban crises and the new urban agenda. Environ. Urban. 2016, 28, 77–86. [Google Scholar] [CrossRef]
Sharifi, A. A critical review of selected smart city assessment tools and indicator sets. J. Clean. Prod. 2019, 233, 1269–1283. [Google Scholar] [CrossRef]
Ros-Mcdonnell, D.; de-la-Fuente-Aragón, M.V.; Ros-Mcdonnell, L.; Cardós, M. Toward Resilient Urban Design: Pedestrians as an Important Element of City Design. Urban Sci. 2024, 8, 65. [Google Scholar] [CrossRef]
Ma, Q.; Zhang, J.; Li, Y. Advanced Integration of Urban Street Greenery and Pedestrian Flow: A Multidimensional Analysis in Chengdu’s Central Urban District. ISPRS Int. J. GEO-Inf. 2024, 13, 254. [Google Scholar] [CrossRef]
Kondo, M.C.; Fluehr, J.M.; McKeon, T.; Branas, C.C. Urban Green Space and Its Impact on Human Health. Int. J. Environ. Res. Public Health 2018, 15, 445. [Google Scholar] [CrossRef] [PubMed]
Pedro, T.; Paiva, D. Improving the Walkability of High Streets: A Participatory Approach Using Biosensing and Scenario Co-Creation. Urban Sci. 2025, 9, 180. [Google Scholar] [CrossRef]

Figure 1. End to end workflow of the proposed framework (MSEF). Solid arrows indicate the main processing pipeline, while dotted arrows denote auxiliary reference, audit, and feedback connections.

Figure 2. Community distribution by housing price tiers in Harbin (N = 3905). (a) Spatial distribution of sampled communities by housing price tier. (b) Number of sampled communities across housing price ranges (CNY/m²).

Figure 3. Schematic diagram of business districts covered by the questionnaire.

Figure 4. Architecture and training workflow of the Multimodal Street Evaluation Framework.

Figure 5. From Questionnaire Items to Schema-Constrained GPT Standardization. (A) Illustrative excerpt of the questionnaire items used for subjective factor assessment. (B) GPT-based standardization of original Chinese raw responses into schema-constrained outputs.

Figure 6. Human-in-the-Loop Data Auditing and Iterative Revision of the Supervision Corpus for Hallucination Control. Solid arrows indicate the main workflow, while dashed arrows denote auxiliary links such as split correspondence, iterative revision, and conditional phase transitions.

Figure 7. Scatterplot of labeling error ratings.

Figure 8. Bland–Altman plot comparing model predictions with human perceptual ratings. The central horizontal line indicates the mean difference (bias), and the upper and lower dashed lines indicate the 95% limits of agreement (mean difference ± 1.96 SD).

Figure 9. Field-collected street-view images from post-training evaluation segments.

Figure 10. Distribution comparison: objective indicators vs. predicted satisfaction. The orange line within each box indicates the median, and the black circles indicate outliers.

Figure 11. Correlation heatmap: street attributes vs. satisfaction.

Figure 12. Inverted-U relationship between connectivity and satisfaction.

Figure 13. Contradictory effects of commercial density: (a) Visual richness rises with commercial density, while satisfaction falls. (b) Segment 5 photos show busy kiosks and narrow walkways that depress comfort.

Figure 14. Divergent effects of architectural openness across land uses.

Figure 15. Segment-level satisfaction variability with contextual street views. (a) Distribution of predicted satisfaction scores across street segments. (b) Representative street-view examples for each segment.

Table 1. Multivariate OLS regression estimates of the effect of street attributes on predicted resident satisfaction.

Variable	β	Std.Err	t	P > \|t\|	[0.025	0.975]
Pedestrian width	0.419	0.025	16.679	<0.001	0.37	0.469
Greenery	0.444	0.025	17.669	<0.001	0.395	0.493
Public amenities	0.346	0.026	13.226	<0.001	0.295	0.397
Visual richness	0.483	0.024	20.097	<0.001	0.435	0.53
Perceived safety	0.486	0.025	19.296	<0.001	0.436	0.535
Motorization	−0.437	0.025	−17.489	<0.001	−0.487	−0.388
Vehicle lane width	−0.506	0.026	−19.815	<0.001	−0.556	−0.456
Commercial intensity	−0.392	0.024	−16.371	<0.001	−0.439	−0.345

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, K.; Lan, H.; Gao, Y.; Wang, K. Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics. Land 2026, 15, 449. https://doi.org/10.3390/land15030449

AMA Style

Yuan K, Lan H, Gao Y, Wang K. Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics. Land. 2026; 15(3):449. https://doi.org/10.3390/land15030449

Chicago/Turabian Style

Yuan, Kaiqing, Haotian Lan, Yao Gao, and Kun Wang. 2026. "Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics" Land 15, no. 3: 449. https://doi.org/10.3390/land15030449

APA Style

Yuan, K., Lan, H., Gao, Y., & Wang, K. (2026). Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics. Land, 15(3), 449. https://doi.org/10.3390/land15030449

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Study Workflow

2.2. Study Area: Harbin, China

2.3. Research Data and Preprocessing

2.3.1. Street View Imagery Acquisition

2.3.2. Conceptualization and Derivation of Subjective and Objective Perception Factors

2.3.3. Stratified Community Sampling and Perceptual Q&A Instrument

2.3.4. Survey Distribution and Quality Control for Training Corpus

2.4. Multimodal Model Architecture, Fine-Tuning, and Validation

2.5. Standardizing Questionnaire Items into a VLM-Consumable Q&A Schema

2.6. Parameter-Efficient Fine-Tuning and Training Workflow

2.6.1. Parameter-Efficient Fine-Tuning Strategy

2.6.2. Training Split, Protocol, and Evaluation

2.6.3. Hallucination Control and Human-in-the-Loop Revision Loop

3. Results

3.1. Validation Through Controlled Audits Against Human Judgments

3.2. Field Deployment Validation at Heilongjiang University of Science and Technology

3.3. Model Behavior on the Field Evaluation Subset

3.4. MSEF Capturing Nonlinear and Trade-Off Patterns

3.5. Local Anomaly Sensitivity and Field Cross-Check

4. Discussion

4.1. Framework Contributions and Key Empirical Findings

4.2. Policy and Planning Implications for SDG 11

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI