2.3. Research Data and Preprocessing
2.3.1. Street View Imagery Acquisition
Street view imagery was collected in Harbin using a community anchored sampling strategy to capture residential street environments at scale while maintaining a direct linkage between images and their corresponding communities. We compiled an inventory of 3905 residential communities with names and geographic coordinates from major real estate information platforms (Lianjia and Anjuke). Community coordinates were used as anchors, with size adaptive buffers defined for image querying (typically 200 m, 300 m, or 400 m based on spatial extent), and a small number of irregular or atypical communities were manually adjusted to improve spatial representativeness. Street view points were generated along accessible public roads within each buffer at approximately 50 m spacing, and images were retrieved via publicly accessible APIs from Baidu Street View and Tencent Street View. As street view coverage rarely enters internal residential spaces, the resulting imagery primarily reflects everyday streets around community perimeters rather than inside gated compounds. To avoid coverage inconsistency and seasonal appearance shifts, we restricted acquisition to summer records. This procedure yielded 24,143 images across the main urban area, and after quality control and redundancy filtering, 13,560 images were retained for analysis.
2.3.2. Conceptualization and Derivation of Subjective and Objective Perception Factors
We operationalize streetscape quality using a paired construct system that combines subjective perceptions with objectively coded physical attributes. Rather than treating physical attributes as direct causes of perceptions, this pairing is used as a measurement design that supports interpretable linkage between what is visually observable in street view imagery and how streets are commonly experienced and evaluated in planning practice.
SDG 11 provides a normative and policy relevant framing for identifying street scale qualities that are routinely discussed in relation to inclusive mobility and public space, such as pedestrian space provision, traffic pressure, perceived safety, and the visibility of greenery and amenities. Building on this framing, the factor set was informed by two established research streams and adapted to the constraints of image grounded multimodal learning. First, microscale audit instruments in walkability and active living research define street components that are observable and can be coded with reasonable reliability. Second, computational urban perception studies have shown that several subjective impressions, particularly perceived safety, can be collected at scale and are statistically associated with street view visual cues in predictive settings.
Starting from a broader candidate pool aligned with common streetscape constructs, we finalized the dimensions using four criteria: relevance to street scale interpretations of SDG 11.2 and 11.7, visual inferability from street view imagery, interpretability for planning diagnosis, and parsimony to reduce label ambiguity and improve rating stability in the initial citywide validation. Following this process, we defined eight subjective perception dimensions rated by residents and experts: pedestrian width, greenery, public amenities, visual richness, perceived safety, motorization, vehicle lane width, and commercial intensity. In parallel, we coded seven objective streetscape attributes from the same images: sidewalk width, vehicle lane width, greenery level, motorization degree, commercial activity density, sky openness, and public amenities count. These objective attributes function as a structured reference for interpreting model outputs and conducting expert audits, particularly for diagnosing cases where perceived impressions and observable conditions do not align, and they were therefore coded by an expert team using explicit rules applied to street view images rather than collected via resident questionnaires.
2.3.3. Stratified Community Sampling and Perceptual Q&A Instrument
To ensure that the supervised dataset covered heterogeneous neighborhood conditions, we first stratified Harbin’s residential communities into five housing-price tiers using equal-frequency quintiles (
Figure 2). The tiers ranged from under ¥5000/m
2 to above ¥10,000/m
2, capturing a spectrum from remote or underdeveloped areas to newly built commercial residential zones. From these five tiers, we selected 30 representative communities based on geographic spread and accessibility to support model development, while the remaining neighborhoods were retained for broader evaluation. Observation points were placed along major roads or intersections adjacent to community perimeters to provide consistent, policy-relevant streetscape views for subsequent labeling.
The perception questionnaire was designed in a question–answer (Q&A) format to elicit consistent judgments from street-view imagery and generate labels that can be directly reformatted for instruction-style multimodal supervision. We used a seven-point Likert scale (1 to 7), which offers sufficient granularity while remaining practical for rapid image-based assessment [
30,
31]. Each item was phrased as a short, attribute-specific prompt so that raters focused on one perceptual aspect at a time, reducing construct confusion during fast visual judgment.
2.3.4. Survey Distribution and Quality Control for Training Corpus
Perceptual supervision was collected via offline and online surveys using a seven-point Likert scale (1 to 7). Offline, 320 questionnaires were administered (
Figure 3). Online, the same instrument was used to label a 600-image pool for model development; participants rated at least five images, with skipping permitted. We issued 1500 online assignments and applied response-level screening (completion time, repetitive patterns, and missingness). After cleaning, 1150 valid questionnaires were retained, and image-level aggregation required at least five valid ratings per image.
For external validation, we distributed 3000 assignments, each containing 10 randomly sampled images from the citywide archive; respondents rated at least six images on the same 1 to 7 scale, and only numeric ratings were collected. Quality control included a minimum viewing time of 20 s per rated image, pattern-based filtering, and an internal consistency check using a repeated image item; submissions were discarded if repeated-item ratings differed by more than 2 points. After filtering, 2530 valid questionnaires were retained, with an average of approximately 7.0 rated images per questionnaire. We retained images with at least five valid ratings, resulting in 3500 images with crowdsourced consensus scores used as the external validation set.
2.4. Multimodal Model Architecture, Fine-Tuning, and Validation
Street-scene perception research has historically been dominated by CNN-style pipelines that regress or rank perceptual scores from street-view imagery, often evaluated on large-scale pairwise-judgment benchmarks developed for urban perception assessment [
32]. However, recent baseline comparisons suggest that this CNN-centered paradigm may not always remain the strongest once visual foundation representations and multimodal reasoning capabilities are introduced into the pipeline [
33]. In particular, evaluations on established urban-perception benchmarks report that foundation-model-based representations can match or, in some cases, exceed multiple CNN-type baselines across perceptual dimensions, which is consistent with improved robustness when street scenes are heterogeneous in morphology and context [
34]. Related evidence from multimodal large language model frameworks points in a similar direction: model-based pairwise street-view judgments calibrated against crowd-sourced perception benchmarks often show closer alignment with human preference signals than traditional Siamese CNN baselines, implying potential advantages for cognitively demanding, subjective evaluation tasks that benefit from higher-level visual semantics and reasoning [
23]. Beyond perception scoring, vision-language pretraining has also been reported to outperform conventional computer-vision baselines in transferable street-scene sensing [
23,
35], supporting the broader feasibility of a “pretrain then adapt” paradigm for urban visual analytics under cross-city migration.
This shift is closely associated with the rise of vision-language models (VLMs), which align images with natural language and thereby provide semantically rich representations of streetscapes. Such alignment can support inference not only about visible physical elements but also about perception-relevant meanings, enabling more flexible querying of urban scenes than task-specific regressors. Contrastive Language–Image Pre-training (CLIP) based frameworks, for example, report systematic associations between cues such as colonnaded facades, neon signage, and surface textures, and crowd-sourced descriptors including historic, chaotic, and vibrant [
33]. Prompt-driven systems such as SAGAI further operationalize this capability for planning-oriented audits by producing geocoded query outputs, while UrbanCLIP suggests that image-text co-embeddings may improve socioeconomic inference relative to image-only baselines, particularly in morphologically ambiguous environments [
36,
37].
At the same time, several constraints can limit reliable deployment across contexts and motivate more cautious model design. Cultural sensitivity is a recurring concern, because models pretrained on dominant datasets may not fully capture local meanings and may display uneven performance in migrant or informal districts, which can translate into cross-context generalization gaps [
36]. Temporal myopia is another challenge, since many street-view pipelines treat scenes as static even though multiple data sources indicate that perceived vibrancy and perceived threat can vary between daytime and nighttime [
37,
38]. Opacity may also complicate adoption in policy settings where legitimacy depends on traceable logic rather than only predictive outputs [
39]. These considerations have encouraged approaches that decompose holistic impressions into interpretable dimensions such as cleanliness, greenness, and enclosure [
40], and that use multi-label or attention-oriented formulations to better represent co-occurrence and interaction among perceptual attributes, including settings where perceived safety conditions commercial vibrancy [
39].
Multimodal large language models (MLLMs), including GPT-4V and CogVLM, extend the VLM paradigm by integrating visual understanding with language reasoning in a single instruction-following interface. Their practical value lies in supporting prompt-based inference across heterogeneous questions without requiring task-specific heads, which can be useful when researchers need to probe diverse perceptual constructs under varying urban contexts. Importantly, MLLMs can generate concise rationales grounded in visible evidence, for example, linking lower pedestrian comfort to narrow sidewalks, obstructed sightlines, or limited shade, which may improve auditability and planning interpretation [
41,
42]. In addition, role-conditioned prompts allow perspective-aware evaluation, such as considering elderly pedestrians or nighttime users, while interpretability tools such as attention maps and token-region alignment provide complementary mechanisms to inspect decision logic and reduce black-box concerns in applied settings [
43,
44,
45,
46].
Figure 4 summarizes the architecture and training workflow of the Multimodal Street Evaluation Framework (MSEF). The framework integrates street-view panoramas with textual annotations derived primarily from resident surveys, interview transcripts, and expert notes. These multimodal inputs are encoded into unified latent representations, aligned through cross-modal attention, and optimized to produce dual outputs, including scalar ratings for multiple street-related dimensions and natural-language rationales that support planning interpretation.
——The framework consists of (i) feature extraction and encoding for street-view imagery and textual annotations, (ii) cross-modal alignment via attention, (iii) a task layer for rating prediction and rationale generation, and (iv) optimization and validation including task losses, AdamW with cosine annealing, and robustness and agreement checks.
In the feature extraction and encoding stage, each street-view panorama is divided into image patches and encoded by a vision transformer backbone into a visual embedding sequence. We denote the resulting visual feature matrix as Vvis∈Rn×d, where nnn is the number of visual tokens and ddd is the hidden dimension. In parallel, textual inputs are tokenized and encoded into a text feature matrix T∈Rm×d, where mmm is the number of text tokens. These two modalities provide complementary signals, with imagery capturing observable physical cues and text carrying human-centric evaluation language and context.
The multimodal alignment module performs cross-modal fusion using an attention mechanism that conditions visual understanding on the semantics of the textual channel. Following the notation in
Figure 4, we apply linear projections to construct query, key, and value matrices:
and are trainable projection matrices, is the key/query dimension, and dvd_vdv is the value dimension. Accordingly, represents queries derived from visual tokens, while and represent keys and values derived from text tokens.
Cross-modal attention is then computed as:
where
is the key dimension. The fused multimodal representation
serves as the aligned feature vector for downstream tasks.
The task layer is built on a pretrained multimodal backbone and is formulated to produce dual outputs. First, the model predicts scalar ratings for multiple dimensions, including environmental qualities and commerce-related evaluations, consistent with the assessment protocol used in this study. Second, it generates concise natural-language rationales that justify the predicted ratings in planning-relevant terms. Training samples are organized in an instruction-style format, where each instance consists of an image, a structured question, and a target answer. This design encourages the model to learn not only numeric appraisal but also explanation generation grounded in visible cues.
To adapt the pretrained backbone to streetscape analytics while preserving general knowledge, we employ parameter-efficient fine-tuning. We use Low-Rank Adaptation (LoRA) to inject trainable low-rank updates into selected transformer layers while keeping the majority of pretrained weights frozen:
where W
0 is the frozen weight matrix and
are trainable low-rank factors with rank r. In addition, we apply P-Tuning v2 by learning a sequence of continuous prompt embeddings {p
1,…,p
m} that are concatenated with the image-conditioned tokens before decoding:
so that task preferences and evaluation heuristics are encoded in the prompt space with minimal parameter updates.
Optimization follows the loss design shown in
Figure 4. For rating prediction tasks with continuous targets, we use a mean squared error objective:
where N is the number of training instances and
is the number of rating dimensions. For rating tasks treated as binary or multi-label outcomes, we use a binary cross-entropy objective:
The overall training objective combines task losses and the language modeling loss for rationale generation, with weights selected to balance scalar accuracy and explanation quality. We optimize parameters using AdamW and adopt a cosine annealing learning rate schedule consistent with
Figure 4:
Model validation includes both predictive performance and robustness checks. For tasks evaluated as classification or discretized levels, we report the F1 score:
where P and R denote precision and recall. In this study, discretized levels refer to the three interval classes (1–3, 3–5, 5–7) used in the interval-based agreement check. We construct a 3 × 3 confusion matrix between predicted interval labels and expert-audit (ground-truth) interval labels. Precision and Recall are computed from TP, FP, and FN for each class, and macro-F1 is reported as the unweighted mean of class-wise F1 scores. Precision and Recall are derived from the confusion matrix by counting true positives, false positives, and false negatives for each class. To assess stability, we conduct perturbation validation by measuring the maximum deviation between predictions under controlled prompt or sampling perturbations:
For subjective perception ratings, agreement with human aggregated ratings is examined using Bland–Altman analysis, where the paired difference is defined as and the limits of agreement are computed as . This combination of performance metrics, perturbation-based robustness testing, and agreement analysis ensures that the framework’s outputs are both quantitatively reliable and suitable for planning-oriented interpretation.
2.5. Standardizing Questionnaire Items into a VLM-Consumable Q&A Schema
Vision language model training is sensitive to the syntactic structure of instructions, and prompt driven urban analytics systems typically rely on stable query templates to ensure comparable outputs across images and sites. Consistent with this line of work in prompt based VLM and MLLM based urban assessment, we converted the original survey instrument into a fixed question answer schema before fine tuning, so that each training instance followed the same parsing pattern while preserving the meaning of the original questionnaire content. In practice, GPT-4 was used only as a formatting normalizer to rewrite each questionnaire item into a structured Q&A pair, without altering the semantic content of the item or residents’ expressed judgments.
Specifically, each street view image was associated with multiple attributes from the survey. For each attribute, we formed a question string that mirrors the original item and explicitly retains its response scale (for example, “How safe is this street? 1 = not safe, 7 = very safe”). We then formed an answer string that includes the aggregated numeric rating together with a short rationale derived from the original responses. Importantly, the GPT-4 standardization step was constrained to preserve content: it does not add new claims, re interpret the intent of the item, or modify what residents expressed. Its sole function is to enforce a uniform input layout (Image, Question, Answer) so that the downstream VLM does not recompile the question wording in unstable ways across samples. To make this step fully transparent, we provide an explicit before/after example in
Figure 5, showing one original questionnaire item and its standardized Q&A version. The example demonstrates that the transformation is structural rather than semantic.
All standardized triplets were stored in JSON format with unique image identifiers, the fixed question template, and the consolidated answer. Concretely, each supervision instance is stored as a schema-constrained JSON record (JSONL, one record per line) consisting of (i) an image identifier, (ii) a fixed question template for the target attribute, and (iii) a consolidated answer that aggregates the human supervision signal. The question template is held constant within each attribute during training and evaluation to avoid instruction drift; across records, we vary only the image_id and the aggregated answer content (score and rationale), which is derived from the corresponding respondents’ responses for that image. A representative JSONL example is shown below.
{
“image_id”: “IMG_000123”,
“respondent_id”: “R_0127”,
“module”: “objective”,
“attribute”: “sidewalk_width”,
“question”: “Please rate the sidewalk width on a 1–7 scale (1 = almost none/very narrow, 7 = very wide and dominant). Provide visible evidence.”,
“answer”: {“score”: 5, “evidence”: “A continuous sidewalk is present, but it does not dominate the frame.”},
“score”: 5,
“evidence”: “A continuous sidewalk is present, but it does not dominate the frame.”,
“scale”: {“min”: 1, “max”: 7, “anchor_1”: “almost none/very narrow”, “anchor_7”: “very wide/dominant”},
“source”: “expert_audit”,
“split”: “train”,
“schema_version”: “v1”,
“language”: “en”
}
{
“image_id”: “IMG_000123”,
“respondent_id”: “R_4581”,
“module”: “subjective”,
“attribute”: “perceived_safety”,
“question”: “Please rate perceived safety on a 1–7 scale (1 = very unsafe, 7 = very safe). Provide visible evidence only.”,
“answer”: {“score”: 4, “evidence”: “Moderate traffic and limited pedestrian separation suggest mixed perceived safety.”},
“score”: 4,
“evidence”: “Moderate traffic and limited pedestrian separation suggest mixed perceived safety.”,
“scale”: {“min”: 1, “max”: 7, “anchor_1”: “very unsafe”, “anchor_7”: “very safe”},
“source”: “crowd_consensus”,
“split”: “external_val”,
“schema_version”: “v1”,
“language”: “en”
}
We applied automated quality control checks to ensure structural validity and content integrity, including verifying that required numeric fields were present, screening and removing personally identifiable information, and flagging outputs that were empty, off topic, or internally inconsistent with the provided score. The resulting dataset was ingested by VisualGLM-6B directly for supervised instruction tuning using the fixed Q&A schema described above.
2.6. Parameter-Efficient Fine-Tuning and Training Workflow
2.6.1. Parameter-Efficient Fine-Tuning Strategy
Recent streetscape perception studies have typically relied on convolutional neural networks that map images to scalar scores or pairwise rankings, which is effective for single-task prediction but limited in cross-attribute generalization and in producing interpretable rationales [
24,
30]. In contrast, modern vision-language models offer a unified instruction interface, allowing the same backbone to answer heterogeneous urban questions under a consistent supervision schema and to produce language explanations that are easier to audit. Motivated by this paradigm shift, we fine-tuned VisualGLM-6B using parameter-efficient methods rather than full fine-tuning, so that adaptation remains data-efficient and iteration-friendly.
We adopted a staged adaptation strategy in which objective physical attributes were optimized with LoRA, while subjective perceptual judgments were adapted with P-Tuning v2. This division is intentional. Objective attributes such as sidewalk width, roadway width, greening level, and sky openness are visually salient and comparatively stable targets. For these tasks, LoRA provides a direct way to adapt internal projections with a small number of trainable parameters, improving visual-text alignment for measurable cues. In contrast, subjective attributes such as perceived safety, cleanliness, and satisfaction depend on nuanced contextual interpretation and are more sensitive to instruction phrasing and distribution shift. For these tasks, P-Tuning v2 enables rapid and low-risk behavioral adjustment by learning deep continuous prompts while keeping the backbone weights frozen, which supports frequent iterative refinement without destabilizing the model.
Importantly, objective prediction is not treated as an end in itself. The objective module serves as an anchoring layer for the subjective module, supporting both alignment and verification. During subjective fine-tuning and deployment, objective outputs are used as auxiliary constraints and sanity checks. When subjective rationales contradict objective cues that are strongly visible in the image, the instance is flagged for manual review and potential correction in the supervision corpus.
2.6.2. Training Split, Protocol, and Evaluation
Fine-tuning followed the image–question–answer triplet format described above. Model development was conducted on a 600-image streetscape pool with crowdsourced consensus labels, which was randomly split into training, validation, and test sets (480/60/60). In addition to this internal split, we used an independent external validation set of 3500 images with crowdsourced consensus perception scores to evaluate out-of-sample generalization. This setup constitutes a small-sample learning regime because high-quality perception supervision is costly to elicit and aggregate, and the number of distinct labeled images for fine-tuning remains limited relative to model capacity; therefore, we adopt parameter-efficient adaptation to improve data efficiency and mitigate overfitting while enabling rapid iteration. For objective tasks, the model was trained to output a bounded score on a fixed scale with a short evidence statement grounded in visible cues. For subjective tasks, the model was trained to output the aggregated resident perception score, accompanied by a concise rationale that reflects recurring themes from interviews and expert notes while remaining faithful to the original meaning. We report objective performance using classification or ordinal metrics consistent with the coded labels, and subjective performance using agreement with aggregated resident perceptions. Hyperparameters and adapter configurations are reported to support reproducibility, including LoRA rank and target modules, as well as the number and depth of prompt tokens in P-Tuning v2.
2.6.3. Hallucination Control and Human-in-the-Loop Revision Loop
At the same time, large vision-language models are known to suffer from hallucination, generating fluent but visually unsupported content, and this failure mode has been systematically documented across representative LVLM families [
47,
48,
49]. Importantly, hallucination risk is not only model-dependent but also prompt-dependent, with instruction wording and visual-instruction priors affecting what is spuriously generated [
50]. Syntheses of the LVLM hallucination literature further emphasize that practical mitigation is typically achieved by combining structured prompting or constrained output formats with data-centric quality control, including auditing, filtering, and iterative refinement of supervision corpora [
49]. This evidence motivates our design choices in two ways. First, questionnaire standardization into a fixed Q&A schema is used to reduce prompt-structure variance and stabilize model parsing, which helps prevent uncontrolled reinterpretation of question intent. Second, a human-in-the-loop revision loop is adopted to identify visually unsupported or inconsistent supervision instances, revise the corpus at the data layer, and re-train on corrected samples, aligning with mitigation strategies advocated in the LVLM hallucination literature [
50].
To reduce this risk, we enforced a structured output format and adopted a data-centric correction loop aligned with the workflow in
Figure 6. First, we sampled a small subset of supervision pairs and conducted a pilot fine-tuning run. Second, the model generated image-grounded answers for the same subset under the fixed templates. Third, human reviewers audited outputs for visual grounding and semantic faithfulness. When deviations were identified, experts performed a supervision-target correction at the data layer by editing both the numeric score and the accompanying rationale to match visible evidence and the intended aggregated human judgments captured in the survey. The corrected score–rationale targets replaced the original supervision pairs and were reinserted into the training corpus for the next fine-tuning iteration. Low-quality cases were traced back to their supervision pairs, which were then revised at the data layer, including tightening question wording, removing ambiguous rationales, and correcting inconsistent score-text pairs. The updated corpus was reintroduced for the next fine-tuning iteration. Importantly, these edits were applied only to the supervision pool used for fine-tuning; held-out evaluation sets and the independent external verification set remained fixed and were never used to rewrite training targets.
This iterative loop was repeated until the error patterns stabilized, after which the refined supervision set was used for full-scale fine-tuning and deployment. Throughout training, we also maintained a reserve buffer of alternative question–answer formulations for each image to expose the model to controlled linguistic variation without changing the underlying target signal. By combining structured prompting, community-level holdout evaluation, and repeated human-in-the-loop correction at the corpus level, we aimed to improve both robustness and interpretability while limiting hallucinated details.
Operationally, we adopted a gated, data-centric correction workflow aligned with
Figure 6. We first ran pilot fine-tuning on a training-only subset under fixed Q&A templates and audited the model’s generated answers on the same subset. Human reviewers flagged three classes of failures: (i) visually ungrounded statements (hallucinated objects, facilities, or events not supported by the image), (ii) attribute confusion (answering a correlated but different dimension under similar prompts), and (iii) score–text inconsistency (numeric scores conflicting with the accompanying rationale). Flagged cases were traced back to their supervision pairs and corrected through corpus-level revision, including tightening question wording, removing ambiguous rationales, and fixing inconsistent score–text pairs. The corrected samples were then reintroduced into the training corpus for the next iteration.
This loop continued until the held-out evaluation met an explicit acceptance gate. In Phase 2, we conducted full-scale fine-tuning and assessed performance on validation and test sets under the community-level holdout design. We proceeded to Phase 3 only when the model satisfied the predefined criteria on the held-out test set, namely Accuracy@1 (within-one-point on the 7-point Likert scale) ≥ 0.85, together with a low rate of auditor-flagged ungrounded outputs (≤5%). If the gate was not satisfied, we returned to corpus revision and repeated the fine-tuning cycle. After the model met the acceptance gate on the held-out test set, we conducted Phase 3 independent external human verification using a separate post hoc dataset, which is described in the next section.