Visually Sustainable but Spatially Broken? A Two-Level Assessment of How Generative AI Encodes Sustainable Urban Design Principles

Jung, Sanghoon

doi:10.3390/su18062943

Open AccessArticle

Visually Sustainable but Spatially Broken? A Two-Level Assessment of How Generative AI Encodes Sustainable Urban Design Principles

by

Sanghoon Jung

Department of Urban Planning, Gachon University, Seongnam-daero 1342, Sujeong-gu, Seongnam-si 13120, Gyeonggi-do, Republic of Korea

Sustainability 2026, 18(6), 2943; https://doi.org/10.3390/su18062943

Submission received: 18 February 2026 / Revised: 13 March 2026 / Accepted: 16 March 2026 / Published: 17 March 2026

(This article belongs to the Special Issue Shaping the Future of Cities by AI Applications in Sustainable Urban Systems—Unlocking the Potential of AI, Generative AI and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Generative AI enables rapid visualization of sustainable urban design scenarios, yet the question of whether these outputs encode sustainability as operable spatial logic, rather than merely depicting it as a visual impression, remains underexplored. This study proposes a two-level assessment framework that scores the same sustainability dimensions at both the visual-representation level and the spatial-logic level, treating the systematic decoupling between the two as a form of visual greenwashing: system-induced representational distortion rather than deliberate misrepresentation. Using AI-workflow reports from two site-based urban design studios (47 students, 12 teams, 36 coded scenes), the framework integrates rubric-based scoring with qualitative process tracing of breakdown–repair logs. Results show that image-level scores consistently outperform logic-level scores across all five dimensions, with the gap most severe in mobility hierarchy and walkability and smallest in green/blue infrastructure. Case analysis reveals that breakdowns arise from failures in program encoding, urban-scale coherence, functional-boundary demarcation, and relational-condition matching, and that students deploy multi-stage repair pipelines, including prompt restructuring, tool switching, reference injection, and external-source compositing, to re-inject collapsed spatial logic. These findings reframe AI-assisted urban design as repair-centered workmanship rather than automated production. The study proposes three guardrails to prevent visual sustainability from substituting for spatial-logic sustainability: image–logic paired submission, design audit trail formalization, and gap-based red-flag review.

Keywords:

generative AI; urban design; visual greenwashing; two-level assessment rubric; breakdown-repair workflow

1. Introduction

The rapid adoption of generative AI in urban design practice and education is changing how designers explore, communicate, and evaluate early-stage options. Using text prompts and reference images, practitioners and students can quickly visualize waterfront parks, pedestrian-oriented streets, tram corridors, and public plazas, accelerating early-stage ideation and stakeholder communication. Contemporary text-to-image diffusion models achieve this by sampling from large-scale learned distributions to produce visually coherent scenes without explicitly computing spatial relationships [1]. This technological leap, however, simultaneously exposes a new vulnerability: the question of whether a generated space can actually work as a functioning urban environment [2,3,4].

Sustainability in urban design does not arise from the mere presence of individual elements—trees, pedestrians, bicycles, or trams. Rather, it depends on relational constraints: the continuity of pedestrian networks, the legibility of public-space entries and edges, the adjacency and hierarchy among programs, and the operational structure of transit nodes. Only when these relational conditions are satisfied does sustainability become operative rather than merely decorative [5,6,7,8]. In image-generation-centered workflows, however, these relational structures are not directly computed or verified; instead, they are implicitly represented through the reproduction of visual patterns. Generative AI readily reproduces visual cues associated with sustainability (e.g., greenery, cycling, transit icons), but it often fails to preserve the relational conditions required for those cues to be operational. This dynamic widens the gap between how sustainable a scene looks and how sustainably it would perform.

This concern connects to a well-established line of inquiry in communication and environmental studies: greenwashing, the strategic practice of projecting an environmentally responsible image without delivering commensurate performance [9,10]. Research shows that visual cues alone—nature imagery, verdant palettes, “clean” aesthetics—can distort sustainability judgments independently of substantive evidence. The high visual persuasiveness of AI-generated urban imagery amplifies this risk: compelling renderings can mask structural deficiencies—severed pedestrian networks, inoperable transit configurations, ambiguous public-space boundaries—beneath a veneer of environmental appeal. When novices and non-expert audiences treat such visual plausibility as evidence of performance, the result is what this study terms visual greenwashing.

Despite growing recognition of this tension, relatively few studies have systematically separated the degree to which AI outputs look sustainable (visual representation) from the degree to which they function sustainably (spatial logic). Prior work emphasizes tool efficiency and ideation capability, while empirical benchmarking of AI outputs against the relational requirements of urban design remains limited, and existing research has largely treated AI outputs as finished images rather than as processual artifacts, leaving the mechanism-level explanation of the visual–logic gap underdeveloped [3,4,11]. In actual studio settings, however, students rarely rely on single-shot generation; rather, they iteratively refine outputs through prompt revision, reference-image injection, tool switching, localized editing, and compositing. The workflow logs produced during this process constitute a rare form of empirical evidence: a design audit trail that can reveal under what conditions spatial logic breaks down and through what interventions it is repaired.

Accordingly, this study proposes a two-level assessment framework that evaluates the same sustainability dimensions at both the image level (visual representation) and the spatial-logic level (relational coherence). Whereas prior evaluations of AI-generated urban imagery have relied on holistic impression-based ratings or single-level scoring [4], the present framework scores each sustainability dimension at two analytically distinct levels and computes the discrepancy (Δ) between them, transforming the familiar observation that ‘AI images look good but do not work spatially’ into a measurable diagnostic structure. The framework also treats the generation–revision process documented in student AI-workflow reports as a design audit trail, thereby providing processual evidence for why and where spatial logic breaks down. These two contributions—dimension-level Δ measurement and audit-trail process tracing—together advance the study of AI in urban design from tool evaluation toward a diagnostic methodology for relational-logic verification. Beyond image quality, the framework tracks when the visual–logic gap emerges in the workflow and how it is addressed, and uses these patterns to propose practical guardrails for education and practice.

Specifically, the study asks: (1) Along which sustainability dimensions does the gap between visual representation and spatial logic become most pronounced? (2) What types of breakdown—program-encoding failure, contextual displacement, goal-conflict-driven boundary collapse, and relational-condition mismatch—recur in student workflows? (3) What repair strategies (prompt restructuring, tool switching, reference injection, partial editing, and external-source compositing) do students deploy to re-inject spatial logic into visually plausible but structurally deficient outputs?

The remainder of this paper is organized as follows. Section 2 reviews the literature on the shift in design rationality brought about by generative AI, the risk of visual greenwashing in built-environment representation, the plausibility gap between visual coherence and spatial intelligibility, and the sustainable urban design principles that ground the evaluation dimensions. Section 3 presents the two-level assessment framework, including data collection, coding procedures, reliability evaluation, and the analytic strategy that integrates quantitative scoring with qualitative process tracing. Section 4 reports the results: dimension-level visual–logic gaps across sustainability criteria and tool types, followed by representative breakdown–repair case studies. Section 5 discusses the findings in relation to prior research and proposes educational and practical guardrails. Section 6 concludes with the study’s contributions, limitations, and directions for future research.

2. Literature Review

2.1. Generative AI and the Shift in Design Rationality: From Rule-Based Generation to Latent-Space Synthesis

The rapid diffusion of generative AI in architecture and urban design signals more than a productivity upgrade; it marks a shift in how design rationality itself is operationalized. In rule- and constraint-driven paradigms, typical of parametric modelling and performance-oriented computational design, design intent is externalized as explicit relationships: constraints, hierarchies, dependencies, and measurable objectives. Contemporary text-to-image diffusion models operate on a fundamentally different principle. They generate outputs by sampling from learned statistical regularities in large-scale image–text distributions, producing visually coherent scenes without necessarily preserving the explicit causal structures on which designers ordinarily rely [1].

This shift moves “knowledge” from explicit rules to latent representations, making iteration closer to prompt-guided search than constraint satisfaction. In architectural discourse, this transition has been theorized as part of a broader “second digital turn,” in which computation increasingly operates through statistical inference and pattern synthesis rather than rule formalization [12]. Within the specific context of AI image generators, architectural experimentation increasingly takes the form of curation and orchestration: selecting references, steering outputs, and navigating a search space whose inner logic is not directly inspectable [2,3,13,14].

At the urban design scale, empirical work remains comparatively limited, yet a growing body of studies already illustrates both the promise and the constraints of generative models for urban-structural tasks. Conditional diffusion has been applied to generate urban road networks from contextual inputs [15], while diffusion-based pipelines have been explored for the automated synthesis and evaluation of urban spatial structures [16]. Parallel work using GAN-based approaches demonstrates that generative models can reproduce certain morphological patterns and support renewal-oriented form exploration, yet still require external metrics and human judgment to ensure contextual fit and planning feasibility [17]. These studies share a recurring observation: generative outputs can approximate plausible spatial configurations, but the relational conditions that make those configurations functional remain difficult for current models to guarantee.

For urban design practice, the immediate payoff is clear: generative imagery accelerates early-stage ideation, narrative visualization, and stakeholder communication at a speed unattainable through conventional modelling or rendering. Yet the same mechanism produces a predictable limitation. Because diffusion models optimize for visual plausibility rather than for explicit spatial constraints, they are structurally inclined to produce persuasive scenes even when the underlying relational conditions remain unsatisfied. This is the core tension that motivates the present study: generative AI can produce “convincing sustainability aesthetics” while remaining unreliable at encoding the “sustainability logics” that urban design actually requires.

2.2. Visual Greenwashing and the Risk of “Looking Sustainable”

Greenwashing is broadly defined as a strategic practice through which organizations project an environmentally responsible image without delivering commensurate environmental performance [9,10,18]. Greenwashing also operates through visual cues—nature imagery, verdant palettes, and ‘clean’ aesthetics—that trigger sustainability inferences independently of substantive evidence [19]. This distinction between claim-based and executional greenwashing is critical for the present study, because generative AI urban images rarely make explicit environmental claims; rather, they achieve their persuasive effect precisely through the accumulation of visual motifs that audiences associate with sustainability.

The greenwashing literature has traditionally assumed an intentional agent that strategically deploys environmental imagery. Even “executional greenwashing,” defined as the use of nature-evoking visual elements to enhance a brand’s ecological image without explicit verbal claims [20], presupposes a communicator who selects those elements for persuasive effect. In AI-generated urban imagery, however, the visual–logic gap examined here arises not primarily from deliberate deception by the designer but from system-induced representational distortion in image synthesis. Diffusion models reproduce sustainability-associated motifs—greenery, cyclists, waterfronts—not to deceive but because those motifs are statistically prevalent in training distributions and contribute to distributional plausibility [1]. The visual–logic gap documented in this study is therefore better understood as system-induced representational distortion: the model’s optimization for visual coherence systematically produces sustainability symbolism while leaving relational spatial structures largely unencoded. This structural, non-intentional character distinguishes the present study’s concept of visual greenwashing from both claim greenwashing (strategic verbal misrepresentation) and executional greenwashing (deliberate selection of nature-evoking cues).

This literature translates directly to the built environment, where representational artifacts—renderings, diagrams, competition boards, promotional visualizations—have long served to construct public legitimacy for “green” projects. Empirical work demonstrates that greenery in promotional images is not a neutral design depiction but an active representational device that shapes perception. Studies of residential development visualizations, for instance, show how abundant vegetation and “green ambience” can be foregrounded to create an environmental impression that exceeds what is delivered on the ground [21]. Related research on housing and development marketing documents how “green space” and sustainability language are mobilized as value-adding narratives, drifting toward greenwashing when the promised benefits are selective, weakly evidenced, or difficult to verify [22,23].

A complementary strand of built-environment scholarship scrutinizes how “green” certification regimes themselves can function as market devices that normalize a sustainability image, potentially outpacing verifiable performance. Empirical work on LEED certification, for example, reveals contextual limits: certification effectiveness and meaning vary by regional conditions, and the label alone does not guarantee environmental outcomes [24]. Critical urban research further argues that certification-led sustainability can be unevenly distributed, producing socio-spatial inequities in access to ‘green’ environments [25]. Building and environmental psychology research reinforces this concern by showing that “green” labels systematically bias occupant perception and evaluation, such that sustainability impressions can exceed measured physical performance [26,27]. These findings indicate that the gap between sustainability image and sustainability substance is not unique to AI-generated imagery; it is a structural feature of how green credentials circulate in the built environment. Generative AI, however, dramatically amplifies both the scale and the visual fluency of this gap.

Generative AI intensifies the greenwashing risk by simultaneously lowering the cost and raising the polish of sustainability symbolism. Because text-to-image models have learned strong statistical associations between sustainability-related terms and particular visual motifs—trees, waterfronts, pedestrians, cycling infrastructure—‘green-looking’ urban scenes can be produced in high volume and fidelity with minimal effort. These motifs are relevant to sustainability, but they become problematic when they function as performative substitutes for evidence. In other words, AI can make sustainability “visible” far faster than it can make sustainability “true.” In urban design settings, where audiences may include non-experts, students, or stakeholders lacking the technical capacity to interrogate underlying networks and constraints, this dynamic opens a pathway toward what this study terms visual greenwashing. The aesthetic signature of sustainability becomes detached from the operational conditions that sustainability demands, not through deliberate misrepresentation but through the structural properties of latent-space synthesis. Executional greenwashing research has shown that nature-evoking visual cues mislead non-expert audiences while leaving expert consumers relatively unaffected [20]. This pattern is directly relevant here, since diagnosing spatial-logic failures requires domain-specific expertise that cannot be assumed outside a studio or professional review setting.

2.3. The Plausibility Gap: Why Urban Design Is Especially Vulnerable

Recent scholarship at the intersection of AI and architecture has conceptualized a recurring discrepancy between visual plausibility and spatial or tectonic intelligibility as a plausibility gap: generated images can appear coherent while violating feasibility constraints, relational logic, or the operational structure implied by the depicted scene [3]. Comparative evaluations of AI image tools in urban design contexts similarly report that outputs may appear compelling while undermining “sensibleness” at the urban-structural level, where legibility depends on network continuity, functional hierarchy, and meaningful spatial relationships [4].

This gap is not merely anecdotal; it is consistent with broader technical findings indicating that text-to-image systems struggle systematically with compositional constraints and spatial relations. Maintaining correct topology, consistent relative positioning, or coherent multi-part structure under complex prompts remains among the hardest requirements for diffusion-based models to satisfy reliably. Recent computer-vision benchmarks and research explicitly identify “spatial consistency” as a known and persistent weakness targeted for improvement [28]. The difficulty is not incidental but architectural: diffusion models learn to denoise toward plausible appearances, not toward valid configurations, so relational accuracy is sacrificed whenever it conflicts with distributional plausibility.

Urban design is especially plausibility-gap-prone for a structural reason. Compared with architectural object-making, urban design is more heavily constrained by external systems: street networks, circulation hierarchies, access logic, frontage conditions, and programmatic adjacencies that must connect to an existing urban fabric. Walkability and urban design research has long emphasized that urban environmental quality depends on relational configuration, such as connectivity, continuity, and the coupling of streets, destinations, and transit access, rather than on isolated objects [5,6,8,29]. When these relational conditions are mapped onto the known limitations of diffusion models, a clear vulnerability emerges: urban design images can “pass” visually even when they fail structurally. A street can look lively while being disconnected from the surrounding network; a square can appear vibrant while remaining inaccessible; transit can be depicted as an icon without possessing an operable node structure.

It is precisely this structural vulnerability that makes separating representation from spatial logic analytically necessary. The three strands reviewed above—(i) latent-space synthesis as a new mode of design rationality, (ii) the AI-amplified risk of visual greenwashing in built-environment representation, and (iii) the plausibility gap rooted in the relational constraints that define urban design—collectively motivate this study’s two-level assessment framework, designed to empirically distinguish “sustainability that looks convincing” from “sustainability that works as spatial logic.”

2.4. Sustainable Urban Design Principles and the Rationale for the Five Evaluation Dimensions

Having established why urban design is particularly vulnerable to the plausibility gap in generative outputs, this section provides the theoretical grounding for the specific evaluation dimensions employed in the study’s assessment framework. Sustainability-oriented urban design and planning studies span multiple intellectual traditions and terminologies, including “sustainable urbanism,” “sustainable urban form,” and “planning for sustainability.” Yet they converge on a relatively stable set of spatial principles that can be translated into measurable dimensions when the empirical material permits [30,31,32,33].

Four widely cited frameworks illustrate this convergence. Farr’s [30] Sustainable Urbanism links environmental performance to the everyday spatial logic of neighborhoods, such as compactness, walkability, and a “complete” public realm, while explicitly integrating green infrastructure as a functional system rather than decorative amenity. Jabareen’s [31] sustainable urban form typology synthesizes recurring design concepts across typological families (compact city, eco-city, etc.): compactness, sustainable transport, density, mixed land uses, diversity, and greening. These are identified as the shared morphological conditions for sustainability. Wheeler’s [32] Planning for Sustainability positions physical form and mobility systems as central levers through which sustainability goals are realized, within a multi-scalar framework integrating land use, transportation, urban design, environmental restoration, and social equity. Finally, UN-Habitat’s [33] “Five Principles of Sustainable Neighbourhood Planning” codifies a practitioner-oriented articulation emphasizing adequate street space, efficient street networks, high density, mixed land use, social mix, and limited land-use specialization, all directed toward “compact, integrated, and connected” neighborhoods.

Despite their differing emphases and scales of application, these frameworks share a common analytical orientation: sustainability is not a property of isolated “green-looking” objects but of relational configurations—how streets, destinations, public realms, ecological systems, and mobility options couple into a coherent and connected urban structure [30,31,32,33]. This convergence provides the theoretical basis for the five evaluation dimensions adopted in this study, each of which operationalizes a principle that recurs across the frameworks above:

(1): Walkability operationalizes the recurring focus on connected street networks, proximity, and mixed-use access, the most consistent spatial precondition for sustainability across all four frameworks.
(2): Public space captures the centrality of a legible, accessible public realm as an organizing armature of sustainable neighborhoods, including entry conditions, edge definition, and activity distribution.
(3): Green/blue infrastructure translates the “greening” and ecological-integration principles into system-level cues—continuity, functional placement, and the relationship of ecological elements to movement corridors and public space—distinguishing it from mere decorative vegetation.
(4): Human-scale streetscape/urban design qualities reflect the emphasis on street-level experience, edge/frontage conditions, and pedestrian-oriented spatial proportion that walkability and urban design quality research has consistently identified as essential [29].
(5): Mobility hierarchy/multimodality captures the sustainability planning emphasis on prioritizing active modes and transit integration within a coherent movement system, a dimension that requires not merely the presence of transit icons but the operational structure of stops, platforms, and interchange nodes.

The selection of these five dimensions is also a methodological decision shaped by the empirical material analyzed in this study: generated images and accompanying workflow logs. Many sustainability principles that matter substantively (governance capacity, financing mechanisms, detailed carbon and energy accounting, construction feasibility, or distributive equity outcomes) cannot be reliably observed in static images and short process traces; they require fundamentally different data, such as plans with quantitative performance metrics, policy documents, stakeholder-process records, or post-occupancy evidence. The five dimensions adopted here, therefore, represent a pragmatic operationalization of widely shared sustainable urban design principles into indicators that are (i) visible or inferable in the image/log corpus and (ii) likely to yield consistent coding across evaluators. In this sense, the framework does not claim to exhaust the meaning of “sustainability”; rather, it provides a theoretically grounded and empirically tractable subset for evaluating what generative systems can plausibly encode—and what they merely depict—when asked to produce “sustainable” urban design.

3. Research Methods: A Two-Level Matrix Framework

3.1. Research Design and Data

This study employs a mixed-methods evaluation design in which student AI-workflow reports produced in urban design studio courses serve as the primary data source. The analytic corpus consists of AI workflow reports submitted in two site-based urban design studios conducted in 2024 and 2025. A total of 47 students organized into 12 teams participated across the two offerings (2024: 31 students, 8 teams; 2025: 16 students, 4 teams). All personally identifiable information has been removed, and cases are presented in anonymized form. The unit of quantitative coding is the representative generated scene selected by each team; with an average of three scenes per team, a total of 36 scenes were evaluated. The images analyzed were produced by students using commercially available generative AI tools, including Midjourney, LookX, and Adobe Photoshop Generative Fill.

The AI workflow reports are not simple portfolios of finished images. They function as design audit trails, documenting step by step which AI tool was chosen and why, how prompts were constructed, what problems arose during generation, and how outputs were subsequently revised or composited. Typical workflow entries describe prompt iteration sequences, reference-image injection, tool switching, and post-generation editing (e.g., object deletion or compositing via generative fill tools). This processual documentation is central to the study’s analytic strategy, as it provides the empirical basis not only for scoring visual and spatial-logic quality but also for tracing how and why the gap between the two levels emerges in practice.

The study integrates quantitative assessment and qualitative interpretation in a sequential design. In the first phase, each generated scene is scored on a 0–2 rubric that separately evaluates the extent to which sustainable urban design principles are realized at the visual-representation level and at the spatial-logic level; the discrepancy between the two is summarized as Δ(Image − Logic) to capture dimension-level patterns (Section 3.2). In the second phase, qualitative process tracing is performed on the prompt, tool-switching, and editing/compositing logs recorded in the reports to explain the causes of the gaps identified in the first phase (Section 3.3). This sequential integration is necessary because quantitative scores alone cannot reveal the mechanism behind a recurring observation: why visually plausible images fail to translate into operable spatial structures. The students’ workflow documentation—their design audit trails—provides the processual evidence needed to address that question.

The coding unit is defined as an individual generated scene (a single image or a variant cluster depicting the same scene) as it appears in the report. Each scene is treated as a bundled record comprising (i) the image itself, (ii) the associated prompt and design-intent narrative, and (iii) any documented failure and revision history. This bundling is essential because the same visually plausible image may warrant a different spatial-logic interpretation depending on the prompt intent and revision process behind it.

Finally, the quantitative sample of 36 scenes is not intended to support statistical generalization; rather, the study pursues analytic generalization [34], identifying recurrent patterns and mechanisms within a bounded but information-rich corpus. Accordingly, results are reported not only as dimension-level means but also with distributional summaries (score frequencies, spread indicators) so that readers can assess the stability and skew of the data for themselves.

3.2. Two-Level Assessment Framework and Rubric

3.2.1. Visual Representation Versus Spatial Logic

To evaluate how effectively generative AI “encodes” sustainable urban design principles, the framework disaggregates sustainability into two analytically distinct levels. The first, visual representation (image/surface level), assesses how plausibly the generated image renders visual cues associated with sustainability—trees, pedestrians, cyclists, waterfronts, plazas, transit vehicles, and similar motifs. The second, spatial logic (logic level), assesses whether these cues are coherently integrated into the relational structures that urban design demands: pedestrian-network continuity, public-space entry and edge conditions, green/blue-infrastructure connectivity, human-scale street proportions and frontage conditions, and multimodal transit hierarchy and interchange operability. The gap between the two levels is defined as Δ(Image − Logic); larger values of Δ indicate that a scene appears sustainable at the surface while its underlying spatial logic remains weak or incoherent.

3.2.2. Evaluation Dimensions and Scoring Rubric

The five evaluation dimensions—(1) walkability, (2) public space, (3) green/blue infrastructure, (4) human-scale streetscape/urban design qualities, and (5) mobility hierarchy/multimodality—were derived from the sustainable urban design frameworks reviewed in Section 2.4, selecting principles that are both theoretically central and empirically codable within the study’s data format (generated images and workflow logs). Each dimension is scored on the same 0–2 ordinal rubric at both the visual and the spatial-logic level. This simple ordinal scheme is consistent with prior AI-image evaluation in urban design contexts [4]; the uniform three-point scale was chosen to balance cross-dimension comparability against coding burden.

The common score definitions are as follows. A score of 0 (absent/collapsed) indicates that the relevant visual cue is absent or that the spatial structure is manifestly incoherent. A score of 1 (partial/ambiguous) indicates that the cue is present but structurally equivocal or exhibiting clear logical flaws. A score of 2 (clear/coherent) indicates that visual representation and spatial logic are relatively well-aligned and the scene’s operational plausibility reads convincingly. Dimension-specific scoring criteria were documented in a codebook designed to minimize impressionistic assessment and anchor coding in observable structural cues. For example, a score of 2 on the spatial-logic level for walkability requires that the continuity of the pedestrian axis and the legibility of intersections, entries, and connections can be plausibly explained; for mobility hierarchy, the critical criterion is not whether a transit vehicle appears in the image but whether stops, platforms, and interchange nodes are spatially constituted within the urban structure.

3.2.3. Expert Pilot Test and Codebook Refinement

To assess the initial clarity and applicability of the rubric, an online pilot test was conducted with 20 urban design experts. The panel comprised 8 professors, 7 doctoral/master’s students, and 5 practitioners, all with backgrounds in urban design or urban planning. Professors and practitioners held teaching, research, or project-management experience in urban design and planning; graduate students had completed studio coursework and related methods training in urban design or planning programs. After reviewing an orientation document explaining the study’s purpose and rubric definitions, respondents evaluated a set of example scenes (generated images accompanied by contextual summaries) across all five dimensions at both the visual and spatial-logic levels (10 items total). The task took approximately 20 min on average. The pilot was designed as a diagnostic pre-test of rubric clarity rather than as a consensus-building exercise (e.g., Delphi); accordingly, no post-response consensus procedure was conducted.

The pilot revealed that inter-rater disagreement concentrated at the boundary between “iconic presence/atmospheric impression” and “structural establishment.” Three areas required the most substantial codebook revision:

Mobility hierarchy (logic level). The distinction between “a tram or bus is visible in the scene” and “stops, platforms, boarding flows, and interchange decks are spatially constituted” produced the largest disagreement. Additional definitional criteria for judging transit-node operability were needed.

Public space (logic level). Respondents frequently conflated “a lively-looking plaza (people, street furniture, events)” with “entry, edge, and activity-distribution logic coherent with adjacent streets and programs.” Level changes (stairs, ramps) and the clarity of entry points proved decisive for logic-level scoring and required more explicit codebook guidance.

Human-scale streetscape (logic level). In scenes rich with surface cues (façade rhythm, street trees, furnishings), coders needed clearer criteria for whether the continuity of street edges, cross-sectional enclosure, and ground-floor access relationships were actually maintained.

On the basis of these recurring points of confusion, codebook definitions were sharpened and anchored with representative scored examples so that the rubric would function as a structure-cue-based coding instrument rather than an impressionistic rating scale (see Appendix A.1 and Appendix A.2 for the final codebook).

3.2.4. Reliability Assessment and Final Coding

Full coding of all 36 scenes was performed by the first author. Given the interpretive demands of spatial-logic scoring, inter-rater reliability was assessed through independent double-coding of a subset (n = 12 scenes, 33% of the corpus) by a second coder—a doctoral student with advanced training in urban design and planning—who applied the same codebook independently. Quadratic-weighted Cohen’s κ (κ_w) was computed for each dimension at both levels. The results are summarized in Table 1.

As Table 1 shows, inter-rater agreement ranged from κ_w = 0.64–0.86 at the visual level and κ_w = 0.58–0.82 at the logic level, with the visual level consistently exhibiting higher agreement than the logic level. The lowest agreement was observed for the logic-level scoring of mobility hierarchy/multimodality (κ_w = 0.58), a predictable pattern given that judging the operability of transit interchange structures requires more complex inferential reasoning and admits greater interpretive latitude than other dimensions. Test–retest intra-rater reliability on the same subset was somewhat higher (κ_w = 0.70–0.88), suggesting that the primary coder applied codebook rules with reasonable internal consistency.

Disagreements identified during double-coding were resolved not by forcing consensus on specific scenes but through a rule-referenced review procedure: for each disputed item, the coders identified which codebook definition had proved ambiguous and refined the wording accordingly. This approach prioritizes instrument improvement over score convergence [35]. The resulting finalized codebook was then applied to the full coding of all 36 scenes.

Following full coding, dimension-level central-tendency measures (means and medians) and gap scores Δ(Image − Logic) were computed. Given the limited sample size (n = 36), results are reported alongside score-distribution summaries (0/1/2 frequencies and spread indicators) so that readers can assess whether any given mean is driven by a few extreme scenes or reflects a stable distributional pattern.

3.3. Qualitative Analytic Strategy

The qualitative component of the study addresses a question that quantitative scores alone cannot answer: why does the visual–logic gap arise? To this end, a process-tracing analysis is conducted on the failure and revision records embedded in the team reports [36]. The analytic categories are organized around a breakdown–repair framework. A breakdown is defined as a moment at which the generated output conflicts with the design team’s intent or with the relational spatial structure required by the design brief (e.g., absence of programmatic massing, collapse of pedestrian/vehicular hierarchy, non-viable transit node, contextual mismatch). A repair is defined as any intervention undertaken to resolve the breakdown (e.g., prompt refinement, reference-image injection, tool switching, partial re-generation, generative-fill compositing/deletion, or manual reconfiguration of the layout).

Case selection follows a contrastive sampling principle organized along two axes. First, scenes with large Δ (high visual scores but weak spatial-logic scores) are compared against scenes with small Δ (closely matched visual and logic scores) to identify recurring failure points and effective repair strategies. Second, dimensions that exhibit frequent collapse (e.g., mobility hierarchy) are compared against dimensions that remain relatively stable (e.g., green-infrastructure cue reproduction) to determine how the form and severity of breakdowns vary by sustainability dimension. This dual-axis contrast yields a differentiated account of which aspects of sustainability generative AI can reproduce automatically (visual cues) and which require human interpretation, judgment, and reconstruction (relational structures).

Building on this analytic structure, three propositions are tested. First, the visual–logic gap (Δ) will not be uniform across dimensions: dimensions that can be represented largely through surface cues (e.g., green/blue infrastructure) are expected to show smaller Δ than those requiring relational reasoning (e.g., mobility hierarchy, walkability). Second, among relational dimensions, Δ will be largest where sustainability depends on multi-component systemic structures such as transit interchange nodes, consistent with compositional-consistency limitations identified in computer-vision benchmarks [28]. Third, when the gap is identified by designers, prompt refinement alone will not resolve it; multi-stage repair pipelines will be required, which indicates a structural limitation rather than a correctable prompt-quality issue.

4. Results

4.1. Tool-Type Differences in Sustainability Encoding

Across the teams’ work, generative AI tools differed markedly in how they handled sustainability elements, and these differences mapped onto the input/reference modality each tool relies on: text-prompt-driven generation, geometry/structure-referenced rendering, or localized editing. Three broad tool-type profiles emerged.

Text-based diffusion models (e.g., Midjourney) excelled at rapidly assembling visual signifiers of sustainability—trees, pedestrians, waterfronts, cycling infrastructure, warm material textures—into aesthetically persuasive scenes. Their strength lay in visual persuasiveness: the generated images were often striking in atmosphere, texture, and overall “plausibility.” However, the relational spatial structures that these scenes would need to sustain were frequently unstable. Student reports repeatedly noted that “texture, atmosphere, and plausibility are excellent, but layout and massing distortions undermine the design framework.” This observation is consistent with the architectural limitation of diffusion models discussed in Section 2.1: optimization for distributional plausibility rather than for explicit relational constraints.

Structure/geometry-referenced rendering and image-conditioned tools (e.g., SketchUp-based renderers, LookX, Adobe Firefly with structure references) take existing massing or geometry as input, and consequently exhibited higher reliability in layout and form preservation. LookX, for instance, was reported to maintain SketchUp massing relatively faithfully and to differentiate road surfaces from building volumes effectively. Yet even these tools showed breakdowns at the level of complex program encoding: when a scene required the articulation of multiple functional zones or fine-grained programmatic detail, the output frequently collapsed into generic form. Similarly, Firefly could preserve the geometry of an uploaded reference capture, but its capacity for generating creative spatial alternatives was assessed as limited.

Localized editing/generative-fill tools (e.g., Adobe Photoshop Generative Fill) were strongest not in whole-scene generation but in partial correction, alignment, and cleanup. Students explicitly observed that “perfect implementation is difficult; the key is to verify essential elements and then carry out additional work in Photoshop,” and that “at the bird’s-eye/master-plan stage, rendering-type AI may be more practical than generation-type AI.” These remarks signal an important reframing: generative AI in studio practice functions less as a design-production tool and more as a tool that entails verification and repair labor.

Consequently, a clear functional division of labor emerged across tool types: (1) text-based generation = visual persuasiveness, (2) structure/geometry reference = layout stability, and (3) Photoshop-class tools = localized repair. Importantly, students did not rely on any single tool in isolation; instead, they actively combined tools, distributing tasks across the pipeline so as to pull “visually plausible scenes” toward “operable spatial logic.” This adaptive orchestration is itself a form of design competence, one that involves diagnosing what each tool can and cannot encode, and allocating repair responsibilities accordingly.

4.2. Dimension-Level Visual–Spatial-Logic Gaps

The rubric-based (0–2) results reveal a consistent pattern: sustainability elements are rendered with relatively high plausibility at the visual-representation level, but spatial-logic scores drop sharply and unevenly across dimensions. Table 2 summarizes the dimension-level means and gap scores.

As Table 2 shows, image-level means cluster in the upper range (1.3–1.9) while logic-level means vary widely (0.2–1.4). The two largest gaps (Δ = 1.1 each) appear in walkability and mobility hierarchy/multimodality; public space follows closely (Δ = 0.9). Green/blue infrastructure exhibits the smallest gap (Δ = 0.5). The dimension-by-dimension patterns are elaborated below.

Walkability recorded the highest image-level mean (1.8), reflecting the ease with which AI populates scenes with pedestrians, street furniture, and canopy trees to compose a convincing “walkable atmosphere.” The logic-level mean, however, was only 0.7. The generated scenes consistently lacked network continuity: pedestrian paths appeared locally vivid but failed to establish legible origins, destinations, and connections. In short, the expression of walkability was rich, but the structure of walkability (where paths enter, where they lead, and how they form a continuous network) remained impoverished.

Public space exhibited a parallel pattern. Plazas and open spaces scored 1.7 at the image level, with convincing renderings of vibrancy and visual appeal. Yet the logic-level score fell to 0.8, indicating recurring failures in the spatial conditions that make public space accessible: level changes, entry-point legibility, edge definition, and the interface between the open space and surrounding streets and programs. AI generated public space as lively scenery with relative ease but struggled to maintain the adjacency relationships and hierarchical organization that public space requires to function within an urban structure.

Green/blue infrastructure showed the smallest gap among the five dimensions (image 1.9, logic 1.4, Δ = 0.5). This comparatively strong performance is attributable to the nature of greenery in AI generation: vegetation functions largely as an areal texture or background fill, so at least the distributional appearance of green coverage can be achieved with relative fidelity. When evaluation criteria extended to corridor-level ecological connectivity—continuity, the absence of fragmentation, and persuasive linkage pathways—limitations persisted. The implication is that AI can approximate the quantity and spread of green infrastructure but remains unreliable in encoding its functional connectivity as a networked system.

Human-scale streetscape/urban design qualities scored 1.6 at the image level and 0.9 at the logic level (Δ = 0.7). AI composed surface cues of “human-scale streets” persuasively, such as façade rhythm, street trees, furnishings, and pedestrian density, but the structural conditions supporting those cues were frequently inconsistent. Edge continuity, cross-sectional enclosure (the sense of being “held” by buildings), ground-floor access relationships, and the alignment of the streetscape with pedestrian-flow logic often broke down even in otherwise convincing scenes.

Mobility hierarchy/multimodality exhibited the most severe gap. Even when trams or buses were visually present (image-level mean 1.3), the operational infrastructure required for them to function, such as tracks, dedicated lanes, platforms, and interchange structures, was almost entirely absent, yielding a logic-level mean of only 0.2. This finding strongly suggests that transit elements persist in AI outputs as icons or decorative signifiers rather than as operable components of an urban mobility system. The conditions for interchange-node viability (access, flow, spatial accommodation) are not automatically encoded.

These patterns are further illustrated by the case studies in Section 4.3. The consistent message emerging from this dataset is that generative AI can produce surface-level cues of sustainability, such as greenery, water features, and pedestrian activity, with relative ease, whereas the relational spatial logics that sustainability requires constitute its most persistent and systematic weakness.

4.3. Breakdown–Repair Patterns

The quantitative gap Δ(Image − Logic) identified in Section 4.2 can be explained at the mechanism level through the breakdown–repair logs preserved in team reports. Teams did not simply accept visually appealing outputs; they documented moments at which the generated image conflicted with their design intent in terms of program, circulation, context, or functional boundaries, declared the failure explicitly, and intervened through prompt restructuring, tool switching, partial editing, or external-source compositing to re-inject spatial logic into the output. In this corpus, “repair” is therefore not a cosmetic afterthought but an integral part of how AI-assisted urban design works.

The four cases below were selected according to the contrastive-sampling principle described in Section 3.3: (i) each represents a dimension where Δ was particularly large, and (ii) the four together cover distinct breakdown typologies, including program-encoding failure, urban-scale coherence drift under prompt-driven generation, goal-conflict-driven layout instability, and relational-condition mismatch. This demonstrates that the same aggregate Δ can arise from qualitatively different workflow problems. Notably, tool choice was not uniform across these cases. The Misa Island case (Case 2) relied primarily on Midjourney, a prompt-driven diffusion model, whereas the remaining three cases were produced with LookX-centered, structure-referenced workflows. Misa Island is therefore treated as a contrast case that foregrounds the limitations of prompt-driven generation at the urban scale, rather than as a direct tool-performance comparison. Accordingly, the cross-case analysis in Section 4.3.5 focuses on breakdown typologies and repair strategies, not on ranking tools.

4.3.1. Case 1—Transfer Center: Program-Encoding Failure and External-Source Injection

Complex transit infrastructure, such as a bus terminal or intermodal transfer center, requires not merely the icon of a bus but the operational structure of boarding, alighting, waiting, signage, and interchange circulation. In this case, the team documented the failure in direct terms: “the bus stop was not represented, so we added keywords,” but “the bus terminal continued to fail to appear properly.” The breakdown, therefore, concerns program encoding at the level of operational structure rather than stylistic. The workflow and repair sequence are shown in Figure 1.

The failure persisted across tool boundaries. The team switched from LookX to Adobe Photoshop’s Generative Fill, entering explicit directives such as “bus station” and “bus stop,” but reported that “a small-scale stop appeared, or no massing was generated at all.” The tools could introduce visually plausible fragments, but they did not produce the coherent interchange logic required for the scene to function as a transfer center.

Repair ultimately could not be completed within the AI pipeline. The team resolved the issue by “adding an external source to form the bus stop and produce the final image,” manually compositing a photographic reference into the scene in Photoshop. The critical insight is that the element in question—complex infrastructure program—lies beyond what current generative tools can reliably encode, regardless of prompt quality. Faced with this structural limitation, the design team must (1) diagnose the failure, (2) switch tools, and (3) when the failure persists, inject the missing program by compositing external evidence. This three-step escalation provides a mechanism-level explanation for the large Δ observed in the mobility-hierarchy dimension.

4.3.2. Case 2—Misa Island: Urban-Scale Prompting Limits and Reference Dependence

The Misa Island case illustrates a recurrent challenge in prompt-driven generation: the mismatch between high visual quality at the level of localized scenes and the difficulty of producing coherent city-scale images that align with a specific masterplan intent. The team intended to design a mixed-use resort complex on Misa Island, organized around a large sphere as the iconic centerpiece and surrounded by an event-oriented public realm capable of hosting major gatherings. The desired output required a bird’s-eye or city-scale view that could convey an integrated relationship among the landmark object, event space, circulation network, and resort program. Representative outputs from this process are shown in Figure 2.

Despite the acknowledged strength of Midjourney in producing aesthetically persuasive imagery, the team reported that limitations became pronounced at the urban scale. They experimented with prompts such as “city scale” and “bird’s-eye view” to obtain a plan-like perspective, yet found it difficult to generate images that matched their masterplan intent. Even when outputs appeared plausible in atmosphere and composition, the core relational requirements (how the landmark anchors the district, how the event space is framed, and how the surrounding resort components connect) remained unstable. The result was a repeated cycle of aesthetically persuasive images that failed to encode the intended spatial structure.

A key lesson from this case concerns the role of reference images in steering prompt-driven systems. Through repeated iteration, the team concluded that effective reference-image selection mattered as much as prompt crafting. When an appropriate reference was identified, compositional structure, style, and color palette could be “blended” into the output with high efficiency and a sharp increase in perceived completeness. However, this mechanism created a strong dependence. As the team noted: “setting the reference image was critical when using Midjourney; if the reference deviated even slightly from the desired direction, the results diverged significantly, and correcting them through prompts was extremely difficult.” The workflow consequently incurred high time costs in both prompt iteration and reference-image curation.

Overall, the Misa Island case demonstrates that city-scale urban design intent is not easily recoverable through prompt iteration when the generative system treats reference images as soft guidance rather than as enforceable structural constraints. The case reinforces the study’s broader observation: prompt-driven tools can quickly generate persuasive visuals, but at larger scales, design teams face instability and must invest substantial effort in reference curation to approximate the intended masterplan logic.

4.3.3. Case 3—Botanic Garden: Goal Conflict and Functional-Boundary Instability

The Botanic Garden case demonstrates what happens when multiple design objectives compete for priority within a single scene: preserving an already-modeled building and site layout, achieving photorealistic rendering quality, and representing lush greenhouse vegetation and park atmosphere. The team aimed to render a crystal-dome greenhouse complex and its surrounding landscape but reported that prompt-driven generation could not reproduce the exterior form as implemented. Outputs remained closer to a sketch than a realistic rendering, even when prompts explicitly emphasized realism. The workflow and resulting output are shown in Figure 3.

AI does not resolve competing objectives logically; it selectively amplifies whichever surface feature the latest prompt emphasizes. In this case, hard constraints such as layout and massing preservation became unstable. The team attempted to stabilize the layout by first rendering the SketchUp model with Veras and then conditioning generation with the directive “maintaining existing modeled building layout: no change in location or structure,” yet reported that although the image became more realistic, “the distortion of building placement and design was severe and the basic design framework was not preserved.”

The team’s repair strategy, therefore shifted from prompt tuning to constraint re-injection through tool switching. They ultimately adopted LookX because it could realize the intended façade and material cues and the surrounding landscape without altering the location or appearance of the modeled design; additional Photoshop work was used for final refinement. As in the other cases, sustainability-associated imagery (lush vegetation, park atmosphere) was generated with relative ease. However, the functional boundaries and layout hierarchies required for spatial logic must be re-injected by the design team, especially when multiple objectives (realism, vegetation abundance, and strict layout preservation) compete within a single image.

4.3.4. Case 4—Waterfront IC: Multi-Stage Repair of Directional and Flow Logic

The Waterfront IC case provides the clearest example of repair operating at the level of urban-design-specific relational logic. From the outset, the team combined LookX’s “Image Generation” function with a reference image to condition the output, then applied LookX’s “Edit” function for partial corrections because the scene was intended as a waterfront park design image. The team also noted using “ChatGPT’s assistance” (OpenAI, GPT-3.5 model) for prompt formulation. The resulting workflow, therefore, already combined reference injection and staged refinement rather than relying on single-shot generation. This workflow is shown in Figure 4.

The final repair is particularly revealing. The team did not describe it as an aesthetic improvement but as the restoration of a relational condition: “We used Photoshop, taking into account the direction in which the river flows, to complete the final image.” This remark illustrates the kind of constraint that “spatial logic” denotes in this study, such as directionality, flow, and site-specific relational conditions, and demonstrates how design teams intervene when AI outputs fail to satisfy such constraints automatically.

Similar multi-stage repair pipelines appeared elsewhere in the corpus. One report described “extracting clipping paths from two viewpoint-matched images in Photoshop and layering them to produce the final image,” synthesizing the strengths of both outputs. Another documented compositing multiple images to restore the centrality of an event ground that had been weakened in generation. A third report identified unwanted elements and layout inconsistencies that compromised “openness, publicness, and visual connectivity,” and performed a cleanup-based repair. Across these instances, AI-assisted studio work was reorganized from “single-shot generation” into repair-centered assembly and editing, a mode of practice better described as workmanship than as automated production.

4.3.5. Cross-Case Synthesis: Shared Breakdown–Repair Mechanisms

The four cases converge on a common finding: generative AI use in design studios is not a one-shot production act but an iterative cycle of breakdown diagnosis and repair. In the Transfer Center, transit elements appeared in the image, but the operational logic of stops, platforms, and interchange circulation did not materialize. This mobility-logic collapse could not be resolved by prompt refinement alone and ultimately required external-source compositing. In Misa Island, attempts to generate city-scale views through Midjourney repeatedly drifted from the intended masterplan; outcomes were highly sensitive to reference-image choice, making reference curation a major driver of both quality and time cost. In the Botanic Garden, competing design goals (photorealistic rendering, lush vegetation, and strict preservation of an existing modeled layout) led to selective amplification and layout distortion; repair required shifting from prompt-driven generation to constraint re-injection through reference-conditioned rendering and post-correction. In Waterfront IC, directional and flow constraints were not automatically met, and repair involved a multi-tool pipeline of reference injection, partial editing, and post-correction.

These case narratives show that breakdowns do not arise from simple “quality deficiency” but from failures in the core logics of urban design. Recurring breakdown types include program (transit-interchange structure), urban-scale coherence (masterplan-level intent and anchoring), boundary (functional demarcation under competing objectives), and relational conditions (flow, access, directionality). Correspondingly, repair is not cosmetic post-processing; it is the re-injection of collapsed or missing logic through reference, editing, compositing, and tool switching. These case-level observations are consistent with the dimension-level Δ patterns reported in Section 4.2, particularly the large gaps in mobility, walkability, and public space. Together, they provide both the quantitative and the processual evidence for the study’s central claim: generative AI encodes the appearance of sustainability far more reliably than the relational structure that sustainability demands.

5. Discussion

5.1. The Visual–Logic Gap as a Structural Feature of Latent-Space Synthesis

The central finding of this study is that generative AI exhibits a structural gap between rendering sustainability as a visual impression and encoding it as an operable spatial logic. Across the five evaluation dimensions, image-level scores clustered in the upper range (1.3–1.9 on a 0–2 scale), whereas logic-level scores dropped sharply and unevenly, most severely in mobility hierarchy/multimodality (Δ = 1.1) and walkability (Δ = 1.1), followed by public space (Δ = 0.9). Green/blue infrastructure showed the smallest gap (Δ = 0.5), but even here the relative success reflected the ease with which vegetation is rendered as an areal texture rather than any genuine encoding of ecological connectivity.

This pattern is not reducible to the claim that “AI is not yet good enough.” The gap is better understood as a consequence of how diffusion models operate: they optimize for distributional plausibility, that is, the likelihood that a generated image resembles the training distribution, rather than for the satisfaction of explicit relational constraints [1]. López Cervantes and Sánchez Morales [3] have theorized this discrepancy as a plausibility gap inherent to predictive-architecture workflows, and the present study provides the first dimension-level empirical evidence for how that gap distributes across the core principles of sustainable urban design. Whereas López Cervantes and Sánchez Morales frame the gap at the level of architectural feasibility and tectonic coherence, this study demonstrates that urban design, because of its dependence on external relational systems (networks, hierarchies, access logic, boundary conditions), is an even more demanding test case. The finding that mobility interchange and pedestrian-network continuity produce the largest Δ values is consistent with computer-vision benchmarks showing that spatial relations and compositional consistency are among the hardest constraints for diffusion models to satisfy [28], and it extends those technical observations into the domain-specific vocabulary of urban design.

These findings are consistent with the three propositions formulated in Section 3.3. Proposition 1 (dimension-differential gap) is confirmed: green/blue infrastructure exhibited the smallest Δ (0.5), while mobility hierarchy and walkability produced the largest gaps (Δ = 1.1 each). Proposition 2 (relational-complexity gradient) is supported by the observation that mobility hierarchy, the dimension requiring the most complex multi-component systemic structure, yielded a logic-level mean of only 0.2, the lowest across all dimensions. Proposition 3 (repair as mechanism) is substantiated by the case analyses in Section 4.3, which demonstrate that prompt refinement alone could not resolve spatial-logic breakdowns; multi-stage pipelines involving tool switching, reference injection, and external-source compositing were required. Taken together, these patterns indicate that the visual–logic gap is a structurally patterned consequence of how diffusion models encode urban sustainability.

Phillips et al. (2024), in their comparative evaluation of AI image tools for urban design, noted that outputs can appear compelling while undermining “sensibleness” at the urban-structural level [4]. The present study advances that observation in two ways. First, by disaggregating the same image into a visual-representation score and a spatial-logic score, it shows that the two dimensions are not merely different; they are systematically decoupled, with the decoupling concentrated in dimensions that demand relational reasoning (mobility, walkability, public-space access). Second, by analyzing the audit trails embedded in student workflow reports, the study provides processual evidence for why the gap arises: breakdowns occur not because of “poor prompts” but because the elements in question, such as transit-interchange structures, place-specific contextual logic, functional boundaries, and directional flows, fall outside what current generative pipelines can reliably encode without explicit external constraints.

5.2. Visual Greenwashing as an Emergent Risk in Generative Urban Design

The systematic decoupling documented above carries a practical risk that connects to the greenwashing literature. As argued earlier, the visual greenwashing produced by generative AI differs from both claim greenwashing (strategic verbal misrepresentation) [9] and executional greenwashing (deliberate selection of nature-evoking cues) [20]. It arises instead from system-induced representational distortion: diffusion models produce sustainability symbolism at high volume and fidelity as a consequence of optimizing for distributional plausibility, while leaving the relational substrate unverified. The greenwashing scholarship has shown that visual cues can distort audience judgment independently of substantive evidence [19] and that “green” labels can bias perception beyond measured performance [26]. The present findings suggest that generative AI amplifies this pathway at unprecedented scale and fidelity.

This risk is heightened in contexts where reviewers lack the domain expertise to interrogate underlying spatial structures. In studio education, students demonstrated considerable diagnostic skill: they identified breakdowns, switched tools, and re-injected spatial logic through multi-stage repair pipelines (Section 4.3). But this capacity was exercised after the visually plausible image had already been generated, and it required domain-specific judgment that cannot be taken for granted outside a studio setting. In professional contexts, such as competition panels, stakeholder presentations, and public consultations, the risk is that a “sustainable-looking” rendering may be taken at face value precisely because its visual quality inhibits critical interrogation of the underlying spatial structure. The Δ metric proposed in this study offers a way to make that risk legible: a large Δ signals not an “error” per se, but a zone where the cost of verifying and repairing relational logic is elevated.

These findings carry implications beyond studio pedagogy. Sustainable urban development, as articulated in SDG 11, depends on translating sustainability commitments into verifiable spatial outcomes, including walkable networks, accessible transit, functional green infrastructure, and inclusive public spaces [32,33]. As generative AI tools are increasingly adopted in competition entries, stakeholder consultations, and ESG-related reporting, the visual–logic gap becomes a governance concern: visually compelling renderings may distort sustainability assessments and mislead non-expert decision-makers. The Δ metric proposed here thus functions not only as a design-review tool but as a transparency instrument, making legible the distance between what a generated image promises and what the underlying spatial structure can deliver.

5.3. Methodological Contribution: The Two-Level Framework and Audit-Trail Analysis

From a methodological standpoint, this study makes two interrelated contributions. First, the two-level assessment framework decomposes the familiar observation that “AI images look plausible but do not make spatial sense” into a measurable structure. By scoring the same sustainability dimension at both the visual and the logic level, the framework reveals not only that a gap exists but where it concentrates, an important distinction for both pedagogy and practice. Prior evaluations of AI-generated urban imagery have relied on holistic or impression-based ratings [4]; the present framework demonstrates that the same image can score high on representation and low on logic, and that this divergence is systematically patterned by dimension.

Second, the use of student workflow reports as design audit trails offers a methodological pathway that most existing studies have not exploited. The majority of AI-in-design research evaluates finished outputs; by contrast, this study treats the generation–revision process itself as data, enabling process-tracing analysis of the mechanisms through which spatial logic breaks down and is repaired. The breakdown–repair framework that emerged from this analysis, covering program-encoding failure, contextual displacement, goal-conflict-driven boundary collapse, and relational-condition mismatch, provides a structured vocabulary for diagnosing and discussing the kinds of failures that generative outputs are likely to produce in urban design contexts. This vocabulary may be useful beyond the present study as a pedagogical and review tool.

5.4. Guardrails for Education and Practice

On the basis of the findings and the discussion above, this study proposes three minimum guardrails designed to prevent visual sustainability from substituting for spatial-logic sustainability in generative-AI-assisted urban design.

Guardrail 1: Image–logic paired submission. Final renderings and bird’s-eye images should be accompanied by structural documentation that demonstrates the spatial logic behind the scene, for example, 2D plans, cross-sections, network diagrams, access/interchange schemata, or annotated figure-ground studies. This pairing functions as a minimum disclosure device: it exposes disconnections, hierarchy collapses, and boundary failures that a polished rendering alone may conceal. In studio pedagogy, paired submission can be integrated into design reviews as a standard requirement, analogous to the expectation that architectural presentations include plans alongside renderings. In professional practice, a similar principle can inform competition-entry requirements or stakeholder-presentation protocols.

Guardrail 2: Formalization of the design audit trail as a deliverable. Given that AI-assisted design in practice operates as a multi-stage repair pipeline rather than as single-shot generation (Section 4.3), workflows should be documented in a lightweight log recording the original generation, intermediate revisions, reference-image injections, tool switches, partial edits, and external-source compositing steps. Formalizing this audit trail as a standard deliverable, rather than treating it as informal documentation, makes the repair labor visible and reviewable, and it provides a basis for assessing whether spatial-logic issues were identified and addressed. In educational settings, the audit trail can also serve a formative function: it gives instructors a window into the student’s diagnostic and repair reasoning, not just their final output. The key to feasibility is brevity: a structured one-page log template with timestamped entries is preferable to an exhaustive narrative.

Guardrail 3: Δ-based red-flag review. The two-level rubric developed in this study can be adapted into a review checklist in which each sustainability dimension is scored at both the visual and the logic levels. Projects or scenes exhibiting a large Δ on any dimension would be flagged for additional scrutiny, not as a punitive measure, but as a signal that the verification and repair cost for relational spatial logic is elevated in that area. Importantly, Δ should not be interpreted as a simple “error score”; it functions as a diagnostic indicator of where human judgment, additional documentation, or further design iteration is most needed. In practice, the threshold for flagging could be calibrated to the pedagogical or review context. For instance, Δ ≥ 1.0—the level at which walkability and mobility hierarchy were flagged in this study—might trigger mandatory paired-submission review in a studio setting.

Together, these three guardrails address complementary failure modes. Paired submission targets the output (ensuring that visual claims are accompanied by structural evidence). The audit trail targets the process (making repair labor and tool orchestration visible). The red-flag review targets the evaluation (providing a structured mechanism for identifying where the visual–logic gap is most severe). None of these measures assumes that generative AI should be avoided; rather, they are designed to position AI outputs as hypotheses requiring verification and repair rather than as finished design proposals.

In this regard, the two-level rubric can serve a dual function. As an evaluative instrument, it provides a structured basis for scoring and comparing the visual and spatial-logic quality of AI-generated outputs in reviews, juries, and research assessments. As a design-thinking scaffold, it can be introduced at the formative stage of studio instruction: students would use the five-dimensional, two-level matrix to self-diagnose their outputs before final submission, treating each Δ value as a signal for where additional repair or documentation is needed. In this scaffolding mode, the rubric operates not as a grading tool but as a diagnostic prompt that directs attention toward relational-logic verification. The three guardrails proposed above provide the institutional structure within which this dual function can operate: paired submission, audit trail formalization, and Δ-based red-flag review.

6. Conclusions

This study evaluated how effectively generative AI encodes sustainable urban design principles by applying a two-level assessment framework, contrasting visual representation with spatial logic, to 36 generated scenes drawn from student AI-workflow reports in two site-based urban design studios. The two-level assessment revealed systematic visual–logic gaps across all five dimensions, most severely in mobility hierarchy and walkability, and least in green/blue infrastructure. This pattern is consistent with the observation that diffusion models approximate surface-level cues more reliably than the relational structures on which urban design performance depends.

Qualitative process tracing revealed that the gap is not passively accepted by student designers. Rather, students diagnosed breakdowns in spatial logic, including program-encoding failure, contextual displacement, goal-conflict-driven boundary collapse, and relational-condition mismatch. To address these breakdowns, they deployed multi-stage repair pipelines involving prompt restructuring, reference injection, tool switching, partial editing, and external-source compositing. This finding reframes AI-assisted urban design as a practice of repair-centered workmanship rather than automated production, and identifies diagnose–decompose–anchor as a core competence for working with generative tools. This competence encompasses spatial literacy and evaluative judgment, not merely prompt proficiency.

To prevent visual sustainability from substituting for spatial-logic sustainability, the study proposed three guardrails: (i) image–logic paired submission, requiring structural documentation alongside final renderings; (ii) formalization of the design audit trail as a standard deliverable; and (iii) Δ-based red-flag review, using dimension-level gap scores to identify areas requiring additional scrutiny (see Section 5.4 for detailed discussion). The underlying principle is that generative AI outputs should be treated as hypotheses requiring verification rather than as finished design proposals.

Several limitations should be acknowledged. The data derive from a single university’s site-based urban design studios (2024–2025), and the influence of a single instructor and institutional context cannot be entirely separated from the findings. In professional practice, where time pressure is greater and documentation norms differ, repair processes may be more tacit, and the visual–logic gap may go undiagnosed more frequently. The sample of 36 coded scenes supports analytic generalization rather than statistical generalization, and the tool portfolio is weighted toward Midjourney and LookX; whether the observed Δ patterns are tool-contingent or reflect a structural property of diffusion-based generation more broadly cannot be resolved within the present dataset. Generative AI also evolves rapidly, and the tools used by students in 2024–2025 may already differ from those available at the time of publication.

Future research should address these limitations in three directions. First, replication across different institutions, studio formats, and professional contexts is needed to test whether the Δ patterns observed here are reproduced, attenuated, or amplified under different conditions, including competition entries, stakeholder visualization, and consultant-led design. Second, longitudinal tracking across evolving generative tools, including structure-referenced conditioning, multimodal encoding, and 3D-aware systems, should determine which dimensions of Δ shrink with technological advance, which persist, and how the locus of human intervention shifts from repair toward verification. Third, coupling image-level assessment with 2D/3D spatial data (network analysis, massing models, GIS) and performance metrics (accessibility, solar exposure, wind environment) would enable partially automated detection of relational-logic breakdowns, moving evaluation from post hoc manual scoring toward real-time diagnostic support.

In sustainable urban design, what matters is not the richness of surface cues but the viability of relational structures. This study’s two-level framework and audit-trail analysis decompose what generative AI does well (visual representation) from what it cannot yet be trusted to do (relational encoding) and propose a framework where AI-generated imagery serves as the beginning of design inquiry rather than its conclusion.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study by the Institutional Review Board at Gachon University because it used previously collected student work and posed minimal risk to participants.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The author would like to thank the students who participated in the studio course for their dedicated design work using AI tools and for granting permission to adapt and reproduce their project materials for publication.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

Appendix A.1. Codebook for Image-Level (Visual Representation) Scoring (0–2)

Scoring scale: 0 = Absent/broken; 1 = Partial/ambiguous; 2 = Clear/plausible.

(1) Walkability

0: The image contains few or no pedestrian cues, and the scene reads as car-dominant.
1: Some pedestrian cues (e.g., people, sidewalks, benches, crossings) appear, but they function mainly as decoration or appear discontinuous.
2: Pedestrian-oriented cues are rich and consistent, producing a plausible walkable street/plaza atmosphere.

(2) Public Space

0: A public space cannot be clearly identified, or it appears only as an indistinct background element.
1: Activity cues suggest a public space, but the center, boundaries, and intended use remain ambiguous.
2: The public space type and use are legible (e.g., staying, meeting, activities), with strong cues of “publicness.”

(3) Green/Blue Infrastructure

0: Vegetation and water-related cues are largely absent.
1: Green/water elements are present but mainly decorative, with little indication of a system structure.
2: Green/blue infrastructure reads as a coherent system (e.g., greenway, waterfront park, continuous canopy) rather than isolated objects.

(4) Human-scale Streetscape/Urban Design Qualities

0: The scene lacks cues of human-scale street experience, showing monotonous façades or arbitrary/implausible detail.
1: Streetscape elements (doors/windows/trees/furniture/signage) appear, but scale, rhythm, and enclosure feel unstable or exaggerated.
2: The scene supports a convincing human-scale experience, including active ground-floor cues, readable façade rhythm, and plausible enclosure and detail.

(5) Mobility Hierarchy/Multimodality

0: Mobility hierarchy cues are minimal or missing, and vehicles/paths appear random or unstructured.
1: Transit/bike cues appear as icons, but stops, platforms, or hubs are not spatially legible.
2: Multiple modes (walk/bike/transit/vehicle) are clearly indicated, with plausible stop/hub cues that support a coherent hierarchy.

Appendix A.2. Codebook for Spatial-Logic-Level (Operability) Scoring (0–2)

Scoring scale: 0 = Inoperable/illogical; 1 = Partly operable/inconsistent; 2 = Operable/coherent.

(1) Walkability

0: Pedestrian routes are clearly broken, and access/crossings/connectivity are not feasible within the depicted structure.
1: Some pedestrian routes are readable, but connectivity or priority is unclear and pedestrian–vehicle logic conflicts are evident.
2: A continuous, readable pedestrian network is present, and crossings, entries, and connections can be explained with minimal contradictions.

(2) Public Space

0: Entries, boundaries, and level changes are non-functional or illogical, so the space cannot operate as a public realm.
1: The space partly works, but adjacency to buildings/streets is awkward, and circulation or activity zones conflict.
2: Access, edges, and activity layout are clear, and adjacency with surrounding streets/buildings is functionally coherent.

(3) Green/Blue Infrastructure

0: Green/blue elements are isolated or internally contradictory (e.g., implausible water/terrain/drainage relationships).
1: Some continuity is readable, but frequent breaks remain, and multifunctional logic is only implied rather than structured.
2: The system is coherent and connected, with ecological/hydrological/walkable continuity that is plausible and explainable.

(4) Human-scale Streetscape/Urban Design Qualities

0: Street-edge continuity and section proportions collapse, making pedestrian-scale experience physically unworkable.
1: Some segments are operable, but continuity, section proportions, or entry relationships break at key points.
2: Street edges and sections remain continuous and plausible, and entry relationships support an operable pedestrian-scale experience.

(5) Mobility Hierarchy/Multimodality

0: Turning radii, lane widths, stop access, or transfer paths are clearly infeasible, and any “hub” reads as sculptural rather than functional.
1: A hierarchy is partly readable, but stop/hub location, access, and transfer circulation are inconsistent.
2: A coherent hierarchy (walk–bike–transit–vehicle) is evident, with stops/hubs and transfers that are spatially plausible within the overall structure.

Appendix B

Appendix B.1. Distribution Summary of “Visual Representation” Scores (Image-Level), n = 36

Visual Representation Dimension	0 n (%)	1 n (%)	2 n (%)	Mean (SD)	Median [IQR]
Walkability	0 (0.0)	7 (19.4)	29 (80.6)	1.81 (0.40)	2 [0.0]
Public Space	1 (2.8)	9 (25.0)	26 (72.2)	1.69 (0.52)	2 [1.0]
Green/Blue Infrastructure	1 (2.8)	1 (2.8)	34 (94.4)	1.92 (0.37)	2 [0.0]
Human-Scale Streetscape/Urban Design Qualities	3 (8.3)	8 (22.2)	25 (69.4)	1.61 (0.64)	2 [1.0]
Mobility Hierarchy/Multimodality	4 (11.1)	17 (47.2)	15 (41.7)	1.31 (0.67)	1 [1.0]

Appendix B.2. Distribution Summary of “Spatial Logic” Scores (Logic-Level), n = 36

Spatial Logic Dimension	0 n (%)	1 n (%)	2 n (%)	Mean (SD)	Median [IQR]
Walkability	17 (47.2)	13 (36.1)	6 (16.7)	0.69 (0.75)	1 [1.0]
Public Space	13 (36.1)	17 (47.2)	6 (16.7)	0.81 (0.71)	1 [1.0]
Green/Blue Infrastructure	2 (5.6)	17 (47.2)	17 (47.2)	1.42 (0.60)	1 [1.0]
Human-Scale Streetscape/Urban Design Qualities	10 (27.8)	19 (52.8)	7 (19.4)	0.92 (0.69)	1 [1.0]
Mobility Hierarchy/Multimodality	30 (83.3)	5 (13.9)	1 (2.8)	0.19 (0.47)	0 [0.0]

References

Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Yiannoudes, S. Shaping Architecture with Generative Artificial Intelligence: Deep Learning Models in Architectural Design Workflow. Architecture 2025, 5, 94. [Google Scholar] [CrossRef]
López Cervantes, J.C.; Sánchez Morales, C.E. Design in the Age of Predictive Architecture: From Digital Models to Parametric Code to Latent Space. Architecture 2026, 6, 25. [Google Scholar] [CrossRef]
Phillips, C.; Jiao, J.; Clubb, E. Testing the Capability of AI Art Tools for Urban Design. IEEE Comput. Graph. Appl. 2024, 44, 37–45. [Google Scholar] [CrossRef]
Cervero, R.; Kockelman, K. Travel Demand and the 3Ds: Density, Diversity, and Design. Transp. Res. Part D Transp. Environ. 1997, 2, 199–219. [Google Scholar] [CrossRef]
Saelens, B.E.; Sallis, J.F.; Frank, L.D. Environmental Correlates of Walking and Cycling: Findings from the Transportation, Urban Design, and Planning Literatures. Ann. Behav. Med. 2003, 25, 80–96. [Google Scholar] [CrossRef]
Frank, L.D.; Schmid, T.L.; Sallis, J.F.; Chapman, J.; Saelens, B.E. Linking Objectively Measured Physical Activity with Objectively Measured Urban Form: Findings from SMARTRAQ. Am. J. Prev. Med. 2005, 28, 117–125. [Google Scholar] [CrossRef] [PubMed]
Ewing, R.; Cervero, R. Travel and the Built Environment: A Meta-Analysis. J. Am. Plan. Assoc. 2010, 76, 265–294. [Google Scholar] [CrossRef]
Delmas, M.A.; Burbano, V.C. The Drivers of Greenwashing. Calif. Manag. Rev. 2011, 54, 64–87. [Google Scholar] [CrossRef]
de Freitas Netto, S.V.; Sobral, M.F.F.; Ribeiro, A.R.B.; Soares, G.R.L. Concepts and Forms of Greenwashing: A Systematic Review. Environ. Sci. Eur. 2020, 32, 19. [Google Scholar] [CrossRef]
McKenna, H.P. Toward a Rethinking and Reimagining of Urban Sustainability in an Era of AI. Urban Sci. 2025, 9, 401. [Google Scholar] [CrossRef]
Carpo, M. The Second Digital Turn: Design Beyond Intelligence; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
del Campo, M.; Manninger, S. Strange, But Familiar Enough: The Design Ecology of Neural Architecture. Archit. Des. 2022, 92, 38–45. [Google Scholar] [CrossRef]
Huang, S.-Y. The Connectionist Turn: How Contemporary Generative AI Reshapes Architectural Rationality. Architecture 2025, 5, 132. [Google Scholar] [CrossRef]
Gu, X.Y.; Zhang, M.M.; Lyu, J.X.; Ge, Q. Generating Urban Road Networks with Conditional Diffusion Models. ISPRS Int. J. Geo-Inf. 2024, 13, 203. [Google Scholar] [CrossRef]
Yu, D.; Wan, B.; Sheng, Q. Automated Generation of Urban Spatial Structures Based on Stable Diffusion and CoAtNet Models. Buildings 2024, 14, 3720. [Google Scholar] [CrossRef]
Xu, S.; Jiang, H.; Wang, H. A Generative Urban Form Design Framework Based on Deep Convolutional GANs and Landscape Pattern Metrics for Sustainable Renewal in Highly Urbanized Cities. Sustainability 2025, 17, 4548. [Google Scholar] [CrossRef]
Lyon, T.P.; Montgomery, A.W. The Means and End of Greenwash. Organ. Environ. 2015, 28, 223–249. [Google Scholar] [CrossRef]
Schmuck, D.; Matthes, J.; Naderer, B. Misleading Consumers with Green Advertising? An Affect–Reason–Involvement Account of Greenwashing Effects in Environmental Advertising. J. Advert. 2018, 47, 127–145. [Google Scholar] [CrossRef]
Parguel, B.; Benoît-Moreau, F.; Russell, C.A. Can Evoking Nature in Advertising Mislead Consumers? The Power of ‘Executional Greenwashing’. Int. J. Advert. 2015, 34, 107–134. [Google Scholar] [CrossRef]
Gałecka-Drozda, A.; Wilkaniec, A.; Szczepańska, M.; Świerk, D. Potential Nature-Based Solutions and Greenwashing to Generate Green Spaces: Developers’ Claims versus Reality in New Housing Offers. Urban For. Urban Green. 2021, 65, 127345. [Google Scholar] [CrossRef]
Quoquab, F.; Sivadasan, S.; Mohammad, J. Greenwashing in the Era of Sustainability: A Systematic Literature Review. Asia Pac. J. Mark. Logist. 2022, 34, 778–799. [Google Scholar] [CrossRef]
Kronenberg, J.; Skuza, M.; Łaszkiewicz, E. To What Extent Do Developers Capitalise on Urban Green Assets? Urban For. Urban Green. 2023, 87, 128063. [Google Scholar] [CrossRef]
Shin, M.H.; Kim, H.Y.; Gu, D.; Kim, H. LEED, Its Efficacy and Fallacy in a Regional Context—An Urban Heat Island Case in California. Sustainability 2017, 9, 1674. [Google Scholar] [CrossRef]
Demirtas, E.; Ayas Onol, T. The Politics of Green Buildings: Neoliberal Environmental Governance and LEED’s Uneven Geography in Istanbul. Buildings 2026, 16, 363. [Google Scholar] [CrossRef]
Holmgren, M.; Kabanshi, A.; Sörqvist, P. Occupant perception of “green” buildings: Distinguishing physical and psychological factors. Build. Environ. 2017, 114, 140–147. [Google Scholar] [CrossRef]
Holmgren, M.; Sörqvist, P. Are Mental Biases Responsible for the Perceived Comfort Advantage in “Green” Buildings? Buildings 2018, 8, 20. [Google Scholar] [CrossRef]
Huang, K.; Duan, C.; Sun, K.; Xie, E.; Li, Z.; Liu, X. T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3563–3579. [Google Scholar] [CrossRef]
Ewing, R.; Handy, S. Measuring the Unmeasurable: Urban Design Qualities Related to Walkability. J. Urban Des. 2009, 14, 65–84. [Google Scholar] [CrossRef]
Farr, D. Sustainable Urbanism: Urban Design with Nature; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Jabareen, Y.R. Sustainable Urban Forms: Their Typologies, Models, and Concepts. J. Plan. Educ. Res. 2006, 26, 38–52. [Google Scholar] [CrossRef]
Wheeler, S.M. Planning for Sustainability: Creating Livable, Equitable, and Ecological Communities; Routledge: London, UK, 2004. [Google Scholar]
UN-Habitat. A New Strategy of Sustainable Neighbourhood Planning: Five Principles; United Nations Human Settlements Programme (UN-Habitat): Nairobi, Kenya, 2014. [Google Scholar]
Yin, R.K. Case Study Research and Applications: Design and Methods, 6th ed.; SAGE: Thousand Oaks, CA, USA, 2018. [Google Scholar]
Campbell, J.L.; Quincy, C.; Osserman, J.; Pedersen, O.K. Coding In-depth Semistructured Interviews: Problems of Unitization and Intercoder Reliability and Agreement. Sociol. Methods Res. 2013, 42, 294–320. [Google Scholar] [CrossRef]
Beach, D.; Pedersen, R.B. Process-Tracing Methods: Foundations and Guidelines, 2nd ed.; University of Michigan Press: Ann Arbor, MI, USA, 2019. [Google Scholar]

Figure 1. Transfer Center workflow. (a) Student-produced SketchUp base model (input). (b) LookX intermediate output. (c) Final image: Final output with student post-editing to integrate site context.

Figure 2. Misa Island: AI outputs generated by students using Midjourney.

Figure 3. Botanic Garden workflow. (a) Student-produced SketchUp base model (input). (b) LookX intermediate output. (c) Final image: Final output with student post-editing to integrate site context.

Figure 4. Waterfront IC workflow. (a) Student-produced SketchUp base model (input). (b) LookX intermediate output. (c) Final image: Final output with student post-editing to integrate site context.

Table 1. Inter-rater reliability for the ordinal (0–2) rubric (quadratic-weighted κ_w).

Dimension	Visual Level κ_w	Logic Level κ_w
Walkability	0.83	0.74
Public Space	0.80	0.69
Green/Blue Infrastructure	0.86	0.82
Human-Scale Streetscape/Urban Design Quality	0.76	0.66
Mobility Hierarchy/Multimodality	0.64	0.58

Note. Double-coding subset: n = 12 scenes (33% of corpus). Test–retest intra-rater reliability on the same subset: κ_w = 0.70–0.88.

Table 2. Two-level rubric results by sustainable urban design dimension (0–2 scale).

Dimension	Visual Level Mean	Logic Level Mean	Δ(Img − Logic)
Walkability	1.8	0.7	1.1
Public Space	1.7	0.8	0.9
Green/Blue Infrastructure	1.9	1.4	0.5
Human-scale Streetscape/Urban Design Quality	1.6	0.9	0.7
Mobility Hierarchy/Multimodality	1.3	0.2	1.1

Note. All scores use a 0–2 rubric; values are dimension-level means across n = 36 scenes. Δ = Image-level mean − Logic-level mean. Full score distributions (0/1/2 frequencies, SD, median, IQR) are provided in Appendix B.1 and Appendix B.2.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jung, S. Visually Sustainable but Spatially Broken? A Two-Level Assessment of How Generative AI Encodes Sustainable Urban Design Principles. Sustainability 2026, 18, 2943. https://doi.org/10.3390/su18062943

AMA Style

Jung S. Visually Sustainable but Spatially Broken? A Two-Level Assessment of How Generative AI Encodes Sustainable Urban Design Principles. Sustainability. 2026; 18(6):2943. https://doi.org/10.3390/su18062943

Chicago/Turabian Style

Jung, Sanghoon. 2026. "Visually Sustainable but Spatially Broken? A Two-Level Assessment of How Generative AI Encodes Sustainable Urban Design Principles" Sustainability 18, no. 6: 2943. https://doi.org/10.3390/su18062943

APA Style

Jung, S. (2026). Visually Sustainable but Spatially Broken? A Two-Level Assessment of How Generative AI Encodes Sustainable Urban Design Principles. Sustainability, 18(6), 2943. https://doi.org/10.3390/su18062943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visually Sustainable but Spatially Broken? A Two-Level Assessment of How Generative AI Encodes Sustainable Urban Design Principles

Abstract

1. Introduction

2. Literature Review

2.1. Generative AI and the Shift in Design Rationality: From Rule-Based Generation to Latent-Space Synthesis

2.2. Visual Greenwashing and the Risk of “Looking Sustainable”

2.3. The Plausibility Gap: Why Urban Design Is Especially Vulnerable

2.4. Sustainable Urban Design Principles and the Rationale for the Five Evaluation Dimensions

3. Research Methods: A Two-Level Matrix Framework

3.1. Research Design and Data

3.2. Two-Level Assessment Framework and Rubric

3.2.1. Visual Representation Versus Spatial Logic

3.2.2. Evaluation Dimensions and Scoring Rubric

3.2.3. Expert Pilot Test and Codebook Refinement

3.2.4. Reliability Assessment and Final Coding

3.3. Qualitative Analytic Strategy

4. Results

4.1. Tool-Type Differences in Sustainability Encoding

4.2. Dimension-Level Visual–Spatial-Logic Gaps

4.3. Breakdown–Repair Patterns

4.3.1. Case 1—Transfer Center: Program-Encoding Failure and External-Source Injection

4.3.2. Case 2—Misa Island: Urban-Scale Prompting Limits and Reference Dependence

4.3.3. Case 3—Botanic Garden: Goal Conflict and Functional-Boundary Instability

4.3.4. Case 4—Waterfront IC: Multi-Stage Repair of Directional and Flow Logic

4.3.5. Cross-Case Synthesis: Shared Breakdown–Repair Mechanisms

5. Discussion

5.1. The Visual–Logic Gap as a Structural Feature of Latent-Space Synthesis

5.2. Visual Greenwashing as an Emergent Risk in Generative Urban Design

5.3. Methodological Contribution: The Two-Level Framework and Audit-Trail Analysis

5.4. Guardrails for Education and Practice

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Codebook for Image-Level (Visual Representation) Scoring (0–2)

Appendix A.2. Codebook for Spatial-Logic-Level (Operability) Scoring (0–2)

Appendix B

Appendix B.1. Distribution Summary of “Visual Representation” Scores (Image-Level), n = 36

Appendix B.2. Distribution Summary of “Spatial Logic” Scores (Logic-Level), n = 36

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI