Next Article in Journal
Performance Evaluation of Concrete Masonry Unit Mixtures Incorporating Citric Acid-Treated Corn Stover Ash and Alkalinized Corn Stover Fibers
Previous Article in Journal
Geotechnical Data-Driven Mapping for Resilient Infrastructure: An Augmented Spatial Interpolation Framework
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generative AI for Architectural Façade Design: Measuring Perceptual Alignment Across Geographical, Objective, and Affective Descriptors

1
Department of Geography, University College London, London WC1E 6BT, UK
2
Department of Architecture, University of Cambridge, Cambridge CB2 1TN, UK
3
Institute of Science Tokyo, Tokyo 152-8550, Japan
4
BeautifulPlacesAI, London WC2H 9JQ, UK
*
Author to whom correspondence should be addressed.
Buildings 2025, 15(17), 3212; https://doi.org/10.3390/buildings15173212
Submission received: 30 July 2025 / Revised: 31 August 2025 / Accepted: 2 September 2025 / Published: 5 September 2025
(This article belongs to the Special Issue BioCognitive Architectural Design)

Abstract

Generative AI is increasingly applied in architectural research, from automated ideation and reshaping design workflows to design education. Despite the increasing realism of synthetic imagery, several research gaps remain including alignment, plausibility, explainability, and control. This study focuses on alignment with human perceptions, specifically examining how synthetic architectural façade imagery aligns with geographical, objective, and affective text descriptors. We propose a pipeline that applies a Latent Diffusion Model to generate façade images and then evaluate this alignment through both AI-based and human-based evaluations. The results reveal that while images generated with geographical prompts are notably aligned, they also show clear biases. The results also reveal that images synthesised from objective descriptors (e.g., angular/curvy) are more aligned with human perceptions than affective descriptors (e.g., utopian/dystopian). These initial results highlight the opportunities and limits of current generative AI models, hinting at data biases and the potential lack of embodied understanding to grasp the complexity in experiencing architecture. Limitations of the study remain. Future work can expand on exploring cultural biases and semantic overlaps, and in testing more advanced embodied AI models and methods.

Graphical Abstract

1. Introduction

Generative AI has become increasingly prevalent and ubiquitous across disciplines and applications, from the use of GPT for generating written text to Latent Diffusion Models (LDM) for creating realistic imagery and videos. These techniques are increasingly being applied in architectural research, influencing automated ideation in design practice and reshaping design workflows and design education. Despite the increasing realism of synthetic imagery, several research gaps have emerged, including model alignment with human perceptions, the plausibility of the synthetic image, the explainability of the model, and the controllability of image editing. Our study focuses on exploring the alignment with human perceptions to support architectural design processes and create façades that resonate with human perceptions. We ask the following questions: (1) Can accessible and open-source pretrained generative vision models such as Stable Diffusion (SDXL) produce geographically consistent architectural counterfactuals of façades in urban settings? (2) Do AI-generated façades align with human perceptual descriptions ranging from objective to affective descriptors?

2. Related Works

Generative Urban Design (GUD) is a well-studied area in architectural computation [1,2,3] and can be broadly categorised into rule-based methods, optimisation-based methods and AI-based methods [4,5,6,7]. For example, procedural models such as CityEngine [8,9] employ a set of predefined rules/grammars to automate city and building generation. More recently, we have seen the increasing use of AI-based vision generative models for architectural and urban design ideation conditioning on text prompts and images as rules [7,10]. These methods are scalable and can integrate with explicit urban design guidelines in practice. However, as Hazbei and Cucuzzella [11] highlight, computational approaches can reduce the “architectural context” to a set of parametric constraints, potentially overlooking more intangible social, cultural and perceptual dimensions. Exploring this gap requires examining not only the generative capacity of models but also their ability to engage with contextual qualities. Our work responds to this topic by focusing on perceptual alignment in vision-based generative façade design.

2.1. Vision-Based Generative Design

Early efforts in vision-based generative models include the use of generative adversarial networks (GANs) [12] that train a generator to synthesise images and a discriminator that improves the image quality [13,14]. Examples in architecture and urban design include their use to synthesise street network patches [15], architectural floor plans [16,17], building footprints [18] and street scenes [19]. For example, Law et al. [19] edits the latent space of a VAEGAN in synthesising plausible urban counterfactuals that adhere to a different image classifier. While GAN-based image synthesis methods demonstrate early promise, they are often unstable to train, lack image diversity and offer limited control over the output.
Recent studies have shifted toward Denoising Diffusion Models, such as Latent Diffusion [20,21], for text-to-image generation, which offer improved fidelity, diversity, and controllability in generative outputs. Several studies have integrated these models into their architectural and urban design research [22,23,24,25]. This includes their use in controllable floorplan generation [26,27], urban footprint generation [22], architectural façade generation [28], image inpainting of historic architectural details [29], generating urban design renders [25], 3D urban blocks [24] and holistic urban environments [23]. Despite their potential, limited research has studied the perceptual alignment of these models in architectural façade design.

2.2. Leveraging Urban Imagery for Human Perception Studies

Interests in evaluating the perceptual quality of urban imagery have increased significantly, driven by the growing availability of street-level imagery (e.g., Google Street View, Mapillary) and advances in computer vision [30,31,32,32]. A seminal example is StreetScore [33,34], which collected crowd-sourced image ratings and developed machine learning models to predict the perceptual quality of streetscapes [35]. However, despite this progress, limited research evaluated the alignment of generative street views with human perceptions. One exception is [28] which compared real and synthetic façade imagery from Antonio Gaudi according to its authenticity, attractiveness, creativity, harmony, and overall preferences. Such research has primarily emphasized on aesthetic quality rather than providing a more comprehensive understanding of perceptual alignment.

2.3. Perceptual and Affective Quality of Our Environment

The perceptual and affective quality of our environment, whether it is built or natural settings, is well-studied in cognitive science and environmental psychology, highlighting the influence of context on human cognition and emotional responses [36,37]. For example, top-down processing theories [38,39] illustrate how prior knowledge and expectations shape our interpretation of sensory information in an environment. While insights in embodied cognition highlight how physical interactions with our environments shape our emotional responses to different architectural features, underscoring the role of bodily experience in shaping perceptions. Appraisal theories [40] further explain how individuals evaluate and interpret environmental stimuli, linking these appraisals to affective responses.
Despite growing interest in using generative AI for architectural design and well-established theory in visual processing, limited research has explored the extent to which these artificially generated images quantitatively align with human perception and emotional response. As [41] argues, architectural form should resonate with human cognitive and perceptual structure in order to be meaningful. Building on these gaps, this research investigates how accessible and open-source pretrained generative image models can be evaluated in relation to human perception. Two experiments are conducted: the first examines geographical consistencies, and the second compares objective and affective perceptual characteristics of synthetic imagery.

3. Methodology

Denoising Diffusion Probabilistic Models (DDPMs) [42] have emerged as a widely-used generative model for text-to-image generation in architectural design [7]. One popular variant is the Latent Diffusion Model (LDM) [20], which performs the diffusion process over a learned latent representation of the image (obtained via a pretrained Variational Autoencoder VAE) instead of directly in pixel space. In an LDM, a U-Net architecture [43] is then tasked with predicting the Gaussian noise added to the latent representation at each diffusion step t. By iteratively subtracting the predicted noise in reverse (starting from a noisy latent and moving toward a clean latent), the model progressively denoises the latent and eventually reconstructs the image. The LDM is trained using a squared loss to match the predicted noise to the true noise added at each step. Formally, this is given by the following equation:
E ϵ ϵ θ ( z , c , t ) 2 2
where ϵ N ( 0 , I ) is the Gaussian noise added to the latent, and ϵ θ ( z , c , t ) is the predicted noise from the UNet [43] at each time step t. Here, z = e n c ( x ) is the latent embedding of the image x from the encoder e n c ( · ) of a pretrained Variational Autoencoder [44] and c is the condition information, in this case, the text prompt that guides the generation. The text prompt is encoded using a CLIP-based text encoder with cross-attention layers.
In this research, we leverage the popular and accessible LDM called SDXL [45], pretrained on the LAION-5B dataset [46], to study the alignment between generated architectural imagery and human descriptions. SDXL is an LDM from StabilityAI, which uses a larger backbone than previous Stable Diffusion Variants such as SD1.5, resulting in improved performance. For more details, please see [45].
Other deep generative models were considered including Stabe Diffusion variants such as SD3 and recent Flow-based architectures such as Flux [47]. SDXL [45] is selected as it achieves high image quality while being more open-source (CreativeML Open RAIL++-M License), more mature and more computationally accessible than other Stable Diffusion and Flow-based variants. We also did not consider entirely close-source, commercial tools like MidJourney or the recent Nano Banana, due to costs and reproduceability, but plan to benchmark against these commercial image generators in future research.

Model Pipeline

The end-to-end model pipeline generates a counterfactual image on the basis of three inputs, namely, the original image, a mask text prompt and a generative text prompt, as shown in Figure 1. The pipeline consists of three components: (1) the image segmentation module, (2) the image generation module and (3) the image evaluation module, which consists of both (a) a geolocalisation experiment and (b) a perceptual alignment experiment. We tested the end-to-end model pipeline to assess the alignment abilities of these generative models. We focus on open-source-weight and lightweight models for all components due to the greater control over model parameters, lower computational costs, reproducibility, and accessibility.
The image segmentation module (1) is based on a semantic segmentation model called ClipSeg, which has been selected due to its simplicity [48]. The model generates a segmentation mask given as a mask prompt which is then used to localise the image generation in the next stage. The modular design of the pipeline enables the integration of alternative segmentation modules in future work, including the widely adopted GroundingDINO and SegmentAnything approach [49].
The image generation module (2) is based on a variant of an LDM called SDXL [45]. The model takes the mask generated in (1), a generative text prompt and the original image to generate a localised counterfactual (inpainting) of the original image. We use default components for the model, including a pretrained VAE as the encoder-decoder, a SDXL-trained denoising UNet for the diffusion process. We qualitatively select two hyper-parameters (guidance scale, strength) based on image quality as shown in the parameter study section. We also experiment with Lora [50] for parameter efficient fine tuning and ControlNet [51] for control generation with a canny-edge map. We decide first to focus on pretrained model results to better understand the performance and biases of an accessible and open-source generative image model and secondly to exclude the canny-edge map controls, as they were restrictive given some of the affective text prompts and unnecessary as inpainting is applied.
The image evaluation module (3) consists of two types of validation. The first is model-based evaluation, where we evaluate using a pretrained vision model to test whether the generated image predicted class matches the ground-truth class. The second is a human-based evaluation where evaluators will judge whether the generated image aligns with the ground-truth class or not. In this study, “ground-truth” refers to the reference class against which both machine-and-human-based evaluations are compared. For the geolocalisation experiment, the ground-truth refers to the geographical prompt (e.g., Paris) that generates the synthetic image which will be compared to the predicted class. For the perceptual alignment experiment, the ground-truth refers to the perceptual prompt (e.g., calm and stressful) that generate the synthetic imagery which will be compared to the predicted perceptual class. Accuracy, between ground-truth and the predicted class, is the main evaluation metric for text-to-image alignment given the largely balanced class. Before describing the experiments, a hyper-parameter analysis is first conducted.

4. Parameter Analysis

4.1. Mask Prompt for Localised Inpainting

To restrict edits to building façades while preserving the rest of the scene, the modelling pipeline employs targeted inpainting using a mask prompt with SegClip. Figure 2 shows three prompt variants, namely ‘windows’, ‘windows and doors’, and ‘doors’. ‘Windows and doors’ is selected as the base prompt as it covers both the upper floors and ground floors. Below are two examples.
Figure 3 presents the original image from Amsterdam alongside two synthesise counterfactual images whose building façade has been transformed to Hong Kong: one without localised inpainting, which introduces unrealistic global changes (such as concrete covering the canal), and another produced with inpainting, which successfully constrains alterations to architectural features on the building façade.
Figure 4 shows the original image from San Francisco alongside two synthesised counterfactual images of Kyoto: one generated without localised inpainting, resulting in the entire road changing shape, and another produce with targeted inpainting, effectively limiting the alterations exclusively to the building façade.
To explore this further, we selected 5 random images and generated 10 synthetic geographical variants (similar to the first geolocalisation experiment), with and without inpainting. This allows us to qualitatively assess the effect of inpainting on the synthetic image, particularly in terms of preserving the geographical quality of the original image.
The qualitative results show that 36 out of 50 images maintain a good level of image coherence without inpainting while all 50 out of 50 images with inpainting are as expected. These results indicate that inpainting is necessary to prevent significant alterations to the urban streetscape. However, these results also suggests that the current hyper-parameters and prompts largely preserve the original streetscape without inpainting. In certain cases, the counterfactuals even introduce visual elements that are characteristic of the target city, for example, overhanging cables in Tokyo. As this research focuses specifically on generating counterfactuals of building façades, we decide to select the inpainting method to minimise significant changes to the streetscape. Future research can relax this assumption.

4.2. Image Diversity

We further study the diversity of generated images by varying the random seed during synthesis. Figure 5 illustrates an example where an image of Shanghai’s building façade (top) was transformed into Paris, Tokyo, and Kyoto, each generated using two different random codes (between 0– 2 20 ). The two rows of images for each target city display consistent, geo-specific urban features—such as the Haussmannian façades that are characteristic of Paris, the Kyo-machiya architecture that is typical of Kyoto, and a mix of modern and traditional buildings representing Tokyo.
To explore this further, we similarly selected 5 random images and generated 10 synthetic variants, each representing a different geography with two random codes (between 0– 2 20 ). The qualitative results show that 50 out of 50 pairs of images show significant architectural differences between the two random codes. These results demonstrate that the generated images are both plausible and well-aligned with their geographical prompts, while also exhibiting visual diversity, which was a known issue with architectures such as those of generative adversarial networks.

4.3. Guidance and Scale Parameters

Furthermore, we conducted a simple visual ablation study to describe the effects of varying the guidance (g) and scale (s) parameters on image generation. For SDXL, the guidance parameter controls how strongly the generation follows the text prompt, while the scale parameter determines the strength of the transformation. A higher guidance value improves image coherence but reduces diversity, while a higher scale produces stronger visual edits.
We synthesise an image of London, modifying the building façades to resemble those in Hong Kong, using different combinations of these hyper-parameters as shown in Figure 6. The results show that when s = 0.9 , the entire image changes excessively, while for s < 0.7 , the transformation is insufficient. Holding s = 0.7 constant, we observe that the generated images are more plausible when g > = 0.6 . Based on these observations, we select s = 0.7 and g = 0.8 as the default parameters for our generative process.
To avoid conflating related concepts, we use “geolocalisation” to refer to the classification task of predicting a façade’s location, “geographical prompt” to denote the city-specific input prompt used for image generation, and “geographical consistency” to describe how well a generated image visually corresponds to its intended location.

5. Experiments

5.1. Geolocalisation Experiment

The ability to interpret and synthesise geographically consistent architectural detail is important to recognise regional biases. This connects to the broader field of planet-scale image geolocalisation [52,53] which is an important task in GIScience and computer vision with use cases in autonomous driving, disaster management and human navigation. To test the model’s efficacy in producing geographically consistent street scenes, we conduct a geolocalisation (‘GeoGuesser’) experiment. For this experiment, we curate a dataset of 120 architectural façade images from 10 cities: [‘Hong Kong’, ‘Seoul’, ‘Tokyo’, ‘Kyoto’, ‘Shanghai’, ‘San Francisco’, ‘Paris’, ‘Amsterdam’, ‘Vienna’, and ‘London’]. The images were taken in central areas of the 12 cities between 2019–2024 (with a majority taken in 2024) using a standard mobile device camera (Apple Iphone 15pro). The main criteria of the image inclusion was that it was taken with a clear direct view of the architectural façade with no clear visual obstruction. Using the model pipeline described in the methodology section, we generated 2400 counterfactual images based on ten geographical text prompts with one iteration using a fixed latent code and another using a random latent code. As there are no significant deviations between the model evaluation results of the two approaches, subsequent experiments use a single latent code for reproducibility and consistency (using the same latent code and prompt but conditioning on different images can generate images that share some structural similarities (e.g., lighting, perspective)).
Buildings 15 03212 i001
Simple variants of the text prompts had been qualitatively tested. In the future, more optimised prompts will be tested [54].

Model-Based Evaluations

We then evaluated the alignment of the synthetic images with the geographical text prompts using model-based evaluation. Specifically, we use a Clip-based contrastive learning model called GeoClip [55], that was designed to infer for each image the probabilities of its location as a ten-cities zero-shot classification problem. For evaluation, we reported overall accuracy as well as confusion matrices. The confusion matrix compares the predicted city labels (columns) against the ground-truth labels (rows). The diagonals represent correctly classified samples, while the off-diagonals highlight misclassification. We validate the geolocalisation pretrained model on the original image dataset as ground-truth, achieving an overall accuracy of a c c 0.90 as described in the result section.

5.2. Perceptual Alignment Experiment

Understanding how humans perceive and respond to architectural features is important for evaluating AI-generated designs. This research will also study how AI-generated imagery from perceptual text prompts aligns with both human and machine evaluations. The theory-inspired perceptual text prompts are classified into three categories: (i) Objective Verbal Descriptors, (ii) Affective Verbal Descriptors, and (iii) Higher-order Affective Verbal Descriptors. This categorisation aligns with established theories of visual processing and cognition such as Feature Integration Theory that supports the effectiveness of objective descriptors by explaining how basic visual features are detected and integrated, and Dual-Process Theories [56,57,58] which highlight how affective descriptors engage automatic, emotional responses (System 1), while higher-order descriptors involve more deliberate, cognitive evaluation (System 2). This hierarchical structure effectively captures the progression from concrete, observable features to more abstract and emotionally nuanced interpretations, improving our understanding of human engagement in architecture. For this research, we hypothesised that AI-generated imagery will show stronger alignment with Objective Verbal Descriptors, followed by Affective and Higher-order Verbal Descriptors. The reason for this is that AI models primarily rely on pattern matching, which makes them adept at generating objective architectural features but may overlook the subtleties and variances in human emotions and abstract cognitive evaluations derived from embodied lived experience [59].

5.2.1. Perceptual Text Prompts

From an initial list of 15 contrastive pairs, we selected 10 pairs based on maximising the cosine distance of the contrastive word embeddings [60]. This is to ensure that the word-pairs are as different as possible, thereby minimising overlaps.
Objective Verbal Descriptors (4 pairs): These descriptors focus on tangible attributes such as “Colorful vs. Dull”, “Angular vs. Curvy”, “Symmetrical vs. Asymmetrical” and “Textured vs. Smooth” as they correspond well with with Low-Level Visual Processing theories such as Feature Integration Theory, which detects basic visual features.
Affective Verbal Descriptors (4 pairs): These descriptors, like “tense vs. relaxing”, “welcoming vs. uninviting”, “stimulating vs. not stimulating” and “safe vs. unsafe”, introduce an emotional dimension that aligns with Intermediate-Level Visual Processing and Affective Appraisal Theories. These descriptors engage in emotional responses that are more subjective but still fundamental in perception.
Higher-order Affective Verbal Descriptors (2 pairs): These descriptors such as “harmonious vs. discordant” and “utopian vs. dystopian” require deeper cognitive and emotional processing, fitting well with High-Level Visual Processing and Top-Down Processing Theories. These descriptors involve more complex interpretations that are often influenced by cultural and personal contexts.
Using the same set of 120 architectural façade images, we generated 2400 counterfactual images by applying 10 contrastive text prompts (“colorful vs. dull”), each with a single latent code. Each synthetic image is created using the following text prompt where { c h a r a c t e r i s t i c s } correspond to the 10 perceptual text prompts defined in the last section.
Buildings 15 03212 i002

5.2.2. Human-Based Evaluation

We conduct self-reported human-based evaluations to gauge the perceptual response of the generated images. We designed a multiple-choice task as a pilot experiment in linking perceptual descriptors with the synthesised images. We proceed with the following procedures: (i) select 10 contrastive-text pairs (objective, affective, higher-order affective); (ii) generate 30 image-pairs derived from 6 base images. These six images have been selected due to their geographical diversity Amsterdam, Hong Kong, Kyoto, San Francisco and Shanghai and diverse architectural styles Kyoto Machiya, San Francisco Terraces, Shanghai Shikumen, Hong Kong modernist skyscraper, and Amsterdam canal architecture; (iii) for each image-pair, generate four multiple-choice options (one correct and three random incorrect answers). An example of the multiple-choice survey can be seen in Figure 7. For the full multiple-choice survey, please see Appendix A. (iv) Conduct the evaluation on Amazon Mechanical Turk (AMT) AMT is an online crowdsourcing marketplace run by Amazon where requesters (such as researchers or companies) can post small tasks, called Human Intelligence Tasks (HITs), to be completed by workers (“Turkers”), recruiting only qualified expert annotators (known as “Masters Workers” in AMT, who are selected for quality assurance). (v) Successfully recruit a total of 46 workers to rate 750 image-pairs. After data cleaning, which includes removing participants who completed only a single user rating and tasks completed in under ten seconds (this threshold was used to mitigate the issue of automated or inattentive task completion), we retained 646 user ratings from 31 participants. The mean task completion time is 176 s and the mean number of ratings per image is 21.7 ( s t d e v 1.29 ). (vi) Finally, we report overall accuracy as the main evaluation metric, defined as Acc = 1 N i = 1 N 1 y ^ i = y i , where y ^ i and y i denote the predicted and true labels, respectively.
While the evaluation provides initial insights, it has several limitations. First, we did not restrict the geographical location or occupations of workers, which may introduce cultural and demographic biases in how perceptual descriptors were interpreted. Second, given the pilot scale of the study, the number of participants and ratings is relatively modest, which may limit statistical power. Third, participants generally struggled to inter-agree, highlighting the difficulty and subjectivity of the multiple-choice task ( m e a n % a g r e e m e n t 0.28 ). Future work will address these limitations by conducting a more comprehensive user study on perceptual alignment, including expanding the participant pool to capture more cultural and demographic diversity, introducing more cities/regions and exploring different alignment tasks such as image ranking and natural text descriptions [19].

5.2.3. Model-Based Evaluations

To complement and cross-check the human evaluations, we apply SigLIP [61], a contrastive vision–language model that is similar to GeoClip, for the same multiple-choice questions as a zero-shot image classifier. For each question, SigLIP estimates a probability score for each of the candidate answers given the prompt. The prediction is considered correct if the correct prompt receives the highest score. We report the same classification accuracy as the evaluation metric. By comparing the outcomes of the model-based and human-based evaluations, we examine how the two sets of results covary. This analysis serves as a soft cross-check on the consistency of the human-based evaluation and provides preliminary insights into the potential of LLM-as-a-Participant/Subject for generative images, while also helping to identify systematic biases.

6. Results

6.1. Geolocalisation Experiment

The results of the geolocalisation experiment in Table 1, show that the generative model achieves an accuracy of 0.39 across the ten cities, reflecting a moderate capability of generating geographically plausible counterfactuals. This performance remains below that of the baseline model on the original images, which achieve an accuracy of 0.90 . Figure 8 presents the confusion matrix for the original and the generated images, where the model is visibly over-predicting for London counterfactuals. We further conduct subgroup analysis for the five East Asian cities and five Western cities. The Western subset shows higher accuracy (acc 0.58 ) compared to the East Asian counterparts (acc 0.43 ). These results suggest that generative models possibly illustrate regional biases in their training data.
To explore these patterns, we visualise a couple of successful and failure examples. The top row of Figure 9 shows a neoclassical building in Paris as the original, alongside a similar neoclassical building in London and a more distinct East Asian-style building for Kyoto. These results highlight the nuances and similarities between London and Paris but also demonstrating the model’s capacity to produce architectural details that are more geographically distant like in Kyoto. In contrast, the bottom row shown in Figure 10 presents a mixed-use building in Shanghai as the original, while the two geographical counterfactuals lack distinctive design features that emphasize this difference. These results highlight why London and Shanghai are among the most frequently confused pairs in the confusion matrix. This may be partly explained by the fact that Shanghai’s foreign concession areas feature diverse architectural styles, including terraced housing that is reminiscent of that found in London.
To further interpret the semantic nature of the counterfactual images, we first encode the images into their latent embeddings Z c l i p using a Clip image encoder [62]. We then run PCA and plot the first and second principal components, P C 1 c f and P C 2 c f , of the counterfactual images. We colour each point based on its counterfactual city and label the approximate centroid of each city cluster (As this is an inpainting exercise, we difference out the original city principal component value P C o r i g P C c f to focus on the shifts captured from inpainting). The results show clustering between cities that are near to each other geographically like ‘Kyoto’ and ‘Tokyo’, or ‘Shanghai’ and ‘Hong Kong’, or ‘London’ and ‘Paris’, as shown in Figure 11. However, the results also show notable geographical entanglement. The underlying reason needs to be fully explored in future research.

6.2. Perceptual Experiment

The perceptual experiment results are summarised in Table 2. Starting with model evaluation, we observe that prompts involving objective descriptors, those requiring less visual processing, tend to achieve higher accuracy such as colourful/dull (acc 0.85 ) and angular/curvy (acc 0.72 ). In contrast, affective prompts which require possibly more nuanced visual processing show lower accuracy, such as relaxing/tense (acc 0.54 ) and harmonious/discordant (acc 0.58 ). For the multi-class problem in Table 3, objective descriptors such as colours (acc 0.86 ) and materials (acc 0.65 ) have notably higher model accuracy compared to more complex affect descriptors (These terms are borrowed from the valence and arousal grid of [63]) (acc 0.36 ).
We then present the human evaluation results, whose accuracy is generally lower than the model evaluation counterparts, suggesting the difficulty of the multiple-choice task. Objective tasks show stronger alignment with human responses, as seen with colorful/dull (acc 0.58 ) and angular/curvy (acc 0.38 ) as compared to affective tasks such as safe/unsafe (acc 0.24 ) and stimulating/non-stimulating (acc 0.27 ). There are some outliers such as symmetrical/asymmetrical (acc 0.23 ) which display lower accuracy despite being an objective descriptor. Additionally, there are minimal differences between affective and higher-order affective descriptors for both model and human evaluations.
We similarly plot objective counterfactual pairs (Figure 12, left) and affective pairs (Figure 12, right). The results indicate that images generated from objective prompts are visually more distinctive than those generated from affective prompts. For example, images produced with the contrasting prompts such as "Stone and Glass" are more visually distinguishable than those produced with “Calm and Stressful”.
Furthermore, we have also plotted the overall trends between human-based and model-based evaluation scores (accuracies) for each pair of perceptual prompts. The results indicate relative correspondence as shown in Figure 13 ( r 0.90 ). This human and model alignment acts as a soft validity/consistency check for the survey. These results also align with recent work treating an LLM as a Participant [64].

7. Discussion

We examined how well machine-generated façade images can align with geographical and perceptual descriptors. Specifically, we found notable alignment but also discrepancies for both the geolocalisation experiment and the perceptual alignment experiment. Results from the first experiment indicate that these generative models have implicitly learned some architectural details that are relevant to the geographical location, such as the neoclassical windows in European cities [65]. However, we also observe that these generative models perform better for certain regions. Results from the second experiment show that these generative models can produce images that are better at aligning with more objective descriptors (e.g., colourful/dull) and less good with affective descriptors (e.g., relaxing/tense). The initial results are consistent with the contextual alignment gap as higlighted in Hazbei and Cucuzzella [11], and this remains evident with more advanced generative methods (SDXL). The discrepancy can be explained in multiple ways.

7.1. Discrepancies Between Objective and Affective Alignment

Computationally, lower performance in the affective task can be related to model expressivity and training data biases, where an open-source LDM pretrained on the well-established LAION-5B dataset [46] may lack exposure to patterns of, say, stressful architecture (e.g., broken windows) [66]. This can potentially be further studied through more advanced generative models (e.g., Flux) and commercial tools (e.g., MidJourney) and fine-tuning with human feedbacks on perceptual data in architecture [67].
On the other hand, these discrepancies can suggest that architectural perception may involve higher-order top-down cognitive processes [56,57,58]. Models without explicit embodiment are less expected to produce affective imagery, as these qualities shape how humans emotionally feel towards architecture. Affective imagery are also likely to be more heterogenous and varied than the objective ones. For example, how we perceive “dystopian” or “stressful” can vary significantly dependent on our experience and culture. Future research on perceptual alignment can incorporate greater cultural, demographic and geographical diversity in the research design. This can lead to deeper investigation involving culturally-enhanced embodied generative systems [54] (e.g., culture-aware prompts, bio/neural-feedbacks).
Furthermore, we note that the separation between objective and affective descriptors is complex and not absolute. Although our categorisation was motivated by cognitive theories of visual processing, both objective and affective descriptors still involve subjective judgement and are often culturally dependent. Despite selecting more contrastive pairs, clear semantic overlaps remain. For example, “Utopian and dystopian” and “Welcoming and Uninviting” share perceptual similarities (see Appendix A, Table A1). This is possibly more notable for affective descriptors in comparison to objective descriptors such as “colorful/dull-color”. This represents both a limitation of the study where results should be interpreted carefully, and a feature of these affective descriptors. Future research design can incorporate image ranking, natural text descriptions and multiple-answer study design to possibly better measure these overlaps and nuances.

7.2. Opportunities and Risks of Generative AI for Design

Future research will continue to investigate these conjectures and examine whether more embodied AI systems (e.g., prompt engineering and multi-modal learning) can address these higher-order cognitive tasks. In exploring the alignment and discrepancies between AI-generated imagery and human perception, we emphasize the richness of human engagement with architectural spaces. This has particular relevance for automation in construction, where generative design tools can meaningfully support human-center design. Furthermore, despite some disagreements (e.g., symmetric/asymmetric), the overall trends between human-based and model-based assessments show relative correspondence between the two. These results highlights the potential in leveraging “LLM-as-a-participant/subject” [64,68] to complement human evaluations for scalable perceptual generation and evaluation in future research.
At the same time, we must be aware of the risks: AI-driven global standardisation can potentially limit design diversity and reduce context-aware cultural design principles (e.g., eroding vernacular building traditions) leading to potentially homogenous design solutions. Integrating these tools into automated design workflows will require careful attention to these perceptual gaps to ensure that AI-generated outputs remain culturally and emotionally aware. To address this, designers are encouraged to treat AI as a design tool to assist (e.g., automating routine tasks) rather than replace human creativity and critical thinking in the design loop. This is not only important in the design workflow but also in educating the next generation of designers.
By highlighting both the opportunities and limitations of current AI in fully capturing the complexity of lived experience, this paper offers not only a framework for evaluating perceptual and emotional alignment in synthetic imagery, but also serves as a reflection of humans’ complex engagement with the built environment.

8. Declaration of Generative AI and AI-Assisted Technologies in the Writing Process

During the preparation of this work, the author(s) used ChatGPT by OpenAI for checking grammer. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the published article.

Author Contributions

Conceptualization, S.L.,C.V.,Y.K.; Methodology, S.L., C.V., Y.K., J.T.; Formal analysis, S.L., Writing—original draft, S.L., C.V. and Y.K.; Writing – review & editing, S.L.,C.V.,Y.K., C.I.S., J.T., M.G.M. and H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We thank the editor for the guidance and the reviewers for their constructive comments, which have improved the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Perceptual Alignment—User Survey

Table A1 shows the four multiple-choice answers for the question; “Which of the following text pairs best describes the image-pairs? (A, B, C, D)”. Figure A1 shows another example of the multiple-choice questions.
Figure A1. Example of the Human Evaluation Self-reported Survey Question. (Answer: c).
Figure A1. Example of the Human Evaluation Self-reported Survey Question. (Answer: c).
Buildings 15 03212 g0a1
Table A1. List of paired adjectives used for the perceptual alignment task.
Table A1. List of paired adjectives used for the perceptual alignment task.
ABCD
Utopian and dystopianangular and curvysafe and unsafeTextured and Smooth
colorful and dull-colorStimulating and UnstimulatingUtopian and dystopianRelaxing and Tense
Relaxing and TenseWelcoming and UninvitingUtopian and dystopiansafe and unsafe
Textured and Smoothcolorful and dull-colorWelcoming and Uninvitingharmonious and discordant
Symmetrical and AssymetricalWelcoming and Uninvitingharmonious and discordantTextured and Smooth
Utopian and dystopianWelcoming and Uninvitingsafe and unsafeStimulating and Unstimulating
safe and unsafeharmonious and discordantRelaxing and TenseStimulating and Unstimulating
Stimulating and Unstimulatingangular and curvySymmetrical and AssymetricalTextured and Smooth
Relaxing and TenseTextured and SmoothUtopian and dystopianharmonious and discordant
Welcoming and UninvitingTextured and Smoothsafe and unsafeRelaxing and Tense
Stimulating and UnstimulatingRelaxing and TenseSymmetrical and AssymetricalTextured and Smooth
colorful and dull-colorSymmetrical and AssymetricalTextured and Smoothsafe and unsafe
Textured and Smoothangular and curvySymmetrical and AssymetricalWelcoming and Uninviting
harmonious and discordantSymmetrical and AssymetricalTextured and SmoothStimulating and Unstimulating
Stimulating and UnstimulatingTextured and SmoothUtopian and dystopianRelaxing and Tense
Symmetrical and AssymetricalTextured and SmoothWelcoming and UninvitingStimulating and Unstimulating
Welcoming and Uninvitingharmonious and discordantangular and curvyRelaxing and Tense
Stimulating and Unstimulatingsafe and unsafecolorful and dull-colorSymmetrical and Asymmetrical
Textured and Smoothsafe and unsafeharmonious and discordantRelaxing and Tense
Welcoming and Uninvitingangular and curvyStimulating and UnstimulatingRelaxing and Tense
Utopian and dystopianRelaxing and Tensecolorful and dull-colorSymmetrical and Assymetrical
Welcoming and UninvitingStimulating and UnstimulatingSymmetrical and Asymmetricalangular and curvy
Textured and SmoothUtopian and dystopianharmonious and discordantSymmetrical and Assymetrical
Welcoming and UninvitingSymmetrical and AssymetricalTextured and SmoothRelaxing and Tense
harmonious and discordantcolorful and dull-colorRelaxing and TenseUtopian and dystopian
Relaxing and Tenseangular and curvyharmonious and discordantcolorful and dull-color
Utopian and dystopianangular and curvyTextured and Smoothharmonious and discordant
Symmetrical and AssymetricalSymmetrical and AsymmetricalTextured and SmoothStimulating and Unstimulating
angular and curvyTextured and SmoothUtopian and dystopianRelaxing and Tense
harmonious and discordantSymmetrical and AssymetricalStimulating and UnstimulatingRelaxing and Tense

References

  1. Alexander, C. A New Theory of Urban Design; Center for Environmental Structure: Berkeley, CA, USA, 1987; Volume 6. [Google Scholar]
  2. Hillier, B.; Hanson, J. The Social Logic of Space; Cambridge University Press: Cambridge, UK, 1989. [Google Scholar]
  3. Batty, M.; Longley, P.A. Fractal Cities: A Geometry of Form and Function; Academic Press: Cambridge, MA, USA, 1994. [Google Scholar]
  4. Koenig, R.; Miao, Y.; Aichinger, A.; Knecht, K.; Konieva, K. Integrating urban analysis, generative design, and evolutionary optimization for solving urban design problems. Environ. Plan. B Urban Anal. City Sci. 2020, 47, 997–1013. [Google Scholar] [CrossRef]
  5. Wortmann, T. Model-based Optimization for Architectural Design: Optimizing Daylight and Glare in Grasshopper. Technol.|Archit.+Des. 2017, 1, 176–185. [Google Scholar] [CrossRef]
  6. Vermeulen, T.; Knopf-Lenoir, C.; Villon, P.; Beckers, B. Urban layout optimization framework to maximize direct solar irradiation. Comput. Environ. Urban Syst. 2015, 51, 1–12. [Google Scholar] [CrossRef]
  7. Jang, S.; Roh, H.; Lee, G. Generative AI in architectural design: Application, data, and evaluation methods. Autom. Constr. 2025, 174, 106174. [Google Scholar] [CrossRef]
  8. Parish, Y.I.; Müller, P. Procedural modeling of cities. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 12–17 August 2001; pp. 301–308. [Google Scholar]
  9. Müller, P.; Wonka, P.; Haegler, S.; Ulmer, A.; Van Gool, L. Procedural modeling of buildings. In ACM SIGGRAPH 2006 Papers; Association for Computing Machinery (ACM): New York, NY, USA, 2006; pp. 614–623. [Google Scholar]
  10. Jiang, F.; Ma, J.; Webster, C.J.; Chiaradia, A.J.; Zhou, Y.; Zhao, Z.; Zhang, X. Generative urban design: A systematic review on problem formulation, design generation, and decision-making. Prog. Plan. 2024, 180, 100795. [Google Scholar] [CrossRef]
  11. Hazbei, M.; Cucuzzella, C. Revealing a Gap in Parametric Architecture’s Address of “Context”. Buildings 2023, 13, 3136. [Google Scholar] [CrossRef]
  12. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  13. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
  14. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
  15. Hartmann, S.; Weinmann, M.; Wessel, R.; Klein, R. Streetgan: Towards road network synthesis with generative adversarial networks. In Proceedings of the International Conferences in Central Europe on Computer Graphics, Visualization and Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  16. Chaillou, S. Archigan: Artificial intelligence × architecture. In Architectural Intelligence, Proceedings of the 1st International Conference on Computational Design and Robotic Fabrication (CDRF 2019), Shanghai, China, 7–8 July 2019; Springer: Berlin/Heidelberg, Germany, 2020; pp. 117–127. [Google Scholar]
  17. Wu, W.; Fu, X.M.; Tang, R.; Wang, Y.; Qi, Y.H.; Liu, L. Data-driven interior plan generation for residential buildings. ACM Trans. Graph. (TOG) 2019, 38, 1–12. [Google Scholar] [CrossRef]
  18. Wu, A.N.; Biljecki, F. GANmapper: Geographical data translation. Int. J. Geogr. Inf. Sci. 2022, 36, 1394–1422. [Google Scholar] [CrossRef]
  19. Law, S.; Hasegawa, R.; Paige, B.; Russell, C.; Elliott, A. Explaining holistic image regressors and classifiers in urban analytics with plausible counterfactuals. Int. J. Geogr. Inf. Sci. 2023, 37, 2575–2596. [Google Scholar] [CrossRef]
  20. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 10684–10695. [Google Scholar]
  21. Ma, H.; Zheng, H. Text Semantics to Image Generation: A method of building facades design base on Stable Diffusion model. In Proceedings of the International Conference on Computational Design and Robotic Fabrication, Shanghai, China, 24 July 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 24–34. [Google Scholar]
  22. Zhou, F.; Li, H.; Hu, R.; Wu, S.; Feng, H.; Du, Z.; Xu, L. ControlCity: A Multimodal Diffusion Model Based Approach for Accurate Geospatial Data Generation and Urban Morphology Analysis. arXiv 2024, arXiv:2409.17049. [Google Scholar] [CrossRef]
  23. Shang, Y.; Lin, Y.; Zheng, Y.; Fan, H.; Ding, J.; Feng, J.; Chen, J.; Tian, L.; Li, Y. UrbanWorld: An Urban World Model for 3D City Generation. arXiv 2024, arXiv:2407.11965. [Google Scholar] [CrossRef]
  24. Zhuang, J.; Li, G.; Xu, H.; Xu, J.; Tian, R. TEXT-TO-CITY: Controllable 3D urban block generation with latent diffusion model. In Accelerated Design, Proceedings of the 29th International Conference of the Association for ComputerAided Architectural Design Research in Asia (CAADRIA), Singapore, 20–26 April 2024; CAADRIA: Hong Kong, China, 2024; pp. 169–178. [Google Scholar]
  25. Cui, X.; Feng, X.; Sun, S. Learning to generate urban design images from the conditional latent diffusion model. IEEE Access 2024, 12, 89135–89143. [Google Scholar] [CrossRef]
  26. Zhang, H.; Zhang, R. Generating accessible multi-occupancy floor plans with fine-grained control using a diffusion model. Autom. Constr. 2025, 177, 106332. [Google Scholar] [CrossRef]
  27. Shabani, M.A.; Hosseini, S.; Furukawa, Y. Housediffusion: Vector floorplan generation via a diffusion model with discrete and continuous denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5466–5475. [Google Scholar]
  28. Zhang, Z.; Fort, J.M.; Mateu, L.G. Exploringthe potential of artificial intelligence as a tool for architectural design: A perception study using gaudí’sworks. Buildings 2023, 13, 1863. [Google Scholar] [CrossRef]
  29. Zhong, X.; Chen, W.; Guo, Z.; Zhang, J.; Luo, H. Image inpainting using diffusion models to restore eaves tile patterns in Chinese heritage buildings. Autom. Constr. 2025, 171, 105997. [Google Scholar] [CrossRef]
  30. Ibrahim, M.R.; Haworth, J.; Cheng, T. Understanding cities with machine eyes: A review of deep computer vision in urban analytics. Cities 2020, 96, 102481. [Google Scholar] [CrossRef]
  31. Biljecki, F.; Ito, K. Street view imagery in urban analytics and GIS: A review. Landsc. Urban Plan. 2021, 215, 104217. [Google Scholar] [CrossRef]
  32. Law, S.; Seresinhe, C.I.; Shen, Y.; Gutierrez-Roig, M. Street-Frontage-Net: Urban image classification using deep convolutional neural networks. Int. J. Geogr. Inf. Sci. 2020, 34, 681–707. [Google Scholar] [CrossRef]
  33. Salesses, P.; Schechtner, K.; Hidalgo, C.A. The collaborative image of the city: Mapping the inequality of urban perception. PLoS ONE 2013, 8, e68400. [Google Scholar] [CrossRef] [PubMed]
  34. Naik, N.; Philipoom, J.; Raskar, R.; Hidalgo, C. Streetscore-predicting the perceived safety of one million streetscapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 779–785. [Google Scholar]
  35. Dubey, A.; Naik, N.; Parikh, D.; Raskar, R.; Hidalgo, C.A. Deep learning the city: Quantifying urban perception at a global scale. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 196–212. [Google Scholar]
  36. Kaplan, R.; Kaplan, S. The Experience of Nature: A Psychological Perspective; Cambridge University Press: Cambridge, UK, 1989. [Google Scholar]
  37. Ulrich, R.S. Aesthetic and affective response to natural environment. In Behavior and the Natural Environment; Springer: Berlin/Heidelberg, Germany, 1983; pp. 85–125. [Google Scholar]
  38. Gregory, R.L. The Intelligent Eye; Weidenfeld & Nicolson: London, UK, 1970. [Google Scholar]
  39. Neisser, U. Cognitive Psychology: Classic Edition; Psychology Press: Hove, UK, 2014. [Google Scholar]
  40. Scherer, K.R. Appraisal theory. In Handbook of Cognition and Emotion; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
  41. Salingaros, N.A.; Mehaffy, M.W. A Theory of Architecture; Umbau-Verlag Harald Püschel: Solingen, Germany, 2006. [Google Scholar]
  42. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
  43. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  44. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  45. Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv 2023, arXiv:2307.01952. [Google Scholar] [CrossRef]
  46. Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Proceedings of the 36th International Conference on Neural Information Processing System, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 25278–25294. [Google Scholar]
  47. Labs, B.F. FLUX. 2024. Available online: https://github.com/black-forest-labs/flux (accessed on 1 December 2024).
  48. Lüddecke, T.; Ecker, A. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7086–7096. [Google Scholar]
  49. Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Computer Vision—ECCV 2024, Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024, Proceedings, Part XLVII; Springer: Berlin/Heidelberg, Germany, 2024; pp. 38–55. [Google Scholar]
  50. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
  51. Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3836–3847. [Google Scholar]
  52. Haas, L.; Skreta, M.; Alberti, S.; Finn, C. Pigeon: Predicting image geolocations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 12893–12902. [Google Scholar]
  53. Dufour, N.; Kalogeiton, V.; Picard, D.; Landrieu, L. Around the world in 80 timesteps: A generative approach to global visual geolocation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 23016–23026. [Google Scholar]
  54. Hao, Y.; Chi, Z.; Dong, L.; Wei, F. Optimizing prompts for text-to-image generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 66923–66939. [Google Scholar]
  55. Vivanco Cepeda, V.; Nayak, G.K.; Shah, M. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 8690–8701. [Google Scholar]
  56. Evans, J.S.B.; Stanovich, K.E. Dual-process theories of higher cognition: Advancing the debate. Perspect. Psychol. Sci. 2013, 8, 223–241. [Google Scholar] [CrossRef]
  57. Wason, P.C.; Evans, J.S.B. Dual processes in reasoning? Cognition 1974, 3, 141–154. [Google Scholar] [CrossRef]
  58. Kahneman, D. Thinking, Fast and Slow; Farrar, Straus and Giroux: New York, NY, USA, 2011. [Google Scholar]
  59. Lefebvre, H. The Production of Space; Wiley-Blackwell: Malden, MA, USA, 2012. [Google Scholar]
  60. Reimers, N. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
  61. Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11975–11986. [Google Scholar]
  62. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021. [Google Scholar]
  63. Russell, J.A.; Weiss, A.; Mendelsohn, G.A. Affect grid: A single-item scale of pleasure and arousal. J. Personal. Soc. Psychol. 1989, 57, 493. [Google Scholar] [CrossRef]
  64. Dillion, D.; Tandon, N.; Gu, Y.; Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 2023, 27, 597–600. [Google Scholar] [CrossRef] [PubMed]
  65. Doersch, C.; Singh, S.; Gupta, A.; Sivic, J.; Efros, A.A. What makes paris look like paris? Commun. ACM 2015, 58, 103–110. [Google Scholar] [CrossRef]
  66. Hur, M.; Nasar, J.L. Physical upkeep, perceived upkeep, fear of crime and neighborhood satisfaction. J. Environ. Psychol. 2014, 38, 186–194. [Google Scholar] [CrossRef]
  67. Wallace, B.; Dang, M.; Rafailov, R.; Zhou, L.; Lou, A.; Purushwalkam, S.; Ermon, S.; Xiong, C.; Joty, S.; Naik, N. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 8228–8238. [Google Scholar]
  68. Manning, B.S.; Zhu, K.; Horton, J.J. Automated Social Scientific Hypothesis Generation and Testing with LLMs; Technical Report Working Paper 32381; National Bureau of Economic Research: Cambridge, MA, USA, 2024. [Google Scholar] [CrossRef]
Figure 1. Our generative façade pipeline has 3 components: (1) a segmentation module using SegClip [48], (2) a image generation module using SDXL [45] and (3) an evaluation module that consists of both (A) a geolocalisation experiment and (B) a perceptual alignment experiment.
Figure 1. Our generative façade pipeline has 3 components: (1) a segmentation module using SegClip [48], (2) a image generation module using SDXL [45] and (3) an evaluation module that consists of both (A) a geolocalisation experiment and (B) a perceptual alignment experiment.
Buildings 15 03212 g001
Figure 2. Segmentation mask from three different text prompts. From left to right, ‘windows and doors’, ‘windows’, ‘doors’.
Figure 2. Segmentation mask from three different text prompts. From left to right, ‘windows and doors’, ‘windows’, ‘doors’.
Buildings 15 03212 g002
Figure 3. Original 5 image (left), a Hong Kong counterfactual with localised inpainting (middle), and an unrealistic counterfactual without inpainting where the canal is covered by concrete (right).
Figure 3. Original 5 image (left), a Hong Kong counterfactual with localised inpainting (middle), and an unrealistic counterfactual without inpainting where the canal is covered by concrete (right).
Buildings 15 03212 g003
Figure 4. Original San Francisco image (left), a Kyoto counterfactual with targeted architectural inpainting (middle), and an unrealistic counterfactual without inpainting where the shape of the road has changed (right).
Figure 4. Original San Francisco image (left), a Kyoto counterfactual with targeted architectural inpainting (middle), and an unrealistic counterfactual without inpainting where the shape of the road has changed (right).
Buildings 15 03212 g004
Figure 5. Original image from Shanghai (top), and two counterfactual images of Paris, Kyoto and Tokyo with two different random codes (middle and bottom) showing good image diversity.
Figure 5. Original image from Shanghai (top), and two counterfactual images of Paris, Kyoto and Tokyo with two different random codes (middle and bottom) showing good image diversity.
Buildings 15 03212 g005
Figure 6. Visualising architectural façade counterfactuals with different guidance and scale parameters. We select s = 0.7 and g = 0.8 as the default parameters for our generative process.
Figure 6. Visualising architectural façade counterfactuals with different guidance and scale parameters. We select s = 0.7 and g = 0.8 as the default parameters for our generative process.
Buildings 15 03212 g006
Figure 7. An example of the Human Evaluation Self-reported Survey Question for the perceptual alignment experiment. (Answer: b).
Figure 7. An example of the Human Evaluation Self-reported Survey Question for the perceptual alignment experiment. (Answer: b).
Buildings 15 03212 g007
Figure 8. Confusion matrices for both (a) original images and (b) generated images.
Figure 8. Confusion matrices for both (a) original images and (b) generated images.
Buildings 15 03212 g008
Figure 9. Geographically-consistent counterfactuals. Paris—Original (left); London (middle); and Kyoto (right).
Figure 9. Geographically-consistent counterfactuals. Paris—Original (left); London (middle); and Kyoto (right).
Buildings 15 03212 g009
Figure 10. Geographically-inconsistent counterfactuals. Shanghai—Original (left); London (middle); and Kyoto (right).
Figure 10. Geographically-inconsistent counterfactuals. Shanghai—Original (left); London (middle); and Kyoto (right).
Buildings 15 03212 g010
Figure 11. Image embeddings of counterfactual images show clustering between cities that are near to each other geographically, though they also reveal a degree of entanglement.
Figure 11. Image embeddings of counterfactual images show clustering between cities that are near to each other geographically, though they also reveal a degree of entanglement.
Buildings 15 03212 g011
Figure 12. Perceptual alignment examples show that objective counterfactuals such as “Stone and Glass” on the left are more distinguishable than affective counterfactuals such as “Calm and Stressful” on the right.
Figure 12. Perceptual alignment examples show that objective counterfactuals such as “Stone and Glass” on the left are more distinguishable than affective counterfactuals such as “Calm and Stressful” on the right.
Buildings 15 03212 g012
Figure 13. Scatterplot showing correspondence between human-based and model-based evaluations.
Figure 13. Scatterplot showing correspondence between human-based and model-based evaluations.
Buildings 15 03212 g013
Table 1. Accuracy results of the geolocalisation experiment for generative and baseline models across different subsets. The results indicate that geolocalisation on generated images attains moderate accuracy relative to the original images with notable bias between the two regional subsets.
Table 1. Accuracy results of the geolocalisation experiment for generative and baseline models across different subsets. The results indicate that geolocalisation on generated images attains moderate accuracy relative to the original images with notable bias between the two regional subsets.
Model/SubsetAccuracy
Baseline images (existing)0.90
Generative images (all cities)0.39
    · Western cities (subset)0.58
    · East Asian cities (subset)0.43
Table 2. Classification Accuracy for Contrastive Attributes. Objective descriptors (Obj) consistently outperform affective (Aff) and higher-order affective descriptors (HiA) in both human ( A c c h ) and model ( A c c m ) evaluation, indicating that AI models can produce observable visual patterns more effectively than abstract perceptual ones.
Table 2. Classification Accuracy for Contrastive Attributes. Objective descriptors (Obj) consistently outperform affective (Aff) and higher-order affective descriptors (HiA) in both human ( A c c h ) and model ( A c c m ) evaluation, indicating that AI models can produce observable visual patterns more effectively than abstract perceptual ones.
TypeContrastive Attribute Acc m Acc h N
ObjColorful vs. Dull-color0.850.5872
ObjAngular vs. Curvy0.720.3873
ObjSymmetrical vs. Asymmetrical0.640.2371
ObjTextured vs. Smooth0.650.3070
AffWelcoming vs. Uninviting0.640.3572
AffSafe vs. Unsafe0.550.2472
AffRelaxing vs. Tense0.540.1273
AffStimulating vs. Not Stimulating0.530.2774
HiAUtopian vs. Dystopian0.590.2472
HiAHarmonious vs. Discordant0.580.2871
Table 3. Model-based classification accuracy for different multi-class categories where similarly objective descriptors (Obj) outperform affective (Aff) ones.
Table 3. Model-based classification accuracy for different multi-class categories where similarly objective descriptors (Obj) outperform affective (Aff) ones.
TypeMulti-Class Category Acc m
ObjRed–green–blue–yellow–purple–orange0.86
ObjBrick–glass–stone–wood0.65
AffExciting–depressing–calm–stressful0.36
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Law, S.; Valentine, C.; Kahlon, Y.; Seresinhe, C.I.; Tang, J.; Morad, M.G.; Fujii, H. Generative AI for Architectural Façade Design: Measuring Perceptual Alignment Across Geographical, Objective, and Affective Descriptors. Buildings 2025, 15, 3212. https://doi.org/10.3390/buildings15173212

AMA Style

Law S, Valentine C, Kahlon Y, Seresinhe CI, Tang J, Morad MG, Fujii H. Generative AI for Architectural Façade Design: Measuring Perceptual Alignment Across Geographical, Objective, and Affective Descriptors. Buildings. 2025; 15(17):3212. https://doi.org/10.3390/buildings15173212

Chicago/Turabian Style

Law, Stephen, Cleo Valentine, Yuval Kahlon, Chanuki Illushka Seresinhe, Jason Tang, Michal Gath Morad, and Haruyuki Fujii. 2025. "Generative AI for Architectural Façade Design: Measuring Perceptual Alignment Across Geographical, Objective, and Affective Descriptors" Buildings 15, no. 17: 3212. https://doi.org/10.3390/buildings15173212

APA Style

Law, S., Valentine, C., Kahlon, Y., Seresinhe, C. I., Tang, J., Morad, M. G., & Fujii, H. (2025). Generative AI for Architectural Façade Design: Measuring Perceptual Alignment Across Geographical, Objective, and Affective Descriptors. Buildings, 15(17), 3212. https://doi.org/10.3390/buildings15173212

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop