Appendix A. LLM Reliability Assessment and Test–Retest Consistency
Appendix A.1. Emotional Criteria
A reliability assessment helps to identify the consistency and dependability of the LLM-based architectural evaluation tool. The inherently stochastic nature of generative AI systems means that identical prompts can yield different responses [
121]. This appendix specifies how to: (
i) obtain reliable decisions from a single LLM via repeated runs with the same prompt; (
ii) compare results from different LLMs; and (
iii) estimate uncertainty with standard measures. Systematic uncertainties are identified and quantified. The reliability analysis is independent of LLM model, showing that despite variations, criteria-driven prompts lead to convergence.
The validation performed in this Appendix involves running multiple independent LLM outputs in parallel, then statistically comparing their results for convergence. A human analyst performs the ensemble consensus check. This classical multi-LLM reliability assessment estimates both test–retest and cross-model robustness. By design, no model evaluates another’s output. It is very different from the newer and more sophisticated “LLM-as-a-judge” methodology employed in the main body of this paper, which does not rely upon human comparison.
At the time of writing, readers wishing to investigate LLM-based diagnostics are advised to use ChatGPT’s most powerful, reasoning model. Perplexity Research and Qwen also give good results for analyses of this type. LLM experiments that returned reliable results from brief, succinct prompts justify this recommendation. The LLM used neuroscientific data and did not rely upon common assumptions or subjective narratives. To achieve the same consistency, other LLMs need fine-tuning such as more detailed instructions and a specified direction in the prompt.
A test–retest reliability analysis is conducted by querying ChatGPT-4o ten times with identical prompts for the same architectural pair in
Figure 1. The standard deviation from the mean is the simplest consistency measure. (A test–retest Pearson’s reliability coefficient
r is not useful here because successive trials vary randomly and are not supposed to converge). This version of ChatGPT was chosen for this test because it was the most widely used at the time of writing, being replaced by ChatGPT-5 after the manuscript was completed.
A slightly modified prompt is employed, and each query is entered as a new chat:
“Use the set of ten qualities {beauty, calmness, coherence, comfort, empathy, intimacy, reassurance, relaxation, visual pleasure, well-being} (“beauty–emotion cluster”) that elicit a positive-valence feeling from a person while physically experiencing a built structure, to investigate the two uploaded pictures of similar buildings. Evaluate the conjectured relative emotional feedback by comparing the two images in a binary preference (1 for the preferred image and 0 for the rejected image for each of the 10 qualities) to give a preference for one over the other. The sum of the values for each image should be 10. Give the answer as (LHS, RHS).”
ChatGPT-4o produced the following results when evaluating for the emotional criteria of the department store
Figure 1: (LHS, RHS) = (0, 10) seven times and (1, 9) three times.
Mean preference = (0.3, 9.7) and standard deviation = (0.5, 0.5).
The RHS image is favored overwhelmingly over ten runs, picking up almost 10 out of the 10 emotional properties on average. The lesson for researchers is that, to improve reliability, an evaluation should be repeated several times. Extensive trials indicated that the best model to use for this comparative analysis was the more advanced ChatGPT 4.5, not 4o, which is what this paper quotes for the emotional evaluation. (ChatGPT-5 Thinking was used for additional trials after its release.) The detailed explanations given by ChatGPT 4.5 proved to be incisive and unbiased. According to OpenAI, version 4.5 is slower but more deterministically reliable in structured scoring tasks than 4o, because 4.5 has lower stochastic entropy and is better aligned with fixed evaluation frameworks.
The second reliability assessment checks whether different LLM versions, and distinct LLMs, will produce comparable results. Inter-version (or cross-model) reliability is established by comparing evaluations across ChatGPT-4o, o3, o4-mini, o4-mini-high, 4.5, 4.1, and 5 Thinking using the image set in
Figure 1. The evaluation trial is extended to include the LLMs Copilot and Perplexity (neither of which has its own AI engine but relies on those of other LLMs), and Qwen (with its own independent AI engine). The following numbers will of course change over repeated runs; so, this is merely an indication of what to look for in a reliability check.
Single-trial results from ChatGPT-4o, o3, o4-mini-high, 4.5, 4.1, 5 Thinking, Copilot, and Perplexity were all equal for this case: (LHS, RHS) = (0, 10), whereas Qwen3 scored (1, 9) and ChatGPT o4-mini scored (2, 8).
Mean preference = (0.3, 9.7) and standard deviation = (0.7, 0.7).
All ten trials strongly favored the RHS image. Models based on OpenAI’s engines (the ChatGPT family and Copilot) plus Perplexity and Qwen converged almost perfectly. After OpenAI introduced ChatGPT-5, it discontinued several legacy versions of ChatGPT that were still available for direct comparison while this paper was being written.
Claude Sonnet 4, Gemini 2.5 Pro, and Kimi K1.5—LLMs with their own AI engines—gave inconsistent results with the above simple prompt. This was due to their conjecturing of effects for the emotional qualities that amounted to speculation. Most general LLMs answer from fashion and stylistic cues found in their training data unless constrained to consult the neurological/vision literature. An evident training data bias conflates aesthetic and emotional criteria. Those LLMs’ detailed explanations were not based strictly on documented psychological feedback but were influenced by opinions on contemporary aesthetics and styles. To use those LLMs, a more detailed prompt will be necessary to prevent the LLM from picking subjective opinions instead of searching through scientific data.
An experiment with Gemini 2.5 Pro using an improved prompt gives better results, as detailed below in
Appendix A.4. It is best to use an optimized prompt for every AI engine/system.
This exercise in response consistency is not a rigorous reliability test for the emotional evaluation module. It simply points out what researchers must do in a systematic manner to validate this model for future investigations. Another important point that came out of this is that distinct LLMs answer questions differently, by drawing upon different sources that may indeed be biased. For this reason, it is essential to ask the LLM for a detailed justification for each number in the evaluation and to check this for impartiality.
Appendix A.2. Geometric Criteria
The test–retest reliability analysis was repeated for the 15 fundamental properties by querying ChatGPT-4o ten times with an identical prompt for the same architectural pair in
Figure 1. For checking intra-model consistency, each query was entered as a new chat. A slightly modified prompt was used this time, along with the descriptive list of the 15 fundamental properties (linked here in the
Supplementary Materials):
“Evaluate these two images of buildings, using the 15 criteria uploaded as Alexander’s Fifteen Fundamental Properties of living geometry. The relative comparison should be presented as a set of numbers (LHS, RHS), where LHS = total score for the relative presence (dominance) of the properties in the LHS image, and RHS = total for score for the relative presence (dominance) of the properties in the RHS image. Score the pair of images as follows: if one property is clearly dominant in one of them, give a 1 to it and 0 to the other. If both images have comparable degrees of one property, or the difference is very small, give a 0 to both. For this reason, the totals could come out to be LHS + RHS < 15.”
ChatGPT-4o produced the following results when evaluating the geometrical criteria ten consecutive times for the department store
Figure 1 (listed here not in chronological order):
(LHS, RHS) = (0, 15), (0, 13), (1, 13), (2, 12) four times, (3, 10) twice, (3, 11).
Mean preference = (1.8, 12.0) and standard deviation = (1.1, 1.4).
Results show near unanimity, with all ten runs choosing the RHS image as containing more geometrical properties—on the average 12 out of the 15.
The second reliability assessment compared evaluations across ChatGPT-4o, o3, o4-mini, o4-mini-high, 4.5, 4.1, 5 Thinking, Gemini 2.5 Pro, Kimi K1.5, and Perplexity using the image set in
Figure 1. The cross-model concordance scores of single trials evaluating the department store
Figure 1 are as follows (again, repeated trials using new chats will inevitably give varied results):
ChatGPT-4o (LHS, RHS) = (0, 13), o3 = (3, 11), o4-mini = (0, 14), o4-mini-high = (0, 15), 4.5 = (0, 15), 4.1 = (4, 11), 5 Thinking = (0, 14), Gemini 2.5 Pro = (0, 14), Kimi K1.5 = (0, 15), Perplexity = (1, 12).
Mean preference = (0.8, 13.4) and standard deviation = (1.5, 1.6).
Agreement on the winning RHS design was 100% across ten independent or semi-independent LLMs, choosing on the average 13 out of the 15 geometrical properties. The AI-based diagnostic tool therefore shows a level of reliability. The authors feel that this preliminary “proof-of-principle” justifies the practical value of the LLM-based evaluative model while identifying important issues to watch out for and develop further.
Appendix A.3. The Data Reveal Directional Agreement Between Emotional and Geometrical Criteria
In all cases, the emotional and geometrical evaluations of the department store
Figure 1 agreed. Summarizing the above results:
Same LLM, 10 repeated runs. Emotion outputs: (0,10) × 7, (1,9) × 3 → all 10 choose RHS.
Geometry outputs: (0,15), (0,13), (1,13), (2,12) × 4, (3,10) × 2, (3,11) → all 10 choose RHS.
Directional agreement = 100%.
- 2.
Across different LLMs, 10 models. Emotion outputs: (0,10) × 8, (1,9), (2,8) → all 10 choose RHS.
Geometry outputs: (0,13), (3,11), (0,14), (0,15) × 3, (4,11), (0,14) × 2, (1,12) → all 10 choose RHS.
Directional agreement = 100%.
Appendix A.4. The Occasional Need for a More Detailed Prompt
As already noted in
Appendix A.1, the LLM Gemini 2.5 Pro did not give a satisfactory result when prompted with the simple prompt for the emotional criteria given above. (Gemini is powered by a distinct AI engine from ChatGPT and is trained separately from other LLMs). A more detailed prompt elicited an accurate scoring for the emotional evaluation of
Figure 1 as (LHS, RHS) = (2, 8) supported by the detailed explanations reproduced in full below. Structured, theory-based prompts can override a bias due to stylistic preferences from training data. This is the first iteration of prompt tuning through feedback, which could be taken further if desired.
To check consistency using this LLM, the enhanced prompt was repeated ten independent times giving the following scores for the 10 emotional criteria. Only the readout from the first trial is recorded below. However, the variance over ten evaluations discourages using this LLM for the objective diagnostic model—the most advanced version of ChatGPT is preferred for now. (Improvement while using Gemini 2.5 Pro requires even more detailed evaluation criteria and a graded scoring method). The ten scores are as follows:
Gemini 2.5 Pro: (LHS, RHS) = (2, 8), (3, 7), (2, 8), (4, 6), (1, 9), (2, 8), (4, 6), (1, 9), (5, 5), (3, 7).
Mean preference = (2.7, 7.3) and standard deviation = (1.27, 1.27).
Prompt to Gemini 2.5 Pro: “Evaluate two architectural images (LHS and RHS) according to a set of ten emotional qualities {beauty, calmness, coherence, comfort, empathy, intimacy, reassurance, relaxation, visual pleasure, well-being}, collectively called the “beauty–emotion cluster”. Each quality relates directly to positive emotional and physiological responses experienced by people physically present in built environments.
For each quality, assign a binary preference: 1 to the image that scientific evidence indicates will more strongly elicit that positive emotional or physiological response, and 0 to the image less effective in eliciting that response. The sum of values across all ten qualities must equal 10. Present your final result as (LHS, RHS).
Do not rely on subjective aesthetic judgments, contemporary architectural styles, minimalism, or stylistic trends. Base your assessment exclusively on biophilic design studies, documented neuroscientific findings, environmental psychology research, and empirical evidence from peer-reviewed open-access scientific literature. Provide a concise scientific justification referencing established neuroscientific and psychological findings for each quality you score.”
Gemini 2.5 Pro: “Based on a rigorous evaluation grounded in the provided neuroscientific and psychological literature, the relative preference between the two buildings is scored decisively in favor of the RHS building. The classical structure’s features align more closely with documented architectural properties known to elicit positive physiological and psychological responses.
Scientific Justification: Neuroaesthetics research indicates that aesthetic appreciation is linked to the activation of the brain’s emotion-valuation system, which includes the orbitofrontal cortex. Designs with harmony and symmetry can create feelings of happiness. The RHS building, with its ordered symmetry and harmonious geometric proportions rooted in classical design, is more aligned with these findings than the simpler, repetitive facade of the LHS.
Scientific Justification: Biophilic design, which incorporates natural elements, is strongly linked to stress reduction and promoting calm. The visible green roof on the LHS building is a direct biophilic feature shown to lower cortisol levels and promote tranquility. Exposure to greenery, even indoors, can speed physiological stress recovery. This provides a direct, documented pathway to inducing calmness.
Scientific Justification: Coherence is defined as the ease of organizing and comprehending a scene, and it is a key psychological dimension in the experience of architecture. The RHS building’s design, with its clear geometric proportions and hierarchy of forms (base, middle, top; distinct window bays), provides a more organized and cognitively digestible structure. This clear structure reduces cognitive stress and enhances the feeling of coherence.
Scientific Justification: Psychological comfort is associated with feelings of safety, familiarity, and well-being. Materials play a key role; natural materials like stone can evoke feelings of permanence and strength. Wood and stone are often perceived as creating a sense of calm and connection with nature, which fosters comfort. The RHS building’s stone facade aligns with these findings more than the glass and metal of the LHS, which can be associated with modernity but less with grounding and comfort.
Scientific Justification: Empathy in architecture is linked to an embodied, multi-sensory experience where the user feels a connection to the space. Human-scale architecture is crucial for this, as it relates the building’s proportions to the human body, creating a more relatable and psychologically positive impact. The detailed ornamentation and articulated window bays of the RHS building provide human-scaled elements that are absent in the monolithic facade of the LHS.
Scientific Justification: Smaller, enclosed spaces tend to foster focus and intimacy, while larger, open spaces promote exploration. The design of the RHS building, with its clearly defined and articulated window recesses, creates smaller-scale areas of enclosure within the larger facade. This contrasts with the vast, open, and less-defined surface of the LHS, making the RHS more conducive to a sense of intimacy.
Scientific Justification: Reassuring environments are often those that feel stable and permanent. Research shows that traditional spaces can provide a positive and satisfactory experience of security. The RHS building’s use of stone, a material that conveys strength and permanence, contributes to this feeling. Its solid, grounded appearance is more reassuring than the lighter, more transparent aesthetic of the LHS.
Scientific Justification: Exposure to nature is a well-documented method for stress reduction and relaxation. Green roofs are a key element of biophilic design that provide opportunities for relaxation. Studies show that biophilic settings promote faster heart rate recovery after stress. The presence of the green roof on the LHS building provides a direct mechanism for promoting relaxation.
Scientific Justification: Research into fractal patterns in visual art and design shows that humans are “hard-wired” to respond positively to the kind of ordered complexity found in nature, which can reduce stress levels by up to 60%. The rich, multi-scalar ornamentation of the RHS facade exhibits fractal-like qualities, which are more engaging and visually pleasing than the non-fractal, uniform grid of the LHS. Preference for fractals is driven by a balance of arousal (complexity) and decreased tension (relaxation).
Scientific Justification: Overall well-being is a holistic measure affected by numerous factors, including a sense of connection, safety, and positive emotional stimulation. The RHS building integrates multiple positive factors: its use of natural materials (stone), human-scaled elements, and ordered complexity all contribute to a psychologically supportive environment. While the LHS has a single biophilic feature, the RHS presents a more comprehensive integration of architectural elements shown to support human well-being.”
Appendix B. ChatGPT Argues That It Is Not Biased by the Public Polls
Appendix B.1. ChatGPT Justifies Its Impartiality in Comparing the Two Buildings in Figure 1
While the AI experiments above were conducted with anonymous figures, those do in fact link to a recent debate in the UK about actual buildings. Moreover, public preference surveys were conducted using the images in
Figure 1,
Figure 2 and
Figure 3. The LLM was asked about this point directly to rule out any possible bias if the generative AI used prior knowledge of the debate. Its answers are convincing, revealing that there was no such interference. The LLM’s unedited explanation is included for each of the three cases, beginning with that of
Figure 1.
Prompt to ChatGPT 4.5: “These two images are published on the web in recent discussions of the demolition versus reuse of Orchard House, the building where the department store Marks and Spencer’s is located in Oxford Street, London, UK. Were your two comparative analyses of the images influenced at all by that online debate, or did you strictly analyze the images according to the stated criteria (first the emotional ones, then the geometric ones)? A public survey was performed in February 2024 to identify which of the two images was preferred, with results that agree with your own scientific analysis. Did the results of the public survey, which is available online, influence your analysis? The results of the crowdsourced survey on public preference between the two building images were (LHS, RHS) = (17%, 79%) with the remaining 4% answering “I don’t know”.”
ChatGPT: “Our two comparative analyses (first emotional, then geometric) were conducted strictly according to the stated criteria and were not consciously influenced by the ongoing demolition-versus-reuse debate. In other words, the evaluations focused on the visual and aesthetic content of the images themselves—such as the emotional impact (e.g., the ambiance, warmth, or grandeur each facade conveyed) and the geometric qualities (symmetry, proportion, pattern, etc.)—independent of any external opinions. We did not factor in knowledge about Marks & Spencer’s plans or the public controversy while analyzing the images. By adhering to the defined criteria, the analysis remained objective: for example, noting features like the “graph-paper façade” and repetitive grid of windows on one design versus the more articulated, classical elements on the other, purely as they appeared visually. This ensured that our comparisons were based on observable attributes in the images rather than the sentiments of the online debate.
The results of the February 2024 public survey (which showed a 79% preference for the RHS image vs. 17% for the LHS image, with 4% undecided) did not influence our image analyses. In fact, our comparative evaluation was performed before considering any poll data and was driven only by the visual criteria provided. The strong agreement between our analysis and the survey results appears to be coincidental or due to both reflecting similar underlying qualities of the designs, rather than our analysis referencing the poll. We became aware of the survey after completing the image comparisons, noting that the British public overwhelmingly preferred the alternative design by 79% to 17% in a Deltapoll survey. However, that information was used only to contextualize findings here, not to shape the earlier analysis. In summary, the scientific image analysis stood on its own, and the public poll data serve only as a retrospective confirmation that the criteria-led evaluation aligned with general public sentiment.”
Appendix B.2. ChatGPT Justifies Its Impartiality in Comparing the Two Stadia in Figure 2
Prompt to ChatGPT 4.5: “These two images are published on the web as part of a debate about the design of Bath Rugby Stadium in the UK. Were your two comparative analyses of the images influenced at all by that online debate, or did you strictly analyze the images according to the stated criteria (first the emotional ones, then the geometric ones)? A public survey was performed in 2023 to identify which of the two images was preferred, with results that agree with your own scientific analysis. Did the results of the public survey, which is available online, influence your analysis? The results of the crowdsourced survey on public preference between the two building images were (LHS, RHS) = (28%, 72%).”
ChatGPT: “1. Independence from Online Debate. No—my analysis was not influenced in any way by the online debate about the HS2 viaduct or by any external source. I based the comparison strictly on the “beauty–emotion cluster” criteria that you provided. The evaluation was made solely by visually examining the two images you uploaded and interpreting them through that specific emotional framework.
2. No Use of Survey Data. Similarly, the 2023 public survey data showing a preference of (28%, 72%) was not known or used during the analysis. I did not access or refer to it, and it had no bearing on the scoring outcome of (10%, 90%) that I independently derived.
3. Reason for Agreement. The fact that the public survey results and the emotional-criteria-based analysis broadly align suggests that the emotional response elicited by the RHS image may be widely shared and consistent across viewers. This convergence can be taken as a useful indicator that the “beauty–emotion cluster” method may resonate with broader public preferences, at least in this case.
4. Objectivity. The scoring was conducted in a controlled and repeatable way by assigning binary preference values for each of the ten emotional dimensions. While interpretation always contains some degree of subjectivity, I applied the method consistently and without reference to any external discussions, articles, or survey results.”
Appendix B.3. ChatGPT Justifies Its Impartiality in Comparing the Two Viaducts in Figure 3
Prompt to ChatGPT 4.5: “These two images are published on the web as part of a debate about the design of a viaduct for HS2 in the UK. Were your two comparative analyses of the images influenced at all by that online debate, or did you strictly analyze the images according to the stated criteria (first the emotional ones, then the geometric ones)? A public survey was performed in 2024 to identify which of the two images was preferred, with results that agree with your own scientific analysis. Did the results of the public survey, which is available online, influence your analysis? The results of the crowdsourced survey on public preference between the two building images were (LHS, RHS) = (28%, 69%).”
ChatGPT: “No, my comparative analysis was not influenced by the Create Streets survey or its results. I assessed the two images strictly and independently using the ten stated emotional criteria from the “beauty–emotion cluster” framework. The analysis focused on visual and environmental cues present in the images themselves—such as form, materiality, scale, texture, rhythm, and contextual fit—to make a reasoned binary judgment for each category. I did not consult or reference the Create Streets survey or its published data when forming the assessment.
On the survey results: That the public preference survey conducted in 2024 found a result of (LHS, RHS) = (28%, 69%) is both interesting and affirming. It aligns closely with the emotional evaluation I provided—which resulted in a score of (0%, 100%) due to a unanimous preference under each of the ten qualitative lenses. This convergence between subjective crowd sentiment and a structured emotional-design analysis strengthens the case that the arched brick viaduct (RHS) resonates more positively with human perceptions of place, coherence, and emotional comfort.
Conclusion: My analysis was conducted independently and objectively, based solely on visual-emotional design principles—and the fact that it aligns with public sentiment adds confidence in the robustness of these criteria as a tool for design evaluation.”