Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models

Hong, Soon Min; Choo, Seungyeon

doi:10.3390/buildings15193477

Open AccessArticle

Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models

by

Soon Min Hong

and

Seungyeon Choo

^*

School of Architecture, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(19), 3477; https://doi.org/10.3390/buildings15193477

Submission received: 11 August 2025 / Revised: 23 September 2025 / Accepted: 23 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Artificial Intelligence in Architecture and Interior Design)

Download

Browse Figures

Versions Notes

Abstract

This study addresses the systematic optimization of Low-Rank Adaptation (LoRA) parameters for architectural knowledge integration in diffusion models, where existing AI research has provided limited guidance for establishing plausible parameter ranges in architectural massing applications. While diffusion models show increasing utilization in architectural design, general models lack domain-specific architectural knowledge, and previous studies have offered insufficient hyperparameter optimization frameworks for architectural massing studies—fundamental components for expressing architectural knowledge. This research establishes a comprehensive LoRA training framework specifically for architectural mass generation, systematically evaluating caption detail levels, optimizers, learning rates, schedulers, batch sizes, and training steps. Through analysis of 220 architectural mass images representing spatial transformation operations, the study recommends the following parameter settings: detailed captions, Adafactor optimizer, learning rate 0.0003, constant scheduler, and batch size 4, achieving significant improvements in prompt-to-output fidelity compared to baseline approaches. The contribution of this study is not in introducing a new algorithm, but in providing a systematic application of LoRA in the architectural domain, serving as a bridging milestone for both emerging architectural-AI researchers and advanced scholars. The findings provide practical guidelines for integrating AI technologies into architectural design workflows, while demonstrating how systematic parameter optimization can enhance the learning of architectural knowledge and support architects in early-stage massing and design decision-making.

Keywords:

architectural AI; diffusion model; LoRA technique; architectural mass generation; human-AI interaction

1. Introduction

1.1. Research Background and Objectives

Recent advancements in AI technology have increased its potential for application across industrial sectors, stimulating related research and development. Carpo [1] describes this progress as “the second coming of AI”, emphasizing that technologies once considered impossible are now becoming reality.

In architecture, AI adoption is also accelerating. Platforms such as Spacemaker, an AI-based design environment for multi-family housing development acquired by Autodesk in 2020, exemplify how AI can improve design efficiency while enabling broader participation in the design process [2]. Beyond such platforms, research on automation and AI-driven image generation is expanding, laying the groundwork for AI-based architectural design environments [3,4].

Diffusion models, which generate images through iterative denoising, have gained significant attention in architectural design for their stability and diversity of outputs compared to Generative Adversarial Networks (GANs) [5,6]. These advantages make them suitable for automating early-stage design tasks and exploring diverse alternatives, and they are already being applied by leading firms such as Morphosis Architects, Coop Himmelb(l)au, MVRDV, and Zaha Hadid Architects [7].

However, diffusion models commonly used in design workflows remain limited in handling domain-specific architectural tasks due to insufficient incorporation of architectural knowledge [8]. Fine-tuning approaches such as Low-Rank Adaptation (LoRA) offer a potential solution, but prior studies provide only limited systematic guidance on how to configure plausible parameter ranges in architectural contexts [9,10]. As highlighted in related research, inappropriate hyperparameter settings can lead to unstable learning or outright failure [11].

Architectural massing studies—concerned with spatial relationships, geometric transformations, and formative logic—represent a fundamental medium for expressing architectural knowledge [12]. Yet systematic parameter optimization frameworks tailored to massing applications remain underexplored, despite their central role in early-stage architectural design.

Therefore, this study aims to explore effective ways to learn architectural knowledge related to architectural mass generation using the LoRA technique based on Diffusion models. Rather than simply applying existing methods, this study systematically constructs an architectural mass-specific dataset and optimizes hyperparameters to tailor AI models for early-stage architectural design workflows. The goal is to analyze the impact of LoRA hyperparameters configurations on architectural knowledge acquisition and recommend parameter settings within the experimental scope for architectural mass generation. This is expected to increase the practical applicability of AI technology in the early stages of architectural design and enhance design efficiency through human-AI interaction. The contribution of this study is not in developing a new optimization algorithm but in systematically applying LoRA to embed architectural massing knowledge into diffusion models. This novelty lies in the disciplinary and methodological framing: it provides a structured reference for fine-tuning practices in architectural contexts, thereby serving as a bridging milestone for both emerging architectural-AI researchers and advanced scholars. To this end, the following research question has been established: “How can architectural knowledge be systematically fine-tuned through the LoRA technique so that diffusion models are not only optimized in terms of technical performance but also capable of producing architecturally meaningful massing outputs that support human–AI interaction in early-stage design?”

1.2. Research Scope and Method

This study is confined to architectural knowledge, particularly focusing on architectural mass generation in the early stages of architectural design. It excludes elements such as exterior finishing materials or facades. All outputs generated and analyzed are 2D images, and no 3D models were produced. Evaluations are therefore based solely on visual analysis of 2D architectural mass forms. This clarification ensures that the methodological scope aligns with the objectives of evaluating LoRA-based Diffusion models for image-level generative tasks. The Diffusion model used in this study is the realisticVisionV51_v51VAE model, which is versatile and suitable for architectural research due to its flexibility in parameter adjustment. For generating LoRA checkpoint files, the widely used Python library Kohya_ss was employed.

It should be noted that this study is confined to 2D image-based massing exploration, which is both a limitation and a necessary first step. While 3D models and façade-level applications are excluded from the present scope, these remain essential directions for future work, as evidenced by recent façade-focused research [13]. At the same time, this 2D scope inevitably restricts external validity, since practical architectural massing workflows involve 3D modeling, site constraints, and schematic CAD inputs that are not addressed in the present study.

The RealisticVisionV51_v51VAE model and the Kohya_ss library were chosen for their widespread adoption and flexibility in parameter adjustment, which make them suitable for architectural image experiments and consistent with the objectives of this study.

The research method consists of three main stages:

First, we analyze the theoretical background on architectural knowledge learning for human-AI interaction and review previous studies related to architectural image generation based on Diffusion models, with the purpose of establishing a conceptual baseline for subsequent experiments.

Second, we construct scenarios for the fine-tuning technique LoRA in order to systematically examine the influence of different hyperparameter settings. To achieve this, we perform (a) Caption Level comparison, (b) Optimizer comparison, (c) Learning Rate comparison, (d) LR Scheduler comparison, (e) Batch Size comparison, and (f) Training Steps analysis to derive key elements of the learning scenario.

Third, we comprehensively evaluate the model outputs using both qualitative and quantitative methods, with the purpose of identifying parameter settings that balance technical fidelity with architectural design relevance. Specifically, we conduct (1) quantitative evaluation through CLIP score analysis to measure the semantic alignment between prompts and generated architectural mass images, and LPIPS, which provides a complementary measure of perceptual similarity against teacher images; and (2) qualitative evaluation based on the visual clarity of architectural mass and the reflection of prompts. Additionally, we compare the performance of the fine-tuned LoRA models with that of the original, unfine-tuned Diffusion model (without LoRA) to verify the learning effectiveness of architectural mass-related knowledge.

Through this process, we identify the optimal settings for LoRA parameter settings in architectural mass-related knowledge learning.

2. Theoretical Background and Literature Review

2.1. Fine-Tuning for Architectural Knowledge

Adjusting general AI models to fit specific domains for architectural knowledge learning is essential, and one of the main methods for this is fine-tuning. Fine-tuning is the process of adjusting a pre-trained model to suit a specific task or dataset, adding new knowledge suitable for new data while maintaining the general knowledge learned in the existing model [14]. However, existing research has provided limited systematic guidance for establishing plausible parameter ranges for fine-tuning in architectural contexts, with researchers noting that “in many cases, engineers rely on the trial-and-error method to manually tune hyperparameters” [11].

This fine-tuning is effective and widely used as it can improve performance for specific domains with relatively small amounts of data [15,16]. Yet systematic parameter optimization frameworks specifically for architectural applications remain limited, with building applications demonstrating that “the performance of DRL is highly sensitive to hyper-parameters, and selecting inappropriate hyper-parameters may lead to unstable learning or even failure” [9].

LoRA is a lightweight fine-tuning method developed to overcome the limitations of extensive computational requirements and storage space demanded by traditional fine-tuning techniques. As shown in Figure 1, LoRA proceeds with learning by adding small low-rank matrices while keeping the pre-trained weights intact, instead of updating all parameters of the model [17]. This LoRA technique saves computing resources such as GPU memory, reduces learning time, and facilitates easy switching between various tasks as the learning results are stored as LoRA checkpoint files, eliminating the need to generate the entire model anew. In other words, LoRA can be considered an efficient fine-tuning method that can perform domain-specific learning while maintaining the performance of the existing model [17].

Compared with other fine-tuning methods such as full fine-tuning, DreamBooth, or ControlNet, LoRA offers clear advantages in efficiency and modularity. Full fine-tuning requires updating all model parameters, which is computationally expensive and less practical for rapid architectural prototyping. DreamBooth and ControlNet provide stronger task-specific control but demand larger datasets and higher computational resources. By contrast, LoRA enables lightweight, domain-specific adaptation with minimal resource requirements, making it particularly suitable for early-stage architectural massing studies.

Given these comparative advantages, digitizing architectural knowledge becomes all the more essential for enabling AI-based design environments. Bernstein [18] emphasized the importance of organizing architectural data and converting it into learnable structures to implement AI-based design environments. This data conversion process forms the basis for AI to effectively learn architectural knowledge and collaborate with architects in the design decision-making process.

Beyond this, computational design theory also provides an important historical context for understanding how design knowledge can be formalized. Shape grammars established rule-based generative processes [19], and parametricism emphasized the role of parameters in structuring design [20]. These approaches resonate with diffusion-based fine-tuning, where captions and prompts function as descriptors that guide generative outputs.

In this study, we further define architectural knowledge as the structured representation of spatial relationships, geometric transformations, and formative logics that underpin design reasoning, as elaborated in Hong [21]. This definition clarifies how LoRA-based fine-tuning embeds architectural reasoning into diffusion models, ensuring that our methodological framework is anchored in a coherent theoretical foundation.

The LoRA technique can fine-tune AI models to suit specific architectural design environments, generating architectural masses and forms in the early stages of architectural design and presenting new design paradigms for human-AI collaboration. Therefore, this study aims to explore effective methods of learning architectural knowledge using the LoRA technique, thereby enhancing creativity and efficiency in AI-assisted architectural design processes and supporting effective human-AI interaction in design workflows.

2.2. Diffusion Models in Architectural Image Generation

Diffusion models, a recently emerged deep learning-based generative model, have gained attention for surpassing the performance of existing Generative Adversarial Networks (GANs) in image synthesis [22]. These models are applied to various tasks beyond image synthesis, including machine vision, natural language processing, and multi-modal learning, with active research and technological development in interdisciplinary fields, including architecture [15].

According to del Campo [22], the origin of diffusion models stems from image caption generation research. ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset has been used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images [23]. ImageNet laid the foundation for developing algorithms capable of describing images. By 2016, algorithms transitioning from image analysis to generation emerged, presenting the initial concept of generating images from textual scene descriptions [24,25]. This development led to the birth of diffusion models, generally defined as probabilistic generative models that operate by injecting noise into images and gradually removing it.

Diffusion models can be particularly useful tools in the architectural design field for quickly generating images that visualize initial design ideas, helping to concretize or develop creative concepts. Various studies are being conducted using diffusion models in the architectural design field, including text-to-image and image-to-image generation.

Unlike conventional GAN-based generation models, diffusion models offer higher output stability and better prompt controllability, making them more suitable for creative and iterative design processes. Moreover, their structure allows for relatively straightforward integration with fine-tuning techniques such as LoRA and ControlNet, which have enabled their rapid adoption in architecture-related workflows. However, most studies in this domain remain focused on the visual output and creative utility of diffusion models, rather than on systematic learning of architectural knowledge embedded in image-caption data or prompt structures.

A range of studies illustrates how diffusion models are being integrated into architectural workflows. Çelik’s [26] research reinterpreted Andrea Palladio’s form grammar through text-to-image generation, highlighting AI’s potential in conceptual design while also stressing the continuing need for human intervention in functional design aspects. Building on this idea of concept exploration, Horvath & Pouliou [27] leveraged eVolo skyscraper competition data to investigate how prompts influence architectural form generation, reinforcing the role of diffusion models as early-stage visualization tools.

Moving beyond general form generation, Kim et al. [8] proposed the Text2Form Diffusion framework to incorporate architectural vocabulary via fine-tuning, showcasing how dataset construction and parameter optimization could lead to more refined outputs. Complementing this, Wu [28] addressed architectural style generation using the ArchDiff model, which demonstrated enhanced consistency and stylistic clarity by combining image-text pairs in a large-scale dataset.

Subsequent research has aimed to enhance control and customization. Ma & Zheng [29] integrated LoRA and ControlNet into Stable Diffusion to generate text-based architectural facades, balancing stylistic diversity with geometric consistency. Similarly, Yoo & Lee [30] focused on exterior visualization of single-family houses, using generative AI and LoRA fine-tuning to tailor design outputs to specific user preferences.

Collectively, prior research demonstrates that diffusion models and fine-tuning techniques hold considerable promise for early-stage architectural design, offering the capacity to generate diverse and creative alternatives. Nevertheless, most of these studies have remained oriented toward aesthetic visualization or stylistic adaptation, and only a few have provided systematic guidance on how domain-specific architectural knowledge might be formally encoded and optimized within such models. The present study addresses this gap by advancing a structured framework for LoRA-based parameter optimization tailored to architectural massing. In doing so, it explicitly links hyperparameter configurations with the architectural relevance of generated outputs, thereby distinguishing itself from prior approaches and underscoring both methodological rigor and design-oriented significance.

Therefore, based on the importance of digital modeling mentioned by Steenson [31], this study aims to focus on generating architectural masses using AI models. We intend to explore ways to learn architectural knowledge through the LoRA technique and generate architectural mass images reflecting the morphological characteristics of architecture by applying LoRA to diffusion models.

3. Methodology for Architectural Knowledge Learning Using LoRA Technique

3.1. LoRA Training Method

3.1.1. Training Environment and Tool

For LoRA training in this study, an AMD EPYC 7352 24-Core Processor (Advanced Micro Devices, Santa Clara, CA, USA) and an NVIDIA Quadro RTX 6000 (NVDIA Corporation, Santa Clara, CA, USA; 24 GB VRAM) were utilized. Model training and data processing were conducted in a Python 3.10.11 environment. The base model employed was RealisticVisionV51_v51VAE, which is based on Stable Diffusion 1.5 (SD 1.5). This model allows for optimized learning in specific domains with minimal data, making it applicable in various fields such as architectural design, medical imaging, and art. Kohya_ss was used for LoRA training. This tool supports dataset preprocessing, hyperparameter setting, and visualization of the learning process. It provides an intuitive learning environment through a Web UI. This configuration was selected to ensure sufficient computational capacity for diffusion model training while remaining accessible and reproducible for architectural researchers.

3.1.2. Training Procedure

The main procedures for LoRA training are illustrated in Figure 2. First, the essential task that must be completed before LoRA learning is building an image dataset. After collecting and preprocessing architectural mass images, an image-caption dataset is constructed through a captioning process that describes the images. Second, in LoRA training, multiple LoRA checkpoint files are generated by setting various hyperparameters such as Caption Level, Optimizer, Learning Rate, LR Scheduler, Batch Size, and Training Steps. In this process, the model is trained by applying various settings, and each checkpoint is learned according to different hyperparameter configurations. Third, architectural mass images are generated using the created LoRA checkpoint files, and the quality of the generated images is compared and analyzed. In this process, the impact of each hyperparameter combination on the generation results is evaluated to derive the optimal settings. Finally, based on the analysis results of the loss graphs, a LoRA checkpoint file specialized for generating architectural mass images is ultimately created through optimized hyperparameter settings.

3.2. Building of Architectural Mass Image-Caption Dataset

3.2.1. Image Data Collection and Preprocessing

In this study, architectural mass images were collected from two sources: Operative Design: A Catalogue of Spatial Verbs by Di Mari & Yoo [32], and Conditional Design: An Introduction to Elemental Architecture by Di Mari [33]. The book Operative Design illustrates the process of architectural form transformation through spatial verbs such as extrude, merge, and twist, represented as geometric mass forms. It explores the investigation of form transformation within the architectural design process. Additionally, Conditional Design focuses on various spatial conditions that arise as a result of operative design. This book addresses changes in mass forms and components, exploring design approaches that respond to surrounding environments and contexts through processes where masses intersect, merge, and spaces connect or gain openness.

The architectural mass images collected from these books effectively illustrate spatial and operative verbs along with mass forms as text-image pairs, making them suitable data for learning architectural knowledge related to building masses. Each collected image was preprocessed to a resolution of 150 dpi and a size of 1024 × 1024 pixels to optimize for learning, resulting in a dataset of 220 images as shown in Table 1.

3.2.2. Image Captioning

Image captioning assigns textual descriptions to images so models can better interpret them and align prompts with visual elements [34]. Kohya_ss supports AI-based automatic captioning functions like Bootstrapping Language-Image Pre-training (BLIP), allowing for fast and efficient image captioning. However, automatically generated captions have limitations in fully reflecting detailed features and diverse descriptions of images [35]. Therefore, to create more accurate captions, it is necessary to manually complement the automatically generated captions [34].

In this study, initial captions were generated using ChatGPT-4o to produce text based on each image’s spatial characteristics. ChatGPT-4o was chosen due to its contextual language generation capabilities, especially for domain-specific captioning in architecture. As shown in Table 2, we uploaded architectural mass images to ChatGPT-4o to generate descriptive captions, then manually reviewed and edited them to ensure terminological and conceptual accuracy. To further ensure reproducibility, we adopted a fixed template of seven descriptive elements—subject, medium, environment, lighting, color, mood, and composition—and applied it consistently across all captions.

According to Chang & Han [36], in the field of architectural design, it is important to consider various aspects of prompt types and compositions for effective image generation. These include elements such as Subject, Medium, Environment, Lighting, Color, Mood, and Visual Composition. ‘Medium’, ‘Environment’, ‘Lighting’, ‘Color’, and ‘Mood’ were summarized as common caption elements, while ‘Subject’ and ‘Composition’, being important items describing specific features of architectural masses, were manually reviewed and modified to improve the accuracy of the captions.

Furthermore, to analyze the impact of image-caption detail level on LoRA learning, this study constructed datasets divided into three levels, as shown in Table 3. The first is a no caption dataset learning only images, the second is a simple caption dataset using brief text, and the third is a detailed caption dataset including comprehensive descriptions. The detailed captions in Table 3 were initially generated using ChatGPT-4o and subsequently edited for clarity and domain accuracy. Through this, we aim to evaluate the performance of the LoRA technique according to the level of caption detail.

3.3. Setting of Hyperparameters

To optimize the performance and efficiency of LoRA training, we set Optimizer, Learning Rate, LR Scheduler, and Training steps as the main hyperparameters. The Optimizer is an algorithm that adjusts the model’s weights to find the minimum value of the Loss Function, while the Learning Rate determines the magnitude of these weight adjustments. The LR Scheduler dynamically adjusts the Learning Rate to promote stable learning. Training steps represent the total number of times the model’s parameters are updated, and are directly related to the number of images, repeats, epochs, and batch size [31]. The formula for calculating this is as follows:

Total steps = (Number of images × Repeats × Epoch) ÷ Batch size

(1)

Epoch represents the number of complete passes through the entire dataset, Repeats is the number of times each image is learned within one epoch, and Batch size refers to the number of images processed simultaneously in one learning session. In this study, LoRA training was conducted based on the hyperparameters presented in Table 4.

The model performance evaluation will be conducted by comparing and analyzing the results generated using img2img and inpaint with prompts set as extrude, branch, and twist, as shown in Table 5. Other LoRA training hyperparameters were fixed as presented in Table 6. In addition, LoRA-specific settings (rank r, alpha, dropout, and target modules/layers) and diffusion parameters (sampler, guidance scale, random seed, and inpaint/img2img strengths) were not the focus of this study and were therefore kept at the default values of Kohya_ss.

In this process, it is important to note that the urban background visible in the 3D images is derived from the original input image and not from the prompts used for generation. Since the inpaint function was used to manually select and mask only the architectural mass area within the image, the generation result reflects only the prompt content applied to the selected mass. Therefore, the contextual elements or surrounding environment present in the image had no influence on the generation outcome. This ensured that the generated architectural mass forms were based solely on prompt-driven generation.

3.4. Evaluation Criteria and Metrics

To evaluate the architectural mass images generated under the parameter settings in Section 3.3, both quantitative and qualitative assessments were conducted. For each LoRA model configuration, ten images were generated per prompt, and one image with low visual noise and clearly defined geometry was selected. These selected outputs were then assessed by combining objective numerical evaluation and visual judgment, aiming to determine the most effective parameter setting for learning architectural mass-related knowledge.

The quantitative evaluation was performed using the CLIP score. The CLIP score quantifies the semantic similarity between a text prompt and an image using cosine similarity. This metric objectively measures how well the generated image aligns with its corresponding prompt. In this study, we computed the CLIP score for each image by comparing the prompt with the image features extracted using the CLIP model. The average CLIP scores for each parameter setting were used to compare model performance across configurations (Figure 3).

In addition to CLIP score, this study employed the Learned Perceptual Image Patch Similarity (LPIPS) metric as a complementary quantitative indicator. LPIPS measures perceptual similarity between generated outputs and reference teacher images based on deep feature embeddings, where lower values indicate greater visual similarity. In this study, LPIPS was computed by comparing generated architectural mass images against teacher images corresponding to the prompt operations—specifically “extrude”, “branch”, and “twist”, as presented in Table 1. Because the teacher images are simplified line drawings without shading or background, whereas generated outputs include tonal and contextual variations, LPIPS is reported as a supplementary diagnostic metric rather than the primary decision criterion. Accordingly, CLIP score was adopted as the main indicator of prompt fidelity, while LPIPS provided additional insight into perceptual differences across model configurations.

Qualitative evaluation was conducted based on the visual clarity of the generated architectural masses and the degree to which the prompts were reflected. For each prompt, final images were selected by focusing on cases where the generated architectural masses best satisfied these criteria, and this process can be visually confirmed through the images presented in the following tables of Chapter 4.

4. Generation and Validation of LoRA Model Performance

4.1. Comparative Analysis Based on Architectural Mass Image-Caption Dataset

To analyze the impact of Caption Level on architectural mass image generation performance during LoRA training, three Caption levels were set (C1: No Caption, C2: Simple Caption, C3: Detailed Caption). The architectural masses generated by each model are presented in Table 7.

C1 and C2 produced forms that did not reflect the prompts. The extrusion (extrude) of the mass form was not properly executed, the branch resulted in a form closer to rotation (twist), and the rotation (twist) was not reflected at all. In contrast, C3 showed mass expressions that reflected the forms in the prompts. For extrude, a protruding mass form could be observed; for branch, an expanded form with segmentation was visible; and for twist, a form with applied torsion could be confirmed.

Quantitatively, C3 also achieved the highest average CLIP score (0.2159), compared to C2 (0.2081) and C1 (0.2052), indicating stronger alignment between prompts and outputs. By contrast, LPIPS did not follow this trend: the lowest (best) average LPIPS was observed for C2 (0.479108), followed by C3 (0.481524) and C1 (0.482933). This divergence likely reflects the domain gap between simplified teacher drawings and rendered outputs; therefore, we prioritize CLIP as the primary indicator of prompt fidelity and report LPIPS as a supplementary diagnostic measure.

Taken together, these results confirm that Detailed Captions (C3) provided the most effective learning condition. Therefore, for the LoRA training in the next chapter, we decided to select the C3 setting, Detailed Caption.

4.2. Comparative Analysis Based on LoRA Training Parameter Settings

4.2.1. Optimizer

To analyze the impact of Optimizer on model performance during LoRA training, experiments were conducted applying four different Optimizers (O1: AdamW, O2: Adafactor, O3: Prodigy, O4: Lion) as shown in Table 8.

Analysis results showed that O1 and O3 generated mass transformations that did not clearly reflect the prompts, while O4 produced masses with visually distorted or overly abstracted masses. Although O4 achieved the highest average CLIP score (0.2163), the resulting shapes lacked clarity and did not maintain architectural coherence, making them unsuitable as architectural mass outputs. In contrast, O2 achieved a more balanced result. It successfully generated protruding architectural mass for extrude and torsional shapes for twist, while showing moderate performance for branch. The average CLIP score for O2 (0.2140) was slightly lower than that of O4, but the visual clarity and architectural fidelity of the generated outputs were superior.

LPIPS results further supported the choice of O2. Adafactor (O2) recorded an average LPIPS score of (0.483004), which was lower than O3 (0.486001) and O4 (0.487936), and comparable to O1 (0.483972). These values indicate that, in addition to semantic fidelity captured by CLIP, O2 also maintained perceptual consistency with the teacher images. While the numerical differences among LPIPS scores were relatively small, the consistent trend across both CLIP and LPIPS suggests that O2 produced outputs that better balanced prompt alignment and perceptual similarity.

Based on these analysis results, Adafactor (O2) was selected as the Optimizer for LoRA training.

4.2.2. Learning Rate

To analyze the impact of Learning Rate (LR) on model performance during LoRA training, experiments were conducted applying four different Learning Rate settings (LR1: 0.0001, LR2: 0.0002, LR3: 0.0003, LR4: 0.0004) as shown in Table 9.

Visual analysis revealed that LR1 and LR2 produced architectural masses that were either excessively deformed or exhibited insufficient transformation, while LR4 often resulted in unstable and noisy outputs. On the other hand, LR3 demonstrated relatively clear architectural masses that best reflected the prompts: the extrude prompt generated outward-protruding masses, branch produced appropriately segmented and split forms, and twist resulted in dynamic torsional masses.

Quantitatively, the differences in average CLIP scores among the four learning rates were minimal—ranging from 0.2026 (LR2) to 0.2098 (LR3). Still, LR3 exhibited the most consistent architectural coherence and clarity among the compared settings, in conjunction with competitive CLIP performance. LPIPS results provided a complementary perspective. The lowest (best) average LPIPS score was observed for LR2 (0.478447), followed closely by LR4 (0.479749), LR1 (0.481434), and LR3 (0.481103). This indicates that although LR2 was slightly superior in perceptual similarity to teacher images, its outputs were visually less coherent in architectural form compared to LR3.

Taken together, these findings suggest that while LPIPS highlights perceptual closeness for LR2, the balance between semantic fidelity (CLIP) and visual clarity favors LR3. Therefore, the Learning Rate setting of 0.0003 (LR3) was selected for LoRA training.

4.2.3. LR Scheduler

To analyze the impact of LR Scheduler on model performance during LoRA training, experiments were conducted applying four different LR Scheduler settings (LRS1: constant, LRS2: cosine, LRS3: linear, LRS4: adafactor) as shown in Table 10.

Analysis showed that LRS2, LRS3, and LRS4 generally produced geometric masses that did not reflect the prompts. LRS3 and LRS4 in particular generated blurred masses with slight noise. Conversely, LRS1 effectively represented the prompts: extrude produced a protruding mass; twist a rotational transformation; and although branch exhibited some blur, it could still be interpreted as a segmented form.

Quantitatively, LRS1 achieved the highest average CLIP score (0.2127), followed by LRS4 (0.2110), LRS2 (0.2082), and LRS3 (0.2078). In parallel, LPIPS results indicated a slightly different trend: LRS3 recorded the lowest average LPIPS value (0.477515), suggesting relatively higher perceptual similarity with teacher images, while LRS1 (0.478304), LRS2 (0.482184), and LRS4 (0.484317) showed comparatively higher values. Although these LPIPS differences were marginal, the divergence between CLIP and LPIPS underscores their complementary roles—CLIP capturing semantic fidelity to prompts and LPIPS reflecting perceptual resemblance to teacher images.

Based on both qualitative and quantitative evaluations, Constant (LRS1) was selected as the LR Scheduler setting for LoRA training.

4.2.4. Batch Size

To analyze the impact of Batch Size on model performance during LoRA training, experiments were conducted applying four different Batch Size settings (B1: 1, B2: 2, B3: 4, B4: 8) as shown in Table 11.

B2 and B4 generated rectangular column-shaped masses that did not reflect the prompts and often exhibited visual noise. B1 produced masses that partially reflected the prompts for extrude and branch, but these effects were not consistently applied, and the twist transformation was weak. In contrast, B3 successfully reflected the prompts: extrude showed a progressively protruded mass, branch produced segmented forms in both regular and irregular configurations, and twist exhibited clear torsional deformation.

Quantitatively, B3 achieved the highest average CLIP score (0.2123), followed closely by B2 (0.2109) and B4 (0.2082). LPIPS results showed a similar trend, with B3 obtaining a competitive perceptual similarity (0.482865), which was slightly higher than B1 (0.475726) and B2 (0.480138) but still consistent with qualitative observations. Although LPIPS differences across batch sizes were relatively small, B3’s balance between semantic fidelity (CLIP) and perceptual consistency (LPIPS) supports its selection.

Based on these combined qualitative and quantitative evaluations, a Batch Size setting of 4 (B3) was selected for LoRA training.

4.2.5. Training Steps

The Training steps for the hyperparameter settings derived so far is 5500, and the LoRA training loss graph is shown in Figure 4. As we can observe an increasing trend in loss values after 5000 steps, we conducted LoRA training with Training Steps set to 11,000 (epoch 10), twice the original value, to find the interval where the loss value converges to zero (Table 12).

The experimental results (Figure 5) show that the loss value converges to zero after passing 0.011 in the 8000–10,000 steps range, and the loss value increases again after exceeding 10,000 steps. In other words, we need to find Training Steps close to but not exceeding 10,000 and derive hyperparameter settings accordingly. The explicit selection rule was to identify the interval where the training loss approached its minimum value, converging toward zero, while avoiding the divergence observed beyond 10,000 steps. Within this convergence range, a step count of 9800 was determined to provide the most stable outcome.

Finally, the number of Training Steps was set to 9800, and, accordingly, as shown in Table 13, the optimal parameter settings were derived by setting the number of images to 280 and repeats to 14.

In order to verify the effectiveness of LoRA fine-tuning, a baseline comparison was conducted using the base diffusion model without LoRA training. Table 14 presents the results of mass generation using the final LoRA model configuration, alongside the output of the base diffusion model. Visual comparison indicates that the base model failed to generate architectural mass images with recognizable form or consistent alignment with prompts. In contrast, the LoRA-trained model consistently generated coherent and differentiated architectural masses that effectively reflected the prompts for extrude, branch, and twist. This baseline comparison serves as a validation of architectural knowledge acquisition through LoRA fine-tuning.

This comparison clearly demonstrates that LoRA fine-tuning enhances the model’s capability to generate prompt-aligned architectural masses.

5. Discussion

This study explored how to fine-tune architectural knowledge using the LoRA technique to implement a Diffusion model for human-AI interaction, addressing the limited systematic guidance for establishing plausible parameter ranges in architectural contexts. The results showed that in learning architectural mass-related knowledge, detailed captions describing the corresponding images in the image-caption dataset influenced LoRA’s performance. This suggests that AI models can generate forms that better reflect prompts when architectural concepts and specific descriptions of transformations and shapes are included. Additionally, Adafactor was evaluated to be more accurate for architectural mass generation than Prodigy and Lion, which have recently gained attention for their good performance as optimizers. While cosine or linear adjustments of learning rates are generally known to be more effective, we found that constant, which adjusts a fixed learning rate, was more suitable as an LR Scheduler for architectural mass generation.

The novelty of this study does not lie in presenting new AI architectures or algorithmic innovations, but in systematically applying LoRA to architectural massing knowledge, which has been underexplored in prior work. This framing positions the study as a bridging milestone, offering foundational guidance for emerging architectural-AI researchers and structured methodological references for advanced scholars.

These results imply that a different approach from conventional AI learning optimization is needed to learn architectural knowledge. It can be interpreted that stable and consistent learning is based on an image-caption dataset that includes detailed descriptions of the geometric properties, spatial relationships, and formative features of architectural masses, as well as accurate terms denoting mass forms and transformations, positively influences the reflection of prompts. Architectural massing studies represent fundamental components for expressing architectural knowledge, encompassing spatial relationships, geometric transformations, and formal logic that define architectural spatial reasoning.

Furthermore, this study contributes to the broader discourse on human-AI interaction by demonstrating that AI’s performance in architectural design tasks can be systematically improved through domain-specific prompt structuring and fine-tuning. Unlike general image generation tasks, architectural mass generation requires an understanding of spatial logic and typological nuance, which this research attempts to capture and quantify. This reflects the importance of curating not only a suitable dataset but also selecting hyperparameters that reinforce architectural reasoning rather than merely visual coherence. Although the generated outputs remain simplified 2D mass forms, they can directly support early-stage design workflows by expanding option spaces and informing subsequent decisions on volumetric articulation and façade development [13]. Recent façade-focused studies reinforce this linkage: Wang et al. [37] demonstrated transformer-based façade feature acquisition from static street view images, and Wang et al. [38] showed that augmentation techniques improve window state detection. While these works emphasize façade-level articulation and openings (e.g., rhythm, window states), our study contributes at the upstream stage by optimizing mass generation. Together, they highlight how massing decisions form the foundation upon which façade composition and functional elements are developed, situating our framework as complementary to façade-oriented research [38,39].

While GANs have been widely explored in architectural image synthesis, their instability and susceptibility to mode collapse limit their reliability for systematic parameter optimization in domain-specific contexts. In contrast, the LoRA-based diffusion framework presented here demonstrates stable learning behavior with modest datasets, making it more suitable for embedding architectural knowledge and supporting consistent analysis of hyperparameter effects.

These findings also extend prior architectural diffusion studies by shifting the focus from stylistic adaptation to methodological rigor. For instance, Kim et al. [8] emphasized curated vocabularies for architectural form generation, and Wu [28] highlighted stylistic consistency through ArchDiff. Similarly, Ma and Zheng [29] and Yoo and Lee [30] demonstrated façade- and style-level customization. In contrast, this study situates LoRA fine-tuning at the massing stage, providing a structured framework for parameter optimization that directly embeds architectural reasoning. This positioning clarifies how our contribution complements, yet diverges from, earlier work in the field.

Rather than presenting novel AI architectures or proposing algorithmic innovations, this study offers a systematic parameter optimization framework that serves as practical guidance for practitioners and architectural researchers newly entering the field of AI. Although the scope of this research is limited to 2D architectural mass image generation and not interactive workflows or 3D modeling, it demonstrates that careful fine-tuning and dataset structuring can lead to meaningful improvements in the alignment between AI-generated outputs and architectural intentions.

By validating the feasibility and effectiveness of LoRA fine-tuning in architectural mass generation, the study adds to the growing body of knowledge that supports AI integration in the architectural design process. It highlights that systematic parameter optimization can play a crucial role in building foundational knowledge and practice for future advancements in the field.

Additionally, the dataset size (220 images) is modest, which limits generalizability; however, this controlled scale enabled a systematic examination of captioning strategies and hyperparameter effects. A limitation of this study is that we did not analyze in detail the changes in model performance according to combinations of hyperparameter settings for Batch Size, Epoch, and Repeats during the optimization process of Training Steps. Particularly, experimenting with different settings for Batch Size and Repeats while fixing Training Steps could have provided insights into how the number of images processed at once (Batch Size) and the number of times the same image is repeatedly learned (Repeats) affect the learning of architectural mass-related knowledge. Therefore, future research should analyze the effects of Batch Size and Repeats when learning the architectural mass image–Detailed Captions dataset to propose a more precise LoRA model optimization method.

Additionally, although this study sequentially tuned hyperparameters such as Caption Level, Optimizer, Learning Rate, LR Scheduler, Batch Size, and Training Steps, potential interaction effects between these variables—particularly between Batch Size and Repeats—were not systematically investigated. Future research will aim to address this limitation by exploring how different combinations of Batch Size, Repeats, and Epoch settings interact under fixed Training Steps. This expanded approach is expected to contribute to a more comprehensive understanding of LoRA-based optimization for architectural mass learning.

Another limitation lies in the evaluation framework. Although CLIP and LPIPS metrics were employed to strengthen the quantitative analysis, these measures remain proxies and cannot fully capture architectural reasoning such as spatial logic or functional coherence. This limitation is acknowledged, and we suggest that future work incorporate expert evaluation to complement quantitative indicators and reinforce architectural validity.

A further limitation concerns external validity. Because all evaluations in this study were conducted on 2D mass images, the findings cannot be directly generalized to architectural massing in practice, which requires 3D modeling, site-specific massing, or schematic CAD-based inputs. Future work will extend the framework to these contexts and test the robustness of LoRA-based optimization on out-of-distribution mass images, thereby enhancing its applicability to practical architectural design scenarios. This limitation also underscores the importance of connecting massing-level optimization with façade-level performance studies. As Wang et al. [37,38,39] have shown, façade analysis from urban-scale image data depends on reliable upstream massing structures. Extending LoRA-based optimization from simplified 2D forms to schematic CAD or 3D contexts would therefore enable a more direct interface with façade recognition and window state detection pipelines [38].

6. Conclusions

This study investigated the fine-tuning of architectural knowledge for architectural mass generation using LoRA applied to a diffusion model, addressing the systematic parameter optimization gap in architectural contexts where existing AI research has provided limited guidance for establishing plausible parameter ranges. Through a series of experiments, the study identified that the inclusion of detailed, domain-specific image captions significantly enhances prompt-to-output accuracy. It also demonstrated that constant learning rates and the Adafactor optimizer are more suitable in architectural workflows compared to generally favored optimizers or learning rate schedulers.

The findings across different hyperparameter experiments (Caption Level, Optimizer, Learning Rate, LR Scheduler, Batch Size, and Training Steps) were systematically synthesized to identify optimal parameter settings for architectural mass generation. Architectural massing studies, as fundamental components for expressing architectural knowledge, require specialized optimization approaches that maintain spatial relationships, geometric transformations, and formal logic inherent in architectural spatial reasoning.

The research provides foundational insight into how AI models can be adapted for architecture-specific design tasks, marking a shift from generalized image generation toward domain-sensitive modeling. While not claiming algorithmic novelty, the study contributes methodologically by establishing a systematic parameter optimization framework and providing practical guidelines for integrating AI technologies into architectural design workflows while maintaining architectural knowledge integrity.

The contributions of this study can be considered along three dimensions. Practically, it provides structured guidance for practitioners and educators to integrate LoRA-based diffusion models into early-stage design workflows. Empirically, it offers a systematic evaluation of hyperparameter effects using both semantic and perceptual metrics, strengthening the robustness of AI-assisted design research. Conceptually, it clarifies the role of architectural knowledge as spatial and morphological logic, demonstrating how such knowledge can be embedded into generative models and advancing discourse on architectural intelligence.

These findings are expected to aid future developments in intelligent design environments, where AI tools support architects through enhanced responsiveness, data-informed design iteration, and the encoding of architectural reasoning into generative processes. The systematic approach demonstrated in this research establishes a foundation for evidence-based AI adoption in architectural practice, enabling practitioners to implement domain-specific fine-tuning with greater confidence and effectiveness.

Nonetheless, several limitations should be noted. The modest dataset size, the focus on 2D massing rather than 3D or contextual modeling, and the absence of expert evaluation restrict the generalizability of the findings. Future research should expand datasets, incorporate site and 3D contexts, and engage expert assessments to validate architectural reasoning beyond proxy metrics, thereby strengthening the applicability of LoRA-based optimization to architectural practice.

Author Contributions

Conceptualization, S.M.H. and S.C.; methodology, S.M.H. and S.C.; software, S.M.H.; validation, S.M.H.; writing—original draft preparation, S.M.H.; writing—review and editing, S.C.; visualization, S.M.H.; supervision, S.C.; project administration, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in 2025 by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2021-KA163269).

Data Availability Statement

The data presented in this study are available on request from the corresponding author on reasonable request. The data are not publicly available due to privacy policies.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Carpo, M. Beyond Digital: Design and Automation at the End of Modernity; MIT Press: Cambridge, MA, USA, 2023. [Google Scholar]
Bao, Y.; Xiang, C. Exploration of Conceptual Design Generation Based on the Deep Learning Model—Discussing the Application of AI Generator to the Preliminary Architectural Design Process. In Proceedings of the International Conference on Architecture and Urban Planning, Suzhou, China, 11–12 November 2023; Springer Nature Singapore Pte Ltd.: Singapore, 2024; pp. 171–178. [Google Scholar] [CrossRef]
Luhrs, M. Using Generative AI Midjourney to Enhance Divergent and Convergent Thinking in an Architect’s Creative Design Process. Des. J. 2024, 27, 677–699. [Google Scholar] [CrossRef]
Almaz, A.F.H.; El-Agouz, E.A.E.; Abdelfatah, M.T.; Mohamed, I.R. The Future Role of Artificial Intelligence (AI) Design’s Integration into Architectural and Interior Design Education is to Improve Efficiency, Sustainability, and Creativity. Civ. Eng. Archit. 2024, 12, 1749–1772. [Google Scholar] [CrossRef]
Cao, Y.; Abdul Aziz, A.; Mohd Arshard, W.N.R. Stable diffusion in architectural design: Closing doors or opening new horizons? Int. J. Archit. Comput. 2024, 23, 339–357. [Google Scholar] [CrossRef]
Sadek, M.M.; Hassan, A.Y.; Diab, T.O.; Abdelhafeez, A. Creating Images with Stable Diffusion and Generative Adversarial Networks. Int. J. Telecommun. 2024, 4, 1–14. [Google Scholar] [CrossRef]
Leach, N. Architecture in the Age of Artificial Intelligence: An Introduction to AI for Architects; Bloomsbury Visual Arts: London, UK, 2022. [Google Scholar]
Kim, F.; Johanes, M.; Huang, J. Text2Form Diffusion: Framework for learning curated architectural vocabulary. In Proceedings of the 41st Conference on Education and Research in Computer Aided Architectural Design in Europe (eCAADe), Graz, Austria, 20–23 September 2023. [Google Scholar]
Li, S.; Su, S.; Lin, X. Optimizing the hyper-parameters of deep reinforcement learning for building control. Build. Simul. 2025, 18, 765–789. [Google Scholar] [CrossRef]
Manmatharasan, P.; Bitsuamlak, G.; Grolinger, K. AI-driven design optimization for sustainable buildings: A systematic review. Build. Environ. 2025, 310, 112707. [Google Scholar] [CrossRef]
Ma, Z.; Cui, S.; Joe, I. An Enhanced Proximal Policy Optimization-Based Reinforcement Learning Method with Random Forest for Hyperparameter Optimization. Appl. Sci. 2022, 12, 7006. [Google Scholar] [CrossRef]
Gero, J.S.; Jupp, J. Strategic use of representation in architectural massing. Build. Res. Inf. 2003, 31, 429–437. [Google Scholar] [CrossRef]
Park, J.; Hong, S.M.; Choo, S.Y. A Generation Method and Evaluation of Architectural Facade Design Using Stable Diffusion with LoRA and ControlNet. J. Archit. Inst. Korea Plan. Des. 2025, 41, 85–96. [Google Scholar] [CrossRef]
Panigrahi, A.; Saunshi, N.; Zhao, H.; Arora, S.K. Task-Specific Skill Localization in Fine-tuned Language Models. arXiv 2023, arXiv:2302.06600. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Yang, Y.; Bhatt, N.; Ingebrand, T.; Ward, W.; Carr, S.; Wang, Z.; Topcu, U. Fine-Tuning Language Models Using Formal Methods Feedback. arXiv 2023, arXiv:2310.18239. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Bernstein, P. Machine Learning: Architecture in the Age of Artificial Intelligence; RIBA Publishing: London, UK, 2022. [Google Scholar]
Stiny, G.; Gips, J. Shape Grammars and the Generative Specification of Painting and Sculpture. In Proceedings of the IFIP Congress, Ljubljana, Yugoslavia, 23–28 August 1971; North-Holland: Amsterdam, The Netherlands, 1972. [Google Scholar]
Schumacher, P. Parametricism as Style—Parametricist Manifesto. In Proceedings of the 11th Architecture Biennale, Venice, Italy, 14 September–23 November 2008. [Google Scholar]
Hong, S.M. Development of an AI-Based Architectural Mass Design System for Enhancing Creative Thinking. Ph.D. Thesis, Kyungpook National University, Daegu, Republic of Korea, 2025. [Google Scholar]
del Campo, M. Ontology of diffusion models: Tools, language and architecture design. In Diffusions in Architecture: Artificial Intelligence and Image Generators; del Campo, M., Ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2024; pp. 44–54. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. arXiv 2014, arXiv:1409.0575. [Google Scholar] [CrossRef]
Ahmad, I.S.; Siddiqui, N.; Boufama, B. A Comparative Study of Text-to-Image Generative Models. In Proceedings of the 2024 IEEE 12th International Symposium on Signal, Image, Marrakech, Morocco, 21–23 May 2024., Video and Communications (ISIVC). [CrossRef]
Sudha, L.; Aruna, K.; Sureka, V.; Niveditha, M.; Prema, S. Semantic Image Synthesis from Text: Current Trends and Future Horizons in Text-to-Image Generation. EAI Endorsed Trans. Internet Things 2024, 11. [Google Scholar] [CrossRef]
Çelik, T. Generative design experiments with artificial intelligence: Reinterpretation of shape grammar. Open House Int. 2023, 49, 123–135. [Google Scholar] [CrossRef]
Horvath, A.S.; Pouliou, P. AI for conceptual architecture: Reflections on designing with text-to-text, text-to-image, and image-to-image generators. Front. Archit. Res. 2024, 13, 593–612. [Google Scholar] [CrossRef]
Wu, Y. ArchDiff: Streamlining Architectural Design with Diffusion-Based Style Generation. Lect. Notes Comput. Sci. 2024, 14871, 300–315. [Google Scholar]
Ma, H.; Zheng, H. Text Semantics to Image Generation: A Method of Building Facades Design Based on Stable Diffusion Model. In Phygital Intelligence, Computational Design and Robotic Fabrication; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
You, Y.; Lee, J. Generative AI-Based Construction of Architect’s Style-trained Models and its Application for Visualization of Residential Houses. Soc. Des. Converg. 2023, 22, 103–116. [Google Scholar] [CrossRef]
Steenson, M.W. Architectural Intelligence: How Designers and Architects Created the Digital Landscape; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Di Mari, A.; Yoo, N. Operative Design: A Catalogue of Spatial Verbs; BIS Publishers: Amsterdam, The Netherlands, 2013. [Google Scholar]
Di Mari, A. Conditional Design: An Introduction to Elemental Architecture; BIS Publishers: Amsterdam, The Netherlands, 2014. [Google Scholar]
Steen, J. Bridging the Gap Between Generative Artificial Intelligence and Innovation in Footwear Design. Master’s Thesis, Delft University of Technology, Delft, The Netherlands, 2024. [Google Scholar]
Lin, H.; Hong, D.; Ge, S.; Luo, C.; Jiang, K.; Jin, H.; Wen, C. RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering. arXiv 2024, arXiv:2407.02233. [Google Scholar]
Chang, Z.-Y.; Han, J.-W. Analysis on prompt engineering structure of AI-generated architectural/interior design images. In Proceedings of the Korean Institute of Interior Design Spring Conference, Bucheon, Republic of Korea, 18 May 2024; Volume 26, pp. 150–154. [Google Scholar]
Wang, S.; Korolija, I.; Rovas, D. Development of Approach to an Automated Acquisition of Static Street View Images Using Transformer Architecture for Analysis of Building Characteristics. Sci. Rep. 2025, 15, 29062. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Korolija, I.; Rovas, D. Impact of Traditional Augmentation Methods on Window State Detection. In Proceedings of the 14th REHVA HVAC World Congress CLIMA 2022, Rotterdam, The Netherlands, 22–25 May 2022. [Google Scholar] [CrossRef]
Wang, S.; Korolija, I.; Rovas, D. Transformer-Based Building Façade Recognition Using Static Street View Images. In Proceedings of the European Conference on Computing in Construction (EC3), Crete, Greece, 10–12 July 2023. [Google Scholar] [CrossRef]

Figure 1. The basic structure of LoRA: only the low-rank matrices A and B are trained. Adapted from He et al. [17].

Figure 2. Diagram of the LoRA training procedure.

Figure 3. Examples of CLIP score results.

Figure 4. B3’s Loss Graph.

Figure 5. Loss Graph of 11,000 steps.

Table 1. Example of image dataset *.

Expand	Extrude	Inflate	Branch	Merge

Nest	Bend	Skew	Twist	Interlock

Intersect	Lift	Lodge	Overlap	Rotate

Shift	Carve	Compress	Fracture	Grade

Notch	Pinch	Shear	Taper	Embed

Extract	Inscribe	Puncture	Split	Reflect + Expand

Pack + Inflate	Array + Stack + Rotate	Array + Taper	Join + Array + Pinch	Join + Split

* The images in Table 1 were redrawn and adapted from the original sources to clearly illustrate architectural mass transformations as spatial verbs, with all images preprocessed to 150 dpi resolution and 1024 × 1024-pixel dimensions to optimize them for LoRA training procedures.

Table 2. Examples of image captioning using ChatGPT-4o.

	Expand	Extrude
ChatGPT-4o
Common	2. Medium: illustration 3. Environment: Abstract/Conceptual 4. Lighting: Standard technical drawing lighting with shading to indicate depth 5. Color: Monochromatic (shades of grey with some darker shading/edges) 6. Mood: Neutral, technical
Feature	1. Subject: 3D geometric shape (expansion of a smaller rectangular prism from a larger rectangular base with an internal hollow section)	1. Subject: 3D geometric shape (extrusion of a smaller rectangular prism from a larger stepped rectangular base with an internal hollow section)
Feature	7. Composition: Isometric view, showing the expansion of a smaller rectangular prism from a larger rectangular base with an internal hollow section, with dashed lines to indicate the hidden edges and shading to emphasize depth and dimension	7. Composition: Isometric view, showing the extrusion of a smaller rectangular prism from a larger stepped rectangular base with an internal hollow section, with dashed lines to indicate the hidden edges and shading to emphasize depth and dimension

Table 3. Examples of image-caption dataset.

Image	Caption Level
Image	No	Simple	Detailed
	-	expand	expand, expansion of a smaller rectangular prism from a larger rectangular base with an internal hollow section, illustration, conceptual, standard technical drawing lighting, monochromatic, neutral mood, isometric view
	-	extrude	extrude, extrusion of a smaller rectangular prism from a larger rectangular base with an internal hollow section, illustration, conceptual, standard technical drawing lighting, monochromatic, neutral mood, isometric view
	-	reflect_expand	reflect_expand, combination of reflecting and expanding rectangular prisms forming a complex, interlocked structure, illustration, conceptual, standard technical drawing lighting, monochromatic, neutral mood, isometric view
	-	join_split	join_split, a joined and split rectangular prism forming a zigzag pattern, extended further, illustration, conceptual, standard technical drawing lighting, monochromatic, neutral mood, isometric view
	-	array, stack_rotate	array, stack_rotate, combination of stacking and rotating rectangular prisms forming a cubic, interlocked structure, illustration, conceptual, standard technical drawing lighting, monochromatic, neutral mood, isometric view
	-	join, array_pinch	join, array_pinch, combination of joined and pinched rectangular prisms forming a complex, interlocked structure, illustration, conceptual, standard technical drawing lighting, monochromatic, neutral mood, isometric view

Table 4. Hyperparameter variants for LoRA trainings.

Hyperparameter	LoRA Experiment	Value
Caption	C1	No Caption
	C2	Simple Caption
	C3	Detailed Caption
Optimizer	O1	AdamW
	O2	Adafactor
	O3	Prodigy
	O4	Lion
Learning Rate	LR1	0.0001
	LR2	0.0002
	LR3	0.0003
	LR4	0.0004
LR Scheduler	LRS1	constant
	LRS2	cosine
	LRS3	linear
	LRS4	adafactor
Batch Size	B1	1
	B2	2
	B3	4
	B4	8
Training Steps	Analysis of Loss Graph	5500
		9800
		11,000

Table 5. Prompt and input images used for LoRA-trained model evaluation.

Prompt	img2img	Inpaint
Extrude Branch Twist

Table 6. Hyperparameters fixed for model evaluation.

	No. Images	Optimizer	Learning Rate	LR Scheduler	Batch Size	Epoch	Repeats
value	220	Adafactor	0.0003	constant	4	5	20

Table 7. Comparative results of Caption Levels (C1: No Caption, C2: Simple Caption, C3: Detailed Caption). C3 produced clearer geometric transformations (extrude, branch, twist) and achieved the highest CLIP score and lowest LPIPS score, indicating stronger prompt fidelity.

Prompt	C1	C2	C3
extrude (CLIP Score) [LPIPS Score]	(0.2467) [0.350381]	(0.2446) [0.346895]	(0.2512) [0.353704]
extrude (CLIP Score) [LPIPS Score]	(0.2127) [0.647027]	(0.2127) [0.650432]	(0.2247) [0.647987]
branch (CLIP Score) [LPIPS Score]	(0.1979) [0.326616]	(0.2107) [0.333401]	(0.2159) [0.326094]
branch (CLIP Score) [LPIPS Score]	(0.1812) [0.688126]	(0.1849) [0.677643]	(0.1856) [0.675901]
twist (CLIP Score) [LPIPS Score]	(0.2038) [0.214463]	(0.2044) [0.191077]	(0.2228) [0.197385]
twist (CLIP Score) [LPIPS Score]	(0.1890) [0.670984]	(0.1914) [0.675118]	(0.1939) [0.684472]
Average CLIP Score	0.2052	0.2081	0.2157
Average LPIPS Score	0.482933	0.479108	0.481524

Table 8. Comparative results of four Optimizers (O1: AdamW, O2: Adafactor, O3: Prodigy, O4: Lion). O4 achieved the highest CLIP score but generated distorted masses, while O2 provided the most balanced outputs with greater architectural fidelity, leading to its selection as the optimal optimizer.

Prompt	O1	O2	O3	O4
extrude (CLIP Score) [LPIPS Score]	(0.2435) [0.343182]	(0.2555) [0.349378]	(0.2433) [0.354583]	(0.2481) [0.363083]
extrude (CLIP Score) [LPIPS Score]	(0.2100) [0.651672]	(0.2177) [0.645908]	(0.2016) [0.648482]	(0.2166) [0.651206]
branch (CLIP Score) [LPIPS Score]	(0.1766) [0.343366]	(0.2135) [0.333571]	(0.1943) [0.337466]	(0.2280) [0.344603]
branch (CLIP Score) [LPIPS Score]	(0.1905) [0.674829]	(0.1920) [0.677852]	(0.1792) [0.678657]	(0.1913) [0.673498]
twist (CLIP Score) [LPIPS Score]	(0.1901) [0.214797]	(0.2165) [0.214462]	(0.1961) [0.212614]	(0.2188) [0.219594]
twist (CLIP Score) [LPIPS Score]	(0.1813) [0.675989]	(0.1888) [0.676852]	(0.1857) [0.686007]	(0.1950) [0.675635]
Average CLIP Score	0.1987	0.2140	0.2000	0.2163
Average LPIPS Score	0.483972	0.483004	0.486301	0.487936

Table 9. Comparative results of Learning Rates (LR1: 0.0001, LR2: 0.0002, LR3: 0.0003, LR4: 0.0004). LR3 yielded the clearest architectural transformations and competitive quantitative scores, indicating that 0.0003 is the most suitable learning rate setting.

Prompt	LR1	LR2	LR3	LR4
extrude (CLIP Score) [LPIPS Score]	(0.2341) [0.350163]	(0.2287) [0.350193]	(0.2380) [0.351559]	(0.2447) [0.342769]
extrude (CLIP Score) [LPIPS Score]	(0.2201) [0.642323]	(0.2130) [0.634285]	(0.2137) [0.642483]	(0.2067) [0.642551]
branch (CLIP Score) [LPIPS Score]	(0.2037) [0.347179]	(0.2141) [0.340617]	(0.2074) [0.333245]	(0.2026) [0.333787]
branch (CLIP Score) [LPIPS Score]	(0.1895) [0.667573]	(0.1745) [0.661532]	(0.1916) [0.676919]	(0.1873) [0.670242]
twist (CLIP Score) [LPIPS Score]	(0.2170) [0.199774]	(0.1947) [0.207238]	(0.2163) [0.211201]	(0.2095) [0.210177]
twist (CLIP Score) [LPIPS Score]	(0.1827) [0.681593]	(0.1909) [0.676819]	(0.1916) [0.6712]	(0.1921) [0.678881]
Average CLIP Score	0.2078	0.2026	0.2098	0.2072
Average LPIPS Score	0.481434	0.478447	0.481103	0.479749

Table 10. Comparative results of LR Schedulers (LRS1: constant, LRS2: cosine, LRS3: linear, LRS4: adafactor). LRS1 consistently reflected prompts more clearly and recorded the highest CLIP score, supporting its selection as the most effective scheduler.

Prompt	LRS1	LRS2	LRS3	LRS4
extrude (CLIP Score) [LPIPS Score]	(0.2562) [0.347824]	(0.2496) [0.360708]	(0.2446) [0.34897]	(0.2518) [0.360331]
extrude (CLIP Score) [LPIPS Score]	(0.2149) [0.643275]	(0.2118) [0.640745]	(0.2231) [0.643537]	(0.2107) [0.634892]
branch (CLIP Score) [LPIPS Score]	(0.1937) [0.331238]	(0.1964) [0.336442]	(0.1931) [0.35139]	(0.2167) [0.342915]
branch (CLIP Score) [LPIPS Score]	(0.1978) [0.667002]	(0.1891) [0.677781]	(0.1934) [0.670332]	(0.1884) [0.671802]
twist (CLIP Score) [LPIPS Score]	(0.2201) [0.198131]	(0.2096) [0.202021]	(0.2120) [0.168319]	(0.2141) [0.212691]
twist (CLIP Score) [LPIPS Score]	(0.1935) [0.680731]	(0.1927) [0.675404]	(0.1807) [0.682539]	(0.1843) [0.683273]
Average CLIP Score	0.2127	0.2082	0.2078	0.2110
Average LPIPS Score	0.478034	0.482184	0.477515	0.484317

Table 11. Comparative results of Batch Sizes (B1: 1, B2: 2, B3: 4, B4: 8). B3 demonstrated the most coherent prompt reflection across extrude, branch, and twist, and achieved the highest CLIP score, confirming Batch Size 4 as the optimal configuration.

Prompt	B1	B2	B3	B4
extrude (CLIP Score) [LPIPS Score]	(0.2448) [0.350233]	(0.2574) [0.359554]	(0.2485) [0.350928]	(0.2472) [0.356032]
extrude (CLIP Score) [LPIPS Score]	(0.2164) [0.635002]	(0.2101) [0.649085]	(0.2118) [0.648677]	(0.2094) [0.633884]
branch (CLIP Score) [LPIPS Score]	(0.1988) [0.325867]	(0.2167) [0.336992]	(0.2173) [0.331515]	(0.2092) [0.342192]
branch (CLIP Score) [LPIPS Score]	(0.1850) [0.675247]	(0.1875) [0.66697]	(0.1873) [0.68546]	(0.1766) [0.661859]
twist (CLIP Score) [LPIPS Score]	(0.2134) [0.191182]	(0.2116) [0.192647]	(0.2173) [0.206828]	(0.2183) [0.204495]
twist (CLIP Score) [LPIPS Score]	(0.1931) [0.676824]	(0.1821) [0.675579]	(0.1918) [0.673779]	(0.1886) [0.671848]
Average CLIP Score	0.2086	0.2109	0.2123	0.2082
Average LPIPS Score	0.475726	0.480138	0.482865	0.478385

Table 12. Hyperparameters for 11,000 training steps.

	No. Images	Optimizer	Learning Rate	LR Scheduler	Batch Size	Epoch	Repeats
value	220	Adafactor	0.0003	constant	4	10	20

Table 13. Hyperparameters for 9800 training steps.

	No. Images	Optimizer	Learning Rate	LR Scheduler	Batch Size	Epoch	Repeats
value	280	Adafactor	0.0003	constant	4	10	14

Table 14. Comparison of Learning Rate.

Prompt	Without LoRA	The Optimal LoRA Model
extrude
extrude
branch
branch
twist
twist

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hong, S.M.; Choo, S. Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models. Buildings 2025, 15, 3477. https://doi.org/10.3390/buildings15193477

AMA Style

Hong SM, Choo S. Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models. Buildings. 2025; 15(19):3477. https://doi.org/10.3390/buildings15193477

Chicago/Turabian Style

Hong, Soon Min, and Seungyeon Choo. 2025. "Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models" Buildings 15, no. 19: 3477. https://doi.org/10.3390/buildings15193477

APA Style

Hong, S. M., & Choo, S. (2025). Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models. Buildings, 15(19), 3477. https://doi.org/10.3390/buildings15193477

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models

Abstract

1. Introduction

1.1. Research Background and Objectives

1.2. Research Scope and Method

2. Theoretical Background and Literature Review

2.1. Fine-Tuning for Architectural Knowledge

2.2. Diffusion Models in Architectural Image Generation

3. Methodology for Architectural Knowledge Learning Using LoRA Technique

3.1. LoRA Training Method

3.1.1. Training Environment and Tool

3.1.2. Training Procedure

3.2. Building of Architectural Mass Image-Caption Dataset

3.2.1. Image Data Collection and Preprocessing

3.2.2. Image Captioning

3.3. Setting of Hyperparameters

3.4. Evaluation Criteria and Metrics

4. Generation and Validation of LoRA Model Performance

4.1. Comparative Analysis Based on Architectural Mass Image-Caption Dataset

4.2. Comparative Analysis Based on LoRA Training Parameter Settings

4.2.1. Optimizer

4.2.2. Learning Rate

4.2.3. LR Scheduler

4.2.4. Batch Size

4.2.5. Training Steps

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI