1. Introduction
Residential buildings represent the bulk of projects in the architectural, engineering, and construction (AEC) industries and are closely related to people’s quality of life. With the rapid development of artificial intelligence technologies [
1,
2,
3,
4,
5], the automatic generation of residential building design using AI has attracted increasing attention [
6,
7,
8,
9]. Automatic generation can significantly improve the efficiency of architectural design, refining the repetitive and time-consuming design process of drafting, evaluating, and modifying, and thus alleviating the designer’s workload [
10,
11,
12]. Moreover, automatic generation can efficiently provide architects with a wide range of design options within a short timeframe. This enables them to explore and select optimal solutions, even under tight project schedules, thereby increasing the likelihood of achieving high-quality and creative design outcomes. This would be hardly possible through manual design [
13]. In addition, automatic generation also benefits the consistency and standardization of design.
To fully realize the aforementioned advantages and further improve model performance, appropriate evaluation metrics for generated residential designs are crucial but have been largely overlooked in existing studies. Most existing studies used general metrics borrowed from computer vision, such as the Fréchet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Intersection over Union (IoU), and Graph Edit Distance (GED) (details in
Section 2.2). While these metrics have proven effective in domains such as image generation and graph analysis, they are not specifically tailored to the context of architectural design. Consequently, they fail to adequately capture the essential characteristics and address the specific requirements of residential design generation. Partly due to the insufficiency of quantitative metrics, many existing studies also conducted user studies or expert reviews to evaluate the generated design. While this can provide useful insights into the generated designs, its results contain subjectiveness and could change with different groups of experts. Also, it is difficult to evaluate a large number of designs due to the hefty workload of evaluating designs manually.
Considering the characteristics and requirements of residential building design, and based on our literature review and analysis, we argue that successful evaluation metrics should cover the following fundamental aspects: compliance with design requirements [
14], comprehensiveness of evaluation [
15], diversity of generated designs [
16], interpretability [
17], and computational efficiency [
18].
- (1)
Compliance with the design requirements: Residential design must be in line with the specific requirements of the owner or targeted users. For example, if a two-bed residence is required but the generative model yields a three-bed design, then the design should be identified by the evaluation metrics, suggesting that the designers discard it. This is particularly important as residential projects involve a sequence of design processes [
19,
20], such as conceptual design, schematic design, design development, and construction documents [
21], so modification costs in later stages due to incompliance with design requirements in earlier stages are hefty [
22,
23]. Therefore, the metrics must be able to filter out the incompliant generated designs.
- (2)
Comprehensive evaluation: Residential design involves a series of features, such as the room number of each room type (e.g., bedroom, balcony), spatial connectivity and proximity of rooms (often shown as a graph, or “bubble diagram”), location and orientation of rooms (e.g., south-facing are often preferred for the north hemisphere), and geometries of rooms (e.g., the area and aspect ratio) [
24,
25]. Successful evaluation metrics should reflect all the major features.
- (3)
Diversity of generated designs: Being able to generate a large variety of designs within requirements is an important aspect of generative models, as this allows architects to compare different possibilities and achieve a more desirable and confident solution from them for further development in later design stages [
26]. Therefore, the evaluation metrics should measure the diversity of the designs generated by the model.
- (4)
Interpretability: Due to the numerosity of generated designs, it is important for the evaluation metrics to label the designs in a definable, interpretable, and traceable way, to assist the manual screening of architects. For example, they can filter the designs based on specific room sizes, or they can compare designs with different types of spatial connectivity.
- (5)
Computational efficiency: Due to the number of generated designs, the evaluation metrics should have high computational efficiency to avoid lengthy calculations.
To address the research gap of insufficiency in evaluation metrics for generated residential designs, we proposed a group of systematic and comprehensive evaluation metrics called Residential Floor Plan Assessment (RFP-A) that can meet the five essential requirements above. We then conducted extensive comparisons with existing metrics, showing the superiority of RFP-A. We also evaluated the performance of existing generative models for residential floor plans using RFP-A to demonstrate its effectiveness.
To the best of our knowledge, this is the first study to establish evaluation metrics specifically tailored for generated residential designs. Our contributions include the following:
- (1)
We developed novel RFP-A evaluation metrics, which can assess the accuracy and diversity of residential floor plans generated by AI models in a comprehensive and interpretable manner, with improved computational efficiency mainly due to the revisions on GED. This not only leads to more reliable assessments of generative models but also can help architects make full use of a large number of generated designs and select the optimal solutions from them.
- (2)
We conducted an in-depth analysis and comparison of RFP-A and existing evaluation metrics, both qualitatively and quantitatively, identifying the issues of using existing metrics for residential design and showing the superiority of RFP-A.
- (3)
We evaluated the performance of six existing generative models for residential designs using RFP-A. This not only provides useful insights into the development of using generative AI in architectural design but also sets up a paradigm of performance evaluation for similar generative models for future studies.
5. Evaluation of Existing Models for Residential Plan Generation
In this section, we use RFP-A and existing evaluation metrics to evaluate the current models in automatic residential generation. We tested six existing models for residential design generation, which can be divided into the following categories based on their input conditions. (1) Graph input: HouseGAN [
35], HouseGAN++ [
34], and HouseDiffusion [
33]; (2) Room constraints: Graph2Plan [
32]; (3) Natural language descriptions: Tell2Design [
36]; and (4) Instance input: FloorplanDiffusion [
38]. We visualized the input and output forms of the models in
Figure 10. To ensure the generalizability and fairness of the model comparison, we tested three generation scenarios: (1) “Generating floor plans containing one bedroom, one living room, one kitchen, one bathroom, one balcony”; (2) “Generating floor plans containing two bedrooms, one living room, one kitchen, one bathroom, one balcony”; (3) “Generating floor plans containing three bedrooms, one living room, one kitchen, one bathroom, one balcony”. We let each of the above models generate 1000 floor plans, and then we use RFP-A and existing evaluation metrics to conduct evaluations.
The evaluation results are shown in
Table 7. First, we evaluated the accuracy of the generation, i.e., the percentage of the generated plans that meet the design requirements. For simplicity and fairness for models with different forms of input, we only evaluated the accuracy regarding room numbers here. Then, we evaluated the diversity of the generated plans that meet the design requirements using the number of categories in each step. Additionally, we calculated the values of existing evaluation metrics.
5.1. Accuracy
Since we defined the numbers of each room type as the design requirement, the accuracy here is calculated as the number of generated plans with the correct room numbers over the total number of generated plans. Surprisingly, only HouseDiffusion and FloorplanDiffusion have reasonable accuracies, higher than 90% for all three scenarios, while the accuracies of the other four models are rather low, below or around 60% for any of the three scenarios.
Among the three models based on graph inputs, the superiority of HouseDiffusion in accuracy is likely because it is based on diffusion structure, which is more advanced and stable than the older GAN structure. The difference is particularly obvious for the most complicated three-bedroom scenario, where the accuracy of HouseGAN and HouseGAN++ drops below 10%, and that of HouseDiffusion still remains above 90%.
FloorplanDiffusion, the other highly accurate model, is also based on diffusion, further indicating the good capability of the diffusion structure in generating floor plans. Moreover, the input of FloorplanDiffusion is more flexible and intuitive, providing an alternative form of input that is effective in initiating floor plan generation in addition to the more commonly used graph.
Graph2Plan is partly rule-based, resulting in relatively low accuracy, while Tell2Design, the first model to generate floor plans based on natural language input, also has rather low accuracy, potentially because it is cross-modal. A more sophisticated model structure than its current Seq2Seq structure might help in improving the accuracy.
5.2. Diversity
The subsection focuses on the diversity of the correctly generated floor plans, i.e., the categories of floor plans that have the prescribed room numbers of each room type. As for each scenario, the number of each room type was prescribed; there is only one correct category in Number, so the diversity of the models in Number is not distinguishable. However, to show the diversity of all generated results more comprehensively, the diversity of Number is denoted as A(T), which indicates that a total of T categories was generated, but only A categories met the requirements, while the remaining (T-A) categories were incorrectly generated. This shows that a large number of incorrect categories were generated by HouseGAN, HouseGAN++, Graph2Plan, and Tell2Design, suggesting poor compliance of the generation to design requirements, while for HouseDiffusion and FloorplanDiffusion, a much smaller number of incorrect categories were generated, showing much better compliance. These are in line with the results shown in
Section 5.1.
For the diversity in Graph, a similar denotation of A(T) is used for the three models using graphs as input. HouseGAN can only generate one correct category because it does not generate doors between rooms, so it only has one type of room connectivity, while all other models generate doors and have two types of room connectivity: door connection and wall connection. FloorplanDiffusion has the best overall diversity in Graph, being the best in the one-bedroom and three-bedroom scenarios and the third best in the two-bedroom scenario, largely because of its flexible input that implies but does not limit the connectivity between rooms. It can also be noted that for the three models using graphs as input, a significant proportion of the generated categories was incorrect, even for the latest HouseDiffusion, suggesting that many of the generated results do not comply with the input graph, and the controllability of the models still needs improvement.
For the diversity in Location, HouseDiffusion achieves the highest values in all three scenarios, exhibiting the best diversity in Location, while for the other two models based on graph input, their diversity in Location drops significantly from the simple one-bedroom scenario to the more complicated three-bedroom scenario. This further suggests the strong capability of the diffusion model in generating rich results, especially in complicated scenarios. FloorplanDiffusion, which has the best overall diversity in Graph, performs mediocrely in diversity in Location, likely because its input includes strong indications of the room locations.
For the diversity in Geometry, the performance of the models fluctuates significantly in different scenarios, showing the sensitivity of geometric diversity to specific generation tasks. Overall, HouseGAN has the best diversity in Geometry, being the best in the one-bedroom and three-bedroom scenarios and the third best in the two-bedroom scenario. The following is Tell2Design, which is ranked second, first, and third in the three scenarios, respectively. FloorplanDiffusion has the least geometric diversity, being the last one in all three scenarios, likely because its input also includes strong indications of room geometries. On the contrary, however, this also suggests good controllability in room geometries as well as in room locations.
Overall, the diversity of the generated floor plans is affected by both the model input and the model structure. FloorplanDiffusion, HouseDiffusion, and HouseGAN have the best diversity in Graph, Location, and Geometry, respectively, while no model exhibits superior diversity across all aspects. This indicates the necessity of developing a more holistic, multi-modal model that can have good diversity in different aspects.
5.3. Existing Evaluation Metrics
FID is supposed to reflect the accuracy of the generated floor plans. However, the results show that it does not correspond well with the actual generation accuracy. For example, Tell2Design performs rather well in FID, being the second, first, and third for the three scenarios, respectively. However, its actual accuracy in terms of room numbers is relatively poor among the models. In addition, HouseGAN has the best FID value for the 3-bedroom scenario, but its accuracy in terms of room numbers is only 7%, the lowest among the six models.
GED evaluates the accuracy of the graphs of the generated floor plans. HouseDiffusion has the best values across all three scenarios, in line with the accuracies calculated using our methods, indicating good abilities of GED in evaluating generated floor plans. However, another accurate model, FloorplanDiffusion, has rather poor GED values, potentially because it does not take graphs as input. This indicates that GED alone can not make comprehensive evaluations of generated floor plans.
IoU evaluates the accuracy of the geometric features of generated floor plans. The values are rather low across all models and scenarios, likely due to their high sensitivity to the overall position of the floor plans (details in
Section 4.1). Tell2Design performs rather well in IoU, being the best or the second-best for the three scenarios, but its actual accuracy in terms of room numbers is relatively poor. FloorplanDiffusion also has good IoU values, being the best for the one-bedroom and two-bedroom scenarios. However, this is potentially because its input includes clues of the locations of each room and thus also the overall position within the canvas.
PSNR evaluates the accuracy of fine-grained features in the generated floor plans, but similar to FID, it does not correspond well with the actual generation accuracy. Graph2Plan, which exhibits poor accuracy in terms of room numbers, especially in the one-bedroom scenario and the more complicated three-bedroom scenario, achieves the best PSNR for the above two scenarios. In the two-bedroom scenario, although PSNR and accuracy are both highest for FloorplanDiffusion, we find that the trends for other models still do not align well. This suggests that PSNR, as a metric for evaluating image quality, is not suitable for assessing residential floor plans.
SSIM reflects the accuracy of coarse-grained features in the generated floor plans. Its values are rather similar and close to one for all models and all scenarios, indicating that it is not capable of making meaningful evaluations on generated floor plans.
6. Conclusions and Future Prospects
Conclusions: To address the lack of tailored evaluation tools for AI-generated residential floor plans, we propose Residential Floor Plan Assessment (RFP-A), a systematic and comprehensive framework specifically designed for this purpose.
We then conducted extensive comparisons with existing metrics involving case studies, correlation analysis, and computational efficiency assessments, showing that RFP-A has the following advantages:
- (1)
RFP-A includes four key evaluation steps: (1) room number compliance, (2) connectivity based on a refined GED, (3) room locations via a rotated coordinate system, and (4) geometric features using a revised IoU. This structure enables a comprehensive and structured evaluation.
- (2)
It can assess both the accuracy and diversity of generated floor plans, while most existing metrics can only evaluate the accuracy.
- (3)
Its stepwise and trapezoidal structure also ensures interpretability, which is lacking in existing metrics. This makes it easier for users to filter the designs they want from a vast number of generated options.
- (4)
It is rather sensitive to key changes in residential floor plans, such as changes in room numbers and room connectivity. Meanwhile, it is insensitive to less important information, including changes in the overall location within the canvas and different levels of pixel noise, providing more detailed and robust evaluations than existing metrics.
- (5)
It has improved computational efficiency mainly thanks to the revision of GED.
Finally, we evaluated the performance of six existing generative models for residential floor plans using RFP-A and found the following:
- (1)
Among the six evaluated models, only HouseDiffusion and FloorplanDiffusion achieved over 90% accuracy in room number compliance, highlighting the advantage of diffusion-based structures.
- (2)
Existing evaluation metrics do not correspond well with the actual generation accuracy, further addressing the necessity and importance of establishing RFP-A.
- (3)
The diversity of the generated plans is affected by both the model input and the model structure. No model exhibits superior diversity across all aspects, indicating the necessity of developing a more holistic, multi-modal model.
This study contributes a robust evaluation framework that supports more reliable and comprehensive assessment of AI-generated residential plans, facilitating future improvements in generative model performance. It also assists architects in managing and filtering large volumes of generated designs through structured, interpretable classification.
Limitations: This study serves as a pilot investigation into the evaluation of generated floor plans, and there are still areas of improvement to be addressed in future studies, such as the following:
- (1)
The current RFP-A only includes the most essential four aspects of design requirements in residential floor plans, while we aim to expand the scope of requirements in future studies, such as the level of privacy between rooms and energy efficiency of the residence.
- (2)
We only include rule-based methods in designing the current RFP-A, which is effective only for well-definable design considerations. Future work will explore the integration of expert knowledge and large language models to evaluate subjective design qualities such as esthetics and spatial experience. This presents new challenges in quantifying implicit design intentions and balancing them with functional and physical requirements.
- (3)
In the current evaluation of existing models for residential plan generation, we only assessed the accuracy in terms of room numbers. In future studies, we will test more generation scenarios involving additional design requirements so that we can reveal the accuracy of the generated plans in terms of graph, location, and geometry.
- (4)
Our current method primarily focuses on evaluating the diversity and accuracy of AI-generated models and, as such, does not yet account for complex physical factors present in real-world residential designs, such as natural ventilation or lighting. Future research will integrate these physical performance aspects to provide a more complete assessment of the generated floor plans.