FP-MAE: A Self-Supervised Model for Floorplan Generation with Incomplete Inputs

Zhong, Jing; Luo, Ran; Li, Peilin; Li, Tianrui; Zeng, Pengyu; Lei, Zhifeng; Feng, Tianjing; Yin, Jun

doi:10.3390/buildings16030558

Open AccessArticle

FP-MAE: A Self-Supervised Model for Floorplan Generation with Incomplete Inputs

by

Jing Zhong

^1,†,

Ran Luo

^2,†,

Peilin Li

³

,

Tianrui Li

¹,

Pengyu Zeng

¹,

Zhifeng Lei

¹,

Tianjing Feng

^4,* and

Jun Yin

^1,*

¹

Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

²

School of Architecture, South China University of Technology, Guangzhou 510640, China

³

Department of Architecture, College of Design and Engineering, National University of Singapore, Singapore 117566, Singapore

⁴

Institute for Environmental Design and Engineering, The Bartlett, University College London, London WC1H 0NN, UK

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Buildings 2026, 16(3), 558; https://doi.org/10.3390/buildings16030558

Submission received: 1 December 2025 / Revised: 7 January 2026 / Accepted: 27 January 2026 / Published: 29 January 2026

(This article belongs to the Special Issue Artificial Intelligence in Architecture and Interior Design)

Download

Browse Figures

Versions Notes

Abstract

Floor plans are a central representational component of architectural design, operating in close relation to sections, elevations, and three-dimensional reasoning to support the production and understanding of architectural space. In this context, we address the bounded computational task of completing incomplete floor plan representations as a form of early-stage design assistance, rather than treating the floor plan as an isolated architectural object. Within this workflow, being able to automatically complete a floor plan from an unfinished draft is highly valuable because it allows architects to generate preliminary schemes more quickly, streamline early discussions, and reduce the repetitive workload involved in revisions. To meet this need, we present FP-MAE, a self-supervised learning framework designed for floor plan completion. This study proposes three core contributions: (1) We developed FloorplanNet, a dedicated dataset that includes 8000 floorplans consisting of both schematic line drawings and color-coded plans, providing diverse yet consistent examples of residential layouts. (2) On top of this dataset, FP-MAE applies the Masked Autoencoder (MAE) strategy. By deliberately masking sections of a plan and using a lightweight Vision Transformer (ViT) to reconstruct the missing regions, the model learns to capture the global structural patterns of floor plans from limited local information. (3) We evaluated FP-MAE across multiple masking scenarios and compared its performance with state-of-the-art baselines. Beyond controlled experiments, we also tested the model on real sketches produced during the early stages of design projects, which demonstrated its robustness under practical conditions. The results show that FP-MAE can produce complete plans that are both accurate and functionally coherent, even when starting from highly incomplete inputs. FP-MAE is a practical and scalable solution for automated floor plan generation. It can be integrated into design software as a supportive tool to speed up concept development and option exploration, and it also points toward broader opportunities for applying AI in architectural automation. While the current framework operates on two-dimensional plan representations, future extensions may integrate multi-view information such as sections or three-dimensional models to better reflect the relational nature of architectural design representations.

Keywords:

architectural design; floor plan reconstruction; self-supervised learning; vision transformer; masked autoencoders

Graphical Abstract

1. Introduction

1.1. Research Background

Architectural design is inherently an iterative and problem-solving process. Rather than producing complete solutions at once, architects typically begin with partial sketches or schematic floor plans and gradually refine them through repeated cycles of drawing, evaluation, and revision [1,2,3,4,5,6]. Design cognition research has long shown that sketching and diagramming are essential for externalizing thinking, supporting reflection, and fostering creativity in architectural practice [7,8]. In this sense, floor plans function not only as representational outputs but also as active cognitive tools throughout the design process. This iterative workflow is particularly evident in residential design, where the drafting and continuous modification of floor plans require significant expertise, time, and labor [3,9,10]. Early-stage designs often consist of incomplete or fragmented plans, yet transforming these drafts into coherent and viable layouts remains a labor-intensive task. Consequently, there is a growing demand for intelligent tools that can assist architects during the conceptual phase by accelerating iteration rather than replacing human judgment [11,12,13,14,15].

Within this context, automatic floor plan completion from partial sketches is not treated as an end in itself, but as a means to support architectural workflows. By inferring plausible spatial configurations from incomplete inputs, such systems can provide rapid visual feedback, generate alternative layouts, and reduce repetitive manual revisions. This early intervention of AI-based assistance has the potential to improve design efficiency, lower time and cost, and enrich design exploration, while keeping the architect’s critical decision-making at the core of the process.

Modern computer vision techniques have advanced rapidly, but they typically require large amounts of data and extensive labeled images for supervised training [16]. In natural image domains, visual data often contain substantial redundancy, allowing missing regions to be inferred from surrounding context without requiring a deep understanding of high-level semantics [17,18]. Architectural floor plans, by contrast, are not generic visual images but domain-specific representations governed by design conventions, spatial typologies, and functional constraints. They are more structured, abstract, and semantically diverse than natural images [19,20,21,22,23,24]. As a result, models operating on floor plans must capture precise geometric relationships while also learning underlying design logic and functional zoning. This combination of geometric accuracy and semantic reasoning poses additional challenges for the generation and completion of architectural plans [25,26,27,28,29].

1.2. Contribution

To address these challenges, we first created a new architectural floor plan reconstruction dataset named FloorplanNet, which was designed to capture the complexity and diversity of residential architectural plans. This dataset contains a variety of residential building layouts and floor plan images (both line-drawing style and colored plans), providing rich resources for training and evaluating floor plan generation models.

Next, we propose a floor plan reconstruction method based on masked autoencoders, as illustrated in Figure 1. In this method, parts of the input floor plan are masked, and a lightweight Vision Transformer (ViT) model is trained to reconstruct the masked sections. In this way, the model learns to infer the complete floor plan structure from limited partial information. This approach employs a self-supervised learning paradigm, allowing the model’s understanding of floor plan images to go beyond simple local features and instead capture global structural patterns.

Moreover, we conducted extensive experiments to validate the effectiveness of our method. We evaluated the performance of FP-MAE in terms of reconstruction accuracy under various masking scenarios and compared it with current state-of-the-art models. Experimental results show that FP-MAE achieves outstanding performance in this kind of reconstruction task. Compared to existing models on automatic floor plan generation, our method directly addresses the critical task of floor plan completion from partial drawings, making a unique contribution to the field of computer-aided architectural design.

It is important to note that architectural sketches and partial drawings are, by nature, incomplete and open-ended. As a result, the completion of a partial floor plan should not be interpreted as a deterministic recovery of an author’s original design intent, but rather as the generation of a plausible hypothesis under missing information. In practice, multiple valid completions may exist for the same partial input, and divergences between generated results and reference designs are therefore to be expected, especially in ambiguous cases.

Within this scope, FP-MAE is positioned not as a complete generative architectural design system, but as a representational and reconstructive baseline that learns strong plan-level priors from incomplete floor plan inputs through self-supervised learning. By enabling robust hypothesis generation under missing or degraded information, the proposed framework provides a foundational capability that can support future generative, constraint-aware, or context-integrated floor plan design workflows. Accordingly, the scope of this work is limited to building-scale floor plan representations, with a particular focus on residential layouts and early-stage plan completion or option exploration.

Rather than offering finalized designs, FP-MAE is positioned as an assistive tool that accelerates early phase iteration, where architects often explore multiple spatial configurations before refining structure, context, or program. The framework complements human design judgment by streamlining the generation of formal spatial proposals that can be further adapted to project specific needs.

2. Related Works

2.1. Application of Self-Supervised Learning

In the field of natural language processing (NLP) and computer vision (CV), self-supervised learning has become a highly influential pre-training method [30,31,32,33,34,35,36]. The core mechanism of this approach is masking certain parts of the input data and training the model to predict the missing parts, which can help the model to learn the inherent structure of the data [37,38,39]. In NLP, models like BERT [40] and GPT [41] use Masked Language Modeling (MLM) techniques, which involve randomly masking fragments of input text and training the model to predict the masked parts [42,43]. This approach has proven successful in capturing deep semantic features of language. In the field of computer vision, self-supervised learning has similarly shown great potential [44,45,46]. For instance, BEiT [47,48] introduced a form of masked image modeling by predicting masked discrete visual tokens within an image, analogous to the MLM method in NLP. Although BEiT demonstrated the promise of masked prediction in vision, its approach, which relied on a discrete VAE tokenizer, could be inefficient for high-resolution images and typically required a large amount of labeled data for fine-tuning to achieve strong performance [49,50]. These methods still face challenges when training on very large-scale datasets, and designing suitable pretext tasks often requires domain-specific knowledge.

In recent years, several masked modeling frameworks for images have further advanced self-supervised Vision Transformer (ViT) pre-training [51,52]. Masked Autoencoders (MAE) by He et al. [18] proposed a simple yet powerful approach: mask a high proportion of input image patches and train an asymmetric encoder–decoder to reconstruct the missing pixels, which showed that large ViT models can be effectively pre-trained without any labels by reconstructing masked images. Xie et al. introduced SimMIM [53], a simplified masked image modeling framework that forgoes complex tokenization or teacher networks, directly using pixel regression on masked patches as the pretext task. Another notable variant is iBOT [54], which uses a self-distillation approach where a teacher network provides target features for masked patches, allowing the model to learn high-level semantics beyond just pixel reconstruction. These self-supervised ViT variants have collectively established masked image modeling as an effective strategy for learning rich visual features without annotations. In parallel, recent architectural studies have shown that learning latent spatial representations from floor plan images can significantly improve generative and reconstruction tasks [55,56,57]. Our FP-MAE builds upon this line of work by adapting the masked autoencoding concept to the domain of architectural floor plans, which have unique structural characteristics compared to natural images.

2.2. Development of Autoencoders

Autoencoders (AEs) are a class of neural networks designed for learning compressed representations of data [58]. They consist of an encoder that maps input data to a lower-dimensional latent space and a decoder that reconstructs the original input from this latent code [59]. Denoising Autoencoders (DAEs) [60] is an extension of basic AEs, which introduce noise or corruptions into the input data to train the network to reconstruct the original, undamaged contents. This approach encourages the learned latent representation to capture features and enhances the model’s robustness to input perturbations. The concept of autoencoders has further evolved into several variants, such as Variational Autoencoders (VAEs) [61], which impose a probabilistic structure on the latent space. VAEs learn not just a single deterministic code but a distribution over the latent space for each input, enabling generative sampling by decoding random latent vectors from the learned distribution. VAEs provide a powerful approach for generative modeling, though they may encounter issues like blurred outputs or mode collapse during training. Deepak Pathak et al. (2016) proposed a pioneering approach in the image inpainting domain by introducing a context encoder to generate missing image regions using surrounding context [62]. This demonstrated that meaningful visual features could be learned by predicting missing parts, marking a representative use of AE-like structures for image generation. Another notable development is Neural Discrete Representation Learning (2017), which introduced VQ-VAE [63], a variant of VAE that replaces continuous latent variables with discrete codes via a learnable codebook. This quantization mitigates posterior collapse, ensuring latent variables are not ignored by an overpowered decoder. Chen & Guo (2023) reviewed the applications and challenges of AEs in deep learning [64], pointing out that existing methods often face high computational costs when handling high-dimensional data.

With the rise of Vision Transformers and self-supervised learning, autoencoder-based frameworks have recently been revitalized in the form of masked autoencoders, which reconstruct missing image regions from sparsely observed inputs. In particular, masked image modeling methods explicitly remove a large portion of the input and train the network to recover the missing content, thereby encouraging the model to learn global structural and semantic information rather than relying on local texture cues. Recent studies further demonstrate that improvements in masking strategies and positional encoding can enhance reconstruction stability and spatial representation ability in masked autoencoder architectures [65].

In our work, the FP-MAE framework can be viewed as a form of masked autoencoder and we explicitly mask large portions of the input floor plan image and train the network to reconstruct the missing regions. This approach forces the model to capture the global floor plan structure and semantics, so that the decoder can regenerate missing parts. By doing so without needing ground-truth labels for what the missing part should contain, FP-MAE combines the strengths of autoencoder architectures with the masked prediction strategy from recent self-supervised ViT models.

2.3. Architectural Floor Plan Generation

In the field of architecture, the automatic generation of floor plans has emerged as a promising research direction aimed at improving design efficiency and quality through computational methods [66,67,68,69,70,71,72]. Early approaches to floor plan generation included rule-based algorithms or optimization techniques. For example, before the deep learning era, Merrell et al. (2010) [73] employed optimization to arrange rooms within a layout under given constraints, demonstrating the feasibility of automated layout synthesis. With the advent of deep learning, data-driven approaches have increasingly gained traction. Wu et al. (2019) [74] pioneered a large-scale data-driven technique for automatically generating residential floor plans given certain boundary constraints, such as a fixed external shape. They introduced the RPLAN dataset consisting of thousands of annotated floor plans and a model to generate interior layouts that fit a given building constraint. This work laid the data foundation for subsequent learning-based studies in floor plan generation.

More recently, graph-based deep learning methods have been proposed to address increasingly complex spatial relationships in floor plans. For instance, Lu et al. (2025) introduced a deep edge-aware graph neural network framework for large-scale floor plan generation, emphasizing non-local spatial dependencies and edge-specific relationships to better capture global layout structure, particularly for complex public buildings [67]. This line of work highlights the importance of explicitly modeling long-range spatial relations and architectural constraints when dealing with large and intricate layouts.

Several generative model frameworks have been explored for layout generation. Graph2Plan, proposed by Hu et al. (2020) [75], combines graph neural networks (GNNs) and CNN-based image generation to synthesize floor plan layouts from high-level graph specifications of room connectivity. In Graph2Plan, the input is a graph where nodes represent rooms and edges represent adjacency; the model learns to produce a floor plan that respects these topological constraints. Generative Adversarial Networks (GANs) have also been applied: HouseGAN [76] was one of the first GAN-based models to generate room layouts from scratch given a set of rooms and their relationships. It treats the floor plan as a graph-constrained generation problem and produces layouts that satisfy the input adjacency relationships. Nauata et al. (2021) later introduced House-GAN++ [77], a refined GAN-based layout generation method that improves diversity and precision of generated layouts and even incorporates door placement into the generation process. House-GAN++ uses a more complex training setup and requires a rich database of design iterations to achieve good results, illustrating the challenge of applying GANs in this domain. More recently, diffusion models have shown superior capability in generative tasks. HouseDiffusion, proposed by Shabani et al. (2024) [78], is a diffusion-based method for generating vectorized floor plans under multiple conditioning constraints. It uses a diffusion process over floor plan representations (such as coordinates of walls and doors) to gradually produce realistic layouts. This approach improved the precision and diversity of generated plans and can incorporate conditions like user-specified room counts or functional zone locations. However, diffusion-based methods can struggle with highly complex layouts or strict constraints, as the multi-step generation may deviate from expectations without careful conditioning [79,80,81]. Other recent works have explored two-stage generation, which first generate a high-level bubble diagram or room adjacency graph, and then generate a consistent floor plan. For instance, Rahbar et al. (2022) [82] use a coarse-to-fine approach with a conditional GAN to refine a bubble diagram into a detailed plan. Transformer-based models have also been applied. Tang et al. (2023) [83] introduce a graph transformer to handle topological constraints, and Sun et al. (2022) [84] developed WallPlan, which uses a graph generation network for walls combined with a semantics network for room types. Upadhyay et al. (2023) [85] proposed an end-to-end layout generation method with minimal post-processing, further simplifying the pipeline. In parallel, recent studies have begun to explore progressive and interactive generation paradigms that better align with real-world architectural workflows. FloorPlan-DeepSeek (FPDS) proposes a vector-based, autoregressive “next room prediction” framework inspired by large language models, enabling incremental floor plan construction rather than one-shot generation [86]. By representing floor plans as sequences of room vectors and supporting conditional generation from partial layouts or textual prompts, FPDS emphasizes controllability, interpretability, and integration with human-in-the-loop design processes. In recent research, there have also been methods that combine natural language processing technology with building floor plan generation to improve user friendliness, such as Tell2Design [87], HouseLLM [88] and ChatHouseDiffusion [89].

Compared to the above-mentioned works that generate floor plans largely from scratch, given some input constraints like graphs or textual descriptions, our method focuses on the reconstruction or completion of architectural floor plans from partial inputs. This is a relatively under-explored scenario. One related recent study by Gueze et al. (2023) [90] addresses floor plan reconstruction from sparse views, using a graph neural network combined with a constrained diffusion model to infer a complete plan from a sparsely observed subset of rooms or walls. Their approach incorporates explicit architectural knowledge by using GNNs on partial graph data and then refining with diffusion. In contrast, our FP-MAE works directly on images of floor plans and does not require an explicit graph or constraint input. Partial floor plan representations, such as partially masked floor plan images or incomplete drawings that are consistent with the data distribution and masking strategies used in this study, can serve as input to the model. By leveraging a self-supervised masked autoencoder framework, FP-MAE learns to inpaint missing regions in floor plan images in a way that preserves global consistency and functional zoning. FP-MAE is one of the first self-supervised vision transformer approaches applied to architectural floor plan completion, which provides a novel perspective on floor plan generation. Rather than synthesizing designs from abstract specifications, we enable interactive design refinement, where an initial sketch drawn by an architect can be automatically completed into a reasonable full layout. This complementary direction has broad applications in computer-aided designs.

3. Methodology

In this section, we provide a detailed description of the FP-MAE architectural design, dataset construction, and training procedure. The FP-MAE model consists of two primary components: a lightweight Vision Transformer encoder and an asymmetric decoder, which together form a masked autoencoder tailored for floor plan reconstruction. We also discuss the key design decisions, including patch size, masking strategy, loss function, dataset and sampling settings.

The model was implemented in PyTorch 2.0 (Meta AI, Menlo Park, CA, USA) using Python 3.10 (Python Software Foundation, Wilmington, DE, USA), and trained with CUDA 11.8 (NVIDIA Corporation, Santa Clara, CA, USA) on NVIDIA RTX 4090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA).

3.1. FP-MAE Encoder

Our encoder is based on a ViT architecture and operates only on the visible, unmasked image patches from the input floor plan. First, an input floor plan image of a fixed size is divided into regular, non-overlapping patches. Each visible patch is embedded by a linear projection to a latent vector, and positional embeddings are added to encode each patch’s location within the floor plan. We then feed the set of embedded visible patch tokens through a series of Transformer encoder blocks. These Transformer blocks use self-attention to allow global information exchange among the visible tokens, enabling the model to capture relationships between distant parts of the floor plan. Notably, the encoder does not see or process the masked regions at all, which means no placeholder tokens for masked patches are input to the encoder. Through the encoding process, the model learns a latent representation encapsulating the layout structure present in the partial input.

3.2. FP-MAE Decoder

The decoder’s role is to take the encoded representation of the visible patches and predict the content of the masked patches, thereby reconstructing the complete floor plan. The input to the decoder is the union of encoded visible patch tokens from the encoder and a set of mask tokens that serve as placeholders for each missing patch. Each mask token is a learned vector that indicates a placeholder where the model needs to fill in missing content. Before feeding into decoder Transformer blocks, we add positional embeddings to all tokens in this full token set, both encoded visible tokens and mask tokens. This step is important; without positional information, the decoder would not know where each mask token belongs, spatially in the floor plan image. After adding positional embeddings, we append the mask tokens to the sequence of encoder outputs, maintaining the correct order corresponding to original patch position, and feed the combined sequence into the decoder. The decoder then processes this full sequence through its own Transformer blocks. Essentially, the decoder learns to inpaint the masked areas based on the encoded context provided by the visible patches. Because the decoder has access to the mask tokens and the encodings of visible context, it can attend to both known and unknown regions and attempt to hallucinate plausible content for the unknown parts. We keep the decoder relatively shallow to ensure the overall model remains efficient; most of the model’s capacity is concentrated in the encoder for feature learning, while the decoder is mainly tasked with reconstruction. This asymmetric encoder–decoder design has proven effective for floor plan reconstruction, particularly because large portions of the input may be missing, and a heavier decoder would risk overfitting to specific masking patterns seen during training.

3.3. Loss Function

The core goal of FP-MAE is to restore complete floor plans from partial ones, simplifying the frequent task of modifying floor plans during the architectural design process and intuitively inspiring architects in a fast manner. The FP-MAE framework for architectural floor plan reconstruction adopts a straightforward autoencoding method, reconstructing new image signals from partially obtained ones. We define the loss function as the Mean Squared Error (MSE) between the reconstructed and original images over the masked regions:

L_{M A E} = \frac{1}{N_{mask}} \sum_{i \in m a s k} {(x_{i} - {\hat{x}}_{i})}^{2}

where

x_{i}

represents the i-th masked input data,

\hat{x_{i}}

denotes the model’s prediction for the i-th data point and

N_{m a s k}

indicates the number of masked input points. This loss function measures the average reconstruction error on the missing portions of the image. By minimizing this loss function, we can effectively update the model parameters, allowing the model to gradually learn how to reconstruct the masked portions of the images, thereby generating more accurate complete architectural floor plans. We found that a simple MSE loss was sufficient and effective, likely because floor plans are structurally constrained and relatively low-texture, so using MSE keeps the training objective straightforward and stable.

3.4. Dataset

The FloorplanNet dataset we utilized pertains to residential architectural floor plans, consisting of 300 collected Real-plan floor plans and 7700 standardized R-plan floor plans, which include colored plans and line drawings. In the line drawing subset, where standardized layouts are generated by the RPLAN toolbox, different colors represent various architectural components, as illustrated in Figure 2.

The colored floor plans subset, as shown in Figure 3, depict different residential layouts, with various colors and shapes denoting different room types or functional area (bedrooms, bathrooms, kitchen, living room, and balcony), providing semantic cues.

Both subsets are accompanied by detailed labels and metadata, which not only enrich the dataset but also ensure the robustness and broad applicability of the experimental results. Although our FP-MAE training uses only the raw images, the diversity and scale of FloorplanNet make it well-suited for training our self-supervised model and evaluating its generalization. Taken together, these components make FloorplanNet a comprehensive dataset for the evaluation of architectural floor plan reconstruction methods.

It is worth noting that FloorplanNet is intentionally constructed with a focus on residential layouts, providing a controlled and well-defined domain for studying floor plan completion. This design choice allows the model to learn stable spatial priors under consistent functional and regulatory conditions. At the same time, it establishes a clear baseline from which future research can expand toward more complex building typologies, such as office buildings, public facilities, and mixed-use developments, each of which introduces distinct spatial organizations and design constraints. In this sense, FloorplanNet serves as a targeted and extensible dataset rather than a fixed representation of architectural form.

3.5. Sampling Strategies

We divide architectural floor plans into regular, non-overlapping image patches. We then sample a portion of these patches and mask the remaining ones. As shown in Figure 4, our sampling strategy starts with random sampling: small patches are randomly selected without replacement, following a uniform distribution. The advantages of using random sampling are as follows:

(1): High random sampling and masking reduce redundancy, turning patch reconstruction into extrapolation from masked neighbors.
(2): Uniform distribution avoids central bias, but diverse sampling strategies are used due to uneven information distribution in floor plans.
(3): We apply large-scale masking to provide the encoder with ample room to operate efficiently, using sparse input image patches to predict an entire architectural floor plan from limited information.

4. Experiments

4.1. Self-Supervised Pretraining

We train FP-MAE in a self-supervised manner, using only incomplete floor plan images and their original complete versions without external labels. To generate training data, we take images from our FloorplanNet dataset and randomly mask out portions of each image (details on masking strategies follow in Section 4.2). The model is then trained to predict the missing parts. We divided the FloorplanNet dataset into training, validation, and test sets, containing 7000, 500, and 500 floor plan images, respectively. During training, only the training set images are used for learning. We monitor performance on the validation set, which is also with masks applied, to tune hyperparameters and decide when to stop training.

We also experimented with data augmentation to improve generalization. We found that heavy augmentations were not necessary for FP-MAE. The model performed very well even with only basic geometric augmentations like random cropping and horizontal flipping. In fact, we observed that augmentations which alter the color or brightness of the floor plans tended to hurt performance. This is likely because in our dataset, different room types or architectural elements are color-coded, so altering colors confuses the model. Therefore, we avoided color-based augmentations and mainly used random crops, random resized crops and flips during training to inject some variation. The training procedure thus largely teaches the model to reconstruct missing parts under a distribution of partial observations that include random spatial variations but preserves the inherent structural and color patterns of floor plans.

4.2. Masking and Sampling Strategies

A key aspect of our methodology is the strategy used to mask parts of the floor plan during training. As shown in Figure 5, we employed multiple masking strategies to diversify the training scenarios and simulate practical use cases. We began with a random sampling masking strategy, randomly selecting a portion of the floor plan patches for masking. By doing so, the model is forced to infer the complete floor plan structure from limited information. We found that, unlike natural images where relatively lower masking ratios are typically adopted, architectural floor plans benefit from a higher masking ratio (e.g., 75%). In our experiments, such high masking not only proved feasible but also encouraged the model to learn the global structure and semantic information of the entire plan. This is because floor plans inherently exhibit more repetitive and constrained spatial patterns, such as wall continuity and room arrangements, which allow the model to infer missing regions effectively even with fewer visible clues.

If the model is trained exclusively with purely random masks, it may fail to generalize to the types of structured missing regions that frequently occur in real design practice. For instance, when an architect deliberately leaves an entire portion of the drawing blank for later refinement. To approximate such scenarios, we introduced several alternative masking strategies, including center sampling, edge sampling, one-sided sampling and so on. These strategies mask larger, contiguous regions of the floor plan, requiring FP-MAE to reconstruct missing content from broader contextual cues rather than evenly scattered fragments. It should be noted that these geometric masking patterns are introduced as a controlled abstraction to systematically study missing regions during training and evaluation, rather than as a faithful representation of real architectural sketch incompleteness.

Our experiments reveal that reconstruction quality is not only determined by the amount of visible information, but also closely tied to the spatial distribution of patches. With a fixed patch size, more uniform distributions tend to yield clearer and more coherent results, while irregular arrangements often produce degraded outputs. Furthermore, we observed an intriguing phenomenon: as the distance between masked and visible patches increases, the reconstructed regions become visually blurrier and less precise, yet the reconstruction task paradoxically becomes easier. In these cases, the model can rely on structural regularities in floor plans (such as wall continuity or typical room arrangements) to generate plausible completions without being constrained by fine-grained details.

These findings underscore the importance of tailoring masking strategies to the unique characteristics of architectural drawings, rather than relying solely on approaches developed for natural images. By employing a variety of masking and sampling strategies, we ensure that FP-MAE is robust to diverse partial input scenarios. Whether the missing data is randomly distributed or concentrated in a specific region, the model has encountered similar situations during training and learned to handle them effectively.

4.3. Qualitative Experimental Results

In this section, we present the reconstruction results of FP-MAE under different masking strategies and compare its performance under varying conditions. Below is an analysis of the experimental results for each strategy:

(1): Random Masking (80% masked): Figure 6 and Figure 7 show examples of FP-MAE reconstructions where the input patches were randomly masked (Figure 6 for a line-drawing plan, Figure 7 for a colored plan). Despite the extremely large amount of missing information, the model is still able to restore a reasonable architectural floor plan structure and make local adjustments that align with the visible content. In the line drawing case, the input might have only sparse wall segments and a few visible corners. FP-MAE completes the structure by drawing walls that enclose spaces, effectively guessing the presence of rooms. One can see that the primary functional zoning remains clear in the output. Some small details are understandably imperfect, for example, the exact positions of doors and windows in the reconstructed portion may not match the original and room dimensions might be slightly off, since multiple configurations could satisfy the limited cues. In the colored plan case, FP-MAE not only reconstructs the walls but also fills the correct color in many areas, indicating it inferred the room types. However, those differences do not detract from the overall feasibility of the layout, and an architect could take the output for a valid design. It should be noted that FP-MAE operates at the level of two-dimensional plan configuration and does not model experiential or inhabitation-related aspects of space symbolized by the walls. These results highlight FP-MAE’s ability to perform extrapolation: given disjoint glimpses of a layout, it extrapolates the likely connecting structure.

(2): Center Masking (30% masked in center): Figure 8 provides examples where the entire center of the floor plan was missing, and the input had only the outer frame, like an empty house outline with some peripheral rooms drawn. FP-MAE’s output in this case successfully reconstructs the main internal layout. We do notice in such cases that while the structure is usually dead-on, with walls aligning perfectly with existing ones at the boundary of the masked area, the content of the central reconstructed area can have more variability. For instance, the model might place two small rooms where the original had one large room. This reflects the model’s training to minimize MSE loss, and in ambiguous situations it might produce an “average” solution or select one plausible option among many. Despite this, crucial architectural elements like continuity of walls and maintaining accessible pathways were satisfied. These cases show that FP-MAE has a good understanding of how rooms typically connect in a residential layout even if the middle portion is unseen.

(3): Perimeter (Edge) Masking (70% edges masked): Figure 9 shows an example input where the border of the floor plan is missing, and only some interior rooms are visible, along with FP-MAE’s reconstruction of the perimeter. The model effectively restores the outline of the house, adding exterior walls that close off the layout. If the visible interior suggested a rectangular overall shape, the output reflected that shape at the edges. In the illustrated output, edge rooms that were entirely masked are reconstructed; for instance, an edge might show a balcony connecting to the interior. These reconstructed edge rooms display slight variations: sometimes the model may not perfectly recover their exact shape or may confuse a balcony with a room due to limited evidence. Nonetheless, FP-MAE consistently generates plausible completions in masked edge regions, rather than leaving gaps. In practical applications, this ability enables the model to suggest extensions for a partially drawn interior plan.

(4): Spotted-perimeter Masking (75% masked): Figure 10 presents an example input where the outer boundary and some central patches of the floor plan are masked, leaving only discontinuous central fragments visible. Unlike edge masking, which only removes the entire outline, spotted-perimeter masking adds intermittent gaps in the center, forcing FP-MAE to reconstruct the overall outline and simultaneously fill in missing central details. In this scenario, the model attempts to close the layout by extrapolating exterior walls and reconstructing rooms attached to the edge. Because as much as 75% of the plan is missing, the reconstructed results often appear simplified or slightly blurred. Certain details, such as window placement or exact room proportions, may be inaccurately inferred, and in some cases the model may misalign confuse a balcony or a small room. Nevertheless, FP-MAE consistently produces a coherent and plausible boundary rather than leaving large gaps. This shows that even under extremely sparse input conditions, the model retains sufficient prior knowledge to imagine reasonable edge structures. In practical applications, this means that spotted-perimeter masking could support early-stage design tasks in which only a small portion of the layout is provided, offering plausible completions of the entire plan.
(5): Biased-keyhole Masking (90% masked): Figure 11 shows an extreme case in which only a very small “keyhole-shaped” region of the floor plan is left visible, while the remaining 90% of the layout is masked. This setting emulates a scenario where only a limited fragment of the design is available, for instance a small corner or partial core, while the remainder of the plan is left unspecified. FP-MAE must therefore extrapolate nearly the entire structure from this narrow clue. In the reconstructed outputs, the model attempts to extend the visible fragment into a complete house layout. For example, if the revealed patch suggests part of a corridor or a wall junction, FP-MAE generates surrounding rooms and boundaries consistent with that structure. However, given the extreme sparsity of information, the results tend to be simplified and blurred. Fine elements such as windows, doors, or exact proportions are often missing or roughly approximated. Some reconstructed walls also exhibit lower contrast and slight misalignment, reflecting the uncertainty of inferring large-scale structure from minimal evidence. Despite these challenges, FP-MAE still produces layouts that are structurally coherent rather than chaotic, filling masked regions with plausible room divisions and exterior walls. This demonstrates the model’s ability to rely on strong prior knowledge of floor plan regularities, even when trained under very sparse conditions.

(6): One-Sided Masking (30% missing one side): It refers to a strategy in which only a single boundary of the floor plan image is masked. This setting emulates realistic design scenarios where architects extend or modify a specific edge of a layout, such as adding new rooms to one wing of a house or expanding a facade. In our implementation, we systematically masked either the top, bottom, or right side of the plan in separate experiments, thereby creating diverse but structured missing regions. A masking ratio of approximately 30% was applied to these boundary regions, indicated as gray blocks in Figure 12. This level of masking was carefully selected to preserve most of the original spatial organization while still introducing a meaningful reconstruction challenge. Because the essential layout remains visible, FP-MAE can effectively leverage the intact structural cues to infer the missing parts. The model not only restores the broader spatial configuration but also accurately reconstructs fine-grained elements such as walls, doors, and room boundaries. The results demonstrate that one-sided masking allows FP-MAE to excel at producing reconstructions that are both structurally coherent and visually convincing, highlighting its ability to generalize even when large continuous areas are absent.

(7): Corner Masking (75% masked, only a corner visible): Figure 13 provides an example with only a small top-left corner of the floor plan given and everything else masked. The output in this case shows the model’s attempt to imagine the rest of the house. In the example, the visible corner might contain a bedroom. FP-MAE then conjectured a layout that extends from that bedroom—it might add a living room adjacent to it, bedrooms on the other end, etc. The reconstructed areas are a bit more generic: walls are correctly placed to form rooms, but one can see they are somewhat simplified or blurred in drawing quality. For instance, windows might be missing or roughly placed since the model is less certain where they should go. Some walls may also appear with lower contrast or slight misalignment as the model balances many possibilities. Still, the output is a valid floor plan. It is impressive that with a tiny fraction of input, the model does not produce chaotic lines; instead, it outputs a structured set of spaces. This demonstrates strong prior knowledge learned by FP-MAE about what constitutes a plausible floor plan.

(8): Spotted-corner Masking (80% masked): Figure 14 illustrates an input where most of the floor plan is obscured, leaving only a fragmented corner region partially visible. Unlike standard corner masking, where a single continuous corner is revealed, spotted-corner masking introduces discontinuous patches in the top-left corner area. This makes the reconstruction task even more challenging, as FP-MAE must infer the overall layout from scattered local clues rather than a single coherent boundary. In the outputs, the model extends the incomplete fragments into a full residential layout. For instance, if a visible fragment suggests part of a balcony or corridor, FP-MAE extrapolates plausible adjacent spaces, adding living areas or bedrooms in reasonable positions. However, due to the high masking ratio (80% of the plan missing) and the fragmented nature of the input, the reconstructions often appear blurred or simplified. Windows and doors may be roughly placed or omitted, and some walls show reduced contrast or slight misalignment. Nevertheless, FP-MAE consistently produces structured and coherent layouts rather than chaotic lines. Even with such limited and discontinuous input, the model leverages strong architectural priors to fill in plausible spaces, demonstrating its robustness in handling highly sparse and irregular masking scenarios.

Overall, the qualitative experiments confirm that FP-MAE successfully learns to complete floor plan drawings in a variety of scenarios. It respects architectural conventions (e.g., aligning walls, preserving room adjacencies, not blocking circulation paths) even when guessing large parts of the layout. The model effectively generalizes from common patterns in the training data, which means it can serve as a smart assistant for architects given a sketch or partial plan, it produces a reasonable complete plan that can then be further refined.

4.4. Quantitative Experimental Results

To quantitatively evaluate FP-MAE, we computed several image-based metrics that are commonly used to assess image reconstruction quality: FID (Fréchet Inception Distance), PSNR (Peak Signal-to-Noise Ratio), and SSIM (Structural Similarity Index). We evaluated these metrics on the test set of FloorplanNet under various masking strategies, and we also compare FP-MAE with two baseline methods: Pix2Pix and CycleGAN, which were trained or tuned for the floor plan completion task.

Table 1 summarizes the results. We report metrics for FP-MAE under eight representative masking scenarios on the line drawing floor plan images. We also include FP-MAE’s performance on the colored floor plan images under the random masking scenario to show its effectiveness on colored inputs. Finally, we list the performance of Pix2Pix and CycleGAN on the line drawing task with random masking for comparison. In each column, higher PSNR/SSIM and lower FID indicate better reconstruction quality, and the best result for each metric is bold.

5. Discussion and Conclusions

Our experimental results clearly demonstrate that FP-MAE achieves outstanding performance in the task of floor plan reconstruction, providing strong evidence for the promise of self-supervised learning methods in architectural design automation. Unlike traditional supervised approaches that depend on large volumes of carefully annotated data, FP-MAE leverages the intrinsic structure of floor plans through a self-supervised masking strategy. By reconstructing missing regions of an image, the model is encouraged to learn rich structural and semantic representations without explicit labels. This learning paradigm not only reduces the cost of data preparation but also offers a scalable solution that can be generalized to diverse architectural contexts.

5.1. Contribution to Architectural Image Completion

One of the core contributions of FP-MAE lies in the introduction of a novel architecture for architectural image completion. While masked autoencoders have been extensively studied in natural image domains, their adaptation to architectural drawings presents distinct challenges. Architectural floor plans are abstract representations that encode geometric relationships, circulation patterns, and functional zoning. Unlike natural images, where local texture redundancy often allows simple patch-based inference, floor plans require strong global structural coherence. Even minor geometric deviations, such as a misplaced wall segment, can fundamentally alter circulation, adjacency, and usability. To address these characteristics, FP-MAE adopts a Vision Transformer–based architecture, which is particularly well suited to capturing long-range spatial dependencies across an entire plan that are difficult to model with conventional convolutional networks. This design enables the model to reason about architectural structure beyond local neighborhoods, prioritizing representational capacity and global consistency. Our experimental results demonstrate that FP-MAE preserves structural integrity in reconstructed plans, maintaining both local detail and overall spatial logic even under severe masking conditions. At the same time, the transformer-based formulation establishes a flexible foundation for future efficiency-oriented extensions. Promising directions include exploring lighter transformer variants, reducing model depth or attention heads, and applying model compression or knowledge distillation techniques to transfer learned spatial priors into more compact architectures. Moreover, the self-supervised pretraining stage can be performed offline or centrally, allowing downstream fine-tuning or inference with substantially reduced computational requirements. Together, these directions outline a clear pathway toward improving the accessibility and scalability of FP-MAE for broader architectural practice contexts without compromising its core architectural capabilities.

Alongside the model, we also introduced a new dataset, FloorplanNet, specifically tailored to the architectural domain. This dataset encompasses a wide range of residential layouts represented in both line drawings and colored semantic floorplans. Such diversity allows FP-MAE to learn not only geometric continuity but also semantic regularities, such as the placement of bedrooms adjacent to bathrooms or kitchens connected to living rooms. The combination of model and dataset forms a unified framework that pushes the boundary of how self-supervised learning can be applied in architecture.

A particularly noteworthy outcome of our study is the model’s ability to capture pixel-level details while simultaneously demonstrating a deeper understanding of architectural semantics. In many experiments, FP-MAE successfully reconstructed fine features such as door openings, room boundaries, and circulation corridors, even when these regions were completely masked. More importantly, the reconstructions were functionally coherent, preserving adjacent relationships that align with architectural norms. This suggests that the model is not simply memorizing patterns but is internalizing design logic that governs spatial organization. Such a capability is significant because it points toward the possibility of using self-supervised learning not just for visual restoration but for capturing design knowledge embedded in floor plans. This knowledge could later be leveraged for tasks such as automated space evaluation, circulation optimization, or even the generation of entirely new layouts conditioned on specific requirements.

5.2. Implications for Architectural Practice

Beyond technical performance, FP-MAE has important implications for architectural practice. Rather than replacing architectural authorship, the framework can be used as an optional digital assistant to support the exploration of alternative plan configurations and to reduce repetitive redrawing during early design iterations. In early-stage design, where rapid iteration is essential, FP-MAE can provide immediate feedback by completing partially sketched ideas into coherent layouts. This capability may support iterative comparison and discussion of design alternatives, while preserving the central role of the architect in evaluating, refining, and validating spatial decisions.

In architectural education, FP-MAE-like tools can facilitate learning by making spatial organization more explicit. Students often struggle to translate abstract concepts into coherent floor plans and to understand the implications of partial design decisions. A system that completes incomplete sketches offers immediate visual feedback, helping students grasp spatial relationships, circulation patterns, and programmatic logic. Recent research indicates that generative AI can support design pedagogy by acting as an active collaborator in exploratory learning processes [91]. Within studio teaching, FP-MAE can encourage iterative experimentation by lowering the cost of revision and allowing students to quickly test and refine design ideas.

FP-MAE also has potential applications in historical building preservation and digital reconstruction. Many historical floor plans are incomplete, damaged, or fragmentary, limiting their usability for architectural analysis and conservation. Recent machine learning research has demonstrated the feasibility of reconstructing missing architectural elements from partial historical drawings [92,93]. In this context, FP-MAE can assist in inferring plausible completions of missing plan segments that are consistent with the spatial logic and conventions of a given historical period. Such reconstructions can support restoration analysis, comparative studies, and the digital documentation of architectural heritage.

5.3. Limitations and Future Research Directions

Although FP-MAE demonstrates clear advantages, it is important to acknowledge the limitations of the current study. First, while the reconstructions are structurally sound, there remain cases where finer details such as window placement or exact dimensions deviate from the ground truth. These deviations highlight the challenge of aligning pixel-level fidelity with semantic accuracy. Future work could address this by incorporating geometric constraints into the reconstruction process, ensuring that generated walls remain orthogonal or that circulation paths remain accessible. Beyond plan-internal geometry, real-world architectural applications are also shaped by external constraints such as climatic conditions, geographical context, and regulatory or environmental requirements, which are not considered in the current framework but could be integrated in future extensions. However, real architectural sketches often exhibit heterogeneous incompleteness, where partial lines, local continuities, and corrections coexist within the same drawing, introducing levels of ambiguity that are not captured by the current masking schemes. Extending the framework to account for such sketch-level uncertainty, for example through stroke-based, vectorized, or graph-based representations, represents an important direction for future research. It should be emphasized that the outputs of FP-MAE are speculative reconstructions intended to support human interpretation, rather than authoritative or definitive design decisions. Under highly ambiguous or underspecified inputs, the model may generate plausible but incorrect or undesired configurations, and human oversight remains essential.

Second, the current study establishes a foundation for several promising research directions. While the present experiments focus on residential floor plans, this setting provides a stable testbed for developing and validating self-supervised reconstruction strategies. However, architectural practice spans a much broader range of building typologies, including offices, educational facilities, healthcare buildings, and large public spaces, each characterized by distinct spatial logics and regulatory constraints. Extending FP-MAE to these typologies would significantly broaden its applicability and offer a rigorous evaluation of the model’s robustness under more complex functional conditions. In parallel, expanding the dataset in scale and geographic diversity would enable the model to capture cross-regional and cross-cultural variations in architectural conventions and spatial norms, leading to more generalized and transferable representations. Beyond individual buildings, future work may also explore extensions to the urban scale, where predictive and assistive modeling can support scenario-based analysis of land use and long-term development, serving as a suggestive tool for planning and policy rather than a deterministic solution.

Third, while our self-supervised approach eliminates the need for labeled data, it may benefit from multimodal integration. For instance, combining floor plan images with textual descriptions of programmatic requirements could allow the model to generate reconstructions that are not only structurally plausible but also aligned with user intent. Similarly, extending FP-MAE to operate on 3D building layouts or multi-level floor plans would open new avenues for real-world application.

Another important direction for future research concerns robustness and generalization under more heterogeneous input conditions. In the present study, FP-MAE is evaluated using controlled masking strategies that provide a systematic and reproducible way to analyze reconstruction behavior under varying degrees of incompleteness. These settings establish a clear baseline for understanding how the model extrapolates global structure from partial observations. Building on this foundation, future work will extend evaluation to more sketch-like inputs that better reflect real design practice, including freehand drawings with irregular proportions, uneven line quality, and localized noise. In addition, more complex and unstructured occlusion patterns, such as irregular gaps, fragmented strokes, or non-contiguous missing regions, will be explored to approximate real-world sketching and revision scenarios. These extensions will enable a more comprehensive assessment of FP-MAE’s robustness while further strengthening its applicability to early-stage architectural workflows.

An additional avenue for future research concerns the qualitative assessment and functional usability of reconstructed floor plans. While the present study focuses on image-based reconstruction metrics to establish a clear and reproducible baseline, future extensions can incorporate architecture-aware evaluation criteria that reflect functional and spatial performance. Recent studies have proposed dedicated spatial and graph-based metrics to quantify aspects such as room accessibility, adjacency logic, and proportional consistency, offering a promising complement to visual similarity measures [26]. In addition, rule-based checks or lightweight constraint validation modules could be integrated to flag implausible proportions, missing circulation paths, or violations of basic architectural conventions. Such evaluation layers would enable FP-MAE to operate more effectively within architect-led workflows, where automated layout completion is combined with professional judgment and iterative refinement. These directions position usability assessment as a natural next step in the evolution of the framework rather than a prerequisite at the current stage.

A further avenue for future development concerns the integration of FP-MAE into existing architectural production workflows. While the present study focuses on image-based floor plan completion, the raster outputs produced by FP-MAE provide a practical foundation for downstream conversion into vectorized CAD or BIM representations. For example, reconstructed walls, room boundaries, and openings can be extracted and translated into editable geometric elements. Building on this capability, future work will explore lightweight plugins or standalone interfaces that allow architects to sketch partial layouts and receive real-time completion suggestions within familiar CAD or BIM environments. Such interfaces would support interactive acceptance, modification, or rejection of generated elements, ensuring that FP-MAE operates as an assistive component within architect-led workflows. These developments outline a clear pathway toward practical deployment without altering existing design practices.

Additionally, pretraining with FP-MAE can provide a strong foundation for a range of downstream architectural tasks. Recent studies have shown that building energy performance can be reliably predicted from early-stage representations such as floor plans or conceptual images, indicating that spatial layout encodes meaningful performance-related information [29]. The latent representations learned by FP-MAE are therefore likely to capture generic spatial priors that are not only useful for geometric completion, but also beneficial for performance-oriented tasks. Building on this, future work may explore coupling FP-MAE pretraining with downstream energy prediction models, enabling a unified pipeline that links layout completion with early-stage performance feedback. Beyond energy analysis, the learned representations may enhance tasks such as room type classification, anomaly detection in architectural drawings, and interactive design assistance. This direction aligns with broader machine learning trends, where pretrained models serve as reusable backbones that can be fine-tuned to support diverse, domain-specific applications.

Finally, while FP-MAE primarily focuses on spatial layout completion, we recognize the significance of broader contextual factors such as structural logic, material choices, site orientation, environmental responsiveness, and socio-cultural influences in architectural decision-making. Although these aspects are not explicitly represented in the current model, our framework provides a solid foundation for their future integration. We see meaningful opportunities to incorporate such variables through multimodal inputs, including site data and material constraints, or by linking the model with rule-based systems and simulation-driven components. Through these developments, FP-MAE can progressively evolve into a more comprehensive design assistant that supports context-aware, architect-led workflows.

5.4. Conclusions

In conclusion, FP-MAE represents a significant step forward in the application of self-supervised learning to architectural floor plan generation. By reconstructing masked regions of floor plan images, the model learns both pixel-level details and semantic-level structures, enabling it to produce high-quality, functionally coherent reconstructions. While the current work focuses on reconstructive completion, the learned priors and masked autoencoding strategy offer a foundation upon which future generative floor plan systems may be developed, particularly when combined with additional constraints. The combination of novel architecture, a tailored dataset, and rigorous experimentation highlights the scalability and adaptability of this approach.

While there remain challenges in terms of precision, generalization, and multimodal integration, the progress demonstrated by FP-MAE establishes a strong foundation for future research. More broadly, it illustrates how self-supervised methods can move beyond natural image domains and into specialized fields such as architecture, where the stakes of structural and functional accuracy are particularly high.

As architectural design continues to embrace digital and AI-driven tools, methods like FP-MAE will become increasingly valuable, not as replacements for human creativity, but as collaborators that amplify efficiency, expand possibilities, and enrich the design process.

Author Contributions

Conceptualization, P.L., T.L. and P.Z.; Methodology, J.Z., R.L., T.L., P.Z. and Z.L.; Validation, T.F.; Investigation, P.L. and Z.L.; Writing—original draft, J.Z., R.L. and P.L.; Writing—review and editing, T.L., T.F. and J.Y.; Visualization, J.Y.; Supervision, T.F. and J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

The author gratefully acknowledges University College London (UCL) for supporting the publication of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Purcell, A.T.; Gero, J.S. Drawings and the design process: A review of protocol studies in design and other disciplines and related research in cognitive psychology. Des. Stud. 1998, 19, 389–430. [Google Scholar] [CrossRef]
Hettithanthri, U.; Hansen, P.; Munasinghe, H. Exploring the architectural design process assisted in conventional design studio: A systematic literature review. Int. J. Technol. Des. Educ. 2023, 33, 1835–1859. [Google Scholar] [CrossRef]
Sami, Z.; Özer, Y.S. The importance of figural and verbal sketches in creativity within the architectural design studio. Megaron Yildiz Tech. Univ. Fac. Archit. E J. 2024, 19, 539–549. [Google Scholar] [CrossRef]
Li, M.; Wang, C.; Wu, Y.; Santamouris, M.; Lu, S. Assessing spatial inequities of thermal environment and blue-green intervention for vulnerable populations in dense urban areas. Urban Clim. 2025, 59, 102328. [Google Scholar] [CrossRef]
Vandenhende, K. Mixing Specific and More Universal Design Media to Deal with Multidisciplinarity. Athens J. Archit. 2023, 9, 319–334. [Google Scholar] [CrossRef]
Lu, S.; Xu, W.; Chen, Y.; Yan, X. An experimental study on the acoustic absorption of sand panels. Appl. Acoust. 2017, 116, 238–248. [Google Scholar] [CrossRef]
Heckmann, O.; Schneider, F. Floor Plan Manual: Housing; Birkhäuser: Basel, Switzerland, 1997. [Google Scholar]
Shi, M.; Seo, J.; Cha, S.H.; Xiao, B.; Chi, H.L. Generative AI-powered architectural exterior conceptual design based on the design intent. J. Comput. Des. Eng. 2024, 11, 125–142. [Google Scholar] [CrossRef]
Zeng, P.; Yin, J.; Zhang, M.; Shen, Y.; Wang, X. CARD: A cross-modal agent framework for generative and editable residential design. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 1–12. [Google Scholar]
Yin, J.; Gao, W.; Li, J.; Xu, P.; Wu, C.; Lin, B.; Lu, S. ArchiDiff: Interactive design of 3D architectural forms generated from a single image. Comput. Ind. 2025, 168, 104275. [Google Scholar] [CrossRef]
Chen, L.; Song, Y.; Guo, J.; Sun, L.; Childs, P.; Yin, Y. How generative AI supports human in conceptual design. Des. Sci. 2025, 11, e9. [Google Scholar] [CrossRef]
Zeng, P.; Gao, W.; Li, J.; Yin, J.; Chen, J.; Lu, S. Automated residential layout generation and editing using natural language and images. Autom. Constr. 2025, 174, 106133. [Google Scholar] [CrossRef]
Karadağ, D.; Ozar, B. A new frontier in design studio: AI and human collaboration in conceptual design. Front. Arch. Res. 2025, 14, 1536–1550. [Google Scholar] [CrossRef]
Zeng, T.; Ma, X.; Luo, Y.; Ji, Y.; Lu, S. Improving outdoor thermal environmental quality through kinetic canopy empowered by machine learning and control algorithms. Build. Simul. 2025, 18, 699–720. [Google Scholar] [CrossRef]
Agri, M.E.; Le, A.; Phung, Q. AI integration in architectural design and management: Professionals’ perspectives. Arch. Eng. Des. Manag. 2025, 1–16, ahead of print. [Google Scholar] [CrossRef]
Yenew, A.B.; Assefa, B.G. From Algorithms to Architecture: Computational Methods for House Floorplan Generation. SN Comput. Sci. 2024, 5, 589. [Google Scholar] [CrossRef]
Liu, S.; Wang, Y.; Liu, X.; Yang, L.; Zhang, Y.; He, J. How does future climatic uncertainty affect multi-objective building energy retrofit decisions? Evidence from residential buildings in subtropical Hong Kong. Sustain. Cities Soc. 2023, 92, 104482. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 16000–16009. [Google Scholar]
Huang, W.; Zheng, H. Architectural drawings recognition and generation through machine learning. In Proceedings of the 38th Annual Conference of the Association for Computer Aided Design in Architecture, Mexico City, Mexico, 18–20 October 2018. [Google Scholar]
Lu, Z.; Wang, T.; Guo, J.; Meng, W.; Xiao, J.; Zhang, W.; Zhang, X. Data-driven floor plan understanding in rural residential buildings via deep recognition. Inf. Sci. 2021, 567, 58–74. [Google Scholar] [CrossRef]
Lv, X.; Zhao, S.; Yu, X.; Zhao, B. Residential floor plan recognition and reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021. [Google Scholar]
Xu, Z.; Yang, C.; Alheejawi, S.; Jha, N.; Mehadi, S.; Mandal, M. Automatic floor plan analysis using a boundary attention-based deep network. Int. J. Doc. Anal. Recognit. IJDAR 2025, 28, 19–30. [Google Scholar] [CrossRef]
Abouagour, M.; Garyfallidis, E. GFLAN: Generative Functional Layouts. arXiv 2025, arXiv:2512.16275. [Google Scholar] [CrossRef]
Zeng, P.; Yin, J.; Gao, Y.; Li, J.; Jin, Z.; Lu, S. Comprehensive and Dedicated Metrics for Evaluating AI-Generated Residential Floor Plans. Buildings 2025, 15, 1674. [Google Scholar] [CrossRef]
Yin, J.; Zeng, P.; Sun, H.; Dai, Y.; Zheng, H.; Zhang, M.; Zhang, Y.; Lu, S. Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025. [Google Scholar]
Hu, S.; Wu, W.; Wang, Y.; Xu, B.; Zheng, L. GSDiff: Synthesizing Vector Floorplans via Geometry-enhanced Structural Graph Generation. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2025; Volume 39, pp. 17323–17332. [Google Scholar]
Zhong, J.; Li, P.; Luo, R.; Yin, J.; Ding, Y.; Bai, J.; Hong, C.; Deng, X.; Ma, X.; Lu, S. EnergAI: A Large Language Model-Driven Generative Design Method for Early-Stage Building Energy Optimization. Energies 2025, 18, 5921. [Google Scholar] [CrossRef]
Yin, J.; Zhong, J.; Zeng, P.; Li, P.; Zheng, H.; Zhang, M.; Lu, S. ArchShapeNet: An interpretable 3D-CNN framework for evaluating architectural shapes. Int. J. Archit. Comput. 2025, 23, 14780771251352965. [Google Scholar] [CrossRef]
Yin, J.; Zeng, P.; Huang, Y.; Sun, H.; Zhong, J.; Hao, T.; Lu, S. AI-empowered prediction of office building energy use from single-view conceptual images for early-stage design. Appl. Energy 2026, 406, 127289. [Google Scholar] [CrossRef]
Hendrycks, D.; Mazeika, M.; Kadavath, S.; Song, D. Using self-supervised learning can improve model robustness and uncertainty. Adv. Neural Inf. Process. Syst. 2019, 32, 15663–15674. [Google Scholar]
Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 2021, 35, 857–876. [Google Scholar] [CrossRef]
Liu, Y.; Jin, M.; Pan, S.; Zhou, C.; Zheng, Y.; Xia, F. Graph self-supervised learning: A survey. IEEE Trans. Knowl. Data Eng. 2022, 35, 5879–5900. [Google Scholar] [CrossRef]
Gui, J.; Chen, T.; Zhang, J.; Cao, Q.; Sun, Z.; Luo, H. A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9052–9071. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 4904–4916. [Google Scholar]
Baevski, A.; Hsu, W.N.; Xu, Q.; Babu, A.; Gu, J.; Auli, M. Data2vec: A general framework for self-supervised learning in speech, vision and language. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2022; pp. 1298–1312. [Google Scholar]
Huang, Y.; Zeng, T.; Jia, M.; Yang, J.; Xu, W.; Lu, S. Fusing Transformer and diffusion for high-resolution prediction of daylight illuminance and glare based on sparse ceiling-mounted input. Build. Environ. 2025, 267, 112163. [Google Scholar] [CrossRef]
Sun, H.; Xia, B.; Zhao, Y.; Chang, Y.; Wang, X. Positive Enhanced Preference Alignment for Text-to-Image Models. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. ibot: Image bert pre-training with online tokenizer. arXiv 2022, arXiv:2111.07832. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://openai.com/research/language-unsupervised (accessed on 1 December 2025).
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Sinha, K.; Jia, R.; Hupkes, D.; Pineau, J.; Williams, A.; Kiela, D. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. arXiv 2021, arXiv:2104.06644. [Google Scholar] [CrossRef]
Shurrab, S.; Duwairi, R. Self-supervised learning methods and applications in medical imaging analysis: A survey. PeerJ Comput. Sci. 2022, 8, e1045. [Google Scholar] [CrossRef]
Chen, L.; Bentley, P.; Mori, K.; Misawa, K.; Fujiwara, M.; Rueckert, D. Self-supervised learning for medical image analysis using image context restoration. Med. Image Anal. 2019, 58, 101539. [Google Scholar] [CrossRef] [PubMed]
Huang, S.-C.; Pareek, A.; Jensen, M.; Lungren, M.P.; Yeung, S.; Chaudhari, A.S. Self-supervised learning for medical image classification: A systematic review and implementation guidelines. NPJ Digit. Med. 2023, 6, 74. [Google Scholar] [CrossRef] [PubMed]
Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
Wang, W.; Bao, H.; Dong, L.; Bjorck, J.; Peng, Z.; Liu, Q.; Aggarwal, K.; Mohammed, O.K.; Singhal, S.; Som, S.; et al. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023. [Google Scholar]
Li, P.; Yin, J.; Zhong, J.; Luo, R.; Zeng, P.; Zhang, M. Segment any architectural facades (SAAF): An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance. arXiv 2025, arXiv:2506.09071. [Google Scholar] [CrossRef]
Zhong, J.; Yin, J.; Li, P.; Zeng, P.; Zhang, M.; Luo, R.; Lu, S. ArchiLense: A framework for quantitative analysis of architectural styles based on vision large language models. arXiv 2025, arXiv:2506.07739. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Khan, A.; Sohail, A.; Fiaz, M.; Hassan, M.; Afridi, T.H.; Marwat, S.U.; Munir, F.; Ali, S.; Naseem, H.; Zaheer, M.Z.; et al. A survey of the self supervised learning mechanisms for vision transformers. arXiv 2024, arXiv:2408.17059. [Google Scholar] [CrossRef]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. SimMIM: A Simple Framework for Masked Image Modeling. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022; pp. 9653–9663. [Google Scholar]
Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. iBOT: Image BERT Pre-Training with Online Tokenizer. In Proceedings of the 10th International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Zeng, P.; Yin, J.; Zhang, M.; Li, J.; Zhang, Y.; Lu, S. Unified residential floor plan generation with multimodal inputs. Autom. Constr. 2025, 178, 106408. [Google Scholar] [CrossRef]
Yin, J.; Zeng, P.; Li, P.; Zhong, J.; Thao, T.; Zheng, H.; Lu, S. Drag2Build++: A drag-based 3D architectural mesh editing workflow based on differentiable surface modeling. Front. Archit. Res. 2025, 14, 1602–1620. [Google Scholar] [CrossRef]
Yin, J.; Zeng, P.; Shen, L.; Zhang, M.; Zhong, J.; Han, Y.; Lu, S. ArchiSet: Benchmarking Editable and Consistent Single-View 3D Reconstruction of Buildings with Specific Window-to-Wall Ratios. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2025; pp. 26004–26014. [Google Scholar]
Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Edinburgh, Scotland, 26 June–1 July 2012. JMLR Workshop and Conference Proceedings. [Google Scholar]
Berahmand, K.; Daneshfar, F.; Salehi, E.S.; Li, Y.; Xu, Y. Autoencoders and their applications in machine learning: A survey. Artif. Intell. Rev. 2024, 57, 28. [Google Scholar] [CrossRef]
Vincent, P.; LaRochelle, H.; Bengio, Y.; Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning; Association for Computing Machinery: New York, NY, USA, 2008. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016. [Google Scholar]
Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 2017, 30, 6306–6315. [Google Scholar]
Chen, S.; Guo, W. Auto-encoders in deep learning—A review with new perspectives. Mathematics 2023, 11, 1777. [Google Scholar] [CrossRef]
Wang, Y.; Wang, H.; Zhang, F. Mask autoencoder for enhanced image reconstruction with position coding offset and combined masking. Vis. Comput. 2025, 41, 7477–7491. [Google Scholar] [CrossRef]
Rodrigues, E.; Gaspar, A.R.; Gomes, Á. Automated approach for design generation and thermal assessment of alternative floor plans. Energy Build. 2014, 81, 170–181. [Google Scholar] [CrossRef]
Lu, Z.; Li, Y.; Wang, F. Complex layout generation for large-scale floor plans via deep edge-aware GNNs. Appl. Intell. 2025, 55, 400. [Google Scholar] [CrossRef]
Liu, J.; Xue, Y.; Ni, H.; Yu, R.; Zhou, Z.; Huang, S.X. Computer-Aided Layout Generation for Building Design: A Review. arXiv 2025, arXiv:2504.09694. [Google Scholar] [CrossRef]
Yan, S.; Wu, C.; Zhang, Y. Generative design for architectural spatial layouts: A review of technical approaches. J. Asian Archit. Build. Eng. 2025, 1–21. [Google Scholar] [CrossRef]
Meselhy, A.; Almalkawi, A. A review of artificial intelligence methodologies in computational automated generation of high performance floorplans. NPJ Clean Energy 2025, 1, 2. [Google Scholar] [CrossRef]
Zeng, P.; Yin, J.; Sun, H.; Dai, Y.; Jiang, M.; Zhang, M.; Lu, S. MRED-14: A Benchmark for Low-Energy Residential Floor Plan Generation with 14 Flexible Inputs. In Proceedings of the 33rd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2025; pp. 11298–11307. [Google Scholar]
Zeng, P.; Yin, J.; Huang, Y.; Zhong, J.; Lu, S. AI-based generation and optimization of energy-efficient residential layouts controlled by contour and room number. Build. Simul. 2025, 18, 2777–2805. [Google Scholar] [CrossRef]
Merrell, P.; Schkufza, E.; Koltun, V. Computer-generated residential building layouts. ACM Trans. Graph. 2010, 29, 181. [Google Scholar] [CrossRef]
Wu, W.; Fu, X.-M.; Tang, R.; Wang, Y.; Qi, Y.-H.; Liu, L. Data-driven Interior Plan Generation for Residential Buildings. ACM Trans. Graph. 2019, 38, 234. [Google Scholar] [CrossRef]
Hu, R.; Huang, Z.; Tang, Y.; Van Kaick, O.; Zhang, H.; Huang, H. Graph2Plan: Learning Floorplan Generation from Layout Graphs. ACM Trans. Graph. 2020, 39, 118. [Google Scholar] [CrossRef]
Nauata, N.; Chang, K.-H.; Cheng, C.-Y.; Mori, G.; Furukawa, Y. House-GAN: Relational Generative Adversarial Networks for Graph-Constrained House Layout Generation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference; Springer: Cham, Switzerland, 2020; pp. 162–177. [Google Scholar]
Nauata, N.; Hosseini, S.; Chang, K.-H.; Chu, H.; Cheng, C.-Y.; Furukawa, Y. House-GAN++: Generative Adversarial Layout Refinement Networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021. [Google Scholar]
Shabani, M.A.; Hosseini, S.; Furukawa, Y. Housediffusion: Vector floorplan generation via a diffusion model with discrete and continuous denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023. [Google Scholar]
Xia, D.; Wu, Z.; Zou, Y.; Chen, R.; Lou, S. Developing a bottom-up approach to assess energy challenges in urban residential buildings of China. Front. Archit. Res. 2025, 14, 1810–1833. [Google Scholar] [CrossRef]
Zou, Y.; Chen, Z.; Lou, S.; Huang, Y.; Xia, D.; Cao, Y.; Li, H.; Lun, I.Y.F. Accelerating long-term building energy performance simulation with a reference day method. Build. Simul. 2024, 17, 2331–2353. [Google Scholar] [CrossRef]
Liu, X.; He, J.; Xiong, K.; Liu, S.; He, B.-J. Identification of factors affecting public willingness to pay for heat mitigation and adaptation: Evidence from Guangzhou, China. Urban Clim. 2023, 48, 101405. [Google Scholar] [CrossRef]
Rahbar, M.; Mahdavinejad, M.; Markazi, A.H.; Bemanian, M. Architectural Layout Design through Deep Learning and Agent-Based Modeling: A Hybrid Approach. J. Build. Eng. 2022, 47, 103822. [Google Scholar] [CrossRef]
Tang, H.; Zhang, Z.; Shi, H.; Li, B.; Shao, L.; Sebe, N.; Timofte, R.; Van Gool, L. Graph Transformer GANs for Graph-Constrained House Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2173–2182. [Google Scholar]
Sun, J.; Wu, W.; Liu, L.; Min, W.; Zhang, G.; Zheng, L. WallPlan: Synthesizing Floorplans by Learning to Generate Wall Graphs. ACM Trans. Graph. 2022, 41, 92. [Google Scholar] [CrossRef]
Upadhyay, A.; Dubey, A.; Kuriakose, S.M.; Agarawal, S. FloorGAN: Generative Network for Automated Floor Layout Generation. In Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD), CODS-COMAD ’23; Association for Computing Machinery: New York, NY, USA, 2023; pp. 140–148. [Google Scholar]
Yin, J.; Zeng, P.; Zhong, J.; Li, P.; Zhang, M.; Luo, R.; Lu, S. FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction. arXiv 2025, arXiv:2506.21562. [Google Scholar]
Leng, S.; Zhou, Y.; Dupty, M.H.; Lee, W.S.; Joyce, S.; Lu, W. Tell2design: A dataset for language-guided floor plan generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, (Volume 1: Long Papers); Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar]
Zong, Z.; Zhan, Z.; Tan, G. HouseLLM: LLM-Assisted Two-Phase Text-to-Floorplan Generation. arXiv 2024, arXiv:2411.12279v1. [Google Scholar]
Qin, S.; He, C.; Chen, Q.; Yang, S.; Liao, W.; Gu, Y.; Lu, X. ChatHouseDiffusion: Prompt-guided generation and editing of floor plans. arXiv 2024, arXiv:2410.11908. [Google Scholar]
Gueze, A.; Ospici, M.; Rohmer, D.; Cani, M.-P. Floor Plan Reconstruction from Sparse Views: Combining Graph Neural Network with Constrained Diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 1583–1592. [Google Scholar]
Medel-Vera, C.; Britton, S.; Gates, W.F. An exploration of the role of generative AI in fostering creativity in architectural learning environments. Comput. Educ. Artif. Intell. 2025, 9, 100501. [Google Scholar] [CrossRef]
Swaileh, W.; Kotzinos, D.; Ghosh, S.; Jordan, M.; Vu, N.-S.; Qian, Y. Versailles-FP dataset: Wall detection in ancient floor plans. In Proceedings of the International Conference on Document Analysis and Recognition; Springer International Publishing: Cham, Switzerland, 2021. [Google Scholar]
Karadag, I. Machine learning for conservation of architectural heritage. Open House Int. 2023, 48, 23–37. [Google Scholar] [CrossRef]

Figure 1. FP-MAE Architecture Diagram.

Figure 2. Illustration of the line drawing floor plan subset.

Figure 3. Illustration of the colored drawing floor plan subset.

Figure 4. Random Masking Method.

Figure 5. Example Results Comparison of Different Masking Strategies.

Figure 6. Detailed Diagram of Random Masking—Line Drawing.

Figure 7. Detailed Diagram of Random Masking—Colored Floor Plan.

Figure 8. Detailed Diagram of Center Masking.

Figure 9. Detailed Diagram of Perimeter Masking.

Figure 10. Detailed Diagram of Spotted-perimeter Masking.

Figure 11. Detailed Diagram of Biased-keyhole Masking.

Figure 12. Detailed Diagram of One-Sided Masking.

Figure 13. Detailed Diagram of Corner Masking.

Figure 14. Detailed Diagram of Spotted-corner Masking.

Table 1. Results on FloorPlanNet. The best results are indicated in bold.

Method	DATA	FID	PSNR	SSIM
Random Masking	Line Drawing	12.6053	79.7688	0.9757
Random Masking	Colored Drawing	17.2192	79.7638	0.9724
Center Masking	Line Drawing	22.0109	77.7863	0.9738
Perimeter Masking	Line Drawing	12.4044	78.6158	0.9609
Spotted-perimeter Masking	Line Drawing	16.8733	79.3059	0.9637
Biased-keyhole Masking	Line Drawing	17.4593	79.3542	0.9632
One-sided Masking	Line Drawing	8.84111	80.1424	0.9794
Corner Masking	Line Drawing	27.3668	76.0285	0.9247
Spotted-corner Masking	Line Drawing	19.3524	81.3495	0.9423
Pix2Pix	Line Drawing	76.3275	68.3546	0.9334
CycleGan	Line Drawing	90.3235	64.3487	0.9468

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhong, J.; Luo, R.; Li, P.; Li, T.; Zeng, P.; Lei, Z.; Feng, T.; Yin, J. FP-MAE: A Self-Supervised Model for Floorplan Generation with Incomplete Inputs. Buildings 2026, 16, 558. https://doi.org/10.3390/buildings16030558

AMA Style

Zhong J, Luo R, Li P, Li T, Zeng P, Lei Z, Feng T, Yin J. FP-MAE: A Self-Supervised Model for Floorplan Generation with Incomplete Inputs. Buildings. 2026; 16(3):558. https://doi.org/10.3390/buildings16030558

Chicago/Turabian Style

Zhong, Jing, Ran Luo, Peilin Li, Tianrui Li, Pengyu Zeng, Zhifeng Lei, Tianjing Feng, and Jun Yin. 2026. "FP-MAE: A Self-Supervised Model for Floorplan Generation with Incomplete Inputs" Buildings 16, no. 3: 558. https://doi.org/10.3390/buildings16030558

APA Style

Zhong, J., Luo, R., Li, P., Li, T., Zeng, P., Lei, Z., Feng, T., & Yin, J. (2026). FP-MAE: A Self-Supervised Model for Floorplan Generation with Incomplete Inputs. Buildings, 16(3), 558. https://doi.org/10.3390/buildings16030558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FP-MAE: A Self-Supervised Model for Floorplan Generation with Incomplete Inputs

Abstract

1. Introduction

1.1. Research Background

1.2. Contribution

2. Related Works

2.1. Application of Self-Supervised Learning

2.2. Development of Autoencoders

2.3. Architectural Floor Plan Generation

3. Methodology

3.1. FP-MAE Encoder

3.2. FP-MAE Decoder

3.3. Loss Function

3.4. Dataset

3.5. Sampling Strategies

4. Experiments

4.1. Self-Supervised Pretraining

4.2. Masking and Sampling Strategies

4.3. Qualitative Experimental Results

4.4. Quantitative Experimental Results

5. Discussion and Conclusions

5.1. Contribution to Architectural Image Completion

5.2. Implications for Architectural Practice

5.3. Limitations and Future Research Directions

5.4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI