1. Introduction
The renewal and renovation of existing buildings represent a significant direction for the future development of the civil engineering industry [
1,
2,
3]. Such engineering projects generally require reference to the original construction drawings of the building. However, for a large number of buildings constructed many years ago, the original architectural drawings are typically paper-based and may have been damaged or lost after long-term storage, resulting in a lack of reliable documentation to support project implementation. At present, the preparation of drawings for existing buildings primarily relies on manual surveying and mapping. This process is not only complex, time-consuming, and labor-intensive, but it is also highly dependent on the experience of operators, making it difficult to achieve both efficiency and accuracy simultaneously. Consequently, traditional approaches can no longer meet the demands of current engineering practice.
Therefore, there is an urgent need for a technical approach capable of efficiently and automatically acquiring information from existing buildings so as to improve the efficiency of drawing generation. Indoor floor plan generation and indoor mapping have become important research directions [
4,
5]. For example, Tan et al. [
6] used terrestrial laser scanning (TLS) to measure indoor geometric quality, demonstrating that traditional manual methods are time-consuming and prone to errors, while laser scanning can improve accuracy. Kim et al. [
7] proposed a method to automatically extract geometric primitives from unstructured point clouds obtained by laser scanning and generate 2D floor plans; by combining machine learning algorithms, structured geometric lines can be obtained directly from raw 3D point clouds, enabling automated construction of floor plan models that reflect the actual building layout without relying on manual reference drawings or adjustments. However, these approaches usually involve high equipment costs and complex processing workflows. With the development of computer vision, image-based methods have gradually emerged. By incorporating semantic segmentation and geometric constraints, building component information can be extracted from images and converted into floor plans, providing a new, low-cost pathway for automated modeling.
In recent years, deep learning methods have achieved remarkable performance improvements in computer vision tasks, providing new research perspectives for extracting structural information from real-world building images and generating architectural drawings. At present, deep learning technologies have been widely applied in various tasks within the construction industry, including building component recognition, semantic segmentation, and three-dimensional reconstruction, thereby enhancing the level of automation in these domains [
8,
9,
10,
11].Wang et al. [
12] proposed an improved U-Net-based deep learning network, MBF-UNet, to achieve accurate segmentation of spalling areas on building exterior wall tiles. Yu et al. [
13] developed the SI3FP pipeline, which integrates advanced deep learning techniques such as Neural Radiance Fields (NeRF), ResNet-50, and RetinaNet. This approach enables the automatic extraction of precise façade and window information from ordinary photographs (e.g., Google Street View images) and generates compliant thermal 3D models, assisting planners in accurately estimating building energy consumption and formulating energy-saving renovation strategies. Oktavianus et al. [
14] combined deep learning with Building Information Modeling (BIM) technology and proposed an intelligent post-earthquake building recovery framework. This framework addresses the strong subjectivity and time-consuming nature of traditional post-earthquake assessments by enabling automatic classification of structural component damage levels and intelligent generation of recovery plans. These studies demonstrate the strong capability of deep learning in automatically recognizing and analyzing building information from visual data. However, most existing works focus on façade analysis, damage detection, or component recognition, while the automatic extraction of indoor structural elements and the generation of architectural floor plans from image data remain relatively underexplored.
In research on the application of deep learning to architectural information extraction, semantic segmentation, as a major research direction in computer vision, has been widely employed in building component recognition and scene understanding [
15,
16,
17].Through semantic segmentation, key structural elements such as walls, doors, windows, and floors can be effectively extracted from real-world architectural images. Wong et al. [
18] constructed the HBD dataset, which contains 2235 indoor images and 15,768 instances, annotated with 11 core building component categories including walls, floors, and ceilings. This dataset supports the training and benchmarking of indoor building component instance segmentation models for 3D reconstruction tasks. Mao et al. [
19] proposed the GA
2Net network, which enhances the utilization of RGB and depth information through global feature extraction and a gated augmentation transformation module, thereby improving semantic segmentation accuracy in indoor scenes under varying illumination conditions and object occlusion. Liang et al. [
20] introduced the MHIBS-Net network, which integrates point cloud normal information and relative positional encoding to directly process complete room-scale point clouds, achieving efficient and accurate semantic segmentation of seven categories of architectural structures, including ceilings, beams, and columns, in complex indoor environments. These studies demonstrate that semantic segmentation provides an effective technical approach for automatically extracting structural component information from building images, laying an important foundation for subsequent geometric analysis and architectural reconstruction.
The training performance of semantic segmentation models largely depends on the availability of sufficient and high-quality data [
21]. However, the acquisition of real-world indoor building datasets heavily relies on manual effort. From image collection to annotation, the process requires substantial labor investment and is often time-consuming. Undoubtedly, the use of synthetic datasets provides an effective solution to these challenges. Synthetic datasets can be generated through modeling and automated annotation, ensuring labeling accuracy while significantly reducing data collection and annotation costs.
Existing approaches for constructing synthetic datasets can generally be categorized into two types. The first type is based on three-dimensional modeling and rendering technologies, where virtual scenes are built to generate images with accurate geometric information and semantic annotations. For example, Cordeiro et al. [
22] used Blender to automatically generate an instance segmentation dataset for robotic picking scenarios, while He et al. [
23] constructed the UnityShip dataset using the Unity engine to support ship-detection tasks in aerial imagery. The second type is based on parametric generation methods, which automatically produce diverse scenes by adjusting predefined structural parameters. Kikuchi et al. [
24] adopted a procedural modeling strategy to generate large-scale urban 3D models and street-view datasets, while Schmedemann et al. [
25] generated industrial defect datasets by randomizing parameters such as defect types, illumination conditions, and camera configurations. Although these approaches demonstrate the effectiveness of synthetic data generation, constructing diverse indoor architectural scenes still requires substantial manual modeling effort and professional expertise in 3D modeling software. In addition, designing scene parameters and ensuring sufficient diversity across layouts, materials, and lighting conditions can be challenging when using traditional modeling or parametric generation approaches. Therefore, developing more efficient and automated scene generation strategies is essential for constructing large-scale annotated datasets to support deep learning-based architectural information extraction.
To complete the restoration and regeneration of missing architectural drawings, obtaining only the two-dimensional regional information of components through semantic segmentation is insufficient. It is also necessary to further extract the actual geometric coordinates and spatial layout relationships of structural elements. Stereo vision technology effectively addresses this limitation [
26]. By simulating the imaging principle of human stereo vision, stereo vision simultaneously captures two images of the same building scene from slightly different viewpoints. Combined with stereo camera calibration parameters and disparity computation models, the two-dimensional component regions obtained from semantic segmentation can be transformed into three-dimensional spatial coordinate data with real-world scaling, thereby enabling subsequent generation of floor plans for existing buildings.
Compared with other three-dimensional measurement techniques, stereo vision offers lower cost and greater deployment flexibility. It does not require complex site preprocessing and can accurately obtain the three-dimensional spatial coordinates of target objects through disparity calculation, meeting the requirements for extracting geometric parameters of building components. Stereo vision technology has been widely applied in multiple domains. In robotics and autonomous driving, stereo vision is commonly used for environmental perception and depth information acquisition. Real-Moreno et al. [
27] achieved rapid computation of target spatial coordinates through left–right image matching in a stereo vision system, while Ulusoy et al. [
28] utilized stereo imaging to obtain environmental depth information, providing critical technical support for target detection in autonomous driving and obstacle avoidance decision-making in indoor autonomous vehicles, respectively. In industrial measurement and three-dimensional reconstruction tasks, stereo vision has been employed to achieve high-precision spatial coordinate estimation [
29]. Furthermore, in fields such as medical imaging and scene modeling, stereo vision has also demonstrated considerable application potential [
30]. In recent years, with the advancement of computer vision technologies, stereo vision has gradually been introduced into research on architectural and indoor scene modeling [
31], providing new technical approaches for acquiring spatial structural information.
To address the limitations of existing methods in terms of dataset construction efficiency, model generalization, and spatial information extraction, this paper proposes a floor plan generation method for existing buildings that end-to-end integrates LLM-driven synthetic data generation, deep learning–based semantic segmentation, and stereo vision-based 3D reconstruction. Compared with conventional floor plan reconstruction methods that typically rely on manual scene modeling, manual annotation, or semi-automatic measurement processes, the proposed method focuses on reducing human intervention throughout the entire workflow. Specifically, natural language-driven scene generation, automated rendering-based annotation, and programmatic stereo reconstruction are integrated into a unified pipeline, enabling large-scale dataset construction and floor plan generation with minimal manual effort. This design significantly improves efficiency and reproducibility while maintaining geometric accuracy. The main contributions of this study are summarized as follows:
We introduce an LLM-driven Automated Indoor Scene Generation to construct synthetic datasets. Indoor-scene layouts and component configurations described in natural language are translated into structured modeling instructions, enabling rapid generation of three-dimensional indoor scenes and corresponding rendered images. Compared with traditional manual modeling or parameter-based scene generation approaches, this strategy significantly reduces modeling effort while improving the efficiency and diversity of synthetic data generation.
An automated dataset construction strategy for indoor architectural scenes is implemented through synchronized rendering and annotation. By configuring material nodes and rendering nodes in Blender, rendered images and corresponding material ID maps are generated simultaneously, allowing semantic segmentation labels to be derived automatically. This mechanism eliminates the need for manual labeling of walls, doors, and windows. Furthermore, synthetic images are combined with manually annotated real indoor images at varying ratios to form hybrid datasets, substantially enhancing model robustness and generalization to real-world indoor environments beyond standard segmentation pipelines.
A floor plan generation pipeline integrating semantic segmentation and stereo vision–based spatial reconstruction is developed. After architectural components such as walls, doors, and windows are identified through semantic segmentation, boundary corner points of segmented regions are matched across stereo image pairs. Combined with stereo camera calibration parameters and disparity computation, the three-dimensional spatial coordinates of structural components can be recovered, enabling the automatic generation of architectural floor plans according to their geometric relationships.
3. Indoor Synthetic Image Dataset Generation Method
3.1. LLM–Blender Collaborative Modeling Based on MCP
Traditional manual modeling processes are complex and time-consuming, which to some extent restricts the scalability and scene diversity of synthetic datasets. To support large-scale synthetic dataset generation, this study leverages a collaborative modeling workflow integrating a large language model (LLM) with Blender through the Model Communication Protocol (MCP). User-provided natural language descriptions are parsed into structured modeling instructions, which are automatically executed in Blender, enabling rapid indoor scene construction.
Since the instructions generated by a large language model cannot be directly transmitted to three-dimensional modeling software, a standardized communication bridge is required to enable data interoperability between the two systems. MCP provides a standardized communication bridge between the LLM and Blender, defining unified instruction formats and supporting bidirectional feedback for iterative adjustments.
To fully leverage the large language model and the MCP-based communication mechanism, a three-dimensional modeling tool capable of accurately receiving and automatically executing structured modeling instructions is required. Blender [
32], with its comprehensive Python API and highly parametric modeling framework, serves as an ideal platform for integration with the large language model and the MCP.
The overall workflow of the proposed method is shown in
Figure 2. Indoor scene requirements provided via natural language are parsed by the LLM into structured instructions, including component geometry, layout, and material specifications. These instructions are executed in Blender to create, arrange, and assign materials to components such as walls, doors, and windows. Modeling results can be fed back to the LLM for iterative adjustments if needed.
Compared with traditional modeling approaches, this collaborative modeling method enables indoor scene construction directly through natural language instructions, without requiring users to master professional operations in Blender, significantly improving modeling efficiency and supporting the generation of diversified synthetic datasets.
3.2. Image Rendering and Material ID Map Generation
To improve the efficiency and accuracy of constructing the indoor synthetic dataset, this study synchronously outputs rendered images and material ID maps during rendering, ensuring precisely matched data pairs and reducing manual labeling effort and potential annotation errors.
The different node types available in Blender are summarized in
Table 1. To achieve one-to-one correspondence between rendered images and material ID maps, as well as to ensure that structural components can be precisely distinguished in the material ID maps, appropriate configurations of Shader Nodes and Compositing Nodes must be sequentially implemented. As shown in
Figure 3, Shader and Compositing Nodes in Blender, combined with Cryptomatte, assign unique IDs to walls, windows, and doors. Cryptomatte nodes automatically extract masks for each component, which are merged via Mix nodes to produce customized annotation colors. The resulting material ID maps are exported along with rendered images, ensuring pixel-level alignment for automated dataset construction.
To overcome the inefficiency of manual rendering, this study develops an automated image acquisition script using Blender’s Python API. The overall workflow of the automated image acquisition script using Blender’s Python API is illustrated in
Figure 4. It loads scenes, sets rendering parameters, and programmatically moves the camera across predefined viewpoints. At each position, scene images and their material ID maps are rendered and saved according to strict naming conventions. This workflow significantly improves generation efficiency, reduces human error in data pairing, and lowers the cost of constructing large-scale synthetic datasets.
3.3. Color Processing of Material ID Maps
During the rendering and output of material ID maps, although dedicated annotation colors are predefined for walls, windows, and doors through mask nodes, the actual displayed colors in the exported material ID maps may exhibit slight deviations in saturation, brightness, or hue. These variations are influenced by factors such as scene illumination intensity and material reflectance properties in Blender. If these color deviations are not properly addressed, they may directly affect the quality of subsequent annotation map generation and potentially lead to incorrect label assignments. Statistical observation indicates that these illumination-induced deviations cause the pixel values of a single category to distribute across a narrow range in the RGB color space, rather than clustering at a single discrete coordinate. Therefore, this study performs color extraction and correction on the actual component colors in the material ID maps to ensure the reliability of the automatically generated annotation maps.
To ensure the accuracy of subsequent semantic labeling, a color extraction and standardization procedure is implemented to process the exported material ID maps. Using the OpenCV and NumPy libraries, the actual RGB characteristics for each component (walls, windows, and doors) are statistically analyzed by calculating the mean pixel values within their respective Cryptomatte-defined masks. Specifically, this process involves computing a representative color centroid for each category to derive precise color references that represent the rendered output. By aligning these extracted values with the corresponding category labels, this standardization procedure ensures a highly consistent mapping between the rendered pixels and their intended semantic categories, thereby effectively eliminating the systematic misclassification that would otherwise occur during the automated annotation stage, as shown in
Figure 5.
3.4. Automatic Annotation of the Synthetic Dataset
Following the extraction of characteristic color references, an automated annotation procedure is implemented to convert the multi-channel material ID maps into single-channel grayscale labels. To achieve this, a class-mapping dictionary is first constructed based on the calibrated RGB values of the walls, windows, and doors identified in the previous stage. This dictionary binds the specific color signatures of each component category to a unique class ID, establishing a standardized lookup table for the subsequent batch-processing script.
The core classification logic employs a color-matching function based on Euclidean distance to ensure robust pixel-level assignment. For each pixel in the material ID map, the function computes the spatial distance between its RGB vector and the reference colors stored in the mapping dictionary, identifying the closest category match. To mitigate potential labeling errors caused by rendering inconsistencies, a predefined distance threshold is applied; pixels that exceed this threshold are automatically classified as background (Class ID = 0). The resulting single-channel maps, where each pixel value directly represents a building component category, provide high-quality, structured training data for the semantic segmentation model, as shown in
Figure 6.
4. Semantic Segmentation of Indoor Components Based on SegFormer
On the basis of the constructed synthetic indoor dataset, real indoor images are further incorporated to perform semantic segmentation of architectural components. The SegFormer model is employed to train and evaluate datasets with varying proportions of real and synthetic data, in order to investigate the influence of synthetic data participation on segmentation performance. Compared with conventional semantic segmentation models such as FCN, U-Net, and DeepLabV3+, which mainly rely on convolutional operations for feature extraction, SegFormer adopts a Transformer-based encoder that is capable of capturing long-range dependencies and global contextual information. This characteristic is particularly beneficial for indoor scenes, where architectural components such as walls, doors, and windows exhibit strong spatial relationships. In addition, SegFormer utilizes a lightweight MLP decoder, which reduces model complexity while maintaining competitive segmentation performance, making it suitable for the limited-scale dataset used in this study. This section first introduces the SegFormer architecture and discusses its suitability for indoor component segmentation. The construction of the real-image dataset is then described, followed by the specification of the training environment and hyperparameter settings. Finally, quantitative comparisons under different dataset configurations are conducted based on evaluation metrics, providing experimental support for subsequent analysis.
4.1. Network Architecture
Deep learning approaches for image segmentation can generally be categorized into convolutional neural networks (CNNs) and Transformer-based models. In recent years, Transformer-based methods have achieved rapid development and widespread adoption in semantic segmentation tasks. SegFormer [
33], a representative Transformer-based network, adopts a hierarchical encoder with a lightweight decoder, which maintains competitive segmentation accuracy while reducing computational overhead. Its multi-level feature extraction captures both local details and global context, effectively delineating walls, doors, and windows in indoor scenes. Based on these strengths, SegFormer is selected as the backbone network for indoor component semantic segmentation in this study (
Figure 7).
During training, the input image is first resized to a predefined resolution and processed by the encoder to produce four feature maps at different scales, forming a multi-scale feature representation. These features are then adaptively fused to integrate shallow spatial details with deep semantic information. The fused features are directly passed to the lightweight fully connected decoder to generate pixel-wise semantic segmentation results with the same spatial resolution as the input image, without requiring complex upsampling operations or transposed convolutions.
4.2. Real-World Image Dataset Construction
To build a real-world indoor component dataset for comparative analysis and mixed training, this study collected real images of components such as doors, windows, and walls within actual indoor environments. The images were captured using smartphones, covering various room types and interior styles to ensure diversity in both component categories and spatial contexts.
During the image acquisition process, the shooting height and camera angle were adjusted to reduce the influence of perspective distortion on the geometric appearance of building components. The shooting distance was controlled within approximately 1.5–3 m, ensuring that each image contained one to two categories of target components, with the target regions occupying the primary area of the image. This setup ensured that the boundaries and detailed features of the components were clearly distinguishable.
To enhance the dataset’s coverage of variations in real indoor environments, a diversified shooting strategy was employed, including capturing images from different rooms and multiple locations within the same room, varying the shooting angles of components, and adjusting indoor lighting conditions through a combination of natural and artificial light sources. While the labels for synthetic images were automatically generated through our integrated Blender pipeline, while real images were manually annotated. These measures help improve the model’s adaptability to changes in viewpoint, lighting variations, and scene complexity during subsequent training, with some representative annotation results shown in
Figure 8.
4.3. Training Environment and Training Parameters
The model training was conducted on a computer running Windows 10, equipped with an NVIDIA GeForce RTX 3080 GPU with 10 GB of VRAM, using a single-GPU training setup. In terms of software, Python 3.9 served as the core programming language, while PyTorch 2.0.0 was used as the deep learning framework, accompanied by TorchVision 0.15.0 for image preprocessing and feature extraction. Low-level computation acceleration was provided by CUDA 11.8 and cuDNN 8.7.0 to support parallel hardware computation. Additionally, the overall training and validation workflow was built based on the MMEngine 0.10.7 framework, with OpenCV 4.12.0 employed for image reading, format conversion, and data preprocessing.
The dataset was split into training and validation sets at a 9:1 ratio, while the test set consisted of real indoor scene images captured by smartphones. Transfer learning was employed, utilizing pretrained weights from the COCO dataset to optimize model initialization. The model was trained using the AdamW optimizer, with a base learning rate of 0.0001, a weight decay of 0.001, a batch size of 2, and for a total of 50 epochs.
4.4. Analysis of Model Evaluation Metrics
To quantitatively analyze the impact of varying proportions of real and synthetic data on segmentation performance, several key evaluation metrics were employed. These include mean Intersection over Union (mIoU), mean F-score (mFscore), mean precision (mPrecision), and mean recall (mRecall), which collectively reflect the model’s overall accuracy, component detection capability, and prediction reliability. By evaluating these metrics across different dataset configurations, a comprehensive comparative analysis was conducted to assess the effectiveness of the synthetic data integration.
mIoU measures the overall segmentation accuracy of the model across all categories. By calculating the average Intersection over Union (IoU) between the predicted results and the ground-truth annotations, it comprehensively reflects the model’s accuracy in segmenting different component regions and is one of the most commonly used metrics in semantic segmentation tasks.
mFscore provides a comprehensive evaluation based on precision and recall, using the harmonic mean to balance the two. It reflects the model’s overall ability to recognize target components while maintaining prediction accuracy.
mPrecision characterizes the reliability of the model’s predictions, reflecting the proportion of pixels predicted as a certain component category that are actually correctly classified. A higher mPrecision value indicates fewer false positive predictions.
mRecall measures the model’s ability to detect target components, reflecting the proportion of ground-truth component pixels that are successfully identified by the model. A higher mRecall value indicates fewer false negative predictions in component regions.
In the above formulas, denotes the number of pixels whose ground-truth class is i and whose predicted class is also i; denotes the number of pixels that are predicted as class i but whose ground-truth class is not i; and denotes the number of pixels whose ground-truth class is i but are predicted as other classes. These metrics evaluate the segmentation performance of the model from different perspectives, providing a comprehensive reflection of its overall performance in indoor building component semantic segmentation tasks. Based on the aforementioned evaluation metrics, comparative experiments were conducted on models trained with different proportions of real and synthetic data.
In the following, “R” denotes real images and “S” denotes synthetic images. The numbers following R and S indicate the percentage of real and synthetic data in the corresponding training set. The composition of each training set is summarized in
Table 2.
Table 3 summarizes the evaluation results of each dataset configuration on the test set, serving to analyze the impact of dataset composition on segmentation performance.
As shown in
Table 3, models trained under different real–synthetic data ratio configurations exhibit noticeable differences in prediction performance. These results also reflect the domain gap between synthetic and real indoor images. Overall, as the proportion of real data in the training set increases, the model performance improves progressively. Meanwhile, under several mixed data configurations, the prediction results approach or closely approximate those obtained using exclusively real data, indicating that synthetic data can serve as a complementary training resource rather than a full replacement for real data.
Under the S100 configuration, where only synthetic data are used for training, the model still achieves relatively stable prediction performance. This demonstrates that synthetic data generated through virtual modeling can provide a practical baseline for indoor building component semantic segmentation tasks, although its overall performance is slightly inferior to that of mixed training configurations. This further highlights the necessity of incorporating real data to mitigate potential domain gaps.
When synthetic data are introduced into the real dataset for mixed training, the model performance shows a consistent improvement compared to using synthetic data alone. More importantly, these results indicate that combining synthetic and real data can effectively complement each other, helping to reduce the reliance on large-scale real-world annotations while enhancing generalization to practical indoor environments.
Although the R100 configuration yields the best overall performance, the results of RS5050 and RS8020 demonstrate that using synthetic data in combination with limited real samples allows the model to achieve prediction levels close to those trained with fully real datasets, providing an efficient trade-off between annotation effort and predictive accuracy.
As shown in
Figure 9, the variation trends of different dataset configurations across multiple evaluation metrics are consistent with the tabulated results, further confirming that mixed training with synthetic and real data helps maintain stable prediction performance, enhances generalization capability, and alleviates the impact of the domain gap between synthetic and real data.
Comprehensive analysis indicates that in indoor building component semantic segmentation tasks, synthetic data alone can establish a viable baseline, and by appropriately adjusting the proportion of real and synthetic data in mixed training, the model can achieve performance approaching that of fully real-data training. This approach provides a practical balance between predictive performance and the cost of acquiring large-scale real annotations, while also addressing challenges posed by differences in lighting, textures, and scene complexity between synthetic and real environments.
5. Spatial Coordinate Determination of Indoor Components and Floor Plan Reconstruction Based on Stereo Vision
5.1. Stereo Imaging Model and Data Acquisition Method
Through semantic segmentation, the component segmentation results of indoor two-dimensional images are obtained. However, pixel-level segmentation alone cannot directly reflect the spatial positions of components. Therefore, it is necessary to further acquire three-dimensional coordinate information. In this study, a stereo vision approach is adopted for 3D coordinate measurement. Compared with monocular vision, stereo vision can estimate the spatial depth of target objects through disparity information, enabling three-dimensional localization of structural points. This approach is applicable to indoor component geometric information acquisition and floor plan reconstruction.
In a stereo vision system, the acquisition of spatial coordinate information generally requires the establishment of a camera imaging model, followed by camera calibration to determine intrinsic and extrinsic parameters. On this basis, combined with the geometric constraints between the left and right views, the necessary parameters can be provided for subsequent spatial coordinate computation.
In this chapter, the Stereoscopy function in Blender version 4.3.0 is employed to simulate the imaging process of a stereo camera system. The use of stereoscopy enables the simultaneous generation of left and right views while preserving spatial depth information, thereby meeting the requirements for three-dimensional coordinate computation based on disparity. Blender’s stereoscopy toolkit provides multiple stereo imaging modes, including Off-axis, Parallel, and Toe-in configurations. These modes differ in terms of optical axis relationships, disparity generation mechanisms, and geometric characteristics. In this study, the Parallel mode is selected because it produces left and right views with parallel optical axes under a unified camera parameter configuration, effectively simulating a real stereo camera system composed of two identical cameras with parallel optical axes. Under this configuration, the left and right cameras share identical intrinsic parameters, and the projection of the same spatial point in the two images differs only along the horizontal direction, generating purely horizontal disparity. This property facilitates the subsequent computation of three-dimensional coordinates of indoor component corner points.
The implementation procedure is as follows. First, the stereoscopy function is enabled in Blender’s rendering settings, and the stereo imaging mode is set to Parallel in the camera parameters. This configuration allows the automatic generation of corresponding left and right images during the rendering process. To ensure consistency in camera parameter settings and to avoid errors caused by manual adjustment of camera position and orientation, the rendering workflow is automated through scripting. During rendering, the camera rotates around its initial position as the rotation center at fixed angular intervals, and at each rotation angle, a pair of left and right images is synchronously generated. The resulting stereo image pairs are subsequently used for component segmentation and spatial geometric computation, providing the data foundation for three-dimensional coordinate measurement and floor plan reconstruction of indoor components, with the configuration results shown in
Figure 10.
5.2. Spatial Coordinate Computation of Indoor Components and Floor Plan Reconstruction Based on Stereo Vision
After completing indoor scene image acquisition and component semantic segmentation, the transformation from two-dimensional images to a building floor plan requires the determination of three-dimensional spatial coordinates of indoor components. Based on the parallel stereo vision imaging model and in combination with the semantic segmentation results, this study computes the spatial coordinates of boundary corner points of components such as doors, windows, and walls. The overall workflow is as follows:
First, the output of the semantic segmentation model is converted into a binary image to distinguish component regions from the background, thereby reducing the influence of noise regions on subsequent computations. On this basis, a contour detection method is employed to extract the boundary contours of components in the image plane, retaining only the outer contours to represent the overall geometric shape of each component. To transform the continuous component contours into discrete feature points suitable for spatial computation, polygonal approximation is applied to the extracted contours. The resulting boundary corner points are regarded as key geometric features of the components and are used for subsequent stereo matching and three-dimensional spatial coordinate calculation. After completing image rectification for the left and right views, a feature matching algorithm is applied to match corner points within the same component category across the stereo image pair, thereby obtaining corresponding point pairs. This strategy effectively reduces the likelihood of mismatches and provides reliable inputs for subsequent three-dimensional coordinate computation, as illustrated in
Figure 11.
At this stage, the matched corner points only contain pixel coordinate information in the image plane and cannot directly represent their positions in real three-dimensional space. Therefore, in order to convert the matched corner points from pixel coordinates to three-dimensional spatial coordinates, a parallel stereo vision model is introduced for 3D coordinate computation. Under this model, the horizontal pixel coordinate difference of a spatial point between the left and right images is defined as the disparity, which is expressed as
where
x1 and
x2 denote the horizontal pixel coordinates of the spatial point in the left and right images, respectively.
Based on the disparity relationship, the depth of the spatial point in the camera coordinate system can be further calculated using the following formula:
where f is the focal length of the camera and B is the baseline distance between the two cameras. Based on this relationship, the distance from the spatial point to the camera plane can be obtained.
Furthermore, by incorporating the camera intrinsic parameters, the pixel coordinates can be transformed into three-dimensional coordinates in the camera coordinate system. The calculation is performed as follows:
where
denotes the point coordinates in the pixel coordinate system, and
represents the three-dimensional coordinates of the spatial point in the camera coordinate system.
and
denote the effective focal lengths of the camera in the horizontal and vertical directions, respectively, and
represents the principal point coordinates of the camera.
To obtain the spatial coordinates under different viewpoints, it is necessary to further transform the three-dimensional points from the camera coordinate system to the world coordinate system. This process can be accomplished using the camera extrinsic parameters, and the relationship can be expressed as follows:
Since the stereo camera remains at a fixed position during data acquisition and rotates discretely around the Z-axis of the world coordinate system, the component corner points computed from different viewpoints are initially expressed in their respective camera coordinate systems. To unify the multi-view results into a common world coordinate system, the predefined rotation angle specified in the script is used to construct the corresponding rotation matrix for the -th frame. Combined with the camera position vector in the world coordinate system, the corner point coordinates of the -th frame are transformed accordingly to achieve coordinate unification. After obtaining the world coordinates of the component boundary corner points, the spatial coordinates are categorized and stored according to component type.
After obtaining and categorizing the world coordinates of boundary corner points for different component types, this study further investigates methods for generating architectural floor plans based on these discrete spatial points.
The overall workflow of the floor plan generation process is shown in
Figure 12. Due to the inherent discreteness and local perturbations of component points reconstructed from stereo vision, directly using these points for floor plan generation may result in discontinuous lines or unstable geometric structures. Therefore, prior to floor plan drawing, it is necessary to perform appropriate planarization processing on the discrete spatial points. Considering that architectural floor plans primarily represent the geometric relationships of components on the horizontal plane, the three-dimensional coordinates of the components are first projected onto the horizontal plane, retaining only their planar positional information. This step effectively reduces the influence of height-direction noise on subsequent geometric analysis.
On this basis, given that walls, doors, and windows typically exhibit approximately linear distributions in the plane, a collinearity tolerance parameter ε is introduced to constrain deviations of point sets along coordinate axes. When the deviation of a point set in a certain direction is smaller than the predefined threshold, the points are regarded as satisfying the geometric consistency condition of the same linear component and are accordingly merged. Points that significantly deviate from this constraint are excluded from subsequent processing, thereby suppressing noise interference.
During the component point parsing stage, a coordinate file with category labels is adopted as an intermediate data format to uniformly store and retrieve points corresponding to walls, doors, and windows. The point sets are then organized according to their respective categories, providing a foundation for subsequent geometric modeling. For door and window components, considering their geometric characteristics as linear openings within walls in architectural floor plans, the corresponding door and window line segments are directly generated from adjacent point pairs to reconstruct their positional relationships in the planar representation.
As the primary structural elements in a floor plan, wall point sets generally exhibit strong directional consistency in the planar space. Based on the distribution characteristics of wall points, they are classified into two categories: approximately horizontal and approximately vertical. By analyzing the value ranges along the principal directions, continuous wall line segments are reconstructed, thereby transforming discrete points into structured wall lines. Considering that doors and windows interrupt the continuity of walls, an interval-based trimming strategy is introduced during the wall drawing stage. According to the projected positions of doors and windows on the wall lines, local segments of the walls are trimmed to prevent wall lines from overlapping door and window openings. After completing the reconstruction of component line segments, walls and doors/windows are drawn within a unified two-dimensional planar coordinate system. Dimension annotations are then generated based on the geometric lengths of the line segments, ultimately producing the final floor plan output.
5.3. Engineering Validation
To verify the feasibility of the proposed stereo camera-based method for determining the spatial coordinates of indoor components and generating architectural floor plans in real-world scenarios, a classroom located in the Civil Engineering Building of a university was selected as the experimental object for engineering validation. The classroom environment contains typical architectural components such as walls, doors, and windows, which are commonly present in most indoor spaces. Therefore, it provides a representative environment for validating the effectiveness of the proposed method in practical indoor scenarios.
The overall procedure of the real-scene validation experiment is shown in
Figure 13. The experiment first involved calibrating the stereo camera used in the study to obtain the intrinsic parameters, distortion coefficients of the left and right cameras, as well as the relative pose between them. The calibration results were employed to establish an accurate imaging model, providing geometric constraints for the subsequent three-dimensional reconstruction of component corner points. The calibration accuracy was evaluated based on the estimated parameter uncertainties, and the relatively small deviations of both intrinsic and extrinsic parameters indicate that the calibration is stable and sufficiently accurate for reliable stereo matching and 3D reconstruction. After calibration, the stereo camera was positioned at an appropriate location inside the classroom to conduct multi-view image acquisition of the indoor environment. Paired left and right images were captured to ensure that major components such as walls, doors, and windows were effectively covered. For the acquired stereo image pairs, the previously trained model was applied to perform semantic segmentation of indoor components, obtaining segmentation results for three categories: walls, doors, and windows. Based on these results, further processing was conducted on the segmented images, including binarization, component contour extraction, and corner detection, in order to extract the key feature points required for subsequent spatial computation.
After obtaining the corresponding component corner points in the left and right images, the stereo vision geometry was applied to establish point correspondences and estimate disparity values. Prior to disparity computation, the stereo image pairs were rectified using precomputed mapping functions to ensure epipolar alignment, so that corresponding points lie on the same horizontal lines. Disparity estimation was then performed using a block matching–based stereo algorithm implemented in OpenCV. Both left and right disparity maps were computed to improve matching robustness, which helps mitigate errors caused by occlusions and mismatches through a left–right consistency constraint. To further improve computational efficiency, a multi-scale strategy was optionally adopted by performing disparity estimation on downsampled images and then resizing the results back to the original resolution. The stereo camera calibration parameters were used to compute the three-dimensional coordinates of the corner points in the camera coordinate system. Considering that the camera rotated around its own position during image acquisition, the known camera pose information was further utilized to transform the three-dimensional points computed from different viewpoints into a unified world coordinate system. This allowed the corner points of components within the classroom to be represented in a consistent spatial reference frame. Subsequently, the corner point coordinates were organized according to component categories and stored in a text file with category labels, facilitating retrieval and processing in the subsequent floor plan generation stage.
In the floor plan generation stage, the component point coordinates were first planarized and noise points were removed, retaining only their geometric distribution in the horizontal plane. Next, based on the distribution characteristics of the component points, walls and doors/windows were processed separately: wall points were used for line fitting and wall line reconstruction according to the directional consistency of the point sets, while door and window line segments were generated individually as linear openings within the walls. When constructing the walls, local trimming of wall lines was performed according to the projected positions of doors and windows to prevent wall lines from overlapping the openings. Finally, the complete wall and door/window structures were drawn within a unified two-dimensional planar coordinate system, and dimension annotations were generated based on the geometric lengths of the line segments, producing the final architectural floor plan.
As shown in
Table 4, the relative errors of all architectural components are within 6%, with most values concentrated around 2–4%. These results demonstrate that the proposed method can achieve reliable geometric estimation for indoor scenes. In particular, smaller elements such as doors and windows exhibit relatively higher accuracy, while wall structures show slightly larger deviations due to error accumulation over longer distances. Nevertheless, the overall accuracy is sufficient for indoor floor plan generation tasks, where capturing the spatial layout is more critical than achieving high-precision measurements.
Through the above engineering validation experiment, the proposed method was able to complete the entire process from stereo image acquisition to architectural floor plan generation in a real indoor environment, demonstrating its feasibility for practical application. It should be noted that the current validation was conducted in a single classroom environment with relatively regular geometric structures. This type of environment is representative of many typical indoor spaces where architectural components such as walls, doors, and windows form predominantly linear boundaries. The floor plan generation stage in this study relies on planarization, collinearity analysis, and line fitting of component points, which are well suited for indoor environments characterized by straight wall structures. Therefore, the proposed method is primarily applicable to buildings with relatively regular geometric layouts, such as classrooms, offices, and residential rooms. For indoor spaces containing more complex geometries, such as curved walls or highly irregular layouts, additional geometric modeling strategies may be required. Future work will further investigate the applicability of the proposed framework in more complex indoor environments, including multi-room spaces and buildings with more diverse architectural configurations.
6. Discussion and Conclusions
6.1. Implementation Summary and Comparative Analysis
This study proposes a method for generating indoor architectural floor plans by combining synthetic dataset construction, semantic segmentation, and stereo vision measurement in an end-to-end workflow. The complete workflow—from image acquisition to floor plan generation—was validated in a real indoor environment. While this experimental validation was conducted in a single classroom, the results demonstrate the practical feasibility of the proposed approach for simple indoor layouts. Focusing on the overall implementation and performance of this method, the study conducted the following research tasks and experimental analyses:
In terms of synthetic dataset construction, this study implemented a collaborative modeling and rendering workflow between a large language model and Blender based on the MCP framework to generate indoor virtual scenes with well-defined structures and controllable component types. During the rendering process, rendered images and material ID maps were output simultaneously, and semantic segmentation annotations were automatically generated based on the material ID maps, thereby constructing a synthetic semantic segmentation dataset that complements real data rather than directly replacing it. They contain component categories such as doors, windows, and walls, providing data sources for subsequent model training and validation. At the same time, by combining manually annotated real indoor images, synthetic and real data were mixed in different proportions to form multiple training data configurations, allowing evaluation of the effects of real–synthetic data ratios and providing insights into mitigating domain discrepancies.;
In terms of component recognition, this study employed the SegFormer semantic segmentation model to perform component-level segmentation on the collected or generated indoor images, obtaining pixel-level regions for components such as doors, windows, and walls. The segmentation results were able to reflect the spatial distribution and boundary shapes of the components relatively completely, providing the necessary basis for subsequent geometric processing such as contour extraction and corner detection;
In terms of spatial coordinate computation and floor plan generation, this study processed the component contours obtained from the segmentation results based on a parallel stereo vision imaging model. Through corner point extraction and stereo matching, the image coordinates were converted into corresponding three-dimensional spatial coordinates. On this basis, the component corner point coordinates were categorized and organized, noise points were removed, and line fitting was performed, noting that the current planarization and linear fitting approach is best suited for relatively simple room geometries, ultimately completing the generation of indoor architectural floor plans.
In
Table 5, we compared the proposed end-to-end workflow with two existing categories of methods: Monocular Depth-based and Point Cloud-based methods [
34,
35]. While monocular depth estimation is cost-efficient, it involves scale ambiguity, which affects the generation of floor plans with accurate physical dimensions. Conversely, point cloud-based methods provide geometric precision but involve higher hardware costs and manual effort for semantic labeling.
Our method utilizes stereo vision to establish a physical scale while maintaining low hardware requirements. As summarized in
Table 5, monocular depth-based methods rely on single RGB images and therefore offer low hardware cost, but they typically suffer from scale ambiguity, which can affect the accuracy of floor plan generation in real-world measurement scenarios. In contrast, point cloud-based methods utilize dense three-dimensional data to achieve high geometric precision and reliable physical scale, yet they usually require specialized sensors and substantial manual effort for semantic labeling.
Compared with these two categories of methods, the proposed workflow adopts stereo images as the primary data modality, enabling the establishment of a calibrated physical scale while preserving relatively low hardware requirements. Unlike approaches that rely solely on manual annotation, this workflow incorporates a hybrid training strategy. By augmenting real-world data with synthetic samples, the manual labeling workload is reduced without relying exclusively on synthetic environments. This integration provides an alternative for architectural surveying and floor plan reconstruction.
6.2. Limitationsand Future Work
The proposed method has practical significance in reducing manual surveying and hand-crafted modeling efforts. By integrating semantic segmentation with stereo vision–based spatial reconstruction, the study establishes a complete pipeline from indoor image acquisition to automatic floor plan generation, providing a practical workflow for extracting architectural information from indoor scenes. However, certain limitations remain. First, the synthetic data differ from real indoor scenes in terms of lighting variations and texture complexity, and the model’s generalization ability in complex real-world environments still has room for improvement. Although the hybrid training strategy combining synthetic and real images helps alleviate this issue to some extent, further improvements in dataset diversity may enhance model robustness under more challenging indoor conditions. Second, the stereo-based spatial coordinate computation is sensitive to calibration accuracy and corner point matching; when component boundaries are blurred or severely occluded, local geometric accuracy in the floor plan may be affected. Third, the floor plan generation stage relies on planarization, collinearity analysis, and line fitting of component points, which are particularly suitable for indoor environments dominated by straight wall structures. As a result, the current method is mainly applicable to typical architectural spaces with relatively regular geometries, such as classrooms, offices, and residential rooms. For indoor environments containing curved walls or highly irregular layouts, additional geometric modeling strategies may be required to achieve more accurate reconstruction.
Finally, this study conducted experiments in a single classroom environment, and the method has not yet been validated for entire buildings or multi-room scenarios. Although the classroom contains typical architectural components such as walls, doors, and windows and thus provides a representative test environment, further validation in more complex indoor layouts is still necessary. Future works will focus on expanding experimental scenarios to more diverse indoor environments, including multi-room buildings and spaces with more complex architectural geometries, while further improving the robustness and accuracy of the overall workflow. In addition, the framework could be extended to handle additional structural elements, such as columns and stairs, to broaden its applicability in renovation and architectural analysis scenarios.