Approach to Enhancing Panoramic Segmentation in Indoor Construction Sites Based on a Perspective Image Segmentation Foundation Model

Han, Juho; Yoon, Sebeen; Kang, Mingyun; Kim, Taehoon

doi:10.3390/app15094875

Open AccessArticle

Approach to Enhancing Panoramic Segmentation in Indoor Construction Sites Based on a Perspective Image Segmentation Foundation Model

¹

Department of Architecture and Architectural Engineering, Yonsei University, Seoul 03722, Republic of Korea

²

Architectural Engineering Program, School of Architecture, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4875; https://doi.org/10.3390/app15094875

Submission received: 27 March 2025 / Revised: 25 April 2025 / Accepted: 25 April 2025 / Published: 27 April 2025

Download

Browse Figures

Versions Notes

Abstract

Panoramic images in indoor construction sites are gaining attention as valuable tools for process monitoring and quality assessment. However, despite the environmental complexity and the demand for high segmentation performance in indoor construction environments, the scarcity of specialized segmentation models and datasets has created a gap between technological advancements and practical application, thus hindering the effective utilization of panoramic images. To address these challenges, this study proposes a novel approach leveraging the Segment Anything Model (SAM), a perspective image segmentation foundation model, to enhance the performance of existing segmentation models. The proposed method iteratively executes SAM with adjusted input parameters to extract objects of varying sizes and subsequently applies filtering algorithms to retain valid objects. Then, label assignment and merging processes are performed based on the predictions from the target model to improve segmentation accuracy. The experimental study was conducted using Panoplane360, a model specifically designed for plane segmentation, as the target model. A quantitative evaluation was conducted to measure the exactness of label assignment, and two qualitative evaluations were performed to assess whether the assigned labels accurately represent the actual planar information. The evaluation results confirmed that the proposed method significantly improves segmentation performance compared to conventional approaches. The findings of this study highlight the potential of SAM-based methods to enhance segmentation accuracy in dynamic indoor construction environments. Furthermore, the proposed approach provides practical advantages, as it improves segmentation performance without requiring the construction of additional datasets. Future research will focus on resolving computational efficiency issues resulting from iterative SAM execution and will extend the applicability of the proposed approach to diverse segmentation tasks and models.

Keywords:

indoor construction sites; segment anything model; segmentation quality; panoramic image segmentation

1. Introduction

Panoramic imaging offers a promising solution for comprehensive visual monitoring in indoor construction environments. This study aims to evaluate and improve segmentation model performance for such imagery to support practical tasks in construction site management.

In indoor construction sites, panoramic images are gaining attention as a valuable visual tool for process and quality management, as they can comprehensively capture the entire environment in a single shot [1]. With recent advancements in deep learning, segmentation models have been integrated with panoramic images to enable automated indoor construction monitoring [2,3,4] and dimensional quality assessment (DQA) [5], thereby supporting site management and accelerating the digital transformation of the construction industry. However, unlike outdoor construction sites where large-scale objects can be relatively easily captured and analyzed in open spaces, indoor construction sites are structurally complex, spatially constrained, and require the precise identification of small-scale objects under limited visibility conditions [2,6]. In particular, various environmental factors in indoor construction sites, such as uneven lighting due to artificial illumination, occlusions from obstacles (workers, machinery, etc.), and complex backgrounds with unfinished interiors, make precise segmentation challenging, requiring careful consideration during data acquisition [6,7,8]. Moreover, segmentation models specifically tailored for indoor construction environments remain scarce, and publicly available datasets are highly limited [2]. These limitations create a gap between technological advancement and practical application, thereby impeding the effective deployment of panoramic segmentation in construction site contexts.

To achieve effective panoramic image segmentation, a large-scale dataset for model training must be constructed. However, constructing datasets for indoor construction sites is a labor-intensive and time-consuming process [9]. To address these issues, research has explored synthetic dataset generation using building information modeling (BIM) [10,11], game engines [12], and generative AI [13]. These approaches allow the generation of high-quality synthetic data that can be structured into various formats, such as panoramic images. However, generating synthetic datasets requires additional modeling processes, which involve considerable manual labor [10,11,12]. Furthermore, synthetic datasets may not fully capture the detailed characteristics of real-world objects, and if the original dataset contains biases, those biases may be amplified in the generated synthetic data [13,14].

Meanwhile, the Segment Anything Model (SAM) [15], a representative foundation model for image segmentation, has garnered considerable attention from researchers due to its strong generalization performance, even when applied to construction-related data. For example, Wang et al. proposed Omni-Scan2BIM, which integrates SAM and DINOv2 to automate the scan-to-BIM process in mechanical, electrical, and plumbing (MEP) construction sites [16]. Teng et al. utilized SAM to develop a segmentation framework for detecting both land and underwater cracks in bridges [17], while Peng et al. combined SAM with the RANSAC algorithm to address quality degradation issues in 3D reconstruction of indoor buildings, particularly in texture-deficient areas [18]. These studies have demonstrated that SAM can be effectively applied even in complex construction environments, yielding higher-quality segmentation results compared to conventional methods, without the need to construct additional datasets.

In response to these challenges, this study proposes a novel SAM-based approach to improve panoramic image segmentation in indoor construction environments, addressing challenges posed by dataset limitations and environmental complexities. The proposed method is designed to extract segmentation masks for all objects in panoramic images by leveraging the iterative execution of SAM and post-processing algorithms. This design allows for the accurate detection of both large and small objects, even under the challenging environmental conditions of indoor construction sites. Subsequently, a label assignment algorithm is developed to assign labels predicted by a target model to the detected object masks. For evaluation, plane segmentation was selected as the target task among various indoor construction segmentation applications, and Panoplane360 [19] was adopted as the target model. The proposed method was applied to the target model, and both quantitative and qualitative comparisons were conducted to verify improvements in segmentation quality. The purpose of this study is to explore and enhance object segmentation techniques using panoramic images in indoor construction environments. Accurate segmentation of construction elements (e.g., walls, columns, and MEP components) can assist in various applications, such as progress monitoring, safety inspections, and spatial planning. This is particularly useful during the interior finishing and quality management stages of building construction.

The contributions of the proposed method are as follows:

(1): This study improves panoramic segmentation performance in complex indoor construction sites by leveraging SAM, a foundation model for perspective image segmentation.
(2): To enable accurate object extraction in indoor construction images, filtering algorithms were developed for iterative SAM execution, along with a labeling algorithm that assigns semantic labels to objects based on predictions from a target model.
(3): The method is validated through quantitative and qualitative evaluations on real construction site datasets, confirming its applicability and practical value in real environments.

This paper is structured as follows: Section 2 discusses panoramic image segmentation, and the challenges associated with its application in indoor construction environments. It then examines plane segmentation, its significance, and the limitations of Panoplane360, followed by an introduction to SAM and its automatic mask generation model. Section 3 details the proposed methodology, dividing it into the object extraction stage and the label extraction and labeling stage, along with a description of the developed algorithms at each stage. Section 4 describes the experimental setup, including the dataset and evaluation methods, and presents the experimental results and analysis. Finally, Section 5 summarizes the study, discusses the significance of the findings, and addresses research limitations and future directions.

2. Literature Review

2.1. Panoramic Image Segmentation in Indoor Construction Site

Panoramic images are lightweight compared to other data formats, providing efficient storage and processing capabilities while offering 360-degree environmental coverage. This characteristic eliminates blind spots and allows for comprehensive spatial analysis. Additionally, they can capture the entire space in a single shot, offering a more efficient and accurate means of environmental analysis compared to traditional perspective images with a limited field of view [1,2]. Due to these advantages, panoramic image segmentation has been widely utilized in such applications as autonomous driving, machine vision inspection, and medical imaging, and has recently been extended to the construction domain as well [20,21]. However, applying panoramic image segmentation to indoor construction sites presents considerable challenges, as models must be capable of understanding the dynamic and complex characteristics of these environments.

Indoor construction environments exhibit distinct characteristics compared to outdoor settings. In outdoor construction sites, large-scale structures can be reliably captured across wide areas using drones or fixed-position cameras under uniform natural lighting and relatively simple background conditions that facilitate the application of segmentation models. In contrast, indoor construction sites are spatially constrained and highly complex and are frequently subject to visual noise caused by uneven brightness from artificial lighting, the movement of workers and materials during ongoing construction, and unfinished surfaces and finishes. These environmental conditions pose inherent limitations on the ability of segmentation models to accurately identify object boundaries [6,7,8]. Moreover, existing panoramic segmentation models have predominantly been trained on static, fully completed indoor spaces, such as residential units and offices [22,23,24]. As a result, applying these pre-trained models to dynamic and physically complex indoor construction sites without domain-specific knowledge injection often leads to a decline in segmentation performance.

To address these challenges, additional training using domain-specific datasets is necessary. However, constructing such datasets is a labor-intensive and time-consuming process [7]. This study focuses on addressing this challenge by developing a method that enables existing segmentation models to accurately segment objects in panoramic images of indoor construction sites without requiring additional dataset construction.

2.2. Plane Segmentation with Panoplane360

Plane segmentation focuses on identifying and representing major geometric structures within an environment, such as walls, floors, and ceilings [25,26,27,28]. Panoramic image-based plane segmentation is widely utilized in various indoor construction site applications, including room layout estimation [29], construction progress monitoring and site management [30], and site layout planning [31]. However, due to the structural characteristics of panoramic images—such as a wide field of view (FoV), projection distortions, and non-centralized object distributions—specialized methods are required to process them effectively [32].

Panoplane360, a CNN-based deep learning model developed by Sun et al. in 2022 [19], was designed specifically for plane segmentation in panoramic images. The model adopts a divide-and-conquer approach to group pixels based on planar orientation. It then applies pixel embedding clustering to identify individual plane instances and predicts geometric parameters to generate labeled masks. Additionally, it incorporates a yaw-invariant V-plane reparameterization technique, allowing the model to infer vertical plane orientations without prior knowledge of the 360° camera’s yaw rotation. By leveraging these techniques, Panoplane360 effectively extracts vertical and horizontal plane masks from panoramic images.

However, as Panoplane360 is trained on static and fully constructed indoor environments, it lacks an understanding of construction site conditions, leading to reduced segmentation quality when applied to such environments. Recognizing the need for precise segmentation in indoor construction sites, this study selects plane segmentation as the target task and adopts Panoplane360 as the target model for evaluating the effectiveness of the proposed method.

2.3. Segment Anything Model

The Segment Anything Model (SAM) is a foundation model for image segmentation developed by Meta AI in 2023, recognized for its high generalization capability across diverse segmentation tasks [10]. SAM has two key strengths. First, it incorporates promptable segmentation, allowing it to generate multiple valid segmentation masks from a single input prompt (e.g., point, box, or text) to reduce ambiguity. Second, it has been trained on SA-1B, a large-scale dataset that enables the model to learn extensive object-centric spatial distributions.

The promptable segmentation feature of SAM resolves mask ambiguity by predicting multiple segmentation masks with confidence scores for each input prompt. By default, SAM generates three mask predictions (full, partial, and sub-part) per input, allowing for precise object differentiation. Additionally, SA-1B, which consists of 1.1 billion masks, surpasses existing large-scale datasets, such as COCO and Open Images V5 by incorporating a broader range of objects, including those near image edges. Consequently, SAM can effectively segment objects positioned not only in central regions but also along image boundaries.

Due to these capabilities, SAM has the potential to function effectively in complex indoor construction environments, where object positions and appearances vary significantly. This study leverages SAM’s superior generalization capability to enhance segmentation performance without requiring additional dataset construction or retraining.

2.4. Automatic Mask Generation Model

The automatic mask generation model is a specialized variant of SAM designed for fully automated mask generation, prioritizing segmentation quality over inference speed. This model operates in three stages. First, during the mask generation stage, a point grid is created by uniformly distributing points along the edges of the input image. If n points are placed along each edge, the full grid contains n² points. The image is then cropped based on user-defined scales, and masks are predicted in the enlarged regions. To refine these predictions, non-maximum suppression (NMS) is applied, eliminating overlapping masks and prioritizing those with higher intersection over union (IoU) scores. Second, in the filtering stage, masks exceeding a predefined IoU threshold are retained, while unstable masks or masks covering the entire image are removed. Finally, in the post-processing stage, connected components with an area smaller than 100 pixels are removed, and holes within masks smaller than 100 pixels are filled to enhance segmentation consistency.

In this study, the automatic mask generation model was iteratively executed to detect objects in panoramic images of indoor construction sites. During each iteration, the point grid generation and threshold parameters were adjusted, and multiple filtering algorithms were developed to refine the predicted masks. For consistency, this study refers to the automatic mask generation model simply as “SAM” throughout the paper.

3. SAM-Based Approach to Improve Panoramic Image Segmentation in Indoor Construction Environments

This section details the proposed method for panoramic image segmentation. As summarized in Figure 1, the method consists of two major stages. The first stage is object extraction, where SAM is iteratively executed to extract all objects from panoramic images of indoor construction sites. However, a simple iterative execution of SAM results in duplicated detections and overlapping objects. To address these issues, filtering algorithms that incorporate three additional techniques were developed. The second stage is label extraction and labeling, in which a labeling algorithm was developed to assign the vertical and horizontal plane label masks extracted from the Panoplane360 to the object masks obtained from SAM. The labeling algorithm overlays the label masks from Panoplane360 onto the object masks extracted by SAM, assigning the most dominant label to each object mask. Additionally, objects with the same assigned label are merged into a single mask to ensure a one-to-one correspondence between objects and labels.

3.1. Object Extract

Applying SAM to panoramic images of indoor construction sites allows for the extraction of valid object masks based on input prompts. However, since the model operates on a single image at a time, it must be configured to run iteratively to extract all objects within the image. To achieve this, we configured SAM to be executed repeatedly on a single image and set a termination condition based on the background proportion. Specifically, after each execution of SAM, the detected object masks were stored and combined to generate a mask representing the detected regions (foreground). Subsequently, the accumulated detected regions were integrated into a single mask, which was then used to calculate the proportion of undetected regions (background). Finally, SAM was continuously executed until the background proportion was reduced to less than 2% of the original image, ensuring the continuous detection of objects. This sequence of processes is illustrated in Figure 2.

However, a simple iterative execution of SAM resulted in duplicated detections and overlapping objects, necessitating additional processing. To address these issues, this study developed three techniques, namely (1) the adjustment of input parameters, (2) updating of input images and prompt labels, and (3) handling of duplicate and overlapping objects. These techniques enable SAM to iteratively detect new objects without redundant detections.

3.1.1. Adjustment of Input Parameters

As previously explained, SAM enables fully automated mask generation through point grid sampling, threshold-based filtering, and post-processing. These parameters can be user-defined, and since they involve a trade-off between mask quality and quantity, their optimal configuration is crucial.

Table 1 summarizes the initial input parameters used in this study. The points_per_side parameter determines the number of points per side for grid generation, where a higher value results in finer segmentation but increases computational load. stability_score_thresh and pred_iou_thresh are thresholds between 0 and 1, representing mask stability and intersection over union (IoU), respectively. Higher values generate higher-quality masks but reduce the total number of detected masks. min_mask_region_area defines the threshold for post-processing, removing disconnected components and filling small holes within masks below the threshold size. To ensure the detection of new objects in each iteration, the points_per_side value was increased by 4 per iteration. Additionally, the initial value of stability_score_thresh was set to 0.97 and decreased by 0.002 per iteration. If no new objects were detected and the background proportion remained unchanged, the pred_iou_thresh value was reduced by 0.01 from its initial value of 0.85.

Figure 3 illustrates the difference in the number and quality of detected objects when the stability_score_thresh is set high (0.97) versus low (0.85). When a high threshold (0.97) was applied, only 31 objects were detected, but they maintained overall high quality. In contrast, when a low threshold (0.85) was applied, 135 objects were detected. However, many masks exhibited overlapping regions or lower quality. In this study, we first set a high threshold to prioritize the detection of high-quality segmentation masks. Then, we gradually lowered the threshold to detect additional objects that were not captured in the previous stages.

3.1.2. Updating Input Images and Point Labels

Despite parameter adjustments, our experimental results indicated that iterating SAM on the same input image increased the likelihood of redundant object detections. Since duplicate detections lead to unnecessary computational overhead, updating the input image and prompt labels for each iteration was essential. Following Algorithm 1, the input image was updated by masking the background predicted in the previous iteration (i.e., multiplying the original image by the inverted cumulative foreground mask).

Algorithm 1: Iterative input image update with cumulative mask

Input: Input image I of size H × W, Cumulative mask M_cusum
Output: Next input image I_next
1: Generate object masks: M ← SamAutomaticMaskGenerator(I)
2: Sort masks by stability score: M ← Sort(M, stability score)
3: Initialize sum mask M_sum ← 0_H×W
4: for each mask M_seg∈ M do
5: Update current mask: M_sum ← M_sum+ M_seg
6: end for
7: Update cumulative mask: M_cusum ← M_cusum+ M_sum
8: Update input image by masking detected areas: I_next ← I × (1 − BinaryMask(M_cusum))
9: Return I_next

As the input image was modified, the corresponding prompt labels (point grid labels) also required updating. According to Algorithm 2, points located within the foreground region (i.e., the cumulative foreground mask) were assigned a value of 0, ensuring that objects would not be detected at those points.

Algorithm 2: Input point label update with cumulative mask

Input: Cumulative mask M_cusum of size H × W, Input points P = (w_i, h_i)
Output: Updated point labels L
1: for each point (w, h) ∈ P satisfying Valid do
2: Update label: L[w, h] ← 1 − M_cusum[h, w]
3: (If the M_cusum value is 1, set label to 0; otherwise, set it to 1)
4: end for
5: Return L

Figure 4 presents a schematic representation of the iterative update process applied to the input image and point labels as they progress through SAM to subsequent stages. Using the input image and point labels, SAM detects objects and accumulates the detected object masks to generate a foreground mask. Subsequently, Algorithm 1 and Algorithm 2 are sequentially applied to the input image and the accumulated foreground mask, respectively, to update them accordingly. The updated input image and point labels are then reprocessed by SAM, enabling the detection of different objects in subsequent iterations.

3.1.3. Handling Duplicate and Overlapping Objects

While updating the input images and prompt labels mitigated redundant object detections, duplicate and overlapping objects were still observed. Object overlaps posed significant challenges in the subsequent label assignment process, where a single object could receive multiple conflicting labels. Addressing these issues was crucial. Two primary challenges were identified, namely (1) duplication and overlap among newly detected object masks and (2) duplication and overlap between newly detected masks and previously stored object masks. To resolve these issues, Algorithm 3 was implemented, with the overall process illustrated in Figure 5.

Algorithm 3: Mask overlap resolution and intersection division

Input: Sorted list of masks M
Output: Updated list of masks M_new
1: Step 1: Duplicate Mask Removal
2: for each pair of masks (m₁, m₂) ∈ M do
3: Compute IoU: iou ← IoU(m₁, m₂)
4: if iou > 0.85 then
5: Remove m₂
6: end if
7: end for
8: Step 2: Separate Masks Based on Intersection
9: Initialize M_inter, M_non-inter
10: for each mask m₁ ∈ M do
11: if m₁ intersects any m₂ ∈ M where m₁ ≠ m₂ then
12: Add m₁ to M_inter
13: else
14: Add m1 to M_non-inter
15: end if
16: end for
17: Step 3: Process and Combine Masks
18: Extract intersection regions: I ← GetIntersections(M_inter)
19: if I is not empty then
20: Perform set difference: D ← ComputeSetDifference(M_inter, I)
21: else
22: Set D ← ∅
23: end if
24: Combine all masks: M_new ← M_non-inter + I + D
25: Return M_new

First, to resolve object duplication, masks were sorted in descending order based on their stability scores. The IoU was calculated between each mask, and if it exceeded a threshold of 0.85, the mask was considered duplicate. In such cases, only the most stable mask was retained, while all redundant masks were discarded. Next, for handling overlapping objects, masks were categorized into the following two groups: those with overlapping regions and those without. The intersection regions (i.e., overlapping object areas) were extracted, and each mask was updated by subtracting these intersections from their respective areas. Finally, the updated masks were combined with non-overlapping masks and intersection masks to form the final set of object masks.

In the first case, overlap was more prevalent than duplication, and the number of masks was relatively small, allowing Algorithm 3 to resolve overlap issues efficiently. However, in the second case, duplication was more significant, and the number of processed masks was more than twice that of the first case. Thus, an additional step was introduced, namely removing duplicate objects first, followed by applying Algorithm 3 to resolve overlapping cases. These optimizations enabled the extraction and storage of non-duplicated, non-overlapping objects from the panoramic images.

3.2. Label Extraction and Labeling

While the proposed method effectively extracted high-quality object masks using SAM and the three processing techniques, the segmented objects lacked labels, necessitating an additional labeling process. As mentioned in Section 2.2, this study used the Panoplane360 labels for vertical and horizontal plane segmentation as a reference to label the SAM-generated object masks.

Given a panoramic image of an indoor construction site, Panoplane360 outputs the following two label masks: one for vertical planes and another for horizontal planes. To assign these labels to the SAM-extracted objects effectively, Algorithm 4 was applied. First, the object masks were overlaid with the label masks, and the dominant label within each object region was determined. The most dominant label was then assigned to an object only if it occupied at least one-third of the object mask area, ensuring that minor overlaps did not lead to incorrect label assignment.

Algorithm 4: Object mask assignment and merging

Input: List of object masks M_obj, label mask L of size H × W
Output: Merged labeled masks M_merged
1: Step 1: Overlapping masks and Filter Labeled masks
2: Initialize enhanced label mask L_enhanced ← 0_{H × W}
3: Initialize labeled mask list M_labeled ← ∅
4: for each object mask M_i∈ M_obj do
5: Compute object mask area: A_i
6: Compute label overlap: L_overlap ← M_i · L
7: Extract label and compute frequencies: L, C ← UniqueCounts(L_valid)
8: Select dominant label: L_dom ← L[arg max C]
9: Get dominant label count: C_dom ← max C
10: if C_dom ≥ A_i/3 then
11: Assign dominant label: M_labeled ← M_i · L_dom
12: Append M_labeled to M_labeled
13: Update enhanced label mask: L_enhanced ← L_enhanced + M_labeled
14: end if
15: end for
16: Step 2: Merge Object Masks Based on the Same Labels
17: Extract sorted unique labels: L_sorted ← SortByFrequency(L_enhanced)
18: Initialize merged mask list M_merged ← ∅
19: for each label L ∈ L_sorted do
20: Initialize merged mask M_merged ← 0_{H × W}
21: for each labeled mask M_labeled ∈ M_labeled do
22: if L ∈ M_labeled then
23: M_merged ← M_merged + M_labeled
24: end if
25: end for
26: Append M_merged to M_merged
27: end for
28: Return M_merged

After the label assignment, multiple object masks were found to have the same label, necessitating an additional merging process. Using the filtered labels, object masks with identical labels were merged into a single mask, ensuring a one-to-one correspondence between objects and labels. Figure 6 illustrates the overall process, where the extracted Panoplane360 labels are systematically assigned to the SAM-generated objects.

4. Experimental Study

In this study, a comparative experiment was conducted to evaluate the performance of panoramic image segmentation in indoor construction environments. Three different methods were assessed, as follows: (1) using Panoplane360 alone for segmentation, (2) applying a single execution of SAM followed by label assignment from Panoplane360, and (3) implementing iterative SAM execution with filtering algorithms, after which labels were assigned using Panoplane360. To conduct this evaluation, a dataset was constructed using panoramic images collected from indoor construction sites in South Korea, and the segmentation results were analyzed using both quantitative and qualitative evaluation metrics. This experiment aims to demonstrate that the proposed method can achieve more precise object boundary segmentation and reliable label assignment, even in complex and dynamic indoor construction environments.

4.1. Dataset Description

As illustrated in Figure 7, the evaluation dataset consists of panoramic images collected from three indoor construction sites in South Korea. The dataset includes two newly constructed apartment buildings (Site B and Site J) and one university renovation project (Site K), representing different architectural types, namely residential and educational facilities. The images were captured using an Insta360 ONE X camera, with 12 images per site, totaling 36 panoramic images. All images were taken during the construction phase when the structural framework was completed but while the finishing work was in progress. The ground truth labels for the collected data were manually annotated using Roboflow.

This dataset incorporates various environmental factors that influence segmentation performance, including lighting conditions that combine natural and artificial illumination, structural complexity with both open and enclosed spaces, and obstacles with varying densities, such as workers, materials, and equipment. For instance, in Site B, large windows obstruct portions of the walls, allowing natural light to enter through openings. Site J contains scaffolding and workers, with numerous scattered construction materials on the floor. In Site K, an open space with abundant natural light creates distinct shadow patterns on the floor. By incorporating these realistic environmental conditions, the dataset enables a comprehensive and practical evaluation of the proposed method’s segmentation performance.

4.2. Evaluation Methods

We structured the segmentation performance evaluation around two key aspects. First, we quantitatively evaluated the exactness of label assignment within object boundaries. To achieve this, mean intersection over union (MIoU) was used as the primary evaluation metric. MIoU is a widely adopted metric for segmentation tasks where precise boundary delineation is crucial, such as plane segmentation. MIoU is calculated using the following Equation (1):

MIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{|P_{i} \cap G_{i}|}{|P_{i} \cup G_{i}|}

(1)

where

N

is the total number of segmented objects in an image,

P_{i}

represents the predicted segmentation mask, and

G_{i}

denotes the ground truth mask. The MIoU value ranges from 0 to 1, with values closer to 1 indicating higher segmentation accuracy.

Since the evaluation was conducted across multiple images with varying environmental conditions, Mean MIoU was used as the final performance metric, calculated using the following Equation (2):

M e a n M I o U = \frac{1}{M} \sum_{i = 1}^{M} M I o U_{i}

(2)

where

M

is the total number of images and

M I o U_{i}

is the MIoU of the i-th image. A higher Mean MIoU indicates that the model consistently achieves high segmentation performance across diverse indoor construction environments.

Second, we qualitatively evaluated whether the assigned labels preserved the correct geometric information. This evaluation was based on (1) a visual comparison of the plane segmentation masks and (2) a visual inspection of the 3D models reconstructed from the predicted planar information. While the previous quantitative metric provides numerical insight, it has inherent limitations in directly verifying whether the labels are correctly assigned to the corresponding planar regions. To overcome these limitations and conduct a more detailed analysis, qualitative evaluation was additionally conducted.

We visualized the predicted plane segmentation masks by substituting the labels of each plane with distinct colors. Furthermore, we reconstructed the planar regions in 3D space using the geometric information inferred from the segmentation masks and visualized the resulting 3D models. If the predicted planar objects are labeled correctly, the plane segmentation mask should represent each plane with a consistent color, and the reconstructed 3D model should accurately reflect the spatial structure of the indoor environment, including the ceiling, walls, and floor.

4.3. Quantitative Analysis

We conducted a quantitative evaluation based on the Mean MIoU across the following three methods: (1) Panoplane360 alone, (2) Panoplane360 combined with a single execution of SAM, and (3) the proposed method. The results for the horizontal planes (H-plane) and vertical planes (V-plane) are presented in Table 2.

The experimental results demonstrated that integrating SAM enhanced segmentation performance relative to Panoplane360 alone, with the proposed iterative SAM execution and filtering algorithms yielding the highest performance. Specifically, Panoplane360 alone achieved a value of 0.504 for the H-plane and a value of 0.519 for the V-plane. When SAM was applied only once, the performance improved to 0.546 for H-plane and 0.571 for V-plane. However, the proposed method achieved the highest scores of 0.642 for the H-plane and 0.606 for the V-plane.

These quantitative results indicate that incorporating SAM-based object segmentation enhances the segmentation quality of Panoplane360, and that iterative SAM execution with filtering algorithms further improves exactness, ensuring that the labels are more accurately assigned within object boundaries.

4.4. Qualitative Analysis

While the quantitative evaluation based on Mean MIoU confirmed that the proposed method improved the exactness of label assignment compared to the target model, it remained unclear whether the assigned labels accurately reflected the correct geometric information. Therefore, an additional visual analysis was conducted to validate the geometric accuracy of the labels.

4.4.1. Plane Segmentation Mask

Figure 8 presents a visual comparison of plane segmentation masks for the three indoor construction sites. Panoplane360 alone exhibited blurred label boundaries, frequent label omissions in objects, such as windows, walls, and ceilings, and inconsistent label assignment within the same object. When SAM was applied once, label boundaries became sharper, and missing labels were assigned to previously unannotated objects, but incorrect label assignment remained frequent. In contrast, the proposed method not only produced clearer label boundaries but also removed incorrect labels and filled missing labels, resulting in more complete plane segmentation. Specifically, in Site B and Site J, the proposed method effectively filled missing labels on windowed walls, which were not fully addressed by the other methods. In Site K, although some wall labels were missing, the method successfully filled gaps between critical structural elements, such as walls and floors, ensuring a more complete segmentation.

4.4.2. 3D Model

Figure 9 presents a comparison of reconstructed 3D models generated from the plane segmentation masks shown in Figure 8. In Sites B and J, the proposed method effectively filled missing labels on windowed walls, and these improvements were consistently reflected in the reconstructed 3D models. In Site K, when using Panoplane360 with a single execution of SAM, ceiling labels were incorrectly assigned to wall objects, resulting in critical errors. However, the proposed method correctly assigned labels, accurately filling gaps between walls and floors.

Visual comparisons of plane segmentation masks and reconstructed 3D models supplement the limitations of the quantitative evaluation. The results indicate that the proposed method achieves clearer delineation of label boundaries and effectively addresses incomplete label assignment made by the target model. Moreover, the proposed approach not only enhances the precision of label assignment but also preserves accurate planar information, thereby enabling a more faithful representation of the spatial structure.

5. Conclusions

This study proposed a method to address the degradation of segmentation performance in indoor construction environments caused by environmental complexity and the lack of standardized datasets. The approach integrates the iterative application of the Segment Anything Model (SAM), post-processing algorithms, and a label assignment algorithm to enhance segmentation quality without requiring additional training data.

Through experimental study, we verified that the proposed method significantly improves segmentation quality in two key aspects. First, we quantitatively evaluated the exactness of label assignment within object boundaries. The proposed method recorded 27.4 percentage points (p.p.) and 16.8 p.p. higher Mean MIoU for horizontal (H-plane) and vertical (V-plane) surfaces, respectively, compared to Panoplane360 alone. Additionally, compared to a method that simply integrates a single execution of SAM, the proposed approach improved segmentation performance by 17.6 p.p. for the H-plane and 6.1 p.p. for the V-plane. Second, we qualitatively evaluated whether the assigned labels preserved accurate geometric information. Visual inspection of the plane segmentation masks confirmed that the proposed method reduced label omissions and improved boundary delineation. Additionally, the visualized 3D models reconstructed from the predicted labels confirmed that the segmented planes were spatially accurate and structurally consistent with the environment.

These results suggest that the proposed method not only improves segmentation accuracy by refining object boundaries but also ensures that labels are appropriately assigned to maintain spatial consistency in indoor construction environments. Notably, the method enhances segmentation quality without requiring the construction of additional datasets for indoor construction, highlighting its practical significance.

However, this study has certain limitations. First, while the proposed method improves segmentation accuracy, it does not consider computational efficiency. The iterative execution of SAM leads to high computational costs and increased processing time. Second, the performance of the method depends on parameter configurations, yet no systematic approach for determining optimal hyperparameters has been established. Several key hyperparameters—such as SAM’s initial parameter values and adjustment increments, mask filtering thresholds, and label assignment thresholds—affect segmentation performance, but in this study, they were determined empirically. Third, the evaluation was limited to plane segmentation using Panoplane360, and the applicability of the method to other segmentation tasks and models was not explored. Further research is needed to verify the robustness of the approach across diverse segmentation tasks and deep learning models.

Future studies should focus on improving computational efficiency by optimizing SAM’s iterative execution process and developing a structured approach for hyperparameter tuning to enhance performance across different environments. Additionally, the applicability of the proposed method should be extended beyond plane segmentation with Panoplane360 to other panoramic segmentation tasks and related deep learning models for further performance validation.

Through these efforts, the proposed method is expected to facilitate the expansion and advancement of panoramic segmentation technology in the construction domain. This is expected to improve segmentation performance and promote its broader adoption in real-world applications, such as construction progress monitoring, safety inspections, and spatial layout planning in indoor environments.

Author Contributions

Conceptualization, J.H.; Methodology, J.H.; Software, J.H.; Formal analysis, J.H.; Writing—original draft, J.H.; Visualization, M.K.; Supervision, S.Y.; Project administration, T.K.; Writing—review and editing, T.K.; Funding acquisition, T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant (RS-2022-00143493) from the Digital-Based Building Construction and Safety Supervision Technology Research Program funded by the Ministry of Land, Infrastructure, and Transport of the Korean Government.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Erazo-Rondinel, A.A.; Melgar, M.A. Exploring the Benefits of 360-Degree Panoramas for Construction Project Monitoring and Control. In Proceedings of the 1st International Online Conference on Buildings, Online, 24–26 October 2024. [Google Scholar]
Fang, X.; Li, H.; Wu, H.; Fan, L.; Kong, T.; Wu, Y. A fast end-to-end method for automatic interior progress evaluation using panoramic images. Eng. Appl. Artif. Intell. 2023, 126, 106733. [Google Scholar] [CrossRef]
Wei, Y.; Akinci, B. Panorama-to-model registration through integration of image retrieval and semantic reprojection. Autom. Constr. 2022, 140, 104356. [Google Scholar] [CrossRef]
Kang, M.; Yoon, S.; Kim, T. Computer Vision-Based Adhesion Quality Inspection Model for Exterior Insulation and Finishing System. Appl. Sci. 2024, 15, 125. [Google Scholar] [CrossRef]
Li, D.; Liu, J.; Feng, L.; Cheng, G.; Zeng, Y.; Dong, B.; Chen, Y.F. Towards automated extraction for terrestrial laser scanning data of building components based on panorama and deep learning. J. Build. Eng. 2022, 50, 104106. [Google Scholar] [CrossRef]
Ekanayake, B.; Wong, J.K.-W.; Fini, A.A.F.; Smith, P. Computer vision-based interior construction progress monitoring: A literature review and future research directions. Autom. Constr. 2021, 127, 103705. [Google Scholar] [CrossRef]
Zhang, C.; Shen, J. Object Detection and Instance Segmentation in Construction Sites. In Proceedings of the 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning, Shanghai, China, 22–24 March 2024; pp. 184–190. [Google Scholar]
Pokuciński, S.; Mrozek, D. Object Detection with YOLOv5 in Indoor Equirectangular Panoramas. Procedia Comput. Sci. 2023, 225, 2420–2428. [Google Scholar] [CrossRef]
Regona, M.; Yigitcanlar, T.; Xia, B.; Li, R.Y.M. Opportunities and Adoption Challenges of AI in the Construction Industry: A PRISMA Review. J. Open Innov. Technol. Mark. Complex. 2022, 8, 45. [Google Scholar] [CrossRef]
Tang, S.; Huang, H.; Zhang, Y.; Yao, M.; Li, X.; Xie, L.; Wang, W. Skeleton-guided generation of synthetic noisy point clouds from as-built BIM to improve indoor scene understanding. Autom. Constr. 2023, 156, 105076. [Google Scholar] [CrossRef]
Ying, H.; Sacks, R.; Degani, A. Synthetic image data generation using BIM and computer graphics for building scene understanding. Autom. Constr. 2023, 154, 105016. [Google Scholar] [CrossRef]
Lee, H.; Jeon, J.; Lee, D.; Park, C.; Kim, J.; Lee, D. Game engine-driven synthetic data generation for computer vision-based safety monitoring of construction workers. Autom. Constr. 2023, 155, 105060. [Google Scholar] [CrossRef]
Wang, L.; Zhou, X.; Liu, J.; Cheng, G. Automated layout generation from sites to flats using GAN and transfer learning. Autom. Constr. 2024, 166, 105668. [Google Scholar] [CrossRef]
Goyal, M.; Mahmoud, Q.H. A Systematic Review of Synthetic Data Generation Techniques Using Generative AI. Electronics 2024, 13, 3509. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Wang, B.; Chen, Z.; Li, M.; Wang, Q.; Yin, C.; Cheng, J.C.P. Omni-Scan2BIM: A ready-to-use Scan2BIM approach based on vision foundation models for MEP scenes. Autom. Constr. 2024, 162, 105384. [Google Scholar] [CrossRef]
Teng, S.; Liu, A.; Situ, Z.; Chen, B.; Wu, Z.; Zhang, Y.; Wang, J. Plug-and-play method for segmenting concrete bridge cracks using the segment anything model with a fractal dimension matrix prompt. Autom. Constr. 2025, 170, 105906. [Google Scholar] [CrossRef]
Peng, H.; Liao, Y.; Li, W.; Fu, C.; Zhang, G.; Ding, Z.; Huang, Z.; Cao, Q.; Cai, S. Segmentation-aware prior assisted joint global information aggregated 3D building reconstruction. Adv. Eng. Inform. 2024, 62, 102904. [Google Scholar] [CrossRef]
Sun, C.; Hsiao, C.-W.; Wang, N.-H.; Sun, M.; Chen, H.-T. Indoor panorama planar 3d reconstruction via divide and conquer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11338–11347. [Google Scholar]
Zheng, Z.; Lin, C.; Nie, L.; Liao, K.; Shen, Z.; Zhao, Y. Complementary bi-directional feature compression for indoor 360deg semantic segmentation with self-distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4501–4510. [Google Scholar]
Gao, S.; Yang, K.; Shi, H.; Wang, K.; Bai, J. Review on Panoramic Imaging and Its Applications in Scene Understanding. IEEE Trans. Instrum. Meas. 2022, 71, 1–34. [Google Scholar] [CrossRef]
Yu, H.; He, L.; Jian, B.; Feng, W.; Liu, S. PanelNet: Understanding 360 Indoor Environment via Panel Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 878–887. [Google Scholar]
Gkitsas, V.; Sterzentsenko, V.; Zioulis, N.; Albanis, G.; Zarpalas, D. Panodr: Spherical panorama diminished reality for indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3716–3726. [Google Scholar]
Shen, Z.; Lin, C.; Liao, K.; Nie, L.; Zheng, Z.; Zhao, Y. PanoFormer: Panorama transformer for indoor 360° depth estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 195–211. [Google Scholar]
Anagnostopoulos, I.; Pătrăucean, V.; Brilakis, I.; Vela, P. Detection of walls, floors, and ceilings in point cloud data. In Proceedings of the Construction Research Congress 2016, San Juan, Puerto Rico, 31 May–2 June 2016; pp. 2302–2311. [Google Scholar]
Shen, Z.; Zheng, Z.; Lin, C.; Nie, L.; Liao, K.; Zheng, S.; Zhao, Y. Disentangling orthogonal planes for indoor panoramic room layout estimation with cross-scale distortion awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17337–17345. [Google Scholar]
Zhong, Y.; Zhao, D.; Cheng, D.; Zhang, J.; Tian, D. A Fast and Precise Plane Segmentation Framework for Indoor Point Clouds. Remote Sens. 2022, 14, 3519. [Google Scholar] [CrossRef]
Holz, D.; Holzer, S.; Rusu, R.B.; Behnke, S. Real-time plane segmentation using RGB-D cameras. In Proceedings of the RoboCup 2011: Robot Soccer World Cup XV 15, Istanbul, Turkey, 5-11 July 2012; pp. 306–317. [Google Scholar]
Yao, H.; Miao, J.; Zhang, G.; Chu, J. 3D layout estimation of general rooms based on ordinal semantic segmentation. IET Comput. Vis. 2023, 17, 855–868. [Google Scholar] [CrossRef]
Maalek, R.; Lichti, D.D.; Ruwanpura, J. Robust Classification and Segmentation of Planar and Linear Features for Construction Site Progress Monitoring and Structural Dimension Compliance Control. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2015, II-3/W5, 129–136. [Google Scholar] [CrossRef]
Hammad, A.W.A.; Rey, D.; Akbarnezhad, A. A cutting plane algorithm for the site layout planning problem with travel barriers. Comput. Oper. Res. 2017, 82, 36–51. [Google Scholar] [CrossRef]
Zhang, J.; Yang, K.; Ma, C.; Reiß, S.; Peng, K.; Stiefelhagen, R. Bending reality: Distortion-aware transformers for adapting to panoramic semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16917–16927. [Google Scholar]

Figure 1. Framework of the proposed method.

Figure 2. Calculation of background proportion by cumulative summing of the foreground.

Figure 3. Comparison of segmentation results based on threshold differences.

Figure 4. Process of updating the input image and input point labels (green: points for segmentation; red: points excluded from segmentation).

Figure 5. Separation of intersecting detected object masks and definition of new masks.

Figure 6. Label assignment and merging process of the object masks.

Figure 7. Examples of panoramic images captured at different construction sites.

Figure 8. Qualitative comparison of plane segmentation results. (A) Input image; (B) ground truth; (C) H&V plane segmentation from Panoplane360; (D) H&V plane segmentation from SAM+Pano (no algorithm); (E) H&V plane segmentation from the proposed method.

Figure 9. Qualitative comparison of 3D reconstruction results. (A) 3D reconstruction from Panoplane360; (B) 3D reconstruction from SAM+Pano (no algorithm); (C) 3D reconstruction from the proposed method.

Table 1. Initial input parameters and parameter variation.

Parameter	Initial Value	Variation	Conditional Variation
points_per_side	32	+4
stability_score_thresh	0.97	−0.002
pred_iou_thresh	0.85		−0.01
min_mask_region_area	100

Table 2. Quantitative comparison of Mean MIoU for the plane segmentation mask.

Label	Site	Panoplane360	SAM+Pano (No Algorithm)	Ours (Proposed)
H_plane	Site B	0.563	0.657	0.733
	Site J	0.572	0.607	0.665
	Site K	0.377	0.375	0.527
	Total	0.504	0.546	0.642
V_plane	Site B	0.538	0.616	0.650
	Site J	0.512	0.556	0.616
	Site K	0.508	0.542	0.552
	Total	0.519	0.571	0.606

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, J.; Yoon, S.; Kang, M.; Kim, T. Approach to Enhancing Panoramic Segmentation in Indoor Construction Sites Based on a Perspective Image Segmentation Foundation Model. Appl. Sci. 2025, 15, 4875. https://doi.org/10.3390/app15094875

AMA Style

Han J, Yoon S, Kang M, Kim T. Approach to Enhancing Panoramic Segmentation in Indoor Construction Sites Based on a Perspective Image Segmentation Foundation Model. Applied Sciences. 2025; 15(9):4875. https://doi.org/10.3390/app15094875

Chicago/Turabian Style

Han, Juho, Sebeen Yoon, Mingyun Kang, and Taehoon Kim. 2025. "Approach to Enhancing Panoramic Segmentation in Indoor Construction Sites Based on a Perspective Image Segmentation Foundation Model" Applied Sciences 15, no. 9: 4875. https://doi.org/10.3390/app15094875

APA Style

Han, J., Yoon, S., Kang, M., & Kim, T. (2025). Approach to Enhancing Panoramic Segmentation in Indoor Construction Sites Based on a Perspective Image Segmentation Foundation Model. Applied Sciences, 15(9), 4875. https://doi.org/10.3390/app15094875

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Approach to Enhancing Panoramic Segmentation in Indoor Construction Sites Based on a Perspective Image Segmentation Foundation Model

Abstract

1. Introduction

2. Literature Review

2.1. Panoramic Image Segmentation in Indoor Construction Site

2.2. Plane Segmentation with Panoplane360

2.3. Segment Anything Model

2.4. Automatic Mask Generation Model

3. SAM-Based Approach to Improve Panoramic Image Segmentation in Indoor Construction Environments

3.1. Object Extract

3.1.1. Adjustment of Input Parameters

3.1.2. Updating Input Images and Point Labels

3.1.3. Handling Duplicate and Overlapping Objects

3.2. Label Extraction and Labeling

4. Experimental Study

4.1. Dataset Description

4.2. Evaluation Methods

4.3. Quantitative Analysis

4.4. Qualitative Analysis

4.4.1. Plane Segmentation Mask

4.4.2. 3D Model

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI