Automatic Indoor as-Built Building Information Models Generation by Using Low-Cost RGB-D Sensors

To generate indoor as-built building information models (AB BIMs) automatically and economically is a great technological challenge. Many approaches have been developed to address this problem in recent years, but it is far from being settled, particularly for the point cloud segmentation and the extraction of the relationship among different elements due to the complicated indoor environment. This is even more difficult for the low-quality point cloud generated by low-cost scanning equipment. This paper proposes an automatic as-built BIMs generation framework that transforms the noisy 3D point cloud produced by a low-cost RGB-D sensor (about 708 USD for data collection equipment, 379 USD for the Structure sensor and 329 USD for iPad) to the as-built BIMs, without any manual intervention. The experiment results show that the proposed method has competitive robustness and accuracy, compared to the high-quality Terrestrial Lidar System (TLS), with the element extraction accuracy of 100%, mean dimension reconstruction accuracy of 98.6% and mean area reconstruction accuracy of 93.6%. Also, the proposed framework makes the BIM generation workflows more efficient in both data collection and data processing. In the experiments, the time consumption of data collection for a typical room, with an area of 45–67 m2, is reduced to 4–6 min with an RGB-D sensor from 50–60 min with TLS. The processing time to generate BIM models is about half minutes automatically, from around 10 min with a conventional semi-manual method.


Introduction
Building information models (BIMs), including as-designed BIMs (AD BIMs) and as-built BIMs (AB BIMs), are the digital representations for the whole life cycle management from design, construction to demolition [1], which are widely used in the areas of construction management [2], computer games [3], indoor navigation [4], and emergency response [5]. Different from traditional 3D models contain spatial information only, the BIMs have more information (e.g., material, functions, and topological

Automatic BIM Generation Framework
This section describes the workflows, as well as the detail algorithms of our automatic BIM generation method. As shown in Figure 1, the whole framework can be divided into three stages.
depth information as the neural networks are weak to process the unstructured geometric information duo to the fixed grid kernel structure. Finally, the RGB images, HHA images together with the ground truth labels, are packed together as the input of the training procedure of semantic 3D reconstruction stage.
The second stage is to get the semantic 3D reconstruction result of the environment. First of all, a neural network is established for the 2D image semantic segmentation in the training procedure. This neural network is trained first with the image pairs and ground truth labels and is used to predict the semantic label of each pixel in the prediction procedure with the input of RGB images and HHA images. Then, with the input of RGB images and depth images, the 3D reconstruction is done by using the simultaneous localization and mapping (SLAM) method [30] with the output of the textured point cloud. Finally, the point cloud with semantic labels is generated by integrating the 2D semantic segmentation result and 3D reconstruction product. The final stage is to extract the digital spatial and relationship information for the BIM generation. In this stage, the semantic point clouds are divided into different parts based on the labeled information provided in stage 2. The planes of floors and ceilings are estimated first by using the random sample consensus (RANSAC) algorithm [31]. As mentioned in Section 1, the point clouds collected by low-cost RGB-D sensors are much noisy, which makes it challenging to extract planes with conventional plane extraction methods. We propose a new "add-remove" method to overcome this problem to improve the quality of the extracted wall planes. Then, the wall planes are projected to the floor plane to get the 2D map of the walls. The start points and the endpoints of all lines are recovered using the line fitting algorithm. At the same time, the door and window point cloud are segregated into several parts which only have one element by using local descriptors [32]. The position of the elements and their relationship to walls are recovered by calculating the closest The first stage is to collect datasets and to transfer the dataset to the structure accepted by neural networks. The raw RGB images and depth images are collected by an RGB-D sensor. Then, we apply the calibration method proposed in [28] to reduce the systemic error of the RGB-D sensor. A depth information encoding method proposed by Gupta et al. [29], named HHA, is used to encode the depth information as the neural networks are weak to process the unstructured geometric information duo to the fixed grid kernel structure. Finally, the RGB images, HHA images together with the ground truth labels, are packed together as the input of the training procedure of semantic 3D reconstruction stage.
The second stage is to get the semantic 3D reconstruction result of the environment. First of all, a neural network is established for the 2D image semantic segmentation in the training procedure. This neural network is trained first with the image pairs and ground truth labels and is used to predict the semantic label of each pixel in the prediction procedure with the input of RGB images and HHA images. Then, with the input of RGB images and depth images, the 3D reconstruction is done by using the simultaneous localization and mapping (SLAM) method [30] with the output of the textured point cloud. Finally, the point cloud with semantic labels is generated by integrating the 2D semantic segmentation result and 3D reconstruction product.
The final stage is to extract the digital spatial and relationship information for the BIM generation. In this stage, the semantic point clouds are divided into different parts based on the labeled information provided in stage 2. The planes of floors and ceilings are estimated first by using the random sample consensus (RANSAC) algorithm [31]. As mentioned in Section 1, the point clouds collected by low-cost RGB-D sensors are much noisy, which makes it challenging to extract planes with conventional plane extraction methods. We propose a new "add-remove" method to overcome this problem to improve the quality of the extracted wall planes. Then, the wall planes are projected to the floor plane to get the 2D map of the walls. The start points and the endpoints of all lines are recovered using the line fitting algorithm. At the same time, the door and window point cloud are segregated into several parts Sensors 2020, 20, 293 5 of 21 which only have one element by using local descriptors [32]. The position of the elements and their relationship to walls are recovered by calculating the closest distances. Meanwhile, the sizes of the elements are calculated based on the single frame point cloud rather than all point clouds to reduce the effects of measurement noise. Finally, the BIM model is automatically generated by integrating all the segmentations with one plug-in program under the Revit platform.

Data Collection and Preprocessing
The hardware used to collect data in this paper is one type of RGB-D sensors named Structure sensor, which can be fitted to an iPad, an iPhone, or other mobile instruments ( Figure 2). The system outputs 640 × 480 aligned RGB and depth images with the frequency up to 30 frames per second. The benefit of semantic segmentation with the depth information has been demonstrated by many studies [33][34][35]. However, the depth image from low-cost RGB-D sensors contains notable systemic errors. The calibration method [28] is first applied to improve depth measurement accuracy. The compare between the raw depth values and calibrated depth values for a wall is shown in Figure 3, and it is clearly shown the improvement of the calibrated depth as the wall should be a straight line. Additionally, as shown in Figure 4b, the occlusion or clutter, caused by the unpredictable variation of scene illumination, reflection of objects, or the out-of-range issues, makes the use of depth information more challenging. In this paper, the algorithm proposed by Levin et al. [36] is adopted to fill the occlusion region (as shown in Figure 4c), considering the filled depth is the input of the depth encoding algorithm used in the following steps. Finally, as shown in Figure 5, the depth information is encoded as HHA (horizontal disparity, height above ground, and the angle with gravity) format because the neural networks are weak to process the unstructured geometric information duo to the fixed grid kernel structure [29].
Sensors 2020, 20, x 5 of 22 distances. Meanwhile, the sizes of the elements are calculated based on the single frame point cloud rather than all point clouds to reduce the effects of measurement noise. Finally, the BIM model is automatically generated by integrating all the segmentations with one plug-in program under the Revit platform.

Data Collection and Preprocessing
The hardware used to collect data in this paper is one type of RGB-D sensors named Structure sensor, which can be fitted to an iPad, an iPhone, or other mobile instruments ( Figure 2). The system outputs 640480 aligned RGB and depth images with the frequency up to 30 frames per second. The benefit of semantic segmentation with the depth information has been demonstrated by many studies [33][34][35]. However, the depth image from low-cost RGB-D sensors contains notable systemic errors. The calibration method [28] is first applied to improve depth measurement accuracy. The compare between the raw depth values and calibrated depth values for a wall is shown in Figure 3,and it is clearly shown the improvement of the calibrated depth as the wall should be a straight line. Additionally, as shown in Figure 4b, the occlusion or clutter, caused by the unpredictable variation of scene illumination, reflection of objects, or the out-of-range issues, makes the use of depth information more challenging. In this paper, the algorithm proposed by Levin et al. [36] is adopted to fill the occlusion region (as shown in Figure 4c), considering the filled depth is the input of the depth encoding algorithm used in the following steps. Finally, as shown in Figure 5, the depth information is encoded as HHA (horizontal disparity, height above ground, and the angle with gravity) format because the neural networks are weak to process the unstructured geometric information duo to the fixed grid kernel structure [29]. The main elements of hardware used in this paper, which includes one RGB-D camera and one iPad [37]. The total cost of the equipment is about 708 USD, 379 USD for the Structure sensor, and 329 USD for iPad.  The main elements of hardware used in this paper, which includes one RGB-D camera and one iPad [37]. The total cost of the equipment is about 708 USD, 379 USD for the Structure sensor, and 329 USD for iPad.
Sensors 2020, 20, x 5 of 22 distances. Meanwhile, the sizes of the elements are calculated based on the single frame point cloud rather than all point clouds to reduce the effects of measurement noise. Finally, the BIM model is automatically generated by integrating all the segmentations with one plug-in program under the Revit platform.

Data Collection and Preprocessing
The hardware used to collect data in this paper is one type of RGB-D sensors named Structure sensor, which can be fitted to an iPad, an iPhone, or other mobile instruments ( Figure 2). The system outputs 640480 aligned RGB and depth images with the frequency up to 30 frames per second. The benefit of semantic segmentation with the depth information has been demonstrated by many studies [33][34][35]. However, the depth image from low-cost RGB-D sensors contains notable systemic errors. The calibration method [28] is first applied to improve depth measurement accuracy. The compare between the raw depth values and calibrated depth values for a wall is shown in Figure 3,and it is clearly shown the improvement of the calibrated depth as the wall should be a straight line. Additionally, as shown in Figure 4b, the occlusion or clutter, caused by the unpredictable variation of scene illumination, reflection of objects, or the out-of-range issues, makes the use of depth information more challenging. In this paper, the algorithm proposed by Levin et al. [36] is adopted to fill the occlusion region (as shown in Figure 4c), considering the filled depth is the input of the depth encoding algorithm used in the following steps. Finally, as shown in Figure 5, the depth information is encoded as HHA (horizontal disparity, height above ground, and the angle with gravity) format because the neural networks are weak to process the unstructured geometric information duo to the fixed grid kernel structure [29].  [37]. The total cost of the equipment is about 708 USD, 379 USD for the Structure sensor, and 329 USD for iPad.

Semantic 3D Reconstruction
The semantic 3D reconstruction stage is divided into two parts. The first part is the semantic segmentation in 2D images based on the RGB information and depth information. The second part is the integration of 2D semantic segmentation results with RGB-D SLAM mapping based on the relationship between pixel-level semantic labels and point clouds.
For the pixel-level semantic segmentation task, the FCN [27] is accepted as the standard approach based on deep-learning, which is the first end-to-end neural network adopting the arbitrary size input and output pixel-level dense with the corresponding size [27,38,39]. However, the semantic segmentation result from FCN is not elaborate enough even with the combination of high layer coarse information and low layer excellent information. The conditional random field (CRF) [40] is one of the widely used algorithms to refine the semantic segmentation result of RGB images, but it only uses the boundary information from the color image. Meanwhile, convolutional oriented boundaries (COB) is used to extract boundary information based on the depth information [41]. Considering that the RGB-D sensors provide aligned color and depth images, we added a CRF layer and a COB layer at the end of raw two-branch FCN architecture to refine the semantic segmentation result of the neural network. The neural network used is still one end-to-end architecture with the image pairs as input and pixel-level semantic segmentation as output. In this paper, the label classes we used include door, window, floor, ceiling, and wall.
Then, we generate the semantic 3D point clouds by integrating the textured point clouds from RGB-D SLAM mapping and the 2D semantic segmentation from the neural network. First of all, the RGB-D SLAM method used by Endres et al. [42] is employed to generate the 3D point clouds from the RGB and depth information. This graph-based SLAM system calculates geometric relationships of adjacent frames based on RANSAC and iterative closest point (ICP). For each frame, the textured point cloud in the local coordinate system is generated based on the color and depth information. The textured 3D point cloud is generated by combining the point clouds of different frames with transformation matrixes from the SLAM system. Secondly, with the depth information and pixellevel semantic segmentation result, the semantic point cloud for each frame is outputted. Finally, similar to the conventional 3D reconstruction method, the semantic 3D point clouds are generated by integrating the semantic point cloud of each frame and the transformational information from the SLAM system. In practice, the overlap of adjacent frames and the incorrect 2D semantic segmentation result make label information of the 3D location ambiguous. Our system accepts to fuse the semantic segmentation results with overlap by using a Bayesian update. In practice, by using the

Semantic 3D Reconstruction
The semantic 3D reconstruction stage is divided into two parts. The first part is the semantic segmentation in 2D images based on the RGB information and depth information. The second part is the integration of 2D semantic segmentation results with RGB-D SLAM mapping based on the relationship between pixel-level semantic labels and point clouds.
For the pixel-level semantic segmentation task, the FCN [27] is accepted as the standard approach based on deep-learning, which is the first end-to-end neural network adopting the arbitrary size input and output pixel-level dense with the corresponding size [27,38,39]. However, the semantic segmentation result from FCN is not elaborate enough even with the combination of high layer coarse information and low layer excellent information. The conditional random field (CRF) [40] is one of the widely used algorithms to refine the semantic segmentation result of RGB images, but it only uses the boundary information from the color image. Meanwhile, convolutional oriented boundaries (COB) is used to extract boundary information based on the depth information [41]. Considering that the RGB-D sensors provide aligned color and depth images, we added a CRF layer and a COB layer at the end of raw two-branch FCN architecture to refine the semantic segmentation result of the neural network. The neural network used is still one end-to-end architecture with the image pairs as input and pixel-level semantic segmentation as output. In this paper, the label classes we used include door, window, floor, ceiling, and wall.
Then, we generate the semantic 3D point clouds by integrating the textured point clouds from RGB-D SLAM mapping and the 2D semantic segmentation from the neural network. First of all, the RGB-D SLAM method used by Endres et al. [42] is employed to generate the 3D point clouds from the RGB and depth information. This graph-based SLAM system calculates geometric relationships of adjacent frames based on RANSAC and iterative closest point (ICP). For each frame, the textured point cloud in the local coordinate system is generated based on the color and depth information. The textured 3D point cloud is generated by combining the point clouds of different frames with transformation matrixes from the SLAM system. Secondly, with the depth information and pixellevel semantic segmentation result, the semantic point cloud for each frame is outputted. Finally, similar to the conventional 3D reconstruction method, the semantic 3D point clouds are generated by integrating the semantic point cloud of each frame and the transformational information from the SLAM system. In practice, the overlap of adjacent frames and the incorrect 2D semantic segmentation result make label information of the 3D location ambiguous. Our system accepts to fuse the semantic segmentation results with overlap by using a Bayesian update. In practice, by using the

Semantic 3D Reconstruction
The semantic 3D reconstruction stage is divided into two parts. The first part is the semantic segmentation in 2D images based on the RGB information and depth information. The second part is the integration of 2D semantic segmentation results with RGB-D SLAM mapping based on the relationship between pixel-level semantic labels and point clouds.
For the pixel-level semantic segmentation task, the FCN [27] is accepted as the standard approach based on deep-learning, which is the first end-to-end neural network adopting the arbitrary size input and output pixel-level dense with the corresponding size [27,38,39]. However, the semantic segmentation result from FCN is not elaborate enough even with the combination of high layer coarse information and low layer excellent information. The conditional random field (CRF) [40] is one of the widely used algorithms to refine the semantic segmentation result of RGB images, but it only uses the boundary information from the color image. Meanwhile, convolutional oriented boundaries (COB) is used to extract boundary information based on the depth information [41]. Considering that the RGB-D sensors provide aligned color and depth images, we added a CRF layer and a COB layer at the end of raw two-branch FCN architecture to refine the semantic segmentation result of the neural network. The neural network used is still one end-to-end architecture with the image pairs as input and pixel-level semantic segmentation as output. In this paper, the label classes we used include door, window, floor, ceiling, and wall.
Then, we generate the semantic 3D point clouds by integrating the textured point clouds from RGB-D SLAM mapping and the 2D semantic segmentation from the neural network. First of all, the RGB-D SLAM method used by Endres et al. [42] is employed to generate the 3D point clouds from the RGB and depth information. This graph-based SLAM system calculates geometric relationships of adjacent frames based on RANSAC and iterative closest point (ICP). For each frame, the textured point cloud in the local coordinate system is generated based on the color and depth information. The textured 3D point cloud is generated by combining the point clouds of different frames with transformation matrixes from the SLAM system. Secondly, with the depth information and pixel-level semantic segmentation result, the semantic point cloud for each frame is outputted. Finally, similar to the conventional 3D reconstruction method, the semantic 3D point clouds are generated by integrating the semantic point cloud of each frame and the transformational information from the SLAM system. In practice, the overlap of adjacent frames and the incorrect 2D semantic segmentation result make Sensors 2020, 20, 293 7 of 21 label information of the 3D location ambiguous. Our system accepts to fuse the semantic segmentation results with overlap by using a Bayesian update. In practice, by using the transformation information provided by SLAM, the segmentation results can be aligned into the same coordinate system, and the overlap information is used to update the label probability distribution. The recursive update function is shown in Equation (1).
where, L i is the prediction result of the label; C k is the semantic point cloud for m th frame; K is one constant to normalize the distribution.

The Transformation from Semantic 3D Reconstruction to BIM Format 3D Model
In this stage, the semantic 3D point cloud would be separated into different parts (e.g., floor, ceiling, wall, door, and window) based on the labeled information firstly. Then, the properties of the elements, as well as the bilateral relationship with each other, are extracted. Finally, the BIM format 3D models are generated based on the obtained information using the Revit platform. As shown in Figure 6, the point cloud generated from low-cost RGB-D sensors is very noisy even after the systemic error calibration, which makes most elements extraction methods not suitable for this situation. We develop one new element extraction algorithm based on the feature of the dataset as well as the empirical knowledge of the indoor environment.
Sensors 2020, 20, x 7 of 22 transformation information provided by SLAM, the segmentation results can be aligned into the same coordinate system, and the overlap information is used to update the label probability distribution. The recursive update function is shown in Equation (1).
where, is the prediction result of the label; is the semantic point cloud for ℎ frame; is one constant to normalize the distribution.

The Transformation from Semantic 3D Reconstruction to BIM Format 3D Model
In this stage, the semantic 3D point cloud would be separated into different parts (e.g., floor, ceiling, wall, door, and window) based on the labeled information firstly. Then, the properties of the elements, as well as the bilateral relationship with each other, are extracted. Finally, the BIM format 3D models are generated based on the obtained information using the Revit platform. As shown in Figure 6, the point cloud generated from low-cost RGB-D sensors is very noisy even after the systemic error calibration, which makes most elements extraction methods not suitable for this situation. We develop one new element extraction algorithm based on the feature of the dataset as well as the empirical knowledge of the indoor environment.

Wall Boundary Extraction
In the wall boundary extraction procedure, the wall point cloud is extracted based on the semantic label firstly, as Figure 7b shows. This point cloud always has a large number of noisy points (Figure 7b red circle areas) due to the measurement errors and aligned error of color and depth images. In this study, we remove those sparse outliers based on the distance distribution of point to neighbors in the input point cloud. The compare between the raw point cloud and the filtered result is shown in Fugure 7b,c.

Wall Boundary Extraction
In the wall boundary extraction procedure, the wall point cloud is extracted based on the semantic label firstly, as Figure 7b shows. This point cloud always has a large number of noisy points (Figure 7b red circle areas) due to the measurement errors and aligned error of color and depth images. In this study, we remove those sparse outliers based on the distance distribution of point to neighbors in the input point cloud. The compare between the raw point cloud and the filtered result is shown in Figure 7b,c.
The first step for the wall boundary extraction is wall plane detection. As Figure 7d shows, point cloud in a different color is points for different wall planes. Traditionally, the plane extraction for a dense point cloud is based on the iteration of the RANSAC plane fitting method. For iteration i, the algorithm detects a plane p i from the input point cloud {P i }. Those points, whose distance to the plane p i is less than the threshold d 0 , are treated as inlier points and will be removed from {P i }. The remaining point cloud is P i+1 which is the input of the next iteration i + 1. This process repeats until there are not enough points in the remaining point cloud P i+1 . This method is effective for the point cloud collected by TLS because it nearly has no overlap for the points and always in high quality. However, as Figure 8a-c shows, this method can cause an over-detection problem when the input is the low-quality point cloud collected by a low-cost RGB-D sensor. The reason for the over-detection is the value of d 0 is too small to remove all the points of plane p i . As Figure 8c shows, some noisy points belonging to the plane p i are not removed, which leads to another detected plane close to plane p i . The parameter d 0 is used to number the inlier points for the RANSAC-based plane fitting method. In this paper, we still use d 0 for the plane fitting but use another parameter T d1 to remove the points from {P i }. This method can significantly reduce the influence the over-detection as Figure 8c-e shows. The first step for the wall boundary extraction is wall plane detection. As Figure 7d shows, point cloud in a different color is points for different wall planes. Traditionally, the plane extraction for a dense point cloud is based on the iteration of the RANSAC plane fitting method. For iteration , the algorithm detects a plane from the input point cloud { }. Those points, whose distance to the plane is less than the threshold 0 , are treated as inlier points and will be removed from { }. The remaining point cloud is { +1 } which is the input of the next iteration + 1. This process repeats until there are not enough points in the remaining point cloud { +1 }. This method is effective for the point cloud collected by TLS because it nearly has no overlap for the points and always in high quality. However, as Figure 8a-c shows, this method can cause an over-detection problem when the input is the low-quality point cloud collected by a low-cost RGB-D sensor. The reason for the overdetection is the value of 0 is too small to remove all the points of plane . As Figure 8c shows, some noisy points belonging to the plane are not removed, which leads to another detected plane close to plane . The parameter 0 is used to number the inlier points for the RANSAC-based plane fitting method. In this paper, we still use 0 for the plane fitting but use another parameter 1 to remove the points from { }. This method can significantly reduce the influence the over-detection as Figure 8c-e shows. Another problem is the over-removing of the point cloud. We remove the points close to the detected wall plane to overcome the over-detection problem. However, some of those points, which are at the joint area of walls, are useful for the detection of another plane. As Figure 9 shows, the gray point is the input point cloud, the green line is the wall plane detected by the plane fitting algorithm, and the yellow area is the region of removed points. In this case, the red rectangular area is another wall with a small size that will not be detected because the points have been removed. To address Another problem is the over-removing of the point cloud. We remove the points close to the detected wall plane to overcome the over-detection problem. However, some of those points, which are at the joint area of walls, are useful for the detection of another plane. As Figure 9 shows, the gray point is the input point cloud, the green line is the wall plane detected by the plane fitting algorithm, and the yellow area is the region of removed points. In this case, the red rectangular area is another wall with a small size that will not be detected because the points have been removed. To address this problem, we project the detected wall plane to the floor plane to get the line of the wall in 2D mapping. The points around the endpoints of the line, the blue circle areas in Figure 9 are reserved to overcome the over-removing problem. Another problem is the over-removing of the point cloud. We remove the points close to the detected wall plane to overcome the over-detection problem. However, some of those points, which are at the joint area of walls, are useful for the detection of another plane. As Figure 9 shows, the gray point is the input point cloud, the green line is the wall plane detected by the plane fitting algorithm, and the yellow area is the region of removed points. In this case, the red rectangular area is another wall with a small size that will not be detected because the points have been removed. To address this problem, we project the detected wall plane to the floor plane to get the line of the wall in 2D mapping. The points around the endpoints of the line, the blue circle areas in Figure 9 are reserved to overcome the over-removing problem.  The detail implementation of the algorithm is presented in Algorithm 1. Where, F(.) is the plane fitting function based on RANSAC algorithm, C(.) is the function to count the point number of the point cloud, D 1 (.) is the function to calculate the distance between the points and plane, D 1 (.) is the function to calculate the minimum distance between the points and line in floor plane, Dis(.) is the function to calculate the distance between the planes, Ang(.) is the function to calculate the angle between the planes, and Pro j(.) is the function to project plane to the floor plane. Additionally, Figure 10 presents an example to extract 2D wall lines from the raw point cloud. Figure 10a is the input point cloud of the walls. Figure 10b,c is the iteration of the plane detection and points removing with the method mentioned above. The iteration will terminate when the number of remaining points is less than the threshold. Then, we can get the segmented wall planes, which are set to different colors as Figure 10d. Finally, the 2D wall lines are extracted by projecting the wall planes to the floor plane and the result is showed in Figure 10e.
angle between the planes, and (. ) is the function to project plane to the floor plane. Additionally, Figure 10 presents an example to extract 2D wall lines from the raw point cloud. Figure 10a is the input point cloud of the walls. Figure 10b,c is the iteration of the plane detection and points removing with the method mentioned above. The iteration will terminate when the number of remaining points is less than the threshold. Then, we can get the segmented wall planes, which are set to different colors as Figure 10d. Finally, the 2D wall lines are extracted by projecting the wall planes to the floor plane and the result is showed in Figure 10e. As shown in Figure 7h, some extracted lines are not connected because the point cloud collected not covers the whole environment, or there are still some deviations between the true value and detected results. To overcome this problem, we propose a 2D wall line connect and refine algorithm based on the vertical distances of the lines and the normal angles of the corresponding detected wall planes. Also, this method is designed with the assumption that the walls in the applied environment are perpendicular and straight. First of all, one reference line 0 are random chose from all the 2D wall lines { }, and two vertexes of 0 are presented as the start point and the endpoint . Moreover, all the distance between the endpoint of 0 and the vertexes of the other lines are calculated. The line whose vertex with the minimum distance is treated as the adjacent line present as 1 , and this vertex is treated as the start point of 1 and another vertex is the endpoint. Then, with line 1 as the reference line, this procedure will repeat until the found start point to be the start point of 0 . In practice, as Figure 11 shows, the isolation of the lines can be divided into two groups, the first group is caused by the missing of the edges and the second group is caused by the adjacent lines are too short to get the intersection point. The classification of these two groups is based on the angle between the normal vectors of corresponding detected wall planes. For example, line with the plane normal vector and adjacent line +1 with the plane normal vector +1 , the angle between and +1 could be calculated and presented as +1 . As Figure 11 shows, if the value of +1 is smaller than the threshold 0 , this connection relationship is initialized as the first group. The central point of the line, whose vertexes is the endpoint of and start point of +1 , are calculated. The end As shown in Figure 7h, some extracted lines are not connected because the point cloud collected not covers the whole environment, or there are still some deviations between the true value and detected results. To overcome this problem, we propose a 2D wall line connect and refine algorithm based on the vertical distances of the lines and the normal angles of the corresponding detected wall planes. Also, this method is designed with the assumption that the walls in the applied environment are perpendicular and straight. First of all, one reference line L 0 are random chose from all the 2D wall lines {L}, and two vertexes of L 0 are presented as the start point Point s and the endpoint Point e . Moreover, all the distance between the endpoint of L 0 and the vertexes of the other lines are calculated. The line whose vertex with the minimum distance is treated as the adjacent line present as L 1 , and this vertex is treated as the start point of L 1 and another vertex is the endpoint. Then, with line L 1 as the reference line, this procedure will repeat until the found start point to be the start point of L 0 . In practice, as Figure 11 shows, the isolation of the lines can be divided into two groups, the first group is caused by the missing of the edges and the second group is caused by the adjacent lines are too short to get the intersection point. The classification of these two groups is based on the angle between the normal vectors of corresponding detected wall planes. For example, line L i with the plane normal vector N i and adjacent line L i+1 with the plane normal vector N i+1 , the angle between N i and N i+1 could be calculated and presented as ϕ i+1 i . As Figure 11 shows, if the value of ϕ i+1 i is smaller than the threshold ϕ 0 , this connection relationship is initialized as the first group. The central point of the line, whose vertexes is the endpoint of L i and start point of L i+1 , are calculated. The end point of L i is adjusted to the foot of perpendicular through the central point to L i . Similarly, the start point of L i+1 is adjusted to the foot of perpendicular trough the central point to L i+1 . Otherwise, if the value of ϕ i+1 i is larger than the threshold ϕ 0 , the connection relationship would be initialized as the second group. In this group, L i and L i+1 are connected by replace the end point of L i and the start point of L i+1 with the extended intersection point of two lines. Here, ϕ 0 is equal to 45 degrees. Finally, with the height of the wall calculated from a distance between the floor plane and ceiling plane, the space boundary could be obtained. The detail implementation of the algorithm is presented in Algorithm 2, where, A(.) is the function to calculate the angle between lines, and D v (.) is the function to calculate the close vertexes distance between lines. if L is empty then 5: add Proj(p c ) to L 6: Distance between points and plane: DP = D 1 ({P r }) 7: Distance between points and line in floor plane: if j == length(L) then 22: add Proj(p c ) to L 23: Distance between points and plane: DP = D 1 ({P r }) 24: Distance between points and line in floor plane: DF = D 2 (Proj(p c ), {P r }) 25:  initialize: Random Select one line 0 from ∇2: Remove 0 from , and add 0 to ∇3: while is not empty do ∇4: for = 1, ≤ ℎ( ), + + do ∇5: For distance set : = ( 0 , ) ∇6: end for ∇7: Find line candidate referring to the minimum value in

Door and Window Extraction
The basic information of the BIM elements, such as door and windows, normally includes the position, size, and relationship with the other relative elements. In this section, we try to estimate the position and the size of doors and windows and reconstruct the relationship to the wall. initialize: Random Select one line L 0 from L 2: Remove L 0 from L, and add L 0 to L 3: while L is not empty do 4: end for 7: Find line candidate L c referring to the minimum value in D 8: Remove L c from L, and add L c to L 9: end while 10: for j = 1, j ≤ length(L), j + + do 11: Calculate the angle between adjacent lines: Ang = A L j , L j+1 12: if Ang < ϕ 0 then 13: Extend lines to obtain intersection point 14: Update the vertexes of L j , L j+1 15: else 16: Add one new line between L j , L j+1 17: end if 18: end for 19: Return: L

Door and Window Extraction
The basic information of the BIM elements, such as door and windows, normally includes the position, size, and relationship with the other relative elements. In this section, we try to estimate the position and the size of doors and windows and reconstruct the relationship to the wall.
As Figure 12a,b shows, the input point cloud is segmented into several clusters based on the local descriptors [32] to make sure each cluster has only one interesting element. For each cluster, as Figure 12c shows, the point cloud is projected onto the floor plane for the 2D line fitting process. The position of the central point of the optimally fitted lines is treated as the X, Y coordinates of corresponding element. Considering the bottom of doors always aligns with the floor plane, we assign the Z value of door position as zero. For the window, the Z value of the position is calculated by projecting the point cloud to the wall plane to get the optimal fitted line. The height of the central point of the line is treated as the Z value of the window position. In this paper, as Figure 12d shows, the width and the height of the elements are estimated from the image frame rather than the global 3D point cloud because the point cloud noise makes the extraction of boundary difficult. assign the value of door position as zero. For the window, the value of the position is calculated by projecting the point cloud to the wall plane to get the optimal fitted line. The height of the central point of the line is treated as the value of the window position. In this paper, as Figure 12d shows, the width and the height of the elements are estimated from the image frame rather than the global 3D point cloud because the point cloud noise makes the extraction of boundary difficult.
In the relationship reconstruction, we use the constraint that the door or the window always on the wall. Firstly, for each element, a door or window, the distances { } between its position and wall planes { }, as well as the angle { } between the element plane and wall planes, are calculated. Then, { } would be filtered with the condition that the value of { } less than the threshold 0 , and the filtered result is { }. Moreover, the element is subject to the wall which is in { } and have minimum value in { }. Here, the value of 0 is set to 30 degrees.

BIM Format 3D Model Generation Based on Geometry Information
Finally, we developed a plug-in based on the Revit 2018 for the BIM format 3D model generation. As Figure 13 shows, the input is the information extracted in the last stage, and the BIM format models could be generated automatically. The first line stored a sequence of points with the format ( , ), which represent the corner coordinate of the wall boundary in 2D projected mapping. The second line is the height of the wall calculated from the height difference between the floor plane and ceiling plane. The third line stores the information of doors with the position coordinate ( , ), height, and width. The last line stores the information of the windows, which includes the position coordinate ( , , ) as well as the height and width of the corresponding window element. In the relationship reconstruction, we use the constraint that the door or the window always on the wall. Firstly, for each element, a door or window, the distances {Dis} between its position and wall planes {P}, as well as the angle {A} between the element plane and wall planes, are calculated. Then, {P} would be filtered with the condition that the value of {A} less than the threshold θ 0 , and the filtered result is P f . Moreover, the element is subject to the wall which is in P f and have minimum value in {Dis}. Here, the value of θ 0 is set to 30 degrees.

BIM Format 3D Model Generation Based on Geometry Information
Finally, we developed a plug-in based on the Revit 2018 for the BIM format 3D model generation. As Figure 13 shows, the input is the information extracted in the last stage, and the BIM format models could be generated automatically. The first line stored a sequence of points with the format (X, Y), which represent the corner coordinate of the wall boundary in 2D projected mapping. The second line is the height of the wall calculated from the height difference between the floor plane and ceiling plane. The third line stores the information of doors with the position coordinate (X, Y), height, and width. The last line stores the information of the windows, which includes the position coordinate (X, Y, Z) as well as the height and width of the corresponding window element.

Experimental Tests and Discussion
In this section, three experiments have been done in three different classrooms at block Z of the Hong Kong Polytechnic University to test the performance of our proposed method. The operator collects the raw color and depth images by holding the hardware (showed in Figure 2) in hand and walks around the room at a given route to cover as much area as possible. The metric used to validate the performance of the algorithm includes the accuracy of element detection, the length measurement accuracy of the room dimension, and the area measurement accuracy of the main reconstructed elements. The actual values of the measurement dimension are collected manually by a range finder,

Experimental Tests and Discussion
In this section, three experiments have been done in three different classrooms at block Z of the Hong Kong Polytechnic University to test the performance of our proposed method. The operator collects the raw color and depth images by holding the hardware (showed in Figure 2) in hand and walks around the room at a given route to cover as much area as possible. The metric used to validate the performance of the algorithm includes the accuracy of element detection, the length measurement accuracy of the room dimension, and the area measurement accuracy of the main reconstructed elements. The actual values of the measurement dimension are collected manually by a range finder, and the actual values of the measurement area are calculated manually with correct dimensions. Moreover, the efficiency of the proposed method is evaluated by being compared with the TLS based method and range finder based method in the time consumption and manual load. The values of the parameters used in the test are listed in Table 1.  Figure 14 shows the detailed processes of the three experiments. In these three rooms, each room has one ceiling and one floor. Meanwhile, the room for the first experiment has two windows, two doors, and eight different walls, the second room has three windows, two doors, and eight walls, and the last room which is more complex, has three windows, two doors, and ten different walls. Figure 15 presents the BIM format 3D models generated by the proposed method as well as the ground truth manually generated by the skillful modeler.   Firstly, we validate the element extraction accuracy of the proposed, and the results are shown in Table 2. The result indicates that our proposed method extracts all the element objects, even for some small dimension walls, as shown in the red cycle area in Figure 14e.
Secondly, we compare the measured dimensions of rooms with the actual values measured by the range finder. Figure 16 shows the detail of the compare results. The first row is the measurement dimensions from our proposed method, and the second row is the true value. Also, quantitative analysis is shown in Table 3. For each room, we measure the width and length of the room, which determines the size of the room as well as the other two dimensions for evaluation. As Table 3 shows, average accuracy for three experiments is 98.6%%, 98.4%, 98.6% respectively with the maximal error Firstly, we validate the element extraction accuracy of the proposed, and the results are shown in Table 2. The result indicates that our proposed method extracts all the element objects, even for some small dimension walls, as shown in the red cycle area in Figure 14e. Secondly, we compare the measured dimensions of rooms with the actual values measured by the range finder. Figure 16 shows the detail of the compare results. The first row is the measurement dimensions from our proposed method, and the second row is the true value. Also, quantitative analysis is shown in Table 3. For each room, we measure the width and length of the room, which determines the size of the room as well as the other two dimensions for evaluation. As Table 3 shows, average accuracy for three experiments is 98.6%%, 98.4%, 98.6% respectively with the maximal error for 214 mm and minimal error for 20 mm. The semantic segmentation based on deep learning can effectively extract the classes of elements in each frame, which makes the recognition more robust when compared to traditional methods.    Thirdly, considering the area is one of the most important attributions of the BIM elements, we compare the area measurement of extracted elements with the actual values, and the results are shown in Table 4. The average area measurement accuracy for three experiments is better than 91.9%, with the best 96.5% for experiment three. Meanwhile, for all three experiments, the area measurements of the walls, ceiling, and floor are in high accuracy better than 92.2%. The measured accuracies of doors and windows range from 74.7% to 96.9%, which is not as good as the other elements. The reason is that the true areas of the windows and doors are minimal, which makes the accuracy more sensitive to measurement errors. Fourth, we test the performance of the proposed method in the "narrow" wall extraction. In this test, we treated the wall of which the length less than three meters as a "narrow" wall. There are four, four, and six narrow walls, respectively, for three experiments. As shown in Figure 17, all the narrow walls, length ranges from 319 mm to 2554 mm, are detected by the proposed algorithm. This is due to that the algorithm developed in this paper overcomes the influence of the over-detection problem significantly and prevents the removal of the point cloud of narrow walls. Also, Table 5 shows the quantitative analysis of measured narrow walls. Apart from some very narrow walls (length less than 400 mm) and individual cases, the accuracy of the most measurements is better than 80%, with the average measured accuracy 75.3%, 81.3%, and 80.5% for three experiments. There are two reasons for the narrow wall measured results are not as good as the measured room sizes. The first one is that the length of some walls is too small, which make the accuracy is sensitive to the measurement error. The second one is that the accumulation error of the SLAM system will cause a significant error at the end of the frame sequences. The closure error between the start frames and end frames makes the extraction of narrow walls more challenge especially when the point cloud is in low-quality.  Finally, we compare the time consumption of our proposed framework with the conventional TLS based method and manual surveying method. In manual surveying method, the operator should measure the dimensions of all the required elements such as the height of the room, the size of the doors, the length of the walls et al., and create the BIM format model based on collected information without the colored point cloud or 3D mesh output. The LTS used in the test is Leica ScanStation 2, which is interoperable with Lecia System 1200. Table 6 shows the information about the dataset collected by the LTS and structure sensor for three experiments. Also, a downsample operation with factor five is applied to the raw point cloud generated by the structure sensor because the frame overlapping makes the raw point cloud enormous. As Table 7 shows, using TLS to collect the dataset always costs more time and requires more manual load because of that, the setup of the equipment and the scanning process are typically time-consuming. The manual surveying would cost less time for data collection when compared to TLS based method, but the data processing work would cost more time without the point cloud as the reference. Our proposed method costs much less time than those two methods, not only in data collection but also in the data processing. In the data collection, the handheld RGB-D sensor we used does not require as much preparation work as TLS and does not need to record the measurement manually like a manual surveying method. Also, the data collection in our framework is handled by only one operator. In data processing, our method is genuinely automatic without any manual intervention and the processing time for all three cases is around 30s. With the improvements in those two aspects, the whole workflow is accelerated from about 200 min for TLS to about 17 min with our method.  Finally, we compare the time consumption of our proposed framework with the conventional TLS based method and manual surveying method. In manual surveying method, the operator should measure the dimensions of all the required elements such as the height of the room, the size of the doors, the length of the walls et al., and create the BIM format model based on collected information without the colored point cloud or 3D mesh output. The LTS used in the test is Leica ScanStation 2, which is interoperable with Lecia System 1200. Table 6 shows the information about the dataset collected by the LTS and structure sensor for three experiments. Also, a downsample operation with factor five is applied to the raw point cloud generated by the structure sensor because the frame overlapping makes the raw point cloud enormous. As Table 7 shows, using TLS to collect the dataset always costs more time and requires more manual load because of that, the setup of the equipment and the scanning process are typically time-consuming. The manual surveying would cost less time for data collection when compared to TLS based method, but the data processing work would cost more time without the point cloud as the reference. Our proposed method costs much less time than those two methods, not only in data collection but also in the data processing. In the data collection, the handheld RGB-D sensor we used does not require as much preparation work as TLS and does not need to record the measurement manually like a manual surveying method. Also, the data collection in our framework is handled by only one operator. In data processing, our method is genuinely automatic without any manual intervention and the processing time for all three cases is around 30s. With the improvements in those two aspects, the whole workflow is accelerated from about 200 min for TLS to about 17 min with our method.

Conclusions
In this paper, we proposed an automatic and efficient indoor AB BIMs generation framework by using low-cost RGB-D sensors. Firstly, we calibrate the low accuracy RGB-D sensor to increase the measurement accuracy and operation range. Then, a deep-learning-based method is used for the semantic 3D reconstruction of the indoor environment with the color and depth images pairs as the input. This method is more effective and economical when compared to traditional manual methods for segmentation. Also, we design a new procedure to transform unstructured 3D point cloud to BIM format 3D model with a low-cost RGB-D sensor.
The experiment results indicate that this method is robust and has acceptable accuracy to handle the noisy 3D point cloud with the average accuracy of 98.6% for dimension reconstruction, average accuracy of 93.6% for area reconstruction and average accuracy about 80% for the narrow wall dimension reconstruction. The time consumption is reduced from 120 min to 16.7 min for the three experiments when compared to the traditional manual TLS based method. In detail, the time requirement of data collection is reduced from 170 min to 15 min, and the time requirement of data processing is reduced from 30 min to 1.7 min. Thus, the framework proposed in this paper, which using the low-cost and portable RGB-D sensor to replace the costly TLS to collect the 3D indoor dataset, provides a potential solution for the AB BIMs generation. The next step of this research will address the issues about the extraction of the attribute (e.g., material and functions) information of the individual construction components as well as the extraction of more complex construction components (e.g., furniture and appliance). Also, we will apply the proposed framework to other sensors, such as Structure Mark II and Microsoft Azure Kinect DK, to get a better result.