Semantic Structure from Motion for Railroad Bridges Using Deep Learning

: Current maintenance practices consume signiﬁcant time, cost, and manpower. Thus, a new technique for maintenance is required. Construction information technologies, including building information modeling (BIM), have recently been applied to the ﬁeld to carry out systematic and productive planning, design, construction, and maintenance. Although BIM is increasingly being applied to new structures, its application to existing structures has been limited. To apply BIM to an existing structure, a three-dimensional (3D) model of the structure that accurately represents the as-is status should be constructed and each structural component should be speciﬁed manually. This study proposes a method that constructs a 3D model and speciﬁes the structural component automatically using photographic data with a camera installed on an unmanned aerial vehicle. This procedure is referred to as semantic structure from motion because it constructs a 3D point cloud model together with semantic information. A validation test was carried out on a railroad bridge to validate the performance of the proposed system. The average precision, intersection over union, and BF scores were 80.87%, 66.66%, and 56.33%, respectively. The proposed method could improve the current scan-to-BIM procedure by generating the as-is 3D point cloud model by specifying the structural component automatically. 3D PCD using the results of semantic segmentation. A detailed explanation of the proposed system, validation test, and discussion are presented.


Introduction
Civil infrastructures such as roads, railroads, and bridges have an important role in human activities. Thus, it is important to ensure that they last a long time through proper maintenance. The current maintenance practice uses manpower to inspect the exterior of the structure and check for damage, deterioration, and erosion of the structure. Manual inspection is ineffective in terms of time, cost, and need for manpower. Thus, techniques that can improve the current maintenance practice are being introduced.
Recent developments in information technology (IT) have influenced the civil engineering domain, including the maintenance field. In particular, numerous studies have been carried out to replace the exterior survey dependent on manpower in the prior maintenance techniques through sensors or images. Yoon et al. [1] carried out health monitoring of structures using drones and imaging equipment. Cha et al. [2] carried out a study to automatically discriminate cracks on concrete surfaces using artificial intelligence. Narazaki et al. [3] reported automatic recognition of structural elements using artificial intelligence. Lee et al. [4] automatically extracted bridge design parameters based on point cloud data (PCD). Park et al. [5] predicted the dynamic characteristics of structures using image data.
Building information modeling (BIM) has recently been integrated into the field to conduct systematic and effective structure planning, design, construction, and maintenance. In Korea, a BIM guideline, "BIM application guide for architecture", has been established, required for all projects with a total construction cost over 50 million dollars [6]. In

Background
Computer vision, a field that enables computers to recognize and analyze visual information, has been continuously developed [17][18][19][20][21]. Deep learning techniques have recently introduced numerous applications in computer vision [22][23][24]. In particular, a convolutional neural network (CNN), which is one of the methods to classify images automatically, has been introduced in an image recognition contest "ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)", held in 2012. This study used a CNN-based semantic segmentation algorithm, Deeplab-V3+ [25], to classify each image pixel into a bridge component.

Semantic Segmentation Using Deep Learning
Semantic segmentation is a method of predicting and displaying the semantic information of an image by segmenting an image into pixels using a CNN. The general structure of a CNN is shown in Figure 1. The CNN extracts a convolution feature using the convolution layer and pooling layer alternately from the image (input data) and classifies the extracted features using a fully connected layer. Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 14 In this study, semantic segmentation was carried out using Deeplab-V3+. The structure of the Deeplab-V3+ model is shown in Figure 2. Deeplab-V3+ separates the structure into an encoder and decoder to overcome the disadvantage of the CNN, which loses location information owing to the loss of dimensions while passing the fully connected layer. The encoder extracts features at random resolution through atrous convolution from a deep convolutional neural network (DCNN). The output stride, which refers to the ratio of the resolution of the input image to the resolution of the output image, was applied. The decoder reduces the channel by performing a 1 × 1 convolution on the final output image of the encoder and performs concat after a bilinear upsample. Through the above process, the decoder efficiently maintains the object segmentation details [25]. Deeplab-V3+ has the advantage of using a pretrained deep learning model. In this study, ResNet-50 was used as a module together with Deeplab-V3+. ResNet-50 is a module using the proposed residual learning to address the gradient loss. In this phenomenon, a variable disappears as it passes a layer. The structure of the existing convolution layer and residual learning is shown in Figure 3. In this study, semantic segmentation was carried out using Deeplab-V3+. The structure of the Deeplab-V3+ model is shown in Figure 2. Deeplab-V3+ separates the structure into an encoder and decoder to overcome the disadvantage of the CNN, which loses location information owing to the loss of dimensions while passing the fully connected layer. The encoder extracts features at random resolution through atrous convolution from a deep convolutional neural network (DCNN). The output stride, which refers to the ratio of the resolution of the input image to the resolution of the output image, was applied. The decoder reduces the channel by performing a 1 × 1 convolution on the final output image of the encoder and performs concat after a bilinear upsample. Through the above process, the decoder efficiently maintains the object segmentation details [25]. In this study, semantic segmentation was carried out using Deeplab-V3+. The structure of the Deeplab-V3+ model is shown in Figure 2. Deeplab-V3+ separates the structure into an encoder and decoder to overcome the disadvantage of the CNN, which loses location information owing to the loss of dimensions while passing the fully connected layer. The encoder extracts features at random resolution through atrous convolution from a deep convolutional neural network (DCNN). The output stride, which refers to the ratio of the resolution of the input image to the resolution of the output image, was applied. The decoder reduces the channel by performing a 1 × 1 convolution on the final output image of the encoder and performs concat after a bilinear upsample. Through the above process, the decoder efficiently maintains the object segmentation details [25]. Deeplab-V3+ has the advantage of using a pretrained deep learning model. In this study, ResNet-50 was used as a module together with Deeplab-V3+. ResNet-50 is a module using the proposed residual learning to address the gradient loss. In this phenomenon, a variable disappears as it passes a layer. The structure of the existing convolution layer and residual learning is shown in Figure 3. Deeplab-V3+ has the advantage of using a pretrained deep learning model. In this study, ResNet-50 was used as a module together with Deeplab-V3+. ResNet-50 is a module using the proposed residual learning to address the gradient loss. In this phenomenon, a variable disappears as it passes a layer. The structure of the existing convolution layer and residual learning is shown in Figure 3.  Residual learning is the addition of a residual connection (or skip connection) to a prior convolution layer. It is composed of the addition of an input to the stack of two convolution layers. Figure 3a describes the structure of the existing CNN. Back-propagation training is performed to obtain a weight that minimizes the difference between the predicted value (H(x)) of the network and target value (label) of the training data. The predicted value H(x) refers to the result of learning the network through data. The target value refers to a label matching the training data. Figure 3b shows the structure of the residual learning block of ResNet-50. The purpose of learning is F(x) becoming 0, considering the hypothesis that it is better to approximate F(x) + x to H(x) than to approximate H(x) as a function of a complex nonlinear structure. In other words, when H(x) ≃ F(x) + x, it changes to F(x) ≃ H(x)-x, and eventually learning F(x) is the output factor H(x) and the input factor x is to learn the difference. Therefore, it can be considered that the residual that is the remainder is learned. Finally, the meaning of learning the optimal F(x) is the same as that of the convergence of F(x) to 0; as x becomes a residual connection (or skip connection), there is no increase in the amount of computation [26].

SfM
SfM is one of the most popular methods for generation of a 3D model using image data [27][28][29]. The first step in SfM is to identify the correspondence between the images. SfM uses scale-invariant feature transform (SIFT) and speeded-up robust feature (SURF) algorithms to find feature points in photographic data and identify correspondence between feature points [30,31]. However, not all of the obtained feature points belong to a correspondence relationship. An outlier in which the feature points do not coincide can exist. In SfM, random sample consensus (RANSAC) was used, which is a method of randomly selecting sample data to remove outliers and selecting data with the largest consensus. As shown in Figure 4, a feature model of each photograph was obtained by applying RANSAC and a feature matching process was performed to compare the feature models. In this process, if the points match through feature matching, it is determined as a correspondence relationship; if not, it is regarded as an outlier and removed. Residual learning is the addition of a residual connection (or skip connection) to a prior convolution layer. It is composed of the addition of an input to the stack of two convolution layers. Figure 3a describes the structure of the existing CNN. Back-propagation training is performed to obtain a weight that minimizes the difference between the predicted value (H(x)) of the network and target value (label) of the training data. The predicted value H(x) refers to the result of learning the network through data. The target value refers to a label matching the training data. Figure 3b shows the structure of the residual learning block of ResNet-50. The purpose of learning is F(x) becoming 0, considering the hypothesis that it is better to approximate F(x) + x to H(x) than to approximate H(x) as a function of a complex nonlinear structure. In other words, when H(x) F(x) + x, it changes to F(x) H(x)-x, and eventually learning F(x) is the output factor H(x) and the input factor x is to learn the difference. Therefore, it can be considered that the residual that is the remainder is learned. Finally, the meaning of learning the optimal F(x) is the same as that of the convergence of F(x) to 0; as x becomes a residual connection (or skip connection), there is no increase in the amount of computation [26].

SfM
SfM is one of the most popular methods for generation of a 3D model using image data [27][28][29]. The first step in SfM is to identify the correspondence between the images. SfM uses scale-invariant feature transform (SIFT) and speeded-up robust feature (SURF) algorithms to find feature points in photographic data and identify correspondence between feature points [30,31]. However, not all of the obtained feature points belong to a correspondence relationship. An outlier in which the feature points do not coincide can exist. In SfM, random sample consensus (RANSAC) was used, which is a method of randomly selecting sample data to remove outliers and selecting data with the largest consensus. As shown in Figure 4, a feature model of each photograph was obtained by applying RANSAC and a feature matching process was performed to compare the feature models. In this process, if the points match through feature matching, it is determined as a correspondence relationship; if not, it is regarded as an outlier and removed. Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 14 When the correspondence between the pictures is grasped through the above process, the position of the camera can be estimated, and the feature points can be expressed as 3D PCD. However, it is not possible to express all shapes of an object with 3D PCD generated only with feature points. Therefore, the 3D dense reconstruction technique is used to obtain dense 3D PCD. The 3D dense reconstruction technique is a method of interpolation using pixels around feature points. Normalized cross correlation (NCC) and inverse distance weighted techniques are used.
NCC is a technique for measurement of geometric similarity by comparing the redgreen-blue (RGB) values of pixels. The 3D dense reconstruction technique refers to the surrounding pixels of the feature point. The RGB value of the surrounding pixels is extracted using a filter on the pixel size. If the RGB obtained from the 3D dense reconstruction technique is calculated using Equation (1), the NCC can be obtained.
where n is the size of the filter, f(x,y) is the RGB at the coordinates of the filter, t(x,y) is the RGB at the x and y coordinates of the comparison filter, ̅ is the average RGB of the filter, ̅ is the average RGB of the comparison filter, is the standard deviation of the RGB of the filter, and is the standard deviation of the RGB of the comparison filter. As the RGB is nonnegative, NCC has a value between −1 and 1. If NCC is close to 1, the two filters are similar. However, because the NCC obtained from the process does not include the distance information of the photograph, a high reliability cannot be ensured. Therefore, the process of assigning a weight to a distance is required. The NCC uses the inverse distance weighting method, expressed by where is the estimated value (interpolation value) of the estimation point, is the reference value of the position ( , ), is the weight, and n is the number of reference values. The weight can be calculated using Equation (3), where represents the distance, Through this process, interpolation is performed on a section in which no feature point exists, and dense 3D PCD can be generated based on the result.

System Development
This study developed an SSfM that automatically classifies image pixels into bridge components using deep learning and generates a 3D point cloud model while preserving When the correspondence between the pictures is grasped through the above process, the position of the camera can be estimated, and the feature points can be expressed as 3D PCD. However, it is not possible to express all shapes of an object with 3D PCD generated only with feature points. Therefore, the 3D dense reconstruction technique is used to obtain dense 3D PCD. The 3D dense reconstruction technique is a method of interpolation using pixels around feature points. Normalized cross correlation (NCC) and inverse distance weighted techniques are used.
NCC is a technique for measurement of geometric similarity by comparing the redgreen-blue (RGB) values of pixels. The 3D dense reconstruction technique refers to the surrounding pixels of the feature point. The RGB value of the surrounding pixels is extracted using a filter on the pixel size. If the RGB obtained from the 3D dense reconstruction technique is calculated using Equation (1), the NCC can be obtained.
where n is the size of the filter, f (x,y) is the RGB at the coordinates of the filter, t(x,y) is the RGB at the x and y coordinates of the comparison filter, f is the average RGB of the filter, t is the average RGB of the comparison filter, σ f is the standard deviation of the RGB of the filter, and σ t is the standard deviation of the RGB of the comparison filter. As the RGB is nonnegative, NCC has a value between −1 and 1. If NCC is close to 1, the two filters are similar. However, because the NCC obtained from the process does not include the distance information of the photograph, a high reliability cannot be ensured. Therefore, the process of assigning a weight to a distance is required. The NCC uses the inverse distance weighting method, expressed by where Z p is the estimated value (interpolation value) of the estimation point, Z i is the reference value of the position (x i , y i ), W i is the weight, and n is the number of reference values. The weight W i can be calculated using Equation (3), where d i represents the distance, Through this process, interpolation is performed on a section in which no feature point exists, and dense 3D PCD can be generated based on the result.

System Development
This study developed an SSfM that automatically classifies image pixels into bridge components using deep learning and generates a 3D point cloud model while preserving the information on bridge components such as piers and girders. Figure 5 shows an overview of the proposed system. The contents of each component are described in detail.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 14 the information on bridge components such as piers and girders. Figure 5 shows an overview of the proposed system. The contents of each component are described in detail.

Bridge Component Classification Using Semantic Segmentation
The first step of the proposed system is to provide semantic information of the bridge components to each pixel in the two-dimensional (2D) images. In this study, information on bridge components was assigned by performing transfer learning based on Deeplab-V3+. Transfer learning is a method of constructing a network using a pretrained deep learning model. Transfer learning can result in a high accuracy using a pretrained deep learning model with a high performance. In this study, transfer learning was performed using Deeplab-V3+/ResNet-50 and a network was developed to classify the components of the bridge. The network developed in this study classified each pixel of the image into 10 classes: pole, building, girder, pier, ground, grass, water, sky, car, and road. Eight classes, except girder and pier, which are components of the bridge, were integrated into the background later. A total of 245 photographic data with a resolution of 800 × 600 pixels were used as the training data. Among the 245 images, 103 were collected through Google Street View, while 142 images were obtained using unmanned aerial images at the Osong 5th Test track bridge in Nojang-ri, Jeon-myeon, and Sejong. The equipment for unmanned aerial photography included DJI Inspire 2 and Zenmuse X5S, as shown in Figure 6. The specifications of the equipment are listed in Tables 1 and 2. (a) Figure 5. Overview of the proposed SSfM system.

Bridge Component Classification Using Semantic Segmentation
The first step of the proposed system is to provide semantic information of the bridge components to each pixel in the two-dimensional (2D) images. In this study, information on bridge components was assigned by performing transfer learning based on Deeplab-V3+. Transfer learning is a method of constructing a network using a pretrained deep learning model. Transfer learning can result in a high accuracy using a pretrained deep learning model with a high performance. In this study, transfer learning was performed using Deeplab-V3+/ResNet-50 and a network was developed to classify the components of the bridge. The network developed in this study classified each pixel of the image into 10 classes: pole, building, girder, pier, ground, grass, water, sky, car, and road. Eight classes, except girder and pier, which are components of the bridge, were integrated into the background later. A total of 245 photographic data with a resolution of 800 × 600 pixels were used as the training data. Among the 245 images, 103 were collected through Google Street View, while 142 images were obtained using unmanned aerial images at the Osong 5th Test track bridge in Nojang-ri, Jeon-myeon, and Sejong. The equipment for unmanned aerial photography included DJI Inspire 2 and Zenmuse X5S, as shown in Figure 6. The specifications of the equipment are listed in Tables 1 and 2.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 14 the information on bridge components such as piers and girders. Figure 5 shows an overview of the proposed system. The contents of each component are described in detail.

Bridge Component Classification Using Semantic Segmentation
The first step of the proposed system is to provide semantic information of the bridge components to each pixel in the two-dimensional (2D) images. In this study, information on bridge components was assigned by performing transfer learning based on Deeplab-V3+. Transfer learning is a method of constructing a network using a pretrained deep learning model. Transfer learning can result in a high accuracy using a pretrained deep learning model with a high performance. In this study, transfer learning was performed using Deeplab-V3+/ResNet-50 and a network was developed to classify the components of the bridge. The network developed in this study classified each pixel of the image into 10 classes: pole, building, girder, pier, ground, grass, water, sky, car, and road. Eight classes, except girder and pier, which are components of the bridge, were integrated into the background later. A total of 245 photographic data with a resolution of 800 × 600 pixels were used as the training data. Among the 245 images, 103 were collected through Google Street View, while 142 images were obtained using unmanned aerial images at the Osong 5th Test track bridge in Nojang-ri, Jeon-myeon, and Sejong. The equipment for unmanned aerial photography included DJI Inspire 2 and Zenmuse X5S, as shown in Figure 6. The specifications of the equipment are listed in Tables 1 and 2.     To train the semantic segmentation network, a labeling process is required to designate which component information of a bridge is contained in each pixel of a photograph. In this study, pixel label data were generated using MATLAB Image Labeler. The label has 10 classes: pole, building, girder, pier, ground, grass, water, sky, car, and road. Figures 7 and 8 show the photographic data and label data used for learning. In addition, the number of pixels for each class was used as a weight to balance the classes.  To train the semantic segmentation network, a labeling process is required to designate which component information of a bridge is contained in each pixel of a photograph. In this study, pixel label data were generated using MATLAB Image Labeler. The label has 10 classes: pole, building, girder, pier, ground, grass, water, sky, car, and road. Figures  7 and 8 show the photographic data and label data used for learning. In addition, the number of pixels for each class was used as a weight to balance the classes.  The Holdout method was used, as shown in Figure 9, for network learning and verification. For the training data, 80% of the 245 data obtained with the Osong 5th test track and Google street view were randomly selected. The remaining 20% of the data were used The Holdout method was used, as shown in Figure 9, for network learning and verification. For the training data, 80% of the 245 data obtained with the Osong 5th test track and Google street view were randomly selected. The remaining 20% of the data were used for network self-validation. The hyperparameters used for the training are listed in Table 3. The batch size was set to 5, while the epoch number was set to 500. The Adam optimizer was used for optimization. The initial learning rate was 0.001. The epoch number was reduced 0.3 times every 10 times. The configuration of the deep learning network used for training is shown in Figure 10. The Holdout method was used, as shown in Figure 9, for network learning and verification. For the training data, 80% of the 245 data obtained with the Osong 5th test track and Google street view were randomly selected. The remaining 20% of the data were used for network self-validation. The hyperparameters used for the training are listed in Table  3. The batch size was set to 5, while the epoch number was set to 500. The Adam optimizer was used for optimization. The initial learning rate was 0.001. The epoch number was reduced 0.3 times every 10 times. The configuration of the deep learning network used for training is shown in Figure 10.

Construction of 3D PCD Using SfM
The trained network classifies images into 10 classes, including bridge components. Eight classes that are not bridge components were integrated into the background, resulting in three classes: girder, pier, and background. The bridge components classified using the above method are in 2D images and not in a 3D point cloud model. To convert the semantic information from a 2D image to a 3D point cloud model, an SfM technique, as shown in Figure 11, was applied.

Construction of 3D PCD Using SfM
The trained network classifies images into 10 classes, including bridge components. Eight classes that are not bridge components were integrated into the background, resulting in three classes: girder, pier, and background. The bridge components classified using the above method are in 2D images and not in a 3D point cloud model. To convert the semantic information from a 2D image to a 3D point cloud model, an SfM technique, as shown in Figure 11, was applied.

Construction of 3D PCD Using SfM
The trained network classifies images into 10 classes, including bridge components. Eight classes that are not bridge components were integrated into the background, resulting in three classes: girder, pier, and background. The bridge components classified using the above method are in 2D images and not in a 3D point cloud model. To convert the semantic information from a 2D image to a 3D point cloud model, an SfM technique, as shown in Figure 11, was applied. The SfM technique, which generates a 3D point cloud model using 2D images, finds location and geometry information by utilizing key points common in several images. Key points are matched from SIFT using image intensity information. In this study, a method of overlaying label data, semantic information on photography data, was used to reflect both location information and semantic information. The key point was found using the intensity information of the original photography data. 3D location information was obtained for each pixel of the image. After finding the location information, the semantic information was transformed to an RGB value and visualized with a transparency of 50% over the original image data. Through this process, it was possible to visualize the 3D PCD of the bridge together with the bridge component information.

Validation Test
To verify the performance of the proposed SSfM system, an experiment was carried out using the test data collected from the Osong 3rd test track in Osong-eup, Heungdeokgu, Cheongju-si, Chungcheongbuk-do, Korea. A total of 183 test data were collected from The SfM technique, which generates a 3D point cloud model using 2D images, finds location and geometry information by utilizing key points common in several images. Key points are matched from SIFT using image intensity information. In this study, a method of overlaying label data, semantic information on photography data, was used to reflect both location information and semantic information. The key point was found using the intensity information of the original photography data. 3D location information was obtained for each pixel of the image. After finding the location information, the semantic information was transformed to an RGB value and visualized with a transparency of 50% over the original image data. Through this process, it was possible to visualize the 3D PCD of the bridge together with the bridge component information.

Validation Test
To verify the performance of the proposed SSfM system, an experiment was carried out using the test data collected from the Osong 3rd test track in Osong-eup, Heungdeokgu, Cheongju-si, Chungcheongbuk-do, Korea. A total of 183 test data were collected from the Osong 3rd test track using drones as test data and the semantic segmentation technique developed in this study was applied to automatically classify the components of the bridge. In addition, SfM was applied based on 183 images with semantic information and a 3D point cloud model including information on the components of the bridge was obtained.
In general, a confusion matrix, as shown in Table 4, was used to evaluate the result of semantic segmentation. Measures to evaluate using the error matrix include accuracy, intersection over union (IoU), precision, recall, and F1 score. Accuracy is one of the measures used to intuitively evaluate the performance of a classification model, as shown in Equation (4). However, data with unbalanced results can distort the performance of the model and must be expressed using other methods.
IoU is an intuitive measure for the evaluation of the performance of a classification model as an intersection. It is most often used to evaluate the prediction results in semantic segmentation and object detection, Precision and recall are usually used together and exhibit an inverse relationship. However, both measures have drawbacks, so that they are used together in the F1 score. The F1 score is an index representing the harmonic average of precision and recall and can accurately evaluate the performance of the model even with unbalanced data. Precision, recall, and F1 scores (denoted BF) are defined in Equations (6)-(8), respectively.
The proposed semantic segmentation network classified 183 test data collected from the Osong 3rd test track. The results are shown in Figure 12 and Table 5. The bridge component classification network proposed in this study was able to automatically classify the girder and pier on the image, with an average accuracy of approximately 80%, IoU of 66%, and BF-score of approximately 56%.    After the application of 183 data to the bridge component classification network, 3D PCD were generated by applying SfM. The results are shown in Figure 13 and Table 6. The results of semantic segmentation could be successfully expressed in the 3D point cloud model. The average precision was approximately 74%, the IoU was approximately 65%, and the BF-score was approximately 55%. It was expected that an additional error will occur in SfM, which converts 2D images to a 3D point cloud model, resulting in lower SfM results than the 2D segmentation results. However, as the semantic segmentation results of several images were averaged to a 3D point of one, the IoU and BF-score of a specific class (pier) slightly increased.

Conclusions and Discussion
In this study, an automatic procedure for generation of a 3D point cloud model that contains bridge component information was proposed using deep learning and computer vision. The verification test was carried out at the Osong 3rd test track located in Osongeup, Heungdeok-gu, Cheongju-si, Korea, by applying the proposed technique to the collected images. The proposed method was able to automatically generate a 3D point cloud model containing information on bridge components with an accuracy of 74.23%, IoU of 65.90%, and average BF score of 55.59%.
Lee et al. [4] conducted a study to automatically extract design parameters of bridges using 3D Point Cloud Data, and the results showed high reliability. However, this study used LiDAR to acquire 3D Point Cloud Data. There is a problem where the bridge must be shut down in order to obtain 3D point cloud data from LiDAR. In order to solve this problem, in this study was conducted to collect 2D image data with an unmanned aerial and generate 3D point cloud data using the 2D image data.
It was confirmed that the proposed method in this study has a problem of increasing

Conclusions and Discussion
In this study, an automatic procedure for generation of a 3D point cloud model that contains bridge component information was proposed using deep learning and computer vision. The verification test was carried out at the Osong 3rd test track located in Osong-eup, Heungdeok-gu, Cheongju-si, Korea, by applying the proposed technique to the collected images. The proposed method was able to automatically generate a 3D point cloud model containing information on bridge components with an accuracy of 74.23%, IoU of 65.90%, and average BF score of 55.59%.
Lee et al. [4] conducted a study to automatically extract design parameters of bridges using 3D Point Cloud Data, and the results showed high reliability. However, this study used LiDAR to acquire 3D Point Cloud Data. There is a problem where the bridge must be shut down in order to obtain 3D point cloud data from LiDAR. In order to solve this problem, in this study was conducted to collect 2D image data with an unmanned aerial and generate 3D point cloud data using the 2D image data.
It was confirmed that the proposed method in this study has a problem of increasing the error because errors are accumulated in the process of SSfM. However, if errors are minimized in the future by using deep learning models with improved modulus and big data, it is expected that time and cost in the modeling for the BIM of existing structures can be saved.