Change Detection in Unmanned Aerial Vehicle Images for Progress Monitoring of Road Construction

Currently, unmanned aerial vehicles are increasingly being used in various construction projects such as housing developments, road construction, and bridge maintenance. If a drone is used at a road construction site, elevation information and orthoimages can be generated to acquire the construction status quantitatively. However, the detection of detailed changes in the site owing to construction depends on visual video interpretation. This study develops a method for automatic detection of the construction area using multitemporal images and a deep learning method. First, a deep learning model was trained using images of the changing area as reference. Second, we obtained an effective application method by applying various parameters to the deep learning process. The application of the time-series images of a construction site to the selected deep learning model enabled more effective identification of the changed areas than the existing pixel-based change detection. The proposed method is expected to be very helpful in construction management by aiding in the development of smart construction technology.


Introduction
The fields where drones can be applied are very diverse, such as terrain information construction, cadastral surveying, disaster management, environmental monitoring, inspection of various facilities, and exploration (mineral and gas) [1,2]. The field of construction involves the merging of design and construction information, visualization using the overlay of orthoimages and two-dimensional (2D) drawings, digital work, three-dimensional (3D) modeling and process comparison based on the process progress, construction quantity confirmation, and workload distribution; in this field, unmanned aerial vehicle (UAV) images and products are being used [3,4]. The life cycle of a road can be divided into four stages: planning, design, construction, and maintenance. In the planning, design, and maintenance stages, the use of drone images has developed, and drones can be used in a less time-consuming and a cost-effective manner in the construction stage [5].
In drone-based construction management, construction supervisors can additionally check whether the construction site has been constructed according to the design specification using a drone [6]. If drone technology is applied to the advanced payment tasks that are performed every month in construction projects, more accurate constructions can be achieved, and construction progress can be recorded through construction history management, which can be used for future maintenance or accident analysis.
UAV images are being used for point cloud extraction, digital surface and terrain model production, orthoimages, topographic mapping, and 3D model production on construction sites [4,7]. Airsight's UAV-based next-generation airport road safety inspection For construction management using drones, a 3D model must be developed by periodically photographing the road construction process with a UAV and then recorded in a time series. To analyze the road construction process through the time series, a standard image is required. In the case of an area with trees, the ground cannot be analyzed using a UAV photo, and accurate ground shape data cannot be obtained. If the beehive removal process and the earthmoving process are sequentially performed at the construction site, the overall process flow must be properly managed to ensure that the drone photography is performed after the removal of beehives and before the earthwork process.
Individual photos, orthoimages, and videos can be utilized as a basic output for UAV-based time-series monitoring. Individual photos are single photos taken by a UAV, which can be used to check the condition of an object. When an image is acquired with the same path and geometry by the same sensor during a subsequent acquisition, the site change can be estimated through comparison between individual photos. The digital cameras, global navigation satellite systems, and inertial measurement units currently used in UAVs generally have similar performances; however, if real-time kinematic positioning is supported by a UAV [26], differences between multitemporal images for detecting changes can be diminished in photographs acquired using the same flight path. The ground sampling distance (GSD) in a UAV photo is the most important setting factor in UAV photogrammetry for understanding the construction situation, and it is deeply related to the UAV flight altitude. It is necessary for the appropriate GSD of an image to be identified according to the road facilities and for characteristics to be distinguished at the construction site.
An orthoimage refers to an image in which orthogonal projection has been performed, and the geometric distortion of the terrain caused by elevation in the image acquired by a UAV is corrected. In addition, all topographic features are converted into the image with a vertical view using a digital surface model (DSM). In a DSM, regular grid data represent the height of the ground surface and the artificial features of buildings, trees, and vegetation. The orthoimages are produced through differential rectification using commercial software and can be georeferenced and used to obtain the current status map of topographical features [27]. In particular, even when there are differences in the camera, flight path, and acquisition geometry between images, the image is converted into the same field of view for more comprehensive use in the detection of changes.

Introduction to Convolutional Siamese Metric Networks
Machine learning enables computers to learn and act like humans by providing data and information in the form of observations and real-world interactions and learning improves autonomously over time. Deep learning is a form of machine learning, wherein a model is trained based on a considerable amount of data and then data are classified. In recent years, deep learning has been performing better than previously in various fields, and it is particularly useful in image-based applications [28]. In particular, time-series analysis for image classification and change detection is a promising area of application for deep learning. Unsupervised deep learning change detection can be performed using a deep learning model that is generally used for semantic segmentation. Some existing studies were conducted by converting the networks applied to image classification, such as SegNet, U-Net, Deeplab-V3+, and Siamese networks [29].
Siamese networks are a type of neural network, and they contain multiple identical subnetwork components, as shown in Figure 1 [30]. The networks have the same model characteristics, such as configuration, parameters, and weights. There are convolutional, pooling, and fully connected layers in a conventional CNN. The convolutional layers extract the hierarchical features from the input image. The functions of the pooling layers are receptive field enlargement and dimensionality reduction to reduce the size of the output feature maps. The objective of a fully connected layer is to use the results of the convolution/pooling layers to predict the input image for each class. output feature maps. The objective of a fully connected layer is to use the results of the convolution/pooling layers to predict the input image for each class. The similarity between the feature vectors of the input images can be measured using distance metrics, such as those induced by the norms, or with a similarity function such as cosine similarity [31]. In this study, we used the Euclidean distance and the contrastive loss function introduced by Chopra et al. during the training phase of the convolutional Siamese network [22,32]. Let = { (i, j)|1 ≤ i ≤ ℎ, 1 ≤ j ≤ } be an image, and X1 and X2 be two input images, which each having a size of h × w × c, where w and h are spatial dimensions and c is the channel dimension of the input image. The parameterized distance function to be learned, DW, between X1 and X2 is defined as the Euclidean distance between the outputs of GW: Here, W ( 1 ) and W ( 2 ) are the output vector tensors, and W ( 1 ) i,j and W ( 2 ) i,j are the feature vectors of the pixel with location (i, j) in image X. W ( 1 , 2 ) i,j is written as i,j for simplicity. During the training phase, we use the contrastive loss function, which can be defined as follows: where Y is the binary ground-truth map assigned to the input image pair, and (i, j) = 0 if the corresponding pixel pair is considered to be similar; otherwise, (i, j) = 1 if it is considered different. ( , 1 , 2 ) is the kth training sample pair with labeling. m > 0 is a constant called a margin, and it was set to two in the experiment. Changed pairs contribute to the loss function only if their parameterized distance is within this margin.

Image Change Detection
To perform change detection, a pair of images from two periods are input into the convolutional Siamese network. The data for each feature pair are obtained for the input images. We calculate the dissimilarity of the data of a feature pair using a predefined distance metric (Euclidean distance function L2 in this study). At this time, the contrast loss function is applied to differentiate between the unchanged pairs and the changed pairs. Change distance images-which are converted from the different distances between the feature pairs-were enhanced for visualization contrast. As visible in the last column, "output," in Figure 2, the detected area is exhibited using various colors such as green, yellow, orange, and red. Blue represents the unchanged area. Depending on the training data, the change detection result may not match the actual changed area; therefore, it is important to specify the range of the changed area well. The flowchart of the proposed method is presented in Figure 2. The similarity between the feature vectors of the input images can be measured using distance metrics, such as those induced by the norms, or with a similarity function such as cosine similarity [31]. In this study, we used the Euclidean distance and the contrastive loss function introduced by Chopra et al. during the training phase of the convolutional Siamese network [22,32]. Let X = {x(i, j)|1 ≤ i ≤ h, 1 ≤ j ≤ w} be an image, and X 1 and X 2 be two input images, which each having a size of h × w × c, where w and h are spatial dimensions and c is the channel dimension of the input image. The parameterized distance function to be learned, D W , between X 1 and X 2 is defined as the Euclidean distance between the outputs of G W : Here, G W (X 1 ) and G W (X 2 ) are the output vector tensors, and G W (X 1 ) i,j and G W (X 2 ) i,j are the feature vectors of the pixel with location (i, j) in image X. D W (X 1 , X 2 ) i,j is written as D i,j for simplicity. During the training phase, we use the contrastive loss function, which can be defined as follows: where Y is the binary ground-truth map assigned to the input image pair, and y(i, j) = 0 if the corresponding pixel pair is considered to be similar; otherwise, y(i, j) = 1 if it is considered different. (Y, X 1 , X 2 ) k is the kth training sample pair with labeling. m > 0 is a constant called a margin, and it was set to two in the experiment. Changed pairs contribute to the loss function only if their parameterized distance is within this margin.

Image Change Detection
To perform change detection, a pair of images from two periods are input into the convolutional Siamese network. The data for each feature pair are obtained for the input images. We calculate the dissimilarity of the data of a feature pair using a predefined distance metric (Euclidean distance function L2 in this study). At this time, the contrast loss function is applied to differentiate between the unchanged pairs and the changed pairs. Change distance images-which are converted from the different distances between the feature pairs-were enhanced for visualization contrast. As visible in the last column, "output," in Figure 2, the detected area is exhibited using various colors such as green, yellow, orange, and red. Blue represents the unchanged area. Depending on the training data, the change detection result may not match the actual changed area; therefore, it is important to specify the range of the changed area well. The flowchart of the proposed method is presented in Figure 2. When the change detection was performed using simply the Siamese network model from the general training data, satisfactory results were obtained; however, detection errors owing to the construction equipment, automobiles, and shadows were observed in some images ( Figure 3). To address such errors caused by automobiles, we can use a technology that can properly remove automobiles while generating orthoimages, or not consider automobiles as an example of a changed area during change detection training. Small-scale shadows were not recognized as change areas, but large-scale shadows were incorrectly detected as change areas. In particular, the change due to the growth of vegetation is the type of change considered in this study. To reduce such false positives, it is important to reflect the changes caused by construction correctly during the creation of training data. When the change detection was performed using simply the Siamese network model from the general training data, satisfactory results were obtained; however, detection errors owing to the construction equipment, automobiles, and shadows were observed in some images ( Figure 3). To address such errors caused by automobiles, we can use a technology that can properly remove automobiles while generating orthoimages, or not consider automobiles as an example of a changed area during change detection training. Small-scale shadows were not recognized as change areas, but large-scale shadows were incorrectly detected as change areas. In particular, the change due to the growth of vegetation is the type of change considered in this study. To reduce such false positives, it is important to reflect the changes caused by construction correctly during the creation of training data.

Evaluation Metrics
We evaluate our network based on the test data by computing the F-measure, which is calculated using the precision and recall [18].

Evaluation Metrics
We evaluate our network based on the test data by computing the F-measure, which is calculated using the precision and recall [18].
where T P is the number of true positives, F P is the number of false positives, and F N is the number of false negatives.

Study Area and Devices
The study area is the construction site of Pyeongtaek-West Pyeongtaek Road, Gyeonggido, South Korea. The site contains earthworks, drainage, and a structure (bridge), and some paving works are also in progress. The slopes within the site are changing owing to construction, and there are areas that require safety management, such as the slopes and roads to and from the construction site. The placement of construction equipment and materials around the construction site was confirmed. It is possible to identify materials or construction waste other than those involved in current construction processes, during UAV image acquisition.
To detect site changes, data were taken at different times using an eBee Plus ( Table 1). The eBee Plus, a fixed-wing drone, weighs 1.1 kg and is 110 cm long, with a maximum flight time of 59 min. Fixed-wing drones use manual take-off and automatic landing methods, and the flight type follows an automatic route during flight based on route information generated in advance through eMotion. This drone supports various positioning accuracy correction functions, such as the real-time kinematic method.

Data Acquisition
It is necessary to establish the flight plan well to obtain high-quality field data. We set the acquisition area, check the construction control point, preempt the take-off and landing sites, preempt the ground control point (GCP), and set the flight path before flight. In a field survey, the flight checklist established at the planning stage is confirmed through a field visit, and the key points to be considered are the flight restriction factors (high-rise buildings, radio wave interference factors, flight obstacles, etc.) and the safety of the UAV. The image parameters were checked to ensure that image acquisition could be performed effectively by checking other important factors such as the flight height, image overlap, and image ground resolution. Three time-series images were acquired using the planned route when the target area was first photographed in subsequent photography.
The study site was confirmed by superimposing the blueprint provided by the road construction corporation on the satellite image and photographed by setting a width of 50 m based on the road center line and the outline of the road plan. The width of the test section is approximately 450 m, and the length is 3500 m.
When the flight area was selected, 14 GCPs and 5 check points (CPs) were surveyed by checking the location of the construction control point, leveling point, and integrated control point located in the study area. The GCPs were selected such that they were evenly distributed across the left and right sections of the construction site, and the control points were selected as far outside as possible because the road was in operation. The GCP survey was divided into a plane control point survey and an elevation control point survey. Figure 4 shows the study area, layout of the GCPs and CPs, and the imaging position. The ground spatial resolution was approximately 4 cm/pixel, the forward overlap was 80%, and the lateral overlap was 70%.

Creation of an Orthoimage
To create an orthoimage using the captured images, "Pix4D mapper" with a concise user interface was used. The processing sequence of Pix4D mapper is divided into a total of six steps: photo input, GCP selection, tie point creation and aerotriangulation, point densification, DSM, and orthoimage production.
The orientation accuracy obtained using 14 GCPs demonstrated that the root mean square errors in the X, Y, and Z directions were 2.7 cm, 3.3 cm, and 10.6 cm, respectively, with a total error of 11.4 cm. At a GSD of approximately 4 cm, the orthoimages were generated by processing point clouds ( Figure 5).

Change Detection Implementation
Changed areas were extracted using a full convolutional Siamese network, which is a network that can learn different images through deep learning and has a structure comprising two CNNs that share weights. The network model is refined based on the pretrained model using the CDnet dataset [33]. The CDnet dataset consists of 91,595 image pairs from 31 indoor and outdoor scene videos. The CDRnet dataset that we created was used to train the pretrained network. The CDRnet dataset consists of 134 image pairs from multitemporal road construction orthoimages and was created with a size of 720 × 480 pixels in part of the test area, including the visually interpreted reference change detection data.
The proposed network was processed using Facebook's PyTorch framework in a Linux environment. All of our experiments were performed on a Xeon 20-core CPU and two NVidia Tesla V100 GPUs. The learning rate, weight decay, momentum, and batch size were 0.00001, 0.00005, 0.9, and 32, respectively. These parameters were set by conducting several experiments. The change detection accuracy was calculated by inputting independent image pairs for the optimal model generated in training. As the processing time for change detection varies according to the size of the input image, the accuracy based on the image size was investigated.

Change Detection
The change detection accuracy of the proposed method was evaluated for the training image pair using the original image resolution. The quantitative results reveal the comparatively good results obtained using our method because it achieves a higher Fmeasure of 85.98%, Re of 89.70%, and Pr of 82.57%.
To evaluate the proposed method, we compared it to the conventional image difference. A few qualitative change detection examples are presented in Figures 6 and 7. In Figure 6, the surrounding area under construction and roads under construction can be observed. Through the image difference method shown in Figure 6, it can be confirmed that there is a high possibility of misclassification in areas with differences owing to shadows, vegetation, and cars. There is also difficulty in determining a pixel value difference to obtain the change detection area. Figure 7 shows the true changed area with the white value, the overlay of the true changed area and the image for the same area as in Figure   Figure 5. Orthoimages of some areas using (left) the first flight data and (right) third flight data.

Change Detection Implementation
Changed areas were extracted using a full convolutional Siamese network, which is a network that can learn different images through deep learning and has a structure comprising two CNNs that share weights. The network model is refined based on the pretrained model using the CDnet dataset [33]. The CDnet dataset consists of 91,595 image pairs from 31 indoor and outdoor scene videos. The CDRnet dataset that we created was used to train the pretrained network. The CDRnet dataset consists of 134 image pairs from multitemporal road construction orthoimages and was created with a size of 720 × 480 pixels in part of the test area, including the visually interpreted reference change detection data.
The proposed network was processed using Facebook's PyTorch framework in a Linux environment. All of our experiments were performed on a Xeon 20-core CPU and two NVidia Tesla V100 GPUs. The learning rate, weight decay, momentum, and batch size were 0.00001, 0.00005, 0.9, and 32, respectively. These parameters were set by conducting several experiments. The change detection accuracy was calculated by inputting independent image pairs for the optimal model generated in training. As the processing time for change detection varies according to the size of the input image, the accuracy based on the image size was investigated.

Change Detection
The change detection accuracy of the proposed method was evaluated for the training image pair using the original image resolution. The quantitative results reveal the comparatively good results obtained using our method because it achieves a higher F-measure of 85.98%, Re of 89.70%, and Pr of 82.57%.
To evaluate the proposed method, we compared it to the conventional image difference. A few qualitative change detection examples are presented in Figures 6 and 7. In Figure 6, the surrounding area under construction and roads under construction can be observed. Through the image difference method shown in Figure 6, it can be confirmed that there is a high possibility of misclassification in areas with differences owing to shadows, vegetation, and cars. There is also difficulty in determining a pixel value difference to obtain the change detection area. Figure 7 shows the true changed area with the white value, the overlay of the true changed area and the image for the same area as in Figure 6, and the change detection result obtained using the proposed method. Unlike the image difference result, the proposed method yields change detection results similar to the true changed area, and misclassifications owing to vehicles, shadows, and vegetation rarely occur. Binary images in first column of Figure 7 represent the true changed area. The white and black areas represent the changed and unchanged areas, respectively. As visible in the last column of Figure 7, the various colors represent different distances between the feature pairs of the bitemporal images. The change distance images were enhanced using a rainbow color map ranging from blue to red for visualization clarity. In the result of the change detection for the last column, the blue area represents the unchanged area, whereas the area extending from the sky blue to red color represents the changed area. The proposed method demonstrates the detection of road surface changes, such as asphalt construction, and the detection of changes owing to road slope construction. However, errors may occur in some small areas rather than large areas, as shown in the last row of Figure 7, and the need for future improvements remains. Binary images in first column of Figure 7 represent the true changed area. The white and black areas represent the changed and unchanged areas, respectively. As visible in the last column of Figure 7, the various colors represent different distances between the feature pairs of the bitemporal images. The change distance images were enhanced using a rainbow color map ranging from blue to red for visualization clarity. In the result of the change detection for the last column, the blue area represents the unchanged area, whereas the area extending from the sky blue to red color represents the changed area. The proposed method demonstrates the detection of road surface changes, such as asphalt construction, and the detection of changes owing to road slope construction. However, errors may occur in some small areas rather than large areas, as shown in the last row of Figure 7, and the need for future improvements remains.   Figure 6 were used in this experiment.

Image Size Effect
To apply the proposed method efficiently, the accuracy was examined based on the image size. Road construction sites are usually several kilometers long or more. Although this varies depending on the purpose, the GSDs of images taken using UAVs are approximately several centimeters. In other words, there are a considerable number of images to process. It is necessary to find an effective image resolution such that a large number of construction images can be rapidly processed for change detection. In this experiment, we evaluated the change detection accuracy of the proposed method while reducing the image size. Starting with the original image, the image scale was changed by 0.1 to 0.1 times the image and processed in a total of 10 steps.
Furthermore, to observe the effect of normalization of the image pixel values, the accuracy of the proposed method was compared by changing the range of the image pixel values. Case 1 uses the original image, case 2 uses the image after subtracting 127.5, which is the middle of the 8-bit image pixel value, from each image, and case 3 subtracts the average value of each image from the image.
As shown in Figure 8, the average accuracy for the three cases, from scale 0.5 to 1.0, is 84.5-85.2%. The accuracy is the highest in the original image with a scale of 1.0, but the difference in accuracy is not as large as 1%. The image scale is slightly off from 0.3 and 0.4, but the difference in accuracy is not significant. As shown in Figure 9, as the image size increases, the execution time has a linear relationship that increases proportionally. Therefore, if the execution time is important, change detection can be performed by decreasing the image scale to 0.3. Moreover, the change detection accuracy based on the image pixel range was not significantly different in the three cases. Instead, the accuracy of the image

Image Size Effect
To apply the proposed method efficiently, the accuracy was examined based on the image size. Road construction sites are usually several kilometers long or more. Although this varies depending on the purpose, the GSDs of images taken using UAVs are approximately several centimeters. In other words, there are a considerable number of images to process. It is necessary to find an effective image resolution such that a large number of construction images can be rapidly processed for change detection. In this experiment, we evaluated the change detection accuracy of the proposed method while reducing the image size. Starting with the original image, the image scale was changed by 0.1 to 0.1 times the image and processed in a total of 10 steps.
Furthermore, to observe the effect of normalization of the image pixel values, the accuracy of the proposed method was compared by changing the range of the image pixel values. Case 1 uses the original image, case 2 uses the image after subtracting 127.5, which is the middle of the 8-bit image pixel value, from each image, and case 3 subtracts the average value of each image from the image.
As shown in Figure 8, the average accuracy for the three cases, from scale 0.5 to 1.0, is 84.5-85.2%. The accuracy is the highest in the original image with a scale of 1.0, but the difference in accuracy is not as large as 1%. The image scale is slightly off from 0.3 and 0.4, but the difference in accuracy is not significant. As shown in Figure 9, as the image size increases, the execution time has a linear relationship that increases proportionally. Therefore, if the execution time is important, change detection can be performed by decreasing the image scale to 0.3. Moreover, the change detection accuracy based on the image pixel range was not significantly different in the three cases. Instead, the accuracy of the image where the average value of the pixels has been subtracted is not good, even though its difference is not large. where the average value of the pixels has been subtracted is not good, even though its difference is not large.

Conclusions
The purpose of this study was to develop a methodology to help in smooth construction management by acquiring the changes caused by construction through images taken at the construction site. The study presented a method for producing orthoimages of a construction site using a UAV and a method for automatically detecting the changes in the images using a convolutional Siamese network. The proposed method can detect color changes, such as for asphalt construction, and the presence or absence of changes owing to the construction of facilities. It is possible to provide reliable information regarding the construction progress by removing the effects of shadows, vegetation, automobiles, and   where the average value of the pixels has been subtracted is not good, even though its difference is not large.

Conclusions
The purpose of this study was to develop a methodology to help in smooth construction management by acquiring the changes caused by construction through images taken at the construction site. The study presented a method for producing orthoimages of a construction site using a UAV and a method for automatically detecting the changes in the images using a convolutional Siamese network. The proposed method can detect color changes, such as for asphalt construction, and the presence or absence of changes owing to the construction of facilities. It is possible to provide reliable information regarding the construction progress by removing the effects of shadows, vegetation, automobiles, and

Conclusions
The purpose of this study was to develop a methodology to help in smooth construction management by acquiring the changes caused by construction through images taken at the construction site. The study presented a method for producing orthoimages of a construction site using a UAV and a method for automatically detecting the changes in the images using a convolutional Siamese network. The proposed method can detect color changes, such as for asphalt construction, and the presence or absence of changes owing to the construction of facilities. It is possible to provide reliable information regarding the construction progress by removing the effects of shadows, vegetation, automobiles, and work equipment, which act as false positives in the existing image change detection technique. Furthermore, to apply change detection efficiently, the detection accuracy based on the image size was analyzed to select the image size that should be applied according to the processing time.
At the construction site, videos are mainly taken using a UAV, and the construction status is checked manually by site managers or supervisors while viewing the video by eye. Simply viewing the video makes it difficult to accumulate construction progress records or to analyze the construction status quantitatively. Therefore, further research is necessary to develop a method capable of performing quantitative analysis by producing a video-based orthoimage so that it can be easily used for identifying and recording changes in the construction site. Video-based analysis is expected to be more applicable to straight or gently curved road sections than to structures. Using the ground control points, an accurate combination of adjacent sections is possible. Conversely, owing to the complex shape of the structure section, data are acquired by securing visibility for several parts of the structure; the difficulty in detecting changes in such structures is greater. If the detection model of the proposed method is updated using detailed changed objects for the road construction site, the change type can be determined in the future. To achieve this, it will also be necessary to build and train time-series images of various construction sites.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.