Detection of Road Surface Changes from Multi-Temporal Unmanned Aerial Vehicle Images Using a Convolutional Siamese Network

: Road quality commonly decreases due to aging and deterioration of road surfaces. As the number of roads that need to be surveyed increases, general maintenance—particularly surveillance— can be quite costly if carried out using traditional methods. Therefore, using unmanned aerial vehicles (UAVs) and deep learning to detect changes via surveys is a promising strategy. This study proposes a method for detecting changes on road surfaces using pairs of UAV images captured at di ﬀ erent times. First, a convolutional Siamese network is introduced to extract the features of an image pair and a Euclidean distance function is applied to calculate the distance between two features. Then, a contrastive loss function is used to enlarge the distance between changed feature pairs and reduce the distance between unchanged feature pairs. Finally, the initial change map is improved based on the preliminary di ﬀ erences between the two input images. Our experimental results conﬁrm the e ﬀ ectiveness of this approach.


Introduction
The quality of road surfaces will decrease during use due to aging and deterioration. Some damage will always appear on a road surface, such as potholes and cracks-the two most common categories of road surface damage. To ensure safety for traffic, maintaining road surface quality is both necessary and urgent. The number of roads to survey is increasing, which poses a real challenge for managers using traditional surveying methods as it leads to increasing costs. Usually, an inspector needs to go outside to collect information about the position and condition of the surface, then plan to repair the damaged location. Currently, the use of an unmanned aerial vehicle (UAV) supported by a high-level computing device and artificial neural network makes this surveying aspect more efficient and more cost-effective than traditional methods.
Image change detection aims to detect the changed areas in images of the same scene taken at different points in time [1,2]. Over the last three decades, many different methods have been reported for detecting a changing area [3][4][5][6][7]. Alcantarilla et al. proposed a novel approach to change detection in Google Street View using monocular video sequences [8]. The method combines geometric methods with the learning made possible with an efficient convolutional network to discriminate between actual and nuisance changes. Guo et al. proposed a method based on convolutional neural network (CNN) architecture. It measures changes to a region using an implicitly learned metric, then develops a contrastive loss threshold to overcome noisy changes using a different viewpoint [9]. To detect temporal

Methodology
The schematic of the proposed method is shown in Figure 1. A pair of images is input through a convolutional Siamese network (ConsimNet) [19] to obtain feature pairs. A simple predefined distance metric (Euclidean distance function-L 2 ) is then used to measure the dissimilarity of the feature pairs. The contrastive loss function is applied to bring together unchanged pairs and separate changed pairs. However, the initial change map results are not commensurate with the real changed area; the extent of the changing area is not fully detectable. To obtain full coverage, the boundary of the real changing area needs to be obtained as a reference value from which to adjust the range of the changing area.
Sustainability 2020, 12, x FOR PEER REVIEW 3 of 14 distance metric (Euclidean distance function-L2) is then used to measure the dissimilarity of the feature pairs. The contrastive loss function is applied to bring together unchanged pairs and separate changed pairs. However, the initial change map results are not commensurate with the real changed area; the extent of the changing area is not fully detectable. To obtain full coverage, the boundary of the real changing area needs to be obtained as a reference value from which to adjust the range of the changing area. Figure 1. Schematic of the proposed method: image pair is fed to the convolutional Siamese network (ConsimNet) to obtain feature pairs. After obtaining the dissimilarity of feature pairs, contrastive loss is applied to pull unchanged pairs together and push changed pairs apart; this is then used to improve the accuracy of the change map.

Convolutional Siamese Metric Network
Siamese networks are neural networks containing two or more identical sub-network components [19]. The networks have the same configuration, parameters, and weights. There are three types of layers in conventional CNNs: convolutional, pooling, and fully connected as demonstrated in Figure 2. The convolutional layers can extract the hierarchical features from the input image. The functionalities of the pooling layers consist of receptive field enlargement and dimensionality Figure 1. Schematic of the proposed method: image pair is fed to the convolutional Siamese network (ConsimNet) to obtain feature pairs. After obtaining the dissimilarity of feature pairs, contrastive loss is applied to pull unchanged pairs together and push changed pairs apart; this is then used to improve the accuracy of the change map.

Convolutional Siamese Metric Network
Siamese networks are neural networks containing two or more identical sub-network components [19]. The networks have the same configuration, parameters, and weights. There are three types of layers in conventional CNNs: convolutional, pooling, and fully connected as demonstrated in Figure 2.
Sustainability 2020, 12, x FOR PEER REVIEW 3 of 14 distance metric (Euclidean distance function-L2) is then used to measure the dissimilarity of the feature pairs. The contrastive loss function is applied to bring together unchanged pairs and separate changed pairs. However, the initial change map results are not commensurate with the real changed area; the extent of the changing area is not fully detectable. To obtain full coverage, the boundary of the real changing area needs to be obtained as a reference value from which to adjust the range of the changing area. Figure 1. Schematic of the proposed method: image pair is fed to the convolutional Siamese network (ConsimNet) to obtain feature pairs. After obtaining the dissimilarity of feature pairs, contrastive loss is applied to pull unchanged pairs together and push changed pairs apart; this is then used to improve the accuracy of the change map.

Convolutional Siamese Metric Network
Siamese networks are neural networks containing two or more identical sub-network components [19]. The networks have the same configuration, parameters, and weights. There are three types of layers in conventional CNNs: convolutional, pooling, and fully connected as demonstrated in Figure 2.   The convolutional layers can extract the hierarchical features from the input image. The functionalities of the pooling layers consist of receptive field enlargement and dimensionality reduction, which means to reduce the size of the output feature maps. The fully connected layers are used as a classifier, which outputs the probabilities predicting the input image to each class.

Contrastive Loss Function
The contrastive loss function was used to enlarge the distance between changed pairs and reduce the distance between unchanged pairs simultaneously. Let X = x(i, j) 1 ≤ i ≤ h, 1 ≤ j ≤ w be an aerial image, and X 1 and X 2 be two input images each with a size of h × w × 3, where w and h are spatial dimensions and 3 is the channel dimension (RGB channels). Define the parameterized distance function to be learned D W between X 1 and X 2 as the Euclidean distance between the outputs of G W : is the output vector tensor, G W (X 1 ) i,j , G W (X 1 ) i,j is the feature vector of the pixel with location (i, j) in the image X. To shorten the notation, D W (X 1 , X 2 ) i,j is written as D i,j . Then, the loss function in its most general form is: where Y is a binary ground-truth map assigned to the input image pair and y(i, j) = 0 if the corresponding pixel pair is deemed similar or y(i, j) = 1 if it is deemed dissimilar. L C is the partial loss function for a pair of dissimilar points and L U is the partial loss function for a pair of similar points. L C and L U must be designed such that minimizing L with respect to D i,j produces a low value for a pair of unchanged pixels and a high value for a pair of changed pixels. L C and L U are defined as follows: where m is a margin. Change pixel pairs contribute to the loss function only if their parameterized distance is within this margin. In the experiment, m was set to 2. Thus, the final loss function is:

Improvement of the Results
The purpose of this step is to find results that improve on the initial results based on the preliminary difference between the two input images. Therefore, it is necessary to find the boundaries of the areas of difference of the two images. This boundary is considered the standard range within which to expand the initial detected area in the previous step until touching the boundary.
The steps are as follows: Step 1: Find the difference between two images I 1 , I 2 in each RGB color channel: Red I 1 −I 2 = Red I 1 − Red I 2 Green I 1 −I 2 = Green I 1 − Green I 2 Blue I 1 −I 2 = Blue I 1 − Blue I 2 .
Step 2: Detect the edges of the two images I 1 , I 2 by Canny edge detection, and by combining this with the result in the above step we obtain a result such as that shown in Figure 3c. Step 3: Group all the adjacent pixels, and fill into the group that has a closed pixel area ( Figure 3d).
Step 4: Remove the small region and determine the boundary location (Figure 3e,f).
Step 5: Based on this boundary, expand the initial change map. Figure 3f is the preliminary difference between the two input images (white pixels). The result not only includes real changes (red box) with the whole area covered, but also road lane marks and some other areas (blue box) because of edge detection. However, the initial change map detected by ConsimNet does not include this noisy object. Therefore, the edge map can be considered as a reference of the initial change map to improve the accuracy of the result.

Evaluation Metrics
In this study, the accuracy of the method was defined using three different performance metrics [5,20,21]. F-measure: where is the number of true positives (i.e., the cases that were correctly classified), is the number of false positives (i.e., the negative pixels that were incorrectly classified as positive pixels), and is the number of false negatives (i.e., the positive pixels that were incorrectly classified as negative).

Study Area and Devices
The object of the survey was the Deokyang Bridge ( Figure 4) in Yeosu City, Korea. Deokyang Bridge is 530 m long and 25 m wide. To detect the surface changes, data were taken at different times by a Drone Phantom 4 RTK ( Table 1). The first recording was conducted on 11 January 2019, and the second was on 17 April 2019. An orthomosaic whose average ground resolution was 14 mm/pixel was generated through photogrammetric processing. The bridge area was selected as the test area of Figure 3f is the preliminary difference between the two input images (white pixels). The result not only includes real changes (red box) with the whole area covered, but also road lane marks and some other areas (blue box) because of edge detection. However, the initial change map detected by ConsimNet does not include this noisy object. Therefore, the edge map can be considered as a reference of the initial change map to improve the accuracy of the result.

Evaluation Metrics
In this study, the accuracy of the method was defined using three different performance metrics [5,20,21].
where T P is the number of true positives (i.e., the cases that were correctly classified), F P is the number of false positives (i.e., the negative pixels that were incorrectly classified as positive pixels), and F N is the number of false negatives (i.e., the positive pixels that were incorrectly classified as negative).

Study Area and Devices
The object of the survey was the Deokyang Bridge ( Figure 4) in Yeosu City, Korea. Deokyang Bridge is 530 m long and 25 m wide. To detect the surface changes, data were taken at different times  Table 1). The first recording was conducted on 11 January 2019, and the second was on 17 April 2019. An orthomosaic whose average ground resolution was 14 mm/pixel was generated through photogrammetric processing. The bridge area was selected as the test area of the orthomosaic image. That area was divided into our computer-processable size, and the same area of two period images was selected to finally generate 163 comparison pairs.

Implementation Details
To train the proposed network, this study used a CDnet dataset [20,21]. This dataset has already been used in [9,22]. The CDnet dataset consists of 31 videos with 91,595 image pairs depicting indoor and outdoor scenes with pedestrians, boats, and trucks captured at different times. The dataset represents various challenges divided into categories such as dynamic backgrounds, camera jitter, shadow, night video, challenging weather, and internal object motion. A background image with no feature object was selected for the reference image at time t0, and other images were taken at time t1. A total of 91,595 image pairs were used for the training, comprising 73,276 pairs for the training set and 18,319 for the validation set. All images were scaled to 512 × 512 during the training. The proposed Siamese network was implemented using the PyTorch framework [23]. In the training procedure, the learning rate was set to 0.00001, and the weight decay and momentum were set to 0.00005 and 0.9, respectively. The batch size was set to 32. The entire process of training, testing, and checking the results was performed in Python on a PyTorch platform [23] running a Linux 18.04 operating system. The training hardware used the NVidia Titan Xp graphics processing unit.

Results
To determine whether the method was a true detection method, 163 pairs of small images were used to preliminarily determine how many locations the detection method identified, that is, the number of correct locations and the number of incorrect locations. The results are shown in Table 2 and Figure 5.

Implementation Details
To train the proposed network, this study used a CDnet dataset [20,21]. This dataset has already been used in [9,22]. The CDnet dataset consists of 31 videos with 91,595 image pairs depicting indoor and outdoor scenes with pedestrians, boats, and trucks captured at different times. The dataset represents various challenges divided into categories such as dynamic backgrounds, camera jitter, shadow, night video, challenging weather, and internal object motion. A background image with no feature object was selected for the reference image at time t 0 , and other images were taken at time t 1 . A total of 91,595 image pairs were used for the training, comprising 73,276 pairs for the training set and 18,319 for the validation set. All images were scaled to 512 × 512 during the training. The proposed Siamese network was implemented using the PyTorch framework [23]. In the training procedure, the learning rate was set to 0.00001, and the weight decay and momentum were set to 0.00005 and 0.9, respectively. The batch size was set to 32. The entire process of training, testing, and checking the results was performed in Python on a PyTorch platform [23] running a Linux 18.04 operating system. The training hardware used the NVidia Titan Xp graphics processing unit.

Results
To determine whether the method was a true detection method, 163 pairs of small images were used to preliminarily determine how many locations the detection method identified, that is, the  Table 2 and Figure 5.     Given the results above, we can see that out of the 163 images, 138 images provided good results, equivalent to 84.7%; and 25 images provided incorrect results, equivalent to 15.3%. Thus, these results reflect the accuracy of the method.
For a more detailed evaluation, seven image pairs were used for testing. An image-to-image registration process was used to ensure that the image pairs were matched and located. The results are shown in Figures 6-12, which represent tests 1 to 7, respectively. Each figure contains smaller images, labeled (a)-(f). Image (a) is the image at time t 0 captured on 11 January 2019 and image (b) is the image at time t 1 captured on 17 April 2019. Image (c) is the initial unimproved result, and image (d) is the blended result between the image at time t 1 and the initial results. Images (e) and (f) are the preliminary differences between the two input images. Image (g) is the result after improving, and image (h) is the blended result between the improved result and the image at time t 1 .
Sustainability 2020, 12, x FOR PEER REVIEW 8 of 14 Given the results above, we can see that out of the 163 images, 138 images provided good results, equivalent to 84.7%; and 25 images provided incorrect results, equivalent to 15.3%. Thus, these results reflect the accuracy of the method.
For a more detailed evaluation, seven image pairs were used for testing. An image-to-image registration process was used to ensure that the image pairs were matched and located. The results are shown in Figures 6-12, which represent tests 1 to 7, respectively. Each figure contains smaller images, labeled (a)-(f). Image (a) is the image at time t0 captured on 11 January 2019 and image (b) is the image at time t1 captured on 17 April 2019. Image (c) is the initial unimproved result, and image (d) is the blended result between the image at time t1 and the initial results. Images (e) and (f) are the preliminary differences between the two input images. Image (g) is the result after improving, and image (h) is the blended result between the improved result and the image at time t1.
As can be seen in images (c), (d), (g), and (h) of Figures 6-12, there are various colors in the detected area, including green, yellow, orange, and red. This is a result of the different distances between the two feature pairs, which were calculated by the Euclidean distance and contrastive loss functions. Change distance images between the feature pairs were enhanced with a rainbow color map for visualization contrast. (g)    (f) (e) (g) As can be seen in images (c), (d), (g), and (h) of Figures 6-12, there are various colors in the detected area, including green, yellow, orange, and red. This is a result of the different distances between the two feature pairs, which were calculated by the Euclidean distance and contrastive loss functions. Change distance images between the feature pairs were enhanced with a rainbow color map for visualization contrast.   (f) (e) (g) As seen in the blended image (d), all potholes were correctly detected; however, the extents of the detected area and the real changed area in image t 1 were unequal. Figure 6d-Test 1, Figure 7d-Test 2, Figure 9d-Test 4, and Figure 11d-Test 6 show that, before improvement, the extents of the detected areas were smaller than those of the real damaged areas. However, looking at Figures 6h, 7h, 9h and 11h, after improvement, the full extent of the damaged areas could be detected. The effectiveness of the method is shown numerically in Table 3.  As seen in the blended image (d), all potholes were correctly detected; however, the extents of the detected area and the real changed area in image t1 were unequal.  Table 3.  To evaluate the performance of the proposed method, the precision (Pr) was calculated as the division of the correctly classified changed area (T p , true positive) by the sum of the correctly classified changed area (T p ) and incorrectly classified unchanged area (F p , false positive). As (Re) was the result of the correctly classified changed area (T p ) divided by the sum of the correctly classified changed area (T p ) and correctly classified unchanged area (F p ), the F-measure rate (F) was the harmonic mean of the precision (Pr) and the recall (Re).
Seven image pairs were used for the test; in general, the values of Pr, Re, and F for the results after improvement were improved. Across all seven tests, before the adjustment, the average value of Pr was 60.49%, Re was 0.57%, and F was 1.13%; after adjustment, the values of Pr, Re, and F were 78.7%, 1.10%, and 2.17%, respectively.

Conclusions
In this study, a change detection method based on a convolutional Siamese network was introduced for UAV-obtained road surface images. The feature pairs of two UAV images taken at different times were extracted by the convolutional Siamese network. Then, the distance between the features was generated to detect changes between image pairs. The contrastive loss was applied to push changed pairs apart and pull unchanged pairs together. Finally, edge detection was used to obtain the boundaries of changed areas, and based on these boundaries, it was possible to adjust the detected area in the initial change map. This method can help to warn managers experts about road surface conditions. The method not only determined the location of the changing area, it also ensured that the full extent of the changing area was detected. Once the defect was detected, countable values like area or position of the defect could be obtained based on the pixel size. If a classification by damage type is added in the future, we could further develop this method into a pavement management system along with damage location.
However, there are still some difficulties caused by noise-generating objects, as some unwanted objects can still be detected and cause confusion. The most unwanted noise in this research is caused by severe shadows. Although shadows themselves were not detected as changes in the images, the actual defects lying beneath these shadows can remain undetected. If the view geometry of the cameras at different times is significantly different (e.g., with a difference of 30 • or more, the detection rate is lowered. Standing water on roads also causes errors. Therefore, it is recommended to shoot the road almost vertically when the sun is high, or when the weather is slightly cloudy.
Determining the type and size of the detected damage depends on the ground spatial resolution of the images. That is, in high-resolution images, it is possible to detect changes of minute linear cracks, etc.; however, in centimeter-level low-resolution images, it is possible to detect the presence of potholes on roads. It is necessary to use sub-millimeter resolution images to detect minute linear cracks and crack changes at the millimeter level. In the future, we will study the change detection of small features due to seasonal variation or deterioration.