1. Introduction
The quality of road surfaces will decrease during use due to aging and deterioration. Some damage will always appear on a road surface, such as potholes and cracks—the two most common categories of road surface damage. To ensure safety for traffic, maintaining road surface quality is both necessary and urgent. The number of roads to survey is increasing, which poses a real challenge for managers using traditional surveying methods as it leads to increasing costs. Usually, an inspector needs to go outside to collect information about the position and condition of the surface, then plan to repair the damaged location. Currently, the use of an unmanned aerial vehicle (UAV) supported by a high-level computing device and artificial neural network makes this surveying aspect more efficient and more cost-effective than traditional methods.
Image change detection aims to detect the changed areas in images of the same scene taken at different points in time [
1,
2]. Over the last three decades, many different methods have been reported for detecting a changing area [
3,
4,
5,
6,
7]. Alcantarilla et al. proposed a novel approach to change detection in Google Street View using monocular video sequences [
8]. The method combines geometric methods with the learning made possible with an efficient convolutional network to discriminate between actual and nuisance changes. Guo et al. proposed a method based on convolutional neural network (CNN) architecture. It measures changes to a region using an implicitly learned metric, then develops a contrastive loss threshold to overcome noisy changes using a different viewpoint [
9]. To detect temporal changes in a scene from a pair of images, a new method that integrates CNN features with superpixel segmentation has been introduced [
10]. Superpixel segmentation is integrated to estimate the precise segmentation boundaries of the changes. Nemmour and Chibani proposed that the combination of fuzzy sets and neural networks provides complete information on changes in a remotely sensed image [
11]. A fuzzy membership model classifies multi-temporal images into changed and unchanged classes, and newly urbanized areas are detected based on an artificial neural network with the input of two Landsat Thematic Mapper images obtained at different times. The result of the method is effective, and detailed classes are created [
12]. On the other hand, Wang et al. investigated the uncertainty in detecting the change of images. According to the authors, there is a need to be transparent in assessing that uncertainty. Therefore, they proposed a framework for evaluating binary land change utilizing remote sensing images. First, changed and unchanged classes are classified by two widely adopted image change detection methods. Second, binary decisions are reached through thresholding on change maps. Finally, two sampling designs (i.e., stratified sampling and random sampling) are used to evaluate the results [
13].
There are also many studies on change detection in UAV images. Zhan et al. proposed a novel model for change detection in optical aerial images, which is based on the supervised deep Siamese CNN [
14]. A multi-temporal change detection framework proposed by Song et al. covers changes to cultivated land in mountainous terrain [
15]. The data in the paper, with very fine spatial and temporal resolutions, was collected by small UAVs. Shi et al. introduced an object-based method to detect change using multi-temporal images obtained by UAV [
16]. This method can overcome distortion effects and can fully use the high resolution of UAV images [
17]. Changes to urban areas in the city of Konya, Turkey are detected by finding the difference in digital elevation models based on comparisons of time-series point cloud data from aerial images taken at different times [
17].
In this study, a change detection method considering road surface as a property was presented using high-resolution UAV images acquired for road surface inspection. First, a convolutional Siamese network (ConsimNet) was proposed to extract the features of image pairs, and a Euclidean distance function was applied to calculate the distance between two features. Then, the contrastive loss function [
14,
18] was used to pull the unchanged pairs together and push the changed pairs apart. Finally, an edge detection technique was applied to improve the detected area in the initial change map. The edge detection finds the boundary of changed areas, and the detected area in the initial change map is adjusted based on this boundary.
ConsimNet has proven effective at overcoming certain problems encountered in detecting changes in high-resolution images, such as differences in an object due to a different viewpoint, wrong detection due to the shadow of an object, and inaccurate geometric correction. However, the limitation of ConsimNet is that it can only detect areas of significant change; it is difficult to detect changes that are unclear or blurry. Of an entire changed area, ConsimNet often detects only the central part of the change region, neglecting the rest because the former is usually the area with the most major changes and thus has clearer differences than the surrounding region (the area in the process of being broken up). Therefore, the area of the change is not fully detected. In this study, edge detection was used to overcome this issue and ensure that such changes, specifically the boundary of the changing region, would be found. Using this range as a standard reference for the initial change map means it would be able to detect exact defects and the range of defects. With this method, the locations of the changing regions were identified while ensuring that the entire area of those regions was detected. This method can be used in pre-warnings of road conditions in road inspections, even if existing conditions are not bad. Small potholes, indentations, or other abnormal features of the road surface can be detected. The rest of this paper is structured as follows. The methodology is described in
Section 2. The experiment is presented in
Section 3. The results and discussion are shown in
Section 4. Finally, the conclusion from this study is drawn in
Section 5.
2. Methodology
The schematic of the proposed method is shown in
Figure 1. A pair of images is input through a convolutional Siamese network (ConsimNet) [
19] to obtain feature pairs. A simple predefined distance metric (Euclidean distance function—L
2) is then used to measure the dissimilarity of the feature pairs. The contrastive loss function is applied to bring together unchanged pairs and separate changed pairs. However, the initial change map results are not commensurate with the real changed area; the extent of the changing area is not fully detectable. To obtain full coverage, the boundary of the real changing area needs to be obtained as a reference value from which to adjust the range of the changing area.
2.1. Convolutional Siamese Metric Network
Siamese networks are neural networks containing two or more identical sub-network components [
19]. The networks have the same configuration, parameters, and weights. There are three types of layers in conventional CNNs: convolutional, pooling, and fully connected as demonstrated in
Figure 2.
The convolutional layers can extract the hierarchical features from the input image. The functionalities of the pooling layers consist of receptive field enlargement and dimensionality reduction, which means to reduce the size of the output feature maps. The fully connected layers are used as a classifier, which outputs the probabilities predicting the input image to each class.
2.2. Contrastive Loss Function
The contrastive loss function was used to enlarge the distance between changed pairs and reduce the distance between unchanged pairs simultaneously. Let
be an aerial image, and
X1 and
X2 be two input images each with a size of
h ×
w × 3, where
w and
h are spatial dimensions and 3 is the channel dimension (RGB channels). Define the parameterized distance function to be learned
DW between
X1 and
X2 as the Euclidean distance between the outputs of
GW:
),
) is the output vector tensor,
,
is the feature vector of the pixel with location (i, j) in the image
X. To shorten the notation,
is written as
. Then, the loss function in its most general form is:
where
Y is a binary ground-truth map assigned to the input image pair and
if the corresponding pixel pair is deemed similar or
if it is deemed dissimilar.
is the partial loss function for a pair of dissimilar points and
is the partial loss function for a pair of similar points.
and
must be designed such that minimizing
L with respect to
produces a low value for a pair of unchanged pixels and a high value for a pair of changed pixels.
and
are defined as follows:
where
m is a margin. Change pixel pairs contribute to the loss function only if their parameterized distance is within this margin. In the experiment,
m was set to 2. Thus, the final loss function is:
2.3. Improvement of the Results
The purpose of this step is to find results that improve on the initial results based on the preliminary difference between the two input images. Therefore, it is necessary to find the boundaries of the areas of difference of the two images. This boundary is considered the standard range within which to expand the initial detected area in the previous step until touching the boundary.
The steps are as follows:
Step 1: Find the difference between two images I1, I2 in each RGB color channel:
Step 2: Detect the edges of the two images I
1, I
2 by Canny edge detection, and by combining this with the result in the above step we obtain a result such as that shown in
Figure 3c.
Step 3: Group all the adjacent pixels, and fill into the group that has a closed pixel area (
Figure 3d).
Step 4: Remove the small region and determine the boundary location (
Figure 3e,f).
Step 5: Based on this boundary, expand the initial change map.
Figure 3f is the preliminary difference between the two input images (white pixels). The result not only includes real changes (red box) with the whole area covered, but also road lane marks and some other areas (blue box) because of edge detection. However, the initial change map detected by ConsimNet does not include this noisy object. Therefore, the edge map can be considered as a reference of the initial change map to improve the accuracy of the result.
2.4. Evaluation Metrics
In this study, the accuracy of the method was defined using three different performance metrics [
5,
20,
21].
where
is the number of true positives (i.e., the cases that were correctly classified),
is the number of false positives (i.e., the negative pixels that were incorrectly classified as positive pixels), and
is the number of false negatives (i.e., the positive pixels that were incorrectly classified as negative).
3. Experiment
3.1. Study Area and Devices
The object of the survey was the Deokyang Bridge (
Figure 4) in Yeosu City, Korea. Deokyang Bridge is 530 m long and 25 m wide. To detect the surface changes, data were taken at different times by a Drone Phantom 4 RTK (
Table 1). The first recording was conducted on 11 January 2019, and the second was on 17 April 2019. An orthomosaic whose average ground resolution was 14 mm/pixel was generated through photogrammetric processing. The bridge area was selected as the test area of the orthomosaic image. That area was divided into our computer-processable size, and the same area of two period images was selected to finally generate 163 comparison pairs.
3.2. Implementation Details
To train the proposed network, this study used a CDnet dataset [
20,
21]. This dataset has already been used in [
9,
22]. The CDnet dataset consists of 31 videos with 91,595 image pairs depicting indoor and outdoor scenes with pedestrians, boats, and trucks captured at different times. The dataset represents various challenges divided into categories such as dynamic backgrounds, camera jitter, shadow, night video, challenging weather, and internal object motion. A background image with no feature object was selected for the reference image at time
t0, and other images were taken at time
t1. A total of 91,595 image pairs were used for the training, comprising 73,276 pairs for the training set and 18,319 for the validation set. All images were scaled to 512 × 512 during the training. The proposed Siamese network was implemented using the PyTorch framework [
23]. In the training procedure, the learning rate was set to 0.00001, and the weight decay and momentum were set to 0.00005 and 0.9, respectively. The batch size was set to 32. The entire process of training, testing, and checking the results was performed in Python on a PyTorch platform [
23] running a Linux 18.04 operating system. The training hardware used the NVidia Titan Xp graphics processing unit.
4. Results
To determine whether the method was a true detection method, 163 pairs of small images were used to preliminarily determine how many locations the detection method identified, that is, the number of correct locations and the number of incorrect locations. The results are shown in
Table 2 and
Figure 5.
Given the results above, we can see that out of the 163 images, 138 images provided good results, equivalent to 84.7%; and 25 images provided incorrect results, equivalent to 15.3%. Thus, these results reflect the accuracy of the method.
For a more detailed evaluation, seven image pairs were used for testing. An image-to-image registration process was used to ensure that the image pairs were matched and located. The results are shown in
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12, which represent tests 1 to 7, respectively. Each figure contains smaller images, labeled (a)–(f). Image (a) is the image at time
t0 captured on 11 January 2019 and image (b) is the image at time
t1 captured on 17 April 2019. Image (c) is the initial unimproved result, and image (d) is the blended result between the image at time
t1 and the initial results. Images (e) and (f) are the preliminary differences between the two input images. Image (g) is the result after improving, and image (h) is the blended result between the improved result and the image at time
t1.
As can be seen in images (c), (d), (g), and (h) of
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12, there are various colors in the detected area, including green, yellow, orange, and red. This is a result of the different distances between the two feature pairs, which were calculated by the Euclidean distance and contrastive loss functions. Change distance images between the feature pairs were enhanced with a rainbow color map for visualization contrast.
As seen in the blended image (d), all potholes were correctly detected; however, the extents of the detected area and the real changed area in image
t1 were unequal.
Figure 6d—Test 1,
Figure 7d—Test 2,
Figure 9d—Test 4, and
Figure 11d—Test 6 show that, before improvement, the extents of the detected areas were smaller than those of the real damaged areas. However, looking at
Figure 6h,
Figure 7h,
Figure 9h and
Figure 11h, after improvement, the full extent of the damaged areas could be detected. The effectiveness of the method is shown numerically in
Table 3.
To evaluate the performance of the proposed method, the precision (Pr) was calculated as the division of the correctly classified changed area (Tp, true positive) by the sum of the correctly classified changed area (Tp) and incorrectly classified unchanged area (Fp, false positive). As (Re) was the result of the correctly classified changed area (Tp) divided by the sum of the correctly classified changed area (Tp) and correctly classified unchanged area (Fp), the F-measure rate (F) was the harmonic mean of the precision (Pr) and the recall (Re).
Seven image pairs were used for the test; in general, the values of Pr, Re, and F for the results after improvement were improved. Across all seven tests, before the adjustment, the average value of Pr was 60.49%, Re was 0.57%, and F was 1.13%; after adjustment, the values of Pr, Re, and F were 78.7%, 1.10%, and 2.17%, respectively.
5. Conclusions
In this study, a change detection method based on a convolutional Siamese network was introduced for UAV-obtained road surface images. The feature pairs of two UAV images taken at different times were extracted by the convolutional Siamese network. Then, the distance between the features was generated to detect changes between image pairs. The contrastive loss was applied to push changed pairs apart and pull unchanged pairs together. Finally, edge detection was used to obtain the boundaries of changed areas, and based on these boundaries, it was possible to adjust the detected area in the initial change map. This method can help to warn managers experts about road surface conditions. The method not only determined the location of the changing area, it also ensured that the full extent of the changing area was detected. Once the defect was detected, countable values like area or position of the defect could be obtained based on the pixel size. If a classification by damage type is added in the future, we could further develop this method into a pavement management system along with damage location.
However, there are still some difficulties caused by noise-generating objects, as some unwanted objects can still be detected and cause confusion. The most unwanted noise in this research is caused by severe shadows. Although shadows themselves were not detected as changes in the images, the actual defects lying beneath these shadows can remain undetected. If the view geometry of the cameras at different times is significantly different (e.g., with a difference of 30° or more, the detection rate is lowered. Standing water on roads also causes errors. Therefore, it is recommended to shoot the road almost vertically when the sun is high, or when the weather is slightly cloudy.
Determining the type and size of the detected damage depends on the ground spatial resolution of the images. That is, in high-resolution images, it is possible to detect changes of minute linear cracks, etc.; however, in centimeter-level low-resolution images, it is possible to detect the presence of potholes on roads. It is necessary to use sub-millimeter resolution images to detect minute linear cracks and crack changes at the millimeter level. In the future, we will study the change detection of small features due to seasonal variation or deterioration.