Change Detection in Aerial Images Using Three-Dimensional Feature Maps

: Interest in aerial image analysis has increased owing to recent developments in and availability of aerial imaging technologies, like unmanned aerial vehicles (UAVs), as well as a growing need for autonomous surveillance systems. Variant illumination, intensity noise, and different viewpoints are among the main challenges to overcome in order to determine changes in aerial images. In this paper, we present a robust method for change detection in aerial images. To accomplish this, the method extracts three-dimensional (3D) features for segmentation of objects above a deﬁned reference surface at each instant. The acquired 3D feature maps, with two measurements, are then used to determine changes in a scene over time. In addition, the important parameters that affect measurement, such as the camera’s sampling rate, image resolution, the height of the drone, and the pixel’s height information, are investigated through a mathematical model. To exhibit its applicability, the proposed method has been evaluated on aerial images of various real-world locations and the results are promising. The performance indicates the robustness of the method in addressing the problems of conventional change detection methods, such as intensity differences and shadows.


Introduction
The rapid development of unmanned aerial vehicles (UAVs) and cameras has provided a major opportunity to obtain high-quality images from low altitudes. The flexibility, availability, and decreased cost of UAVs have made aerial image analysis popular for many remote sensing and surveillance applications in different fields [1,2]. Forest inventory [2], infrastructure inspection [3], search and rescue missions [4], surveillance of urban areas [5], and environmental management [6] are among many applications that require change detection in aerial images. However, comparing images is subject to alterations in viewpoint, lighting conditions, and scale, along with the presence of noise and other factors that introduce errors [7].
Change detection systems conventionally consist of preprocessing steps followed by a change detection step. Image registration is an essential preprocessing step that aligns aerial images taken from different viewpoints to remove geometrical error [7]. Change detection methods are then applied that usually contain feature extraction from multitemporal aerial images followed by analysis of the obtained feature map to distinguish changed from unchanged regions [1]. The algorithms used in the analysis include thresholding (global or local) or more advanced techniques, such as k-means clustering [8] or Markov random field [9,10]. In the context of change detection in aerial images, the use of deep neural networks (DNNs) has been demonstrated to achieve fair performance [11,12].
Additional strategies, such as transfer learning, can be considered to enhance performance. However, DNNs require a large amount of labeled data sets for training in order to perform well. These techniques are also particularly vulnerable to noise (e.g., strong shadows and reflection) and lack generality for various applications.
Furthermore, advancement of image matching algorithms, such as structure from motion (SfM), has yielded highly detailed 3D image point clouds (IPCs) [13]. This has led to approaches that focus on the height and 3D properties of a scene in order to detect changes over time [14][15][16]. These algorithms determine transformations between multiple point clouds to generate 3D models. The obtained 3D models are then compared and analyzed for change or deformation detection. A system for structural change detection for street views using a monocular camera, GPS, and inertial odometry has been presented in [17]. In that work, 3D scene reconstruction is performed at different times for accurate alignment and then a deep deconvolutional architecture is employed for a pixel-wise change detection. Similarly, a building change detection technique using 3D reconstruction from aerial images has been proposed in [18]. By that approach, two sets containing many aerial images at different times are employed in order to create dense point clouds and consequently the depth maps. These depth maps are then analyzed to detect changed buildings while filtering out the unwanted regions using the signature of histograms of orientations descriptor.
An application for a 3D difference has been discussed in [19] for object segmentation and, further on, object discovery and learning. The obtained changes from 3D difference is employed to detect and segment newly discovered objects in a scene using dense depth maps. A 3D difference detection [20] based on depth cameras has been presented that checks the similarity between the 3D geometry of a real object and its virtual 3D model. That method can improve the 3D model based on the detected 3D differences, although the application of common depth cameras is limited in the case of aerial images. In [21], a visual relation detection has been explored to understand relations between entities in the image with emphasis on depth maps. Depth maps are then obtained from RGB images using a fully connected convolutional network. Moreover, the authors of [22] have proposed a projective volume-monitoring method for intrusion detection using stereo images to obtain depth maps in order to represent pixels in terms of 3D Cartesian coordinates. These methods are more robust to noise and changes in illuminations. However, the quantity and distribution of images besides camera parameters are essential components for precise deployment of such techniques [15].
The aforementioned issues related to the conventional methods described highlight the need for a change detection method that is robust to noise, versatile, and, at the same time, computationally easy to implement. The required technique should perform reliably in different scenarios while not demanding a high quantity of images or significant computational power to detect changes in a scene. Therefore, this paper presents a robust change detection method that measures the height information of each pixel corresponding to the points in the physical world from only a pair of aerial images. The obtained 3D feature maps at each instant are then compared to detect the changed regions in a scene over time. Moreover, the main parameters that impact measurements, such as camera sampling rate, image resolution, the height and the speed of the drone, and their relation to the pixel's height information, are investigated. The main contributions of the proposed system are to address the earlier mentioned drawbacks of limited robustness and versatility and offer reliable measurement with low computational power demand. A patent was published as a result of the work reported in this paper.
This manuscript is organized as follows. Section 2 describes the proposed system methodology, including the mathematical model, image registration, and parallax displacement, along with the 3D comparison model used to distinguish changed regions. Section 3 describes the experimental results of the proposed system and evaluation parameters. Finally, Section 4 provides a summary and discussion of key findings.

Proposed Methodology
In this section, the methodology for the proposed system is presented. In the first step of the method, a pair of aerial images (I(t 1 ) and I(t 1 + ∆t)) is taken with the same camera but with a small time separation. Based on the movement of the platform, the images will have slightly different viewpoints over the area of interest. The uniqueness of our proposed method is that the camera's intrinsic and extrinsic parameters are not required for further analysis. Thus, a reference surface is defined and a set of keypoints on that surface are extracted and matched for both aerial images. Employing the matched keypoints, the transformation matrix is calculated and the pair of images is registered. In the next step, the parallax displacements between the two images are computed and turned into a 3D feature map, which illustrates the height of each pixel in the image with respect to the reference surface in the physical world.
The same procedure is repeated to acquire a second 3D feature map based on another pair of aerial images, (I(t 2 ) and I(t 2 + ∆t)), taken at a different time (e.g., one hour or day later). The acquired 3D feature maps are then associated to calculate the 3D comparison model of the multitemporal aerial images. This model determines the changed regions in the aerial images over time based on their different height characteristics at each point.
A detailed description of each step of the proposed algorithm is presented in the following sections.

3D Map Generation Unit
In this section, the 3D feature map generation using a pair of aerial images is described (see Figure 1). This approach utilizes a pinhole camera model [23] in which light is envisioned from the scene entering the camera from a pinhole. This results in only one ray from any particular point in the scene being projected on the image plane. Therefore, points at various heights from the ground appear on the same image plane with the camera. Upon movement of the camera, the same points appear again on the image plane but with different positions. For simplicity, it can be assumed that the reference surface is on the ground in the physical world and the image planes are parallel to the ground (see Figure 2).  In addition, ground sampling distance (GSD) (m/px) can be defined as the relation between scene coverage at the reference surface, S, and the image width as is the camera's height from the ground, and α is the camera's field of view [24]. The points that were located initially beyond (or below) the reference surface in the physical world will not appear at aligned positions even after projective mapping. This discrepancy is directly proportional to the height of the point from the reference surface in the physical world. Let S be the reference surface, Q a point above the surface, G its projection on the surface, and O and O' the center of projections for the two images I(t) and I(t + ∆t) as portrayed in Figure 3. The projections of Q and G on I(t) are q and g, and their projections on I(t + ∆t) are q' and g', respectively. Furthermore, let M and N be the points that rays OQ and O'Q intersect the surface, S. After projective mapping, points g and g' are aligned, while the residual parallax vector, qq', remains non-zero. The following relation is obtained considering the similar triangles of QMN and QOO'.

OO'
where h Q (m) and h U AV (m) are the distance between Q and O to the surface, S, respectively. The parallax displacement of qq' is proportional to the physical displacement, MN, based on GSD (m/px). In addition, the physical displacement, OO', is equal to the traveled distance of the drone with the constant speed, v U AV (m/s), and the sampling interval T (s). Thus, we can rewrite the previous equation and organize it according to h Q (m) as where d (px) is the parallax displacement in pixels between q and q' ( qq' ). It should be noted that (v U AV · T) in the earlier equation can be expressed as the drone traveled distance between samplings in the case of various speed of the drone.

Image Registration
The pair of images (I(t) and I(t + ∆t)) have slightly different viewpoints, and therefore image registration is required. The different viewpoints are obtained by the displacement of the drone traveling parallel to the ground at a constant speed, v U AV (m/s), with a sampling time of T (s). It can be assumed the speed of the drone, v U AV (m/s), is known and constant, so then the sampling time, T (s), is computed based on h min (m), the minimum height of the objects that are supposed to be detected, Equation (4) is derived from Equation (3), which takes into account that a successful 3D point detection (any point above the reference surface) requires at least one pixel parallax displacement between corresponding points (d = 1 px). According to Equations (3) and (4), theoretically there is an inverse relationship between the sampling time and the minimum detection height for the given speed and height of the drone. Thus, it affects the accuracy of the 3D feature map of a scene. The obtained pair of aerial images are then aligned using a projective mapping.
A projective mapping is a linear transformation from one preferred plane in the first image to the same plane in the second image. Thus, the points on the reference surface in two images are aligned together. This mapping can be expressed in terms of matrix multiplications using homogeneous coordinates and a transformation matrix, H(t) [25], where (x(t), y(t)) and (x (t + ∆t), y (t + ∆t)) are coordinates of corresponding points on the reference surface and H(t) is a non-singular 3 × 3 matrix. The main application of the proposed system is change detection for transportation surveillance systems such as at parking spaces in a harbor area. Therefore, the reference surface is considered on the ground/road in order to simplify the detection process of changed regions at other instants. In this analysis, H(t) is computed using four corresponding points on each image as only the ratio of the elements of this matrix is significant [23]. Alternatively, in order to estimate H(t), the keypoints on the reference surface in images with variant viewpoints can be extracted based on a feature detector and descriptor, such as scale-invariant feature transform (SIFT) [26]. The detected keypoints in two images should then be matched by finding the nearest descriptors to each other using the fast-approximate nearest-neighbors (FLANN) algorithm [27].
The computed transformation matrix, H(t), contains rotation and translation to relate the selected planes in the pair of aerial images (I(t) and I(t + ∆t)), and thus H(t) is applied to all pixels on I(t + ∆t). For a constant sampling time, T, the transformation matrix, H(t), is required to be computed only once as the relative positions of the two image planes remain the same.
As discussed earlier, if two arbitrary views of the same scene are registered as all points on the reference surface to be aligned, the residual parallax displacement can be used to estimate the height of any point of the image. If the aligned surface is a plane, then the parallax displacement is directly proportional to the height of the point and inversely proportional to its depth from the camera [28].

Parallax Displacement D(t)
The parallax displacement, D(t), of the aerial image for all pixels should be computed in order to determine the height of each point using (3). In theory, the parallax displacement for each pixel is the Euclidean distance between (x(t), y(t)) and (x (t + ∆t), y (t + ∆t)) for corresponding points. However, in practice, this leads to computation of sparse pixel parallax displacements that is not adequate to describe the height of all pixels in the image. Therefore, in this paper, the parallax displacement for each pixel is estimated based on polynomial expansion [29]. This involves estimating a neighborhood of each pixel with a quadratic polynomial function in the first image. By equating the coefficients, the parallax displacement, d, for that pixel can be estimated.
In order to improve approximation and reduce noise, it is assumed that the displacement varies slowly over a neighborhood. Thus, the parallax displacement, d, is computed as a weighted average of parallax displacements over a neighborhood of M around the pixel. The final output is a matrix, D(t), that is equal in size to the input images which describes the magnitude of the parallax displacement vectors at each pixel (see Figure 4).

3D Feature Map J(t)
The 3D feature map J(t 1 ) (m) that defines the height of each pixel in the aerial image at time t 1 is computed based on the parallax displacements using the following equation, Similarly, the second 3D feature map J(t 2 ) is computed using the second pair of images, I(t 2 ) and I(t 2 + ∆t). In order to compare these 3D maps with each other, the corresponding regions should be aligned.

Change Detection Unit
After acquiring the second 3D feature map J(t 2 ), the pair of 3D feature maps, J(t 1 ) and J(t 2 ), are compared to detect 3D point changes in the scene. The comparison model contains the height difference of corresponding pixels in two aerial images, I(t 1 ) and I(t 2 ). The height difference is compared with a threshold to obtain the changed regions in a scene over time (see Figure 5).

Defining a reference surface
Change-detection model

3D Feature Map Registration
In order to register J(t 2 ) based on J(t 1 ), a projective mapping is required using the transformation matrix, H(t 1 , t 2 ). The transformation matrix, H(t 1 , t 2 ), is computed with extracted corresponding keypoints on the reference surface on I(t 1 ) and I(t 2 ). Subsequently, the 3D feature map J(t 2 ) is warped according to J(t 1 ).

3D Comparison Model
The aim of the 3D comparison model is to determine the change in the height of points and consequently objects in the scene at two instants, t 1 and t 2 , where J'(t 2 ) is the transformed 3D feature map J(t 2 ).

Change Detection Model
The difference is considered significant if it is greater than a predefined threshold, and therefore a change in the pixel is detected, where τ is the predefined threshold based on the specific application.

Experimental Results and Discussion
Data were collected at 10 locations over two days using a commercial UAV with a built-in camera. The height of the flights was h U AV = 100 m, the camera's field of view was α = 84 • , GSD ≈ 0.039 m/px, and the image size was 3840 × 2160 px. The respective covered area on the ground in each aerial image was approximately 150 m×84 m. The ground truth of the measurement was obtained manually, representing pixels in the image that are above the reference surface in order to evaluate the detected changes. In addition, the results were compared using an adaptive model known as Kendall's Tau-d local pattern correlation with the intensity similarity threshold, alpha = 0.7, and the texture similarity threshold, beta = 0.95 [30].
The objective of the proposed system was to extract the height information of the pixels from aerial images taken at different times and, consequently, to compare the extracted 3D feature maps in order to determine regions with significant changes in their height profiles. The sampling time is computed to be T = 2.3 s using (6) by setting the drone speed, v U AV = 4.8 m/s, and the minimum height of the objects, h min = 0.42 m.
The evaluation parameters to assess the performance of the proposed change detection system are the accuracy rate (ACC), the true positive rate (TPR), and the false positive rate (FPR).
where TP, FP, TN, and FN are true positive, false positive, true negative, and false negative, respectively. The robustness and accuracy of the proposed system is closely related to the detection of parallax displacements. Therefore, one of the main parameters that impacts performance is the size of the neighborhood, M, around pixels where the weighted average of parallax displacements is computed. This is particularly important as not all pixels contain dominant keypoints, and thus displacements at those pixels are not reliable [29]. A suitable size for the neighborhood window depends on shape and size of the objects in the scene. In this paper, variant neighborhood window sizes are employed and 3D feature maps are compared against their ground truth. The results are displayed using a receiver operating characteristic (ROC) curve in Figure 6. In this context, positive is considered a pixel above and negative a pixel on the reference surface. Accordingly, the performance of the proposed model with respect to the 3D feature map extraction for variant neighborhood window sizes, M, is presented by the true positive rate and false positive rate computed using Equations (10) and (11). Based on the obtained ROC curve, the neighborhood window size is set to M = 40 and M = 50 and the 3D feature maps are generated for 10 locations over two days. The experimental results are discussed with examples from aerial images in different scenarios in order to evaluate the performance of the proposed model more specifically (see Figure 7).
In the first scenario, a single object, a truck trailer, is observed on a sunny day with strong shadows standing alone in the scene (see Figure 7a). The generated 3D feature map of the scene accurately corresponds to the object and its surroundings. Even the variant height of the trailer section and the wind deflector on the cabin are reflected accurately in the 3D map. The second scenario considers multiple objects in the scene (see Figure 7b). In this case, several trailers with variant heights are located beside each other with strong shadows covering them partially. The generated 3D feature map corresponds to each trailer individually with correct height reconstruction of the scene. The third scenario involves multiple objects with little space in between (see Figure 7c). In this case, the dense objects are reconstructed correctly, and variant heights as well as corresponding covered areas are generated accordingly in the 3D feature map. These results demonstrate the performance of the proposed model under various scenarios using only a pair of aerial images.
The analysed surfaces in focus for change detection in this work were mainly vehicles and trailers in a harbour area as for transportation applications. In addition, it was observed that construction of 3D feature maps from homogeneous surfaces (areas with a uniform appearance, e.g., a truck trailer's roof) raised difficulty due to having less distinctive keypoints [31]. Thus, as mentioned earlier, the parallax displacement was computed over a neighborhood window, M, in order to compensate for the lack of distinctive keypoints. As a result, the obtained 3D feature maps were more reliable especially for homogeneous surfaces; however, it introduced some errors over the edges. Moreover, the proposed method can potentially be applied for more complex surfaces such as hilly and even urban areas as we have partially observed in the experimental results.
It is worth mentioning that there are available commercial algorithms in order to achieve 3D reconstruction of a scene [32,33]. However, the main goal of the proposed method is to perform change detection in a scene by utilizing 3D feature maps. Moreover, the 3D feature map is acquired using only one pair of aerial images rather than several. The precision of the 3D feature maps might be compromised while it is computationally highly efficient to detect changed regions over large areas. The aligned 3D feature maps are then compared, and regions with significant changes are detected and compared against the ground truth. In addition, regions with similar heights but above the reference surface are also marked. This information is particularly important when studying the utilization of the scene in the physical world. Figure 8 illustrates change detection examples for three regions on two different days using the corresponding generated 3D feature maps. In the first and second regions, vans have disappeared on the second day; therefore, there are changes in the respective scenes over time. However, in the second region the vehicle's color was very similar to the background that makes it more difficult for the conventional change detection methods. In the third region, several trailers have been replaced with some other trailers with variant heights, positions and colors in a more challenging scenario. The absolute difference in the corresponding pixel heights successfully establishes the amount of detected changes in the respective regions for all three scenarios (see Figure 8c,f,i).  Table 1 demonstrates the change detection results over two days using the proposed system on 10 pairs of aerial images on each day. The smallest observed objects were passenger vehicles approximately 2 m by 4 m with a height of 1.5 m. In order to compare the results of the proposed system with another model, a local pattern correlation descriptor called Kendall's Tau-d is implemented and the obtained results are presented. The change detection results using the proposed 3D change detection approach are more accurate than those obtained using the intensity model. The main reason for this is that any intensity comparison model is highly dependent on brightness and illumination within two different instances of time. In contrast, the proposed 3D comparison model is affected by brightness conditions only at nearly the same time to construct a 3D feature map and the comparison of the 3D feature maps can accommodate changes in illumination. Moreover, the amount of noise in the image based on reflection (e.g., having a wet surface), strong shadows, and other factors can be very high at two different times, potentially affecting intensity methods negatively. Furthermore, intensity comparison methods cannot distinguish between objects with similar patterns and colors (e.g., white trailers in aerial images), meaning they fail to detect changes. On the contrary, the proposed 3D model features height information of objects in the scene at both instants. The computation time of the presented method can be considered in two parts: first, to create the 3D feature maps and second to detect changes using the obtained 3D feature maps. The computation time of the first part for creating the 3D feature map at each instant from a pair of aerial images with sizes of 3840 × 2160 px was approximately 1.76 s. Then, it took~0.06 s to compute the changed regions in a scene for the second part. In comparison, a well-known commercial software [32] was used to create the 3D feature map with the same pair of aerial images and giving four pair of corresponding keypoints to define the reference surface. The total computation time of the software to perform this task was approximately 28.32 s that includes aligning the images, building the dense cloud, and finally building the mesh. Furthermore, the Kendall's Tau-d model [30] determined changes in aerial images over time in~33.03 s (non-vectorized implementation). These computation times show a significant improvement in the processing time and, consequently, the computational cost as the result of the proposed method.
In addition, the proposed method can be used to detect 3D objects with respect to the reference surface in oblique aerial images (see the Section 5). However, in the case of oblique images (in contrast to orthogonal aerial images), perspective distortion causes significant depth variation. Nevertheless, using the described method, the relative position of any pixel to the reference surface can be determined so that 3D objects can be detected in the scene.

Conclusions
Change detection in multitemporal aerial images is challenging because of variant brightness, shadows, colors, and surface conditions at two or more instants. The proposed method in this paper determines changes in the scene regardless of the aforementioned undesired conditions. This method is based on height extraction of each point in the aerial images of a scene. The height information is computed using the parallax displacements and then related to the actual distance from the reference surface in the physical world. Consequently, the computed 3D feature maps are aligned and compared at different times to detect changes in the scene. The experimental results show that the proposed system effectively detects changed regions in the scene with an accuracy rate and true positive rate of 82.56% and 92.93%, respectively. Furthermore, this method is able to accommodate changes in illumination, strong shadows, and noise at different instants. In addition, a mathematical model is presented to calculate the suitable image sampling rate based on the minimum height of interest. It can be concluded that the proposed method determines changed regions in multitemporal aerial images quickly, efficiently, and with relatively low computational cost as an alternative to conventional intensity and pattern correlation comparison models.

Patent
Pettersson, M.I.; Javadi, S.; Dahl, M. A method for detecting changes of objects in a Scene. Swedish Patent 1850695-6, filed 8 June 2018, and expected to be issued 1 May 2020.