YOLOv3-Based Matching Approach for Roof Region Detection from Drone Images

: Due to the large data volume, the UAV image stitching and matching suffers from high computational cost. The traditional feature extraction algorithms—such as Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Oriented FAST Rotated BRIEF (ORB)— require heavy computation to extract and describe features in high-resolution UAV images. To overcome this issue, You Only Look Once version 3 (YOLOv3) combined with the traditional feature point matching algorithms is utilized to extract descriptive features from the drone dataset of residential areas for roof detection. Unlike the traditional feature extraction algorithms, YOLOv3 performs the feature extraction solely on the proposed candidate regions instead of the entire image, thus the complexity of the image matching is reduced significantly. Then, all the extracted features are fed into Structural Similarity Index Measure (SSIM) to identify the corresponding roof region pair between consecutive image sequences. In addition, the candidate corresponding roof pair by our architecture serves as the coarse matching region pair and limits the search range of features matching to only the detected roof region. This further improves the feature matching consistency and reduces the chances of wrong feature matching. Analytical results show that the proposed method is 13 × faster than the traditional image matching methods with comparable performance.


Introduction
Image registration is a traditional computer vision problem for applications in various domains ranging from military, medical, surveillance, robotics, as well as remote sensing [1]. With advances in robotics, cameras can be effortlessly mounted on a UAV to capture the ground images from a top view. A UAV is often operated in a lawn-mower scanning pattern to capture a region of interests (ROI). These captured ROI images are then stitched together to provide an overview representation of the entire region. Drones are relatively low-cost and can be operated in remote areas.
The process of image stitching is useful in a number of tasks, such as disaster prevention, environment change detection, road surveillance, land monitoring, and land measurement. The task of image matching can be divided into two sub-tasks: feature detection and feature description. Researchers have extensively used advanced handcraft feature descriptor algorithms, such as SIFT [2,3], SURF [4,5], and ORB [6]. In the task of feature detection, the distinctive and repetitive features are first detected and input into a non-ambiguous matching algorithm [7,8]. These features are further summarized by region descriptor algorithms such as SIFT, SURF, or ORB. These handcrafted descriptors work by summarizing the histogram of gradient in the region surrounding the feature. SIFT is the pioneer in the work of descriptor handcrafting that is robust to scale and orientation changes. SURF and ORB are approximate and fast versions of SIFT. Features are then matched based on several measures such as brute force matching and Flann-based matching, which is based on the nearest descriptor distance and the matches that satisfy a ratio test as suggested by Lowe et al. [2]. As the raw matches based on these measures often contain outliers, the Random Sample Consensus (RANSAC) [9] is often adopted to perform a match consistency check to filter the outliers. The drone image motion is generally caused by the movement of the camera. Hence, the camera motion can be modeled as a global motion in which every pixel in the image shares a single motion. The global motion is generally modeled as a transformation matrix, which can be estimated by as few as four matching pairs.
Recent advances in deep learning and convolutional neural networks have been applied in various fields such as natural language processing and subsequently in computer vision, especially in the tasks of object detection and object classification [10,11]. The concept of the convolutional neural network was first introduced in LeNet [12]. AlexNet [13] made it well-known after winning the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [14]. Various studies have shown that training a deep network on a large dataset can lead to a better testing accuracy [13][14][15][16]. The advances in the hardware such as the graphics processing unit (GPU) made it possible to process larger data in a shorter time. Recent deep learning methods specifically YOLOv3 [17] have shown consistent good results for object detection and classification.
The most straightforward idea for enhancing the computational time of the drone image registration is the use of high-performance computing (HPC) approach. This study introduces a novel method to integrate the GPU-based deep learning algorithm into traditional image matching methods. The use of a GPU is a significant recent advance for making the training stage of deep network methods more practical [18][19][20][21]. The proposed method generates robust candidate regions by adopting YOLOv3 [17] and performs the traditional image matching only on the candidate regions. Similar to Fast R-CNN [22,23], the use of candidate regions are applied for the image matching tasks instead of image classification.
Structural similarity (SSIM) is then adopted to determine the similarity of the candidates' regions. The mismatched regions are then filtered and the overlaps are matched to confirm the corresponding relationship of the overlapping regions on two adjacent images. The traditional feature extraction algorithm is then run to extract features from the matched regions and match the features. The search region is thus limited to very small area of the image, reducing the matching error. In the urban, the roof is an important information infrastructure [20]. Therefore, it led to a significant reduction in the computational requirements as the image matching is only performed on the candidate roof regions which is well suited for real-time image registration applications. In this paper, it is shown that our proposed method has achieved 13× faster than the traditional methods of SIFT, SURF, and ORB.

Traditional Image Stitching Methods and Deep Learning
Image stitching has been long studied in the fields of computer vision and remote sensing. Traditional image matching methods involve handcrafting descriptors that are robust to photometric and geometric variations at some distinctive repetitive feature locations. The computational cost of the image stitching process rises linearly with the image size as more features are detected and matched for the image stitching. Recent advances in convolutional neural networks and deep learning have shown remarkable results in the field of language processing and image processing. Deep learning has revolutionized high-level computer vision tasks such as object detection and classification. However, further research is needed on adapting deep learning methods in low-level computer vision tasks such as image matching.

Traditional Image Matching
Traditional image matching methods can be classified as feature-based or pixel-based matching. For drone image registration, the motion is only caused by the movement of the drone. This motion can be approximated by only a single global motion, shared by all the pixels in the image. Hence, feature-based matching is popular in drone image registration. Moreover, feature-based matching is robust to photometric and geometric variations. Only a few distinctive repetitive feature points are detected, and their descriptors are matched. Well-known feature detection methods include the Harris corner detector [7], Hessian affine region detector [24], and Shi Tomasi feature detector [8]. Feature descriptors are handcrafted, such as SIFT [2], SURF [4], and ORB are based on the histogram of gradient (HOG) for a local region surrounding a keypoint location and also the pixel gradient. SIFT [2] is a pioneering feature descriptor, and is the basis for the faster approximate variants SURF [4] and ORB [6].

Scale-Invariant Feature Transform (SIFT)
David Lowe presented the Scale-Invariant Feature Transform (SIFT) algorithm in 1999 [2]. SIFT is perhaps one of the earliest works on providing a comprehensive keypoint detection and feature descriptor extraction technique. The SIFT algorithm has four basic steps.
Thirdly, to characterize the image at each keypoint, the Gaussian smoothed image L at each level of the pyramid is processed with the closest scale, hence all the computations are performed in a scale-invariant manner. At each pixel L(x,y), the gradient magnitude m(x, y) and orientation θ(x, y) of the feature points in the image can be calculated as shown in Equation (2) and Equation (3).
The final step of the SIFT algorithm is the local image descriptors where location, scale, and orientation are determined for each keypoint.  [4]. SURF is based on the Hessian matrix which can find feature points [4]. Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of a function of many variables. The Hessian matrix measures the local change around each point. It chooses the points at the maximum determinant. Given a point X = (x, y) in image I, the Hessian matrix H(X, σ) at point X and scale σ is defined as where L xx (X, σ) denotes the convolution of the Gaussian second-order derivative ∂ 2 ∂x 2 g(x) with image I at point X, L xy (X, σ), and L yy (X, σ) are defined similarly. For orientation assignment, it uses wavelet responses in both horizontal and vertical directions by applying adequate Gaussian weights. For feature description also SURF uses the wavelet responses. A neighborhood around the keypoint is selected and divided into subregions and then for each subregion the wavelet responses are taken and represented to get SURF feature descriptor. The sign of Laplacian which is already computed in the detection is used for underlying interest points. The sign of the Laplacian distinguishes bright blobs on dark backgrounds from the reverse case. In matching cases, the features are compared only if they have same type of contrast (based on sign) which allows faster matching [5].

Oriented FAST and Rotated BRIEF(ORB)
Oriented FAST and Rotated BRIEF (ORB) is a computed feature extractor and descriptor algorithm presented by Ethan Rublee [6].
ORB is a fusion of the FAST keypoint detector and BRIEF descriptor with some modifications. Initially to determine the keypoints, it uses FAST, as shown in Equation (5).
FAST corner detector uses a circle of 16 pixels to classify whether a candidate point p is actually a corner or not. Each pixel in the circle is labeled from integer number 1 to 16 clockwise. I P is the intensity of candidate pixel p. I k is the intensity of number 1 to 16. Corner Response Function (CRF) gives a numerical value for the corner strength at a pixel location based on the image intensity in the local neighborhoods. The t is a threshold intensity value. Then a Harris corner measure is applied to find top N points among them. FAST does not compute the orientation and is rotation variant. It computes the intensity weighted centroid of the patch with located corner at center. The direction of the vector from this corner point to centroid gives the orientation. Moments are computed to improve the rotation invariance. The descriptor BRIEF poorly performs if there is an in-plane rotation. In ORB, a rotation matrix is computed using the orientation of patch and then the BRIEF descriptors are steered according to the orientation [6].

RANdom SAmple Consensus (RANSAC)
These descriptors between image pairs are then matched against each other to identify the best match with the minimum distance by brute force method. As the matches often contain outliers, a consistency check such as RANSAC [9] is often used to remove inconsistent matches. Figure 1 shows the match points of an input image pair after adopting the RANSAC algorithm [9]. The consistent matches are then used to model a transformation matrix for estimating a global motion for every pixel.

Deep Learning Algorithms
The development of neural networks-based systems have drastically increased and demonstrated extraordinary performance [22]. The neural networks-based methods have recently emerged as potential alternatives to the traditional methods [24,25]. The recent success of deep learning in computer vision has led to the adoption of the convolutional neural network (CNN) in low-level computer vision tasks such as image matching. Hardware advances such as GPU enable training of a very deep CNN that incorporates hundreds of layers [11].

Object Detection Network
Most current object detection frameworks are either one-stage or two-stage. Regions with convolutional neural network (R-CNN) [26], fast R-CNN [22], and faster R-CNN [27,28] are two-stage object detection frameworks. Two-stage object detectors often achieve high object detection accuracy at a high computational cost. One-stage object detectors, including single shot multibox detector (SSD) [29] and YOLOv3 [17], they formulate the object detection of an input image as a regression problem that outputs class probabilities as well as bounding box coordinates. One-stage object detectors have gained popularity recently, as they achieve comparable object detection accuracy and better speed than two-stage object detectors. Specifically, YOLOv3 [17] has reported achieving consistent high accuracy in object detection. On a Pascal Titan X, YOLOv3 [17] runs in real time at 30 FPS, and has a mAP-50 of 57.9% on COCO test-dev.
In this paper, we construct a YOLOv3-based [17] end-to-end training convolutional neural network to detect "roof". YOLOv3 [17] used a single neural network to directly predict the bounding box and class probability. The detailed information about "YOLOv3 object detection" in next section.

Proposed Method
This study presents a novel method to generate a few plausible candidate regions using YOLOv3 [17] object detection for two subsequent drone images on NVIDIA TITAN Xp. The proposed method performs traditional image matching procedures, such as feature extraction and description methods, only in the candidate roof region, thus significantly reducing complexity compared to conventional methods (such as SIFT [2], SURF [4], and ORB [6]). Figure 2 shows the complete flow chart of the algorithm. All the default YOLOv3 [17] parameter settings were applied, except that the network was only trained for a single class "roof". The image was divided into S × S grid cells of 13 × 13, 26 × 26 and 52 × 52 for detection on the corresponding scales. Each grid cell is responsible for outputting three bounding boxes, B = 3. Each bounding box outputs five parameters x, y, w, h, and confidence (refers Equation (6)) which define the bounding box location as well as a confidence score indicating the likelihood that the bounding box contains an object.
P r (Object) denotes the probability that the box contains an object. If a cell has no object, then the confidence scores should be 0, otherwise the confidence score should equal the intersection over union (IOU) between the predicted box and ground truth. IOU is a ratio between the intersection and the union of the predicted boxes and the ground truth boxes, when IOU exceeds the threshold, the bounding box is correct, as shown in Equation (7). This standard is used to measure the correlation between ground truth, box truth and prediction, box predict ; a higher value represents a higher correlation.
IOU is frequently adopted as an evaluation metric to measure the accuracy of an object detector. The importance of IOU is not limited to assigning anchor boxes during preparation of the training dataset but is also very useful when adopting the non-max suppression algorithm for cleaning up whenever multiple boxes are predicted for the same object. The IOU is assigned to 0.5 (the default threshold is usually 0.5), which means that at least half of the ground truth and the predicted box cover the same region. When IOU is greater than 50% threshold, the test case is predicted as containing an object.
Each grid cell is assigned 1 conditional class probability, P r (class|object), which is the probability that the object belongs to the class "roof" given an object is presence. The class confidence score for each prediction box is then calculated as Equation (8), which gives the classification confidence as well as the localization confidence.
Class Con f idence = Box Con f idence × P r (class|object) (8) The detection output tensor is of size S × S × B × (5 + C). The value 5 is for the four bounding attributes and one confidence score. Figure 3 shows the detection process using YOLOv3 [17]. Figure 4 shows a backbone network adopted in YOLOv3 [17] for a multiscale object detection. This study adopted the network model for a single class object "roof". The object "roof" became our candidate regions.

Experiment Environment
The experiment environment includes Intel(R) core (TM) i7-8770 @3.2GHz (CPU) and 24 GB of memory, NVIDIA GeForce TITAN Xp GPU with 24 GB memory and using CUDA 9.0. Table 1 shows the hardware and software configurations for the training process. To evaluate the effectiveness of this research method, we used a set of real images acquired by a UAV equipped with imaging sensors spanning the visible range. The camera is SONY a7R, characterized by a Exmor R full frame CMOS sensor with 36.4 megapixels. All images have been acquired from the National Science and Technology Center for Disaster Reduction, New Taipei, on 13 October 2016, at 10:00 a.m. The images are characterized by three channels (RGB) with 8 bits of radiometric resolution and a spatial resolution of 25 cm ground sample distance (GSD). Table 2 shows the UAV platform and sensor characteristics. In this study, the dataset comprises 99 drone images with 6000 × 4000 pixel size, captured in the Xizhi District, New Taipei City, Taiwan. As YOLOv3 [17] is designed to train and test the images of 416 × 416 pixel size, the original images were cropped into 1000 × 1000 pixel size with overlapping areas of 70% between the subsequent images. The cropped images were then randomly split into training and testing data at ratio 9:1.
In order to train the network to output the location of the object, all the ground truth objects in the images need to be labeled first. We used the LabelImg open source project on GitHub (tzutalin.github) [29], which is currently the most widely used annotation tool. An open-source software application "LabelImg" [29] was adopted to create the ground truth bounding boxes for the object detection task. Figure 5 shows a screenshot of the process of creating the ground truth bounding boxes using the labelImg software. As the drone images mostly covered the residential areas, only a single class of object "roof" was labeled. The annotations of training images in the XML format were used directly in the YOLOv3 end-to-end training network.

Evaluation Methods
The precision is the ratio of true positives (true predictions) to the total number of predicted positives TP denotes the number of true positives. FP denotes the number of false positives and FN is the number of false negatives. The recall is the ratio of true positives to the total of ground truth positives.
The average precision (AP) is the area under the precision-recall curve, and p(k) denotes the precision value at recall = k.
The loss function is a function that maps an event or value of one or more variables onto a real number intuitively representing some 'cost' associated with the event. Therefore, the performance of the training model can be measured by calculating the loss function.
YOLOv3 uses multiple logistic classifiers instead of Softmax to classify each box, since Softmax is not suitable for multi-label classification, and increasing the number of independent multiple logistic classifiers does not decrease the classification accuracy. Therefore, the optimization loss function can be expressed as shown in Equation (12).
In Equation (12), the loss function term1 (λ coord calculates the loss related to the predicted bounding box position (x, y). term2 calculates the loss related to the predicted box width and height (w, h). Terms term3 ( box predictor and the loss associated with the confidence score. C i is the confidence score, and C i is the intersection over union of the predicted box with the ground true.Ĉ i is expressed in Equation (13). The final term ( is the classification loss.Ĉ i = Pr(Object) × IOU truth predict (13) In this study, the dataset comprises 99 drone images. We have used 50 drone images which totally consist of 2200 house roofs divided into 2000 training samples and 200 testing samples. The length of the training time is 4 h. The training time of the deep learning algorithm is excluded from the T computation. Figure 6 shows the precision-recall curve generated by the model trained with our dataset (training sample = 2000, testing sample = 200). The average precision obtained is AP = 80.91%. We trained the YOLOv3 [17] roof detection model on the datasets. Figure 7 depicts the roof detection results on the dataset.
Matching rate (MR) is the ratio between the number of correct matching feature points and the total number of matching feature points detected by the algorithm. In Equation (14), (Keypoint 1 , Keypoint 2 ) refers to the numbers of keypoints detected in the first and second images respectively, and Matches is the number of matches between these two series of interest points. In Equation (15), we using match performance (MP) to understand the matching status per unit time. In Equation (16), k is the filtered match pair number, where p ∈ [1, k], q ∈ [1, k], and (x p , y p ) and (x q , y q ) are the spatial coordinates of the corresponding matching points on the registration image and the reference image, respectively. A smaller RMSE means a higher registration accuracy, and RMSE < 1 means that the registration accuracy the sub-pixel level.

Xizhi District, New Taipei City CASE 1
After the training of YOLOv3 is completed, the weights generated after the training can be used to detect the candidate overlapping areas of other UAV images. Figure 8a shows the first image (taken image time = t) that is reference image for the proposed YOLOv3-based roof region detection. Figure 8b, there are three roof regions were detected and highlighted by YOLOv3-based roof region detection. Figure 8c-e shows the candidate regions in the reference image detected by YOLOv3 object detector.  Figure 9a shows the second image (taken image time = t + interval shooting time) that is to be registered for the proposed YOLOv3-based roof region detection. Figure 9b there are three roof regions were detected and highlighted by YOLOv3-based roof region detection. Figure 9c-e shows the candidate regions in the registered image detected by YOLOv3 object detector. The candidate roof regions were matched to find the corresponding region pair using SSIM [30]. Table 3 shows the SSIM measure between candidate regions and their execution times respectively. After obtaining the corresponding region pairs, traditional feature matching algorithms, SIFT [2], SURF [4], and ORB [6], were performed as shown in Figures 10-12 (the right image is registered image, the left image is reference image).    To compare the proposed method with the traditional image matching algorithms, SIFT [2], SURF [4], and ORB [6], feature extraction and matching were performed using these algorithms on the original image pairs, as shown in Figures 13-15, respectively. We recorded the number of keypoint, time, and match point coordinates to computed match rate (MR), match performance (MP), and root mean squared error (RMSE).   In this paper, we used the ENVI (Environment for Visualizing Images) software to computed the root-mean-squared error (RMSE). The manually selected GCPs (ground control points) combined with match point coordinates used in the root mean squared error (RMSE) calculation. As shown in Figure 16, the 20 pairs of red markers denote the manual selected of GCPs.  Table 4 and Figure 17 summarize the comparison of traditional image matching algorithms, SIFT [2], SURF [4] and ORB [6], with the YOLOv3-based candidate region matching algorithms YOLOv3+SIFT, YOLOv3+SURF, and YOLOv3+ORB. As shown in Table 4, the proposed method was more than 13× faster compared to the traditional image matching algorithm. Table 4. Comparison between the traditional image matching methods and the YOLOv3-based candidate region image matching method for image pair in Figure 8.    Our proposed method is compared with the traditional image matching algorithms such as SIFT [2], SURF [4], and ORB [6]. For our quantitative evaluation indexes execution time (T), match rate (MR), match performance (MP), and root mean squared error (RMSE). For the traditional image matching algorithm, the SIFT [2] algorithm had largest number of matching number, it had the longest execution time and lowest match rate. The SURF [4] algorithm's match rate (MR) and root mean squared error (RMSE) have the best performance among the traditional image matching algorithms. The ORB [6] has the best execution time (T) among the traditional image matching algorithm. As shown in Table 4, experimental results show that the proposed method performance was better than the traditional image matching algorithm. The proposed method can be rapidly implemented and has high accuracy and strong robustness.

Xizhi District, New Taipei City CASE 2
In this paper, we have evaluated the performance of the YOLOv3-based roof region detection with other cases. Figure 19 shows the reference image and candidate regions in the reference image detected by YOLOv3 object detector.    Table 5 shows the SSIM measures between candidate regions and their execution times. Traditional feature matching algorithms, SIFT [2], SURF [4], and ORB [6], were run on the corresponding region pairs as shown in Figures 21 and 22. The right image is registered image, the left image is reference image.     The manually selected GCPs (ground control points) combined with match point coordinates have been collected with the root mean square error (RMSE). As shown in Figure 26.   Figure 27 summarize the comparison between traditional image matching algorithms, SIFT [2], SURF [4], and ORB [6] with the YOLOv3-based candidate region matching algorithms YOLOv3+SIFT, YOLOv3+SURF, and YOLOv3+ORB. As shown in Table 6, the proposed method was more than 15× faster compared to the traditional image matching algorithm. Table 6. Comparison between traditional image matching methods and the YOLOv3-based candidate region image matching method for image pair in Figure 19.   Figure 28 shows the registration result from the proposed YOLOv3-based matching method. As shown in Table 6, the results show that the proposed method performed was better than the traditional image matching algorithm especially in the execution time (T), where it performs 15× faster than the traditional methods. The proposed method can be rapidly implemented and has high accuracy and strong robustness.

Conclusions
Traditional feature-based image matching algorithms dominated the image matching for decades. A fast image matching algorithm is desired as image resolution and size are growing significantly. With the advances of GPU, deep learning algorithms are adopted in various computer vision and language processing fields. In this paper, we proposed a YOLOv3-based image matching approach for fast roof region detection from drone images. As the feature-based matching is performed only on the corresponding region pair instead of the original image pair, the computation complexity is reduced significantly. The proposed approach showed comparable results and performed 13× faster than the traditional methods. In the future work, our model will be trained using overlapping regions with different object conditions. The proposed approach to other UAV images.