Diverse Scene Stitching from a Large-Scale Aerial Video Dataset

Diverse scene stitching is a challenging task in aerial video surveillance. This paper presents a hybrid stitching method based on the observation that aerial videos captured in real surveillance settings are neither totally ordered nor completely unordered. Often, human operators apply continuous monitoring of the drone to revisit the same area of interest. This monitoring mechanism yields to multiple short, successive video clips that overlap in either time or space. We exploit this property and treat the aerial image stitching problem as temporal sequential grouping and spatial cross-group retrieval. We develop an effective graph-based framework that can robustly conduct the grouping, retrieval and stitching tasks. To evaluate the proposed approach, we experiment on the large-scale VIRATaerial surveillance dataset, which is challenging for its heterogeneity in image quality and diversity of the scene. Quantitative and qualitative comparisons with state-of-the-art algorithms show the efficiency and robustness of our technique.


Introduction
Diverse scene stitching from a large-scale aerial video dataset is highly desired in the field of remote sensing.Different from single scene stitching, diverse scene stitching simultaneously processes data from multiple scenes.For each scene, the data may come from revisited observations of the same aerial platform at different times or from co-observations of multiple aerial platforms.The purpose of diverse scene stitching is to combine all observations with various times, illuminations, viewpoints and even platforms together to generate scene oriented panoramas.The biggest advantage is the ability to automatically explore multiple scenes and to generate panoramas from large-scale aerial surveillance data.Further, the results also provide useful inputs for high-level scene understanding.
Image stitching has been well studied over the past decade [1][2][3][4][5][6][7][8][9][10].We refer the readers to the comprehensive survey [11] for a background.In the field of remote sensing, image stitching or image mosaicking is one of the main tasks, and there are many relevant works, such as environmental monitoring, land, water or marine resources survey, aerial video surveillance, etc.Here, we mainly focus on stitching methods related to aerial video surveillance, and the related stitching methods [1,5,6,9,12] can be classified into two categories, including sequential stitching and retrieval stitching.
Sequential stitching algorithms [1,5,12] are based on the assumption that the input data are ordered.Small baseline algorithm, such as optical flow [13,14], can then be used to efficiently estimate the image transformation between consecutive frames.Assuming the time consistency cannot be broken, these sequential methods require that the input data should be taken on a stable aerial platform without sudden viewpoint changes or large motion blurs.Otherwise, no time consistency can be used to align these sequential images, resulting in discontinuous panoramas.While advanced methods based on wide baseline image matching techniques, such as the well-known scale-invariant feature transform [15] and speeded-Up robust features (SURF) [16], can reduce these artifacts, they cannot handle large changes of aerial data.For example, Figure 1 shows samples from the benchmark VIRATdataset [17] of a diverse scene (as shown in Figure 1a).Motion blurs, low contrast and significant viewpoint changes (as shown in Figure 1b,c) are common in the acquired images.The continuity assumption hence is not valid in this case.
The stitching problem can also resort to image matching techniques.Techniques, such as [2,4,6,[18][19][20], assume that the input data are completely unordered.Brown et al. [4,19] formulate stitching as a multi-image matching problem and use brute-force searching to find the overlapping relationships among all images.These methods can recognize multiple panoramas from a small-scale image dataset.
However, given a dataset with n images, this results in n(n − 1)/2 possible image pairs and, hence, leads to O(n 2 ) complexity.Consequently, the computational cost of these approaches is generally very high and becomes a bottleneck in real-time surveillance.
There is also an emerging trend for using retrieval techniques [6,18,20] to find one-to-many image matches.These solutions then apply two-view matching on all selected image pairs.The retrieval stitching method is based on feature indexing.Although using indexing is much less expensive than matching, it requires sufficient distinctive image features.If the observed scene lacks enough unique features, the variance of different images is not enough for the method to reliably find correct matches.
As a result, the retrieval stitching method cannot find all overlapping aerial images from the same scene and yields low performance: typically about 80% correct matches can be found [18].For instance, a large number of indistinctive or self-similar scenes in the VIRAT dataset (as shown in Figure 1b) will lead to retrieval failure.The main contribution of this paper is a hybrid stitching method that unifies the temporal continuity and spatial repeatability of aerial videos in a graph-based framework.The heart of our method is built on the observation that aerial videos captured in diverse realistic scene are neither totally ordered nor completely unordered.Even though the aerial videos are corrupted by large motion blurs, sudden scene changes, low contrast and high camera noise, we observe that for each scene, there always exists short continuous videos from different times.We therefore adopt sequential grouping to first roughly partition the entire video into small continuous groups and then present a cross-group retrieval method to efficiently find spatially overlapping images among different groups.Finally, a graph-based method is applied to find global optimal paths for stitching the images into panoramas.Experiments show that our method can robustly stitch VIRAT aerial surveillance video and achieve a few orders of magnitude accelerations over the state-of-the-art stitching systems, such as PTGui [21], AutoStitch [19] and the most recent scheme by Autopano [22].

Problem Formulation
In this section, we tackle the diverse scene stitching problem by proposing a reliable hybrid stitching model to automatically find overlapping images from a large-scale aerial dataset.The proposed model is built upon two characteristics of aerial video.(1) Temporal continuity: the human operator usually applies continuous monitoring on the area of interest, which results in short, but temporal continuous aerial video clips.(2) Spatial repeatability: in many surveillance task, especially long time persistent surveillance, same spatial area may be repeatedly visited at various time from one or multiple aerial platforms.We term this method hybrid stitching and integrate both sequential grouping and cross-group retrieval in a graph-based framework to solve it (as shown in Figure 2).We use a undirected graph G = (V, E) to represent the pairwise image relations of the unordered image set.The graph vertices Vdenote the aerial image, and graph edges E represent the relations of two aerial images.If an image pair < I i , I j > is verified as overlapping images, there exists an edge between vertices V i and V j .For example, a graph is visualized in Figure 2c, in which graph vertices V represent aerial images, and graph edges E are shown as a black edge between overlapping images.
A similar graph-based framework [23] can be found in the field of remote sensing, but with quite different applications.In this paper, the key objective of our hybrid stitching is to explore edges in the graph completely and efficiently for stitching.

Sequential Grouping
In this section, we will introduce the method to generate initial small groups or sequential subgraphs.As mentioned above, we do not expect to use sequence information to handle all challenges of the VIRAT aerial video, because many kinds of disturbances, such as sudden viewpoint changes and large motion blur, can cause failure of sequential stitching.Moreover, sequential methods cannot deal with repeated observation of the same scene at different times.
However, we observe that sequential information often exists in aerial video sequence.Even if not providing a complete global picture of the entire surveillance scene, small sequential fragments are very valuable for further cross-group retrieval.
Standard optical flow tracker [13,14] or wide baseline matching methods [15,16] can be used to get frame-to-frame alignment between continuous image pairs.In this work, we select scale-invariant feature transform (SIFT) [15] features for the following two reasons: First, compared against other local invariant features, SIFT is proven to have the most robust features under geometric and illumination changes.Second, the extracted SIFT features are also used for further feature indexing and retrieval.
Given a set of VIRAT images I = {I 1 , I 2 , ..., I n } (as shown in Figure 2a), we firstly extract the SIFT features F = {f 1 , f 2 , ..., f m } from the entire image set.Then, for each continuous image pair, we apply SIFT matching and outlier removal [24] to check if they are overlapped.After that, new edges are added into G = (V, E) between overlapped image vertices.Finally, the original isolate notes are classified as many small sequential subgraphs {G s 1 , G s 2 , ..., G s w } (As shown in Figure 2(b) with the same color).With the sequential stitching, we can quickly set up the relationships between successive overlapping images.More importantly, we can transfer the problem from complicated image-to-image matching into group-to-group retrieval, which means that even if only one connected edge between two sequential groups is found, the big panorama will be generated completely without breaks.Thus, we can significantly improve the probability of finding all overlapping images, meanwhile reducing the most time-consuming part of image stitching, which is feature matching between candidate image pairs.

Cross-Group Retrieval
In this section, we propose a novel cross-group retrieval method to find an optimal edge between candidate groups quickly.Our method contains two steps: (1) feature indexing; and (2) greedy searching-based optimal edge selection.

• Feature Indexing
In the first step, the hierarchical K-means tree (HKT) is adopted to find the overlapping image pairs from the unordered image set.HKT has proven to be powerful in image recognition, image classification and retrieval [25], demonstrating that the searching hierarchical tree can speed the matching of high-dimensional vectors by up to several orders of magnitude compared to a linear search.
Given a set of SIFT features F = {f 1 , f 2 , ..., f m } from the entire image set, HKT is constructed by splitting the SIFT feature points at each level into k distinct clusters using a k-means clustering.We apply the same method recursively to the points in each cluster.The recursion will be stopped when the number of feature points in a cluster is smaller than k.Finally, we save the index correspondences between the feature points and images in a look-up table for efficient retrieval.In all experiments, the k used for the hierarchical k-means tree is 32; the maximum number of iterations to use in the k-means clustering stage when building the k-means tree is 100; and we pick the initial cluster centers randomly when performing k-means clustering.

• Greedy Searching-Based Optimal Edge Selection
To find the relationships between two groups, we define a n × n accumulator matrix A, which is initialized by zeros.For each query feature point f in image I i , we search for the HKT to find k closest points in the high-dimensional feature space, then we check the look-up table to find the index set = { 1 , 2 , ..., t } of corresponding images and increase the accumulator matrix A i,c by one, where c ∈ .After querying all feature points, image pairs < I i , I j > with sufficient high matching times as A i,j ≥ 1 n n c=1 A i,c will be labeled as candidate edge (as shown in Figure 2c, red dotted lines).Given two sequential subgraphs G s α and G s β , we use the following greedy searching to find the optimal edge Θ efficiently.For all vertices V i ∈ G s α and V j ∈ G s β , we firstly select the candidate image pairs Θ 0 with maximal retrieval score A i,j by Equation (1), and then, we use feature matching and random sample consensus (RANSAC) [24] to verify the edge between V i and V j .
If sufficient inliers have been found, we keep this edge and remove all other retrieval edges between sequential subgraphs G s α and G s β .Otherwise, we select and check the second optimal edge from the rest of the candidate edges between the two subgraphs.This process is repeated until all existing candidate edges between two groups have been checked.Actually, in our experiments, we find that image pairs connected by the first selected optimal edge usually contains sufficient correct feature correspondences.As a result, our approach only needs to apply one feature matching to connect two sequential groups in most cases.This characteristic is essential for reducing the heavy computational cost of feature matching on redundant edges.
Here, we make a brief discussion about the relationship and advantage of cross-group retrieval compared to traditional single image retrieval.
For a given retrieval sampling with one image, we use S w to represent the probability of finding correct overlapping images from the entire dataset.To extract panorama with enough confidence from a large-scale aerial dataset, the probability value should be as high as possible.However, due to the challenges of realistic aerial video with similar background or low texture regions, the probability S w of a traditional single image retrieval is usually low.
Given S w , the probability of retrieval failure for one image is 1 − S w .For a giving image group G s with λ sample images, the probability of all retrieval results being wrong is (1 − S w ) λ .Finally, the probability that at least one sample image of group G s can find correct overlapping images is: Thus, through integrating the sequential grouping and crossing group retrieval in one framework, we can greatly increase the probability to find panoramas completely by Equation (2).

Graph-Based Global Panorama Rendering
After finding the cross-group edges, we have generated a undirected graph G = (V, E) to represent the pairwise image relations of the dataset.With this undirected graph, it is convenient to identify image relations that have not been explicitly established by the pairwise image matching.
Since the number of panoramas is unknown in the original image set, we have to extract all of the connected subgraphs in the graph firstly and then compute the image transforms inside each component, respectively.
A connected component of a undirected graph is a subgraph in which any two vertices are connected to each other by paths.In this paper, we use the depth-first search [26] to compute the connected components of a graph in linear time.A search begins at a particular vertex V j , and each new vertex reached is marked.When no more vertices can be reached along edges from marked vertices, a connected component has been found.An unmarked vertex is then selected, and the process is repeated until the entire graph is explored.For each connected component, we need to find a homography H r,j between reference image vertices V r and other vertices V j , j = 1, ..., l.In this work, we pick the image vertices with the maximal number of connected edges as the reference vertices V r .Although the H r,j may be calculated by chaining together the homography on any path between vertices V r and V j , to reduce the accumulation error of long chains, we find a shortest path from V j to V r with the Dijkstra algorithm (as shown in Figure 2d, black solid lines).
Finally, images of the same group are warped together with the homography model, which is estimated by the shortest path in the previous graph.This strategy is very quick, but may retain seams in the scene with large depth variation, such as buildings or trees.In these cases, we can employ graph cuts [27] to minimize the total squared difference between two images, and the chosen seam is through those pixels where the image colors are similar.The pseudocode of the proposed hybrid stitching approach is shown in Algorithm 1.
Algorithm 1 Pseudocode for diverse scene stitching.

8:
Match candidate image pairs with SIFT features.

9:
Remove outliers with RANSAC. 10: Estimate homography with correct inliers. 11: If (u, v) is a connected edge with enough inliers, remove all other edges between G s i and G s j , and compare the next group pairs.12: Otherwise, remove (u, v), and repeat Step 7 until all existing edges between G s i and G s j have been checked.13 for each image vertices V j of group g do 17: Find the shortest path between image vertices V j and reference image vertices V r . 18: Warp corresponding image I j by homography on the shortest path.

19:
Seam cutting and stitching between downsampled warped image and previous panorama.

Experiments
In this section, we will present a comparative performance evaluation of the proposed method.

• Dataset
We use VIRAT benchmark aerial video dataset [17] with huge diversity in the scene.The VIRAT dataset contains 24 videos, which are selected from 25-h original videos recorded at various times with a CCD and IR camera on an aerial platform.Each selected video is 5 min long with an image size of 640 × 480.The VIRAT dataset is highly realistic, natural and challenging for aerial video stitching (as shown in Figure 1b,c).Given the input aerial images, we label images from the same scene and take the total number of image groups at the ground truth for evaluation (as shown in Table 1, third column).For instance, if the input aerial images are captured from two scenes, then the ground truth will be set as the scene Number 2 in this case.

• Implementation Details
We implement our algorithm with C++.The experiments in this paper are all performed on a laptop computer with Intel i5 1.6 GHz CPU and 12 G RAM.For sequential grouping, we apply a distinctive SIFT [15] descriptor to get the feature correspondence between images.For cross-group retrieval, we build the hierarchical K-means tree [28] for feature indexing.The only input parameter of our method is the total number of correct feature correspondences, which determines whether a valid homography transformation can be found between two images or not.This parameter is set as 20 in all of our experiments.

• Evaluation Metrics
We use the standard ground truth (GT), true positive (TP), false positive (FP) and processing time to evaluate the robustness and efficiency of different algorithms.
Ground truth scene numbers are provided by our manually labeling results.For each output panorama, we define two criteria to identify whether it is a true positive or false positive.One is the label consistency, and another is scene integrity.Label consistency means all images in one panorama must have the same scene label.Scene integrity describes the matching ratio between the output panorama and the corresponding scene, which can be calculated by the image number of output panorama divided by the total image number of the corresponding scene.Only the panorama whose image label is consistent and the scene integrity is higher than 90% can be identified a true positive panorama.Otherwise, it is a false positive.Finally, the processing time is the total time from loading images to stitching all panorama results in our experiments.

• Quantitative Comparison Results
First, we compare our method to the sequential stitching and retrieval stitching.We choose them as the baseline results, since the sequential stitching model is widely used for aerial video stitching, and retrieval stitching shares the similar goal of finding overlapping images efficiently without brute-force matching.For a fair comparison, we use standard SIFT [15] for sequential stitching and combine SIFT [15] and bag-of-words (BOW) [6,29] for retrieval stitching.In this experiment, we extract 3 frames every second from VIRAT video.Thus, there are 932 images in each video.Figure 3a shows an example of sequential stitching.The sequential stitching methods are based on the assumption that the input aerial images is continuous in time, and they usually require the input data to be taken on a stable platform without sudden changes between consecutive frames.However, in the realistic VIRAT aerial dataset, challenges, such as motion blur, sudden scene or view point changes, are common in the aerial images, and the continuity assumption of the sequential stitching method hence is not valid in this case.As a result, although SIFT is more robust for wide baseline image matching, it still cannot handle those image changes between consecutive frames.The panorama of sequential stitching is split into 4 fragments (as shown in Figure 3a).In contrast, our approach recovers the panorama completely (as shown in Figure 3b).Figure 4 shows an example of retrieval stitching and our approach.The retrieval stitching method is based on feature indexing, and it requires sufficient distinctive image features.If the scene lacks enough unique features, the variance of different images is not enough for the method to reliably find correct matches.In Figure 4a, in the low texture and self-similar grassland scene, retrieval stitching outputs 12 small scene fragments from 3 scenes.Through exploring the sequential groups firstly, our method significantly reduces the risk of retrieval failure by cross-group retrieval and completely generates three panoramas (as shown in Figure 4b).Further, we see that retrieval stitching achieves similar TP and FP as hybrid stitching only in two videos (Table 1, VIRAT 02 and 10).In all of the other 22 videos, hybrid stitching outperforms retrieval stitching in all metrics.In addition, the processing time of hybrid stitching is only half of retrieval stitching, and this advantage becomes more apparent as the increase of continuity inside the video (Table 1, VIRAT 10 and 17).
Figure 5 compares the statistic performance of 24 VIRAT videos.Our hybrid stitching achieves the highest TP and lowest FP, and its processing speed is even comparable with the fastest sequential stitching approach.On average, our method only costs 333.4 s to find 96% panoramas with 2 false positives.The significant improvements over retrieval stitching show the utility of combing temporal sequential grouping and cross-group retrieval in one hybrid stitching framework.
The previous experiments show two sample results between sequential stitching, retrieval stitching and our approach.Next, we compare the three methods on all 24 VIRAT videos.
Table 1 shows the quantitative comparison results on 24 VIRAT videos.The first column is the video indexing; the second column shows the image number of each VIRAT video.The manually-labeled ground truth (GT) is shown in the third column.The TP, FP and processing time of sequential stitching, retrieval stitching and our hybrid stitching are shown in the other columns.
Although the processing speed of sequential stitching is the the fastest, it is easily broken by diverse changes of aerial videos.As a result, we observe that both retrieval stitching and hybrid stitching significantly outperform sequential stitching in every video with higher TP and lower FP.

• Qualitative Comparison Results
Second, we compare our system with state-of-the-art stitching systems, including PTGui [21], AutoStitch [19] and the very recent scheme by Autopano [22].In order to compare the performance on the entire VIRAT dataset, we extract one image of every two seconds from the 24 VIRAT videos, and the total number of images is 2312.
Figure 6 shows the results of the state-of-the-art systems.Due to the lack of the ability for scene recognition, PTGui cannot automatically find overlapping image groups from the input video.It costs 2 h, 18 min and 52 s to generate one false panorama with enormous artifacts (as shown in Figure 6).Although AutoStitch [19] can recognize a scene from a small-scale image dataset, it cannot generate a panorama even after 10 h for the large-scale dataset of this experiment.The most recent Autopano [22] is the only system that can generate multiple panoramas from the large-scale input data.However, it costs over 10 h to create 38 panoramas, in which 18 panoramas contain significant artifacts (samples are shown in Figure 6, Autopano 1-5), and only 20 panoramas have reasonable visual effects (samples are shown in Figure 6, Autopano 6-8).In contrast, our approach costs only 15 min and 33 s to output panoramas of 48 scenes (as shown in Figure 7), which is much faster than the above state-of-the-art systems.Specifically, our approach firstly spends 136.112 s for feature detection and description and 160.049 s for feature indexing.Then, it uses 389.115 s for sequential grouping and cross-group retrieval.Finally, it costs 0.625 s for finding optimal paths in the entire graph, and 247.238 s for panorama stitching.As can be seen in Figure 7, the quality of our panoramas is much better than the state-of-the-art systems with less artifacts.
We also show a sample stitching result in Figure 8, which illustrates the dynamic stitching process of the overall site of VIRAT dataset (as shown in Figure 7, white bounding box).Our approach successfully extracts 22 revisits of this scene and generates a dynamic panorama with images captured at various times and viewpoints.We believe this reorganized scene panorama is particularly helpful for long-term surveillance and high level scene understanding.Experiments demonstrate the superiority of the proposed method.

Conclusions and Future Works
In this paper, we have proposed a powerful hybrid model for diverse scene stitching.The proposed method is based on the main findings that a large-scale aerial dataset has a close relationship to the human operators' monitoring behavior.For the same area of interest, human operators often apply continuous monitoring and repeated monitoring, which yield short, successive video clips that overlap in time or space.
Inspired by the temporal continuity and spatial repeatability of aerial surveillance video, our model integrates the sequential grouping and cross-group retrieval into a graph-based framework.We experiment with our method on the large-scale VIRAT aerial dataset [17], which is much more challenging than many other aerial datasets, due to its heterogeneity in image quality and diversity of the scene.To the best of our knowledge, this is the first stitching work on VIRAT.Experimental results show that our method can explicitly explore multiple panoramas from this challenging dataset.Moreover, our approach achieves a few orders of magnitude accelerations over the state-of-the-art stitching systems.
One limitation of the proposed method is that we use the homography model for vertical image stitching.Currently, most of the aerial images in our experiments are vertical images.However, For oblique aerial images, such as images taken parallel to the surface of the Earth, artifacts may appear with the homography model.For instance, for a three-dimensional building in an urban environment, the oblique images taken from different points of view are quite different.As a result, in order to generate a good panorama, highly accurate interior and exterior camera parameter extraction and 3D scene reconstruction may be required, which is more complicated compared with homography-based vertical image stitching.We would like to extend our work to diverse scene stitching with oblique aerial images in the future.
The current work is a pure image-based method for exploring diverse scenes from a large-scale aerial dataset.This image-based design makes our system very convenient to be used in many remote sensing tasks, even without any other sensor information.Considering that many aerial images may contain GPS/INS information, it is possible to embed the GPS information to further improve the robustness and efficiency of the diverse scene stitching.For instance, the GPS/INS information may be used to filter image pairs without overlapping area before feature matching or it can be used to split the entire aerial images into several groups with a similar GPS location.We would also like to consider integrating the GPS/INS information in our future work.

Figure 1 .
Figure 1.Examples of the large-scale VIRATaerial dataset [17], which includes images selected from 25-h realistic aerial videos.The dataset is published in [17] and available from www.viratdata.org.(a) Diverse scenes of VIRAT aerial dataset.(b) Sample image shots with clouds, motion blur, low contrast and camera noise.(c) Sample image shots with varying scales and viewpoints over time

Figure 2 .
Figure 2. Framework of our method, which mainly contains three parts: sequential grouping, cross-group retrieval and global stitching.(a) Input large-scale aerial images; (b) example results of sequential grouping; the same color represents the same group; (c) example results of cross-group retrieval; black lines denote the sequential edges, and red dotted lines represent the candidate edges generated by retrieval; (d) example results of global stitching graph; black lines denote the final edges after optimization; (e) example results of diverse scene stitching.

Figure 3 .
Figure 3. Example of sequential stitching vs. hybrid stitching.(a) Sequential stitching results over time with 4 panorama patches from 1 scene.(b) Our hybrid stitching result with a complete scene panorama

Figure 4 .Figure 5 .
Figure 4. Example of retrieval stitching vs. hybrid stitching.(a) Retrieval stitching results with 12 panorama patches from 3 scenes.(b) Our hybrid stitching results with complete panorama of 3 scenes.

Figure 7 .
Figure 7.Our diverse scene stitching results from the VIRAT dataset with 2312 images.(Top) Examples of input VIRAT images.(Bottom) 48 panoramas of diverse scene by our approach after only 15 min and 33 s.The dynamic stitching results of panorama with the white bounding box are shown in Figure 8.

FrameFigure 8 .
Figure 8. Example of dynamic stitching process of a surveillance scene with 22 revisits from 2312 VIRAT images; the yellow dotted line shows the first image from a new revisit.

Table 1 .
Quantitative comparison of Sequential Stitching, Retrieval Stitching and Hybrid Stitching on 24 VIRAR videos.