# Diverse Scene Stitching from a Large-Scale Aerial Video Dataset

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

^{2}) complexity. Consequently, the computational cost of these approaches is generally very high and becomes a bottleneck in real-time surveillance.

## 2. Hybrid Stitching Model

#### 2.1. Problem Formulation

_{i}, I

_{j}> is verified as overlapping images, there exists an edge between vertices V

_{i}and V

_{j}. For example, a graph is visualized in Figure 2c, in which graph vertices V represent aerial images, and graph edges E are shown as a black edge between overlapping images.

#### 2.2. Sequential Grouping

_{1}, I

_{2}, …, I

_{n}} (as shown in Figure 2a), we firstly extract the SIFT features F = {f

_{1}, f

_{2}, …, f

_{m}} from the entire image set. Then, for each continuous image pair, we apply SIFT matching and outlier removal [24] to check if they are overlapped. After that, new edges are added into G = (V, E) between overlapped image vertices. Finally, the original isolate notes are classified as many small sequential subgraphs $\{{G}_{1}^{s},{G}_{2}^{s},\dots {G}_{w}^{s}\}$ (As shown in Figure 2(b) with the same color).

#### 2.3. Cross-Group Retrieval

**Feature Indexing**

_{1}, f

_{2}, …, f

_{m}} from the entire image set, HKT is constructed by splitting the SIFT feature points at each level into k distinct clusters using a k-means clustering. We apply the same method recursively to the points in each cluster. The recursion will be stopped when the number of feature points in a cluster is smaller than k. Finally, we save the index correspondences between the feature points and images in a look-up table for efficient retrieval. In all experiments, the k used for the hierarchical k-means tree is 32; the maximum number of iterations to use in the k-means clustering stage when building the k-means tree is 100; and we pick the initial cluster centers randomly when performing k-means clustering.

**Greedy Searching-Based Optimal Edge Selection**

_{i}, we search for the HKT to find k closest points in the high-dimensional feature space, then we check the look-up table to find the index set ℓ = {ℓ

_{1}, ℓ

_{2}, …, ℓ

_{t}} of corresponding images and increase the accumulator matrix A

_{i,c}by one, where c ∈ ℓ. After querying all feature points, image pairs < I

_{i}, I

_{j}> with sufficient high matching times as ${A}_{i,j}\ge \frac{1}{n}{\displaystyle {\sum}_{c=1}^{n}{A}_{i,c}}$ will be labeled as candidate edge (as shown in Figure 2c, red dotted lines).

_{0}with maximal retrieval score A

_{i,j}by Equation (1), and then, we use feature matching and random sample consensus (RANSAC) [24] to verify the edge between V

_{i}and V

_{j}.

_{w}to represent the probability of finding correct overlapping images from the entire dataset. To extract panorama with enough confidence from a large-scale aerial dataset, the probability value should be as high as possible. However, due to the challenges of realistic aerial video with similar background or low texture regions, the probability S

_{w}of a traditional single image retrieval is usually low.

_{w}, the probability of retrieval failure for one image is 1 − S

_{w}. For a giving image group G

^{s}with λ sample images, the probability of all retrieval results being wrong is (1 − S

_{w})

^{λ}. Finally, the probability that at least one sample image of group G

^{s}can find correct overlapping images is:

#### 2.4. Graph-Based Global Panorama Rendering

_{j}, and each new vertex reached is marked. When no more vertices can be reached along edges from marked vertices, a connected component has been found. An unmarked vertex is then selected, and the process is repeated until the entire graph is explored. For each connected component, we need to find a homography H

_{r,j}between reference image vertices V

_{r}and other vertices V

_{j}, j = 1, …, l. In this work, we pick the image vertices with the maximal number of connected edges as the reference vertices V

_{r}. Although the H

_{r,j}may be calculated by chaining together the homography on any path between vertices V

_{r}and V

_{j}, to reduce the accumulation error of long chains, we find a shortest path from V

_{j}to V

_{r}with the Dijkstra algorithm (as shown in Figure 2d, black solid lines).

Input: |

The large-scale aerial video dataset. |

Algorithm: |

1: Build an undirected graph G = (V, E). |

2: Extract SIFT features of all input images. |

3: Generate sequential groups $\{{G}_{1}^{s},\dots ,{G}_{w}^{s}\}$ by matching continuous images. |

4: Build an HTK tree with all SIFT features. |

5: Retrieve edges for each image vertices. |

6: for each sequential group pairs
${\{{G}_{i}^{s},{G}_{j}^{s}\}}_{i\ne j}$ do |

7: Select optimal edge (u, v) by Equation (1). |

8: Match candidate image pairs with SIFT features. |

9: Remove outliers with RANSAC. |

10: Estimate homography with correct inliers. |

11: If (u, v) is a connected edge with enough inliers, remove all other edges between ${G}_{i}^{s}$ and ${G}_{j}^{s}$ compare the next group pairs. |

12: Otherwise, remove (u, v), and repeat Step 7 until all existing edges between ${G}_{i}^{s}$ and ${G}_{j}^{s}$ have been checked. |

13: end for |

14: Extract all connected subgraphs $\{{G}_{1}^{c},\dots ,{G}_{h}^{c}\}$ by depth-first search in global group G |

15: for each group g ∈
$\{{G}_{1}^{c},\dots ,{G}_{h}^{c}\}$ do |

16: for each image vertices V_{j} of group g do |

17: Find the shortest path between image vertices V_{j} and reference image vertices V_{r} |

18: Warp corresponding image I_{j} by homography on the shortest path. |

19: Seam cutting and stitching between downsampled warped image and previous panorama. |

20: end for |

21: end for |

22: Output complete panorama image set {P_{1}, P_{2}, …, P_{τ}}. |

## 3. Experiments

**Dataset**

**Implementation Details**

**Evaluation Metrics**

**Quantitative Comparison Results**

**Qualitative Comparison Results**

## 4. Conclusions and Future Works

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Kumar, R.; Sawhney, H.; Samarasekera, S.; Hsu, S.; Tao, H.; Guo, Y.L.; Hanna, K.; Pope, A.; Wildes, A.R.; Hirvonen, D.; Hansen, M.; Burt, P. Aerial video surveillance and exploitation. Proc. IEEE
**2001**, 89, 1518–1539. [Google Scholar] - Brown, M.; Lowe, D.G. Recognizing panoramas. Proceedings of the IEEE Conference on International Conference on Computer Vision, Nice, France, 13–16 October 2003; pp. 1218–1225.
- Agarwala, A.; Agrawala, M.; Cohen, M.; Salesin, D.; Szeliski, R. Photographing long scenes with multi-viewpoint. ACM Trans. Graph
**2006**, 25, 853–861. [Google Scholar] - Brown, M.; Lowe, D.G. Automatic panoramic image stitching using invariant Features. Int. J. Comput. Vis
**2007**, 74, 59–73. [Google Scholar] - Indelman, V.; Gurfil, P.; Rivlin, E.; Rotstein, H. Real-time mosaic-aided aerial navigation: I. motion estimation. Proceedings of the AIAA Guidance, Navigation and Control Conference, Chicago, IL, USA, 10–13 August 2009; pp. 1–23.
- Botterill, T.; Mills, S.; Green, R. Real-time aerial image mosaicing. Proceedings of the 2010 25th International Conference of Image and Vision Computing New Zealand, IEEE, Queenstown, New Zealand, 8–9 November 2010; pp. 1–8.
- Zaragoza, J.; Chin, T.J.; Brown, M.S.; Suter, D. As-projective-as-possible image stitching with moving DLT. Proceedings of the IEEE Conferencce on Computer Vision and Pattern Recognition, Portland OR, USA, 23–28 June 2013; pp. 2339–2346.
- Li, J.; Yang, T.; Yu, J.Y.; Lu, Z.Y.; Lu, P.; Jia, X.; Chen, W.J. Fast aerial video stitching. Int. J. Adv. Robot. Syst
**2014**, 11. [Google Scholar] [CrossRef] - Molina, E.; Zhu, Z.G. Persistent aerial video registration and fast multi-view mosaicing. IEEE Trans. Image Proces
**2014**, 23, 2184–2192. [Google Scholar] - Zhang, F.; Liu, F. Parallax-tolerant Image Stitching. Proceedings of the IEEE Conferencce on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3262–3269.
- Szeliski, R. Image alignment and stitching: A tutorial. Found. Trends Comput. Graph. Vis
**2006**, 2, 1–104. [Google Scholar] - Kekec, T.; Yildirim, A.; Unel, M. A new approach to real-time mosaicing of aerial images. Robot. Auton. Syst
**2014**, 62, 1755–1767. [Google Scholar] - Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. Proceedings of the International Joint Conference on ArtificialIntelligence, University of British Columbia, Vancouver, BC, Canada, 24–28 August 1981; pp. 674–679.
- Baker, S.; Matthews, I. Lucas-kanade 20 years on: A unifying framework. Int. J. Comput. Vis
**2004**, 56, 221–255. [Google Scholar] - Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis
**2004**, 60, 91–110. [Google Scholar] - Bay, H.; Ess, A.; Tuytelaars, T.; Gool, L.V. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst
**2008**, 110, 346–359. [Google Scholar] - Oh, S.; Hoogs, A.; Perera, A.; Cuntoor, N.; Chen, C.C.; Lee, J.T.; Mukherjee, S.; Aggarwal, J.K.; Lee, H.; Davis, L.; et al. A large-scale benchmark dataset for event recognition in surveillance video. Proceedings of the Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3153–3160.
- Schaffalitzky, F.; Zisserman, A. Multi-view matching for unordered image sets. Proceedings of the European Conference on Computer Vision, Copenhagen, 27 May–2 June 2002; Springer: Berlin, Germany, 2002; pp. 414–431. [Google Scholar]
- Brown, M. AutoStitch. Available online: http://www.cs.bath.ac.uk/brown/autostitch/autostitch.html accessed on 25 May 2015.
- Sibiryakov, A.; Bober, M. Graph-based multiple panorama extraction from unordered image sets. Proc. SPIE
**2007**. [Google Scholar] [CrossRef] - New House Internet Services. PTGui Software. Available online: http://www.ptgui.com accessed on 25 May 2015.
- Kolor Company. Kolor autopano. Available online: http://www.kolor.com accessed on 25 May 2015.
- Kang, X.C.; Lin, X.G. Graph-based divide and conquer method for parallelizing spatial operations on vector data. Remote Sens
**2014**, 6, 10107–10130. [Google Scholar] - Fischler, A.M.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM
**1981**, 24, 381–395. [Google Scholar] - Muja, M.; Lowe, D.G. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. Proceedings of the International Conference on Computer Vision Theory and Applications, Lisboa, Portugal, 5–8 February 2009; pp. 331–340.
- Hopcroft, J.; Tarjan, R. Efficient algorithms for graph manipulation. Commun. ACM
**1973**, 16, 372–378. [Google Scholar] - Kwatra, V.; Schödl, A.; Essa, I.; Turk, G.; Bobick, A. Graphcut textures: Image and video synthesis using graph cuts. ACM Trans. Graph
**2003**, 22, 277–286. [Google Scholar] - Muja, M.; Lowe, D.G. Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell
**2014**, 36, 2227–2240. [Google Scholar] - Nister, D.; Stewenius, H. Scalable recognition with a vocabulary tree. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; 2, pp. 2161–2168.

**Figure 1.**Examples of the large-scale VIRATaerial dataset [17], which includes images selected from 25-h realistic aerial videos. The dataset is published in [17] and available from www.viratdata.org. (

**a**) Diverse scenes of VIRAT aerial dataset. (

**b**) Sample image shots with clouds, motion blur, low contrast and camera noise. (

**c**) Sample image shots with varying scales and viewpoints over time

**Figure 2.**Framework of our method, which mainly contains three parts: sequential grouping, cross-group retrieval and global stitching. (

**a**) Input large-scale aerial images; (

**b**) example results of sequential grouping; the same color represents the same group; (

**c**) example results of cross-group retrieval; black lines denote the sequential edges, and red dotted lines represent the candidate edges generated by retrieval; (

**d**) example results of global stitching graph; black lines denote the final edges after optimization; (

**e**) example results of diverse scene stitching.

**Figure 3.**Example of sequential stitching vs. hybrid stitching. (

**a**) Sequential stitching results over time with 4 panorama patches from 1 scene. (

**b**) Our hybrid stitching result with a complete scene panorama

**Figure 4.**Example of retrieval stitching vs. hybrid stitching. (

**a**) Retrieval stitching results with 12 panorama patches from 3 scenes. (

**b**) Our hybrid stitching results with complete panorama of 3 scenes.

**Figure 7.**Our diverse scene stitching results from the VIRAT dataset with 2312 images. (Top) Examples of input VIRAT images. (Bottom) 48 panoramas of diverse scene by our approach after only 15 min and 33 s. The dynamic stitching results of panorama with the white bounding box are shown in Figure 8.

**Figure 8.**Example of dynamic stitching process of a surveillance scene with 22 revisits from 2312 VIRAT images; the yellow dotted line shows the first image from a new revisit.

**Table 1.**Quantitative comparison of Sequential Stitching, Retrieval Stitching and Hybrid Stitching on 24 VIRAR videos.

Data set | Number of Images | GT | Sequential Stitching
| Retrieval Stitching
| Hybrid Stitching
| ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

TP | FP | Total Time(s) | TP | FP | Total Time(s) | TP | FP | Total Time(s) | |||

VIRAT#01 | 932 | 17 | 6 | 17 | 247.6 | 12 | 4 | 780.1 | 17 | 0 | 365.9 |

VIRAT#02 | 932 | 15 | 8 | 21 | 264.4 | 14 | 1 | 946.2 | 14 | 1 | 425.8 |

VIRAT#03 | 932 | 20 | 7 | 25 | 269.4 | 11 | 9 | 668 | 19 | 1 | 416.3 |

VIRAT#04 | 932 | 25 | 9 | 25 | 196.3 | 16 | 2 | 613.8 | 25 | 0 | 323.4 |

VIRAT#05 | 932 | 20 | 10 | 22 | 249.4 | 16 | 4 | 743.3 | 19 | 2 | 371.8 |

VIRAT#06 | 932 | 23 | 16 | 14 | 208.9 | 21 | 1 | 546.4 | 22 | 1 | 337.5 |

VIRAT#07 | 932 | 19 | 6 | 27 | 250.3 | 14 | 7 | 585.4 | 16 | 6 | 370 |

VIRAT#08 | 932 | 19 | 9 | 19 | 241.2 | 17 | 3 | 623.6 | 18 | 2 | 312.7 |

VIRAT#09 | 932 | 18 | 9 | 19 | 209.3 | 12 | 4 | 590.7 | 18 | 3 | 320.4 |

VIRAT#10 | 932 | 2 | 1 | 10 | 224.2 | 2 | 0 | 722.7 | 2 | 0 | 302.0 |

VIRAT#11 | 932 | 13 | 12 | 5 | 222.4 | 10 | 3 | 774.0 | 13 | 1 | 336.0 |

VIRAT#12 | 932 | 22 | 8 | 29 | 253.2 | 14 | 13 | 701.7 | 21 | 2 | 440.8 |

VIRAT#13 | 932 | 10 | 3 | 25 | 238.5 | 7 | 9 | 668 | 9 | 3 | 253.2 |

VIRAT#14 | 932 | 12 | 6 | 14 | 240.0 | 10 | 3 | 624.9 | 12 | 1 | 306.3 |

VIRAT#15 | 932 | 14 | 6 | 17 | 234.8 | 10 | 7 | 540.3 | 14 | 1 | 311.2 |

VIRAT#16 | 932 | 19 | 7 | 16 | 216.6 | 14 | 1 | 485.3 | 18 | 2 | 342.5 |

VIRAT#17 | 932 | 9 | 6 | 2 | 196.7 | 7 | 3 | 461 | 9 | 0 | 228.5 |

VIRAT#18 | 932 | 14 | 10 | 16 | 223.4 | 9 | 10 | 526.4 | 13 | 5 | 281.6 |

VIRAT#19 | 932 | 11 | 8 | 14 | 176.6 | 9 | 6 | 382.7 | 10 | 3 | 211.4 |

VIRAT#20 | 932 | 18 | 13 | 10 | 198.7 | 13 | 7 | 490.7 | 18 | 3 | 349.5 |

VIRAT#21 | 932 | 12 | 8 | 12 | 228.4 | 9 | 4 | 460 | 10 | 2 | 392.7 |

VIRAT#22 | 932 | 9 | 5 | 14 | 231.0 | 4 | 11 | 481.2 | 9 | 4 | 297.6 |

VIRAT#23 | 932 | 13 | 8 | 12 | 192.2 | 11 | 3 | 547.9 | 13 | 0 | 316.8 |

VIRAT#24 | 932 | 16 | 6 | 27 | 227.9 | 7 | 16 | 550.9 | 16 | 5 | 389.1 |

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yang, T.; Li, J.; Yu, J.; Wang, S.; Zhang, Y. Diverse Scene Stitching from a Large-Scale Aerial Video Dataset. *Remote Sens.* **2015**, *7*, 6932-6949.
https://doi.org/10.3390/rs70606932

**AMA Style**

Yang T, Li J, Yu J, Wang S, Zhang Y. Diverse Scene Stitching from a Large-Scale Aerial Video Dataset. *Remote Sensing*. 2015; 7(6):6932-6949.
https://doi.org/10.3390/rs70606932

**Chicago/Turabian Style**

Yang, Tao, Jing Li, Jingyi Yu, Sibing Wang, and Yanning Zhang. 2015. "Diverse Scene Stitching from a Large-Scale Aerial Video Dataset" *Remote Sensing* 7, no. 6: 6932-6949.
https://doi.org/10.3390/rs70606932