Integration of Multi-Camera Video Moving Objects and GIS

: This work discusses the integration of multi-camera video moving objects (MCVO) and GIS. This integration was motivated by the characteristics of multi-camera videos distributed in the urban environment, namely, large data volume, sparse distribution and complex spatial–temporal correlation of MCVO, thereby resulting in low efficiency of manual browsing and retrieval of videos. To address the aforementioned drawbacks, on the basis of multi-camera video moving object extraction, this paper first analyzed the characteristics of different video-GIS Information fusion methods and investigated the integrated data organization of MCVO by constructing a spatial–temporal pipeline among different cameras. Then, the conceptual integration model of MCVO and GIS was proposed on the basis of spatial mapping, and the GIS-MCVO prototype system was constructed in this study. Finally, this study analyzed the applications and potential benefits of the GIS-MCVO system, including a GIS-based user interface on video moving object expression in the virtual geographic scene, video compression storage, blind zone trajectory deduction, retrieval of MCVO, and video synopsis. Examples have shown that the integration of MCVO and GIS can improve the efficiency of expressing video information, achieve the compression of video data, rapidly assisting the user in browsing video objects from multiple cameras.


Introduction
At present, billions of surveillance video cameras have been deployed worldwide. The images acquired from these cameras are widely used in security, transportation, environmental detection, and other fields to monitor real-time changes in geographic scenes 24 hours per day. In the actual monitoring process, displaying multiple video images in a grid interface cannot effectively express the spatial relationship among different camera images in the urban environment ( Figure 1). Furthermore, monitoring tasks, such as spatial-temporal behavior analysis, video scene simulation, and regional condition, cannot be effectively completed by only relying on image data. To solve the aforementioned problems, monitoring images should be introduced into the geographic information systems (GIS) to construct a video-GIS (V-GIS) surveillance system. V-GIS is a geographical environment perception and analysis platform that integrates a traditional video analysis system within GIS. Under the unified geographic reference, a geospatial data service supports surveillance image intelligence analysis and realizes image-scene integrated modeling [1], video data spatialization [2], video data management analysis [3], virtual reality (VR) fusion expression [4], and other related functions. Unlike other GIS data, videos have large data volume, sparse distribution of video moving objects (like pedestrians and vehicles), and complex spatial-temporal correlation of video moving objects from different cameras. These characteristics result in the low efficiency of manual browsing and video retrieval and an insufficient ability to analyze a videos' geospatial correlation. As a result, traditional GIS data research methods cannot effectively analyze video data, and a specific research method for V-GIS must be determined. Currently, studies on V-GIS involves a series of investigations on the specificity of video data, including geographic video semantic expression [5], structured processing on geo-video [6], video integration and GIS [1]. However, most existing research focuses on the integration of video image and GIS, thereby ignoring the integration of video moving object and GIS. Video moving objects are the main focus of users, whereas traditional manual retrieval and analysis on moving objects from massive surveillance video requires considerable computational resources and time and generates many misdetections and misjudgments. In recent years, new algorithms, such as mask regionconvolutional neural network (R-CNN) [7] and "you only look once" (YOLO.v3) [8] , which improves the efficiency and accuracy of video moving object detection and rapid recognition and make the information fusion between video moving objects and GIS accurate and feasible has been presented. A new fusion type on the integration of video moving object and GIS should be determined to enhance the effectivity of the retrieval and analysis in V-GIS.
In this work, we considered the integration of multi-camera video moving objects (MCVO) and GIS. This integration was essentially an augmented VR technique applied in GIS [9]. The integration of single camera video moving objects and GIS has been discussed before [10]. Practically, V-GIS is oriented toward multi-camera video information processing in a surveillance network, in which a single video moving object may appear in multiple camera fields of view, and its trajectory and semantic behavior have complex spatial-temporal associations related to multiple camera shots [11]. Therefore, a system that can effectively organize the video moving objects associated with multiple cameras should be proposed and constructed to perform comprehensive geospatial processing and analysis.
This study attempted to construct an integrated MCVO and GIS system. This integration could achieve not only the video moving object information extraction and spatially-correlated visualization but also the MCVO's spatial-temporal associated analysis, which effectively assists users in understanding multi-camera videos. The main contributions of this paper are presented as follows: 1. The conceptual fusion model between MCVO and GIS is proposed by comparing the characteristics of different V-GIS fusion methods on the basis of the specificity of MCVO organization. 2. The GIS-MCVO prototype system is proposed by describing the architecture and function design of the system. 3. The unique functions and benefits of GIS-MCVO system, including GIS-based user interface on video moving object expression, video compression storage, blind zone trajectory deduction, retrieval of multiple camera video objects, and video synopsis, which cannot be easily achieved in the traditional V-GIS integrated system, are analyzed.
This study only investigated the video data obtained by camera with a fixed position and attitude. The organizational structure of this work is presented as follows. Chapter 2 presents an overview of the related work. Chapter 3 discusses the fusion modes and the organization of MCVO integrated with 3D GIS. Chapter 4 describes the architecture of the GIS-MCVO prototype system and analyzes its main features and key technologies. Chapter 5 lists the related applications and the potential advantages of the GIS-MCVO system. Chapter 6 summarizes and concludes the study.

Related Work
A MCVO and GIS integrated monitoring system were developed on the basis of video moving object detection and V-GIS integrated data organization technology to express video information in the virtual geographic environment. This section describes the related research on intelligent surveillance videos, video moving object extraction, V-GIS integrated data organization, and V-GIS fusion expression.
Although the surveillance video technology has been partially used in the last century [12,13], video surveillance systems that consist of closed-circuit television and analog signals have poor range monitoring capabilities, information storage, and comprehensive analysis. Since the "9/11" incident in 2001, the demand for enhancing video surveillance has greatly increased along with the amount of surveillance image data, thereby causing great difficulties on the manual retrieval and analysis of video information. Computer vision technology has been applied to large-scale video surveillance systems, and video image analysis technology is becoming more automated and intelligent to effectively process massive image information [14].
Video object detection, tracking, and cross-camera recognition are required to execute MCVO extraction. Object detection is conducted to determine the object's location area from the image. The current video object detection methods are mainly based on deep learning, which is divided into two categories: The two-stage model based on region proposal and the one-stage model from end to end. The two-stage model is developed from region-convolutional neural network (R-CNN) and mainly includes fast R-CNN [15], faster R-CNN [16], and mask R-CNN [7], whereas the one-stage model mainly includes the YOLO [8,17] series of methods. Furthermore, object tracking is conducted to predict the size and location of the video object in subsequent video streams under the selection of a tracking object from a given image. Currently, the most widely used tracking methods are filtering [18] and deep learning-based methods [19]. Cross-camera recognition is necessary to determine the spatial-temporal correspondence of video objects from multiple cameras. The implementation methods include the metric-based learning recognition of image features [20] and the local featurebased recognition [21] developed from prior studies. Meanwhile, video sequence [22] and GANbased recognition methods [23] are developed to introduce learning contents to overcome the insufficiency of the training samples.
In the area of V-GIS integrated data organization, early researchers constructed prototype systems such as multimedia GIS [24], geo-video [25], and video GIS [26] by describing the correspondence between video frames and geographic locations. These studies implemented geographic video image retrieval. In recent years, researchers have focused on the fusion of a projected video image and geographic scene. The geographic video-scene data fusion organization method based on camera spatialization model is formed by constructing image-geographic spatial mapping [2]. Typical camera spatialization models include quadrilateral models applied to 2D planes [27], quadrilateral pyramid models in 3D scenes [28], and grid camera-based coverage analysis models [29]. Researchers developed a series of methods for organizing V-GIS data fusion by geolocation and annotation of video data using the aforementioned models. In some of these methods (e.g., view-based R-tree [3] and camera-based topology indexing [30]), video data organization is analyzed by examining the camera field of view. The other methods used moving object texture association [31], spatial-temporal behavior association [32], and semantic association [33].
A suitable mapping method must be selected to project the video on to the virtual scene model to integrate videos with geospatial information [34,35]. Katkere A. [36] first proposed the concept of fusion expression between video and GIS by using different mapping methods for various semantic-type regions, such as moving objects and video scenes, and building an immersing system based on multi-camera video data. According to different mapping methods, the information fusion methods of surveillance video and virtual scene are divided into two categories: GIS-video image fusion (image projection) [37] and GIS-video moving object fusion (object projection) [38]. The implementation forms of GIS-video image fusion, including video image linked search analysis [4] and videos that are projected to the geographic scene [33], are easy to implement but lack the ability to analyze and understand video image contents. The object projection method extracts video semantic objects from the original video through object detection. Specific implementation methods are divided into three types: (1) foreground and background independent projection [28,39]; (2) foreground projection [31,40]; and (3) foreground abstraction [41]. The foreground and background independent projections project the foreground moving object subgraphs to their corresponding spatial-temporal position in the virtual scene. The foreground projection shows the foreground moving object's subgraph in the corresponding spatial-temporal position in the scene while omitting the projection of the video background image. Foreground abstraction projects the moving object's avatar to the corresponding spatial-temporal position in the geographic scene and simplifies the representation of the video moving object.

Fusion between Multiple Camera Objects and GIS
In this section, we first introduce the extraction and data organization for MCVO and then analyze the characteristics of different V-GIS fusion methods and determine the implementation form of the integration of MCVO and GIS.

Extraction and Data Organization of Multi-Camera Video Moving Objects
Video moving object extraction includes three steps: video moving object detection, tracking, and cross-camera recognition. Figure 2 shows their process relationships. After completing the processes, the original video images are transferred to the data of each video moving objects: where represents the dates of a video moving object; is the number of the video frames in which the object occurred; and and represent the location and sub-graph image of the current object in each frame, respectively. This study adopts the spatial-temporal pipeline (STP) as the basic unit to uniformly describe the MCVO information.
As is the total set of all the moving objects from multiple cameras. represents each STP in the current camera video range. A total of video moving objects exist in , and , is the STP of each moving object. and , can be expressed as: , = , , , , , , ( = 1,2, … , ) , where , , and , , represent the geolocation of the video moving object in the video frame of the camera. This study proposes a data-based MCVO ontology so that the organization of the video moving object data can effectively reflect the association among cross-camera video moving objects, because a single video moving object may appear in many different cameras. This organization realizes the unified organization record of a cross-camera video moving object with the same entity that appears in different camera videos and requires the merging of the original video moving object's STP. Let denote the total number of video moving objects after merging on crosscamera video moving objects with the same entity, and the expression of the STP of each video moving object is: where is the set of all video moving objects in multiple cameras, indicates the global STP of the video moving object with serial number in the surveillance video network, and , denotes the STP subsequence of the video moving object number in the camera.

Fusion Between Video Moving Object and GIS
On the perspective of spatial positioning, the video's field of view was georeferenced to determine the geospatial range of the video image ( Figure 3a). The video moving object had to locate the geospatial trajectory to determine the position of the object's subgraph in each video frame ( Figure 3b). Both of the fusions were carried out by determining the pairs of points with the same name between the image's geospatial coordinates, solving the camera parameters, and constructing the video's geospatial mapping equation. The equations for geospatial video mapping were established using the homography matrix method. Figure 4 shows the relationship between the geospatial and the image space coordinate systems. The center of the camera is denoted by C; the image space coordinate system is denoted by ; and the geospatial coordinate system is denoted by . Assuming that and are respective points in the image's spatial and geographic coordinate system with similar names, then let the homography matrix be such that the relationship between and is: where is represented as follows: has six unknowns. Thus, at least three pairs of images and geospatial points should be determined to solve . When is determined, the coordinates of any point in the geographic space can be solved as: The fusion of video moving object and GIS includes three implementations: foreground and background independent projection, foreground projection, and foreground abstraction. We qualitatively analyzed the information expression ability of various video moving object-GIS fusion methods to select the appropriate fusion form of MCVO and GIS. Table 1 shows the results.  Table 1 shows that the foreground and background independent projections simultaneously mapped the background and foreground information of the video, thereby making the video's foreground object less prominent. Meanwhile, foreground abstraction completely abandoned the expression of the original video image data, and no image-and scene-associated expressions were observed. As a result, neither of the two implementations were suitable for the integration of MCVO and GIS. By contrast, foreground projection can express image and scene correlation and represent images spatially. The remaining discussion of this paper on the integration of video moving object and GIS was based on foreground projection because it was suitable for MCVO and geographic scene fusion expression. However, although this problem can be alleviated by optimizing the selection of the viewpoints to some extent, one disadvantage of foreground projection is that it can only support a certain range of viewpoints displayed in the virtual geographic scene.

Data Organization of Spatial-Temporal Trajectory
A single camera can only record the spatial-temporal trajectories of the moving objects within the field of view of this camera. While the fields of view of different cameras are not continuous, for analyzing the trajectory of a video moving object in a multi-camera environment, it is necessary to re-organize the trajectory of the same object occurred in different cameras.
The processing steps came as follows: using the automatic object detection algorithm developed from computer vision to detect the video moving objects, extract the positions and sub-pictures, then use the tracking algorithm to generate the spatial-temporal trajectories of moving objects. Based on the mapping model described in Chapter 3.2, we can get the instantaneous position of the moving object in each frame. In each camera's field of view, all instantaneous spatial positions of a moving object are temporal aligned, combining into the local trajectory of this object in the current camera. Since the field of view of different cameras in geospatial space is discontinuous, it is necessary to use the moving object re-recognition algorithm to perform cross-camera recognition to obtain the global trajectory of each video moving object in a multi-camera environment (as shown in Figure 5). In order to effectively reorganize the association of different levels of video moving object trajectory, is the total set of all moving objects in the geographical scene, and there are -th moving objects in the k-th camera field of view, and the local trajectory of each dynamic object within the camera's field of view is , . The expression of and , are as follows: , = , , , ( = 1,2, … , ) , where , , and , , respectively represents the geospatial position of the i-th moving object in the j-th video frame in the k-th camera. Since the same moving object may appear in different camera fields, in order to express the moving object cross-camera association, represents the actual total number of moving objects in the geographic scene, and is the global trajectory of each moving object. The expression of and are as follows: where represents the global trajectory of the i-th moving object in the geographical scene, and the local trajectory of the moving object in the 1, 2, … -th camera is , , , … , . is still the total set of all moving objects in the geographical scene.

Architecture of GIS-MCVO Surveillance System
The video surveillance system was integrated with GIS, and the prototype of GIS-MCVO system was designed and developed on the basis of data organization and MCVO spatialization. The system assists the users to achieve rapid reorganization and effective understanding of multi-camera video content and geospatial association in the urban environment, which are evacuated by storing geospatial information, surveillance video, and video moving object information separately and performing fusion display and comprehensive analysis.

Design Schematic of the System
The overall system design of GIS-MCVO followed the framework of service-oriented software architecture. The framework of the system was divided into a function layer, data layer, service layer, business layer and representation layer from bottom to up, as shown in Figure 6.

Geospatial mapping
Moving object extraction

Cross-camera object recognition
Represe -ntation layer Figure 6. Design Schematic of the system.
(1) Function layer: The function layer is a server with data processing and analysis functions. This layer is used for pre-processing GIS and video data and comprises functional modules for video data acquisition, video moving object extraction, video data geospatial mapping, and crosscamera object recognition. In addition, the function layer can provide basic data support for realtime publishing. (2) Data layer: This layer is supported by the database and is mainly used to store, access, and manage geospatial, video image, and video moving object data and to provide data services to clients. (3) Service layer: The service layer publishes the data service of the underlying system database, including video stream image, video moving object, and geospatial information data services. This layer provides real-time multisource data services to terminal users and remote command centers. (4) Business layer: The business layer selects relevant data service content according to the demand of the system user. Through analysis, this layer fetches different services and generates and transmits the corresponding result to the representation layer. (5) Representation layer: In the representation layer, the user can apply multiple modes on the MCVO and GIS fusion, along with related functions on application and analysis by using a common browser under various operating system platforms.

Design of System Functions
This section describes the modules in the function layer and their functional support relationships (Figure 7).  (1) Moving object extraction module: This module uses detection and tracking algorithms to extract moving objects; separate the video's foreground and background; achieve cross-camera recognition on objects from different cameras; and stores the trajectory, type, set of sub-graphs, and other associated information of the moving objects.

Applications and Potential Benefits for GIS-MCVO Surveillance System
In this chapter, we have briefly described the applications and potential benefits of the GIS-MCVO surveillance system and compared the results with the evacuation results of traditional and GIS-video image surveillance systems. The implementation of GIS-MCVO system is shown in Figure  8.

GIS-Based User Interface
The user interface of a traditional video surveillance system includes a monitor that displays video from a selected camera, which can switch among different cameras or simultaneously display multiple channels of video in a grid interface. As the number of cameras increases, the visual interface architecture becomes less usable. When users need to retrieve and comprehensively analyze multicamera video information with complex spatial relationships, the situation becomes unfavorable because the grid interface cannot express the spatial relationship of the cameras nor can it express video object information in different cameras. Effective identification requires three factors: familiarity of the users on the camera's shooting direction; spatial relationship to the camera's field of view; and the ability to manually and rapidly complete camera selection, switching operations, and moving object tracking tasks. These operations require long-term training and experience, and even if the user can effectively master them, mistakes are still inevitable. The defects of the grid interface of multi-camera videos include: : 1) the lack of ability to express a spatial relationship among camera views; and 2) the lack of ability to extract video moving object information and cross-camera identification. The video image and GIS integrated visualization system, can effectively solve the defect described in 1) , but did not solve the defect stated in 2) because the system still needs to manually search and analyze moving objects in long-term and large-data scaled video and artificially determine the spatial relationship of the video objects from different cameras due to the lack of extraction and analysis processes on video moving objects. Therefore, the GIS and video image integrated visualization system still experienced considerable manual processing pressure when dealing with actual multi-camera video monitoring tasks as object retrieval and behavior analysis (Figure 9a).
To reduce the user's information retrieval pressure on surveillance video and promote the expression effectiveness of a video moving object and 3D geographic scene model, the GIS-MCVO system, in which the geospatial visualization of video information was achieved by dynamically expressing the video object subgraph in a geographic scene, and was constructed on the basis of video object trajectory and sub-graph data. The advantages of this visual interface were presented as follows: (1) Only the video moving object's sub-graph is expressed, thereby greatly reducing the amount of video information that the user needs to retrieve and watch; (2) The video object sub-graph is displayed in a 3D geographic scene using a planar map, thereby avoiding the video image texture distortion-alignment problem in 3D scene model fitting, reducing the amount of image rendering calculation, and improving the efficiency of video information expression; and (3) the GIS-MCVO system can synchronously and relatedly express the MCVO trajectory by adding the view polygon, identifying the same entity object, and optimizing a virtual viewpoint because the same entity video object trajectory is associated with multiple spatial-temporal cameras (Figure 9b). In summary, the proposed GIS-MCVO system can demonstrate the spatialization of video image and video object trajectory. In addition, the system can not only effectively express the spatial relationship within the camera's field of view but can also bring about the visual representation of the video information based on the video object trajectory and sub-graph data, thereby resolving the defects of the grid interface of multi-camera videos. The GIS-MCVO system eliminates the manual processing of multi-camera surveillance video information.

Video Compression Storage
The GIS-MCVO system only stores the sub-graph and spatial-temporal trajectory information of the video moving object, but not the background information. Therefore, when storing and analyzing video moving objects, video data compression, also known as VR fusion video compression, occurs [42]. This method converts video data from an image level to an object level. The obtained compressed data constitutes a series of correlated compression levels in different combinations ( Figure 10). From the perspective of a compression mechanism, video image compression is achieved by constructing predictive models in accordance with the H.264 standard, which predicts video image pixels via intra-or inter-frame prediction. Video image compression aims to reconstruct the original video by using compressed data. Thus, the capability of recovering the original video images should be considered. Furthermore, VR fusion video compression represents video information in simplified approaches (e.g., showing only the sub-graphs or avatars of the moving object). As a result, the ability of the latter to recover the original video image does not need to be considered. In terms of the data compression effect, the video data used in the object projection fusion patterns have data compression relations with the original video sequence images. Furthermore, data compression relations exist among the three patterns of object projection fusion.
First layer of compression: In this layer, the compressed data were oriented to the foreground and background independent projection patterns. The sub-graphs of the moving objects, spatialtemporal position, and background images were extracted and stored separately. This compression layer converted video information from the image level to the object level.
Second layer of compression: In this layer, the compressed data were oriented to the foreground projection pattern, and the virtual scene model was used instead of the video background. This compression layer transferred the background that represented the camera view from the image to the virtual scene model.
Third layer of compression: In this layer, the compressed data were oriented to the abstract of the foreground projection pattern. The virtual avatar in a semantic symbol was used instead of the sub-graphs to display video moving objects in a virtual geographic scene, and the spatial-temporal position was the only information that needed to be stored.
To test the compression efficiency of the data for storage, we examined a set of video images and recorded the trend of the compression rate with respect to the number of input video frames for the different layers. The experimental results are presented as follows: In Figure 11, the magnitudes of compression rate in the first layers as , was in the order of

Trajectory Deduction in Visual Blind Zone
In the GIS-MCVO system, the trajectory of an arbitrary video object, which appears in a plurality of the camera's fields of view, can be determined through cross-camera video object recognition. However, factors, such as the size of the camera field of view, the number of cameras, and the size of the monitoring area limit the trajectory. Several visual blind zones, in which visual information cannot be captured by any camera, exist, but the object's general motion trends and paths in these zones can be determined on the basis of camera spatial relationships, road network, and 3D geographic scene model ( Figure 12). Through trajectory deduction in visual blind zones, the global trajectory in multi-camera surveillance areas can be estimated and supplemented by the implementation of specific functions in GIS-MCVO as video object retrieval, video synopsis, and video object behavior understanding.

Retrieval of MCVO
Even when a user detects a large amount of video moving objects, they will always be interested in several video moving objects with specific conditions. Thus, video moving object retrieval is needed. The search conditions are divided into two types: appearance and trajectory descriptions. In this section, we discuss the problems of the trajectory description of MCVO retrieval.
The input modes of trajectory description-based video object retrieval include moving object instance, graphical description, and semantical description. The trajectory graphical description is divided into area of interest (AOI)-based type and trajectory template-based type. The topological relationship between the search object and the AOI or template trajectory is used to determine whether the search condition is met. The research on traditional trajectory description is limited by generating AOI or trajectory template in the image space [43,44]. Moreover, realizing geospatial comprehensive retrieval on video object trajectory is impossible in a multi-camera environment. Thus, the search work was divided and refined, and the retrieval analysis was processed individually in each camera to obtain the search results. In recent years, some studies have attempted to construct an AOI [45] and trajectory template [46] in geographic space by combining multi-camera fields of view to carry out the spatialization of trajectory retrieval condition description. Although these methods realize the geospatial analysis of the retrieval conditions, the video content of each camera must be retrieved separately due to the lack of spatial-temporal correlation analysis of different camera video object trajectories. To solve the problems mentioned above, the proposed GIS-MCVO system had the function of integrated representation and analysis of geospatial video moving object. The video object retrieval function in GIS-MCVO can analyse the video object trajectory retrieval condition globally, thereby deviating from the limitations of the camera lens. Video object sub-graphs can be integrated in the 3D scene model to express the retrieval results ( Figure 13). The camera-bycamera search and the global search can be realized by sketching the trajectory template in the virtual geographic scene of the GIS-MCVO system. The camera-by-camera search analyzes the spatial relationship between the search condition and the camera's field of view, and then matches the search condition with the trajectory of video moving object in each camera's field of view and returns the results of the query. The global search directly matches the search condition with the cross-camera global trajectory of video moving objects and returns the results of the query. The search results are as shown in Figure 14.

Synopsis of Multiple Videos
The video moving objects are timely sparse distributed in the original video. If the video moving objects are watched after the original time, then a blank screen, which is not conducive to the efficient expression of the video and wastes the user's working time, might be displayed for a long duration. In response to this problem, several research studies have attempted to increase the temporal displaying density of the video moving object by reducing the video playback duration; this method is referred to as video synopsis [47]. Video synopsis changes the playback sequence of the video moving object according to their spatial-temporal relationships and concentrates on expressing large amounts of video moving objects in a short period of time while these video moving objects appear in different time segments in the original video ( Figure 15). The image platform is utilized to generate video synopsis by creating a short video. Although this generation mode is applied in multi-camera video synopsis, it produces problems in two aspects: First, the optimization in image generation appears as an overlap of the video moving object sub-graphs and as continuous updates of the video background image; and second, the spatial information expression among different videos appears as an inability to express spatial-temporal associations among the video objects from different cameras and optimized selection in virtual view. At present, existing research focuses on fusion processing of video information in different cameras using image matching [48], camera angle optimization [49], and video image cropping fusion. These studies only consider the correlation of pixel content among video images but fail to effectively solve the spatial information expression problems, such as MCVO correlation expression and optimized selection in virtual view. To overcome the problems above, the proposed GIS-MCVO system was designed on the basis of video moving object extraction and cross-camera correlation to flexibly select the video moving objects and cameras, thereby effectively expressing the spatial-temporal information of a large amount of MCVO in the geographic scene and achieving multi-camera video synopsis. GIS-MCVObased multi-camera video synopsis concentration converts from the object level, in which the video moving object's playback time is adjusted and is extended to the camera level. This level achieves integrated operations, including camera lens switching and cross-camera video moving object playback time reset (Figure 16). The proposed GIS-MCVO system expresses the video object as a 3D geographic scene model. This mode not only avoids image optimization problems, such as overlapping of video objects and continuous updating of the video background however, can also optimize the selections on object display and camera display sequences along with its effective expression on the spatial-temporal association of video objects among different cameras (Figure 17). In order to analyze the duration reduction effect of video synopsis, the following function was used to calculate the compression ratio of video frames of multiple cameras.
where represents the number of video moving objects; represents starting interval frames between each two adjacent video moving objects; max( ) represents the largest number of frames among all the video moving objects; represents the summary of the original video frames from all the cameras. Under different values of , the compression ratio of video frames is shown in Figure  18. As can be seen in Figure 18, video synopsis based on GIS-MCVO can reduce the video play-back time; on the other hand, the compression ratio obtained after video object cross-camera recognition is lower than that without cross-camera recognition, which proves that object cross-camera recognition can improve the efficiency of the expression of MCVO.
In summary, GIS-MCVO-based video synopsis can assist users in rapidly retrieving the spatialtemporal trajectory and image information of many video moving objects in a multi-camera environment, along with efficiently understanding the spatial-temporal behavior of MCVO.

Conclusion
This work mainly discussed the integration of MCVO and GIS. The implementation of the integration was carried out on the basis of V-GIS fusion [9], V-GIS practical application [1,3], and our preliminary work on single camera video object and GIS integration [10]. The advantage of this integration was that it could achieve not only the extraction of video key information and spatial correlation visualization but also the spatially associated analysis of multi-camera video object, thereby assisting users to effectively monitor video operations. The main contributions of this paper are presented as follows: (1) The research on single-camera video object and GIS integration is extended to MCVO and GIS integration along with the consideration of the spatial-temporal correlation among different camera objects. In addition, the conceptual fusion model between MCVO and GIS is proposed.
(2) The GIS-MCVO system is built on the basis of constructing multi-camera video moving objects and GIS integration related data organization. The overall architecture and function design of the system are presented and analyzed.
(3) This paper analyzed the related applications of the GIS-MCVO system by comparing the traditional image-based surveillance system with GIS-video image integrated surveillance system. The results showed that GIS-MCVO is advantageous in the applications of system user interface, video compression storage, trajectory deduction, video object retrieval, and video synopsis.
GIS-MCVO realizes data organization, spatial-temporal analysis, and visual expression of MCVO integrated with GIS. Compared with GIS-video image fusion, the fusion between GIS and MCVO has the following advantages: (1) multi-camera video object and the geographic scene are integrated for expression and analysis, rather than video image-GIS integrated fusion; (2) video moving object from different cameras with the same entity are integrated, analyzed and expressed in GIS; and (3) the GIS-based user interface assists users in retrieving and analyzing video moving objects quicker and more concisely; and (4) the compression rate of video data is in the order of -2 10 to -4 10 by achieving the integration. However, the fusion between GIS and MCVO has the following disadvantages: (1) Loss of video information, in which some visual image information of video objects are abandoned during evacuation; and (2) uncertainty on video object due to the missed or false detection of the video object. Moreover, the visual blind zone between the cameras, which decreases the accuracy of the extracted MCVO results in multi-camera video moving objects' trajectories, that are not necessarily accurate.
We implemented a prototype of the GIS-MCVO surveillance system, which can be used to illustrate the proof of integration of GIS-video object integrated analysis or considered a platform to carry out the interactive monitoring and analysis of video and geographic information by adding other functions in the future. In further research, we will introduce GPS real-time positioning data, remote sensing images, and lyric data in the integration of video and GIS by investigating the geographic video understanding and analysis supported by multisource information.