With the widespread deployment of ground, air and space sensor sources (internet of things or IoT, social networks and sensor networks), the integrated applications of real-time geospatial data from ubiquitous sensors have already become challenging issues, especially in the case of public security management and the facility management of smart city and present characteristics of 4V categories (volume, velocity, variety and value) [1
]. Due to the rapid progression of data acquisition tools and Internet techniques, a new generation named the “big data era” has appeared and has exerted a significant influence on spatial science research. Great challenges remain in archiving, retrieving, and mining the massive unstructured sensor data and user-generated datasets efficiently for instant perception and understanding [3
The traditional geographic information system (GIS) aims to map the “snapshot” of the geographical world in a moment in an structured format into commercial relational databases (RDBs) such as Oracle and MySQL, using geospatial data persistence followed by further development and integration of on-demand application functions operated on “outdated” database records [5
]. The “current” snapshot in GIS databases still falls out of sync with the real-time input data from the fast-paced constantly changing world due to the input/output (I/O) bottleneck and the high-latency for consistency maintenance in RDB [6
]. This limits its capability of on-the-fly access of real-time geospatial data for online analysis in real time. Meanwhile, RDB can hardly provide sufficient storage capability in handling the fast-growing big geospatial data because of the “scale-up” system expansion scheme, which requires repeating upgrades of storage devices [8
]. In addition, the “store first, then compute” mode stores the incremental unstructured input big data tuple by tuple with large amount of invalid or duplicated data, which not only puts considerable pressure on incremental storage and efficient scheduling, but also cannot meet the requirements of increased spatio-temporal big data linking analysis in big-data-driven GIS studies [9
]. Therefore, traditional organization and management approaches of geospatial data using RDB-based GIS databases cannot support online analysis in real time.
To address above issues, not only SQL (NoSQL) database management systems (DBMS) are emerging as a new solution. The mainstream NoSQL databases can be classified into four categories in terms of the data storage model: key-value store, document store, column store, and graph store. The emergence of NoSQL DBMS was accompanied by the urgent demand for handling continuous generation, large volumes and unstructured formats of real-time data, with the main characteristics of low-latency accessing, extensibility and low cost of hardware/software/labor [11
]. For example, Hao et al. used the Hadoop File System (HDFS) to store real-time multi-sensor stream data from IoT to track objects, such as tracking people in indoor spaces using radio frequency identification (RFID) [14
]. Kang et al. proposed a sensor-integrated data repository model using MongoDB to integrate heterogeneous IoT data sources such as RFID, sensor and global positioning systems (GPS), and optimize a shard key to maximize query speed and uniform data distribution over data servers [15
]. Van et al. compared the read/write performance of sensor data between Cassandra and MongoDB, and concluded that Cassandra is a good choice for relatively larger amounts of sensor data, while MongoDB is good for smaller amounts of sensor data [16
]. Kim et al. utilized Redis to solve the high traffics of web services in concurrent access [17
]. Although these NoSQL DBMS offer the benefits of high-performance writing/querying for large volume real-time input data, a novel approach for organization of real-time geospatial data is needed to cope with both fast-growing real-time input data and on-the-fly analysis results for real-time geo-processing.
Compared with real-time input data, for which the relevant problems are predominantly focused on the low-latency of large volume data writing and querying as well as fast-growing storage, the problems that must be solved to handle on-the-fly analysis results are predominantly related to the ability of flexible queries and transaction processing. Although RDB are prominent for structured data storage and transaction processing, they can hardly provide sufficient performance in handling the fast-growing big data due to the “scale-up” system expansion scheme [15
]. As complements to RDB, NoSQL databases provide a series of features that RDB cannot provide, such as horizontal scalability, memory and distributed index, dynamically modifying data schema, etc. They can store unstructured data efficiently because of easy data schema-modification capability, and require lower server expansion cost than RDB because of the “scale-out” scheme. However, the NoSQL database is lacking in completed atomicity, consistency, isolation and durability (ACID) constraint and support for complicated queries, transaction processing and join operations [18
]. Therefore, there is feasibility in synthetically using features of NoSQL and SQL databases to support real-time storage, query, and computation for real-time geospatial data.
To lay the foundations for online GIS analysis in real time, this manuscript presents a hybrid database organization and management approach of SQL database RDBs and NoSQL databases (including the main memory database, MMDB, and distributed files systems, DFS). This hybrid approach makes full use of the advantages of NoSQL and SQL DBMS for the real-time access of input data and structured on-the-fly analysis results. The MMDB facilitates real-time access of the latest input data such as the sensor web and IoT and supports the real-time query for online geospatial analysis. The RDB stores change information like multi-modal features and abnormal events extracted from real-time input data; the DFS on disk manages the massive geospatial data, and the extensible storage architecture and distributed scheduling of a NoSQL database satisfy the performance requirements of incremental storage and multi-user concurrent access. The proposed approach offers the capability of managing large-volume real-time geospatial data and meets the requirements of increased spatio-temporal big data linking analysis. This research will help the research community to conduct big-data-driven GIS studies in a more efficient and productive way. A case analysis of geographic video (GeoVideo) surveillance of public security is presented to prove the feasibility of this hybrid organization and management approach.
The remainder of the paper is organized as follows. Section 2
describes the NoSQL–SQL hybrid organization and management approach. Section 3
takes real-time GeoVideo data as a case study to demonstrate the dataflow and key algorithms in the proposed approach. Section 4
reports several sets of comparative experiment conducted to demonstrate the advantages of the proposed approach. Section 5
describes the conclusion of the study and applications of this research.
3. Case Study of Real-Time GeoVideo Data in Hybrid NoSQL–SQL DBMS
GeoVideo is typical real-time geospatial data, which maps video frames into a geographic space. In this section, we take public security video surveillance as an example to illustrate the dataflow of massive GeoVideo data between typical NoSQL databases (MMDB Redis and DFS MongoDB) and the SQL database (RDB MySQL).
3.1. Dataflow in Hybrid Databases
Real-time motion-area detection and foreground extraction are critical procedures for public security GeoVideo surveillance, which transform the original video frames into geographic elements including geo-referenced vehicles, human bodies and static background for higher-level visual processing and GIS analysis [22
]. Therefore, linking storage and computation of real-time GeoVideo stream is an effective way to reduce the response time in emergency. Figure 3
shows the dataflow of GeoVideo data between Redis, MySQL and MongoDB.
Step 1: Access the GeoVideo stream and convert the video stream into video frames according to frame rate. Calculate the field of view (FOV) based on the attitude angle and focal length of the camera.
Step 2: The sequence of GeoVideo frames is stored in Redis, and each video channel corresponds to a Redis list, which supports queries of the recent period of video data. The video analysis system obtains the latest video frames from the head of the list for real-time motion region detection and foreground extraction, and maintains a constant list length by a first-in-first-out (FIFO) replacement algorithm.
Step 3: Store the change attribute extracted by a comparative analysis of video frames in the memory table named Change_Attribute_Table of MySQL, and create an insert trigger to perform real-time monitoring of abnormal changes. If the change value exceeds predefined threshold values, the trigger creates a corresponding event according to the range of change and inserts the event into the memory table named Event_Table of MySQL.
Step 4: Create an insert trigger in Event_Table and use the distributed scheduling system Gearman to synchronize the event in Redis as a cache. The event tuple in MySQL maps to the data structure of hash on Redis and takes the globally unique event ID as the hash key.
Step 5: Use the “subscribe/publish” message mechanism of Redis to actively push events to the relevant geographic objects. Take the event category as “channel” to execute event push operations.
Step 6: Geographic objects receive subscribed events, analyze properties of the events, and automatically load spatio-temporal related GeoVideo data using predefined event templates.
Step 7: Group the preprocessed video frames in Redis with geographic process semantics and write to the MongoDB cluster Mongos. Choose the compound shard key—“day+cameraID” and create hybrid index on these two attributes to support distributed storage of massive GeoVideo data.
Step 8: Group the FOV of video frames in Redis, as the motion trajectory of the camera. Calculate the 3D minimum boundary rectangle of these FOVs and update the spatio-temporal index while synchronizing video frames from Redis to MongoDB.
3.2. Structure and Algorithm of Interoperability
3.2.1. Change Detection and Event Trigger
The abnormal change value extracted from the GeoVideo stream in Redis is the trigger to execute video analysis. Using the “trigger mechanism” in RDB to link storage and computation of the GeoVideo data can actively detect abnormal changes and create events in real time without external I/O operations. In video surveillance of a museum, the distance between a visitor and an exhibit, stay duration in an exhibition hall, exhibit movement and fire hazard are all important factors in the determination of exhibit safety. For example, in monitoring the distance between a visitor and an exhibit, varying distances will trigger different levels of security events. Table 1
describes the structure of the table Safe_Distance, which records the distance between an exhibit and visitor extracted from a video frame. Table 2
describes the structure of the table Event_Safe_Distance, which records events detected from the table Safe_Distance by checking the distance value between an exhibit and visitor. The trigger Trigger_Safe_Distance (Algorithm 1) checks abnormal distances in the table Safe_Distance, and inserts the events into the table Event_Safe_Distance. The table Safe_Distance, table Event_Safe_Distance, and the trigger Trigger_Safe_Distance in MySQL constitute an integrated procedure of real-time change detection and event trigger.
|Algorithm 1: Trigger_SafeDistance (Tuple DistanceValue)|
1. create trigger Trigger_Safe_Distance
2. after insert on Table Safe_Distance
3. foreach new row
5. if newTuple.distance < safe_distance_level1 then
6. insert into Event_Safe_Distance values(event_type1, newTuple.attributes)
7. elseif newTuple.distance < safe_distance_level2 then
8. insert into Event_Safe_Distance values(event_type2, newTuple. attributes)
9. elseif newTuple.distance < safe_distance_level3 then
10. insert into Event_Safe_Distance values(event_type3, newTuple. attributes)
12. insert into Event_Safe_Distance values(event_type4, newTuple. attributes)
13. end if-then
3.2.2. Event Subscribing and Publishing
Use of the message mechanism of “subscribe/publish” can actively dispatch events to related geographic objects in the first time. We define the insert trigger Dispatch_Event (Algorithm 2) in the table Event_Safe_Distance of MySQL. In the stored procedure of trigger Dispatch_Event, we use the distributed scheduling system Gearman to synchronize events from MySQL to Redis, by editing the Gearman Worker named “SyncAndDispatchEvent” (Algorithm 3) and using the embedded “subscribe/publish” message mechanism of Redis to push the real-time generated events to related geographic objects. In the video surveillance application, we separate each event category as a channel, and various safety departments are subscribed to different types of events. For example, a museum safety department receives all levels of security incidents, but the local police station only receives those at a higher level. Through real-time publishing of security events, subscribers receive event notifications and query event-related real-time and historical GeoVideo data from Redis and MongoDB for comprehensive analysis.
|Algorithm 2: Trigger_DispatchEvent (Tuple SafeDistanceEvent)|
1. create trigger Dispatch_Event
2. after insert on Table Event_Safe_Distance
3. foreach new row
5. set @ret=gman_do_background (GearmanWorker ‘SyncAndDispatchEvent’,
6. json_object(newTupleInMySQL.attributes as newKeyValuePairInRedis.attributes))
|Algorithm 3: GearmanWorker_SyncAndDispatchEvent (Tuple SafeDistanceEvent)|
1. $worker = new GearmanWorker()
3. $redis = new Redis()
4. $redis->connect(ip, port)
7. function SyncAndDispatchEvent (Tuple $job)
8. global $redis
9. $workString = $job->workload()
10. $work = json_decode($workString)
11. $redis->hmset(key: “attributeName”, value: $work->attribute)
12. $redis->publish(channel: $work->event_type, $work->eid)
4. Experimental Study
In this section, we presented experiments using the organization and management approach described above and analyze performance results in terms of real-time GeoVideo data accessing, abnormal event detection and dispatch and massive historical data-distributed storage. In Section 4.1
, we describe the software and hardware environment, the dataset and other preparations involved in the experiments. Detailed experiments were presented in Section 4.2
4.1. Experimental Setting
All experiments were performed on two Dell OPTIPLEX 9020 workstations, each of which created three virtual machines as work nodes. Each node had a 4-Core Intel I7-4790M 3.60 GHz CPU with eight hardware threads and 4 GB of RAM. The two workstations communicated via a gigabit network. All six nodes used a 64-bit Linux operating system (CentOS Enterprise Server). A 64-bit MongoDB, 64-bit Redis, 64-bit MySQL, task distribution system Gearman and 64-bit OpenCV were used to implement the hybrid organization method described above on the two work nodes for real-time GeoVideo data.
In the experiments, we chose 58 randomly distributed cameras both inside and outside the office building. Each one of the cameras recorded 48 h at the sampling rate of 25 frames per second, with one geo-referenced frame per second due to the sampling rate of GPS and the compass (the digital compass can achieve 30 or 40 readings per second but GPS can only get one sample per second). These were key parameters for calculating spatial information of GeoVideo, such as FOV and position of the monitoring object. Therefore, we had a dataset with about 250 million video frames and 10 million geo-referenced key frames; each frame size is about 100 K. The change detection of video frames, including motion-area detection and foreground extraction were implemented by an open-source feature extraction algorithm library OpenCV.
In order to validate the proposed method, we carried out a series of experiments using a sample GeoVideo data set. The first experiment compared updating and query performance between the Redis-MongoDB-based repository, MongoDB-based repository and MySQL-based repository in order to check whether the hybrid method outperformed independent representative RDB and DFS in real-time accessing. The second experiment compared abnormal event detection and dispatch performance of active detect-push and regular scan in order to validate linking of storage and computation by close visit, greatly reducing response time. The third experiment compared data distribution and query performance of different shard key selection in order to validate the choice of compound shard key and checked whether an increase in the number of MongoDB shards improved access performance.
4.2. Experimental Results
4.2.1. Real-Time Accessing
This test compared the real-time accessing performance of the suggested Redis-MongoDB-based repository with an independent MongoDB-based repository and MySQL-based repository. For this test, we configured a MongoDB-based repository which has one config server, three shard servers and one mongs router, and a Redis-based repository and a MySQL-based repository on a single node. We used three update/query proportions (75% updating and 25% querying, 50% updating and 50% querying, and 25% updating and 75% querying) and got the average value to measure the update/query throughput and response latency of querying latest data in the same way.
From Figure 4
a–c, we can conclude that the accessing efficiency and the latency for querying latest data of Redis-MongoDB-based repository was significantly better than the other two. The Redis-MongoDB-based repository outperformed the MongoDB-based repository, while the MongoDB-based repository outperformed the MySQL-based repository, which was because Redis solved the problems of I/O bottleneck and high consumption for global index maintenance and MongoDB simplified the complex operations for maintaining strong consistency for real-time updating by keeping eventual consistency. In general, the accessing throughput of the three repositories kept stable initially and then declined gradually with the increasing amount of data stored in the database. The number of GeoVideo data volume had a small effect on Redis-MongoDB-based repository and a great effect on the MySQL-based repository. Therefore, separating real-time and historical video data, using the MMDB to manage real-time data and utilizing DFS to manage massive historical data was an efficient way to satisfy the requirements of real-time access.
4.2.2. Event Detection and Dispatch
To demonstrate the performance of real-time abnormal change detection and event dispatch, a comparative trial of time cost of event detection and dispatch was designed in four modes as follows. Mode 1: Use of the trigger mechanism of MySQL in the memory table Safe_Distance to detect abnormal change, and utilization of an embedded Gearman worker in the trigger of memory table Event_Safe_Distance to push events to related geographic objects by “subscribe/publish” message mechanism without external I/O operations. Mode 2: Use of the trigger mechanism of MySQL to detect abnormal change and utilization of the embedded Gearman worker to push events to related geographic objects, however table Safe_Distance and table Event_Safe_Distance were stored on disk. Mode 3: Use of the trigger mechanism of MySQL to detect abnormal change in memory tables Safe_Distance and storage of different types of events in different memory tables. Then, applications periodically (500 ms) scanned event tables to get the latest events. Mode 4: Use of the procedure functions of MySQL to periodically (500 ms) detect abnormal change and storage of calculated events in the memory table Event_Safe_Distance. Then, the trigger in memory table Event_Safe_Distance dispatched the events to related geographic objects and modules using the embedded Gearman worker.
From Figure 5
a, we can conclude that the event detection and dispatch efficiency of mode 1 was significantly greater than the other three. The comparison between mode 1 and 2 demonstrated that memory tables can greatly decrease the time cost of event trigger and event delivery. The comparison between mode 1 and 3 as well as mode 1 and mode 4 revealed that active operations of event detection and event dispatch were more efficient. Figure 5
b,c show that mode 1 consumed the least resources of MySQL, and it had little influence on the accessing performance of MySQL when compared to the other three modes.
4.2.3. Historical Data Distribution
This experiment aimed to validate whether the MongoDB cluster Mongos guaranteed an even distribution of GeoVideo data, better query performance and scalable storage. We configured a MongoDB cluster Mongos which had one config server, three shard servers, one mongs router and an alternative extendible shard server. A comparison between three shard keys: compound key “day+cameraID”, hash key “cameraID” and incremental key “day” was made.
a,b showed that compound shard key had better accessing performance when compared to the other two types of shard key. Although the hash key had a better even distribution and updating efficiency of GeoVideo data, it had a bottleneck for queries due to random distribution of GeoVideo data without clustering and an invalid distributed index. The incremental key made the updating operation focus on one shard server that recorded the latest data, which resulted in an update bottleneck that did not allow scalable storage to fully come into play. Figure 6
c showed that an increase in the number of shards improved update performance which means MongoDB had a strong capability to scale out. The shard number had a small performance improvement on the incremental key and a great performance improvement on the compound key and hash key.
The complex structures of geospatial systems have a pressing need for appropriate management [25
]. Recent developments in information technology commonly referred to as “big data”, along with the related fields of data science and analytics are needed to process, analyze, and determine the value of the overwhelming amounts of geospatial data [3
]. The existing geospatial analysis methods have been developed primarily in the context of small data. Yet, all of the processes of interest to the general public and decision makers operate in the real time and heterogeneous context. The imbalance between rich data products and poor data utilization calls attention to the techniques of real-time data access and management. The hybrid NoSQL–SQL organization and management approach of real-time geospatial data makes full use of the advantages of the real-time access operation of NoSQL MMDB for various heterogeneous input data, flexible queries and transaction processing of SQL RDBMS to support the access of on-the-fly analysis results, and scalable ability of NoSQL DFS for massive data. This approach is considered effective for supporting real-time storage, query, and computation in real-time GIS. Aimed at difficulties in the linkage between storage and computation within GIS online analysis, this paper presented an internal and external collaborative storage strategy by separately managing real-time and historical geospatial data, and designs a workflow integrating real-time change detection and active event delivery for driving geographic process evolution. A novel ID structure was also designed to associate the multi-granularity geographic elements for unified scheduling in hybrid NoSQL–SQL DBMS. Additionally, using the subscribe/publish message mechanism to map the relationships between geographic objects, events and processes, the method can efficiently decrease the latency of collaborative scheduling of real-time and historical spatio-temporal data. Experimental results from concrete application of GeoVideo based on Redis, MongoDB and MySQL demonstrate the practicality and reliability of this method in supporting real-time GIS applications.
This NoSQL–SQL hybrid organization approach is an important foundation in the real-time GIS platform for environment monitoring. This system has been successfully applied for GeoVideo monitoring and trajectory tracking and provides public security decision support for the security department (see Figure 7
). The data management approach has facilitated the real-time access and abnormal change detection of big GeoVideo data. The developed organization and management approach of real-time geospatial data will enable advancements in a broad spectrum of applications by assisting researchers in tackling the challenges posed by big data.