3.1. Trajectory Data Gathering and Storage
The data-gathering step can involve multiple processing tasks to improve the quality of the trajectory data before initiating mining and analysis activities. For example, the system can perform a cleaning process to remove the outliers. Zhen’s survey [
1] divides the outliers problem solutions into three categories: Mean (or Median) filter; Kalman and Particle filter; and Heuristics-Based Outlier Detection. The mean (or median) filter calculates the mean (median) within a sliding window to estimate the real value of a determined point in the trajectory. This filter is best indicated when the trajectory sampling rate is high. Kalman and Particle filters are algorithms used to estimate actual measurements from noise-contaminated data. Kalman and Particle propose models that depend on the initial measurements, i.e., if the first points of the trajectory are noises, the effectiveness of the model drops significantly. The Heuristics-Based Outlier Detection method removes the noise points from the trajectory holding only the points within the calculated limits, that is, the method computes velocity and distance between each point and its successor; if these parameters exceed the limits informed, the point is not included in the trajectory.
Vast volumes of data have an impact on data storage, transmission, processing, and display. The purpose of trajectory data compression is to reduce the size of the data set without distorting the trajectory trend [
23]. There are two categories of trajectory data compression algorithms [
52]:
offline compression: this category reduces the size of the trajectory after the trajectory has been fully generated. The classical algorithm is Douglas–Peucker (DP), which is based on heuristics that recursively divide the sequence of positions and stores only the representative position of each sub-sequence. Nowadays, there are already modifications and improvements in the DP like the Top-Down Time-Ratio (TD-TR) [
53];
online compression: the compression of the trajectory occurs following the movement of the object along the trajectory. Ideal for real-time environments, such as traffic monitoring. The main algorithms are Sliding Window, Open Window [
53], and STTrace [
54]. Sliding Window and Open Window are similar algorithms differing in the choice of point location of the sliding window. The algorithm causes a sliding window to grow along with the trajectory points. In contrast, the error of adjustment line segments (line going from the first and last point of the window) and the original trajectory are not greater than the specified error limit. The STTrace algorithm uses the coordinates, speed, and orientation of the current trajectory point to calculate a safe area where the next position can be located; if the next point falls in this region, it can be ignored.
Once collected, organized, cleaned and compressed, the trajectory data can be transformed into a geographic representation before being stored into a database. There are two common formats of geospatial data types: raster and vector. The graph format is a subset of the vector model [
55]. It was observed among the analyzed research that the raster format is more commonly used in Data Warehouses that work with summary information at the cell level. In raster format, the map is divided into several cells of a shape (square, triangle, or polygon), and each cell can contain information about a particular variable, e.g., precipitation, temperature, humidity, soil type, etc. [
56]. On the other hand, in the vector format, the map is built using points, lines, and polygons and is often used to represent the movement of trajectories geographically. In the trajectory context, the basic logical unit in a vector model is the line, used to encode the object location, and represented as a string of coordinates of points along the line [
57]. Finally, the geographical graph represents the geolocated characteristics of data in a map. This representation is generally used to describe the urban grid where roads are represented as edges and reference points (or intersections of streets) as vertices. This is the representation type that can be used for implementing the map-matching process, in which the geographical representation of trajectory points are transformed so that the coordinates match with the representation of the urban mesh. Through the graph, it is possible to get another trajectory representation: the paths. Here, the trajectory is represented by a sequence of segments, and each segment is composed of two vertices of the graph so that two consecutive segments have a common vertex [
13].
Trajectory data are stored in different formats according to the device type, monitored objects, and purpose of the application. Besides raw trajectory data, other relevant properties can be obtained and stored, such as speed, direction, and acceleration [
12]. Typically, trajectory data are captured in real time, composing a data stream that feeds a type of spatio-temporal database called MOD (Moving Object Database) [
39]. A large structure for storage is needed to save the massive and ever-increasing data stream [
58]. Current systems can use dataspace technologies and Big Data platforms, such as Apache Spark and Hadoop. The goal of dataspace support is to provide basic functionality over several data sources, regardless of how integrated they are [
59]. Dataspace systems offer services on data without requiring upfront semantic integration and services like
pay-as-you-go, that is, pay for the service before using it and do not go beyond what you paid for [
60]. However, if more sophisticated operations are required, such as relational DB-style operations or data mining, additional efforts can be employed to integrate existing heterogeneous data sources into a Dataspace Support Platform (DSSP) [
61].
We have noticed two types of trajectory data manipulation: the trajectory data can be handled in real time, such as navigation systems, e.g., Waze (
https://www.waze.com), or analyzed through a historical basis. Real-time trajectory application maintains the current location of the objects, that is, their queries are posed on current location and the expected future positions of the object. The CRISIS system [
48] is an example of an application that deals with trajectory data streams and uses Apache Jena to hold in memory an RDF (Resource Description Framework) graph containing a semantic representation of data received from various sensors. In that system, data of several heterogeneous sensors are integrated into a structure that uses Semantic Web to embed the data in a context (in this case, maritime navigation), facilitating the interoperability and the discovery of new knowledge about the environment to be monitored [
48]. Data streaming is produced by AIS (Automatic Identification System) sensors and climate and glacier monitoring stations. The streaming data are processed and represented as an RDF graph that can be stored either locally or in the LOD (Linked Open Data) cloud. The MobyDick system [
43] presents a prototype framework for managing and monitoring mobile objects. This research does not store any information in the database; it only works with the information in the main memory. MobyDick implements a data model based on temporal and spatial ISO specification: ISO 19108:2002 [
62] and ISO 19107: 2003 [
63]. MobyDick functions as a layer above the Apache Flink [
64] platform, which implements parallel distributed processing of data.
Unlike applications that use data streams, a Trajectory Database maintains the history of the movement. The new tendency to maintain a historical trajectory database fed continuously by a moving object data stream requires a robust structure with large storage capacity. Computational clusters with parallel processing and horizontal scalability are infrastructures that support Big Data storage and analysis [
65]. The Bao et al. research [
42] presents a system that focuses on urban trajectories. Their system uses Microsoft Azure to store the large volume of data. The system is composed of three modules: trajectory storage, space-time indexing, and map-matching. The most recent data are stored in a Redis database and the Azure for historical data. ST-Hadoop [
47] was the first open-source MapReduce framework with native spatial-temporal data support. It sacrifices storage for better performance by storing data at the level of the day, month, and year. The data are stored in files in HDFS (Hadoop Distributed File System) with spatio-temporal indexing that speeds up the query process.
Traditional trajectory management systems, such as PostgreSQL, Oracle, HDFS, and Azure, are disk-oriented, which can cause problems of scalability and slow query processing. Hence, the use of Big Data platforms like Apache Spark has become increasingly common in the management of trajectory data. The Spark platform is a distributed system that provides an abstraction called RDD (Resilient Distributed Dataset) (
https://spark.apache.org/docs/latest/rdd-programming-guide.html,ApacheSpark-RDDProgrammingGuide). These RDDs maintain a collection of objects in memory that can be handled conveniently by Spark. The TrajSpark system [
46] extends the Apache Spark by building a global and local indexing structure to speed up the searching process. In addition, TrajSpark relies on a load-balancing monitor that improves the use of data partitions. In some applications, the balancing is done by adding new data on an hourly or daily database, and the data distribution changes over time. If the entire dataset is re-partitioned when new data are loaded, this can cause an overhead cost. To re-partition, old data are not worth it because new data are more valuable. Therefore, TrajSpark only tries to partition the new data groups without touching the existing data.
Another system that uses the Spark architecture is the DiStRDF (Distributed Spatio-temporal RDF system) [
49]. DiStRDF is a distributed system that uses RDF to process spatio-temporal queries in a network of heterogeneous databases. In Nikitopoulos et al.’s experiments, data were stored in an HDFS system managed by an Apache Spark environment. The RDF data acts as a large dictionary containing an approximate location summary of the object and the event time. This dictionary is stored in a Redis database to speed up query processing.
Based on storage system and geometric representation, some systems were analyzed and arranged according to
Table 2. This table presents the platforms used to manage the trajectory data used in some works and the geometric representation type used. The column
Geometric Representation shows the geometric representation type used in the research analyzed. In the integration step, it is observed that none of the proposed architectures deal with data in raster format. All analyzed research stores the trajectory data in vector format and one of them also represents the information in graph form. Each data manager was chosen to accord with the trajectory data model used in each research work.
Table 2 shows some systems that use spatial databases such as PostgreSQL, together with the spatial expansion Postgis, and Oracle. More recent works have adopted Big Data technologies, as this is the new trend because of the large volume of trajectory data that is produced by sensors and social media. It is estimated that the amount of digital data doubles its size every two years and geospatial data are a major contributor to the Big Data scenario [
66]. Traditional storage technologies, such as those used in [
26,
40], cannot organize and query this large volume of data. Computational clusters with parallel processing and horizontal scalability are infrastructures that support Big Data storage and analysis [
65]. Big Data technologies such as Hadoop, MongoDB, Flink, and Spark are becoming increasingly common in large Database Management Systems [
65,
67]. We can conclude that newer trajectory systems tend to use Big Data technologies (Azure, Spark, ST-Hadoop, MongoDB) to deal with trajectory data. In addition, cloud computing platforms, such as those using Azure, HDFS, and Spark, are not optimized to deal with spatio-temporal data.
The trajectory systems shown in
Table 2 can also be grouped according to the adopted data structure: structured data or semi-structured data. The T-Warehouse system [
39] presents the complete architecture of a trajectory system, with MOD and TDW modules. The MOD module uses the Hermes framework [
68] to provide an Object-Relational DBMS (ORDBMS) to trajectory data. The Oracle DBMS is used to build the TDW. It is observed that older works, such as SeMiTri [
26], use a simple relational database with spatial extension, as in the case of PostgreSQL + postgis. Other works use a semi-structured data model, especially when it should represent the semantic information of the trajectory. Modeling trajectory data using RDF graphs or ontologies has gained strength as new works on semantic trajectories enrichment have emerged [
33,
49]. Representing semantic trajectory data using RDF enables not only inferring new knowledge, but also the publication of data as Linked Open Data (LOD), making it accessible on the Semantic Web. For example, the MASTER [
33] project uses a database called Rendezvous [
69] that stores graphs in the RDF format and intends to make its data available on the Semantic Web. Rendezvous is a triplestore [
70] based on an NoSQL distributed database that stores data in an RDF format. According to [
29], trajectory data storage technologies are well served, and the new challenge now is the semantic enrichment of trajectory data, which is the subject addressed in the next subsection.
3.2. Semantic Trajectories
This section describes the semantically enriched MODs (
Table 2) with the respective semantic information type.
The SeMiTri system [
26] is an application example that processes geometric data and context data to produce semantically enriched trajectories. The system performs three types of semantic annotation: by region, line, and point. The annotation by region is computed through online maps like OpenStreetMap and can identify areas such as residential, industrial, and commercial. For line annotation, the system performs a map-matching operation, and then, based on context, the system infers the user’s transport type (bus, subway, hike, etc.). Point type annotations are associated with those trajectory segments where the moving object is stationary. In this segment type, the system identifies the PoI, using a Markov Chain algorithm [
71], which is more suitable for this segment type (home, work, market, etc.).
The system named VISTA [
50] presents a tool with visual analytics functionalities that support the users: (i) in exploring and processing trajectory data; and (ii) in creating features and semantic information, to guide the user to comprehend how to label trajectories properly. Another system that also assigns trajectory annotations is ANALYTiC [
45], which uses machine-learning algorithms to infer semantic annotations about trajectory data. In that article, a semantic annotation, or label, is any contextual information related to the trajectory, for example: activity information such as walking, studying, driving, or fishing. ANALYTiC uses the active learning strategy to maintain good performance of classifiers while using a smaller number of training examples.
The CONSTAnT model [
3] is only a conceptual data model that defines the important aspects to implement a semantic trajectory system. The model is divided basically into two parts. The first part refers to the simplest entities, which contain information about the object, trajectory, sub-tracings, semantic points, environment, place, and events. The second part refers to the more complex objects in which data mining techniques are required to instantiate its objects, such as purpose, means of transport, and behavior.
The MASTER system presents not only the conceptual model but also the logical model and an example of data storage and information query. The focus of the MASTER project is not how to get semantic information, but how to represent semantic information by conceptual and logical models. The logical model is represented by an RDF graph because it is generic enough to model trajectories and aspects extracted from heterogeneous data sources [
33]. The MASTER system uses the database Rendezvous [
69] in order to manage the large volume of data.
Table 3 highlights the projects that used some semantic notation for trajectories. The table also indicates the semantic information type, according to the 5W1H model, adopted in each system that has some semantic information linked to the trajectory. Among the applications discussed in this section, the MASTER project [
33] would be able to fit the 5W1H model, besides allowing the input of other contextual information.
Some systems only adopt a label for trajectory [
45], and others allow annotation for each part of the trajectory: point, segment and entire trajectory. The column
Semantic Annotation from
Table 3 highlights the semantic annotation allowed for each part of the trajectory. The SeMiTri [
26] and MASTER [
33] allow associate semantic information for point, segment, and/or entire trajectory. The ANALYTiC [
45] system associates semantic information to entire trajectory and the VISTA [
50] systems associate semantic information to trajectory segment.