1. Introduction
Explosive growth in geospatial and temporal data as well as the emergence of new technologies emphasize the need for automated discovery of spatiotemporal knowledge. Spatiotemporal data mining studies the process of discovering interesting and previously unknown, but potentially useful patterns from large spatial and spatiotemporal databases [
1,
2,
3,
4,
5].
Figure 1 shows the process of spatiotemporal data mining. Given input spatiotemporal data, the first step is often preprocessing to correct noise, errors, and missing data and exploratory space-time analysis to understand the underlying spatiotemporal distributions. Then, an appropriate spatiotemporal data mining algorithm is selected to run on the preprocessed data, and produce output patterns. Common output pattern families include spatiotemporal outliers, associations and tele-couplings, predictive models, partitions and summarization, hotspots, as well as change patterns. Spatiotemporal data mining algorithms often have statistical foundations and integrate scalable computational techniques. Output patterns are post-processed and then interpreted by domain scientists to find novel insights and refine data mining algorithms when needed.
Figure 1.
The process of spatiotemporal data mining.
Figure 1.
The process of spatiotemporal data mining.
Societal importance: Spatiotemporal data mining techniques are crucial to organizations which make decisions based on large spatial and spatiotemporal datasets, including NASA, the National Geospatial-Intelligence Agency [
6], the National Cancer Institute [
7], the US Department of Transportation [
8], and the National Institute of Justice [
9]. These organizations are spread across many application domains. In ecology and environmental management [
10,
11,
12,
13], researchers need tools to classify remote sensing images to map forest coverage. In public safety [
14], crime analysts are interested in discovering hotspot patterns from crime event maps so as to effectively allocate police resources. In transportation [
15], researchers analyze historical taxi GPS trajectories to recommend fast routes from places to places. Epidemiologists [
16] use spatiotemporal data mining techniques to detect disease outbreak. There are also other application domains such as earth science [
17], climatology [
1,
18], precision agriculture [
19], and Internet of Things [
20].
The interdisciplinary nature of spatiotemporal data mining means that its techniques must developed with awareness of the underlying physics or theories in the application domains [
21]. For example, climate science studies find that observable predictors for climate phenomena discovered by data science techniques can be misleading if they do not take into account climate models, locations, and seasons [
22]. In this case, statistical significance testing is critically important in order to further validate or discard relationships mined from data.
Challenges: In addition to interdisciplinary challenges, spatiotemporal data mining also poses statistical and computational challenges. Extracting interesting and useful patterns from spatiotemporal datasets is more difficult than extracting corresponding patterns from traditional numeric and categorical data due to the complexity of spatiotemporal data types and relationships. According to Tobler’s first law of geography, “Everything is related to everything else, but near things are more related than distant things.” For example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. In spatial statistics such spatial dependence is called the spatial autocorrelation effect. Ignoring autocorrelation and assuming an identical and independent distribution (i.i.d.) when analyzing data with spatial and spatio-temporal characteristics may produce hypotheses or models that are inaccurate or inconsistent with the data set [
23]. In addition to spatial dependence at nearby locations, phenomena of spatiotemporal tele-coupling also indicate long range spatial dependence such as El Niño and La Niña effects in the climate system. Another challenge comes from the fact that spatiotemporal datasets are embedded in continuous space and time, and thus many classical data mining techniques assuming discrete data (e.g., transactions in association rule mining) may not be effective. A third challenge is the spatial heterogeneity and temporal non-stationarity,
i.e., spatiotemporal temporal data samples do not follow an identical distribution across the entire space and over all time. Instead, different geographical regions and temporal period may have distinct distributions. Modifiable area unit problem (MAUP) or multi-scale effect is another challenge since results of spatial analysis depends on a choice of appropriate spatial and temporal scales. Finally, flow and movement and Lagrangian framework of reference in spatiotemporal networks pose challenges (e.g., directionality, anisotropy,
etc.).
Previous surveys: As shown in
Figure 2, surveys in spatial and spatiotemporal data mining can be categorized into two groups: ones without statistical foundations, and ones with a focus on statistical foundation. Among the surveys without focuses on statistical foundation, Koperski
et al. [
24] and Ester
et al. [
25] reviewed spatial data mining from a spatial database approach; Roddick
et al. [
12] provided a bibliography for spatial, temporal and spatiotemporal data mining; Miller
et al. [
26] cover a list of recent spatial and spatiotemporal data mining topics but without a systematic view of statistical foundation. Among surveys covering statistical foundations, Shekhar
et al. 2003 [
23] reviewed several spatial pattern families focusing on spatial data’s unique characteristics; Kisilevich
et al. [
27] reviewed spatiotemporal clustering research; Aggarwal
et al. [
28] has a chapter summarizing spatial and spatiotemporal outlier detection techniques; Zhou
et al. [
29] reviewed spatial and spatiotemporal change detection research from an interdisciplinary view; Cheng
et al. 2014 [
30] reviews state of the art spatiotemporal data mining research including spatiotemporal autocorrelation, space-time forecasting and prediction, space-time clustering, as well as space-time visualization; Shekhar
et al. 2011 [
31] give the most recent review of spatial data mining research. However, its discussion of spatiotemporal patterns is limited (e.g., nothing on spatiotemporal change patterns or statistically significant hotspots). In summary, there is no current survey in the literature that provides a systematic overview of spatiotemporal data mining that covers its statistical foundations as well as all major spatiotemporal pattern families.
Figure 2.
Categorization of spatial and spatiotemporal data mining surveys.
Figure 2.
Categorization of spatial and spatiotemporal data mining surveys.
We hope this survey contributes to spatiotemporal data mining research in filling these two gaps. More specifically: (1) we provide a taxonomy of spatiotemporal data types; (2) we provide a taxonomy of spatial and spatiotemporal statistics organized by different data types; (3) we survey common computational techniques for all major spatiotemporal pattern families, including spatiotemporal outliers, spatiotemporal coupling and tele-coupling, spatiotemporal prediction, spatiotemporal partitioning and summarization, spatiotemporal hotspots and change patterns. Within each pattern family, techniques are categorized by the input spatiotemporal data types; (4) we analyze the research trends and future research needs.
Organization of the paper: This survey starts with the characteristics of the data inputs of spatiotemporal data mining (
Section 2) and an overview of its statistical foundation (
Section 3). It then describes in detail six main output patterns of spatiotemporal data mining related to outliers, association and tele-coupling, prediction, partitioning and summarization, hotspot, and change patterns (
Section 4). Common software tools are discussed in
Section 5. The paper concludes with an examination of research needs and future directions in
Section 6.
5. Spatial and Spatiotemporal Analysis Tools
This section lists currently existing spatial and spatiotemporal analysis tools, including geographic information system (GIS) softwares, spatial and spatiotemporal statistical tools, spatial database management systems, as well as spatial big data platforms.
GIS Softwares: ArcGIS [
175] is the currently most widely used commercial GIS software for working with maps and geographic information. It has an extension named Tracking Analyst to support visualization and analysis for spatiotemporal data. QGIS [
176] (previously Quantum GIS) is a very popular open source GIS software.
Spatial Statistical Tools: R provides many packages for spatial and spatiotemporal statistical analysis [
177], such as
spatstat for point pattern analysis,
gstat and
geoR for Geostatistics,
spdep for areal data analysis. Matlab also provides Mapping Toolbox [
178] and other spatial statistical toolboxes. SAS recently provides support on spatial statistics [
179] such as KRIGE2D Procedure for Kriging, SIM2D Procedure for Gaussian random field, SPP Procedure for spatial point pattern, and VARIOGRAM Procedure for variograms.
Spatial Database Management Systems: Many commercial database provides extensions to support spatial data, such as Oracle Spatial [
180], and DB2 Spatial Extender [
181]. PostGIS [
182] is a widely used open source spatial database management systems, which is an extension to Postgres, an object-relational DBMS.
Spatial Big Data Platform: The upcoming spatial big data from vehicle GPS trajectories, cellphone location data, as well as remote sensing imagery exceeds the capabilities of traditional spatial DBMS, and requires new platforms to support scalable spatial analysis. Current spatial big data platforms include ESRI GIS on Hadoop [
183,
184], Hadoop GIS [
185], and Spatial Hadoop [
186].
6. Research Trend and Future Research Needs
Most current research in spatiotemporal data mining uses Euclidean space, which often assumes isotropic property and symmetric neighborhoods. However, in many real world applications, the underlying space is network space, such as river networks and road networks [
187,
188,
189]. One of the main challenges in spatial and spatiotemporal network data mining is to account for the network structure in the dataset. For example, in anomaly detection, spatial techniques do not consider the spatial network structure of the dataset, that is, they may not be able to model graph properties such as one-ways, connectivity, left-turns,
etc. The network structure often violates the isotropic property and symmetry of neighborhoods, and instead, requires asymmetric neighborhood and directionality of neighborhood relationship (e.g., network flow direction).
Recently, some cutting edge research has been conducted in the spatial network statistics and data mining [
80]. For example, several spatial network statistical methods have been developed, e.g., network K function and network spatial autocorrelation. Several spatial analysis methods have also been generalized to the network space, such as network point cluster analysis and clumping method, network point density estimation, network spatial interpolation (Kriging), as well as network Huff model. Due to the nature of spatial network space as distinct from Euclidean space, these statistics and analysis often rely on advanced spatial network computational techniques [
80].
We believe more spatiotemporal data mining research is still needed in the network space. First, though several spatial statistics and data mining techniques have been generalized to the network space, few spatiotemporal network statistics and data mining have been developed, and the vast majority of research is still in the Euclidean space. Future research is needed to develop more spatial network statistics, such as spatial network scan statistics, spatial network random field model, as well as spatiotemporal autoregressive models for networks. Furthermore, phenomena observed on spatiotemporal networks need to be interpreted in an appropriate frame of reference to prevent a mismatch between the nature of the observed phenomena and the mining algorithm. For instance, moving objects on a spatiotemporal network need to be studied from a traveler’s perspective,
i.e., the Lagrangian frame of reference [
190,
191,
192] instead of a snapshot view. This is because a traveler moving along a chosen path in a spatiotemporal network would experience a road-segment (and its properties such as fuel efficiency, travel-time
etc.) for the time at which he/she arrives at that segment, which may be distinct from the original departure-time at the start of the journey. These unique requirements (non-isotropy and Lagrangian reference frame) call for novel spatiotemporal statistical foundations [
187] as well as new computational approaches for spatiotemporal network data mining.
Another future research need is to develope spatiotemporal graph big data platforms, motivated by the upcoming rich spatiotemporal network data collected from vehicles. Modern vehicles have rich instrumentation to measure hundreds of attributes at high frequency and are generating big data (Exabyte [
193]). This vehicle measurement big data (VMBD) consist of a collection of trips on a transportation graph such as a road map annotated with several measurements of engine sub-systems. Collecting and analyzing VMBD during real-world driving conditions can aid in understanding the underlying factors which govern real world fuel inefficiencies or high greenhouse gas (GHG) emissions [
194]. Current relevant big data platforms for spatial and spatiotemporal data mining include ESRI GIS Tools for Hadoop [
183,
184], Hadoop GIS [
185],
etc. These provide distributed systems for geometric-data (e.g., lines, points and polygons) including geometric indexing and partitioning methods such as R-tree, R+-tree, or Quad tree. Recently, SpatialHadoop has been developed [
186]. SpatialHadoop embeds geometric notions in language, visualization, storage, MapReduce, and operations layers. However, spatio-temporal graphs (STGs) violate the core assumptions of current spatial big data platforms that the geometric concepts are adequate for conveniently representing STG analytics operations and for partition data for load-balancing. STGs also violate core assumptions underlying graph analytics software (e.g., Giraph [
195], GraphLab [
196] and Pregel [
197]) that traditional location-unaware graphs are adequate for conveniently representing STG analytics operations and for partition data for load-balancing. Therefore, novel spatiotemporal graph big data platforms are needed. Several challenges should be addressed, e.g., spatiotemporal graph big data requires novel distributed file systems (DFS) to partition the graph, and a novel programming model is still needed to support abstract data types and fundamental STG operations,
etc.