Spatial Data Science

The field of data science has had a significant impact in both academia and industry, and with good reason [...]

The field of data science has had a significant impact in both academia and industry, and with good reason. The ability to make use of large amounts of data to find solutions for pressing problems in society, the environment, and business, constitutes both an opportunity and a challenge. The concept of data is our best prospect to improve our understanding of the world significantly, ease the attrition in human/environment interaction, optimize resource allocation, and mitigate human suffering and deprivation.
Recently, there have been many examples of the "unreasonable effectiveness of data" (Haley et al. 2009 [1]), where sizable high-quality datasets unlock the solution to difficult and perennial problems. The ImageNet LargeScale Visual Recognition Challenge (ILSVRC) (Russakovsky et al. 2015 [2]) is probably one of the most spectacular examples of how data can have a pivotal role in advancing a whole field of research. The competition, that ran from 2010 to 2017, completely transformed the landscape of image recognition in a mere seven years. In this period, the winning accuracy in the classification of objects in the dataset rose from 71.8% to 97.3%, and the difference in the performance of the different teams drastically reduced. In the last year of the competition, 29 of the 38 teams achieved an accuracy rate above 95% (for an interesting account of the competition origins and development, see Gershgorn 2017 [3]). The spectacular results of this competition promoted a paradigm shift, where data take center stage, and its impact on improving the performance of old models and the development of new and improved ones becomes evident. After all, the idea is to let the data do the heavy lifting (Domingos 2012 [4]), and large-high-quality datasets proved to be up to par with the task.
The ImageNet competition example is far from being unique, as far as the relevance of data goes, but it is especially appropriate when talking about spatial data science. In fact, the ImageNet story also includes an interesting lesson for spatial data science, which is related to the pivotal role of convolution neural networks in the results of the ImageNet competition. The year 2012 was a turning point with the results achieved by AlexNet (Krizhevsky et al. 2012 [5]), which beat the competition by a massive 10.8% margin. This feat was probably one of the most critical events in the establishment of the deep learning phenomenon and the (re)boost of interest in machine learning and artificial intelligence. From 2012 onwards, convolution neural networks (CNN) dominated the competition and bled into many other areas of application. However, the compelling aspect of CNN for this Special Issue, and spatial data science in general, is the smart way in which they take into account the spatial structure of data, effectively encoding the first law of geography ("everything is related to everything else, but near things are more related than distant things." (Tobler 1970 [6])) into the algorithm.
The data deluge and the consequent digital transformation processes in the economy and society [7] also created new opportunities and challenges in the study of geographical phenomena. Due to the plethora of georeferenced data collected today by sensors and people, the transition from theory-driven research to data-driven research has been discussed in the literature (Miller and Goodchild 2015 [8]; "geographic research has shifted from a data-scarce to a data-rich environment"). This view is exaggerated by the emergence of the so-called fourth paradigm of science, i.e., after experimental science, theoretical science and computational science (simulating of complex phenomena) comes data science (data-intensive) (Hey et al. 2009 [9]; Kitchin 2014 [10]).
While in the 1980s and 1990s, the geographic information science community debated if there was something special in spatial data (Gahegan 2003 [11], Anselin 1990 [12] and Bação et al. 2005 [13]), today, the question does not seem to be so relevant, as data science is forced to deal with a myriad of data types, most of them suffering from similar pathologies as spatial data. Let us take the example of spatial dependence, which can be seen as a particular form of dependency between observations. The problem is not one of violating the independence assumption, as most data science methods are essentially assumption-free. The problem is that, if we do not account for spatial dependency in the model, the results will probably never be either very good or relevant. This is assuming that every phenomenon is defined by a process and expressed in a context, where the process represents the factors underlying the phenomena, and the context represents the frame in which the phenomena are observed (e.g., space and time). Spatial dependency indicates that the context has a meaningful impact in the process, in other words, the phenomenon in a particular location is a function of the underlying factors, but also of the intensity of that same phenomenon in neighboring locations. This factor adds complexity to the analysis, for it would be much simpler to concentrate our attention on the underlying factors and assume a neutral context. This facet is the reason why spatial data science needs to produce spatially explicit models.
The question now is what do we mean by spatially explicit models, according to (Goodchild 2001 [14]); these are not invariant under relocation, include spatial representations in their implementations, include spatial concepts in their formulations, and the spatial structures of inputs and outcomes are different. The important thing about spatially explicit models is that they harness the geographic frame to produce better results, whenever space is the relevant context of expression of the phenomenon. Therefore, building spatially explicit models in spatial data science is not so much a philosophical question; instead, it is a utilitarian approach.
Several authors (Miller and Goodchild 2015 [8]; Li et al. 2015 [15]; Jiang and Shekhar 2017 [16]) have already highlighted that spatial data science must support decision making in a meaningful way and not aim to replace human decisions, which are usually made by intelligence and skepticism (see Miller and Goodchild 2015 [8]; 'data dictatorship'). Thus, knowledge and theories of the disciplines should not be ignored in the course of spatial analyses, because otherwise, results (e.g., patterns and correlations in data) discovered by (big data) algorithms quickly tend to be uninteresting and less useful (Jiang and Shekhar 2017) [17]: "Ignoring domain knowledge and theories, patterns discovered by spatial big data science algorithms may be spurious." The collection of papers accepted for this Special Issue is broad and eclectic and deals with topics that range from motion activity and trajectories to epidemic spreading. Some papers are more focused on developing theoretical aspects, and others on real-world applications, although all of them have reported experimental results. We are sure that the International Journal of Geo-Information reader will find some exciting and thought-provoking ideas in this Special Issue.
The paper "Spatio-Temporal Analysis of Intense Convective Storms Tracks in a Densely Urbanized Italian Basin" (Sangiorgio and Barindelli 2020) [17] combines both the spatial and temporal dimensions to identify the most favorable conditions for the formation of convective events. Intense convective storms usually produce large rainfall volumes in short time periods, leading to an increase in floods and corresponding damages. The use of visualization solutions allows for an improved understanding of the phenomenon and identifies the geographic areas where these convective thunderstorms are more frequent.
The paper "Analyzing Road Coverage of Public Vehicles According to Number and Time Period for Installation of Road Inspection Systems" (Sangiorgio et al. 2020) [18] deals with the problem of using sensors to address the monitoring of aging road infrastructure efficiently. They focus on a methodology to automate road inspection based on the use of a smartphone-based system and analyze the data collected from public vehicles with a long-term global positioning system (GPS), in two Japanese cities. The authors conclude that, with only a fraction of the public vehicles, the entire road inspection area can be achieved efficiently.
Living in the current pandemic situation, we are all too aware of the relevance of having appropriate spatial-temporal tools to identify, understand, and promptly react to the spread of pathogens. Hamer et al. 2020 [19] propose papros, an R package for spatial-temporal prediction based on local data, using various deterministic, geostatistical regionalization, and machine learning methods. To showcase the package, the authors present a use case-based on the prediction of powdery mildew infestation events.
Moreover, "Quantitative Identification of Urban Functions with Fishers' Exact Test and POI Data Applied in Classifying Urban Districts: A Case Study within the Sixth Ring Road in Beijing" (Yi et al. 2019) [20] puts forward a quantitative methodology to identify urban functions. The authors use Fisher's test and point of interest (POI) data, and apply the methodology to determine the urban districts, based on their urban functions within the Sixth Ring Road in Beijing. After the application of a k-modes clustering algorithm, the authors identify four main groups of districts based on their urban functions.
Dealing with trajectory data continues to be a challenge; there are still many problems to tackle in order to be able to extract relevant and accurate knowledge from trajectory data. Pulshashi et al. [21] propose an application to simplify trajectory data, for both batch and streaming environments, in their paper "Simplification and Detection of Outlying Trajectories from Batch and Streaming Data Recorded in Harsh Environments." The application seeks to reduce noise, and especially outlying point-locations that can mislead the analysis and alter the statistical properties of trajectories. They conclude with an experimental evaluation of the proposed method and compare it with other outlier detection algorithms.
Finally, the last paper of this Special Issue [22] (Crivellari and Beinat 2019) uses motion traces to build a behavioral portrait of places based on how people move between them. In their proposal, they ignore geographical coordinates and spatial proximity, and based on the word2vec concept, create a motion-to-vector (Mot2vec). They start by transforming the original trajectories into sequences of locations, and then they use the skip-gram word2vec model to build the location embedding. According to the authors, these embeddings constitute a meaningful representation of locations, "allowing a direct way of comparing locations' connections and providing analogous similarity distributions for places of the same type." With this Special Issue of the ISPRS International Journal of Geo-Information, based on spatial data science, we hope to contribute to promoting the discussion and interest around the role of spatial in data science. More importantly, we hope that this volume can be seen as a contribution to encourage the geographic information science community to become (even more) involved, and contribute to the advance of this exciting and thriving field.