Machine Learning of Spatial Data

: Properties of spatially explicit data are often ignored or inadequately handled in machine learning for spatial domains of application. At the same time, resources that would identify these properties and investigate their inﬂuence and methods to handle them in machine learning applications are lagging behind. In this survey of the literature, we seek to identify and discuss spatial properties of data that inﬂuence the performance of machine learning. We review some of the best practices in handling such properties in spatial domains and discuss their advantages and disadvantages. We recognize two broad strands in this literature. In the ﬁrst, the properties of spatial data are developed in the spatial observation matrix without amending the substance of the learning algorithm; in the other, spatial data properties are handled in the learning algorithm itself. While the latter have been far less explored, we argue that they offer the most promising prospects for the future of spatial machine learning.


Introduction
Machine learning (ML) has become a widely used approach in almost every discipline to solve a broad range of tasks and problems with structured and unstructured data, including but not limited to regression, grouping, classification, and prediction. It has proved itself to be a powerful and effective tool in various disciplinary fields and domains of application where spatial aspects are essential, including the following: land use and land cover classification [1,2], cross-sectional characterization [3,4] and longitudinal change [5], urban growth [6] and gentrification [7], disaster management [8], agriculture and crop yield prediction [9], infectious disease emergence and spread [10], transportation and crash analysis [11], map visualization and cartography [12,13], delineation of geographic regions [14] and habitat mapping [15], geographic information retrieval and text matching [16], POI and region recommendation [17], trajectory and movement pattern prediction [18], point cloud classification [19], spatial interaction [20], spatial interpolation [21], and spatiotemporal prediction [22][23][24].
Spatial data exhibit certain distinctive properties that set them apart from other data types, such as spatial dependence, spatial heterogeneity, and scale. As in other modeling approaches, we need to be aware of the specificities that these properties entail when we conduct ML on spatial data. Indeed, the explicit handling of these spatial properties can improve the performance of the ML model or add meaningful insights into the process of learning a task. At the same time, failure to appropriately include these properties into the ML model can negatively impact learning [11,[21][22][23][24][25][26].
Surveys of the extant literature have previously been conducted on several themes intersecting with spatial ML, namely knowledge discovery and data mining [27][28][29], spatial prediction methods [30], artificial neural networks in geospatial analysis [31], ML for spatial environmental data [25], and active deep learning for geo-text and image classification [32]. While their contribution to furthering scholarship in spatial ML is unquestionable, these surveys are rather limited in scope. For instance, while there are overlaps between data mining and ML, they have distinct definitions, follow different processes, and have different goals. The emphasis of other literature surveys is on a specific discipline or learning method. In addition, the focus has usually been on applications; a detailed discussion of spatial properties and their role in the ML process is still missing. At this juncture, there is extensive literature that applies ML to spatial data but research that explicitly features the spatial properties of data in ML remains rather limited. Thus, this paper aims to review some of the best recent practices with ML of spatial data while considering the learning process and the properties of spatial data for the purpose of full understanding: • How much progress in handling spatial properties of data in ML has been accomplished? • What are some of the best practices in this respect? • What are the gaps that may remain in this literature, and where do opportunities exist for future research with, and on, spatially explicit ML? Figure 1a shows our review framework. An initial bibliography search (search strategy 1) through article titles, review papers, and book chapters for exact combination of keywords 'spatial heterogeneity', 'spatial dependence', 'scale', 'machine learning', 'neural network', and 'deep learning' in the Web of Science abstract and citation database resulted in 923 records after filtering out duplicates. Analysis of the records indicated that, while significant attention has been directed toward the machine learning of spatial data in the past few years (Figure 1b), not much attention has so far explicitly been devoted to the fundamental properties of spatial data (Figure 1c). Based on this observation, most of the previous literature that apply ML for spatial data either ignores spatial properties of data or addresses them marginally in their research study. Given the focused objective and the research questions of this paper, and the fact that spatial properties have usually been addressed marginally in the literature, a systematic review with keywords may not be the best way to search through the literature. Thus, we conduct a more focused literature search mainly in journals related to spatial science. This second strategy is useful to reduce the search space to find papers that go beyond application and aim to address properties related to spatial data even marginally as a part of their research questions. Thus, we design a repetitive search process in high quality journals, including but not limited to the International Journal of Geographical Information Science (IJGIS), Transactions in GIS, GIScience & Remote Sensing, Computers, Environment and Urban Systems, Journal of Spatial Science, Remote Sensing of Environment, ISPRS International Journal of Geo-Information, Cartography and Geographic Information Science, Remote Sensing, and ISPRS Journal of Photogrammetry and Remote Sensing. We initially search combinations of keywords 'machine learning', 'deep learning', 'heterogeneity', 'spatial dependence', and 'scale' to extract papers. Then, we go through their abstracts to see if they have accounted for properties of spatial data, and extract those with direct relation to the objective of this paper. We scan through these papers. If a new challenge is raised in a paper, the related keywords are extracted, and a supplementary search in the journals or in Google Scholar is conducted to find related literature. This process is repeated until the topic is sufficiently discussed. Otherwise, we consider it to be an existing gap in the literature. Within the final database, we divide the related literature into two main categories. The first category includes processes that handle spatial data within the spatial observation matrix. The second category handles spatial properties of data in the learning process instead. Depending on the target application and problem, one may employ one or multiple approaches that are presented to account for spatial properties of data. Interested readers can explore more deeply the topics broached in this paper by following some of the references provided herein. Furthermore, it is not our intention to provide a general comparison between ML methods. For such a purpose, interested readers are referred to [5]. The rest of this paper is organized as follows. We begin with a brief overview of ML (Section 2) and of spatial properties of data (Section 3). Then, we lay out the process of ML of spatial data (Section 4) in two steps, entailing the construction of a spatial observation matrix (Section 4.1) and a learning algorithm (Section 4.2). Section 5 is concerned with the ML of spatiotemporal data. Sections 4 and 5 serve to answer the first two questions that are the core of this review paper. Finally, we summarize the review work conducted for this paper, discuss the gaps in the current state of knowledge and practice in ML for spatial data and identify possible areas of fruitful research for the future (Section 6). The third question that motivates this paper is answered in the latter section.

Machine Learning
ML can broadly be defined as the capability of a computer program to improve automatically with experience via the performance of certain tasks. A performance measure quantifies this experience, and if it improves, we say that the machine is learning [33]. As shown in Figure 2, ML can be classified into three types: unsupervised, supervised, and reinforcement learning. In learning tasks, we usually aim to estimate one or more output variables Y = (Y 1 , ..., Y m ) for a given set of input variables X = (X 1 , ..., X n ). When the desired output variables Y are in hand, the learning task is dubbed as supervised learning or metaphorically learning from a teacher [34]. In other words, we know the correct answer, and we try to learn the dependency between input variables X and Y. At the same time, one can take (X, Y) as random variables since many factors may influence data and measurements, making the whole setting stochastic [25]. Then, we can represent these random variables by a joint probability density P(X, Y). In this case, a supervised learning is concerned with determining the properties of the conditional density function P(Y|X). Output variables Y could be class labels (classification) or real numbers (regression), possibly resulting from the coding of unstructured data [35].
When the desired target variables Y are not obtained, the learning task is called unsupervised learning or learning without a teacher [34]. In this case, the training only involves the n observations of random variables X with joint density P(X), and the goal is to directly infer the properties of this probability density function. These properties can be finding the joint values of the variables X = (X 1 , ..., X n ) that frequently appear in the data (association rules), grouping or segmenting a collection of similar or dissimilar samples into subsets (clustering), or the projection of data into lower dimensions usually ranked by variance (dimensionality reduction). Unsupervised learning is typically concerned with finding patterns in the data and thus, is close to data mining and knowledge discovery in databases [35].
Existing somewhere between supervised and unsupervised learning, semi-supervised learning is based on extending either unsupervised or supervised learning to include additional information typical of the other paradigm. Two main settings of semi-supervised learning are semi-supervised classification and constrained clustering. The former is a classification task with partially labeled data (usually useful when the training sample size is small). The latter is unsupervised learning with some sort of supervised information about the clusters [36]. Recently popular, active learning methods are a subset of semisupervised machine learning. The idea behind active learning is that a machine can learn with less training data if it is allowed to choose from the training data set by asking questions [37]. A comprehensive review of these methods for application in (geo) text and image classification is available in [32].
Finally, in reinforcement learning, the machine produces some actions and interacts with its environment. These actions affect the state of the environment and, in turn, it receives some scalar rewards (positive reinforcement learning) or punishments (negative reinforcement learning). The goal of the machine is to learn to act in a way that maximizes (minimizes) the future rewards (punishments) [35]. Deep reinforcement learning has been used for navigation tasks [38], autonomous driving [39], spatial dynamic and agent based systems, and some other domains [40,41].
Regardless of the type of ML and its task, when spatial data are used, we need to account for their properties. Before reviewing the best practices on ML of spatial data, we review the main properties of spatial data in the next section.

Spatial Data Properties
In machine learning, observations are represented by a matrix X where the rows are instances (samples) of a phenomenon under study, and columns are different attributes associated with each of these instances. The same applies to spatial data, but the samples on each row are also referenced to a specific location in the geographic space. To define the relationship between the real world and this matrix, we can choose between two wellknown views of the world, namely, the field view or the object view [42]. Field entities are usually represented by regular grids and object entities by points, lines, or polygons. Being referenced to a specific location in space creates unique properties for spatial data that geographers and econometricians have studied over the years [43][44][45]. There is a broad agreement in the literature that there are three fundamental properties for spatial data: spatial dependence, spatial heterogeneity, and scale.

Spatial Dependence
Named the first law of geography by [46], "near things are more related than distant things" formulates the first property of spatial data, which is known as spatial dependence. Spatial dependence is a fundamental and useful property of spatial data that stems from the general continuity of space. A variety of ways exist to express and measure spatial dependence in data sets. For example, a spatial autocorrelation statistic can summarize the similarity of the values for a variable of interest at different locations as a function of the distance that separates them or of their adjacency to each other [47,48]. The Global Moran's I is an indicator of spatial autocorrelation and is a similarity index, which is usually used for areal data. It is calculated based on the cross-product of variations from the mean [49]: where Z(S i ) and Z(S j ) are the attribute values at sample locations S i and S j , respectively.Z is the sample mean and σ 2 is the sample variance. Weight matrix element W S i S j represents the spatial relationship between paired geographic units, which can be defined based on contiguity (topological relationship of spatial units) or distance (distance is determined on a physical or social network, 2D or 3D Cartesian space). The I index takes positive or negative values between −1 and +1. A positive value means similar values happen in close proximity, while a negative value shows that dissimilar values are spatially grouped close to each other. A value close to zero shows no spatial autocorrelation for the variable, indicative of a spatially random process; in the latter case, the assumption of independence, essential for many statistical methods, is met. More measures of spatial autocorrelation have also been suggested, such as Geary's C [50], which considers the square of variations of the variable between two locations. Alternatively, semi-variogram analysis (see [51]) can be conducted to estimate the spatial autocorrelation structure of a stochastic process based on the variance of the observations over a range of distances (usually for geographic phenomena represented by points). The empirical semi-variogramγ(h) is defined as half of the average squared difference between values of a random field Z at pair points S i and S j for a given distance h in a region:γ where d(h) is the number of pairs of points located at distance h from each other. The semi-variogram is visually analyzed in a variogram. A range parameter is defined as the distance beyond which spatial autocorrelation disappears in the data, thus expressing the spatial dependence structure. In principle, the spatial autocorrelation decreases as the distance among geographic units increases [52]. From a statistical point of view, spatial dependence violates the assumption of independence and identical distribution (i.i.d.), central to many statistical methods. While this assumption is not required for ML methods, spatial dependence properties still need to be considered for some applications and may significantly contribute to enhancing the quality of the learning process, which will be discussed later (see Section 4.1.1). Two more complex components of spatial dependence are the neighborhood effect and spillover effect. While the spatial weight matrix is usually used to capture the spatial dependence of the dependent variable, spatial association rules, which are based on linguistic or topological (in mathematics language) rules, are used to capture these two more implicit spatial effects. Association rules such as, "if block group A is next to a high crime neighborhood, then block group A has high crime" (neighborhood effect) or "if a block group A is next-to a shopping mall, then block group A will experience high crime" (spillover effect) are examples of these implicit spatial effects [53,54]. However, most traditional machine learning algorithms do not consider the impact that the spatial dependence property may have on learning these association rules. For example, while we expect the crime rate at a block to be more similar to its neighboring blocks than the farther blocks, traditional machine learning algorithms, such as support vector machines (SVM) or neural networks have not been designed to incorporate this characteristic [25].

Spatial Heterogeneity
All patterns that we can observe result from the four main processes, namely, interaction, dispersion, diffusion, and exchange over the geographic space [42]. A specific observed pattern can be the result of one or multiple processes. These processes may happen in different spatial as well as temporal scales, over varied durations, and with differentiated intensities. A mixture of those with the place-based environmental factors and contexts (which are the existing patterns resulted from interweaving previous processes at that location) in different sub-regions creates even more complex patterns of outcomes [55,56].
Global measures of spatial autocorrelation may confirm the existence of positive or negative self-similarity with regard to distance, but this comes at the cost of a fundamental assumption. The parameters (mean and variance) of the random function representing the process are assumed to be constant, which means that the sample's distribution is even over the extent of the territory over which data are generated. This is called the stationarity of the random function associated with that process [51], and when it is violated (called a nonstationary process), the process is heterogeneous. In other words, a spatial process is said to be stationary when the difference between values of an attribute is only explained by the distance between the points or units [57,58]. Another source of spatial heterogeneity is when the spatial dependence is different in various directions (anisotropy). For example, high precipitation patterns may be interrupted in a specific direction where the spatial topography of the terrain (mountains) blocks the clouds [30]. Spatial heterogeneity is not limited to the geo-spatial domains and has been studied in other domains, such as molecular biology of the cells [59] and neuroscience [60].

Scale
Scale is also important because it can inform about sampling for training experience. Learning is more reliable when the distribution of the samples in the training experience is similar to the distribution of the test experience [33]. In many geographic studies, training occurs on data from a specific geographic area. This makes it challenging to use the trained model for other geographic regions because the distribution of the test and train data sets is not similar, due to spatial heterogeneity [61]. This means that the sampling strategy for the training data set is essential to cover the heterogeneity of the phenomena of interest over the spatial frame of study. The scale can inform the sampling of the training data set by defining its elements, namely, resolution (measurement scale), context (scale at which the process is operating), and spatial extent (extent of observation) [62]. By increasing the extent of the study area, more processes and contextual environmental factors may alter the variable and result in non-stationarity by interweaving spatial patterns of different scales or inconsistent effect of processes in different regions [63]. This is especially important because collecting spatial samples is an expensive undertaking.
There are two distinct challenges related to scale and zoning when working with spatial data with areal units: the modifiable areal unit problem (MAUP) and the uncertain geographic context problem (UGCoP). The first notion is related to the sensitivity of the analytical results to the definition of the geographic units for which data are collected [64].
Ref. [65] demonstrated how different levels of spatial aggregation and zoning could result in different values of the correlation coefficient for areal units. They generated two variables with high positive and negative spatial autocorrelation and investigated the effect of varying levels of aggregation in the correlation coefficient. The results show that even in the absence of spatial autocorrelation, the correlation coefficient increases by grouping and aggregation. Attempts have been made to provide solutions for the MAUP problem based on the size and interconnectedness of the areas [66] and spatial entropy [67,68].
The second problem, UGCoP, refers to the sensitivity of the contextual variables and analytical results to different delineations of contextual units [69]. Kwan highlighted how varying delineations of contextual units, even if everything else is the same, can lead to uncertain and inconsistent analytical results over time. Note that this problem is different from the MAUP because it refers to the contextual influences on the individuals being studied in geographic units with unknown or uncertain spatial configuration. In this regard, it is closer to the ecological fallacy [70] problem, which relates to inferences about individuals from inferences in the aggregated level. An example is in environmental health studies, where residential neighborhoods (e.g., census, postal code, or buffer) are typically used as contextual units. However, these geographic units may not accurately represent the actual areas that apply contextual influences on the health outcome. People are exposed to health risks in different locations (home, school, etc.) during the day, and it is not easy to delineate the boundaries of such exposures. UGCoP is not due to different zoning or spatial scale, but contextual influences naturally change across granularities, making it even more complicated to delineate geographic contextual units [69].
The scale of analysis and appropriate geographic contextual units are among the first questions that one may need to answer before applying any spatial machine learning. However, the above-mentioned fundamental problems have rarely been investigated in spatial machine learning applications. For example, ref. [71] showed how MAUP can lead to perturbations in the convolution-based residual neural networks used for urban traffic prediction. Thus, the investigation of such effects on spatial machine learning is critically needed.

Other Properties
In addition to the above main properties of spatial data, some less fundamental but still essential properties may result from the specific representation of the data (polygon, line, point, regular grid, text) or the process of measurement [42]. For example, discretizing the continuous space into pixels may cause a loss of information at the sub-pixel level [72], while delineating the boundaries of geographic regions entails some generalization and may not be easy, due to the Boolean nature of typical regionalization approaches [73]. When working with text data, disambiguation in place names and addresses due to multiple instances for a single entity, a single entity with multiple names, and different addressing systems can create uncertainties [14,16,74,75]. Additionally, the geometric interpretation of vague place names, such as 'midland' or 'near', is usually not straightforward [76,77].
Moving from a 2D spatial space to a 3D space may impose some limitations and create some biases when ML applications are used. An example would be the case of 3D point clouds. The density of the point cloud usually changes by distance from the sensors: the closer the sensor, the higher the density. The implication of this effect is that identifiable spatial features in more dense areas of the cloud may not be recognizable in more sparse areas [78].
Finally, for many image-based applications (less for satellite images and more for close-range images), the orientation of the sensor at the time of capturing the image may influence the amount of energy received by the sensor. This makes orientation an important element. A case susceptible to such an effect in remote sensing would be that of night light images, where satellite viewing vertical and zenith angles can significantly impact the amount of light in urban areas [79]. Statistical models are commonly used to handle these characteristics in remote sensing. In close-range photogrammetry and computer vision applications, however, the sensor's geometry at the time the image is captured is usually reconstructed. Interested readers are referred to the famous structure from motion [80] and visual simultaneous localization and mapping (Visual SLAM) [81]. As for applications in ML, the simplest way to make the model invariant of the orientation is to train the model with samples from different orientations and image augmentation. In the next section, we categorize the literature into two pathways through which spatial properties of data can be handled.

Machine Learning of Spatial Data
To conduct machine learning of spatial data, we need to add location, distance, or topological relations to the process of learning. Figure 3 organizes the learning process into two steps, the spatial observation matrix and the learning algorithm. We review the literature to address how spatial data properties can be involved in each of these steps.

Spatial Observation Matrix
One typical way to include spatial properties in ML is to find a representation for these properties in the observation matrix X. The principle here is that, after we design and engineer the observation matrix X to include spatial properties, we can effectively use typical ML methods (e.g., families of decision trees and random forests, support vector machines, neural networks, and ensemble models) without making any changes to the learning algorithm. Several critical aspects are involved in creating an ultimate spatial observation matrix used as an input to the learning algorithm, namely, spatial sampling, spatial features, dimensionality reduction, and handling missing data. These are discussed hereunder.

Spatial Sampling
While tremendous progress has been made in spatial data collection technologies, ML methods still face essential challenges in acquiring optimized samples for training. The current view in ML is to move from model-centric ML to a more data-centric ML [82]. From a statistical point of view, the size and distribution of a sample set should represent the entire population or distribution. Two important points need to be made here. First, this is distinct from the commonplace concern about the poor arrangement of samples in training and test data sets in ML, which leads to a typical generalizability issue (see Section 4.2.11 on this matter). Instead, the entire sample set (including both the training and test data sets) should represent the phenomenon being learned, which is a sampling problem [83].
Second, the representativeness of the samples is defined in the attribute, spatial, or temporal space (or multiple spaces at a time) depending on the application. The structure of the data in each space may impose some special properties. As far as the spatial data matrix is concerned, data properties in a spatial space, such as spatial dependence, can inform an optimized sampling configuration to avoid redundancy.
It is not always the scarcity of the samples that leads to challenges for learning. Oversampling will not impact the learning process because the assumption of i.i.d. is not required in ML [84]. However, it may overestimate the accuracy of learning in the assessment process. For example, let us consider a land cover classification task, using satellite imagery. Suppose that a large batch of samples are selected from close locations for a single class category (e.g., vegetation). It is most likely that these samples are similar to each other. In this case, even if samples from other sites are available in the sample data set, the classifier is somehow biased since it most probably labels the frequent familiar samples correctly. Thus, the confusion matrix class accuracy and recall, which are usually used to evaluate classification results, are not robust to assess the learning results with the varying sample sets [85,86]. This problem is known as intra-class imbalance [87]. Intraclass imbalance is not solved by cross-validation methods typically used in the generalization step (Section 4.2.11). This means that we either need an evaluation method that can account for such effects or to be careful in sampling the data that enter the learning algorithm. For a useful review of the spatial sampling methods, see [52]. It is worth noting that inter-class imbalance in which the number of samples in class categories are highly uneven can also degrade the accuracy of classification. The performance on a class with many samples will be higher, compared to the classes with fewer samples. While it overestimates the overall accuracy, inter-class accuracy is still identifiable by looking at single class accuracy and recall in the confusion matrix [88].
One application of reinforcement learning is for efficient spatial sampling. Over space, sampling can be conducted adaptively, according to local contextual characteristics using reinforcement learning. For example, Ref. [89] trained an agent to select appropriate spatial resolutions for satellite images that run through detectors. When a low spatial resolution image is dominated by large objects, the agent passes this image to a coarse level detector. Alternatively, when a high spatial resolution image is dominated by small objects, the agent passes that image to a fine level detector. Such an approach reduces dependency on high spatial resolution images that are more expensive and need more time to process. One of the problems in object detection algorithms is that the algorithm must use a sliding window that passes exhaustively through the entire image to detect an object. Ref. [90] presents a reinforcement learning method to select a small set of sequential samples from the image before deciding whether a target is present in an image.

Spatial Features
Several methods exist to include the spatial components of data in the observation matrix. One way is to directly add the spatial reference to the data matrix as attributes. This entails embedding the spatial references of data directly into the attribute space. Practically, this can be implemented in either one of two ways. One consists in adding coordinates (e.g., latitude and longitude) alongside semantic attributes for each observation to the observation matrix [91,92]. For example, in [93], the authors demonstrated that in many propagation environments (e.g., in sensor networks deployed in outdoors for environmental monitoring), where the wireless channel strength is a function of the distance dependent path-loss, adding the geographic location of transmitters and the receivers as input to a neural network model can efficiently schedule links between transmitters and receivers. Such a capability is important in dense sensor networks in which simultaneously activated nearby links (connections between transmitters and sensors) produce significant interference for each other. However, using the coordinates as spatial variables may generate considerable overfitting because they are highly correlated [94]. The other typical way is to add observations tied to a region as fixed effects of that region to the observation matrix [95,96]. This approach is effective to handle inclusion relationships; however, it cannot capture complex structures.
In addition to spatial reference information, spatial entities and phenomena have within-object information (geometric, spectral, textural, and statistical) and between-object information (contextual and relational), which can be created as new features and added to the observation matrix directly [1]. The geometric information of single spatial entities can be used in different forms such as the length, area, and ratio thereof. Spectral and textural information have extensively been used in the remote sensing community for land-cover and land-use classification. However, spectral features are insufficient for this purpose due to heterogeneity, especially in urban environments [2].
Other features, such as texture, which indicates the coarseness or smoothness of a pattern in an image, have been suggested as complementary information. From a mathematical point of view, numerous functions exist that can represent the local variability of pixel values as texture features, including first-order statistics (e.g., means, variance, and standard deviation) and second-order statistics (gray-level co-occurrence matrix, spatial autocorrelation) [97]. These methods, however, focus on a single scale. Multi-resolution spatial and frequency analysis tools, such as Gabor transform, Wigner distribution, and wavelet transforms, have effectively been used to overcome this problem [98][99][100][101].
Texture analysis is usually conducted using moving windows (regularly shaped grid) or fields (irregular shape). Compared to moving windows, fields partition the area into homogeneous regions and provide a more realistic representation of the spatial entities [97]. The boundaries of these fields are determined using existing polygon features or digital image segmentation methods (e.g., edge detection, region growing) [102]. In addition to textural and spectral information, geometric and contextual attributes of the fields can also be used. In this respect, [103] used the proportion of build-up area, vegetation, and water surfaces, and [104] calculated the spatial metrics for land covers and semi-variogram for the Normalized Difference Vegetation Index (NDVI).
We provide here another example of how to identify and delineate the boundaries of functional zones based on moving object patterns. Figure 4 shows two zones inside and outside the port of Rotterdam, the Netherlands. The points represent the recorded location of vessels that visited an unlabeled zone, extracted from the vessel trajectories' stop and move segments. On the one hand, the point clusters may inform about the boundaries of each zone. On the other hand, shape and simple statistics within each zone, such as the duration of visits, the number of unique vessels, distribution of vessel course (direction), and vessel type, may provide some information about the functionality of space (e.g., anchorage, containerized ship berth, bulk ship berth). This within entity information can be added as new features to the observation matrix to recognize the functionality of the zones. Further, between-objects information, such as connectivity, contiguity, distance, association, and direction, can be used to create new features [1]. In the vessel movement example, the association of zones, where different types of ships dock to load and un-load cargoes in close distance, may become a discriminating feature to distinguish the port area from anchorage zones outside of the port (in Figure 4, compare bottom-left and bottom-right quadrants).
Point cloud classification is another area where machine learning methods have been used extensively. The simplest way to classify point clouds is to add the third coordinate as an attribute of each point and use standard 2D ML methods to classify the point cloud. However, there are two critical shortcomings for such approaches [78]. First, point clouds may have multiple z values for a point with (x,y) coordinates, and compression of the 3D point cloud to 2D image causes a loss of information. Second, points in point clouds are usually irregularly spaced, making the selection of a fixed window size difficult. In other words, the point densities may vary in relation to the distance to the sensor. As a result, features learned in a dense area are not generalizable to sparse areas.
Ref. [105] used regularly spaced voxels and voxel feature encoding (VFE) to address the first problem. This method is subject to loss of information, due to decreased spatial resolution and increased memory usage in the voxelization process. Alternatively, Ref. [106] projected point clouds on multiple synthetic 2D images and labeled pixels based on the prediction scores from these synthetic images to handle the first issue (Figure 5a). Ref. [78] suggested creating a point cloud pyramid with l scale levels by subsampling the original point cloud. The deep hierarchical features (see Section 4.2.6) for each point are extracted using a deep neural network within each scale. Such an approach forms a feature pyramid (Figure 5b). The feature vectors of a point along this pyramid are concatenated to create a final feature vector that is fed into a classifier. This final feature vector contains both hierarchical and multi-scale features of the original point cloud, which can address both issues discussed earlier.

Dimensionality Reduction
ML tasks may end up with a large number of input variables in the observation data matrix. A disproportionately large number of interrelated variables may negatively impact learning in a variety of ways. Apart from the need for more training data and an increase in processing time, an unnecessarily large number of variables may impose new uncertainties into the learning process because of the correlation between variables [107]. One way to handle such a problem is to select a subset of features that provide the best results. A variety of methods exist for feature selection, such as a genetic algorithm [108]. Dimensionality reduction methods are another way to handle many useful variables, especially when the influence of each variable is not of interest.
To reduce unnecessary variables, we need to understand the structure of the variancecovariance matrix. The problem is that calculating the variance-covariance matrix C n×n for a given observation matrix X m×n is computationally expensive for a large number of variables. Several dimensionality reduction methods exist, including but not limited to principal components analysis (PCA), factor analysis, independent components analysis, and self-organizing maps (SOM) [109][110][111]. PCA is a statistical method well known for its application in dimensionality reduction. PCA transforms the observation matrix X m×n with many interrelated variables to a smaller set of new uncorrelated variables (components). These new variables are ordered so that the first few retain most of the variation present in the original variables [110].
It is proved that by applying useful linear algebra operations, C can be decomposed into LVL T = C, where L is a matrix of eigenvectors (representing the weight of each variable on corresponding principal components), and V is an orthogonal matrix of eigenvalues (representing the variances of the corresponding principal components). The observations are then transferred to a space where the column vectors in L represent the axes. In practice, a few principal components that represent most of the variation in the data set are selected ( Figure 6). These transformed observations are then entered into the learning algorithm. For an observation matrix with spatial references, a simple PCA assumes the structure of the variance-covariance matrix remains stationary across the study area. Ref. [112] showed that PCA could be weighted locally either in attribute space (LWPCA) or in geographic space (GWPCA) to account for certain heterogeneities. For the former, we assume the covariance structure is homogeneous for observations that are close to one another in attribute space, which leads to L i V i L T i = C i with respect to the sub-region i.
The GWPCA for location (S i , S j ), however, is written as LVL T |(S i , S j ) = C(S i , S j ), with C(S i , S j ) as the variance-covariance matrix of location (S i , S j ). While PCA is a common approach in summarizing (sub)sets of variables in ML, LW-PCA and GWPCA have not been investigated. Such methods can additionally provide insights into the spatial distribution of each composite variable by mapping components scores. This may lead to a complete understanding of the process [113]. Moreover, eigenvalues and eigenvectors can be obtained for locations with no observations. Challenges include bandwidth selection and more computational cost. A locally weighted PCA can also optimize sampling re-design by identifying local and spatial outliers rather than global and aspatial outliers [112].

Missing Data
Nowadays, data are more available, both spatially and temporally. However, given that they are more often organic, being a byproduct of the processes that created them in the first place, there are always gaps in temporal and spatial dimensions because of effects and circumstances that are out of our control. Therefore, missing data is an important challenge, and many analyses are simply not implementable without having a way to deal with this problem. Missing observations can be independent of each other, dependent on their neighboring points, or with specific patterns [114]. There are different ways to address missing values, such as aggregating data into coarser granularity, removing the observations with missing values from the data set, and imputing values. Although imputing data adds preprocessing step to analysis, it leverages the existing data and avoids losing information due to aggregation and discarding some observations. Spatial prediction methods can always be used to impute values for data sets with missing values. The most famous approaches for spatial prediction are spatial statistical models (e.g., geographically weighted regression) and geostatistical models, such as kriging [115]. Several studies have demonstrated that kriging in the universal form is preferred to geographically weighted regression for prediction, due to its optimal statistical properties [116].
Apart from the statistical and geostatistical models, other approaches in machine learning and other paradigms have also been adapted to spatial properties for missing data imputation. Probabilistic principal component analysis (PPCA), for example, is a probabilistic extension of PCA and has been used to impute missing values for different applications [114,117]. This method has proven useful when a significant portion of the observation matrix is unknown [118]. A comparison of different types of PPCA can be found in [119], and an extension of PPCA to consider temporal and spatial dependence can be found in [120]. All systematic errors must be removed before using PPCA for imputation [114].

Learning Algorithm
Instead of generating new spatial features and processing them with traditional nonspatial machine learning methods, we can directly incorporate spatial properties in the learning algorithm. Among all ML methods, decision trees and random forests, support vector machines (SVM), neural networks, and deep neural networks (DNN) have found considerable attention in spatial science.

Decision Trees
Decision trees (DT) are popular ML methods adapted for spatial problems to overcome violation of the i.i.d assumption. As a class, spatial entropy-based decision tree classifiers use information gain coupled with spatial autocorrelation to select candidate tree node tests in a raster spatial framework. For example, Ref. [121] added a spatial autocorrelation measure to the target function evaluated at each node of the tree. PCT is a multi-task approach where hierarchies of clusters of similar data are identified, and a predictive model is associated with each group. When splitting a group is considered at a node, a test is run that maximizes within-cluster variance reduction. To account for spatial nonstationarity in the target variable, a term based on global measures of spatial autocorrelation was added to this test.
A common trait of such approaches is to add a constraint to account for spatial properties, while still relying on local entropy testing at tree nodes. One of the frequently occurring issues in image classification using decision trees is salt and pepper noise. Salt and pepper noise happens when the predicted label of a specific pixel differs from its neighborhood pixels and can result from high spatial autocorrelation in class labels of the sample data used for training. Ref. [122] proposed a focal-test-based spatial decision tree (FTSDT), in which the tree traversal direction of a learning sample is based on local and focal (neighborhood) properties of features. They use local indicators of spatial association-Lisa [123] as spatial autocorrelation statistics to measure the spatial dependence between neighborhood pixels.
Geographically weighted versions of decision trees have also been developed to account for spatial heterogeneity. Ref. [26] presented a geographically weighted random forest (GW-RF) to visualize the relationships between type 2 diabetes mellitus and risk factors in U.S. counties. Similar to geographically weighted regression (GWR), the spatial weight matrix and RF are integrated into a local regression analysis framework. Compared to global RF, GW-RF showed higher variability of the type 2 diabetes prevalence. One downside of GWR when interpreting the importance of the variables is that it overlooks collinearity among the predictors, which leads to parameter redundancies. Thus, any attempt to interpret a local coefficient independent of the other local coefficients at the same location is not valid. RF-GWR, however, is not impacted by collinearity [124].

Support Vector Machines
Support vector machines (SVMs) have been used for classification and regression problems [125]. The idea of SVM is to map the original input space to a higher-dimensionality feature space where the observations are separable by hyperplanes (Figure 7a). Among all possible hyperplanes, the one that maximizes the margin width ( ) is optimized (l 1 ). In reality, observations may not be separable easily, due to the presence of outliers (Figure 7b). Instead of increasing the complexity of the model structure (in this example, a nonlinear curve), we allow misclassification of some observations and penalize them based on their distance from the margins (ξ). In essence, the objective function has two terms: a term containing and another term containing ξ, and the goal is to maximize the former and minimize the latter. ξ is called a regularization term. Regularization terms are usually added to objective functions to control the complexity of the model and avoid overfitting (see Section 4.2.11). SVM performs well in high dimensional spaces. It is less sensitive to class imbalance and powerful in generalization [11].
Ref. [126] suggested an extension of SVM called support vector random field that explicitly models spatial dependencies in the classification using conditional random fields (CRF). The model contains two components: the observation-matching potential function and the local-consistency potential function. The former models the relationship between the observations and the class labels using an SVM classifier, and the latter models the relationship to neighborhood labels. The local-consistency potential function penalizes discontinuity in between the pairwise sites.
(a) (b) Figure 7. Observations are transformed to a higher-dimensionality space (for simplicity, a 2D space). (a) Among all possible hyperplanes (line or curves in the current 2D example), the hyperplane that maximizes the margin is optimum (l 1 ). (b) In practice, outliers exist in observations. While there is still a hyperplane that can separate the two classes, it may result in overfitting.

Self-Organizing Maps
Self-organizing maps (SOM) are among nonlinear clustering methods that have been used with spatial and non-spatial data [111]. SOM is a simple neural network with no hidden layers. It maps an n dimensional feature vector to a regular grid of square (four neighbors) or hexagons (six neighbors) neurons in the output layer, initialized with n weights. First, a similarity measure is used to find more similar neurons to the input feature vector. Then, the weights of the activated neurons and their neighboring neurons are adjusted to make them even more similar to the input vector. This process is repeated for a set of input feature vectors. Finally, it creates a spatial organization of the neurons in a one-, two-, or three-dimensional space in which dissimilar units stay farther away. SOM is similar to K-means clustering but different in two ways. First, K-means is based on the nearest distance, while SOM utilizes distances between all paired neurons (weighted by a neighborhood kernel) [20]. Second, SOM also visualizes the relation between clusters by representing how far they are from each other in a topological space [111]. Such a property makes SOM attractive for visual data mining. For example, it is possible to compare the SOM visualizations with other forms of visualization (e.g., geographic visualization), especially in an interactive platform [127]. Ref. [12] used SOM to visualize demographic trajectories, and [20] demonstrated how SOM could be employed to visually mine spatial interaction systems using a large domestic air travel data set. This property can even be used to map data without spatial properties in a 2D or 3D space. For example, Ref. [128] mapped massive textual databases with several hundred dimensions in feature space into a 2D space using SOM (Figure 8).
Apart from visualization, one concern about using SOM for geo-referenced data is that the geographic reference of the data is ignored in mapping. Ref. [129] suggested a variation of SOM called Geo-SOM to address this issue by considering spatial dependency. Geo-SOM forces the algorithm to search among the neurons geographically close to the data pattern when seeking the winning unit for a specific data pattern. Ref. [130] suggests using a one-dimensional SOM to create a sequence of numbers (cluster indices) that are ordered according to the similarity of attributes within the high-dimensional space. Each spatial point can then be labeled with its associated index number and represented in a choropleth map. The spatial pattern of these number sequences, called contextual numbers, summarizes the variations of geographic locations in high-dimensional space in a single contextual map (Figure 8b). Like other dimensionality reduction methods in Section 4.1.3, such a feature (with the advantage of being a nonlinear dimensionality reduction method) can be used as input to machine learning. For a complete review of SOM applications in GIS, see [127].

Radial Basis Function Networks
Radial basis function (RBF) networks have an input layer, a hidden layer, and an output layer. Instead of a linear relationship between the input vector and the neurons in the hidden layer followed by a nonlinear activation function, the weighted norm (distance) of the input vector and the neurons are calculated in RBF networks with a radially symmetric activation function, which is usually Gaussian. Ref. [131] compared the RBF network with MLP networks for modeling urban change, and showed that RBF demonstrates higher prediction accuracy. Ref. [132] used the RBF networks for spatial interpolation by incorporating a semivariogram model where the neurons in the hidden layer are the center of the observation points. Ref. [133] demonstrated how a hybrid MLP-RBFN network can improve the spatial interpolation results, where MLP and RBF collaborate to fit surfaces of different types.

Adaptive Resonance Theory Networks
Adaptive resonance theory (ART)-based networks are a family of neural networks that have been used in spatial interaction flows, crop classification, and land-use change applications [134][135][136]. ART-based networks are supervised, self-organizing, and selfstabilizing neural networks that can learn fast in nonstationary environments [137]. Fuzzy ARTMAP, which couples ART-based networks with fuzzy logic, is the most famous ARTbased network [138]. It includes two input modules: Art a and Art b , each with two layers connected by a map field module. Art a matches the input vector to the most similar neurons in its second layer. If the vector is not similar to the current neurons in memory, a new neuron is created. This property enables the Art a neural network to adaptively change the topology of the network and adds new experiences to the memory. Art b , which maintains the class labels, is connected to Art a through a map module. However, the Fuzzy ARTMAP depends on the quality of training data and sensitivity to noise and outliers that may be treated as novel patterns. Ref. [134] used the adaptive nature of the Fuzzy ARTMAP in forming its topology to account for spatial heterogeneity in land-use change.
Their proposed ART-P-MAP considers the land-use change as a regression problem instead of a classification problem and uses the density of training observations as a confidence measure for prediction through a Bayesian decision approach. This approach increases the generalizability of the model and avoids the problem of adding new neurons due to noise and outliers.

Deep Convolutional Neural Networks
In the past few years, deep neural networks (DNNs) have been proven as more promising to process data in their raw form than conventional ML methods. DNNs are usually composed of several nonlinear but simple modules that represent data at different levels. Starting with raw data, each module transforms the representation at one level into a representation at a higher (more abstract) level. In the process and using the backpropagation algorithm, the machine can learn very complex functions [139].
Popular DNNs for spatial data can roughly be classified into four categories, namely, convolutional neural networks (CNNs), deep graph neural networks (GNNs), generative neural networks, and recurrent neural networks (RNNs) when combined with CNNs. Here, we discuss several main DNNs that can consider spatial properties of data in their architecture or have been used to solve problems in spatial domains starting with CNNs. RNNs are used primarily to learn from sequence data and will be discussed in Section 5.
CNNs are the most popular and well-established form of DNNs to process and analyze images. They include convolutional layers and pooling layers as the two main types (Figure 9a). Convolutional layers work based on convolution of a sliding window (filter) with pixel values across the image Figure 9b. The filter weights are determined automatically through the learning process by the network. This is the main advantage of deep CNNs in comparison to conventional ML methods where user defined filter weights are needed. This feature of CNNs has been used on a limited basis to automatically define W matrix weights in other spatial applications [24]. Pooling layers aggregate neighboring pixels into a single pixel, reducing the image's overall dimensions [140]. As the number of convolutional and pooling layers increases, the network becomes deeper and can extract more abstract features. A classification network finally follows the processing of these layers. In the past few years, several prominent CNN-based DNNs have also been introduced for semantic segmentation (labeling pixels as opposed to image patches) using deep residual networks with less complexity and depthwise separable convolutions, which reduces the computation time and the number of network parameters significantly [141][142][143][144][145].
Two main differences exist between deep convolutional networks and conventional ML methods [140]. First, CNN makes the manual spatial feature extraction described in Section 4.1.2 an automatic part of the learning algorithm, that is, weights of the convolution filters are not known a priori but are instead determined through the training process, which in some ways resembles the connection weights in ordinary neural networks. These convolution filters partially account for spatial dependence properties by considering the neighborhood pixels, but their size and number within each layer must still be determined in a hyperparameter optimization process. These regularly shaped filters cannot account for variation of spatial dependence in different directions. Second, pooling operations in CNNs allow them to consider hierarchical features in training. The number and architecture of convolutional and pooling layers is arbitrary to some degree and can cause unnecessary complexity of model. Thus, one should always optimize these hyperparameters.
The input images have a regular shape in CNNs, while actual objects and regions might be irregularly shaped in the real world. An example of this is in urban land use classification. A regular shape input image may ignore the real boundaries of objects [1]. Attempts to address this problem include proposing an object-based convolutional neural network (OCNN). This approach involves a two-step process, where image segmentation is initially applied to partition the image into two object categories of linear shaped (e.g., roads) and general shaped (e.g., buildings) objects with homogeneous spectral and spatial properties, using the mean-shift method [146]. Two separate CNNs are trained on these categories of objects with different input image sizes. For linear objects, a small input image size and for general objects, a large input image size performs better. While this approach outperforms regular deep image segmentation networks, it has some limitations. First, it depends to the accuracy and robustness of the segmentation process. The extracted partitions do not necessarily represent the actual boundaries of objects. Second, the segmentation itself is a time-consuming operation, and parameters of segmentation are still needed to be defined manually by the user.

Deep Graph Neural Networks
Graphs have been used as a data model to represent many spatial entities (e.g., street networks) and relations (e.g., socioeconomic relations). Compared to regular grids and Euclidean spaces, graphs are irregularly structured (i.e., irregular neighborhood relationships). Thus, many of the machine learning methods cannot directly be used on graphs to perform tasks such as classification (e.g., role of a node in a social network), prediction (e.g., whether or not a social relation exists between two nodes), and community detection (e.g., discovery of criminal groups) [147]. For example, CNNs have been applied quite extensively for image classification and segmentation. However, many problems (e.g., social and biological networks) usually cannot be represented in grid format, making it challenging to apply convolutions. Additionally, feature engineering is needed every single time when conventional ML methods are used for graphs [148]. In recent years, a lot of interest has been directed toward developing ML and, especially, deep learning methods that are directly applicable to graphs. These methods are called graph neural networks.
Graph neural networks were first introduced by [149]. The main idea behind graph neural networks is to aggregate features from neighborhood nodes to represent the feature value of a node [150]. This process is similar to applying convolutions on regular grids. Thus, attempts have recently been made to generalize convolution to graphs. In general, graph neural networks can be categorized into spectral and non-spectral approaches [151]. Spectral methods create a spectral representation for the graph and apply convolution through the graph Fourier transform (using the eigenvalues and eigenvectors of the Laplacian matrix of the graph) [152][153][154]. The challenge with these types of methods is that, if the structure of the graph changes, the trained model of the previous structure cannot directly be applied to a graph with a new structure. Spectral methods are also computationally expensive and have practical issues with scalability [153]. Spatial (non-spectral) methods directly use convolutions on close neighbors in the graph [155]. These approaches are relatively new and have shown impressive performance in many applications, such as disease spread forecasting [156], traffic analysis [157], medical diagnosis and analysis [158], fraud detection [148], and natural language processing [159]. Spatial methods process the entire graph simultaneously and are computationally less expensive [150]. Graph attention neural networks are similar to the spatial graph neural networks but they assign larger weights (more attention) to the important nodes in the neighborhood [150]. Ref. [160] adopted the attention mechanism to extract the spatial dependencies among road segments for traffic forecasting, and showed attention networks perform outstanding robustness against noise. Graph neural networks can extract effective features from similar to classical CNNs on regular grids. More importantly, they can effectively reflect the true geometric structure of the original data [161].
In addition to the method that is used for learning, the chosen graph representation is important for successful learning. Ref. [162] showed how a heterogeneous representation of spatial elements in a network from OpenStreetMap (i.e., a graph with more than one node/edge type) improves the performance of a graph neural network for the semantic inference of the spatial network. As another example, Ref. [163] showed how the conventional primal (intersections represent nodes) and dual (road segments represent as nodes) representations of street networks are limited for assessing Open Street Map data quality with ML methods, and provided a hybrid representation for street networks by graph. Other novel approaches to represent spatial phenomena by networks can be found in [164,165].
A survey of deep learning methods for graphs can be found in [166]. The application of graph neural networks has yet to be explored in spatial domains, especially for non-gridbased spatial data, such as social networks.

Deep Generative Neural Networks
Models in ML can also be categorized as discriminative or generative [167]. On the one hand, we have discriminative models that are usually used for classification and are called classifiers. They return P(Y|X), the probability of a sample to belong to class Y, given the feature attributes X. Generative models, on the other hand, attempt to generate realistic features of a class of objects, given the distribution of the class P(X|Y). Thus, a new set of features is generated, using the distribution of a specific class. In the past few years, generative neural network models have attracted considerable attention. Among several types of generative models, variational autoencoders (VAE) (Figure 10a) and generative adversarial neural networks (GAN) (Figure 10b) have found applications for spatial prediction [168,169].
GANs were first suggested by [168]. They form a type of generative neural networks composed of two sub-networks: a discriminator and a generator. These two networks compete with one another to learn through the training. On the one hand, the generator attempts to create realistic sets of features X from data class distribution Y by adding random noise Z (Figure 10a). Random noise Z is added to make sure each time a new instanceX of class Y is generated. The generator attempts to confuse the discriminator by introducing this new feature setX as a new instance of the class. The discriminator, on the other hand, is a binary classifier that comparesX with actual instances of the class and recognizes whether it is real or fake. As training continues, both networks compete until the discriminator fails to recognize the unreal instance from the actual sample. VAE works based on an encoder-decoder architecture, which has the same number of units in the input and output layers [169]. For image classification, for example, VAE creates a new representation of a labeled input image into a space of lower dimensionality; it creates a distribution of the object class in the latent space, and reconstructs a sample image from the distribution. By comparing the actual and reconstructed images, the network can learn through the training.
Many spatial phenomena are heterogeneous and nonlinear, rendering conventional data analytics methods less effective. Generative networks have been applied successfully for DEM spatial interpolation [170], spatiotemporal imputation of aerosol when a substantial amount of data is missing [171], and predicting regional desirability with VGI data [17]. However, CNN-based methods are not appropriate when a large amount of data is missing since they require complete images or images with limited random missing values for training [171]. The other application of generative models is for data augmentation, especially when the size of the training data set is small or the class balance is uneven. Data augmentation works by slightly manipulating the training data to generate new training samples [172].

Learning with DNNs
Apart from discussing the specific capability of the various types of networks, it is also necessary to look at the ways existing DNNs can be used. One can use pre-trained networks, adapt a pre-trained network, or train a network on new data from scratch [173]. In the first method, pre-trained models are used as feature extractors for classification problems on a new data set. This is similar to the methods in Section 4.1.2, where new features are created and added to the observation matrix. These methods have been shown to be effective at classifying remote sensing and photogrammetric imagery [174,175].
Alternatively, one can fine-tune a pre-trained network on a small set of new observations. This approach is useful when the size of the available training data set is small. One may even be able to use a network pre-trained for a problem in a very different domain or topic. For example, authors of [176] successfully fine-tuned a pre-trained network on an ImageNet data set (close range images) [177] to classify land use with a very small new training data set from UC Merced Land Use satellite images [178].
The success of fine-tuning pre-trained networks comes from the capability of models to generalize and specifically from the fact that, even though data sets may be from different settings or domains, they still exhibit fundamental standard features that machines can learn from one setting and apply to another. For example, the geometric boundary of objects in both sets of images mentioned above is composed of corners and horizontal, vertical, and diagonal lines. The way this is usually conducted is that coefficients for several layers in the trained network (usually the layers with low-level features) are frozen, and only the coefficients of the rest of the network are fine-tuned with the new observations. Compared to the first approach, fine-tuning must provide better results since features are oriented to the new data set [179]. Finally, the last method is to train the network on the new data set from scratch. However, this method is subject to overfitting if the training dataset is small [173].

Spatial Accuracy
Measures of the accuracy of spatial data are essential to be considered in the objective function in machine learning for a number of geospatial applications. For example, [180] showed how the lack of a spatial accuracy measure could influence the evaluation of location prediction performance for birds nests based on several habitat and environmental factors. When the variables are rasterized, one typical way to evaluate the prediction accuracy is to measure the similarity between the predicted and actual maps. In the example being discussed, a cell is labeled as either a nest or no-nest (a binary classification task). Then, the following objective function of learning performance can be used for this purpose, where a measure is devised to calculate the classification accuracy from the confusion matrix: Figure 11a,b shows the location of sample nests and their rasterized format. Figure 11c,d represents the predicted locations in two different iterations during learning. Objective function (3) returns the same similarity value for c and d, while d is more accurate spatially. Thus, the authors suggest adding a term to the objective function to measure spatial accuracy, which could be the average distance to the nearest prediction. Therefore, we can rewrite the objective function as follows: where α is a regularizer parameter that is fine-tuned during the training.

Generalization
Generalization is the capability to generalize a trained model based on a set of data to future data sets. A data set is usually divided into three mutually exclusive sets, namely, a training set, a validation set, and a test set. We fit a model to our training data set at each iteration and compute an objective function to measure learning performance. Then, we use the validation data set to evaluate the model's capability to fit another data set for generalization. This process is repeated until we have a model that fits best on both data sets. The critical point is that the validation data set is still being used in learning, and we need to ensure that our model is being tested on a completely different data set. The test data set is used for the latter purpose, after we select the best model from the previous step. More complex validation methods, such as k-fold cross-validation, can be used to guarantee the best setting of the train, test, and validation data sets [181].
If we fail to reduce the validation loss while the training loss decreases, we will have the so-called overfitting problem (Figure 12a). On the other hand, if the model is very simple, both training and validation loss are close but decrease significantly (Figure 12c). This means tha we need to control the complexity of the model to reach the trade-off between the loss value in training and the validation data set. Regularization is the method that is usually used for this purpose, which works by penalizing the weights in the loss function. We can add a regularization term to the objective function as follows: Objective f unction : Minimize(Loss + λ × reg) or, Maximize(Similarity + λ × reg), (5) Two famous regularization terms are L 1 (sum of the absolute value of the weights) and L 2 (some of the square values of the weights) norms. The former tosses out some of the parameters by imposing zero value to their weights, while the latter assures the weights stay close to zero. Lambda is the parameter that we need to tune to determine the best complexity level for our model and is usually domain-dependent. A small lambda value means more weight to our training data set (Figure 12a), and a large value means we are selecting a very simple model (Figure 12c). We always need to fine-tune the λ value for the data set to have a suitable generalization model (Figure 12b). Regularization is essential when working with small data sets that are more prone to overfitting and when we have many features that impose computational complexity and noise, which are very common when working with spatial data. As mentioned above and in Section 4.2.8, the α and λ values are domain-specific, which means that they need to be optimized at training time. These parameters are called hyperparameters, which are different from standard parameters, so the latter is determined based on the data set. At the same time, the former is dependent on the properties of the model [182]. Therefore, failure to choose the optimized values for these hyperparameters can cause overfitting. However, hyperparameters are not limited to regularization parameters, and their number is different depending on the machine learning method and problem formulation. For example, random forests have three primary hyperparameters: ensemble size, the maximum size of the individual trees, and the number of randomly selected variables at each node. However, in neural networks, the architecture, learning algorithm, number of training iterations, learning rate, and momentum must be set [183]. In addition, new hyperparameters may be added for spatial applications, such as the spatial accuracy, the initial size and scale of input images, size of the filters, and appropriate sample size. A detailed discussion of hyperparameter optimization in spatial science is discussed in [182].

Spatiotemporal Learning
A significant amount of attention has been devoted to spatiotemporal learning in the past few years with the availability of technology to collect data at a much higher frequency frequently, if not continuously. Among machine learning methods, neural networks and SVMs are used often for space-time learning. Similar to the spatial properties discussed in this paper, spatiotemporal dependence and non-stationarity may also exist in data with spatiotemporal dimensions [184]. The number of parameters in spatiotemporal ML may become very large, which can make learning impossible if the model cannot capture the underlying spatiotemporal structure well [185].
Geographically and temporally weighted regression models have already been developed for geospatial applications. Still, the challenges related to expressing complex and nonlinear space-time proximity and optimal weights for kernels remain unanswered in these methods. Ref. [24] proposed a spatiotemporal proximity neural network (STPNN) that constructs the nonstationary weight matrix instead of using fixed and conventional methods to address the complex nonlinear interactions between time and space. Ref. [22] used a multi-stage approach to address spatial heterogeneity and dependence for spacetime prediction. Authors used GeoSoM to divide space and time into homogeneous regions. In the second step of the latter model, a space-time lag within each cluster was estimated to capture the space-time dependence structure among the space-time series. Finally, a feedback recurrent neural network predicts values on each cluster locally. Although such techniques have high performance, they are usually multistep, and the computational cost is high. Furthermore, complexities related to anisotropy are not modeled.
Convolutional recurrent neural networks and especially convolutional long short-term memory (LSTM) networks can be applied extensively in spatiotemporal learning of grid data [186]. LSTM is a type of recurrent neural network with the capability of memorizing temporal dependencies in data. A combination of this feature with the power of CNNs to learn the hierarchical spatial features can provide an automatic single-step ML model to account for space-time dependence (see ConvLSTM [187] and PredRNN [188]).
Recurrent neural networks have also been used in combination with graph neural networks [189]. Ref. [190] modified graph neural networks to include gated recurrent units, and demonstrated its capability on simple non-spatial graphs. Ref. [160] proposed a spatio-temporal graph neural network with attention mechanism and LSTM units to jointly learn spatial-temporal dependencies on traffic networks. Spatial dependencies are assumed fixed with connectivity and distance in all time-steps for most of the current spatio-temporal graph neural networks. However, this assumption does not hold for many real-world applications, such as traffic prediction in a road network, where spatial dependencies change dynamically based on factors, such as accidents or weather conditions [150]. A recent survey of machine learning methods of spatiotemporal sequences is available elsewhere for interested readers [185].

Discussion and Concluding Remarks
We conducted a state-of-the-art survey of literature where ML cross-pollinates spatial domains in which data exhibit distinctive properties, such as spatial dependence, spatial heterogeneity, and scale. We identified two broad approaches in this body of literature, which are respectively motivated by the two components of a spatial ML learning system, namely, the spatial observation matrix and the learning algorithm. The former explicitly handles the spatial properties of data before the process of learning begins. In other words, no change in the learning algorithm is implemented following this step. It is now well recognized that considering spatial properties in sampling strategies and in addressing missing data is necessary in any spatial ML application. In addition to these matters, creating new spatial features was discussed as one of the main approaches to augment the observation matrix with new spatial properties of data. To date, a large body of literature in ML of spatially explicit data has resorted to spatial features mainly because the idea comes naturally, because extensive research in geographic information science has focused on these matters over the past two decades, and because this approach permits using existing ML algorithms without further modifications. Many of these methods were successfully used for a variety of applications, ranging from point cloud classification to trajectory analysis and pattern recognition in satellite imagery.
We also discussed how spatial properties can be handled explicitly in the other component of ML, namely the learning algorithm, an approach that has only recently started to be explored. Here, spatial properties are addressed in the learning algorithm representation or objective function rather than at the level of the observation matrix. When dealing with learning algorithms in spatial domains, we argued for focused attention to spatial hyperparameter optimization and spatial accuracy. Different learning algorithms require various numbers of hyperparameters to be optimized, with deep learning methods usually having the largest number. When it comes to accuracy, spatial accuracy is often ignored, while evidence shows that it can significantly influence the results and the generalizability of the model, and from there, degrade the predictive power of the ML model. New measures for spatial accuracy may be needed to alleviate these issues. Space-time learning has recently also become a focus of considerable attention, both in identifying technical challenges and in advancing modeling solutions, as processes intermingle in space and time. With the proliferation of panel and other space-time data and the focused interest for process-based knowledge triggered by the COVID-19 outbreak and pandemic, this area has emerged as a priority research area.
Our literature review shows that progress in the learning algorithm component of ML is still in early stages, compared to advances made with enhancing the spatial observation matrix, and there is a lot more room to develop and apply some of these methods in different spatial domains. Here, the main takeaways are as follows: • CNNs can be used to automatically estimate the spatial weight matrix, which is usually unknown and needs to be defined by the user to reflect spatial data properties in many spatial problems. Advances may be anticipated in several areas; • Deep neural networks with convolutional layers have been shown to automatically extract patterns from multiple scales and hierarchies. However, they have so far mainly been used to recognize patterns in raster data sets. Therefore, use cases in a broader range of domains of applications are called for; • Graph-based deep learning methods provide a new opportunity to apply CNN-based deep learning to problems with graph structure (e.g., social networks) or when the geographic units are irregularly shaped (e.g., census data); • Further studies for learning in spatio-temporal domains will need to be undertaken as well. Deep neural networks based on a combination of LSTM and CNNs introduce simultaneous learning across space, time, scales, and hierarchies. When augmented with reinforcement learning to add feedback within systems, which is the case in many spatial, social and environmental applications, they can realize the dream of a single universal ML method [139].
However, deep neural networks have their own limitations and pitfalls. First, they need a large amount of training data. Second, they have a large number of parameters and hyperparameters, which make them computationally expensive. Third, there is no limit to increasing the complexity of the DNNs because of the arbitrary nature of architecture design. Thus, there are always potential pitfalls for researchers to develop less reproducible and unnecessarily complex models.
In addition to the above suggestions for future research, which both extend past research practices and leverage some of the proposed methods discussed in this paper to apply ML for spatial data analytics, a long-term line of research in this area may align with a more fundamental recasting of concepts related to the definition of space. Properties of spatial data that we discussed in this paper are the byproducts of the current conceptual definition of space as a fixed "container". This container is independent of the objects and events that exist and occur inside it. With the current approach, space is defined in a reference coordinate system tied to an origin with a scale measure, where the location of entities is determined with respect to this origin. This is how all of the current GIS and remote sensing software and tools conceptualize space, as does, by extension, spatial ML.
While this absolute view of space was found practical thus far, it has its own limitations. For example, the spatial weight matrix, designed to involve the neighborhood relationships between geographic entities, may be an inadequate representation of space. It is independent of time, usually assumes isotropy in neighborhood definition, and can only partially account for spatial dependence. Additionally, it cannot capture the complete complexity of spatial problems when multiple variables are present. Another example of that is spatial lags, which are limited since they change for different variables. If one wants to include them in learning, the final model may end up with a very large number of variables. At the same time, the user still needs to discover how many spatial lags and for which variables to use them. Such methods cannot address more complex problems, such as UGCoP and MAUP, making it difficult to predetermine scales and appropriate units of observation. The contextual influences naturally change across granularities, which makes the problem even more complex.
Contrary to the absolute view, a relative view of space assumes that space is a construct of the events that take place within it and is not completely separate from them [45]. From this point of view, Euclidean coordinates are not favored over other semantic attributes, and they collectively construct the spatial space and structure. Current tools and software require a reconceptualization of the space to represent this relative view, as no easy and robust tuning of existing paradigms can be envisioned. From this point of view, we see attractive and, thus far, almost untouched areas that primarily land in the second component of Figure 2, the learning algorithm. For example, graph-based deep learning has remarkable potential for future research within this broader view. Graphs can provide a good representation of relative space mainly because they can explicitly model the relations and are not required to be in a common coordinate reference frame. Current GIS tools and software need all layers to be in the same reference coordinate system and they perform quite poorly at handling networks of social interactions. No limit exists in the definition of neighbors in graphs, and they can dynamically change. Nested and hierarchical networks with different geometric references and topology are easy to link (e.g., in bipartite networks) and manage. MAUP disappears because objects and phenomena need not to be aggregated in partitioned areal units anymore. Nested and hierarchical graphs may also be a solution to the UGCoP. One can actually delineate the boundaries of environmental risk exposures for individuals who visit different locations (home, school, etc.) during the day within graphs [69].
Graph-based deep learning can become a gateway for machines to learn in social networks, where relations are the building blocks of everything. It is worth noting that the phrase 'social networks' is not equivalent to online social media platforms, but is a broader term for any network with human, biological, and natural social interactions. One attractive example is the weighted stochastic block model [191,192], which is a powerful approach to detecting communities in social networks. The spatial organization of communities in a weighted stochastic block model can reveal new relationships in data. Another example is SOMs that organize clusters in space and have widely been studied in the geography community.
The geometric attributes, such as coordinates, can be treated like other attributes within a relative space, so we do not expect ML methods to change. At the same time, the concept of relative space presents fundamental challenges and questions, such as the following: Are graphs an adequate data model to represent such a space, or do we need new data models? How can one acquire the ground truth required for most ML methods in a space that is elastic and changing dynamically? What can machines learn about a space that is a construct of its components? Future research should focus on these challenges and questions.