In general, estimators based on second-order moments, such as Kriging, need assumptions for second-order and intrinsic stationarity, which should be predefined in a two points statistics model. In the process of the variogram modelling that reflects these stationary hypotheses, specialized knowledge that can involve the subjectivity of experts is necessary to set parameters of a variogram. In contrast, the MLA as a proposed method in this study for spatial estimation does not require and assumption for stationary hypotheses of data and variogram modelling. However, if the spatial data is trained without additional pre-processing and consideration spatial bias, it is possible for the prediction results to ignore spatial autocorrelation.

Figure 1 shows the process of Kriging and that of the proposed methodology for spatial estimation. When applying Kriging to spatial problems, it is essential to find the normality of target attribute data. If the distribution of it is not normal distribution and skewed, it should be transformed because the Kriging estimator is sensitive to large data values. There are two typical methods for applying Kriging to spatial data with such skewness. The first is to apply ordinary Kriging after data transformation. In this case, Box–Cox transformation including logarithmic transformation is typically used to transform the data. However, problematically, Kriging results are biased when the data is back-transformed through application of the inverse of the original transformation to the Kriging estimates of the transformed data. The second is to use IK without considering back-transformation. However, to apply IK, the data must be transformed into separate indicators based on specific thresholds, and variograms for each indicator should be modeled separately. After considering these points, variogram modelling using the converted data was conducted, and spatial estimation was performed by applying the theoretical variogram model to the Kriging calculation. During this process, the results of data transformation and variogram modelling are different, according to expert knowledge, which affects performances of spatial estimation. In contrast,

Figure 1b shows the process of spatial estimation through the MLA, which is divided into four main steps: (i) data preparation and processing; (ii) data partitioning; (iii) selecting the machine-learning algorithm and hyper-parameter optimization, and (iv) training and estimating spatial data.

#### 3.1. Data Preparation & Processing

Spatial estimation based on the MLA is also a coordinate-based data-inference process similar to Kriging. The target attribute value (e.g., the concentration of pollutants, deposits of minerals and the thickness of layers) is set as the output of the MLA. For input, location information (e.g., coordinates, altitude) are used. Although other covariates affecting estimation of the target values can be included, we used coordinates as inputs to compare the performances of spatial prediction according for different methodologies under the same data conditions. Conventionally, when using the MLA for spatial estimation, coordinates are used without deformation. However, recently, to mimic the spatial correlation used in Kriging, a distance vector that calculates distances between the point to be predicted and all sample points have been used [

12,

17]. Although there are various types of distance calculation algorithm, we used Euclidean distance for considering spatial correlation. Conversely, when transforming raw coordinates into a distance vector, the number of input variables increases by the number of sample points. For example, if there are 10 sample points, a 10 × 10 distance matrix is calculated, and each sample point has distance vector which includes 10 distance variables. However, if there are 1000 sample points, the distance vector to be used as an input will have 1000 variables. In this case, the computational complexity of the MLA increases exponentially, which can increase computation time and reduce performance. Therefore, the dimension of the input variable was reduced by applying PCA to reduce the computational complexity of the MLA.

#### 3.2. Data Partitioning

For training and evaluating the performances of MLA, sample data should be divided into a training dataset and a test dataset. To address issues with a large amount of sample data, it is common to divide training and test datasets according to a certain ratio, the so-called hold-out validation method. However, when there are a small number of sample data points, training performance can be significantly reduced depending on the data separation ratio; hence, the

k-fold validation method is generally used. As the method consists of

k data partitions and performs training and validation

k number of times, this method has the advantage of evaluating the prediction performance of the algorithm for the entire data set even with a small amount of data. When applying this method to spatial data, the individual fold partition should be composed in such a way so as to consider the entire area that needs to be estimated [

15].

#### 3.3. Machine Learning Algorithm and Hyper-Parameter Optimization

Various machine-learning algorithms have been used to solve regression and classification problems of spatial data. For example, RF algorithms have many variants that can be used for spatial regression problems. We used quantile RFs [

31] that can calculate the variance of prediction error as well as the target attribute values of a spatial region. In the MLA, as the training performance of the algorithm varies depending on the hyper-parameters, parameter optimization is necessary. In RF, the number of trees, the size of minimum leaf and the number of features for the split node are the typical hyper-parameters. In the case of the number of trees, the RF becomes more robust to overfitting when it increases, but the increase in performance becomes very small above a certain number. Therefore, it is recommended to set it to a large number within the limit that can be calculated in the researcher’s computer environment [

32]. Meanwhile, hyper-parameter optimization was performed through the grid search method for setting the size of minimum leaf and the number of features for the split node in this study.