A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices

Kim, Jeonghyeon; Lee, Youngho; Lee, Myeong-Hun; Hong, Seong-Yun

doi:10.3390/su14159056

Open AccessArticle

A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices

Department of Geography, Kyung Hee University, Seoul 02447, Korea

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(15), 9056; https://doi.org/10.3390/su14159056

Submission received: 13 June 2022 / Revised: 17 July 2022 / Accepted: 18 July 2022 / Published: 24 July 2022

(This article belongs to the Special Issue Big Data, Information and AI for Smart Urban)

Download

Browse Figures

Versions Notes

Abstract

:

As the volume of spatial data has rapidly increased over the last several decades, there is a growing concern about missing and incomplete observations that may result in biased conclusions. Several recent studies have reported that machine learning techniques can more efficiently address this limitation in emerging data sets than conventional interpolation approaches, such as inverse distance weighting and kriging. However, most existing studies focus on data from environmental sciences; so, further evaluations are required to assess their strengths and limitations for socioeconomic data, such as house price data. In this study, we conducted a comparative analysis of four commonly used methods: neural networks, random forests, inverse distance weighting, and kriging. We applied these methods to the real estate transaction data of Seoul, South Korea, and demonstrated how the values of the houses at which no transactions are recorded could be predicted. Our empirical analysis suggested that the neural networks and random forests can provide a more accurate estimation than the interpolation counterparts. Of the two machine learning techniques, the results from a random forest model were slightly better than those from a neural network model. However, the neural network appeared to be more sensitive to the amount of training data, implying that it has the potential to outperform the other methods when there are sufficient data available for training.

Keywords:

house prices; machine learning; spatial interpolation; neural networks; random forests; real estate transactions data

1. Introduction

Spatial data are multidimensional and include the location information of certain events or objects and their non-spatial attributes. Ideally, the relationship between locations and attributes should be investigated using complete data for every location in the study region. In practice, however, such spatially continuous and complete data are difficult, if not impossible, to collect due to the time and cost required [1]. Missing and incomplete observations are common in most real-world datasets, and thus, the use of reliable and accurate imputation methods has become crucial to complement them [2].

In recent decades, there have been numerous studies on the estimation of missing values in spatial data. The relevant literature can be divided into two broad groups: (1) empirical studies that focus on utilizing spatial statistics and other imputation methods and (2) more methodological works that aim to identify the most accurate and reliable imputation method through comparative analysis. In the former, spatial interpolation techniques, such as inverse distance weighting (IDW) and kriging, have been commonly used to estimate unobserved values, especially when spatial autocorrelation is present in the data. However, machine learning techniques, such as neural networks and random forests, have become popular in recent years because of their successful application in various domains [3,4,5,6,7,8].

Although machine learning has shown its potential and usefulness in various academic and industrial fields, it is still essential to understand its advantages and limitations as a spatial imputation tool compared to the existing interpolation methods [9]. While previous studies have demonstrated that machine learning techniques, such as neural networks and random forests, can estimate the missing values more accurately than the traditional distance-based functions, the results varied depending on the training data size and the characteristics of the target variable to be imputed. Considering that much of the existing literature is in the field of environmental sciences, the application of machine learning techniques to socio-spatial data still needs further evaluation.

To complement this limitation in the current literature, this work evaluates the strengths and drawbacks of machine learning as an imputation method in the context of social and spatial studies. More specifically, we aim to address the following two research questions:

Are the state-of-the-art machine learning models capable of estimating missing values in the spatial data representing complex urban phenomena, such as house prices?
If so, are they more accurate than the conventional spatial interpolation approaches?

To answer these questions, we employ two of the most popular techniques, namely neural networks and random forests, to estimate the prices of the houses for which no transactions are recorded in the real estate transaction data of Seoul, South Korea. The results are then compared to those from IDW and kriging using a set of accuracy metrics, such as the root mean square error (RMSE) and mean absolute error (MAE). The findings from this empirical analysis will help us better understand the capabilities of machine learning in social sciences and urban analytics, especially for imputing missing values in massive spatial data with socioeconomic attributes.

2. Past Studies

In statistics, imputation refers to the process of replacing missing values with substituted ones using ancillary data [10]. Proper use of the imputation can minimize any bias caused by missing values and ensure the efficiency of data processing and analysis. Single imputation methods, such as hot-deck, mean substitution, and regression, have been widely used because of their simplicity in solving problems. However, multiple imputations, which provide an ensemble of the uncertainty of errors by performing a set of single imputations, have also been increasingly adopted in recent years.

In the fields of geography, geology, and environmental sciences, interpolation methods such as IDW and kriging have been commonly used to predict unobserved values in datasets [11]. Past studies show that a simple distance-based function, such as IDW, works reasonably well in certain settings (e.g., the case of bathymetry in [12]), but kriging is often considered a more robust tool as it can take auxiliary variables into account. For example, Wu and Li [13] argued that the temperature could be more accurately interpolated when the variable ‘altitude’ was included in the model, along with latitude and longitude. Bhattacharjee et al. [14] also demonstrated that a cokriging model incorporating a range of weather variables into the temperature prediction could produce more precise estimations than a model relying only on distances.

While these interpolation methods have usually been applied to environmental variables, such as precipitation, temperature, and humidity, some efforts have been made to use IDW and kriging for the imputation of socioeconomic data with location information [15,16]. Montero and Larraz [17] attempted to estimate the prices of commercial properties in Toledo using IDW, kriging, and cokriging and concluded that cokriging performed better than the others when auxiliary variables were chosen appropriately. Similarly, Kuntz and Helbich [18] found that cokriging predicted the house prices in Vienna more precisely than its other variants. These results seem to be supported by the cases of other cities and countries [19,20].

Meanwhile, more recent studies focus on how the development of machine learning can enhance the accuracy of imputation accuracy and replace the existing interpolation methods. Because machine learning techniques, such as neural networks and random forests, can handle the complexity and nonlinearity of real-world datasets, they can be advantageous in predicting missing and unobserved values [21]. For the imputation of environmental variables, several studies have been conducted to assess the performance of machine learning, and promising results have been reported over the past decade [3,9]. In particular, random forests yielded a significant improvement in various empirical cases, suggesting that they can be a reliable alternative to conventional statistical approaches.

The efforts were not limited to the environmental studies. Pérez-Rave and colleagues [22], for example, proposed an integrated approach of the hedonic price regressions model and machine learning and demonstrated its advantages in predicting real estate data. Čeh et al. [23] also compared random forests and the hedonic model based on multiple regression to predict apartment prices. They found that the random forest models could detect and predict data variability more accurately than the traditional approaches.

Nonetheless, it is still difficult to generalize that machine learning outperforms interpolation methods in predicting missing values. While many studies illustrate the successful applications of such state-of-the-art techniques, their effectiveness and efficiency for a particular problem depend on the size and nature of the data at hand [24]. In the case of spatial data representing socioeconomic attributes, such as house prices, more empirical analyses should be conducted to verify the relative strengths and drawbacks of machine learning [2]. To this end, this study applies neural networks and random forests, currently the two most popular machine learning techniques, to estimate unobserved values in real estate transaction data and compares the results to those from the interpolation methods.

3. Data and Methods

3.1. Data

This study uses the real estate transaction data for Seoul, South Korea. It is a public dataset provided by the Ministry of Land, Infrastructure, and Transport (MOLIT) and contains all the sales and rental transaction records from 2006. We utilized only the sales records between 2015 and 2019 as input data because the rental prices are determined by various external factors and may not represent the actual value of the properties.

The real estate transactions data consist of several transaction-related variables, ranging from the dates of the transactions to the sales (or rental) prices. However, they do not include cadastral information, such as the year of construction, land and building areas, and the number of floors. MOLIT provides such information as a separate dataset called the integrated building information data. Therefore, we merged the two datasets manually to prepare the training and test datasets, as illustrated in Figure 1.

The real estate transactions data have two fields for location information, namely lot numbers and street addresses (or sometimes called road name addresses). The integrated building information data have the same fields as well, but some adjustments are required to match the location information of the two datasets. Therefore, we concatenated the two fields to form a single field of [street address, lot number] and used it as a key field during the joining process. Any transactions not matched to the integrated building information data were removed from the final dataset.

It may be worth noting that the sales transactions were made multiple times for some properties during the study period. Consequently, there were duplicate points in the final dataset, each representing one transaction. Although we tried to minimize data loss during the merging process, some transaction records were dropped because of incomplete or inaccurate address information.

The final dataset consisted of 287,034 observations with 16 variables. We randomly selected approximately 70% of these, or 200,924 observations, to train the prediction models; we tested the accuracy of each model using the remaining 30% (i.e., 86,110 observations). Table 1 lists the names, descriptions, and types of the variables. The sales price per unit was the target variable to be imputed, and all the other variables were used as predictor variables. Note that the interpolation methods adopted in the present study cannot consider auxiliary variables other than the x and y coordinates; therefore, the estimation is based only on the distances between the properties.

3.2. Methods

In this study, we examine the accuracy and efficiency of four methods—neural networks, random forests, IDW, and kriging—as imputation tools for spatial data representing socioeconomic attributes, specifically house prices.

A neural network model comprises one input layer and one output layer, and there can be multiple hidden layers between the two. The input layer takes raw data values and feeds them into one or more hidden layers in the middle of the model. The data values are weighted and summed at each hidden layer and then passed through an activation function to the next hidden or output layer [25]. The number of hidden layers in a neural network model is directly related to its ability to handle nonlinearity in the data. As the number of hidden layers increases, a more complex relationship between the input and output variables can be addressed; however, the risk of overfitting simultaneously escalates [26].

The learning process of a neural network involves backpropagation, which propagates errors in the reverse direction from the output layer to the input layer and updates the weights in each layer. While this is a crucial step for ensuring the accuracy and reliability of the model [27,28], backpropagation in a large network with many hidden layers and nodes can be computationally expensive. Therefore, for successful and effective use of neural networks, it is essential to choose the appropriate number of hidden layers and nodes, along with other hyperparameters, through iterative refinement.

Random forest is one of the most popular ensemble techniques; it forms multiple decision trees using subsets of the input data, trains them through random bootstrapping, and aggregates the trees into a final model [29]. In a random forest model, one predictor variable is randomly selected at each branch of decision trees, and such randomness minimizes bias and reduces the correlation between the trees [29]. Previous studies found that random forests are robust to missing observations and outliers [30] and can produce more accurate and reliable results than traditional decision trees.

One practical drawback of random forests is that they cannot visualize the results in an intuitive graph form, as in the case of decision trees. If the significance of predictor variables is required to simplify and optimize the model, it should be estimated separately. In this study, we estimate the relative importance of variables by calculating the out-of-bag (OOB) errors for decision trees with n-1 variables. If the error changes marginally when a specific variable is removed, the variable can be eliminated from the model, as it has only a slight impact on the result [31]. To obtain the best possible prediction results from random forests, the predictor variables should be chosen based on the OOB errors, and the other hyperparameters, such as the number of trees, should be carefully tuned [32].

IDW is the simplest spatial interpolation method and has been used for several decades [11]. For each location to be imputed, it computes a weighted average of all the observed values, where the weight of observations is determined based on their distance to the location of interest, s₀ [33]. The equation for IDW can be formulated as

{\hat{Z}}_{(s_{0})} = \frac{\sum_{i = 1}^{n} w (s_{i}) Z (s_{i})}{\sum_{i = 1}^{n} w (s_{i})}

(1)

where

{\hat{Z}}_{(s_{0})}

refers to the estimated value at point s₀, and

Z (s_{i})

is the observed value at point s_i.

w (s_{i})

represents the weight applied to

Z (s_{i})

and is calculated as

w (s_{i}) = | | s_{i} - s_{0} | |^{- p}

(2)

The distance decay parameter, p, defines how rapidly the influence of distant points decreases. As p increases, the weights for points distant from s₀ become significantly smaller than those for the close points. It is set to 2 by default in most computer software; however, because the choice of p is crucial for the performance of IDW, it is desirable to test several candidates and choose the one that produces the most accurate and explainable results.

Kriging is a weighted linear regression that utilizes observed values from neighboring locations to predict missing or unknown values [34]. Kriging has many variants, including simple, ordinary, and universal kriging. The most appropriate variant depends on the variable of interest. If the mean of the target variable is constant in the study area and its exact value is known, simple kriging can minimize the prediction errors. However, if the mean of the target variable is unknown, or if it is non-stationary, the use of either ordinary or universal kriging is ideal [35,36,37].

Cokriging refers to a family of methods that incorporate associated variables into the kriging models. Cokriging has the same variants as kriging—simple, ordinary, and universal—and it tends to perform better than its univariate counterparts, as briefly discussed in Section 2 (see, for example, [14]). However, in this study, none of the ancillary variables available (i.e., those presented in Table 1) was sufficiently correlated with the target variable. The strongest correlation of 0.285 was observed between the land areas and sales prices per unit, but it was not strong enough to expect any improvements from the use of cokriging [38]. Furthermore, there was no apparent trend in the mean of the target variable; so, we chose ordinary kriging as the most suitable method for the provided data. The semivariogram parameters, including nugget, sill, and range, were chosen through iterations to obtain the best fitting model for ordinary kriging.

4. Model Optimization

To obtain the most accurate predictions from each method presented in the previous section, it is crucial to optimize the model hyperparameters [39]. The hyperparameters were iteratively changed in this study, and the accuracies were assessed through cross-validation. To avoid overfitting, we divided all of the data into training and test sets for model building and validation, respectively. For IDW and kriging, however, there were no learning processes involved; therefore, these methods were directly applied to the test set. The root mean square error (RMSE) was employed as the primary metric during the optimization process, and the hyperparameters that minimized the RMSE were selected for each model.

The neural network models were constructed by adjusting the number of hidden layers and the number of nodes in the hidden layers. The batch size, indicating the total number of training data in a single batch, was fixed at 64, and the epochs for the model optimization were set at 150, considering the size of the training set. The activation and loss functions were set to a rectified linear unit (ReLU) and mean square error (MSE), respectively. Table 2 indicates that the RMSE decreases in general as the number of hidden layers and nodes increases. However, it also suggests that the RMSE would increase if the scale of the network exceeded a certain point. Therefore, in this study, we chose a model with four hidden layers, a maximum of 256 nodes, and a minimum of 128 nodes as the optimal neural network model, as it yielded the lowest RMSE.

The random forest models were built by varying the numbers of trees and predictor variables for node branching. The number of trees increased by 50 from 50 to 200. For each decision tree, the number of predictor variables was set to 4, 8, and 16, making 12 models in total. The cross-validation results show that the error tends to decrease as we increase the number of trees and predictor variables. In Table 3, the model with 200 trees and 16 predictor variables has the lowest RMSE; so, we chose it as the optimal model for the random forest.

In IDW, because the model estimation depends only on the distance decay parameter p, we examined the accuracy by gradually increasing it from 2 to 8 by 0.1. The results are presented in Figure 2, where the RMSE is the lowest at p = 2.2; therefore, it was chosen as the optimal parameter for IDW.

In ordinary kriging, the models were constructed by varying the semivariogram parameters, namely the model type, nugget, sill, and range (Figure 3). As shown in Table 4, we tested the spherical, exponential, and Gaussian models to fit the semivariogram. For each model, we used four different range values (i.e., 1000, 3000, 5000, and 7000) with default nugget and sill values. Overall, the spherical function provided more accurate results, and the RMSE seemed to decrease as the range decreased (Table 4).

5. Results

The models selected from the previous section were applied to the test datasets. For comparison, Figure 4 shows the distribution of the actual house prices and the residuals (i.e., the difference between the actual and the predicted prices) from each method. Due to the volume of the data and the presence of many duplicate points, the results could not be visualized using raw points. Instead, we created grids and calculated the average values for each grid. In Figure 4a, the grids with high average sales prices are marked in dark red, whereas those with low prices are marked in light yellow. The red and blue colors in Figure 4b–e indicate the positive and negative residuals, respectively.

Figure 4a shows that the high sales prices are clustered around the central and bottom parts of the city, as well as along the Han River. The residuals from the neural network model tend to be small and positive, but there are some large discrepancies between the actual and the predicted values around where the expensive properties are located (Figure 4b). The random forest model has a similar visual impression (Figure 4c), and it is difficult to conclude which model is better from these maps.

In contrast, it is relatively apparent that IDW and kriging produced higher residuals in some parts of the city. Figure 4d implies that both over- and underestimation of house prices have occurred. There are both negative and positive residuals, and some of those along the river are considerable in size. The pattern in Figure 4e is similar; however, the residuals tend to be smaller than those from IDW, and there appears to be no clustering of large residuals in the central part of the city. Overall, these results imply that IDW and kriging are less reliable for predicting house prices than the machine learning methods. House prices are determined by a complex interplay of various geographic, demographic, and social factors, and thus, the conventional spatial interpolation methods that rely on a distance-based function may not be sufficient to establish an accurate model.

In addition to these visual comparisons, we adopted several accuracy metrics, including the mean absolute error (MAE), mean absolute percentage error (MAPE), mean absolute scale error (MASE), and RMSE, for a more comprehensive evaluation.

The MAE is the mean of the absolute error between the actual and predicted values. Because the MAE represents the absolute value of the error, a smaller MAE value implies a higher accuracy of the model. As it does not involve multiplication, it is less affected by outliers than the RMSE. The MAPE is defined as the error divided by the actual, observed value. As the actual value is used as a denominator, it should be non-zero. If the actual value is less than one, and if it gets closer to zero, the MAPE approaches infinity. The last metric in this study, MASE, on the other hand, divides the error by the normal variation range to determine whether the error occurs outside of the variation range. This metric is particularly useful when comparing variables with different variances.

Table 5 presents the calculated accuracy metrics. All metrics indicate that the random forest model is the most accurate, followed by the neural network model. The accuracies of IDW and Kriging were found to be the lowest. These findings indicate that the machine learning techniques could be superior to the distance-based interpolation methods for predicting house prices.

This result can be attributed to the number of data points used in this study. In ordinary kriging, although the imputation process is simple when the amount of data is small, the calculation time increases significantly when a large amount of data is available, making it difficult to fully utilize the given data [40]. On the other hand, most machine learning techniques are designed to work with large data, and the accuracy of the model improves as the volume of the data increases in general. The data in this study amounted to almost 300,000 transactions, resulting in a higher accuracy with the machine learning techniques than the spatial interpolation counterparts.

Among the machine learning techniques, the random forest model yielded higher accuracy than the neural network model. One reason behind this is that the real estate transaction data are in a tabular form. When tabular data are used in neural networks, the structure of the model can become highly dimensional or overparameterized, resulting in overfitting issues during the analysis process [41,42,43]. Conversely, the tree-based ensemble model can be more efficient for high-dimensional tabular data [44], and it has the advantage of preventing overfitting by considering the balance between biases and variances [29]. For this reason, the random forest model might perform better than the neural network for predicting house prices in this study.

As mentioned in the previous section, the relative importance of the predictor variables in a random forest model can be calculated. This can compensate for the disadvantage of random forests, making intuitive visualization and interpretation of the results difficult.

Figure 5 shows the relative importance of the predictor variables in the selected random forest model. The most important variable, the y coordinate, has a score of 100, and the scores of the other variables are estimated. Following the y coordinates, the x coordinates, the years built, the heights, and a binary variable indicating whether the building is in Gangnam-gu, one of the 25 districts in Seoul, were relatively significant. Considering that five of the listed variables (i.e., site_area, far, bcr, flr_area, and bldg_area) are area-related, the areas are one of the most important factors that affect house prices, which is consistent with the results from previous studies. In addition, the location-related variables also have considerable influences on the target variable: not only the absolute locations represented by the x and y coordinates but also the districts in which the properties are located seem to have a significant influence.

6. Conclusions

Although it is crucial to secure data integrity across the entire study region when utilizing spatial data, missing values almost always exist for various reasons. Several studies have been conducted to develop imputation methods for spatial data to address this problem. In geography, IDW and kriging have been commonly used, and they have contributed to solving the discontinuity of data. Recently, it has been suggested that the development of machine learning can replace existing spatial interpolation methods, and its outperformance has been demonstrated. However, there are still uncertainties about whether it can be applied to spatial data representing complex urban phenomena.

While there is a considerable number of studies that compare machine learning and the existing interpolation methods, the existing works tend to focus on the environmental fields. To fill this gap in the literature, we compared machine learning (i.e., neural networks and random forests) and spatial interpolation methods (i.e., IDW and kriging) using real estate transactions data in Seoul, South Korea. In this process, we constructed a dataset by combining the real estate transactions data and the integrated building information data. We proceeded with model optimization by varying the parameters to obtain the best results from each method.

From the maps of the residuals from each model, we found that the machine learning models performed better than the spatial imputation methods. The neural network model and the random forest model did not show high residuals across the study area. This visual impression was confirmed by a set of accuracy metrics, i.e., MAE, RMSE, MAPE, and MASE. The random forest model was the most accurate, followed by the neural network. The spatial interpolation methods showed clusters of high residuals, and kriging followed IDW in terms of the accuracy indicators. High residuals tend to appear in the regions with high transaction prices. To confirm how the selected variables affect the prediction results, the relative variable importance was calculated, and those related to areas and locations were found to have a significant influence.

In this study, we conducted a comparative analysis of machine learning and spatial interpolation methods for predicting house prices. The application of machine learning to the imputation of spatial data has been successful in many studies. Nonetheless, this study is significant as it provides more empirical evidence to support the use of machine learning for social science research and urban analytics. It is also expected that this study can contribute to the application of machine learning in the imputation of spatial data.

Nevertheless, it is important to note that this result should not be generalized to other fields of study and other types of spatial data. With the development of machine learning, many methods have been proposed, and their performance is continuously improving. However, the performance of a specific model for a specific application is affected by various factors, such as the size and structure of the data. Therefore, it is important to ensure that the method we adopt and the mode we are building are superior to other candidates through careful evaluation.

Author Contributions

Conceptualization, S.-Y.H.; methodology, J.K. and S.-Y.H.; validation, J.K. and S.-Y.H.; data curation, J.K. and M.-H.L.; formal analysis, J.K.; investigation, J.K.; writing—original draft preparation, J.K. and Y.L.; writing—review and editing, S.-Y.H.; visualization, J.K.; supervision, S.-Y.H.; project administration, S.-Y.H.; funding acquisition, S.-Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grants funded by the Korean government (Ministry of Science and ICT) (NRF-2021R1C1C1009849; NRF-2019R1C1C1006555).

Data Availability Statement

The real estate transactions data of South Korea are available to download from the following website: https://rt.molit.go.kr/ (accessed on 13 June 2022).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Li, J.; Heap, A.D.; Potter, A.; Huang, Z.; Daniell, J.J. Can we improve the spatial predictions of seabed sediments? A case study of spatial interpolation of mud content across the southwest Australian margin. Cont. Shelf Res. 2011, 31, 1365–1376. [Google Scholar] [CrossRef]
Tadić, J.M.; Ilić, V.; Biraud, S. Examination of geostatistical and machine-learning techniques as interpolators in anisotropic atmospheric environments. Atmos. Environ. 2015, 111, 28–38. [Google Scholar] [CrossRef] [Green Version]
Appelhans, T.; Mwangomo, E.; Hardy, D.R.; Hemp, A.; Nauss, T. Evaluating machine learning approaches for the interpolation of monthly air temperature at Mt. Kilimanjaro, Tanzania. Spat. Stat. 2015, 14, 91–113. [Google Scholar] [CrossRef] [Green Version]
Mariano, C.; Mónica, B. A random forest-based algorithm for data-intensive spatial interpolation in crop yield mapping. Comput. Electron. Agric. 2021, 184, 106094. [Google Scholar] [CrossRef]
Zhu, D.; Cheng, X.; Zhang, F.; Yao, X.; Gao, Y.; Liu, Y. Spatial interpolation using conditional generative adversarial neural networks. Int. J. Geogr. Inf. Sci. 2020, 34, 735–758. [Google Scholar] [CrossRef]
Hu, Q.; Li, Z.; Wang, L.; Huang, Y.; Wang, Y.; Li, L. Rainfall Spatial Estimations: A Review from Spatial Interpolation to Multi-Source Data Merging. Water 2019, 11, 579. [Google Scholar] [CrossRef] [Green Version]
Nghiep, N.; Al, C. Predicting Housing Value: A Comparison of Multiple Regression Analysis and Artificial Neural Networks. J. Real Estate Res. 2001, 22, 313–336. [Google Scholar] [CrossRef]
Lin, G.-F.; Chen, L.-H. A spatial interpolation method based on radial basis function networks incorporating a semivariogram model. J. Hydrol. 2004, 288, 288–298. [Google Scholar] [CrossRef]
Li, J.; Heap, A.D.; Potter, A.; Daniell, J.J. Application of machine learning methods to spatial interpolation of environmental variables. Environ. Model. Softw. 2011, 26, 1647–1659. [Google Scholar] [CrossRef]
Kleinke, K.; Reinecke, J.; Salfrán, D.; Spiess, M. Applied Multiple Imputation; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
Meng, Q.; Liu, Z.; Borders, B.E. Assessment of regression kriging for spatial interpolation—Comparisons of seven GIS interpolation methods. Cartogr. Geogr. Inf. Sci. 2013, 40, 28–39. [Google Scholar] [CrossRef]
Henrico, I. Optimal interpolation method to predict the bathymetry of Saldanha Bay. Trans. GIS 2021, 25, 1991–2009. [Google Scholar] [CrossRef]
Wu, T.; Li, Y. Spatial interpolation of temperature in the United States using residual kriging. Appl. Geogr. 2013, 44, 112–120. [Google Scholar] [CrossRef]
Bhattacharjee, S.; Chen, J.; Ghosh, S.K. Spatio-temporal prediction of land surface temperature using semantic kriging. Trans. GIS 2020, 24, 189–212. [Google Scholar] [CrossRef] [Green Version]
Martínez, M.G.; Lorenzo, J.M.M.; Rubio, N.G. Kriging methodology for regional economic analysis: Estimating the housing price in Albacete. Int. Adv. Econ. Res. 2000, 6, 438–450. [Google Scholar] [CrossRef]
McCluskey, W.J.; Deddis, W.G.; Lamont, I.G.; Borst, R.A. The application of surface generated interpolation models for the prediction of residential property values. J. Prop. Investig. Financ. 2000, 18, 162–176. [Google Scholar] [CrossRef] [Green Version]
Montero, J.; Larraz, B. Interpolation Methods for Geographical Data: Housing and Commercial Establishment Markets. J. Real Estate Res. 2011, 33, 233–244. [Google Scholar] [CrossRef]
Kuntz, M.; Helbich, M. Geostatistical mapping of real estate prices: An empirical comparison of kriging and cokriging. Int. J. Geogr. Inf. Sci. 2014, 28, 1904–1921. [Google Scholar] [CrossRef]
Kim, G.; Lee, B.; Park, B. A comparative analysis on spatial interpolation techniques for price estimation of housing facilities. Geogr. J. Korea 2013, 47, 119–127. [Google Scholar]
Choi, J.H.; Kim, B.J. A study for applicability of cokriging techniques for estimating the real transaction price of land. J. Korean Soc. Geospat. Inf. Sci. 2015, 23, 55–63. [Google Scholar]
Rigol, J.P.; Jarvis, C.H.; Stuart, N. Artificial neural networks as a tool for spatial interpolation. Int. J. Geogr. Inf. Sci. 2001, 15, 323–343. [Google Scholar] [CrossRef]
Pérez-Rave, J.I.; Correa-Morales, J.C.; González-Echavarría, F. A machine learning approach to big data regression analysis of real estate prices for inferential and predictive purposes. J. Prop. Res. 2019, 36, 59–96. [Google Scholar] [CrossRef]
Čeh, M.; Kilibarda, M.; Lisec, A.; Bajat, B. Estimating the Performance of Random Forest versus Multiple Regression for Predicting Prices of the Apartments. ISPRS Int. J. Geo-Inf. 2018, 7, 168. [Google Scholar] [CrossRef] [Green Version]
Seya, H.; Shiroi, D. A Comparison of Residential Apartment Rent Price Predictions Using a Large Data Set: Kriging versus Deep Neural Network. Geogr. Anal. 2022, 54, 239–260. [Google Scholar] [CrossRef]
Abraham, A. Artificial Neural Networks. In Handbook of Measuring System Design; Oklahoma State University: Stillwater, OK, USA, 2005. [Google Scholar]
Minsky, M.; Papert, S. Perceptrons: An Introduction to Computational Geometry; The MIT Press: Cambridge, MA, USA, January 1969; Available online: https://mitpress.mit.edu/books/perceptrons (accessed on 12 June 2022).
Montavon, G.; Samek, W.; Müller, K.-R. Methods for interpreting and understanding deep neural networks. Digit. Signal Processing 2018, 73, 1–15. [Google Scholar] [CrossRef]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [Green Version]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Antipov, E.A.; Pokryshevskaya, E.B. Mass appraisal of residential apartments: An application of Random forest for valuation and a CART-based approach for model diagnostics. Expert Syst. Appl. 2012, 39, 1772–1778. [Google Scholar] [CrossRef] [Green Version]
Strobl, C.; Malley, J.; Tutz, G. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods 2009, 14, 323–348. [Google Scholar] [CrossRef] [Green Version]
Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Bivand, R.; Pebesma, E.J.; Gómez-Rubio, V. Applied Spatial Data Analysis with R. In Use R! Springer: New York, NY, USA, 2013. [Google Scholar]
Cressie, N. The origins of kriging. Math. Geol. 1990, 22, 239–252. [Google Scholar] [CrossRef]
Armstrong, M. Problems with universal kriging. J. Int. Assoc. Math. Geol. 1984, 16, 101–108. [Google Scholar] [CrossRef]
Oliver, M.A.; Webster, R. Kriging: A method of interpolation for geographical information systems. Int. J. Geogr. Inf. Syst. 1990, 4, 313–332. [Google Scholar] [CrossRef]
Webster, R.; McBratney, A.B. Mapping soil fertility at Broom’s Barn by simple kriging. J. Sci. Food Agric. 1987, 38, 97–115. [Google Scholar] [CrossRef]
Van der Meer, F. Introduction to Geostatistics; ITC Lecture Notes: Enschede, The Netherlands, 1993. [Google Scholar]
Probst, P.; Boulesteix, A.-L.; Bischl, B. Tunability: Importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 2019, 20, 1934–1965. [Google Scholar]
Cressie, N.; Johannesson, G. Fixed rank kriging for very large spatial data sets. J. R. Stat. Soc. Ser. B 2008, 70, 209–226. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, UK, 2016. [Google Scholar]
Shavitt, I.; Segal, E. Regularization learning networks: Deep learning for tabular datasets. Adv. Neural Inf. Processing Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper/2018 (accessed on 12 June 2022).
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf. Processing Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper/2019 (accessed on 12 June 2022).
Lundberg, S.M.; Erion, G.G.; Lee, S.-I. Consistent individualized feature attribution for tree ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar]

Figure 1. Joining process of the real estate transactions data and the integrated building information data.

Figure 2. RMSE with different distance decaying parameters, p.

Figure 3. Semivariogram and fitting functions representing the spherical model, the exponential model, and the Gaussian model.

Figure 4. Geographic distributions of the actual (observed) sales prices and the residuals from each method: (a) actual values, (b) neural networks, (c) random forests, (d) IDW, and (e) ordinary kriging.

Figure 5. Relative variable importance in the selected random forest model.

Table 1. List of predictor and target variables and descriptions.

Type	Name	Description	Class
Predictor	bldg_area	Land area occupied by the building	Numeric
	yr_built	Year built	Numeric
	flr_area	Total floor area of the building	Numeric
	site_area	Area of the site on which the building is located	Numeric
	height	Height of the building (in metres)	Numeric
	bcr	Building coverage area	Numeric
	far	Floor area ratio	Numeric
	x	Latitude	Numeric
	y	Longitude	Numeric
	district	District in which the building is located	Categorical
	yr	Year of the sales transaction	Numeric
	month	Month of the sales transaction	Numeric
	net_area	Floor area of the property	Numeric
	flr_lv	Floor level	Integer
	type	Type of the property (e.g., apartments, detached houses)	Categorical
Target	price	Sales price	Numeric

Table 2. RMSE of neural network models with different combinations of hyperparameters.

Model	No. of Hidden Layers	No. of Nodes		RMSE
Model	No. of Hidden Layers	Max	Min	RMSE
1	2	32	16	193.75
2	2	128	64	152.35
3	2	256	128	143.19
4	3	32	16	210.97
5	3	128	64	114.08
6	3	256	128	103.03
7	4	32	16	124.04
8	4	128	64	101.99
9	4	256	128	97.90
10	5	32	16	117.59
11	5	128	64	98.65
12	5	256	128	107.60

Table 3. RMSE of random forest models with different numbers of decision trees and predictor variables.

Model	No. of Trees	Predictor Variables	RMSE
1	50	4	116.12
2	50	8	87.47
3	50	16	82.58
4	100	4	115.44
5	100	8	86.85
6	100	16	82.07
7	150	4	115.27
8	150	8	86.68
9	150	16	81.92
10	200	4	115.13
11	200	8	86.57
12	200	16	81.86

Table 4. RMSE of ordinary kriging models with different sets of parameters.

Model	Fitting Function	Nugget	Sill	Range	RMSE
1	Spherical	18,189.04	45,968.56	1000	131.00
2		25,750.27	55,018.88	3000	134.11
3		27,728.88	61,251.09	5000	134.95
4		28,872.34	66,397.40	7000	135.38
5	Exponential	21,004.69	54,192.23	1000	131.89
6		26,625.02	67,663.44	3000	134.32
7		28,158.26	78,786.51	5000	134.97
8		29,120.28	94,920.26	7000	135.37
9	Gaussian	28,021.85	51,190.93	1000	136.91
10		31,722.12	65,187.40	3000	137.62
11		33,177.85	76,133.24	5000	137.74
12		34,099.72	98,675.17	7000	137.79

Table 5. Prediction accuracy of machine learning and conventional interpolation techniques measured by four error metrics, MAE, RMSE, MAPE, and MASE.

Method	MAE	RMSE	MAPE	MASE
Neural network	64.4615	102.0348	0.1260	0.2759
Random forest	48.1897	80.8789	0.0952	0.2063
IDW	78.4847	126.5288	0.1544	0.3360
Ordinary kriging	83.5145	131.0042	0.1655	0.3575

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Lee, Y.; Lee, M.-H.; Hong, S.-Y. A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices. Sustainability 2022, 14, 9056. https://doi.org/10.3390/su14159056

AMA Style

Kim J, Lee Y, Lee M-H, Hong S-Y. A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices. Sustainability. 2022; 14(15):9056. https://doi.org/10.3390/su14159056

Chicago/Turabian Style

Kim, Jeonghyeon, Youngho Lee, Myeong-Hun Lee, and Seong-Yun Hong. 2022. "A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices" Sustainability 14, no. 15: 9056. https://doi.org/10.3390/su14159056

APA Style

Kim, J., Lee, Y., Lee, M.-H., & Hong, S.-Y. (2022). A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices. Sustainability, 14(15), 9056. https://doi.org/10.3390/su14159056

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices

Abstract

1. Introduction

2. Past Studies

3. Data and Methods

3.1. Data

3.2. Methods

4. Model Optimization

5. Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI