Augmenting Geostatistics with Matrix Factorization: A Case Study for House Price Estimation

: Singular value decomposition (SVD) is ubiquitously used in recommendation systems to estimate and predict values based on latent features obtained through matrix factorization. But, oblivious of location information, SVD has limitations in predicting variables that have strong spatial autocorrelation, such as housing prices which strongly depend on spatial properties such as the neighborhood and school districts. In this work, we build an algorithm that integrates the latent feature learning capabilities of truncated SVD with kriging, which is called SVD-Regression Kriging (SVD-RK). In doing so, we address the problem of modeling and predicting spatially autocorrelated data for recommender engines using real estate housing prices by integrating spatial statistics. We also show that SVD-RK outperforms purely latent features based solutions as well as purely spatial approaches like Geographically Weighted Regression (GWR). Our proposed algorithm, SVD-RK, integrates the results of truncated SVD as an independent variable into a regression kriging approach. We show experimentally, that latent house price patterns learned using SVD are able to improve house price predictions of ordinary kriging in areas where house prices ﬂuctuate locally. For areas where house prices are strongly spatially autocorrelated, evident by a house pricing variogram showing that the data can be mostly explained by spatial information only, we propose to feed the results of SVD into a geographically weighted regression model to outperform the orginary kriging approach.


Introduction
The price of a house or apartment is challenging to estimate. The price not only depends on various variables of the house itself, such as the number of rooms, but also on the location [1]. Different school districts [2], proximity to public transport [3], and safety of the neighborhood [4] affect the price of a house. Fluctuations in house prices can have a strong impact on real economic activity. For instance, the rapid rise and subsequent collapse in US residential housing prices is widely considered as one of the major causalities of the financial crisis of 2007-2009 [5], which has in turn led to a deep recession and a protracted decline in employment. In this work, our goal is to leverage powerful, but spatially oblivious, recommendation systems to improve geostatistics for house price estimation.
Recommender engines have become an integral part of our e-commerce. With the progression of the recommender systems, there have been many improvements to it's implementation. Recommender system engines allow to predict the price of a house by leveraging knowledge about similar houses and their prices. One enhancement has been in integrating spatial elements [6,7] based on Tobler's first law of geography. The idea of these existing work is to build individial recommender systems for each areal unit. Secondly, we look at how RK has been used in its dominant field, environmental science, and how it is being used as part of machine learning algorithms.

Spatially-Aware Recommender Systems
Most recommender systems use collaborative or content-based filtering techniques allowing the customer to sift through items that may not be of interest and to help them navigate to personalized and relevant items faster. There has been extensive research on improving these methods by integrating spatial elements to recommender engines. A pioneering approach in 2009 by Bohnert et al. [7] used Gaussian spatial processes, also known as kriging, to predict which exhibits in a museum users would enjoy seeing next. They compared their process with a rule-based collaborative filtering recommender system, since matrix factorization was not yet used as the state-of-the-art method for recommendation systems at this time. The study did not combine the two approaches but rather compared one to the other and found that the kriging-based approach had a higher accuracy to the collaborative filtering method [7]. The study [12] also compares both kriging and their deep learning approach against each other but also do not combine them. Another study incorporated a k-d tree based space partitioning technique to collaborative filtering to split the users' space with respect to location. In the study they first grouped the users that were spatially similar and then fed each partition into the collaborative filtering recomender engine [6]. They found that this method is effective in producing a more quality recommendation along with reducing the run time. More recently, a study was conducted to develop a recommendation engine, STAP, that inferred interest in activities for a given time, around the current location of a user [13]. Using matrix factorization to infer temporal preferences and a personal function region to infer spatial activity preferences, the researchers improved Point-of-Interest (POIs) recommendation engines. In 2016, a group of researchers made additional improvements to recommender engines called SPORE using the classical Markov chain models [14]. Unlike the method from [13], this study utilized spatial information as direct features of the matrix factorization model [14]. Another approach by Yang et al. [15] used derived spatial information such as the distance from a users home location as explanatory variables to improve the recommendation of sites to users in a location-based social network. All of these approaches have in common, that they enrich the recommendation system with simple explicit and task-specific spatial features.

Machine Learning with Spatial Statistic
Ordinary kriging is a widely used spatial interpolation method and is based on a stochastic model of continuous spatial variation which can be depicted though a variogram or covariance function [16,17]. It is by far the most popular form of kriging which has been combined with SVD using traditional classification algorithms in [18] as illustrated in Figure 1a. However, rather than combining the results independently returned by kriging and SVD, the goal of this work is to integrate SVD directly in the kriging process, thus allowing kriging to leverage learned patterns and estimated house prices provided by SVD as sketched in Figure 1.
The technique of using auxiliary attributes of data as independent variables in universal kriging is also referred to as regression kriging (RK). This method has been used extensively in the environmental sciences. The authors of [19] describes different case studies where RK is traditionally applied in the field of environmental science. The study explores mapping soil organic matter using RK, mapping presence/absence of the yew plant, and mapping land surface temperatures. More recently, RK has also been extended to the machine learning field of study. In 2011, a study experimented with using residuals from machine learning into ordinary kriging (OK) [16]. For their combined methods, they applied LM, GLM, GLS, Rpart, RF, SVM and KSVM to the data, then OK or Inverse Distance Squared (IDS) was applied to the residuals of these models. They predicted values of each model and the corresponding interpolated residual values were added together to produce the final predictions of each combined method. The combination of RF and SVM with OK and IDS were novel at the time and had not been applied in previous studies in environmental sciences. In 2015, a study used a similar approach to [16] and applied the output of their neural network as covariates in the universal kriging (UK) algorithm and also applied the neural network residuals as covariates to the ordinary kriging algorithm to interpolate atmospheric temperatures [20] . However, the study found that their approaches were only successful 50% of the time. More recently in 2019, another study used the residuals from the machine learning algorithms Random Forest and Boosted Regression Trees to predict topsoil organic carbon [21]. However, the researchers did not find any significant improvements using the residuals as covariates to OK. Our approach spatially improves recommender engines using similar approaches in [20,21] but build models for each distinctive region. For our algorithm, truncated SVD is segmented by county and calculated individually. Then, both the outputs and residuals from the collaborative filtering method for each county are used as independent variables in UK. The observed values from truncated SVD are weighted based on the spatial structures calculated from UK, resulting in statistically significant improvements to truncated SVD.

Preliminaries
In this section, we discuss the Zillow House Price dataset used for our study and we present the existing concepts exploited by our algorithm used for our approach to combine recommender systems with spatial statistics. By selecting the appropriate hyper-parameters and models within each algorithm, we are able to utilize the strengths of each method to optimize the interpolation of housing estimates.

Study Area and Dataset
The data used for this study was derived from the Zillow Price Competition on Kaggle in 2016 and can still be found on the Kaggle website (Kaggle Data Set available at: https://www.kaggle.com/ c/zillow-prize-1/data). The competition was to develop an algorithm that makes predictions about the future sale prices of homes and in turn help increase the accuracy of the current Zillow algorithm at the time. While the competition has now been closed, we utilized this data set on our algorithm because the data points are densely populated, as shown in Figure 2, which makes it an excellent candidate for testing our algorithm that requires at least a moderate level of spatial autocorrelation.
The data set contains points from three different counties in California: Ventura, Los Angeles, and Orange county with County ID's 2061, 3101, and 1286 respectively. Each county will be evaluated to examine what methods are best to predict housing prices based on the characteristics of each county. The data contains dozens of attributes of each home but for this study we selected the attributes that had the least "NA" values which include, latitude, longitude, bedroom count, and bathroom count. Figure 3 provides a boxplot to show the range of housing prices in each of the three counties. The boxes show the 25% to 75% quantiles of housing prices for each county, whereas their whiskers (the lines reaching out from the boxes) show the 2.5% to 97.5% quantiles. We observe that County 3101 has the largest range of house prices, including outliers costing less than USD 1,000, and houses priced at as much as ten million USD.

Ordinary and Universal Kriging
Geostatistics was originally developed from the gold mining industry in the 1950s to determine if the local metal concentrations in ores were sufficient to make the extraction worthwhile. With much success, this method had proven to be an effective way to interpolate values assuming a continuous surface underneath. No model can perfectly describe the natural world and any technique used for interpolation will results in some amount of error. It is here where geostatistics, unlike other methods such as Voronoi polygons or trend surface analysis, is an optimal choice because it can provide estimates of the model errors. Even now, geostatistics is a popular method in environmental sciences [22,23]. Many scientists want to measure properties of interest to them (i.e., nutrients in soil, air pollution, rainfall) over a large continuous areas over which they often have only sparse observed data at limited numbers of places [17]. In these circumstances the best that they can do is to estimate or predict in a spatial sense values between observed points and use geostatistics in this way.
The geostatistics technique kriging is used to interpolate values of non-sampled locations from a collection of sampled areas and estimate the uncertainty surrounding the predicted value [24]. Kriging is not effective if there is no or negative spatial autocorrelation among data points [25]. But for the case of house pricing estimation, the literature provides evidence of spatial auto-correlation [1] confirming Tobler's first law of geography, stating that "Everything is related to everything else. But near things are more related than distant things." [26]. For more information on a recent tutorial on kriging can be found at [27]. Following this intuition, we assume that housing prices are spatially autocorrelated, which can be exploited for predicting using a kriging estimate.
Typically, ordinary kriging interpolates values assuming stationarity and homoscedasticity, thus assuming no trend. In other words, the mean and variance across the entire spatial domain is assumed to be constant. This study incorporates Ordinary Kriging which assumes a constant stationary function and is calculated using Definition 1: Definition 1 (Ordinary Kriging Estimate). The kriging estimate is of the form: whereẐ(x 0 ) is the estimated value (specifically, housing prices in our use-case) at location x 0 ; Z(x i ) are all other observed values at locations x i ; N is the sample size; and λ i weights chosen for x i .
The challenging part of Definition 1 is to find appropriate weights λ i to define how much an observed value Z(x i ) at location x i affects the value Z(x 0 ) to be estimated. For this purpose, different models (variograms) are used in the literature to choose the weight depending on the distance d(x 0 , x i ) between the estimated location x 0 and the location of the predictor value x i .
where N(h) is the set of experimental pairs having distance h, that is, a set of observations (

the observed values at places x i and x i + h.
In order to build the appropriate models, kriging builds an empirical Variogram (Definition 2) which is calculated by substituting the distance between two observed points into Definition 2.
The empirical variogram values provide information on the spatial autocorrelation of the dataset and can then be used to fit a model to ensure that kriging predictions have positive kriging variances [28]. This data, which has N = 16,997 house prices as described in Section 3.1, leads to N 2 − N = 288,881,012 pairs of data points. These pairs are grouped by their Euclidean distance, and for each distance group at 0.01 mi the corresponding empirical covariance (as defined in Definition 2) is given on the y-axis.
We can empirically observe whether there is a strong spatial autocorrelation within the data based on the empirical variograms, showing at what point the correlation drops and the covariance as distance increases. An experimental variogram that increases with increasing gradient as the lag distance increases signifies trend [17]. When trend is present within the data, the process should be modelled as a combination of a deterministic trend and spatially correlated random residuals from the trend [17]. Thus, for this study we also use Universal Kriging where the mean will differ throughout the spatial domain while still assuming homoscedasticity, i.e., assuming that the variance is constant and is used when data has a strong trend.
To leverage the empirical variogram (Definition 2) for kriging, the values of λ i in Definition 1 are chosen by solving the following set of equations: whereγ(d(x i , x j )) is a model of the semivariance between as defined in Definition 2 given their distance d(x i , x j ). By definition, the kriging weights are subject to the constraint that they sum to one [17].
ψ(x 0 ) is a Lagrange multiplier introduced for the minimization of the error variance, thus λ i must sum to one. Ordinary Kriging assumes a stationary stochastic process, i.e., having a constant (but unknown) mean. But for housing prices, different areas may have different mean house price, depending on neighborhood, school district, and proximity to public transport [2][3][4]. Universal Kriging (UK) is used for non-stationarity when drift is present. It allows for non-stationarity thus allowing the mean to differ throughout the spatial domain while still assuming homoscedasticity, i.e., assuming that the variance is constant and is used when data has a strong trend. UK assumes that the prediction variable, Z(x), can be written as a sum of deterministic trend component m(x) and the spatially correlated random residual component (x): where a i is the ith coefficient to be estimated from the data; f i is the ith basic function of spatial coordinates which describes the drift; k is the number of functions used in modeling the drift; (x) is a spatially correlated random residual.
For this study, the mean, m(x) is assumed to have a linear function of two-dimensional coordinates with k = 3 and can be approximated by a suitable model with the form [29]: where a i is the coefficient to be estimated from the data, and x and y are the spatial coordinates in formal UK. With the appropriate equations mentioned above, we can calculate an empirical variogram to determine the weights of the observed values. Unlike Ordinary Kriging, where the variances are calculated by the observed location values x i and x i + h, in UK they are calculated by the observed location residuals from the trend. In contrast to the mean in OK which assumes to be constant, the mean in UK is a function of the spatial coordinates and the local drift which is represented by a linear equation of the independent variables.
To build a modelγ of the empirical semivariogram as required in Equation (3) different parametric models can be used. For this study, the Spherical and Matern model were interchangeably chosen since they best fit the empirical semivariogram in different counties. The Spherical model is given by the following expression: where d is the distance parameter or range, c 0 is the nugget variance, c is the variance of spatially correlated component, and h is the lag interval. The Matern model is given by the following expression [30]: where h is the distance between the observed point and unobserved point, K θ 2 (·) is the modified Bessel function of order θ 2 , θ 1 (the "range" parameter) controls the rate of decay of the correlation between observations as distance increases, and θ 2 controls the behavior of the autocorrelation function for observations that are separated by small distances [31].

Truncated Singular Value Decomposition (SVD)
Matrix factorization has become the state-of-the-art technique for recommendation systems as popularized by the "Funk" Singular Value Decomposition method used in the 2015 Netflix Challenge [32]. The idea of matrix factorization is to represent a data matrix by multiple (much-) lower dimensionality matrices of latent factors. Then, multiplying of the factorized matrices yields an approximation of the original matrix that allows to estimate missing values within the matrix. The goal of truncated SVD is to keep only the k latent factors having the most information. SVD has also been used to compare ground data and measurements collected by Earth systems because SVD offers an effective method in finding frequent and simultaneous patterns between two spatio-temporal fields [8].
The idea behind such latent factor models is that a large portion of a n × m data matrix R may be highly correlated, and can be approximated by matrices of a lower rank. For example, a matrix that stores a rating value for each of n user combinations, and each of m items may have high redundancy: Some users may be very similar to each other, while some items may exhibit similar reactions to topics. Definition 4 (Singular Value Decomposition (SVD)). Let R be a m × n matrix. SVD factorizes R into three components: where U is a m × m matrix, V is a n × n matrix, and Σ is a m × n diagonal matrix containing the singular values of R in descending magnitude. Intuitively, Σ can be seen as a scaling matrix, that scales dimensions according to their variance/information. U and V can be seen as rotation matrices, which rotate the data space in the directions of the highest variance components.
The idea of truncated SVD is to simply truncate all but the k highest variance components, discarding the rest of the matrices. This not only allows for an efficient approximation of R, but it also allows to discard detailed and potentially overfitting information.
Definition 5 (Truncated SVD). Let R be a m × n matrix and let k ≤ n, m. Truncated SVD factorizes R into three components: where U k is a m × k matrix, describing each line of R by k latent features, V k is a n × k matrix, describing each column of R by k latent features, and Σ k is a k × k diagonal matrix retaining the largest singular values of Σ. Σ can be seen as a scaling matrix, that scales dimensions according to their variance/information. U and V can be seen as rotation matrices, which rotate the data space in the directions of the highest variance components.
The matrix factorization method provides a clear way to observe the correlations between rows and columns in a one-shot estimation to the entire matrix [33]. To leverage this factorization for prediction, we rebuild an approximationR of data matrix R, by multiplying the truncated matrices. In the resulting matrix, each predicted value,r ij , inR is calculated by the dot product of the ith row factor u i of U k and the jth column factor v j of V k [33]: Once the latent vectors for rows and columns are established, it is simple to use Equation (10) for estimating the values of R. Since SVD cannot directly compute estimates with an incomplete matrix, a simple approach is to take the average of the columns or rows of the observed values and input them into the missing ratings. However, this approach can be biased. By using the truncated SVD algorithm with Stochastic Gradient Descent (SGD) which is a convex optimization method, only observed values will be taken into consideration while avoiding overfitting through a regularized model [32]. Overfitting is common when a training dataset is small. Through regularization, the models incorporate a bias which discourages extremely large values of the U k and V k coefficients to promote stability. SGD is used in truncated SVD as an approach to minimize error by finding at what point the slope of the error loss function is closest to zero by looping through all observed values, computing the prediction error, and then using the error from each observed value to update the k entries in row i of U and k entries in column j of V. Details of this SGD approach for approximating missing values in truncated SVD can be found in [32][33][34].
While truncated SVD has proven its predictive power in many applications such as user-movie recommendation [35] and face recognition [36], it does not explicitly allow to model spatial locations and spatial auto-correlation. This limitation is evident due to truncated SVD using linear algebra only, which is too limited to describe spatial trends. Therefore, in the following, we propose a modification to kriging which augments the power of latent factor modeling using truncated SVD with a kriging approach that gives spatial information appropriate special treatment.

Methodology
In this study, with the Zillow data discussed in Section 3.1, we first show that ordinary kriging can provide better house price prediction better than machine learning approaches based on SVD. As shown in Figure 4, if the variogram for a specific couny fits the data well, than ordinary kriging in the method of choice. However, if trend is present in the variogram, than our algorithm SVD-RK can enhance ordinary kriging by generating stronger predictions.
In a nutshell, the idea of our approach is to first assess how well the deviation of prices can be explained by distance. Cases where the covariance function shows a strong fit to the empirical variogram indicate that that ordinary kriging may be sufficient to yield good predictions, without the need to further explain the deviation using SVD. However, in cases where some places cannot be explained by their local neighborhoods, such as an expensive area adjacent to cheaper housing, or areas in close proximity having varying house prices due to different school districts, we may be able to leverage SVD. The idea is that SVD may find that a location is similar to other areas (regardless of distance), and thus, estimate house prices similar to these locations having similar (latent) features. The following subsection describes our approach, as shown in Figure 4, to leverage universal kriging with information derived from latent feature analysis using truncated SVD.

Step 1: Segmenting
Our proposed algorithm integrates hidden patterns that the human eye cannot detect using distance and spatial arrangement of nearby observations to improve the prediction of unobserved values. In a previous study [18], we combined the outputs of SVD and OK into a hybrid model using a multi-linear regression model and a neural network as shown in Figure 1a. The shortcoming of this approach is the assumption of a stationary data set, where mean and standard deviation of the variable of interest are identical in all locations. This assumption held in the previous study [18], of predicting emotion values rather than house prices. However, different neighborhoods can have significantly different mean house prices, incurring a large error if stationary means and homoscedasticity are assumed, and leading to major confusion by the neural network approach as shown in Table 1.  Table 1 also shows us that when the whole data set is ingested into SVD, UK, and OK, then SVD outperforms both types of kriging significantly. However, as shown in the second row of Table 1, when the data is segmented by county, OK performs better than when the whole data set is utilized at once. This is because, when segmented, each county has its own covariance function that describes the unique spatial auto-correlation. We leverage this segmentation for both SVD matrices and kriging models, and build an algorithm that can outperform both SVD and kriging. Thus, as shown as in Figure 4, the first step from the Original Data is to separate the data by county.

Step 2: Variogram Construction
In the second step of Figure 4, we take the separated data and construct house pricing variograms for each study area to understand their spatial structure. The purpose of the variogram for a study area is to decide whether kriging alone is expected to yield good predictions, or if we should leverage SVD to explain the unexplained residuals which is the decision point in Figure 4. Our experimental evaluation in Section 5 will depict example cases in which the model variogram strongly fits the empirical variogram such that SVD cannot improve the housing prediction (county id 3101), and show cases in which the model variogram does not fit the empirical variogram well such that the strength of SVD can improve the housing price prediction (county id 1286 and 2061).

Step 3: Applying SVD to House Price Matrix
In order to utilize truncated SVD as part of our algorithm, we used location ids that were derived from unique latitude and longitude combinations as the rows for the matrix, and the non-spatial attributes (bedroom and bathroom combinations) for the columns of the matrix as shown in Figure 5. This data structure allows the truncated SVD algorithm to make housing price predictions of different locations for all the possible bedroom-bathroom possibilities that were unavailable, depicted in orange cells in Figure 5. Truncated SVD allows our algorithm to leverage the benefits of traditional recommender systems by identifying similar locations (i.e., locations having similar prices for the similar types of houses), and by identifying similar types of houses (i.e., types of houses having similar prices for similar locations). Truncated SVD only takes into consideration hidden patterns within the data and does not look at spatial elements in any way. Our approach is designed to use the SVD outputs and residuals and weigh the truncated SVD predictions based on their spatial structure through RK, which we call SVD-RK.

Step 4: Feeding Truncated SVD into Regression Kriging
As described in Figure 4, kriging and SVD are combined by feeding the SVD output into a regression kriging as covariates as sketched in Figure 1b and architected in Figure 4. The reason for feeding SVD outputs into kriging instead of kriging outputs into SVD is because in real estate housing price estimation, the SVD outputs are stronger than UK when compared individually as shown in Table 1. Thus, feeding more accurate estimations from SVD into kriging, our algorithm will further improve the predictions by taking into account spatial autocorrelation.
The idea of our approach is that truncated SVD leverage latent features of locations for prediction. For example, SVD may find that one location exhibits similar housing prices than a set of other locations, thus leveraging these other locations for prediction. Truncated SVD treats each location as a nominal variable, and thus, is oblivious of distances between locations and unable to leverage spatial autocorrelation. By leveraging the variogram constructed as described in Section 4.2 (and shown in Section 5 for the Zillow dataset described in Section 3.1), we can explore if a given dataset is strongly spatially autocorrelated. In that case, we use the unobserved values estimated by trunacated SVD as independent variables for universal kriging (UK). Traditionally, the mean of UK is a function of the site coordinates, such that the latitude and longitude are independent variables of a linear model [17]. However, UK can also use a regression from other independent variables which is known as regression kriging (RK). RK is a spatial interpolation technique that combines a regression of the dependent variable on auxiliary variables (such as soil nutrients, or temperature) with simple kriging of the regression residuals. It is mathematically equivalent to the UK, where auxiliary predictors are used directly to solve the kriging weights [19]. Instead of using latitude and longitude as traditional auxiliary variables, we use other attributes, the SVD outputs and residuals, as independent variables which lead to better results. This allows UK to use new auxiliary variables to calculate the weight matrix for all observed points. The study [20] explored regression kriging using the outputs of an artificial neural network as covariates to UK. The approach showed some promising properties, in that it did better than UK alone in four out of eight examined model cases, however, the evidence of benefits was inconclusive since it showed varying degrees of success. In the study [21], Random Forest and Boosted Regression Tree residuals were used as covariates with ordinary kriging but RT residuals with OK did not show any improvements as covariants. The study found that OK and UK were the most accurate mapping methods in comparison to RK. We experimented using the truncated SVD outputs and the truncated SVD residuals as covariates separately for OK and UK. However, we found that that these methods, as explained in [20,21] showed no benefit in predicting points. Thus we applied a segmentation enhancement to the approach that has shown strong improvements to the final predictions.
The strength of kriging is that a variance can be calculated to determine the certainty of the kriging interpolation. SVD-OK feeds SVD into kriging which not only outputs a stronger prediction than either SVD or kriging individually, but it also outputs a variance. This can be utilized to make intelligible predictions because it quantigies the confidence of the prediction that the algorithm makes. Thus the variance causes a more explainable approach to machine learning. Another important aspect to UK is that the technique is able to calculate the weight vector for unobserved points using independent variables that may only be present in training the model but not present or available in the unobserved points. With this in mind, we have utilized both the truncated SVD output and the truncated SVD residual as independent variables together in creating our covariance function. Thus, the kriging approach not only knows the SVD prediction based on latent factors, but it also knows the confidence of these prediction given its difference to the ground truth of training data. Our results show that our combined approach SVD-RK, which feeds the truncated SVD results in regression kriging outperforms individual approaches based on matrix factorization and kriging alone as well as the approaches in previous studies [20,21].

Coding Language and Packages
This study was conducted using both the R and Python programming languages. The python package "surprise" was used to apply truncated SVD to the data. Specifically the NMF function, under the matrix factorization algorithms, within the "surprise" package was used to to perform truncated SVD. This function ensures that the user and item vectors stay positive. More information on this technique can be found in [37,38]. For this function, Table 2 describes the hyperparameters used for this study. The R package "gstat" was used in applying kriging to the data. The "variogram", "fit.variogram", and "krige" functions were utilized in making the predictions using the truncated SVD outputs and residuals. All code for this study can be found on on GitHub [39].

Experimental Evaluations
Our aim is to explore and combine the techniques of kriging and Recommender Engines using truncated SVD such that the results will stabilize and reduce the mean absolute error (MAE) of housing price predictions. In order to leverage the strengths of both methods, we first built variograms for each county. Each county describes a different spatial structure as shown in Figure 6. As shown with Figure 6a, the covariance function fits the empirical variogram very well for county id 3101 which gives us insight into what algorithm to use for this county. Since the covariance function can describe the spatial structure well for county id 3101, OK will be the most suitable approach to predicting housing prices. As shown with Figure 6b,c, the best fit curve function still cannot fully describe the spatial structure for county id 1286 and 2061. However, with our proposed algorithm, we found that SVD-RK can utilize hidden patterns within the data and kriging can weigh the SVD outputs and residuals based on spatial correlation to produce stronger results than OK alone.
For the experiments conducted, we split the data into training and testing sets. Since the data was derived from the Zillow Kaggle Challenge as explained in Section 3.1, the challenge specified which data points were to be used as part of the training and testing. This study followed the exact same train/test split procedure for the purpose of producing comparable results. Following the exact same procedure allows all the experiments in this study to be evaluated fairly. In addition, there were no issues with minimum point density as a limitation for kriging. Since the data has already been pre-processed for the Kaggle challenge, there was a sufficient amount of points to build a variogram for every county. Furthermore, for data points that had NA values for some attributes, these lines of data were dropped. Only values within the data set that had a value for latitude, longitude, bedroom count, bathroom count, and price were utilized in this study.

Evaluation of SVD-RK
Integrating truncated SVD into regression kriging, we first created separate matrices for each county to find the latent patterns. An important SVD parameter is the choice of K, the number of latent features, shown in Table 2. From running experiments shown in Figure 7, we found that any number less than 10 features yields similarly good prediction results. In particular, we get similar (only slightly worse results) using k = 1 latent features instead of k = 5. This shows that a single feature (which may latently describe the "priceliness" of a neighborhood) yields good price prediction results. Table 2 also shows the appropriate amount of epochs, which is the number of iterations of the SGD procedure. The learning rate in SVD is the average rating given by a user minus the average ratings of all items. This term is used to reduce the error between the predicted and actual value. For this study, the learning rate set to 5 × 10 −8 was found to produce optimal results. The SVD results are then fed into separate kriging models by county to weigh the observed values based on the spatial description of that county. Since every county may have different characteristics, segmenting the models helps to weigh the observations appropriately. Table 3, our method, SVD-RK, which applies truncated SVD matrices and kriging models by region presents improvements to counties 1286 and 2061, which could not be fully described by the covariance function in kriging alone as shown in Figure 6. We found that when appling SVD-RK to each county reperately, truncated SVD matrices are able to better learn the hidden features of the data for a particular area thus producing results with less outliers which leads to better kriging results. For truncated SVD, every county may have different hidden characteristics which can be extracted more clearly when segmented. For UK, segmenting each model by county, in addition to segmenting matrices in truncated SVD, allows the algorithm to better describe the spatial structure of each county better than apply one generic covariance function to describe all three counties. Thus, applying SVD-RK by county, allows the algorithm to tailor the spatial structure differently for a given area. One area may be best described with a Spherical covariance function while another may be better described with a Matern covariance function. Table 4 lists the spatial structure for each county to highlight their differences and why segmenting would stabilize the results and make better predictions for certain counties. For comparison purposes we also tested how Non-Negative Matrix Factorization (NMF), an implementation for SVD that follows [38], would react if fed into RK. NMF is similar to SVD, except for having an additional constraint of imposing non-negative weights in matrices U and V. Table 3 shows the results of NMF-RK and show that SVD-RK yields better results. Thus this study focuses on SVD-RK as the method of choice.

As shown in
Our approach, as shown in Table 3, does not work well for county 3101, which covers regions of different densities with large distances between these regions (see Figure 2), which also has outliers with extremely low/high house prices, as shown in Figure 3. In this case, the large variance of house prices combined with the large distance to nearest points in some locations yields to vast mis-estimations. We also see that ordinary kriging yields extremely good results for this county, due to the variogram fitting the empirical covariances very closely as shown in Figure 6a. Thus, in this case, the spatial location (captured by orginary kriging) is already able to tightly fit the data and explain most of the error. Yet, we will propose two approaches to improve our approach to work in cases such as County 3101. First, we will propose an approach for outlier remove in Section 5.2. While this appraoch stabilizes our SVD-RK approach, it still yields worse results than ordinary kriging. Thus, we propose to substitute regression kriging with geographically weighted regression (GWR), in our approach, leading to better results in Section 5.3.

Outlier Removal for SVD-RK
While SVD-RK has shown significant improvements to the truncated SVD predictions, county id 3101 had no improvement. To better understand why the improvement for county id 3101 was not as strong as the others, we created a boxplot to visually see the outliers in Figure 3 and mapped the housing prices shown in Figure 8. Figure 8a shows county id 3101 with a relatively even distribution of housing price values except near the bend where there are red housing prices showing more extreme values, while Figure 8b for county id 2061 and 1286 shows almost no extreme values. We then calculated the minimum, maximum, average, and standard deviation of the housing prices for each county id shown in Table 5. From Table 5, it is evident that county id 3101 has the largest difference between the minimum and maximum house price points and has visible outliers. Since kriging is highly sensitive to skewed distributions and extreme positive values having a large impact on semi-variance calculations, this caused the improvement in error to be relatively smaller for county id 3101 [17].  To improve the mean absolute error for county id 3101, we proposed removing the outliers when kriging to decrease standard deviation and range between the maximum and minimum values. We removed all outliers deviating by more than two standard deviations from the mean. This resulted in a smaller standard deviation and an increase in accuracy by 25.9% (shown in Table 6) from the original segmentation OK method (shown in Table 3). For those outliers that were left from being ingested into the algorithm, the results from using OK without treating outliers can still be used with an MAE of 83,000.

Feeding Truncated SVD into Geographically Weighted Regression (SVD-GWR)
In addition to non-stationary kriging, this study also explored improving truncated SVD with Geographically Weighted Regression (GWR) since it is one of the recent developments of spatial analytical methods [40]. GWR is a local spatial statistical technique for exploring non-stationary data and, although controversial, it can also be used to make predictions. Similar to kriging, this algorithm is closely related to Tobler's first law of geography assuming that closer observations are more related than distant observations [26]. GWR calibrates a separate ordinary least squares regression at each location in the data set within a certain bandwidth and allows the relationships between the independent and dependent variables to vary locally. The GWR model can be described as the following: where y i is the dependent variable at location i, β i0 is the intercept coefficient at location i, x ik is the k-th explanatory variable at location i, β ik is the k-th local regression coefficient for the kth explanatory variable at location i, i is the random error term associated with location i, and i is indexed by two-dimensional geographic coordinates (u i , v i ) [41]. The equation for GWR is similar to traditional regression however, for GWR, the regression coefficients are estimated at each data location rather than being fixed for a study area [40]. Further details for the GWR method can be found in [40][41][42][43]. While some studies have shown the potential of GWR as a predictor [44] there is still some controversy whether to use this method to make predictions [45]. Thus we compared GWR to OK for county id 3101 and our algorithm SVD-RK for county id's 1286 and 2061 to determine if this method would increase the prediction accuracy better. For this experiment we used the truncated SVD output and residual as independent variables for Equation (11). Table 7 shows the results of applying integrating truncated SVD to GWR, which we called SVD-GWR, compared to the OK algorithm (since this method performed the best for county id 3101 as seen in Table 3). We found that SVD-GWR improves the accuracy by 6.2%. However, Table 8 shows that SVD-GWR performed worse for county id 1286 and 2061 in comparison to SVD-RK (the optimal method for the two counties as shown in Table 3). Table 7. GWR over OK in USD.

3101
78,300 83,500 6.2% Table 8. GWR over SVD-RK in USD. Since removing outliers for county id 3101 improved the OK method (shown in Table 6), we also wanted to determine if removing outliers would improve the SVD-GWR method. Table 9 shows that there was an improvement to the error in comparison to SVD-GWR without being treated for outliers, however the improvement of OK treated with outliers in Table 6 is still the method that produced the lowest error for county id 3101. Table 9. GWR over OK without Outliers in USD.

Conclusions and Future Work
Housing price estimation is a challenging task that depends not only on properties of a house, but also on its location. We propose a system that leverages the predictive power of a latent-feature based recommender engine using truncated singular value decomposition (SVD). As such systems are spatially oblivious, we integrate the result into regression kriging. This way, kriging fits a model to a data set in which gaps are filled by SVD, and where outliers, which cannot be explained by the main latent features of the data, are removed by truncation of non-principal components of SVD. Our experimental evaluation shows that our proposed integrated approach SVD-RK outperforms both standalone SVD, standalone ordinary kriging, and universal kriging in two out of three study areas where the variogram fit shows a degree of trend. However, for the study area that has a good variogram fit a straightforward application of our proposed approach may fail. Therefore, we propose to augment our approach using an outlier detection approach, as well as combining it with GWR to better account for outliers in the data.
Our work shows that, given only basic information about houses such as the number of rooms and bathrooms, the combination of SVD and kriging yields the best of both worlds of recommender systems and spatial interpolation. A next step of this work is consider further non-spatial features, such as the size of a house (in square meters), and the age of house (in years), in the recommender system module to give SVD a more accurate prediction, and thus, improve the SVD-RK result.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: