1. Introduction
The urban space zone is a multilayer spatial structure gathering people and products of their activities in nearby places. Urban space has multiple functions such as economic [
1,
2], social [
3,
4], industrial [
5], transport [
6], cultural [
7], administrative [
8], and housing [
9,
10]. The intensity and the direction of urbanization processes is determined in large extend by local demographic, economic and administrative conditions. As a rule, valuation of urban space is strictly related to the basic economic good, i.e., land. In this meaning, modelling of the spatial distribution of land value, expressed by the prices of urbanized land intended for housing development, makes a significant element for supporting a series of decisions in the property management system. The issue of urban space valuation, in the form of land value map generation, has been a subject of a series of studies [
11,
12,
13,
14,
15,
16] demonstrating that the relations between the price and the neighborhood reveal the features of both spatial and non-spatial relations [
17,
18].
The methods for developing land value maps are based first of all on relations between the prices of land and selected reference points in urban space. Liu et al. [
19] analyzed development of prices and values in relation to the distance from CBD (Central Business District), elements of social infrastructure, schools etc. In a similar manner, Bugs [
20] suggests that the value map should be developed on the basis of distance, e.g., from the city center, main streets, places particularly affecting the value, as well as areas at risk of flood. As a result of spatial analysis with the use of GIS tools, maps of urban space valuation are created, which also reflect land value. In a slightly different trend of research in land value map development, hedonic maps play a particular role, taking into consideration selected features of properties, such as price determinants [
21,
22,
23,
24,
25,
26]. The application of GIS tools and geostatistical methods to model the area presenting the land value appear particularly interesting [
27]. Fik et al. [
28] suggest the use of hedonic models together with surface trends for LVS (location value signature) evaluation. In a similar vein, Bourassa et al. [
29] postulate the application of hedonic models and geostatistical methods in conjunction. Geostatistical methods can be treated as a natural supplement of the traditional statistical analysis, taking into account spatial distribution of the phenomenon under analysis.
Both theory and practice indicate that models that do not take into consideration autocorrelation and spatial heterogeneity can provide inaccurate results [
30,
31,
32]. In the spatial econometrics literature, the negative consequences of ignoring the presence of spatial autocorrelation and/or spatial heterogeneity are well-known, therefore numerous publications postulate the application of spatial models for market analyses and price prediction [
29,
33]. Spatial effects can be taken into account in many ways. They may be taken into account either directly, so that they become part of the modeling structure, or indirectly, so they are pre-treated prior to build the model. Among the models that take these effects into account directly, we can indicate, among others, the spatial econometric model [
30,
34], spatially switching regression [
35], random coefficient models [
36], and geographically weighted regression [
37,
38].
In recent years, increasingly more attention is given to the synergy between hierarchical and spatial modelling, which forms the basis for constructing hierarchical spatial models [
39,
40,
41,
42]. Some studies also concern the use of hierarchical spatial models in the analysis of the real estate market [
42,
43,
44]. This paper presents the results of research on methodological bases for constructing land value maps in Olsztyn, Poland. The aim of the study was to demonstrate that the two-level hierarchical spatial autoregressive models can provide a significant alternative for models typically applied for the construction of land value maps: LM: Linear Model, HLM: Hierarchical Linear Model, and SAR: Spatial Autoregressive Model. The paper is structured as follows. After the introduction to the research, a description of hierarchical spatial autoregressive models is given in
Section 2 together with an overview of previous results published in the field and the theoretical basis for the performed research.
Section 3 presents the data description, procedure of applied methodology and a discussion of the obtained results.
Section 4 presents the conclusions drawn from this work.
2. Theoretical Basic of Conducted Research
Spatial factors (e.g., neighborhood attractiveness) affecting the property market are relatively difficult to describe with the use of mathematical models [
10,
45]. Their analytical depiction leads only to partial explanation of relations affecting the events in the form of, e.g., the occurrence of a transaction in property in a given location, characterized by a precise set of attributes and price. The specificity of the property market is the occurrence of spatial relations at the individual and group level.
The individual level is created by point objects of known geographical coordinates (properties) and the group level can be obtained by classifying properties into territorial units—e.g., housing estates, districts, cities, communes or regions. Multilevel consideration can also be applied to spatial data concerning the level of land value, e.g., the value of individual properties (level I), or values of land in individual zones or sections (level II) and at the level of city districts (level III). Multilevel structure of real estate market data provided a direct motivation to undertake the subject of applying hierarchical spatial autoregressive models, (HSAR: Hierarchical Spatial Autoregressive), for the needs of creating land value maps.
The analysis of market data requires taking into account spatial effects typical for the specificity of the property market in the form of heterogeneity and spatial dependencies which, in turn, provide reasons to apply multilevel models. The term of heterogeneity can be applied to all changes in the distribution of a given phenomenon, of a continuous or discrete nature. The sources of data heterogeneity can be seen in the absence of spatial stationarity and it is demonstrated by the instability of relations between phenomena in geographical space and/or their lack of uniformity in spatial distribution [
30,
34,
46]. In case of data located geographically, uncontrolled heterogeneity can have a negative effect in the form of inaccurate conclusions concerning the examined relations. Instability of structural parameters in the regression model can be a continuous or discrete change. If the value of the parameter is subject to change along with the change of the object location coordinates, instability in the form of a continuous change occurs. In such a case, the value of the parameter is functionally related to the location of the object. This is quite common in the models of hedonic regression of property prices [
47], which make it possible to analyze the relation between the value of the property and its features.
Another form of spatial heterogeneity is heteroskedasticity of the random factor. Grouping (clustering) the value of the random factor is most frequently the effect of omitting a significant (one or many) explaining variable or the result of model specification errors. In the case of spatially located data, it is difficult to expect the feature distribution in geographical space to be even and regular. It is also difficult to identify and to quantify all factors responsible for the existing irregularities and heterogeneities. In effect, models estimated on spatial data quite frequently demonstrate a lack of homogeneity of the random component [
34,
48,
49].
Spatial interactions for spatially located data can refer to the endogenous variable, the explaining variables, or to the random component. If these interactions concern the endogenous variable, then spatial autoregression is involved. This means that the values of this variable from other locations affect the development of this value in the analyzed location. If interactions concern the random component, the phenomenon of spatial autocorrelation of the random component of the model occurs. Depending on the type of spatial interaction, two basic models of spatial regressions are most frequently used: The spatial lag model and the spatial error model [
35,
50,
51,
52]. Intergroup differentiation, estimated based on a multilevel model, ignoring the presence of spatial relations, is overestimated. As a result, this may lead to incorrect conclusions concerning the scale of the heterogeneity of the phenomenon, while the analyzed process is marked by spatial autocorrelation.
In traditional spatial econometric models, the presence of spatial interactions is understood as the occurrence of relations between each (or selected) pair of observations i and i’, always forming only one level of the phenomenon analysis [
53]. In multilevel models, the level of observation is one of possible levels of the analysis. If level II and subsequent were obtained through spatial aggregation of geographically located data, then spatial interactions can exist either at one or several levels of the analysis, e.g., the individual or group level, or at the individual and group level at the same time [
54,
55,
56]. A hierarchical spatial autoregressive model (HSAR) can provide an alternative to previous spatial interaction methods, making it possible to analyze complex forms of heterogeneity. The tools explored so far, based on spatial modelling of the property market are directed mainly towards identification and interpretation of interactions between objects. Spatial dependencies were considered at an individual level, typically without taking into account intergroup effects, understood as connections between random group means [
56].
An additional problem related to the price level analysis is related to limiting the space to be evaluated and drawing conclusions concerning broader spatial systems based on a limited area, situated in the direct neighborhood of the examined area. The evaluation of the place through the prism of the nearest neighbor is more evident when a larger spatial structure is subject to evaluation. In practice, this can mean that the level of property value will be the resultant of partial evaluations attributed to spatial units, e.g., a district, a housing estate, etc., and it will therefore reflect the intuitively understood spatial hierarchy. Due to this fact, the multilevel approach seems to be an adequate method for modelling the area representing the land value. This results from the possibility of controlling heterogeneity at several levels of spatial aggregation at the same time, without the need to introduce ten to twenty binary variables, and consequently, without the results in the form of reduction of the degree of freedom, just like in fixed effect models [
57]. Therefore, the land value level, its differentiation between territorial units occupying a higher position in spatial hierarchy, modelled as random effect variation, is subject to adjustment for diversification observed between territorial units occupying a lower position in this hierarchy.
An additional advantage of multilevel analysis is the possibility to introduce context variables, i.e., those for which the available data are only accessible for levels of aggregation higher than the level at which the explained variable is considered. However, this is also possible using the traditional OLS-estimator but in that case we can expect biased results [
58]. Therefore, the level of land value analyzed at the individual level, can be successfully explained not only with variables characterizing the given unit, but also variables reflecting the features of the environment.
The above arguments justify the attempt to apply hierarchical spatial autoregressive models both for analysis of spatial diversification of prices in the property market and for creating land value maps. Linking all above-mentioned aspects concerning research on the level of prices and land value encourages the search for such models that will allow the control of both the multilevel structure of the phenomenon and spatial interactions. Those requirements can be satisfied by HSAR models, the structure of which is predestined to simultaneous identification of both types of spatial effects. Taking into account heterogeneity and spatial interactions using the HSAR model can therefore be an extension of analysis and statistical modelling in property market studies. Therefore, the main aim of this study is to present the concept and principles of spatial analyses using hierarchical spatial autoregressive models as a substantive basis for developing land value maps. Additionally, it identifies the possibilities for applying HSAR class models in research concerning the development of land value maps (price prediction).
The basic model used in spatial econometrics is the spatial autoregressive model (SAR), used for explaining processes characterized by spatial autocorrelation. In multilevel modelling, the highest role is played by the traditional hierarchical (multilevel) model (HLM) with random effects for a higher level, which can be also applied to explain processes characterized by spatial heterogeneity [
58]. Including both hierarchy and spatial heterogeneity in one model provides a basis for spatial multilevel modelling [
39,
40,
42]. The class of HSAR models was described in detail by, e.g., References [
42,
56]. Those models extend the typical SAR model to include the hierarchical data structure.
Many spatial data sets demonstrate a hierarchical structure, e.g., the property situated in a district located in the urban area [
59]. According to the literature on multilevel modelling, individual objects being the subject of the measurement create a lower level, while objects aggregated in the form of, e.g., regions, belong to a higher level [
58]. The main assumption of multilevel modelling is the existence of differences between the objects at the higher level and intergroup relations at the lower level. This means correlating the features of objects allocated to the lower level due to the effect of the same factors affecting the given region. This can be formulated as vertical group dependence. However horizontal dependence cannot be modelled using classical multilevel modelling. This is the type of relationship related to spatial econometric modelling of single level spatial data sets, and results from the interaction or penetration of spatial units due to the geographic proximity. If a spatial data set of a hierarchical structure is involved, we can expect both types of relations: vertical and horizontal. The former, concerning dependencies at the higher level, is related to regional (context) effects, while the latter concerns dependencies of the spatial autocorrelation type. In principle, we can distinguish between spatial interaction (at lower levels) and spatial heterogeneity (at the regional level). Group dependencies in the hierarchical spatial model mean that the allocation of objects to groups should have a geographical nature, while traditional hierarchical models usually do not consider spatial hierarchy [
56].
The proposed model adds a spatial autoregressive element to the classical model of regression in the form of a spatially lagged element Wy, where y is vector n of observation of the explained variable, and W is a spatial weight matrix. Of course, the model can also be estimated without this component. In such case, the Wy element can be omitted and the model obtained in this situation would be equal to the HLM model (Hierarchical Linear Model) [
60], although HSAR fits a simultaneous autoregressive (SAR) spatial random effect rather than a conditional autoregressive (CAR) spatial random effect.
Examples of the application of hierarchical spatial models were described, among others, by Dong and Harris [
56], Páez and Scott [
51], and Bivand et al. [
57], who analyzed the market in a similar manner, with some covariates observed for each individual-level observation and some others observed only at the aggregated, district level. They estimated, among others, district level spatial random effects. Additionally, in the context of property market analyses, the hierarchical aspect of data was analyzed, e.g., by Chasco and Le Gallo [
61] and Brunauer et al. [
62]. In this case, the hierarchical spatial autoregressive model can be applied as it allows for spatially correlated random effects and spatial dependence among individuals.
The general formula of the HSAR model can be presented as follows [
56,
63]:
where:
Y—is an N × 1 vector of dependent variable,
ρ, λ—parameters of spatial interactions,
W—spatial weight matrix at the individual level,
β—parameter vector,
X—matrix of explained variables,
Δ—matrix demonstrating the classification of entities i to objects j,
θ—vector of random effects for absolute term,
u—vector of random group effects,
ε—vector of a random component,
M– spatial weight matrix at the group level.
If ρ = 0 and λ = 0, the model obtained will correspond to the two-level HLM model with a random absolute term. In order to estimate model parameters, Bayesian methods can be used, with a properly determined likelihood function. This function is described by the following equation [
56]:
Statistical inference concerning a given parameter is based on the posterior distribution of this parameter. The Bayesian paradigm assumes as a basic principle that posterior distribution θ* = {ρ, λ, β, θ, σ
u2, σ
ε2} is proportional to the product of the data and prior distributions [
34]. For the k-element vector of β parameters, we consider the multidimensional normal distribution with the expected value M
β and diagonal matrix of variance-covariance matrix T
β. Therefore, the posterior distribution of β parameters will be as follows [
56,
63]:
Posterior distribution of the θ random effects will have the following form:
Posterior distribution of random component σ
ε can be presented in the following way:
where IG(a
ε, b
ε) stands for reverse distribution gamma with the shape parameter a
ε and scale parameter b
ε. Posterior distribution of random effect variance σ
u2 will have the following form:
Knowing posterior distributions, random sampling is performed. To generate random sampling, Markov Chain Monte Carlo (MCMC) methods are used. Sample determination with the MCMC method makes it possible to use, e.g., Gibbs sampling [
64].
For the ρ parameter, the posterior distribution of which cannot be approximated with any of the known distributions, the application of Gibbs sampling is not possible. Therefore, to determine the sample, the method of reverse distribution function was used. This method can be brought down to making numerical integration of the distribution of density, which can be expressed as [
56]:
where C is a constant, while:
In a similar way, the distribution of the parameter of spatial interactions at the group level can be determined:
Detailed principles of model estimation are presented, e.g., in References [
56,
63].
3. Data Description
The conducted research concerns the market of undeveloped land properties, intended for residential development, situated in the city of Olsztyn in the Northeastern region of Poland. Approximately 180,000 inhabitants in an area of almost 90 km
2 currently inhabit the city. The varied spatial structure of the city, numerous lakes and forests, as well as a quite strong planning intervention have resulted in significant spatial heterogeneity of property prices. Transaction data used for the analyses originate from the register of prices and values of real estate properties, held by the City Hall in Olsztyn. Overall, 520 data entries concerning undeveloped land property transactions, carried out in 2010–2017, were used for analyses. A similar number of transactions were assumed in many research carried out so far concerning, e.g., mass valuation (e.g., References [
65,
66,
67]. The logarithm of the price per 1m
2 was assumed as an explained variable. The assumption of the logarithm was dictated by a relatively large span of prices and a distribution demonstrating strong right-skewness. Each of the sold properties was additionally described in the form of a set of eleven features forming explaining variables, as presented in
Table 1.
Variables were selected in such a manner as to ensure that they reflect spatial conditions and location values to the highest extent. Due to the fact that the relations between explaining variables and the explained variable are non-linear, for area and distance from characteristic places, these values are presented as logarithms. The general numerical characteristics of variables are presented in
Table 2 and the
Appendix A.
The prices in the property market, after a turbulent period of growth in the previous decade in 2006–2007, just like in other cities of Poland, demonstrated relative stabilization and even a slight decrease in the examined period. Preliminary studies demonstrated that no need adjustments due to the passage of time (
Figure 1).
All observations assumed for analyses were grouped into two hierarchy levels. The first level refers to transactions and their attributes, while the second level concerns the location of the property in planning zones resulting from the division of the area into functional zones (
Figure 2).
This division results, first of all, from planning documents (a study of conditions and directions for land development and a local zoning plan), as well as from natural conditions. The area in each distinguished zone is relatively uniform in terms of environment, the type of use and prevailing development. The administrative division was not used for the analysis due to the fact that, in this case, boundaries are imposed by formal considerations and not by market conditions. Therefore, an assumption was made that the adopted division into zones, which reflects location values, should also indicate areas that are relatively uniform in terms of prices and the factors affecting them. According to the assumptions made, this should result in a higher homogeneity of prices for properties situated in the same zone and at the same time, higher heterogeneity of properties situated in various zones.
4. Results and Discussion
This study was aimed not only at identification of spatial effects in the analysis of prices in the property market [
56], but first of all at indicating the possibilities of applying the HSAR model to develop a land value map. Unquestionably, the use of the hierarchical data structure can contribute to improving the quality of obtained models [
68]. The additional inclusion of spatial autocorrelation in the property market [
31] makes it possible to construct a hierarchical spatial model, of not only a diagnostic [
56], but also a predictive nature.
During the research period, apart from the HSAR model, three other models were constructed based on the same data set (LM: Linear Model, HLM: Hierarchical Linear Model, and SAR: Spatial Autoregressive Model), which made it possible to carry out a simple comparative analysis. Those models were then used to develop land value maps based on price prediction and residue analysis. The R environment with the packages HSAR, lme4, sp, spdep (among others) was used for statistical computing.
A classical multiple regression model (LM) provided a point of reference for further research and allowed a simple analysis of relations between assumed variables and the transaction price. Another model (SAR) was built based on the spatial autocorrelation phenomena. Detailed principles concerning the construction and testing of spatial autoregressive models are provided, among others, by [
35] and [
52]. A key issue can be, in this case, proper determination of the spatial weight matrix, reflecting mutual relations between objects located in space. During the studies, an assumption was made that mutual interaction between events in the property market (land prices) exponentially decreases along with the distance [
56], while the range of similarity estimated on the basis of the variogram was about 2500 m. The estimated range confirms the results of previous analyses, e.g., References [
69,
70]. However, the threshold distance should also result from the spatial location of the transaction, and its values must be selected in such a way so as to ensure that each object is spatially related to other objects through the weight matrix. Therefore, the weight matrix was determined as [
56]:
where d is the threshold value established based on data at 2,500 m. This matrix was then standardized with rows. Based on the Lagrange multiplier test [
35], it was established that the proper model in this case would be the spatial lag model. In a subsequent model (HLM), fixed effects related to explained variables and random effects for the second hierarchy level were taken into account [
71], without taking into consideration spatial dependencies. The identifier of property location in a given zone played the role of the second level hierarchy variable. The HSAR model assumes that the spatial weight matrix for the first level is the same as in the SAR model, while the weight matrix for the second level is based on the common threshold criterion (contiguity) of distinguished zones. The problem in this case was the emergence of islands, which means that some of the distinguished zones do not have any neighbour and therefore, a zero row occurs in the weight matrix. However, it means that the random effects for those zones in the HLM and HSAR model will be the same. The results of parameter estimation for individual models are presented in
Table 3 and
Table 4.
In the LM model, apart from the constant, six variables were proven to be statistically significant at a significance level of p < 0.05. The most significant variables were property right, distance from public transport stops and utilities network. Statistically insignificant variables included, among others, plot area and transaction date. The HLM model indicates four statistically significant variables. A comparison of both models indicates a slightly better fit of the HLM model. This is indicated by, among others, AIC. The residual standard deviation SEresid also proves the fact that the hierarchical model slightly better explains the price variability. In the SAR model, four variables proved statistically significant (property right, density of main roads, distance from public transport stops and utility network).
The coefficient of spatial autocorrelation ρ amounted to 0.551, which justifies application of a model using the spatial relations. In the HSAR model, six variables, as well as the constant for the model, is significant at the significance level of
p < 0.05. Relatively high value AIC in HSAR model results from the effective number of parameters, which strongly depends on the variance of the group-level parameters [
71]. In addition, a different method (Bayesian approach) was used to estimate the HSAR model, which may hinder the unambiguous interpretation of the common information criterion for all analysed models [
64]. Therefore, when assessing the models, the criterion of minimizing errors was directed first of all. The value of the residual standard deviation was 0.361, which means that the average relative fit error is about 6.6% of the average transaction price logarithm. The value of spatial autocorrelation coefficient ρ is lower than in the SAR model, while autocorrelation specified for the second level in the hierarchy was 0.383, which proves moderate spatial dependency. The distribution of random effects for individual zones for the HLM and HSAR models is presented in
Figure 3 and
Figure 4. The distribution of random effects for both hierarchical models is very similar, which results from moderate spatial autocorrelation of prices at the level of zones. While those effects differ insignificantly, after converting logarithms into prices, it turns out that they can have a quite significant effect on the final land value map.
Land value maps were generated by overlaying subsequent raster layers resulting from model coefficients, interpolated component ρWy (for SAR and HSAR models), random effects (for HLM and HSAR models) and the layer constructed based on the interpolation of residuals according to the scheme presented in
Figure 5.
Determination of a referential layer consisted in generating a raster, whose value corresponded to the model values created as a result of the sum of properly multiplied layers of explaining variables. It should be observed that seven variables among those assumed for the analyses can be referred to each point in the area under analysis. The beginning of 2018 was assumed as a date of study, ownership was assumed as the type of property rights and the assumed area was 1000 m
2. In order to obtain the map of values, after overlaying individual layers, the obtained values in the form of logarithms were converted directly into the value in PLN. The value estimated on the basis of the LM model results from simple prediction, while the map generated on the basis of the SAR model required additional interpolation of the spatial lag (ρWy). Value maps developed on the basis of LM and SAR models are schematically presented in
Figure 6.
The distribution of values in both models is similar. Relatively high values prevail in the centre, decreasing when moving towards the city boundaries. Due to the fact that the developed maps are only of a demonstrative nature, water and forest areas were not excluded, although they should not be subject to the analysis.
For the HLM and HSAR models, random effects for zones also have to be taken into account. Therefore, the prediction of values was carried out as the first step, based on the fixed effects of the model and the mean value was then estimated for each of the zones, taking into account random effects. For the zones with no transactions, a zero value of random effect was assumed in the HLM model, while in the HSAR model, this value was obtained through interpolation. The obtained maps are schematically presented in
Figure 7 and
Figure 8.
Differences in the estimation of land value obtained based on HLM and HSAR models are small and result from the assumption made with reference to spatial autocorrelation both at the first level (transactions) and the second level (zones). The similarity results first of all from the assumptions made concerning the effect of assumed factors on transaction prices.
The use of hierarchical spatial models can be an alternative or addition to the previously used methods for the development of land value maps. Geostatistics methods are commonly used in the development of such maps, which usually give satisfactory results [
72,
73]. However, it should be emphasized that hierarchical models can be particularly useful especially when we want to take into account the spatial hierarchy (e.g., division into districts or functional zones), which assumes rapid changes in value in space. Similar conclusions are drawn by, among others, Arribas et al. [
68] used hierarchical models for price analysis in Alicante and Dong and Harris [
42] who conducted research in Beijing.