Exploring Determinants of Housing Prices in Beijing: An Enhanced Hedonic Regression with Open Access POI Data

The housing market in Chinese metropolises have become inflated significantly over the last decade. In addition to an economic upturn and housing policies that have potentially fueled the real estate bubble, factors that have contributed to the spatial heterogeneity of housing prices can be dictated by the amenity value in the proximity of communities, such as accessibility to business centers and transportation hubs. In the past, scholars have employed the hedonic pricing model to quantify the amenity value in relation to structural, locational, and environmental variables. These studies, however, are limited by two methodological obstacles that are relatively difficult to overcome. The first pertains to difficulty of data collection in regions where geospatial datasets are strictly controlled and limited. The second refers to the spatial autocorrelation effect inherent in the hedonic analysis. Using Beijing, China as a case study, we addressed these two issues by (1) collecting residential housing and urban amenity data in terms of Points of Interest (POIs) through web-crawling on open access platforms; and (2) eliminating the spatial autocorrelation effect using the Eigenvector Spatial Filtering (ESF) method. The results showed that the effects of nearby amenities on housing prices are mixed. In other words, while proximity to certain amenities, such as convenient parking, was positively correlated with housing prices, other amenity variables, such as supermarkets, showed negative correlations. This mixed finding is further discussed in relation to community planning strategies in Beijing. This paper provides an example of employing open access datasets to analyze the determinants of housing prices. Results derived from the model can offer insights into the reasons for housing segmentation in Chinese cities, eventually helping to formulate effective urban planning strategies and equitable housing policies.


Introduction
Over the last decade, the housing markets in Chinese metropolises have become inflated to an unprecedented extent. Due to rapid urbanization and domestic migration, cities are not able to accommodate medical, educational, transportation, and cultural needs [1]. These services or amenities, such as hospitals, schools, and transportation hubs, are distributed unevenly across urban landscapes. As a result, the housing markets tend to manifest a clear pattern of spatial heterogeneity in terms of property values.

Housing Development and Data Issues in China
Housing markets in cities differ significantly in terms of locations and prices. For example, research of Western cities has shown a clear segmentation pattern due to racial segregation and economic exclusion [20]. In addition to the political and ethical impact of these trends, the value of a single property is always subject to structural, locational, and environmental variables in its neighborhood [21]. While the mantra "location, location, location" is critical for determining investment potential in the housing market [22], locational (e.g., distance to CBD; [23]) and environmental variables (e.g., scenic views; [4,24]) are often relatively difficult to quantify.
In China, housing prices are less determined by neighborhood segmentation than their proximity to urban amenities. Since the economic reform of the 1990s that restructured urban land use, the housing market has gradually transitioned from government-assigned units into a free-trade system [25]. Before the reform, housing was characterized by work-unit compounds (the Chinese danwei), where residents lived and worked within the same community and were provided with a wide range of amenities, such as retail stores, schools, and hospitals [26,27]. The commodification of residential housing has substantially transformed the urban landscape and, therefore, residents' commuting patterns. Due to heavy reliance on public transit systems, resident mobility has become the primary momentum for determining property value in the housing market [28]. In other words, housing units with relative proximity to urban centers, amenities, and transportation hubs tend to have considerably higher market value. For example, in a recent study on the effect of the proximity to Beijing railway stations on the values of surrounding properties, it was found that while both non-transfer and transfer stations positively affect the price, the impact of transfer stations nearby was more significant [13]. The vital role that public transit plays can also be extended to suburban regions, where the housing units located within the transit-oriented development zones are priced much higher than the units further away [29]. In addition, property value increases due to other retail amenities, such as convenience stores, supermarkets, and drugstores that cater to nearby residents [11,12,16]. In this regard, exploring the effect of price determinants, especially accessibility to public transit and retail amenities, is a fundamental step towards understanding the Chinese housing market.
The spatial granularity of datasets is another essential aspect of urban studies in China. With the constant emergence and shifting of Chinese cities and townships, there has been an inconsistency in the definition of administrative units for the census and economic purposes [30]. This lack of clear demarcation is further exacerbated when the hierarchy of administrative units is directly involved in the identification of zonal patterns. However, the fine-scale, urban geospatial datasets (e.g., transport infrastructure, housing units) are relatively scant due to flaws in the data sharing policies and confidentiality concerns. Although there has been a recent resurgence of volunteered geographic information (VGI) collected from social media platforms [31], there remains a pressing need for alternative sources of urban datasets with more clearly defined spatiotemporal granularity.
Recent explorations of POI data on open access platforms have provided new opportunities to address these data issues. Particular amenities in terms of POIs can be extracted and categorized, as POIs represent the finest spatial resolution of the built environment of urban landscapes. Many scholars have employed POIs to evaluate potential city growth [32,33], as well as to analyze land use and classification strategies [34]. There is an enormous need to utilize POIs for housing price studies in China, as the pace at which new communities sprout and the sheer number of geospatial variables generated in the process cannot be efficiently captured in a traditional manner.

Hedonic Pricing Model and Spatial Autocorrelation
Studies on property values have often resorted to the hedonic pricing model [21], the theoretical foundation of which stems from the consumer theory [35]. In the later theory, it is assumed that merchandise is composed of a collection of quantifiable characteristics, and consumers make purchases by logically ranking the characteristics that best suit their needs. Residential properties, as a particular form of merchandise, can thus be appraised as a function of structural, locational, and environmental determinants [3,6].
The hedonic pricing model is plagued by the critical issue of the statistical effect of spatial autocorrelation [19], which violates the assumption of independence in regression analysis. Neglecting the spatial autocorrelation in the hedonic pricing model may lead to biased estimation and insufficient confidence levels [36]. Thus, two spatial autoregressive approaches, the spatial error model and the spatial lag model, have been proposed to ameliorate this effect [37,38]. If the regression is expressed as y = α + βX + e, the spatial error model considers that the autocorrelation effect exists in the component of the error (e) [6]; the spatial lag model, on the other hand, considers the autocorrelation effect in the dependent variable (y) [38,39]. In spite of the fact that these two approaches have been employed in many cases of hedonic modeling, two challenges remain. First, it is uncertain if the autocorrelation effect exists in the random error e or in the dependent variable y [40]. This lack of specification raises fundamental questions about the usefulness of this model. Second, incorporating the autoregressive method poses technical barriers to model estimation with regard to the ordinary least square regression (OLS) [41].
Based on Moran's autocorrelation coefficient (also known as Moran's I), the ESF method has been proposed to tackle these two methodological challenges [41]. Compared to the nonlinear approaches, the ESF method provides a linear formulation and solution. This method has been implemented in various cases of spatial analysis, such as crime analyses [42], migration [43], and species distribution [44]. However, the model has rarely been applied to real estate valuation [45].

Method
The fundamental premise of the hedonic pricing model is that a product is marketed as a set of characteristics, which can be formulated in a matrix X = (x 1 , x 2 , x 3 , . . . , x n ) [46]. The market price of the product can thus be illustrated as a function of these characteristics or y = f (x 1 , x 2 , x 3 , . . . , x n ). The general form of the hedonic pricing model is shown in Equation (1): where y is the price vector, X is the matrix of independent variables, α, β are the vectors of coefficients, and e is the vector of random error. Researchers have noted that the distribution of both y and x n could be skewed, leading to the problem of heteroscedasticity [21,47,48]. Thus, the Box-Cox transformation has been applied to Equation (1) during implementation of the model. The hedonic price, as discussed in the last section, is influenced by the spatial autocorrelation effect. The theoretical foundation to address the issue is rooted in the findings that spatial correlation can be interpreted using a set of eigenvectors (EVs) from Moran's I [49,50]. The ESF method was proposed to justify using EVs to represent the spatial autocorrelation effect in regression analysis [51]. Here, a set of selected eigenvectors E k = (EV 1 , EV 2 , EV 3 , . . . , EV k ) are presented [52], as shown in Equation (2). By introducing γE k (where γ is the coefficient of the eigenvectors), the method is able to isolate the spatial autocorrelation effect in the hedonic regression: The eigenvectors E k in Equation (2) are selected from all the eigenvectors E n = (EV 1 , EV 2 , EV 3 , . . . , EV n ) that correspond to the matrix (I − 11 t /n)C(I − 11 t /n), according to the Moran's I, as shown in Equation (3): where C is an n-by-n binary spatial weight matrix, I is an n-by-n identity matrix, 1 is an n-by-1 vector of 1, and n is the number of observations. Employing all n EVs is not practical, as it may cause over-specification when the number of variables is equal to or greater than the number of observations [53]. Therefore, we selected k EVs as a subset of n EVs (k << n) using a two-step procedure. First, we determined a threshold to preselect m EV candidates (k ≤ m < n), as proposed in a previous study [54]. The threshold was set to Moran's I m /Moran's I max > 0.25. The m EV candidates and the 20 independent variables (i.e., the variables in the case study) were employed to construct the original model. Second, we applied a stepwise regression to the original model to select the set with the best fit, including the k EVs and l (l ≤ 20) independent variables in the final model.

Data Collection
Beijing, the capital of China, has long been regarded as the cultural, financial, and political center of the nation. With an estimated population of 22 million, Beijing had a gross domestic product (GDP) value of 2.49 trillion RMB in 2016. This capital city is demarcated by the Fifth Ring Road (Figure 1), circumscribing an urban area of approximately 750 square kilometers. The properties beyond this ring road were excluded from the study, because these suburban areas support very few urban functions. The datasets for this study included both POIs of the properties and those of amenities nearby. First, the amenity POI data were gathered from Gaode Maps (http://ditu.amap.com/), the primary product of the AutoNavi Software. The amenity POI data were gathered using a POI extractor developed based on the Gaode Map JavaScript application programming interface (API). The distribution of the POIs included 76,786 amenities, as shown in Figure 1.
Due to the lack of real estate information in the Gaode Maps, we augmented the POI database with data sourced from Lianjia (http://www.lianjia.com/), the largest real estate brokerage firm in Beijing. We collected 6959 purchase transactions in the third and fourth quarters of 2015 using a web crawling Python program and georeferenced them by property addresses. Most of these properties were typical housing units, apartments within a gated community (the Chinese xiaoqu). We further classified the property characteristics into structural, locational, and environmental variables to explore their impact on housing prices, as shown in Table 1. In addition, we determined the local Air Quality Index (AQI) at each location from an interpolated surface of the mean AQI in 2015 [55], as shown in Figure 2. This index serves as an essential environmental indicator of the city that has been shrouded in smog pollution [56]. The amenities variables included eight subcategories (Table 2). In reference to other research on housing prices [57,58], we also added four location dummy variables to examine the differences among administrative regions (i.e., Xicheng, Dongcheng, Chaoyang, and Haidian) and a time dummy variable to account for price fluctuations over the two quarters. To compare the intensity of the influences, the variables in Tables 1 and 2 were all rescaled.

Accessibility Measure
One widely-utilized method to quantify the amenity value is the accessibility measure. In transportation research, the geographical accessibility to amenities can be evaluated as the function of distance or other cost variables, such as time, money, discomfort, and risk [59]. Another facet of the accessibility measure is the emphasis on the spatiotemporal constraints on human activities with regard to time geography [60]. As the exploration of various accessibility metrics was not the focus of this study, we applied a classic accessibility measurement, the cumulative opportunity method, to estimate the effect of spatial separation, as shown in Equation (4). This accessibility index (A ij ) summarizes the total number of amenities near the property, where d ij is the Euclidean distance from property i to amenity j and D is the threshold distance. We included a negative-linear function [59,61] to determine the distance decay. As a walkable distance to urban centers has been identified as in the range of 0.5-2 miles [62], we chose 1.5 km as the threshold D. The accessibility indices by amenity subcategory are shown in Table 2.

Results and Discussion
In this study, we utilized both OLS and ESF models for the hedonic analysis. The results are shown in Table 3. The relatively low variation inflation indicator (VIF) suggested that no severe collinearity existed within the estimated variables. The Breusch-Pagan test indicated no significant heteroscedasticity in the regression residuals of the two models. Although the Jarque-Berar test results show that the residuals in both models are not normally distributed, they do not affect the asymptotic normality of the OLS estimators when the sample size is large [63]. While there was a positive spatial correlation in the residuals of the OLS model (Moran's I = 0.2324, p < 0.001), no significant autocorrelation was detected in the ESF residuals (Moran's I = −0.03, p > 0.1). This result indicated that the ESF model is capable of eliminating the spatial autocorrelation effect in the regression analysis. The adjusted R 2 values (OLS = 0.8384, ESF = 0.8768) suggested that the data had a better fit for the ESF model. The OLS model failed to identify the correlation with the accessibility to bus stops (ESF β = −0.0059), metro stations (ESF β = 0.0045), and parking lots (ESF β = 0.0119).
We will now discuss the results in more detail with respect to the structural, locational, and environmental characteristics. As both models revealed, the structural variables had a significantly positive influence on housing prices, as the property value is dictated mainly by the floor space (size, ESF β = 0.1736) and floor plans (bedrooms, ESF β = 0.0221; living rooms, ESF β = 0.0287). The correlation with the orientation variable (ESF = 0.0098) can be explained by consumers' preferences for properties with plentiful natural light. Due to the centralized location of Beijing, proximity in terms of travel time to the nearest development center was negatively correlated (ESF β = −0.006). The positive β values for the four district dummy variables (Xicheng, Dongcheng, Chaoyang, Haidian) suggested that property prices in the inner cities were significantly higher than those in suburban regions; and the prices in the Xicheng district outnumbered other districts. The accessibility to most amenities was positively correlated because these amenities fulfill people's needs for transportation (metro stations, ESF β = 0.0045), shopping (convenience stores, ESF β = 0.0126; shopping malls, β = 0.0058), and exercise (gyms, ESF β = 0.0117). The positive correlation with parking lots (ESF β = 0.0119) can be evidenced by the rapid increase in popularity of private cars in Beijing over the past decades; thus, parking spaces have become a valued resource and an indispensable element of new community development [64,65]. While the accessibility to primary schools was identified as positively correlated with housing prices, we feel that the strength of this correlation was underestimated in this study (ESF β = 0.0083). The influence of schools on the neighborhood housing market depended not only on proximity but, more importantly, on the good reputation of nearby schools [66]. The case of Beijing is more complex because school zoning is not a direct reflection of proximity but is affiliated with individual communities or xiaoqu. The demarcation by community in lieu of physical distance, has been largely criticized for creating inequality of educational opportunities and causing social segregation throughout the city [67]. We were surprised by our finding that housing prices were negatively correlated with accessibility to bus stops (ESF β = −0.0059) and supermarkets (ESF β = −0.0126). We feel this result can be explained by the formation of large-scale housing estates in suburban Beijing, such as Huilongguan in the northwest and Wangjing in the northeast. These estates were developed in the early 2000s to ease conflicts caused by limited land use in the inner city. Although housing prices have skyrocketed in these communities over the last decade, certain types of amenities are still limited. This phenomenon is very likely the result of the traditional residential-oriented communities, which were literally surrounded by walls that prevented easy access to nearby amenities [27]. These communities have also been criticized for generating excessive traffic due to the spatial mismatch of job opportunities and housing units [68,69]. Thus, recent urban planning guidelines in China have focused on the removal of walls and fences in the new community designs in an attempt to ease traffic congestion and improve access to amenities [70].
Not surprisingly, we found that the AQI was negatively correlated with housing prices (ESF β = −0.0601). This finding is in line with previous hedonic analyses of housing prices based on air pollutants such as PM-10 and sulfur dioxide (SO 2 ) in Chicago [71] and Seoul [39]. This result revealed that the property value can be affected by the natural environment near the housing unit. Thus, alleviating air pollution and promoting sustainable urban design should be a priority for both community developers and urban planners.

Conclusions
The use of the hedonic pricing model has a long tradition in urban studies. In this study, we have developed a complete housing and amenity POI database via web-crawling and employing an open access API. We also incorporated the ESF method to eliminate the spatial autocorrelation effect inherent in hedonic analysis. It is our hope that the results, in addition to revealing the roles that structural, locational, and environmental variables play in the housing market, will help urban planners and stakeholders to better understand the local real estate economy [72].
The contribution of the paper is two-fold. We utilized the ESF method to eliminate the spatial autocorrelation effect in hedonic analyses of housing prices. While most studies on price determinants fail to consider spatial autocorrelation, our study revealed that the level of correlation may fluctuate when a set of eigenvectors are incorporated into the hedonic price. On the other hand, this study provides a solid example of acquiring fine-scale urban datasets from open sources. Unlike most Western countries where geospatial data is public record on various geographical scales (e.g., the US TIGER products), in China datasets containing georeferenced urban features are not available in the public sector. Proprietary data are often limited in size, granularity, and timeliness of information delivery. The increasing transparency of data sharing policies and the emergence of the map service providers such as Google, Bing, Baidu, and Gaode have made this geographical information more accessible to public users. However, full caution is needed when employing such open access platforms. For example, housing prices and structural variables in this study were sourced from a private company and might not adequately represent their actual market value. Therefore, triangulation of the dataset with other sources is necessary to ensure data quality.
Methodological limitations of this study should be recognized by other scholars as opportunities for further research. First, we did not consider the service capacity of amenities and variations of distance decay in the accessibility assessment; and the spatial separation to amenities can be network-based [62,73] or follows a power law [12], thus mediating the correlation [13]. Second, in the future, researchers should focus on communities located in a smaller urban sector rather than an entire city to minimize the impact of geographical heterogeneity. Being aware of these methodological limitations is a prerequisite for model improvement.
In conclusion, in addition to the three tiers of price determinants explored in the study, the Chinese housing markets are largely dictated by local policies that restrict property transactions. In Beijing, the desire to buy a new property can be thwarted by economic externalities, such as lack of a residency permit (the Chinese hukou), educational resources, and investment potential. For example, in Beijing, school zoning is a dominant factor fueling the housing market and creating residential economic segregation [74]. The enforcement of limits on home purchases and residentcy requirements is another set of factors that affect the health of the housing market [75]. To fully understand the facets of the housing market in Chinese cities, these policy initiatives and regulations must be thoroughly assessed.
Author Contributions: Yixiong Xiao designed the experiment, collected the data, performed the analysis, and drafted the manuscript; Xiang Chen performed the accessibility measurements and reorganized the manuscript; Qiang Li supervised the data analysis and created a research agenda for project implementation; and Xi Yu, Jin Chen, and Jing Guo assisted with adjusting the model and analyzing the data.

Conflicts of Interest:
The authors have no conflict of interest.