Understanding the Effects of Inﬂuential Factors on Housing Prices by Combining Extreme Gradient Boosting and a Hedonic Price Model (XGBoost-HPM)

: The characteristics of housing and location conditions are the main drivers of spatial differences in housing prices, which is a topic attracting high interest in both real estate and geography research. One of the most popular models, the hedonic price model (HPM), has limitations in identifying nonlinear relationships and distinguishing the importance of inﬂuential factors. Therefore, extreme gradient boosting (XGBoost), a popular machine learning technology, and the HPM were combined to analyse the comprehensive effects of inﬂuential factors on housing prices. XGBoost was employed to identify the importance order of factors and HPM was adopted to reveal the value of the original non-market priced inﬂuential factors. The results showed that combining the two models can lead to good performance and increase understanding of the spatial variations in housing prices. Our work found that (1) the ﬁve most important variables for Shenzhen housing prices were distance to city centre, green view index, population density, property management fee and economic level; (2) space quality at the human scale had important effects on housing prices; and (3) some traditional factors, especially variables related to education, should be modiﬁed according to the development of the real estate market. The results showed that the demonstrated multisource geo-tagged data fusion framework, which integrated XGBoost and HPM, is practical and supports a comprehensive understanding of the relationships between housing prices and inﬂuential factors. The ﬁndings in this article provide essential implications for informing equitable housing policies and designing liveable neighbourhoods.


Introduction
In the last decade, housing prices have become one of the top issues in economic development and for determining whether urban residents can live a better life [1][2][3]. The rapid growth of housing prices and spatial differentiation greatly concern managers, scholars, developers and residents [4][5][6]. As cities continue to expand and renew, the non-uniformity of spatial reconstruction and resource allocation is becoming increasingly prominent, which accelerates the spatial variations in housing prices within cities [5,[7][8][9][10]. Understanding the mechanisms influencing spatial variations in housing prices is essential to formulate scientific housing policies, divide submarkets, optimize urban spatial layouts, allocate public infrastructure and equalize spatial resources [11,12]. Greater efforts and improvements to previous studies are required to deeply understand the complex relationships between housing prices and influential factors.
First, although the hedonic price model (HPM) has been widely applied to housing prices and can identify the economic value of influential factors well [5,13,14], the traditional HPM has been criticized for some limitations, including: (1) a poor ability to reduce the impact of collinearity; (2) the assumption of linear relationships between influential factors and housing prices; and (3) a lack of robustness in the results [15][16][17][18]. The above limitations of the HPM might directly reduce the accuracy of housing price modelling and muddle our overall understanding of the influential factors of housing prices; thus, housing prices modelling should be improved by applying new data sources, methods and technologies [5,19].
Second, despite some efforts [20,21], studies using eye-level data to evaluate the neighbourhood environment and explore the effects of factors at the human scale on housing prices are rather limited. Related literature has shown that spatial differentiation in housing prices is the result of the effects of internal and external factors [11,[22][23][24][25][26][27][28][29]. Specifically, the internal factors of housing prices include housing type, housing age, community environment, hardware facilities, property services, etc. The external factors can be classified into location conditions, traffic accessibility, landscape environment, living facilities, spatial quality, etc. Many studies have proven that street view images and semantic segmentation methods can be employed to derive human perceptual evaluations of the local environment [30,31]. In fact, people's perceptions of streetscapes, such as openness, security, walkability, development intensity, and greenness, can significantly affect their willingness to pay housing prices [32][33][34]. However, these factors at the human scale are rarely discussed in real estate studies.
Increasingly available big geo-data and machine learning demonstrate considerable potential to alleviate some of the aforementioned limitations. First, a combination of extreme gradient boosting (XGBoost) and the traditional HPM was applied in this article with the aim of identifying the spatial variations in housing prices more accurately. XGBoost is superior in choosing influential factors that have a significant effect on model accuracy, which can reduce the risk of model overfitting and identify the importance of variables [20,35]. The HPM was used to evaluate the qualified effects of influential factors on housing prices. Second, the wide application of street view images and semantic segmentation algorithms proves that as a new and available data source, street view data can be used for urban local environmental assessment on the human scale. An increasing number of studies suggest that it is necessary to use street view data to extract human-scale influential factors to study spatial variations in housing prices [20,36].
To date, limited effort has been exerted to integrate machine learning technologies and regression models to estimate spatial variations in housing prices [37], although these were proven effective in modelling geographic phenomena and spatial distributions in the fields of PM2.5, traffic fatalities and tropical cyclone intensity [38][39][40]. Therefore, this article proposes a multisource geo-tagged data fusion framework to estimate spatial variations in housing prices by extending the influential factors to the human scale and applying a combination of XGBoost and the HPM. Taking Shenzhen, China as the study area, this article analyses the main factors and combination characteristics that result in spatial differentiation in housing prices. We aim to contribute to the related literature in the following two ways: (1) an analytic framework combining XGBoost and the HPM is proposed to estimate spatial variations in housing prices; and (2) this article constructs a multidimensional and multilevel system of housing price influential factors using multi big geo-data, including building, point of interest (POI), road network, land-use, and street view data. In particular, based on street view data, a series of metrics for measuring spatial quality are proposed to reflect the physical and social structure characteristics of the urban local environment under fine spatial granularity. The research results of this paper are expected to provide references for constructing a theory of urban housing price differentiation, promoting the healthy and stable development of the real estate market and constructing a liveable neighbourhood environment. Figure 1 presents the overall methodological framework for understanding spatial variations in housing prices; it comprises data collection, factor quantification, modelling and results mapping and analysis. The first step was the collection of multi-source geodatasets, including housing prices, community information, land-use data, POI data and building data. Second, we quantified the dependent variable (i.e., average home price) and independent variables. Third, the models, including XGBoost and the HPM, were implemented in turn to identify the spatially varying effects of influential factors on housing prices. Finally, the results were visualized and discussed from various perspectives, including the importance degree of influential factors and quantified relationships between housing prices and influential factors. In this article, housing price modelling adopting XGBoost and the HPM was built and implemented using Python 3.6.

Data Collection
Our work was conducted in Shenzhen, one of the largest and fastest growing cities in China, which is located on the southeast coast of Guangdong Province. As a special economic zone, Shenzhen has experienced a series of highly urbanized development projects, urban renewal, space contraction and industrial upgrading, which have also provided many favourable policies to appeal to young people nationwide. The two main reasons for selecting Shenzhen as the study area are as follows: first, the immense housing demand, the shortage of residential land supply, and the tension between people and available land have inevitably led to the high cost of land and recent increases in housing prices in Shenzhen. To maintain a stable and healthy real estate market, it is necessary to understand the spatial variations and the influential factors of housing prices in Shenzhen. The results of our work can also provide some suggestions for the real estate development of similar cities. Second, there are abundant existing studies on housing prices in Shenzhen [14,41], which can be used as a reference and for comparison with our work.
In this article, housing price data for gated commercial communities were collected from one of the leading real estate information service platforms in China, the Anjuke website (https://shenzhen.anjuke.com/, accessed on 16 January 2021). Based on a massive user dataset, Anjuke has launched research reports on real estate data, housing rentals, etc., analysing market trends and user behaviours, providing guidance, helping home buyers find apartments and providing references for developers to help them make decisions. Moreover, gated commercial community information, including the spatial location, green ratio, property management fee and plot ratio, can also be acquired from the Anjuke website. Therefore, through large-scale real estate data collection, a database of real housing information was constructed that can ensure the reliability and authenticity of housing information. This paper collected data on 12,137 housing units in a total of 3186 gated commercial communities. A 1 km buffer zone was constructed to designate the neighbourhood of a residential area [42,43], and all of the location-related characteristics of communities were thus aggregated within a 1 km buffer. The study area, Shenzhen, and the locations of the communities are shown in Figure 2. The POI data were obtained from an online mapping services provider called Auto map (https://lbs.amap.com/, accessed on 16 January 2021). POI data refer to geographic entities that are of interest to users within a specific spatial scope and that can be abstracted as points, such as schools, hospitals, parks, and supermarkets. The road network data were acquired from Open Street Map (OSM, https://www.openstreetmap.org/, accessed on 16 January 2021), which is a free, open source, editable map service created by the internet public. Economic development levels were extracted from Luojia-1 Night-time Light Imagery. Urban land-use data for 2019 (including residential, commercial, industrial, transportation, public management, and service) were acquired by land change investigation, which was provided by the Shenzhen Municipal Bureau of Planning and Natural Resources. The building data were acquired from the official department mentioned above to ensure the authority and reliability of the datasets.
Notably, street view images were used to measure the space quality at the human scale. At present, network map service providers, represented by Google Maps, Tencent Maps and Baidu Maps, can provide street view services. The street view images used in this article were obtained from Baidu Maps (http://quanjing.baidu.com/#/, accessed on 16 January 2021). Based on road network data obtained from OSM, the sampling points of the street view images were generated using 50 m intervals along the road network to ensure the continuity of the street landscapes and avoid data redundancy. To reflect the human perspective, four angles of 0 • , 90 • , 180 • and 270 • were selected for street view data collection to achieve comprehensive inclusion of the visual environment around each viewpoint. The 1 km buffer and examples of the data used are shown in Figure 3. An example of street view images from the four angles for a sample location is shown in Figure 4.

Factor Quantification
The acquisition of massive data information usually relies on network crawling tools. The average price of each residence was obtained from the social media platform Anjuke. To ensure that the housing price data conformed to the assumptions of the regression model, this paper used the HPM in semi-logarithmic form. That is, the distribution of housing prices was adjusted to a logarithmic scale. According to previous studies, housing prices are affected by many factors, and different influential factors result in spatial variations in housing prices. Therefore, a number of influential factors for housing prices were quantified from the perspectives of structure, community, facilities, location, space quality and landscape patterns [20,23,41].

Structural Information
Based on the information provided by the real estate information service platform Anjuke, the number of rooms and the number of halls were used to represent the shape and size of the residence. Its area and which floor it was on both have a direct impact on the lives of residents, affecting factors such as comfort, views and light exposure.

Community Information
This article selected the construction year of the community, plot ratio, green rate and property management fee to represent community information. The plot ratio indicator is an important index reflecting the intensity of land development and the efficiency of land use. The higher the residential plot ratio is, the greater the development intensity and the higher the land utilization rate. However, plot ratios that are too high affect both the quality of the urban landscape and the living environment [44]. The green rate of a community is one of the most direct measurements of green, ecological and healthy communities. Although the property management fee is not the only standard used to measure the quality of property management services, many property management companies in residential areas charge higher fees to provide better services. The main reason is that highquality services require more manpower and material resources; property management fees will thus have an impact on the quality of property services to some extent [45].

Locational Attributes
This paper mainly reflects the locational attributes of residential areas by considering two perspectives: traffic conditions and socio-economic activities. The evaluation of traffic conditions mainly depended on the bus stations, subway stations and road networks near the community, including the number of subway entrances and bus stations in the neighbourhood, the distance to the nearest subway entrance and bus stations, and the road density in the neighbourhood. The factors used for evaluating socio-economic activities included the distance to the city centre (Shenzhen Civic Centre in Futian district), the population density of the 1 km buffer, and the economic development level represented by night-time light imagery [46].

Facility Proximity
Facilities, including educational facilities, medical facilities, recreational facilities and commercial facilities, have an important influence on the convenience and quality of residents' daily lives. The proximity of educational facilities was measured by the number of kindergartens in a 1 km buffer, the number of primary and secondary schools, the number of high schools, the distance to the nearest kindergarten, and the distance to the nearest primary and secondary schools. Medical facilities were measured by the number of AAA class hospitals (3A hospitals are some of the highest-level hospitals) in the 1 km buffer. The proximity of recreational facilities was quantified using the number of community parks in the 1 km buffer and distance to the nearest community park. Finally, the proximity of commercial facilities was determined by using factors including the number of supermarkets in the 1 km buffer, the distance to the nearest supermarket, and the distance to the nearest farmers' market.

Space Quality
Space quality is a concept formed to evaluate space by reflecting the comprehensive demand of the population for urban space. As an intuitive representation of streets, street view images can be segmented into streetscape scenes by semantic segmentation algorithms, which can be applied to measure the quality of urban space. This study adopted SegNet, an efficient pixel-level semantic segmentation algorithm, to segment a street view [47]. SegNet is a full consolidation neural network consisting of encoders and decoders. Finally, each pixel is classified by the softmax layer. One of the most obvious innovations of SegNet is the sampling method of its decoder to low resolution. SegNet can segment a street view image into 12 types of streetscapes. The green view index, sky openness (openness) and walkability, as measured by scene elements, were adopted to measure the space quality of the neighbourhood. In addition, the average normalized difference vegetation index (NDVI) of the 1 km buffer was included to measure space quality. A good visual environment provides places for residents to rest and engage in leisure activities.

Landscape Patterns
Landscape patterns can be used to reflect both ecological and socio-economic functions, which both affect housing prices [41,48]. This article adopted landscape metrics based on land-use data to describe the landscape patterns of neighbourhoods. Three selected metrics measured at the landscape scale were Shannon's diversity index (SHDI), patch density and patch area. Specifically, Shannon's diversity index quantifies diversity by describing the uncertainty that occurs among individuals in a population. The higher the uncertainty is, the higher the diversity, and vice versa. Patch density is the ratio of patch number to patch area, which can be used to represent landscape fragmentation. Patch area is used to represent the area of patch and is a shape metric. The three selected factors are independent from one another and do not have a strong correlation, which can reveal the multidimensional features of landscape patterns and remove the risk of information redundancy. In addition, the three factors reflect the characteristics of landscape patterns from different perspectives, thereby offering a comprehensive view. Overall general descriptive statistics for the selected influential factors are shown in Table 1. XGBoost is a boosting integration model that combines the gradient lift algorithm and decision trees, specifically using several preferred weak learners (i.e., decision trees) to complete the learning task [35,49,50]. Instead of using the search method, XGBoost directly utilizes the first and second derivative values of the loss function and improves the performance of the algorithm through techniques, such as pre-ordering and node number of bits. After introducing the regularization term, the XGBoost model chooses a simple model with good performance. The regularization item is used to suppress weak learner overfitting in each iteration and does not participate in the integration of the final model. In each iteration, the objective function is expanded by Taylor's formula, as in Equation (1): where t denotes the tth interaction, i is the ith sample, and y i is the real value of the ith sample;ŷ i (t−1) represents the predictive value of the (t − 1)th iteration; g i and h i are the first and second derivatives, respectively; and Ω( f t ) is the regularization item. The complexity of the tree is shown in Equation (2): where T t is the number of leaf nodes in the round t iteration, and ω j represents the weight of the jth leaf node. In the process of constructing the decision tree, sorting the values of the features to determine the optimal split point is the most extensive step. The greatest advantage of XGBoost is that the data features are sorted before the training and then stored as blocks. As a result, the existing blocks can be used for subsequent iterations, which greatly reduces the amount of computation required. XGBoost yields values measuring feature importance that can identify how important each feature is in its feature set. In this article, XGBoost was used to analyse the feature contributions and calculate the importance of the influential factors, which can deliver effective inputs of independent variables for HPM modelling.

Hedonic Price Model-Based Exploration of the Effects of Influential Factors
The HPM is the most commonly used method for studying the market values of the influential factors of housing prices, which are non-market prices. In 1974, Rosen [51] first applied the HPM to real estate research and it has been widely used since to study the marginal price of factors influencing housing prices. The essence of the HPM is ordinary linear regression, which can reveal the marginal price of housing property. The HPM has a variety of forms, such as logarithmic form, semi-logarithmic form and exponential form. The semi-logarithmic equation can effectively solve or reduce the heteroscedasticity problem, and its form is simple. Therefore, this paper modelled housing prices and their influential factors in semi-logarithmic form. The HPM model in semi-logarithmic form is as follows: where P i is the home price, and LogP i is the logarithm base 10 of the home price. S ij , C ik , L il , F im and Q in represent the structural, community, locational, facility and space qualityrelated factors. β 0 is the intercept item. β j , β k , β l , β m and β n are estimated coefficients for the structural, community, locational, facility and space quality-related factors. That is, they can be understood as semi-elastic prices.

The Results of XGBoost and the HPM
To understand the significant inequality evident in the housing prices of Shenzhen, it was necessary to study the influencing mechanisms behind those prices. First, XGBoost was used to explore the importance of different factors of housing prices by revealing non-linear relationships between influential factors and housing prices. The performance of XGBoost is summarized in Table 2. It generally presented a good performance, with an R-square of 0.944. Other indicators, including root mean squared error (RMSE) and its percentage (%RMSE), mean absolute error (MAE) and its percentage (%MAE) and accuracy (P), were also used to evaluate the performance of the XGBoost model. Specifically, the validation indicators of the prediction results showed that the RMSE was 0.057, the %RMSE was 1.203%, the MAE was 0.039, and the %MAE was 0.825%. The HPM was then applied to study the effects of influential factors on housing prices. The values of the variance inflation factor (VIF) were all less than 10, which indicates that there is no obvious multicollinearity between the variables. The R-square and RMSE in the HPM were 0.600 and 0.153, respectively, which indicates that almost 60% of the variation in housing prices can be explained. The statistical information of XGBoost and the HPM are shown in Table 2.

The Importance of the Different Factors of Housing Prices Based on XGBoost
XGBoost is capable of measuring feature importance using their weights. Figure 5 shows the relative importance of the influential factors based on the XGBoost results. As shown in Figure 5, the top five variables were distance to city centre, green view index, population density, property management fee and economic level. The top five variables can predominantly explain the spatial variability of housing prices. The variable of distance to the city centre had the largest effect on home prices, and its effect was much larger than that of other variables. First, the economic development of Futian and Nanshan were very rapid. Traffic conditions are also becoming more developed and health care and education resources have become more advanced, which can attract many people to aggregate in these areas. Therefore, there is high and urgent demand for homes, which has caused a short supply in the property market. Second, the level of urbanization is higher, and land resources are scarcer. Land prices are high, which naturally drives high housing prices. Compared with other variables for green space, such as the greening ratio, NDVI and park accessibility, the green view index showed a much greater impact on housing prices. The green view index systematically, and in detail, records the greening quality at the urban street level from the perspective of pedestrians. The green view index provides the spatial distribution of greenness at the street level, which can directly and accurately reflect information on the facade of green space. This result also showed that street greening, which is more common and accessible to residents, should not be ignored in planning. The effect of population density on housing prices is also concentrated in supply and demand. Following the top five variables, the landscape index, represented by patch density and Shannon's diversity index, also played a role in housing prices. A better urban ecological landscape can enhance the comfort of residents and improve the value and competitiveness of the city. This result also verified the effects of the urban ecological landscape on housing prices, consistent with a previous study [41]. Subway entrances, kindergartens, 3A hospitals, supermarkets and parks are facilities that are very important to the daily life of residents and that provide them with convenience [17]. The distance to schools had less of an effect on home prices than expected. An increasing number of studies have pointed out that the quality of public school districts in which housing is located has becoming increasingly important. Finally, the structure of a residence and its age had a relatively small effect on its price. Buyers paid more attention to the floor location of a home, as this is directly related to its lighting, ventilation and views.

Effects of Influential Factors on Housing Prices: HPM Results
A summary report of the HPM was obtained by inputting the attributes into ordinary least squares (Table 3). Some variables warrant attention according to Table 3. First, the structural factors of a residence did not have a significant effect on housing prices, except for the floor on which the home was located. This is consistent with the XGBoost results, which suggests that these structural variables were less important for home prices in Shenzhen. The floor of the home had a positive effect on its prices, and higher floors were generally perceived to have better views and lighting than lower floors. Second, for the variables related to community information, the plot ratio, greening rate and property management fee all had significant positive effects on housing prices. Notably, the plot ratio variable was related to building density, building height, building spacing and number of residents, which directly affected the residential quality of the community. Generally, the lower the plot ratio, the higher the residential quality of the community. One of the main reasons for the positive relationship between the plot ratio and housing prices was that, given a background of high housing prices and tight land supply, high plot ratio residential communities have become mainstream. The factors for locational attributes and facility proximity, except for distance to the nearest subway entrance, all showed statistically significant effects on housing prices. Locational attributes and facility proximity were essential attributes of housing prices. For example, residences near city centres were favoured by families due to well-established cultural and athletic facilities, clusters of medical treatment and health organizations, and the presence of entertainment and amusement functions, which can be highly convenient and accessible for families. In addition, many high-rise office buildings cluster in Futian, encouraging a large number of high-income groups to aggregate there, thus effectively stimulating housing prices. The effects of some variables related to educational resources, such as the number of kindergartens, the number of middle schools and the distance to the nearest primary school, were unexpected. The main reason was that the housing price in school districts was promoted by educational resources, while the housing price in high-quality school districts was not affected by the distance from the school.
In this paper, a POI-based mixed-use variable and Shannon's diversity index were used to represent the degree of functional mix in a neighbourhood. However, the two variables had very different effects on home prices. Specifically, the POI-based mixed-use variable had a significant positive effect on housing prices. Shannon's diversity index showed significant negative effects on housing prices for the following reasons: (1) there were obvious differences in the classifications used for POI and those for land use; (2) buyers focused much more on facilities, and POI data can better reflect interactions with people than land use. Finally, this paper adopted the street view image to determine the space quality of each neighbourhood. There were positive relationships between the green view index and housing prices. Compared with the factors of the number of community parks and distance to the nearest community park, the green view index had a larger effect on housing prices. Many studies show that eye-level photographs have important effects on residents' health and living environment [52]. Green view, as measured by street view images, can accurately assess people's daily exposure to greenery.

Implications for Housing Policy
Identifying the relationships between housing prices and influential factors is crucial for policy decision making to optimize the urban infrastructure layout, develop a healthy and equitable real estate market, and design liveable neighbourhoods. First, the relationships between housing prices and factors are complex, and it is necessary to study them by combining advanced technologies with classical methods. Therefore, this paper adopted one of the most popular machine learning methods, XGBoost, and the classical housing price method, the HPM, to reveal the ranked importance of factors for housing prices and their quantified effects on housing prices. Notably, XGBoost has shown potential in providing insights for real estate studies and the appraisal of the real estate price. According to the research framework of this article, real estate-related management departments can monitor or assess urban property prices. On the one hand, a machine learning algorithm can be used to rapidly map the relative importance of factors that affect housing prices. On the other hand, using traditional characteristics shows factors affecting the price model of market value. The influential factors of housing prices should be continuously expanded, supplemented, and improved as new data and methods become available. For example, we used street view images and semantic segmentation to extract factors at the human scale. Street view images, as a new type of data covering urban areas and the general landscape, can fully represent the physical appearance of city space. These new data and methods have great potential and applicability for study because they reflect the living environment of residential districts from a human perspective. Finally, our work demonstrated that current urban infrastructure is not perfect, as indicated by features such as differences in school resources, subway distribution, etc. These imperfections are the main reasons for the difference in housing prices in cities. Therefore, in future urban design and planning, we should pay attention to the factors with important influence to promote the sustainable development of the real estate market and urban design. Overall, this paper provided a theoretical reference for the systematic, scientific and comprehensive evaluation of the factors influencing urban housing prices.

Limitations and Prospects
Although this paper applied new methods to explore the complex relationships between housing prices and influential factors, and further improves the system of influential factors for housing prices, we also recognize several limitations of this work that should be addressed in future studies. First, temporal variations in housing prices should also be considered. Housing price appreciation rates and influence mechanisms are equally important for buyers, property developers and governments [20]. Second, the proportions of the scenes extracted from street view images were not sufficient to evaluate space quality. Residents' perception of the neighbourhood environment, such as safety, liveliness, depression and wealth, should be evaluated based on street view images [53]. Finally, this paper proposed a framework integrating machine learning and a hedonic price model that was only applied in Shenzhen, China. In the future, more overseas and domestic cities can be studied to improve the generalizability and reproducibility of the research framework presented in this paper.

Conclusions
A deep and comprehensive understanding of the spatial variations in housing prices and their influential factors is critical for housing price control, public facility construction and urban planning [54]. We presented a multisource geo-tagged data fusion framework integrating XGBoost and the HPM to study the complex relationships between housing prices and influential factors. Specifically, XGBoost and the HPM can reveal different aspects of the complex relationships between housing prices and influential factors by ranking the influential factors by importance and determining their quantified effects. The XGBoost results identified the five most important variables for Shenzhen housing prices as distance to city centre, green view index, population density, property management fee and economic level. The HPM results proved that green view index is a good objective indicator for measuring street-level greenery, which has significant and positive effects on housing prices. Understanding the variations in housing prices and their influential factors can better inform sustainable city planning and urban and housing policy making. In addition, some new factors at the human scale extracted from street view data were added to enrich the system of factors affecting housing prices. Our work also demonstrated that urban big data, machine learning and spatial statistical methods provide us with new data sources and enable interdisciplinary approaches to understanding the distribution of housing prices.