Real Estate Market Forecasting for Enterprises in First-Tier Cities: Based on Explainable Machine Learning Models

Song, Dechun; Hu, Guohui; Li, Hanxi; Zhao, Hong; Wang, Zongshui; Liu, Yang

doi:10.3390/systems13070513

Open AccessArticle

Real Estate Market Forecasting for Enterprises in First-Tier Cities: Based on Explainable Machine Learning Models

by

Dechun Song

¹,

Guohui Hu

²,

Hanxi Li

^1,*,

Hong Zhao

¹,

Zongshui Wang

^2,3

and

Yang Liu

⁴

¹

School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China

²

Business School, Beijing Information Science and Technology University, Beijing 100192, China

³

Institute of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China

⁴

International Business School, Beijing University of Financial Technology, Beijing 101117, China

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(7), 513; https://doi.org/10.3390/systems13070513

Submission received: 27 April 2025 / Revised: 4 June 2025 / Accepted: 23 June 2025 / Published: 25 June 2025

Download

Browse Figures

Versions Notes

Abstract

The real estate market significantly influences individual lives, corporate decisions, and national economic sustainability. Therefore, constructing a data-driven, interpretable real estate market prediction model is essential. It can clarify each factor’s role in housing prices and transactions, offering a scientific basis for market regulation and enterprise investment decisions. This study comprehensively measures the evolution trends of the real estate markets in Beijing, Shanghai, Guangzhou, and Shenzhen, China, from 2003 to 2022 through three dimensions. Then, various machine learning methods and interpretability methods like SHAP values are used to explore the impact of supply, demand, policies, and expectations on the real estate market of China’s first-tier cities. The results reveal the following: (1) In terms of commercial housing sales area, adequate housing supply, robust medical services, and high population density boost the sales area, while demand for small units reflects buyers’ balance between affordability and education. (2) In terms of commercial housing average sales price, growth is driven by education investment, population density, and income, with loan interest rates serving as a stabilizing tool. (3) In terms of commercial housing sales amount, educational expenditure, general public budget expenditure, and real estate development investment amount drive revenue, while the five-year loan benchmark interest rate is the primary inhibitory factor. These findings highlight the divergent impacts of supply, demand, policy, and expectation factors across different market dimensions, offering critical insights for enterprise investment strategies.

Keywords:

real estate market; real-estate enterprises; market forecasting; explainable machine learning models

1. Introduction

The real estate sector, as a fundamental pillar of the national economy of China, is integral to macroeconomic stability and sustainable development. However, China’s real estate market has exhibited significant volatility over the past decade, characterized by pronounced disparities among first-, second-, and third-tier cities, thereby increasing the complexity of regulatory interventions [1]. To address these challenges, the Ministry of Housing and Urban–Rural Development has underscored the necessity for cities, particularly first-tier cities, to enhance regulatory autonomy in managing the real estate market and adjust housing purchase restrictions in accordance with local conditions. In first-tier cities, factors such as high population mobility, constrained land resources, and frequent policy interventions have contributed to multiple cycles of housing price surges since the housing reform. These fluctuations have not only influenced the overall real estate market but have also had broader implications for national monetary policy [2]. The impact of first-tier city housing market dynamics is manifested through spillover effects, demonstration effects, and structural market adjustments. The spillover effect arises as elevated housing costs in first-tier cities drive homebuyers to adjacent regions, subsequently exerting upward pressure on property prices in those areas [3]. The demonstration effect occurs when rising housing prices in first-tier cities are perceived as indicative of national market trends, reinforcing buyer confidence and precipitating price increases in other cities, particularly strong second-tier cities, at times leading to market overvaluation [4]. Additionally, sustained price appreciation in first-tier cities may prompt real estate firms to adjust their investment strategies, prioritizing high-return urban centers, while third- and fourth-tier cities experience stagnation or decline in housing demand and pricing. Moreover, fluctuations in first-tier-city housing prices may impose “passive constraints” on national monetary policy, compelling the central bank to navigate trade-offs between financial stability in the real estate sector and broader economic growth objectives. This dynamic may, under certain circumstances, result in deviations from an optimal monetary policy trajectory. For instance, during the 2016–2017 period, China’s economy was undergoing deepening supply-side structural reforms, and the slowing macroeconomic growth theoretically warranted a more accommodative monetary policy to support the real economy. However, the overheating of the real estate market in first-tier cities, accompanied by surging speculative demand and rising financial leverage, necessitated monetary tightening by the central bank to mitigate the risks associated with a potential real estate bubble.

Given the complexities of the current economic landscape, the development of rigorous forecasting models is imperative for anticipating market trends and informing evidence-based policymaking. Accurate predictions of real estate market dynamics in first-tier cities can provide critical empirical support for governmental policy formulation, market regulation, and corporate strategic planning, thereby facilitating rational market development and optimizing resource allocation.

2. Literature Review

2.1. Factors Influencing the Real-Estate Enterprise Market

As a crucial pillar of the national economy in China, the real estate sector’s market performance is influenced by a complex interplay of factors. With the continuous advancement of domestic economic restructuring and industrial upgrading, real-estate enterprises must conduct in-depth analyses of these multifaceted influences to establish and sustain competitive advantages.

In real estate market research, housing prices are commonly used as key indicators to assess market development trends [2]. Given the diverse determinants affecting housing prices, scholars have integrated theories from various disciplines to develop comprehensive analytical frameworks. For instance, Cui et al. adopted an economic theoretical approach to examine the impacts of public goods supply, international trade, real estate investment, income levels, and credit rationing on housing prices across cities of different tiers [5]. Meanwhile, Liu and Chen applied the concept of physical potential energy to explore how differences in housing prices, purchasing policies, and market information between high- and low-tier cities create potential disparities that influence housing market dynamics [4]. In recent years, with the development of big data and natural language processing (NLP) technologies, media sentiment analysis has emerged as an innovative approach to capturing the emotional expectations of market participants, offering new perspectives for studying real estate market volatility [6].

From the perspective of market fundamentals, prior research has primarily analyzed housing price determinants by categorizing them into demand-side and supply-side factors. The demand-side factors encompass demographic and social structures, income levels and purchasing power, and the financial environment. Population size and migration patterns are frequently employed to characterize demographic influences in real estate research. Using population size and appropriate forecasting methods, scholars can predict future real estate demand [7]. Wang et al. examined the effects of interregional migration and urbanization on housing prices, finding that city-level population mobility is positively correlated with housing prices, particularly when migrants have higher educational backgrounds [8]. Income levels and purchasing power reflect both the objective economic conditions of residents and their subjective willingness to invest in housing. In recent years, some studies have taken Guangzhou as a case study to explore the factors influencing the settlement intentions of international migrants, demonstrating an inverted U-shaped relationship between the number of rooms and settlement intention, while no significant correlation was found between per capita housing area and settlement intention [9]. In addition, regional differences in income levels significantly affect housing price disparities [10,11]. The pricing sentiment spiral phenomenon further exacerbates market polarization, reinforcing cyclical price trends [12]. Additionally, the financial environment, including stock market fluctuations and monetary policy, plays a crucial role in shaping housing market dynamics. Given the dual nature of housing as both a consumable good and an investment asset, stock market volatility can enhance the attractiveness of real estate investments, thereby driving up housing prices through substitution effects [13]. Lu et al. highlighted that increased money supply directly raises housing prices and indirectly affects them through stock market overshooting effects [14].

On the supply side, housing construction costs and market inventory are the primary determinants. Property supply negatively correlates with housing price increases, with first-tier cities experiencing a more pronounced effect [15]. In slower-growing regions, the costs associated with housing price inflation exhibit distinct characteristics [16]. Among these costs, land acquisition is a critical determinant of housing prices [11]. Wu et al. found that developers who acquire land at inflated prices tend to adopt the following two strategies: extending the development cycle to capitalize on higher future prices or leveraging their market position to directly increase sales prices [17]. Construction costs, as a controllable component, have a significant positive correlation with housing prices [18]. Consequently, substantial research has been conducted on optimizing construction cost-efficiency. Furthermore, as real estate is a capital-intensive industry, financing costs play a crucial role. Developers primarily rely on bank loans, trust funds, and real estate investment trusts, with capital liquidity directly affecting market activity and investment cycles.

In summary, existing studies have largely focused on identifying real estate market determinants based on city-tier classifications or cross-tier comparisons, predominantly using housing prices as the dependent variable. However, given the complexity and systemic nature of the real estate market, housing prices alone fail to capture the full spectrum of market dynamics. Consequently, scholars have sought to expand the scope of measurement dimensions to construct more comprehensive market assessment frameworks. For instance, Chen and Zhu incorporated indicators such as commercial property sales area, housing construction area, and the commercial property sales ratio to enhance real estate market evaluations [19]. Similarly, Cui utilized investment indicators, supply–demand metrics, and housing prices as characterization variables, contributing to the theoretical and methodological advancement of real estate market forecasting [20]. These multidimensional approaches facilitate a more precise and holistic understanding of market trends, thereby supporting more effective policy formulation and strategic decision making within the real estate sector.

2.2. Market Forecast Models for Real-Estate Enterprises

The real estate market is a crucial component of the national economy, with its development directly influencing macroeconomic stability and growth. Accurate house price forecasting not only aids the government in formulating regulatory policies but also provides a valuable reference for investors in decision making. In recent years, advancements in data analysis techniques have driven the evolution of house price forecasting methods, encompassing traditional qualitative forecasting, econometric models, machine learning, and hybrid approaches. Qualitative forecasting methods are typically used alongside quantitative techniques, serving as both a complement and a foundation for quantitative forecasting. These methods primarily include the omen forecasting method, the Delphi method, and the extrapolation forecasting method. The omen forecasting method relies on experience and observation, identifying and analyzing leading indicators or signals to anticipate potential future trends. In real estate market forecasting, this method often employs market dynamics indicators as early signals of house price fluctuations, leveraging expert experience, historical pattern analysis, and market-sensitive information. While suitable for short-term trend assessments, its strong subjectivity introduces uncertainties and potential misjudgments [20]. The Delphi method, characterized by multiple rounds of anonymous surveys, gathers expert opinions, refining and adjusting responses in each round to achieve a high degree of consensus. Abdur et al. developed a forecasting model incorporating 35 core micro- and macroeconomic variables and integrated insights from 11 real estate experts. Through a four-round Delphi survey, they ranked the influence of eight key variables and reached a high level of agreement [21]. The extrapolation forecasting method, based on logical reasoning and causal analysis, typically uses historical real estate data to infer future trends by analyzing past data variation rates [22].

Econometric forecasting methods, grounded in statistical and mathematical modeling, quantify relationships between economic variables to predict future trends. Spatial econometric models, widely applied in house price analysis, effectively capture spatial dependence. Liu Dong et al. employed multifactor regression, state-space models, and Kalman filtering to predict real estate price trends in China [23]. The autoregressive distributed lag (ARDL) model is particularly useful in depicting dynamic adjustments among variables by estimating short-term coefficients and long-term equilibrium relationships. Rapach and Strauss demonstrated that real house price growth predictability varied significantly across U.S. states from 1995 to 2006. Their findings indicated that ARDL models performed well in certain inland states but exhibited weaker predictive power for coastal states, highlighting a degree of “disconnect” between house prices and economic fundamentals in these regions [24]. Yang et al. employed exponential smoothing, ARIMA models, and regression–time series combination models to forecast the average price of commercial housing in China. They further developed a combined forecasting model based on the IOWHA operator and an evaluation index system, revealing significant complementarities among different methods [25]. Additionally, some scholars have innovated within econometric model development by integrating short-term positive serial correlation with long-term mean reversion. During periods of market prosperity, greater weight is assigned to positive serial correlation, while in economic downturns, reversion to fundamental value is prioritized, thereby enhancing prediction accuracy [26].

With the advancement of computational power and the widespread application of big data, machine learning and hybrid models have demonstrated superior capabilities over traditional statistical methods in capturing complex nonlinear relationships, improving prediction accuracy and handling high-dimensional and heterogeneous data. Liu integrated the gray relational method, wavelet neural networks, and Markov chain techniques to enhance predictive performance [27]. Shi and Xiao developed a novel valuation model for second-hand houses by incorporating historical transaction prices and quantitative adjustments for property-specific characteristics. By merging econometric and machine learning approaches, their model significantly improved interpretability [28]. Cui et al. proposed a hybrid forecasting approach that combines decomposition-based ensemble learning from machine learning, time series analysis from econometrics, and boom signal methods from qualitative forecasting. Their methodology involves selecting an appropriate benchmark model based on data characteristics, demonstrating superior predictive accuracy over single-model approaches in forecasting real estate market investment, demand, and price trends [20]. Chen et al. categorized house price determinants into the following four dimensions: supply, demand, policy, and expectations. By incorporating machine learning techniques and SHAP-based interpretability methods, they analyzed the dynamic evolution of key factors influencing house prices in China’s four first-tier cities. Their findings indicated that expectations play a dominant role in driving price increases in these cities [2] and are also a primary factor contributing to price differentiation across first-, second-, and third-tier cities [29].

2.3. Summary

Although extensive research has been conducted on housing price forecasting, studies specifically focusing on the real estate markets of China’s first-tier cities remain relatively scarce. Given that these cities serve as “barometers” for the national real estate market, influencing market expectations and policy adjustments, further investigation is warranted. Moreover, existing studies predominantly examine the determinants of housing price fluctuations, such as macroeconomic indicators, supply–demand dynamics, and policy interventions. However, relatively little attention has been paid to core market indicators—such as housing sales area and sales revenue—that comprehensively reflect market activity. This gap limits the ability of forecasting models to capture market dynamics and structural shifts effectively. Compared with traditional statistical and econometric modeling approaches, machine learning methods demonstrate superior adaptability and predictive accuracy in real estate forecasting due to their ability to process high-dimensional features, model nonlinear relationships, and optimize automatically. In particular, the integration of SHAP (Shapley additive explanations) enhances model interpretability by quantifying the marginal contributions of different factors to housing prices and market transactions. This not only deepens researchers’ and policymakers’ understanding of the underlying drivers of the real estate market but also provides a scientific basis for more precise market regulation and investment decision making.

3. Methods

The purpose of this study is to explore the multidimensional impact of different types of influencing factors on the real estate market using machine learning models and interpretability techniques. The research framework is shown in Figure 1.

3.1. Machine Learning Models

The complex, nonlinear interactions between multi-level social, economic, political, and expectation-related factors and the scale of the real estate market pose significant challenges for traditional modeling approaches [2]. Machine learning models, however, can effectively capture these nonlinear relationships that simple linear regression models struggle to approximate [20]. Therefore, this study selects several models that have been widely applied in real estate price forecasting to explore different machine learning-based alternatives.

(1): Support Vector Regression (SVR)

Support vector regression (SVR) is an application of support vector machine (SVM) in regression problems and has been widely adopted in real estate price forecasting, demonstrating superior performance [30,31]. Given a dataset

(x_{i}, y_{i})

, where

x_{i} \in R

represents the feature vectors and

y_{i} \in R

denotes the corresponding continuous target values,

i = 1,2, \dots, N

, the objective of SVR is to find a function,

f (x)

, that minimizes the error between

f (x)

and the actual values,

y

. The loss function is defined as follows:

L (y, f (x)) = m a x (0, |y - f (x)| - ϵ)

(1)

where

ϵ

is a predefined threshold representing the maximum allowable error between the predicted and actual values. If the absolute error between the predicted and actual values is less than

ϵ

, the loss is zero; otherwise, the loss equals the absolute deviation exceeding

ϵ

. The corresponding optimization problem is formulated as follows:

\min_{w, b, ξ, ξ^{*}} \{\frac{1}{2} {| w |}^{2} + C \sum_{i = 1}^{N} (ξ_{i} + ξ_{i}^{*})\}

(2)

where

w

is the weight vector,

b

is the bias term,

ξ_{i}

and

ξ_{i}^{*}

are slack variables used to handle data points that fall outside of

ϵ -

insensitive zone, and

C

is the regularization parameter that controls the model complexity and penalizes errors. By applying the Lagrange multiplier method, the above optimization problem can be transformed into its dual form, as follows:

\max_{α, α^{*}} \{- \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (α_{i} - α_{i}^{*}) (α_{j} - α_{j}^{*}) ⟨x_{i}, x_{j}⟩ + \sum_{i = 1}^{N} (y_{i} - ϵ) (α_{i} - α_{i}^{*})\}

(3)

where

α_{i}

and

α_{i}^{*}

are the Lagrange multipliers. In the solution to the dual problem, only a subset of

α_{i}

and

α_{i}^{*}

are nonzero, and the corresponding

x_{i}

are referred to as support vectors. The final regression function can be expressed as follows:

f (x) = \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) ⟨x, x_{i}⟩ + b

(4)

The above equation indicates that the decision function of the SVR relies only on the support vectors rather than the entire training dataset. To handle nonlinear problems, SVR can employ a kernel function,

K (x, x^{'}),

to replace the inner product,

〈x, x^{'}〉

, including linear kernels, polynomial kernels, and radial basis function kernels. Through these steps, SVR is able to find a regression function that approximates all training data points as closely as possible within the

ϵ -

insensitive zone while maintaining the model’s generalization capability.

(2): Random Forest (RF)

Random forest (RF) is an ensemble learning method that improves model accuracy and robustness by constructing multiple decision trees and combining their predictions. It has been utilized by scholars in real estate market-related research [32,33]. When constructing each decision tree, RF employs the bootstrap sampling method, where multiple subsample sets,

D^{'}

, are randomly drawn with replacement from the original training set,

D

. The size of each subsample set is the same as that of the original training set, but it may contain duplicate samples. This approach helps reduce the model’s variance and enhances its generalization ability.

In each node split, RF randomly selects a subset of features,

D^{'},

from all available features, and then chooses the best feature,

f^{*}

, from this subset for the split. This process can be expressed as follows:

f^{*} = a r g \min_{f \in D^{'}} I m p u r i t y (f)

(5)

where

I m p u r i t y (f)

represents the impurity of feature

f

, such as the Gini impurity or information gain. In regression problems, RF determines the final prediction result by averaging the predictions of all decision trees. For a given input,

x

, the predicted result can be expressed as follows:

(x) = \frac{1}{T} \sum_{t = 1}^{T} f_{t} (x)

(6)

where

T

represents the number of decision trees, and

f_{t} (x)

denotes the prediction result of the

t

-th decision tree. RF can estimate the model’s generalization error using the out-of-bag (OOB) error, which effectively improves the model’s accuracy and robustness while avoiding overfitting. The OOB error is the average error of all decision trees on the samples that did not participate in training. For a given sample,

x_{i}

, the OOB error can be expressed as follows:

Error (x_{i}) = \frac{1}{T_{OOB} (x_{i})} \sum_{t \in O O B (x_{i})} L (y_{i}, f_{t} (x_{i}))

(7)

where

T_{OOB} (x_{i})

represents the number of decision trees that did not participate in training on sample

x_{i}

,

OOB (x_{i})

is the set of these decision trees,

L (y_{i}, f_{t} (x_{i}))

is the true value of sample

x_{i}

, and the loss function between the true value,

y_{i},

and the prediction,

f_{t} (x_{i})

, of the

t

-th decision tree is denoted.

(3): Extreme Gradient Boosting (XGBoost)

XGBoost is an ensemble learning method based on gradient boosting decision trees (GBDT). It improves model accuracy and robustness by constructing multiple decision tree models and combining their predictions [2]. XGBoost has been shown to provide accurate housing price predictions [34]. Its objective function can be expressed as follows:

O b j = \sum_{i = 1}^{N} L [F_{m} (x_{i}), y_{i}] + \sum_{j = 1}^{m} Ω (f_{j}) = \sum_{i = 1}^{N} L [F_{m - 1} (x_{i}) + f_{m} (x_{i}), y_{i}] + \sum_{j = 1}^{m} Ω (f_{j})

(8)

where

f_{m} (x_{i})

represents the sub-model at the current step, while

F_{m - 1} (x_{i})

denotes the previously trained and fixed

m - 1

sub-models. Based on the boosting additive model, each iteration optimizes only the sub-model in the current step. Additionally, XGBoost employs a second-order Taylor expansion to approximate the loss function, as follows:

O b j = \sum_{i = 1}^{N} [L [F_{m - 1} (x_{i}), y_{i}] + \frac{𝜕 L}{𝜕 F_{m - 1} (x_{i})} f_{m} (x_{i}) + \frac{1}{2} \frac{𝜕^{2} L}{𝜕^{2} F_{m - 1} (x_{i})} f_{m}^{2} (x_{i})] + \sum_{j = 1}^{m} Ω (f_{j})

(9)

where

L

epresents the loss function, while

\frac{𝜕 L}{𝜕 F_{m - 1} (x_{i})}

and

\frac{𝜕^{2} L}{𝜕^{2} F_{m - 1} (x_{i})}

correspond to the first-order and second-order derivatives, and

Ω (f_{j})

denotes the regularization term.

Ω (f) = γ T + \frac{1}{2} λ {| w |}^{2}

(10)

where

T

represents the number of leaf nodes in tree

f

,

w

denotes the vector of leaf node output regression values, and

γ

and

λ

are regularization parameters.

(4): Explainable Artificial Intelligence (XAI)

Although machine learning methods such as XGBoost can accurately predict real estate markets in different cities [30,31,32,33,34], they suffer from the “black-box” problem compared to linear methods, making it difficult to interpret which factors have significant influences on housing prices. To further analyze the factors affecting urban real estate markets, this study employs the SHAP (Shapley additive explanation) method for interpretation and analysis [35]. SHAP is rooted in cooperative game theory and assigns a prediction value to each sample, where the Shapley value allocated to each feature within the sample effectively reflects the magnitude and direction of its impact on real estate market growth.

Assume that the

i

-th sample is

x_{i}

, the

j

-th influencing factor of

x_{i}

is

x_{i j}

, and the predicted value of the urban real estate market using the machine learning method is

y_{i}

. The reference value for the model prediction is

y_{0}

. According to the SHAP value interpretability method, the predicted value,

y_{i},

of the machine learning method for the training sample can be decomposed as follows:

y_{i} = y_{0} + \sum_{j = 1}^{n} f (x_{i j})

(11)

where

f (x_{i j})

represents the Shapley value of

x_{i j}

, indicating the contribution of feature

j

to the final predicted value,

y_{i},

in sample

i

. When

\sum_{j = 1}^{n} f (x_{i j}) < 0 (> 0)

holds, this suggests that all influencing factors collectively drive the contraction (or expansion) of the urban real estate market. The Shapley value can be calculated using Equation (12), as follows:

f (x_{i j}) = \sum_{S \subseteq F {j}} \frac{|S|! (n -| S | - 1)!}{n!} [f (x_{i}^{s \cup j}) - f (x_{i}^{s})]

(12)

where

F

represents the complete set of influencing factors for the urban real estate market across

n

cities.

S

denotes the feature set that excludes factor

j

. By selecting

| S |

factors that do not include

j

in the full set,

F

, a total of

\frac{n!}{|S|! (n -| S | - 1)!}

possible combinations can be formed, which serve as the basis for constructing the sample,

x_{i}^{s}

, used for simulation and computation.

f (x_{i}^{s \cup j}) - f (x_{i}^{s})

represents the change in the predicted outcome of sample

x_{i}

when the observed value of feature

j

in sample

x_{i}

is randomly replaced. By constructing different combinations of variables,

S

, and the corresponding samples,

x_{i}^{s \cup j}

and

x_{i}^{s}

, the contribution of factor

j

to the predicted value,

y_{i}

, of the real estate market can be approximated.

3.2. Market Forecasting Indicator System

Referring to existing studies [2], this paper categorizes the factors influencing the real-estate enterprise market into the following four dimensions: supply, demand, policy, and expectations. Based on this classification, an indicator system for urban real-estate enterprise markets and their influencing factors is constructed, as shown in Table 1. Among these factors, supply and demand serve as fundamental determinants of market price and scale, while policy and expectation factors regulate supply–demand dynamics, thereby affecting the overall market size of real-estate enterprises.

(1): Supply Factors

From a supply-side perspective, the funding available to real estate developers [5] and the supply of housing space [36] are the primary determinants of real estate market supply. Considering both developers’ supply behavior and data availability, this study employs indicators such as real estate development investment, floor space under construction, completed floor space, and land acquisition area to assess the impact of supply factors on real-estate enterprise markets in China’s four first-tier cities.

(2): Demand Factors

A city’s population size, age structure, income level, and public service provision all influence the potential market for real-estate enterprises. First, population size [37] and age structure [38] are key determinants of urban housing prices. Therefore, this study measures population size using the total number of permanent residents and population density, while the natural growth rate of the permanent population serves as an indicator of the city’s demographic structure. Second, household-income levels and savings balances affect housing demand and, consequently, the scale of a real-estate enterprise market. Following prior research [37,39], this study adopts per capita GDP, per capita disposable income, and household savings balance as indicators to evaluate the impact of income and savings on the urban real estate market. Additionally, since the real estate industry serves both investment and residential purposes, urban public services—an essential component of real estate’s added value—exert a certain influence on the market size of real-estate enterprises [40]. Drawing on previous studies [41,42], this study incorporates indicators such as education expenditure, the number of hospital beds, urban road area, park green space area, and urban green coverage rate to assess the level of public services across education, healthcare, transportation, and environmental dimensions.

(3): Policy Factors

Policy factors primarily include monetary and fiscal policies. Monetary policy influences homebuyers’ affordability and effectively regulates housing prices by affecting the availability and cost of personal housing loans [43]. Therefore, the national benchmark interest rate for loans over five years—serving as a reference for urban mortgage rates—is used to measure the impact of central bank monetary policy on the scale of the real-estate enterprise market. Moreover, real estate development depends on fiscal policy support at the local level, where government fiscal expenditures may drive real estate capitalization [44]. In line with previous studies [2], this study employs general government fiscal revenue, general government fiscal expenditure, and the general fiscal deficit ratio (i.e., general government fiscal revenue divided by the general government fiscal expenditure) to assess the role of fiscal policy in urban real estate markets.

(4): Expectation Factors

Real estate price expectations influence the behavior of both firms and consumers, making them a critical determinant of real estate market dynamics [44]. Given real estate’s nature as a fixed-asset investment with long production cycles and low liquidity, it is significantly affected by investor expectations. The existing literature classifies real estate price expectation measurement methods into the following two categories: adaptive expectations [45] and survey-based expectations [46]. Compared to the latter, the former offers higher data availability. Accordingly, following prior research [2], this study employs the lagged one-period housing price growth rate as a proxy for housing price expectations.

3.3. Model Evaluation Metrics

Based on existing studies [20,47,48,49], we selected a comprehensive set of evaluation metrics to assess the performance of the market forecasting model for real estate companies, as shown in Formulas (13)–(18). These include the root mean squared error (RMSE) and its percentage form %RMSE, the mean absolute error (MAE) and its percentage form %MAE, model fit (

R^{2}

), and the directional evaluation metric,

D_{s t a t}

.

R M S E = \sqrt{\frac{1}{n} \sum_{i}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(13)

% R M S E = \sqrt{\frac{1}{n} \sum_{i}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}} / \bar{y}

(14)

M A E = \frac{1}{n} \sum_{i}^{n} |y_{i} - {\hat{y}}_{i}|

(15)

% M A E = \frac{1}{n} \sum_{i}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|

(16)

R^{2} = 1 - \frac{\sum {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum {(y_{i} - {\bar{y}}_{i})}^{2}}

(17)

D_{s t a t} = \frac{1}{n} \sum_{i}^{n} a_{i}, a_{i} = \{\begin{matrix} 1, [(y_{i + 1} - y_{i}) ({\hat{y}}_{i + 1} - {\hat{y}}_{i})] \geq 0 \\ 0, o t h e r \end{matrix}

(18)

where

n

represents the total number of observed housing price data points in each city;

y_{i}

and

{\hat{y}}_{i}

denote the actual and predicted housing prices, respectively, for the

i

-th observation in each city; and

\bar{y}

refers to the average housing price during the observation period in each city.

4. Explainable Machine Learning Model Construction

4.1. Data Collection and Preprocessing

The empirical analysis in this study utilizes real estate market data from the period 2003–2022, published by the National Bureau of Statistics, for the super-first-tier cities. These data are used for a preliminary validation of the research framework for real estate market forecasting models. The market scale and potential are primarily measured by indicators such as sales revenue of commercial housing [20], sales area [19,50], and average sales price [51]. Descriptive statistics of these indicators are presented in Table 2, with their time trends shown in Figure 2.

The three indicators in Table 2 can partially measure the transaction market scale of real estate in each city. In terms of sales revenue, the city with the lowest commercial housing sales revenue among China’s first-tier cities in 2003 was Shenzhen, with a total of CNY 25,755,952 thousand. By 2022, its sales revenue had increased to CNY 350,284,938 thousand, with an average annual growth rate of 63%. The highest sales revenue for first-tier cities in China was recorded in Shanghai in 2022, reaching CNY 746,747,697.6 thousand, compared to CNY 121,624,152 thousand in 2003, with an average annual growth rate of 25.7%. In terms of sales area, as shown in Figure 1, Shanghai and Beijing had higher sales areas between 2003 and 2007, followed by a slight annual fluctuation and gradual decline. Shenzhen and Guangzhou maintained relatively stable sales areas, with Guangzhou showing a slight increase in recent years. Regarding average sales price, all cities showed a upward trend, with Shenzhen experiencing the fastest price increase and Guangzhou the slowest. Meanwhile, in 2022, the housing prices in Beijing, Guangzhou, and Shenzhen showed a certain degree of decline.

4.2. Model Training and Evaluation

(1): Data Preprocessing

The initially collected data were some missing values, which was addressed by filling in the missing values with the mean of the non-missing elements from the adjacent preceding and succeeding years, ensuring the rationality of the missing data handling. Additionally, to eliminate the effect of data dimensionality, the data underwent range normalization, as shown in Formula (19).

X_{norm} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

(19)

where

X

represents the original data,

X_{\min}

is the minimum value of the feature, and

X_{\max}

is the maximum value of the feature. This transformation method preserves the original distribution of the data while standardizing the scale across different features.

(2): Model Construction

To enhance the objectivity of the model selection and effectively prevent performance evaluation inaccuracies caused by data leakage, we used 75% of the dataset as the training set and 25% as the test set. Based on the training data, we constructed real estate market forecasting models for real estate companies, including LR, SVR, RF, and XGBoost. During the training process, K-fold cross-validation (Formula (20)) was employed to measure various model metrics, thereby reducing the uncertainty and overfitting in model construction.

a v g {M e t r i c}_{i} = \frac{1}{K} \sum_{k = 1}^{K} {M e t r i c}_{i}^{k}

(20)

where

{M e t r i c}_{i}

represents the

i

-th metric used for the model evaluation; K denotes the number of equal-sized subsets into which the entire dataset is randomly divided, denoted as

D_{1}, D_{2}, \dots, D_{k}

. For each k from 1 to K,

D_{k}

is used as the test set and

D_{1} \cup D_{2} \cup \dots \cup D_{k - 1} \cup D_{k + 1} \cup \dots \cup D_{K}

as the training set. Through this approach, K-fold cross-validation provides a more robust evaluation of the model’s generalization ability, reducing performance evaluation bias caused by different data splits.

In order to further reduce the overfitting during the model training process and effectively improve the application value of the model, we repeated the construction task of the machine learning model 100 times by changing the composition of the cross-validation dataset. The performance evaluation results of the model are shown in Table 3. In addition, we also conducted a Friedman’s test [52] on the performance differences among the models, and the results are shown in Table 4. During the training phase, each model uses the default hyperparameters from Python sk-learn (1.3.2 in python 3.8).

It can be observed that the SVR performs well in forecasting commercial housing sales area and sales price, which is attributed to its suitability for datasets with a small number of samples and a large number of features. However, the RF model achieved the best performance in predicting the commercial housing sales amount, possibly due to the more complex nonlinear interaction with its influencing factors compared to sales area and sales price. Therefore, we selected the SVR and RF to construct the final market forecasting models for real-estate enterprises.

Bayesian optimization is commonly used for hyperparameter tuning in machine learning models [53,54]. Compared to grid search, it offers a faster convergence rate. Therefore, based on cross-validation in the training set, we employed Bayesian optimization to fine-tune the hyperparameters of the baseline SVR and RF models. The model performance evaluation results on the test set before and after hyperparameter tuning are shown in Table 5. Among them, the search range for hyperparameters in the SVR model is {“C”: (0.01, 100), “gamma”: (1e−6, 1), kernel: [“rbf”, “linear”, “poly”, and “sigmoid”]}, and the search range for RF model is {“n_estimators”: (50, 500), “max_depth”: (2, 6), “min_Samples_split”: (2, 10), “min_Samples_leaf”: (1, 6) }. After optimization, the model’s performance on the test set improved, effectively enhancing its generalization capability. This demonstrates that the model can serve as a valuable reference for forecasting the real estate market size in China’s first-tier cities.

4.3. Analysis of Key Influencing Factors

Based on the results of the model selection, we ultimately constructed a market forecasting model for real-estate enterprises using SVR. To further analyze the impact mechanisms of various influencing factors on the market, we employed the SHAP interpretability method to visualize the contribution of each feature. Taking the 2022 real estate market size in Beijing as a case study, we utilized SHAP to generate an individual force plot illustrating the impact of different factors, as shown in Figure 3.

(1): Key Factors Influencing Commercial Housing Sales Area

In Figure 3A, the primary positive factors affecting the commercial housing sales area in China’s first-tier cities include the floor area under construction by real estate development enterprises, land acquisition area by real estate developers, the number of hospital beds, and the permanent population. Specifically, better healthcare conditions and a larger permanent population contribute to a more abundant housing supply, thereby increasing the sales area of commercial housing. Conversely, increases in education expenditure, general fiscal budget revenue, general fiscal budget expenditure, and the natural growth rate of the permanent population have a suppressive effect on the expansion of the commercial housing sales area to some extent. This is because first-tier cities in China are rich in educational resources, and purchasing “school district housing” is often seen as a way to secure an advantageous position in educational resource allocation. Additionally, the growth in general fiscal budget revenue and expenditure reflects the developed economic level of first-tier cities. To balance the high housing prices in these cities, buyers may tend to choose smaller housing units, which ultimately affects the total commercial housing sales area. The natural growth rate of the permanent population reflects the age structure of urban populations. A higher natural growth rate implies an increase in the young and middle-aged populations. However, young people today generally have lower fertility intentions and higher expectations for living quality, making them more inclined toward smaller household sizes or even choosing “DINK” (dual-income, no kids) or single-person households, thereby influencing the total sales area of commercial housing.

(2): Key Factors Influencing Commercial Housing Average Sales Price

As shown in Figure 3B, the key factors driving the increase in the average sales price of commercial housing in China’s first-tier cities include population density, education expenditure, time, and per capita disposable income. Over the past two decades, commercial housing prices in the four first-tier cities have exhibited a significant upward trend. Higher education quality, denser populations, and higher income levels enhance buyers’ price tolerance and strengthen housing demand, thereby driving up the average selling price of commercial housing. Meanwhile, the five-year benchmark loan interest rate negatively affects the average selling price of commercial housing in first-tier cities. An increase in loan interest rates raises consumer mortgage costs, making it harder for buyers to afford high-priced housing. Conversely, reducing loan interest rates can increase housing affordability and indirectly raise the selling price of commercial housing.

(3): Key Factors Influencing Commercial Housing Sales Amount

As shown in Figure 3C, the factors that significantly contribute to the growth in commercial housing sales revenue in China’s first-tier cities include educational expenditure, general public budget expenditure, real estate development investment amount, household deposit balance, time, and permanent population. In recent years, commercial housing sales amount in first-tier cities has shown an expanding trend. From a supply perspective, an adequate market supply is a necessary condition for the growth in commercial housing sales revenue. On the demand side, the expansion of educational expenditure and the increase in the permanent population contribute to the expansion of the real estate market. In terms of expectations, the investment attributes of real estate determine that higher price expectations will promote the expansion of the real estate market. In contrast, the five-year benchmark loan interest rate, general fiscal budget expenditure, and the general fiscal deficit rate have a restraining effect on the growth in commercial housing sales revenue. Since real estate transactions involve large capital investments, buyers often rely on mortgage loans. As a result, they must balance loan costs and housing prices. Additionally, increases in general fiscal budget expenditure and deficit rates may indicate that the government is allocating more funds to infrastructure construction, public services, and technological innovation.

(4): Case Study on Influencing Factors

As shown in Figure 3D, various factors contributed to the decline in Beijing’s commercial housing sales area in 2022. Specifically, a high number of hospital beds, a large permanent population, and extensive real estate development activity positively influenced sales area. However, education expenditure, general fiscal budget expenditure, and general fiscal budget revenue negatively impacted sales. Due to Beijing’s advanced economic development and high housing prices, as well as its superior educational resources, consumers may increasingly prefer smaller housing units or opt for affordable rental strategies, leading to an overall reduction in commercial housing sales area. As shown in Figure 3E, the average selling price of commercial housing in Beijing exhibited an upward trend in 2022. The main positive factors included high levels of education expenditure, time, high per capita regional GDP, high general fiscal budget revenue, and low five-year benchmark loan interest rates. However, compared to other first-tier cities, Beijing’s relatively lower population density played a certain suppressive role in price growth. As shown in Figure 3F, the total sales revenue of commercial housing in Beijing showed a slight upward trend in 2022. Key positive factors included time, high educational expenditure, high general public budget expenditure, high real-estate-development investment amount, and high household deposit balance. In contrast, low population density had a certain negative impact on total sales revenue.

5. Conclusions and Discussion

5.1. Research Findings

This study constructs a predictive model for the real estate market in China’s first-tier cities, utilizing data from 2003 to 2022, along with other relevant datasets, through the application of machine learning techniques combined with SHAP interpretability methods. Initially, multiple machine learning models were developed, and the optimal baseline model—support vector regression (SVR)—was selected based on cross-validation performance. Subsequently, Bayesian optimization was employed to fine-tune the model parameters, and the generalization ability of the model was independently validated, enabling accurate forecasting of real estate market trends in China’s first-tier cities.

The findings indicate that the floor area under construction by real estate development enterprises, the land area acquired by these enterprises, the number of hospital beds, and the permanent resident population are the primary positive factors influencing the sales area of commercial housing in China’s first-tier cities. Additionally, permanent population density, education expenditure, time, and per capita disposable income are key determinants that contribute to higher average housing prices. Furthermore, real estate development investment, educational expenditure, general public budget expenditure, and real estate development investment amount are identified as significant drivers promoting the growth in total commercial housing sales value in these cities.

5.2. Theoretical Contributions

The innovation and theoretical contribution of this study are mainly reflected in the following three aspects:

Firstly, the study adopts multiple machine learning models combined with SHAP interpretable technology, following a rigorous and complete framework for model construction, selection, validation, and optimization, providing a reliable research framework for real estate market prediction. It can effectively capture the nonlinear relationship and complex interaction effects between various indicators and real estate and provide theoretical support for understanding the logic behind the model prediction results.

Secondly, this study revealed the comprehensive mechanism of the influencing factors in the real estate market. A comprehensive analysis was conducted on the impact of supply factors, demand factors, policy factors, and expected factors on the real estate market. Previous studies often focused on the analysis of a certain type of factor or a few factors [9,10], but this study constructs a comprehensive model that considers the combined effects of multiple factors, which can more accurately reflect the complexity and diversity of the real estate market.

Thirdly, the study decomposes the real estate market into three dimensions, which helps to reveal the differences in the roles of different factors in different market dimensions and provides strong support for building a more comprehensive theoretical system for the real estate market.

Finally, this study fosters interdisciplinary theoretical integration by bridging computer science, data science, and real estate economics. It not only broadens the theoretical frontiers of real estate market research but also offers valuable theoretical insights for emerging interdisciplinary domains such as urban planning and financial technology.

5.3. Implications

The research conclusion of this study has important reference value for enterprise decision making and government regulation. The specific implications are as follows:

In the short term, from the supply-side perspective, the land acquisition area, construction area, completed area, and investment volume of real-estate enterprises will influence the stable operation of the real estate market across multiple dimensions [11]. As economically developed urban centers in China, first-tier cities have attracted a substantial influx of migrants due to their advantages in income, healthcare, education, environment, and infrastructure [5,15]. Given the rapid increase in housing prices and the large resident populations in these cities, the government should, in the short term, correspondingly increase the supply of residential land to ensure that the provision of residential land aligns with the population size. This strategy not only helps stabilize land supply expectations but also contributes to curbing the excessive rise in housing prices. The short-term strategy emphasizes the significance of supply-side management, preventing price volatility caused by insufficient housing supply and providing local governments with practical policy tools and approaches to achieve “stable land prices, stable housing prices” in response to the central government’s policy directives of “housing for living, not speculation” and “city-specific policies”.

From the demand-side perspective, in the medium term, it is necessary to address the excessive housing demand in first-tier cities caused by large-scale population inflows [7]. The government can further promote the equalization of basic public services, such as education, healthcare, elderly care, and environmental protection, across different cities, especially narrowing the gap between first-tier cities and surrounding second- and third-tier cities, to reduce the siphoning effect of first-tier cities on population inflows. This approach not only helps alleviate the population pressure in first-tier cities and promotes the coordinated development of surrounding cities but also helps mitigate overheated housing demand and price surges in first-tier cities. Meanwhile, real-estate enterprises can, guided by policies and through field research, accelerate their industrial deployment in the surrounding areas of first-tier cities, offering homebuyers more choices. This not only effectively alleviates the housing pressure in first-tier cities but also expands the real estate markets in neighboring cities, thereby fostering regional economic synergy. The medium-term strategy should focus on the coordination of supply and demand, emphasizing the role of basic public service equalization in guiding real estate demand and contributing to the realization of the “people-centered new urbanization” strategy. For local governments, this provides practical and feasible policy references for establishing a multi-center, multi-level regional development pattern.

In the long term, it is essential to maintain the continuity, consistency, and stability of real estate financial policies to stabilize market expectations. The Chinese government has repeatedly emphasized the need to “stabilize land prices, housing prices, and expectations”. Recent policies, such as the reduction in five-year loan interest rates and first-home loan interest rates, reflect the government’s expectations and determination to ensure the stable operation of the real estate market [13]. Local governments should reduce their reliance on land-based finance, avoid using real estate as the primary tool for local economic growth, and instead increase long-term investments in education, healthcare, elderly care, environmental protection, and infrastructure to achieve the sustainable and healthy development of urban economies. From a long-term perspective, the conclusions of this study not only contribute to maintaining market stability and reducing systemic risks but also highlight the importance of consistency and foresight in macroeconomic regulation. They provide both theoretical foundations and practical guidance for establishing a “virtuous cycle between the real estate market and the national economy.” Moreover, they offer critical reference points for local governments to transform development models, reduce dependence on real estate, and promote high-quality development.

5.4. Limitations and Future Research Directions

The real estate market has a very complex operational mechanism, which makes our research limited like other data-driven prediction models. The limitations and future research directions are mainly reflected in the following three aspects: (1) We only selected the four major first-tier cities in China as research objects to verify the effectiveness of the research framework, and it is necessary to expand it to more types of cities in the future. (2) Although we have adopted methods such as cross validation, statistical testing, and independent test sets to reduce the risk of model overfitting, due to limitations in data availability and timeliness, the dataset used is relatively small. In the future, it is necessary to further expand the selection of research subjects and time ranges to improve the overfitting risk of the model at the data level. (3) There are many indicators that affect the real estate market. Although we have comprehensively measured the supply, demand, policies, and expectations, there are inevitably some omissions. In the future, more reasonable and effective indicators need to be included to build a more complete indicator system for real estate market forecasting. (4) The model used in the study is relatively basic and has not yet combined the spatial location characteristics of the city with the time series characteristics of the real estate market. Further research can be conducted in the future.

Author Contributions

Conceptualization, D.S., Z.W., H.Z. and H.L.; methodology, D.S., G.H. and H.L.; formal analysis, D.S. and G.H.; writing—original draft preparation, D.S., G.H., H.L. and Y.L.; data curation, G.H.; software, G.H.; validation, G.H., Z.W. and H.Z.; visualization, G.H.; writing—review and editing, G.H., H.L., Y.L., Z.W. and H.Z.; supervision and project administration, Z.W. and H.Z.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Project of Cultivation for Young Top-Notch Talents of Beijing Municipal Institution (no. BPHR202203237).

Data Availability Statement

The research data are not publicly available in the paper at this time. If needed, please feel free to contact the authors via email to request access.

Acknowledgments

The authors would like to thank the anonymous reviewers for their reviews and comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, X.; Chen, Y. Evaluation and Improvement Strategies of the Long-Term Mechanism for Real Estate: A Perspective of the “Three-in-One” Macro Policy Approach. Stud. Explor. 2022, 8, 99–112. [Google Scholar]
Chen, X.; Cheng, S.; Chen, K.; Xiao, Z.Y. A Study on the Influencing Factors of Housing Prices in First-Tier Cities Based on Machine Learning Methods. Nankai J. Philos. Soc. Sci. 2023, 6, 146–163. [Google Scholar]
Gong, J.; Zheng, T. Investor Attention and Intercity Housing Price Spillover Effects: A Study Based on Baidu Search Index Between Pair Cities. J. Financ. Econ. 2023, 6, 55–70. [Google Scholar]
Liu, S.; Chen, M. Study on the Transmission Effect of Chinese Urban Housing Price Fluctuation Diffusion Level. Reg. Res. Dev. 2021, 40, 45–50. [Google Scholar]
Cui, Z.; Zhou, M.; Kong, L. Study on the Heterogeneity of Influencing Factors of Urban Housing Prices in China. Tax Econ. 2022, 6, 65–74. [Google Scholar]
Paulus, N.M.; Lautenschlaeger, L.; Schaefers, W. Social Media and Real Estate: Do Twitter Users Predict REIT Performance? J. Real Estate Res. 2024, 1, 1–34. [Google Scholar] [CrossRef]
Alhefnawei, A.M.M.; Al, E. Population Modeling and Housing Demand Prediction for the Saudi 2030 Vision: A Case Study of Riyadh City. Int. J. Hous. Mark. Anal. 2024, 17, 1558–1572. [Google Scholar] [CrossRef]
Wang, X.-R.; Hui, E.C.-M.; Sun, J.-X. Population Migration, Urbanization and Housing Prices: Evidence from the Cities in China. Habitat Int. 2017, 66, 49–56. [Google Scholar] [CrossRef]
Du, H.; Xu, M.; Wang, Y.; Chen, L. Housing wellbeing and settlement intentions of skilled migrants in China: The effects of subjective housing feelings and objective housing outcomes. Appl. Spat. Anal. Policy 2024, 17, 983–1015. [Google Scholar] [CrossRef]
Long, J.; Cui, C.; Kohl, S.; Yang, Y. The ladder of prosperity: An analysis of housing wealth accumulation across income groups in urban China. China Econ. Rev. 2025, 92, 102428. [Google Scholar] [CrossRef]
Lin, X.; Lü, P. Research on the Spatial Correlation and Influencing Factors of Housing Prices in the Beijing-Tianjin-Hebei Urban Agglomeration Based on the Spatial Durbin Model. Econ. Prob. Explor. 2021, 1, 79–90. [Google Scholar]
Shen, S.; Zhao, Y.; Pang, J. Local Housing Market Sentiments and Returns: Evidence from China. J. Real Estate Financ. Econ. 2024, 68, 488–522. [Google Scholar] [CrossRef]
Cheng, Y.; Li, H.; Dai, Y.; Xu, Y.F. Stock Market Volatility and Urban Housing Prices: An Empirical Analysis Based on Monthly Panel Data from 2011 to 2017. Econ. Sci. 2025, 1, 27–47. [Google Scholar]
Lu, J.; Liu, F.; Hua, Y. Monetary Policy, Stock Prices, and Housing Price Volatility. Stat. Inf. Forum 2023, 38, 81–94. [Google Scholar]
Wu, W.; Zhou, A. Causes of Housing Price Fluctuations: Demand Rigidity or Land Finance? Shandong Soc. Sci. 2021, 3, 126–132. [Google Scholar]
Li, Z.; Zhang, H. Cost Push, Demand Pull: What Has Driven the Increase in Housing Prices in China? China Manag. Sci. 2015, 23, 143–150. [Google Scholar]
Wu, J.; Li, H.; Hu, B. The Impact of Land Cost on the Price of Newly Built Commercial Housing. Price Theor. Pract. 2015, 9, 52–54. [Google Scholar]
Melecky, A.; Paksi, D. Drivers of European Housing Prices in the New Millennium: Demand, Financial, and Supply Determinants. Empirica 2024, 51, 731–753. [Google Scholar] [CrossRef]
Chen, S.; Zhu, W. The Impact of Sheltered Housing Monetization on the Real Estate Market. Jianghan Forum 2024, 6, 33–42. [Google Scholar]
Cui, M.; Liu, X.; Li, X.; Dong, J.C. Data-Driven Integrated Prediction of the Real Estate Market. Manag. Rev. 2020, 32, 89–101. [Google Scholar]
Yakub, A.A.; Al, E. An Analysis of the Determinants of Office Real Estate Price Modeling in Nigeria: Using a Delphi Approach. Prop. Manag. 2022, 40, 758–779. [Google Scholar]
Wang, Q.; Gu, L.; Guo, H. Analysis and Forecast of Housing Price Cycles in First-Tier Cities in China. Price Theor. Pract. 2014, 5, 55–57. [Google Scholar]
Liu, D.; Wang, W.; Xie, H.; Wang, S.Y.; Lu, F.B. Analysis and Forecast of Influencing Factors of Real Estate Prices in China. Manag. Rev. 2010, 22, 3–10. [Google Scholar]
Rapach, D.E.; Strauss, J.K. Differences in Housing Price Forecast Ability Across U.S. States. Int. J. Forecast. 2009, 25, 351–372. [Google Scholar] [CrossRef]
Yang, G.; Luo, Y.; Gao, J. Study on Housing Price Forecasting Models in China. Stat. Decis. 2014, 12, 17–20. [Google Scholar]
Kouwenberg, R.; Zwinkels, R. Forecasting the U.S. Housing Market. Int. J. Forecast. 2014, 30, 415–425. [Google Scholar] [CrossRef]
Liu, C.; Yao, J. Real Estate Price Forecasting Model Based on Multiple Factors. Stat. Decis. 2017, 17, 33–38. [Google Scholar]
Shi, J.; Xiao, Y. An Evaluation Model for Second-Hand Housing Prices Based on Urban Big Data. China Manag. Sci. 2025, 1, 1–17. [Google Scholar]
Chen, X.; Chen, K.; Wang, Z.; Xiao, Z.Y. Factors Affecting Housing Price Differentiation Among Cities. Econ. Theor. Econ. Manag. 2024, 44, 49–64. [Google Scholar]
Habbab, F.Z.; Kampouridis, M. An In-Depth Investigation of Five Machine Learning Algorithms for Optimizing Mixed-Asset Portfolios Including REITs. Expert Syst. Appl. 2024, 235, 121102. [Google Scholar] [CrossRef]
Tang, X.; Zhang, R.; Liu, L. Research on the Forecasting of Beijing’s Second-Hand Housing Prices Based on Bat Algorithm and SVR Model. Stat. Res. 2018, 35, 71–81. [Google Scholar]
Zhang, W.; Ma, L. Research on Urban Second-Hand Housing Price Evaluation Method: An Analysis of Beijing’s Second-Hand Housing Prices Based on Lasso-GM-RF Combined Model. Price Theor. Pract. 2020, 9, 172–175+180. [Google Scholar]
Shokoohyar, S.; Sobhani, A.; Sobhani, A. Determinants of Rental Strategy: Short-Term vs Long-Term Rental Strategy. Int. J. Contemp. Hosp. Manag. 2020, 32, 3873–3894. [Google Scholar] [CrossRef]
Guliker, E.; Folmer, E.; van Sinderen, M. Spatial Determinants of Real Estate Appraisals in the Netherlands: A Machine Learning Approach. ISPRS Int. J. Geo-Inf. 2022, 11, 125. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Li, L.; Fu, B. Study on Real Estate Bubble Measurement in China Based on Beijing Data. Commer. Res. 2011, 3, 61–67. [Google Scholar]
Tang, L.Z.; Zhu, J.F.; Luo, J. A New Measurement of the Influencing Factors of Real Estate Prices from the Perspective of Macroeconomic Regulation. Econ. Theor. Econ. Manag. 2014, 1, 102–107. [Google Scholar]
Li, J.N.; You, W.X.; Sun, P.Y. Does In-Migration Promote Housing Price Increases in Cities? An Empirical Study Based on Chinese Urban Data. Nankai Econ. Stud. 2017, 1, 58–76. [Google Scholar]
Li, J.L. Study on the Influencing Factors of Housing Price Fluctuations: An Empirical Analysis Based on 2005–2015 Data. Econ. Theor. Econ. Manag. 2017, 9, 30–37. [Google Scholar]
Gao, Y.; Li, X.T.; Dong, J.C. Why Do Housing Prices Differ in Cities? The Impact of Public Services on Housing Prices. Syst. Eng. Theor. Pract. 2019, 39, 2255–2262. [Google Scholar]
Tang, Y.G.; Chen, Q.; Man, L.P. Capitalization, Fiscal Incentives, and Local Public Service Provision: An Empirical Analysis Based on 35 Medium and Large Cities in China. Econ. Q. J. 2016, 15, 217–240. [Google Scholar]
Liu, C.; Yang, J.D. Strategic Land Supply and Housing Price Differentiation. Financ. Res. 2019, 45, 68–82. [Google Scholar]
Li, F.; Shi, Y.N.; Xu, Z.H.; Zhang, H.Y. Credit and Housing Prices: The Impact of First-Time Home Loan Rates on Housing Prices in China. J. Southwest Univ. (Soc. Sci.) 2022, 43, 132–142. [Google Scholar]
Shi, Y.N.; Wang, J.; Ye, J.P. The Pathway of Housing Price Increase: A Re-Verification of Fiscal Policy, Population, and Expectations. J. Zhejiang Gongshang Univ. 2021, 2, 94–106. [Google Scholar]
Gao, B.; Wang, H.L.; Li, W.J. Expectations, Speculation, and Housing Price Bubbles in Chinese Cities. Financ. Res. 2014, 2, 44–58. [Google Scholar]
Dong, J.C.; He, J.; Li, X.T.; Dong, Z. The Mechanism of Public Housing Price Expectation Formation: An Empirical Study Based on Social Learning Perspective. Manag. Rev. 2020, 32, 34–46. [Google Scholar]
Hu, L.; He, S.; Han, Z.; Xiao, H.; Su, S.; Weng, M.; Cai, Z. Monitoring Housing Rental Prices Based on Social Media: An Integrated Approach of Machine-Learning Algorithms and Hedonic Modeling to Inform Equitable Housing Policies. Land Use Policy 2019, 82, 657–673. [Google Scholar] [CrossRef]
Rico-Juan, J.R.; de La Paz, P.T. Machine Learning with Explainability or Spatial Hedonics Tools? An Analysis of the Asking Prices in the Housing Market in Alicante, Spain. Expert Syst. Appl. 2021, 171, 114590. [Google Scholar] [CrossRef]
Baur, K.; Rosenfelder, M.; Lutz, B. Automated Real Estate Valuation with Machine Learning Models Using Property Descriptions. Expert Syst. Appl. 2023, 213, 119147. [Google Scholar] [CrossRef]
Yan, X.X.; Xia, E.J. Forecasting the Sales Area of Ordinary Residential Commercial Houses in Beijing. J. Beijing Inst. Technol. (Soc. Sci.) 2002, S1, 88–90. [Google Scholar]
Wang, X.Z.; Shao, J.; Hong, J.K.; Chen, M.K. A Short-Term Housing Price Forecasting Method Integrating Multi-Source Heterogeneous Information and Data Features. Syst. Eng. Theor. Pract. 2025, 1, 1–20. [Google Scholar]
Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
Gold, C.; Holub, A.; Sollich, P. Bayesian Approach to Feature Selection and Parameter Tuning for Support Vector Machine Classifiers. Neural Netw. 2005, 18, 693–701. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.J.; Gu, X. Robust Parameter Design Based on Bayesian Support Vector Regression Machine. Stat. Decis. 2023, 39, 23–28. [Google Scholar]

Figure 1. Research framework.

Figure 2. Market size trend of real-estate enterprises in first-tier cities of China.

Figure 3. Analysis of market influence factors for real-estate enterprises. Note: (A) Commercial Housing Sales Area; (B) Commercial Housing Average Sales Price; (C) Commercial Housing Sales Amount; (D) Beijing’s Commercial Housing Sales Area in 2022; (E) Beijing’s Commercial Housing Average Sales Price in 2022; (F) Beijing’s Commercial Housing Sales Amount in 2022.

Table 1. Indicator system of market influence factors for real-estate enterprises.

Factor Category	Indicator Name	Variable	Data Source
Supply Factors	Real estate development investment amount	$X_{1}$	National Bureau of Statistics
	Floor area under construction by real estate developers	$X_{2}$
	Completed floor area by real estate developers	$X_{3}$
	Land area purchased by real estate developers	$X_{4}$
Demand Factors	Permanent population	$X_{5}$	China Urban Statistical Yearbook
	Population density	$X_{6}$
	Natural population growth rate	$X_{7}$
	Per capita GDP	$X_{8}$
	Per capita disposable income	$X_{9}$
	Household deposit balance	$X_{10}$	National Bureau of Statistics
	Educational expenditure	$X_{11}$	China Urban Statistical Yearbook
	Number of hospital beds	$X_{12}$
	Urban road area	$X_{13}$
	Park green space area	$X_{14}$
	Urban green coverage ratio	$X_{15}$
Policy Factors	General public budget revenue	$X_{16}$
	General public budget expenditure	$X_{17}$
	General public fiscal deficit ratio	$X_{18}$	Estimated data
	National benchmark interest rate for loans over 5 years	$X_{19}$	People’s Bank of China
Expectation Factors	House price growth rate (lagging one period)	$X_{20}$	Estimated data

Table 2. Descriptive statistics of real estate market indicators.

Real Estate Market Indicator	Unit	Min.	Mean	Max.
Commercial Housing Sales Amount	CNY 10,000	2,575,595.20	27,565,412.08	74,674,769.76
Commercial Housing Sales Area	10,000 sq. m	381.69	1524.10	3694.96
Commercial Housing Average Sales Price	CNY/sq. m	4211.00	20,742.84	58,593.00

Table 3. Cross-validation evaluation results of real estate market forecasting models.

Task	Model	Evaluation Metrics
Task	Model	$R M S E$	$% R M S E$	$M A E$	$% M A E$	$R^{2}$	$D_{s t a t}$
Commercial Housing Sales Area	LR	3.5969	1543.16%	0.5674	243.42%	−254.7684	0.8614
	SVR	0.1305	55.97%	0.1003	43.04%	0.6856	0.8498
	RF	0.1418	60.83%	0.1005	43.12%	0.6287	0.8431
	XGBoost	0.1523	65.36%	0.1120	48.06%	0.5706	0.8237
Commercial Housing Average Sales Price	LR	1.4724	546.64%	0.2613	97.01%	−34.5733	0.8115
	SVR	0.0905	33.60%	0.0734	27.27%	0.8868	0.8603
	RF	0.0946	35.11%	0.0677	25.12%	0.8756	0.8214
	XGBoost	0.1141	42.35%	0.0720	26.73%	0.8170	0.8047
Commercial Housing Sales Amount	LR	2.5017	1128.55%	0.4080	184.07%	−139.0169	0.8424
	SVR	0.1063	47.97%	0.0817	36.86%	0.7691	0.8563
	RF	0.1047	47.24%	0.0810	36.54%	0.7762	0.8881
	XGBoost	0.1117	50.39%	0.0865	39.03%	0.7443	0.8712

Table 4. Results of Friedman’s test for real estate market forecasting models.

Task	Model	Average Rank of Evaluation Metrics
Task	Model	$R M S E$	$% R M S E$	$M A E$	$% M A E$	$R^{2}$	$D_{s t a t}$
Commercial Housing Sales Area	LR	4.00	4.00	4.00	4.00	4.00	1.55
	SVR	1.05	1.05	1.56	1.56	1.05	2.24
	RF	2.09	2.09	1.56	1.56	2.09	2.63
	XGBoost	2.86	2.86	2.88	2.88	2.86	3.58
	$χ^{2}$	279.012	279.012	249.696	249.696	279.012	90.589
	p-Value	0.000	0.000	0.000	0.000	0.000	0.000
Commercial Housing Average Sales Price	LR	4.00	4.00	4.00	4.00	4.00	2.69
	SVR	1.46	1.46	2.44	2.44	1.46	1.18
	RF	1.70	1.70	1.39	1.39	1.70	2.67
	XGBoost	2.84	2.84	2.17	2.17	2.84	3.46
	$χ^{2}$	245.232	245.232	215.676	215.676	245.232	142.750
	p-Value	0.000	0.000	0.000	0.000	0.000	0.000
Commercial Housing Sales Amount	LR	4.00	4.00	4.00	4.00	4.00	3.12
	SVR	1.95	1.95	1.81	1.81	1.95	2.86
	RF	1.53	1.53	1.59	1.59	1.53	1.46
	XGBoost	2.52	2.52	2.60	2.60	2.52	2.56
	$χ^{2}$	209.628	209.628	213.852	213.852	209.628	136.782
	p-Value	0.000	0.000	0.000	0.000	0.000	0.000

Table 5. Model performance evaluation results on the test set.

Task	Metrics						Hyperparameters
Task	$R M S E$	$% R M S E$	$M A E$	$% M A E$	$R^{2}$	$D_{s t a t}$	C	Gamma	Kernel
Commercial Housing Sales Area	0.1084	46.60%	0.0863	37.10%	0.7827	0.8421	1.0	“scale”	“rbf”
Commercial Housing Sales Area	0.1002	43.05%	0.0789	33.91%	0.8146	0.8947	14.8915	0.0174	“rbf”
Commercial Housing Average Sales Price	0.0898	46.23%	0.0803	41.43%	0.7855	1.0000	1.0	“scale”	“rbf”
Commercial Housing Average Sales Price	0.0770	39.73%	0.0639	32.98%	0.8422	1.0000	73.9421	0.1124	“rbf”
Task	$R M S E$	$% R M S E$	$M A E$	$% M A E$	$R^{2}$	$D_{s t a t}$	MD	MSL	MSS	NE
Commercial Housing Sales Amount	0.0806	35.43%	0.0551	24.23%	0.8745	0.8947	None	1	2	100
Commercial Housing Sales Amount	0.0796	35.02%	0.0529	23.26%	0.8774	0.8947	4	1	2	150

Note: MD denotes “max_depth”; MSL denotes “min_samples_leaf”; MSS denotes “min_samples_split”; and NE denotes n_estimators.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, D.; Hu, G.; Li, H.; Zhao, H.; Wang, Z.; Liu, Y. Real Estate Market Forecasting for Enterprises in First-Tier Cities: Based on Explainable Machine Learning Models. Systems 2025, 13, 513. https://doi.org/10.3390/systems13070513

AMA Style

Song D, Hu G, Li H, Zhao H, Wang Z, Liu Y. Real Estate Market Forecasting for Enterprises in First-Tier Cities: Based on Explainable Machine Learning Models. Systems. 2025; 13(7):513. https://doi.org/10.3390/systems13070513

Chicago/Turabian Style

Song, Dechun, Guohui Hu, Hanxi Li, Hong Zhao, Zongshui Wang, and Yang Liu. 2025. "Real Estate Market Forecasting for Enterprises in First-Tier Cities: Based on Explainable Machine Learning Models" Systems 13, no. 7: 513. https://doi.org/10.3390/systems13070513

APA Style

Song, D., Hu, G., Li, H., Zhao, H., Wang, Z., & Liu, Y. (2025). Real Estate Market Forecasting for Enterprises in First-Tier Cities: Based on Explainable Machine Learning Models. Systems, 13(7), 513. https://doi.org/10.3390/systems13070513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real Estate Market Forecasting for Enterprises in First-Tier Cities: Based on Explainable Machine Learning Models

Abstract

1. Introduction

2. Literature Review

2.1. Factors Influencing the Real-Estate Enterprise Market

2.2. Market Forecast Models for Real-Estate Enterprises

2.3. Summary

3. Methods

3.1. Machine Learning Models

3.2. Market Forecasting Indicator System

3.3. Model Evaluation Metrics

4. Explainable Machine Learning Model Construction

4.1. Data Collection and Preprocessing

4.2. Model Training and Evaluation

4.3. Analysis of Key Influencing Factors

5. Conclusions and Discussion

5.1. Research Findings

5.2. Theoretical Contributions

5.3. Implications

5.4. Limitations and Future Research Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI