Using Machine Learning Models and Actual Transaction Data for Predicting Real Estate Prices

: Real estate price prediction is crucial for the establishment of real estate policies and can help real estate owners and agents make informative decisions. The aim of this study is to employ actual transaction data and machine learning models to predict prices of real estate. The actual transaction data contain attributes and transaction prices of real estate that respectively serve as independent variables and dependent variables for machine learning models. The study employed four machine learning models-namely, least squares support vector regression (LSSVR), classiﬁcation and regression tree (CART), general regression neural networks (GRNN), and backpropagation neural networks (BPNN), to forecast real estate prices. In addition, genetic algorithms were used to select parameters of machine learning models. Numerical results indicated that the least squares support vector regression outperforms the other three machine learning models in terms of forecasting accuracy. Furthermore, forecasting results generated by the least squares support vector regression are superior to previous related studies of real estate price prediction in terms of the average absolute percentage error. Thus, the machine learning-based model is a substantial and feasible way to forecast real estate prices, and the least squares support vector regression can provide relatively competitive and satisfactory results.


Introduction
The real estate market is one of the most crucial components of any national economy. Hence, observations of the real estate market and accurate predictions of real estate prices are helpful for real estate buyers and sellers as well as economic specialists. However, real estate forecasting is a complicated and difficult task owing to many direct and indirect factors that inevitably influence the accuracy of predictions. In general, factors influencing real estate prices could be quantitative or qualitative [1]. The quantitative factors possibly include macroeconomic factors [2], business cycles [3], and real estate attributes [4]. The macroeconomic factors contain unemployment rates, share index, current account of a country, industrial production, and gross domestic product [2]. Attributes of real estate, for example, includes past sale prices, land area, years of constructions, floor space, surface area, number of floors and building conditions [1,4]. The qualitative factors refer to subject preferences of decision makers, such as views [5], building styles, and living environment [1]. However, some difficulties arise in data collection for qualitative factors. For qualitative factors, sometimes these data are suffering from lack of measurements. Data of qualitative factors sometimes are suffering from lack of measurements. Thus, qualitative factors are hard to measure [1]. Therefore, this study did not take qualitative factors influencing real estate prices into considerations and used quantitative data gathered from actual transaction data recording details of real estate transaction data in Taiwan. Four machine learning models were used to forecast real estate prices accordingly. The rest of this study is organized as follows. Section 2 presents literature review of real estate prediction. Section 3 illustrates methods used in this study. Section 4 introduces the proposed real estate appraising system. The numerical results are depicted in Section 5. Section 6 provides conclusions.

The Literature Review of Real Estate Price Predictions
Some studies of real estate prices predictions are presented as follows. Singh et al. [6] employed the concept of big data to predict housing sale data in Iowa, using three models to forecast house sale prices: linear regression, random forest, and gradient boosting. The numerical results indicated that the gradient boosting model outperforms the other forecasting models in terms of forecasting accuracy. Segnon et al. [7] presented a logistic smooth transition autoregressive fractionally integrated process to predict housing price volatility in the U.S.A. and analyzed complicated statistical models based on assumptions of the variance process. The numerical results revealed that the Markov-switching multifractal and fractionally integrated generalized autoregressive conditional heteroscedastic models provide satisfied forecasting accuracy.
Kang [8] developed a news article-based forecast model to predict Jeonse prices in South Korea. The Internet search intensity of keywords from news was treated as the independent variable. The numerical results showed that the designed models obtain more accurate results than time series techniques. Giudice et al. [9] used genetic algorithms to forecast real estate rental prices when geographic locations and four real estate attributes are considered. The multivariate regression technique was performed to forecast the same data. Numerical results indicated that the genetic algorithms are superior to the multivariate regression in term of prediction accuracy. Park and Bae [10] designed a house price prediction system by machine learning approaches to help house sellers or real estate agents with house price evaluations. In their investigation, data were collected from the Multiple Listing Service of the Metropolitan Regional Information Systems. The numerical results indicated that repeated incremental pruning to produce error reduction (RIPPER) algorithm obtains more accurate forecasting results than the other forecasting models.
Bork and Møller [11] employed dynamic model averaging and dynamic model selection to forecast house price growth rates in the 50 states of the U.S.A. The presented forecasting system captures house price growth rates by varying the model and the coefficients' change over time and across locations. Thus, the forecasting results provided by the proposed system are substantial. Plakandaras et al. [12] designed a hybrid model as an early warning system, including ensemble empirical mode decomposition and support vector regression, for predicting sudden house price drops. The proposed forecasting approach generates superior results over other forecasting models in terms of forecasting accuracy. Chen et al. [13] developed a housing price analysis system using information from public websites to analyze the total ratio of the average value and standard deviation of housing prices. They reported that the designed housing price analysis system is a helpful way for obtaining insights into housing prices. Lee et al. [14] employed fuzzy adaptive networks to forecast pre-owned housing prices in Taipei by taking both objective variables and subjective variables into considerations. The empirical results indicated that the fuzzy adaptive networks outperform back-propagation neural networks and the adaptive network fuzzy inference system in terms of forecasting accuracy.
Antipov and Pokryshevskaya [15] used random forest to appraise residential estate of Saint Petersburg, Russia and reported that the random forest approach outperforms the other forecasting methods in prediction accuracy. In addition, the authors claimed that the random forest approach is capable of dealing with missing values and categorical variables. Kontrimas and Verikas [16] presented an ensemble learning system integrating ordinary least squares linear regression and support vector regression to appraise real estate. In addition, weights based on value zones provided by experts of the register center and weights generated by the self-organizing map (SOM) were used by the ensemble learning system. The numerical results revealed that the ensemble learning system with weights generated by SOM outperform the ensemble learning systems with weights provided by experts. Furthermore, the ensemble systems can reach more accurate forecasting results than the other single forecasting models.
Gupta et al. [17] employed time series models with or without the information content of 10 or 120 additional quarterly macroeconomic series to forecast the U.S. real house price index. Their study concluded that the utilization of fundamental economic variables could increase the forecasting accuracy and especially be effective for the 10 fundamental economic variables in the dynamic stochastic general equilibrium model. Kusan et al. [18] developed a grading fuzzy logic model to forecast house selling prices. Many housing attributes, such as public transportations systems and environmental factors, served as inputs of the fuzzy logic system. The numerical results indicated that the proposed fuzzy logic system captures the patterns of house selling prices and lead to satisfied forecasting accuracy. The collection of actual transaction data of real estate has been performed since August 2012 by the Ministry of the Interior of Taiwan. Thus, the motivation of this study is to examine the performance of machine learning models with actual transaction data in forecasting real estate prices.

Least Squares Support Vector Regression
The support vector machines [19,20] technique was originally designed for classification problems. For dealing with regression problems, support vector regression [21][22][23] was proposed and has become a popular alternative for estimating linear or non-linear prediction problems. However, owing to solving quadratic functions in the procedure of support vector regression, the computation burden is quite challenging. Thus, the least squares support vector regression [24] was designed to decrease the computation load by converting a quadratic programming problem into a linear problem. For an input-output dataset {X i , Y i }, i = 1 . . . N, the LSSVR model can be illustrated as Equation (1) [24].
where W represents the weighted vector, Ω denotes the penalty factors controlling the trade-off between the optimization of approximation error and flatness of the approximated function, τ i is the ith error vector, ψ(x i ) is the nonlinear mapping function transferring the original input space into a high dimension input space, δ indicates the bias parameter. Using the Lagrange function, the Lagrange form of Equation (1) is depicted as Equation (2) where α i represent Lagrange multipliers. Applying the Karush-Kuhn-Tucker conditions [25][26][27] and setting derivatives with respect to four variables, W, δ, α, and τ, equal to zero, Equations (3)-(6) can be obtained.
Then, by the least squares method, the LSSVR model can be reformed as Equation (7): where K(x, x i ) represents the kernel function satisfying the Mercer's principle [28]. The radial basis function with a variance σ 2 expressed by Equation (9) is specified here as a kernel function.

Classification and Regression Trees
Proposed by Breiman et al. [29], the classification and regression tree is one of the most popular techniques in dealing with classification or regression problems. The Gini measurement and least-squared deviation measurement are performed by CART models for categorical and numerical problems correspondingly [29,30]. Let the p th sample be illustrated as I p,1 , I p,2 , . . . . . . I p,n . . . .O p , where I p,n is the value of the p th sample with n features, and O p is the corresponding output value of the sample. For a regression problem of CART, the minimization of the least-squared deviation measure of impurity [29] represented by Equation (10) serves as a decision to determine the split-up of trees into branches. 1 where N is the total number of training samples, U r and U l are training data sets directing to the right child node and the left child node correspondingly. O r and O l are mean output values of the right node and the left node respectively.

General Regression Neural Networks
Based on the Parzen window non-parametric estimation [31] of a probability density function, the general regression neural network [32] is a probabilistic neural network that is able to cope with linear or non-linear forecasting problems with continuous outcome values. Suppose a probability density function f (I, O) is associated with input vectors I and output vectors O. The regression of O on i is illustrated as Equation where I is the expected value of i.
Furthermore, the process of conducting a general regression neural network model can be treated as dealing with the kernel regression problems expressed as Equation (12), where (I p ,O p ) is a data pair of (I,O), and σ is the smoothing parameter.
where (I p , O p ) is a data pair of (I, O), σ is the smoothing parameter.

Backpropagation Neural Networks
The backpropagation learning algorithms [33] make up a powerful and popular method to train multilayer perceptron neural networks. The backpropagation neural network containing one or more hidden layers delivers data patterns forward from the input layer through hidden layers to the output layer and generates output values. Feedback errors represented by the difference between the actual output values and the output values of the networks are then sent backward from the direction of the output layer to the input layer. During the process errors' propagation, a chain rule is applied to obtain the updated weights between neuros. By passing data forward and transmitting errors backward iteratively, the training error value decreases. For simplicity and generality, three layers MLP is illustrated for addressing the learning procedures of backpropagation neural networks [34]. Suppose the input data set is represented as I, the h-th hidden neuro obtains an input depicted as Equation (13): The f (·) is the activation function, B h is the bias of the h-th hidden neuro, I p is the p-th input value, U hp is the weight between the p-th input neuro and the h-th hidden neuro.
The output of the h-th hidden neuro is represented as Equation (14).
Then, the output of the q-th output neuro is represented as Equation (15).
where n is the number of output nodes, Y q is the output value of the q-th output neuro, A q is the q-th actual value. The training error is expressed as Equation (16).
Then, by the gradient-descent method and chain rules, updated weights between the output layer and the hidden layer can be obtained and illustrated as Equation (17) and Equation (18).
V qh (t + 1) = ΩV qh (t) + γ∆V qh (t) (18) where V qh (t + 1) is the weight connecting hidden neuro h and output neuro q for (t + 1)-th epoch, Ω is the momentum, γ denotes the learning rate. Then, the learning algorithm between the input layer and the hidden layer can be illustrated as Equations (19) and (20).

Data Collection
The data were collected from actual transaction price data of Taichung, Taiwan (https://lvr. land.moi.gov.tw/) from 2016/04 to 2019/04. Real estates with buildings served as objectives of this study. Due to blank, unclear and outlier data, data cleansing was conducted before using the data. Blank and unclear data were deleted, and the interquartile range technique was performed to deal with outlier data. After data cleansing, in total 32,215 data observations were used in this study. In addition, the real estate attributes were rearranged, and we finally utilized a total of twenty-three independent variables and one dependent variable to forecast real estate transaction prices. For example, the real estate addresses were transformed into geographical coordinates by the Taiwan Geospatial One-Stop system (https://www.tgos.tw/tgos/Web/Address/TGOS_Address.aspx) provided by the Ministry of the Interior of Taiwan. Three attributes namely "non-urban use land zone", "urban use land zone", and "non-urban land use compilation" were integrated into an attribute of "purpose of land use". Ages of real estate were generated from attributes of transaction dates and completion dates of buildings. Table 1 depicts variables used in this study. Moreover, the Pearson correlation coefficient was used to select independent variables with significant coefficient of correlation. Pearson correlation coefficients of independent variables to the real estate price are listed in Table 2. Independent variables with absolute values of Pearson correlation coefficients larger than 0.1 were left. Totally, eleven independent variables were selected. With or without parking space Categorical X 3 Longitude Numerical X 4 Latitude Numerical X 5 Transaction area of land Numerical X 6 Purpose of land use Categorical X 7 Ages of buildings Numerical X 8 Transaction amount of property Numerical X 9 Transaction floors Numerical X 10 Total floors of buildings Numerical X 11 Types of buildings Categorical X 12 Use of buildings Categorical X 13 Materials of buildings Categorical X 14 Total transaction areas of buildings Numerical X 15 Number of bedrooms Numerical X 16 Number of living rooms Numerical X 17 Number of bathrooms Numerical X 18 With or without compartments Categorical X 19 With or without management committee Categorical X 20 Prices per square meter Numerical X 21 Types of parking space Categorical X 22 Area of parking space Numerical X 23 Prices of parking space Numerical The Dependent Variable Y Transaction prices Numerical   Figure 1 illustrates the proposed real estate appraising framework in this study. After completing the preparation of data, data were spitted into a training data set and a testing data set with data sizes of eighty percent and twenty percent of the total data correspondingly. Then, a 5-fold cross validation was performed to examine the robustness of machine learning models in forecasting real estate pierces. In this study, genetic algorithms (GA) [35] were used to tune parameters of machine learning models. The average absolute percentage error served as the objective function of genetic algorithms. In this study, parameters of machine learning models were expressed by a chromosome including ten genes in a binary-coded forms, and the population size is twenty. A single point crossover was conducted and the crossover and mutation rates were 0.6 and 0.6, correspondingly. In the parameter tuning procedure, tentative models with different parameters were iteratively generated by genetic algorithms. Parameters generated by genetic algorithms for four machine learning models are specified as follows. Two parameters, penalty factors and widths of Gaussian functions, were selected for the least squares support vector regression models. For the classification and regression tree models, the maximum depth of the tree, the minimum numbers of leaf nodes, and the numbers of predictors were adjusted. The smoothing parameter was determined for general regression neural networks. Learning rates and momentums were altered for backpropagation neural networks.

The Proposed Real Estate Appraising Framework
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 11 genetic algorithms. In this study, parameters of machine learning models were expressed by a chromosome including ten genes in a binary-coded forms, and the population size is twenty. A single point crossover was conducted and the crossover and mutation rates were 0.6 and 0.6, correspondingly. In the parameter tuning procedure, tentative models with different parameters were iteratively generated by genetic algorithms. Parameters generated by genetic algorithms for four machine learning models are specified as follows. Two parameters, penalty factors and widths of Gaussian functions, were selected for the least squares support vector regression models. For the classification and regression tree models, the maximum depth of the tree, the minimum numbers of leaf nodes, and the numbers of predictors were adjusted. The smoothing parameter was determined for general regression neural networks. Learning rates and momentums were altered for backpropagation neural networks.

Figure 1.
The proposed real estate appraising framework for real estate appraisal.

Numerical Results
Two indices, namely the average absolute percentage error (MAPE) and the normalized mean absolute error (NMAE), were employed to evaluate performances of machine learning models in real estate prices forecasting. The mathematical forms of MAPE and NMAE are expressed as follows. 22) where N is the amount of forecasting periods, A t is the real value at period t, and P t is the predicting value at period t, R h is the highest actual value, and R l is the lowest actual value. Tables 3 and 4 show average MAPE and average NMAE values of machines learning models without and with attribute selection correspondingly. The computation results revealed that four machine learning models can result in better results with selected independent variables. According to Lewis [36], MAPE values less than 10 percent are highly accurate predictions; and values between 10 percent and 20 percent are good predictions. Thus, with attribute selection, LSSVR, GRNN and CART are highly accurate predictions and BPNN is a good forecasting in this study. Moreover, real estate prices can be expressed in different currencies or units. To avoid influences of real estate prices in various currencies or units, MAPE is specified when comparing the prediction accuracy of this study with forecasting results of previous studies. This study collected previous related studies which employed MAPE as measurements. Table 5 lists MAPE values of previous studies and the LSSVR models in this study for real estate prices forecasting. Table 5 revealed that the LSSVR models are superior to the other forecasting models of previous studies in terms of MAPE. However, the performance differences may come from many variances of real estate such as countries, cultures, market trends, and economic conditions. Furthermore, buyers' and owners' anticipations are continually varying owning to changes in lifestyles, materials of buildings, environmental legislation, regulations on the rational use of energy [5]. Thus, this study provides a feasible and comparative alternative in forecasting real estate. Forecasting models should be kept adjusted and improved to maintain stability and feasibility over different time periods. In this study, the number of residential properties is 31,397 and the proportion of residential properties is about 97.46 percent of the total data. The other 2.54 percent is commercial real estate. Commercial real estate appraisal uses factors such as spatial autocorrelation which is different from residential real estate appraisal [37][38][39]. Therefore, the focus of this study is on the residential aspect.

Conclusions
The decision to purchase real estate is undeniably very essential in the life of most adults. Thus, the appraisal and prediction of real estate can provide useful information to help facilitate real estate transactions. Real estate prices vary due to a wide variety of attributes. Machine learning models, including least squares support vector regression, classification and regression tree, general regression neural networks, and backpropagation neural networks, were used in this investigation to forecast real estate transaction prices with actual transaction data in Taichung, Taiwan. Genetic algorithms were employed to determine parameters of machine learning models. Empirical results revealed that attribute selection for machine learning models in this study does improve performances of four forecasting models in forecasting accuracy. With attribute selection, three machine learning models offer highly accurate predictions, and one machine learning model presents good prediction. Thus, four machines model with features selection used in this study are appropriate for forecasting real estate prices. Furthermore, the least squares support vector regression outperformed the other three forecasting models and obtains more accurate results than some previous studies in terms of MAPE. Thus, the least squares support vector regression with genetic algorithms is a feasible and promising machine learning technique in forecasting real estate prices.
For future work, diverse data types such as comments of real estate attributes, prices from social media, images from Google maps, and economic indicators are possible sources added as inputs for machine learning models to improve forecasting accuracy. In this study, general regression neural networks and backpropagation neural networks seems not to generate satisfying results when compared with results provided by the least squares support vector regression and the classification and regression tree. Thus, another potential opportunity for future research might be the use of deep learning techniques to forecast real estate prices.