# Identifying Real Estate Opportunities Using Machine Learning

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. State of the Art

## 3. Data

#### 3.1. Source and Description

- Zone: division within the Salamanca district where the asset is located. This zone is determined by Idealista based on the asset location.
- Postal code: the postal code for the area where the asset is located.
- Street name: the name of the street where the asset is located.
- Street number: number within the street where the asset is located.
- Floor number: the number of the floor where the asset is located.

- Type of asset: whether the asset is an apartment (flat, condo...) or a villa (detached or semi-detached house).
- Constructed area: the total area of the asset indicated in square meters.
- Floor area: the floor area of the asset indicated in square meters, which will in some cases be smaller than the constructed area because terraces, gardens, etc., are ignored in this feature.
- Construction year: construction year of the building.
- Number of rooms: the total number of rooms in the asset.
- Number of baths: the total number of bathrooms in the asset.
- Is penthouse: whether the asset is a penthouse.
- Is duplex: whether the asset is a duplex, i.e., a house of two floors connected by stairs.
- Has lift: whether the building where the asset is located has a lift.
- Has box room: whether the asset includes a box room.
- Has swimming pool: whether the asset has a swimming pool (either private or in the building).
- Has garden: whether the asset has a garden.
- Has parking: whether the asset has a parking lot included in the price.
- Parking price: if a parking lot is offered by an additional cost, then that price is specified in this feature.
- Community costs: the monthly fee for community costs.

- Date of activation: the date when the ad was published in the listing and displayed publicly.
- Date of deactivation: the date when the ad was removed from the listing, which could indicate that it was already sold or rented.
- Price: current price of the asset, or price when the ad was deactivated, which in most cases correspond to the price at which the asset was sold or rented.

#### 3.2. Data Cleansing

#### 3.3. Exploratory Data Analysis

## 4. Machine Learning Proposal

- Support vector regression: this method is also known as a “kernel method”, and constitutes an extension of classical support vector machine classifiers to support regression [19]. Kernel methods transform data into a higher dimensional state where data can be separable by a certain function, then learning such function to discriminate between instances. When applied to regression, a parameter epsilon ($\u03f5$) is introduced, thus aiming that the learned function does not deviate from the real output by a value larger than epsilon for each instance.
- k-nearest neighbors: this is an example of a geometric technique to perform regression. In this technique, data is not really used to build a model, but is rather considered a “lazy” algorithm, since it needs to traverse the whole learning set in order to carry out prediction for a single instance. In particular, what k-nearest neighbors does is compute the distance of the instance to be predicted to all of the instances in the learning dataset, based on some distance function or metric (such as Euclidean or cosine distance). Once distances are computed, the k instances in the training set which are closest to the instance subject of prediction will be retrieved, and their known output will be aggregated to generate a predicted output, for example, by computing their average. This method is interesting since it will consider assets similar to the one we want to predict, but can be problematic when dealing with high-dimensional data with binary attributes.
- Ensembles of regression trees: regression trees are logical models that are able to learn a set of rules in order to compute the output of an instance given its features. A regression tree can be learned from a training dataset by choosing the most informative feature according to some criterion (such as entropy or statistical dispersion) and then dividing the dataset based on some condition over that feature. The process is repeated with each of the divided training subsets until a tree is completely formed, either because no more features are available to be selected or because fitting a regression model on the subset performs better than computing a regression sub-tree. In this paper, we will use ensembles of trees, which means that several models will be combined in order to reduce regression bias. To build the ensemble, we will use the technique known as extremely randomized trees, introduced by Geurts et al. [20] as an improvement to random forests.
- Multi-layer perceptron: the multi-layer perceptron is an example of a connectionist model, more particularly an implementation of an artificial neural networks. This kind of model comprises an input layer that receives as input the values for the features of each instance and several hidden layers, each of which features several hidden units, or neurons. The model is fully connected, meaning that each neuron from one layer is connected to every single neuron in the following layer. Each connection has a corresponding floating number, called weight, which serves for aggregating the inputs to the neuron, and then a nonlinear activation function is applied. During training, a gradient descent algorithm is used to fit the connections weights via a process known as backpropagation.

## 5. Evaluation

#### 5.1. Experimental Setup

- Support vector regression: we will specify the kernel type.
- k-nearest neighbors: we will configure the number of neighbors to consider, the distance metric and the weight function used for prediction.
- Ensembles of regression trees: we will set up the number of trees that conform the forest, the criterion for determining the quality of a split and whether or not bootstrap samples are used when building trees.
- Multi-layer perceptron: we will consider different architectures, i.e., different configurations of how hidden units are distributed among layers.

#### 5.2. Results and Findings

- Explained variance regression score, which measures the extent to which a model accounts for the variation of a dataset. Letting $\widehat{y}$ be the predicted output and y the actual output, this metric is computed as follows in Equation (1):$${E}_{var}\left(y,\widehat{y}\right)=1-\frac{Var\left\{y-\widehat{y}\right\}}{Var\left\{y\right\}}.$$In this equation, $Var$ is the variance of a distribution. The best possible score is 1.0, which would occur when $y=\widehat{y}$.
- Mean absolute error, which computes the average of the error for all the instances, computed as follows in Equation (2):$$MAE\left(y,\widehat{y}\right)=\frac{1}{n}{\displaystyle \sum _{i=0}^{n}}\left|{y}_{i}-{\widehat{y}}_{i}\right|.$$Since this is an error metric, the best possible value is 0.
- Median absolute error, similar to the previous score but computing the median of the distribution of differences between the expected and actual values, as shown in Equation (3):$$MedAE\left(y,\widehat{y}\right)=median\left(\left|{y}_{1}-{\widehat{y}}_{1}\right|,\dots ,\left|{y}_{n}-{\widehat{y}}_{n}\right|\right).$$Again, since this is an error metric, the best possible value is 0.
- Mean squared error, similar to MAE but with all errors squared, and therefore computed as described in Equation (4):$$MSE\left(y,\widehat{y}\right)=\frac{1}{n}{\displaystyle \sum _{i=0}^{n}}{\left|{y}_{i}-{\widehat{y}}_{i}\right|}^{2}.$$As it happened with MAE, since this is an error metric, the best possible value is 0.
- Coefficient of determination (${R}^{2}$), which provides a measure of how well future samples are likely to be predicted. It is computed using Equation (5):$${R}^{2}\left(y,\widehat{y}\right)=1-\frac{{\sum}_{i=0}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{{\sum}_{i=0}^{n}{\left({y}_{i}-\overline{y}\right)}^{2}}.$$In the previous equation, $\overline{y}$ refers to the average of the real outputs. The maximum value for the coefficient of determination is 1.0, which would be obtained when the predicted output matches the real output for all instances. ${R}^{2}$ would be 0 if the model always predicts the average output, but it can also hold negative values, since the model can work arbitrarily worse than just predicting the estimated value.

#### 5.2.1. Which Model Performs Best?

#### 5.2.2. How Are Models Affected by Setup?

#### 5.2.3. When Does Normalization Provide an Advantage?

#### 5.2.4. How Much Time Do Models Require to Train and Run?

## 6. Conclusions

## 7. Data Statement

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Teuben, B.; Bothra, H. Real Estate Market Size 2017—Annual Update on the Size of the Professionally Managed Global Real Estate Investment Market; Technical Report; MSCI, Inc.: New York, NY, USA, 2018; Available online: https://www.msci.com/documents/10199/6fdca931-3405-1073-e7fa-1672aa66f4c2 (accessed on 20 November 2018).
- Idealista. Índice Idealista 50: Evolución del Precio de la Vivienda de Segunda Mano en España. 2018. Available online: https://www.idealista.com/news/estadisticas/indicevivienda#precio (accessed on 20 November 2018).
- Jiang, L.; Phillips, P.C.B.; Yu, J. A New Hedonic Regression for Real Estate Prices Applied to the Singapore Residential Market. Technical Report, Cowles Foundation Discussion Paper No. 1969. 2014. Available online: https://ssrn.com/abstract=2533017 (accessed on 20 November 2018).
- Jiang, L.; Phillips, P.C.; Yu, J. New Methodology for Constructing Real Estate Price Indices Applied to the Singapore Residential Market. J. Bank. Financ.
**2015**, 61, S121–S131. [Google Scholar] [CrossRef] - Greenstein, S.M.; Tucker, C.E.; Wu, L.; Brynjolfsson, E. The Future of Prediction : How Google Searches Foreshadow Housing Prices and Sales The Future of Prediction How Google Searches Foreshadow Housing Prices. In Economic Analysis of the Digital Economy; The University of Chicago Press: Chicago, IL, USA, 2015; pp. 89–118. [Google Scholar]
- Sun, D.; Du, Y.; Xu, W.; Zuo, M.; Zhang, C.; Zhou, J. Combining Online News Articles and Web Search to Predict the Fluctuation of Real Estate Market in Big Data Context. Pac. Asia J. Assoc. Inf. Syst.
**2015**, 6, 19–37. [Google Scholar] - Zurada, J.; Levitan, A.; Juan, G. Non-Conventional Approaches to Property Vale Assessment. J. Appl. Bus. Res.
**2016**, 22, 1–14. [Google Scholar] - Guan, J.; Shi, D.; Zurada, J.M.; Levitan, A.S. Analyzing Massive Data Sets: An Adaptive Fuzzy Neural Approach for Prediction, with a Real Estate Illustration. J. Organ. Comput. Electron. Commer.
**2014**, 24, 94–112. [Google Scholar] [CrossRef] - Sarip, A.G.; Hafez, M.B.; Nasir Daud, M. Application of Fuzzy Regression Model for Real Estate Price Prediction. Malays. J. Comput. Sci.
**2016**, 29, 15–27. [Google Scholar] [CrossRef] - Del Giudice, V.; De Paola, P.; Cantisani, G. Valuation of Real Estate Investments through Fuzzy Logic. Buildings
**2017**, 7, 26. [Google Scholar] [CrossRef] - Rafiei, M.H.; Adeli, H. A Novel Machine Learning Model for Estimation of Sale Prices of Real Estate Units. J. Construct. Eng. Manag.
**2016**, 142, 04015066. [Google Scholar] [CrossRef] - Park, B.; Kwon Bae, J. Using Machine Learning Algorithms for Housing Price Prediction: The Case of Fairfax County, Virginia Housing Data. Expert Syst. Appl.
**2015**, 42, 2928–2934. [Google Scholar] [CrossRef] - Manganelli, B.; Paola, P.D.; Giudice, V.D. Linear Programming in a Multi-Criteria Model for Real Estate Appraisal. In Proceedings of the International Conference on Computational Science and Its Applications, Salamanca, Spain, 12–16 November 2007; Volume 9786, pp. 182–192. [Google Scholar]
- Del Giudice, V.; De Paola, P.; Forte, F. Using Genetic Algorithms for Real Estate Appraisals. Buildings
**2017**, 7, 31. [Google Scholar] [CrossRef] - Del Giudice, V.; De Paola, P.; Forte, F.; Manganelli, B. Real Estate appraisals with Bayesian approach and Markov Chain Hybrid Monte Carlo Method: An Application to a Central Urban Area of Naples. Sustainability
**2017**, 9, 2138. [Google Scholar] [CrossRef] - White, H. A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica
**1980**, 48, 817–838. [Google Scholar] [CrossRef] - MacKinnon, J.G.; White, H. Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. J. Econom.
**1985**, 29, 305–325. [Google Scholar] [CrossRef] - Frank, E.; Trigg, L.; Holmes, G.; Witten, I.H. Technical Note: Naive Bayes for Regression. Mach. Learn.
**2000**, 41, 5–25. [Google Scholar] [CrossRef][Green Version] - Smola, A.J.; Schölkopf, B. A Tutorial on Support Vector Regression. Stat. Comput.
**2004**, 14, 199–222. [Google Scholar] [CrossRef] - Geurts, P.; Ernst, D.; Wehenkel, L. Extremely Randomized Trees. Mach. Learn.
**2006**, 63, 3–42. [Google Scholar] [CrossRef] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F. A Survey of Deep Neural Network Architectures and their Applications. Neurocomputing
**2017**, 243, 11–26. [Google Scholar] [CrossRef] - Kleine-Deters, J.; Zalakeviciute, R.; Gonzalez, M.; Rybarczyk, Y. Modeling PM2.5 Urban Pollution Using Machine Learning and Selected Meteorological Parameters. J. Electr. Comput. Eng.
**2017**, 2017, 5106045. [Google Scholar] [CrossRef]

**Figure 1.**Evolution of the Spanish resale real estate market, focusing on four different regions: Barcelona (blue), Madrid (yellow), Palma de Mallorca (red) and Lugo (green). Source: Idealista [2], reproduced with permission from Idealista.

.48 | .48 |

(a) t | (b) t |

**Figure 10.**Distribution of the median average error based on the different parameters for the ensembles of regression trees.

**Figure 11.**Distribution of the median average error based on the different parameters for the k-nearest neighbors.

**Figure 12.**Distribution of the median average error based on the different parameters for the multi-layer perceptron.

**Figure 13.**Distribution of the median average error based on the use of normalization for the different machine learning techniques.

**Table 1.**Data description, showing the range of values for each feature and the number of empty values, as well as mean and standard deviation in the case of numerical values.

Feature | Type | Range | Mean (Std. Dev.) | Empty Values |
---|---|---|---|---|

Zone | Categorical | 1, 2, 3, 4, 5, 6 | – | – |

Postal code | Categorical | 28001, 28006, 28009, 28014, 28028, 28046 | – | – |

Street name | Categorical | 65 values | – | 1453 |

Street number | Categorical | 77 values | – | 2049 |

Floor number | Categorical | Basement, Floor, Mezz, 1–14 | – | 119 |

Type of asset | Categorical | Apartment, Villa | – | – |

Constructed area | Numerical | 50–2041 sq.m. | 288.76 (133.71) | – |

Floor area | Numerical | 93–1700 sq.m. | 257.63 (126.43) | 1673 |

Construction year | Numerical | 1848–2018 | 1953.23 (31.35) | 1517 |

Number of rooms | Numerical | 0–20 | 4.19 (1.35) | 6 |

Number of baths | Numerical | 0–10 | 3.53 (1.14) | 5 |

Is penthouse | Boolean | T (169)/F (288) | – | 1809 |

Is duplex | Boolean | T (50)/F (260) | – | 1956 |

Has lift | Boolean | T (2123)/F (20) | – | 123 |

Has box room | Boolean | T (1212)/F (269) | – | 785 |

Has swimming pool | Boolean | T (127)/F (413) | – | 1726 |

Has garden | Boolean | T (155)/F (391) | – | 1720 |

Has parking | Boolean | T (687)/F (77) | – | 1502 |

Parking price | Numeric | 115–750,000 | 52,359.50 (102,670) | 2209 |

Community costs | Numeric | 0–3000 | 353.71 (299.61) | 1536 |

**Table 2.**Ordinary least squares estimations—in Column 1 “non-robust” estimation. Column 2 represents a “robust” estimation (HC0) following White [16], price in millions. Stars indicate p-Value. Legend: * = $p<0.1$, ** = $p<0.5$, *** = $p<0.01$.

Price | |||||
---|---|---|---|---|---|

Dependent Var. | Non-Robust | HC0 | |||

Method: OLS | Coef. | Error | p-Value | Error | p-Value |

Floor number | 0.045 | 0.023 | * | 0.038 | – |

Constructed area | 0.004 | 0.002 | ** | 0.001 | *** |

Usable area | 0.001 | 0.002 | – | 0.002 | – |

Is penthouse | 0.090 | 0.090 | – | 0.090 | – |

Is duplex | −0.258 | 0.284 | ** | 0.107 | ** |

Room number | −0.057 | 0.037 | – | 0.037 | – |

Bath number | 0.156 | 0.053 | ** | 0.053 | *** |

Has lift | −0.055 | 0.470 | – | 0.756 | – |

Has box room | 0.011 | 0.163 | – | 0.061 | – |

Has swimming pool | 0.050 | 0.243 | 0.095 | – | |

Has garden | 0.143 | 0.251 | – | 0.079 | * |

Has parking | 0.260 | 0.244 | – | 0.086 | *** |

Is apartment | −0.143 | 0.235 | – | 0.302 | – |

Is villa | −0.065 | 0.278 | – | 0.264 | – |

Location (dummy) | Yes | – | – | – | – |

Postal code (dummy) | Yes | – | – | – | – |

Street (dummy) | Yes | – | – | – | – |

Observations | 2266 | – | – | – | – |

**Table 3.**Configuration in scikit-learn of the different machine learning algorithms that will be used for addressing the regression problem.

ML Algorithm | Parameter | Values |
---|---|---|

Support vector regression | Kernel type (kernel) | Radial basis function kernel (rbf) |

svm.SVR | Penalty (C) | 1.0 |

Kernel coefficient (gamma) | Inverse of the number of features | |

k-nearest neighbors | Number of neighbors (n_neighbors) | 5, 10, 20, 50 |

neighbors.KNeighborsRegressor | Distance metric (metric) | Minkowski, cosine |

Weight function (weights) | Uniform, inverse to distance (distance) | |

Ensembles of regression trees | Number of trees in the forest (n_estimators) | 10, 20, 50 |

ensemble.ExtraTreesRegressor | Criterion for split quality (criterion) | Mean absolute error (mae), mean squared error (mse) |

Whether bootstrap samples are used (bootstrap) | True, false | |

Multi-layer perceptron | Network architecture (hidden_layer_sizes) | 1024, 256–128, 128–64–32 |

neural_network.MLPRegressor | Activation function (activation) | Rectified linear unit (relu) |

Learning rate (learning_rate_init) | 0.001 | |

Optimizer (solver) | Adam | |

Batch size (batch_size) | 200 |

**Table 4.**Quality metrics per model. Average (and standard deviation) is shown for stochastic models. Only the best setup based on mean MSE is shown. Price in millions.

ML Algorithm | E_{var} | MAE | MedAE | MSE | R^{2} |
---|---|---|---|---|---|

k-nearest neighbors | 0.3625 (–) | 0.4404 (–) | 0.2068 (–) | 4.044 (–) | 0.3598 (–) |

Multi-layer perceptron | 0.3113 (0.0020) | 0.5637 (0.0026) | 0.3355 (0.0041) | 4.2262 (0.0029) | 0.3067 (0.0027) |

Ensembles of regression trees | 0.1303 (0.1381) | 0.3714 (0.0075) | 0.1319 (0.0038) | 4.3468 (0.1548) | 0.1253 (0.1386) |

Support vector regression | 1.73 × 10^{−5} (–) | 0.7384 (–) | 0.4540 (–) | 4.9015 (–) | −0.0664 (–) |

**Table 5.**Average (and standard deviation) of training and prediction times for different machine learning configurations.

ML Algorithm | Parameter | Value | Train Time (s) | Predict Time (s) |
---|---|---|---|---|

Support vector regression | – | – | 0.500 (0.022) | 0.093 (0.0015) |

Ensembles of regression trees | n_estimators | 10 | 0.999 (1.068) | 0.0011 (0.00006) |

20 | 2.037 (2.185) | 0.0021 (0.00011) | ||

50 | 5.192 (5.587) | 0.0049 (0.00024) | ||

bootstrap | true | 1.752 (2.250) | 0.0026 (0.0015) | |

false | 3.734 (4.905) | 0.0028 (0.0017) | ||

criterion | mae | 5.337 (4.195) | 0.0027 (0.0016) | |

mse | 0.149 (0.103) | 0.0027 (0.0016) | ||

k-nearest neighbors | n_neighbors | 5 | 0.0016 (0.0014) | 0.013 (0.0038) |

10 | 0.0017 (0.0015) | 0.014 (0.0036) | ||

20 | 0.0017 (0.0016) | 0.015 (0.0031) | ||

50 | 0.0016 (0.0015) | 0.018 (0.0016) | ||

weights | distance | 0.0017 (0.0015) | 0.015 (0.0035) | |

uniform | 0.0016 (0.0014) | 0.015 (0.0037) | ||

metric | cosine | 0.00027 (0.000012) | 0.018 (0.0013) | |

minkowski | 0.0030 (0.00038) | 0.012 (0.0031) | ||

Multi-layer perceptron | hidden_layer_sizes | (128, 64, 32) | 1.727 (1.104) | 0.00076 (0.00013) |

(256, 128) | 3.045 (1.487) | 0.0011 (0.000061) | ||

(1024) | 8.376 (0.094) | 0.0021 (0.000061) |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Baldominos, A.; Blanco, I.; Moreno, A.J.; Iturrarte, R.; Bernárdez, Ó.; Afonso, C. Identifying Real Estate Opportunities Using Machine Learning. *Appl. Sci.* **2018**, *8*, 2321.
https://doi.org/10.3390/app8112321

**AMA Style**

Baldominos A, Blanco I, Moreno AJ, Iturrarte R, Bernárdez Ó, Afonso C. Identifying Real Estate Opportunities Using Machine Learning. *Applied Sciences*. 2018; 8(11):2321.
https://doi.org/10.3390/app8112321

**Chicago/Turabian Style**

Baldominos, Alejandro, Iván Blanco, Antonio José Moreno, Rubén Iturrarte, Óscar Bernárdez, and Carlos Afonso. 2018. "Identifying Real Estate Opportunities Using Machine Learning" *Applied Sciences* 8, no. 11: 2321.
https://doi.org/10.3390/app8112321