4.1. OLS Model Interpretation
Before performing PCA on the set of explanatory variables, the whole data set was preprocessed, which included scaling, standardization and label encoding transformation of each categorical variable with n
possible values into n
binary values. The PC predictors were derived, and 10 PCs were selected in accordance with expected eigenvalues higher than 1 (Kaiser–Harris principle). The list of retained principal components is given in Table 2
The column labelled PCi contains the component loadings, which represent the correlations of the observed explanatory variables with the principal components. Component loadings are used to interpret the meaning of the components. The dark shaded table cells indicate strong correlations (over 0.60), while light shaded cells specify high or moderate correlations between the PCi and explanatory variable. The column labelled h2 contains the component communalities, which represent the amount of variance in a variable explained by the components (the proportion of each variable’s variance can be explained by the principal components).
The row labelled SS loadings contains the eigenvalues associated with the components. An eigenvalue is the standardized variance associated with a particular component (in this case, the value for the first component is 3.44). The row labelled Proportion Var represents the amount of variance accounted for by each component. Here, it can be seen that the first principal component, PCA1, includes a number of variables with high loadings and accounts for 10 percent of the variance in the 36 variables.
Component loadings indicate the relative contribution of the observed variables to each principal component, which can help in the interpretation of the meaning of PCAs in real-word settings.
Concerning component loadings for PC1 and strong correlations with predictors such as year of installations (pipelines, wires) replacement, year of construction, year of window replacement, year of facade insulation and year of roof replacement, PC1 could be interpreted as “age—quality of building and apartment”.
Due to its strong correlation to the apartment floor number, total number of floors in the building, apartment floor number (starting above the ground floor) and number of apartments in the building, PC2 is interpreted as “floors”.
PC3 is strongly positively correlated with the elevation of the building itself above sea level, the northing Y coordinate and the proximity to river banks, and negatively correlated with the easting X coordinate and the proximity to the airport.
The threat of flooding has been noted in Ljubljana, and buildings above common flood levels are highly valued. The main city business development axis is directed towards the north, which is followed by spatial price trends of housing. On the other hand, the Eastern part of the city is more industrialized, and a huge railway arrangement area extends from the downtown core to the Eastern direction. Distance to an airport (noise), as an environmental element, influences the prices negatively. The proximity to riverbanks and the lake shore, as an environmental amenity, influences apartment prices positively. PC3 is interpreted as “topography and general Environment”.
Due to its high component loadings on the apartment living area, apartment total area and number of rooms in an apartment, PC4 is recognized as the “Size” context.
PC5 has a high positive correlation with proximity to main city roads, distance to green areas such as forests, proximity to “city bus lines” stations, distance to university facilities, and its negative correlation with the distance from regional roads, and for that reason, can be interpreted as “accessibility within the city”.
PC6 is interpreted as the recreation rivers and forests context with respect to its high component loadings of distances to river banks and distances to green areas and forest. The river courses and forest boundaries coincide with cadastral community boundaries and for that reason, this principal component also has a strong correlation with cadastral community IDs.
An exceptionally high correlation (>0.9) with the distance to highway entrances and the distance to highway lane makes PC7 the “regional accessibility” context.
Provided that railway corridors represent the boundaries of real estate market zones in Ljubljana and that PC8 has a high correlation with the real estate market zoning and the distance to the railway, PCA8 is interpreted as “RE market zoning”.
PC9 is interpreted as “apartment density in the building” due to its correlation with the number of apartments in a building and the apartment ID within a building. Apartments IDs grow from the bottom to the top of a building.
PC10 is interpreted as the “construction” context owing to its component loading values of construction type (brick, concrete, wood) and housing type (single, double, raw houses).
The date of transaction is a time specific variable and indicates no significant correlation to any of the PCA components, which implies that the regression model in our case does not take into account the price differences over the considered time period. The summary output of the multiple regression model on the PCAs shows that the model explains only 23% of the variability and is statistically significant.
4.2. Random Forest Model Interpretation
Similar to the application method of the multiple regression model, we examined the importance of the predicting variables calculated from permuting the OOB data. We decided to only use the first ten ranked predictors (Figure 3
) out of 36 by using a trial and error method to build the RF model on training data (70% bootstrap sample).
The scatter plots (Figure 3
) were used in order to depict a bivariate relationship (predictors’ values versus price/m2
). Since the overlap of data points in scatter plots makes it difficult to discern the relationship, a smoothed curve through the cloud of points was fitted to describe the general relationship between the variables and apartment prices.
The interpretation of the relationships between particular variables and apartment price is given in ranked order of importance:
1. Year built (year of construction): recently built apartments have higher prices per m2, in general. Nowadays, prices for apartments in buildings built from the year 1900 to the period before the end of World War II (WW2) have risen. The prices are lower for apartments in the buildings built in the period just after WW2 as the economic standard of living at that period was low and consequently, a lower quality of construction generally appeared in Slovenia and in Ljubljana. Apartment prices of buildings built after 1950, rising from an average of 2100 €/m2 up to 3200 €/m2 for newly-built buildings.
2. Living_area (apartment living area): smaller apartment living areas (studios and one room apartments for young couples or single households are most frequently exchanged on the market) indicate a higher price per square meter. The price per m2 of an apartment declines from around 3000 €/m2 to about 2100 €/m2 for smaller living area between 20 m2 and 65 m2 in Ljubljana (19.5 €/m2 for each additional m2 on above mentioned interval). Apartment prices then rapidly rise per each additional m2 of living area, up to 2300 €/m2 (for 115 m2 apartments), which might indicate the apartment market of higher income households of families with kids and both parents employed. Larger apartments (from 115 m2 to 200 m2) decline in their value per square meter for each additional m2.
3. Trans_date (date of transaction): represents the change in price/m2 in time (days) on the relative timescale from the beginning of 2008 until the middle of 2013 (1989 days)—a period of recession in Slovenia. The graph of price changes corresponds to the reported decline of the market price, from about 3000 €/m2 to about 2000 €/m2 in four and a half years.
4. Total_area (Aaartment total area): represents the sum of an apartment’s living area and the additional area for storage and the balcony. It is highly correlated by the predictor, apartment living area. For that reason, changes in price/m2 have approximately the same trend as the apartment living area but the influence of an enlarged area is reduced by about 200 €/m2.
5. Year_ren_inst (year of installations’ replacement): The utilities and installations in the apartment physically deteriorate or depreciate over time and must be replaced. The younger the replacements are, the higher the average price/m2 of apartments in the sample is. Apartments with old installations have an average price of 1700 €/m2, whereas apartments with new replacements of gas, electrical and plumbing installations have an average price of 2800 €/m2.
6. Dist_Reginal road (proximity to regional roads) Regional roads bring traffic to the city of Ljubljana from surrounding regions and are connected to the ring motorway built around the city. Their influence on the housing price decreases slightly at up to 2 km of distance and then increases to a distance of approximately 3 km. The price then decreases again over larger distances from the ring motorway.
The decline in average prices from 2600 €/m2 to 2300 €/m2 compared to the growing distance from regional roads might be understood as a negative influence of increased walking distances to public transportation flow, which is usually located along regional roads.
7. ID_apartment (apartment ID within a building): apartments in Slovenia are strictly numbered from the bottom to the top of the building. The prices are fairly constant with the growing number of apartment ID but begin to plateau at around ID number 300. Only high rise condominiums have apartment IDs over 300. Slightly growing prices above building unit 300 represent top floor positions of apartments in the buildings and penthouse positions with excellent views over lower condominiums. In Ljubljana, these higher positions mean beautiful views over the Kamnik–Savinja Alps mountain range to the north or views onto the nice medieval city centre and Castle Hill, surrounded by the river Ljubljanica. This predictor is, as expected, correlated with the predictors, nu_flt_in_build (number of apartments in the building), flt_floor (apartment floor number) and flt_floor_base (number of apartments above the ground floor). There are some underground and half underground apartments which are not desirable on the market and therefore, they do not reach high prices.
8. Dist_Railway (proximity to railway) In general, apartments closer to railroad yield lower prices; the average price/m2 closer to railroads is about 2250 €/m2 and the (environmental noise) influence of railroad proximity to housing disappears after about 1.5 km, where the average prices are above 2.700 €/m2 in Ljubljana. However, there is special situation in Ljubljana, where degraded land and abandoned buildings in several locations close to railroads were recently replaced by modern condominiums with high quality construction with the average price at around 2400 €/m2.
9. No_apart (Number of apartments in the building): the price/m2 grows from 2100 €/m2 to about 2300 €/m2 for buildings with up to 20 apartments. In this building size, about 83% of the 1550 buildings are not equipped with an elevator. The remaining 17% of the buildings have higher apartment prices. From 20 to about 100 apartments per building, the price is almost stabilised. However, the price/m2 declines for buildings with huge numbers of apartments.
10. ID_building (building ID within cadastral community): new buildings have the largest available number in the sequence within the cadastral community, and apartment prices/m2 for newly built structures are higher than older ones. However, there is an anomaly in the graph for the interval from approximately ID 3000 to ID 5000, which is the result of random chance. The ID numbers of buildings from the abovementioned interval (in four cadastral communities) correspond to the neighbourhoods of higher prices (Zupančičeva jama, Trnovski bloki, Vič south of Cesta na Brdo and Šiška elite settlement Koseze pond).
Comparing, semantically, the set of interpretations of the top ten PCAs with the set of top ten ranked predictors selected by RF (importance calculated from permuting OOB data), we can conclude that the two models have equivalent ratings for 7 out of 10 variables (Figure 4
4.3. The Comparison of OLS and RF Performance
The comparison of OLS and RF performance was conducted in an out of sample prediction context using a stratified five-fold cross validation procedure with training sets consisting of 70% of all transactions and test sets consisting of 30% of all transactions. In “stratified” cross-validation, training and test sets have the same spatial and price value distribution as the full dataset [31
]. In addition, “stratified” is a variant of the k-fold within training data and also ensures that each fold has the right proportion of samples in regard to spatial location and price values.
Predefined performance measures for both sets of data are given in Table 3
. All performance measures (SR, MAPE, COD) indicate that significantly better results were obtained by RF in comparison to OLS. Considering the R2
values for OLS and RF (0.23 and 0.57, respectively) and the noticeably lower MAPE and COD values for the RF model, it can be concluded that we are facing a non-linear problem. RF outperforms the OLS model in this non-linear situation; even COD values for OLS are above the recommended upper limit (17% > 15%). The obtained results for the sales ratio (SR) were adequate for both applied methods for both the training and test data sets in accordance with the approved range of 0.9–1.1 [29
The effects of predictions performed by two methods were also compared by detailed visual inspection of differences between the sales ratios (SR) of OLS and RF predictions at identical locations. Figure 5
a. shows the kernel density [32
] of the differences between the average sales ratio values of the OLS and RF models for the apartments in each building (SR(OLS)—SR (RF)). The spatial distribution of the kernel density of sales ratio differences is shown within seven classes, from −0.100 to 0.100 (from dark brown to azzuro blue, middle class is transparent). The locations representing positively signed differences between sales ratios of OLS and RF assessments, coloured azzuro blue, are predominantly situated at the north of the CBD (north, northeast and northwest), which are the main directions of business activity development. They are also located in smaller quantity towards the east (at the confluence of the river Ljubljanica) and southwest and are radially dispersed from the CBD along the regional connecting roads.
Most interesting is the distribution of negatively signed differences between the average sales ratios (RF sales ratios values are higher than OLS sales ratios). The locations of negative differences represent contemporary settlements of apartments, and they are marked by a red check mark symbol (Figure 5
In order to obtain more insight into how the prediction models behave over the case study area, the Hot Spot Analysis was performed by calculating Getis–Ord Gi statistics [33
]. The best way to interpret the Getis–Ord Gi statistic is in the context of the standardized Z-score values. A positive Z-score of Gi statistics (red points, H-H, high clustering of high values) appears when the spatial clustering is formed by similar, but high, values (in our case average SR > 1). If the spatial clustering is formed by low values (in our case average SR < 1), the Z-score (blue points, L-L, high clustering of low values) tends to be negative. A Z-score of around 0 (transparent points, Insignificant clustering) indicates no apparent spatial association pattern.
The hot spot spatial clusters of average SR were mapped over the massive appraisal map obtained by kriging interpolation of the considered transactions (Figure 5
It is obvious that both models followed similar spatial patterns over the case study area. Both methods underestimated the higher prices of apartments (SR < 1, blue points over orange areas), and overestimated the lower prices of apartments (SR > 1, red points over light green areas).
By coupling the results of the Hot Spot Analysis results with the kernel density of the difference between average SRs for OLS and RF, it is evident that blue points where RF underestimated actual prices are coincided with dark brown areas (areas where SR(RF) > SR(OLS)). Therefore, it can be concluded that RF predictions are closer to actual prices than OLS predictions in those areas. Those particular areas are the CBD area as well as areas with contemporary locations, i.e., the areas with high apartment prices.
On the basis of the above facts, and considering the obtained performance metrics, it is suggested that RF predictions outperform OLS predictions. Namely, at the locations of higher differences in sales ratios (where the values are slightly higher), the RF model shows more sensitivity than the OLS model for capturing differences in values.