Comparing the Performance of Regression and Machine Learning Models in Predicting the Usable Area of Houses with Multi-Pitched Roofs

Dawid, Leszek; Barańska, Anna Marta; Baran, Paweł

doi:10.3390/app15116297

Open AccessArticle

Comparing the Performance of Regression and Machine Learning Models in Predicting the Usable Area of Houses with Multi-Pitched Roofs

by

Leszek Dawid

¹

,

Anna Marta Barańska

²

and

Paweł Baran

^3,*

¹

Faculty of Civil Engineering, Environmental and Geodetic Sciences, Technical University of Koszalin, 75-453 Koszalin, Poland

²

Faculty of Mining Surveying and Environmental Engineering, AGH University of Krakow, 30-059 Krakow, Poland

³

Institute of Economics and Finance, University of Szczecin, 71-101 Szczecin, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6297; https://doi.org/10.3390/app15116297

Submission received: 18 April 2025 / Revised: 23 May 2025 / Accepted: 30 May 2025 / Published: 3 June 2025

Download

Browse Figures

Versions Notes

Abstract

The usable floor area is one of the key parameters when appraising residential property. In Poland, valuers have to base their analysis on data from the Real Estate Price Register (RCN) in order to value a property. Unfortunately, these data often turn out to be incomplete, especially with regard to floor area, which makes the selection of reference properties difficult and can lead to erroneous valuation results. To address this problem, a study was conducted that used linear models, non-linear models and machine learning algorithms to calculate the floor area of buildings with complex multi-pitched roofs. The analysis was conducted using data sourced from the Database of Topographic Objects (BDOT10k). Three key factors were identified to provide a reliable estimate of usable floor area: the covered area, the height of the building and, optionally, the number of storeys. The results show that the linear model based on the design data achieved an accuracy of 88%, the non-linear model achieved 89% and the machine learning algorithms achieved 93%. For the existing building data from the city of Koszalin, the best model achieved an accuracy of 90%. The estimated values of the usable area of the building designs for the best model on the test set differed on average from the true ones by 8.7 m², while for the existing buildings, the difference was 9.9 m² on average (in both cases, the average relative error was about 7%).

Keywords:

real estate appraisal; neural networks; urban remote sensing; GIScience; linear and non-linear regression; mathematical modelling; machine learning

1. Introduction

This study is a continuation of the research presented in the article [1]. The comparative approach is recognised as one of the most widely used and considered to be one of the most reliable methods of property valuation [2]. It involves comparing the property to be valued with others that have recently been sold on the local property market. The basis of this approach is data on comparable transactions in which the usable area of the property plays a key role [3,4]. In Poland, valuers use the Real Estate Price Register (REPR). It is a database that is accessible to authorised professionals and is located in the local County Office. It contains information on property prices, derived from notarial deeds. The REPR was introduced in 2001 and updated in 2021 to comply with new legislation [5]. The data in this database relate to all purchase/sale transactions, and covers the entire country. The REPR is a practical source of real estate data that is linked to other state registers [6,7,8,9,10,11].

Although the REPR is an important tool to support property valuation, its main drawback is the incompleteness of the data, especially with regard to the usable floor area of residential buildings. Research carried out in the West Pomeranian Voivodeship, in the districts of Koszalin and Kołobrzeg, showed that only in 40% of cases was information on the usable area of the properties under investigation available in the REPR [12,13,14]. Such deficiencies significantly hamper the comparison of properties and may lead to errors in valuations, especially in small local markets where the number of comparable transactions is severely limited.

Given the limitations of the REPR in terms of data completeness, one potential solution to the lack of floor area data is the use of the publicly available Database of Topographic Objects (BDOT10k). BDOT10k is a detailed spatial database developed for the entire territory of Poland. It contains topographic information, including data on buildings, transportation networks, watercourses, land use, and other landscape features [15,16,17]. For buildings, the database includes parameters such as height, length, width, and perimeter. The data sources include geodetic measurements and airborne laser scanning (LiDAR) technology. In this study, data on buildings located in Koszalin were obtained from the Geoportal in CityGML 2.0 format [18] and processed using QGIS Desktop v. 3.28.2. software [19]. Missing information, such as the number of floors, was supplemented using observations from Google Street View [20].

In Poland, LiDAR datasets are available in LoD1 and LoD2 formats. The LoD1 configuration features a density of 4 points/m² but does not include detailed roof geometry. The second level, LoD2, has a higher density of 12 points per m² and includes a detailed roof structure, including building height and additional building elements such as garages. Most LiDAR data in Poland are available at the LoD2 level, but for the residential buildings in Koszalin analysed in the article, only the LoD1 level is available, which limits the ability to estimate their height. Limitations in scanning accuracy at the LoD1 level can lead to significant errors in the assessment of building heights, which can be as high as 50%. This was calculated using Monte Carlo simulations in an earlier article [21]. In Poland, the quality of LiDAR scanning varies regionally. In the western part of the country, including Koszalin, the LoD1 standard applies, while in the eastern part, the more detailed LoD2 standard is used. In this paper, in order to avoid height measurement errors at the LoD1 level at this stage of the proposed method, building heights measured by surveyors were used.

In this article, we analyse detached residential buildings with multi-pitched roofs. These roofs are characterised by having more than two pitched planes that intersect at various angles [22]. They are primarily valued for their functionality, aesthetic appeal, and durability. Multi-pitched roofs not only drain rainwater efficiently and resist winds, but they also allow for efficient use of attic space. The aesthetic appeal of these structures is a key factor in their popularity, particularly in villas and modern residences where visual appeal is a priority. In addition, the larger surface area of these roofs allows for better design of the ventilation system, thereby enhancing the comfort of the building and extending its lifespan. However, it should be noted that the construction of multi-pitched roofs is more complicated, resulting in higher costs compared to flat or gable roofs. This study focuses on hipped and more complex roof types. A hipped roof consists of four slopes—two triangular and two trapezoidal. This structure provides effective protection against precipitation and contributes to an aesthetically pleasing appearance.

This work presents an approach for determining the usable floor space in residential buildings with sloped roofs, utilising information derived from the BDOT10k dataset. Other studies and analyses are primarily focused on property values [23,24,25,26]. The current study combines mathematical modelling, commonly used in property valuation, with machine learning techniques, which are less frequently applied to determining building parameters. This is an innovative approach. One of the study’s strengths is the incorporation of multiple types of data. The proposed approach is ready to use the LiDAR-based BDOT10k data. The study demonstrates that the covered area of a building can be reliably estimated using just a few parameters. Previous analyses, including those by Dawid et al. [27], demonstrated that it is possible to accurately determine the usable floor area of flat-roofed houses using topographic data and methods such as regression or neural networks, achieving an error below 9%. In the case of buildings with pitched roofs [21], it has been shown that linear regression models and neural networks, based on variables such as building area, number of floors, and knee–wall height, can provide accuracy with errors of up to 5% (regression) and 3% (neural network). However, for newer buildings from 2020 to 2022, which differ in architectural style, accuracy was lower, with a maximum error of 15%. In another study [1], based on 191 designs and 48 existing buildings in Koszalin, fairly high precision in estimation was achieved. The non-linear model demonstrated a marginal improvement in performance when compared to the linear model, with an estimated accuracy of approximately 91.5%. For the existing buildings in Koszalin, the optimal fit was 93%.

A similar study by Janowski et al. [28] gave promising results. The presented approach employed LiDAR data, official data on building area, and optical recognition of the number of rooms and floors from photographs of buildings. These were modified with specific multipliers in groups of homogeneous buildings. This resulted in a discrepancy between the estimation of usable area by the model and that calculated according to the standards. The discrepancy ranged from 6% for buildings of typical size and construction to 17% for large or atypical buildings. This translated into absolute errors ranging from 7.5 to 34 m². However, a comparison with the actual floor area of the buildings presented in the same article shows a high variance in the results (for several buildings in the test group, the discrepancy between the modelling result and the actual floor area is of the order of 100 m², and in one case exceeds 300 m²), which currently rules out the applicability of this approach in practice.

This article utilises data from two distinct sources. The initial source was architectural designs, and the second was existing buildings. The primary rationale for this approach is that the documentation from the architectural offices contains a comprehensive and detailed specification of the buildings, developed according to known standards, which is a significant advantage compared to the data on existing buildings. Architectural designs also provide precise information about the interior of the building and the structural solutions used, which have a considerable effect on the accurate determination of the floor area. The existing building data, on the other hand, can be subject to measurement inaccuracies, which can compromise the integrity of the analyses. The utilisation of detailed architectural designs is a notable advantage, given the strong correlation between the accuracy of linear models, non-linear models and machine learning methods, and the quality of the input data. In this study, we have extended previous research by analysing buildings with multi-pitched roofs, in which estimation of the floor area is even more difficult because the slopes limit usable space on the top floor.

2. Materials and Methods

2.1. Polish Standards for Calculating Usable Floor Area

The usable area is one of the most important parameters for property valuation, but its calculation is not a straightforward process. Over the years, different measurement standards have been in force in Poland, which has led to inconsistencies in the results. Until 1999, the PN-B-02365:1970 standard [29] was widely used, according to which only usable rooms such as living rooms, kitchens or bathrooms were to be considered as part of the usable area. In the case of storied houses, the type of roof and also its slope were important, as the usable area also depends on the height of the rooms (see [30,31] for further details). Subsequently, the standard PN-ISO 9836:1997 [32] was introduced, which excluded areas such as basements or garages. Since 2012, the standard PN-ISO 9836:2015-12 [33] has been in force, which only applies to new buildings (see details in [29,31,32,33,34,35,36,37,38,39]). This standard is consistent with European standards. It provides comprehensive regulations on measuring usable area, including the area under sloping roofs. According to this standard, the usable floor area is calculated as follows: the area with a height above 2.2 m is fully included, while the area with a height between 1.4 and 2.2 m is counted at 50%, and areas below 1.4 m are completely excluded from the calculations. This standard does not take partition walls into account when calculating the usable floor area, which was a practice in previous standards. The changing regulations and the variety of standards lead to difficulties in interpreting the data. In the case of older properties, non-compliance with the current standards can result in valuation errors of several or even several percent, depending on the size of the building or the type of roof. Moreover, the REPR frequently lacks information on which standard the usable floor area of the building was calculated, which makes it even more difficult to compare properties. Detailed information on the norms and their discrepancies can be found in available scientific studies and Polish standards [21,22,23,24,25,26]. This study focuses on estimating the floor area of single-family homes with multi-pitched roofs using the principles set out in the PN-ISO 9836:2015-12 standard.

2.2. Data from Design Offices and Data on Existing Single-Family Houses in Koszalin

In order to develop regression models, both linear and non-linear, and to efficiently train machine learning models for predicting the buildings’ usable area, it is crucial to use reliable and complete data. In the present study, two data sources were utilised: architectural designs obtained from online sources and measurements of actual buildings.

The first dataset came from the architectural offices Lipinscy [40], Archon [41] and Extradom [42], which provide detailed building designs and interior plans. This documentation has been developed according to existing standards for calculating floor space, thus enabling precise analysis and minimising errors in the models. Architectural designs also offer a more accurate indication of living spaces, which is a crucial factor in the property valuation process. A total of 436 residential projects were analysed, divided into two groups. The data from the first two sources constituted the first sample (previously partially analysed in paper [1]), and the data from the third source constituted the second sample. All design parameters were adopted without modification. The buildings analysed also included garages and boiler rooms, which are not included in the building floor area. The building parameters considered are shown in Table 1.

The second dataset contained data on buildings in Koszalin, in the north-western region of Poland. The selected buildings, constructed between 2020 and 2023, featured multi-pitched roofs and diverse architectural styles. Information about the parameters of these buildings was obtained from documentation provided by the District Building Control Inspector.

The study was divided into several stages. Initially, classical multiple linear regression was employed to estimate the floor area of buildings. The second stage involved analysis using non-linear regression. The third stage involved the use of regularised linear regression to increase the accuracy of the models and eliminate variables with less predictive power. The fourth stage of analysis used advanced machine learning methods. The objective of each stage was to identify the optimal model for predicting the value of the floor area of buildings. The analyses were carried out independently for a dataset of architectural designs and existing buildings. The paper also makes reference to earlier studies presented in [1]. In particular, a comparison is provided of the results obtained using machine learning methods in the current study with the results obtained in the current and previous articles using linear and non-linear modelling. The parameters of the models were estimated on the basis of the analysed datasets. If attempts are made to adapt the models to other datasets, optimal parameter values should be estimated separately. Nevertheless, the analyses which were carried out indicate a recurring set of significant independent variables in the regression models, and the approach itself has sufficient versatility to potentially serve as a general methodology for modelling missing data in databases containing different property characteristics.

2.3. Linear and Non-Linear Modelling

During the first and second phases of the analysis, the parameters of both multiple linear and non-linear regression models were estimated. Linear models took on the form:

A_{U} = a_{0} + \sum_{i = 1}^{m} a_{i} \cdot X_{i}

(1)

while non-linear models were of the form:

A_{U} = a_{0} + \sum_{i = 1}^{m} f_{i} (X_{i})

(2)

where A_U—the explained variable, usable floor area of the building;

a₀—the constant term in the equation;

a_i—linear regression model coefficients;

X_i—explanatory variables of the model, listed in Table 1;

m—the number of explanatory variables in the model;

f_i—a function of the dependence of A_U on X_i, selected on the basis of the scatterplot of A_U vs. X_i.

In model (1), it is hypothesised that each explanatory variable has a constant effect on the explanatory variable A_U. However, in order to optimally select the form of the non-linear model in the first step of the procedure for estimating its parameters, a series of scatter plots was made, showing the dependence of A_U on each explanatory variable. Similar to the article [1], the objective of this step was to build a non-linear model (2) which would approximate the relationships between the buildings’ features more accurately than a linear regression model does. The detailed procedure can be found in [43].

The quality of the model’s fit to the data was verified using the coefficient of determination. This coefficient was then tested for its significance and corrected for the number of degrees of freedom of the model. The presence of outliers, which have the capacity to reduce the model’s fit and distort the true relationships, was also analysed. It was assumed that the number of such observations should not exceed approximately 10% of the total set. The next step in strengthening the model was to progressively eliminate the statistically insignificant components of the multivariate model, starting with the least significant, while monitoring the increasing coefficient of determination and changing the significance level of the remaining components. Among the non-significant explanatory variables were h (knee–wall height) and B (boiler room) in both types of models. Following the elimination of the variables h, B and 26 outlier observations (10%) in the linear model, there was a substantial improvement in the model’s fit: the adjusted coefficient of determination increased from the original 74% to 88%. In the case of the non-linear model, 28 outliers (11%) and 4 of the 10 model components had to be removed, which eliminated the same two explanatory variables (h and B), to achieve an 89% fit of the model to the data, after the initial 75%.

The need to remove a larger number of outlier cases in the non-linear model indicates its slightly poorer fit compared to the linear model, thus confirming the conclusions of earlier analyses reported in [1]. The gain in the performance of the latter of the two models is so small that the conclusion about the inefficiency of the effort required by non-linear modelling in estimating the usable floor area of residential buildings remains valid.

An alternative to regression models with a stepwise procedure for removing variables and cases is regularised models. Regularisation is a technique that aims to reduce the over-fitting of a model, and involves attaching a penalty function to the minimised sum of squares that results in either a desired reduction in the number of variables in the model or a reducing in the values of the model parameters, which in turn reduces the dependence of the estimate on random fluctuations in the explanatory. Linear models with both L1-type (LASSO, least absolute shrinkage and selection operator) [44,45] and L2-type regularisation (ridge regression), as well as a hybrid of these, were used in an approach described in [45,46].

Formally, we write the model with L1 regularisation in the form of a function minimisation task (finding the vector b, that provides the smallest expression value):

\sum_{i} {(y_{i} - β_{0} - \sum_{j} β_{j} x_{i j})}^{2} + λ \sum_{j} |β_{j}|

(3)

where

x_{i j}

is a value of the j-th explanatory variable observed in the i-th object,

β_{0}

and

β_{j}

are model parameters, while

λ

is the regularisation parameter (penalty term).

Similarly, the L2 regularisation can be written as minimising the expression:

\sum_{i} {(y_{i} - β_{0} - \sum_{j} β_{j} x_{i j})}^{2} + λ \sum_{j} β_{j}^{2}

(4)

A hybrid of the two approaches, elastic net, in which the penalty function consists of both expressions, will instead be written as the following minimisation:

\sum_{i} {(y_{i} - β_{0} - \sum_{j} β_{j} x_{i j})}^{2} + λ_{1} \sum_{j} | β_{j} | + λ_{2} \sum_{j} β_{j}^{2}

(5)

where parameters

λ_{1}, λ_{2}

are oftentimes replaced by a pair

λ, α

in a modified penalty term

λ (α \sum_{j} |β_{j}| + (1 - α) \sum_{j} β_{j}^{2})

, and the simulation and the simulation proceeds in two stages—for each value of λ from a certain set, a value of α is found that minimises (5), and then the best of the resulting models is selected.

Regularised models generally exhibit superior forecasting capabilities in comparison to the corresponding baseline linear regression models (containing the same explanatory variables). Moreover, models derived through LASSO eliminate some of the explanatory variables.

2.4. Machine Learning Models

Within the approaches categorised as machine learning (ML) techniques, two groups of models were used. The first was models of boosted regression trees, obtained by using the XGBoost method. In the second group, neural networks (NN) of simple design, containing one or two hidden layers, were used. The rationale for this choice is as follows. We wanted to make a clear reference to previous works by L. Dawid et al. [21,27] that we build upon, hence the neural networks. We were also looking for methods that may compete with neural networks for small to medium-sized regression tasks, which is where the regression trees ensembles come in handy, and we chose XGBoost as the representative of this group over Random Forests, which was also considered.

The XGBoost (eXtreme Gradient Boosting) method was proposed by Chen and Guestrin [47]. It is a variant of a multi-model approach, using regression trees, in which a certain model loss function is minimised, with a regularisation condition imposed on the tree structure, taking into account the number and weights of the leaves. Technically, the application of this method involves minimising the expression

\sum_{i} l (\hat{y_{i}}, y_{i}) + \sum_{k} Ω (f_{k})

(6)

where

Ω (f) = γ T + \frac{1}{2} λ {‖w‖}^{2}

is a regularisation formula, l—convex loss function (usually the mean square error, MSE), k—number of the model (tree) in an ensemble, T—number of leaves, w—weights of leaves, g, and l—regularisation parameters. Such a task is solved with an iterative algorithm, described in detail in [47].

Next, we used neural networks as predictive models. We have made several assumptions about the NN architecture. Since the task was a relatively uncomplicated one—non-linear regression on 6–8 variables—we excluded from our considerations all NNs other than simple multilayer perceptrons, with either one or two hidden layers, as these are proven to be sufficient for regression tasks. Then, not to overcomplicate the architecture, we chose the ranges for the number of nodes in each hidden layer. All this resulted in a relatively short list of NN architectures under consideration. Then, the best-performing architecture was found in each task in the process of multiple network training (over a grid of said architectures). The neural networks used in the modelling procedure had one or two hidden layers, a ReLU activation function, and were trained with the Adam algorithm. The hidden layers in the different model variants for the design data contained between 3 and 20 fully connected neurons, and the training procedure covered between 800 and 2000 epochs, with 10–20% of the test data and a training batch size of 12–24 observations (4.9–9.8% of the total number of observations). In the case of models for existing building data, the models had between 3 and 12 neurons in the hidden layers, the training procedure consisted of 1000 epochs, the test data consisted of 10% of the training set, and a batch consisting of 4–8 observations was presented to the network at one time.

After obtaining the results from ML models, we decided to build hybrids combining baseline models with ML models or one ML model with another, in order to further improve our predictions.

To compare the indications of the individual models, the MAE (mean absolute error) and RMSE (root mean squared error) measures were used, as in [1], given by the formulae

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|

(7)

and

R M S E = \sqrt{M S E} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}

(8)

where

N

is the number of observations in the validation set,

y_{i}

is the value of the dependent variable in the

i

-th object, and

{\hat{y}}_{i} —

the estimated value of A_U.

Shapley values [48,49] were used to assess the impact of individual input variables on the estimated value of the dependent variable in ML models. This concept originates from game theory, where it is used to calculate the contribution of each player to the outcome of a cooperation between a group of players called a coalition. Formally, the Shapley value of the i-th player’s contribution to the payoff in a coalition game with n players and giving the coalition S⊆N payoff v(S) is given by the formula

ϕ_{i} (v) = \sum_{S \subseteq N \ {i}} \frac{|S|! (n - |S| - 1)!}{n!} (v (S \cup \{i\}) - v (S))

(9)

where

N

is the set of all

n

players. Consequently,

ϕ_{i} (v)

is the sum of all value differences between coalitions that contain the i-th player and the corresponding coalitions without them.

The effect of individual variables on the estimated usable area in ML models (boosted trees) was measured using the SHAP (SHapley Additive exPlanations) method [50], which calculates Shapley values by simulation [49]. The SHAP value is a Shapley value, interpreted as the effect of the i-th variable on the model’s estimated value of v, where S∪{i} is the set of explanatory variables of the model.

The ML models were estimated on a laptop computer with an i5 processor and 16 GB of RAM in R v.4.4.2 (including glmnet v.4.1 and xgboost v.1.7 libraries) except for the NN models, which were estimated in a Python 3.10 environment (including Tensorflow 2.0 library).

3. Results

3.1. Data Characteristics and Preprocessing

Three data samples from two sources described in Section 2 were used in the survey. The data in every sample are divided into two parts: a training set and a validation set used to compare the estimated/trained model’s performance. In regression models with regularisation and in ML models, data from the training set were further subdivided into a training set and a test set of varying contents. Two of the samples—the old design data (a 191-object training set and 10-object validation set) and the existing building data (48 training and 5 validation observations)—are the datasets used previously in [1]. The third one—containing new design data—consists of 245 training records followed by 25 validation records. Some characteristics of the data samples, including their respective ranges, are presented in Table 1, and the data distributions are depicted in Figure 1. The main difference between the old and the new design data is that in the former the validation set was added incrementally to an existing data, mimicking a real-life framework with new data being observed after modelling, while in the latter, the validation set was drawn from a larger group of observations, thus it better represents the respective distributions.

In the distribution diagram (Figure 1), it can be seen that there is a considerable amount of variation in the data and that the validation data (dark points) generally correspond with the training data in terms of distribution. The validation set of old projects is the least fit to the distribution of the A_U variable, with buildings with small-to-medium usable area being underrepresented. This can cause difficulties in the proper final evaluation of the models.

In order to preserve the best features of the algorithms used, data preprocessing was performed. All regularised regression models and all neural networks were trained using standardised features; i.e., all the variables (except for binary B and three-valued GP) were scaled according to the formula:

u_{i} = \frac{x_{i} - {\bar{x}}_{i}}{S (x)}

(10)

where

x_{i}

is the observed value of a variable X,

{\bar{x}}_{i}

is the mean, and

S (x)

—the standard deviation of X.

3.2. Linear Regression

In the first stage of modelling, linear regression parameters were estimated based on a set of design data obtained from architectural offices. This set consisted of 245 one- and two-story residential buildings with multi-pitched roofs. The final, optimal form of the linear model was created after rejecting 26 outliers and dropping two irrelevant variables—knee–wall height and boiler room; the reliable fit of this model to the data is 88%. Table 2 shows the detailed results of the multiple regression analysis for a sample of 219 observations, described by 5 explanatory variables.

The standardised regression coefficients (β-weights) prove that the AC variable (built-up area) has the greatest influence on the explanatory variable, being, thus, the most effective predictor of floor area. The results of the regression model parameter significance tests (the last two columns of Table 2) confirm the adequacy of the five exogenous variables. In the rest of the article, the model is referred to as Model D.

3.3. Non-Linear Regression

Only two of the graphs of the dependence of the A_U variable (floor area of the building) on individual independent variables (Figure 2a,b) reveal a non-linear character. Based on these, the forms of mathematical functions f_i for the given explanatory variable were selected, from which a multivariate additive non-linear regression model was built. The forms of the tested f_i functions, building the model (2), are given in Table 3. Polynomial functions of degrees 1, 2 and 3 and the exponential function were tested. For the bivariate variable (S_N—number of storeys), a linear function was the only one that could be considered.

The initial acceptable form of the additive multivariate non-linear regression model took the Formula (11):

\begin{array}{l} A_{U} = a_{0} + b_{1} \cdot X_{1} + b_{2} \cdot X_{2} + c_{1} \cdot X_{3} + c_{2} \cdot X_{3}^{2} + \\ + d_{1} \cdot X_{4} + d_{2} \cdot X_{4}^{2} + d_{3} \cdot X_{4}^{3} + b_{3} \cdot X_{5} + b_{4} \cdot X_{6} + b_{5} \cdot X_{7} \end{array}

(11)

where A_U—the explained variable, usable floor area of the building;

a₀—the constant term;

X_i—independent variables, listed in Table 3;

b_i, c_i, d_i—model parameters.

As with the linear model, parameter estimation of the non-linear model was carried out several times, gradually removing outliers or irrelevant variables. Table 4 shows the results of parameter estimation of the final form of non-linear multiple regression, considered optimal, built on the basis of 217 observations, of Formula (12):

A_{U} = a_{0} + b \cdot A_{C} + c \cdot H + d_{1} \cdot G_{A} + d_{2} \cdot G_{A}^{2} + g \cdot S_{N} + h \cdot R_{S}

(12)

Twenty-eight cases (11%) were considered to be the outliers, half of which appeared as outliers in the linear model, as well as thus supporting the conclusion that they are truly incompatible with the set analysed.

The model predicts the value of a building’s floor area based on variables A_C, H, G_A², G_A, S_N, and Rs. An almost identical set of variables was found to be significant in our previous study [1]. Then, however, instead of the R_S variable, the set of significant variables included the h variable. The value of the R² is 0.893; its adjusted value is slightly smaller and equals 0.890. This means that as much as 89% of the variance in the response variable is explained by the model. A small difference between R² and adjusted R² confirms that the model remains stable even after taking into account the relationship between the amount of data and the number of variables. The fit of the model to the data is not weaker than that of the linear model.

The final model includes explanatory (exogenous) variables with the parameters being statistically significant (last two columns of Table 4). The model shows that most of the buildings’ characteristics have high predictive power (statistically significant parameters at p-value < 0.05), except for the variable S_N, where the parameter can also be considered significant (but at p-value of 0.10). In the remainder of this paper, the model is referred to as model D_N.

For the purpose of clarity in the presentation, a comprehensive summary of the groups of models that are compared in the article is presented in Table 5, with the models divided according to the modelling technique and the dataset that was utilised. The models in each column are arranged in the order in which they were created.

3.4. Prediction of Usable Area in a Test Set for Linear, Non-Linear, and Regularised Models

In the majority of applications of regression models, the most significant issue is their ability to accurately predict the value of the dependent variable for cases that did not occur in the sample used to estimate the model parameters. Consequently, the ultimate evaluation of the quality of the aforementioned models, which also serves as a comparison with other predictive models (see below), will be the results of the usable area modelling for a sample of 25 new observations. The characteristics of the models are summarised in Table 6. The table also presents the results of linear models with regularisation. The results of the machine learning (ML) models are presented in Table 7. Finally, the results obtained by hybrids of linear, non-linear and ML models are summarised in Table 8. The hybrid approach yielded the most accurate estimates of the usable floor area of the buildings under study.

In the creation of predictive models for the other two datasets, a similar approach was adopted. Models for 191 building designs from the work [1] are compared with machine learning (ML) models on a sample of 10 new observations (see Table 9). Models for 48 existing buildings that were derived from the same source are compared with ML models obtained in the following subsections (see Table 10). This comparison was made on a sample of five observations due to the considerable difficulty in collecting new, reliable data with the same set of variables.

The best of the linear models—model D—performed reasonably well in determining the usable floor area for new house designs. The mean absolute error indicated an average deviation of 12.8 m², whereas the root mean squared error indicated an average deviation of 15.1 m². Furthermore, it was observed that half of the errors did not exceed 12 m², while the others did not exceed 28 m² (the maximum error recorded was 27.7 m²). As large errors occur in only 1–2% of cases on the learning set and in 1–2 cases on the test set, such a result can be provisionally accepted.

In contrast, the best of the acquired non-linear models, model D_N, described in detail in Section 3.2, despite the slightly worse characteristics obtained on the learning data, predicted the usable area in the test sample even better, with a mean absolute error of 12.1 m² and a root mean squared error of 14.09 m², with a maximum error of 23.8 m². These two models will be treated as baseline models for comparison in the following section. Both provided a benchmark of high requirements for subsequent methods.

In the next step, the models were simulated with LASSO regularisation, then this scheme was repeated for ridge regression and elastic net. None of these models gave better results than models D and D_N for the test sample. Subsequently, models with LASSO regularisation were estimated, where the learning sample was reduced by 20 observations (indicated by the expert as potential outliers). The best model in this approach is denoted in Table 5 as LASSO+ λmin = 0.72. The models obtained in this instance predicted the usable area of the objects in the test set more accurately, but their indications were still weaker than the baseline models. It is worth noting that the best model of the LASSO+ family contains only four variables, which is the least of all the models described. These features are the covered area A_C, the building height H, the number of storeys S_N and the roof slope R_S. The model is of the form:

A_{U} = 0.05 + 0.38 \cdot A_{C} + 6.13 \cdot H + 4.47 \cdot S_{N} + 0.14 \cdot R_{S}

(13)

In particular, the removal of the G_A variable (garage area) from the model may be surprising, as this is an important feature in the context of determining the usable area. Most relevant here, however, is the fact that there is no correlation between the size (or even the presence) of the garage and the usable area of the building, so this variable has little predictive power—as can be seen, for example, in the graph of SHAP values (Figure 2). There is no clear relationship between the value of the garage area and its impact on the estimated value of A_U, as measured by SHAP.

The use of regularised models did not result in the expected improvement in modelling performance on the test sample. The results obtained with the best of the models considered, the LASSO+ model, are slightly worse than those obtained with the D and D_N models (cf. Table 6). The other, worse-predicting models were omitted from the comparative analyses.

3.5. Machine Learning for Buildings’ Designs

3.5.1. XGBoost and NN Results for the New Designs Dataset

Among the models obtained using the XGBoost method (boosted trees), the best model was the one for which the MAE in the test set was 12.61, and the highest error was just over 34 m². This model outperformed all linear models in terms of MAE, but it predicted building floor areas on average worse than the D_N model.

Among the neural network models, multiple architectures were tested—with all the selected explanatory variables in the input layer, and with one and two hidden layers of varying numbers of neurons. The NN5-(10-5) network with five input variables (all except the area of the boiler room B and the height of the knee–wall h), ten neurons on the first and five on the second hidden layer, and the NN6-(10-6) network with six variables (variables as in the previous model and additionally the square of the variable garage area G_A²) performed best. However, attention should be paid to the above-mentioned doubt concerning the variable garage area G_A. The model NN5-(10-5) was identified as the most effective model within this class, with regard to the mean absolute error (MAE) and root mean square error (RMSE) on the test set. However, it was observed that model NN6-(10-6) exhibited a reduced error in instances where the prediction differed most significantly from the actual value of the usable floor area. Both neural network models outperform models D and D_N in terms of estimating the usable area of buildings on average. Moreover, it was noted that the estimates obtained for the worst case deviated more from the actual values than the estimates obtained for the base models D and D_N (the maximum error was higher).

In the context of the boosted trees model, the study examined the contribution of individual variables to the estimation of individual results, expressed in the form of SHAP values. The result is presented in Figure 3. As can be seen, the values of the G_A variable have the most diverse effect on the estimated usable area, with both high and low values of this variable capable of increasing or decreasing the value of the modelled variable. In the case of the A_C, H, and R_S variables, high values lead to an increase, while low values result in a decrease in the estimated usable area. Meanwhile, for the B variable, there is a noticeable, albeit not very strong, opposite effect on A_U.

It is worth noting that most of the models obtained were slightly overfitted, with MAE values for the learning data in the

〈9,10〉

range and RMSE values in the

〈11,13〉

range, while for the validation set, the respective values of MAE ranged from 11 to 13 and RMAE ranged from 15 to 16 (see Table 7). This means that some potential still exists here to improve the predictive properties of the model.

3.5.2. Hybrid Models—Combining Regression with ML

The best results in estimating the usable area were obtained using hybrid models, which combine features of the various models described above. This is not surprising, as the idea behind their use is to take advantage of specific, desirable features of the individual models.

The models developed as part of this study were created with the following observations in mind:

Values of the explanatory variable below 90 m² are very rare, so a threshold can be set below which the estimated values of the A_U variable will be replaced by a threshold value.
The baseline linear D and non-linear D_N models handle observations closer to the extreme values of the domain better than the regularised and ML models. In particular:
Both D and D_N models predict well the usable area values in the area of the domain referred to above (values lower than 90 m²), so it is possible to replace the estimated values of the A_U variable that are below the threshold with the values estimated by model D or D_N,
Models D and D_N predict usable area values for buildings with a large covered area quite well; thus, a threshold can be found above which the ML model estimate will be replaced by the D or D_N model estimate (two thresholds were tested: 180 and 200 m²)

A hybrid model that outperforms its component models within specific domain subsets is obtained by defining the conditions for applying each component model according to the above list of observations. These conditions depend on the variant/value of selected features or the value of the explained variable. Additionally, a combination of a regression model (other than D and D_N) and a machine learning model can be introduced to achieve the desired properties. Several of such hybrids have been tested, and this approach yields promising results. Consequently, it should be explored in future research.

As an example to illustrate the construction of this type of hybrid model, Figure 4 presents a schematic of the NN6 90 A180DN model, which combines the best neural network model, the DN baseline model and a threshold value.

In this part of the study, numerous combinations of models were tested. The most important results are summarised in Table 8. All the models presented here estimate the usable floor area more accurately than those previously described—at least according to some criteria (cf. Table 6 and Table 7).

The first of the hybrid models, and the only representative of the group of ordinary regression models, is a hybrid of the best model with LASSO regularisation (LASSO λ = 0.41) and the D_N model for observations where the building area (A_C) exceeds 180 m², denoted as LASSO A180D_N. This hybrid of simple regression models has estimated usable areas in the test set more accurately than all previously presented models.

Two models, based on the best gradient-boosted trees (which themselves did not receive the highest ratings), combined with models D and D_N in a domain part where houses have large covered areas (denoted as XGB A180D and XGB A180D_N, respectively), provide highly accurate estimates of the usable floor area. The second model produced test set predictions that outperformed all other models, including hybrid models, except for the hybrid combining the best neural network and the non-linear D_N model for the same buildings, with an area covered exceeding 180 m². In the latter case, the NN6 90 A180D_N hybrid model achieved an average absolute error (MAE) of 8.73, meaning it made errors that were, on average, 28% smaller than those of the D_N base model, while inheriting its maximum error (cf. Table 7). Meanwhile, the hybrid of the two best models, which averages their predictions and is denoted as XGB + NN6, produced an even slightly lower average error measured by RMSE, though, naturally, a slightly higher MAE than the neural network hybrid (as it represents the arithmetic mean of the MAE of both component models). The NN6 90 A180D_N model and its hybrid with the XGB A180D_N, i.e., the XGB + NN6 model, can therefore be considered the best predictors.

The results of usable area modelling with the NN6 90 A180D_N model for the test set, along with the results of its component models—the NN6-(10-6) neural network and the base D_N model—are presented in Figure 5. In Figure 5c, an effective reduction in the error variance can be observed. The graph in Figure 5d reveals that for small usable areas, the hybrid model inherited estimates of the neural network that are closer to the observed values than those of the base model (blue points are positioned closer to the dashed line). At the same time, for larger observed usable areas, values from the base model are preferred (green points—estimates from the D_N model).

3.5.3. XGBoost and NN Results for the Old Designs Dataset

As a supplement to the study on the building design dataset previously analysed in [1], a set of analyses was conducted, following the approach outlined in the previous section. Unfortunately, the test data used for model comparison in the original study were drawn rather unfortunately, making it difficult—both in the original study and in the repeated analysis—to achieve satisfactory precision in estimating usable floor area. Linear, non-linear and regularised models yielded high average error values for the test set, as well as significant errors in individual cases. In the current part of the study, it turned out that gradient-boosted tree models using the XGBoost algorithm did not perform particularly well. However, with relatively little additional effort, a noticeable improvement in modelling results was achieved using neural networks. The best obtained model, featuring six neurons in the first hidden layer and ten in the second, reduced the mean absolute error (MAE) by 3.68 m² (15.9%), the root mean squared error (RMSE) by 6.91 m² (23.2%) and the maximum error by 17.76 (29.9%) compared to the baseline A+ model (see Table 9).

In this case, the test dataset consisted of ten observations collected after assembling the training dataset. This setup closely mirrors a scenario that would occur during the deployment of a fully developed model. Unfortunately, under such circumstances, it is difficult to obtain a dataset that accurately represents the entire population or precisely reflects the structure of the training data. In this particular task, the test data differed significantly from the training data, making it impossible for any of the considered models to estimate the usable floor area for new observations with a level of precision comparable to that achieved in Section 3.5.2.

It is worth noting that with a different division into training and test datasets—where the latter better represented the population under study—significantly better results were obtained, with average errors around 7 m², comparable to the case analysed in Section 3.5.2. This indicates that there is still substantial potential for improving results within this variant of the analysis.

Due to the non-specific distribution of errors and their high variance, hybrid models were excluded from presentation for this dataset. In all examined cases, such models either gave little or no benefit over the baseline models.

3.6. Machine Learning for Existing Buildings

For the dataset containing variables describing existing buildings in Koszalin, a complementary analysis, similar to that in Section 3.5.3, was conducted. In Table 10, alongside the results of the existing baseline linear model (Model C) and the best LASSO-regularised model from [1], the performance of the best-obtained gradient-boosted tree model (XGBoost-eb) and the top neural network model (NN6(10-6)eb) is presented. Next, models were combined according to the principles described in Section 3.5.2. Table 10 presents the results achieved by three of such hybrids.

The baseline linear model obtained in [1] had average characteristics, with a mean absolute error (MAE) of 20.95 m². The regularised model performed only slightly better in predicting the usable floor area of buildings.

Gradient-boosted tree models significantly reduced the average errors in estimating the usable floor area of buildings compared to the baseline model. For the best-performing model, the MAE value dropped to 15.94 m²—an improvement of over 20%—while the maximum error was reduced to 30 m². These results were further improved with the use of neural networks. The best of the network architectures tested, featuring 6 inputs and 10 and 6 neurons in the hidden layers, delivered another significant improvement in estimation accuracy. For this model, the mean squared error (MSE) decreased to 10.81 m², with a maximum error below 21 m²—results that can already be considered satisfactory. They are comparable to those obtained for house design projects in Section 3.5.1, and even slightly better.

Several hybrid models were also created, with the results of three of them presented in Table 10. Notably, even the model that averages the baseline model C and the LASSO-regularised model outperforms each of its component models. Thus, the simple act of combining two weaker models resulted in a clear benefit, significantly reducing MAE and noticeably lowering RMSE. This result is worth highlighting. Another hybrid model presented is a regularised linear model, where its predictions are replaced by those of the baseline model if the covered area exceeds 180 m². This model proved to be slightly better than the C + LASSO model in terms of RMSE, meaning the largest errors were reduced—particularly the maximum error, which was lowered to the level previously achieved by the LASSO model. The last model is a hybrid of a neural network and the baseline model C run for observations with a large covered area (exceeding 200 m²). This combination allowed for further improvements in the neural network model estimations. On the test set, it achieved a mean squared error (MSE) of 9.99 m² and a root mean squared error (RMSE) of 11.91 m², making its results comparable to the best models presented in Section 3.5.2.

4. Discussion

The analysis of the obtained results leads to several methodological conclusions, as well as insights into the technical and practical aspects of modelling. Additionally, it highlights systemic solutions for real estate data collection, reliability assessment, and their application in real-world property management tasks.

The hypothesis of the superiority of ML models was confirmed by the analyses conducted. Although no single, definitively best model was identified, in most cases, the obtained ML models proved to be more resilient to data variability and delivered slightly better results than regression models. At the same time, no model type can be directly transferred between real-data and design-data modelling tasks. Also, for each estimation task related to a specific local market, a separate analysis must be conducted, and models must be estimated independently. These two features are probably the main limitations of the proposed method. The overall conclusion of the analysis is that if we are investing effort in modelling, it is more efficient to invest in ML models, in particular, simple neural networks. In addition, as demonstrated in Section 3.4 and Section 3.5, it is worth considering the use of hybrid models that utilise different component models for specific groups of buildings. In the conducted study, the hybrid model combining gradient-boosted trees and regression delivered significantly better results than any of its individual component models. A similar improvement was observed for the hybrid of a neural network and regression, with the hybrid containing a non-linear model providing the most accurate overall estimates of the usable floor area for designed buildings. Meanwhile, the simplest hybrid model, which replaced underestimated usable floor area values below a certain threshold with that threshold value, produced almost as accurate estimations as the best-performing model while requiring considerably less effort to develop. At the same time, it should be noted that creating such post hoc hybrid models is merely a temporary measure. The correct approach should involve the ex ante identification of coherent groups of similar objects, for which optimal models will be developed within individual classes, followed by the integration of these models across groups during the testing phase. This approach will be further analysed in future research on building usable area models, particularly in the development of a solution that incorporates models for buildings with different roof structures—flat, gable, and multi-pitched. Since all models are local, the generalisation of such a modelling task turns into a new modelling round with every change in the scope (switching a local market will end up in a new set of models). However, it is still possible to build an application (e.g., with a modelling backend mechanism running on a server in the cloud) that will not only use previously estimated models but also estimate new models on the fly. Building such an application is planned.

An important methodological conclusion is the confirmed necessity of traditional variable selection by an operator. The study demonstrated that regularisation within the model does not work as effectively in the examined cases as manual adjustments to the set of explanatory variables. This is likely due to both the relatively small size of the models under consideration and the considerable heterogeneity of the data. In such cases, the role of an expert is invaluable in designing the learning procedure and selecting models tailored to the specific scenario, such as a local real estate market. Another important factor is that in tasks with only a few explanatory variables and a training set ranging from several dozen to around two hundred instances, the number of parameters in a regression model (without interactions) is too small to fully capture the relationships between variables, which are not entirely revealed in such a limited dataset. In contrast, the number of parameters in a neural network model—where most parameters effectively measure interactions between variables—is typically much too large to maintain a proper balance between model sensitivity and overfitting. The obtained results confirm these intuitive observations.

Most of the estimated linear models, as well as the non-linear models obtained for the building project dataset, share a common characteristic—they tend to overestimate the usable area of smaller houses while underestimating the feature in larger houses. This suggests difficulties in generalising at the domain’s extremes, leading to a narrowing of the area onto which explanatory variable sets are projected. A probable cause is the small number of observations in the training sample within these regions, so increasing the sample size should significantly mitigate this issue. A preliminary test was conducted using the SMOTE method for regression tasks [52,53], generating synthetic observations with large floor areas. However, no better results were obtained compared to the original training sample. Nonetheless, this remains a promising technique that could be utilised in future research. It appears that the key prerequisite for its successful application is the precise classification of studied objects and the identification of homogeneous, underrepresented classes within the training sample. In regression tasks, this is particularly challenging due to the continuous nature of the dependent variable, making it more difficult than in classification problems [54,55].

In the part of the study that continued the work from [1], the results were better for the group of existing buildings than for the design dataset. In both cases, machine learning models, despite slight overfitting, demonstrated superior performance in predicting usable floor area compared to regression models. This appears to be primarily due to greater data consistency and higher quality in terms of modelling requirements. On the other hand, the relatively high error values may be due to an unfortunate selection of objects for the test dataset. Specifically, in a separate study on the project dataset, it was confirmed that when the test set properly reflects the structure of the entire population, model errors decrease several times. This provides further confirmation that the most crucial factor influencing the obtained results is data quality, and above all, the homogeneity of the training data. At the same time, this provides another argument for dividing the modelled group of buildings into smaller, more cohesive sets of homogeneous building types and conducting modelling within these subgroups. The clear advantage of machine learning models also suggests that ML models should be preferred for estimating usable floor area when high-quality data are available (especially data that cover the entire domain). Conversely, traditional regression models are more resilient to declines in data quality (e.g., missing or sparse data in certain areas of the domain). This is an important practical conclusion.

LiDAR-derived data are available in many European Union countries [56], making national sources a viable basis for conducting similar analyses. The method described in this article could be applied in other countries having such data, although further research is needed to validate this hypothesis. Given that the models consider only the general characteristics of buildings, the proposed approach appears to have a significantly broader application.

5. Final Conclusions

This study proposes a method for estimating the usable floor area of buildings with multi-pitched roofs, that is, using linear and non-linear models alongside machine learning techniques. The research is significant in that it integrates diverse data sources, including the Real Estate Price Register, cadastral records, the BDOT10k Topographic Object Database based on LiDAR-collected data, and independently gathered surveyor data. The findings indicate that a building’s usable area can be reliably estimated using only a few key parameters, even if it requires multiple attempts and the use of different classes of models or their combination. This approach is especially relevant for the Polish real estate market, where gaps in the Real Estate Price Register frequently hinder accurate property valuations. While not a substitute for standard internal measurements, this method offers a reasonable estimation when only limited data, such as topographic information, are available. It can support property valuers in determining the usable area of comparable buildings when the Real Estate Price Register lacks these data. Additionally, for administrative purposes, it could serve as a supplementary tool for assessing taxable usable floor area.

Author Contributions

Conceptualisation, L.D.; methodology, L.D., A.M.B. and P.B.; validation, L.D., A.M.B. and P.B.; formal analysis, L.D.; investigation, L.D., A.M.B. and P.B.; resources, L.D., A.M.B. and P.B.; data curation, L.D.; writing—original draft preparation, L.D., A.M.B. and P.B.; writing—review and editing, L.D., A.M.B. and P.B.; visualisation, A.M.B. and P.B.; supervision, L.D.; project administration, L.D. All authors have read and agreed to the published version of the manuscript.

Funding

Open access funding provided by the University of Szczecin and the Technical University of Koszalin.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We thank Anna Dawid for the useful discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BDOT10k	Database of Topographic Objects (pol. Baza Danych Obiektów Topograficznych)
LiDAR	Light Detection and Ranging
LoD	Level of Detail
REPR	Real Estate Price Register (pol. Rejestr Cen Nieruchomości)
MAE	Mean Absolute Error
RMSE	Root Mean Squared Error
ML	Machine Learning

References

Dawid, L.; Barańska, A.; Baran, P.; Ala-Karvia, U. Linear and Nonlinear Modelling of the Usable Area of Buildings with Multi-Pitched Roofs. Appl. Sci. 2024, 14, 11850. [Google Scholar] [CrossRef]
Dydenko, J. Redakcja, Szacowanie Nieruchomości; Wolters Kluver: Warszawa, Poland, 2024. (In Polish) [Google Scholar]
Sawiłow, E. Analysis of the real estate valuation methods in comparative approach. Geod. Rev. 2008, 80, 3–7. (In Polish) [Google Scholar]
Cymerman, R.; Hopfer, A.; Kotlewski, L. Zasady Określania Wartości Nieruchomości: Metodyczne i Prawne; Educaterra: Olsztyn, Poland, 2022. (In Polish) [Google Scholar]
Rozporządzenie Ministra Rozwoju, Pracy i Technologii z Dnia 27 Lipca 2021 r. w Sprawie Ewidencji Gruntów i Budynków, Dz.U. 2021 poz. 1390. Available online: https://isap.sejm.gov.pl/isap.nsf/DocDetails.xsp?id=WDU20210001390 (accessed on 12 January 2025). (In Polish)
Hycner, R. Basics of the Cadastre; AGH University of Science and Technology Press: Kraków, Poland, 2004; pp. 241–282. (In Polish) [Google Scholar]
Cienciała, A.; Sobolewska-Mikulska, K.; Sobura, S. Credibility of the cadastral data on land use and the methodology for their verification and update. Land Use Policy 2021, 102, 105204. [Google Scholar] [CrossRef]
Wierzbicki, D.; Matuk, O.; Bielecka, E. Polish Cadastre Modernization with Remotely Extracted Buildings from High-Resolution Aerial Orthoimagery and Airborne LiDAR. Remote Sens. 2021, 13, 611. [Google Scholar] [CrossRef]
Kocur-Bera, K.; Frąszczak, H. Coherence of Cadastral Data in Land Management—A Case Study of Rural Areas in Poland. Land 2021, 10, 399. [Google Scholar] [CrossRef]
Larsson, K.; Paasch, J.M.; Paulsson, J. Representation of 3D cadastral boundaries—From analogue to digital. Land Use Policy 2020, 98, 104178. [Google Scholar] [CrossRef]
Mika, M. An Analysis of Possibilities for the Establishment of a Multipurpose and Multidimensional Cadastre in Poland. Land Use Policy 2018, 77, 446–453. [Google Scholar] [CrossRef]
Dawid, L. Analysis of Data Completeness in the Register of Real Estate Prices and Values Used for Real Estate Valuation on the Example of Koszalin District in the Years 2010–2016. Folia Oecon. Stetin. 2018, 18, 17–26. [Google Scholar] [CrossRef]
Dawid, L. Analysis of Completeness of Data from the Price and Value Register on the Example of Kołobrzeg and Koszalin Districts in Years 2010–2017. Stud. Res. FEM SU 2018, 1, 91–102. (In Polish) [Google Scholar]
Foryś, I.; Kokot, S. Problems with Real Estate Market Analysis. In Microeconomy in Theory and Practice; Res. Bull. Univ. Szczec.: Szczecin, Poland, 2001; pp. 175–182. (In Polish) [Google Scholar]
Database of Topographic Objects (pol. Baza Danych Obiektów Topologicznych) (BDOT). Available online: https://www.geoportal.gov.pl/pl/dane/baza-danych-obiektow-topograficznych-bdot10k/ (accessed on 10 September 2024).
Wężyk, P. (Ed.) Textbook for Participants of Trainings on Using LiDAR Products; Head Offi ce of Land Surveying and Cartography: Cracow, Poland, 2015. (In Polish) [Google Scholar]
Ren, X.; Yu, B.; Wang, Y. Semantic Segmentation Method for Road Intersection Point Clouds Based on Lightweight LiDAR. Appl. Sci. 2024, 14, 4816. [Google Scholar] [CrossRef]
2.0 CityGML; Open Geospatial Consortium: Arlington, TX, USA, 2012.
QGIS Development Team. QGIS Geographic Information System. Open Source Geospatial Foundation Project. Available online: http://qgis.osgeo.org (accessed on 21 May 2024).
Head Office of Land Surveying and Cartography. Geoportal of National Spatial Data Infrastructure. Available online: https://www.geoportal.gov.pl/ (accessed on 12 May 2024).
Dawid, L.; Cybiński, K.; Stręk, Z. Machine Learning of Usable Area of Gable-Roof Residential Buildings Based on Topographic Data. Remote Sens. 2023, 15, 863. [Google Scholar] [CrossRef]
Dudzik, P. Geometria dachów (Roof geometry). Inżynieria I Bud. 2023, LXXIX, 293–298. (In Polish) [Google Scholar]
Barańska, A. Linear and Nonlinear Weighing of Property Features. Real Estate Manag. Valuat. 2019, 27, 59–68. [Google Scholar] [CrossRef]
Pinter, G.; Mosavi, A.; Felde, I. Artificial Intelligence for Modeling Real Estate Price Using Call Detail Records and Hybrid Machine Learning Approach. Entropy 2020, 22, 1421. [Google Scholar] [CrossRef]
Baldominos, A.; Blanco, I.; Moreno, A.J.; Iturrarte, R.; Bernárdez, Ó.; Afonso, C. Identifying Real Estate Opportunities Using Machine Learning. Appl. Sci. 2018, 8, 2321. [Google Scholar] [CrossRef]
Kim, J.; Lee, Y.; Lee, M.-H.; Hong, S.-Y. A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices. Sustainability 2022, 14, 9056. [Google Scholar] [CrossRef]
Dawid, L.; Tomza, M.; Dawid, A. Estimation of usable area of fl at-roof residential buildings using topographic data with machine learning methods. Remote Sens. 2019, 11, 2382. [Google Scholar] [CrossRef]
Janowski, A.; Renigier-Biłozor, M.; Walacik, M.; Chmielewska, A. Remote measurement of building usable floor area–Algorithms fusion. Land Use Policy 2021, 100, 104938. [Google Scholar] [CrossRef]
PN-70/B-02365; Surface Area of Buildings—Classification, Definitions, and Methods of Measurement. Polish Committee of Standardization: Warszawa, Poland, 1970. Available online: http://rzeczoznawca-zachodniopomorskie.pl/pliki/PN_70_B_02365.pdf (accessed on 22 April 2024). (In Polish)
Zbroś, D. The Rules for Calculating the Usable Area by Two Current Polish Standards. Saf. Eng. Anthropog. Objects 2016, 3, 19–22. (In Polish) [Google Scholar]
Pogorzelski, A.; Sieczkowski, J. Obliczanie Powierzchni i Kubatur Budynku; Polcen: Warszawa, Poland, 2023. (In Polish) [Google Scholar]
PN-ISO 9836:1997; Performance Standards in Building—Definition and Calculation of Area and Space Indicators. Polish Commitee of Standardization: Warszawa, Poland, 1997. Available online: http://rzeczoznawca-zachodniopomorskie.pl/pliki/ PN_ISO_9836_1997.pdf (accessed on 10 April 2024). (In Polish)
PN-ISO 9836:2015-12; Performance Standards in Building—Definition and Calculation of Area and Space Indicators. Polish Commitee of Standardization: Warszawa, Poland, 2015. Available online: https://sklep.pkn.pl/pn-iso-9836-2015-12p.html (accessed on 18 October 2024). (In Polish)
Benduch, P.; Butryn, K. Legal and standard principles of buildings and their parts usable fl oor area quantity surveying. In Infrastructure and Ecology of Rural Areas; Polish Academy of Sciences: Cracow, Poland, 2018; pp. 225–238. ISSN 1732-5587. (In Polish) [Google Scholar]
Pogorzelski, A.; Sieczkowski, J. Wysokość i powierzchnia użytkowa pomieszczeń w budynkach ze stropami pochyłymi. Bud. I Prawo 2024, 27, 17–20. (In Polish) [Google Scholar]
Ebing, J. Calculating of Area and Cubic Volume of Facilities with Different Intended Use. Dashofer Sp. z o.o. Publishing House: Ljubljana, Slovenia, 2011; ISBN 978-83-7537-108-6. (In Polish) [Google Scholar]
Benduch, P.; Hanus, P. The Concept of Estimating Usable Floor Area of Buildings Based on Cadastral Data. Rep. Geod. Geoinform. 2018, 105, 29–41. [Google Scholar] [CrossRef]
Pogorzelski, A.; Sieczkowski, J. Wybrane zagadnienia dotyczące obliczania powierzchni zabudowy i powierzchni użytkowej budynków. Przegląd Bud. 2023, 94, 54–58. (In Polish) [Google Scholar]
Regulation of the Minister of Transport, Construction and Maritime Economy of April 25, 2012 on Detailed Scope and Form of a Construction Project. In J. Laws; 2012; p. 462. Available online: https://isap.sejm.gov.pl/isap.nsf/DocDetails.xsp?id=wdu20120000462 (accessed on 20 May 2020). (In Polish)
Lipińscy, M.L. Design Office. Houses Projects. Available online: https://lipinscy.pl/ (accessed on 21 May 2024).
Mendel, B. ARCHON+ Project Office. Available online: https://www.archon.pl/ (accessed on 21 May 2024).
Extradom, Design Office. Available online: https://www.extradom.pl/projekty (accessed on 2 December 2024).
Barańska, A. Statystyczne Metody Analizy i Weryfikacji Proponowanych Algorytmów Wyceny Nieruchomości; AGH Publishing: Kraków, Poland, 2010. [Google Scholar]
Santosa, F.; Symes, W.W. Linear inversion of band-limited reflection seismograms. SIAM J. Sci. Stat. Comput. 1986, 7, 1307–1330. [Google Scholar] [CrossRef]
Tibshirani, R. Regression Shrinkage and Selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD’16 and the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–16 August 2016. [Google Scholar] [CrossRef]
Shapley, L.S. Notes on the n-Person Game—II: The Value of an n-Person Game. RAND Corporation: Santa Monica, CA, USA, 1951; Available online: https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM670.pdf (accessed on 20 March 2025).
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 3rd ed. 2025. Available online: https://christophm.github.io/interpretable-ml-book/ (accessed on 28 March 2025).
Lundberg, S.M.; Lee, S. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NIPS’17. Red Hook. pp. 4768–4777. Available online: https://dl.acm.org/doi/pdf/10.5555/3295222.3295230 (accessed on 25 March 2025).
Liu, Y.; Just, A. SHAPforxgboost: SHAP Plots for ’XGBoost’. R package version 0.1.0. Available online: https://github.com/liuyanguu/SHAPforxgboost/ (accessed on 15 January 2025).
Chawla, N.V.; Bowyer, K.W.; Lawrence, O.H.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Torgo, L.; Ribeiro, R.P.; Pfahringer, B.; Branco, P. SMOTE for Regression. In Progress in Artificial Intelligence; EPIA 2013; Lecture Notes in Computer Science; Correia, L., Reis, L.P., Cascalho, J., Eds.; Springer: Berlin, Heidelberg, 2013; Volume 8154. [Google Scholar] [CrossRef]
Branco, P.; Torgo, L.; Ribeiro, R. SMOGN: A Pre-Processing Approach for Imbalanced Regression. Proc. Mach. Learn. Res. 2017, 74, 36–50. [Google Scholar]
Song, X.Y.; Dao, N.; Branco, P. DistSMOGN: Distributed SMOGN for Imbalanced Regression Problems. Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications. Proc. Mach. Learn. Res. 2022, 183, 38–52. [Google Scholar]
Kakoulaki, G.; Martinez, A.; Florio, P. Non-Commercial Light Detection and Ranging (LiDAR) Data in Europe; Publications Office of the European Union: Luxembourg, 2021; ISBN 978-92-76-41150-5. EUR 30817 EN. [Google Scholar]

Figure 1. Features distributions in the data: (A) new designs sample, (B) old designs sample, (C) existing buildings sample; black dots represent validation data.

Figure 2. Relationship between the usable area (A_U) variable and selected explanatory variables: garage area (a) and boiler room (b).

Figure 3. SHAP values of the individual features for XGBoost model in explaining the usable area of buildings. Source: own calculation, visualisation in SHAPforxgboost [51].

Figure 4. Idea of the hybrid model NN6 90 A180D_N.

Figure 5. Hybrid model NN6 90 A180D_N error-based composition and observed vs. fitted A_U. (a) case selection based on NN6-(10-6) model; (b) case selection based on D_N model; (c) errors after selection; (d) observed vs. fitted values of A_U.

Table 1. Features of 436 residential buildings from architectural offices and features of 48 existing buildings with their respective ranges of values in the dataset.

Feature	Variable	Design Projects	Existing Buildings (Koszalin)
Usable area	A_U	48.12–302.42 m²	80.71–407.51 m²
Covered area	A_C	69.8–334.68 m²	100.5–394.71 m²
Number of storeys	S_N	1–2	1–2
Height	H	5.7–9.57 m	5.7–10.17 m
Knee wall’s height	h	0–1.8 m	No data
Garage area	G_A	0–56.55 m²	0–66.25 m²
Number of rooms	R	No data	4–12
Number of roof surfaces	R_N	No data	4–15
Presence of a boiler room	B	0; 1	No data
Presence of an in-built garage (0.5–attached or partially in-built)	G_P	No data	0; 0.5; 1
Roof slope	R_S	23.8–40°	No data

Table 2. Multiple Linear Regression estimates for the building designs dataset (model D). Source: own calculation.

N = 219	β	σ(β)	a	σ(a)	t(213)	p-Value
intercept			12.080	8.301	1.455	0.147
A_C	1.086	0.040	0.532	0.019	27.363	<0.001
H	0.091	0.025	3.432	0.955	3.596	<0.001
G_A	−0.260	0.039	−0.403	0.061	−6.660	<0.001
S_N	−0.158	0.079	−9.111	4.580	−1.989	0.048
R_S	0.236	0.078	0.453	0.150	3.022	0.003

R = 0.938 R² = 0.881 Adj. R² = 0.878 F(5, 213) = 314.48 p < 0.001 SE: 8.722.

Table 3. Function types to describe the dependence of the usable area on individual exogenous variables.

Independent Variable	Function Type
Covered area—A_C	linear function
Building’s height—H	linear function
Garage area—G_A	exponential function polynomial 2°
Boiler room—B	linear function polynomial 3°
Knee–wall’s height—h	polynomial 3°
Number of storeys—S_N	linear function
Roof slope —R_S	linear function

Table 4. Non-linear regression estimates for the building designs dataset (model D_N). Source: own calculation.

N = 217	β	σ(β)	a	σ(a)	t(210)	p-Value
intercept			11.625	8.108	1.434	0.153
A_C	1.026	0.041	0.505	0.020	24.732	<0.001
H	0.102	0.024	3.950	0.939	4.208	<0.001
G_A	−0.521	0.083	−0.818	0.131	−6.252	<0.001
G_A²	0.324	0.089	0.013	0.004	3.638	<0.001
S_N	−0.130	0.076	−7.665	4.480	−1.711	0.089
R_S	0.214	0.075	0.419	0.146	2.863	0.005

R = 0.945 R² = 0.893 Adj. R² = 0.890 F(6, 210) = 291.67 p < 0.001 SE: 8.4863.

Table 5. Full list of models compared in the study.

Technique	Model Type	Design Data	Design Data (Old Dataset)	Existing Buildings
Regression	Linear	D	A+ ¹	C ¹
	Non-linear	D_N	—	—
Regularised regression	LASSO	LASSO λ = 0.41 LASSO + λ = 0.72	LASSO+ ¹	LASSO ¹
	Ridge	Ridge λ = 0.37	—	—
	Elastic net	Elastic net⁺	—	—
Machine	XGBoost	XGBoost	XGBoost ods	XGBoost eb
learning	Neural networks	NN6-(10-6) NN5-(10-5)	NN 6-(10-6)ods	NN 6-(10-6)eb
Hybrid models		LASSO A180D_N XGB A180D XGB A180D_N NN5 90 NN5 90D_NA180D_N NN6 90 A180D NN6 90 A180D_N	—	C+ LASSO LASSO A180C NN6 A200C
		XGB + NN6

¹ source: Dawid et al. [1].

Table 6. Quality of the estimation of the usable area in the test group (N = 25) of the design data—best linear and non-linear models. Source: own calculation.

	Linear Models					Non-Linear Model
	Model D	LASSO λ = 0.41⁺	Ridge λ = 0.37	Elastic net⁺	LASSO+ λ = 0.72	ModelD_N
MAE	12.82	13.33	13.52	13.39	13.04	12.12
RMSE	15.09	16.15	16.23	16.20	15.67	14.09
$\max \|y - \hat{y}\|$	27.73	35.08	35.05	35.19	34.12	23.83

Table 7. Quality of the estimation of the usable area in the test group (N = 25) of the design data—best ML (XGBoost and NN) models. Source: own calculation.

	ML Model	Neural Networks
	XGBoost	NN6-(10-6)	NN5-(10-5)
MAE	12.61	11.89	11.15
RMSE	15.64	15.19	15.15
$\max \|y - \hat{y}\|$	34.09	33.83	37.67

Table 8. Quality of the estimation of the usable area in the test group (N = 25) of the design data—selected hybrid models.

	Hybrid Models
	LASSO A180D_N	XGB A180D	XGB A180D_N	NN5 90	NN5 90D_N A180D_N	NN6 90 A180D	NN6 90 A180D_N	XGB +NN6
MAE	11.69	10.11	9.58	10.88	10.60	9.26	8.73	8.87
RMSE	13.58	12.39	11.60	14.71	12.83	12.07	11.25	11.15
$\max \|y - \hat{y}\|$	23.06	27.73	23.06	34.95	23.06	27.73	23.06	23.06

Naming conventions: first part of the name: LASSO—the best LASSO model (presented in Table 6), XGB—the best XGBoost model, NN5—the best neural network with 5 inputs, NN6—the best neural network with 6 inputs (all presented in Table 7); subsequent parts of the name: 90—a model in which predictions lower than 90 are replaced by 90, 90D_N—a model in which predictions lower than 90 are replaced by predictions from model D_N, A180D/A180D_N—a model in which predictions for objects with A_C higher than 180 are replaced with predictions from model D/model D_N, respectively; XGB + NN6—a model averaging the predictions from models XGB A180D_N and NN6 90 A180D_N.

Table 9. Quality of the estimation of the usable area in the test group (N = 10) of the design data—selected hybrid models. Source: * Dawid et al. [1]; ** own calculation.

	Linear Models *		ML and Neural Networks **
	Model A⁺	LASSO⁺	XGBoost ods	NN 6-(10-6)ods
MAE	23.19	27.42	27.58	19.51
RMSE	30.22	33.57	32.05	23.31
$\max \|y - \hat{y}\|$	59.36	67.22	53.13	41.60

Table 10. Quality of the estimation of the usable area in the test group (N = 5) of the existing buildings data—baseline linear models, best ML (XGBoost and NN) models, and selected hybrid models. Source: ¹ Dawid et al. [1]; ² own calculations.

	Linear Models ¹		ML and Neural Nets ²		Hybrid Models ²
	Model C	LASSO	XGBoost eb	NN6(10-6)eb	C+ LASSO	LASSO A180C	NN6 A200C
MAE	20.95	20.43	15.94	10.81	16.65	16.95	9.99
RMSE	27.92	24.16	18.81	12.77	23.12	21.14	11.91
$\max \|y - \hat{y}\|$	55.34	39.55	30.73	20.73	47.45	39.55	20.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dawid, L.; Barańska, A.M.; Baran, P. Comparing the Performance of Regression and Machine Learning Models in Predicting the Usable Area of Houses with Multi-Pitched Roofs. Appl. Sci. 2025, 15, 6297. https://doi.org/10.3390/app15116297

AMA Style

Dawid L, Barańska AM, Baran P. Comparing the Performance of Regression and Machine Learning Models in Predicting the Usable Area of Houses with Multi-Pitched Roofs. Applied Sciences. 2025; 15(11):6297. https://doi.org/10.3390/app15116297

Chicago/Turabian Style

Dawid, Leszek, Anna Marta Barańska, and Paweł Baran. 2025. "Comparing the Performance of Regression and Machine Learning Models in Predicting the Usable Area of Houses with Multi-Pitched Roofs" Applied Sciences 15, no. 11: 6297. https://doi.org/10.3390/app15116297

APA Style

Dawid, L., Barańska, A. M., & Baran, P. (2025). Comparing the Performance of Regression and Machine Learning Models in Predicting the Usable Area of Houses with Multi-Pitched Roofs. Applied Sciences, 15(11), 6297. https://doi.org/10.3390/app15116297

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparing the Performance of Regression and Machine Learning Models in Predicting the Usable Area of Houses with Multi-Pitched Roofs

Abstract

1. Introduction

2. Materials and Methods

2.1. Polish Standards for Calculating Usable Floor Area

2.2. Data from Design Offices and Data on Existing Single-Family Houses in Koszalin

2.3. Linear and Non-Linear Modelling

2.4. Machine Learning Models

3. Results

3.1. Data Characteristics and Preprocessing

3.2. Linear Regression

3.3. Non-Linear Regression

3.4. Prediction of Usable Area in a Test Set for Linear, Non-Linear, and Regularised Models

3.5. Machine Learning for Buildings’ Designs

3.5.1. XGBoost and NN Results for the New Designs Dataset

3.5.2. Hybrid Models—Combining Regression with ML

3.5.3. XGBoost and NN Results for the Old Designs Dataset

3.6. Machine Learning for Existing Buildings

4. Discussion

5. Final Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI