Next Article in Journal
A FAIR Perspective on Data Quality Frameworks
Previous Article in Journal
Experimental Dataset of Greenhouse Gas Emissions from Laboratory Biocover Experiment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Real Estate Prices Using Machine Learning in Bosnia and Herzegovina

by
Zvezdan Stojanović
1,*,
Dario Galić
2 and
Hava Kahrić
3
1
Department of Electrical Engineering, Faculty of Engineering, European University Brčko District, 76100 Brčko, Bosnia and Herzegovina
2
Department of Interdisciplinary Areas, Faculty of Dental Medicine and Health, 31000 Osijek, Croatia
3
Department of Electrical Engineering, Faculty of Engineering, Kallos University, 75000 Tuzla, Bosnia and Herzegovina
*
Author to whom correspondence should be addressed.
Data 2025, 10(9), 135; https://doi.org/10.3390/data10090135 (registering DOI)
Submission received: 2 July 2025 / Revised: 4 August 2025 / Accepted: 19 August 2025 / Published: 23 August 2025

Abstract

The real estate market has a major impact on the economy and everyday life. Accurate real estate valuation is essential for buyers, sellers, investors, and government institutions. Traditionally, valuation has been conducted using various estimation models. However, recent advancements in information technology, particularly in artificial intelligence and machine learning, have enabled more precise predictions of real estate prices. Machine learning allows computers to recognize patterns in data and create models that can predict prices based on the characteristics of the property, such as location, square footage, number of rooms, age of the building, and similar features. The aim of this paper is to investigate how the application of machine learning can be used to predict real estate prices. A machine learning model was developed using four algorithms: Linear Regression, Random Forest Regression, XGBoost, and K-Nearest Neighbors. The dataset used in this study was collected from major online real estate listing portals in Bosnia and Herzegovina. The performance of each model was evaluated using the R2 score, Root Mean Squared Error (RMSE), scatter plots, and error distributions. Based on this evaluation, the most accurate model was selected. Additionally, a simple web interface was created to allow for non-experts to easily obtain property price estimates.

1. Summary

Before the advent of modern technologies and algorithms, real estate valuation was carried out using several basic, so-called traditional methods [1,2]. These methods are still widely used today, especially by real estate appraisers and financial institutions.
The most used methods are given in [3] and are as follows:
  • Market Approach:
  • The valuation is carried out by comparing similar properties that have recently been sold in the same or comparable market.
  • Key factors considered include location, size (area), number of rooms, building condition, and age.
  • This method heavily relies on the appraiser’s expertise and the availability of relevant market data.
  • Income Approach:
  • This approach is mostly used for commercial properties that generate income (e.g., apartment rentals).
  • The value is estimated based on the expected income it can generates over time.
  • It is used in investment valuations.
  • Cost Approach:
  • This method estimates the value of a property based on the cost of constructing a new, identical, or similar property, minus depreciation.
  • It is typically used for unique properties for which there is limited comparable market data.
Although traditional methods remain useful, they often rely heavily on the subjective judgment of experts and are limited by their ability to process large volumes of data and complex relationships.
In this context, machine learning offers significant improvements, as it enables automatic analysis of complex data and makes more objective predictions. Machine learning algorithms can consider a far greater number of variables than traditional models, not only including property-specific features, but also contextual factors such as proximity to schools, kindergartens, and local crime rates. Moreover, these models can be easily updated or expanded with new variables, allowing for greater flexibility and adaptability.
Real estate price prediction is fundamentally a regression task, which is why various machine learning algorithms were employed in this study. Compared to traditional approaches, artificial intelligence techniques offer several advantages: the ability to incorporate a significantly larger number of variables that may influence property value, the capacity to uncover complex relationships and dependencies among these variables, and the potential for continuous model improvement using the integration of new data. These methods also help improve prediction accuracy, reduce human subjectivity, and allow for scalable expansion of the dataset. Accurate real estate valuation is not only essential for sellers and buyers, but also for investors and financial institutions, particularly in the banking sector for assessing loan risk. Given that many of these stakeholders may lack specialized knowledge in artificial intelligence, we developed a simple web application to make property price estimation accessible and user-friendly. To ensure that the model reflects the real market conditions in Bosnia and Herzegovina, we did not rely solely on publicly available real estate databases. Instead, we collected approximately 2000 records from online real estate listings within the country. The limited size of the dataset influenced our choice of regression algorithms, leading us to select those best suited for smaller datasets while still ensuring reliable performance.
The remainder of this paper is organized as follows: Section 2 reviews relevant related work in the field of real estate price prediction. Section 3 outlines the machine learning algorithms used in this study, the evaluation metrics applied to compare their performance, and the software tools that were utilized. Section 4 details the data preparation process and the training of the predictive models. Section 5 presents the development of the web interface designed to make the prediction tool accessible to end users.

2. Background

According to [4]: “Machine Learning (ML) is a field of artificial intelligence that deals with the development of techniques and algorithms that allow computers to learn from data and improve their performance over time, without explicit programming. Machine learning gives a machine the ability to adapt its behavior in unpredictable and uncertain situations to achieve a greater degree of autonomy”.
The main goal of machine learning is to develop models that can generalize from experience (data) to make informed decisions or provide accurate predictions in new situations. In other words, ML systems learn from previous experiences to predict current performance and anticipate future outcomes [5]. By learning from previous computations and identifying patterns within large datasets, machine learning can support the creation of reliable and consistent decision-making systems [6].
For this reason, ML algorithms have been successfully applied in many areas, such as spam detection, natural language processing, speech recognition, optical character recognition, image recognition, fraud detection, unassisted vehicle control, medical diagnosis, recommendation system, and so on [7,8].
A key subfield of machine learning is supervised learning, whose goal is to uncover unknown relationships based on known input–output pairs. Common forms of supervised learning include numerical regression, classification, and artificial neural networks [9].
Figure 1 illustrates some of the most widely used machine learning algorithms. In the context of property value estimation, as described in this paper, supervised learning is applied specifically; regression algorithms are presented in the figure below.
In contrast, unsupervised learning aims to uncover natural structures or patterns within input data without predefined labels. A related concept is reinforcement learning, in which the system does not have access to input–output pairs, but learns through interaction with the environment by receiving feedback in the form of rewards for correct actions and penalties for incorrect ones [9].
Clearly defining the end goal of a machine learning task is essential, as it directly influences how the problem is approached, which algorithms are selected, the choice of performance metrics, and the extent of effort devoted to model fine-tuning [10,11].
Regression is one of the key techniques used in machine learning, especially in the context of predicting numerical values. Its objective is to model the relationship between one or more independent variables (features) and a dependent variable (target) that is numerical in nature.
In this study, the dependent variable is the price of real estate property [12,13]. Several machine learning algorithms can be applied to regression tasks, including those listed below [5,12]:
  • Linear Regression: One of the most basic and widely used regression models. It predicts outcomes by computing a weighted sum of the input features along with a constant (bias term).
  • Multiple Linear Regression: An extension of linear regression that handles multiple input variables. The model learns individual weights (coefficients) for each feature. It is appropriate for more complex problems where the price is influenced by several factors, such as location, square footage, number of rooms, etc.
  • Decision Tree Regressor: This model uses a tree-like structure to split data into nodes based on a criterion that minimizes the variance of the target variable. It can capture non-linear relationships, and does not require feature scaling.
  • Random Forest Regressor: An ensemble learning method that constructs multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting. It is known for its robustness and reliability, especially with heterogeneous data.
  • Gradient Boosting and XGBoost: An advanced technique that builds models sequentially, with each new model correcting the errors of its predecessor. XGBoost (Extreme Gradient Boosting) is a high-performance, optimized implementation of Gradient Boosting. These algorithms are frequently used in competitive machine learning due to their high predictive power.
  • K-Nearest Neighbors (KNN): A non-parametric algorithm that can be used for both classification and regression. It predicts values based on the average of the k-nearest data points. KNN is simple, but sensitive to irrelevant features and the size of the dataset, which can affect its performance.
  • Multi-Layer Perceptron (MLP): A type of feedforward artificial neural network capable of modeling complex relationships. While powerful, MLPs typically require large datasets, longer training times, and careful hyperparameter tuning. With smaller datasets, they are prone to overfitting and may not generalize well.
Several studies have compared the performance of these algorithms. In [12], the authors compared XGBoost, Support Vector Regressor, Random Forest Regressor, and Multilayer Perceptron, concluding that the XGBoost algorithm performs best for real estate price prediction. The study by [13] evaluated four algorithms, Random Forest (RF), Decision Tree (DT), Gradient Boosting (GB), and Extreme Gradient Boosting (XGBoost), finding that GB and XGBoost outperformed the others by delivering the highest prediction accuracy. Similarly, [14] compared Random Forest and XGBoost for housing price prediction, concluding that XGBoost provided superior results.
In [15], Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machine (GBM) were compared. The main conclusion was that RF and GBM produced similarly accurate price estimations with lower prediction errors compared to SVM.

3. Methods

When deciding which machine learning algorithms to use, we considered both the results of previous research, presented above, and the size of our relatively small dataset. We chose Linear Regression and K-Nearest Neighbors for their simplicity and broad applicability, Random Forest for its efficiency with small datasets, and XGBoost due to its strong performance with small- to medium-sized datasets.
One of the contributions of this paper is to evaluate the accuracy of different machine learning algorithms when applied to small- and medium-sized datasets for real estate price prediction.
In recent years, several papers have addressed similar topics; however, only a few have focused on small- and medium-sized datasets. Therefore, it was challenging to evaluate how different machine learning algorithms would perform on our dataset. We trained and evaluated multiple regression models on real estate data from Bosnia and Herzegovina, including Linear Regression, Random Forest Regressor, K-Nearest Neighbors, and XGBoost Regressor. The evaluation metrics applied were R2 Score (coefficient of determination), RMSE (Root Mean Squared Error), scatter plots (actual vs. predicted values), and error distributions (boxplots), which will be described in the next section. After selecting the best-performing algorithm for our dataset, we developed a Streamlit application to make the real estate price predictions accessible and understandable to a broader audience, including users without technical or programming backgrounds. Often, a practical aspect is overlooked: how to assist ordinary individuals, who are not AI experts, during the stressful process of buying or choosing a property. For this reason, alongside selecting the most accurate algorithm, we created an easy-to-use web application to support users in their decision-making. The application is designed to be easily upgradable with new features and adaptable for use in any region or country. Additionally, it helps reduce human subjectivity, which is an inherent factor in real estate transactions.

3.1. Comprehensive Evaluation of Various Regression Models

After a machine learning model is trained, it is necessary to evaluate its accuracy and reliability using appropriate evaluation metrics. These metrics quantify how well the model predicts actual values based on the input data. In the context of regression problems, such as predicting real estate prices, the metrics described below are used [16,17].
Mean Squared Error (MSE) calculates the average of the squared differences between the actual and predicted values.
M S E = 1 n i = 1 n ( y i y i ^ ) 2
Due to squaring the differences, the MSE penalizes larger errors more heavily. This makes it suitable for detecting a model that often makes large deviations. However, its value is expressed in squared units, which differ from the units of the target variable.
Root Mean Squared Error (RMSE) is the square root of MSE.
R M S E = M S E
The advantage of RMSE is that the result is expressed in the same units as the target value, making it easier to interpret. Like MSE, it penalizes larger errors more strongly by assigning them greater weight.
R2 score measures how much of the variance in the dependent variable can be explained by the model.
Values range from −∞ to 1:
  • R2 = 1: The model perfectly explains the variability in the data.
  • R2 = 0: The model does not explain the variability better than simply predicting the mean of the target variable.
  • R2 < 0: The model performs worse than the mean-based prediction, indicating poor predictive power.
R 2 = S S E S S T = 1 S S R S S T
SSE (Sum of Squares due to Error) represents the explained variation, while SSR (Sum of Squares due to Residuals) refers to the unexplained variation. The SST (Total Sum of Squares) is the total variation and is calculated as the sum of SSE and SSR: SST = SSE + SSR [16]. This metric is particularly useful when evaluating the relative success of a predictive model compared to simpler approaches, such as using the mean of the target variable as a baseline prediction.
It can be concluded that the proper use of evaluation metrics is essential for selecting the most efficient model for a given problem. In this study, RMSE and R2 are used to compare the performance of different regression models, as they jointly provide valuable insight into both the average prediction error and the model’s ability to generalize.
In the context of regression, where the goal is to predict continuous numerical values, evaluation metrics serve not only to measure the deviation between predicted and actual values, but also to assess the stability, robustness, and generalization capacity of the model. Proper interpretation of these metrics allows for researchers to understand model behavior under different conditions, detect potential issues such as bias or overfitting, and improve performance through parameter tuning.
To complement the numerical evaluation, this study also employs boxplots and scatter plots as visual tools for comparing the performance of the machine learning algorithms.

3.2. Real Estate Price Forecasting Tools and Libraries

In the development of the real estate price prediction model, various software tools and libraries were utilized to support data collection, preprocessing, analysis, and machine learning model construction. During the initial development phase, the Jupyter Notebook (version 7.0.3) environment within Anaconda was used to test the model on a smaller dataset of approximately 2000 records. In later stages, for full-scale implementation and application development, the code was organized and executed in the Visual Studio Code (VS Code) editor (version 1.102.3), where the final version of the model and the Streamlit web application were developed.
The key software tools and technologies used in this study are described below:
  • Python is currently the dominant programming language in the fields of machine learning, data science, and artificial intelligence [7]. It offers a wide range of libraries for data loading, visualization, statistical analysis, natural language processing, image processing, and more. One of Python’s main advantages is the ability to interact directly with the code through a terminal or interface such as Jupyter Notebook [18].
  • Anaconda provides an IPython console and an interactive environment for writing and testing code, particularly suited for experimenting with AI and machine learning functions. It includes a suite of pre-installed libraries such as NumPy (version 2.3.2), SciPy (1.16.1), Matplotlib (version 3.10.5), Pandas (version 2.3.1), IPython (8.12.0), Jupyter Notebook (version 7.03), and Scikit-learn (version 1.7.1) [18]. This makes it ideal for rapid prototyping and iterative development.
  • Visual Studio Code (VS Code) is a powerful, free, cross-platform, and highly customizable development environment created by Microsoft. It supports a wide range of programming languages, including Python (version 3.11.12), JavaScript (version 13.4), C++20, Java (version 21.0.1), etc.

4. Data Description

The real estate market in Bosnia and Herzegovina is complex and influenced by numerous factors, including economic conditions, local infrastructure, demographic changes, and the legislative framework. While numerous features have been considered in previous studies, such as [19], this paper focuses on a selected subset of features, listed below, that are most relevant and available within our dataset.
Location: Location is one of the most important factors in determining the price of a property. In urban areas, property prices are significantly higher in the city center, near universities, hospitals, and administrative buildings, while properties on the outskirts are typically more affordable.
Size and structure: Even though larger square footage usually means a higher price, the relationship is not always linear. Also, the layout and number of rooms have a significant impact. Two- and three-bedroom apartments are the most sought-after in Tuzla Canton, while studios and four-bedroom apartments cater to more specific buyer groups.
The age of the building and the condition of the property: Newer constructions (especially those built after 2015) tend to have higher market values due to modern designs, improved energy efficiency, and additional amenities (elevator, parking, video surveillance). On the other hand, older buildings often require renovation, which reduces their market value.
Number of floors and elevator availability: The number of floors plays a role, especially in buildings without elevators. Apartments on higher floors in older buildings without elevators tend to be cheaper, while apartments on middle floors are the most sought after.
Access to infrastructure and amenities: Proximity to schools, kindergartens, shopping malls, public transportation, and parking facilities significantly affects the property value.
Legal documentation and ownership status: Clear ownership status, availability of building and occupancy permits, and the ability to register and obtain a mortgage often affect the speed of sale and final price.
Influence of the market and economic conditions: Inflation, interest rates, bank lending policies, and supply investment construction in large cities directly affect price trends.
By analyzing these factors, it is possible to identify key input variables used in modeling real estate prices. Data on location, square footage, number of floors, year of construction, and additional amenities were included in the final dataset used for training and evaluation.

4.1. Data Preparation and Processing

To create a model for estimating real estate prices in BiH, data were collected from publicly available real estate advertisements and sales platforms. The real estate price prediction model was developed to accurately estimate the market value of a residential unit, based on common features: location, square footage, number of rooms, floor, year of construction (or age of the building), elevator, parking, and price (target value). Before conducting more advanced analysis or applying machine learning algorithms, it is necessary to carry out the appropriate data preprocessing procedures [12,18]. The goal of preprocessing is to bring raw data into a format suitable for further analysis, reduce variances, and ensure consistency [18].
The phases involved in the development of the real estate price prediction model are illustrated in Figure 2.
Data loading was performed using the Pandas library, which enables efficient handling of tabular data using the DataFrame object.
Data preprocessing consisted of the following steps:
  • Data cleaning: Unnecessary whitespace was removed from column names using df.columns.str.strip(), and columns were checked for duplicate entries and missing values (df.isnull().sum()).
  • Feature engineering: A new column, Price per m2, was added, which represents the ratio of the total price to the area of the apartment. This transformation allows for a more realistic comparison of market value across properties of varying sizes.
  • Scaling of numeric variables: The StandardScaler from the sklearn.preprocessing module was applied to standardize numerical variables, ensuring that they have a mean of 0 and a standard deviation of 1. This step is especially important for algorithms sensitive to differences in scale sizes (e.g., KNN, SVM, LogisticRegression).
  • Encoding of categorical variables: Columns such as Location, Elevator, Parking, and Balcony were encoded using OneHotEncoder, which converted them to binary values (0 or 1). The parameter drop = ‘first’ was used to avoid false correlations between categories (dummy variable trap).
The Column Transformer was used to perform all transformations, which allows for the application of various processing on numerical and categorical data within a single process flow (Pipeline).
After preprocessing, the data was divided into the following:
  • Training set (e.g., 80%): Used to train the model.
  • Test set (e.g., 20%): Used to assess generalization.
The subsequent phase—namely, training various machine learning models, selecting the optimal model based on performance metrics, and integrating it into a web-based application—constitutes the core contribution of this research, and will be discussed in detail in the following sections.

4.2. Comparison of Different Machine Learning Algorithms

Once the data was properly preprocessed and transformed, the next phase involved applying machine learning algorithms to model the relationship between the input features and the target variable: real estate prices [15].
This study focuses on the implementation and comparative evaluation of multiple regression models, with the aim of identifying the algorithm that yields the most accurate predictions.
To assess model performance, standard regression evaluation metrics—namely, Root Mean Squared Error (RMSE) and R2 score—were employed, as described in the previous section.
To achieve the optimal accuracy in predicting real estate prices, four different machine learning models were applied: Linear Regression, Random Forest Regressor, and XGBoost Regressor and K-Nearest Neighbors. Each model was trained on the same dataset containing 2000 real estate examples in Bosnia and Herzegovina.
Table 1 compares the four machine learning algorithms selected earlier using the RMSE and R2 metrics. XGBoost showed the highest accuracy with the lowest prediction error, making it the best choice for implementation in a web-based application. The combination of multi-layer analysis and a robust model enabled more reliable price prediction in the real estate market in Bosnia and Herzegovina.
Hyperparameters manage the training process and structures of the models, significantly affecting their performance and generalization capabilities [20]. Effectively tuning hyperparameters can lead to significant improvements in a model’s accuracy and efficiency, meaning that this is a critical step in the machine learning workflow.
Column 4 of Table 1 shows the hyperparameters used in each of the models.
  • Linear regression, in its basic form, does not use hyperparameters, but, in its extended form, it does [21].
  • In the Random Forest algorithm [22], the used hyperparameters are as follows:
    • n_estimators: Number of trees in the foreset (by default: n_estimators = 100)
    • max_depth: Max number of levels in each decision tree;
    • random_state: Hyperparameter that makes the model’s output replicable.
  • In XGBoost algorithm [23], the used hyperparameters are as follows:
    • max_depth: Determines how deep each tree can grow;
    • n_estimator: Determines the number of trees in the model;
    • eta (learning_rate): How much each tree affects the prediction (defaul = 0.1).
  • In KNN algorithm [24], the used hyperparameters are as follows:
    • n_neighbors: Determines the nearest points in the training dataset (int, default = 5);
    • weight: Used in prediction (default = uniform);
    • metric: Used for the distance computation (default = minkowski);
    • seed: Used for reproducibility.
Using different metrics to compare algorithms, it was found that XGBoost is the best machine learning algorithm for our dataset.
In addition to evaluating the models using numerical metrics, visual inspection was performed to further validate the results, specifically using a box plot analysis and scatter plot verification [25].
As shown in Figure 3, the XGBoost Regressor demonstrates the lowest error dispersion, as evidenced by the narrowest box among all four algorithms. This indicates a higher level of consistency and reliability in its predictions. The box plot also includes “whiskers (horizontal black lines extending from the box),” which represent the range of most of the error values, while individual points outside the whiskers represent outliers (circles-black dotes) or extreme prediction errors [26]. Boxes (blue rectangles) on Figure 3, represents the interquartile range (IQR), i.e., the middle 50% prediction errors.
  • The bottom edge of the box = 25th percentile (Q1)
  • The top edge of the box = 75th percentile (Q3)
  • The green line inside the box = median (Q2).
From the box plot, it is clear that K-Nearest Neighbors (KNN) has the widest box and the most outliers, making it the least reliable model among the ones tested.
Finally, to further assess the performance of the four algorithms, scatter plots of predicted versus actual prices were used (Figure 4). Scatter plots are a common method for evaluating regression models, as they allow for a visual comparison between predicted and true values across the entire dataset [25,26,27]. Figure 4 shows scatter plots comparing actual house process (x-axis) vs predicted prices (y-axis). Each blue circle (dot) represents a single property. Red dashed line is the perfect prediction line. If blue dot lies on the red line, prediction is exact. If blue dot lies above the line, the model overestimated the price. If blue dot lies below the line, the model underestimated the price.
From the visual analysis, it is evident that the XGBoost Regressor produces predictions that are most closely aligned with the actual values, exhibiting minimal deviation and a tight clustering around the regression line.

5. Application Creation and Testing

After completing the implementation and evaluation of the XGBoost model, the challenge was to make the research findings accessible and practical for a broader audience beyond researchers. To address this, a Streamlit application was developed to present the real estate price predictions generated using the XGBoost model in a user-friendly manner, catering especially to users without technical or programming expertise. Although XGBoost is a very powerful and accurate model for regression tasks, its results were initially limited to Python code and numerical output in the terminal.
The motivation for transitioning from model analysis to developing an interactive application was multifaceted, including the following factors:
  • The need for the real-time practical application of the model.
  • The demand for an interactive tool that allows for users to input data (e.g., area, number of rooms, location) and receive immediate price predictions. This not only demonstrates the model’s practical utility, but also provides a clear visual presentation of the results suitable for potential investors, buyers, and researchers.
Streamlit was a natural, strategic, and technically optimal choice, considering the tool we already use (Python 3.11.12), the nature of our work (ML prediction), time constraints, and target audience (both technical and non-technical users).
This approach enabled us to combine advanced machine learning techniques with a simple and functional web application, thereby extending the work beyond theoretical and research value to practical real-world applications. Using Streamlit (version 1.47.1) in this context was a logical step for transforming the static models into interactive tools.
Streamlit proved to be the most logical and efficient product because it can be integrated with Python code, has direct support for machine learning, and quickly creates the same applications. It enables the rapid development of interactive applications using only basic knowledge of Python and CSS syntax. Features such as drop-down menus, sliders, and numeric inputs are readily available. Additionally, Streamlit applications can be run both locally and online. Being open-source and free further contributed to its suitability for this project.
Figure 5 illustrates the layout of the application. The example test case features an apartment in Sarajevo with an area of 60 square meters, two rooms, an elevator, parking, and a balcony, with an estimated price of KM 351,313.
To improve the visual appearance of the application, a CSS (Cascading Style Sheet) was added, so the final application looked as shown in Figure 6.
The machine learning model cannot process text values (e.g., “Sarajevo”, “Apartment”), so it was necessary to convert text and binary data into numeric formats:
  • Binary values (Elevator, Parking, Balcony) were encoded as 0 (No) and 1 (Yes).
  • Categorical values (City, Type) were converted into one-hot encoded variables—e.g., if the apartment is in Sarajevo, the column City_Sarajevo has the value 1, while all other city columns are set to 0.
After the model was trained, it was saved using the joblib library in two separate files: “real_estate_model_xgb.pkl” and “model_column_xgb.pkl”.
These files contain the trained model and the list of feature column names expected during training. This is important because the user input must have an identical structure during prediction (same order and column names).
A web application was developed using the Streamlit library, operating as follows:
  • The user enters data (city, square footage, number of rooms, etc.).
  • The application converts these inputs into a numerical feature vector matching the model’s expected format.
  • The previously trained model and feature list are loaded via joblib.load.
  • The model predicts the property price based on the input features.
  • The predicted price is displayed to the user.

6. Conclusions

This paper presents the application of machine learning techniques for predicting real estate prices, integrating theoretical foundations with practical implementation through the development of a web application.
Multiple regression models were trained and evaluated on real estate data from Bosnia and Herzegovina, including Linear Regression, Random Forest Regressor, K-Nearest Neighbors, and XGBoost Regressor. Model performance was assessed using several evaluation methods: R2 score (coefficient of determination), RMSE (Root Mean Squared Error), residual analysis (residual plots), scatter plots (actual vs. predicted values), and error distribution analysis (box plots).
Based on a comprehensive evaluation of various regression models, the final decision to implement the model in the web application favored the XGBoost algorithm. This decision was based on quantitative indicators, where XGBoost achieved the highest value of the coefficient of determination (R2) and the lowest value of the Root Mean Squared Error (RMSE), indicating its high accuracy and reliability in predicting real estate prices. Furthermore, graphical analyses, scatter plots, and error distributions confirmed that XGBoost exhibited the smallest deviations and the best linear fit between the actual and predicted values. Due to its advanced capabilities in learning non-linear relationships and built-in regularization preventing overfitting, the XGBoost model was selected as the core component of the application for automatic real estate price prediction based on user input.
In this study, the selection of the optimal model for real estate price regression analysis was not merely a technical decision, but the outcome of a rigorous analysis of performance metrics, model stability, and generalization ability. After implementing and evaluating multiple algorithms (Linear Regression, Random Forest, KNN, and XGBoost), the Extreme Gradient Boosting (XGBoost)-based model proved to be superior in almost all key aspects.
A particularly notable result is the fact that XGBoost proved to be extremely efficient and stable, despite being trained and tested on a relatively limited set of around 2000 samples. XGBoost proved to be extremely robust, even in these circumstances. This feature makes it particularly suitable for real-world applications where the available data is often limited, fragmented, or heterogeneous.
An application was developed using the Streamlit library, enabling users to input property characteristics and receive real-time price predictions. The most significant advantages of the developed system include the following:
  • Automated data processing and price prediction generation;
  • A user-friendly interface accessible to non-technical users;
  • System scalability across different countries and real estate markets, integration of evaluation metrics for the quantitative assessment of model performance.
On the other hand, one of the main challenges was the limited availability of structured real estate data in Bosnia and Herzegovina. This limitation highlights the need and opportunity for future research and development projects aimed at building a comprehensive national real estate database. Such a database would significantly enhance market analysis, enable more accurate price predictions, and support more transparent decision-making for buyers, sellers, investors, and public institutions. Developing this type of data infrastructure would offer substantial value, and could serve as a promising topic for future research or a final thesis in the fields of data science, machine learning, or information systems development.

Author Contributions

Conceptualization, Z.S. and D.G.; methodology, H.K.; software, H.K.; validation, Z.S., D.G. and H.K.; formal analysis, Z.S.; investigation, H.K.; resources, D.G.; data curation, Z.S.; writing—original draft preparation, Z.S. and H.K.; writing—review and editing, Z.S. and D.G.; visualization, H.K.; supervision, Z.S.; project administration, D.G.; funding acquisition, D.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tse, R.Y. Estimating neighbourhood effects in house prices: Towards a new hedonic model approach. Urban Stud. 2002, 39, 1165–1180. [Google Scholar] [CrossRef]
  2. Hansen, J. Australian house prices: A comparison of hedonic and repeat-sales measures. Econ. Rec. 2009, 85, 132–145. [Google Scholar] [CrossRef]
  3. Pagourtzi, E.; Assimakopoulos, V.; Hatzichristos, T.; French, N. Real estate appraisal: A review of valuation methods. J. Prop. Investig. Financ. 2003, 21, 383–401. [Google Scholar] [CrossRef]
  4. Samuel, A.L. Some Studies in Machine Learning Using the Game of Checkers. IBM J. Res. Dev. 1959, 44, 206–226. [Google Scholar] [CrossRef]
  5. Jaafar, N.S.; Mohamed, J.; Ismail, S. Machine Learning for Property Price Prediction and Price Valuation: A Sistematic Literature Review. J. Malays. Inst. Plan. 2021, 19, 411–422. [Google Scholar] [CrossRef]
  6. Christian, J.; Patrick, Z.; Kai, H. Machine Learning nad Deep Learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
  7. Mehryar, M.; Afshin, R.; Ameet, T. Foundation of Machine Learning, 2nd ed.; MIT Press: Cambridge, UK, 2018; ISBN 978-02-6203-940-6. [Google Scholar]
  8. Galić, D.; Stojanović, Z.; Čajić, E. Application of Neural Networks and Machine Learning in Image Recognition. Tech. Gaz. 2024, 31, 316–323. [Google Scholar] [CrossRef]
  9. Mehmedović, L.B. Intelligent Systems, 1st ed.; Harhograf Tuzla: Tuzla Grad, Bosnia and Herzegovina, 2011; ISBN 978-99-5849-030-9. [Google Scholar]
  10. Nagy, Z. Fundamentals of Artificial Intelligence and Machine Learning, 1st ed.; Komjuter Biblioteka: Beograd, Serbia, 2019; ISBN 978-86-7310-544-4. [Google Scholar]
  11. Muller, A.C.; Guido, S. Introduction to Machine Learning with Python, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017; ISBN 978-14-4936-989-7. [Google Scholar]
  12. Sharma, H.; Harsora, H.; Ogunleye, B. An Optimal House Price Prediction Algorithm: XGBoost. Analytics 2024, 3, 30–45. [Google Scholar] [CrossRef]
  13. Deng, Z. Comparative Analysis of the Effectiveness of Different Algorithms for House Price Prediction in Machine Learning. Adv. Econ. Manag. Political Sci. 2024, 136, 101–107. [Google Scholar] [CrossRef]
  14. Lee, H. House Price Prediction and Analysis Based on Random Forest and XGBoost Models. Highlights in Business, Economics and Management. Highlights Bus. Econ. Manag. 2023, 21, 934–938. [Google Scholar] [CrossRef]
  15. Winky, K.O.H.; Tang, B.-S.; Wong, S.W. Predicting propert prices with machine learning algorithms. J. Prop. Res. 2021, 38, 48–70. [Google Scholar] [CrossRef]
  16. Kim, S.-J.; Bae, S.-J.; Jang, M.-W. Linear Regression Machine Learning Algorithms for Estimating Reference Evapotranspiration Using Limited Climate Data. Sustainability 2022, 14, 11674. [Google Scholar] [CrossRef]
  17. Vagelis, P.; German, S.; Nikolaos, P.B.; Mohamed, S. Investigation of performance metrics in regression analysis and machine learning-based prediction models 2022. In Proceedings of the 8th European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2022) Congress, Oslo, Norway, 24 November 2022. [Google Scholar] [CrossRef]
  18. Geron, A. Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow, 2nd ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2019; ISBN 978-14-9203-263-2. [Google Scholar]
  19. Baldominos, A.; Blanco, I.; Moreno, A.J.; Iturrarte, R.; Bernárdez, Ó.; Afonso, C. Identifying Real Estate Opportunities Using Machine Learning. Appl. Sci. 2018, 8, 2321. [Google Scholar] [CrossRef]
  20. Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
  21. Linear regression: Hyperparameters. 2025. Available online: https://developers.google.com/machine-learning/crash-course/linear-regression/hyperparameters (accessed on 1 August 2025).
  22. Random Forest Hyperparameter Turning on Python. 2025. Available online: https://www.geeksforgeeks.org/machine-learning/random-forest-hyperparameter-tuning-in-python/ (accessed on 1 August 2025).
  23. A Guide on XGBoost Hyperparameters Tuning. 2025. Available online: https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning (accessed on 1 August 2025).
  24. KNeighborsClassifier. 2025. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html (accessed on 1 August 2025).
  25. Box and Whisker Plots: Understanding, Creating, and Interpreting Data Visualization. 2025. Available online: https://www.6sigma.us/six-sigma-in-focus/box-and-whisker-plots/ (accessed on 2 August 2025).
  26. Jiang, Q.; Xu, L.; Sun, S.; Wang, M.; Xiao, H. Retrieval model for total nitrogen concentration based on UAV hyper spectral remote sensing data and machine learning algorithms—A casestudy in the Miyun Reservoir, China. Ecol. Indic. 2021, 124, 107356. [Google Scholar] [CrossRef]
  27. Ouyand, X. House Price Prediction Based on Machine Learning Models. Highlights Sci. Eng. Technol. 2024, 85, 870–878. [Google Scholar] [CrossRef]
Figure 1. Some of the most popular machine learning algorithms.
Figure 1. Some of the most popular machine learning algorithms.
Data 10 00135 g001
Figure 2. The phases of forming a real estate price prediction model.
Figure 2. The phases of forming a real estate price prediction model.
Data 10 00135 g002
Figure 3. Comparing machine learning algorithms using a boxplot.
Figure 3. Comparing machine learning algorithms using a boxplot.
Data 10 00135 g003
Figure 4. Model fitting for machine learning.
Figure 4. Model fitting for machine learning.
Data 10 00135 g004
Figure 5. Web application layout.
Figure 5. Web application layout.
Data 10 00135 g005
Figure 6. Web application layout using CSS.
Figure 6. Web application layout using CSS.
Data 10 00135 g006
Table 1. Model comparison.
Table 1. Model comparison.
ModelR2 ScoreRMSE (KM)Experimental Set Up
Linear Regression0.944533,184-none
Random Forest0.962127,431n_estimators = 100, max_depth = 15, random_state = 42
XGBoost0.99186637n_estimators = 500, objective = ‘reg:squarederror’, max_depth = 6, eta = 0.1, seed = 42
K-Nearest Neighbors0.592689,943n_neighbors = 5, weights = ‘uniform’, metric = ‘minkowski’
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Stojanović, Z.; Galić, D.; Kahrić, H. Predicting Real Estate Prices Using Machine Learning in Bosnia and Herzegovina. Data 2025, 10, 135. https://doi.org/10.3390/data10090135

AMA Style

Stojanović Z, Galić D, Kahrić H. Predicting Real Estate Prices Using Machine Learning in Bosnia and Herzegovina. Data. 2025; 10(9):135. https://doi.org/10.3390/data10090135

Chicago/Turabian Style

Stojanović, Zvezdan, Dario Galić, and Hava Kahrić. 2025. "Predicting Real Estate Prices Using Machine Learning in Bosnia and Herzegovina" Data 10, no. 9: 135. https://doi.org/10.3390/data10090135

APA Style

Stojanović, Z., Galić, D., & Kahrić, H. (2025). Predicting Real Estate Prices Using Machine Learning in Bosnia and Herzegovina. Data, 10(9), 135. https://doi.org/10.3390/data10090135

Article Metrics

Back to TopTop