Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks

Viktoratos, Iosif; Tsadiras, Athanasios

doi:10.3390/a18020093

Open AccessArticle

Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks

by

Iosif Viktoratos

^* and

Athanasios Tsadiras

Department of Economics, Aristotle University of Thessaloniki, 541 24 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(2), 93; https://doi.org/10.3390/a18020093

Submission received: 11 December 2024 / Revised: 23 January 2025 / Accepted: 2 February 2025 / Published: 7 February 2025

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Accurately estimating house values is a critical challenge for real-estate stakeholders, including homeowners, buyers, sellers, agents, and policymakers. This study introduces a novel approach to this problem using Kolmogorov–Arnold networks (KANs), a type of neural network based on the Kolmogorov–Arnold theorem. The proposed KAN model was tested on two datasets and demonstrated superior performance compared to existing state-of-the-art methods for predicting house prices. By delivering more precise price forecasts, the model supports improved decision-making for real-estate stakeholders. Additionally, the results highlight the broader potential of KANs for addressing complex prediction tasks in data science. This study aims to provide an innovative and effective solution for accurate house price estimation, offering significant benefits for the real-estate industry and beyond.

Keywords:

Kolmogorov–Arnold networks; real-estate forecasting; house price prediction; neural networks; deep learning; regression; artificial intelligence; machine learning

1. Introduction

1.1. Residential Real-Estate Domain:Motivation and Challenges

The real-estate sector plays a vital role in the economy and includes various domains such as residential, commercial, industrial, and retail. It relates to a broad spectrum of activities centered on the purchase, sale, rental, lease, and management of properties [1]. Its dynamics and developments have a pertinent impact on people, companies, and state authorities around the globe [2].

The residential real-estate sector is clearly an integral part of the overall property market, especially the purchase and sale or leasing of residential property (single-family homes, townhomes, and condominiums) [2]. This segment has a huge impact for the overall economy [3]. Residential properties represent about 65% of the entire real-estate market in the USA. Home sales in the United States hit an all-time high of nearly seven million in 2021, after consistently increasing from a low level that year. The average price per square foot of floor space in new single-family housing fell after the great financial crisis and then stagnated for several years. Since 2012, the price has continuously increased, reaching approximately USD 144per square foot in 2021. In the same year, the average sales price of a new home was over USD 454,000, and, in 2022, house prices increased further (https://www.statista.com/statistics/682549/average-price-per-square-foot-in-new-single-family-house s-usa/ accessed on 28 November 2024). The above indicates that understanding its current state, trends, and prospects is essential for both industry participants and investors.

1.2. The Specific Case of the House Price Prediction Problem

Like the broader real-estate market, the house real-estate domain is subject to various factors that influence its growth and performance. House prices are one of the most important economic indicators. Changes in prices have substantial consequences for households, policymakers, investors, etc. Understanding the dynamics of house prices is important for assessing the health of the housing market and crucial for evaluating its state [4]. Also, knowing how house prices work helps investors predict trends, forecast future changes, and make successful decisions about buying and selling property.

Searching for the factors that have an impact on the prices of houses has been a topic of interest for many years. Some factors and trends that influence the price of a house are location, size, condition, the use of technology, etc. [5].

1.3. Scientific Background and the Proposed Approach

Researchers and industries are actively working on projects related to predicting house prices [6]. Based on a set of input features/factors derived from real-estate datasets they are developing models capable of estimating house prices. Machine learning techniques and deep neural networks (NNs) have been widely employed for this purpose [7].

The objective of this study is to implement and propose a novel approach using KANs. KANs are based on the Kolmogorov–Arnold theorem, which states that any continuous function of several variables can be represented as a superposition of continuous functions of one variable and a finite number of parameters. Formally, for any continuous function

f : R^{n} \to R

, there exist continuous functions g₁, g₂, …, g_k:

R

→

R

and a continuous function h:

R^{k}

→

R

such that f(x₁, x₂, …, x_n) = h(g₁(x₁), g₂(x₂), …, g_k(x_k)).

This theorem has several important implications for function approximation and neural network design. It suggests that complex, multi-dimensional functions can be decomposed into simpler, one-dimensional functions, which can be more easily modeled and interpreted. Various generalizations of the Kolmogorov–Arnold theorem have been proposed, extending its applicability to different types of functions and higher dimensions. These generalizations have significant implications for fields such as machine learning, where understanding the structure of data is crucial for developing effective predictive models. Unlike the static, non-learnable activation functions in Multi-Layer Perceptrons (MLPs), KANs incorporate univariate functions that act as both weights and activation functions, adapting as part of the learning process. They use complex learnable functions (called B-Splines) to represent nonlinear input transformations, with relatively fewer parameters than a traditional neural network. By decentralizing activation functions to edges rather than nodes, KANs align with the Kolmogorov–Arnold representation theorem, enhancing model interpretability and performance (Figure 1) [8]. These models are designed to handle complex, nonlinear relationships within the data, capturing intricate patterns that traditional models might miss. KANs excel in representing highly nonlinear functions, making them particularly suited for customer behavior analysis, where interactions between features can be complex and multi-faceted. By leveraging the Kolmogorov–Arnold representation theorem, these networks can approximate any continuous function, providing high flexibility and accuracy in prediction problems [9].

1.4. Contribution

The contribution of this study can be summarized as follows:

Scientific contribution: This study expands the methodological landscape in regression by presenting Kolmogorov–Arnold networks (KANs) as a robust alternative to traditional models like gradient boosting, ensemble methods, and neural networks such as MLPs, GRUs, and LSTMs. Gradient boosting algorithms like XGBoost and LightGBM, while effective on structured datasets, struggle with capturing intricate, nonlinear relationships in high-dimensional data and face scalability and interpretability challenges. Similarly, MLPs require extensive tuning to handle complex, high-dimensional data effectively, while GRUs and LSTMs, optimized for sequential data, are less suitable for non-sequential regression tasks and introduce significant computational overhead. KANs address these limitations by leveraging the Kolmogorov–Arnold representation theorem, which decomposes multivariate functions into sums and compositions of univariate functions. This approach allows KANs to model complex nonlinear relationships efficiently while maintaining interpretability, scalability, and adaptability to high-dimensional data. The spline-based architecture further enhances their ability to generalize well across diverse regression tasks, offering a significant advantage over both traditional and modern machine learning models. By outperforming these methods in terms of accuracy and computational efficiency, KANs demonstrate their potential as a transformative tool for tackling complex regression problems across various domains.
Economic and social impact: By improving prediction accuracy in housing price estimation, this research contributes to a more transparent and informed real-estate market. Enhanced predictive capabilities help stakeholders make better decisions, optimize investments, and mitigate risks. Furthermore, the insights gained from the study can lead to the development of smarter urban planning and housing policies, ultimately benefiting society by fostering equitable and sustainable growth in the housing sector.
Broader applicability across industries: While the focus is on house price estimation, the versatile nature of KANs offers potential applications beyond real estate. KANs can be applied in areas such as financial modeling, healthcare predictions, environmental analysis, and supply chain optimization. The study underscores the adaptability of this approach, setting a precedent for its application in diverse regression and prediction tasks.

This work not only establishes the utility of KANs in regression problems but also sets the stage for a future exploration of their potential across a wide range of applications. Section 2 provides an overview of related works in the field of KANs and house price prediction, while Section 3 explains the research methodology and the proposed KAN model in detail. In Section 4 and Section 5, the experiments are described, and the results are discussed. Finally, Section 6 concludes the paper and provides insights for future research in this area.

2. Literature Review and State of the Art

2.1. KAN-Based Approaches

Liu et al. introduce KANs, a novel neural network architecture inspired by the Kolmogorov–Arnold representation theorem, and show that it has significant advantages over traditional MLPs [9]. KANs are designed to effectively capture compositional structures in data, enabling them to approximate complex functions, such as exponential and sine functions, with greater efficiency and accuracy. The authors demonstrate that KANs not only outperform MLPs in terms of accuracy, particularly in small-scale AI and science tasks, but also enhance interpretability by revealing the underlying compositional structures in symbolic formulas. Furthermore, KANs can utilize grid extension techniques to further improve their performance, positioning them as a promising foundation for future AI applications in scientific discovery and mathematical exploration.

Moradi et al. investigate the potential of KANs as a transformative approach to deep learning, particularly in the context of autoencoders for image representation tasks. The study compares KAN-based autoencoders against traditional convolutional neural networks (CNNs) across several datasets, revealing competitive reconstruction accuracy and suggesting KANs’ viability in advanced data analysis applications [10].

Jiang et al. present a novel hybrid model combining KANs and artificial NNs for short-term load forecasting, addressing the limitations of traditional NN approaches. By leveraging KANs to identify essential periodic and nonlinear patterns in load data, while utilizing NNs to fit the residual components, the proposed method enhances both interpretability and predictive accuracy. The integration of regularization techniques ensures that KAN-derived predictions maintain dominance, leading to clearer analytical expressions for forecasting. Experimental validation against various models demonstrates superior performance of the hybrid approach, marking a significant advancement in short-term load forecasting methodologies [11].

Yang et al. propose the Kolmogorov–Arnold transformer (KAT), an innovative architecture that enhances traditional transformer models by substituting MLP layers with KAN layers [12]. This approach significantly improves expressiveness and performance, addressing key challenges in integrating KANs into transformers and paving the way for more efficient deep learning models.

Hollόsi et al. present a study focused on enhancing road safety by detecting mobile phone usage among bus drivers [13]. The research utilizes KANs to improve the accuracy of driver monitoring systems in compliance with regulations against phone use while driving.

Danish et al. introduce a Kolmogorov–Arnold recurrent network (KARN), a novel approach combining KAN’s flexibility with RNN’s temporal modeling capabilities for improved load forecasting in energy management. Through the integration of learnable temporal spline functions and edge-based activations, KARN addresses the limitations of traditional RNNs and LSTMs in capturing complex load variations. Extensive evaluation across diverse consumer types, including residential and industrial buildings, demonstrates KARN’s superior performance and versatility compared to conventional approaches, establishing it as a robust solution for energy-load forecasting [14].

Gao et al. explore the application of KANs for predicting renewable energy variables, addressing two key challenges in deep learning: interpretability and architectural sensitivity. Using data from the Tokyo Meteorological Observatory, the research demonstrates KAN’s effectiveness in predicting solar radiation and outdoor temperature, achieving superior performance even with minimal architecture [15].

2.2. Literature Review House Price Prediction

Predicting house prices is a popular task in machine learning, and there are numerous state-of-the-art papers that use NNs and other machine learning techniques for this task.

Limsombunc et al.’s work is one of the first research studies that demonstrated that the artificial neural network model has significant potential in predicting house prices. The purpose of this research is to evaluate and compare the effectiveness of the hedonic model and an artificial neural network model in predicting house prices. The study uses a sample of 200 houses in Christchurch, New Zealand, obtained randomly from the Harcourt website, taking into account factors such as house size, age, type, the number of bedrooms, bathrooms, garages, amenities, and location [16].

Afonso et al. propose a hybrid model that combines deep learning recurrent neural networks (RNNs) and random forest algorithms for predicting housing prices. The model uses features extracted from the image and text data of the houses, along with other relevant numerical data. The results show that the proposed model outperforms both standalone deep learning and random forest models for the prediction of housing prices in Brazil [17].

Nouriani et al. propose a novel approach for estimating housing prices using a combination of interior and exterior images of houses, as well as satellite images. The proposed method utilizes deep convolutional neural networks to extract features from images and then (along with other house features) employs multiple regression models to estimate house prices based on these features [18].

Mora-Garcia et al. present a study on the use of machine learning algorithms for housing price prediction in California during the COVID-19 pandemic. The authors analyzed different regression models, including linear regression, decision trees, and others to predict the price of houses in different regions of California. The study also evaluated the impact of the pandemic on the housing market and the performance of the different models [7].

Kim et al. explore the application of various machine learning algorithms, including support vector machine, random forest, XGBoost, LightGBM, and CatBoost, as automatic real-estate valuation models. The study involves analyzing approximately 57,000 records on apartment transactions in Seoul in 2018 and proposes combination methods to improve the models’ predictive power. The findings suggest that machine learning predictors outperform conventional models, and an efficient averaging of the predictors can improve their predictive accuracy. Additionally, machine learning algorithms can recommend which algorithm should be selected for making predictions [19].

Joshi et al. propose to use multiple different models that can be used for prediction and focus on more accurate results. They propose to use an ensemble learning method and combine multiple ML models to improve the results [20].

Ragb et al. propose a hybrid Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) model for real-estate price prediction. The model combines the strengths of both GRU and LSTM networks. The results of these experiments in the Boston dataset showed that the proposed model has better performance when the networks are used in the fusion process than when they act individually [21].

Dobrovolska et al. investigate the dynamics of the real-estate market in Kyiv, Ukraine, focusing on the impact of various economic factors on housing prices. The study employs a dual-approach forecasting model, combining traditional methods with advanced analytical tools to enhance prediction accuracy. Ultimately, the research aims to provide a robust framework for future forecasting in the volatile real-estate sector [22].

3. Materials and Methods

3.1. Problem Formulation and Proposed Model Description

The primary goal of this research is to predict housing prices based on a set of input features derived from real-estate datasets. Formally, we define the problem as follows:

Let

X = {x_{1}, x_{2}, \dots, x_{n}}

represent the input features, where x_i corresponds to specific characteristics of a property (e.g., size, location, and number of rooms). The target variable

y

represents the predicted price of the property. The objective is to learn a mapping function

f : X \to y

, such that the prediction error L(y,

\hat{y}

), where

\hat{y} = f (x)

, is minimized. Here, L is the loss function used to quantify the prediction error.

To ensure the reliability and generalizability of the proposed KAN model, we explicitly state the following assumptions:

Independence: The input features are assumed to be conditionally independent given the target variable.
Sufficient data: The dataset is assumed to be sufficiently large and diverse to capture the underlying patterns in housing prices.
Feature engineering: The input features are assumed to be preprocessed (e.g., normalized) to ensure compatibility with the neural network.

These assumptions are tested and validated during the experimental phases.

KAN is a neural network architecture inspired by the Kolmogorov–Arnold representation theorem. This property allows KAN to decompose complex multivariate functions into simpler univariate components, making it uniquely suited for capturing nonlinear interactions in high-dimensional data.

The Kolmogorov–Arnold representation theorem provides the theoretical foundation for KAN. For a continuous function

f : R^{n} \to R^{f},

the theorem states the following:

f (x_{1}, x_{2}, \dots, x_{n}) = \sum_{i = 1}^{2 n + 1} g_{i} (\sum_{j = 1}^{n} h_{i j} (x_{j}))

(1)

where

g_i and h_ij are continuous univariate functions.

x_{1}, x_{2}, \dots, x_{n}

are the input variables.

In the KAN model, this decomposition is approximated using neural network layers.

The input features are transformed into intermediate representations:

z_{i} = \sum_{j = 1}^{n} h_{i j} (x_{j})

(2)

where h_ij are univariate transformations modeled by spline functions.

The intermediate features z_i are passed through another set of univariate functions g_i to compute the final output:

\hat{y} = \sum_{i = 1}^{2 n + 1} g_{i} (z_{i})

(3)

The model minimizes a loss function L, such as the Mean Squared Error (MSE):

L (y, \hat{y}) = \frac{1}{m} \sum_{i = 1}^{m} {(y_{i} - \hat{y_{i}})}^{2}

(4)

where m is the number of training samples,

y_{i}

is the true value, and

{\hat{y}}_{i}

is the predicted value. To prevent overfitting, regularization terms are added:

L_{total} = L (y, \hat{y}) + λ_{1} | θ |_{1} + λ_{2} | θ |_{2}^{2}

(5)

Below, a detailed architectural description is provided:

Input Layer: The input layer accepts a vector X of size n, where n corresponds to the number of features. Each feature is standard-scaled to ensure consistent scaling across all inputs.
Intermediate Layers:
- Three KAN layers are sequentially stacked, each performing a decomposition of the input into univariate components using splines.
- Each KAN layer introduces nonlinear transformations, enabling the model to capture intricate feature interactions.
- The splines are defined by a set number of knots and a specific order, which control the granularity and smoothness of the approximations. The number of knots determines how many polynomials are strung together. The grid range and epsilon define the bounds and resolution of the decomposition.
- Fully connected layers refine the outputs of the KAN layers.
- Activation functions introduce nonlinearity and enhance the model’s expressiveness.
Output Layer: The output layer consists of a single neuron with a linear activation function, producing the predicted price $\hat{y}$ .
Loss Function: The Mean Squared Error (MSE) is used as the loss function.
Regularization: Dropout and L1 and L2 regularization techniques are applied to prevent overfitting.
Optimization Algorithm: The model is trained using the Adam optimizer, which combines the benefits of momentum and adaptive learning rates to ensure efficient convergence.

3.2. Research Methodology

The research methodology consists of the following phases (Figure 2):

1.: Literature Review

First, a thorough literature review was made to identify state-of-the-art models and available datasets. This step was crucial for deciding which models will be implemented and comparing the proposed KAN-based model with them in the corresponding datasets.

2.: Data Collection and Preprocessing

After that, data collection, preprocessing, and cleaning were performed to ensure the quality and consistency of the data for subsequent analysis. To mitigate the impact of extreme values on model performance, outlier detection was implemented using a quantile-based approach. Specifically, observations where the price feature fell outside the 5th and 95th percentiles were excluded from the datasets. Additionally, feature engineering incorporated standardization techniques, including Min–Max scaling and standard scaling, to normalize the input features and enhance model performance. The next section provides all the necessary details.

3.: Model Development and Validation

State-of-the-art models found in the literature were developed and validated, specifically tailored for regression and house price prediction tasks. In detail, the following deep learning neural network models were examined, implemented, and compared with KAN:

MLP

The MLP is a type of feedforward neural network comprising an input layer, one or more hidden layers, and an output layer. Each layer is fully connected to the subsequent layer, and neurons use an activation function to introduce nonlinearity. The mathematical representation for a neuron in a hidden layer is given by the following:

h = f (W x + b)

(6)

where

-: h is the output of the neuron.
-: W is the weight matrix.
-: x is the input vector.
-: b is the bias vector.
-: f is the activation function (e.g., ReLU and sigmoid).

MLPs are general-purpose neural networks that approximate complex functions through multiple layers. However, their black-box nature and reliance on global optimization make them less interpretable than KAN, which provides a transparent decomposition of feature interactions.

LSTM

The Long Short-Term Memory (LSTM) network enhances the RNN by incorporating memory cells and three gating mechanisms: input, forget, and output gates [21]. These gates enable the model to retain long-term dependencies. The mathematical operations are provided below:

Forget Gate : f_{t} = σ (W_{f} \cdot [h_{\{t - 1\}}, x_{t}] + b_{f})

(7)

Input Gate : i_{t} = σ (W_{i} \cdot [h_{\{t - 1\}}, x_{t}] + b_{i})

(8)

Candidate Cell State : \tilde{C_{t}} = t a n h (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c})

(9)

Cell State Update : C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ \tilde{C_{t}}

(10)

Output Gate : o_{t} = σ (W_{o} * [h_{t - 1}, x_{t}] + b_{o})

(11)

Hidden State : h_{t} = o_{t} ⊙ t a n h (C_{t})

(12)

LSTM networks are designed for sequential data, leveraging memory cells to retain temporal dependencies. While effective for time-series tasks, their sequential nature makes them less suited for static datasets like house price prediction, where feature interactions are more critical than temporal patterns. In contrast, KAN focuses on capturing nonlinear dependencies across static features.

GRU

The Gated Recurrent Unit (GRU) is a variant of the recurrent neural network (RNN) that addresses the vanishing gradient problem through gating mechanisms [21]. GRUs utilize reset and update gates to control the flow of information. The mathematical formulations are the following:

Update Gate : z_{t} = σ (W_{z} \cdot [h_{\{t - 1\}}, x_{t}] + b_{z})

(13)

Reset Gate : r_{t} = σ (W_{r} \cdot [h_{\{t - 1\}}, x_{t}] + b_{r})

(14)

Candidate Hidden State : \tilde{h_{t}} = t a n h (W_{h} \cdot [r_{t} {⊙ h}_{t - 1}, x_{t}] + b_{h})

(15)

Hidden State : h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h_{t}}

(16)

GRU, like LSTM, is a recurrent neural network architecture designed for sequential data. It simplifies the LSTM architecture by combining the forget and input gates, making it computationally efficient. However, GRU’s focus on temporal dependencies limits its applicability to static datasets, where KAN’s spline-based decomposition excels at modeling complex, nonlinear feature interactions.

Regression-Based Models

These models are primarily focused on fitting data through regression techniques. First, linear regression was utilized. Linear regression tries to fit a straight line to minimize errors between predictions and actual values. Also, a random forest regressor was developed and tested. This algorithm combines multiple decision trees by averaging predictions for robustness. Regression models, including linear and nonlinear variants, are foundational for predictive modeling. However, their performance is often constrained by assumptions of linearity or predefined functional forms. KAN overcomes these limitations by learning univariate functions adaptively, providing a more flexible and accurate framework for nonlinear prediction.

Boosting-Based Models

These models use boosting techniques to combine weak learners (e.g., decision trees) into a strong ensemble. XGBoost, CatBoost, and LightGBM were implemented and tested [20,21,22]. Boosting methods combine weak learners (e.g., decision trees) to create strong predictive models, excelling in structured data tasks. However, their reliance on hierarchical splits can oversimplify complex feature interactions. KAN addresses this gap by decomposing multivariate relationships into univariate functions, enabling precise and interpretable predictions.

4.: Integration and Testing—Experiments

Comprehensive testing was conducted to evaluate the performance of the models. Two widely used datasets were utilized for conducting the experiments, namely, the Greece listings (https://www.kaggle.com/datasets/argyrisanastopoulos/greece-property-listings/ accessed on 10 September 2024) and the California house pricing (https://www.kaggle.com/code/ahmedmahmoud16/california-housing-prices accessed on 10 September 2024). Metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Symmetric Mean Absolute Percentage Error (SMAPE) were used to assess model effectiveness. Based on testing outcomes, iterative refinement of the models was undertaken to enhance their performance and reliability. The next section provides detailed information about the experiments and the results.

4. Results

The performance of the models was evaluated using MAE, MSE, and SMAPE. In both situations, 90% of the records were randomly assigned for training the models, while 10% were used for testing. The experiments were carried out in Google Colab using the Python programming language and its relevant libraries (e.g., deep KAN for implementing the Kolmogorov–Arnold network and TensorFlow for MLP and GRU–LSTM). To ensure a fair comparison between the models in predicting the house prices, the study adhered to the following principles:

(a): Exclusion of feature extraction models: No feature extraction methods, such as autoencoders or Boltzmann Machines, were used beforehand.
(b): Model optimization: The most effective configuration parameters (the number of layers, neurons, optimizer, etc.) for each model were identified through hyperparameter tuning, a critical aspect of model development and validation. Hyperparameters play a pivotal role in defining both the structural and operational characteristics of a model, significantly influencing its performance. The hyperparameter tuning process employed a combination of manual trial-and-error and grid-search techniques to identify optimal settings. The tuning involved several key steps. For the learning rate, a range of values (e.g., 0.001 and 0.01) was tested to balance precision in convergence with training efficiency as smaller learning rates can lead to more precise convergence but increase training time, while larger rates may risk overshooting the optimal solution. The architecture was iteratively refined by testing different configurations of layers and neurons, starting with a simple structure and gradually increasing complexity. Performance on the validation set was closely monitored to determine the optimal depth and number of neurons. Various activation functions were evaluated to assess their impact on convergence speed and overall model accuracy, with the final choice made based on empirical performance. Additional parameters, including batch size and dropout rates, were systematically varied to enhance model robustness and mitigate overfitting. The effects of these parameters on validation performance were carefully observed to ensure stability and generalization. The optimal hyperparameter settings, summarized in Table 1, were determined based on the best validation performance across multiple runs, ensuring that the model effectively generalizes to unseen data. Table 1 below shows the optimal values for the KAN-based model on the datasets.

4.1. Dataset A—Greece Listings

The Greece House Listings dataset provides information on numerous properties for sale in Greece. The dataset comprises 20,000 rows and 25 columns, where each row represents a distinct property for sale. The columns that are included are the following:

location_name (the municipal of the house; categorical feature; 73 unique values; most common: Athens, 21%).
location_region (the region of the house; categorical feature; possible values: Attiki/Thessaloniki; most common: Attiki, 94%).
res_type (the type of the property; possible values: building, apartment, etc.; five unique values).
res_address (secondary location attribute; this stands for the exact neighborhood of the property; categorical feature; 987 unique values).
res_price (advertised price for the property (in euros). This is the feature that the model has to predict. The mean value is 367,000.
res_sqr (square meters of the property. The mean value is 169).
construction_year (the construction year of the property).
levels (levels for the property, e.g., 1st floor, 2nd floor, etc.).
bedrooms (the number of bedrooms. The mean is 2.58).
bathrooms (the number of bathrooms. The mean is 1.48).
status (the current status for the property; categorical feature; possible values such as ‘good’, ‘renovated’, etc.; eight unique values).
energyclass (categorical feature; the energy class is from the lowest level (H) to the highest (A+) in the following order: H, Ζ, Ε, Δ, Γ, Β, Β+, A, and A+. There are also three possible values for the energy class: Non-effective, Excluded, and Pending).
auto_heating (autonomous heating: 1 for Yes, 0 for No).
solar (solar water heater: 1 for Yes, 0 for No).
cooling (cooling: 1 for Yes, 0 for No).
safe_door (safety door: 1 for Yes, 0 for No).
gas (1 for Yes, 0 for No).
fireplace (1 for Yes, 0 for No).
furniture (1 for Yes, 0 for No).
student (Is it appropriate for students? 1 for Yes, 0 for No).

Beginning with the data cleaning process, dependent columns, such as price per square meter, and columns that showed no statistically significant correlation (e.g., parking) were removed. Also, rows that contained empty fields were removed. Additionally, 5% of the outliers with respect to the price were removed, including the minimum and maximum values. Categorical variables were encoded using one-hot encoding and ordinal encoding techniques. Numerical variables were standardized using a standard scaler. Table 2 and Figure 3 below demonstrate the results sorted by performance (error), showing KAN outperforming significantly all of the existing state-of-the-art approaches. Figure 4 presents a comparison plot between the values predicted by KAN and the actual values.

4.2. Dataset B—California Housing

The California Housing Prices dataset is a popular dataset from Kaggle that contains information on the median house prices for various districts in California. The dataset contains a total of 20,640 records, with each record representing a different district in California. The dataset has a total of 10 columns, with each column representing a different attribute of the district. The attributes included in the dataset are as follows:

longitude: Represents the longitude coordinate of the district.
latitude: Represents the latitude coordinate of the district.
housing_median_age: Represents the median age of the houses in the district. The mean value is 28.64.
total_rooms: Represents the total number of rooms in the district. The mean is 2643.66.
total_bedrooms: Represents the total number of bedrooms in the district. The mean is 538.43
population: Represents the total population of the district. The mean is 1425.48.
households: Represents the total number of households in the district. The mean is 499.54.
median_income: Represents the median income of the households in the district. The mean value is 3.87.
median_house_value: Represents the median house value in the district. The mean is 206,855.
ocean_proximity: Represents the proximity of the district to the ocean. It contains five different categories.

Regarding the data cleaning process, rows that contained empty fields were removed. Additionally, 5% of the outliers with respect to the median house value were removed, including the minimum and maximum values. The categorical variable was encoded using one-hot encoding. Numerical variables were standardized using a standard scaler. Table 3 and Figure 5 below demonstrate the results sorted by performance (error), showing KAN outperforming significantly all of the existing state-of-the-art deep NN approaches. Traditional machine learning algorithms XGBoost and LGB have slightly better results because of the nature of the dataset. California housing has few input variables and also not so many rows, which makes it hard for NN-based models to get their full potential. Additionally, Figure 6 presents a comparison plot between the values predicted by KAN and the actual values.

To address the concern regarding the limited input variables and dataset size of the California Housing dataset, which may constrain the potential of the model, an additional benchmarking experiment was conducted to evaluate the performance of KAN. The evaluation process included a comparison of KAN with XGBoost, the best-performing baseline, using an augmented version of the California Housing dataset. The augmentation process involved generating 3000 additional samples by applying Gaussian noise to the numerical features of randomly selected rows from the original dataset. Specifically, Gaussian noise with a standard deviation of 0.05 was added to these features, ensuring that the augmented samples retained the original data distribution while introducing diversity. These perturbed samples were then appended to the original dataset, resulting in an expanded dataset for training and testing. Both KAN and XGB models were fine-tuned and evaluated on the test set using the augmented dataset. The results, presented in Table 4 below, indicate that KAN outperformed XGB in MAE, RMSE, and SMAPE.

5. Discussion

Starting with the first dataset, the Kolmogorov–Arnold network (KAN) model demonstrated its efficacy by achieving the lowest Mean Absolute Error (MAE) of 35,861, Root Mean Squared Error (RMSE) of 52,158, and Symmetric Mean Absolute Percentage Error (SMAPE) of 16.53%. These results highlight the model’s ability to capture complex nonlinear relationships effectively, outperforming advanced architectures such as GRU–LSTM and widely used machine learning models, including CatBoost, XGBoost, and LightGBM.

In contrast, on the default California Housing dataset, traditional machine learning models such as XGBoost and LightGBM emerged as the top performers. XGBoost achieved the lowest MAE (30,625), RMSE (42,477), and SMAPE (15.81%), closely followed by LightGBM. While KAN remained competitive, it recorded slightly higher metrics, with an MAE of 31,961, an RMSE of 44,191, and an SMAPE of 16.45%. These results suggest that the relatively small size and limited feature set of the California Housing dataset may constrain the full potential of neural-network-based models like KAN.

To further explore this limitation, in the augmented version of the California Housing dataset, KAN demonstrated its ability to leverage the increased data complexity and size, outperforming XGBoost across all metrics. Specifically, KAN achieved a test MAE of 34,716, an RMSE of 53,378, and an SMAPE of 17.74%, compared to XGBoost’s MAE of 36,102, RMSE of 55,533, and SMAPE of 18.31%. Notably, KAN showed an improvement of approximately 3.6% in prediction accuracy. This additional benchmarking highlights the robustness of KAN in handling augmented datasets and underscores its potential to outperform state-of-the-art methods in scenarios where data complexity and size are increased. The findings demonstrate that while traditional models may excel on simpler datasets, KAN offers significant advantages when applied to richer, more diverse datasets, further establishing its applicability to complex predictive tasks.

To assess further the performance differences between KAN and other models on the two datasets, the Wilcoxon Signed-Rank test was conducted [23]. This non-parametric test was chosen over the paired t-test to address violations of the normality assumption in the data. Additionally, the overall performance improvement of KAN was calculated as the average percentage improvement across all metrics. The results, summarized in Table 5, provide p-values indicating the statistical significance of the comparisons and the average percentage improvements for each model. The results of the statistical analysis underscore the effectiveness of the KAN model in predictive tasks, particularly in comparison to several state-of-the-art models. KAN demonstrates statistically significant improvements in performance metrics against widely used models such as CatBoost, GRU–LSTM, MLP, random forest, and linear regression, as indicated by p-values less than 0.05. Notably, the KAN model achieves a 9.19% improvement over CatBoost, a model known for its robustness, and a 27.98% improvement over linear regression, emphasizing its ability to capture complex nonlinear relationships.

This difference highlights a notable aspect of KAN’s behavior: its performance tends to excel with larger datasets and high-dimensional inputs, where its advanced approximation capabilities and ability to model intricate patterns shine. However, on smaller datasets like California Housing, with fewer rows and limited input variables, ensemble-based methods such as XGBoost and LightGBM might be better suited due to their inherent ability to optimize simpler feature interactions and adapt efficiently to constrained datasets. While KAN’s improvements over XGBoost (1.10%) and LightGBM (2.40%) are not statistically significant (p-values of 0.6875 and 0.502, respectively), it remains competitive with these ensemble methods, which are well regarded for their efficiency and performance on such small datasets.

These findings highlight both the strengths and the challenges of KANs in regression problems. KAN addresses critical challenges in predictive modeling by capturing intricate nonlinear relationships and providing a theoretically grounded framework for feature decomposition. While LSTM models are effective for sequential data, their temporal focus is not suitable for static datasets like those used in this study. Similarly, boosting methods such as XGBoost and LightGBM rely on hierarchical tree-based splits, which may fail to capture complex nonlinear interactions present in high-dimensional data. KAN offers a distinct advantage by leveraging its spline-based architecture to learn univariate functions that are composable, scalable, and interpretable. Unlike regression models, which are constrained by linear assumptions, KAN provides a flexible framework for modeling nonlinear dependencies with theoretical rigor and empirical robustness. The proposed KAN model excels in scenarios with high-dimensional, nonlinear data, as evidenced by its superior performance.

In general, KANs have demonstrated exceptional capabilities in scenarios that demand complex function approximation. This makes such models particularly transformative for datasets characterized by extensive variables and intricate high-dimensional relationships, such as those found in marketing, financial modeling, etc. By modeling nonlinear interactions and identify subtle patterns within data, KANs can often surpass conventional regression techniques in terms of accuracy and insight generation.

On the other hand, several limitations warrant discussion and present opportunities for future research. A primary challenge lies in KAN’s performance on small or low-dimensional datasets, where simpler models can achieve superior results with a significantly lower computational overhead. Moreover, KANs, due to their architecture and the nature of their computations, can be computationally intensive. This complexity may lead to longer training times and increased resource requirements, particularly when scaling to larger datasets or more intricate models. To address these challenges, several optimization methods can be applied. Regularization techniques, such as L2 regularization or dropout, can help to prevent overfitting by penalizing excessive complexity. Data augmentation or synthetic data generation can enhance the dataset, helping KANs generalize better. Moreover, simplifying the model architecture or combining KANs with simpler learning methods, like decision trees for final predictions, could reduce computational costs and improve performance on small datasets. These adjustments can enhance the adaptability of KANs, making them more effective in predictive regression tasks.

6. Conclusions

In this study, a custom KAN-based deep neural network model is proposed for the house price estimation problem. The network architecture supports a deep neural network, consisting of multiple layers. To evaluate the performance of the KAN model, we used two datasets: the Greek Listings dataset, which consists of real-estate listings in Greece, and the California House dataset, which contains data related to houses in California. Both datasets were obtained from Kaggle and are available there.

Based on the analysis conducted on the two datasets, the results show that the custom KAN-based model can outperform the best regression algorithms and traditional neural network models in terms of predictive accuracy (regarding MAE and MSE metrics that were tested). These findings demonstrate that KANs have significant potential for improving predictive performance in real-world applications. Furthermore, the outcomes prove the efficiency of spiking neural networks in modeling complicated, dynamic systems with incomplete and noisy data. Overall, the results provide strong evidence for the use of KANs as a powerful and adaptable tool for data analysis and decision-making.

Regarding the future directions of this research, multiple plans have been outlined to further extend this study. One direction is to conduct more experiments with various types of KAN-based architectures to explore their full potential in greater detail. While the experiments conducted utilized public datasets, a future research direction includes the validation of the KAN model with real market data to assess its practical applicability. This validation strategy encompasses collaboration with real-estate firms to access actual transaction data and conducting targeted pilot studies in specific geographic regions. Our research roadmap includes the development of a mobile application to enhance accessibility and practical implementation.

Beyond house price estimation, the KAN model demonstrates significant potential for broader applications in the real-estate industry. The model’s versatility extends to investment analysis for property valuation, market trend forecasting through historical data analysis, risk assessment considering multiple economic indicators, and urban planning support through data-driven insights. These applications collectively position KAN as an innovative analytical tool that can significantly enhance decision-making processes in the real-estate sector.

This comprehensive approach to practical implementation and diverse application scenarios underscores the potential of KAN as a transformative technology in real-estate analytics while maintaining rigorous standards for validation and continuous improvement. Finally, it is hoped that this research will inspire new research directions and collaborations in the field of neural networks, opening the way for further innovation and breakthroughs in the field of artificial intelligence.

Author Contributions

Conceptualization, I.V. and A.T.; Approach, I.V. and A.T.; Implementation, I.V.; Resources, I.V.; Writing—Original Draft Preparation, I.V.; Writing—Review and Editing, I.V. and A.T.; Supervision, A.T.; Project Administration, I.V. and A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The following datasets can be downloaded at https://www.kaggle.com accessed on 2 September 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Feth, M. Proptech: The real estate industry in transition. In The Routledge Handbook of FinTech; Taylor & Francis Group: Abingdon, UK, 2021; pp. 385–391. [Google Scholar] [CrossRef]
Siniak, N.; Kauko, T.; Shavrov, S.; Marina, N. The impact of proptech on real estate industry growth. IOP Conf. Ser. Mater. Sci. Eng. 2020, 869, 062041. [Google Scholar] [CrossRef]
He, X.; Lin, Z.; Liu, Y. Volatility and Liquidity in the Real Estate Market. J. Real Estate Res. 2018, 40, 523–550. [Google Scholar] [CrossRef]
Braesemann, F.; Baum, A. PropTech: Turning Real Estate Into a Data-Driven Market? SSRN Electron. J. 2020, 1–22. [Google Scholar] [CrossRef]
Chiang, M.-C.; Sing, T.F.; Wang, L. Interactions Between Housing Market and Stock Market in the United States: A Markov Switching Approach. J. Real Estate Res. 2020, 42, 552–571. [Google Scholar] [CrossRef]
Adamczyk, T.; Bieda, A. The applicability of time series analysis in real estate valuation. Geomat. Environ. Eng. 2015, 9, 15–25. [Google Scholar] [CrossRef]
Mora-Garcia, R.-T.; Cespedes-Lopez, M.-F.; Perez-Sanchez, V.R. Housing Price Prediction Using Machine Learning Algorithms in COVID-19 Times. Land 2022, 11, 2100. [Google Scholar] [CrossRef]
Kundu, A.; Sarkar, A.; Sadhu, A. KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search. EPJ Quantum Technol. 2024, 11, 76. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar] [CrossRef]
Moradi, M.; Panahi, S.; Bollt, E.; Lai, Y.C. Kolmogorov-Arnold Network Autoencoders. arXiv 2024, arXiv:2410.02077. [Google Scholar] [CrossRef]
Jiang, B.; Wang, Y.; Wang, Q.; Geng, H. A Hybrid KAN-ANN based Model for Interpretable and Enhanced Short-term Load Forecasting. Authorea Prepr. 2024. [Google Scholar] [CrossRef]
Yang, X.; Wang, X. Kolmogorov-Arnold Transformer. arXiv 2024, arXiv:2409.10594. [Google Scholar] [CrossRef]
Hollósi, J.; Ballagi, Á.; Kovács, G.; Fischer, S.; Nagy, V. Detection of Bus Driver Mobile Phone Usage Using Kolmogorov-Arnold Networks. Computers 2024, 13, 218. [Google Scholar] [CrossRef]
Danish, M.U.; Grolinger, K. Kolmogorov–Arnold recurrent network for short term load forecasting across diverse consumers. Energy Rep. 2025, 13, 713–727. [Google Scholar] [CrossRef]
Gao, Y.; Hu, Z.; Chen, W.-A.; Liu, M.; Ruan, Y. A revolutionary neural network architecture with interpretability and flexibility based on Kolmogorov–Arnold for solar radiation and temperature forecasting. Appl. Energy 2025, 378, 124844. [Google Scholar] [CrossRef]
Limsombunchai, V.; Gan, C.; Lee, M. House Price Prediction: Hedonic Price Model vs. Artificial Neural Network. Am. J. Appl. Sci. 2004, 1, 193–201. [Google Scholar] [CrossRef]
Afonso, B.; Melo, L.; Oliveira, W.; Sousa, S.; Berton, L. Housing Prices Prediction with a Deep Learning and Random Forest Ensemble. In Proceedings of the Encontro Nacional De Inteligência Artificial E Computacional (Eniac), Rio Grande, Brazil, 24 September 2020; pp. 389–400. [Google Scholar] [CrossRef]
Nouriani, A.; Lemke, L. Vision-based housing price estimation using interior, exterior & satellite images. Intell. Syst. Appl. 2022, 14, 200081. [Google Scholar] [CrossRef]
Kim, J.; Lee, Y.; Lee, M.-H.; Hong, S.-Y. A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices. Sustainability 2022, 14, 9056. [Google Scholar] [CrossRef]
Joshi, H.; Swarndeep, S. A Comparative Study on House Price Prediction using Machine Learning. Int. Res. J. Eng. Technol. 2022, 9, 782–788. [Google Scholar]
Ragb, H.; Muntaser, A.; Jera, E.; Saide, A.; Elwarfalli, I. Hybrid GRU-LSTM Recurrent Neural Network-Based Model for Real Estate Price Prediction Hybrid GRU-LSTM Recurrent Neural Network-Based Model for Real Estate Price Prediction. TechRxiv 2023. [Google Scholar] [CrossRef]
Dobrovolska, O.; Fenenko, N. Forecasting Trends in the Real Estate Market: Analysis of Relevant Determinants. Financ. Mark. Inst. Risks 2024, 8, 227–253. [Google Scholar] [CrossRef]
Rey, D.; Neuhäuser, M. Wilcoxon-Signed-Rank Test. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1658–1659. [Google Scholar] [CrossRef]

Figure 1. KAN architecture.

Figure 2. Research methodology.

Figure 3. Models’ performance in Dataset A.

Figure 4. Comparison plot of predicted and actual values in Dataset A—KAN.

Figure 5. Models’ performance in Dataset B.

Figure 6. Comparison plot of predicted and actual values in Dataset B—KAN.

Table 1. Optimal parameters for KAN.

Parameter	Optimal Values: Dataset A	Optimal Values: Dataset B
Model architecture	256-128-64-32-1	50-40-30-20-1
Optimizer	Adam	Adam
Activation function	ReLU	ReLU
Number of knots	7	7
Spline layer	9	9
Learning rate	0.001	0.01
Batch size	8	8

Table 2. Results in Dataset A.

Model Name	MAE	RMSE	SMAPE (%)
KAN	35,861	52,158	16.53
GRU–LSTM	36,972	53,763	16.92
CatBoost	37,615	54,918	17.14
XGBoost	38,157	55,711	17.41
LGB	38,961	55,906	17.98
MLP	39,123	57,034	18.07
Random forest	46,124	67,341	21.16
Linear regression	46,451	68,176	21.65

Table 3. Results in Dataset B—California Housing.

Model Name	MAE	RMSE	SMAPE (%)
XGBoost	30,625	42,477	15.81
LGB	30,792	42,709	15.93
KAN	31,961	44,191	16.45
Random forest	32,774	45,458	16.87
GRU–LSTM	33,174	46,012	16.98
MLP	35,723	49,548	17.14
CatBoost	36,345	50,411	18.67
Linear regression	40,452	56,107	20.26

Table 4. KAN and XGBoost test results in augmented California Housing dataset.

Model Name	MAE	RMSE	SMAPE (%)
KAN	34,716	53,378	17.74
XGBoost	36,102	55,533	18.31

Table 5. Statistical significance results.

Model Compared	p-Value	Performance Improvement
CatBoost	0.03 (Significant)	9.19%
GRU–LSTM	0.003 (Significant)	3.28%
LGB	0.502	2.40%
Linear regression	0.005 (Significant)	27.98%
MLP	0.003 (Significant)	9.31%
Random forest	0.0312 (Significant)	15.62%
XGBoost	0.6875	1.10%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Viktoratos, I.; Tsadiras, A. Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks. Algorithms 2025, 18, 93. https://doi.org/10.3390/a18020093

AMA Style

Viktoratos I, Tsadiras A. Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks. Algorithms. 2025; 18(2):93. https://doi.org/10.3390/a18020093

Chicago/Turabian Style

Viktoratos, Iosif, and Athanasios Tsadiras. 2025. "Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks" Algorithms 18, no. 2: 93. https://doi.org/10.3390/a18020093

APA Style

Viktoratos, I., & Tsadiras, A. (2025). Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks. Algorithms, 18(2), 93. https://doi.org/10.3390/a18020093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks

Abstract

1. Introduction

1.1. Residential Real-Estate Domain:Motivation and Challenges

1.2. The Specific Case of the House Price Prediction Problem

1.3. Scientific Background and the Proposed Approach

1.4. Contribution

2. Literature Review and State of the Art

2.1. KAN-Based Approaches

2.2. Literature Review House Price Prediction

3. Materials and Methods

3.1. Problem Formulation and Proposed Model Description

3.2. Research Methodology

4. Results

4.1. Dataset A—Greece Listings

4.2. Dataset B—California Housing

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI