Next Article in Journal
Encoding-Based Machine Learning Approach for Health Status Classification and Remote Monitoring of Cardiac Patients
Previous Article in Journal
Optimization of Multimodal Transport Paths Considering a Low-Carbon Economy Under Uncertain Demand
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks

by
Iosif Viktoratos
* and
Athanasios Tsadiras
Department of Economics, Aristotle University of Thessaloniki, 541 24 Thessaloniki, Greece
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(2), 93; https://doi.org/10.3390/a18020093
Submission received: 11 December 2024 / Revised: 23 January 2025 / Accepted: 2 February 2025 / Published: 7 February 2025
(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

Abstract

:
Accurately estimating house values is a critical challenge for real-estate stakeholders, including homeowners, buyers, sellers, agents, and policymakers. This study introduces a novel approach to this problem using Kolmogorov–Arnold networks (KANs), a type of neural network based on the Kolmogorov–Arnold theorem. The proposed KAN model was tested on two datasets and demonstrated superior performance compared to existing state-of-the-art methods for predicting house prices. By delivering more precise price forecasts, the model supports improved decision-making for real-estate stakeholders. Additionally, the results highlight the broader potential of KANs for addressing complex prediction tasks in data science. This study aims to provide an innovative and effective solution for accurate house price estimation, offering significant benefits for the real-estate industry and beyond.

1. Introduction

1.1. Residential Real-Estate Domain:Motivation and Challenges

The real-estate sector plays a vital role in the economy and includes various domains such as residential, commercial, industrial, and retail. It relates to a broad spectrum of activities centered on the purchase, sale, rental, lease, and management of properties [1]. Its dynamics and developments have a pertinent impact on people, companies, and state authorities around the globe [2].
The residential real-estate sector is clearly an integral part of the overall property market, especially the purchase and sale or leasing of residential property (single-family homes, townhomes, and condominiums) [2]. This segment has a huge impact for the overall economy [3]. Residential properties represent about 65% of the entire real-estate market in the USA. Home sales in the United States hit an all-time high of nearly seven million in 2021, after consistently increasing from a low level that year. The average price per square foot of floor space in new single-family housing fell after the great financial crisis and then stagnated for several years. Since 2012, the price has continuously increased, reaching approximately USD 144per square foot in 2021. In the same year, the average sales price of a new home was over USD 454,000, and, in 2022, house prices increased further (https://www.statista.com/statistics/682549/average-price-per-square-foot-in-new-single-family-house s-usa/ accessed on 28 November 2024). The above indicates that understanding its current state, trends, and prospects is essential for both industry participants and investors.

1.2. The Specific Case of the House Price Prediction Problem

Like the broader real-estate market, the house real-estate domain is subject to various factors that influence its growth and performance. House prices are one of the most important economic indicators. Changes in prices have substantial consequences for households, policymakers, investors, etc. Understanding the dynamics of house prices is important for assessing the health of the housing market and crucial for evaluating its state [4]. Also, knowing how house prices work helps investors predict trends, forecast future changes, and make successful decisions about buying and selling property.
Searching for the factors that have an impact on the prices of houses has been a topic of interest for many years. Some factors and trends that influence the price of a house are location, size, condition, the use of technology, etc. [5].

1.3. Scientific Background and the Proposed Approach

Researchers and industries are actively working on projects related to predicting house prices [6]. Based on a set of input features/factors derived from real-estate datasets they are developing models capable of estimating house prices. Machine learning techniques and deep neural networks (NNs) have been widely employed for this purpose [7].
The objective of this study is to implement and propose a novel approach using KANs. KANs are based on the Kolmogorov–Arnold theorem, which states that any continuous function of several variables can be represented as a superposition of continuous functions of one variable and a finite number of parameters. Formally, for any continuous function f : R n R , there exist continuous functions g1, g2, …, gk: R R and a continuous function h: R k R such that f(x1, x2, …, xn) = h(g1(x1), g2(x2), …, gk(xk)).
This theorem has several important implications for function approximation and neural network design. It suggests that complex, multi-dimensional functions can be decomposed into simpler, one-dimensional functions, which can be more easily modeled and interpreted. Various generalizations of the Kolmogorov–Arnold theorem have been proposed, extending its applicability to different types of functions and higher dimensions. These generalizations have significant implications for fields such as machine learning, where understanding the structure of data is crucial for developing effective predictive models. Unlike the static, non-learnable activation functions in Multi-Layer Perceptrons (MLPs), KANs incorporate univariate functions that act as both weights and activation functions, adapting as part of the learning process. They use complex learnable functions (called B-Splines) to represent nonlinear input transformations, with relatively fewer parameters than a traditional neural network. By decentralizing activation functions to edges rather than nodes, KANs align with the Kolmogorov–Arnold representation theorem, enhancing model interpretability and performance (Figure 1) [8]. These models are designed to handle complex, nonlinear relationships within the data, capturing intricate patterns that traditional models might miss. KANs excel in representing highly nonlinear functions, making them particularly suited for customer behavior analysis, where interactions between features can be complex and multi-faceted. By leveraging the Kolmogorov–Arnold representation theorem, these networks can approximate any continuous function, providing high flexibility and accuracy in prediction problems [9].

1.4. Contribution

The contribution of this study can be summarized as follows:
  • Scientific contribution: This study expands the methodological landscape in regression by presenting Kolmogorov–Arnold networks (KANs) as a robust alternative to traditional models like gradient boosting, ensemble methods, and neural networks such as MLPs, GRUs, and LSTMs. Gradient boosting algorithms like XGBoost and LightGBM, while effective on structured datasets, struggle with capturing intricate, nonlinear relationships in high-dimensional data and face scalability and interpretability challenges. Similarly, MLPs require extensive tuning to handle complex, high-dimensional data effectively, while GRUs and LSTMs, optimized for sequential data, are less suitable for non-sequential regression tasks and introduce significant computational overhead. KANs address these limitations by leveraging the Kolmogorov–Arnold representation theorem, which decomposes multivariate functions into sums and compositions of univariate functions. This approach allows KANs to model complex nonlinear relationships efficiently while maintaining interpretability, scalability, and adaptability to high-dimensional data. The spline-based architecture further enhances their ability to generalize well across diverse regression tasks, offering a significant advantage over both traditional and modern machine learning models. By outperforming these methods in terms of accuracy and computational efficiency, KANs demonstrate their potential as a transformative tool for tackling complex regression problems across various domains.
  • Economic and social impact: By improving prediction accuracy in housing price estimation, this research contributes to a more transparent and informed real-estate market. Enhanced predictive capabilities help stakeholders make better decisions, optimize investments, and mitigate risks. Furthermore, the insights gained from the study can lead to the development of smarter urban planning and housing policies, ultimately benefiting society by fostering equitable and sustainable growth in the housing sector.
  • Broader applicability across industries: While the focus is on house price estimation, the versatile nature of KANs offers potential applications beyond real estate. KANs can be applied in areas such as financial modeling, healthcare predictions, environmental analysis, and supply chain optimization. The study underscores the adaptability of this approach, setting a precedent for its application in diverse regression and prediction tasks.
This work not only establishes the utility of KANs in regression problems but also sets the stage for a future exploration of their potential across a wide range of applications. Section 2 provides an overview of related works in the field of KANs and house price prediction, while Section 3 explains the research methodology and the proposed KAN model in detail. In Section 4 and Section 5, the experiments are described, and the results are discussed. Finally, Section 6 concludes the paper and provides insights for future research in this area.

2. Literature Review and State of the Art

2.1. KAN-Based Approaches

Liu et al. introduce KANs, a novel neural network architecture inspired by the Kolmogorov–Arnold representation theorem, and show that it has significant advantages over traditional MLPs [9]. KANs are designed to effectively capture compositional structures in data, enabling them to approximate complex functions, such as exponential and sine functions, with greater efficiency and accuracy. The authors demonstrate that KANs not only outperform MLPs in terms of accuracy, particularly in small-scale AI and science tasks, but also enhance interpretability by revealing the underlying compositional structures in symbolic formulas. Furthermore, KANs can utilize grid extension techniques to further improve their performance, positioning them as a promising foundation for future AI applications in scientific discovery and mathematical exploration.
Moradi et al. investigate the potential of KANs as a transformative approach to deep learning, particularly in the context of autoencoders for image representation tasks. The study compares KAN-based autoencoders against traditional convolutional neural networks (CNNs) across several datasets, revealing competitive reconstruction accuracy and suggesting KANs’ viability in advanced data analysis applications [10].
Jiang et al. present a novel hybrid model combining KANs and artificial NNs for short-term load forecasting, addressing the limitations of traditional NN approaches. By leveraging KANs to identify essential periodic and nonlinear patterns in load data, while utilizing NNs to fit the residual components, the proposed method enhances both interpretability and predictive accuracy. The integration of regularization techniques ensures that KAN-derived predictions maintain dominance, leading to clearer analytical expressions for forecasting. Experimental validation against various models demonstrates superior performance of the hybrid approach, marking a significant advancement in short-term load forecasting methodologies [11].
Yang et al. propose the Kolmogorov–Arnold transformer (KAT), an innovative architecture that enhances traditional transformer models by substituting MLP layers with KAN layers [12]. This approach significantly improves expressiveness and performance, addressing key challenges in integrating KANs into transformers and paving the way for more efficient deep learning models.
Hollόsi et al. present a study focused on enhancing road safety by detecting mobile phone usage among bus drivers [13]. The research utilizes KANs to improve the accuracy of driver monitoring systems in compliance with regulations against phone use while driving.
Danish et al. introduce a Kolmogorov–Arnold recurrent network (KARN), a novel approach combining KAN’s flexibility with RNN’s temporal modeling capabilities for improved load forecasting in energy management. Through the integration of learnable temporal spline functions and edge-based activations, KARN addresses the limitations of traditional RNNs and LSTMs in capturing complex load variations. Extensive evaluation across diverse consumer types, including residential and industrial buildings, demonstrates KARN’s superior performance and versatility compared to conventional approaches, establishing it as a robust solution for energy-load forecasting [14].
Gao et al. explore the application of KANs for predicting renewable energy variables, addressing two key challenges in deep learning: interpretability and architectural sensitivity. Using data from the Tokyo Meteorological Observatory, the research demonstrates KAN’s effectiveness in predicting solar radiation and outdoor temperature, achieving superior performance even with minimal architecture [15].

2.2. Literature Review House Price Prediction

Predicting house prices is a popular task in machine learning, and there are numerous state-of-the-art papers that use NNs and other machine learning techniques for this task.
Limsombunc et al.’s work is one of the first research studies that demonstrated that the artificial neural network model has significant potential in predicting house prices. The purpose of this research is to evaluate and compare the effectiveness of the hedonic model and an artificial neural network model in predicting house prices. The study uses a sample of 200 houses in Christchurch, New Zealand, obtained randomly from the Harcourt website, taking into account factors such as house size, age, type, the number of bedrooms, bathrooms, garages, amenities, and location [16].
Afonso et al. propose a hybrid model that combines deep learning recurrent neural networks (RNNs) and random forest algorithms for predicting housing prices. The model uses features extracted from the image and text data of the houses, along with other relevant numerical data. The results show that the proposed model outperforms both standalone deep learning and random forest models for the prediction of housing prices in Brazil [17].
Nouriani et al. propose a novel approach for estimating housing prices using a combination of interior and exterior images of houses, as well as satellite images. The proposed method utilizes deep convolutional neural networks to extract features from images and then (along with other house features) employs multiple regression models to estimate house prices based on these features [18].
Mora-Garcia et al. present a study on the use of machine learning algorithms for housing price prediction in California during the COVID-19 pandemic. The authors analyzed different regression models, including linear regression, decision trees, and others to predict the price of houses in different regions of California. The study also evaluated the impact of the pandemic on the housing market and the performance of the different models [7].
Kim et al. explore the application of various machine learning algorithms, including support vector machine, random forest, XGBoost, LightGBM, and CatBoost, as automatic real-estate valuation models. The study involves analyzing approximately 57,000 records on apartment transactions in Seoul in 2018 and proposes combination methods to improve the models’ predictive power. The findings suggest that machine learning predictors outperform conventional models, and an efficient averaging of the predictors can improve their predictive accuracy. Additionally, machine learning algorithms can recommend which algorithm should be selected for making predictions [19].
Joshi et al. propose to use multiple different models that can be used for prediction and focus on more accurate results. They propose to use an ensemble learning method and combine multiple ML models to improve the results [20].
Ragb et al. propose a hybrid Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) model for real-estate price prediction. The model combines the strengths of both GRU and LSTM networks. The results of these experiments in the Boston dataset showed that the proposed model has better performance when the networks are used in the fusion process than when they act individually [21].
Dobrovolska et al. investigate the dynamics of the real-estate market in Kyiv, Ukraine, focusing on the impact of various economic factors on housing prices. The study employs a dual-approach forecasting model, combining traditional methods with advanced analytical tools to enhance prediction accuracy. Ultimately, the research aims to provide a robust framework for future forecasting in the volatile real-estate sector [22].

3. Materials and Methods

3.1. Problem Formulation and Proposed Model Description

The primary goal of this research is to predict housing prices based on a set of input features derived from real-estate datasets. Formally, we define the problem as follows:
Let X = { x 1 , x 2 , , x n } represent the input features, where xi corresponds to specific characteristics of a property (e.g., size, location, and number of rooms). The target variable y represents the predicted price of the property. The objective is to learn a mapping function f : X y , such that the prediction error L(y, y ^ ), where y ^ = f x , is minimized. Here, L is the loss function used to quantify the prediction error.
To ensure the reliability and generalizability of the proposed KAN model, we explicitly state the following assumptions:
  • Independence: The input features are assumed to be conditionally independent given the target variable.
  • Sufficient data: The dataset is assumed to be sufficiently large and diverse to capture the underlying patterns in housing prices.
  • Feature engineering: The input features are assumed to be preprocessed (e.g., normalized) to ensure compatibility with the neural network.
These assumptions are tested and validated during the experimental phases.
KAN is a neural network architecture inspired by the Kolmogorov–Arnold representation theorem. This property allows KAN to decompose complex multivariate functions into simpler univariate components, making it uniquely suited for capturing nonlinear interactions in high-dimensional data.
The Kolmogorov–Arnold representation theorem provides the theoretical foundation for KAN. For a continuous function f : R n R f , the theorem states the following:
f x 1 , x 2 , , x n = i = 1 2 n + 1 g i j = 1 n h i j x j
where
gi and hij are continuous univariate functions.
x 1 , x 2 , , x n are the input variables.
In the KAN model, this decomposition is approximated using neural network layers.
The input features are transformed into intermediate representations:
z i = j = 1 n h i j x j
where hij are univariate transformations modeled by spline functions.
The intermediate features zi are passed through another set of univariate functions gi to compute the final output:
y ^ = i = 1 2 n + 1 g i z i
The model minimizes a loss function L, such as the Mean Squared Error (MSE):
L y , y ^ = 1 m i = 1 m y i y i ^ 2
where m is the number of training samples, y i is the true value, and y ^ i is the predicted value. To prevent overfitting, regularization terms are added:
L total = L y , y ^ + λ 1 | θ | 1 + λ 2 | θ | 2 2
Below, a detailed architectural description is provided:
  • Input Layer: The input layer accepts a vector X of size n, where n corresponds to the number of features. Each feature is standard-scaled to ensure consistent scaling across all inputs.
  • Intermediate Layers:
    • Three KAN layers are sequentially stacked, each performing a decomposition of the input into univariate components using splines.
    • Each KAN layer introduces nonlinear transformations, enabling the model to capture intricate feature interactions.
    • The splines are defined by a set number of knots and a specific order, which control the granularity and smoothness of the approximations. The number of knots determines how many polynomials are strung together. The grid range and epsilon define the bounds and resolution of the decomposition.
    • Fully connected layers refine the outputs of the KAN layers.
    • Activation functions introduce nonlinearity and enhance the model’s expressiveness.
  • Output Layer: The output layer consists of a single neuron with a linear activation function, producing the predicted price y ^ .
  • Loss Function: The Mean Squared Error (MSE) is used as the loss function.
  • Regularization: Dropout and L1 and L2 regularization techniques are applied to prevent overfitting.
  • Optimization Algorithm: The model is trained using the Adam optimizer, which combines the benefits of momentum and adaptive learning rates to ensure efficient convergence.

3.2. Research Methodology

The research methodology consists of the following phases (Figure 2):
1.
Literature Review
First, a thorough literature review was made to identify state-of-the-art models and available datasets. This step was crucial for deciding which models will be implemented and comparing the proposed KAN-based model with them in the corresponding datasets.
2.
Data Collection and Preprocessing
After that, data collection, preprocessing, and cleaning were performed to ensure the quality and consistency of the data for subsequent analysis. To mitigate the impact of extreme values on model performance, outlier detection was implemented using a quantile-based approach. Specifically, observations where the price feature fell outside the 5th and 95th percentiles were excluded from the datasets. Additionally, feature engineering incorporated standardization techniques, including Min–Max scaling and standard scaling, to normalize the input features and enhance model performance. The next section provides all the necessary details.
3.
Model Development and Validation
State-of-the-art models found in the literature were developed and validated, specifically tailored for regression and house price prediction tasks. In detail, the following deep learning neural network models were examined, implemented, and compared with KAN:
  • MLP
The MLP is a type of feedforward neural network comprising an input layer, one or more hidden layers, and an output layer. Each layer is fully connected to the subsequent layer, and neurons use an activation function to introduce nonlinearity. The mathematical representation for a neuron in a hidden layer is given by the following:
h = f W x + b
where
-
h is the output of the neuron.
-
W is the weight matrix.
-
x is the input vector.
-
b is the bias vector.
-
f is the activation function (e.g., ReLU and sigmoid).
MLPs are general-purpose neural networks that approximate complex functions through multiple layers. However, their black-box nature and reliance on global optimization make them less interpretable than KAN, which provides a transparent decomposition of feature interactions.
  • LSTM
The Long Short-Term Memory (LSTM) network enhances the RNN by incorporating memory cells and three gating mechanisms: input, forget, and output gates [21]. These gates enable the model to retain long-term dependencies. The mathematical operations are provided below:
Forget   Gate :   f t = σ W f · h t 1 ,   x t + b f
Input   Gate :   i t = σ W i · h t 1 ,   x t + b i
Candidate   Cell   State :   C t ~ = t a n h W c · h t 1 , x t + b c
Cell   State   Update :   C t = f t C t 1 + i t C t ~
Output   Gate :   o t = σ W o h t 1 ,   x t + b o
Hidden   State :   h t = o t t a n h C t
LSTM networks are designed for sequential data, leveraging memory cells to retain temporal dependencies. While effective for time-series tasks, their sequential nature makes them less suited for static datasets like house price prediction, where feature interactions are more critical than temporal patterns. In contrast, KAN focuses on capturing nonlinear dependencies across static features.
  • GRU
The Gated Recurrent Unit (GRU) is a variant of the recurrent neural network (RNN) that addresses the vanishing gradient problem through gating mechanisms [21]. GRUs utilize reset and update gates to control the flow of information. The mathematical formulations are the following:
Update   Gate :   z t = σ W z · h t 1 ,   x t + b z
Reset   Gate :   r t = σ W r · h t 1 ,   x t + b r
Candidate   Hidden   State :   h t ~ = t a n h W h ·   r t h t 1 , x t + b h
Hidden   State :   h t = 1 z t h t 1 + z t h t ~
GRU, like LSTM, is a recurrent neural network architecture designed for sequential data. It simplifies the LSTM architecture by combining the forget and input gates, making it computationally efficient. However, GRU’s focus on temporal dependencies limits its applicability to static datasets, where KAN’s spline-based decomposition excels at modeling complex, nonlinear feature interactions.
  • Regression-Based Models
These models are primarily focused on fitting data through regression techniques. First, linear regression was utilized. Linear regression tries to fit a straight line to minimize errors between predictions and actual values. Also, a random forest regressor was developed and tested. This algorithm combines multiple decision trees by averaging predictions for robustness. Regression models, including linear and nonlinear variants, are foundational for predictive modeling. However, their performance is often constrained by assumptions of linearity or predefined functional forms. KAN overcomes these limitations by learning univariate functions adaptively, providing a more flexible and accurate framework for nonlinear prediction.
  • Boosting-Based Models
These models use boosting techniques to combine weak learners (e.g., decision trees) into a strong ensemble. XGBoost, CatBoost, and LightGBM were implemented and tested [20,21,22]. Boosting methods combine weak learners (e.g., decision trees) to create strong predictive models, excelling in structured data tasks. However, their reliance on hierarchical splits can oversimplify complex feature interactions. KAN addresses this gap by decomposing multivariate relationships into univariate functions, enabling precise and interpretable predictions.
4.
Integration and Testing—Experiments
Comprehensive testing was conducted to evaluate the performance of the models. Two widely used datasets were utilized for conducting the experiments, namely, the Greece listings (https://www.kaggle.com/datasets/argyrisanastopoulos/greece-property-listings/ accessed on 10 September 2024) and the California house pricing (https://www.kaggle.com/code/ahmedmahmoud16/california-housing-prices accessed on 10 September 2024). Metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Symmetric Mean Absolute Percentage Error (SMAPE) were used to assess model effectiveness. Based on testing outcomes, iterative refinement of the models was undertaken to enhance their performance and reliability. The next section provides detailed information about the experiments and the results.

4. Results

The performance of the models was evaluated using MAE, MSE, and SMAPE. In both situations, 90% of the records were randomly assigned for training the models, while 10% were used for testing. The experiments were carried out in Google Colab using the Python programming language and its relevant libraries (e.g., deep KAN for implementing the Kolmogorov–Arnold network and TensorFlow for MLP and GRU–LSTM). To ensure a fair comparison between the models in predicting the house prices, the study adhered to the following principles:
(a)
Exclusion of feature extraction models: No feature extraction methods, such as autoencoders or Boltzmann Machines, were used beforehand.
(b)
Model optimization: The most effective configuration parameters (the number of layers, neurons, optimizer, etc.) for each model were identified through hyperparameter tuning, a critical aspect of model development and validation. Hyperparameters play a pivotal role in defining both the structural and operational characteristics of a model, significantly influencing its performance. The hyperparameter tuning process employed a combination of manual trial-and-error and grid-search techniques to identify optimal settings. The tuning involved several key steps. For the learning rate, a range of values (e.g., 0.001 and 0.01) was tested to balance precision in convergence with training efficiency as smaller learning rates can lead to more precise convergence but increase training time, while larger rates may risk overshooting the optimal solution. The architecture was iteratively refined by testing different configurations of layers and neurons, starting with a simple structure and gradually increasing complexity. Performance on the validation set was closely monitored to determine the optimal depth and number of neurons. Various activation functions were evaluated to assess their impact on convergence speed and overall model accuracy, with the final choice made based on empirical performance. Additional parameters, including batch size and dropout rates, were systematically varied to enhance model robustness and mitigate overfitting. The effects of these parameters on validation performance were carefully observed to ensure stability and generalization. The optimal hyperparameter settings, summarized in Table 1, were determined based on the best validation performance across multiple runs, ensuring that the model effectively generalizes to unseen data. Table 1 below shows the optimal values for the KAN-based model on the datasets.

4.1. Dataset A—Greece Listings

The Greece House Listings dataset provides information on numerous properties for sale in Greece. The dataset comprises 20,000 rows and 25 columns, where each row represents a distinct property for sale. The columns that are included are the following:
  • location_name (the municipal of the house; categorical feature; 73 unique values; most common: Athens, 21%).
  • location_region (the region of the house; categorical feature; possible values: Attiki/Thessaloniki; most common: Attiki, 94%).
  • res_type (the type of the property; possible values: building, apartment, etc.; five unique values).
  • res_address (secondary location attribute; this stands for the exact neighborhood of the property; categorical feature; 987 unique values).
  • res_price (advertised price for the property (in euros). This is the feature that the model has to predict. The mean value is 367,000.
  • res_sqr (square meters of the property. The mean value is 169).
  • construction_year (the construction year of the property).
  • levels (levels for the property, e.g., 1st floor, 2nd floor, etc.).
  • bedrooms (the number of bedrooms. The mean is 2.58).
  • bathrooms (the number of bathrooms. The mean is 1.48).
  • status (the current status for the property; categorical feature; possible values such as ‘good’, ‘renovated’, etc.; eight unique values).
  • energyclass (categorical feature; the energy class is from the lowest level (H) to the highest (A+) in the following order: H, Ζ, Ε, Δ, Γ, Β, Β+, A, and A+. There are also three possible values for the energy class: Non-effective, Excluded, and Pending).
  • auto_heating (autonomous heating: 1 for Yes, 0 for No).
  • solar (solar water heater: 1 for Yes, 0 for No).
  • cooling (cooling: 1 for Yes, 0 for No).
  • safe_door (safety door: 1 for Yes, 0 for No).
  • gas (1 for Yes, 0 for No).
  • fireplace (1 for Yes, 0 for No).
  • furniture (1 for Yes, 0 for No).
  • student (Is it appropriate for students? 1 for Yes, 0 for No).
Beginning with the data cleaning process, dependent columns, such as price per square meter, and columns that showed no statistically significant correlation (e.g., parking) were removed. Also, rows that contained empty fields were removed. Additionally, 5% of the outliers with respect to the price were removed, including the minimum and maximum values. Categorical variables were encoded using one-hot encoding and ordinal encoding techniques. Numerical variables were standardized using a standard scaler. Table 2 and Figure 3 below demonstrate the results sorted by performance (error), showing KAN outperforming significantly all of the existing state-of-the-art approaches. Figure 4 presents a comparison plot between the values predicted by KAN and the actual values.

4.2. Dataset B—California Housing

The California Housing Prices dataset is a popular dataset from Kaggle that contains information on the median house prices for various districts in California. The dataset contains a total of 20,640 records, with each record representing a different district in California. The dataset has a total of 10 columns, with each column representing a different attribute of the district. The attributes included in the dataset are as follows:
  • longitude: Represents the longitude coordinate of the district.
  • latitude: Represents the latitude coordinate of the district.
  • housing_median_age: Represents the median age of the houses in the district. The mean value is 28.64.
  • total_rooms: Represents the total number of rooms in the district. The mean is 2643.66.
  • total_bedrooms: Represents the total number of bedrooms in the district. The mean is 538.43
  • population: Represents the total population of the district. The mean is 1425.48.
  • households: Represents the total number of households in the district. The mean is 499.54.
  • median_income: Represents the median income of the households in the district. The mean value is 3.87.
  • median_house_value: Represents the median house value in the district. The mean is 206,855.
  • ocean_proximity: Represents the proximity of the district to the ocean. It contains five different categories.
Regarding the data cleaning process, rows that contained empty fields were removed. Additionally, 5% of the outliers with respect to the median house value were removed, including the minimum and maximum values. The categorical variable was encoded using one-hot encoding. Numerical variables were standardized using a standard scaler. Table 3 and Figure 5 below demonstrate the results sorted by performance (error), showing KAN outperforming significantly all of the existing state-of-the-art deep NN approaches. Traditional machine learning algorithms XGBoost and LGB have slightly better results because of the nature of the dataset. California housing has few input variables and also not so many rows, which makes it hard for NN-based models to get their full potential. Additionally, Figure 6 presents a comparison plot between the values predicted by KAN and the actual values.
To address the concern regarding the limited input variables and dataset size of the California Housing dataset, which may constrain the potential of the model, an additional benchmarking experiment was conducted to evaluate the performance of KAN. The evaluation process included a comparison of KAN with XGBoost, the best-performing baseline, using an augmented version of the California Housing dataset. The augmentation process involved generating 3000 additional samples by applying Gaussian noise to the numerical features of randomly selected rows from the original dataset. Specifically, Gaussian noise with a standard deviation of 0.05 was added to these features, ensuring that the augmented samples retained the original data distribution while introducing diversity. These perturbed samples were then appended to the original dataset, resulting in an expanded dataset for training and testing. Both KAN and XGB models were fine-tuned and evaluated on the test set using the augmented dataset. The results, presented in Table 4 below, indicate that KAN outperformed XGB in MAE, RMSE, and SMAPE.

5. Discussion

Starting with the first dataset, the Kolmogorov–Arnold network (KAN) model demonstrated its efficacy by achieving the lowest Mean Absolute Error (MAE) of 35,861, Root Mean Squared Error (RMSE) of 52,158, and Symmetric Mean Absolute Percentage Error (SMAPE) of 16.53%. These results highlight the model’s ability to capture complex nonlinear relationships effectively, outperforming advanced architectures such as GRU–LSTM and widely used machine learning models, including CatBoost, XGBoost, and LightGBM.
In contrast, on the default California Housing dataset, traditional machine learning models such as XGBoost and LightGBM emerged as the top performers. XGBoost achieved the lowest MAE (30,625), RMSE (42,477), and SMAPE (15.81%), closely followed by LightGBM. While KAN remained competitive, it recorded slightly higher metrics, with an MAE of 31,961, an RMSE of 44,191, and an SMAPE of 16.45%. These results suggest that the relatively small size and limited feature set of the California Housing dataset may constrain the full potential of neural-network-based models like KAN.
To further explore this limitation, in the augmented version of the California Housing dataset, KAN demonstrated its ability to leverage the increased data complexity and size, outperforming XGBoost across all metrics. Specifically, KAN achieved a test MAE of 34,716, an RMSE of 53,378, and an SMAPE of 17.74%, compared to XGBoost’s MAE of 36,102, RMSE of 55,533, and SMAPE of 18.31%. Notably, KAN showed an improvement of approximately 3.6% in prediction accuracy. This additional benchmarking highlights the robustness of KAN in handling augmented datasets and underscores its potential to outperform state-of-the-art methods in scenarios where data complexity and size are increased. The findings demonstrate that while traditional models may excel on simpler datasets, KAN offers significant advantages when applied to richer, more diverse datasets, further establishing its applicability to complex predictive tasks.
To assess further the performance differences between KAN and other models on the two datasets, the Wilcoxon Signed-Rank test was conducted [23]. This non-parametric test was chosen over the paired t-test to address violations of the normality assumption in the data. Additionally, the overall performance improvement of KAN was calculated as the average percentage improvement across all metrics. The results, summarized in Table 5, provide p-values indicating the statistical significance of the comparisons and the average percentage improvements for each model. The results of the statistical analysis underscore the effectiveness of the KAN model in predictive tasks, particularly in comparison to several state-of-the-art models. KAN demonstrates statistically significant improvements in performance metrics against widely used models such as CatBoost, GRU–LSTM, MLP, random forest, and linear regression, as indicated by p-values less than 0.05. Notably, the KAN model achieves a 9.19% improvement over CatBoost, a model known for its robustness, and a 27.98% improvement over linear regression, emphasizing its ability to capture complex nonlinear relationships.
This difference highlights a notable aspect of KAN’s behavior: its performance tends to excel with larger datasets and high-dimensional inputs, where its advanced approximation capabilities and ability to model intricate patterns shine. However, on smaller datasets like California Housing, with fewer rows and limited input variables, ensemble-based methods such as XGBoost and LightGBM might be better suited due to their inherent ability to optimize simpler feature interactions and adapt efficiently to constrained datasets. While KAN’s improvements over XGBoost (1.10%) and LightGBM (2.40%) are not statistically significant (p-values of 0.6875 and 0.502, respectively), it remains competitive with these ensemble methods, which are well regarded for their efficiency and performance on such small datasets.
These findings highlight both the strengths and the challenges of KANs in regression problems. KAN addresses critical challenges in predictive modeling by capturing intricate nonlinear relationships and providing a theoretically grounded framework for feature decomposition. While LSTM models are effective for sequential data, their temporal focus is not suitable for static datasets like those used in this study. Similarly, boosting methods such as XGBoost and LightGBM rely on hierarchical tree-based splits, which may fail to capture complex nonlinear interactions present in high-dimensional data. KAN offers a distinct advantage by leveraging its spline-based architecture to learn univariate functions that are composable, scalable, and interpretable. Unlike regression models, which are constrained by linear assumptions, KAN provides a flexible framework for modeling nonlinear dependencies with theoretical rigor and empirical robustness. The proposed KAN model excels in scenarios with high-dimensional, nonlinear data, as evidenced by its superior performance.
In general, KANs have demonstrated exceptional capabilities in scenarios that demand complex function approximation. This makes such models particularly transformative for datasets characterized by extensive variables and intricate high-dimensional relationships, such as those found in marketing, financial modeling, etc. By modeling nonlinear interactions and identify subtle patterns within data, KANs can often surpass conventional regression techniques in terms of accuracy and insight generation.
On the other hand, several limitations warrant discussion and present opportunities for future research. A primary challenge lies in KAN’s performance on small or low-dimensional datasets, where simpler models can achieve superior results with a significantly lower computational overhead. Moreover, KANs, due to their architecture and the nature of their computations, can be computationally intensive. This complexity may lead to longer training times and increased resource requirements, particularly when scaling to larger datasets or more intricate models. To address these challenges, several optimization methods can be applied. Regularization techniques, such as L2 regularization or dropout, can help to prevent overfitting by penalizing excessive complexity. Data augmentation or synthetic data generation can enhance the dataset, helping KANs generalize better. Moreover, simplifying the model architecture or combining KANs with simpler learning methods, like decision trees for final predictions, could reduce computational costs and improve performance on small datasets. These adjustments can enhance the adaptability of KANs, making them more effective in predictive regression tasks.

6. Conclusions

In this study, a custom KAN-based deep neural network model is proposed for the house price estimation problem. The network architecture supports a deep neural network, consisting of multiple layers. To evaluate the performance of the KAN model, we used two datasets: the Greek Listings dataset, which consists of real-estate listings in Greece, and the California House dataset, which contains data related to houses in California. Both datasets were obtained from Kaggle and are available there.
Based on the analysis conducted on the two datasets, the results show that the custom KAN-based model can outperform the best regression algorithms and traditional neural network models in terms of predictive accuracy (regarding MAE and MSE metrics that were tested). These findings demonstrate that KANs have significant potential for improving predictive performance in real-world applications. Furthermore, the outcomes prove the efficiency of spiking neural networks in modeling complicated, dynamic systems with incomplete and noisy data. Overall, the results provide strong evidence for the use of KANs as a powerful and adaptable tool for data analysis and decision-making.
Regarding the future directions of this research, multiple plans have been outlined to further extend this study. One direction is to conduct more experiments with various types of KAN-based architectures to explore their full potential in greater detail. While the experiments conducted utilized public datasets, a future research direction includes the validation of the KAN model with real market data to assess its practical applicability. This validation strategy encompasses collaboration with real-estate firms to access actual transaction data and conducting targeted pilot studies in specific geographic regions. Our research roadmap includes the development of a mobile application to enhance accessibility and practical implementation.
Beyond house price estimation, the KAN model demonstrates significant potential for broader applications in the real-estate industry. The model’s versatility extends to investment analysis for property valuation, market trend forecasting through historical data analysis, risk assessment considering multiple economic indicators, and urban planning support through data-driven insights. These applications collectively position KAN as an innovative analytical tool that can significantly enhance decision-making processes in the real-estate sector.
This comprehensive approach to practical implementation and diverse application scenarios underscores the potential of KAN as a transformative technology in real-estate analytics while maintaining rigorous standards for validation and continuous improvement. Finally, it is hoped that this research will inspire new research directions and collaborations in the field of neural networks, opening the way for further innovation and breakthroughs in the field of artificial intelligence.

Author Contributions

Conceptualization, I.V. and A.T.; Approach, I.V. and A.T.; Implementation, I.V.; Resources, I.V.; Writing—Original Draft Preparation, I.V.; Writing—Review and Editing, I.V. and A.T.; Supervision, A.T.; Project Administration, I.V. and A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The following datasets can be downloaded at https://www.kaggle.com accessed on 2 September 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Feth, M. Proptech: The real estate industry in transition. In The Routledge Handbook of FinTech; Taylor & Francis Group: Abingdon, UK, 2021; pp. 385–391. [Google Scholar] [CrossRef]
  2. Siniak, N.; Kauko, T.; Shavrov, S.; Marina, N. The impact of proptech on real estate industry growth. IOP Conf. Ser. Mater. Sci. Eng. 2020, 869, 062041. [Google Scholar] [CrossRef]
  3. He, X.; Lin, Z.; Liu, Y. Volatility and Liquidity in the Real Estate Market. J. Real Estate Res. 2018, 40, 523–550. [Google Scholar] [CrossRef]
  4. Braesemann, F.; Baum, A. PropTech: Turning Real Estate Into a Data-Driven Market? SSRN Electron. J. 2020, 1–22. [Google Scholar] [CrossRef]
  5. Chiang, M.-C.; Sing, T.F.; Wang, L. Interactions Between Housing Market and Stock Market in the United States: A Markov Switching Approach. J. Real Estate Res. 2020, 42, 552–571. [Google Scholar] [CrossRef]
  6. Adamczyk, T.; Bieda, A. The applicability of time series analysis in real estate valuation. Geomat. Environ. Eng. 2015, 9, 15–25. [Google Scholar] [CrossRef]
  7. Mora-Garcia, R.-T.; Cespedes-Lopez, M.-F.; Perez-Sanchez, V.R. Housing Price Prediction Using Machine Learning Algorithms in COVID-19 Times. Land 2022, 11, 2100. [Google Scholar] [CrossRef]
  8. Kundu, A.; Sarkar, A.; Sadhu, A. KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search. EPJ Quantum Technol. 2024, 11, 76. [Google Scholar] [CrossRef]
  9. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar] [CrossRef]
  10. Moradi, M.; Panahi, S.; Bollt, E.; Lai, Y.C. Kolmogorov-Arnold Network Autoencoders. arXiv 2024, arXiv:2410.02077. [Google Scholar] [CrossRef]
  11. Jiang, B.; Wang, Y.; Wang, Q.; Geng, H. A Hybrid KAN-ANN based Model for Interpretable and Enhanced Short-term Load Forecasting. Authorea Prepr. 2024. [Google Scholar] [CrossRef]
  12. Yang, X.; Wang, X. Kolmogorov-Arnold Transformer. arXiv 2024, arXiv:2409.10594. [Google Scholar] [CrossRef]
  13. Hollósi, J.; Ballagi, Á.; Kovács, G.; Fischer, S.; Nagy, V. Detection of Bus Driver Mobile Phone Usage Using Kolmogorov-Arnold Networks. Computers 2024, 13, 218. [Google Scholar] [CrossRef]
  14. Danish, M.U.; Grolinger, K. Kolmogorov–Arnold recurrent network for short term load forecasting across diverse consumers. Energy Rep. 2025, 13, 713–727. [Google Scholar] [CrossRef]
  15. Gao, Y.; Hu, Z.; Chen, W.-A.; Liu, M.; Ruan, Y. A revolutionary neural network architecture with interpretability and flexibility based on Kolmogorov–Arnold for solar radiation and temperature forecasting. Appl. Energy 2025, 378, 124844. [Google Scholar] [CrossRef]
  16. Limsombunchai, V.; Gan, C.; Lee, M. House Price Prediction: Hedonic Price Model vs. Artificial Neural Network. Am. J. Appl. Sci. 2004, 1, 193–201. [Google Scholar] [CrossRef]
  17. Afonso, B.; Melo, L.; Oliveira, W.; Sousa, S.; Berton, L. Housing Prices Prediction with a Deep Learning and Random Forest Ensemble. In Proceedings of the Encontro Nacional De Inteligência Artificial E Computacional (Eniac), Rio Grande, Brazil, 24 September 2020; pp. 389–400. [Google Scholar] [CrossRef]
  18. Nouriani, A.; Lemke, L. Vision-based housing price estimation using interior, exterior & satellite images. Intell. Syst. Appl. 2022, 14, 200081. [Google Scholar] [CrossRef]
  19. Kim, J.; Lee, Y.; Lee, M.-H.; Hong, S.-Y. A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices. Sustainability 2022, 14, 9056. [Google Scholar] [CrossRef]
  20. Joshi, H.; Swarndeep, S. A Comparative Study on House Price Prediction using Machine Learning. Int. Res. J. Eng. Technol. 2022, 9, 782–788. [Google Scholar]
  21. Ragb, H.; Muntaser, A.; Jera, E.; Saide, A.; Elwarfalli, I. Hybrid GRU-LSTM Recurrent Neural Network-Based Model for Real Estate Price Prediction Hybrid GRU-LSTM Recurrent Neural Network-Based Model for Real Estate Price Prediction. TechRxiv 2023. [Google Scholar] [CrossRef]
  22. Dobrovolska, O.; Fenenko, N. Forecasting Trends in the Real Estate Market: Analysis of Relevant Determinants. Financ. Mark. Inst. Risks 2024, 8, 227–253. [Google Scholar] [CrossRef]
  23. Rey, D.; Neuhäuser, M. Wilcoxon-Signed-Rank Test. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1658–1659. [Google Scholar] [CrossRef]
Figure 1. KAN architecture.
Figure 1. KAN architecture.
Algorithms 18 00093 g001
Figure 2. Research methodology.
Figure 2. Research methodology.
Algorithms 18 00093 g002
Figure 3. Models’ performance in Dataset A.
Figure 3. Models’ performance in Dataset A.
Algorithms 18 00093 g003
Figure 4. Comparison plot of predicted and actual values in Dataset A—KAN.
Figure 4. Comparison plot of predicted and actual values in Dataset A—KAN.
Algorithms 18 00093 g004
Figure 5. Models’ performance in Dataset B.
Figure 5. Models’ performance in Dataset B.
Algorithms 18 00093 g005
Figure 6. Comparison plot of predicted and actual values in Dataset B—KAN.
Figure 6. Comparison plot of predicted and actual values in Dataset B—KAN.
Algorithms 18 00093 g006
Table 1. Optimal parameters for KAN.
Table 1. Optimal parameters for KAN.
ParameterOptimal Values: Dataset AOptimal Values: Dataset B
Model architecture256-128-64-32-150-40-30-20-1
OptimizerAdamAdam
Activation functionReLUReLU
Number of knots77
Spline layer99
Learning rate0.0010.01
Batch size88
Table 2. Results in Dataset A.
Table 2. Results in Dataset A.
Model NameMAERMSESMAPE (%)
KAN35,86152,15816.53
GRU–LSTM36,97253,76316.92
CatBoost37,61554,91817.14
XGBoost38,15755,71117.41
LGB38,96155,90617.98
MLP39,12357,03418.07
Random forest46,12467,34121.16
Linear regression46,45168,17621.65
Table 3. Results in Dataset B—California Housing.
Table 3. Results in Dataset B—California Housing.
Model NameMAERMSESMAPE (%)
XGBoost30,62542,47715.81
LGB30,79242,70915.93
KAN31,96144,19116.45
Random forest32,77445,45816.87
GRU–LSTM33,17446,01216.98
MLP35,72349,54817.14
CatBoost36,34550,41118.67
Linear regression40,45256,10720.26
Table 4. KAN and XGBoost test results in augmented California Housing dataset.
Table 4. KAN and XGBoost test results in augmented California Housing dataset.
Model NameMAERMSESMAPE (%)
KAN34,71653,37817.74
XGBoost36,10255,53318.31
Table 5. Statistical significance results.
Table 5. Statistical significance results.
Model Comparedp-ValuePerformance Improvement
CatBoost0.03 (Significant)9.19%
GRU–LSTM0.003 (Significant)3.28%
LGB0.5022.40%
Linear regression0.005 (Significant)27.98%
MLP0.003 (Significant)9.31%
Random forest0.0312 (Significant)15.62%
XGBoost0.68751.10%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Viktoratos, I.; Tsadiras, A. Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks. Algorithms 2025, 18, 93. https://doi.org/10.3390/a18020093

AMA Style

Viktoratos I, Tsadiras A. Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks. Algorithms. 2025; 18(2):93. https://doi.org/10.3390/a18020093

Chicago/Turabian Style

Viktoratos, Iosif, and Athanasios Tsadiras. 2025. "Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks" Algorithms 18, no. 2: 93. https://doi.org/10.3390/a18020093

APA Style

Viktoratos, I., & Tsadiras, A. (2025). Advancing Real-Estate Forecasting: A Novel Approach Using Kolmogorov–Arnold Networks. Algorithms, 18(2), 93. https://doi.org/10.3390/a18020093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop