Probabilistic Kolmogorov–Arnold Network: An Approach for Stochastic Modelling Using Divisive Data Re-Sorting

Polar, Andrew; Poluektov, Michael

doi:10.3390/modelling6030088

Open AccessArticle

Probabilistic Kolmogorov–Arnold Network: An Approach for Stochastic Modelling Using Divisive Data Re-Sorting

by

Andrew Polar

¹ and

Michael Poluektov

^2,3,*

¹

Independent Researcher, Duluth, GA 30096, USA

²

Department of Mathematical Sciences and Computational Physics, School of Science and Engineering, University of Dundee, Dundee DD1 4HN, UK

³

International Institute for Nanocomposites Manufacturing, WMG, University of Warwick, Coventry CV4 7AL, UK

^*

Author to whom correspondence should be addressed.

Modelling 2025, 6(3), 88; https://doi.org/10.3390/modelling6030088

Submission received: 27 June 2025 / Revised: 11 August 2025 / Accepted: 18 August 2025 / Published: 22 August 2025

Download

Browse Figures

Versions Notes

Abstract

The Kolmogorov–Arnold network (KAN) is a regression model that is based on a representation of an arbitrary continuous multivariate function by a composition of functions of a single variable. Experimentally obtained datasets for regression models typically include uncertainties, which in some cases, cannot be neglected. The conventional way to account for the latter is to model confidence intervals of the systems’ outputs in addition to the expected values of the outputs. However, such information may be insufficient, and in some cases, researchers aim to obtain probability distributions of the outputs. The present paper proposes a method for estimating probability distributions of the outputs by constructing an ensemble of models. The suggested approach covers input-dependent probability distributions of the outputs and is capable of capturing the multi-modality, as well as the variation of the distribution type with the inputs. Although the method is applicable to any regression model, the present paper combines it with KANs, since their specific structure leads to the construction of computationally efficient models. The source codes are available online.

Keywords:

uncertainty quantification; deep ensemble learning; Kolmogorov–Arnold representation; divisive data re-sorting; multi-modality in posterior distributions

1. Introduction

The popularity of using the Kolmogorov–Arnold representation [1,2] as a regression model has significantly grown over the last year, largely due to the publication of preprint [3], which promotes term network (leading to abbreviation ‘KAN’). However, the idea of using the Kolmogorov–Arnold representation as a machine-learning model has been out for decades [4,5], and successful implementations of this model and its training method have been available for years [6,7,8,9,10,11,12,13], including the developments by the authors of the present paper [14,15]. The pioneering work of Igelnik and Parikh [9] should be especially emphasised, as it is the first proposed KAN with the spline basis functions. A more detailed literature review on KANs can be found in the recent publication by the authors of the present paper [15].

Most recent preprints focus on deterministic modelling, and only few authors tried using this model for estimating probabilistic properties. For example, in [16], a method for propagating statistical moments through the model is suggested, and in [17], KANs are combined with Bayesian inference for estimating posterior distributions. The present paper proposes an ensemble training method for uncertainty quantification and combines it with KANs.

The prior research on probabilistic modelling can be divided into Bayesian neural networks (BNNs) [18,19,20] and a large group of ensemble methods [21,22,23,24,25,26,27,28]. There are also methods tailored to specific models, such as the Monte Carlo dropout [29]. One particular ensemble method can be highlighted—quantile regression forests (QRFs) [30]. It is a customisation of Random Forests by using a specific classifier, which is designed to perform quantile regression. The original publication did not report quantile values for the benchmark tests, only pinball loss metrics [31]; however, the experiments can be reproduced using freely available libraries (Python package scikit-garden).

Most ensemble methods do not require making choices regarding the expected probability types, as opposed to some probabilistic models, such as BNNs. Practical utilisation of BNNs is facilitated by the publicly available Keras (https://keras.io/examples/keras_recipes/bayesian_neural_networks, accessed on 17 August 2025) library. The users should define and configure the type of the expected distribution. In cases when it is more complex than a single bell-shaped curve, the users need to specify its form, for example a mixture of Gaussian distributions. Assigning the type of the expected distribution can be an additional uncertainty—choosing a uni-modal type leaves no chances for the library to determine multi-modality, while assigning a mixture of Gaussians can reduce the accuracy by forcing to estimate unnecessary extra parameters when the actual distribution is much simpler.

Another popular family of algorithms, natural gradient boosting (NGBoost) [32], also requires the users to choose the distribution type. It aims at improving the probabilistic predictions of gradient boosting models by replacing the standard gradient with the natural gradient, which accounts for the geometry of the output distribution. The accuracy metric, natural log-likelihood, requires parameters of the assumed distribution to be estimated from the modelled sample. When samples are short and the distribution type is a mixture, it brings uncertainty to a quantification method in addition to those that are already in the data.

The aim of the approach proposed in the present paper is to ‘pass’ the burden of the identification of the distribution type from the user to the code. It is an ensemble method and has a capability of making samples large enough for the estimation of the probability distributions, including the capturing of the multi-modality. The method is combined with the KANs’ architecture, resulting in shallow probabilistic models. The reason for building the approach on KANs is the descriptive capabilities of the latter, enabling it to identify hidden properties of the modelled systems via deep-learning steps. The approach was also designed to have a low computational cost.

2. Divisive Data Re-Sorting (DDR)

A dataset is considered to contain independent data records

\{x_{i}, y_{i}\}

,

i \in \{1, \dots, N\}

, where

x_{i} \in R^{m}

is the input of the i-th record,

y_{i} \in R

is the output of the i-th record, and N is the number of records. The records are observations of a system with uncertainty—the output of such system is a random variable with input-dependent probability distribution.

Suppose some deterministic regression model

M_{0} : R^{m} \to R

can be built using a particular error minimisation process,

M_{0} = \underset{M \in A}{argmin} \sum_{i} {(y_{i} - M (x_{i}))}^{2},

where

A

is the space of possible models. Such models will be called expectation models.

The method starts by building one expectation model for the entire dataset

M_{1, 1}

, where the first subscript is the step number and the second subscript is the index of the model within the step. Then, the data records are re-sorted according to residuals

r_{i} = y_{i} - M_{1, 1} (x_{i})

and subdivided into two even clusters over the median residual in the sorted list. At the second step, new expectation models are built for each cluster, resulting in

M_{2, 1}

and

M_{2, 2}

. Then, within each cluster, the records are again re-sorted according to the residuals. Each cluster is then subdivided into two clusters each over the median residual, resulting in four clusters in total, for which new expectation models

M_{3, 1}

,

M_{3, 2}

,

M_{3, 3}

, and

M_{3, 4}

are built. The process is continued in a similar way.

The user must ensure that the average error for each cluster declines in the subdividing process. Declining errors indicate that the selected model type is good enough for the entire dataset to be represented by a collection of models—the larger the number of models in the collection, the better the fit to all data points. Non-declining errors, in turn, mean that the dataset cannot be represented by a collection of models of the selected type. The latter may be due to model inadequacy.

When the models become sufficiently accurate, the process stops. It is up to the user to decide when the residuals are sufficiently small, i.e. the ‘depth’ of the process (the number of models constituting the final ensemble) becomes a hyperparameter. The ensemble of models obtained at the last subdividing step can now be used for approximating the distribution of the output for each individual record—the outputs of the ensemble can be handled as a sample from the probability distribution of the real output.

If for a given expectation model, the computational complexity is known and is given by dependency

C (N)

, where N is the number of records on which the model is trained, then, knowing that at the n-th step of the DDR algorithm, exactly

2^{n - 1}

models are built, each on

N / 2^{n - 1}

records, the complexity of just building the models of this step is

2^{n - 1} C (2^{1 - n} N)

. DDR is of course ideal for parallelisation: each expectation model is independent of the other; hence, at the n-th step of the DDR algorithm, the entire process of constructing the ensemble can be executed within exactly

2^{n - 1}

threads.

The proposed approach falls into the boosting category, since a series of models is built sequentially were the next group of models depends on the previous. It has similarities to some known methods, but in the form proposed here, it is novel to the best knowledge of the authors. A possibly close concept is called anti-clustering [33,34,35], where the data is clustered such that the clusters have maximum similarity, with maximum diversity of the elements within them. Although similar to the established boosting frameworks, such as AdaBoost, the difference to the latter is the building of ‘weak’ models such that each is trained on its own subset of records, without the introduction of weights; the main mechanism of DDR is the formation of the subsets of records (re-sorting), rather than fitting of the weights.

2.1. Interpretation of DDR and Uncertainty Type

The capability of detecting multi-modality by the proposed algorithm is illustrated schematically in Figure 1. The columns of different output values are multiple outputs corresponding to a single input. The red lines are the expectation models of the ensemble. The line ‘densities’ within each column approximately match the ‘densities’ of the output values, which follows from the construction of the ensemble of the models. Of course, expectation models with the output continuously dependent on the input require the probability distribution of the output of the modelled system to be continuously dependent on the input.

The DDR algorithm does not specifically rely on the type of data uncertainty. Sometimes, it is not easy to identify the uncertainty type even in the case of known underlying physical properties of the system. For example, the uncertainly type of a dataset with self-reported incomes depending on demographic parameters (e.g., age, education level, etc.) can be interpreted in various ways: it can be said that this data has aleatoric uncertainty because multiple records with identical inputs are likely to have different outputs; on the other hand, by expanding the list of inputs and obtaining more demographic details, the uncertainties can be reduced, which may classify them as epistemic. Not in all, but in many cases, the boundary between epistemic and aleatoric uncertainties is arbitrary and is defined by the users as they understand it.

2.2. Relation to k-Nearest Neighbours (kNN)

In general, kNN is considered to be a competitive method for various types of probabilistic modelling [36]. For example, it can be applied to the scenario illustrated in Figure 1. If a given input matches any of the inputs associated with the columns of data points, the application of the kNN algorithm with the proper choice of the number of nearest neighbours will directly give the set of the points from the column. The latter is close to the set of outputs of the models comprising the DDR ensemble. However, the figure shows an ideal scenario only for the explanation of the concept. For non-deal data, the points will be spread along the input axes and there will also be gaps in the data. Furthermore, the applicability of kNN cannot always be determined from the data—even for a deterministic system, the neighbouring points can be collected and variances can be estimated, which may lead to a wrong conclusion.

Thus, the similarity between DDR and kNN consists in returning a pre-defined number of outputs, which in the ideal case (e.g., Figure 1), are similar or identical. These outputs represent the scatter of points corresponding to one input point, given that the considered system is stochastic. However, the DDR’s outputs are the results of the models, while kNN’s outputs are the values directly picked from the dataset.

DDR allows researchers to avoid difficult choices related to kNN, such as the number of neighbours to pick, the weights to be assigned for the points that are slightly shifted, how to cover the gaps in the data, and the applicability of kNN in the first place (whether the object is deterministic or stochastic). The main advantage of DRR is that it does not require the entire dataset to be retained in the memory for each prediction. The numerical comparison of DDR and kNN is provided below.

2.3. Prediction of Probability Density

When the residuals decline sharply during the DDR steps, the number of obtained clusters can be small, e.g., 8 or 16. The modelling goal, however, is to predict the probability distribution, which requires a relatively large ensemble. Some approximation methods, such as kernel density estimation (KDE) [37], can be applied, but DDR allows for significant enlargement of the size of the ensemble, which can be used to build empirical cumulative distribution functions (ECDFs).

When the re-sorting process is completed, any large enough group of adjacent records in the sorted list can be used for building an additional model for the ensemble. This is possible because each cluster contains records that already have been sorted at the preceding step of the algorithm. Building additional models for the existing ensemble can be one option.

The second option is constructing a different (additional) ensemble of models using a sliding-window technique. It is applied to the sorted dataset obtained at the last step. Exactly k sequential records are selected, starting from the first record, and are used to build the first model for the additional ensemble. Then, k sequential records starting from record

(s + 1)

are used to build the second model for the additional ensemble. This is repeated in the similar way. Such process can be interpreted as having a window of a selected size that is moved along the sorted dataset, and the records that fall into this window are used for training new models for the additional ensemble. When

s < k

, the window is moved with overlapping. The users must make a choice regarding size k and shift s. The size must be selected such that a reliable estimation of model parameters is possible; it is subject to the choice of the parameter estimation algorithm. The shift defines the desired size of the additional ensemble, i.e. the number of points for the ECDFs.

Within the discussed algorithms, the clusters are always even. One can construct modifications of such algorithms considering uneven clusters (with varying number of records); however, this will create difficulties for the choice of the number of the parameters of the expectation models, especially on small datasets.

3. Kolmogorov–Arnold Model/Network

The two-layer Kolmogorov–Arnold model (or network, KAN) is given by the following expression:

{\tilde{y}}_{i} = \sum_{k = 1}^{d} Φ_{k} (\sum_{j = 1}^{m} f_{k j} (x_{i, j})),

(1)

where scalar

{\tilde{y}}_{i}

is the calculated model output of the i-th record, scalars

x_{i, j}

denote the j-th component of input vector

x_{i}

, functions

f_{k j}

and

Φ_{k}

constitute the model. In the original work [1,2], it has been shown that the model with

d = 2 m + 1

can represent any continuous multivariate function (more recently, the restrictions on continuity have been somewhat relaxed, see e.g., [38] and references therein). Functions

f_{k j}

and

Φ_{k}

, which are referred to as the ‘inner’ and the ‘outer’ functions, respectively, are further decomposed into the basis functions and the parameters. There are multiple model training methods and choices for the basis functions. It is assumed that the reader is already familiar with this model; an overview of the model and its training methods can be found in the recent publication by the authors [15]; practical examples of the model application can be found on the authors’ website (http://openkan.org, accessed on 17 August 2025).

3.1. KAN and DDR

The present paper proposes the two-layer KAN to be used as an expectation model for the DDR ensemble; more precisely, a specific modification of it—shallow probabilistic model is proposed in the subsection below. However, the expectation model can of course be arbitrary, and its complexity can be optimised using standard tools, such as e.g., minimum description length (MDL) [39].

First versions of the present paper proposed another expectation model—a multi-layer deep KAN with a specific architecture. It is called a binary Urysohn tree and is characterised by doubling the number of the underlying functions from layer to layer from top to bottom. It is a successful model with descriptive capabilities equivalent to shallow KANs and with some of its own advantages. The main reason for mentioning it is the historical significance—it was a successful implementation of a multi-layer deep KAN back in 2021 by the authors of the present paper. Interested readers can access version 3 of the present paper [40] containing a detailed description of the model and numerical experiences. Subsequent versions of the present paper refocused on the shallow KANs.

3.2. Shallow Probabilistic Model

For relatively large datasets, the entire KAN can of course be used as the expectation model for DDR. In this case, its structure can optimised via a range of techniques; for example, running variational Bayesian inference on KANs can yield a model that is optimal in terms of size (e.g., as is carried out for other models [41]). However, for relatively small datasets, that might not be possible—subdivision of the dataset into clusters may lead to such cluster size that the number of records within a cluster is insufficient for model training without overparametrisation. The latter can be addressed using the idea as follows.

Model (1) can be rewritten as

\begin{matrix} {\tilde{y}}_{i} = \sum_{k = 1}^{d} Φ_{k} (θ_{i, k}), \end{matrix}

(2)

\begin{matrix} θ_{i, k} = \sum_{j = 1}^{m} f_{k j} (x_{i, j}), \end{matrix}

(3)

where intermediate variable

θ_{i} \in R^{d}

has been introduced, with index i indicating the record number; scalars

θ_{i, k}

denote the k-th component of

θ_{i}

. A single expectation model (KAN) is fitted for the entire dataset first. Then, the model is used to calculate the values of intermediate variables

θ_{i}

. These values are then assembled into a new dataset

\{θ_{i}, y_{i}\}

,

i \in \{1, \dots, N\}

. Finally, the DDR algorithm is applied to the new dataset with the expectation model having the form given by Equation (2).

Such approach allows for a significant reduction in the number of estimated parameters within a single expectation model in the ensemble, also leading to a faster training (compared to using KANs as expectation models). Model (2) is called a generalised additive model (GAM)—it can be viewed as a constituent part of KAN.

4. Elementary Example—Application of DDR

The aim of the first test is to demonstrate the capabilities of DDR using a simple expectation model (without the combination with KAN). Several computational tests of the present paper use synthetic data, as the reference (true) distributions are required for the validation; in the case of experimentally-obtained datasets of stochastic systems, true distributions are unknown.

4.1. Data and Expectation Model

Synthetic data is generated using the Monte Carlo (MC) algorithm. The data corresponds to a stochastic system, the output of which is the sum of outcomes of q ten-sided dice rolls. There are three inputs:

q_{1}

,

q_{2}

, and p, where

1 \leq q_{1, 2} \leq 10

. The number of dice is then chosen as

q = q_{1}

with probability p, and

q = q_{2}

with probability

(1 - p)

. The output of such system is a discrete random variable, the distribution of which is input-dependent and is also bi-modal for a large range of inputs. The training and the validation datasets consist of 1000 and 100 records, respectively; the inputs are drawn independently from a uniform distribution; the validation dataset is generated independently from the training dataset.

Seven steps of the DDR algorithm are performed, resulting in

2^{7 - 1} = 64

expectation models in the DDR ensemble. Afterwards, the sliding-window technique is applied with the window size of 20 records, which is moved along the sorted records without overlapping, providing the final ensemble of 50 models. The sliding-window method is not necessary for this particular dataset, but its usage is a part of the test.

For this example, the most simple expectation model is chosen—a multilinear (trilinear) model:

{\tilde{y}}^{i} = c_{0} + c_{1} x_{i, 1} + c_{2} x_{i, 2} + c_{3} x_{i, 3} + c_{4} x_{i, 1} x_{i, 2} + c_{5} x_{i, 1} x_{i, 3} + c_{6} x_{i, 2} x_{i, 3} + c_{7} x_{i, 1} x_{i, 2} x_{i, 3},

(4)

where

{\tilde{y}}^{i}

is the estimated output of the i-th record,

x_{i, j}

denote the j-th component of the input vector, and

c_{j}

are the model parameters.

4.2. Accuracy Metrics

To assess the performance of the approach, the modelling results are compared to the MC simulations that can be considered to represent true distributions.

The first comparison step considers means and standard deviations obtained from the DDR ensemble (modelling) and from the MC simulations (reference solution). The normalised root mean square error (RMSE) measure is used:

E_{RMSE} = \frac{1}{max (Z_{MC}) - min (Z_{MC})} \sqrt{\frac{1}{N} \sum_{i} {(Z_{i, DDR} - Z_{i, MC})}^{2}},

(5)

where Z stands for either mean or standard deviation, subscript ‘

DDR

’ denotes the value obtained from the DDR ensemble, subscript ‘

MC

’ denotes the value obtained from the MC sampling, N is the number of records used for the comparison, and the sum is over record number i.

Furthermore, goodness-of-fit comparisons are also performed. Both tested sample and MC generated population have large number of ties (equal values), which makes many standard tests, such as Cramér–von Mises and Anderson–Darling, inapplicable. Also, this particular case is not bounded by two tested samples only; independent true samples of any sizes can be generated by MC. This advantage is used for introduction of a customised test similar to the Cramér–von Mises test.

The statistic is the relative distance between the median trees constructed as follows. When two samples are compared, each one is subdivided by the median into two clusters; this median is the first entry in the list; another two medians for each of the obtained clusters are added to the list; the process continues in the same way until the predefined number of the clusters’ or nested medians is obtained. They are sorted, resulting in a vector that can be called the median tree. The vectors corresponding to the two samples (U and V) are compared using the relative distance:

S = 2 \sqrt{2} \frac{∥ U - V ∥}{∥ U ∥ + ∥ V ∥} .

(6)

Such metric is, in fact, a set of quantiles, i.e. vectors of multiple quantiles for different samples are compared. It is easy to show that if two samples correspond to the same continuous random variable with probability density function f, then the numerator of S is related to statistic T from the Cramér–von Mises test [42]. In particular, statistic

T_{*} = \frac{a b}{c (a + b)} \sum_{i} {((U_{i} - V_{i}) f (\frac{U_{i} + V_{i}}{2}))}^{2},

(7)

where a and b are the sizes of two samples, c is the number of medians (elements in U or V), has the same limiting distribution as T when

a \to \infty

,

b \to \infty

, and

a / b \to λ

, where

λ

is finite. Statistic S simply omits f and is scaled differently to

T_{*}

. Theoretical limits for S are

0 \leq S \leq 2 \sqrt{2}

, but the expected value of S for two random U and V, assuming the uniform distribution with the same limits, is 1; therefore, this value is a convenient measure with conventional range

[0, 1]

.

The standard approach for most similar testing procedures is to compare the statistic computed for two samples to tabulated values. Since this is a new test, such tables are not available, but can be created ad hoc. The MC sample can be as large as necessary and represents the population, against which the model sample is tested. Random sub-samples from the MC population are taken, statistic S is calculated for each taken sub-sample and the MC population, the values are sorted and used as an ECDF for the validation. For the examples of this paper, the number of sub-samples is taken to be 100, and when the statistic for the model sample is below the maximum value, the test is considered to be passed at a

1 %

significance level.

4.3. Test Results

The results are shown in Table 1. Each column is obtained by independent execution of the programme, which includes the data generation, the training, and the validation (against MC) as described above. For this example, the normalised RMSEs for the mean and for the standard deviation obtained from the DDR ensemble are approximately

4 %

and

14 %

, respectively. The samples given by the DDR ensemble pass approximately 85 out of 100 goodness-of-fit tests.

The last row of Table 1 shows the application of kNN to the example above. The number of neighbours is the same as the number of models in the DDR ensemble. It can be seen that kNN samples pass fewer goodness-of-fit tests, approximately 73 out of 100.

Estimation of the probability density using KDE applied to the DDR ensemble is shown in Figure 2 for

p = 0.3

,

q_{1} = 4

,

q_{2} = 9

, where it can be seen that the bi-modal character of the distribution is captured. The method consists in replacing a single value in a sample by a smooth distribution curve (e.g., Gaussian) and obtaining the distribution as a result of superposition. The expected smoothness of the distribution is controlled by a numerical parameter.

Table 2 shows a sharp decline of the relative error in the subdividing process of the DDR algorithm computed during training on the training dataset itself, confirming the aleatoric nature of the uncertainty. The source code is available online (https://github.com/andrewpolar/vdice_bilinear, accessed on 17 August 2025).

4.4. Comparison to Bayesian Neural Networks (BNNs)

The multi-modal input-dependent probability distributions of the outputs of a stochastic system can also be estimated using BNNs [20]. A well-tested and popular code Keras (https://keras.io/examples/keras_recipes/bayesian_neural_networks, accessed on 17 August 2025) is taken for the comparison. Since the reference input-dependent distributions should be known (i.e., needed for the validation), in the provided example (https://archive.ics.uci.edu/ml/datasets/wine+quality, accessed on 17 August 2025), the experimentally obtained dataset has been replaced by the above-described synthetic dataset. The code has been used almost as is, with only one necessary change—to support multi-modality, the posterior (output) distribution type has been changed from a single normal distribution to a mixture of two normal distributions, e.g.,

f (x) = \sum_{j} p_{j} f_{N} (x; μ_{j}, σ_{j}^{2}),

(8)

where weights

p_{j}

, expectations

μ_{j}

, and standard deviations

σ_{j}

are estimated from the data. The latter is a typical way of incorporating multi-modality in probabilistic neural networks [43].

The results are shown in Table 3. It can be seen that it is less accurate than DDR for this particular dataset. The reason for this is a relatively small dataset size, but it is a part of the test. BNN also captures multi-modality, which can be shown by making the dataset less challenging—by increasing the size, excluding uni-modal records, or narrowing the variation ranges for the inputs, e.g.,

3 \leq q_{1, 2} \leq 6

. The source code is available online (https://github.com/andrewpolar/vdice-python, accessed on 17 August 2025).

5. Comparison to Quantile Regression Forests (QRFs)

Quantile regressions are commonly used for prediction of the confidence intervals along with the expectations and the medians. The experiment in this section is the comparison of the confidence intervals estimated for the Boston housing dataset by DDR and QRF. The latter was installed as Python package scikit-garden. In both tests, 5-fold cross validation was used. The dataset is short (506 records) and significantly imbalanced.

The generalised additive model (GAM) was chosen as an expectation model for DDR:

{\tilde{y}}_{i} = \sum_{j = 1}^{m} f_{j} (x_{i, j}),

(9)

with 6 nodes and piecewise linear basis functions. For DDR, the sliding-window size of 20 and the shift of 14 were taken. In the test, the number of points falling into the interval defined by two quantile values was counted and divided by the number of test records in the validation set.

For confidence interval

[0.05, 0.95]

, the ideal prediction result is of course

0.90

. The comparison is shown in Table 4, where the sequential numbers denote different code executions. The difference is insignificant; both methods give good results for this dataset. Table 5 shows the same test for narrower limits

[0.20, 0.80]

, for which the ideal prediction value is

0.60

. It can be seen that the results again are very close and, considering the given data quality, can be judged to be very accurate.

The hyperparameters, such as the sliding-window size and the shift in DDR, can be conveniently calibrated by dividing the dataset into three subsets: training, selection, and testing. The selection subset is used for the comparison of quantile values for different hyperparameters and is not directly used in the training.

6. Probabilistic KAN Test

This example introduces a more complex and closer to reality dataset that cannot be modelled by simple models, requiring a complex network-type model with multiple layers. The data is computed using the following formula:

\begin{matrix} y = \frac{2 + 2 x_{3}^{*}}{3 π} (arctan (20 (x_{1}^{*} - \frac{1}{2} + \frac{x_{2}^{*}}{6}) exp (x_{5}^{*})) + \frac{π}{2}) + \\ + \frac{2 + 2 x_{4}^{*}}{3 π} (arctan (20 (x_{1}^{*} - \frac{1}{2} - \frac{x_{2}^{*}}{6}) exp (x_{5}^{*})) + \frac{π}{2}), \\ x_{j}^{*} = x_{j} + δ (c_{j} - 0.5), j \in \{1, \dots, 5\}, \end{matrix}

(10)

where y is the system output,

x_{j}

are the system inputs,

c_{j} \sim unif (0, 1)

are uniformly distributed random variables (noise), multiplier

δ = 0.4

is the aleatoric uncertainty level (the system becomes deterministic for

δ = 0

). For data generation, inputs

x_{j} \sim unif (0, 1)

are taken.

The probability density of the output significantly changes depending on the inputs. In Figure 3, in the insets (blue figures), four examples of probability densities of y are shown for four different inputs:

\begin{matrix} x_{a} = [\begin{matrix} 0.5 & 0.5 & 0.5 & 0.5 & 0.5 \end{matrix}], \\ x_{b} = [\begin{matrix} 0.65 & 0 & 0.5 & 0.5 & 0.5 \end{matrix}], \\ x_{c} = [\begin{matrix} 0.68 & 1 & 0.5 & 0.5 & 0.5 \end{matrix}], \\ x_{d} = [\begin{matrix} 0.74 & 1 & 0.5 & 0.5 & 1 \end{matrix}], \end{matrix}

corresponding to Figure 3a–d, respectively. The probability density functions (PDFs) are built using the MC sampling of

10^{5}

points.

To benchmark the DDR procedure, a total of 40 runs of the programme have been performed. During each run, a dataset of

10^{6}

records has been generated, the ensemble of models has been constructed using the DDR algorithm, and the ECDFs for the four points given above have been calculated using the sliding-window technique. The number of the outer functions of the Kolmogorov–Arnold model has been selected to be 11; the inner and the outer functions have been taken to be piecewise linear with 5 and 7 equidistant nodes, respectively.

In Figure 3, the four inputs given above are considered, the ECDFs of the output built using the MC sampling are shown in black, and the ECDFs obtained using the DDR procedure (with the sliding-window technique) are shown in grey (the output of the system, denoted as y in Equation (10), is a random variable; hence, the ECDF of y is denoted as

F_{y}

, and its argument is denoted as

κ

, which stands for a value that y may take). Furthermore, an average ECDF across all realisations (i.e., average of grey curves) is shown in red. For the purposes of this discussion, the MC-sampling ECDFs are referred to as the exact. It can be seen that the ensemble of models reproduces qualitatively the major features of the exact CDFs and predicts quantitatively the range of the output. The subfigures show different representative scenarios: for input

x_{a}

, the ensemble ECDF is very close to the exact CDF; for input

x_{b}

, the ensemble ECDF is somewhat far from the exact CDF quantitatively, although it reproduces qualitatively the change in the slope; for inputs

x_{c}

and

x_{d}

, the ensemble ECDF is rather close to the exact CDF, although somewhat smoothed.

The DDR algorithm builds the ensemble of models using the specifically created set of data clusters. Therefore, it is crucial to emphasise its advantage over an ensemble built using clusters containing random records. To show this, a set of 100 input points have been randomly generated. For each input point, the true mean and the true standard deviation have been calculated using the MC sampling. Next, for each input point, the mean and the standard deviation have been calculated using the DDR ensembles and the random ensembles (the entire dataset is shuffled randomly, split into the same number of clusters as the DDR ensemble contains, and the ensemble of models is created such that each model is trained on its cluster). In Figure 4, the results are compared and it can be seen that the random ensembles can model the means well, which is a known fact, but cannot model the standard deviations at all. Meanwhile, the DDR ensembles predict the standard deviations both qualitatively and quantitatively.

Shallow Probabilistic KAN

Following Section 3.2, a shallow model is constructed for the example above. The number of records is decreased to

10^{4}

. The structure of the Kolmogorov–Arnold model is changed to 6 and 12 equidistant nodes for the inner and the outer functions, respectively (as above, piecewise linear functions are used). Five steps of the DDR algorithm are performed, resulting in

2^{5 - 1} = 16

clusters. For the training of the DDR ensemble models, the number of equidistant nodes for the outer functions is decreased to 7. The sliding window of 500 records with the shift of 300 records is used, providing the final ensemble of 32 models. Thus, each individual model of the final ensemble containing 77 parameters (since there are 11 outer functions with 7 nodes per function) is trained on 500 records (the sliding window size).

The same accuracy metrics as in the dice example are used, and the results are shown in Table 6. The errors for the mean and for the standard deviation are even lower than for the dice example; the number of passed goodness-of-fit tests is lower, but accounting for the complexity of the input-dependent distributions (shown in the insets of Figure 3), the authors consider this result to be acceptable. The source code is available online (https://github.com/andrewpolar/pkan, accessed on 17 August).

One common problem for probabilistic models is a relatively long training time (e.g., several minutes even for relatively small datasets). The construction of the shallow probabilistic KAN using DDR in this example took approximately

0.25

s on Intel(R) Core(TM) i7-8550U 1.80 GHz CPU, which can be considered to be a relatively good result. This can be crucial for cases focusing on unsupervised learning using large datasets typically requiring hours and days for training.

7. Detecting Public Bias in Bookmakers’ Odds for the English Premier League

The sports betting market is a tightly-regulated social system involving multiple large groups with opposing interests but common rules. Bookmakers’ odds can be regarded as a probabilistic model, crafted by highly-skilled and motivated professionals to secure a long-term profit. Football match outcomes, however, exhibit substantial aleatoric uncertainty, making them an attractive testing ground for probabilistic modelling.

In this experiment, the objective is not to maximise the percentage of correct predictions, but to design a betting strategy that maximises the net profit using the approach proposed above. In this scenario, the match outcome prediction accuracy as a performance metric is misleading: for example, consistently betting on favourites in the British Premier League yields roughly

50 %

correct outcomes but no monetary gain. Purely random betting results in small but consistent losses, typically around

5 %

. Thus, the approach focuses on detecting patterns of public bias embedded in bookmakers’ odds and exploiting it for profitable betting decisions.

7.1. What Is Public Bias?

Bookmakers operate in a competitive market, which limits their commission rates—typically around

3 %

to

7 %

as reported on their websites. This commission, embedded in the offered odds, is often referred to as a Dutch booking. In a perfectly balanced market, a Dutch book guarantees profit regardless of the match result.

In practice, bettors’ preferences are often biased and mismatched with presumed probabilities. To manage this, bookmakers adjust their odds to influence betting behaviour and restore the balance—sometimes visible in last-minute odds changes [44]. Most bookmakers also prohibit AI-assisted betting and may suspend accounts of suspected violators, though their detection methods are not disclosed.

7.2. Modelling Concept

The model produces probabilistic predictions for each possible match outcome: home win, draw, or away win, based on historical records. For each match, it selects the single outcome with the highest expected profit given the offered odds. A fixed virtual stake of £100 is placed on a single selected outcome for each match. Profit or loss is accumulated over the full Premier League season of 380 matches (each of the 20 teams playing every other team twice, once at home and once away).

Performance is reported as return on total stake (ROS) R:

R = \frac{M}{n B},

(11)

where n is the number of bets, B is the virtual stake in the given currency (

B = £ 100

), and M is the total profit in the given currency. For example,

R = 0.1

corresponds to a

10 %

return, which is

£ 3800

profit over

£ 38000

in total wagers. The denominator is used only for the reference; an actual bankroll of

£ 38 K

is not required. For comparison, purely random betting produces

R \approx - 0.05

.

7.3. Modelling Details

The model is the probabilistic KAN as described above. It predicts the goal difference (home minus away) as a real-valued output. Predictions are then converted into the categorical outcomes (home win, away win, draw) using a simple threshold (above

0.5

, below

- 0.5

, and between these values, respectively).

Two seasons were modeled with different feature sets:

2019–2020 season:
‑
Features: current standings positions for each team (integers 0–19).
‑
Standings at the season start are taken from the previous season; updated after each match.
‑
Training data: the preceding 15 seasons.
‑
Newly promoted teams are assigned the initial standings positions.
‑
Results: Table 7 (code/data publicly available (http://openkan.org/bpl1.html, accessed on 17 August 2025)).
2020–2021 season:
‑
Features: recent match performance—the total goal differences in the lost matches, the total goal differences in won matches, and the number of draws—for both home and away teams (6 features per match).
‑
Features are updated after each match.
‑
Features at the season start are taken from the previous season; updated after each match.
‑
Training data: the preceding 16 seasons.
‑
Newly promoted teams are excluded; only 17 teams from the prior season are modelled.
‑
Results: Table 8 (code/data publicly available (http://openkan.org/bpl2.html, accessed on 17 August 2025)).

7.4. Discussion

Across both seasons, the model’s recommended wagers fell predominantly on underdogs. Only about

30 %

of these bets were correct, yet the overall ROS was strongly positive—evidence that the model identified instances where potential payouts justified the risk. This behaviour is consistent with the principle of value betting, in which profit arises from exploiting the market odds that underestimate an outcome’s true probability.

Considering the deliberately elementary feature sets used, the performance is notable. By comparison, studies focused on maximizing predictive accuracy for sports outcomes, e.g., [45], often employ extensive feature sets, sometimes exceeding 100 variables, and report higher outcome prediction accuracy. The objective of the present experiment, however, was to detect systematic mismatches between bookmakers’ publicly-offered odds, shaped by bettors’ biases, and the actual probabilities inferred from the historical records, rather than to maximise raw prediction accuracy.

8. Conclusions

The present paper has introduced a new approach to modelling of stochastic systems—the divisive data re-sorting (DDR) method. It consists in recursive steps of training of an ensemble of models, each on an individual cluster of data records, sorting of the records, and increasing of the ensemble size. The method can be combined with any expectation model, but the model should have adequate descriptive capabilities for the considered data. The suggested method can capture the multi-modality of the input-dependent probability distribution of the system’s output, as numerical experiments have shown. The method can also be used when users do not need a distribution and are satisfied with first two statistical moments for the computed outputs. The method is related to kNN but is free from many of its hard choices, as well as from retaining the entire dataset for each prediction.

The DDR method has been then combined together with the Kolmogorov–Arnold model (or network, KAN) resulting in the shallow probabilistic KAN. It has been found to be an efficient combination of a regression model with high descriptive capabilities and a simple model (the generalised additive model, or GAM) as a component of the DDR ensemble for producing the probabilistic output. Its relatively quick training time may be crucial for some applications.

Author Contributions

Conceptualisation, A.P. and M.P.; methodology, A.P. and M.P.; software, A.P. and M.P.; validation, A.P. and M.P.; formal analysis, A.P. and M.P.; investigation, A.P. and M.P.; resources, A.P. and M.P.; data curation, A.P. and M.P.; writing—original draft preparation, A.P. and M.P.; writing—review and editing, A.P. and M.P.; visualization, A.P. and M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The generation of synthetic datasets is described in the text of the paper. Links to publicly-available datasets that have been used for the numerical experiments are provided in the text of the paper.

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Arnold, V.I. On functions of three variables. In Doklady Akademii Nauk; Russian Academy of Sciences: Moscow, Russia, 1957; Volume 114, pp. 679–681. [Google Scholar]
Kolmogorov, A.N. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. In Doklady Akademii Nauk; Russian Academy of Sciences: Moscow, Russia, 1957; Volume 114, pp. 953–956. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Bryant, D. Analysis of Kolmogorov’s Superposition Theorem and Its Implementation in Applications with Low and High Dimensional Data. Ph.D. Thesis, University of Central Florida, Orlando, FL, USA, 2008. [Google Scholar]
Liu, X. Kolmogorov Superposition Theorem and Its Applications. Ph.D. Thesis, Imperial College London, London, UK, 2015. [Google Scholar]
Sprecher, D.A. A numerical implementation of Kolmogorov’s superpositions. Neural Netw. 1996, 9, 765–772. [Google Scholar] [CrossRef] [PubMed]
Sprecher, D.A. A numerical implementation of Kolmogorov’s superpositions II. Neural Netw. 1997, 10, 447–457. [Google Scholar] [CrossRef]
Köppen, M. On the training of a Kolmogorov network. In Proceedings of the Artificial Neural Networks—ICANN 2002, Madrid, Spain, 28–30 August 2002; Dorronsoro, J.R., Ed.; Springer: Berlin/Heidelberg, Germany, 2002; pp. 474–479. [Google Scholar]
Igelnik, B.; Parikh, N. Kolmogorov’s spline network. IEEE Trans. Neural Netw. 2003, 14, 725–733. [Google Scholar] [CrossRef] [PubMed]
Coppejans, M. On Kolmogorov’s representation of functions of several variables by functions of one variable. J. Econom. 2004, 123, 1–31. [Google Scholar] [CrossRef]
Actor, J.; Knepley, M.J. An algorithm for computing Lipschitz inner functions in Kolmogorov’s superposition theorem. arXiv 2017, arXiv:1712.08286. [Google Scholar]
Actor, J. Computation for the Kolmogorov Superposition Theorem. Master’s Thesis, Rice University, Houston, TX, USA, 2018. [Google Scholar]
van Deventer, H.; van Rensburg, P.J.; Bosman, A. KASAM: Spline additive models for function approximation. arXiv 2022, arXiv:2205.06376. [Google Scholar] [CrossRef]
Polar, A.; Poluektov, M. A deep machine learning algorithm for construction of the Kolmogorov-Arnold representation. Eng. Appl. Artif. Intell. 2021, 99, 104137. [Google Scholar] [CrossRef]
Poluektov, M.; Polar, A. Construction of the Kolmogorov-Arnold networks using the Newton-Kaczmarz method. Mach. Learn. 2025, 114, 185. [Google Scholar] [CrossRef]
Duda, J. Biology-inspired joint distribution neurons based on hierarchical correlation reconstruction allowing for multidirectional neural networks. arXiv 2024, arXiv:2405.05097. [Google Scholar] [CrossRef]
Hassan, M.M. Bayesian Kolmogorov Arnold Networks (Bayesian KANs): A probabilistic approach to enhance accuracy and interpretability. arXiv 2024, arXiv:2408.02706. [Google Scholar] [CrossRef]
MacKay, D.J.C. A practical bayesian framework for backpropagation networks. Neural Comput. 1992, 4, 448–472. [Google Scholar] [CrossRef]
Neal, R.M. Bayesian learning via stochastic dynamics. In Proceedings of the NIPS’92: Proceedings of the 5th International Conference on Neural Information Processing Systems; Advances in Neural Information Processing System; Hanson, S.J., Cowan, J.D., Giles, C.L., Eds.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993; pp. 475–482. [Google Scholar]
Jerfel, G. Multimodal Probabilistic Inference for Robust Uncertainty Quantification. Ph.D. Thesis, Duke University, Durham, NC, USA, 2021. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Association for Computing Machinery: New York, NY, USA, 2017. [Google Scholar]
Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Schapire, R.E. The strength of weak learnability. Mach. Learn. 1990, 5, 197–227. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Smyth, P.; Wolpert, D.H. Linearly combining density estimators via stacking. Mach. Learn. 1999, 36, 59–83. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 20–22 June 2016; pp. 1050–1059. [Google Scholar]
Meinshausen, N. Quantile regression forests. J. Mach. Learn. Res. 2006, 7, 983–999. [Google Scholar]
Koenker, R.; Bassett, G. Regression quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
Duan, T.; Avati, A.; Ding, D.Y.; Thai, K.K.; Basu, S.; Ng, A.; Schuler, A. NGBoost: Natural gradient boosting for probabilistic prediction. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; Volume 119, pp. 2690–2700. [Google Scholar]
Valev, V. Set partition principles. In Proceedings of the Transactions of the Ninth Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes, Prague, Czech Republic, 28 June–2 July 1982; Kozesnik, J., Ed.; Springer: Dordrecht, The Netherlands, 1983; pp. 251–256. [Google Scholar]
Späth, H. Anticlustering: Maximizing the variance criterion. Control Cybern. 1986, 15, 213–218. [Google Scholar]
Papenberg, M.; Klau, G.W. Using anticlustering to partition data sets into equivalent parts. Psychol. Methods 2021, 26, 161–174. [Google Scholar] [CrossRef]
Fathabadi, A.; Seyedian, S.M.; Malekian, A. Comparison of Bayesian, k-nearest neighbor and Gaussian process regression methods for quantifying uncertainty of suspended sediment concentration prediction. Sci. Total Environ. 2022, 818, 151760. [Google Scholar] [CrossRef]
Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
Ismayilova, A.; Ismailov, V.E. On the Kolmogorov neural networks. Neural Netw. 2024, 176, 106333. [Google Scholar] [CrossRef] [PubMed]
Bazzi, A.; Slock, D.T.M.; Meilhac, L. Detection of the number of superimposed signals using modified MDL criterion: A random matrix approach. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4593–4597. [Google Scholar] [CrossRef]
Polar, A.; Poluektov, M. Urysohn forest for aleatoric uncertainty quantification. arXiv 2021, arXiv:2104.01714v3. [Google Scholar]
Bazzi, A.; Slock, D.T.M.; Meilhac, L. Sparse recovery using an iterative variational Bayes algorithm and application to AoA estimation. In Proceedings of the 2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Limassol, Cyprus, 12–14 December 2016; pp. 197–202. [Google Scholar] [CrossRef]
Anderson, T.W. On the distribution of the two-sample Cramér-von Mises criterion. Ann. Math. Stat. 1962, 33, 1148–1159. [Google Scholar] [CrossRef]
Mohebali, B.; Tahmassebi, A.; Meyer-Baese, A.; Gandomi, A.H. Probabilistic neural networks: A brief overview of theory, implementation, and application. In Handbook of Probabilistic Models; Elsevier: Amsterdam, The Netherlands, 2020; pp. 347–367. [Google Scholar]
Odachowski, K.; Grekow, J. Using bookmaker odds to predict the final result of football matches. In Proceedings of the Knowledge Engineering, Machine Learning and Lattice Computing with Applications, San Sebastian, Spain, 10–12 September 2012; Graña, M., Toro, C., Howlett, R.J., Jain, L.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7828. [Google Scholar]
Taspinar, Y.S.; Cinar, I.; Koklu, M. Improvement of football match score prediction by selecting effective features for Italy Serie A League. MANAS J. Eng. 2021, 1, 1–9. [Google Scholar] [CrossRef]

Figure 1. A schematic illustration of the concept of the proposed algorithm building the ensemble of models.

Figure 2. Probability density of the output for the dice example: the reference solution obtained using the Monte Carlo (MC) sampling (solid line) and the estimation made using the divisive data re-sorting (DDR) algorithm (bar chart).

Figure 3. The empirical cumulative distribution functions (ECDFs) of the output of the considered stochastic system obtained using the Monte Carlo (MC) sampling (black) and using the realisations of the divisive data re-sorting (DDR) algorithm (grey). Averages over the realisations are shown in red. The corresponding probability density functions (PDFs) obtained using the MC sampling are shown in the insets. Subfigures (a–d) correspond to inputs

x_{a}

–

x_{d}

.

Figure 3. The empirical cumulative distribution functions (ECDFs) of the output of the considered stochastic system obtained using the Monte Carlo (MC) sampling (black) and using the realisations of the divisive data re-sorting (DDR) algorithm (grey). Averages over the realisations are shown in red. The corresponding probability density functions (PDFs) obtained using the MC sampling are shown in the insets. Subfigures (a–d) correspond to inputs

x_{a}

–

x_{d}

.

Figure 4. The mean (a) and the standard deviation (b) for 100 input points. The points are sorted based on their

x_{1}

value. The values are obtained using the MC sampling (red squares), using the DDR ensembles (blue crosses), and using the ensembles of models trained on randomly selected disjoint sets of records (grey pluses).

Figure 4. The mean (a) and the standard deviation (b) for 100 input points. The points are sorted based on their

x_{1}

value. The values are obtained using the MC sampling (red squares), using the DDR ensembles (blue crosses), and using the ensembles of models trained on randomly selected disjoint sets of records (grey pluses).

Table 1. Test results for the dice example.

sequential number	1	2	3	4	5	6	7	8
RMSE for mean	0.04	0.03	0.05	0.03	0.04	0.05	0.04	0.03
RMSE for standard deviation	0.15	0.14	0.14	0.15	0.14	0.14	0.14	0.15
passed goodness-of-fit tests, DDR	79	91	80	92	86	77	89	87
passed goodness-of-fit tests, kNN	62	81	70	62	77	73	83	73

Table 2. Decline of the relative error in the DDR algorithm with increase of the number of clusters.

number of clusters	1	2	4	8	16	32	64
RMSE for output	0.153	0.107	0.058	0.032	0.018	0.009	0.004

Table 3. Test results for the dice example, modelled using BNN.

sequential number	1	2	3	4	5	6	7	8
RMSE for mean	0.19	0.17	0.19	0.14	0.15	0.18	0.15	0.17
RMSE for standard deviation	0.20	0.21	0.30	0.43	0.43	0.26	0.32	0.23
passed goodness-of-fit tests	19	23	18	16	13	25	23	23

Table 4. Quantile regression test for estimation of confidence interval

[0.05, 0.95]

for QRF and DDR.

Table 4. Quantile regression test for estimation of confidence interval

[0.05, 0.95]

for QRF and DDR.

sequential number	1	2	3	4	5	6	7	8
QRF	0.8735	0.8597	0.8775	0.8656	0.8794	0.8557	0.8773	0.8745
DDR	0.9051	0.8794	0.8577	0.8597	0.8695	0.9012	0.8834	0.8932

Table 5. Quantile regression test for estimation of confidence interval

[0.20, 0.80]

for QRF and DDR.

Table 5. Quantile regression test for estimation of confidence interval

[0.20, 0.80]

for QRF and DDR.

sequential number	1	2	3	4	5	6	7	8
QRF	0.6047	0.5929	0.6166	0.6107	0.6304	0.6107	0.6067	0.6146
DDR	0.6067	0.6086	0.6107	0.6166	0.5790	0.5770	0.6126	0.5810

Table 6. Test results for shallow probabilistic KAN.

sequential number	1	2	3	4	5	6	7	8
RMSE for mean	0.04	0.03	0.04	0.03	0.04	0.03	0.04	0.03
RMSE for standard deviation	0.10	0.13	0.12	0.16	0.10	0.14	0.12	0.13
passed goodness-of-fit tests	58	58	62	60	64	63	63	64

Table 7. Sports betting results for 2019–2020 season of the English Premier League.

sequential number	1	2	3	4	5	6	7	8
ROS	0.13	0.21	0.14	0.14	0.15	0.12	0.15	0.09

Table 8. Sports betting results for 2020–2021 season of the English Premier League.

sequential number	1	2	3	4	5	6	7	8
ROS	0.15	0.13	0.20	0.23	0.21	0.14	0.15	0.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Polar, A.; Poluektov, M. Probabilistic Kolmogorov–Arnold Network: An Approach for Stochastic Modelling Using Divisive Data Re-Sorting. Modelling 2025, 6, 88. https://doi.org/10.3390/modelling6030088

AMA Style

Polar A, Poluektov M. Probabilistic Kolmogorov–Arnold Network: An Approach for Stochastic Modelling Using Divisive Data Re-Sorting. Modelling. 2025; 6(3):88. https://doi.org/10.3390/modelling6030088

Chicago/Turabian Style

Polar, Andrew, and Michael Poluektov. 2025. "Probabilistic Kolmogorov–Arnold Network: An Approach for Stochastic Modelling Using Divisive Data Re-Sorting" Modelling 6, no. 3: 88. https://doi.org/10.3390/modelling6030088

APA Style

Polar, A., & Poluektov, M. (2025). Probabilistic Kolmogorov–Arnold Network: An Approach for Stochastic Modelling Using Divisive Data Re-Sorting. Modelling, 6(3), 88. https://doi.org/10.3390/modelling6030088

Article Menu

Probabilistic Kolmogorov–Arnold Network: An Approach for Stochastic Modelling Using Divisive Data Re-Sorting

Abstract

1. Introduction

2. Divisive Data Re-Sorting (DDR)

2.1. Interpretation of DDR and Uncertainty Type

2.2. Relation to k-Nearest Neighbours (kNN)

2.3. Prediction of Probability Density

3. Kolmogorov–Arnold Model/Network

3.1. KAN and DDR

3.2. Shallow Probabilistic Model

4. Elementary Example—Application of DDR

4.1. Data and Expectation Model

4.2. Accuracy Metrics

4.3. Test Results

4.4. Comparison to Bayesian Neural Networks (BNNs)

5. Comparison to Quantile Regression Forests (QRFs)

6. Probabilistic KAN Test

Shallow Probabilistic KAN

7. Detecting Public Bias in Bookmakers’ Odds for the English Premier League

7.1. What Is Public Bias?

7.2. Modelling Concept

7.3. Modelling Details

7.4. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI